pragmatic data mining: novel paradigms for tackling key ... · pragmatic data mining: novel...

Pragmatic Data Mining: Novel Paradigms for

Tackling Key Challenges

A Project Report

Submitted in partial fulfilment of the

requirements for the Degree of

Master of Engineering

in

Faculty of Engineering

by

Vikas Kumar Garg

Computer Science & Automation (CSA)

Indian Institute of Science

BANGALORE – 560 012

June 2009

To

My Family

For

Their Unalloyed and Unconditional

Love, Prayers, and Support.

Acknowledgements

First and foremost, I owe all my endeavors and success to my family for enduring innumerable hardships

to progressively sustain and nurture me all through my life. They have sacrificed every bit of their

pleasure to provide me with the best facilities. I am falling short of words to express my indebtedness

for their love and blessings.

I am greatly indebted to my guide, MNM Sir, for his perspicacity in removing my doubts, willingness

to broaden the horizon of my knowledge, and ever availability despite being extremely busy due to his

commitments as the Chairman of our department. His contribution in shaping my career has been im-

mense and I admire him greatly for his simplicity and altruistic generosity. I am also profoundly grateful

to Narahari Sir, interactions with whom were instrumental in inspiring me towards inter-disciplinary

work. His guidance has been extremely fruitful and provided a terrific learning experience for me. I feel

really honored and privileged to have learnt the nuances of quality research from a person of his stature.

Frankly speaking, I have to pinch myself sometimes, to realize that I have had an opportunity to work

with MNM Sir and Narahari Sir, the persons I always wanted to emulate.

I would also like to thank Shevade Sir, whose extremely well-structured course on Pattern Recognition

went a long way in developing my fascination for meaningful research in related fields of intelligent

systems. My thanks are also due to my friends, especially Devansh Dikshit and Harsh Shrimal, who

made my stay at IISc, a memorable and pleasant experience. My sincere thanks are also due to Mr.

Ramasuri Narayanam, who provided his valuable comments on my work, time and again.

I would be failing in my duty if I do not acknowledge the contribution of my critics, who inspired

me to put in a whole-hearted and determined effort into my research. During these two years, at times

when I I got a bit lethargic, their criticism instigated the requisite spark in me to rejuvenate myself.

i

Vita

I received my B.E. in Information Technology from Netaji Subhas Institute of Tech-

nology (NSIT) (7 semesters) and Delhi College of Engineering (DCE)(1 semester), University of Delhi,

India in 2006. I worked as a Research and Design Engineer at VedaSoft Solutions, India from July

2006-July 2007. Since August 2007, I have been working as a graduate student in the Department of

Computer Science and Automation at the Indian Institute of Science, Bangalore. My research interests

can be broadly summarized as lying at the intersection of Intelligent Systems and Theoretical Computer

Science. More specifically, from an application point of view, I am fascinated with the fields of Artificial

Intelligence (Machine Learning, Data Mining, Computer Vision, NLP, Robotics), Game Theory, On-

line Algorithms, and Computational Neuroscience. I am also interested in Statistical Learning Theory,

Convex Optimization, and Complexity Theory.

ii

Introduction

Data Mining, a branch of science closely associated with other sub-fields of Artificial Intelligence such

as Machine Learning and Pattern Recognition, is a relatively new field that has generated tremendous

interest among researchers over the last decade. A plethora of contributions in the literature has resulted

in emergence of data mining as an interdisciplinary dossier of paramount potential and significance. Al-

though data mining is a field that encompasses a variety of subjects and constantly engulfs new topics,

the traditional areas of clustering and classification continue to garner considerable attention. Addi-

tionally, the endeavor to apply mining techniques to new applications, which involve high dimensional

data, has necessitated the design of improved techniques for dimensionality reduction. In this work, we

propose novel techniques to address some of the key issues pertaining to these areas.

Dimensionality reduction and thereafter deriving the features has been used as a common technique

to redefine a pattern from a high dimension space to a lower dimension space. Dimension reduction

is important for operations such as clustering, classification, indexing, and searching. In our work, we

focus on dimensionality reduction in the context of image data. In our work, the N-dimensional data

belonging to a manifold has been extracted. Our approach highlights the fact that highly dynamic real

life data can not be considered locally linear and the information contained therein should be understood

by conceptualizing the image in terms of outlines and contours. The outlines remain almost constant

during dynamic range of the image while the contours keep on changing and project the mood, gestures

and other expressions. Chapter 1 provides a detailed exposition of our approach for dimensionality

reduction, feature extraction, and embedding.

Clustering or unsupervised classification of patterns into groups based on similarity is another very

well studied problem in pattern recognition, data mining, information retrieval, and related disciplines.

Clustering finds numerous direct practical applications in pattern analysis, decision making, and machine

learning tasks such as image segmentation. Besides, clustering also acts as a precursor to many data

processing tasks including classification. The explosive rate of data generation has reinforced the need

for incremental learning. In Chapter 2, we provide a framework for necessary and sufficient conditions

to obtain order independence.

iii

iv

The k-means algorithm is a very widely used clustering technique for scientific and industrial ap-

plications. Several variants of the k-means algorithm have been proposed in the literature. A major

limitation of the k-means algorithm is that the number of distance computations performed in the k-

means algorithm is linear in k, the number of clusters. In Chapter 3, we propose an algorithm, RACK,

based on AVL trees, that effectively computes distances from O(lg k) cluster centers only, thereby con-

siderably improving the total time required for clustering. Simultaneously, RACK ensures that quality

of clustering does not degrade much compared to the k-means algorithm.

In addition, there are other shortcomings of k-means: (a) it may give poor results for an inappropriate

choice of k, (b) it may not converge to a globally optimal solution due to inappropriate initial selection

of cluster centers , and (c) it requires knowledge of the number of cluster centers, k, as an input.

To address these issues, we also propose a novel technique, SHARPC, based on the cooperative game

theoretic concept of Shapley value. The SHARPC algorithm not only obviates the need for specifying

k, but also gives optimal clustering in that it tends to minimize both the distances: (a) the average

distance between a point and its nearest cluster center, and (b) the average pair-wise distance within

a cluster. We note that the algorithms such as k-means only strive to minimize the average distance

between cluster centers and the points assigned to the corresponding clusters, without taking the intra-

cluster point-point distances into consideration. In that context, we believe, such algorithms do not

really capture the essence of clustering: grouping together points that are similar to each other. Our

game theoretic model is presented in Chapter 6.

On the other hand, the Leader algorithm is a popular single pass technique for clustering large

datasets on the fly. The Leader algorithm does not require prior information about the number k of

clusters. However, different orderings of input sequence may result in different number of clusters. In

other words, Leader is highly susceptible to ordering effects and may give extremely poor quality of

clustering on skewed data orders. In Chapter 2, we propose robust variants of the Leader algorithm

that improve the quality of clustering, while still preserving the one pass property of Leader.

We also address the unification of partition based clustering algorithms in Chapter 4. Partitional

algorithms form an extremely popular class of clustering algorithms. Primarily, these algorithms can be

classified into two sub-categories: a) k-means based algorithms that presume the knowledge of a suitable

k, and b) algorithms such as Leader, BIRCH, and DB-Scan, which take a distance threshold value, δ,

as an input. We propose a novel technique, EPIC, which is based on both the number of clusters, k

and the distance threshold, δ. We also establish a relationship between k and δ, and demonstrate that

EPIC achieves better performance than the standard k-means algorithm. In addition, we present a

generic scheme for integrating EPIC into different classification algorithms to reduce their training time

complexity.

v

As already mentioned, in many pattern classification applications, data are represented by high di-

mensional feature vectors. There are two reasons to reduce dimensionality of pattern representation.

First, low dimensional representation reduces computational overhead and improves classification speed.

Second, low dimensionality tends to improve the generalization ability of classification algorithms. More-

over, limiting the number of features cuts down the model capacity and thus may reduce the risk of

overfitting. Therefore, to deal with the issue of rapidly increasing computational cost in applications

requiring processing large feature sets, we introduce the α-Minimum Feature Cover (α-MFC) problem

in Chapter 5 and prove it to be NP-Hard. We also propose Feature Subspace Support vector Ma-

chines (FS-SVMs) to find an approximate solution to the α-MFC problem for efficient high dimensional

handwritten digit recognition.

The suffix tree is an immensely popular data structure for indexing colossal scale biological reposito-

ries. This is mainly due to its linear time and space complexity of construction in terms of the sequence

size, in addition to linear search complexity in terms of the pattern size. For most practical data mining

applications, the suffix tree needs to be disk-resident. To complicate the matter further, searching for

a pattern requires random traversal of suffix links connecting nodes across different pages that results

in increased I/O activity. A lot of research has been carried out into addressing this problem, primarily

focusing on building efficient disk-resident trees. One of the objectives of our work is to optimize the

layout of suffix trees with regard to assigning disk pages to tree nodes thereby resulting in improving

the search efficiency. In Chapter 7, we do a theoretical analysis of the whole problem based on our

approach and give bounded guarantee on the performance.

Abstract

Over the last few decades, Data Mining has progressively evolved into an extremely significant field

for active research. Accordingly, with a tremendous spurt in the amount of real data being generated,

the attention has diverted from synthesis and accumulation of data to its analysis and application.

Many of the well-established techniques in the literature, pertaining to some integral machine learning

and pattern recognition areas such as Dimensionality Reduction, Clustering and Classification, have

been rendered ineffective as a result of this paradigm shift in focus. In this work, we present a com-

prehensive overview of the key challenges facing these areas, and offer new insights into overcoming

these challenges. In particular, we make the following contributions: we (a) propose a generic dimen-

sion reduction technique for extracting significant information, especially in the context of image data

depicting dynamic scenes, (b) characterize the notion of order independence in incremental learning,

(c) propose improvements in the prototype Leader algorithm to obtain better quality of clustering, (d)

introduce an algorithm, RACK, based on height balanced trees, which significantly improves upon the

time taken by the popular k-means algorithm, without compromising much on the quality of clustering,

(e) demonstrate how the integration of partition based clustering techniques can be achieved using an

algorithm, EPIC, for elegantly incorporating the domain knowledge, (f) show how an order independent

algorithm based on Shapley value, SHARPC, views the problem of clustering as a natural manifestation

of the interactions among the points in a convex game setting, and thereby improves the quality of

clustering, (g) introduce the Q-Optimal Disk Layout problem in the context of suffix trees, show it to

be NP-Hard, and suggest an algorithm Approx. Q-OptDL to obtain a disk layout that is guaranteed

to have a performance within twice of the optimal layout asymptotically, and (h) introduce the α-MFC

problem for addressing the ‘curse of dimensionality’ in classification, and propose Feature Subspace

SVMs( FS-SVMs) for an approximate solution to the α-MFC problem in the context of high dimen-

sional handwritten digit recognition. Our experimental results strongly corroborate the efficacy of our

work.

vi

Contents

Acknowledgements i

Vita ii

Introduction iii

Abstract vi

1 Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 Nature of the local data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.2 Quality Degradation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.3 Application Domain and Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.4 Online handling of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.1 N- Dimensional Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.2 Contours and Outlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.3 Vectored Scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.4 Schrodinger’s Solution to measure non-linearity . . . . . . . . . . . . . . . . . . . 81.2.5 Cumulated Adjustment Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.2.6 Embedding using Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3 Image Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.1 Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.2 Extent of Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Bibliography 21

2 Characterizing Ordering Effects for Robust Incremental Clustering 242.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.1 Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.2.2 Incremental Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Characterizing Ordering Effects in Incremental Learners . . . . . . . . . . . . . . . . . . 272.3.1 Order Insensitive Incremental Learning through Commutative Monoids . . . . . 272.3.2 Dynamically Complete Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4 Robust Incremental Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.4.1 The Leader Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.4.2 The Nearest Neighbor Leader (NN-Leader) Clustering Algorithm . . . . . . . . . 322.4.3 The Nearest Mean and Neighbor Leader(NMN-Leader) Clustering Algorithm . . 322.4.4 The Apogee-Mean-Perigee Leader (AMP-Leader) Clustering Algorithm . . . . . 34

vii

CONTENTS viii

2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.6 Conclusion/Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Bibliography 40

3 RACK: RApid Clustering using K-means algorithm 423.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2 Effective Clustering for large datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2.2 The RACK Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.2.3 Bound on the Quality of Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 463.2.4 Analysis of Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Bibliography 55

4 EPIC: Towards Efficient Integration of Partitional Clustering Algorithms 574.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2.1 k-means Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.3 The EPIC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3.1 Bound on Number of Distance Computations, Relation between τ and k, andMaximum Permissible Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4 Application of EPIC to classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.4.1 Integration of EPIC into Support Vector Machines (SVMs) . . . . . . . . . . . . 644.4.2 Integration of Two-level EPIC into k-NNC . . . . . . . . . . . . . . . . . . . . . 65

4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.5.1 Integration of Two-level EPIC into SVM . . . . . . . . . . . . . . . . . . . . . . . 664.5.2 Integration of Two-level EPIC into k-NNC . . . . . . . . . . . . . . . . . . . . . 70

4.6 Conclusions/Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Bibliography 72

5 Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recog-nition 745.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.3 The α-Minimum Feature Cover (α-MFC) Problem . . . . . . . . . . . . . . . . . . . . . 785.4 Feature Subspace SVMs (FS-SVMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.5 A Greedy Algorithm for Approximating α-MFC . . . . . . . . . . . . . . . . . . . . . . . 815.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.6.1 Experimental Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.6.2 Analysis of Results obtained using Algorithm 1 . . . . . . . . . . . . . . . . . . . 855.6.3 Analysis of Results obtained using Algorithm 2 . . . . . . . . . . . . . . . . . . . 88

5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.8 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Bibliography 97

CONTENTS ix

6 SHARPC: SHApley Value based Robust Pattern Clustering 1006.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036.2.1 The Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036.2.2 The Shapley Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.2.3 Convex Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.2.4 Shapley Value of Convex Games . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.3 Shapley Value based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1056.3.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1066.3.2 An Algorithm for Clustering based on Shapley values . . . . . . . . . . . . . . . 1076.3.3 Convexity of the Underlying Game . . . . . . . . . . . . . . . . . . . . . . . . . . 1096.3.4 SHARPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.4 Order Independence of SHARPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136.4.1 Characterizing Ordering Effects in Incremental Learners . . . . . . . . . . . . . . 1136.4.2 Order Independence of SHARPC . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.5 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.6 Comparison of SHARPC with k-means and Leader . . . . . . . . . . . . . . . . . . . . . 122

6.6.1 Satisfiability of desirable Clustering Properties . . . . . . . . . . . . . . . . . . . 1226.6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.7 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1276.7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Bibliography 129

7 A 2-Approximation Algorithm for Optimal Disk Layout of Genome Scale SuffixTrees 1337.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1337.2 Hardness of the Disk Layout Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.2.1 The Q-Optimal Disk Layout Problem . . . . . . . . . . . . . . . . . . . . . . . . 1367.3 Improving the Disk Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

7.3.1 Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1387.3.2 Performance Bound on Approx. Q-OptDL . . . . . . . . . . . . . . . . . . . . . . 139

7.4 Conclusion/Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

Bibliography 143

Conclusion ii

List of Tables

2.1 Wine Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.2 Iris Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.1 Spam Dataset (4601 examples, 58 dimensions) . . . . . . . . . . . . . . . . . . . . . . . 523.2 Intrusion Dataset (494019 examples, 35 dimensions) . . . . . . . . . . . . . . . . . . . . 53

4.1 Training and testing timings for synthetic dataset 1 (using SVM light) . . . . . . . . . . 674.2 Training and testing timings for synthetic dataset 1 (using SVMperf ) . . . . . . . . . . 674.3 Training and testing timings for synthetic dataset 2 . . . . . . . . . . . . . . . . . . . . . 684.4 Comparison with CB-SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.5 Results for k-NNC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.1 Comparison between Leader, k-means, and SHARPC . . . . . . . . . . . . . . . . . . . . 1236.2 Spam Dataset (4601 examples, 58 dimensions) . . . . . . . . . . . . . . . . . . . . . . . 1246.3 Wine Dataset (178 examples, 13 dimensions) . . . . . . . . . . . . . . . . . . . . . . . . 1256.4 Cloud Dataset (1024 examples, 10 dimensions) . . . . . . . . . . . . . . . . . . . . . . . 1256.5 Network Intrusion Dataset (5000 examples, 37 dimensions) . . . . . . . . . . . . . . . . 125

x

List of Figures

1.1 Local non-linear behavior of a contour . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Transitions at a Sample Image Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Illustration of a contour and an outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Contour sketches are sufficient to depict essential features . . . . . . . . . . . . . . . . . 61.5 Formation of a Contour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.6 Change in contour with change in vector orientation . . . . . . . . . . . . . . . . . . . . 71.7 Image Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.8 Contour formation from in-phase waves of different dimensions . . . . . . . . . . . . . . 91.9 Energy interpretation of vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.10 Selection of Contour Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.11 Interaction of low dimensional waves to form a higher dimensional wave . . . . . . . . . 101.12 Calculation of CAF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.13 Sample Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.14 Processed image incorporating even dimensions from vector magnitude 22 down to 0 . . 161.15 Processed image incorporating vectors of magnitude 22 . . . . . . . . . . . . . . . . . . . 171.16 Processed image incorporating vectors of magnitude 16 . . . . . . . . . . . . . . . . . . . 171.17 Processed image incorporating vectors of magnitude 10 . . . . . . . . . . . . . . . . . . . 181.18 Final reduced image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.19 Image Feature Extractor vs. LLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.20 Results with Standard Lip Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1 Wine Dataset: (a) α vs. δ, and (b) β vs. δ . . . . . . . . . . . . . . . . . . . . . . . . . 362.2 Iris Dataset: (a) α vs. δ, and (b) β vs. δ . . . . . . . . . . . . . . . . . . . . . . . . . . 372.3 Intrusion Dataset: (a) α vs. δ, and (b) β vs. δ . . . . . . . . . . . . . . . . . . . . . . . 37

3.1 k-means may not converge to a solution even after many iterations . . . . . . . . . . . . 443.2 A new data point is more likely to belong to a cluster with large number of data points 44

4.1 OCR 1 vs 6 : (a) accuracy vs. threshold (b) support vectors vs. threshold . . . . . . . . 684.2 OCR 3 vs 8 : (a) accuracy vs. threshold (b) support vectors vs. threshold . . . . . . . . 69

5.1 Iris Dataset: The need for segmentation of feature space . . . . . . . . . . . . . . . . . . 765.2 Segmentation of feature space for handwritten digit data . . . . . . . . . . . . . . . . . . 765.3 Different approaches for SVM Classification . . . . . . . . . . . . . . . . . . . . . . . . . 775.4 Steps in the modified classification process . . . . . . . . . . . . . . . . . . . . . . . . . . 775.5 The proposed feature reduction step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.6 Ensemble of classifiers using segments of feature space . . . . . . . . . . . . . . . . . . . 815.7 Features near the periphery contain less discriminative information than those deep inside 825.8 Sample patterns of handwritten digit data . . . . . . . . . . . . . . . . . . . . . . . . . . 845.9 Similarity vs. Block Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.10 Accuracy vs. Block Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

xi

LIST OF FIGURES xii

5.11 (Sample Dataset) Accuracy(%) results on training sets of different size . . . . . . . . . . 865.12 (MNIST ) Accuracy(%) results on training sets of different size . . . . . . . . . . . . . . 875.13 (CEDAR) Accuracy(%) results on training sets of different size . . . . . . . . . . . . . . 875.14 (USPS ) Accuracy(%) results on training sets of different size . . . . . . . . . . . . . . . 875.15 (Sample Dataset) Total relative time taken by Algorithm 1 on training sets of different size 885.16 (MNIST ) Total relative time taken by Algorithm 1 on training sets of different size . . . 885.17 (CEDAR) Total relative time taken by Algorithm 1 on training sets of different size . . 895.18 (USPS ) Total relative time taken by Algorithm 1 on training sets of different size . . . . 895.19 Accuracy vs. Number of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.20 Reduction in Accuracy(%) vs. Reduction in Number of Features . . . . . . . . . . . . . . 905.21 Algorithm 2 vs. Random Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.22 (Sample Dataset) Accuracy(%) results obtained using Algorithm 2 on training sets of

different size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.23 (MNIST ) Accuracy(%) results obtained using Algorithm 2 on training sets of different size 915.24 (CEDAR) Accuracy(%) results obtained using Algorithm 2 on training sets of different size 915.25 (USPS ) Accuracy(%) results obtained using Algorithm 2 on training sets of different size 925.26 Time Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 925.27 (Sample Dataset) Relative time taken by Algorithm 2 for training sets of different size . 935.28 (MNIST ) Relative time taken by Algorithm 2 for training sets of different size . . . . . 935.29 (CEDAR) Relative time taken by Algorithm 2 for training sets of different size . . . . . 945.30 (USPS ) Relative time taken by Algorithm 2 for training sets of different size . . . . . . 945.31 (Sample Dataset) Total relative time taken by Algorithm 2 for training sets of different size 945.32 (MNIST ) Total relative time taken by Algorithm 2 for training sets of different size . . 955.33 (CEDAR) Total relative time taken by Algorithm 2 for training sets of different size . . 955.34 (USPS ) Total relative time taken by Algorithm 2 for training sets of different size . . . 95

6.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1196.2 β − C plot (Spam Dataset): (a) SHARPC vs. Leader (b) SHARPC vs. k-means . . . . 1266.3 α− C plot (Intrusion Dataset): (a) SHARPC vs. Leader (b) SHARPC vs. k-means . . 1266.4 β − C plot (Intrusion Dataset): (a) SHARPC vs. Leader (b) SHARPC vs. k-means . . 1266.5 Wine: Potential (α) does not vary much with permutations (p) for a fixed threshold (δ) 127

Chapter 1

Generic Non-Linear N-Dimension

Reduction for Dynamic Scenes

Dimension reduction and thereafter deriving the features has been used as a common technique to

redefine a pattern in a lower dimensional space from higher dimensional space. This leads to better

classification and proves helpful in understanding the non-linear image information. Human beings

have a very fast intelligence mechanism. Any new dimension reduction technique can not be practically

useful unless it takes less processing time than the existing techniques, in addition to improving the

overall quality of the final reduced image. The recognition while working with reduced dimensions is

very fast. At the same time the features inherent to the scene should not be lost. While studying

various dimension reduction techniques, it has been noticed that though the techniques are getting

faster but the qualitative information content in the d-dimension representation is going down. The

task of dimension reduction is to operations such as clustering, classification, indexing, and searching

[1, 2]. Data projection [3] has been found fundamental to human perception. There are many methods

for estimating intrinsic dimensionality of data without actually projecting like Benett’s method [4],

Fukunaga and Olsen’s algorithms [5, 6] based upon space partition and PCA. In [7], a statistical approach

has been proposed. Pettis et al. developed an algorithm [8] taking average of distances to each point’s

k nearest neighbors. Nearest neighbor estimator was suggested by Verveer and Duin [9]. Bruske and

Sommer’s approach was based upon preserving topology maps [10]. Other than this, the data projection

approaches have been used by many [11], including Sammon’s Non-Linear Mapping (NLM) [12] and

Kohonen’s Self Organizing Map (SOM) [13]. Kruskal detected the similar features of NLM and MDS

[14] while Niemann improved the convergence of NLM [15]. Curvilinear Component Analysis (CCA)

[16] was better than NLM in the sense that it ignores the distances longer than a particular threshold.

Few methods based upon data projection like Isomap [17], Local Linear Embedding (LLE) [18], and

1

Chapter 1. Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 2

Curvilinear Distance Analysis (CDA) [19] have also been suggested in the literature. Isomap finds the

geodesic distances and applies MDS; CDA is CCA with geodesic distances while LLE assumes that

the local data is linear. All such methods depend upon the neighborhood information. Such methods

fail when the data is spread over the image having a multiple number of clusters [20]. In case of

Isomap, for computing a low-dimensional embedding of a set of high dimensional data points, two

issues are important. First, the basic approach presented was akin to the methods described in the

context of flattening cortical surfaces using geodesic distances [21] and multidimensional scaling [22].

However, these ideas generalize to arbitrary dimensionality if the connectivity and metric information

of the manifold are correctly supplied. Second, due to topological considerations, this approach should

be used after careful preprocessing of the data. In the application domain of cortical flattening, it

is necessary to check manually for connectivity errors, so that points nearby in 3-space (for example,

on opposite banks of a cortical sulcus) are not taken to be nearby in the cortical surface. If such

care is taken, this method represents the preferred method for quasi-isometric cortical flattening. The

novelty of the Isomap technique is the way it defines the connectivity of each data point via its nearest

Euclidean neighbors in the input consisting of many images of a person’s face observed under different

pose and lighting conditions, in no particular order. These images can be thought of as points in a

high-dimensional vector space, with each input dimension corresponding to the brightness of one pixel

in the image. Although the input dimensionality may be quite high, the meaningful structure out of

these images has many fewer independent degrees of freedom. Yang proposed a distance preserving

method [23] by using the triangulation method [24]. The method improves upon the earlier projection

based techniques but fails because the data is N -Dimensional and making a neighborhood graph along

with preserving the previous points is still dependent upon the size of d. Recently, neural networks have

also been used to reduce dimensionality [25]. There, the authors describe a non-linear generalization of

PCA that uses an adaptive, multilayer encoder network to transform the high-dimensional data into a

low-dimensional code and a reverse decoder network. A Support Vector Machine based approach has

also been proposed [26]. However, these techniques require excessive time to train and converge.

In our work, the N-Dimensional data belonging to manifold has been extracted. The image is

divided into outline and contours and the N- Dimensional vectors are determined. For a particular

expression/gesture, the range of vector movement is determined. Schrodinger’s equation is solved for

the vectors belonging to individual dimension under the constraints of dynamic-vector-range. Finally,

the image is embedded using a mean-variance plot.


1.1 Motivation

Dimension reduction and feature extraction from complex data is a subject of paramount importance.

The need for a fast extraction technique that preserves the essential features of the underlying data

cannot be overemphasized, especially when the scene is non-static. In this regard, the need of the hour

is a fast and quality preserving algorithm that is applicable over a wide range of fields.

1.1.1 Nature of the local data

The underlying concept is based upon the N-dimensional nature of natural images. Here, N signifies

that in general more than two dimensions are required to correctly express the image in motion in

terms of features. Earlier theories focused on dimension reduction considering a few principal axes but

failed because of multi-dimensional nature of natural scenes. The human beings have been bestowed

with a fast learning mechanism that infers from data by looking at the scenes from different angles and

the greater the number of viewing principal directions, better is the analysis of the scene. Sphere is

nearest to the natural way of scene analysis. This dimension related behavior of the scene can be easily

understood by considering Fig. 1.1. As shown, the actual contour shape from A to S is lost due to the

local linear assumption. Furthermore, different contours in the vicinity would be projected along the

same hyperplane irrespective of their disparate local orientations. The actual orientation of the contour

in N dimensional space gets lost since we consider magnitudes in a considerably low dimensional space.

We note that between any two points there may be a large number of distinct contours. And even when

the geodesic distances are taken, still the existing techniques are incapable of bringing the distinguishing

features which convey the information change. Thus, it becomes clearly evident that for a contour to

be extracted satisfactorily, local non-linearity has to be accounted for.

Figure 1.1: Local non-linear behavior of a contour

Dynamic datum depicting non-static scenes can not be considered linear as its shape keeps on

changing. Hence the locally nonlinear behavior is being stressed upon. Most of the applications are

concerned with the dynamic scenes. The basic message conveyed is that for a face in a non-static scene


to be recognized efficiently when seen from any angle, a technique which considers the inherent local

non-linearity of the face is required.

1.1.2 Quality Degradation

Another important issue is quality degradation as a consequence of dimensionality reduction. Since

most of the existing techniques incorporate a few principal axes, they suffer from the problem of severe

quality degradation. To preserve the quality of the original image, we need to consider vectors in an

N-dimensional generalized space rather than in a restricted domain comprising a few selected viewing

directions. We emphasize that our technique preserves the essential features of an image while dispensing

away superfluous data.

1.1.3 Application Domain and Flexibility

The main objective behind developing any reduction technique is the extent of its usefulness in specific

and/or generic fields. The existing techniques are limited in their practical use because of restrictiveness

in their underlying theory. We need a reduction technique that has wide applicability, flexibility, and

stability to perform satisfactorily on diverse data from same/different fields of science. Our technique

manifests itself to extensive use in fields of medical image processing, space exploration, face identi-

fication, gesture and activity recognition, spam-resistant image-indexed engines to name a few. For

example, applying our non-linear technique to a sequence of pictures, we could have all the information

about the war and other emergency scenarios enabling us to take fast-track actions. Likewise, we could

have far better authentication systems that need to deal with only the important features. Similarly,

we could identify spam etc. on the Internet, design better brain-mapping devices and recognize the

gestures and expressions being conveyed. In short, we put forward a generic technique that caters to

a wide spectrum of applications. We envisage our work as a potential break-through in diverse fields

including brain mapping, medical image processing, signal processing, biometrics, defense operations,

gesture and activity recognition, and spam-resistant image indexing etc.

1.1.4 Online handling of data

A technique designed for dynamic scenes must be able to process data in real time. Most of the current

techniques require all the data to be present before deciding on the best direction(s) for projecting the

data. However, our feature extractor can process the data on-the-fly, eliminating the need for having all

the data beforehand. Only the data points adjacent to the point being considered have to be present and

stream buffers can be used. Significant dimension reduction is achieved during incremental step 1, and

our algorithm obviates the need for availability of entire data before processing. Additional reduction


can be achieved by an optional second pass.

1.2 Our Approach

1.2.1 N- Dimensional Vectors

We devise an efficient technique for N-dimension data reduction taking into account the inherent non-

linearity of the contours. Rather than considering only a few dimensions, we base our theory on a

concept of transitions that incorporate the different dimensions involved. A transition refers to a change

in intensity as we move from one image point to any of its adjacent points (Fig. 1.2).

Figure 1.2: Transitions at a Sample Image Point

1.2.2 Contours and Outlines

The outlines are those curves and boundaries which more or less remain unchanged with varying ex-

pression of a human face, for instance, the shape of the teeth, the periphery enclosing a nostril etc. The

contours on the other hand are temporary features formed as a result of change in speech and emotion

etc. and vanish when that particular expression is over e.g. the elevation/depression of the forehead,

different shapes involved in lip movement, different curves that occur on cheeks when a person expresses

himself. As shown in Fig. 1.3, the shape of the lips, when a person speaks something, is a contour

whereas the shape of the teeth is an outline.

Figure 1.3: Illustration of a contour and an outline


This concept is analogous to the way an image is sketched or expressed by a cartoonist in terms of a

few lines and shades. The major lines which may not change in the consecutive scenes are the outlines,

while the others which undergo modifications in the successive scenes are the contours. As shown in

Fig. 1.4, just by seeing a contour sketch of Albert Einstein or Mother Teresa, we can identify the great

personalities. We do not need to consider redundant image data to recognize a face.

Figure 1.4: Contour sketches are sufficient to depict essential features

Most of the existing techniques account for outlines to an extent but fail miserably when it comes

to identifying contours. The basic reason for this failure is ignorance of the underlying vectors involved

in formation of a contour. A contour is formed by a number of N-dimensional vectors emanating from

different directions as indicated in Fig. 1.5.

Figure 1.5: Formation of a Contour

Any contour can be perfectly analyzed by considering vectors of different dimensions arranged one

above another along the curve from its one end to another. Every vector is associated with a transition

in a given direction. The various vectors in tandem form a contour as shown. Note that our definition of

a dimension is slightly different from that implicit in many subspace methods that consider a dimension

as a feature or an attribute. In our context, each dimension contains all those vectors that have the

same magnitude of transition, irrespective of their spatial orientations. Thus, for instance, a transition

from image intensity 5 to 10 corresponds to the same dimension as a transition from intensity 23 to 28.


The existing theories neglect the vector nature of a contour and restrict themselves to only intensity

levels involved over an assumed local-linear region, which is unsatisfactory since a contour is formed by

orientation of the constituent vectors and the contour changes as these vectors change their orientation

following the image dynamics as shown in Fig. 1.6. It should be noted that a) the contours do not

change in a still scene, b) for a dynamic scene these particular contours play significant role because any

change incurred by them results in the change in expression or gesture. c) The dimensionality of image

can be considerably reduced while still preserving the essential features if these small changes are taken

care of.

Figure 1.6: Change in contour with change in vector orientation

1.2.3 Vectored Scene

We analyze the vectors of a contour as we move from one position to adjacent positions over all the

dimensions of the contour, and hence follow the very basis of contour formation, for instance, in an

8-bit image even though there are 256 distinct possible vector magnitudes involved, all of these are

not important. This is primarily because a point is involved in multiple vectors in its eight possible

directions and a point is very rarely involved in high magnitude vectors in all these directions; there

is, generally, at least one direction where the vector magnitude is lying in a lower range. Thus we

can restrict ourselves to vectors up to a low upper range only. This low upper range is more or less

constant for one application but it may vary from one application to another; it tends to be higher for

non-linear data compared to linear data. It is observed in most images that a point involved in high

magnitude vector in one direction is also, in general, involved in low magnitude vector(s) in some other

direction(s). This behavior is displayed in natural images used for processing. Thus, we dispense with

very high magnitude vectors and process each point for lower dimensions and include/exclude that point

from a particular dimension depending on whether it is involved in at least one corresponding vector.

Furthermore, we have observed empirically that a point involved in a particular vector in one direction

is also likely to have a nearly same magnitude vector in another direction as well for instance a point

involved in vector of magnitude 2 in one direction is likely to have a vector(s) of magnitude 3 and /or 1

in other direction(s). Thus, we may consider fewer dimensions by taking even or odd vectors, without


affecting much the quality of the image. We thus consider different transition frames or dimensions,

with each different magnitude vector defining a different dimension and all vectors with equal magnitude

included in the same dimension. This enables us to consider N-dimensional image features in terms of

different dimensions as shown in Fig. 1.7.

Figure 1.7: Image Feature Extractor

1.2.4 Schrodinger’s Solution to measure non-linearity

The general steady-state Schrodinger’s equation of a particle is

Grad(ψ) + 8π2m(E−U)h2 ψ = 0

where ψ, m, U , and E denote the wave function, mass, potential energy, and total energy of a par-

ticle while h is the Planck’s constant. Grad(ψ) denotes the gradient of the wave function. Solving

Schrodinger’s equation provides various parameters like energy of the particles, their momentum etc.

The most important parameter that the Schrodinger’s equation provides is the extent of the non-linearity

given by wave function. The quantity whose variations make up matter waves is called the wave func-

tion, ψ. The value of ψ associated with a moving object at a particular point in space and at a particular

time instant is related to the likelihood of finding that object there at that time. Once Schrodinger’s

equation has been solved for a particle in a given physical situation, the resulting wave function contains

all the information about the particle like the expected value of the particle position, particle energy,

angular momentum and linear momentum etc. as permitted by the Heisenberg’s uncertainty principle

which states that it is impossible to know both the exact position and exact momentum of an object

at the same time e.g. in one dimension, the expectation value < G(x) > of any quantity, for instance,

potential energy U(x), which is a function of position x of an unrestricted particle described by the wave

function ψ, is given as

< G(x) >=∫xG(x)|ψ|2dx


A contour as shown in Fig. 1.8 can be thought of being formed from a group of energy particles whose

motion is random and spread over various dimensions. Employing the wave concept, each dimension

is associated with a corresponding wave and for a contour to be formed; all the related waves should

be continuous and in phase with each other. The non-linearity of a contour is a direct implication

of the constituent dimensions which satisfy this condition. The existing techniques neglect this local

Figure 1.8: Contour formation from in-phase waves of different dimensions

interaction of particles and hence fail to account for the non-linearity. The Schrodinger’s equation is an

important tool to ascertain this nonlinearity. It is important to note that techniques like Wavelets and

Fourier Transform are also based on similar ideas, where any signal is shown to be expressible in terms

of infinite sinusoidal waveforms. Analyzing the Schrodinger’s equation, there are two terms on the LHS.

The first term on the LHS considers the gradient or the direction of the maximum energy change. The

second term again is a second order term that considers the kinetic energy given by (E −U) multiplied

by wave function. Rewriting Schrodinger’s equation,

Grad(ψ) = − 8π2m(E−U)h2 ψ

Thus, for the LHS to be equal to RHS, the (E−U) factor must take the maximum value (magnitude) to

account for the gradient term. Now, a particle that is sufficiently energetic to undergo a high transition

can easily undergo lower transitions as well. Thus, we need to substitute the value of maximum length

vector for Schrodinger’s equation to hold as shown in Fig. 1.9. Therefore to obtain the non-linearity

Figure 1.9: Energy interpretation of vectors

corresponding to a particular image, we solve the Schrodinger’s equation at every point of a contour


substituting the highest dimension vector magnitude for (E − U) at that point in a unit length region

(adjacent positions in Fig. 1.10), discarding those adjacent points which do not lie on the contour.

Figure 1.10: Selection of Contour Vectors

1.2.5 Cumulated Adjustment Factor

An important consideration needs to be taken care of. Since we substitute the maximum vector value

in the Schrodinger’s equation, we account for corresponding maximum change in intensity in all the

possible directions. In other words, the lower intensity changes are accounted for as well. Thus the non-

linearity involved in a higher vector should be greater than the non-linearity associated with a lower

vector. Therefore, we need to introduce an adjustment term, which we call the Cumulated Adjustment

Factor (CAF). We further note from the ideas presented earlier, from wave considerations, that a high

dimension wave is formed by in- phase interaction of the lower dimension waves, which leads to an

additive effect on the amplitude and consequently, the wave function for the higher transition (Fig.

1.11). Thus, we define CAF in such a way that the CAF for a higher vector is the sum of CAF of all

Figure 1.11: Interaction of low dimensional waves to form a higher dimensional wave

the lower dimensions and the wave function value obtained from Schrodinger’s equation for that vector

magnitude with the CAF for lowest odd/ even vector being equal to its probability function (Fig. 1.12).

The CAF for a higher dimension is the sum of its probability function and the CAF of all the lower

dimensions. Thus a higher value of CAF may be thought of as corresponding to higher energy. Note


that we do not consider zero as the lowest dimension since it represents no change and thus provides no

useful information.

Figure 1.12: Calculation of CAF

1.2.6 Embedding using Mean and Variance

We count the points involved in highest dimension and then move on to lower dimension values, without

considering those points in lower dimension which are involved in a higher transition, in accordance with

the concept that a high transition corresponds to sufficient energy to undergo lower changes in intensity.

Finally, we take the product of each CAF and its corresponding count and then calculate the mean and

variance for the reduced image. It has been shown in [27] that the relative spatial relationships existing

among the components present in an image are preserved by a set of triples consisting of mean, variance

and the total number of “keys”. The importance of mean and variance lies in the fact that the value

of weighted mean gives an idea of the average CAF and hence average transition energy involved in a

particular image frame whereas the variance indicates the variation in energy about the mean. Knowing

the mean, we get an idea as to which single dimension conveys maximum information which can be

easily retrieved based on the integral transition value that is nearest to the one corresponding to the

weighted mean. This is an added utility of our technique as we can extract the single self-contained

dimension that is a near optimal trade-off between amount of dimension reduction and the extent of

feature preserving. Further, when we speak different phonemes, different face shapes and energies are

involved and hence we can get an idea of the order of the energy involved while speaking a phoneme.

1.3 Image Feature Extractor

We now present an algorithm for the image feature extractor.

1.3.1 Algorithm 1

FeatureExtract(X, StepSize, First, Last)

X is the input image matrix with m rows and n cols. First and Last represent the highest and lowest


considered vector magnitudes respectively. Only those vectors in this range are considered whose mag-

nitude varies from First in some multiple of StepSize(s).

Step 1

Take a matrix Flag of same size as X.

Initialize Flag = 0 for all index positions;

for each point a in X, do

for each point b adjacent to a, do

if ((First - |X(a) - X(b)|) mod StepSize = 0) and Last ≤ |X(a) - X(b)| ≤ First))

Flag(a) = Flag(b) =1;

end if

end for

end for

/∗ At this point, Flag has value 1 at only those points which are involved in forming significant vectors ∗/

Step 2

2.1 Find all components C in X connecting those points that have corresponding Flag entry set to 1.

2.2 Discard all components in C that consist of only single vector.

Step 3

3.1 Apply Schrodinger’s equation to every component vector V to measure the extent of non-linearity

3.2 Obtain the value of CAF for each vector V.

Step 4

Embed the image using a variance−mean plot.

1.3.2 Extent of Dimension Reduction

Consider an image matrix (X)m×n. Define an adjacency function f : X ×X → {0, 1} as

f(x, y) = 1, x and y are adjacent to each other

= 0, otherwise

Let a and b denote the lowest and highest vector values considered in Step 1. Further let D be defined

as set of transition values between significant vectors. Define


Rx(a, b,D) = {xi ∈ X|f(x, xi) = 1 ∧ (a ≤ |x− xi| ≤ b) ∧ |x− xi| = kd,

k ∈ Z+ ∪ {0}, d ∈ D,x ∈ X}

Clearly Rx(a, b,D) contains the significant vector points adjacent to x with transitions lying in range

a to b and in multiples of gap(s) d belonging to set of differences D. Let Vmax and Vmin denote the

highest and lowest intensity values in X. Then,

0 ≤ d ≤ Vmax − Vmin ∀d ∈ D.

Note that, if a = Vmin, b = Vmax and D = {0, 1, . . ., Vmax − Vmin}, then

Rx(a,b,D) = {xi ∈ X|f(x, xi) = 1, x ∈ X}

Taking |Rx(a,b,D)| as the number of significant points adjacent to x, we obtain the total number of

significant points z in matrix X as

z = 12

∑x∈X Rx(a,b,D)

Number of points in Xm×n = mn

Therefore data reduction using step 1,

datar1 =mn− zmn

= 1−∑x

Rx(a, b,D)2mn

(1.1)

whereby data reduction for

Rx(Vmin, Vmax,{0, 1, . . ., Vmax − Vmin})

is given by

datar1 = 1− 2mn2mn

= 0

that is, we end up without any reduction as expected.

Let us call a new matrix that contains only significant points as Z. Define

R′

z(a′, b′, D′) = {zi ∈ Z|f(z, zi) = 1 ∧ (a

′ ≤ |z − zi| ≤ b′) ∧ |z − zi| = b

′ − kd′ , k ∈ Z+ ∪ {0}, d′ ∈ D′ ,

z ∈ Z},

where

a′ ≥ a and b

′ ≤ b, a and b being range values for matrix X.

At the end of step 1, the data reduction, datar2 is given by


datar2 =12

∑xRx(a, b,D)− 1

2

∑z R

′

z(a′, b′, D′)

12

∑xRx(a, b,D)

= 1−∑z R

′

z(a′, b′, D′)∑

xRx(a, b,D)(1.2)

Now, using (1.1), ∑xRx(a, b,D)

2mn= 1− datar1

⇒∑xRx(a, b,D) = 2mn(1− datar1)

Substituting in (1.2), we get

datar2 = 1−∑z R

′

z(a′, b′, D′)

2mn(1− datar1)

⇒∑z R

′

z(a′, b′, D′) = 2mn(1− datar1)(1− datar2)

⇒ 12

∑z

R′

z(a′, b′, D′) = mn(1− datar1)(1− datar2) (1.3)

Number of points in matrix X = mn

Number of points left after reduction = 12

∑z R

′

z(a′, b′, D′)

= mn(1− datar1)(1− datar2) [Using (1.3)]

Therefore, total reduction obtained by Image Feature Extractor in Algorithm 1

datar =mn−mn(1− datar1)(1− datar2)

mn

⇒ datar = 1− (1− datar1)(1− datar2)

Now, we proceed to obtain a mathematical expression for reduction in number of dimensions. Let the

number of dimensions in the matrices X and Rx be Dim1 and Dim2 respectively. If the number of

dimensions at the end of step 1 is Dim, then

Dim1 ≤ Vmax − Vmin + 1

Further, since two significant points adjacent to a particular point may also form a vector between

themselves if they are adjacent to each other,

Dim2 ≤ 2(b− a+ 1),

and

Dim ≤ 2(b′ − a′ + 1)

Further, Dim ≤ Dim2 ≤ Dim1.

Therefore, dimension reduction after taking into account Rx is,


dimr1 =(Dim1 −Dim2)

Dim1= 1− Dim2

Dim1

⇒ Dim2 = Dim1(1− dimr1) (1.4)

Now, dimension reduction at the end of step 1,

dimr2 =(Dim2 −Dim)

Dim2

and total dimension reduction,

dimr =(Dim1 −Dim)

Dim1

Now,

dimr =(Dim1 −Dim2) + (Dim2 −Dim)

Dim1

which, using (1.4), becomes

=(Dim1 ∗ dimr1) + (Dim2 ∗ dimr2)

Dim1

= dimr1 +Dim2 ∗ dimr2

Dim1

= dimr1 + (1− dimr1) ∗ dimr2

= dimr1 ∗ (1− dimr2) + dimr2

Note that datar denotes the dimension reduction according to the definition of a dimension as

a feature or an attribute in many subspace methods. On the other hand, dimr corresponds to our

definition of a dimension as a vector magnitude.

1.4 Experimental Results

To substantiate the theory, we conducted several experiments. First of all, we show the results of

applying Image Feature Extractor to extract significant information from a human face. We now show

how our generic algorithm works in case of human facial data. The steps for dimension reduction applied

to a sample human face (Fig. 1.13) are as follows,

Determining the StepSize

We reiterate an important observation: a point involved in a particular vector in one direction is also

likely to have a nearly same magnitude vector in another direction as well, as explained earlier. Thus

we may consider fewer dimensions by taking even or odd vectors, without affecting much the quality of


Figure 1.13: Sample Image

the image. Hence, StepSize for this case can be taken as 2 (include either even or odd vectors). Fig.

1.14 shows the processed image that incorporates even dimensions from vector magnitude 22 down to

0 (both inclusive). Note that there is a loss of information, in particular around the eyes and the lips;

however this loss is negligible.

Figure 1.14: Processed image incorporating even dimensions from vector magnitude 22 down to 0

Determining the value of First

As discussed previously, a point is associated in multiple vectors in its eight possible directions and is

rarely associated with high magnitude vectors in all of these, there is generally at least one direction

where the vector magnitude is lying in a lower range. Thus we can reduce the number of vectors further

without affecting much the contours and outlines. We further note that the points involved in high

intensity change become progressively less as the corresponding vector magnitude increases and the

no. of points included after 10-11 even (or odd) dimensions is negligible(for highly non-linear data like

human face in motion). Thus First can be taken as 22 or some higher value. Fig. 1.15 and Fig. 1.16

show the processed image that incorporates vectors corresponding to magnitude 22 and 16 respectively.

Note that image corresponding to vector magnitude 16 contains more information than magnitude 22.


Figure 1.15: Processed image incorporating vectors of magnitude 22


Determining the value of Last

We note that contours and outlines contain sufficient information represented by an image. Only the

vectors involved in contour and outline formation are important, others are responsible for redundancy.

Further the higher dimensions (vectors with greater intensity change) provide more useful information

than low vectors. Very low vector transitions close to zero contribute to redundant data. Thus we can

dispense with low dimensions. For the human face Last can be taken as a value close to 10. However,

the values of StepSize, First and Last may vary from one application to another. Fig. 1.17 shows the

processed image that incorporates vectors corresponding to magnitude 10.

Discarding singly connected points

The points which are involved in only one vector are discarded as they only add to superfluous data.

Thus now we are left with only contours and outlines which can be further processed for applications.

Fig. 1.18 shows the final N-dimensional reduced image incorporating all dimensions from 10-22 as

per our original sample image. Comparing it with the original sample image, it is clearly evident that

the Algorithm 1 extracts all the significant content from an image. More data can be dispensed with,

however at the cost of loss of information. The data associated with the dimensionally reduced image



Figure 1.18: Final reduced image

is input to the Schrodinger’s equation to measure the extent of non-linearity.

Image Feature Extractor Applied to Standard Face and Lip Images

We now show the results of applying Image Feature Extractor to Standard Face Images. For sake of

comparison, we are providing results applied on same face expressions using LLE also. An investigation

into results of these techniques clearly indicates that Image Feature Extractor outperforms LLE. Various

face expressions like seriousness, laugh, smile and forceful expression are well-separated from each other

in case of the proposed Image Feature Extractor, as shown in Fig. 1.19.

Similarly, our results are shown for standard lip images in Fig. 1.20. The results obtained clearly

indicate that data in dynamic scenes varies locally non-linearly and should not be linearly approximated.

Another important criterion is the execution time. The Image Feature Extractor algorithm is a

real-time, online technique since it requires knowledge of points in the immediate vicinity only without

considering the rest of the data. In other words, just by analyzing the sequence of face (part of face)

movement over a very short duration, we can know with reasonable accuracy, the message being conveyed

or the emotion being expressed by a person in a short interval of time. Though the above theory has

been explained primarily with facial data, it can be equally well used in many other application domains.

Only thing that needs to be done is to find the range of the vector magnitudes and the step size to be


Figure 1.19: Image Feature Extractor vs. LLE

Figure 1.20: Results with Standard Lip Images

considered for a specific application.

1.5 Conclusion

The approximation of non-linearity of dynamic features using local linear concepts by considering

geodesic distances and applying method of least square is not useful from application point of view.

Our technique overcomes this limitation. It processes the image while preserving the non-linear features

or contours that are formed in a multi-dimensional space. We first design an N-dimensional Image

Feature Extractor, which shows the contribution of individual vectors or dimensions (as we call them)

in making the face features and show how these different dimensions interact with each other to pro-

duce the original image. We then perform N-dimension data reduction and elimination of redundant

data but still preserving the contours and the outlines. Then we design an algorithm that isolates the

significant information in different connected components: the contours and the outlines. We apply the


Schrodinger’s equation to measure the extent of non-linearity in each dimension in terms of the wave

function and then make the adjustments encompassing energy considerations to obtain CAF. Finally,

we obtain the weighted mean and variance of the image. Different image expressions can therefore be

separated from each other and similar ones grouped together. The results on a sample face image besides

the standard lip and face images strongly indicate the efficacy of the proposed approach.

Bibliography

[1] A. K. Jain and R. C. Dubes. Algorithm for clustering data. Prentice Hall, 1988.

[2] A. K. Jain, R. Duin, and J. Mao. Statistical Pattern Recognition : A Review. IEEE Trans. Pattern

Analysis and Machine Intelligence (PAMI), 22, pp. 4–37, 2000.

[3] H. S. Seung and D. Lee. The Manifold Ways of Perception. Science, 290, pp. 2268–2269, 2000.

[4] R. S. Bennet. The Intrinsic Dimensionality of Signal Collections. IEEE Trans. Information Theory,

15(5), pp. 517–525, 1969.

[5] K. Fukunaga and D. R. Olsen. An Algorithm for Finding Intrinsic Dimensionality of Data. IEEE

Trans. Computers, 20(2), pp. 176–183, 1971.

[6] D. R. Olsen and K. Fukunaga. Representation of Non-Linear data surfaces. IEEE Trans. Computers,

22(10), pp. 915–922, 1973.

[7] G. V. Trunk. Statistical Estimation of the Intrinsic Dimensionality of a Noisy Signal Collection.

IEEE Trans. Computers, 25, pp. 165–171, 1976.

[8] K. W. Pettis, T. A. Bailey, A. K. Jain, and R. C. Dubes. An Intrinsic Dimensionality Estimator

from Near Neighbor Information. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI),

1(1), pp. 25–37, 1979.

[9] P. J. Verveer and R. P. Duin. An evaluation of intrinsic dimensionality estimators. IEEE Trans.

Pattern Analysis and Machine Intelligence (PAMI), 17(1), pp. 81–86, 1995.

[10] J. Bruske and G. Sommer. Intrinsic Dimensionality Estimation with Optimally topology preserving

maps. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 20(5), pp. 572–575, 1998.

[11] G. Biswas, A. K. Jain, and R. C. Dubes. Evaluation of Projection Algorithms. IEEE Trans. Pattern

Analysis and Machine Intelligence (PAMI), 3(6), pp. 701–708, 1981.

[12] J. J. W. Sammon. A Nonlinear Mapping for Data Structure Analysis. IEEE Trans. Computers,

18(5), pp. 401–409, 1969.

21

BIBLIOGRAPHY 22

[13] T. Kohonen. Self-Organizing Maps. Springer, Second Edition, 1997.

[14] J. Kruskal. Comments on a Nonlinear Mapping for Data Structure Analysis. IEEE Trans. Com-

puters, 20(12), pp. 1614, 1971.

[15] H. Niemann and J. Weiss. A fast converging algorithm for non-linear mapping of high dimensional

data to a plane. IEEE Trans. Computers, 28, pp. 142–147, 1979.

[16] P. Demartines and J. Herault. ‘Curvilinear Component Analysis: A Self-Organizing Neural Network

for Nonlinear Mapping of Data Sets. IEEE Trans. Neural Networks, 8(1), pp. 148–154, 1997.

[17] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A Global Geometric Framework for Nonlinear

Dimensionality Reduction. Science, 290, pp. 2319–2323, 2000.

[18] S. T. Roweis and L. K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding.

Science, 290, pp. 2323–2326, 2000.

[19] J. A. Lee, A. Lendasse, N. Donckers, and M. Verleysen. A Robust Nonlinear Projection Method.

Proc. Eighth European Symp. Artificial Neural Networks (ESANN 2000), pp. 13–20, 2000.

[20] M. Vlachos, C. Domeniconi, D. Gunopulos, G. Kollios, and N. Koudas. Non-linear dimensionality

reduction techniques for classification and visualization. Proceedings of the eighth ACM SIGKDD

international conference on Knowledge Discovery and Data Mining, 2002.

[21] M. P. Young and S. Yamane. Sparse population coding of faces in the inferotemporal cortex. Science,

256, pp. 1327–1331, 1992.

[22] R. N. Shepard. Multidimensional scaling, tree fitting and clustering. Science, 210, pp. 390–398,

1980.

[23] L. Yang. Distance-Preserving Projection of High-Dimensional Data for Nonlinear Dimensionality

Reduction. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 26(9), pp. 1243–1246,

2004.

[24] R. C. T. Lee, J. R. Slagle, and H. Blum. A Triangulation Method for the Sequential Mapping of

Points from N-Space to Two-Space. IEEE Trans. Computers, 26(3), pp. 288–292, 1977.

[25] G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks.

Science, 313, pp. 504–507, 2006.

[26] Q. Tao, D. Chu, and J. Wang. Recursive Support Vector Machines for Dimensionality Reduction.

IEEE Trans. Neural Networks, 19(1), 2008.

BIBLIOGRAPHY 23

[27] P. Punitha and D. S. Guru. An effective and efficient exact match retrieval scheme for image

database systems based on spatial reasoning: A logarithmic search time approach. IEEE Transactions

on Knowledge and Data Engineering (TKDE), 18(10), pp. 1368–1381, 2006.

Chapter 2

Characterizing Ordering Effects for

Robust Incremental Clustering

2.1 Introduction

Clustering or unsupervised classification of patterns into groups based on similarity is a very well studied

problem in pattern recognition, data mining, information retrieval and related disciplines. Clustering

finds numerous direct practical applications in pattern analysis, decision making and machine learning

tasks such as image segmentation. Besides clustering also acts as a precursor to many data processing

tasks including classification [1]. Almost all the algorithms proposed for clustering require availability

of the entire dataset before any processing is done. Lately, there has been an explosion in the rate

at which data is generated. The traditional clustering algorithms that presumed primary memory to

be sufficient for containing the complete dataset have been rendered ineffective and the disk I/O is

increasingly becoming a serious bottleneck. Further, in case of applications such as network intrusion

detection and stock market analysis, a huge amount of dynamic stream data is generated. This stream

data enters the clustering system continuously and incrementally and hence the clusters derived must

also be incrementally refined [14]. Therefore, there is an imminent need for devising efficient algorithms

that process each data instance only once and on the fly.

There has been a slight confusion in the literature regarding the notion of incremental learning. The

most widely acceptable definition of incremental learning is the one given in [2]. An incremental learner

inputs one training experience at a time, does not reprocess any previous experiences and retains only

one knowledge structure in the memory. Strictly enforcing these constraints rules out the incremental

nature of a number of algorithms such as Candidate Elimination Algorithm in Version Space [3], learning

structural descriptions [4] and decision tree induction [5].

24

Chapter 2. Characterizing Ordering Effects for Robust Incremental Clustering 25

Most of the existing incremental clustering algorithms are sensitive to the order of input data. That

is, given a set of data objects, such an algorithm might return different clusterings based on the order

of presentation of these objects. An incremental learner is said to be order dependent if it produces

different knowledge structures based on the sequence in which examples are provided as input. In [6],

two necessary properties of order independent incremental learners are outlined: 1) they are able to

focus on an optimal hypothesis from a set of current potential ones, and 2) they maintain sufficient

information so that they do not forget any potential hypothesis. Incremental variants of a clustering

algorithm COBWEB [7] have been proposed but they suffer from statistical independence assumption of

attributes in the underlying probability distribution. Moreover, the statistical representation makes it

expensive to update and store the clusters. In [8], the authors throw some light on the ordering effects in

incremental clustering. Order independence of a concept directed clustering approach using knowledge

structures has been established in [9]. In our work, we provide necessary and sufficient conditions to

achieve order independence.

Another key problem in the clustering domain concerns determining a suitable number k of output

clusters when k is not input as a parameter to the clustering algorithm. The knowledge of an appropri-

ate k is imperative for effectively solving the k-means problem as in [13]. Many algorithms have been

proposed in the literature to overcome the limitation of pre-defining the number of clusters. Some of

these algorithms are weakly incremental in that they only make one pass over the dataset. In [11], the

authors propose an incremental algorithm BIRCH to dynamically cluster incoming multi-dimensional

metric data points, using a Clustering Feature (CF) Tree. The Shortest Spanning Path (SSP) [12] al-

gorithm has been used for data reorganization and automatic auditing of records. In [7], the author

proposed COBWEB, an incremental conceptual clustering algorithm that was subsequently used for

many engineering tasks. The Leader algorithm [15] is an immensely popular single pass incremental

algorithm for clustering large datasets and particularly attractive for applications such as stream data

analysis that require fast processing. A shortcoming of the Leader algorithm is that it is highly suscep-

tible to ordering effects, that is, it might yield entirely different clusterings based on the order in which

the data points are input into the algorithm. In this chapter, we also propose robust variants of the

Leader algorithm that yield more robust clusters in that the total squared distance between the points

within a cluster is less compared to that obtained using the Leader algorithm.

2.2 Preliminaries

In the following section, we review some of the technical background required for rest of this chapter.


2.2.1 Group

A non-empty set of elements G together with a binary operation ∗ (called the product), defined on G,

is said to form a group (G, ∗) if the following axioms are satisfied [10],

a) a, b ∈ G ⇒ a ∗ b ∈ G (closed).

b) a, b, c ∈ G ⇒ a ∗ (b ∗ c) = (a ∗ b) ∗ c (associative law).

c) There exists an element e ∈ G such that a ∗ e = e ∗ a = a ∀a ∈ G (the existence of an identity

element in G).

d) For every a ∈ G there exists an element a−1 ∈ G such that a ∗ a−1 = a−1 ∗ a = e (the existence of

inverses in G).

Abelian Group

A group G is said to be abelian (or commutative) if for every a, b ∈ G, a ∗ b = b ∗ a.

Commutative Monoid

A monoid is a non-empty set of elements with a binary operation that satisfies axioms a), b) and c) of

a group. A monoid that satisfies the commutative property is called a commutative monoid.

2.2.2 Incremental Learning

A learner L is incremental if L (i) inputs one training experience at a time, (ii) does not reprocess any

previous experience, and (iii) retains only one knowledge structure in memory [2]. The first condition

avoids to consider as incremental the learning algorithms that process many instances at a time by storing

all the instances seen thus far and executing the procedure on all of them. Batch learning systems fail

to satisfy this condition. The second condition rules out those systems, for example artificial neural

networks, that reprocess the old data with the new data to generate a new model. The underlying

idea is to make sure that time required to process each experience must remain almost constant. The

final constraint requires the algorithm to memorize exactly one definition for each concept and rules

out algorithms like CE [3] that retain in memory a set of competing hypotheses summarizing the data.

These hypotheses may grow exponentially with the number of training experiences.

A learner L is order sensitive or order dependent if there exists a training set T on which L exhibits

an order effect. That is, given a set of data objects, such an algorithm might return different clusterings

based on the order of presentation of these objects. An incremental learner is order sensitive if it

produces different knowledge structures based on the sequence in which examples are provided as input.

There exist at least three different levels at which order effects can occur: attribute level, instance level,

and concept level [2]. In our work, we focus on mitigating order effects at the instance level. Our


incremental algorithm maintains a single memory knowledge structure consisting of a constant number

of abstractions (independent of the input dataset) summarizing the data objects or instances seen so

far.

We define an abstraction as a part of the knowledge structure being maintained in the memory

such that (i) each abstraction represents summary of members of one cluster, and (ii) the abstraction

corresponding to a cluster is updated only when a new instance is assigned to the cluster. In addition, we

restrict the number of abstractions to the number of required clusters so that only a constant number of

abstractions are maintained at any time irrespective of the number of training experiences. The current

abstraction Ak is updated to abstraction Ak+1 when the corresponding cluster is assigned training

experience xk+1, without reprocessing any previous experience. For the rest of this chapter, we use

the terms data instances, experiences, objects and points interchangeably. Likewise, the terms order

dependence and order sensitivity shall convey the same meaning.

2.3 Characterizing Ordering Effects in Incremental Learners

In the following discussion, we claim that any order insensitive incremental algorithm operates as a

structure that can be abstracted in terms of a commutative monoid.

2.3.1 Order Insensitive Incremental Learning through Commutative Monoids

We first show how the axioms of an abelian group are satisfied by a simple order independent incremental

algorithm that finds the linear sum of n d-dimensional data points.

Theorem 1. Any order independent incremental algorithm must maintain a knowledge structure A of

abstractions together with an operator ∗ defined on it, such that (A, ∗) is a commutative monoid.

Proof. Consider the impact of violation of any of the properties of a commutative monoid on order

sensitivity of the underlying incremental algorithm.

Closure Suppose a, b ∈ A does not imply a∗ b ∈ A. This implies that the structure obtained by incor-

porating a new instance to the current knowledge structure might result in a new structure that does

not belong to the range of legal structures, hence ruling out valid processing on any further instances

in the input sequence.

Associativity The violation of associativity clearly implies presence of more than one memory struc-

ture for the same input sequence depending on the order of processing. In fact, Catalan Number of

memory structures are possible depending on the order of processing the input sequence.

Identity The presence of an identity element is required to maintain idempotency and consistency dur-

ing three phases (i) initial structure prior to processing any input data, (ii) some intermediate structure


where the algorithm waits for more examples, and (iii) once all the input instances are exhausted.

Commutativity Violation of commutativity results in obtaining any one of a potential O(N!) final

memory structures on processing an N-input dataset making the algorithm order sensitive.

Now, we introduce the concept of a dynamically complete set to completely characterize the order

independence in incremental learners.

2.3.2 Dynamically Complete Set

Let f be a function defined as f : A × X→ A, where X = {x1, x2, ..., xn} represents the set of input data

instances and A represents the set of all valid memory structures. Then X is said to be a dynamically

complete set with respect to f and A if the following conditions are satisfied for k ∈ {0,1,..., n− 1},

1)f(Ak, xk+1) = Ak+1

2)f(Ak, x′

l) = f(f(f(f(f(A0, x1), x2)..., xl−1), xl+1)..., xk), 1 ≤ l < k

with f(Ak, φ′) = Ak, where φ represents empty or null instance and x

′

l denotes the removal operation

for disregarding xl from the abstraction being considered.

3) f(f(Ak, x′

l), xl) = f(Ak, φ) = Ak, where 1 ≤ l ≤ k

4) f(f(Ak, xl), xm) = f(f(Ak, xm), xl), where l, m ∈ {k + 1, k + 2, ..., n− 1}

A dynamically complete set incorporates the idea of a removal operation which imparts ability to

the current memory structure to return to a previous structure by deleting information gained through

subsequent insertions. In fact, it has a stronger effect in that it can generate all the memory structures

that can be obtained using the experiences seen thus far, by deleting one or more of these data instances

in any order. Thus, for any sequence of input instances, ordering effects are implicitly taken care of. The

basic notion of a complete set can be related to optimal sub-structure property of dynamic programming:

if a set S of instances is insensitive to order effects, then every subset of S must also be insensitive to

order. Thus, for an algorithm to be truly incremental, it should be amenable not only to an addition

operation for moving to a new memory structure but also to a deletion operation for reverting back to

any of the structures possible using a subset of the instances seen so far.

Theorem 2. Presence of a dynamically complete set X on (A, f) provides a sufficient condition for

order independence in any incremental algorithm that takes X as an input and uses A and f .

Proof. The theorem holds trivially for a singleton set X, starting from A0. For |X| ≥ 2, we proceed

as follows. Consider the case when X consists of two elements x1 and x2. Then by property 4) from

definition of a dynamically complete set, f(f(A0, x1), x2) = f(f(A0, x2), x1) and the theorem holds.

Now, suppose X = {x1, x2, x3}. Again using 4) considering 2-subsets of X, we get the following results,

(a) f(f(A0, x1), x2) = f(f(A0, x2), x1)

(b) f(f(A0, x1), x3) = f(f(A0, x3), x1), and


(c) f(f(A0, x2), x3) = f(f(A0, x3), x2)

Applying (a), (b) and (c) on 3-subsets of X, we obtain

f(f(f(A0, x1), x2), x3) = f(f(f(A0, x2), x1), x3) (2.1)

f(f(f(A0, x1), x3), x2) = f(f(f(A0, x3), x1), x2) (2.2)

f(f(f(A0, x2), x3), x1) = f(f(f(A0, x3), x2), x1) (2.3)

Also, using property 2) from definition of a dynamically complete set,

f(A3, x′

2) = f(f(A0, x1), x3)

⇒ f(f(f(A0, x1), x3), x2) = f(f(A3, x′

2), x2)

Using property 3) of a dynamically complete set, we get

f(f(f(A0, x1), x3), x2) = A3

Likewise, using property 2) and 3) in conjunction with (6.4), (6.5) and (6.6),

f(f(f(A0, x2), x1), x3) = f(f(f(A0, x1), x2), x3) = f(f(A3, x′

3), x3) = A3

f(f(f(A0, x2), x3), x1) = f(f(A3, x′

1), x1) = A3


2), x2) = A3


1), x1) = A3

Finally, using property 1) repeatedly,

f(f(f(A0, x1), x2), x3) = f(f(A1, x2)), x3) = f(A2, x3) = A3

Clearly, order independence is seen to hold for a 3-element set X. Let us assume that order independence

holds for some r-element set where r ≥ 3. Then, consider the set of r+1 elements. The instance xr+1 can

be at any position p, where 1 ≤ p ≤ r+1. Then by property 4), f(f(A(p−1), x(p)), x(p+1)) = f(f(A(p−1),

x(p+1)), x(p)) where A(p) refers to structure obtained by processing instances up to position p and x(p)

refers to instance at position p. Thus, xr+1 gets shifted to right by one position. Applying property 4)

iteratively in this manner, xr+1 reaches the end of the sequence whereby using property 1) in conjunction

with our assumption for r-element set yields f(Ar, xr+1) = Ar+1 for all sequences of length r + 1. The

theorem follows from the principle of mathematical induction.

The notion of a dynamically complete set relates closely to that of a commutative monoid. It is readily

seen that initial memory structure A0 in the wake of Theorem 2 takes care of all the requirements of a

commutative monoid. In fact, we can prove the following stronger result than Theorem 2.

Theorem 3. Presence of a dynamically complete set X on (A, f) provides both a necessary and sufficient

condition for order independence in any incremental algorithm that takes X as an input and uses A and

f .

Proof. Theorem 1 states that all properties of a commutative monoid are necessary for order insensitivity


of an incremental algorithm. Since a dynamically complete set satisfies all these properties, we only need

to show that the removal operation is also necessary for order independence. We prove this requirement

by means of a contradiction. Suppose, the removal operation does not hold. Then, for some Ak and

some xl,

f(f(Ak, x′

l), xl) 6= Ak (2.4)

Now two cases are possible,

Case 1: xl = φ

This indicates absence (even if transient) of any input when the current abstraction is Ak . Then,

f(f(f(Ak, φ′), φ)) 6= Ak, which implies the fallacy that the state of the algorithm changes in the

absence of any input.

Case 2: xl 6= φ

We have, using the property 2) from definition of a dynamically complete set,

f(f(Ak, x′

l), xl) = f(f(f(f(f(f(A0, x1), x2)..., xl−1), xl+1)..., xk), xl)

Using property 4), we get

f(f(Ak, x′

l), xl) = f(f(f(f(f(f(A0, x1), x2)..., xl−1), xl), xl+1)..., xk)

which yields,

f(f(Ak, x′

l), xl) = Ak (2.5)

Now, (6.2) and (4.2) together imply that Ak 6= Ak, an obvious contradiction.

Therefore, the presence of a removal operation is also necessary for obtaining insensitivity to order.

Thus, Theorem 3 holds since we have already shown the sufficiency condition in Theorem 2.

An important implication of Theorem 3 is that order insensitive incremental algorithms can be

proposed provided a suitable function that operates on one memory abstraction at a time is defined

over the domain of all the input instances.

2.4 Robust Incremental Clustering

Designing a strictly order independent incremental algorithm is desirable but might be difficult to obtain

in practice. However, if we relax the definition of incremental behavior to one pass algorithms, we can

improve the quality of clustering by incorporating slight modifications. In the following discussion, we

show how slight modifications, to the Leader algorithm, can be used to substantially improve the quality

of clustering.


2.4.1 The Leader Clustering Algorithm

The Leader algorithm [15] is a popular incremental clustering method whereby the input instances are

clustered based on a specified threshold parameter. The Leader algorithm starts by assuming the first

data point as representative of a cluster. Successive incoming instances are assigned one by one to this

cluster, provided they lie within a distance δ of the cluster representative in the d-dimensional space.

If a data point is farther than the specified threshold δ, it becomes a representative of a new cluster.

The subsequent data points are assigned to the cluster whose representative element is nearest to it and

within δ distance. The process is iteratively repeated till all input points are clustered.

Leader Algorithm

Input: The input dataset X = {x1, x2, ..., xn} to be clustered and a control distance parameter δ.

Output: A set of clusters with their centers.

Initialize the set of cluster centers, C = φ;

C = C⋃{x1};

for i = 2 to n, do

find cluster r whose center xCr ∈ C is closest to xi;

compute the distance, d(xi, xCr ) between xi and xCr ;

if d(xi, xCr ) ≤ δ

assign xi to cluster r;

else

C = C⋃{xi};

end if

end for

return C as the set of cluster centers;

It is obvious that the Leader clustering algorithm does not require prior information about the

number k of clusters. However, a skewed ordering of input sequence may result in extremely poor

quality of clustering. In our work, we propose robust variants of the Leader algorithm while essentially

preserving its one pass benefit. The basic idea behind these algorithms is that the ordering effects can

be ameliorated to an extent by considering more than one point during the decision process and thereby

quality of clustering can be improved upon. We first propose the Nearest Neighbor Leader Algorithm

which is a simple modification of the Leader algorithm.


2.4.2 The Nearest Neighbor Leader (NN-Leader) Clustering Algorithm

The conventional Leader algorithm considers each of the existing clusters one by one and assigns an

incoming data point to the cluster whose representative is nearest to this point. A disadvantage of this

approach is that the clustering decision is based entirely on the center representatives while neglecting

the role played by non-center members of these clusters. Algorithm 1 outlines the NN-Leader approach.

The algorithm first determines the closest neighbor xr of the incoming, unlabeled data point xi and then

computes the distance between xi and xCr , where xCr is the representative of the cluster containing xr.

If this distance lies within the stipulated threshold δ, then xi is assigned to xCr , otherwise, xi becomes

a representative of a new cluster.

Algorithm 1: NN-Leader




C = C⋃{x1};

for i = 2 to n, do

find the center xCr of the cluster assigned to point xr

that is closest to xi;

compute the distance, d(xi, xCr ) between xi and xCr ;

if d(xi, xCr ) ≤ δ

assign xi to cluster with center xCr ;

else

C = C⋃{xi};

end if

end for


2.4.3 The Nearest Mean and Neighbor Leader(NMN-Leader) Clustering Al-

gorithm

As mentioned earlier, the susceptibility of the Leader algorithm to skewness in the data can be primarily

attributed to the exclusion of non-representative cluster elements. The NN-Leader algorithm strives

to overcome this limitation by including the immediate neighbor of the unlabeled data point in the

decision process. However, the role of the nearest neighbor is restricted to identifying a potential cluster;

the extent of proximity of the neighbor is not taken into consideration. For many pattern clustering


applications, a more robust heuristic would be to find the sum of distances of the incoming point to

more than one point of each cluster and assign the incoming point to the cluster with the minimum

overall sum of the said distances.The Nearest Mean and Neighbor Leader (Algorithm 2) extends the

notion of the NN-Leader algorithm.

Specifically, the NMN-Leader algorithm proceeds in the following way. For each unlabeled data

point, all clusters whose representatives respect the δ threshold, are considered as potential clusters.

Then the distance of the unlabeled point to a nearest point in each of the potential clusters is computed.

The data point is assigned to a potential cluster based on the respective sum of the distance to each of

the cluster centers and the nearest point in the corresponding clusters; the unlabeled data point becomes

a new cluster representative, if no such potential cluster exists. Thus the NMN-Leader can be used to

ameliorate one of the main drawbacks of the the Leader and NN- Leader algorithms, namely, that only

a single distance value is taken into consideration while deciding a cluster for an incoming point. It is

important to note that the mean or center, in the foregoing discussion, refers to a cluster representative

unlike the general notion of an average value.

Algorithm 2: NMN-Leader




C = C⋃{x1};

for i = 2 to n, do

flag = 0;

for each cluster Cj in C, do

find representative point of Cj , xCjm ;

compute the distance, dji1 =d(xi, xCjm );

if dji1 > δ

continue with next cluster Cj in C;

else

compute the distance, dji2 = d(xi, xjr), between

xi and its nearest neighbor in Cj ;

flag = 1;

end if

end for

if flag = 1

assign xi to the cluster Cj with minimum value of dji1 + dji2 ;


else

C = C⋃{xi};

end if

end for


2.4.4 The Apogee-Mean-Perigee Leader (AMP-Leader) Clustering Algo-

rithm

The Nearest Mean and Neighbor Leader (NMN-Leader) algorithm considers the sum of distances to

each candidate cluster center and the nearest neighbor in that cluster. A more robust approach would

involve considering the farthest (apogee), representative (mean) and nearest (perigee) data points of

each cluster. Thus, a more robust clustering is likely to be achieved by applying AMP-Leader clustering

algorithm (Algorithm 3).

Algorithm 3: AMP-Leader




C = C⋃{x1};

for i = 2 to n, do

flag = 0;

for each cluster Cj in C, do

find representative point of Cj , xCjm

compute the distance, dji1 =d(xi, xCjm );

if dji1 > δ

continue with next cluster Cj in C;

end if

flag = 1;

find the point in Cj , xCjp , which is closest to xi;

compute the distance, dji2 = d(xi, xCjp );

find the point in Cj , xCja , which is farthest from xi;

compute the distance, dji3 =d(xi, xCja );

dij = dji1 + dji2 + dji3 ;

end for

if flag = 1


assign xi to cluster Cj , Cj ∈ C, with minimum

value of dij ;

else

C = C⋃{xi};

end if

end for



An important question needs to be addressed: how to quantify robustness of the clusters obtained using

different algorithms, in order to decide the suitability of one over the others? For our experiments, we

measure the robustness of an algorithm in terms of the following two parameters,

1. α =

√√√√∑i

|xi − xk|2

n, where xk is the representative of cluster Ck, Ck ∈ C, to which xi ∈ X =

{x1, x2, ..., xn} is assigned, and

2. β =

√√√√√√∑k

∑xi,xj∈Ck

|xi − xj |2

|Ck|(|Ck| − 1)

|C|.

It is easy to see that α and β can be used to measure the robustness of clustering algorithms satisfactorily:

α quantifies the deviation of data points from the representative element while β captures the scatter

among different elements assigned to the same cluster. The lower the values of α and β, the higher the

quality of clustering. Our experimental results support the intuition that the principal reason behind

the low quality of clustering by Leader algorithm is a high value of β, since the Leader algorithm only

tries to minimize the distance between the data points and their respective cluster centers.

We conducted extensive experimentation to measure the robustness of different algorithms proposed

in this work relative to the Leader algorithm. We present results of using the Leader, NN-Leader,

NMN-Leader, and the AMP-Leader algorithms on three real datasets: Iris, Wine, and KDD Network

Intrusion datasets, available at [16], [17]. The Wine dataset describes the chemical analysis of wines

derived from three different cultivars. The quantities of 13 constituents found in each of the three types

of wines forms the input to the clustering algorithms, after removing the class identifier. Fig. 2.1(a) and

Fig. 2.1(b) summarize, respectively, the α and β plots for different variants of the Leader algorithm.

It is observed that the AMP Leader outperforms the Leader algorithm by an order of magnitude, the

NN and the NMN Leader algorithms also perform better than the Leader algorithm, though the gap is

not so vastly pronounced. The NN algorithm performs slightly better than the NMN algorithm in the


Figure 2.1: Wine Dataset: (a) α vs. δ, and (b) β vs. δ

Table 2.1: Wine Dataset ResultsAlgorithm δ α β Time(sec)

Leader 25 103.459243 88.312259 0.03Leader 30 105.858316 94.513687 0.03Leader 50 236.117268 81.759452 0.04Leader 75 296.273254 64.230872 0.03

NN 25 12.255716 12.806741 0.04NN 30 14.402743 13.841357 0.04NN 50 25.951286 24.689666 0.04NN 75 36.319593 34.837875 0.04

NMN 25 12.693116 13.806690 0.12NMN 30 14.297954 14.705744 0.11NMN 50 25.426869 28.176638 0.11NMN 75 36.901905 40.978662 0.12AMP 25 3.268322 1.550220 0.19AMP 30 4.470130 1.799374 0.21AMP 50 11.458716 4.303440 0.36AMP 75 28.191802 11.122114 0.44

context of β values, however, the two perform equally well with respect to α. It is worthy to note that

there is not much increase in overall time taken to execute by employing NN, NMN, and AMP Leader

algorithms in place of the Leader algorithm. Hence, these algorithms provide an encouraging alternative

to the Leader algorithm.

Our second dataset, Iris, contains 3 classes of 50 instances each, where each class refers to a type of

Iris plant. We removed the categorical class label attribute and used different algorithms for clustering

based on the remaining 4 real-valued attributes. Fig. 2.2 (a) shows the comparison of α values obtained

at various thresholds using different clustering algorithms discussed here. It is observed that the Leader

algorithm results in a higher value of α than the NN, NMN and AMP Leader algorithms for almost all

values of δ, barring a brief interval around δ=2.5. The NMN and AMP algorithms result in nearly same

value of α marginally lower than that obtained using the NN algorithm; the difference becomes more

pronounced as δ is increased beyond 4.


Figure 2.2: Iris Dataset: (a) α vs. δ, and (b) β vs. δ

Figure 2.3: Intrusion Dataset: (a) α vs. δ, and (b) β vs. δ

On the other hand, Fig. 2.2 (b) shows the comparison of β values obtained at various values

of δ using different clustering algorithms. It is observed that the NMN algorithm performs poorly

as compared to the others, including the Leader algorithm. This counter-intuitive behavior can be

explained if we consider the number of clusters obtained in each case. A relatively low value of β due

to Leader algorithm is because of the large number of clusters resulting from the Leader algorithm; a

substantial number of clusters more than offsets the low β value of other algorithms. Also, since β is

inversely proportional to√|C|, it is seen that the value β

√|C| gives a better idea of the quality of

clustering than β, therefore for extremely low values of δ, β√|C| should be preferred.

We finally show the empirical results for the KDD Network Intrusion dataset. This dataset was

released for the KDD Competition with the goal of building a network intrusion detector, to distinguish

between intrusions and normal accesses. We removed the categorical attributes and used the remaining

37 features for the clustering process. Fig. 2.3(a) and Fig. 2.3(b) clearly indicate that the AMP,

NN, and NMN algorithms massively outperform the Leader algorithm in the quality of clustering. We

observe that the AMP, NN and NMN have almost same values for α resulting in only a single (visible)

curve. Our experiments with several other large datasets indicate that the gap between the Leader and

the proposed algorithms widens more substantially with increase in δ.


Table 2.2: Iris Dataset ResultsAlgorithm δ α β Time(sec)

Leader 1 0.604428 0.363442 0.01Leader 1.1 0.836421 0.517497 0.01Leader 1.2 0.769459 0.508818 0.02Leader 1.3 0.887168 0.556569 0.01Leader 1.4 0.840555 0.699171 0.02Leader 1.5 0.988433 0.565288 0.02Leader 1.6 1.075050 0.469647 0.01Leader 2 1.286390 0.565929 0.02Leader 2.5 0.999066 0.288164 0.02Leader 4 1.871648 0.353127 0.01Leader 5 2.816606 0.615548 0.01Leader 6 3.250200 1.191248 0.01

NN 1 0.569444 0.465093 0.02NN 1.1 0.608988 0.488164 0.03NN 1.2 0.648537 0.550964 0.03NN 1.3 0.678823 0.547391 0.03NN 1.4 0.705408 0.635412 0.02NN 1.5 0.797705 0.592036 0.02NN 1.6 0.864870 0.604020 0.03NN 2 1.024044 0.714045 0.02NN 2.5 1.151405 0.896195 0.02NN 4 1.246863 1.312673 0.02NN 5 2.653413 1.460232 0.03NN 6 3.153094 2.121179 0.03

NMN 1 0.607783 0.600252 0.08NMN 1.1 0.621450 0.628142 0.09NMN 1.2 0.646890 0.718134 0.09NMN 1.3 0.658989 0.722566 0.09NMN 1.4 0.749933 0.854402 0.08NMN 1.5 0.752418 0.847823 0.10NMN 1.6 0.787782 0.884730 0.10NMN 2 0.959965 0.940225 0.09NMN 2.5 1.054672 1.000977 0.10NMN 4 1.246863 1.312673 0.08NMN 5 2.089912 1.986403 0.08NMN 6 2.365953 2.066980 0.08AMP 1 0.378594 0.158689 0.51AMP 1.1 0.491189 0.283306 0.34AMP 1.2 0.596210 0.422448 0.18AMP 1.3 0.617306 0.531905 0.15AMP 1.4 0.744222 0.773295 0.11AMP 1.5 0.752418 0.847823 0.12AMP 1.6 0.787782 0.884730 0.12AMP 2 0.959965 0.940225 0.09AMP 2.5 1.054672 1.000977 0.11AMP 4 1.246863 1.312673 0.09AMP 5 2.115167 1.994683 0.08AMP 6 2.365953 2.066980 0.10


From our experiments on several datasets, we observe that the difference in quality of clustering

between the NN and NMN algorithms is not substantial enough to choose one over the other. However,

the fact that NN takes less time to execute than the NMN (Tables 6.3 and 2.2) suggests that NN can

be preferred to NMN until some additional domain knowledge indicates otherwise. Then the choice

between NN and AMP would be a selection trade-off between quality of clustering and time taken to

execute.

2.6 Conclusion/Future Work

Incremental clustering is an important data mining task. A major concern in incremental algorithms is to

obtain identical results on a dataset for all possible orderings of the input data. We analyzed the problem

using the ideas from algebraic structures and introduced the notion of a dynamically complete set. We

proved that a dynamically complete set provides both a necessary and sufficient condition for order

independence. We also proposed a suite of robust incremental algorithms based on the popular Leader

clustering algorithm. Our experimental results indicate that the proposed algorithms perform way better

than the Leader algorithm on a number of datasets of different sizes and from different application

domains. The NN-Leader algorithm takes less time to execute than the NMN-Leader algorithm while

affording almost same quality of clustering. The time-robustness trade-off could be used to choose

between the AMP and the NN algorithms with the former providing more robust clustering while the

latter accounting for considerably less time.

Bibliography

[1] A. K. Jain, M. N. Murty, and P. J. Flynn. Data Clustering: A Review. ACM Computing Surveys,

31(3), 1999.

[2] P. Langley. Order Effects in Incremental Learning. Learning in humans and machines: Towards an

interdisciplinary learning science, Elsevier, 1995.

[3] T. Mitchell. Generalization as Search. Artificial Intelligence, 18, pp. 203–226, 1982.

[4] P. H. Winston. Learning structural descriptions from examples. The psychology of computer vision,

McGraw-Hill, 1975.

[5] J. C. Schlimmer and D. Fisher. A case study of incremental concept induction. Proceedings of the

Fifth National Conference on Artificial Intelligence, pp. 496–501, Morgan Kaufmann, 1986.

[6] A. Cornuejols. Getting Order Independence in Incremental Learning. Proceedings of the 1993 Euro-

pean Conference on Machine Learning (ECML), pp. 196–212, Springer-Verlag, 1993.

[7] D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2(2), pp.

139-172, 1987.

[8] D. Fisher, L. Xu, and N. Zard. Ordering effects in clustering. Proceedings of the 9th International

Conference on Machine Learning (ICML), pp. 163–168, 1992.

[9] B. Shekar, M. N. Murty, and G. Krishna. Structural aspects of semantic-directed clusters. Pattern

Recognition, 22, pp. 65–74, 1989.

[10] I. Herstein. Topics in Algebra. John Wiley & Sons, Second Edition, 2006.

[11] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Efficient Data Clustering Method for Very

large Databases. ACM SIGMOD, 1996.

[12] J. R. Slagle, C. L. Chang, and S. R. Heller. A Clustering and Data Reorganizing Algorithm. IEEE

Trans. Systems, Man and Cybernetics, 5, pp. 125–128, 1975.

40

BIBLIOGRAPHY 41

[13] D. Arthur and S. Vassilvitskii. k-means++: The Advantages of Careful Seeding. Symposium on

Discrete Algorithms (SODA), 2007.

[14] M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining Stream Statistics over Sliding Win-

dows. Proceedings of the thirteenth annual ACM-SIAM Symposium on Discrete Algorithms (SODA),

2002.

[15] H. Spath. Cluster Analysis Algorithms for Data Reduction and Classification, Ellis Horwood, Chich-

ester, 1980.

[16] http://archive.ics.uci.edu/ml/datasets.html.

[17] http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.

Chapter 3

RACK: RApid Clustering using

K-means algorithm

3.1 INTRODUCTION

Clustering or unsupervised classification of patterns into groups, based on similarity, is a very well studied




tasks including classification ([1], [2]). The k-means clustering problem can be expressed in the following

way. Given an integer k and a dataset of n d-dimensional data points, the objective is to choose k so

that the total squared distance between each point and its closest center is minimized. Several attempts

have been made to find efficient solution to the k-means problem. Some techniques ([7], [8], [9]) describe

O(1 + ε)-competitive algorithms for the k-means problem. However, these algorithms have exponential

time complexity in k and thus may not be practical. A (9+ε)-competitive algorithm has been suggested

in [10]. However, the application of this algorithm is also limited, since the time complexity is cubic in

the number of data points.

According to a recent survey [3], the k-means algorithm is the most widely used technique for

scientific and industrial applications. Several variants of the k-means algorithm have been proposed

in the literature. The Lloyd’s algorithm ([4], [5], [6]) is an extensively used k-means algorithm. It

begins with a set of k representatives or centers that are randomly chosen from the data points. In one

iteration, every data point is assigned to the cluster with nearest center. The center of each cluster is

updated to the mean of the points assigned to that cluster. Then, the algorithm proceeds with next

iteration until there is no significant change in the clusters obtained in successive iterations. Recently,

42

Chapter 3. RACK: RApid Clustering using K-means algorithm 43

an O(lg k)-competitive algorithm, k-means++, has been proposed [11]. The k-means++ algorithm

improves the speed of k-means. However, the speed deteriorates significantly with an increase in value

of k.

The Leader algorithm [12], a popular single pass incremental algorithm for clustering large datasets,

is particularly attractive for applications such as stream data analysis that require fast processing. A

shortcoming of the Leader algorithm is that the quality of clustering may degrade significantly in case

of skewed input orders ([13], [14]). In [15], the authors propose an incremental algorithm BIRCH to

dynamically cluster incoming multi-dimensional metric data points, using a Clustering Feature (CF)

Tree. The Shortest Spanning Path (SSP) [16] algorithm has been used for data reorganization and

automatic auditing of records. In [17], authors proposed COBWEB, an incremental conceptual clustering

algorithm that was subsequently used for many engineering tasks.

In this work, we propose a novel technique, which strives to obtain good quality of clustering like

k-means, while minimizing the time requirements like the Leader algorithm. In particular, we make the

following contributions,

1. we propose the RACK algorithm that chooses k centers by applying k-means on a randomly chosen

subset D′

of the input dataset D, builds a height balanced tree based on a measure governed by the

cardinality of the clusters and deviation about their respective means, and incrementally assigns

the remaining data points to suitable clusters, and

2. we prove that the average deviation resulting from RACK algorithm is bounded byO(k|D| |D\D′ |).

3.2 Effective Clustering for large datasets

3.2.1 Motivation

The k-means algorithm may not work well in case of skewed data. Fig. 3.1 shows a set of data points

arranged in the form of two concentric circles, A and B. The k-means algorithm would fail to converge to

a satisfactory solution even after an extremely large number of iterations. To ameliorate this problem, we

need a different heuristic from the closest center approach: a sampled subset may be used to determine

the cluster centers followed by fast one-pass clustering of the remaining data points using an appropriate

algorithm. The basic intuition is that a random subset of the original dataset may quickly converge

to a local optimal solution and the resulting cluster centers may be used to incrementally cluster the

remaining points. This avoids repeatedly computing k distances to cluster centers till convergence unlike

k-means, since most of the points may never change their cluster centers.

Now, consider the situation as shown in Fig. 3.2. Suppose that a sample subset of the original

dataset is already clustered using the k-means algorithm. Now, if a new data point x arrives, it is more


Figure 3.1: k-means may not converge to a solution even after many iterations

probable to be a part of the cluster with large number of data points. This is a reasonable assumption

to make, particularly if the sample set reflects the behavior of the entire dataset.

Figure 3.2: A new data point is more likely to belong to a cluster with large number of data points

At the same time, we do not want the quality of clustering, which is indicated by the average

squared deviation about the cluster representatives, to degrade significantly as in case of incremental

algorithms. Therefore, rather than making a decision on the basis of cardinality of clusters alone, we

also seek to minimize the deviation. We define a heuristic, the potential, to accomplish good clustering.

We also note that, in case of k-means, the number of distance computations is given by O(nkld), for

a d-dimensional dataset of n points, which converges to a solution in l iterations. Therefore, we can

significantly improve the time by devising an algorithm that effectively measures distance from only

O(lg k) centers, and needs only a single iteration for a major part of the original dataset like the Leader

algorithm.

The Leader algorithm starts by assuming the first data point as representative of a cluster. An

iterative process is followed, whereby successive incoming instances are assigned one by one to this

cluster, provided they lie within a distance δ of the cluster representative in the d-dimensional space. If

a data point is farther than the specified threshold δ, it becomes a representative of a new cluster.

Based on the aforesaid ideas, a meaningful and intuitively appealing heuristic would be to execute

the k-means algorithm on a much smaller set of data points, maintain the requisite information about

clusters in a balanced tree structure, and cluster the remaining points of the dataset incrementally along

a suitable path of the tree. The skewed behavior of Leader algorithm is likely to be avoided since a

reasonable set of cluster centers can be expected as a result of applying k-means to a sampled subset.


In this context, we propose the RACK algorithm to accomplish effective clustering.

3.2.2 The RACK Algorithm

The RACK algorithm proceeds in two phases. In Phase 1, k-means is applied, on a randomly chosen

subset D′

of the original dataset D, to obtain a set of clusters, C = {C1, C2, . . . , Ck}. Further, as

mentioned in Section 3.2.1, we define a new heuristic, potential p, for the subsequent clustering process

involving points in the set D\D′ . The potential for each of the k clusters is given by,

p = {p1 =|C1|α1

, p2 =|C2|α2

, . . . , pk =|Ck|αk}

where, denoting the center of Cj by cj ,

αj =1|Cj |

∑xi∈Cj

||xi − cj ||2, j ∈ {1, 2, . . . , k}

In Phase 2, an AVL tree is constructed using the set p. Intuitively, for each incoming point, we

look for a cluster with large size and low variance in order to minimize the computation cost, and to

improve the quality of clusters. The AVL tree [18] is an example of height balanced binary search trees,

wherein each of the search, delete and insert operations can be performed in O(lg r) time, r being the

total number of nodes in the tree. Any node or cluster, Cmi , in an AVL tree can be uniquely identified

by specifying its level m in a path i from the root. Then, starting from the root, additional increase

in deviation due to each point xt ∈ D\D′

is computed. If the point xt is deemed fit at a node, it is

assigned to the corresponding cluster, and the average deviation about the representative of that cluster

is given by,

αmi′ =|Cmi |αmi + ||xt − xim||2

|Cmi |+ 1

where xim denotes the representative of cluster Cmi . Accordingly, the potential at that node can also

be updated. However, if xt does not satisfy the clustering criterion at a node, then it has to be assigned

to a suitable cluster, in its right or left subtree. In the worst case, xt is assigned to a leaf node.

Algorithm: RACK

Input: A dataset D and k

Output: A set of k clusters and cluster centers

• Phase 1: Obtain a subset D′

of data points by drawing random samples from D.

• Apply the k-means algorithm on D′

to obtain the set of k clusters, C = {C1, C2, . . . , Ck}.


• Compute the overall deviation of points in each cluster about its mean, α = {α1, α2, . . . , αk}.

• Compute the corresponding set of potential values, p = {p1 =|C1|α1

, p2 =|C2|α2

, . . . , pk =|Ck|αk}.

• Phase 2: Insert a node for each of the k clusters, in an AVL tree H, in the order of non-increasing

p. Store a duplet (|Cj |, pj) with j ∈ {1, 2, . . . , k}.

• for each data point xt ∈ D\D′

1. Set CNode to root of H.

2. if both the left child L and the right child R of CNode exist, then

(a) Compute the value, val = ||xt − xCNode||2, where xCNode denotes the representative

element of CNode

(b) Compute the quantity, qR =2|CNode|+ 1

pR, where pR denotes the potential of R about

its representative xR. Also, compute qL =2|CNode|+ 1

pL

(c) if val < qR, then set CNode to R; else if val > qL, then set CNode to L; otherwise

assign xt to CNode, and update pCNode topCNode(|CNode|+ 1)2

|CNode|2 + val pCNodeand |CNode| to

|CNode|+ 1;

else if CNode is a leaf, then compute val as above and update pCNode topCNode(|CNode|+ 1)2

|CNode|2 + val pCNode

and |CNode| to |CNode|+ 1;

else if the right child R exists, then compute val and qR. If val ≥ qR, then update pCNode topCNode(|CNode|+ 1)2

|CNode|2 + val pCNodeand |CNode| to |CNode|+ 1, otherwise, set CNode to R;

else compute val and qL. If val ≤ qL, then update pCNode topCNode(|CNode|+ 1)2

|CNode|2 + val pCNodeand

|CNode| to |CNode|+ 1, otherwise, set CNode to L.

3. Return the set of clusters, C1, C2, . . . , Ck and their cluster representatives.

3.2.3 Bound on the Quality of Clustering

Theorem 4. Let D and k denote the original dataset and the number of clusters respectively. The

quality of clustering achieved by Phase 2 of RACK, as measured in terms of the average deviation, is

bounded by O(k|D| |D\D′ |), if a sampled dataset of size D′

is used for obtaining k cluster representatives

in Phase 1.

Proof. Consider a cluster Cmi , at a level m, in the height balanced tree, H. Then, if pmi is the current

potential of Cmi ,

pm+1il ≤ pmi ≤ pm+1

ir (3.1)


where pm+1il and pm+1

ir denote the current potential of the left child, Cm+1il , and the right child, Cm+1

ir ,

of cluster Cmi . Now, suppose a new data point xt is assigned to Cmi and let the new potential of Cmi be

pm′

i . Note that xt does not modify the potential of any cluster, other than Cmi . In the RACK algorithm,

val is computed along a suitable path down the tree till the updated potential of the corresponding node

(on inserting the new point) lies between that of its children, or a leaf node is reached. Therefore, we

must have

pm+1il ≤ pm

′

i ≤ pm+1ir (3.2)

Using (3.1) and (3.2), in conjunction with the definition of potential, we have

|Cm+1il |αm+1il

≤ |Cmi |αmi

≤ |Cm+1ir |αm+1ir

(3.3)

and,|Cm+1il |αm+1il

≤ |Cmi |+ 1αm′

i

≤ |Cm+1ir |αm+1ir

(3.4)

Further,

αmi′ =|Cmi |αmi + ||xt − xim||2

|Cmi |+ 1

where xim denotes the representative of cluster Cmi .

⇒ αmi =(|Cmi |+ 1)αm

′

i − ||xt − xim||2

|Cmi |

which using (3.3) yields,

αm′

i ∈ [|Cmi |2α

m+1ir + ||xt − xim||2|Cm+1

ir ||Cm+1ir |(|Cmi |+ 1)

,

|Cmi |2αm+1il + ||xt − xim||2|Cm+1

il ||Cm+1il |(|Cmi |+ 1)

]

whence using (3.4), αm′

i ≥ max((|Cmi |+ 1)αm+1

ir

|Cm+1ir |

,

|Cmi |2αm+1ir + ||xt − xim||2|Cm+1

ir ||Cm+1ir |(|Cmi |+ 1)

) (3.5)

and, αm′

i ≤ min((|Cmi |+ 1)αm+1

il

|Cm+1il |

,


il ||Cm+1il |(|Cmi |+ 1)

) (3.6)


Now,


ir ||Cm+1ir |(|Cmi |+ 1)

− (|Cmi |+ 1)αm+1ir

|Cm+1ir |

=|Cmi |2α

m+1ir + ||xt − xim||2|Cm+1

ir | − (|Cmi |+ 1)2αm+1ir

|Cm+1ir |(|Cmi |+ 1)

=||xt − xim||2|Cm+1

ir | − αm+1ir − 2|Cmi |α

m+1ir

|Cm+1ir |(|Cmi |+ 1)

=||xt − xim||2

|Cmi |+ 1− αm+1

ir (2|Cmi |+ 1)|Cm+1ir |(|Cmi |+ 1)

≥ 0, if

||xt − xim||2

|Cmi |+ 1≥ αm+1

ir (2|Cmi |+ 1)|Cm+1ir |(|Cmi |+ 1)

i.e. if ||xt − xim||2 ≥αm+1ir (2|Cmi |+ 1)|Cm+1ir |

=2|Cmi |+ 1pm+1ir

Therefore,


ir ||Cm+1ir |(|Cmi |+ 1)

≥ (|Cmi |+ 1)αm+1ir

|Cm+1ir |

if ||xt − xim||2 ≥2|Cmi |+ 1pm+1ir

Thus, the bound on αm′

i in (3.5) can be determined. Similarly, we can obtain the bound in (3.6). Now,

there are three possible cases, based on various possible intervals being considered.

Case 1:

||xt − xim||2 ∈ [2|Cmi |+ 1pm+1ir

,2|Cmi |+ 1pm+1il

]

Then,

αm′

i ∈ [|Cmi |2α

m+1ir + ||xt − xim||2|Cm+1

ir ||Cm+1ir |(|Cmi |+ 1)

,


il ||Cm+1il |(|Cmi |+ 1)

]


whence the length of interval, wherein α′

lies, is given by


il ||Cm+1il |(|Cmi |+ 1)

−|Cmi |2α

m+1ir + ||xt − xim||2|Cm+1

ir ||Cm+1ir |(|Cmi |+ 1)

=|Cmi |2|C

m+1ir |αm+1

il − |Cmi |2|Cm+1il |αm+1

ir

|Cm+1ir ||Cm+1

il |(|Cmi |+ 1)

=|Cmi |2(|Cm+1

ir |αm+1il − |Cm+1

il |αm+1ir )

|Cm+1ir ||Cm+1

il |(|Cmi |+ 1)

=|Cmi |2(pm+1

ir αm+1ir αm+1

il − pm+1il αm+1

il αm+1ir )

|Cm+1ir ||Cm+1

il |(|Cmi |+ 1)

=|Cmi |2(pm+1

ir αm+1ir αm+1

il − pm+1il αm+1

il αm+1ir )

pm+1ir αm+1

ir pm+1il αm+1

il (|Cmi |+ 1)

=|Cmi |2(pm+1

ir − pm+1il )

pm+1ir pm+1

il (|Cmi |+ 1)(3.7)

Case 2:

||xt − xim||2 <2|Cmi |+ 1pm+1ir

Clearly, this case is ruled out since, by our assumption, xt is assigned to Cmi (otherwise xt should have

been checked further, for a suitable cluster, in the right subtree of Cmi ).

Case 3:

||xt − xim||2 >2|Cmi |+ 1pm+1il

This case is also not possible, as observed in a way, analogous to Case 2.

Therefore, the resulting interval after inserting xt must satisfy (3.7). Further, since xt is assigned

to Cmi , none of the clusters in the path from root of H till Cmi may satisfy (3.7). Assuming that

the probability of a point falling in an interval is proportional to the length of that interval, using a

normalization constant z, the probability of xt being assigned to Cmi is given by,

P (Cmi ← xt)


≤ (|Cmi |2(pm+1

ir − pm+1il )

zpm+1ir pm+1

il (|Cmi |+ 1))m−1∏j=1

[1−|Cji |2(pj+1

ir − pj+1il )

zpj+1ir pj+1

il (|Cji |+ 1)]

≤|Cmi |2(pm+1

ir − pm+1il )

zpm+1ir pm+1

il (|Cmi |+ 1)(since ∀s ps+1

ir ≥ ps+1il )

Then, the expected change in αm′

i is

≤∑

xt∈D\D′

|Cmi |2(pm+1ir − pm+1

il )zpm+1ir pm+1

il (|Cmi |+ 1)||xt − xim||2

Since there are O(lg k) levels, therefore, the expected change along path i is

≤O(lg k)∑m=1

m∑

xt∈D\D′

|Cmi |2(pm+1ir − pm+1

il )zpm+1ir pm+1

il (|Cmi |+ 1)||xt − xim||2

≤O(lg k)∑m=1

m∑

xt∈D\D′

|Cmi |(pm+1ir − pm+1

il )zpm+1ir pm+1

il

||xt − xim||2

≤O(lg k)∑m=1

m∑

xt∈D\D′β|Cmi | ||xt − xim||2

(where, β = max(pm+1ir − pm+1

il

zpm+1ir pm+1

il

), m ∈ {1, 2, . . . , k − 1})

≤O(lg k)∑m=1

mβ|D\D′| |Cmi |d2

max

(where dmax is the maximum distance between any two data points in D)

≤O(lg k)∑m=1

mβ′|D\D

′||Cmi |

(where β′

= β d2max)

Now, there are O(lg k) nodes in any root to leaf path of H. Further, the total number of nodes is k.

Therefore, the number of paths is bounded by O(k/lg k). Therefore, expected quality of clustering, as

characterized by average deviation, using RACK is

≤O(k/lg k)∑

i=1

i

O(lg k)∑m=1

mβ′|Cmi | |D\D

′|

≤O(k/lg k)∑

i=1

i O(|D/k|) O(|D\D′| lg2 k)


(for large datasets, expected value of |Cmi | is O(|D/k|))

= O(k|D| |D\D′|)

3.2.4 Analysis of Time Complexity

RACK algorithm consists of two phases. Phase 1 employs the k-means algorithm on sampled dataset

D′, which incurs time bounded by O(|D′ | kl′d), where l′ and d denote the number of iterations and

the dimensionality of data respectively. The computation of overall deviation of points in each cluster

about its center, and the potential values takes O(D′) time. Therefore, the time complexity of Phase 1

can be expressed as O(|D′ | kl′d).

Phase 2 consists of primarily three steps: (a) sorting the potential values obtained in Phase 1 in

non-increasing order, (b) constructing the AVL tree using potential value as the key, and (c) assigning

the points in D\D′ to one of the nodes in the tree. Sorting in (a) can be accomplished in O(k lg k)

time using any standard algorithm such as heapsort. The AVL tree can be constructed from k sorted

potential values in O(k lg k) time, since each insertion operation requires O(lg k) comparisons and a

total of k insertions are needed. Finally, each point in D\D′ may have to go down a path from root of

the tree to one of its leaves. The length of any such path is bounded by O(lg k) nodes, as a consequence

of the height balancing property of the AVL tree. At each node in the path, its right and left child

may have to be accessed, for a maximum of two operations. The cluster size and potential value at

each node can be updated in O(1) time. Then, the overall time complexity for clustering |D\D′ | data

points, in Phase 2, is O((k + d|D\D′ | )lg k). However, since typically |D\D′ | is much greater than k,

therefore, Phase 2 requires time bounded by O(d|D\D′ | lg k). The RACK algorithm, as a result, takes

O(d(|D′ | kl′ + |D\D′ | lg k)) time.


We carried out extensive experimentation to compare RACK with the Leader, the k-means and the

k-means++ algorithms. For our experiments, we measured the quality of clustering of an algorithm in

terms of the deviation as given by,

α =1n

∑xi∈D

||xi − xij ||2,

where xj is the representative of cluster Cj that belongs to the set of clusters C, to which xi is assigned.

Clearly, a low value of α corresponds to a high quality of clustering. We conducted an empirical study

on a number of real-world datasets. However, due to space constraints, we provide the results for


Table 3.1: Spam Dataset (4601 examples, 58 dimensions)Algorithm Clusters Average α Time(sec)

Leader 10 1.5397e+05 0.78k-means 10 3.6843e+04 2.8

k-means++ 10 1.8650e+04 0.93RACK 10 4.9174e+04 0.85Leader 20 1.9995e+05 0.81k-means 20 3.3210e+04 7.67



k-means++ 100 1.0876e+03 3.65RACK 100 1.718e+03 1.97

Spam and Intrusion datasets. These datasets are available as archives at the UCI Machine Learning

Repository ([19] [20]). Moreover, we state the results obtained using 30 runs of experiments to account

for statistical significance. Since the Leader algorithm does not take k as an input parameter, we

executed the code for Leader algorithm for different distance thresholds, across different orders, and

observed the number of clusters. Then, we modulated the Leader threshold distance to obtain almost

the same number of clusters. Finally we averaged the deviation across different orders. The Spam

dataset consists of 4601 examples of 58 real valued features each. Table 3.1 shows the results obtained

using the different algorithms for varying number of clusters. The RACK algorithm used a sampled

dataset of 1500 instances. As indicated, RACK competes with Leader in the total time taken. The

quality of clustering also compares favorably with the k-means algorithm and tends to approach that

of the k-means++, especially as the value of k is increased. Clearly, RACK outperforms k-means and

k-means++ in the total time taken. On the other hand, RACK yields much better clusters than the

Leader algorithm.

Table 3.2, on the other hand, shows the results of our experiments on the network intrusion data.

The RACK algorithm selected a sampled dataset of 20000 instances. As the results indicate, the time

taken by RACK is at least an order of magnitude less than the other algorithms. Further, the average

deviation of clusters, yielded by RACK, is slightly worse than that obtained using the k-means. However,

RACK outperforms k-means as the number of clusters is increased.

We also conducted experiments on several other datasets like the yeast, wine, and cloud [19] and the

results indicate that RACK can be used to get a good clustering quickly. In our experiments, we found

that the number of samples required for good clustering varies with the input dataset. However, the


Table 3.2: Intrusion Dataset (494019 examples, 35 dimensions)Algorithm Clusters Average α Time(sec)

Leader 10 3.962e+09 20.43k-means 10 3.392e+08 62.23




k-means++ 100 5.367e+07 862.74RACK 100 1.39978e+08 63.49

number of samples required is a very small fraction of the entire dataset. In view of space constraints,

the results have been omitted. It would be interesting to devise some heuristic for choosing the minimum

sample size, in accordance with statistical learning theory, but that is beyond the scope of this work.

Clearly, RACK is a pragmatic approach to clustering large datasets, and offers a viable alternative to

the popularly used k-means and the Leader algorithms.

3.4 Conclusions

The k-means is an immensely popular clustering algorithm, and finds its use in several applications.

The k-means algorithm offers good quality of clustering, however, it may take excessive time to converge

to a solution because of large number of iterations. On the other hand, incremental techniques (such

as the Leader algorithm) enable fast clustering, however, the quality of clustering may be extremely

poor. To address these issues, we proposed a novel algorithm, RACK, in this work. RACK randomly

selects a sample of data points, D′, from the original dataset, D, and applies k-means on D

′to obtain k

reasonable cluster representatives. Then, these clusters are represented by k nodes in a height balanced

tree, such that every path from root to any leaf consists of O(lg k) nodes. Each data point in D\D′ is

checked for clustering in the tree based on an appropriate heuristic. We proved an asymptotic bound

on the quality of clustering obtained using RACK and showed that RACK takes O(D\D′ lg k) time

for clustering the set D\D′ . We also provided experimental results on two large scale datasets. We

compared RACK with the Leader, the k-means, and the k-means++ algorithms. Our observations are:


• The time taken for clustering by RACK is much smaller than that of Leader, k-means, and k-

means++, in case of large datasets, where the value of k is also typically large, and

• The quality of clustering obtained using RACK is much better than that of Leader and is com-

petitive with k-means.

3.5 Future Work

In this work, we proposed the RACK algorithm that selects k centers by applying k-means on the

sampled dataset D′. However, the centers may well be chosen using any other clustering algorithm like

the k-means++. Further, RACK does not update the cluster centers. It would be interesting to analyze

the quality of clustering when the change in center is incrementally reflected with the addition of each

data point. Additionally, rather than using random sampling to obtain D′, we may resort to employing

better sampling techniques so that the selected k centers reflect the distribution of the entire dataset

more closely.

Bibliography


31(3), 1999.

[2] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data, Prentice Hall, 1988.

[3] Pavel Berkhin, “Survey of clustering data mining techniques”, Technical report, Accrue Software,

San Jose, CA, 2002.

[4] P. K. Agarwal and N. H. Mustafa. k-means projective clustering. Proceedings of the twenty-third

ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS), pp. 155–

165, ACM press, New York, 2004.

[5] R. Herwig, A.J. Poustka, C. Muller, C. Bull, H. Lehrach, and J O’ Brien. Large-scale clustering of

cdna-fingerprinting data. Genome Research, 9, pp. 1093–1105, 1999.

[6] F. Gibou and R. Fedkiw. A fast hybrid k-means level set algorithm for segmentation. Fourth Annual

Hawaii International Conference on Statistics and Mathematics, pp. 281–291, 2005.

[7] W. Vega, M. Karpinski, C. Kenyon, and Y. Rabani. Approximation schemes for clustering problems.

Proceedings of the thirty-fifth Annual ACM Symposium on Theory of Computing (STOC), pp. 50–

58, ACM Press, New York, 2003.

[8] A. Kumar, Y. Sabharwal, and S. Sen. A simple linear time (1 + ε)-approximation algorithm for

k-means clustering in any dimensions. Proceedings of the forty-fifth Annual IEEE Symposium on

Foundations of Computer Science (FOCS), pp. 454–462, Washington, 2004. .

[9] S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median clustering. Proceedings

of the thirty-sixth Annual ACM Symposium on Theory of Computing (STOC), pp. 291–300, ACM

Press, New York, 2004.

[10] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local

search approximation algorithm for k-means clustering. Computational Geometry, 28(2-3), pp. 89–

112, 2004.

55

BIBLIOGRAPHY 56


Discrete Algorithms (SODA), 2007.


ester, 1980.




Conference on Machine Learning, pp. 163–168, 1992.


Large Databases. In Proc. ACM-SIGMOD International Conference on Management of Data, pp.

103–114, 1996.

[16] J. R. Slagle, C. L. Chang, and S. R. Heller. A Clustering and Data Reorganizing Algorithm. IEEE

Trans. Systems, Man and Cybernetics, 5, pp. 125–128, 1975.

[17] D. Fisher. Knowledge Acquisition via Incremental Conceptual Clustering. Machine Learning, 2, pp.

139–172, 1987.

[18] M. A. Weiss. Data Structures and Algorithm Analysis in C++, Pearson, 2006.

[19] http://archive.ics.uci.edu/ml/datasets/.


Chapter 4

EPIC: Towards Efficient Integration

of Partitional Clustering Algorithms

4.1 Introduction





tasks including classification [1]. The different clustering techniques can be categorized into hierarchical

and partitional algorithms. The hierarchical algorithms generate a hierarchy of clusters by determining

successive clusters using the previously established clusters. Hierarchical algorithms can be further di-

vided into two sub-categories: agglomerative and divisive. The agglomerative algorithms start with each

element in a separate cluster and iteratively merge the existing clusters into successively larger clusters

in a bottom-up fashion. The divisive hierarchical algorithms begin with a single cluster containing all

the data points and then proceed to generate smaller clusters following a top-down approach. Partitional

clustering algorithms, on the other hand, assign the data points into a pre-defined number of clusters.

These algorithms can also be broadly classified into two categories, based on how the number of clusters

is specified. The k-means algorithm [5] is an immensely popular clustering algorithm that takes k, the

number of clusters, as an input explicitly. There are many partitional clustering algorithms, such as

Leader [4], BIRCH [2], and DBSCAN [3], which take as input a distance threshold value, τ , instead.

This threshold value, indirectly, determines the number of clusters obtained using these techniques.

We believe that a hybrid technique, which uses both k and τ in the clustering process, would be

more useful since more domain knowledge can be easily incorporated. In our work, we propose a variant

57

Chapter 4. EPIC: Towards Efficient Integration of Partitional Clustering Algorithms 58

of the k-means algorithm, EPIC, to accomplish exactly the same goal. EPIC, an anagram of the initials

of “Efficient Integration of Partitional Clustering”, initially assigns the data points to k1 clusters,

where k1 < k, k being the tentative number of desired clusters. Then, an iterative process is followed

to refine the clusters using the specified threshold distance, τ . We demonstrate that the proposed

algorithm performs fewer distance computations than the k-means algorithm and thus provides better

time performance, without making any assumptions about the distribution of the input data. The

analysis of EPIC also facilitates understanding the relationship between the number of clusters and the

distance threshold. Further, we also provide a bound on the number of levels or iterations to guarantee

that EPIC makes lesser distance computations than the k-means algorithm. We also present a generic

scheme for integrating EPIC into classification algorithms to achieve better time performance.

4.2 Preliminaries

In this section, we present a brief overview of the k-means problem and the k-means clustering algorithms

that the proposed algorithm is based on.

4.2.1 k-means Algorithms

The k-means problem is to determine k points called centers so as to minimize the clustering error,

defined as mean squared distance from each data point to its nearest center. The most commonly used

algorithm for solving this problem is the Lloyd’s k-means algorithm [5, 6] which iteratively assigns the

patterns to clusters and computes the cluster centers. MacQueen’s k-means algorithm [7] is a two-pass

variant of the k-means algorithm:

1. Choose the first k patterns as the initial k centers. Assign each of the remaining N − k patterns

to the cluster whose center is closest. Calculate the new centers of the clusters obtained.

2. Assign each of the N patterns to one of the k clusters obtained in step 1 based on its distance

from the cluster centers and recompute the centers.

Analysis: Distance computation is the only time-consuming operation in this algorithm. So, we focus

on the number of distance computations performed.

In step 1, the number of distance computations needed is given by k(N − k). The number of distance

computations in step 2 equals Nk. This implies that the total number of distance computations required

in the MacQueen’s k-means algorithm equals k(N − k) +Nk = 2Nk− k2 and the complexity is O(Nk).

The Lloyd’s k-means algorithm may not converge to a solution in polynomial time, so a maximum of

m iterations is used to find an approximate solution. Then, the total number of distance computations

equals k(N − k) + (m− 1)Nk = mNk − k2.


4.3 The EPIC Algorithm

Inputs: A dataset to be clustered: X = {xi, yi}Ni=1, where xi ∈ <d; a radius threshold parameter: τ ;

an approximate number of clusters: k; the maximum number of iterations allowed for the conventional

k-means algorithm to converge: m (If m is not provided, take m to be 100, as is the common practice).

1. Let n be the maximum number of levels. Set n to some value ≤ bmk2 c+ 1. Initialize count = 1.

2. Cluster X into k1 =⌊

1kn−2

{2(n−1)m

}n−1⌋

clusters using MacQueen’s 2-pass k-means algorithm.

3. Compute the radius {r1i }k1i=1 of each cluster {c1i }

k1i=1 and determine D, the maximum radius of any

cluster.

4. Set τ1 = min(τ , D − ε), where ε→ 0 is an extremely small positive quantity. Set τ = τ1.

5. Set the level, t = 1.

6. For every cluster {cti}kti=1, if rti > τt

• split cti using k-means into ( rti

τt)d clusters.

7. Let kt+1 be the total number of clusters. If kt+1 < k,

• set τt+1 = τt√

kt

kt+1

• set t = t+ 1

• Set count = count+ 1

• If count < n and τcount−1 ≥ D{

2(count− 1)mk

} 1d

– Compute the radius {rti}kti=1 of each cluster {cti}

kti=1

– go to step 6.

8. Return the clusters with their centers.

EPIC is a multi-level hierarchical clustering algorithm. Every iteration contributes at most one level

to the hierarchy. Starting with k1 clusters, we want to split clusters having radius beyond a specified

threshold, while ensuring that the number of distance computations is less than the k-means algorithm.

Therefore, we also need to bound the number of levels. We note that if the user-specified τ is greater

than D, then no splitting of k1 clusters is possible. Hence, for meaningful analysis, we require τ < D.

For this reason, τ is reset to τ1 in Step 4. In case, no prior knowledge about τ is available, τ can simply

be taken as D − ε.


4.3.1 Bound on Number of Distance Computations, Relation between τ and

k, and Maximum Permissible Levels

Consider the given dataset X = {x1, x2, . . . , xN}, where xi ∈ <d are independent samples drawn from an

identical distribution. The number of distance computations using k-means on X for m ≥ 2 iterations

is given by,

ND1 = mNk − k2 (4.1)

Also, the number of distance computations in first step of EPIC, using MacQueen’s 2-pass algorithm, is

NDL1 = 2Nk1 − (k1)2 (4.2)

Let Cti denote the ith cluster at level t, with center cti. Then, after the first level of clustering, we have k1

clusters: C11 , C

12 , . . . , C

1k1

with centers c11, c12, . . . , c

1k1

respectively. Let us define rti , the radius of cluster

Cti , as

rti = maxxj∈Ctid(xj , cti) (4.3)

where d(x, y) is the distance between x and y. In the EPIC algorithm, the ith cluster at level t is

partitioned at level t +1 if rti ≥ τt. Let kt denote the number of clusters at level t. Clearly, k1 = k1.

For each cluster i, 1 ≤ i ≤ kt, define an indicator variable

Zti = 1{rti>τt} (4.4)

and a corresponding probability

pti = P (Zti = 1) (4.5)

Let |Cti | denote the number of data points assigned to Cti . If a cluster Cti is partitioned at t + 1, the

next level, then the expected number of distance computations at level t+ 1 is given by

NDLt+1 =kt∑i=1

pti

[2|Cti |

(rtiτt

)d−(rtiτt

)2d]

(4.6)

Suppose that the EPIC algorithm proceeds till the nth level. Then, the expected number of total

distance computations using EPIC is

ND2 = NDL1 +n−1∑t=1

NDLt+1 = 2Nk1 − (k1)2 +n−1∑t=1

kt∑i=1

pti

[2|Cti |

(rtiτt

)d−(rtiτt

)2d]

(4.7)


Now, we can derive the bounds for the difference in number of computations, ND1 - ND2 as follows,

ND1 −ND2 = mNk − k2 − 2Nk1 + (k1)2 −n−1∑t=1

kt∑i=1

pti

[2|Cti |

(rtiτt

)d−(rtiτt

)2d]

(4.8)

= mNk − k2 − 2Nk1 + (k1)2 −n−1∑t=1

kt∑i=1

2pti|Cti |(rtiτt

)d+n−1∑t=1

kt∑i=1

pti

(rtiτt

)2d

(4.9)

Taking A = mNk − k2 − 2Nk1 + (k1)2, we get

A−n−1∑t=1

kt∑i=1

2pti|Cti |(rtiτt

)d≤ ND1 −ND2 ≤ A+

n−1∑t=1

kt∑i=1

pti

(rtiτt

)2d

(4.10)

Let Pt = maxi pti. Further, define αt = max(1,maxi (

rtiτt

)) so that the following bound holds for all rti :

0 ≤ rti ≤ αtτt (4.11)

Then, (4.10) implies

A−n−1∑t=1

kt∑i=1

2Pt|Cti |αdt ≤ ND1 −ND2 ≤ A+n−1∑t=1

kt∑i=1

Ptα2dt (4.12)

⇒ A−n−1∑t=1

2Ptαdtkt∑i=1

|Cti | ≤ ND1 −ND2 ≤ A+n−1∑t=1

ktPtα2dt (4.13)

Since∑kt

i=1 |Cti | = N , therefore,

A− 2Nn−1∑t=1

Ptαdt ≤ ND1 −ND2 ≤ A+

n−1∑t=1

ktPtα2dt (4.14)

Further, since the radius of any cluster may never exceed D, the maximum distance between any pair

of points in X, we must have for all t,

αtτt ≤ D (4.15)

⇒ αdt ≤(D

τt

)d≤(

D

τn−1

)d= M(say) (4.16)

since for all t, τt ≥ τn−1. It follows from (4.14) that

A− 2MN(n− 1) ≤ ND1 −ND2 ≤ A+n−1∑t=1

ktM2 (4.17)


Now, the number of clusters at a level t+ 1 is maximum when all the clusters at the previous level t are

partitioned. That being the case, the number of clusters at level t+ 1,

kt+1 ≤ kt(rtiτt

)d≤ ktαdt ≤Mkt (4.18)

Recursively simplifying (4.18) till t equals 1, we get,

kt+1 ≤Mkt ≤M2kt−1 ≤M tk1 (4.19)

which in light of (4.17) yields,

A− 2MN(n− 1) ≤ ND1 −ND2 ≤ A+n−1∑t=1

M t+1k1 (4.20)

⇒ A− 2MN(n− 1) ≤ ND1 −ND2 ≤ A+ k1

{Mn+1 −M2

M − 1

}(4.21)

This gives a bound on difference in number of distance computations. Now, we want ND1 ≥ ND2.

Then,

A− 2MN(n− 1) ≥ 0⇒M ≤ A

2N(n− 1)(4.22)

Plugging in the values of A and M , we get

(D

τn−1

)d≤ mNk − k2 − 2Nk1 + (k1)2

2N(n− 1)=mk − 2k1

2(n− 1)− k2 − (k1)2

2N(n− 1)(4.23)

Since k ≥ k1, (D

τn−1

)d≤ mk

2(n− 1)(4.24)

Then, the relation between τ and k is given by,

⇒ D ≥ τ ≥ τn−1 ≥ D{

2(n− 1)mk

} 1d

≥ D{

2mk

} 1d

(4.25)

Since this bound is an invariant for EPIC, therefore, (4.22) is satisfied resulting in lesser number of

distance computations in case of EPIC than k-means. The parameter n in EPIC can be used to control

this gap in number of computations: a small value of n would correspond to a large gap. Finally, we

bound the maximum number of permissible levels using (4.25) as

D

{2(n− 1)mk

} 1d

≤ D (4.26)


⇒ 2(n− 1) ≤ mk (4.27)

nmax = bmk2c+ 1 (4.28)

where nmax is the maximum number of levels, which ensures that EPIC is computationally more efficient

than the k-means algorithm. We want to bound the value of M , since it is directly involved in the

expression for difference in number of computations. Using (4.16) and (4.24), we get

(D

τ

)d≤M ≤ mk

2(n− 1)(4.29)

Finally, to complete the unification of τ and k, we must ensure that the number of clusters at the

termination of EPIC algorithm is bounded by k, irrespective of the value of k1. Then, using (4.16), we

must have for any value of M given by (4.29),

k1Mn−1 ≤ k (4.30)

⇒ k1

{mk

2(n− 1)

}n−1

≤ k (4.31)

which in the wake of (4.29) yields

k1 =

⌊1

kn−2

{2(n− 1)

m

}n−1⌋

(4.32)

4.4 Application of EPIC to classification

A two-level implementation of EPIC can be employed to reduce the time complexity of various classifi-

cation algorithms. We present below a generic technique for the integration of EPIC into classification

algorithms to improve their performance.

Inputs: A set of training examples and corresponding class labels, X = {xi, yi}Ni=1, where xi ∈ <d, and

yi ∈ Γ, the set of labels; the number of clusters, k.

1. Cluster X into k clusters and determine the radius of each cluster.

2. Set τ to some value in the range indicated by (4.25).

3. Train the classifier using the centroids of those clusters that have their radius greater than τ .

4. Determine the clusters which form a part of the classification model. Sub-cluster these clusters.

5. Train the classifier using the centroids of the clusters (obtained in the previous step), which have

their radius greater than τ .


6. Again determine the clusters in the classification model and train the classifier with the patterns

in these clusters.

Analysis of Complexity

The entire dataset of N patterns is processed in Step 1 only. Step 2 takes linear time since the distance

of each pattern from its cluster center needs to be investigated for determining the maximum radius,

and τ subsequently. In Steps 3 and 5, only O(k) patterns are processed. The number of patterns

processed in Steps 4 and 6 is much less than N , for large N . In addition, significantly many patterns

are eliminated, either because they belong to clusters having a small radius or because they are not a

part of the model. Thus, the first step pre-dominantly determines the time complexity of the algorithm.

This clustering step has an O(n) time complexity for constant k and a value of d � N . Thus, the

training time complexity of a classifier integrated with two-level EPIC becomes linear.

4.4.1 Integration of EPIC into Support Vector Machines (SVMs)

Training an SVM [10, 11] involves solving a Quadratic Programming (QP) problem. The training time

complexity of training an SVM is O(N2), and the space complexity is at least quadratic. Decompo-

sition methods such as Chunking [12] and Sequential Minimal Optimization (SMO) [13] choose a set

of Lagrange variables to optimize in each iteration and solve the optimization problem involving these

variables. The SVM optimization problem has been reformulated using techniques such as the Least

Squares SVM and the Reduced SVM. Core Vector Machines (CVMs) [14] consider a Minimum Enclosing

Ball (MEB) problem and try to obtain an approximate solution. An optimization problem called the

Structural SVM has been used in a Cutting Plane algorithm [15]. In each iteration, it considers a few

constraint violations and finds the solution that satisfies these constraints. This process is continued till

a required approximate solution is obtained. Weight based sampling and selective sampling techniques

that iteratively choose the most useful training examples have also been proposed in the literature. Our

approach is similar to the Clustering based SVM (CB-SVM) [8] approach, which integrates a scalable

hierarchical micro-clustering algorithm, BIRCH [2], into an SVM to reduce the number of patterns that

are processed by the learner.

Inputs: Dataset χ = {xi, yi}Ni=1, xi ∈ <d, yi ∈ {−1, 1}; Number of clusters k; Radius threshold τ

1. Cluster the positive and negative patterns independently and calculate the radius of each cluster

obtained.

2. Train an SVM boundary function from the centers of the clusters whose radius is greater than τ .

3. Sub-cluster the clusters which are near the boundary .


4. Construct another SVM from the centers which are obtained in the previous step.

5. De-cluster the clusters near the boundary of the SVM constructed in step 4 and train an SVM on

the patterns in these clusters. This gives the final SVM boundary function.

In step 3, we need to determine the clusters which are close to the boundary. Closeness to the

boundary [8] is defined as follows: Let Di be the distance of a cluster center to the boundary. Let Ds be

the maximum of the distances from the support vector centers to the boundary. A cluster is said to be

near the boundary if Di −Ri < Ds.


The first step of the algorithm involves clustering the given patterns into k clusters. Using the Mac-

Queen’s k-means algorithm, we can obtain the clustering in O(Nk) time. Assuming that Sequen-

tial Minimal Optimization (SMO) is employed for training the SVM, the training time complexity

with N patterns would be O(N2). SVM training is performed in steps 2, 4 and 5. In step 2,

the number of training patterns equals the number of clusters which have a radius greater than the

threshold. This is equal to at most k clusters from the positive patterns plus at most k clusters

from the negative patterns. So the training complexity is O((2k)2) ∼ O(k2). In step 3, the num-

ber of patterns which are clustered is N1 =∑ki=1 wiYi, where wi is the number of patterns in the

ith cluster, and Yi = 1{ith cluster is close to the boundary}. Hence, the time complexity of this step is

O(N1k2). The number of patterns input to the SVM in step 4 is at most k2 and thus the time com-

plexity of this step is O(k4). The number of patterns input to the SVM constructed in step 5 is

N2 =∑ki=1

∑kj=1 Y

′

ijw′

ij , where w′

ij is the number of patterns in the jth sub-cluster of the ith clus-

ter, and Y′

ij = 1{Yi=1 and the jth sub−cluster is close to the boundary}. The training complexity of the final

SVM is O(N22 ). The total time complexity is hence given by O(Nk + k2 + N1k

2 + k4 + N22 ). When

the dataset is large in size, we will have N1 � N and N2 � N . For such large datasets, the time

complexity becomes O(Nk) ∼ O(N). Hence for large datasets, this SVM training process has linear

time complexity. At any point in time, it is required to store only the input patterns and the k cluster

centers. Hence the space complexity is O(N + k) ∼ O(N).

4.4.2 Integration of Two-level EPIC into k-NNC

K-NNC algorithm becomes computationally intensive when the size of the training set is large. Various

techniques have been proposed in literature to reduce the computational complexity of k-NNC. Cluster-

ing is used to realize an efficient k-NNC variant in [9]. This technique achieves considerable reduction

in computation but its time complexity is non-linear. We prove that, on incorporating two-level EPIC

algorithm into the k-NN classifier, the time complexity becomes linear. The two-level EPIC algorithm


can be incorporated into k-NN classification as follows:

Inputs: Training Dataset χ = {xi, yi}Ni=1, xi ∈ <d, yi ∈ Υ; Test Pattern TP ; Number of clusters k′;

Radius threshold τ ; Number of neighbors k

1. Cluster the dataset χ into k′ clusters and determine the radius of each cluster.

2. Find k cluster centroids nearest to TP from those centroids whose clusters have radius greater

than τ .

3. Sub-cluster the k clusters obtained in the previous step and find the k nearest sub-cluster centroids.

4. From the patterns in the nearest sub-clusters, find the k nearest patterns. Assign TP the class to

which majority of these patterns belong.


The first step of the algorithm involves clustering the given patterns into k′ clusters. Using the Mac-

Queen’s k-means algorithm, we can obtain the clustering in O(Nk′) time. In step 2, O(k′) distances need

to be computed. Sorting these distances and finding the k nearest centroids would have a complexity

of O((k′)2) even if a simple sorting algorithm such as bubble sort is employed. In step 3, the number

of patterns which are clustered equal N1 =∑ki=1 wiYi, where wi is the number of patterns in the ith

cluster and Yi = 1{ith centroid is among the k−NN}. To find the k nearest sub-cluster centroids, O((kk′)2)

effort is required. The time complexity of this step is O(N1kk′+ (kk′)2). The number of patterns input

to step 4 is N2 =∑ki=1

∑kj=1 Y

′

ijw′

ij , where w′

ij is the number of patterns in the jth sub-cluster of the

ith cluster, and Y′

ij = 1{Yi=1 and the jth sub−cluster is among the k−NN}. Finding the k-nearest neighbors

among these patterns requires the sorting of N2 distances which requires O(N22 ) effort. Hence, the total

time complexity is given by O(Nk′ + (k′)2 + N1kk′ + (kk′)2 + N2

2 ). When the dataset is large in size,

we will have N1 � N and N2 � N . Thus, the time complexity becomes O(N) for constant k and k′.

At any point in time, storage is required only for the input patterns, the k′ cluster centers and the k

nearest neighbors. Hence the space complexity is O(N + k + k′) ∼ O(N).


4.5.1 Integration of Two-level EPIC into SVM

Integration of two-level EPIC into SVM was tested on both synthetic and real datasets. We observed

that there was considerable reduction in the training and testing time and the accuracy was comparable

to that of an SVM trained on the entire dataset at once. The testing was performed on an Intel(R)


Table 4.1: Training and testing timings for synthetic dataset 1 (using SVM light)

NumClusters - 5 4 3 4Threshold - 3 5 10 7Training 1.01 1.01 1.00 0.69 0.72

TimeTesting 0.04 0.04 0.03 0.02 0.03Time

Accuracy 100 100 100 99.94 99.62NumSVs 57 54 49 15 10

Table 4.2: Training and testing timings for synthetic dataset 1 (using SVMperf )

NumClusters - 10 10Threshold - 5 5.5Training 0.94 0.94 0.91

TimeTesting 0.01 0.02 0.01Time

Accuracy 99.98 99.23 99.67NumSVs 2 2 2

Xeon(R) 2GHz machine with 4096KB cache and 4038MB memory.

Synthetic Datasets

To verify that there is a substantial decrease in the training and testing time when two-level EPIC

clustering is incorporated in SVM training, we tested the algorithm on two synthetic datasets:

Dataset 1: This dataset contains a total of 1,00,000 two-dimensional patterns. The patterns of each

class are drawn from independent normal distributions with means [1 5]T and [10 5]T and unit variance.

The two classes are linearly separable. The test dataset, consisting of 80,000 patterns, is drawn from

the same distribution. The results are presented in Tables 4.1 and 4.2. The first column in Table

4.1 shows the results corresponding to the SVM light V 6.01 implementation of SVM trained on the

entire dataset without performing clustering (Plain SVM ). The remaining columns in this table are the

results for SVM light with two-level EPIC incorporated. The first column in Table 4.2 shows the results

corresponding to the SVMperf V 2.10 implementation of SVM trained on the entire dataset without

performing clustering. The remaining columns in this table are the results for SVMperf with two-level

EPIC incorporated. We observe that an SVM with two-level EPIC performs better than Chunking or

Sequential Minimal Optimization and is on par with the Cutting Plane Algorithm.

Dataset 2: This dataset contains a total of 10,00,000 two-dimensional patterns. The patterns of

each class are drawn from independent normal distributions with means [1 5]T and [5 5]T and unit


Table 4.3: Training and testing timings for synthetic dataset 2

NumClusters - 10 7 10Threshold - 3 5 6Training 5287.98 2731.46 1037.95 12.74

TimeTesting 0.26 0.34 0.30 0.26Time

Accuracy 97.71 97.04 96.32 94.97NumSVs 57894 31825 37060 1209

variance. This dataset is not linearly separable and hence is more realistic. The test dataset, consisting

of 6,00,000 patterns, is drawn from the same distribution. The time taken for training and testing

(in seconds), the test accuracy achieved and the number of Support Vectors for varying values of the

parameters numClusters and threshold are recorded in Table 4.3. The first column in the table shows

the results corresponding to Plain SVM. The remaining columns are the results for SVM (SVM light)

with two-level EPIC incorporated. It can be inferred from the above results that incorporating two-level

EPIC in SVM training greatly reduces the running time complexity of SVM.

Real dataset

Figure 4.1: OCR 1 vs 6 : (a) accuracy vs. threshold (b) support vectors vs. threshold

We present the results obtained from experiments performed on different class combinations of the

Optical Character Recognition (OCR) dataset. This dataset consists of handwritten characters repre-

senting the numerals 0 to 9. The training and test sets, respectively, consist of 667 and 333 patterns

for each class. Each pattern has 192 features and a class label ranging from 0 to 9. For the purpose

of experimentation with SVM, we consider combinations of classes, 1 vs 6 and 3 vs 8. For each of the

two class combinations, we record the test accuracy and the number of support vectors. The number

of support vectors is an indicator of the complexity of the trained classifier and is also a bound on


Figure 4.2: OCR 3 vs 8 : (a) accuracy vs. threshold (b) support vectors vs. threshold

the probability of error[11]. Hence, lower the number of support vectors, the better the classifier. The

results are presented in Figs. 4.1(a), 4.1(b), 4.2(a) and 4.2(b). We can observe that the number of

support vectors after clustering is considerably lesser than the number of support vectors with Plain

SVM maintaining a comparable test accuracy. For the class combination 1 vs 6 (Figs. 4.1(a) and 4.1(b),

with a radius threshold of 90 units, the reduction in the number of support vectors is approximately

25% with no reduction in accuracy. For the class combination 3 vs 8 (Figs. 4.2(a) and 4.2(b)), with

a radius threshold of 300 units, there is an improvement in the test accuracy with a reduction of 27%

in the number of support vectors. This can be attributed to the fact that some of the patterns which

cause SVM to overfit the training data are eliminated during the cluster elimination process. For both

the class combinations, the accuracy reduces as the threshold increases beyond a certain limit. This is

because, as the threshold is increased beyond this limit, lesser number of examples are left for the SVM

to learn from and important information that is required for classification is lost.

Comparison with Other Techniques

In order to demonstrate that our algorithm is on par with the methods that are currently employed to

reduce the training time of SVM, we performed empirical comparisons with CB-SVM. To compare with

CB-SVM, we use the synthetic dataset described in [8]. A 2-dimensional dataset was generated with

parameter values k = 50, cl = 0.0, ch = 1.0, rl = 0.0, rh = 0.1, Nl = 0, Nh = 10000, and θ = 0.5. The

results are tabulated in Table 4.4. The first column contains results using CB-SVM as reported in [8].

The results show that our algorithm is on par with CB-SVM.


Table 4.4: Comparison with CB-SVM

Training 1,13,601 1,20,738 1,20,738 1,20,738set size

NumClusters - 10 17 20Threshold - 0.05 0.05 0.05Training 10.589 14.09 4.62 2.91

TimeAccuracy 99.00 99.00 96.00 95.00

Table 4.5: Results for k-NNC

Dataset Synthetic Synthetic OCRDataset 1 Dataset 2

Time taken 1.695 17.347 0.224by k-NNCAccuracy 100 97.1 92.49(k-NNC)

NumClusters 6 6 15Threshold 5 5 5

Time taken 1.382 9.852 0.2by k-NNC with

2-level EPICAccuracy(k -NNC 100 96.8 92.40with 2-level EPIC)

4.5.2 Integration of Two-level EPIC into k-NNC

The results corresponding to the integration of two-level EPIC in k-NNC are presented in Table 4.5. We

selected 1,000 test patterns from each of the synthetic datasets. Since k-NNC is a multi-class classifier,

the entire OCR dataset was used. For each dataset, we build a 5-NNC and record the average time

taken(in seconds) for the classification of a single example and the test set accuracy. A considerable

reduction in time is observed.

4.6 Conclusions/Future Work

We proposed an algorithm, EPIC, which is based on both k, and τ . EPIC makes significantly less number

of distance computations compared to the popular Lloyd’s k-means algorithm. We also established a

relation between τ and k. We also presented a generic technique to integrate a two-level EPIC algorithm

into different classifiers, in order to achieve linear time training complexity. Our experimental results

strongly suggest that EPIC can be efficiently integrated into SVM and k-NNC classifiers: the accuracy

obtained using EPIC is better than or competitive with state-of-the-art algorithms, while the time taken


is much less. EPIC does not make assumptions about the underlying distribution of the input data. A

prior knowledge of the input distribution would help in incorporating more knowledge into the skeletal

EPIC algorithm, thereby improving the performance further.

Bibliography


31(3), 1999.


large Databases. Proceedings of the 1996 ACM SIGMOD International Conference on Management

of Data, pp. 103–114, 1996.

[3] M. Ester, H-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in

large spatial databases with noise. Proceedings of the Second International Conference on Knowledge

Discovery and Data Mining (KDD), pp. 226–231, 1996.


ester, 1980.

[5] S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), pp.

129–137, 1982.

[6] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A.Y. Wu. An Efficient

k-means Clustering Algorithm: Analysis and Implementation. IEEE Transactions on Pattern Analysis

and Machine Intelligence, pp. 881–892, 2002.

[7] J. MacQueen. Some methods for classification and analysis of multivariate observations. Proceedings

of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, pp. 281-297, 1967.

[8] H. Yu, J. Yang and J. Han. Classifying large data sets using SVMs with hierarchical clusters. In the

Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge discovery and Data

Mining, pp. 306–315, 2003.

[9] B. Zhang and S. N. Srihari. Fast k-Nearest Neighbor Classification Using Cluster-Based Trees. IEEE

Transactions on Pattern Analysis and Machine Intelligence (PAMI), pp. 525–528, 2004.

[10] V. N. Vapnik. Statistical Learning Theory, John Wiley & Sons, Inc.,1998.

72

BIBLIOGRAPHY 73

[11] C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and

Knowledge Discovery, 2, pp. 121–167, Springer, 1998.

[12] T. Joachims. Making large-Scale SVM Learning Practical. Advances in kernel methods: Support

Vector Learning, MIT Press Cambridge, MA, pp. 169–184, 1999.

[13] J. Platt. Sequential Minimal Optimization: A fast algorithm for training Support Vector Machines.

Advances in Kernel Methods: Support Vector Learning, MIT Press,Cambridge, MA, pp. 185–208,

1999.

[14] I.W. Tsang, J.T. Kwok, and P.M. Cheung. Core Vector Machines: Fast SVM Training on Very

Large Data Sets. The Journal of Machine Learning Research, 6, pp. 363–392, 2005.

[15] T. Joachims. Training linear SVMs in linear time. Proceedings of the 12th ACM SIGKDD Interna-

tional Conference on Knowledge Discovery and Data Mining, pp. 217–226, 2006.

Chapter 5

Feature Subspace SVMs (FS-SVMs)

for High Dimensional Handwritten

Digit Recognition

5.1 Introduction

In many pattern classification applications, data are represented by high dimensional feature vectors.

There are two reasons to reduce dimensionality of pattern representation. First, low dimensional repre-

sentation reduces computational overhead and improves classification speed. Second, low dimensionality

tends to improve generalization ability of classification algorithms. Moreover, limiting the number of

features cuts down the model capacity and thus may reduce the risk of over-fitting [35]. The clas-

sifier maximizing its performance may not always perform well on test data. The performance of a

classifier on test data depends on factors such as training sample size, the dimensionality of pattern

representation and the complexity of the classifier. SVMs are less likely to overfit data than other

non-regularized classification algorithms since the structural minimization principle of SVMs chooses

discriminative function that has the minimal risk bound [16]. One of the major drawbacks of SVMs is

that the training time grows almost quadratically in the number of examples. This issue becomes even

more critical for multi-class problems where a set of binary SVMs must be built and combined. This is

the case of One-Against-All approach which is widely used in implementation of SVMs.

Feature selection is a major approach to dimensionality reduction [1]. Feature selection refers to

selecting features in the input space and the features obtained form a subset of the original input

feature set. In the literature, just a few algorithms have been proposed for SVM feature selection. In

74

Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition75

[2], a mathematical programming method which minimizes a concave function on a polyhedral set was

proposed. In [3], feature subset selection was done by optimizing a modified criterion that induces an

extra term to penalize the size of the feature subset. In [4], the authors introduced a binary vector

representing the presence or absence of a feature to the optimization criterion, with the motivation of

approximating the binary vector with a real valued vector so that gradient descent methods can be used

to find the optimal value of the binary vector and the corresponding feature subset. Basically, the three

methods mentioned above evaluate features on an individual basis, although the features actually work

in a collective way in the discriminative function. To deal with this problem, a SVM recursive feature

elimination (SVM RFE) algorithm that evaluates features on a collective basis was proposed [5] for gene

data. In [32], the authors proposed the Reduced Feature Support Vector Machine (RFSVM) algorithm

for completely arbitrary kernels. Recently, a feature selection algorithm has been proposed for scene

categorization using support vector machines [33].

Ensemble methods have been quite popular in literature. Ensemble methods are learning algorithms

that construct a set of classifiers and then classify new data points by taking a weighted vote of their

predictions. Ensemble methods have been shown to be effective in [6, 7, 8, 9, 17, 18, 19, 20]. Recently

some research has gone into devising ensemble methods for SVMs as well. In [10], the authors perform

a bias-variance analysis of Support Vector Machines for development of SVM-based ensemble methods.

A horizontal Divide and Conquer approach has been proposed which uses different experts for different

subsets of patterns [11]. However, the training time becomes a serious bottleneck with an increase in

number of examples. Another limitation of this approach lies in the fact that it can be used to separate

only a single class from other classes; an SVM has to be trained separately for each class.

5.2 Motivation

In our work, we propose a novel technique of incorporating a set of features dynamically to enable

training each SVM exactly once. First, rather than partitioning the input space on the basis of the

corresponding class, we partition the training set based on the subsets of features. That is, partitioning

is done in the feature space and not the input space and hence the name Feature Subspace Support

Vector Machines (FS-SVMs). Each of these subsets is used to train an SVM that is subsequently used

to test a data set. The accuracy of each SVM determines the weight for the corresponding subset. Then

we combine each of these weighted individual subsets and find the most likely class. These weights could

be used to classify test data. The need for such an approach is highlighted in Figure 5.1.

The Iris Plants Database [12] is one of the best known data sets in the literature. The data set

contains 3 classes: Iris Setosa, Iris Virginica, and Iris Versicolor where each class refers to a type of Iris

plant. The feature space consists of four attributes namely sepal length, sepal width, petal length and


Figure 5.1: Iris Dataset: The need for segmentation of feature space

petal width. It is observed that the class Setosa is linearly separable from Virginica and Versicolor using

petal length and width only. However, Virginica and Versicolor are not linearly separable. Therefore,

overfitting can be avoided by segmenting the feature space and keeping the features petal length and

petal width in the same partition. Versicolor and Virginica can be separated by using a non-linear SVM

trained on other features. Thus the FS-SVM approach can be effectively used to separate patterns by

employing different kernels in different regions of the feature space. Our experimental results on Iris

database support this observation whereby classification accuracy in excess of 98.5% was obtained.

The FS-SVM approach is all the more promising in the context of handwritten digit data. Figure

5.2 shows a sample each of 4, 7, 8, and 9. The features are shown to be partitioned into two subsets,

A and B. The features contained in B contain sufficient information to separate 8 from the rest. The

features contained in A can be used to separate the digits 4, 7, and 9 from each other. We note that

the feature subset A, if used alone, may not be able to distinguish between 8 and 9. However, A can be

used to separate all the four digits, if used in conjunction with B. Likewise, the other digits may also

be classified correctly by an appropriate partitioning of the feature space.

Figure 5.2: Segmentation of feature space for handwritten digit data

Figure 5.3 indicates where the FS-SVM approach fits in the existing paradigm of efficient classification

using SVMs. The linear SVMs can be effectively trained in linear time provided the data is linearly

separable [13], otherwise non-linear SVMs have to be used. For large data sets, clustering based SVMs

have been shown to yield good performance [14]. Ensemble methods using different experts in different

regions of the horizontally segmented input space have also been proposed [11]. The FS-SVMs use

different kernels based on partitioning of the feature space.

In many classification applications, data are represented by high dimensional feature vectors. As


Figure 5.3: Different approaches for SVM Classification

already noted, training SVMs on such feature vectors in one shot incurs high computational cost. In

addition, reduced dimensionality helps in keeping space requirements low; this becomes a critical factor

in context of high dimensional large datasets where the number of I/O operations becomes a bottleneck.

Besides, we may want to have different SVM formulations, such as the one in [34], for different feature

subsets and/or assign different weights to features based on their significance. This may help in tackling

overfitting.

For many practical applications such as high dimensional handwritten digit recognition, the dimen-

sionality of data remains very high even after feature selection. Therefore, there is a need to reduce the

feature set further but without sacrificing the classification accuracy much. In our work, we therefore

introduce another stage that we call the feature reduction step (Figure 5.4). This feature reduction

step reduces the dimensionality of the data without compromising much on generalization ability. The

features chosen during the feature selection step are input to the feature reduction step. In other words,

feature selection is done on the pre-processed data, using techniques such as [2, 4], and then the se-

lected features are input to the feature reduction algorithm. Henceforth, the description of the feature

reduction step will presume the availability of a suitable set of features.

Figure 5.4: Steps in the modified classification process

The essential idea behind incorporating the feature reduction step can be understood using Figure

5.5. The examples from two classes, shown in rectangles and circles, are well separated using the

maximum margin separating hyperplane SH 1. However, most of the examples are correctly classified

using separating hyperplane SH 2 alone. Therefore we can discard Feature 2 at a slight expense of

classification accuracy. This effect may become even more pronounced in case of high dimensional data.

We introduce the α-MFC problem to formalize the feature reduction step.


Figure 5.5: The proposed feature reduction step

5.3 The α-Minimum Feature Cover (α-MFC) Problem

Definition 1. For a given training set X and test set Y, a feature set D = {d1, d2, ..., dn} is defined to

be optimal if no subset of D results in a greater accuracy than that obtained using D, with respect to a

classifier trained on X and tested on Y.

Definition 2. Given a training set X and a test set Y on an optimal feature set D = {d1, d2, ..., dn},

define an α-Minimum Feature Cover (α-MFC) of D for an SVM as a subset D’ such that classification

accuracy of the SVM using D’ is no less than α times that obtained using the original set D(0 ≤ α ≤

1). Further there is no other subset D* of D having a) number of features less than D’ and accuracy

greater than α times that obtained using D, or, b) same number of features as D’ but greater accuracy.

The α-Minimum Feature Cover(α-MFC) problem can be rephrased in terms of a decision problem,

{(D, α, k): Feature set D has an α-MFC of size k} where size of a feature set refers to the number of

features contained in it.

Theorem 5. The α-MFC problem is NP- Hard.

Proof. We show a reduction from the Minimum Vertex Cover [21], a well-known NP-Complete problem.

Let w(di, dj) denote the magnitude of correlation between the features di and dj . It is clear that 0

≤ w(di, dj) ≤ 1 ∀di, dj ∈ D. Consider any arbitrary value β such that β ∈ (0,1). Now, draw a graph

G = (V,E), where the vertex set V consists of a collection of nodes, each corresponding to a feature in

original set D. Connect all those pairs of vertices i, j in V by an edge that have w(di, dj) ≥ β.

The reduction algorithm takes as inputs, an instance (G, k) of the Minimum Vertex Cover Problem

and a specified α, to generate an instance (D, α, k) of the Minimum Feature Cover problem. A node is

drawn corresponding to each vertex in G. Further, for any two vertices in G that are connected by an

edge, the magnitude of correlation is set to 1 (which is > β ∀β ∈ (0, 1)) while the non-adjacent vertices

incur a value 0, in the corresponding input instance of the α-MFC algorithm. Clearly this can be done


in polynomial time in number of edges of G by inspecting each pair of vertices for an edge. This has an

effect of retaining exactly those edges in the α-MFC instance as present in G. Now set the value of α to

1. Thus, an instance (D, 1, k) of the α-MFC is obtained corresponding to any graph G of the Minimum

Vertex Cover problem.

We claim that there is a k that satisfies 1-MFC iff there is a vertex cover of size k in G. If we can

find a polynomial time solution to the 1-MFC problem, then we may as well find the minimum vertex

cover in G. However, this cannot be true unless P = NP. Conversely, if there is a vertex cover of size

k, then the set consisting of all the features corresponding to vertices present in the vertex cover is

the desired 1-MFC set D’. This follows, since there cannot be a set D* having size less than k and

accuracy greater than that of D’(which is 1 and same as that of D), otherwise our original set D is not

the optimal feature set defined in Definition 1, and we arrive at a contradiction. Therefore, it follows

that the α-MFC problem is NP- Hard.

An important implication of Theorem 7 is that we cannot find a polynomial solution to isolate a

minimum subset of features that has an accuracy within a specified parameter α of the original feature

set, unless P = NP. Therefore we look for a greedy approach to determine a good reduced feature

subset. We will return to an algorithm for accomplishing exactly the same goal but first we show how

an incremental Feature Subspace approach can be employed for classification using SVMs.

5.4 Feature Subspace SVMs (FS-SVMs)

A lot of ensemble methods have been successfully employed in the pattern classification field, but almost

all of them use different experts in different parts by training each classifier on a subset of patterns. We

investigate the efficiency of an algorithm based on partitioning of the feature space in the context of

SVMs.

Let X={x1, x2, ..., xn} be the training set defined on D = {d1, d2, ..., dk}, the set of k features with

class label y ∈ {1, 2, ..., C} denoting one of the C classes. Without loss of generality, assume that D

is divided into M blocks P1, P2,..., PM with corresponding weights I1, I2, ..., IM . Each of the SVMs,

Si, is trained on a corresponding block Pi, i ∈ {1, 2, ..., M}. Note that the SVMs Si need not be

different and we may as well use a single SVM for all the blocks. The weights I1, I2, ..., IM represent

prior knowledge about the importance of the corresponding blocks and can be empirically determined

using the approach such as in (Grandvalet and Canu, 2003)[31]. In the absence of any knowledge about

the significance of blocks, all weights could be set to 1. Note that these blocks may not be all present

simultaneously and we train the corresponding SVM as and when one becomes available.

The Feature Subspace algorithm for SVMs is given in Algorithm 1. It is to be noted that the


Update: Feedback Phase adjusts the weights of individual SVMs based on their classification decision,

after every pattern is classified, correctly or not. At the time of classifying a new digit sample, these

individual weights are scaled with the corresponding class prediction and the prior importance of the

feature subsets. If an SVM Si incorrectly predicts a class j, it is penalized by reducing its weight, as the

formula for Wij suggests. The weights W , therefore, provide feedback for further classification decisions.

Algorithm 1: FS-SVMs

Divide: Training Phase

1.1 for i =1 to M do

1.2 train SVM Si on Pi

1.3 test Si using a test set X’

1.4 for j =1 to C do

1.5 Wij = Fraction of correct predictions in j by Si

1.6 end for

1.7 end for

Combine: Test Phase

1.8 for each test example x do

1.9 Let Aixj be an indicator variable denoting the class predicted by SVM

Si (∀ i= 1 to M) on x. Final predicted class for example x is given by,

argmaxj∑Mi=1 Ii * Wij *Aixj where j ∈ {1, 2, ..., C}

Update: Feedback Phase

1.10 Update the weights Wij of the SVMs based on the classification decision

1.11 end for

Algorithm 1 conceptually looks similar to many of the existing ensemble methods. However there

is a subtle difference. The novelty of Algorithm 1 lies in its scalability and computational efficiency

accomplished due to partitioning of the feature space and its generalization ability due to horizontal

weighted voting. In other words, Algorithm 1 has more flexibility due to amalgamation of vertical

partitioning and horizontal weighing, unlike conventional ensemble methods.

The algorithm is generic in that instead of using only SVMs we could as well use SVMs in conjunction

with other classifiers such as Decision trees and k-Nearest Neighbor Classifier, etc. as shown in Figure

5.6. It would be informative to compare the performance of an ensemble of diverse classifiers with

that of a combination of SVMs, however, a further discussion on the same is beyond the scope of

this work. Different SVM formulations could be used on different feature subsets. Further, based on


prior knowledge about the application domain, we may assign varying weights to different blocks based

on significance of the constituent features, for instance, in case of handwritten digit data the features

representing the top and bottom portions are less significant than those toward the center and can be

assigned smaller initial weights accordingly. In case all features are equally important, we can simply

do away by replacing all Ii with 1.

Figure 5.6: Ensemble of classifiers using segments of feature space

5.5 A Greedy Algorithm for Approximating α-MFC

As already pointed out, data dimensionality may remain prohibitively high despite feature selection.

Thus, we propose a feature reduction step prior to classification. The aim of this stage is to find a

Minimum Feature Cover corresponding to the specified parameter α. However, an important corollary

of Theorem 7 is that we cannot find a polynomial solution to isolate a minimum subset of features that

has an accuracy within a specified parameter α of the original feature set, unless P = NP. Therefore we

need to look for some heuristic approach to determine a suitable feature subset.

We suggest a greedy algorithm to obtain a good feature subset. This approach proves to be very

useful particularly in applications such as handwritten digit recognition. There are primarily two reasons

for this observation. First, the features that are in close proximity to each other are generally more

correlated than those farther apart. So, nearby features can be grouped together in one partition.

Second, the features near the center are found to have much more impact on the aggregate decision of

the Support Vector Machine than those toward the periphery [25]. Hence in general, we can discard the

partitions containing features that are away from the middle. These observations may not be true in

case of all domains but prove worthy in case of applications like handwritten digit recognition.


Figure 5.7: Features near the periphery contain less discriminative information than those deep inside

This behavior can be understood from Figure 5.7 displaying a sample of handwritten digits. Each

cell in the 7 x 7 grid denotes a feature. Note that the features at the top and bottom of each digit do

not contain much information. Further not much discrimination can be done between a pair of digits,

based primarily on these features. On the other hand, as we move closer to the central portion, the

dissimilarity between different digits becomes more prominent. Hence, in general we can dispense away

top and bottom parts without incurring too much loss in information.

We now provide an algorithm based on FS-SVM that follows a greedy approach to find an approx-

imate Minimum Feature Cover. Algorithm 2 partitions the original feature set D into partitions each

of which is used to train and test corresponding SVM. The least accurate of these partitions, Pr, is

kept apart and the overall accuracy of the combined remaining feature set, P , is determined. If this

accuracy is greater than the desired parameter α, we proceed in same manner to see if more features

can be reduced, otherwise we partition the sidelined subset Pr into two subsets, Pr1 and Pr2, of equal

size. We then merge the more accurate feature partition with P . If the classification accuracy of Pr1

is same as that of Pr2, then either Pr1 or Pr2 may be chosen (Algorithm 2 resolves the ties in favor of

Pr1). Iteratively following this process, we end up getting a reduced feature set that either satisfies or

is very close to satisfying the constraint α.

Algorithm 2: Approximate α-MFC

2.1 Divide the feature set D = {d1, d2, ..., dk} into M blocks P1, P2, ..., PM with y ∈ {1, 2, ..., C}

contained in each partition

2.2 for i = 1 to M do

2.3 train and test using features from Pi

2.4 end for

2.5 Choose the partition Pr with least classification accuracy,

r ∈ {1, 2, ..., M}

2.6 Remove Pr and check the overall accuracy using


P = P1

⋃...⋃

Pr−1

⋃Pr+1

⋃...⋃

PM

2.7 if the overall accuracy < α,

2.8 take partition Pr, split it into same (almost same if Pr contains odd

number of features) size partitions Pr1 and Pr2.

2.9 train and test separately on Pr1 and Pr2

2.10 if accuracy with Pr1 >= accuracy with Pr2,

2.11 P = P⋃

Pr1

2.12 else P = P⋃

Pr2

2.13 end if

2.14 return P as the greedy feature set close to satisfying the constraint α

2.15 else if the overall accuracy > α,

2.16 D = P

2.17 go to 2.1

2.18 else return P as the greedy feature set satisfying the constraint α

2.19 end if

Definition 3. The quality qi of a partition i is defined to be the fraction of accuracy that would be lost

if i is discarded. The quality of the entire feature set equals 1.

Theorem 6. Let q1, q2, ..., qt, qt+1 denote the respective quality of (t+1) partitions discarded before

termination of Algorithm 2. Then, Algorithm 2 finds a α1−qt+1

- approximate solution to the α-MFC

problem.

Proof. Let Q be the accuracy obtained using the entire feature set.Then,

Accuracy left after partition 1 is removed= (1− q1)Q

Accuracy left after partition 2 is removed= (1− q1)(1− q2)Q

Proceeding in the same way,

Accuracy left after t partitions are removed= (1− q1)(1− q2)...(1− qt)Q = k(say)

This must be greater than or equal to αQ since another partition is removed subsequently.

⇒ k ≥ α Q ...(1)

Also,

k(1 - qt+1) < α Q ...(2)

The result follows using (1) and (2).

Thus, Algorithm 2 yields a reduced feature set within α1−qt+1

of the optimal solution to the to the


Figure 5.8: Sample patterns of handwritten digit data

α-MFC problem. An important point is in place. Theorem 2 implies that the quality of the reduced

feature set would depend on quality of the (t+ 1)th partition. So the quality of the reduced feature set

depends on the size of the (t+ 1)th partition. Further, applying Algorithm 2 with a sequence of varying

partition sizes on the original feature set may result in a different final reduced feature set, even if the

number of eliminated features is same in all the cases. Thus the quality of the final reduced set depends

on the size of partitions for eliminating features and the quality of segments of features.


On Iris [12], which consists of 150 samples, we used 90 samples randomly as training data and the

remaining 60 samples as test data. The FS-SVM algorithm (Algorithm 1) resulted in an average classi-

fication accuracy in excess of 98.5%. This compares favorably with the popular variants of SVMs such

as Max-Wins voting, DAGSVM, and one-versus-all [26, 27, 28, 29] (see experimental results in [30]).

We provide a detailed analysis of our results on handwritten digit data in the following sections.

5.6.1 Experimental Set-up

SVMs were trained on different non-overlapping partitions of our handwritten digit training set consist-

ing of 6670 examples each of 192 features (16 x 12 grid). The test data comprised of 3333 examples on

the same features. Each example in the training and test sets belonged to exactly one of the 10 classes

(0-9). Moreover, each of the training and test sets was subdivided into almost equal number of patterns

per class. Figure 5.8 shows a few sample patterns of the handwritten data used in our experiments.

We conducted extensive experimentation on a number of other standard datasets including USPS

[22], MNIST [23] and CEDAR [24]. The USPS dataset consists of 7291 training examples and 2007

test examples. On the other hand, CEDAR consists of 5802 training examples and 707 test examples.

The original MNIST dataset has equal training and test data size (60,000 examples each). However for


Figure 5.9: Similarity vs. Block Size

our experiments, we chose a random set of 6,000 examples for training. Another set of 4,000 examples

was used as the test set. The software package primarily used was the freely available BSVM 2.06

software [15]. The BSVM package provides implementation of three different standard SVM techniques

for classification. For our experiments, we chose the implementation that solves a single optimization

problem for the purpose of classification. Further, default settings of BSVM were used for each of the

individual SVMs.

5.6.2 Analysis of Results obtained using Algorithm 1

With an SVM trained and tested over all the features, out of all the 3333 examples, 3065 examples

were classified correctly giving an overall accuracy of 91.96%. Define similarity of a partition as the

percentage of predictions made by the corresponding SVM that match with those predicted with the

entire feature set taken in one go. Intuitively, the higher the similarity of a partition, the greater the

importance of the features contained in it, for the purpose of classification. To predict a new example,

partitions are combined in proportion of their similarity to get the combined result. We note from our

experiments that we are able to achieve high similarity values and the Feature Subspace approach is

very encouraging.

Figure 5.9 shows the similarity when a combined decision is made using non-overlapping blocks of

features of equal size (a pair of oblique lines has been used to indicate, thenceforth, a change in scale

along the corresponding axis). As similarity is defined as the number of predictions that tally with

the predictions made using all the features simultaneously, the similarity of entire set of 192 features is

1. Note that there could be variation in similarity values based on the tie-resolving rule. When equal

favorable predictions about different classes are made, it is observed that the similarity results are better

when ties are resolved in favor of the latter class rather than the former class. However, using Algorithm

1, this disparity is averted since the probability of two classes getting equal overall weights on a real axis

is negligibly small. This shows that different weighing schemes have varying influence over the similarity

measure. We aim to analyze more of these schemes in our future work.


Figure 5.10: Accuracy vs. Block Size

Figure 5.11: (Sample Dataset) Accuracy(%) results on training sets of different size

Figure 5.10, on the other hand, shows the overall accuracy results. Again, Algorithm 1 outperforms

the equal weighing schemes with different tie-resolving criteria. Moreover, the accuracy curves seem to

follow the similarity curves almost invariably.

We also conducted experiments to analyze the impact of Algorithm 1 on smaller handwritten digit

training sets and different partition sizes. Specifically, we used training sets of size 50, 80, 100, 150, 200,

250 and 300 for our evaluation. The test set data was kept unchanged. Figure 5.11 shows the results

on our sample dataset. It is clearly seen that Algorithm 1 outperforms the standard bound-constrained

SVM in terms of classification accuracy, even in case of smaller training datasets, for different partition

sizes. Similar results were obtained with the other datasets: MNIST (Figure 5.12), CEDAR (Figure

5.13), and USPS (Figure 5.14). These results strongly suggest that the FS-SVM approach works well

even when only a limited training dataset is available.

We also analyzed the computational costs involved in our experiments. Figure 5.15 shows the total

time taken by Algorithm 1 for smaller training sets, relative to using an SVM trained on an entire

dataset. Clearly, partitioning the dataset and recombining the individual verdicts in Algorithm 1 takes


Figure 5.12: (MNIST ) Accuracy(%) results on training sets of different size

Figure 5.13: (CEDAR) Accuracy(%) results on training sets of different size

Figure 5.14: (USPS ) Accuracy(%) results on training sets of different size


Figure 5.15: (Sample Dataset) Total relative time taken by Algorithm 1 on training sets of different size

Figure 5.16: (MNIST ) Total relative time taken by Algorithm 1 on training sets of different size

significantly less time compared to an SVM trained on the original feature space. Similar results were

obtained for MNIST (Figure 5.16), CEDAR (Figure 5.17), and USPS (Figure 5.18) datasets, thereby

endorsing the vast improvements in computational cost using Algorithm 1.

5.6.3 Analysis of Results obtained using Algorithm 2

As mentioned, another important means to reducing the curse of dimensionality is to select the most

pertinent feature subset. Using Algorithm 2, we note that this approach is even more promising than

combining individual partitions together. As observed earlier, in case of handwritten data, the features

at the periphery contain very less discriminative information and thus can be discarded at a cost of

slight decrease in prediction accuracy. This fact is experimentally verified using Algorithm 2 on our

sample dataset, as indicated by Figure 5.19. We started off with M=8 and size of each partition as 24.

After successive runs of the Algorithm 2, it was found that using 132 features accounted for an accuracy


Figure 5.17: (CEDAR) Total relative time taken by Algorithm 1 on training sets of different size

Figure 5.18: (USPS ) Total relative time taken by Algorithm 1 on training sets of different size

Figure 5.19: Accuracy vs. Number of Features


Figure 5.20: Reduction in Accuracy(%) vs. Reduction in Number of Features

Figure 5.21: Algorithm 2 vs. Random Selection

of 87.31%. Even with an overall reduction of 50% in features from the original 192 features, we were

able to get an accuracy of 80.138% as against 91.96% using the entire feature set. With α= 85%, only

120 features ranging from 36-156 resulted in an overall accuracy of 85.06%, a slight reduction from what

was achieved using the whole feature set (Figure 5.20). Thus retaining only these 120 features seems

a good accuracy-size tradeoff. Figure 5.21 shows a comparison in accuracy between a reduced feature

set obtained by Algorithm 2 and a random selection of features. Clearly Algorithm 2 outperforms

a random strategy. We performed several such experiments and similar results were obtained. The

difference becomes more pronounced as the number of features is reduced further.

Figure 5.22 shows the accuracy results obtained using Algorithm 2 on smaller training sets, for

selected feature sets of different size. It is clearly observed that Algorithm 2 results in high classification

accuracy, even after discarding a substantial portion of the original feature set.

Similar results were obtained for MNIST (Figure 5.23), CEDAR (Figure 5.24), and USPS (Figure

5.25) datasets.

We also analyzed the computational costs associated with Algorithm 2. Figure 5.26 shows the relative


Figure 5.22: (Sample Dataset) Accuracy(%) results obtained using Algorithm 2 on training sets ofdifferent size

Figure 5.23: (MNIST ) Accuracy(%) results obtained using Algorithm 2 on training sets of different size

Figure 5.24: (CEDAR) Accuracy(%) results obtained using Algorithm 2 on training sets of different size


Figure 5.25: (USPS ) Accuracy(%) results obtained using Algorithm 2 on training sets of different size

Figure 5.26: Time Performance

time taken by SVMs to train and test feature partitions of different size, on our sample dataset. Note

that the total time taken by the feature reduction step includes not only the time to train the individual

SVMs but also the time to partition the feature subspace. In our experiments with handwritten digit

data, we found that the time to train SVMs is the predominant factor in determining the overall time.

However, the total time taken by Algorithm 2 may exceed that taken by an SVM on the complete

feature set, for some applications such as text classification, where the relative importance of features

may not be very well understood. However, as clearly indicated in Figure 5.26, huge savings in time

are obtained using Algorithm 2, in the context of handwritten digit recognition. The extension of the

method outlined in Algorithm 2, to applications other than handwritten digit recognition, seems an

interesting area for further research.

Figure 5.27 shows the time taken by selected feature subsets on smaller training sets. To give an

indication of the improvement in computational cost, the time is shown relative to an SVM trained on


Figure 5.27: (Sample Dataset) Relative time taken by Algorithm 2 for training sets of different size

Figure 5.28: (MNIST ) Relative time taken by Algorithm 2 for training sets of different size

the entire training set and using the original feature set. Clearly, improvement to an order of magnitude

is observed.

Similar results were obtained for MNIST (Figure 5.28), CEDAR (Figure 5.29), and USPS (Figure

5.30) datasets.

We also computed the overall time taken by Algorithm 2 to obtain smaller feature sets (see Figures

5.31, 5.32, 5.33, and 5.34). Our results clearly indicate that Algorithm 2 provides a promising approach

to reducing the feature sets without incurring significant computational overheads.

5.7 Conclusion

We introduced the concept of α-MFC in form of a feature reduction step and proved it to be NP-Hard.

We then proposed an algorithm (Algorithm 1 ) to show how the partitions of the original feature set

could be trained and tested individually, and then combined together, for high accuracy using FS-SVMs.


Figure 5.29: (CEDAR) Relative time taken by Algorithm 2 for training sets of different size

Figure 5.30: (USPS ) Relative time taken by Algorithm 2 for training sets of different size

Figure 5.31: (Sample Dataset) Total relative time taken by Algorithm 2 for training sets of differentsize


Figure 5.32: (MNIST ) Total relative time taken by Algorithm 2 for training sets of different size

Figure 5.33: (CEDAR) Total relative time taken by Algorithm 2 for training sets of different size

Figure 5.34: (USPS ) Total relative time taken by Algorithm 2 for training sets of different size


We also proposed an approximate α-MFC greedy algorithm (Algorithm 2) based on partitioning of the

feature space that was found to result in high classification accuracy on the experimental handwritten

digit data even after elimination of a large fraction of the original feature set.

5.8 Future Work

We intend to analyze the effect of different weighing schemes on the accuracy. We also look forward,

as a future work, to combining the various classifiers like k-NNC with FS-SVMs on real datasets as

suggested in this chapter. The complexity of Algorithm 2 depends on the quality of the partitions of

the features. It would be interesting to assess the variance in the final results due to different partitions.

This work dealt primarily with the handwritten digit recognition. The extension of the ideas presented

herein for other high dimensional pattern applications would be another future direction.

Bibliography

[1] A. K. Jain, R. P. W Duin, and J. Mao. statistical pattern recognition: A review. IEEE Trans. on

PAMI, 22, pp. 4–37, 2000.

[2] P. S. Bradley, O. L. Mangasarian, and W. N. Street. Feature selection via mathematical program-

ming. INFORMS J. Comput., 10, pp. 209–217, 1998.

[3] P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support

vector machines. Proceedings of 13th International Conference on Machine Learning (ICML), pp.

82–90, San Francisco, CA, 1998.

[4] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik. Feature selection

for SVMs. In: S. A. Solla, T. K. Leen, and K. R. Muller (eds.), Advances in Neural Information

Processing Systems, 13 , MIT Press, MA, Cambridge, 2001.

[5] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using

support vector machines. Machine Learning, 46(1-3), pp. 389–422, 2002.

[6] E. Bauer and R. Kohavi. An empirical comparison of voting classification algorithms: Bagging,

boosting, and variants. Machine Learning, 36(1-2), pp. 105–139, 1999.

[7] T. G. Dietterich. An experimental comparison of three methods for constructing ensembles of

decision trees: Bagging, boosting, and randomization. Machine Learning, 40, pp. 139–158, 2000.

[8] L. K. Hansen and P. Salamon. Neural network ensembles. IEEE Trans. on PAMI, 12, pp. 993–1001,

1990.

[9] S. Guha, A. Meyerson, N. Mishra, and R. Motwani. Clustering Data Streams: Theory and Practice.

IEEE Trans. on TKDE, 15(3), pp. 515–528, 2003.

[10] G. Valentini and T. G. Dietterich. Bias-Variance Analysis of Support Vector Machines for the

Development of SVM-Based Ensemble Methods. Journal of Machine Learning Research (JMLR),

5 , pp. 725–775, 2004.

97

BIBLIOGRAPHY 98

[11] H. Nemmour and Y. Chibani. Multi-Class SVMs Based on Fuzzy Integral Mixture for Handwritten

Digit Recognition. Proceedings of the Geometric Modeling and Imaging: New Trends (GMAI), pp.

145–149, 2006.

[12] http://archive.ics.uci.edu/ml/machine-learning-databases/iris/.

[13] T. Joachims. Training Linear SVMs in Linear Time. Proceedings of the ACM Conference on

Knowledge Discovery and Data Mining (KDD), Philadelphia, Pennsylvania, 2006.

[14] H. Yu, J. Yang, and J. Han (2003). Classifying Large Data Sets Using SVMs with Hierarchical

Clusters. Proceedings of the 9th ACM SIGKDD, Washington, 2003.

[15] C. W. Hsu and C. J. Lin. BSVM 2.06, prepared by R. E. Fan, released in 2006.

[16] C. J. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and

Knowledge Discovery, 2(2), pp. 121–167, 1998.

[17] A. C. Tan and D. Gilbert. Ensemble machine learning on gene expression data for cancer classifi-

cation. Applied Bioinformatics, 2(3), pp. 75–83, 2003.

[18] P. M. Long and V. B. Vega. Boosting and microarray data. Machine Learning, 52, pp. 31–44,

2003.

[19] L. Lam and C. Y. Suen. Application of majority voting to pattern recognition: an analysis of its

behavior and performance’. IEEE Trans. Systems, Man, and Cybernetics, Part A, 27, pp. 553–68,

1997.

[20] L. Brieman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and regression trees,

Monterey, CA: Wadsworth and Brooks, 1984.

[21] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, Second

Edition, MIT Press, Cambridge, 2001.

[22] http://www.kernel-machines.org/data.html.

[23] http://yann.lecun.com/exdb/mnist.

[24] http://www.cedar.buffalo.edu/Databases/index.html.

[25] D. Gorgevik and D. Cakmakov. An Efficient Three-Stage Classifier for Handwritten Digit Recog-

nition. Proceedings of the 16th International Conference on Pattern Recognition (ICPR), 4, pp.

507–510, Cambridge, UK, 2004.

BIBLIOGRAPHY 99

[26] C. W. Hsu and C. J. Lin. A comparison of methods for multi-class support vector machines. IEEE

Trans. Neural Networks, 13(2), pp. 415–425, 2002.

[27] K. Duan and S. S. Keerthi. Which is the best multi-class SVM method? An empirical study.

Multiple Classifier Systems, pp. 278–285, 2005.

[28] D. Anguita, S. Ridella, and D. Sterpi. A new method for multi-class support vector machines.

Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN), pp. 412–

417, 2004.

[29] J. C. Platt, N. Cristianini, and J. S. Taylor. Large margin DAGs for multi-class classification. Ad-

vances in Neural Information Processing Systems (NIPS), 12, pp. 547–553, MIT Press, Cambridge,

2000.

[30] Y. Liu, Z. You, and L. Cao. A novel and quick SVM-based multi-class classifier. Pattern Recogni-

tion, 39(11), pp. 2258–2264, 2006.

[31] Y. Grandvalet and S. Canu. Adaptive Scaling for Feature Selection in SVMs. Advances in Neural

Information Processing Systems (NIPS), 15, pp. 553–560, 2003.

[32] O. L. Mangasarian and E. W. Wild. Feature Selection for Nonlinear Kernel Support Vector Ma-

chines. Proceedings of the 7th IEEE International Conference on Data Mining (ICDM) Work-

shops, Omaha NE, 2007.

[33] V. Devendran, H. Thiagarajan, A. K. Santra, and A. Wahi. Feature Selection for Scene Catego-

rization using Support Vector Machines. Proceedings of the 2008 Congress on Image and Signal

Processing, 1, pp. 588–592, Washinton, 2008.

[34] J. A. K.Suykens, T. V. Gestel, J. Vandewalle, and B. D. Moor. A Support Vector Machine For-

mulation to PCA Analysis and its Kernel Version. IEEE Transactions on Neural Networks, 14(2),

pp. 447–450, 2003.

[35] L. Hermes and J. M. Buhmann. Feature Selection for Support Vector Machines. International

Conference on Pattern Recognition (ICPR), 2, pp. 712–715, 2000.

Chapter 6

SHARPC: SHApley Value based

Robust Pattern Clustering

6.1 Introduction


problem in pattern recognition, data mining, information retrieval, and related disciplines. Clustering

finds numerous direct practical applications in pattern analysis, decision making, and machine learning

tasks such as image segmentation. Besides, clustering has also been used in solving extremely large scale

problems, e.g. in bioinformatics ([6], [7]), and graph theory([8]). Clustering also acts as a precursor to

many data processing tasks including classification (Jain, Murty and Flynn [1]). According to Backer

and Jain [2], “in cluster analysis a group of objects is split into a number of more or less homogeneous

subgroups on the basis of an often subjectively chosen measure of similarity (i.e., chosen subjectively

based on its ability to create “interesting” clusters), such that the similarity between objects within

a subgroup is larger than the similarity between objects belonging to different subgroups“.

Similar views are echoed in other works on clustering, e.g. Jain and Dubes [3], Hansen and Jaumard

[4], and Xu and Wunsch [5].

The machine learning and pattern recognition literature abounds in algorithms on clustering. The

techniques such as ISODATA [10], Genetic k-means Algorithm (GKA) [11], and Partitioning Around

Medoids (PAM) [12] are based on vector quantization. The density estimation based models such

as Gaussian Mixture Density Decomposition (GMDD) [13], information theory based models such as

entropy maximization [14], graph theory based models such as Delaunay Triangulation graph (DTG)

[15], combinatorial search based models such as Genetically Guided Algorithm (GGA) [16], fuzzy models

such as Fuzzy c-means (FCM) [17], neural networks based models such as Self-Organizing Map (SOM)

100

Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 101

[18], kernel based models such as Support Vector Clustering (SVC) [19], and data visualization based

models such as Principal Component Analysis (PCA) [20] have received considerable attention from

the research community. In this paper, we propose a clustering algorithm, SHARPC, based on the

celebrated game theoretic concept of Shapley value. There are a number of solution concepts such as

Shapley value, core, bargaining sets, nucleolus etc. to analyze cooperative games [26]. Shapley value

is a fair solution concept in that it divides the collective or total value of the game among the players

according to their marginal contributions in achieving that collective value. We strive to make best use

of this notion of fairness for efficient clustering. To the best of our knowledge, SHARPC is the first

approach based on cooperative game theory to the clustering problem.

A key problem in the clustering domain concerns determining a suitable number k of output clusters

when k is not input as a parameter to the clustering algorithm. Dubes has described this as ”the

fundamental problem of cluster validity“ [9]. It is often impractical to presume the availability of a

domain expert to select the number of clusters. A number of techniques and heuristics such as elbow

criterion [21], regularization framework [22] based on the Bayesian Information Criterion (BIC), L-

method [23], Minimum Description Length (MDL) framework [24], G-means [25] have been proposed to

tackle this problem. Our algorithm SHARPC obviates the need for specifying the number of clusters as

an input.

In his work on unification of clustering [40], Kleinberg considered three desirable properties: scale

invariance, richness, and consistency and proved an impossibility theorem, showing that no clustering

algorithm satisfies all of these properties simultaneously. In this paper, we introduce order independence

as another desirable property, and provide necessary and sufficient conditions for order independence.

SHARPC satisfies the scale invariance, the richness, and the order independence conditions. Addition-

ally, the SHARPC approach can be generalized to obtain hierarchical clusters.

The Leader algorithm [28] is a prototype incremental algorithm that dynamically assigns each in-

coming point to the nearest cluster. However, the Leader algorithm is highly susceptible to ordering

effects and may give extremely poor quality of clustering on skewed data orders. For a known value of

k, the k-means and its variants ([29, 30]), based on vector quantization are the most popular clustering

algorithms. However, there are primarily two drawbacks of k-means: (a) it may give poor results for an

inappropriate choice of k, and (b) it may not converge to a globally optimal solution due to inappro-

priate initial selection of cluster centers [33]. Our experimental results strongly suggest that SHARPC

outperforms both the Leader and the k-means algorithms in terms of the quality of clustering.

6.1.1 Motivation

Clustering is the assignment of data points to subsets such that points within a subset are more similar

to each other than points from other subsets. We believe that most of the existing popular algorithms


do not truly reflect the intrinsic notion of a cluster, since they try to minimize the distance of every

point from its closest cluster representative alone, while overlooking the importance of other points in

the same cluster. Although this approach succeeds in optimizing the average distance between a point

and its closest cluster center, it conspicuously fails to capture what has been described by Michalski

and others as the “context-sensitive” information: clustering should be done not just on the basis of

distance between a pair of points, A and B, but also on the relationship of A and B to other data points

([46, 47]). Therefore, there is a need for an algorithm that gives an optimal solution in keeping both

the point-to-center and point-to-point distances, within a cluster, to a minimum. We emphasize that

incorporating the gestalt or collective behavior of points within the same cluster is fundamental to the

very notion of clustering, and this provides the motivation for our work.

In addition, it is more intuitive to characterize similarity between different points as compared to

the distance between them since the distance measure may not necessarily be scale invariant. Moreover,

from an application point of view, it is more convenient to specify a similarity threshold parameter in an

identical range, compared to a distance threshold that may vary across domains. Further, as described

later, there are certain axioms that are directly relevant in the context of clustering. The Shapley

value is an important solution concept, from cooperative game theory, that satisfies these axioms and

thereby characterizes the notion of fairness in clustering. We strive to incorporate this idea of fairness

for efficient clustering.

6.1.2 Contributions

In this paper, we make the following contributions:

• We formulate the problem of clustering as a cooperative game among the data points and show

that the underlying characteristic form game is convex.

• We propose a novel approach, SHARPC, for clustering the data points based on their Shapley

values and the convexity of the proposed game theoretic model. SHARPC determines an optimal

number of clusters and satisfies desirable clustering properties such as scale invariance and richness.

• We provide both the necessary and sufficient conditions for order independence and prove that

SHARPC is an order independent algorithm.

• We also extend the idea of clustering using Shapley value approach to obtain hierarchical clusters

with minimum bounded similarity guarantee.

• We demonstrate the efficacy of our approach through detailed experimentation. SHARPC is

compared with the popular k-means and Leader algorithms and the results are shown for several

benchmark datasets.


The outline of the paper is as follows. In Sect. 6.2, a succinct background encompassing important

concepts from cooperative game theory is documented. Our Shapley value based clustering paradigm

is presented in Sect. 6.3, along with Algorithm 1, a clustering algorithm based on exact computation

of Shapley value, and SHARPC, based on approximation of the Shapley value. The ordering effects

are characterized in Sect. 6.4. We present the generalization of our approach to hierarchical clustering

in Sect. 6.5. We provide a brief description on the applicability of the Leader, the k-means, and the

SHARPC algorithms with respect to certain desirable clustering properties in Sect. 6.6.1. An analysis

of the experimental results is carried out in Sect. 6.6.2. Finally, we present a summary of our work in

Sect. 6.7 and indicate the future work in Sect. 6.7.1.

6.2 Preliminaries

A cooperative game with transferable utility (TU) [26] is defined as the pair (N, v) whereN = {1, 2, ..., n}

is the set of players and v : 2N → R is a mapping with v(φ) = 0. The mapping v is called the

characteristic function or the value function. Given any subset S of N , v(S) is often called the value

or the worth of the coalition S and represents the total transferable utility that can be achieved by the

players in S, without help from the players in N \ S. The set of players N is called the grand coalition

and v(N) is called the value of the grand coalition. In the sequel, we use the phrases cooperative game,

coalitional game, and TU game interchangeably.

A cooperative game can be analyzed using a solution concept, which provides a method of dividing the

total value of the game among individual players. We describe below two important solution concepts,

namely the core and the Shapley value.

6.2.1 The Core

A payoff allocation x = (x1, x2, ..., xn) denotes a vector in Rn with xi representing the utility of player

i where i ∈ N . The allocation x is said to be individually rational if xi ≥ v({i}), ∀i ∈ N . The payoff

allocation x is said to be coalitionally rational if∑i∈C xi ≥ v(C), ∀C ⊆ N . Note that coalitional

rationality implies individual rationality. Finally, the payoff allocation x is said to be collectively rational

if∑i∈N xi = v(N). The core of a TU game (N, v) is the collection of all payoff allocations that are

coalitionally rational and collectively rational. It can be shown that every payoff allocation lying in the

core of a game (N, v) is stable in the sense that no player will benefit by unilaterally deviating from a

given payoff allocation in the core. The elements of the core are therefore potential payoff allocations

that could result when rational players interact and negotiate among themselves. A limitation of the

concept of the core is that given a coalitional game, the core may be empty or very large.


6.2.2 The Shapley Value

The Shapley value is a solution concept that provides a unique expected payoff allocation for a given

coalitional game (N, v). It describes an effective approach to the fair allocation of gains obtained by

cooperation among the players of a cooperative game. Since some players may contribute more to the

total value than others, an important requirement is to distribute the gains fairly among the players.

The concept of Shapley value, which was developed axiomatically by Lloyd Shapley, takes into account

the relative importance of each player to the game in deciding the payoff to be allocated to the players.

We denote by

φ(N, v) = (φ1(N, v), φ2(N, v), . . . , φn(N, v))

the Shapley value of the TU game N, v). Mathematically, the Shapley value, φi(N, v), of a player i,

∀v ∈ R2n−1, is given by,

φi(N, v) =∑

C⊆N−i

|C|!(n− |C| − 1)!n!

{v(C ∪ {i})− v(C)}

where φi(N, v) is the expected payoff to player i and N − i denotes N \{i}. There are several equivalent

alternative formulations for the Shapley value.

The Shapley value is the unique mapping that satisfies three key properties: linearity, symmetry,

and carrier property [26]. The three properties imply that the Shapley value provides a fair way of

distributing the gains of cooperation among all the players in the game. A natural way of interpreting

the Shapley value φi(N, v) of player i is in terms of the average marginal contribution that player i

makes to any coalition of N assuming that all the orderings are equally likely. Thus the Shapley value

takes into account all possible coalitional dynamics and negotiation scenarios among the players and

comes up with a single unique way of distributing the value v(N) of the grand coalition among all the

players. The Shapley value of a player accurately reflects the bargaining power of the player and the

marginal value the player brings to the game.

Now we describe an important class of cooperative games called the convex games.

6.2.3 Convex Games

A cooperative game (N, v) is a convex game [27] if

v(C) + v(D) ≤ v(C ∪D) + v(C ∩D), ∀C,D ⊆ N

Equivalently, a TU game (N, v) is said to be convex if for every player i, the marginal contribution of i

to larger coalitions is larger. In other words,

v(C ∪ {i})− v(C) ≤ v(D ∪ {i})− v(D),


∀C ⊆ D ⊆ N − {i}, i ∈ N

where the marginal contribution m(S, j) of player j in a coalition S is given by,

m(S, j) = v(S ∪ {j})− v(S), S ⊆ N, j ∈ N, j /∈ S.

A very important property is that if a TU game (N, v) is convex, then the core of the game is non-empty

and moreover, the Shapley value belongs to the core.

6.2.4 Shapley Value of Convex Games

It can be shown the core of a convex game (N, v) is a convex polyhedron with a dimension of at most

|N | − 1. Consider a permutation π of players in the game. Then, for any of a possible |N |! such

permutations, the initial segments of the ordering are given by

Tπ,r = {i ∈ N : π(i) ≤ r}, r ∈ {1, ..., |N |}

where, Tπ,0 = 0 and Tπ,|N | = N . Note that π(i) refers to the position of the player i in the permutation

π. Now, to determine the core for a particular ordering π, we solve the equations

xπi (Tπ,r) = v(Tπ,r), r ∈ {1, ..., |N |}.

The solution to these equations defines a payoff vector xπ with elements given by

xπi = v(Tπ,π(i))− v(Tπ,π(i)−1), ∀i = 1, 2, ..., |N |.

In fact, the payoff vectors xπ precisely represent the extreme points of the core in convex games.

Moreover, it is known [27] that the Shapley value for a convex game is the center of gravity of xπ. Thus,

if Π is the set of all permutations of N , then the Shapley value of player i can be computed as

φi =1|N |!

∑π∈Π

xπi

This provides an efficient way of computing the Shapley value of a convex game and we use this fact

later in this paper.

6.3 Shapley Value based Clustering

A central idea of this paper is to map cluster formation to coalition formation in an appropriately defined

TU cooperative game.


6.3.1 The Model

Consider a dataset X = {x1, x2, ..., xn} of n input instances. We set up a cooperative game (N, v) among

the input data points in the following way. Given the dataset X, define a function, d : X×X → R+∪{0},

where d(xi, xj) ∀xi, xj ∈ X indicates the distance between xi and xj , with d(xi, xi) = 0; d can be any

distance metric such as the Euclidean distance, for instance, depending on the application domain. Let

f′

: R+ ∪ {0} → [0, 1) be a monotonically increasing dissimilarity function such that f′(0) = 0 and

f′(d(x, xi)) + f

′(d(x, xj)) ≥ f

′(d(x, xi) + d(x, xj)) (6.1)

Define a corresponding similarity mapping, f : R+ ∪ {0} → (0, 1], such that f(a) = 1 − f ′(a). In

this setting, each of the n points corresponds to a player in this game, thereby |N | = n. The problem

of clustering can be viewed as grouping together those points which are less dissimilar as given by f′

or

equivalently, more similar as indicated by f : each of the n points interacts with other points and tries

to form a coalition or cluster with them, in order to maximize its value. Now, we assign v({xi}) = 0,

for all xi such that xi is not a member of any coalition. This is based on the intuition that any isolated

point should have least value since it is not involved in any cluster. Two situations are accounted for

using this idea. First, all those points that have not been processed as yet are assigned an initial value

of 0. Second, after processing, if some point behaves as an outlier, then it can be discarded based on its

value as explained later. This motivates us to define, for a coalition T ,

v(T ) =12

∑xi,xj∈Txi 6=xj

f(d(xi, xj))

In other words, v(T ), the total value of a coalition T , is computed by taking the sum of similarities

over all (|T |2 ) distinct pairs of points. We emphasize the relevance of defining the value function v(.) for

a coalition in this way. Our approach computes the total worth of a coalition as the sum of pairwise

similarities between the points. Note that this formulation elegantly captures the notion of clustering

in its purest form: points within a cluster are similar to each other. Henceforth, we shall use the terms

data points, patterns and players interchangeably. Moreover, the phrase cluster center shall convey the

same meaning as cluster representative.

The usage of Shapley value for clustering is justified by interpreting certain axioms in the following

way:

• Symmetry (Permutation invariance): Given a game (N, v) and a permutation π on N , we

have ∑i∈N

φi(N, v) = φπ(i)(N, πv)


As a consequence of this property, the Shapley value remains same even if the points are arbitrarily

renamed or reordered. This is extremely significant for achieving order independence, a desirable

clustering property, as explained later.

• Preservation of Carrier: Given any game (N, v) such that v(S⋃{i}) = v(S) ∀S ⊆ N , we have

φi(N, v) = 0. This property implies that if a point does not contribute to the overall worth of a

cluster, then it does not receive any marginal contribution. Therefore, the outliers or the points

that are far off from the other data points do not derive any benefit by forming clusters with them.

• Additivity or Aggregation: For any two games, (N, v) and (N,w), we have

φi(N, v + w) = φi(N, v) + φi(N,w), where

(v + w)(S) = v(S) + w(S)

Additivity implies the linearity property: if the payoff function v is scaled by a real number α,

then the Shapley value is also scaled by the same factor. That is, φi(N,αv) = αφi(N, v). Linearity

is essential for achieving scale invariance with respect to the value function. Another important

consequence of additivity is that the overall marginal contribution of a point is just the sum of its

contributions in each of the games considered separately. In the context of clustering, for every

point i, the addition of a set of new points X ′ in the initial dataset X results in increasing the

marginal contribution of i with respect to X\{i} by an additional contribution incurred due to

X ′\{i}.

• Pareto Optimality: For any game (N, v), we have∑i∈N

φi(N, v) = v(N). As an implication of

this property, the overall worth of the dataset is distributed entirely among the different data

points.

In fact, Shapley value is the only solution concept that satisfies all the aforesaid axioms simultaneously,

and hence provides an appropriate tool for tackling clustering.

6.3.2 An Algorithm for Clustering based on Shapley values

Algorithm 1 outlines our approach to clustering. Algorithm 1 takes as input a threshold parameter of

similarity, δ, in addition to the dataset to be clustered.

Algorithm 1.

Input: The dataset X = {x1, x2, ..., xn} to be clustered and a threshold parameter of similarity δ ∈ (0, 1].

Output: A set of cluster centers and the clusters.


1.1 for i = 1 to n, do

1.2 v({xi}) = 0;

1.3 d(xi, xi) = 0;

1.4 end for

1.5 for every ordering π of data points, do

1.6 for r = 1 to n, do

1.7 Tπ,r = ∅;

1.8 for i = 1 to n, do

1.9 if π(i) ≤ r

1.10 Tπ,r = Tπ,r ∪ {i};

1.11 end if

1.12 end for

1.13 end for

1.14 end for

1.15 for i = 1 to n, do

1.16 for j = i to n, do

1.17 compute f(d(xi, xj));

1.18 end for

1.19 end for

1.20 for i = 1 to n, do

1.21 φxi = 0;

1.22 for every ordering π of data points, do

1.23 xπi = v(Tπ,π(i))− v(Tπ,π(i)−1);

1.24 φxi = φxi +1

n!xπi ;

1.25 end for

1.26 end for

1.27 K = ∅;

1.28 sort the points in X, in non-increasing order, based on the Shapley values φx1 , φx2 , . . . , φxn ;

1.29 Q = X;

1.30 while Q 6= ∅, do

1.31 choose the point x ∈ Q with maximum Shapley value in Q, as a new cluster center;

1.32 K = K ∪ {x};

1.33 P = {xi ∈ Q : f(d(x, xi)) ≥ δ}

1.34 assign the points in P to the cluster with center x;

1.35 Q = Q \ P ;


1.36 end while

1.37 return K as the set of cluster centers;

First, the Shapley value of each player is computed. Then, the cluster centers or representatives

are chosen in the following way. We sort the points in the non-increasing order of their Shapley values.

Then, the algorithm chooses the point x with the current highest Shapley value, and assigns all those

points that are at least δ-similar to x, to the same cluster as x. The points, which have already been

clustered, do not play any further part in the clustering process. The point with the highest Shapley

value among all currently unclustered points is chosen as a new cluster center, and the entire process

is repeated iteratively. The algorithm returns a set of cluster representatives on termination. It can be

observed (see Theorem 8 in Sect. 6.3.3) that data points close to a cluster center would also be having

almost the same Shapley value since they have similar distances to the remaining points. Hence, they

should not be treated as new cluster centers themselves. Therefore, by tuning the parameter δ, we can

obtain a good clustering by assigning the points nearer to the center to the same cluster. Note that

Algorithm 1 can be easily modified to discard outliers by adding a step wherein all those clusters that

are assigned fewer points than a minimum predefined number are discarded.

We now prove that the general setting described in Sect. 6.3.1 corresponds to a convex game among

the data points.

6.3.3 Convexity of the Underlying Game

Theorem 7. Define the total value of an individual point xi, v({xi}) = 0 ∀i ∈ {1, 2, ..., n}, and that

of a coalition T of n data points, v(T ) =12

∑xi,xj∈Txi 6=xj

f(d(xi, xj)), where f is a similarity function. In this

setting, the cooperative game (N, v) is a convex game.

Proof. Consider any two coalitions C and D, C ⊆ D ⊆ X \ {xp}, where xp ∈ X. Then, by definition,

v(D)− v(C) =1

2

Xxi,xj∈Dxi 6=xj

f(d(xi, xj))−1

2

Xxi,xj∈Cxi 6=xj

f(d(xi, xj))

=1

2

Xxi,xj∈D\Cxi 6=xj

f(d(xi, xj)) +X

xi∈D\Cxj∈C

f(d(xi, xj)) (6.2)

Again,

v(C ∪ {xp}) =1

2

Xxi,xj∈Cxi 6=xj

f(d(xi, xj)) +Xxi∈C

f(d(xi, xp))


Also,

v(D ∪ {xp}) =1

2

Xxi,xj∈Dxi 6=xj

f(d(xi, xj)) +Xxi∈D

f(d(xi, xp))

Then,

v(D ∪ {xp})− v(C ∪ {xp}) =1

2

Xxi,xj∈Dxi 6=xj

f(d(xi, xj)) +Xxi∈D

f(d(xi, xp)−1

2

Xxi,xj∈Cxi 6=xj

f(d(xi, xj))−Xxi∈C

f(d(xi, xp))

=1

2

Xxi,xj∈D\Cxi 6=xj

f(d(xi, xj)) +X

xi∈D\Cxj∈C

f(d(xi, xj)) +X

xi∈D\C

f(d(xi, xp))

= v(D)− v(C) +X

xi∈D\C

f(d(xi, xp)) (using (6.2))

≥ v(D)− v(C) ( since f : R+ ∪ {0} → (0, 1])

An important consequence of Theorem 7 is that the Shapley value belongs to the core. Therefore, we

can compute the Shapley value of each player as explained in Sect. 6.2.4. Further, as the next theorem

states, the points which are close to each other have almost same Shapley values.

Theorem 8. Any two points xi, xt, such that d(xi, xt) ≤ ε, where ε→ 0, in the convex game setting of

Sect. 6.3.1 have almost equal Shapley values.

Proof. As explained in Sect. 6.2.4, the Shapley value of a point xi is given by,

φi =1

n!

Xπ∈Π

xπi

=1

n!

Xπ∈Π

ˆv(Tπ,π(i))− v(Tπ,π(i)−1)

˜=

1

n!

Xπ∈Π

ˆ Xπ(p)≤π(i)π(q)<π(p)

f(d(xp, xq))−X

π(p)≤π(i)−1π(q)<π(p)

f(d(xp, xq))˜

=1

n!

Xπ∈Π

Xπ(p)<π(i)

f(d(xi, xp))

=1

n!

Xπ∈Π

Xπ(p)<π(i)

[1− f′(d(xi, xp))]


=1

n!

Xπ∈Π

[π(i)− 1]− 1

n!

Xπ∈Π

Xπ(p)<π(i)

f′(d(xi, xp))

The first term on the right is a sum that is invariant for each point over all permutations. The second term can

be written as,

D(i) =1

n!

Xπ∈Π

Xπ(p)<π(i)

f′(d(xi, xp))

=1

n!

Xπ∈Π

Xπ(p)<π(i)d(xi,xp)≤ε

f′(d(xi, xp)) +

1

n!

Xπ∈Π

Xπ(p)<π(i)d(xi,xp)>ε

f′(d(xi, xp))

It follows immediately using (6.1), for xt, t ∈ {1, 2, . . . , n}, t 6= i such that d(xi, xt) ≤ ε → 0, we have

f′(d(xi, xt))→ 0, and f

′(d(xi, xp))→ f

′(d(xt, xp)), thereby implying D(t)→ D(i).

Note that Theorem 8 does not say anything about points that are far apart from each other. In

particular, it does not forbid points, away from each other, from having similar Shapley values; it only

implies that points close to each other tend to have almost same Shapley values.

6.3.4 SHARPC

The exact computation of Shapley values for n players, as in Algorithm 1, is computationally a hard

problem since it involves taking the average over all the n! permutation orderings. However, as mentioned

earlier, the Shapley value for a convex game is the center of gravity of the extreme points of the non-

empty core. Therefore, making use of Theorem 7, we can approximate the Shapley value by averaging

marginal contributions over only p random permutations, where p << n!. Then, the error resulting

from this approximation can be bounded according to the concentration result proved in the following

lemma.

Lemma 1. Let Φ(p) = (φ1(p), φ2(p), . . . , φn(p)) denote the empirical Shapley values, of n data points,

computed using p permutations. Then, for some constants ε, c, and c1, such that ε ≥ 0 and c, c1 > 0,

P (|Φ(p)−E(Φ(p))| ≥ ε) ≤ c1e−cpε2

Proof. Define S =

pXi=1

Yi, where Y1, Y2, . . . , Yp denote p independent random permutations of length n, corre-

sponding to p n-dimensional points, randomly chosen from the boundary of a convex polyhedron. Clearly, S is

a random variable. Now, applying Hoeffding’s inequality, we can find constants c1, c2, and t, 0 ≤ t ≤ pE(S),

and c1, c2 > 0, such that

P (|S − E(S)| ≥ t) ≤ c1e−c2t

2

pE(S)

⇒ P (|S − E(S)| ≥ pε) ≤ c1e−c2pε

2

E(S) ( substituting t = pε)


⇒ P (1

p|S − E(S)| ≥ ε) ≤ c1e

−c2pε

2

E(S)

⇒ P (|Φ(p)− E(Φ(p))| ≥ ε) ≤ c1e−c2pε

2

E(S)

⇒ P (|Φ(p)− E(Φ(p))| ≥ ε) ≤ c1e−cpε2

[ since Φ(p) =S

p]

Lemma 1 provides a bound on deviation from the exact Shapley value. We want the probability of

error in estimation of Shapley value to be as small as possible. To ensure this, we develop the idea of order

independence in learning algorithms. In fact, we characterize a stronger notion of order independence

in incremental learning and subsequently prove that the Shapley value of points in our convex game

setting can be approximated to a high degree of accuracy in O(n2) number of computations. However,

we first propose an efficient algorithm, SHARPC (acronym for SHApley value based Robust Pattern

Clustering), based on the convexity of our game theoretic model (Theorem 7).

Algorithm 2. (SHARPC)

Input: The dataset X = {x1, x2, ..., xn} to be clustered and a threshold parameter of similarity δ ∈ (0, 1].

Output: A set of cluster centers and the clusters.

2.1 Find the pair-wise similarity between all points in the input dataset.

2.2 For each player xi in the input dataset, X, compute the value, φi =Xxj∈Xj 6=i

f(d(xi, xj)).

2.3 Arrange the points in non-increasing order of their φ-value, and assign them to clusters as in Algorithm 1.

2.4 Find the number of clusters, k, and their centers, resulting from Step 2.3.

2.5 Run the k-means algorithm, with the initial k centers set to the cluster centers that are obtained in Step 2.4.

Note that SHARPC is essentially the same as Algorithm 1. SHARPC performs an approximate

computation of Shapley values using only O(n2) similarity computations. In addition, Step 2.5 is

incorporated to employ the cluster centers obtained in Step 2.4 for more efficient clustering, in the sense

of minimizing the point-to-center distances.

Analysis of Time Complexity

The computation of pair-wise similarity, in Step 2.1 of SHARPC, can be done in (n2 ) = O(n2) time

steps. The computation of Sφi, for player i, takes O(n) steps since a sum over (n− 1) similarity values

(corresponding to players in X\{i}) needs to be performed. Therefore, for n players, the complexity of

Step 2.2 is O(n2). For the sake of analysis, let k clusters be obtained as a result of Step 2.3. Then, in

an expected sense, O(n/k) points are assigned to each cluster. Therefore, on an average, O(k) passes

need to be made for computing the points similar to each of the k clusters, and in each pass, O(n)


similarity computations are required. Therefore, the complexity of Step 2.3 is bounded by O(nk). Step

2.4 requires O(k) time corresponding to the k cluster centers. Finally, Step 2.5 can be accomplished

in O(nkl), l being the number of iterations till convergence. Generally, because SHARPC determines

suitable cluster centers, therefore, the k-means algorithm in Step 2.5 converges rapidly and thus, the

similarity computation is the pre-dominant factor in determining the total time. Therefore, the overall

complexity of SHARPC is given by O(n2). Note that SHARPC takes more time than k-means, with a

complexity O(nkl′) (l

′being the number of iterations till convergence), and Leader, with a complexity

O(nk), for the same number of clusters, k. However, the complexity of SHARPC can be greatly reduced

by further approximating the Shapley value considering only O(t) nearest neighbors, t << n, using

generic branch and bound techniques such in [41], locality sensitive hashing in [42] based techniques

such as in [44], application specific techniques such as in [43].

6.4 Order Independence of SHARPC

In this section, we show how SHARPC exploits the order independence property of the convex game

setting to estimate Shapley values to a high degree of accuracy. To set the stage, we characterize the

concept of order independence and provide the necessary and sufficient conditions for order indepen-

dence.

6.4.1 Characterizing Ordering Effects in Incremental Learners

In his celebrated work on unification of clustering [40], Kleinberg considered three properties: scale-

invariance, richness, and consistency and proved an impossibility result, showing that no clustering

algorithm satisfies all of these properties simultaneously. Order independence is another desirable fun-

damental property of clustering algorithms. In other words, we want the algorithms to produce the

same final clustering across different runs, irrespective of the sequence in which the input instances are

presented. We note that even though algorithms such as the Leader and the k-means can be shown

to satisfy some of the three properties: scale-invariance, richness, and consistency; they do not satisfy

order independence. In particular, the Leader algorithm is known to be susceptible to ordering effects.

On the other hand, the random selection of initial cluster centers precludes the k-means algorithm from

being truly order independent.

6.4.2 Order Independence of SHARPC

Next, we prove an important theorem, which highlights the order independence of SHARPC.

Theorem 9. SHARPC is order independent.


Proof. Let a dataset X = {x1, x2, . . . , xn} be provided as an input to SHARPC. For any permutation ordering

on the input instances, π ∈ Π, we may define an abstraction on i points, Tπ,i =X

π(p)≤π(i)π(q)<π(p)

f(d(xp, xq)), and a

function g such that g(Tπ,i, xi+1) = Tπ,i +X

π(p)≤π(i)

f(d(xi+1, xp)), where xi+1 is the current input instance.

Further, in case of SHARPC, g(Tπ,k, x′l) =

Xπ(p)≤π(k)π(q)<π(p)

p 6=lq 6=l

f(d(xp, xq)), where 1 ≤ l ≤ k.

In order to prove order independence of SHARPC, we need to verify that X is a dynamically complete set with

respect to T and g,

• g(Tπ,k, xk+1) = Tπ,k +X

π(p)≤π(k)

f(d(xk+1, xp))

=X

π(p)≤π(k)π(q)<π(p)

f(d(xp, xq)) +X

π(p)≤π(k)

f(d(xk+1, xp))

=X

π(p)≤π(k+1)π(q)<π(p)

f(d(xp, xq))

= Tπ,k+1

• g(Tπ,k, x′l) =

Xπ(p)≤π(k)π(q)<π(p)

p6=lq 6=l

f(d(xp, xq))

= f(d(x2, x1))+X

π(p)<3

f(d(x3, xp))+. . .+X

π(p)<π(l−1)

f(d(xl−1, xp))+X

π(p)<π(l)

f(d(xl+1, xp))+. . .+X

π(p)<π(k)

f(d(xk, xp))

= g(g(g(g(g(Tπ,0, x1), x2), . . . , xl−1), xl+1), . . . , xk)

• g(g(Tπ,k, x′l), xl)

=X


p6=lq 6=l

f(d(xp, xq)) +X

π(p)≤π(k)p 6=l

f(d(xl, xp))

[Note that this step follows since the incoming data point, xl, arrives at the (k+1)th position, whereas the

earlier instance is removed, as indicated by x′l, and hence does not contribute to the sum of similarities].

=X


f(d(xp, xq))

= Tπ,k

• g(g(Tπ,k, xl), xm)

= (X


f(d(xp, xq)) +X

π(p)≤π(k)

f(d(xl, xp))) +X

π(p)≤π(k+1)

f(d(xm, xp))

= (X


f(d(xp, xq)) +X

π(p)≤π(k)

f(d(xl, xp))) + (X

π(p)≤π(k)

f(d(xm, xp)) + f(d(xm, xl)))


= (X


f(d(xp, xq)) +X

π(p)≤π(k)

f(d(xm, xp))) + (X

π(p)≤π(k)

f(d(xl, xp)) + f(d(xl, xm)))

[since f(d(xl, xm)) = f(d(xm, xl))]

= g(g(Tπ,k, xm), xl)

Note that since updating the knowledge structure in memory on arrival of a new point requires its

similarity computations to each of the previously seen points in the given sequence, therefore, SHARPC

ceases to be incremental, nonetheless, SHARPC is an order independent algorithm. In the next theorem,

we prove that SHARPC estimates the Shapley value of points in the input dataset to an arbitrarily high

degree of accuracy.

Theorem 10. Let X = {x1, x2, . . . , xn} be an input dataset. Further, let Φ = (φ1, φ2, . . . , φn) denote

the approximate Shapley values, of n data points, given by φi =∑xj∈Xj 6=i

f(d(xi, xj)). Then, for some

constants ε, c, and c1, such that ε ≥ 0 and c, c1 > 0,

P (|Φ−E(Φ)| ≥ ε) ≤ c1e−c(n−1)!ε2 ,

where E(Φ) denotes the vector of exact Shapley values.

Proof. Consider an arbitrary permutation, π, on X, where data point xi is fixed at the nth position. Clearly,

there are (n − 1)! such permutations, corresponding to arrangement of the data points in X\{xi}. Then, the

marginal contribution of xi in any such permutation is given by,

xπi = v(Tπ,π(i))− v(Tπ,π(i)−1)

= v(Tπ,n)− v(Tπ,n−1)

[since π(i) = n]

=X

π(j)≤nπ(q)<π(j)

f(d(xj , xq))−X

π(j)≤n−1π(q)<π(j)

f(d(xj , xq))

=X

π(j)≤nπ(q)<π(j)

f(d(xj , xq))−X

π(j)≤n−1π(q)<π(j)

f(d(xj , xq))

=X

π(j)<n

f(d(xi, xj))

=Xxj∈Xj 6=i

f(d(xi, xj))

Now, using Theorem 9, all such permutations result in the same marginal contribution for xi. Thus, the Shapley


value of xi, approximated using (n− 1)! such permutations is given by,

φi =1

(n− 1)!

Xπ∈Ππ(i)=n

Xxj∈Xj 6=i

f(d(xi, xj))

=1

(n− 1)!(n− 1)!

Xxj∈Xj 6=i

f(d(xi, xj))

=Xxj∈Xj 6=i

f(d(xi, xj))

Now, using Lemma 1, the empirical Shapley value differs from the exact Shapley value according to the following

bound,

P (|Φ(p)−E(Φ(p))| ≥ ε) ≤ c1e−cpε2

whereby substituting p = (n− 1)!, we get

P (|Φ−E(Φ)| ≥ ε) ≤ c1e−c(n−1)!ε2

Theorem 10 essentially implies that we can obtain a highly accurate approximation of the Shapley

values by computing pairwise similarities between points in the dataset. Further, it also makes SHARPC

completely order independent since identical cluster centers (and identical clusters, subsequently) are

obtained using different runs of the algorithm. As mentioned, this is a highly desirable property, which

is conspicuously absent in the k-means and the Leader algorithms.

Moreover, since every finite subset of players in a convex game is also convex, therefore, Theorem

10 provides an effective mechanism for further enhancing the computational efficiency of existing Shap-

ley value approximation techniques such as the Multi-perturbation Shapley value analysis (MSA), by

using only a single permutation on a sampled subset of players. MSA employs a Shapley value based

approach to address the issue of defining and calculating the contributions of neural network elements

from a dataset of multiple lesions [45]. Only a subset of other elements is considered, across different

permutations, for obtaining the marginal contribution of each player. Theorem 10 implies that by using

a convex formulation of the network elements, for N players, only N permutations need be considered.

For each player i, a single permutation (where i is placed at the last position, and other players are

arranged arbitrarily in the remaining N − 1 positions) would suffice. This is an extremely significant

result in advancing the state of the art, as regards the computational efficiency of estimating Shapley

value for large datasets across hetrogeneous applications.


6.5 Hierarchical Clustering

In this section, we demonstrate another significant feature of how the Shapley value approach can be

extended to obtain hierarchical clusters. We start by showing that Shapley value based clustering en-

sures a bound on the extent of similarity among points in the same cluster for a suitable choice of d and f .

Lemma 2. The data points, which are assigned to the same cluster by Algorithm 1, are at least 2δ − 1

similar to each other, on an average, for a suitable choice of distance metric d, dissimilarity function

f′, and similarity function f (as defined in Sect. 6.3.1).

Proof. Consider any two points xi and xj , 1 ≤ i, j ≤ n, that are assigned to the same cluster C with center x.

Then,

f(d(x, xi)) ≥ δ

f(d(x, xj)) ≥ δ

Using the above inequalities, we get,

f(d(x, xi)) + f(d(x, xj)) ≥ 2δ

⇒ f′(d(x, xi)) + f

′(d(x, xj)) ≤ 2− 2δ (6.3)

Now d, being a metric, satisfies the triangle inequality. Thus,

d(x, xi) + d(x, xj) ≥ d(xi, xj)

Now, by definition, f′

is a monotonically increasing function, and therefore, we get the following inequality

f′(d(x, xi) + d(x, xj)) ≥ f

′(d(xi, xj))

Further, using (6.1), we get

f′(d(x, xi)) + f

′(d(x, xj)) ≥ f

′(d(x, xi) + d(x, xj))

Using (6.3), together with these inequalities, we get

f′(d(xi, xj)) ≤ 2− 2δ ∀xi, xj ∈ C

Now, since by definition, f′(d(xi, xj)) + f(d(xi, xj)) = 1, therefore,

f(d(xi, xj)) ≥ 2δ − 1 ∀xi, xj ∈ C


⇒ fC ≥ (2δ − 1)

where fC denotes the average or mean similarity between data points in C (corresponding to (|C|2 ) distinct pairs

of points).

Lemma 2 leads to the following important observations. The quality of clustering can be controlled

by varying the similarity parameter δ over the range (0, 1]. This is promising on several accounts,

(a) the number of clusters is not required as a pre-requisite unlike most clustering algorithms, (b) no

assumptions about the distribution of data are made, and (c) a (lower) bound on the intra-cluster

similarity is achieved. Note that the central idea in Shapley value based clustering is to obtain good

cluster centers, and this in conjunction with Lemma 2 ensures that a high quality of clustering is

obtained.

Now we discuss how the Shapley value approach can be extended to obtain hierarchical clustering.

Using Algorithm 1, we get clusters with a minimum average similarity δ as indicated by Theorem 1.

Now, we consider only the cluster centers selected by Algorithm 1. We choose the center x with the

highest Shapley value, merge the clusters with centers that are at least δ′

similar to x, and make x

the cluster center of the single cluster thus obtained. Then, we choose the cluster center with the

highest Shapley value among the remaining centers and repeat the process. The idea can be similarly

propagated to levels up the hierarchy till a single cluster remains. In the next theorem, we prove the

minimum similarity bound for hierarchical clusters at any level.

Theorem 11. The data points assigned to the same cluster using Algorithm 1, at a level i in the

hierarchy are, on an average, at least 2δ − 1 similar to each other, where δ is the similarity threshold

parameter for level i− 1. This minimum average similarity is independent of δ′, the threshold for level

i.

Proof. Let x′

and y′

be any two points in clusters represented by centers x and y at level (i − 1) respectively

(see Figure 6.1). After decreasing the threshold from δ to δ′, x′

and y′

are assigned to the same cluster at level

i. Without loss of generality, let x be the center of this cluster at level i. Consider the triangle yx′y′. Then,

defining the functions f and f′

as in Sect. 6.3.1, we get, using triangle inequality,

d(x′, y) + d(y, y

′) ≥ d(x

′, y′)

⇒ f′(d(x

′, y) + d(y, y

′)) ≥ f

′(d(x

′, y′))

Also,

f′(d(x

′, y)) + f

′(d(y, y

′)) ≥ f

′(d(x

′, y) + d(y, y

′))

⇒ f(d(y, y′)) ≤ 1 + f(d(x

′, y′)) + f(d(x

′, y)) (6.4)


Figure 6.1: Hierarchical Clustering

But since y′

is assigned to the cluster with center y at level i− 1, therefore

f(d(y, y′)) ≥ δ (6.5)

Using (6.4) and (6.5),

f(d(x′, y)) ≥ δ − 1− f(d(x

′, y′)) (6.6)

Similarly, considering ∆xyx′, we get

f(d(x′, y)) ≤ 1− f(d(x, y)) + f(d(x, x

′)) (6.7)

Using (6.6) and (6.7),

f(d(x, y)) ≤ 2 + f(d(x, x′)) + f(d(x

′, y′))− δ (6.8)

But since y is assigned to the cluster with center x at level i, therefore

f(d(x, y)) ≥ δ′

Using this inequality in conjunction with (6.8),

f(d(x′, y′)) ≥ δ + δ

′− 2− f(d(x, x

′)) (6.9)

Let gi : X×X → {0, 1} be a function indicating if the two points in the domain are assigned to the same cluster

at level i or not. If the two points are assigned to the same cluster at level i, then gi returns a 1, else a 0.

Consider the situation as highlighted in Figure 6.1. As mentioned earlier, x′

and y′

are assigned to different

clusters at level (i− 1) but to the same cluster at level i. Then,

gi(x′, y′) = 1 and gi−1(x

′, y′) = 0

Now, the total similarity among all the points assigned to the same cluster Ci at level i is given by,

1

2

Xxp,xq∈Cixp 6=xq

f(d(xp, xq))


=1

2

Xxp,xq∈Ci

gi−1(xp,xq)=0xp 6=xq

f(d(xp, xq)) +1

2

Xxp,xq∈Ci


f(d(xp, xq))

Let hi : X ×X → {0, 1} be a function indicating if the points being considered are assigned to the same cluster,

with one point as the center, at level i or not. That is, hi(x, x′) = 1 if and only if x is a cluster center, at level

i, with x′

as a data point in the same cluster.

Then, using (6.9), Xxp,xq∈Ci


f(d(xp, xq)) ≥X

x,xp∈Cihi−1(x,xp)=1hi(x,xp)=1

(δ + δ′− 2− f(d(x, xp)))

= 2 txi−1(|Ci| − txi−1)(δ + δ′− 2)−

Xx,xp,xq∈Cihi−1(x,xp)=1hi(x,xp)=1

f(d(x, xp))

where txi−1 is the number of elements in the cluster, at level (i− 1), of which x is a member. Then,

Xxp,xq∈Ci


f(d(xp, xq)) ≥ 2txi−1(|Ci| − txi−1)(δ + δ′− 2− 1)

(since, f(d(x, xp)) ≤ 1 ∀xp ∈ Ci)

= 2txi−1(|Ci| − txi−1)(δ + δ′− 3) (6.10)

Now, using Lemma 2,

Xxp,xq∈Ci


v(xp,xq)6=1

f(d(xp, xq)) ≥Xxp∈Ci

zi−1(xp)6=1

txp

i−1(txp

i−1 − 1)

2(2δ − 1) (6.11)

where v(xp, xq) = 1 if the unordered pair (xp, xq) has already been considered in similarity computations and

zi−1(xp) = 1 if the cluster assigned to xp, at level i− 1, has already been accounted for. Then,Xxp,xq∈Ci


f(d(xp, xq)) = 2X

xp,xq∈Cigi−1(xp,xq)=1

xp 6=xq

v(xp,xq)6=1

f(d(xp, xq))

Using (6.11),Xxp,xq∈Ci


f(d(xp, xq))

≥Xxp∈Ci

zi−1(xp)6=1

txp

i−1(txp

i−1 − 1)(2δ − 1) (6.12)

Then, using (6.10) and (6.12), the total similarity among all the points assigned to the same cluster Ci, at level

i is,


1

2

Xxp,xq∈Cixp 6=xq

f(d(xp, xq))

≥ txi−1(|Ci| − txi−1)(δ + δ′− 3)

+1

2

Xxp∈Ci

zi−1(xp) 6=1

txp

i−1(txp

i−1 − 1)(2δ − 1) (6.13)

= S (say)

Differentiating with respect to δ′, we get

∂S

∂δ′= txi−1(|Ci| − txi−1)

Then,∂S

∂δ′= 0⇒ tx

min

i−1 = 0

or

txmin

i−1 = |Ci|

But, txmin

i−1 6= 0, since at least x belongs to txi−1.

Therefore, txmin

i−1 = |Ci|, which is intuitive since the dissimilarity is maximum when all the data points are

assigned to the same cluster. The minimum value of S is obtained from (6.13),

Smin =1

2|Ci|(|Ci| − 1)(2δ − 1)

Thus,

fHC ≥ (2δ − 1)

where fHC is the mean similarity between data points in any hierarchical cluster at level i where δ is the similarity

threshold at level i− 1.

A similar result can be proved for SHARPC sans Step 2.5 (wherein the k-means algorithm is executed

to minimize the average point to closest center distance). An important implication of Theorem 11 is

that the minimum average similarity at any level in hierarchical clustering is achieved when all the data

points are assigned to the same cluster. Further, Theorem 11 suggests a lower bound on the extent of

similarity shared by patterns belonging to the same cluster. This is a significant result since the user

can input a suitable δ ∈ (0, 1] and obtain clusters, at any level, with a minimum average similarity

guarantee.


6.6 Comparison of SHARPC with k-means and Leader

SHARPC can be viewed as an optimal unification of Leader and k-means, since the similarity threshold

can be viewed as an extension of the idea of the distance threshold in the Leader algorithm, whereas

the k-means algorithm is directly employed in Step 2.5. The initial selection of cluster centers, based on

Shapley value, provides the missing link required for excellent clustering. In this section, we compare

SHARPC with Leader, the prototype one-pass incremental algorithm, and k-means, the most popu-

lar representative of partition based clustering algorithms. We also provide experimental evidence to

substantiate the efficacy of our approach.

6.6.1 Satisfiability of desirable Clustering Properties

Now, to facilitate a better comprehension of the efficacy of different approaches, we mention the various

desirable properties of clustering and indicate the algorithms that satisfy these properties.

• Scale Invariance: The Leader algorithm does not satisfy scale invariance since it decides the

clusters based on a distance threshold, and thus incurs a fundamental distance scale. The k-

means algorithm satisfies scale invariance since it assigns clusters to points depending only on

their relative distances to the k cluster centers, irrespective of the absolute distances. SHARPC

consits of two phases. In the first phase, clustering is done based on the similarity values, which

are again relative (e.g. consider a similarity function f(d(xi, xj)) = 1− d(xi, xj)dτ

where dτ > dmax,

with dmax denoting the maximum distance between any two points in the dataset). In the second

phase, the k-means algorithm is used, which is scale invariant, as already mentioned. Thus,

SHARPC satisfies scale invariance.

• Richness: The Leader algorithm satisfies the richness property since we can always adjust the

distances among points to generate any desired partition of the input dataset. For example, one

of the ways to obtain a single cluster is to set all pairwise distances to some value less than

the distance threshold, whereas to have each point assigned to a separate cluster, every pairwise

distance may be set to some value greater than the distance threshold. The k-means algorithm

satisfies the richness condition only if the value of k can be adjusted according to the desired

partitions. However, since in general, k is a constant input provided to the k-means algorithm, we

may not partition the input dataset into any number of clusters other than k, and this precludes

the k-means algorithm from satisfying richness. Note that this restriction of k-means does not

apply to the SHARPC algorithm, since the number of clusters, k, is not provided as an input

and is determined based on the Shapley values of points and similarities among them. Hence,

SHARPC satisfies the richness property.


• Consistency: The Leader and the k-means algorithms do not satisfy the consistency requirement.

This follows directly from Kleinberg’s result in [40], which states that there does not exist any

centroid based clustering function that satisfies the consistency property. SHARPC also does not

satisfy consistency as a consequence of Kleinberg’s impossibility theorem that no algorithm may

satisfy scale invariance, richness, and consistency simultaneously.

We summarize the foregoing discussion in Table 6.1. It is easy to infer that SHARPC provides

an excellent approach to clustering, since it satisfies three of the four properties, and is optimal

in that no other clustering algorithm can perform better, as a consequence of the impossibility

theorem.

Table 6.1: Comparison between Leader, k-means, and SHARPCProperty Leader k-means SHARPC

Scale Invariance X√ √

Richness√

X√

Consistency X X XOrder Independence X X

√

6.6.2 Experimental Results

We carried out extensive experimentation to compare SHARPC with the Leader and the k-means

algorithms. For our experiments, we measured the quality of clustering of an algorithm in terms of the

following two parameters,

• α =1n

∑xi∈X

||xi − xk||2, where xk is the representative of cluster, Ck ∈ C, to which xi is assigned.

• β =1|C|

∑xi,xj∈Ck

Ck∈C

||xi − xj ||2

|Ck|(|Ck| − 1)

where C is the set of clusters to which xi ∈ X = {x1, x2, . . . , xn} is assigned. The potential, α, quantifies

the deviation of data points from the representative element while the scatter, β, captures the spread

among different elements assigned to the same cluster. Clearly, the lower the values of α and β, the

higher the quality of clustering. The potential, α, is a standard measure, but we also characterize quality

of clustering in terms of β, since it is closer to the basic notion of a cluster as a group of points more

similar to each other than points belonging to other clusters. Further, we choose Euclidean distance as

our distance metric d, and set the dissimilarity between any two data points xi and xj , f′(d(xi, xj)), to

d(xi, xj)dmax + 1

where dmax denotes the maximum distance between any two points in the dataset.


Table 6.2: Spam Dataset (4601 examples, 58 dimensions)Algorithm Clusters Average α Average β

Leader 10 153974 190783k-means 10 36850 5619527

SHARPC 10 24368 190541Leader 25 110673 170557k-means 25 33281 2248452




SHARPC 100 809 2319

We conducted an experimental study on a number of real-world datasets. We provide the results

for Wine, Spam, Cloud and Intrusion (first 5000 points) datasets. These datasets are available as

archives at the UCI Machine Learning Repository [31, 32]. We implemented code in Matlab without

any optimizations. Moreover, we state the results obtained using 30 runs of experiments to account

for statistical significance. Further, since the Leader algorithm does not take δ as an input parameter,

we executed the code for Leader algorithm for different distance thresholds, across different orders,

and observed the number of clusters. Then, we modulated δ to obtain almost the same number of

clusters. Likewise, we varied δ for adjusting the SHARPC algorithm to the number of clusters used in

the k-means. Finally, we averaged the α and β values for a fixed number of clusters.

Table 6.3 shows the α and β values resulting from the Leader, the k-means and the SHARPC on

the Wine dataset. Clearly, SHARPC outperforms Leader by 2 to 3 orders of magnitude in terms of α.

Similarly, SHARPC improves Leader by an order of magnitude in terms of β. It readily follows that

SHARPC gives much better clustering than the Leader algorithm. Similarly, the comparison of SHARPC

with k-means reveals that even though k-means performs much better than the Leader algorithm, the

quality of clustering is the best in the SHARPC algorithm. In fact, the gap in the quality of clustering

becomes more prominent in the case of larger datasets, such as Spam (Table 6.2 and Fig. 6.2) and

Intrusion (Table 6.5, Fig. 6.3 and Fig. 6.4). Table 6.4 shows the comparison results on the Cloud

dataset. We observed from our experiments that in general, SHARPC not only outperforms k-means

and Leader algorithms in terms of β but also in terms of α, since SHARPC finds an optimal set of

cluster centers.

In our experiments, we observed that although SHARPC is of complexity O(n2), in practice, it takes

much less time since it tends to converge rapidly, due to an optimal selection of initial cluster centers.


Table 6.3: Wine Dataset (178 examples, 13 dimensions)Algorithm Clusters Average α Average β

Leader 5 187568 38772k-means 5 5673 13576




SHARPC 20 423 839

Table 6.4: Cloud Dataset (1024 examples, 10 dimensions)Algorithm Clusters Average α Average β

Leader 5 321983 84727k-means 5 17391 80844




SHARPC 20 2892 6722

Table 6.5: Network Intrusion Dataset (5000 examples, 37 dimensions)Algorithm Clusters Average α Average β

Leader 10 4.1139e+09 4.4772e+07k-means 10 1.4836e+06 1.3116e+08

SHARPC 10 7.28763e+05 3.5754e+07Leader 25 4.1014e+09 5.8943e+06k-means 25 8.9539e+05 3.1790e+07




SHARPC 100 1.7079e+03 1.7347e+04


Figure 6.2: β − C plot (Spam Dataset): (a) SHARPC vs. Leader (b) SHARPC vs. k-means

Figure 6.3: α− C plot (Intrusion Dataset): (a) SHARPC vs. Leader (b) SHARPC vs. k-means

Figure 6.4: β − C plot (Intrusion Dataset): (a) SHARPC vs. Leader (b) SHARPC vs. k-means


Figure 6.5: Wine: Potential (α) does not vary much with permutations (p) for a fixed threshold (δ)

For instance, on the Intrusion dataset, for 100 clusters, SHARPC takes about 10 seconds compared

to 2.5 seconds by k-means. Further, evaluating SHARPC on the Wine dataset, we found that if the

similarity threshold (δ) is kept fixed, α does not increase significantly with a decline in the number of

permutations, p (Figure 6.5), thereby supporting the result in Lemma 1. Similar behavior was observed

with other datasets. Therefore, as already mentioned, there is a lot of scope to further improve the time

complexity of SHARPC by incorporating modifications to techniques like MSA.

We also conducted experiments on several other real datasets and found that SHARPC provided

much better clustering than Leader and k-means. However, due to space constraints, we are unable to

present our detailed analyses here.

6.7 Summary and Future Work

In this paper, we proposed a novel approach to clustering based on a cooperative game theoretic frame-

work First, we proposed an algorithm for clustering based on the exact computation of Shapley value.

Then, an efficient approach, SHARPC, based on the convexity of our game theoretic model was put

forward. We also highlighted order independence as a desirable clustering property and provided both

the necessary and sufficient conditions for achieving order independence. In addition to being order

independent, SHARPC also satisfies scale invariance and richness, two other desirable clustering prop-

erties. We also showed how SHARPC can be readily generalized to obtain hierarchical clusters with

a minimum similarity bound. Our experiments on several standard datasets suggest that SHARPC

provides a significantly better clustering than the popular k-means and Leader algorithms.

6.7.1 Future Work

In this paper, we investigated the efficacy of SHARPC by conducting experiments using a particular

similarity function. It would be interesting to analyze the impact of different dissimilarity measures on

the overall quality of clustering. Further, as suggested in the paper, as a future work, existing techniques


may be employed to further reduce the complexity of the skeletal SHARPC algorithm. The extension

of ideas presented in this work to semi-supervised and supervised learning would be another interesting

direction. This paper may be used as a reference to build on Kleinberg’s work on unification of clustering

by extending the impossibility results incorporating richness, scale invariance, and consistency to include

order independence. The notion of order independence in incremental learners is extremely relevant in

the context of stream applications. Some important open problems, in this regard, emanating from our

work are:

• Can we come up with novel abstraction(s) to obtain efficient order independent incremental learn-

ers?

• Can some of the properties of a “Dynamic Set” be relaxed to achieve computationally efficient yet

effective weak incremental learning?

• Can some of the existing non-incremental techniques be made incremental by incorporating ap-

propriate abstractions?

It would also be worthwhile to investigate the applicability of other solutions concepts from cooperative

game theory, such as the Nucleolus, to various problems in machine learning and pattern recognition.

Bibliography

[1] A. Jain, M. N. Murty, and P. J. Flynn. Data Clustering: A Review. ACM Computing Surveys,

31(3), pp. 264–323, 1999.

[2] E. Backer and A. Jain. A clustering performance measure based on fuzzy set decomposition. IEEE

Transactions Pattern Analysis and Machine Intelligence (PAMI), 3(1), pp. 66–75, 1981.

[3] A. Jain and R. Dubes. Algorithms for Clustering Data. Englewood Cliffs, Prentice Hall, NJ, 1988.

[4] P. Hansen and B. Jaumard. Cluster analysis and mathematical programming. Math. Program., 79,

pp. 191–215, 1997.

[5] R. Xu and D. Wunsch II. Survey of Clustering Algorithms. IEEE Transactions on Neural Networks,

16(3), pp. 645–678, 2005.

[6] O. Sasson, N. Linial, and M. Linial. The metric space of proteins–Comparative study of clustering

algorithms. Bioinformatics, 18, pp. s14–s21, 2002.

[7] W. Li, L. Jaroszewski, and A. Godzik. Clustering of highly homologous sequences to reduce the size

of large protein databases. Bioinformatics, 17, pp. 282–283, 2001.

[8] S. Mulder and D. Wunsch. Million city travelling salesman problem solution by divide and conquer

clustering with adaptive resonance neural networks. Neural Net., 16, pp. 827–832, 2003.

[9] R. Dubes. Cluster analysis and related issue. Handbook of Pattern Recognition and Computer Vision,

C. Chen, L. Pau, and P. Wang, Eds., World Scientific, pp. 3–32, 1993.

[10] G. Ball and D. Hall. A clustering technique for summarizing multi-variate data. Behav. Sci., 12,

pp. 153–155, 1967.

[11] K. Krishna and M. N. Murty. Genetic K-means algorithm, IEEE Trans. Syst., Man, Cybern. B

(SMC-B), 29(3), pp. 433–439, 1999.

[12] L. Kaufman and P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis,

Wiley, 1990.

129

BIBLIOGRAPHY 130

[13] X. Zhuang, Y. Huang, K. Palaniappan, and Y. Zhao. Gaussian mixture density modeling, decom-

position, and applications. IEEE Trans. Image Process., 5(9), pp. 1293–1302, 1996.

[14] http://www.neuroinformatik.ruhr-uni-bochum.de/ini/VDM/research/gsn/

JavaPaper.

[15] J. Cherng and M. Lo. A hypergraph based clustering algorithm for spatial data sets. Proc. IEEE

Int. Conf. Data Mining (ICDM), pp. 83–90, 2001.

[16] L. Hall, I. Ozyurt, and J. Bedzek. Clustering with a genetically optimized approach. IEEE Trans.

Evol. Comput., 3(2), pp. 103–112, 1999.

[17] F. Hoppner, F. Klawonn, and R. Kruse. Fuzzy Cluster Analysis: Methods for Classification, Data

Analysis, and Image Recognition, Wiley, New York, 1999.

[18] T. Kohonen. The self-organizing map. Proc. of the IEEE, 78(9), pp. 1464–1480, 1990.

[19] A. Ben-Hur, D. Horn, H. Siegelmann, and V. Vapnik. Support vector clustering. J. Mach. Learn.

Res., 2, pp. 125–137, 2001.

[20] A. Jain, R. Duin, and J. Mao. Statistical pattern recognition: A review. IEEE Trans. Pattern Anal.

and Mach. Intell. (PAMI), 22(1), pp. 4–37, 2000.

[21] D. Ketchen and C.L. Shook. The application of cluster analysis in strategic management. Strategic

Management Journal, 17(6), pp. 441–458, 1996.

[22] D. Pelleg and A. Moore. X-means: Extending K-means with efficient estimation of the number

of clusters. Proc. of the 17th International Conference on Machine Learning (ICML), pp. 727–734,

2000.

[23] S. Salvador and P. Chan. Determining the Number of Clusters/Segments in Hierarchical Clus-

tering/Segmentation Algorithms. Proc. of the 16th IEEE International Conference on Tools with

Artificial Intelligence (ICTAI), pp. 576–584, 2004.

[24] H. Bischof, A. Leonardis, and A. Selb. MDL principle for robust vector quantisation. Pattern

Analysis and Applications, 2, pp. 59–72, 1999.

[25] G. Hamerly and C. Elkan. Learning the k in k-means. Proc. of the 17th International Conference

on Neural Information Processing Systems (NIPS), pp. 281–288, 2003.

[26] R. B. Myerson. Game Theory: Analysis of Conflict. Harvard University Press, 1997.

[27] L. S. Shapley. Cores of convex games. International Journal of Game Theory, 1(1), pp. 11–26, 1971.

BIBLIOGRAPHY 131


ester, UK.

[29] R. Ostrovsky, Y. Rabani, L. Schulman, and C. Swamy. The effectiveness of Lloyd-type methods for

the k-Means problem. Symposium on Foundations of Computer Science (FOCS), pp. 165–176, 2006.


Discrete Algorithms (SODA), pp. 1027–1035, 2007.

[31] http://archive.ics.uci.edu/ml/datasets/.


[33] D. Arthur and S. Vassilvitskii. How Slow is the k-means Method?. Proc. of the Symposium on

Computational Geometry (SoCG), 2006.



[35] T. Mitchell. Generalization as Search. Artificial Intelligence, 18, pp. 203–226, 1982.

[36] A. Cornuejols. Getting Order Independence in Incremental Learning. Proceedings of the 1993 Eu-

ropean Conference on Machine Learning (ECML), pp. 196–212, Springer-Verlag, 1993.


Conference on Machine Learning (ICML), pp. 163–168, 1992.

[38] B. Shekar, M. N. Murty, and G. Krishna. Structural aspects of semantic-directed clusters. Pattern

Recognition, 22, pp. 65–74, 1989.

[39] I. N. Herstein. Topics in Algebra, John Wiley & Sons, Second Edition, 2006.

[40] J. Kleinberg. An Impossibility Theorem for Clustering. Proceedings of the Advances in Neural

Information Processing Systems, 15, pp. 463–470, 2002.

[41] B. Zhang and S.N. Srihari. Fast k-Nearest Neighbor Classification Using Cluster-Based Trees. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 26(4), pp. 525–528, 2004.

[42] A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. Proceedings

of the 25th International Conference on Very Large Data Bases (VLDB), pp. 518–529, 1999.

[43] V. Garcia, E. Debreuve, and M. Barlaud. Fast k nearest neighbor search using GPU. Proceedings

of the CVPR Workshop on Computer Vision on GPU, Alaska, 2008.

BIBLIOGRAPHY 132

[44] P. Haghani, S. Michel, and K. Aberer. Distributed similarity search in high dimensions using lo-

cality sensitive hashing. Proceedings of the 12th International Conference on Extending Database

Technology: Advances in Database Technology, pp. 744–755, 2009.

[45] A. Keinan, B. Sandbank, C. C. Hilgetag, I. Meilijson, and E. Ruppin. Fair Attribution of Functional

Contribution in Artificial and Biological Networks. J. Neural Comput., 16(9), pp. 1887–1915, 2004.

[46] R. Anderberg. Cluster Analysis for Applications, Academic Press, New York, 1973.

[47] R. S. Michalski. Knowledge Acquisition Through Conceptual Clustering: A Theoretical Framework

and an Algorithm for Partitioning Data into Conjunctive Concepts. Journal of Policy Analysis and

Information Systems, 4(3), pp. 219–244, 1980.

Chapter 7

A 2-Approximation Algorithm for

Optimal Disk Layout of Genome

Scale Suffix Trees

7.1 Introduction

The suffix tree is an immensely popular data structure for indexing colossal scale biological repositories

[1]. This is mainly due to its linear time and space complexity of construction in terms of the sequence

size, in addition to linear search complexity in terms of the pattern size [2, 3]. However this comes at the

cost of increased storage space requirements, with the standard implementations consuming an order of

magnitude more space than the indexed data. As such for most practical data mining applications, the

suffix tree needs to be disk-resident. To complicate the matter further, searching for a pattern requires

random traversal of suffix links connecting nodes across different pages that results in increased I/O

activity. Dispensing away with suffix links not only affects the construction time [2, 3] but also renders

several search algorithms infeasible [4, 5, 6]. This is because sequence search algorithms typically involve

traversing both edges and suffix links. For instance, to find all maximal matching subsequences between

the sequence and the pattern, tree edges are used to walk down the tree matching the pattern sequence

along the way with the subsequent matches found by following the suffix links [7]. A lot of research has

gone subsequently into addressing this problem, primarily focusing on building efficient disk-resident

trees [8, 9, 10, 11, 18]. The objective of our work is to optimize the layout of suffix trees with regard to

assigning disk pages to tree nodes thereby resulting in improving the search efficiency. Layout strategies

have been mentioned in the literature for a variety of data structures [12, 13, 14, 15, 16].

133

Chapter 7. A 2-Approximation Algorithm for Optimal Disk Layout of Genome Scale Suffix Trees134

However, there has been only one notable contribution in the field of suffix trees [17]. Therein a

layout strategy, Stellar, is experimentally shown to improve search performance on a representative set

of real genomic sequences. In our work, we do a theoretical analysis of the whole problem based on our

approach and give bounded guarantee on the performance. For the rest of the chapter, we shall use

memory and main memory interchangeably unless mentioned otherwise explicitly. Similarly a search

pattern and query shall convey the same meaning.

7.2 Hardness of the Disk Layout Problem

Consider a suffix tree layout on the disk. Suppose we need to access a node x in the process of determining

a potential match of the pattern. There are two possible alternatives through which x could get into

the main memory (a) following an edge from its parent, or (b) following a suffix link at the node of last

mismatch. For each node present in the memory, accessing a memory resident child or a suffix child

costs less than one not present since the later involves an I/O operation. A theoretically equivalent

way of analyzing the same problem is to consider the cost of accessing a node resulting from absence or

presence of its parent or its suffix parent, where a suffix parent is defined as a node having a suffix link

to the node in consideration. Put succinctly, we may as well interpret that cost related with accessing a

node is less when at least one of its parent or suffix parent is present in the memory rather than when

none is memory resident. Now there are two distinct possibilities, either at least one of the parent or

suffix parent is present in the memory or none is.

Let, P1(x) and P2(x) respectively denote the probability that parent or a suffix parent of node x is

present in the memory. Further, let C1(x) and C2(x) be the costs of accessing x when at least one of its

parent or suffix parent is present in the memory and when none is present respectively. As pointed out

in the foregoing discussion, C2(x) ≥ C1(x) since C2(x) involves an additional I/O operation in order to

bring a parent or suffix parent of x into the memory. Now, for x,

Probability that the parent node is not present in the memory = 1- P1(x)

Probability that a suffix parent is not present in the memory = 1- P2(x)

⇒ Probability that at least a parent or suffix parent is present = 1− (1− P1(x))(1− P2(x))

Now expected cost of accessing x is given by C(x), where

C(x) = [Probability that parent(x)/suffix parent(x) is present in memory] * C1(x) + [Probability that

none of the parent(x) and suffix parent(x) is present in memory] * C2(x).

⇒ C(x) = (1− (1− P1(x))(1− P2(x))) ∗ C1(x) + (1− P1(x))(1− P2(x)) ∗ C2(x)

= C1(x)− (1− P1(x))(1− P2(x)) ∗ C1(x) + (1− P1(x))(1− P2(x)) ∗ C2(x)

= C1(x) + (C2(x)− C1(x))(1− P1(x))(1− P2(x))

= C1(x) + k(x)(C2(x)− C1(x)) where,


k(x) = (1− P1(x))(1− P2(x))

An important point is in place. The probabilities defined herein may be extremely difficult or even

impossible to calculate but nonetheless, for theoretical considerations, let these probabilities be pro-

vided by some Oracle. To earn a better understanding of this probabilistic formulation, consider a few

special cases, one by one.

Case 1: C1(x) = C2(x) ∀x

This is possible only when we have virtually infinite main memory so that the whole suffix tree is

memory-resident. Then expected cost of access to node x

C(x) = C1(x) + k(C2(x)− C1(x))

= C1(x) + k(C1(x)− C1(x))

= C1(x) = C1 (assuming a uniform memory access cost)

That is to say, if the whole suffix tree resides in the memory, then assuming uniform memory access,

cost of accessing any particular node x would be a constant cost independent of x and its parent/suffix

parent. This is a rather uninteresting case since for the genome scale applications; the size of the suffix

tree is much larger than the capacity of the main memory.

Case 2: P1(x) = P2(x) = 0

This case is applicable for the root of the suffix tree, which is the first node to be retrieved into the

memory from the disk. Then,

C(x) = C1(x) + (C2(x)− C1(x))(1− P1(x))(1− P2(x))

= C1(x) + (C2(x)− C1(x)) = C2(x)

That is, an I/O operation is required to get the root of the suffix tree into the main memory. This holds

since the root is the first node to be accessed in process of finding a match irrespective of the pattern

being searched for.

Case 3: P1(x) =1 or P2(x) = 1

This case applies when either the parent or a suffix parent of x is always present in the main memory


and rarely holds for genome scale suffix trees. It follows then,

C(x) = C1(x) + (C2(x)− C1(x))(1− P1(x))(1− P2(x))

= C1(x)

The cost of accessing x is C1(x) as expected.

7.2.1 The Q-Optimal Disk Layout Problem

Given a large scale suffix tree S and a set of patterns (possibly infinite) Q to be matched with S, the

Q-Optimal Disk Layout (Q-OptDL) problem is to find an arrangement L of nodes belonging to S on

disk such that the overall cost of accessing the nodes of S on patterns in Q is minimum for L.

Theorem 12. The Q-OptDL problem is NP- Hard.

Proof. We show a reduction from the 0/1 Knapsack, a well-known NP- Complete problem. In the 0/1

Knapsack problem, there are n kinds of items, 1 through n. Each item j has a value Pj and a weight

Wj . The capacity of the knapsack is W. Mathematically, the 0-1 knapsack problem can be formulated

as: maximize∑nj=1(Pj ∗Xj)

subject to∑nj=1(Wj ∗Xj) ≤ W, and

Xj ∈ {0,1} ∀j ∈ {1,2, ...,n}

As shown earlier,

C(x) = C1(x) + k(x)(C2(x)− C1(x)) where,

k(x) = (1− P1(x))(1− P2(x))

By definition, an optimal layout minimizes overall sum of costs over patterns in Q.

⇒ minimize∑Q

∑x

[C1(x) + k(x) ∗ (C2(x)− C1(x))]

Now, we relax the problem setting by assuming C1 and C2 as average memory and disk access costs

respectively. Then, objective function is given by

minimize∑Q

∑x

[C1 + k(x) ∗ (C2 − C1)]

⇒ minimize∑Q

∑x

[C1 + (1− P1(x))(1− P2(x)) ∗ (C2 − C1)]

⇒ minimize∑Q

∑x

(1− P1(x))(1− P2(x)) ∗ (C2 − C1)


Now, C2 ≥ C1. Therefore, we may as well

maximize∑Q

∑x

(C2 − C1)−∑Q

∑x

(1− P1(x))(1− P2(x)) ∗ (C2 − C1)

⇒ maximize∑Q

∑x

[1− (1− P1(x))(1− P2(x))] ∗ (C2 − C1)

⇒ maximize∑Q

∑x

[1− (1− P1(x))(1− P2(x))]

Let Xj be an indicator variable representing the access of a node x. A node j lying on the disk that

is not accessed does not contribute to the cost, thereby having corresponding Xj set to 0. Finally, the

objective function becomes

maximize∑Q

n∑j=1

[1− (1− P1(x))(1− P2(x))] ∗Xj

⇒ maximize∑Q

n∑j=1

P (x) ∗Xj

where

P (x) = [1− (1− P1(x))(1− P2(x)) (7.1)

Let the capacity of the main memory be M. The reduction algorithm takes a knapsack of capacity M,

a singleton set Q and tries to put some l nodes one by one into it, out of a total n potential candidates,

based on the probability given by expression (1) and subject to the constraint that the combined size

of these l nodes should not exceed M. If we can solve the Optimal Layout Problem, then we get a

polynomial solution to the 0/1 knapsack problem, which is impossible unless P = NP. Conversely, if

we do get the l nodes in the knapsack in polynomial time, then we can write them on the disk, in the

same order and proceeding in this manner would yield the optimal layout corresponding to Q. Thus,

the Optimal Layout Problem is NP-Hard.

Now that we have deduced that it is computationally hard to find the optimal disk layout, we are

faced with another problem. It may be very difficult to obtain the values of P1 and P2 for each node

because of the unsymmetric structure of the suffix trees. Nonetheless, we get some useful insights into

improving the layout:

1. We note from the foregoing discussion that there is a vast decline in cost if either the parent or

suffix parent of a node lies close to it. Equivalently, we can improve the layout by bringing a

node’s child and suffix child close to itself.


2. We note that in genome databases, the consecutive base nucleotides have relatively different pro-

portions. For instance, the pair AC has in general different percentage from that of AT, AG and

AA. We can exploit this fact by bringing a node’s more probable successors close to it.

3. Another possible approach is to consider the patterns that have been searched over the database

over a considerable period of time and then incorporate this knowledge to improve the layout from

time to time. This would work very well in case the database being queried is relatively static over

a substantial period and there is similarity in queries being searched for which is a common trend

with biological databases. This might also be useful in case the same set of queries is searched

again.

7.3 Improving the Disk Layout

Now we propose a post-construction algorithm that takes the root r of a disk-resident suffix tree and the

capacity B of the disk-page as its input. There is also a set Q of patterns to be matched. Q represents

some sort of prior knowledge about the search queries. In the absence of any such information, the

sequence corresponding to the suffix tree could be used to initialize the probability values. Basically,

the algorithm Approx. Q-OptDL assigns the child nodes of a node x to the same disk page as x in a

probabilistic fashion. The child representing a base with a higher proportion of co-occurrence with its

parent gets a higher probability of being assigned close to the parent. This is followed by accounting for

the suffix child. The process is recursively followed to get a new layout of the tree nodes. A BFS queue

is used as the primary data structure. When a new query comes, Q and the probability values in Q can

be updated incrementally. Algorithm 1 could be invoked periodically to improve the layout.

7.3.1 Algorithm 1

Approx. Q-OptDL(r,B,Q)

% r : Root of the subtree to be traversed %

% B : Capacity of the disk-page in terms of no. of nodes %

% Q : Set of patterns to be matched %

queue ← r

nodecount ← 0;

while queue not empty, do

{

r ← queue; //remove from the queue

if r not visited then


mark r as visited and increment nodecount

while there is an unmarked child c of r, do

{

PQrc ← Relative proportion of base at c among

all the unmarked base child nodes of r in Q

if c not marked visited AND nodecount < B then

{

mark c as visited with a probability PQrc

if c is marked visited then

{

increment nodecount;

queue ← c; //insert into the queue

s ← suffix-link(c);

if s not visited AND nodecount < B then

{

mark s as visited

increment nodecount

queue ← s

}

}

}

}

if nodecount ≥ B then

{

while queue not empty do

{

m ← queue

Approx. Q-OptDL(m,B,Q)

}

}

}

7.3.2 Performance Bound on Approx. Q-OptDL

Now we show that the Algorithm 1 performs no worse than twice the optimal layout asymptotically.

In the following discussion, cost refers to an I/O rate caused due to limitations of the underlying disk


layout.

Theorem 13. The suffix tree disk layout obtained using Approx. Q-OptDL (Algorithm 1) has an

asymptotic performance within twice that of the optimal disk layout.

Proof. Let Popt and P denote the cost associated with the optimal layout and the layout L obtained

using Algorithm 1 respectively, over an infinite number of patterns. Further, let Pk(e) denote the cost

of layout L while accessing Q, a set of k patterns. Then,

P = limk→∞

Pk(e)

Now, when we access a particular node x with the closest child node x′, an I/O operation may be required

if the next base in the pattern being matched is not present in the memory. Then, the conditional cost

admitted due to this mismatch is given by Pk(e|x, x′). Suppose that during the matching process, at

a particular selection step, the optimal layout chooses a node x with base θ (where θ ∈ {A,C,G,T} in

human DNA), while the layout L using Algorithm 1 chooses node x′

k with a base θ′

k, then since base θ

and θ′

k are conditionally independent of the nodes x and x′, we have

P (θ, θ′

k|x, x′

k) = P (θ|x)P (θ′

k|x′

k)

A mismatch between the two layouts happens if θ 6= θ′

k results in an I/O. Then, the conditional cost of

this mismatch is given by

Pk(e|x, x′

k) = 1−m∑i=1

P (θ = ti, θ′

k = ti|x, x′

k)

where m denotes the number of bases(m = 4 for DNA)

⇒ Pk(e|x, x′

k) = 1−m∑i=1

P (ti|x)P (ti|x′

k) (7.2)

We also note that instead of only Q, if different sets of patterns are used, then different layouts would

be chosen by Algorithm 1. So, we take an average layout, under which conditional cost P (e|x) is given

by

P (e|x) =∫P (e|x, x

′

k)p(x′

k|x)dx′

k (7.3)

where p(x′

k|x) represents the conditional density of x′

k on x Using (7.2) and (7.3), and taking the limits,

we obtain

limk→∞

Pk(e|x) =∫

[1−m∑i=1

P (ti|x)P (ti|x′

k)]δ(x′

k − x)dx′

k


= 1−m∑i=1

P 2(ti|x)

Thereby, the asymptotic cost under layout L is given by

P = limk→∞

Pk(e)

⇒ P = limk→∞

∫Pk(e|x)p(x)dx

⇒ P =∫

[1−m∑i=1

P 2(ti|x)]p(x)dx (7.4)

⇒ P '∫

[1− P 2(tmax|x)]p(x)dx

where tmax refers to the base with greatest probability, that is put into a disk page accordingly by

Algorithm 1.

⇒ P '∫

[2(1− P (tmax|x))]p(x)dx

Now,m∑i=1

P 2(ti|x) = P 2(tmax|x) +∑i 6=max

P 2(ti|x)

We seek to bound this sum by minimizing the second term subject to the following constraints:

• P (ti|x) ≥ 0 , and

•∑i 6=max P

2(ti|x) = 1 − P (tmax|x) = Popt(e|x) since the optimal layout would tend to have least

probability of incurring an I/O. Also,∑mi=1 P

2(ti|x) is minimized if all of the a posteriori condi-

tional costs except that pertaining to tmax are equal.

Considering this fact in the light of foregoing discussion, we get

P (ti|x) =Popt(e|x)m− 1

, i 6= max, or

= 1− Popt(e|x) i = max

We arrive at the following inequalities:

•m∑i=1

P 2(ti|x) ≥ (1− Popt(e|x))2 +P 2opt(e|x)m− 1

, and

•

1−m∑i=1

P 2(ti|x) ≤ 2Popt(e|x)− m

m− 1P 2opt(e|x) (7.5)


Noting that the conditional variance, V ar[Popt(e|x)] ≥ 0, we get

∫P 2opt(e|x)p(x)dx ≥ P 2

opt (7.6)

Using (7.4), (7.5), and (7.6), we obtain the bound in case Q consists of an infinite number of patterns

Popt ≤ P ≤ Popt(2−m

m− 1Popt) (7.7)

Put into words, the Approx. Q-OptDL algorithm outputs a suffix tree disk layout that is guaranteed

to perform no worse than twice the optimal disk layout asymptotically.

Corollary

The suffix tree disk layout on human genome obtained using Approx. Q-OptDL has an asymptotic

P ≤ Popt(2− 43Popt) optimal upper bound.

Proof. The proof follows immediately by substituting value of m = 4 in (7.7) since human DNA consists

of bases {A,C,G,T}.

7.4 Conclusion/Future Work

We discussed the concept of an optimal layout in the context of genome scale suffix trees. The Q-Optimal

Disk Layout has been shown as NP-hard employing a reduction from the 0/1 Knapsack Problem. We

then suggested an algorithm Approx. Q-OptDL to improve the layout of a disk-resident suffix tree.

The Approx. Q-OptDL results in a layout that is guaranteed to have a performance within twice of

the optimal layout asymptotically. This is an extremely important result keeping in view the explosive

growth in the genomic data and accordingly the size of suffix trees. As a future work, we intend to

improve the layouts that are based on suffix trees but have different structures. Further, it would be

interesting to discover if the 2-approximation bound given in this chapter can be further improved.

Bibliography

[1] D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational

Biology. Cambridge University Press, Cambridge, 1997.

[2] E. M. McCreight. A Space-Efficient Suffix Tree Construction Algorithm. JACM, 23(2), 1976.

[3] E. Ukkonen. Online Construction of Suffix-trees. Algorithmica, 14(3), 1995.

[4] W. I. Chang and E. L. Lawler. Approximate String Matching in Sublinear Expected Time. Proc. of

the IEEE Symp. on Found. of Comp. Science (FOCS), 1, pp. 116–124, MO, USA, 1990.

[5] A. L. Cobbs. Fast Approximate Matching using Suffix Trees. Proc. of the 6th Annual Symp. on

Combinatorial Pattern Matching (CPM), 1995.

[6] E. Ukkonen. Approximate String Matching over Suffix Trees. Proc. of the 4th Annual Symp. on

Combinatorial Pattern Matching (CPM), 1993.

[7] A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O. White, and S. L. Salzberg. Alignment of

Whole Genomes. Nucleic Acids Research, 27(11), 1999.

[8] E. Hunt, M. P. Atkinson, and R. W. Irving. A Database Index to Large Biological Sequences. Proc.

of the 27th Intl. Conf. on Very Large Databases (VLDB), 2001.

[9] K.-B. Schurman, and J. Stoye. Suffix Tree Construction and Storage with Limited Main Memory.

Technical Report 2003-06, Universitat Bielefeld, 2003.

[10] S. Tata, R. A. Hankins, and J. M. Patel. Practical Suffix Tree Construction. Proc. of the 30th Intl.

Conf. on Very Large Databases (VLDB), 2004.

[11] S. Bedathur and J. Haritsa. Engineering a Fast Online Persistent Suffix Tree Construction. Proc.

of the IEEE Intl. Conf. on Data Engg., (ICDE), 2004.

[12] A. A. Diwan, S. Rane, S. Seshadri, and S. Sudarshan. Clustering Techniques for Minimizing External

Path Length. Proc. Of the 22nd Intl. Conf. on Very Large Databases (VLDB), 1996.

143

BIBLIOGRAPHY i

[13] S. Baswana and S. Sen. Planar Graph Blocking for External Searching. Algorithmica, 34(3), 2002.

[14] M. Nodine, M. Goodrich, and J. Vitter. Blocking for External Graph Searching. Proc. Of the 12th

ACM Symp. on Principles of Database Systems (PODS), 1993.

[15] S. Thite. Optimum Binary Search Trees on the Hierarchical Memory Model. Master’s thesis, Univ.

of Illinois at Urbana-Champaign, 2001.

[16] J. Gil and A. Itai. How to Pack Trees. Journal of Algorithms, 32(2), 1999.

[17] S. J. Bedathur and J.R. Haritsa. Search-Optimized Suffix-Tree Storage for Biological Applications.

Proc. of 12th IEEE Intl. Conf. on High performance(HIPC), 2005.

[18] Benjarath Phoophakdee and Mohammed J. Zaki. Genome-scale Disk-based Suffix Tree Indexing.

ACM International Conference on Management of Data, Beijing, 2007.

Conclusion

In this thesis, some key challenges in clustering, classification, and dimensionality reduction have been

identified and appropriate solutions have been suggested. We believe this work has made some funda-

mental contributions to these areas. An attempt has been made to tackle extremely important problems

such as characterizing incremental learners, which are bound to assume even greater importance with a

spurt in stream applications. A general goal of this work is to unify the different data mining algorithms,

especially in the context of clustering, for instance, we established order independence as a desirable

property, much like Kleinberg’s properties of scale invariance, richness, and consistency. On one hand,

we have proposed faster solutions to existing problems, while on the other, we have tried to emphasize

the importance of incorporating the ideas fundamental to these problems, in their most pristine essence.

Even though, at certain places, there is a considerable overlap between these contrasting approaches,

still they can be coarsely delineated: the FS-SVMs, the robust variants of the Leader algorithm, and the

RACK algorithm belong to the former, whereas the SHARPC algorithm, the Image Feature Extractor

technique, the improved Suffix Tree layout, and the EPIC algorithm belong to the latter. Wherever

possible, we have tried to present a big picture by providing a framework for evaluating the different

algorithms. We have also suggested some open problems and future directions. We hope this thesis

fosters more interesting research in years to come.

ii

pragmatic data mining: novel paradigms for tackling key ... · pragmatic data mining: novel...

Documents