pragmatic data mining: novel paradigms for tackling key ... · pragmatic data mining: novel...
TRANSCRIPT
Pragmatic Data Mining: Novel Paradigms for
Tackling Key Challenges
A Project Report
Submitted in partial fulfilment of the
requirements for the Degree of
Master of Engineering
in
Faculty of Engineering
by
Vikas Kumar Garg
Computer Science & Automation (CSA)
Indian Institute of Science
BANGALORE – 560 012
June 2009
To
My Family
For
Their Unalloyed and Unconditional
Love, Prayers, and Support.
Acknowledgements
First and foremost, I owe all my endeavors and success to my family for enduring innumerable hardships
to progressively sustain and nurture me all through my life. They have sacrificed every bit of their
pleasure to provide me with the best facilities. I am falling short of words to express my indebtedness
for their love and blessings.
I am greatly indebted to my guide, MNM Sir, for his perspicacity in removing my doubts, willingness
to broaden the horizon of my knowledge, and ever availability despite being extremely busy due to his
commitments as the Chairman of our department. His contribution in shaping my career has been im-
mense and I admire him greatly for his simplicity and altruistic generosity. I am also profoundly grateful
to Narahari Sir, interactions with whom were instrumental in inspiring me towards inter-disciplinary
work. His guidance has been extremely fruitful and provided a terrific learning experience for me. I feel
really honored and privileged to have learnt the nuances of quality research from a person of his stature.
Frankly speaking, I have to pinch myself sometimes, to realize that I have had an opportunity to work
with MNM Sir and Narahari Sir, the persons I always wanted to emulate.
I would also like to thank Shevade Sir, whose extremely well-structured course on Pattern Recognition
went a long way in developing my fascination for meaningful research in related fields of intelligent
systems. My thanks are also due to my friends, especially Devansh Dikshit and Harsh Shrimal, who
made my stay at IISc, a memorable and pleasant experience. My sincere thanks are also due to Mr.
Ramasuri Narayanam, who provided his valuable comments on my work, time and again.
I would be failing in my duty if I do not acknowledge the contribution of my critics, who inspired
me to put in a whole-hearted and determined effort into my research. During these two years, at times
when I I got a bit lethargic, their criticism instigated the requisite spark in me to rejuvenate myself.
i
Vita
I received my B.E. in Information Technology from Netaji Subhas Institute of Tech-
nology (NSIT) (7 semesters) and Delhi College of Engineering (DCE)(1 semester), University of Delhi,
India in 2006. I worked as a Research and Design Engineer at VedaSoft Solutions, India from July
2006-July 2007. Since August 2007, I have been working as a graduate student in the Department of
Computer Science and Automation at the Indian Institute of Science, Bangalore. My research interests
can be broadly summarized as lying at the intersection of Intelligent Systems and Theoretical Computer
Science. More specifically, from an application point of view, I am fascinated with the fields of Artificial
Intelligence (Machine Learning, Data Mining, Computer Vision, NLP, Robotics), Game Theory, On-
line Algorithms, and Computational Neuroscience. I am also interested in Statistical Learning Theory,
Convex Optimization, and Complexity Theory.
ii
Introduction
Data Mining, a branch of science closely associated with other sub-fields of Artificial Intelligence such
as Machine Learning and Pattern Recognition, is a relatively new field that has generated tremendous
interest among researchers over the last decade. A plethora of contributions in the literature has resulted
in emergence of data mining as an interdisciplinary dossier of paramount potential and significance. Al-
though data mining is a field that encompasses a variety of subjects and constantly engulfs new topics,
the traditional areas of clustering and classification continue to garner considerable attention. Addi-
tionally, the endeavor to apply mining techniques to new applications, which involve high dimensional
data, has necessitated the design of improved techniques for dimensionality reduction. In this work, we
propose novel techniques to address some of the key issues pertaining to these areas.
Dimensionality reduction and thereafter deriving the features has been used as a common technique
to redefine a pattern from a high dimension space to a lower dimension space. Dimension reduction
is important for operations such as clustering, classification, indexing, and searching. In our work, we
focus on dimensionality reduction in the context of image data. In our work, the N-dimensional data
belonging to a manifold has been extracted. Our approach highlights the fact that highly dynamic real
life data can not be considered locally linear and the information contained therein should be understood
by conceptualizing the image in terms of outlines and contours. The outlines remain almost constant
during dynamic range of the image while the contours keep on changing and project the mood, gestures
and other expressions. Chapter 1 provides a detailed exposition of our approach for dimensionality
reduction, feature extraction, and embedding.
Clustering or unsupervised classification of patterns into groups based on similarity is another very
well studied problem in pattern recognition, data mining, information retrieval, and related disciplines.
Clustering finds numerous direct practical applications in pattern analysis, decision making, and machine
learning tasks such as image segmentation. Besides, clustering also acts as a precursor to many data
processing tasks including classification. The explosive rate of data generation has reinforced the need
for incremental learning. In Chapter 2, we provide a framework for necessary and sufficient conditions
to obtain order independence.
iii
iv
The k-means algorithm is a very widely used clustering technique for scientific and industrial ap-
plications. Several variants of the k-means algorithm have been proposed in the literature. A major
limitation of the k-means algorithm is that the number of distance computations performed in the k-
means algorithm is linear in k, the number of clusters. In Chapter 3, we propose an algorithm, RACK,
based on AVL trees, that effectively computes distances from O(lg k) cluster centers only, thereby con-
siderably improving the total time required for clustering. Simultaneously, RACK ensures that quality
of clustering does not degrade much compared to the k-means algorithm.
In addition, there are other shortcomings of k-means: (a) it may give poor results for an inappropriate
choice of k, (b) it may not converge to a globally optimal solution due to inappropriate initial selection
of cluster centers , and (c) it requires knowledge of the number of cluster centers, k, as an input.
To address these issues, we also propose a novel technique, SHARPC, based on the cooperative game
theoretic concept of Shapley value. The SHARPC algorithm not only obviates the need for specifying
k, but also gives optimal clustering in that it tends to minimize both the distances: (a) the average
distance between a point and its nearest cluster center, and (b) the average pair-wise distance within
a cluster. We note that the algorithms such as k-means only strive to minimize the average distance
between cluster centers and the points assigned to the corresponding clusters, without taking the intra-
cluster point-point distances into consideration. In that context, we believe, such algorithms do not
really capture the essence of clustering: grouping together points that are similar to each other. Our
game theoretic model is presented in Chapter 6.
On the other hand, the Leader algorithm is a popular single pass technique for clustering large
datasets on the fly. The Leader algorithm does not require prior information about the number k of
clusters. However, different orderings of input sequence may result in different number of clusters. In
other words, Leader is highly susceptible to ordering effects and may give extremely poor quality of
clustering on skewed data orders. In Chapter 2, we propose robust variants of the Leader algorithm
that improve the quality of clustering, while still preserving the one pass property of Leader.
We also address the unification of partition based clustering algorithms in Chapter 4. Partitional
algorithms form an extremely popular class of clustering algorithms. Primarily, these algorithms can be
classified into two sub-categories: a) k-means based algorithms that presume the knowledge of a suitable
k, and b) algorithms such as Leader, BIRCH, and DB-Scan, which take a distance threshold value, δ,
as an input. We propose a novel technique, EPIC, which is based on both the number of clusters, k
and the distance threshold, δ. We also establish a relationship between k and δ, and demonstrate that
EPIC achieves better performance than the standard k-means algorithm. In addition, we present a
generic scheme for integrating EPIC into different classification algorithms to reduce their training time
complexity.
v
As already mentioned, in many pattern classification applications, data are represented by high di-
mensional feature vectors. There are two reasons to reduce dimensionality of pattern representation.
First, low dimensional representation reduces computational overhead and improves classification speed.
Second, low dimensionality tends to improve the generalization ability of classification algorithms. More-
over, limiting the number of features cuts down the model capacity and thus may reduce the risk of
overfitting. Therefore, to deal with the issue of rapidly increasing computational cost in applications
requiring processing large feature sets, we introduce the α-Minimum Feature Cover (α-MFC) problem
in Chapter 5 and prove it to be NP-Hard. We also propose Feature Subspace Support vector Ma-
chines (FS-SVMs) to find an approximate solution to the α-MFC problem for efficient high dimensional
handwritten digit recognition.
The suffix tree is an immensely popular data structure for indexing colossal scale biological reposito-
ries. This is mainly due to its linear time and space complexity of construction in terms of the sequence
size, in addition to linear search complexity in terms of the pattern size. For most practical data mining
applications, the suffix tree needs to be disk-resident. To complicate the matter further, searching for
a pattern requires random traversal of suffix links connecting nodes across different pages that results
in increased I/O activity. A lot of research has been carried out into addressing this problem, primarily
focusing on building efficient disk-resident trees. One of the objectives of our work is to optimize the
layout of suffix trees with regard to assigning disk pages to tree nodes thereby resulting in improving
the search efficiency. In Chapter 7, we do a theoretical analysis of the whole problem based on our
approach and give bounded guarantee on the performance.
Abstract
Over the last few decades, Data Mining has progressively evolved into an extremely significant field
for active research. Accordingly, with a tremendous spurt in the amount of real data being generated,
the attention has diverted from synthesis and accumulation of data to its analysis and application.
Many of the well-established techniques in the literature, pertaining to some integral machine learning
and pattern recognition areas such as Dimensionality Reduction, Clustering and Classification, have
been rendered ineffective as a result of this paradigm shift in focus. In this work, we present a com-
prehensive overview of the key challenges facing these areas, and offer new insights into overcoming
these challenges. In particular, we make the following contributions: we (a) propose a generic dimen-
sion reduction technique for extracting significant information, especially in the context of image data
depicting dynamic scenes, (b) characterize the notion of order independence in incremental learning,
(c) propose improvements in the prototype Leader algorithm to obtain better quality of clustering, (d)
introduce an algorithm, RACK, based on height balanced trees, which significantly improves upon the
time taken by the popular k-means algorithm, without compromising much on the quality of clustering,
(e) demonstrate how the integration of partition based clustering techniques can be achieved using an
algorithm, EPIC, for elegantly incorporating the domain knowledge, (f) show how an order independent
algorithm based on Shapley value, SHARPC, views the problem of clustering as a natural manifestation
of the interactions among the points in a convex game setting, and thereby improves the quality of
clustering, (g) introduce the Q-Optimal Disk Layout problem in the context of suffix trees, show it to
be NP-Hard, and suggest an algorithm Approx. Q-OptDL to obtain a disk layout that is guaranteed
to have a performance within twice of the optimal layout asymptotically, and (h) introduce the α-MFC
problem for addressing the ‘curse of dimensionality’ in classification, and propose Feature Subspace
SVMs( FS-SVMs) for an approximate solution to the α-MFC problem in the context of high dimen-
sional handwritten digit recognition. Our experimental results strongly corroborate the efficacy of our
work.
vi
Contents
Acknowledgements i
Vita ii
Introduction iii
Abstract vi
1 Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Nature of the local data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.2 Quality Degradation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.3 Application Domain and Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.4 Online handling of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.1 N- Dimensional Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.2 Contours and Outlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.3 Vectored Scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.4 Schrodinger’s Solution to measure non-linearity . . . . . . . . . . . . . . . . . . . 81.2.5 Cumulated Adjustment Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.2.6 Embedding using Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Image Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.1 Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.2 Extent of Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Bibliography 21
2 Characterizing Ordering Effects for Robust Incremental Clustering 242.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.1 Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.2.2 Incremental Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Characterizing Ordering Effects in Incremental Learners . . . . . . . . . . . . . . . . . . 272.3.1 Order Insensitive Incremental Learning through Commutative Monoids . . . . . 272.3.2 Dynamically Complete Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Robust Incremental Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.4.1 The Leader Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.4.2 The Nearest Neighbor Leader (NN-Leader) Clustering Algorithm . . . . . . . . . 322.4.3 The Nearest Mean and Neighbor Leader(NMN-Leader) Clustering Algorithm . . 322.4.4 The Apogee-Mean-Perigee Leader (AMP-Leader) Clustering Algorithm . . . . . 34
vii
CONTENTS viii
2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.6 Conclusion/Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Bibliography 40
3 RACK: RApid Clustering using K-means algorithm 423.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2 Effective Clustering for large datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2.2 The RACK Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.2.3 Bound on the Quality of Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 463.2.4 Analysis of Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Bibliography 55
4 EPIC: Towards Efficient Integration of Partitional Clustering Algorithms 574.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.1 k-means Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.3 The EPIC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.1 Bound on Number of Distance Computations, Relation between τ and k, andMaximum Permissible Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Application of EPIC to classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.4.1 Integration of EPIC into Support Vector Machines (SVMs) . . . . . . . . . . . . 644.4.2 Integration of Two-level EPIC into k-NNC . . . . . . . . . . . . . . . . . . . . . 65
4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.5.1 Integration of Two-level EPIC into SVM . . . . . . . . . . . . . . . . . . . . . . . 664.5.2 Integration of Two-level EPIC into k-NNC . . . . . . . . . . . . . . . . . . . . . 70
4.6 Conclusions/Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Bibliography 72
5 Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recog-nition 745.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.3 The α-Minimum Feature Cover (α-MFC) Problem . . . . . . . . . . . . . . . . . . . . . 785.4 Feature Subspace SVMs (FS-SVMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.5 A Greedy Algorithm for Approximating α-MFC . . . . . . . . . . . . . . . . . . . . . . . 815.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.6.1 Experimental Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.6.2 Analysis of Results obtained using Algorithm 1 . . . . . . . . . . . . . . . . . . . 855.6.3 Analysis of Results obtained using Algorithm 2 . . . . . . . . . . . . . . . . . . . 88
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.8 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Bibliography 97
CONTENTS ix
6 SHARPC: SHApley Value based Robust Pattern Clustering 1006.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036.2.1 The Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036.2.2 The Shapley Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.2.3 Convex Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.2.4 Shapley Value of Convex Games . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3 Shapley Value based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1056.3.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1066.3.2 An Algorithm for Clustering based on Shapley values . . . . . . . . . . . . . . . 1076.3.3 Convexity of the Underlying Game . . . . . . . . . . . . . . . . . . . . . . . . . . 1096.3.4 SHARPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.4 Order Independence of SHARPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136.4.1 Characterizing Ordering Effects in Incremental Learners . . . . . . . . . . . . . . 1136.4.2 Order Independence of SHARPC . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.5 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.6 Comparison of SHARPC with k-means and Leader . . . . . . . . . . . . . . . . . . . . . 122
6.6.1 Satisfiability of desirable Clustering Properties . . . . . . . . . . . . . . . . . . . 1226.6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.7 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1276.7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Bibliography 129
7 A 2-Approximation Algorithm for Optimal Disk Layout of Genome Scale SuffixTrees 1337.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1337.2 Hardness of the Disk Layout Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.2.1 The Q-Optimal Disk Layout Problem . . . . . . . . . . . . . . . . . . . . . . . . 1367.3 Improving the Disk Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.3.1 Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1387.3.2 Performance Bound on Approx. Q-OptDL . . . . . . . . . . . . . . . . . . . . . . 139
7.4 Conclusion/Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Bibliography 143
Conclusion ii
List of Tables
2.1 Wine Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.2 Iris Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1 Spam Dataset (4601 examples, 58 dimensions) . . . . . . . . . . . . . . . . . . . . . . . 523.2 Intrusion Dataset (494019 examples, 35 dimensions) . . . . . . . . . . . . . . . . . . . . 53
4.1 Training and testing timings for synthetic dataset 1 (using SVM light) . . . . . . . . . . 674.2 Training and testing timings for synthetic dataset 1 (using SVMperf ) . . . . . . . . . . 674.3 Training and testing timings for synthetic dataset 2 . . . . . . . . . . . . . . . . . . . . . 684.4 Comparison with CB-SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.5 Results for k-NNC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1 Comparison between Leader, k-means, and SHARPC . . . . . . . . . . . . . . . . . . . . 1236.2 Spam Dataset (4601 examples, 58 dimensions) . . . . . . . . . . . . . . . . . . . . . . . 1246.3 Wine Dataset (178 examples, 13 dimensions) . . . . . . . . . . . . . . . . . . . . . . . . 1256.4 Cloud Dataset (1024 examples, 10 dimensions) . . . . . . . . . . . . . . . . . . . . . . . 1256.5 Network Intrusion Dataset (5000 examples, 37 dimensions) . . . . . . . . . . . . . . . . 125
x
List of Figures
1.1 Local non-linear behavior of a contour . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Transitions at a Sample Image Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Illustration of a contour and an outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Contour sketches are sufficient to depict essential features . . . . . . . . . . . . . . . . . 61.5 Formation of a Contour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.6 Change in contour with change in vector orientation . . . . . . . . . . . . . . . . . . . . 71.7 Image Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.8 Contour formation from in-phase waves of different dimensions . . . . . . . . . . . . . . 91.9 Energy interpretation of vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.10 Selection of Contour Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.11 Interaction of low dimensional waves to form a higher dimensional wave . . . . . . . . . 101.12 Calculation of CAF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.13 Sample Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.14 Processed image incorporating even dimensions from vector magnitude 22 down to 0 . . 161.15 Processed image incorporating vectors of magnitude 22 . . . . . . . . . . . . . . . . . . . 171.16 Processed image incorporating vectors of magnitude 16 . . . . . . . . . . . . . . . . . . . 171.17 Processed image incorporating vectors of magnitude 10 . . . . . . . . . . . . . . . . . . . 181.18 Final reduced image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.19 Image Feature Extractor vs. LLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.20 Results with Standard Lip Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Wine Dataset: (a) α vs. δ, and (b) β vs. δ . . . . . . . . . . . . . . . . . . . . . . . . . 362.2 Iris Dataset: (a) α vs. δ, and (b) β vs. δ . . . . . . . . . . . . . . . . . . . . . . . . . . 372.3 Intrusion Dataset: (a) α vs. δ, and (b) β vs. δ . . . . . . . . . . . . . . . . . . . . . . . 37
3.1 k-means may not converge to a solution even after many iterations . . . . . . . . . . . . 443.2 A new data point is more likely to belong to a cluster with large number of data points 44
4.1 OCR 1 vs 6 : (a) accuracy vs. threshold (b) support vectors vs. threshold . . . . . . . . 684.2 OCR 3 vs 8 : (a) accuracy vs. threshold (b) support vectors vs. threshold . . . . . . . . 69
5.1 Iris Dataset: The need for segmentation of feature space . . . . . . . . . . . . . . . . . . 765.2 Segmentation of feature space for handwritten digit data . . . . . . . . . . . . . . . . . . 765.3 Different approaches for SVM Classification . . . . . . . . . . . . . . . . . . . . . . . . . 775.4 Steps in the modified classification process . . . . . . . . . . . . . . . . . . . . . . . . . . 775.5 The proposed feature reduction step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.6 Ensemble of classifiers using segments of feature space . . . . . . . . . . . . . . . . . . . 815.7 Features near the periphery contain less discriminative information than those deep inside 825.8 Sample patterns of handwritten digit data . . . . . . . . . . . . . . . . . . . . . . . . . . 845.9 Similarity vs. Block Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.10 Accuracy vs. Block Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
xi
LIST OF FIGURES xii
5.11 (Sample Dataset) Accuracy(%) results on training sets of different size . . . . . . . . . . 865.12 (MNIST ) Accuracy(%) results on training sets of different size . . . . . . . . . . . . . . 875.13 (CEDAR) Accuracy(%) results on training sets of different size . . . . . . . . . . . . . . 875.14 (USPS ) Accuracy(%) results on training sets of different size . . . . . . . . . . . . . . . 875.15 (Sample Dataset) Total relative time taken by Algorithm 1 on training sets of different size 885.16 (MNIST ) Total relative time taken by Algorithm 1 on training sets of different size . . . 885.17 (CEDAR) Total relative time taken by Algorithm 1 on training sets of different size . . 895.18 (USPS ) Total relative time taken by Algorithm 1 on training sets of different size . . . . 895.19 Accuracy vs. Number of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.20 Reduction in Accuracy(%) vs. Reduction in Number of Features . . . . . . . . . . . . . . 905.21 Algorithm 2 vs. Random Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.22 (Sample Dataset) Accuracy(%) results obtained using Algorithm 2 on training sets of
different size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.23 (MNIST ) Accuracy(%) results obtained using Algorithm 2 on training sets of different size 915.24 (CEDAR) Accuracy(%) results obtained using Algorithm 2 on training sets of different size 915.25 (USPS ) Accuracy(%) results obtained using Algorithm 2 on training sets of different size 925.26 Time Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 925.27 (Sample Dataset) Relative time taken by Algorithm 2 for training sets of different size . 935.28 (MNIST ) Relative time taken by Algorithm 2 for training sets of different size . . . . . 935.29 (CEDAR) Relative time taken by Algorithm 2 for training sets of different size . . . . . 945.30 (USPS ) Relative time taken by Algorithm 2 for training sets of different size . . . . . . 945.31 (Sample Dataset) Total relative time taken by Algorithm 2 for training sets of different size 945.32 (MNIST ) Total relative time taken by Algorithm 2 for training sets of different size . . 955.33 (CEDAR) Total relative time taken by Algorithm 2 for training sets of different size . . 955.34 (USPS ) Total relative time taken by Algorithm 2 for training sets of different size . . . 95
6.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1196.2 β − C plot (Spam Dataset): (a) SHARPC vs. Leader (b) SHARPC vs. k-means . . . . 1266.3 α− C plot (Intrusion Dataset): (a) SHARPC vs. Leader (b) SHARPC vs. k-means . . 1266.4 β − C plot (Intrusion Dataset): (a) SHARPC vs. Leader (b) SHARPC vs. k-means . . 1266.5 Wine: Potential (α) does not vary much with permutations (p) for a fixed threshold (δ) 127
Chapter 1
Generic Non-Linear N-Dimension
Reduction for Dynamic Scenes
Dimension reduction and thereafter deriving the features has been used as a common technique to
redefine a pattern in a lower dimensional space from higher dimensional space. This leads to better
classification and proves helpful in understanding the non-linear image information. Human beings
have a very fast intelligence mechanism. Any new dimension reduction technique can not be practically
useful unless it takes less processing time than the existing techniques, in addition to improving the
overall quality of the final reduced image. The recognition while working with reduced dimensions is
very fast. At the same time the features inherent to the scene should not be lost. While studying
various dimension reduction techniques, it has been noticed that though the techniques are getting
faster but the qualitative information content in the d-dimension representation is going down. The
task of dimension reduction is to operations such as clustering, classification, indexing, and searching
[1, 2]. Data projection [3] has been found fundamental to human perception. There are many methods
for estimating intrinsic dimensionality of data without actually projecting like Benett’s method [4],
Fukunaga and Olsen’s algorithms [5, 6] based upon space partition and PCA. In [7], a statistical approach
has been proposed. Pettis et al. developed an algorithm [8] taking average of distances to each point’s
k nearest neighbors. Nearest neighbor estimator was suggested by Verveer and Duin [9]. Bruske and
Sommer’s approach was based upon preserving topology maps [10]. Other than this, the data projection
approaches have been used by many [11], including Sammon’s Non-Linear Mapping (NLM) [12] and
Kohonen’s Self Organizing Map (SOM) [13]. Kruskal detected the similar features of NLM and MDS
[14] while Niemann improved the convergence of NLM [15]. Curvilinear Component Analysis (CCA)
[16] was better than NLM in the sense that it ignores the distances longer than a particular threshold.
Few methods based upon data projection like Isomap [17], Local Linear Embedding (LLE) [18], and
1
Chapter 1. Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 2
Curvilinear Distance Analysis (CDA) [19] have also been suggested in the literature. Isomap finds the
geodesic distances and applies MDS; CDA is CCA with geodesic distances while LLE assumes that
the local data is linear. All such methods depend upon the neighborhood information. Such methods
fail when the data is spread over the image having a multiple number of clusters [20]. In case of
Isomap, for computing a low-dimensional embedding of a set of high dimensional data points, two
issues are important. First, the basic approach presented was akin to the methods described in the
context of flattening cortical surfaces using geodesic distances [21] and multidimensional scaling [22].
However, these ideas generalize to arbitrary dimensionality if the connectivity and metric information
of the manifold are correctly supplied. Second, due to topological considerations, this approach should
be used after careful preprocessing of the data. In the application domain of cortical flattening, it
is necessary to check manually for connectivity errors, so that points nearby in 3-space (for example,
on opposite banks of a cortical sulcus) are not taken to be nearby in the cortical surface. If such
care is taken, this method represents the preferred method for quasi-isometric cortical flattening. The
novelty of the Isomap technique is the way it defines the connectivity of each data point via its nearest
Euclidean neighbors in the input consisting of many images of a person’s face observed under different
pose and lighting conditions, in no particular order. These images can be thought of as points in a
high-dimensional vector space, with each input dimension corresponding to the brightness of one pixel
in the image. Although the input dimensionality may be quite high, the meaningful structure out of
these images has many fewer independent degrees of freedom. Yang proposed a distance preserving
method [23] by using the triangulation method [24]. The method improves upon the earlier projection
based techniques but fails because the data is N -Dimensional and making a neighborhood graph along
with preserving the previous points is still dependent upon the size of d. Recently, neural networks have
also been used to reduce dimensionality [25]. There, the authors describe a non-linear generalization of
PCA that uses an adaptive, multilayer encoder network to transform the high-dimensional data into a
low-dimensional code and a reverse decoder network. A Support Vector Machine based approach has
also been proposed [26]. However, these techniques require excessive time to train and converge.
In our work, the N-Dimensional data belonging to manifold has been extracted. The image is
divided into outline and contours and the N- Dimensional vectors are determined. For a particular
expression/gesture, the range of vector movement is determined. Schrodinger’s equation is solved for
the vectors belonging to individual dimension under the constraints of dynamic-vector-range. Finally,
the image is embedded using a mean-variance plot.
Chapter 1. Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 3
1.1 Motivation
Dimension reduction and feature extraction from complex data is a subject of paramount importance.
The need for a fast extraction technique that preserves the essential features of the underlying data
cannot be overemphasized, especially when the scene is non-static. In this regard, the need of the hour
is a fast and quality preserving algorithm that is applicable over a wide range of fields.
1.1.1 Nature of the local data
The underlying concept is based upon the N-dimensional nature of natural images. Here, N signifies
that in general more than two dimensions are required to correctly express the image in motion in
terms of features. Earlier theories focused on dimension reduction considering a few principal axes but
failed because of multi-dimensional nature of natural scenes. The human beings have been bestowed
with a fast learning mechanism that infers from data by looking at the scenes from different angles and
the greater the number of viewing principal directions, better is the analysis of the scene. Sphere is
nearest to the natural way of scene analysis. This dimension related behavior of the scene can be easily
understood by considering Fig. 1.1. As shown, the actual contour shape from A to S is lost due to the
local linear assumption. Furthermore, different contours in the vicinity would be projected along the
same hyperplane irrespective of their disparate local orientations. The actual orientation of the contour
in N dimensional space gets lost since we consider magnitudes in a considerably low dimensional space.
We note that between any two points there may be a large number of distinct contours. And even when
the geodesic distances are taken, still the existing techniques are incapable of bringing the distinguishing
features which convey the information change. Thus, it becomes clearly evident that for a contour to
be extracted satisfactorily, local non-linearity has to be accounted for.
Figure 1.1: Local non-linear behavior of a contour
Dynamic datum depicting non-static scenes can not be considered linear as its shape keeps on
changing. Hence the locally nonlinear behavior is being stressed upon. Most of the applications are
concerned with the dynamic scenes. The basic message conveyed is that for a face in a non-static scene
Chapter 1. Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 4
to be recognized efficiently when seen from any angle, a technique which considers the inherent local
non-linearity of the face is required.
1.1.2 Quality Degradation
Another important issue is quality degradation as a consequence of dimensionality reduction. Since
most of the existing techniques incorporate a few principal axes, they suffer from the problem of severe
quality degradation. To preserve the quality of the original image, we need to consider vectors in an
N-dimensional generalized space rather than in a restricted domain comprising a few selected viewing
directions. We emphasize that our technique preserves the essential features of an image while dispensing
away superfluous data.
1.1.3 Application Domain and Flexibility
The main objective behind developing any reduction technique is the extent of its usefulness in specific
and/or generic fields. The existing techniques are limited in their practical use because of restrictiveness
in their underlying theory. We need a reduction technique that has wide applicability, flexibility, and
stability to perform satisfactorily on diverse data from same/different fields of science. Our technique
manifests itself to extensive use in fields of medical image processing, space exploration, face identi-
fication, gesture and activity recognition, spam-resistant image-indexed engines to name a few. For
example, applying our non-linear technique to a sequence of pictures, we could have all the information
about the war and other emergency scenarios enabling us to take fast-track actions. Likewise, we could
have far better authentication systems that need to deal with only the important features. Similarly,
we could identify spam etc. on the Internet, design better brain-mapping devices and recognize the
gestures and expressions being conveyed. In short, we put forward a generic technique that caters to
a wide spectrum of applications. We envisage our work as a potential break-through in diverse fields
including brain mapping, medical image processing, signal processing, biometrics, defense operations,
gesture and activity recognition, and spam-resistant image indexing etc.
1.1.4 Online handling of data
A technique designed for dynamic scenes must be able to process data in real time. Most of the current
techniques require all the data to be present before deciding on the best direction(s) for projecting the
data. However, our feature extractor can process the data on-the-fly, eliminating the need for having all
the data beforehand. Only the data points adjacent to the point being considered have to be present and
stream buffers can be used. Significant dimension reduction is achieved during incremental step 1, and
our algorithm obviates the need for availability of entire data before processing. Additional reduction
Chapter 1. Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 5
can be achieved by an optional second pass.
1.2 Our Approach
1.2.1 N- Dimensional Vectors
We devise an efficient technique for N-dimension data reduction taking into account the inherent non-
linearity of the contours. Rather than considering only a few dimensions, we base our theory on a
concept of transitions that incorporate the different dimensions involved. A transition refers to a change
in intensity as we move from one image point to any of its adjacent points (Fig. 1.2).
Figure 1.2: Transitions at a Sample Image Point
1.2.2 Contours and Outlines
The outlines are those curves and boundaries which more or less remain unchanged with varying ex-
pression of a human face, for instance, the shape of the teeth, the periphery enclosing a nostril etc. The
contours on the other hand are temporary features formed as a result of change in speech and emotion
etc. and vanish when that particular expression is over e.g. the elevation/depression of the forehead,
different shapes involved in lip movement, different curves that occur on cheeks when a person expresses
himself. As shown in Fig. 1.3, the shape of the lips, when a person speaks something, is a contour
whereas the shape of the teeth is an outline.
Figure 1.3: Illustration of a contour and an outline
Chapter 1. Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 6
This concept is analogous to the way an image is sketched or expressed by a cartoonist in terms of a
few lines and shades. The major lines which may not change in the consecutive scenes are the outlines,
while the others which undergo modifications in the successive scenes are the contours. As shown in
Fig. 1.4, just by seeing a contour sketch of Albert Einstein or Mother Teresa, we can identify the great
personalities. We do not need to consider redundant image data to recognize a face.
Figure 1.4: Contour sketches are sufficient to depict essential features
Most of the existing techniques account for outlines to an extent but fail miserably when it comes
to identifying contours. The basic reason for this failure is ignorance of the underlying vectors involved
in formation of a contour. A contour is formed by a number of N-dimensional vectors emanating from
different directions as indicated in Fig. 1.5.
Figure 1.5: Formation of a Contour
Any contour can be perfectly analyzed by considering vectors of different dimensions arranged one
above another along the curve from its one end to another. Every vector is associated with a transition
in a given direction. The various vectors in tandem form a contour as shown. Note that our definition of
a dimension is slightly different from that implicit in many subspace methods that consider a dimension
as a feature or an attribute. In our context, each dimension contains all those vectors that have the
same magnitude of transition, irrespective of their spatial orientations. Thus, for instance, a transition
from image intensity 5 to 10 corresponds to the same dimension as a transition from intensity 23 to 28.
Chapter 1. Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 7
The existing theories neglect the vector nature of a contour and restrict themselves to only intensity
levels involved over an assumed local-linear region, which is unsatisfactory since a contour is formed by
orientation of the constituent vectors and the contour changes as these vectors change their orientation
following the image dynamics as shown in Fig. 1.6. It should be noted that a) the contours do not
change in a still scene, b) for a dynamic scene these particular contours play significant role because any
change incurred by them results in the change in expression or gesture. c) The dimensionality of image
can be considerably reduced while still preserving the essential features if these small changes are taken
care of.
Figure 1.6: Change in contour with change in vector orientation
1.2.3 Vectored Scene
We analyze the vectors of a contour as we move from one position to adjacent positions over all the
dimensions of the contour, and hence follow the very basis of contour formation, for instance, in an
8-bit image even though there are 256 distinct possible vector magnitudes involved, all of these are
not important. This is primarily because a point is involved in multiple vectors in its eight possible
directions and a point is very rarely involved in high magnitude vectors in all these directions; there
is, generally, at least one direction where the vector magnitude is lying in a lower range. Thus we
can restrict ourselves to vectors up to a low upper range only. This low upper range is more or less
constant for one application but it may vary from one application to another; it tends to be higher for
non-linear data compared to linear data. It is observed in most images that a point involved in high
magnitude vector in one direction is also, in general, involved in low magnitude vector(s) in some other
direction(s). This behavior is displayed in natural images used for processing. Thus, we dispense with
very high magnitude vectors and process each point for lower dimensions and include/exclude that point
from a particular dimension depending on whether it is involved in at least one corresponding vector.
Furthermore, we have observed empirically that a point involved in a particular vector in one direction
is also likely to have a nearly same magnitude vector in another direction as well for instance a point
involved in vector of magnitude 2 in one direction is likely to have a vector(s) of magnitude 3 and /or 1
in other direction(s). Thus, we may consider fewer dimensions by taking even or odd vectors, without
Chapter 1. Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 8
affecting much the quality of the image. We thus consider different transition frames or dimensions,
with each different magnitude vector defining a different dimension and all vectors with equal magnitude
included in the same dimension. This enables us to consider N-dimensional image features in terms of
different dimensions as shown in Fig. 1.7.
Figure 1.7: Image Feature Extractor
1.2.4 Schrodinger’s Solution to measure non-linearity
The general steady-state Schrodinger’s equation of a particle is
Grad(ψ) + 8π2m(E−U)h2 ψ = 0
where ψ, m, U , and E denote the wave function, mass, potential energy, and total energy of a par-
ticle while h is the Planck’s constant. Grad(ψ) denotes the gradient of the wave function. Solving
Schrodinger’s equation provides various parameters like energy of the particles, their momentum etc.
The most important parameter that the Schrodinger’s equation provides is the extent of the non-linearity
given by wave function. The quantity whose variations make up matter waves is called the wave func-
tion, ψ. The value of ψ associated with a moving object at a particular point in space and at a particular
time instant is related to the likelihood of finding that object there at that time. Once Schrodinger’s
equation has been solved for a particle in a given physical situation, the resulting wave function contains
all the information about the particle like the expected value of the particle position, particle energy,
angular momentum and linear momentum etc. as permitted by the Heisenberg’s uncertainty principle
which states that it is impossible to know both the exact position and exact momentum of an object
at the same time e.g. in one dimension, the expectation value < G(x) > of any quantity, for instance,
potential energy U(x), which is a function of position x of an unrestricted particle described by the wave
function ψ, is given as
< G(x) >=∫xG(x)|ψ|2dx
Chapter 1. Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 9
A contour as shown in Fig. 1.8 can be thought of being formed from a group of energy particles whose
motion is random and spread over various dimensions. Employing the wave concept, each dimension
is associated with a corresponding wave and for a contour to be formed; all the related waves should
be continuous and in phase with each other. The non-linearity of a contour is a direct implication
of the constituent dimensions which satisfy this condition. The existing techniques neglect this local
Figure 1.8: Contour formation from in-phase waves of different dimensions
interaction of particles and hence fail to account for the non-linearity. The Schrodinger’s equation is an
important tool to ascertain this nonlinearity. It is important to note that techniques like Wavelets and
Fourier Transform are also based on similar ideas, where any signal is shown to be expressible in terms
of infinite sinusoidal waveforms. Analyzing the Schrodinger’s equation, there are two terms on the LHS.
The first term on the LHS considers the gradient or the direction of the maximum energy change. The
second term again is a second order term that considers the kinetic energy given by (E −U) multiplied
by wave function. Rewriting Schrodinger’s equation,
Grad(ψ) = − 8π2m(E−U)h2 ψ
Thus, for the LHS to be equal to RHS, the (E−U) factor must take the maximum value (magnitude) to
account for the gradient term. Now, a particle that is sufficiently energetic to undergo a high transition
can easily undergo lower transitions as well. Thus, we need to substitute the value of maximum length
vector for Schrodinger’s equation to hold as shown in Fig. 1.9. Therefore to obtain the non-linearity
Figure 1.9: Energy interpretation of vectors
corresponding to a particular image, we solve the Schrodinger’s equation at every point of a contour
Chapter 1. Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 10
substituting the highest dimension vector magnitude for (E − U) at that point in a unit length region
(adjacent positions in Fig. 1.10), discarding those adjacent points which do not lie on the contour.
Figure 1.10: Selection of Contour Vectors
1.2.5 Cumulated Adjustment Factor
An important consideration needs to be taken care of. Since we substitute the maximum vector value
in the Schrodinger’s equation, we account for corresponding maximum change in intensity in all the
possible directions. In other words, the lower intensity changes are accounted for as well. Thus the non-
linearity involved in a higher vector should be greater than the non-linearity associated with a lower
vector. Therefore, we need to introduce an adjustment term, which we call the Cumulated Adjustment
Factor (CAF). We further note from the ideas presented earlier, from wave considerations, that a high
dimension wave is formed by in- phase interaction of the lower dimension waves, which leads to an
additive effect on the amplitude and consequently, the wave function for the higher transition (Fig.
1.11). Thus, we define CAF in such a way that the CAF for a higher vector is the sum of CAF of all
Figure 1.11: Interaction of low dimensional waves to form a higher dimensional wave
the lower dimensions and the wave function value obtained from Schrodinger’s equation for that vector
magnitude with the CAF for lowest odd/ even vector being equal to its probability function (Fig. 1.12).
The CAF for a higher dimension is the sum of its probability function and the CAF of all the lower
dimensions. Thus a higher value of CAF may be thought of as corresponding to higher energy. Note
Chapter 1. Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 11
that we do not consider zero as the lowest dimension since it represents no change and thus provides no
useful information.
Figure 1.12: Calculation of CAF
1.2.6 Embedding using Mean and Variance
We count the points involved in highest dimension and then move on to lower dimension values, without
considering those points in lower dimension which are involved in a higher transition, in accordance with
the concept that a high transition corresponds to sufficient energy to undergo lower changes in intensity.
Finally, we take the product of each CAF and its corresponding count and then calculate the mean and
variance for the reduced image. It has been shown in [27] that the relative spatial relationships existing
among the components present in an image are preserved by a set of triples consisting of mean, variance
and the total number of “keys”. The importance of mean and variance lies in the fact that the value
of weighted mean gives an idea of the average CAF and hence average transition energy involved in a
particular image frame whereas the variance indicates the variation in energy about the mean. Knowing
the mean, we get an idea as to which single dimension conveys maximum information which can be
easily retrieved based on the integral transition value that is nearest to the one corresponding to the
weighted mean. This is an added utility of our technique as we can extract the single self-contained
dimension that is a near optimal trade-off between amount of dimension reduction and the extent of
feature preserving. Further, when we speak different phonemes, different face shapes and energies are
involved and hence we can get an idea of the order of the energy involved while speaking a phoneme.
1.3 Image Feature Extractor
We now present an algorithm for the image feature extractor.
1.3.1 Algorithm 1
FeatureExtract(X, StepSize, First, Last)
X is the input image matrix with m rows and n cols. First and Last represent the highest and lowest
Chapter 1. Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 12
considered vector magnitudes respectively. Only those vectors in this range are considered whose mag-
nitude varies from First in some multiple of StepSize(s).
Step 1
Take a matrix Flag of same size as X.
Initialize Flag = 0 for all index positions;
for each point a in X, do
for each point b adjacent to a, do
if ((First - |X(a) - X(b)|) mod StepSize = 0) and Last ≤ |X(a) - X(b)| ≤ First))
Flag(a) = Flag(b) =1;
end if
end for
end for
/∗ At this point, Flag has value 1 at only those points which are involved in forming significant vectors ∗/
Step 2
2.1 Find all components C in X connecting those points that have corresponding Flag entry set to 1.
2.2 Discard all components in C that consist of only single vector.
Step 3
3.1 Apply Schrodinger’s equation to every component vector V to measure the extent of non-linearity
3.2 Obtain the value of CAF for each vector V.
Step 4
Embed the image using a variance−mean plot.
1.3.2 Extent of Dimension Reduction
Consider an image matrix (X)m×n. Define an adjacency function f : X ×X → {0, 1} as
f(x, y) = 1, x and y are adjacent to each other
= 0, otherwise
Let a and b denote the lowest and highest vector values considered in Step 1. Further let D be defined
as set of transition values between significant vectors. Define
Chapter 1. Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 13
Rx(a, b,D) = {xi ∈ X|f(x, xi) = 1 ∧ (a ≤ |x− xi| ≤ b) ∧ |x− xi| = kd,
k ∈ Z+ ∪ {0}, d ∈ D,x ∈ X}
Clearly Rx(a, b,D) contains the significant vector points adjacent to x with transitions lying in range
a to b and in multiples of gap(s) d belonging to set of differences D. Let Vmax and Vmin denote the
highest and lowest intensity values in X. Then,
0 ≤ d ≤ Vmax − Vmin ∀d ∈ D.
Note that, if a = Vmin, b = Vmax and D = {0, 1, . . ., Vmax − Vmin}, then
Rx(a,b,D) = {xi ∈ X|f(x, xi) = 1, x ∈ X}
Taking |Rx(a,b,D)| as the number of significant points adjacent to x, we obtain the total number of
significant points z in matrix X as
z = 12
∑x∈X Rx(a,b,D)
Number of points in Xm×n = mn
Therefore data reduction using step 1,
datar1 =mn− zmn
= 1−∑x
Rx(a, b,D)2mn
(1.1)
whereby data reduction for
Rx(Vmin, Vmax,{0, 1, . . ., Vmax − Vmin})
is given by
datar1 = 1− 2mn2mn
= 0
that is, we end up without any reduction as expected.
Let us call a new matrix that contains only significant points as Z. Define
R′
z(a′, b′, D′) = {zi ∈ Z|f(z, zi) = 1 ∧ (a
′ ≤ |z − zi| ≤ b′) ∧ |z − zi| = b
′ − kd′ , k ∈ Z+ ∪ {0}, d′ ∈ D′ ,
z ∈ Z},
where
a′ ≥ a and b
′ ≤ b, a and b being range values for matrix X.
At the end of step 1, the data reduction, datar2 is given by
Chapter 1. Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 14
datar2 =12
∑xRx(a, b,D)− 1
2
∑z R
′
z(a′, b′, D′)
12
∑xRx(a, b,D)
= 1−∑z R
′
z(a′, b′, D′)∑
xRx(a, b,D)(1.2)
Now, using (1.1), ∑xRx(a, b,D)
2mn= 1− datar1
⇒∑xRx(a, b,D) = 2mn(1− datar1)
Substituting in (1.2), we get
datar2 = 1−∑z R
′
z(a′, b′, D′)
2mn(1− datar1)
⇒∑z R
′
z(a′, b′, D′) = 2mn(1− datar1)(1− datar2)
⇒ 12
∑z
R′
z(a′, b′, D′) = mn(1− datar1)(1− datar2) (1.3)
Number of points in matrix X = mn
Number of points left after reduction = 12
∑z R
′
z(a′, b′, D′)
= mn(1− datar1)(1− datar2) [Using (1.3)]
Therefore, total reduction obtained by Image Feature Extractor in Algorithm 1
datar =mn−mn(1− datar1)(1− datar2)
mn
⇒ datar = 1− (1− datar1)(1− datar2)
Now, we proceed to obtain a mathematical expression for reduction in number of dimensions. Let the
number of dimensions in the matrices X and Rx be Dim1 and Dim2 respectively. If the number of
dimensions at the end of step 1 is Dim, then
Dim1 ≤ Vmax − Vmin + 1
Further, since two significant points adjacent to a particular point may also form a vector between
themselves if they are adjacent to each other,
Dim2 ≤ 2(b− a+ 1),
and
Dim ≤ 2(b′ − a′ + 1)
Further, Dim ≤ Dim2 ≤ Dim1.
Therefore, dimension reduction after taking into account Rx is,
Chapter 1. Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 15
dimr1 =(Dim1 −Dim2)
Dim1= 1− Dim2
Dim1
⇒ Dim2 = Dim1(1− dimr1) (1.4)
Now, dimension reduction at the end of step 1,
dimr2 =(Dim2 −Dim)
Dim2
and total dimension reduction,
dimr =(Dim1 −Dim)
Dim1
Now,
dimr =(Dim1 −Dim2) + (Dim2 −Dim)
Dim1
which, using (1.4), becomes
=(Dim1 ∗ dimr1) + (Dim2 ∗ dimr2)
Dim1
= dimr1 +Dim2 ∗ dimr2
Dim1
= dimr1 + (1− dimr1) ∗ dimr2
= dimr1 ∗ (1− dimr2) + dimr2
Note that datar denotes the dimension reduction according to the definition of a dimension as
a feature or an attribute in many subspace methods. On the other hand, dimr corresponds to our
definition of a dimension as a vector magnitude.
1.4 Experimental Results
To substantiate the theory, we conducted several experiments. First of all, we show the results of
applying Image Feature Extractor to extract significant information from a human face. We now show
how our generic algorithm works in case of human facial data. The steps for dimension reduction applied
to a sample human face (Fig. 1.13) are as follows,
Determining the StepSize
We reiterate an important observation: a point involved in a particular vector in one direction is also
likely to have a nearly same magnitude vector in another direction as well, as explained earlier. Thus
we may consider fewer dimensions by taking even or odd vectors, without affecting much the quality of
Chapter 1. Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 16
Figure 1.13: Sample Image
the image. Hence, StepSize for this case can be taken as 2 (include either even or odd vectors). Fig.
1.14 shows the processed image that incorporates even dimensions from vector magnitude 22 down to
0 (both inclusive). Note that there is a loss of information, in particular around the eyes and the lips;
however this loss is negligible.
Figure 1.14: Processed image incorporating even dimensions from vector magnitude 22 down to 0
Determining the value of First
As discussed previously, a point is associated in multiple vectors in its eight possible directions and is
rarely associated with high magnitude vectors in all of these, there is generally at least one direction
where the vector magnitude is lying in a lower range. Thus we can reduce the number of vectors further
without affecting much the contours and outlines. We further note that the points involved in high
intensity change become progressively less as the corresponding vector magnitude increases and the
no. of points included after 10-11 even (or odd) dimensions is negligible(for highly non-linear data like
human face in motion). Thus First can be taken as 22 or some higher value. Fig. 1.15 and Fig. 1.16
show the processed image that incorporates vectors corresponding to magnitude 22 and 16 respectively.
Note that image corresponding to vector magnitude 16 contains more information than magnitude 22.
Chapter 1. Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 17
Figure 1.15: Processed image incorporating vectors of magnitude 22
Figure 1.16: Processed image incorporating vectors of magnitude 16
Determining the value of Last
We note that contours and outlines contain sufficient information represented by an image. Only the
vectors involved in contour and outline formation are important, others are responsible for redundancy.
Further the higher dimensions (vectors with greater intensity change) provide more useful information
than low vectors. Very low vector transitions close to zero contribute to redundant data. Thus we can
dispense with low dimensions. For the human face Last can be taken as a value close to 10. However,
the values of StepSize, First and Last may vary from one application to another. Fig. 1.17 shows the
processed image that incorporates vectors corresponding to magnitude 10.
Discarding singly connected points
The points which are involved in only one vector are discarded as they only add to superfluous data.
Thus now we are left with only contours and outlines which can be further processed for applications.
Fig. 1.18 shows the final N-dimensional reduced image incorporating all dimensions from 10-22 as
per our original sample image. Comparing it with the original sample image, it is clearly evident that
the Algorithm 1 extracts all the significant content from an image. More data can be dispensed with,
however at the cost of loss of information. The data associated with the dimensionally reduced image
Chapter 1. Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 18
Figure 1.17: Processed image incorporating vectors of magnitude 10
Figure 1.18: Final reduced image
is input to the Schrodinger’s equation to measure the extent of non-linearity.
Image Feature Extractor Applied to Standard Face and Lip Images
We now show the results of applying Image Feature Extractor to Standard Face Images. For sake of
comparison, we are providing results applied on same face expressions using LLE also. An investigation
into results of these techniques clearly indicates that Image Feature Extractor outperforms LLE. Various
face expressions like seriousness, laugh, smile and forceful expression are well-separated from each other
in case of the proposed Image Feature Extractor, as shown in Fig. 1.19.
Similarly, our results are shown for standard lip images in Fig. 1.20. The results obtained clearly
indicate that data in dynamic scenes varies locally non-linearly and should not be linearly approximated.
Another important criterion is the execution time. The Image Feature Extractor algorithm is a
real-time, online technique since it requires knowledge of points in the immediate vicinity only without
considering the rest of the data. In other words, just by analyzing the sequence of face (part of face)
movement over a very short duration, we can know with reasonable accuracy, the message being conveyed
or the emotion being expressed by a person in a short interval of time. Though the above theory has
been explained primarily with facial data, it can be equally well used in many other application domains.
Only thing that needs to be done is to find the range of the vector magnitudes and the step size to be
Chapter 1. Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 19
Figure 1.19: Image Feature Extractor vs. LLE
Figure 1.20: Results with Standard Lip Images
considered for a specific application.
1.5 Conclusion
The approximation of non-linearity of dynamic features using local linear concepts by considering
geodesic distances and applying method of least square is not useful from application point of view.
Our technique overcomes this limitation. It processes the image while preserving the non-linear features
or contours that are formed in a multi-dimensional space. We first design an N-dimensional Image
Feature Extractor, which shows the contribution of individual vectors or dimensions (as we call them)
in making the face features and show how these different dimensions interact with each other to pro-
duce the original image. We then perform N-dimension data reduction and elimination of redundant
data but still preserving the contours and the outlines. Then we design an algorithm that isolates the
significant information in different connected components: the contours and the outlines. We apply the
Chapter 1. Generic Non-Linear N-Dimension Reduction for Dynamic Scenes 20
Schrodinger’s equation to measure the extent of non-linearity in each dimension in terms of the wave
function and then make the adjustments encompassing energy considerations to obtain CAF. Finally,
we obtain the weighted mean and variance of the image. Different image expressions can therefore be
separated from each other and similar ones grouped together. The results on a sample face image besides
the standard lip and face images strongly indicate the efficacy of the proposed approach.
Bibliography
[1] A. K. Jain and R. C. Dubes. Algorithm for clustering data. Prentice Hall, 1988.
[2] A. K. Jain, R. Duin, and J. Mao. Statistical Pattern Recognition : A Review. IEEE Trans. Pattern
Analysis and Machine Intelligence (PAMI), 22, pp. 4–37, 2000.
[3] H. S. Seung and D. Lee. The Manifold Ways of Perception. Science, 290, pp. 2268–2269, 2000.
[4] R. S. Bennet. The Intrinsic Dimensionality of Signal Collections. IEEE Trans. Information Theory,
15(5), pp. 517–525, 1969.
[5] K. Fukunaga and D. R. Olsen. An Algorithm for Finding Intrinsic Dimensionality of Data. IEEE
Trans. Computers, 20(2), pp. 176–183, 1971.
[6] D. R. Olsen and K. Fukunaga. Representation of Non-Linear data surfaces. IEEE Trans. Computers,
22(10), pp. 915–922, 1973.
[7] G. V. Trunk. Statistical Estimation of the Intrinsic Dimensionality of a Noisy Signal Collection.
IEEE Trans. Computers, 25, pp. 165–171, 1976.
[8] K. W. Pettis, T. A. Bailey, A. K. Jain, and R. C. Dubes. An Intrinsic Dimensionality Estimator
from Near Neighbor Information. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI),
1(1), pp. 25–37, 1979.
[9] P. J. Verveer and R. P. Duin. An evaluation of intrinsic dimensionality estimators. IEEE Trans.
Pattern Analysis and Machine Intelligence (PAMI), 17(1), pp. 81–86, 1995.
[10] J. Bruske and G. Sommer. Intrinsic Dimensionality Estimation with Optimally topology preserving
maps. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 20(5), pp. 572–575, 1998.
[11] G. Biswas, A. K. Jain, and R. C. Dubes. Evaluation of Projection Algorithms. IEEE Trans. Pattern
Analysis and Machine Intelligence (PAMI), 3(6), pp. 701–708, 1981.
[12] J. J. W. Sammon. A Nonlinear Mapping for Data Structure Analysis. IEEE Trans. Computers,
18(5), pp. 401–409, 1969.
21
BIBLIOGRAPHY 22
[13] T. Kohonen. Self-Organizing Maps. Springer, Second Edition, 1997.
[14] J. Kruskal. Comments on a Nonlinear Mapping for Data Structure Analysis. IEEE Trans. Com-
puters, 20(12), pp. 1614, 1971.
[15] H. Niemann and J. Weiss. A fast converging algorithm for non-linear mapping of high dimensional
data to a plane. IEEE Trans. Computers, 28, pp. 142–147, 1979.
[16] P. Demartines and J. Herault. ‘Curvilinear Component Analysis: A Self-Organizing Neural Network
for Nonlinear Mapping of Data Sets. IEEE Trans. Neural Networks, 8(1), pp. 148–154, 1997.
[17] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A Global Geometric Framework for Nonlinear
Dimensionality Reduction. Science, 290, pp. 2319–2323, 2000.
[18] S. T. Roweis and L. K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding.
Science, 290, pp. 2323–2326, 2000.
[19] J. A. Lee, A. Lendasse, N. Donckers, and M. Verleysen. A Robust Nonlinear Projection Method.
Proc. Eighth European Symp. Artificial Neural Networks (ESANN 2000), pp. 13–20, 2000.
[20] M. Vlachos, C. Domeniconi, D. Gunopulos, G. Kollios, and N. Koudas. Non-linear dimensionality
reduction techniques for classification and visualization. Proceedings of the eighth ACM SIGKDD
international conference on Knowledge Discovery and Data Mining, 2002.
[21] M. P. Young and S. Yamane. Sparse population coding of faces in the inferotemporal cortex. Science,
256, pp. 1327–1331, 1992.
[22] R. N. Shepard. Multidimensional scaling, tree fitting and clustering. Science, 210, pp. 390–398,
1980.
[23] L. Yang. Distance-Preserving Projection of High-Dimensional Data for Nonlinear Dimensionality
Reduction. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 26(9), pp. 1243–1246,
2004.
[24] R. C. T. Lee, J. R. Slagle, and H. Blum. A Triangulation Method for the Sequential Mapping of
Points from N-Space to Two-Space. IEEE Trans. Computers, 26(3), pp. 288–292, 1977.
[25] G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks.
Science, 313, pp. 504–507, 2006.
[26] Q. Tao, D. Chu, and J. Wang. Recursive Support Vector Machines for Dimensionality Reduction.
IEEE Trans. Neural Networks, 19(1), 2008.
BIBLIOGRAPHY 23
[27] P. Punitha and D. S. Guru. An effective and efficient exact match retrieval scheme for image
database systems based on spatial reasoning: A logarithmic search time approach. IEEE Transactions
on Knowledge and Data Engineering (TKDE), 18(10), pp. 1368–1381, 2006.
Chapter 2
Characterizing Ordering Effects for
Robust Incremental Clustering
2.1 Introduction
Clustering or unsupervised classification of patterns into groups based on similarity is a very well studied
problem in pattern recognition, data mining, information retrieval and related disciplines. Clustering
finds numerous direct practical applications in pattern analysis, decision making and machine learning
tasks such as image segmentation. Besides clustering also acts as a precursor to many data processing
tasks including classification [1]. Almost all the algorithms proposed for clustering require availability
of the entire dataset before any processing is done. Lately, there has been an explosion in the rate
at which data is generated. The traditional clustering algorithms that presumed primary memory to
be sufficient for containing the complete dataset have been rendered ineffective and the disk I/O is
increasingly becoming a serious bottleneck. Further, in case of applications such as network intrusion
detection and stock market analysis, a huge amount of dynamic stream data is generated. This stream
data enters the clustering system continuously and incrementally and hence the clusters derived must
also be incrementally refined [14]. Therefore, there is an imminent need for devising efficient algorithms
that process each data instance only once and on the fly.
There has been a slight confusion in the literature regarding the notion of incremental learning. The
most widely acceptable definition of incremental learning is the one given in [2]. An incremental learner
inputs one training experience at a time, does not reprocess any previous experiences and retains only
one knowledge structure in the memory. Strictly enforcing these constraints rules out the incremental
nature of a number of algorithms such as Candidate Elimination Algorithm in Version Space [3], learning
structural descriptions [4] and decision tree induction [5].
24
Chapter 2. Characterizing Ordering Effects for Robust Incremental Clustering 25
Most of the existing incremental clustering algorithms are sensitive to the order of input data. That
is, given a set of data objects, such an algorithm might return different clusterings based on the order
of presentation of these objects. An incremental learner is said to be order dependent if it produces
different knowledge structures based on the sequence in which examples are provided as input. In [6],
two necessary properties of order independent incremental learners are outlined: 1) they are able to
focus on an optimal hypothesis from a set of current potential ones, and 2) they maintain sufficient
information so that they do not forget any potential hypothesis. Incremental variants of a clustering
algorithm COBWEB [7] have been proposed but they suffer from statistical independence assumption of
attributes in the underlying probability distribution. Moreover, the statistical representation makes it
expensive to update and store the clusters. In [8], the authors throw some light on the ordering effects in
incremental clustering. Order independence of a concept directed clustering approach using knowledge
structures has been established in [9]. In our work, we provide necessary and sufficient conditions to
achieve order independence.
Another key problem in the clustering domain concerns determining a suitable number k of output
clusters when k is not input as a parameter to the clustering algorithm. The knowledge of an appropri-
ate k is imperative for effectively solving the k-means problem as in [13]. Many algorithms have been
proposed in the literature to overcome the limitation of pre-defining the number of clusters. Some of
these algorithms are weakly incremental in that they only make one pass over the dataset. In [11], the
authors propose an incremental algorithm BIRCH to dynamically cluster incoming multi-dimensional
metric data points, using a Clustering Feature (CF) Tree. The Shortest Spanning Path (SSP) [12] al-
gorithm has been used for data reorganization and automatic auditing of records. In [7], the author
proposed COBWEB, an incremental conceptual clustering algorithm that was subsequently used for
many engineering tasks. The Leader algorithm [15] is an immensely popular single pass incremental
algorithm for clustering large datasets and particularly attractive for applications such as stream data
analysis that require fast processing. A shortcoming of the Leader algorithm is that it is highly suscep-
tible to ordering effects, that is, it might yield entirely different clusterings based on the order in which
the data points are input into the algorithm. In this chapter, we also propose robust variants of the
Leader algorithm that yield more robust clusters in that the total squared distance between the points
within a cluster is less compared to that obtained using the Leader algorithm.
2.2 Preliminaries
In the following section, we review some of the technical background required for rest of this chapter.
Chapter 2. Characterizing Ordering Effects for Robust Incremental Clustering 26
2.2.1 Group
A non-empty set of elements G together with a binary operation ∗ (called the product), defined on G,
is said to form a group (G, ∗) if the following axioms are satisfied [10],
a) a, b ∈ G ⇒ a ∗ b ∈ G (closed).
b) a, b, c ∈ G ⇒ a ∗ (b ∗ c) = (a ∗ b) ∗ c (associative law).
c) There exists an element e ∈ G such that a ∗ e = e ∗ a = a ∀a ∈ G (the existence of an identity
element in G).
d) For every a ∈ G there exists an element a−1 ∈ G such that a ∗ a−1 = a−1 ∗ a = e (the existence of
inverses in G).
Abelian Group
A group G is said to be abelian (or commutative) if for every a, b ∈ G, a ∗ b = b ∗ a.
Commutative Monoid
A monoid is a non-empty set of elements with a binary operation that satisfies axioms a), b) and c) of
a group. A monoid that satisfies the commutative property is called a commutative monoid.
2.2.2 Incremental Learning
A learner L is incremental if L (i) inputs one training experience at a time, (ii) does not reprocess any
previous experience, and (iii) retains only one knowledge structure in memory [2]. The first condition
avoids to consider as incremental the learning algorithms that process many instances at a time by storing
all the instances seen thus far and executing the procedure on all of them. Batch learning systems fail
to satisfy this condition. The second condition rules out those systems, for example artificial neural
networks, that reprocess the old data with the new data to generate a new model. The underlying
idea is to make sure that time required to process each experience must remain almost constant. The
final constraint requires the algorithm to memorize exactly one definition for each concept and rules
out algorithms like CE [3] that retain in memory a set of competing hypotheses summarizing the data.
These hypotheses may grow exponentially with the number of training experiences.
A learner L is order sensitive or order dependent if there exists a training set T on which L exhibits
an order effect. That is, given a set of data objects, such an algorithm might return different clusterings
based on the order of presentation of these objects. An incremental learner is order sensitive if it
produces different knowledge structures based on the sequence in which examples are provided as input.
There exist at least three different levels at which order effects can occur: attribute level, instance level,
and concept level [2]. In our work, we focus on mitigating order effects at the instance level. Our
Chapter 2. Characterizing Ordering Effects for Robust Incremental Clustering 27
incremental algorithm maintains a single memory knowledge structure consisting of a constant number
of abstractions (independent of the input dataset) summarizing the data objects or instances seen so
far.
We define an abstraction as a part of the knowledge structure being maintained in the memory
such that (i) each abstraction represents summary of members of one cluster, and (ii) the abstraction
corresponding to a cluster is updated only when a new instance is assigned to the cluster. In addition, we
restrict the number of abstractions to the number of required clusters so that only a constant number of
abstractions are maintained at any time irrespective of the number of training experiences. The current
abstraction Ak is updated to abstraction Ak+1 when the corresponding cluster is assigned training
experience xk+1, without reprocessing any previous experience. For the rest of this chapter, we use
the terms data instances, experiences, objects and points interchangeably. Likewise, the terms order
dependence and order sensitivity shall convey the same meaning.
2.3 Characterizing Ordering Effects in Incremental Learners
In the following discussion, we claim that any order insensitive incremental algorithm operates as a
structure that can be abstracted in terms of a commutative monoid.
2.3.1 Order Insensitive Incremental Learning through Commutative Monoids
We first show how the axioms of an abelian group are satisfied by a simple order independent incremental
algorithm that finds the linear sum of n d-dimensional data points.
Theorem 1. Any order independent incremental algorithm must maintain a knowledge structure A of
abstractions together with an operator ∗ defined on it, such that (A, ∗) is a commutative monoid.
Proof. Consider the impact of violation of any of the properties of a commutative monoid on order
sensitivity of the underlying incremental algorithm.
Closure Suppose a, b ∈ A does not imply a∗ b ∈ A. This implies that the structure obtained by incor-
porating a new instance to the current knowledge structure might result in a new structure that does
not belong to the range of legal structures, hence ruling out valid processing on any further instances
in the input sequence.
Associativity The violation of associativity clearly implies presence of more than one memory struc-
ture for the same input sequence depending on the order of processing. In fact, Catalan Number of
memory structures are possible depending on the order of processing the input sequence.
Identity The presence of an identity element is required to maintain idempotency and consistency dur-
ing three phases (i) initial structure prior to processing any input data, (ii) some intermediate structure
Chapter 2. Characterizing Ordering Effects for Robust Incremental Clustering 28
where the algorithm waits for more examples, and (iii) once all the input instances are exhausted.
Commutativity Violation of commutativity results in obtaining any one of a potential O(N!) final
memory structures on processing an N-input dataset making the algorithm order sensitive.
Now, we introduce the concept of a dynamically complete set to completely characterize the order
independence in incremental learners.
2.3.2 Dynamically Complete Set
Let f be a function defined as f : A × X→ A, where X = {x1, x2, ..., xn} represents the set of input data
instances and A represents the set of all valid memory structures. Then X is said to be a dynamically
complete set with respect to f and A if the following conditions are satisfied for k ∈ {0,1,..., n− 1},
1)f(Ak, xk+1) = Ak+1
2)f(Ak, x′
l) = f(f(f(f(f(A0, x1), x2)..., xl−1), xl+1)..., xk), 1 ≤ l < k
with f(Ak, φ′) = Ak, where φ represents empty or null instance and x
′
l denotes the removal operation
for disregarding xl from the abstraction being considered.
3) f(f(Ak, x′
l), xl) = f(Ak, φ) = Ak, where 1 ≤ l ≤ k
4) f(f(Ak, xl), xm) = f(f(Ak, xm), xl), where l, m ∈ {k + 1, k + 2, ..., n− 1}
A dynamically complete set incorporates the idea of a removal operation which imparts ability to
the current memory structure to return to a previous structure by deleting information gained through
subsequent insertions. In fact, it has a stronger effect in that it can generate all the memory structures
that can be obtained using the experiences seen thus far, by deleting one or more of these data instances
in any order. Thus, for any sequence of input instances, ordering effects are implicitly taken care of. The
basic notion of a complete set can be related to optimal sub-structure property of dynamic programming:
if a set S of instances is insensitive to order effects, then every subset of S must also be insensitive to
order. Thus, for an algorithm to be truly incremental, it should be amenable not only to an addition
operation for moving to a new memory structure but also to a deletion operation for reverting back to
any of the structures possible using a subset of the instances seen so far.
Theorem 2. Presence of a dynamically complete set X on (A, f) provides a sufficient condition for
order independence in any incremental algorithm that takes X as an input and uses A and f .
Proof. The theorem holds trivially for a singleton set X, starting from A0. For |X| ≥ 2, we proceed
as follows. Consider the case when X consists of two elements x1 and x2. Then by property 4) from
definition of a dynamically complete set, f(f(A0, x1), x2) = f(f(A0, x2), x1) and the theorem holds.
Now, suppose X = {x1, x2, x3}. Again using 4) considering 2-subsets of X, we get the following results,
(a) f(f(A0, x1), x2) = f(f(A0, x2), x1)
(b) f(f(A0, x1), x3) = f(f(A0, x3), x1), and
Chapter 2. Characterizing Ordering Effects for Robust Incremental Clustering 29
(c) f(f(A0, x2), x3) = f(f(A0, x3), x2)
Applying (a), (b) and (c) on 3-subsets of X, we obtain
f(f(f(A0, x1), x2), x3) = f(f(f(A0, x2), x1), x3) (2.1)
f(f(f(A0, x1), x3), x2) = f(f(f(A0, x3), x1), x2) (2.2)
f(f(f(A0, x2), x3), x1) = f(f(f(A0, x3), x2), x1) (2.3)
Also, using property 2) from definition of a dynamically complete set,
f(A3, x′
2) = f(f(A0, x1), x3)
⇒ f(f(f(A0, x1), x3), x2) = f(f(A3, x′
2), x2)
Using property 3) of a dynamically complete set, we get
f(f(f(A0, x1), x3), x2) = A3
Likewise, using property 2) and 3) in conjunction with (6.4), (6.5) and (6.6),
f(f(f(A0, x2), x1), x3) = f(f(f(A0, x1), x2), x3) = f(f(A3, x′
3), x3) = A3
f(f(f(A0, x2), x3), x1) = f(f(A3, x′
1), x1) = A3
f(f(f(A0, x3), x1), x2) = f(f(f(A0, x1), x3), x2) = f(f(A3, x′
2), x2) = A3
f(f(f(A0, x3), x2), x1) = f(f(f(A0, x2), x3), x1) = f(f(A3, x′
1), x1) = A3
Finally, using property 1) repeatedly,
f(f(f(A0, x1), x2), x3) = f(f(A1, x2)), x3) = f(A2, x3) = A3
Clearly, order independence is seen to hold for a 3-element set X. Let us assume that order independence
holds for some r-element set where r ≥ 3. Then, consider the set of r+1 elements. The instance xr+1 can
be at any position p, where 1 ≤ p ≤ r+1. Then by property 4), f(f(A(p−1), x(p)), x(p+1)) = f(f(A(p−1),
x(p+1)), x(p)) where A(p) refers to structure obtained by processing instances up to position p and x(p)
refers to instance at position p. Thus, xr+1 gets shifted to right by one position. Applying property 4)
iteratively in this manner, xr+1 reaches the end of the sequence whereby using property 1) in conjunction
with our assumption for r-element set yields f(Ar, xr+1) = Ar+1 for all sequences of length r + 1. The
theorem follows from the principle of mathematical induction.
The notion of a dynamically complete set relates closely to that of a commutative monoid. It is readily
seen that initial memory structure A0 in the wake of Theorem 2 takes care of all the requirements of a
commutative monoid. In fact, we can prove the following stronger result than Theorem 2.
Theorem 3. Presence of a dynamically complete set X on (A, f) provides both a necessary and sufficient
condition for order independence in any incremental algorithm that takes X as an input and uses A and
f .
Proof. Theorem 1 states that all properties of a commutative monoid are necessary for order insensitivity
Chapter 2. Characterizing Ordering Effects for Robust Incremental Clustering 30
of an incremental algorithm. Since a dynamically complete set satisfies all these properties, we only need
to show that the removal operation is also necessary for order independence. We prove this requirement
by means of a contradiction. Suppose, the removal operation does not hold. Then, for some Ak and
some xl,
f(f(Ak, x′
l), xl) 6= Ak (2.4)
Now two cases are possible,
Case 1: xl = φ
This indicates absence (even if transient) of any input when the current abstraction is Ak . Then,
f(f(f(Ak, φ′), φ)) 6= Ak, which implies the fallacy that the state of the algorithm changes in the
absence of any input.
Case 2: xl 6= φ
We have, using the property 2) from definition of a dynamically complete set,
f(f(Ak, x′
l), xl) = f(f(f(f(f(f(A0, x1), x2)..., xl−1), xl+1)..., xk), xl)
Using property 4), we get
f(f(Ak, x′
l), xl) = f(f(f(f(f(f(A0, x1), x2)..., xl−1), xl), xl+1)..., xk)
which yields,
f(f(Ak, x′
l), xl) = Ak (2.5)
Now, (6.2) and (4.2) together imply that Ak 6= Ak, an obvious contradiction.
Therefore, the presence of a removal operation is also necessary for obtaining insensitivity to order.
Thus, Theorem 3 holds since we have already shown the sufficiency condition in Theorem 2.
An important implication of Theorem 3 is that order insensitive incremental algorithms can be
proposed provided a suitable function that operates on one memory abstraction at a time is defined
over the domain of all the input instances.
2.4 Robust Incremental Clustering
Designing a strictly order independent incremental algorithm is desirable but might be difficult to obtain
in practice. However, if we relax the definition of incremental behavior to one pass algorithms, we can
improve the quality of clustering by incorporating slight modifications. In the following discussion, we
show how slight modifications, to the Leader algorithm, can be used to substantially improve the quality
of clustering.
Chapter 2. Characterizing Ordering Effects for Robust Incremental Clustering 31
2.4.1 The Leader Clustering Algorithm
The Leader algorithm [15] is a popular incremental clustering method whereby the input instances are
clustered based on a specified threshold parameter. The Leader algorithm starts by assuming the first
data point as representative of a cluster. Successive incoming instances are assigned one by one to this
cluster, provided they lie within a distance δ of the cluster representative in the d-dimensional space.
If a data point is farther than the specified threshold δ, it becomes a representative of a new cluster.
The subsequent data points are assigned to the cluster whose representative element is nearest to it and
within δ distance. The process is iteratively repeated till all input points are clustered.
Leader Algorithm
Input: The input dataset X = {x1, x2, ..., xn} to be clustered and a control distance parameter δ.
Output: A set of clusters with their centers.
Initialize the set of cluster centers, C = φ;
C = C⋃{x1};
for i = 2 to n, do
find cluster r whose center xCr ∈ C is closest to xi;
compute the distance, d(xi, xCr ) between xi and xCr ;
if d(xi, xCr ) ≤ δ
assign xi to cluster r;
else
C = C⋃{xi};
end if
end for
return C as the set of cluster centers;
It is obvious that the Leader clustering algorithm does not require prior information about the
number k of clusters. However, a skewed ordering of input sequence may result in extremely poor
quality of clustering. In our work, we propose robust variants of the Leader algorithm while essentially
preserving its one pass benefit. The basic idea behind these algorithms is that the ordering effects can
be ameliorated to an extent by considering more than one point during the decision process and thereby
quality of clustering can be improved upon. We first propose the Nearest Neighbor Leader Algorithm
which is a simple modification of the Leader algorithm.
Chapter 2. Characterizing Ordering Effects for Robust Incremental Clustering 32
2.4.2 The Nearest Neighbor Leader (NN-Leader) Clustering Algorithm
The conventional Leader algorithm considers each of the existing clusters one by one and assigns an
incoming data point to the cluster whose representative is nearest to this point. A disadvantage of this
approach is that the clustering decision is based entirely on the center representatives while neglecting
the role played by non-center members of these clusters. Algorithm 1 outlines the NN-Leader approach.
The algorithm first determines the closest neighbor xr of the incoming, unlabeled data point xi and then
computes the distance between xi and xCr , where xCr is the representative of the cluster containing xr.
If this distance lies within the stipulated threshold δ, then xi is assigned to xCr , otherwise, xi becomes
a representative of a new cluster.
Algorithm 1: NN-Leader
Input: The input dataset X = {x1, x2, ..., xn} to be clustered and a control distance parameter δ.
Output: A set of clusters with their centers.
Initialize the set of cluster centers, C = φ;
C = C⋃{x1};
for i = 2 to n, do
find the center xCr of the cluster assigned to point xr
that is closest to xi;
compute the distance, d(xi, xCr ) between xi and xCr ;
if d(xi, xCr ) ≤ δ
assign xi to cluster with center xCr ;
else
C = C⋃{xi};
end if
end for
return C as the set of cluster centers;
2.4.3 The Nearest Mean and Neighbor Leader(NMN-Leader) Clustering Al-
gorithm
As mentioned earlier, the susceptibility of the Leader algorithm to skewness in the data can be primarily
attributed to the exclusion of non-representative cluster elements. The NN-Leader algorithm strives
to overcome this limitation by including the immediate neighbor of the unlabeled data point in the
decision process. However, the role of the nearest neighbor is restricted to identifying a potential cluster;
the extent of proximity of the neighbor is not taken into consideration. For many pattern clustering
Chapter 2. Characterizing Ordering Effects for Robust Incremental Clustering 33
applications, a more robust heuristic would be to find the sum of distances of the incoming point to
more than one point of each cluster and assign the incoming point to the cluster with the minimum
overall sum of the said distances.The Nearest Mean and Neighbor Leader (Algorithm 2) extends the
notion of the NN-Leader algorithm.
Specifically, the NMN-Leader algorithm proceeds in the following way. For each unlabeled data
point, all clusters whose representatives respect the δ threshold, are considered as potential clusters.
Then the distance of the unlabeled point to a nearest point in each of the potential clusters is computed.
The data point is assigned to a potential cluster based on the respective sum of the distance to each of
the cluster centers and the nearest point in the corresponding clusters; the unlabeled data point becomes
a new cluster representative, if no such potential cluster exists. Thus the NMN-Leader can be used to
ameliorate one of the main drawbacks of the the Leader and NN- Leader algorithms, namely, that only
a single distance value is taken into consideration while deciding a cluster for an incoming point. It is
important to note that the mean or center, in the foregoing discussion, refers to a cluster representative
unlike the general notion of an average value.
Algorithm 2: NMN-Leader
Input: The input dataset X = {x1, x2, ..., xn} to be clustered and a control distance parameter δ.
Output: A set of clusters with their centers.
Initialize the set of cluster centers, C = φ;
C = C⋃{x1};
for i = 2 to n, do
flag = 0;
for each cluster Cj in C, do
find representative point of Cj , xCjm ;
compute the distance, dji1 =d(xi, xCjm );
if dji1 > δ
continue with next cluster Cj in C;
else
compute the distance, dji2 = d(xi, xjr), between
xi and its nearest neighbor in Cj ;
flag = 1;
end if
end for
if flag = 1
assign xi to the cluster Cj with minimum value of dji1 + dji2 ;
Chapter 2. Characterizing Ordering Effects for Robust Incremental Clustering 34
else
C = C⋃{xi};
end if
end for
return C as the set of cluster centers;
2.4.4 The Apogee-Mean-Perigee Leader (AMP-Leader) Clustering Algo-
rithm
The Nearest Mean and Neighbor Leader (NMN-Leader) algorithm considers the sum of distances to
each candidate cluster center and the nearest neighbor in that cluster. A more robust approach would
involve considering the farthest (apogee), representative (mean) and nearest (perigee) data points of
each cluster. Thus, a more robust clustering is likely to be achieved by applying AMP-Leader clustering
algorithm (Algorithm 3).
Algorithm 3: AMP-Leader
Input: The input dataset X = {x1, x2, ..., xn} to be clustered and a control distance parameter δ.
Output: A set of clusters with their centers.
Initialize the set of cluster centers, C = φ;
C = C⋃{x1};
for i = 2 to n, do
flag = 0;
for each cluster Cj in C, do
find representative point of Cj , xCjm
compute the distance, dji1 =d(xi, xCjm );
if dji1 > δ
continue with next cluster Cj in C;
end if
flag = 1;
find the point in Cj , xCjp , which is closest to xi;
compute the distance, dji2 = d(xi, xCjp );
find the point in Cj , xCja , which is farthest from xi;
compute the distance, dji3 =d(xi, xCja );
dij = dji1 + dji2 + dji3 ;
end for
if flag = 1
Chapter 2. Characterizing Ordering Effects for Robust Incremental Clustering 35
assign xi to cluster Cj , Cj ∈ C, with minimum
value of dij ;
else
C = C⋃{xi};
end if
end for
return C as the set of cluster centers;
2.5 Experimental Results
An important question needs to be addressed: how to quantify robustness of the clusters obtained using
different algorithms, in order to decide the suitability of one over the others? For our experiments, we
measure the robustness of an algorithm in terms of the following two parameters,
1. α =
√√√√∑i
|xi − xk|2
n, where xk is the representative of cluster Ck, Ck ∈ C, to which xi ∈ X =
{x1, x2, ..., xn} is assigned, and
2. β =
√√√√√√∑k
∑xi,xj∈Ck
|xi − xj |2
|Ck|(|Ck| − 1)
|C|.
It is easy to see that α and β can be used to measure the robustness of clustering algorithms satisfactorily:
α quantifies the deviation of data points from the representative element while β captures the scatter
among different elements assigned to the same cluster. The lower the values of α and β, the higher the
quality of clustering. Our experimental results support the intuition that the principal reason behind
the low quality of clustering by Leader algorithm is a high value of β, since the Leader algorithm only
tries to minimize the distance between the data points and their respective cluster centers.
We conducted extensive experimentation to measure the robustness of different algorithms proposed
in this work relative to the Leader algorithm. We present results of using the Leader, NN-Leader,
NMN-Leader, and the AMP-Leader algorithms on three real datasets: Iris, Wine, and KDD Network
Intrusion datasets, available at [16], [17]. The Wine dataset describes the chemical analysis of wines
derived from three different cultivars. The quantities of 13 constituents found in each of the three types
of wines forms the input to the clustering algorithms, after removing the class identifier. Fig. 2.1(a) and
Fig. 2.1(b) summarize, respectively, the α and β plots for different variants of the Leader algorithm.
It is observed that the AMP Leader outperforms the Leader algorithm by an order of magnitude, the
NN and the NMN Leader algorithms also perform better than the Leader algorithm, though the gap is
not so vastly pronounced. The NN algorithm performs slightly better than the NMN algorithm in the
Chapter 2. Characterizing Ordering Effects for Robust Incremental Clustering 36
Figure 2.1: Wine Dataset: (a) α vs. δ, and (b) β vs. δ
Table 2.1: Wine Dataset ResultsAlgorithm δ α β Time(sec)
Leader 25 103.459243 88.312259 0.03Leader 30 105.858316 94.513687 0.03Leader 50 236.117268 81.759452 0.04Leader 75 296.273254 64.230872 0.03
NN 25 12.255716 12.806741 0.04NN 30 14.402743 13.841357 0.04NN 50 25.951286 24.689666 0.04NN 75 36.319593 34.837875 0.04
NMN 25 12.693116 13.806690 0.12NMN 30 14.297954 14.705744 0.11NMN 50 25.426869 28.176638 0.11NMN 75 36.901905 40.978662 0.12AMP 25 3.268322 1.550220 0.19AMP 30 4.470130 1.799374 0.21AMP 50 11.458716 4.303440 0.36AMP 75 28.191802 11.122114 0.44
context of β values, however, the two perform equally well with respect to α. It is worthy to note that
there is not much increase in overall time taken to execute by employing NN, NMN, and AMP Leader
algorithms in place of the Leader algorithm. Hence, these algorithms provide an encouraging alternative
to the Leader algorithm.
Our second dataset, Iris, contains 3 classes of 50 instances each, where each class refers to a type of
Iris plant. We removed the categorical class label attribute and used different algorithms for clustering
based on the remaining 4 real-valued attributes. Fig. 2.2 (a) shows the comparison of α values obtained
at various thresholds using different clustering algorithms discussed here. It is observed that the Leader
algorithm results in a higher value of α than the NN, NMN and AMP Leader algorithms for almost all
values of δ, barring a brief interval around δ=2.5. The NMN and AMP algorithms result in nearly same
value of α marginally lower than that obtained using the NN algorithm; the difference becomes more
pronounced as δ is increased beyond 4.
Chapter 2. Characterizing Ordering Effects for Robust Incremental Clustering 37
Figure 2.2: Iris Dataset: (a) α vs. δ, and (b) β vs. δ
Figure 2.3: Intrusion Dataset: (a) α vs. δ, and (b) β vs. δ
On the other hand, Fig. 2.2 (b) shows the comparison of β values obtained at various values
of δ using different clustering algorithms. It is observed that the NMN algorithm performs poorly
as compared to the others, including the Leader algorithm. This counter-intuitive behavior can be
explained if we consider the number of clusters obtained in each case. A relatively low value of β due
to Leader algorithm is because of the large number of clusters resulting from the Leader algorithm; a
substantial number of clusters more than offsets the low β value of other algorithms. Also, since β is
inversely proportional to√|C|, it is seen that the value β
√|C| gives a better idea of the quality of
clustering than β, therefore for extremely low values of δ, β√|C| should be preferred.
We finally show the empirical results for the KDD Network Intrusion dataset. This dataset was
released for the KDD Competition with the goal of building a network intrusion detector, to distinguish
between intrusions and normal accesses. We removed the categorical attributes and used the remaining
37 features for the clustering process. Fig. 2.3(a) and Fig. 2.3(b) clearly indicate that the AMP,
NN, and NMN algorithms massively outperform the Leader algorithm in the quality of clustering. We
observe that the AMP, NN and NMN have almost same values for α resulting in only a single (visible)
curve. Our experiments with several other large datasets indicate that the gap between the Leader and
the proposed algorithms widens more substantially with increase in δ.
Chapter 2. Characterizing Ordering Effects for Robust Incremental Clustering 38
Table 2.2: Iris Dataset ResultsAlgorithm δ α β Time(sec)
Leader 1 0.604428 0.363442 0.01Leader 1.1 0.836421 0.517497 0.01Leader 1.2 0.769459 0.508818 0.02Leader 1.3 0.887168 0.556569 0.01Leader 1.4 0.840555 0.699171 0.02Leader 1.5 0.988433 0.565288 0.02Leader 1.6 1.075050 0.469647 0.01Leader 2 1.286390 0.565929 0.02Leader 2.5 0.999066 0.288164 0.02Leader 4 1.871648 0.353127 0.01Leader 5 2.816606 0.615548 0.01Leader 6 3.250200 1.191248 0.01
NN 1 0.569444 0.465093 0.02NN 1.1 0.608988 0.488164 0.03NN 1.2 0.648537 0.550964 0.03NN 1.3 0.678823 0.547391 0.03NN 1.4 0.705408 0.635412 0.02NN 1.5 0.797705 0.592036 0.02NN 1.6 0.864870 0.604020 0.03NN 2 1.024044 0.714045 0.02NN 2.5 1.151405 0.896195 0.02NN 4 1.246863 1.312673 0.02NN 5 2.653413 1.460232 0.03NN 6 3.153094 2.121179 0.03
NMN 1 0.607783 0.600252 0.08NMN 1.1 0.621450 0.628142 0.09NMN 1.2 0.646890 0.718134 0.09NMN 1.3 0.658989 0.722566 0.09NMN 1.4 0.749933 0.854402 0.08NMN 1.5 0.752418 0.847823 0.10NMN 1.6 0.787782 0.884730 0.10NMN 2 0.959965 0.940225 0.09NMN 2.5 1.054672 1.000977 0.10NMN 4 1.246863 1.312673 0.08NMN 5 2.089912 1.986403 0.08NMN 6 2.365953 2.066980 0.08AMP 1 0.378594 0.158689 0.51AMP 1.1 0.491189 0.283306 0.34AMP 1.2 0.596210 0.422448 0.18AMP 1.3 0.617306 0.531905 0.15AMP 1.4 0.744222 0.773295 0.11AMP 1.5 0.752418 0.847823 0.12AMP 1.6 0.787782 0.884730 0.12AMP 2 0.959965 0.940225 0.09AMP 2.5 1.054672 1.000977 0.11AMP 4 1.246863 1.312673 0.09AMP 5 2.115167 1.994683 0.08AMP 6 2.365953 2.066980 0.10
Chapter 2. Characterizing Ordering Effects for Robust Incremental Clustering 39
From our experiments on several datasets, we observe that the difference in quality of clustering
between the NN and NMN algorithms is not substantial enough to choose one over the other. However,
the fact that NN takes less time to execute than the NMN (Tables 6.3 and 2.2) suggests that NN can
be preferred to NMN until some additional domain knowledge indicates otherwise. Then the choice
between NN and AMP would be a selection trade-off between quality of clustering and time taken to
execute.
2.6 Conclusion/Future Work
Incremental clustering is an important data mining task. A major concern in incremental algorithms is to
obtain identical results on a dataset for all possible orderings of the input data. We analyzed the problem
using the ideas from algebraic structures and introduced the notion of a dynamically complete set. We
proved that a dynamically complete set provides both a necessary and sufficient condition for order
independence. We also proposed a suite of robust incremental algorithms based on the popular Leader
clustering algorithm. Our experimental results indicate that the proposed algorithms perform way better
than the Leader algorithm on a number of datasets of different sizes and from different application
domains. The NN-Leader algorithm takes less time to execute than the NMN-Leader algorithm while
affording almost same quality of clustering. The time-robustness trade-off could be used to choose
between the AMP and the NN algorithms with the former providing more robust clustering while the
latter accounting for considerably less time.
Bibliography
[1] A. K. Jain, M. N. Murty, and P. J. Flynn. Data Clustering: A Review. ACM Computing Surveys,
31(3), 1999.
[2] P. Langley. Order Effects in Incremental Learning. Learning in humans and machines: Towards an
interdisciplinary learning science, Elsevier, 1995.
[3] T. Mitchell. Generalization as Search. Artificial Intelligence, 18, pp. 203–226, 1982.
[4] P. H. Winston. Learning structural descriptions from examples. The psychology of computer vision,
McGraw-Hill, 1975.
[5] J. C. Schlimmer and D. Fisher. A case study of incremental concept induction. Proceedings of the
Fifth National Conference on Artificial Intelligence, pp. 496–501, Morgan Kaufmann, 1986.
[6] A. Cornuejols. Getting Order Independence in Incremental Learning. Proceedings of the 1993 Euro-
pean Conference on Machine Learning (ECML), pp. 196–212, Springer-Verlag, 1993.
[7] D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2(2), pp.
139-172, 1987.
[8] D. Fisher, L. Xu, and N. Zard. Ordering effects in clustering. Proceedings of the 9th International
Conference on Machine Learning (ICML), pp. 163–168, 1992.
[9] B. Shekar, M. N. Murty, and G. Krishna. Structural aspects of semantic-directed clusters. Pattern
Recognition, 22, pp. 65–74, 1989.
[10] I. Herstein. Topics in Algebra. John Wiley & Sons, Second Edition, 2006.
[11] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Efficient Data Clustering Method for Very
large Databases. ACM SIGMOD, 1996.
[12] J. R. Slagle, C. L. Chang, and S. R. Heller. A Clustering and Data Reorganizing Algorithm. IEEE
Trans. Systems, Man and Cybernetics, 5, pp. 125–128, 1975.
40
BIBLIOGRAPHY 41
[13] D. Arthur and S. Vassilvitskii. k-means++: The Advantages of Careful Seeding. Symposium on
Discrete Algorithms (SODA), 2007.
[14] M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining Stream Statistics over Sliding Win-
dows. Proceedings of the thirteenth annual ACM-SIAM Symposium on Discrete Algorithms (SODA),
2002.
[15] H. Spath. Cluster Analysis Algorithms for Data Reduction and Classification, Ellis Horwood, Chich-
ester, 1980.
[16] http://archive.ics.uci.edu/ml/datasets.html.
[17] http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
Chapter 3
RACK: RApid Clustering using
K-means algorithm
3.1 INTRODUCTION
Clustering or unsupervised classification of patterns into groups, based on similarity, is a very well studied
problem in pattern recognition, data mining, information retrieval and related disciplines. Clustering
finds numerous direct practical applications in pattern analysis, decision making and machine learning
tasks such as image segmentation. Besides clustering also acts as a precursor to many data processing
tasks including classification ([1], [2]). The k-means clustering problem can be expressed in the following
way. Given an integer k and a dataset of n d-dimensional data points, the objective is to choose k so
that the total squared distance between each point and its closest center is minimized. Several attempts
have been made to find efficient solution to the k-means problem. Some techniques ([7], [8], [9]) describe
O(1 + ε)-competitive algorithms for the k-means problem. However, these algorithms have exponential
time complexity in k and thus may not be practical. A (9+ε)-competitive algorithm has been suggested
in [10]. However, the application of this algorithm is also limited, since the time complexity is cubic in
the number of data points.
According to a recent survey [3], the k-means algorithm is the most widely used technique for
scientific and industrial applications. Several variants of the k-means algorithm have been proposed
in the literature. The Lloyd’s algorithm ([4], [5], [6]) is an extensively used k-means algorithm. It
begins with a set of k representatives or centers that are randomly chosen from the data points. In one
iteration, every data point is assigned to the cluster with nearest center. The center of each cluster is
updated to the mean of the points assigned to that cluster. Then, the algorithm proceeds with next
iteration until there is no significant change in the clusters obtained in successive iterations. Recently,
42
Chapter 3. RACK: RApid Clustering using K-means algorithm 43
an O(lg k)-competitive algorithm, k-means++, has been proposed [11]. The k-means++ algorithm
improves the speed of k-means. However, the speed deteriorates significantly with an increase in value
of k.
The Leader algorithm [12], a popular single pass incremental algorithm for clustering large datasets,
is particularly attractive for applications such as stream data analysis that require fast processing. A
shortcoming of the Leader algorithm is that the quality of clustering may degrade significantly in case
of skewed input orders ([13], [14]). In [15], the authors propose an incremental algorithm BIRCH to
dynamically cluster incoming multi-dimensional metric data points, using a Clustering Feature (CF)
Tree. The Shortest Spanning Path (SSP) [16] algorithm has been used for data reorganization and
automatic auditing of records. In [17], authors proposed COBWEB, an incremental conceptual clustering
algorithm that was subsequently used for many engineering tasks.
In this work, we propose a novel technique, which strives to obtain good quality of clustering like
k-means, while minimizing the time requirements like the Leader algorithm. In particular, we make the
following contributions,
1. we propose the RACK algorithm that chooses k centers by applying k-means on a randomly chosen
subset D′
of the input dataset D, builds a height balanced tree based on a measure governed by the
cardinality of the clusters and deviation about their respective means, and incrementally assigns
the remaining data points to suitable clusters, and
2. we prove that the average deviation resulting from RACK algorithm is bounded byO(k|D| |D\D′ |).
3.2 Effective Clustering for large datasets
3.2.1 Motivation
The k-means algorithm may not work well in case of skewed data. Fig. 3.1 shows a set of data points
arranged in the form of two concentric circles, A and B. The k-means algorithm would fail to converge to
a satisfactory solution even after an extremely large number of iterations. To ameliorate this problem, we
need a different heuristic from the closest center approach: a sampled subset may be used to determine
the cluster centers followed by fast one-pass clustering of the remaining data points using an appropriate
algorithm. The basic intuition is that a random subset of the original dataset may quickly converge
to a local optimal solution and the resulting cluster centers may be used to incrementally cluster the
remaining points. This avoids repeatedly computing k distances to cluster centers till convergence unlike
k-means, since most of the points may never change their cluster centers.
Now, consider the situation as shown in Fig. 3.2. Suppose that a sample subset of the original
dataset is already clustered using the k-means algorithm. Now, if a new data point x arrives, it is more
Chapter 3. RACK: RApid Clustering using K-means algorithm 44
Figure 3.1: k-means may not converge to a solution even after many iterations
probable to be a part of the cluster with large number of data points. This is a reasonable assumption
to make, particularly if the sample set reflects the behavior of the entire dataset.
Figure 3.2: A new data point is more likely to belong to a cluster with large number of data points
At the same time, we do not want the quality of clustering, which is indicated by the average
squared deviation about the cluster representatives, to degrade significantly as in case of incremental
algorithms. Therefore, rather than making a decision on the basis of cardinality of clusters alone, we
also seek to minimize the deviation. We define a heuristic, the potential, to accomplish good clustering.
We also note that, in case of k-means, the number of distance computations is given by O(nkld), for
a d-dimensional dataset of n points, which converges to a solution in l iterations. Therefore, we can
significantly improve the time by devising an algorithm that effectively measures distance from only
O(lg k) centers, and needs only a single iteration for a major part of the original dataset like the Leader
algorithm.
The Leader algorithm starts by assuming the first data point as representative of a cluster. An
iterative process is followed, whereby successive incoming instances are assigned one by one to this
cluster, provided they lie within a distance δ of the cluster representative in the d-dimensional space. If
a data point is farther than the specified threshold δ, it becomes a representative of a new cluster.
Based on the aforesaid ideas, a meaningful and intuitively appealing heuristic would be to execute
the k-means algorithm on a much smaller set of data points, maintain the requisite information about
clusters in a balanced tree structure, and cluster the remaining points of the dataset incrementally along
a suitable path of the tree. The skewed behavior of Leader algorithm is likely to be avoided since a
reasonable set of cluster centers can be expected as a result of applying k-means to a sampled subset.
Chapter 3. RACK: RApid Clustering using K-means algorithm 45
In this context, we propose the RACK algorithm to accomplish effective clustering.
3.2.2 The RACK Algorithm
The RACK algorithm proceeds in two phases. In Phase 1, k-means is applied, on a randomly chosen
subset D′
of the original dataset D, to obtain a set of clusters, C = {C1, C2, . . . , Ck}. Further, as
mentioned in Section 3.2.1, we define a new heuristic, potential p, for the subsequent clustering process
involving points in the set D\D′ . The potential for each of the k clusters is given by,
p = {p1 =|C1|α1
, p2 =|C2|α2
, . . . , pk =|Ck|αk}
where, denoting the center of Cj by cj ,
αj =1|Cj |
∑xi∈Cj
||xi − cj ||2, j ∈ {1, 2, . . . , k}
In Phase 2, an AVL tree is constructed using the set p. Intuitively, for each incoming point, we
look for a cluster with large size and low variance in order to minimize the computation cost, and to
improve the quality of clusters. The AVL tree [18] is an example of height balanced binary search trees,
wherein each of the search, delete and insert operations can be performed in O(lg r) time, r being the
total number of nodes in the tree. Any node or cluster, Cmi , in an AVL tree can be uniquely identified
by specifying its level m in a path i from the root. Then, starting from the root, additional increase
in deviation due to each point xt ∈ D\D′
is computed. If the point xt is deemed fit at a node, it is
assigned to the corresponding cluster, and the average deviation about the representative of that cluster
is given by,
αmi′ =|Cmi |αmi + ||xt − xim||2
|Cmi |+ 1
where xim denotes the representative of cluster Cmi . Accordingly, the potential at that node can also
be updated. However, if xt does not satisfy the clustering criterion at a node, then it has to be assigned
to a suitable cluster, in its right or left subtree. In the worst case, xt is assigned to a leaf node.
Algorithm: RACK
Input: A dataset D and k
Output: A set of k clusters and cluster centers
• Phase 1: Obtain a subset D′
of data points by drawing random samples from D.
• Apply the k-means algorithm on D′
to obtain the set of k clusters, C = {C1, C2, . . . , Ck}.
Chapter 3. RACK: RApid Clustering using K-means algorithm 46
• Compute the overall deviation of points in each cluster about its mean, α = {α1, α2, . . . , αk}.
• Compute the corresponding set of potential values, p = {p1 =|C1|α1
, p2 =|C2|α2
, . . . , pk =|Ck|αk}.
• Phase 2: Insert a node for each of the k clusters, in an AVL tree H, in the order of non-increasing
p. Store a duplet (|Cj |, pj) with j ∈ {1, 2, . . . , k}.
• for each data point xt ∈ D\D′
1. Set CNode to root of H.
2. if both the left child L and the right child R of CNode exist, then
(a) Compute the value, val = ||xt − xCNode||2, where xCNode denotes the representative
element of CNode
(b) Compute the quantity, qR =2|CNode|+ 1
pR, where pR denotes the potential of R about
its representative xR. Also, compute qL =2|CNode|+ 1
pL
(c) if val < qR, then set CNode to R; else if val > qL, then set CNode to L; otherwise
assign xt to CNode, and update pCNode topCNode(|CNode|+ 1)2
|CNode|2 + val pCNodeand |CNode| to
|CNode|+ 1;
else if CNode is a leaf, then compute val as above and update pCNode topCNode(|CNode|+ 1)2
|CNode|2 + val pCNode
and |CNode| to |CNode|+ 1;
else if the right child R exists, then compute val and qR. If val ≥ qR, then update pCNode topCNode(|CNode|+ 1)2
|CNode|2 + val pCNodeand |CNode| to |CNode|+ 1, otherwise, set CNode to R;
else compute val and qL. If val ≤ qL, then update pCNode topCNode(|CNode|+ 1)2
|CNode|2 + val pCNodeand
|CNode| to |CNode|+ 1, otherwise, set CNode to L.
3. Return the set of clusters, C1, C2, . . . , Ck and their cluster representatives.
3.2.3 Bound on the Quality of Clustering
Theorem 4. Let D and k denote the original dataset and the number of clusters respectively. The
quality of clustering achieved by Phase 2 of RACK, as measured in terms of the average deviation, is
bounded by O(k|D| |D\D′ |), if a sampled dataset of size D′
is used for obtaining k cluster representatives
in Phase 1.
Proof. Consider a cluster Cmi , at a level m, in the height balanced tree, H. Then, if pmi is the current
potential of Cmi ,
pm+1il ≤ pmi ≤ pm+1
ir (3.1)
Chapter 3. RACK: RApid Clustering using K-means algorithm 47
where pm+1il and pm+1
ir denote the current potential of the left child, Cm+1il , and the right child, Cm+1
ir ,
of cluster Cmi . Now, suppose a new data point xt is assigned to Cmi and let the new potential of Cmi be
pm′
i . Note that xt does not modify the potential of any cluster, other than Cmi . In the RACK algorithm,
val is computed along a suitable path down the tree till the updated potential of the corresponding node
(on inserting the new point) lies between that of its children, or a leaf node is reached. Therefore, we
must have
pm+1il ≤ pm
′
i ≤ pm+1ir (3.2)
Using (3.1) and (3.2), in conjunction with the definition of potential, we have
|Cm+1il |αm+1il
≤ |Cmi |αmi
≤ |Cm+1ir |αm+1ir
(3.3)
and,|Cm+1il |αm+1il
≤ |Cmi |+ 1αm′
i
≤ |Cm+1ir |αm+1ir
(3.4)
Further,
αmi′ =|Cmi |αmi + ||xt − xim||2
|Cmi |+ 1
where xim denotes the representative of cluster Cmi .
⇒ αmi =(|Cmi |+ 1)αm
′
i − ||xt − xim||2
|Cmi |
which using (3.3) yields,
αm′
i ∈ [|Cmi |2α
m+1ir + ||xt − xim||2|Cm+1
ir ||Cm+1ir |(|Cmi |+ 1)
,
|Cmi |2αm+1il + ||xt − xim||2|Cm+1
il ||Cm+1il |(|Cmi |+ 1)
]
whence using (3.4), αm′
i ≥ max((|Cmi |+ 1)αm+1
ir
|Cm+1ir |
,
|Cmi |2αm+1ir + ||xt − xim||2|Cm+1
ir ||Cm+1ir |(|Cmi |+ 1)
) (3.5)
and, αm′
i ≤ min((|Cmi |+ 1)αm+1
il
|Cm+1il |
,
|Cmi |2αm+1il + ||xt − xim||2|Cm+1
il ||Cm+1il |(|Cmi |+ 1)
) (3.6)
Chapter 3. RACK: RApid Clustering using K-means algorithm 48
Now,
|Cmi |2αm+1ir + ||xt − xim||2|Cm+1
ir ||Cm+1ir |(|Cmi |+ 1)
− (|Cmi |+ 1)αm+1ir
|Cm+1ir |
=|Cmi |2α
m+1ir + ||xt − xim||2|Cm+1
ir | − (|Cmi |+ 1)2αm+1ir
|Cm+1ir |(|Cmi |+ 1)
=||xt − xim||2|Cm+1
ir | − αm+1ir − 2|Cmi |α
m+1ir
|Cm+1ir |(|Cmi |+ 1)
=||xt − xim||2
|Cmi |+ 1− αm+1
ir (2|Cmi |+ 1)|Cm+1ir |(|Cmi |+ 1)
≥ 0, if
||xt − xim||2
|Cmi |+ 1≥ αm+1
ir (2|Cmi |+ 1)|Cm+1ir |(|Cmi |+ 1)
i.e. if ||xt − xim||2 ≥αm+1ir (2|Cmi |+ 1)|Cm+1ir |
=2|Cmi |+ 1pm+1ir
Therefore,
|Cmi |2αm+1ir + ||xt − xim||2|Cm+1
ir ||Cm+1ir |(|Cmi |+ 1)
≥ (|Cmi |+ 1)αm+1ir
|Cm+1ir |
if ||xt − xim||2 ≥2|Cmi |+ 1pm+1ir
Thus, the bound on αm′
i in (3.5) can be determined. Similarly, we can obtain the bound in (3.6). Now,
there are three possible cases, based on various possible intervals being considered.
Case 1:
||xt − xim||2 ∈ [2|Cmi |+ 1pm+1ir
,2|Cmi |+ 1pm+1il
]
Then,
αm′
i ∈ [|Cmi |2α
m+1ir + ||xt − xim||2|Cm+1
ir ||Cm+1ir |(|Cmi |+ 1)
,
|Cmi |2αm+1il + ||xt − xim||2|Cm+1
il ||Cm+1il |(|Cmi |+ 1)
]
Chapter 3. RACK: RApid Clustering using K-means algorithm 49
whence the length of interval, wherein α′
lies, is given by
|Cmi |2αm+1il + ||xt − xim||2|Cm+1
il ||Cm+1il |(|Cmi |+ 1)
−|Cmi |2α
m+1ir + ||xt − xim||2|Cm+1
ir ||Cm+1ir |(|Cmi |+ 1)
=|Cmi |2|C
m+1ir |αm+1
il − |Cmi |2|Cm+1il |αm+1
ir
|Cm+1ir ||Cm+1
il |(|Cmi |+ 1)
=|Cmi |2(|Cm+1
ir |αm+1il − |Cm+1
il |αm+1ir )
|Cm+1ir ||Cm+1
il |(|Cmi |+ 1)
=|Cmi |2(pm+1
ir αm+1ir αm+1
il − pm+1il αm+1
il αm+1ir )
|Cm+1ir ||Cm+1
il |(|Cmi |+ 1)
=|Cmi |2(pm+1
ir αm+1ir αm+1
il − pm+1il αm+1
il αm+1ir )
pm+1ir αm+1
ir pm+1il αm+1
il (|Cmi |+ 1)
=|Cmi |2(pm+1
ir − pm+1il )
pm+1ir pm+1
il (|Cmi |+ 1)(3.7)
Case 2:
||xt − xim||2 <2|Cmi |+ 1pm+1ir
Clearly, this case is ruled out since, by our assumption, xt is assigned to Cmi (otherwise xt should have
been checked further, for a suitable cluster, in the right subtree of Cmi ).
Case 3:
||xt − xim||2 >2|Cmi |+ 1pm+1il
This case is also not possible, as observed in a way, analogous to Case 2.
Therefore, the resulting interval after inserting xt must satisfy (3.7). Further, since xt is assigned
to Cmi , none of the clusters in the path from root of H till Cmi may satisfy (3.7). Assuming that
the probability of a point falling in an interval is proportional to the length of that interval, using a
normalization constant z, the probability of xt being assigned to Cmi is given by,
P (Cmi ← xt)
Chapter 3. RACK: RApid Clustering using K-means algorithm 50
≤ (|Cmi |2(pm+1
ir − pm+1il )
zpm+1ir pm+1
il (|Cmi |+ 1))m−1∏j=1
[1−|Cji |2(pj+1
ir − pj+1il )
zpj+1ir pj+1
il (|Cji |+ 1)]
≤|Cmi |2(pm+1
ir − pm+1il )
zpm+1ir pm+1
il (|Cmi |+ 1)(since ∀s ps+1
ir ≥ ps+1il )
Then, the expected change in αm′
i is
≤∑
xt∈D\D′
|Cmi |2(pm+1ir − pm+1
il )zpm+1ir pm+1
il (|Cmi |+ 1)||xt − xim||2
Since there are O(lg k) levels, therefore, the expected change along path i is
≤O(lg k)∑m=1
m∑
xt∈D\D′
|Cmi |2(pm+1ir − pm+1
il )zpm+1ir pm+1
il (|Cmi |+ 1)||xt − xim||2
≤O(lg k)∑m=1
m∑
xt∈D\D′
|Cmi |(pm+1ir − pm+1
il )zpm+1ir pm+1
il
||xt − xim||2
≤O(lg k)∑m=1
m∑
xt∈D\D′β|Cmi | ||xt − xim||2
(where, β = max(pm+1ir − pm+1
il
zpm+1ir pm+1
il
), m ∈ {1, 2, . . . , k − 1})
≤O(lg k)∑m=1
mβ|D\D′| |Cmi |d2
max
(where dmax is the maximum distance between any two data points in D)
≤O(lg k)∑m=1
mβ′|D\D
′||Cmi |
(where β′
= β d2max)
Now, there are O(lg k) nodes in any root to leaf path of H. Further, the total number of nodes is k.
Therefore, the number of paths is bounded by O(k/lg k). Therefore, expected quality of clustering, as
characterized by average deviation, using RACK is
≤O(k/lg k)∑
i=1
i
O(lg k)∑m=1
mβ′|Cmi | |D\D
′|
≤O(k/lg k)∑
i=1
i O(|D/k|) O(|D\D′| lg2 k)
Chapter 3. RACK: RApid Clustering using K-means algorithm 51
(for large datasets, expected value of |Cmi | is O(|D/k|))
= O(k|D| |D\D′|)
3.2.4 Analysis of Time Complexity
RACK algorithm consists of two phases. Phase 1 employs the k-means algorithm on sampled dataset
D′, which incurs time bounded by O(|D′ | kl′d), where l′ and d denote the number of iterations and
the dimensionality of data respectively. The computation of overall deviation of points in each cluster
about its center, and the potential values takes O(D′) time. Therefore, the time complexity of Phase 1
can be expressed as O(|D′ | kl′d).
Phase 2 consists of primarily three steps: (a) sorting the potential values obtained in Phase 1 in
non-increasing order, (b) constructing the AVL tree using potential value as the key, and (c) assigning
the points in D\D′ to one of the nodes in the tree. Sorting in (a) can be accomplished in O(k lg k)
time using any standard algorithm such as heapsort. The AVL tree can be constructed from k sorted
potential values in O(k lg k) time, since each insertion operation requires O(lg k) comparisons and a
total of k insertions are needed. Finally, each point in D\D′ may have to go down a path from root of
the tree to one of its leaves. The length of any such path is bounded by O(lg k) nodes, as a consequence
of the height balancing property of the AVL tree. At each node in the path, its right and left child
may have to be accessed, for a maximum of two operations. The cluster size and potential value at
each node can be updated in O(1) time. Then, the overall time complexity for clustering |D\D′ | data
points, in Phase 2, is O((k + d|D\D′ | )lg k). However, since typically |D\D′ | is much greater than k,
therefore, Phase 2 requires time bounded by O(d|D\D′ | lg k). The RACK algorithm, as a result, takes
O(d(|D′ | kl′ + |D\D′ | lg k)) time.
3.3 Experimental Results
We carried out extensive experimentation to compare RACK with the Leader, the k-means and the
k-means++ algorithms. For our experiments, we measured the quality of clustering of an algorithm in
terms of the deviation as given by,
α =1n
∑xi∈D
||xi − xij ||2,
where xj is the representative of cluster Cj that belongs to the set of clusters C, to which xi is assigned.
Clearly, a low value of α corresponds to a high quality of clustering. We conducted an empirical study
on a number of real-world datasets. However, due to space constraints, we provide the results for
Chapter 3. RACK: RApid Clustering using K-means algorithm 52
Table 3.1: Spam Dataset (4601 examples, 58 dimensions)Algorithm Clusters Average α Time(sec)
Leader 10 1.5397e+05 0.78k-means 10 3.6843e+04 2.8
k-means++ 10 1.8650e+04 0.93RACK 10 4.9174e+04 0.85Leader 20 1.9995e+05 0.81k-means 20 3.3210e+04 7.67
k-means++ 20 1.3360e+04 2.89RACK 20 2.7097e+04 1.63Leader 50 8.2472e+05 1.23k-means 50 3.2597e+04 13.32
k-means++ 50 1.2915e+03 3.12RACK 50 3.1088e+03 1.31Leader 100 1.5397e+05 2.2k-means 100 3.0447e+04 17.92
k-means++ 100 1.0876e+03 3.65RACK 100 1.718e+03 1.97
Spam and Intrusion datasets. These datasets are available as archives at the UCI Machine Learning
Repository ([19] [20]). Moreover, we state the results obtained using 30 runs of experiments to account
for statistical significance. Since the Leader algorithm does not take k as an input parameter, we
executed the code for Leader algorithm for different distance thresholds, across different orders, and
observed the number of clusters. Then, we modulated the Leader threshold distance to obtain almost
the same number of clusters. Finally we averaged the deviation across different orders. The Spam
dataset consists of 4601 examples of 58 real valued features each. Table 3.1 shows the results obtained
using the different algorithms for varying number of clusters. The RACK algorithm used a sampled
dataset of 1500 instances. As indicated, RACK competes with Leader in the total time taken. The
quality of clustering also compares favorably with the k-means algorithm and tends to approach that
of the k-means++, especially as the value of k is increased. Clearly, RACK outperforms k-means and
k-means++ in the total time taken. On the other hand, RACK yields much better clusters than the
Leader algorithm.
Table 3.2, on the other hand, shows the results of our experiments on the network intrusion data.
The RACK algorithm selected a sampled dataset of 20000 instances. As the results indicate, the time
taken by RACK is at least an order of magnitude less than the other algorithms. Further, the average
deviation of clusters, yielded by RACK, is slightly worse than that obtained using the k-means. However,
RACK outperforms k-means as the number of clusters is increased.
We also conducted experiments on several other datasets like the yeast, wine, and cloud [19] and the
results indicate that RACK can be used to get a good clustering quickly. In our experiments, we found
that the number of samples required for good clustering varies with the input dataset. However, the
Chapter 3. RACK: RApid Clustering using K-means algorithm 53
Table 3.2: Intrusion Dataset (494019 examples, 35 dimensions)Algorithm Clusters Average α Time(sec)
Leader 10 3.962e+09 20.43k-means 10 3.392e+08 62.23
k-means++ 10 2.176e+07 33.29RACK 10 4.59207e+08 1.35Leader 20 3.495e+09 45.72k-means 20 3.217e+08 251.19
k-means++ 20 1.776e+07 125.74RACK 20 3.3994e+08 3.24Leader 50 1.918e+09 61.38k-means 50 3.084e+08 943.07
k-means++ 50 1.229e+07 313.84RACK 50 2.4573e+08 13.69Leader 100 1.487e+10 114.69k-means 100 2.865e+08 6172.35
k-means++ 100 5.367e+07 862.74RACK 100 1.39978e+08 63.49
number of samples required is a very small fraction of the entire dataset. In view of space constraints,
the results have been omitted. It would be interesting to devise some heuristic for choosing the minimum
sample size, in accordance with statistical learning theory, but that is beyond the scope of this work.
Clearly, RACK is a pragmatic approach to clustering large datasets, and offers a viable alternative to
the popularly used k-means and the Leader algorithms.
3.4 Conclusions
The k-means is an immensely popular clustering algorithm, and finds its use in several applications.
The k-means algorithm offers good quality of clustering, however, it may take excessive time to converge
to a solution because of large number of iterations. On the other hand, incremental techniques (such
as the Leader algorithm) enable fast clustering, however, the quality of clustering may be extremely
poor. To address these issues, we proposed a novel algorithm, RACK, in this work. RACK randomly
selects a sample of data points, D′, from the original dataset, D, and applies k-means on D
′to obtain k
reasonable cluster representatives. Then, these clusters are represented by k nodes in a height balanced
tree, such that every path from root to any leaf consists of O(lg k) nodes. Each data point in D\D′ is
checked for clustering in the tree based on an appropriate heuristic. We proved an asymptotic bound
on the quality of clustering obtained using RACK and showed that RACK takes O(D\D′ lg k) time
for clustering the set D\D′ . We also provided experimental results on two large scale datasets. We
compared RACK with the Leader, the k-means, and the k-means++ algorithms. Our observations are:
Chapter 3. RACK: RApid Clustering using K-means algorithm 54
• The time taken for clustering by RACK is much smaller than that of Leader, k-means, and k-
means++, in case of large datasets, where the value of k is also typically large, and
• The quality of clustering obtained using RACK is much better than that of Leader and is com-
petitive with k-means.
3.5 Future Work
In this work, we proposed the RACK algorithm that selects k centers by applying k-means on the
sampled dataset D′. However, the centers may well be chosen using any other clustering algorithm like
the k-means++. Further, RACK does not update the cluster centers. It would be interesting to analyze
the quality of clustering when the change in center is incrementally reflected with the addition of each
data point. Additionally, rather than using random sampling to obtain D′, we may resort to employing
better sampling techniques so that the selected k centers reflect the distribution of the entire dataset
more closely.
Bibliography
[1] A. K. Jain, M. N. Murty, and P. J. Flynn. Data Clustering: A Review. ACM Computing Surveys,
31(3), 1999.
[2] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data, Prentice Hall, 1988.
[3] Pavel Berkhin, “Survey of clustering data mining techniques”, Technical report, Accrue Software,
San Jose, CA, 2002.
[4] P. K. Agarwal and N. H. Mustafa. k-means projective clustering. Proceedings of the twenty-third
ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS), pp. 155–
165, ACM press, New York, 2004.
[5] R. Herwig, A.J. Poustka, C. Muller, C. Bull, H. Lehrach, and J O’ Brien. Large-scale clustering of
cdna-fingerprinting data. Genome Research, 9, pp. 1093–1105, 1999.
[6] F. Gibou and R. Fedkiw. A fast hybrid k-means level set algorithm for segmentation. Fourth Annual
Hawaii International Conference on Statistics and Mathematics, pp. 281–291, 2005.
[7] W. Vega, M. Karpinski, C. Kenyon, and Y. Rabani. Approximation schemes for clustering problems.
Proceedings of the thirty-fifth Annual ACM Symposium on Theory of Computing (STOC), pp. 50–
58, ACM Press, New York, 2003.
[8] A. Kumar, Y. Sabharwal, and S. Sen. A simple linear time (1 + ε)-approximation algorithm for
k-means clustering in any dimensions. Proceedings of the forty-fifth Annual IEEE Symposium on
Foundations of Computer Science (FOCS), pp. 454–462, Washington, 2004. .
[9] S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median clustering. Proceedings
of the thirty-sixth Annual ACM Symposium on Theory of Computing (STOC), pp. 291–300, ACM
Press, New York, 2004.
[10] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local
search approximation algorithm for k-means clustering. Computational Geometry, 28(2-3), pp. 89–
112, 2004.
55
BIBLIOGRAPHY 56
[11] D. Arthur and S. Vassilvitskii. k-means++: The Advantages of Careful Seeding. Symposium on
Discrete Algorithms (SODA), 2007.
[12] H. Spath. Cluster Analysis Algorithms for Data Reduction and Classification, Ellis Horwood, Chich-
ester, 1980.
[13] P. Langley. Order Effects in Incremental Learning. Learning in humans and machines: Towards an
interdisciplinary learning science, Elsevier, 1995.
[14] D. Fisher, L. Xu, and N. Zard. Ordering effects in clustering. Proceedings of the 9th International
Conference on Machine Learning, pp. 163–168, 1992.
[15] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Efficient Data Clustering Method for Very
Large Databases. In Proc. ACM-SIGMOD International Conference on Management of Data, pp.
103–114, 1996.
[16] J. R. Slagle, C. L. Chang, and S. R. Heller. A Clustering and Data Reorganizing Algorithm. IEEE
Trans. Systems, Man and Cybernetics, 5, pp. 125–128, 1975.
[17] D. Fisher. Knowledge Acquisition via Incremental Conceptual Clustering. Machine Learning, 2, pp.
139–172, 1987.
[18] M. A. Weiss. Data Structures and Algorithm Analysis in C++, Pearson, 2006.
[19] http://archive.ics.uci.edu/ml/datasets/.
[20] http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
Chapter 4
EPIC: Towards Efficient Integration
of Partitional Clustering Algorithms
4.1 Introduction
Clustering or unsupervised classification of patterns into groups based on similarity is a very well studied
problem in pattern recognition, data mining, information retrieval and related disciplines. Clustering
finds numerous direct practical applications in pattern analysis, decision making and machine learning
tasks such as image segmentation. Besides clustering also acts as a precursor to many data processing
tasks including classification [1]. The different clustering techniques can be categorized into hierarchical
and partitional algorithms. The hierarchical algorithms generate a hierarchy of clusters by determining
successive clusters using the previously established clusters. Hierarchical algorithms can be further di-
vided into two sub-categories: agglomerative and divisive. The agglomerative algorithms start with each
element in a separate cluster and iteratively merge the existing clusters into successively larger clusters
in a bottom-up fashion. The divisive hierarchical algorithms begin with a single cluster containing all
the data points and then proceed to generate smaller clusters following a top-down approach. Partitional
clustering algorithms, on the other hand, assign the data points into a pre-defined number of clusters.
These algorithms can also be broadly classified into two categories, based on how the number of clusters
is specified. The k-means algorithm [5] is an immensely popular clustering algorithm that takes k, the
number of clusters, as an input explicitly. There are many partitional clustering algorithms, such as
Leader [4], BIRCH [2], and DBSCAN [3], which take as input a distance threshold value, τ , instead.
This threshold value, indirectly, determines the number of clusters obtained using these techniques.
We believe that a hybrid technique, which uses both k and τ in the clustering process, would be
more useful since more domain knowledge can be easily incorporated. In our work, we propose a variant
57
Chapter 4. EPIC: Towards Efficient Integration of Partitional Clustering Algorithms 58
of the k-means algorithm, EPIC, to accomplish exactly the same goal. EPIC, an anagram of the initials
of “Efficient Integration of Partitional Clustering”, initially assigns the data points to k1 clusters,
where k1 < k, k being the tentative number of desired clusters. Then, an iterative process is followed
to refine the clusters using the specified threshold distance, τ . We demonstrate that the proposed
algorithm performs fewer distance computations than the k-means algorithm and thus provides better
time performance, without making any assumptions about the distribution of the input data. The
analysis of EPIC also facilitates understanding the relationship between the number of clusters and the
distance threshold. Further, we also provide a bound on the number of levels or iterations to guarantee
that EPIC makes lesser distance computations than the k-means algorithm. We also present a generic
scheme for integrating EPIC into classification algorithms to achieve better time performance.
4.2 Preliminaries
In this section, we present a brief overview of the k-means problem and the k-means clustering algorithms
that the proposed algorithm is based on.
4.2.1 k-means Algorithms
The k-means problem is to determine k points called centers so as to minimize the clustering error,
defined as mean squared distance from each data point to its nearest center. The most commonly used
algorithm for solving this problem is the Lloyd’s k-means algorithm [5, 6] which iteratively assigns the
patterns to clusters and computes the cluster centers. MacQueen’s k-means algorithm [7] is a two-pass
variant of the k-means algorithm:
1. Choose the first k patterns as the initial k centers. Assign each of the remaining N − k patterns
to the cluster whose center is closest. Calculate the new centers of the clusters obtained.
2. Assign each of the N patterns to one of the k clusters obtained in step 1 based on its distance
from the cluster centers and recompute the centers.
Analysis: Distance computation is the only time-consuming operation in this algorithm. So, we focus
on the number of distance computations performed.
In step 1, the number of distance computations needed is given by k(N − k). The number of distance
computations in step 2 equals Nk. This implies that the total number of distance computations required
in the MacQueen’s k-means algorithm equals k(N − k) +Nk = 2Nk− k2 and the complexity is O(Nk).
The Lloyd’s k-means algorithm may not converge to a solution in polynomial time, so a maximum of
m iterations is used to find an approximate solution. Then, the total number of distance computations
equals k(N − k) + (m− 1)Nk = mNk − k2.
Chapter 4. EPIC: Towards Efficient Integration of Partitional Clustering Algorithms 59
4.3 The EPIC Algorithm
Inputs: A dataset to be clustered: X = {xi, yi}Ni=1, where xi ∈ <d; a radius threshold parameter: τ ;
an approximate number of clusters: k; the maximum number of iterations allowed for the conventional
k-means algorithm to converge: m (If m is not provided, take m to be 100, as is the common practice).
1. Let n be the maximum number of levels. Set n to some value ≤ bmk2 c+ 1. Initialize count = 1.
2. Cluster X into k1 =⌊
1kn−2
{2(n−1)m
}n−1⌋
clusters using MacQueen’s 2-pass k-means algorithm.
3. Compute the radius {r1i }k1i=1 of each cluster {c1i }
k1i=1 and determine D, the maximum radius of any
cluster.
4. Set τ1 = min(τ , D − ε), where ε→ 0 is an extremely small positive quantity. Set τ = τ1.
5. Set the level, t = 1.
6. For every cluster {cti}kti=1, if rti > τt
• split cti using k-means into ( rti
τt)d clusters.
7. Let kt+1 be the total number of clusters. If kt+1 < k,
• set τt+1 = τt√
kt
kt+1
• set t = t+ 1
• Set count = count+ 1
• If count < n and τcount−1 ≥ D{
2(count− 1)mk
} 1d
– Compute the radius {rti}kti=1 of each cluster {cti}
kti=1
– go to step 6.
8. Return the clusters with their centers.
EPIC is a multi-level hierarchical clustering algorithm. Every iteration contributes at most one level
to the hierarchy. Starting with k1 clusters, we want to split clusters having radius beyond a specified
threshold, while ensuring that the number of distance computations is less than the k-means algorithm.
Therefore, we also need to bound the number of levels. We note that if the user-specified τ is greater
than D, then no splitting of k1 clusters is possible. Hence, for meaningful analysis, we require τ < D.
For this reason, τ is reset to τ1 in Step 4. In case, no prior knowledge about τ is available, τ can simply
be taken as D − ε.
Chapter 4. EPIC: Towards Efficient Integration of Partitional Clustering Algorithms 60
4.3.1 Bound on Number of Distance Computations, Relation between τ and
k, and Maximum Permissible Levels
Consider the given dataset X = {x1, x2, . . . , xN}, where xi ∈ <d are independent samples drawn from an
identical distribution. The number of distance computations using k-means on X for m ≥ 2 iterations
is given by,
ND1 = mNk − k2 (4.1)
Also, the number of distance computations in first step of EPIC, using MacQueen’s 2-pass algorithm, is
NDL1 = 2Nk1 − (k1)2 (4.2)
Let Cti denote the ith cluster at level t, with center cti. Then, after the first level of clustering, we have k1
clusters: C11 , C
12 , . . . , C
1k1
with centers c11, c12, . . . , c
1k1
respectively. Let us define rti , the radius of cluster
Cti , as
rti = maxxj∈Ctid(xj , cti) (4.3)
where d(x, y) is the distance between x and y. In the EPIC algorithm, the ith cluster at level t is
partitioned at level t +1 if rti ≥ τt. Let kt denote the number of clusters at level t. Clearly, k1 = k1.
For each cluster i, 1 ≤ i ≤ kt, define an indicator variable
Zti = 1{rti>τt} (4.4)
and a corresponding probability
pti = P (Zti = 1) (4.5)
Let |Cti | denote the number of data points assigned to Cti . If a cluster Cti is partitioned at t + 1, the
next level, then the expected number of distance computations at level t+ 1 is given by
NDLt+1 =kt∑i=1
pti
[2|Cti |
(rtiτt
)d−(rtiτt
)2d]
(4.6)
Suppose that the EPIC algorithm proceeds till the nth level. Then, the expected number of total
distance computations using EPIC is
ND2 = NDL1 +n−1∑t=1
NDLt+1 = 2Nk1 − (k1)2 +n−1∑t=1
kt∑i=1
pti
[2|Cti |
(rtiτt
)d−(rtiτt
)2d]
(4.7)
Chapter 4. EPIC: Towards Efficient Integration of Partitional Clustering Algorithms 61
Now, we can derive the bounds for the difference in number of computations, ND1 - ND2 as follows,
ND1 −ND2 = mNk − k2 − 2Nk1 + (k1)2 −n−1∑t=1
kt∑i=1
pti
[2|Cti |
(rtiτt
)d−(rtiτt
)2d]
(4.8)
= mNk − k2 − 2Nk1 + (k1)2 −n−1∑t=1
kt∑i=1
2pti|Cti |(rtiτt
)d+n−1∑t=1
kt∑i=1
pti
(rtiτt
)2d
(4.9)
Taking A = mNk − k2 − 2Nk1 + (k1)2, we get
A−n−1∑t=1
kt∑i=1
2pti|Cti |(rtiτt
)d≤ ND1 −ND2 ≤ A+
n−1∑t=1
kt∑i=1
pti
(rtiτt
)2d
(4.10)
Let Pt = maxi pti. Further, define αt = max(1,maxi (
rtiτt
)) so that the following bound holds for all rti :
0 ≤ rti ≤ αtτt (4.11)
Then, (4.10) implies
A−n−1∑t=1
kt∑i=1
2Pt|Cti |αdt ≤ ND1 −ND2 ≤ A+n−1∑t=1
kt∑i=1
Ptα2dt (4.12)
⇒ A−n−1∑t=1
2Ptαdtkt∑i=1
|Cti | ≤ ND1 −ND2 ≤ A+n−1∑t=1
ktPtα2dt (4.13)
Since∑kt
i=1 |Cti | = N , therefore,
A− 2Nn−1∑t=1
Ptαdt ≤ ND1 −ND2 ≤ A+
n−1∑t=1
ktPtα2dt (4.14)
Further, since the radius of any cluster may never exceed D, the maximum distance between any pair
of points in X, we must have for all t,
αtτt ≤ D (4.15)
⇒ αdt ≤(D
τt
)d≤(
D
τn−1
)d= M(say) (4.16)
since for all t, τt ≥ τn−1. It follows from (4.14) that
A− 2MN(n− 1) ≤ ND1 −ND2 ≤ A+n−1∑t=1
ktM2 (4.17)
Chapter 4. EPIC: Towards Efficient Integration of Partitional Clustering Algorithms 62
Now, the number of clusters at a level t+ 1 is maximum when all the clusters at the previous level t are
partitioned. That being the case, the number of clusters at level t+ 1,
kt+1 ≤ kt(rtiτt
)d≤ ktαdt ≤Mkt (4.18)
Recursively simplifying (4.18) till t equals 1, we get,
kt+1 ≤Mkt ≤M2kt−1 ≤M tk1 (4.19)
which in light of (4.17) yields,
A− 2MN(n− 1) ≤ ND1 −ND2 ≤ A+n−1∑t=1
M t+1k1 (4.20)
⇒ A− 2MN(n− 1) ≤ ND1 −ND2 ≤ A+ k1
{Mn+1 −M2
M − 1
}(4.21)
This gives a bound on difference in number of distance computations. Now, we want ND1 ≥ ND2.
Then,
A− 2MN(n− 1) ≥ 0⇒M ≤ A
2N(n− 1)(4.22)
Plugging in the values of A and M , we get
(D
τn−1
)d≤ mNk − k2 − 2Nk1 + (k1)2
2N(n− 1)=mk − 2k1
2(n− 1)− k2 − (k1)2
2N(n− 1)(4.23)
Since k ≥ k1, (D
τn−1
)d≤ mk
2(n− 1)(4.24)
Then, the relation between τ and k is given by,
⇒ D ≥ τ ≥ τn−1 ≥ D{
2(n− 1)mk
} 1d
≥ D{
2mk
} 1d
(4.25)
Since this bound is an invariant for EPIC, therefore, (4.22) is satisfied resulting in lesser number of
distance computations in case of EPIC than k-means. The parameter n in EPIC can be used to control
this gap in number of computations: a small value of n would correspond to a large gap. Finally, we
bound the maximum number of permissible levels using (4.25) as
D
{2(n− 1)mk
} 1d
≤ D (4.26)
Chapter 4. EPIC: Towards Efficient Integration of Partitional Clustering Algorithms 63
⇒ 2(n− 1) ≤ mk (4.27)
nmax = bmk2c+ 1 (4.28)
where nmax is the maximum number of levels, which ensures that EPIC is computationally more efficient
than the k-means algorithm. We want to bound the value of M , since it is directly involved in the
expression for difference in number of computations. Using (4.16) and (4.24), we get
(D
τ
)d≤M ≤ mk
2(n− 1)(4.29)
Finally, to complete the unification of τ and k, we must ensure that the number of clusters at the
termination of EPIC algorithm is bounded by k, irrespective of the value of k1. Then, using (4.16), we
must have for any value of M given by (4.29),
k1Mn−1 ≤ k (4.30)
⇒ k1
{mk
2(n− 1)
}n−1
≤ k (4.31)
which in the wake of (4.29) yields
k1 =
⌊1
kn−2
{2(n− 1)
m
}n−1⌋
(4.32)
4.4 Application of EPIC to classification
A two-level implementation of EPIC can be employed to reduce the time complexity of various classifi-
cation algorithms. We present below a generic technique for the integration of EPIC into classification
algorithms to improve their performance.
Inputs: A set of training examples and corresponding class labels, X = {xi, yi}Ni=1, where xi ∈ <d, and
yi ∈ Γ, the set of labels; the number of clusters, k.
1. Cluster X into k clusters and determine the radius of each cluster.
2. Set τ to some value in the range indicated by (4.25).
3. Train the classifier using the centroids of those clusters that have their radius greater than τ .
4. Determine the clusters which form a part of the classification model. Sub-cluster these clusters.
5. Train the classifier using the centroids of the clusters (obtained in the previous step), which have
their radius greater than τ .
Chapter 4. EPIC: Towards Efficient Integration of Partitional Clustering Algorithms 64
6. Again determine the clusters in the classification model and train the classifier with the patterns
in these clusters.
Analysis of Complexity
The entire dataset of N patterns is processed in Step 1 only. Step 2 takes linear time since the distance
of each pattern from its cluster center needs to be investigated for determining the maximum radius,
and τ subsequently. In Steps 3 and 5, only O(k) patterns are processed. The number of patterns
processed in Steps 4 and 6 is much less than N , for large N . In addition, significantly many patterns
are eliminated, either because they belong to clusters having a small radius or because they are not a
part of the model. Thus, the first step pre-dominantly determines the time complexity of the algorithm.
This clustering step has an O(n) time complexity for constant k and a value of d � N . Thus, the
training time complexity of a classifier integrated with two-level EPIC becomes linear.
4.4.1 Integration of EPIC into Support Vector Machines (SVMs)
Training an SVM [10, 11] involves solving a Quadratic Programming (QP) problem. The training time
complexity of training an SVM is O(N2), and the space complexity is at least quadratic. Decompo-
sition methods such as Chunking [12] and Sequential Minimal Optimization (SMO) [13] choose a set
of Lagrange variables to optimize in each iteration and solve the optimization problem involving these
variables. The SVM optimization problem has been reformulated using techniques such as the Least
Squares SVM and the Reduced SVM. Core Vector Machines (CVMs) [14] consider a Minimum Enclosing
Ball (MEB) problem and try to obtain an approximate solution. An optimization problem called the
Structural SVM has been used in a Cutting Plane algorithm [15]. In each iteration, it considers a few
constraint violations and finds the solution that satisfies these constraints. This process is continued till
a required approximate solution is obtained. Weight based sampling and selective sampling techniques
that iteratively choose the most useful training examples have also been proposed in the literature. Our
approach is similar to the Clustering based SVM (CB-SVM) [8] approach, which integrates a scalable
hierarchical micro-clustering algorithm, BIRCH [2], into an SVM to reduce the number of patterns that
are processed by the learner.
Inputs: Dataset χ = {xi, yi}Ni=1, xi ∈ <d, yi ∈ {−1, 1}; Number of clusters k; Radius threshold τ
1. Cluster the positive and negative patterns independently and calculate the radius of each cluster
obtained.
2. Train an SVM boundary function from the centers of the clusters whose radius is greater than τ .
3. Sub-cluster the clusters which are near the boundary .
Chapter 4. EPIC: Towards Efficient Integration of Partitional Clustering Algorithms 65
4. Construct another SVM from the centers which are obtained in the previous step.
5. De-cluster the clusters near the boundary of the SVM constructed in step 4 and train an SVM on
the patterns in these clusters. This gives the final SVM boundary function.
In step 3, we need to determine the clusters which are close to the boundary. Closeness to the
boundary [8] is defined as follows: Let Di be the distance of a cluster center to the boundary. Let Ds be
the maximum of the distances from the support vector centers to the boundary. A cluster is said to be
near the boundary if Di −Ri < Ds.
Analysis of Complexity
The first step of the algorithm involves clustering the given patterns into k clusters. Using the Mac-
Queen’s k-means algorithm, we can obtain the clustering in O(Nk) time. Assuming that Sequen-
tial Minimal Optimization (SMO) is employed for training the SVM, the training time complexity
with N patterns would be O(N2). SVM training is performed in steps 2, 4 and 5. In step 2,
the number of training patterns equals the number of clusters which have a radius greater than the
threshold. This is equal to at most k clusters from the positive patterns plus at most k clusters
from the negative patterns. So the training complexity is O((2k)2) ∼ O(k2). In step 3, the num-
ber of patterns which are clustered is N1 =∑ki=1 wiYi, where wi is the number of patterns in the
ith cluster, and Yi = 1{ith cluster is close to the boundary}. Hence, the time complexity of this step is
O(N1k2). The number of patterns input to the SVM in step 4 is at most k2 and thus the time com-
plexity of this step is O(k4). The number of patterns input to the SVM constructed in step 5 is
N2 =∑ki=1
∑kj=1 Y
′
ijw′
ij , where w′
ij is the number of patterns in the jth sub-cluster of the ith clus-
ter, and Y′
ij = 1{Yi=1 and the jth sub−cluster is close to the boundary}. The training complexity of the final
SVM is O(N22 ). The total time complexity is hence given by O(Nk + k2 + N1k
2 + k4 + N22 ). When
the dataset is large in size, we will have N1 � N and N2 � N . For such large datasets, the time
complexity becomes O(Nk) ∼ O(N). Hence for large datasets, this SVM training process has linear
time complexity. At any point in time, it is required to store only the input patterns and the k cluster
centers. Hence the space complexity is O(N + k) ∼ O(N).
4.4.2 Integration of Two-level EPIC into k-NNC
K-NNC algorithm becomes computationally intensive when the size of the training set is large. Various
techniques have been proposed in literature to reduce the computational complexity of k-NNC. Cluster-
ing is used to realize an efficient k-NNC variant in [9]. This technique achieves considerable reduction
in computation but its time complexity is non-linear. We prove that, on incorporating two-level EPIC
algorithm into the k-NN classifier, the time complexity becomes linear. The two-level EPIC algorithm
Chapter 4. EPIC: Towards Efficient Integration of Partitional Clustering Algorithms 66
can be incorporated into k-NN classification as follows:
Inputs: Training Dataset χ = {xi, yi}Ni=1, xi ∈ <d, yi ∈ Υ; Test Pattern TP ; Number of clusters k′;
Radius threshold τ ; Number of neighbors k
1. Cluster the dataset χ into k′ clusters and determine the radius of each cluster.
2. Find k cluster centroids nearest to TP from those centroids whose clusters have radius greater
than τ .
3. Sub-cluster the k clusters obtained in the previous step and find the k nearest sub-cluster centroids.
4. From the patterns in the nearest sub-clusters, find the k nearest patterns. Assign TP the class to
which majority of these patterns belong.
Analysis of Complexity
The first step of the algorithm involves clustering the given patterns into k′ clusters. Using the Mac-
Queen’s k-means algorithm, we can obtain the clustering in O(Nk′) time. In step 2, O(k′) distances need
to be computed. Sorting these distances and finding the k nearest centroids would have a complexity
of O((k′)2) even if a simple sorting algorithm such as bubble sort is employed. In step 3, the number
of patterns which are clustered equal N1 =∑ki=1 wiYi, where wi is the number of patterns in the ith
cluster and Yi = 1{ith centroid is among the k−NN}. To find the k nearest sub-cluster centroids, O((kk′)2)
effort is required. The time complexity of this step is O(N1kk′+ (kk′)2). The number of patterns input
to step 4 is N2 =∑ki=1
∑kj=1 Y
′
ijw′
ij , where w′
ij is the number of patterns in the jth sub-cluster of the
ith cluster, and Y′
ij = 1{Yi=1 and the jth sub−cluster is among the k−NN}. Finding the k-nearest neighbors
among these patterns requires the sorting of N2 distances which requires O(N22 ) effort. Hence, the total
time complexity is given by O(Nk′ + (k′)2 + N1kk′ + (kk′)2 + N2
2 ). When the dataset is large in size,
we will have N1 � N and N2 � N . Thus, the time complexity becomes O(N) for constant k and k′.
At any point in time, storage is required only for the input patterns, the k′ cluster centers and the k
nearest neighbors. Hence the space complexity is O(N + k + k′) ∼ O(N).
4.5 Experimental Results
4.5.1 Integration of Two-level EPIC into SVM
Integration of two-level EPIC into SVM was tested on both synthetic and real datasets. We observed
that there was considerable reduction in the training and testing time and the accuracy was comparable
to that of an SVM trained on the entire dataset at once. The testing was performed on an Intel(R)
Chapter 4. EPIC: Towards Efficient Integration of Partitional Clustering Algorithms 67
Table 4.1: Training and testing timings for synthetic dataset 1 (using SVM light)
NumClusters - 5 4 3 4Threshold - 3 5 10 7Training 1.01 1.01 1.00 0.69 0.72
TimeTesting 0.04 0.04 0.03 0.02 0.03Time
Accuracy 100 100 100 99.94 99.62NumSVs 57 54 49 15 10
Table 4.2: Training and testing timings for synthetic dataset 1 (using SVMperf )
NumClusters - 10 10Threshold - 5 5.5Training 0.94 0.94 0.91
TimeTesting 0.01 0.02 0.01Time
Accuracy 99.98 99.23 99.67NumSVs 2 2 2
Xeon(R) 2GHz machine with 4096KB cache and 4038MB memory.
Synthetic Datasets
To verify that there is a substantial decrease in the training and testing time when two-level EPIC
clustering is incorporated in SVM training, we tested the algorithm on two synthetic datasets:
Dataset 1: This dataset contains a total of 1,00,000 two-dimensional patterns. The patterns of each
class are drawn from independent normal distributions with means [1 5]T and [10 5]T and unit variance.
The two classes are linearly separable. The test dataset, consisting of 80,000 patterns, is drawn from
the same distribution. The results are presented in Tables 4.1 and 4.2. The first column in Table
4.1 shows the results corresponding to the SVM light V 6.01 implementation of SVM trained on the
entire dataset without performing clustering (Plain SVM ). The remaining columns in this table are the
results for SVM light with two-level EPIC incorporated. The first column in Table 4.2 shows the results
corresponding to the SVMperf V 2.10 implementation of SVM trained on the entire dataset without
performing clustering. The remaining columns in this table are the results for SVMperf with two-level
EPIC incorporated. We observe that an SVM with two-level EPIC performs better than Chunking or
Sequential Minimal Optimization and is on par with the Cutting Plane Algorithm.
Dataset 2: This dataset contains a total of 10,00,000 two-dimensional patterns. The patterns of
each class are drawn from independent normal distributions with means [1 5]T and [5 5]T and unit
Chapter 4. EPIC: Towards Efficient Integration of Partitional Clustering Algorithms 68
Table 4.3: Training and testing timings for synthetic dataset 2
NumClusters - 10 7 10Threshold - 3 5 6Training 5287.98 2731.46 1037.95 12.74
TimeTesting 0.26 0.34 0.30 0.26Time
Accuracy 97.71 97.04 96.32 94.97NumSVs 57894 31825 37060 1209
variance. This dataset is not linearly separable and hence is more realistic. The test dataset, consisting
of 6,00,000 patterns, is drawn from the same distribution. The time taken for training and testing
(in seconds), the test accuracy achieved and the number of Support Vectors for varying values of the
parameters numClusters and threshold are recorded in Table 4.3. The first column in the table shows
the results corresponding to Plain SVM. The remaining columns are the results for SVM (SVM light)
with two-level EPIC incorporated. It can be inferred from the above results that incorporating two-level
EPIC in SVM training greatly reduces the running time complexity of SVM.
Real dataset
Figure 4.1: OCR 1 vs 6 : (a) accuracy vs. threshold (b) support vectors vs. threshold
We present the results obtained from experiments performed on different class combinations of the
Optical Character Recognition (OCR) dataset. This dataset consists of handwritten characters repre-
senting the numerals 0 to 9. The training and test sets, respectively, consist of 667 and 333 patterns
for each class. Each pattern has 192 features and a class label ranging from 0 to 9. For the purpose
of experimentation with SVM, we consider combinations of classes, 1 vs 6 and 3 vs 8. For each of the
two class combinations, we record the test accuracy and the number of support vectors. The number
of support vectors is an indicator of the complexity of the trained classifier and is also a bound on
Chapter 4. EPIC: Towards Efficient Integration of Partitional Clustering Algorithms 69
Figure 4.2: OCR 3 vs 8 : (a) accuracy vs. threshold (b) support vectors vs. threshold
the probability of error[11]. Hence, lower the number of support vectors, the better the classifier. The
results are presented in Figs. 4.1(a), 4.1(b), 4.2(a) and 4.2(b). We can observe that the number of
support vectors after clustering is considerably lesser than the number of support vectors with Plain
SVM maintaining a comparable test accuracy. For the class combination 1 vs 6 (Figs. 4.1(a) and 4.1(b),
with a radius threshold of 90 units, the reduction in the number of support vectors is approximately
25% with no reduction in accuracy. For the class combination 3 vs 8 (Figs. 4.2(a) and 4.2(b)), with
a radius threshold of 300 units, there is an improvement in the test accuracy with a reduction of 27%
in the number of support vectors. This can be attributed to the fact that some of the patterns which
cause SVM to overfit the training data are eliminated during the cluster elimination process. For both
the class combinations, the accuracy reduces as the threshold increases beyond a certain limit. This is
because, as the threshold is increased beyond this limit, lesser number of examples are left for the SVM
to learn from and important information that is required for classification is lost.
Comparison with Other Techniques
In order to demonstrate that our algorithm is on par with the methods that are currently employed to
reduce the training time of SVM, we performed empirical comparisons with CB-SVM. To compare with
CB-SVM, we use the synthetic dataset described in [8]. A 2-dimensional dataset was generated with
parameter values k = 50, cl = 0.0, ch = 1.0, rl = 0.0, rh = 0.1, Nl = 0, Nh = 10000, and θ = 0.5. The
results are tabulated in Table 4.4. The first column contains results using CB-SVM as reported in [8].
The results show that our algorithm is on par with CB-SVM.
Chapter 4. EPIC: Towards Efficient Integration of Partitional Clustering Algorithms 70
Table 4.4: Comparison with CB-SVM
Training 1,13,601 1,20,738 1,20,738 1,20,738set size
NumClusters - 10 17 20Threshold - 0.05 0.05 0.05Training 10.589 14.09 4.62 2.91
TimeAccuracy 99.00 99.00 96.00 95.00
Table 4.5: Results for k-NNC
Dataset Synthetic Synthetic OCRDataset 1 Dataset 2
Time taken 1.695 17.347 0.224by k-NNCAccuracy 100 97.1 92.49(k-NNC)
NumClusters 6 6 15Threshold 5 5 5
Time taken 1.382 9.852 0.2by k-NNC with
2-level EPICAccuracy(k -NNC 100 96.8 92.40with 2-level EPIC)
4.5.2 Integration of Two-level EPIC into k-NNC
The results corresponding to the integration of two-level EPIC in k-NNC are presented in Table 4.5. We
selected 1,000 test patterns from each of the synthetic datasets. Since k-NNC is a multi-class classifier,
the entire OCR dataset was used. For each dataset, we build a 5-NNC and record the average time
taken(in seconds) for the classification of a single example and the test set accuracy. A considerable
reduction in time is observed.
4.6 Conclusions/Future Work
We proposed an algorithm, EPIC, which is based on both k, and τ . EPIC makes significantly less number
of distance computations compared to the popular Lloyd’s k-means algorithm. We also established a
relation between τ and k. We also presented a generic technique to integrate a two-level EPIC algorithm
into different classifiers, in order to achieve linear time training complexity. Our experimental results
strongly suggest that EPIC can be efficiently integrated into SVM and k-NNC classifiers: the accuracy
obtained using EPIC is better than or competitive with state-of-the-art algorithms, while the time taken
Chapter 4. EPIC: Towards Efficient Integration of Partitional Clustering Algorithms 71
is much less. EPIC does not make assumptions about the underlying distribution of the input data. A
prior knowledge of the input distribution would help in incorporating more knowledge into the skeletal
EPIC algorithm, thereby improving the performance further.
Bibliography
[1] A. K. Jain, M. N. Murty, and P. J. Flynn. Data Clustering: A Review. ACM Computing Surveys,
31(3), 1999.
[2] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Efficient Data Clustering Method for Very
large Databases. Proceedings of the 1996 ACM SIGMOD International Conference on Management
of Data, pp. 103–114, 1996.
[3] M. Ester, H-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in
large spatial databases with noise. Proceedings of the Second International Conference on Knowledge
Discovery and Data Mining (KDD), pp. 226–231, 1996.
[4] H. Spath. Cluster Analysis Algorithms for Data Reduction and Classification, Ellis Horwood, Chich-
ester, 1980.
[5] S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), pp.
129–137, 1982.
[6] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A.Y. Wu. An Efficient
k-means Clustering Algorithm: Analysis and Implementation. IEEE Transactions on Pattern Analysis
and Machine Intelligence, pp. 881–892, 2002.
[7] J. MacQueen. Some methods for classification and analysis of multivariate observations. Proceedings
of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, pp. 281-297, 1967.
[8] H. Yu, J. Yang and J. Han. Classifying large data sets using SVMs with hierarchical clusters. In the
Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge discovery and Data
Mining, pp. 306–315, 2003.
[9] B. Zhang and S. N. Srihari. Fast k-Nearest Neighbor Classification Using Cluster-Based Trees. IEEE
Transactions on Pattern Analysis and Machine Intelligence (PAMI), pp. 525–528, 2004.
[10] V. N. Vapnik. Statistical Learning Theory, John Wiley & Sons, Inc.,1998.
72
BIBLIOGRAPHY 73
[11] C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and
Knowledge Discovery, 2, pp. 121–167, Springer, 1998.
[12] T. Joachims. Making large-Scale SVM Learning Practical. Advances in kernel methods: Support
Vector Learning, MIT Press Cambridge, MA, pp. 169–184, 1999.
[13] J. Platt. Sequential Minimal Optimization: A fast algorithm for training Support Vector Machines.
Advances in Kernel Methods: Support Vector Learning, MIT Press,Cambridge, MA, pp. 185–208,
1999.
[14] I.W. Tsang, J.T. Kwok, and P.M. Cheung. Core Vector Machines: Fast SVM Training on Very
Large Data Sets. The Journal of Machine Learning Research, 6, pp. 363–392, 2005.
[15] T. Joachims. Training linear SVMs in linear time. Proceedings of the 12th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining, pp. 217–226, 2006.
Chapter 5
Feature Subspace SVMs (FS-SVMs)
for High Dimensional Handwritten
Digit Recognition
5.1 Introduction
In many pattern classification applications, data are represented by high dimensional feature vectors.
There are two reasons to reduce dimensionality of pattern representation. First, low dimensional repre-
sentation reduces computational overhead and improves classification speed. Second, low dimensionality
tends to improve generalization ability of classification algorithms. Moreover, limiting the number of
features cuts down the model capacity and thus may reduce the risk of over-fitting [35]. The clas-
sifier maximizing its performance may not always perform well on test data. The performance of a
classifier on test data depends on factors such as training sample size, the dimensionality of pattern
representation and the complexity of the classifier. SVMs are less likely to overfit data than other
non-regularized classification algorithms since the structural minimization principle of SVMs chooses
discriminative function that has the minimal risk bound [16]. One of the major drawbacks of SVMs is
that the training time grows almost quadratically in the number of examples. This issue becomes even
more critical for multi-class problems where a set of binary SVMs must be built and combined. This is
the case of One-Against-All approach which is widely used in implementation of SVMs.
Feature selection is a major approach to dimensionality reduction [1]. Feature selection refers to
selecting features in the input space and the features obtained form a subset of the original input
feature set. In the literature, just a few algorithms have been proposed for SVM feature selection. In
74
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition75
[2], a mathematical programming method which minimizes a concave function on a polyhedral set was
proposed. In [3], feature subset selection was done by optimizing a modified criterion that induces an
extra term to penalize the size of the feature subset. In [4], the authors introduced a binary vector
representing the presence or absence of a feature to the optimization criterion, with the motivation of
approximating the binary vector with a real valued vector so that gradient descent methods can be used
to find the optimal value of the binary vector and the corresponding feature subset. Basically, the three
methods mentioned above evaluate features on an individual basis, although the features actually work
in a collective way in the discriminative function. To deal with this problem, a SVM recursive feature
elimination (SVM RFE) algorithm that evaluates features on a collective basis was proposed [5] for gene
data. In [32], the authors proposed the Reduced Feature Support Vector Machine (RFSVM) algorithm
for completely arbitrary kernels. Recently, a feature selection algorithm has been proposed for scene
categorization using support vector machines [33].
Ensemble methods have been quite popular in literature. Ensemble methods are learning algorithms
that construct a set of classifiers and then classify new data points by taking a weighted vote of their
predictions. Ensemble methods have been shown to be effective in [6, 7, 8, 9, 17, 18, 19, 20]. Recently
some research has gone into devising ensemble methods for SVMs as well. In [10], the authors perform
a bias-variance analysis of Support Vector Machines for development of SVM-based ensemble methods.
A horizontal Divide and Conquer approach has been proposed which uses different experts for different
subsets of patterns [11]. However, the training time becomes a serious bottleneck with an increase in
number of examples. Another limitation of this approach lies in the fact that it can be used to separate
only a single class from other classes; an SVM has to be trained separately for each class.
5.2 Motivation
In our work, we propose a novel technique of incorporating a set of features dynamically to enable
training each SVM exactly once. First, rather than partitioning the input space on the basis of the
corresponding class, we partition the training set based on the subsets of features. That is, partitioning
is done in the feature space and not the input space and hence the name Feature Subspace Support
Vector Machines (FS-SVMs). Each of these subsets is used to train an SVM that is subsequently used
to test a data set. The accuracy of each SVM determines the weight for the corresponding subset. Then
we combine each of these weighted individual subsets and find the most likely class. These weights could
be used to classify test data. The need for such an approach is highlighted in Figure 5.1.
The Iris Plants Database [12] is one of the best known data sets in the literature. The data set
contains 3 classes: Iris Setosa, Iris Virginica, and Iris Versicolor where each class refers to a type of Iris
plant. The feature space consists of four attributes namely sepal length, sepal width, petal length and
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition76
Figure 5.1: Iris Dataset: The need for segmentation of feature space
petal width. It is observed that the class Setosa is linearly separable from Virginica and Versicolor using
petal length and width only. However, Virginica and Versicolor are not linearly separable. Therefore,
overfitting can be avoided by segmenting the feature space and keeping the features petal length and
petal width in the same partition. Versicolor and Virginica can be separated by using a non-linear SVM
trained on other features. Thus the FS-SVM approach can be effectively used to separate patterns by
employing different kernels in different regions of the feature space. Our experimental results on Iris
database support this observation whereby classification accuracy in excess of 98.5% was obtained.
The FS-SVM approach is all the more promising in the context of handwritten digit data. Figure
5.2 shows a sample each of 4, 7, 8, and 9. The features are shown to be partitioned into two subsets,
A and B. The features contained in B contain sufficient information to separate 8 from the rest. The
features contained in A can be used to separate the digits 4, 7, and 9 from each other. We note that
the feature subset A, if used alone, may not be able to distinguish between 8 and 9. However, A can be
used to separate all the four digits, if used in conjunction with B. Likewise, the other digits may also
be classified correctly by an appropriate partitioning of the feature space.
Figure 5.2: Segmentation of feature space for handwritten digit data
Figure 5.3 indicates where the FS-SVM approach fits in the existing paradigm of efficient classification
using SVMs. The linear SVMs can be effectively trained in linear time provided the data is linearly
separable [13], otherwise non-linear SVMs have to be used. For large data sets, clustering based SVMs
have been shown to yield good performance [14]. Ensemble methods using different experts in different
regions of the horizontally segmented input space have also been proposed [11]. The FS-SVMs use
different kernels based on partitioning of the feature space.
In many classification applications, data are represented by high dimensional feature vectors. As
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition77
Figure 5.3: Different approaches for SVM Classification
already noted, training SVMs on such feature vectors in one shot incurs high computational cost. In
addition, reduced dimensionality helps in keeping space requirements low; this becomes a critical factor
in context of high dimensional large datasets where the number of I/O operations becomes a bottleneck.
Besides, we may want to have different SVM formulations, such as the one in [34], for different feature
subsets and/or assign different weights to features based on their significance. This may help in tackling
overfitting.
For many practical applications such as high dimensional handwritten digit recognition, the dimen-
sionality of data remains very high even after feature selection. Therefore, there is a need to reduce the
feature set further but without sacrificing the classification accuracy much. In our work, we therefore
introduce another stage that we call the feature reduction step (Figure 5.4). This feature reduction
step reduces the dimensionality of the data without compromising much on generalization ability. The
features chosen during the feature selection step are input to the feature reduction step. In other words,
feature selection is done on the pre-processed data, using techniques such as [2, 4], and then the se-
lected features are input to the feature reduction algorithm. Henceforth, the description of the feature
reduction step will presume the availability of a suitable set of features.
Figure 5.4: Steps in the modified classification process
The essential idea behind incorporating the feature reduction step can be understood using Figure
5.5. The examples from two classes, shown in rectangles and circles, are well separated using the
maximum margin separating hyperplane SH 1. However, most of the examples are correctly classified
using separating hyperplane SH 2 alone. Therefore we can discard Feature 2 at a slight expense of
classification accuracy. This effect may become even more pronounced in case of high dimensional data.
We introduce the α-MFC problem to formalize the feature reduction step.
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition78
Figure 5.5: The proposed feature reduction step
5.3 The α-Minimum Feature Cover (α-MFC) Problem
Definition 1. For a given training set X and test set Y, a feature set D = {d1, d2, ..., dn} is defined to
be optimal if no subset of D results in a greater accuracy than that obtained using D, with respect to a
classifier trained on X and tested on Y.
Definition 2. Given a training set X and a test set Y on an optimal feature set D = {d1, d2, ..., dn},
define an α-Minimum Feature Cover (α-MFC) of D for an SVM as a subset D’ such that classification
accuracy of the SVM using D’ is no less than α times that obtained using the original set D(0 ≤ α ≤
1). Further there is no other subset D* of D having a) number of features less than D’ and accuracy
greater than α times that obtained using D, or, b) same number of features as D’ but greater accuracy.
The α-Minimum Feature Cover(α-MFC) problem can be rephrased in terms of a decision problem,
{(D, α, k): Feature set D has an α-MFC of size k} where size of a feature set refers to the number of
features contained in it.
Theorem 5. The α-MFC problem is NP- Hard.
Proof. We show a reduction from the Minimum Vertex Cover [21], a well-known NP-Complete problem.
Let w(di, dj) denote the magnitude of correlation between the features di and dj . It is clear that 0
≤ w(di, dj) ≤ 1 ∀di, dj ∈ D. Consider any arbitrary value β such that β ∈ (0,1). Now, draw a graph
G = (V,E), where the vertex set V consists of a collection of nodes, each corresponding to a feature in
original set D. Connect all those pairs of vertices i, j in V by an edge that have w(di, dj) ≥ β.
The reduction algorithm takes as inputs, an instance (G, k) of the Minimum Vertex Cover Problem
and a specified α, to generate an instance (D, α, k) of the Minimum Feature Cover problem. A node is
drawn corresponding to each vertex in G. Further, for any two vertices in G that are connected by an
edge, the magnitude of correlation is set to 1 (which is > β ∀β ∈ (0, 1)) while the non-adjacent vertices
incur a value 0, in the corresponding input instance of the α-MFC algorithm. Clearly this can be done
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition79
in polynomial time in number of edges of G by inspecting each pair of vertices for an edge. This has an
effect of retaining exactly those edges in the α-MFC instance as present in G. Now set the value of α to
1. Thus, an instance (D, 1, k) of the α-MFC is obtained corresponding to any graph G of the Minimum
Vertex Cover problem.
We claim that there is a k that satisfies 1-MFC iff there is a vertex cover of size k in G. If we can
find a polynomial time solution to the 1-MFC problem, then we may as well find the minimum vertex
cover in G. However, this cannot be true unless P = NP. Conversely, if there is a vertex cover of size
k, then the set consisting of all the features corresponding to vertices present in the vertex cover is
the desired 1-MFC set D’. This follows, since there cannot be a set D* having size less than k and
accuracy greater than that of D’(which is 1 and same as that of D), otherwise our original set D is not
the optimal feature set defined in Definition 1, and we arrive at a contradiction. Therefore, it follows
that the α-MFC problem is NP- Hard.
An important implication of Theorem 7 is that we cannot find a polynomial solution to isolate a
minimum subset of features that has an accuracy within a specified parameter α of the original feature
set, unless P = NP. Therefore we look for a greedy approach to determine a good reduced feature
subset. We will return to an algorithm for accomplishing exactly the same goal but first we show how
an incremental Feature Subspace approach can be employed for classification using SVMs.
5.4 Feature Subspace SVMs (FS-SVMs)
A lot of ensemble methods have been successfully employed in the pattern classification field, but almost
all of them use different experts in different parts by training each classifier on a subset of patterns. We
investigate the efficiency of an algorithm based on partitioning of the feature space in the context of
SVMs.
Let X={x1, x2, ..., xn} be the training set defined on D = {d1, d2, ..., dk}, the set of k features with
class label y ∈ {1, 2, ..., C} denoting one of the C classes. Without loss of generality, assume that D
is divided into M blocks P1, P2,..., PM with corresponding weights I1, I2, ..., IM . Each of the SVMs,
Si, is trained on a corresponding block Pi, i ∈ {1, 2, ..., M}. Note that the SVMs Si need not be
different and we may as well use a single SVM for all the blocks. The weights I1, I2, ..., IM represent
prior knowledge about the importance of the corresponding blocks and can be empirically determined
using the approach such as in (Grandvalet and Canu, 2003)[31]. In the absence of any knowledge about
the significance of blocks, all weights could be set to 1. Note that these blocks may not be all present
simultaneously and we train the corresponding SVM as and when one becomes available.
The Feature Subspace algorithm for SVMs is given in Algorithm 1. It is to be noted that the
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition80
Update: Feedback Phase adjusts the weights of individual SVMs based on their classification decision,
after every pattern is classified, correctly or not. At the time of classifying a new digit sample, these
individual weights are scaled with the corresponding class prediction and the prior importance of the
feature subsets. If an SVM Si incorrectly predicts a class j, it is penalized by reducing its weight, as the
formula for Wij suggests. The weights W , therefore, provide feedback for further classification decisions.
Algorithm 1: FS-SVMs
Divide: Training Phase
1.1 for i =1 to M do
1.2 train SVM Si on Pi
1.3 test Si using a test set X’
1.4 for j =1 to C do
1.5 Wij = Fraction of correct predictions in j by Si
1.6 end for
1.7 end for
Combine: Test Phase
1.8 for each test example x do
1.9 Let Aixj be an indicator variable denoting the class predicted by SVM
Si (∀ i= 1 to M) on x. Final predicted class for example x is given by,
argmaxj∑Mi=1 Ii * Wij *Aixj where j ∈ {1, 2, ..., C}
Update: Feedback Phase
1.10 Update the weights Wij of the SVMs based on the classification decision
1.11 end for
Algorithm 1 conceptually looks similar to many of the existing ensemble methods. However there
is a subtle difference. The novelty of Algorithm 1 lies in its scalability and computational efficiency
accomplished due to partitioning of the feature space and its generalization ability due to horizontal
weighted voting. In other words, Algorithm 1 has more flexibility due to amalgamation of vertical
partitioning and horizontal weighing, unlike conventional ensemble methods.
The algorithm is generic in that instead of using only SVMs we could as well use SVMs in conjunction
with other classifiers such as Decision trees and k-Nearest Neighbor Classifier, etc. as shown in Figure
5.6. It would be informative to compare the performance of an ensemble of diverse classifiers with
that of a combination of SVMs, however, a further discussion on the same is beyond the scope of
this work. Different SVM formulations could be used on different feature subsets. Further, based on
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition81
prior knowledge about the application domain, we may assign varying weights to different blocks based
on significance of the constituent features, for instance, in case of handwritten digit data the features
representing the top and bottom portions are less significant than those toward the center and can be
assigned smaller initial weights accordingly. In case all features are equally important, we can simply
do away by replacing all Ii with 1.
Figure 5.6: Ensemble of classifiers using segments of feature space
5.5 A Greedy Algorithm for Approximating α-MFC
As already pointed out, data dimensionality may remain prohibitively high despite feature selection.
Thus, we propose a feature reduction step prior to classification. The aim of this stage is to find a
Minimum Feature Cover corresponding to the specified parameter α. However, an important corollary
of Theorem 7 is that we cannot find a polynomial solution to isolate a minimum subset of features that
has an accuracy within a specified parameter α of the original feature set, unless P = NP. Therefore we
need to look for some heuristic approach to determine a suitable feature subset.
We suggest a greedy algorithm to obtain a good feature subset. This approach proves to be very
useful particularly in applications such as handwritten digit recognition. There are primarily two reasons
for this observation. First, the features that are in close proximity to each other are generally more
correlated than those farther apart. So, nearby features can be grouped together in one partition.
Second, the features near the center are found to have much more impact on the aggregate decision of
the Support Vector Machine than those toward the periphery [25]. Hence in general, we can discard the
partitions containing features that are away from the middle. These observations may not be true in
case of all domains but prove worthy in case of applications like handwritten digit recognition.
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition82
Figure 5.7: Features near the periphery contain less discriminative information than those deep inside
This behavior can be understood from Figure 5.7 displaying a sample of handwritten digits. Each
cell in the 7 x 7 grid denotes a feature. Note that the features at the top and bottom of each digit do
not contain much information. Further not much discrimination can be done between a pair of digits,
based primarily on these features. On the other hand, as we move closer to the central portion, the
dissimilarity between different digits becomes more prominent. Hence, in general we can dispense away
top and bottom parts without incurring too much loss in information.
We now provide an algorithm based on FS-SVM that follows a greedy approach to find an approx-
imate Minimum Feature Cover. Algorithm 2 partitions the original feature set D into partitions each
of which is used to train and test corresponding SVM. The least accurate of these partitions, Pr, is
kept apart and the overall accuracy of the combined remaining feature set, P , is determined. If this
accuracy is greater than the desired parameter α, we proceed in same manner to see if more features
can be reduced, otherwise we partition the sidelined subset Pr into two subsets, Pr1 and Pr2, of equal
size. We then merge the more accurate feature partition with P . If the classification accuracy of Pr1
is same as that of Pr2, then either Pr1 or Pr2 may be chosen (Algorithm 2 resolves the ties in favor of
Pr1). Iteratively following this process, we end up getting a reduced feature set that either satisfies or
is very close to satisfying the constraint α.
Algorithm 2: Approximate α-MFC
2.1 Divide the feature set D = {d1, d2, ..., dk} into M blocks P1, P2, ..., PM with y ∈ {1, 2, ..., C}
contained in each partition
2.2 for i = 1 to M do
2.3 train and test using features from Pi
2.4 end for
2.5 Choose the partition Pr with least classification accuracy,
r ∈ {1, 2, ..., M}
2.6 Remove Pr and check the overall accuracy using
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition83
P = P1
⋃...⋃
Pr−1
⋃Pr+1
⋃...⋃
PM
2.7 if the overall accuracy < α,
2.8 take partition Pr, split it into same (almost same if Pr contains odd
number of features) size partitions Pr1 and Pr2.
2.9 train and test separately on Pr1 and Pr2
2.10 if accuracy with Pr1 >= accuracy with Pr2,
2.11 P = P⋃
Pr1
2.12 else P = P⋃
Pr2
2.13 end if
2.14 return P as the greedy feature set close to satisfying the constraint α
2.15 else if the overall accuracy > α,
2.16 D = P
2.17 go to 2.1
2.18 else return P as the greedy feature set satisfying the constraint α
2.19 end if
Definition 3. The quality qi of a partition i is defined to be the fraction of accuracy that would be lost
if i is discarded. The quality of the entire feature set equals 1.
Theorem 6. Let q1, q2, ..., qt, qt+1 denote the respective quality of (t+1) partitions discarded before
termination of Algorithm 2. Then, Algorithm 2 finds a α1−qt+1
- approximate solution to the α-MFC
problem.
Proof. Let Q be the accuracy obtained using the entire feature set.Then,
Accuracy left after partition 1 is removed= (1− q1)Q
Accuracy left after partition 2 is removed= (1− q1)(1− q2)Q
Proceeding in the same way,
Accuracy left after t partitions are removed= (1− q1)(1− q2)...(1− qt)Q = k(say)
This must be greater than or equal to αQ since another partition is removed subsequently.
⇒ k ≥ α Q ...(1)
Also,
k(1 - qt+1) < α Q ...(2)
The result follows using (1) and (2).
Thus, Algorithm 2 yields a reduced feature set within α1−qt+1
of the optimal solution to the to the
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition84
Figure 5.8: Sample patterns of handwritten digit data
α-MFC problem. An important point is in place. Theorem 2 implies that the quality of the reduced
feature set would depend on quality of the (t+ 1)th partition. So the quality of the reduced feature set
depends on the size of the (t+ 1)th partition. Further, applying Algorithm 2 with a sequence of varying
partition sizes on the original feature set may result in a different final reduced feature set, even if the
number of eliminated features is same in all the cases. Thus the quality of the final reduced set depends
on the size of partitions for eliminating features and the quality of segments of features.
5.6 Experimental Results
On Iris [12], which consists of 150 samples, we used 90 samples randomly as training data and the
remaining 60 samples as test data. The FS-SVM algorithm (Algorithm 1) resulted in an average classi-
fication accuracy in excess of 98.5%. This compares favorably with the popular variants of SVMs such
as Max-Wins voting, DAGSVM, and one-versus-all [26, 27, 28, 29] (see experimental results in [30]).
We provide a detailed analysis of our results on handwritten digit data in the following sections.
5.6.1 Experimental Set-up
SVMs were trained on different non-overlapping partitions of our handwritten digit training set consist-
ing of 6670 examples each of 192 features (16 x 12 grid). The test data comprised of 3333 examples on
the same features. Each example in the training and test sets belonged to exactly one of the 10 classes
(0-9). Moreover, each of the training and test sets was subdivided into almost equal number of patterns
per class. Figure 5.8 shows a few sample patterns of the handwritten data used in our experiments.
We conducted extensive experimentation on a number of other standard datasets including USPS
[22], MNIST [23] and CEDAR [24]. The USPS dataset consists of 7291 training examples and 2007
test examples. On the other hand, CEDAR consists of 5802 training examples and 707 test examples.
The original MNIST dataset has equal training and test data size (60,000 examples each). However for
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition85
Figure 5.9: Similarity vs. Block Size
our experiments, we chose a random set of 6,000 examples for training. Another set of 4,000 examples
was used as the test set. The software package primarily used was the freely available BSVM 2.06
software [15]. The BSVM package provides implementation of three different standard SVM techniques
for classification. For our experiments, we chose the implementation that solves a single optimization
problem for the purpose of classification. Further, default settings of BSVM were used for each of the
individual SVMs.
5.6.2 Analysis of Results obtained using Algorithm 1
With an SVM trained and tested over all the features, out of all the 3333 examples, 3065 examples
were classified correctly giving an overall accuracy of 91.96%. Define similarity of a partition as the
percentage of predictions made by the corresponding SVM that match with those predicted with the
entire feature set taken in one go. Intuitively, the higher the similarity of a partition, the greater the
importance of the features contained in it, for the purpose of classification. To predict a new example,
partitions are combined in proportion of their similarity to get the combined result. We note from our
experiments that we are able to achieve high similarity values and the Feature Subspace approach is
very encouraging.
Figure 5.9 shows the similarity when a combined decision is made using non-overlapping blocks of
features of equal size (a pair of oblique lines has been used to indicate, thenceforth, a change in scale
along the corresponding axis). As similarity is defined as the number of predictions that tally with
the predictions made using all the features simultaneously, the similarity of entire set of 192 features is
1. Note that there could be variation in similarity values based on the tie-resolving rule. When equal
favorable predictions about different classes are made, it is observed that the similarity results are better
when ties are resolved in favor of the latter class rather than the former class. However, using Algorithm
1, this disparity is averted since the probability of two classes getting equal overall weights on a real axis
is negligibly small. This shows that different weighing schemes have varying influence over the similarity
measure. We aim to analyze more of these schemes in our future work.
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition86
Figure 5.10: Accuracy vs. Block Size
Figure 5.11: (Sample Dataset) Accuracy(%) results on training sets of different size
Figure 5.10, on the other hand, shows the overall accuracy results. Again, Algorithm 1 outperforms
the equal weighing schemes with different tie-resolving criteria. Moreover, the accuracy curves seem to
follow the similarity curves almost invariably.
We also conducted experiments to analyze the impact of Algorithm 1 on smaller handwritten digit
training sets and different partition sizes. Specifically, we used training sets of size 50, 80, 100, 150, 200,
250 and 300 for our evaluation. The test set data was kept unchanged. Figure 5.11 shows the results
on our sample dataset. It is clearly seen that Algorithm 1 outperforms the standard bound-constrained
SVM in terms of classification accuracy, even in case of smaller training datasets, for different partition
sizes. Similar results were obtained with the other datasets: MNIST (Figure 5.12), CEDAR (Figure
5.13), and USPS (Figure 5.14). These results strongly suggest that the FS-SVM approach works well
even when only a limited training dataset is available.
We also analyzed the computational costs involved in our experiments. Figure 5.15 shows the total
time taken by Algorithm 1 for smaller training sets, relative to using an SVM trained on an entire
dataset. Clearly, partitioning the dataset and recombining the individual verdicts in Algorithm 1 takes
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition87
Figure 5.12: (MNIST ) Accuracy(%) results on training sets of different size
Figure 5.13: (CEDAR) Accuracy(%) results on training sets of different size
Figure 5.14: (USPS ) Accuracy(%) results on training sets of different size
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition88
Figure 5.15: (Sample Dataset) Total relative time taken by Algorithm 1 on training sets of different size
Figure 5.16: (MNIST ) Total relative time taken by Algorithm 1 on training sets of different size
significantly less time compared to an SVM trained on the original feature space. Similar results were
obtained for MNIST (Figure 5.16), CEDAR (Figure 5.17), and USPS (Figure 5.18) datasets, thereby
endorsing the vast improvements in computational cost using Algorithm 1.
5.6.3 Analysis of Results obtained using Algorithm 2
As mentioned, another important means to reducing the curse of dimensionality is to select the most
pertinent feature subset. Using Algorithm 2, we note that this approach is even more promising than
combining individual partitions together. As observed earlier, in case of handwritten data, the features
at the periphery contain very less discriminative information and thus can be discarded at a cost of
slight decrease in prediction accuracy. This fact is experimentally verified using Algorithm 2 on our
sample dataset, as indicated by Figure 5.19. We started off with M=8 and size of each partition as 24.
After successive runs of the Algorithm 2, it was found that using 132 features accounted for an accuracy
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition89
Figure 5.17: (CEDAR) Total relative time taken by Algorithm 1 on training sets of different size
Figure 5.18: (USPS ) Total relative time taken by Algorithm 1 on training sets of different size
Figure 5.19: Accuracy vs. Number of Features
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition90
Figure 5.20: Reduction in Accuracy(%) vs. Reduction in Number of Features
Figure 5.21: Algorithm 2 vs. Random Selection
of 87.31%. Even with an overall reduction of 50% in features from the original 192 features, we were
able to get an accuracy of 80.138% as against 91.96% using the entire feature set. With α= 85%, only
120 features ranging from 36-156 resulted in an overall accuracy of 85.06%, a slight reduction from what
was achieved using the whole feature set (Figure 5.20). Thus retaining only these 120 features seems
a good accuracy-size tradeoff. Figure 5.21 shows a comparison in accuracy between a reduced feature
set obtained by Algorithm 2 and a random selection of features. Clearly Algorithm 2 outperforms
a random strategy. We performed several such experiments and similar results were obtained. The
difference becomes more pronounced as the number of features is reduced further.
Figure 5.22 shows the accuracy results obtained using Algorithm 2 on smaller training sets, for
selected feature sets of different size. It is clearly observed that Algorithm 2 results in high classification
accuracy, even after discarding a substantial portion of the original feature set.
Similar results were obtained for MNIST (Figure 5.23), CEDAR (Figure 5.24), and USPS (Figure
5.25) datasets.
We also analyzed the computational costs associated with Algorithm 2. Figure 5.26 shows the relative
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition91
Figure 5.22: (Sample Dataset) Accuracy(%) results obtained using Algorithm 2 on training sets ofdifferent size
Figure 5.23: (MNIST ) Accuracy(%) results obtained using Algorithm 2 on training sets of different size
Figure 5.24: (CEDAR) Accuracy(%) results obtained using Algorithm 2 on training sets of different size
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition92
Figure 5.25: (USPS ) Accuracy(%) results obtained using Algorithm 2 on training sets of different size
Figure 5.26: Time Performance
time taken by SVMs to train and test feature partitions of different size, on our sample dataset. Note
that the total time taken by the feature reduction step includes not only the time to train the individual
SVMs but also the time to partition the feature subspace. In our experiments with handwritten digit
data, we found that the time to train SVMs is the predominant factor in determining the overall time.
However, the total time taken by Algorithm 2 may exceed that taken by an SVM on the complete
feature set, for some applications such as text classification, where the relative importance of features
may not be very well understood. However, as clearly indicated in Figure 5.26, huge savings in time
are obtained using Algorithm 2, in the context of handwritten digit recognition. The extension of the
method outlined in Algorithm 2, to applications other than handwritten digit recognition, seems an
interesting area for further research.
Figure 5.27 shows the time taken by selected feature subsets on smaller training sets. To give an
indication of the improvement in computational cost, the time is shown relative to an SVM trained on
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition93
Figure 5.27: (Sample Dataset) Relative time taken by Algorithm 2 for training sets of different size
Figure 5.28: (MNIST ) Relative time taken by Algorithm 2 for training sets of different size
the entire training set and using the original feature set. Clearly, improvement to an order of magnitude
is observed.
Similar results were obtained for MNIST (Figure 5.28), CEDAR (Figure 5.29), and USPS (Figure
5.30) datasets.
We also computed the overall time taken by Algorithm 2 to obtain smaller feature sets (see Figures
5.31, 5.32, 5.33, and 5.34). Our results clearly indicate that Algorithm 2 provides a promising approach
to reducing the feature sets without incurring significant computational overheads.
5.7 Conclusion
We introduced the concept of α-MFC in form of a feature reduction step and proved it to be NP-Hard.
We then proposed an algorithm (Algorithm 1 ) to show how the partitions of the original feature set
could be trained and tested individually, and then combined together, for high accuracy using FS-SVMs.
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition94
Figure 5.29: (CEDAR) Relative time taken by Algorithm 2 for training sets of different size
Figure 5.30: (USPS ) Relative time taken by Algorithm 2 for training sets of different size
Figure 5.31: (Sample Dataset) Total relative time taken by Algorithm 2 for training sets of differentsize
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition95
Figure 5.32: (MNIST ) Total relative time taken by Algorithm 2 for training sets of different size
Figure 5.33: (CEDAR) Total relative time taken by Algorithm 2 for training sets of different size
Figure 5.34: (USPS ) Total relative time taken by Algorithm 2 for training sets of different size
Chapter 5. Feature Subspace SVMs (FS-SVMs) for High Dimensional Handwritten Digit Recognition96
We also proposed an approximate α-MFC greedy algorithm (Algorithm 2) based on partitioning of the
feature space that was found to result in high classification accuracy on the experimental handwritten
digit data even after elimination of a large fraction of the original feature set.
5.8 Future Work
We intend to analyze the effect of different weighing schemes on the accuracy. We also look forward,
as a future work, to combining the various classifiers like k-NNC with FS-SVMs on real datasets as
suggested in this chapter. The complexity of Algorithm 2 depends on the quality of the partitions of
the features. It would be interesting to assess the variance in the final results due to different partitions.
This work dealt primarily with the handwritten digit recognition. The extension of the ideas presented
herein for other high dimensional pattern applications would be another future direction.
Bibliography
[1] A. K. Jain, R. P. W Duin, and J. Mao. statistical pattern recognition: A review. IEEE Trans. on
PAMI, 22, pp. 4–37, 2000.
[2] P. S. Bradley, O. L. Mangasarian, and W. N. Street. Feature selection via mathematical program-
ming. INFORMS J. Comput., 10, pp. 209–217, 1998.
[3] P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support
vector machines. Proceedings of 13th International Conference on Machine Learning (ICML), pp.
82–90, San Francisco, CA, 1998.
[4] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik. Feature selection
for SVMs. In: S. A. Solla, T. K. Leen, and K. R. Muller (eds.), Advances in Neural Information
Processing Systems, 13 , MIT Press, MA, Cambridge, 2001.
[5] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using
support vector machines. Machine Learning, 46(1-3), pp. 389–422, 2002.
[6] E. Bauer and R. Kohavi. An empirical comparison of voting classification algorithms: Bagging,
boosting, and variants. Machine Learning, 36(1-2), pp. 105–139, 1999.
[7] T. G. Dietterich. An experimental comparison of three methods for constructing ensembles of
decision trees: Bagging, boosting, and randomization. Machine Learning, 40, pp. 139–158, 2000.
[8] L. K. Hansen and P. Salamon. Neural network ensembles. IEEE Trans. on PAMI, 12, pp. 993–1001,
1990.
[9] S. Guha, A. Meyerson, N. Mishra, and R. Motwani. Clustering Data Streams: Theory and Practice.
IEEE Trans. on TKDE, 15(3), pp. 515–528, 2003.
[10] G. Valentini and T. G. Dietterich. Bias-Variance Analysis of Support Vector Machines for the
Development of SVM-Based Ensemble Methods. Journal of Machine Learning Research (JMLR),
5 , pp. 725–775, 2004.
97
BIBLIOGRAPHY 98
[11] H. Nemmour and Y. Chibani. Multi-Class SVMs Based on Fuzzy Integral Mixture for Handwritten
Digit Recognition. Proceedings of the Geometric Modeling and Imaging: New Trends (GMAI), pp.
145–149, 2006.
[12] http://archive.ics.uci.edu/ml/machine-learning-databases/iris/.
[13] T. Joachims. Training Linear SVMs in Linear Time. Proceedings of the ACM Conference on
Knowledge Discovery and Data Mining (KDD), Philadelphia, Pennsylvania, 2006.
[14] H. Yu, J. Yang, and J. Han (2003). Classifying Large Data Sets Using SVMs with Hierarchical
Clusters. Proceedings of the 9th ACM SIGKDD, Washington, 2003.
[15] C. W. Hsu and C. J. Lin. BSVM 2.06, prepared by R. E. Fan, released in 2006.
[16] C. J. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and
Knowledge Discovery, 2(2), pp. 121–167, 1998.
[17] A. C. Tan and D. Gilbert. Ensemble machine learning on gene expression data for cancer classifi-
cation. Applied Bioinformatics, 2(3), pp. 75–83, 2003.
[18] P. M. Long and V. B. Vega. Boosting and microarray data. Machine Learning, 52, pp. 31–44,
2003.
[19] L. Lam and C. Y. Suen. Application of majority voting to pattern recognition: an analysis of its
behavior and performance’. IEEE Trans. Systems, Man, and Cybernetics, Part A, 27, pp. 553–68,
1997.
[20] L. Brieman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and regression trees,
Monterey, CA: Wadsworth and Brooks, 1984.
[21] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, Second
Edition, MIT Press, Cambridge, 2001.
[22] http://www.kernel-machines.org/data.html.
[23] http://yann.lecun.com/exdb/mnist.
[24] http://www.cedar.buffalo.edu/Databases/index.html.
[25] D. Gorgevik and D. Cakmakov. An Efficient Three-Stage Classifier for Handwritten Digit Recog-
nition. Proceedings of the 16th International Conference on Pattern Recognition (ICPR), 4, pp.
507–510, Cambridge, UK, 2004.
BIBLIOGRAPHY 99
[26] C. W. Hsu and C. J. Lin. A comparison of methods for multi-class support vector machines. IEEE
Trans. Neural Networks, 13(2), pp. 415–425, 2002.
[27] K. Duan and S. S. Keerthi. Which is the best multi-class SVM method? An empirical study.
Multiple Classifier Systems, pp. 278–285, 2005.
[28] D. Anguita, S. Ridella, and D. Sterpi. A new method for multi-class support vector machines.
Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN), pp. 412–
417, 2004.
[29] J. C. Platt, N. Cristianini, and J. S. Taylor. Large margin DAGs for multi-class classification. Ad-
vances in Neural Information Processing Systems (NIPS), 12, pp. 547–553, MIT Press, Cambridge,
2000.
[30] Y. Liu, Z. You, and L. Cao. A novel and quick SVM-based multi-class classifier. Pattern Recogni-
tion, 39(11), pp. 2258–2264, 2006.
[31] Y. Grandvalet and S. Canu. Adaptive Scaling for Feature Selection in SVMs. Advances in Neural
Information Processing Systems (NIPS), 15, pp. 553–560, 2003.
[32] O. L. Mangasarian and E. W. Wild. Feature Selection for Nonlinear Kernel Support Vector Ma-
chines. Proceedings of the 7th IEEE International Conference on Data Mining (ICDM) Work-
shops, Omaha NE, 2007.
[33] V. Devendran, H. Thiagarajan, A. K. Santra, and A. Wahi. Feature Selection for Scene Catego-
rization using Support Vector Machines. Proceedings of the 2008 Congress on Image and Signal
Processing, 1, pp. 588–592, Washinton, 2008.
[34] J. A. K.Suykens, T. V. Gestel, J. Vandewalle, and B. D. Moor. A Support Vector Machine For-
mulation to PCA Analysis and its Kernel Version. IEEE Transactions on Neural Networks, 14(2),
pp. 447–450, 2003.
[35] L. Hermes and J. M. Buhmann. Feature Selection for Support Vector Machines. International
Conference on Pattern Recognition (ICPR), 2, pp. 712–715, 2000.
Chapter 6
SHARPC: SHApley Value based
Robust Pattern Clustering
6.1 Introduction
Clustering or unsupervised classification of patterns into groups based on similarity is a very well studied
problem in pattern recognition, data mining, information retrieval, and related disciplines. Clustering
finds numerous direct practical applications in pattern analysis, decision making, and machine learning
tasks such as image segmentation. Besides, clustering has also been used in solving extremely large scale
problems, e.g. in bioinformatics ([6], [7]), and graph theory([8]). Clustering also acts as a precursor to
many data processing tasks including classification (Jain, Murty and Flynn [1]). According to Backer
and Jain [2], “in cluster analysis a group of objects is split into a number of more or less homogeneous
subgroups on the basis of an often subjectively chosen measure of similarity (i.e., chosen subjectively
based on its ability to create “interesting” clusters), such that the similarity between objects within
a subgroup is larger than the similarity between objects belonging to different subgroups“.
Similar views are echoed in other works on clustering, e.g. Jain and Dubes [3], Hansen and Jaumard
[4], and Xu and Wunsch [5].
The machine learning and pattern recognition literature abounds in algorithms on clustering. The
techniques such as ISODATA [10], Genetic k-means Algorithm (GKA) [11], and Partitioning Around
Medoids (PAM) [12] are based on vector quantization. The density estimation based models such
as Gaussian Mixture Density Decomposition (GMDD) [13], information theory based models such as
entropy maximization [14], graph theory based models such as Delaunay Triangulation graph (DTG)
[15], combinatorial search based models such as Genetically Guided Algorithm (GGA) [16], fuzzy models
such as Fuzzy c-means (FCM) [17], neural networks based models such as Self-Organizing Map (SOM)
100
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 101
[18], kernel based models such as Support Vector Clustering (SVC) [19], and data visualization based
models such as Principal Component Analysis (PCA) [20] have received considerable attention from
the research community. In this paper, we propose a clustering algorithm, SHARPC, based on the
celebrated game theoretic concept of Shapley value. There are a number of solution concepts such as
Shapley value, core, bargaining sets, nucleolus etc. to analyze cooperative games [26]. Shapley value
is a fair solution concept in that it divides the collective or total value of the game among the players
according to their marginal contributions in achieving that collective value. We strive to make best use
of this notion of fairness for efficient clustering. To the best of our knowledge, SHARPC is the first
approach based on cooperative game theory to the clustering problem.
A key problem in the clustering domain concerns determining a suitable number k of output clusters
when k is not input as a parameter to the clustering algorithm. Dubes has described this as ”the
fundamental problem of cluster validity“ [9]. It is often impractical to presume the availability of a
domain expert to select the number of clusters. A number of techniques and heuristics such as elbow
criterion [21], regularization framework [22] based on the Bayesian Information Criterion (BIC), L-
method [23], Minimum Description Length (MDL) framework [24], G-means [25] have been proposed to
tackle this problem. Our algorithm SHARPC obviates the need for specifying the number of clusters as
an input.
In his work on unification of clustering [40], Kleinberg considered three desirable properties: scale
invariance, richness, and consistency and proved an impossibility theorem, showing that no clustering
algorithm satisfies all of these properties simultaneously. In this paper, we introduce order independence
as another desirable property, and provide necessary and sufficient conditions for order independence.
SHARPC satisfies the scale invariance, the richness, and the order independence conditions. Addition-
ally, the SHARPC approach can be generalized to obtain hierarchical clusters.
The Leader algorithm [28] is a prototype incremental algorithm that dynamically assigns each in-
coming point to the nearest cluster. However, the Leader algorithm is highly susceptible to ordering
effects and may give extremely poor quality of clustering on skewed data orders. For a known value of
k, the k-means and its variants ([29, 30]), based on vector quantization are the most popular clustering
algorithms. However, there are primarily two drawbacks of k-means: (a) it may give poor results for an
inappropriate choice of k, and (b) it may not converge to a globally optimal solution due to inappro-
priate initial selection of cluster centers [33]. Our experimental results strongly suggest that SHARPC
outperforms both the Leader and the k-means algorithms in terms of the quality of clustering.
6.1.1 Motivation
Clustering is the assignment of data points to subsets such that points within a subset are more similar
to each other than points from other subsets. We believe that most of the existing popular algorithms
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 102
do not truly reflect the intrinsic notion of a cluster, since they try to minimize the distance of every
point from its closest cluster representative alone, while overlooking the importance of other points in
the same cluster. Although this approach succeeds in optimizing the average distance between a point
and its closest cluster center, it conspicuously fails to capture what has been described by Michalski
and others as the “context-sensitive” information: clustering should be done not just on the basis of
distance between a pair of points, A and B, but also on the relationship of A and B to other data points
([46, 47]). Therefore, there is a need for an algorithm that gives an optimal solution in keeping both
the point-to-center and point-to-point distances, within a cluster, to a minimum. We emphasize that
incorporating the gestalt or collective behavior of points within the same cluster is fundamental to the
very notion of clustering, and this provides the motivation for our work.
In addition, it is more intuitive to characterize similarity between different points as compared to
the distance between them since the distance measure may not necessarily be scale invariant. Moreover,
from an application point of view, it is more convenient to specify a similarity threshold parameter in an
identical range, compared to a distance threshold that may vary across domains. Further, as described
later, there are certain axioms that are directly relevant in the context of clustering. The Shapley
value is an important solution concept, from cooperative game theory, that satisfies these axioms and
thereby characterizes the notion of fairness in clustering. We strive to incorporate this idea of fairness
for efficient clustering.
6.1.2 Contributions
In this paper, we make the following contributions:
• We formulate the problem of clustering as a cooperative game among the data points and show
that the underlying characteristic form game is convex.
• We propose a novel approach, SHARPC, for clustering the data points based on their Shapley
values and the convexity of the proposed game theoretic model. SHARPC determines an optimal
number of clusters and satisfies desirable clustering properties such as scale invariance and richness.
• We provide both the necessary and sufficient conditions for order independence and prove that
SHARPC is an order independent algorithm.
• We also extend the idea of clustering using Shapley value approach to obtain hierarchical clusters
with minimum bounded similarity guarantee.
• We demonstrate the efficacy of our approach through detailed experimentation. SHARPC is
compared with the popular k-means and Leader algorithms and the results are shown for several
benchmark datasets.
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 103
The outline of the paper is as follows. In Sect. 6.2, a succinct background encompassing important
concepts from cooperative game theory is documented. Our Shapley value based clustering paradigm
is presented in Sect. 6.3, along with Algorithm 1, a clustering algorithm based on exact computation
of Shapley value, and SHARPC, based on approximation of the Shapley value. The ordering effects
are characterized in Sect. 6.4. We present the generalization of our approach to hierarchical clustering
in Sect. 6.5. We provide a brief description on the applicability of the Leader, the k-means, and the
SHARPC algorithms with respect to certain desirable clustering properties in Sect. 6.6.1. An analysis
of the experimental results is carried out in Sect. 6.6.2. Finally, we present a summary of our work in
Sect. 6.7 and indicate the future work in Sect. 6.7.1.
6.2 Preliminaries
A cooperative game with transferable utility (TU) [26] is defined as the pair (N, v) whereN = {1, 2, ..., n}
is the set of players and v : 2N → R is a mapping with v(φ) = 0. The mapping v is called the
characteristic function or the value function. Given any subset S of N , v(S) is often called the value
or the worth of the coalition S and represents the total transferable utility that can be achieved by the
players in S, without help from the players in N \ S. The set of players N is called the grand coalition
and v(N) is called the value of the grand coalition. In the sequel, we use the phrases cooperative game,
coalitional game, and TU game interchangeably.
A cooperative game can be analyzed using a solution concept, which provides a method of dividing the
total value of the game among individual players. We describe below two important solution concepts,
namely the core and the Shapley value.
6.2.1 The Core
A payoff allocation x = (x1, x2, ..., xn) denotes a vector in Rn with xi representing the utility of player
i where i ∈ N . The allocation x is said to be individually rational if xi ≥ v({i}), ∀i ∈ N . The payoff
allocation x is said to be coalitionally rational if∑i∈C xi ≥ v(C), ∀C ⊆ N . Note that coalitional
rationality implies individual rationality. Finally, the payoff allocation x is said to be collectively rational
if∑i∈N xi = v(N). The core of a TU game (N, v) is the collection of all payoff allocations that are
coalitionally rational and collectively rational. It can be shown that every payoff allocation lying in the
core of a game (N, v) is stable in the sense that no player will benefit by unilaterally deviating from a
given payoff allocation in the core. The elements of the core are therefore potential payoff allocations
that could result when rational players interact and negotiate among themselves. A limitation of the
concept of the core is that given a coalitional game, the core may be empty or very large.
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 104
6.2.2 The Shapley Value
The Shapley value is a solution concept that provides a unique expected payoff allocation for a given
coalitional game (N, v). It describes an effective approach to the fair allocation of gains obtained by
cooperation among the players of a cooperative game. Since some players may contribute more to the
total value than others, an important requirement is to distribute the gains fairly among the players.
The concept of Shapley value, which was developed axiomatically by Lloyd Shapley, takes into account
the relative importance of each player to the game in deciding the payoff to be allocated to the players.
We denote by
φ(N, v) = (φ1(N, v), φ2(N, v), . . . , φn(N, v))
the Shapley value of the TU game N, v). Mathematically, the Shapley value, φi(N, v), of a player i,
∀v ∈ R2n−1, is given by,
φi(N, v) =∑
C⊆N−i
|C|!(n− |C| − 1)!n!
{v(C ∪ {i})− v(C)}
where φi(N, v) is the expected payoff to player i and N − i denotes N \{i}. There are several equivalent
alternative formulations for the Shapley value.
The Shapley value is the unique mapping that satisfies three key properties: linearity, symmetry,
and carrier property [26]. The three properties imply that the Shapley value provides a fair way of
distributing the gains of cooperation among all the players in the game. A natural way of interpreting
the Shapley value φi(N, v) of player i is in terms of the average marginal contribution that player i
makes to any coalition of N assuming that all the orderings are equally likely. Thus the Shapley value
takes into account all possible coalitional dynamics and negotiation scenarios among the players and
comes up with a single unique way of distributing the value v(N) of the grand coalition among all the
players. The Shapley value of a player accurately reflects the bargaining power of the player and the
marginal value the player brings to the game.
Now we describe an important class of cooperative games called the convex games.
6.2.3 Convex Games
A cooperative game (N, v) is a convex game [27] if
v(C) + v(D) ≤ v(C ∪D) + v(C ∩D), ∀C,D ⊆ N
Equivalently, a TU game (N, v) is said to be convex if for every player i, the marginal contribution of i
to larger coalitions is larger. In other words,
v(C ∪ {i})− v(C) ≤ v(D ∪ {i})− v(D),
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 105
∀C ⊆ D ⊆ N − {i}, i ∈ N
where the marginal contribution m(S, j) of player j in a coalition S is given by,
m(S, j) = v(S ∪ {j})− v(S), S ⊆ N, j ∈ N, j /∈ S.
A very important property is that if a TU game (N, v) is convex, then the core of the game is non-empty
and moreover, the Shapley value belongs to the core.
6.2.4 Shapley Value of Convex Games
It can be shown the core of a convex game (N, v) is a convex polyhedron with a dimension of at most
|N | − 1. Consider a permutation π of players in the game. Then, for any of a possible |N |! such
permutations, the initial segments of the ordering are given by
Tπ,r = {i ∈ N : π(i) ≤ r}, r ∈ {1, ..., |N |}
where, Tπ,0 = 0 and Tπ,|N | = N . Note that π(i) refers to the position of the player i in the permutation
π. Now, to determine the core for a particular ordering π, we solve the equations
xπi (Tπ,r) = v(Tπ,r), r ∈ {1, ..., |N |}.
The solution to these equations defines a payoff vector xπ with elements given by
xπi = v(Tπ,π(i))− v(Tπ,π(i)−1), ∀i = 1, 2, ..., |N |.
In fact, the payoff vectors xπ precisely represent the extreme points of the core in convex games.
Moreover, it is known [27] that the Shapley value for a convex game is the center of gravity of xπ. Thus,
if Π is the set of all permutations of N , then the Shapley value of player i can be computed as
φi =1|N |!
∑π∈Π
xπi
This provides an efficient way of computing the Shapley value of a convex game and we use this fact
later in this paper.
6.3 Shapley Value based Clustering
A central idea of this paper is to map cluster formation to coalition formation in an appropriately defined
TU cooperative game.
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 106
6.3.1 The Model
Consider a dataset X = {x1, x2, ..., xn} of n input instances. We set up a cooperative game (N, v) among
the input data points in the following way. Given the dataset X, define a function, d : X×X → R+∪{0},
where d(xi, xj) ∀xi, xj ∈ X indicates the distance between xi and xj , with d(xi, xi) = 0; d can be any
distance metric such as the Euclidean distance, for instance, depending on the application domain. Let
f′
: R+ ∪ {0} → [0, 1) be a monotonically increasing dissimilarity function such that f′(0) = 0 and
f′(d(x, xi)) + f
′(d(x, xj)) ≥ f
′(d(x, xi) + d(x, xj)) (6.1)
Define a corresponding similarity mapping, f : R+ ∪ {0} → (0, 1], such that f(a) = 1 − f ′(a). In
this setting, each of the n points corresponds to a player in this game, thereby |N | = n. The problem
of clustering can be viewed as grouping together those points which are less dissimilar as given by f′
or
equivalently, more similar as indicated by f : each of the n points interacts with other points and tries
to form a coalition or cluster with them, in order to maximize its value. Now, we assign v({xi}) = 0,
for all xi such that xi is not a member of any coalition. This is based on the intuition that any isolated
point should have least value since it is not involved in any cluster. Two situations are accounted for
using this idea. First, all those points that have not been processed as yet are assigned an initial value
of 0. Second, after processing, if some point behaves as an outlier, then it can be discarded based on its
value as explained later. This motivates us to define, for a coalition T ,
v(T ) =12
∑xi,xj∈Txi 6=xj
f(d(xi, xj))
In other words, v(T ), the total value of a coalition T , is computed by taking the sum of similarities
over all (|T |2 ) distinct pairs of points. We emphasize the relevance of defining the value function v(.) for
a coalition in this way. Our approach computes the total worth of a coalition as the sum of pairwise
similarities between the points. Note that this formulation elegantly captures the notion of clustering
in its purest form: points within a cluster are similar to each other. Henceforth, we shall use the terms
data points, patterns and players interchangeably. Moreover, the phrase cluster center shall convey the
same meaning as cluster representative.
The usage of Shapley value for clustering is justified by interpreting certain axioms in the following
way:
• Symmetry (Permutation invariance): Given a game (N, v) and a permutation π on N , we
have ∑i∈N
φi(N, v) = φπ(i)(N, πv)
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 107
As a consequence of this property, the Shapley value remains same even if the points are arbitrarily
renamed or reordered. This is extremely significant for achieving order independence, a desirable
clustering property, as explained later.
• Preservation of Carrier: Given any game (N, v) such that v(S⋃{i}) = v(S) ∀S ⊆ N , we have
φi(N, v) = 0. This property implies that if a point does not contribute to the overall worth of a
cluster, then it does not receive any marginal contribution. Therefore, the outliers or the points
that are far off from the other data points do not derive any benefit by forming clusters with them.
• Additivity or Aggregation: For any two games, (N, v) and (N,w), we have
φi(N, v + w) = φi(N, v) + φi(N,w), where
(v + w)(S) = v(S) + w(S)
Additivity implies the linearity property: if the payoff function v is scaled by a real number α,
then the Shapley value is also scaled by the same factor. That is, φi(N,αv) = αφi(N, v). Linearity
is essential for achieving scale invariance with respect to the value function. Another important
consequence of additivity is that the overall marginal contribution of a point is just the sum of its
contributions in each of the games considered separately. In the context of clustering, for every
point i, the addition of a set of new points X ′ in the initial dataset X results in increasing the
marginal contribution of i with respect to X\{i} by an additional contribution incurred due to
X ′\{i}.
• Pareto Optimality: For any game (N, v), we have∑i∈N
φi(N, v) = v(N). As an implication of
this property, the overall worth of the dataset is distributed entirely among the different data
points.
In fact, Shapley value is the only solution concept that satisfies all the aforesaid axioms simultaneously,
and hence provides an appropriate tool for tackling clustering.
6.3.2 An Algorithm for Clustering based on Shapley values
Algorithm 1 outlines our approach to clustering. Algorithm 1 takes as input a threshold parameter of
similarity, δ, in addition to the dataset to be clustered.
Algorithm 1.
Input: The dataset X = {x1, x2, ..., xn} to be clustered and a threshold parameter of similarity δ ∈ (0, 1].
Output: A set of cluster centers and the clusters.
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 108
1.1 for i = 1 to n, do
1.2 v({xi}) = 0;
1.3 d(xi, xi) = 0;
1.4 end for
1.5 for every ordering π of data points, do
1.6 for r = 1 to n, do
1.7 Tπ,r = ∅;
1.8 for i = 1 to n, do
1.9 if π(i) ≤ r
1.10 Tπ,r = Tπ,r ∪ {i};
1.11 end if
1.12 end for
1.13 end for
1.14 end for
1.15 for i = 1 to n, do
1.16 for j = i to n, do
1.17 compute f(d(xi, xj));
1.18 end for
1.19 end for
1.20 for i = 1 to n, do
1.21 φxi = 0;
1.22 for every ordering π of data points, do
1.23 xπi = v(Tπ,π(i))− v(Tπ,π(i)−1);
1.24 φxi = φxi +1
n!xπi ;
1.25 end for
1.26 end for
1.27 K = ∅;
1.28 sort the points in X, in non-increasing order, based on the Shapley values φx1 , φx2 , . . . , φxn ;
1.29 Q = X;
1.30 while Q 6= ∅, do
1.31 choose the point x ∈ Q with maximum Shapley value in Q, as a new cluster center;
1.32 K = K ∪ {x};
1.33 P = {xi ∈ Q : f(d(x, xi)) ≥ δ}
1.34 assign the points in P to the cluster with center x;
1.35 Q = Q \ P ;
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 109
1.36 end while
1.37 return K as the set of cluster centers;
First, the Shapley value of each player is computed. Then, the cluster centers or representatives
are chosen in the following way. We sort the points in the non-increasing order of their Shapley values.
Then, the algorithm chooses the point x with the current highest Shapley value, and assigns all those
points that are at least δ-similar to x, to the same cluster as x. The points, which have already been
clustered, do not play any further part in the clustering process. The point with the highest Shapley
value among all currently unclustered points is chosen as a new cluster center, and the entire process
is repeated iteratively. The algorithm returns a set of cluster representatives on termination. It can be
observed (see Theorem 8 in Sect. 6.3.3) that data points close to a cluster center would also be having
almost the same Shapley value since they have similar distances to the remaining points. Hence, they
should not be treated as new cluster centers themselves. Therefore, by tuning the parameter δ, we can
obtain a good clustering by assigning the points nearer to the center to the same cluster. Note that
Algorithm 1 can be easily modified to discard outliers by adding a step wherein all those clusters that
are assigned fewer points than a minimum predefined number are discarded.
We now prove that the general setting described in Sect. 6.3.1 corresponds to a convex game among
the data points.
6.3.3 Convexity of the Underlying Game
Theorem 7. Define the total value of an individual point xi, v({xi}) = 0 ∀i ∈ {1, 2, ..., n}, and that
of a coalition T of n data points, v(T ) =12
∑xi,xj∈Txi 6=xj
f(d(xi, xj)), where f is a similarity function. In this
setting, the cooperative game (N, v) is a convex game.
Proof. Consider any two coalitions C and D, C ⊆ D ⊆ X \ {xp}, where xp ∈ X. Then, by definition,
v(D)− v(C) =1
2
Xxi,xj∈Dxi 6=xj
f(d(xi, xj))−1
2
Xxi,xj∈Cxi 6=xj
f(d(xi, xj))
=1
2
Xxi,xj∈D\Cxi 6=xj
f(d(xi, xj)) +X
xi∈D\Cxj∈C
f(d(xi, xj)) (6.2)
Again,
v(C ∪ {xp}) =1
2
Xxi,xj∈Cxi 6=xj
f(d(xi, xj)) +Xxi∈C
f(d(xi, xp))
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 110
Also,
v(D ∪ {xp}) =1
2
Xxi,xj∈Dxi 6=xj
f(d(xi, xj)) +Xxi∈D
f(d(xi, xp))
Then,
v(D ∪ {xp})− v(C ∪ {xp}) =1
2
Xxi,xj∈Dxi 6=xj
f(d(xi, xj)) +Xxi∈D
f(d(xi, xp)−1
2
Xxi,xj∈Cxi 6=xj
f(d(xi, xj))−Xxi∈C
f(d(xi, xp))
=1
2
Xxi,xj∈D\Cxi 6=xj
f(d(xi, xj)) +X
xi∈D\Cxj∈C
f(d(xi, xj)) +X
xi∈D\C
f(d(xi, xp))
= v(D)− v(C) +X
xi∈D\C
f(d(xi, xp)) (using (6.2))
≥ v(D)− v(C) ( since f : R+ ∪ {0} → (0, 1])
An important consequence of Theorem 7 is that the Shapley value belongs to the core. Therefore, we
can compute the Shapley value of each player as explained in Sect. 6.2.4. Further, as the next theorem
states, the points which are close to each other have almost same Shapley values.
Theorem 8. Any two points xi, xt, such that d(xi, xt) ≤ ε, where ε→ 0, in the convex game setting of
Sect. 6.3.1 have almost equal Shapley values.
Proof. As explained in Sect. 6.2.4, the Shapley value of a point xi is given by,
φi =1
n!
Xπ∈Π
xπi
=1
n!
Xπ∈Π
ˆv(Tπ,π(i))− v(Tπ,π(i)−1)
˜=
1
n!
Xπ∈Π
ˆ Xπ(p)≤π(i)π(q)<π(p)
f(d(xp, xq))−X
π(p)≤π(i)−1π(q)<π(p)
f(d(xp, xq))˜
=1
n!
Xπ∈Π
Xπ(p)<π(i)
f(d(xi, xp))
=1
n!
Xπ∈Π
Xπ(p)<π(i)
[1− f′(d(xi, xp))]
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 111
=1
n!
Xπ∈Π
[π(i)− 1]− 1
n!
Xπ∈Π
Xπ(p)<π(i)
f′(d(xi, xp))
The first term on the right is a sum that is invariant for each point over all permutations. The second term can
be written as,
D(i) =1
n!
Xπ∈Π
Xπ(p)<π(i)
f′(d(xi, xp))
=1
n!
Xπ∈Π
Xπ(p)<π(i)d(xi,xp)≤ε
f′(d(xi, xp)) +
1
n!
Xπ∈Π
Xπ(p)<π(i)d(xi,xp)>ε
f′(d(xi, xp))
It follows immediately using (6.1), for xt, t ∈ {1, 2, . . . , n}, t 6= i such that d(xi, xt) ≤ ε → 0, we have
f′(d(xi, xt))→ 0, and f
′(d(xi, xp))→ f
′(d(xt, xp)), thereby implying D(t)→ D(i).
Note that Theorem 8 does not say anything about points that are far apart from each other. In
particular, it does not forbid points, away from each other, from having similar Shapley values; it only
implies that points close to each other tend to have almost same Shapley values.
6.3.4 SHARPC
The exact computation of Shapley values for n players, as in Algorithm 1, is computationally a hard
problem since it involves taking the average over all the n! permutation orderings. However, as mentioned
earlier, the Shapley value for a convex game is the center of gravity of the extreme points of the non-
empty core. Therefore, making use of Theorem 7, we can approximate the Shapley value by averaging
marginal contributions over only p random permutations, where p << n!. Then, the error resulting
from this approximation can be bounded according to the concentration result proved in the following
lemma.
Lemma 1. Let Φ(p) = (φ1(p), φ2(p), . . . , φn(p)) denote the empirical Shapley values, of n data points,
computed using p permutations. Then, for some constants ε, c, and c1, such that ε ≥ 0 and c, c1 > 0,
P (|Φ(p)−E(Φ(p))| ≥ ε) ≤ c1e−cpε2
Proof. Define S =
pXi=1
Yi, where Y1, Y2, . . . , Yp denote p independent random permutations of length n, corre-
sponding to p n-dimensional points, randomly chosen from the boundary of a convex polyhedron. Clearly, S is
a random variable. Now, applying Hoeffding’s inequality, we can find constants c1, c2, and t, 0 ≤ t ≤ pE(S),
and c1, c2 > 0, such that
P (|S − E(S)| ≥ t) ≤ c1e−c2t
2
pE(S)
⇒ P (|S − E(S)| ≥ pε) ≤ c1e−c2pε
2
E(S) ( substituting t = pε)
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 112
⇒ P (1
p|S − E(S)| ≥ ε) ≤ c1e
−c2pε
2
E(S)
⇒ P (|Φ(p)− E(Φ(p))| ≥ ε) ≤ c1e−c2pε
2
E(S)
⇒ P (|Φ(p)− E(Φ(p))| ≥ ε) ≤ c1e−cpε2
[ since Φ(p) =S
p]
Lemma 1 provides a bound on deviation from the exact Shapley value. We want the probability of
error in estimation of Shapley value to be as small as possible. To ensure this, we develop the idea of order
independence in learning algorithms. In fact, we characterize a stronger notion of order independence
in incremental learning and subsequently prove that the Shapley value of points in our convex game
setting can be approximated to a high degree of accuracy in O(n2) number of computations. However,
we first propose an efficient algorithm, SHARPC (acronym for SHApley value based Robust Pattern
Clustering), based on the convexity of our game theoretic model (Theorem 7).
Algorithm 2. (SHARPC)
Input: The dataset X = {x1, x2, ..., xn} to be clustered and a threshold parameter of similarity δ ∈ (0, 1].
Output: A set of cluster centers and the clusters.
2.1 Find the pair-wise similarity between all points in the input dataset.
2.2 For each player xi in the input dataset, X, compute the value, φi =Xxj∈Xj 6=i
f(d(xi, xj)).
2.3 Arrange the points in non-increasing order of their φ-value, and assign them to clusters as in Algorithm 1.
2.4 Find the number of clusters, k, and their centers, resulting from Step 2.3.
2.5 Run the k-means algorithm, with the initial k centers set to the cluster centers that are obtained in Step 2.4.
Note that SHARPC is essentially the same as Algorithm 1. SHARPC performs an approximate
computation of Shapley values using only O(n2) similarity computations. In addition, Step 2.5 is
incorporated to employ the cluster centers obtained in Step 2.4 for more efficient clustering, in the sense
of minimizing the point-to-center distances.
Analysis of Time Complexity
The computation of pair-wise similarity, in Step 2.1 of SHARPC, can be done in (n2 ) = O(n2) time
steps. The computation of Sφi, for player i, takes O(n) steps since a sum over (n− 1) similarity values
(corresponding to players in X\{i}) needs to be performed. Therefore, for n players, the complexity of
Step 2.2 is O(n2). For the sake of analysis, let k clusters be obtained as a result of Step 2.3. Then, in
an expected sense, O(n/k) points are assigned to each cluster. Therefore, on an average, O(k) passes
need to be made for computing the points similar to each of the k clusters, and in each pass, O(n)
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 113
similarity computations are required. Therefore, the complexity of Step 2.3 is bounded by O(nk). Step
2.4 requires O(k) time corresponding to the k cluster centers. Finally, Step 2.5 can be accomplished
in O(nkl), l being the number of iterations till convergence. Generally, because SHARPC determines
suitable cluster centers, therefore, the k-means algorithm in Step 2.5 converges rapidly and thus, the
similarity computation is the pre-dominant factor in determining the total time. Therefore, the overall
complexity of SHARPC is given by O(n2). Note that SHARPC takes more time than k-means, with a
complexity O(nkl′) (l
′being the number of iterations till convergence), and Leader, with a complexity
O(nk), for the same number of clusters, k. However, the complexity of SHARPC can be greatly reduced
by further approximating the Shapley value considering only O(t) nearest neighbors, t << n, using
generic branch and bound techniques such in [41], locality sensitive hashing in [42] based techniques
such as in [44], application specific techniques such as in [43].
6.4 Order Independence of SHARPC
In this section, we show how SHARPC exploits the order independence property of the convex game
setting to estimate Shapley values to a high degree of accuracy. To set the stage, we characterize the
concept of order independence and provide the necessary and sufficient conditions for order indepen-
dence.
6.4.1 Characterizing Ordering Effects in Incremental Learners
In his celebrated work on unification of clustering [40], Kleinberg considered three properties: scale-
invariance, richness, and consistency and proved an impossibility result, showing that no clustering
algorithm satisfies all of these properties simultaneously. Order independence is another desirable fun-
damental property of clustering algorithms. In other words, we want the algorithms to produce the
same final clustering across different runs, irrespective of the sequence in which the input instances are
presented. We note that even though algorithms such as the Leader and the k-means can be shown
to satisfy some of the three properties: scale-invariance, richness, and consistency; they do not satisfy
order independence. In particular, the Leader algorithm is known to be susceptible to ordering effects.
On the other hand, the random selection of initial cluster centers precludes the k-means algorithm from
being truly order independent.
6.4.2 Order Independence of SHARPC
Next, we prove an important theorem, which highlights the order independence of SHARPC.
Theorem 9. SHARPC is order independent.
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 114
Proof. Let a dataset X = {x1, x2, . . . , xn} be provided as an input to SHARPC. For any permutation ordering
on the input instances, π ∈ Π, we may define an abstraction on i points, Tπ,i =X
π(p)≤π(i)π(q)<π(p)
f(d(xp, xq)), and a
function g such that g(Tπ,i, xi+1) = Tπ,i +X
π(p)≤π(i)
f(d(xi+1, xp)), where xi+1 is the current input instance.
Further, in case of SHARPC, g(Tπ,k, x′l) =
Xπ(p)≤π(k)π(q)<π(p)
p 6=lq 6=l
f(d(xp, xq)), where 1 ≤ l ≤ k.
In order to prove order independence of SHARPC, we need to verify that X is a dynamically complete set with
respect to T and g,
• g(Tπ,k, xk+1) = Tπ,k +X
π(p)≤π(k)
f(d(xk+1, xp))
=X
π(p)≤π(k)π(q)<π(p)
f(d(xp, xq)) +X
π(p)≤π(k)
f(d(xk+1, xp))
=X
π(p)≤π(k+1)π(q)<π(p)
f(d(xp, xq))
= Tπ,k+1
• g(Tπ,k, x′l) =
Xπ(p)≤π(k)π(q)<π(p)
p6=lq 6=l
f(d(xp, xq))
= f(d(x2, x1))+X
π(p)<3
f(d(x3, xp))+. . .+X
π(p)<π(l−1)
f(d(xl−1, xp))+X
π(p)<π(l)
f(d(xl+1, xp))+. . .+X
π(p)<π(k)
f(d(xk, xp))
= g(g(g(g(g(Tπ,0, x1), x2), . . . , xl−1), xl+1), . . . , xk)
• g(g(Tπ,k, x′l), xl)
=X
π(p)≤π(k)π(q)<π(p)
p6=lq 6=l
f(d(xp, xq)) +X
π(p)≤π(k)p 6=l
f(d(xl, xp))
[Note that this step follows since the incoming data point, xl, arrives at the (k+1)th position, whereas the
earlier instance is removed, as indicated by x′l, and hence does not contribute to the sum of similarities].
=X
π(p)≤π(k)π(q)<π(p)
f(d(xp, xq))
= Tπ,k
• g(g(Tπ,k, xl), xm)
= (X
π(p)≤π(k)π(q)<π(p)
f(d(xp, xq)) +X
π(p)≤π(k)
f(d(xl, xp))) +X
π(p)≤π(k+1)
f(d(xm, xp))
= (X
π(p)≤π(k)π(q)<π(p)
f(d(xp, xq)) +X
π(p)≤π(k)
f(d(xl, xp))) + (X
π(p)≤π(k)
f(d(xm, xp)) + f(d(xm, xl)))
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 115
= (X
π(p)≤π(k)π(q)<π(p)
f(d(xp, xq)) +X
π(p)≤π(k)
f(d(xm, xp))) + (X
π(p)≤π(k)
f(d(xl, xp)) + f(d(xl, xm)))
[since f(d(xl, xm)) = f(d(xm, xl))]
= g(g(Tπ,k, xm), xl)
Note that since updating the knowledge structure in memory on arrival of a new point requires its
similarity computations to each of the previously seen points in the given sequence, therefore, SHARPC
ceases to be incremental, nonetheless, SHARPC is an order independent algorithm. In the next theorem,
we prove that SHARPC estimates the Shapley value of points in the input dataset to an arbitrarily high
degree of accuracy.
Theorem 10. Let X = {x1, x2, . . . , xn} be an input dataset. Further, let Φ = (φ1, φ2, . . . , φn) denote
the approximate Shapley values, of n data points, given by φi =∑xj∈Xj 6=i
f(d(xi, xj)). Then, for some
constants ε, c, and c1, such that ε ≥ 0 and c, c1 > 0,
P (|Φ−E(Φ)| ≥ ε) ≤ c1e−c(n−1)!ε2 ,
where E(Φ) denotes the vector of exact Shapley values.
Proof. Consider an arbitrary permutation, π, on X, where data point xi is fixed at the nth position. Clearly,
there are (n − 1)! such permutations, corresponding to arrangement of the data points in X\{xi}. Then, the
marginal contribution of xi in any such permutation is given by,
xπi = v(Tπ,π(i))− v(Tπ,π(i)−1)
= v(Tπ,n)− v(Tπ,n−1)
[since π(i) = n]
=X
π(j)≤nπ(q)<π(j)
f(d(xj , xq))−X
π(j)≤n−1π(q)<π(j)
f(d(xj , xq))
=X
π(j)≤nπ(q)<π(j)
f(d(xj , xq))−X
π(j)≤n−1π(q)<π(j)
f(d(xj , xq))
=X
π(j)<n
f(d(xi, xj))
=Xxj∈Xj 6=i
f(d(xi, xj))
Now, using Theorem 9, all such permutations result in the same marginal contribution for xi. Thus, the Shapley
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 116
value of xi, approximated using (n− 1)! such permutations is given by,
φi =1
(n− 1)!
Xπ∈Ππ(i)=n
Xxj∈Xj 6=i
f(d(xi, xj))
=1
(n− 1)!(n− 1)!
Xxj∈Xj 6=i
f(d(xi, xj))
=Xxj∈Xj 6=i
f(d(xi, xj))
Now, using Lemma 1, the empirical Shapley value differs from the exact Shapley value according to the following
bound,
P (|Φ(p)−E(Φ(p))| ≥ ε) ≤ c1e−cpε2
whereby substituting p = (n− 1)!, we get
P (|Φ−E(Φ)| ≥ ε) ≤ c1e−c(n−1)!ε2
Theorem 10 essentially implies that we can obtain a highly accurate approximation of the Shapley
values by computing pairwise similarities between points in the dataset. Further, it also makes SHARPC
completely order independent since identical cluster centers (and identical clusters, subsequently) are
obtained using different runs of the algorithm. As mentioned, this is a highly desirable property, which
is conspicuously absent in the k-means and the Leader algorithms.
Moreover, since every finite subset of players in a convex game is also convex, therefore, Theorem
10 provides an effective mechanism for further enhancing the computational efficiency of existing Shap-
ley value approximation techniques such as the Multi-perturbation Shapley value analysis (MSA), by
using only a single permutation on a sampled subset of players. MSA employs a Shapley value based
approach to address the issue of defining and calculating the contributions of neural network elements
from a dataset of multiple lesions [45]. Only a subset of other elements is considered, across different
permutations, for obtaining the marginal contribution of each player. Theorem 10 implies that by using
a convex formulation of the network elements, for N players, only N permutations need be considered.
For each player i, a single permutation (where i is placed at the last position, and other players are
arranged arbitrarily in the remaining N − 1 positions) would suffice. This is an extremely significant
result in advancing the state of the art, as regards the computational efficiency of estimating Shapley
value for large datasets across hetrogeneous applications.
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 117
6.5 Hierarchical Clustering
In this section, we demonstrate another significant feature of how the Shapley value approach can be
extended to obtain hierarchical clusters. We start by showing that Shapley value based clustering en-
sures a bound on the extent of similarity among points in the same cluster for a suitable choice of d and f .
Lemma 2. The data points, which are assigned to the same cluster by Algorithm 1, are at least 2δ − 1
similar to each other, on an average, for a suitable choice of distance metric d, dissimilarity function
f′, and similarity function f (as defined in Sect. 6.3.1).
Proof. Consider any two points xi and xj , 1 ≤ i, j ≤ n, that are assigned to the same cluster C with center x.
Then,
f(d(x, xi)) ≥ δ
f(d(x, xj)) ≥ δ
Using the above inequalities, we get,
f(d(x, xi)) + f(d(x, xj)) ≥ 2δ
⇒ f′(d(x, xi)) + f
′(d(x, xj)) ≤ 2− 2δ (6.3)
Now d, being a metric, satisfies the triangle inequality. Thus,
d(x, xi) + d(x, xj) ≥ d(xi, xj)
Now, by definition, f′
is a monotonically increasing function, and therefore, we get the following inequality
f′(d(x, xi) + d(x, xj)) ≥ f
′(d(xi, xj))
Further, using (6.1), we get
f′(d(x, xi)) + f
′(d(x, xj)) ≥ f
′(d(x, xi) + d(x, xj))
Using (6.3), together with these inequalities, we get
f′(d(xi, xj)) ≤ 2− 2δ ∀xi, xj ∈ C
Now, since by definition, f′(d(xi, xj)) + f(d(xi, xj)) = 1, therefore,
f(d(xi, xj)) ≥ 2δ − 1 ∀xi, xj ∈ C
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 118
⇒ fC ≥ (2δ − 1)
where fC denotes the average or mean similarity between data points in C (corresponding to (|C|2 ) distinct pairs
of points).
Lemma 2 leads to the following important observations. The quality of clustering can be controlled
by varying the similarity parameter δ over the range (0, 1]. This is promising on several accounts,
(a) the number of clusters is not required as a pre-requisite unlike most clustering algorithms, (b) no
assumptions about the distribution of data are made, and (c) a (lower) bound on the intra-cluster
similarity is achieved. Note that the central idea in Shapley value based clustering is to obtain good
cluster centers, and this in conjunction with Lemma 2 ensures that a high quality of clustering is
obtained.
Now we discuss how the Shapley value approach can be extended to obtain hierarchical clustering.
Using Algorithm 1, we get clusters with a minimum average similarity δ as indicated by Theorem 1.
Now, we consider only the cluster centers selected by Algorithm 1. We choose the center x with the
highest Shapley value, merge the clusters with centers that are at least δ′
similar to x, and make x
the cluster center of the single cluster thus obtained. Then, we choose the cluster center with the
highest Shapley value among the remaining centers and repeat the process. The idea can be similarly
propagated to levels up the hierarchy till a single cluster remains. In the next theorem, we prove the
minimum similarity bound for hierarchical clusters at any level.
Theorem 11. The data points assigned to the same cluster using Algorithm 1, at a level i in the
hierarchy are, on an average, at least 2δ − 1 similar to each other, where δ is the similarity threshold
parameter for level i− 1. This minimum average similarity is independent of δ′, the threshold for level
i.
Proof. Let x′
and y′
be any two points in clusters represented by centers x and y at level (i − 1) respectively
(see Figure 6.1). After decreasing the threshold from δ to δ′, x′
and y′
are assigned to the same cluster at level
i. Without loss of generality, let x be the center of this cluster at level i. Consider the triangle yx′y′. Then,
defining the functions f and f′
as in Sect. 6.3.1, we get, using triangle inequality,
d(x′, y) + d(y, y
′) ≥ d(x
′, y′)
⇒ f′(d(x
′, y) + d(y, y
′)) ≥ f
′(d(x
′, y′))
Also,
f′(d(x
′, y)) + f
′(d(y, y
′)) ≥ f
′(d(x
′, y) + d(y, y
′))
⇒ f(d(y, y′)) ≤ 1 + f(d(x
′, y′)) + f(d(x
′, y)) (6.4)
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 119
Figure 6.1: Hierarchical Clustering
But since y′
is assigned to the cluster with center y at level i− 1, therefore
f(d(y, y′)) ≥ δ (6.5)
Using (6.4) and (6.5),
f(d(x′, y)) ≥ δ − 1− f(d(x
′, y′)) (6.6)
Similarly, considering ∆xyx′, we get
f(d(x′, y)) ≤ 1− f(d(x, y)) + f(d(x, x
′)) (6.7)
Using (6.6) and (6.7),
f(d(x, y)) ≤ 2 + f(d(x, x′)) + f(d(x
′, y′))− δ (6.8)
But since y is assigned to the cluster with center x at level i, therefore
f(d(x, y)) ≥ δ′
Using this inequality in conjunction with (6.8),
f(d(x′, y′)) ≥ δ + δ
′− 2− f(d(x, x
′)) (6.9)
Let gi : X×X → {0, 1} be a function indicating if the two points in the domain are assigned to the same cluster
at level i or not. If the two points are assigned to the same cluster at level i, then gi returns a 1, else a 0.
Consider the situation as highlighted in Figure 6.1. As mentioned earlier, x′
and y′
are assigned to different
clusters at level (i− 1) but to the same cluster at level i. Then,
gi(x′, y′) = 1 and gi−1(x
′, y′) = 0
Now, the total similarity among all the points assigned to the same cluster Ci at level i is given by,
1
2
Xxp,xq∈Cixp 6=xq
f(d(xp, xq))
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 120
=1
2
Xxp,xq∈Ci
gi−1(xp,xq)=0xp 6=xq
f(d(xp, xq)) +1
2
Xxp,xq∈Ci
gi−1(xp,xq)=1xp 6=xq
f(d(xp, xq))
Let hi : X ×X → {0, 1} be a function indicating if the points being considered are assigned to the same cluster,
with one point as the center, at level i or not. That is, hi(x, x′) = 1 if and only if x is a cluster center, at level
i, with x′
as a data point in the same cluster.
Then, using (6.9), Xxp,xq∈Ci
gi−1(xp,xq)=0xp 6=xq
f(d(xp, xq)) ≥X
x,xp∈Cihi−1(x,xp)=1hi(x,xp)=1
(δ + δ′− 2− f(d(x, xp)))
= 2 txi−1(|Ci| − txi−1)(δ + δ′− 2)−
Xx,xp,xq∈Cihi−1(x,xp)=1hi(x,xp)=1
f(d(x, xp))
where txi−1 is the number of elements in the cluster, at level (i− 1), of which x is a member. Then,
Xxp,xq∈Ci
gi−1(xp,xq)=0xp 6=xq
f(d(xp, xq)) ≥ 2txi−1(|Ci| − txi−1)(δ + δ′− 2− 1)
(since, f(d(x, xp)) ≤ 1 ∀xp ∈ Ci)
= 2txi−1(|Ci| − txi−1)(δ + δ′− 3) (6.10)
Now, using Lemma 2,
Xxp,xq∈Ci
gi−1(xp,xq)=1xp 6=xq
v(xp,xq)6=1
f(d(xp, xq)) ≥Xxp∈Ci
zi−1(xp)6=1
txp
i−1(txp
i−1 − 1)
2(2δ − 1) (6.11)
where v(xp, xq) = 1 if the unordered pair (xp, xq) has already been considered in similarity computations and
zi−1(xp) = 1 if the cluster assigned to xp, at level i− 1, has already been accounted for. Then,Xxp,xq∈Ci
gi−1(xp,xq)=1xp 6=xq
f(d(xp, xq)) = 2X
xp,xq∈Cigi−1(xp,xq)=1
xp 6=xq
v(xp,xq)6=1
f(d(xp, xq))
Using (6.11),Xxp,xq∈Ci
gi−1(xp,xq)=1xp 6=xq
f(d(xp, xq))
≥Xxp∈Ci
zi−1(xp)6=1
txp
i−1(txp
i−1 − 1)(2δ − 1) (6.12)
Then, using (6.10) and (6.12), the total similarity among all the points assigned to the same cluster Ci, at level
i is,
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 121
1
2
Xxp,xq∈Cixp 6=xq
f(d(xp, xq))
≥ txi−1(|Ci| − txi−1)(δ + δ′− 3)
+1
2
Xxp∈Ci
zi−1(xp) 6=1
txp
i−1(txp
i−1 − 1)(2δ − 1) (6.13)
= S (say)
Differentiating with respect to δ′, we get
∂S
∂δ′= txi−1(|Ci| − txi−1)
Then,∂S
∂δ′= 0⇒ tx
min
i−1 = 0
or
txmin
i−1 = |Ci|
But, txmin
i−1 6= 0, since at least x belongs to txi−1.
Therefore, txmin
i−1 = |Ci|, which is intuitive since the dissimilarity is maximum when all the data points are
assigned to the same cluster. The minimum value of S is obtained from (6.13),
Smin =1
2|Ci|(|Ci| − 1)(2δ − 1)
Thus,
fHC ≥ (2δ − 1)
where fHC is the mean similarity between data points in any hierarchical cluster at level i where δ is the similarity
threshold at level i− 1.
A similar result can be proved for SHARPC sans Step 2.5 (wherein the k-means algorithm is executed
to minimize the average point to closest center distance). An important implication of Theorem 11 is
that the minimum average similarity at any level in hierarchical clustering is achieved when all the data
points are assigned to the same cluster. Further, Theorem 11 suggests a lower bound on the extent of
similarity shared by patterns belonging to the same cluster. This is a significant result since the user
can input a suitable δ ∈ (0, 1] and obtain clusters, at any level, with a minimum average similarity
guarantee.
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 122
6.6 Comparison of SHARPC with k-means and Leader
SHARPC can be viewed as an optimal unification of Leader and k-means, since the similarity threshold
can be viewed as an extension of the idea of the distance threshold in the Leader algorithm, whereas
the k-means algorithm is directly employed in Step 2.5. The initial selection of cluster centers, based on
Shapley value, provides the missing link required for excellent clustering. In this section, we compare
SHARPC with Leader, the prototype one-pass incremental algorithm, and k-means, the most popu-
lar representative of partition based clustering algorithms. We also provide experimental evidence to
substantiate the efficacy of our approach.
6.6.1 Satisfiability of desirable Clustering Properties
Now, to facilitate a better comprehension of the efficacy of different approaches, we mention the various
desirable properties of clustering and indicate the algorithms that satisfy these properties.
• Scale Invariance: The Leader algorithm does not satisfy scale invariance since it decides the
clusters based on a distance threshold, and thus incurs a fundamental distance scale. The k-
means algorithm satisfies scale invariance since it assigns clusters to points depending only on
their relative distances to the k cluster centers, irrespective of the absolute distances. SHARPC
consits of two phases. In the first phase, clustering is done based on the similarity values, which
are again relative (e.g. consider a similarity function f(d(xi, xj)) = 1− d(xi, xj)dτ
where dτ > dmax,
with dmax denoting the maximum distance between any two points in the dataset). In the second
phase, the k-means algorithm is used, which is scale invariant, as already mentioned. Thus,
SHARPC satisfies scale invariance.
• Richness: The Leader algorithm satisfies the richness property since we can always adjust the
distances among points to generate any desired partition of the input dataset. For example, one
of the ways to obtain a single cluster is to set all pairwise distances to some value less than
the distance threshold, whereas to have each point assigned to a separate cluster, every pairwise
distance may be set to some value greater than the distance threshold. The k-means algorithm
satisfies the richness condition only if the value of k can be adjusted according to the desired
partitions. However, since in general, k is a constant input provided to the k-means algorithm, we
may not partition the input dataset into any number of clusters other than k, and this precludes
the k-means algorithm from satisfying richness. Note that this restriction of k-means does not
apply to the SHARPC algorithm, since the number of clusters, k, is not provided as an input
and is determined based on the Shapley values of points and similarities among them. Hence,
SHARPC satisfies the richness property.
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 123
• Consistency: The Leader and the k-means algorithms do not satisfy the consistency requirement.
This follows directly from Kleinberg’s result in [40], which states that there does not exist any
centroid based clustering function that satisfies the consistency property. SHARPC also does not
satisfy consistency as a consequence of Kleinberg’s impossibility theorem that no algorithm may
satisfy scale invariance, richness, and consistency simultaneously.
We summarize the foregoing discussion in Table 6.1. It is easy to infer that SHARPC provides
an excellent approach to clustering, since it satisfies three of the four properties, and is optimal
in that no other clustering algorithm can perform better, as a consequence of the impossibility
theorem.
Table 6.1: Comparison between Leader, k-means, and SHARPCProperty Leader k-means SHARPC
Scale Invariance X√ √
Richness√
X√
Consistency X X XOrder Independence X X
√
6.6.2 Experimental Results
We carried out extensive experimentation to compare SHARPC with the Leader and the k-means
algorithms. For our experiments, we measured the quality of clustering of an algorithm in terms of the
following two parameters,
• α =1n
∑xi∈X
||xi − xk||2, where xk is the representative of cluster, Ck ∈ C, to which xi is assigned.
• β =1|C|
∑xi,xj∈Ck
Ck∈C
||xi − xj ||2
|Ck|(|Ck| − 1)
where C is the set of clusters to which xi ∈ X = {x1, x2, . . . , xn} is assigned. The potential, α, quantifies
the deviation of data points from the representative element while the scatter, β, captures the spread
among different elements assigned to the same cluster. Clearly, the lower the values of α and β, the
higher the quality of clustering. The potential, α, is a standard measure, but we also characterize quality
of clustering in terms of β, since it is closer to the basic notion of a cluster as a group of points more
similar to each other than points belonging to other clusters. Further, we choose Euclidean distance as
our distance metric d, and set the dissimilarity between any two data points xi and xj , f′(d(xi, xj)), to
d(xi, xj)dmax + 1
where dmax denotes the maximum distance between any two points in the dataset.
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 124
Table 6.2: Spam Dataset (4601 examples, 58 dimensions)Algorithm Clusters Average α Average β
Leader 10 153974 190783k-means 10 36850 5619527
SHARPC 10 24368 190541Leader 25 110673 170557k-means 25 33281 2248452
SHARPC 25 6125 44113Leader 35 86139 33248k-means 35 32711 1607324
SHARPC 35 3380 20567Leader 50 73901 23151k-means 50 32626 1125174
SHARPC 50 1850 11154Leader 100 65443 12598k-means 100 14085 251347
SHARPC 100 809 2319
We conducted an experimental study on a number of real-world datasets. We provide the results
for Wine, Spam, Cloud and Intrusion (first 5000 points) datasets. These datasets are available as
archives at the UCI Machine Learning Repository [31, 32]. We implemented code in Matlab without
any optimizations. Moreover, we state the results obtained using 30 runs of experiments to account
for statistical significance. Further, since the Leader algorithm does not take δ as an input parameter,
we executed the code for Leader algorithm for different distance thresholds, across different orders,
and observed the number of clusters. Then, we modulated δ to obtain almost the same number of
clusters. Likewise, we varied δ for adjusting the SHARPC algorithm to the number of clusters used in
the k-means. Finally, we averaged the α and β values for a fixed number of clusters.
Table 6.3 shows the α and β values resulting from the Leader, the k-means and the SHARPC on
the Wine dataset. Clearly, SHARPC outperforms Leader by 2 to 3 orders of magnitude in terms of α.
Similarly, SHARPC improves Leader by an order of magnitude in terms of β. It readily follows that
SHARPC gives much better clustering than the Leader algorithm. Similarly, the comparison of SHARPC
with k-means reveals that even though k-means performs much better than the Leader algorithm, the
quality of clustering is the best in the SHARPC algorithm. In fact, the gap in the quality of clustering
becomes more prominent in the case of larger datasets, such as Spam (Table 6.2 and Fig. 6.2) and
Intrusion (Table 6.5, Fig. 6.3 and Fig. 6.4). Table 6.4 shows the comparison results on the Cloud
dataset. We observed from our experiments that in general, SHARPC not only outperforms k-means
and Leader algorithms in terms of β but also in terms of α, since SHARPC finds an optimal set of
cluster centers.
In our experiments, we observed that although SHARPC is of complexity O(n2), in practice, it takes
much less time since it tends to converge rapidly, due to an optimal selection of initial cluster centers.
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 125
Table 6.3: Wine Dataset (178 examples, 13 dimensions)Algorithm Clusters Average α Average β
Leader 5 187568 38772k-means 5 5673 13576
SHARPC 5 5554 12181Leader 10 172834 18904k-means 10 2777 5203
SHARPC 10 1382 2590Leader 15 159221 12316k-means 15 925 2284
SHARPC 15 762 1726Leader 20 146523 9037k-means 20 618 1597
SHARPC 20 423 839
Table 6.4: Cloud Dataset (1024 examples, 10 dimensions)Algorithm Clusters Average α Average β
Leader 5 321983 84727k-means 5 17391 80844
SHARPC 5 17293 80748Leader 10 312737 39679k-means 10 6579 39388
SHARPC 10 6485 21459Leader 15 305732 25011k-means 15 4955 27720
SHARPC 15 4056 8203Leader 20 300436 17884k-means 20 4297 21060
SHARPC 20 2892 6722
Table 6.5: Network Intrusion Dataset (5000 examples, 37 dimensions)Algorithm Clusters Average α Average β
Leader 10 4.1139e+09 4.4772e+07k-means 10 1.4836e+06 1.3116e+08
SHARPC 10 7.28763e+05 3.5754e+07Leader 25 4.1014e+09 5.8943e+06k-means 25 8.9539e+05 3.1790e+07
SHARPC 25 5.2724e+04 9.7346e+05Leader 35 4.0931e+09 4.2186e+05k-means 35 2.8318e+05 1.9058e+07
SHARPC 35 2.0239e+04 2.7405e+05Leader 50 4.0819e+09 3.1730e+05k-means 50 2.2037e+05 1.3114e+07
SHARPC 50 8.8181e+03 7.8463e+04Leader 100 1.7105e+07 3.1332e+05k-means 100 1.8994e+05 6.5027e+06
SHARPC 100 1.7079e+03 1.7347e+04
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 126
Figure 6.2: β − C plot (Spam Dataset): (a) SHARPC vs. Leader (b) SHARPC vs. k-means
Figure 6.3: α− C plot (Intrusion Dataset): (a) SHARPC vs. Leader (b) SHARPC vs. k-means
Figure 6.4: β − C plot (Intrusion Dataset): (a) SHARPC vs. Leader (b) SHARPC vs. k-means
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 127
Figure 6.5: Wine: Potential (α) does not vary much with permutations (p) for a fixed threshold (δ)
For instance, on the Intrusion dataset, for 100 clusters, SHARPC takes about 10 seconds compared
to 2.5 seconds by k-means. Further, evaluating SHARPC on the Wine dataset, we found that if the
similarity threshold (δ) is kept fixed, α does not increase significantly with a decline in the number of
permutations, p (Figure 6.5), thereby supporting the result in Lemma 1. Similar behavior was observed
with other datasets. Therefore, as already mentioned, there is a lot of scope to further improve the time
complexity of SHARPC by incorporating modifications to techniques like MSA.
We also conducted experiments on several other real datasets and found that SHARPC provided
much better clustering than Leader and k-means. However, due to space constraints, we are unable to
present our detailed analyses here.
6.7 Summary and Future Work
In this paper, we proposed a novel approach to clustering based on a cooperative game theoretic frame-
work First, we proposed an algorithm for clustering based on the exact computation of Shapley value.
Then, an efficient approach, SHARPC, based on the convexity of our game theoretic model was put
forward. We also highlighted order independence as a desirable clustering property and provided both
the necessary and sufficient conditions for achieving order independence. In addition to being order
independent, SHARPC also satisfies scale invariance and richness, two other desirable clustering prop-
erties. We also showed how SHARPC can be readily generalized to obtain hierarchical clusters with
a minimum similarity bound. Our experiments on several standard datasets suggest that SHARPC
provides a significantly better clustering than the popular k-means and Leader algorithms.
6.7.1 Future Work
In this paper, we investigated the efficacy of SHARPC by conducting experiments using a particular
similarity function. It would be interesting to analyze the impact of different dissimilarity measures on
the overall quality of clustering. Further, as suggested in the paper, as a future work, existing techniques
Chapter 6. SHARPC: SHApley Value based Robust Pattern Clustering 128
may be employed to further reduce the complexity of the skeletal SHARPC algorithm. The extension
of ideas presented in this work to semi-supervised and supervised learning would be another interesting
direction. This paper may be used as a reference to build on Kleinberg’s work on unification of clustering
by extending the impossibility results incorporating richness, scale invariance, and consistency to include
order independence. The notion of order independence in incremental learners is extremely relevant in
the context of stream applications. Some important open problems, in this regard, emanating from our
work are:
• Can we come up with novel abstraction(s) to obtain efficient order independent incremental learn-
ers?
• Can some of the properties of a “Dynamic Set” be relaxed to achieve computationally efficient yet
effective weak incremental learning?
• Can some of the existing non-incremental techniques be made incremental by incorporating ap-
propriate abstractions?
It would also be worthwhile to investigate the applicability of other solutions concepts from cooperative
game theory, such as the Nucleolus, to various problems in machine learning and pattern recognition.
Bibliography
[1] A. Jain, M. N. Murty, and P. J. Flynn. Data Clustering: A Review. ACM Computing Surveys,
31(3), pp. 264–323, 1999.
[2] E. Backer and A. Jain. A clustering performance measure based on fuzzy set decomposition. IEEE
Transactions Pattern Analysis and Machine Intelligence (PAMI), 3(1), pp. 66–75, 1981.
[3] A. Jain and R. Dubes. Algorithms for Clustering Data. Englewood Cliffs, Prentice Hall, NJ, 1988.
[4] P. Hansen and B. Jaumard. Cluster analysis and mathematical programming. Math. Program., 79,
pp. 191–215, 1997.
[5] R. Xu and D. Wunsch II. Survey of Clustering Algorithms. IEEE Transactions on Neural Networks,
16(3), pp. 645–678, 2005.
[6] O. Sasson, N. Linial, and M. Linial. The metric space of proteins–Comparative study of clustering
algorithms. Bioinformatics, 18, pp. s14–s21, 2002.
[7] W. Li, L. Jaroszewski, and A. Godzik. Clustering of highly homologous sequences to reduce the size
of large protein databases. Bioinformatics, 17, pp. 282–283, 2001.
[8] S. Mulder and D. Wunsch. Million city travelling salesman problem solution by divide and conquer
clustering with adaptive resonance neural networks. Neural Net., 16, pp. 827–832, 2003.
[9] R. Dubes. Cluster analysis and related issue. Handbook of Pattern Recognition and Computer Vision,
C. Chen, L. Pau, and P. Wang, Eds., World Scientific, pp. 3–32, 1993.
[10] G. Ball and D. Hall. A clustering technique for summarizing multi-variate data. Behav. Sci., 12,
pp. 153–155, 1967.
[11] K. Krishna and M. N. Murty. Genetic K-means algorithm, IEEE Trans. Syst., Man, Cybern. B
(SMC-B), 29(3), pp. 433–439, 1999.
[12] L. Kaufman and P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis,
Wiley, 1990.
129
BIBLIOGRAPHY 130
[13] X. Zhuang, Y. Huang, K. Palaniappan, and Y. Zhao. Gaussian mixture density modeling, decom-
position, and applications. IEEE Trans. Image Process., 5(9), pp. 1293–1302, 1996.
[14] http://www.neuroinformatik.ruhr-uni-bochum.de/ini/VDM/research/gsn/
JavaPaper.
[15] J. Cherng and M. Lo. A hypergraph based clustering algorithm for spatial data sets. Proc. IEEE
Int. Conf. Data Mining (ICDM), pp. 83–90, 2001.
[16] L. Hall, I. Ozyurt, and J. Bedzek. Clustering with a genetically optimized approach. IEEE Trans.
Evol. Comput., 3(2), pp. 103–112, 1999.
[17] F. Hoppner, F. Klawonn, and R. Kruse. Fuzzy Cluster Analysis: Methods for Classification, Data
Analysis, and Image Recognition, Wiley, New York, 1999.
[18] T. Kohonen. The self-organizing map. Proc. of the IEEE, 78(9), pp. 1464–1480, 1990.
[19] A. Ben-Hur, D. Horn, H. Siegelmann, and V. Vapnik. Support vector clustering. J. Mach. Learn.
Res., 2, pp. 125–137, 2001.
[20] A. Jain, R. Duin, and J. Mao. Statistical pattern recognition: A review. IEEE Trans. Pattern Anal.
and Mach. Intell. (PAMI), 22(1), pp. 4–37, 2000.
[21] D. Ketchen and C.L. Shook. The application of cluster analysis in strategic management. Strategic
Management Journal, 17(6), pp. 441–458, 1996.
[22] D. Pelleg and A. Moore. X-means: Extending K-means with efficient estimation of the number
of clusters. Proc. of the 17th International Conference on Machine Learning (ICML), pp. 727–734,
2000.
[23] S. Salvador and P. Chan. Determining the Number of Clusters/Segments in Hierarchical Clus-
tering/Segmentation Algorithms. Proc. of the 16th IEEE International Conference on Tools with
Artificial Intelligence (ICTAI), pp. 576–584, 2004.
[24] H. Bischof, A. Leonardis, and A. Selb. MDL principle for robust vector quantisation. Pattern
Analysis and Applications, 2, pp. 59–72, 1999.
[25] G. Hamerly and C. Elkan. Learning the k in k-means. Proc. of the 17th International Conference
on Neural Information Processing Systems (NIPS), pp. 281–288, 2003.
[26] R. B. Myerson. Game Theory: Analysis of Conflict. Harvard University Press, 1997.
[27] L. S. Shapley. Cores of convex games. International Journal of Game Theory, 1(1), pp. 11–26, 1971.
BIBLIOGRAPHY 131
[28] H. Spath. Cluster Analysis Algorithms for Data Reduction and Classification, Ellis Horwood, Chich-
ester, UK.
[29] R. Ostrovsky, Y. Rabani, L. Schulman, and C. Swamy. The effectiveness of Lloyd-type methods for
the k-Means problem. Symposium on Foundations of Computer Science (FOCS), pp. 165–176, 2006.
[30] D. Arthur and S. Vassilvitskii. k-means++: The Advantages of Careful Seeding. Symposium on
Discrete Algorithms (SODA), pp. 1027–1035, 2007.
[31] http://archive.ics.uci.edu/ml/datasets/.
[32] http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
[33] D. Arthur and S. Vassilvitskii. How Slow is the k-means Method?. Proc. of the Symposium on
Computational Geometry (SoCG), 2006.
[34] P. Langley. Order Effects in Incremental Learning. Learning in humans and machines: Towards an
interdisciplinary learning science, Elsevier, 1995.
[35] T. Mitchell. Generalization as Search. Artificial Intelligence, 18, pp. 203–226, 1982.
[36] A. Cornuejols. Getting Order Independence in Incremental Learning. Proceedings of the 1993 Eu-
ropean Conference on Machine Learning (ECML), pp. 196–212, Springer-Verlag, 1993.
[37] D. Fisher, L. Xu, and N. Zard. Ordering effects in clustering. Proceedings of the 9th International
Conference on Machine Learning (ICML), pp. 163–168, 1992.
[38] B. Shekar, M. N. Murty, and G. Krishna. Structural aspects of semantic-directed clusters. Pattern
Recognition, 22, pp. 65–74, 1989.
[39] I. N. Herstein. Topics in Algebra, John Wiley & Sons, Second Edition, 2006.
[40] J. Kleinberg. An Impossibility Theorem for Clustering. Proceedings of the Advances in Neural
Information Processing Systems, 15, pp. 463–470, 2002.
[41] B. Zhang and S.N. Srihari. Fast k-Nearest Neighbor Classification Using Cluster-Based Trees. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 26(4), pp. 525–528, 2004.
[42] A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. Proceedings
of the 25th International Conference on Very Large Data Bases (VLDB), pp. 518–529, 1999.
[43] V. Garcia, E. Debreuve, and M. Barlaud. Fast k nearest neighbor search using GPU. Proceedings
of the CVPR Workshop on Computer Vision on GPU, Alaska, 2008.
BIBLIOGRAPHY 132
[44] P. Haghani, S. Michel, and K. Aberer. Distributed similarity search in high dimensions using lo-
cality sensitive hashing. Proceedings of the 12th International Conference on Extending Database
Technology: Advances in Database Technology, pp. 744–755, 2009.
[45] A. Keinan, B. Sandbank, C. C. Hilgetag, I. Meilijson, and E. Ruppin. Fair Attribution of Functional
Contribution in Artificial and Biological Networks. J. Neural Comput., 16(9), pp. 1887–1915, 2004.
[46] R. Anderberg. Cluster Analysis for Applications, Academic Press, New York, 1973.
[47] R. S. Michalski. Knowledge Acquisition Through Conceptual Clustering: A Theoretical Framework
and an Algorithm for Partitioning Data into Conjunctive Concepts. Journal of Policy Analysis and
Information Systems, 4(3), pp. 219–244, 1980.
Chapter 7
A 2-Approximation Algorithm for
Optimal Disk Layout of Genome
Scale Suffix Trees
7.1 Introduction
The suffix tree is an immensely popular data structure for indexing colossal scale biological repositories
[1]. This is mainly due to its linear time and space complexity of construction in terms of the sequence
size, in addition to linear search complexity in terms of the pattern size [2, 3]. However this comes at the
cost of increased storage space requirements, with the standard implementations consuming an order of
magnitude more space than the indexed data. As such for most practical data mining applications, the
suffix tree needs to be disk-resident. To complicate the matter further, searching for a pattern requires
random traversal of suffix links connecting nodes across different pages that results in increased I/O
activity. Dispensing away with suffix links not only affects the construction time [2, 3] but also renders
several search algorithms infeasible [4, 5, 6]. This is because sequence search algorithms typically involve
traversing both edges and suffix links. For instance, to find all maximal matching subsequences between
the sequence and the pattern, tree edges are used to walk down the tree matching the pattern sequence
along the way with the subsequent matches found by following the suffix links [7]. A lot of research has
gone subsequently into addressing this problem, primarily focusing on building efficient disk-resident
trees [8, 9, 10, 11, 18]. The objective of our work is to optimize the layout of suffix trees with regard to
assigning disk pages to tree nodes thereby resulting in improving the search efficiency. Layout strategies
have been mentioned in the literature for a variety of data structures [12, 13, 14, 15, 16].
133
Chapter 7. A 2-Approximation Algorithm for Optimal Disk Layout of Genome Scale Suffix Trees134
However, there has been only one notable contribution in the field of suffix trees [17]. Therein a
layout strategy, Stellar, is experimentally shown to improve search performance on a representative set
of real genomic sequences. In our work, we do a theoretical analysis of the whole problem based on our
approach and give bounded guarantee on the performance. For the rest of the chapter, we shall use
memory and main memory interchangeably unless mentioned otherwise explicitly. Similarly a search
pattern and query shall convey the same meaning.
7.2 Hardness of the Disk Layout Problem
Consider a suffix tree layout on the disk. Suppose we need to access a node x in the process of determining
a potential match of the pattern. There are two possible alternatives through which x could get into
the main memory (a) following an edge from its parent, or (b) following a suffix link at the node of last
mismatch. For each node present in the memory, accessing a memory resident child or a suffix child
costs less than one not present since the later involves an I/O operation. A theoretically equivalent
way of analyzing the same problem is to consider the cost of accessing a node resulting from absence or
presence of its parent or its suffix parent, where a suffix parent is defined as a node having a suffix link
to the node in consideration. Put succinctly, we may as well interpret that cost related with accessing a
node is less when at least one of its parent or suffix parent is present in the memory rather than when
none is memory resident. Now there are two distinct possibilities, either at least one of the parent or
suffix parent is present in the memory or none is.
Let, P1(x) and P2(x) respectively denote the probability that parent or a suffix parent of node x is
present in the memory. Further, let C1(x) and C2(x) be the costs of accessing x when at least one of its
parent or suffix parent is present in the memory and when none is present respectively. As pointed out
in the foregoing discussion, C2(x) ≥ C1(x) since C2(x) involves an additional I/O operation in order to
bring a parent or suffix parent of x into the memory. Now, for x,
Probability that the parent node is not present in the memory = 1- P1(x)
Probability that a suffix parent is not present in the memory = 1- P2(x)
⇒ Probability that at least a parent or suffix parent is present = 1− (1− P1(x))(1− P2(x))
Now expected cost of accessing x is given by C(x), where
C(x) = [Probability that parent(x)/suffix parent(x) is present in memory] * C1(x) + [Probability that
none of the parent(x) and suffix parent(x) is present in memory] * C2(x).
⇒ C(x) = (1− (1− P1(x))(1− P2(x))) ∗ C1(x) + (1− P1(x))(1− P2(x)) ∗ C2(x)
= C1(x)− (1− P1(x))(1− P2(x)) ∗ C1(x) + (1− P1(x))(1− P2(x)) ∗ C2(x)
= C1(x) + (C2(x)− C1(x))(1− P1(x))(1− P2(x))
= C1(x) + k(x)(C2(x)− C1(x)) where,
Chapter 7. A 2-Approximation Algorithm for Optimal Disk Layout of Genome Scale Suffix Trees135
k(x) = (1− P1(x))(1− P2(x))
An important point is in place. The probabilities defined herein may be extremely difficult or even
impossible to calculate but nonetheless, for theoretical considerations, let these probabilities be pro-
vided by some Oracle. To earn a better understanding of this probabilistic formulation, consider a few
special cases, one by one.
Case 1: C1(x) = C2(x) ∀x
This is possible only when we have virtually infinite main memory so that the whole suffix tree is
memory-resident. Then expected cost of access to node x
C(x) = C1(x) + k(C2(x)− C1(x))
= C1(x) + k(C1(x)− C1(x))
= C1(x) = C1 (assuming a uniform memory access cost)
That is to say, if the whole suffix tree resides in the memory, then assuming uniform memory access,
cost of accessing any particular node x would be a constant cost independent of x and its parent/suffix
parent. This is a rather uninteresting case since for the genome scale applications; the size of the suffix
tree is much larger than the capacity of the main memory.
Case 2: P1(x) = P2(x) = 0
This case is applicable for the root of the suffix tree, which is the first node to be retrieved into the
memory from the disk. Then,
C(x) = C1(x) + (C2(x)− C1(x))(1− P1(x))(1− P2(x))
= C1(x) + (C2(x)− C1(x)) = C2(x)
That is, an I/O operation is required to get the root of the suffix tree into the main memory. This holds
since the root is the first node to be accessed in process of finding a match irrespective of the pattern
being searched for.
Case 3: P1(x) =1 or P2(x) = 1
This case applies when either the parent or a suffix parent of x is always present in the main memory
Chapter 7. A 2-Approximation Algorithm for Optimal Disk Layout of Genome Scale Suffix Trees136
and rarely holds for genome scale suffix trees. It follows then,
C(x) = C1(x) + (C2(x)− C1(x))(1− P1(x))(1− P2(x))
= C1(x)
The cost of accessing x is C1(x) as expected.
7.2.1 The Q-Optimal Disk Layout Problem
Given a large scale suffix tree S and a set of patterns (possibly infinite) Q to be matched with S, the
Q-Optimal Disk Layout (Q-OptDL) problem is to find an arrangement L of nodes belonging to S on
disk such that the overall cost of accessing the nodes of S on patterns in Q is minimum for L.
Theorem 12. The Q-OptDL problem is NP- Hard.
Proof. We show a reduction from the 0/1 Knapsack, a well-known NP- Complete problem. In the 0/1
Knapsack problem, there are n kinds of items, 1 through n. Each item j has a value Pj and a weight
Wj . The capacity of the knapsack is W. Mathematically, the 0-1 knapsack problem can be formulated
as: maximize∑nj=1(Pj ∗Xj)
subject to∑nj=1(Wj ∗Xj) ≤ W, and
Xj ∈ {0,1} ∀j ∈ {1,2, ...,n}
As shown earlier,
C(x) = C1(x) + k(x)(C2(x)− C1(x)) where,
k(x) = (1− P1(x))(1− P2(x))
By definition, an optimal layout minimizes overall sum of costs over patterns in Q.
⇒ minimize∑Q
∑x
[C1(x) + k(x) ∗ (C2(x)− C1(x))]
Now, we relax the problem setting by assuming C1 and C2 as average memory and disk access costs
respectively. Then, objective function is given by
minimize∑Q
∑x
[C1 + k(x) ∗ (C2 − C1)]
⇒ minimize∑Q
∑x
[C1 + (1− P1(x))(1− P2(x)) ∗ (C2 − C1)]
⇒ minimize∑Q
∑x
(1− P1(x))(1− P2(x)) ∗ (C2 − C1)
Chapter 7. A 2-Approximation Algorithm for Optimal Disk Layout of Genome Scale Suffix Trees137
Now, C2 ≥ C1. Therefore, we may as well
maximize∑Q
∑x
(C2 − C1)−∑Q
∑x
(1− P1(x))(1− P2(x)) ∗ (C2 − C1)
⇒ maximize∑Q
∑x
[1− (1− P1(x))(1− P2(x))] ∗ (C2 − C1)
⇒ maximize∑Q
∑x
[1− (1− P1(x))(1− P2(x))]
Let Xj be an indicator variable representing the access of a node x. A node j lying on the disk that
is not accessed does not contribute to the cost, thereby having corresponding Xj set to 0. Finally, the
objective function becomes
maximize∑Q
n∑j=1
[1− (1− P1(x))(1− P2(x))] ∗Xj
⇒ maximize∑Q
n∑j=1
P (x) ∗Xj
where
P (x) = [1− (1− P1(x))(1− P2(x)) (7.1)
Let the capacity of the main memory be M. The reduction algorithm takes a knapsack of capacity M,
a singleton set Q and tries to put some l nodes one by one into it, out of a total n potential candidates,
based on the probability given by expression (1) and subject to the constraint that the combined size
of these l nodes should not exceed M. If we can solve the Optimal Layout Problem, then we get a
polynomial solution to the 0/1 knapsack problem, which is impossible unless P = NP. Conversely, if
we do get the l nodes in the knapsack in polynomial time, then we can write them on the disk, in the
same order and proceeding in this manner would yield the optimal layout corresponding to Q. Thus,
the Optimal Layout Problem is NP-Hard.
Now that we have deduced that it is computationally hard to find the optimal disk layout, we are
faced with another problem. It may be very difficult to obtain the values of P1 and P2 for each node
because of the unsymmetric structure of the suffix trees. Nonetheless, we get some useful insights into
improving the layout:
1. We note from the foregoing discussion that there is a vast decline in cost if either the parent or
suffix parent of a node lies close to it. Equivalently, we can improve the layout by bringing a
node’s child and suffix child close to itself.
Chapter 7. A 2-Approximation Algorithm for Optimal Disk Layout of Genome Scale Suffix Trees138
2. We note that in genome databases, the consecutive base nucleotides have relatively different pro-
portions. For instance, the pair AC has in general different percentage from that of AT, AG and
AA. We can exploit this fact by bringing a node’s more probable successors close to it.
3. Another possible approach is to consider the patterns that have been searched over the database
over a considerable period of time and then incorporate this knowledge to improve the layout from
time to time. This would work very well in case the database being queried is relatively static over
a substantial period and there is similarity in queries being searched for which is a common trend
with biological databases. This might also be useful in case the same set of queries is searched
again.
7.3 Improving the Disk Layout
Now we propose a post-construction algorithm that takes the root r of a disk-resident suffix tree and the
capacity B of the disk-page as its input. There is also a set Q of patterns to be matched. Q represents
some sort of prior knowledge about the search queries. In the absence of any such information, the
sequence corresponding to the suffix tree could be used to initialize the probability values. Basically,
the algorithm Approx. Q-OptDL assigns the child nodes of a node x to the same disk page as x in a
probabilistic fashion. The child representing a base with a higher proportion of co-occurrence with its
parent gets a higher probability of being assigned close to the parent. This is followed by accounting for
the suffix child. The process is recursively followed to get a new layout of the tree nodes. A BFS queue
is used as the primary data structure. When a new query comes, Q and the probability values in Q can
be updated incrementally. Algorithm 1 could be invoked periodically to improve the layout.
7.3.1 Algorithm 1
Approx. Q-OptDL(r,B,Q)
% r : Root of the subtree to be traversed %
% B : Capacity of the disk-page in terms of no. of nodes %
% Q : Set of patterns to be matched %
queue ← r
nodecount ← 0;
while queue not empty, do
{
r ← queue; //remove from the queue
if r not visited then
Chapter 7. A 2-Approximation Algorithm for Optimal Disk Layout of Genome Scale Suffix Trees139
mark r as visited and increment nodecount
while there is an unmarked child c of r, do
{
PQrc ← Relative proportion of base at c among
all the unmarked base child nodes of r in Q
if c not marked visited AND nodecount < B then
{
mark c as visited with a probability PQrc
if c is marked visited then
{
increment nodecount;
queue ← c; //insert into the queue
s ← suffix-link(c);
if s not visited AND nodecount < B then
{
mark s as visited
increment nodecount
queue ← s
}
}
}
}
if nodecount ≥ B then
{
while queue not empty do
{
m ← queue
Approx. Q-OptDL(m,B,Q)
}
}
}
7.3.2 Performance Bound on Approx. Q-OptDL
Now we show that the Algorithm 1 performs no worse than twice the optimal layout asymptotically.
In the following discussion, cost refers to an I/O rate caused due to limitations of the underlying disk
Chapter 7. A 2-Approximation Algorithm for Optimal Disk Layout of Genome Scale Suffix Trees140
layout.
Theorem 13. The suffix tree disk layout obtained using Approx. Q-OptDL (Algorithm 1) has an
asymptotic performance within twice that of the optimal disk layout.
Proof. Let Popt and P denote the cost associated with the optimal layout and the layout L obtained
using Algorithm 1 respectively, over an infinite number of patterns. Further, let Pk(e) denote the cost
of layout L while accessing Q, a set of k patterns. Then,
P = limk→∞
Pk(e)
Now, when we access a particular node x with the closest child node x′, an I/O operation may be required
if the next base in the pattern being matched is not present in the memory. Then, the conditional cost
admitted due to this mismatch is given by Pk(e|x, x′). Suppose that during the matching process, at
a particular selection step, the optimal layout chooses a node x with base θ (where θ ∈ {A,C,G,T} in
human DNA), while the layout L using Algorithm 1 chooses node x′
k with a base θ′
k, then since base θ
and θ′
k are conditionally independent of the nodes x and x′, we have
P (θ, θ′
k|x, x′
k) = P (θ|x)P (θ′
k|x′
k)
A mismatch between the two layouts happens if θ 6= θ′
k results in an I/O. Then, the conditional cost of
this mismatch is given by
Pk(e|x, x′
k) = 1−m∑i=1
P (θ = ti, θ′
k = ti|x, x′
k)
where m denotes the number of bases(m = 4 for DNA)
⇒ Pk(e|x, x′
k) = 1−m∑i=1
P (ti|x)P (ti|x′
k) (7.2)
We also note that instead of only Q, if different sets of patterns are used, then different layouts would
be chosen by Algorithm 1. So, we take an average layout, under which conditional cost P (e|x) is given
by
P (e|x) =∫P (e|x, x
′
k)p(x′
k|x)dx′
k (7.3)
where p(x′
k|x) represents the conditional density of x′
k on x Using (7.2) and (7.3), and taking the limits,
we obtain
limk→∞
Pk(e|x) =∫
[1−m∑i=1
P (ti|x)P (ti|x′
k)]δ(x′
k − x)dx′
k
Chapter 7. A 2-Approximation Algorithm for Optimal Disk Layout of Genome Scale Suffix Trees141
= 1−m∑i=1
P 2(ti|x)
Thereby, the asymptotic cost under layout L is given by
P = limk→∞
Pk(e)
⇒ P = limk→∞
∫Pk(e|x)p(x)dx
⇒ P =∫
[1−m∑i=1
P 2(ti|x)]p(x)dx (7.4)
⇒ P '∫
[1− P 2(tmax|x)]p(x)dx
where tmax refers to the base with greatest probability, that is put into a disk page accordingly by
Algorithm 1.
⇒ P '∫
[2(1− P (tmax|x))]p(x)dx
Now,m∑i=1
P 2(ti|x) = P 2(tmax|x) +∑i 6=max
P 2(ti|x)
We seek to bound this sum by minimizing the second term subject to the following constraints:
• P (ti|x) ≥ 0 , and
•∑i 6=max P
2(ti|x) = 1 − P (tmax|x) = Popt(e|x) since the optimal layout would tend to have least
probability of incurring an I/O. Also,∑mi=1 P
2(ti|x) is minimized if all of the a posteriori condi-
tional costs except that pertaining to tmax are equal.
Considering this fact in the light of foregoing discussion, we get
P (ti|x) =Popt(e|x)m− 1
, i 6= max, or
= 1− Popt(e|x) i = max
We arrive at the following inequalities:
•m∑i=1
P 2(ti|x) ≥ (1− Popt(e|x))2 +P 2opt(e|x)m− 1
, and
•
1−m∑i=1
P 2(ti|x) ≤ 2Popt(e|x)− m
m− 1P 2opt(e|x) (7.5)
Chapter 7. A 2-Approximation Algorithm for Optimal Disk Layout of Genome Scale Suffix Trees142
Noting that the conditional variance, V ar[Popt(e|x)] ≥ 0, we get
∫P 2opt(e|x)p(x)dx ≥ P 2
opt (7.6)
Using (7.4), (7.5), and (7.6), we obtain the bound in case Q consists of an infinite number of patterns
Popt ≤ P ≤ Popt(2−m
m− 1Popt) (7.7)
Put into words, the Approx. Q-OptDL algorithm outputs a suffix tree disk layout that is guaranteed
to perform no worse than twice the optimal disk layout asymptotically.
Corollary
The suffix tree disk layout on human genome obtained using Approx. Q-OptDL has an asymptotic
P ≤ Popt(2− 43Popt) optimal upper bound.
Proof. The proof follows immediately by substituting value of m = 4 in (7.7) since human DNA consists
of bases {A,C,G,T}.
7.4 Conclusion/Future Work
We discussed the concept of an optimal layout in the context of genome scale suffix trees. The Q-Optimal
Disk Layout has been shown as NP-hard employing a reduction from the 0/1 Knapsack Problem. We
then suggested an algorithm Approx. Q-OptDL to improve the layout of a disk-resident suffix tree.
The Approx. Q-OptDL results in a layout that is guaranteed to have a performance within twice of
the optimal layout asymptotically. This is an extremely important result keeping in view the explosive
growth in the genomic data and accordingly the size of suffix trees. As a future work, we intend to
improve the layouts that are based on suffix trees but have different structures. Further, it would be
interesting to discover if the 2-approximation bound given in this chapter can be further improved.
Bibliography
[1] D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational
Biology. Cambridge University Press, Cambridge, 1997.
[2] E. M. McCreight. A Space-Efficient Suffix Tree Construction Algorithm. JACM, 23(2), 1976.
[3] E. Ukkonen. Online Construction of Suffix-trees. Algorithmica, 14(3), 1995.
[4] W. I. Chang and E. L. Lawler. Approximate String Matching in Sublinear Expected Time. Proc. of
the IEEE Symp. on Found. of Comp. Science (FOCS), 1, pp. 116–124, MO, USA, 1990.
[5] A. L. Cobbs. Fast Approximate Matching using Suffix Trees. Proc. of the 6th Annual Symp. on
Combinatorial Pattern Matching (CPM), 1995.
[6] E. Ukkonen. Approximate String Matching over Suffix Trees. Proc. of the 4th Annual Symp. on
Combinatorial Pattern Matching (CPM), 1993.
[7] A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O. White, and S. L. Salzberg. Alignment of
Whole Genomes. Nucleic Acids Research, 27(11), 1999.
[8] E. Hunt, M. P. Atkinson, and R. W. Irving. A Database Index to Large Biological Sequences. Proc.
of the 27th Intl. Conf. on Very Large Databases (VLDB), 2001.
[9] K.-B. Schurman, and J. Stoye. Suffix Tree Construction and Storage with Limited Main Memory.
Technical Report 2003-06, Universitat Bielefeld, 2003.
[10] S. Tata, R. A. Hankins, and J. M. Patel. Practical Suffix Tree Construction. Proc. of the 30th Intl.
Conf. on Very Large Databases (VLDB), 2004.
[11] S. Bedathur and J. Haritsa. Engineering a Fast Online Persistent Suffix Tree Construction. Proc.
of the IEEE Intl. Conf. on Data Engg., (ICDE), 2004.
[12] A. A. Diwan, S. Rane, S. Seshadri, and S. Sudarshan. Clustering Techniques for Minimizing External
Path Length. Proc. Of the 22nd Intl. Conf. on Very Large Databases (VLDB), 1996.
143
BIBLIOGRAPHY i
[13] S. Baswana and S. Sen. Planar Graph Blocking for External Searching. Algorithmica, 34(3), 2002.
[14] M. Nodine, M. Goodrich, and J. Vitter. Blocking for External Graph Searching. Proc. Of the 12th
ACM Symp. on Principles of Database Systems (PODS), 1993.
[15] S. Thite. Optimum Binary Search Trees on the Hierarchical Memory Model. Master’s thesis, Univ.
of Illinois at Urbana-Champaign, 2001.
[16] J. Gil and A. Itai. How to Pack Trees. Journal of Algorithms, 32(2), 1999.
[17] S. J. Bedathur and J.R. Haritsa. Search-Optimized Suffix-Tree Storage for Biological Applications.
Proc. of 12th IEEE Intl. Conf. on High performance(HIPC), 2005.
[18] Benjarath Phoophakdee and Mohammed J. Zaki. Genome-scale Disk-based Suffix Tree Indexing.
ACM International Conference on Management of Data, Beijing, 2007.
Conclusion
In this thesis, some key challenges in clustering, classification, and dimensionality reduction have been
identified and appropriate solutions have been suggested. We believe this work has made some funda-
mental contributions to these areas. An attempt has been made to tackle extremely important problems
such as characterizing incremental learners, which are bound to assume even greater importance with a
spurt in stream applications. A general goal of this work is to unify the different data mining algorithms,
especially in the context of clustering, for instance, we established order independence as a desirable
property, much like Kleinberg’s properties of scale invariance, richness, and consistency. On one hand,
we have proposed faster solutions to existing problems, while on the other, we have tried to emphasize
the importance of incorporating the ideas fundamental to these problems, in their most pristine essence.
Even though, at certain places, there is a considerable overlap between these contrasting approaches,
still they can be coarsely delineated: the FS-SVMs, the robust variants of the Leader algorithm, and the
RACK algorithm belong to the former, whereas the SHARPC algorithm, the Image Feature Extractor
technique, the improved Suffix Tree layout, and the EPIC algorithm belong to the latter. Wherever
possible, we have tried to present a big picture by providing a framework for evaluating the different
algorithms. We have also suggested some open problems and future directions. We hope this thesis
fosters more interesting research in years to come.
ii