latent tree models: an application and an extensionlzhang/paper/pspdf/poon-thesis.pdf · latent...
TRANSCRIPT
LATENT TREE MODELS:AN APPLICATION AND AN EXTENSION
by
KIN-MAN POON
A Thesis Submitted toThe Hong Kong University of Science and Technology
in Partial Fulfillment of the Requirements forthe Degree of Doctor of Philosophy
in Computer Science
August 2012, Hong Kong
Copyright c© by Kin-Man Poon 2012
Authorization
I hereby declare that I am the sole author of the thesis.
I authorize the Hong Kong University of Science and Technology to lend this thesis to
other institutions or individuals for the purpose of scholarly research.
I further authorize the Hong Kong University of Science and Technology to reproduce
the thesis by photocopying or by other means, in total or in part, at the request of other
institutions or individuals for the purpose of scholarly research.
KIN-MAN POON
ii
LATENT TREE MODELS:AN APPLICATION AND AN EXTENSION
by
KIN-MAN POON
This is to certify that I have examined the above Ph.D. thesis
and have found that it is complete and satisfactory in all respects,
and that any and all revisions required by
the thesis examination committee have been made.
PROF. NEVIN L. ZHANG, THESIS SUPERVISOR
PROF. MOUNIR HAMDI, HEAD OF DEPARTMENT
Department of Computer Science and Engineering
1 August 2012
iii
ACKNOWLEDGMENTS
I would like to take this opportunity to express my great gratitude to my supervisor,
Prof. Nevin L. Zhang, for his guidance, encouragement, and support throughout my PhD
study. I sincerely appreciate his help in suggesting research topics, revising my papers,
and preparing my presentations. This thesis would not have been completed without him.
I would like to thank my proposal and thesis examination committee members: Dr.
habil. Manfred Jaeger (from Department for Computer Science at Aalborg University),
Prof. Bing-Yi Jing (from Department of Mathematics), Prof. Dit-Yan Yeung, Prof. Brian
Mak, and Prof. James Kwok. I am grateful for their attendance and insightful comments.
Thanks are also due to my colleagues at HKUST for their encouragement, friendship,
and useful discussions. They include Tao Chen, Yi Wang, Tengfei Liu, and April Liu.
Finally, I am indebted to my parents and my wife for their love, patience, understand-
ing, and support. And I would like to include a quote here to express my gratefulness to
the invisible One:
“How can I repay the LORD
for all the great good done for me?”
— Psalm 116:12, NAB
iv
TABLE OF CONTENTS
Title Page i
Authorization Page ii
Signature Page iii
Acknowledgments iv
Table of Contents v
List of Figures ix
List of Tables xi
Abstract xii
Chapter 1 Introduction 1
1.1 Clustering 1
1.1.1 Categories of Clustering Algorithms 2
1.1.2 Model-Based Clustering 3
1.1.3 Multidimensional Clustering 4
1.1.4 Spectral Clustering 5
1.2 Contributions 7
1.3 Organization 7
Chapter 2 Latent Variable Models 9
2.1 Measurement Models 10
2.2 Confirmatory and Exploratory Analyses 11
2.3 Mixture Models 11
2.4 Models with Multiple Latent Variables 12
2.4.1 Structural Equation Models 13
2.4.2 Multidimensional Measurement Models 13
2.4.3 Mixtures of Latent Variable Models 15
2.4.4 General Models 16
2.5 Latent Tree Models 16
2.6 Summary of Models 17
v
Chapter 3 Latent Tree Models 19
3.1 Notations 19
3.2 Bayesian Networks 19
3.3 Latent Tree Models 21
3.3.1 Model Scores 21
3.3.2 Model Equivalence 22
3.3.3 Root Walking 22
3.3.4 Regularity 23
3.4 Parameter Estimation 24
3.5 Learning Algorithms 25
3.5.1 Score-Based Methods 26
3.5.2 Constraint-Based Methods 33
3.5.3 Variable Clustering Methods 37
3.5.4 Comparison between Approaches 40
3.6 Applications 40
3.6.1 Multidimensional Clustering 41
3.6.2 Latent Structure Discovery 43
3.6.3 Density Estimation 43
3.6.4 Classification 44
3.6.5 Domains of Applications 46
Chapter 4 Application: Rounding in Spectral Clustering 48
4.1 Related Work 49
4.2 Basics of Spectral Clustering 50
4.2.1 Similarity Measure and Similarity Graph 50
4.2.2 Graph Laplacian 51
4.2.3 The Ideal Case 52
4.2.4 Spectral Clustering 53
4.2.5 Two Properties 54
4.3 A Naive Method for Rounding 55
4.3.1 Binarization of Eigenvectors 55
4.3.2 Rounding by Overlaying Partitions 56
4.3.3 Determining the Number of Eigenvectors to Use 57
4.4 Latent Class Models for Rounding 59
4.4.1 Known Number of Clusters 60
vi
4.4.2 Unknown Number of Clusters 62
4.5 Latent Tree Models for Rounding 62
4.6 Empirical Evaluation on Synthetic Data 65
4.6.1 Performance in the Ideal Case 65
4.6.2 Graceful Degrading of Performance 66
4.6.3 Impact of an Assumption 68
4.6.4 Sensitivity Study 69
4.6.5 Running Time 70
4.7 Comparison with Alternative Methods 71
4.7.1 Synthetic Data 71
4.7.2 MNIST Digits Data 72
4.7.3 Image Segmentation 73
4.8 Conclusions 74
Chapter 5 Extension: Pouch Latent Tree Models 76
5.1 Pouch Latent Tree Models 76
5.2 Related Work 79
5.3 Inference 80
5.3.1 Clique Tree Propagation 80
5.3.2 Complexity 83
5.4 Parameter Estimation 83
5.5 Structure Learning 84
5.5.1 Search Operators 85
5.5.2 Search Phases 87
5.5.3 Operation Granularity 88
5.5.4 Efficient Model Evaluation 88
5.5.5 EAST-PLTM 89
5.6 Conclusions 90
Chapter 6 Variable Selection in Clustering 91
6.1 To Do or To Facilitate 91
6.2 Experimental Setup 93
6.2.1 Data Sets and Algorithms 93
6.2.2 Method of Comparison 94
6.3 Results 95
6.3.1 Synthetic Data 96
vii
6.3.2 Image Data 98
6.3.3 Wine Data 100
6.3.4 WDBC Data 101
6.3.5 Discussions 102
Chapter 7 Multidimensional Clustering 104
7.1 Clustering Multifaceted Data 104
7.2 Related Work 106
7.3 Seasonal Statistics of NBA Players 107
7.4 PLTMs on NBA Data 108
7.4.1 Clusterings Obtained 108
7.4.2 Cluster Means 110
7.4.3 Relationships between Clusterings 111
7.5 Comparison with Other Methods 112
7.5.1 Multiple Independent GMMs by GS 112
7.5.2 Factor Analysis 113
7.5.3 LTM with Continuous Latent Variables 113
7.6 Discussions 115
Chapter 8 Conclusions 116
8.1 Summary of Work 116
8.2 Future Work 117
8.3 Possible Improvements 118
Appendix A List of Publications by the Author 134
viii
LIST OF FIGURES
1.1 An example of dendrogram. 2
1.2 Data points in the original feature space and the transformed eigenspacein spectral clustering. 6
2.1 Structures of latent class models, latent trait models, and latent profilemodels; and factor models. 10
2.2 Model structure of a finite mixture model. 12
2.3 Structures of the two types of multidimensional item response theory mod-els. 14
2.4 A latent tree model as an extension to a latent class model. 16
3.1 An example of Bayesian network. 20
3.2 An example of latent tree model, root walking, and unrooted model. 21
3.3 Examples of applying the node introduction, node deletion, and noderelocation operators. 27
3.4 Four possible resulting structures of quartet test. 33
3.5 An example of information curves. 42
4.1 Examples of eigenvectors in spectral clustering. 53
4.2 Examples of binary vectors in spectral clustering. 56
4.3 Illustration of Naive-Rounding2. 58
4.4 Latent class model for rounding in spectral clustering. 59
4.5 Latent tree model for rounding. 63
4.6 Synthetic data set for ideal case 66
4.7 LTM-Rounding and ROT-Rounding on synthetic data for non-idealcase. 67
4.8 Partitions obtained by LTM-Rounding1. 69
4.9 Sensitivity analysis on the parameters δ and K in LTM-Rounding. 70
4.10 Image segmentation results by LTM-Rounding and ROT-Rounding. 75
5.1 An example of PLTM. The numbers in parentheses show the cardinalitiesof the discrete variables. 77
5.2 Generative model for synthetic data. 77
5.3 A Gaussian mixture model as a special case of PLTM. 78
5.4 Examples of node introduction, node deletion, and node relocation inPTLMs. 86
5.5 Examples of pouching and unpouching in PLTMs. 87
ix
6.1 Feature curves on synthetic data. 97
6.2 Structure of the PLTM learned from image data. 98
6.3 Feature curves on image data. 99
6.4 Structure of the PLTM learned from wine data. 100
6.5 Structure of the PLTM learned from wdbc data. 101
6.6 Feature curves on wdbc data. 102
7.1 Clustering multifaceted data. 105
7.2 PLTM obtained on NBA data. 109
7.3 Model obtained from the CLNJ method on NBA data. 114
x
LIST OF TABLES
2.1 Summary of latent variable models. 18
4.1 Performances of various rounding methods on synthetic data. 68
4.2 Comparison of LTM-Rounding and LTM-Rounding1. 69
4.3 Comparison of various rounding methods on MNIST digits data. 72
5.1 Discrete distributions in Example 1. 77
6.1 Descriptions of UCI data sets used in our experiments. 93
6.2 Clustering performances as measured by NMI. 96
6.3 Confusion matrix for PLTM on wdbc data. 101
7.1 Attributes on NBA data. 108
7.2 Attribute means conditional on the specified latent variables on NBA data. 109
7.3 Conditional distributions of Gen and Acc on NBA data. 111
7.4 Partition of attributes on NBA data by GS. 112
7.5 Results from significance tests whether the factor models fit NBA data. 113
xi
LATENT TREE MODELS:AN APPLICATION AND AN EXTENSION
by
KIN-MAN POON
Department of Computer Science and Engineering
The Hong Kong University of Science and Technology
ABSTRACT
Latent tree models are a class of probabilistic graphical models. These models have a tree
structure, in which the internal nodes represent latent variables whereas the leaf nodes
represent manifest variables. They allow fast inference and have been used for multidi-
mensional clustering, latent structure discovery, density estimation, and classification.
This thesis makes two contributions to the research of latent tree models. The first
contribution is a new application of latent tree models in spectral clustering. In spectral
clustering, one defines a similarity matrix for a collection of data points, transforms the
matrix to get the so-called Laplacian matrix, finds the eigenvectors of the Laplacian
matrix, and obtains a partition of the data points using the leading eigenvectors. The
last step is sometimes referred to as rounding, where one needs to decide how many leading
eigenvectors to use, to determine the number of clusters, and to partition the data points.
We propose to use latent tree models for rounding. The method differs from previous
rounding methods in three ways. First, we relax the assumption that the number of
clusters equals the number of eigenvectors used. Second, when deciding how many leading
eigenvectors to use, we not only rely on information contained in the leading eigenvectors
themselves, but also make use of the subsequent eigenvectors. Third, our method is
model-based and solves all the three subproblems of rounding using latent tree models.
We evaluate our method on both synthetic and real-world data. The results show that
our method works correctly in the ideal case where between-clusters similarity is 0, and
degrades gracefully as one moves away from the ideal case.
xii
The second contribution is an extension to latent tree models. While latent tree models
have been shown to be useful in data analysis, they contain only discrete variables and are
thus limited to discrete data. One has to resort to discretization if analysis on continuous
data is needed. However, this leads to loss of information and may make the resulting
models harder to interpret.
We propose an extended class of models, called pouch latent tree models, that allow
the leaf nodes to represent continuous variables. This extension allows the models to
work on continuous data directly. These models also generalize Gaussian mixture models,
which are commonly used for model-based clustering on continuous data. We use pouch
latent tree models for facilitating variable selection in clustering, and demonstrate on
some benchmark data that it is more appropriate to facilitate variable selection than to
perform variable selection traditionally. We further demonstrate the usefulness of the
models by performing multidimensional clustering on some real-world basketball data.
Our results exhibit multiple meaningful clusterings and interesting relationships between
the clusterings.
xiii
CHAPTER 1
INTRODUCTION
Latent tree models (LTMs) [172] are a class of tree-structured probabilistic graphical
models. These models allow multiple latent variables and have found applications in
multidimensional clustering, latent structure discovery, and density approximation. In
this thesis, we investigate an application of LTMs in spectral clustering and propose an
extension to LTMs so that they can deal with continuous data.
In this introductory chapter, we motivate our work through an introduction to cluster-
ing. We show how LTMs can be used to solve two clustering problems. We also point out a
limitation of LTMs that our work aims to solve. After that, we highlight the contributions
of this thesis. We give an outline of the thesis at the end of this chapter.
1.1 Clustering
Clustering [56, 82, 153, 167] aims to find natural grouping of data. In general, the grouping
maximizes the intra-class similarity and minimizes the inter-class similarity [66]. Cluster-
ing is also referred to as unsupervised classification, where data are classified without any
class labels given beforehand. Since it does not need any labeling to the given data, it is
a useful technique for exploratory data analysis.
Clustering can be used to explain the heterogeneity in the data. For example, the
political stance of a person can affect one’s attitudes towards different social issues. Some-
times, however, we may not know the political stances of people. What we know is only
their attitudes. By clustering on the attitudes, we can find groups of people with similar
attitudes. The grouping possibly reflects the different political ideologies of people.
Clustering has been applied in various areas. For example, it has been applied in
business for market segmentation [10, 41, 49, 127, 160, 164], conjoint analysis in mar-
keting [131], and strategic management research [89]. It is also used in bioinformatics
for analyzing gene expression data [60, 110, 168] and in medical analysis for traditional
Chinese medicine data [176, 177]. Other applications include image segmentation, object
and character recognition, and information retrieval [82].
1
Mississippi
Sou
th C
arol
ina
Arkansas
Florida
Louisiana
Texas
Alabama
Georgia
Vermont
Oklahoma
Montana
Arizona
Nevada Utah
Colorado
Idaho
Wyoming
Virginia
Nor
th C
arol
ina
Tennessee
Kentucky
Maryland
Delaware
Wes
t Virg
inia
Missouri
New
Mex
ico
California
Oregon
Washington
Alaska
Michigan
New
Ham
pshi
reConnecticut
New
Yor
kIndiana
Ohio
Pennsylvania
Illinois
New
Jer
sey
Kansas
Nebraska
Iowa
Sou
th D
akot
aMinnesota
Nor
th D
akot
aWisconsin
Massachusetts
Rho
de Is
land Hawaii
Maine
020
4060
80100
120
140
Cluster Dendrogram
hclust (*, "average")dist(votes.repub)
Height
Figure 1.1: An example of dendrogram. It shows the result of hierarchical clustering onthe voting patterns in 31 years of the 50 states in the United States.
1.1.1 Categories of Clustering Algorithms
Clustering algorithms can be categorized from the following different aspects.
Clustering outputs. Based on the output, clustering algorithms can be generally clas-
sified as hierarchical or partitional. A hierarchical clustering algorithm yields a hierarchy
of nested clusterings. The hierarchy is usually represented by a dendrogram (see Figure 1.1
for an example1 and [82] for more details). The agglomerative clustering methods with
single-link [147] or complete-link [90] are two common methods for hierarchical clustering.
A partitional clustering algorithm, in contrast, yields a single clustering without any
hierarchical structure. One well-known example is the K-means algorithm [105].
Some clustering algorithms are known as density-based. They may also be considered
as partitional clustering algorithms, since they yield single clusterings without hierarchical
structure. However, these algorithms define clusters as dense regions that are separated
by low-density regions in data. Those data points in the low-density regions are treated
as noise and consequently are not assigned to any clusters. DBSCAN [46] is a well-known
density-based clustering algorithm.
Principles for clustering. Clustering algorithms can be classified as distance-based or
model-based according to their principles. In distance-based methods, a distance measure
1 The data set was obtained from http://cran.r-project.org/web/packages/cluster/index.html
2
has to be defined to measure the similarity or dissimilarity between the data points.
Hierarchical and partitional algorithms can then be used to find clusterings based on the
defined distance measure. For continuous data, the distance measures often used are the
Euclidean distance, Manhattan distance, or Mahalanobis distance [56].
In model-based methods [180], data are assumed to be generated from a probability
model. These methods search from a given family of models for the one that gives the
best fit to data. More details are given in the next section.
Cluster assignments. Clustering algorithms can give hard or soft assignments. In hard
assignment, a data point belongs to only one cluster. In contrast, a data point can belong
to more than one cluster with different degrees of membership in soft assignment. Hard
assignments are used by the majority of clustering algorithms, whereas soft assignments
are usually used by fuzzy clustering algorithms [76] and model-based clustering algorithms.
1.1.2 Model-Based Clustering
Our work involves primarily model-based clustering. Compared with distance-based clus-
tering, model-based clustering offers two advantages. First, it provides a statistical basis
for analysis. This allows model selection in particular. And one important use of the
model selection is for determining the number of clusters automatically.
Second, generative models for data can be obtained. These models can be interpreted
for better understanding of data. For example, the distribution of data for a cluster can
be given by the model. The posterior probability of the cluster of a data point can also
be computed.
Finite mixture models [47, 51, 111] are usually used in model-based clustering. Specif-
ically, in finite mixture modeling, the population is assumed to be made up from a finite
number of clusters. Suppose a variable Y is used to indicate this cluster, and variables
X represent the attributes in the data. The variable Y is a latent (unobserved) variable
whereas the variables X are manifest (observed) variables. The manifest variables X are
assumed to follow a mixture distribution
P (x) =∑y
P (y)P (x|y).
The probability values of the distribution P (y) are known as mixing proportions and
the conditional distributions P (x|y) are known as component distributions. To generate
a sample, the model first picks a cluster y according to the distribution P (y) and then
3
uses the corresponding component distribution P (x|y) to generate values for the observed
variables.
Gaussian distributions are often used as the component distributions due to compu-
tational convenience. A Gaussian mixture model (GMM) has a distribution given by
P (x) =∑y
P (y)N (x|µy,Σy),
where N (x|µy,Σy) is a multivariate Gaussian distribution, with mean vector µy and
covariance matrix Σy conditional on the value of Y .
The Expectation-Maximization (EM) algorithm [40] can be used to estimate the model
parameters. Once parameter estimation is done, the probability that a data sample d
belongs to cluster y can be computed by
P (y|d) ∝ P (y)N (d|µy,Σy),
where the symbol∝means that the exact values of the distribution P (y|d) can be obtained
by using the sum∑
y P (y)N (d|µy,Σy) as a normalization constant. The sample d is
assigned to each cluster y by a different degree of association P (y|d). Hence, the latent
variable Y represents a soft partition of data.
The number of clusters can be given manually or determined by model selection auto-
matically. In the latter case, a score is used to evaluate a model with G number of clusters.
The number G that leads to the highest score is then chosen as the optimal number of
clusters. Many scores have been proposed, and the BIC score has been empirically shown
to perform well among them [50].
As we can see from above, finite mixture models contain only one discrete latent
variable for clustering. This can sometimes be insufficient. A remedy is to allow multiple
discrete latent variables in a model. LTMs provide such extension to latent class models,
which is commonly used for clustering discrete data. In the following two subsections, we
illustrate why one latent variable is sometimes insufficient. We also show how LTMs can
be used in these situations.
1.1.3 Multidimensional Clustering
Suppose we are given the scores of four subject tests of some students. The four subjects
are mathematics, science, literature, and history. Our aim is to cluster the students based
on those scores.
If we cluster the data using a model with one discrete latent variable, a single clustering
is obtained. The clustering will likely be based on the general intelligence of the students.
4
However, the four subject tests may require two different skills. Analytical skill is needed
for mathematics and science, while literal skill for literature and history. The single
clustering obtained cannot reflect the two different skills of students. This example shows
a limitation of using models with single discrete latent variables for clustering.
On the other hand, LTMs have multiple discrete latent variables. These variables
allow multiple clusterings on data. If we use an LTM to cluster the students, it may
result in three clusterings. The clusterings will likely be based on the analytic skill, literal
skill, and general intelligence, respectively, of the students. They allow us to cluster the
students along three different dimensions. Hence, this approach to clustering is known as
multidimensional clustering [28]. Note that while both hierarchical clustering and multi-
dimensional clustering yield multiple clusterings, there is an important difference between
them. In hierarchical clustering, the clusterings obtained are nested and represent cluster-
ings at different levels of granularity along the same dimension. But in multidimensional
clustering, multiple partitional clusterings are obtained along different dimensions and
they are not nested.
One limitation of LTMs is that they can work on discrete data only. To use LTMs
in the current example, we have to discretize the scores. However, this leads to loss of
information. Moreover, this may make the resulting models more difficult to interpret.
For example, if we discrete an attribute into fewer intervals, the discretized attribute may
represent the original attribute less well. On the other hand, if we discretize an attribute
into more intervals, the conditional probability tables in the resulting models will have
more entries and will be harder to comprehend. Due to these deficiencies, an extension
to LTMs for continuous data is needed.
1.1.4 Spectral Clustering
Many commonly used clustering methods yield ‘globular’ clusters. These include the
finite mixture models and K-means. They perform poorly when the true clusters are
non-convex, such as those shown in Figure 1.2(a). To overcome this shortcoming, spectral
clustering [159] has been proposed. This method has gained prominence in recent years.
In spectral clustering, one defines a similarity matrix for a collection of data points.
The matrix is transformed to get the so-called Laplacian matrix. One then finds the
eigenvectors of the Laplacian matrix. A partition of the data points is obtained using the
leading eigenvectors.
Essentially, data are transformed from the feature space to an eigenspace in spectral
clustering. The eigenspace is given by the leading eigenvectors of the Laplacian matrix. It
5
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2 −1 0 1 2
−2
−1
01
2
(a) Feature Space
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0.00 0.02 0.04 0.06 0.08
0.00
0.02
0.04
0.06
0.08
(b) Eigenspace
Figure 1.2: Data points in the original feature space and the transformed eigenspace inspectral clustering. Ideally, points in each cluster are transformed to the same coordinatesin the eigenspace. Also, each cluster has distinct coordinates in the eigenspace. In thisexample, the cluster with circular (red) points in the feature space are mapped to theupper-left point in the eigenspace, while the one with triangular (cyan) points to thelower-right point.
is expected to allow the data to be clustered more easily. Ideally, the points belonging to
the same cluster should have the same location in the eigenspace, while those belonging
to different clusters should have different locations. This is illustrated in Figure 1.2.
In general, the clusters may not be as well-separated as in the ideal case. The data
points in a cluster do not have the same location in the eigenspace. Instead, they are
perturbed from their ideal locations. This is why a clustering algorithm is usually used
to partition data in the eigenspace. This is done in the last step of spectral clustering,
which is sometimes referred to as rounding. Indeed, there are three related subproblems
in this step. One needs to decide how many leading eigenvectors to use, to determine the
number of clusters, and to partition the data points.
When the number of clusters is given, rounding is easy. The same number of leading
eigenvectors are usually used. And we can obtain a clustering from the leading eigen-
vectors using K-means. But when the number of clusters is not given, rounding becomes
more difficult. To tackle this problem, some methods use Gaussian mixture models to
determine the number of clusters and to partition the data [166, 179]. This is done after
selecting the relevant eigenvectors using heuristics.
Gaussian mixture models have only one discrete latent variable. To partition data, the
variable has to consider all the leading eigenvectors at the same time. On the other hand,
an LTM has multiple discrete latent variables. It allows the latent variables to focus on
only subsets of eigenvectors. It also has a more flexible model structure than Gaussian
6
mixture models. These differences suggest that LTMs may find a useful application in
rounding.
1.2 Contributions
This thesis makes two contributions to the research of LTMs. The first contribution is an
application of LTMs in spectral clustering. We propose and study a novel model-based
approach to rounding using LTMs. The method differs from previous rounding methods
in three ways. First, we relax the assumption that the number of clusters equals the
number of eigenvectors used. Second, when deciding how many leading eigenvectors to
use, we not only rely on information contained in the leading eigenvectors themselves,
but also make use of the subsequent eigenvectors. Third, our method is model-based and
solves all the three subproblems of rounding using LTMs. We evaluate our method on
both synthetic and real-world data. The results show that our method works correctly
in the ideal case where between-clusters similarity is 0, and degrades gracefully as one
moves away from the ideal case.
The second contribution is an extension to LTMs. All variables in LTMs are dis-
crete. Therefore, we propose an extended class of models, called pouch latent tree models
(PLTMs), that allows the leaf nodes to represent continuous variables. This extension is
also a generalization of Gaussian mixture models. We develop an inference algorithm and
a learning algorithm for these models. We use these models for facilitating variable selec-
tion in clustering, and demonstrate on some benchmark data that it is more appropriate
to facilitate variable selection than to perform variable selection traditionally. We further
demonstrate the usefulness of these models by performing multidimensional clustering on
some real-world basketball data.
1.3 Organization
This thesis can be divided into three main parts. The first part consists of Chapters 2
and 3. It serves as a background for the thesis. In Chapter 2, we review different types
of latent variable models. We point out particularly the differences between those models
and LTMs. In Chpater 3, we review the backgrounds of LTMs. We also survey different
algorithms and applications of LTMs.
The second part consists of only Chapter 4. It is based on the work in [126]. In
chapter 4, we propose an application of LTMs in spectral clustering. We use LTMs for
rounding, and evaluate our approach using both synthetic data and real-world data.
7
The third part consists of Chapter 5–7. It is based on the work in [123, 124, 125]. In
Chapter 5, we propose PLTMs as an extension to LTMs for continuous data. We also
describe an inference algorithm and a learning algorithm for PLTMs. In Chapter 6, we
consider PLTMs for variable selection in clustering. In particular, two approaches to vari-
able selection are compared. One approach facilitates variable selection using PTLMs, and
the other performs variable selection in traditional ways. We compare the two approaches
using some benchmark data. In Chapter 7, we perform multidimensional clustering on
seasonal statistics of NBA players. We show that multiple interesting clusterings can be
found using PLTMs.
Chapter 8 concludes this thesis after the three main parts. It summarizes the thesis
and points out some future directions.
8
CHAPTER 2
LATENT VARIABLE MODELS
Latent variables are often used in statistical modeling. Many definitions of latent variables
exist in the literature [15]. In this thesis, we use a simple one. We define latent variables
as random variables whose values are not observed. In contrast, those variables whose
values are observed are called manifest variables. And we refer to those models that
contain latent variables as latent variable models.
There are several reasons why latent variables are needed. Firstly, latent variables can
be used to represent some abstract concepts which cannot be observed. They can also
represent some concepts that are hard to measure in practice. Latent variables used in
this situation are sometimes called theoretical constructs or hypothetical constructs. They
are used in a model to hypothesize some hidden factors that affect the values of observed
variables. For example, item response theory models are often used in educational testing.
The latent variables in these models can be used to measure some unobserved intelligence
or ability of students.
Secondly, latent variables can be used to represent unobserved heterogeneities or
sources of variations. They are included in a model so that their values can be recovered
or estimated from the observed data. For example, latent variable can be used in cluster
analysis to recover some unobserved grouping of similar samples.
Thirdly, latent variables can also be used for dimension reduction. The aim here is
to summarize the values of a large number of manifest variables using a much smaller
number of latent variables. The meanings of the latent variables may not be of interest.
For example, in principal component analysis, the components can be considered as latent
variables. The meanings of these components are often unimportant. The components
are only used to reduce the dimension of data for further analysis.
Finally, some related variables may have been excluded unintentionally during the
collection of data. By including these latent (excluded) variables, the model structure
can sometimes be simplified. For example, Elidan et al. [45] propose to simplify some
pattern of complex parts in a model structure by introducing latent variables into it.
Latent variables are of particular interest to social scientists. This is because abstract
concepts are usually involved in their fields of study. Latent variable models have long
9
X1 X2 X3 X4 X5 X6
Y1
(a)
X1 X2 X3 X4 X5 X6
Y1 Y2
(b)
Figure 2.1: Structures of:(a) latent class models, latent trait models, and latent profilemodels; and (b) factor models. Latent variables are shown in shaded nodes.
been used in these fields. Some earlier efforts include their uses in psychometrics [149],
biometrics [165], and econometrics [64]. Latent variable models have also been applied
in some other domains, including education [141], medical care [121], marketing [16], and
economics [98].
In the following, we survey latent variable models proposed in the literature. We put
more emphasis on models that are related to LTMs. We aim to point out the differences
between those models and LTMs.
2.1 Measurement Models
Many of the latent variable models traditionally used in the social sciences are sometimes
known as measurement models [146]. A fundamental characteristic of these models is
the assumption of a form of conditional independence called local independence. Specifi-
cally, the manifest variables are assumed to be conditionally independent given the latent
variables. This assumption forces the latent variables to account for the probabilistic
relationships between the manifest variables.
The measurement models can be categorized based on the types of the manifest and
latent variables used. If the latent and manifest variables are both discrete, we have a
latent class model (LCM) [62, 99]. When the manifest variables are discrete and the latent
variables are continuous, the resulting models are known as item response theory (IRT)
models [133, 157] or latent trait models [99]. When the manifest variables are continuous
and the latent variables are discrete, we have latent profile models [99]. If both latent and
manifest variables are continuous, we have factor models [8, 9, 67].
Among these four kinds of models, the first three share the same model structure.
Their model structure is shown in Figure 2.1(a). Each of these models have only one latent
variable. The difference between these models is in the types and hence the distributions
10
of the variables. On the other hand, a factor model can have multiple latent variables.
Its model structure is shown in Figure 2.1(b). As shown in the figure, the latent variables
in a factor model can be correlated. However, they are usually assumed to be mutually
independent.
Manifest variables in a factor model can have multiple parents. However, a post-
processing step called factor rotation is often carried out during factor analysis [70, 88].
This step tries to find an equivalent model with the simplest structure, such that each
manifest variable has the fewest parents in the equivalent model.
2.2 Confirmatory and Exploratory Analyses
Before we continue to survey other latent variable models, it is worth to distinguish
between two approaches to using these models. The first approach is a confirmatory
one. It aims to test a hypothesis. To do so, a model structure is specified based on the
hypothesis. The model is then estimated from and tested against the empirical data. The
hypothesis is confirmed or repudiated based on the model fit to data.
The second approach is an exploratory one. In this approach, only a class of models
is specified. The exact model structure is determined by model selection. This approach
aims to help users have better understanding on data from the resulting model structure
and parameters.
As an example, we compare the two different approaches to using factor models. In
exploratory factor analysis, the number of factors (latent variables) is unknown before-
hand. Moreover, only a minimal number of constraints on parameters are imposed. The
number of factors and the model parameters are estimated during the analysis. On the
other hand, number of factors are specified in confirmatory factory analysis. Restrictions
are also imposed based on the hypothesis under test. In particular, edges may be removed
from the model structure to indicate that some manifest and latent variables are mutually
independent. Some parameter values are set to zero for this purpose.
2.3 Mixture Models
Latent variable models are also used in machine learning. One of the most often used
models is the finite mixture model (FMM) [47, 54, 111]. The model has the following
distribution
P (X) =∑y
P (Y = y)P (X|Y = y),
11
X1, X2, X3, X4, X5, X6
Y
Figure 2.2: Model structure of a finite mixture model.
where X is a vector of manifest variables and Y is a latent variable. It assumes that
the samples can be divided into different groups. Each of these groups has a different
distribution of X. The model has a discrete latent variable Y . The variable indicates
which group a sample belongs to. However, the grouping of the samples are unobserved
from data. Therefore, it has to be represented by a latent variable. Figure 2.2 shows the
structure of a FMM.
The FMMs include a wide variety of models. The manifest variables in the models
can be discrete or continuous. Also, the latent class models and the latent profile models
can also be represented as these models. Among the varieties of these models, Gaussian
mixture models (GMMs) are probably most often used. In GMMs, the manifest variable
are continuous and have conditional Gaussian distributions. Latent profile models can
be considered as a special case of the GMMs. Both have the same types of manifest and
latent variables. However, the former models assume local independence but the latter
ones do not.
The component distribution P (X|Y ) can also be represented by other class of models.
For example, Thiesson et al. [154] propose a mixture of conditional Gaussian networks.
A conditional Gaussian networks [36, 95] is Bayesian networks containing both discrete
and continuous variables. However, the discrete variables can have only discrete parents.
Another example is a mixture of trees [115], in which the component distributions are
represented by tree-structured graphical models. There are also some mixtures of latent
variable models. Those models are introduced below.
Hidden Markov models [58, 129] have a similar structure as the finite mixture models.
However, they are used for time series modeling and have latent variables representing
values at different time instants.
2.4 Models with Multiple Latent Variables
Most of the models surveyed so far have only single latent variables. This may limit the
effectiveness of the models. In fact, there are some models that allow multiple latent
12
variables.
2.4.1 Structural Equation Models
Structural equation models (SEMs) [11, 14, 87, 104] are a class of models that have multiple
latent variables. It has been widely used in social sciences. It is also a generalization of
the factor models. A SEM can be divided into two overlapping parts. The first part is the
measurement model. It includes the latent variables and the manifest variables depending
on them. The relationships between the manifest variables and the latent variables are
similar to those in a factor model. The second part is the structure model. It includes the
latent variables and the relationships between them.
While all variables in SEMs are continuous traditionally, there are some recent frame-
works that can include both types of variables. For example, the framework called Gen-
eralized Linear Latent and Mixed Models has been introduced by Skrondal and Rabe-
Hesketh [145, 146]. It extends the SEMs with both types of manifest variables. Muthén
[117] presents another framework, which can have both types of latent and manifest vari-
ables.
SEMs are usually used for confirmatory analysis. However, there is also some work for
learning the model structure automatically. Spirtes et al. [150] present algorithms that
can determine the relationships between the variables in a SEM. However, the algorithm
cannot discover any new latent variables. Silva et al. [144] propose an algorithm for
learning the structure of a SEM given some manifest variables. Similarity to exploratory
factor analysis, it can determine the number of latent variables. The models they consider
contain only continuous variables.
2.4.2 Multidimensional Measurement Models
The factor model is one of the measurement models that allow multiple latent variables.
We refer to this kind of measurement models as multidimensional measurement models.
This is because in these models, the multiple latent variable can be considered as the
different dimensions of factors or heterogeneities that affect the manifest variables.
Many latent variable models used in machine learning have a similar structure as the
factor models [138]. These include the models used for principal component analysis
(PCA) [85] and probabilistic PCA [137, 156]. The difference between these two models
and the factor models is that they assume different variances of the noise on the manifest
variables.
13
X1 X2 X3 X4 X5 X6
Y1 Y2
(a) Between-item multidimensionality
X1 X2 X3 X4 X5 X6
Y1 Y2
(b) Within-item multidimensionality
Figure 2.3: Structures of the two types of multidimensional item response theory models.
Independent component analysis (ICA) [78, 79] also uses a model with the same struc-
ture as the factor models. However, non-Gaussian distributions are assumed on the latent
variables. This is in contrast to the case for factor models, where Gaussian distributions
are usually assumed. Tree-dependent component analysis [6] relaxes the independence
assumption on the latent variables of the ICA models. It allows the latent variables to be
correlated and be represented as a tree-structured model.
There are factor models that have discrete manifest variables. Collectively, these
models are sometimes known as discrete component analysis (DCA) [19]. They include
non-negative matrix factorization (NMF) [100, 101], probabilistic latent semantic anal-
ysis (PLSA) [74, 75], multinomial PCA [18], latent Dirichlet allocation (LDA) [13], and
GaP [21]. These models differ from each other in the conditional distributions on the
manifest variables and the prior distributions on the latent variables. More detailed com-
parison between these models is given by Buntine and Jakulin [19].
The latent variables in the DCA models are continuous. However, those in PLSA,
multinomial PCA, and LDA are normalized, meaning that the sum of their values is al-
ways one. Each of the normalized variables represents one discrete state, and its value
represents the probability of that state. Together, the normalized variables in one model
are equivalent to a single discrete variable. Hence, those models are actually unidimen-
sional in this sense. They can show only one dimension of clustering on the data.
In additional to the factor models, there are some extensions to the other traditional
measurement models to allow multiple latent variables. For example, the traditional IRT
models are extended by the multidimensional IRT models [1, 17, 135]. These models
can be broadly divided into two types based on the number of parents that a manifest
variable can have [1]. The first type is the between-item multidimensional models. Each
manifest variable has only one parent latent variable. The whole model consists of multiple
unidimensional IRT models on disjunct subsets of manifest variables. The second type
is the within-item multidimensional models. They have a similar structure as the factor
14
models. In both types of models, the latent variables can be correlated. The structures
of the two types of models are shown in Figure 2.3.
The traditional LCMs have also been extended to the latent class factor models [107].
It can be considered as decomposing the discrete latent variable in the former model into a
joint variable consisting of multiple discrete latent variables in the latter model. The latent
variables in the latent class factor models are binary and are mutually independent. The
structure of the models is the same as that of the factor models. However, the parameters
are restricted in a such way that a manifest variable is affected by each latent variable
independently. This restriction reduces the number of parameters in the model.
2.4.3 Mixtures of Latent Variable Models
In a FMM, a discrete variable is used to indicate which group a data sample belongs
to. Since the grouping of data is unobserved, this variable is considered as a latent
variable. When the component distributions in a model are represented by some other
latent variable models, the whole model can also be considered as having multiple latent
variables.
One example of this kind of mixture models is the mixtures of factor models. This
includes mixture of factor analyzers [48, 59, 112], mixture of PCA [72], and mixture
of probabilistic PCA [155]. In these models, a factor model is used to represent the
component distribution. It is used as a dimension reduction technique. It replaces the
Gaussian distribution that has a complete covariance matrix in the original model. Hence,
the number of parameters in the component distribution can be reduced. However, the
latent variables in the factor models may not have any meaning.
There is also some work on mixtures of SEMs. Under these models, the behaviors
of different groups of samples can be represented by different SEMs. These models are
described by Jedidi et al. [84], Muthén [117], and Skrondal and Rabe-Hesketh [146].
The above mixture models have only single discrete variables to indicate the grouping
of samples. In contrast, a hierarchical mixture of experts [12, 86] have multiple discrete
variables to indicate the grouping of samples. These discrete variables can be unobserved.
This means that the grouping is determined by some hidden factors. The model has
a tree structure of latent variables. However, it is not truly multidimensional. The
multiple discrete variables represent a hierarchical clustering of data. The grouping of
data indicated by a higher level of latent variable is further refined by some lower levels
of latent variables. Therefore, there is only one dimension of partition on the data.
The multiple latent variables represent only different levels of granularity along the same
15
X1 X2 X3 X4 X5 X6 X7
Y1
(a)
X1 X2 X3
X4
X5 X6 X7
Y2 Y3
Y1
(b)
Figure 2.4: A latent tree model as an extension to a latent class model. (a) In the latentclass model, local dependencies are observed. They are indicated by the dashed arrowsbetween the observed nodes. (b) Latent variables Y2 and Y3 are introduced to account forthe observed local dependencies. This results in the latent tree model shown.
dimension.
2.4.4 General Models
In addition to the above categories of models that contain multiple latent variables, there
are some general models that can also allow multiple latent variables. For example,
Elidan et al. [45] consider adding latent variables to a Bayesian network. The Bayesian
networks they consider contain only discrete variables. Their algorithm introduces new
latent variables to replace some pattern of complex parts in the model structure. The
motivation is to simplify the model structure by these latent variables. However, it is
unclear whether the new latent variables can be interpreted easily. This work is extended
by Elidan and Friedman [44] so that the number of states in a latent variable can be
determined automatically.
2.5 Latent Tree Models
Latent tree models (LTMs) [171, 172] are the main topic of this thesis. They are also
previously known as hierarchical latent class models. Similar to some multidimensional
measurement models, they extend the traditional LCMs to allow multiple latent variables.
The extension is motivated by an observation called local dependence. Consider the
LCM shown in Figure 2.4(a). Local independence is assumed by the model. However,
this assumption is sometimes violated on the empirical data. Local dependencies may be
observed. This means that even given the value of the latent variable, correlations among
some manifest variables may be observed from data.
In an LTM, these local dependencies can be accounted for by the introduction of some
16
latent variables. This is illustrated in Figure 2.4(b). The latent variables are introduced
as the parents of subsets of variables with local dependence. More latent variables can be
introduced if local dependencies are still observed.
The resulting model has a tree structure. It can have multiple latent variables. The
latent variables are found on the internal nodes, and the manifest variables are found on
the leaf nodes. Besides, same as in LCMs, all variables in LTMs are discrete.
Compared with other latent variable models, LTMs have two remarkable features.
First, they can have multiple latent variables. Those multiple latent variables allow LTMs
to fit data better and provide better explanation on data.
Second, LTMs have a tree structure. This allows a tractable probabilistic inference in
LTMs. It also limits the search space for structural learning. In addition, its simplicity
allows easier interpretation of the model.
2.6 Summary of Models
Table 2.1 summarizes the latent variable models reviewed in this chapter.
17
Model MV LV MLV Structure RemarksLatent class models [62, 99] D D N StarItem responsemodels [133, 157], latent traitmodels [99]
D C N Star
Latent profile models [99] C D N StarFactor models [8, 9, 67] C C Y FactorFinite mixturemodels [47, 54, 111]
M D N Mixture
Gaussian mixture models C D N MixtureMixtures of conditionalGaussian networks [154]
M D Y Mixture
Hidden Markovmodels [58, 129]
M D Y Mixture Latent variables represent val-ues at different time instants.
Structural equationmodels [11, 14, 87, 104]
C C Y General
Generalized linear latent andmixed models [145, 146]
M C Y General
Muthén’s framework [117] M M Y GeneralPCA [85], probabilisticPCA [137, 156]
C C Y Factor
Independent componentanalysis [78, 79]
C C Y Factor
Tree-dependent componentanalysis [6]
C C Y Factor Latent variables are connectedby a tree structure.
Discrete component analysis[19], e.g., NMF [100, 101],PLSA [74, 75], multinomialPCA [18], LDA [13], GaP [21]
D C Y Factor Latent variables are normalizedin PLSA, multinomial PCA,and LDA.
Multidimensional IRTmodels [1, 17, 135]
D C Y Factor
Latent class factormodels [107]
D D Y Factor Latent variables are binary andmutually independent.
Mixtures of factoranalyzers [48, 59, 112],mixtures of PCA [72], mixturesof probabilistic PCA [155]
C M Y Mixture Each model has one discrete la-tent variable and multiple con-tinuous latent variables.
Mixtures of structuralequation models [84, 117, 146]
M M Y Mixture
Hierarchical mixtures ofexperts [12, 86]
M D Y Tree Latent variables in a model rep-resent a hierarchical clustering.
Bayesian networks with latentvariables [44, 45]
D D Y General
Latent tree models [171, 172] D D Y Tree
Table 2.1: Summary of latent variable models. The second and third columns indicatethe type of manifest variables (MV) and latent variables (LV) in the models. The typecan be continuous (C), discrete (D), or mixed (M). The fourth column indicates whethermultiple latent variables (MLV) are allowed in the models. The fifth column shows themodel structures. They can be: star -shaped (Figure 2.1(a)), tree-shaped (Figure 2.4(b)),same as those of factor models (Figure 2.1(b)), ormixture models (in which the componentdistributions can be represented by some other structures, see Figure 2.2), or a generalstructure.
18
CHAPTER 3
LATENT TREE MODELS
At the end of last chapter, we discuss how latent tree models extend from latent class
models. We also compare LTMs with other latent variable models. In this chapter, we
describe LTMs in more details. We begin with the notations used in the thesis. In
section 3.2, we review Bayesian networks. They provide a framework for defining LTMs.
In section 3.3, we define LTMs and discuss their properties. We then review some learning
algorithms that have been proposed for LTMs in section 3.5. Finally, we survey some
applications of these models in the literature in section 3.6.
3.1 Notations
In this thesis, we use capital letters such as X and Y to denote random variables. We use
lower case letters such as x and y to denote their values. We use bold face letters such as
X, Y , x, and y to denote sets of variables or values. Manifest variables are denoted by
X and latent variables by Y . If a variable can be manifest or latent, we use V to denote
it. For a discrete variable V , we use |V | to denote its cardinality. Furthermore, we use
P (X) to denote the distribution of a variable X, and use P (x) as a shorthand for the
probability of having the value of x, that is, P (X = x).
We use the following notations for graphs. When the meaning is clear from context,
we use the terms ‘variable’ and ‘node’ interchangeably. Capital letter Π(V ) is used to
indicate the parent node of a node V , and lower case letter π(V ) to indicate its value.
3.2 Bayesian Networks
Bayesian networks (BNs) are a class of probabilistic graphical models [91]. They define
probability distributions over some random variables. Their graphical structure provides
a natural representation of the relationships between the variables. For a detailed intro-
duction, reader are referred to those books in this area [e.g. 36, 38, 122].
A BN M = (m,θ) can be defined by its structure m and a set of parameters θ. The
structure m is given by a directed acyclic graph. The set of nodes V = {V1, . . . , Vn} in
the graph represent n variables. Each node V is associated with a probability distribution
19
Burglary (B) Earthquake (E)
Alarm (A)
JohnCalls (J) MaryCalls (M)
P (B).001
P (E).002
B E P (A)T T .95
T F .94
F T .29
F F .001
A P (J)T .90
T .05
A P (M)T .70
T .01
Figure 3.1: An example of Bayesian network.
P (V |Π(V )) conditional on its parents Π(V ).1 The set of parameters θ consists of those
parameters needed to specify all the conditional probability distributions.
The joint probability defined by a BN is assumed to satisfy the Markov condition.
This condition means that every variable V in a BN is conditionally independent to
other non-descendent variables given the parent variables of V . Based on this condition,
the joint probability P (V ) can be factorized as a product of the conditional probability
distributions associated with the nodes:
P (V ) =n∏i=1
P (Vi|Π(Vi)).
Figure 3.1 shows an example of BN given by Russell and Norvig [139]. The BN has
five binary variables. It models the situation in which an alarm may be triggered by
burglary or earthquake. The triggered alarm may further leads to a call from John or one
from Mary. The conditional probability distributions in the BN models the probabilities
of these events.
In BNs, discrete variables are usually considered. However, continuous variables can
also be included. In a Gaussian Bayesian network (GBN) [57, 142], all variables are con-
tinuous. Gaussian distributions are used to model the probabilistic relationships between
the variables. On the other hand, a conditional Gaussian Bayesian network [36, 95] can
contain both discrete and continuous variables. However, discrete variables cannot have
continuous parents in this model.
20
X1 X2 X3
X4
X5 X6
Y2 Y3
Y1
(a) Original model
X1 X2 X3
X4
X5 X6
Y2
Y3
Y1
(b) After root walking (from Y1 to Y2)
X1 X2 X3 X4 X5 X6
Y2 Y3Y1
(c) unrooted model
Figure 3.2: An example of latent tree model, root walking, and unrooted model. Theleaf nodes X1–X6 represent manifest variables, while the internal nodes Y1–Y3 representlatent variables. Latent variables are represented by shaded nodes.
3.3 Latent Tree Models
A latent tree model (LTM) [162, 172] is a tree-structured BN containing latent variables.
In this model, latent variables are represented by the internal nodes, whereas manifest
variables by the leaf nodes. Both latent and manifest variables are discrete. An example
of LTM is shown in Figure 3.2(a).
Similar to the case for BN, an LTM can be written as a pair M = (m,θ). The second
component θ is the set of parameters for specifying the distributions in the model. The
first component m represents the model structure of the LTM. It specifies the variables,
their cardinalities, and the edges between the variables. We sometimes also refer to the
first component m as an LTM.
3.3.1 Model Scores
Suppose D is a collection of data over a set of variables X. There can be infinitely many
possible LTMs having X as their leaf nodes. Therefore, a score is needed to evaluate the
relevancy between an LTM and the data D. It is essential to select the best model among
1 If a node does not have any parent, it is assumed to be the child of a dummy node with one value.This allows all nodes to be treated in the same way.
21
possibilities.
The BIC score [140] is used for this purpose in this thesis. It has been empirically
shown to work well compared with some other scores [172]. The BIC score of a model m
is given by
BIC(m|D) = logP (D|m,θ∗)− d(m)
2logN, (3.1)
where θ∗ is the maximum likelihood estimate (MLE) of the parameters and d(m) is the
number of independent parameters in m. The first term is known as the maximized log-
likelihood term. It favors models that fit data well. The second term is known as the
penalty term. It discourages complex models. Hence, the BIC score provides a trade-off
between model fitness and model parsimony.
3.3.2 Model Equivalence
We use models to fit observed data. And we want to find those with the best fit. However,
in some cases, two different models fit the data equally well. This idea is formalized
through the concepts of model inclusion and equivalence.
Consider two LTMs m and m′ that share the same set of manifest variables X. We
say that m includes m′ if for any parameter value θ′ of m′, there exists parameter value
θ of m such that P (V |m,θ) = P (V |m′,θ′).
When m includes m′, m can represent any distributions over the manifest variables
that m′ can. Hence, the maximized log-likelihood of m must be larger than or equal to
that of m′, that is, P (D|m,θ∗) ≥ P (D|m′,θ′∗).
When m and m′ include each other, we say that they are marginally equivalent.
Marginally equivalent models have equal maximized log-likelihood. This means they
can fit the data equally well. If they have the same number of independent parameters,
they become equivalent. Equivalent models are indistinguishable based on data if the BIC
score is used for model selection. In fact, this is also true if any other penalized likelihood
score [63] is used.
3.3.3 Root Walking
Consider an LTM m with root Y1. Suppose Y2 is a latent node and is a child of Y1. Define
another LTM m′ by reversing the direction of the edge Y1 → Y2. Now Y2 becomes the
root in the new model m′. We call this operation root walking — the root has walked
from Y1 to Y2. Figure 3.2(b) shows the model obtained after walking the root from Y1 to
Y2 in the original model in Figure 3.2(a).
22
Root walking results in equivalent models [172]. This implies that using any root in
an LTM can lead to the same score. Hence, the root of an LTM cannot be determined
from data. What can be determined is only an equivalent class of models having the same
structure but with different roots.
An unrooted LTM can be used to represent this equivalent class of models. It is
obtained by dropping the direction of all edges in an LTM. Members of the equivalent
class can be obtained by using different latent nodes as roots in the unrooted model. An
example of unrooted model is given in Figure 3.2(c).
Semantically, the unrooted model is a Markov random field over an undirected tree.
The external nodes are manifest whereas the interior nodes are latent. Model inclusion
and equivalence can be defined for unrooted models in the same way as for rooted models.
In this thesis, we use rooted LTMs to represent the models obtained from data. It
should be noted that any latent nodes can indeed be chosen as root. Our results do not
depend on the choice of the root.
3.3.4 Regularity
A larger number of model parameters usually results in better model fit to data. How-
ever, for some LTMs, it is possible to find other marginally equivalent models with fewer
parameters. Those models are called irregular models.
Zhang [172] shows that any model that violates the following condition is irregular.
Condition 1. (Upper Bound on Cardinality). Let Y be a latent variable in an LTM. And
let V1, . . . , Vr be the r neighbors of Y . For any latent variable Y in an LTM,
|Y | ≤∏r
i=1 |Vi|maxri=1 |Vi|
. (3.2)
If Y has only two neighbors, strict inequality holds and one of the neighbors must be a
latent node.
Given an irregular model m, it is possible to obtain a model m′ that is marginally
equivalent to m and does not violate the regularity condition. We refer to such model m
as regular models, and the process to obtain it as regularization.
Let m be an irregular model that violates Condition 1. A regular model m′ can be
obtained through the following regularization process:
1. For each latent variable Y in m, let V1, . . . , Vr be the r neighbors of Y .
23
(a) If Y violates Inequality (3.2), set
|Y | =∏r
i=1 |Vi|maxri=1 |Vi|
. (3.3)
(b) If Y has only two neighbors, where one of them is a latent node, and if it
violates the strict version of Inequality (3.2), remove Y from m and connect
the two neighbors of Y .
2. Repeat Step 1 until there is no further change.
Suppose Y is a latent node in an LTM. If the cardinality of Y satisfies Equality (3.3),
we say that Y is saturated. In this case, Y is also said to subsume all its neighbors except
the one with the largest cardinality. Wang et al. [162] identify another regularity condition
related to saturated latent nodes. This condition is stated below.
Condition 2. (Non-Redundant Variables). In an LTM, there do not exist any two adja-
cent latent nodes Y1 and Y2, such that both Y1 and Y2 are saturated and one of them is
subsumed by the other.
Suppose Y1 and Y2 are the two adjacent nodes that violates the above condition in an
LTM. If Y2 subsumes Y1, Y1 is a redundant variable. The model can be regularized by
removing Y1 and connecting the neighbors of Y1 (except Y2) to Y2.
By definition, regular models must have a higher BIC score than irregular models.
Therefore, if we want to find the model with the highest BIC score, we can restrict the
search space to include only regular models. Note that given a set of manifest variables,
there are only a finite number of regular models in this search space [172].
3.4 Parameter Estimation
Statistical analysis involves learning an LTM from some given data. If the model struc-
ture is given, it needs to estimate only the model parameters. If the model structure is
unknown, structural learning is needed as well. We assume that the model structure is
given and discuss parameter estimation in this section. We discuss structure learning in
the next section.
Let D = {d1, . . . ,dN} be a collection of N data samples, where di denotes the values
of the manifest variables in the i-th sample. Also, suppose the model structure m is given.
To learn an LTM M from D, we need find the maximum likelihood estimate (MLE) θ∗
of the model parameters θ. The learning result can then be given as M = (m,θ∗).
24
The EM algorithm [40] can be used to compute the MLE θ∗. It is usually used for mod-
els with latent variables. The algorithm starts with an initial estimate, θ(0). It improves
the estimate iteratively. Use θ(t) to denote the estimate after t iterations. The algorithm
iterates until an iteration fails to improve the model likelihood by a certain threshold δ.
In other words, it stops after the t-th iteration if P (D|m,θ(t))− P (D|m,θ(t−1)) ≤ δ. The
algorithm then returns θ∗ = θ(t).
There are two steps in each iteration, namely the E-step and the M-step. Suppose the
parameter estimate θ(t−1) is obtained after t − 1 iterations. In the E-step, we compute,
for each latent node Y and its parent Π(Y ), the distributions P (y, π(Y )|dk,θ(t−1)) and
P (y|dk,θ(t−1)) for each sample dk. In the M-step, we compute the MLE θ(t) based on
the distributions obtained in the E-step. For each of the manifest or latent nodes V , the
MLE of the parameters is given by
P (v|π(V ),θ(t)) ∝N∑k=1
P (v, π(V )|dk,θ(t−1)),
where the ∝ symbol indicates that the exact values of the probability can be obtained
after normalization.
During an iteration, the E-step is also known as the data completion step. It can be
considered as completing the data by computing the values of the latent variables based
on the observed data and the current parameter estimates. The M-step then finds the
MLE based on the completed data. It is done as if all the variables are observed.
Some more details are worth to note about the EM algorithm. First, the initial values
θ(0) of the parameters has to be chosen. For P (v|π(V ),θ(0)), the probability values are
randomly generated from a uniform distribution over the interval (0, 1]. The distributions
P (v|π(V ),θ(0)) are then normalized so that their sums are equal to 1.
Second, given the observed data, we need to compute probabilities in the E-step. This
computation is known as inference for probabilistic graphical models. Inference in an
LTM can be done using the clique tree propagation as in a discrete Bayesian network.
Readers are referred to other sources [e.g. 36, 38] for details.
3.5 Learning Algorithms
More often than not, the model structure is not known beforehand. It needs to be learnt
from given data.
For discrete Bayesian networks, two main approaches are used for structural learning.
The first approach is scored-based [20, 30, 69]. It requires a score for the evaluation of
25
model structures. It then attempts to search for the structure that gives the highest score.
Given a base model structure, some operators are used to explore the search space. These
operators usually modify a small part of the base model. They include, for example,
addition, removal, and reversal of edges. A greedy search, also known as a hill-climbing
method, is often used in this approach.
The second approach is constraint-based [29, 150]. It first identifies some constraints
on the model structure based on the data. For example, conditional independencies
between variables can be tested against the data using statistical or information theoretic
measures. This approach then finds a model structure that is consistent with the identified
constraints.
The above two approaches are also used for learning the structures of LTMs. In
addition, a third approach has been used. It is based on variable clustering. This approach
is applicable to LTMs but not to general BNs because they exploit some characteristics
of LTMs.
In the following subsections, we review methods following those three approaches.
3.5.1 Score-Based Methods
The score-based methods for LTMs have much in common. They are based on hill-
climbing and use similar search operators. They are also based on the BIC score for model
selection. However, there are some differences between them that affect their efficiency
and effectiveness. A main difference is in how they divide the use of search operators into
different phases. Besides, methods may make different adaptations on the BIC score for
model evaluation.
We begin this subsection with the description of a brute-force search. It serves to
illustrate the principles behind other score-based methods. After that, we review some
score-based methods proposed in the literature.
Brute-Force Search
The brute-force search we describe here is a hill-climbing method. It starts with an initial
model and then iteratively improves this model in each search step. The initial model
m(0) is the simplest LTM over the given manifest variables. Specifically, it is a 2-class
LCM. The root node is a latent variable with two possible values. The manifest variables
are connected as children to the root node.
Suppose a model m(j−1) is obtained after j − 1 iterations. In the j-th iteration, the
26
X1 X2 X3
X4 X5 X6
Y2
Y1
(a) m1
X1 X2 X3
X4
X5 X6
Y2 Y3
Y1
(b) m2
X1 X2
X4
X3 X5 X6
Y2 Y3
Y1
(c) m3
Figure 3.3: Examples of applying the node introduction, node deletion, and node reloca-tion operators. Introducing Y3 to mediate between Y1 and the pair X5 and X6 in m1 givesm2. Relocating X3 from Y2 to Y3 in m2 gives m3. In reverse, relocating X3 from Y3 to Y2in m3 gives m2. Deleting Y3 in m2 gives m1.
algorithm uses some search operators to generate candidate models by modifying the
base model m(j−1). The BIC score is then computed for each candidate model. Use m′ to
denote the candidate model with the highest BIC score. If m′ has a higher BIC score than
m(j−1), m′ is used as the new base model m(j) and the algorithm proceeds to the (j+1)-th
iteration. Otherwise, the algorithm terminates and returns m∗ = m(j−1) (together with
the MLE of the parameters).
When we learn the structure of an LTM, we need to determine the number of latent
variables, their cardinalities, and the connections between all variables. Search operators
are needed to modify these aspects of the structure to effectively explore the model space.
In the brute-search search, five search operators are used. They are state introduction,
state deletion, node introduction, node deletion, and node relocation.
Given an LTM and a latent variable in the model, the state introduction (SI) operator
creates a new model by adding a state to the domain of the variable. The state deletion
(SD) operator does the opposite. Applying SI on a model m results in another model that
includes m. Applying SD on a model m results in another model that is included by m.
Node introduction (NI) involves one latent node Y and two of its neighbors. It creates
a new model by introducing a new latent node Ynew to mediate between Y and the two
neighbors. The cardinality of Ynew is set to be that of Y . For example, in the model m1
27
of Figure 3.3, introducing a new latent node Y3 to mediate Y1 and its neighbors X5 and
X6 results in m2. Applying NI on a model m results in another model that includes m.
A node deletion (ND) operator is the opposite of NI. It involves two neighboring latent
nodes Y and Ydelete. It creates a new model by deleting Ydelete and connecting all neighbors
of Ydelete (other than Y ) to Y . We refer to Y as the anchor variable of the deletion and
say that Ydelete is deleted with respect to Y . For example, in the model m2 of Figure 3.3,
deleting Y3 with respect to Y1 leads us back to the model m1. Applying ND on a model m
results in another model that is included by m if the node deleted has more or the same
number of states as the anchor node.
A node relocation (NR) involves a node V , one of its neighboring latent nodes Yoriginand another latent node Ydest. The node V can be a latent node or a manifest node. The
NR operator creates a new model by relocating V to Ydest. In other words, it removes the
edge between V and Yorigin and adds an edge between V and Ydest. For example, in m2
of Figure 3.3, relocating X3 from Y2 to Y3 results in m3.
Note that for the sake of computational efficiency, this brute-force search does not
consider introducing a new node to mediate Y and more than two of its neighbors. This
restriction can be compensated by considering a restricted version of node relocation after
a successful node introduction. Suppose Ynew is introduced to mediate between Y and its
two neighbors. The restricted version of NR relocates one of the neighbors of Y (other
than Ynew) to Ynew.
In principle, the above brute-force search should be able to find the LTM with the
highest BIC on some given data. However, the search has two problems. First, it may be
stuck at local maxima. Second, it is inefficient. We show how these problems have been
addressed by other methods below.
Double Hill-Climbing Algorithm
When Zhang [171, 172] reports his studies on LTMs, which is named HLCMs at that
time, he also devises a structural learning algorithm for LTMs. The algorithm is called
the double hill-climbing (DHC) algorithm by Zhang and Kočka [173]. Same as the brute-
force search, the DHC algorithm is an iterative method.
The algorithm uses the BIC score for model selection. It hill-climbs in two different
phases in each iteration. In the first phase, it fixes the cardinalities of the latent variables
and searches for the best model structure. It considers candidates generated by the NI,
ND, and NR operators during this phase. The best candidate model is passed to the
second phase. Note that only one operation has been used in the first phase to change
28
the original model to the best candidate model.
In the second phase, the algorithm searches for the optimal cardinalities for the latent
variables. It starts with the minimum cardinalities for all latent variables in the base
model. It then hill-climbs using the SI operator until it cannot find a better model. More
than one SI operations can be used in this phase. The best model is passed to the next
iteration. The whole algorithm stops when both phases fail to find a better model.
Compared with the brute-force search, the DHC algorithm considers SI in a separate
phase. A possible reason for this is that the candidate models generated by SI and NI
cannot be compared directly using the BIC score. This issue is addressed in a later
development, which is described next.
Operation Granularity
Some search operators may increase the complexity of the current model much more than
other search operators. This issue is known as operation granularity [26, 27]. As an ex-
ample, consider a 2-class LCM with 100 binary manifest variables. NI would introduce
2 additional parameters in this model, but SI would introduce 101 additional parame-
ters. This illustrates that candidate models resulting from SI usually have much more
parameters than those resulting from NI.
If we use BIC to evaluate candidate models given by the search operators, those
having a much larger increase in complexity are usually preferred. This might lead to
local maxima.
Zhang and Kočka [173] propose a cost-effective principle to address this issue. Let m
be the base model and m′ be a candidate model. They define the improvement ratio of
m′ over m given data D by
IR(m′,m|D) =BIC(m′|D)−BIC(m|D)
d(m′)− d(m), (3.4)
where d(·) denotes the number of independent parameters in a model. The ratio measures
the unit improvement of m′ over m. It is also related to the likelihood ratio test in a later
work [28]. Among those candidate models with more parameters than the base model, the
cost-effective principle stipulates that the one with the highest improvement ratio should
be chosen.
Single Hill-Climbing Algorithm
Zhang and Kočka [173] propose another hill-climbing algorithm for LTMs. The algorithm
29
is called the single hill-climbing (SHC) algorithm. This algorithm is similar to the DHC
algorithm. It is an iterative method and each iteration is divided into two phases. How-
ever, the search operators are grouped differently in those two phases. Moreover, both
phases may have multiple search steps. In other words, in each phase, the algorithm
repeatedly improves the model until it cannot find a better model.
The SHC algorithm adopts an expand-and-retract strategy for the search. This is
similar to the greedy equivalence search (GES) [30, 113], proposed for structural learning
of Bayesian networks when all variables are observed. In each iteration, the first phase
tries to improve the model by expansion, while the second tries to improve the model by
retraction.
In the first phase, the NI, SI, NR operators are used. The algorithm considers candi-
date models that can have more parameters than the base model. In fact, the NR operator
does not always result in models with more parameters. Suppose the NR operator has
generated some candidate models that do have more parameters than the base model. If
the best model among them has a higher BIC score than the base model, then that model
is used as the base model for the next search step. Otherwise, the remaining models
generated by the NR operator are compared along with the candidate models generated
by the SI and NI operators. The algorithm follows the cost-effective principle to choose
among these models.
In the second phase, the SD and ND operators are used. They result in model with
fewer parameters. The algorithm repeats in this phase until it cannot find a better model
using those two operators.
Heuristic Single Hill-Climbing Algorithm
In each search step, the SHC algorithm uses fewer operators than the brute-force search.
This means that it has to evaluate fewer models. In this sense, the SHC algorithm is more
efficient However, to evaluate a model, it still needs to run the EM algorithm. And the
EM algorithm is known to be computationally expensive. Hence, the SHC algorithm is
still not very efficient.
Zhang and Kočka [173] propose an improved algorithm over SHC to address the effi-
ciency issue. The algorithm is called the heuristic single hill-climbing (HSHC) algorithm.
It is inspired by the structural EM [53]. The idea of the structural EM is to complete the
data using the current model, and then evaluate the candidate models using the completed
data. Heuristics based on this idea are proposed for the search operators except the SD
operator. They are are used to select the best candidate model for each search operator.
30
This saves the calls to the EM algorithms. However, the candidate models generated by
different operators are evaluated as in SHC so that the evaluation can be more accurate.
Node Relocation Operator
The NR operator used in DHC, SHC, and HSHC is actually slightly different from the
one we describe in the brute-force search. The one used in those algorithms is known as
a restricted version of NR. Consider the relocation of a node Y from a neighboring latent
node Yorigin. In the restricted version of NR, Y can be moved to only those latent nodes
neighboring Yorigin. In contrast, Y can be moved to any latent nodes in the unrestricted
version of NR.
The two versions of NR are compared by Chen [24]. Using the restricted version is
found to be faster. However, this is more likely to get trapped in local maxima. Therefore,
it is suggested to use the unrestricted version for node relocation. In this thesis, the NR
operator refers to the unrestricted version unless otherwise stated.
Restricted Likelihood
HSHC uses heuristics to speed up model evaluation. On the other hand, Chen et al. [26]
propose another way to do this. They use the so-called restricted likelihood, which is
explained below.
Consider the current model m after a number of search steps. A candidate model m′
can be obtained from m by applying a search operator. Very often, the search operator
modifies only a small part of m, so that m and m′ share a large number of parameters.
For example, consider the model m2 in Figure 3.3. It is obtained from model m1 using
the NI operator. Models m1 and m2 shares many parameters, such as P (x1|y2), P (x2|y2),P (x3|y2), P (y2|y1), and P (x4|y1). On the other hand, some parameters are not shared
by m and m′. In this example, parameters P (x5|y1) and P (x6|y1) are specific to m1,
while parameters P (y3|y1), P (x5|y3) and P (x6|y3) are specific to m2. The parameters θ′
of m′ can be divided into two groups. They can be written as θ′ = (θ′1,θ′2). The first
component θ′1 consists of parameters shared with m, whereas the second one θ′2 consists
of parameters specific to m′. Similarly, the parameters of m can be written as θ = (θ1,θ2)
with respect to m′.
Suppose we have computed the MLE θ∗ = (θ∗1,θ∗2) of the parameters ofm. Parameters
θ∗1 can be used as estimates for the shared parameters of m′. Consequently, we can obtain
a likelihood function P (D|m′,θ∗1,θ′2) that depends only on the unshared parameters θ′2.
This function is referred to as restricted likelihood function of m′.
31
The BIC score requires the maximum log-likelihood ofm′. Instead of computing it over
all parameters of m′, we can approximate it by maximizing the restricted log-likelihood
over only the subset of parameters θ′2. This results in an approximate score given by
BICRL(m′|D) = maxθ′2
logP (D|m′,θ∗1,θ′2)−d(m′)
2logN. (3.5)
The advantage of using BICRL is that it allows a more efficient implementation of
EM. There are two reasons. First, it involves a maximization of fewer parameters in
the M-step. This also means fewer computations in the E-step, since only distributions
relevant to θ′2 need to be computed. Second, the E-step can exploit the fixed parameters
θ∗1 to allow sharing of computation among all iterations of EM. Specifically, the E-step
requires inference on an LTM, which in turn requires passing messages in the clique tree.
As the parameters θ∗1 do not change in these iterations, the messages that depend on
only these parameters remain the same in all iterations. Therefore, some messages can be
cached for each data case. They can then be shared for the inference used in subsequent
E-steps.
Chen et al. [26] propose to use both BIC and BICRL to evaluate models. However,
they are used at different situations. The approximation BICRL can be used for quickly
evaluating candidate models generated by the same search operator. On the other hand,
the real BIC can be used for accurately evaluating the candidate model with the highest
BICRL for each of the search operators. This improves the speed for the evaluation of
models within search operators, but maintains the accuracy for the evaluation of models
across search operators.
EAST
In addition to the restricted likelihood, Chen et al. [26] propose another modification to
the HSHC. They adopt a grow-restructure-thin strategy for the search. They divide the
operators into three phases. In the expansion phase, the SI and NI operators are used.
In the adjustment phase, the NR operator is used. In the simplification phase, the SD
and ND operators are used. Due to the names of the three phases, the whole algorithm
is called EAST.
The improvement ratio is used in the expansion phase. However, it is used only for
comparing the best candidate models given by different search operators. The BIC score
is used in the other two phases. To improve the efficiency, the restricted likelihood is also
used to approximate both the BIC score and the improvement ratio.
32
X1 X2 X3 X4
(a)
X1 X2 X3 X4
(b)
X1 X3 X2 X4
(c)
X1 X4 X2 X3
(d)
Figure 3.4: Four possible resulting structures of quartet test over manifest variables X1–X4. (a) shows a fork. (b), (c), and (d) show three dogbones with different combinationsof sibling variables. Directions of the edges are omitted since they cannot be determinedfrom data.
EAST is considered as the state-of-the-art method for learning LTMs. It is described
in more details by Chen et al. [28]. Recently, methods using other approaches have been
proposed for learning LTMs. We describe those methods next.
3.5.2 Constraint-Based Methods
To learn Bayesian networks, conditional independencies are identified as constraints by
constraint-based methods [29, 150]. These constraints can then be used to determine the
connections between variables. They are sufficient when the variables are fixed in a model.
However, variables are not fixed when we learn LTMs. We need to determine the number
of latent variables and their cardinalities. Therefore, other constraints are needed.
For LTMs, some form of sibling test is usually used. Siblings refer to variables that
have the same parent. And a sibling cluster refers to the child variables of the same parent.
The sibling test not only allows the connections among variables to be determined, it may
also suggest that a latent variable can be added as parent for a sibling cluster.
In the following, we review how constraints can be used to learn LTMs.
Quartet-Based Learning
Chen and Zhang [25] explore the possibility of using a quartet test to learn LTMs. Given
four manifest variables, the test determines the best regular LTM on them. There are
four possible structures (Figure 3.4). The structure is called a fork if there is only one
latent variable. It is called a dogbone if there are two latent variables. For the latter case,
there can be three different groupings of the sibling variables.
33
Chen and Zhang [25] propose a quartet-based method to find all sibling clusters given
some manifest variables. Their method assumes that the quartet test is always correct.
Suppose we want to determine whether two manifest variables X1 and X2 belong to the
same sibling cluster. Their method tries to find two other manifest variables Z1 and Z2,
such that the quartet test over the four manifest variables returns a dogbone in that X1
and X2 are not sibling. Chen and Zhang [25] prove that X1 and X2 belong to the same
sibling cluster if and only if such manifest variables Z1 and Z2 do not exist. Using this
result, their method finds all the sibling clusters by essentially checking every pair of
manifest variables to see whether they belong to the same cluster.
This method has two limitations. First, its analysis relies on the assumption that the
quartet test is always correct. However, this assumption seems unrealistic. Second, the
method is incomplete. It does not suggest how the quartet test can be done. It also stops
short of building a LTM after the sibling clusters are found.
Information Distance
In addition to quartet test, the so-called information distance has also been used to
determine the sibling clusters.
Consider two discrete variables Vi and Vj. Denote the joint probability matrix between
Vi and Vj by Jij. It is defined as Jij = (pijab) ∈ R|Vi|×|Vj |, where pijab = P (Vi = a, Vj = b).
Also, denote the marginal probability matrix of Vi by Mi. It is defined as a diagonal
matrix Mi = (piab) ∈ R|Vi|×|Vi|, where its diagonal entries are given by paa = P (Vi = a).
The information distance between Vi and Vj is defined by
dij = − log| det Jij|√
detMi detMj
.
A nice property of the information distance is that it is additive [93]. Specifically,
consider two variables Vk and Vl in an LTM. Denote the path between them by path(k, l).
The information distance dkl is given by the sum of the distances along the path, that is,
dkl =∑
(i,j)∈path(k,l)
dij.
Due to the additivity property, Choi et al. [32] show that the information distance can
be used to find out some parent-child and sibling relationships among variables. For any
three variables Vi, Vj, and Vk, define Φijk = dik − djk to be the difference between the
information distances dik and djk. It has been shown that for any pair of variables Vi and
Vj in an LTM:
34
1. Φijk = dij for all Vk 6= Vi, Vj if and only if Vi is a leaf node and Vj is its parent;
similarly, Φijk = −dij for all Vk 6= Vi, Vj if and only if Vj is a leaf node and Vi is its
parent; and
2. −dij < Φijk = Φijk′ < dij for all Vk, Vk′ 6= Vi, Vj if and only if both Vi and Vj are leaf
nodes and they belong to the same sibling cluster.
Recursive Grouping and Chow-Liu Recursive Grouping
Choi et al. [32] propose two algorithms to learn LTMs by using the above two properties
based on information distance to identify relationships among variables. The first algo-
rithm is called Recursive Grouping (RG). As its name suggests, it is an iterative method.
It maintains a working set S of leaf variables through iterations.
The method starts with adding all manifest variables into S. The information distance
is used to identify parent-child relationships and sibling relationships among the variables
in S. The variables are connected based on the identified relationships. A new working
set S ′ is then constructed for the next iteration.
In each iteration, variables in S are partitioned into sibling clusters along with their
identified parents. The new working set S ′ is constructed according to the three possible
cases for the sibling clusters. If a sibling cluster contains only one variable, the variable is
added to S ′. If a sibling cluster contains an identified parent, the parent is added to S ′.If a sibling cluster has no identified parent, a new latent variable is added as the parent
of the sibling cluster, and the new variable is added S ′.
When the method moves to the next iteration with S ′, it has to compute information
distances for the variables added in the previous iteration. Choi et al. [32] show that these
distances can be inferred from the existing distances. The method then proceeds as in the
previous iteration. It stops until a working set has no more than two variables when an
iteration starts. And if the working set has two variables, the two variables are connected
with an arbitrary direction, so that a tree is formed.
The second algorithm proposed by Choi et al. [32] is called Chow-Liu Recursive Group-
ing (CLRG). It is based on the well-known Chow-Liu algorithm [33] for constructing tree-
structured Bayesian networks. The algorithm starts with building a minimum spanning
tree over all manifest variables based on the information distances. Let I be the set of
internal nodes of the tree. For each node V in I, let N be a set containing V and its
neighbors. A subtree is built on N using the RG algorithm. The nodes N in the main
tree is then replaced by the subtree. The algorithm repeats until all nodes in I has been
operated on.
35
The two proposed algorithms have been proved to be structurally consistent. This
means the model structure can be correctly recovered, provided that the number of sam-
ples is large enough. The RG algorithm is faster than CLRG when there are few latent
variables, and vice versa otherwise.
While strong theoretical guarantees are given on the algorithms, they rely on two as-
sumptions that may not be realistic. First, it assumes that information distances between
variables are accurate. However, in reality they have to be estimated from finite samples.
The estimated distances may lower the accuracy of the identified relationships between
variables. Second, it assumes that the data is sampled from a tree model. If the samples
are generated from a general model, there is no guarantee on the performance of the
models resulting from these algorithms.
The algorithm has another significant limitation. It requires all variables, including
both latent and manifest variables, to have the same cardinality. This may not be a
problem if the resulting LTMs are used for density approximation or dimension reduction.
However, this may be less desirable if the latent variables are used for clustering, since we
may want to have different numbers of clusters for different partitions.
Note that the above two algorithms can also be used when all variables are Gaussian.
In addition, they allow manifest variables to be internal nodes in the resulting models.
Therefore, they can be used for learning models other than LTMs.
Spectral Recursive Grouping
Anandkumar et al. [3] propose a spectral quartet test for identifying sibling relationships
among variables. While their test can be applied to some other models, we focus on
LTMs with discrete variables here. Their method assumes that all manifest variables
have d states, whereas all latent variables have k ≤ d states.
Consider any four variables in an LTM. Suppose we sum out all other variables in
the model. The four variables can induce one of the four subtrees in Figure 3.4. Given
the four variables, the spectral quartet test recovers the induced subtree based on some
properties of the covariance matrices between variables. The test is conducted as follows.
Let Σij denote the covariance between variables Vi and Vj. Also, let σs(M) denote the
s-th largest eigenvalue of a matrix M , and detk(M) =∏k
s=1 σs(M) denote the product of
the k largest eigenvalues of M . Under some mild conditions, Anandkumar et al. [3] show
that {Vi, Vj} and {Vi′ , Vj′} are two pairs of siblings in the induced subtree if and only if
detk(Σij)detk(Σi′j′) > detk(Σij′)detk(Σi′j). (3.6)
36
Intuitively, the inequality states that the correlations between sibling variables should be
larger than those between non-sibling variables.
The proposed test determines the induced subtree by checking whether there is any
combination of siblings that can fulfill the above inequality. If so, it returns one of the
dogbones in Figure 3.4. Otherwise, it returns a fork. Note that while a dogbone can
ascertain the sibling relationships, a fork cannot.
In practice, the covariance matrices Σij are unknown and have to be estimated from
data. To account for the error in estimation, a confidence parameter ∆ij for each pair of
variables Vi and Vj is used. The eigenvalues of the covariance matrices are adjusted by
these confidence parameters to allow a larger margin of error in estimation.
Anandkumar et al. [3] propose an learning algorithm named Spectral Recursive Group-
ing (SRG). The algorithm is based on the RG algorithm [32]. Instead of using information
distance, SRG uses the spectral quartet test to identify parent-child and sibling relation-
ships among variables.
SRG relaxes the assumption of RG to allow the latent variables and manifest variables
to have different cardinalities. However, it still assumes the same cardinality for all latent
variables. Similar to the case for RG, this may limit the effectiveness of using SRG for
clustering.
3.5.3 Variable Clustering Methods
Sibling variables in LTMs are more likely to be more correlated than other variables.
This intuition leads to some algorithms that build LTMs by grouping similar variables
together. The grouping can be given by variable clustering.
A variable clustering method usually builds an LTM in two main steps. It first clusters
the variables, and then builds an LTM based on the variable clusters. To cluster variables,
we need a similarity measure between variables. The mutual information [34] is often used
for this purpose. Consider two variables Vi and Vj. The mutual information MI(Vi;Vj)
between them is given by
MI(Vi;Vj) =∑Vi,Vj
P (Vi, Vj) logP (Vi, Vj)
P (Vi)P (Vj). (3.7)
After the variable clusters have been obtained, a latent variable is added as a parent
for each variable cluster. The latent variables are then somehow connected together so
that an LTM can be obtained.
37
There are four remaining issues in this approach. First, a variable clustering algorithm
is needed. Second, it needs to determine the cardinalities of the latent variables. Third, it
needs to decide how the new latent variables are processed. Finally, the size of a variable
cluster may need to be determined.
In the following, we review how these issues are addressed by different methods.
Hierarchical Clustering Learning / LTAB
Hierarchical clustering produces nested partitions of objects. The nested partitions are
usually represented by a dendrogram (Figure 1.1). When we perform hierarchical clus-
tering on the variables, the dendrogram may look similar to the structure of an LTM.
This suggests that the structure of LTMs can possibly be learned based on hierarchical
clustering on variables.
Wang et al. [162] propose a learning algorithm based on hierarchical clustering. The
algorithm is called hierarchical clustering learning (HCL) by Wang [161]. However, it is
also known as LTAB in the literature, due to its application in density estimation.
The algorithm accepts a parameter for the cardinality of all latent variables. It con-
structs the model structure in the same way as how a dendrogram is built. Specifically,
in each step it finds the pair of variables that have the highest mutual information. Then
a new latent variable, with the given cardinality, is added as the parent of that pair of
variables. The latent variable replaces the pair of variables for consideration in the next
step. The algorithm repeats until only one variable is left for consideration.
Single link is used to estimate the mutual information between the new latent variable
and other variables. In other words, the mutual information between any two variables is
given by the maximum of the mutual information between these variables or any of their
descendant leaf variables.
The hierarchical clustering yields a binary tree. However, the resulting tree may be
irregular due to the violation of Conditions 1 and 2. If so, the model is regularized until no
violation is found. Due to regularization, a latent node may have more than two children
in the final model. Its cardinality may also be smaller than the one given as parameter.
BIN-G and BIN-A
Harmeling and Williams [68] propose two closely related algorithms for learning LTMs,
namely BIN-G and BIN-A. Similar to HCL, they are based on hierarchical clustering.
38
The difference between these two algorithms is in the handling of the pair of variables
with the highest mutual information in each step. In BIN-G, a LCM is learnt on the pair
of variables. The cardinality of the new latent variable is given by the LCM obtained.
The pairwise mutual information between new latent variable and other variables are
computed based on completion of data.
In BIN-A, new latent variables are added only after hierarchical clustering has com-
pleted. During hierarchical clustering, the mutual information between two clusters of
variables is estimated by average link. In other words, it is estimated by the average of
the mutual information between any pair of leaf variables from the two different clusters.
After hierarchical clustering, a latent variable is added for each pair of variables clustered
together. Similar to BIN-G, its cardinality is estimated by building an LCM on its child
variables. The cardinalities of the new latent variable are estimated recursively starting
from the leaf variables.
Unlike HCL, the LTMs by these two algorithms do not go through any regularization.
Hence, the final structures are always binary trees. The models are likely to have more
latent variables than those obtained from HCL. On the other hand, BIN-G and BIN-A
allow more flexible cardinalities of latent variables than HCL.
Pyramid
In hierarchical clustering, each cluster has two variables, and hence methods based on
hierarchical clustering yield binary trees. To allow clusters with a variable number of
variables, a criterion is needed to determine the size of clusters.
Wang [161] proposes such a criterion. A test called unidimensionality test determines
whether some variables should belong to the same cluster or different clusters. Given some
variables, the test works by learning an LTM on the variables. The LTM is restricted to
have at most two latent variables. This means the test returns an LTM with either one
or two latent variables. If the LTM has one latent variable, it indicates that the given
variables belong to the same cluster. Otherwise, the variables should belong to different
clusters.
An algorithm called Pyramid is proposed by Wang [161]. It is based on agglomerative
clustering. It uses the unidimensionality test to determine whether the growth of a sibling
cluster should stop. Specifically, the algorithm starts by finding the two variables with the
highest mutual information. The variables are put into a sibling cluster S. The algorithmgrows S by repeatedly finding the next variable that has the highest mutual information
with any of the variables in S. Unidimensionality test is run each time when a variable
39
is being added. Denote the variable being added by V . If the test indicates that the
variables in S ∪ {V } belong to the same cluster, V is added to S. And the algorithm
continues to find the next variable to grow S. Otherwise, V is not added and the growing
of S stops.
After a sibling cluster is found, a latent variable is added as the parent of the con-
stituent variables. The cardinality is given by the LTM obtained during the last unidimen-
sionality test. The latent variable replaces its child variables for the consideration of the
next sibling cluster. The algorithm continues until one variable is left for consideration.
Unlike those methods based on hierarchical clustering, a latent variable in the LTMs
obtained from Pyramid can have more than two children.
3.5.4 Comparison between Approaches
Among the three approaches, the constraint-based methods provide some theoretic guar-
antees that the other approaches do not. Those methods are particularly useful if it is
important to have correct relationships among the variables. However, the guarantees are
based on the assumption that the underlying distribution is generated from an LTM. They
are questionable when the assumption is not valid. Moreover, there are some restrictions
on the variables that these methods can handle. For example, all latent variables need
to have the same cardinality by many of those methods. This may make those methods
unsuitable for clustering.
The score-based methods do not have any theoretical guarantees. On the other hand,
these methods aim to maximize the scores of the resulting LTMs. If they are not trapped
by local maxima, the maximized scores provide an assurance on the quality of the models.
Also, there are no restrictions on what models these methods can learn.
Those methods based on variable clustering do not have any guarantee on their per-
formance at all. The LTMs obtained from them may also have more latent variables than
those LTMs from other approaches. On the other hand, those methods in general have a
significant speed advantage over the score-based methods. However, a recent study shows
that they can be slower than some constraint-based methods [32].
3.6 Applications
Several types of applications of LTMs have been proposed. They include multidimensional
clustering, latent structure discovery, density estimation, and classification. LTMs have
40
also been applied in various domains, such as traditional Chinese medicine, text mining,
and financial data. We review the different types and domains of applications below.
3.6.1 Multidimensional Clustering
In a mixture model, the discrete latent variable can be used for clustering. Similarly, the
discrete latent variables in an LTM can be used for this purpose. And with multiple latent
variables in an LTM, we can cluster data in multiple ways.
Zhang [172] suggests that an LTM can give multiple clusterings with its multiple latent
variables. However, he does not describe how this can done in detail at that time. This
idea is further developed through the subsequent work of Chen [24] and Chen et al. [28].
The development results in a theme of clustering called multidimensional clustering.
As an example of using LTMs for multidimensional clustering, consider the LTM shown
in Figure 3.2(a). There are three latent variables in this model. Each one represents a
different way to partition data. And each one partitions the data with different degrees
of dependence on different attributes (manifest variables). Hence, the latent variables
represent clusterings along three different dimensions.
A latent variable can be considered as partitioning data based on its neighboring
variables. Those variables can be latent or manifest. To interpret the meaning of the
clustering given by a latent variable, we need to determine which variables the clustering
depends on. On the one hand, a manifest variable contributes most to the clustering given
by its parent variable. This is due to the fact that, in an LTM, the mutual information
of a manifest variable with its parent variable must be larger than or equal to that with
any other latent variable [34, P. 34]. For example, in Figure 3.2(a), since Y2 is the parent
of X1, it follows that Y2 = arg maxY ∈{Y1,Y2,Y3}MI(X1;Y ).
On the other hand, a latent variable may not be most related to its child manifest
variables. Although the latent variable often partitions data based mainly on its child
variables, this is not necessarily true. For example, in Figure 3.2(a), if MI(Y3;X5) and
MI(Y3;X6) are low, Y3 may have a higher mutual information with the child variables
of other latent variables, especially when MI(Y1;Y2) is high. Therefore, to interpret the
meaning of a clustering given by a latent variable, we need to know how the clustering
relates to the attributes.
Chen et al. [28] propose to use information curves to show the relationships between
a latent variable and the other attributes. An example of them is shown in Figure 3.5.
Each latent variable in an LTM can have a chart with two curves. The first curve is called
pairwise information curve. It shows the pairwise mutual information between the latent
41
Figure 3.5: Information curves of Y2 for the LTM in Figure 3.2(a). The red curve is thepairwise information curve (left axis). The blue curve is the cumulative information curve(right axis).
variable with each attribute. The attributes are sorted in descending order, so that those
with strongest dependence are shown first. The exact pairwise mutual information can
be computed with the model.
The second curve is called cumulative information curve. It shows the the percentage
of cumulative mutual information of the latent variable with the attributes from the first
attribute up to an attribute, divided by the mutual information between latent variable
and all attributes. For example, the second point on the cumulative information curve in
the figure shows the value of MI(Y2;X1,X2)MI(Y2;X1,...,X6)
. This ratio estimates how much information of
the latent variable can be accounted for from the first few attributes. If the ratio is high,
we can interpret the latent variable using ;only those attributes. Since it is intractable to
compute the cumulative mutual information exactly, the computation is done with data
sampling from the model.
Using the information curves, we can find a small subset of attributes that a latent
variable is mainly related to. We can then use those attributes to interpret the clusters
given by the latent variable.
In addition to having clusterings along different dimensions, LTMs allow one to under-
stand the relationships between different clusterings. Since latent variables are connected
in an LTM, their probabilistic relationships have be represented in the model. As a conse-
quence, we can compute the joint probability or conditional probability of different latent
variables in an LTM. We can then examine the probability to understand the relationships
between clusterings.
Chen et al. [28] give an example of multidimensional clustering on a real-world data.
The data was obtained from a survey by ICAC, which is an anti-corruption agency in Hong
42
Kong. The data has 31 attributes and 1200 samples. Eight latent variables were obtained
from an LTM given by EAST. Some of them are interpreted and are found meaningful.
The LTM also reveals some interesting relationships between the latent variables.
3.6.2 Latent Structure Discovery
Latent structure discovery can be considered as a complementary analysis to multidimen-
sional clustering. In multidimensional clustering, we are interested in the groupings of
data. In latent structure discovery, we are more interested in the structure of the model.
For example, we may be interested in how manifest variables are grouped into sibling
clusters. This information reveals which attributes may share the same hidden factors.
Zhang et al. [175] perform latent discovery on the CoIL Challenge 2000 data [158].
The data set contains information on customers of an insurance company. The data set
they work on contains 42 attributes. Three attributes show the socio-demographic data
of the customers, while the others show the ownership of various insurance products by
the customers.
An LTM was obtained from HSHC. In the model, the attribute for contribution to one
kind of insurance policies is almost always paired as sibling variables with the attribute
for the number of that kind of insurance policies. While this is obvious to human beings,
it is interesting to see that machines can do the pairing automatically.
In addition, some related attributes are also grouped under the same ancestor latent
variable. For example, of one latent variable, 8 out of 10 descendant attributes are related
to agriculture products. Of another latent variable, 11 out of 13 descendant attributes
are related to private vehicles. The model indicates that these latent variables are the
common factors for those related attributes.
3.6.3 Density Estimation
Another application of LTMs is in density estimation. In this application, we are given
a original distribution, or some samples obtained from that distribution. We aim to
approximate the distribution using an LTM. Here, the model structure is not that impor-
tant. What is more important is whether the LTM can estimate the original distribution
accurately.
Suppose we use an LTMM to approximate the distribution of another modelM0. Note
that M0 is not an LTM in general. There are two usual ways to measure the accuracy
of the approximate distribution. First, it can be measured in terms of the likelihood of
43
the M on some testing data generated from M0. Let D = {d1, . . . ,dN} be N samples of
testing data. The likelihood of M is given by
P (D|M) =N∑i=1
P (di|M).
Second, the accuracy can be measured in terms of the KL divergence [34]. Let X be
the variables of M0. Also, let PM(X) and PM0(X) denote the distributions of X given
by models M and M0, respectively. The KL divergence between those two distributions
is
D[PM0(X)||PM(X)] =∑X
PM0(X) logPM0(X)
PM(X).
The KL divergence achieves its minimum of zero if the approximate distribution matches
the original distribution exactly.
Wang et al. [162] suggest the use of LTMs for approximate inference. This application
is a kind of density estimation. Their motivation is that probabilistic inference may be
intractable for some complex Bayesian networks. On the other hand, inference on LTMs
is efficient due to the tree structure. Hence, if an LTM can approximate the distribution
of the original BN accurately, it can be used for inference with greatly improved efficiency.
To carry out this idea, an algorithm called LTAB is proposed. Suppose we want
to do inference on a Bayesian network M0. The algorithm first generates samples from
M0. It then learns an LTM from the samples using the HCL method. After that, the
parameters of the LTM is estimated using EM. Finally, the resulting LTM can be used
for approximate inference.
Wang et al. [162] compares LTAB with a standard approximate inference scheme called
loopy belief propagation (LBP) [116]. The experiments were conducted on 8 BNs. Given
the same amount of inference time, LTAB consistently outperforms LBP in approximation
accuracy in terms of KL divergence. To achieve the same level of accuracy, LTAB can be
faster than than LBP by one to three orders of magnitude.
There is one drawback of LTAB. Although the inference time can be fast, the training
time for LTMs can be slow. Therefore, this method can be considered as a tradeoff
between training time and inference time.
3.6.4 Classification
Classification aims to predict the class label of an instance given its attributes. This area
also finds its use with LTMs.
44
Hierarchical Naïve Bayes Models
Zhang et al. [174] propose a class of models called hierarchical naïve Bayes (HNB) models
for classification. An HNB model is almost identical to an LTM. It is different from an
LTM in that its root node is used to represent the class variable. To learn HNB models,
an algorithm based on DHC can be used. Experiments show that HNB models have
comparable classification accuracy to naïve Bayes models [42]. But more importantly,
the HNB models can also reveal some interesting latent variables. This is like performing
classification with latent structure discovery at the same time.
The above method uses the BIC score to guide the model search. To improve the
classification accuracy of the resulting models, Langseth and Nielsen [94] propose to use
the classification accuracy for model selection during the search. The classification accu-
racy is estimated by cross-validation, so the testing data need not be used. The search
algorithm is also modified to improve the computational complexity.
HNB models obtained from this modified algorithm were compared with 7 other clas-
sifiers on 22 data sets. The experiments show that HNB models are better than other
classifiers on 12 data sets. And they are not significantly poorer than the winners on
other 8 data sets. In addition to the good classification performance, some HNB models
obtained from this algorithm are also shown to have interesting structures.
Latent Tree Classifiers
Wang et al. [163] propose a different approach to using LTMs for classification. The
models used are called latent tree classifiers (LTC). They can be considered as mixtures
of LTMs, where the class variable is used as the mixture variable.
Specifically, an LTM is built for data for every class label in a LTC. Let X be the
attributes in a data set. Also, let Dc denote those samples that belong to class c in the
data set. For each c, an LTM Mc is built on Dc. Model Mc can be used to estimate
the distribution P (X|c). Given a sample d, the classification c can be then obtained by
c = arg maxc P (X = d|c)P (c), where P (X = d|c) is given by Mc.
To learn an LTM for a class, the EAST algorithm is used. However, the AIC score [2]
is used instead of the BIC score. Here, the model Mc can be considered as an estimate
to P (X|c). Therefore, our objective is to minimize difference between the distribution
estimated by Mc and the true distribution. The AIC score is an approximation to the
expected KL divergence of the model to the true distribution. It matches better to the
current objective and hence it is used. Another reason to use it is that it is shown
empirically to perform better than the BIC score in a previous study [161].
45
The LTCs were compared with 4 other methods related to the naïve Bayes models. In
general, they achieve a higher classification accuracy. Similar to the HNB models, some
of the LTCs are also shown to have meaningful structures.
3.6.5 Domains of Applications
The above subsections reviewed four types of applications of LTMs. In this subsection,
we review the different domains of these applications.
Traditional Chinese Medicine
Traditional Chinese medicine (TCM) is an important domain of applications for LTMs.
Multidimensional clustering and latent structure discovery have been used in this domain.
Zhang et al. [176, 177] present a study on a TCM diagnosis called kidney deficiency. LTMs
are used to analyze some data related to this diagnosis. The data set contains 35 syndrome
attributes for 2600 subjects.
An LTM was learnt mainly by the HSHC algorithm. An initial model was obtained,
but it was modified based on domain knowledge to help escape a local maximum. The
HSHC algorithm was run again on the modified model and the resulting model was used
for analysis in their study.
The final LTM exhibits some interesting features. First, some latent variables are
found to have interesting meanings. For example, two latent variables can be interpreted
to mean kidney yang deficiency and kidney yin deficiency, respectively.
Second, the structure shows some reasonable groupings of variables. The latent vari-
able kidney yang deficiency (KYD) has five descendent manifest variables. They represent
five syndromes, namely loose stool, indigested grain in stool, intolerance to cold, cold limbs,
and cold lumbus and back. According to TCM theory, those syndromes may result from a
deficiency of kidney yang. The structure thus matches the TCM theory.
Third, the clusters given by some latent variables can also be interpreted meaningfully.
For example, the latent variable for kidney yang deficiency gives five clusters. Three of them
may mean no KYD, medium level of KYD, and severe level of KYD, respectively. The
other two may mean light level of KYD. However, different syndromes are observed from
these two clusters.
To sum up, the above analysis shows that LTMs can be used to validate some TCM
theories. Moreover, the clusters given by LTMs can also be used to classify the subjects.
46
Text Mining
LTMs have also been applied in the domain of text mining [32, 68]. Experiments were
conducted on the 20 Newsgroup data.2 The data were obtained from 16242 postings.
There are 100 binary attributes. Each attribute indicates whether a word appears in a
posting or not.
Harmeling and Williams [68] shows an LTM obtained from the BIN-G algorithm. They
show that the LTM groups some manifest variables in a meaningful way. For example,
a subtree contains some words related to the topic “medicine” as leaf variables, such as
doctor, medicine, disease, patients, cancer, studies, aids, health, and insurance. Several other
similar subtrees can be found in the model.
The same data set is also analyzed by Choi et al. [32]. Models were obtained from
the RG and CLRG algorithms. The models are a generalization of LTMs, where mani-
fest variables can also be represented by internal nodes. They are shown to have fewer
latent variables than the one obtained from BIN-G. Their BIC scores are also higher. An
model obtained from a modified version of CLRG have some subtrees that can roughly
be interpreted as topics.
Data in text mining tends to have large numbers of attributes and samples. Therefore,
it may not be feasible to use those score-based methods that are currently available in
this domain. On the other hand, attributes in text mining are usually binary. This allows
the use of those learning algorithms that are restricted to have the same cardinality for
all manifest variables.
Finance
Finance can also be one domain of application of LTMs. Some monthly stock data were
analyzed by Choi et al. [32]. Each record of the data represents the monthly returns of
84 companies. The samples were collected from 1990 to 2007.
Choi et al. [32] demonstrate the result using a model obtained from a method similar
to CLRG. The model is slightly different from an LTM. Its variables are all continuous
and it has internal manifest nodes. The model has some subtrees of related companies.
For example, one subtree is related to the telecom industry. It contains Verizon, Sprint,
and T-mobile as descendants.
2 http://www.cs.nyu.edu/~roweis/data.html
47
CHAPTER 4
APPLICATION:ROUNDING IN SPECTRAL CLUSTERING
Spectral clustering [159] is one way to handle clusters of irregular shapes. The idea is to
convert clustering into a graph cut problem. More specifically, one first builds a similarity
graph over the data points using a measure of data similarity as edge weights and then
partitions the graph by cutting some of the edges. Each connected component in the
resulting graph is a cluster. The cut is done so as to simultaneously minimize the cost of
cut and balance the sizes of the resulting clusters [65, 143].
The graph cut problem is NP-hard and is hence relaxed. In the relaxed problem, the
cluster indicator functions are allowed to be real-valued. The solution is given by the
leading eigenvectors of the so-called Laplacian matrix, which is a simple transformation
of the original data similarity matrix. In a post-processing step, a partition of the data
points is obtained from those real-valued eigenvectors. This post-processing step is called
rounding [5, 136].
Although the spectral clustering literature is abundant, there are relatively few pa-
pers on rounding. In general, rounding is considered an open problem. There are three
subproblems: (1) decide how many (and more generally which) leading eigenvectors to
use; (2) determine the number of clusters; and (3) determine the members of each cluster.
Among the three, the first two subproblems are considered much harder.
In this chapter we focus on rounding. The remaining chapter is organized as follows.
In Section 4.1, we review some related work. We also indicate three main differences
between our work and the other work. In Section 4.2 we review the basics of spectral
clustering and point out two key properties of the eigenvectors of the Laplacian matrix
in the ideal case. In Section 4.3 we describe a straightforward method for rounding that
takes advantage of the two key properties. This method is fragile and breaks down as
soon as we move away from the ideal case. In Sections 4.4 and 4.5 we propose a model-
based method for rounding that exploits the same two properties. The method is named
LTM-Rounding. It is evaluated on synthetic data in Section 4.6 and is compared with
other methods in Section 4.7. This chapter concludes in Section 4.8.
48
4.1 Related Work
Previous rounding methods fall into two groups depending on whether they assume the
number of clusters is given. When the number of clusters is known to be k, rounding is
usually done based on the first k eigenvectors. The data points are projected onto the
subspace spanned by those eigenvectors and then the K-means algorithm is run on that
space to get k clusters [159]. Bach and Jordan [5] approximate the subspace using a space
spanned by k piecewise constant vectors and then run K-means on the latter space. This
turns out to be equivalent to a weighted K-means algorithm on the original subspace.
Zhang and Jordan [178] observe a link between rounding and the orthogonal Procrustes
problem in Mathematics and iteratively use an analytical solution for the latter problem
to build a method for rounding. Rebagliati and Verri [134] ask the user to provide a
number K that is larger than k and obtain k clusters based on the first K eigenvectors
using a randomized algorithm that repeatedly calls K-means as a subroutine.
When the number of clusters is not given, one needs to estimate it. A common method
is to manually examine the difference between every two consecutive eigenvalues starting
from the first two. If a big gap appears for the first time between the k-th and (k+ 1)-th
eigenvalues, then one uses k as an estimate of the number of clusters. Zelnik-Manor and
Perona [169] propose an automatic method. The method considers a number of integers.
For each integer k, it tries to rotate the first k eigenvectors so as to align them with the
canonical coordinate system for the eigenspace spanned by those vectors. A cost function
is defined in terms of how well the alignment can be achieved. The k with the lowest cost
is chosen as an estimate for the number of clusters. Xiang and Gong [166] and Zhao et al.
[179] question the assumption that clustering should be based on all the eigenvectors from
a continuous block at the beginning of the eigenvector spectrum. They use heuristics to
choose a collection of eigenvectors which do not necessarily form a continuous block, and
then use Gaussian mixture models to determine the number of clusters and to partition
the data points. use Gaussian mixture models to determine the clusters. Socher et al.
[148] assume the number of leading eigenvectors to use is given. Based on those leading
eigenvectors, they determine the number of clusters and the membership of each cluster
using a non-parametric Bayesian clustering method.
In this chapter, we propose and study a novel model-based approach to rounding. The
method differs from the previous methods in three ways. First, we relax the assumption
that the number of clusters equals the number of eigenvectors that one uses for rounding.
In the ideal case where between-cluster similarity is 0, if one knows the number kt of true
clusters, one can indeed recover the kt clusters from the first kt eigenvectors. However,
49
this might not be the case in non-ideal cases or when the number of clusters one tries to
obtain is not kt. Our method allows the number of clusters to differ from the number of
eigenvectors. This is conducive to robust performance in non-ideal cases.
Second, we choose a continuous block of leading eigenvectors for rounding just as
Zelnik-Manor and Perona [169]. The difference is that, when deciding the appropriateness
of the first k eigenvectors, Zelnik-Manor and Perona use only information contained in
those eigenvectors, whereas we also use information contained in subsequent eigenvectors.
So our method uses more information and hence the choice is expected to be more robust.
Third, we solve all the three subproblems of rounding and we do so within one class
of models, namely LTMs. In contrast, most previous methods assume that the first
two subproblems are solved and the solutions are equal, and focus only on the third
subproblem. Xiang and Gong [166] and Zhao et al. [179] do consider all three subproblems.
However, they do not solve all the subproblem within one class of models. They first
choose a collection of eigenvectors based on some heuristics and then use Gaussian mixture
models to solve the other two subproblems. Zelnik-Manor and Perona [169] also considers
all three subproblems. However, their method is not model-based and it assumes the
number of clusters equals the number of eigenvectors. An advantage of the model-based
approach is that its performance degrades gracefully as we move away from the ideal case.
4.2 Basics of Spectral Clustering
In this section we review the basics of spectral clusters and point out two properties that
we exploit later.
4.2.1 Similarity Measure and Similarity Graph
Let X = {x1, . . . ,xn} be a set of n data points in an Euclidean space Rd. In order to
partition the data, one needs to define a non-negative similarity measure sij for each pair
xi and xj of data points. This can be done in a number of ways. In our work we consider
two measures:
• k-NN similarity measure: sij = 1 if xi is one of the k nearest neighbors of xj , or
vice versa, and sij = 0 otherwise.
• Gaussian similarity measure: sij = exp(−||xi−xj ||2
σ2 ), where σ is a parameter that
controls the width of neighborhood of each data point.
50
The matrix S = (sij)i,j=1,...,n is called the similarity matrix.
Given a similarity measure, the data can be represented as an weighted undirected
graph G. In the graph there is a vertex vi representing each data point xi, and there is
an edge between two vertices vi and vj if and only if sij > 0. The value sij is used as the
edge weight and is sometimes denoted as wij. The graph is called the similarity graph
and its adjacency matrix W = (wij)i,j=1,...,n is the same as the similarity matrix S. Note
that the similarity graph G is a complete graph when the Gaussian similarity measure is
used, and it might not be so when the k-NN similarity measure is used.
4.2.2 Graph Laplacian
In spectral clustering one transforms the similarity matrix to get another matrix called
graph Laplacian matrix. There are a number of Laplacian matrices to choose from [159].
In this chapter, we use the normalized Laplacian matrix Lrw given by
Lrw = I −D−1S, (4.1)
where I ∈ Rn×n is the identity matrix, D = (dij) ∈ Rn×n is the diagonal degree matrix
given by dii =∑n
j=1 sij, and D−1 is the inverse of D. The following proposition is well-
known [159].
Proposition 1. The Laplacian matrix Lrw satisfies the following properties:
1. Lrw is symmetric and positive semi-definite.
2. The eigenvalues of Lrw are non-negative and the smallest one is 0.
3. If the similarity graph is connected, then there is only one eigenvalue that equals 0.
4. The unit vector 1 ∈ Rn that consists of all 1’s is an eigenvector for eigenvalue 0.
The eigenvalues of Lrw are arranged in ascending order as 0 = λ1 ≤ λ2 ≤ . . . ≤ λn
and the eigenvectors for the eigenvalues are arranged in the same order as e1, e2, . . . , en.
The eigenvectors at the front of the list are called the leading eigenvectors. Note that an
eigenvector of Lrw is a vector of n real numbers. It can also be viewed as a function over the
data points. As a matter of fact, in the graph cut formulation of spectral clustering [159],
an eigenvector is a cluster indicator function for a cut. Two example eigenvectors are
shown in Figure 4.1.
51
4.2.3 The Ideal Case
In theoretical analysis of spectral clustering, reference is often made to the so-called ideal
case. Suppose the data consist of kt true clusters C1, C2, . . . , Ckt . The ideal case refers to
the situation where the similarity graph G has exactly kt connected components, with each
corresponding to a true cluster. This is the same as saying that the in-cluster similarities
are strictly positive while the between-cluster similarities are 0. Assume that the data
points are ordered based on their cluster memberships. Then the Laplacian matrix Lrwhas a block diagonal form and can be written as follows
L =
L1
L2
. . .Lkt
,
where each block Lj is the Laplacian matrix for the corresponding connected component
Cj of G. The following proposition is evident.
Proposition 2. In the ideal case, the matrix Lrw has the following two properties:
1. The eigenvalues of Lrw are the union of the eigenvalues of L1, . . . , Lkt.
2. Eigenvectors of Lrw can be obtained from eigenvectors of L1, . . . , Lkt by appropriately
padding them with zeros.
Let ej be an eigenvector of the block Lj for an eigenvalue λ. As a function over data
points, ej is defined only over the component Cj. We can extend ej to get a function
e over all the n data points by letting e be the same as ej over Cj and be 0 elsewhere.
Proposition 2 says that λ is an eigenvalue of Lrw and e is an eigenvector of Lrw. Note
that the support of e is a subset of Cj or, to put it another way, Cj contains the support
of e.
For a subset of data points C, the indicator function 1C of C is a function over all
data points that takes value 1 on the data points in C and 0 elsewhere. It can also be
viewed as an n dimensional vector and is called the indicator vector of C. The following
proposition is a corollary of Propositions 1 and 2.
Proposition 3. In the ideal case, the matrix Lrw has exactly kt eigenvalues that equal 0.
The eigenpace of eigenvalue 0 is spanned by the indicator vectors 1C1, . . . , 1Ckt.
The vectors 1C1 , . . . , 1Cktcollectively form the canonical coordinate system of the
eigenspace of eigenvalue 0. Create a n × kt matrix Uc using those vectors as column
52
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
C4
C2C1
C3
C5
(a) A data set
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●
●●●
●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●●●
●●●●●
●●
●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●●
●●●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●
●●●●●●●●●●
●●●●●
●●●
●●
●●●●
●●●●●●●
●●●●
●●●●
●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●●
●●●●●●
●●
●●●●
●●●●●
●●
●●●
●●●●●
●●●
●●●●
●●●●●
●●●●●
●
●●●●●●
●●●●●●●●●●
●●●
●●●●●●●●●●
●●●●●●●●●●●
●●●●
●●●●
●●●●●●●
●●●●●
●●●●
●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
−0.
2−
0.1
0.0
0.1
0.2
C1 C2 C3 C4 C5
(b) Eigenvector e1
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●
●●●
●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●●●
●●●●●
●●
●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●●
●●●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●
●●●●●●●●●●
●●●●●
●●●
●●
●●●●
●●●●●●●
●●●●
●●●●
●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●●
●●●●●●
●●
●●●●
●●●●●
●●
●●●
●●●●●
●●●
●●●●
●●●●●
●●●●●
●
●●●●●●
●●●●●●●●●●
●●●
●●●●●●●●●●
●●●●●●●●●●●
●●●●
●●●●
●●●●●●●
●●●●●
●●●●
●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
−0.
2−
0.1
0.0
0.1
0.2
C1 C2 C3 C4 C5
(c) Eigenvector e6
Figure 4.1: Example eigenvectors: There are five true clusters C1, . . . , C5 in the data setshown in (a). The 10-NN similarity measure is used to produce the ideal case. The matrixLrw is built accordingly and it consists of five blocks L1, . . . , L5, each corresponding to atrue cluster. Two eigenvectors e1 and e6 of Lrw are shown. Each eigenvector is depicted intwo diagrams. In the first diagram, the values of the eigenvector are indicated by differentcolors, with grey meaning 0. In the second diagram, the X-axis indexes the data pointsand the Y-axis shows the values of the eigenvector.
vectors. Each row zi of the matrix corresponds to a data point xi. So we have a mapping
xi → zi that maps the data points from the original space Rd to points in the eigenspace
of eigenvalue 0.
For a row zi of Uc that corresponds to an data point in component Cj, the value at
position j is 1 and the values at all other positions are 0. Therefore, all the data points
in Cj become one point when mapped onto the eigenspace of eigenvalue 0. This fact is of
fundamental importance to spectral clustering.
4.2.4 Spectral Clustering
To partition a collection of n data points {x1, . . . ,xn} into k clusters using a similarity
matrix S, spectral clustering proceeds as follows:
1. Compute the Laplacian matrix Lrw.
2. Compute the k leading eigenvectors e1, . . . , ek of Lrw.
3. Form an n× k matrix U using e1, . . . , ek as columns.
53
4. For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U .
5. Cluster the points {y1, . . . ,yn} into k clusters using the K-means algorithm.
6. Obtain k clusters of the original data points accordingly.
Suppose the similarity matrix S satisfies the conditions for the ideal case and k = kt.
The eigenvectors e1, . . . , ek obtained at Step 2 are not necessarily the same as the true-
cluster indicator vectors 1C1 , . . . , 1Ckt. However, we have,
U = UcR. (4.2)
for some orthogonal matrix R ∈ Rk×k.
The matrix U gives us a mapping from the data space to the eigenspace of eigenvalue
0: xi → yi, where yi is the i-th row of U . The mapping can be broken up into two
steps: First map xi to the i-th row zi of Uc, and then obtain yi by rotating the vector ziusing R, that is, yi = ziR. As pointed out earlier, data points in the same true cluster
are mapped into one point at the first step. Hence they are mapped into one point by
the whole mapping xi → yi. This means that, in the ideal case, spectral clustering can
recover the true clusters.
Spectral clustering is expected to also work well in non-ideal cases that are not far
away from the ideal case. A formal justification of this is provided by Ng et al. [118] in
terms of matrix perturbation theory.
4.2.5 Two Properties
To end this section, we point out two properties of the eigenvectors in the ideal case that
we exploit in our work. We know from Proposition 3 that the Laplacian matrix Lrw has
multiple eigenvalues that equal 0 in the ideal case. For simplicity, we assume that the
non-zero eigenvalues of Lrw are distinct. We call the eigenvectors for eigenvalue 0 the
primary eigenvectors and the others the secondary eigenvectors.
Proposition 4. In the ideal case, eigenvectors of Lrw has the following properties:
1. The primary eigenvectors are piecewise constant, with each value identifying one
true cluster or the union of several true clusters.
2. The support of each secondary eigenvector is contained within one of the true clus-
ters.
54
The support of a vector is defined as the set of elements that have non-zero values.
The first item follows readily from Equation (4.2) and the fact that each column of Uc is
a true-cluster indicator vector. The second item is an obvious corollary of Proposition 2.
The two properties can be illustrated using Figure 4.1. The primary eigenvector e1has two different values, 0.1 and 0. The value 0.1 identifies cluster C4, while 0 corresponds
to the union of the other clusters. The vector e6 is a secondary eigenvector. Its support
is contained in the true cluster C4.
4.3 A Naive Method for Rounding
In this section we explain how the two properties in Proposition 4 can be used for rounding.
We do so by giving two naive algorithms that are intended only for the ideal case. In
the next two sections, we develop a model-based algorithm that exploits the same two
properties and that works also in non-ideal cases.
4.3.1 Binarization of Eigenvectors
In the original graph cut formulation of spectral clustering [65, 143], an eigenvector has
two different values. It partitions the data points into two clusters. The graph cut problem
is NP-hard and is hence relaxed. In the relaxed problem, an eigenvector can have many
different values (see Figure 4.1(c)).
The first step of our method is to obtain, from each eigenvector ei, two clusters using
a confidence parameter δ that is between 0 and 1. Let eij be the value of ei at the j-th
data point. One of the clusters consists of data points j that satisfy
eij > 0 and eij > δmaxjeij,
while the other cluster consists of data points j that satisfy
eij < 0 and eij < δminjeij.
The indicator vectors of those two clusters are denoted as e+i and e−i respectively. We refer
to the process of obtaining those vectors from ei as binarization. Applying binarization to
the eigenvectors e1 and e6 of Figure 4.1 results in the binary vectors shown in Figure 4.2.
Note that e−1 is a degenerate binary vector in the sense that it is 0 everywhere. We still
refer to it as a binary vector for convenience. The following proposition follows readily
from Proposition 4.
55
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●
●●●
●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●●●
●●●●●
●●
●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●●
●●●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●
●●●●●●●●●●
●●●●●
●●●
●●
●●●●
●●●●●●●
●●●●
●●●●
●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●●
●●●●●●
●●
●●●●
●●●●●
●●
●●●
●●●●●
●●●
●●●●
●●●●●
●●●●●
●
●●●●●●
●●●●●●●●●●
●●●
●●●●●●●●●●
●●●●●●●●●●●
●●●●
●●●●
●●●●●●●
●●●●●
●●●●
●●●●
●●●●●●
(a) e+1
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●
●●●
●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●●●
●●●●●
●●
●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●●
●●●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●
●●●●●●●●●●
●●●●●
●●●
●●
●●●●
●●●●●●●
●●●●
●●●●
●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●●
●●●●●●
●●
●●●●
●●●●●
●●
●●●
●●●●●
●●●
●●●●
●●●●●
●●●●●
●
●●●●●●
●●●●●●●●●●
●●●
●●●●●●●●●●
●●●●●●●●●●●
●●●●
●●●●
●●●●●●●
●●●●●
●●●●
●●●●
●●●●●●
(b) e−1
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●
●●●
●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●●●
●●●●●
●●
●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●●
●●●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●
●●●●●●●●●●
●●●●●
●●●
●●
●●●●
●●●●●●●
●●●●
●●●●
●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●●
●●●●●●
●●
●●●●
●●●●●
●●
●●●
●●●●●
●●●
●●●●
●●●●●
●●●●●
●
●●●●●●
●●●●●●●●●●
●●●
●●●●●●●●●●
●●●●●●●●●●●
●●●●
●●●●
●●●●●●●
●●●●●
●●●●
●●●●
●●●●●●
(c) e+6
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●
●●●
●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●●●
●●●●●
●●
●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●●
●●●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●
●●●●●●●●●●
●●●●●
●●●
●●
●●●●
●●●●●●●
●●●●
●●●●
●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●●
●●●●●●
●●
●●●●
●●●●●
●●
●●●
●●●●●
●●●
●●●●
●●●●●
●●●●●
●
●●●●●●
●●●●●●●●●●
●●●
●●●●●●●●●●
●●●●●●●●●●●
●●●●
●●●●
●●●●●●●
●●●●●
●●●●
●●●●
●●●●●●
(d) e−6
Figure 4.2: Binary vectors obtained from eigenvectors e1 and e6 through binarizationwith δ = 0.1. Data points with values 1 and 0 are indicated by red and green respectively.
Proposition 5. Let eb be a vector obtained from an eigenvectors e of the Laplacian matrix
Lrw through binarization. In the ideal case, we have:
1. If e is a primary eigenvector, then the support of eb is either a true cluster or the
union of several true clusters.
2. If e is a secondary eigenvector, then the support of eb is a subset of one of the true
clusters.
4.3.2 Rounding by Overlaying Partitions
Each binary vector gives a partition of all the data points, with one cluster comprising
points with value 1 and another cluster comprising points with value 0. Suppose there
are two partitions. One binary vector divides the data into two clusters C1 and C2 and
the other into P1 and P2. Overlaying the two partitions results in a new partition that
consists of clusters C1 ∩ P1, C1 ∩ P2, C2 ∩ P1, and C2 ∩ P2. Note that there are not
necessarily exactly 4 clusters in the new partition as some of the 4 intersections might be
empty. It is straightforward to generalize the concept of overlaying to multiple partitions
with any numbers of clusters.
We use a continuous block of leading eigenvectors for rounding. There is a question of
how many eigenvectors to use. In this subsection we consider the case where the number
q of leading eigenvectors to use is given. Here is a simple method for rounding:
Naive-Rounding1(q):
1. Compute the q leading eigenvectors of the Laplacian matrix Lrw.
2. Binarize the q eigenvectors.
56
3. Obtain a partition of the data using each binary vector from the previous
step.
4. Overlay all the partitions to get the final partition.
Note that the number of clusters obtained by Naive-Rounding1 is determined by
q, but it is not necessarily the same as q in general. The following proposition is an easy
corollary of Proposition 5.
Proposition 6. In the ideal case and when q is smaller than or equal to the number of
true clusters, the clusters obtained by Naive-Rounding1 are either the true clusters or
unions of the true clusters.
4.3.3 Determining the Number of Eigenvectors to Use
Now consider the case when the number q of leading eigenvector to use is not given. We
determine q, and hence the number of clusters, by making use of Proposition 5. Suppose
Pq is the partition given by Naive-Rounding1 by using the q leading eigenvectors. The
idea is to gradually increase q and test the partition Pq for each q to see whether it satisfies
the condition of Proposition 5 (2).
Suppose K is a sufficiently large integer. Denote a subroutine that tests the partition
Pq by cTest(Pq, q,K). To perform this test, we use the binary vectors obtained from
the eigenvectors from the range [q + 1, K]. If the support of every such binary vector
is contained by some cluster in Pq, we say that Pq satisfies the containment condition
and cTest returns true. Otherwise, Pq violates the condition and cTest returns false.
Proposition 5 states that the containment condition must be satisfied if Pq is the true
partition. If it is violated, Pq cannot be the true partition.
Let kt be the number of true clusters. When q = kt, cTest(Pq, q,K) passes because
of Proposition 5. When q = kt + 1, Pq is likely to be finer than the true partition.
Consequently, cTest(Pq, q,K) may fail. The probability of this happening increases with
the number of binary vectors used in the test. To make the probability high, we pick
K such that K/2 is safely larger than kt and let q run from 2 to bK/2c. When q < kt,
cTest(Pq, q,K) usually passes.
The above discussions suggest that if the test cTest(Pq, q,K) returns true for some
q = k and returns false for q = k + 1 for the first time, we can use k as an estimate of kt.
Consequently, we can use k leading eigenvectors for clustering and return Pk as the final
clustering result. This leads to the following algorithm.
Naive-Rounding2(K):
57
●●●●●●
●●●
●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●●●
●●●●●
●●
●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●●
●●●●●●●●●●●●●
●●●●●●●●●●●
(a) 7 clusters
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●
●●●
●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●●●
●●●●●
●●
●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●●
●●●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●
●●●●●●●●●●
●●●●●
●●●
●●
●●●●
●●●●●●●
●●●●
●●●●
●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●●
●●●●●●
●●
●●●●
●●●●●
●●
●●●
●●●●●
●●●
●●●●
●●●●●
●●●●●
●
●●●●●●
●●●●●●●●●●
●●●
●●●●●●●●●●
●●●●●●●●●●●
●●●●
●●●●
●●●●●●●
●●●●●
●●●●
●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●
●●●
●●
●
●
●
●
●●●●●●●●●●●●●●●●●●●●●
●●●●●
●●●
●
●●
●
●●●●
●●●●
●●●●●●●●●●●
●●●●
●
●●
●
●
●
●
●
●●●
●
●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
−0.
2−
0.1
0.0
0.1
0.2
C1 C2 C3 C4 C5
(b) Eigenvector e16
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●
●●●
●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●●●
●●●●●
●●
●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●●
●●●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●
●●●●●●●●●●
●●●●●
●●●
●●
●●●●
●●●●●●●
●●●●
●●●●
●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●●
●●●●●●
●●
●●●●
●●●●●
●●
●●●
●●●●●
●●●
●●●●
●●●●●
●●●●●
●
●●●●●●
●●●●●●●●●●
●●●
●●●●●●●●●●
●●●●●●●●●●●
●●●●
●●●●
●●●●●●●
●●●●●
●●●●
●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●
●●●
●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●●●
●●●●●
●●
●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●●
●●●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●
●●●●●●●●●●
●●●●●
●●●
●●
●●●●
●●●●●●●
●●●●
●●●●
●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●●
●●●●●●
●●
●●●●
●●●●●
●●
●●●
●●●●●
●●●
●●●●
●●●●●
●●●●●
●
●●●●●●
●●●●●●●●●●
●●●
●●●●●●●●●●
●●●●●●●●●●●
●●●●
●●●●
●●●●●●●
●●●●●
●●●●
●●●●
●●●●●●
(c) Binary vectors e+16 and e−16
Figure 4.3: Illustration of Naive-Rounding2: (a) shows the partition obtained usingfirst 6 pairs of binary vectors. There are 7 clusters, each indicated by a different color.(b) and (c) show eigenvector e16 and the vectors obtained from it via binarization. Thesupport of e+16 (the red region) is not contained in any of the clusters in (a). The same istrue for the support e−16.
1. For q = 2 to bK/2c,
(a) Pcurrent ← Naive-Rounding1(q).
(b) Pnext ← Naive-Rounding1(q + 1).
(c) If cTest(Pcurrent, q,K) = true and cTest(Pnext, q + 1, K) = false,
return Pcurrent.
2. Return Pcurrent.
Suppose an integer K is given such that K/2 is safely larger than the number of true
clusters. The algorithm Naive-Rounding2 automatically decides how many leading
eigenvectors to use in rounding and automatically determines the number of clusters.
Consider running it on the dataset shown in Figure 4.1(a) with K = 40. It loops q
through 2 to 4 without terminating because both the two tests at Step 1(c) return true.
When q = 5, Pcurrent is the true partition and Pnext is the one as shown in Figure 4.3(a).
At Step 1(c), cTest(Pcurrent, q,K) returns true. However, cTest(Pnext, q + 1, K) returns
false, because the support of the binary vector e+16 or e−16 is not a subset of any cluster
in Pnext (Figure 4.3). Consequently, Naive-Rounding2 terminates at Step 1(c) and
returns Pcurrent (the true partition) as the final result.
58
Y
e+1 e−1 · · · e+q e−q
Figure 4.4: Latent class model for rounding: The binary vectors e+1 , e−1 , . . . , e+q , e−q are
regarded as random variables. The discrete latent variable Y represents the partition tofind.
It is possible for Naive-Rounding2 to use more than kt eigenvectors. Because we
are talking about the ideal case, the containment condition is satisfied for Pcurrent at
q = kt. However, the condition might also be satisfied for Pnext at q = kt. In that
case, over-shooting occurs. In our empirical evaluation with the three data sets shown in
Figure 4.6, Naive-Rounding2 did pick the correct numbers of leading eigenvectors, and
it determined the number of clusters and the members of the clusters correctly.
4.4 Latent Class Models for Rounding
Naive-Rounding2 is fragile. It can break down as soon as we move away from the ideal
case. In this and the next sections, we describe a model-based method that also exploits
Proposition 5 and is more robust.
In the model-based method, the binary vectors are regarded as features and the prob-
lem is to cluster the data points based on those features. So what we face is a problem of
clustering discrete data. As in the previous section, there is a question of how many lead-
ing eigenvectors to use. In this section, we assume the number q of leading eigenvectors
to use is given. In the next section, we discuss how to determine q.
The problem we address in this section is how to cluster the data points based on the
first q pairs of binary vectors e+1 , e−1 , . . . , e+q , e
−q . We solve the problem using latent
class models. LCMs are commonly used to cluster discrete data, just as Gaussian mixture
models are used to cluster continuous data. Technically, they are the same as the naïve
Bayes models except that the class variable is not observed.
The LCM for our problem is shown in Figure 4.4. So far we have been using the
notations e+s and e−s to denote vectors of n components or functions over the data points.
In this and the next sections, we overload the notations to denote random variables that
take different values at different data points. We do not use bold letters for this case. The
latent variable Y represents the partition to find and each state of Y represents a cluster.
So the number of states of Y , often called the cardinality of Y , is the number of clusters.
To learn an LCM from data means to determine:
59
1. The cardinality of Y , that is, the number of clusters; and
2. The probability distributions P (Y ), P (e+s |Y ) and P (e−s |Y ), that is, the characteri-
zation of the clusters.
After an LCM is learned, one can compute the posterior distribution of Y for each
data point. This gives a soft partition of the data. To get a hard partition, one can assign
each data point to the state of Y that has the maximum posterior probability. This is
called hard assignment.
4.4.1 Known Number of Clusters
There are two cases with the LCM learning problem, depending on whether the number
of clusters is known. When the number of clusters is known, we only need to determine
the probability distributions P (Y ), P (e+s |Y ), and P (e−s |Y ). This is done using the EM
algorithm [40].
Before dealing with the case when the number of clusters is not known, we spend
some time to explain how the LCM method is related to Naive-Rouding1. Here is our
strategy:
1. We set the probability parameter values in such a way that the LCM gives the same
partition as Naive-Rounding1.
2. We show that those parameter values maximize the likelihood of the model.
It is well known that the EM algorithm aims at finding the maximum likelihood estimate
of the parameters. So we can conclude that the LCM method actually tries to find the
same partition as Naive-Rounding1.
Suppose Naive-Rounding1 produces k clusters C1, . . . , Ck. For each r ∈ {1, . . . , k},let nr be the number of data points in Cr. It is clear that n = n1 + · · ·+nk. The partition
{C1, . . . , Ck} is obtained by overlaying the partitions given by the first q pairs of binary
vectors e+1 , e−1 , . . . , e+q , e−q . In the LCM model, e+1 , e
−1 , . . . , e+q , e−q are viewed as feature
variables. Use ebs to denote a general feature variable. Then we have that for each r, the
feature variable ebs
• Either takes value 1 at all the points in Cr,
• Or takes value 0 at all the points in Cr.
60
Moreover, all the points from one particular cluster have the same feature values.
Let the latent variable Y have k states {1, . . . , k}. For each r ∈ {1, . . . , k} and each
feature variable ebs, set
P (Y = r) =nrn, (4.3)
P (ebs = 1|Y = r) =
{1 if ebs is 1 over Cr0 otherwise (4.4)
Under those parameter values, Y = r means the cluster Cr.
Use m to denote the LCM model, and θ to denote the collection of parameter values
given by Equations (4.3) and (4.4). Consider the conditional probability P (xi|Y = r,m,θ)
and the marginal probability P (xi|m,θ) of a general data point xi. Because Y = r means
the cluster Cr, we have
P (xi|Y = r,m,θ) =
{1 if xi ∈ Cr0 otherwise (4.5)
P (xi|m,θ) =nrn
if xi ∈ Cr (4.6)
Now consider the posterior distribution P (Y |xi,m,θ) of Y . It follows from Equation
(4.5) that
P (Y = r|xi,m,θ) =
{1 if xi ∈ Cr0 otherwise
This means that, by hard assignment, the LCM gives us exactly the partition {C1, . . . , Ck}— the partition found by Naive-Rounding1.
We next show that the parameter values θ maximize the likelihood of the model. For
each r, enumerate the data points in Cr as xr1, . . . ,xrnr . We know from Equation (4.6)
that P (xr1|m,θ) = . . . = P (xrnr |m,θ) = nr
n. Use D to denote the entire data set. So we
have
logP (D|m,θ) =k∑r=1
nr∑t=1
logP (xrt|m,θ)
=k∑r=1
nr lognrn
= n
k∑r=1
nrn
lognrn
Now consider another set θ′ of parameter values. Because all the points from one par-
ticular cluster have the same feature value, we have P (xr1|m,θ′) = . . . = P (xrnr |m,θ′).
61
Let that value be p′r. Then we have
logP (D|m,θ′) = nk∑r=1
nrn
log p′r
≤ n∑r
nrn
lognrn
= logP (D|m,θ),
where Gibbs’ inequality is used in the second step. So the parameter values θ do maximize
the likelihood.
To summarize the arguments, the LCMmethod actually tries to find the same partition
as Naive-Rounding1 when the number of clusters is set properly.
4.4.2 Unknown Number of Clusters
Now consider the case when the number of clusters is not known. We follow the standard
practice in the literature and determine it using a search procedure guided by the BIC
score [140]. Specifically, we start by setting the number k of clusters to 1 and increase it
gradually. For each k, we estimate the probability parameters using the EM algorithm
and compute the BIC score of the model given by Equation (3.1). The BIC score would
initially increase with k. We stop the process as soon as it starts to decrease, and use the
k with the maximum BIC score as the estimate of the number of clusters.
We point out earlier that the LCM method tries to find the same partition as Naive-
Rounding1 when the number of clusters is the same in both cases. However, the number
of clusters determined using the BIC score might not equal the number of clusters found
by Naive-Rounding1. When that happens, the LCM method produces a different
partition.
4.5 Latent Tree Models for Rounding
In this section we present a method for determining the number q of leading eigenvectors
to use in the model-based approach. The idea is to extend the LCM method of the
previous section using the strategy of Naive-Rounding2.
Consider an integer q between 2 and K/2. We first build an LCM using the first q
pairs of binary vectors and obtain a hard partition of the data using the LCM. Suppose
k clusters C1, . . . , Ck are obtained. Each cluster Cr corresponds to a state r of the latent
variable Y .
62
Y
e+1 e−1 · · · e+q e−q Y1
ebs1 · · · ebs2
· · · Yk
ebs3 · · · ebs4
Figure 4.5: Latent tree model for rounding.
We extend the LCM model to obtain the model shown in Figure 4.5. We do this in
three steps. First, we introduce k new latent variables Y1, . . . , Yk into the model. They
are all connected to Y . Each Yr is a binary variable and its conditional distribution is set
as follows:
P (Yr = 1|Y = r′) =
{1 if r′ = r,
0 otherwise.(4.7)
So the state Yr = 1 means the cluster Cr and the state Yr = 0 means the union of the
other clusters.
Next, we add binary vectors from the range [q + 1, K] to the model by connecting
them to the new latent variables. For convenience we call those vectors the secondary
binary vectors. This is not to be confused with the secondary eigenvectors mentioned
in Proposition 4. For each secondary binary vector ebs, let Ds be its support. When
determining to which Yr to connect ebs, we consider how well the cluster Cr covers Ds.
We connect ebs to the Yr such that Cr covers Ds the best, in the sense that the quantity
|Ds ∩ Cr| is maximized, where |.| stands for the number of data points in a set. Ties are
broken arbitrarily.
Finally, we set the conditional distribution P (ebs|Yr) as follows:
P (ebs = 1|Yr = 1) =|Ds ∩ Cr||Cr|
(4.8)
P (ebs = 1|Yr = 0) =|Ds − Cr|n− |Cr|
(4.9)
What we get is an LTM. The LCM part of the model is called its primary part,
while the newly added part is called the secondary part. The parameter values for the
primary part is determined during LCM learning, while those for the secondary part are
set manually by Equations (4.7), (4.8), and (4.9).
To determine the number q of leading eigenvectors to use, we examine all integers in
the range [2, K/2]. For each such integer q, we build an LTM as described above and
63
compute its BIC score. We pick the q with the maximum BIC score as the answer to our
question. After q is determined, we use the primary part of the LTM to partition the
data. In other words, the secondary part is used only to determine q. It is not used when
we determine the partition. We call this method the LTM method for rounding.
Here are the intuitions behind the LTM method. If the support Ds of ebs is contained
in cluster Cr, it fits the situation to connect ebs to Yr. The model construction no longer
fits the data well if Ds is not contained in any of the clusters. The worst case is when two
different clusters Cr and Cr′ cover Ds equally well and better than other clusters. In this
case, ebs can be either connected to Yr or to Yr′ . Different choices here lead to different
models. As such, neither choice is ‘ideal’. Even when there is only one cluster Cr that
covers Ds the best, connecting ebs to Yr is still intuitively not ‘perfect’ as long as Cr does
not cover Ds completely.
So when the support of every secondary binary vector is contained by one of the
clusters C1, . . . , Ck, the LTM that we build would fit the data well. However, when
the supports of some secondary binary vectors are not completely covered by any of the
clusters, the LTM would not fit the data well.
Now consider the ideal case. According to Proposition 5 (2), the fit would be good if
q is the the number kt of true clusters, or equivalently the number of eigenvectors of the
Laplacian matrix for eigenvalue 0. The fit would not be as good otherwise. This is why
the likelihood of the LTM, hence its BIC score, contains information that can be used to
choose q.
To end this section, we summarize the LTM method for rounding:
LTM-Rounding:
• Inputs:
1. A data set D = {x1, . . . , xn} with similarity matrix S.
2. Parameters: δ, K.
• Algorithm:
1. Form the Laplacian matrix Lrw using Equation (4.1).
2. Compute the first K eigenvectors of Lrw.
3. Using δ as the threshold, binarize the eigenvectors as explained in
Section 4.3. (This results in K pairs of binary vectors, some of which
might be degenerate.)
4. S∗ ← −∞.
64
5. For q = 2 to bK/2c,
(a) mlcm ← the LCM learnt using the first q pairs of binary vectors
as shown in Section 4.4.
(b) P ← hard partition obtained using mlcm.
(c) mltm ← the LTM extended frommlcm as explained in Section 4.5.
(d) S ← the BIC score of mltm.
(e) If S > S∗, then P ∗ ← P and S∗ ← S.
6. Return P ∗.
An implementation of LTM-Rounding can be obtained from http://www.cse.ust.hk/
~lzhang/ltm/index.htm.
In general, we suggest to set the binarization threshold δ at 0.1. The parameter
K should be such that K/2 is safely larger than the number of true clusters. Note that
LTM-Rounding does not allow q to be larger than K/2. This is so that there is sufficient
information in the secondary part of the LTM to determine the appropriateness of using
the first q eigenvectors of the Laplacian matrix for rounding.
4.6 Empirical Evaluation on Synthetic Data
Our empirical investigations are designed to: (1) show that LTM-Rounding works per-
fectly in the ideal case and its performance degrades gracefully as we move away from the
ideal case, and (2) compare LTM-Rounding with alternative methods. Synthetic data
are used for the first purpose, and the results are discussed in this section. Both synthetic
and real-world data are used for the second purpose, and the results are discussed in the
next section.
LTM-Rounding has two parameters δ and K. We set δ = 0.1 and K = 40 in all our
experiments except in sensitivity analysis (Section 4.6.4). For each set of synthetic data,
10 repetitions were run.
4.6.1 Performance in the Ideal Case
Three data sets were used for the ideal case (Fig 4.6). They vary in the number and the
shape of clusters. Intuitively the first data set is the easiest, while the third one is the
hardest. To produce the ideal case, we used the 10-NN similarity measure for the first two
data sets. For the third data set, the 10-NN similarity measure gave a similarity graph
with one single connected component. So we used the 3-NN similarity measure instead.
65
●●●●
●●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●● ●
●
●
●
●
●●
●
●
●●
●
●●●
● ●●
●
●
●
●
●●
●●
●●
●●
●
●
●●
●
●
●
●
●●
●
●●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
6 clusters(a) 6 clusters
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
5 clusters(b) 5 clusters
●● ●
●●●●●●●
●●●●
●●
●●●
●●●
●
●●●●●
●
●●
●●
●●
●●
●●●●
●●
●●●
●●●●
●●
●●
●●
●●●●●●
● ●●●
● ●
● ●●
●●●●●●●
●●●●
●
●●
●●●● ● ●
●
● ●●
●●
●
●●
●●●●
●
●●
●
●●
●●
●●
●
●●●●
●●●●
●●●
●●●
●●●
●●●●
●●
●
●●
●●
●
●●
●●
●
●
2 clusters(c) 2 clusters
Figure 4.6: Synthetic data set for the ideal case: The 10-NN similarity measure is usedfor the first two data sets, while 3-NN similarity measure for the third data set. Eachcolor and shape of the data points identifies a cluster. Clusters were recovered by LTM-Rounding correctly.
LTM-Rounding produced the same results on all 10 runs. The results are shown
in Fig 4.6 using colors and shapes of data points. Each color identifies a cluster. LTM-
Rounding correctly determined the numbers of clusters and recovered the true clusters
perfectly.
4.6.2 Graceful Degrading of Performance
To demonstrate that the performance of LTM-Rounding degrades gracefully as we move
away from the ideal case, we generated 8 new data sets by adding different levels of noise
to the second data set in Fig 4.6. The Gaussian similarity measure was adopted with
σ = 0.2. So the similarity graphs for all the data sets are complete.
We evaluated the quality of an obtained clustering by comparing it with the true
clustering using Rand index (RI) [132] and variation of information (VI) [114]. Note
that higher RI values indicate better performance, while the opposite is true for VI. The
performance statistics of LTM-Rounding are shown in Tables 4.1. We see that RI is 1 for
the first three data sets. This means that the true clusters have been perfectly recovered.
The index starts drop from the 4th data set onwards in general. It falls gracefully with
the increase in the level of noise in data. Similar trend can also be observed on VI.
The partitions produced by LTM-Rounding at the best run (in terms of BIC score)
are shown in Fig 4.7(a) using colors and shapes of the data points. We see that on the
first four data sets, LTM-Rounding correctly determined the number of clusters and the
members of each cluster. On the next three data sets, it also correctly recovered the true
clusters except that the top and the bottom crescent clusters might have been broken into
two or three parts. Given the gaps between the different parts, the results are probably
the best one can hope for. The result on the last data set is less ideal but still reasonable.
66
●●●● ●●●●●●
●●●●●
●●●●●●●
●●●●●●●●●
●●●
●●
●●●
●●●●
●●●●●●
●●●●
●●●
●●●
●●●●●●
●●●●●
●●●●●●●●
●●●●
●●
●●●●●●
●●●●
●●●●●
●
5 clusters(1) 5 clusters,RI=1.00
●●●
●●
● ●●●●
●●
●●●
●● ●●●●● ●
●●
●●●
● ●
●●●●
●
●
●●●
●●● ●●●●●●
●
●●●
●●
●●
●●
●
●●●
●
●●●
●●●
●●●
●●
●●●
●● ●●
●
●
●
●●●
●
●●
●
●●●●
●
●●●
●
5 clusters(2) 5 clusters,RI=1.00
●●
●
● ●●
●●
●
●●
●●
●●●
●
●
●
●● ●●●●
●●
●
●●
●
●
●
●
●●●
●
●
●●●
●●●
● ●●●●●
●
●
●
●●●
●
●●●●●
●
●
●●
●●●
●● ●
●●
●●
●●●
●●● ●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●●
5 clusters(3) 5 clusters,RI=1.00
●
●
●
●
●
●
●●
●●
●●●
●
●●●
●
●
●
●●
● ●●●
●
●
●
●
●●●
●
●
● ●●●
●●
●
●●● ●
●●
●●
●●
● ●
●●
●●
●
●
●
●●●●
●
●●
●●●
●
● ●
●●
●
●● ●
●
●●
●
●●●
●
● ●●
●
● ●●
●●●●
●
5 clusters(4) 5 clusters,RI=1.00
●
●●● ●●
●
●●●
●●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●●●
●●
● ●●
●
●
●●●
●
●●●●
●
●
●
●
●●
●
●●
●●
7 clusters(5) 7 clusters,RI=0.97
●●
●●
●
●
●●
●
●●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●● ●
●
●
●
●
●
●
●●
●
● ●● ●
6 clusters(6) 6 clusters,RI=0.98
●
●
●
●
●
●
●
●
●
●
●●
● ●
●●
●
●
●
●
●●
8 clusters(7) 8 clusters,RI=0.94
●
●
●
●●
●
●
●
●
●●● ●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
● ●
●
●●
●●
●
●
●
●
● ●
●
●● ●
●
●●●
12 clusters(8) 12 clusters,RI=0.88
(a) Partitions produced by LTM-Rounding
●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●● ●● ●●●●●●●
●●
●●●
●●●●●●●●
●●●●●●●●●●●●
●●●
●●●●●●●●●●●●
4 clusters(1) 4 clusters,RI=0.92
● ●●●●●●●●
●●●●●●●
●●●●●
●●●●●●●●
●●●
●
●●●●●●●●●●●●●●●●●● ●●●●●●●
●●●●●●●●●●
●●●
●
●●●●
●
●●●●●●
●●●
●●●●●●● ●
●●●● ●●
4 clusters(2) 4 clusters,RI=0.92
●● ●●●●
●●●●●●●
●●●●●●
●●●
●●●●
●●
●●●
●●●●
●●
●
●
●●●●●●
●●
●● ● ●● ●● ●●●●●●●●●
●●●●●●
●
●
●●●●●●
●
●●
●●●●●●●●●●●●● ●●●●● ●●
5 clusters(3) 5 clusters,RI=1.00
●●●●●●●
●●
●●
●●●
●●
●●
●●●
●
●●
●
●
●
●
●
●●
●
●●●●
●
●
●●●
●●● ●●●●●● ●● ●
●●● ●●●
●
●
●●
●●
●
●
●●
●
●●
●●●
●
●
●●●
●
●●
●●●
●●
●
●● ●●●
●●
● ●●●
6 clusters(4) 6 clusters,RI=0.98
●●●
● ● ●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●●
●
●●●
●●●●
●
●●
●●●● ●●●● ●●●
● ●●●●●●●
●● ●
●
●●●●
●●●
●●●
●
●●●●●
●●
●●
●●
●
●●
●●● ●
●● ●●
●● ●
●●
●
●●
●●
●●●●
●
●
●●● ●
●
●
●●●●●
●
● ●●●
●●
●●●
●●
●
●●
●
●
●●
●
●●
● ●●
●
●
●
●●
●●
●
●
●
●●●●
●●
● ●
● ●●
●
●
●●
●
●
●
●
●
●●●
●●●
●
● ●
●
●
●
●●●●●
●
● ●●
●
●
● ●●
●
●●
●
●●
●●●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
● ●● ●
●●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●●●
●● ●● ●●
●●
●
●
●●
●●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●●
● ●●
●
●●
●● ●
●●●
●●
●
●
●
●
● ●
●●●●
●
●
●
●
●●●
●
●●
●● ●●●
●
●●●
●●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●●●
●●
● ●●
●
●
●●●
●
●●●●
●
●
●
●
●●
●
●●
●●
2 clusters(5) 2 clusters,RI=0.52
●● ●●●
●●●●●●
●●●●●
●
●●●
●●
●●
●
●
●
●
●
●
●●●
●
●
●
●●●●●
●
●
●●● ●●
●●●●● ● ●●●
●●
●
●
●
●
●●●
●●●
●
●
●●
●
●
●●
●
●●●
●●●
●
●●
●
●●
●●●●
● ●●●●●
●● ●
●●●
●
●●
●
●●
●●
●●
●
●
●
●●●
●
●●
●
●● ●●●
●
● ●
●
●
●
●
●
●
●●
●●
●
●●●
●
●
●●●●
●
●●
●●●●
●●
●
●
●
●
●
●●●
●
●●
●●
●● ●●
●
●
●●
●
●●
●●●●
●● ●●
● ●●
●●
●
●●
●
●
●
●
●●
●●
●●
●●
●● ●
●● ●●●
●●●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●●
●●
●●
●●●●
●
●●●
●
●●●
●
●
●●
●
●●●
●●
●●
●●
●●
●
●●●
● ●●
●●●
●
●●●
●●
●●●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●●
●
●●●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●● ●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●●
●●
●
●
●●
●
●●
●
●●
●
●
●
2 clusters(6) 2 clusters,RI=0.52
●●●
●● ●●●●●●
●●
●●●
●●
●
●●
●●●●
●
●●
●●●
●
●
●
●●
●
●●●●●● ●●● ●● ●●● ●● ●
●●●
●
●●●
●●
●
●●●
●
●
●●●●●●●●●
●
●
●
●
●●
●●●
●●
●●
●●●●
●● ●●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●●●●
●
●●
●
●●
●●●
●●
●
●●
●●
●●
●
●
●●
●●
●
●
●
●●●●
●
●●
●
●●
●
●
●
●●
●
●
●
●●
●●
●
●●
●●
●
●
●
●
●
●
●
●●
●●
●
●
20 clusters(7) 20 clusters,RI=0.88
● ●●●
●●●
●●
●
●
●
●●●●
●
●●
●
●
●
●●
●
●●●●●●●
●
●●
●●
●
●
●●●●●●●
●●●●
●● ●●
●
●●
●●●●
●
●
●
●
●
●●
●●
●
●
●●●●
●●
●
●●
●
●●
●
●●●●
●●●
●●●●●
●
●
●
11 clusters(8) 11 clusters,RI=0.90
(b) Partitions produced by ROT-Rounding
Figure 4.7: The 8 synthetic data sets for the non-ideal case: These data sets were generatedby adding various levels of noise to the second data set in Figure 4.6. The Gaussiansimilarity measure with σ = 0.2 is used in all the data sets. The color and shape of thedata points and the picture labels show the partitions produced by (a) LTM-Roundingat the best run (in terms of BIC score of LTM); and (b) ROT-Rounding. RI meansRand index computed against the true partition.
67
Data 1 2 3 4 5 6 7 8LTM 1.0±.00 1.0±.00 1.0±.00 .99±.01 .97±.02 .98±.01 .94±.01 .88±.01ROT .92±.00 .92±.00 1.0±.00 .98±.00 .52±.00 .52±.00 .88±.00 .90±.00
K-means 1.0±.00 1.0±.00 1.0±.00 1.0±.00 .85±.00 .72±.00 .71±.00 .75±.00GMM 1.0±.00 1.0±.00 1.0±.00 1.0±.00 .94±.00 .88±.00 .91±.00 .88±.00
(a) Rand index
Data 1 2 3 4 5 6 7 8LTM .00±.00 .00±.00 .00±.00 .06±.09 .29±.19 .28±.14 .79±.12 1.64±.10ROT .40±.00 .40±.00 .00±.00 .20±.00 1.60±.00 1.60±.00 1.85±.00 1.42±.00
K-means .00±.00 .00±.00 .00±.00 .00±.00 1.04±.00 1.41±.00 1.52±.00 1.97±.00GMM .00±.00 .00±.00 .00±.00 .00±.00 .85±.00 .91±.00 1.25±.00 1.74±.00
(b) Variation of information
Table 4.1: Performances of various methods on the 8 synthetic data sets in terms of(a) Rand index (RI) and (b) variation of information (VI). Higher values of RI or lowervalues of VI indicate better performance. K-means and GMM require extra informationfor rounding and should not be compared with LTM-Rounding and ROT-Roundingdirectly.
The above discussions indicate that the performance of LTM-Rounding degrades
gracefully as we move away from the ideal case.
4.6.3 Impact of an Assumption
Suppose it is determined that rounding is to be based on the first q eigenvectors of the
Laplacian matrix. Let mq be the number of clusters to be obtained based on those
eigenvectors. Previous work usually assumes that mq = q. We have argued against this
assumption. Now we empirically study its impact.
To carry out the study, we create a variant of LTM-Rounding by enforcing the
assumption. The change is at Step 5(a) and it concerns the number of states for the
latent variable in the LCM. As it stands, the algorithm determines the number using the
BIC score. In this study, we manually set it to q. No other changes are made. We refer
to the modified algorithm as LTM-Rounding1.
We tested LTM-Rounding and LTM-Rounding1 on the 8 synthetic data sets. The
performance statistics are shown in Table 4.2. We see that RI for LTM-Rounding1
are consistently lower than those for LTM-Rounding, and VI for LTM-Rounding1
are consistently higher. One exception occurs on the 4th data set, where RI and VI for
LTM-Rounding1 are not significantly better. The other exception occurs on the 8th
data set, where RI are the same for LTM-Rounding and LTM-Rounding1, but VI
for LTM-Rounding1 are significantly higher. Figure 4.8 shows the partitions obtained
68
Data 1 2 3 4 5 6 7 8LTM 1.0±.00 1.0±.00 1.0±.00 .99±.01 .97±.02 .98±.01 .94±.01 .88±.01LTM1 1.0±.00 1.0±.00 1.0±.00 1.0±.00 .95±.00 .95±.02 .91±.01 .88±.00
(a) Rand index
Data 1 2 3 4 5 6 7 8LTM .00±.00 .00±.00 .00±.00 .06±.09 .29±.19 .28±.14 .79±.12 1.64±.10LTM1 .00±.00 .00±.00 .00±.00 .00±.00 .69±.00 .66±.23 1.24±.10 1.84±.06
(b) Variation of information
Table 4.2: Comparison of LTM-Rounding and LTM-Rounding1 in terms of (a) Randindex and (b) variation of information.
●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●● ●● ●●●●●●●
●●
●●●
●●●●●●●●
●●●●●●●●●●●●
●●●
●●●●●●●●●●●●
5 clusters(1) 5 clusters,RI=1.00
● ●●●●●●●●
●●●●●●●
●●●●●
●●●●●●●●
●●●
●
●●●●●●●●●●●●●●●●●● ●●●●●●●
●●●●●●●●●●
●●●
●
●●●●
●
●●●●●●
●●●
●●●●●●● ●
●●●● ●●
5 clusters(2) 5 clusters,RI=1.00
●●
●●●●
●●●
●
●●
●●●●●
●●●● ●●
●●
●
●●●
●●
●●
●●
●●●
●
●
●●
●
● ●●
●●●●
●
●●
●
●
●
●
●
●●
●●●●
●●●●●
●●●●● ●●
●●●
●●●● ●
●
●
●●●
●
●●●●
●●●● ●●
5 clusters(3) 5 clusters,RI=1.00
●●
●●●
●●
●●●
●●
●●
●
●●●
●
●●● ●
●●●
●●●● ●●●
●●●
●●
●●
●●
●
●
●
●
●●
●●
●●
●●
●●●
●●
●●
●
●
●
●●
●●
●
●●●
●●
●
● ● ●
●
●
●●●●●●
●●●●
●●
●●●●●●●●
5 clusters(4) 5 clusters,RI=1.00
●● ●
●●
●
●●
●●
●●●●
●
●
●●● ●●
●●●●●
●
● ●●●
●●
●●●
●●
●
●●
●
●
●●
●
●●
● ●●
●
●
●
●●
●●
●
●
●
●●●●
●●
● ●
● ●●
●
●
●●
●
●
●
●
●
●●●
●●●
●
● ●
●
●
●
●●●●●
●
● ●
●
●
●
●
●●
●
● ●
●
●●●
●
● ●● ●
●●
●●
●●●
●
●●
●
●
●
●●
●●
●●
●●
●●
●
●
●
●●●●
●
●
●
●
●●
●
10 clusters(5) 10 clusters,RI=0.95
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●●
●
●●●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
8 clusters(6) 8 clusters,RI=0.95
●
● ●
●● ●●
●
● ●●●
●●
●
●
●●
●
●
●●
●
●●●
●
●●
●
●
●
●
●●
●●
● ●
● ●
●
●
●
●
●
●●
●
●●●
● ●
●
●
● ●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●●
●●●●
●
●
●
●
●●
●
●
11 clusters(7) 11 clusters,RI=0.91
●
● ●
●
● ●
●●●●
●● ●●● ●
● ●●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●●
● ●●
●●●●
●●
●
●
●●
●
●
●●●●
●
●
●●●
●
●
●
●
●●
●
●●
●
●●●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
● ●
●
●● ●
●
●●●
17 clusters(8) 17 clusters,RI=0.88
Figure 4.8: Partitions obtained by LTM-Rounding1 at the best run.
by LTM-Rounding1 at the best run. We see that LTM-Rounding1 estimated higher
numbers of clusters than LTM-Rounding on the last four data sets. Overall, it is fair
to say that LTM-Rounding1 is inferior to LTM-Rounding. So it is not a good idea to
enforce the constraint mq = q.
4.6.4 Sensitivity Study
LTM-Rounding has two parameters δ and K. How sensitive is the performance of the
algorithm to the choice of parameter values? To answer this question, we conducted
experiments on the 2nd, 5th, and 8th data sets in Fig 4.7. Those data sets contain
different levels of noise and hence are at different distances from the ideal case.
69
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
(2)
●
●● ● ● ● ● ● ● ● ●
●
●
● ● ●
●
●
●
●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
(5)
● ● ● ● ● ● ●
● ● ●
● ● ●
●
●
●
●
●
●
●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
(8)(a) K = 40 with varying δ (X-axis)
●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
(2)
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
(5)
●
● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
(8)(b) δ = 0.1 with varying K (X-axis)
Figure 4.9: Sensitivity analysis on parameters δ and K in LTM-Rounding. It wasconducted on the 2nd, 5th, and 8th synthetic data sets. The Y-axis is Rand index andthe bars indicate the standard deviations. In general, we recommend δ = 0.1 and K = 40(shown as blue points).
To determine the sensitivity of LTM-Rounding with respect to δ, we fix K = 40
and let δ vary between 0.01 and 0.95. The RI statistics are shown in Fig 4.9(a). We
see that on data set (2) the performance of LTM-Rounding is insensitive to δ except
when δ = 0.01. On data set (5), the performance of LTM-Rounding is more or less
robust when 0.1 ≤ δ ≤ 0.3. Its performance gets particularly worse when δ ≥ 0.8. Similar
behavior can be observed on data set (8). Those results suggest that the performance
of LTM-Rounding is robust with respect to δ in situations close to the ideal case and
it becomes sensitive in situations that are far away from the ideal case. In general, we
recommend δ = 0.1.
To determine the sensitivity of LTM-Rounding with respect to K, we fix δ = 0.1
and let K vary between 5 and 100. The RI statistics are shown in Fig 4.9(b). It is clear
that the performance of LTM-Rounding is robust with respect to K as long as it is not
too small.
4.6.5 Running Time
LTM-Rounding deals with tree-structured models. It is deterministic everywhere except
70
at Step 5(a), where the EM algorithm is called to estimate the parameters of LCM. EM is
an iterative algorithm and is computationally expensive in general. However, it is efficient
on LCMs. So the running time of LTM-Rounding is relatively short. To process one of
the data sets in Figure 4.7, it took around 10 minutes on a laptop computer.
4.7 Comparison with Alternative Methods
To compare with LTM-Rounding, we included the method by Zelnik-Manor and Perona
[169] in our experiments.1 The latter method determines the appropriateness of using
the q leading eigenvectors by checking how well they can be aligned with the canonical
coordinates through rotation. So we name it ROT-Rounding. Both LTM-Rounding
and ROT-Rounding can determine the number q of eigenvectors and the number k of
clusters automatically. Therefore, they are directly comparable. The method by Xiang
and Gong [166] is also related to our work. However, its implementation is not available
to us and hence it is excluded from comparison.
K-means and GMM2 can also be used for rounding. However, both methods cannot
determine q, and K-means cannot even determine k. We therefore gave the number ktof true clusters for K-means and used q = kt for both methods. Since the two methods
required additional information, they are included in our experiments only for reference.
They should not be compared with LTM-Rounding directly.
ROT-Rounding and GMM require the maximum allowable number of clusters. We
set that number to 20.
4.7.1 Synthetic Data
Table 4.1 shows the performance statistics of the three other methods on the 8 synthetic
data sets. LTM-Rounding performed better than ROT-Rounding except for the last
data set. The differences are particularly substantial on the 5th, 6th, and 7th data sets.
The partitions produced by ROT-Rounding at one run are shown in Fig 4.7(b).3
We see that on the first two data sets, ROT-Rounding underestimated the number of
clusters and merged the two small crescent clusters. This is serious under-performance
given the easiness of those data sets. On the 3rd data set, it recovered the true clusters
1The code can be obtained from http://webee.technion.ac.il/~lihi/Demos/SelfTuningClustering.html.2MCLUST [52] was used with diagonal covariance matrices and default prior control as the implemen-tation of GMM.
3Same results were obtained for all 10 runs.
71
Method #clusters RI VILTM 9.3±.82 .91±.00 2.17±.07ROT 19.0±.00 .92±.00 2.40±.00
K-means (10) .90±.00 1.83±.00GMM 16.0±.00 .91±.00 2.14±.00
sd-CRP 9.3±.96 .89±.00 2.72±.08
Table 4.3: Comparison of various rounding methods on MNIST digits data. The bottomthree methods required extra information for rounding. They should not be compareddirectly with LTM-Rounding and ROT-Rounding.
correctly. On the 4th data set, it broke the top crescent cluster into two. On the 5th data
set, it recovered the bottom cluster correctly but merged all the other clusters incorrectly.
This leads to a much lower RI. The story on the 6th data set is similar. On the 7th
data set, it produced many more clusters than the number of true clusters. Therefore,
its RI is also lower. On the last data set, the clustering obtained by ROT-Rounding is
not visually better than the one obtained by LTM-Rounding, even though RI suggests
otherwise. Given all the evidence presented, we conclude that the performance of LTM-
Rounding is significantly better than that of ROT-Rounding.
Like LTM-Rounding, both K-means and GMM recovered clusterings correctly on
the first three data sets (Table 4.1). They performed slightly better than LTM-Rounding
on the 4th data set. However, they were considerably worse on the next three data sets,
and were not better on the last one. This happened even though K-means and GMM
were given additional information for rounding. This shows the superior performance of
LTM-Rounding for rounding.
4.7.2 MNIST Digits Data
In the next two experiments, we used real-world data to compare the rounding methods.
The MNIST digits data were used in this subsection. The data consist of 1000 samples
of handwritten digits from 0 to 9. They were preprocessed by the deep belief network as
described in [71] using their accompanying code.4 Table 4.3 shows the results averaged
over 10 runs.
We see that the number of clusters estimated by LTM-Rounding is close to the
ground truth (10), but that by ROT-Rounding is considerably larger than 10. In terms
of quality of clusterings, the results are inconclusive. RI suggests that ROT-Rounding
performed slightly better, but VI suggests that LTM-Rounding performed significantly
4We thank Richard Socher for sharing the preprocessed data with us. The original data can be foundat http://yann.lecun.com/exdb/mnist.
72
better.
Compared with LTM-Rounding, K-means obtained a better clustering (in terms of
VI), whereas GMM obtained one with similar quality. However, K-means was given the
number of true clusters and GMM the number of eigenvectors. GMM also overestimated
the number of clusters even with the extra information.
A non-parametric Bayesian clustering method, called sd-CRP, has recently been pro-
posed for rounding [148]. Although we obtained the same data from their authors, we
could not get a working implementation for their method. Therefore, we simply copied
their reported performance to Table 4.3. Note that sd-CPR can determine the number
clusters automatically. However, it requires the number of eigenvectors as input, and the
number was set to 10 in this experiment. Hence, sd-CPR requires more information
than LTM-Rounding. Table 4.3 shows that it estimated a similar number of clusters as
LTM-Rounding. However, its clustering was significantly worse in terms of RI or VI.
This shows that LTM-Rounding performed better than sd-CPR even with less given
information.
4.7.3 Image Segmentation
The find part of our empirical evaluation was conducted on real-world image segmentation
tasks. Five images from the Berkeley Segmentation Data Set (BSDS500) were used. They
are shown in the first column of Figure 4.10. The similarity matrices were built using the
method proposed by Arbeláez et al. [4]. The images and the Matlab code for similarity
matrix construction were downloaded from the webpage of the Berkley Computer Vision
Group.5
The segmentation results obtained by ROT-Rounding and LTM-Rounding are
shown in the second and third columns of Figure 4.10 respectively. On the first two
images, ROT-Rounding did not identify any meaningful segments. In contrast, LTM-
Rounding identified the polar bear and detected the boundaries of the river on the first
image. It identified the bottle, the glass and a lobster on the second image.
An obvious undesirable aspect of the results is that some uniform regions are broken
up. Examples include the river bank and the river itself in the first image, and the
background and the table in the second image. This is a known problem of spectral
clustering when applied to image segmentation and can be dealt with using image analysis
techniques [4]. We do not deal with the problem in this work.
5http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/resources.html
73
On the third image the performance of LTM-Rounding was better because the lizard
was identified and the segmentation lines follow leaf edges more closely. On the fourth
image LTM-Rounding did a better job at detecting the edges around the lady’s hands
and skirt and on the left end of the silk scarf. However, ROT-Rounding did a better
job at the last image because it produced a cleaner segmentation.
Overall, the performance of LTM-Rounding is better than that of ROT-Rounding.
It should be noted that rounding is a post-processing step of spectral clustering. The
quality of the final results depends critically on the eigenvectors that are produced at
earlier steps. The objective of this section has been to compare LTM-Rounding and
ROT-Rounding on the same collection of eigenvectors. The conclusion is meaningful
even if the final segmentation results are not as good as the best that can be achieved by
image analysis techniques.
4.8 Conclusions
Rounding is an important step of spectral clustering that has not received sufficient at-
tention. Not many papers have been published on the topic, especially on the issues of
determining the number of leading eigenvectors. In this chapter, we have proposed a
novel method for the task. The method is based on LTMs. It can automatically select an
appropriate number of eigenvectors to use, determine the number of clusters, and finally
assign data points to clusters. We have shown that the method works correctly in the
ideal case and its performance degrades gracefully as we move away from the ideal case.
74
Original ROT LTM
Figure 4.10: Image segmentation results. The first column shows the original images.The second and third columns show the segmentation results using ROT-Rounding andLTM-Rounding respectively.
75
CHAPTER 5
EXTENSION:POUCH LATENT TREE MODELS
In LTMs, all variables are discrete. This limits the use of LTMs to discrete data. In this
chapter, we propose an extension to LTMs, resulting in a class of models called pouch
latent tree models.
PLTMs have continuous manifest variables, and hence can handle continuous data.
Note that although our description focuses on continuous data, PTLMs can readily work
on both continuous and discrete data. The algorithms described in this chapter work on
mixed data without any need for modification.
This chapter begins with a definition of PTLMs. PLTMs are then related to other
work in Section 5.2. We describe an interference algorithm in Section 5.3. In Section 5.4,
we discuss how the parameters of PLTMs can be estimated using the EM algorithm. In
Section 5.5, we describe a structural learning algorithm for PLTMs.
5.1 Pouch Latent Tree Models
A pouch latent tree model (PLTM) is a rooted tree, where each internal node represents
a latent variable, and each leaf node represents a set of manifest variables. All the latent
variables are discrete, whereas all the manifest variables are continuous. A leaf node may
contain a single manifest variable or several of them. Because of the second possibility, leaf
nodes are called pouch nodes. Figure 5.1 shows an example of PLTM. In this example,
Y1–Y4 are discrete latent variables, where Y1–Y3 have three possible values and Y4 has
two. X1–X9 are continuous manifest variables. They are grouped into five pouch nodes:
{X1, X2}, {X3}, {X4, X5}, {X6}, and {X7, X8, X9}. Note that we reserve the use of bold
capital letter W for denoting the variables of a pouch node, such as, W1 = {X1, X2}.
In a PLTM, the dependency of a discrete latent variable Y on its parent Π(Y ) is
characterized by a conditional discrete distribution P (y|π(Y )).1 Let W be the variables
of a pouch node with a parent node Y = Π(W ). We assume that, given a value y of
Y , W follows the conditional Gaussian distribution P (w|y) = N (w|µy,Σy) with mean
1 The root node is regarded as the child of a dummy node with only one value, and hence is treated inthe same way as other latent nodes.
76
Figure 5.1: An example of PLTM. The numbers in parentheses show the cardinalities ofthe discrete variables.
Figure 5.2: Generative model for synthetic data.
y1 P (y1)s1 0.33s2 0.33s3 0.34
P (y2|y1)y2 y1 = s1 y1 = s2 y1 = s3s1 0.74 0.13 0.13s2 0.13 0.74 0.13s3 0.13 0.13 0.74
Table 5.1: Discrete distributions in Example 1.
vector µy and covariance matrix Σy. A PLTM can be written as a pair M = (m,θ),
where m denotes the model structure and θ denotes the parameters.
Example 1. Figure 5.2 gives another example of PLTM. In this model, there are two
discrete latent variables Y1 and Y2, each having three possible values {s1, s2, s3}. There
are six pouch nodes, namely {X1, X2}, {X3}, {X4}, {X5}, {X6}, and {X7, X8, X9}. Thevariables in the pouch nodes are continuous.
Each node in the model is associated with a distribution. The discrete distributions
P (y1) and P (y2|y1), associated with the two discrete nodes, are given in Table 5.1.
The pouch nodes W are associated with conditional Gaussian distributions. These
distributions have parameters for specifying the conditional means µπ(W ) and conditional
covariances Σπ(W ). For the four pouch nodes with single variables, {X3}, {X4}, {X5},and {X6}, these parameters have scalar values. The conditional means µy for each of these
variables are either −2.5, 0, and 2.5, depending on whether y = s1, s2, or s3, where y is the
value of the corresponding parent variable Y ∈ {Y1, Y2}. The conditional covariances Σy
77
Figure 5.3: A Gaussian mixture model as a special case of PLTM.
can also have different values for different values of their parents. However, for simplicity
in this example, we set Σy = 1,∀y ∈ {s1, s2, s3}.
Let p be the number of variables in a pouch node. The conditional means are specified
by p-vectors and the conditional covariances by p × p matrices. For example, the means
and covariances of the pouch node {X1, X2} conditional on its parent y1 are given by:
µy1 =
(−2.5,−2.5) : y1 = s1
(0, 0) : y1 = s2
(2.5, 2.5) : y1 = s3
and Σy1 =
(1 0.5
0.5 1
),∀y1 ∈ {s1, s2, s3}.
The conditional means and covariances are specified similarly for pouch node {X7, X8, X9}.They are given by:
µy2 =
(−2.5,−2.5,−2.5) : y2 = s1
(0, 0, 0) : y2 = s2
(2.5, 2.5, 2.5) : y2 = s3
and Σy2 =
1 0.5 0.50.5 1 0.50.5 0.5 1
,∀y2 ∈ {s1, s2, s3}.
PLTMs have a noteworthy two-way relationship with GMMs. On the one hand,
PLTMs generalize the structure of GMMs to allow more than one latent variable in a
model. Thus, a GMM can be considered as a PLTM with only one latent variable and
one pouch node containing all manifest variables. As an example, a GMM is depicted as
a PLTM in Figure 5.3, in which Y1 is a discrete latent variable and X1–X9 are continuous
manifest variables.
On the other hand, the distribution of a PLTM over the manifest variables can be
represented by a GMM. Consider a PLTM M . Suppose W 1, . . . ,W b are the b pouch
nodes and Y1, . . . , Yl are the l latent nodes in M . Denote as X =⋃bi=1Wi and Y = {Yj :
j = 1, . . . , l} the sets of all manifest variables and all latent variables in M , respectively.
78
The probability distribution defined by M over the manifest variables X is given by
P (x) =∑y
P (x,y)
=∑y
l∏j=1
P (yj|π(Yj))b∏i=1
N (wi|µπ(W i),Σπ(W i)) (5.1)
=∑y
P (y)N (x|µy,Σy). (5.2)
Equation 5.1 follows from the model definition. Equation 5.2 follows from the fact that
Π(W i),Π(Yj) ∈ Y and the product of Gaussian distributions is also a Gaussian distri-
bution. Equation 5.2 shows that P (x) is a mixture of Gaussian distributions. Although
it means that PLTMs are not more expressive than GMMs on the distributions of ob-
served data, PLTMs have two advantages over GMMs. First, numbers of parameters can
be reduced in PLTMs by exploiting the conditional independence between variables, as
expressed by the factorization in Equation 5.1. Second, and more important, the multiple
latent variables in PLTMs allow multiple clusterings on data.
Example 2. In this example, we compare the numbers of parameters in a PLTM and
in a GMM. Consider a discrete node and its parent node with c and c′ possible values,
respectively. It requires (c−1)×c′ parameters to specify the conditional discrete distribution
for this node. Consider a pouch node with p variables, and its parent variable with c′
possible values. This node has p × c′ parameters for the conditional mean vectors andp(p+1)
2×c′ parameters for the conditional covariance matrices. Now consider the PLTM in
Figure 5.1 and the GMM in Figure 5.3. Both of them define a distribution on 9 manifest
variables. Based on the above expressions, the PLTM has 77 parameters and the GMM
has 164 parameters.
Given the same number of manifest variables, a PLTM may appear to be more complex
than a GMM due to a larger number of latent variables. However, this example shows
that a PLTM can still require fewer parameters than a GMM.
5.2 Related Work
The graphical structure of PLTMs looks similar to that of the Bayesian networks (BNs) [122].
In fact, a PLTM is different from a BN only because of the possibility of multiple variables
in a single pouch node. It has been shown that any nonsingular multivariate Gaussian
distribution can be converted to a complete Gaussian Bayesian network (GBN) with an
equivalent distribution [57]. Therefore, a pouch node can be considered as a shorthand
79
notation of a complete GBN. If we convert each pouch node into a complete GBN, a
PLTM can be considered as a conditional Gaussian Bayesian network (i.e., a BN with
discrete distributions and conditional Gaussian distributions). It can also be considered
as a BN in general.
Some mixture models allow manifest variables to have multivariate normal distribu-
tions. These include the AutoClass models [23] and the MULTIMIX models [77]. The
manifest variables of those models are similar to the pouch nodes in PLTMs. However,
those mixture models do not allow multiple latent variables.
Galimberti and Soffritti [55] propose to build multiple GMMs on a given data set
rather than a single GMM. We refer to their method as GS. Their method partitions the
attributes and learns a GMM on each attribute subset. The collection of independent
GMMs forms the resulting model. The method obtains the initial partition of attributes
using variable clustering. To look for the optimal partition, it repeatedly merges the two
subsets of attributes that lead to the largest improvement of BIC, until it cannot find any
such pair.
The GS models can be considered as having multiple latent variables. They are similar
to PLTMs in this aspect. However, the latent variables in a GS model are disconnected
and hence are independent. In contrast, the latent variables in a PLTM are interdependent
and are connected by a tree structure.
5.3 Inference
A PLTM defines a probability distribution P (X,Y ) over manifest variablesX and latent
variables Y . Consider observing values e for the evidence variables E ⊆X. For a subset
of variables Q ⊆ X ∪ Y , we are often required to compute the posterior probability
P (q|e). For example, classifying a data point d to one of the clusters represented by
latent variable Y requires us to compute P (y|X = d).
Inference refers to the computation of the posterior probability P (q|e). It can be done
on PLTMs similarly as the clique tree propagation on conditional GBNs [96]. However,
due to the existence of pouch nodes in PLTMs, this propagation algorithm requires some
modifications. The inference algorithm is discussed in details below.
5.3.1 Clique Tree Propagation
Consider a PLTM M with manifest variables X and latent variables Y . Recall that
inference on M refers to the computation of the posterior probability P (q|e) of some
80
Algorithm 1 Inference Algorithm1: procedure Propagate(M, T ,E, e)2: Initialize ψ(C) for every clique C3: Incorporate evidence to the potentials4: Choose an arbitrary clique in T as CP5: for all C ∈ Ne(CP ) do6: CollectMessage(CP , C)7: end for8: for all C ∈ Ne(CP ) do9: DistributeMessage(CP , C)10: end for11: Normalize ψ(C) for every clique C12: end procedure
13: procedure CollectMessage(C,C ′)14: for all C ′′ ∈ Ne(C ′)\{C} do15: CollectMessage(C ′, C ′′)16: end for17: SendMessage(C ′, C)18: end procedure
19: procedure DistributeMessage(C,C ′)20: SendMessage(C,C ′)21: for all C ′′ ∈ Ne(C ′)\{C} do22: DistributeMessage(C ′, C ′′)23: end for24: end procedure
25: procedure SendMessage(C,C ′)26: φ← RetrieveFactor(C ∩ C ′)27: φ′ ←
∑C\C′ ψ(C)
28: SaveFactor(C ∩ C ′, φ′)29: ψ(C ′)← ψ(C ′)× φ′/φ30: end procedure
31: procedure RetrieveFactor(S)32: If SaveFactor(S, φ) has been called,
return φ; otherwise, return 1.33: end procedure
// Ne(C) denotes the neighbors of C
variables of interest, Q ⊆X ∪ Y , after observing values e of evidence variables E ⊆X.
To perform inference, M has to be converted into a clique tree T . A propagation scheme
for message passing can then be carried out on T .
Construction of clique trees is simple due to the tree structure of PLTMs. To construct
T , a clique C is added to T for each edge in M , such that C = V ∪{Π(V )} contains thevariable(s) V of the child node and variable Π(V ) of its parent node. Two cliques are
then connected in T if they share any common variable. The resulting clique tree contains
two types of cliques. The first type are discrete cliques. Each one contains two discrete
variables. The second type are mixed cliques. Each one contains the continuous variables
of a pouch node and the discrete variable of its parent node. Observe that in a PLTM
all the internal nodes are discrete and only leaf nodes are continuous. Consequently, the
clique tree can be considered as a clique tree consisted of all discrete cliques, with some
mixed cliques attaching to it on the boundary.
After a clique tree is constructed, propagation can be carried out on it. Algorithm 1
outlines a clique tree propagation, based on the Hugin architecture [36, 38], for PLTMs.
It consists of four main steps: initialization of cliques, incorporation of evidence, message
passing, and normalization. The propagation on the discrete part of the clique tree is
done as for discrete BNs [36, 38]. Here we focus on the part related to the mixed cliques.
81
Step 1: Initialization of cliques (line 2). Consider a mixed clique Cm containing
continuous variables W and discrete variable Y = Π(W ). The potential ψ(w, y) of Cm(also denoted as ψ(Cm) in the algorithm) is initialized by the corresponding conditional
distribution,
ψ(w, y) = P (w|y) = N (w|µy,Σy).
Step 2: Incorporation of evidence (line 3). The variables in a pouch node W can
be divided into two groups, depending on whether the values of the variables have been
observed. Let E′ = W ∩ E denote those variables of which values have been observed,
and let U = W \E denote those of which values have not been observed. Furthermore,
let [µ]S denote the part of mean vector µ containing elements corresponding to variables
S, and let [Σ]ST denote the part of the covariance matrix Σ that has rows and columns
corresponding to variables S and T , respectively. To incorporate the evidence e′ for E′,
the potential of Cm changes from ψ(w, y) to
ψ′(w, y) = P (e′|y)× P (w|y, e′) = N(e′∣∣ [µy]E′ , [Σy]E′E′
)×N
(w|µ′y,Σ
′y
),
where µ′y and Σ′y can be divided into two parts. The part related to the evidence variables
E′ is given by:[µ′y]E′ = e′, [Σ′y]UE′ = 0, [Σ′y]E′E′ = 0, and [Σ′y]E′U = 0.
The other part is given by:[µ′y]U
=[µy]U
+ [Σy]UE′
[Σ−1y
]E′E′
(e′ −
[µy]E′
),
[Σ′y]UU = [Σy]UU − [Σy]UE′
[Σ−1y
]E′E′ [Σy]E′U .
Step 3: Message passing (lines 5–10). In this step, Cm involves two operations,
marginalization and combination. Marginalization of ψ′(w, y) over W is required for
sending out a message from Cm (line 27). It results in a potential ψ′(y), involving only
the discrete variable y, as given by:
ψ′(y) = P (e′|y) = N(e′∣∣ [µy]E′ , [Σy]E′E′
).
Combination is required for sending a message to Cm (line 29). The combination of the
potential ψ′(w, y) with a discrete potential φ(y) is given by ψ′′(w, y) = ψ′(w, y) × φ(y).
When the message passing completes (line 10), ψ′′(w, y) represents the distribution
ψ′′(w, y) = P (y, e)× P (w|y, e′) = P (y, e)×N(w|µ′y,Σ
′y
).
82
Step 4: Normalization (line 11). In this step, a potential changes from ψ′′(w, y) to
ψ′′′(w, y) = P (y|e)× P (w|y, e′) = P (y|e)×N(w|µ′y,Σ
′y
).
For implementation, the potential of a mixed clique is usually represented by two
types of data structures: one for the discrete distribution and one for the conditional
Gaussian distribution. More details for the general clique tree propagation can be found
in [35, 36, 96].
5.3.2 Complexity
The structure of PLTMs allows an efficient inference. Let n be the number of nodes in
a PLTM, c be the maximum cardinality of a discrete variable, and p be the maximum
number of variables in a pouch node. The time complexity of the inference is dominated by
the steps related to message passing and incorporation of evidence on continuous variables.
The message passing step requires O(nc2) time, since each clique has at most two discrete
variables due to the tree structure. Incorporation of evidence requires O(ncp3) time.
Suppose we have the same number of observed variables. Since PLTMs generally has
smaller pouch nodes than GMMs and hence a smaller p, the term O(ncp3) shows that
inference on PTLMs can be faster than that on GMMs. This happens even though PLTM
can have more nodes and thus a larger n.
5.4 Parameter Estimation
Suppose there is a data set D with N samples d1, . . . ,dN . Each sample consists of values
for the manifest variables. Consider computing the maximum likelihood estimate (MLE)
θ∗ of the parameters for a given PLTM structure m. We do this using the EM algo-
rithm [40]. The algorithm starts with an initial estimate θ(0) and improves the estimate
iteratively.
Suppose the parameter estimate θ(t−1) is obtained after t − 1 iterations. The t-th
iteration consists of two steps, an E-step and an M-step. In the E-step, we compute,
for each latent node Y and its parent Π(Y ), the distributions P (y, π(Y )|dk,θ(t−1)) and
P (y|dk,θ(t−1)) for each sample dk. This is done by the inference algorithm discussed in
the previous section.
For each sample k, let wk be the values of variablesW of a pouch node for the sample
83
dk. In the M-step, the new estimate θ(t) is obtained as follows:
P (y|π(Y ),θ(t)) ∝N∑k=1
P (y, π(Y )|dk,θ(t−1))
µ(t)y =
∑Nk=1 P (y|dk,θ(t−1))wk∑Nk=1 P (y|dk,θ(t−1))
Σ(t)y =
∑Nk=1 P (y|dk,θ(t−1))(wk − µ(t)
y )(wk − µ(t)y )′∑N
k=1 P (y|dk,θ(t−1)),
where µ(t)y and Σ(t)
y here correspond to the distribution P (w|y,θ(t)) for node W condi-
tional on its parent Y = Π(W ). The EM algorithm proceeds to the (t + 1)-th iteration
unless the improvement of log-likelihood logP (D|θ(t))− logP (D|θ(t−1)) falls below a cer-
tain threshold.
The starting values of the parameters θ(0) are chosen as follows. For P (y|π(Y ),θ(0)),
the probabilities are randomly generated from a uniform distribution over the interval
(0, 1] and are then normalized. The initial values of µ(0)y are set to equal to a random
sample from data, while those of Σ(0)y are set to equal to the sample covariance.
Like in the case of GMMs, the likelihood is unbounded in the case of PLTMs. This
might lead to spurious local maxima [111]. For example, consider a mixture component
that consists of only one data point. If we set the mean of the component to be equal
to that data point and set the covariance to zero, then the model will have an infinite
likelihood on the data. However, even though the likelihood of this model is higher than
some other models, it does not mean that the corresponding clustering is better. The
infinite likelihood can always be achieved by trivially grouping one of the data points as
a cluster. This is why we refer to this kind of local maxima as spurious.
To mitigate this problem, we use a variant of the method by Ingrassia [80]. In the M-
step of EM, we need to compute the covariance matrix Σy(t) for each pouch nodeW . We
impose the following constraints on the eigenvalues λ(t) of Σy(t): σ2
min/γ ≤ λ(t) ≤ σ2max×γ,
where σ2min and σ2
max are the minimum and maximum of the sample variances of the
variables W and γ is a parameter for our method.
5.5 Structure Learning
Given a data set D, we aim at finding the model m∗ that maximizes the BIC score given
by Equation (3.1).
84
A hill-climbing algorithm can be used to search for m∗. It starts with a model m(0)
that contains one latent node as root and a separate pouch node for each manifest variable
as a leaf node. The latent variable at the root node has two possible values. Suppose a
model m(j−1) is obtained after j − 1 iterations. In the j-th iteration, the algorithm uses
some search operators to generate candidate models by modifying the base model m(j−1).
The BIC score is then computed for each candidate model. The candidate model m′ with
the highest BIC score is compared with the base model m(j−1). If m′ has a higher BIC
score than m(j−1), m′ is used as the new base model m(j) and the algorithm proceeds to
the (j + 1)-th iteration. Otherwise, the algorithm terminates and returns m∗ = m(j−1)
(together with the MLE of the parameters).
The above hill-climbing algorithm explains the principles for our structural learning
algorithm. In the remaining of this section, we first describe the search operators used
in the algorithm. We then discuss how the efficiency of the hill-climbing algorithm can
be improved, based on the ideas of Chen et al. [26, 28]. We summarize the structural
learning algorithm, called EAST-PLTM, in the last subsection.
5.5.1 Search Operators
There are four aspects of the structure m, namely, the number of latent variables, the
cardinalities of these latent variables, the connections between variables, and the composi-
tion of pouches. The search operators used in our hill-climbing algorithm modify all these
aspects to effectively explore the search space. There are totally seven search operators.
Five of them are borrowed from Zhang and Kočka [173], while two of them are new for
PLTMs.
The five borrowed operators are the state introduction (SI), state deletion (SD), node
introduction (NI), node deletion (ND), and node relocation (NR) operators. They are
described in Section 3.5.1 (page 27). Figure 5.4 gives some examples of NI, ND, and NR
in PLTMs.
The two new operators are pouching (PO) and unpouching (UP) operators. The PO
operator creates a new model by combining a pair of sibling pouch nodesW 1 andW 2 into
a new pouch nodeW po = W 1∪W 2. The UP operator creates a new model by separating
one manifest variable X from a pouch node W up, resulting in two sibling pouch nodes
W 1 = W up\{X} and W 2 = {X}. Figure 5.5 shows some examples of the use of these
two operators.
The purpose of the PO and UP operators is to modify the conditional independencies
entailed by the model on the variables of the pouch nodes. For example, consider the two
85
(a) m1
(b) m2 (c) m3
Figure 5.4: Examples of applying the node introduction, node deletion, and node reloca-tion operators. Introducing Y3 to mediate between Y1, {X4, X5} and {X6} in m1 givesm2. Relocating {X4, X5} from Y3 to Y2 in m2 gives m3. In reverse, relocating {X4, X5}from Y2 to Y3 in m3 gives m2. Deleting Y3 in m2 gives m1.
models m1 and m2 in Figure 5.5. In m1, X4 and X5 are conditionally independent given
Y3, i.e., P (X4, X5|Y3) = P (X4|Y3)P (X5|Y3). In other words, covariance between X4 and
X5 is zero given Y3. On the other hand, X4 and X5 need not be conditionally independent
given Y3 in m2. The covariances between them are allowed to be non-zero in the 2 × 2
conditional covariance matrices for the pouch node {X4, X5}.
The PO operator in effect postulates that two sibling pouch nodes are correlated given
their parent node. It may improve the BIC score of the candidate model by increasing the
likelihood term, when there is some degree of local dependence between those variables
on the empirical data. On the other hand, the UP operator postulates that one variable
in a pouch node is conditionally independent from other variables in the pouch node. It
reduces the number of parameters in the candidate model and hence may improve the
BIC score by decreasing the penalty term. These postulates are tested by comparing the
BIC scores of the corresponding models in each search step. The postulate that leads to
the model with the highest BIC score is considered as most appropriate.
For the sake of computational efficiency, we do not consider pouching more than two
manifest variables. This is similar to how NI is done. To compensate for the restriction,
we consider a restricted version of PO after a successful pouching. The restricted version
86
(a) m1
(b) m2 (c) m3
Figure 5.5: Examples of applying the pouching and unpouching operators. Pouching {X4}and {X5} in m1 gives m2, and pouching {X4, X5} and {X6} in m2 gives m3. In reverse,unpouching X6 from {X4, X5, X6} in m3 gives m2, and unpouching X5 from {X4, X5}gives m1.
combines the new pouch node resulting from PO with one of its sibling pouch nodes.
5.5.2 Search Phases
In every search step, a search operator generates all possible candidate models for consid-
eration. Let l and b be the numbers of latent nodes and pouch nodes, respectively, and
n = l+b be total number of nodes. Let p, q, and r be the maximum number of variables in
a pouch node, the maximum number of sibling pouch nodes, and the maximum number
of neighbors of a latent node, respectively. The numbers of possible candidate models
that NI, ND, SI, SD, NR, PO, and UP can generate are O(lr(r−1)/2), O(lr), O(l), O(l),
O(nl), O(lq(q − 1)/2), and O(bp), respectively. If we consider all seven operators in each
search step, many candidate models are generated but only one of them is chosen for the
next step. Those suboptimal models are discarded and are not used for the next step.
Therefore, in some sense, much time are wasted for considering the suboptimal models.
A more efficient way is to consider fewer candidate models in each search step. This can
be achieved by considering only a subset of search operators at a time. Therefore, we follow
the idea of Chen et al. [26] and partition the search operators into three categories. SI, NI,
87
and PO belong to the expansion category, since all of them can create candidate models
that are more complex than the current one. SD, ND, and UP, which simplify a model,
belong to the simplification category. NR does not change the complexity considerably.
It belongs to the adjustment category. We perform search in three phases, each of which
considers only operators in one category. The best model found in each phase is used
to seed the next phase, and the search repeats these three phases until all three phases
cannot find a better model.
5.5.3 Operation Granularity
Some search operators have the issue of operation granularity (see Section 3.5.1, page 29).
This is similar to the those for LTMs. Therefore, we follow the cost-effective principle
and evaluate the models using the improvement ratio (Equation (3.4)) for those affected
operators.
The principle is applied only on candidate models generated by the SI, NI, and PO
operators. In other words, it is used only during the expansion phase. It is not applied
to other operators since those operators do not or do not necessarily increase model
complexity.
5.5.4 Efficient Model Evaluation
Similar to the other five operators, the PO and UP operators modify a small part of the
base model. Hence, the restricted likelihood (see Section 3.5.1, page 31) can also be used
to speed up the evaluation process.
We use an example to illustrate the evaluation of a candidate model given by a PO
operator. Consider the models m1 and m2 in Figure 5.5. m2 is obtained from m1 using
the PO operator. The two models share many parameters such as P (x1, x2|y2), P (y2|y1),P (y3|y1), P (x6|y3), etc. On the other hand, some parameters are not shared by the two
models. In this example, parameters P (x4|y3) and P (x5|y3) are specific to m1, while
parameter P (x4, x5|y3) is specific to m2. To compute the restricted likelihood of m2, we
keep the MLE of the shared parameters. And we update only the parameter P (x4, x5|y3),which is specific to m2, in EM. We can then evaluate m2 using the approximate score
BICRL given by Equation (3.5).
88
Algorithm 2 Search Algorithm1: procedure EAST-PLTM(m, D)2: repeat3: m′ ← m4: m← Expand(m)5: m← Adjust(m)6: m← Simplify(m)7: until BIC(m|D) ≤ BIC(m′|D)8: return m′
9: end procedure
10: procedure Expand(m, D)11: loop12: m′ ← m13: M← SI(m′) ∪NI(m′) ∪ PO(m′)14: m← PickModel-IR(M,D)15: if BIC(m|D) ≤ BIC(m′|D) then16: return m′
17: end if18: if m ∈ NI(m′) ∪ PO(m′) then19: m← Enhance(m,D)20: end if21: end loop22: end procedure
23: procedure Adjust(m, D)24: return RepPickModel(m,NR,D)25: end procedure
26: procedure Simplify(m, D)27: m← RepPickModel(m,UP,D)28: m← RepPickModel(m,ND,D)29: m← RepPickModel(m,SD,D)30: end procedure
31: procedure RepPickModel(m, Op, D)32: repeat33: m′ ← m34: m← PickModel(Op(m′),D)35: until BIC(m|D) ≤ BIC(m′|D)36: return m′
37: end procedure
5.5.5 EAST-PLTM
The entire search algorithm for PTLMs is named EAST-PLTM. It is outlined in Algo-
rithm 2.
The search process starts with the initial modelm(0) described in Section 5.5. The pro-
cedure EAST-PLTM uses the initial model as the first current model. It then repeatedly
tries to improve the current model in the three different phases.
In procedure Expand, the improvement ratio IR is used to pick the best models among
the candidate models generated by SI, NI, and PO. It stops when the best candidate model
fails to improve over the previous model. On the other hand, if the best candidate model
is better, and if it comes from NI or PO, procedure Enhance is called. This procedure
iteratively improves the model by using a restricted version of search operator. If the
given model is generated by NI, restricted version of NR is used. If it is generated by PO,
restricted version of PO is used.
In procedure Adjust, it calls the procedure RepPickModel to improves the model
using the NR operator repeatedly. In procedure Simplify, it first tries to improve the
model using UP. It then uses ND, and uses SD lastly.
89
5.6 Conclusions
In this chapter, we have described the PLTMs. We have also presented an inference al-
gorithm and a learning algorithm for these models. In the next two chapters, we demon-
strate the usefulness of PLTMs. We apply these models for facilitating variable selection
in clustering and for multidimensional clustering.
90
CHAPTER 6
VARIABLE SELECTION IN CLUSTERING
Variable selection for cluster analysis is a difficult problem. The difficulty originates not
only from the lack of class information but also the fact that high-dimensional data are
often multifaceted and can be meaningfully clustered in multiple ways. In such a case the
effort to find one subset of attributes that presumably gives the “best” clustering may be
misguided. It makes more sense to facilitate variable selection by domain experts, that
is, to systematically identify various facets of a data set (each being based on a subset of
attributes), cluster the data along each one, and present the results to the domain experts
for appraisal and selection.
In this chapter, we use PLTMs as a generalization of the Gaussian mixture model. We
show their ability to cluster data along multiple facets. And we demonstrate it is often
more reasonable to facilitate variable selection than to perform it.
We begin this chapter by explaining our approach in Section 6.1. Then we describe
the experimental setup in Section 6.2. In Section 6.3, we present the empirical results of
comparison between the two approaches to variable selection. We also highlight findings
on some of the data sets.
6.1 To Do or To Facilitate
Variable selection is an important issue for cluster analysis of high-dimensional data. The
cluster structure of interest to domain experts can often be best described using a subset
of attributes. The inclusion of other attributes can degrade clustering performance and
complicate cluster interpretation. Recently there is a growing interest in the issue [43,
151, 170]. This chapter is concerned with variable selection for model-based clustering.
In classification, variable selection is a clearly defined problem. It has to find the subset
of attributes that gives the best classification performance. The problem is less clear for
cluster analysis due to the lack of class information. Several variable selection methods
have been proposed for model-based clustering. Most of them introduce flexibility into
the generative mixture model to allow clusters to be related to subsets of (instead of all)
attributes and determine the subsets alongside parameter estimation or during a separate
model selection phase.
91
Raftery and Dean [130] consider a variation of the Gaussian mixture model (GMM)
where the latent variable is related to a subset of attributes and is independent of other
attributes given the subset. A greedy algorithm is proposed to search among those models
for one with high BIC score. At each search step, two nested models are compared using
the Bayes factor and the better one is chosen to seed the next search step. Maugis et al.
[108, 109] extend this work by considering different possibilities of dependency among the
relevant and irrelevant attributes.
Law et al. [97] start with the Naïve Bayes model (that is, GMM with diagonal covari-
ance matrices) and add a saliency parameter for each attribute. The parameter ranges
between 0 and 1. When it is 1, the attribute depends on the latent variable. When it is
0, the attribute is independent of the latent variable and its distribution is assumed to be
unimodal. The saliency parameters are estimated together with other model parameters
using the EM algorithm. The work is extended by Li et al. [102] so that the saliency of
an attribute can vary across clusters.
The third line of work is based on GMMs where all clusters share a common diagonal
covariance matrix, while their means may vary. If the mean of a cluster along an attribute
turns out to coincide with the overall mean, then that attribute is irrelevant to cluster.
Both Bayesian methods [73, 103] and regularization methods [120] have been developed
based on this idea.
Our work is based on two observations. First, while clustering algorithms identify
clusters in data based on the characteristics of data, domain experts are ultimately the
ones to judge the interestingness of the clusters found. Second, high-dimensional data are
often multifaceted in the sense that there may be multiple meaningful ways to partition
them. The first observation is the reason why variable selection for clustering is such a
difficult problem, whereas the second one suggests that the problem may be ill-conceived
from the start.
Instead of performing variable selection, we advocate to facilitate variable selection by
domain experts. The idea is to systematically identify all the different facets of a data
set, cluster the data along each one, and present the results to the domain experts for
appraisal and selection. The analysis would be useful if one of the clusterings is found
interesting.
We use PLTMs to to realize the idea. Analyzing data using a PLTM may result in
multiple latent variables. Each latent variable represents a partition (clustering) of the
data and is usually related primarily to only a subset of attributes. Consequently, the data
is clustered along multiple dimensions and the results can be used to facilitate variable
selection.
92
Data Set Attributes Classes Samples Latentsglass 9 6 214 3.0image 181 7 2310 4.4ionosphere 331 2 351 9.9iris 4 3 150 1.0vehicle 18 4 846 3.0wdbc 30 2 569 9.4wine 13 3 178 2.0yeast 8 10 1484 5.0zernike 47 10 2000 6.9
Table 6.1: Descriptions of UCI data sets used in our experiments. The last column showsthe average numbers of latent variables obtained by PTLM analysis over 10 repetitions.
6.2 Experimental Setup
Our empirical study is designed to compare two types of analyses that can be applied
to unlabeled data: PLTM analysis and GMM analysis. PLTM analysis yields a model
with multiple latent variables. Each of the latent variables represents a partition of data
and may depend only on a subset of attributes. GMM analysis produces a model with
a single latent variable. It can be done with or without variable selection. Our study is
primarily concerned with GMM analysis with variable selection. GMM analysis without
variable selection is included for reference. When variable selection is performed, the
latent variable may depend on only a subset of attributes.
6.2.1 Data Sets and Algorithms
We used both synthetic and real-world data sets in our experiments. The synthetic data
were generated from the model described in Example 1, of which the variable Y1 is used as
the class variable. The real-world data sets were borrowed from the UCI machine learning
repository.2 We chose nine labeled data sets that have been often used in the literature
and that contain only continuous attributes. Table 6.1 shows the basic information of
these data sets.
We compare PLTM analysis with four methods based on GMMs. The first method is
plain GMM (PGMM) analysis. The second one is MCLUST [52].3 This method reduces
the number of parameters by imposing constraints on the eigenvalue decomposition of the
1Attributes having single values had been removed.2 http://www.ics.uci.edu/~mlearn/MLRepository.html3 http://cran.r-project.org/web/packages/mclust
93
covariance matrices. The third one is CLUSTVARSEL [130].4 It is denoted as CVS for
short. The last one is the method of Law et al. [97], which we call LFJ, using the the
first letters of the three author names. Among these four methods, the last two perform
variable selection while the first two do not. Recently, there is a model-based method that
also produces multiple clusterings [55]. It can also be used to facilitate variable selection.
Therefore, we include this method in our study and denote it as GS. CVS and LFJ are
described in Section 6.1, and GS in Section 5.2.
In our experiments, the parameters of PGMMs and PLTMs were estimated using
the EM algorithm. The same settings were used for both cases. EM was terminated
when it failed to improve the log-likelihood by 0.01 in one iteration or when the number
of iterations reached 500. We used a variant of multiple-restart approach [31] with 64
starting points to avoid local maxima in parameter estimation. For the scheme to avoid
spurious local maxima in parameter estimation as described in Section 5.4, we set the
constant γ at 20. For PGMM and CVS, the true numbers of classes were given as input.
For MCLUST, LFJ, and GS, the maximum number of mixture components was set at 20.
6.2.2 Method of Comparison
Our experiments started with labeled data. In the training phase, models were learned
from data with the class labels removed. In the testing phase, the clusterings contained
in models were evaluated and compared based on the class labels. The objective is to see
which method recovers the class variable the best.
A model produced by a GMM-based method contains a single latent variable Y . It
represents one way to partition the data. We follow Strehl and Ghosh [152] and evaluate
the partition using normalized mutual information NMI(C;Y ) between Y and the class
variable C. The NMI is given by
NMI(C;Y ) =MI(C;Y )√H(C)H(Y )
,
where MI(C;Y ) is the mutual information between C and Y and H(V ) is the entropy
of a variable V [34]. These quantities can be computed from P (c, y), which in turn is
estimated by P (c, y) = 1N
∑Nk=1 Ick(c)P (y|dk), where d1, . . . ,dN are the samples in testing
data, Ick(c) is an indicator function having value of 1 when c = ck and 0 otherwise, and
ck is the class label for the k-th sample.
A model resulting from PLTM analysis or GS contains a set Y of latent variables.
Each of the latent variables represents a partition of the data. In practice, the user may4 http://cran.r-project.org/web/packages/clustvarsel
94
find several of the partitions interesting and use them all in his work. In this section,
however, we are talking about comparing different clustering algorithms in terms of the
ability to recover the original class partition. So, the user needs to choose one of the
partitions as the final result. The question becomes whether this analysis provides the
possibility for the user to recover the original class partition. Consequently, we assume
that the user chooses, among all the partitions produced, the one closest to the class
partition and we evaluate the performance of PLTM analysis and GS using this quantity:
maxY ∈Y
NMI(C;Y ).
Among the four GMM-based methods, CVS and LFJ make explicit efforts to per-
form variable selection, while PGMM and MCLUST do not. PLTM and GS do not
make explicit effort to perform variable selection either. However, they produces multiple
partitions of data and some of the partitions may depend only on subsets of attributes.
Consequently, they allow the user to examine the clustering results based on various sub-
sets of attributes and choose the ones the user deems interesting. In this sense, we can
view PLTM and GS as methods to facilitate variable selection. So, our empirical work
can be viewed as a comparison between two different approaches to variable selection: to
do (CVS and LFJ) or to facilitate (PLTM and GS).
Note that NMI was used in our experiments due to the absence of a domain expert
to evaluate the clustering results. In practice, class labels are not available when we
cluster data. Hence the NMI cannot be used to select the appropriate partitions for
the facilitation approach. The user needs to comprehend the clusterings and find those
interesting to him. We analyze some unlabeled NBA data in the next chapter. This can
also serve as a demonstration for the facilitation approach in practice.
6.3 Results
The results of our experiments are given in Table 6.2. PLTM has clearly superior perfor-
mances over the two variable selection methods, CVS and LFJ. Specifically, it outperforms
CVS on all but one data set and LFJ on all data sets. PLTM also has clear advantages
over PGMM and MCLUST, the two methods that do not do variable selection. PLTM
outperformed PGMM on all but one data sets and outperformed MCLUST except for two
data sets. In addition, PLTM performed better than GS, the other method that produces
multiple clusterings. It outperformed GS on all but one data. On these data sets, PLTM
usually outperformed the other methods by large margins.
95
Facilitate VS Perform VS No VSData Set PLTM GS CVS LFJ PGMM MCLUSTsynthetic .85 (.00) .69 (.00) .34 (.00) .56 (.02) .56 (.00) .64 (.00)glass .43 (.03) .38 (.00) .29 (.00) .35 (.03) .28 (.03) .33 (.00)image .71 (.03) .65 (.00) .41 (.00) .51 (.03) .52 (.04) .66 (.00)vehicle .40 (.04) .31 (.00) .23 (.00) .27 (.01) .25 (.08) .36 (.00)wine .97 (.00) .83 (.00) .71 (.00) .70 (.19) .50 (.06) .69 (.00)zernike .50 (.02) .39 (.00) .33 (.00) .45 (.01) .44 (.03) .41 (.00)ionosphere .36 (.01) .26 (.00) .41 (.00) .13 (.07) .57 (.04) .32 (.00)iris .76 (.00) .74 (.00) .87 (.00) .68 (.02) .73 (.08) .76 (.00)wdbc .45 (.03) .36 (.00) .34 (.00) .41 (.02) .44 (.08) .68 (.00)yeast .18 (.00) .22 (.00) .04 (.00) .11 (.04) .16 (.01) .11 (.00)
Table 6.2: Clustering performances as measured by NMI. The averages and standarddeviations over 10 repetitions are reported. Best results are highlighted in bold. The firstrow categorizes the methods according to their approaches to variable selection (VS).
We next examine models produced by the various methods to gain insights about the
superior performance of PLTM analysis.
6.3.1 Synthetic Data
Before examining models obtained from synthetic data, we first take a look at the data
set itself. The data were sampled from the model shown in Figure 5.2, with information
about the two latent variables Y1 and Y2 removed. Nonetheless, the latent variables
represent two natural ways to partition the data. To see how the partitions are related
to the attributes, we plot the NMI5 between the latent variables and the attributes in
Figure 6.1(a). We call the curve for a latent variable its feature curve. We see that Y1 is
strongly correlated with X1–X3, but not with the other attributes. Hence it represents a
partition based on those three attributes. Similarly, Y2 represents a partition of the data
based on attributes X4–X9. So, we say that the data has two facets, one represented by
X1–X3 and another by X4–X9. The designated class partition Y1 is a partition along the
first facet.
The model produced by PLTM analysis have the same structure as the generative
model. We name the two latent variables in the model Z1 and Z2 respectively. Their
feature curves are also shown in Figure 6.1(a). We see that the feature curves of Z1 and
Z2 match those of Y1 and Y2 well. This indicates that PLTM analysis has successfully
recovered the two facets of the data. It has also produced a partition of the data along
each of the facets. If the user chooses the partition Z1 along the facet X1–X3 as the
5To compute NMI(X;Y ) between a continuous variable X and a latent variable Y , we discretized Xinto 10 equal-width bins, so that P (X,Y ) could be estimated as a discrete distribution.
96
Y1Y2
Z1Z2
0.0
0.1
0.2
0.3
0.4
0.5
X1 X2 X3 X4 X5 X6 X7 X8 X9Features
NMI
(a) PLTM analysis
Y1Y2CVS
LFJGS1GS2
0.0
0.1
0.2
0.3
0.4
0.5
X1 X2 X3 X4 X5 X6 X7 X8 X9Features
NMI
(b) Variable selection methods and GS
Figure 6.1: Feature curves of the partitions obtained by various methods and that of theoriginal class partition on synthetic data.
final result, then the original class partition is well recovered. This explains the good
performance of PLTM (NMI=0.85).
The feature curves of the partitions obtained by LFJ and CVS are shown in Fig-
ure 6.1(b). We see that the LFJ partition is not along any of the two natural facets of
the data. Rather it is a partition based on a mixture of those two facets. Consequently,
the performance of LFJ (NMI=0.56) is not as good as that of PLTM. CVS did identify
the facet represented by X4–X9, but it is not the facet of the designated class parti-
tion. In other words, it picked the wrong facet. Consequently, the performance of CVS
(NMI=0.34) is the worst among all the methods considered. GS succeeded to identify two
facets. However, the their feature curves do not match those of Y1 and Y2 well. Hence,
its performance (NMI=0.69) is worse than PLTM.
97
Figure 6.2: Structure of the PLTM learned from image data.
6.3.2 Image Data
In the image data, each instance represents a 3× 3 region of an image. It is described by
18 attributes. The feature curve of the original class partition is given in Figure 6.3(a).
We see that it is a partition based on 10 color-related attributes from intensity to hue
and the attribute centroid.row.
The structure of the model produced by PLTM analysis is shown in Figure 6.2. It
contains 4 latent variables Y1–Y4. Their feature curves are shown in Figure 6.3(a). We
see that the feature curve of Y1 matches that of the class partition beautifully. If the user
chooses the partition represented by Y1 as the final result, then the original class partition
is well recovered. This explains the good performance of PLTM (NMI=0.71).
The feature curves of the partitions obtained by LFJ and CVS are shown in Fig-
ure 6.3(b). The LFJ curve matches that of the class partition quite well, but not as
well as the feature curve of Y1, especially on the attributes line.density.5, hue and
centroid.row. Consequently, the performance of LFJ (NMI=0.51) is not as good as that
of PLTM. Similar things can be said about the partition obtained by CVS. Its feature
curve differs from the class feature curve even more than the LFJ curve on the attribute
line.density.5, which is irrelevant to the class partition. Consequently, the performance
of CVS (NMI=0.41) is even worse than that of LFJ.
GS produced four partitions on image data. Two of them consist of only one com-
ponent, and are therefore discarded. The feature curves of the remaining two partitions
are shown in Figure 6.3(c). We see that one of them corresponds to the facet of the class
partition, but does not match that well. As a result, the performance of GS (NMI=0.65)
is better than the other methods but is not as good as that of PLTM.
Two remarks are in order. First, the 10 color-related attributes semantically form
a facet of the data. PLTM analysis has identified the facet in the pouch below Y1.
98
classY1Y2Y3Y4
0.0
0.2
0.4
0.6
line.density.5
line.density.2
vedge.mean
vedge.sd
hedge.mean
hedge.sd
intensityrawred
rawblue
rawgreenexredexblue
exgreenvalue
saturationhu
e
centroid.row
centroid.col
Features
NMI
(a) PLTM analysis
classCVSLFJ
0.0
0.2
0.4
0.6
line.density.5
line.density.2
vedge.mean
vedge.sd
hedge.mean
hedge.sd
intensityrawred
rawblue
rawgreenexredexblue
exgreenvalue
saturationhu
e
centroid.row
centroid.col
Features
NMI
(b) Variable selection methods
classGS1GS2
0.0
0.2
0.4
0.6
line.density.5
line.density.2
vedge.mean
vedge.sd
hedge.mean
hedge.sd
intensityrawred
rawblue
rawgreenexredexblue
exgreenvalue
saturationhu
e
centroid.row
centroid.col
Features
NMI
(c) GS
Figure 6.3: Feature curves of the partitions obtained by various methods and that of theoriginal class partition on image data.
99
Figure 6.4: Structure of the PLTM learned from wine data.
Moreover, it obtained a partition based on not only the color attributes, but also the
attribute centroid.row, the vertical location of a region in an image. This is interesting.
It is because centroid.row is closely related to the color facet. Intuitively, the vertical
location of a region should correlate with the color of the region. For example, the color
of the sky occurs more frequently at the top of an image and that of grass more frequently
at the bottom.
Second, the latent variable Y2 is strongly correlated with the two line density attributes.
This is another facet of the data that PLTM analysis has identified. PLTM analysis has
also identified the edge-related facet in the pouch node below Y3. However, it did not
obtain a partition along the facet. The partition represented by Y3 depends on not only
the edge attributes but others as well. The two coordinate attributes centroid.row and
centroid.col semantically form one facet. The facet has not been identified probably
because the two attributes are not correlated.
6.3.3 Wine Data
The PLTM learned from wine data is shown in Figure 6.4. The model structure also
appears to be interesting. While we are not experts on wine, it seems natural to have Ash
and Alcalinity_of_ash in one pouch as both are related to ash. Similarly, Flavanoids,
Nonflavanoid_phenols, and Total_phenols are related to phenolic compounds. These
compounds affect the color of wine, so it is reasonable to have them in one pouch along
with the opacity attribute OD280/OD315. Moreover, both Magnesium and Malic_acid play
a role in the production of ATP (adenosine triphosphate), the most essential chemical in
energy production. So, it is not a surprise to find them connected to a second latent
variable.
100
Figure 6.5: Structure of the PLTM learned from wdbc data.
Table 6.3: Confusion matrix for PLTM on wdbc data.
Y1class s1 s2 s3 s4 s5
malignant 43 116 6 0 47benign 0 9 193 114 41
6.3.4 WDBC Data
While PLTM performed well on the data sets discussed above, there are some data sets on
which PLTM did not perform so well. One such example is the wdbc data. This data were
obtained from 569 digitalized images of cell nuclei aspirated from breast masses. They
involve 10 computed features of the cell nuclei in these images. Each instance corresponds
to one image and includes the mean value (m), standard error (s), and worst value (w)
for each feature. It is labeled as either benign or malignant.
Figure 6.5 shows the structure of a PLTM learned from this data. We can see that this
model identifies some meaningful facets. The pouch below Y1 identifies a facet related to
the size of nuclei. It includes attributes mainly related to area, perimeter, and radius. The
second facet is identified by the pouch below Y2. It is related to concavity and consists
primarily of the mean and worst values of the two features related to concavity. The
third facet is identified by the pouch below Y3. It includes the mean and worst values of
smoothness and symmetry, and appears to show whether the nuclei have regular shapes
or not. The pouch below Y9 identifies a facet related to texture. This facet includes three
texture-related attributes but also the attribute symmetry.s. The remaining attributes
are mostly standard errors of some features and may be considered as the amount of
variation of the features. They are connected to the rest of the model through Y4 and Y8.
101
classY1Y1(2)
0.0
0.2
0.4
0.6
0.8
radius.w
radius.s
radius.m
perimeter.w
perimeter.s
perimeter.m
compactness.marea.warea.s
area.m
fractal_dim.w
concavity.w
concavity.m
concave_pts.w
concave_pts.m
compactness.w
symmetry.w
symmetry.m
smoothness.w
smoothness.m
concave_pts.s
concavity.s
compactness.s
fractal_dim.s
fractal_dim.m
smoothness.s
texture.w
texture.s
texture.m
symmetry.s
Features
NMI
Figure 6.6: Features curves of the partition Y1, obtained by PLTM, and that of the originalclass partition on wdbc data. Y1(2) is obtained by setting the cardinality of Y1 to 2.
Although the model appears to have a reasonable structure, it did not achieve a high
NMI on this data (NMI=0.45). To have better understanding, we compare the feature
curve of the class partition with that of the closest partition (Y1) obtained by PLTM
in Figure 6.6. The two feature curves have roughly similar shapes. We also look at the
confusion matrix for Y1, which is shown in Table 6.3. We can see that Y1 gives a reasonable
partition. The first four states of Y1 group together the benign cases or malignant cases
almost perfectly, while the remaining state groups together some uncertain cases.
One possible reason for the relatively low NMI is that Y1 has 5 states but the class
variable has only 2 states. The higher number of states of Y1 may lead to a lower NMI
due to an increase of the entropy term. It may also lead to the mismatch of the feature
curves. To verify this explanation, we manually set the cardinality of Y1 to 2. The feature
curve of this adjusted latent variable (Y1(2)) is shown in Figure 6.6. It now matches
the feature curve of the class partition well. The adjustment also improved the NMI to
0.69, and made it become highest on this data set. This example shows that an incorrect
estimation of the number of clusters could be a reason why PLTM performed worse than
other methods on some data sets.
6.3.5 Discussions
We have also performed PLTM analysis on the other data sets. The last column of
Table 6.1 lists the average numbers of latent variables obtained over 10 repetitions. We
see that multiple latent variables have been identified except on iris data, which has only
four attributes. Many of the latent variables represent partitions of data along natural
102
facets of the data.
In general, PLTM analysis has the ability to identify natural facets of data and cluster
data along those facets. In practice, a user may find several of the clusterings useful. In
the setting where clustering algorithms are evaluated using labeled data, PLTM performs
well if the original class partition is also along some of the natural facets, and poorly
otherwise.
103
CHAPTER 7
MULTIDIMENSIONAL CLUSTERING
Variable selection methods produce single clusterings. Hence, those methods become
inadequate when multiple meaningful clusterings exist on data. The question remains
whether multiple clusterings often exist on real-world data.
This chapter attempts to answer this question. We performed PLTM analysis on
seasonal statistics of National Basketball Association (NBA) players. Unlike the data
used in the previous chapter, this data does not contain any class labels. The objective
here is not to recover any target partition. Rather, our aim is to see whether PLTM can
obtain multiple meaningful clusterings. We interpret the clustering results using basic
basketball knowledge.
This chapter is organized as follows. In Section 7.1, we use an example to compare
the multidimensional clustering with the traditional single clustering approach. We then
review some work related to multi-dimensional clustering in Section 7.2. Next, we describe
the NBA data in Section 7.3. In Section 7.4, we present our findings of PLTM analysis on
this data. In Section 7.5, we compare our method with other related methods. Finally,
we discuss our results in Section 7.6.
7.1 Clustering Multifaceted Data
Suppose we want to cluster the data shown in Figure 7.1(a). The data has two attributes.
As Figures 7.1(c) and (d) show, we can obtain a meaningful clustering from either one of
the attributes. Due to this reason, we say that the data is multifaceted.
If we use PLTMs to cluster the data, we will likely get the PLTM shown in Fig-
ure 7.1(b). The PLTM has two latent variables. They correspond to the clusterings along
the X and Y dimensions, respectively. This approach of clustering is called multidimen-
sional clustering because of the multiple clusterings obtained along different dimensions
of data.
On the other hand, if traditional clustering methods are used, only a single clustering
can be obtained. If variable selection is used, we will likely get a 3-cluster solution on either
one of the attributes (Figure 7.1(c) and (d)). This means we will miss the meaningful
clustering on the other attribute.
104
-10 -5 0 5 10
-10
-50
510
X
Y
(a) Data points.
X Y
CX(3) CY (3)
(b) PLTM likely to be obtained.
-10 -5 0 5 10
-10
-50
510
X
Y
(c) Clustering on attribute X.
-10 -5 0 5 10
-10
-50
510
X
Y
(d) Clustering on attribute Y .
-10 -5 0 5 10
-10
-50
510
X
Y
(e) Clustering on both attributes.
Figure 7.1: Clustering data with two facets. (b) The latent variables CX and CY denotethe cluster variables for attributes X and Y , respectively. (c) and (d) Clustering on onlyone of the attributes give 3 clusters. (e) Clustering on both attributes give 9 clusters.
105
If we cluster the data traditionally without variable selection, we will likely get a 9-
cluster solution (Figure 7.1(e)). It is the cross product of the two 3-cluster solutions on
either attribute.
Compared with the traditional approach, the multidimensional approach has two ad-
vantages. These advantages can be observed by comparing how the distribution of data
P (X, Y ) is represented by the two different approaches. Using the multidimensional ap-
proach, P (X, Y ) can be given as
P (X, Y ) = P (X|CX)P (Y |CY )P (CX)P (CY |CX), (7.1)
where CX and CY are two clustering variables each with 3 states. And using the traditional
approach, P (X, Y ) can be given as
P (X, Y ) = P (X, Y |C)P (C), (7.2)
where C is a clustering variable with 9 states.
The first advantage of the multidimensional approach is that its solution is more
comprehensible. To understand the clusterings, we can inspect the terms P (X|CX) and
P (Y |CY ) in Equation (7.1). Each of them focuses on a single dimension, and each has
only three components. On the other hand, we need to inspect the term P (X, Y |C) in
Equation (7.2) to understand the clustering obtained from the traditional approach. This
term depends on two attributes and has nine components. Hence, the terms from the
multidimensional approach are simpler than the one from the traditional approach.
The second advantage is related to the numbers of parameters needed in the solutions
of the two approaches. Due to factorization, Equation (7.1) generally needs fewer param-
eters than Equation (7.2). In general, model selection penalizes models with more param-
eters. Therefore, the former equation allows a larger number of clusters to be discovered if
we need to determine the number of clusters automatically through model selection. And
since the latter equation needs more parameters, a larger number of clusters are usually
prohibited by model selection.
7.2 Related Work
There is some recent work that produces multiple clusterings. While PLTMs acknowl-
edge the correlations between these clusterings, most other work attempts to find multiple
clusterings that are dissimilar from each other. Two approaches are commonly used. The
first approach finds multiple clusterings sequentially. In this approach, an alternative
106
clustering is found based on a given clustering. The alternative clustering is made dis-
similar to the given one with conditional information bottleneck [61], by “cannot-links”
constraints [7], by orthogonal projection [37], or with an optimization problem constrained
by the similarity between clusterings [128]. The second approach finds multiple clusterings
simultaneously. It includes a method based on k-means [83] and one based on spectral
clustering [119]. Both methods find dissimilar clusterings by adding penalty terms that
discourage the similarity between the clusterings to their corresponding objective func-
tions. One other method finds multiple clusterings simultaneously using the suboptimal
solutions of spectral clustering [39].
There are some other lines of work that produce multiple clusterings. Caruana et al.
[22] generate numerous clusterings by applying random weights on the attributes. The
clusterings are presented in an organized way so that users can pick the clustering they
deem the best. Subspace clustering [92] tries to identify all clusters in all subspaces.
Biclustering [106] attempts to find groups of objects that exhibit the same pattern (e.g.,
synchronous rise and fall) over a subset set of attributes. Unlike our approach, the above
related work is distance-based rather than model-based.
Relatively few model-based methods have been proposed to produce multiple cluster-
ings. The GS method [55] is one of these methods. (see Section 5.2). However, due to
the model structure, the obtained clusterings are assumed to be mutually independent.
Jaeger et al. [81] propose the factorial logistic model for multiple clusterings. This
model has binary manifest variables and discrete latent variables. Logistic regression has
been used to model the conditional distributions of the manifest variables given the values
of the latent variables. The number of latent variables and the cardinalities of the latent
variables are assumed to be given in their work.
Factor models also have multiple latent variables. However, most of them have con-
tinuous latent variables rather than discrete latent variables. These models explain the
unobserved heterogeneity using continuous latent scores. This is different from the mul-
tidimensional clustering approach, where discrete clusters are used to explain the unob-
served heterogeneity.
7.3 Seasonal Statistics of NBA Players
Some NBA data were used for multidimensional clustering in our study. The data were
collected from players who played in at least one game in the 2009/10 season.1 The data
1 The data were obtained from: http://www.dougstats.com/09-10RD.txt
107
Attr. Description Avg. Pos.games Number of games played Nostarts Number of games started Nomin Minutes played Yesfgm Field goals made (including two-point and three-point shots) Yesfgp Field goal percentage (shots made divided by shot attempted) No3gm Three pointers Yes3gp Three pointer percentage Noftm Free throws made (penalty shots) Yesftp Free throw percentage Nooff Offensive rebounds (gaining possessions of ball after missed
shots by teammates)Yes F/C
def Defensive rebounds (gaining possessions of ball after missedshots by opponents)
Yes F/C
blk Blocks (deflecting field goal attempts by opponents) Yes F/Cstl Steals (defensive acts of causing turnovers by opponents) Yes Gast Assists (passes leading to scores by teammates) Yes Gto Turnovers (losing control of ball to opponents) Yes Gpf Personal fouls Yestf Technical fouls Yesdq Disqualifications Yes
Table 7.1: Descriptions of the attributes of NBA data. The second column indicateswhether an attribute shows an average per game. The last column indicates which positionusually gets a higher value of an attribute. Positions F and C stand for forward and center,and are played by taller players. Position G stands for guard and is played by shorterplayers.
set has 18 attributes and 441 samples. Each sample corresponds to one player. The
attributes are described in Table 7.1.
There are three main positions of players. They are guard (G), forward (F), and center
(C). The guard position is played by shorter players, while the forward and center positions
are played by taller players. According to our understanding of basketball games, some
statistics are more likely to be higher for a particular position. They are indicated in the
last column of Table 7.1.
7.4 PLTMs on NBA Data
In this section, we show the findings of an PLTM analysis on NBA data.
7.4.1 Clusterings Obtained
The structure of the model obtained from the PLTM analysis is shown in Figure 7.2. The
model contains seven latent variables, each of which identifies a different facet of data.
108
Figure 7.2: PLTM obtained on NBA data. The latent variables are shown in shadednodes and represent different clusterings on the players. They have been renamed basedon our interpretation of their meanings. The abbreviations in these names stand for: role(Role), general ability (Gen), technical fouls (T), disqualification (D), tall-player ability(Tall), shooting accuracy (Acc), and three-pointer ability (3pt).
Role P(Role) games startsoccasional 0.32 29.1 2.0irreg_starter 0.11 46.1 31.7reg_sub 0.19 68.2 5.4reg 0.13 75.8 32.8reg_starter 0.25 76.0 73.7overall 1.00 56.3 27.9
(a) Role
3pt P(3pt) 3pm 3ppnever 0.29 0.00 0.00seldom 0.12 0.08 0.26fair 0.17 0.33 0.28good 0.40 1.19 0.36extreme 0.02 0.73 0.64overall 1.00 0.55 0.23
(b) 3pt
Acc P(Acc) fgp ftplow_ftp 0.10 0.44 0.37low_fgp 0.16 0.39 0.72high_ftp 0.47 0.44 0.79high_fgp 0.28 0.52 0.67overall 1.00 0.45 0.71
(c) Acc
Table 7.2: Attribute means conditional on the specified latent variables on NBA data.The second columns show the marginal distributions of those latent variables. The lastrows show the unconditional means of the attributes.
The first facet consists of attributes games and starts, which are related to the role of
a player. The second one consists of attributes min, fgm, ftm, ast, stl, and to, which
are related to some general performance of a player. The third and fourth facets each
contains only one attribute. They are related to tf and dq, respectively. The fifth facet
contains attributes blk, off, def, and pf. This is related to one aspect of performance in
which taller players usually have an advantage. The sixth facet consists of two attributes
ftp and fgp, which are related to the shooting accuracy. The last facet contains 3pm and
3pp, which are related to three pointers.
109
7.4.2 Cluster Means
To understand the clusters, we may examine the mean values of attributes of the clusters.
We use the clusterings Role, 3pt, and Acc as an illustration.
Table 7.2(a) shows the means of games and starts conditional on the clustering Role.
Note that there are 82 games in a NBA season. We see that players belonging to the first
cluster did not play regularly. The second group of players also played less often than
average, but they usually started in a game when they played. This cluster probably
refers to those players who had the calibre of starters but had missed part of the season
due to injuries. The third group of players played often, but usually as a substitute (not
as a starter). The fourth group of players played regularly and sometimes started in a
game. The last group contains players who played and started regularly.
Table 7.2(b) shows the means of 3pm and 3pp conditional on 3pt. The variable 3pt
partitions players into five clusters. The first two clusters contains players that never and
seldom made a three-pointer, respectively. The next two clusters contains players that
have fair and good three-pointer accuracies, respectively. The last group is an extreme
case. It contains players shooting with surprisingly high accuracy. As indicated by the
marginal distribution, it consists of only a very small proportion of players. This is
possibly a group of players who had made some three pointers during the sporadic games
that they had played. The accuracy remained very high since they did not play often.
Table 7.2(c) shows the means of fgp and ftp conditional on Acc. The first group
of players had particularly poor free throw percentage. The second group of players
had low field goal percentage and average free throw percentage. The third group of
players had particularly high free throw percentage, while the fourth group of players had
particularly high field goal percentage. One may expect that since both ftp and fgp are
related to the shooting accuracies, these two attributes should be positively correlated and
the last two groups may look counter-intuitive. However, it is indeed reasonable due to
one observation in NBA. Taller players usually stay closer to basket in games. Therefore,
they take high-percentage shots more often and has higher field goal percentage. On the
other hand, taller players are usually poorer in making free throws and have lower free
throw percentage. One typical example is Dwight Howard. He had a relatively high fgp
(0.61) but a relatively low ftp (0.59). He was classified appropriately as “high_fgp” by
PLTM.
110
GenRole poor fair good
occasional 0.81 0.19 0.00irreg_starter 0.00 0.69 0.31reg_sub 0.22 0.78 0.00regular 0.00 0.81 0.19reg_starter 0.00 0.06 0.94
(a) P (Gen|Role)
AccTall low_ftp low_fgp high_ftp high_fgp
poor 0.28 0.53 0.00 0.18fair 0.00 0.02 0.95 0.03good 0.00 0.00 1.00 0.00good_big 0.11 0.04 0.00 0.86v_good 0.00 0.00 0.14 0.86
(b) P (Acc|Tall)
Table 7.3: Conditional distributions of Gen and Acc on NBA data.
7.4.3 Relationships between Clusterings
In addition to the distributions of individual clusterings, PLTMs also model the proba-
bilistic relationships between the clusterings. Users of PLTM analysis may also find these
relationships interesting. It can be demonstrated from the following two examples.
Table 7.3(a) shows the conditional distribution P (Gen|Role). The clustering Gen con-
sists of three clusters of players with poor, fair, and good general performances, respec-
tively. We observe that those players playing occasionally were mostly poor in general,
and almost all starters played well in general. While the other three groups of players
usually played fairly, more of the irregular starters (“irreg_starter”) played well than those
who played regularly (“regular”), and none of the regular substitutes (“reg_sub”) played
well. This relationship is reasonable because a player’s general performance should be
related to the role of the player.
Table 7.3(b) shows the conditional distribution P (Acc|Tall). It is consistent with our
observation that taller players usually shoot free throws more poorly. Most players who
played very well (“v_good”) or particularly well (“good_big”) as tall players belong to the
group that has high field goal percentage but low free throw percentage (“high_fgp”). On
the other hand, those who do not play well specifically as tall players (“fair” and “good”)
usually have average field goal percentage and higher free throw percentage (“high_ftp”).
For those who played poorly as tall players, we cannot tell much about them.
111
Subset No. Attributes Components1 starts, min, fgm, 3pm, ftm, off, def, ast, stl, to, blk, pf 112 games, tf 33 fgp, 3pp, ftp, dq 2
Table 7.4: Partition of attributes on NBA data by GS. The last column lists the numberof components that a GMM has on each attribute subset.
7.5 Comparison with Other Methods
We now compare PLTM analysis with other related methods. Since we do not have
knowledge on the number of clusters or the number of clusterings, we included only those
methods that can determine these numbers automatically in our study.
7.5.1 Multiple Independent GMMs by GS
The GS method is described in Section 5.2. It is similar to PLTM analysis in that both
methods are model-based and produce multiple discrete latent variables. For comparison
between the two methods, we performed the GS method on the NBA data. The result
is shown in Table 7.4. Each row corresponds to one clustering identified by GS. It shows
the subsets of attributes that the clustering depends on, and the number of clusters in it.
These clusterings have three weaknesses compared to those obtained from PLTM anal-
ysis. First, the subsets of related attributes found by GS appear to be less natural than
those facets identified by PLTM analysis. In particular, attribute games can be related
to many aspects of the statistics, but it is grouped together by GS with a less interesting
attribute tf, which indicates the number of technical fouls, in subset 2. In subset 3, while
fgp, 3pp, ftp are all related to shooting percentages, they are also grouped together with
an apparently unrelated attribute dq. Subset 1 lumps together a large number of at-
tributes. This misses some more specific and meaningful facets that have been identified
by PLTM analysis.
The second weakness is due to the numbers of clusters obtained from GS. One the
one hand, there are a large number of clusters in subset 1. This makes it difficult to
comprehend the clustering, especially with many attributes in this subset. On the other
hand, there are only few clusters from subset 2 and 3. This means some subtle clusters
found in PLTM analysis were not found by GS.
The third weakness is inherent in the structure of models used by GS. Since inde-
pendent GMMs are used on the subsets of attributes, the latent variables are assumed
to be independent. Consequently, the GS model cannot show those possibly meaningful
112
number of factors degrees of freedom χ2 statistic p-value1 135 2755.72 02 118 1134.12 1.31× 10−165
3 102 660.30 1.38× 10−82
4 87 412.17 9.58× 10−44
5 73 247.15 8.91× 10−21
6 60 169.29 2.35× 10−12
7 48 119.20 5.48× 10−8
8 37 72.20 0.0004719 factanal failed
Table 7.5: Results from significance tests whether the factor models fit NBA data.
relationships between the clusterings as PLTMs do.
7.5.2 Factor Analysis
Factor analysis produces multiple continuous latent variables. To see whether this ap-
proach can produce similar results, we perform factor analysis on the NBA data using the
factanal method of R.2 We tried one to nine factors in the analysis. The results of the
significance test whether these models fit the data are shown in Table 7.5.
The results show that the highest p-value is 0.000471. This value is considerably lower
than 0.05, which is the value for accepting a model at 95% significance level. This shows
that factor analysis could not fit the data well enough even when we used eight factors.
However, when we fit the data with more than eight factors, the method factanal failed.
The possible reason is that nine factors are too many for data with only 18 attributes.
Altogether, the results indicate that factor analysis failed on NBA data.
7.5.3 LTM with Continuous Latent Variables
Choi et al. [32] propose several methods for learning tree models with continuous latent
variables when the manifest variables are continuous. Their methods produce models with
structure similar to PLTMs. Therefore, we included one of those methods for comparison.
We followed their report and chose to present the result of the CLNJ method. The
resulting model structure is shown in Figure 7.3.
The model has two latent variables. Notice that they are continuous. The latent
variable Y1 appears to be related to the shooting accuracy. It is connected to the subtree
with attributes 3pm, 3pp, and ftp. Those attributes are expected to be higher when a
2http://www.r-project.org/
113
min
starts
games
fgm
ftm to
ast tf
stl
Y1
3pm
3pp
ftp
Y2
def
off
fgp blk
pf
dq
Figure 7.3: Model obtained from CLNJ method [32]. The latent variables (Y1 and Y2) arecontinuous in this model.
player can shoot more accurately. The other latent variable, Y2, appears to be related to
the performance of taller players. It is connected to the subtree with attributes def, off,
blk, and fgp. The taller players usually have advantage for these statistics.
Although the two latent variables look reasonable, the whole model obtained from
CLNJ is less desirable than the PLTM shown in Figure 7.2 for two reasons. First, the
CLNJ model has only two latent variables, while the PLTM has seven. Therefore, its
latent variables cannot show some other facets of basketball games. For example, PLTM
has a latent variable to explain the different roles of players but the CLNJ model does
not.
Second, the continuous latent variables in the CLNJ model cannot explain any non-
linear relationships between the variables. For example, some player missed many games
due to injury, but they had the calibre of starters and played significant minutes when
they could play. This group of players cannot be explained by the linear relationships in
the CLNJ model, such as the positive correlation between min and starts. On the other
hand, the PLTM has discovered a cluster of such players.
Another example can be observed from the fact that the coefficients on the edges of
the CLNJ model are all positive. This means when a higher value of the tall-man ability
(Y2) is observed, the CLNJ model expects a higher value of ftp. This contradicts our
understanding of basketball games and the finding of the PLTM, both of which suggest
a player who played particularly well as a tall player usually attained a below average
percentage of free throws (ftp). These two examples show that the continuous latent
114
variables are insufficient to model the data.
7.6 Discussions
If we think about how basketball games are played, we can expect that the heterogene-
ity of players can originate from various aspects, such as positions of the players, their
competence on their corresponding positions, or their general competence. As our results
show, PLTM analysis identified these different facets from the NBA data and allowed
users to partition data based on them separately. This is not possible using traditional
clustering methods, with or without variable selection.
Even though the number of attributes in NBA data is small relatively to some data
sets that are available nowadays, we can still identify multiple meaningful clusterings from
it. We can expect real-world data with higher dimensions are also multifaceted. Hence,
it is generally more appropriate to use the multidimensional clustering approach than to
use the traditional clustering approach.
Our experiments also showed that factor analysis and the CLNJ method could not find
a model that explains NBA as well as the PLTM. This suggests that discrete latent vari-
ables are more appropriate to explain the multiple heterogeneities than continuous latent
variables. By allowing LTMs to work on continuous data, the experiments demonstrates
that PTLMs are an useful extension of LTMs.
115
CHAPTER 8
CONCLUSIONS
In this chapter, we first summarize what we have done in this thesis. We then point out
some future work and possible improvements.
8.1 Summary of Work
We have made two main contributions to the research of LTMs in this thesis. The first
main contribution is that we have applied LTMs in the rounding step for spectral clus-
tering. This includes:
• We have identified a new source of information for rounding in spectral clustering.
Most work uses only the primary eigenvectors for rounding. However, we have used
also the secondary eigenvectors. We have specifically pointed out the property of
the secondary eigenvectors that we have exploited (Proposition 4 (2)).
• We have proposed an intuitively appealing method for rounding in the ideal case
(Naive-Rounding2). This method can automatically determine the number of
clusters by using the secondary eigenvectors. It also worked perfectly on the three
synthetic data sets in our test.
• We have proposed a model-based method for rounding for the general case (LTM-
Rounding). We have shown that this method worked perfectly for the ideal case
and degraded gracefully when the data deviated from the ideal case. We have com-
pared this method with another popular method, ROT-Rounding [169]. We have
shown that LTM-Rounding worked better than ROT-Rounding on synthetic
data.
• We have used LTM-Rounding for image segmentation. We have shown that its
results were comparable to those of ROT-Rounding, and that it could find some
meaningful segments that ROT-Rounding could not.
The second main contribution is that we have extended LTMs for continuous data.
This includes:
116
• We have proposed PLTMs for handling continuous data, along with an inference
algorithm and a learning algorithm for those models. While we have focused on
continuous data for PLTMs in this thesis, PLTMs should work on both continuous
data and discrete data with any modification.
• We have shown that PLTMs are also an generalization of GMMs. We have pointed
out two advantages of using PLTMs instead of GMMs. PTLMs allow fewer parame-
ters by factorization. And more important, they allow multiple latent variables and
a more flexible model structure.
• We have used PLTMs for facilitating variable selection in model-based clustering.
We have demonstrated that facilitating variable selection can perform better than
doing variable selection traditionally. We have examined the results on four UCI
data sets to explain the performance of PLTM analysis.
• We have demonstrated the usefulness of PTLMs by performing multidimensional
clustering on NBA data. Our analysis identified several meaningful clusterings. It
also found some interesting relationships among the clusterings. Besides, we have
compared out results with results obtained from other methods. We have shown that
the discrete latent variables in an PLTM explained the heterogeneity on data better
than the continuous latent variables in two other models. We have also shown that
the tree structure of latent variables in an PTLM worked better than disconnected
latent variables in another model.
8.2 Future Work
Our work has pointed out some directions for future work. First, most work in spectral
clustering uses only primary eigenvectors for rounding. However, our work has shown that
the secondary eigenvectors can contain useful information and can be used for rounding.
We hope this can open a new direction for rounding in spectral clustering. It is possible
that other work for rounding can improve its performance by making use of the secondary
eigenvectors.
Second, we have compared the facilitation approach and the traditional approach to
variable selection in clustering. By using the traditional approach, one implicitly hopes for
a panacea solution that would be the most meaningful one for every interest. However, our
experiments showed that if one is interested in the partitions given by the class labels,
the clusterings obtained from the traditional approach are worse than those from the
facilitation approach. This suggests that the traditional approach should be used with
117
caution. Instead, it is more appropriate to present multiple clusterings with different
selections of variables and allow one to choose the clustering that is most meaningful to
a particular interest.
Third, most work on clustering aims to find a single clustering. However, our analysis
on NBA data have demonstrated that more than one meaningful clustering can be found
on data with as few as 18 attributes. Data used for clustering nowadays usually have more
attributes. Our result suggests that those data potentially contain multiple meaningful
clusterings. Therefore, future clustering work should aim for multiple clusterings rather
than a single clustering.
8.3 Possible Improvements
There are some possible improvements for our work. First, our method for rounding
requires discretization of eigenvectors. This may lead to loss of information. Therefore,
it would be better to work directly with the continuous values of the eigenvectors. And
this suggests a future work that uses PLTMs rather than LTMs for rounding.
Second, the training of PLTMs can take a long time to complete. For example, one
run of PLTM analysis took around 5 hours on data sets of moderate size (e.g., image,
ionosphere, and wdbc data) and around 2.5 days on the largest data set (zernike data)
in our experiments. This limits the use of PLTM analysis to data with lower dimensions.
Hence, PLTM analysis is currently infeasible for data with hundreds or thousands of
attributes, such as those for text clustering and gene expression analysis. More research
is needed to improve the efficiency of the training process. Future work may consider the
variable clustering approach or the constraint-based approach for learning PLTMs.
118
Bibliography
[1] Raymond J. Adams, Mark Wilson, and Wen-chung Wang. The multidimensional
random coefficients multinomial logit model. Applied Psychological Measurement,
21(1):1–23, 1997.
[2] Hirotugu Akaike. A new look at the statistical model identification. IEEE Trans-
actions on Automatic Control, 19(6):716–723, December 1974.
[3] Animashree Anandkumar, Kamalika Chaudhuri, Daniel Hsu, Sham M. Kakade,
Le Song, and Tong Zhang. Spectral methods for learning multivariate latent tree
structure. In Advances in Neural Information Processing Systems, 2012.
[4] Pablo Arbeláez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour de-
tection and hierarchical image segmentation. IEEE Transaction on Pattern Analysis
and Machine Intelligence, 33(5):898–916, 2011.
[5] Francis R. Bach and Michael I. Jordan. Learning spectral clustering, with appli-
cation to speech separation. Journal of Machine Learning Research, 7:1963–2001,
2006.
[6] Francis R. Bach and Michael I. Jordon. Beyond independent components: Trees
and clusters. Journal of Machine Learning Research, 4:1205–1233, 2003.
[7] Eric Bae and James Bailey. COALA: A novel approach for the extraction of an
alternative clustering of high quality and high dissimilarity. In Proceedings of the
Sixth IEEE International Conference on Data Mining, 2006.
[8] David J. Bartholomew and Martin Knott. Latent Variable Models and Factor Anal-
ysis. Arnold, 2nd edition, 1999.
[9] Alexander Basilevsky. Statistical Factor Analysis and Related Methods: Theory and
Applications. J. Wiley, 1994.
[10] Francesca Bassi. Latent class factor models for market segmentation: An application
to pharmaceuticals. Statistical Methods and Applications, 16:279–287, 2007.
[11] Peter M. Bentler and David G. Weeks. Linear structural equations with latent
variables. Psychometrika, 45(3):289–308, 1980.
119
[12] Christopher M. Bishop and Michael E. Tipping. A hierarchical latent variable
model for data visualization. IEEE Transaction on Pattern Analysis and Machine
Intelligence, 20(3):281–293, 1998.
[13] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation.
Journal of Machine Learning Research, 3:993–1022, 2003.
[14] Kenneth A. Bollen. Structural Equations with Latent Variables. Wiley, New York,
1989.
[15] Kenneth A. Bollen. Latent variables in psychology and the social sciences. Annual
Review of Psychology, 53:605–634, 2002.
[16] Eric T. Bradlow and Alan M. Zaslavsky. A hierarchical latent variable model for
ordinal data from a customer satisfaction survey with “no answer” responses. Journal
of the American Statistical Association, 94(445):43–52, 1999.
[17] Derek C. Briggs. An introduction to multidimensional measurement using Rasch
models. Journal of Applied Measurement, 4(1):87–100, 2003.
[18] Wray Buntine. Variational extensions to EM and multinomial PCA. In Proceedings
of the 13th European Conference on Machine Learning, 2002.
[19] Wray Buntine and Aleks Jakulin. Discrete component analysis. In Subspace, Latent
Structure and Feature Selection, LNCS 3940, pages 1–33. Springer-Verlag, 2006.
[20] Wray L. Buntine. Operations for learning with graphical models. Journal of Arti-
ficial Intelligence Research, 2:159–225, 1994.
[21] John Canny. GaP: a factor model for discrete data. In Proceedings of the 27th
Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, 2004.
[22] Rich Caruana, Mohamed Elhawary, Nam Nguyen, and Casey Smith. Meta cluster-
ing. In Proceedings of the Sixth International Conference on Data Mining, 2006.
[23] Peter Cheeseman and John Stutz. Bayesian classification (AutoClass): Theory and
results. In Advances in Knowledge Discovery and Data Mining, pages 153–180.
AAAI Press, 1996.
[24] Tao Chen. Search-Based Learning of Latent Tree Models. PhD thesis, Department
of Computer Science and Engineering, The Hong Kong University of Science and
Technology, 2009.
120
[25] Tao Chen and Nevin L. Zhang. Quartet-based learning of hierarchical latent class
models: Discovery of shallow latent variables. In 9th International Symposium of
Artificial Intelligence and Mathematics, 2006.
[26] Tao Chen, Nevin L. Zhang, and Yi Wang. Efficient model evaluation in the search-
based approach to latent structure discovery. In Proceedings of the Fourth European
Workshop on Probabilistic Graphical Models, pages 57–64, 2008.
[27] Tao Chen, Nevin L. Zhang, and Yi Wang. The role of operation granularity in
search-based learning of latent tree models. In The First International Workshop
on Advanced Methodologies for Bayesian Networks, 2010.
[28] Tao Chen, Nevin L. Zhang, Tengfei Liu, Kin Man Poon, and Yi Wang. Model-
based multidimensional clustering of categorical data. Artificial Intelligence, 176:
2246–2269, 2012.
[29] Jie Cheng, Russell Greiner, Jonathan Kelly, David Bell, and Weiru Liu. Learning
Bayesian networks from data: An information-theory based approach. Artificial
Intelligence, 137(1–2):43–90, 2002.
[30] David Maxwell Chickering. Optimal structure identification with greedy search.
Journal of Machine Learning Research, 3:507–554, 2002.
[31] David Maxwell Chickering and David Heckerman. Efficient approximations for the
marginal likelihood of Bayesian networks with hidden variables. Machine learning,
29:181–212, 1997.
[32] Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, and Alan S. Willsky.
Learning latent tree graphical models. Journal of Machine Learning Research, 12:
1771–1812, 2011.
[33] C. K. Chow and C. N. Liu. Approximating discrete probability distributions with
dependence trees. IEEE Transactions on Information Theory, 14(3):462–467, 1968.
[34] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley,
2nd edition, 2006.
[35] Robert G. Cowell. Local propagation in conditional Gaussian Bayesian networks.
Journal of Machine Learning Research, 6:1517–1550, 2005.
[36] Robert G. Cowell, A. Philip Dawid, Steffen L. Lauritzen, and David J. Spiegelhalter.
Probabilistic Networks and Expert Systems. Springer, 1999.
121
[37] Ying Cui, Xiaoli Z. Fern, and Jennifer G. Dy. Non-reduntant multi-view clustering
via orthogonalization. In Proceedings of the Seventh IEEE International Conferene
on Data Mining, 2007.
[38] Adnan Darwiche. Modeling and Reasoning with Bayesian Networks. Cambridge
University Press, 2009.
[39] Sajib Dasgupta and Vincent Ng. Mining clustering dimensions. In Proceedings of
the 27th International Conference on Machine Learning, 2010.
[40] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incom-
plete data via the EM algorithm. Journal of the Royal Statistical Society. Series B
(Methodological), 39(1):1–38, 1977.
[41] Sara Dolnicar. A review of data-driven market segmentation in tourism. Journal of
Travel and Tourism Marketing, 12(1):1–22, 2002.
[42] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. Wiley-
Interscience, 2nd edition, 2000.
[43] Jennifer G. Dy and Carla E. Brodley. Feature selection for unsupervised learning.
Journal of Machine Learning Research, 5:845–889, 2004.
[44] Gal Elidan and Nir Friedman. Learning the dimensionality of hidden variables. In
Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence, pages
144–151, 2001.
[45] Gal Elidan, Noam Lotner, Nir Friedman, and Daphne Koller. Discovering hidden
variables: A structure-based approach. In Advances in Neural Information Process-
ing Systems, 2001.
[46] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based
algorithm for discovering clusters in large spatial databases with noise. In Proceed-
ings of the 2nd International Conference on Knowledge Discovery and Data Mining,
1996.
[47] Mario A.T. Figueiredo and Anil K. Jain. Unsupervised learning of finite mixture
models. IEEE Transaction on Pattern Analysis and Machine Intelligence, 24(3):
381–396, 2002.
[48] Ernest Fokoué and D. M. Titterington. Mixtures of factor analysers. Bayesian
estimation and inference by stochastic simulation. Machine Learning, 50:73–97,
2003.
122
[49] Jaime R. S. Fonseca and Margarida G. M. S. Cardoso. Retail clients latent segments.
In Progress in Artificial Intelligence, pages 348–358. Springer, 2005.
[50] Jaime R. S. Fonseca and Margarida G. M. S. Cardoso. Mixture-model cluster
analysis using information theoretical criteria. Intelligent Data Analysis, 11:155–
173, 2007.
[51] Chris Fraley and Adrian E. Raftery. Model-based clustering, discriminant analysis,
and density estimation. Journal of American Statistical Association, 97(458):611–
631, 2002.
[52] Chris Fraley and Adrian E. Raftery. MCLUST version 3 for R: Normal mixture
modeling and model-based clustering. Technical Report 504, Department of Statis-
tics, University of Washington, 2006. (revised 2009).
[53] Nir Friedman. Learning belief networks in the presence of missing values and hidden
variables. In Proceedings of the 14th International Conference on Machine Learning,
1997.
[54] Sylvia Frühwirth-Schnatter. Finite Mixture and Markov Switching Models. Springer,
2006.
[55] Giuliano Galimberti and Gabriele Soffritti. Model-based methods to identify mul-
tiple cluster structures in a data set. Computational Statistics and Data Analysis,
52:520–536, 2007.
[56] Guojun Gan, Chaoqun Ma, and Jianhong Wu. Data Clustering: Theory, Algo-
rithms, and Applications. ASA-SIAM, 2007.
[57] Dan Geiger and David Heckerman. Learning Gaussian networks. Technical Report
MSR-TR-94-10, Microsoft Research, 1994.
[58] Zoubin Ghahramani. An introduction to hidden Markov models and Bayesian net-
works. International Journal of Pattern Recognition and Artificial Intelligence, 15
(1):9–42, 2001.
[59] Zoubin Ghahramani and Matthew J. Beal. Variational interference for Bayesian
mixtures of factor analysers. In Advances in Neural Information Processing Systems
12, 2000.
[60] Debashis Ghosh and Arul M. Chinnaiyan. Mixture modelling of gene expression
data from microarray experiments. Bioinformatics, 18(2):275–286, 2002.
123
[61] David Gondek and Thomas Hofmann. Non-redundant data clustering. In Proceed-
ings of the Fourth IEEE International Conference on Data Mining, 2004.
[62] Leo A. Goodman. Exploratory latent structure analysis using both identifiable and
unidentifiable models. Biometrika, 61(2):215–231, 1974.
[63] Peter J. Green. Penalized likelihood. In Encyclopedia of Statistical Science, Update
Volume 3, pages 578–586, 1999.
[64] T. Haavelmo. The statistical implications of a system of simultaneous equations.
Econometrica, 11:1–12, 1943.
[65] Lars Hagen and Andrew B. Kahng. A new approach to effective circuit clustering.
In Proceedings of IEEE International Conference on Computer Aided Design, pages
422–427, 1992.
[66] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan
Kaufmann, San Francisco, 2000.
[67] Harry H. Harman. Modern Factor Analysis. The University of Chicago Press, 3rd
edition, 1976.
[68] Stefan Harmeling and Christopher K. I. Williams. Greedy learning of binary latent
trees. IEEE Transaction on Pattern Analysis and Machine Intelligence, 33(6):1087–
1097, 2011.
[69] David Heckerman. A tutorial on learning with Bayesian networks. Technical Report
MSR-TR-95-06, Microsoft Research, 1995.
[70] A. E. Henderickson and P. O. White. PROMAX: A quick method for rotation to
oblique simple structure. British Journal of Mathematical and Statistical Psychol-
ogy, 17:65–70, 1964.
[71] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with
neural networks. Science, 313(5786):504–507, 2006.
[72] Geoffrey E. Hinton, Michael Revow, and Peter Dayan. Recognizing handwritten
digits using mixtures of linear models. In Advances in Neural Information Processing
Systems 7, 1995.
[73] Peter D. Hoff. Model-based subspace clustering. Bayesian Analysis, 1(2):321–344,
2006.
124
[74] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the
22nd Annual international SIGIR Conferrence on Research and Development in
Information Retrieval, 1999.
[75] Thomas Hofmann. Unsupervised learning by probabilistic latent semantic analysis.
Machine Learning, 42:177–196, 2001.
[76] Frank Höppner, Frank Klawonn, Rudolf Druse, and Thomas Runkler. Fuzzy Cluster
Analysis: Methods for Classification, Data Analysis and Image Recognition. Wiley,
1999.
[77] Lynette Hunt and Murray Jorgensen. Mixture model clustering using the MUL-
TIMIX program. Australian & New Zealand Journal of Statistics, 41(2):154–171,
1999.
[78] Aapo Hyvärinen and Erkki Oja. Independent component analysis: Algorithms and
applications. Neural Networks, 13:411–430, 2000.
[79] Aapo Hyvärinen, Juha Karhunen, and Erkki Oja. Independent Component Analysis.
John Wiley & Sons, 2001.
[80] Salvatore Ingrassia. A likelihood-based constrained algorithm for multivariate nor-
mal mixture models. Statistical Methods and Applications, 13(2):151–166, 2004.
[81] Manfred Jaeger, Simon Lyager, Michael Vandborg, and Thomas Wohlgemuth. Fac-
torial clustering with an application to plant distribution data. In Proceedings of
the 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clus-
terings, pages 31–42, 2011.
[82] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM
Computing Surveys, 31(3):264–323, 1999.
[83] Prateek Jain, Raghu Meka, and Inderjit S. Dhillon. Simultaneous unsupervised
learning of disparate clusterings. In Proceedings of the Seventh SIAM International
Converence on Data Mining, pages 858–869, 2008.
[84] Kamel Jedidi, Harsharanjeet S. Jagpal, and Wayne S. DeSarbo. Finite-mixture
structural equational models for response-based segmentation and unobserved het-
erogeneity. Marketing Science, 16(1):39–59, 1997.
[85] I. T. Jolliffe. Principal Component Analysis. Springer, 2002.
125
[86] Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and the
EM algorithm. Neural Computation, 6(2):181–214, 1994.
[87] K. G. Jöreskog. Structural equation models in the social sciences: Specification,
estimation and testing. In Applications of Statistics, pages 265–287, 1977.
[88] H. F. Kaiser. The Varimax criterion for analystic rotation in factor analysis. Psy-
chometrika, 23:187–200, 1958.
[89] David J. Ketchen, JR. and Christopher L. Shook. The application of cluster analysis
in strategic management research: An analysis and critique. Strategic Management
Journal, 17:441–458, 1996.
[90] B. King. Step-wise clustering procedures. Journal of American Statistical Associa-
tion, 69:86–101, 1967.
[91] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and
Techniques. The MIT Press, 2009.
[92] Hans-Peter Kriegel, Peer Kröger, and Arthur Zimek. Clustering high dimensional
data: A survey on subspace clustering, pattern-based clustering, and correlation
clustering. ACM Transactions on Knowledge Discovery from Data, 3(1):1–58, 2009.
[93] James K. Lake. Reconstructing evolutionary trees from DNA and protein sequences:
Paraliner distances. Proceedings of the National Academy of Science, 91:1455–1459,
1994.
[94] Helge Langseth and Thomas D. Nielsen. Classification using hierarchical naïve
Bayes models. Machine Learning, 63:135–159, 2006.
[95] Steffen L. Lauritzen. Propagation of probabilities, means, variances in mixed graph-
ical association models. Journal of the American Statistical Association, 87(420):
1098–1108, 1992.
[96] Steffen L. Lauritzen and Frank Jensen. Stable local computation with conditional
Gaussian distributions. Statistics and Computing, 11:191–203, 2001.
[97] Martin H. C. Law, Mário A. T. Figueiredo, and Anil K. Jain. Simultaneous fea-
ture selection and clustering using mixture models. IEEE Transaction on Pattern
Analysis and Machine Intelligence, 26(9):1154–1166, 2004.
[98] David F. Layton and Richard A. Levine. How much does the far future matter?
A hierarchical Bayesian analysis of the public’s willingness to mitigate ecological
126
impacts of climate change. Journal of the American Statistical Association, 98
(463):533–544, 2003.
[99] Paul F. Lazarsfeld and Neil W. Henry. Latent Structure Analysis. Houghton Mifflin,
Boston, 1968.
[100] Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative
matrix factorization. Nature, 401:788–791, 1999.
[101] Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factor-
ization. In Advances in Neural Information Processing Systems 13, 2001.
[102] Yuanhong Li, Ming Dong, and Jing Hua. Simultaneous localized feature selection
and model detection for gaussian mixtures. IEEE Transaction on Pattern Analysis
and Machine Intelligence, 31(5):953–960, 2009.
[103] Jun. S Liu, Junni L. Zhang, Michael J. Palumbo, and Charles E. Lawrence. Bayesian
clustering with variable and transformation selections (with discussion). Bayesian
Statistics, 7:249–275, 2003.
[104] John C. Loehlin. Latent Variable Models : An Introduction to Factor, Path, and
Structural Equation Analysis. L. Erlbaum Associates, 2004.
[105] J. MacQueen. Some methods for classification and analysis of multivariate obser-
vations. In Proceedings of the Fifth Berkley Symposium, volume 1, pages 281–297,
1967.
[106] Sara C. Madeira and Arlindo L. Oliveria. Biclustering algorithms for biological data
analysis: A survey. IEEE Transactions on Computational Biology and Bioinformat-
ics, 1(1):24–45, 2004.
[107] Jay Magidson and Jeroen K. Vermunt. Latent class factor and cluster models,
bi-plots, and related graphical displays. Sociological Methodology, 31:221–264, 2001.
[108] Cathy Maugis, Gilles Celeux, and Marie-Laure Martin-Magniette. Variable selection
for clustering with Gaussian mixture models. Biometrics, 65:701–709, 2009.
[109] Cathy Maugis, Gilles Celeux, and Marie-Laure Martin-Magniette. Variable selec-
tion in model-based clustering: A general variable role modeling. Computational
Statistics and Data Analysis, 53:3872–3882, 2009.
[110] G. J. McLachlan, R. W. Bean, and D. Peel. A mixture model-based approach to
the clustering of microarray expression data. Bioinformatics, 18(3):413–422, 2002.
127
[111] Geoffrey J. McLachlan and David Peel. Finite Mixture Models. Wiley, New York,
2000.
[112] Geoffrey J. McLachlan and David Peel. Mixtures of factor analyzers. In Proceedings
of the 17th International Conference on Machine Learning, 2000.
[113] Christopher Meek. Graphical Models: Selecting Causal and Statistical Models. PhD
thesis, Carnegie Mellon University, 1997.
[114] Marina Meila. Comparing clusterings—an information based distance. Journal of
Multivariate Analysis, 98:873–895, 2007.
[115] Marina Meila and Michael I. Jordan. Learning with mixtures of trees. Journal of
Machine Learning Research, 1:1–48, 2000.
[116] Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. Loopy belief propagation for
approximate inference: An empirical study. In Proceedings of the 15th Conference
on Uncertainty in Artificial Intelligence, 1999.
[117] Bengt O. Muthén. Beyond SEM: General latent variable modeling. Behaviormetrika,
29(1):81–117, 2002.
[118] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clusterings: Analysis
and an algorithm. In Advances in Neural Information Processing Systems 14, 2002.
[119] Donglin Niu, Jennifer G. Dy, and Michael I. Jordan. Multiple non-redundant spec-
tral clustering views. In Proceedings of the 27th International Conference on Ma-
chine Learning, 2010.
[120] Wei Pan and Xiaotong Shen. Penalized model-based clustering with application to
variable selection. Journal of Machine Learning Research, 8:1145–1164, 2007.
[121] Blossom H. Patterson, C. Mitchell Dayton, and Barry I. Graubard. Latent class
analysis of complex sample survey data: Application to dietary data. Journal of
the American Statistical Association, 97(459):721–741, 2002.
[122] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann Publishers, San Mateo, California, 1988.
[123] Leonard K. M. Poon, Nevin L. Zhang, Tengfei Liu, and April H. Liu. Variable
selection in model-based clustering: To do or to facilitate. International Journal of
Approximate Reasoning. Accepted with minor revisions.
128
[124] Leonard K. M. Poon, Nevin L. Zhang, Tao Chen, and Yi Wang. Using Bayesian
networks for model-based multiple clusterings: An example of exploratory analysis
on NBA data. In The 1st International Workshop on Advanced Methodologies for
Bayesian Networks, 2010.
[125] Leonard K. M. Poon, Nevin L. Zhang, Tao Chen, and Yi Wang. Variable selec-
tion in model-based clustering: To do or to facilitate. In Proceedings of the 27th
International Conference on Machine Learning, 2010.
[126] Leonard K. M. Poon, April H. Liu, Tengfei Liu, and Nevin L. Zhang. A model-based
approach to rounding in spectral clustering. In Proceedings of the 28th Conference
on Uncertainty in Artificial Intelligence, 2012.
[127] Girish Punj and David W. Stewart. Cluster analysis in marketing research: Review
and suggestions for application. Journal of Marketing Research, 20:134–148, 1983.
[128] ZiJie Qi and Ian Davidson. A principled and flexible framework for finding alterna-
tive clusterings. In Proceedings of the 15th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, 2009.
[129] Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications
in speech recognition. In Proceedings of the IEEE, pages 257–286, 1989.
[130] Adrian E. Raftery and Nema Dean. Variable selection for model-based clustering.
Journal of American Statistical Association, 101(473):168–178, 2006.
[131] Venkatram Ramaswamy and Steven H. Cohen. Latent class models for conjoint
analysis. In Conjoint Measurement, pages 295–319. Springer, 2007.
[132] William M. Rand. Objective criteria for the evaluation of clustering methods. Jour-
nal of the American Statistical Association, 66(336):846–850, 1971.
[133] Georg Rasch. On general laws and the meaning of measurement in psychology. In
Proceedings of the 4th Berkeley Symposium on Mathematics, Statistics, and Proba-
bility, volume 4, 1961.
[134] Nicola Rebagliati and Alessandro Verri. Spectral clustering with more than k eigen-
vectors. Neurocomputing, 74:1391–1401, 2011.
[135] Mark D. Reckase. Multidimensional Item Response Theory. Springer, 2009.
[136] Franz Rendl and Henry Wolkowicz. A projection technique for partitioning the
nodes of a graph. Annals of Operations Research, 58(3):155–179, 1995.
129
[137] Sam Roweis. EM algorithms for PCA and SPCA. In Advances in Neural Information
Processing Systems 10, pages 626–632, 1998.
[138] Sam Roweis. A unifying review of linear Gaussian models. Neural Computation,
11:305–345, 1999.
[139] Stuart Russell and Peter Norvig. Artificial Intelligence: a Modern Approach. Pren-
tice Hall, 1995.
[140] Gideon Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6
(2):461–464, 1978.
[141] Steven L. Scott and Edward H. Ip. Empirical Bayes and item-clustering effects in
a latent variable hierarchical model: A case study from the national assessment
of educational progress. Journal of the American Statistical Association, 97(458):
409–419, 2002.
[142] Ross D. Shachter and C. Robert Kenley. Gaussian influence diagrams. Management
Science, 35(5):527–550, 1989.
[143] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 731–737, 1997.
[144] Ricardo Silva, Richard Scheines, Clark Glymour, and Peter Spirtes. Learning the
structure of linear latent variable models. Journal of Machine Learning Research,
7:191–246, 2006.
[145] Anders Skrondal and Sophia Rabe-Hesketh. Generalized Latent Variable Modeling:
Multilevel, Longitudinal, and Structural Equation Models. Chapman & Hall/CRC,
2004.
[146] Anders Skrondal and Sophia Rabe-Hesketh. Latent variable modelling: A survey.
Scandinavian Journal of Statistics, 34(4):712–745, 2007.
[147] P. Sneath. The applications of computers to taxonomy. Journal of General Micro-
biology, 17:201–226, 1957.
[148] Richard Socher, Andrew Maas, and Christopher D. Manning. Spectral Chinese
restaurant processes: Nonparametric clustering based on similarities. In 14th Inter-
national Conference on Artificial Intelligence and Statistics, 2011.
130
[149] Charles Spearman. General intelligence, objectively determined and measured.
American Journal of Psychology, 15:201–293, 1904.
[150] Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and
Search. The MIT Press, 2nd edition, 2000.
[151] Douglas Steinley and Michael J. Brusco. Selection of variables in cluster analysis:
An empirical comparison of eight procedures. Psychometrika, 73(1):125–144, 2008.
[152] Alexander Strehl and Joydeep Ghosh. Cluster ensembles — a knowledge reuse
framework for combining multiple partitions. Journal of Machine Learning Re-
search, 3:583–617, 2002.
[153] Sergios Theodoridis and Konstantinos Koutroumbas. Pattern Recognition. Elsevier,
4th edition, 2009.
[154] Bo Thiesson, Christopher Meek, David Maxwell Chickering, and David Heckerman.
Learning mixtures of Bayesian networks. In Proceedings of the 14th Conference of
Uncertainty in Artificial Intelligence, 1998.
[155] Michael E. Tipping and Christopher M. Bishop. Mixtures of probabilistic principal
component analyzers. Neural Computation, 11:443–482, 1999.
[156] Michael E. Tipping and Christopher M. Bishop. Probabilistic principal component
analysis. Journal of the Royal Statistical Society, Series B (Statistical Methodology),
61(3):611–622, 1999.
[157] Wim J. van der Linden and Ronald K. Hambleton, editors. Handbook of Modern
Item Response Theory. Springer, 1997.
[158] Peter van der Putten and Maarten van Someren. A bias-variance analysis of a
real world learning problem: The CoIL challenge 2000. Machine learning, 57(1–2):
177–195, 2004.
[159] Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing,
17:395–416, 2007.
[160] Ralf Wagner, Sören W. Scholz, and Reinhold Decker. The number of clusters in mar-
ket segmentation. In Data Analysis and Decision Support, pages 157–176. Springer,
2005.
[161] Yi Wang. Latent Tree Models for Multivarite Density Estimation: Algorithms and
Applications. PhD thesis, Department of Computer Science and Engineering, The
Hong Kong University of Science and Technology, 2009.
131
[162] Yi Wang, Nevin L. Zhang, and Tao Chen. Latent tree models and approximate
inference in Bayesian networks. Journal of Artificial Intelligence Research, 32:879–
900, 2008.
[163] Yi Wang, Nevin L. Zhang, Tao Chen, and Leonard K. M. Poon. Latent tree classi-
fier. In Proceedings of the 11th European Conference on Symbolic and Quantitative
Approaches to Reasoning with Uncertainty, 2011.
[164] Michel Wedel and Wagner A. Kamakura. Market Segmentation: Conceptual and
Methodological Foundations. Kluwer Academic Publishers, 2 edition, 2000.
[165] Sewell Wright. On the nature of size factors. Genetics, 3:367–374, 1918.
[166] Tao Xiang and Shaogang Gong. Spectral clustering with eigenvector selection. Pat-
tern Recognition, 41(3):1012–1029, 2008.
[167] Rui Xu and Donald C. Wunsch, II. Clustering. Wiley-IEEE Press, 2009.
[168] K. Y. Yeung, C. Fraley, A. Murua, A. E. Raftery, and W. L. Ruzzo. Model-based
clustering and data transformations for gene expression data. Bioinformatics, 17
(10):977–987, 2001.
[169] Lihi Zelnik-Manor and Pietro Perona. Self-tuning spectral clustering. In Advances
in Neural Information Processing Systems, 2005.
[170] Hong Zeng and Yiu-Ming Cheung. A new feature selection method for gaussian
mixture clustering. Pattern Recognition, 42:243–250, 2009.
[171] Nevin L. Zhang. Hierarchical latent class models for cluster analysis. In Proceedings
of the 18th National Conference on Artificial Intelligence, 2002.
[172] Nevin L. Zhang. Hierarchical latent class models for cluster analysis. Journal of
Machine Learning Research, 5:697–723, 2004.
[173] Nevin L. Zhang and Tomáš Kočka. Efficient learning of hierarchical latent class
models. In Proceedings of the 16th IEEE International Conference on Tools with
Artificial Intelligence, pages 585–593, 2004.
[174] Nevin L. Zhang, Thomas D. Nielsen, and Finn V. Jensen. Latent variable discovery
in classification models. Artificial Intelligence in Medicine, 30:283–299, 2004.
[175] Nevin L. Zhang, Yi Wang, and Tao Chen. Discovery of latent structures: Experience
with the CoIL challenge 2000 data set. Journal of Systems Science and Complexity,
21:172–183, 2008.
132
[176] Nevin L. Zhang, Shihong Yuan, Tao Chen, and Yi Wang. Latent tree models and
diagnosis in traditional Chinese medicine. Artificial Intelligence in Medicine, 42:
229–245, 2008.
[177] Nevin L. Zhang, Shihong Yuan, Tao Chen, and Yi Wang. Statistical validation of
TCM theories. Journal of Alternative and Complementary Medicine, 14(5):583–587,
2008.
[178] Zhihua Zhang and Michael I. Jordan. Multiway spectral clustering: A margin-based
perspective. Statistical Science, 23(3):383–403, 2008.
[179] Feng Zhao, Licheng Jiao, Hanqiang Liu, Xinbo Gao, and Maoguo Gong. Spectral
clustering with eigenvector selection based on entropy ranking. Neurocomputing,
73:1704–1717, 2010.
[180] Shi Zhong and Joydeep Ghosh. A unified framework for model-based clustering.
Journal of Machine Learning Research, 4:1001–1037, 2003.
133
APPENDIX A
LIST OF PUBLICATIONS BY THE AUTHOR
• Tao Chen, Nevin L. Zhang, Tengfei Liu, Kin Man Poon, and Yi Wang. Model-
based multidimensional clustering of categorical data. Artificial Intelligence, 176:
2246–2269, 2012.
• Tengfei Liu, Nevin L. Zhang, Kin Man Poon, Yi Wang, and Hua Liu. Fast
multidimensional clustering of categorical data. In Proceedings of the 2nd MultiClust
Workshop: Discovering, Summarizing and Using Multiple Clusterings, 2011
• Leonard K. M. Poon, Nevin L. Zhang, Tengfei Liu, and April H. Liu. Variable
selection in model-based clustering: To do or to facilitate. International Journal of
Approximate Reasoning. Accepted with minor revision.
• Leonard K. M. Poon, Nevin L. Zhang, Tao Chen, and Yi Wang. Using Bayesian
networks for model-based multiple clusterings: An example of exploratory analysis
on NBA data. In The 1st International Workshop on Advanced Methodologies for
Bayesian Networks, 2010.
• Leonard K. M. Poon, Nevin L. Zhang, Tao Chen, and Yi Wang. Variable selec-
tion in model-based clustering: To do or to facilitate. In Proceedings of the 27th
International Conference on Machine Learning, 2010.
• Leonard K. M. Poon, April H. Liu, Tengfei Liu, and Nevin L. Zhang. A model-
based approach to rounding in spectral clustering. In Proceedings of the 28th Con-
ference on Uncertainty in Artificial Intelligence, 2012.
• Yi Wang, Nevin L. Zhang, Tao Chen, Leonard K. M. Poon. Latent tree classi-
fier. In Proceedings of the 11th European Conference on Symbolic and Quantitative
Approaches to Reasoning with Uncertainty, 2011.
• Yi Wang, Nevin L. Zhang, Tao Chen, Leonard K. M. Poon. LTC: A Latent
Tree Approach To Classification. International Journal of Approximate Reasoning.
Accepted with minor revision.
134