latent tree models: an application and an extensionlzhang/paper/pspdf/poon-thesis.pdf · latent...

LATENT TREE MODELS:AN APPLICATION AND AN EXTENSION

by

KIN-MAN POON

A Thesis Submitted toThe Hong Kong University of Science and Technology

in Partial Fulfillment of the Requirements forthe Degree of Doctor of Philosophy

in Computer Science

August 2012, Hong Kong

Copyright c© by Kin-Man Poon 2012

Authorization

I hereby declare that I am the sole author of the thesis.

I authorize the Hong Kong University of Science and Technology to lend this thesis to

other institutions or individuals for the purpose of scholarly research.

I further authorize the Hong Kong University of Science and Technology to reproduce

the thesis by photocopying or by other means, in total or in part, at the request of other

institutions or individuals for the purpose of scholarly research.

KIN-MAN POON

ii


by

KIN-MAN POON

This is to certify that I have examined the above Ph.D. thesis

and have found that it is complete and satisfactory in all respects,

and that any and all revisions required by

the thesis examination committee have been made.

PROF. NEVIN L. ZHANG, THESIS SUPERVISOR

PROF. MOUNIR HAMDI, HEAD OF DEPARTMENT

Department of Computer Science and Engineering

1 August 2012

iii

ACKNOWLEDGMENTS

I would like to take this opportunity to express my great gratitude to my supervisor,

Prof. Nevin L. Zhang, for his guidance, encouragement, and support throughout my PhD

study. I sincerely appreciate his help in suggesting research topics, revising my papers,

and preparing my presentations. This thesis would not have been completed without him.

I would like to thank my proposal and thesis examination committee members: Dr.

habil. Manfred Jaeger (from Department for Computer Science at Aalborg University),

Prof. Bing-Yi Jing (from Department of Mathematics), Prof. Dit-Yan Yeung, Prof. Brian

Mak, and Prof. James Kwok. I am grateful for their attendance and insightful comments.

Thanks are also due to my colleagues at HKUST for their encouragement, friendship,

and useful discussions. They include Tao Chen, Yi Wang, Tengfei Liu, and April Liu.

Finally, I am indebted to my parents and my wife for their love, patience, understand-

ing, and support. And I would like to include a quote here to express my gratefulness to

the invisible One:

“How can I repay the LORD

for all the great good done for me?”

— Psalm 116:12, NAB

iv

TABLE OF CONTENTS

Title Page i

Authorization Page ii

Signature Page iii

Acknowledgments iv

Table of Contents v

List of Figures ix

List of Tables xi

Abstract xii

Chapter 1 Introduction 1

1.1 Clustering 1

1.1.1 Categories of Clustering Algorithms 2

1.1.2 Model-Based Clustering 3

1.1.3 Multidimensional Clustering 4

1.1.4 Spectral Clustering 5

1.2 Contributions 7

1.3 Organization 7

Chapter 2 Latent Variable Models 9

2.1 Measurement Models 10

2.2 Confirmatory and Exploratory Analyses 11

2.3 Mixture Models 11

2.4 Models with Multiple Latent Variables 12

2.4.1 Structural Equation Models 13

2.4.2 Multidimensional Measurement Models 13

2.4.3 Mixtures of Latent Variable Models 15

2.4.4 General Models 16

2.5 Latent Tree Models 16

2.6 Summary of Models 17

v

Chapter 3 Latent Tree Models 19

3.1 Notations 19

3.2 Bayesian Networks 19

3.3 Latent Tree Models 21

3.3.1 Model Scores 21

3.3.2 Model Equivalence 22

3.3.3 Root Walking 22

3.3.4 Regularity 23

3.4 Parameter Estimation 24

3.5 Learning Algorithms 25

3.5.1 Score-Based Methods 26

3.5.2 Constraint-Based Methods 33

3.5.3 Variable Clustering Methods 37

3.5.4 Comparison between Approaches 40

3.6 Applications 40

3.6.1 Multidimensional Clustering 41

3.6.2 Latent Structure Discovery 43

3.6.3 Density Estimation 43

3.6.4 Classification 44

3.6.5 Domains of Applications 46

Chapter 4 Application: Rounding in Spectral Clustering 48

4.1 Related Work 49

4.2 Basics of Spectral Clustering 50

4.2.1 Similarity Measure and Similarity Graph 50

4.2.2 Graph Laplacian 51

4.2.3 The Ideal Case 52

4.2.4 Spectral Clustering 53

4.2.5 Two Properties 54

4.3 A Naive Method for Rounding 55

4.3.1 Binarization of Eigenvectors 55

4.3.2 Rounding by Overlaying Partitions 56

4.3.3 Determining the Number of Eigenvectors to Use 57

4.4 Latent Class Models for Rounding 59

4.4.1 Known Number of Clusters 60

vi

4.4.2 Unknown Number of Clusters 62

4.5 Latent Tree Models for Rounding 62

4.6 Empirical Evaluation on Synthetic Data 65

4.6.1 Performance in the Ideal Case 65

4.6.2 Graceful Degrading of Performance 66

4.6.3 Impact of an Assumption 68

4.6.4 Sensitivity Study 69

4.6.5 Running Time 70

4.7 Comparison with Alternative Methods 71

4.7.1 Synthetic Data 71

4.7.2 MNIST Digits Data 72

4.7.3 Image Segmentation 73

4.8 Conclusions 74

Chapter 5 Extension: Pouch Latent Tree Models 76

5.1 Pouch Latent Tree Models 76

5.2 Related Work 79

5.3 Inference 80

5.3.1 Clique Tree Propagation 80

5.3.2 Complexity 83

5.4 Parameter Estimation 83

5.5 Structure Learning 84

5.5.1 Search Operators 85

5.5.2 Search Phases 87

5.5.3 Operation Granularity 88

5.5.4 Efficient Model Evaluation 88

5.5.5 EAST-PLTM 89

5.6 Conclusions 90

Chapter 6 Variable Selection in Clustering 91

6.1 To Do or To Facilitate 91

6.2 Experimental Setup 93

6.2.1 Data Sets and Algorithms 93

6.2.2 Method of Comparison 94

6.3 Results 95

6.3.1 Synthetic Data 96

vii

6.3.2 Image Data 98

6.3.3 Wine Data 100

6.3.4 WDBC Data 101

6.3.5 Discussions 102

Chapter 7 Multidimensional Clustering 104

7.1 Clustering Multifaceted Data 104

7.2 Related Work 106

7.3 Seasonal Statistics of NBA Players 107

7.4 PLTMs on NBA Data 108

7.4.1 Clusterings Obtained 108

7.4.2 Cluster Means 110

7.4.3 Relationships between Clusterings 111

7.5 Comparison with Other Methods 112

7.5.1 Multiple Independent GMMs by GS 112

7.5.2 Factor Analysis 113

7.5.3 LTM with Continuous Latent Variables 113

7.6 Discussions 115

Chapter 8 Conclusions 116

8.1 Summary of Work 116

8.2 Future Work 117

8.3 Possible Improvements 118

Appendix A List of Publications by the Author 134

viii

LIST OF FIGURES

1.1 An example of dendrogram. 2

1.2 Data points in the original feature space and the transformed eigenspacein spectral clustering. 6

2.1 Structures of latent class models, latent trait models, and latent profilemodels; and factor models. 10

2.2 Model structure of a finite mixture model. 12

2.3 Structures of the two types of multidimensional item response theory mod-els. 14

2.4 A latent tree model as an extension to a latent class model. 16

3.1 An example of Bayesian network. 20

3.2 An example of latent tree model, root walking, and unrooted model. 21

3.3 Examples of applying the node introduction, node deletion, and noderelocation operators. 27

3.4 Four possible resulting structures of quartet test. 33

3.5 An example of information curves. 42

4.1 Examples of eigenvectors in spectral clustering. 53

4.2 Examples of binary vectors in spectral clustering. 56

4.3 Illustration of Naive-Rounding2. 58

4.4 Latent class model for rounding in spectral clustering. 59

4.5 Latent tree model for rounding. 63

4.6 Synthetic data set for ideal case 66

4.7 LTM-Rounding and ROT-Rounding on synthetic data for non-idealcase. 67

4.8 Partitions obtained by LTM-Rounding1. 69

4.9 Sensitivity analysis on the parameters δ and K in LTM-Rounding. 70

4.10 Image segmentation results by LTM-Rounding and ROT-Rounding. 75

5.1 An example of PLTM. The numbers in parentheses show the cardinalitiesof the discrete variables. 77

5.2 Generative model for synthetic data. 77

5.3 A Gaussian mixture model as a special case of PLTM. 78

5.4 Examples of node introduction, node deletion, and node relocation inPTLMs. 86

5.5 Examples of pouching and unpouching in PLTMs. 87

ix

6.1 Feature curves on synthetic data. 97

6.2 Structure of the PLTM learned from image data. 98

6.3 Feature curves on image data. 99

6.4 Structure of the PLTM learned from wine data. 100

6.5 Structure of the PLTM learned from wdbc data. 101

6.6 Feature curves on wdbc data. 102

7.1 Clustering multifaceted data. 105

7.2 PLTM obtained on NBA data. 109

7.3 Model obtained from the CLNJ method on NBA data. 114

x

LIST OF TABLES

2.1 Summary of latent variable models. 18

4.1 Performances of various rounding methods on synthetic data. 68

4.2 Comparison of LTM-Rounding and LTM-Rounding1. 69

4.3 Comparison of various rounding methods on MNIST digits data. 72

5.1 Discrete distributions in Example 1. 77

6.1 Descriptions of UCI data sets used in our experiments. 93

6.2 Clustering performances as measured by NMI. 96

6.3 Confusion matrix for PLTM on wdbc data. 101

7.1 Attributes on NBA data. 108

7.2 Attribute means conditional on the specified latent variables on NBA data. 109

7.3 Conditional distributions of Gen and Acc on NBA data. 111

7.4 Partition of attributes on NBA data by GS. 112

7.5 Results from significance tests whether the factor models fit NBA data. 113

xi


by

KIN-MAN POON

Department of Computer Science and Engineering

The Hong Kong University of Science and Technology

ABSTRACT

Latent tree models are a class of probabilistic graphical models. These models have a tree

structure, in which the internal nodes represent latent variables whereas the leaf nodes

represent manifest variables. They allow fast inference and have been used for multidi-

mensional clustering, latent structure discovery, density estimation, and classification.

This thesis makes two contributions to the research of latent tree models. The first

contribution is a new application of latent tree models in spectral clustering. In spectral

clustering, one defines a similarity matrix for a collection of data points, transforms the

matrix to get the so-called Laplacian matrix, finds the eigenvectors of the Laplacian

matrix, and obtains a partition of the data points using the leading eigenvectors. The

last step is sometimes referred to as rounding, where one needs to decide how many leading

eigenvectors to use, to determine the number of clusters, and to partition the data points.

We propose to use latent tree models for rounding. The method differs from previous

rounding methods in three ways. First, we relax the assumption that the number of

clusters equals the number of eigenvectors used. Second, when deciding how many leading

eigenvectors to use, we not only rely on information contained in the leading eigenvectors

themselves, but also make use of the subsequent eigenvectors. Third, our method is

model-based and solves all the three subproblems of rounding using latent tree models.

We evaluate our method on both synthetic and real-world data. The results show that

our method works correctly in the ideal case where between-clusters similarity is 0, and

degrades gracefully as one moves away from the ideal case.

xii

The second contribution is an extension to latent tree models. While latent tree models

have been shown to be useful in data analysis, they contain only discrete variables and are

thus limited to discrete data. One has to resort to discretization if analysis on continuous

data is needed. However, this leads to loss of information and may make the resulting

models harder to interpret.

We propose an extended class of models, called pouch latent tree models, that allow

the leaf nodes to represent continuous variables. This extension allows the models to

work on continuous data directly. These models also generalize Gaussian mixture models,

which are commonly used for model-based clustering on continuous data. We use pouch

latent tree models for facilitating variable selection in clustering, and demonstrate on

some benchmark data that it is more appropriate to facilitate variable selection than to

perform variable selection traditionally. We further demonstrate the usefulness of the

models by performing multidimensional clustering on some real-world basketball data.

Our results exhibit multiple meaningful clusterings and interesting relationships between

the clusterings.

xiii

CHAPTER 1

INTRODUCTION

Latent tree models (LTMs) [172] are a class of tree-structured probabilistic graphical

models. These models allow multiple latent variables and have found applications in

multidimensional clustering, latent structure discovery, and density approximation. In

this thesis, we investigate an application of LTMs in spectral clustering and propose an

extension to LTMs so that they can deal with continuous data.

In this introductory chapter, we motivate our work through an introduction to cluster-

ing. We show how LTMs can be used to solve two clustering problems. We also point out a

limitation of LTMs that our work aims to solve. After that, we highlight the contributions

of this thesis. We give an outline of the thesis at the end of this chapter.

1.1 Clustering

Clustering [56, 82, 153, 167] aims to find natural grouping of data. In general, the grouping

maximizes the intra-class similarity and minimizes the inter-class similarity [66]. Cluster-

ing is also referred to as unsupervised classification, where data are classified without any

class labels given beforehand. Since it does not need any labeling to the given data, it is

a useful technique for exploratory data analysis.

Clustering can be used to explain the heterogeneity in the data. For example, the

political stance of a person can affect one’s attitudes towards different social issues. Some-

times, however, we may not know the political stances of people. What we know is only

their attitudes. By clustering on the attitudes, we can find groups of people with similar

attitudes. The grouping possibly reflects the different political ideologies of people.

Clustering has been applied in various areas. For example, it has been applied in

business for market segmentation [10, 41, 49, 127, 160, 164], conjoint analysis in mar-

keting [131], and strategic management research [89]. It is also used in bioinformatics

for analyzing gene expression data [60, 110, 168] and in medical analysis for traditional

Chinese medicine data [176, 177]. Other applications include image segmentation, object

and character recognition, and information retrieval [82].

1

Mississippi

Sou

th C

arol

ina

Arkansas

Florida

Louisiana

Texas

Alabama

Georgia

Vermont

Oklahoma

Montana

Arizona

Nevada Utah

Colorado

Idaho

Wyoming

Virginia

Nor

th C

arol

ina

Tennessee

Kentucky

Maryland

Delaware

Wes

t Virg

inia

Missouri

New

Mex

ico

California

Oregon

Washington

Alaska

Michigan

New

Ham

pshi

reConnecticut

New

Yor

kIndiana

Ohio

Pennsylvania

Illinois

New

Jer

sey

Kansas

Nebraska

Iowa

Sou

th D

akot

aMinnesota

Nor

th D

akot

aWisconsin

Massachusetts

Rho

de Is

land Hawaii

Maine

020

4060

80100

120

140

Cluster Dendrogram

hclust (*, "average")dist(votes.repub)

Height

Figure 1.1: An example of dendrogram. It shows the result of hierarchical clustering onthe voting patterns in 31 years of the 50 states in the United States.

1.1.1 Categories of Clustering Algorithms

Clustering algorithms can be categorized from the following different aspects.

Clustering outputs. Based on the output, clustering algorithms can be generally clas-

sified as hierarchical or partitional. A hierarchical clustering algorithm yields a hierarchy

of nested clusterings. The hierarchy is usually represented by a dendrogram (see Figure 1.1

for an example1 and [82] for more details). The agglomerative clustering methods with

single-link [147] or complete-link [90] are two common methods for hierarchical clustering.

A partitional clustering algorithm, in contrast, yields a single clustering without any

hierarchical structure. One well-known example is the K-means algorithm [105].

Some clustering algorithms are known as density-based. They may also be considered

as partitional clustering algorithms, since they yield single clusterings without hierarchical

structure. However, these algorithms define clusters as dense regions that are separated

by low-density regions in data. Those data points in the low-density regions are treated

as noise and consequently are not assigned to any clusters. DBSCAN [46] is a well-known

density-based clustering algorithm.

Principles for clustering. Clustering algorithms can be classified as distance-based or

model-based according to their principles. In distance-based methods, a distance measure

1 The data set was obtained from http://cran.r-project.org/web/packages/cluster/index.html

2

http://cran.r-project.org/web/packages/cluster/index.html

has to be defined to measure the similarity or dissimilarity between the data points.

Hierarchical and partitional algorithms can then be used to find clusterings based on the

defined distance measure. For continuous data, the distance measures often used are the

Euclidean distance, Manhattan distance, or Mahalanobis distance [56].

In model-based methods [180], data are assumed to be generated from a probability

model. These methods search from a given family of models for the one that gives the

best fit to data. More details are given in the next section.

Cluster assignments. Clustering algorithms can give hard or soft assignments. In hard

assignment, a data point belongs to only one cluster. In contrast, a data point can belong

to more than one cluster with different degrees of membership in soft assignment. Hard

assignments are used by the majority of clustering algorithms, whereas soft assignments

are usually used by fuzzy clustering algorithms [76] and model-based clustering algorithms.

1.1.2 Model-Based Clustering

Our work involves primarily model-based clustering. Compared with distance-based clus-

tering, model-based clustering offers two advantages. First, it provides a statistical basis

for analysis. This allows model selection in particular. And one important use of the

model selection is for determining the number of clusters automatically.

Second, generative models for data can be obtained. These models can be interpreted

for better understanding of data. For example, the distribution of data for a cluster can

be given by the model. The posterior probability of the cluster of a data point can also

be computed.

Finite mixture models [47, 51, 111] are usually used in model-based clustering. Specif-

ically, in finite mixture modeling, the population is assumed to be made up from a finite

number of clusters. Suppose a variable Y is used to indicate this cluster, and variables

X represent the attributes in the data. The variable Y is a latent (unobserved) variable

whereas the variables X are manifest (observed) variables. The manifest variables X are

assumed to follow a mixture distribution

P (x) =∑y

P (y)P (x|y).

The probability values of the distribution P (y) are known as mixing proportions and

the conditional distributions P (x|y) are known as component distributions. To generate

a sample, the model first picks a cluster y according to the distribution P (y) and then

3

uses the corresponding component distribution P (x|y) to generate values for the observed

variables.

Gaussian distributions are often used as the component distributions due to compu-

tational convenience. A Gaussian mixture model (GMM) has a distribution given by

P (x) =∑y

P (y)N (x|µy,Σy),

where N (x|µy,Σy) is a multivariate Gaussian distribution, with mean vector µy and

covariance matrix Σy conditional on the value of Y .

The Expectation-Maximization (EM) algorithm [40] can be used to estimate the model

parameters. Once parameter estimation is done, the probability that a data sample d

belongs to cluster y can be computed by

P (y|d) ∝ P (y)N (d|µy,Σy),

where the symbol∝means that the exact values of the distribution P (y|d) can be obtained

by using the sum∑

y P (y)N (d|µy,Σy) as a normalization constant. The sample d is

assigned to each cluster y by a different degree of association P (y|d). Hence, the latent

variable Y represents a soft partition of data.

The number of clusters can be given manually or determined by model selection auto-

matically. In the latter case, a score is used to evaluate a model with G number of clusters.

The number G that leads to the highest score is then chosen as the optimal number of

clusters. Many scores have been proposed, and the BIC score has been empirically shown

to perform well among them [50].

As we can see from above, finite mixture models contain only one discrete latent

variable for clustering. This can sometimes be insufficient. A remedy is to allow multiple

discrete latent variables in a model. LTMs provide such extension to latent class models,

which is commonly used for clustering discrete data. In the following two subsections, we

illustrate why one latent variable is sometimes insufficient. We also show how LTMs can

be used in these situations.

1.1.3 Multidimensional Clustering

Suppose we are given the scores of four subject tests of some students. The four subjects

are mathematics, science, literature, and history. Our aim is to cluster the students based

on those scores.

If we cluster the data using a model with one discrete latent variable, a single clustering

is obtained. The clustering will likely be based on the general intelligence of the students.

4

However, the four subject tests may require two different skills. Analytical skill is needed

for mathematics and science, while literal skill for literature and history. The single

clustering obtained cannot reflect the two different skills of students. This example shows

a limitation of using models with single discrete latent variables for clustering.

On the other hand, LTMs have multiple discrete latent variables. These variables

allow multiple clusterings on data. If we use an LTM to cluster the students, it may

result in three clusterings. The clusterings will likely be based on the analytic skill, literal

skill, and general intelligence, respectively, of the students. They allow us to cluster the

students along three different dimensions. Hence, this approach to clustering is known as

multidimensional clustering [28]. Note that while both hierarchical clustering and multi-

dimensional clustering yield multiple clusterings, there is an important difference between

them. In hierarchical clustering, the clusterings obtained are nested and represent cluster-

ings at different levels of granularity along the same dimension. But in multidimensional

clustering, multiple partitional clusterings are obtained along different dimensions and

they are not nested.

One limitation of LTMs is that they can work on discrete data only. To use LTMs

in the current example, we have to discretize the scores. However, this leads to loss of

information. Moreover, this may make the resulting models more difficult to interpret.

For example, if we discrete an attribute into fewer intervals, the discretized attribute may

represent the original attribute less well. On the other hand, if we discretize an attribute

into more intervals, the conditional probability tables in the resulting models will have

more entries and will be harder to comprehend. Due to these deficiencies, an extension

to LTMs for continuous data is needed.

1.1.4 Spectral Clustering

Many commonly used clustering methods yield ‘globular’ clusters. These include the

finite mixture models and K-means. They perform poorly when the true clusters are

non-convex, such as those shown in Figure 1.2(a). To overcome this shortcoming, spectral

clustering [159] has been proposed. This method has gained prominence in recent years.

In spectral clustering, one defines a similarity matrix for a collection of data points.

The matrix is transformed to get the so-called Laplacian matrix. One then finds the

eigenvectors of the Laplacian matrix. A partition of the data points is obtained using the

leading eigenvectors.

Essentially, data are transformed from the feature space to an eigenspace in spectral

clustering. The eigenspace is given by the leading eigenvectors of the Laplacian matrix. It

5

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

−2 −1 0 1 2

−2

−1

01

2

(a) Feature Space

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0.00 0.02 0.04 0.06 0.08

0.00

0.02

0.04

0.06

0.08

(b) Eigenspace

Figure 1.2: Data points in the original feature space and the transformed eigenspace inspectral clustering. Ideally, points in each cluster are transformed to the same coordinatesin the eigenspace. Also, each cluster has distinct coordinates in the eigenspace. In thisexample, the cluster with circular (red) points in the feature space are mapped to theupper-left point in the eigenspace, while the one with triangular (cyan) points to thelower-right point.

is expected to allow the data to be clustered more easily. Ideally, the points belonging to

the same cluster should have the same location in the eigenspace, while those belonging

to different clusters should have different locations. This is illustrated in Figure 1.2.

In general, the clusters may not be as well-separated as in the ideal case. The data

points in a cluster do not have the same location in the eigenspace. Instead, they are

perturbed from their ideal locations. This is why a clustering algorithm is usually used

to partition data in the eigenspace. This is done in the last step of spectral clustering,

which is sometimes referred to as rounding. Indeed, there are three related subproblems

in this step. One needs to decide how many leading eigenvectors to use, to determine the

number of clusters, and to partition the data points.

When the number of clusters is given, rounding is easy. The same number of leading

eigenvectors are usually used. And we can obtain a clustering from the leading eigen-

vectors using K-means. But when the number of clusters is not given, rounding becomes

more difficult. To tackle this problem, some methods use Gaussian mixture models to

determine the number of clusters and to partition the data [166, 179]. This is done after

selecting the relevant eigenvectors using heuristics.

Gaussian mixture models have only one discrete latent variable. To partition data, the

variable has to consider all the leading eigenvectors at the same time. On the other hand,

an LTM has multiple discrete latent variables. It allows the latent variables to focus on

only subsets of eigenvectors. It also has a more flexible model structure than Gaussian

6

mixture models. These differences suggest that LTMs may find a useful application in

rounding.

1.2 Contributions

This thesis makes two contributions to the research of LTMs. The first contribution is an

application of LTMs in spectral clustering. We propose and study a novel model-based

approach to rounding using LTMs. The method differs from previous rounding methods

in three ways. First, we relax the assumption that the number of clusters equals the

number of eigenvectors used. Second, when deciding how many leading eigenvectors to

use, we not only rely on information contained in the leading eigenvectors themselves,

but also make use of the subsequent eigenvectors. Third, our method is model-based and

solves all the three subproblems of rounding using LTMs. We evaluate our method on

both synthetic and real-world data. The results show that our method works correctly

in the ideal case where between-clusters similarity is 0, and degrades gracefully as one

moves away from the ideal case.

The second contribution is an extension to LTMs. All variables in LTMs are dis-

crete. Therefore, we propose an extended class of models, called pouch latent tree models

(PLTMs), that allows the leaf nodes to represent continuous variables. This extension is

also a generalization of Gaussian mixture models. We develop an inference algorithm and

a learning algorithm for these models. We use these models for facilitating variable selec-

tion in clustering, and demonstrate on some benchmark data that it is more appropriate

to facilitate variable selection than to perform variable selection traditionally. We further

demonstrate the usefulness of these models by performing multidimensional clustering on

some real-world basketball data.

1.3 Organization

This thesis can be divided into three main parts. The first part consists of Chapters 2

and 3. It serves as a background for the thesis. In Chapter 2, we review different types

of latent variable models. We point out particularly the differences between those models

and LTMs. In Chpater 3, we review the backgrounds of LTMs. We also survey different

algorithms and applications of LTMs.

The second part consists of only Chapter 4. It is based on the work in [126]. In

chapter 4, we propose an application of LTMs in spectral clustering. We use LTMs for

rounding, and evaluate our approach using both synthetic data and real-world data.

7

The third part consists of Chapter 5–7. It is based on the work in [123, 124, 125]. In

Chapter 5, we propose PLTMs as an extension to LTMs for continuous data. We also

describe an inference algorithm and a learning algorithm for PLTMs. In Chapter 6, we

consider PLTMs for variable selection in clustering. In particular, two approaches to vari-

able selection are compared. One approach facilitates variable selection using PTLMs, and

the other performs variable selection in traditional ways. We compare the two approaches

using some benchmark data. In Chapter 7, we perform multidimensional clustering on

seasonal statistics of NBA players. We show that multiple interesting clusterings can be

found using PLTMs.

Chapter 8 concludes this thesis after the three main parts. It summarizes the thesis

and points out some future directions.

8

CHAPTER 2

LATENT VARIABLE MODELS

Latent variables are often used in statistical modeling. Many definitions of latent variables

exist in the literature [15]. In this thesis, we use a simple one. We define latent variables

as random variables whose values are not observed. In contrast, those variables whose

values are observed are called manifest variables. And we refer to those models that

contain latent variables as latent variable models.

There are several reasons why latent variables are needed. Firstly, latent variables can

be used to represent some abstract concepts which cannot be observed. They can also

represent some concepts that are hard to measure in practice. Latent variables used in

this situation are sometimes called theoretical constructs or hypothetical constructs. They

are used in a model to hypothesize some hidden factors that affect the values of observed

variables. For example, item response theory models are often used in educational testing.

The latent variables in these models can be used to measure some unobserved intelligence

or ability of students.

Secondly, latent variables can be used to represent unobserved heterogeneities or

sources of variations. They are included in a model so that their values can be recovered

or estimated from the observed data. For example, latent variable can be used in cluster

analysis to recover some unobserved grouping of similar samples.

Thirdly, latent variables can also be used for dimension reduction. The aim here is

to summarize the values of a large number of manifest variables using a much smaller

number of latent variables. The meanings of the latent variables may not be of interest.

For example, in principal component analysis, the components can be considered as latent

variables. The meanings of these components are often unimportant. The components

are only used to reduce the dimension of data for further analysis.

Finally, some related variables may have been excluded unintentionally during the

collection of data. By including these latent (excluded) variables, the model structure

can sometimes be simplified. For example, Elidan et al. [45] propose to simplify some

pattern of complex parts in a model structure by introducing latent variables into it.

Latent variables are of particular interest to social scientists. This is because abstract

concepts are usually involved in their fields of study. Latent variable models have long

9

X1 X2 X3 X4 X5 X6

Y1

(a)

X1 X2 X3 X4 X5 X6

Y1 Y2

(b)

Figure 2.1: Structures of:(a) latent class models, latent trait models, and latent profilemodels; and (b) factor models. Latent variables are shown in shaded nodes.

been used in these fields. Some earlier efforts include their uses in psychometrics [149],

biometrics [165], and econometrics [64]. Latent variable models have also been applied

in some other domains, including education [141], medical care [121], marketing [16], and

economics [98].

In the following, we survey latent variable models proposed in the literature. We put

more emphasis on models that are related to LTMs. We aim to point out the differences

between those models and LTMs.

2.1 Measurement Models

Many of the latent variable models traditionally used in the social sciences are sometimes

known as measurement models [146]. A fundamental characteristic of these models is

the assumption of a form of conditional independence called local independence. Specifi-

cally, the manifest variables are assumed to be conditionally independent given the latent

variables. This assumption forces the latent variables to account for the probabilistic

relationships between the manifest variables.

The measurement models can be categorized based on the types of the manifest and

latent variables used. If the latent and manifest variables are both discrete, we have a

latent class model (LCM) [62, 99]. When the manifest variables are discrete and the latent

variables are continuous, the resulting models are known as item response theory (IRT)

models [133, 157] or latent trait models [99]. When the manifest variables are continuous

and the latent variables are discrete, we have latent profile models [99]. If both latent and

manifest variables are continuous, we have factor models [8, 9, 67].

Among these four kinds of models, the first three share the same model structure.

Their model structure is shown in Figure 2.1(a). Each of these models have only one latent

variable. The difference between these models is in the types and hence the distributions

10

of the variables. On the other hand, a factor model can have multiple latent variables.

Its model structure is shown in Figure 2.1(b). As shown in the figure, the latent variables

in a factor model can be correlated. However, they are usually assumed to be mutually

independent.

Manifest variables in a factor model can have multiple parents. However, a post-

processing step called factor rotation is often carried out during factor analysis [70, 88].

This step tries to find an equivalent model with the simplest structure, such that each

manifest variable has the fewest parents in the equivalent model.

2.2 Confirmatory and Exploratory Analyses

Before we continue to survey other latent variable models, it is worth to distinguish

between two approaches to using these models. The first approach is a confirmatory

one. It aims to test a hypothesis. To do so, a model structure is specified based on the

hypothesis. The model is then estimated from and tested against the empirical data. The

hypothesis is confirmed or repudiated based on the model fit to data.

The second approach is an exploratory one. In this approach, only a class of models

is specified. The exact model structure is determined by model selection. This approach

aims to help users have better understanding on data from the resulting model structure

and parameters.

As an example, we compare the two different approaches to using factor models. In

exploratory factor analysis, the number of factors (latent variables) is unknown before-

hand. Moreover, only a minimal number of constraints on parameters are imposed. The

number of factors and the model parameters are estimated during the analysis. On the

other hand, number of factors are specified in confirmatory factory analysis. Restrictions

are also imposed based on the hypothesis under test. In particular, edges may be removed

from the model structure to indicate that some manifest and latent variables are mutually

independent. Some parameter values are set to zero for this purpose.

2.3 Mixture Models

Latent variable models are also used in machine learning. One of the most often used

models is the finite mixture model (FMM) [47, 54, 111]. The model has the following

distribution

P (X) =∑y

P (Y = y)P (X|Y = y),

11

X1, X2, X3, X4, X5, X6

Y

Figure 2.2: Model structure of a finite mixture model.

where X is a vector of manifest variables and Y is a latent variable. It assumes that

the samples can be divided into different groups. Each of these groups has a different

distribution of X. The model has a discrete latent variable Y . The variable indicates

which group a sample belongs to. However, the grouping of the samples are unobserved

from data. Therefore, it has to be represented by a latent variable. Figure 2.2 shows the

structure of a FMM.

The FMMs include a wide variety of models. The manifest variables in the models

can be discrete or continuous. Also, the latent class models and the latent profile models

can also be represented as these models. Among the varieties of these models, Gaussian

mixture models (GMMs) are probably most often used. In GMMs, the manifest variable

are continuous and have conditional Gaussian distributions. Latent profile models can

be considered as a special case of the GMMs. Both have the same types of manifest and

latent variables. However, the former models assume local independence but the latter

ones do not.

The component distribution P (X|Y ) can also be represented by other class of models.

For example, Thiesson et al. [154] propose a mixture of conditional Gaussian networks.

A conditional Gaussian networks [36, 95] is Bayesian networks containing both discrete

and continuous variables. However, the discrete variables can have only discrete parents.

Another example is a mixture of trees [115], in which the component distributions are

represented by tree-structured graphical models. There are also some mixtures of latent

variable models. Those models are introduced below.

Hidden Markov models [58, 129] have a similar structure as the finite mixture models.

However, they are used for time series modeling and have latent variables representing

values at different time instants.

2.4 Models with Multiple Latent Variables

Most of the models surveyed so far have only single latent variables. This may limit the

effectiveness of the models. In fact, there are some models that allow multiple latent

12

variables.

2.4.1 Structural Equation Models

Structural equation models (SEMs) [11, 14, 87, 104] are a class of models that have multiple

latent variables. It has been widely used in social sciences. It is also a generalization of

the factor models. A SEM can be divided into two overlapping parts. The first part is the

measurement model. It includes the latent variables and the manifest variables depending

on them. The relationships between the manifest variables and the latent variables are

similar to those in a factor model. The second part is the structure model. It includes the

latent variables and the relationships between them.

While all variables in SEMs are continuous traditionally, there are some recent frame-

works that can include both types of variables. For example, the framework called Gen-

eralized Linear Latent and Mixed Models has been introduced by Skrondal and Rabe-

Hesketh [145, 146]. It extends the SEMs with both types of manifest variables. Muthén

[117] presents another framework, which can have both types of latent and manifest vari-

ables.

SEMs are usually used for confirmatory analysis. However, there is also some work for

learning the model structure automatically. Spirtes et al. [150] present algorithms that

can determine the relationships between the variables in a SEM. However, the algorithm

cannot discover any new latent variables. Silva et al. [144] propose an algorithm for

learning the structure of a SEM given some manifest variables. Similarity to exploratory

factor analysis, it can determine the number of latent variables. The models they consider

contain only continuous variables.

2.4.2 Multidimensional Measurement Models

The factor model is one of the measurement models that allow multiple latent variables.

We refer to this kind of measurement models as multidimensional measurement models.

This is because in these models, the multiple latent variable can be considered as the

different dimensions of factors or heterogeneities that affect the manifest variables.

Many latent variable models used in machine learning have a similar structure as the

factor models [138]. These include the models used for principal component analysis

(PCA) [85] and probabilistic PCA [137, 156]. The difference between these two models

and the factor models is that they assume different variances of the noise on the manifest

variables.

13

X1 X2 X3 X4 X5 X6

Y1 Y2

(a) Between-item multidimensionality

X1 X2 X3 X4 X5 X6

Y1 Y2

(b) Within-item multidimensionality

Figure 2.3: Structures of the two types of multidimensional item response theory models.

Independent component analysis (ICA) [78, 79] also uses a model with the same struc-

ture as the factor models. However, non-Gaussian distributions are assumed on the latent

variables. This is in contrast to the case for factor models, where Gaussian distributions

are usually assumed. Tree-dependent component analysis [6] relaxes the independence

assumption on the latent variables of the ICA models. It allows the latent variables to be

correlated and be represented as a tree-structured model.

There are factor models that have discrete manifest variables. Collectively, these

models are sometimes known as discrete component analysis (DCA) [19]. They include

non-negative matrix factorization (NMF) [100, 101], probabilistic latent semantic anal-

ysis (PLSA) [74, 75], multinomial PCA [18], latent Dirichlet allocation (LDA) [13], and

GaP [21]. These models differ from each other in the conditional distributions on the

manifest variables and the prior distributions on the latent variables. More detailed com-

parison between these models is given by Buntine and Jakulin [19].

The latent variables in the DCA models are continuous. However, those in PLSA,

multinomial PCA, and LDA are normalized, meaning that the sum of their values is al-

ways one. Each of the normalized variables represents one discrete state, and its value

represents the probability of that state. Together, the normalized variables in one model

are equivalent to a single discrete variable. Hence, those models are actually unidimen-

sional in this sense. They can show only one dimension of clustering on the data.

In additional to the factor models, there are some extensions to the other traditional

measurement models to allow multiple latent variables. For example, the traditional IRT

models are extended by the multidimensional IRT models [1, 17, 135]. These models

can be broadly divided into two types based on the number of parents that a manifest

variable can have [1]. The first type is the between-item multidimensional models. Each

manifest variable has only one parent latent variable. The whole model consists of multiple

unidimensional IRT models on disjunct subsets of manifest variables. The second type

is the within-item multidimensional models. They have a similar structure as the factor

14

models. In both types of models, the latent variables can be correlated. The structures

of the two types of models are shown in Figure 2.3.

The traditional LCMs have also been extended to the latent class factor models [107].

It can be considered as decomposing the discrete latent variable in the former model into a

joint variable consisting of multiple discrete latent variables in the latter model. The latent

variables in the latent class factor models are binary and are mutually independent. The

structure of the models is the same as that of the factor models. However, the parameters

are restricted in a such way that a manifest variable is affected by each latent variable

independently. This restriction reduces the number of parameters in the model.

2.4.3 Mixtures of Latent Variable Models

In a FMM, a discrete variable is used to indicate which group a data sample belongs

to. Since the grouping of data is unobserved, this variable is considered as a latent

variable. When the component distributions in a model are represented by some other

latent variable models, the whole model can also be considered as having multiple latent

variables.

One example of this kind of mixture models is the mixtures of factor models. This

includes mixture of factor analyzers [48, 59, 112], mixture of PCA [72], and mixture

of probabilistic PCA [155]. In these models, a factor model is used to represent the

component distribution. It is used as a dimension reduction technique. It replaces the

Gaussian distribution that has a complete covariance matrix in the original model. Hence,

the number of parameters in the component distribution can be reduced. However, the

latent variables in the factor models may not have any meaning.

There is also some work on mixtures of SEMs. Under these models, the behaviors

of different groups of samples can be represented by different SEMs. These models are

described by Jedidi et al. [84], Muthén [117], and Skrondal and Rabe-Hesketh [146].

The above mixture models have only single discrete variables to indicate the grouping

of samples. In contrast, a hierarchical mixture of experts [12, 86] have multiple discrete

variables to indicate the grouping of samples. These discrete variables can be unobserved.

This means that the grouping is determined by some hidden factors. The model has

a tree structure of latent variables. However, it is not truly multidimensional. The

multiple discrete variables represent a hierarchical clustering of data. The grouping of

data indicated by a higher level of latent variable is further refined by some lower levels

of latent variables. Therefore, there is only one dimension of partition on the data.

The multiple latent variables represent only different levels of granularity along the same

15

X1 X2 X3 X4 X5 X6 X7

Y1

(a)

X1 X2 X3

X4

X5 X6 X7

Y2 Y3

Y1

(b)

Figure 2.4: A latent tree model as an extension to a latent class model. (a) In the latentclass model, local dependencies are observed. They are indicated by the dashed arrowsbetween the observed nodes. (b) Latent variables Y2 and Y3 are introduced to account forthe observed local dependencies. This results in the latent tree model shown.

dimension.

2.4.4 General Models

In addition to the above categories of models that contain multiple latent variables, there

are some general models that can also allow multiple latent variables. For example,

Elidan et al. [45] consider adding latent variables to a Bayesian network. The Bayesian

networks they consider contain only discrete variables. Their algorithm introduces new

latent variables to replace some pattern of complex parts in the model structure. The

motivation is to simplify the model structure by these latent variables. However, it is

unclear whether the new latent variables can be interpreted easily. This work is extended

by Elidan and Friedman [44] so that the number of states in a latent variable can be

determined automatically.

2.5 Latent Tree Models

Latent tree models (LTMs) [171, 172] are the main topic of this thesis. They are also

previously known as hierarchical latent class models. Similar to some multidimensional

measurement models, they extend the traditional LCMs to allow multiple latent variables.

The extension is motivated by an observation called local dependence. Consider the

LCM shown in Figure 2.4(a). Local independence is assumed by the model. However,

this assumption is sometimes violated on the empirical data. Local dependencies may be

observed. This means that even given the value of the latent variable, correlations among

some manifest variables may be observed from data.

In an LTM, these local dependencies can be accounted for by the introduction of some

16

latent variables. This is illustrated in Figure 2.4(b). The latent variables are introduced

as the parents of subsets of variables with local dependence. More latent variables can be

introduced if local dependencies are still observed.

The resulting model has a tree structure. It can have multiple latent variables. The

latent variables are found on the internal nodes, and the manifest variables are found on

the leaf nodes. Besides, same as in LCMs, all variables in LTMs are discrete.

Compared with other latent variable models, LTMs have two remarkable features.

First, they can have multiple latent variables. Those multiple latent variables allow LTMs

to fit data better and provide better explanation on data.

Second, LTMs have a tree structure. This allows a tractable probabilistic inference in

LTMs. It also limits the search space for structural learning. In addition, its simplicity

allows easier interpretation of the model.

2.6 Summary of Models

Table 2.1 summarizes the latent variable models reviewed in this chapter.

17

Model MV LV MLV Structure RemarksLatent class models [62, 99] D D N StarItem responsemodels [133, 157], latent traitmodels [99]

D C N Star

Latent profile models [99] C D N StarFactor models [8, 9, 67] C C Y FactorFinite mixturemodels [47, 54, 111]

M D N Mixture

Gaussian mixture models C D N MixtureMixtures of conditionalGaussian networks [154]

M D Y Mixture

Hidden Markovmodels [58, 129]

M D Y Mixture Latent variables represent val-ues at different time instants.

Structural equationmodels [11, 14, 87, 104]

C C Y General

Generalized linear latent andmixed models [145, 146]

M C Y General

Muthén’s framework [117] M M Y GeneralPCA [85], probabilisticPCA [137, 156]

C C Y Factor

Independent componentanalysis [78, 79]

C C Y Factor

Tree-dependent componentanalysis [6]

C C Y Factor Latent variables are connectedby a tree structure.

Discrete component analysis[19], e.g., NMF [100, 101],PLSA [74, 75], multinomialPCA [18], LDA [13], GaP [21]

D C Y Factor Latent variables are normalizedin PLSA, multinomial PCA,and LDA.

Multidimensional IRTmodels [1, 17, 135]

D C Y Factor

Latent class factormodels [107]

D D Y Factor Latent variables are binary andmutually independent.

Mixtures of factoranalyzers [48, 59, 112],mixtures of PCA [72], mixturesof probabilistic PCA [155]

C M Y Mixture Each model has one discrete la-tent variable and multiple con-tinuous latent variables.

Mixtures of structuralequation models [84, 117, 146]

M M Y Mixture

Hierarchical mixtures ofexperts [12, 86]

M D Y Tree Latent variables in a model rep-resent a hierarchical clustering.

Bayesian networks with latentvariables [44, 45]

D D Y General

Latent tree models [171, 172] D D Y Tree

Table 2.1: Summary of latent variable models. The second and third columns indicatethe type of manifest variables (MV) and latent variables (LV) in the models. The typecan be continuous (C), discrete (D), or mixed (M). The fourth column indicates whethermultiple latent variables (MLV) are allowed in the models. The fifth column shows themodel structures. They can be: star -shaped (Figure 2.1(a)), tree-shaped (Figure 2.4(b)),same as those of factor models (Figure 2.1(b)), ormixture models (in which the componentdistributions can be represented by some other structures, see Figure 2.2), or a generalstructure.

18

CHAPTER 3

LATENT TREE MODELS

At the end of last chapter, we discuss how latent tree models extend from latent class

models. We also compare LTMs with other latent variable models. In this chapter, we

describe LTMs in more details. We begin with the notations used in the thesis. In

section 3.2, we review Bayesian networks. They provide a framework for defining LTMs.

In section 3.3, we define LTMs and discuss their properties. We then review some learning

algorithms that have been proposed for LTMs in section 3.5. Finally, we survey some

applications of these models in the literature in section 3.6.

3.1 Notations

In this thesis, we use capital letters such as X and Y to denote random variables. We use

lower case letters such as x and y to denote their values. We use bold face letters such as

X, Y , x, and y to denote sets of variables or values. Manifest variables are denoted by

X and latent variables by Y . If a variable can be manifest or latent, we use V to denote

it. For a discrete variable V , we use |V | to denote its cardinality. Furthermore, we use

P (X) to denote the distribution of a variable X, and use P (x) as a shorthand for the

probability of having the value of x, that is, P (X = x).

We use the following notations for graphs. When the meaning is clear from context,

we use the terms ‘variable’ and ‘node’ interchangeably. Capital letter Π(V ) is used to

indicate the parent node of a node V , and lower case letter π(V ) to indicate its value.

3.2 Bayesian Networks

Bayesian networks (BNs) are a class of probabilistic graphical models [91]. They define

probability distributions over some random variables. Their graphical structure provides

a natural representation of the relationships between the variables. For a detailed intro-

duction, reader are referred to those books in this area [e.g. 36, 38, 122].

A BN M = (m,θ) can be defined by its structure m and a set of parameters θ. The

structure m is given by a directed acyclic graph. The set of nodes V = {V1, . . . , Vn} in

the graph represent n variables. Each node V is associated with a probability distribution

19

Burglary (B) Earthquake (E)

Alarm (A)

JohnCalls (J) MaryCalls (M)

P (B).001

P (E).002

B E P (A)T T .95

T F .94

F T .29

F F .001

A P (J)T .90

T .05

A P (M)T .70

T .01

Figure 3.1: An example of Bayesian network.

P (V |Π(V )) conditional on its parents Π(V ).1 The set of parameters θ consists of those

parameters needed to specify all the conditional probability distributions.

The joint probability defined by a BN is assumed to satisfy the Markov condition.

This condition means that every variable V in a BN is conditionally independent to

other non-descendent variables given the parent variables of V . Based on this condition,

the joint probability P (V ) can be factorized as a product of the conditional probability

distributions associated with the nodes:

P (V ) =n∏i=1

P (Vi|Π(Vi)).

Figure 3.1 shows an example of BN given by Russell and Norvig [139]. The BN has

five binary variables. It models the situation in which an alarm may be triggered by

burglary or earthquake. The triggered alarm may further leads to a call from John or one

from Mary. The conditional probability distributions in the BN models the probabilities

of these events.

In BNs, discrete variables are usually considered. However, continuous variables can

also be included. In a Gaussian Bayesian network (GBN) [57, 142], all variables are con-

tinuous. Gaussian distributions are used to model the probabilistic relationships between

the variables. On the other hand, a conditional Gaussian Bayesian network [36, 95] can

contain both discrete and continuous variables. However, discrete variables cannot have

continuous parents in this model.

20

X1 X2 X3

X4

X5 X6

Y2 Y3

Y1

(a) Original model

X1 X2 X3

X4

X5 X6

Y2

Y3

Y1

(b) After root walking (from Y1 to Y2)

X1 X2 X3 X4 X5 X6

Y2 Y3Y1

(c) unrooted model

Figure 3.2: An example of latent tree model, root walking, and unrooted model. Theleaf nodes X1–X6 represent manifest variables, while the internal nodes Y1–Y3 representlatent variables. Latent variables are represented by shaded nodes.

3.3 Latent Tree Models

A latent tree model (LTM) [162, 172] is a tree-structured BN containing latent variables.

In this model, latent variables are represented by the internal nodes, whereas manifest

variables by the leaf nodes. Both latent and manifest variables are discrete. An example

of LTM is shown in Figure 3.2(a).

Similar to the case for BN, an LTM can be written as a pair M = (m,θ). The second

component θ is the set of parameters for specifying the distributions in the model. The

first component m represents the model structure of the LTM. It specifies the variables,

their cardinalities, and the edges between the variables. We sometimes also refer to the

first component m as an LTM.

3.3.1 Model Scores

Suppose D is a collection of data over a set of variables X. There can be infinitely many

possible LTMs having X as their leaf nodes. Therefore, a score is needed to evaluate the

relevancy between an LTM and the data D. It is essential to select the best model among

1 If a node does not have any parent, it is assumed to be the child of a dummy node with one value.This allows all nodes to be treated in the same way.

21

possibilities.

The BIC score [140] is used for this purpose in this thesis. It has been empirically

shown to work well compared with some other scores [172]. The BIC score of a model m

is given by

BIC(m|D) = logP (D|m,θ∗)− d(m)

2logN, (3.1)

where θ∗ is the maximum likelihood estimate (MLE) of the parameters and d(m) is the

number of independent parameters in m. The first term is known as the maximized log-

likelihood term. It favors models that fit data well. The second term is known as the

penalty term. It discourages complex models. Hence, the BIC score provides a trade-off

between model fitness and model parsimony.

3.3.2 Model Equivalence

We use models to fit observed data. And we want to find those with the best fit. However,

in some cases, two different models fit the data equally well. This idea is formalized

through the concepts of model inclusion and equivalence.

Consider two LTMs m and m′ that share the same set of manifest variables X. We

say that m includes m′ if for any parameter value θ′ of m′, there exists parameter value

θ of m such that P (V |m,θ) = P (V |m′,θ′).

When m includes m′, m can represent any distributions over the manifest variables

that m′ can. Hence, the maximized log-likelihood of m must be larger than or equal to

that of m′, that is, P (D|m,θ∗) ≥ P (D|m′,θ′∗).

When m and m′ include each other, we say that they are marginally equivalent.

Marginally equivalent models have equal maximized log-likelihood. This means they

can fit the data equally well. If they have the same number of independent parameters,

they become equivalent. Equivalent models are indistinguishable based on data if the BIC

score is used for model selection. In fact, this is also true if any other penalized likelihood

score [63] is used.

3.3.3 Root Walking

Consider an LTM m with root Y1. Suppose Y2 is a latent node and is a child of Y1. Define

another LTM m′ by reversing the direction of the edge Y1 → Y2. Now Y2 becomes the

root in the new model m′. We call this operation root walking — the root has walked

from Y1 to Y2. Figure 3.2(b) shows the model obtained after walking the root from Y1 to

Y2 in the original model in Figure 3.2(a).

22

Root walking results in equivalent models [172]. This implies that using any root in

an LTM can lead to the same score. Hence, the root of an LTM cannot be determined

from data. What can be determined is only an equivalent class of models having the same

structure but with different roots.

An unrooted LTM can be used to represent this equivalent class of models. It is

obtained by dropping the direction of all edges in an LTM. Members of the equivalent

class can be obtained by using different latent nodes as roots in the unrooted model. An

example of unrooted model is given in Figure 3.2(c).

Semantically, the unrooted model is a Markov random field over an undirected tree.

The external nodes are manifest whereas the interior nodes are latent. Model inclusion

and equivalence can be defined for unrooted models in the same way as for rooted models.

In this thesis, we use rooted LTMs to represent the models obtained from data. It

should be noted that any latent nodes can indeed be chosen as root. Our results do not

depend on the choice of the root.

3.3.4 Regularity

A larger number of model parameters usually results in better model fit to data. How-

ever, for some LTMs, it is possible to find other marginally equivalent models with fewer

parameters. Those models are called irregular models.

Zhang [172] shows that any model that violates the following condition is irregular.

Condition 1. (Upper Bound on Cardinality). Let Y be a latent variable in an LTM. And

let V1, . . . , Vr be the r neighbors of Y . For any latent variable Y in an LTM,

|Y | ≤∏r

i=1 |Vi|maxri=1 |Vi|

. (3.2)

If Y has only two neighbors, strict inequality holds and one of the neighbors must be a

latent node.

Given an irregular model m, it is possible to obtain a model m′ that is marginally

equivalent to m and does not violate the regularity condition. We refer to such model m

as regular models, and the process to obtain it as regularization.

Let m be an irregular model that violates Condition 1. A regular model m′ can be

obtained through the following regularization process:

1. For each latent variable Y in m, let V1, . . . , Vr be the r neighbors of Y .

23

(a) If Y violates Inequality (3.2), set

|Y | =∏r

i=1 |Vi|maxri=1 |Vi|

. (3.3)

(b) If Y has only two neighbors, where one of them is a latent node, and if it

violates the strict version of Inequality (3.2), remove Y from m and connect

the two neighbors of Y .

2. Repeat Step 1 until there is no further change.

Suppose Y is a latent node in an LTM. If the cardinality of Y satisfies Equality (3.3),

we say that Y is saturated. In this case, Y is also said to subsume all its neighbors except

the one with the largest cardinality. Wang et al. [162] identify another regularity condition

related to saturated latent nodes. This condition is stated below.

Condition 2. (Non-Redundant Variables). In an LTM, there do not exist any two adja-

cent latent nodes Y1 and Y2, such that both Y1 and Y2 are saturated and one of them is

subsumed by the other.

Suppose Y1 and Y2 are the two adjacent nodes that violates the above condition in an

LTM. If Y2 subsumes Y1, Y1 is a redundant variable. The model can be regularized by

removing Y1 and connecting the neighbors of Y1 (except Y2) to Y2.

By definition, regular models must have a higher BIC score than irregular models.

Therefore, if we want to find the model with the highest BIC score, we can restrict the

search space to include only regular models. Note that given a set of manifest variables,

there are only a finite number of regular models in this search space [172].

3.4 Parameter Estimation

Statistical analysis involves learning an LTM from some given data. If the model struc-

ture is given, it needs to estimate only the model parameters. If the model structure is

unknown, structural learning is needed as well. We assume that the model structure is

given and discuss parameter estimation in this section. We discuss structure learning in

the next section.

Let D = {d1, . . . ,dN} be a collection of N data samples, where di denotes the values

of the manifest variables in the i-th sample. Also, suppose the model structure m is given.

To learn an LTM M from D, we need find the maximum likelihood estimate (MLE) θ∗

of the model parameters θ. The learning result can then be given as M = (m,θ∗).

24

The EM algorithm [40] can be used to compute the MLE θ∗. It is usually used for mod-

els with latent variables. The algorithm starts with an initial estimate, θ(0). It improves

the estimate iteratively. Use θ(t) to denote the estimate after t iterations. The algorithm

iterates until an iteration fails to improve the model likelihood by a certain threshold δ.

In other words, it stops after the t-th iteration if P (D|m,θ(t))− P (D|m,θ(t−1)) ≤ δ. The

algorithm then returns θ∗ = θ(t).

There are two steps in each iteration, namely the E-step and the M-step. Suppose the

parameter estimate θ(t−1) is obtained after t − 1 iterations. In the E-step, we compute,

for each latent node Y and its parent Π(Y ), the distributions P (y, π(Y )|dk,θ(t−1)) and

P (y|dk,θ(t−1)) for each sample dk. In the M-step, we compute the MLE θ(t) based on

the distributions obtained in the E-step. For each of the manifest or latent nodes V , the

MLE of the parameters is given by

P (v|π(V ),θ(t)) ∝N∑k=1

P (v, π(V )|dk,θ(t−1)),

where the ∝ symbol indicates that the exact values of the probability can be obtained

after normalization.

During an iteration, the E-step is also known as the data completion step. It can be

considered as completing the data by computing the values of the latent variables based

on the observed data and the current parameter estimates. The M-step then finds the

MLE based on the completed data. It is done as if all the variables are observed.

Some more details are worth to note about the EM algorithm. First, the initial values

θ(0) of the parameters has to be chosen. For P (v|π(V ),θ(0)), the probability values are

randomly generated from a uniform distribution over the interval (0, 1]. The distributions

P (v|π(V ),θ(0)) are then normalized so that their sums are equal to 1.

Second, given the observed data, we need to compute probabilities in the E-step. This

computation is known as inference for probabilistic graphical models. Inference in an

LTM can be done using the clique tree propagation as in a discrete Bayesian network.

Readers are referred to other sources [e.g. 36, 38] for details.

3.5 Learning Algorithms

More often than not, the model structure is not known beforehand. It needs to be learnt

from given data.

For discrete Bayesian networks, two main approaches are used for structural learning.

The first approach is scored-based [20, 30, 69]. It requires a score for the evaluation of

25

model structures. It then attempts to search for the structure that gives the highest score.

Given a base model structure, some operators are used to explore the search space. These

operators usually modify a small part of the base model. They include, for example,

addition, removal, and reversal of edges. A greedy search, also known as a hill-climbing

method, is often used in this approach.

The second approach is constraint-based [29, 150]. It first identifies some constraints

on the model structure based on the data. For example, conditional independencies

between variables can be tested against the data using statistical or information theoretic

measures. This approach then finds a model structure that is consistent with the identified

constraints.

The above two approaches are also used for learning the structures of LTMs. In

addition, a third approach has been used. It is based on variable clustering. This approach

is applicable to LTMs but not to general BNs because they exploit some characteristics

of LTMs.

In the following subsections, we review methods following those three approaches.

3.5.1 Score-Based Methods

The score-based methods for LTMs have much in common. They are based on hill-

climbing and use similar search operators. They are also based on the BIC score for model

selection. However, there are some differences between them that affect their efficiency

and effectiveness. A main difference is in how they divide the use of search operators into

different phases. Besides, methods may make different adaptations on the BIC score for

model evaluation.

We begin this subsection with the description of a brute-force search. It serves to

illustrate the principles behind other score-based methods. After that, we review some

score-based methods proposed in the literature.

Brute-Force Search

The brute-force search we describe here is a hill-climbing method. It starts with an initial

model and then iteratively improves this model in each search step. The initial model

m(0) is the simplest LTM over the given manifest variables. Specifically, it is a 2-class

LCM. The root node is a latent variable with two possible values. The manifest variables

are connected as children to the root node.

Suppose a model m(j−1) is obtained after j − 1 iterations. In the j-th iteration, the

26

X1 X2 X3

X4 X5 X6

Y2

Y1

(a) m1

X1 X2 X3

X4

X5 X6

Y2 Y3

Y1

(b) m2

X1 X2

X4

X3 X5 X6

Y2 Y3

Y1

(c) m3

Figure 3.3: Examples of applying the node introduction, node deletion, and node reloca-tion operators. Introducing Y3 to mediate between Y1 and the pair X5 and X6 in m1 givesm2. Relocating X3 from Y2 to Y3 in m2 gives m3. In reverse, relocating X3 from Y3 to Y2in m3 gives m2. Deleting Y3 in m2 gives m1.

algorithm uses some search operators to generate candidate models by modifying the

base model m(j−1). The BIC score is then computed for each candidate model. Use m′ to

denote the candidate model with the highest BIC score. If m′ has a higher BIC score than

m(j−1), m′ is used as the new base model m(j) and the algorithm proceeds to the (j+1)-th

iteration. Otherwise, the algorithm terminates and returns m∗ = m(j−1) (together with

the MLE of the parameters).

When we learn the structure of an LTM, we need to determine the number of latent

variables, their cardinalities, and the connections between all variables. Search operators

are needed to modify these aspects of the structure to effectively explore the model space.

In the brute-search search, five search operators are used. They are state introduction,

state deletion, node introduction, node deletion, and node relocation.

Given an LTM and a latent variable in the model, the state introduction (SI) operator

creates a new model by adding a state to the domain of the variable. The state deletion

(SD) operator does the opposite. Applying SI on a model m results in another model that

includes m. Applying SD on a model m results in another model that is included by m.

Node introduction (NI) involves one latent node Y and two of its neighbors. It creates

a new model by introducing a new latent node Ynew to mediate between Y and the two

neighbors. The cardinality of Ynew is set to be that of Y . For example, in the model m1

27

of Figure 3.3, introducing a new latent node Y3 to mediate Y1 and its neighbors X5 and

X6 results in m2. Applying NI on a model m results in another model that includes m.

A node deletion (ND) operator is the opposite of NI. It involves two neighboring latent

nodes Y and Ydelete. It creates a new model by deleting Ydelete and connecting all neighbors

of Ydelete (other than Y ) to Y . We refer to Y as the anchor variable of the deletion and

say that Ydelete is deleted with respect to Y . For example, in the model m2 of Figure 3.3,

deleting Y3 with respect to Y1 leads us back to the model m1. Applying ND on a model m

results in another model that is included by m if the node deleted has more or the same

number of states as the anchor node.

A node relocation (NR) involves a node V , one of its neighboring latent nodes Yoriginand another latent node Ydest. The node V can be a latent node or a manifest node. The

NR operator creates a new model by relocating V to Ydest. In other words, it removes the

edge between V and Yorigin and adds an edge between V and Ydest. For example, in m2

of Figure 3.3, relocating X3 from Y2 to Y3 results in m3.

Note that for the sake of computational efficiency, this brute-force search does not

consider introducing a new node to mediate Y and more than two of its neighbors. This

restriction can be compensated by considering a restricted version of node relocation after

a successful node introduction. Suppose Ynew is introduced to mediate between Y and its

two neighbors. The restricted version of NR relocates one of the neighbors of Y (other

than Ynew) to Ynew.

In principle, the above brute-force search should be able to find the LTM with the

highest BIC on some given data. However, the search has two problems. First, it may be

stuck at local maxima. Second, it is inefficient. We show how these problems have been

addressed by other methods below.

Double Hill-Climbing Algorithm

When Zhang [171, 172] reports his studies on LTMs, which is named HLCMs at that

time, he also devises a structural learning algorithm for LTMs. The algorithm is called

the double hill-climbing (DHC) algorithm by Zhang and Kočka [173]. Same as the brute-

force search, the DHC algorithm is an iterative method.

The algorithm uses the BIC score for model selection. It hill-climbs in two different

phases in each iteration. In the first phase, it fixes the cardinalities of the latent variables

and searches for the best model structure. It considers candidates generated by the NI,

ND, and NR operators during this phase. The best candidate model is passed to the

second phase. Note that only one operation has been used in the first phase to change

28

the original model to the best candidate model.

In the second phase, the algorithm searches for the optimal cardinalities for the latent

variables. It starts with the minimum cardinalities for all latent variables in the base

model. It then hill-climbs using the SI operator until it cannot find a better model. More

than one SI operations can be used in this phase. The best model is passed to the next

iteration. The whole algorithm stops when both phases fail to find a better model.

Compared with the brute-force search, the DHC algorithm considers SI in a separate

phase. A possible reason for this is that the candidate models generated by SI and NI

cannot be compared directly using the BIC score. This issue is addressed in a later

development, which is described next.

Operation Granularity

Some search operators may increase the complexity of the current model much more than

other search operators. This issue is known as operation granularity [26, 27]. As an ex-

ample, consider a 2-class LCM with 100 binary manifest variables. NI would introduce

2 additional parameters in this model, but SI would introduce 101 additional parame-

ters. This illustrates that candidate models resulting from SI usually have much more

parameters than those resulting from NI.

If we use BIC to evaluate candidate models given by the search operators, those

having a much larger increase in complexity are usually preferred. This might lead to

local maxima.

Zhang and Kočka [173] propose a cost-effective principle to address this issue. Let m

be the base model and m′ be a candidate model. They define the improvement ratio of

m′ over m given data D by

IR(m′,m|D) =BIC(m′|D)−BIC(m|D)

d(m′)− d(m), (3.4)

where d(·) denotes the number of independent parameters in a model. The ratio measures

the unit improvement of m′ over m. It is also related to the likelihood ratio test in a later

work [28]. Among those candidate models with more parameters than the base model, the

cost-effective principle stipulates that the one with the highest improvement ratio should

be chosen.

Single Hill-Climbing Algorithm

Zhang and Kočka [173] propose another hill-climbing algorithm for LTMs. The algorithm

29

is called the single hill-climbing (SHC) algorithm. This algorithm is similar to the DHC

algorithm. It is an iterative method and each iteration is divided into two phases. How-

ever, the search operators are grouped differently in those two phases. Moreover, both

phases may have multiple search steps. In other words, in each phase, the algorithm

repeatedly improves the model until it cannot find a better model.

The SHC algorithm adopts an expand-and-retract strategy for the search. This is

similar to the greedy equivalence search (GES) [30, 113], proposed for structural learning

of Bayesian networks when all variables are observed. In each iteration, the first phase

tries to improve the model by expansion, while the second tries to improve the model by

retraction.

In the first phase, the NI, SI, NR operators are used. The algorithm considers candi-

date models that can have more parameters than the base model. In fact, the NR operator

does not always result in models with more parameters. Suppose the NR operator has

generated some candidate models that do have more parameters than the base model. If

the best model among them has a higher BIC score than the base model, then that model

is used as the base model for the next search step. Otherwise, the remaining models

generated by the NR operator are compared along with the candidate models generated

by the SI and NI operators. The algorithm follows the cost-effective principle to choose

among these models.

In the second phase, the SD and ND operators are used. They result in model with

fewer parameters. The algorithm repeats in this phase until it cannot find a better model

using those two operators.

Heuristic Single Hill-Climbing Algorithm

In each search step, the SHC algorithm uses fewer operators than the brute-force search.

This means that it has to evaluate fewer models. In this sense, the SHC algorithm is more

efficient However, to evaluate a model, it still needs to run the EM algorithm. And the

EM algorithm is known to be computationally expensive. Hence, the SHC algorithm is

still not very efficient.

Zhang and Kočka [173] propose an improved algorithm over SHC to address the effi-

ciency issue. The algorithm is called the heuristic single hill-climbing (HSHC) algorithm.

It is inspired by the structural EM [53]. The idea of the structural EM is to complete the

data using the current model, and then evaluate the candidate models using the completed

data. Heuristics based on this idea are proposed for the search operators except the SD

operator. They are are used to select the best candidate model for each search operator.

30

This saves the calls to the EM algorithms. However, the candidate models generated by

different operators are evaluated as in SHC so that the evaluation can be more accurate.

Node Relocation Operator

The NR operator used in DHC, SHC, and HSHC is actually slightly different from the

one we describe in the brute-force search. The one used in those algorithms is known as

a restricted version of NR. Consider the relocation of a node Y from a neighboring latent

node Yorigin. In the restricted version of NR, Y can be moved to only those latent nodes

neighboring Yorigin. In contrast, Y can be moved to any latent nodes in the unrestricted

version of NR.

The two versions of NR are compared by Chen [24]. Using the restricted version is

found to be faster. However, this is more likely to get trapped in local maxima. Therefore,

it is suggested to use the unrestricted version for node relocation. In this thesis, the NR

operator refers to the unrestricted version unless otherwise stated.

Restricted Likelihood

HSHC uses heuristics to speed up model evaluation. On the other hand, Chen et al. [26]

propose another way to do this. They use the so-called restricted likelihood, which is

explained below.

Consider the current model m after a number of search steps. A candidate model m′

can be obtained from m by applying a search operator. Very often, the search operator

modifies only a small part of m, so that m and m′ share a large number of parameters.

For example, consider the model m2 in Figure 3.3. It is obtained from model m1 using

the NI operator. Models m1 and m2 shares many parameters, such as P (x1|y2), P (x2|y2),P (x3|y2), P (y2|y1), and P (x4|y1). On the other hand, some parameters are not shared

by m and m′. In this example, parameters P (x5|y1) and P (x6|y1) are specific to m1,

while parameters P (y3|y1), P (x5|y3) and P (x6|y3) are specific to m2. The parameters θ′

of m′ can be divided into two groups. They can be written as θ′ = (θ′1,θ′2). The first

component θ′1 consists of parameters shared with m, whereas the second one θ′2 consists

of parameters specific to m′. Similarly, the parameters of m can be written as θ = (θ1,θ2)

with respect to m′.

Suppose we have computed the MLE θ∗ = (θ∗1,θ∗2) of the parameters ofm. Parameters

θ∗1 can be used as estimates for the shared parameters of m′. Consequently, we can obtain

a likelihood function P (D|m′,θ∗1,θ′2) that depends only on the unshared parameters θ′2.

This function is referred to as restricted likelihood function of m′.

31

The BIC score requires the maximum log-likelihood ofm′. Instead of computing it over

all parameters of m′, we can approximate it by maximizing the restricted log-likelihood

over only the subset of parameters θ′2. This results in an approximate score given by

BICRL(m′|D) = maxθ′2

logP (D|m′,θ∗1,θ′2)−d(m′)

2logN. (3.5)

The advantage of using BICRL is that it allows a more efficient implementation of

EM. There are two reasons. First, it involves a maximization of fewer parameters in

the M-step. This also means fewer computations in the E-step, since only distributions

relevant to θ′2 need to be computed. Second, the E-step can exploit the fixed parameters

θ∗1 to allow sharing of computation among all iterations of EM. Specifically, the E-step

requires inference on an LTM, which in turn requires passing messages in the clique tree.

As the parameters θ∗1 do not change in these iterations, the messages that depend on

only these parameters remain the same in all iterations. Therefore, some messages can be

cached for each data case. They can then be shared for the inference used in subsequent

E-steps.

Chen et al. [26] propose to use both BIC and BICRL to evaluate models. However,

they are used at different situations. The approximation BICRL can be used for quickly

evaluating candidate models generated by the same search operator. On the other hand,

the real BIC can be used for accurately evaluating the candidate model with the highest

BICRL for each of the search operators. This improves the speed for the evaluation of

models within search operators, but maintains the accuracy for the evaluation of models

across search operators.

EAST

In addition to the restricted likelihood, Chen et al. [26] propose another modification to

the HSHC. They adopt a grow-restructure-thin strategy for the search. They divide the

operators into three phases. In the expansion phase, the SI and NI operators are used.

In the adjustment phase, the NR operator is used. In the simplification phase, the SD

and ND operators are used. Due to the names of the three phases, the whole algorithm

is called EAST.

The improvement ratio is used in the expansion phase. However, it is used only for

comparing the best candidate models given by different search operators. The BIC score

is used in the other two phases. To improve the efficiency, the restricted likelihood is also

used to approximate both the BIC score and the improvement ratio.

32

X1 X2 X3 X4

(a)

X1 X2 X3 X4

(b)

X1 X3 X2 X4

(c)

X1 X4 X2 X3

(d)

Figure 3.4: Four possible resulting structures of quartet test over manifest variables X1–X4. (a) shows a fork. (b), (c), and (d) show three dogbones with different combinationsof sibling variables. Directions of the edges are omitted since they cannot be determinedfrom data.

EAST is considered as the state-of-the-art method for learning LTMs. It is described

in more details by Chen et al. [28]. Recently, methods using other approaches have been

proposed for learning LTMs. We describe those methods next.

3.5.2 Constraint-Based Methods

To learn Bayesian networks, conditional independencies are identified as constraints by

constraint-based methods [29, 150]. These constraints can then be used to determine the

connections between variables. They are sufficient when the variables are fixed in a model.

However, variables are not fixed when we learn LTMs. We need to determine the number

of latent variables and their cardinalities. Therefore, other constraints are needed.

For LTMs, some form of sibling test is usually used. Siblings refer to variables that

have the same parent. And a sibling cluster refers to the child variables of the same parent.

The sibling test not only allows the connections among variables to be determined, it may

also suggest that a latent variable can be added as parent for a sibling cluster.

In the following, we review how constraints can be used to learn LTMs.

Quartet-Based Learning

Chen and Zhang [25] explore the possibility of using a quartet test to learn LTMs. Given

four manifest variables, the test determines the best regular LTM on them. There are

four possible structures (Figure 3.4). The structure is called a fork if there is only one

latent variable. It is called a dogbone if there are two latent variables. For the latter case,

there can be three different groupings of the sibling variables.

33

Chen and Zhang [25] propose a quartet-based method to find all sibling clusters given

some manifest variables. Their method assumes that the quartet test is always correct.

Suppose we want to determine whether two manifest variables X1 and X2 belong to the

same sibling cluster. Their method tries to find two other manifest variables Z1 and Z2,

such that the quartet test over the four manifest variables returns a dogbone in that X1

and X2 are not sibling. Chen and Zhang [25] prove that X1 and X2 belong to the same

sibling cluster if and only if such manifest variables Z1 and Z2 do not exist. Using this

result, their method finds all the sibling clusters by essentially checking every pair of

manifest variables to see whether they belong to the same cluster.

This method has two limitations. First, its analysis relies on the assumption that the

quartet test is always correct. However, this assumption seems unrealistic. Second, the

method is incomplete. It does not suggest how the quartet test can be done. It also stops

short of building a LTM after the sibling clusters are found.

Information Distance

In addition to quartet test, the so-called information distance has also been used to

determine the sibling clusters.

Consider two discrete variables Vi and Vj. Denote the joint probability matrix between

Vi and Vj by Jij. It is defined as Jij = (pijab) ∈ R|Vi|×|Vj |, where pijab = P (Vi = a, Vj = b).

Also, denote the marginal probability matrix of Vi by Mi. It is defined as a diagonal

matrix Mi = (piab) ∈ R|Vi|×|Vi|, where its diagonal entries are given by paa = P (Vi = a).

The information distance between Vi and Vj is defined by

dij = − log| det Jij|√

detMi detMj

.

A nice property of the information distance is that it is additive [93]. Specifically,

consider two variables Vk and Vl in an LTM. Denote the path between them by path(k, l).

The information distance dkl is given by the sum of the distances along the path, that is,

dkl =∑

(i,j)∈path(k,l)

dij.

Due to the additivity property, Choi et al. [32] show that the information distance can

be used to find out some parent-child and sibling relationships among variables. For any

three variables Vi, Vj, and Vk, define Φijk = dik − djk to be the difference between the

information distances dik and djk. It has been shown that for any pair of variables Vi and

Vj in an LTM:

34

1. Φijk = dij for all Vk 6= Vi, Vj if and only if Vi is a leaf node and Vj is its parent;

similarly, Φijk = −dij for all Vk 6= Vi, Vj if and only if Vj is a leaf node and Vi is its

parent; and

2. −dij < Φijk = Φijk′ < dij for all Vk, Vk′ 6= Vi, Vj if and only if both Vi and Vj are leaf

nodes and they belong to the same sibling cluster.

Recursive Grouping and Chow-Liu Recursive Grouping

Choi et al. [32] propose two algorithms to learn LTMs by using the above two properties

based on information distance to identify relationships among variables. The first algo-

rithm is called Recursive Grouping (RG). As its name suggests, it is an iterative method.

It maintains a working set S of leaf variables through iterations.

The method starts with adding all manifest variables into S. The information distance

is used to identify parent-child relationships and sibling relationships among the variables

in S. The variables are connected based on the identified relationships. A new working

set S ′ is then constructed for the next iteration.

In each iteration, variables in S are partitioned into sibling clusters along with their

identified parents. The new working set S ′ is constructed according to the three possible

cases for the sibling clusters. If a sibling cluster contains only one variable, the variable is

added to S ′. If a sibling cluster contains an identified parent, the parent is added to S ′.If a sibling cluster has no identified parent, a new latent variable is added as the parent

of the sibling cluster, and the new variable is added S ′.

When the method moves to the next iteration with S ′, it has to compute information

distances for the variables added in the previous iteration. Choi et al. [32] show that these

distances can be inferred from the existing distances. The method then proceeds as in the

previous iteration. It stops until a working set has no more than two variables when an

iteration starts. And if the working set has two variables, the two variables are connected

with an arbitrary direction, so that a tree is formed.

The second algorithm proposed by Choi et al. [32] is called Chow-Liu Recursive Group-

ing (CLRG). It is based on the well-known Chow-Liu algorithm [33] for constructing tree-

structured Bayesian networks. The algorithm starts with building a minimum spanning

tree over all manifest variables based on the information distances. Let I be the set of

internal nodes of the tree. For each node V in I, let N be a set containing V and its

neighbors. A subtree is built on N using the RG algorithm. The nodes N in the main

tree is then replaced by the subtree. The algorithm repeats until all nodes in I has been

operated on.

35

The two proposed algorithms have been proved to be structurally consistent. This

means the model structure can be correctly recovered, provided that the number of sam-

ples is large enough. The RG algorithm is faster than CLRG when there are few latent

variables, and vice versa otherwise.

While strong theoretical guarantees are given on the algorithms, they rely on two as-

sumptions that may not be realistic. First, it assumes that information distances between

variables are accurate. However, in reality they have to be estimated from finite samples.

The estimated distances may lower the accuracy of the identified relationships between

variables. Second, it assumes that the data is sampled from a tree model. If the samples

are generated from a general model, there is no guarantee on the performance of the

models resulting from these algorithms.

The algorithm has another significant limitation. It requires all variables, including

both latent and manifest variables, to have the same cardinality. This may not be a

problem if the resulting LTMs are used for density approximation or dimension reduction.

However, this may be less desirable if the latent variables are used for clustering, since we

may want to have different numbers of clusters for different partitions.

Note that the above two algorithms can also be used when all variables are Gaussian.

In addition, they allow manifest variables to be internal nodes in the resulting models.

Therefore, they can be used for learning models other than LTMs.

Spectral Recursive Grouping

Anandkumar et al. [3] propose a spectral quartet test for identifying sibling relationships

among variables. While their test can be applied to some other models, we focus on

LTMs with discrete variables here. Their method assumes that all manifest variables

have d states, whereas all latent variables have k ≤ d states.

Consider any four variables in an LTM. Suppose we sum out all other variables in

the model. The four variables can induce one of the four subtrees in Figure 3.4. Given

the four variables, the spectral quartet test recovers the induced subtree based on some

properties of the covariance matrices between variables. The test is conducted as follows.

Let Σij denote the covariance between variables Vi and Vj. Also, let σs(M) denote the

s-th largest eigenvalue of a matrix M , and detk(M) =∏k

s=1 σs(M) denote the product of

the k largest eigenvalues of M . Under some mild conditions, Anandkumar et al. [3] show

that {Vi, Vj} and {Vi′ , Vj′} are two pairs of siblings in the induced subtree if and only if

detk(Σij)detk(Σi′j′) > detk(Σij′)detk(Σi′j). (3.6)

36

Intuitively, the inequality states that the correlations between sibling variables should be

larger than those between non-sibling variables.

The proposed test determines the induced subtree by checking whether there is any

combination of siblings that can fulfill the above inequality. If so, it returns one of the

dogbones in Figure 3.4. Otherwise, it returns a fork. Note that while a dogbone can

ascertain the sibling relationships, a fork cannot.

In practice, the covariance matrices Σij are unknown and have to be estimated from

data. To account for the error in estimation, a confidence parameter ∆ij for each pair of

variables Vi and Vj is used. The eigenvalues of the covariance matrices are adjusted by

these confidence parameters to allow a larger margin of error in estimation.

Anandkumar et al. [3] propose an learning algorithm named Spectral Recursive Group-

ing (SRG). The algorithm is based on the RG algorithm [32]. Instead of using information

distance, SRG uses the spectral quartet test to identify parent-child and sibling relation-

ships among variables.

SRG relaxes the assumption of RG to allow the latent variables and manifest variables

to have different cardinalities. However, it still assumes the same cardinality for all latent

variables. Similar to the case for RG, this may limit the effectiveness of using SRG for

clustering.

3.5.3 Variable Clustering Methods

Sibling variables in LTMs are more likely to be more correlated than other variables.

This intuition leads to some algorithms that build LTMs by grouping similar variables

together. The grouping can be given by variable clustering.

A variable clustering method usually builds an LTM in two main steps. It first clusters

the variables, and then builds an LTM based on the variable clusters. To cluster variables,

we need a similarity measure between variables. The mutual information [34] is often used

for this purpose. Consider two variables Vi and Vj. The mutual information MI(Vi;Vj)

between them is given by

MI(Vi;Vj) =∑Vi,Vj

P (Vi, Vj) logP (Vi, Vj)

P (Vi)P (Vj). (3.7)

After the variable clusters have been obtained, a latent variable is added as a parent

for each variable cluster. The latent variables are then somehow connected together so

that an LTM can be obtained.

37

There are four remaining issues in this approach. First, a variable clustering algorithm

is needed. Second, it needs to determine the cardinalities of the latent variables. Third, it

needs to decide how the new latent variables are processed. Finally, the size of a variable

cluster may need to be determined.

In the following, we review how these issues are addressed by different methods.

Hierarchical Clustering Learning / LTAB

Hierarchical clustering produces nested partitions of objects. The nested partitions are

usually represented by a dendrogram (Figure 1.1). When we perform hierarchical clus-

tering on the variables, the dendrogram may look similar to the structure of an LTM.

This suggests that the structure of LTMs can possibly be learned based on hierarchical

clustering on variables.

Wang et al. [162] propose a learning algorithm based on hierarchical clustering. The

algorithm is called hierarchical clustering learning (HCL) by Wang [161]. However, it is

also known as LTAB in the literature, due to its application in density estimation.

The algorithm accepts a parameter for the cardinality of all latent variables. It con-

structs the model structure in the same way as how a dendrogram is built. Specifically,

in each step it finds the pair of variables that have the highest mutual information. Then

a new latent variable, with the given cardinality, is added as the parent of that pair of

variables. The latent variable replaces the pair of variables for consideration in the next

step. The algorithm repeats until only one variable is left for consideration.

Single link is used to estimate the mutual information between the new latent variable

and other variables. In other words, the mutual information between any two variables is

given by the maximum of the mutual information between these variables or any of their

descendant leaf variables.

The hierarchical clustering yields a binary tree. However, the resulting tree may be

irregular due to the violation of Conditions 1 and 2. If so, the model is regularized until no

violation is found. Due to regularization, a latent node may have more than two children

in the final model. Its cardinality may also be smaller than the one given as parameter.

BIN-G and BIN-A

Harmeling and Williams [68] propose two closely related algorithms for learning LTMs,

namely BIN-G and BIN-A. Similar to HCL, they are based on hierarchical clustering.

38

The difference between these two algorithms is in the handling of the pair of variables

with the highest mutual information in each step. In BIN-G, a LCM is learnt on the pair

of variables. The cardinality of the new latent variable is given by the LCM obtained.

The pairwise mutual information between new latent variable and other variables are

computed based on completion of data.

In BIN-A, new latent variables are added only after hierarchical clustering has com-

pleted. During hierarchical clustering, the mutual information between two clusters of

variables is estimated by average link. In other words, it is estimated by the average of

the mutual information between any pair of leaf variables from the two different clusters.

After hierarchical clustering, a latent variable is added for each pair of variables clustered

together. Similar to BIN-G, its cardinality is estimated by building an LCM on its child

variables. The cardinalities of the new latent variable are estimated recursively starting

from the leaf variables.

Unlike HCL, the LTMs by these two algorithms do not go through any regularization.

Hence, the final structures are always binary trees. The models are likely to have more

latent variables than those obtained from HCL. On the other hand, BIN-G and BIN-A

allow more flexible cardinalities of latent variables than HCL.

Pyramid

In hierarchical clustering, each cluster has two variables, and hence methods based on

hierarchical clustering yield binary trees. To allow clusters with a variable number of

variables, a criterion is needed to determine the size of clusters.

Wang [161] proposes such a criterion. A test called unidimensionality test determines

whether some variables should belong to the same cluster or different clusters. Given some

variables, the test works by learning an LTM on the variables. The LTM is restricted to

have at most two latent variables. This means the test returns an LTM with either one

or two latent variables. If the LTM has one latent variable, it indicates that the given

variables belong to the same cluster. Otherwise, the variables should belong to different

clusters.

An algorithm called Pyramid is proposed by Wang [161]. It is based on agglomerative

clustering. It uses the unidimensionality test to determine whether the growth of a sibling

cluster should stop. Specifically, the algorithm starts by finding the two variables with the

highest mutual information. The variables are put into a sibling cluster S. The algorithmgrows S by repeatedly finding the next variable that has the highest mutual information

with any of the variables in S. Unidimensionality test is run each time when a variable

39

is being added. Denote the variable being added by V . If the test indicates that the

variables in S ∪ {V } belong to the same cluster, V is added to S. And the algorithm

continues to find the next variable to grow S. Otherwise, V is not added and the growing

of S stops.

After a sibling cluster is found, a latent variable is added as the parent of the con-

stituent variables. The cardinality is given by the LTM obtained during the last unidimen-

sionality test. The latent variable replaces its child variables for the consideration of the

next sibling cluster. The algorithm continues until one variable is left for consideration.

Unlike those methods based on hierarchical clustering, a latent variable in the LTMs

obtained from Pyramid can have more than two children.

3.5.4 Comparison between Approaches

Among the three approaches, the constraint-based methods provide some theoretic guar-

antees that the other approaches do not. Those methods are particularly useful if it is

important to have correct relationships among the variables. However, the guarantees are

based on the assumption that the underlying distribution is generated from an LTM. They

are questionable when the assumption is not valid. Moreover, there are some restrictions

on the variables that these methods can handle. For example, all latent variables need

to have the same cardinality by many of those methods. This may make those methods

unsuitable for clustering.

The score-based methods do not have any theoretical guarantees. On the other hand,

these methods aim to maximize the scores of the resulting LTMs. If they are not trapped

by local maxima, the maximized scores provide an assurance on the quality of the models.

Also, there are no restrictions on what models these methods can learn.

Those methods based on variable clustering do not have any guarantee on their per-

formance at all. The LTMs obtained from them may also have more latent variables than

those LTMs from other approaches. On the other hand, those methods in general have a

significant speed advantage over the score-based methods. However, a recent study shows

that they can be slower than some constraint-based methods [32].

3.6 Applications

Several types of applications of LTMs have been proposed. They include multidimensional

clustering, latent structure discovery, density estimation, and classification. LTMs have

40

also been applied in various domains, such as traditional Chinese medicine, text mining,

and financial data. We review the different types and domains of applications below.

3.6.1 Multidimensional Clustering

In a mixture model, the discrete latent variable can be used for clustering. Similarly, the

discrete latent variables in an LTM can be used for this purpose. And with multiple latent

variables in an LTM, we can cluster data in multiple ways.

Zhang [172] suggests that an LTM can give multiple clusterings with its multiple latent

variables. However, he does not describe how this can done in detail at that time. This

idea is further developed through the subsequent work of Chen [24] and Chen et al. [28].

The development results in a theme of clustering called multidimensional clustering.

As an example of using LTMs for multidimensional clustering, consider the LTM shown

in Figure 3.2(a). There are three latent variables in this model. Each one represents a

different way to partition data. And each one partitions the data with different degrees

of dependence on different attributes (manifest variables). Hence, the latent variables

represent clusterings along three different dimensions.

A latent variable can be considered as partitioning data based on its neighboring

variables. Those variables can be latent or manifest. To interpret the meaning of the

clustering given by a latent variable, we need to determine which variables the clustering

depends on. On the one hand, a manifest variable contributes most to the clustering given

by its parent variable. This is due to the fact that, in an LTM, the mutual information

of a manifest variable with its parent variable must be larger than or equal to that with

any other latent variable [34, P. 34]. For example, in Figure 3.2(a), since Y2 is the parent

of X1, it follows that Y2 = arg maxY ∈{Y1,Y2,Y3}MI(X1;Y ).

On the other hand, a latent variable may not be most related to its child manifest

variables. Although the latent variable often partitions data based mainly on its child

variables, this is not necessarily true. For example, in Figure 3.2(a), if MI(Y3;X5) and

MI(Y3;X6) are low, Y3 may have a higher mutual information with the child variables

of other latent variables, especially when MI(Y1;Y2) is high. Therefore, to interpret the

meaning of a clustering given by a latent variable, we need to know how the clustering

relates to the attributes.

Chen et al. [28] propose to use information curves to show the relationships between

a latent variable and the other attributes. An example of them is shown in Figure 3.5.

Each latent variable in an LTM can have a chart with two curves. The first curve is called

pairwise information curve. It shows the pairwise mutual information between the latent

41

Figure 3.5: Information curves of Y2 for the LTM in Figure 3.2(a). The red curve is thepairwise information curve (left axis). The blue curve is the cumulative information curve(right axis).

variable with each attribute. The attributes are sorted in descending order, so that those

with strongest dependence are shown first. The exact pairwise mutual information can

be computed with the model.

The second curve is called cumulative information curve. It shows the the percentage

of cumulative mutual information of the latent variable with the attributes from the first

attribute up to an attribute, divided by the mutual information between latent variable

and all attributes. For example, the second point on the cumulative information curve in

the figure shows the value of MI(Y2;X1,X2)MI(Y2;X1,...,X6)

. This ratio estimates how much information of

the latent variable can be accounted for from the first few attributes. If the ratio is high,

we can interpret the latent variable using ;only those attributes. Since it is intractable to

compute the cumulative mutual information exactly, the computation is done with data

sampling from the model.

Using the information curves, we can find a small subset of attributes that a latent

variable is mainly related to. We can then use those attributes to interpret the clusters

given by the latent variable.

In addition to having clusterings along different dimensions, LTMs allow one to under-

stand the relationships between different clusterings. Since latent variables are connected

in an LTM, their probabilistic relationships have be represented in the model. As a conse-

quence, we can compute the joint probability or conditional probability of different latent

variables in an LTM. We can then examine the probability to understand the relationships

between clusterings.

Chen et al. [28] give an example of multidimensional clustering on a real-world data.

The data was obtained from a survey by ICAC, which is an anti-corruption agency in Hong

42

Kong. The data has 31 attributes and 1200 samples. Eight latent variables were obtained

from an LTM given by EAST. Some of them are interpreted and are found meaningful.

The LTM also reveals some interesting relationships between the latent variables.

3.6.2 Latent Structure Discovery

Latent structure discovery can be considered as a complementary analysis to multidimen-

sional clustering. In multidimensional clustering, we are interested in the groupings of

data. In latent structure discovery, we are more interested in the structure of the model.

For example, we may be interested in how manifest variables are grouped into sibling

clusters. This information reveals which attributes may share the same hidden factors.

Zhang et al. [175] perform latent discovery on the CoIL Challenge 2000 data [158].

The data set contains information on customers of an insurance company. The data set

they work on contains 42 attributes. Three attributes show the socio-demographic data

of the customers, while the others show the ownership of various insurance products by

the customers.

An LTM was obtained from HSHC. In the model, the attribute for contribution to one

kind of insurance policies is almost always paired as sibling variables with the attribute

for the number of that kind of insurance policies. While this is obvious to human beings,

it is interesting to see that machines can do the pairing automatically.

In addition, some related attributes are also grouped under the same ancestor latent

variable. For example, of one latent variable, 8 out of 10 descendant attributes are related

to agriculture products. Of another latent variable, 11 out of 13 descendant attributes

are related to private vehicles. The model indicates that these latent variables are the

common factors for those related attributes.

3.6.3 Density Estimation

Another application of LTMs is in density estimation. In this application, we are given

a original distribution, or some samples obtained from that distribution. We aim to

approximate the distribution using an LTM. Here, the model structure is not that impor-

tant. What is more important is whether the LTM can estimate the original distribution

accurately.

Suppose we use an LTMM to approximate the distribution of another modelM0. Note

that M0 is not an LTM in general. There are two usual ways to measure the accuracy

of the approximate distribution. First, it can be measured in terms of the likelihood of

43

the M on some testing data generated from M0. Let D = {d1, . . . ,dN} be N samples of

testing data. The likelihood of M is given by

P (D|M) =N∑i=1

P (di|M).

Second, the accuracy can be measured in terms of the KL divergence [34]. Let X be

the variables of M0. Also, let PM(X) and PM0(X) denote the distributions of X given

by models M and M0, respectively. The KL divergence between those two distributions

is

D[PM0(X)||PM(X)] =∑X

PM0(X) logPM0(X)

PM(X).

The KL divergence achieves its minimum of zero if the approximate distribution matches

the original distribution exactly.

Wang et al. [162] suggest the use of LTMs for approximate inference. This application

is a kind of density estimation. Their motivation is that probabilistic inference may be

intractable for some complex Bayesian networks. On the other hand, inference on LTMs

is efficient due to the tree structure. Hence, if an LTM can approximate the distribution

of the original BN accurately, it can be used for inference with greatly improved efficiency.

To carry out this idea, an algorithm called LTAB is proposed. Suppose we want

to do inference on a Bayesian network M0. The algorithm first generates samples from

M0. It then learns an LTM from the samples using the HCL method. After that, the

parameters of the LTM is estimated using EM. Finally, the resulting LTM can be used

for approximate inference.

Wang et al. [162] compares LTAB with a standard approximate inference scheme called

loopy belief propagation (LBP) [116]. The experiments were conducted on 8 BNs. Given

the same amount of inference time, LTAB consistently outperforms LBP in approximation

accuracy in terms of KL divergence. To achieve the same level of accuracy, LTAB can be

faster than than LBP by one to three orders of magnitude.

There is one drawback of LTAB. Although the inference time can be fast, the training

time for LTMs can be slow. Therefore, this method can be considered as a tradeoff

between training time and inference time.

3.6.4 Classification

Classification aims to predict the class label of an instance given its attributes. This area

also finds its use with LTMs.

44

Hierarchical Naïve Bayes Models

Zhang et al. [174] propose a class of models called hierarchical naïve Bayes (HNB) models

for classification. An HNB model is almost identical to an LTM. It is different from an

LTM in that its root node is used to represent the class variable. To learn HNB models,

an algorithm based on DHC can be used. Experiments show that HNB models have

comparable classification accuracy to naïve Bayes models [42]. But more importantly,

the HNB models can also reveal some interesting latent variables. This is like performing

classification with latent structure discovery at the same time.

The above method uses the BIC score to guide the model search. To improve the

classification accuracy of the resulting models, Langseth and Nielsen [94] propose to use

the classification accuracy for model selection during the search. The classification accu-

racy is estimated by cross-validation, so the testing data need not be used. The search

algorithm is also modified to improve the computational complexity.

HNB models obtained from this modified algorithm were compared with 7 other clas-

sifiers on 22 data sets. The experiments show that HNB models are better than other

classifiers on 12 data sets. And they are not significantly poorer than the winners on

other 8 data sets. In addition to the good classification performance, some HNB models

obtained from this algorithm are also shown to have interesting structures.

Latent Tree Classifiers

Wang et al. [163] propose a different approach to using LTMs for classification. The

models used are called latent tree classifiers (LTC). They can be considered as mixtures

of LTMs, where the class variable is used as the mixture variable.

Specifically, an LTM is built for data for every class label in a LTC. Let X be the

attributes in a data set. Also, let Dc denote those samples that belong to class c in the

data set. For each c, an LTM Mc is built on Dc. Model Mc can be used to estimate

the distribution P (X|c). Given a sample d, the classification c can be then obtained by

c = arg maxc P (X = d|c)P (c), where P (X = d|c) is given by Mc.

To learn an LTM for a class, the EAST algorithm is used. However, the AIC score [2]

is used instead of the BIC score. Here, the model Mc can be considered as an estimate

to P (X|c). Therefore, our objective is to minimize difference between the distribution

estimated by Mc and the true distribution. The AIC score is an approximation to the

expected KL divergence of the model to the true distribution. It matches better to the

current objective and hence it is used. Another reason to use it is that it is shown

empirically to perform better than the BIC score in a previous study [161].

45

The LTCs were compared with 4 other methods related to the naïve Bayes models. In

general, they achieve a higher classification accuracy. Similar to the HNB models, some

of the LTCs are also shown to have meaningful structures.

3.6.5 Domains of Applications

The above subsections reviewed four types of applications of LTMs. In this subsection,

we review the different domains of these applications.

Traditional Chinese Medicine

Traditional Chinese medicine (TCM) is an important domain of applications for LTMs.

Multidimensional clustering and latent structure discovery have been used in this domain.

Zhang et al. [176, 177] present a study on a TCM diagnosis called kidney deficiency. LTMs

are used to analyze some data related to this diagnosis. The data set contains 35 syndrome

attributes for 2600 subjects.

An LTM was learnt mainly by the HSHC algorithm. An initial model was obtained,

but it was modified based on domain knowledge to help escape a local maximum. The

HSHC algorithm was run again on the modified model and the resulting model was used

for analysis in their study.

The final LTM exhibits some interesting features. First, some latent variables are

found to have interesting meanings. For example, two latent variables can be interpreted

to mean kidney yang deficiency and kidney yin deficiency, respectively.

Second, the structure shows some reasonable groupings of variables. The latent vari-

able kidney yang deficiency (KYD) has five descendent manifest variables. They represent

five syndromes, namely loose stool, indigested grain in stool, intolerance to cold, cold limbs,

and cold lumbus and back. According to TCM theory, those syndromes may result from a

deficiency of kidney yang. The structure thus matches the TCM theory.

Third, the clusters given by some latent variables can also be interpreted meaningfully.

For example, the latent variable for kidney yang deficiency gives five clusters. Three of them

may mean no KYD, medium level of KYD, and severe level of KYD, respectively. The

other two may mean light level of KYD. However, different syndromes are observed from

these two clusters.

To sum up, the above analysis shows that LTMs can be used to validate some TCM

theories. Moreover, the clusters given by LTMs can also be used to classify the subjects.

46

Text Mining

LTMs have also been applied in the domain of text mining [32, 68]. Experiments were

conducted on the 20 Newsgroup data.2 The data were obtained from 16242 postings.

There are 100 binary attributes. Each attribute indicates whether a word appears in a

posting or not.

Harmeling and Williams [68] shows an LTM obtained from the BIN-G algorithm. They

show that the LTM groups some manifest variables in a meaningful way. For example,

a subtree contains some words related to the topic “medicine” as leaf variables, such as

doctor, medicine, disease, patients, cancer, studies, aids, health, and insurance. Several other

similar subtrees can be found in the model.

The same data set is also analyzed by Choi et al. [32]. Models were obtained from

the RG and CLRG algorithms. The models are a generalization of LTMs, where mani-

fest variables can also be represented by internal nodes. They are shown to have fewer

latent variables than the one obtained from BIN-G. Their BIC scores are also higher. An

model obtained from a modified version of CLRG have some subtrees that can roughly

be interpreted as topics.

Data in text mining tends to have large numbers of attributes and samples. Therefore,

it may not be feasible to use those score-based methods that are currently available in

this domain. On the other hand, attributes in text mining are usually binary. This allows

the use of those learning algorithms that are restricted to have the same cardinality for

all manifest variables.

Finance

Finance can also be one domain of application of LTMs. Some monthly stock data were

analyzed by Choi et al. [32]. Each record of the data represents the monthly returns of

84 companies. The samples were collected from 1990 to 2007.

Choi et al. [32] demonstrate the result using a model obtained from a method similar

to CLRG. The model is slightly different from an LTM. Its variables are all continuous

and it has internal manifest nodes. The model has some subtrees of related companies.

For example, one subtree is related to the telecom industry. It contains Verizon, Sprint,

and T-mobile as descendants.

2 http://www.cs.nyu.edu/~roweis/data.html

47

http://www.cs.nyu.edu/~roweis/data.html

CHAPTER 4

APPLICATION:ROUNDING IN SPECTRAL CLUSTERING

Spectral clustering [159] is one way to handle clusters of irregular shapes. The idea is to

convert clustering into a graph cut problem. More specifically, one first builds a similarity

graph over the data points using a measure of data similarity as edge weights and then

partitions the graph by cutting some of the edges. Each connected component in the

resulting graph is a cluster. The cut is done so as to simultaneously minimize the cost of

cut and balance the sizes of the resulting clusters [65, 143].

The graph cut problem is NP-hard and is hence relaxed. In the relaxed problem, the

cluster indicator functions are allowed to be real-valued. The solution is given by the

leading eigenvectors of the so-called Laplacian matrix, which is a simple transformation

of the original data similarity matrix. In a post-processing step, a partition of the data

points is obtained from those real-valued eigenvectors. This post-processing step is called

rounding [5, 136].

Although the spectral clustering literature is abundant, there are relatively few pa-

pers on rounding. In general, rounding is considered an open problem. There are three

subproblems: (1) decide how many (and more generally which) leading eigenvectors to

use; (2) determine the number of clusters; and (3) determine the members of each cluster.

Among the three, the first two subproblems are considered much harder.

In this chapter we focus on rounding. The remaining chapter is organized as follows.

In Section 4.1, we review some related work. We also indicate three main differences

between our work and the other work. In Section 4.2 we review the basics of spectral

clustering and point out two key properties of the eigenvectors of the Laplacian matrix

in the ideal case. In Section 4.3 we describe a straightforward method for rounding that

takes advantage of the two key properties. This method is fragile and breaks down as

soon as we move away from the ideal case. In Sections 4.4 and 4.5 we propose a model-

based method for rounding that exploits the same two properties. The method is named

LTM-Rounding. It is evaluated on synthetic data in Section 4.6 and is compared with

other methods in Section 4.7. This chapter concludes in Section 4.8.

48

4.1 Related Work

Previous rounding methods fall into two groups depending on whether they assume the

number of clusters is given. When the number of clusters is known to be k, rounding is

usually done based on the first k eigenvectors. The data points are projected onto the

subspace spanned by those eigenvectors and then the K-means algorithm is run on that

space to get k clusters [159]. Bach and Jordan [5] approximate the subspace using a space

spanned by k piecewise constant vectors and then run K-means on the latter space. This

turns out to be equivalent to a weighted K-means algorithm on the original subspace.

Zhang and Jordan [178] observe a link between rounding and the orthogonal Procrustes

problem in Mathematics and iteratively use an analytical solution for the latter problem

to build a method for rounding. Rebagliati and Verri [134] ask the user to provide a

number K that is larger than k and obtain k clusters based on the first K eigenvectors

using a randomized algorithm that repeatedly calls K-means as a subroutine.

When the number of clusters is not given, one needs to estimate it. A common method

is to manually examine the difference between every two consecutive eigenvalues starting

from the first two. If a big gap appears for the first time between the k-th and (k+ 1)-th

eigenvalues, then one uses k as an estimate of the number of clusters. Zelnik-Manor and

Perona [169] propose an automatic method. The method considers a number of integers.

For each integer k, it tries to rotate the first k eigenvectors so as to align them with the

canonical coordinate system for the eigenspace spanned by those vectors. A cost function

is defined in terms of how well the alignment can be achieved. The k with the lowest cost

is chosen as an estimate for the number of clusters. Xiang and Gong [166] and Zhao et al.

[179] question the assumption that clustering should be based on all the eigenvectors from

a continuous block at the beginning of the eigenvector spectrum. They use heuristics to

choose a collection of eigenvectors which do not necessarily form a continuous block, and

then use Gaussian mixture models to determine the number of clusters and to partition

the data points. use Gaussian mixture models to determine the clusters. Socher et al.

[148] assume the number of leading eigenvectors to use is given. Based on those leading

eigenvectors, they determine the number of clusters and the membership of each cluster

using a non-parametric Bayesian clustering method.

In this chapter, we propose and study a novel model-based approach to rounding. The

method differs from the previous methods in three ways. First, we relax the assumption

that the number of clusters equals the number of eigenvectors that one uses for rounding.

In the ideal case where between-cluster similarity is 0, if one knows the number kt of true

clusters, one can indeed recover the kt clusters from the first kt eigenvectors. However,

49

this might not be the case in non-ideal cases or when the number of clusters one tries to

obtain is not kt. Our method allows the number of clusters to differ from the number of

eigenvectors. This is conducive to robust performance in non-ideal cases.

Second, we choose a continuous block of leading eigenvectors for rounding just as

Zelnik-Manor and Perona [169]. The difference is that, when deciding the appropriateness

of the first k eigenvectors, Zelnik-Manor and Perona use only information contained in

those eigenvectors, whereas we also use information contained in subsequent eigenvectors.

So our method uses more information and hence the choice is expected to be more robust.

Third, we solve all the three subproblems of rounding and we do so within one class

of models, namely LTMs. In contrast, most previous methods assume that the first

two subproblems are solved and the solutions are equal, and focus only on the third

subproblem. Xiang and Gong [166] and Zhao et al. [179] do consider all three subproblems.

However, they do not solve all the subproblem within one class of models. They first

choose a collection of eigenvectors based on some heuristics and then use Gaussian mixture

models to solve the other two subproblems. Zelnik-Manor and Perona [169] also considers

all three subproblems. However, their method is not model-based and it assumes the

number of clusters equals the number of eigenvectors. An advantage of the model-based

approach is that its performance degrades gracefully as we move away from the ideal case.

4.2 Basics of Spectral Clustering

In this section we review the basics of spectral clusters and point out two properties that

we exploit later.

4.2.1 Similarity Measure and Similarity Graph

Let X = {x1, . . . ,xn} be a set of n data points in an Euclidean space Rd. In order to

partition the data, one needs to define a non-negative similarity measure sij for each pair

xi and xj of data points. This can be done in a number of ways. In our work we consider

two measures:

• k-NN similarity measure: sij = 1 if xi is one of the k nearest neighbors of xj , or

vice versa, and sij = 0 otherwise.

• Gaussian similarity measure: sij = exp(−||xi−xj ||2

σ2 ), where σ is a parameter that

controls the width of neighborhood of each data point.

50

The matrix S = (sij)i,j=1,...,n is called the similarity matrix.

Given a similarity measure, the data can be represented as an weighted undirected

graph G. In the graph there is a vertex vi representing each data point xi, and there is

an edge between two vertices vi and vj if and only if sij > 0. The value sij is used as the

edge weight and is sometimes denoted as wij. The graph is called the similarity graph

and its adjacency matrix W = (wij)i,j=1,...,n is the same as the similarity matrix S. Note

that the similarity graph G is a complete graph when the Gaussian similarity measure is

used, and it might not be so when the k-NN similarity measure is used.

4.2.2 Graph Laplacian

In spectral clustering one transforms the similarity matrix to get another matrix called

graph Laplacian matrix. There are a number of Laplacian matrices to choose from [159].

In this chapter, we use the normalized Laplacian matrix Lrw given by

Lrw = I −D−1S, (4.1)

where I ∈ Rn×n is the identity matrix, D = (dij) ∈ Rn×n is the diagonal degree matrix

given by dii =∑n

j=1 sij, and D−1 is the inverse of D. The following proposition is well-

known [159].

Proposition 1. The Laplacian matrix Lrw satisfies the following properties:

1. Lrw is symmetric and positive semi-definite.

2. The eigenvalues of Lrw are non-negative and the smallest one is 0.

3. If the similarity graph is connected, then there is only one eigenvalue that equals 0.

4. The unit vector 1 ∈ Rn that consists of all 1’s is an eigenvector for eigenvalue 0.

The eigenvalues of Lrw are arranged in ascending order as 0 = λ1 ≤ λ2 ≤ . . . ≤ λn

and the eigenvectors for the eigenvalues are arranged in the same order as e1, e2, . . . , en.

The eigenvectors at the front of the list are called the leading eigenvectors. Note that an

eigenvector of Lrw is a vector of n real numbers. It can also be viewed as a function over the

data points. As a matter of fact, in the graph cut formulation of spectral clustering [159],

an eigenvector is a cluster indicator function for a cut. Two example eigenvectors are

shown in Figure 4.1.

51

4.2.3 The Ideal Case

In theoretical analysis of spectral clustering, reference is often made to the so-called ideal

case. Suppose the data consist of kt true clusters C1, C2, . . . , Ckt . The ideal case refers to

the situation where the similarity graph G has exactly kt connected components, with each

corresponding to a true cluster. This is the same as saying that the in-cluster similarities

are strictly positive while the between-cluster similarities are 0. Assume that the data

points are ordered based on their cluster memberships. Then the Laplacian matrix Lrwhas a block diagonal form and can be written as follows

L =

L1

L2

. . .Lkt

,

where each block Lj is the Laplacian matrix for the corresponding connected component

Cj of G. The following proposition is evident.

Proposition 2. In the ideal case, the matrix Lrw has the following two properties:

1. The eigenvalues of Lrw are the union of the eigenvalues of L1, . . . , Lkt.

2. Eigenvectors of Lrw can be obtained from eigenvectors of L1, . . . , Lkt by appropriately

padding them with zeros.

Let ej be an eigenvector of the block Lj for an eigenvalue λ. As a function over data

points, ej is defined only over the component Cj. We can extend ej to get a function

e over all the n data points by letting e be the same as ej over Cj and be 0 elsewhere.

Proposition 2 says that λ is an eigenvalue of Lrw and e is an eigenvector of Lrw. Note

that the support of e is a subset of Cj or, to put it another way, Cj contains the support

of e.

For a subset of data points C, the indicator function 1C of C is a function over all

data points that takes value 1 on the data points in C and 0 elsewhere. It can also be

viewed as an n dimensional vector and is called the indicator vector of C. The following

proposition is a corollary of Propositions 1 and 2.

Proposition 3. In the ideal case, the matrix Lrw has exactly kt eigenvalues that equal 0.

The eigenpace of eigenvalue 0 is spanned by the indicator vectors 1C1, . . . , 1Ckt.

The vectors 1C1 , . . . , 1Cktcollectively form the canonical coordinate system of the

eigenspace of eigenvalue 0. Create a n × kt matrix Uc using those vectors as column

52

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

C4

C2C1

C3

C5

(a) A data set

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●

●●●●●

●●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●●●

●●●●●

●●●

●●

●●●●

●●●●●●●

●●●●

●●●●

●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●

●●●●●●

●●

●●●●

●●●●●

●●

●●●

●●●●●

●●●

●●●●

●●●●●

●●●●●

●

●●●●●●

●●●●●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●●●●

●●●●

●●●●

●●●●●●●

●●●●●

●●●●

●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−0.

2−

0.1

0.0

0.1

0.2

C1 C2 C3 C4 C5

(b) Eigenvector e1

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●

●●●●●

●●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●●●

●●●●●

●●●

●●

●●●●

●●●●●●●

●●●●

●●●●

●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●

●●●●●●

●●

●●●●

●●●●●

●●

●●●

●●●●●

●●●

●●●●

●●●●●

●●●●●

●

●●●●●●

●●●●●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●●●●

●●●●

●●●●

●●●●●●●

●●●●●

●●●●

●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−0.

2−

0.1

0.0

0.1

0.2

C1 C2 C3 C4 C5

(c) Eigenvector e6

Figure 4.1: Example eigenvectors: There are five true clusters C1, . . . , C5 in the data setshown in (a). The 10-NN similarity measure is used to produce the ideal case. The matrixLrw is built accordingly and it consists of five blocks L1, . . . , L5, each corresponding to atrue cluster. Two eigenvectors e1 and e6 of Lrw are shown. Each eigenvector is depicted intwo diagrams. In the first diagram, the values of the eigenvector are indicated by differentcolors, with grey meaning 0. In the second diagram, the X-axis indexes the data pointsand the Y-axis shows the values of the eigenvector.

vectors. Each row zi of the matrix corresponds to a data point xi. So we have a mapping

xi → zi that maps the data points from the original space Rd to points in the eigenspace

of eigenvalue 0.

For a row zi of Uc that corresponds to an data point in component Cj, the value at

position j is 1 and the values at all other positions are 0. Therefore, all the data points

in Cj become one point when mapped onto the eigenspace of eigenvalue 0. This fact is of

fundamental importance to spectral clustering.

4.2.4 Spectral Clustering

To partition a collection of n data points {x1, . . . ,xn} into k clusters using a similarity

matrix S, spectral clustering proceeds as follows:

1. Compute the Laplacian matrix Lrw.

2. Compute the k leading eigenvectors e1, . . . , ek of Lrw.

3. Form an n× k matrix U using e1, . . . , ek as columns.

53

4. For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U .

5. Cluster the points {y1, . . . ,yn} into k clusters using the K-means algorithm.

6. Obtain k clusters of the original data points accordingly.

Suppose the similarity matrix S satisfies the conditions for the ideal case and k = kt.

The eigenvectors e1, . . . , ek obtained at Step 2 are not necessarily the same as the true-

cluster indicator vectors 1C1 , . . . , 1Ckt. However, we have,

U = UcR. (4.2)

for some orthogonal matrix R ∈ Rk×k.

The matrix U gives us a mapping from the data space to the eigenspace of eigenvalue

0: xi → yi, where yi is the i-th row of U . The mapping can be broken up into two

steps: First map xi to the i-th row zi of Uc, and then obtain yi by rotating the vector ziusing R, that is, yi = ziR. As pointed out earlier, data points in the same true cluster

are mapped into one point at the first step. Hence they are mapped into one point by

the whole mapping xi → yi. This means that, in the ideal case, spectral clustering can

recover the true clusters.

Spectral clustering is expected to also work well in non-ideal cases that are not far

away from the ideal case. A formal justification of this is provided by Ng et al. [118] in

terms of matrix perturbation theory.

4.2.5 Two Properties

To end this section, we point out two properties of the eigenvectors in the ideal case that

we exploit in our work. We know from Proposition 3 that the Laplacian matrix Lrw has

multiple eigenvalues that equal 0 in the ideal case. For simplicity, we assume that the

non-zero eigenvalues of Lrw are distinct. We call the eigenvectors for eigenvalue 0 the

primary eigenvectors and the others the secondary eigenvectors.

Proposition 4. In the ideal case, eigenvectors of Lrw has the following properties:

1. The primary eigenvectors are piecewise constant, with each value identifying one

true cluster or the union of several true clusters.

2. The support of each secondary eigenvector is contained within one of the true clus-

ters.

54

The support of a vector is defined as the set of elements that have non-zero values.

The first item follows readily from Equation (4.2) and the fact that each column of Uc is

a true-cluster indicator vector. The second item is an obvious corollary of Proposition 2.

The two properties can be illustrated using Figure 4.1. The primary eigenvector e1has two different values, 0.1 and 0. The value 0.1 identifies cluster C4, while 0 corresponds

to the union of the other clusters. The vector e6 is a secondary eigenvector. Its support

is contained in the true cluster C4.

4.3 A Naive Method for Rounding

In this section we explain how the two properties in Proposition 4 can be used for rounding.

We do so by giving two naive algorithms that are intended only for the ideal case. In

the next two sections, we develop a model-based algorithm that exploits the same two

properties and that works also in non-ideal cases.

4.3.1 Binarization of Eigenvectors

In the original graph cut formulation of spectral clustering [65, 143], an eigenvector has

two different values. It partitions the data points into two clusters. The graph cut problem

is NP-hard and is hence relaxed. In the relaxed problem, an eigenvector can have many

different values (see Figure 4.1(c)).

The first step of our method is to obtain, from each eigenvector ei, two clusters using

a confidence parameter δ that is between 0 and 1. Let eij be the value of ei at the j-th

data point. One of the clusters consists of data points j that satisfy

eij > 0 and eij > δmaxjeij,

while the other cluster consists of data points j that satisfy

eij < 0 and eij < δminjeij.

The indicator vectors of those two clusters are denoted as e+i and e−i respectively. We refer

to the process of obtaining those vectors from ei as binarization. Applying binarization to

the eigenvectors e1 and e6 of Figure 4.1 results in the binary vectors shown in Figure 4.2.

Note that e−1 is a degenerate binary vector in the sense that it is 0 everywhere. We still

refer to it as a binary vector for convenience. The following proposition follows readily

from Proposition 4.

55

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●

●●●●●

●●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●●●

●●●●●

●●●

●●

●●●●

●●●●●●●

●●●●

●●●●

●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●

●●●●●●

●●

●●●●

●●●●●

●●

●●●

●●●●●

●●●

●●●●

●●●●●

●●●●●

●

●●●●●●

●●●●●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●●●●

●●●●

●●●●

●●●●●●●

●●●●●

●●●●

●●●●

●●●●●●

(a) e+1

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●

●●●●●

●●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●●●

●●●●●

●●●

●●

●●●●

●●●●●●●

●●●●

●●●●

●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●

●●●●●●

●●

●●●●

●●●●●

●●

●●●

●●●●●

●●●

●●●●

●●●●●

●●●●●

●

●●●●●●

●●●●●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●●●●

●●●●

●●●●

●●●●●●●

●●●●●

●●●●

●●●●

●●●●●●

(b) e−1

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●

●●●●●

●●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●●●

●●●●●

●●●

●●

●●●●

●●●●●●●

●●●●

●●●●

●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●

●●●●●●

●●

●●●●

●●●●●

●●

●●●

●●●●●

●●●

●●●●

●●●●●

●●●●●

●

●●●●●●

●●●●●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●●●●

●●●●

●●●●

●●●●●●●

●●●●●

●●●●

●●●●

●●●●●●

(c) e+6

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●

●●●●●

●●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●●●

●●●●●

●●●

●●

●●●●

●●●●●●●

●●●●

●●●●

●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●

●●●●●●

●●

●●●●

●●●●●

●●

●●●

●●●●●

●●●

●●●●

●●●●●

●●●●●

●

●●●●●●

●●●●●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●●●●

●●●●

●●●●

●●●●●●●

●●●●●

●●●●

●●●●

●●●●●●

(d) e−6

Figure 4.2: Binary vectors obtained from eigenvectors e1 and e6 through binarizationwith δ = 0.1. Data points with values 1 and 0 are indicated by red and green respectively.

Proposition 5. Let eb be a vector obtained from an eigenvectors e of the Laplacian matrix

Lrw through binarization. In the ideal case, we have:

1. If e is a primary eigenvector, then the support of eb is either a true cluster or the

union of several true clusters.

2. If e is a secondary eigenvector, then the support of eb is a subset of one of the true

clusters.

4.3.2 Rounding by Overlaying Partitions

Each binary vector gives a partition of all the data points, with one cluster comprising

points with value 1 and another cluster comprising points with value 0. Suppose there

are two partitions. One binary vector divides the data into two clusters C1 and C2 and

the other into P1 and P2. Overlaying the two partitions results in a new partition that

consists of clusters C1 ∩ P1, C1 ∩ P2, C2 ∩ P1, and C2 ∩ P2. Note that there are not

necessarily exactly 4 clusters in the new partition as some of the 4 intersections might be

empty. It is straightforward to generalize the concept of overlaying to multiple partitions

with any numbers of clusters.

We use a continuous block of leading eigenvectors for rounding. There is a question of

how many eigenvectors to use. In this subsection we consider the case where the number

q of leading eigenvectors to use is given. Here is a simple method for rounding:

Naive-Rounding1(q):

1. Compute the q leading eigenvectors of the Laplacian matrix Lrw.

2. Binarize the q eigenvectors.

56

3. Obtain a partition of the data using each binary vector from the previous

step.

4. Overlay all the partitions to get the final partition.

Note that the number of clusters obtained by Naive-Rounding1 is determined by

q, but it is not necessarily the same as q in general. The following proposition is an easy

corollary of Proposition 5.

Proposition 6. In the ideal case and when q is smaller than or equal to the number of

true clusters, the clusters obtained by Naive-Rounding1 are either the true clusters or

unions of the true clusters.

4.3.3 Determining the Number of Eigenvectors to Use

Now consider the case when the number q of leading eigenvector to use is not given. We

determine q, and hence the number of clusters, by making use of Proposition 5. Suppose

Pq is the partition given by Naive-Rounding1 by using the q leading eigenvectors. The

idea is to gradually increase q and test the partition Pq for each q to see whether it satisfies

the condition of Proposition 5 (2).

Suppose K is a sufficiently large integer. Denote a subroutine that tests the partition

Pq by cTest(Pq, q,K). To perform this test, we use the binary vectors obtained from

the eigenvectors from the range [q + 1, K]. If the support of every such binary vector

is contained by some cluster in Pq, we say that Pq satisfies the containment condition

and cTest returns true. Otherwise, Pq violates the condition and cTest returns false.

Proposition 5 states that the containment condition must be satisfied if Pq is the true

partition. If it is violated, Pq cannot be the true partition.

Let kt be the number of true clusters. When q = kt, cTest(Pq, q,K) passes because

of Proposition 5. When q = kt + 1, Pq is likely to be finer than the true partition.

Consequently, cTest(Pq, q,K) may fail. The probability of this happening increases with

the number of binary vectors used in the test. To make the probability high, we pick

K such that K/2 is safely larger than kt and let q run from 2 to bK/2c. When q < kt,

cTest(Pq, q,K) usually passes.

The above discussions suggest that if the test cTest(Pq, q,K) returns true for some

q = k and returns false for q = k + 1 for the first time, we can use k as an estimate of kt.

Consequently, we can use k leading eigenvectors for clustering and return Pk as the final

clustering result. This leads to the following algorithm.

Naive-Rounding2(K):

57

●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●

●●●●●

●●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●

(a) 7 clusters

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●

●●●●●

●●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●●●

●●●●●

●●●

●●

●●●●

●●●●●●●

●●●●

●●●●

●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●

●●●●●●

●●

●●●●

●●●●●

●●

●●●

●●●●●

●●●

●●●●

●●●●●

●●●●●

●

●●●●●●

●●●●●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●●●●

●●●●

●●●●

●●●●●●●

●●●●●

●●●●

●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●

●●

●

●

●

●

●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●

●

●●

●

●●●●

●●●●

●●●●●●●●●●●

●●●●

●

●●

●

●

●

●

●

●●●

●

●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−0.

2−

0.1

0.0

0.1

0.2

C1 C2 C3 C4 C5

(b) Eigenvector e16

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●

●●●●●

●●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●●●

●●●●●

●●●

●●

●●●●

●●●●●●●

●●●●

●●●●

●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●

●●●●●●

●●

●●●●

●●●●●

●●

●●●

●●●●●

●●●

●●●●

●●●●●

●●●●●

●

●●●●●●

●●●●●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●●●●

●●●●

●●●●

●●●●●●●

●●●●●

●●●●

●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●

●●●●●

●●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●●●

●●●●●

●●●

●●

●●●●

●●●●●●●

●●●●

●●●●

●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●

●●●●●●

●●

●●●●

●●●●●

●●

●●●

●●●●●

●●●

●●●●

●●●●●

●●●●●

●

●●●●●●

●●●●●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●●●●

●●●●

●●●●

●●●●●●●

●●●●●

●●●●

●●●●

●●●●●●

(c) Binary vectors e+16 and e−16

Figure 4.3: Illustration of Naive-Rounding2: (a) shows the partition obtained usingfirst 6 pairs of binary vectors. There are 7 clusters, each indicated by a different color.(b) and (c) show eigenvector e16 and the vectors obtained from it via binarization. Thesupport of e+16 (the red region) is not contained in any of the clusters in (a). The same istrue for the support e−16.

1. For q = 2 to bK/2c,

(a) Pcurrent ← Naive-Rounding1(q).

(b) Pnext ← Naive-Rounding1(q + 1).

(c) If cTest(Pcurrent, q,K) = true and cTest(Pnext, q + 1, K) = false,

return Pcurrent.

2. Return Pcurrent.

Suppose an integer K is given such that K/2 is safely larger than the number of true

clusters. The algorithm Naive-Rounding2 automatically decides how many leading

eigenvectors to use in rounding and automatically determines the number of clusters.

Consider running it on the dataset shown in Figure 4.1(a) with K = 40. It loops q

through 2 to 4 without terminating because both the two tests at Step 1(c) return true.

When q = 5, Pcurrent is the true partition and Pnext is the one as shown in Figure 4.3(a).

At Step 1(c), cTest(Pcurrent, q,K) returns true. However, cTest(Pnext, q + 1, K) returns

false, because the support of the binary vector e+16 or e−16 is not a subset of any cluster

in Pnext (Figure 4.3). Consequently, Naive-Rounding2 terminates at Step 1(c) and

returns Pcurrent (the true partition) as the final result.

58

Y

e+1 e−1 · · · e+q e−q

Figure 4.4: Latent class model for rounding: The binary vectors e+1 , e−1 , . . . , e+q , e−q are

regarded as random variables. The discrete latent variable Y represents the partition tofind.

It is possible for Naive-Rounding2 to use more than kt eigenvectors. Because we

are talking about the ideal case, the containment condition is satisfied for Pcurrent at

q = kt. However, the condition might also be satisfied for Pnext at q = kt. In that

case, over-shooting occurs. In our empirical evaluation with the three data sets shown in

Figure 4.6, Naive-Rounding2 did pick the correct numbers of leading eigenvectors, and

it determined the number of clusters and the members of the clusters correctly.

4.4 Latent Class Models for Rounding

Naive-Rounding2 is fragile. It can break down as soon as we move away from the ideal

case. In this and the next sections, we describe a model-based method that also exploits

Proposition 5 and is more robust.

In the model-based method, the binary vectors are regarded as features and the prob-

lem is to cluster the data points based on those features. So what we face is a problem of

clustering discrete data. As in the previous section, there is a question of how many lead-

ing eigenvectors to use. In this section, we assume the number q of leading eigenvectors

to use is given. In the next section, we discuss how to determine q.

The problem we address in this section is how to cluster the data points based on the

first q pairs of binary vectors e+1 , e−1 , . . . , e+q , e

−q . We solve the problem using latent

class models. LCMs are commonly used to cluster discrete data, just as Gaussian mixture

models are used to cluster continuous data. Technically, they are the same as the naïve

Bayes models except that the class variable is not observed.

The LCM for our problem is shown in Figure 4.4. So far we have been using the

notations e+s and e−s to denote vectors of n components or functions over the data points.

In this and the next sections, we overload the notations to denote random variables that

take different values at different data points. We do not use bold letters for this case. The

latent variable Y represents the partition to find and each state of Y represents a cluster.

So the number of states of Y , often called the cardinality of Y , is the number of clusters.

To learn an LCM from data means to determine:

59

1. The cardinality of Y , that is, the number of clusters; and

2. The probability distributions P (Y ), P (e+s |Y ) and P (e−s |Y ), that is, the characteri-

zation of the clusters.

After an LCM is learned, one can compute the posterior distribution of Y for each

data point. This gives a soft partition of the data. To get a hard partition, one can assign

each data point to the state of Y that has the maximum posterior probability. This is

called hard assignment.

4.4.1 Known Number of Clusters

There are two cases with the LCM learning problem, depending on whether the number

of clusters is known. When the number of clusters is known, we only need to determine

the probability distributions P (Y ), P (e+s |Y ), and P (e−s |Y ). This is done using the EM

algorithm [40].

Before dealing with the case when the number of clusters is not known, we spend

some time to explain how the LCM method is related to Naive-Rouding1. Here is our

strategy:

1. We set the probability parameter values in such a way that the LCM gives the same

partition as Naive-Rounding1.

2. We show that those parameter values maximize the likelihood of the model.

It is well known that the EM algorithm aims at finding the maximum likelihood estimate

of the parameters. So we can conclude that the LCM method actually tries to find the

same partition as Naive-Rounding1.

Suppose Naive-Rounding1 produces k clusters C1, . . . , Ck. For each r ∈ {1, . . . , k},let nr be the number of data points in Cr. It is clear that n = n1 + · · ·+nk. The partition

{C1, . . . , Ck} is obtained by overlaying the partitions given by the first q pairs of binary

vectors e+1 , e−1 , . . . , e+q , e−q . In the LCM model, e+1 , e

−1 , . . . , e+q , e−q are viewed as feature

variables. Use ebs to denote a general feature variable. Then we have that for each r, the

feature variable ebs

• Either takes value 1 at all the points in Cr,

• Or takes value 0 at all the points in Cr.

60

Moreover, all the points from one particular cluster have the same feature values.

Let the latent variable Y have k states {1, . . . , k}. For each r ∈ {1, . . . , k} and each

feature variable ebs, set

P (Y = r) =nrn, (4.3)

P (ebs = 1|Y = r) =

{1 if ebs is 1 over Cr0 otherwise (4.4)

Under those parameter values, Y = r means the cluster Cr.

Use m to denote the LCM model, and θ to denote the collection of parameter values

given by Equations (4.3) and (4.4). Consider the conditional probability P (xi|Y = r,m,θ)

and the marginal probability P (xi|m,θ) of a general data point xi. Because Y = r means

the cluster Cr, we have

P (xi|Y = r,m,θ) =

{1 if xi ∈ Cr0 otherwise (4.5)

P (xi|m,θ) =nrn

if xi ∈ Cr (4.6)

Now consider the posterior distribution P (Y |xi,m,θ) of Y . It follows from Equation

(4.5) that

P (Y = r|xi,m,θ) =

{1 if xi ∈ Cr0 otherwise

This means that, by hard assignment, the LCM gives us exactly the partition {C1, . . . , Ck}— the partition found by Naive-Rounding1.

We next show that the parameter values θ maximize the likelihood of the model. For

each r, enumerate the data points in Cr as xr1, . . . ,xrnr . We know from Equation (4.6)

that P (xr1|m,θ) = . . . = P (xrnr |m,θ) = nr

n. Use D to denote the entire data set. So we

have

logP (D|m,θ) =k∑r=1

nr∑t=1

logP (xrt|m,θ)

=k∑r=1

nr lognrn

= n

k∑r=1

nrn

lognrn

Now consider another set θ′ of parameter values. Because all the points from one par-

ticular cluster have the same feature value, we have P (xr1|m,θ′) = . . . = P (xrnr |m,θ′).

61

Let that value be p′r. Then we have

logP (D|m,θ′) = nk∑r=1

nrn

log p′r

≤ n∑r

nrn

lognrn

= logP (D|m,θ),

where Gibbs’ inequality is used in the second step. So the parameter values θ do maximize

the likelihood.

To summarize the arguments, the LCMmethod actually tries to find the same partition

as Naive-Rounding1 when the number of clusters is set properly.

4.4.2 Unknown Number of Clusters

Now consider the case when the number of clusters is not known. We follow the standard

practice in the literature and determine it using a search procedure guided by the BIC

score [140]. Specifically, we start by setting the number k of clusters to 1 and increase it

gradually. For each k, we estimate the probability parameters using the EM algorithm

and compute the BIC score of the model given by Equation (3.1). The BIC score would

initially increase with k. We stop the process as soon as it starts to decrease, and use the

k with the maximum BIC score as the estimate of the number of clusters.

We point out earlier that the LCM method tries to find the same partition as Naive-

Rounding1 when the number of clusters is the same in both cases. However, the number

of clusters determined using the BIC score might not equal the number of clusters found

by Naive-Rounding1. When that happens, the LCM method produces a different

partition.

4.5 Latent Tree Models for Rounding

In this section we present a method for determining the number q of leading eigenvectors

to use in the model-based approach. The idea is to extend the LCM method of the

previous section using the strategy of Naive-Rounding2.

Consider an integer q between 2 and K/2. We first build an LCM using the first q

pairs of binary vectors and obtain a hard partition of the data using the LCM. Suppose

k clusters C1, . . . , Ck are obtained. Each cluster Cr corresponds to a state r of the latent

variable Y .

62

Y

e+1 e−1 · · · e+q e−q Y1

ebs1 · · · ebs2

· · · Yk

ebs3 · · · ebs4

Figure 4.5: Latent tree model for rounding.

We extend the LCM model to obtain the model shown in Figure 4.5. We do this in

three steps. First, we introduce k new latent variables Y1, . . . , Yk into the model. They

are all connected to Y . Each Yr is a binary variable and its conditional distribution is set

as follows:

P (Yr = 1|Y = r′) =

{1 if r′ = r,

0 otherwise.(4.7)

So the state Yr = 1 means the cluster Cr and the state Yr = 0 means the union of the

other clusters.

Next, we add binary vectors from the range [q + 1, K] to the model by connecting

them to the new latent variables. For convenience we call those vectors the secondary

binary vectors. This is not to be confused with the secondary eigenvectors mentioned

in Proposition 4. For each secondary binary vector ebs, let Ds be its support. When

determining to which Yr to connect ebs, we consider how well the cluster Cr covers Ds.

We connect ebs to the Yr such that Cr covers Ds the best, in the sense that the quantity

|Ds ∩ Cr| is maximized, where |.| stands for the number of data points in a set. Ties are

broken arbitrarily.

Finally, we set the conditional distribution P (ebs|Yr) as follows:

P (ebs = 1|Yr = 1) =|Ds ∩ Cr||Cr|

(4.8)

P (ebs = 1|Yr = 0) =|Ds − Cr|n− |Cr|

(4.9)

What we get is an LTM. The LCM part of the model is called its primary part,

while the newly added part is called the secondary part. The parameter values for the

primary part is determined during LCM learning, while those for the secondary part are

set manually by Equations (4.7), (4.8), and (4.9).

To determine the number q of leading eigenvectors to use, we examine all integers in

the range [2, K/2]. For each such integer q, we build an LTM as described above and

63

compute its BIC score. We pick the q with the maximum BIC score as the answer to our

question. After q is determined, we use the primary part of the LTM to partition the

data. In other words, the secondary part is used only to determine q. It is not used when

we determine the partition. We call this method the LTM method for rounding.

Here are the intuitions behind the LTM method. If the support Ds of ebs is contained

in cluster Cr, it fits the situation to connect ebs to Yr. The model construction no longer

fits the data well if Ds is not contained in any of the clusters. The worst case is when two

different clusters Cr and Cr′ cover Ds equally well and better than other clusters. In this

case, ebs can be either connected to Yr or to Yr′ . Different choices here lead to different

models. As such, neither choice is ‘ideal’. Even when there is only one cluster Cr that

covers Ds the best, connecting ebs to Yr is still intuitively not ‘perfect’ as long as Cr does

not cover Ds completely.

So when the support of every secondary binary vector is contained by one of the

clusters C1, . . . , Ck, the LTM that we build would fit the data well. However, when

the supports of some secondary binary vectors are not completely covered by any of the

clusters, the LTM would not fit the data well.

Now consider the ideal case. According to Proposition 5 (2), the fit would be good if

q is the the number kt of true clusters, or equivalently the number of eigenvectors of the

Laplacian matrix for eigenvalue 0. The fit would not be as good otherwise. This is why

the likelihood of the LTM, hence its BIC score, contains information that can be used to

choose q.

To end this section, we summarize the LTM method for rounding:

LTM-Rounding:

• Inputs:

1. A data set D = {x1, . . . , xn} with similarity matrix S.

2. Parameters: δ, K.

• Algorithm:

1. Form the Laplacian matrix Lrw using Equation (4.1).

2. Compute the first K eigenvectors of Lrw.

3. Using δ as the threshold, binarize the eigenvectors as explained in

Section 4.3. (This results in K pairs of binary vectors, some of which

might be degenerate.)

4. S∗ ← −∞.

64

5. For q = 2 to bK/2c,

(a) mlcm ← the LCM learnt using the first q pairs of binary vectors

as shown in Section 4.4.

(b) P ← hard partition obtained using mlcm.

(c) mltm ← the LTM extended frommlcm as explained in Section 4.5.

(d) S ← the BIC score of mltm.

(e) If S > S∗, then P ∗ ← P and S∗ ← S.

6. Return P ∗.

An implementation of LTM-Rounding can be obtained from http://www.cse.ust.hk/

~lzhang/ltm/index.htm.

In general, we suggest to set the binarization threshold δ at 0.1. The parameter

K should be such that K/2 is safely larger than the number of true clusters. Note that

LTM-Rounding does not allow q to be larger than K/2. This is so that there is sufficient

information in the secondary part of the LTM to determine the appropriateness of using

the first q eigenvectors of the Laplacian matrix for rounding.

4.6 Empirical Evaluation on Synthetic Data

Our empirical investigations are designed to: (1) show that LTM-Rounding works per-

fectly in the ideal case and its performance degrades gracefully as we move away from the

ideal case, and (2) compare LTM-Rounding with alternative methods. Synthetic data

are used for the first purpose, and the results are discussed in this section. Both synthetic

and real-world data are used for the second purpose, and the results are discussed in the

next section.

LTM-Rounding has two parameters δ and K. We set δ = 0.1 and K = 40 in all our

experiments except in sensitivity analysis (Section 4.6.4). For each set of synthetic data,

10 repetitions were run.

4.6.1 Performance in the Ideal Case

Three data sets were used for the ideal case (Fig 4.6). They vary in the number and the

shape of clusters. Intuitively the first data set is the easiest, while the third one is the

hardest. To produce the ideal case, we used the 10-NN similarity measure for the first two

data sets. For the third data set, the 10-NN similarity measure gave a similarity graph

with one single connected component. So we used the 3-NN similarity measure instead.

65

http://www.cse.ust.hk/~lzhang/ltm/index.htm

http://www.cse.ust.hk/~lzhang/ltm/index.htm

●●●●

●●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●●

●

●●●

●

●

●

●

●

●

●● ●

●

●

●

●

●●

●

●

●●

●

●●●

● ●●

●

●

●

●

●●

●●

●●

●●

●

●

●●

●

●

●

●

●●

●

●●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●

6 clusters(a) 6 clusters

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

5 clusters(b) 5 clusters

●● ●

●●●●●●●

●●●●

●●

●●●

●●●

●

●●●●●

●

●●

●●

●●

●●

●●●●

●●

●●●

●●●●

●●

●●

●●

●●●●●●

● ●●●

● ●

● ●●

●●●●●●●

●●●●

●

●●

●●●● ● ●

●

● ●●

●●

●

●●

●●●●

●

●●

●

●●

●●

●●

●

●●●●

●●●●

●●●

●●●

●●●

●●●●

●●

●

●●

●●

●

●●

●●

●

●

2 clusters(c) 2 clusters

Figure 4.6: Synthetic data set for the ideal case: The 10-NN similarity measure is usedfor the first two data sets, while 3-NN similarity measure for the third data set. Eachcolor and shape of the data points identifies a cluster. Clusters were recovered by LTM-Rounding correctly.

LTM-Rounding produced the same results on all 10 runs. The results are shown

in Fig 4.6 using colors and shapes of data points. Each color identifies a cluster. LTM-

Rounding correctly determined the numbers of clusters and recovered the true clusters

perfectly.

4.6.2 Graceful Degrading of Performance

To demonstrate that the performance of LTM-Rounding degrades gracefully as we move

away from the ideal case, we generated 8 new data sets by adding different levels of noise

to the second data set in Fig 4.6. The Gaussian similarity measure was adopted with

σ = 0.2. So the similarity graphs for all the data sets are complete.

We evaluated the quality of an obtained clustering by comparing it with the true

clustering using Rand index (RI) [132] and variation of information (VI) [114]. Note

that higher RI values indicate better performance, while the opposite is true for VI. The

performance statistics of LTM-Rounding are shown in Tables 4.1. We see that RI is 1 for

the first three data sets. This means that the true clusters have been perfectly recovered.

The index starts drop from the 4th data set onwards in general. It falls gracefully with

the increase in the level of noise in data. Similar trend can also be observed on VI.

The partitions produced by LTM-Rounding at the best run (in terms of BIC score)

are shown in Fig 4.7(a) using colors and shapes of the data points. We see that on the

first four data sets, LTM-Rounding correctly determined the number of clusters and the

members of each cluster. On the next three data sets, it also correctly recovered the true

clusters except that the top and the bottom crescent clusters might have been broken into

two or three parts. Given the gaps between the different parts, the results are probably

the best one can hope for. The result on the last data set is less ideal but still reasonable.

66

●●●● ●●●●●●

●●●●●

●●●●●●●

●●●●●●●●●

●●●

●●

●●●

●●●●

●●●●●●

●●●●

●●●

●●●

●●●●●●

●●●●●

●●●●●●●●

●●●●

●●

●●●●●●

●●●●

●●●●●

●

5 clusters(1) 5 clusters,RI=1.00

●●●

●●

● ●●●●

●●

●●●

●● ●●●●● ●

●●

●●●

● ●

●●●●

●

●

●●●

●●● ●●●●●●

●

●●●

●●

●●

●●

●

●●●

●

●●●

●●●

●●●

●●

●●●

●● ●●

●

●

●

●●●

●

●●

●

●●●●

●

●●●

●


●●

●

● ●●

●●

●

●●

●●

●●●

●

●

●

●● ●●●●

●●

●

●●

●

●

●

●

●●●

●

●

●●●

●●●

● ●●●●●

●

●

●

●●●

●

●●●●●

●

●

●●

●●●

●● ●

●●

●●

●●●

●●● ●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●●


●

●

●

●

●

●

●●

●●

●●●

●

●●●

●

●

●

●●

● ●●●

●

●

●

●

●●●

●

●

● ●●●

●●

●

●●● ●

●●

●●

●●

● ●

●●

●●

●

●

●

●●●●

●

●●

●●●

●

● ●

●●

●

●● ●

●

●●

●

●●●

●

● ●●

●

● ●●

●●●●

●


●

●●● ●●

●

●●●

●●

●●

●

●●

●

●●

●

●

●

●

●

●

●

●●●

●●

● ●●

●

●

●●●

●

●●●●

●

●

●

●

●●

●

●●

●●


●●

●●

●

●

●●

●

●●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●● ●

●

●

●

●

●

●

●●

●

● ●● ●


●

●

●

●

●

●

●

●

●

●

●●

● ●

●●

●

●

●

●

●●


●

●

●

●●

●

●

●

●

●●● ●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

● ●

●

●●

●●

●

●

●

●

● ●

●

●● ●

●

●●●


(a) Partitions produced by LTM-Rounding

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●● ●● ●●●●●●●

●●

●●●

●●●●●●●●

●●●●●●●●●●●●

●●●

●●●●●●●●●●●●


● ●●●●●●●●

●●●●●●●

●●●●●

●●●●●●●●

●●●

●

●●●●●●●●●●●●●●●●●● ●●●●●●●

●●●●●●●●●●

●●●

●

●●●●

●

●●●●●●

●●●

●●●●●●● ●

●●●● ●●


●● ●●●●

●●●●●●●

●●●●●●

●●●

●●●●

●●

●●●

●●●●

●●

●

●

●●●●●●

●●

●● ● ●● ●● ●●●●●●●●●

●●●●●●

●

●

●●●●●●

●

●●

●●●●●●●●●●●●● ●●●●● ●●


●●●●●●●

●●

●●

●●●

●●

●●

●●●

●

●●

●

●

●

●

●

●●

●

●●●●

●

●

●●●

●●● ●●●●●● ●● ●

●●● ●●●

●

●

●●

●●

●

●

●●

●

●●

●●●

●

●

●●●

●

●●

●●●

●●

●

●● ●●●

●●

● ●●●


●●●

● ● ●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●●

●

●●●

●●●●

●

●●

●●●● ●●●● ●●●

● ●●●●●●●

●● ●

●

●●●●

●●●

●●●

●

●●●●●

●●

●●

●●

●

●●

●●● ●

●● ●●

●● ●

●●

●

●●

●●

●●●●

●

●

●●● ●

●

●

●●●●●

●

● ●●●

●●

●●●

●●

●

●●

●

●

●●

●

●●

● ●●

●

●

●

●●

●●

●

●

●

●●●●

●●

● ●

● ●●

●

●

●●

●

●

●

●

●

●●●

●●●

●

● ●

●

●

●

●●●●●

●

● ●●

●

●

● ●●

●

●●

●

●●

●●●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●●

● ●● ●

●●

●

●

●●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●

●●●

●● ●● ●●

●●

●

●

●●

●●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●●

● ●●

●

●●

●● ●

●●●

●●

●

●

●

●

● ●

●●●●

●

●

●

●

●●●

●

●●

●● ●●●

●

●●●

●●

●●

●

●●

●

●●

●

●

●

●

●

●

●

●●●

●●

● ●●

●

●

●●●

●

●●●●

●

●

●

●

●●

●

●●

●●


●● ●●●

●●●●●●

●●●●●

●

●●●

●●

●●

●

●

●

●

●

●

●●●

●

●

●

●●●●●

●

●

●●● ●●

●●●●● ● ●●●

●●

●

●

●

●

●●●

●●●

●

●

●●

●

●

●●

●

●●●

●●●

●

●●

●

●●

●●●●

● ●●●●●

●● ●

●●●

●

●●

●

●●

●●

●●

●

●

●

●●●

●

●●

●

●● ●●●

●

● ●

●

●

●

●

●

●

●●

●●

●

●●●

●

●

●●●●

●

●●

●●●●

●●

●

●

●

●

●

●●●

●

●●

●●

●● ●●

●

●

●●

●

●●

●●●●

●● ●●

● ●●

●●

●

●●

●

●

●

●

●●

●●

●●

●●

●● ●

●● ●●●

●●●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●●

●●

●●

●●●●

●

●●●

●

●●●

●

●

●●

●

●●●

●●

●●

●●

●●

●

●●●

● ●●

●●●

●

●●●

●●

●●●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●●

●

●●●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●● ●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●●

●●

●●

●

●

●●

●

●●

●

●●

●

●

●


●●●

●● ●●●●●●

●●

●●●

●●

●

●●

●●●●

●

●●

●●●

●

●

●

●●

●

●●●●●● ●●● ●● ●●● ●● ●

●●●

●

●●●

●●

●

●●●

●

●

●●●●●●●●●

●

●

●

●

●●

●●●

●●

●●

●●●●

●● ●●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●●●●

●

●●

●

●●

●●●

●●

●

●●

●●

●●

●

●

●●

●●

●

●

●

●●●●

●

●●

●

●●

●

●

●

●●

●

●

●

●●

●●

●

●●

●●

●

●

●

●

●

●

●

●●

●●

●

●


● ●●●

●●●

●●

●

●

●

●●●●

●

●●

●

●

●

●●

●

●●●●●●●

●

●●

●●

●

●

●●●●●●●

●●●●

●● ●●

●

●●

●●●●

●

●

●

●

●

●●

●●

●

●

●●●●

●●

●

●●

●

●●

●

●●●●

●●●

●●●●●

●

●

●


(b) Partitions produced by ROT-Rounding

Figure 4.7: The 8 synthetic data sets for the non-ideal case: These data sets were generatedby adding various levels of noise to the second data set in Figure 4.6. The Gaussiansimilarity measure with σ = 0.2 is used in all the data sets. The color and shape of thedata points and the picture labels show the partitions produced by (a) LTM-Roundingat the best run (in terms of BIC score of LTM); and (b) ROT-Rounding. RI meansRand index computed against the true partition.

67

Data 1 2 3 4 5 6 7 8LTM 1.0±.00 1.0±.00 1.0±.00 .99±.01 .97±.02 .98±.01 .94±.01 .88±.01ROT .92±.00 .92±.00 1.0±.00 .98±.00 .52±.00 .52±.00 .88±.00 .90±.00

K-means 1.0±.00 1.0±.00 1.0±.00 1.0±.00 .85±.00 .72±.00 .71±.00 .75±.00GMM 1.0±.00 1.0±.00 1.0±.00 1.0±.00 .94±.00 .88±.00 .91±.00 .88±.00

(a) Rand index

Data 1 2 3 4 5 6 7 8LTM .00±.00 .00±.00 .00±.00 .06±.09 .29±.19 .28±.14 .79±.12 1.64±.10ROT .40±.00 .40±.00 .00±.00 .20±.00 1.60±.00 1.60±.00 1.85±.00 1.42±.00

K-means .00±.00 .00±.00 .00±.00 .00±.00 1.04±.00 1.41±.00 1.52±.00 1.97±.00GMM .00±.00 .00±.00 .00±.00 .00±.00 .85±.00 .91±.00 1.25±.00 1.74±.00

(b) Variation of information

Table 4.1: Performances of various methods on the 8 synthetic data sets in terms of(a) Rand index (RI) and (b) variation of information (VI). Higher values of RI or lowervalues of VI indicate better performance. K-means and GMM require extra informationfor rounding and should not be compared with LTM-Rounding and ROT-Roundingdirectly.

The above discussions indicate that the performance of LTM-Rounding degrades

gracefully as we move away from the ideal case.

4.6.3 Impact of an Assumption

Suppose it is determined that rounding is to be based on the first q eigenvectors of the

Laplacian matrix. Let mq be the number of clusters to be obtained based on those

eigenvectors. Previous work usually assumes that mq = q. We have argued against this

assumption. Now we empirically study its impact.

To carry out the study, we create a variant of LTM-Rounding by enforcing the

assumption. The change is at Step 5(a) and it concerns the number of states for the

latent variable in the LCM. As it stands, the algorithm determines the number using the

BIC score. In this study, we manually set it to q. No other changes are made. We refer

to the modified algorithm as LTM-Rounding1.

We tested LTM-Rounding and LTM-Rounding1 on the 8 synthetic data sets. The

performance statistics are shown in Table 4.2. We see that RI for LTM-Rounding1

are consistently lower than those for LTM-Rounding, and VI for LTM-Rounding1

are consistently higher. One exception occurs on the 4th data set, where RI and VI for

LTM-Rounding1 are not significantly better. The other exception occurs on the 8th

data set, where RI are the same for LTM-Rounding and LTM-Rounding1, but VI

for LTM-Rounding1 are significantly higher. Figure 4.8 shows the partitions obtained

68

Data 1 2 3 4 5 6 7 8LTM 1.0±.00 1.0±.00 1.0±.00 .99±.01 .97±.02 .98±.01 .94±.01 .88±.01LTM1 1.0±.00 1.0±.00 1.0±.00 1.0±.00 .95±.00 .95±.02 .91±.01 .88±.00

(a) Rand index

Data 1 2 3 4 5 6 7 8LTM .00±.00 .00±.00 .00±.00 .06±.09 .29±.19 .28±.14 .79±.12 1.64±.10LTM1 .00±.00 .00±.00 .00±.00 .00±.00 .69±.00 .66±.23 1.24±.10 1.84±.06

(b) Variation of information

Table 4.2: Comparison of LTM-Rounding and LTM-Rounding1 in terms of (a) Randindex and (b) variation of information.

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●● ●● ●●●●●●●

●●

●●●

●●●●●●●●

●●●●●●●●●●●●

●●●

●●●●●●●●●●●●


● ●●●●●●●●

●●●●●●●

●●●●●

●●●●●●●●

●●●

●

●●●●●●●●●●●●●●●●●● ●●●●●●●

●●●●●●●●●●

●●●

●

●●●●

●

●●●●●●

●●●

●●●●●●● ●

●●●● ●●


●●

●●●●

●●●

●

●●

●●●●●

●●●● ●●

●●

●

●●●

●●

●●

●●

●●●

●

●

●●

●

● ●●

●●●●

●

●●

●

●

●

●

●

●●

●●●●

●●●●●

●●●●● ●●

●●●

●●●● ●

●

●

●●●

●

●●●●

●●●● ●●


●●

●●●

●●

●●●

●●

●●

●

●●●

●

●●● ●

●●●

●●●● ●●●

●●●

●●

●●

●●

●

●

●

●

●●

●●

●●

●●

●●●

●●

●●

●

●

●

●●

●●

●

●●●

●●

●

● ● ●

●

●

●●●●●●

●●●●

●●

●●●●●●●●


●● ●

●●

●

●●

●●

●●●●

●

●

●●● ●●

●●●●●

●

● ●●●

●●

●●●

●●

●

●●

●

●

●●

●

●●

● ●●

●

●

●

●●

●●

●

●

●

●●●●

●●

● ●

● ●●

●

●

●●

●

●

●

●

●

●●●

●●●

●

● ●

●

●

●

●●●●●

●

● ●

●

●

●

●

●●

●

● ●

●

●●●

●

● ●● ●

●●

●●

●●●

●

●●

●

●

●

●●

●●

●●

●●

●●

●

●

●

●●●●

●

●

●

●

●●

●


●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●●

●

●●●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●


●

● ●

●● ●●

●

● ●●●

●●

●

●

●●

●

●

●●

●

●●●

●

●●

●

●

●

●

●●

●●

● ●

● ●

●

●

●

●

●

●●

●

●●●

● ●

●

●

● ●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●●

●●●●

●

●

●

●

●●

●

●


●

● ●

●

● ●

●●●●

●● ●●● ●

● ●●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●●

● ●●

●●●●

●●

●

●

●●

●

●

●●●●

●

●

●●●

●

●

●

●

●●

●

●●

●

●●●

●●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

● ●

●

●● ●

●

●●●


Figure 4.8: Partitions obtained by LTM-Rounding1 at the best run.

by LTM-Rounding1 at the best run. We see that LTM-Rounding1 estimated higher

numbers of clusters than LTM-Rounding on the last four data sets. Overall, it is fair

to say that LTM-Rounding1 is inferior to LTM-Rounding. So it is not a good idea to

enforce the constraint mq = q.

4.6.4 Sensitivity Study

LTM-Rounding has two parameters δ and K. How sensitive is the performance of the

algorithm to the choice of parameter values? To answer this question, we conducted

experiments on the 2nd, 5th, and 8th data sets in Fig 4.7. Those data sets contain

different levels of noise and hence are at different distances from the ideal case.

69

●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

(2)

●

●● ● ● ● ● ● ● ● ●

●

●

● ● ●

●

●

●

●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

(5)

● ● ● ● ● ● ●

● ● ●

● ● ●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

(8)(a) K = 40 with varying δ (X-axis)

●

● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ●

20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

(2)

●

●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

(5)

●

● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

(8)(b) δ = 0.1 with varying K (X-axis)

Figure 4.9: Sensitivity analysis on parameters δ and K in LTM-Rounding. It wasconducted on the 2nd, 5th, and 8th synthetic data sets. The Y-axis is Rand index andthe bars indicate the standard deviations. In general, we recommend δ = 0.1 and K = 40(shown as blue points).

To determine the sensitivity of LTM-Rounding with respect to δ, we fix K = 40

and let δ vary between 0.01 and 0.95. The RI statistics are shown in Fig 4.9(a). We

see that on data set (2) the performance of LTM-Rounding is insensitive to δ except

when δ = 0.01. On data set (5), the performance of LTM-Rounding is more or less

robust when 0.1 ≤ δ ≤ 0.3. Its performance gets particularly worse when δ ≥ 0.8. Similar

behavior can be observed on data set (8). Those results suggest that the performance

of LTM-Rounding is robust with respect to δ in situations close to the ideal case and

it becomes sensitive in situations that are far away from the ideal case. In general, we

recommend δ = 0.1.

To determine the sensitivity of LTM-Rounding with respect to K, we fix δ = 0.1

and let K vary between 5 and 100. The RI statistics are shown in Fig 4.9(b). It is clear

that the performance of LTM-Rounding is robust with respect to K as long as it is not

too small.

4.6.5 Running Time

LTM-Rounding deals with tree-structured models. It is deterministic everywhere except

70

at Step 5(a), where the EM algorithm is called to estimate the parameters of LCM. EM is

an iterative algorithm and is computationally expensive in general. However, it is efficient

on LCMs. So the running time of LTM-Rounding is relatively short. To process one of

the data sets in Figure 4.7, it took around 10 minutes on a laptop computer.

4.7 Comparison with Alternative Methods

To compare with LTM-Rounding, we included the method by Zelnik-Manor and Perona

[169] in our experiments.1 The latter method determines the appropriateness of using

the q leading eigenvectors by checking how well they can be aligned with the canonical

coordinates through rotation. So we name it ROT-Rounding. Both LTM-Rounding

and ROT-Rounding can determine the number q of eigenvectors and the number k of

clusters automatically. Therefore, they are directly comparable. The method by Xiang

and Gong [166] is also related to our work. However, its implementation is not available

to us and hence it is excluded from comparison.

K-means and GMM2 can also be used for rounding. However, both methods cannot

determine q, and K-means cannot even determine k. We therefore gave the number ktof true clusters for K-means and used q = kt for both methods. Since the two methods

required additional information, they are included in our experiments only for reference.

They should not be compared with LTM-Rounding directly.

ROT-Rounding and GMM require the maximum allowable number of clusters. We

set that number to 20.

4.7.1 Synthetic Data

Table 4.1 shows the performance statistics of the three other methods on the 8 synthetic

data sets. LTM-Rounding performed better than ROT-Rounding except for the last

data set. The differences are particularly substantial on the 5th, 6th, and 7th data sets.

The partitions produced by ROT-Rounding at one run are shown in Fig 4.7(b).3

We see that on the first two data sets, ROT-Rounding underestimated the number of

clusters and merged the two small crescent clusters. This is serious under-performance

given the easiness of those data sets. On the 3rd data set, it recovered the true clusters

1The code can be obtained from http://webee.technion.ac.il/~lihi/Demos/SelfTuningClustering.html.2MCLUST [52] was used with diagonal covariance matrices and default prior control as the implemen-tation of GMM.

3Same results were obtained for all 10 runs.

71

http://webee.technion.ac.il/~lihi/Demos/SelfTuningClustering.html

Method #clusters RI VILTM 9.3±.82 .91±.00 2.17±.07ROT 19.0±.00 .92±.00 2.40±.00

K-means (10) .90±.00 1.83±.00GMM 16.0±.00 .91±.00 2.14±.00

sd-CRP 9.3±.96 .89±.00 2.72±.08

Table 4.3: Comparison of various rounding methods on MNIST digits data. The bottomthree methods required extra information for rounding. They should not be compareddirectly with LTM-Rounding and ROT-Rounding.

correctly. On the 4th data set, it broke the top crescent cluster into two. On the 5th data

set, it recovered the bottom cluster correctly but merged all the other clusters incorrectly.

This leads to a much lower RI. The story on the 6th data set is similar. On the 7th

data set, it produced many more clusters than the number of true clusters. Therefore,

its RI is also lower. On the last data set, the clustering obtained by ROT-Rounding is

not visually better than the one obtained by LTM-Rounding, even though RI suggests

otherwise. Given all the evidence presented, we conclude that the performance of LTM-

Rounding is significantly better than that of ROT-Rounding.

Like LTM-Rounding, both K-means and GMM recovered clusterings correctly on

the first three data sets (Table 4.1). They performed slightly better than LTM-Rounding

on the 4th data set. However, they were considerably worse on the next three data sets,

and were not better on the last one. This happened even though K-means and GMM

were given additional information for rounding. This shows the superior performance of

LTM-Rounding for rounding.

4.7.2 MNIST Digits Data

In the next two experiments, we used real-world data to compare the rounding methods.

The MNIST digits data were used in this subsection. The data consist of 1000 samples

of handwritten digits from 0 to 9. They were preprocessed by the deep belief network as

described in [71] using their accompanying code.4 Table 4.3 shows the results averaged

over 10 runs.

We see that the number of clusters estimated by LTM-Rounding is close to the

ground truth (10), but that by ROT-Rounding is considerably larger than 10. In terms

of quality of clusterings, the results are inconclusive. RI suggests that ROT-Rounding

performed slightly better, but VI suggests that LTM-Rounding performed significantly

4We thank Richard Socher for sharing the preprocessed data with us. The original data can be foundat http://yann.lecun.com/exdb/mnist.

72

http://yann.lecun.com/exdb/mnist

better.

Compared with LTM-Rounding, K-means obtained a better clustering (in terms of

VI), whereas GMM obtained one with similar quality. However, K-means was given the

number of true clusters and GMM the number of eigenvectors. GMM also overestimated

the number of clusters even with the extra information.

A non-parametric Bayesian clustering method, called sd-CRP, has recently been pro-

posed for rounding [148]. Although we obtained the same data from their authors, we

could not get a working implementation for their method. Therefore, we simply copied

their reported performance to Table 4.3. Note that sd-CPR can determine the number

clusters automatically. However, it requires the number of eigenvectors as input, and the

number was set to 10 in this experiment. Hence, sd-CPR requires more information

than LTM-Rounding. Table 4.3 shows that it estimated a similar number of clusters as

LTM-Rounding. However, its clustering was significantly worse in terms of RI or VI.

This shows that LTM-Rounding performed better than sd-CPR even with less given

information.

4.7.3 Image Segmentation

The find part of our empirical evaluation was conducted on real-world image segmentation

tasks. Five images from the Berkeley Segmentation Data Set (BSDS500) were used. They

are shown in the first column of Figure 4.10. The similarity matrices were built using the

method proposed by Arbeláez et al. [4]. The images and the Matlab code for similarity

matrix construction were downloaded from the webpage of the Berkley Computer Vision

Group.5

The segmentation results obtained by ROT-Rounding and LTM-Rounding are

shown in the second and third columns of Figure 4.10 respectively. On the first two

images, ROT-Rounding did not identify any meaningful segments. In contrast, LTM-

Rounding identified the polar bear and detected the boundaries of the river on the first

image. It identified the bottle, the glass and a lobster on the second image.

An obvious undesirable aspect of the results is that some uniform regions are broken

up. Examples include the river bank and the river itself in the first image, and the

background and the table in the second image. This is a known problem of spectral

clustering when applied to image segmentation and can be dealt with using image analysis

techniques [4]. We do not deal with the problem in this work.

5http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/resources.html

73

http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/resources.html

On the third image the performance of LTM-Rounding was better because the lizard

was identified and the segmentation lines follow leaf edges more closely. On the fourth

image LTM-Rounding did a better job at detecting the edges around the lady’s hands

and skirt and on the left end of the silk scarf. However, ROT-Rounding did a better

job at the last image because it produced a cleaner segmentation.

Overall, the performance of LTM-Rounding is better than that of ROT-Rounding.

It should be noted that rounding is a post-processing step of spectral clustering. The

quality of the final results depends critically on the eigenvectors that are produced at

earlier steps. The objective of this section has been to compare LTM-Rounding and

ROT-Rounding on the same collection of eigenvectors. The conclusion is meaningful

even if the final segmentation results are not as good as the best that can be achieved by

image analysis techniques.

4.8 Conclusions

Rounding is an important step of spectral clustering that has not received sufficient at-

tention. Not many papers have been published on the topic, especially on the issues of

determining the number of leading eigenvectors. In this chapter, we have proposed a

novel method for the task. The method is based on LTMs. It can automatically select an

appropriate number of eigenvectors to use, determine the number of clusters, and finally

assign data points to clusters. We have shown that the method works correctly in the

ideal case and its performance degrades gracefully as we move away from the ideal case.

74

Original ROT LTM

Figure 4.10: Image segmentation results. The first column shows the original images.The second and third columns show the segmentation results using ROT-Rounding andLTM-Rounding respectively.

75

CHAPTER 5

EXTENSION:POUCH LATENT TREE MODELS

In LTMs, all variables are discrete. This limits the use of LTMs to discrete data. In this

chapter, we propose an extension to LTMs, resulting in a class of models called pouch

latent tree models.

PLTMs have continuous manifest variables, and hence can handle continuous data.

Note that although our description focuses on continuous data, PTLMs can readily work

on both continuous and discrete data. The algorithms described in this chapter work on

mixed data without any need for modification.

This chapter begins with a definition of PTLMs. PLTMs are then related to other

work in Section 5.2. We describe an interference algorithm in Section 5.3. In Section 5.4,

we discuss how the parameters of PLTMs can be estimated using the EM algorithm. In

Section 5.5, we describe a structural learning algorithm for PLTMs.

5.1 Pouch Latent Tree Models

A pouch latent tree model (PLTM) is a rooted tree, where each internal node represents

a latent variable, and each leaf node represents a set of manifest variables. All the latent

variables are discrete, whereas all the manifest variables are continuous. A leaf node may

contain a single manifest variable or several of them. Because of the second possibility, leaf

nodes are called pouch nodes. Figure 5.1 shows an example of PLTM. In this example,

Y1–Y4 are discrete latent variables, where Y1–Y3 have three possible values and Y4 has

two. X1–X9 are continuous manifest variables. They are grouped into five pouch nodes:

{X1, X2}, {X3}, {X4, X5}, {X6}, and {X7, X8, X9}. Note that we reserve the use of bold

capital letter W for denoting the variables of a pouch node, such as, W1 = {X1, X2}.

In a PLTM, the dependency of a discrete latent variable Y on its parent Π(Y ) is

characterized by a conditional discrete distribution P (y|π(Y )).1 Let W be the variables

of a pouch node with a parent node Y = Π(W ). We assume that, given a value y of

Y , W follows the conditional Gaussian distribution P (w|y) = N (w|µy,Σy) with mean

1 The root node is regarded as the child of a dummy node with only one value, and hence is treated inthe same way as other latent nodes.

76

Figure 5.1: An example of PLTM. The numbers in parentheses show the cardinalities ofthe discrete variables.

Figure 5.2: Generative model for synthetic data.

y1 P (y1)s1 0.33s2 0.33s3 0.34

P (y2|y1)y2 y1 = s1 y1 = s2 y1 = s3s1 0.74 0.13 0.13s2 0.13 0.74 0.13s3 0.13 0.13 0.74

Table 5.1: Discrete distributions in Example 1.

vector µy and covariance matrix Σy. A PLTM can be written as a pair M = (m,θ),

where m denotes the model structure and θ denotes the parameters.

Example 1. Figure 5.2 gives another example of PLTM. In this model, there are two

discrete latent variables Y1 and Y2, each having three possible values {s1, s2, s3}. There

are six pouch nodes, namely {X1, X2}, {X3}, {X4}, {X5}, {X6}, and {X7, X8, X9}. Thevariables in the pouch nodes are continuous.

Each node in the model is associated with a distribution. The discrete distributions

P (y1) and P (y2|y1), associated with the two discrete nodes, are given in Table 5.1.

The pouch nodes W are associated with conditional Gaussian distributions. These

distributions have parameters for specifying the conditional means µπ(W ) and conditional

covariances Σπ(W ). For the four pouch nodes with single variables, {X3}, {X4}, {X5},and {X6}, these parameters have scalar values. The conditional means µy for each of these

variables are either −2.5, 0, and 2.5, depending on whether y = s1, s2, or s3, where y is the

value of the corresponding parent variable Y ∈ {Y1, Y2}. The conditional covariances Σy

77

Figure 5.3: A Gaussian mixture model as a special case of PLTM.

can also have different values for different values of their parents. However, for simplicity

in this example, we set Σy = 1,∀y ∈ {s1, s2, s3}.

Let p be the number of variables in a pouch node. The conditional means are specified

by p-vectors and the conditional covariances by p × p matrices. For example, the means

and covariances of the pouch node {X1, X2} conditional on its parent y1 are given by:

µy1 =

(−2.5,−2.5) : y1 = s1

(0, 0) : y1 = s2

(2.5, 2.5) : y1 = s3

and Σy1 =

(1 0.5

0.5 1

),∀y1 ∈ {s1, s2, s3}.

The conditional means and covariances are specified similarly for pouch node {X7, X8, X9}.They are given by:

µy2 =

(−2.5,−2.5,−2.5) : y2 = s1

(0, 0, 0) : y2 = s2

(2.5, 2.5, 2.5) : y2 = s3

and Σy2 =

1 0.5 0.50.5 1 0.50.5 0.5 1

,∀y2 ∈ {s1, s2, s3}.

PLTMs have a noteworthy two-way relationship with GMMs. On the one hand,

PLTMs generalize the structure of GMMs to allow more than one latent variable in a

model. Thus, a GMM can be considered as a PLTM with only one latent variable and

one pouch node containing all manifest variables. As an example, a GMM is depicted as

a PLTM in Figure 5.3, in which Y1 is a discrete latent variable and X1–X9 are continuous

manifest variables.

On the other hand, the distribution of a PLTM over the manifest variables can be

represented by a GMM. Consider a PLTM M . Suppose W 1, . . . ,W b are the b pouch

nodes and Y1, . . . , Yl are the l latent nodes in M . Denote as X =⋃bi=1Wi and Y = {Yj :

j = 1, . . . , l} the sets of all manifest variables and all latent variables in M , respectively.

78

The probability distribution defined by M over the manifest variables X is given by

P (x) =∑y

P (x,y)

=∑y

l∏j=1

P (yj|π(Yj))b∏i=1

N (wi|µπ(W i),Σπ(W i)) (5.1)

=∑y

P (y)N (x|µy,Σy). (5.2)

Equation 5.1 follows from the model definition. Equation 5.2 follows from the fact that

Π(W i),Π(Yj) ∈ Y and the product of Gaussian distributions is also a Gaussian distri-

bution. Equation 5.2 shows that P (x) is a mixture of Gaussian distributions. Although

it means that PLTMs are not more expressive than GMMs on the distributions of ob-

served data, PLTMs have two advantages over GMMs. First, numbers of parameters can

be reduced in PLTMs by exploiting the conditional independence between variables, as

expressed by the factorization in Equation 5.1. Second, and more important, the multiple

latent variables in PLTMs allow multiple clusterings on data.

Example 2. In this example, we compare the numbers of parameters in a PLTM and

in a GMM. Consider a discrete node and its parent node with c and c′ possible values,

respectively. It requires (c−1)×c′ parameters to specify the conditional discrete distribution

for this node. Consider a pouch node with p variables, and its parent variable with c′

possible values. This node has p × c′ parameters for the conditional mean vectors andp(p+1)

2×c′ parameters for the conditional covariance matrices. Now consider the PLTM in

Figure 5.1 and the GMM in Figure 5.3. Both of them define a distribution on 9 manifest

variables. Based on the above expressions, the PLTM has 77 parameters and the GMM

has 164 parameters.

Given the same number of manifest variables, a PLTM may appear to be more complex

than a GMM due to a larger number of latent variables. However, this example shows

that a PLTM can still require fewer parameters than a GMM.

5.2 Related Work

The graphical structure of PLTMs looks similar to that of the Bayesian networks (BNs) [122].

In fact, a PLTM is different from a BN only because of the possibility of multiple variables

in a single pouch node. It has been shown that any nonsingular multivariate Gaussian

distribution can be converted to a complete Gaussian Bayesian network (GBN) with an

equivalent distribution [57]. Therefore, a pouch node can be considered as a shorthand

79

notation of a complete GBN. If we convert each pouch node into a complete GBN, a

PLTM can be considered as a conditional Gaussian Bayesian network (i.e., a BN with

discrete distributions and conditional Gaussian distributions). It can also be considered

as a BN in general.

Some mixture models allow manifest variables to have multivariate normal distribu-

tions. These include the AutoClass models [23] and the MULTIMIX models [77]. The

manifest variables of those models are similar to the pouch nodes in PLTMs. However,

those mixture models do not allow multiple latent variables.

Galimberti and Soffritti [55] propose to build multiple GMMs on a given data set

rather than a single GMM. We refer to their method as GS. Their method partitions the

attributes and learns a GMM on each attribute subset. The collection of independent

GMMs forms the resulting model. The method obtains the initial partition of attributes

using variable clustering. To look for the optimal partition, it repeatedly merges the two

subsets of attributes that lead to the largest improvement of BIC, until it cannot find any

such pair.

The GS models can be considered as having multiple latent variables. They are similar

to PLTMs in this aspect. However, the latent variables in a GS model are disconnected

and hence are independent. In contrast, the latent variables in a PLTM are interdependent

and are connected by a tree structure.

5.3 Inference

A PLTM defines a probability distribution P (X,Y ) over manifest variablesX and latent

variables Y . Consider observing values e for the evidence variables E ⊆X. For a subset

of variables Q ⊆ X ∪ Y , we are often required to compute the posterior probability

P (q|e). For example, classifying a data point d to one of the clusters represented by

latent variable Y requires us to compute P (y|X = d).

Inference refers to the computation of the posterior probability P (q|e). It can be done

on PLTMs similarly as the clique tree propagation on conditional GBNs [96]. However,

due to the existence of pouch nodes in PLTMs, this propagation algorithm requires some

modifications. The inference algorithm is discussed in details below.

5.3.1 Clique Tree Propagation

Consider a PLTM M with manifest variables X and latent variables Y . Recall that

inference on M refers to the computation of the posterior probability P (q|e) of some

80

Algorithm 1 Inference Algorithm1: procedure Propagate(M, T ,E, e)2: Initialize ψ(C) for every clique C3: Incorporate evidence to the potentials4: Choose an arbitrary clique in T as CP5: for all C ∈ Ne(CP ) do6: CollectMessage(CP , C)7: end for8: for all C ∈ Ne(CP ) do9: DistributeMessage(CP , C)10: end for11: Normalize ψ(C) for every clique C12: end procedure

13: procedure CollectMessage(C,C ′)14: for all C ′′ ∈ Ne(C ′)\{C} do15: CollectMessage(C ′, C ′′)16: end for17: SendMessage(C ′, C)18: end procedure

19: procedure DistributeMessage(C,C ′)20: SendMessage(C,C ′)21: for all C ′′ ∈ Ne(C ′)\{C} do22: DistributeMessage(C ′, C ′′)23: end for24: end procedure

25: procedure SendMessage(C,C ′)26: φ← RetrieveFactor(C ∩ C ′)27: φ′ ←

∑C\C′ ψ(C)

28: SaveFactor(C ∩ C ′, φ′)29: ψ(C ′)← ψ(C ′)× φ′/φ30: end procedure

31: procedure RetrieveFactor(S)32: If SaveFactor(S, φ) has been called,

return φ; otherwise, return 1.33: end procedure

// Ne(C) denotes the neighbors of C

variables of interest, Q ⊆X ∪ Y , after observing values e of evidence variables E ⊆X.

To perform inference, M has to be converted into a clique tree T . A propagation scheme

for message passing can then be carried out on T .

Construction of clique trees is simple due to the tree structure of PLTMs. To construct

T , a clique C is added to T for each edge in M , such that C = V ∪{Π(V )} contains thevariable(s) V of the child node and variable Π(V ) of its parent node. Two cliques are

then connected in T if they share any common variable. The resulting clique tree contains

two types of cliques. The first type are discrete cliques. Each one contains two discrete

variables. The second type are mixed cliques. Each one contains the continuous variables

of a pouch node and the discrete variable of its parent node. Observe that in a PLTM

all the internal nodes are discrete and only leaf nodes are continuous. Consequently, the

clique tree can be considered as a clique tree consisted of all discrete cliques, with some

mixed cliques attaching to it on the boundary.

After a clique tree is constructed, propagation can be carried out on it. Algorithm 1

outlines a clique tree propagation, based on the Hugin architecture [36, 38], for PLTMs.

It consists of four main steps: initialization of cliques, incorporation of evidence, message

passing, and normalization. The propagation on the discrete part of the clique tree is

done as for discrete BNs [36, 38]. Here we focus on the part related to the mixed cliques.

81

Step 1: Initialization of cliques (line 2). Consider a mixed clique Cm containing

continuous variables W and discrete variable Y = Π(W ). The potential ψ(w, y) of Cm(also denoted as ψ(Cm) in the algorithm) is initialized by the corresponding conditional

distribution,

ψ(w, y) = P (w|y) = N (w|µy,Σy).

Step 2: Incorporation of evidence (line 3). The variables in a pouch node W can

be divided into two groups, depending on whether the values of the variables have been

observed. Let E′ = W ∩ E denote those variables of which values have been observed,

and let U = W \E denote those of which values have not been observed. Furthermore,

let [µ]S denote the part of mean vector µ containing elements corresponding to variables

S, and let [Σ]ST denote the part of the covariance matrix Σ that has rows and columns

corresponding to variables S and T , respectively. To incorporate the evidence e′ for E′,

the potential of Cm changes from ψ(w, y) to

ψ′(w, y) = P (e′|y)× P (w|y, e′) = N(e′∣∣ [µy]E′ , [Σy]E′E′

)×N

(w|µ′y,Σ

′y

),

where µ′y and Σ′y can be divided into two parts. The part related to the evidence variables

E′ is given by:[µ′y]E′ = e′, [Σ′y]UE′ = 0, [Σ′y]E′E′ = 0, and [Σ′y]E′U = 0.

The other part is given by:[µ′y]U

=[µy]U

+ [Σy]UE′

[Σ−1y

]E′E′

(e′ −

[µy]E′

),

[Σ′y]UU = [Σy]UU − [Σy]UE′

[Σ−1y

]E′E′ [Σy]E′U .

Step 3: Message passing (lines 5–10). In this step, Cm involves two operations,

marginalization and combination. Marginalization of ψ′(w, y) over W is required for

sending out a message from Cm (line 27). It results in a potential ψ′(y), involving only

the discrete variable y, as given by:

ψ′(y) = P (e′|y) = N(e′∣∣ [µy]E′ , [Σy]E′E′

).

Combination is required for sending a message to Cm (line 29). The combination of the

potential ψ′(w, y) with a discrete potential φ(y) is given by ψ′′(w, y) = ψ′(w, y) × φ(y).

When the message passing completes (line 10), ψ′′(w, y) represents the distribution

ψ′′(w, y) = P (y, e)× P (w|y, e′) = P (y, e)×N(w|µ′y,Σ

′y

).

82

Step 4: Normalization (line 11). In this step, a potential changes from ψ′′(w, y) to

ψ′′′(w, y) = P (y|e)× P (w|y, e′) = P (y|e)×N(w|µ′y,Σ

′y

).

For implementation, the potential of a mixed clique is usually represented by two

types of data structures: one for the discrete distribution and one for the conditional

Gaussian distribution. More details for the general clique tree propagation can be found

in [35, 36, 96].

5.3.2 Complexity

The structure of PLTMs allows an efficient inference. Let n be the number of nodes in

a PLTM, c be the maximum cardinality of a discrete variable, and p be the maximum

number of variables in a pouch node. The time complexity of the inference is dominated by

the steps related to message passing and incorporation of evidence on continuous variables.

The message passing step requires O(nc2) time, since each clique has at most two discrete

variables due to the tree structure. Incorporation of evidence requires O(ncp3) time.

Suppose we have the same number of observed variables. Since PLTMs generally has

smaller pouch nodes than GMMs and hence a smaller p, the term O(ncp3) shows that

inference on PTLMs can be faster than that on GMMs. This happens even though PLTM

can have more nodes and thus a larger n.

5.4 Parameter Estimation

Suppose there is a data set D with N samples d1, . . . ,dN . Each sample consists of values

for the manifest variables. Consider computing the maximum likelihood estimate (MLE)

θ∗ of the parameters for a given PLTM structure m. We do this using the EM algo-

rithm [40]. The algorithm starts with an initial estimate θ(0) and improves the estimate

iteratively.

Suppose the parameter estimate θ(t−1) is obtained after t − 1 iterations. The t-th

iteration consists of two steps, an E-step and an M-step. In the E-step, we compute,

for each latent node Y and its parent Π(Y ), the distributions P (y, π(Y )|dk,θ(t−1)) and

P (y|dk,θ(t−1)) for each sample dk. This is done by the inference algorithm discussed in

the previous section.

For each sample k, let wk be the values of variablesW of a pouch node for the sample

83

dk. In the M-step, the new estimate θ(t) is obtained as follows:

P (y|π(Y ),θ(t)) ∝N∑k=1

P (y, π(Y )|dk,θ(t−1))

µ(t)y =

∑Nk=1 P (y|dk,θ(t−1))wk∑Nk=1 P (y|dk,θ(t−1))

Σ(t)y =

∑Nk=1 P (y|dk,θ(t−1))(wk − µ(t)

y )(wk − µ(t)y )′∑N

k=1 P (y|dk,θ(t−1)),

where µ(t)y and Σ(t)

y here correspond to the distribution P (w|y,θ(t)) for node W condi-

tional on its parent Y = Π(W ). The EM algorithm proceeds to the (t + 1)-th iteration

unless the improvement of log-likelihood logP (D|θ(t))− logP (D|θ(t−1)) falls below a cer-

tain threshold.

The starting values of the parameters θ(0) are chosen as follows. For P (y|π(Y ),θ(0)),

the probabilities are randomly generated from a uniform distribution over the interval

(0, 1] and are then normalized. The initial values of µ(0)y are set to equal to a random

sample from data, while those of Σ(0)y are set to equal to the sample covariance.

Like in the case of GMMs, the likelihood is unbounded in the case of PLTMs. This

might lead to spurious local maxima [111]. For example, consider a mixture component

that consists of only one data point. If we set the mean of the component to be equal

to that data point and set the covariance to zero, then the model will have an infinite

likelihood on the data. However, even though the likelihood of this model is higher than

some other models, it does not mean that the corresponding clustering is better. The

infinite likelihood can always be achieved by trivially grouping one of the data points as

a cluster. This is why we refer to this kind of local maxima as spurious.

To mitigate this problem, we use a variant of the method by Ingrassia [80]. In the M-

step of EM, we need to compute the covariance matrix Σy(t) for each pouch nodeW . We

impose the following constraints on the eigenvalues λ(t) of Σy(t): σ2

min/γ ≤ λ(t) ≤ σ2max×γ,

where σ2min and σ2

max are the minimum and maximum of the sample variances of the

variables W and γ is a parameter for our method.

5.5 Structure Learning

Given a data set D, we aim at finding the model m∗ that maximizes the BIC score given

by Equation (3.1).

84

A hill-climbing algorithm can be used to search for m∗. It starts with a model m(0)

that contains one latent node as root and a separate pouch node for each manifest variable

as a leaf node. The latent variable at the root node has two possible values. Suppose a

model m(j−1) is obtained after j − 1 iterations. In the j-th iteration, the algorithm uses

some search operators to generate candidate models by modifying the base model m(j−1).

The BIC score is then computed for each candidate model. The candidate model m′ with

the highest BIC score is compared with the base model m(j−1). If m′ has a higher BIC

score than m(j−1), m′ is used as the new base model m(j) and the algorithm proceeds to

the (j + 1)-th iteration. Otherwise, the algorithm terminates and returns m∗ = m(j−1)

(together with the MLE of the parameters).

The above hill-climbing algorithm explains the principles for our structural learning

algorithm. In the remaining of this section, we first describe the search operators used

in the algorithm. We then discuss how the efficiency of the hill-climbing algorithm can

be improved, based on the ideas of Chen et al. [26, 28]. We summarize the structural

learning algorithm, called EAST-PLTM, in the last subsection.

5.5.1 Search Operators

There are four aspects of the structure m, namely, the number of latent variables, the

cardinalities of these latent variables, the connections between variables, and the composi-

tion of pouches. The search operators used in our hill-climbing algorithm modify all these

aspects to effectively explore the search space. There are totally seven search operators.

Five of them are borrowed from Zhang and Kočka [173], while two of them are new for

PLTMs.

The five borrowed operators are the state introduction (SI), state deletion (SD), node

introduction (NI), node deletion (ND), and node relocation (NR) operators. They are

described in Section 3.5.1 (page 27). Figure 5.4 gives some examples of NI, ND, and NR

in PLTMs.

The two new operators are pouching (PO) and unpouching (UP) operators. The PO

operator creates a new model by combining a pair of sibling pouch nodesW 1 andW 2 into

a new pouch nodeW po = W 1∪W 2. The UP operator creates a new model by separating

one manifest variable X from a pouch node W up, resulting in two sibling pouch nodes

W 1 = W up\{X} and W 2 = {X}. Figure 5.5 shows some examples of the use of these

two operators.

The purpose of the PO and UP operators is to modify the conditional independencies

entailed by the model on the variables of the pouch nodes. For example, consider the two

85

(a) m1

(b) m2 (c) m3

Figure 5.4: Examples of applying the node introduction, node deletion, and node reloca-tion operators. Introducing Y3 to mediate between Y1, {X4, X5} and {X6} in m1 givesm2. Relocating {X4, X5} from Y3 to Y2 in m2 gives m3. In reverse, relocating {X4, X5}from Y2 to Y3 in m3 gives m2. Deleting Y3 in m2 gives m1.

models m1 and m2 in Figure 5.5. In m1, X4 and X5 are conditionally independent given

Y3, i.e., P (X4, X5|Y3) = P (X4|Y3)P (X5|Y3). In other words, covariance between X4 and

X5 is zero given Y3. On the other hand, X4 and X5 need not be conditionally independent

given Y3 in m2. The covariances between them are allowed to be non-zero in the 2 × 2

conditional covariance matrices for the pouch node {X4, X5}.

The PO operator in effect postulates that two sibling pouch nodes are correlated given

their parent node. It may improve the BIC score of the candidate model by increasing the

likelihood term, when there is some degree of local dependence between those variables

on the empirical data. On the other hand, the UP operator postulates that one variable

in a pouch node is conditionally independent from other variables in the pouch node. It

reduces the number of parameters in the candidate model and hence may improve the

BIC score by decreasing the penalty term. These postulates are tested by comparing the

BIC scores of the corresponding models in each search step. The postulate that leads to

the model with the highest BIC score is considered as most appropriate.

For the sake of computational efficiency, we do not consider pouching more than two

manifest variables. This is similar to how NI is done. To compensate for the restriction,

we consider a restricted version of PO after a successful pouching. The restricted version

86

(a) m1

(b) m2 (c) m3

Figure 5.5: Examples of applying the pouching and unpouching operators. Pouching {X4}and {X5} in m1 gives m2, and pouching {X4, X5} and {X6} in m2 gives m3. In reverse,unpouching X6 from {X4, X5, X6} in m3 gives m2, and unpouching X5 from {X4, X5}gives m1.

combines the new pouch node resulting from PO with one of its sibling pouch nodes.

5.5.2 Search Phases

In every search step, a search operator generates all possible candidate models for consid-

eration. Let l and b be the numbers of latent nodes and pouch nodes, respectively, and

n = l+b be total number of nodes. Let p, q, and r be the maximum number of variables in

a pouch node, the maximum number of sibling pouch nodes, and the maximum number

of neighbors of a latent node, respectively. The numbers of possible candidate models

that NI, ND, SI, SD, NR, PO, and UP can generate are O(lr(r−1)/2), O(lr), O(l), O(l),

O(nl), O(lq(q − 1)/2), and O(bp), respectively. If we consider all seven operators in each

search step, many candidate models are generated but only one of them is chosen for the

next step. Those suboptimal models are discarded and are not used for the next step.

Therefore, in some sense, much time are wasted for considering the suboptimal models.

A more efficient way is to consider fewer candidate models in each search step. This can

be achieved by considering only a subset of search operators at a time. Therefore, we follow

the idea of Chen et al. [26] and partition the search operators into three categories. SI, NI,

87

and PO belong to the expansion category, since all of them can create candidate models

that are more complex than the current one. SD, ND, and UP, which simplify a model,

belong to the simplification category. NR does not change the complexity considerably.

It belongs to the adjustment category. We perform search in three phases, each of which

considers only operators in one category. The best model found in each phase is used

to seed the next phase, and the search repeats these three phases until all three phases

cannot find a better model.

5.5.3 Operation Granularity

Some search operators have the issue of operation granularity (see Section 3.5.1, page 29).

This is similar to the those for LTMs. Therefore, we follow the cost-effective principle

and evaluate the models using the improvement ratio (Equation (3.4)) for those affected

operators.

The principle is applied only on candidate models generated by the SI, NI, and PO

operators. In other words, it is used only during the expansion phase. It is not applied

to other operators since those operators do not or do not necessarily increase model

complexity.

5.5.4 Efficient Model Evaluation

Similar to the other five operators, the PO and UP operators modify a small part of the

base model. Hence, the restricted likelihood (see Section 3.5.1, page 31) can also be used

to speed up the evaluation process.

We use an example to illustrate the evaluation of a candidate model given by a PO

operator. Consider the models m1 and m2 in Figure 5.5. m2 is obtained from m1 using

the PO operator. The two models share many parameters such as P (x1, x2|y2), P (y2|y1),P (y3|y1), P (x6|y3), etc. On the other hand, some parameters are not shared by the two

models. In this example, parameters P (x4|y3) and P (x5|y3) are specific to m1, while

parameter P (x4, x5|y3) is specific to m2. To compute the restricted likelihood of m2, we

keep the MLE of the shared parameters. And we update only the parameter P (x4, x5|y3),which is specific to m2, in EM. We can then evaluate m2 using the approximate score

BICRL given by Equation (3.5).

88

Algorithm 2 Search Algorithm1: procedure EAST-PLTM(m, D)2: repeat3: m′ ← m4: m← Expand(m)5: m← Adjust(m)6: m← Simplify(m)7: until BIC(m|D) ≤ BIC(m′|D)8: return m′

9: end procedure

10: procedure Expand(m, D)11: loop12: m′ ← m13: M← SI(m′) ∪NI(m′) ∪ PO(m′)14: m← PickModel-IR(M,D)15: if BIC(m|D) ≤ BIC(m′|D) then16: return m′

17: end if18: if m ∈ NI(m′) ∪ PO(m′) then19: m← Enhance(m,D)20: end if21: end loop22: end procedure

23: procedure Adjust(m, D)24: return RepPickModel(m,NR,D)25: end procedure

26: procedure Simplify(m, D)27: m← RepPickModel(m,UP,D)28: m← RepPickModel(m,ND,D)29: m← RepPickModel(m,SD,D)30: end procedure

31: procedure RepPickModel(m, Op, D)32: repeat33: m′ ← m34: m← PickModel(Op(m′),D)35: until BIC(m|D) ≤ BIC(m′|D)36: return m′

37: end procedure

5.5.5 EAST-PLTM

The entire search algorithm for PTLMs is named EAST-PLTM. It is outlined in Algo-

rithm 2.

The search process starts with the initial modelm(0) described in Section 5.5. The pro-

cedure EAST-PLTM uses the initial model as the first current model. It then repeatedly

tries to improve the current model in the three different phases.

In procedure Expand, the improvement ratio IR is used to pick the best models among

the candidate models generated by SI, NI, and PO. It stops when the best candidate model

fails to improve over the previous model. On the other hand, if the best candidate model

is better, and if it comes from NI or PO, procedure Enhance is called. This procedure

iteratively improves the model by using a restricted version of search operator. If the

given model is generated by NI, restricted version of NR is used. If it is generated by PO,

restricted version of PO is used.

In procedure Adjust, it calls the procedure RepPickModel to improves the model

using the NR operator repeatedly. In procedure Simplify, it first tries to improve the

model using UP. It then uses ND, and uses SD lastly.

89

5.6 Conclusions

In this chapter, we have described the PLTMs. We have also presented an inference al-

gorithm and a learning algorithm for these models. In the next two chapters, we demon-

strate the usefulness of PLTMs. We apply these models for facilitating variable selection

in clustering and for multidimensional clustering.

90

CHAPTER 6

VARIABLE SELECTION IN CLUSTERING

Variable selection for cluster analysis is a difficult problem. The difficulty originates not

only from the lack of class information but also the fact that high-dimensional data are

often multifaceted and can be meaningfully clustered in multiple ways. In such a case the

effort to find one subset of attributes that presumably gives the “best” clustering may be

misguided. It makes more sense to facilitate variable selection by domain experts, that

is, to systematically identify various facets of a data set (each being based on a subset of

attributes), cluster the data along each one, and present the results to the domain experts

for appraisal and selection.

In this chapter, we use PLTMs as a generalization of the Gaussian mixture model. We

show their ability to cluster data along multiple facets. And we demonstrate it is often

more reasonable to facilitate variable selection than to perform it.

We begin this chapter by explaining our approach in Section 6.1. Then we describe

the experimental setup in Section 6.2. In Section 6.3, we present the empirical results of

comparison between the two approaches to variable selection. We also highlight findings

on some of the data sets.

6.1 To Do or To Facilitate

Variable selection is an important issue for cluster analysis of high-dimensional data. The

cluster structure of interest to domain experts can often be best described using a subset

of attributes. The inclusion of other attributes can degrade clustering performance and

complicate cluster interpretation. Recently there is a growing interest in the issue [43,

151, 170]. This chapter is concerned with variable selection for model-based clustering.

In classification, variable selection is a clearly defined problem. It has to find the subset

of attributes that gives the best classification performance. The problem is less clear for

cluster analysis due to the lack of class information. Several variable selection methods

have been proposed for model-based clustering. Most of them introduce flexibility into

the generative mixture model to allow clusters to be related to subsets of (instead of all)

attributes and determine the subsets alongside parameter estimation or during a separate

model selection phase.

91

Raftery and Dean [130] consider a variation of the Gaussian mixture model (GMM)

where the latent variable is related to a subset of attributes and is independent of other

attributes given the subset. A greedy algorithm is proposed to search among those models

for one with high BIC score. At each search step, two nested models are compared using

the Bayes factor and the better one is chosen to seed the next search step. Maugis et al.

[108, 109] extend this work by considering different possibilities of dependency among the

relevant and irrelevant attributes.

Law et al. [97] start with the Naïve Bayes model (that is, GMM with diagonal covari-

ance matrices) and add a saliency parameter for each attribute. The parameter ranges

between 0 and 1. When it is 1, the attribute depends on the latent variable. When it is

0, the attribute is independent of the latent variable and its distribution is assumed to be

unimodal. The saliency parameters are estimated together with other model parameters

using the EM algorithm. The work is extended by Li et al. [102] so that the saliency of

an attribute can vary across clusters.

The third line of work is based on GMMs where all clusters share a common diagonal

covariance matrix, while their means may vary. If the mean of a cluster along an attribute

turns out to coincide with the overall mean, then that attribute is irrelevant to cluster.

Both Bayesian methods [73, 103] and regularization methods [120] have been developed

based on this idea.

Our work is based on two observations. First, while clustering algorithms identify

clusters in data based on the characteristics of data, domain experts are ultimately the

ones to judge the interestingness of the clusters found. Second, high-dimensional data are

often multifaceted in the sense that there may be multiple meaningful ways to partition

them. The first observation is the reason why variable selection for clustering is such a

difficult problem, whereas the second one suggests that the problem may be ill-conceived

from the start.

Instead of performing variable selection, we advocate to facilitate variable selection by

domain experts. The idea is to systematically identify all the different facets of a data

set, cluster the data along each one, and present the results to the domain experts for

appraisal and selection. The analysis would be useful if one of the clusterings is found

interesting.

We use PLTMs to to realize the idea. Analyzing data using a PLTM may result in

multiple latent variables. Each latent variable represents a partition (clustering) of the

data and is usually related primarily to only a subset of attributes. Consequently, the data

is clustered along multiple dimensions and the results can be used to facilitate variable

selection.

92

Data Set Attributes Classes Samples Latentsglass 9 6 214 3.0image 181 7 2310 4.4ionosphere 331 2 351 9.9iris 4 3 150 1.0vehicle 18 4 846 3.0wdbc 30 2 569 9.4wine 13 3 178 2.0yeast 8 10 1484 5.0zernike 47 10 2000 6.9

Table 6.1: Descriptions of UCI data sets used in our experiments. The last column showsthe average numbers of latent variables obtained by PTLM analysis over 10 repetitions.

6.2 Experimental Setup

Our empirical study is designed to compare two types of analyses that can be applied

to unlabeled data: PLTM analysis and GMM analysis. PLTM analysis yields a model

with multiple latent variables. Each of the latent variables represents a partition of data

and may depend only on a subset of attributes. GMM analysis produces a model with

a single latent variable. It can be done with or without variable selection. Our study is

primarily concerned with GMM analysis with variable selection. GMM analysis without

variable selection is included for reference. When variable selection is performed, the

latent variable may depend on only a subset of attributes.

6.2.1 Data Sets and Algorithms

We used both synthetic and real-world data sets in our experiments. The synthetic data

were generated from the model described in Example 1, of which the variable Y1 is used as

the class variable. The real-world data sets were borrowed from the UCI machine learning

repository.2 We chose nine labeled data sets that have been often used in the literature

and that contain only continuous attributes. Table 6.1 shows the basic information of

these data sets.

We compare PLTM analysis with four methods based on GMMs. The first method is

plain GMM (PGMM) analysis. The second one is MCLUST [52].3 This method reduces

the number of parameters by imposing constraints on the eigenvalue decomposition of the

1Attributes having single values had been removed.2 http://www.ics.uci.edu/~mlearn/MLRepository.html3 http://cran.r-project.org/web/packages/mclust

93

http://www.ics.uci.edu/~mlearn/MLRepository.html

http://cran.r-project.org/web/packages/mclust

covariance matrices. The third one is CLUSTVARSEL [130].4 It is denoted as CVS for

short. The last one is the method of Law et al. [97], which we call LFJ, using the the

first letters of the three author names. Among these four methods, the last two perform

variable selection while the first two do not. Recently, there is a model-based method that

also produces multiple clusterings [55]. It can also be used to facilitate variable selection.

Therefore, we include this method in our study and denote it as GS. CVS and LFJ are

described in Section 6.1, and GS in Section 5.2.

In our experiments, the parameters of PGMMs and PLTMs were estimated using

the EM algorithm. The same settings were used for both cases. EM was terminated

when it failed to improve the log-likelihood by 0.01 in one iteration or when the number

of iterations reached 500. We used a variant of multiple-restart approach [31] with 64

starting points to avoid local maxima in parameter estimation. For the scheme to avoid

spurious local maxima in parameter estimation as described in Section 5.4, we set the

constant γ at 20. For PGMM and CVS, the true numbers of classes were given as input.

For MCLUST, LFJ, and GS, the maximum number of mixture components was set at 20.

6.2.2 Method of Comparison

Our experiments started with labeled data. In the training phase, models were learned

from data with the class labels removed. In the testing phase, the clusterings contained

in models were evaluated and compared based on the class labels. The objective is to see

which method recovers the class variable the best.

A model produced by a GMM-based method contains a single latent variable Y . It

represents one way to partition the data. We follow Strehl and Ghosh [152] and evaluate

the partition using normalized mutual information NMI(C;Y ) between Y and the class

variable C. The NMI is given by

NMI(C;Y ) =MI(C;Y )√H(C)H(Y )

,

where MI(C;Y ) is the mutual information between C and Y and H(V ) is the entropy

of a variable V [34]. These quantities can be computed from P (c, y), which in turn is

estimated by P (c, y) = 1N

∑Nk=1 Ick(c)P (y|dk), where d1, . . . ,dN are the samples in testing

data, Ick(c) is an indicator function having value of 1 when c = ck and 0 otherwise, and

ck is the class label for the k-th sample.

A model resulting from PLTM analysis or GS contains a set Y of latent variables.

Each of the latent variables represents a partition of the data. In practice, the user may4 http://cran.r-project.org/web/packages/clustvarsel

94

http://cran.r-project.org/web/packages/clustvarsel

find several of the partitions interesting and use them all in his work. In this section,

however, we are talking about comparing different clustering algorithms in terms of the

ability to recover the original class partition. So, the user needs to choose one of the

partitions as the final result. The question becomes whether this analysis provides the

possibility for the user to recover the original class partition. Consequently, we assume

that the user chooses, among all the partitions produced, the one closest to the class

partition and we evaluate the performance of PLTM analysis and GS using this quantity:

maxY ∈Y

NMI(C;Y ).

Among the four GMM-based methods, CVS and LFJ make explicit efforts to per-

form variable selection, while PGMM and MCLUST do not. PLTM and GS do not

make explicit effort to perform variable selection either. However, they produces multiple

partitions of data and some of the partitions may depend only on subsets of attributes.

Consequently, they allow the user to examine the clustering results based on various sub-

sets of attributes and choose the ones the user deems interesting. In this sense, we can

view PLTM and GS as methods to facilitate variable selection. So, our empirical work

can be viewed as a comparison between two different approaches to variable selection: to

do (CVS and LFJ) or to facilitate (PLTM and GS).

Note that NMI was used in our experiments due to the absence of a domain expert

to evaluate the clustering results. In practice, class labels are not available when we

cluster data. Hence the NMI cannot be used to select the appropriate partitions for

the facilitation approach. The user needs to comprehend the clusterings and find those

interesting to him. We analyze some unlabeled NBA data in the next chapter. This can

also serve as a demonstration for the facilitation approach in practice.

6.3 Results

The results of our experiments are given in Table 6.2. PLTM has clearly superior perfor-

mances over the two variable selection methods, CVS and LFJ. Specifically, it outperforms

CVS on all but one data set and LFJ on all data sets. PLTM also has clear advantages

over PGMM and MCLUST, the two methods that do not do variable selection. PLTM

outperformed PGMM on all but one data sets and outperformed MCLUST except for two

data sets. In addition, PLTM performed better than GS, the other method that produces

multiple clusterings. It outperformed GS on all but one data. On these data sets, PLTM

usually outperformed the other methods by large margins.

95

Facilitate VS Perform VS No VSData Set PLTM GS CVS LFJ PGMM MCLUSTsynthetic .85 (.00) .69 (.00) .34 (.00) .56 (.02) .56 (.00) .64 (.00)glass .43 (.03) .38 (.00) .29 (.00) .35 (.03) .28 (.03) .33 (.00)image .71 (.03) .65 (.00) .41 (.00) .51 (.03) .52 (.04) .66 (.00)vehicle .40 (.04) .31 (.00) .23 (.00) .27 (.01) .25 (.08) .36 (.00)wine .97 (.00) .83 (.00) .71 (.00) .70 (.19) .50 (.06) .69 (.00)zernike .50 (.02) .39 (.00) .33 (.00) .45 (.01) .44 (.03) .41 (.00)ionosphere .36 (.01) .26 (.00) .41 (.00) .13 (.07) .57 (.04) .32 (.00)iris .76 (.00) .74 (.00) .87 (.00) .68 (.02) .73 (.08) .76 (.00)wdbc .45 (.03) .36 (.00) .34 (.00) .41 (.02) .44 (.08) .68 (.00)yeast .18 (.00) .22 (.00) .04 (.00) .11 (.04) .16 (.01) .11 (.00)

Table 6.2: Clustering performances as measured by NMI. The averages and standarddeviations over 10 repetitions are reported. Best results are highlighted in bold. The firstrow categorizes the methods according to their approaches to variable selection (VS).

We next examine models produced by the various methods to gain insights about the

superior performance of PLTM analysis.

6.3.1 Synthetic Data

Before examining models obtained from synthetic data, we first take a look at the data

set itself. The data were sampled from the model shown in Figure 5.2, with information

about the two latent variables Y1 and Y2 removed. Nonetheless, the latent variables

represent two natural ways to partition the data. To see how the partitions are related

to the attributes, we plot the NMI5 between the latent variables and the attributes in

Figure 6.1(a). We call the curve for a latent variable its feature curve. We see that Y1 is

strongly correlated with X1–X3, but not with the other attributes. Hence it represents a

partition based on those three attributes. Similarly, Y2 represents a partition of the data

based on attributes X4–X9. So, we say that the data has two facets, one represented by

X1–X3 and another by X4–X9. The designated class partition Y1 is a partition along the

first facet.

The model produced by PLTM analysis have the same structure as the generative

model. We name the two latent variables in the model Z1 and Z2 respectively. Their

feature curves are also shown in Figure 6.1(a). We see that the feature curves of Z1 and

Z2 match those of Y1 and Y2 well. This indicates that PLTM analysis has successfully

recovered the two facets of the data. It has also produced a partition of the data along

each of the facets. If the user chooses the partition Z1 along the facet X1–X3 as the

5To compute NMI(X;Y ) between a continuous variable X and a latent variable Y , we discretized Xinto 10 equal-width bins, so that P (X,Y ) could be estimated as a discrete distribution.

96

Y1Y2

Z1Z2

0.0

0.1

0.2

0.3

0.4

0.5

X1 X2 X3 X4 X5 X6 X7 X8 X9Features

NMI

(a) PLTM analysis

Y1Y2CVS

LFJGS1GS2

0.0

0.1

0.2

0.3

0.4

0.5

X1 X2 X3 X4 X5 X6 X7 X8 X9Features

NMI

(b) Variable selection methods and GS

Figure 6.1: Feature curves of the partitions obtained by various methods and that of theoriginal class partition on synthetic data.

final result, then the original class partition is well recovered. This explains the good

performance of PLTM (NMI=0.85).

The feature curves of the partitions obtained by LFJ and CVS are shown in Fig-

ure 6.1(b). We see that the LFJ partition is not along any of the two natural facets of

the data. Rather it is a partition based on a mixture of those two facets. Consequently,

the performance of LFJ (NMI=0.56) is not as good as that of PLTM. CVS did identify

the facet represented by X4–X9, but it is not the facet of the designated class parti-

tion. In other words, it picked the wrong facet. Consequently, the performance of CVS

(NMI=0.34) is the worst among all the methods considered. GS succeeded to identify two

facets. However, the their feature curves do not match those of Y1 and Y2 well. Hence,

its performance (NMI=0.69) is worse than PLTM.

97

Figure 6.2: Structure of the PLTM learned from image data.

6.3.2 Image Data

In the image data, each instance represents a 3× 3 region of an image. It is described by

18 attributes. The feature curve of the original class partition is given in Figure 6.3(a).

We see that it is a partition based on 10 color-related attributes from intensity to hue

and the attribute centroid.row.

The structure of the model produced by PLTM analysis is shown in Figure 6.2. It

contains 4 latent variables Y1–Y4. Their feature curves are shown in Figure 6.3(a). We

see that the feature curve of Y1 matches that of the class partition beautifully. If the user

chooses the partition represented by Y1 as the final result, then the original class partition

is well recovered. This explains the good performance of PLTM (NMI=0.71).

The feature curves of the partitions obtained by LFJ and CVS are shown in Fig-

ure 6.3(b). The LFJ curve matches that of the class partition quite well, but not as

well as the feature curve of Y1, especially on the attributes line.density.5, hue and

centroid.row. Consequently, the performance of LFJ (NMI=0.51) is not as good as that

of PLTM. Similar things can be said about the partition obtained by CVS. Its feature

curve differs from the class feature curve even more than the LFJ curve on the attribute

line.density.5, which is irrelevant to the class partition. Consequently, the performance

of CVS (NMI=0.41) is even worse than that of LFJ.

GS produced four partitions on image data. Two of them consist of only one com-

ponent, and are therefore discarded. The feature curves of the remaining two partitions

are shown in Figure 6.3(c). We see that one of them corresponds to the facet of the class

partition, but does not match that well. As a result, the performance of GS (NMI=0.65)

is better than the other methods but is not as good as that of PLTM.

Two remarks are in order. First, the 10 color-related attributes semantically form

a facet of the data. PLTM analysis has identified the facet in the pouch below Y1.

98

classY1Y2Y3Y4

0.0

0.2

0.4

0.6

line.density.5

line.density.2

vedge.mean

vedge.sd

hedge.mean

hedge.sd

intensityrawred

rawblue

rawgreenexredexblue

exgreenvalue

saturationhu

e

centroid.row

centroid.col

Features

NMI

(a) PLTM analysis

classCVSLFJ

0.0

0.2

0.4

0.6

line.density.5

line.density.2

vedge.mean

vedge.sd

hedge.mean

hedge.sd

intensityrawred

rawblue

rawgreenexredexblue

exgreenvalue

saturationhu

e

centroid.row

centroid.col

Features

NMI

(b) Variable selection methods

classGS1GS2

0.0

0.2

0.4

0.6

line.density.5

line.density.2

vedge.mean

vedge.sd

hedge.mean

hedge.sd

intensityrawred

rawblue

rawgreenexredexblue

exgreenvalue

saturationhu

e

centroid.row

centroid.col

Features

NMI

(c) GS

Figure 6.3: Feature curves of the partitions obtained by various methods and that of theoriginal class partition on image data.

99

Figure 6.4: Structure of the PLTM learned from wine data.

Moreover, it obtained a partition based on not only the color attributes, but also the

attribute centroid.row, the vertical location of a region in an image. This is interesting.

It is because centroid.row is closely related to the color facet. Intuitively, the vertical

location of a region should correlate with the color of the region. For example, the color

of the sky occurs more frequently at the top of an image and that of grass more frequently

at the bottom.

Second, the latent variable Y2 is strongly correlated with the two line density attributes.

This is another facet of the data that PLTM analysis has identified. PLTM analysis has

also identified the edge-related facet in the pouch node below Y3. However, it did not

obtain a partition along the facet. The partition represented by Y3 depends on not only

the edge attributes but others as well. The two coordinate attributes centroid.row and

centroid.col semantically form one facet. The facet has not been identified probably

because the two attributes are not correlated.

6.3.3 Wine Data

The PLTM learned from wine data is shown in Figure 6.4. The model structure also

appears to be interesting. While we are not experts on wine, it seems natural to have Ash

and Alcalinity_of_ash in one pouch as both are related to ash. Similarly, Flavanoids,

Nonflavanoid_phenols, and Total_phenols are related to phenolic compounds. These

compounds affect the color of wine, so it is reasonable to have them in one pouch along

with the opacity attribute OD280/OD315. Moreover, both Magnesium and Malic_acid play

a role in the production of ATP (adenosine triphosphate), the most essential chemical in

energy production. So, it is not a surprise to find them connected to a second latent

variable.

100

Figure 6.5: Structure of the PLTM learned from wdbc data.

Table 6.3: Confusion matrix for PLTM on wdbc data.

Y1class s1 s2 s3 s4 s5

malignant 43 116 6 0 47benign 0 9 193 114 41

6.3.4 WDBC Data

While PLTM performed well on the data sets discussed above, there are some data sets on

which PLTM did not perform so well. One such example is the wdbc data. This data were

obtained from 569 digitalized images of cell nuclei aspirated from breast masses. They

involve 10 computed features of the cell nuclei in these images. Each instance corresponds

to one image and includes the mean value (m), standard error (s), and worst value (w)

for each feature. It is labeled as either benign or malignant.

Figure 6.5 shows the structure of a PLTM learned from this data. We can see that this

model identifies some meaningful facets. The pouch below Y1 identifies a facet related to

the size of nuclei. It includes attributes mainly related to area, perimeter, and radius. The

second facet is identified by the pouch below Y2. It is related to concavity and consists

primarily of the mean and worst values of the two features related to concavity. The

third facet is identified by the pouch below Y3. It includes the mean and worst values of

smoothness and symmetry, and appears to show whether the nuclei have regular shapes

or not. The pouch below Y9 identifies a facet related to texture. This facet includes three

texture-related attributes but also the attribute symmetry.s. The remaining attributes

are mostly standard errors of some features and may be considered as the amount of

variation of the features. They are connected to the rest of the model through Y4 and Y8.

101

classY1Y1(2)

0.0

0.2

0.4

0.6

0.8

radius.w

radius.s

radius.m

perimeter.w

perimeter.s

perimeter.m

compactness.marea.warea.s

area.m

fractal_dim.w

concavity.w

concavity.m

concave_pts.w

concave_pts.m

compactness.w

symmetry.w

symmetry.m

smoothness.w

smoothness.m

concave_pts.s

concavity.s

compactness.s

fractal_dim.s

fractal_dim.m

smoothness.s

texture.w

texture.s

texture.m

symmetry.s

Features

NMI

Figure 6.6: Features curves of the partition Y1, obtained by PLTM, and that of the originalclass partition on wdbc data. Y1(2) is obtained by setting the cardinality of Y1 to 2.

Although the model appears to have a reasonable structure, it did not achieve a high

NMI on this data (NMI=0.45). To have better understanding, we compare the feature

curve of the class partition with that of the closest partition (Y1) obtained by PLTM

in Figure 6.6. The two feature curves have roughly similar shapes. We also look at the

confusion matrix for Y1, which is shown in Table 6.3. We can see that Y1 gives a reasonable

partition. The first four states of Y1 group together the benign cases or malignant cases

almost perfectly, while the remaining state groups together some uncertain cases.

One possible reason for the relatively low NMI is that Y1 has 5 states but the class

variable has only 2 states. The higher number of states of Y1 may lead to a lower NMI

due to an increase of the entropy term. It may also lead to the mismatch of the feature

curves. To verify this explanation, we manually set the cardinality of Y1 to 2. The feature

curve of this adjusted latent variable (Y1(2)) is shown in Figure 6.6. It now matches

the feature curve of the class partition well. The adjustment also improved the NMI to

0.69, and made it become highest on this data set. This example shows that an incorrect

estimation of the number of clusters could be a reason why PLTM performed worse than

other methods on some data sets.

6.3.5 Discussions

We have also performed PLTM analysis on the other data sets. The last column of

Table 6.1 lists the average numbers of latent variables obtained over 10 repetitions. We

see that multiple latent variables have been identified except on iris data, which has only

four attributes. Many of the latent variables represent partitions of data along natural

102

facets of the data.

In general, PLTM analysis has the ability to identify natural facets of data and cluster

data along those facets. In practice, a user may find several of the clusterings useful. In

the setting where clustering algorithms are evaluated using labeled data, PLTM performs

well if the original class partition is also along some of the natural facets, and poorly

otherwise.

103

CHAPTER 7

MULTIDIMENSIONAL CLUSTERING

Variable selection methods produce single clusterings. Hence, those methods become

inadequate when multiple meaningful clusterings exist on data. The question remains

whether multiple clusterings often exist on real-world data.

This chapter attempts to answer this question. We performed PLTM analysis on

seasonal statistics of National Basketball Association (NBA) players. Unlike the data

used in the previous chapter, this data does not contain any class labels. The objective

here is not to recover any target partition. Rather, our aim is to see whether PLTM can

obtain multiple meaningful clusterings. We interpret the clustering results using basic

basketball knowledge.

This chapter is organized as follows. In Section 7.1, we use an example to compare

the multidimensional clustering with the traditional single clustering approach. We then

review some work related to multi-dimensional clustering in Section 7.2. Next, we describe

the NBA data in Section 7.3. In Section 7.4, we present our findings of PLTM analysis on

this data. In Section 7.5, we compare our method with other related methods. Finally,

we discuss our results in Section 7.6.

7.1 Clustering Multifaceted Data

Suppose we want to cluster the data shown in Figure 7.1(a). The data has two attributes.

As Figures 7.1(c) and (d) show, we can obtain a meaningful clustering from either one of

the attributes. Due to this reason, we say that the data is multifaceted.

If we use PLTMs to cluster the data, we will likely get the PLTM shown in Fig-

ure 7.1(b). The PLTM has two latent variables. They correspond to the clusterings along

the X and Y dimensions, respectively. This approach of clustering is called multidimen-

sional clustering because of the multiple clusterings obtained along different dimensions

of data.

On the other hand, if traditional clustering methods are used, only a single clustering

can be obtained. If variable selection is used, we will likely get a 3-cluster solution on either

one of the attributes (Figure 7.1(c) and (d)). This means we will miss the meaningful

clustering on the other attribute.

104

-10 -5 0 5 10

-10

-50

510

X

Y

(a) Data points.

X Y

CX(3) CY (3)

(b) PLTM likely to be obtained.

-10 -5 0 5 10

-10

-50

510

X

Y

(c) Clustering on attribute X.

-10 -5 0 5 10

-10

-50

510

X

Y

(d) Clustering on attribute Y .

-10 -5 0 5 10

-10

-50

510

X

Y

(e) Clustering on both attributes.

Figure 7.1: Clustering data with two facets. (b) The latent variables CX and CY denotethe cluster variables for attributes X and Y , respectively. (c) and (d) Clustering on onlyone of the attributes give 3 clusters. (e) Clustering on both attributes give 9 clusters.

105

If we cluster the data traditionally without variable selection, we will likely get a 9-

cluster solution (Figure 7.1(e)). It is the cross product of the two 3-cluster solutions on

either attribute.

Compared with the traditional approach, the multidimensional approach has two ad-

vantages. These advantages can be observed by comparing how the distribution of data

P (X, Y ) is represented by the two different approaches. Using the multidimensional ap-

proach, P (X, Y ) can be given as

P (X, Y ) = P (X|CX)P (Y |CY )P (CX)P (CY |CX), (7.1)

where CX and CY are two clustering variables each with 3 states. And using the traditional

approach, P (X, Y ) can be given as

P (X, Y ) = P (X, Y |C)P (C), (7.2)

where C is a clustering variable with 9 states.

The first advantage of the multidimensional approach is that its solution is more

comprehensible. To understand the clusterings, we can inspect the terms P (X|CX) and

P (Y |CY ) in Equation (7.1). Each of them focuses on a single dimension, and each has

only three components. On the other hand, we need to inspect the term P (X, Y |C) in

Equation (7.2) to understand the clustering obtained from the traditional approach. This

term depends on two attributes and has nine components. Hence, the terms from the

multidimensional approach are simpler than the one from the traditional approach.

The second advantage is related to the numbers of parameters needed in the solutions

of the two approaches. Due to factorization, Equation (7.1) generally needs fewer param-

eters than Equation (7.2). In general, model selection penalizes models with more param-

eters. Therefore, the former equation allows a larger number of clusters to be discovered if

we need to determine the number of clusters automatically through model selection. And

since the latter equation needs more parameters, a larger number of clusters are usually

prohibited by model selection.

7.2 Related Work

There is some recent work that produces multiple clusterings. While PLTMs acknowl-

edge the correlations between these clusterings, most other work attempts to find multiple

clusterings that are dissimilar from each other. Two approaches are commonly used. The

first approach finds multiple clusterings sequentially. In this approach, an alternative

106

clustering is found based on a given clustering. The alternative clustering is made dis-

similar to the given one with conditional information bottleneck [61], by “cannot-links”

constraints [7], by orthogonal projection [37], or with an optimization problem constrained

by the similarity between clusterings [128]. The second approach finds multiple clusterings

simultaneously. It includes a method based on k-means [83] and one based on spectral

clustering [119]. Both methods find dissimilar clusterings by adding penalty terms that

discourage the similarity between the clusterings to their corresponding objective func-

tions. One other method finds multiple clusterings simultaneously using the suboptimal

solutions of spectral clustering [39].

There are some other lines of work that produce multiple clusterings. Caruana et al.

[22] generate numerous clusterings by applying random weights on the attributes. The

clusterings are presented in an organized way so that users can pick the clustering they

deem the best. Subspace clustering [92] tries to identify all clusters in all subspaces.

Biclustering [106] attempts to find groups of objects that exhibit the same pattern (e.g.,

synchronous rise and fall) over a subset set of attributes. Unlike our approach, the above

related work is distance-based rather than model-based.

Relatively few model-based methods have been proposed to produce multiple cluster-

ings. The GS method [55] is one of these methods. (see Section 5.2). However, due to

the model structure, the obtained clusterings are assumed to be mutually independent.

Jaeger et al. [81] propose the factorial logistic model for multiple clusterings. This

model has binary manifest variables and discrete latent variables. Logistic regression has

been used to model the conditional distributions of the manifest variables given the values

of the latent variables. The number of latent variables and the cardinalities of the latent

variables are assumed to be given in their work.

Factor models also have multiple latent variables. However, most of them have con-

tinuous latent variables rather than discrete latent variables. These models explain the

unobserved heterogeneity using continuous latent scores. This is different from the mul-

tidimensional clustering approach, where discrete clusters are used to explain the unob-

served heterogeneity.

7.3 Seasonal Statistics of NBA Players

Some NBA data were used for multidimensional clustering in our study. The data were

collected from players who played in at least one game in the 2009/10 season.1 The data

1 The data were obtained from: http://www.dougstats.com/09-10RD.txt

107

http://www.dougstats.com/09-10RD.txt

Attr. Description Avg. Pos.games Number of games played Nostarts Number of games started Nomin Minutes played Yesfgm Field goals made (including two-point and three-point shots) Yesfgp Field goal percentage (shots made divided by shot attempted) No3gm Three pointers Yes3gp Three pointer percentage Noftm Free throws made (penalty shots) Yesftp Free throw percentage Nooff Offensive rebounds (gaining possessions of ball after missed

shots by teammates)Yes F/C

def Defensive rebounds (gaining possessions of ball after missedshots by opponents)

Yes F/C

blk Blocks (deflecting field goal attempts by opponents) Yes F/Cstl Steals (defensive acts of causing turnovers by opponents) Yes Gast Assists (passes leading to scores by teammates) Yes Gto Turnovers (losing control of ball to opponents) Yes Gpf Personal fouls Yestf Technical fouls Yesdq Disqualifications Yes

Table 7.1: Descriptions of the attributes of NBA data. The second column indicateswhether an attribute shows an average per game. The last column indicates which positionusually gets a higher value of an attribute. Positions F and C stand for forward and center,and are played by taller players. Position G stands for guard and is played by shorterplayers.

set has 18 attributes and 441 samples. Each sample corresponds to one player. The

attributes are described in Table 7.1.

There are three main positions of players. They are guard (G), forward (F), and center

(C). The guard position is played by shorter players, while the forward and center positions

are played by taller players. According to our understanding of basketball games, some

statistics are more likely to be higher for a particular position. They are indicated in the

last column of Table 7.1.

7.4 PLTMs on NBA Data

In this section, we show the findings of an PLTM analysis on NBA data.

7.4.1 Clusterings Obtained

The structure of the model obtained from the PLTM analysis is shown in Figure 7.2. The

model contains seven latent variables, each of which identifies a different facet of data.

108

Figure 7.2: PLTM obtained on NBA data. The latent variables are shown in shadednodes and represent different clusterings on the players. They have been renamed basedon our interpretation of their meanings. The abbreviations in these names stand for: role(Role), general ability (Gen), technical fouls (T), disqualification (D), tall-player ability(Tall), shooting accuracy (Acc), and three-pointer ability (3pt).

Role P(Role) games startsoccasional 0.32 29.1 2.0irreg_starter 0.11 46.1 31.7reg_sub 0.19 68.2 5.4reg 0.13 75.8 32.8reg_starter 0.25 76.0 73.7overall 1.00 56.3 27.9

(a) Role

3pt P(3pt) 3pm 3ppnever 0.29 0.00 0.00seldom 0.12 0.08 0.26fair 0.17 0.33 0.28good 0.40 1.19 0.36extreme 0.02 0.73 0.64overall 1.00 0.55 0.23

(b) 3pt

Acc P(Acc) fgp ftplow_ftp 0.10 0.44 0.37low_fgp 0.16 0.39 0.72high_ftp 0.47 0.44 0.79high_fgp 0.28 0.52 0.67overall 1.00 0.45 0.71

(c) Acc

Table 7.2: Attribute means conditional on the specified latent variables on NBA data.The second columns show the marginal distributions of those latent variables. The lastrows show the unconditional means of the attributes.

The first facet consists of attributes games and starts, which are related to the role of

a player. The second one consists of attributes min, fgm, ftm, ast, stl, and to, which

are related to some general performance of a player. The third and fourth facets each

contains only one attribute. They are related to tf and dq, respectively. The fifth facet

contains attributes blk, off, def, and pf. This is related to one aspect of performance in

which taller players usually have an advantage. The sixth facet consists of two attributes

ftp and fgp, which are related to the shooting accuracy. The last facet contains 3pm and

3pp, which are related to three pointers.

109

7.4.2 Cluster Means

To understand the clusters, we may examine the mean values of attributes of the clusters.

We use the clusterings Role, 3pt, and Acc as an illustration.

Table 7.2(a) shows the means of games and starts conditional on the clustering Role.

Note that there are 82 games in a NBA season. We see that players belonging to the first

cluster did not play regularly. The second group of players also played less often than

average, but they usually started in a game when they played. This cluster probably

refers to those players who had the calibre of starters but had missed part of the season

due to injuries. The third group of players played often, but usually as a substitute (not

as a starter). The fourth group of players played regularly and sometimes started in a

game. The last group contains players who played and started regularly.

Table 7.2(b) shows the means of 3pm and 3pp conditional on 3pt. The variable 3pt

partitions players into five clusters. The first two clusters contains players that never and

seldom made a three-pointer, respectively. The next two clusters contains players that

have fair and good three-pointer accuracies, respectively. The last group is an extreme

case. It contains players shooting with surprisingly high accuracy. As indicated by the

marginal distribution, it consists of only a very small proportion of players. This is

possibly a group of players who had made some three pointers during the sporadic games

that they had played. The accuracy remained very high since they did not play often.

Table 7.2(c) shows the means of fgp and ftp conditional on Acc. The first group

of players had particularly poor free throw percentage. The second group of players

had low field goal percentage and average free throw percentage. The third group of

players had particularly high free throw percentage, while the fourth group of players had

particularly high field goal percentage. One may expect that since both ftp and fgp are

related to the shooting accuracies, these two attributes should be positively correlated and

the last two groups may look counter-intuitive. However, it is indeed reasonable due to

one observation in NBA. Taller players usually stay closer to basket in games. Therefore,

they take high-percentage shots more often and has higher field goal percentage. On the

other hand, taller players are usually poorer in making free throws and have lower free

throw percentage. One typical example is Dwight Howard. He had a relatively high fgp

(0.61) but a relatively low ftp (0.59). He was classified appropriately as “high_fgp” by

PLTM.

110

GenRole poor fair good

occasional 0.81 0.19 0.00irreg_starter 0.00 0.69 0.31reg_sub 0.22 0.78 0.00regular 0.00 0.81 0.19reg_starter 0.00 0.06 0.94

(a) P (Gen|Role)

AccTall low_ftp low_fgp high_ftp high_fgp

poor 0.28 0.53 0.00 0.18fair 0.00 0.02 0.95 0.03good 0.00 0.00 1.00 0.00good_big 0.11 0.04 0.00 0.86v_good 0.00 0.00 0.14 0.86

(b) P (Acc|Tall)

Table 7.3: Conditional distributions of Gen and Acc on NBA data.

7.4.3 Relationships between Clusterings

In addition to the distributions of individual clusterings, PLTMs also model the proba-

bilistic relationships between the clusterings. Users of PLTM analysis may also find these

relationships interesting. It can be demonstrated from the following two examples.

Table 7.3(a) shows the conditional distribution P (Gen|Role). The clustering Gen con-

sists of three clusters of players with poor, fair, and good general performances, respec-

tively. We observe that those players playing occasionally were mostly poor in general,

and almost all starters played well in general. While the other three groups of players

usually played fairly, more of the irregular starters (“irreg_starter”) played well than those

who played regularly (“regular”), and none of the regular substitutes (“reg_sub”) played

well. This relationship is reasonable because a player’s general performance should be

related to the role of the player.

Table 7.3(b) shows the conditional distribution P (Acc|Tall). It is consistent with our

observation that taller players usually shoot free throws more poorly. Most players who

played very well (“v_good”) or particularly well (“good_big”) as tall players belong to the

group that has high field goal percentage but low free throw percentage (“high_fgp”). On

the other hand, those who do not play well specifically as tall players (“fair” and “good”)

usually have average field goal percentage and higher free throw percentage (“high_ftp”).

For those who played poorly as tall players, we cannot tell much about them.

111

Subset No. Attributes Components1 starts, min, fgm, 3pm, ftm, off, def, ast, stl, to, blk, pf 112 games, tf 33 fgp, 3pp, ftp, dq 2

Table 7.4: Partition of attributes on NBA data by GS. The last column lists the numberof components that a GMM has on each attribute subset.

7.5 Comparison with Other Methods

We now compare PLTM analysis with other related methods. Since we do not have

knowledge on the number of clusters or the number of clusterings, we included only those

methods that can determine these numbers automatically in our study.

7.5.1 Multiple Independent GMMs by GS

The GS method is described in Section 5.2. It is similar to PLTM analysis in that both

methods are model-based and produce multiple discrete latent variables. For comparison

between the two methods, we performed the GS method on the NBA data. The result

is shown in Table 7.4. Each row corresponds to one clustering identified by GS. It shows

the subsets of attributes that the clustering depends on, and the number of clusters in it.

These clusterings have three weaknesses compared to those obtained from PLTM anal-

ysis. First, the subsets of related attributes found by GS appear to be less natural than

those facets identified by PLTM analysis. In particular, attribute games can be related

to many aspects of the statistics, but it is grouped together by GS with a less interesting

attribute tf, which indicates the number of technical fouls, in subset 2. In subset 3, while

fgp, 3pp, ftp are all related to shooting percentages, they are also grouped together with

an apparently unrelated attribute dq. Subset 1 lumps together a large number of at-

tributes. This misses some more specific and meaningful facets that have been identified

by PLTM analysis.

The second weakness is due to the numbers of clusters obtained from GS. One the

one hand, there are a large number of clusters in subset 1. This makes it difficult to

comprehend the clustering, especially with many attributes in this subset. On the other

hand, there are only few clusters from subset 2 and 3. This means some subtle clusters

found in PLTM analysis were not found by GS.

The third weakness is inherent in the structure of models used by GS. Since inde-

pendent GMMs are used on the subsets of attributes, the latent variables are assumed

to be independent. Consequently, the GS model cannot show those possibly meaningful

112

number of factors degrees of freedom χ2 statistic p-value1 135 2755.72 02 118 1134.12 1.31× 10−165

3 102 660.30 1.38× 10−82

4 87 412.17 9.58× 10−44

5 73 247.15 8.91× 10−21

6 60 169.29 2.35× 10−12

7 48 119.20 5.48× 10−8

8 37 72.20 0.0004719 factanal failed

Table 7.5: Results from significance tests whether the factor models fit NBA data.

relationships between the clusterings as PLTMs do.

7.5.2 Factor Analysis

Factor analysis produces multiple continuous latent variables. To see whether this ap-

proach can produce similar results, we perform factor analysis on the NBA data using the

factanal method of R.2 We tried one to nine factors in the analysis. The results of the

significance test whether these models fit the data are shown in Table 7.5.

The results show that the highest p-value is 0.000471. This value is considerably lower

than 0.05, which is the value for accepting a model at 95% significance level. This shows

that factor analysis could not fit the data well enough even when we used eight factors.

However, when we fit the data with more than eight factors, the method factanal failed.

The possible reason is that nine factors are too many for data with only 18 attributes.

Altogether, the results indicate that factor analysis failed on NBA data.

7.5.3 LTM with Continuous Latent Variables

Choi et al. [32] propose several methods for learning tree models with continuous latent

variables when the manifest variables are continuous. Their methods produce models with

structure similar to PLTMs. Therefore, we included one of those methods for comparison.

We followed their report and chose to present the result of the CLNJ method. The

resulting model structure is shown in Figure 7.3.

The model has two latent variables. Notice that they are continuous. The latent

variable Y1 appears to be related to the shooting accuracy. It is connected to the subtree

with attributes 3pm, 3pp, and ftp. Those attributes are expected to be higher when a

2http://www.r-project.org/

113

http://www.r-project.org/

min

starts

games

fgm

ftm to

ast tf

stl

Y1

3pm

3pp

ftp

Y2

def

off

fgp blk

pf

dq

Figure 7.3: Model obtained from CLNJ method [32]. The latent variables (Y1 and Y2) arecontinuous in this model.

player can shoot more accurately. The other latent variable, Y2, appears to be related to

the performance of taller players. It is connected to the subtree with attributes def, off,

blk, and fgp. The taller players usually have advantage for these statistics.

Although the two latent variables look reasonable, the whole model obtained from

CLNJ is less desirable than the PLTM shown in Figure 7.2 for two reasons. First, the

CLNJ model has only two latent variables, while the PLTM has seven. Therefore, its

latent variables cannot show some other facets of basketball games. For example, PLTM

has a latent variable to explain the different roles of players but the CLNJ model does

not.

Second, the continuous latent variables in the CLNJ model cannot explain any non-

linear relationships between the variables. For example, some player missed many games

due to injury, but they had the calibre of starters and played significant minutes when

they could play. This group of players cannot be explained by the linear relationships in

the CLNJ model, such as the positive correlation between min and starts. On the other

hand, the PLTM has discovered a cluster of such players.

Another example can be observed from the fact that the coefficients on the edges of

the CLNJ model are all positive. This means when a higher value of the tall-man ability

(Y2) is observed, the CLNJ model expects a higher value of ftp. This contradicts our

understanding of basketball games and the finding of the PLTM, both of which suggest

a player who played particularly well as a tall player usually attained a below average

percentage of free throws (ftp). These two examples show that the continuous latent

114

variables are insufficient to model the data.

7.6 Discussions

If we think about how basketball games are played, we can expect that the heterogene-

ity of players can originate from various aspects, such as positions of the players, their

competence on their corresponding positions, or their general competence. As our results

show, PLTM analysis identified these different facets from the NBA data and allowed

users to partition data based on them separately. This is not possible using traditional

clustering methods, with or without variable selection.

Even though the number of attributes in NBA data is small relatively to some data

sets that are available nowadays, we can still identify multiple meaningful clusterings from

it. We can expect real-world data with higher dimensions are also multifaceted. Hence,

it is generally more appropriate to use the multidimensional clustering approach than to

use the traditional clustering approach.

Our experiments also showed that factor analysis and the CLNJ method could not find

a model that explains NBA as well as the PLTM. This suggests that discrete latent vari-

ables are more appropriate to explain the multiple heterogeneities than continuous latent

variables. By allowing LTMs to work on continuous data, the experiments demonstrates

that PTLMs are an useful extension of LTMs.

115

CHAPTER 8

CONCLUSIONS

In this chapter, we first summarize what we have done in this thesis. We then point out

some future work and possible improvements.

8.1 Summary of Work

We have made two main contributions to the research of LTMs in this thesis. The first

main contribution is that we have applied LTMs in the rounding step for spectral clus-

tering. This includes:

• We have identified a new source of information for rounding in spectral clustering.

Most work uses only the primary eigenvectors for rounding. However, we have used

also the secondary eigenvectors. We have specifically pointed out the property of

the secondary eigenvectors that we have exploited (Proposition 4 (2)).

• We have proposed an intuitively appealing method for rounding in the ideal case

(Naive-Rounding2). This method can automatically determine the number of

clusters by using the secondary eigenvectors. It also worked perfectly on the three

synthetic data sets in our test.

• We have proposed a model-based method for rounding for the general case (LTM-

Rounding). We have shown that this method worked perfectly for the ideal case

and degraded gracefully when the data deviated from the ideal case. We have com-

pared this method with another popular method, ROT-Rounding [169]. We have

shown that LTM-Rounding worked better than ROT-Rounding on synthetic

data.

• We have used LTM-Rounding for image segmentation. We have shown that its

results were comparable to those of ROT-Rounding, and that it could find some

meaningful segments that ROT-Rounding could not.

The second main contribution is that we have extended LTMs for continuous data.

This includes:

116

• We have proposed PLTMs for handling continuous data, along with an inference

algorithm and a learning algorithm for those models. While we have focused on

continuous data for PLTMs in this thesis, PLTMs should work on both continuous

data and discrete data with any modification.

• We have shown that PLTMs are also an generalization of GMMs. We have pointed

out two advantages of using PLTMs instead of GMMs. PTLMs allow fewer parame-

ters by factorization. And more important, they allow multiple latent variables and

a more flexible model structure.

• We have used PLTMs for facilitating variable selection in model-based clustering.

We have demonstrated that facilitating variable selection can perform better than

doing variable selection traditionally. We have examined the results on four UCI

data sets to explain the performance of PLTM analysis.

• We have demonstrated the usefulness of PTLMs by performing multidimensional

clustering on NBA data. Our analysis identified several meaningful clusterings. It

also found some interesting relationships among the clusterings. Besides, we have

compared out results with results obtained from other methods. We have shown that

the discrete latent variables in an PLTM explained the heterogeneity on data better

than the continuous latent variables in two other models. We have also shown that

the tree structure of latent variables in an PTLM worked better than disconnected

latent variables in another model.

8.2 Future Work

Our work has pointed out some directions for future work. First, most work in spectral

clustering uses only primary eigenvectors for rounding. However, our work has shown that

the secondary eigenvectors can contain useful information and can be used for rounding.

We hope this can open a new direction for rounding in spectral clustering. It is possible

that other work for rounding can improve its performance by making use of the secondary

eigenvectors.

Second, we have compared the facilitation approach and the traditional approach to

variable selection in clustering. By using the traditional approach, one implicitly hopes for

a panacea solution that would be the most meaningful one for every interest. However, our

experiments showed that if one is interested in the partitions given by the class labels,

the clusterings obtained from the traditional approach are worse than those from the

facilitation approach. This suggests that the traditional approach should be used with

117

caution. Instead, it is more appropriate to present multiple clusterings with different

selections of variables and allow one to choose the clustering that is most meaningful to

a particular interest.

Third, most work on clustering aims to find a single clustering. However, our analysis

on NBA data have demonstrated that more than one meaningful clustering can be found

on data with as few as 18 attributes. Data used for clustering nowadays usually have more

attributes. Our result suggests that those data potentially contain multiple meaningful

clusterings. Therefore, future clustering work should aim for multiple clusterings rather

than a single clustering.

8.3 Possible Improvements

There are some possible improvements for our work. First, our method for rounding

requires discretization of eigenvectors. This may lead to loss of information. Therefore,

it would be better to work directly with the continuous values of the eigenvectors. And

this suggests a future work that uses PLTMs rather than LTMs for rounding.

Second, the training of PLTMs can take a long time to complete. For example, one

run of PLTM analysis took around 5 hours on data sets of moderate size (e.g., image,

ionosphere, and wdbc data) and around 2.5 days on the largest data set (zernike data)

in our experiments. This limits the use of PLTM analysis to data with lower dimensions.

Hence, PLTM analysis is currently infeasible for data with hundreds or thousands of

attributes, such as those for text clustering and gene expression analysis. More research

is needed to improve the efficiency of the training process. Future work may consider the

variable clustering approach or the constraint-based approach for learning PLTMs.

118

Bibliography

[1] Raymond J. Adams, Mark Wilson, and Wen-chung Wang. The multidimensional

random coefficients multinomial logit model. Applied Psychological Measurement,

21(1):1–23, 1997.

[2] Hirotugu Akaike. A new look at the statistical model identification. IEEE Trans-

actions on Automatic Control, 19(6):716–723, December 1974.

[3] Animashree Anandkumar, Kamalika Chaudhuri, Daniel Hsu, Sham M. Kakade,

Le Song, and Tong Zhang. Spectral methods for learning multivariate latent tree

structure. In Advances in Neural Information Processing Systems, 2012.

[4] Pablo Arbeláez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour de-

tection and hierarchical image segmentation. IEEE Transaction on Pattern Analysis

and Machine Intelligence, 33(5):898–916, 2011.

[5] Francis R. Bach and Michael I. Jordan. Learning spectral clustering, with appli-

cation to speech separation. Journal of Machine Learning Research, 7:1963–2001,

2006.

[6] Francis R. Bach and Michael I. Jordon. Beyond independent components: Trees

and clusters. Journal of Machine Learning Research, 4:1205–1233, 2003.

[7] Eric Bae and James Bailey. COALA: A novel approach for the extraction of an

alternative clustering of high quality and high dissimilarity. In Proceedings of the

Sixth IEEE International Conference on Data Mining, 2006.

[8] David J. Bartholomew and Martin Knott. Latent Variable Models and Factor Anal-

ysis. Arnold, 2nd edition, 1999.

[9] Alexander Basilevsky. Statistical Factor Analysis and Related Methods: Theory and

Applications. J. Wiley, 1994.

[10] Francesca Bassi. Latent class factor models for market segmentation: An application

to pharmaceuticals. Statistical Methods and Applications, 16:279–287, 2007.

[11] Peter M. Bentler and David G. Weeks. Linear structural equations with latent

variables. Psychometrika, 45(3):289–308, 1980.

119

[12] Christopher M. Bishop and Michael E. Tipping. A hierarchical latent variable

model for data visualization. IEEE Transaction on Pattern Analysis and Machine

Intelligence, 20(3):281–293, 1998.

[13] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation.

Journal of Machine Learning Research, 3:993–1022, 2003.

[14] Kenneth A. Bollen. Structural Equations with Latent Variables. Wiley, New York,

1989.

[15] Kenneth A. Bollen. Latent variables in psychology and the social sciences. Annual

Review of Psychology, 53:605–634, 2002.

[16] Eric T. Bradlow and Alan M. Zaslavsky. A hierarchical latent variable model for

ordinal data from a customer satisfaction survey with “no answer” responses. Journal

of the American Statistical Association, 94(445):43–52, 1999.

[17] Derek C. Briggs. An introduction to multidimensional measurement using Rasch

models. Journal of Applied Measurement, 4(1):87–100, 2003.

[18] Wray Buntine. Variational extensions to EM and multinomial PCA. In Proceedings

of the 13th European Conference on Machine Learning, 2002.

[19] Wray Buntine and Aleks Jakulin. Discrete component analysis. In Subspace, Latent

Structure and Feature Selection, LNCS 3940, pages 1–33. Springer-Verlag, 2006.

[20] Wray L. Buntine. Operations for learning with graphical models. Journal of Arti-

ficial Intelligence Research, 2:159–225, 1994.

[21] John Canny. GaP: a factor model for discrete data. In Proceedings of the 27th

Annual International ACM SIGIR Conference on Research and Development in

Information Retrieval, 2004.

[22] Rich Caruana, Mohamed Elhawary, Nam Nguyen, and Casey Smith. Meta cluster-

ing. In Proceedings of the Sixth International Conference on Data Mining, 2006.

[23] Peter Cheeseman and John Stutz. Bayesian classification (AutoClass): Theory and

results. In Advances in Knowledge Discovery and Data Mining, pages 153–180.

AAAI Press, 1996.

[24] Tao Chen. Search-Based Learning of Latent Tree Models. PhD thesis, Department

of Computer Science and Engineering, The Hong Kong University of Science and

Technology, 2009.

120

[25] Tao Chen and Nevin L. Zhang. Quartet-based learning of hierarchical latent class

models: Discovery of shallow latent variables. In 9th International Symposium of

Artificial Intelligence and Mathematics, 2006.

[26] Tao Chen, Nevin L. Zhang, and Yi Wang. Efficient model evaluation in the search-

based approach to latent structure discovery. In Proceedings of the Fourth European

Workshop on Probabilistic Graphical Models, pages 57–64, 2008.

[27] Tao Chen, Nevin L. Zhang, and Yi Wang. The role of operation granularity in

search-based learning of latent tree models. In The First International Workshop

on Advanced Methodologies for Bayesian Networks, 2010.

[28] Tao Chen, Nevin L. Zhang, Tengfei Liu, Kin Man Poon, and Yi Wang. Model-

based multidimensional clustering of categorical data. Artificial Intelligence, 176:

2246–2269, 2012.

[29] Jie Cheng, Russell Greiner, Jonathan Kelly, David Bell, and Weiru Liu. Learning

Bayesian networks from data: An information-theory based approach. Artificial

Intelligence, 137(1–2):43–90, 2002.

[30] David Maxwell Chickering. Optimal structure identification with greedy search.


[31] David Maxwell Chickering and David Heckerman. Efficient approximations for the

marginal likelihood of Bayesian networks with hidden variables. Machine learning,

29:181–212, 1997.

[32] Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, and Alan S. Willsky.

Learning latent tree graphical models. Journal of Machine Learning Research, 12:

1771–1812, 2011.

[33] C. K. Chow and C. N. Liu. Approximating discrete probability distributions with

dependence trees. IEEE Transactions on Information Theory, 14(3):462–467, 1968.

[34] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley,

2nd edition, 2006.

[35] Robert G. Cowell. Local propagation in conditional Gaussian Bayesian networks.


[36] Robert G. Cowell, A. Philip Dawid, Steffen L. Lauritzen, and David J. Spiegelhalter.

Probabilistic Networks and Expert Systems. Springer, 1999.

121

[37] Ying Cui, Xiaoli Z. Fern, and Jennifer G. Dy. Non-reduntant multi-view clustering

via orthogonalization. In Proceedings of the Seventh IEEE International Conferene

on Data Mining, 2007.

[38] Adnan Darwiche. Modeling and Reasoning with Bayesian Networks. Cambridge

University Press, 2009.

[39] Sajib Dasgupta and Vincent Ng. Mining clustering dimensions. In Proceedings of

the 27th International Conference on Machine Learning, 2010.

[40] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incom-

plete data via the EM algorithm. Journal of the Royal Statistical Society. Series B

(Methodological), 39(1):1–38, 1977.

[41] Sara Dolnicar. A review of data-driven market segmentation in tourism. Journal of

Travel and Tourism Marketing, 12(1):1–22, 2002.

[42] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. Wiley-

Interscience, 2nd edition, 2000.

[43] Jennifer G. Dy and Carla E. Brodley. Feature selection for unsupervised learning.


[44] Gal Elidan and Nir Friedman. Learning the dimensionality of hidden variables. In

Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence, pages

144–151, 2001.

[45] Gal Elidan, Noam Lotner, Nir Friedman, and Daphne Koller. Discovering hidden

variables: A structure-based approach. In Advances in Neural Information Process-

ing Systems, 2001.

[46] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based

algorithm for discovering clusters in large spatial databases with noise. In Proceed-

ings of the 2nd International Conference on Knowledge Discovery and Data Mining,

1996.

[47] Mario A.T. Figueiredo and Anil K. Jain. Unsupervised learning of finite mixture

models. IEEE Transaction on Pattern Analysis and Machine Intelligence, 24(3):

381–396, 2002.

[48] Ernest Fokoué and D. M. Titterington. Mixtures of factor analysers. Bayesian

estimation and inference by stochastic simulation. Machine Learning, 50:73–97,

2003.

122

[49] Jaime R. S. Fonseca and Margarida G. M. S. Cardoso. Retail clients latent segments.

In Progress in Artificial Intelligence, pages 348–358. Springer, 2005.

[50] Jaime R. S. Fonseca and Margarida G. M. S. Cardoso. Mixture-model cluster

analysis using information theoretical criteria. Intelligent Data Analysis, 11:155–

173, 2007.

[51] Chris Fraley and Adrian E. Raftery. Model-based clustering, discriminant analysis,

and density estimation. Journal of American Statistical Association, 97(458):611–

631, 2002.

[52] Chris Fraley and Adrian E. Raftery. MCLUST version 3 for R: Normal mixture

modeling and model-based clustering. Technical Report 504, Department of Statis-

tics, University of Washington, 2006. (revised 2009).

[53] Nir Friedman. Learning belief networks in the presence of missing values and hidden

variables. In Proceedings of the 14th International Conference on Machine Learning,

1997.

[54] Sylvia Frühwirth-Schnatter. Finite Mixture and Markov Switching Models. Springer,

2006.

[55] Giuliano Galimberti and Gabriele Soffritti. Model-based methods to identify mul-

tiple cluster structures in a data set. Computational Statistics and Data Analysis,

52:520–536, 2007.

[56] Guojun Gan, Chaoqun Ma, and Jianhong Wu. Data Clustering: Theory, Algo-

rithms, and Applications. ASA-SIAM, 2007.

[57] Dan Geiger and David Heckerman. Learning Gaussian networks. Technical Report

MSR-TR-94-10, Microsoft Research, 1994.

[58] Zoubin Ghahramani. An introduction to hidden Markov models and Bayesian net-

works. International Journal of Pattern Recognition and Artificial Intelligence, 15

(1):9–42, 2001.

[59] Zoubin Ghahramani and Matthew J. Beal. Variational interference for Bayesian

mixtures of factor analysers. In Advances in Neural Information Processing Systems

12, 2000.

[60] Debashis Ghosh and Arul M. Chinnaiyan. Mixture modelling of gene expression

data from microarray experiments. Bioinformatics, 18(2):275–286, 2002.

123

[61] David Gondek and Thomas Hofmann. Non-redundant data clustering. In Proceed-

ings of the Fourth IEEE International Conference on Data Mining, 2004.

[62] Leo A. Goodman. Exploratory latent structure analysis using both identifiable and

unidentifiable models. Biometrika, 61(2):215–231, 1974.

[63] Peter J. Green. Penalized likelihood. In Encyclopedia of Statistical Science, Update

Volume 3, pages 578–586, 1999.

[64] T. Haavelmo. The statistical implications of a system of simultaneous equations.

Econometrica, 11:1–12, 1943.

[65] Lars Hagen and Andrew B. Kahng. A new approach to effective circuit clustering.

In Proceedings of IEEE International Conference on Computer Aided Design, pages

422–427, 1992.

[66] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan

Kaufmann, San Francisco, 2000.

[67] Harry H. Harman. Modern Factor Analysis. The University of Chicago Press, 3rd

edition, 1976.

[68] Stefan Harmeling and Christopher K. I. Williams. Greedy learning of binary latent

trees. IEEE Transaction on Pattern Analysis and Machine Intelligence, 33(6):1087–

1097, 2011.

[69] David Heckerman. A tutorial on learning with Bayesian networks. Technical Report

MSR-TR-95-06, Microsoft Research, 1995.

[70] A. E. Henderickson and P. O. White. PROMAX: A quick method for rotation to

oblique simple structure. British Journal of Mathematical and Statistical Psychol-

ogy, 17:65–70, 1964.

[71] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with

neural networks. Science, 313(5786):504–507, 2006.

[72] Geoffrey E. Hinton, Michael Revow, and Peter Dayan. Recognizing handwritten

digits using mixtures of linear models. In Advances in Neural Information Processing

Systems 7, 1995.

[73] Peter D. Hoff. Model-based subspace clustering. Bayesian Analysis, 1(2):321–344,

2006.

124

[74] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the

22nd Annual international SIGIR Conferrence on Research and Development in

Information Retrieval, 1999.

[75] Thomas Hofmann. Unsupervised learning by probabilistic latent semantic analysis.

Machine Learning, 42:177–196, 2001.

[76] Frank Höppner, Frank Klawonn, Rudolf Druse, and Thomas Runkler. Fuzzy Cluster

Analysis: Methods for Classification, Data Analysis and Image Recognition. Wiley,

1999.

[77] Lynette Hunt and Murray Jorgensen. Mixture model clustering using the MUL-

TIMIX program. Australian & New Zealand Journal of Statistics, 41(2):154–171,

1999.

[78] Aapo Hyvärinen and Erkki Oja. Independent component analysis: Algorithms and

applications. Neural Networks, 13:411–430, 2000.

[79] Aapo Hyvärinen, Juha Karhunen, and Erkki Oja. Independent Component Analysis.

John Wiley & Sons, 2001.

[80] Salvatore Ingrassia. A likelihood-based constrained algorithm for multivariate nor-

mal mixture models. Statistical Methods and Applications, 13(2):151–166, 2004.

[81] Manfred Jaeger, Simon Lyager, Michael Vandborg, and Thomas Wohlgemuth. Fac-

torial clustering with an application to plant distribution data. In Proceedings of

the 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clus-

terings, pages 31–42, 2011.

[82] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM

Computing Surveys, 31(3):264–323, 1999.

[83] Prateek Jain, Raghu Meka, and Inderjit S. Dhillon. Simultaneous unsupervised

learning of disparate clusterings. In Proceedings of the Seventh SIAM International

Converence on Data Mining, pages 858–869, 2008.

[84] Kamel Jedidi, Harsharanjeet S. Jagpal, and Wayne S. DeSarbo. Finite-mixture

structural equational models for response-based segmentation and unobserved het-

erogeneity. Marketing Science, 16(1):39–59, 1997.

[85] I. T. Jolliffe. Principal Component Analysis. Springer, 2002.

125

[86] Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and the

EM algorithm. Neural Computation, 6(2):181–214, 1994.

[87] K. G. Jöreskog. Structural equation models in the social sciences: Specification,

estimation and testing. In Applications of Statistics, pages 265–287, 1977.

[88] H. F. Kaiser. The Varimax criterion for analystic rotation in factor analysis. Psy-

chometrika, 23:187–200, 1958.

[89] David J. Ketchen, JR. and Christopher L. Shook. The application of cluster analysis

in strategic management research: An analysis and critique. Strategic Management

Journal, 17:441–458, 1996.

[90] B. King. Step-wise clustering procedures. Journal of American Statistical Associa-

tion, 69:86–101, 1967.

[91] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and

Techniques. The MIT Press, 2009.

[92] Hans-Peter Kriegel, Peer Kröger, and Arthur Zimek. Clustering high dimensional

data: A survey on subspace clustering, pattern-based clustering, and correlation

clustering. ACM Transactions on Knowledge Discovery from Data, 3(1):1–58, 2009.

[93] James K. Lake. Reconstructing evolutionary trees from DNA and protein sequences:

Paraliner distances. Proceedings of the National Academy of Science, 91:1455–1459,

1994.

[94] Helge Langseth and Thomas D. Nielsen. Classification using hierarchical naïve

Bayes models. Machine Learning, 63:135–159, 2006.

[95] Steffen L. Lauritzen. Propagation of probabilities, means, variances in mixed graph-

ical association models. Journal of the American Statistical Association, 87(420):

1098–1108, 1992.

[96] Steffen L. Lauritzen and Frank Jensen. Stable local computation with conditional

Gaussian distributions. Statistics and Computing, 11:191–203, 2001.

[97] Martin H. C. Law, Mário A. T. Figueiredo, and Anil K. Jain. Simultaneous fea-

ture selection and clustering using mixture models. IEEE Transaction on Pattern

Analysis and Machine Intelligence, 26(9):1154–1166, 2004.

[98] David F. Layton and Richard A. Levine. How much does the far future matter?

A hierarchical Bayesian analysis of the public’s willingness to mitigate ecological

126

impacts of climate change. Journal of the American Statistical Association, 98

(463):533–544, 2003.

[99] Paul F. Lazarsfeld and Neil W. Henry. Latent Structure Analysis. Houghton Mifflin,

Boston, 1968.

[100] Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative

matrix factorization. Nature, 401:788–791, 1999.

[101] Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factor-

ization. In Advances in Neural Information Processing Systems 13, 2001.

[102] Yuanhong Li, Ming Dong, and Jing Hua. Simultaneous localized feature selection

and model detection for gaussian mixtures. IEEE Transaction on Pattern Analysis

and Machine Intelligence, 31(5):953–960, 2009.

[103] Jun. S Liu, Junni L. Zhang, Michael J. Palumbo, and Charles E. Lawrence. Bayesian

clustering with variable and transformation selections (with discussion). Bayesian

Statistics, 7:249–275, 2003.

[104] John C. Loehlin. Latent Variable Models : An Introduction to Factor, Path, and

Structural Equation Analysis. L. Erlbaum Associates, 2004.

[105] J. MacQueen. Some methods for classification and analysis of multivariate obser-

vations. In Proceedings of the Fifth Berkley Symposium, volume 1, pages 281–297,

1967.

[106] Sara C. Madeira and Arlindo L. Oliveria. Biclustering algorithms for biological data

analysis: A survey. IEEE Transactions on Computational Biology and Bioinformat-

ics, 1(1):24–45, 2004.

[107] Jay Magidson and Jeroen K. Vermunt. Latent class factor and cluster models,

bi-plots, and related graphical displays. Sociological Methodology, 31:221–264, 2001.

[108] Cathy Maugis, Gilles Celeux, and Marie-Laure Martin-Magniette. Variable selection

for clustering with Gaussian mixture models. Biometrics, 65:701–709, 2009.

[109] Cathy Maugis, Gilles Celeux, and Marie-Laure Martin-Magniette. Variable selec-

tion in model-based clustering: A general variable role modeling. Computational

Statistics and Data Analysis, 53:3872–3882, 2009.

[110] G. J. McLachlan, R. W. Bean, and D. Peel. A mixture model-based approach to

the clustering of microarray expression data. Bioinformatics, 18(3):413–422, 2002.

127

[111] Geoffrey J. McLachlan and David Peel. Finite Mixture Models. Wiley, New York,

2000.

[112] Geoffrey J. McLachlan and David Peel. Mixtures of factor analyzers. In Proceedings

of the 17th International Conference on Machine Learning, 2000.

[113] Christopher Meek. Graphical Models: Selecting Causal and Statistical Models. PhD

thesis, Carnegie Mellon University, 1997.

[114] Marina Meila. Comparing clusterings—an information based distance. Journal of

Multivariate Analysis, 98:873–895, 2007.

[115] Marina Meila and Michael I. Jordan. Learning with mixtures of trees. Journal of

Machine Learning Research, 1:1–48, 2000.

[116] Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. Loopy belief propagation for

approximate inference: An empirical study. In Proceedings of the 15th Conference

on Uncertainty in Artificial Intelligence, 1999.

[117] Bengt O. Muthén. Beyond SEM: General latent variable modeling. Behaviormetrika,

29(1):81–117, 2002.

[118] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clusterings: Analysis

and an algorithm. In Advances in Neural Information Processing Systems 14, 2002.

[119] Donglin Niu, Jennifer G. Dy, and Michael I. Jordan. Multiple non-redundant spec-

tral clustering views. In Proceedings of the 27th International Conference on Ma-

chine Learning, 2010.

[120] Wei Pan and Xiaotong Shen. Penalized model-based clustering with application to

variable selection. Journal of Machine Learning Research, 8:1145–1164, 2007.

[121] Blossom H. Patterson, C. Mitchell Dayton, and Barry I. Graubard. Latent class

analysis of complex sample survey data: Application to dietary data. Journal of

the American Statistical Association, 97(459):721–741, 2002.

[122] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible

Inference. Morgan Kaufmann Publishers, San Mateo, California, 1988.

[123] Leonard K. M. Poon, Nevin L. Zhang, Tengfei Liu, and April H. Liu. Variable

selection in model-based clustering: To do or to facilitate. International Journal of

Approximate Reasoning. Accepted with minor revisions.

128

[124] Leonard K. M. Poon, Nevin L. Zhang, Tao Chen, and Yi Wang. Using Bayesian

networks for model-based multiple clusterings: An example of exploratory analysis

on NBA data. In The 1st International Workshop on Advanced Methodologies for

Bayesian Networks, 2010.

[125] Leonard K. M. Poon, Nevin L. Zhang, Tao Chen, and Yi Wang. Variable selec-

tion in model-based clustering: To do or to facilitate. In Proceedings of the 27th

International Conference on Machine Learning, 2010.

[126] Leonard K. M. Poon, April H. Liu, Tengfei Liu, and Nevin L. Zhang. A model-based

approach to rounding in spectral clustering. In Proceedings of the 28th Conference

on Uncertainty in Artificial Intelligence, 2012.

[127] Girish Punj and David W. Stewart. Cluster analysis in marketing research: Review

and suggestions for application. Journal of Marketing Research, 20:134–148, 1983.

[128] ZiJie Qi and Ian Davidson. A principled and flexible framework for finding alterna-

tive clusterings. In Proceedings of the 15th ACM SIGKDD International Conference

on Knowledge Discovery and Data Mining, 2009.

[129] Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications

in speech recognition. In Proceedings of the IEEE, pages 257–286, 1989.

[130] Adrian E. Raftery and Nema Dean. Variable selection for model-based clustering.

Journal of American Statistical Association, 101(473):168–178, 2006.

[131] Venkatram Ramaswamy and Steven H. Cohen. Latent class models for conjoint

analysis. In Conjoint Measurement, pages 295–319. Springer, 2007.

[132] William M. Rand. Objective criteria for the evaluation of clustering methods. Jour-

nal of the American Statistical Association, 66(336):846–850, 1971.

[133] Georg Rasch. On general laws and the meaning of measurement in psychology. In

Proceedings of the 4th Berkeley Symposium on Mathematics, Statistics, and Proba-

bility, volume 4, 1961.

[134] Nicola Rebagliati and Alessandro Verri. Spectral clustering with more than k eigen-

vectors. Neurocomputing, 74:1391–1401, 2011.

[135] Mark D. Reckase. Multidimensional Item Response Theory. Springer, 2009.

[136] Franz Rendl and Henry Wolkowicz. A projection technique for partitioning the

nodes of a graph. Annals of Operations Research, 58(3):155–179, 1995.

129

[137] Sam Roweis. EM algorithms for PCA and SPCA. In Advances in Neural Information

Processing Systems 10, pages 626–632, 1998.

[138] Sam Roweis. A unifying review of linear Gaussian models. Neural Computation,

11:305–345, 1999.

[139] Stuart Russell and Peter Norvig. Artificial Intelligence: a Modern Approach. Pren-

tice Hall, 1995.

[140] Gideon Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6

(2):461–464, 1978.

[141] Steven L. Scott and Edward H. Ip. Empirical Bayes and item-clustering effects in

a latent variable hierarchical model: A case study from the national assessment

of educational progress. Journal of the American Statistical Association, 97(458):

409–419, 2002.

[142] Ross D. Shachter and C. Robert Kenley. Gaussian influence diagrams. Management

Science, 35(5):527–550, 1989.

[143] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. In Pro-

ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,

pages 731–737, 1997.

[144] Ricardo Silva, Richard Scheines, Clark Glymour, and Peter Spirtes. Learning the

structure of linear latent variable models. Journal of Machine Learning Research,

7:191–246, 2006.

[145] Anders Skrondal and Sophia Rabe-Hesketh. Generalized Latent Variable Modeling:

Multilevel, Longitudinal, and Structural Equation Models. Chapman & Hall/CRC,

2004.

[146] Anders Skrondal and Sophia Rabe-Hesketh. Latent variable modelling: A survey.

Scandinavian Journal of Statistics, 34(4):712–745, 2007.

[147] P. Sneath. The applications of computers to taxonomy. Journal of General Micro-

biology, 17:201–226, 1957.

[148] Richard Socher, Andrew Maas, and Christopher D. Manning. Spectral Chinese

restaurant processes: Nonparametric clustering based on similarities. In 14th Inter-

national Conference on Artificial Intelligence and Statistics, 2011.

130

[149] Charles Spearman. General intelligence, objectively determined and measured.

American Journal of Psychology, 15:201–293, 1904.

[150] Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and

Search. The MIT Press, 2nd edition, 2000.

[151] Douglas Steinley and Michael J. Brusco. Selection of variables in cluster analysis:

An empirical comparison of eight procedures. Psychometrika, 73(1):125–144, 2008.

[152] Alexander Strehl and Joydeep Ghosh. Cluster ensembles — a knowledge reuse

framework for combining multiple partitions. Journal of Machine Learning Re-

search, 3:583–617, 2002.

[153] Sergios Theodoridis and Konstantinos Koutroumbas. Pattern Recognition. Elsevier,

4th edition, 2009.

[154] Bo Thiesson, Christopher Meek, David Maxwell Chickering, and David Heckerman.

Learning mixtures of Bayesian networks. In Proceedings of the 14th Conference of

Uncertainty in Artificial Intelligence, 1998.

[155] Michael E. Tipping and Christopher M. Bishop. Mixtures of probabilistic principal

component analyzers. Neural Computation, 11:443–482, 1999.

[156] Michael E. Tipping and Christopher M. Bishop. Probabilistic principal component

analysis. Journal of the Royal Statistical Society, Series B (Statistical Methodology),

61(3):611–622, 1999.

[157] Wim J. van der Linden and Ronald K. Hambleton, editors. Handbook of Modern

Item Response Theory. Springer, 1997.

[158] Peter van der Putten and Maarten van Someren. A bias-variance analysis of a

real world learning problem: The CoIL challenge 2000. Machine learning, 57(1–2):

177–195, 2004.

[159] Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing,

17:395–416, 2007.

[160] Ralf Wagner, Sören W. Scholz, and Reinhold Decker. The number of clusters in mar-

ket segmentation. In Data Analysis and Decision Support, pages 157–176. Springer,

2005.

[161] Yi Wang. Latent Tree Models for Multivarite Density Estimation: Algorithms and

Applications. PhD thesis, Department of Computer Science and Engineering, The

Hong Kong University of Science and Technology, 2009.

131

[162] Yi Wang, Nevin L. Zhang, and Tao Chen. Latent tree models and approximate

inference in Bayesian networks. Journal of Artificial Intelligence Research, 32:879–

900, 2008.

[163] Yi Wang, Nevin L. Zhang, Tao Chen, and Leonard K. M. Poon. Latent tree classi-

fier. In Proceedings of the 11th European Conference on Symbolic and Quantitative

Approaches to Reasoning with Uncertainty, 2011.

[164] Michel Wedel and Wagner A. Kamakura. Market Segmentation: Conceptual and

Methodological Foundations. Kluwer Academic Publishers, 2 edition, 2000.

[165] Sewell Wright. On the nature of size factors. Genetics, 3:367–374, 1918.

[166] Tao Xiang and Shaogang Gong. Spectral clustering with eigenvector selection. Pat-

tern Recognition, 41(3):1012–1029, 2008.

[167] Rui Xu and Donald C. Wunsch, II. Clustering. Wiley-IEEE Press, 2009.

[168] K. Y. Yeung, C. Fraley, A. Murua, A. E. Raftery, and W. L. Ruzzo. Model-based

clustering and data transformations for gene expression data. Bioinformatics, 17

(10):977–987, 2001.

[169] Lihi Zelnik-Manor and Pietro Perona. Self-tuning spectral clustering. In Advances

in Neural Information Processing Systems, 2005.

[170] Hong Zeng and Yiu-Ming Cheung. A new feature selection method for gaussian

mixture clustering. Pattern Recognition, 42:243–250, 2009.

[171] Nevin L. Zhang. Hierarchical latent class models for cluster analysis. In Proceedings

of the 18th National Conference on Artificial Intelligence, 2002.

[172] Nevin L. Zhang. Hierarchical latent class models for cluster analysis. Journal of

Machine Learning Research, 5:697–723, 2004.

[173] Nevin L. Zhang and Tomáš Kočka. Efficient learning of hierarchical latent class

models. In Proceedings of the 16th IEEE International Conference on Tools with

Artificial Intelligence, pages 585–593, 2004.

[174] Nevin L. Zhang, Thomas D. Nielsen, and Finn V. Jensen. Latent variable discovery

in classification models. Artificial Intelligence in Medicine, 30:283–299, 2004.

[175] Nevin L. Zhang, Yi Wang, and Tao Chen. Discovery of latent structures: Experience

with the CoIL challenge 2000 data set. Journal of Systems Science and Complexity,

21:172–183, 2008.

132

[176] Nevin L. Zhang, Shihong Yuan, Tao Chen, and Yi Wang. Latent tree models and

diagnosis in traditional Chinese medicine. Artificial Intelligence in Medicine, 42:

229–245, 2008.

[177] Nevin L. Zhang, Shihong Yuan, Tao Chen, and Yi Wang. Statistical validation of

TCM theories. Journal of Alternative and Complementary Medicine, 14(5):583–587,

2008.

[178] Zhihua Zhang and Michael I. Jordan. Multiway spectral clustering: A margin-based

perspective. Statistical Science, 23(3):383–403, 2008.

[179] Feng Zhao, Licheng Jiao, Hanqiang Liu, Xinbo Gao, and Maoguo Gong. Spectral

clustering with eigenvector selection based on entropy ranking. Neurocomputing,

73:1704–1717, 2010.

[180] Shi Zhong and Joydeep Ghosh. A unified framework for model-based clustering.


133

APPENDIX A

LIST OF PUBLICATIONS BY THE AUTHOR

• Tao Chen, Nevin L. Zhang, Tengfei Liu, Kin Man Poon, and Yi Wang. Model-

based multidimensional clustering of categorical data. Artificial Intelligence, 176:

2246–2269, 2012.

• Tengfei Liu, Nevin L. Zhang, Kin Man Poon, Yi Wang, and Hua Liu. Fast

multidimensional clustering of categorical data. In Proceedings of the 2nd MultiClust

Workshop: Discovering, Summarizing and Using Multiple Clusterings, 2011

• Leonard K. M. Poon, Nevin L. Zhang, Tengfei Liu, and April H. Liu. Variable

selection in model-based clustering: To do or to facilitate. International Journal of

Approximate Reasoning. Accepted with minor revision.

• Leonard K. M. Poon, Nevin L. Zhang, Tao Chen, and Yi Wang. Using Bayesian

networks for model-based multiple clusterings: An example of exploratory analysis

on NBA data. In The 1st International Workshop on Advanced Methodologies for

Bayesian Networks, 2010.

• Leonard K. M. Poon, Nevin L. Zhang, Tao Chen, and Yi Wang. Variable selec-

tion in model-based clustering: To do or to facilitate. In Proceedings of the 27th

International Conference on Machine Learning, 2010.

• Leonard K. M. Poon, April H. Liu, Tengfei Liu, and Nevin L. Zhang. A model-

based approach to rounding in spectral clustering. In Proceedings of the 28th Con-

ference on Uncertainty in Artificial Intelligence, 2012.

• Yi Wang, Nevin L. Zhang, Tao Chen, Leonard K. M. Poon. Latent tree classi-

fier. In Proceedings of the 11th European Conference on Symbolic and Quantitative

Approaches to Reasoning with Uncertainty, 2011.

• Yi Wang, Nevin L. Zhang, Tao Chen, Leonard K. M. Poon. LTC: A Latent

Tree Approach To Classification. International Journal of Approximate Reasoning.

Accepted with minor revision.

134

latent tree models: an application and an extensionlzhang/paper/pspdf/poon-thesis.pdf · latent...

Documents