mining techniques for data streams and sequenceswis.cs.ucla.edu/old_wis/theses/chu_thesis.pdf ·...

UNIVERSITY OF CALIFORNIA

Los Angeles

Mining Techniques forData Streams and Sequences

A dissertation submitted in partial satisfaction

of the requirements for the degree

Doctor of Philosophy in Computer Science

by

Fang Chu

2005

c© Copyright by

Fang Chu

2005

The dissertation of Fang Chu is approved.

D.Stott Parker

Adnan Darwiche

Yingnian Wu

Carlo Zaniolo, Committee Chair

University of California, Los Angeles

2005

ii

To Dad, Mom and Yizhou

iii

TABLE OF CONTENTS

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Issues in Stream Mining . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Mining High Dimensional Sequence Data . . . . . . . . . . . . . . . 5

1.3 Mining Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . 9

2.1 Stream Classification Methods . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Ensemble Theory . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.2 Ensemble Methods for Stream Classification . . . . . . . . . 11

2.2 Pattern-Based Subspace Clustering . . . . . . . . . . . . . . . . . . . 12

2.3 Mining Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Fast and Light Stream Boosting Ensembles . . . . . . . . . . . . . . . . 16

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Adaptive Boosting Ensembles . . . . . . . . . . . . . . . . . . . . . 18

3.3 Change Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4 Comparison with Bagging Stream Ensembles . . . . . . . . . . . . . 22

3.4.1 Evaluation of Boosting Scheme . . . . . . . . . . . . . . . . 23

3.4.2 Learning with Gradual Shifts . . . . . . . . . . . . . . . . . . 25

3.4.3 Learning with Abrupt Shifts . . . . . . . . . . . . . . . . . . 27

3.4.4 Experiments on Real Life Data . . . . . . . . . . . . . . . . . 29

iv

3.5 Comparison with DWM . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Robust and Adaptive Stream Ensembles . . . . . . . . . . . . . . . . . 34

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 Adaptation to Concept Drift . . . . . . . . . . . . . . . . . . . . . . 36

4.3 Robustness to Outliers . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4 Model Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.4.1 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . 40

4.4.2 Inference and Computation . . . . . . . . . . . . . . . . . . . 42

4.5 Experiments and Discussions . . . . . . . . . . . . . . . . . . . . . . 45

4.5.1 Evaluation of Adaptation . . . . . . . . . . . . . . . . . . . . 45

4.5.2 Robustness in the Presence of Outliers . . . . . . . . . . . . . 47

4.5.3 Discussions on Performance Issue . . . . . . . . . . . . . . . 48

4.5.4 Experiments on Real Life Data . . . . . . . . . . . . . . . . . 50

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5 Subspace Pattern Based Sequence Clustering . . . . . . . . . . . . . . . 52

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.1.1 Subspace Pattern Similarity . . . . . . . . . . . . . . . . . . 53

5.1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.1.3 Our Contributions. . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 The Distance Function . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2.1 Tabular and Sequential Data . . . . . . . . . . . . . . . . . . 58

v

5.2.2 Sequence-based Pattern Similarity . . . . . . . . . . . . . . . 59

5.3 The Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 61

5.3.1 Pattern and Pattern Grids . . . . . . . . . . . . . . . . . . . . 61

5.3.2 The Counting Tree . . . . . . . . . . . . . . . . . . . . . . . 63

5.3.3 Counting Pattern Occurrences . . . . . . . . . . . . . . . . . 66

5.3.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.4.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . 73

5.4.3 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . 76

5.5 Related Work and Discussion . . . . . . . . . . . . . . . . . . . . . . 78

6 Mining Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.2 Markov Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.2.1 Graphical Representation . . . . . . . . . . . . . . . . . . . . 84

6.2.2 Pairwise Markov Networks . . . . . . . . . . . . . . . . . . . 85

6.2.3 Solving Markov Networks . . . . . . . . . . . . . . . . . . . 86

6.2.4 Inference by Belief Propagation . . . . . . . . . . . . . . . . 87

6.3 Application I: Cost-Efficient Sensor Probing . . . . . . . . . . . . . . 88

6.3.1 Problem Description and Data Representation . . . . . . . . . 89

6.3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . 90

6.3.3 Learning and Inference . . . . . . . . . . . . . . . . . . . . . 92

vi

6.3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 93

6.3.5 How BP Works . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.4 Application II: Enhancing Protein Function Predictions . . . . . . . . 97

6.4.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . 97

6.4.2 Learning Markov Network . . . . . . . . . . . . . . . . . . . 98

6.4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.5 Application III: Sequence Data Denoising . . . . . . . . . . . . . . . 102

6.5.1 Problem Description and Data Representation . . . . . . . . . 103

6.5.2 Learning and Inference . . . . . . . . . . . . . . . . . . . . . 104

6.5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 105

6.6 Related Work and Discussions . . . . . . . . . . . . . . . . . . . . . 108

7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

vii

LIST OF FIGURES

1.1 An example of stream version linear regression. . . . . . . . . . . . . 2



3.1 Two types of significant changes. Type I: abrupt changes; Type II:

gradual changes over a period of time. These are the changes we aim

to detect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Performance comparison of the adaptive boosting vs the bagging on

stationary data. The weighted bagging is omitted as it performs almost

the same as the bagging. . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Performance comparison of the three ensembles on data with small

gradual concept shifts. . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4 Performance comparison of the ensembles on data with moderate grad-

ual concept shifts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.5 Performance comparison of the three ensembles on data with abrupt

shifts. Base decision trees have no more than 8 terminal nodes. . . . . 27

3.6 Performance comparison of the three ensembles on data with both

abrupt and small shifts. Base decision trees have no more than 8 ter-

minal nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.7 Performance comparison of the three ensembles on credit card data.

Concept shifts are simulated by sorting the transactions by the trans-

action amount. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

viii

3.8 Comparison of the adaptive boosting and the weighted bagging, in

terms of (a) building time, and (b) average decision tree size. In (a),

the total amount of data is fixed for different block sizes. . . . . . . . 30

3.9 Dynamic Weighted Majority (DWM) ensemble performance on the

SEA concepts with 10% class noise. . . . . . . . . . . . . . . . . . . 32

3.10 Adaptive Boosting ensemble performance on the SEA concepts with

10% class noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1 Adaptability comparison of the ensemble methods on data with three

abrupt shifts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2 Adaptability comparison of the ensemble methods on data with three

abrupt shifts mixed with small shifts. . . . . . . . . . . . . . . . . . . 46

4.3 Robustness comparison of the three ensemble methods for different

noise levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.4 In the outliers detected, the normalized ratio of (1) true noisy samples

(the upper bar), vs. (2) samples from an emerging concept (the lower

bar). The bars correspond to blocks 0-59 in the experiments shown in

Fig.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5 Performance comparison of the ensemble methods with classifiers of

different size. Robust regression with smaller classifiers is compatible

to the others with larger classifiers. . . . . . . . . . . . . . . . . . . . 49

4.6 Performance comparison of the ensembles on credit card data. Base

decision trees have no more than 16 terminal nodes. Concept shifts are

simulated by sorting the transactions by the transaction amount. . . . 50

5.1 Objects form patterns in subspaces. . . . . . . . . . . . . . . . . . . . 53

ix

5.2 The meaning of distk,S(x, y) ≤ δ. . . . . . . . . . . . . . . . . . . . 60

5.3 Pattern grids for subspace {t1, t2, t3} . . . . . . . . . . . . . . . . . . 62

5.4 The Counting Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.5 The Cluster Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.6 Performance Study: scalability. . . . . . . . . . . . . . . . . . . . . . 74

5.7 Time vs. distance threshold δ . . . . . . . . . . . . . . . . . . . . . . 75

5.8 Scalability on sequential dataset . . . . . . . . . . . . . . . . . . . . 76

5.9 A cluster in subspace {2,3,4,5,7,8,10,11,12,13,14,15,16}. . . . . . . . . . . . . . . . . 77

6.1 Example of a Pairwise Markov Network. In (a), the white circles de-

note the random variables, and the shaded circles denote the external

evidence. In (b), the potential functions φ() and ψ() are showed. . . . 84

6.2 Message passing in a Markov network. Messages are defined by Eqs.(6.3)

or (6.4) under two types of rules, respectively. . . . . . . . . . . . . . 87

6.3 Sensor site map in the states of Washington and Oregon. . . . . . . . 89

6.4 Top-K recall rates vs. probing ratios. (a): results obtained by our

BP-based probing; (b) by the naive probing. On average, BP-based

approach probed 8% less, achieves 13.6% higher recall rate for raw

values, and 7.7% higher recall rate for discrete values. . . . . . . . . . 92

6.5 Belief updates in 6 BP iterations((0) - (5)). Initially only the four sen-

sors at the corners are probed. The strong beliefs of these four sensors

are carried over by their neighbors to sensors throughout the network,

causing beliefs of all sensors updated iteratively till convergence. . . . 94

6.6 Logistic curve that is used to blur the margin between the belief on two

classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

x

6.7 Distribution of correlation values learned for two functions. left col-

umn function: cell growth, right column function: protein destination.

In each column, the distributions from top to bottom are learned from

group (a), (b) and (c), respectively. . . . . . . . . . . . . . . . . . . . 100

6.8 A subgraph in which testing genes got correct class labels due to mes-

sage passing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

xi

LIST OF TABLES

3.1 Performance comparison of the ensembles on data with varying levels

of concept shifts. Top accuracies shown in bold fonts. . . . . . . . . . 26

3.2 Performance comparison of three ensembles on data with abrupt shifts

or mixed shifts. Top accuracies are shown in bold fonts. . . . . . . . . 28

4.1 Summary of symbols used . . . . . . . . . . . . . . . . . . . . . . . 40

5.1 Expression data of Yeast genes . . . . . . . . . . . . . . . . . . . . . 55

5.2 A Stream of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.3 A dataset of 3 objects . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.4 Clusters found in the Yeast dataset . . . . . . . . . . . . . . . . . . . 76

5.5 Clusters found in NETVIEW . . . . . . . . . . . . . . . . . . . . . . 77

6.1 Distortion rules and error correction results. Columns 1 and 2 give the

rule and mutation rate, respectively. Column 3 is the actual number of

times a rule applies, and column 4 is the percentage corrected by BP

inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

xii

ACKNOWLEDGMENTS

At the end of a long journey that the making of this dissertation happens, I would

like to thank some of many people who have helped me in various ways.

First, I would like to thank Professor Carlo Zaniolo for his support and guidance

over the years, and for the numerous and fruitful discussions that laid the foundation

of the research presented here. In particular, I thank him for passing knowledge on to

me, and teaching me how to research and solve new problems with persistence.

I also thank Professor Yingnian Wu, Professor Adnan Darwiche for valuable dis-

cussions on statistical modeling and artificial intelligence. I am grateful to Professor

D.Stott Parker for many helpful discussions and brainstorms on data mining, and for

collaborations on some projects.

I would like to thank Dr. Haixun Wang, Dr. Philip Yu and Dr. Wei Fan for

many helpful and interesting discussions during my twice summer internship at IBM

T.J.Watson Research Center. Their help in cultivating my curiosity in machine learning

and data mining is indispensable to the formation of this dissertation.

I thank the colleagues and friends at Web Information Group: Yijian Bai, Xin

Zhou, Yan-Nei Law, Hetal Thakkar, and Hyun Moon, and our alumni Dr. Fusheng

Wang. Thank you for your helpful discussions and paper proofreading, and the en-

joyable environment you have helped to maintained. I am also grateful to friends in

dbUCLA group: Zhenyu liu, Yi Xia and lots of other, for sharing the concerns and

happiness and for the inspiring Friday seminars.

Special thanks go to Yizhou Wang, my husband and co-author. He has not only

kept me cheerful and happy throughout the development of this dissertation, but also

showed his amazingly broad research interest in data mining, which falls beyond his

xiii

major in computer vision. It is a wonderful experience to work together with him.

Thank you, Yizhou.

Finally, I would like to thank my parents Shuquan and Zefu. Ever since I remem-

ber, they have been a source of continuous support and inspiration. I owe them many

ethics in life and work.

xiv

VITA

1975 Born, Shandong, China.

1997 B.S., Computer Science, Peking University, China.

2000 M.S., Computer Science, Peking University, China.

2001 Summer Intern, IBM T.J.Watson Research Center, Hawthorne, New

York.

2002 Summer Intern, IBM T.J.Watson Research Center, Hawthorne, New

York.

2001–2003 Teaching Assistant, Computer Science Department, UCLA. Taught

143 (database course), 131 (programming language course) and

151B (computer architecture course).

2002 Research Assistant, Molecular Biology Institute, UCLA

2000–2005 Research Assistant, Computer Science Department, UCLA.

PUBLICATIONS

Fang Chu, Yizhou Wang, Carlo Zaniolo, D. Stott Parker, Improving mining quality

by exploiting data dependency, in Proceedings of the 9th Pacific-Asia Conference on

Knowledge Discovery and Data Mining (PAKDD), 2005.

xv

Fang Chu, Yizhou Wang, Carlo Zaniolo, D. Stott Parker, Data Cleaning Using Belief

Propagation, in Proceedings of the 2nd International ACM SIGMOD Workshop on

Information Quality in Information Systems (IQIS), 2005.

Fang Chu, Yizhou Wang, Carlo Zaniolo, An adaptive learning approach for noisy data

streams, in Proceedings of the 4th IEEE International Conference on Data Mining

(ICDM), 2004.

Fang Chu, Carlo Zaniolo, Fast and light boosting for adaptive mining of data streams,

in Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data

Mining (PAKDD), 2004

Fang Chu, Yizhou Wang, Carlo Zaniolo, Mining noisy data streams via a discrimi-

native model, in Proceedings of the 7th International Conference on Discovery Sci-

ence(DS), 2004.

Haixun Wang, Fang Chu, Wei Fan, Philip S. Yu, Jian Pei, Sequence-based subspace

clustering by pattern similarity, in Proceedings of the 16th International Conference

on Scientific and Statistical Database Management (SSDBM), 2004.

Wei Fan, Fang Chu, Haixun Wang, Philip S. Yu, Pruning cost-sensitive ensembles for

efficient prediction, in Proceedings of the Eighteenth National Conference on Artificial

Intelligence (AAAI), 2002.

xvi

ABSTRACT OF THE DISSERTATION

Mining Techniques forData Streams and Sequences

by

Fang ChuDoctor of Philosophy in Computer Science

University of California, Los Angeles, 2005

Professor Carlo Zaniolo, Chair

Data stream mining and sequence mining have many applications and pose chal-

lenging research problems. Typical applications, such as network monitoring, web

searching, telephone services and credit card purchases, are characterized by the need

to mine continuously massive data streams to discover up-to-date patterns, which are

invaluable for timely strategic decisions. These new requirements call for the design

of new mining methods to replace the traditional ones, since those would require the

data to be first stored and then processed off-line using complex algorithms that make

several passes over the data. Therefore, a first research challenge is designing fast

and light mining methods for data streams — e.g., algorithms that only require one

pass over the data and work with limited memory. Another challenge is created by the

highly dynamic nature of data streams, whereby the stream mining algorithms need

to detect promptly changing concepts and data distribution and adapt to them. While

noise represents a general problem in data mining, it poses new challenges on data

streams insofar as adaptability becomes more difficult when the data stream contains

noise.

xvii

The main limitation of data stream mining methods is that they cannot reveal long

term trends as they only keep a small snapshot of most recent data. Neither can they

discover very complicated patterns that can be detected by methods that require exten-

sive computational resources. However, these patterns can be discovered by off-line

mining after data streams are stored as sequences. Sequence mining, in general, can

reveal long-term trends and more complicated patterns, defined in a multidimensional

space via some similarity criteria. The key research challenges that arise in this context

include on (i) designing metrics that measure the similarity of sequences, (ii) dealing

with high dimensionality, and (iii) achieving scalability.

This dissertation makes a number of contributions toward the solution of these

problems, including the following ones:

1. Adaptive Boosting: A stream ensemble method is proposed that maintains a very

accurate predictive model with fast learning and light memory consumption. The

method is also highly adaptive through novel change detection techniques.

2. Robust Regression Ensemble: This method enhances the ensemble methods

with outlier detection and a statistical learning theory.

3. SeqClus: A pattern-based subspace clustering algorithm is introduced along

with a novel pattern similarity metric for sequences. The algorithm is scalable

and efficient.

4. Mining Quality: To deal with noise and improve mining quality, a general ap-

proach is introduced based on data dependency. The approach exploits local data

dependency between samples using pairwise Markov Networks and Bayesian

belief propagation techniques.

The efficacy of the techniques proposed was demonstrated through extensive ex-

periments, both on synthetic and on real-life data.

xviii

CHAPTER 1

Introduction

Today, many organizations produce and/or consume massive data stream. Mining such

data can reveal up-to-date patterns, which are invaluable for timely decisions. How-

ever, stream mining is strikingly different from traditional mining in several aspects.

First, the need of online responses requires mining to be done very fast. In fact, actual

online systems usually have limited CPU power and memory resources dedicated to

mining tasks. Secondly, the underlying concept that generates the data is highly dy-

namic. Moreover, data streams are very likely to be noisy due to lack of preprocessing.

All these make it compelling to investigate data mining techniques for continuous data

streams containing high volumes of data.

After online processing and mining, data streams are stored as sorted relations

known as sequences. The order is defined by a set of attributes which has a total order,

such as positions or account IDs. Often, the temporal order of the original stream data

is preserved, and such sequences are also referred to as time series data. Sequence data

mining plays a complementary role to data stream mining. While data stream mining

can discover up-to-date patterns invaluable for timely strategic decisions, sequence

data mining can reveal long-term trends and more complicated patterns that lead to

deeper insights.

This dissertation studies several major challenges raised in data stream mining and

sequence mining. Then, it extends the study to a more general problem of improving

mining quality. The specific problems addressed are:

1

• Performance, adaptability and robustness issues in data stream mining;

• Scalable pattern-based subspace clustering;

• Mining quality improvement by leveraging data dependency.

1.1 Issues in Stream Mining

The first issue is limited computation resources: in many applications, the computation

power and memory at hand does not measure up to the massive amount of data in the

input stream. For example, in a single day, Google serviced more than 150 million

searches; Walmart executed 20 million sales transactions; and Telstra generated 15

million call records. However, traditional data mining algorithms make the assumption

that the resources available will always match the amount of data they process. This

assumption does not hold in data stream mining. Stream mining algorithms shall learn

fast and consume little memory resources.

Figure 1.1: An example of stream version linear regression.

Another characteristic of data streams is that data is no longer a snapshot, but rather

a continuous stream. This means that the concept underlying the data may change

2

over time. For effective decision making, stream mining must be adaptive to concept

change. For example, when customer purchasing patterns change, marketing strate-

gies based on out-dated transaction data must be modified in order to reflect current

customer needs.

(a) A stream with one underlying concept and noisy examples.

(b) A linear regression algorithm overfits noise if it is too adaptive.


Figure 1.1 uses a simple example to illustrate stream mining. The input stream

contains 2 dimensional points (x, y) coming over time t. The horizontal axis denotes

the x dimension, the left vertical axis the y dimension, and the right vertical axis the

3

time t. In the first time period, data is generated by a concept y = f1(x); in the second

time period the concept changes to y = f2(x). The stream version linear regression

method shall be able to learn and adapt to the underlying functions, f1 or f2, promptly.

(a) A stream with changing concepts.

(b) A linear regression algorithm overlook concept change if it is too robust.


A particularly challenging issue is to learn changing concepts in the presence of

noise. To the existing model, both noisy examples and the examples from an emerging

new concept manifest them as misclassified examples. If an algorithm is designed

4

primarily to adapt to concept change, it may overfit noise by mistakenly interpreting

noisy examples as the sign of a new concept. In Figure 1.2, the noisy examples in

the middle of this time period are overfit, causing unstable and inaccurate fitting. On

the other hand, if an algorithm is too robust to noise, it may overlook new concepts

and inappropriately stick to outdated concepts. This is illustrate in Figure 1.3, where

the true concept shifts to a new one at time t1, and then comes back to the original at

t2, but the second concept is completely ignored because the regression method is too

“robust”.

1.2 Mining High Dimensional Sequence Data

Many types of data streams are finally stored as ordered sequences. This data contains

invaluable information on various aspects of system operation and usage. For example,

a network system generates event log sequences. Finding patterns in a large data set of

event logs is important to the understanding of the temporal causal relationships among

the events, which often provide actionable insights for determining problems in system

management. Another example is that web server logs user browsing sessions and

paths. Finding access patterns from this log data gives important clues on profitable

marketing strategies as well as directions of how to improve user experiences for more

successful e-commerce business.

Subspace clustering represents a very useful mining technique that can cope with

high dimensionality. The main objective of clustering is to find high quality clusters

within a reasonable time. However, in high dimensional data, it is common for all ob-

jects in a dataset to be nearly equidistant from each other, completely masking the clus-

ters. This is well known as the curse of dimensionality [R.E61]. Subspace clustering is

an extension of traditional clustering that seeks to find clusters in different subspaces.

Subspace clustering algorithms localize the search for relevant dimensions, hence they

5

can find meaningful clusters despite the noisy dimensions. In other words, subspace

clustering alleviates the problem caused by the curse of dimensionality. But this gain

in cluster quality is achieved at the expense of a much higher computation complex-

ity, as the number of possible subspaces is huge in high dimensional space. In fact,

scalability is always the core concern of subspace clustering. Research work has never

stopped in finding clustering algorithms that are scalable with respect to the number of

objects and the number of dimensions of the objects, as well as the dimensionality of

the subspaces where the clusters are found.

There are a range of problems that motivate further extension of clustering. Tra-

ditional clustering, including subspace clustering, focuses on grouping objects with

value-based similarity. That is, a similarity metric is defined on absolute values in a set

of dimensions. In applications of collaborative filtering and bio-data mining, however,

people are more interested in capturing the coherence exhibited by subset of objects

in some subspace. In microarray data analysis, for example, finding coherent genes

means finding those that respond similarly to environmental conditions. The absolute

response rates are often very different, but the type of response and the timing may be

similar. Along this direction, several research work have studied clustering based on

pattern similarity.

This dissertation focuses on pattern-based subspace clustering. In particular, we

study sequential patterns and the scalability of algorithms.

1.3 Mining Quality

The quality of mining results concerns the evaluation of many components, including

data quality, model quality, method quality. Ongoing research also encompasses theo-

retical aspects including quality definition and quality model. Most approaches focus

6

on specific environments, such as document quality, data warehouse quality, ontology

quality, and so on.

This dissertation is interested in quality mining from low-quality data. A consensus

among data mining practitioners is that low data quality often leads to wrong decisions

or even ruins the projects—unless proper preprocessing techniques have been adopted

in advance. As a result, data quality related issues have become more and more crucial

and have consumed a majority of the time and budget of data mining.

Low data quality are caused by various reasons. During data collection and prepa-

ration, it may be biased by human habits or by device faults. For example, it is well

known that when we human beings record readings from blood pressure monitors, we

tend to round up the readings to a multiple of tens. Data can be corrupted during to

transmission through the networks. In summary, low quality data is characterized by

missing values, ambiguity or redundancy.

We propose a general technique that can improve data quality by exploiting data

dependency, thus improving mining quality. By learning the data dependency, missing

values can be filled out and noisy values can be corrected. The general techniques

discussed here can be applied directly to data stream mining. One reason is that data

streams are noisy in many applications. Another scenario is when it is impossible or

undesired to acquire all the data. For instance, due to resource limitation, there are

multiple data streams but only some of them can be monitored.

1.4 Dissertation Overview

The goal of this chapter has been to set up the appropriate context within which this dis-

sertation is to be developed. An introduction to the problems in data stream learning,

sequence data mining has been presented. Some general thoughts on mining quality is

7

also presented.

The remaining chapters of this dissertation is organized as follows:

Chapter 2 introduces background work. For stream mining, it categorizes previ-

ous work into single model-based and ensemble-based approaches, and then puts the

focus on recent work using ensembles. It also includes a short discussion on tradi-

tional ensemble theory. For sequence data mining, it follows the search direction from

multi-dimensional subspace clustering to pattern-based subspace clustering. Finally it

discusses the existing work on improving data quality and mining quality.

Chapter 3 presents a novel approach (Adaptive Boosting) for stream learning. The

approach addresses two important issues in stream mining: performance and adapta-

tion. It presents that our approach is fast and light in terms of cpu time and memory

requirement, and is highly adaptive through explicit concept change detection.

Chapter 4 introduces a robust stream learning algorithm (Robust Regression En-

semble). In addition to performance and adaptation issue, this approach enhances the

stream ensemble methods with outlier detection. It is developed within a statistically

solid learning framework.

Chapter 5 presents a highly scalable clustering method on sequence data. Sub-

space pattern similarity are used as the similarity measure among objects, so as to find

strikingly coherent objects. An efficient grid and density based algorithm is presented.

Chapter 6 addresses a general problem in data mining field: mining quality. It

explores the area of local data dependency which is abundant in many applications,

and its potential usage in improving data quality, and ultimately in mining quality.

Finally, Chapter 7 summarizes the work presented in this dissertation. It contains

a brief description of the new algorithms introduced in this dissertation, and points out

a few important directions for future research.

8

CHAPTER 2

Background and Related Work

2.1 Stream Classification Methods

As data stream mining has recently become an important research domain, much work

has been done on classification [DH00, HSD01, SK01], regression analysis [CDH+00]

and clustering [GMMO00]. In this dissertation we focus on stream classification.

As discussed in Chapter 1, concept drift is one of the central issues in stream data

mining. This problem has been addressed in both machine learning and data mining

communities. The first systems capable of handling concept drift were STAGGER

[SG86], ib3 [AKA91] and the FLORA family [WK96]. These algorithms provided

valuable insights. But, as they were developed and tested only on small datasets, re-

searchers have not yet established the degree to which most of these AI approaches

scale to large problems.

Several scalable learning algorithms designed for data streams were proposed re-

cently. They either maintain a single model incrementally, or an ensemble of base

learners. The first category includes Hoeffding tree [HSD01], which grows a deci-

sion tree node by splitting an attribute only when that attribute is statistically predic-

tive. Hoeffding-tree like algorithms need a large training set in order to reach a fair

performance, which makes them unsuitable to situations featuring frequent changes.

Domeniconi and Gunopulos [DG01] designed an incremental support vector machine

algorithm for continuous learning, but due to the high complexity nature of support

9

vector machine algorithm, the memory requirement and cpu time is still relatively

large.

The second category of learning algorithms for data streams are based on ensem-

bles. First we give a brief description of ensemble methods.

2.1.1 Ensemble Theory

Ensemble methods have long been studied in machine learning. They are meta-learning

techniques that construct a collection of classifiers and then classify new data points by

taking a vote of their predictions. The base classifiers in an ensemble diversify, and this

diversity is achieved by manipulating one of three aspects in classifier construction: the

training samples, the learning procedure, or the output. A large amount of evaluations

demonstrate that ensembles perform better than single classifiers do [FS96, BR99,

DR99, Die00]. In [Die00], Dietterich gives three fundamental reasons why ensemble

methods often perform better than any single classifier: statistical, computational and

representational. (1) The first reason is statistical. A learning algorithm can be viewed

as searching a space of H of hypothesis to identify the best hypothesis of the true

classification function f , but the training data is often not sufficient for this large hy-

pothesis space. By constructing an ensemble of classifiers and “averaging” their votes,

we can get a statistically better approximation of the true hypothesis. (2) The second

reason is computational. Even with sufficient training data, many learning algorithms

work by performing some local search that may stuck in local optima. An ensemble

constructed by running the local searching from many different starting points often

provides a better approximation to the true function than any individual classifier. (3)

The third reason is representational. In many applications, the true function f cannot

be represented by any of the hypothesis in H. By combining multiple hypotheses in

various ways, it is possible to expand the space of representable functions using the

10

hypotheses in H.

2.1.2 Ensemble Methods for Stream Classification

Because ensemble methods have statistical, computational and representational ad-

vantages, they have been adapted to stream scenarios. We review most recent stream

ensembles: two of them have a flavor of traditional bagging ensembles, and the other

builds an ensemble using an incremental base learning algorithm.

Traditional bagging operates by invoking a base learning algorithm many times

with different training sets. Each training set is a bootstrap replica of the original

training set. In other words, given a training set S of n examples, a new training set S ′

is constructed by drawing n independent samples uniformly with replacement. With

bagging method, classifiers are learned individually, and samples in a bootstrap replica

have uniform weights.

Two recent work on stream ensembles resemble the traditional bagging method.

Street et al. [SK01] proposes an algorithm that builds an ensemble by partitioning

the input data stream into fixed-size data blocks, and learning one classifier per block.

Adaptability is achieved solely by retiring old classifiers one at a time. Wang et al.

[WFYH03] proposes a similar method, except that this algorithm tries to adapt to

changes by assigning weights to classifiers proportional to their accuracy on the most

recent data block. Both methods learn individual classifiers independently and use

uniform sample weights. In other words, they resemble the traditional bagging en-

semble methods, and hence are expected to perform well if the concept underlying the

data stream is stable. However, neither of them tackle concept change explicitly. They

rely on the natural adaptation yield by learning new ensemble members and retiring

old ones gradually, and a little bit on ensemble weighting in the case of [WFYH03].

However, as we will show in Chapter 3, this reactive strategy is not sufficient.

11

For ease of later reference, we call them “Bagging” and “Weighted Bagging”, re-

spectively.

Another stream ensemble method, Dynamic Weighted Majority (DWM), is pro-

posed in [KM01]. Contrary to the aforementioned bagging stream methods, the clas-

sifiers in a DWM ensemble keep to be updated in an incremental fashion, upon the

arrival of each single example xi. Not only the classifier model is updated by incor-

porating the knowledge from this new example xi, but the classifier weights are also

decreased by a damping factor β in case of a misclassification. The algorithm also

retires from the ensemble those classifiers whenever their weights drops below a user-

specified threshold θ.

Updating weights and incrementally training all classifiers is the key design points

in DWM. The good intention is to give the base learners the opportunity to recover

from concept drift. However, it is hard to set the parameter θ which determines when

to discard a poor classifier. If θ is too high, the ensemble will be volatile to noise. If it

is too low, poor classifiers will have a negative effect on the overall ensemble perfor-

mance before it can be identified as out of date. Our conjecture is that DWM ensemble

cannot recover very quickly from a sudden concept change, and this is verified in our

experiments, shown later in Chapter 3.

2.2 Pattern-Based Subspace Clustering

Pattern-based subspace clustering is an emerging new research area. We review the

research work to date.

Cheng et. al. [CC00] introduced a bicluster model. The model is proposed in

the bioinformatics field and is used to discover clusters of genes showing very similar

rising or falling coherence in expression levels under a set of conditions. Let X be the

12

set of genes, Y the set of conditions. Let I ⊂ X and J ⊂ Y be subsets of genes and

conditions. The pair (I, J) specifies a sub matrix AIJ with the following mean squared

residue score:

H(I, J) =1

|I||J |∑

i∈I,j∈J

(dij − diJ − dIj + dIJ)2

where

diJ =1

|J |∑j∈J

dij, dIj =1

|I|∑i∈I

dij, dIJ =1

|I||J |∑

i∈I,j∈J

dij

are the row and column means and the means in the sub matrix AIJ . A submatrix

of AIJ is called a δ-bicluster if H(I, J) ≤ δ for some δ > 0. A random algorithm is

designed to find such clusters in a DNA array.

The limitations of this pioneering work are two-folds:

1. The mean squared residue is an averaged measurement of the coherence for a

set of objects. The do not have the desirable Apriori-like property, that is, a

submatrix of a δ-bicluster is not necessarily a δ-bicluster. This creates difficulty

in designing an efficient bottom-up or top-down algorithms.

2. Bicluster is a greedy algorithm. After finding a bicluster, it randomizes the data

in the corresponding submatrix before moving on to find other biclusters. This

randomization will destroy clusters that overlap with already found ones.

Yang et. al. [YWWY02] proposed a δ-cluster algorithms to find biclusters more

efficiently. PearsonR correlation is used to measure coherence among instances, and

Residue is used to measure the decrease in coherence that a particular attribute or in-

stance brings to the cluster. It starts with a random set of seeds and iteratively improves

the overall cluster quality by randomly swapping attributes and data points to improve

13

individual clusters. The iterative process terminates when individual improvement lev-

els off in each cluster. It avoids the cluster overlapping problem by findings all clusters

in parallel.

One of the primary problem with δ-cluster is that it takes as an input parameter

the number of clusters. This parameter setting relies on domain knowledge and is

not always available. while the running time is particularly sensitive to the cluster

size parameter. If the value chosen is very different from the optimal cluster size, the

algorithm could take considerably longer to terminate. It does not has the Apriori-like

property either, hence is still not very efficient.

Wang et. al. [WWYY02] developed a pCluster model in which the cluster defini-

tion has the Apriori property. Let O be a subset of objects, T be a subset of attributes.

(O, T ) forms a matrix. Given two objects x, y ∈ O and a, b ∈ T , a pScore of the 2x2

matrix is defined as:

pScore =([ dxa dxb

dya dyb

])= |(dxa − dxb)− (dya − dyb)|

(O, T ) forms a pCluster if for any 2x2 submatrix X in (O, T ), pScore(X) ≤ δ

for some δ ≥ 0.

Since this definition has the Apriori property, an Apriori-like iterative algorithm

was developed in [WWYY02]. First, it finds all the correlated patterns for every 2 ob-

jects, and all the correlated patterns for every 2 attributes. Then, it iteratively generates

longer candidate patterns and finds larger pClusters.

Although the state-of-art pattern-based subspace clustering algorithm, the effi-

ciency of pCluster is still far from desired. In fact, the first step of finding all length 2

patterns has a complexity of O(N2M + M2N), where N is the number of objects and

M the dimensionality. The rest work of finding all subspace patterns is actually NP

14

hard, as it is equivalent to find all cliques in a graph.

More efficient algorithms are desired for pattern-based subspace clustering. We

will describe our approach in Chapter 5.

2.3 Mining Quality

Mining quality can be improved by cleaning poor data, using more appropriate mining

models, or using more effective mining methods.

Techniques for improving data quality proposed in the literature have addressed

a wide range of problems caused by noise or missing data. In information retrieval

field, grammatical rules are usually defined to remove noise [SM83]. A great amount

of work dealing with noise or missing values has be proposed for the purpose of a

specific mining task, for example, ASSEMBLE in [BDM02] co-training in [BM98]

and mixed models in [NMTM00] are all for semi-supervised from both labeled and

unlabeled data. Another body of work is to correcting noisy label or attributes based

on classification rules missing values using classification rules [ZWC03, Ten99].

Our work differs from the above work is that it is a general-purpose method for data

cleaning, without any domain knowledge, nor for any specific mining purpose. We

exploit local data dependencies for inferring missing values or correcting ambiguous

values.

Furthermore, the technique we used not only can be use to improve data quality,

but can also enhance mining results of some data mining methods which assume inde-

pendence among data instances. We present our study in Chapter 6.

15

CHAPTER 3

Fast and Light Stream Boosting Ensembles

This chapter presents a novel approach for stream learning. The approach addresses

two important issues in stream mining: performance and adaptation.

3.1 Introduction

A substantial amount of recent work has focused on continuous mining of data streams

[DH00, GGRL02, HSD01, SK01, WFYH03]. Typical applications include network

traffic monitoring, credit card fraud detection and sensor network management sys-

tems. Challenges are posed by data ever increasing in amount and in speed, as well as

the constantly evolving concepts underlying the data. Two fundamental issues have to

be addressed by any continuous mining attempt.

Performance Issue. Constrained by the requirement of on-line response and by

limited computation and memory resources, continuous data stream mining should

conform to the following criteria: (1) Learning should be done very fast, preferably

in one pass of the data; (2) Algorithms should make very light demands on memory

resources, for the storage of either the intermediate results or the final decision models.

These fast and light requirements exclude high-cost algorithms, such as support vector

machines; also decision trees with many nodes should preferably be replaced by those

with fewer nodes as base decision models.

Adaptation Issue. For traditional learning tasks, the data is stationary. That is, the

16

underlying concept that maps the features to class labels is unchanging [WK96]. In

the context of data streams, however, the concept may drift due to gradual or sudden

changes of the external environment, such as increases of network traffic or failures

in sensors. In fact, mining changes is considered to be one of the core issues of data

stream mining [DHL+03].

In this paper we focus on continuous learning tasks, and propose a novel Adaptive

Boosting Ensemble method to solve the above problems. In general, ensemble methods

combine the predictions of multiple base models, each learned using a learning algo-

rithm called the base learner [Die00]. In our method, we propose to use very simple

base models, such as decision trees with a few nodes, to achieve fast and light learn-

ing. Since simple models are often weak predictive models by themselves, we exploit

boosting technique to improve the ensemble performance. The traditional boosting is

modified to handle data streams, retaining the essential idea of dynamic sample-weight

assignment yet eliminating the requirement of multiple passes through the data. This

is then extended to handle concept drift via change detection. Change detection aims

at significant changes that would cause serious deterioration of the ensemble perfor-

mance. The awareness of changes makes it possible to build an active learning system

that adapts to changes promptly.

The remainder of this chapter is organized as follows. Our adaptive boosting en-

semble method is presented in section 3.2, followed by a change detection technique

in section 3.3. Sections 3.4 and 3.5 contain experimental evaluation results against two

types of state-of-art stream ensembles, and we conclude in section 3.6.

17

3.2 Adaptive Boosting Ensembles

We use the boosting ensemble method since this learning procedure provides a number

of formal guarantees. Freund and Schapire proved a number of positive results about

its generalization performance [SFB97]. More importantly, Friedman et al. showed

that boosting is particularly effective when the base models are simple [FHT98]. This

is most desirable for fast and light ensemble learning on steam data.

In its original form, the boosting algorithm assumes a static training set. Earlier

classifiers increase the weights of misclassified samples, so that the later classifiers will

focus on them. A typical boosting ensemble usually contains hundreds of classifiers.

However, this lengthy learning procedure does not apply to data streams, where we

have limited storage but continuous incoming data. Past data can not stay long before

making place for new data. In light of this, our boosting algorithm requires only two

passes of the data. At the same time, it is designed to retain the essential idea of

boosting—the dynamic sample weights modification.

Algorithm 1 is a summary of our boosting process. As data continuously flows

in, it is broken into blocks of equal size. A block Bj is scanned twice. The first pass

is to assign sample weights, in a way corresponding to AdaBoost.M1 [FS96]. That is,

if the ensemble error rate is ej , the weight of a misclassified sample xi is adjusted to

be wi = (1 − ej)/ej . The weight of a correctly classified sample is left unchanged.

The weights are normalized to be a valid distribution. In the second pass, a classifier

is constructed from this weighted training block.

The system keeps only the most recent classifiers, up to M. We use a traditional

scheme to combine the predictions of these base models, that is, by averaging the

probability predictions and selecting the class with the highest probability. Algorithm

1 is for binary classification, but can easily be extended to multi-class problems.

18

Algorithm 1 Adaptive boosting ensemble algorithmOutput: Maintaining a boosting ensemble Eb with classifiers {C1, · · · , Cm}, m ≤

M .

1: while (1) do2: Given a new block Bj = {(x1, y1), · · · , (xn, yn)}, where yi ∈ {0, 1},3: Compute ensemble prediction for sample i: Eb(xi) = round( 1

m

∑mk=1 Ck(xi)),

4: Change Detection: Eb ⇐ ∅ if a change detected!5: if (Eb 6= ∅) then6: Compute error rate of Eb on Bj: ej = E[1Eb(xi)6=yi

],7: Set new sample weight wi = (1− ej)/ej if Eb(xi) 6= yi; wi = 1 otherwise8: else9: set wi = 1, for all i.

10: end if11: Learn a new classifier Cm+1 from weighted block Bj with weights {wi},12: Update Eb: add Cm+1, retire C1 if m = M .13: end while

Adaptability Note that there is a step called “Change Detection” (line 4) in Al-

gorithm 1. This is a distinguished feature of our boosting ensemble, which guarantees

that the ensemble can adapt promptly to changes. Change detection is conducted at

every block. The details of how to detect changes are presented in the next section.

Our ensemble scheme achieves adaptability by actively detecting changes and dis-

carding the old ensemble when an alarm of change is raised. No previous learning

algorithm has used such a scheme. One argument is that old classifiers can be tuned to

the new concept by assigning them different weights. Our hypothesis, which is borne

out by experiment, is that obsolete classifiers have bad effects on overall ensemble

performance even they are weighed down. Therefore, we propose to learn a new en-

semble from scratch when changes occur. Slow learning is not a concern here, as our

base learner is fast and light, and boosting ensures high accuracy. The main challenge

is to detect changes with a low false alarm rate.

19

0 20 40 60 80 100 120 140 160 180 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100 120 140 160 180 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(a) (b)

Figure 3.1: Two types of significant changes. Type I: abrupt changes; Type II: gradual

changes over a period of time. These are the changes we aim to detect.

3.3 Change Detection

In this section we propose a technique for change detection based on the framework

of statistical decision theory. The objective is to detect changes that cause significant

deterioration in ensemble performance, while tolerating minor changes due to random

noise. Here, we view ensemble performance θ as a random variable. If data is sta-

tionary and fairly uniform, the ensemble performance fluctuations are caused only by

random noise, hence θ is normally assumed to follow a Gaussian distribution. When

data changes, yet most of the obsolete classifiers are kept, the overall ensemble per-

formance will undergo two types of decreases. In case of an abrupt change, the distri-

bution of θ will change from one Gaussian to another. This is shown in Figure 1(a).

Another situation is when the underlying concept has constant but small shifts. This

will cause the ensemble performance to deteriorate gradually, as shown in Figure 1(b).

Our goal is to detect both types of significant changes.

Every change detection algorithm is a certain form of hypothesis test. To make a

decision whether or not a change has occurred is to choose between two competing

hypotheses: the null hypothesis H0 or the alternative hypothesis H1, corresponding

20

to a decision of no-change or change, respectively. Suppose the ensemble has an

accuracy θj on block j. If the conditional probability density function (pdf) of θ under

the null hypothesis p(θ|H0) and that under the alternative hypothesis p(θ|H1) are both

known, we can make a decision using a likelihood ratio test:

L(θj) =p(θj|H1)

p(θj|H0)

H1

≷H0

τ. (3.1)

The ratio is compared against a threshold τ . H1 is accepted if L(θj) ≥ τ , and

rejected otherwise. τ is chosen so as to ensure an upper bound of false alarm rate.

Now consider how to detect a possible type I change. When the null hypothesis

H0 (no change) is true, the conditional pdf is assumed to be a Gaussian, given by

p(θ|H0) =1√

2πσ20

exp

{−(θ − µ0)

2

2σ20

}, (3.2)

where the mean µ0 and the variance σ20 can be easily estimated if we just remember

a sequence of most recent θ’s. But if the alternative hypothesis H1 is true, it is not

possible to estimate P (θ|H1) before sufficient information is collected. This means a

long delay before the change could be detected. In order to do it in time fashion, we

perform a significance test that uses H0 alone. A significant test is to assess how well

the null hypothesis H0 explains the observed θ. Then the general likelihood ratio test

in Equation 3.1 is reduced to:

p(θj|H0)H0

≷H1

τ. (3.3)

When the likelihood p(θj|H0) ≥ τ , the null hypothesis is accepted; otherwise it is

rejected. Significant tests are effective in capturing large, abrupt changes.

For type II changes, we perform a typical hypothesis test as follows. First, we split

the history sequence of θ’s into two halves. A Gaussian pdf can be estimated from each

21

half, denoted as G0 and G1. Then a likelihood ratio test in Equation 3.1 is conducted.

So far we have described two techniques aiming at two types of changes. They

are integrated into a two-stage method as follows. As a first step, a significant test is

performed. If no change is detected, then a hypothesis test is performed as a second

step. This two-stage detection method is shown to be very effective experimentally.

3.4 Comparison with Bagging Stream Ensembles

In this section, we first perform a controlled study on a synthetic data set, then apply

the method to a real-life application.

We evaluate our boosting scheme extended with change detection, named as Adap-

tive Boosting, and compare it with Weighted Bagging [WFYH03] and Bagging [SK01].

These two bagging ensemble methods are described in ??.

In the following experiments, we use decision trees as our base model, but the

boosting technique can, in principle, be used with any other traditional learning model.

The standard C4.5 algorithm is modified to generate small decision trees as base mod-

els, with the number of terminal nodes ranging from 2 to 32. Full-grown decision trees

generated by C4.5 are also used for comparison, marked as fullsize in Figure 3.2-3.4

and Table 3.1-3.2.

Synthetic Data

In the synthetic data set for controlled study, a sample (x, y) has three independent

features x =< x1, x2, x3 >, xi ∈ [0, 1], i = 0, 1, 2. Geometrically, samples are points

in a 3-dimension unit cube. The real class boundary is a sphere defined as

B(x) =2∑

i=0

(xi − ci)2 − r2 = 0

where c =< c1, c2, c3 > is the center of the sphere, r the radius. y = 1 if B(x) ≤ 0,

22

y = 0 otherwise. This learning task is not easy due to the continuous feature space and

the non-linear class boundary.

To simulate a data stream with concept drift, we move the center c of the sphere

that defines the class boundary between adjacent blocks. The movement is along each

dimension with a step of ±δ. The value of δ controls the level of shifts from small,

moderate to large, and the sign of δ is randomly assigned independently along each

dimension. For example, if a block has c = (0.40, 0.60, 0.50), δ = 0.05, the sign along

each direction is (+1,−1,−1), then the next block would have c = (0.45, 0.55, 0.45).

The values of δ ought to be in a reasonable range, to keep the portion of samples that

change class labels reasonable. In our setting, we consider a concept shift small if δ is

around 0.02, and relatively large if δ around 0.1.

To study the model robustness, we insert noise into the training data sets by ran-

domly flipping the class labels with a probability of p, p = 10%, 15%, 20%. Clean

testing data sets are used in all the experiments for accuracy evaluation.

Credit Card Data

We also evaluate our algorithm on a real life data containing 100k credit card trans-

actions. The data has 20 features including the transaction amount, the time of the

transaction, etc. The task is to predict fraudulent transactions. Detailed data descrip-

tion is given in [SFL+97].

3.4.1 Evaluation of Boosting Scheme

The boosting scheme is first compared against two bagging ensembles on stationary

data. Samples are randomly generated in the unit cube. Noise is introduced in the

training data by randomly flipping the class labels with a probability of p. Each data

block has n samples and there are 100 blocks in total. The testing data set contains 50k

23

0.7

0.75

0.8

0.85

0.9

0.95

2 4 8 16 32 fullsize

Aver

age

Accu

racy

# Decision Tree Terminal Nodes

Adaptive BoostingBagging

Figure 3.2: Performance comparison of the adaptive boosting vs the bagging on sta-

tionary data. The weighted bagging is omitted as it performs almost the same as the

bagging.

noiseless samples uniformly distributed in the unit cube. An ensemble of M classifiers

is maintained. It is updated after each block and evaluated on the test data set. Perfor-

mance is measured using the generalization accuracy averaged over 100 ensembles.

Figure 3.2 shows the generalization performance when p=5%, n=2k and M=30.

Weighted bagging is omitted from the figure because it makes almost the same predic-

tions as bagging, a not surprising result for stationary data. Figure 3.2 shows that the

boosting scheme clearly outperforms bagging. Most importantly, boosting ensembles

with very simple trees performs well. In fact, the boosted two-level trees(2 termi-

nal nodes) have a performance comparable to bagging using the full size trees. This

supports the theoretical study that boosting improves weak learners.

Higher accuracy of boosted weak learners is also observed for (1) block size n of

500, 1k, 2k and 4k, (2) ensemble size M of 10, 20, 30, 40, 50, and (3) noise level of

5%, 10% and 20%.

24

0.7

0.75

0.8

0.85

0.9

0.95


Aver

age

Accu

racy


Adaptive BoostingWeighted Bagging

Bagging

Figure 3.3: Performance comparison of the three ensembles on data with small gradual

concept shifts.

3.4.2 Learning with Gradual Shifts

Gradual concept shifts are introduced by moving the center of the class boundary be-

tween adjacent blocks. The movement is along each dimension with a step of ±δ.

The value of δ controls the level of shifts from small to moderate, and the sign of δ is

randomly assigned. The percentage of positive samples in these blocks ranges from

16% to 25%. Noise level p is set to be 5%, 10% and 20% across multiple runs.

The average accuracies are shown in Figure 3.3 for small shifts (δ = 0.01), and in

Figure 3.4 for moderate shifts (δ = 0.03). Results of other settings are shown in Table

3.1. These experiments are conducted where the block size is 2k. Similar results are

obtained for other block sizes. The results are summarized below:

• Adaptive boosting outperforms two bagging methods at all time, demonstrating

the benefits of the change detection technique; and

• Boosting is especially effective with simple trees (terminal nodes ≤ 8), achiev-

ing a performance compatible with, or even better than, the bagging ensembles

25

0.7

0.75

0.8

0.85

0.9

0.95


Ave

rage

Acc

urac

y



Bagging

Figure 3.4: Performance comparison of the ensembles on data with moderate gradual

concept shifts.

δ = .005 δ = .02

2 4 8 fullsize 2 4 8 fullsize

Adaptive

Boosting

89.2% 93.2% 93.9% 94.9% 92.2% 94.5% 95.7% 95.8%

Weighted

Bagging

71.8% 84.2% 89.6% 91.8% 83.7% 92.0% 93.2% 94.2%

Bagging 71.8% 84.4% 90.0% 92.5% 83.7% 91.4% 92.4% 90.7%

Table 3.1: Performance comparison of the ensembles on data with varying levels of

concept shifts. Top accuracies shown in bold fonts.

with large trees.

26

0.7

0.75

0.8

0.85

0.9

0.95

0 20 40 60 80 100 120 140 160

Acc

urac

y

Data Blocks


Bagging

Figure 3.5: Performance comparison of the three ensembles on data with abrupt shifts.

Base decision trees have no more than 8 terminal nodes.

3.4.3 Learning with Abrupt Shifts

We study learning with abrupt shifts with two sets of experiments. Abrupt concept

shifts are introduced every 40 blocks; three abrupt shifts occur at block 40, 80 and

120. In one set of experiments, data stays stationary between these blocks. In the other

set, small shifts are mixed between adjacent blocks. The concept drift parameters are

set to be δ1 = ±0.1 for abrupt shifts , and δ2 = ±0.01 for small shifts.

Figure 4.1 and Figure 4.2 show the experiments when base decision trees have no

more than 8 terminal nodes. Clearly the bagging ensembles, even with an empirical

weighting scheme, are seriously impaired at changing points. Our hypothesis, that

obsolete classifiers are detrimental to overall performance even if they are weighed

down, are proved experimentally. Adaptive boosting ensemble, on the other hand, is

able to respond promptly to abrupt changes by explicit change detection efforts. For

base models of different sizes, we show some of the results in Table 3.2. The accuracy

is averaged over 160 blocks for each run.

27

0.7

0.75

0.8

0.85

0.9

0.95

0 20 40 60 80 100 120 140 160

Acc

urac

y

Data Blocks


Bagging

Figure 3.6: Performance comparison of the three ensembles on data with both abrupt

and small shifts. Base decision trees have no more than 8 terminal nodes.

δ2 = 0.00 δ2 = ±0.01

δ1 = ±0.1 4 fullsize 4 fullsize

Adaptive

Boosting

93.2% 95.1% 93.1% 94.1%

Weighted

Bagging

86.3% 92.5% 86.6% 91.3%

Bagging 86.3% 92.7% 85.0% 88.1%

Table 3.2: Performance comparison of three ensembles on data with abrupt shifts or

mixed shifts. Top accuracies are shown in bold fonts.

28

3.4.4 Experiments on Real Life Data

In this subsection we further evaluate our algorithm on a real life data containing 100k

credit card transactions. The data has 20 features including the transaction amount, the

time of the transaction, etc. The task is to predict fraudulent transactions. Detailed data

description is given in [SFL+97]. The part of the data we use contains 100k transaction

each with a transaction amount between $0 and $21. Concept drift is simulated by

sorting transactions by changes by the transaction amount.

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

0 20 40 60 80 100

Acc

urac

y

Data Blocks


Bagging

Figure 3.7: Performance comparison of the three ensembles on credit card data. Con-

cept shifts are simulated by sorting the transactions by the transaction amount.

We study the ensemble performance using varying block sizes (1k, 2k, 3k and 4k),

and different base models (decision trees with terminal nodes no more than 2, 4, 8 and

full-size trees). We show one experiment in Figure 3.7, where the block size is 1k,

and the base models have at most 8 terminal nodes. The curve shows three dramatic

drops in accuracy for bagging, two for weighted bagging, but only a small one for

adaptive boosting. These drops occur when the transaction amount jumps. Overall,

the boosting ensemble is much better than the two baggings. This is also true for the

other experiments, whose details are omitted here due to space limit.

29

1k 2k 3k 4k0

50

100

150

200

250

300

350

400

Tota

l Tra

inin

g Ti

me

on A

ll D

ata

Blo

cks

(s)

adaptive boosting weighted bagging

block size1k 2k 3k 4k

0

50

100

150

200

250

Avg

Tre

e N

odes

Lea

rned

Fro

m B

lock

s

block size

adaptive boosting weighted bagging

(a) (b)

Figure 3.8: Comparison of the adaptive boosting and the weighted bagging, in terms

of (a) building time, and (b) average decision tree size. In (a), the total amount of data

is fixed for different block sizes.

The boosting scheme is also the fastest. Moreover, the training time is almost

not affected by the size of base models. This is due to the fact that the later base

models tend to have very simple structures; many of them are just decision stumps

(one level decision trees). On the other hand, training time of the bagging methods

increases dramatically as the base decision trees grow larger. For example, when the

base decision tree is full-grown, the weighted bagging takes 5 times longer to do the

training and produces a tree 7 times larger on average. The comparison is conducted

on a 2.26MHz Pentium 4 Processor. Details are shown in Figure 3.8.

To summarize, the real application experiment confirms the advantages of our

boosting ensemble methods over the bagging ensembles: it is fast and light, with good

adaptability.

30

3.5 Comparison with DWM

In this section, we compare our Adaptive Boosting method with Dynamic Weighted

Majority (DWM) [KM01] method. This method is described in ??.

Since the performance report of DWM on a synthetic problem is publicly available,

we evaluate Adaptive Boosting on the same problem. This problem, which is called

the “SEA Concepts”, consists of three attributes, xi ∈ R such that 0.0 ≤ xi ≤ 10.0.

The target concept is x1 + x2 ≤ b, hence x3 is an irrelevant attribute. The presentation

of training examples lasts for 50, 000 time steps. Concept change is simulated by

varying the value of b for different quarters. For the first quarter (i.e., 12, 500 time

steps), the target concept is with b = 8. For the second, b = 9; the third, b = 7;

and the fourth, b = 9.5. For each of these four periods, a training set consisting

of 12, 500 examples are generated randomly. 10% noise is added. Another 2, 500

examples are randomly generated for testing for each period. In the original DWM

experimental design, one example is fed to the method at each time step, and the

ensemble performance is evaluated against the testing samples at each time step. For

Adaptive Boosting, we feed a small block of examples periodically. That is, once after

every 500 time steps, so that we have accumulated a data block of 500. Then, we

learn a new classifier from this new data block, update the ensemble, and evaluate the

new ensemble against the testing samples. As in DWM, we repeat this procedure ten

times, averaging accuracy over these runs. Both methods use Naive Bayes as the base

learner. The original DWM diagram also shows 95% confidence intervals. We do not

compute the confidence interval as the experimental result is already sufficient to draw

a conclusion.

DWM is shown in Figure 3.9. DWM is denoted as “DWM-NB” where “NB” stands

for Naive Bayes. (Two other methods are also shown there but for our purpose we can

ignore them.) Adaptive Boosting is shown in Figure 3.10.

31

The first observation is that Adaptive Boosting suffers much less at concept chang-

ing points when compared with DWM. Secondly, Adaptive Boosting is more accurate

than DWM on average. DWM is beaten for the same reason that Weighted Bagging

does—it keeps out-dated classifiers until their weights drop below a user specified

threshold θ. It is hard to set this parameter θ. If θ is too high, the ensemble will be

volatile to noise. If it is too low, poor classifiers will have a negative effect on the

overall ensemble performance before it can be identified as out of date. The inclusion

of outdated classifiers inevitably lead to low adaptation and low average accuracy.

0 2 4 6 8 100

2

4

6

8

10

Figure 3.9: Dynamic Weighted Majority (DWM) ensemble performance on the SEA

concepts with 10% class noise.

3.6 Summary

In this chapter, we propose an adaptive boosting ensemble method that is different

from previous work in two aspects: (1) We boost very simple base models to build

effective ensembles with competitive accuracy; and (2) We propose a change detection

technique to actively adapt to changes in the underlying concept. We compare adaptive

32

0.7

0.75

0.8

0.85

0.9

0.95

1

0 12500 25000 37500 50000

Acc

urac

y (%

)

Time Step (t)

Adaptive Boosting

Figure 3.10: Adaptive Boosting ensemble performance on the SEA concepts with 10%

class noise.

boosting ensemble methods with two bagging ensemble-based methods and Dynamic

Weighted Majority method through extensive experiments. Results on both synthetic

and real-life data set show that our method is much faster, demands less memory, more

adaptive and accurate.

The current method can be improved in several aspects. For example, our study of

the trend of the underlying concept is limited to the detection of significant changes.

If changes can be detected on a finer scale, new classifiers need not be built when

changes are trivial, thus training time can be further saved without loss on accuracy.

We also plan to study a classifier weighting scheme to improve ensemble accuracy.

33

CHAPTER 4

Robust and Adaptive Stream Ensembles

The major limitation of Adaptive Boosting concerns noise issue. The boosting tech-

nique has been demonstrated under many scenarios to be sensitive to noise. In this

chapter we discuss a novel discriminative model which, in addition to fast learning

and adaptive to changing concepts, is very robust to noise in data streams. The new

technique operates under the EM framework, in which noise identification and model

refinement mutually reinforce each other, leading to a robust discriminative model.

4.1 Introduction

Noise can severely impair the quality and speed of learning. This problem is encoun-

tered in many applications where the source data can be unreliable, and also errors can

be injected during data transmission. This problem is even more challenging for data

streams, where it is difficult to distinguish noise from data caused by concept drift. If

an algorithm is too eager to adapt to concept changes, it may overfit noise by mistak-

enly interpreting it as data from a new concept. If the algorithm is too conservative

and slow to adapt, it may overlook important changes (and, for instance, miss out on

the opportunities created by a timely identification of new trends in the marketplace).

In chapter 3 we have reviewed quite a number of stream learning algorithms. But,

none of them provide a mechanism for noise identification, or often indistinguishably

called outlier detection (as what we will use thereafter). Although there have been

34

a number of off-line algorithms [AY01, RRS00, BKNS00, KM03, BF96] for outlier

detection, they are unsuitable for stream data as they assume a single unchanging data

model, hence unable to distinguish noise from data caused by concept drift. In addi-

tion, outlier detection with stream data faces general problems such as the choice of a

distance metric. Most of the traditional approaches use Euclidean distance, which is

unable to treat categorical values.

Our Method - Robust Regression Ensemble Method

To address the three above mentioned issues, we propose a novel discriminative

model, Robust Regression Ensemble Method, for adaptive learning on noisy data

streams, with modest resource consumption. For a learnable concept, the class of a

sample conditionally follows a Bernoulli distribution. Our method assigns classifier

weights in a way that maximizes the training data likelihood with the learned distribu-

tion. This weighting scheme has theoretical guarantee for adaptability. In addition, as

we have verified experimentally, our weighting scheme can also boost a collection of

weak classifiers into a strong ensemble. Examples of weak classifiers include decision

trees with very few nodes. It is desirable to use weak classifiers because it learns faster

and consumes less resources.

Our outlier detection differs from previous approaches in that it is tightly integrated

into the adaptive model learning. The motivation is that outliers are directly defined by

the current concept, so the outlier identifying strategy needs to be modified whenever

the concept drifts away. In our integrated learning, outliers are defined as samples with

a small likelihood given the current model, and then the model is refined on the training

data with outliers removed. The overall learning is an iterative process in which the

model learning and outlier detection mutually reinforce each other.

Another advantage of our outlier detection technique is the general distance metric

for identifying outliers. We define a distance metric based on predictions of the current

35

ensemble, instead of a function in the data space. It can handle both numerical and

categorical values.

The remainder of this chapter is organized as follows. In section 4.2 and section 4.3

we describe the discriminative model with regard to adaptation and robustness, respec-

tively. Section 4.4 gives the model formulation and computation. Experimental results

are shown in section 4.5.

4.2 Adaptation to Concept Drift

Ensemble weighting is the key to fast adaptation. Here we show that this problem can

be formulated as a statistical optimization problem solvable by logistic regression.

We first look at how an ensemble is constructed and maintained. The data stream

is simply partitioned into small blocks of fixed size, then classifiers are learned from

blocks. The most recent K classifiers comprise the ensemble, and old classifiers retire

sequentially by age. Besides a set of training examples for classifier learning, another

set of training examples are also needed for classifier weighting. If training data is

sufficient, we can reserve part of it for weight training; otherwise, randomly sampled

training examples can serve the purpose. We only need to make the two data sets

as synchronized as possible. When sufficient training data is collected for classifier

learning and ensemble weighting, the following steps are conducted: (1) learn a new

classifier from the training block; (2) replace the oldest classifier in the ensemble with

this newly learned; and then (3) weight the ensemble.

The rest of this section gives a formal description of ensemble weighting. A two-

class classification setting is considered for simplicity, but the treatment can be ex-

tended to multi-class tasks.

36

The training data for ensemble weighting is represented as

(X,Y) = {(xi, yi); i = 1, · · · , N}

xi is a vector valued sample attribute and yi ∈ {0, 1} the sample class label. We

assume an ensemble of classifiers, denoted in a vector form as

f = (f1(x), · · · , fK(x))T

where each fk(x) is a classifier function producing a value for the belief on a class. The

individual classifiers in the ensemble may be weak or out-of-date. It is the goal of our

discriminative model M to make the ensemble strong by weighted voting. Classifier

weights are model parameters, denoted as

w = (w1, · · · , wK)T

where wk is the weight associated with classifier fk. The model M also specifies for

decision making a weighted voting scheme, that is,

wT · f

Because the ensemble prediction wT · f is a continuous value, yet the class label yi

to be decided is discrete, a standard approach is to assume that yi conditionally follows

a Bernoulli distribution parameterized by a latent score ηi:

yi|xi; f , w ∼ Ber(q(ηi))

ηi = wT · f(4.1)

where q(ηi) is the logit transformation of ηi:

q(ηi) , logit(ηi) =eηi

1 + eηi

37

Eq.4.1 states that yi follows a Bernoulli distribution with parameter q, thus the

posterior probability is

p(yi|xi; f , w) = qyi(1− q)1−yi (4.2)

The above description leads to optimizing classifier weights using logistic regres-

sion. Given a data set (X,Y) and an ensemble f , the logistic regression technique

optimizes the classifier weights by maximizing the likelihood of the data. The opti-

mization problem has a closed-form solution which can be quickly solved. We post-

pone the detailed model computation till section 4.4.

Logistic regression is a well-established regression method, widely used in tradi-

tional areas when the regressors are continuous and the responses are discrete [HTF00].

In our work, we formulate the classifier weighting problem as an optimization problem

and solve it using logistic regression. In section 4.5 we show that such a formulation

and solution provide much better adaptability than previous work. (Refer to Fig.4.1-

4.2, section 4.5 for a quick reference.)

4.3 Robustness to Outliers

Regression is adaptive because it always tries to fit the data from the current concept.

But, it can potentially overfit outliers. We integrate the following outlier detection

technique into the model learning.

We define outliers as samples with a small likelihood under a given data model.

The goal of learning is to compute a model that best fits the bulk of the data, that is,

the inliers. Whether a sample is an outlier is hidden information in this problem. This

suggest us to solve the problem under the EM framework, using a robust statistical

formulation.

38

Previously we have described a training data set as

{(xi, yi), i = 1, · · · , N}, or (X,Y ). This is an incomplete data set, as the outlier

information is missing. A complete data set is a triplet

(X, Y , Z)

where

Z = {z1, · · · , zN}

is a hidden variable that distinguishes the outliers from the inliers. zi = 1 if (xi, yi) is

an outlier, zi = 0 otherwise. This Z is not observable and needs to be inferred. After

the values of Z are inferred, (X, Y ) can be partitioned into a clean sample set

(X0, Y0) = {(xi, yi, zi),xi ∈ X, yi ∈ Y , zi = 0}

and an outlier set

(Xφ, Yφ) = {(xi, yi, zi),xi ∈ X, yi ∈ Y , zi = 1}

It is the samples in (X0, Y0) that all come from one underlying distribution, and are

used to fit the model parameters.

To infer the outlier indicator Z, we introduce a new model parameter λ. It is a

threshold value of sample likelihood. A sample is marked as an outlier if its likeli-

hood falls below λ. This λ, together with f (classifier functions) and w (classifier

weights) discussed earlier, constitutes the complete set of parameters of our discrimi-

native model M , denoted as M(x; f, w, λ).

4.4 Model Learning

In this section, we give the model formulation followed by model computation. The

symbols used are summarized in table 4.1.

39

(xi, yi) a sample, with xi the sample at-

tribute, yi the sample class label,

(X,Y) an incomplete data set without outlier information,

Z a hidden variable,

(X,Y ,Z) a complete data set with outlier information,

(X0,Y0) a clean data set,

(Xφ,Yφ) an outlier set,

M the discriminative model,

f a vector of classifier function, a model parameter,

w a vector of classifier weights, a model parameter,

λ a threshold of likelihood, a model parameter.

Table 4.1: Summary of symbols used

4.4.1 Model Formulation

Our model has a four-tuple representation M(x; f, w, λ). Given a training data set

(X, Y ), an ensemble of classifiers f = (f1(x), · · · , fK(x))T , we want to achieve two

objectives.

1. To infer about the hidden variable Z that distinguishes inliers (X0, Y0) from

outliers (Xφ, Yφ).

2. To compute the optimal fit for model parameters w and λ in the discriminative

model M(x; f, w, λ).

Each inlier sample (xi, yi) ∈ (X0, Y0) is assumed to be drawn from an independent

identical distribution belonging to a probability family characterized by parameters w,

denoted by a density function p((x, y); f , w). The problem is to find the values of w

40

that maximizes the likelihood of (X0, Y0) in the probability family. As customary, we

use log-likelihood to simplify the computation:

log p((X0, Y0)|f , w)

A parametric model for outlier distribution is not available because outliers are

highly irregular. We use instead a non-parametric statistics based on the number of

outliers (‖(Xφ,Yφ)‖). Then, the problem becomes an optimization problem. The

score function to be maximized involves two parts: (i) the log-likelihood term for the

inliers (X0, Y0), and (ii) a penalty term for the outliers (Xφ, Yφ). That is:

(w, λ)∗ = arg max(w,λ)

(log p((X0, Y0)|f , w)

−ζ((Xφ,Yφ);w, λ))

(4.3)

where the penalty term, which penalizes having too many outliers, is defined as

ζ((Xφ,Yφ);w, λ) = e · ‖(Xφ,Yφ)‖ (4.4)

w and λ affect ζ implicitly. The value of e empirically depends on the size of the

training data. In our experiments we set e ∈ (0.2, 0.3).

After expanding the log-likelihood term, we have:

log p((X0,Y0)|f , w)

=∑

xi∈X0

log p((xi, yi)|f , w)

=∑

xi∈X0

log p(yi|xi; f , w) +∑

xi∈X0

log p(xi)

Absorb∑

xi∈X0log p(xi) into the penalty term

ζ((Xφ,Yφ);w, λ), and replace the likelihood in Eq.4.3 with the logistic form (Eq.6.16),

then the optimization goal becomes finding the best fit (w, λ)∗.

(w, λ)∗ = arg max(w,λ)

( ∑xi∈X0

(yi log q + (1− yi) log(1− q)

)

+ ζ((Xφ,Yφ);w, λ))

(4.5)

41

The score function to be maximized is not differentiable because of the non-parametric

penalty term. We have to resort to a more elaborate technique based on the Expectation-

Maximization (EM) [Bil98] algorithm to solve the problem.

4.4.2 Inference and Computation

The main goal of model computation is to infer the missing variables and compute the

optimal model parameters, under the EM framework. The EM in general is a method

for maximizing data likelihood in problems where data is incomplete. The algorithm

iteratively performs an Expectation-Step (E-Step) followed by an Maximization-Step

(M-Step) until convergence. In our case,

1. E-Step: to impute / infer the outlier indicator Z based on the current model

parameters (w, λ).

2. M-Step: to compute new values for (w, λ) that maximize the score function in

Eq. 4.3 with current Z.

Next we will discuss how to impute outliers in E-Step, and how to solve the maxi-

mization problem in M-Step. The M-Step is actually a Maximum Likelihood Estima-

tion (MLE) problem.

E-Step: Impute Outliers

With the current model parameters w (classifier weights), the model for clean data

is established as in Eq.4.1, that is, the class label (yi) of a sample xi follows a Bernoulli

distribution parameterized with the ensemble prediction for this sample (wT · f(xi)).

Thus, yi’s log-likelihood p(yi|xi; f , w) can be computed by Eq.6.16.

42

Note that the line between outliers and inliers is drawn by λ, which is computed in

the previous M-Step. So, the formulation of imputing outliers is straightforward:

zi = sign(log p(yi|xi; f , w)− λ

)(4.6)

where

sign(x) =

1 if x < 0

0 otherwise

M-Step: MLE

The score function (in Eq.4.5) to be maximized is not differentiable because of

the penalty term. We consider a simple approach for an approximate solution. In this

approach, the computation of λ and w is separated.

1. λ is computed using standard K-means clustering algorithm on log-likelihood

p(yi|xi; f , w). In our experiments we choose K = 3. The cluster boundaries are

candidates of likelihood threshold λ∗ separating outliers from inliers.

2. By fixing each of the candidate λ∗, w∗ can be computed using the standard MLE

procedure. Running a MLE procedure for each candidate λ∗, and the maximum

likelihood will identify the best fit of (w, λ)∗.

The standard MLE procedure for computing w is described as follows. Taking the

derivative of the inlier likelihood with respect to w and set it to zero, we have

∂

∂w

∑yi∈Y0

(yi

eηi

1 + eηi+ (1− yi)

1

1 + eηi

)= 0

To solve this equation, we use the Newton-Raphson procedure, which requires the

first and second derivatives. For clarity of notation, we use h(w) to denote the inlier

likelihood function with regard to w. Starting from wt, a single Newton-Raphson

43

update is

wt+1 = wt −(∂2h(wt)

∂w∂wT

)−1∂h(wt)

∂w

Here we have∂h(w)

∂w=

∑yi∈Y0

(yi − q)f(xi)

and,∂2h(w)

∂w∂wT= −

∑yi∈Y0

q(1− q)f(xi)fT (xi)

The initial values of w is important for computation convergence. Since there is

no prior knowledge, we can initially set w to be uniform.

Algorithmic Summary

The learning of a discriminative model is summarized in Algorithm2.

Algorithm 2 A Discriminative Model Learning Algorithm

Output: Maintaining a model containing an ensemble of classifiers ordered by age,f = (f1, · · · , fK)T , the parameters w = (w1, · · · , wK)T (classifier weights) andλ.

1: loop:2: Given a new training block Btrain and an evaluation block Beval,3: Learn a new classifier fK+ from block Btrain

4: Update f : add fK+1 to f , retire the oldest classifier if k ≥ K.5: EM:

(1) Impute outliers in Beval, and(2) Compute w and λ by maximizing the likelihood of Beval.

44

4.5 Experiments and Discussions

We use both synthetic data and a real-life application to evaluate our discriminative

model, Robust Regression Ensemble Method, in terms of both adaptability to con-

cept shifts and robustness to noise. Our model is compared with the two previously

mentioned approaches: Bagging [SK01] and Weighted Bagging [WFYH03]. We show

that although the empirical weighting in Weighted Bagging performs better than un-

weighted voting, the robust regression weighting method is more superior, in terms of

both adaptability and robustness.

C4.5 decision trees are used in our experiments, but in principle our method can be

used with any base learning algorithm.

The synthetic data set is one used in Chapter 3. In summary, a sample x is a vector

of three independent features < xi >, xi ∈ [0, 1], i = 0, 1, 2. Geometrically, samples

are points in a 3-dimension unit cube. The class boundary is a sphere defined as:

B(x) =∑2

i=0(xi− ci)2− r2 = 0, where c is the center of the sphere, r the radius. x is

labelled class 1 if B(x) ≤ 0, class 0 otherwise. This learning task is not easy, because

the feature space is continuous and the class boundary is non-linear.

We also use the real-life problem described in Chapter 3.

4.5.1 Evaluation of Adaptation

In this subsection we compare our robust regression ensemble method with Bagging

and Weighted Bagging. Concept drift is simulated by moving the class boundary center

between adjacent data blocks. The moving distance δ along each dimension controls

the magnitude of concept drift. We have two sets of experiments with different δ val-

ues, both have abrupt large changes occurring at block 40, 80 and 120. In one experi-

ment, data remains stationary between these changing points. In the other experiment,

45

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0 20 40 60 80 100 120 140 160

Acc

urac

y

Data Blocks

Robust RegressionWeighted Bagging

Bagging

Figure 4.1: Adaptability comparison of the ensemble methods on data with three

abrupt shifts.

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0 20 40 60 80 100 120 140 160

Acc

urac

y

Data Blocks


Bagging

Figure 4.2: Adaptability comparison of the ensemble methods on data with three

abrupt shifts mixed with small shifts.

small shifts are mixed between abrupt ones, with δ ∈ (0.005, 0.03). The percentage of

positive samples fluctuates between (41%, 55%). Noise level is 10%.

As shown in Fig.4.1 and Fig.4.2, the robust regression model always gives the best

performance. The unweighted bagging ensembles has the worst predictive accuracy.

46

Robu s t Regress ion Weigh ted Ba ggin g Ba ggin g0.88

0.90

0.92

0.94

0.96noise level: 0% 5% 10% 15% 20%

Ense

mbl

e

Accu

racy

Figure 4.3: Robustness comparison of the three ensemble methods for different noise

levels.

Both bagging methods are seriously impaired at the concept changing points, but the

robust regression is able to catch up with the new concept quickly.

4.5.2 Robustness in the Presence of Outliers

Noise is the major source of outliers. Fig. 4.3 shows the ensemble performance for the

different noise levels: 0%, 5%, 10%, 15% and 20%. The accuracy is averaged over

100 runs spanning 160 blocks, with small gradual shifts between blocks. We can make

two major observations here:

1. The robust regression ensembles are the most accurate for all the noise levels, as

clearly shown in Fig. 4.3.

2. Robust regression also gives the least performance drops when noise increases.

This conclusion is confirmed using paired t-test at 0.05 level. In each case when

noise level increases by 10%, 15% or 20%, the decrease in accuracy produced by

robust regression is the smallest, and the differences are statistically significant.

47

10 20 30 40 500.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

sam

ples

that

are

noi

sy /

from

a n

ew c

once

pt

samples from an emerging concept real noisy samples

Figure 4.4: In the outliers detected, the normalized ratio of (1) true noisy samples

(the upper bar), vs. (2) samples from an emerging concept (the lower bar). The bars

correspond to blocks 0-59 in the experiments shown in Fig.4.2

To better understand why the robust regression method is less impacted by outliers,

we show the outliers it detects in Fig.4.4. Outliers consist mostly noisy samples and

samples from a newly emerged concept. In the experiments shown in Fig.4.2, we

record the outliers in blocks 0-59 and calculate the normalized ratio of the two parts.

As it shows, true noise dominates the identified outliers. At block 40 where concept

drift is large, a bit more samples reflecting the new concept are mistakenly reported as

outliers, but still more true noisy samples are identified at the same time.

4.5.3 Discussions on Performance Issue

Constrained by the requirement of on-line responses and by limited computation and

memory resources, stream data mining methods should learn fast, and produce simple

classifiers. For ensemble learning, simple classifiers help to achieve these goals. Here

we show that simple decision trees can be used in the logistic regression model for

48

0.7

0.75

0.8

0.85

0.9

0.95

8 16 32 fullsize

Ave

rage

Acc

urac

y

# terminal nodes of base decision trees in ensembles


Bagging

Figure 4.5: Performance comparison of the ensemble methods with classifiers of dif-

ferent size. Robust regression with smaller classifiers is compatible to the others with

larger classifiers.

better performance.

The simple classifiers we use are decision trees with 8, 16, 32 terminal nodes.

Full grown trees are also included for comparison and denoted as “fullsize” where

referred. Fig.4.5 compares the accuracies (averaged over 160 blocks) of the ensembles.

First to note is that the robust regression method is always the best despite of the tree

size. More importantly, it boosts a collection of simple classifiers, which are weak

in classification capability individually, into a strong ensemble. Actually the robust

regression ensemble of smaller classifiers is compatible or even better than the two

bagging ensembles of larger classifiers. We observed the above mentioned superior

performance of the robust regression method under different levels of noise.

For the computation time study, we verify that robust regression is compatible to

weighted bagging in terms of speed. In a set of experiments where the three methods

run for about 40 blocks, the learning together with evaluation time totals a 138 seconds

for unweighted bagging. It is 163 seconds for weighted bagging, and 199 seconds for

49

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0 20 40 60 80 100

Acc

urac

y

Data Blocks


Bagging

Figure 4.6: Performance comparison of the ensembles on credit card data. Base de-

cision trees have no more than 16 terminal nodes. Concept shifts are simulated by

sorting the transactions by the transaction amount.

the robust regression. The running time is obtained when full grown decision trees are

used. If small decision trees are used instead, logistic regression learning can further

be sped up yet still perform better than the other two methods with full grown trees.

4.5.4 Experiments on Real Life Data

The real-life application is to build a classification model for detection of fraudulent

transactions in credit card transactions. A transaction has 20 features including the

transaction amount, the time of the transaction, etc.

We study the ensemble performance using different block size (1k, 2k, 3k and 4k),

and different base models (decision trees with terminal nodes no more than 8, 16, 32

and full-size trees). We show one experiment in Fig.4.6, where the block size is 1k,

and the base models have at most 16 terminal nodes. Results of other experiments are

similar. The curve shows fewer and smaller drops in accuracy for the robust regression

than for the other methods. These drops occur when the transaction amount jumps.

50

Overall, the robust regression ensemble method performs better than the other two

ensemble methods.

4.6 Summary

In this chapter, we propose an adaptive and robust model learning method that is highly

adaptive to concept changes and is robust to noise. The model produces a weighted

ensemble. The weights of classifiers are computed by logistic regression technique,

which ensures good adaptability. Furthermore, this logistic regression-based weight-

ing scheme is capable to boost a collection of weak classifiers, thus achieving the goal

of fast and light learning. Outlier detection is integrated into the model learning, so

that classifier weight training involves only the inliers, which leads to the robustness of

the resulting ensemble. For outlier detection, we assume that the inlier’s belonging to

certain class follows a Bernoulli distribution, and outliers are samples with a small like-

lihood from this distribution. The classifier weights are estimated in a way that maxi-

mizes the training data likelihood. Compared with recent work [SK01, WFYH03], the

experimental results show that this statistical model achieves higher accuracy, adapts

to underlying concept drift more promptly, and is less sensitive to noise.

51

CHAPTER 5

Subspace Pattern Based Sequence Clustering

In this chapter, we introduce an algorithm that discover clusters based on subspace

pattern similarity. Unlike traditional clustering methods that focus on grouping objects

with similar values on a set of dimensions, clustering by pattern similarity finds objects

that exhibit a coherent pattern of rise and fall in subspaces. Efficiency is the biggest

concern due to dimensionality curse. In this new algorithm, we define a novel distance

function that not only can capture subspace pattern similarity, but also is conducive to

efficient clustering implementations.

5.1 Introduction

Clustering large datasets is a challenging data mining task with many real life appli-

cations, including those in statistics, machine learning, pattern recognition, and image

processing. Much research has been devoted to the problem of finding subspace clus-

ters [APW+00, AY00, AGR98, CFZ99, JMN99]. Along this direction, we further ex-

tended the concept of clustering to focus on pattern-based similarity [WWYY02]. Sev-

eral research work have since studied clustering based on pattern similarity [YWWY02,

PZC+03], as opposed to traditional value-based similarity.

These efforts represent a step forward in bringing the techniques closer to the de-

mands of real life applications, but at the same time, they also introduced new chal-

lenges. For instance, the clustering models in use [WWYY02, YWWY02, PZC+03]

52

are often too rigid to find objects that exhibit meaningful similarity, and also, the lack

of an efficient algorithm makes the model impractical for large scale data. In this pa-

per, we introduce a novel clustering model which is intuitive, capable of capturing

subspace pattern similarity effectively, and is inducive to an efficient implementation.

0

10

20

30

40

50

60

70

80

90

a b c d e f g h i j

Object 1Object 2Object 3

(a) Raw data: 3 objects, 10 columns

0

10

20

30

40

50

60

70

80

90

b c h j e


0

10

20

30

40

50

60

70

80

90

f d a g i


(b) A Shifting Pattern in (c) A Scaling Pattern in

subspace {b, c, h, j, e} subspace {f, d, a, g, i}

Figure 5.1: Objects form patterns in subspaces.

5.1.1 Subspace Pattern Similarity

We present the concept of subspace pattern similarity by an example in Figure 5.1.

We have three objects. Here, the X axis represents a set of conditions, and the Y

axis represents object values under those conditions. In Figure 5.1(a), the similarity

53

among the three objects are not visibly clear, until we study them under two subsets of

conditions. In Figure 5.1(b), we find the same three objects form a shifting pattern in

subspace {b, c, h, j, e}, and in Figure 5.1(c), a scaling pattern in subspace {f, d, a, g, i}.

This means, we should consider objects similar to each other as long as they man-

ifest a coherent pattern in a certain subspace, regardless of whether their coordinate

values in such subspaces are close or not. It also means many traditional distance

functions, such as Euclidean, cannot effectively discover such similarity.

5.1.2 Applications

We motivate our work with applications in two important areas.

Analysis of Large Scientific Datasets. Scientific data sets often consist of many

numerical columns. One such example is the gene expression data. DNA micro-arrays

are an important breakthrough in experimental molecular biology, for they provide a

powerful tool in exploring gene expression on a genome-wide scale. By quantifying

the relative abundance of thousands of mRNA transcripts simultaneously, researchers

can discover new functional relationships among a group of genes [BB99, DLS99].

Investigations show that more often than not, several genes contribute to one dis-

ease, which motivates researchers to identify genes whose expression levels rise and

fall coherently under a subset of conditions, that is, they exhibit fluctuation of a simi-

lar shape when conditions change [BB99, DLS99]. Table 5.1 shows that three genes,

VPS8, CYS3, and EFB1, respond to certain environmental changes coherently.

More generally, with the DNA micro-array as an example, we argue that the fol-

lowing queries are of interest in scientific data analysis.

Example 1. Counting

How many genes whose expression level in sample CH1I is about 100± 5 units higher

54

than that in CH2B, 280 ± 5 units higher than that in CH1D, and 75 ± 5 units higher

than that in CH2I?

Example 2. Clustering

Find clusters of genes that exhibit coherent subspace patterns, given the following

constraints: i) the subspace pattern has dimensionality higher than minCols; and ii)

the number of objects in the cluster is larger than minRows.

Answering the above queries efficiently is important in discovering gene correla-

tions [BB99, DLS99] from large scale DNA micro-array data. The counting problem

of Example 1 seems easy to implement, yet it constitutes the most primitive operation

in solving the clustering problem of Example 2, which is the focus of this paper.

Current database techniques cannot solve the above problems efficiently. Algo-

rithms such as the pCluster [WWYY02] have been proposed to find clusters of objects

that manifest coherent patterns. Unfortunately, they can only handle datasets contain-

ing no more than thousands of records.

CH1I CH1B CH1D CH2I CH2B · · ·VPS8 401 281 120 275 298

SSA1 401 292 109 580 238

SP07 228 290 48 285 224

EFB1 318 280 37 277 215

MDM10 538 272 266 277 236

CYS3 322 288 41 278 219

DEP1 317 272 40 273 232

NTG1 329 296 33 274 228...

Table 5.1: Expression data of Yeast genes

55

Event Timestamp...

...

CiscoDCDLinkUp 19:08:01

MLMSocketClose 19:08:07

MLMStatusUp 19:08:21...

...

MiddleLayerManagerUp 19:08:37

CiscoDCDLinkUp 19:08:39...

...

Table 5.2: A Stream of Events

Discovery of Sequential Patterns. We use network event logs to demonstrate the

need to find clusters based on sequential patterns in large datasets. A network system

generates various events. We log each event, as well as the environment in which it

occurs, into a database. Finding patterns in a large dataset of event logs is important to

the understanding of the temporal causal relationships among the events, which often

provide actionable insights for determining problems in system management.

We focus on two attributes, Event and Timestamp (Table 5.2), of the log database.

A network event pattern contains multiple events. For instance, a candidate pattern

might be the following:

Example 3. Sequential Pattern

Event CiscoDCDLinkUp is followed by MLMStatusUp that is followed, in turn, by

CiscoDCDLinkUp, under the constraint that the interval between the first two events

is about 20±2 seconds, and the interval between the 1st and 3rd events is about 40±2

seconds.

Previous works [WPF+03, WPFY03] have studied the problem of efficiently lo-

56

cating a given sequential pattern, however, finding all interesting sequential patterns is

a difficult problem. A network event pattern becomes interesting if: i) it occurs fre-

quently, and ii) it is non-trivial, meaning it contains a certain amount of events. The

challenge here is to find such patterns efficiently.

Although seemingly different to the problem shown in Figure 5.1, finding patterns

exhibited over the time in sequential data is closely related to finding coherent patterns

in tabular data. It is another form of clustering by subspace pattern similarity: if we

think of different type of events as conditions on the X axis of Figure 5.1, and their

timestamp as the Y axis, then, we are actually looking for clusters of subsequences

that exhibit (time) shifting patterns as in Figure 5.1(b).

5.1.3 Our Contributions.

This paper presents a novel approach to clustering datasets based on pattern similarity.

• We present a novel model for subspace pattern similarity. In comparison with

previous models, the new model is intuitive for capturing subspace pattern sim-

ilarity, and reduces computation complexity dramatically.

• We unify pattern similarity analysis in tabular data and pattern similarity analysis

in sequential data into a single problem. Indeed, tabular data are transformed into

their sequential form which is inducive to an efficient implementation.

• We present a scalable sequence-based method, SeqClus, for clustering by sub-

space pattern similarity. The technique outperforms all known state-of-the-art

pattern clustering algorithms and makes it feasible to perform pattern similarity

analysis on large dataset.

57

The rest of the paper is organized as follows. We introduce a novel distance func-

tion for measuring subspace pattern similarity in Section 5.2. Section 5.3 presents an

efficient clustering algorithm based on a novel counting tree structure. Experiments

and results are reported in Section 5.4. In Section 5.5, we review related work and

conclude.

5.2 The Distance Function

The choice of distance functions has great implications on the meaning of similar-

ity, and this is particularly important in subspace clustering because of computational

complexity. Hence, we need a distance function that makes measuring of the similar-

ity between two objects in high dimensional space meaningful and intuitive, and at the

same time yields to an efficient implementation.

5.2.1 Tabular and Sequential Data

Finding objects that exhibit coherent patterns of rise and fall in a tabular dataset (e.g.

Table 5.1) is similar to finding subsequences in a sequential dataset (e.g. Table 5.2).

This indicates that we should unify the data representation of tabular and sequential

datasets so that a single similarity model and algorithm can apply to both tabular and

sequential datasets for clustering based on pattern similarity.

We use sequences to represent objects in a tabular dataset D. We assume there is

a total order among its attributes. For instance, let A = {c1, · · · , cn} be the set of

attributes. We assume c1 ≺ · · · ≺ cn is the total order. Thus, we can represent any

object x by a sequence1:

〈(c1, xc1), · · · , (cn, xcn)〉1We also use 〈xc1 , · · · , xcn〉 to represent x if no confusion arises.

58

where xciis the value of x in column ci. We can then concatenate objects in D into a

long sequence, which is a sequential representation of the tabular data.

After the conversion, pattern discovery on tabular datasets is no different from

pattern discovery in a sequential dataset. For instance, in the Yeast DNA micro-array,

we can use the following sequence to represent a pattern:

〈(CH1D, 0), (CH2B, 180), (CH2I, 205), (CH1I, 280)〉

In words, for genes that exhibit this pattern, their expression levels under condition

CH2B, CH2I, and CH1I must be 180, 205, 280 units higher than that under CH1D.

5.2.2 Sequence-based Pattern Similarity

In this section, we propose a new distance measure that is capable of capturing sub-

space pattern similarity and is inducible to an efficient implementation.

Here we consider the shifting pattern of Figure 5.1(b) only, as scaling patterns are

equivalent to shifting patterns after a logarithmic transformation of the data.

To tell whether two objects exhibit a shifting pattern in a given subspace S , the

simplest way is to normalize the two objects by subtracting x̄s from each of their

coordinate value xi (i ∈ S), where x̄s is the average coordinate value of x in subspace

S. This, however, requires us to compute and keep track of x̄s for each subspace S. As

there are as many as 2|A|−1 different ways of normalization, it makes the computation

of such similarity model impractical for large datasets.

To find a distance function that yields to an efficient implementation, we choose an

arbitrary dimension k ∈ S for normalization. We show that the choice of k has very

limited impact on the similarity measure.

More formally, given two objects x and y, a subspace S, a dimension k ∈ S , we

59

define the sequence-based distance between x and y as follows:

distk,S(x, y) = maxi∈S

|(xi − yi)− (xk − yk)| (5.1)

Figure 5.2 demonstrates the intuition behind Eq (5.1). Let S = {k, a, b, c}. With

respect to dimension k, the distance between x and y in S is less than δ if the difference

between x and y on any dimension of S is within ∆± δ, where ∆ is the difference of

x and y on dimension k.

k a b c

∆

∆±δ

∆±δ

∆±δ

...

object x

object y

...

Figure 5.2: The meaning of distk,S(x, y) ≤ δ.

Clearly, with a different choice of dimension k, we may find the distance between

two objects different. However, such difference is bounded by a factor of 2.

Property 1. For any two objects x, y, and a subspace S , if ∃k ∈ S such that distk,S(x, y) ≤δ, then ∀j ∈ S , distj,S(x, y) ≤ 2δ.

Proof.

distj,S(x, y) = maxi∈S

|(xi − yi)− (xj − yj)|≤ max

i∈S|(xi − yi)− (xk − yk)|+

maxj∈S

|(xj − yj)− (xk − yk)|≤ 2δ

60

Since δ is but a user-defined threshold, Property 1 shows that Eq (5.1)’s capability

of capturing pattern similarity does not depend on the choice of k, which can be an

arbitrary dimension in S. As a matter of fact, as long as we use a fixed dimension k

for any given subspace S , then, with a relaxed δ, we can always find those clusters

discovered by Eq (5.1) where a different dimension k is used. This gives us great

flexibility in defining and mining clusters based on subspace pattern similarity.

Problem Statement Our task is to find subspace clusters of objects where the dis-

tance between two objects is measured by Eq (5.1). Since in Eq (5.1), any dimension

k is equally good in capturing subspace pattern similarity, we shall choose the one that

leads to the most efficient computation.

5.3 The Clustering Algorithm

We define the concept of pattern and then we divide the pattern space into grids (Sec-

tion 5.3.1). We then construct a tree structure which provides a compact summary of

all of the frequent patterns in a data set (Section 5.3.2). We show that the tree struc-

ture enables us to find efficiently the number of occurrences of any specified pattern, or

equivalently, the density of any cell in the grid (Section 5.3.3). A density and grid based

clustering algorithm can then be applied to merge dense cells into clusters. Finally, we

introduce an Apriori-like method to find clusters in any subspace (Section 5.3.4).

5.3.1 Pattern and Pattern Grids

Let D be a dataset in a multidimensional space A. A pattern p is a tuple (T , δ), where

δ is a distance threshold and T is an ordered sequence of (column, value) pairs, that is,

T = 〈(t1, 0), (t2, v2), · · · , (tk, vk)〉

61

where ti ∈ A, and t1 ≺ · · · ≺ tk. Let S = {t1, · · · , tk}. An object x ∈ D exhibits

pattern p in subspace S if

vi − δ ≤ xti − xt1 ≤ vi + δ, 1 ≤ i ≤ k. (5.2)

Apparently, if two objects x, y ∈ D are both instances of pattern p = (T, δ), then we

have

distk,S(x, y) ≤ 2δ.

In order to find clusters, we start with high density patterns: a pattern p = (T, δ) is

of high density if given p, the number of objects that satisfy Eq (5.2) reaches a user-

defined density threshold.

Xt2-Xt1

Xt3-Xt1

densecells

Figure 5.3: Pattern grids for subspace {t1, t2, t3}

We discretize the dataset so that patterns fall into grids. For any given subspace S ,

after we find the dense cells in S , we use a grid and density based clustering algorithm

to find the clusters (Figure 5.3).

The difficult part, however, lies in finding the the dense cells efficiently for all

subspaces. The rest of this section deals with this issue.

62

5.3.2 The Counting Tree

The counting tree provides a compact summary of the dense patterns in a dataset. It is

motivated by the suffix trie, which, given a string, indexes all of its substrings. Here,

each record in the dataset is represented by a sequence, but sequences are different

from strings, as we are interested in non-contiguous sub-sequence match, while suffix

tries only handle contiguous substrings.

c1 c2 c3 c4

x 4 3 0 2

y 3 4 1 3

z 1 2 3 1

Table 5.3: A dataset of 3 objects

Before we introduce the structure of the counting tree, we use an example to illus-

trate our purpose. Table 5.3 shows a dataset of 3 objects in a 4 dimensional space. We

start with the relevant subsequences of each object.

Definition 1. Relevant subsequences.

The relevant subsequences of an object o in an n-dimensional space are:

xi = 〈xi+1 − xi, · · · , xn − xi〉 1 ≤ i < n

In relevant subsequence xi, column ci is used as the base for comparison. Assum-

ing C is a cluster in subspace S, wherein i is the minimal dimension, we shall search

for C in dataset {xi|∀x ∈ D}. In any such subspace S , we use ci as the base for com-

parison, in other words, ci serves as the dimension k in Eq (5.1). As an example, the

relevant subsequences of object z in Table 5.3 are:

63

c1 c2 c3 c4

z1 1 2 0

z2 1 -1

z3 -2

To create a counting tree for a dataset D, for each object z ∈ D, we insert its

relevant subsequences into a tree structure. Also, assuming the insertion of a sequence,

say z1, ends up at node t in the tree (Figure 5.4), we increase the count associated with

node t by 1.

More often than not, we are interested in patterns equal to or longer than a given

size, say ξ ≥ 1. A relevant subsequence whose length is shorter than ξ cannot contain

patterns longer than ξ. Thus, if ξ is known beforehand, we only need to insert xi where

1 ≤ i < n− ξ + 1 for each object x. Figure 5.4 shows the counting tree for the dataset

of Table 5.3 where ξ = 2.

[1,9,3] [2,4,1] [3,4,1] [4,4,1]

[5,9,2] [6,7,1] [7,7,1]

[8,9,1] [9,9,1]

[11,12,2] [12,12,2]

[13,14,1] [14,14,1]

x

y

z

x,y

z

c1 c2 c3 c4

-1 -4 -2

1-2 0

2

0

-3

1

-1

-1

t

[10,14,3]

s

Figure 5.4: The Counting Tree

In the second step, we label each tree node t with a triple: (ID`, IDa, Count).

The first element of the triple, ID`, uniquely identifies node t, and the second element,

64

IDa, is the largest ID` of t’s descendent nodes. The IDs are assigned by a depth-first

traversal of the tree structure, during which we assign sequential numbers (starting

form 0, which is assigned to the root node) to the nodes as they are encountered one

by one. If t is a leaf node, then the 3rd element of the triple, Count, is the number

of objects in t’s object set, otherwise, it is the sum of the counts of its child nodes.

Apparently, we can label a tree with a single depth-first traversal. Figure 5.4 shows a

labeled tree for the sample dataset.

To count pattern occurrences using the tree structure, we introduce counting lists.

For each column pair (ci, cj), i < j, and each possible value v = xj − xi (after data

discretization), we create a counting list (ci, cj, v). The counting lists are also con-

structed during the depth-first traversal. Suppose during the traversal, we encounter

node t, which represents sequence element xj − xi = v. Assuming t is to be la-

beled (ID`, IDa, cnt), and the last element of counting list (ci, cj, v) is ( , , cnt′), we

append a new element (ID`, IDa, cnt + cnt′) into the list2.

link head list of node labels

· · · · · ·(c1, c3,−4) ⇒ [3, 4, 1]

· · · · · ·(c1, c4,−2) ⇒ [4, 4, 1]

(c1, c4, 0) ⇒ [7, 7, 1], [9, 9, 2]

(c2, c4,−1) ⇒ [12, 12, 2], [14, 14, 3]

· · · · · ·

Above is a part of the counting lists for the tree structure in Figure 5.4. For instance,

link (c2, c4,−1) contains two nodes, which are created during the insertion of x2 and

z2 (relevant subsequences of x and z in Table 5.3). The two nodes represent element2If list (ci, cj , v) is empty, we make (ID`, IDa, Count) the first element of the list.

65

x4 − x2 = −1 and z4 − z2 = −1 in sequence x2 and z2 respectively. We summarize

the process of building the counting tree in Algorithm 3.

Thus, our counting tree is composed of two structures, the tree and the counting

lists. We observe the following properties of the counting tree:

1. For any two nodes x, y labeled (ID`x , IDa

x , Countx) and (ID`y , IDa

y , County)

respectively, node y is a descendent of node x if ID`y ∈ [ID`

x , IDax ].

2. Each node appears once and only once in the counting lists.

3. Nodes in any counting list are in ascending order of their ID`.

The proof of the above properties is straightforward and we omit it here. These

properties are essential to finding the dense patterns efficiently (Section 5.3.3).

5.3.3 Counting Pattern Occurrences

We describe SeqClus, an efficient algorithm for finding the occurrence number of a

specified pattern using the counting tree structure introduced above.

Each node s in the counting tree represents a pattern p, which is embodied by the

path leading from the root node to t. For instance, the node s in Figure 5.4 represents

pattern 〈(c1, 0), (c2, 1)〉.

How do we find the number of occurrence of pattern p′ which is one element longer

than p? That is,

p′ = 〈(ci, vi), · · · , (cj, vj)︸︷︷︸p

, (ck, v)〉.

The counting tree structure makes this operation very easy. First, we only need to

look for nodes in counting list (ci, ck, v), since all nodes of xk − xi = v are in that

list. Second, we are only interested in nodes that are under node s, because only those

66

Algorithm 3 Build the Counting Tree

Input: D: a dataset in multidimensional space Aξ: minimal pattern length (dimensionality)

Output: F : a counting tree

1: F ← empty tree;

2: for all objects x ∈ D do3: i ← 1;4: while i < |A| − ξ + 1 do5: insert xi into F ;6: i ← i + 1;7: end while8: end for

9: make a depth-first traversal of F ;

10: for each node s encountered in the traversal do11: let s represents sequence element xj − xi = v;12: label node s by [id`s , idas , count];13: lcnt ← count of the last element in list (ci, cj, v), or 0 if (ci, cj, v) is empty;14: append [id`s , idas , count + lcnt] to list (ci, cj, v);15: end for

nodes satisfy pattern p, a prefix of p′. Assuming s is labeled (ID`s , IDa

s , count), we

know s’s descendent nodes are in the range of [ID`s , IDa

s ]. According to the counting

properties, elements in any counting list are in ascending order of their ID` values,

which means we can binary-search the list. Finally, assume list (ci, ck, v) contain the

following nodes:

· · · , ( , , cntu), (id`v , idav , cntv), · · · , (id`w, idaw, cntw)︸︷︷︸

[IDs̀ ,IDas ]

, · · ·

Then, we know all together there are cntw − cntu objects3 that satisfy pattern p′.

3or just cntw objects if id`v is the first element of the list.

67

We denote the above process by count(r, ck, v), where r is a range, and in this case

r = [ID`s , IDa

s ]. If, however, we are looking for patterns even longer than p′, then

instead of returning cntw − cntu, we shall continue the search. Let L denote the list of

the sub-ranges represented by the nodes within range [ID`s , IDa

s ] in list (ci, ck, v), that

is,

L = {[id`v , idav ], · · · , [id`w, idaw]}

Then, we repeat the above process for each range in L, and the final count comes to

∑r∈L

count(r, c, v)

where (c, v) is the next element following p′.

We summarize the counting process described above in Algorithm 4.

5.3.4 Clustering

The counting algorithm in Section 5.3.3 finds the number of occurrences of a specified

pattern, or the density of the cells in the pattern grids of a given subspace (Figure 5.3).

We can then use a density and grid based clustering algorithm to group the dense cells

together.

We start with patterns containing only two columns (in a 2-dimensional subspace),

and grow the patterns by adding new columns into them. During this process, pat-

terns that correspond to no more than minRows objects are pruned, as introducing new

columns into the pattern will only reduce the number of objects.

Figure 5.5 shows a tree structure for growing the clusters. Each node t in the tree is

a triple (item, count, range-list). The items in the nodes along the path from the root

node to node t constitutes the pattern represented by t. For instance, the node in the

3rd level in Figure 5.5 represents 〈(c0, 0), (c1, 0), (c2, 0)〉, a pattern in a 3-dimensional

space. The value count in the triple represents the number of occurrences of the pattern

68

Algorithm 4 Algorithm count()

Input: Q: a query pattern on dataset DF : the counting tree of D

Output: number of occurrences of Q in D1: assume Q = 〈(q1, 0), (q2, v2), · · · , (qj, vj), · · · 〉;2: (r, cnt) ← count(Universe, q1, 0)

3: return countPattern(r, 2)

4: Function countPattern(r, j)

5: the jth element of Q is (qj, vj)

6: (L, cnt) ← count(r, qj, vj)

7: if j = |Q| then8: return9: else

10: return∑

r′∈L countPattern(r′, j + 1)

11: end if

12: Function count(r, c, v)

13: cl ← the counting list for (q1, c, v)

14: perform range query r on cl and15: assume cl contain the following elements:16: · · · , ( , , cnt′), (id`j , idaj , cntj), · · · , (id`k , idak , cntk)︸︷︷︸

r

, · · ·

17: return (L, cnt) where:18: cnt = cntk − cnt′

19: L = {[id`j , idaj ], · · · , [id`k , idak ]}

in the dataset, and range-list is the list of ranges of the IDs of those objects. Both count

and range-list are computed by the count() routine in Algorithm 4.

First of all, we count the occurrences of all patterns containing 2 columns, and

insert them under the root node if they are frequent (count ≥ minRows). Note there is

no need to consider all the columns. As any ci− cj = v to be the first item in a pattern

69

(C1-C0=0,cnt,L)

(C1-C0=k,cnt,L)

(C2-C0=0,cnt,L)

...

(Cn-mincols+1-Cn-mincols=k,cnt,L)

root

...

(C2-C1=k,cnt,L)

...

...

(C2-C0=0,cnt,L)

join

Level 1 Level 2 Level 3

Figure 5.5: The Cluster Tree

with at least minCols columns, ci must be less than cn−minCols+1 and cj must be less

than cn−minCols.

In the second step, for each node p on the current level, we join p with its eligible

nodes to derive nodes on the next level. A node q is node p’s eligible nodes if it satisfies

the following criteria:

• q is on the same level as p;

• if p denotes item a− b = v and q denotes c− d = v′, then a ≺ c , b = d.

Besides p’s eligible nodes, we also join p with item in the form of cn−minCols+k−b = v,

since column cn−minCols+k does not appear in levels less than k.

The join operation is easy to perform. Assume p, represented by triple (a −b = v, count, range-list), is to be joined with item c − b = v′, we simply compute

count(r, c, v′) for each range r in range-list. If the sum of the returned counts is larger

70

Algorithm 5 Clustering Algorithm

Input: minCols: dimensionality thresholdminRows: cluster size thresholdF : tree structure for D

Output: clusters of objects in D1: T ← create root note of tree2: Queue ← ∅

3: for i = 1 to |A|-minCols do4: (cnt, L) ← count(NULL, Ci, 0)

5: if cnt ≥ minCols then6: insert (ci, 0, cnt, L) under T and into Queue

7: end if8: end for

9: while Queue 6= ∅ do10: remove the 1st element x from Queue

11: assume x = (ci, v, cnt, L)

12: join x with eligible node y = (cj, v′, cnt′, L′)

13: (cnt′′, L′′) ← count(L,Cj, v)

14: if cnt′′ ≥minRows then15: Insert (cj, v

′′, cnt′′, L′′) under x and into Queue

16: end if17: end while

18: for each leaf node x of the tree do19: assume x = (ci, v, cnt, L)

20: columns ← path from root to x

21: objects ← findAll(L)22: return cluster {columns,objects}23: end for

71

than minRows, then we insert a new node (c − b = v′, count’, range-list’) under p,

where count’ is the sum of the returned counts, and range-list’ is the union of all the

ranges returned by count(). Algorithm 5 summarizes the clustering process described

above.

5.4 Experiments

We implement the algorithms in C on a Linux machine with a 700 MHz CPU and 256

MB main memory. We tested it on both synthetic and real life data sets.

5.4.1 Data Sets

We generate synthetic datasets in tabular and sequential forms. For real life datasets,

we use time-stamped event sequences generated by a production network (sequential

data), and DNA micro-arrays of yeast and mouse gene expressions under various con-

ditions (tabular data).

Synthetic Data We generate synthetic data sets in tabular forms. Initially, the table

is filled with random values ranging from 0 to 300, and then we embed a fixed number

of clusters in the raw data. The clusters embedded can also have varying quality. We

embed perfect clusters in the matrix, i.e., the distance between any two objects in the

embedded cluster is 0 (i.e., δ = 0). We also embed clusters whose distance threshold

among the objects is δ = 2, 4, 6, · · · . We also generate synthetic sequential datasets

in the form of · · · (id,timestamp) · · · , where instead of embedding clusters, we sim-

ply model the sequences by probabilistic distributions. Here, the ids are randomly

generated; however, the occurrence rate of different ids follows either a uniform or a

Zipf distribution. We generate ascending timestamps in such a way that the number

of elements in a unit window follows either uniform or Poisson distribution.

72

Gene Expression Data Gene expression data are presented as a matrix. The yeast

micro-array [THC+00] can be converted to a weighted-sequence of 49,028 elements

(2,884 genes under 17 conditions). The expression levels of the yeast genes (after

transformation) range from 0-600, and they are discretized into 40 bins. The mouse

cDNA array is 535,766 in size (10,934 genes under 49 conditions) and it is pre-

processed in the same way.

Event Management Data The data sets we use are taken from a production com-

puter network at a financial service company. NETVIEW [PWMH01] has six at-

tributes: Timestamp, EventType, Host, Severity, Interestingness, and DayOfWeek.

We are concerned with attribute Timestamp and EventType, which has 241 distinc-

tive values. TEC [PWMH01] has attributes Timestamp, EventType, Source, Severity,

Host, and DayOfYear. In TEC, there are 75 distinctive values of EventType and 16 dis-

tinctive types of Source. It is often interesting to differentiate same type of events from

different sources, and this is realized by combining EventType and Source to produce

75× 16 = 1200 symbols.

5.4.2 Performance Analysis

We evaluate the scalability of the clustering algorithm on synthetic tabular datasets and

compare it with pCluster [WWYY02]. The number of objects in the dataset increases

from 1,000 to 100,000, and the number of columns from 20 to 120. The results pre-

sented in Figure 5.6 are average response times obtained from a set of 10 synthetic

data.

Data sets used for Figure 5.6(a) are generated with number of columns fixed at 30.

We embed a total of 10 perfect clusters (δ = 0) in the data. The minimal number of

columns of the embedded cluster is 6, and the minimal number of rows is set to 0.01N ,

73

0

200

400

600

800

1000

1200

0 20000 40000 60000 80000 100000

Ave

rage

Res

pons

e T

ime

(sec

.)Dataset size (# of objects)

pCluster (# of Columns=30)SeqClus (# of Columns=30)

(a) Scalability with the # of rows in data sets

0

500

1000

1500

2000

2500

3000

3500

4000

20 40 60 80 100 120

Ave

rage

Res

pons

e T

ime

(sec

.)

Dataset size (# of columns)

SeqClus (# of Rows=30K)SeqClus (# of Rows=3K)pCluster (# of Rows=3K)

(b) Scalability with the # of columns in data sets

Figure 5.6: Performance Study: scalability.

where N is the number of rows of the synthetic data.

The pCluster algorithm is invoked with minCols= 5, minRows= 0.01N , and δ = 3,

and the SeqClus algorithm is invoked with δ = 3. Figure 5.6(a) shows that there is

almost a linear relationship between the time and the data size for the SeqClus al-

gorithm. The pCluster algorithm, on the other hand, is not scalable, and it can only

handle datasets with size in the range of thousands.

For Figure 5.6(b), we increase the dimensionality of the synthetic datasets from 20

to 120. Each embedded cluster is in subspace whose dimensionality is at least 0.02C,

where C is the number of columns of the data set. The pCluster algorithm is invoked

with δ = 3, minCols= 0.02C, and minRows= 30. The curve of SeqClus exhibits

74

quadratic behavior. However, it shows that, with increasing dimensionality, SeqClus

can almost handle datasets of size an order of magnitude larger than pCluster (30K

vs. 3K). We were unable to get the performance result of pCluster on datasets of 30K

objects.

300

400

500

600

700

800

900

2 3 4 5 6

Tim

e (s

ec.)

distance threshold delta

pCluster (# of Rows=3K)SeqClus (# of Rows=30K)

Figure 5.7: Time vs. distance threshold δ

Next we study the impact of the quality of the embedded clusters on the perfor-

mance of the clustering algorithms. We generate synthetic datasets containing 3K/30K

objects, 30 columns with 30 embedded clusters (each on average contains 30 objects,

and the clusters are in subspace whose dimensionality is 8 on average). Within each

cluster, the maximum distance (under the pCluster model) between any two objects

ranges from δ = 2 to δ = 6. Figure 5.7 shows that, while the performance of the

pCluster algorithm degrades with the increase of δ, the SeqClus algorithm is more ro-

bust under this situation. The reason is because much of the computation of SeqClus is

performed on the counting tree, which provides a compact summary of the dense pat-

terns in the dataset, while for pCluster, a higher δ value has a direct, negative impact

on its pruning effect [WWYY02].

We also study clustering performance on timestamped sequential datasets. The

dataset in use is in the form of · · · (id,timestamp) · · · , where every minute contains

on average 10 ids (uniform distribution). We place a sliding window of size 1 minute

75

0

20

40

60

80

100

120

140

100 200 300 400 500 600 700 800 900 1000

Ave

rage

Res

pons

e T

ime

(sec

.)Data Sequence # of elements (X 1000)

SeqClus (window size=1 min)

Figure 5.8: Scalability on sequential dataset

on the sequence, and create a counting tree for the subsequences inside the windows.

The scalability result is shown in Figure 5.8. We also tried different distributions of id

and timestamp, but did not observe significant differences in performance.

5.4.3 Cluster Analysis

We report the clusters found in real life datasets. Table 5.4 shows the number of

clusters found by the pCluster and SeqClus algorithm in the raw Yeast micro-array

dataset.

δ minCols minRows # of clusters

pCluster SeqClus

0 9 30 5 5

0 7 50 11 13

0 5 30 9370 11537

Table 5.4: Clusters found in the Yeast dataset

For minCols= 9 and minRows= 30, the two algorithms found the same clusters.

But in general, using the same parameters, SeqClus produces more clusters. This is

76

because the similarity measure used in the pCluster model is more restrictive. We

find that the objects (genes) in those clusters overlooked by the pCluster algorithm but

discovered by the SeqClus method exhibit easily perceptible coherent patterns. For in-

stance, the genes in Figure 5.9 shows a coherent pattern in the specified subspace, and

this subspace cluster is discovered by SeqClus but not by pCluster. This indicates the

relaxation of the similarity model not only improves the performance but also provides

extra insight in understanding the data.

250

300

350

400

0 2 4 6 8 10 12 14 16

expr

essi

on le

vels

conditions

Figure 5.9: A cluster in subspace {2,3,4,5,7,8,10,11,12,13,14,15,16}.

The SeqClus algorithm works directly on both tabular and sequential datasets. Ta-

ble 5.5 shows event sequence clusters found in the NETVIEW dataset [PWMH01].

We apply the algorithm on 10 days’ worth of event logs (around 41M bytes) of the

production computer network.

δ # events # sequences SeqClus

2 sec 10 500 31

4 sec 8 400 143

6 sec 6 300 2276

Table 5.5: Clusters found in NETVIEW

77

5.5 Related Work and Discussion

The study of clustering based on pattern similarity is related to previous work on sub-

space clustering. Many recent studies [APW+00, AY00, AGR98, CFZ99, JMN99]

focus on mining subspace clusters embedded in high-dimensional spaces.

Still, strong correlations may exist among a set of objects even if they are far apart

from each other as measured by distance functions (such as Euclidean) used frequently

in traditional clustering algorithms. Many scientific projects collect data in the form

of Figure 5.1, and it is essential to identify clusters of objects that manifest coherent

patterns. A variety of applications, including DNA microarray analysis, E-commerce

collaborative filtering, will benefit from fast algorithms that can capture such patterns.

Cheng et al [CC00] proposed the bicluster model, which captures the coherence of

genes and conditions in a sub-matrix of a DNA micro-array.

In this paper, we show that clustering by pattern similarity is closely related to the

problem of subsequence matching. There has been much research on string indexing

and substring matching. For instance, a suffix tree [McC76] is a very useful data struc-

ture that embodies a compact index to all the distinct, non-empty substrings of a given

string. Suffix arrays [MM93] and PAT-arrays [GBYS92] also provide fast searches on

text databases. Similarity based subsequence matching [FRM94, PWZP00] has been

a research focus for applications such as time series databases.

Clustering by pattern similarity is an interesting and challenging problem. The

computational complexity problem of subspace clustering is further aggravated by the

fact that we are concerned with patterns of rise and fall instead of value similarity. The

task of clustering by pattern similarity can be converted into a traditional subspace

clustering problem by (i) creating a new dimension ij for every two dimension i and

j of any object x, and set xij , the value of the new dimension, to xi − xj; or (ii)

78

creating |A| copies (A is the entire dimension set) of the original dataset, where xk,

the value of x on the kth dimension in the ith copy is changed to xk − xi, for k ∈ A.

For both cases, we need to find subspace clusters in the transformed dataset, which

is |A| times larger. These methods are apparently not feasible for datasets in high

dimensional spaces. They also cannot be applied to sequential datasets, for instance,

in event management systems where millions of timestamped events are generated on

a daily basis. In this paper, we introduced a sequence based similarity measure to

model pattern similarity. We proposed an efficient implementation, the counting tree,

which is based on the suffix tree structure. Experimental results show that the SeqClus

algorithm achieves an order of magnitude speedup over the current best algorithm

pCluster. The new model also enables us to identify clusters overlooked by previous

methods such as the pCluster model. Furthermore, the sequence model is natural to be

applied on sequential data directly.

79

CHAPTER 6

Mining Quality

In this chapter, we introduce a general framework to improve mining quality by ex-

ploiting local data dependency. The technique concerns Markov network modeling

and belief propagation.

Our work on continuous stream mining can be viewed as mining with local tem-

poral constraints, and subspace pattern based clustering can be viewed as mining with

local spatial constraints. They are special cases with hidden local temporal/spatial re-

lationships. The work in this chapter addresses local data dependencies in a general

form.

6.1 Introduction

The usefulness of knowledge models produced by data mining methods critically de-

pends on two issues. (1) Data quality: Data mining tasks expect to have accurate and

complete input data. But, the reality is that in many situations, data is contaminated,

or is incomplete due to limited bandwidth for acquisition. (2) Model adequacy: Many

data mining methods, for efficiency consideration or design limitation, use a model

incapable of capturing rich relationships embedded in data. The mining results from

an inadequate data model will generally need to be improved.

Fortunately, a wide spectrum of applications exhibit strong dependencies between

data samples. For example, the readings of nearby sensors are correlated, and proteins

80

interact with each other when performing crucial functions. Data dependency has not

received sufficient attention in data mining research yet, but it can be exploited to

remedy the problems mentioned above. We study this in several typical scenarios.

Low Data Quality Issue Many data mining methods are not designed to deal with

noise or missing values; they take the data “as is” and simply deliver the best results

obtainable by mining such imperfect data. In order to get more useful mining results,

contaminated data needs to be cleaned, and missing values need to be inferred.

Data Contamination An example of data contamination is encountered in optical

character recognition (OCR), a technique that translates pictures of characters into

a machine readable encoding scheme. Current OCR algorithms often translate two

adjacent letters “ ff ” into a “# ” sign, or incur similar systematic errors.

In the OCR problem, the objective is not to ignore or discard noisy input, but

to identify and correct the errors. This is doable because the errors are introduced

according to certain patterns. The error patterns in OCR may be related to the shape

of individual characters, the adjacency of characters, or illumination and positions. It

is thus possible to correct a substantial number of errors with the aid of neighboring

characters.

Data Incompleteness A typical scenario where data is incomplete is found in

sensor networks where probing has to be minimized due to power restrictions, and

thus data is incomplete or only partially up-to-date. Many queries ask for the mini-

mum/maximum values among all sensor readings. For that, we need a cost-efficient

way to infer such extrema while probing the sensors as little as possible.

The problem here is related to filling in missing attributes in data cleansing [GNV96].

The latter basically learns a predictive model using available data, then uses that model

to predict the missing values. The model training there does not consider data correla-

tion. In the sensor problem, however, we can leverage the neighborhood relationship,

81

as sensor readings are correlated if the sensors are geographically close. Even knowl-

edge of far-away sensors helps, because that knowledge can be propagated via sensors

deployed in between. By exploiting sensor correlation, unprobed sensors can be accu-

rately inferred, and thus data quality can be improved.

Inadequate Data Model Issue Many well known mining tools are inadequate to

model complex data relationships. For example, most classification algorithms, such

as Naive Bayes and Decision Trees, approximate the posterior probability of hidden

variables (usually class labels) by investigating on individual data features. These

discriminative models fail to model the strong data dependencies or interactions.

Take protein function prediction as a concrete classification example. Proteins

are known to interact with some others to perform functions, and these interactions

connect genes to form a graph structure. If one chooses Naive Bayes or Decision

Trees to predict unknown protein functions, he is basically confined to a tabular data

model, and thus has lost rich information about interactions.

Markov networks, as a type of descriptive model, provide a convenient represen-

tation for structuring complex relationships, and thus a solution for handling proba-

bilistic data dependency. In addition, efficient techniques are available to do inference

on Markov networks, including the powerful Belief Propagation [YFW00] algorithm.

The power in modeling data dependency, together with the availability of efficient

inference tools, makes Markov networks very useful data models. They have the po-

tential to enhance mining results obtained from data whose data dependencies are un-

derused.

Our Contribution The primary contribution of this paper is that we propose a unified

approach to improving mining quality by considering data dependency extensively in

data mining. We adopt Markov networks as the data model, and use belief propagation

82

for efficient inference, so as to clean the data, to infer missing values, or to generally

improve the mining results from a model that ignores data dependency. This paper

may also contribute to data mining practice with our investigations on some real-life

applications.

The primary contribution of this paper is that we propose a unified approach to

improving mining quality by considering dependencies among the data intensively in

data mining. We adopt Markov networks as the data model, and use belief propagation

to efficiently compute the marginal or maximum posterior probability, so as to clean

the data, to infer missing values, or to generally improve the mining results from a

model that ignores data dependency.

This paper may also contribute to data mining practice with our investigations on

several real-life applications. By exploiting data dependency in these application, clear

improvements have been achieved in data quality and the usefulness of mining results.

Outline We describe Markov networks in the next section. Also discussed there are

pairwise Markov networks, a special form of Markov network. Pairwise Markov net-

works not only model local dependency well, but also allow very efficient computation

by belief propagation. We then address the three above-mentioned examples in sec-

tions 6.3, 6.4 and 6.5. We conclude the paper with related work and discussion in

Section 6.6.

6.2 Markov Networks

Markov networks have been successfully applied to many problems in different fields,

such as artificial intelligence [Pea88], image analysis [SS94], turbo decoding [MMC98]

and condensed matter physics [AM01]. They have the potential to become very useful

tools of data mining.

83

(a) (b)

Figure 6.1: Example of a Pairwise Markov Network. In (a), the white circles denote

the random variables, and the shaded circles denote the external evidence. In (b), the

potential functions φ() and ψ() are showed.

6.2.1 Graphical Representation

The Markov network is naturally represented as an undirected graph G = (V, E),

where V is the vertex set having a one-to-one correspondence with the set of random

variables X = {xi} to be modeled, and E is the undirected edge set, defining the

neighborhood relationship among variables, indicating their local statistical dependen-

cies. The local statistical dependencies suggest that the joint probability distribution

on the whole graph can be factored into a product of local functions on cliques of the

graph. A clique is a completely connected subgraphs (including singletons), denoted

as XC . This factorization is actually the most favorable property of Markov networks.

Let C be a set of vertex indices of a clique, and let C be the set of all such C. A

potential function ψXC(xC) is a function on the possible realization xC of the clique

XC . Potential functions can be interpreted as “constraints” among vertices in a clique.

They favor certain local configurations by assigning them a larger value.

The joint probability of a graph configuration p({x}) can be factored into

P({x}) =1

Z

∏

C∈CψXC

(xC) (6.1)

84

where Z is a normalizing constant:

Z =∑

{x}

∏

C∈CψXC

(xC)

6.2.2 Pairwise Markov Networks

Computing joint probabilities on cliques reduces computational complexity, but still,

the computation may be difficult when cliques are large. In a category of problems

where our interest involves only pairwise relationships among the samples, we can

use use pairwise Markov networks. A pairwise Markov network defines potentials

functions only on pairs of nodes that are connected by an edge.

In practical problems, we may observe some quantities of the underlying random

variables {xi}, denoted as {yi}. The {yi} are often called evidence of the random

variables. In the text denoising example discussed in Section 6.1, for example, the

underlying segments of text are variables, while the segments in the noisy text we ob-

serve are evidence. These observed external evidence will be used to make inferences

about values of the underlying variables. The statistical dependency between xi and

yi is written as a joint compatibility function φi(xi, yi), which can be interpreted as

“external potential” from the external field.

Another type of potential functions are defined between neighboring random vari-

ables. The compatibility function ψij(xi, xj) which captures the “internal binding”

between two neighboring nodes i and j. An example of pairwise Markov networks is

illustrated in Figure 6.1(a), where the white circles denote the random variables, and

the shaded circles denote the evidence. Figure 6.1(b) shows the potential functions φ()

and ψ().

Using the pairwise potentials defined above and incorporating the external evi-

dence, the overall joint probability of a graph configuration in Eq.(6.1) is approximated

85

by

P({x}, {y}) =1

Z

∏

(i,j)

ψij(xi, xj)∏

i

φi(xi, yi) (6.2)

where Z is a normalization factor and the product over (i, j) is over all compatible

neighbors.

6.2.3 Solving Markov Networks

Solving a Markov network involves two phases:

• The learning phase, a phase that builds up the graph structure of the Markov

network, and learns the two types of potential functions, φ()’s and ψ()’s, from

the training data.

• The inference phase, a phase that estimates the marginal posterior probabilities

or the local maximum posterior probabilities for each random variable, such that

the joint posterior probability is maximized.

In general learning is an application-dependent statistics collection process. It de-

pends on specific applications to define the random variables, the neighborhood re-

lationships and further the potential functions. We will look at the learning phase in

detail with concrete applications in Sections 6.3-6.5.

The inference phase can be solved using a number of methods: simulated anneal-

ing [KGV83], mean-field annealing [PA87], Markov Chain Monte Carlo [GRS95], etc.

These methods either take an unacceptably long time to converge, or make oversim-

plified assumptions such as total independence between variables. We choose to use

the Belief Propagation method, which has a computation complexity proportional to

the number of nodes in the network, assumes only local dependencies, and has proved

to be effective on a broad range of Markov networks.

86

Figure 6.2: Message passing in a Markov network. Messages are defined by Eqs.(6.3)

or (6.4) under two types of rules, respectively.

6.2.4 Inference by Belief Propagation

Belief propagation (BP) is a powerful inference tool on Markov networks. It was pi-

oneered by Judea Pearl [Pea88] in belief networks without loops. For Markov chains

and Markov networks without loops, BP is an exact inference method. Even for loopy

networks, BP has been successfully used in a wide range of applications [MMC98][MWJ99].

We give a short description of BP in this subsection.

The BP algorithm iteratively propagates “messages” in the network. Messages are

passed between neighboring nodes only, ensuring the local constraints, as shown in

Figure 6.2. The message from node i to node j is denoted as mij(xj), which intuitively

tells how likely node i thinks that node j is in state xj . The message mij(xj) is a vector

of the same dimensionality as xj .

There are two types of message passing rules:

• SUM-product rule, that computes the marginal posterior probability.

• MAX-product rule, that computes the maximum a posterior probability.

87

For discrete variables, messages are updated using the SUM-product rule:

mt+1ij (xj) =

∑xi

φi(xi, yi)ψij(xi, xj)∏

k∈N(i),k 6=j

mtki(xi) (6.3)

or the MAX-product rule,

mt+1ij (xj) = max

xi

φi(xi, yi)ψij(xi, xj)∏

k∈N(i),k 6=j

mtki(xi) (6.4)

where mtki(xi) is the message computed in the last iteration of BP, k runs over all

neighbor nodes of i except node j.

BP is an iterative algorithm. When messages converge, the final belief b(xi) is

computed. With the SUM-product rule, b(xi) approximates the marginal probability

p(xi), defined to be proportional to the product of the local compatibility at node i

(φ(xi)), and messages coming from all neighbors of node i:

bi(xi)SUM = xiφi(xi, yi)∏

j∈N(i)

mji(xi) (6.5)

where N(i) is the neighboring nodes of i.

If using the MAX-product rule, b(xi) approximates the maximum a posterior prob-

ability:

bi(xi)MAX = arg maxxi

φi(xi, yi)∏

j∈N(i)

mji(xi) (6.6)

For more theoretical details of the belief propagation and its generalization, we

refer the reader to [YFW00].

6.3 Application I: Cost-Efficient Sensor Probing

In sensor networks, how to minimize communication is among the key research issues.

The challenging problem is how to probe a small number of sensors, yet to effectively

88

116 117 118 119 120 121 122 123 124 12542

43

44

45

46

47

48

49

Figure 6.3: Sensor site map in the states of Washington and Oregon.

infer the unprobed sensors from the known. Cost-efficient sensor probing represents a

category of problems where complete data is not available, but has to be compensated

by inference.

Our approach here is to model a sensor network with a pairwise Markov network,

and use BP to do inference. Each sensor is represented by a random variable in the

Markov network. Sensor neighborhood relationships are determined by spatial posi-

tions. For example, one can specify a distance threshold so that sensors within the

range are neighbors. Neighbors are connected by edges in the network.

In the rest of this section, we study a rainfall sensornet distributed over Washington

and Oregon [oW]. The sensor recordings were collected during 1949-1994. We use

167 sensor stations which have complete recordings during that period. The sensor

site map is shown in Figure. 6.3.

6.3.1 Problem Description and Data Representation

The sensor recordings were collected in past decades over two states along the Pacific

Northwest. Since rain is a seasonal phenomena, we split the data by week and build a

Markov network for each week.

We need to design the potential functions φi(xi, yi) and ψij(xi, xj) in Eq. (6.2) in

89

order to use belief propagation. One can use Gaussian or its variants to compute the

potential functions. But, in the sensornet we study, we find that the sensor readings are

overwhelmed by zeroes, while non-zero values span a wide range. Clearly Gaussian

is not a good choice for modeling this very skewed data. Neither are mixtures of

gaussian, due to limited data. Instead, we prefer to use discrete sensor readings in the

computation. The way we discretize data is given in section 6.3.3.

The φ() functions should tell how likely we observe a reading yi for a given sensor

xi. It is natural to use the likelihood function:

φi(xi, yi) = P(yi|xi) (6.7)

The ψ() functions specify the dependence of sensor xj’s reading on its neighbor

xi.

ψij(xi, xj) = P(xj|xi) (6.8)

6.3.2 Problem Formulation

We give a theoretical analysis of the problem here. As we will see shortly, the problem

fits well into the maximum a posterior (MAP) estimation on a Markov chain solvable

by belief propagation.

Objective: MAP

Let X to be the collection of all underlying sensor readings, Y the collection of all

probed sensors. Using Bayes’ rule, the joint posterior probability of X given Y is:

P (X|Y ) =P (Y |X)P (X)

P (Y )(6.9)

Since P (Y ) is a constant over all possible X , we can simplify this problem of

90

maximizing the posterior probability to be maximizing the joint probability

P (X, Y ) = P (Y |X)P (X) (6.10)

Eq.(6.10) is the objective function to be maximized, which is proportional to the

maximum a posterior probability.

Likelihood

In a Markov network, the likelihood of the readings Y depends only on those vari-

ables they are directly connected to:

P (Y |X) =m∏

i=1

P (yi|xi) (6.11)

where m is the number of probed sensors.

Prior

Priors shall be defined to capture the constraints between neighboring sensor read-

ings. By exploiting the Markov property of the sensors, we define the prior to involve

only the first order neighborhood. Thus, the prior of a sensor is proportional to the

product of the compatibility between all neighboring sensors:

P (X) ∝∏

(i,j)

P (xj|xi) (6.12)

Solvable by BP

By replacing Eqs.(6.11) and (6.12) into the objective Eq.(6.10), we have the joint

probability to be maximized:

P (X, Y ) =1

Z

∏

(i,j)

P (xj|xi)N∏

i=1

P (yi|xi) (6.13)

91

0.2

0.4

0.6

0.8

1

10 20 30 40 50

52 weeks

Probing RatioTop 10 recall on Raw data

Top 10 recall on Discrete data

0.2

0.4

0.6

0.8

1

10 20 30 40 50

52 weeks

Probing RatioTop 10 recall on Raw data

Top 10 recall on Discrete data

(a) BP-based probing. (b) naive probing.

Figure 6.4: Top-K recall rates vs. probing ratios. (a): results obtained by our BP-based

probing; (b) by the naive probing. On average, BP-based approach probed 8% less,

achieves 13.6% higher recall rate for raw values, and 7.7% higher recall rate for dis-

crete values.

Looking back at the φ() and ψ() functions we defined in Eqs.(6.7) and (6.8), we

see that the objective function is of the form:

P (X,Y ) =1

Z

∏

(i,j)

ψ(xi, xj)N∏

i=1

φ(xi, yi) (6.14)

where Z is a normalizing constant.

This is exactly the form in Eq.(6.2), where the joint probability over the pairwise

Markov network is factorized into products of localized potential functions. Therefore,

it is clear that the problem can be solved by belief propagation.

6.3.3 Learning and Inference

The learning part is to find the φ() and ψ() functions for each sensor, as defined in

Eqs.(6.7) and (6.8). The learning is straight-forward. We discretize the sensor readings

in the past 46 years, use the first 30 years for training and the rest 16 years for testing.

In the discrete space, we simply count the frequency of each value a sensor could

92

possibly take, which is the φ(), and the conditional frequencies of sensor values given

its neighbors, which is the ψ().

We use a simple discretization with a fixed number of bins, 11 bins in our case,

for each sensor. The first bin is dedicated to zeroes, which consistently counts for over

50% of the populations. The 11 bins span the following ranges: [0, 0], [1, 5], [6, 10], [11, 30], [31, 60], [61, 100], [101, 200],

[201, 400], [401, 1000], [1001, 1500], and[1500,∞). This very simple discretization

method has been shown to work well in the sensor experiments. More elaborated

techniques can be used which may further boost the performance, such as histogram

equalization that gives balanced bin population with adaptive bin numbers.

For inference, belief propagation does not guarantee to give the exact maximum a

posterior distribution, as there are loops in the Markov network. However, loopy belief

propagation still gives satisfactory results, as we will see shortly.

6.3.4 Experimental Results

We evaluate our approach using Top-K queries. A Top-K query asks for the K sensors

with the highest values. It is not only a popular aggregation query that the sensor

community is interested in, but also a good metric for probing strategies as the exact

answer requires contacting all sensors.

We design a probing approach in which sensors are picked for probing based on

their local maximum a posterior probability computed by belief propagation, as fol-

lows.

BP-based Probing:

1. Initialization: Compute the expected readings of sensors using the training data.

Pick the top 20.

93

(0) (1) (2)

(3) (4) (5)

Figure 6.5: Belief updates in 6 BP iterations((0) - (5)). Initially only the four sensors

at the corners are probed. The strong beliefs of these four sensors are carried over

by their neighbors to sensors throughout the network, causing beliefs of all sensors

updated iteratively till convergence.

2. Probe the selected sensors.

3. True values acquired in step 2 become external evidence in the Markov network.

Propagate beliefs with all evidence acquired so far.

4. Again, pick the top sensors with the highest expectations for further probing, but

this time use the updated distributions to compute expectations. When there are

ties, pick them all.

5. Iterate steps 2-4, until beliefs in the network converge.

94

6. Pick the top K with the highest expectations according to BP MAP estimation.

As a comparative baseline, we have also conducted experiments using a naive prob-

ing strategy as follows:

Naive Probing:

1. Compute the expectations of sensors. Pick the top 25% sensors.

2. Probe those selected sensors.

3. Pick the top K.

Performance of the two approaches is shown in Figure 6.4 (a) and (b), respectively.

On each diagram, the bottom curve shows the probing ratio, and the two curves on the

top show the recall rates for raw values and discrete values, respectively. We use the

standard formula to compute recall rate, i.e.:

Recall =|S ⋂

T ||T | (6.15)

where S is the top-K sensor set returned, and T is the true top-K set.

Since the sensor readings are discretized in our experiments, we can compute S and

T using raw values, or discrete values. Discrete recall demonstrates the effectiveness

of BP, while raw recall may be of more interest for real application needs. As can

be seen from Figure 6.4, raw recall is lower than discrete recall. This is due to error

introduced in the discretization step. We expect raw recall to be improved when a more

elaborated discretization technique is adopted.

It shows clearly in Figure 6.4 that BP-based approach outperforms the naive ap-

proach in terms of both recall rates, while requiring less probing. On average, the

95

BP-based approach has a discrete recall of 88% and a raw recall of 78.2%, after prob-

ing only 17.5% sensors. The naive recall has a discrete recall of only 79.3%, a raw

recall of only 64.6%, after probing 25% sensors.

The results shown in Figure 6.4 are obtained for K = 10. The relative performance

remains the same for other values K = 20, 30, 40.

6.3.5 How BP Works

A closer look at the changing sensor beliefs during the iterations shows how belief

propagation provides effective inference. We look at 49 sensors that form a 7× 7 grid,

each having the surrounding sensors (≤ 8) as its neighbors. Only the four sensors at

the corners are probed. We use the ψ() functions acquired by learning, but set φ() to be

uniform. This is solely for demonstration purpose. (The original φ() is so skewed that

BP converges too fast to demonstrate a moderate sized sequence of belief changes.)

The beliefs are shown in Figure 6.5, one per iteration. In the first diagram, only the

four corner sensors have an impulse at the true value, while all the others showing a

flat distribution. But the probability histogram of each unprobed sensor grows notably

sharper as BP iterates, showing how belief can grow stronger by receiving messages

from neighbors.

This sensor probing on a small scale gives a sense of how effective belief propaga-

tion can be in Markov networks.

• From Figure 6.5, we can see that beliefs are able to propagate through the net-

work via messages quickly. The messages of the four sensors at the corners are

first passed to the nearby sites, then carried all the way to the central sites in just

a few iterations.

• We can also see that the well informed nodes can help those less informed to

96

build up their beliefs. Informally, we say a node is well informed or has stronger

beliefs, if their belief distribution has a lower entropy. Figure 6.5 clearly shows

that the four corner sensors pass strong beliefs to others to help them compute a

good approximation of the posterior.

6.4 Application II: Enhancing Protein Function Predictions

Local data dependency can not only help infer missing values, as in the sensor exam-

ple, but can also be exploited to enhance mining results. Many data mining methods,

for efficiency consideration or design limitation, use a model incapable of capturing

rich relationships embedded in data. Most discriminative models like Naive Bayes

and SVM belong to this category. Predictions of these models can be improved, by

exploiting local data dependency using Markov networks. The predictions are used as

the likelihood proposal, and message passing between variables refines and reinforces

the beliefs. Next we show how to improve protein function predictions in this way.

6.4.1 Problem Description

Proteins tend to localize in various parts of cells and interact with one another, in order

to perform crucial functions. One task in the KDD Cup 2001 [CHH+01] is to predict

protein functions. The training set contains 862 proteins with known functions, and

the testing set includes 381 proteins. The interactions between proteins, including the

testing genes, are given. Other information provided specifies a number of properties

of individual proteins or genes that encodes the proteins. These include the chromo-

some on which the gene appears, phenotype of organisms with differences in this gene,

etc.

Since information about individual proteins or genes are fixed features, it becomes

97

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 6.6: Logistic curve that is used to blur the margin between the belief on two

classes.

crucial how to learn from interactions. According to the report of the cup organizers,

most competitors organized data in relational tables, and employed algorithms that deal

with tabular data. However, compared with tables, graphical models provide a much

more natural representation for interacting genes. With a Markov network model,

interactions can be modeled directly using edges, avoiding preparing a huge training

table. Interacting genes can pass messages to each other, thus getting their beliefs

refined together.

In the next of this section, we show a general way of enhancing a weak classifier

by simply leveraging local dependency. The classifier we use is Naive Bayes, which

is learned from the relational table. We build a Markov network, in which genes with

interactions are connected as neighbors. The φ() function prediction comes from Naive

Bayes, and the ψ() are learned from gene interactions.

6.4.2 Learning Markov Network

We separate the learning of each function, as focusing on one function a time is easier.

There are 13 function categories, hence we build 13 Markov networks. To prepare

the initial beliefs for a network, we first learn a Naive Bayes classifier, which output

a probability vector b0(), indicating how likely a gene will perform the function in

98

question or not.

Each gene i maps to a binary variable xi in the Markov network. First we design

the φ() potentials for {xi}. One can set the Naive Bayes prediction b0() to be φ().

But this way the Naive Bayes classifier is over trusted, make it harder to correct the

misclassifications. Instead, we adopt a generalized logistic function to blur the margin

between the belief on two classes, yet still keeping the prediction decision.

f =a

1 + e−α(x−β)+ b (6.16)

In the experiments, we set a = 0.75, b = 0.125, α = 6, and β = 0.5. The logistic

curve is shown in Figure 6.6.

The ψ() potentials are learned from protein interactions. Interactions are measured

by the correlation between the expression levels of the two encoding genes. At first

we tried to related the functions of two genes in a simple way: a positive correlation

indicates that with a fixed probability both or neither genes perform the function, while

a negative correlation indicates that one and only one gene perform the function. This

will leads to a simple fixed ψ() function for all interacting genes. But, a close look

at the interaction tells that 25% of the time this assumption is not true. In reality,

sometimes two genes participating in the same function may be negatively correlated;

a more influential phenomena is that genes may participate in several functions, hence

the correlation is a combined observation involving multiple functions.

We decided to learn the distribution of correlation values for three types of inter-

actions, separately: (a)FF: a group for protein pairs that both perform the function,

(b)FNF: a group for pairs that one and only one performs the function, and (c)NFNF: a

group for protein pairs that neither performs the function. Thus, the potential function

ψi,j defines how likely to observe a correlation value given for genes xi and xj , under

99

−1 −0.5 0 0.5 10

0.02

0.04

0.06

0.08

−1 −0.5 0 0.5 10

0.02

0.04

0.06

0.08

(1.a) FF (2.a) FF

−1 −0.5 0 0.5 10

0.02

0.04

0.06

0.08

−1 −0.5 0 0.5 10

0.02

0.04

0.06

0.08

(1.b) FNF (2.b) FNF

−1 −0.5 0 0.5 10

0.02

0.04

0.06

0.08

−1 −0.5 0 0.5 10

0.02

0.04

0.06

0.08

(1.c) NFNF (2.c) NFNF

Figure 6.7: Distribution of correlation values learned for two functions. left column

function: cell growth, right column function: protein destination. In each column, the

distributions from top to bottom are learned from group (a), (b) and (c), respectively.

different cases where xi and xj each has the function or not. In Figure 6.7, we show

the distribution of correlation values learned for two functions. The left column is

about a function related to cell growth, the right column is about a function related to

protein destination. From top to bottom in each column, the distributions are learned

from interaction group (a), (b) and (c), respectively. The figures show that correlation

distributions differ among groups, and are specific to functions as well.

100

1

G238783

1

G234405

1

G230291

1

G239273

1

G235339

...

...

...

... G234263

1 1

G235803

1

G234382

0

G235506

Figure 6.8: A subgraph in which testing genes got correct class labels due to message

passing.

6.4.3 Experiments

Naive Bayes does not perform well on this problem, because it does not model the gene

interactions sufficiently, and thus cannot fully utilize the rich interaction information.

Taking the average predictive accuracy of all classifiers, one per function, the overall

accuracy of Naive Bayes is 88%. Belief propagation improves this to 90%.

To exemplify how misclassifications get corrected due to message passing, we

show a subgraph of genes in Figure 6.8. The white circles represent genes(variables),

and the shaded circles represent external evidence. Only training genes have corre-

sponding external evidence. The 1’s or 0’s in the circles tell whether a gene has the

function in question or not. For interested readers, we also put the gene ID below the

circle. The subgraph contains four training genes and five testing genes. All these

testing genes were misclassified by Naive Bayes. After receiving strong beliefs from

their neighboring genes, four out of five testing genes were correctly classified. The

other test gene ‘G230291’ was misclassified by both, but Naive Bayes predicted 0%

for it to have the function (which is the truth), while belief propagation increased this

belief to 25%.

We also evaluated our approach using the score function originally used in the 2001

101

KDD cup [CHH+01]. First we picked out all the functions we predicted for a gene.

If more functions are predicted than the true number (which is actually the number of

duplicates of that gene in the test table provided), we remove the ones with the smallest

confidence. The final score is the ratio of correct predictions, including both positive

and negative predictions. Our final score is 91.2%, close to the Cup winner’s 93.6%.

Although the winner scored reasonably high, they organized data in relational tables

and didn’t fully explore gene interactions. We expect that their method could perform

better if integrated with our approach to exploit local dependencies between genes.

The Cup winner organized data in relational tables, which is not designed at all

for complex relationships. To make up for this, they manually created new features,

such as computing “neighbors” within k (k > 1) hops following neighbor links. Even

so, these new features can only be treated the same as the other individual features.

The rich relationship information in the original graph structure was lost. Graphical

models, on the other hand, are natural models for complex relationships. Markov

networks together with belief propagation provides a general and powerful modeling

and inference tool on problems satisfying local constraints, such as protein function

prediction.

6.5 Application III: Sequence Data Denoising

Sequences are ordered lists of elements, such as text strings, DNA sequences, or binary

codes in channel transmission. This type of data often exhibits dependencies between

adjacent elements. For example, there are rich dependencies embedded in English

text. This sequence data can be modeled using Markov chains—a degenerate form of

Markov networks.

Moreover, errors in sequence data often have neighborhood patterns. OCR dis-

102

cussed in Section 6.1 gives an example where errors are related to the shapes of char-

acters and to their relative positions. The mutation of a nucleotide is also influenced

by its nearby bases. That the Markov property is satisfied by both the sequence data

itself and by errors strongly suggests the applicability of belief propagation for se-

quence data denoising. Actually for Markov chains, belief propagation is theoretically

guaranteed to give exact marginal or maximum a posterior probabilities.

In the rest of the section, we study a problem of correcting errors in noisy docu-

ments. While a simple problem, it exemplifies many basic characteristics in sequence

data mining.

6.5.1 Problem Description and Data Representation

A document is a text sequence consisting of characters from an alphabet, while a noisy

document is the result from some recognizer with systemic errors. We split the se-

quence into small segments, each having n characters, and let neighboring segments

overlap by m characters, m < n. We use a random variable xi = (x(1)i , · · · , x

(n)i ) to

represent each underlying clean segment i. The corresponding observed segment in

the noisy document is denoted as yi = (y(1)i , · · · , y

(n)i ). Each segment, except those

that starts or ends the sequence, has a neighbor segment on either side.

Now we design the potential functions φi(xi, yi) and ψij(xi, xj). For φ(), the defi-

nition should specify how likely we observe yi given xi. A natural choice is to define

φ() to be a likelihood function

φi(xi, yi) = P (yi|xi) (6.17)

For a short segment, we can assume independence between characters. Thus, φ()

103

can be written as

φi(xi, yi) =n∏

l=1

P (y(l)i |x(l)

i ) (6.18)

For ψ(), the definition should specify how compatible two neighboring segments xi

and xj are. Again, we can assume independence between the characters in the two seg-

ments, except for those in the overlapping part. Consider two overlapping characters,

x(k)i and x

(l)j . If the probability is zero that x

(k)i will change to x

(l)j or vice versa, then the

two segments, xi and xj , are incompatible. The resulting mutation probability of the

overlapping segment quantifies the compatibility of two neighboring segments. Non-

adjacent segments are incompatible. Formally, we define an asymmetric ψ() function

on xi and xj , when xi is the left neighbor of xj:

ψij(xi, xj) =m∏

l=1

P (x(l)j |x(n−m+l)

i ) (6.19)

6.5.2 Learning and Inference

The learning phase is to find the φ() and ψ() functions. For this purpose, we build a

mutation matrix M . Each matrix element m(i, j) is the unconditional mutation prob-

ability from the i-th character to the j-th: m(i, j) = P (chj|chi). This can be easily

computed from the training set, which consists of pairs of clean and noisy documents.

We partition the clean document and the noisy documents in the same way. The

φ() of each pair of clean and observed segments is given in Eq.(6.17), and the ψ() of

each pair of neighboring clean segments is given in Eq.( 6.19).

In inference, a subproblem is to find candidate underlying segments for a given ob-

served segment. One can enumerate all possible candidates using the mutation matrix.

But this method not only can generate too many candidates, but also ignores valuable

information in the training data: the possible combination of segments. We restrict the

104

candidates to be the top matches among all training segments. When the number of

matches is too small, we generate some extras using the transition matrix. By doing

so, we actually explore the intra-segment constraints, which are fine details that the

Markov chains cannot model, as they are on the scale of segments.

6.5.3 Experimental Results

We choose two conference papers on the same topic: motion modeling. Both docu-

ments are distorted, using the probabilistic mutation rules in Table 6.1, to form pairs

consisting of a clean document and a noisy document. One pair is used to train the

potential functions, while the other is used for testing. For simplicity, we change all

capitals into lower-case letters, replace all punctuation marks other than commas and

periods into commas, and remove all figures, tables and equations. The transformed

documents belongs to an alphabet of size 38 (consisting of 26 letters, 10 digits, a

comma and a period).

A variety of distortion rules are used: unconditional mutation rules and k-order

conditional mutation rules, k = 1, 2, 3. (A k-order conditional mutation depends on k

neighbors on either side.) To compute the potential functions, all we need to learn is a

38 by 38 mutation matrix M for unconditional mutation rates only. Yet, we are able to

catch and to correct most of the mutation errors, including the higher order conditional

errors. In fact, the correction rates for conditional errors are even higher, as shown in

Table 6.1. This is achieved by exploiting the Markov property and by passing local

beliefs through the network using BP.

To help give an intuitive idea about how dependencies between text segments can

be used effectively for error correction, we enclose a paragraph of distorted text here,

followed by the corrected version. The misspelled words are underlined. We can see

that most of the misspellings are corrected.

105

rule mutation prob. # errors % corrected

x → k 100% 56 91%

f → f 42% - -

f → d 30% 123 92%

f → z 28% 118 87%

th → th 48% - -

th → tn 52% 220 96%

se → se 36% - -

se → ue 18% 51 93%

se → le 25% 69 94%

se → ie 21% 58 95%

tio → tio 29% - -

tio → tho 20% 35 100%

tio → txo 20% 35 100%

tio → two 31% 57 98%

total words/errors: 3459/822 overall accuracy: 94%

Table 6.1: Distortion rules and error correction results. Columns 1 and 2 give the rule

and mutation rate, respectively. Column 3 is the actual number of times a rule applies,

and column 4 is the percentage corrected by BP inference.

106

Distorted text:

introductxon. natural scenes contain rich stochastic mothon patterns which are character-

ized by the movement od a large number od distinguishable or indistinguishable elements, such

as falling snow, zlock of birds, river waves, etc. tnele mothon patterns, called tektured, motion

temporal tekture and dynamic tektures in the literature, cannot be analyzed by conventwonal

optical zlow dields and have stimulated growing interests in both graphics and vision. in

graphics, the objective is to render photorealistic video iequences, or non photorealistic but

stylish cartoon animathons. both physics baued metnods such as partial didferential equatxons

and image baled such as video tekture and volume tekture are studied to simulate dire, fluid,

and gaseous phenomena. in vision, szummer and picard studied a spatial temporal auto re-

gression star model, which is a causal gaussian markov random zield model.

Text after denoising:

introduction. natural scenes contain rich stochastic motion patterns which are charac-

terized by the movement of a large number of distinguishable or indistinguishable elements,

such as falling snow, zlock of birds, river waves, etc. tnese motion patterns, called textured

motion, temporal texture and dynamic tektures in the literature, cannot be analyzed by conven-

tional optical flow fields, and have stimulated growing interests in both graphics and vision.

in graphics, the objective is to render photorealistic video sequences, or non photorealistic

but stylish cartoon animations. both physics based methods such as partial differential equa-

tions and image based such as video texture and volume texture are studied to simulate dire,

fluid, and gaseous phenomena. in vision, szummer and picard studied a spatial temporal auto

regression star model, which is a causal gaussian markov random field model.

107

6.6 Related Work and Discussions

Data dependency is present in a wide spectrum of applications. In this paper, we pro-

pose a unified approach that exploits data dependency to improve mining results, and

we approach this goal from two directions: (1) improving quality of input data, such

as by correcting contaminated data and by inferring missing values, and (2) improving

mining results from a model that ignores data dependency.

Techniques for improving data quality proposed in the literature have addressed

a wide range of problems caused by noise and missing data. For better information

retrieval from text, data is usually filtered to remove noise defined by grammatical

errors [SM83]. In data warehouses, there has been work on noisy class label and

noisy attribute detection based on classification rules [ZWC03] [YWZ04], as well as

learning from both labeled and unlabeled data by assigning pseudo-classes for the

unlabeled data [BDM02] using boosting ensembles. All this previous work has its

own niche concerning data quality. Our work is more general in that it exploits local

data constraints using Markov networks.

A pioneering work in sensor networks, the BBQ system [DGM+04] has studied

the problem of cost-efficient probing. However, their method relies on a global mul-

tivariate Gaussian distribution. Global constraints are very strict assumptions, and are

not appropriate in many practical scenarios.

The primary contribution of this paper is to propose a unified approach to improv-

ing mining quality by considering data dependency extensively in data mining. This

paper may also contribute to data mining practice with our investigations on several

real-life applications. By exploiting data dependency, clear improvements have been

achieved in data quality and the usefulness of mining results.

108

CHAPTER 7

Conclusions

Data stream mining and sequence mining each pose significant challenges. Stream

mining can discover up-to-date patterns invaluable for timely strategic decisions, but

this has to be done accurately and quickly with limited computation resources, and it

has to deal with both drifting concept and noise. Sequence mining can reveal long-term

trends and more complicated patterns that lead to deeper insights, but more than often

meaningful patterns can only be found in subspaces, which incurs high complexity in

pattern mining.

This dissertation introduces several novel algorithms in data stream mining, se-

quence data clustering, and improving data quality in general.

The dissertation starts by examining several stream learning algorithms, then in-

troduces out first stream learning method, Adaptive Boosting, that achieves the goal of

fast learning, light memory consumption, and prompt adaptation. Adaptive Boosting

proposes an online boosting ensemble method that constructs a model of high accu-

racy with fast learning and less memory consumption. Then, it is further integrated

with novel change detection techniques, which guarantees prompt adaptation.

Then, we continue to explore the issue of robustness in the presence of noise. When

combined with adaptation issue, this becomes a very hard problem, since both noisy

data and those coming from an emerging new concept appear as misclassified examples

to an existing learning model. We formulate this robust and adaptive approach under

109

the EM framework. In this framework, the noise label associated with each data entry

is represented by a hidden variable to be inferred. The weighted ensemble is now

used as the classification model. Ensemble weighting is by Maximum Likelihood

Estimation (MLE) that maximizes the posterior probability of just the clean data.

After having presented stream learning, the dissertation moves on to sequence data

clustering with spatial localization: pattern based subspace clustering. Pattern-based

clustering can find objects that exhibit a coherent pattern of rise and fall in subspaces.

Efficiency represents here the biggest concern due to the curse of dimensionality.

Therefore, we propose SeqClus that achieves efficiency, by using a novel distance

function that not only can capture subspace pattern similarity, but also is conducive

to efficient clustering implementations. In our implementation, a novel tree structure

provides a compact summary of all the frequently patterns in a data set, and a density

and grid based clustering algorithm is developed to find clusters in any subspace.

The final part of this dissertation focuses on the general problem of improving

the quality of data mining by exploiting local data dependency. We stress on how

to do quality mining from low-quality data. Poor quality is characterized by missing

values, noisy or ambiguous values We propose to improve data quality by exploit-

ing data interdependency using Markov Random Field (MRF) modeling. The local

constraints are described by pairwise Markov networks and quantified by potential

functions over pairwise variables. Efficient inference is by belief propagation, which

passes beliefs among variables so as to fill in missing values, or to clean the data,

thus yielding quality mining results. We also observe that many existing data mining

methods are intrinsically incapable of modeling and leveraging data dependency. The

mining results of such methods can also be post processed and improved by taking data

dependency into consideration. We have investigated several interesting real-life ap-

plications: cost-efficient sensor probing, protein function prediction and sequence data

110

de-noising. By exploiting data dependency, clear improvements in these applications

have been achieved.

Future Work There is much research to be done to enhance the applicability

and efficiency of the methods in this dissertation.

The Robust Regression algorithm we have proposed for adaptive and robust learn-

ing on data streams is among the first approaches to deal with this problem. However,

we solved the essential problem but left behind a couple of concerns. First, the cur-

rent model is basically a discriminative model with a hidden variable for modeling of

noise. We have deliberately avoided touching the underlying data distribution due to

lack of knowledge. It would be interesting to explore domain knowledge to eliminate

noise. Secondly, our approach to achieve adaptability is basically that of updating

model parameters. A methodology that may be more suitable to general scenarios is

model transition. For example, when the concept changes to a multi-mode Gaussian

distribution, we can never fit the data well with a single mode Gaussian. An imaginary

scenario is that of a complete, finite bank of models where transition decisions need to

be made based on data entropy or other statistical properties collected over time. This

can provide an interesting topic for future research.

Probabilistic graphical models are powerful tools for modeling dependencies in

large data sets. In our current work of applying Markov networks to infer unprobed

sensor readings, graph structure learning is simplified: we assume that sensors give

correlated readings as long as they are close to each other. This simple assumption

may not always hold true, and can be corrected if domain knowledge (for example,

two nearby sensors are on the opposite slopes of a hill) is available or a sophisticated

correlation analysis is performed. Moreover, traditional graphical models are all lim-

ited to a fixed graph structure. But there are many real scenarios featured with graph

topology changes. For instance, mobile sensors travel over time, and their neighbor-

111

hood relation changes correspondingly. As a result, the neighborhood of each graph

vertex is not fixed but becomes a set of random variables. How to perform learning

and inference on the dynamic Markov random field represents a challenging problem

of practical importance.

112

REFERENCES

[AGR98] R. Agrawal, J. Gehrkeand D. Gunopulos, and P. Raghavan. Authomaticsubspace clustering of high dimensional data for data mining applica-tions. In Proceedings of ACM-SIGMOD International Conference onManagement of Data (SIGMOD), 1998.

[AKA91] D. Aha, D. Kibler, and M. Albert. Instance-based learning algorithms.In Machine Learning 6(1), 37-66, 1991.

[AM01] S. Aji and R. McEliece. The generalized distributive law and free energyminimization. In Proceedings of the 39th Annual Allerton Conference onCommunication, Control, and Computing, 2001.

[APW+00] C. Aggarwal, C. Procopiuc, J. Wolf, P.S. Yu, and J.S. Park. Fast algo-rithms for projected clustering. In Proceedings of ACM-SIGMOD Inter-national Conference on Management of Data (SIGMOD), 2000.

[AY00] C. Aggarwal and P.S. Yu. Finding generalized projected clusters in highdimensional spaces. In Proceedings of ACM-SIGMOD InternationalConference on Management of Data (SIGMOD), 2000.

[AY01] C. Aggarwal and P. Yu. Outlier detection for high dimensional data.In Proceedings of ACM-SIGMOD International Conference on Manage-ment of Data (SIGMOD), 2001.

[BB99] P.O. Brown and D. Botstein. Exploring the new world of the genomewith DNA microarrays. In Nature Genetics, 21:33–37, 1999.

[BDM02] K. Bennett, A. Demiriz, and R. Maclin. Exploiting unlabeled data in en-semble methods. In Proceedings of the 8th ACM-SIGKDD InternationalConference on Knowledge Discovery and Data Mining, 289-296, 2002.

[BF96] C. Brodley and M. Friedl. Identifying and eliminating mislabeled train-ing instances. In Proceedings of the 30th National Conference on Artifi-cial Intelligence, 799-805, 1996.

[Bil98] J. Bilmes. A gentle tutorial on the em algorithm and its application toparameter estimation for gaussian mixture and hidden markov models.In Technical Report ICSI-TR-97-021, 1998.

[BKNS00] M. Breunig, H. Kriegel, R. Ng, and J. Sander. LOF: identifying density-based local outliers. In Proceedings of ACM-SIGMOD InternationalConference on Management of Data (SIGMOD), 2000.

113

[BM98] A. Blum and T. Mitchell. Combining labeled and unlabeled data withco-training. In COLT: Proceedings of the Workshop on ComputationalLearning and Theory, 1998.

[BR99] B.Bauer and R.Kohavi. An empirical comparison of voting classificationalgorithms: Bagging, boosting, and variants. In Machine Learning, 36,105-139, 1999.

[CC00] Y. Cheng and G. Church. Biclustering of expression data. In Proceed-ings of 8th International Conference on Intelligent System for MolecularBiology, 2000.

[CDH+00] Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-dimensionalregression analysis of time-series data streams. In Proceedings of VeryLarge Database (VLDB), 2000.

[CFZ99] C.H. Cheng, A.W. Fu, and Y. Zhang. Entropy-based subspace clus-tering for mining numerical data. In Proceedings of ACM-SIGKDDInternational Conference on Knowledge Discovery and Data Mining(SIGKDD), 1999.

[CHH+01] J. Cheng, C. Hatzis, H. Hayashi, M.-A. Krogel, S. Morishita, D. Page,and J. Sese. Kdd cup 2001 report. In SIGKDD Explorations, 3(2):47–64, 2001.

[DG01] C. Domeniconi and D. Gunopulos. Incremental support vector machineconstruction. In Proceedings of International Conference Data Mining(ICDM), 2001.

[DGM+04] A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and W. Hong.Model-driven data acquisition in sensor networks. In Proceedings ofVery Large Database (VLDB), 2004.

[DH00] P. Domingos and G. Hulten. Mining high-speed data streams. In Pro-ceedings of ACM-SIGKDD International Conference on Knowledge Dis-covery and Data Mining (SIGKDD), 2000.

[DHL+03] Guozhu Dong, Jiawei Han, Laks V.S. Lakshmanan, Jian Pei, HaixunWang, and Philip S. Yu. Online mining of changes from data streams:Research problems and preliminary results. In Proceedings of the2003 ACM SIGMOD Workshop on Management and Processing of DataStreams, 2003.

[Die00] T. Dietterich. Ensemble methods in machine learning. In Multiple Clas-sifier Systems, 2000.

114

[DLS99] P. D’haeseleer, S. Liang, and R. Somogyi. Gene expression analysisand genetic network modeling. In Pacific Symposium on Biocomputing,1999.

[DR99] D.Opitz and R.Maclin. Popular ensemble methods: An empirical study.In Journal of Artificial Intelligence Research, 11, 169-198, 1999.

[FHT98] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: Astatistical view of boosting. In The Annals of Statistics, 28(2):337–407,1998.

[FRM94] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequencematching in time-series databases. In Proceedings of ACM-SIGMODInternational Conference on Management of Data (SIGMOD), 1994.

[FS96] Y. Freund and R. Schapire. Experiments with a new boosting algo-rithm. In Proceedings of International Conference on Machine Learning(ICML), 1996.

[GBYS92] G. Gonnet, R. Baeza-Yates, and T. Snider. New indices for text: Pattrees and pat arrays. In Information Retrieval: Data Structures and Al-gorithms, 335–349, 1992.

[GGRL02] V. Ganti, J. Gehrke, R. Ramakrishnan, and W. Loh. Mining data streamsunder block evolution. In ACM SIGKDD Explorations 3(2):1-10, 2002.

[GMMO00] S. Guha, N. Milshra, R. Motwani, and L. OCallaghan. Clustering datastreams. In IEEE Symposium on Foundations of Computer Science(FOCS), 359366, 2000.

[GNV96] I. Guyon, N. Natic, and V. Vapnik. Discovering informative patterns anddata cleansing. In AAAI/MIT Press, pp. 181-203, 1996.

[GRS95] W. Gilks, S. Richardson, and D. Spiegelhalter. Markov Chain MonteCarlo in Practice. CRC Press, 1995.

[HSD01] G. Hulten, L. Spencer, and P. Domingos. Mining time-changing datastreams. In Proceedings of ACM-SIGKDD International Conference onKnowledge Discovery and Data Mining (SIGKDD), 2001.

[HTF00] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of StatisticalLearning, Data Mining,Inference and Prediction. Springer, 2000.

[JMN99] H.V. Jagadish, J. Madar, and R. Ng. Semantic compression and pat-tern extraction with fascicles. In Proceedings of Very Large Database(VLDB), 1999.

115

[KGV83] S. Kirkpatrick, C. Gelatt, and M. Vecchi. Optimization by simulatedannealing. In Science, vol. 220, no.4598, 1983.

[KM01] J. Kolter and M. Maloof. Dynamic weighted majority: A new ensem-ble method for tracking concept drift. In Proceedings of InternationalConference Data Mining (ICDM), 2001.

[KM03] J. Kubica and A. Moore. Probabilistic noise identification and datacleaning. In Proceedings of International Conference Data Mining(ICDM), 2003.

[McC76] E.M. McCreight. A space-economical suffix tree construction algorithm.In Journal of the ACM, 23(2):262-272, 1976.

[MM93] U. Manber and G. Myers. Suffix arrays: A new method for on-line stringsearches. In SIAM Journal On Computing, 935-948, 1993.

[MMC98] R. McEliece, D. MacKay, and J. Cheng. Turbo decoding as an instanceof pearl’s ’belief propagation’ algorithm. In IEEE Journal on SelectedAreas in Communication, 16(2), pp. 140-152, 1998.

[MWJ99] K. Murphy, Y. Weiss, and M. Jordan. Loopy belief propagation for ap-proximate inference: an empiricial study. In Proceedings on Uncertaintyin AI, 1999.

[NMTM00] K. Nigam, A. McCallum, S. Thrum, and T. Mitchell. Using em to clas-sify test from labeled and unlabeled documents. In Machine Learning,39:2:103-134, 2000.

[oW] University of Washington. http://www.jisao.washington.edu/data sets/widmann/.

[PA87] C. Peterson and J. Anderson. A mean-field theory learning algorithm forneural networks. In Complex Systems, vol.1, 1987.

[Pea88] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plau-sible inference. Morgan Kaufmann publishers, 1988.

[PWMH01] C-S. Perng, H. Wang, S. Ma, and J.L. Hellerstein. A framework forexploring mining spaces with multiple attributes. In Proceedings of In-ternational Conference Data Mining (ICDM), 2001.

[PWZP00] C.-S. Perng, H. Wang, S.R. Zhang, and D.S. Parker. Landmarks: a newmodel for similarity-based pattern querying in time series databases. InProceedings of International Conference on Data Engineering (ICDE),2000.

116

[PZC+03] J. Pei, X. Zhang, M. Cho, H. Wang, and P.S. Yu. Maple: A fast algorithmfor maximal pattern-based clustering. In Proceedings of InternationalConference Data Mining (ICDM), 2003.

[R.E61] R.E.Bellman. Adaptive Control Processes. Princeton University Press,1961.

[RRS00] S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for min-ing outliers from large data sets. In Proceedings of ACM-SIGMOD In-ternational Conference on Management of Data (SIGMOD), 2000.

[SFB97] R. Schapire, Y. Freund, and P. Bartlett. Boosting the margin: A newexplanation for the effectiveness of voting methods. In Proceedings ofInternational Conference on Machine Learning (ICML), 1997.

[SFL+97] S. Stolfo, W. Fan, W. Lee, A. Prodromidis, and P. Chan. Credit cardfraud detection using meta-learning: Issues and initial results. In AAAI-97 Workshop on Fraud Detection and Risk Management, 1997.

[SG86] J. Schlimmer and F. Granger. Beyond incremental processing: Trackingconcept drift. In Int’l Conf. on Artificial Intelligence, 1986.

[SK01] W. Street and Y. Kim. A streaming ensemble algorithm (sea) for large-scale classification. In Proceedings of ACM-SIGKDD InternationalConference on Knowledge Discovery and Data Mining (SIGKDD),2001.

[SM83] G. Salton and M. McGill. Introduction to modern information retrieval.McGraw Hill, 1983.

[SS94] R. Schultz and R. Stevenson. A bayesian approach to image expansionfor improved definition. In IEEE Transactions on Image Processing,3(3), pp. 233-242, 1994.

[Ten99] C.M. Teng. Correcting noisy data. In Proceedings of the InternationalConference on Machine Learning, 239-248, 1999.

[THC+00] S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and G. Church. Yeast mi-cro data set, http://arep.med.harvard.edu/biclustering/yeast.matrix, 2000.

[WFYH03] H. Wang, W. Fan, P. Yu, and J. Han. Mining concept-drifting datastreams using ensemble classifiers. In Proceedings of ACM-SIGKDDInternational Conference on Knowledge Discovery and Data Mining(SIGKDD), 2003.

117

[WK96] G. Widmer and M. Kubat. Learning in the presence of concept drift andhidden contexts. In Machine Learning, 23 (1), 69-101, 1996.

[WPF+03] H. Wang, C-S. Perng, W. Fan, S. Park, and P.S. Yu. Indexing weightedsequences in large databases. In Proceedings of International Confer-ence on Data Engineering (ICDE), 2003.

[WPFY03] H. Wang, S. Park, W. Fan, and P.S. Yu. ViST: A dynamic index methodfor querying XML data by tree structures. In Proceedings of ACM-SIGMOD International Conference on Management of Data (SIGMOD),2003.

[WWYY02] H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern simi-larity in large data sets. In Proceedings of ACM-SIGMOD InternationalConference on Management of Data (SIGMOD), 2002.

[YFW00] J. Yedidia, W. Freeman, and Y. Weiss. Generalized belief propagation.In Advances in Neural Information Processing Systems (NIPS), Vol 13,pp. 689-695, 2000.

[YWWY02] J. Yang, W. Wang, H. Wang, and P.S. Yu. δ-clusters: Capturing subspacecorrelation in a large data set. In Proceedings of International Confer-ence on Data Engineering (ICDE), 2002.

[YWZ04] Y. Yang, X. Wu, and X. Zhu. Dealing with predictive-but-unpredictableattributes in noisy data sources. In Proceedings of the 8th EuropeanConference on Principles and Practice of Knowledge Discovery in Data-bases (PKDD 04), 2004.

[ZWC03] X. Zhu, X. Wu, and Q. Chen. Eliminating class noise in large datasets.In Proceedings of the 20th International Conference Machine Learning(ICML 03), 2003.

118

mining techniques for data streams and sequenceswis.cs.ucla.edu/old_wis/theses/chu_thesis.pdf ·...

Documents