mining techniques for data streams and sequenceswis.cs.ucla.edu/old_wis/theses/chu_thesis.pdf ·...
TRANSCRIPT
UNIVERSITY OF CALIFORNIA
Los Angeles
Mining Techniques forData Streams and Sequences
A dissertation submitted in partial satisfaction
of the requirements for the degree
Doctor of Philosophy in Computer Science
by
Fang Chu
2005
c© Copyright by
Fang Chu
2005
The dissertation of Fang Chu is approved.
D.Stott Parker
Adnan Darwiche
Yingnian Wu
Carlo Zaniolo, Committee Chair
University of California, Los Angeles
2005
ii
To Dad, Mom and Yizhou
iii
TABLE OF CONTENTS
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Issues in Stream Mining . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Mining High Dimensional Sequence Data . . . . . . . . . . . . . . . 5
1.3 Mining Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Stream Classification Methods . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Ensemble Theory . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Ensemble Methods for Stream Classification . . . . . . . . . 11
2.2 Pattern-Based Subspace Clustering . . . . . . . . . . . . . . . . . . . 12
2.3 Mining Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Fast and Light Stream Boosting Ensembles . . . . . . . . . . . . . . . . 16
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Adaptive Boosting Ensembles . . . . . . . . . . . . . . . . . . . . . 18
3.3 Change Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Comparison with Bagging Stream Ensembles . . . . . . . . . . . . . 22
3.4.1 Evaluation of Boosting Scheme . . . . . . . . . . . . . . . . 23
3.4.2 Learning with Gradual Shifts . . . . . . . . . . . . . . . . . . 25
3.4.3 Learning with Abrupt Shifts . . . . . . . . . . . . . . . . . . 27
3.4.4 Experiments on Real Life Data . . . . . . . . . . . . . . . . . 29
iv
3.5 Comparison with DWM . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Robust and Adaptive Stream Ensembles . . . . . . . . . . . . . . . . . 34
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Adaptation to Concept Drift . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Robustness to Outliers . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Model Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.1 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.2 Inference and Computation . . . . . . . . . . . . . . . . . . . 42
4.5 Experiments and Discussions . . . . . . . . . . . . . . . . . . . . . . 45
4.5.1 Evaluation of Adaptation . . . . . . . . . . . . . . . . . . . . 45
4.5.2 Robustness in the Presence of Outliers . . . . . . . . . . . . . 47
4.5.3 Discussions on Performance Issue . . . . . . . . . . . . . . . 48
4.5.4 Experiments on Real Life Data . . . . . . . . . . . . . . . . . 50
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Subspace Pattern Based Sequence Clustering . . . . . . . . . . . . . . . 52
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1.1 Subspace Pattern Similarity . . . . . . . . . . . . . . . . . . 53
5.1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.1.3 Our Contributions. . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 The Distance Function . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.1 Tabular and Sequential Data . . . . . . . . . . . . . . . . . . 58
v
5.2.2 Sequence-based Pattern Similarity . . . . . . . . . . . . . . . 59
5.3 The Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.1 Pattern and Pattern Grids . . . . . . . . . . . . . . . . . . . . 61
5.3.2 The Counting Tree . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.3 Counting Pattern Occurrences . . . . . . . . . . . . . . . . . 66
5.3.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . 73
5.4.3 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5 Related Work and Discussion . . . . . . . . . . . . . . . . . . . . . . 78
6 Mining Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2 Markov Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2.1 Graphical Representation . . . . . . . . . . . . . . . . . . . . 84
6.2.2 Pairwise Markov Networks . . . . . . . . . . . . . . . . . . . 85
6.2.3 Solving Markov Networks . . . . . . . . . . . . . . . . . . . 86
6.2.4 Inference by Belief Propagation . . . . . . . . . . . . . . . . 87
6.3 Application I: Cost-Efficient Sensor Probing . . . . . . . . . . . . . . 88
6.3.1 Problem Description and Data Representation . . . . . . . . . 89
6.3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . 90
6.3.3 Learning and Inference . . . . . . . . . . . . . . . . . . . . . 92
vi
6.3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 93
6.3.5 How BP Works . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.4 Application II: Enhancing Protein Function Predictions . . . . . . . . 97
6.4.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . 97
6.4.2 Learning Markov Network . . . . . . . . . . . . . . . . . . . 98
6.4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.5 Application III: Sequence Data Denoising . . . . . . . . . . . . . . . 102
6.5.1 Problem Description and Data Representation . . . . . . . . . 103
6.5.2 Learning and Inference . . . . . . . . . . . . . . . . . . . . . 104
6.5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 105
6.6 Related Work and Discussions . . . . . . . . . . . . . . . . . . . . . 108
7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
vii
LIST OF FIGURES
1.1 An example of stream version linear regression. . . . . . . . . . . . . 2
1.2 An example of stream version linear regression. . . . . . . . . . . . . 3
1.3 An example of stream version linear regression. . . . . . . . . . . . . 4
3.1 Two types of significant changes. Type I: abrupt changes; Type II:
gradual changes over a period of time. These are the changes we aim
to detect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Performance comparison of the adaptive boosting vs the bagging on
stationary data. The weighted bagging is omitted as it performs almost
the same as the bagging. . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Performance comparison of the three ensembles on data with small
gradual concept shifts. . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Performance comparison of the ensembles on data with moderate grad-
ual concept shifts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Performance comparison of the three ensembles on data with abrupt
shifts. Base decision trees have no more than 8 terminal nodes. . . . . 27
3.6 Performance comparison of the three ensembles on data with both
abrupt and small shifts. Base decision trees have no more than 8 ter-
minal nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7 Performance comparison of the three ensembles on credit card data.
Concept shifts are simulated by sorting the transactions by the trans-
action amount. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
viii
3.8 Comparison of the adaptive boosting and the weighted bagging, in
terms of (a) building time, and (b) average decision tree size. In (a),
the total amount of data is fixed for different block sizes. . . . . . . . 30
3.9 Dynamic Weighted Majority (DWM) ensemble performance on the
SEA concepts with 10% class noise. . . . . . . . . . . . . . . . . . . 32
3.10 Adaptive Boosting ensemble performance on the SEA concepts with
10% class noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Adaptability comparison of the ensemble methods on data with three
abrupt shifts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Adaptability comparison of the ensemble methods on data with three
abrupt shifts mixed with small shifts. . . . . . . . . . . . . . . . . . . 46
4.3 Robustness comparison of the three ensemble methods for different
noise levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 In the outliers detected, the normalized ratio of (1) true noisy samples
(the upper bar), vs. (2) samples from an emerging concept (the lower
bar). The bars correspond to blocks 0-59 in the experiments shown in
Fig.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Performance comparison of the ensemble methods with classifiers of
different size. Robust regression with smaller classifiers is compatible
to the others with larger classifiers. . . . . . . . . . . . . . . . . . . . 49
4.6 Performance comparison of the ensembles on credit card data. Base
decision trees have no more than 16 terminal nodes. Concept shifts are
simulated by sorting the transactions by the transaction amount. . . . 50
5.1 Objects form patterns in subspaces. . . . . . . . . . . . . . . . . . . . 53
ix
5.2 The meaning of distk,S(x, y) ≤ δ. . . . . . . . . . . . . . . . . . . . 60
5.3 Pattern grids for subspace {t1, t2, t3} . . . . . . . . . . . . . . . . . . 62
5.4 The Counting Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5 The Cluster Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.6 Performance Study: scalability. . . . . . . . . . . . . . . . . . . . . . 74
5.7 Time vs. distance threshold δ . . . . . . . . . . . . . . . . . . . . . . 75
5.8 Scalability on sequential dataset . . . . . . . . . . . . . . . . . . . . 76
5.9 A cluster in subspace {2,3,4,5,7,8,10,11,12,13,14,15,16}. . . . . . . . . . . . . . . . . 77
6.1 Example of a Pairwise Markov Network. In (a), the white circles de-
note the random variables, and the shaded circles denote the external
evidence. In (b), the potential functions φ() and ψ() are showed. . . . 84
6.2 Message passing in a Markov network. Messages are defined by Eqs.(6.3)
or (6.4) under two types of rules, respectively. . . . . . . . . . . . . . 87
6.3 Sensor site map in the states of Washington and Oregon. . . . . . . . 89
6.4 Top-K recall rates vs. probing ratios. (a): results obtained by our
BP-based probing; (b) by the naive probing. On average, BP-based
approach probed 8% less, achieves 13.6% higher recall rate for raw
values, and 7.7% higher recall rate for discrete values. . . . . . . . . . 92
6.5 Belief updates in 6 BP iterations((0) - (5)). Initially only the four sen-
sors at the corners are probed. The strong beliefs of these four sensors
are carried over by their neighbors to sensors throughout the network,
causing beliefs of all sensors updated iteratively till convergence. . . . 94
6.6 Logistic curve that is used to blur the margin between the belief on two
classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
x
6.7 Distribution of correlation values learned for two functions. left col-
umn function: cell growth, right column function: protein destination.
In each column, the distributions from top to bottom are learned from
group (a), (b) and (c), respectively. . . . . . . . . . . . . . . . . . . . 100
6.8 A subgraph in which testing genes got correct class labels due to mes-
sage passing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
xi
LIST OF TABLES
3.1 Performance comparison of the ensembles on data with varying levels
of concept shifts. Top accuracies shown in bold fonts. . . . . . . . . . 26
3.2 Performance comparison of three ensembles on data with abrupt shifts
or mixed shifts. Top accuracies are shown in bold fonts. . . . . . . . . 28
4.1 Summary of symbols used . . . . . . . . . . . . . . . . . . . . . . . 40
5.1 Expression data of Yeast genes . . . . . . . . . . . . . . . . . . . . . 55
5.2 A Stream of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3 A dataset of 3 objects . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4 Clusters found in the Yeast dataset . . . . . . . . . . . . . . . . . . . 76
5.5 Clusters found in NETVIEW . . . . . . . . . . . . . . . . . . . . . . 77
6.1 Distortion rules and error correction results. Columns 1 and 2 give the
rule and mutation rate, respectively. Column 3 is the actual number of
times a rule applies, and column 4 is the percentage corrected by BP
inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
xii
ACKNOWLEDGMENTS
At the end of a long journey that the making of this dissertation happens, I would
like to thank some of many people who have helped me in various ways.
First, I would like to thank Professor Carlo Zaniolo for his support and guidance
over the years, and for the numerous and fruitful discussions that laid the foundation
of the research presented here. In particular, I thank him for passing knowledge on to
me, and teaching me how to research and solve new problems with persistence.
I also thank Professor Yingnian Wu, Professor Adnan Darwiche for valuable dis-
cussions on statistical modeling and artificial intelligence. I am grateful to Professor
D.Stott Parker for many helpful discussions and brainstorms on data mining, and for
collaborations on some projects.
I would like to thank Dr. Haixun Wang, Dr. Philip Yu and Dr. Wei Fan for
many helpful and interesting discussions during my twice summer internship at IBM
T.J.Watson Research Center. Their help in cultivating my curiosity in machine learning
and data mining is indispensable to the formation of this dissertation.
I thank the colleagues and friends at Web Information Group: Yijian Bai, Xin
Zhou, Yan-Nei Law, Hetal Thakkar, and Hyun Moon, and our alumni Dr. Fusheng
Wang. Thank you for your helpful discussions and paper proofreading, and the en-
joyable environment you have helped to maintained. I am also grateful to friends in
dbUCLA group: Zhenyu liu, Yi Xia and lots of other, for sharing the concerns and
happiness and for the inspiring Friday seminars.
Special thanks go to Yizhou Wang, my husband and co-author. He has not only
kept me cheerful and happy throughout the development of this dissertation, but also
showed his amazingly broad research interest in data mining, which falls beyond his
xiii
major in computer vision. It is a wonderful experience to work together with him.
Thank you, Yizhou.
Finally, I would like to thank my parents Shuquan and Zefu. Ever since I remem-
ber, they have been a source of continuous support and inspiration. I owe them many
ethics in life and work.
xiv
VITA
1975 Born, Shandong, China.
1997 B.S., Computer Science, Peking University, China.
2000 M.S., Computer Science, Peking University, China.
2001 Summer Intern, IBM T.J.Watson Research Center, Hawthorne, New
York.
2002 Summer Intern, IBM T.J.Watson Research Center, Hawthorne, New
York.
2001–2003 Teaching Assistant, Computer Science Department, UCLA. Taught
143 (database course), 131 (programming language course) and
151B (computer architecture course).
2002 Research Assistant, Molecular Biology Institute, UCLA
2000–2005 Research Assistant, Computer Science Department, UCLA.
PUBLICATIONS
Fang Chu, Yizhou Wang, Carlo Zaniolo, D. Stott Parker, Improving mining quality
by exploiting data dependency, in Proceedings of the 9th Pacific-Asia Conference on
Knowledge Discovery and Data Mining (PAKDD), 2005.
xv
Fang Chu, Yizhou Wang, Carlo Zaniolo, D. Stott Parker, Data Cleaning Using Belief
Propagation, in Proceedings of the 2nd International ACM SIGMOD Workshop on
Information Quality in Information Systems (IQIS), 2005.
Fang Chu, Yizhou Wang, Carlo Zaniolo, An adaptive learning approach for noisy data
streams, in Proceedings of the 4th IEEE International Conference on Data Mining
(ICDM), 2004.
Fang Chu, Carlo Zaniolo, Fast and light boosting for adaptive mining of data streams,
in Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data
Mining (PAKDD), 2004
Fang Chu, Yizhou Wang, Carlo Zaniolo, Mining noisy data streams via a discrimi-
native model, in Proceedings of the 7th International Conference on Discovery Sci-
ence(DS), 2004.
Haixun Wang, Fang Chu, Wei Fan, Philip S. Yu, Jian Pei, Sequence-based subspace
clustering by pattern similarity, in Proceedings of the 16th International Conference
on Scientific and Statistical Database Management (SSDBM), 2004.
Wei Fan, Fang Chu, Haixun Wang, Philip S. Yu, Pruning cost-sensitive ensembles for
efficient prediction, in Proceedings of the Eighteenth National Conference on Artificial
Intelligence (AAAI), 2002.
xvi
ABSTRACT OF THE DISSERTATION
Mining Techniques forData Streams and Sequences
by
Fang ChuDoctor of Philosophy in Computer Science
University of California, Los Angeles, 2005
Professor Carlo Zaniolo, Chair
Data stream mining and sequence mining have many applications and pose chal-
lenging research problems. Typical applications, such as network monitoring, web
searching, telephone services and credit card purchases, are characterized by the need
to mine continuously massive data streams to discover up-to-date patterns, which are
invaluable for timely strategic decisions. These new requirements call for the design
of new mining methods to replace the traditional ones, since those would require the
data to be first stored and then processed off-line using complex algorithms that make
several passes over the data. Therefore, a first research challenge is designing fast
and light mining methods for data streams — e.g., algorithms that only require one
pass over the data and work with limited memory. Another challenge is created by the
highly dynamic nature of data streams, whereby the stream mining algorithms need
to detect promptly changing concepts and data distribution and adapt to them. While
noise represents a general problem in data mining, it poses new challenges on data
streams insofar as adaptability becomes more difficult when the data stream contains
noise.
xvii
The main limitation of data stream mining methods is that they cannot reveal long
term trends as they only keep a small snapshot of most recent data. Neither can they
discover very complicated patterns that can be detected by methods that require exten-
sive computational resources. However, these patterns can be discovered by off-line
mining after data streams are stored as sequences. Sequence mining, in general, can
reveal long-term trends and more complicated patterns, defined in a multidimensional
space via some similarity criteria. The key research challenges that arise in this context
include on (i) designing metrics that measure the similarity of sequences, (ii) dealing
with high dimensionality, and (iii) achieving scalability.
This dissertation makes a number of contributions toward the solution of these
problems, including the following ones:
1. Adaptive Boosting: A stream ensemble method is proposed that maintains a very
accurate predictive model with fast learning and light memory consumption. The
method is also highly adaptive through novel change detection techniques.
2. Robust Regression Ensemble: This method enhances the ensemble methods
with outlier detection and a statistical learning theory.
3. SeqClus: A pattern-based subspace clustering algorithm is introduced along
with a novel pattern similarity metric for sequences. The algorithm is scalable
and efficient.
4. Mining Quality: To deal with noise and improve mining quality, a general ap-
proach is introduced based on data dependency. The approach exploits local data
dependency between samples using pairwise Markov Networks and Bayesian
belief propagation techniques.
The efficacy of the techniques proposed was demonstrated through extensive ex-
periments, both on synthetic and on real-life data.
xviii
CHAPTER 1
Introduction
Today, many organizations produce and/or consume massive data stream. Mining such
data can reveal up-to-date patterns, which are invaluable for timely decisions. How-
ever, stream mining is strikingly different from traditional mining in several aspects.
First, the need of online responses requires mining to be done very fast. In fact, actual
online systems usually have limited CPU power and memory resources dedicated to
mining tasks. Secondly, the underlying concept that generates the data is highly dy-
namic. Moreover, data streams are very likely to be noisy due to lack of preprocessing.
All these make it compelling to investigate data mining techniques for continuous data
streams containing high volumes of data.
After online processing and mining, data streams are stored as sorted relations
known as sequences. The order is defined by a set of attributes which has a total order,
such as positions or account IDs. Often, the temporal order of the original stream data
is preserved, and such sequences are also referred to as time series data. Sequence data
mining plays a complementary role to data stream mining. While data stream mining
can discover up-to-date patterns invaluable for timely strategic decisions, sequence
data mining can reveal long-term trends and more complicated patterns that lead to
deeper insights.
This dissertation studies several major challenges raised in data stream mining and
sequence mining. Then, it extends the study to a more general problem of improving
mining quality. The specific problems addressed are:
1
• Performance, adaptability and robustness issues in data stream mining;
• Scalable pattern-based subspace clustering;
• Mining quality improvement by leveraging data dependency.
1.1 Issues in Stream Mining
The first issue is limited computation resources: in many applications, the computation
power and memory at hand does not measure up to the massive amount of data in the
input stream. For example, in a single day, Google serviced more than 150 million
searches; Walmart executed 20 million sales transactions; and Telstra generated 15
million call records. However, traditional data mining algorithms make the assumption
that the resources available will always match the amount of data they process. This
assumption does not hold in data stream mining. Stream mining algorithms shall learn
fast and consume little memory resources.
Figure 1.1: An example of stream version linear regression.
Another characteristic of data streams is that data is no longer a snapshot, but rather
a continuous stream. This means that the concept underlying the data may change
2
over time. For effective decision making, stream mining must be adaptive to concept
change. For example, when customer purchasing patterns change, marketing strate-
gies based on out-dated transaction data must be modified in order to reflect current
customer needs.
(a) A stream with one underlying concept and noisy examples.
(b) A linear regression algorithm overfits noise if it is too adaptive.
Figure 1.2: An example of stream version linear regression.
Figure 1.1 uses a simple example to illustrate stream mining. The input stream
contains 2 dimensional points (x, y) coming over time t. The horizontal axis denotes
the x dimension, the left vertical axis the y dimension, and the right vertical axis the
3
time t. In the first time period, data is generated by a concept y = f1(x); in the second
time period the concept changes to y = f2(x). The stream version linear regression
method shall be able to learn and adapt to the underlying functions, f1 or f2, promptly.
(a) A stream with changing concepts.
(b) A linear regression algorithm overlook concept change if it is too robust.
Figure 1.3: An example of stream version linear regression.
A particularly challenging issue is to learn changing concepts in the presence of
noise. To the existing model, both noisy examples and the examples from an emerging
new concept manifest them as misclassified examples. If an algorithm is designed
4
primarily to adapt to concept change, it may overfit noise by mistakenly interpreting
noisy examples as the sign of a new concept. In Figure 1.2, the noisy examples in
the middle of this time period are overfit, causing unstable and inaccurate fitting. On
the other hand, if an algorithm is too robust to noise, it may overlook new concepts
and inappropriately stick to outdated concepts. This is illustrate in Figure 1.3, where
the true concept shifts to a new one at time t1, and then comes back to the original at
t2, but the second concept is completely ignored because the regression method is too
“robust”.
1.2 Mining High Dimensional Sequence Data
Many types of data streams are finally stored as ordered sequences. This data contains
invaluable information on various aspects of system operation and usage. For example,
a network system generates event log sequences. Finding patterns in a large data set of
event logs is important to the understanding of the temporal causal relationships among
the events, which often provide actionable insights for determining problems in system
management. Another example is that web server logs user browsing sessions and
paths. Finding access patterns from this log data gives important clues on profitable
marketing strategies as well as directions of how to improve user experiences for more
successful e-commerce business.
Subspace clustering represents a very useful mining technique that can cope with
high dimensionality. The main objective of clustering is to find high quality clusters
within a reasonable time. However, in high dimensional data, it is common for all ob-
jects in a dataset to be nearly equidistant from each other, completely masking the clus-
ters. This is well known as the curse of dimensionality [R.E61]. Subspace clustering is
an extension of traditional clustering that seeks to find clusters in different subspaces.
Subspace clustering algorithms localize the search for relevant dimensions, hence they
5
can find meaningful clusters despite the noisy dimensions. In other words, subspace
clustering alleviates the problem caused by the curse of dimensionality. But this gain
in cluster quality is achieved at the expense of a much higher computation complex-
ity, as the number of possible subspaces is huge in high dimensional space. In fact,
scalability is always the core concern of subspace clustering. Research work has never
stopped in finding clustering algorithms that are scalable with respect to the number of
objects and the number of dimensions of the objects, as well as the dimensionality of
the subspaces where the clusters are found.
There are a range of problems that motivate further extension of clustering. Tra-
ditional clustering, including subspace clustering, focuses on grouping objects with
value-based similarity. That is, a similarity metric is defined on absolute values in a set
of dimensions. In applications of collaborative filtering and bio-data mining, however,
people are more interested in capturing the coherence exhibited by subset of objects
in some subspace. In microarray data analysis, for example, finding coherent genes
means finding those that respond similarly to environmental conditions. The absolute
response rates are often very different, but the type of response and the timing may be
similar. Along this direction, several research work have studied clustering based on
pattern similarity.
This dissertation focuses on pattern-based subspace clustering. In particular, we
study sequential patterns and the scalability of algorithms.
1.3 Mining Quality
The quality of mining results concerns the evaluation of many components, including
data quality, model quality, method quality. Ongoing research also encompasses theo-
retical aspects including quality definition and quality model. Most approaches focus
6
on specific environments, such as document quality, data warehouse quality, ontology
quality, and so on.
This dissertation is interested in quality mining from low-quality data. A consensus
among data mining practitioners is that low data quality often leads to wrong decisions
or even ruins the projects—unless proper preprocessing techniques have been adopted
in advance. As a result, data quality related issues have become more and more crucial
and have consumed a majority of the time and budget of data mining.
Low data quality are caused by various reasons. During data collection and prepa-
ration, it may be biased by human habits or by device faults. For example, it is well
known that when we human beings record readings from blood pressure monitors, we
tend to round up the readings to a multiple of tens. Data can be corrupted during to
transmission through the networks. In summary, low quality data is characterized by
missing values, ambiguity or redundancy.
We propose a general technique that can improve data quality by exploiting data
dependency, thus improving mining quality. By learning the data dependency, missing
values can be filled out and noisy values can be corrected. The general techniques
discussed here can be applied directly to data stream mining. One reason is that data
streams are noisy in many applications. Another scenario is when it is impossible or
undesired to acquire all the data. For instance, due to resource limitation, there are
multiple data streams but only some of them can be monitored.
1.4 Dissertation Overview
The goal of this chapter has been to set up the appropriate context within which this dis-
sertation is to be developed. An introduction to the problems in data stream learning,
sequence data mining has been presented. Some general thoughts on mining quality is
7
also presented.
The remaining chapters of this dissertation is organized as follows:
Chapter 2 introduces background work. For stream mining, it categorizes previ-
ous work into single model-based and ensemble-based approaches, and then puts the
focus on recent work using ensembles. It also includes a short discussion on tradi-
tional ensemble theory. For sequence data mining, it follows the search direction from
multi-dimensional subspace clustering to pattern-based subspace clustering. Finally it
discusses the existing work on improving data quality and mining quality.
Chapter 3 presents a novel approach (Adaptive Boosting) for stream learning. The
approach addresses two important issues in stream mining: performance and adapta-
tion. It presents that our approach is fast and light in terms of cpu time and memory
requirement, and is highly adaptive through explicit concept change detection.
Chapter 4 introduces a robust stream learning algorithm (Robust Regression En-
semble). In addition to performance and adaptation issue, this approach enhances the
stream ensemble methods with outlier detection. It is developed within a statistically
solid learning framework.
Chapter 5 presents a highly scalable clustering method on sequence data. Sub-
space pattern similarity are used as the similarity measure among objects, so as to find
strikingly coherent objects. An efficient grid and density based algorithm is presented.
Chapter 6 addresses a general problem in data mining field: mining quality. It
explores the area of local data dependency which is abundant in many applications,
and its potential usage in improving data quality, and ultimately in mining quality.
Finally, Chapter 7 summarizes the work presented in this dissertation. It contains
a brief description of the new algorithms introduced in this dissertation, and points out
a few important directions for future research.
8
CHAPTER 2
Background and Related Work
2.1 Stream Classification Methods
As data stream mining has recently become an important research domain, much work
has been done on classification [DH00, HSD01, SK01], regression analysis [CDH+00]
and clustering [GMMO00]. In this dissertation we focus on stream classification.
As discussed in Chapter 1, concept drift is one of the central issues in stream data
mining. This problem has been addressed in both machine learning and data mining
communities. The first systems capable of handling concept drift were STAGGER
[SG86], ib3 [AKA91] and the FLORA family [WK96]. These algorithms provided
valuable insights. But, as they were developed and tested only on small datasets, re-
searchers have not yet established the degree to which most of these AI approaches
scale to large problems.
Several scalable learning algorithms designed for data streams were proposed re-
cently. They either maintain a single model incrementally, or an ensemble of base
learners. The first category includes Hoeffding tree [HSD01], which grows a deci-
sion tree node by splitting an attribute only when that attribute is statistically predic-
tive. Hoeffding-tree like algorithms need a large training set in order to reach a fair
performance, which makes them unsuitable to situations featuring frequent changes.
Domeniconi and Gunopulos [DG01] designed an incremental support vector machine
algorithm for continuous learning, but due to the high complexity nature of support
9
vector machine algorithm, the memory requirement and cpu time is still relatively
large.
The second category of learning algorithms for data streams are based on ensem-
bles. First we give a brief description of ensemble methods.
2.1.1 Ensemble Theory
Ensemble methods have long been studied in machine learning. They are meta-learning
techniques that construct a collection of classifiers and then classify new data points by
taking a vote of their predictions. The base classifiers in an ensemble diversify, and this
diversity is achieved by manipulating one of three aspects in classifier construction: the
training samples, the learning procedure, or the output. A large amount of evaluations
demonstrate that ensembles perform better than single classifiers do [FS96, BR99,
DR99, Die00]. In [Die00], Dietterich gives three fundamental reasons why ensemble
methods often perform better than any single classifier: statistical, computational and
representational. (1) The first reason is statistical. A learning algorithm can be viewed
as searching a space of H of hypothesis to identify the best hypothesis of the true
classification function f , but the training data is often not sufficient for this large hy-
pothesis space. By constructing an ensemble of classifiers and “averaging” their votes,
we can get a statistically better approximation of the true hypothesis. (2) The second
reason is computational. Even with sufficient training data, many learning algorithms
work by performing some local search that may stuck in local optima. An ensemble
constructed by running the local searching from many different starting points often
provides a better approximation to the true function than any individual classifier. (3)
The third reason is representational. In many applications, the true function f cannot
be represented by any of the hypothesis in H. By combining multiple hypotheses in
various ways, it is possible to expand the space of representable functions using the
10
hypotheses in H.
2.1.2 Ensemble Methods for Stream Classification
Because ensemble methods have statistical, computational and representational ad-
vantages, they have been adapted to stream scenarios. We review most recent stream
ensembles: two of them have a flavor of traditional bagging ensembles, and the other
builds an ensemble using an incremental base learning algorithm.
Traditional bagging operates by invoking a base learning algorithm many times
with different training sets. Each training set is a bootstrap replica of the original
training set. In other words, given a training set S of n examples, a new training set S ′
is constructed by drawing n independent samples uniformly with replacement. With
bagging method, classifiers are learned individually, and samples in a bootstrap replica
have uniform weights.
Two recent work on stream ensembles resemble the traditional bagging method.
Street et al. [SK01] proposes an algorithm that builds an ensemble by partitioning
the input data stream into fixed-size data blocks, and learning one classifier per block.
Adaptability is achieved solely by retiring old classifiers one at a time. Wang et al.
[WFYH03] proposes a similar method, except that this algorithm tries to adapt to
changes by assigning weights to classifiers proportional to their accuracy on the most
recent data block. Both methods learn individual classifiers independently and use
uniform sample weights. In other words, they resemble the traditional bagging en-
semble methods, and hence are expected to perform well if the concept underlying the
data stream is stable. However, neither of them tackle concept change explicitly. They
rely on the natural adaptation yield by learning new ensemble members and retiring
old ones gradually, and a little bit on ensemble weighting in the case of [WFYH03].
However, as we will show in Chapter 3, this reactive strategy is not sufficient.
11
For ease of later reference, we call them “Bagging” and “Weighted Bagging”, re-
spectively.
Another stream ensemble method, Dynamic Weighted Majority (DWM), is pro-
posed in [KM01]. Contrary to the aforementioned bagging stream methods, the clas-
sifiers in a DWM ensemble keep to be updated in an incremental fashion, upon the
arrival of each single example xi. Not only the classifier model is updated by incor-
porating the knowledge from this new example xi, but the classifier weights are also
decreased by a damping factor β in case of a misclassification. The algorithm also
retires from the ensemble those classifiers whenever their weights drops below a user-
specified threshold θ.
Updating weights and incrementally training all classifiers is the key design points
in DWM. The good intention is to give the base learners the opportunity to recover
from concept drift. However, it is hard to set the parameter θ which determines when
to discard a poor classifier. If θ is too high, the ensemble will be volatile to noise. If it
is too low, poor classifiers will have a negative effect on the overall ensemble perfor-
mance before it can be identified as out of date. Our conjecture is that DWM ensemble
cannot recover very quickly from a sudden concept change, and this is verified in our
experiments, shown later in Chapter 3.
2.2 Pattern-Based Subspace Clustering
Pattern-based subspace clustering is an emerging new research area. We review the
research work to date.
Cheng et. al. [CC00] introduced a bicluster model. The model is proposed in
the bioinformatics field and is used to discover clusters of genes showing very similar
rising or falling coherence in expression levels under a set of conditions. Let X be the
12
set of genes, Y the set of conditions. Let I ⊂ X and J ⊂ Y be subsets of genes and
conditions. The pair (I, J) specifies a sub matrix AIJ with the following mean squared
residue score:
H(I, J) =1
|I||J |∑
i∈I,j∈J
(dij − diJ − dIj + dIJ)2
where
diJ =1
|J |∑j∈J
dij, dIj =1
|I|∑i∈I
dij, dIJ =1
|I||J |∑
i∈I,j∈J
dij
are the row and column means and the means in the sub matrix AIJ . A submatrix
of AIJ is called a δ-bicluster if H(I, J) ≤ δ for some δ > 0. A random algorithm is
designed to find such clusters in a DNA array.
The limitations of this pioneering work are two-folds:
1. The mean squared residue is an averaged measurement of the coherence for a
set of objects. The do not have the desirable Apriori-like property, that is, a
submatrix of a δ-bicluster is not necessarily a δ-bicluster. This creates difficulty
in designing an efficient bottom-up or top-down algorithms.
2. Bicluster is a greedy algorithm. After finding a bicluster, it randomizes the data
in the corresponding submatrix before moving on to find other biclusters. This
randomization will destroy clusters that overlap with already found ones.
Yang et. al. [YWWY02] proposed a δ-cluster algorithms to find biclusters more
efficiently. PearsonR correlation is used to measure coherence among instances, and
Residue is used to measure the decrease in coherence that a particular attribute or in-
stance brings to the cluster. It starts with a random set of seeds and iteratively improves
the overall cluster quality by randomly swapping attributes and data points to improve
13
individual clusters. The iterative process terminates when individual improvement lev-
els off in each cluster. It avoids the cluster overlapping problem by findings all clusters
in parallel.
One of the primary problem with δ-cluster is that it takes as an input parameter
the number of clusters. This parameter setting relies on domain knowledge and is
not always available. while the running time is particularly sensitive to the cluster
size parameter. If the value chosen is very different from the optimal cluster size, the
algorithm could take considerably longer to terminate. It does not has the Apriori-like
property either, hence is still not very efficient.
Wang et. al. [WWYY02] developed a pCluster model in which the cluster defini-
tion has the Apriori property. Let O be a subset of objects, T be a subset of attributes.
(O, T ) forms a matrix. Given two objects x, y ∈ O and a, b ∈ T , a pScore of the 2x2
matrix is defined as:
pScore =([ dxa dxb
dya dyb
])= |(dxa − dxb)− (dya − dyb)|
(O, T ) forms a pCluster if for any 2x2 submatrix X in (O, T ), pScore(X) ≤ δ
for some δ ≥ 0.
Since this definition has the Apriori property, an Apriori-like iterative algorithm
was developed in [WWYY02]. First, it finds all the correlated patterns for every 2 ob-
jects, and all the correlated patterns for every 2 attributes. Then, it iteratively generates
longer candidate patterns and finds larger pClusters.
Although the state-of-art pattern-based subspace clustering algorithm, the effi-
ciency of pCluster is still far from desired. In fact, the first step of finding all length 2
patterns has a complexity of O(N2M + M2N), where N is the number of objects and
M the dimensionality. The rest work of finding all subspace patterns is actually NP
14
hard, as it is equivalent to find all cliques in a graph.
More efficient algorithms are desired for pattern-based subspace clustering. We
will describe our approach in Chapter 5.
2.3 Mining Quality
Mining quality can be improved by cleaning poor data, using more appropriate mining
models, or using more effective mining methods.
Techniques for improving data quality proposed in the literature have addressed
a wide range of problems caused by noise or missing data. In information retrieval
field, grammatical rules are usually defined to remove noise [SM83]. A great amount
of work dealing with noise or missing values has be proposed for the purpose of a
specific mining task, for example, ASSEMBLE in [BDM02] co-training in [BM98]
and mixed models in [NMTM00] are all for semi-supervised from both labeled and
unlabeled data. Another body of work is to correcting noisy label or attributes based
on classification rules missing values using classification rules [ZWC03, Ten99].
Our work differs from the above work is that it is a general-purpose method for data
cleaning, without any domain knowledge, nor for any specific mining purpose. We
exploit local data dependencies for inferring missing values or correcting ambiguous
values.
Furthermore, the technique we used not only can be use to improve data quality,
but can also enhance mining results of some data mining methods which assume inde-
pendence among data instances. We present our study in Chapter 6.
15
CHAPTER 3
Fast and Light Stream Boosting Ensembles
This chapter presents a novel approach for stream learning. The approach addresses
two important issues in stream mining: performance and adaptation.
3.1 Introduction
A substantial amount of recent work has focused on continuous mining of data streams
[DH00, GGRL02, HSD01, SK01, WFYH03]. Typical applications include network
traffic monitoring, credit card fraud detection and sensor network management sys-
tems. Challenges are posed by data ever increasing in amount and in speed, as well as
the constantly evolving concepts underlying the data. Two fundamental issues have to
be addressed by any continuous mining attempt.
Performance Issue. Constrained by the requirement of on-line response and by
limited computation and memory resources, continuous data stream mining should
conform to the following criteria: (1) Learning should be done very fast, preferably
in one pass of the data; (2) Algorithms should make very light demands on memory
resources, for the storage of either the intermediate results or the final decision models.
These fast and light requirements exclude high-cost algorithms, such as support vector
machines; also decision trees with many nodes should preferably be replaced by those
with fewer nodes as base decision models.
Adaptation Issue. For traditional learning tasks, the data is stationary. That is, the
16
underlying concept that maps the features to class labels is unchanging [WK96]. In
the context of data streams, however, the concept may drift due to gradual or sudden
changes of the external environment, such as increases of network traffic or failures
in sensors. In fact, mining changes is considered to be one of the core issues of data
stream mining [DHL+03].
In this paper we focus on continuous learning tasks, and propose a novel Adaptive
Boosting Ensemble method to solve the above problems. In general, ensemble methods
combine the predictions of multiple base models, each learned using a learning algo-
rithm called the base learner [Die00]. In our method, we propose to use very simple
base models, such as decision trees with a few nodes, to achieve fast and light learn-
ing. Since simple models are often weak predictive models by themselves, we exploit
boosting technique to improve the ensemble performance. The traditional boosting is
modified to handle data streams, retaining the essential idea of dynamic sample-weight
assignment yet eliminating the requirement of multiple passes through the data. This
is then extended to handle concept drift via change detection. Change detection aims
at significant changes that would cause serious deterioration of the ensemble perfor-
mance. The awareness of changes makes it possible to build an active learning system
that adapts to changes promptly.
The remainder of this chapter is organized as follows. Our adaptive boosting en-
semble method is presented in section 3.2, followed by a change detection technique
in section 3.3. Sections 3.4 and 3.5 contain experimental evaluation results against two
types of state-of-art stream ensembles, and we conclude in section 3.6.
17
3.2 Adaptive Boosting Ensembles
We use the boosting ensemble method since this learning procedure provides a number
of formal guarantees. Freund and Schapire proved a number of positive results about
its generalization performance [SFB97]. More importantly, Friedman et al. showed
that boosting is particularly effective when the base models are simple [FHT98]. This
is most desirable for fast and light ensemble learning on steam data.
In its original form, the boosting algorithm assumes a static training set. Earlier
classifiers increase the weights of misclassified samples, so that the later classifiers will
focus on them. A typical boosting ensemble usually contains hundreds of classifiers.
However, this lengthy learning procedure does not apply to data streams, where we
have limited storage but continuous incoming data. Past data can not stay long before
making place for new data. In light of this, our boosting algorithm requires only two
passes of the data. At the same time, it is designed to retain the essential idea of
boosting—the dynamic sample weights modification.
Algorithm 1 is a summary of our boosting process. As data continuously flows
in, it is broken into blocks of equal size. A block Bj is scanned twice. The first pass
is to assign sample weights, in a way corresponding to AdaBoost.M1 [FS96]. That is,
if the ensemble error rate is ej , the weight of a misclassified sample xi is adjusted to
be wi = (1 − ej)/ej . The weight of a correctly classified sample is left unchanged.
The weights are normalized to be a valid distribution. In the second pass, a classifier
is constructed from this weighted training block.
The system keeps only the most recent classifiers, up to M. We use a traditional
scheme to combine the predictions of these base models, that is, by averaging the
probability predictions and selecting the class with the highest probability. Algorithm
1 is for binary classification, but can easily be extended to multi-class problems.
18
Algorithm 1 Adaptive boosting ensemble algorithmOutput: Maintaining a boosting ensemble Eb with classifiers {C1, · · · , Cm}, m ≤
M .
1: while (1) do2: Given a new block Bj = {(x1, y1), · · · , (xn, yn)}, where yi ∈ {0, 1},3: Compute ensemble prediction for sample i: Eb(xi) = round( 1
m
∑mk=1 Ck(xi)),
4: Change Detection: Eb ⇐ ∅ if a change detected!5: if (Eb 6= ∅) then6: Compute error rate of Eb on Bj: ej = E[1Eb(xi)6=yi
],7: Set new sample weight wi = (1− ej)/ej if Eb(xi) 6= yi; wi = 1 otherwise8: else9: set wi = 1, for all i.
10: end if11: Learn a new classifier Cm+1 from weighted block Bj with weights {wi},12: Update Eb: add Cm+1, retire C1 if m = M .13: end while
Adaptability Note that there is a step called “Change Detection” (line 4) in Al-
gorithm 1. This is a distinguished feature of our boosting ensemble, which guarantees
that the ensemble can adapt promptly to changes. Change detection is conducted at
every block. The details of how to detect changes are presented in the next section.
Our ensemble scheme achieves adaptability by actively detecting changes and dis-
carding the old ensemble when an alarm of change is raised. No previous learning
algorithm has used such a scheme. One argument is that old classifiers can be tuned to
the new concept by assigning them different weights. Our hypothesis, which is borne
out by experiment, is that obsolete classifiers have bad effects on overall ensemble
performance even they are weighed down. Therefore, we propose to learn a new en-
semble from scratch when changes occur. Slow learning is not a concern here, as our
base learner is fast and light, and boosting ensures high accuracy. The main challenge
is to detect changes with a low false alarm rate.
19
0 20 40 60 80 100 120 140 160 180 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60 80 100 120 140 160 180 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(a) (b)
Figure 3.1: Two types of significant changes. Type I: abrupt changes; Type II: gradual
changes over a period of time. These are the changes we aim to detect.
3.3 Change Detection
In this section we propose a technique for change detection based on the framework
of statistical decision theory. The objective is to detect changes that cause significant
deterioration in ensemble performance, while tolerating minor changes due to random
noise. Here, we view ensemble performance θ as a random variable. If data is sta-
tionary and fairly uniform, the ensemble performance fluctuations are caused only by
random noise, hence θ is normally assumed to follow a Gaussian distribution. When
data changes, yet most of the obsolete classifiers are kept, the overall ensemble per-
formance will undergo two types of decreases. In case of an abrupt change, the distri-
bution of θ will change from one Gaussian to another. This is shown in Figure 1(a).
Another situation is when the underlying concept has constant but small shifts. This
will cause the ensemble performance to deteriorate gradually, as shown in Figure 1(b).
Our goal is to detect both types of significant changes.
Every change detection algorithm is a certain form of hypothesis test. To make a
decision whether or not a change has occurred is to choose between two competing
hypotheses: the null hypothesis H0 or the alternative hypothesis H1, corresponding
20
to a decision of no-change or change, respectively. Suppose the ensemble has an
accuracy θj on block j. If the conditional probability density function (pdf) of θ under
the null hypothesis p(θ|H0) and that under the alternative hypothesis p(θ|H1) are both
known, we can make a decision using a likelihood ratio test:
L(θj) =p(θj|H1)
p(θj|H0)
H1
≷H0
τ. (3.1)
The ratio is compared against a threshold τ . H1 is accepted if L(θj) ≥ τ , and
rejected otherwise. τ is chosen so as to ensure an upper bound of false alarm rate.
Now consider how to detect a possible type I change. When the null hypothesis
H0 (no change) is true, the conditional pdf is assumed to be a Gaussian, given by
p(θ|H0) =1√
2πσ20
exp
{−(θ − µ0)
2
2σ20
}, (3.2)
where the mean µ0 and the variance σ20 can be easily estimated if we just remember
a sequence of most recent θ’s. But if the alternative hypothesis H1 is true, it is not
possible to estimate P (θ|H1) before sufficient information is collected. This means a
long delay before the change could be detected. In order to do it in time fashion, we
perform a significance test that uses H0 alone. A significant test is to assess how well
the null hypothesis H0 explains the observed θ. Then the general likelihood ratio test
in Equation 3.1 is reduced to:
p(θj|H0)H0
≷H1
τ. (3.3)
When the likelihood p(θj|H0) ≥ τ , the null hypothesis is accepted; otherwise it is
rejected. Significant tests are effective in capturing large, abrupt changes.
For type II changes, we perform a typical hypothesis test as follows. First, we split
the history sequence of θ’s into two halves. A Gaussian pdf can be estimated from each
21
half, denoted as G0 and G1. Then a likelihood ratio test in Equation 3.1 is conducted.
So far we have described two techniques aiming at two types of changes. They
are integrated into a two-stage method as follows. As a first step, a significant test is
performed. If no change is detected, then a hypothesis test is performed as a second
step. This two-stage detection method is shown to be very effective experimentally.
3.4 Comparison with Bagging Stream Ensembles
In this section, we first perform a controlled study on a synthetic data set, then apply
the method to a real-life application.
We evaluate our boosting scheme extended with change detection, named as Adap-
tive Boosting, and compare it with Weighted Bagging [WFYH03] and Bagging [SK01].
These two bagging ensemble methods are described in ??.
In the following experiments, we use decision trees as our base model, but the
boosting technique can, in principle, be used with any other traditional learning model.
The standard C4.5 algorithm is modified to generate small decision trees as base mod-
els, with the number of terminal nodes ranging from 2 to 32. Full-grown decision trees
generated by C4.5 are also used for comparison, marked as fullsize in Figure 3.2-3.4
and Table 3.1-3.2.
Synthetic Data
In the synthetic data set for controlled study, a sample (x, y) has three independent
features x =< x1, x2, x3 >, xi ∈ [0, 1], i = 0, 1, 2. Geometrically, samples are points
in a 3-dimension unit cube. The real class boundary is a sphere defined as
B(x) =2∑
i=0
(xi − ci)2 − r2 = 0
where c =< c1, c2, c3 > is the center of the sphere, r the radius. y = 1 if B(x) ≤ 0,
22
y = 0 otherwise. This learning task is not easy due to the continuous feature space and
the non-linear class boundary.
To simulate a data stream with concept drift, we move the center c of the sphere
that defines the class boundary between adjacent blocks. The movement is along each
dimension with a step of ±δ. The value of δ controls the level of shifts from small,
moderate to large, and the sign of δ is randomly assigned independently along each
dimension. For example, if a block has c = (0.40, 0.60, 0.50), δ = 0.05, the sign along
each direction is (+1,−1,−1), then the next block would have c = (0.45, 0.55, 0.45).
The values of δ ought to be in a reasonable range, to keep the portion of samples that
change class labels reasonable. In our setting, we consider a concept shift small if δ is
around 0.02, and relatively large if δ around 0.1.
To study the model robustness, we insert noise into the training data sets by ran-
domly flipping the class labels with a probability of p, p = 10%, 15%, 20%. Clean
testing data sets are used in all the experiments for accuracy evaluation.
Credit Card Data
We also evaluate our algorithm on a real life data containing 100k credit card trans-
actions. The data has 20 features including the transaction amount, the time of the
transaction, etc. The task is to predict fraudulent transactions. Detailed data descrip-
tion is given in [SFL+97].
3.4.1 Evaluation of Boosting Scheme
The boosting scheme is first compared against two bagging ensembles on stationary
data. Samples are randomly generated in the unit cube. Noise is introduced in the
training data by randomly flipping the class labels with a probability of p. Each data
block has n samples and there are 100 blocks in total. The testing data set contains 50k
23
0.7
0.75
0.8
0.85
0.9
0.95
2 4 8 16 32 fullsize
Aver
age
Accu
racy
# Decision Tree Terminal Nodes
Adaptive BoostingBagging
Figure 3.2: Performance comparison of the adaptive boosting vs the bagging on sta-
tionary data. The weighted bagging is omitted as it performs almost the same as the
bagging.
noiseless samples uniformly distributed in the unit cube. An ensemble of M classifiers
is maintained. It is updated after each block and evaluated on the test data set. Perfor-
mance is measured using the generalization accuracy averaged over 100 ensembles.
Figure 3.2 shows the generalization performance when p=5%, n=2k and M=30.
Weighted bagging is omitted from the figure because it makes almost the same predic-
tions as bagging, a not surprising result for stationary data. Figure 3.2 shows that the
boosting scheme clearly outperforms bagging. Most importantly, boosting ensembles
with very simple trees performs well. In fact, the boosted two-level trees(2 termi-
nal nodes) have a performance comparable to bagging using the full size trees. This
supports the theoretical study that boosting improves weak learners.
Higher accuracy of boosted weak learners is also observed for (1) block size n of
500, 1k, 2k and 4k, (2) ensemble size M of 10, 20, 30, 40, 50, and (3) noise level of
5%, 10% and 20%.
24
0.7
0.75
0.8
0.85
0.9
0.95
2 4 8 16 32 fullsize
Aver
age
Accu
racy
# Decision Tree Terminal Nodes
Adaptive BoostingWeighted Bagging
Bagging
Figure 3.3: Performance comparison of the three ensembles on data with small gradual
concept shifts.
3.4.2 Learning with Gradual Shifts
Gradual concept shifts are introduced by moving the center of the class boundary be-
tween adjacent blocks. The movement is along each dimension with a step of ±δ.
The value of δ controls the level of shifts from small to moderate, and the sign of δ is
randomly assigned. The percentage of positive samples in these blocks ranges from
16% to 25%. Noise level p is set to be 5%, 10% and 20% across multiple runs.
The average accuracies are shown in Figure 3.3 for small shifts (δ = 0.01), and in
Figure 3.4 for moderate shifts (δ = 0.03). Results of other settings are shown in Table
3.1. These experiments are conducted where the block size is 2k. Similar results are
obtained for other block sizes. The results are summarized below:
• Adaptive boosting outperforms two bagging methods at all time, demonstrating
the benefits of the change detection technique; and
• Boosting is especially effective with simple trees (terminal nodes ≤ 8), achiev-
ing a performance compatible with, or even better than, the bagging ensembles
25
0.7
0.75
0.8
0.85
0.9
0.95
2 4 8 16 32 fullsize
Ave
rage
Acc
urac
y
# Decision Tree Terminal Nodes
Adaptive BoostingWeighted Bagging
Bagging
Figure 3.4: Performance comparison of the ensembles on data with moderate gradual
concept shifts.
δ = .005 δ = .02
2 4 8 fullsize 2 4 8 fullsize
Adaptive
Boosting
89.2% 93.2% 93.9% 94.9% 92.2% 94.5% 95.7% 95.8%
Weighted
Bagging
71.8% 84.2% 89.6% 91.8% 83.7% 92.0% 93.2% 94.2%
Bagging 71.8% 84.4% 90.0% 92.5% 83.7% 91.4% 92.4% 90.7%
Table 3.1: Performance comparison of the ensembles on data with varying levels of
concept shifts. Top accuracies shown in bold fonts.
with large trees.
26
0.7
0.75
0.8
0.85
0.9
0.95
0 20 40 60 80 100 120 140 160
Acc
urac
y
Data Blocks
Adaptive BoostingWeighted Bagging
Bagging
Figure 3.5: Performance comparison of the three ensembles on data with abrupt shifts.
Base decision trees have no more than 8 terminal nodes.
3.4.3 Learning with Abrupt Shifts
We study learning with abrupt shifts with two sets of experiments. Abrupt concept
shifts are introduced every 40 blocks; three abrupt shifts occur at block 40, 80 and
120. In one set of experiments, data stays stationary between these blocks. In the other
set, small shifts are mixed between adjacent blocks. The concept drift parameters are
set to be δ1 = ±0.1 for abrupt shifts , and δ2 = ±0.01 for small shifts.
Figure 4.1 and Figure 4.2 show the experiments when base decision trees have no
more than 8 terminal nodes. Clearly the bagging ensembles, even with an empirical
weighting scheme, are seriously impaired at changing points. Our hypothesis, that
obsolete classifiers are detrimental to overall performance even if they are weighed
down, are proved experimentally. Adaptive boosting ensemble, on the other hand, is
able to respond promptly to abrupt changes by explicit change detection efforts. For
base models of different sizes, we show some of the results in Table 3.2. The accuracy
is averaged over 160 blocks for each run.
27
0.7
0.75
0.8
0.85
0.9
0.95
0 20 40 60 80 100 120 140 160
Acc
urac
y
Data Blocks
Adaptive BoostingWeighted Bagging
Bagging
Figure 3.6: Performance comparison of the three ensembles on data with both abrupt
and small shifts. Base decision trees have no more than 8 terminal nodes.
δ2 = 0.00 δ2 = ±0.01
δ1 = ±0.1 4 fullsize 4 fullsize
Adaptive
Boosting
93.2% 95.1% 93.1% 94.1%
Weighted
Bagging
86.3% 92.5% 86.6% 91.3%
Bagging 86.3% 92.7% 85.0% 88.1%
Table 3.2: Performance comparison of three ensembles on data with abrupt shifts or
mixed shifts. Top accuracies are shown in bold fonts.
28
3.4.4 Experiments on Real Life Data
In this subsection we further evaluate our algorithm on a real life data containing 100k
credit card transactions. The data has 20 features including the transaction amount, the
time of the transaction, etc. The task is to predict fraudulent transactions. Detailed data
description is given in [SFL+97]. The part of the data we use contains 100k transaction
each with a transaction amount between $0 and $21. Concept drift is simulated by
sorting transactions by changes by the transaction amount.
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
0 20 40 60 80 100
Acc
urac
y
Data Blocks
Adaptive BoostingWeighted Bagging
Bagging
Figure 3.7: Performance comparison of the three ensembles on credit card data. Con-
cept shifts are simulated by sorting the transactions by the transaction amount.
We study the ensemble performance using varying block sizes (1k, 2k, 3k and 4k),
and different base models (decision trees with terminal nodes no more than 2, 4, 8 and
full-size trees). We show one experiment in Figure 3.7, where the block size is 1k,
and the base models have at most 8 terminal nodes. The curve shows three dramatic
drops in accuracy for bagging, two for weighted bagging, but only a small one for
adaptive boosting. These drops occur when the transaction amount jumps. Overall,
the boosting ensemble is much better than the two baggings. This is also true for the
other experiments, whose details are omitted here due to space limit.
29
1k 2k 3k 4k0
50
100
150
200
250
300
350
400
Tota
l Tra
inin
g Ti
me
on A
ll D
ata
Blo
cks
(s)
adaptive boosting weighted bagging
block size1k 2k 3k 4k
0
50
100
150
200
250
Avg
Tre
e N
odes
Lea
rned
Fro
m B
lock
s
block size
adaptive boosting weighted bagging
(a) (b)
Figure 3.8: Comparison of the adaptive boosting and the weighted bagging, in terms
of (a) building time, and (b) average decision tree size. In (a), the total amount of data
is fixed for different block sizes.
The boosting scheme is also the fastest. Moreover, the training time is almost
not affected by the size of base models. This is due to the fact that the later base
models tend to have very simple structures; many of them are just decision stumps
(one level decision trees). On the other hand, training time of the bagging methods
increases dramatically as the base decision trees grow larger. For example, when the
base decision tree is full-grown, the weighted bagging takes 5 times longer to do the
training and produces a tree 7 times larger on average. The comparison is conducted
on a 2.26MHz Pentium 4 Processor. Details are shown in Figure 3.8.
To summarize, the real application experiment confirms the advantages of our
boosting ensemble methods over the bagging ensembles: it is fast and light, with good
adaptability.
30
3.5 Comparison with DWM
In this section, we compare our Adaptive Boosting method with Dynamic Weighted
Majority (DWM) [KM01] method. This method is described in ??.
Since the performance report of DWM on a synthetic problem is publicly available,
we evaluate Adaptive Boosting on the same problem. This problem, which is called
the “SEA Concepts”, consists of three attributes, xi ∈ R such that 0.0 ≤ xi ≤ 10.0.
The target concept is x1 + x2 ≤ b, hence x3 is an irrelevant attribute. The presentation
of training examples lasts for 50, 000 time steps. Concept change is simulated by
varying the value of b for different quarters. For the first quarter (i.e., 12, 500 time
steps), the target concept is with b = 8. For the second, b = 9; the third, b = 7;
and the fourth, b = 9.5. For each of these four periods, a training set consisting
of 12, 500 examples are generated randomly. 10% noise is added. Another 2, 500
examples are randomly generated for testing for each period. In the original DWM
experimental design, one example is fed to the method at each time step, and the
ensemble performance is evaluated against the testing samples at each time step. For
Adaptive Boosting, we feed a small block of examples periodically. That is, once after
every 500 time steps, so that we have accumulated a data block of 500. Then, we
learn a new classifier from this new data block, update the ensemble, and evaluate the
new ensemble against the testing samples. As in DWM, we repeat this procedure ten
times, averaging accuracy over these runs. Both methods use Naive Bayes as the base
learner. The original DWM diagram also shows 95% confidence intervals. We do not
compute the confidence interval as the experimental result is already sufficient to draw
a conclusion.
DWM is shown in Figure 3.9. DWM is denoted as “DWM-NB” where “NB” stands
for Naive Bayes. (Two other methods are also shown there but for our purpose we can
ignore them.) Adaptive Boosting is shown in Figure 3.10.
31
The first observation is that Adaptive Boosting suffers much less at concept chang-
ing points when compared with DWM. Secondly, Adaptive Boosting is more accurate
than DWM on average. DWM is beaten for the same reason that Weighted Bagging
does—it keeps out-dated classifiers until their weights drop below a user specified
threshold θ. It is hard to set this parameter θ. If θ is too high, the ensemble will be
volatile to noise. If it is too low, poor classifiers will have a negative effect on the
overall ensemble performance before it can be identified as out of date. The inclusion
of outdated classifiers inevitably lead to low adaptation and low average accuracy.
0 2 4 6 8 100
2
4
6
8
10
Figure 3.9: Dynamic Weighted Majority (DWM) ensemble performance on the SEA
concepts with 10% class noise.
3.6 Summary
In this chapter, we propose an adaptive boosting ensemble method that is different
from previous work in two aspects: (1) We boost very simple base models to build
effective ensembles with competitive accuracy; and (2) We propose a change detection
technique to actively adapt to changes in the underlying concept. We compare adaptive
32
0.7
0.75
0.8
0.85
0.9
0.95
1
0 12500 25000 37500 50000
Acc
urac
y (%
)
Time Step (t)
Adaptive Boosting
Figure 3.10: Adaptive Boosting ensemble performance on the SEA concepts with 10%
class noise.
boosting ensemble methods with two bagging ensemble-based methods and Dynamic
Weighted Majority method through extensive experiments. Results on both synthetic
and real-life data set show that our method is much faster, demands less memory, more
adaptive and accurate.
The current method can be improved in several aspects. For example, our study of
the trend of the underlying concept is limited to the detection of significant changes.
If changes can be detected on a finer scale, new classifiers need not be built when
changes are trivial, thus training time can be further saved without loss on accuracy.
We also plan to study a classifier weighting scheme to improve ensemble accuracy.
33
CHAPTER 4
Robust and Adaptive Stream Ensembles
The major limitation of Adaptive Boosting concerns noise issue. The boosting tech-
nique has been demonstrated under many scenarios to be sensitive to noise. In this
chapter we discuss a novel discriminative model which, in addition to fast learning
and adaptive to changing concepts, is very robust to noise in data streams. The new
technique operates under the EM framework, in which noise identification and model
refinement mutually reinforce each other, leading to a robust discriminative model.
4.1 Introduction
Noise can severely impair the quality and speed of learning. This problem is encoun-
tered in many applications where the source data can be unreliable, and also errors can
be injected during data transmission. This problem is even more challenging for data
streams, where it is difficult to distinguish noise from data caused by concept drift. If
an algorithm is too eager to adapt to concept changes, it may overfit noise by mistak-
enly interpreting it as data from a new concept. If the algorithm is too conservative
and slow to adapt, it may overlook important changes (and, for instance, miss out on
the opportunities created by a timely identification of new trends in the marketplace).
In chapter 3 we have reviewed quite a number of stream learning algorithms. But,
none of them provide a mechanism for noise identification, or often indistinguishably
called outlier detection (as what we will use thereafter). Although there have been
34
a number of off-line algorithms [AY01, RRS00, BKNS00, KM03, BF96] for outlier
detection, they are unsuitable for stream data as they assume a single unchanging data
model, hence unable to distinguish noise from data caused by concept drift. In addi-
tion, outlier detection with stream data faces general problems such as the choice of a
distance metric. Most of the traditional approaches use Euclidean distance, which is
unable to treat categorical values.
Our Method - Robust Regression Ensemble Method
To address the three above mentioned issues, we propose a novel discriminative
model, Robust Regression Ensemble Method, for adaptive learning on noisy data
streams, with modest resource consumption. For a learnable concept, the class of a
sample conditionally follows a Bernoulli distribution. Our method assigns classifier
weights in a way that maximizes the training data likelihood with the learned distribu-
tion. This weighting scheme has theoretical guarantee for adaptability. In addition, as
we have verified experimentally, our weighting scheme can also boost a collection of
weak classifiers into a strong ensemble. Examples of weak classifiers include decision
trees with very few nodes. It is desirable to use weak classifiers because it learns faster
and consumes less resources.
Our outlier detection differs from previous approaches in that it is tightly integrated
into the adaptive model learning. The motivation is that outliers are directly defined by
the current concept, so the outlier identifying strategy needs to be modified whenever
the concept drifts away. In our integrated learning, outliers are defined as samples with
a small likelihood given the current model, and then the model is refined on the training
data with outliers removed. The overall learning is an iterative process in which the
model learning and outlier detection mutually reinforce each other.
Another advantage of our outlier detection technique is the general distance metric
for identifying outliers. We define a distance metric based on predictions of the current
35
ensemble, instead of a function in the data space. It can handle both numerical and
categorical values.
The remainder of this chapter is organized as follows. In section 4.2 and section 4.3
we describe the discriminative model with regard to adaptation and robustness, respec-
tively. Section 4.4 gives the model formulation and computation. Experimental results
are shown in section 4.5.
4.2 Adaptation to Concept Drift
Ensemble weighting is the key to fast adaptation. Here we show that this problem can
be formulated as a statistical optimization problem solvable by logistic regression.
We first look at how an ensemble is constructed and maintained. The data stream
is simply partitioned into small blocks of fixed size, then classifiers are learned from
blocks. The most recent K classifiers comprise the ensemble, and old classifiers retire
sequentially by age. Besides a set of training examples for classifier learning, another
set of training examples are also needed for classifier weighting. If training data is
sufficient, we can reserve part of it for weight training; otherwise, randomly sampled
training examples can serve the purpose. We only need to make the two data sets
as synchronized as possible. When sufficient training data is collected for classifier
learning and ensemble weighting, the following steps are conducted: (1) learn a new
classifier from the training block; (2) replace the oldest classifier in the ensemble with
this newly learned; and then (3) weight the ensemble.
The rest of this section gives a formal description of ensemble weighting. A two-
class classification setting is considered for simplicity, but the treatment can be ex-
tended to multi-class tasks.
36
The training data for ensemble weighting is represented as
(X,Y) = {(xi, yi); i = 1, · · · , N}
xi is a vector valued sample attribute and yi ∈ {0, 1} the sample class label. We
assume an ensemble of classifiers, denoted in a vector form as
f = (f1(x), · · · , fK(x))T
where each fk(x) is a classifier function producing a value for the belief on a class. The
individual classifiers in the ensemble may be weak or out-of-date. It is the goal of our
discriminative model M to make the ensemble strong by weighted voting. Classifier
weights are model parameters, denoted as
w = (w1, · · · , wK)T
where wk is the weight associated with classifier fk. The model M also specifies for
decision making a weighted voting scheme, that is,
wT · f
Because the ensemble prediction wT · f is a continuous value, yet the class label yi
to be decided is discrete, a standard approach is to assume that yi conditionally follows
a Bernoulli distribution parameterized by a latent score ηi:
yi|xi; f , w ∼ Ber(q(ηi))
ηi = wT · f(4.1)
where q(ηi) is the logit transformation of ηi:
q(ηi) , logit(ηi) =eηi
1 + eηi
37
Eq.4.1 states that yi follows a Bernoulli distribution with parameter q, thus the
posterior probability is
p(yi|xi; f , w) = qyi(1− q)1−yi (4.2)
The above description leads to optimizing classifier weights using logistic regres-
sion. Given a data set (X,Y) and an ensemble f , the logistic regression technique
optimizes the classifier weights by maximizing the likelihood of the data. The opti-
mization problem has a closed-form solution which can be quickly solved. We post-
pone the detailed model computation till section 4.4.
Logistic regression is a well-established regression method, widely used in tradi-
tional areas when the regressors are continuous and the responses are discrete [HTF00].
In our work, we formulate the classifier weighting problem as an optimization problem
and solve it using logistic regression. In section 4.5 we show that such a formulation
and solution provide much better adaptability than previous work. (Refer to Fig.4.1-
4.2, section 4.5 for a quick reference.)
4.3 Robustness to Outliers
Regression is adaptive because it always tries to fit the data from the current concept.
But, it can potentially overfit outliers. We integrate the following outlier detection
technique into the model learning.
We define outliers as samples with a small likelihood under a given data model.
The goal of learning is to compute a model that best fits the bulk of the data, that is,
the inliers. Whether a sample is an outlier is hidden information in this problem. This
suggest us to solve the problem under the EM framework, using a robust statistical
formulation.
38
Previously we have described a training data set as
{(xi, yi), i = 1, · · · , N}, or (X,Y ). This is an incomplete data set, as the outlier
information is missing. A complete data set is a triplet
(X, Y , Z)
where
Z = {z1, · · · , zN}
is a hidden variable that distinguishes the outliers from the inliers. zi = 1 if (xi, yi) is
an outlier, zi = 0 otherwise. This Z is not observable and needs to be inferred. After
the values of Z are inferred, (X, Y ) can be partitioned into a clean sample set
(X0, Y0) = {(xi, yi, zi),xi ∈ X, yi ∈ Y , zi = 0}
and an outlier set
(Xφ, Yφ) = {(xi, yi, zi),xi ∈ X, yi ∈ Y , zi = 1}
It is the samples in (X0, Y0) that all come from one underlying distribution, and are
used to fit the model parameters.
To infer the outlier indicator Z, we introduce a new model parameter λ. It is a
threshold value of sample likelihood. A sample is marked as an outlier if its likeli-
hood falls below λ. This λ, together with f (classifier functions) and w (classifier
weights) discussed earlier, constitutes the complete set of parameters of our discrimi-
native model M , denoted as M(x; f, w, λ).
4.4 Model Learning
In this section, we give the model formulation followed by model computation. The
symbols used are summarized in table 4.1.
39
(xi, yi) a sample, with xi the sample at-
tribute, yi the sample class label,
(X,Y) an incomplete data set without outlier information,
Z a hidden variable,
(X,Y ,Z) a complete data set with outlier information,
(X0,Y0) a clean data set,
(Xφ,Yφ) an outlier set,
M the discriminative model,
f a vector of classifier function, a model parameter,
w a vector of classifier weights, a model parameter,
λ a threshold of likelihood, a model parameter.
Table 4.1: Summary of symbols used
4.4.1 Model Formulation
Our model has a four-tuple representation M(x; f, w, λ). Given a training data set
(X, Y ), an ensemble of classifiers f = (f1(x), · · · , fK(x))T , we want to achieve two
objectives.
1. To infer about the hidden variable Z that distinguishes inliers (X0, Y0) from
outliers (Xφ, Yφ).
2. To compute the optimal fit for model parameters w and λ in the discriminative
model M(x; f, w, λ).
Each inlier sample (xi, yi) ∈ (X0, Y0) is assumed to be drawn from an independent
identical distribution belonging to a probability family characterized by parameters w,
denoted by a density function p((x, y); f , w). The problem is to find the values of w
40
that maximizes the likelihood of (X0, Y0) in the probability family. As customary, we
use log-likelihood to simplify the computation:
log p((X0, Y0)|f , w)
A parametric model for outlier distribution is not available because outliers are
highly irregular. We use instead a non-parametric statistics based on the number of
outliers (‖(Xφ,Yφ)‖). Then, the problem becomes an optimization problem. The
score function to be maximized involves two parts: (i) the log-likelihood term for the
inliers (X0, Y0), and (ii) a penalty term for the outliers (Xφ, Yφ). That is:
(w, λ)∗ = arg max(w,λ)
(log p((X0, Y0)|f , w)
−ζ((Xφ,Yφ);w, λ))
(4.3)
where the penalty term, which penalizes having too many outliers, is defined as
ζ((Xφ,Yφ);w, λ) = e · ‖(Xφ,Yφ)‖ (4.4)
w and λ affect ζ implicitly. The value of e empirically depends on the size of the
training data. In our experiments we set e ∈ (0.2, 0.3).
After expanding the log-likelihood term, we have:
log p((X0,Y0)|f , w)
=∑
xi∈X0
log p((xi, yi)|f , w)
=∑
xi∈X0
log p(yi|xi; f , w) +∑
xi∈X0
log p(xi)
Absorb∑
xi∈X0log p(xi) into the penalty term
ζ((Xφ,Yφ);w, λ), and replace the likelihood in Eq.4.3 with the logistic form (Eq.6.16),
then the optimization goal becomes finding the best fit (w, λ)∗.
(w, λ)∗ = arg max(w,λ)
( ∑xi∈X0
(yi log q + (1− yi) log(1− q)
)
+ ζ((Xφ,Yφ);w, λ))
(4.5)
41
The score function to be maximized is not differentiable because of the non-parametric
penalty term. We have to resort to a more elaborate technique based on the Expectation-
Maximization (EM) [Bil98] algorithm to solve the problem.
4.4.2 Inference and Computation
The main goal of model computation is to infer the missing variables and compute the
optimal model parameters, under the EM framework. The EM in general is a method
for maximizing data likelihood in problems where data is incomplete. The algorithm
iteratively performs an Expectation-Step (E-Step) followed by an Maximization-Step
(M-Step) until convergence. In our case,
1. E-Step: to impute / infer the outlier indicator Z based on the current model
parameters (w, λ).
2. M-Step: to compute new values for (w, λ) that maximize the score function in
Eq. 4.3 with current Z.
Next we will discuss how to impute outliers in E-Step, and how to solve the maxi-
mization problem in M-Step. The M-Step is actually a Maximum Likelihood Estima-
tion (MLE) problem.
E-Step: Impute Outliers
With the current model parameters w (classifier weights), the model for clean data
is established as in Eq.4.1, that is, the class label (yi) of a sample xi follows a Bernoulli
distribution parameterized with the ensemble prediction for this sample (wT · f(xi)).
Thus, yi’s log-likelihood p(yi|xi; f , w) can be computed by Eq.6.16.
42
Note that the line between outliers and inliers is drawn by λ, which is computed in
the previous M-Step. So, the formulation of imputing outliers is straightforward:
zi = sign(log p(yi|xi; f , w)− λ
)(4.6)
where
sign(x) =
1 if x < 0
0 otherwise
M-Step: MLE
The score function (in Eq.4.5) to be maximized is not differentiable because of
the penalty term. We consider a simple approach for an approximate solution. In this
approach, the computation of λ and w is separated.
1. λ is computed using standard K-means clustering algorithm on log-likelihood
p(yi|xi; f , w). In our experiments we choose K = 3. The cluster boundaries are
candidates of likelihood threshold λ∗ separating outliers from inliers.
2. By fixing each of the candidate λ∗, w∗ can be computed using the standard MLE
procedure. Running a MLE procedure for each candidate λ∗, and the maximum
likelihood will identify the best fit of (w, λ)∗.
The standard MLE procedure for computing w is described as follows. Taking the
derivative of the inlier likelihood with respect to w and set it to zero, we have
∂
∂w
∑yi∈Y0
(yi
eηi
1 + eηi+ (1− yi)
1
1 + eηi
)= 0
To solve this equation, we use the Newton-Raphson procedure, which requires the
first and second derivatives. For clarity of notation, we use h(w) to denote the inlier
likelihood function with regard to w. Starting from wt, a single Newton-Raphson
43
update is
wt+1 = wt −(∂2h(wt)
∂w∂wT
)−1∂h(wt)
∂w
Here we have∂h(w)
∂w=
∑yi∈Y0
(yi − q)f(xi)
and,∂2h(w)
∂w∂wT= −
∑yi∈Y0
q(1− q)f(xi)fT (xi)
The initial values of w is important for computation convergence. Since there is
no prior knowledge, we can initially set w to be uniform.
Algorithmic Summary
The learning of a discriminative model is summarized in Algorithm2.
Algorithm 2 A Discriminative Model Learning Algorithm
Output: Maintaining a model containing an ensemble of classifiers ordered by age,f = (f1, · · · , fK)T , the parameters w = (w1, · · · , wK)T (classifier weights) andλ.
1: loop:2: Given a new training block Btrain and an evaluation block Beval,3: Learn a new classifier fK+ from block Btrain
4: Update f : add fK+1 to f , retire the oldest classifier if k ≥ K.5: EM:
(1) Impute outliers in Beval, and(2) Compute w and λ by maximizing the likelihood of Beval.
44
4.5 Experiments and Discussions
We use both synthetic data and a real-life application to evaluate our discriminative
model, Robust Regression Ensemble Method, in terms of both adaptability to con-
cept shifts and robustness to noise. Our model is compared with the two previously
mentioned approaches: Bagging [SK01] and Weighted Bagging [WFYH03]. We show
that although the empirical weighting in Weighted Bagging performs better than un-
weighted voting, the robust regression weighting method is more superior, in terms of
both adaptability and robustness.
C4.5 decision trees are used in our experiments, but in principle our method can be
used with any base learning algorithm.
The synthetic data set is one used in Chapter 3. In summary, a sample x is a vector
of three independent features < xi >, xi ∈ [0, 1], i = 0, 1, 2. Geometrically, samples
are points in a 3-dimension unit cube. The class boundary is a sphere defined as:
B(x) =∑2
i=0(xi− ci)2− r2 = 0, where c is the center of the sphere, r the radius. x is
labelled class 1 if B(x) ≤ 0, class 0 otherwise. This learning task is not easy, because
the feature space is continuous and the class boundary is non-linear.
We also use the real-life problem described in Chapter 3.
4.5.1 Evaluation of Adaptation
In this subsection we compare our robust regression ensemble method with Bagging
and Weighted Bagging. Concept drift is simulated by moving the class boundary center
between adjacent data blocks. The moving distance δ along each dimension controls
the magnitude of concept drift. We have two sets of experiments with different δ val-
ues, both have abrupt large changes occurring at block 40, 80 and 120. In one experi-
ment, data remains stationary between these changing points. In the other experiment,
45
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0 20 40 60 80 100 120 140 160
Acc
urac
y
Data Blocks
Robust RegressionWeighted Bagging
Bagging
Figure 4.1: Adaptability comparison of the ensemble methods on data with three
abrupt shifts.
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0 20 40 60 80 100 120 140 160
Acc
urac
y
Data Blocks
Robust RegressionWeighted Bagging
Bagging
Figure 4.2: Adaptability comparison of the ensemble methods on data with three
abrupt shifts mixed with small shifts.
small shifts are mixed between abrupt ones, with δ ∈ (0.005, 0.03). The percentage of
positive samples fluctuates between (41%, 55%). Noise level is 10%.
As shown in Fig.4.1 and Fig.4.2, the robust regression model always gives the best
performance. The unweighted bagging ensembles has the worst predictive accuracy.
46
Robu s t Regress ion Weigh ted Ba ggin g Ba ggin g0.88
0.90
0.92
0.94
0.96noise level: 0% 5% 10% 15% 20%
Ense
mbl
e
Accu
racy
Figure 4.3: Robustness comparison of the three ensemble methods for different noise
levels.
Both bagging methods are seriously impaired at the concept changing points, but the
robust regression is able to catch up with the new concept quickly.
4.5.2 Robustness in the Presence of Outliers
Noise is the major source of outliers. Fig. 4.3 shows the ensemble performance for the
different noise levels: 0%, 5%, 10%, 15% and 20%. The accuracy is averaged over
100 runs spanning 160 blocks, with small gradual shifts between blocks. We can make
two major observations here:
1. The robust regression ensembles are the most accurate for all the noise levels, as
clearly shown in Fig. 4.3.
2. Robust regression also gives the least performance drops when noise increases.
This conclusion is confirmed using paired t-test at 0.05 level. In each case when
noise level increases by 10%, 15% or 20%, the decrease in accuracy produced by
robust regression is the smallest, and the differences are statistically significant.
47
10 20 30 40 500.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
sam
ples
that
are
noi
sy /
from
a n
ew c
once
pt
samples from an emerging concept real noisy samples
Figure 4.4: In the outliers detected, the normalized ratio of (1) true noisy samples
(the upper bar), vs. (2) samples from an emerging concept (the lower bar). The bars
correspond to blocks 0-59 in the experiments shown in Fig.4.2
To better understand why the robust regression method is less impacted by outliers,
we show the outliers it detects in Fig.4.4. Outliers consist mostly noisy samples and
samples from a newly emerged concept. In the experiments shown in Fig.4.2, we
record the outliers in blocks 0-59 and calculate the normalized ratio of the two parts.
As it shows, true noise dominates the identified outliers. At block 40 where concept
drift is large, a bit more samples reflecting the new concept are mistakenly reported as
outliers, but still more true noisy samples are identified at the same time.
4.5.3 Discussions on Performance Issue
Constrained by the requirement of on-line responses and by limited computation and
memory resources, stream data mining methods should learn fast, and produce simple
classifiers. For ensemble learning, simple classifiers help to achieve these goals. Here
we show that simple decision trees can be used in the logistic regression model for
48
0.7
0.75
0.8
0.85
0.9
0.95
8 16 32 fullsize
Ave
rage
Acc
urac
y
# terminal nodes of base decision trees in ensembles
Robust RegressionWeighted Bagging
Bagging
Figure 4.5: Performance comparison of the ensemble methods with classifiers of dif-
ferent size. Robust regression with smaller classifiers is compatible to the others with
larger classifiers.
better performance.
The simple classifiers we use are decision trees with 8, 16, 32 terminal nodes.
Full grown trees are also included for comparison and denoted as “fullsize” where
referred. Fig.4.5 compares the accuracies (averaged over 160 blocks) of the ensembles.
First to note is that the robust regression method is always the best despite of the tree
size. More importantly, it boosts a collection of simple classifiers, which are weak
in classification capability individually, into a strong ensemble. Actually the robust
regression ensemble of smaller classifiers is compatible or even better than the two
bagging ensembles of larger classifiers. We observed the above mentioned superior
performance of the robust regression method under different levels of noise.
For the computation time study, we verify that robust regression is compatible to
weighted bagging in terms of speed. In a set of experiments where the three methods
run for about 40 blocks, the learning together with evaluation time totals a 138 seconds
for unweighted bagging. It is 163 seconds for weighted bagging, and 199 seconds for
49
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0 20 40 60 80 100
Acc
urac
y
Data Blocks
Robust RegressionWeighted Bagging
Bagging
Figure 4.6: Performance comparison of the ensembles on credit card data. Base de-
cision trees have no more than 16 terminal nodes. Concept shifts are simulated by
sorting the transactions by the transaction amount.
the robust regression. The running time is obtained when full grown decision trees are
used. If small decision trees are used instead, logistic regression learning can further
be sped up yet still perform better than the other two methods with full grown trees.
4.5.4 Experiments on Real Life Data
The real-life application is to build a classification model for detection of fraudulent
transactions in credit card transactions. A transaction has 20 features including the
transaction amount, the time of the transaction, etc.
We study the ensemble performance using different block size (1k, 2k, 3k and 4k),
and different base models (decision trees with terminal nodes no more than 8, 16, 32
and full-size trees). We show one experiment in Fig.4.6, where the block size is 1k,
and the base models have at most 16 terminal nodes. Results of other experiments are
similar. The curve shows fewer and smaller drops in accuracy for the robust regression
than for the other methods. These drops occur when the transaction amount jumps.
50
Overall, the robust regression ensemble method performs better than the other two
ensemble methods.
4.6 Summary
In this chapter, we propose an adaptive and robust model learning method that is highly
adaptive to concept changes and is robust to noise. The model produces a weighted
ensemble. The weights of classifiers are computed by logistic regression technique,
which ensures good adaptability. Furthermore, this logistic regression-based weight-
ing scheme is capable to boost a collection of weak classifiers, thus achieving the goal
of fast and light learning. Outlier detection is integrated into the model learning, so
that classifier weight training involves only the inliers, which leads to the robustness of
the resulting ensemble. For outlier detection, we assume that the inlier’s belonging to
certain class follows a Bernoulli distribution, and outliers are samples with a small like-
lihood from this distribution. The classifier weights are estimated in a way that maxi-
mizes the training data likelihood. Compared with recent work [SK01, WFYH03], the
experimental results show that this statistical model achieves higher accuracy, adapts
to underlying concept drift more promptly, and is less sensitive to noise.
51
CHAPTER 5
Subspace Pattern Based Sequence Clustering
In this chapter, we introduce an algorithm that discover clusters based on subspace
pattern similarity. Unlike traditional clustering methods that focus on grouping objects
with similar values on a set of dimensions, clustering by pattern similarity finds objects
that exhibit a coherent pattern of rise and fall in subspaces. Efficiency is the biggest
concern due to dimensionality curse. In this new algorithm, we define a novel distance
function that not only can capture subspace pattern similarity, but also is conducive to
efficient clustering implementations.
5.1 Introduction
Clustering large datasets is a challenging data mining task with many real life appli-
cations, including those in statistics, machine learning, pattern recognition, and image
processing. Much research has been devoted to the problem of finding subspace clus-
ters [APW+00, AY00, AGR98, CFZ99, JMN99]. Along this direction, we further ex-
tended the concept of clustering to focus on pattern-based similarity [WWYY02]. Sev-
eral research work have since studied clustering based on pattern similarity [YWWY02,
PZC+03], as opposed to traditional value-based similarity.
These efforts represent a step forward in bringing the techniques closer to the de-
mands of real life applications, but at the same time, they also introduced new chal-
lenges. For instance, the clustering models in use [WWYY02, YWWY02, PZC+03]
52
are often too rigid to find objects that exhibit meaningful similarity, and also, the lack
of an efficient algorithm makes the model impractical for large scale data. In this pa-
per, we introduce a novel clustering model which is intuitive, capable of capturing
subspace pattern similarity effectively, and is inducive to an efficient implementation.
0
10
20
30
40
50
60
70
80
90
a b c d e f g h i j
Object 1Object 2Object 3
(a) Raw data: 3 objects, 10 columns
0
10
20
30
40
50
60
70
80
90
b c h j e
Object 1Object 2Object 3
0
10
20
30
40
50
60
70
80
90
f d a g i
Object 1Object 2Object 3
(b) A Shifting Pattern in (c) A Scaling Pattern in
subspace {b, c, h, j, e} subspace {f, d, a, g, i}
Figure 5.1: Objects form patterns in subspaces.
5.1.1 Subspace Pattern Similarity
We present the concept of subspace pattern similarity by an example in Figure 5.1.
We have three objects. Here, the X axis represents a set of conditions, and the Y
axis represents object values under those conditions. In Figure 5.1(a), the similarity
53
among the three objects are not visibly clear, until we study them under two subsets of
conditions. In Figure 5.1(b), we find the same three objects form a shifting pattern in
subspace {b, c, h, j, e}, and in Figure 5.1(c), a scaling pattern in subspace {f, d, a, g, i}.
This means, we should consider objects similar to each other as long as they man-
ifest a coherent pattern in a certain subspace, regardless of whether their coordinate
values in such subspaces are close or not. It also means many traditional distance
functions, such as Euclidean, cannot effectively discover such similarity.
5.1.2 Applications
We motivate our work with applications in two important areas.
Analysis of Large Scientific Datasets. Scientific data sets often consist of many
numerical columns. One such example is the gene expression data. DNA micro-arrays
are an important breakthrough in experimental molecular biology, for they provide a
powerful tool in exploring gene expression on a genome-wide scale. By quantifying
the relative abundance of thousands of mRNA transcripts simultaneously, researchers
can discover new functional relationships among a group of genes [BB99, DLS99].
Investigations show that more often than not, several genes contribute to one dis-
ease, which motivates researchers to identify genes whose expression levels rise and
fall coherently under a subset of conditions, that is, they exhibit fluctuation of a simi-
lar shape when conditions change [BB99, DLS99]. Table 5.1 shows that three genes,
VPS8, CYS3, and EFB1, respond to certain environmental changes coherently.
More generally, with the DNA micro-array as an example, we argue that the fol-
lowing queries are of interest in scientific data analysis.
Example 1. Counting
How many genes whose expression level in sample CH1I is about 100± 5 units higher
54
than that in CH2B, 280 ± 5 units higher than that in CH1D, and 75 ± 5 units higher
than that in CH2I?
Example 2. Clustering
Find clusters of genes that exhibit coherent subspace patterns, given the following
constraints: i) the subspace pattern has dimensionality higher than minCols; and ii)
the number of objects in the cluster is larger than minRows.
Answering the above queries efficiently is important in discovering gene correla-
tions [BB99, DLS99] from large scale DNA micro-array data. The counting problem
of Example 1 seems easy to implement, yet it constitutes the most primitive operation
in solving the clustering problem of Example 2, which is the focus of this paper.
Current database techniques cannot solve the above problems efficiently. Algo-
rithms such as the pCluster [WWYY02] have been proposed to find clusters of objects
that manifest coherent patterns. Unfortunately, they can only handle datasets contain-
ing no more than thousands of records.
CH1I CH1B CH1D CH2I CH2B · · ·VPS8 401 281 120 275 298
SSA1 401 292 109 580 238
SP07 228 290 48 285 224
EFB1 318 280 37 277 215
MDM10 538 272 266 277 236
CYS3 322 288 41 278 219
DEP1 317 272 40 273 232
NTG1 329 296 33 274 228...
Table 5.1: Expression data of Yeast genes
55
Event Timestamp...
...
CiscoDCDLinkUp 19:08:01
MLMSocketClose 19:08:07
MLMStatusUp 19:08:21...
...
MiddleLayerManagerUp 19:08:37
CiscoDCDLinkUp 19:08:39...
...
Table 5.2: A Stream of Events
Discovery of Sequential Patterns. We use network event logs to demonstrate the
need to find clusters based on sequential patterns in large datasets. A network system
generates various events. We log each event, as well as the environment in which it
occurs, into a database. Finding patterns in a large dataset of event logs is important to
the understanding of the temporal causal relationships among the events, which often
provide actionable insights for determining problems in system management.
We focus on two attributes, Event and Timestamp (Table 5.2), of the log database.
A network event pattern contains multiple events. For instance, a candidate pattern
might be the following:
Example 3. Sequential Pattern
Event CiscoDCDLinkUp is followed by MLMStatusUp that is followed, in turn, by
CiscoDCDLinkUp, under the constraint that the interval between the first two events
is about 20±2 seconds, and the interval between the 1st and 3rd events is about 40±2
seconds.
Previous works [WPF+03, WPFY03] have studied the problem of efficiently lo-
56
cating a given sequential pattern, however, finding all interesting sequential patterns is
a difficult problem. A network event pattern becomes interesting if: i) it occurs fre-
quently, and ii) it is non-trivial, meaning it contains a certain amount of events. The
challenge here is to find such patterns efficiently.
Although seemingly different to the problem shown in Figure 5.1, finding patterns
exhibited over the time in sequential data is closely related to finding coherent patterns
in tabular data. It is another form of clustering by subspace pattern similarity: if we
think of different type of events as conditions on the X axis of Figure 5.1, and their
timestamp as the Y axis, then, we are actually looking for clusters of subsequences
that exhibit (time) shifting patterns as in Figure 5.1(b).
5.1.3 Our Contributions.
This paper presents a novel approach to clustering datasets based on pattern similarity.
• We present a novel model for subspace pattern similarity. In comparison with
previous models, the new model is intuitive for capturing subspace pattern sim-
ilarity, and reduces computation complexity dramatically.
• We unify pattern similarity analysis in tabular data and pattern similarity analysis
in sequential data into a single problem. Indeed, tabular data are transformed into
their sequential form which is inducive to an efficient implementation.
• We present a scalable sequence-based method, SeqClus, for clustering by sub-
space pattern similarity. The technique outperforms all known state-of-the-art
pattern clustering algorithms and makes it feasible to perform pattern similarity
analysis on large dataset.
57
The rest of the paper is organized as follows. We introduce a novel distance func-
tion for measuring subspace pattern similarity in Section 5.2. Section 5.3 presents an
efficient clustering algorithm based on a novel counting tree structure. Experiments
and results are reported in Section 5.4. In Section 5.5, we review related work and
conclude.
5.2 The Distance Function
The choice of distance functions has great implications on the meaning of similar-
ity, and this is particularly important in subspace clustering because of computational
complexity. Hence, we need a distance function that makes measuring of the similar-
ity between two objects in high dimensional space meaningful and intuitive, and at the
same time yields to an efficient implementation.
5.2.1 Tabular and Sequential Data
Finding objects that exhibit coherent patterns of rise and fall in a tabular dataset (e.g.
Table 5.1) is similar to finding subsequences in a sequential dataset (e.g. Table 5.2).
This indicates that we should unify the data representation of tabular and sequential
datasets so that a single similarity model and algorithm can apply to both tabular and
sequential datasets for clustering based on pattern similarity.
We use sequences to represent objects in a tabular dataset D. We assume there is
a total order among its attributes. For instance, let A = {c1, · · · , cn} be the set of
attributes. We assume c1 ≺ · · · ≺ cn is the total order. Thus, we can represent any
object x by a sequence1:
〈(c1, xc1), · · · , (cn, xcn)〉1We also use 〈xc1 , · · · , xcn〉 to represent x if no confusion arises.
58
where xciis the value of x in column ci. We can then concatenate objects in D into a
long sequence, which is a sequential representation of the tabular data.
After the conversion, pattern discovery on tabular datasets is no different from
pattern discovery in a sequential dataset. For instance, in the Yeast DNA micro-array,
we can use the following sequence to represent a pattern:
〈(CH1D, 0), (CH2B, 180), (CH2I, 205), (CH1I, 280)〉
In words, for genes that exhibit this pattern, their expression levels under condition
CH2B, CH2I, and CH1I must be 180, 205, 280 units higher than that under CH1D.
5.2.2 Sequence-based Pattern Similarity
In this section, we propose a new distance measure that is capable of capturing sub-
space pattern similarity and is inducible to an efficient implementation.
Here we consider the shifting pattern of Figure 5.1(b) only, as scaling patterns are
equivalent to shifting patterns after a logarithmic transformation of the data.
To tell whether two objects exhibit a shifting pattern in a given subspace S , the
simplest way is to normalize the two objects by subtracting x̄s from each of their
coordinate value xi (i ∈ S), where x̄s is the average coordinate value of x in subspace
S. This, however, requires us to compute and keep track of x̄s for each subspace S. As
there are as many as 2|A|−1 different ways of normalization, it makes the computation
of such similarity model impractical for large datasets.
To find a distance function that yields to an efficient implementation, we choose an
arbitrary dimension k ∈ S for normalization. We show that the choice of k has very
limited impact on the similarity measure.
More formally, given two objects x and y, a subspace S, a dimension k ∈ S , we
59
define the sequence-based distance between x and y as follows:
distk,S(x, y) = maxi∈S
|(xi − yi)− (xk − yk)| (5.1)
Figure 5.2 demonstrates the intuition behind Eq (5.1). Let S = {k, a, b, c}. With
respect to dimension k, the distance between x and y in S is less than δ if the difference
between x and y on any dimension of S is within ∆± δ, where ∆ is the difference of
x and y on dimension k.
k a b c
∆
∆±δ
∆±δ
∆±δ
...
object x
object y
...
Figure 5.2: The meaning of distk,S(x, y) ≤ δ.
Clearly, with a different choice of dimension k, we may find the distance between
two objects different. However, such difference is bounded by a factor of 2.
Property 1. For any two objects x, y, and a subspace S , if ∃k ∈ S such that distk,S(x, y) ≤δ, then ∀j ∈ S , distj,S(x, y) ≤ 2δ.
Proof.
distj,S(x, y) = maxi∈S
|(xi − yi)− (xj − yj)|≤ max
i∈S|(xi − yi)− (xk − yk)|+
maxj∈S
|(xj − yj)− (xk − yk)|≤ 2δ
60
Since δ is but a user-defined threshold, Property 1 shows that Eq (5.1)’s capability
of capturing pattern similarity does not depend on the choice of k, which can be an
arbitrary dimension in S. As a matter of fact, as long as we use a fixed dimension k
for any given subspace S , then, with a relaxed δ, we can always find those clusters
discovered by Eq (5.1) where a different dimension k is used. This gives us great
flexibility in defining and mining clusters based on subspace pattern similarity.
Problem Statement Our task is to find subspace clusters of objects where the dis-
tance between two objects is measured by Eq (5.1). Since in Eq (5.1), any dimension
k is equally good in capturing subspace pattern similarity, we shall choose the one that
leads to the most efficient computation.
5.3 The Clustering Algorithm
We define the concept of pattern and then we divide the pattern space into grids (Sec-
tion 5.3.1). We then construct a tree structure which provides a compact summary of
all of the frequent patterns in a data set (Section 5.3.2). We show that the tree struc-
ture enables us to find efficiently the number of occurrences of any specified pattern, or
equivalently, the density of any cell in the grid (Section 5.3.3). A density and grid based
clustering algorithm can then be applied to merge dense cells into clusters. Finally, we
introduce an Apriori-like method to find clusters in any subspace (Section 5.3.4).
5.3.1 Pattern and Pattern Grids
Let D be a dataset in a multidimensional space A. A pattern p is a tuple (T , δ), where
δ is a distance threshold and T is an ordered sequence of (column, value) pairs, that is,
T = 〈(t1, 0), (t2, v2), · · · , (tk, vk)〉
61
where ti ∈ A, and t1 ≺ · · · ≺ tk. Let S = {t1, · · · , tk}. An object x ∈ D exhibits
pattern p in subspace S if
vi − δ ≤ xti − xt1 ≤ vi + δ, 1 ≤ i ≤ k. (5.2)
Apparently, if two objects x, y ∈ D are both instances of pattern p = (T, δ), then we
have
distk,S(x, y) ≤ 2δ.
In order to find clusters, we start with high density patterns: a pattern p = (T, δ) is
of high density if given p, the number of objects that satisfy Eq (5.2) reaches a user-
defined density threshold.
Xt2-Xt1
Xt3-Xt1
densecells
Figure 5.3: Pattern grids for subspace {t1, t2, t3}
We discretize the dataset so that patterns fall into grids. For any given subspace S ,
after we find the dense cells in S , we use a grid and density based clustering algorithm
to find the clusters (Figure 5.3).
The difficult part, however, lies in finding the the dense cells efficiently for all
subspaces. The rest of this section deals with this issue.
62
5.3.2 The Counting Tree
The counting tree provides a compact summary of the dense patterns in a dataset. It is
motivated by the suffix trie, which, given a string, indexes all of its substrings. Here,
each record in the dataset is represented by a sequence, but sequences are different
from strings, as we are interested in non-contiguous sub-sequence match, while suffix
tries only handle contiguous substrings.
c1 c2 c3 c4
x 4 3 0 2
y 3 4 1 3
z 1 2 3 1
Table 5.3: A dataset of 3 objects
Before we introduce the structure of the counting tree, we use an example to illus-
trate our purpose. Table 5.3 shows a dataset of 3 objects in a 4 dimensional space. We
start with the relevant subsequences of each object.
Definition 1. Relevant subsequences.
The relevant subsequences of an object o in an n-dimensional space are:
xi = 〈xi+1 − xi, · · · , xn − xi〉 1 ≤ i < n
In relevant subsequence xi, column ci is used as the base for comparison. Assum-
ing C is a cluster in subspace S, wherein i is the minimal dimension, we shall search
for C in dataset {xi|∀x ∈ D}. In any such subspace S , we use ci as the base for com-
parison, in other words, ci serves as the dimension k in Eq (5.1). As an example, the
relevant subsequences of object z in Table 5.3 are:
63
c1 c2 c3 c4
z1 1 2 0
z2 1 -1
z3 -2
To create a counting tree for a dataset D, for each object z ∈ D, we insert its
relevant subsequences into a tree structure. Also, assuming the insertion of a sequence,
say z1, ends up at node t in the tree (Figure 5.4), we increase the count associated with
node t by 1.
More often than not, we are interested in patterns equal to or longer than a given
size, say ξ ≥ 1. A relevant subsequence whose length is shorter than ξ cannot contain
patterns longer than ξ. Thus, if ξ is known beforehand, we only need to insert xi where
1 ≤ i < n− ξ + 1 for each object x. Figure 5.4 shows the counting tree for the dataset
of Table 5.3 where ξ = 2.
[1,9,3] [2,4,1] [3,4,1] [4,4,1]
[5,9,2] [6,7,1] [7,7,1]
[8,9,1] [9,9,1]
[11,12,2] [12,12,2]
[13,14,1] [14,14,1]
x
y
z
x,y
z
c1 c2 c3 c4
-1 -4 -2
1-2 0
2
0
-3
1
-1
-1
t
[10,14,3]
s
Figure 5.4: The Counting Tree
In the second step, we label each tree node t with a triple: (ID`, IDa, Count).
The first element of the triple, ID`, uniquely identifies node t, and the second element,
64
IDa, is the largest ID` of t’s descendent nodes. The IDs are assigned by a depth-first
traversal of the tree structure, during which we assign sequential numbers (starting
form 0, which is assigned to the root node) to the nodes as they are encountered one
by one. If t is a leaf node, then the 3rd element of the triple, Count, is the number
of objects in t’s object set, otherwise, it is the sum of the counts of its child nodes.
Apparently, we can label a tree with a single depth-first traversal. Figure 5.4 shows a
labeled tree for the sample dataset.
To count pattern occurrences using the tree structure, we introduce counting lists.
For each column pair (ci, cj), i < j, and each possible value v = xj − xi (after data
discretization), we create a counting list (ci, cj, v). The counting lists are also con-
structed during the depth-first traversal. Suppose during the traversal, we encounter
node t, which represents sequence element xj − xi = v. Assuming t is to be la-
beled (ID`, IDa, cnt), and the last element of counting list (ci, cj, v) is ( , , cnt′), we
append a new element (ID`, IDa, cnt + cnt′) into the list2.
link head list of node labels
· · · · · ·(c1, c3,−4) ⇒ [3, 4, 1]
· · · · · ·(c1, c4,−2) ⇒ [4, 4, 1]
(c1, c4, 0) ⇒ [7, 7, 1], [9, 9, 2]
(c2, c4,−1) ⇒ [12, 12, 2], [14, 14, 3]
· · · · · ·
Above is a part of the counting lists for the tree structure in Figure 5.4. For instance,
link (c2, c4,−1) contains two nodes, which are created during the insertion of x2 and
z2 (relevant subsequences of x and z in Table 5.3). The two nodes represent element2If list (ci, cj , v) is empty, we make (ID`, IDa, Count) the first element of the list.
65
x4 − x2 = −1 and z4 − z2 = −1 in sequence x2 and z2 respectively. We summarize
the process of building the counting tree in Algorithm 3.
Thus, our counting tree is composed of two structures, the tree and the counting
lists. We observe the following properties of the counting tree:
1. For any two nodes x, y labeled (ID`x , IDa
x , Countx) and (ID`y , IDa
y , County)
respectively, node y is a descendent of node x if ID`y ∈ [ID`
x , IDax ].
2. Each node appears once and only once in the counting lists.
3. Nodes in any counting list are in ascending order of their ID`.
The proof of the above properties is straightforward and we omit it here. These
properties are essential to finding the dense patterns efficiently (Section 5.3.3).
5.3.3 Counting Pattern Occurrences
We describe SeqClus, an efficient algorithm for finding the occurrence number of a
specified pattern using the counting tree structure introduced above.
Each node s in the counting tree represents a pattern p, which is embodied by the
path leading from the root node to t. For instance, the node s in Figure 5.4 represents
pattern 〈(c1, 0), (c2, 1)〉.
How do we find the number of occurrence of pattern p′ which is one element longer
than p? That is,
p′ = 〈(ci, vi), · · · , (cj, vj)︸ ︷︷ ︸p
, (ck, v)〉.
The counting tree structure makes this operation very easy. First, we only need to
look for nodes in counting list (ci, ck, v), since all nodes of xk − xi = v are in that
list. Second, we are only interested in nodes that are under node s, because only those
66
Algorithm 3 Build the Counting Tree
Input: D: a dataset in multidimensional space Aξ: minimal pattern length (dimensionality)
Output: F : a counting tree
1: F ← empty tree;
2: for all objects x ∈ D do3: i ← 1;4: while i < |A| − ξ + 1 do5: insert xi into F ;6: i ← i + 1;7: end while8: end for
9: make a depth-first traversal of F ;
10: for each node s encountered in the traversal do11: let s represents sequence element xj − xi = v;12: label node s by [id`s , idas , count];13: lcnt ← count of the last element in list (ci, cj, v), or 0 if (ci, cj, v) is empty;14: append [id`s , idas , count + lcnt] to list (ci, cj, v);15: end for
nodes satisfy pattern p, a prefix of p′. Assuming s is labeled (ID`s , IDa
s , count), we
know s’s descendent nodes are in the range of [ID`s , IDa
s ]. According to the counting
properties, elements in any counting list are in ascending order of their ID` values,
which means we can binary-search the list. Finally, assume list (ci, ck, v) contain the
following nodes:
· · · , ( , , cntu), (id`v , idav , cntv), · · · , (id`w, idaw, cntw)︸ ︷︷ ︸
[IDs̀ ,IDas ]
, · · ·
Then, we know all together there are cntw − cntu objects3 that satisfy pattern p′.
3or just cntw objects if id`v is the first element of the list.
67
We denote the above process by count(r, ck, v), where r is a range, and in this case
r = [ID`s , IDa
s ]. If, however, we are looking for patterns even longer than p′, then
instead of returning cntw − cntu, we shall continue the search. Let L denote the list of
the sub-ranges represented by the nodes within range [ID`s , IDa
s ] in list (ci, ck, v), that
is,
L = {[id`v , idav ], · · · , [id`w, idaw]}
Then, we repeat the above process for each range in L, and the final count comes to
∑r∈L
count(r, c, v)
where (c, v) is the next element following p′.
We summarize the counting process described above in Algorithm 4.
5.3.4 Clustering
The counting algorithm in Section 5.3.3 finds the number of occurrences of a specified
pattern, or the density of the cells in the pattern grids of a given subspace (Figure 5.3).
We can then use a density and grid based clustering algorithm to group the dense cells
together.
We start with patterns containing only two columns (in a 2-dimensional subspace),
and grow the patterns by adding new columns into them. During this process, pat-
terns that correspond to no more than minRows objects are pruned, as introducing new
columns into the pattern will only reduce the number of objects.
Figure 5.5 shows a tree structure for growing the clusters. Each node t in the tree is
a triple (item, count, range-list). The items in the nodes along the path from the root
node to node t constitutes the pattern represented by t. For instance, the node in the
3rd level in Figure 5.5 represents 〈(c0, 0), (c1, 0), (c2, 0)〉, a pattern in a 3-dimensional
space. The value count in the triple represents the number of occurrences of the pattern
68
Algorithm 4 Algorithm count()
Input: Q: a query pattern on dataset DF : the counting tree of D
Output: number of occurrences of Q in D1: assume Q = 〈(q1, 0), (q2, v2), · · · , (qj, vj), · · · 〉;2: (r, cnt) ← count(Universe, q1, 0)
3: return countPattern(r, 2)
4: Function countPattern(r, j)
5: the jth element of Q is (qj, vj)
6: (L, cnt) ← count(r, qj, vj)
7: if j = |Q| then8: return9: else
10: return∑
r′∈L countPattern(r′, j + 1)
11: end if
12: Function count(r, c, v)
13: cl ← the counting list for (q1, c, v)
14: perform range query r on cl and15: assume cl contain the following elements:16: · · · , ( , , cnt′), (id`j , idaj , cntj), · · · , (id`k , idak , cntk)︸ ︷︷ ︸
r
, · · ·
17: return (L, cnt) where:18: cnt = cntk − cnt′
19: L = {[id`j , idaj ], · · · , [id`k , idak ]}
in the dataset, and range-list is the list of ranges of the IDs of those objects. Both count
and range-list are computed by the count() routine in Algorithm 4.
First of all, we count the occurrences of all patterns containing 2 columns, and
insert them under the root node if they are frequent (count ≥ minRows). Note there is
no need to consider all the columns. As any ci− cj = v to be the first item in a pattern
69
(C1-C0=0,cnt,L)
(C1-C0=k,cnt,L)
(C2-C0=0,cnt,L)
...
(Cn-mincols+1-Cn-mincols=k,cnt,L)
root
...
(C2-C1=k,cnt,L)
...
...
(C2-C0=0,cnt,L)
join
Level 1 Level 2 Level 3
Figure 5.5: The Cluster Tree
with at least minCols columns, ci must be less than cn−minCols+1 and cj must be less
than cn−minCols.
In the second step, for each node p on the current level, we join p with its eligible
nodes to derive nodes on the next level. A node q is node p’s eligible nodes if it satisfies
the following criteria:
• q is on the same level as p;
• if p denotes item a− b = v and q denotes c− d = v′, then a ≺ c , b = d.
Besides p’s eligible nodes, we also join p with item in the form of cn−minCols+k−b = v,
since column cn−minCols+k does not appear in levels less than k.
The join operation is easy to perform. Assume p, represented by triple (a −b = v, count, range-list), is to be joined with item c − b = v′, we simply compute
count(r, c, v′) for each range r in range-list. If the sum of the returned counts is larger
70
Algorithm 5 Clustering Algorithm
Input: minCols: dimensionality thresholdminRows: cluster size thresholdF : tree structure for D
Output: clusters of objects in D1: T ← create root note of tree2: Queue ← ∅
3: for i = 1 to |A|-minCols do4: (cnt, L) ← count(NULL, Ci, 0)
5: if cnt ≥ minCols then6: insert (ci, 0, cnt, L) under T and into Queue
7: end if8: end for
9: while Queue 6= ∅ do10: remove the 1st element x from Queue
11: assume x = (ci, v, cnt, L)
12: join x with eligible node y = (cj, v′, cnt′, L′)
13: (cnt′′, L′′) ← count(L,Cj, v)
14: if cnt′′ ≥minRows then15: Insert (cj, v
′′, cnt′′, L′′) under x and into Queue
16: end if17: end while
18: for each leaf node x of the tree do19: assume x = (ci, v, cnt, L)
20: columns ← path from root to x
21: objects ← findAll(L)22: return cluster {columns,objects}23: end for
71
than minRows, then we insert a new node (c − b = v′, count’, range-list’) under p,
where count’ is the sum of the returned counts, and range-list’ is the union of all the
ranges returned by count(). Algorithm 5 summarizes the clustering process described
above.
5.4 Experiments
We implement the algorithms in C on a Linux machine with a 700 MHz CPU and 256
MB main memory. We tested it on both synthetic and real life data sets.
5.4.1 Data Sets
We generate synthetic datasets in tabular and sequential forms. For real life datasets,
we use time-stamped event sequences generated by a production network (sequential
data), and DNA micro-arrays of yeast and mouse gene expressions under various con-
ditions (tabular data).
Synthetic Data We generate synthetic data sets in tabular forms. Initially, the table
is filled with random values ranging from 0 to 300, and then we embed a fixed number
of clusters in the raw data. The clusters embedded can also have varying quality. We
embed perfect clusters in the matrix, i.e., the distance between any two objects in the
embedded cluster is 0 (i.e., δ = 0). We also embed clusters whose distance threshold
among the objects is δ = 2, 4, 6, · · · . We also generate synthetic sequential datasets
in the form of · · · (id,timestamp) · · · , where instead of embedding clusters, we sim-
ply model the sequences by probabilistic distributions. Here, the ids are randomly
generated; however, the occurrence rate of different ids follows either a uniform or a
Zipf distribution. We generate ascending timestamps in such a way that the number
of elements in a unit window follows either uniform or Poisson distribution.
72
Gene Expression Data Gene expression data are presented as a matrix. The yeast
micro-array [THC+00] can be converted to a weighted-sequence of 49,028 elements
(2,884 genes under 17 conditions). The expression levels of the yeast genes (after
transformation) range from 0-600, and they are discretized into 40 bins. The mouse
cDNA array is 535,766 in size (10,934 genes under 49 conditions) and it is pre-
processed in the same way.
Event Management Data The data sets we use are taken from a production com-
puter network at a financial service company. NETVIEW [PWMH01] has six at-
tributes: Timestamp, EventType, Host, Severity, Interestingness, and DayOfWeek.
We are concerned with attribute Timestamp and EventType, which has 241 distinc-
tive values. TEC [PWMH01] has attributes Timestamp, EventType, Source, Severity,
Host, and DayOfYear. In TEC, there are 75 distinctive values of EventType and 16 dis-
tinctive types of Source. It is often interesting to differentiate same type of events from
different sources, and this is realized by combining EventType and Source to produce
75× 16 = 1200 symbols.
5.4.2 Performance Analysis
We evaluate the scalability of the clustering algorithm on synthetic tabular datasets and
compare it with pCluster [WWYY02]. The number of objects in the dataset increases
from 1,000 to 100,000, and the number of columns from 20 to 120. The results pre-
sented in Figure 5.6 are average response times obtained from a set of 10 synthetic
data.
Data sets used for Figure 5.6(a) are generated with number of columns fixed at 30.
We embed a total of 10 perfect clusters (δ = 0) in the data. The minimal number of
columns of the embedded cluster is 6, and the minimal number of rows is set to 0.01N ,
73
0
200
400
600
800
1000
1200
0 20000 40000 60000 80000 100000
Ave
rage
Res
pons
e T
ime
(sec
.)Dataset size (# of objects)
pCluster (# of Columns=30)SeqClus (# of Columns=30)
(a) Scalability with the # of rows in data sets
0
500
1000
1500
2000
2500
3000
3500
4000
20 40 60 80 100 120
Ave
rage
Res
pons
e T
ime
(sec
.)
Dataset size (# of columns)
SeqClus (# of Rows=30K)SeqClus (# of Rows=3K)pCluster (# of Rows=3K)
(b) Scalability with the # of columns in data sets
Figure 5.6: Performance Study: scalability.
where N is the number of rows of the synthetic data.
The pCluster algorithm is invoked with minCols= 5, minRows= 0.01N , and δ = 3,
and the SeqClus algorithm is invoked with δ = 3. Figure 5.6(a) shows that there is
almost a linear relationship between the time and the data size for the SeqClus al-
gorithm. The pCluster algorithm, on the other hand, is not scalable, and it can only
handle datasets with size in the range of thousands.
For Figure 5.6(b), we increase the dimensionality of the synthetic datasets from 20
to 120. Each embedded cluster is in subspace whose dimensionality is at least 0.02C,
where C is the number of columns of the data set. The pCluster algorithm is invoked
with δ = 3, minCols= 0.02C, and minRows= 30. The curve of SeqClus exhibits
74
quadratic behavior. However, it shows that, with increasing dimensionality, SeqClus
can almost handle datasets of size an order of magnitude larger than pCluster (30K
vs. 3K). We were unable to get the performance result of pCluster on datasets of 30K
objects.
300
400
500
600
700
800
900
2 3 4 5 6
Tim
e (s
ec.)
distance threshold delta
pCluster (# of Rows=3K)SeqClus (# of Rows=30K)
Figure 5.7: Time vs. distance threshold δ
Next we study the impact of the quality of the embedded clusters on the perfor-
mance of the clustering algorithms. We generate synthetic datasets containing 3K/30K
objects, 30 columns with 30 embedded clusters (each on average contains 30 objects,
and the clusters are in subspace whose dimensionality is 8 on average). Within each
cluster, the maximum distance (under the pCluster model) between any two objects
ranges from δ = 2 to δ = 6. Figure 5.7 shows that, while the performance of the
pCluster algorithm degrades with the increase of δ, the SeqClus algorithm is more ro-
bust under this situation. The reason is because much of the computation of SeqClus is
performed on the counting tree, which provides a compact summary of the dense pat-
terns in the dataset, while for pCluster, a higher δ value has a direct, negative impact
on its pruning effect [WWYY02].
We also study clustering performance on timestamped sequential datasets. The
dataset in use is in the form of · · · (id,timestamp) · · · , where every minute contains
on average 10 ids (uniform distribution). We place a sliding window of size 1 minute
75
0
20
40
60
80
100
120
140
100 200 300 400 500 600 700 800 900 1000
Ave
rage
Res
pons
e T
ime
(sec
.)Data Sequence # of elements (X 1000)
SeqClus (window size=1 min)
Figure 5.8: Scalability on sequential dataset
on the sequence, and create a counting tree for the subsequences inside the windows.
The scalability result is shown in Figure 5.8. We also tried different distributions of id
and timestamp, but did not observe significant differences in performance.
5.4.3 Cluster Analysis
We report the clusters found in real life datasets. Table 5.4 shows the number of
clusters found by the pCluster and SeqClus algorithm in the raw Yeast micro-array
dataset.
δ minCols minRows # of clusters
pCluster SeqClus
0 9 30 5 5
0 7 50 11 13
0 5 30 9370 11537
Table 5.4: Clusters found in the Yeast dataset
For minCols= 9 and minRows= 30, the two algorithms found the same clusters.
But in general, using the same parameters, SeqClus produces more clusters. This is
76
because the similarity measure used in the pCluster model is more restrictive. We
find that the objects (genes) in those clusters overlooked by the pCluster algorithm but
discovered by the SeqClus method exhibit easily perceptible coherent patterns. For in-
stance, the genes in Figure 5.9 shows a coherent pattern in the specified subspace, and
this subspace cluster is discovered by SeqClus but not by pCluster. This indicates the
relaxation of the similarity model not only improves the performance but also provides
extra insight in understanding the data.
250
300
350
400
0 2 4 6 8 10 12 14 16
expr
essi
on le
vels
conditions
Figure 5.9: A cluster in subspace {2,3,4,5,7,8,10,11,12,13,14,15,16}.
The SeqClus algorithm works directly on both tabular and sequential datasets. Ta-
ble 5.5 shows event sequence clusters found in the NETVIEW dataset [PWMH01].
We apply the algorithm on 10 days’ worth of event logs (around 41M bytes) of the
production computer network.
δ # events # sequences SeqClus
2 sec 10 500 31
4 sec 8 400 143
6 sec 6 300 2276
Table 5.5: Clusters found in NETVIEW
77
5.5 Related Work and Discussion
The study of clustering based on pattern similarity is related to previous work on sub-
space clustering. Many recent studies [APW+00, AY00, AGR98, CFZ99, JMN99]
focus on mining subspace clusters embedded in high-dimensional spaces.
Still, strong correlations may exist among a set of objects even if they are far apart
from each other as measured by distance functions (such as Euclidean) used frequently
in traditional clustering algorithms. Many scientific projects collect data in the form
of Figure 5.1, and it is essential to identify clusters of objects that manifest coherent
patterns. A variety of applications, including DNA microarray analysis, E-commerce
collaborative filtering, will benefit from fast algorithms that can capture such patterns.
Cheng et al [CC00] proposed the bicluster model, which captures the coherence of
genes and conditions in a sub-matrix of a DNA micro-array.
In this paper, we show that clustering by pattern similarity is closely related to the
problem of subsequence matching. There has been much research on string indexing
and substring matching. For instance, a suffix tree [McC76] is a very useful data struc-
ture that embodies a compact index to all the distinct, non-empty substrings of a given
string. Suffix arrays [MM93] and PAT-arrays [GBYS92] also provide fast searches on
text databases. Similarity based subsequence matching [FRM94, PWZP00] has been
a research focus for applications such as time series databases.
Clustering by pattern similarity is an interesting and challenging problem. The
computational complexity problem of subspace clustering is further aggravated by the
fact that we are concerned with patterns of rise and fall instead of value similarity. The
task of clustering by pattern similarity can be converted into a traditional subspace
clustering problem by (i) creating a new dimension ij for every two dimension i and
j of any object x, and set xij , the value of the new dimension, to xi − xj; or (ii)
78
creating |A| copies (A is the entire dimension set) of the original dataset, where xk,
the value of x on the kth dimension in the ith copy is changed to xk − xi, for k ∈ A.
For both cases, we need to find subspace clusters in the transformed dataset, which
is |A| times larger. These methods are apparently not feasible for datasets in high
dimensional spaces. They also cannot be applied to sequential datasets, for instance,
in event management systems where millions of timestamped events are generated on
a daily basis. In this paper, we introduced a sequence based similarity measure to
model pattern similarity. We proposed an efficient implementation, the counting tree,
which is based on the suffix tree structure. Experimental results show that the SeqClus
algorithm achieves an order of magnitude speedup over the current best algorithm
pCluster. The new model also enables us to identify clusters overlooked by previous
methods such as the pCluster model. Furthermore, the sequence model is natural to be
applied on sequential data directly.
79
CHAPTER 6
Mining Quality
In this chapter, we introduce a general framework to improve mining quality by ex-
ploiting local data dependency. The technique concerns Markov network modeling
and belief propagation.
Our work on continuous stream mining can be viewed as mining with local tem-
poral constraints, and subspace pattern based clustering can be viewed as mining with
local spatial constraints. They are special cases with hidden local temporal/spatial re-
lationships. The work in this chapter addresses local data dependencies in a general
form.
6.1 Introduction
The usefulness of knowledge models produced by data mining methods critically de-
pends on two issues. (1) Data quality: Data mining tasks expect to have accurate and
complete input data. But, the reality is that in many situations, data is contaminated,
or is incomplete due to limited bandwidth for acquisition. (2) Model adequacy: Many
data mining methods, for efficiency consideration or design limitation, use a model
incapable of capturing rich relationships embedded in data. The mining results from
an inadequate data model will generally need to be improved.
Fortunately, a wide spectrum of applications exhibit strong dependencies between
data samples. For example, the readings of nearby sensors are correlated, and proteins
80
interact with each other when performing crucial functions. Data dependency has not
received sufficient attention in data mining research yet, but it can be exploited to
remedy the problems mentioned above. We study this in several typical scenarios.
Low Data Quality Issue Many data mining methods are not designed to deal with
noise or missing values; they take the data “as is” and simply deliver the best results
obtainable by mining such imperfect data. In order to get more useful mining results,
contaminated data needs to be cleaned, and missing values need to be inferred.
Data Contamination An example of data contamination is encountered in optical
character recognition (OCR), a technique that translates pictures of characters into
a machine readable encoding scheme. Current OCR algorithms often translate two
adjacent letters “ ff ” into a “# ” sign, or incur similar systematic errors.
In the OCR problem, the objective is not to ignore or discard noisy input, but
to identify and correct the errors. This is doable because the errors are introduced
according to certain patterns. The error patterns in OCR may be related to the shape
of individual characters, the adjacency of characters, or illumination and positions. It
is thus possible to correct a substantial number of errors with the aid of neighboring
characters.
Data Incompleteness A typical scenario where data is incomplete is found in
sensor networks where probing has to be minimized due to power restrictions, and
thus data is incomplete or only partially up-to-date. Many queries ask for the mini-
mum/maximum values among all sensor readings. For that, we need a cost-efficient
way to infer such extrema while probing the sensors as little as possible.
The problem here is related to filling in missing attributes in data cleansing [GNV96].
The latter basically learns a predictive model using available data, then uses that model
to predict the missing values. The model training there does not consider data correla-
tion. In the sensor problem, however, we can leverage the neighborhood relationship,
81
as sensor readings are correlated if the sensors are geographically close. Even knowl-
edge of far-away sensors helps, because that knowledge can be propagated via sensors
deployed in between. By exploiting sensor correlation, unprobed sensors can be accu-
rately inferred, and thus data quality can be improved.
Inadequate Data Model Issue Many well known mining tools are inadequate to
model complex data relationships. For example, most classification algorithms, such
as Naive Bayes and Decision Trees, approximate the posterior probability of hidden
variables (usually class labels) by investigating on individual data features. These
discriminative models fail to model the strong data dependencies or interactions.
Take protein function prediction as a concrete classification example. Proteins
are known to interact with some others to perform functions, and these interactions
connect genes to form a graph structure. If one chooses Naive Bayes or Decision
Trees to predict unknown protein functions, he is basically confined to a tabular data
model, and thus has lost rich information about interactions.
Markov networks, as a type of descriptive model, provide a convenient represen-
tation for structuring complex relationships, and thus a solution for handling proba-
bilistic data dependency. In addition, efficient techniques are available to do inference
on Markov networks, including the powerful Belief Propagation [YFW00] algorithm.
The power in modeling data dependency, together with the availability of efficient
inference tools, makes Markov networks very useful data models. They have the po-
tential to enhance mining results obtained from data whose data dependencies are un-
derused.
Our Contribution The primary contribution of this paper is that we propose a unified
approach to improving mining quality by considering data dependency extensively in
data mining. We adopt Markov networks as the data model, and use belief propagation
82
for efficient inference, so as to clean the data, to infer missing values, or to generally
improve the mining results from a model that ignores data dependency. This paper
may also contribute to data mining practice with our investigations on some real-life
applications.
The primary contribution of this paper is that we propose a unified approach to
improving mining quality by considering dependencies among the data intensively in
data mining. We adopt Markov networks as the data model, and use belief propagation
to efficiently compute the marginal or maximum posterior probability, so as to clean
the data, to infer missing values, or to generally improve the mining results from a
model that ignores data dependency.
This paper may also contribute to data mining practice with our investigations on
several real-life applications. By exploiting data dependency in these application, clear
improvements have been achieved in data quality and the usefulness of mining results.
Outline We describe Markov networks in the next section. Also discussed there are
pairwise Markov networks, a special form of Markov network. Pairwise Markov net-
works not only model local dependency well, but also allow very efficient computation
by belief propagation. We then address the three above-mentioned examples in sec-
tions 6.3, 6.4 and 6.5. We conclude the paper with related work and discussion in
Section 6.6.
6.2 Markov Networks
Markov networks have been successfully applied to many problems in different fields,
such as artificial intelligence [Pea88], image analysis [SS94], turbo decoding [MMC98]
and condensed matter physics [AM01]. They have the potential to become very useful
tools of data mining.
83
(a) (b)
Figure 6.1: Example of a Pairwise Markov Network. In (a), the white circles denote
the random variables, and the shaded circles denote the external evidence. In (b), the
potential functions φ() and ψ() are showed.
6.2.1 Graphical Representation
The Markov network is naturally represented as an undirected graph G = (V, E),
where V is the vertex set having a one-to-one correspondence with the set of random
variables X = {xi} to be modeled, and E is the undirected edge set, defining the
neighborhood relationship among variables, indicating their local statistical dependen-
cies. The local statistical dependencies suggest that the joint probability distribution
on the whole graph can be factored into a product of local functions on cliques of the
graph. A clique is a completely connected subgraphs (including singletons), denoted
as XC . This factorization is actually the most favorable property of Markov networks.
Let C be a set of vertex indices of a clique, and let C be the set of all such C. A
potential function ψXC(xC) is a function on the possible realization xC of the clique
XC . Potential functions can be interpreted as “constraints” among vertices in a clique.
They favor certain local configurations by assigning them a larger value.
The joint probability of a graph configuration p({x}) can be factored into
P({x}) =1
Z
∏
C∈CψXC
(xC) (6.1)
84
where Z is a normalizing constant:
Z =∑
{x}
∏
C∈CψXC
(xC)
6.2.2 Pairwise Markov Networks
Computing joint probabilities on cliques reduces computational complexity, but still,
the computation may be difficult when cliques are large. In a category of problems
where our interest involves only pairwise relationships among the samples, we can
use use pairwise Markov networks. A pairwise Markov network defines potentials
functions only on pairs of nodes that are connected by an edge.
In practical problems, we may observe some quantities of the underlying random
variables {xi}, denoted as {yi}. The {yi} are often called evidence of the random
variables. In the text denoising example discussed in Section 6.1, for example, the
underlying segments of text are variables, while the segments in the noisy text we ob-
serve are evidence. These observed external evidence will be used to make inferences
about values of the underlying variables. The statistical dependency between xi and
yi is written as a joint compatibility function φi(xi, yi), which can be interpreted as
“external potential” from the external field.
Another type of potential functions are defined between neighboring random vari-
ables. The compatibility function ψij(xi, xj) which captures the “internal binding”
between two neighboring nodes i and j. An example of pairwise Markov networks is
illustrated in Figure 6.1(a), where the white circles denote the random variables, and
the shaded circles denote the evidence. Figure 6.1(b) shows the potential functions φ()
and ψ().
Using the pairwise potentials defined above and incorporating the external evi-
dence, the overall joint probability of a graph configuration in Eq.(6.1) is approximated
85
by
P({x}, {y}) =1
Z
∏
(i,j)
ψij(xi, xj)∏
i
φi(xi, yi) (6.2)
where Z is a normalization factor and the product over (i, j) is over all compatible
neighbors.
6.2.3 Solving Markov Networks
Solving a Markov network involves two phases:
• The learning phase, a phase that builds up the graph structure of the Markov
network, and learns the two types of potential functions, φ()’s and ψ()’s, from
the training data.
• The inference phase, a phase that estimates the marginal posterior probabilities
or the local maximum posterior probabilities for each random variable, such that
the joint posterior probability is maximized.
In general learning is an application-dependent statistics collection process. It de-
pends on specific applications to define the random variables, the neighborhood re-
lationships and further the potential functions. We will look at the learning phase in
detail with concrete applications in Sections 6.3-6.5.
The inference phase can be solved using a number of methods: simulated anneal-
ing [KGV83], mean-field annealing [PA87], Markov Chain Monte Carlo [GRS95], etc.
These methods either take an unacceptably long time to converge, or make oversim-
plified assumptions such as total independence between variables. We choose to use
the Belief Propagation method, which has a computation complexity proportional to
the number of nodes in the network, assumes only local dependencies, and has proved
to be effective on a broad range of Markov networks.
86
Figure 6.2: Message passing in a Markov network. Messages are defined by Eqs.(6.3)
or (6.4) under two types of rules, respectively.
6.2.4 Inference by Belief Propagation
Belief propagation (BP) is a powerful inference tool on Markov networks. It was pi-
oneered by Judea Pearl [Pea88] in belief networks without loops. For Markov chains
and Markov networks without loops, BP is an exact inference method. Even for loopy
networks, BP has been successfully used in a wide range of applications [MMC98][MWJ99].
We give a short description of BP in this subsection.
The BP algorithm iteratively propagates “messages” in the network. Messages are
passed between neighboring nodes only, ensuring the local constraints, as shown in
Figure 6.2. The message from node i to node j is denoted as mij(xj), which intuitively
tells how likely node i thinks that node j is in state xj . The message mij(xj) is a vector
of the same dimensionality as xj .
There are two types of message passing rules:
• SUM-product rule, that computes the marginal posterior probability.
• MAX-product rule, that computes the maximum a posterior probability.
87
For discrete variables, messages are updated using the SUM-product rule:
mt+1ij (xj) =
∑xi
φi(xi, yi)ψij(xi, xj)∏
k∈N(i),k 6=j
mtki(xi) (6.3)
or the MAX-product rule,
mt+1ij (xj) = max
xi
φi(xi, yi)ψij(xi, xj)∏
k∈N(i),k 6=j
mtki(xi) (6.4)
where mtki(xi) is the message computed in the last iteration of BP, k runs over all
neighbor nodes of i except node j.
BP is an iterative algorithm. When messages converge, the final belief b(xi) is
computed. With the SUM-product rule, b(xi) approximates the marginal probability
p(xi), defined to be proportional to the product of the local compatibility at node i
(φ(xi)), and messages coming from all neighbors of node i:
bi(xi)SUM = xiφi(xi, yi)∏
j∈N(i)
mji(xi) (6.5)
where N(i) is the neighboring nodes of i.
If using the MAX-product rule, b(xi) approximates the maximum a posterior prob-
ability:
bi(xi)MAX = arg maxxi
φi(xi, yi)∏
j∈N(i)
mji(xi) (6.6)
For more theoretical details of the belief propagation and its generalization, we
refer the reader to [YFW00].
6.3 Application I: Cost-Efficient Sensor Probing
In sensor networks, how to minimize communication is among the key research issues.
The challenging problem is how to probe a small number of sensors, yet to effectively
88
116 117 118 119 120 121 122 123 124 12542
43
44
45
46
47
48
49
Figure 6.3: Sensor site map in the states of Washington and Oregon.
infer the unprobed sensors from the known. Cost-efficient sensor probing represents a
category of problems where complete data is not available, but has to be compensated
by inference.
Our approach here is to model a sensor network with a pairwise Markov network,
and use BP to do inference. Each sensor is represented by a random variable in the
Markov network. Sensor neighborhood relationships are determined by spatial posi-
tions. For example, one can specify a distance threshold so that sensors within the
range are neighbors. Neighbors are connected by edges in the network.
In the rest of this section, we study a rainfall sensornet distributed over Washington
and Oregon [oW]. The sensor recordings were collected during 1949-1994. We use
167 sensor stations which have complete recordings during that period. The sensor
site map is shown in Figure. 6.3.
6.3.1 Problem Description and Data Representation
The sensor recordings were collected in past decades over two states along the Pacific
Northwest. Since rain is a seasonal phenomena, we split the data by week and build a
Markov network for each week.
We need to design the potential functions φi(xi, yi) and ψij(xi, xj) in Eq. (6.2) in
89
order to use belief propagation. One can use Gaussian or its variants to compute the
potential functions. But, in the sensornet we study, we find that the sensor readings are
overwhelmed by zeroes, while non-zero values span a wide range. Clearly Gaussian
is not a good choice for modeling this very skewed data. Neither are mixtures of
gaussian, due to limited data. Instead, we prefer to use discrete sensor readings in the
computation. The way we discretize data is given in section 6.3.3.
The φ() functions should tell how likely we observe a reading yi for a given sensor
xi. It is natural to use the likelihood function:
φi(xi, yi) = P(yi|xi) (6.7)
The ψ() functions specify the dependence of sensor xj’s reading on its neighbor
xi.
ψij(xi, xj) = P(xj|xi) (6.8)
6.3.2 Problem Formulation
We give a theoretical analysis of the problem here. As we will see shortly, the problem
fits well into the maximum a posterior (MAP) estimation on a Markov chain solvable
by belief propagation.
Objective: MAP
Let X to be the collection of all underlying sensor readings, Y the collection of all
probed sensors. Using Bayes’ rule, the joint posterior probability of X given Y is:
P (X|Y ) =P (Y |X)P (X)
P (Y )(6.9)
Since P (Y ) is a constant over all possible X , we can simplify this problem of
90
maximizing the posterior probability to be maximizing the joint probability
P (X, Y ) = P (Y |X)P (X) (6.10)
Eq.(6.10) is the objective function to be maximized, which is proportional to the
maximum a posterior probability.
Likelihood
In a Markov network, the likelihood of the readings Y depends only on those vari-
ables they are directly connected to:
P (Y |X) =m∏
i=1
P (yi|xi) (6.11)
where m is the number of probed sensors.
Prior
Priors shall be defined to capture the constraints between neighboring sensor read-
ings. By exploiting the Markov property of the sensors, we define the prior to involve
only the first order neighborhood. Thus, the prior of a sensor is proportional to the
product of the compatibility between all neighboring sensors:
P (X) ∝∏
(i,j)
P (xj|xi) (6.12)
Solvable by BP
By replacing Eqs.(6.11) and (6.12) into the objective Eq.(6.10), we have the joint
probability to be maximized:
P (X, Y ) =1
Z
∏
(i,j)
P (xj|xi)N∏
i=1
P (yi|xi) (6.13)
91
0.2
0.4
0.6
0.8
1
10 20 30 40 50
52 weeks
Probing RatioTop 10 recall on Raw data
Top 10 recall on Discrete data
0.2
0.4
0.6
0.8
1
10 20 30 40 50
52 weeks
Probing RatioTop 10 recall on Raw data
Top 10 recall on Discrete data
(a) BP-based probing. (b) naive probing.
Figure 6.4: Top-K recall rates vs. probing ratios. (a): results obtained by our BP-based
probing; (b) by the naive probing. On average, BP-based approach probed 8% less,
achieves 13.6% higher recall rate for raw values, and 7.7% higher recall rate for dis-
crete values.
Looking back at the φ() and ψ() functions we defined in Eqs.(6.7) and (6.8), we
see that the objective function is of the form:
P (X,Y ) =1
Z
∏
(i,j)
ψ(xi, xj)N∏
i=1
φ(xi, yi) (6.14)
where Z is a normalizing constant.
This is exactly the form in Eq.(6.2), where the joint probability over the pairwise
Markov network is factorized into products of localized potential functions. Therefore,
it is clear that the problem can be solved by belief propagation.
6.3.3 Learning and Inference
The learning part is to find the φ() and ψ() functions for each sensor, as defined in
Eqs.(6.7) and (6.8). The learning is straight-forward. We discretize the sensor readings
in the past 46 years, use the first 30 years for training and the rest 16 years for testing.
In the discrete space, we simply count the frequency of each value a sensor could
92
possibly take, which is the φ(), and the conditional frequencies of sensor values given
its neighbors, which is the ψ().
We use a simple discretization with a fixed number of bins, 11 bins in our case,
for each sensor. The first bin is dedicated to zeroes, which consistently counts for over
50% of the populations. The 11 bins span the following ranges: [0, 0], [1, 5], [6, 10], [11, 30], [31, 60], [61, 100], [101, 200],
[201, 400], [401, 1000], [1001, 1500], and[1500,∞). This very simple discretization
method has been shown to work well in the sensor experiments. More elaborated
techniques can be used which may further boost the performance, such as histogram
equalization that gives balanced bin population with adaptive bin numbers.
For inference, belief propagation does not guarantee to give the exact maximum a
posterior distribution, as there are loops in the Markov network. However, loopy belief
propagation still gives satisfactory results, as we will see shortly.
6.3.4 Experimental Results
We evaluate our approach using Top-K queries. A Top-K query asks for the K sensors
with the highest values. It is not only a popular aggregation query that the sensor
community is interested in, but also a good metric for probing strategies as the exact
answer requires contacting all sensors.
We design a probing approach in which sensors are picked for probing based on
their local maximum a posterior probability computed by belief propagation, as fol-
lows.
BP-based Probing:
1. Initialization: Compute the expected readings of sensors using the training data.
Pick the top 20.
93
(0) (1) (2)
(3) (4) (5)
Figure 6.5: Belief updates in 6 BP iterations((0) - (5)). Initially only the four sensors
at the corners are probed. The strong beliefs of these four sensors are carried over
by their neighbors to sensors throughout the network, causing beliefs of all sensors
updated iteratively till convergence.
2. Probe the selected sensors.
3. True values acquired in step 2 become external evidence in the Markov network.
Propagate beliefs with all evidence acquired so far.
4. Again, pick the top sensors with the highest expectations for further probing, but
this time use the updated distributions to compute expectations. When there are
ties, pick them all.
5. Iterate steps 2-4, until beliefs in the network converge.
94
6. Pick the top K with the highest expectations according to BP MAP estimation.
As a comparative baseline, we have also conducted experiments using a naive prob-
ing strategy as follows:
Naive Probing:
1. Compute the expectations of sensors. Pick the top 25% sensors.
2. Probe those selected sensors.
3. Pick the top K.
Performance of the two approaches is shown in Figure 6.4 (a) and (b), respectively.
On each diagram, the bottom curve shows the probing ratio, and the two curves on the
top show the recall rates for raw values and discrete values, respectively. We use the
standard formula to compute recall rate, i.e.:
Recall =|S ⋂
T ||T | (6.15)
where S is the top-K sensor set returned, and T is the true top-K set.
Since the sensor readings are discretized in our experiments, we can compute S and
T using raw values, or discrete values. Discrete recall demonstrates the effectiveness
of BP, while raw recall may be of more interest for real application needs. As can
be seen from Figure 6.4, raw recall is lower than discrete recall. This is due to error
introduced in the discretization step. We expect raw recall to be improved when a more
elaborated discretization technique is adopted.
It shows clearly in Figure 6.4 that BP-based approach outperforms the naive ap-
proach in terms of both recall rates, while requiring less probing. On average, the
95
BP-based approach has a discrete recall of 88% and a raw recall of 78.2%, after prob-
ing only 17.5% sensors. The naive recall has a discrete recall of only 79.3%, a raw
recall of only 64.6%, after probing 25% sensors.
The results shown in Figure 6.4 are obtained for K = 10. The relative performance
remains the same for other values K = 20, 30, 40.
6.3.5 How BP Works
A closer look at the changing sensor beliefs during the iterations shows how belief
propagation provides effective inference. We look at 49 sensors that form a 7× 7 grid,
each having the surrounding sensors (≤ 8) as its neighbors. Only the four sensors at
the corners are probed. We use the ψ() functions acquired by learning, but set φ() to be
uniform. This is solely for demonstration purpose. (The original φ() is so skewed that
BP converges too fast to demonstrate a moderate sized sequence of belief changes.)
The beliefs are shown in Figure 6.5, one per iteration. In the first diagram, only the
four corner sensors have an impulse at the true value, while all the others showing a
flat distribution. But the probability histogram of each unprobed sensor grows notably
sharper as BP iterates, showing how belief can grow stronger by receiving messages
from neighbors.
This sensor probing on a small scale gives a sense of how effective belief propaga-
tion can be in Markov networks.
• From Figure 6.5, we can see that beliefs are able to propagate through the net-
work via messages quickly. The messages of the four sensors at the corners are
first passed to the nearby sites, then carried all the way to the central sites in just
a few iterations.
• We can also see that the well informed nodes can help those less informed to
96
build up their beliefs. Informally, we say a node is well informed or has stronger
beliefs, if their belief distribution has a lower entropy. Figure 6.5 clearly shows
that the four corner sensors pass strong beliefs to others to help them compute a
good approximation of the posterior.
6.4 Application II: Enhancing Protein Function Predictions
Local data dependency can not only help infer missing values, as in the sensor exam-
ple, but can also be exploited to enhance mining results. Many data mining methods,
for efficiency consideration or design limitation, use a model incapable of capturing
rich relationships embedded in data. Most discriminative models like Naive Bayes
and SVM belong to this category. Predictions of these models can be improved, by
exploiting local data dependency using Markov networks. The predictions are used as
the likelihood proposal, and message passing between variables refines and reinforces
the beliefs. Next we show how to improve protein function predictions in this way.
6.4.1 Problem Description
Proteins tend to localize in various parts of cells and interact with one another, in order
to perform crucial functions. One task in the KDD Cup 2001 [CHH+01] is to predict
protein functions. The training set contains 862 proteins with known functions, and
the testing set includes 381 proteins. The interactions between proteins, including the
testing genes, are given. Other information provided specifies a number of properties
of individual proteins or genes that encodes the proteins. These include the chromo-
some on which the gene appears, phenotype of organisms with differences in this gene,
etc.
Since information about individual proteins or genes are fixed features, it becomes
97
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 6.6: Logistic curve that is used to blur the margin between the belief on two
classes.
crucial how to learn from interactions. According to the report of the cup organizers,
most competitors organized data in relational tables, and employed algorithms that deal
with tabular data. However, compared with tables, graphical models provide a much
more natural representation for interacting genes. With a Markov network model,
interactions can be modeled directly using edges, avoiding preparing a huge training
table. Interacting genes can pass messages to each other, thus getting their beliefs
refined together.
In the next of this section, we show a general way of enhancing a weak classifier
by simply leveraging local dependency. The classifier we use is Naive Bayes, which
is learned from the relational table. We build a Markov network, in which genes with
interactions are connected as neighbors. The φ() function prediction comes from Naive
Bayes, and the ψ() are learned from gene interactions.
6.4.2 Learning Markov Network
We separate the learning of each function, as focusing on one function a time is easier.
There are 13 function categories, hence we build 13 Markov networks. To prepare
the initial beliefs for a network, we first learn a Naive Bayes classifier, which output
a probability vector b0(), indicating how likely a gene will perform the function in
98
question or not.
Each gene i maps to a binary variable xi in the Markov network. First we design
the φ() potentials for {xi}. One can set the Naive Bayes prediction b0() to be φ().
But this way the Naive Bayes classifier is over trusted, make it harder to correct the
misclassifications. Instead, we adopt a generalized logistic function to blur the margin
between the belief on two classes, yet still keeping the prediction decision.
f =a
1 + e−α(x−β)+ b (6.16)
In the experiments, we set a = 0.75, b = 0.125, α = 6, and β = 0.5. The logistic
curve is shown in Figure 6.6.
The ψ() potentials are learned from protein interactions. Interactions are measured
by the correlation between the expression levels of the two encoding genes. At first
we tried to related the functions of two genes in a simple way: a positive correlation
indicates that with a fixed probability both or neither genes perform the function, while
a negative correlation indicates that one and only one gene perform the function. This
will leads to a simple fixed ψ() function for all interacting genes. But, a close look
at the interaction tells that 25% of the time this assumption is not true. In reality,
sometimes two genes participating in the same function may be negatively correlated;
a more influential phenomena is that genes may participate in several functions, hence
the correlation is a combined observation involving multiple functions.
We decided to learn the distribution of correlation values for three types of inter-
actions, separately: (a)FF: a group for protein pairs that both perform the function,
(b)FNF: a group for pairs that one and only one performs the function, and (c)NFNF: a
group for protein pairs that neither performs the function. Thus, the potential function
ψi,j defines how likely to observe a correlation value given for genes xi and xj , under
99
−1 −0.5 0 0.5 10
0.02
0.04
0.06
0.08
−1 −0.5 0 0.5 10
0.02
0.04
0.06
0.08
(1.a) FF (2.a) FF
−1 −0.5 0 0.5 10
0.02
0.04
0.06
0.08
−1 −0.5 0 0.5 10
0.02
0.04
0.06
0.08
(1.b) FNF (2.b) FNF
−1 −0.5 0 0.5 10
0.02
0.04
0.06
0.08
−1 −0.5 0 0.5 10
0.02
0.04
0.06
0.08
(1.c) NFNF (2.c) NFNF
Figure 6.7: Distribution of correlation values learned for two functions. left column
function: cell growth, right column function: protein destination. In each column, the
distributions from top to bottom are learned from group (a), (b) and (c), respectively.
different cases where xi and xj each has the function or not. In Figure 6.7, we show
the distribution of correlation values learned for two functions. The left column is
about a function related to cell growth, the right column is about a function related to
protein destination. From top to bottom in each column, the distributions are learned
from interaction group (a), (b) and (c), respectively. The figures show that correlation
distributions differ among groups, and are specific to functions as well.
100
1
G238783
1
G234405
1
G230291
1
G239273
1
G235339
...
...
...
... G234263
1 1
G235803
1
G234382
0
G235506
Figure 6.8: A subgraph in which testing genes got correct class labels due to message
passing.
6.4.3 Experiments
Naive Bayes does not perform well on this problem, because it does not model the gene
interactions sufficiently, and thus cannot fully utilize the rich interaction information.
Taking the average predictive accuracy of all classifiers, one per function, the overall
accuracy of Naive Bayes is 88%. Belief propagation improves this to 90%.
To exemplify how misclassifications get corrected due to message passing, we
show a subgraph of genes in Figure 6.8. The white circles represent genes(variables),
and the shaded circles represent external evidence. Only training genes have corre-
sponding external evidence. The 1’s or 0’s in the circles tell whether a gene has the
function in question or not. For interested readers, we also put the gene ID below the
circle. The subgraph contains four training genes and five testing genes. All these
testing genes were misclassified by Naive Bayes. After receiving strong beliefs from
their neighboring genes, four out of five testing genes were correctly classified. The
other test gene ‘G230291’ was misclassified by both, but Naive Bayes predicted 0%
for it to have the function (which is the truth), while belief propagation increased this
belief to 25%.
We also evaluated our approach using the score function originally used in the 2001
101
KDD cup [CHH+01]. First we picked out all the functions we predicted for a gene.
If more functions are predicted than the true number (which is actually the number of
duplicates of that gene in the test table provided), we remove the ones with the smallest
confidence. The final score is the ratio of correct predictions, including both positive
and negative predictions. Our final score is 91.2%, close to the Cup winner’s 93.6%.
Although the winner scored reasonably high, they organized data in relational tables
and didn’t fully explore gene interactions. We expect that their method could perform
better if integrated with our approach to exploit local dependencies between genes.
The Cup winner organized data in relational tables, which is not designed at all
for complex relationships. To make up for this, they manually created new features,
such as computing “neighbors” within k (k > 1) hops following neighbor links. Even
so, these new features can only be treated the same as the other individual features.
The rich relationship information in the original graph structure was lost. Graphical
models, on the other hand, are natural models for complex relationships. Markov
networks together with belief propagation provides a general and powerful modeling
and inference tool on problems satisfying local constraints, such as protein function
prediction.
6.5 Application III: Sequence Data Denoising
Sequences are ordered lists of elements, such as text strings, DNA sequences, or binary
codes in channel transmission. This type of data often exhibits dependencies between
adjacent elements. For example, there are rich dependencies embedded in English
text. This sequence data can be modeled using Markov chains—a degenerate form of
Markov networks.
Moreover, errors in sequence data often have neighborhood patterns. OCR dis-
102
cussed in Section 6.1 gives an example where errors are related to the shapes of char-
acters and to their relative positions. The mutation of a nucleotide is also influenced
by its nearby bases. That the Markov property is satisfied by both the sequence data
itself and by errors strongly suggests the applicability of belief propagation for se-
quence data denoising. Actually for Markov chains, belief propagation is theoretically
guaranteed to give exact marginal or maximum a posterior probabilities.
In the rest of the section, we study a problem of correcting errors in noisy docu-
ments. While a simple problem, it exemplifies many basic characteristics in sequence
data mining.
6.5.1 Problem Description and Data Representation
A document is a text sequence consisting of characters from an alphabet, while a noisy
document is the result from some recognizer with systemic errors. We split the se-
quence into small segments, each having n characters, and let neighboring segments
overlap by m characters, m < n. We use a random variable xi = (x(1)i , · · · , x
(n)i ) to
represent each underlying clean segment i. The corresponding observed segment in
the noisy document is denoted as yi = (y(1)i , · · · , y
(n)i ). Each segment, except those
that starts or ends the sequence, has a neighbor segment on either side.
Now we design the potential functions φi(xi, yi) and ψij(xi, xj). For φ(), the defi-
nition should specify how likely we observe yi given xi. A natural choice is to define
φ() to be a likelihood function
φi(xi, yi) = P (yi|xi) (6.17)
For a short segment, we can assume independence between characters. Thus, φ()
103
can be written as
φi(xi, yi) =n∏
l=1
P (y(l)i |x(l)
i ) (6.18)
For ψ(), the definition should specify how compatible two neighboring segments xi
and xj are. Again, we can assume independence between the characters in the two seg-
ments, except for those in the overlapping part. Consider two overlapping characters,
x(k)i and x
(l)j . If the probability is zero that x
(k)i will change to x
(l)j or vice versa, then the
two segments, xi and xj , are incompatible. The resulting mutation probability of the
overlapping segment quantifies the compatibility of two neighboring segments. Non-
adjacent segments are incompatible. Formally, we define an asymmetric ψ() function
on xi and xj , when xi is the left neighbor of xj:
ψij(xi, xj) =m∏
l=1
P (x(l)j |x(n−m+l)
i ) (6.19)
6.5.2 Learning and Inference
The learning phase is to find the φ() and ψ() functions. For this purpose, we build a
mutation matrix M . Each matrix element m(i, j) is the unconditional mutation prob-
ability from the i-th character to the j-th: m(i, j) = P (chj|chi). This can be easily
computed from the training set, which consists of pairs of clean and noisy documents.
We partition the clean document and the noisy documents in the same way. The
φ() of each pair of clean and observed segments is given in Eq.(6.17), and the ψ() of
each pair of neighboring clean segments is given in Eq.( 6.19).
In inference, a subproblem is to find candidate underlying segments for a given ob-
served segment. One can enumerate all possible candidates using the mutation matrix.
But this method not only can generate too many candidates, but also ignores valuable
information in the training data: the possible combination of segments. We restrict the
104
candidates to be the top matches among all training segments. When the number of
matches is too small, we generate some extras using the transition matrix. By doing
so, we actually explore the intra-segment constraints, which are fine details that the
Markov chains cannot model, as they are on the scale of segments.
6.5.3 Experimental Results
We choose two conference papers on the same topic: motion modeling. Both docu-
ments are distorted, using the probabilistic mutation rules in Table 6.1, to form pairs
consisting of a clean document and a noisy document. One pair is used to train the
potential functions, while the other is used for testing. For simplicity, we change all
capitals into lower-case letters, replace all punctuation marks other than commas and
periods into commas, and remove all figures, tables and equations. The transformed
documents belongs to an alphabet of size 38 (consisting of 26 letters, 10 digits, a
comma and a period).
A variety of distortion rules are used: unconditional mutation rules and k-order
conditional mutation rules, k = 1, 2, 3. (A k-order conditional mutation depends on k
neighbors on either side.) To compute the potential functions, all we need to learn is a
38 by 38 mutation matrix M for unconditional mutation rates only. Yet, we are able to
catch and to correct most of the mutation errors, including the higher order conditional
errors. In fact, the correction rates for conditional errors are even higher, as shown in
Table 6.1. This is achieved by exploiting the Markov property and by passing local
beliefs through the network using BP.
To help give an intuitive idea about how dependencies between text segments can
be used effectively for error correction, we enclose a paragraph of distorted text here,
followed by the corrected version. The misspelled words are underlined. We can see
that most of the misspellings are corrected.
105
rule mutation prob. # errors % corrected
x → k 100% 56 91%
f → f 42% - -
f → d 30% 123 92%
f → z 28% 118 87%
th → th 48% - -
th → tn 52% 220 96%
se → se 36% - -
se → ue 18% 51 93%
se → le 25% 69 94%
se → ie 21% 58 95%
tio → tio 29% - -
tio → tho 20% 35 100%
tio → txo 20% 35 100%
tio → two 31% 57 98%
total words/errors: 3459/822 overall accuracy: 94%
Table 6.1: Distortion rules and error correction results. Columns 1 and 2 give the rule
and mutation rate, respectively. Column 3 is the actual number of times a rule applies,
and column 4 is the percentage corrected by BP inference.
106
Distorted text:
introductxon. natural scenes contain rich stochastic mothon patterns which are character-
ized by the movement od a large number od distinguishable or indistinguishable elements, such
as falling snow, zlock of birds, river waves, etc. tnele mothon patterns, called tektured, motion
temporal tekture and dynamic tektures in the literature, cannot be analyzed by conventwonal
optical zlow dields and have stimulated growing interests in both graphics and vision. in
graphics, the objective is to render photorealistic video iequences, or non photorealistic but
stylish cartoon animathons. both physics baued metnods such as partial didferential equatxons
and image baled such as video tekture and volume tekture are studied to simulate dire, fluid,
and gaseous phenomena. in vision, szummer and picard studied a spatial temporal auto re-
gression star model, which is a causal gaussian markov random zield model.
Text after denoising:
introduction. natural scenes contain rich stochastic motion patterns which are charac-
terized by the movement of a large number of distinguishable or indistinguishable elements,
such as falling snow, zlock of birds, river waves, etc. tnese motion patterns, called textured
motion, temporal texture and dynamic tektures in the literature, cannot be analyzed by conven-
tional optical flow fields, and have stimulated growing interests in both graphics and vision.
in graphics, the objective is to render photorealistic video sequences, or non photorealistic
but stylish cartoon animations. both physics based methods such as partial differential equa-
tions and image based such as video texture and volume texture are studied to simulate dire,
fluid, and gaseous phenomena. in vision, szummer and picard studied a spatial temporal auto
regression star model, which is a causal gaussian markov random field model.
107
6.6 Related Work and Discussions
Data dependency is present in a wide spectrum of applications. In this paper, we pro-
pose a unified approach that exploits data dependency to improve mining results, and
we approach this goal from two directions: (1) improving quality of input data, such
as by correcting contaminated data and by inferring missing values, and (2) improving
mining results from a model that ignores data dependency.
Techniques for improving data quality proposed in the literature have addressed
a wide range of problems caused by noise and missing data. For better information
retrieval from text, data is usually filtered to remove noise defined by grammatical
errors [SM83]. In data warehouses, there has been work on noisy class label and
noisy attribute detection based on classification rules [ZWC03] [YWZ04], as well as
learning from both labeled and unlabeled data by assigning pseudo-classes for the
unlabeled data [BDM02] using boosting ensembles. All this previous work has its
own niche concerning data quality. Our work is more general in that it exploits local
data constraints using Markov networks.
A pioneering work in sensor networks, the BBQ system [DGM+04] has studied
the problem of cost-efficient probing. However, their method relies on a global mul-
tivariate Gaussian distribution. Global constraints are very strict assumptions, and are
not appropriate in many practical scenarios.
The primary contribution of this paper is to propose a unified approach to improv-
ing mining quality by considering data dependency extensively in data mining. This
paper may also contribute to data mining practice with our investigations on several
real-life applications. By exploiting data dependency, clear improvements have been
achieved in data quality and the usefulness of mining results.
108
CHAPTER 7
Conclusions
Data stream mining and sequence mining each pose significant challenges. Stream
mining can discover up-to-date patterns invaluable for timely strategic decisions, but
this has to be done accurately and quickly with limited computation resources, and it
has to deal with both drifting concept and noise. Sequence mining can reveal long-term
trends and more complicated patterns that lead to deeper insights, but more than often
meaningful patterns can only be found in subspaces, which incurs high complexity in
pattern mining.
This dissertation introduces several novel algorithms in data stream mining, se-
quence data clustering, and improving data quality in general.
The dissertation starts by examining several stream learning algorithms, then in-
troduces out first stream learning method, Adaptive Boosting, that achieves the goal of
fast learning, light memory consumption, and prompt adaptation. Adaptive Boosting
proposes an online boosting ensemble method that constructs a model of high accu-
racy with fast learning and less memory consumption. Then, it is further integrated
with novel change detection techniques, which guarantees prompt adaptation.
Then, we continue to explore the issue of robustness in the presence of noise. When
combined with adaptation issue, this becomes a very hard problem, since both noisy
data and those coming from an emerging new concept appear as misclassified examples
to an existing learning model. We formulate this robust and adaptive approach under
109
the EM framework. In this framework, the noise label associated with each data entry
is represented by a hidden variable to be inferred. The weighted ensemble is now
used as the classification model. Ensemble weighting is by Maximum Likelihood
Estimation (MLE) that maximizes the posterior probability of just the clean data.
After having presented stream learning, the dissertation moves on to sequence data
clustering with spatial localization: pattern based subspace clustering. Pattern-based
clustering can find objects that exhibit a coherent pattern of rise and fall in subspaces.
Efficiency represents here the biggest concern due to the curse of dimensionality.
Therefore, we propose SeqClus that achieves efficiency, by using a novel distance
function that not only can capture subspace pattern similarity, but also is conducive
to efficient clustering implementations. In our implementation, a novel tree structure
provides a compact summary of all the frequently patterns in a data set, and a density
and grid based clustering algorithm is developed to find clusters in any subspace.
The final part of this dissertation focuses on the general problem of improving
the quality of data mining by exploiting local data dependency. We stress on how
to do quality mining from low-quality data. Poor quality is characterized by missing
values, noisy or ambiguous values We propose to improve data quality by exploit-
ing data interdependency using Markov Random Field (MRF) modeling. The local
constraints are described by pairwise Markov networks and quantified by potential
functions over pairwise variables. Efficient inference is by belief propagation, which
passes beliefs among variables so as to fill in missing values, or to clean the data,
thus yielding quality mining results. We also observe that many existing data mining
methods are intrinsically incapable of modeling and leveraging data dependency. The
mining results of such methods can also be post processed and improved by taking data
dependency into consideration. We have investigated several interesting real-life ap-
plications: cost-efficient sensor probing, protein function prediction and sequence data
110
de-noising. By exploiting data dependency, clear improvements in these applications
have been achieved.
Future Work There is much research to be done to enhance the applicability
and efficiency of the methods in this dissertation.
The Robust Regression algorithm we have proposed for adaptive and robust learn-
ing on data streams is among the first approaches to deal with this problem. However,
we solved the essential problem but left behind a couple of concerns. First, the cur-
rent model is basically a discriminative model with a hidden variable for modeling of
noise. We have deliberately avoided touching the underlying data distribution due to
lack of knowledge. It would be interesting to explore domain knowledge to eliminate
noise. Secondly, our approach to achieve adaptability is basically that of updating
model parameters. A methodology that may be more suitable to general scenarios is
model transition. For example, when the concept changes to a multi-mode Gaussian
distribution, we can never fit the data well with a single mode Gaussian. An imaginary
scenario is that of a complete, finite bank of models where transition decisions need to
be made based on data entropy or other statistical properties collected over time. This
can provide an interesting topic for future research.
Probabilistic graphical models are powerful tools for modeling dependencies in
large data sets. In our current work of applying Markov networks to infer unprobed
sensor readings, graph structure learning is simplified: we assume that sensors give
correlated readings as long as they are close to each other. This simple assumption
may not always hold true, and can be corrected if domain knowledge (for example,
two nearby sensors are on the opposite slopes of a hill) is available or a sophisticated
correlation analysis is performed. Moreover, traditional graphical models are all lim-
ited to a fixed graph structure. But there are many real scenarios featured with graph
topology changes. For instance, mobile sensors travel over time, and their neighbor-
111
hood relation changes correspondingly. As a result, the neighborhood of each graph
vertex is not fixed but becomes a set of random variables. How to perform learning
and inference on the dynamic Markov random field represents a challenging problem
of practical importance.
112
REFERENCES
[AGR98] R. Agrawal, J. Gehrkeand D. Gunopulos, and P. Raghavan. Authomaticsubspace clustering of high dimensional data for data mining applica-tions. In Proceedings of ACM-SIGMOD International Conference onManagement of Data (SIGMOD), 1998.
[AKA91] D. Aha, D. Kibler, and M. Albert. Instance-based learning algorithms.In Machine Learning 6(1), 37-66, 1991.
[AM01] S. Aji and R. McEliece. The generalized distributive law and free energyminimization. In Proceedings of the 39th Annual Allerton Conference onCommunication, Control, and Computing, 2001.
[APW+00] C. Aggarwal, C. Procopiuc, J. Wolf, P.S. Yu, and J.S. Park. Fast algo-rithms for projected clustering. In Proceedings of ACM-SIGMOD Inter-national Conference on Management of Data (SIGMOD), 2000.
[AY00] C. Aggarwal and P.S. Yu. Finding generalized projected clusters in highdimensional spaces. In Proceedings of ACM-SIGMOD InternationalConference on Management of Data (SIGMOD), 2000.
[AY01] C. Aggarwal and P. Yu. Outlier detection for high dimensional data.In Proceedings of ACM-SIGMOD International Conference on Manage-ment of Data (SIGMOD), 2001.
[BB99] P.O. Brown and D. Botstein. Exploring the new world of the genomewith DNA microarrays. In Nature Genetics, 21:33–37, 1999.
[BDM02] K. Bennett, A. Demiriz, and R. Maclin. Exploiting unlabeled data in en-semble methods. In Proceedings of the 8th ACM-SIGKDD InternationalConference on Knowledge Discovery and Data Mining, 289-296, 2002.
[BF96] C. Brodley and M. Friedl. Identifying and eliminating mislabeled train-ing instances. In Proceedings of the 30th National Conference on Artifi-cial Intelligence, 799-805, 1996.
[Bil98] J. Bilmes. A gentle tutorial on the em algorithm and its application toparameter estimation for gaussian mixture and hidden markov models.In Technical Report ICSI-TR-97-021, 1998.
[BKNS00] M. Breunig, H. Kriegel, R. Ng, and J. Sander. LOF: identifying density-based local outliers. In Proceedings of ACM-SIGMOD InternationalConference on Management of Data (SIGMOD), 2000.
113
[BM98] A. Blum and T. Mitchell. Combining labeled and unlabeled data withco-training. In COLT: Proceedings of the Workshop on ComputationalLearning and Theory, 1998.
[BR99] B.Bauer and R.Kohavi. An empirical comparison of voting classificationalgorithms: Bagging, boosting, and variants. In Machine Learning, 36,105-139, 1999.
[CC00] Y. Cheng and G. Church. Biclustering of expression data. In Proceed-ings of 8th International Conference on Intelligent System for MolecularBiology, 2000.
[CDH+00] Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-dimensionalregression analysis of time-series data streams. In Proceedings of VeryLarge Database (VLDB), 2000.
[CFZ99] C.H. Cheng, A.W. Fu, and Y. Zhang. Entropy-based subspace clus-tering for mining numerical data. In Proceedings of ACM-SIGKDDInternational Conference on Knowledge Discovery and Data Mining(SIGKDD), 1999.
[CHH+01] J. Cheng, C. Hatzis, H. Hayashi, M.-A. Krogel, S. Morishita, D. Page,and J. Sese. Kdd cup 2001 report. In SIGKDD Explorations, 3(2):47–64, 2001.
[DG01] C. Domeniconi and D. Gunopulos. Incremental support vector machineconstruction. In Proceedings of International Conference Data Mining(ICDM), 2001.
[DGM+04] A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and W. Hong.Model-driven data acquisition in sensor networks. In Proceedings ofVery Large Database (VLDB), 2004.
[DH00] P. Domingos and G. Hulten. Mining high-speed data streams. In Pro-ceedings of ACM-SIGKDD International Conference on Knowledge Dis-covery and Data Mining (SIGKDD), 2000.
[DHL+03] Guozhu Dong, Jiawei Han, Laks V.S. Lakshmanan, Jian Pei, HaixunWang, and Philip S. Yu. Online mining of changes from data streams:Research problems and preliminary results. In Proceedings of the2003 ACM SIGMOD Workshop on Management and Processing of DataStreams, 2003.
[Die00] T. Dietterich. Ensemble methods in machine learning. In Multiple Clas-sifier Systems, 2000.
114
[DLS99] P. D’haeseleer, S. Liang, and R. Somogyi. Gene expression analysisand genetic network modeling. In Pacific Symposium on Biocomputing,1999.
[DR99] D.Opitz and R.Maclin. Popular ensemble methods: An empirical study.In Journal of Artificial Intelligence Research, 11, 169-198, 1999.
[FHT98] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: Astatistical view of boosting. In The Annals of Statistics, 28(2):337–407,1998.
[FRM94] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequencematching in time-series databases. In Proceedings of ACM-SIGMODInternational Conference on Management of Data (SIGMOD), 1994.
[FS96] Y. Freund and R. Schapire. Experiments with a new boosting algo-rithm. In Proceedings of International Conference on Machine Learning(ICML), 1996.
[GBYS92] G. Gonnet, R. Baeza-Yates, and T. Snider. New indices for text: Pattrees and pat arrays. In Information Retrieval: Data Structures and Al-gorithms, 335–349, 1992.
[GGRL02] V. Ganti, J. Gehrke, R. Ramakrishnan, and W. Loh. Mining data streamsunder block evolution. In ACM SIGKDD Explorations 3(2):1-10, 2002.
[GMMO00] S. Guha, N. Milshra, R. Motwani, and L. OCallaghan. Clustering datastreams. In IEEE Symposium on Foundations of Computer Science(FOCS), 359366, 2000.
[GNV96] I. Guyon, N. Natic, and V. Vapnik. Discovering informative patterns anddata cleansing. In AAAI/MIT Press, pp. 181-203, 1996.
[GRS95] W. Gilks, S. Richardson, and D. Spiegelhalter. Markov Chain MonteCarlo in Practice. CRC Press, 1995.
[HSD01] G. Hulten, L. Spencer, and P. Domingos. Mining time-changing datastreams. In Proceedings of ACM-SIGKDD International Conference onKnowledge Discovery and Data Mining (SIGKDD), 2001.
[HTF00] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of StatisticalLearning, Data Mining,Inference and Prediction. Springer, 2000.
[JMN99] H.V. Jagadish, J. Madar, and R. Ng. Semantic compression and pat-tern extraction with fascicles. In Proceedings of Very Large Database(VLDB), 1999.
115
[KGV83] S. Kirkpatrick, C. Gelatt, and M. Vecchi. Optimization by simulatedannealing. In Science, vol. 220, no.4598, 1983.
[KM01] J. Kolter and M. Maloof. Dynamic weighted majority: A new ensem-ble method for tracking concept drift. In Proceedings of InternationalConference Data Mining (ICDM), 2001.
[KM03] J. Kubica and A. Moore. Probabilistic noise identification and datacleaning. In Proceedings of International Conference Data Mining(ICDM), 2003.
[McC76] E.M. McCreight. A space-economical suffix tree construction algorithm.In Journal of the ACM, 23(2):262-272, 1976.
[MM93] U. Manber and G. Myers. Suffix arrays: A new method for on-line stringsearches. In SIAM Journal On Computing, 935-948, 1993.
[MMC98] R. McEliece, D. MacKay, and J. Cheng. Turbo decoding as an instanceof pearl’s ’belief propagation’ algorithm. In IEEE Journal on SelectedAreas in Communication, 16(2), pp. 140-152, 1998.
[MWJ99] K. Murphy, Y. Weiss, and M. Jordan. Loopy belief propagation for ap-proximate inference: an empiricial study. In Proceedings on Uncertaintyin AI, 1999.
[NMTM00] K. Nigam, A. McCallum, S. Thrum, and T. Mitchell. Using em to clas-sify test from labeled and unlabeled documents. In Machine Learning,39:2:103-134, 2000.
[oW] University of Washington. http://www.jisao.washington.edu/data sets/widmann/.
[PA87] C. Peterson and J. Anderson. A mean-field theory learning algorithm forneural networks. In Complex Systems, vol.1, 1987.
[Pea88] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plau-sible inference. Morgan Kaufmann publishers, 1988.
[PWMH01] C-S. Perng, H. Wang, S. Ma, and J.L. Hellerstein. A framework forexploring mining spaces with multiple attributes. In Proceedings of In-ternational Conference Data Mining (ICDM), 2001.
[PWZP00] C.-S. Perng, H. Wang, S.R. Zhang, and D.S. Parker. Landmarks: a newmodel for similarity-based pattern querying in time series databases. InProceedings of International Conference on Data Engineering (ICDE),2000.
116
[PZC+03] J. Pei, X. Zhang, M. Cho, H. Wang, and P.S. Yu. Maple: A fast algorithmfor maximal pattern-based clustering. In Proceedings of InternationalConference Data Mining (ICDM), 2003.
[R.E61] R.E.Bellman. Adaptive Control Processes. Princeton University Press,1961.
[RRS00] S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for min-ing outliers from large data sets. In Proceedings of ACM-SIGMOD In-ternational Conference on Management of Data (SIGMOD), 2000.
[SFB97] R. Schapire, Y. Freund, and P. Bartlett. Boosting the margin: A newexplanation for the effectiveness of voting methods. In Proceedings ofInternational Conference on Machine Learning (ICML), 1997.
[SFL+97] S. Stolfo, W. Fan, W. Lee, A. Prodromidis, and P. Chan. Credit cardfraud detection using meta-learning: Issues and initial results. In AAAI-97 Workshop on Fraud Detection and Risk Management, 1997.
[SG86] J. Schlimmer and F. Granger. Beyond incremental processing: Trackingconcept drift. In Int’l Conf. on Artificial Intelligence, 1986.
[SK01] W. Street and Y. Kim. A streaming ensemble algorithm (sea) for large-scale classification. In Proceedings of ACM-SIGKDD InternationalConference on Knowledge Discovery and Data Mining (SIGKDD),2001.
[SM83] G. Salton and M. McGill. Introduction to modern information retrieval.McGraw Hill, 1983.
[SS94] R. Schultz and R. Stevenson. A bayesian approach to image expansionfor improved definition. In IEEE Transactions on Image Processing,3(3), pp. 233-242, 1994.
[Ten99] C.M. Teng. Correcting noisy data. In Proceedings of the InternationalConference on Machine Learning, 239-248, 1999.
[THC+00] S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and G. Church. Yeast mi-cro data set, http://arep.med.harvard.edu/biclustering/yeast.matrix, 2000.
[WFYH03] H. Wang, W. Fan, P. Yu, and J. Han. Mining concept-drifting datastreams using ensemble classifiers. In Proceedings of ACM-SIGKDDInternational Conference on Knowledge Discovery and Data Mining(SIGKDD), 2003.
117
[WK96] G. Widmer and M. Kubat. Learning in the presence of concept drift andhidden contexts. In Machine Learning, 23 (1), 69-101, 1996.
[WPF+03] H. Wang, C-S. Perng, W. Fan, S. Park, and P.S. Yu. Indexing weightedsequences in large databases. In Proceedings of International Confer-ence on Data Engineering (ICDE), 2003.
[WPFY03] H. Wang, S. Park, W. Fan, and P.S. Yu. ViST: A dynamic index methodfor querying XML data by tree structures. In Proceedings of ACM-SIGMOD International Conference on Management of Data (SIGMOD),2003.
[WWYY02] H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern simi-larity in large data sets. In Proceedings of ACM-SIGMOD InternationalConference on Management of Data (SIGMOD), 2002.
[YFW00] J. Yedidia, W. Freeman, and Y. Weiss. Generalized belief propagation.In Advances in Neural Information Processing Systems (NIPS), Vol 13,pp. 689-695, 2000.
[YWWY02] J. Yang, W. Wang, H. Wang, and P.S. Yu. δ-clusters: Capturing subspacecorrelation in a large data set. In Proceedings of International Confer-ence on Data Engineering (ICDE), 2002.
[YWZ04] Y. Yang, X. Wu, and X. Zhu. Dealing with predictive-but-unpredictableattributes in noisy data sources. In Proceedings of the 8th EuropeanConference on Principles and Practice of Knowledge Discovery in Data-bases (PKDD 04), 2004.
[ZWC03] X. Zhu, X. Wu, and Q. Chen. Eliminating class noise in large datasets.In Proceedings of the 20th International Conference Machine Learning(ICML 03), 2003.
118