online detection of change in data streams shai ben-david school of computer science u. waterloo
TRANSCRIPT
Online Detection of Change in Data Streams
Shai Ben-DavidSchool of Computer Science
U. Waterloo
Some Change Detection Tasks
Quality control – Factory products are being regularly tested and
scored.
Can we detect when the distribution of scores changes?
Real estate prices – Following selling prices of houses in K-W
Can we tell when market trends change?
Problem Formalization
Data points are generated sequentially and independently by some underlying
probability distribution. Viewing the generated stream of data
points, we wish to detect when the underlying data generating distribution changes (and how it changes).
Detection in Sensor Networks
We consider large scale networks of sensors
Each sensor makes local binary decisions about the monitored physical phenomena: RED/GREEN
An observer collects a random sample of
sensors’ readings.
First data collection Second data collection
Is there a change in the underlying data-generating distribution?
If a change has been detected, What has exactly changed?
Change Detection in Sensor Networks
Similar Issues in Other Disciplines
Ecology – Tracing the distribution of species over
geographical locations.
Public Health – Tracing spread of various diseases.
Census data analysis.
Our basic paradigm
Compare two sliding windows over the data stream:
S1 S2
time
Reducing change detection problem to the “two samples” problem;
Given two samples S1 , S2 , generated by distributions P1 ,P2 ,Infer from S1 , S2 , whether P1=P2.
Meta-Algorithm for Online Change Detection
Explanation
Note that the meta-algorithm is actually running kk independent algorithms in parallel – one for each triplet (m(m1,i1,i,m,m2,i2,i,,ααii).).
Each keeps a baseline window XXii, containing the mm1,i1,i
points following last-detected change, cc00 , and a second window, YYii, containing the most recent mm2,i2,i points in the stream.
We declare CHANGE whenever d(Xd(Xii, YYi i )> )> ααii
At such a point we reset cc00 and XXii
The different ααi i ‘s reflect different levels of ‘change sensitivity. The mmii ‘s are computed from the ααii‘s using
the theory outlined below
Statistical requirements
We wish to support our statistical tests with formal, finite sample size guarantees for:
Control the rate of False Positives ( `false alarms’).
Control the rate of False Negatives (`Missed-detections’).
Reliability of the change description.
Previous Workon the Two-Sample Problem Mostly within the context of parametric statistics.
(Assuming the underlying distributions come from a known family of ‘nice’ distributions)
Previous applications not concerned with memory and computation time limitations.
Performance guarantees are asymptotic – apply only in the limit when sample sizes go to infinity.
Previous focus on detection only – we wish to also describe the change.
The Need for Probability-Distance Measure
False Positives guarantees are straightforward:
“If S1, S2 are samples of the same distribution, then the probability that the test will declare `CHANGE’ is small”
False Negatives guarantees are more delicate:
“If S1, S2, come from different distributions then, w.h.p.
declare `CHANGE’”
This is infeasible.
One needs to quantify “d(P1, P2)> ε”
Inadequacy of common Measures
The L1 norm (or `total valiance’)
is too sensitive:
For every sample-based test and every m,
there are P1, P2 s.t. L1(P1, P2)> ¼
but the test fails to detect change from m-samples.
Lp ‘s for p>1 are too insensitive.
)()(),( 21EA211 APAPPPL SUP
A New Measure of Distance
Given a family FF of domain subsets, we define
Note that this is a pseudo-metric over probability distributions.
Intuitively, FF is chosen as a family of sets that the user
cares about,
ddFF measures the largest change in probability over sets in FF.
)()(),( 21A21 APAPPPd FF Sup
Major Merits of the F-distance
If F is the family of disks or rectangles, dF captures an intuitive notion of ‘localized change’
If the family of sets, F, has a finite
VC-dimension, then one gets finite sample- size guarantees against false negatives (w.r.t. dF )
Background: VC-Dimension
The Vapnik-Chervonenkis dimension (VC-dim)is a parameter that measures the `combinatorial complexityof a family of sets .
nBXBFABAnF || and |::|Max)(
nnnF 2)( : Sup)(dim-VC F For algebraically defined families it is roughly the number of
parameters needed to define a set in the family:
So, VC-dim(Planar disks)=3, VC-dim{Axis Aligned Rectangles)=4
VC-Based Guarantees
Let P1, P2 be any probability distributions over some
domain set X. And let F be a family of subsets of X of
finite VC-dimension d. For every 0 < ε <1, if S1, S2
are i.i.d samples of size m each, drawn by P1, P2
(respectively) then,
4/
2121
2
4)2(]
||)()(||)()(|[ md em
ASASAPAPFAP
VC-Based Guarantees (2)
In particular, we get
Where Si(A) is the empirical measure
4/
2121
2
4)2(
]|))()(())()(([|
md
FF
em
ASASdAPAPdP
||
||)(
i
ii S
ASAs
A Relativized Discrepancy
To focus on small-weight subsets, we define a variation of the
dF distance
2)()(
1,2
)()(min
|)()(P| Sup
),(
2121
21
21
APAPAPAP
APA
PP
FA
F
Statistical Guarantees for the Relativized Discrepancy
Let P1, P2 be any probability distributions over some
domain set X. And let F be a family of subsets of X of
finite VC-dimension d. For every 0 < ε <1, if S1, S2
are i.i.d samples of size m each, drawn by P1, P2
(respectively) then,
4/
2121
2
4)2(
]|))()(())()(([|
md
FF
em
ASASAPAPP
Algorithms for computing dF(S1, S2)
We developed several basic algorithms, that take a
pair of samples S1, S2 as input, and output the sets
A in F that exhibit maximal empirical discrepancy.
Our focus is the computational complexity of the
algorithms as function of the input sample sizes.
Algorithms – The Basic Ideas (1)
We say that a collection H of subsets is F-complete
w.r.t. a sample S, if for every A in F there exist a set B in H
such that
. BSAS It follows that, if H is F-complete w.r.t. S1 U S2 , then
and
),(),( 2121 SSdSSd HF
),(),( 2121 SSdSSd HF
Algorithms – The Basic Ideas (2)
The next step is to find finite collections of subsets that are F-complete for some natural families, F.
For example,
SssssssDSH disk 321321 ,,:),,()(
Where, D(s1, s2 , s3) is the disk whose boundary is defined by this triple of points.
Running times of our algorithms
For Real-valued data we designed a
data structure and an algorithm that requires
O(mO(m1i1i + m + m2i2i) ) time at every initiation
and
(O(log (m(O(log (m1i1i + m + m2i2i)) )) time for incremental updates for every new arriving data point.
Running times of our Algorithms (2)
For two-dimensional data points, we consider two
basic families FF of sets in the plane: Axis Aligned Rectangles and Planar Disks.
For Rectangles we get computational complexity
O(|S|O(|S|33))
For Disks we get an exhaustive algorithm that runs
in time O(|S|O(|S|44))
and an approximation algorithm of complexity
O(|S|O(|S|22log|S|)log|S|)
Summary
We defined notions of spatial distances between probability distributions –
changes that are detectable within local geometric
regions (say, circles)
Apply Vapnik-Chervonenkis theory to derive confidence guarantees.
Develop efficient detection and estimation algorithms
)()(),( 21A21 APAPPPd FF Max
Novelty of Our Approach
Non-parametric statistics.
(We make no prior assumptions about the underlying distribution)
We provide performance guarantees for manageable (finite) sample sizes.
We develop computationally efficient algorithms for change detection and change estimation.