online detection of change in data streams shai ben-david school of computer science u. waterloo

Online Detection of Change in Data Streams

Shai Ben-DavidSchool of Computer Science

U. Waterloo

Some Change Detection Tasks

Quality control – Factory products are being regularly tested and

scored.

Can we detect when the distribution of scores changes?

Real estate prices – Following selling prices of houses in K-W

Can we tell when market trends change?

Problem Formalization

Data points are generated sequentially and independently by some underlying

probability distribution. Viewing the generated stream of data

points, we wish to detect when the underlying data generating distribution changes (and how it changes).

Detection in Sensor Networks

We consider large scale networks of sensors

Each sensor makes local binary decisions about the monitored physical phenomena: RED/GREEN

An observer collects a random sample of

sensors’ readings.

First data collection Second data collection

Is there a change in the underlying data-generating distribution?

If a change has been detected, What has exactly changed?

Change Detection in Sensor Networks

Similar Issues in Other Disciplines

Ecology – Tracing the distribution of species over

geographical locations.

Public Health – Tracing spread of various diseases.

Census data analysis.

Our basic paradigm

Compare two sliding windows over the data stream:

S1 S2

time

Reducing change detection problem to the “two samples” problem;

Given two samples S1 , S2 , generated by distributions P1 ,P2 ,Infer from S1 , S2 , whether P1=P2.

Meta-Algorithm for Online Change Detection

Explanation

Note that the meta-algorithm is actually running kk independent algorithms in parallel – one for each triplet (m(m1,i1,i,m,m2,i2,i,,ααii).).

Each keeps a baseline window XXii, containing the mm1,i1,i

points following last-detected change, cc00 , and a second window, YYii, containing the most recent mm2,i2,i points in the stream.

We declare CHANGE whenever d(Xd(Xii, YYi i )> )> ααii

At such a point we reset cc00 and XXii

The different ααi i ‘s reflect different levels of ‘change sensitivity. The mmii ‘s are computed from the ααii‘s using

the theory outlined below

Statistical requirements

We wish to support our statistical tests with formal, finite sample size guarantees for:

Control the rate of False Positives ( `false alarms’).

Control the rate of False Negatives (`Missed-detections’).

Reliability of the change description.

Previous Workon the Two-Sample Problem Mostly within the context of parametric statistics.

(Assuming the underlying distributions come from a known family of ‘nice’ distributions)

Previous applications not concerned with memory and computation time limitations.

Performance guarantees are asymptotic – apply only in the limit when sample sizes go to infinity.

Previous focus on detection only – we wish to also describe the change.

The Need for Probability-Distance Measure

False Positives guarantees are straightforward:

“If S1, S2 are samples of the same distribution, then the probability that the test will declare `CHANGE’ is small”

False Negatives guarantees are more delicate:

“If S1, S2, come from different distributions then, w.h.p.

declare `CHANGE’”

This is infeasible.

One needs to quantify “d(P1, P2)> ε”

Inadequacy of common Measures

The L1 norm (or `total valiance’)

is too sensitive:

For every sample-based test and every m,

there are P1, P2 s.t. L1(P1, P2)> ¼

but the test fails to detect change from m-samples.

Lp ‘s for p>1 are too insensitive.

)()(),( 21EA211 APAPPPL SUP

A New Measure of Distance

Given a family FF of domain subsets, we define

Note that this is a pseudo-metric over probability distributions.

Intuitively, FF is chosen as a family of sets that the user

cares about,

ddFF measures the largest change in probability over sets in FF.

)()(),( 21A21 APAPPPd FF Sup

Major Merits of the F-distance

If F is the family of disks or rectangles, dF captures an intuitive notion of ‘localized change’

If the family of sets, F, has a finite

VC-dimension, then one gets finite sample- size guarantees against false negatives (w.r.t. dF )

Background: VC-Dimension

The Vapnik-Chervonenkis dimension (VC-dim)is a parameter that measures the `combinatorial complexityof a family of sets .

nBXBFABAnF || and |::|Max)(

nnnF 2)( : Sup)(dim-VC F For algebraically defined families it is roughly the number of

parameters needed to define a set in the family:

So, VC-dim(Planar disks)=3, VC-dim{Axis Aligned Rectangles)=4

VC-Based Guarantees

Let P1, P2 be any probability distributions over some

domain set X. And let F be a family of subsets of X of

finite VC-dimension d. For every 0 < ε <1, if S1, S2

are i.i.d samples of size m each, drawn by P1, P2

(respectively) then,

4/

2121

2

4)2(]

||)()(||)()(|[ md em

ASASAPAPFAP

VC-Based Guarantees (2)

In particular, we get

Where Si(A) is the empirical measure

4/

2121

2

4)2(

]|))()(())()(([|

md

FF

em

ASASdAPAPdP

||

||)(

i

ii S

ASAs

A Relativized Discrepancy

To focus on small-weight subsets, we define a variation of the

dF distance

2)()(

1,2

)()(min

|)()(P| Sup

),(

2121

21

21

APAPAPAP

APA

PP

FA

F

Statistical Guarantees for the Relativized Discrepancy

Let P1, P2 be any probability distributions over some

domain set X. And let F be a family of subsets of X of

finite VC-dimension d. For every 0 < ε <1, if S1, S2

are i.i.d samples of size m each, drawn by P1, P2

(respectively) then,

4/

2121

2

4)2(

]|))()(())()(([|

md

FF

em

ASASAPAPP

Algorithms for computing dF(S1, S2)

We developed several basic algorithms, that take a

pair of samples S1, S2 as input, and output the sets

A in F that exhibit maximal empirical discrepancy.

Our focus is the computational complexity of the

algorithms as function of the input sample sizes.

Algorithms – The Basic Ideas (1)

We say that a collection H of subsets is F-complete

w.r.t. a sample S, if for every A in F there exist a set B in H

such that

. BSAS It follows that, if H is F-complete w.r.t. S1 U S2 , then

and

),(),( 2121 SSdSSd HF

),(),( 2121 SSdSSd HF

Algorithms – The Basic Ideas (2)

The next step is to find finite collections of subsets that are F-complete for some natural families, F.

For example,

SssssssDSH disk 321321 ,,:),,()(

Where, D(s1, s2 , s3) is the disk whose boundary is defined by this triple of points.

Running times of our algorithms

For Real-valued data we designed a

data structure and an algorithm that requires

O(mO(m1i1i + m + m2i2i) ) time at every initiation

and

(O(log (m(O(log (m1i1i + m + m2i2i)) )) time for incremental updates for every new arriving data point.

Running times of our Algorithms (2)

For two-dimensional data points, we consider two

basic families FF of sets in the plane: Axis Aligned Rectangles and Planar Disks.

For Rectangles we get computational complexity

O(|S|O(|S|33))

For Disks we get an exhaustive algorithm that runs

in time O(|S|O(|S|44))

and an approximation algorithm of complexity

O(|S|O(|S|22log|S|)log|S|)

Summary

We defined notions of spatial distances between probability distributions –

changes that are detectable within local geometric

regions (say, circles)

Apply Vapnik-Chervonenkis theory to derive confidence guarantees.

Develop efficient detection and estimation algorithms

)()(),( 21A21 APAPPPd FF Max

Novelty of Our Approach

Non-parametric statistics.

(We make no prior assumptions about the underlying distribution)

We provide performance guarantees for manageable (finite) sample sizes.

We develop computationally efficient algorithms for change detection and change estimation.

online detection of change in data streams shai ben-david school of computer science u. waterloo

Documents