carnegie mellon db/ir '06c. faloutsos#1 data mining on streams christos faloutsos cmu
Post on 15-Jan-2016
218 Views
Preview:
TRANSCRIPT
DB/IR '06 C. Faloutsos #1
Carnegie Mellon
Data Mining on Streams
Christos Faloutsos
CMU
DB/IR '06 C. Faloutsos #2
Carnegie Mellon
THANK YOU!
• Prof. Panos Ipeirotis
• Julia Mills
DB/IR '06 C. Faloutsos #4
Carnegie Mellon
Outline
• Problem and motivation
• Single-sequence mining: AWSOM
• Co-evolving sequences: SPIRIT
• Lag correlations: BRAID
• Conclusions
DB/IR '06 C. Faloutsos #5
Carnegie Mellon
Problem definition - example
Each sensor collects data (x1, x2, …, xt, …)
DB/IR '06 C. Faloutsos #6
Carnegie Mellon
Problem definition
• Given: one or more sequences x1 , x2 , … , xt , …
(y1, y2, … , yt, …
… )
• Find – patterns; correlations; outliers– incrementally!
DB/IR '06 C. Faloutsos #7
Carnegie Mellon
Limitations / ChallengesFind patterns using a method that is• nimble: limited resources
– Memory– Bandwidth, power, CPU
• incremental: on-line, ‘any-time’ response– single pass (‘you get to see it only once’)
• automatic: no human intervention– eg., in remote environments
DB/IR '06 C. Faloutsos #8
Carnegie Mellon
Application domains• Sensor devices
– Temperature, weather measurements– Road traffic data– Geological observations– Patient physiological data
• Embedded devices– Network routers– Intelligent (active) disks
DB/IR '06 C. Faloutsos #9
Carnegie Mellon
Motivation - Applications (cont’d)
• ‘Smart house’
– sensors monitor temperature, humidity, air quality
• video surveillance
DB/IR '06 C. Faloutsos #10
Carnegie Mellon
Motivation - Applications (cont’d)
• civil/automobile infrastructure
– bridge vibrations [Oppenheim+02]
– road conditions / traffic monitoring
DB/IR '06 C. Faloutsos #11
Carnegie Mellon
Motivation - Applications (cont’d)
• Weather, environment/anti-pollution
– volcano monitoring
– air/water pollutant monitoring
DB/IR '06 C. Faloutsos #12
Carnegie Mellon
Motivation - Applications (cont’d)
• Computer systems
– ‘Active Disks’ (buffering, prefetching)
– web servers (ditto)
– network traffic monitoring
– ...
DB/IR '06 C. Faloutsos
Carnegie Mellon
InteMonw/ Evan Hoke, Jimeng Sun
self-* PetaBytedata center at CMU
DB/IR '06 C. Faloutsos #14
Carnegie Mellon
Outline
• Problem and motivation
• Single-sequence mining: AWSOM
• Co-evolving sequences: SPIRIT
• Lag correlations: BRAID
• conclusions
DB/IR '06 C. Faloutsos #15
Carnegie Mellon
Single sequence mining - AWSOM
• with Spiros Papadimitriou (CMU -> IBM)
• Anthony Brockwell (CMU/Stat)
DB/IR '06 C. Faloutsos #16
Carnegie Mellon
Problem definition• Semi-infinite streams of values (time series) x1, x2,
…, xt, …
• Find patterns, forecasts, outliers…
Periodicity? (daily)
Periodicity? (twice daily)
“Noise”??
DB/IR '06 C. Faloutsos #17
Carnegie Mellon
Requirements / Goals
• Adapt and handle arbitrary periodic components
and
• nimble (limited resources, single pass)
• on-line, any-time
• automatic (no human intervention/tuning)
DB/IR '06 C. Faloutsos #18
Carnegie Mellon
Overview
• Introduction / Related work
• Background
• Main idea
• Experimental results
DB/IR '06 C. Faloutsos #19
Carnegie Mellon
WaveletsExample – Haar transform
t
W1,1
t
W1,2
t
W1,3
t
W1,4
t
W2,1
t
W2,2
t
W3,1
t
V4,1
time
frequ
ency
t
xt
“constant”
DB/IR '06 C. Faloutsos #20
Carnegie Mellon
WaveletsWhy we like them
• Wavelets compress many real signals well:– Image compression and processing– Vision– Astronomy, seismology, …
• Wavelet coefficients can be updated as new points arrive
DB/IR '06 C. Faloutsos #21
Carnegie Mellon
Overview
• Introduction / Related work
• Background
• Main idea
• Experimental results
DB/IR '06 C. Faloutsos #22
Carnegie Mellon
AWSOMxt
tt
W1,1
t
W1,2
t
W1,3
t
W1,4
t
W2,1
t
W2,2
t
W3,1
t
V4,1
time
frequ
ency=
DB/IR '06 C. Faloutsos #23
Carnegie Mellon
AWSOMxt
tt
W1,1
t
W1,2
t
W1,3
t
W1,4
t
W2,1
t
W2,2
t
W3,1
t
V4,1
time
frequ
ency
DB/IR '06 C. Faloutsos #24
Carnegie Mellon
AWSOM - idea
Wl,tWl,t-1Wl,t-2Wl,t l,1Wl,t-1 l,2Wl,t-2 …
Wl’,t’-1Wl’,t’-2Wl’,t’
Wl’,t’ l’,1Wl’,t’-1 l’,2Wl’,t’-2 …
DB/IR '06 C. Faloutsos #25
Carnegie Mellon
More details…
• Update of wavelet coefficients
• Update of linear models
• Feature selection– Not all correlations are significant– Throw away the insignificant ones (“noise”)
(incremental)
(incremental; RLS)
(single-pass)
DB/IR '06 C. Faloutsos #26
Carnegie Mellon
Complexity• Model update
Space: OlgN + mk2 OlgNTime: Ok2 O1
Where– N: number of points (so far)– k: number of regression coefficients; fixed– m: number of linear models; OlgN
?
DB/IR '06 C. Faloutsos #27
Carnegie Mellon
Overview
• Introduction / Related work
• Background
• Main idea
• Experimental results
DB/IR '06 C. Faloutsos #28
Carnegie Mellon
Results - Synthetic data• Triangle pulse
• Mix (sine + square)
• AR captures wrong trend (or none)
• Seasonal AR estimation fails
AWSOM AR Seasonal AR
DB/IR '06 C. Faloutsos #29
Carnegie Mellon
Results - Real data
• Automobile traffic– Daily periodicity– Bursty “noise” at smaller scales
• AR fails to capture any trend• Seasonal AR estimation fails
DB/IR '06 C. Faloutsos #30
Carnegie Mellon
Results - real data
• Sunspot intensity– Slightly time-varying “period”
• AR captures wrong trend• Seasonal ARIMA
– wrong downward trend, despite help by human!
DB/IR '06 C. Faloutsos #31
Carnegie Mellon
Conclusions
Adapt and handle arbitrary periodic components
andnimble
Limited memory (logarithmic)
Constant-time update
on-line, any-timeSingle pass over the data
automatic: No human intervention/tuning
DB/IR '06 C. Faloutsos #32
Carnegie Mellon
Outline
• Problem and motivation
• Single-sequence mining: AWSOM
• Co-evolving sequences: SPIRIT
• Lag correlations: BRAID
• conclusions
DB/IR '06 C. Faloutsos #33
Carnegie Mellon
Part 2
SPIRIT: Mining co-evolving streams
[Papadimitriou, Sun, Faloutsos, VLDB05]
DB/IR '06 C. Faloutsos #34
Carnegie Mellon
Motivation• Eg., chlorine concentration in water
distribution network
DB/IR '06 C. Faloutsos #35
Carnegie Mellon
Motivation
water distribution network
normal operationMay have hundreds of measurements, but
it is unlikely they are completely unrelated!
Phase 1 Phase 2 Phase 3
: : : : : :
: : : : : :
chlo
rine c
once
ntr
ati
ons
DB/IR '06 C. Faloutsos #36
Carnegie Mellon
Phase 1 Phase 2 Phase 3
: : : : : :
: : : : : :
Motivation
water distribution network
normal operation major leak
chlo
rine c
once
ntr
ati
ons
sensorsnear leak
sensorsawayfrom leak
DB/IR '06 C. Faloutsos #37
Carnegie Mellon
Phase 1 Phase 2 Phase 3
: : : : : :
: : : : : :
Motivation
water distribution network
normal operation major leak
chlo
rine c
once
ntr
ati
ons
sensorsnear leak
sensorsawayfrom leak
DB/IR '06 C. Faloutsos #38
Carnegie Mellon
Motivation
actual measurements(n streams)
k hidden variable(s)
We would like to discover a few “hidden(latent) variables” that summarize the key trends
Phase 1
: : : : : :
: : : : : :
chlo
rine c
once
ntr
ati
ons
Phase 1
k = 1
DB/IR '06 C. Faloutsos #39
Carnegie Mellon
Motivation
We would like to discover a few “hidden(latent) variables” that summarize the key trends
chlo
rine c
once
ntr
ati
ons
Phase 1 Phase 1Phase 2 Phase 2
actual measurements(n streams)
k hidden variable(s)
k = 2
: : : : : :
: : : : : :
DB/IR '06 C. Faloutsos #40
Carnegie Mellon
Motivation
We would like to discover a few “hidden(latent) variables” that summarize the key trends
chlo
rine c
once
ntr
ati
ons
Phase 1 Phase 1Phase 2 Phase 2Phase 3 Phase 3
actual measurements(n streams)
k hidden variable(s)
k = 1
: : : : : :
: : : : : :
DB/IR '06 C. Faloutsos #41
Carnegie Mellon
• Discover “hidden” (latent) variables for:– Summarization of main trends for users– Efficient forecasting, spotting outliers/anomalies
and the usual:
• nimble: Limited memory requirements
• on-line, any-time: (single pass etc)
• automatic: No special parameters to tune
Goals
DB/IR '06 C. Faloutsos #42
Carnegie Mellon
Related workStream mining
• Stream SVD [Guha, Gunopulos, Koudas / KDD03]• StatStream [Zhu, Shasha / VLDB02]• Clustering
[Aggarwal, Han, Yu / VLDB03], [Guha, Meyerson, et al / TKDE],[Lin, Vlachos, Keogh, Gunopulos / EDBT04],
• Classification[Wang, Fan, et al / KDD03], [Hulten, Spencer, Domingos / KDD01]
DB/IR '06 C. Faloutsos #43
Carnegie Mellon
Related workStream mining
• Piecewise approximations[Palpanas, Vlachos, Keogh, etal / ICDE 2004]
• Queries on streams[Dobra, Garofalakis, Gehrke, et al / SIGMOD02],[Madden, Franklin, Hellerstein, et al / OSDI02],[Considine, Li, Kollios, et al / ICDE04],[Hammad, Aref, Elmagarmid / SSDBM03]
• …
DB/IR '06 C. Faloutsos #44
Carnegie Mellon
OverviewPart 2
• Method
• Experiments
• Conclusions & Other work
DB/IR '06 C. Faloutsos #45
Carnegie Mellon
Stream correlations
• Step 1: How to capture correlations?
• Step 2: How to do it incrementally, when we have a very large number of points?
• Step 3: How to dynamically adjust the number of hidden variables?
DB/IR '06 C. Faloutsos #46
Carnegie Mellon
1. How to capture correlations?
20oC
30oC
Tem
pera
ture
t1
First sensor
time
DB/IR '06 C. Faloutsos #47
Carnegie Mellon
1. How to capture correlations?
First sensor
Second sensor
20oC
30oC
Tem
pera
ture
t2
time
DB/IR '06 C. Faloutsos #48
Carnegie Mellon
20oC 30oC
1. How to capture correlations
20oC
30oC
Temperature t1
Correlations:
Let’s take a closer look at the first three value-pairs…T
em
pera
ture
t2
DB/IR '06 C. Faloutsos #49
Carnegie Mellon
20oC 30oC
1. How to capture correlations
20oC
30oC
Tem
pera
ture
t2
Temperature t1
First three lie (almost) on a line in the space of value-pairs… O(n) numbers for the slope, and One number for each value-pair (offset on line)
offse
t = “h
idde
n va
riabl
e”
time=1
time=2
time=3
DB/IR '06 C. Faloutsos #50
Carnegie Mellon
1. How to capture correlations
20oC 30oC
20oC
30oC
Tem
pera
ture
t2
Temperature t1
Other pairs also follow the same pattern: they lie (approximately) on this line
DB/IR '06 C. Faloutsos #51
Carnegie Mellon
Stream correlations
• Step 1: How to capture correlations?
• Step 2: How to do it incrementally, when we have a very large number of points?
• Step 3: How to dynamically adjust the number of hidden variables?
DB/IR '06 C. Faloutsos
Carnegie Mellon
Incremental updates
error
20o
C30o
C
20o
C
30o
C
Tem
pera
ture
T2
Temperature T1
DB/IR '06 C. Faloutsos
Carnegie Mellon
Incremental updates• Algorithm runs in O(n) where
n= # of streams• no need to access old data
error
20oC
30oC
20oC 30oCTemperature T1
DB/IR '06 C. Faloutsos #54
Carnegie Mellon
Stream correlationsPrincipal Component Analysis (PCA)
• The “line” is the first principal component (PC)
• This line is optimal: it minimizes the sum of squared projection errors
DB/IR '06 C. Faloutsos #55
Carnegie Mellon
2. Incremental updateGiven number of hidden variables k
• Assuming k is known
• We know how to update the slope
For each new point x and for i = 1, …, k :
• yi := wiTx (proj. onto wi)
• di di + yi2 (energy i-th eigenval.)
• ei := x – yiwi (error)
• wi wi + (1/di) yiei (update estimate)
• x x – yiwi (repeat with remainder)
y1
w1
xe1
w1 updated
DB/IR '06 C. Faloutsos #56
Carnegie Mellon
Stream correlations
• Step 1: How to capture correlations?
• Step 2: How to do it incrementally, when we have a very large number of points?
• Step 3: How to dynamically adjust k, the number of hidden variables?
DB/IR '06 C. Faloutsos #57
Carnegie Mellon
Answer
• When the reconstruction accuracy is too low (say, <95%)
• then introduce another hidden variable (k++)
• [How to initialize its values: tricky]
DB/IR '06 C. Faloutsos #58
Carnegie Mellon
Missing values
20oC 30oC
20oC
30oC
Tem
pera
ture
T2
Temperature T1
true values (pair)
all possiblevalue pairs(given only t1)
best guess(given correlations: intersection)
DB/IR '06 C. Faloutsos #59
Carnegie Mellon
Forecasting
?
• Assume we want to forecast the next value for a particular stream (e.g. auto-regression)
n streams
DB/IR '06 C. Faloutsos #60
Carnegie Mellon
Forecasting
• Option 1: One complex model per stream– Next value = function of
previous values on all streams
– Captures correlations
– Too costly! [ ~ O(n3) ]
+
n streams
DB/IR '06 C. Faloutsos #61
Carnegie Mellon
Forecasting
• Option 1: One complex model per stream
• Option 2: One simple model per stream– Next value = function of
previous value on same stream
– Worse accuracy, but maybe acceptable
– But, still need n models
+
n streams
DB/IR '06 C. Faloutsos #62
Carnegie Mellon
Forecasting
n streams
hiddenvariables
k hidden vars
k << n and already
capture correlations
+
Only k simplemodels
Efficiency &robustness
DB/IR '06 C. Faloutsos #63
Carnegie Mellon
Time/space requirementsIncremental PCA
O(nk) space (total) and time (per tuple), i.e.,
• Independent of # points
• Linear w.r.t. # streams (n)
• Linear w.r.t. # hidden variables (k)
In fact,
• Can be done in real time
DB/IR '06 C. Faloutsos #64
Carnegie Mellon
OverviewPart 2
• Method
• Experiments
• Conclusions & Other work
DB/IR '06 C. Faloutsos #65
Carnegie Mellon
ExperimentsChlorine concentration
166 streams2 hidden variables (~4% error)
Measurements
Reconstruction
[CMU Civil Engineering]
DB/IR '06 C. Faloutsos #66
Carnegie Mellon
ExperimentsChlorine concentration
hidden variables
• Both capture global, periodic pattern• Second: ~ first, but phase-shifted• Can express any phase-shift…
[CMU Civil Engineering]
DB/IR '06 C. Faloutsos #67
Carnegie MellonExperiments
Light measurements
measurementreconstruction
54 sensors2-4 hidden variables (~6% error)
DB/IR '06 C. Faloutsos #68
Carnegie MellonExperiments
Light measurements
• 1 & 2: main trend (as before)• 3 & 4: potential anomalies and outliers
hidden variables
intermittentintermittent
DB/IR '06 C. Faloutsos #69
Carnegie Mellon
ConclusionsSPIRIT:
Discovers hidden variables for– Summarization of main trends for users– Efficient forecasting, spotting outliers/anomalies
Incremental, real time computationnimble: With limited memoryautomatic: No special parameters to tune
DB/IR '06 C. Faloutsos #70
Carnegie Mellon
Outline
• Problem and motivation
• Single-sequence mining: AWSOM
• Co-evolving sequences: SPIRIT
• Lag correlations: BRAID
• Conclusions
DB/IR '06 C. Faloutsos #71
Carnegie Mellon
Part 3:BRAID: Discovering Lag
Correlations in Multiple StreamsYasushi Sakurai, Spiros Papadimitriou, Christos FaloutsosSIGMOD’05
DB/IR '06 C. Faloutsos #72
Carnegie Mellon
Lag Correlations
• Examples– A decrease in interest rates typically precedes
an increase in house sales by a few months
– Higher amounts of fluoride in the drinking water leads to fewer dental cavities, some years later
DB/IR '06 C. Faloutsos #73
Carnegie Mellon
Lag Correlations• Example of lag-correlated sequences
These sequences are correlated with lag l=1300 time-ticks
CCF (Cross-Correlation Function)
DB/IR '06 C. Faloutsos #74
Carnegie Mellon
Lag Correlations• Example of lag-correlated sequences
CCF (Cross-Correlation Function)
how to compute it•quickly•cheaply•incrementally
DB/IR '06 C. Faloutsos #75
Carnegie Mellon
Challenging Problems
• Problem definitions– For given two co-evolving sequences X and Y,
determine• Whether there is a lag correlation• If yes, what is the lag length l
– For given k numerical sequences, X1,…,Xk , report
• Which pairs have a lag correlation• The corresponding lag for each pair
DB/IR '06 C. Faloutsos #76
Carnegie Mellon
Our solution
• Ideal characteristics:– ‘Any-time’ processing, and fast
Computation time per time tick is constant
– NimbleMemory space requirement is sub-linear of sequence
length
– AccurateApproximation introduces small error
DB/IR '06 C. Faloutsos #77
Carnegie Mellon
• Sequence indexing– Agrawal et al. (FODO 1993)
– Faloutsos et al. (SIGMOD 1994)
– Keogh et al. (SIGMOD 2001)
• Compression (wavelet and random projections)– Gilbert et al. (VLDB 2001), Guha et al. (VLDB 2004)
– Dobra et al.(SIGMOD 2002), Ganguly et al.(SIGMOD 2003)
• Data Stream Management– Abadi et al. (VLDB Journal 2003)
– Motwani et al. (CIDR 2003)
– Chandrasekaran et al. (CIDR 2003)
– Cranor et al. (SIGMOD 2003)
Related Work
DB/IR '06 C. Faloutsos #78
Carnegie Mellon
Related Work• Pattern discovery
– Clustering for data streamsGuha et al. (TKDE 2003)
– Monitoring multiple streamsZhu et al. (VLDB 2002)
– ForecastingYi et al. (ICDE 2000)
Papadimitriou et al. (VLDB 2003)
• None of previously published methods focuses on the problem
DB/IR '06 C. Faloutsos #79
Carnegie Mellon
Overview
• Introduction / Related work
• Background
• Main ideas
• Theoretical analysis
• Experimental results
DB/IR '06 C. Faloutsos #80
Carnegie Mellon
Main Idea (1)• Incremental compution
– Sufficient statistics• Sum of X :
• Square sum of X :
• Inner-product for X and the shifted Y :
– Compute R(l) incrementally:
• Covariance of X and Y:
• Variance of X:
n
lt ltt yxlSxy1
)(
n
t txnSx1
),1(
n
t txnSxx1
2),1(
),1(),1(
)()(
lnVynlVx
lClR
ln
lnSynlSxlSxylC
),1(),1(
)()(
ln
nlSxnlSxxnlVx
2)),1((
),1(),1(
DB/IR '06 C. Faloutsos #81
Carnegie Mellon
Main Idea (2)
Lag
Cor
rela
tion
• Sequence smoothing
t=nTime
DB/IR '06 C. Faloutsos #82
Carnegie Mellon
Main Idea (2)
Lag
Cor
rela
tion
Level
h=0t=nTime
• Sequence smoothing– Means of windows for each level– Sufficient statistics computed from the means– CCF computed from the sufficient statistics– But, it allows a partial redundancy
DB/IR '06 C. Faloutsos #83
Carnegie Mellon
Main Idea (3)
Lag
Cor
rela
tion
Level
h=0t=nTime
• Geometric lag probing
DB/IR '06 C. Faloutsos #84
Carnegie Mellon
Main Idea (3)
Lag
Cor
rela
tion
Level
h=0t=nTime
• Geometric lag probing– Use colored windows– Keep track of only a geometric progression of the
lag values: l={0,1,2,4,8,…,2h,…}– Use a cubic spline to interpolate
DB/IR '06 C. Faloutsos #85
Carnegie Mellon
Overview
• Introduction / Related work
• Background
• Main ideas
• Theoretical analysis
• Experimental results
DB/IR '06 C. Faloutsos #86
Carnegie Mellon
Experimental results• Setup
– Intel Xeon 2.8GHz, 1GB memory, Linux– Datasets:
Sines, SpikeTrains, Humidity, Light, Temperature,
Kursk, Sunspots– Enhanced BRAID, b=16
• Evaluation– Estimation error of lag correlations– Computation time
DB/IR '06 C. Faloutsos #87
Carnegie Mellon
Detecting Lag Correlations (2)• SpikeTrains
CCF (Cross-Correlation Function)
BRAID closely estimates the correlation coefficients
DB/IR '06 C. Faloutsos #88
Carnegie Mellon
Detecting Lag Correlations (3)• Humidity
CCF (Cross-Correlation Function)
BRAID closely estimates the correlation coefficients
DB/IR '06 C. Faloutsos #89
Carnegie Mellon
Detecting Lag Correlations (4)• Light
CCF (Cross-Correlation Function)
BRAID closely estimates the correlation coefficients
DB/IR '06 C. Faloutsos #90
Carnegie Mellon
Detecting Lag Correlations (5)• Kursk
CCF (Cross-Correlation Function)
BRAID closely estimates the correlation coefficients
DB/IR '06 C. Faloutsos #91
Carnegie Mellon
Estimation Error
• Largest relative error is about 1%
1.03811681156Sunspots
0.61514721463Kursk
0.529570567Light
0.33838553842Humidity
0.38728302841SpikeTrains
0.000716716Sines
BRAIDNaive
Estimation
error (%)
Lag correlationDatasets
DB/IR '06 C. Faloutsos #92
Carnegie Mellon
Performance
• Almost linear w.r.t. sequence length
• Up to 40,000 times faster
DB/IR '06 C. Faloutsos #93
Carnegie Mellon
Group Lag Correlations• Two correlated pairs from 55 Temperature sequences• Each sensor is located in a different place
Estimation of CCF of #16 and #19 Estimation of CCF of #47 and #48
#16 #19 #47 #48
DB/IR '06 C. Faloutsos #94
Carnegie Mellon
Conclusions
Automatic lag correlation detection on stream data• incremental – online, ‘any-time’• nimble
– O(log n) space, O(1) time to update the statistics
– Up to 40,000 times faster than the naive implementation
• Accurate– Detecting the correct lag within 1% relative error or
less
DB/IR '06 C. Faloutsos #95
Carnegie Mellon
Overall Conclusions
• Mining streaming numerical data: challenging!
• Extensions: streaming matrix data (eg., network traffic matrix)
IP-source IP-d
estin
atio
n
tim
e
DB/IR '06 C. Faloutsos #96
Carnegie Mellon
Thank you
• christos <at> cs.cmu.edu
• www.cs.cmu.edu/~christos
• [InteMon demo]
top related