incremental pattern discovery on streams, graphs and tensors jimeng sun ph.d.thesis proposal may 15,...
TRANSCRIPT
Incremental Pattern Discovery on Streams, Graphs and
Tensors
Jimeng Sun
Ph.D.Thesis Proposal
May 15, 2006
2
Thesis Committee
Christos Faloutsos (Chair)Tom MitchellHui ZhangDavid Steier, PricewaterhouseCoopers Philip Yu, IBM Watson Research Center
3
Thesis ProposalGoal: incremental pattern discovery on streaming applications
Streams: E1: Environmental sensor networks E2: Cluster/data center monitoring
Graphs: E3: Social network analysis
Tensors: E4: Network forensics E5: Financial auditing E6: fMRI: Brain image analysis
How to summarize streaming data efficiently and incrementally?
4
E1: Environmental Sensor Monitoring
water distribution network
normal operation
May have hundreds of measurements, and they are often related!
Phase 1 Phase 2 Phase 3
: : : : : :
: : : : : :
chlo
rine c
once
ntr
ati
ons
sensorsnear leak
sensorsawayfrom leak
CMU civil departmentProf. Jeanne M. VanBriesen
5
Phase 1 Phase 2 Phase 3
: : : : : :
: : : : : :
E1: Environmental Sensor Monitoring
water distribution network
normal operation major leak
chlo
rine c
once
ntr
ati
ons
sensorsnear leak
sensorsawayfrom leak
CMU civil departmentProf. Jeanne M. VanBriesen
May have hundreds of measurements, and they are often related!
6
E1: Environmental Sensor Monitoring
We would like to discover a few “hidden(latent) variables” that summarize the key trends
chlo
rine c
once
ntr
ati
ons
Phase 1 Phase 1Phase 2 Phase 2Phase 3 Phase 3
actual measurements(n streams)
k hidden variable(s)
k = 1-2
: : : : : :
: : : : : :
SPIRIT
7
E3: Social network analysisTraditionally, people focus on static networks and find community structuresWe plan to monitor the change of the community structure over time and identify abnormal individuals
DB
Aut
hors
Keywords
DM
DB
1990
2004
8
E4: Network forensicsDirectional network flowsA large ISP with 100 POPs, each POP 10Gbps link capacity [Hotnets2004]
450 GB/hour with compression
Task: Identify abnormal traffic pattern and find out the cause
normal trafficabnormal traffic
dest
inati
on
source
dest
inati
on
sourceCollaboration with Prof. Hui Zhang and Dr. Yinglian Xie
9
Commonality of all
Data: continuously arrivingLarge volumeMulti-dimensionalUnlabeled
Task: incremental pattern discoveryMain trendsAnomalies
10
Thesis statement
Incremental and efficient summarization of heterogonous streaming data through a general and concise presentation enables many real applications in different domains.
11
Outline
Motivating examplesData model and mining frameworkRelated workCurrent workProposed workConclusion
12
Static Data model Tensor
Formally,
Generalization of matrices
Represented as multi-array, data cube.
Order 1st 2nd 3rd
Correspondence Vector Matrix 3D array
ExampleSensors
Aut
hors
Keywords
Sources
Des
tinat
ions
Por
ts
13
Dynamic Data model (our focus)
Tensor StreamsA sequence of Mth order tensor
where
n is increasing over time
Order 1st 2nd 3rd
Correspondence
Multiple streams Time evolving graphs
3D arrays
Example
Sources
Des
tinat
ions
Por
tstime
Sensors
…
time
…
au
thor
keyword
…
14
Application Modules
Our framework for incremental pattern discovery
DataStreams
TensorStreams
Core tensors
Pro
jectio
ns
Preprocessing Tensor Analysis
AnomalyDetection
Clustering Prediction
Mining flow
15
Outline
Motivating examplesData model and mining frameworkRelated workCurrent workProposed workConclusion
16
Related workLow Rank approximationPCA, SVD: orthogonal
based projectionCUR [Drineas05]:
example based projection
Multilinear analysisTensors: matricizing,
mode-productTensor decompositions:
Tucker, PARAFAC, HOSVD
Stream miningScan data once to
identify patternsSampling: [Vitter85],
[Gibbons98]Sketches: [Indyk00],
[Cormode03]
Graph miningExplorative: [Faloutsos04]
[Kumar99][Leskovec05]…
Algorithmic: [Yan05][Cormode05]…
Our Work
17
Y
Background – Singular value decomposition (SVD)
SVD
Best rank k approximation in L2PCA is an important application of SVDNote that U and V are dense and may have negative entries
Am
n
m
nRR
R
UVT k
k k
UT
18
Background – Latent semantic indexing (LSI)
Singular vectors are useful for clustering
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
pattern
cluster
querycache
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.800 0.27
=DM
DB
9.64 0
0 5.29x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
document-conceptconcept-term
concept-association
frequent
19
Background: Tensor Operations
MatricizingUnfold a tensor into a matrix
SourceDest
inati
onPo
rt
Source
Source
Dest
inati
on
*Port
20
Background: Tensor Operations
Mode-productMultiply a tensor with a matrix
SourceDest
inati
onPo
rt
“group”
source
Dest
inati
onPo
rt
“group”so
urc
e
21
OutlineData modelFrameworkRelated workCurrent work
Dynamic and Streaming tensor analysis (DTA/STA)Compact matrix decomposition (CMD)
Proposed workConclusion
22
Methodology map
static dynamic
1st 1st order DTA,SPIRIT (1st order STA)
2nd SVD, PCA, CMD
DTA, STA3 PARAFAC,HOSVD,
TensorPCA
orderdata
23
Tensor analysisGiven a sequence of tensorsfind the projection matricessuch that the reconstruction error e is minimized:
…
…
t
Note that this is a generalization of PCA when n is a constant
Sources
Des
tinat
ions
Por
ts
Source Projection
Des
tinat
ion
Pro
ject
ion
Port Projection
Core Tensor
24
DB
Aut
hors
Keywords
DM
DB
UA
UK
1990
2004
1990
2004
Why do we care?
Anomaly detectionReconstruction error drivenMultiple resolution
Multiway latent semantic indexing (LSI) Philip Yu
Michael Stonebreak
er
QueryPattern
time
25
1st order DTA - problemGiven x1…xn where each xi RN, find
URNR such that the error e is small:
n
N
x1
xn
….
?
tim
e
Sensors
UT
indooroutdoor
Y
Sensors
R
Note that Y = XU
26
1st order DTAInput: new data vector x RN, old variance
matrix C RN N
Output: new projection matrix U RN R
Algorithm:1. update variance matrix Cnew = xTx + C2. Diagonalize UUT = Cnew 3. Determine the rank R and return U
xT C UUTx
Cnew
Diagonalization has to be done for every new x!
Old X
x
tim
e
27
1st order STA: SPIRITAdjust U smoothly when new data arrive without diagonalization
For each new point xProject onto current lineEstimate errorRotate line in the direction of the error and in proportion to its magnitude
For each new point x and for i = 1, …, k : yi := Ui
Tx (proj. onto Ui)
di di + yi2 (energy i-th eigenval.)
ei := x – yiUi (error)
Ui Ui + (1/di) yiei (update estimate)
x x – yiUi (repeat with remainder)
error
U
Sensor 1
Sen
sor
2
28
Mth order DTA
dU
TdU
Reconstruct Variance Matrix
dC
dC
Update Variance Matrix
dS
Diagonalize Variance Matrix
dU
TdU
dSX(d)X(d)
dX TdX
Mat
riciz
ing,
Tra
nspo
se
Construct Variance Matrix of Incremental Tensor
Matricizing
T
29
Mth order DTA – complexityStorage: O( Ni), i.e., size of an input tensor at a single
timestampComputation: Ni
3 (or Ni2) diagonalization of C
+ Ni Ni matrix multiplication X (d)T X(d)
For low order tensor(<3), diagonalization is the main cost
For high order tensor, matrix multiplication is the main cost
30
Streaming tensor analysis (STA)
TdX
Matricizing
Run SPIRIT along each modeComplexity:
Storage: O( Ni)
Computation: Ri Ni which is smaller than DTAy1
U1
xe1
U1 updated
31
Experiment
GoalComputation efficiencyAccurate approximationReal applications
Anomaly detection Clustering
32
Data set 1: Network dataTCP flows collected at CMU backboneRaw data 500GB with compressionConstruct 2nd or 3rd order tensors with hourly windows with <source, destination,value> or <source, destination, port, value>Each tensor: 500500 or 500500100 biased sampled from over 22k hosts1200 timestamps (hours)
Sparse data Power-law distribution10AM to 11AM on 01/06/2005
33
Data set 2: Bibliographic data (DBLP)
Papers from VLDB and KDD conferencesConstruct 2nd order tensors with yearly windows with <author, keywords, num> Each tensor: 45843741 11 timestamps (years)
34
Computational cost
3rd order network tensor 2nd order DBLP tensorOTA is the offline tensor analysisPerformance metric: CPU time (sec)Observations:
DTA and STA are orders of magnitude faster than OTAThe slide upward trend in DBLP is due to the increasing number of papers each year (data become denser over time)
35
Accuracy comparison
Performance metric: the ratio of reconstruction error between DTA/STA and OTA; fixing the error of OTA to 20%Observation: DTA performs very close to OTA in both datasets, STA performs worse in DBLP due to the bigger changes.
3rd order network tensor 2nd order DBLP tensor
36
Network anomaly detection
Reconstruction error gives indication of anomalies.Prominent difference between normal and abnormal ones is mainly due to the unusual scanning activity (confirmed by the campus admin).
Reconstruction error over time
Normal trafficAbnormal traffic
37
Multiway LSIAuthors Keywords Yearmichael carey, michaelstonebreaker, h. jagadish,hector garcia-molina
queri,parallel,optimization,concurr,objectorient
1995
surajit chaudhuri,mitch cherniack,michaelstonebreaker,ugur etintemel
distribut,systems,view,storage,servic,process,cache
2004
jiawei han,jian pei,philip s. yu,jianyong wang,charu c. aggarwal
streams,pattern,support, cluster, index,gener,queri
2004
Two groups are correctly identified: Databases and Data miningPeople and concepts are drifting over time
DB
DM
38
Quick summary of DTA/STA
Tensor stream is a general data modelDTA/STA incrementally decompose tensors into core tensors and projection matricesThe result of DTA/STA can be used in other applications
Anomaly detectionMultiway LSI
Incremental computation!
39
OutlineData modelFrameworkRelated workCurrent work
Dynamic and Streaming tensor analysis (DTA/STA)Compact matrix decomposition (CMD)
Proposed workConclusion
40
Methodology map
static dynamic
1st 1st order DTA,SPIRIT (1st order STA)
2nd SVD, PCA, CMD
DTA, STA3 PARAFAC,HOSVD,
TensorPCA
orderdata
41
Disadvantage of orthogonal projection on sparse data
Real data are often (very) sparse
Orthogonal projection does not preserve the sparsity in the data
more space than original datalarge computational cost
Data Size Nonzero percent
Network flow 22k-by-22k 0.0025%
DBLP (author, conference) 428k-by-3.6k 0.004%
42
Interpretability problem of orthogonal projection
Each column of projection matrix Ui is a linear combination of all dimensions along certain mode Ui(:,1) = [0.5; -0.5; 0.5; 0.5]
All the data are projected onto the span of Ui
It is hard to interpret the projections
43
Compact matrix decomposition (CMD)
Example-based projection: use actual rows and columns to specify the subspaceGiven a matrix ARmn, find three matrices C Rmc, U Rcr, R Rr n , such that ||A-CUR|| is small
C
RX
m
n
r
c
Am
n
U is the pseudo-inverse of X
Orthogonal projection
Example-based
44
CMD algorithm (high level)
CMU from 4K feet
45
CMD algorithm (high level)
Biased sample with replacement of columns and rows from ARemove duplicates with proper scalingConstruct U from C and R (pseudo-inverse of the intersection of C and R)
Remove duplicates with proper scaling
1111
1010
0011
A
1111
1111
1 1 11 1 1
Cd
Rd
C
2 2 2R
U
46
CMD algorithm (low level)
CMU from 3 feet
47
CMD algorithm (low level)
Remove duplicates with proper scaling
Cd
RdX
m
n
r
c
Ci = ui1/2 Ci
Ri = vi Ri
C
RX
m
n
r`
c`
Theorem: Matrix C and Cd have the same singular values and left singular vectorsProof: see [Sun06]
ui, vi the number of occurrences of Ci and Ri
48
ExperimentDatasets
Performance metricsSpace ratio to the original dataCPU time (sec)Accuracy = 1 – reconstruction error
Data Dimension Nonzeros
Network flow(source, destination)
22k-by-22k 12K
DBLP(author, conference)
428K-by-3.6K 64K
49
Space efficiency
CMD uses much smaller space to achieve the same accuracyCUR limitation: duplicate columns and rowsSVD limitation: orthogonal projection densifies the data
Network DBLP
50
Computational efficiency
CMD is fastest among all threeCMD and CUR requires SVD on only the sampled columnsCUR is much worse than CMD due to duplicate columnsSVD is slowest since it performs on the entire data
Network DBLP
51
Quick summary on CMDCMD: A C U R
C/R: sampled and scaled columns and rows (sparse)U: a small matrix (dense)
PropertiesInterpretability: interpret matrix by sampled rows and columns Efficiency: in computation and space
ApplicationAnomaly detection
Efficient computation,intuitive model
52
My related publicationsSun, J., Tao, D., Faloutsos, C. Beyond Streams and Graphs: Dynamic Tensor Analysis, submitted.Sun, J., Xie, Y., Zhang, H., Faloutsos, C. Compact Matrix Decomposition for Large Graphs: Theory and Practice, submitted.Hoke, E., Sun, J., Faloutsos, C. Intemon: intelligent monitoring system for large clusters. submittedSun, J., Papadimitriou, S., Faloutsos, C. Distributed Pattern Discovery in Multiple Streams, PAKDD 2006 Papadimitriou, S., Sun, J., Faloutsos, C. Streaming Pattern Discovery in Multiple Time-Series, VLDB 2006Sun,.J. Papadimitriou, S., Faloutsos, C. Online latent variable detection in sensor networks, ICDE, 2005
53
OutlineMotivating examplesData model and mining frameworkBackground and related workCurrent work
Dynamic and Streaming tensor analysis (DTA/STA)Compact matrix decomposition (CMD)
Proposed workConclusion
54
Proposed workMethodology
EvaluationGoal: real data, real application, real patterns
[P3]DTASTA
Tensor analysis
Orthogonalprojection
Example-basedprojection
Mth
Other divergence
SPIRITDistributed SPIRIT1s
t
[P4]CMD, [P1,P2]
2nd
MthMth
55
P1: Effective example-based projection
Occasionally, CMD does not give an accurate result. Especially, when the “large” columns and rows are in near parallel space.Current heuristics keeps sample those “large” columns/rows
Recent work [Drineas06] provides relative error guarantee
|A-CUR| (1+)|A-Ak| where Ak is best k approximation from SVD
Our idea: pick the column that disagree the most with the selected columns.
1111
1111
0011
CMD New
56
P2: Incremental CMD
Given time evolving graphs (2nd tensor stream), currently we need to apply CMD every timestamp on a new (slightly changed) graph How to compute CMD efficiently over time? Our idea: 1
111
1111
0011
t =11221
2120
0012
t =2
57
P3: Example-based tensor decomposition
CMD is currently on matrices (2nd order tensors) only.Generalize CMD to higher order
Build infrastructure: sparse tensor package [Kolda 06]Prototype sparse tensor access methods
How to store a sparse tensor? How to access some subset of a tensor?
Our goal: Implement tensor CMD efficiently.
58
P4: Other divergenceCurrently, the model implicitly assumes Gaussian distribution and Euclidean distance.But, many real data are not Gaussian. Our goal focus on other distribution and distance measure
Euclidean distance Gaussian distribution KL divergence Multinomial distributionBregman divergence Exponential family
59
Evaluation planReal data, real application, real success or failure
Data Tasks Order
Environmental monitoring
Temperature, humidity in large building; chlorine concentration in water distribution; Real-time summarization and anomaly detection
1st
Machine monitoring
Monitor a number of system parameters; identify unusual patterns in real-time
1st
DBLP/IMDB Time evolving graphs; find community structure
2nd
Network flow Identify interesting patterns, identify attacks
2nd or 3rd
Other data
fMRI data Brain image data; classification 3rd
Financial data Transaction data; identify the anomalies that may indicate frauds or errors
>= 1
60
Timeline
1-3 months4-6 months7-8 months9-11 months7-12 monthsAfter 12 months
P1:Effective example-based projectionP2:Incremental CMDP3:Example-based tensor decompositionP4:other divergenceWriting thesisDefense
P1 P3P2 P4Writing thesis
12 months
Defense