incremental pattern discovery on streams, graphs and tensors jimeng sun ph.d.thesis proposal may 15,...

Incremental Pattern Discovery on Streams, Graphs and

Tensors

Jimeng Sun

Ph.D.Thesis Proposal

May 15, 2006

2

Thesis Committee

Christos Faloutsos (Chair)Tom MitchellHui ZhangDavid Steier, PricewaterhouseCoopers Philip Yu, IBM Watson Research Center

3

Thesis ProposalGoal: incremental pattern discovery on streaming applications

Streams: E1: Environmental sensor networks E2: Cluster/data center monitoring

Graphs: E3: Social network analysis

Tensors: E4: Network forensics E5: Financial auditing E6: fMRI: Brain image analysis

How to summarize streaming data efficiently and incrementally?

4

E1: Environmental Sensor Monitoring

water distribution network

normal operation

May have hundreds of measurements, and they are often related!

Phase 1 Phase 2 Phase 3

: : : : : :

: : : : : :

chlo

rine c

once

ntr

ati

ons

sensorsnear leak

sensorsawayfrom leak

CMU civil departmentProf. Jeanne M. VanBriesen

5

Phase 1 Phase 2 Phase 3

: : : : : :

: : : : : :


water distribution network

normal operation major leak

chlo

rine c

once

ntr

ati

ons

sensorsnear leak

sensorsawayfrom leak

CMU civil departmentProf. Jeanne M. VanBriesen

May have hundreds of measurements, and they are often related!

6


We would like to discover a few “hidden(latent) variables” that summarize the key trends

chlo

rine c

once

ntr

ati

ons

Phase 1 Phase 1Phase 2 Phase 2Phase 3 Phase 3

actual measurements(n streams)

k hidden variable(s)

k = 1-2

: : : : : :

: : : : : :

SPIRIT

7

E3: Social network analysisTraditionally, people focus on static networks and find community structuresWe plan to monitor the change of the community structure over time and identify abnormal individuals

DB

Aut

hors

Keywords

DM

DB

1990

2004

8

E4: Network forensicsDirectional network flowsA large ISP with 100 POPs, each POP 10Gbps link capacity [Hotnets2004]

450 GB/hour with compression

Task: Identify abnormal traffic pattern and find out the cause

normal trafficabnormal traffic

dest

inati

on

source

dest

inati

on

sourceCollaboration with Prof. Hui Zhang and Dr. Yinglian Xie

9

Commonality of all

Data: continuously arrivingLarge volumeMulti-dimensionalUnlabeled

Task: incremental pattern discoveryMain trendsAnomalies

10

Thesis statement

Incremental and efficient summarization of heterogonous streaming data through a general and concise presentation enables many real applications in different domains.

11

Outline

Motivating examplesData model and mining frameworkRelated workCurrent workProposed workConclusion

12

Static Data model Tensor

Formally,

Generalization of matrices

Represented as multi-array, data cube.

Order 1st 2nd 3rd

Correspondence Vector Matrix 3D array

ExampleSensors

Aut

hors

Keywords

Sources

Des

tinat

ions

Por

ts

13

Dynamic Data model (our focus)

Tensor StreamsA sequence of Mth order tensor

where

n is increasing over time

Order 1st 2nd 3rd

Correspondence

Multiple streams Time evolving graphs

3D arrays

Example

Sources

Des

tinat

ions

Por

tstime

Sensors

…

time

…

au

thor

keyword

…

14

Application Modules

Our framework for incremental pattern discovery

DataStreams

TensorStreams

Core tensors

Pro

jectio

ns

Preprocessing Tensor Analysis

AnomalyDetection

Clustering Prediction

Mining flow

15

Outline

Motivating examplesData model and mining frameworkRelated workCurrent workProposed workConclusion

16

Related workLow Rank approximationPCA, SVD: orthogonal

based projectionCUR [Drineas05]:

example based projection

Multilinear analysisTensors: matricizing,

mode-productTensor decompositions:

Tucker, PARAFAC, HOSVD

Stream miningScan data once to

identify patternsSampling: [Vitter85],

[Gibbons98]Sketches: [Indyk00],

[Cormode03]

Graph miningExplorative: [Faloutsos04]

[Kumar99][Leskovec05]…

Algorithmic: [Yan05][Cormode05]…

Our Work

17

Y

Background – Singular value decomposition (SVD)

SVD

Best rank k approximation in L2PCA is an important application of SVDNote that U and V are dense and may have negative entries

Am

n

m

nRR

R

UVT k

k k

UT

18

Background – Latent semantic indexing (LSI)

Singular vectors are useful for clustering

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

pattern

cluster

querycache

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=DM

DB

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

document-conceptconcept-term

concept-association

frequent

19

Background: Tensor Operations

MatricizingUnfold a tensor into a matrix

SourceDest

inati

onPo

rt

Source

Source

Dest

inati

on

*Port

20

Background: Tensor Operations

Mode-productMultiply a tensor with a matrix

SourceDest

inati

onPo

rt

“group”

source

Dest

inati

onPo

rt

“group”so

urc

e

21

OutlineData modelFrameworkRelated workCurrent work

Dynamic and Streaming tensor analysis (DTA/STA)Compact matrix decomposition (CMD)

Proposed workConclusion

22

Methodology map

static dynamic

1st 1st order DTA,SPIRIT (1st order STA)

2nd SVD, PCA, CMD

DTA, STA3 PARAFAC,HOSVD,

TensorPCA

orderdata

23

Tensor analysisGiven a sequence of tensorsfind the projection matricessuch that the reconstruction error e is minimized:

…

…

t

Note that this is a generalization of PCA when n is a constant

Sources

Des

tinat

ions

Por

ts

Source Projection

Des

tinat

ion

Pro

ject

ion

Port Projection

Core Tensor

24

DB

Aut

hors

Keywords

DM

DB

UA

UK

1990

2004

1990

2004

Why do we care?

Anomaly detectionReconstruction error drivenMultiple resolution

Multiway latent semantic indexing (LSI) Philip Yu

Michael Stonebreak

er

QueryPattern

time

25

1st order DTA - problemGiven x1…xn where each xi RN, find

URNR such that the error e is small:

n

N

x1

xn

….

?

tim

e

Sensors

UT

indooroutdoor

Y

Sensors

R

Note that Y = XU

26

1st order DTAInput: new data vector x RN, old variance

matrix C RN N

Output: new projection matrix U RN R

Algorithm:1. update variance matrix Cnew = xTx + C2. Diagonalize UUT = Cnew 3. Determine the rank R and return U

xT C UUTx

Cnew

Diagonalization has to be done for every new x!

Old X

x

tim

e

27

1st order STA: SPIRITAdjust U smoothly when new data arrive without diagonalization

For each new point xProject onto current lineEstimate errorRotate line in the direction of the error and in proportion to its magnitude

For each new point x and for i = 1, …, k : yi := Ui

Tx (proj. onto Ui)

di di + yi2 (energy i-th eigenval.)

ei := x – yiUi (error)

Ui Ui + (1/di) yiei (update estimate)

x x – yiUi (repeat with remainder)

error

U

Sensor 1

Sen

sor

2

28

Mth order DTA

dU

TdU

Reconstruct Variance Matrix

dC

dC

Update Variance Matrix

dS

Diagonalize Variance Matrix

dU

TdU

dSX(d)X(d)

dX TdX

Mat

riciz

ing,

Tra

nspo

se

Construct Variance Matrix of Incremental Tensor

Matricizing

T

29

Mth order DTA – complexityStorage: O( Ni), i.e., size of an input tensor at a single

timestampComputation: Ni

3 (or Ni2) diagonalization of C

+ Ni Ni matrix multiplication X (d)T X(d)

For low order tensor(<3), diagonalization is the main cost

For high order tensor, matrix multiplication is the main cost

30

Streaming tensor analysis (STA)

TdX

Matricizing

Run SPIRIT along each modeComplexity:

Storage: O( Ni)

Computation: Ri Ni which is smaller than DTAy1

U1

xe1

U1 updated

31

Experiment

GoalComputation efficiencyAccurate approximationReal applications

Anomaly detection Clustering

32

Data set 1: Network dataTCP flows collected at CMU backboneRaw data 500GB with compressionConstruct 2nd or 3rd order tensors with hourly windows with <source, destination,value> or <source, destination, port, value>Each tensor: 500500 or 500500100 biased sampled from over 22k hosts1200 timestamps (hours)

Sparse data Power-law distribution10AM to 11AM on 01/06/2005

33

Data set 2: Bibliographic data (DBLP)

Papers from VLDB and KDD conferencesConstruct 2nd order tensors with yearly windows with <author, keywords, num> Each tensor: 45843741 11 timestamps (years)

34

Computational cost

3rd order network tensor 2nd order DBLP tensorOTA is the offline tensor analysisPerformance metric: CPU time (sec)Observations:

DTA and STA are orders of magnitude faster than OTAThe slide upward trend in DBLP is due to the increasing number of papers each year (data become denser over time)

35

Accuracy comparison

Performance metric: the ratio of reconstruction error between DTA/STA and OTA; fixing the error of OTA to 20%Observation: DTA performs very close to OTA in both datasets, STA performs worse in DBLP due to the bigger changes.

3rd order network tensor 2nd order DBLP tensor

36

Network anomaly detection

Reconstruction error gives indication of anomalies.Prominent difference between normal and abnormal ones is mainly due to the unusual scanning activity (confirmed by the campus admin).

Reconstruction error over time

Normal trafficAbnormal traffic

37

Multiway LSIAuthors Keywords Yearmichael carey, michaelstonebreaker, h. jagadish,hector garcia-molina

queri,parallel,optimization,concurr,objectorient

1995

surajit chaudhuri,mitch cherniack,michaelstonebreaker,ugur etintemel

distribut,systems,view,storage,servic,process,cache

2004

jiawei han,jian pei,philip s. yu,jianyong wang,charu c. aggarwal

streams,pattern,support, cluster, index,gener,queri

2004

Two groups are correctly identified: Databases and Data miningPeople and concepts are drifting over time

DB

DM

38

Quick summary of DTA/STA

Tensor stream is a general data modelDTA/STA incrementally decompose tensors into core tensors and projection matricesThe result of DTA/STA can be used in other applications

Anomaly detectionMultiway LSI

Incremental computation!

39

OutlineData modelFrameworkRelated workCurrent work



40

Methodology map

static dynamic

1st 1st order DTA,SPIRIT (1st order STA)

2nd SVD, PCA, CMD

DTA, STA3 PARAFAC,HOSVD,

TensorPCA

orderdata

41

Disadvantage of orthogonal projection on sparse data

Real data are often (very) sparse

Orthogonal projection does not preserve the sparsity in the data

more space than original datalarge computational cost

Data Size Nonzero percent

Network flow 22k-by-22k 0.0025%

DBLP (author, conference) 428k-by-3.6k 0.004%

42

Interpretability problem of orthogonal projection

Each column of projection matrix Ui is a linear combination of all dimensions along certain mode Ui(:,1) = [0.5; -0.5; 0.5; 0.5]

All the data are projected onto the span of Ui

It is hard to interpret the projections

43

Compact matrix decomposition (CMD)

Example-based projection: use actual rows and columns to specify the subspaceGiven a matrix ARmn, find three matrices C Rmc, U Rcr, R Rr n , such that ||A-CUR|| is small

C

RX

m

n

r

c

Am

n

U is the pseudo-inverse of X

Orthogonal projection

Example-based

44

CMD algorithm (high level)

CMU from 4K feet

45

CMD algorithm (high level)

Biased sample with replacement of columns and rows from ARemove duplicates with proper scalingConstruct U from C and R (pseudo-inverse of the intersection of C and R)

Remove duplicates with proper scaling

1111

1010

0011

A

1111

1111

1 1 11 1 1

Cd

Rd

C

2 2 2R

U

46

CMD algorithm (low level)

CMU from 3 feet

47

CMD algorithm (low level)

Remove duplicates with proper scaling

Cd

RdX

m

n

r

c

Ci = ui1/2 Ci

Ri = vi Ri

C

RX

m

n

r`

c`

Theorem: Matrix C and Cd have the same singular values and left singular vectorsProof: see [Sun06]

ui, vi the number of occurrences of Ci and Ri

48

ExperimentDatasets

Performance metricsSpace ratio to the original dataCPU time (sec)Accuracy = 1 – reconstruction error

Data Dimension Nonzeros

Network flow(source, destination)

22k-by-22k 12K

DBLP(author, conference)

428K-by-3.6K 64K

49

Space efficiency

CMD uses much smaller space to achieve the same accuracyCUR limitation: duplicate columns and rowsSVD limitation: orthogonal projection densifies the data

Network DBLP

50

Computational efficiency

CMD is fastest among all threeCMD and CUR requires SVD on only the sampled columnsCUR is much worse than CMD due to duplicate columnsSVD is slowest since it performs on the entire data

Network DBLP

51

Quick summary on CMDCMD: A C U R

C/R: sampled and scaled columns and rows (sparse)U: a small matrix (dense)

PropertiesInterpretability: interpret matrix by sampled rows and columns Efficiency: in computation and space

ApplicationAnomaly detection

Efficient computation,intuitive model

52

My related publicationsSun, J., Tao, D., Faloutsos, C. Beyond Streams and Graphs: Dynamic Tensor Analysis, submitted.Sun, J., Xie, Y., Zhang, H., Faloutsos, C. Compact Matrix Decomposition for Large Graphs: Theory and Practice, submitted.Hoke, E., Sun, J., Faloutsos, C. Intemon: intelligent monitoring system for large clusters. submittedSun, J., Papadimitriou, S., Faloutsos, C. Distributed Pattern Discovery in Multiple Streams, PAKDD 2006 Papadimitriou, S., Sun, J., Faloutsos, C. Streaming Pattern Discovery in Multiple Time-Series, VLDB 2006Sun,.J. Papadimitriou, S., Faloutsos, C. Online latent variable detection in sensor networks, ICDE, 2005

53

OutlineMotivating examplesData model and mining frameworkBackground and related workCurrent work



54

Proposed workMethodology

EvaluationGoal: real data, real application, real patterns

[P3]DTASTA

Tensor analysis

Orthogonalprojection

Example-basedprojection

Mth

Other divergence

SPIRITDistributed SPIRIT1s

t

[P4]CMD, [P1,P2]

2nd

MthMth

55

P1: Effective example-based projection

Occasionally, CMD does not give an accurate result. Especially, when the “large” columns and rows are in near parallel space.Current heuristics keeps sample those “large” columns/rows

Recent work [Drineas06] provides relative error guarantee

|A-CUR| (1+)|A-Ak| where Ak is best k approximation from SVD

Our idea: pick the column that disagree the most with the selected columns.

1111

1111

0011

CMD New

56

P2: Incremental CMD

Given time evolving graphs (2nd tensor stream), currently we need to apply CMD every timestamp on a new (slightly changed) graph How to compute CMD efficiently over time? Our idea: 1

111

1111

0011

t =11221

2120

0012

t =2

57

P3: Example-based tensor decomposition

CMD is currently on matrices (2nd order tensors) only.Generalize CMD to higher order

Build infrastructure: sparse tensor package [Kolda 06]Prototype sparse tensor access methods

How to store a sparse tensor? How to access some subset of a tensor?

Our goal: Implement tensor CMD efficiently.

58

P4: Other divergenceCurrently, the model implicitly assumes Gaussian distribution and Euclidean distance.But, many real data are not Gaussian. Our goal focus on other distribution and distance measure

Euclidean distance Gaussian distribution KL divergence Multinomial distributionBregman divergence Exponential family

59

Evaluation planReal data, real application, real success or failure

Data Tasks Order

Environmental monitoring

Temperature, humidity in large building; chlorine concentration in water distribution; Real-time summarization and anomaly detection

1st

Machine monitoring

Monitor a number of system parameters; identify unusual patterns in real-time

1st

DBLP/IMDB Time evolving graphs; find community structure

2nd

Network flow Identify interesting patterns, identify attacks

2nd or 3rd

Other data

fMRI data Brain image data; classification 3rd

Financial data Transaction data; identify the anomalies that may indicate frauds or errors

>= 1

60

Timeline

1-3 months4-6 months7-8 months9-11 months7-12 monthsAfter 12 months

P1:Effective example-based projectionP2:Incremental CMDP3:Example-based tensor decompositionP4:other divergenceWriting thesisDefense

P1 P3P2 P4Writing thesis

12 months

Defense

incremental pattern discovery on streams, graphs and tensors jimeng sun ph.d.thesis proposal may 15,...

Documents