measuring two-event structural correlations on graphs

33
MEASURING TWO-EVENT STRUCTURAL CORRELATIONS ON GRAPHS Ziyu Guan, Nan Li, Xifeng Yan Department of Computer Science UC Santa Barbara

Upload: peri

Post on 24-Feb-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Measuring Two-Event Structural Correlations on Graphs. Ziyu Guan, Nan Li, Xifeng Yan Department of Computer Science UC Santa Barbara. Outline. Motivations Measuring Two Event Structural Correlation (TESC) Efficient Computation Experiments Discussions and Future work. Intrusion. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Measuring Two-Event Structural Correlations on Graphs

MEASURING TWO-EVENT STRUCTURAL CORRELATIONS ON GRAPHS

Ziyu Guan, Nan Li, Xifeng YanDepartment of Computer Science

UC Santa Barbara

Page 2: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

2

Measuring Two-Event Structural Correlations on Graphs

OUTLINE Motivations

Measuring Two Event Structural Correlation (TESC)

Efficient Computation

Experiments

Discussions and Future work

Page 3: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

3

Measuring Two-Event Structural Correlations on Graphs

INTRUSIONAttraction

Ping Sweep SMB Service Sweep

Page 4: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

4

Measuring Two-Event Structural Correlations on Graphs

PRODUCT SALES How is the relationship between the sales of

two products in a social network? Attraction

Repulsion

Page 5: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

5

Measuring Two-Event Structural Correlations on Graphs

A NEW NOTION OF CORRELATION

Two-Event Structural Correlation (TESC)

Defined on graph structures

Capture relationships between distributions of two events on a graph

Events can be different things in different contexts: Topics or products (social networks) Virus (contact networks) Intrusion alerts (computer networks)

Page 6: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

6

Measuring Two-Event Structural Correlations on Graphs

IT IS A NONTRIVIAL PROBLEM Simply computing average distance between

occurrences of two events will not work Distance for positive could be longer than that for negative

gScore cannot be adapted[Z. Guan et al., SIGMOD2011]

Significance cannot be assessed by randomization!

Page 7: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

7

Measuring Two-Event Structural Correlations on Graphs

OUTLINE Motivations

Measuring Two Event Structural Correlation (TESC)

Efficient Computation

Experiments

Discussions and Future work

Page 8: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

8

Measuring Two-Event Structural Correlations on Graphs

HOW TO MEASURE? Positive correlation: the presence of event A

tend to imply the presence of event B. More A also tend to attract more B.

Negative correlation: the presence of one event is likely to imply the absence of the other one. More A means less B.

Our idea: employ reference nodes in the graph as observers to capture these characteristics quantitatively. Avoid randomization for significance testing.

Page 9: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

9

Measuring Two-Event Structural Correlations on Graphs

PRELIMINARIES A graph G = (V, E) and an event set Q = {qi}. Given two events a

and b in Q, Va and Vb are sets of nodes having a and b, respectively.

Def. (Node h-hop neighborhood): given a node, subgraph induced by nodes within distance h from that node

Def. (Node Set h-hop neighborhood): given a node set, subgraph induced by the union of all nodes which are within distance h from at least one node in the set.

ha bV

a bV

Page 10: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

Measuring Two-Event Structural Correlations on Graphs

MEASURING CONCORDANCE

Concordance score

Density function10

1 ( ( ) ( ))( ( ) ( )) 0( , ) 1 ( ( ) ( ))( ( ) ( )) 0

0

h h h ha i a j b i b jh h h h

i j a i a j b i b j

s r s r s r s rc r r s r s r s r s r

otherwise

| |( )| |

hh a ra h

r

V Vs rV

If the density changes are consistentIf the density changes are inconsistentTie

Fraction of nodes possessing event a in r’s h-hop neighborhood

Page 11: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

11

Measuring Two-Event Structural Correlations on Graphs

KENDALL’S TAU AS THE MEASURE Kendall’s Tau rank correlation is

used to compute the overall concordance among reference nodes with regard to density changes of the two events:

: the number of all reference nodes

lies in [-1,1]. A higher positive value means a stronger positive correlation. A lower negative value means a stronger negative correlation. 0 means no correlation.

1

1 1

( , )( , ) 1 ( 1)

2

N N

i ji j i

c r ra b

N N

| |ha bN V

( , )a b

1

2

3

0.1 0.20.2 0.40.3 0.5

rrr

1

2

3

0.1 0.50.2 0.40.3 0.2

rrr

1

1

Density of a

Density of b

Page 12: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

12

Measuring Two-Event Structural Correlations on Graphs

SIGNIFICANCE TESTING Impractical to compute

directly Testing: choose uniformly a

sample of n reference nodes, and compute score over this sample

It is proved the distribution of under null hypothesis tends to the normal distribution with mean 0 and variance related to n

Thus, correlation significance (z-score) is

( , )a b

( , ) ( ( , )) ( , )( , )( ( , ))

t a b E t a b t a bz a bVar t a b

( , )a b( , )t a b

ha bV

( , )t a b

Page 13: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

13

Measuring Two-Event Structural Correlations on Graphs

REFERENCE NODES The reasons of choosing to be the set of

all reference nodes: Nodes outside cannot reach any event nodes in h

hops Incorporating them can only increase the number of

consistent pairs, and increase the size of ties (decrease variance in the null case), leading to unexpected high z-scores:

ha bV

Out-of-sight-nodes

ha bV

Page 14: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

14

Measuring Two-Event Structural Correlations on Graphs

OUTLINE Motivations

Measuring Two Event Structural Correlation (TESC)

Efficient Computation

Experiments

Discussions and Future work

Page 15: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

15

Measuring Two-Event Structural Correlations on Graphs

EFFICIENT COMPUTATION The key problem in efficient computation is

how to get a uniform sample of reference nodes from

, but only have .

We explore three algorithms for reference node sampling BFS, importance sampling, whole graph sampling

a bV h

a bV

ha bV

a bV

Page 16: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

16

Measuring Two-Event Structural Correlations on Graphs

BATCH_BFS Batch_BFS is just like a h-hop Breadth-first search, but

with the queue initialized with a set of nodes. Initialize the queue with all event nodes ( ) to

enumerate all reference nodes ( )

Queue:1 2 3 4{ , , , }a bV v v v v

2v 3v 4v1v

2v5v

6v

Correctness can be easily verified by imagining we start with a virtual node which connects to all nodes in and then do a (h+1) BFS.

a bV h

a bV

a bV

0v1v2v

3v

4v

Page 17: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

17

Measuring Two-Event Structural Correlations on Graphs

IMPORTANCE SAMPLING (1) Sample size n is usually much smaller than .

The idea is to directly sample nodes from , avoid enumerating . Time cost depends on n, rather than

The basic operation is peeking the h-hop neighborhood of an event node

Difficulties: (1) different nodes have different sizes of h-hop neighborhoods (2) there could be many overlapped regions

| |ha bV

ha bV

ha bV

| |ha bV

r

Page 18: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

Measuring Two-Event Structural Correlations on Graphs

IMPORTANCE SAMPLING (2) Uniform sampling by rejection sampling

18

uv

w

Problem: heavy overlap leads to high fail probability!

| || | | | | |su

h h hu v w

h h hu v w

ccV V VP

V V V

Step 1: select an event node u with probability proportional to the size of its h-hop neighborhoodStep 2: perform a h-hop BFS search to retrieve u’s h-hop neighborhoodStep 3: randomly sample a node r from u’s h-hop neighborhoodStep 4: Do a h-hop BFS search from r to see how many event nodes it can reach (say, c event nodes).Step 5: With probability 1 / c, accept r as a reference node. Otherwise get nothing from this run.

r, ,{ }a b uV v w

Page 19: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

19

Measuring Two-Event Structural Correlations on Graphs

IMPORTANCE SAMPLING (3) Follow the same sampling scheme, but do not reject

any node, resulting in a nonuniform distribution over all reference nodes where is proportional to the number of event nodes r can reach in h-hop

Intrinsically, is an estimator of . The goal is to design a proper estimator for , which can leverage samples from to estimate as a surrogate to

A consistent estimator

( , )t a b ( , )a b( , )a b

{ ( )}p rP=

P=

1

1 1

1

1 1

( , )( ) ( )

( , )

( ) ( )

n ni j

i ji j i i j

n ni j

i j i i j

w wc r r

p r p rt a b w w

p r p r

( )p r

( , )a b( , )t a b

Concordance scores Number of times rj is

sampled

Page 20: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

20

Measuring Two-Event Structural Correlations on Graphs

IMPORTANCE SAMPLING (4) The importance sampling procedure

Step 1: select an event node u with probability proportional to the size of its h-hop neighborhoodStep 2: perform a h-hop BFS search to retrieve u’s h-hop neighborhoodStep 3: randomly sample a node r from u’s h-hop neighborhoodStep 4: if r has been selected before, wr++; else add r to the sample set and set wr = 1.

1

1 1

1

1 1

( , )( ) ( )

( , )

( ) ( )

n ni j

i ji j i i j

n ni j

i j i i j

w wc r r

p r p rt a b

w wp r p r

Page 21: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

21

Measuring Two-Event Structural Correlations on Graphs

WHOLE GRAPH SAMPLING When the set of all reference nodes, i.e. ,

is large enough, we simply sample nodes from the graph

bh

aV

ha bV

Page 22: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

22

Measuring Two-Event Structural Correlations on Graphs

COMPLEXITY COMPARISON Space cost is the same:

Reference node sampling Batch_BFS: Importance sampling: Whole graph sampling:

Additional costs in common Event density computation: Z-score computation:

(| |)O E

(| | | |)bh

bh

a aO V E ( )BO nc( ), ) (| | / |( | 1)f b

haB fO n c n V V n E

( )BO nc2( )O n

Linear in the number of nodes and edges in the h-hop neighborhood of

a b

Average cost of a h-hop BFS search

Inverse proportional to the size of the h-hop neighborhood of a b

Do not need too many sample reference nodes since the variance of t(a,b) is upper bounded

Page 23: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

23

Measuring Two-Event Structural Correlations on Graphs

OUTLINE Motivations

Measuring Two Event Structural Correlation (TESC)

Efficient Computation

Experiments

Discussions and Future work

Page 24: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

24

Measuring Two-Event Structural Correlations on Graphs

EXPERIMENTS – DATASETS DBLP

Co-author network Events: keywords in paper titles 964,677 nodes, 3,547,014 edges, 0.19M keywords

Intrusion Obtained from log of intrusion alerts in a computer

network Events: intrusion alerts 200,858 nodes, 703,020 edges, 545 alerts

Twitter 20 million nodes and 0.16 billion edges

(Scalability)

Page 25: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

25

Measuring Two-Event Structural Correlations on Graphs

EXPERIMENTS – EVENT SIMULATION (1)

Simulate positive and negative correlations (on DBLP graph)

Generate for three h levels: 1, 2, 3

Positive: linked pair, Gaussian distributed distance

Negative: Every b is kept h+1 hop away from a.

Noises: break correlation structure by relocation a fraction of nodes

Page 26: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

26

Measuring Two-Event Structural Correlations on Graphs

EXPERIMENTS – EVENT SIMULATION (2) Results for positive case

Reca

ll

Noise Noise Noise

h = 1 h = 2 h = 3

Page 27: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

27

Measuring Two-Event Structural Correlations on Graphs

EXPERIMENTS – REAL EVENTS (DBLP)

# Pair z-scoreh = 1

h = 2 h = 3

1 Texture vs. Image

6.22 19.85 30.58 10.96

2 Wireless vs. Sensor

5.99 23.09 32.12 14.49

3 Multicast vs. Network

4.21 18.37 26.66 6.81

4 Wireless vs. Network

2.06 17.41 27.90 4.80

5 Semantic vs. RDF

1.72 16.02 24.94 27.00

# Pair z-scoreh = 1

h = 2

h = 3

1 Texture vs. Java

-23.63

-9.41 -6.40 1.67

2 GPU vs. RDF

-24.47

-14.64

-6.31 2.02

3 SQL vs. Calibration

-21.29

-12.70

-5.45 0.73

4 Hardware vs. Ontology

-22.31

-8.85 -5.01 1.36

5 Transaction vs. Camera

-22.20

-7.91 -4.26 2.19

( | )( )

P b aP b

( | )( )

P b aP b

Highly positive pairs: Highly negative pairs:

Treating nodes as baskets

Page 28: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

28

Measuring Two-Event Structural Correlations on Graphs

EXPERIMENTS – REAL EVENTS (INTRUSION)

# Pair 1-hop z-

score1 Ping Sweep vs.

SMB Service Sweep

13.64 1.68

2 Ping Flood vs. ICMP Flood

12.53 3.51

3 Email Pipe vs. Email Command Overflow

12.15 0.96

4 HTML Hostname Overflow vs. HTML NullChar Evasion

9.08 1.27

5 Email Error vs. Email Pipe

4.34 0.47

( | )( )

P b aP b

# Pair 2-hop z-

score1 Audit TFTP Get

Filename vs. LDAP Auth Failed

-31.30 0.00

2 LDAP Auth Failedvs. TFTP Put

-31.12 0.00

3 DPS Magic Number DoS vs. HTTP Auth TooLong

-30.96 0.00

4 LDAP BER Sequence Dos vs. TFTP Put

-30.30 0.00

5 Email Executable Extension vs. UDP Service Sweep

-26.93 0.00

( | )( )

P b aP b

Highly positive pairs: Highly negative pairs:

Page 29: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

29

Measuring Two-Event Structural Correlations on Graphs

EXPERIMENTS – SCALABILITY

Running time when increasing the number of event nodes ( ). Results are obtained

from Twittter graph.| |a bV

h = 1 h = 3

Page 30: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

30

Measuring Two-Event Structural Correlations on Graphs

OUTLINE Motivations

Measuring Two Event Structural Correlation (TESC)

Efficient Computation

Experiments

Discussions and Future work

Page 31: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

31

Measuring Two-Event Structural Correlations on Graphs

DISCUSSIONS TESC as correlation of local densities Why nonparametric statistic

No distribution assumption, no linear assumption Nonparametric statistics are less powerful

because they use less information Model nonlinear correlation of data [Kaplan et al.,

JSTSP, 2009] Kendall correlation and Spearman correlation

Both can be used Choose Kendall’s Tau because

Intuitive interpretation Facilitate importance sampling

Intra-correlation and inter-correlation?

(based on constructive comments from Dr. Kaplan)

0.450.50.1

231

Rank0.30.20.1

Page 32: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

32

Measuring Two-Event Structural Correlations on Graphs

FUTURE WORK Structure help explain the distribution of

events. Reversely, events could also help explain structure

Discuss very similar topics Buy very

similar products

Page 33: Measuring Two-Event Structural Correlations on Graphs

Z. Guan, Nan Li, Xifeng Yan

33

Measuring Two-Event Structural Correlations on Graphs

THANK YOU!!!QUESTIONS?