measuring two-event structural correlations on graphs

MEASURING TWO-EVENT STRUCTURAL CORRELATIONS ON GRAPHS

Ziyu Guan, Nan Li, Xifeng YanDepartment of Computer Science

UC Santa Barbara

Z. Guan, Nan Li, Xifeng Yan

2

Measuring Two-Event Structural Correlations on Graphs

OUTLINE Motivations

Measuring Two Event Structural Correlation (TESC)

Efficient Computation

Experiments

Discussions and Future work


3


INTRUSIONAttraction

Ping Sweep SMB Service Sweep


4


PRODUCT SALES How is the relationship between the sales of

two products in a social network? Attraction

Repulsion


5


A NEW NOTION OF CORRELATION

Two-Event Structural Correlation (TESC)

Defined on graph structures

Capture relationships between distributions of two events on a graph

Events can be different things in different contexts: Topics or products (social networks) Virus (contact networks) Intrusion alerts (computer networks)


6


IT IS A NONTRIVIAL PROBLEM Simply computing average distance between

occurrences of two events will not work Distance for positive could be longer than that for negative

gScore cannot be adapted[Z. Guan et al., SIGMOD2011]

Significance cannot be assessed by randomization!


7


OUTLINE Motivations



Experiments



8


HOW TO MEASURE? Positive correlation: the presence of event A

tend to imply the presence of event B. More A also tend to attract more B.

Negative correlation: the presence of one event is likely to imply the absence of the other one. More A means less B.

Our idea: employ reference nodes in the graph as observers to capture these characteristics quantitatively. Avoid randomization for significance testing.


9


PRELIMINARIES A graph G = (V, E) and an event set Q = {qi}. Given two events a

and b in Q, Va and Vb are sets of nodes having a and b, respectively.

Def. (Node h-hop neighborhood): given a node, subgraph induced by nodes within distance h from that node

Def. (Node Set h-hop neighborhood): given a node set, subgraph induced by the union of all nodes which are within distance h from at least one node in the set.

ha bV

a bV



MEASURING CONCORDANCE

Concordance score

Density function10

1 ( ( ) ( ))( ( ) ( )) 0( , ) 1 ( ( ) ( ))( ( ) ( )) 0

0

h h h ha i a j b i b jh h h h

i j a i a j b i b j

s r s r s r s rc r r s r s r s r s r

otherwise

| |( )| |

hh a ra h

r

V Vs rV

If the density changes are consistentIf the density changes are inconsistentTie

Fraction of nodes possessing event a in r’s h-hop neighborhood


11


KENDALL’S TAU AS THE MEASURE Kendall’s Tau rank correlation is

used to compute the overall concordance among reference nodes with regard to density changes of the two events:

: the number of all reference nodes

lies in [-1,1]. A higher positive value means a stronger positive correlation. A lower negative value means a stronger negative correlation. 0 means no correlation.

1

1 1

( , )( , ) 1 ( 1)

2

N N

i ji j i

c r ra b

N N

| |ha bN V

( , )a b

1

2

3

0.1 0.20.2 0.40.3 0.5

rrr

1

2

3

0.1 0.50.2 0.40.3 0.2

rrr

1

1

Density of a

Density of b


12


SIGNIFICANCE TESTING Impractical to compute

directly Testing: choose uniformly a

sample of n reference nodes, and compute score over this sample

It is proved the distribution of under null hypothesis tends to the normal distribution with mean 0 and variance related to n

Thus, correlation significance (z-score) is

( , )a b

( , ) ( ( , )) ( , )( , )( ( , ))

t a b E t a b t a bz a bVar t a b

( , )a b( , )t a b

ha bV

( , )t a b


13


REFERENCE NODES The reasons of choosing to be the set of

all reference nodes: Nodes outside cannot reach any event nodes in h

hops Incorporating them can only increase the number of

consistent pairs, and increase the size of ties (decrease variance in the null case), leading to unexpected high z-scores:

ha bV

Out-of-sight-nodes

ha bV


14


OUTLINE Motivations



Experiments



15


EFFICIENT COMPUTATION The key problem in efficient computation is

how to get a uniform sample of reference nodes from

, but only have .

We explore three algorithms for reference node sampling BFS, importance sampling, whole graph sampling

a bV h

a bV

ha bV

a bV


16


BATCH_BFS Batch_BFS is just like a h-hop Breadth-first search, but

with the queue initialized with a set of nodes. Initialize the queue with all event nodes ( ) to

enumerate all reference nodes ( )

Queue:1 2 3 4{ , , , }a bV v v v v

2v 3v 4v1v

2v5v

6v

Correctness can be easily verified by imagining we start with a virtual node which connects to all nodes in and then do a (h+1) BFS.

a bV h

a bV

a bV

0v1v2v

3v

4v


17


IMPORTANCE SAMPLING (1) Sample size n is usually much smaller than .

The idea is to directly sample nodes from , avoid enumerating . Time cost depends on n, rather than

The basic operation is peeking the h-hop neighborhood of an event node

Difficulties: (1) different nodes have different sizes of h-hop neighborhoods (2) there could be many overlapped regions

| |ha bV

ha bV

ha bV

| |ha bV

r



IMPORTANCE SAMPLING (2) Uniform sampling by rejection sampling

18

uv

w

Problem: heavy overlap leads to high fail probability!

| || | | | | |su

h h hu v w

h h hu v w

ccV V VP

V V V

Step 1: select an event node u with probability proportional to the size of its h-hop neighborhoodStep 2: perform a h-hop BFS search to retrieve u’s h-hop neighborhoodStep 3: randomly sample a node r from u’s h-hop neighborhoodStep 4: Do a h-hop BFS search from r to see how many event nodes it can reach (say, c event nodes).Step 5: With probability 1 / c, accept r as a reference node. Otherwise get nothing from this run.

r, ,{ }a b uV v w


19


IMPORTANCE SAMPLING (3) Follow the same sampling scheme, but do not reject

any node, resulting in a nonuniform distribution over all reference nodes where is proportional to the number of event nodes r can reach in h-hop

Intrinsically, is an estimator of . The goal is to design a proper estimator for , which can leverage samples from to estimate as a surrogate to

A consistent estimator

( , )t a b ( , )a b( , )a b

{ ( )}p rP=

P=

1

1 1

1

1 1

( , )( ) ( )

( , )

( ) ( )

n ni j

i ji j i i j

n ni j

i j i i j

w wc r r

p r p rt a b w w

p r p r

( )p r

( , )a b( , )t a b

Concordance scores Number of times rj is

sampled


20


IMPORTANCE SAMPLING (4) The importance sampling procedure

Step 1: select an event node u with probability proportional to the size of its h-hop neighborhoodStep 2: perform a h-hop BFS search to retrieve u’s h-hop neighborhoodStep 3: randomly sample a node r from u’s h-hop neighborhoodStep 4: if r has been selected before, wr++; else add r to the sample set and set wr = 1.

1

1 1

1

1 1

( , )( ) ( )

( , )

( ) ( )

n ni j

i ji j i i j

n ni j

i j i i j

w wc r r

p r p rt a b

w wp r p r


21


WHOLE GRAPH SAMPLING When the set of all reference nodes, i.e. ,

is large enough, we simply sample nodes from the graph

bh

aV

ha bV


22


COMPLEXITY COMPARISON Space cost is the same:

Reference node sampling Batch_BFS: Importance sampling: Whole graph sampling:

Additional costs in common Event density computation: Z-score computation:

(| |)O E

(| | | |)bh

bh

a aO V E ( )BO nc( ), ) (| | / |( | 1)f b

haB fO n c n V V n E

( )BO nc2( )O n

Linear in the number of nodes and edges in the h-hop neighborhood of

a b

Average cost of a h-hop BFS search

Inverse proportional to the size of the h-hop neighborhood of a b

Do not need too many sample reference nodes since the variance of t(a,b) is upper bounded


23


OUTLINE Motivations



Experiments



24


EXPERIMENTS – DATASETS DBLP

Co-author network Events: keywords in paper titles 964,677 nodes, 3,547,014 edges, 0.19M keywords

Intrusion Obtained from log of intrusion alerts in a computer

network Events: intrusion alerts 200,858 nodes, 703,020 edges, 545 alerts

Twitter 20 million nodes and 0.16 billion edges

(Scalability)


25


EXPERIMENTS – EVENT SIMULATION (1)

Simulate positive and negative correlations (on DBLP graph)

Generate for three h levels: 1, 2, 3

Positive: linked pair, Gaussian distributed distance

Negative: Every b is kept h+1 hop away from a.

Noises: break correlation structure by relocation a fraction of nodes


26


EXPERIMENTS – EVENT SIMULATION (2) Results for positive case

Reca

ll

Noise Noise Noise

h = 1 h = 2 h = 3


27


EXPERIMENTS – REAL EVENTS (DBLP)

# Pair z-scoreh = 1

h = 2 h = 3

1 Texture vs. Image

6.22 19.85 30.58 10.96

2 Wireless vs. Sensor

5.99 23.09 32.12 14.49

3 Multicast vs. Network

4.21 18.37 26.66 6.81

4 Wireless vs. Network

2.06 17.41 27.90 4.80

5 Semantic vs. RDF

1.72 16.02 24.94 27.00

# Pair z-scoreh = 1

h = 2

h = 3

1 Texture vs. Java

-23.63

-9.41 -6.40 1.67

2 GPU vs. RDF

-24.47

-14.64

-6.31 2.02

3 SQL vs. Calibration

-21.29

-12.70

-5.45 0.73

4 Hardware vs. Ontology

-22.31

-8.85 -5.01 1.36

5 Transaction vs. Camera

-22.20

-7.91 -4.26 2.19

( | )( )

P b aP b

( | )( )

P b aP b

Highly positive pairs: Highly negative pairs:

Treating nodes as baskets


28


EXPERIMENTS – REAL EVENTS (INTRUSION)

# Pair 1-hop z-

score1 Ping Sweep vs.

SMB Service Sweep

13.64 1.68

2 Ping Flood vs. ICMP Flood

12.53 3.51

3 Email Pipe vs. Email Command Overflow

12.15 0.96

4 HTML Hostname Overflow vs. HTML NullChar Evasion

9.08 1.27

5 Email Error vs. Email Pipe

4.34 0.47

( | )( )

P b aP b

# Pair 2-hop z-

score1 Audit TFTP Get

Filename vs. LDAP Auth Failed

-31.30 0.00

2 LDAP Auth Failedvs. TFTP Put

-31.12 0.00

3 DPS Magic Number DoS vs. HTTP Auth TooLong

-30.96 0.00

4 LDAP BER Sequence Dos vs. TFTP Put

-30.30 0.00

5 Email Executable Extension vs. UDP Service Sweep

-26.93 0.00

( | )( )

P b aP b

Highly positive pairs: Highly negative pairs:


29


EXPERIMENTS – SCALABILITY

Running time when increasing the number of event nodes ( ). Results are obtained

from Twittter graph.| |a bV

h = 1 h = 3


30


OUTLINE Motivations



Experiments



31


DISCUSSIONS TESC as correlation of local densities Why nonparametric statistic

No distribution assumption, no linear assumption Nonparametric statistics are less powerful

because they use less information Model nonlinear correlation of data [Kaplan et al.,

JSTSP, 2009] Kendall correlation and Spearman correlation

Both can be used Choose Kendall’s Tau because

Intuitive interpretation Facilitate importance sampling

Intra-correlation and inter-correlation?

(based on constructive comments from Dr. Kaplan)

0.450.50.1

231

Rank0.30.20.1


32


FUTURE WORK Structure help explain the distribution of

events. Reversely, events could also help explain structure

Discuss very similar topics Buy very

similar products


33


THANK YOU!!!QUESTIONS?

measuring two-event structural correlations on graphs

Documents

event structural correlations

nan li

presence of event

event set q

graph events

graphsziyu guan

xifeng yanmeasure correlation

xifeng yan1outlinemotivations