measuring two-event structural correlations on graphs
DESCRIPTION
Measuring Two-Event Structural Correlations on Graphs. Ziyu Guan, Nan Li, Xifeng Yan Department of Computer Science UC Santa Barbara. Outline. Motivations Measuring Two Event Structural Correlation (TESC) Efficient Computation Experiments Discussions and Future work. Intrusion. - PowerPoint PPT PresentationTRANSCRIPT
MEASURING TWO-EVENT STRUCTURAL CORRELATIONS ON GRAPHS
Ziyu Guan, Nan Li, Xifeng YanDepartment of Computer Science
UC Santa Barbara
Z. Guan, Nan Li, Xifeng Yan
2
Measuring Two-Event Structural Correlations on Graphs
OUTLINE Motivations
Measuring Two Event Structural Correlation (TESC)
Efficient Computation
Experiments
Discussions and Future work
Z. Guan, Nan Li, Xifeng Yan
3
Measuring Two-Event Structural Correlations on Graphs
INTRUSIONAttraction
Ping Sweep SMB Service Sweep
Z. Guan, Nan Li, Xifeng Yan
4
Measuring Two-Event Structural Correlations on Graphs
PRODUCT SALES How is the relationship between the sales of
two products in a social network? Attraction
Repulsion
Z. Guan, Nan Li, Xifeng Yan
5
Measuring Two-Event Structural Correlations on Graphs
A NEW NOTION OF CORRELATION
Two-Event Structural Correlation (TESC)
Defined on graph structures
Capture relationships between distributions of two events on a graph
Events can be different things in different contexts: Topics or products (social networks) Virus (contact networks) Intrusion alerts (computer networks)
Z. Guan, Nan Li, Xifeng Yan
6
Measuring Two-Event Structural Correlations on Graphs
IT IS A NONTRIVIAL PROBLEM Simply computing average distance between
occurrences of two events will not work Distance for positive could be longer than that for negative
gScore cannot be adapted[Z. Guan et al., SIGMOD2011]
Significance cannot be assessed by randomization!
Z. Guan, Nan Li, Xifeng Yan
7
Measuring Two-Event Structural Correlations on Graphs
OUTLINE Motivations
Measuring Two Event Structural Correlation (TESC)
Efficient Computation
Experiments
Discussions and Future work
Z. Guan, Nan Li, Xifeng Yan
8
Measuring Two-Event Structural Correlations on Graphs
HOW TO MEASURE? Positive correlation: the presence of event A
tend to imply the presence of event B. More A also tend to attract more B.
Negative correlation: the presence of one event is likely to imply the absence of the other one. More A means less B.
Our idea: employ reference nodes in the graph as observers to capture these characteristics quantitatively. Avoid randomization for significance testing.
Z. Guan, Nan Li, Xifeng Yan
9
Measuring Two-Event Structural Correlations on Graphs
PRELIMINARIES A graph G = (V, E) and an event set Q = {qi}. Given two events a
and b in Q, Va and Vb are sets of nodes having a and b, respectively.
Def. (Node h-hop neighborhood): given a node, subgraph induced by nodes within distance h from that node
Def. (Node Set h-hop neighborhood): given a node set, subgraph induced by the union of all nodes which are within distance h from at least one node in the set.
ha bV
a bV
Z. Guan, Nan Li, Xifeng Yan
Measuring Two-Event Structural Correlations on Graphs
MEASURING CONCORDANCE
Concordance score
Density function10
1 ( ( ) ( ))( ( ) ( )) 0( , ) 1 ( ( ) ( ))( ( ) ( )) 0
0
h h h ha i a j b i b jh h h h
i j a i a j b i b j
s r s r s r s rc r r s r s r s r s r
otherwise
| |( )| |
hh a ra h
r
V Vs rV
If the density changes are consistentIf the density changes are inconsistentTie
Fraction of nodes possessing event a in r’s h-hop neighborhood
Z. Guan, Nan Li, Xifeng Yan
11
Measuring Two-Event Structural Correlations on Graphs
KENDALL’S TAU AS THE MEASURE Kendall’s Tau rank correlation is
used to compute the overall concordance among reference nodes with regard to density changes of the two events:
: the number of all reference nodes
lies in [-1,1]. A higher positive value means a stronger positive correlation. A lower negative value means a stronger negative correlation. 0 means no correlation.
1
1 1
( , )( , ) 1 ( 1)
2
N N
i ji j i
c r ra b
N N
| |ha bN V
( , )a b
1
2
3
0.1 0.20.2 0.40.3 0.5
rrr
1
2
3
0.1 0.50.2 0.40.3 0.2
rrr
1
1
Density of a
Density of b
Z. Guan, Nan Li, Xifeng Yan
12
Measuring Two-Event Structural Correlations on Graphs
SIGNIFICANCE TESTING Impractical to compute
directly Testing: choose uniformly a
sample of n reference nodes, and compute score over this sample
It is proved the distribution of under null hypothesis tends to the normal distribution with mean 0 and variance related to n
Thus, correlation significance (z-score) is
( , )a b
( , ) ( ( , )) ( , )( , )( ( , ))
t a b E t a b t a bz a bVar t a b
( , )a b( , )t a b
ha bV
( , )t a b
Z. Guan, Nan Li, Xifeng Yan
13
Measuring Two-Event Structural Correlations on Graphs
REFERENCE NODES The reasons of choosing to be the set of
all reference nodes: Nodes outside cannot reach any event nodes in h
hops Incorporating them can only increase the number of
consistent pairs, and increase the size of ties (decrease variance in the null case), leading to unexpected high z-scores:
ha bV
Out-of-sight-nodes
ha bV
Z. Guan, Nan Li, Xifeng Yan
14
Measuring Two-Event Structural Correlations on Graphs
OUTLINE Motivations
Measuring Two Event Structural Correlation (TESC)
Efficient Computation
Experiments
Discussions and Future work
Z. Guan, Nan Li, Xifeng Yan
15
Measuring Two-Event Structural Correlations on Graphs
EFFICIENT COMPUTATION The key problem in efficient computation is
how to get a uniform sample of reference nodes from
, but only have .
We explore three algorithms for reference node sampling BFS, importance sampling, whole graph sampling
a bV h
a bV
ha bV
a bV
Z. Guan, Nan Li, Xifeng Yan
16
Measuring Two-Event Structural Correlations on Graphs
BATCH_BFS Batch_BFS is just like a h-hop Breadth-first search, but
with the queue initialized with a set of nodes. Initialize the queue with all event nodes ( ) to
enumerate all reference nodes ( )
Queue:1 2 3 4{ , , , }a bV v v v v
2v 3v 4v1v
2v5v
6v
Correctness can be easily verified by imagining we start with a virtual node which connects to all nodes in and then do a (h+1) BFS.
a bV h
a bV
a bV
0v1v2v
3v
4v
Z. Guan, Nan Li, Xifeng Yan
17
Measuring Two-Event Structural Correlations on Graphs
IMPORTANCE SAMPLING (1) Sample size n is usually much smaller than .
The idea is to directly sample nodes from , avoid enumerating . Time cost depends on n, rather than
The basic operation is peeking the h-hop neighborhood of an event node
Difficulties: (1) different nodes have different sizes of h-hop neighborhoods (2) there could be many overlapped regions
| |ha bV
ha bV
ha bV
| |ha bV
r
Z. Guan, Nan Li, Xifeng Yan
Measuring Two-Event Structural Correlations on Graphs
IMPORTANCE SAMPLING (2) Uniform sampling by rejection sampling
18
uv
w
Problem: heavy overlap leads to high fail probability!
| || | | | | |su
h h hu v w
h h hu v w
ccV V VP
V V V
Step 1: select an event node u with probability proportional to the size of its h-hop neighborhoodStep 2: perform a h-hop BFS search to retrieve u’s h-hop neighborhoodStep 3: randomly sample a node r from u’s h-hop neighborhoodStep 4: Do a h-hop BFS search from r to see how many event nodes it can reach (say, c event nodes).Step 5: With probability 1 / c, accept r as a reference node. Otherwise get nothing from this run.
r, ,{ }a b uV v w
Z. Guan, Nan Li, Xifeng Yan
19
Measuring Two-Event Structural Correlations on Graphs
IMPORTANCE SAMPLING (3) Follow the same sampling scheme, but do not reject
any node, resulting in a nonuniform distribution over all reference nodes where is proportional to the number of event nodes r can reach in h-hop
Intrinsically, is an estimator of . The goal is to design a proper estimator for , which can leverage samples from to estimate as a surrogate to
A consistent estimator
( , )t a b ( , )a b( , )a b
{ ( )}p rP=
P=
1
1 1
1
1 1
( , )( ) ( )
( , )
( ) ( )
n ni j
i ji j i i j
n ni j
i j i i j
w wc r r
p r p rt a b w w
p r p r
( )p r
( , )a b( , )t a b
Concordance scores Number of times rj is
sampled
Z. Guan, Nan Li, Xifeng Yan
20
Measuring Two-Event Structural Correlations on Graphs
IMPORTANCE SAMPLING (4) The importance sampling procedure
Step 1: select an event node u with probability proportional to the size of its h-hop neighborhoodStep 2: perform a h-hop BFS search to retrieve u’s h-hop neighborhoodStep 3: randomly sample a node r from u’s h-hop neighborhoodStep 4: if r has been selected before, wr++; else add r to the sample set and set wr = 1.
1
1 1
1
1 1
( , )( ) ( )
( , )
( ) ( )
n ni j
i ji j i i j
n ni j
i j i i j
w wc r r
p r p rt a b
w wp r p r
Z. Guan, Nan Li, Xifeng Yan
21
Measuring Two-Event Structural Correlations on Graphs
WHOLE GRAPH SAMPLING When the set of all reference nodes, i.e. ,
is large enough, we simply sample nodes from the graph
bh
aV
ha bV
Z. Guan, Nan Li, Xifeng Yan
22
Measuring Two-Event Structural Correlations on Graphs
COMPLEXITY COMPARISON Space cost is the same:
Reference node sampling Batch_BFS: Importance sampling: Whole graph sampling:
Additional costs in common Event density computation: Z-score computation:
(| |)O E
(| | | |)bh
bh
a aO V E ( )BO nc( ), ) (| | / |( | 1)f b
haB fO n c n V V n E
( )BO nc2( )O n
Linear in the number of nodes and edges in the h-hop neighborhood of
a b
Average cost of a h-hop BFS search
Inverse proportional to the size of the h-hop neighborhood of a b
Do not need too many sample reference nodes since the variance of t(a,b) is upper bounded
Z. Guan, Nan Li, Xifeng Yan
23
Measuring Two-Event Structural Correlations on Graphs
OUTLINE Motivations
Measuring Two Event Structural Correlation (TESC)
Efficient Computation
Experiments
Discussions and Future work
Z. Guan, Nan Li, Xifeng Yan
24
Measuring Two-Event Structural Correlations on Graphs
EXPERIMENTS – DATASETS DBLP
Co-author network Events: keywords in paper titles 964,677 nodes, 3,547,014 edges, 0.19M keywords
Intrusion Obtained from log of intrusion alerts in a computer
network Events: intrusion alerts 200,858 nodes, 703,020 edges, 545 alerts
Twitter 20 million nodes and 0.16 billion edges
(Scalability)
Z. Guan, Nan Li, Xifeng Yan
25
Measuring Two-Event Structural Correlations on Graphs
EXPERIMENTS – EVENT SIMULATION (1)
Simulate positive and negative correlations (on DBLP graph)
Generate for three h levels: 1, 2, 3
Positive: linked pair, Gaussian distributed distance
Negative: Every b is kept h+1 hop away from a.
Noises: break correlation structure by relocation a fraction of nodes
Z. Guan, Nan Li, Xifeng Yan
26
Measuring Two-Event Structural Correlations on Graphs
EXPERIMENTS – EVENT SIMULATION (2) Results for positive case
Reca
ll
Noise Noise Noise
h = 1 h = 2 h = 3
Z. Guan, Nan Li, Xifeng Yan
27
Measuring Two-Event Structural Correlations on Graphs
EXPERIMENTS – REAL EVENTS (DBLP)
# Pair z-scoreh = 1
h = 2 h = 3
1 Texture vs. Image
6.22 19.85 30.58 10.96
2 Wireless vs. Sensor
5.99 23.09 32.12 14.49
3 Multicast vs. Network
4.21 18.37 26.66 6.81
4 Wireless vs. Network
2.06 17.41 27.90 4.80
5 Semantic vs. RDF
1.72 16.02 24.94 27.00
# Pair z-scoreh = 1
h = 2
h = 3
1 Texture vs. Java
-23.63
-9.41 -6.40 1.67
2 GPU vs. RDF
-24.47
-14.64
-6.31 2.02
3 SQL vs. Calibration
-21.29
-12.70
-5.45 0.73
4 Hardware vs. Ontology
-22.31
-8.85 -5.01 1.36
5 Transaction vs. Camera
-22.20
-7.91 -4.26 2.19
( | )( )
P b aP b
( | )( )
P b aP b
Highly positive pairs: Highly negative pairs:
Treating nodes as baskets
Z. Guan, Nan Li, Xifeng Yan
28
Measuring Two-Event Structural Correlations on Graphs
EXPERIMENTS – REAL EVENTS (INTRUSION)
# Pair 1-hop z-
score1 Ping Sweep vs.
SMB Service Sweep
13.64 1.68
2 Ping Flood vs. ICMP Flood
12.53 3.51
3 Email Pipe vs. Email Command Overflow
12.15 0.96
4 HTML Hostname Overflow vs. HTML NullChar Evasion
9.08 1.27
5 Email Error vs. Email Pipe
4.34 0.47
( | )( )
P b aP b
# Pair 2-hop z-
score1 Audit TFTP Get
Filename vs. LDAP Auth Failed
-31.30 0.00
2 LDAP Auth Failedvs. TFTP Put
-31.12 0.00
3 DPS Magic Number DoS vs. HTTP Auth TooLong
-30.96 0.00
4 LDAP BER Sequence Dos vs. TFTP Put
-30.30 0.00
5 Email Executable Extension vs. UDP Service Sweep
-26.93 0.00
( | )( )
P b aP b
Highly positive pairs: Highly negative pairs:
Z. Guan, Nan Li, Xifeng Yan
29
Measuring Two-Event Structural Correlations on Graphs
EXPERIMENTS – SCALABILITY
Running time when increasing the number of event nodes ( ). Results are obtained
from Twittter graph.| |a bV
h = 1 h = 3
Z. Guan, Nan Li, Xifeng Yan
30
Measuring Two-Event Structural Correlations on Graphs
OUTLINE Motivations
Measuring Two Event Structural Correlation (TESC)
Efficient Computation
Experiments
Discussions and Future work
Z. Guan, Nan Li, Xifeng Yan
31
Measuring Two-Event Structural Correlations on Graphs
DISCUSSIONS TESC as correlation of local densities Why nonparametric statistic
No distribution assumption, no linear assumption Nonparametric statistics are less powerful
because they use less information Model nonlinear correlation of data [Kaplan et al.,
JSTSP, 2009] Kendall correlation and Spearman correlation
Both can be used Choose Kendall’s Tau because
Intuitive interpretation Facilitate importance sampling
Intra-correlation and inter-correlation?
(based on constructive comments from Dr. Kaplan)
0.450.50.1
231
Rank0.30.20.1
Z. Guan, Nan Li, Xifeng Yan
32
Measuring Two-Event Structural Correlations on Graphs
FUTURE WORK Structure help explain the distribution of
events. Reversely, events could also help explain structure
Discuss very similar topics Buy very
similar products
Z. Guan, Nan Li, Xifeng Yan
33
Measuring Two-Event Structural Correlations on Graphs
THANK YOU!!!QUESTIONS?