statistical questions related to the analysis of dynamic ...pr12244/lanl_1.pdf · statistics for...
TRANSCRIPT
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Statistical questions related to the analysis of dynamicnetwork data
Patrick Rubin-Delanchy
University of Bristol & Heilbronn Institute for Mathematical Research
Joint work with
Nicholas A Heard (Imperial College London)Daniel J Lawson (University of Bristol)Axel Gandy (Imperial College London)
5th November 2014
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Dynamic network data
Data recording interactions between entities through time. Examples:
social networks: Twitter, Facebook, ...
email: e.g. the Enron email corpus (made public by the Federal Energy RegulatoryCommission), is a store of emails sent and received by Enron top-executivesleading up to the scandal
collaboration networks: academic, cosponsorship of legislation [3], music
recommender systems (e.g. Netflix challenge, music recommendation)
computer networks (e.g. LANL, Imperial College London)
biological networks (e.g. neural networks)
Example in music (with thanks to Theo Dickson):MusicBrainz.org is an open-source user created database of song details. 40GB,850,000 different artists, 13.5 million recordings and 722,667 collaborations.
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Cyber-security
United Nations International Telecommunications Union announces [1]:
almost 3 billion internet users by end 2014
mobile-cellular subscriptions to reach almost 7.6 billion
UK figures:
1 in 5 pounds earned on the internet
81% large corporations 60% small businesses reported a cyber-breach in UK
Average cost is 600K – 1.2M pounds per breach [2]
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
A typical attack pattern
Taken from [5]:
A. Opportunistic infection
B. Network traversal
C. Data exfiltration
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Key statistical themes in the analysis of network data
A. Point process networks
B. Information flow
C. Network anomaly detection
D. Combining p-values
E. Big data
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Outline of talk
In this talk we will specifically cover:
A. Detecting dependence between two point processes. Applications:diagnose information flow in a communication network; detect tunnellingin a computer network; some biological applications (e.g. neuronalspikes, ecology, molecular biology); sport; finance, ... (2/3)
B. Combining Monte Carlo p-values. Applications: anomaly detection,change detection, feature discovery, ... (1/3)
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Testing for dependence
On networks, important forms of dependence include:
A ‘causal’ relationship: do events by A trigger events by B?
Correlation: do events by B occur surprisingly close to events by A?
Anti-correlation: do A and B alternate?
Inbition: does A inhibit B?
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
BA
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
BA
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
BA
r 0(t)
Null hypothesis: B is non-homogeneous Poisson process with intensity r0(t)
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
BA
r 0(t)
OOO
Let b1=volume until first response time
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
BA
r 0(t)
OOO
Let b2=volume until second response time
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
BA
r 0(t)
OOO
Let b3=volume until third response time
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Lemma
Under H0, the volumes b1, b2, . . . are the event times of a homogeneous Poissonprocess.
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
BA
r 1(t)
Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), f (x) ∝ exp(−βx) anda(t) is closest A event to t.
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
BA
r 1(t)
OOO
Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), f (x) ∝ exp(−βx) anda(t) is closest A event to t.
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
BA
r 1(t)
OOO
Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), f (x) ∝ exp(−βx) anda(t) is closest A event to t
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
BA
r 1(t)
OOO
Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), f (x) ∝ exp(−βx) anda(t) is closest A event to t
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
B~f(
t~)
Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), f (x) ∝ exp(−βx) anda(t) is closest A event to t
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
BA
r 1(t)
Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), where f is a step-function with one step
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
B~f(
t~)
Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), where f is a step-function with one step
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
B~f(
t~)
Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), where f a decreasingfunction. (f is maximum likelihood)
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
BA
r 1(t)
Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), where f a decreasingfunction. (r1 is maximum likelihood)
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Theorem
Under H0, f (0) is distributed as n/(UT ) where U is a uniform random variable over[0, 1] and T is the length of the observation period.
p-value for A causing B: n/{T f (0)}
In previous example: p ≈ 0.15 (not significant)
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Daily emailing behaviour of an individual in the Enron dataset:
Time
Day
365
300
240
180
120
601
0 8 16 24
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Bayesian intensity estimation for an individual of interest (change-point model for dailybinned data, and density estimation for within day behaviour)
0 5 10 15 20 25 30
Time
λ(t)
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
‘Information flow’ through JD in the Enron data. Over 2001, 12 individuals contactand are contacted back by JD. Full black p < 0.0001, half-black means p ≤ 0.05,white means not significant.
1 2 3 4 5 6 7 8 9 10 11 12
12
34
56
78
910
1112
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
7→ j 6 j → 7: only one email on each edge, they are sent about one month apart,and appear to be unrelated judging by their subject-lines. The p-value iscomputed to be about .2
10→ j 6 j → 10: 14 emails from 10 to i and 9 from i to 10, the most coincidentalemail times falling in July, about 3 1
2 hours from each other. The p-valueis 0.07. If time is not transformed the raw p-value is 0.0035, suggesting asignificant interaction. However, upon inspecting the subject-lines of10→ i and i → 10 it is not in fact obvious that there is consistentreciprocation. For example, the subject-lines of the two most coincidentalemails are “FW: Enron Complaint” and “Dunn hearing link?”, which arenot obviously related.
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Based on the subject-lines of the most coincidental email times, the three black circlesare correct detections.
6→ i i → 5 The p-value is around 2 · 10−5. The most coincidental email times aretwo hours and twenty minutes apart, and have the same subject-line,“Re: FW: SoCalGas Capacity”.
9→ i i → 3 The p-value is around 6 · 10−8. The most coincidental email times are50 minutes apart, have the subject-lines “California Update–LegislativePush Underway” and “Re: California Update–Legislative PushUnderway”.
12→ i i → 11 The p-value is around 2 · 10−5. The most coincidental email timesare 30 minutes apart and have the subject-lines “RE: CA Unbundling”and “CA Unbundling”
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Other tests of H0 : f ∝ 1:
Likelihood ratio test based on H1 : f step-function, params λ1 ≥ λ2, τ “time-out”
Linear-time recursive solution for small N (catastrophic cancellation for N ≈ 100)Quadratic-time recursive solution for larger NAsymptotically is a weighted upper K-S test
Or: treat reordered events bi as if they were p-values (Fisher’s Method):
−2∑
log(bi ) ∼ χ22N
All implemented in the R package ‘mppa’ (available on CRAN).
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Extensions
1 unknown base rate (see next week’s talk)
2 other forms of dependence (correlation, inhibition, ...)
3 chains of interaction (e.g. A causes B causes C )
4 single response model (only a few causal events)
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Combining p-values
We often generate many test statistics and want to determine if there is an overalleffect. Applications in cyber-security:
Detecting traffic forwarding by timing correlations (a test for each in/out pair)
Change detection in Netflow (e.g. testing port usage, connectivity histogram, dataflow, ...)
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
00:00 01:00 02:00 03:00 04:00 05:00
Time (minutes:seconds)
Ser
ver
IP a
ddre
ss:P
ort
10:5
310
:771
11:1
4811
:535
31:
802:
803:
804:
443
5:44
36:
443
7:20
498:
389
9:53
9:77
1
Figure : Outgoing traffic from a client computer over 5 minutes, split by server IP: port.
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
The p-value
The p-value is a measure of the significance of an effect. Framework:
1 A null hypothesis, denoted H0, under which there is no effect
2 An alternative hypothesis, H1, under which the effect is present
3 A test statistic T for the effect, which would tend to be larger under H1 thanunder H0
The p-value isp = P[T ≥ t],
where P is the probability measure under the null hypothesis and t is the observedstatistic.
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Example: permutation test
Let a1, ..., ak and b1, ..., bl denote two groups of data.
1 H0: all elements are exchangeable
2 H1: elements are only exchangeable within groups
3 t = |a− b|
Then p = P[T ≥ t], where T = |A− B| and A, B are formed from randompermutations of the indices.
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Monte Carlo p-value
Often p cannot be computed exactly. Instead, it is estimated via:
p = 1/N∑
I(T ∗i ≥ t),
where T ∗i is the ith simulated replicate of T under the null hypothesis.
Previous example (permutation test):
1 randomly permute the indices a1, . . . , ak and b1, ..., bl to form A∗1, . . . ,A∗k and
B∗1 , . . . ,B∗l
2 Compute T ∗ = |A∗ − B∗|
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Combining p-values
Suppose we generate m ordered p-values p1 ≤ . . . ≤ pm.
Under H0 all of the p-values are independent and uniformly distributed (aside fromordering).
Some ways of combining the p-values into one, overall score:
p1 ∼ Beta(1,m) (the minimum p-value)
−2∑
log(pi ) ∼ χ22m (Fisher’s method)
min(pi m/i) ∼ Uniform[0, 1] (Simes’s test)
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Some difficulties
Why it’s a bit harder than that:
1 ‘Needle-in-a-haystack’ problems (work with NA Heard)
2 Discrete p-values (work with NA Heard)
3 Bayesian p-values (work with DJ Lawson, my talk next week)
4 Monte Carlo p-values (work with A Gandy)
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Combining Monte Carlo p-values
Naıve approach: do N simulations for each.
Under H0 the number p-values estimated to be zero is a Binomial variable with successprobability 1/(N + 1) and size m. For N � m there is a high probability of calculatinga p-value of 0!Let p = [p1, . . . , pm] and f be a function f : [0, 1]m → R that combines the p-values.
Lemma
If there exists i ∈ 1, . . . ,m such that
minj 6=i
supx∈[0,1]m
|∇i f (x)|/|∇j f (x)| =∞,
then asymptotically
suppi∈(0,1)m
var(f (p)naıve)
var(f (p)opt)= m.
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Our algorithm
Our algorithm looks for a more clever allocation of the simulation effort
During simulation, it adaptively identifies ‘which p-values need most work’
Our algorithm appears to reach the optimal asymptotic variance
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Example: change detection in Netflow
Traffic from one computer on Imperial College’s network (with thanks to AndyThomas) over a day
The data has an artificial changepoint where the user knowingly changed hisbehaviour
We split the computer’s traffic by edge (the other IP address), bin the data perhour, and throw away any edge with less than three bins
This results in approximately 100 time series
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Example: change detection in Netflow cont’d
1 For any proposed changepoint, count the absolute difference between the numberof flows before and after the changepoint on each edge, resulting in a statistic ti .
2 For each edge:1 Randomly permute the binned data2 Compute same absolute difference between the number of flows, resulting in a
simulated statistic T ∗i .
3 Get a running estimate of the changepoint p-value for that edge, pi
3 Use our algorithm to combine the p-values using Fisher’s score
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
1e−
071e
−05
1e−
031e
−01
Time
p−va
lue
0 6 12 18 24
Figure : Overall p-value in favour of a changepoint
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Time
54
32
1
0 6 12 18 24
Figure : Most significant edges. Samples taken: 15039, 14767, 11598, 7985, 6931
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Time
54
32
1
0 6 12 18 24
Figure : Least significant edges. Samples taken: 55, 56, 58, 50, 61
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Extensions
1 explicity tackle the search and multiple testing problem (my work here with peopleat ACS)
2 develop methodology for non-smooth functions (e.g. Simes’s test)
3 Handle correlated data
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
Conclusion
1 Probabilistic and statistical methods for analysing point process networks arebecoming increasingly important.
2 We discussed testing for dependence without any regard to computational issues(e.g. the search through the graph) or the use of marks on the points (e.g. subjectlines). Of course these (arguably more difficult) points still need to be addressed.
3 Monte Carlo tests are a very promising approach for detecting features/anomalieson networks. Much more work is needed on making them algorithmically feasibleon Big data
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data
Statistics for dynamic network analysisDependence between point processes
Combining p-values
ITU releases 2014 ICT figures.
https://www.gov.uk/government/uploads/system/uploads/attachment data/file/60942/THe-COST-OF-CYBER-CRIME-SUMMARY-FINAL.pdf.
Accessed: 2014-06-05.
Policy: keeping the UK safe in cyber space.
https://www.gov.uk/government/policies/keeping-the-uk-safe-in-cyberspace.Accessed: 2014-06-05.
James H Fowler.
Connecting the congress: A study of cosponsorship networks.Political Analysis, 14(4):456–487, 2006.
Axel Gandy and Patrick Rubin-Delanchy.
An algorithm to compute the power of Monte Carlo tests with guaranteed precision.The Annals of Statistics, 41(1):125–142, 2013.
Joshua Neil, Curtis Hash, Alexander Brugh, Mike Fisk, and Curtis B Storlie.
Scan statistics for the online detection of locally anomalous subgraphs.Technometrics, 55(4):403–414, 2013.
Patrick Rubin-Delanchy and Nicholas A Heard.
A test for dependence between two point processes on the real line.arXiv preprint arXiv:1408.3845, 2014.
P. Rubin-Delanchy Statistical questions related to the analysis of dynamic network data