statistical learning with hawkes processes · activity of users on a social network [darpa twitter...
TRANSCRIPT
![Page 1: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/1.jpg)
Statistical learning with Hawkes processes
Stephane Gaıffas
![Page 2: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/2.jpg)
1 Introduction
2 Sparse and Low Rank MHP
3 New matrix concentration inequalities
4 Faster inference: a dedicated mean field approximation
5 A more direct approach: cumulants matching
![Page 3: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/3.jpg)
1 Introduction
2 Sparse and Low Rank MHP
3 New matrix concentration inequalities
4 Faster inference: a dedicated mean field approximation
5 A more direct approach: cumulants matching
![Page 4: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/4.jpg)
Introduction
You have users of a system
You want to quantify their level of interactions
You don’t want to use only declared interactions: deprecated, notrelated to the users’ activity
You really want levels of interaction driven by user’s actions, using theirtimestamps’ patterns
Example 1: Twitter. Timestamps of users’ messages. Find somethingbetter than the graph given by links of type “user 1 follows user 2”
Example 2: MemeTracker. Publications times of articles onwebsites/blogs, with hyperlinks. Quantify the influence of the publicationactivity of websites on the others.
![Page 5: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/5.jpg)
Introduction
From:
Build:
![Page 6: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/6.jpg)
Introduction
Data: large number of irregular timestamped events recorded incontinuous time
Activity of users on a social network [DARPA Twitter Bot Challenge2016, etc.]
High-frequency variations of signals in finance [Bacry et al. 2013]
Earthquakes and aftershocks in geophysics [Ogata 1998]
Crime activity [Mohler 2011 and the PrePol startup]
Genomics, Neurobiology [Reynaud-Bouret et al. 2010]
Methods: in the context of social networks, survival analysis andmodeling based on counting processes [Gomez et al. 2013, 2015], [Xu etal. 2016]
![Page 7: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/7.jpg)
Introduction
Setting
For each node i ∈ I = {1, . . . , d} we have a set Z i of events
Any τ ∈ Z i is the occurence time of an event related to i
Counting process
Put Nt = [N1t · · ·Nd
t ]>
N it =
∑τ∈Z i 1τ≤t
Intensity
Stochastic intensities λt = [λ1t · · ·λdt ]>, λit = intensity of N it
λit = limdt→0
P(N it+dt − N i
t = 1|Ft)
dt
λit = instantaneous rate of event occurence at time t for node i
λt characterizes the distribution of Nt [Daley et al. 2007]
Patterns can be captured by putting structure on λt
![Page 8: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/8.jpg)
The Multivariate Hawkes Process (MHP)
Scaling
We observe Nt on [0,T ]. “Asymptotics” in T → +∞. d is “large”
The Hawkes process
A particular structure for λt : auto-regression
Nt is called a Hawkes process [Hawkes 1971] if
λit = µi +d∑
j=1
∫ t
0
ϕij(t − t ′)dN jt′ ,
µi ∈ R+ exogenous intensity
ϕij non-negative integrable and causal (support R+) functions
ϕij are called kernels. Encodes the impact of an action by node j onthe activity of node i
Captures auto-excitation and cross-excitation across nodes, aphenomenon observed in social networks [Crane et al. 2008]
![Page 9: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/9.jpg)
Stability condition of the MHP
Stability condition
Introduce the matrix with entries
G ij =
∫ +∞
0
ϕij(t)dt
Its spectral norm ‖G‖ must satisfy ‖G‖ < 1 to ensure stability ofthe process (and stationarity)
![Page 10: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/10.jpg)
A brief history of MHP
Brief history
Introduced in Hawkes 1971
Earthquakes and geophysics [Kagan and Knopoff 1981], [Zhuang etal. 2012]
Genomics [Reynaud-Bouret and Schbath 2010]
High-frequency Finance [Bacry et al. 2013]
Terrorist activity [Mohler et al. 2011, Porter and White 2012]
Neurobiology [Hansen et al. 2012]
Social networks [Carne and Sornette 2008], [Zhou et al.2013]
And even FPGA-based implementation [Guo and Luk 2013]
![Page 11: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/11.jpg)
A brief history of MHP
![Page 12: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/12.jpg)
Estimation for MHP
Parametric estimation (Maximum likelihood)
First work [Ogata 1978]
and [Simma and Jordan 2010], [Zhou et al. 2013]→ Expected Maximization (EM) algorithms, with priors
Non parametric estimation
[Marsan Lengline 2008], generalized by [Lewis, Mohler 2010]→ EM for penalized likelihood function→ Monovariate Hawkes processes
[Reynaud-Bouret et al. 2011]→ `1-penalization over a dictionary
[Bacry and Muzy 2014]→ Another approach: Weiner-Hopf equations, larger datasets
![Page 13: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/13.jpg)
MHP in large dimension
What for?
Infer influence and causality directly from actions of users
Exploit the hidden lower-dimensional structure of model parametersfor inference/prediction
#Events and dimension d is large. We want:
a simple parametric model on µ = [µi ] and ϕ = [ϕij ]
a tractable and scalable optimization problem
to encode some prior assumptions using (convex) penalization
![Page 14: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/14.jpg)
A simple parametrization of the MHP
Simple parametrization
Considerϕij(t) = aij × αije
−αij t
aij = level of interaction between nodes i and j
αij = lifetime of instantaneous excitation of node i by node j
The matrixA = [aij ]1≤i,j≤d
is understood has a weighted adjacency matrix of mutual influence ofnodes {1, . . . , d}
A is non-symmetric: “oriented graph”
![Page 15: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/15.jpg)
A simple parametrization of the MHP
We end up with intensities
λiθ,t = µi +
∫
(0,t)
d∑
j=1
aijαije−αij (t−s)dN j
s
for i ∈ {1, . . . , d} whereθ = [µ,A,α]
with
baselines µ = [µ1 · · ·µd ]> ∈ Rd+
interactions A = [aij ]1≤i,j≤d ∈ Rd×d+
decays α = [αij ]1≤i,j≤d ∈ Rd×d+
![Page 16: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/16.jpg)
A simple parametrization of the MHP
For d = 1 the intensity λθ,t looks like this:
![Page 17: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/17.jpg)
Goodness-of-fit functionals
Minus log-likelihood
−`T (θ) =1
T
d∑
i=1
{∫ T
0
λiθ,tdt −∫ T
0
log λiθ,tdNit
}
Least-squares
RT (θ) =1
T
d∑
i=1
{∫ T
0
(λiθ,t)2dt − 2
∫ T
0
λiθ,tdNit
}
with
λiθ,t = µi +d∑
j=1
aijαij
∫
(0,t)
e−αij (t−s)dN js
where θ = [µ,A,α] with µ = [µi ], A = [aij ], α = [αij ]
![Page 18: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/18.jpg)
A simple framework
Put ‖λθ‖2T = 〈λθ, λθ〉T with
〈λθ, λθ′〉T =1
T
d∑
i=1
∫
[0,T ]
λiθ,t λiθ′,t dt
so that least-squares writes
RT (θ) = ‖λθ‖2T −2
T
d∑
i=1
∫
[0,T ]
λiθ,tdNit
It is natural: if N has ground truth intensity λ∗ then
E[RT (θ)] = E‖λθ‖2T − 2E〈λθ, λ∗〉T = E‖λθ − λ∗‖2T − ‖λ∗‖T ,
where we used “signal + noise” decomposition (Doob-Meyer):
dN it = λ∗t dt + dM i
t
with M i martingale
![Page 19: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/19.jpg)
1 Introduction
2 Sparse and Low Rank MHP
3 New matrix concentration inequalities
4 Faster inference: a dedicated mean field approximation
5 A more direct approach: cumulants matching
![Page 20: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/20.jpg)
A simple framework
A strong assumption: assume that
ϕij(t) = aijhij(t)
for known hij meaning that
λiθ,t = µi +
∫
(0,t)
d∑
j=1
aijhij(t − s)dN js ,
where θ = [µ,A] with µ = [µ1, . . . , µd ]> and A =[aij ]1≤i,j≤d
However
Most papers using high-dimensional MHP assume hij(t) = αe−αt fora known α!
e.g. [Yang and Zha 2013], [Zhou et al. 2013], [Farajtabar et al.2015]
More on this problem later
![Page 21: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/21.jpg)
Prior encoding by penalization
Prior assumptions
Some users are basically inactive and react only if stimulated:
µ is sparse
Everybody does not interact with everybody:
A is sparse
Interactions have community structure, possibly overlapping, a smallnumber of factors explain interactions:
A is low-rank
![Page 22: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/22.jpg)
Prior encoding by penalization
Standard convex relaxations [Tibshirani (01), Srebro et al. (05),Bach (08), Candes & Tao (09), etc.]
Convex relaxation of ‖A‖0 =∑
ij 1Aij>0 is `1-norm:
‖A‖1 =∑
ij
|Aij |
Convex relaxation of rank is trace-norm:
‖A‖∗ =∑
j
σj(A) = ‖σ(A)‖1
where σ1(A) ≥ · · · ≥ σd(A) singular values of A
![Page 23: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/23.jpg)
Prior encoding by penalization
So, we use the following penalizations
Use `1 penalization on µ
Use `1 penalization on AUse trace-norm penalization on A
[but other choices might be interesting...]
NB1: to induce sparsity AND low-rank on A, we use the mixedpenalization
A 7→ γ∗‖A‖∗ + γ1‖A‖1
NB2: there exist better ways to induce sparsity and low-rank than this, cfRichard et al (2013) but much harder to minimize
![Page 24: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/24.jpg)
Sparse and low-rank matrices
{A : ‖A‖∗ ≤ 1} {A : ‖A‖1 ≤ 1} {A : ‖A‖1 + ‖A‖∗ ≤ 1}
The balls are computed on the set of 2× 2 symmetric matrices, which isidentified with R3.
![Page 25: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/25.jpg)
Algorithm
We end up with the problem
θ = (µ, A) ∈ argminθ=(µ,A)∈Rd
+×Rd×d+
{RT (θ) + pen(θ)
},
with mixed penalizations
pen(θ) = τ1‖µ‖1 + γ1‖A‖1 + γ∗‖A‖∗
A problem: the “features scaling” problem
Features scaling is necessary for “linear approaches” in supervisedlearning
No features and labels here!
⇒ Can be solved here by fine tuning of the penalization terms
![Page 26: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/26.jpg)
Algorithm
Consider instead
θ = (µ, A) ∈ argminθ=(µ,A)∈Rd
+×Rd×d+
{RT (θ) + pen(θ)
},
where this time
pen(θ) = ‖µ‖1,w + ‖A‖1,W + w∗‖A‖∗
Penalization tuned by data-driven weights w , W and w∗ to solvethe “scaling” problem
Comes from sharp controls of the noise terms, using newprobabilistic tools
Ugly (but computationally easy) formulas
![Page 27: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/27.jpg)
![Page 28: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/28.jpg)
Numerical experiment
Toy example
Ground truth parameters µ and A, with d = 30 and T = 2000
![Page 29: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/29.jpg)
Numerical experiment
Ground truth A and instances of recoveries using 6 procedures
![Page 30: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/30.jpg)
Numerical experiment
averaged AUC, Estimation error and Kandall rank on 100 simulations
top: non-weighted VS weighted L1bottom: non-weighted VS weighted L1 + trace norm
![Page 31: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/31.jpg)
Numerical experiment: likelihood VS least-squares
Convergence speed of least squares VS likelihood with (proximal gradientdescent with/without acceleration
![Page 32: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/32.jpg)
Numerical experiment: likelihood VS least-squares
Performance achieved by least squares VS likelihood
![Page 33: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/33.jpg)
Theorical results
A sharp oracle inequality
Recall 〈λ1, λ2〉T = 1T
∑di=1
∫ T
0λi1,tλ
i2,tdt and ‖λ‖2T = 〈λ, λ〉T
Assume RE in our setting (Restricted Eigenvalues), mandatoryassumption to obtain fast rates for convex-relaxation basedprocedures
Theorem. We have
‖λθ − λ∗‖2T ≤ infθ
{‖λθ − λ∗‖2T + κ(θ)2
(5
4‖(w)supp(µ)‖22
+9
8‖(W )supp(A)‖2F +
9
8w2∗ rank(A)
)}
with a probability larger than 1− 146e−x .
![Page 34: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/34.jpg)
Theoretical results
Roughly, θ achieves an optimal tradeoff between approximation andcomplexity given by
‖µ‖0 log d
Tmax
iN i ([0,T ])/T
+‖A‖0 log d
Tmaxij
v ijT
+rank(A) log d
Tλmax(V T )
Complexity measured both by sparsity and rank
Convergence has shape (log d)/T , where T = length of theobservation interval
These terms are balanced by “empirical variance” terms
![Page 35: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/35.jpg)
Theoretical results
Data-driven weights come from new “empirical” Bernstein’sinequalities, entrywise and for operator norm of the noise ZT (amatrix martingale)
Leads to a data-driven scaling of penalization: deals correctly withthe inhomogeneity of information over nodes
Noise term is
Z t =
∫ t
0
diag[dMs ]H s ,
with H t predictable process with entries
(H t)ij =
∫
(0,t)
hij(t − s)dN js
We need to control 1T ‖ZT‖op
![Page 36: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/36.jpg)
Theoretical results
A consequence of our new concentration inequalities (more after):
P[‖Z t‖op
t≥√
2v(x + log(2d))
t+
b(x + log(2d))
3t,
bt ≤ b, λmax(V t) ≤ v
]≤ e−x ,
for any v , x , b > 0, where
V t =1
t
∫ t
0
‖H s‖22,∞[
diag[λ∗s ] 0
0 H>s diag[H sH>s ]−1 diag[λ∗s ]H s
]ds
and bt = sups∈[0,t] ‖H s‖2,∞ (‖ · ‖2,∞ = maximum `2 row norm)
Useless for statistical learning! Event λmax(V t) ≤ v is annoying and V t
is not observable (depends on λ∗)!
![Page 37: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/37.jpg)
Theoretical results
Theorem [Something better]. For any x > 0, we have
‖Z t‖opt
≤ 8
√(x + log d + ˆ
x,t)λmax(V t)
t
+(x + log d + ˆ
x,t)(10.34 + 2.65bt)
t
with a probability larger than 1− 84.9e−x , where
V t =1
t
∫ t
0
‖H s‖22,∞[
diag[dNs ] 0
0 H>s diag[H sH>s ]−1 diag[dNs ]H s
]ds
and small ugly term:
ˆx,t = 4 log log( 2λmax(V t ) + 2(4 + b2t /3)x
x∨ e
)+ 2 log log
(b2t ∨ e
).
This is a non-commutative deviation inequality with observable variance
![Page 38: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/38.jpg)
1 Introduction
2 Sparse and Low Rank MHP
3 New matrix concentration inequalities
4 Faster inference: a dedicated mean field approximation
5 A more direct approach: cumulants matching
![Page 39: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/39.jpg)
New matrix concentration inequalities
Main tool: new concentration inequalities for matrix martingales incontinuous time
Introduce
Z t =
∫ t
0
As(C s � dM s)Bs ,
where {At}, {C t} and {Bt} predictable and where {M t}t≥0 is a“white” matrix martingale, in the sense that [vecM]t is diagonal
NB: entries of Z t are given by
(Z t)i,j =
p∑
k=1
q∑
l=1
∫ t
0
(As)i,k(C s)k,l(Bs)l,j(dM s)k,l .
![Page 40: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/40.jpg)
New matrix concentration inequalities
Concentration for purely discountinuous matrix martingale:
M t is purely discountinuous and we have
〈M〉t =
∫ t
0
λsds
for a non-negative and predictable intensity process {λt}t≥0.
Standard moment assumptions (subexponential tails)
Introduce
V t =
∫ t
0
‖As‖2∞,2‖Bs‖22,∞W sds
where
W t =
[W 1
t 00 W 2
t
], (1)
W 1t = At diag[A>t At ]
−1 diag[(C�2t � λt)1
]A>t
W 2t = B>t diag[BtB>t ]−1 diag
[(C�2t � λt)
>1]Bt
![Page 41: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/41.jpg)
New matrix concentration inequalities
Introduce also
bt = sups∈[0,t]
‖As‖∞,2‖Bs‖2,∞‖C s‖∞.
Theorem.
P[‖Z t‖op ≥
√2v(x + log(m + n)) +
b(x + log(m + n))
3,
bt ≤ b, λmax(V t) ≤ v
]≤ e−x ,
First result of this type for matrix-martingale in continuous time
![Page 42: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/42.jpg)
New matrix concentration inequalities
Corollary. {N t} a p × q matrix, each (N t)i,j is an independent inhomogeneousPoisson processes with intensity (λt)i,j . Consider the martingale M t = N t −Λt ,where Λt =
∫ t
0λsds and let {C t} be deterministic and bounded. We have∥∥∥∫ t
0
C s � d(N t − Λt)∥∥∥op
≤
√2(∥∥∥∫ t
0
C�2s � λsds
∥∥∥1,∞∨∥∥∥∫ t
0
C�2s � λsds
∥∥∥∞,1
)(x + log(p + q))
+sups∈[0,t] ‖C s‖∞(x + log(p + q))
3
holds with a probability larger than 1− e−x .
![Page 43: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/43.jpg)
New matrix concentration inequalities
Corollary. Even more particualar: N random matrix where N i,j areindependent Poisson variables with intensity λi,j . We have
‖N − λ‖op ≤√
2(‖λ‖1,∞ ∨ ‖λ‖∞,1)(x + log(p + q))
+x + log(p + q)
3.
Up to our knowledge, not previously stated in literature
NB: In the Gaussian case: variance depends on maximum `2 norm ofrows and columns (cf. Tropp (2011))
![Page 44: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/44.jpg)
New matrix concentration inequalities
Some remarksA non-commutative Hoeffding’s inequality when M t has Brownianmotion entries (allowing Ito’s formula...), with a similar variance term
A mix of tools from stochastic calculus and random matrix theory
A family of results leading to generalization to continuous-timemartingale of matrix deviation inequalities (papers by J. Trop et al.)
[For experts: as a by-product have give a proof in the discrete-time casethat does not require Lieb’s convavity Theorem about the traceexponential]
![Page 45: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/45.jpg)
1 Introduction
2 Sparse and Low Rank MHP
3 New matrix concentration inequalities
4 Faster inference: a dedicated mean field approximation
5 A more direct approach: cumulants matching
![Page 46: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/46.jpg)
Mean-field inference for Hawkes
Going back to maximum-likelihood estimation, with d very large
For inference, exploit the fact that d is large
⇒ use a Mean-Field approximation! (from Delattre et al. 2015)
�0.5
0
0.5
1
1.5
2
2.5
3
0 0.2 0.4 0.6 0.8 1
�1 t/⇤
1
t/T
0.1
1
1 10 100E1
/2[(�
1 t/⇤
1�
1)2]
d
d = 1d = 16
d = 128
Simulation resultsd�1/2
When d is large, we have
λit ≈ Λi with Λit = E[dN i
t ]/dt
![Page 47: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/47.jpg)
Mean-field inference for Hawkes
Use the quadratic approximation
log λit ≈ log Λi+λit − Λi
Λi− (λit − Λi )2
2(Λi )2
in the log-likelihood
⇒Reduces inference to linear systems
Fluctuations E1/2[(λ1t/Λ1 − 1)2]
0 0.2 0.4 0.6 0.8
||Φ||
1
10
100
d0.001
0.01
0.1
1
![Page 48: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/48.jpg)
Mean-field inference for Hawkes
No clean proof yet (only on toy example)
But it works very well empirically
![Page 49: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/49.jpg)
Mean-field inference for Hawkes
0.01
0.1
1
10
1000 10000 100000Rel
ativ
eer
rorE1
/2[ (α
inf/αtr−
1)2]
T
α = 0.3
0.01
0.1
1
10
1000 10000 100000
T
α = 0.7
d = 4d = 8d = 16d = 32T−1/2
d = 4d = 8d = 16d = 32T−1/2
![Page 50: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/50.jpg)
Mean-field inference for Hawkes
It is faster by several order of magnitude than state-of-the-art solvers
15.25
15.3
15.35
15.4
15.45
15.5
15.55
15.6
1 10 100Min
usLo
g-Li
kelih
ood−logP(N
t|θ
inf)
Computational time (s)
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 10 100Rel
ativ
eer
rorE1
/2[ (α
inf/α
tr−1)
2]
Computational time (s)
BFGSEMCFMF
BFGSEMCFMF
![Page 51: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/51.jpg)
1 Introduction
2 Sparse and Low Rank MHP
3 New matrix concentration inequalities
4 Faster inference: a dedicated mean field approximation
5 A more direct approach: cumulants matching
![Page 52: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/52.jpg)
Cumulants matching for MHP
Some thoughts
Our original motivation for MHP is influence and causality recoveryof nodes
Knowledge of the full parametrization of MHP is of little interest byitself
A reminder
λit = µi +d∑
j=1
∫ t
0
ϕij(t − t ′)dN jt′ ,
Idea
Let’s not estimate the kernels ϕij , but their integrals only!
Nonparametric approach, no structure imposed on the kernels ϕij
Let’s not use a dictionary either (over-parametrization)
A way more direct approach
![Page 53: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/53.jpg)
Cumulants matching for MHP
We want to estimate G = [g ij ] where
g ij =
∫ +∞
0
ϕij(u) du ≥ 0 for 1 ≤ i , j ≤ d
Remark
g ij = average total number of events of node i whose directancestor is an event of node j
introducing N i←jt that counts the number of events of i whose direct
ancestor is an event of j we can prove that
E[dN i←jt ] = g ijE[dN i
t ] = g ijΛidt
Consequence
G describes mutual influences between nodes
We know from Eichler et al (2010) that N jt does not Granger cause
N it iff g ij = 0
Recall stability condition ‖G‖ < 1, which entails that I − G isinvertible
![Page 54: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/54.jpg)
Cumulants matching for MHP
Cumulant matching method for estimation of GCompute estimates of the third order cumulants of the process
Find G that matches these empirical cumulants
Highly non-convex problem: polynomial or order 10 with respect tothe entries of (I − G )−1
Actually not so hard, local minima turns out to be good (deeplearning literature)
Cumulant matching quite powerful for latent topics models, such asLatent Dirichlet Allocation [Bach et al. 2015]
![Page 55: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/55.jpg)
Cumulants matching for MHP
First order-three cumulants can be estimated as
Λi =1
T
∑
τ∈Z i
1 =N i
T
T
C ij =1
T
∑
τ∈Z i
(N jτ+H − N j
τ−H − 2HΛj)
K ijk =1
T
∑
τ∈Z i
(N jτ+H − N j
τ−H − 2HΛj)(
Nkτ+H − Nk
τ−H − 2HΛk)
− Λi
T
∑
τ∈Z j
(Nkτ+2H − Nk
τ−2H − 4HΛk)
+ 2Λi
T
∑
τ∈Z j
∑
τ ′∈Z k
τ−2H≤τ ′<τ
(τ − τ ′)− 4H2Λi Λj Λk .
![Page 56: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/56.jpg)
Cumulants matching for MHP
DefiningR = (I − G )−1
we can make a link between the cumulants and G
Λidt = E(dN it)
C ijdt =
∫
τ∈(−∞,+∞)
(E(dN i
tdNjt+τ )− E(dN i
t)E(dN jt+τ )
)
K ijkdt =
∫ ∫
τ,τ ′∈(−∞,+∞)
(E(dN i
tdNjt+τdN
kt+τ ′)
+ 2E(dN it)E(dN j
t+τ )E(dNkt+τ ′)
− E(dN itdN
jt+τ )E(dNk
t+τ ′)− E(dN itdN
kt+τ ′)E(dN j
t+τ )
− E(dN jt+τdN
kt+τ ′)E(dN i
t)),
![Page 57: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/57.jpg)
Cumulants matching for MHP
and
Λi =d∑
m=1
R imµm
C ij =d∑
m=1
ΛmR imR jm
K ijk =d∑
m=1
(R imR jmC km + R imC jmRkm + C imR jmRkm − 2ΛmR imR jmRkm)
Why order three and not two?
integrated covariance (order two) contains only symmetricinformation, and is thus unable to provide causal information
the skewness of the process breaks the symmetry between past andfuture so to uniquely fix G
![Page 58: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/58.jpg)
Cumulants matching for MHP
Our algorithm [NPHC: Non Parametric Hawkes Cumulant]
Compute estimators of d2 third-order cumulant components
{K iij}1≤i,j≤d (not d3 !). Put it in K c
FindR ∈ argminR‖K c(R)− K c‖22
using a first-order stochastic gradient descent algorithm (AdaGrad inour case)
SetG = I − R
−1
![Page 59: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/59.jpg)
Cumulants matching for MHP
MetricsRelative Error
RelErr(A,B) =1
d2
∑
i,j
(|aij − bij |/|aij |1aij 6=0 + |bij |1aij=0
)
Mean Kendall Rank Correlation
MRankCorr(A,B) =1
d
d∑
i=1
RankCorr([ai•], [bi•]),
where
RankCorr(x , y) =1
d(d − 1)/2(Nconcordant(x , y)− Ndiscordant(x , y))
with Nconcordant(x , y) = number of pairs (i , j) s.t xi > xj and yi > yj orxi < xj and yi < yj and Ndiscordant(x , y) defined conversely
![Page 60: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/60.jpg)
bad param. MLE — best param. MLE — NPHC — ground truth G
top row : rectangular kernel (d = 10)
middle row : power-low kernel (d = 10) (usually very hard...)
bottom row : exponential kernel (d = 100)
![Page 61: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/61.jpg)
Cumulants matching for MHP
Experiments with MemeTracker dataset
keep the 100 most active sites
contains publication times of articles in many websites/blogs, withhyperlinks
≈ 8 millions events
Use hyperlinks to establish an estimated ground truth for thematrix G
![Page 62: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/62.jpg)
NPHC on MemeTracker
G
![Page 63: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/63.jpg)
Cumulants matching for MHP
Method Best HMLE Best HMLE NPHC(for RelErr) (for RankCorr)
RelErr 0.153 0.154 0.064MRankCorr 0.035 0.032 0.175
Results on the MemeTracker dataset
![Page 64: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/64.jpg)
Conclusion
Take-home message
Hawkes Process for “time-oriented” machine learning
Surprisingly relevant to reproduce real-word phenomena(auto-excitation, user influence)
Main contributions
Sharp theoretical guarantees for low-rank inducing penalizationfor Hawkes models
New results about concentration of matrix-martingales incontinuous time
Improved training time of the Hawkes model using a “mean-field”approximation
Go beyond the parametric approach: unveil causality usingintegrated cumulants matching
![Page 65: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/65.jpg)
Conclusion
Bibliography
A bound for generalization error for sparse and low-rank multivariateHawkes processes, with E. Bacry and J-F Bacry [in revision in JMLR]
Concentration inequalities for matrix martingales in continuous time,with E. Bacry and J-F Bacry [in revision in PTRF]
Mean-field inference of Hawkes point processes, with E. Bacry, J-FMuzy and I. Mastromatteo [Journal of Physics A]
Uncovering causality from multivariate Hawkes integratedcumulants, (with M. Achab, E. Bacry, S.G., I. Mastromatteo, J-FMuzy) [submitted to JMLR]
![Page 66: Statistical learning with Hawkes processes · Activity of users on a social network [DARPA Twitter Bot Challenge 2016, etc.] High-frequency variations of signals in nance [Bacry et](https://reader035.vdocument.in/reader035/viewer/2022071014/5fccf584de374c2ce80fb99f/html5/thumbnails/66.jpg)
Thank you!