manuel gomez rodriguez 1,2 jure leskovec 1 andreas krause 3
DESCRIPTION
1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology. Inferring Networks of Diffusion and Influence. Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3. Networks and Processes. - PowerPoint PPT PresentationTRANSCRIPT
1
1 Stanford University2 MPI for Biological Cybernetics
3 California Institute of Technology
Inferring Networks of Diffusion and Influence
Manuel Gomez Rodriguez1,2
Jure Leskovec1
Andreas Krause3
2
Networks and Processes
Many times hard to directly observe a social or information network: Hidden/hard-to-reach populations:
Drug injection users Implicit connections:
Network of information sharing in online media
Often easier to observe results of the processes taking place on such (invisible) networks: Virus propagation:
People get sick, they see the doctor Information networks:
Blogs mention information
3
Information Diffusion Network
Information diffuses through the network
We only see who mentions but not where they got the information from
Can we reconstruct the (hidden) diffusion network?
4
More Examples
Virus propagation Word of mouth & Viral marketing
Can we infer the underlying social or information network?
Viruses propagate through the network
We only observe when people get sick
But NOT who infected them
Recommendations and influence propagate
We only observe when people buy products
But NOT who influenced them
Process
We observe
It’s hidden
5
Inferring the Network
There is a hidden directed network:
We only see times when nodes get infected: Contagion c1: (a, 1), (c, 2), (b, 3), (e, 4) Contagion c2: (c, 1), (a, 4), (b, 5), (d, 6)
Want to infer the who-infects-whom network
bb
dd
ee
aa
cc
aa
cc
bb
eecc
aabb
dd
6
Our Problem Formulation
Plan for the talk:
1. Define a continuous time model of diffusion
2. Define the likelihood of the observed propagation data given a graph
3. Show how to efficiently compute the likelihood
4. Show how to efficiently optimize the likelihood Find a graph G that maximizes the likelihood There is a super-exponential number of graphs G Our method finds a near-optimal solution in O(N2)!
7
cccc
ee ffee ff
cc
bbaa bbaaaa bb
dd
Cascade Generation Model
Cascade generation model: Cascade reaches u at tu, and spreads to u’s
neighbors v: with probability β cascade propagates along (u, v) and
tv = tu + Δ, with Δ ~ f()
ta tb tcΔ1 Δ2
We assume each node v has only one parent!
Δ3 Δ4
te tf
8
Likelihood of a Cascade
bb
dd
ee
aa
cc
aa
cc
bb
ee
If u infected v in a cascade c, its transmission probability is:
Pc(u, v) f(tv - tu) with tv > tu and (u, v) are neighbors
Prob. that cascade c propagates in a tree T:
To model that in reality any node v in a cascade can have been infected by an external influence m: Pc(m, j) = ε
mmεεε
Tree pattern T on cascade c: (a, 1), (b, 2), (c, 4), (e, 8)
9
Finding the Diffusion Network
There are many possible propagation trees: c: (a, 1), (c, 2), (b, 3), (e, 4)
Need to consider all possible propagation trees T supported by G:
bb
dd
ee
aa
cc
aa
cc
bb
ee
bb
dd
ee
aa
cc
aa
cc
bb
ee
bb
dd
ee
aa
cc
aa
cc
bb
ee
Likelihood of a set of cascades C on G: Want to find:
Good news
Computing P(c|G) is tractable:For each c, consider all O(nn) possible transmission trees of G.
Matrix Tree Theorem can compute this sum in O(n3)!
Bad news
We actually want to find:
We have a super-exponential number of graphs!
10
An Alternative Formulation
We consider only the most likely tree Maximum log-likelihood for a cascade c under a
graph G:
Log-likelihood of G given a set of cascades C:The problem is NP-hard: MAX-k-COVER
Our algorithm can do it near-optimally in O(N2)
11
Max Directed Spanning Tree
Given a cascade c, What is the most likely propagation tree?
where
bb
dd
aa
cc
Local greedy selection gives optimal tree!
A maximum directed spanning tree (MDST): Just need to compute the MDST of a the sub-
graph of G induced by c (i.e., a DAG) For each node, just picks an in-edge of max-
weight:
3
5
4
2
6
1
Subgraph of G induced by c doesn’t have loops (DAG)
aa
bb
dd
cc
12
Great News: Submodularity
Theorem:Log-likelihood Fc(G) is monotonic, and submodular in the edges of the graph G
Gain of adding an edge to a “small” graph Gain of adding an edge to a “large“ graph
Fc(A {e}) – Fc (A) ≥ Fc (B {e}) – Fc (B)
Proof:
ss
A B VxV
w
w’
xA
Bjj
oo
Single cascade c, edge e with weight x Let w be max weight in-edge of s in A Let w’ be max weight in-edge of s in B We know: w ≤ w’ Now: Fc(A {e}) – Fc(A) = max (w, x) – w
≥ max (w’, x) – w’ = Fc(B {e}) – Fc(B)
rr
aa
kk
iiii
kk
Then, log-likelihood FC(G) is monotonic, and submodular too
13
Finding the Diffusion Graph
Use the greedy hill-climbing to maximize FC(G):
At every step, pick the edge that maximizes the marginal improvement
bb
dd
ee
aa
ccLocalized update
Marginal gainsb ac ad aa bc bd be b
: 12 : 3 : 6 : 20 : 18 : 4 : 5
a cb cb dc de d
: 15 : 8 : 16 : 8 : 10
b ed e
: 7 : 13
: 17 : 2 : 3
Localized update
Localized update
Localized update
: 1 : 1
: 8 : 7
: 6
1. Approximation guarantee (≈ 0.63 of OPT)
2. Tight on-line bounds on the solution quality
3. Speed-ups:Lazy evaluation (by submodularity)
Localized update (by the structure of the problem)
Benefits
14
Experimental Setup
We validate our method on:
How many edges of G can we find?
Precision-Recall Break-even point
How many cascades do we need?
How fast are we?
How well do we optimize Fc(G)?
Synthetic dataGenerate a graph G on k edgesGenerate cascadesRecord node infection timesReconstruct G
Real dataMemeTracker: 172m news articlesAug ’08 – Sept ‘09343m textual phrases (quotes)
15
Small synthetic network:
True networkTrue network Baseline networkBaseline network Our methodOur method
15
Small Synthetic Example
Pick k strongest edges:
16
Synthetic Networks
Our performance does not depend on the network structure:
Synthetic Networks: Forest Fire, Kronecker, etc. Prob. of transmission: Exponential, Power Law
Break-even points of > 90% when the baseline gets 30-50%!
1024 node hierarchical Kronecker exponential transmission model
1000 node Forest Fire (α = 1.1) power law transmission model
17
How good is our graph?
We achieve ≈ 90 % of the best possible network!
18
How many cascades do we need?
With 2x as many infections as edges, the break-even point is already 0.8 - 0.9!
19
Running Time
Lazy evaluation and localized updates speed up 2 orders of magnitude!
20
Real Data
MemeTracker dataset: 172m news articles Aug ’08 – Sept ‘09 343m textual phrases
(quotes) Times tc(w) when site w
mentions phrase (quote) c http://memetracker.org
Given times when sites mention phrases We infer the network of information diffusion:
Who tends to copy (repeat after) whom
21
Real Network
We use the hyperlinks in the MemeTracker dataset to generate the edges of a ground truth G
From the MemeTracker dataset, we have the timestamps of: 1. cascades of hyperlinks:
sites link other sites
2. cascades of (MemeTracker) quotes
sites copy quotes from other sites
Can we infer the hyperlinks network from…
…cascades of hyperlinks? …cascades of MemeTracker
quotes?
Are they correlated?
ee
ffccaa
ee
ffccaa
22
Real Network
500 node hyperlink network using hyperlinks cascades
500 node hyperlink network using MemeTracker cascades
Break-even points of 50% for hyperlinks cascades and 30% for MemeTracker cascades!
23
5,000 news sites:
BlogsMainstream media
Diffusion Network
24BlogsMainstream media
Diffusion Network (small part)
25
Networks and Processes We infer hidden networks based on diffusion data
(timestamps)
Problem formulation in a maximum likelihood framework NP-hard problem to solve exactly We develop an approximation algorithm that:
It is efficient -> It runs in O(N2) It is invariant to the structure of the underlying network It gives a sub-optimal network with tight bound
Future work: Learn both the network and the diffusion model Extensions to other processes taking place on networks Applications to other domains: biology, neuroscience, etc.
26
Thanks!For more (Code & Data):http://snap.stanford.edu/netinf