manuel gomez rodriguez 1,2 jure leskovec 1 andreas krause 3
DESCRIPTION
1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology. Inferring Networks of Diffusion and Influence. Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3. Hidden and implicit networks. - PowerPoint PPT PresentationTRANSCRIPT
1
1 Stanford University2 MPI for Biological Cybernetics
3 California Institute of Technology
Inferring Networks of Diffusion and Influence
Manuel Gomez Rodriguez1,2
Jure Leskovec1
Andreas Krause3
2
Hidden and implicit networks
Many social or information networks are implicit or hard to observe: Hidden/hard-to-reach populations:
Network of needle sharing between drug injection users Implicit connections:
Network of information propagation in online news media
But we can observe results of the processes taking place on such (invisible) networks: Virus propagation:
Drug users get sick, and we observe when they see the doctor Information networks:
We observe when media sites mention information
3
Information Diffusion Network
Information diffuses through the network
We only see who mentions but not where they got the information from
Question: Can we infer the hidden networks?
Time
4
Examples and Applications
Virus propagation Word of mouth & Viral marketing
Can we infer the underlying network?
Viruses propagate through the network
We only observe when people get sick
But NOT who infected whom
Recommendations and influence propagate
We only observe when people buy products
But NOT who influenced whom
Process
We observe
It’s hidden
5
Inferring the Network
There is a directed social network over which diffusions take place:
bb
dd
ee
aa
cc
aa
cc
bb
eecc
aabb
dd
But we do not observe the edges of the network We only see the time when a node gets infected:
Cascade c1: (a, 1), (c, 2), (b, 6), (e, 9) Cascade c2: (c, 1), (a, 4), (b, 5), (d, 8)
Task: inferring the underlying network
6
Our Problem Formulation
Plan for the talk:
1. Define a continuous time model of diffusion
2. Define the likelihood of the observed cascades given a network
3. Show how to efficiently compute the likelihood of cascades
4. Show how to efficiently find a graph G that maximizes the likelihood
Note: There is a super-exponential number of graphs, O(NN*N) Our method finds a near-optimal graph in O(N2)!
7
cccc
ee ffee ff
cc
bbaa bbaaaa bb
dd
Cascade Generation Model
ta tb tcΔ1 Δ2
We assume each node v has only one parent!
Δ3 Δ4
te tf
Continuous time cascade diffusion model: Cascade c reaches node u at tu and spreads
to u’s neighbors: With probability β cascade propagates along edge (u, v)
and we determine the infection time of node vtv = tu + Δ
e.g.: Δ ~ Exponential or Power-law
8
Likelihood of a Single Cascade
bb
dd
ee
aa
cc
aa
cc
bb
ee
Probability that cascade c propagates from node u to node v is:
Pc(u, v) P(tv - tu) with tv > tu
Prob. that cascade c propagates in a tree pattern T:
Since not all nodes get infected by the diffusion process, we introduce the external influence node m: Pc(m, v) = ε
mmεεε
Tree pattern T on cascade c: (a, 1), (b, 2), (c, 4), (e, 8)
9
Finding the Diffusion Network
There are many possible propagation trees that are consistent with the observed data:
c: (a, 1), (c, 2), (b, 3), (e, 4)
bb
dd
ee
aa
cc
aa
cc
bb
ee
bb
dd
ee
aa
cc
aa
cc
bb
ee
bb
dd
ee
aa
cc
aa
cc
bb
ee
Likelihood of a set of cascades C: Want to find a graph:
Need to consider all possible propagation trees T supported by the graph G:
Bad news
We actually want to search over graphs:
There is a super-exponential number of graphs!
Good news
Computing P(c|G) is tractable:Even though there are O(nn) possible propagation trees.
Matrix Tree Theorem can compute this in O(n3)!
10
An Alternative Formulation
We consider only the most likely tree Maximum log-likelihood for a cascade c under a
graph G:
Log-likelihood of G given a set of cascades C:
The problem is still intractable (NP-hard)
But we present an algorithm that finds near-optimal networks in O(N2)
11
Max Directed Spanning Tree
Given a cascade c and a network G, What is the most likely propagation tree?
where
Greedy parent selection of each node gives globally optimal tree!
A maximum directed spanning tree (MDST): The sub-graph of G induced by the nodes in the
cascade c is a DAG Because edges point forward in time
For each node, just picks an in-edge of max-weight:
12
Objective function is Submodular
Theorem:Log-likelihood FC(G) of a set of cascades C is monotonic, and submodular in the edges of the graph G
Gain of adding an edge to a “small” graph
Gain of adding an edge to a “large“ graph
FC(A {e}) – FC (A) ≥ FC (B {e}) – FC (B)
A B VxV
Given a set of cascades C, How do we find the network G that maximize FC(G)?
Fc(G) of a single cascade c is monotonic, and submodular
FC(G) of a set of cascades C monotonic, and submodular
Proof:
13
Objective function is Submodular
Proof:
ss
w
w’
xA
Bjj
oo
Single cascade c, edge e with weight x Let w be max weight in-edge of s in A Let w’ be max weight in-edge of s in B We know: w ≤ w’ Now: Fc(A {e}) – Fc(A) = max (w, x) – w
≥ max (w’, x) – w’ = Fc(B {e}) – Fc(B)
rr
aa
kk
iiii
kk
Gain of adding an edge to a “small” graph
Gain of adding an edge to a “large“ graph
Fc(A {e}) – Fc (A) ≥ Fc (B {e}) – Fc (B)
A B VxV
14
Finding the Diffusion Graph
Use the greedy hill-climbing to maximize FC(G): For i=1…k:
At every step, pick the edge that maximizes the marginal improvement
bb
dd
ee
aa
cc
Marginal gainsa bc bd be b
: 20 : 18 : 4 : 5
a cb cb dc de d
: 15 : 8 : 16 : 8 : 10
b ed e
: 7 : 13
: 17 : 2 : 3 : 1 : 1
: 8 : 7
: 6
1. Approximation guarantee (≈ 0.63 of OPT)
2. Tight on-line bounds on the solution quality
3. Speed-ups:Lazy evaluation (by submodularity)
Localized update (by the structure of the problem)
Benefits:
15
Experimental Setup
We validate our method on:
How many edges of G can we find?
Precision-Recall Break-even point
How many cascades do we need?
How fast is the algorithm?
How well do we optimize the likelihood Fc(G)?
Synthetic dataGenerate a graph G on k edgesGenerate cascadesRecord node infection timesReconstruct G
Real dataMemeTracker: 172m news articlesAug ’08 – Sept ‘09343m textual phrases (quotes)Flickr:
16
Small synthetic network:
True networkTrue network Baseline networkBaseline network Our methodOur method
16
Small Synthetic Example
Pick k strongest edges:
17
Synthetic Networks
Performance does not depend on the network structure: Synthetic Networks: Forest Fire, Kronecker, etc. Transmission time distribution: Exponential, Power Law
Break-even point of > 90%
1024 node hierarchical Kronecker exponential transmission model
1000 node Forest Fire (α = 1.1) power law transmission model
18
How good is our graph?
We achieve ≈ 90 % of the best possible network!
19
How many cascades do we need?
With 2x as many infections as edges, the break-even point is already 0.8 - 0.9!
20
Running Time
Lazy evaluation and localized updates speed up 2 orders of magnitude!
Can infer a networks of 10k nodes in several hours
21
Real Data: Information diffusion
MemeTracker dataset: 172m news articles from Aug ’08 – Sept ‘09 343m textual phrases (quotes)
Want to infer the network of information diffusion We use the hyperlinks between sites to generate the
edges of a ground truth G From the MemeTracker dataset, we have the
timestamps of: 1. cascades of hyperlinks:
time when a site creates a link
2. cascades of (MemeTracker) textual phrases:
time when site mentions the information
ee
ffccaa
ee
ffccaa
22
Real Network: Performance
500 node hyperlink network using hyperlinks cascades
500 node hyperlink network using MemeTracker cascades
Break-even points of 50% for hyperlinks cascades and 30% for MemeTracker cascades!
23
5,000 news sites:
BlogsMainstream media
Information Diffusion Network
24BlogsMainstream media
Information Diffusion Network (small part)
25
Real Data: Trips reconstruction
Flickr dataset: 60k Flickr users 6M time-stamped geo-localized photos
For every user we have:Time and Place where a photo was taken
20425816@N05;Argentina;Ciudad de Buenos Aires;Cafayate;2008-04-02
9603517@N06;Spain;Andalucia;Granada;2008-04-09
9603517@N06;Belgium;Oost-Vlaanderen;Ghent;2006-05-20
95311862@N00;Italy;Piedmont;San Pietro Mosezzo;2005-03-10
Want to infer the network of frequent trips…
26
Trips Network
27
Conclusions We infer hidden networks based on diffusion data
(timestamps)
Problem formulation in a maximum likelihood framework NP-hard problem to solve exactly We develop an approximation algorithm that:
It is efficient -> It runs in O(N2) It is invariant to the structure of the underlying network It gives a sub-optimal network with tight bound
Future work: Learn both the network and the diffusion model Applications to other domains: biology, neuroscience, etc.
28
Thanks!For more (Code & Data):http://snap.stanford.edu/netinf