manuel gomez rodriguez 1,2 jure leskovec 1 andreas krause 3

1

1 Stanford University2 MPI for Biological Cybernetics

3 California Institute of Technology

Inferring Networks of Diffusion and Influence

Manuel Gomez Rodriguez1,2

Jure Leskovec1

Andreas Krause3

2

Networks and Processes

Many times hard to directly observe a social or information network: Hidden/hard-to-reach populations:

Drug injection users Implicit connections:

Network of information sharing in online media

Often easier to observe results of the processes taking place on such (invisible) networks: Virus propagation:

People get sick, they see the doctor Information networks:

Blogs mention information

3

Information Diffusion Network

Information diffuses through the network

We only see who mentions but not where they got the information from

Can we reconstruct the (hidden) diffusion network?

4

More Examples

Virus propagation Word of mouth & Viral marketing

Can we infer the underlying social or information network?

Viruses propagate through the network

We only observe when people get sick

But NOT who infected them

Recommendations and influence propagate

We only observe when people buy products

But NOT who influenced them

Process

We observe

It’s hidden

5

Inferring the Network

There is a hidden directed network:

We only see times when nodes get infected: Contagion c1: (a, 1), (c, 2), (b, 3), (e, 4) Contagion c2: (c, 1), (a, 4), (b, 5), (d, 6)

Want to infer the who-infects-whom network

bb

dd

ee

aa

cc

aa

cc

bb

eecc

aabb

dd

6

Our Problem Formulation

Plan for the talk:

1. Define a continuous time model of diffusion

2. Define the likelihood of the observed propagation data given a graph

3. Show how to efficiently compute the likelihood

4. Show how to efficiently optimize the likelihood Find a graph G that maximizes the likelihood There is a super-exponential number of graphs G Our method finds a near-optimal solution in O(N2)!

7

cccc

ee ffee ff

cc

bbaa bbaaaa bb

dd

Cascade Generation Model

Cascade generation model: Cascade reaches u at tu, and spreads to u’s

neighbors v: with probability β cascade propagates along (u, v) and

tv = tu + Δ, with Δ ~ f()

ta tb tcΔ1 Δ2

We assume each node v has only one parent!

Δ3 Δ4

te tf

8

Likelihood of a Cascade

bb

dd

ee

aa

cc

aa

cc

bb

ee

If u infected v in a cascade c, its transmission probability is:

Pc(u, v) f(tv - tu) with tv > tu and (u, v) are neighbors

Prob. that cascade c propagates in a tree T:

To model that in reality any node v in a cascade can have been infected by an external influence m: Pc(m, j) = ε

mmεεε

Tree pattern T on cascade c: (a, 1), (b, 2), (c, 4), (e, 8)

9

Finding the Diffusion Network

There are many possible propagation trees: c: (a, 1), (c, 2), (b, 3), (e, 4)

Need to consider all possible propagation trees T supported by G:

bb

dd

ee

aa

cc

aa

cc

bb

ee

bb

dd

ee

aa

cc

aa

cc

bb

ee

bb

dd

ee

aa

cc

aa

cc

bb

ee

Likelihood of a set of cascades C on G: Want to find:

Good news

Computing P(c|G) is tractable:For each c, consider all O(nn) possible transmission trees of G.

Matrix Tree Theorem can compute this sum in O(n3)!

Bad news

We actually want to find:

We have a super-exponential number of graphs!

10

An Alternative Formulation

We consider only the most likely tree Maximum log-likelihood for a cascade c under a

graph G:

Log-likelihood of G given a set of cascades C:The problem is NP-hard: MAX-k-COVER

Our algorithm can do it near-optimally in O(N2)

11

Max Directed Spanning Tree

Given a cascade c, What is the most likely propagation tree?

where

bb

dd

aa

cc

Local greedy selection gives optimal tree!

A maximum directed spanning tree (MDST): Just need to compute the MDST of a the sub-

graph of G induced by c (i.e., a DAG) For each node, just picks an in-edge of max-

weight:

3

5

4

2

6

1

Subgraph of G induced by c doesn’t have loops (DAG)

aa

bb

dd

cc

12

Great News: Submodularity

Theorem:Log-likelihood Fc(G) is monotonic, and submodular in the edges of the graph G

Gain of adding an edge to a “small” graph Gain of adding an edge to a “large“ graph

Fc(A {e}) – Fc (A) ≥ Fc (B {e}) – Fc (B)

Proof:

ss

A B VxV

w

w’

xA

Bjj

oo

Single cascade c, edge e with weight x Let w be max weight in-edge of s in A Let w’ be max weight in-edge of s in B We know: w ≤ w’ Now: Fc(A {e}) – Fc(A) = max (w, x) – w

≥ max (w’, x) – w’ = Fc(B {e}) – Fc(B)

rr

aa

kk

iiii

kk

Then, log-likelihood FC(G) is monotonic, and submodular too

13

Finding the Diffusion Graph

Use the greedy hill-climbing to maximize FC(G):

At every step, pick the edge that maximizes the marginal improvement

bb

dd

ee

aa

ccLocalized update

Marginal gainsb ac ad aa bc bd be b

: 12 : 3 : 6 : 20 : 18 : 4 : 5

a cb cb dc de d

: 15 : 8 : 16 : 8 : 10

b ed e

: 7 : 13

: 17 : 2 : 3

Localized update

Localized update

Localized update

: 1 : 1

: 8 : 7

: 6

1. Approximation guarantee (≈ 0.63 of OPT)

2. Tight on-line bounds on the solution quality

3. Speed-ups:Lazy evaluation (by submodularity)

Localized update (by the structure of the problem)

Benefits

14

Experimental Setup

We validate our method on:

How many edges of G can we find?

Precision-Recall Break-even point

How many cascades do we need?

How fast are we?

How well do we optimize Fc(G)?

Synthetic dataGenerate a graph G on k edgesGenerate cascadesRecord node infection timesReconstruct G

Real dataMemeTracker: 172m news articlesAug ’08 – Sept ‘09343m textual phrases (quotes)

15

Small synthetic network:

True networkTrue network Baseline networkBaseline network Our methodOur method

15

Small Synthetic Example

Pick k strongest edges:

16

Synthetic Networks

Our performance does not depend on the network structure:

Synthetic Networks: Forest Fire, Kronecker, etc. Prob. of transmission: Exponential, Power Law

Break-even points of > 90% when the baseline gets 30-50%!

1024 node hierarchical Kronecker exponential transmission model

1000 node Forest Fire (α = 1.1) power law transmission model

17

How good is our graph?

We achieve ≈ 90 % of the best possible network!

18

How many cascades do we need?

With 2x as many infections as edges, the break-even point is already 0.8 - 0.9!

19

Running Time

Lazy evaluation and localized updates speed up 2 orders of magnitude!

20

Real Data

MemeTracker dataset: 172m news articles Aug ’08 – Sept ‘09 343m textual phrases

(quotes) Times tc(w) when site w

mentions phrase (quote) c http://memetracker.org

Given times when sites mention phrases We infer the network of information diffusion:

Who tends to copy (repeat after) whom

21

Real Network

We use the hyperlinks in the MemeTracker dataset to generate the edges of a ground truth G

From the MemeTracker dataset, we have the timestamps of: 1. cascades of hyperlinks:

sites link other sites

2. cascades of (MemeTracker) quotes

sites copy quotes from other sites

Can we infer the hyperlinks network from…

…cascades of hyperlinks? …cascades of MemeTracker

quotes?

Are they correlated?

ee

ffccaa

ee

ffccaa

22

Real Network

500 node hyperlink network using hyperlinks cascades

500 node hyperlink network using MemeTracker cascades

Break-even points of 50% for hyperlinks cascades and 30% for MemeTracker cascades!

23

5,000 news sites:

BlogsMainstream media

Diffusion Network

24BlogsMainstream media

Diffusion Network (small part)

25

Networks and Processes We infer hidden networks based on diffusion data

(timestamps)

Problem formulation in a maximum likelihood framework NP-hard problem to solve exactly We develop an approximation algorithm that:

It is efficient -> It runs in O(N2) It is invariant to the structure of the underlying network It gives a sub-optimal network with tight bound

Future work: Learn both the network and the diffusion model Extensions to other processes taking place on networks Applications to other domains: biology, neuroscience, etc.

26

Thanks!For more (Code & Data):http://snap.stanford.edu/netinf

manuel gomez rodriguez 1,2 jure leskovec 1 andreas krause 3

Documents

cascade c

set of cascades c

probability cascade

network of information

node v

graph g

likely propagation tree

cascadeif u infected