manuel gomez rodriguez 1,2 jure leskovec 1 andreas krause 3

26
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

Upload: suzuki

Post on 17-Jan-2016

46 views

Category:

Documents


0 download

DESCRIPTION

1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology. Inferring Networks of Diffusion and Influence. Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3. Networks and Processes. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

1

1 Stanford University2 MPI for Biological Cybernetics

3 California Institute of Technology

Inferring Networks of Diffusion and Influence

Manuel Gomez Rodriguez1,2

Jure Leskovec1

Andreas Krause3

Page 2: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

2

Networks and Processes

Many times hard to directly observe a social or information network: Hidden/hard-to-reach populations:

Drug injection users Implicit connections:

Network of information sharing in online media

Often easier to observe results of the processes taking place on such (invisible) networks: Virus propagation:

People get sick, they see the doctor Information networks:

Blogs mention information

Page 3: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

3

Information Diffusion Network

Information diffuses through the network

We only see who mentions but not where they got the information from

Can we reconstruct the (hidden) diffusion network?

Page 4: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

4

More Examples

Virus propagation Word of mouth & Viral marketing

Can we infer the underlying social or information network?

Viruses propagate through the network

We only observe when people get sick

But NOT who infected them

Recommendations and influence propagate

We only observe when people buy products

But NOT who influenced them

Process

We observe

It’s hidden

Page 5: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

5

Inferring the Network

There is a hidden directed network:

We only see times when nodes get infected: Contagion c1: (a, 1), (c, 2), (b, 3), (e, 4) Contagion c2: (c, 1), (a, 4), (b, 5), (d, 6)

Want to infer the who-infects-whom network

bb

dd

ee

aa

cc

aa

cc

bb

eecc

aabb

dd

Page 6: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

6

Our Problem Formulation

Plan for the talk:

1. Define a continuous time model of diffusion

2. Define the likelihood of the observed propagation data given a graph

3. Show how to efficiently compute the likelihood

4. Show how to efficiently optimize the likelihood Find a graph G that maximizes the likelihood There is a super-exponential number of graphs G Our method finds a near-optimal solution in O(N2)!

Page 7: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

7

cccc

ee ffee ff

cc

bbaa bbaaaa bb

dd

Cascade Generation Model

Cascade generation model: Cascade reaches u at tu, and spreads to u’s

neighbors v: with probability β cascade propagates along (u, v) and

tv = tu + Δ, with Δ ~ f()

ta tb tcΔ1 Δ2

We assume each node v has only one parent!

Δ3 Δ4

te tf

Page 8: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

8

Likelihood of a Cascade

bb

dd

ee

aa

cc

aa

cc

bb

ee

If u infected v in a cascade c, its transmission probability is:

Pc(u, v) f(tv - tu) with tv > tu and (u, v) are neighbors

Prob. that cascade c propagates in a tree T:

To model that in reality any node v in a cascade can have been infected by an external influence m: Pc(m, j) = ε

mmεεε

Tree pattern T on cascade c: (a, 1), (b, 2), (c, 4), (e, 8)

Page 9: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

9

Finding the Diffusion Network

There are many possible propagation trees: c: (a, 1), (c, 2), (b, 3), (e, 4)

Need to consider all possible propagation trees T supported by G:

bb

dd

ee

aa

cc

aa

cc

bb

ee

bb

dd

ee

aa

cc

aa

cc

bb

ee

bb

dd

ee

aa

cc

aa

cc

bb

ee

Likelihood of a set of cascades C on G: Want to find:

Good news

Computing P(c|G) is tractable:For each c, consider all O(nn) possible transmission trees of G.

Matrix Tree Theorem can compute this sum in O(n3)!

Bad news

We actually want to find:

We have a super-exponential number of graphs!

Page 10: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

10

An Alternative Formulation

We consider only the most likely tree Maximum log-likelihood for a cascade c under a

graph G:

Log-likelihood of G given a set of cascades C:The problem is NP-hard: MAX-k-COVER

Our algorithm can do it near-optimally in O(N2)

Page 11: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

11

Max Directed Spanning Tree

Given a cascade c, What is the most likely propagation tree?

where

bb

dd

aa

cc

Local greedy selection gives optimal tree!

A maximum directed spanning tree (MDST): Just need to compute the MDST of a the sub-

graph of G induced by c (i.e., a DAG) For each node, just picks an in-edge of max-

weight:

3

5

4

2

6

1

Subgraph of G induced by c doesn’t have loops (DAG)

aa

bb

dd

cc

Page 12: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

12

Great News: Submodularity

Theorem:Log-likelihood Fc(G) is monotonic, and submodular in the edges of the graph G

Gain of adding an edge to a “small” graph Gain of adding an edge to a “large“ graph

Fc(A {e}) – Fc (A) ≥ Fc (B {e}) – Fc (B)

Proof:

ss

A B VxV

w

w’

xA

Bjj

oo

Single cascade c, edge e with weight x Let w be max weight in-edge of s in A Let w’ be max weight in-edge of s in B We know: w ≤ w’ Now: Fc(A {e}) – Fc(A) = max (w, x) – w

≥ max (w’, x) – w’ = Fc(B {e}) – Fc(B)

rr

aa

kk

iiii

kk

Then, log-likelihood FC(G) is monotonic, and submodular too

Page 13: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

13

Finding the Diffusion Graph

Use the greedy hill-climbing to maximize FC(G):

At every step, pick the edge that maximizes the marginal improvement

bb

dd

ee

aa

ccLocalized update

Marginal gainsb ac ad aa bc bd be b

: 12 : 3 : 6 : 20 : 18 : 4 : 5

a cb cb dc de d

: 15 : 8 : 16 : 8 : 10

b ed e

: 7 : 13

: 17 : 2 : 3

Localized update

Localized update

Localized update

: 1 : 1

: 8 : 7

: 6

1. Approximation guarantee (≈ 0.63 of OPT)

2. Tight on-line bounds on the solution quality

3. Speed-ups:Lazy evaluation (by submodularity)

Localized update (by the structure of the problem)

Benefits

Page 14: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

14

Experimental Setup

We validate our method on:

How many edges of G can we find?

Precision-Recall Break-even point

How many cascades do we need?

How fast are we?

How well do we optimize Fc(G)?

Synthetic dataGenerate a graph G on k edgesGenerate cascadesRecord node infection timesReconstruct G

Real dataMemeTracker: 172m news articlesAug ’08 – Sept ‘09343m textual phrases (quotes)

Page 15: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

15

Small synthetic network:

True networkTrue network Baseline networkBaseline network Our methodOur method

15

Small Synthetic Example

Pick k strongest edges:

Page 16: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

16

Synthetic Networks

Our performance does not depend on the network structure:

Synthetic Networks: Forest Fire, Kronecker, etc. Prob. of transmission: Exponential, Power Law

Break-even points of > 90% when the baseline gets 30-50%!

1024 node hierarchical Kronecker exponential transmission model

1000 node Forest Fire (α = 1.1) power law transmission model

Page 17: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

17

How good is our graph?

We achieve ≈ 90 % of the best possible network!

Page 18: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

18

How many cascades do we need?

With 2x as many infections as edges, the break-even point is already 0.8 - 0.9!

Page 19: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

19

Running Time

Lazy evaluation and localized updates speed up 2 orders of magnitude!

Page 20: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

20

Real Data

MemeTracker dataset: 172m news articles Aug ’08 – Sept ‘09 343m textual phrases

(quotes) Times tc(w) when site w

mentions phrase (quote) c http://memetracker.org

Given times when sites mention phrases We infer the network of information diffusion:

Who tends to copy (repeat after) whom

Page 21: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

21

Real Network

We use the hyperlinks in the MemeTracker dataset to generate the edges of a ground truth G

From the MemeTracker dataset, we have the timestamps of: 1. cascades of hyperlinks:

sites link other sites

2. cascades of (MemeTracker) quotes

sites copy quotes from other sites

Can we infer the hyperlinks network from…

…cascades of hyperlinks? …cascades of MemeTracker

quotes?

Are they correlated?

ee

ffccaa

ee

ffccaa

Page 22: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

22

Real Network

500 node hyperlink network using hyperlinks cascades

500 node hyperlink network using MemeTracker cascades

Break-even points of 50% for hyperlinks cascades and 30% for MemeTracker cascades!

Page 23: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

23

5,000 news sites:

BlogsMainstream media

Diffusion Network

Page 24: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

24BlogsMainstream media

Diffusion Network (small part)

Page 25: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

25

Networks and Processes We infer hidden networks based on diffusion data

(timestamps)

Problem formulation in a maximum likelihood framework NP-hard problem to solve exactly We develop an approximation algorithm that:

It is efficient -> It runs in O(N2) It is invariant to the structure of the underlying network It gives a sub-optimal network with tight bound

Future work: Learn both the network and the diffusion model Extensions to other processes taking place on networks Applications to other domains: biology, neuroscience, etc.

Page 26: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

26

Thanks!For more (Code & Data):http://snap.stanford.edu/netinf