dynamics of large networks

55
Dynamics of large networks Jure Leskovec Machine Learning Department Carnegie Mellon University Thesis defense, September 3 2008

Upload: bradley-summers

Post on 02-Jan-2016

21 views

Category:

Documents


1 download

DESCRIPTION

Jure Leskovec Machine Learning Department Carnegie Mellon University. Dynamics of large networks. Thesis defense, September 3 2008. Web: Rich data. Today: L arge on-line systems have detailed records of human activity On-line communities: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dynamics of large networks

Dynamics of large networks

Jure LeskovecMachine Learning DepartmentCarnegie Mellon University

Thesis defense, September 3 2008

Page 2: Dynamics of large networks

2

Web: Rich data

Today: Large on-line systems have detailed records of human activity On-line communities:

▪ Facebook (64 million users, billion dollar business)▪ MySpace (300 million users)

Communication:▪ Instant Messenger (~1 billion users)

News and Social media:▪ Blogging (250 million blogs world-wide, presidential candidates run

blogs) On-line worlds:

▪ World of Warcraft (internal economy 1 billion USD)▪ Second Life (GDP of 700 million USD in ‘07)

Can study phenomena and behaviors at scales that

before were never possible

Page 3: Dynamics of large networks

3

Rich data: Networks

b) Internet (AS)c) Social networksa) World wide web

d) Communication e) Citationsf) Biological networks

Page 4: Dynamics of large networks

4

Networks: What do we know? We know lots about the network structure:

Properties: Scale free [Barabasi ’99], Clustering [Watts-Strogatz ‘98], Navigation [Adamic-Adar ’03, LibenNowell ’05], Bipartite cores [Kumar et al. ’99], Network motifs [Milo et al. ‘02], Communities [Newman ‘99], Conductance [Mihail-Papadimitriou-Saberi ‘06], Hubs and authorities [Page et al. ’98, Kleinberg ‘99]

Models: Preferential attachment [Barabasi ’99], Small-world [Watts-Strogatz ‘98], Copying model [Kleinberg el al. ’99], Heuristically optimized tradeoffs [Fabrikant et al. ‘02], Congestion [Mihail et al. ‘03], Searchability [Kleinberg ‘00], Bowtie [Broder et al. ‘00], Transit-stub [Zegura ‘97], Jellyfish [Tauro et al. ‘01]

We know much less about processes and

dynamics of networks

Page 5: Dynamics of large networks

5

This thesis: Network dynamics

Network evolution How network structure changes as the

network grows and evolves? Diffusion and cascading behavior

How do rumors and diseases spread over networks?

Large data Observe phenomena that is “invisible” at

smaller scales

Page 6: Dynamics of large networks

6

Data size matters

We need massive network data for the patterns to emerge MSN Messenger network [WWW ’08]

(the largest social network ever analyzed)▪ 240M people, 255B messages, 4.5 TB data

Product recommendations [EC ‘06]

▪ 4M people, 16M recommendations Blogosphere [in progress]

▪ 164M posts, 127M links

Page 7: Dynamics of large networks

7

This thesis: The structure

Network Evolution

Network Cascades

Large Data

Observations

Q1: How does network structure

evolve over time?

Q4: What are patterns of diffusion in networks?

Q7: What are the properties of a

social network of the whole planet?

ModelsQ2: How to

model individual edge

attachment?

Q5: How do we model influence

propagation?

Q8: What is community

structure of large networks?

Algorithms

(applications)

Q3: How to generate

realistic looking networks?

Q6: How to identify

influential nodes and epidemics?

Q9: How to predict search result quality from the web

graph?

Page 8: Dynamics of large networks

8

This thesis: The structure

Network Evolution

Network Cascades

Large Data

Observations

Q1: How does network structure

evolve over time?

Q4: What are patterns of diffusion in networks?

Q7: What are the properties of a

social network of the whole planet?

ModelsQ2: How to

model individual edge

attachment?

Q5: How do we model influence

propagation?

Q8: What is community

structure of large networks?

Algorithms

(applications)

Q3: How to generate

realistic looking networks?

Q6: How to identify

influential nodes and epidemics?

Q9: How to predict search result quality from the web

graph?

Page 9: Dynamics of large networks

9

Background: Network models

Empirical findings on real graphs led to new network models

Such models make assumptions/predictions about other network properties

What about network evolution?

log degree

log pro

b. Model

Power-law degree distribution

Preferential attachment

Explains

Page 10: Dynamics of large networks

10

Networks are denser over time Densification Power Law:

a … densification exponent (1 ≤ a ≤ 2)

Q1) Network evolution

What is the relation between the number of nodes and the edges over time?

Prior work assumes: constant average degree over time

Internet

Citations

a=1.2

a=1.6

N(t)

E(t

)

N(t)

E(t

)

[w/ Kleinberg-Faloutsos, KDD ’05]

Page 11: Dynamics of large networks

11

Q1) Network evolution

Prior models and intuition say that the network diameter slowly grows (like log N, log log N)

time

diam

eter

diam

eter

size of the graph

Internet

Citations

Diameter shrinks over time as the network grows the

distances between the nodes slowly decrease

[w/ Kleinberg-Faloutsos, KDD ’05]

Page 12: Dynamics of large networks

12

Q2) Modeling edge attachment

We directly observe atomic events of network evolution (and not only network snapshots)

We can model evolution at finest scale Test individual edge attachment

Directly observe events leading to network properties Compare network models by likelihood

(and not by just summary network statistics)

and so on for

millions…

[w/ Backstrom-Kumar-Tomkins, KDD ’08]

Page 13: Dynamics of large networks

13

Setting: Edge-by-edge evolution

Network datasets Full temporal information from the first edge onwards LinkedIn (N=7m, E=30m), Flickr (N=600k, E=3m), Delicious

(N=200k, E=430k), Answers (N=600k, E=2m)

We model 3 processes governing the evolution P1) Node arrival: node enters the network P2) Edge initiation: node wakes up, initiates an edge, goes

to sleep P3) Edge destination: where to attach a new edge

▪ Are edges more likely to attach to high degree nodes?▪ Are edges more likely to attach to nodes that are close?

[w/ Backstrom-Kumar-Tomkins, KDD ’08]

Page 14: Dynamics of large networks

14

Edge attachment degree bias

Are edges more likely to connect to higher degree nodes?

kkpe )(Gnp

PA

Flickr

Network

τ

Gnp 0

PA 1

Flickr 1

Delicious

1

Answers

0.9

LinkedIn

0.6

First direct proof of preferential

attachment!

[w/ Backstrom-Kumar-Tomkins, KDD ’08]

Page 15: Dynamics of large networks

15

uw

v

But, edges also attach locally Just before the edge (u,w) is placed how many

hops is between u and w?

Network

% Δ

Flickr 66%

Delicious

28%

Answers

23%

LinkedIn

50%

Fraction of triad closing

edges

Real edges are local.Most of them close

triangles!

[w/ Backstrom-Kumar-Tomkins, KDD ’08]

GnpPA

Flickr

Page 16: Dynamics of large networks

16

New triad-closing edge (u,w) appears next We model this as:

1. u chooses neighbor v 2. v chooses neighbor w3. Connect (u,w) We consider 25 triad closing strategies and compute their log-likelihood

Triad closing is best explained by choosing a node based on the number of common

friends and time since last activity (just choosing random neighbor also works well)

How to best close a triangle?

[w/ Backstrom-Kumar-Tomkins, KDD ’08]

uw

v

Page 17: Dynamics of large networks

17

Q3) Generating realistic graphs

Problem: generate a realistic looking synthetic network

Why synthetic graphs? Anomaly detection, Simulations, Predictions, Null-model,

Sharing privacy sensitive graphs, …

Q: Which network properties do we care about? Q: What is a good model and how do we fit it?

Compare graph properties, e.g.,

degree distribution

Given a real

network

Generate a synthetic network

Page 18: Dynamics of large networks

18

Q3) The model: Kronecker graphs

We prove Kronecker graphs mimic real graphs: Power-law degree distribution, Densification,

Shrinking/stabilizing diameter, Spectral properties

Initiator

(9x9)(3x3)

(27x27)

Kronecker product of graph adjacency matrices

[w/ Chakrabarti-Kleinberg-Faloutsos, PKDD ’05]

pij

Edge probabilityEdge probability

Given a real graph. How to estimate the initiator G1?

Page 19: Dynamics of large networks

19

Maximum likelihood estimation

Q5) Kronecker graphs: Estimation

Naïve estimation takes O(N!N2): N! for different node labelings:

▪ Our solution: Metropolis sampling: N! (big) const N2 for traversing graph adjacency matrix

▪ Our solution: Kronecker product (E << N2): N2 E Do stochastic gradient descent

a bc d

P( | ) Kronecker

arg max

We estimate the model in O(E)

[w/ Faloutsos, ICML ’07]

Page 20: Dynamics of large networks

20

Estimation: Epinions (N=76k, E=510k)

We search the space of ~101,000,000

permutations Fitting takes 2 hours Real and Kronecker are very close

Degree distribution

node degree

pro

babili

ty

0.99 0.54

0.49 0.13

Path lengths

number of hops

# r

each

able

pair

s

“Network” values

ranknetw

ork

valu

e

[w/ Faloutsos, ICML ’07]

Page 21: Dynamics of large networks

21

Thesis: The structure

Network Evolution

Network Cascades

Large Data

Observations

Q1: How does network structure

evolve over time?

Q4: What are patterns of diffusion in networks?

Q7: What are the properties of a

social network of the whole planet?

ModelsQ2: How to

model individual edge

attachment?

Q5: How do we model influence

propagation?

Q8: What is community

structure of large networks?

Algorithms

(applications)

Q3: How to generate

realistic looking networks?

Q6: How to identify

influential nodes and epidemics?

Q9: How to predict search result quality from the web

graph?

Page 22: Dynamics of large networks

22

Thesis: The structure

Network Evolution

Network Cascades

Large Data

Observations

Q1: How does network structure

evolve over time?

Q4: What are patterns of diffusion in networks?

Q7: What are the properties of a

social network of the whole planet?

ModelsQ2: How to

model individual edge

attachment?

Q5: How can we model influence

propagation?

Q8: What is community

structure of large networks?

Algorithms

(applications)

Q3: How to generate

realistic looking networks?

Q6: How to identify

influential nodes and epidemics?

Q9: How to predict search result quality from the web

graph?

Page 23: Dynamics of large networks

23

Part 2: Diffusion and Cascades

Behavior that cascades from node to node like an epidemic News, opinions, rumors Word-of-mouth in marketing Infectious diseases

As activations spread through the network they leave a trace – a cascade

Cascade (propagation graph)Network

We observe cascading behavior in large

networks

Page 24: Dynamics of large networks

24

Setting 1: Viral marketing

People send and receive product recommendations, purchase products

Data: Large online retailer: 4 million people, 16 million recommendations, 500k products

10% credit 10% off

[w/ Adamic-Huberman, EC ’06]

Page 25: Dynamics of large networks

25

Setting 2: Blogosphere

Bloggers write posts and refer (link) to other posts and the information propagates

Data: 10.5 million posts, 16 million links

[w/ Glance-Hurst et al., SDM ’07]

Page 26: Dynamics of large networks

26

Viral marketing cascades are more social: Collisions (no summarizers) Richer non-tree structures

Q4) What do cascades look like? Are they stars? Chains? Trees?

Information cascades (blogosphere):

Viral marketing (DVD recommendations):

(ordered by frequency)

pro

pagati

on

[w/ Kleinberg-Singh, PAKDD ’06]

Page 27: Dynamics of large networks

27

Q5) Human adoption curves Prob. of adoption depends on the number of friends

who have adopted [Bass ‘69, Granovetter ’78] What is the shape?

Distinction has consequences for models and algorithms

k = number of friends adopting

Pro

b.

of

ad

op

tion

k = number of friends adopting

Pro

b.

of

ad

op

tion

Diminishing returns? Critical mass?

To find the answer we need lots of

data

Page 28: Dynamics of large networks

28

Q5) Adoption curve: Validation

Later similar findings were made for group membership [Backstrom-Huttenlocher-Kleinberg ‘06], and probability of communication [Kossinets-Watts ’06]

Prob

abili

ty o

f pur

chas

ing

0 5 10 15 20 25 30 35 400

0.010.020.030.040.050.060.070.080.09

0.1DVD

recommendations(8.2 million

observations)

# recommendations received

Adoption curve follows the diminishing returns.Can we exploit this?

[w/ Adamic-Huberman, EC ’06]

Page 29: Dynamics of large networks

29

Q6) Cascade & outbreak detection

Blogs – information epidemics Which are the influential/infectious blogs?

Viral marketing Who are the trendsetters? Influential people?

Disease spreading Where to place monitoring stations to detect

epidemics?

Page 30: Dynamics of large networks

30

Q6) The problem: Detecting cascades

How to quickly detect epidemics as they spread?

c1

c2

c3

[w/ Krause-Guestrin et al., KDD ’07]

Page 31: Dynamics of large networks

31

Two parts to the problem Cost:

Cost of monitoring is node dependent

Reward: Minimize the number of affected

nodes:▪ If A are the monitored nodes, let R(A)

denote the number of nodes we save

We also consider other rewards: ▪ Minimize time to detection▪ Maximize number of detected outbreaks

R(A)A

( )

[w/ Krause-Guestrin et al., KDD ’07]

Page 32: Dynamics of large networks

32

Reward for detecting cascade i

Given: Graph G(V,E), budget M Data on how cascades C1, …, Ci, …,CK spread over time

Select a set of nodes A maximizing the reward

subject to cost(A) ≤ M

Solving the problem exactly is NP-hard Max-cover [Khuller et al. ’99]

Optimization problem

Page 33: Dynamics of large networks

33

Solution: CELF Algorithm

We develop CELF (cost-effective lazy forward-selection) algorithm: Two independent runs of a modified greedy

▪ Solution set A’: ignore cost, greedily optimize reward▪ Solution set A’’: greedily optimize reward/cost ratio

Pick best of the two: arg max(R(A’), R(A’’))

Theorem: If R is submodular then CELF is near optimal CELF achieves ½(1-1/e) factor approximation

[w/ Krause-Guestrin et al., KDD ’07]

Page 34: Dynamics of large networks

34

Problem structure: Submodularity

Theorem: Reward function R is submodular (diminishing returns, think of it as “concavity”)

[w/ Krause-Guestrin et al., KDD ’07]

Gain of adding a node to a small set

Gain of adding a node to a large set

R(A {u}) – R(A) ≥ R(B {u}) – R(B)A B

S1

S2

Placement A={S1, S2}

S’

New monitored

node:

Adding S’ helps a lot

S2

S4

S1

S3

Placement B={S1, S2, S3, S4}

S’

Adding S’

helps very little

Page 35: Dynamics of large networks

35

Blogs: Information epidemics Question: Which blogs should one read to

catch big stories? Idea: Each blog covers part of the blogosphere

• Each dot is a blog• Proximity is based on the number of common cascades

Page 36: Dynamics of large networks

36

Blogs: Information epidemics

For more info see our website: www.blogcascade.org

Which blogs should one read to catch big stories?

CELF

In-links

Random

# postsOut-links

Number of selected blogs (sensors)

Reward

(higher is

better)

(used by Technorati)

Page 37: Dynamics of large networks

37

CELF: Scalability

CELF

GreedyExhaustive search

Number of selected blogs (sensors)

Run time

(seconds)

(lower is better)

CELF runs 700x faster than simple greedy algorithm

Page 38: Dynamics of large networks

38

Same problem: Water Network Given:

a real city water distribution network

data on how contaminants spread over time

Place sensors (to save lives)

Problem posed by the US Environmental Protection Agency

SS

c1

c2

[w/ Krause et al., J. of Water Resource Planning]

Page 39: Dynamics of large networks

39

Water network: Results

Our approach performed best at the Battle of Water Sensor Networks competition

CELF

PopulationRandom

Flow

Degree

Author Score

CMU (CELF) 26

Sandia 21

U Exter 20

Bentley systems 19

Technion (1) 14

Bordeaux 12

U Cyprus 11

U Guelph 7

U Michigan 4

Michigan Tech U 3

Malcolm 2

Proteo 2

Technion (2) 1

Number of placed sensors

Population

saved(higher is better)

[w/ Krause et al., J. of Water Resource Planning]

Page 40: Dynamics of large networks

40

Thesis: The structure

Network Evolution

Network Cascades

Large Data

Observations

Q1: How does network structure

evolve over time?

Q4: What are patterns of diffusion in networks?

Q7: What are the properties of a

social network of the whole planet?

Models

Q2: How to model

individual edge

attachment?

Q5: How do we model influence

propagation?

Q8: What is community

structure of large networks?

Algorithms

(applications)

Q3: How to generate

realistic looking networks?

Q6: How to identify

influential nodes and epidemics?

Q9: How to predict search result quality from the web

graph?

Page 41: Dynamics of large networks

41

Thesis: The structure

Network Evolution

Network Cascades

Large Data

Observations

Q1: How does network structure

evolve over time?

Q4: What are patterns of diffusion in networks?

Q7: What are the properties of a

social network of the whole planet?

ModelsQ2: How to

model individual edge

attachment?

Q5: How do we model influence

propagation?

Q8: What is community

structure of large networks?

Algorithms

(applications)

Q3: How to generate

realistic looking networks?

Q6: How to identify

influential nodes and epidemics?

Q9: How to predict search result quality from the web

graph?

Page 42: Dynamics of large networks

42

3 case studies on large data

Benefits from working with large data: Q7) Can test hypothesis at planetary scale

6 degrees of separation

Q8) Observe phenomena previously invisible Network community structure

Q9) Making global predictions from local network structure Web search

Page 43: Dynamics of large networks

43

Q7) Planetary look on a small-world

Milgram’s small world experiment

(i.e., hops + 1)

Small-world experiment [Milgram ‘67] People send letters from Nebraska to Boston

How many steps does it take? Messenger social network – largest network analyzed

240M people, 255B messages, 4.5TB data

[w/ Horvitz, WWW ’08]

MSN Messenger network

Avg. is 6.2. Thus, 6 degrees of separation

Avg. is 6.6! Wikipedia calls it 7 degrees of

separation

Page 44: Dynamics of large networks

44

Q8) Network community structure

How community like is a set of nodes?

Need a natural intuitive measure

Conductance:Φ(S) = # edges cut / # edges inside

Plot: Score of best cut of volume k=|S|

S

S’

[w/ Dasgupta-Lang-Mahoney, WWW ’08]

Page 45: Dynamics of large networks

45

Example: Small network

Collaborations between scientists (N=397, E=914)Cluster size, log k

log Φ

(k)

Page 46: Dynamics of large networks

46

Example: Large network

Collaboration network (N=4,158, E=13,422)

Cluster size, log k

log Φ

(k)

[w/ Dasgupta-Lang-Mahoney, WWW ’08]

Page 47: Dynamics of large networks

47

Q8) Suggested network structure

Good cuts exist only to size of 100

nodes

Denser and denser

core of the network

Expander like core contains

60% of the nodes and 80% edges

Network structure: Core-

periphery (jellyfish, octopus)

• Good cuts exist at small scales• Communities blend into the core as they grow • Consequences: There is a scale to a cluster (community) size

100

Page 48: Dynamics of large networks

48

Q9) Web Projections

User types in a query to a search engine Search engine returns results:

Is this a good set of search results?

Result returned by the search engine

Hyperlinks

Non-search resultsconnecting the graph

[w/ Dumais-Horvitz, WWW ’07]

Page 49: Dynamics of large networks

49

Q9) Web Projections: Results We can predict search result quality with

80% accuracy just from the connection patterns between the results

Predict “Good” Predict “Poor”

Good?

Poor?

[w/ Dumais-Horvitz, WWW ’07]

Page 50: Dynamics of large networks

50

Thesis: The structure

Network Evolution

Network Cascades

Large Data

Observations

Densification and shrinking

diameterCascade shapes

7 degrees of separation of

MSN

Models Triangle closing model

Diminishing returns of

human adoption

Network community structure

Algorithms

(applications)

Kronecker graphs and

fitting

Cascade and outbreak detection

Web projections

Page 51: Dynamics of large networks

51

Future directions: Evolution

Why are networks the way they are? Health of a social network

Steer the network evolution Better design networked services

Predictive modeling of large communities Online massively multi-player games are closed

worlds with detailed traces of activity

Page 52: Dynamics of large networks

52

Future directions: Diffusion

Predictive models of information diffusion When, where and what post will create a cascade? Where should one tap the network to get the

effect they want? Social Media Marketing

How do news and information spread New ranking and influence measures for blogs Sentiment analysis from cascade structure

Page 53: Dynamics of large networks

53

What’s next?

Observations: Data analysis

Models: Predictions

Algorithms: Applications

Actively influencing the network

Page 54: Dynamics of large networks

54

Christos FaloutsosJon KleinbergEric Horvitz

Carlos GuestrinAndreas KrauseSusan Dumais

Andrew TomkinsRavi Kumar

Lada AdamicBernardo Huberman

Kevin LangMichael Manohey

Zoubin GharamaniAnirban Dasgupta

Tom MitchellJohn Lafferty

Larry WassermanAvrim Blum

Sam MaddenMichalis Faloutsos

Ajit SinghJohn Shawe-Taylor

Lars BackstromMarko Grobelnik

Natasa Milic-FraylingDunja Mladenic

Jeff HammerbacherMatt Hurst

Deepay ChakrabartiMary McGlohonNatalie Glance

Jeanne VanBriesen

Microsoft ResearchYahoo Research

FacebookEve online

Spinn3rLinkedIn

Nielsen BuzzMetricsMicrosoft Live Labs

Google

Thanks!

Everyone is invited toWean 4623 at 5pmThere will be food

Page 55: Dynamics of large networks

55

Thesis: The structure

Network Evolution

Network Cascades

Large Data

Observations

Densification and shrinking

diameterCascade shapes

7 degrees of separation of

MSN

Models Triangle closing model

Diminishing returns of

human adoption

Network community structure

Algorithms

(applications)

Kronecker graphs and

fitting

Cascade and outbreak detection

Web projections