cs224w: social and information network analysis jure...

61
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu Today 6:00-7:30pm: SNAP.PY recitation, Nvidia Auditorium 9/26 4:15-5:45pm: Review of Probability, Gates B01

Upload: others

Post on 20-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

http://cs224w.stanford.edu

Today 6:00-7:30pm: SNAP.PY recitation, Nvidia Auditorium 9/26 4:15-5:45pm: Review of Probability, Gates B01

Page 2: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2

Properties

Small diameter, Edge clustering

Patterns of signed edge creation

Viral Marketing, Blogosphere, Memetracking

Scale-Free

Densification power law, Shrinking diameters

Strength of weak ties, Core-periphery

Models

Erdös-Renyi model, Small-world model

Structural balance, Theory of status

Independent cascade model, Game theoretic model

Preferential attachment, Copying model

Microscopic model of evolving networks

Kronecker Graphs

Algorithms

Decentralized search

Models for predicting edge signs

Influence maximization, Outbreak detection, LIM

PageRank, Hubs and authorities

Link prediction, Supervised random walks

Community detection: Girvan-Newman, Modularity

Page 3: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

For example, last time we talked about Observations and Models for the Web graph: 1) We took a real system: the Web 2) We represented it as a directed graph 3) We used the language of graph theory Strongly Connected Components

4) We designed a computational experiment: Find In- and Out-components of a given node v

5) We learned something about the structure of the Web: BOWTIE!

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3

v

Out(v)

Page 4: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Undirected graphs Links: undirected

(symmetrical, reciprocal relations)

Undirected links: Collaborations Friendship on Facebook

Directed graphs Links: directed

(asymmetrical relations)

Directed links: Phone calls Following on Twitter

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4

A

B

D

C

L

M F

G

H

I

A G

F

B C

D

E

Page 5: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5

Aij = 1 if there is a link from node i to node j

Aij = 0 otherwise

=

0111100010011010

A

=

0110000000011000

A

Note that for a directed graph (right) the matrix is not symmetric.

4

1 2 3

4

1 2 3

Page 6: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6

Und

irect

ed

A

Node degree, ki: the number of edges adjacent to node i

Dire

cted

A G

F

B C

D

E

In directed networks we define an in-degree and out-degree. The (total) degree of a node is the sum of in- and out-degrees.

2=inCk 1=out

Ck 3=Ck

outin kk =

Avg. degree:

Source: Node with kin = 0 Sink: Node with kout = 0 N

Ek =

Page 7: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7

2)1(

2max−

=

=

NNNE

The maximum number of edges in an undirected graph on N nodes is

An undirected graph with the number of edges E = Emax is called a complete graph, and its average degree is N-1

Page 8: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Most real-world networks are sparse E << Emax (or k << N-1)

WWW (Stanford-Berkeley): N=319,717 ⟨k⟩=9.65 Social networks (LinkedIn): N=6,946,668 ⟨k⟩=8.87 Communication (MSN IM): N=242,720,596 ⟨k⟩=11.1 Coauthorships (DBLP): N=317,080 ⟨k⟩=6.62 Internet (AS-Skitter): N=1,719,037 ⟨k⟩=14.91 Roads (California): N=1,957,027 ⟨k⟩=2.82 Proteins (S. Cerevisiae): N=1,870 ⟨k⟩=2.39 (Source: Leskovec et al., Internet Mathematics, 2009)

Consequence: Adjacency matrix is filled with zeros! (Density of the matrix (E/N2): WWW=1.51×10-5, MSN IM = 2.27×10-8)

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8

Page 9: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Unweighted (undirected)

Weighted (undirected)

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9

3

1

4

2

Aij =

0 1 1 01 0 1 11 1 0 00 1 0 0

3

1

4

2

Aij =

0 2 0.5 02 0 1 4

0.5 1 0 00 4 0 0

NEkAnonzeroE

AAAN

jiij

jiijii

2)(21

0

1,==

==

∑=

Examples: Friendship, Hyperlink Examples: Collaboration, Internet, Roads

Page 10: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Self-edges (self-loops) (undirected)

Multigraph (undirected)

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10

3

1

4

2 3

1

4

2

Aij =

1 1 1 01 0 1 11 1 0 00 1 0 1

∑ ∑≠= =

+=

=≠N

jiji

N

iiiij

jiijii

AAE

AAA

,1, 121

0

NEkAnonzeroE

AAAN

jiij

jiijii

2)(21

0

1,==

==

∑=

Aij =

0 2 1 02 0 1 31 1 0 00 3 0 0

Examples: Proteins, Hyperlinks Examples: Communication, Collaboration

Page 11: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11

WWW >

Facebook friendships >

Citation networks >

Collaboration networks >

Mobile phone calls >

Protein Interactions >

> directed multigraph with self-edges > undirected, unweighted > unweighted, directed, acyclic > undirected multigraph or weighted graph > directed, (weighted?) multigraph > undirected, unweighted with self-interactions

Page 12: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Bipartite graph is a graph whose nodes can be divided into two disjoint sets U and V such that every link connects a node in U to one in V; that is, U and V are independent sets

Examples: Authors-to-papers (they authored) Actors-to-Movies (they appeared in) Users-to-Movies (they rated)

“Folded” networks: Author collaboration networks Movie co-rating networks

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12

U V

A

B

C

D

E

A B

C

D

E

Folded version of the graph above

Page 13: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification
Page 14: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Degree distribution P(k): Probability that a randomly chosen node has degree k Nk = # nodes with degree k

Normalized histogram: P(k) = Nk / N ➔ plot

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14

k

P(k)

1 2 3 4

0.1 0.2 0.3 0.4 0.5 0.6

k

Nk

Page 15: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

A path is a sequence of nodes in which each node is linked to the next one

Path can intersect itself and pass through the same edge multiple times E.g.: ACBDCDEG In a directed graph a path

can only follow the direction of the “arrow”

Pn = {i0,i1,i2,...,in}

Pn = {(i0 ,i1),(i1,i2 ),( i2 ,i3 ),...,( in−1,in )}

C

A B

D E

H

F

G

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15

Page 16: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Distance (shortest path, geodesic) between a pair of nodes is defined as the number of edges along the shortest path connecting the nodes *If the two nodes are disconnected, the

distance is usually defined as infinite

In directed graphs paths need to follow the direction of the arrows

Consequence: Distance is not symmetric: hA,C ≠ hC, A

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17

B

A

D

C

B

A

D

C

hB,D = 2

hB,C = 1, hC,B = 2

Page 17: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Diameter: the maximum (shortest path) distance between any pair of nodes in a graph

Average path length for a connected graph (component) or a strongly connected (component of a) directed graph

Many times we compute the average only over the

connected pairs of nodes (that is, we ignore “infinite” length paths)

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18

∑≠

=iji

ijhE

h,max2

1where hij is the distance from node i to node j

Page 18: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Clustering coefficient: What portion of i’s neighbors are connected? Node i with degree ki Ci ∈ [0,1]

Average clustering coefficient:

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20

Ci=0 Ci=1/3 Ci=1

i i i

where ei is the number of edges between the neighbors of node i

∑=N

iiC

NC 1

Page 19: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Clustering coefficient: What portion of i’s neighbors are connected? Node i with degree ki

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21

where ei is the number of edges between the neighbors of node i

C

A B

D E

H

F

G

kB=2, eB=1, CB=2/2 = 1

kD=4, eD=2, CD=4/12 = 1/3

Page 20: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22

Degree distribution: P(k)

Path length: h

Clustering coefficient: C

Page 21: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification
Page 22: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

MSN Messenger activity in June 2006: 245 million users logged in 180 million users engaged in

conversations More than 30 billion

conversations More than 255 billion

exchanged messages

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24

Page 23: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

25 9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

Page 24: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Network: 180M people, 1.3B edges 26 9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

Page 25: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Contact Conversation Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9/25/2014 27

Messaging as an undirected graph • Edge (u,v) if users u and v

exchanged at least 1 msg • N=180 million people • E=1.3 billion edges

Page 26: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28

Page 27: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29

Page 28: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 30

Note: We plotted the same data as on the previous slide, just the axes are now logarithmic.

Page 29: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31

∑=

=kki

ik

ki

CN

C:

1Ck: average Ci of nodes i of degree k:

Avg. clustering of the MSN: C = 0.1140

Page 30: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32

Number of links between pairs of

nodes

Avg. path length 6.6 90% of the nodes can be reached in < 8 hops

Steps #Nodes 0 1

1 10

2 78

3 3,96

4 8,648

5 3,299,252

6 28,395,849

7 79,059,497

8 52,995,778

9 10,321,008

10 1,955,007

11 518,410

12 149,945

13 44,616

14 13,740

15 4,476

16 1,542

17 536

18 167

19 71

20 29

21 16

22 10

23 3

24 2

25 3

# no

des

as w

e do

BFS

out

of a

rand

om n

ode

Page 31: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 33

Degree distribution:

Path length: 6.6

Clustering coefficient: 0.11

Heavily skewed avg. degree= 14.4

Are these values “expected”? Are they “surprising”?

To answer this we need a null-model!

Page 32: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

P(k) = δ(k-4) ki =4 for all nodes 𝐶𝐶 = 1

𝑁𝑁(12

(𝑁𝑁 − 4) + 2 + 2 23) = ½ as N → ∞

Path length: ℎ𝑚𝑚𝑚𝑚𝑚𝑚 = 𝑁𝑁−12

= 𝑂𝑂(𝑁𝑁)

Avg. shortest path-length: ℎ� < 2𝑁𝑁 𝑁𝑁−1

𝑁𝑁−12

𝑁𝑁(𝑁𝑁−1)2

= 𝑂𝑂(𝑁𝑁)

So, we have: Constant degree, Constant avg. clustering coeff. Linear avg. path-length

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 34

Note about calculations: We are interested in quantities as graphs get large (N→∞)

We will use big-O: f(x) = O(g(x)) as x→∞ if f(x) < g(x)*c for all x > x0 and some constant c.

Page 33: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

P(k) = δ(k-6) k =6 for each inside node

C = 6/15 for inside nodes Path length:

In general, for lattices:

Average path-length is (D… lattice dimensionality)

Constant degree, constant clustering coefficient 9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 35

DNh /1≈

Page 34: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 37

What did we learn so far?

MSN Network is neither a chain

nor a grid

Page 35: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification
Page 36: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Erdös-Renyi Random Graphs [Erdös-Renyi, ‘60] Two variants: Gn,p: undirected graph on n nodes and each

edge (u,v) appears i.i.d. with probability p

Gn,m : undirected graph with n nodes, and m uniformly at random picked edges

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9/25/2014 39

What kinds of networks does such model produce?

Page 37: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

n and p do not uniquely determine the graph! The graph is a result of a random process

We can have many different realizations given the same n and p

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 40

n = 10 p= 1/6

Page 38: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

How likely is a graph on E edges? P(E): the probability that a given Gnp

generates a graph on exactly E edges:

where Emax=n(n-1)/2 is the maximum possible number of edges in an undirected graph of n nodes

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 41

EEE ppE

EEP −−

= max)1()( max

P(E) is exactly the Binomial distribution >>> Number of successes in a sequence of Emax independent yes/no experiments

Page 39: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

What is expected degree of a node? Let Xv be a rnd. var. measuring the degree of node v We want to know:

For the calculation we will need: Linearity of expectation For any random variables Y1,Y2,…,Yk If Y=Y1+Y2+…Yk, then E[Y]= ∑i E[Yi]

An easier way: Decompose Xv to Xv= Xv,1+Xv,2+…+Xv,n-1 where Xv,u is a {0,1}-random variable

which tells if edge (v,u) exists or not

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 42

How to think about this? • Prob. of node u linking to node v is p • u can link (flips a coin) to all other (n-1) nodes • Thus, the expected degree of node u is: p(n-1)

)( ][1

0jXPjXE v

n

jv == ∑

=

Page 40: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 43

Degree distribution: P(k)

Path length: h

Clustering coefficient: C

What are values of these properties for Gnp?

Page 41: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Fact: Degree distribution of Gnp is Binomial. Let P(k) denote a fraction of nodes with

degree k:

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 44

knk ppk

nkP −−−

−= 1)1(

1)(

Select k nodes out of n-1

Probability of having k edges

Probability of missing the rest of the n-1-k edges

)1( −= npk By the law of large numbers, as the network size increases, the distribution becomes increasingly

narrow—we are increasingly confident that the degree of a node is in the vicinity of k.

P(k

)

Mean, variance of a binomial distribution

k

Page 42: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Remember:

Edges in Gnp appear i.i.d. with prob. p

So:

Then:

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 45

nk

nkp

kkkkpCii

ii ≈−

==−

−⋅=

1)1()1(

Clustering coefficient of a random graph is small. For a fixed avg. degree (that is p=1/n), C decreases with the graph size n.

)1(2

−=

ii

ii kk

eC

Number of distinct pairs of neighbors of node i of degree ki Each pair is connected

with prob. p

Where ei is the number of edges between i’s neighbors

Page 43: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 46

Degree distribution:

Clustering coefficient: C=p=k/n

Path length: next!

knk ppk

nkP −−−

−= 1)1(

1)(

Page 44: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

To prove the diameter of a Gnp we define few concepts Define: Random k-Regular graph Assume each node has k spokes (half-edges) k=1:

k=2:

k=3:

Randomly pair them up! Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9/25/2014 48

Graph is a set of pairs

Graph is a set of cycles

Arbitrarily complicated graphs

Page 45: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Graph G(V, E) has expansion α: if∀ S ⊆ V: # of edges leaving S ≥ α⋅ min(|S|,|V\S|)

Or equivalently:

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 49

|)\||,min(|#min SVS

SleavingedgesVS⊆

S V \ S

Page 46: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 50

S nodes α·S edges

S’ nodes α·S’ edges

(A big) graph with “good” expansion

|)\||,min(|#min SVS

SleavingedgesVS⊆

Page 47: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Expansion is measure of robustness: To disconnect l nodes, we need to cut ≥ α⋅ l edges

Low expansion:

High expansion:

Social networks: “Communities”

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 51

|)\||,min(|#min SVS

SleavingedgesVS⊆

Page 48: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

k-regular graph (every node has degree k): Expansion is at most k (when S is a single node)

Is there a graph on n nodes (n→∞), of fixed max deg. k, so that expansion α remains const?

Examples: n×n grid: k=4: α =2n/(n2/4)→0

(S=n/2 × n/2 square in the center)

Complete binary tree: α →0 for |S|=(n/2)-1

Fact: For a random 3-regular graph on n nodes, there is some const α (α >0, independent. of n) such that w.h.p. the expansion of the graph is ≥ α

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 52

S

S

|)\||,min(|#min SVS

SleavingedgesVS⊆

Page 49: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Fact: In a graph on n nodes with expansion α for all pairs of nodes s and t there is a path of O((log n) / α) edges connecting them.

Proof: Proof strategy: We want to show that from any

node s there is a path of length O((log n)/α) to any other node t

Let Sj be a set of all nodes found within j steps of BFS from s. How does Sj increase as a function of j?

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 53

s

S0

S1

S2

Page 50: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Proof (continued): Let Sj be a set of all nodes found

within j steps of BFS from s. We want to relate Sj and Sj+1

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 54

=+≥+ kS

SS jjj

α1

At most k edges “collide” at a node

Expansion

s

S0

S1

S2

|Sj| nodes

At least α|Sj| edges

|Sj+1| nodes

Each of degree k

Page 51: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Proof (continued): In how many steps of BFS

do we reach >n/2 nodes? Need j so that:

Let’s set: Then:

In 2k/α·log n steps |Sj| grows to Θ(n).

So, the diameter of G is O(log(n)/ α) 9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 55

s 2

1 nk

Sj

j ≥

+=

αt

αnkj 2log

=

221 2

2

log

log

nnk

n

nk

>=≥

+

αα n

nk

k2

2

log

log

21 ≥

+

ααClaim:

Remember n>0, α ≤ k then:

In log(n) steps, we reach >n/2 nodes

In log(n) steps, we reach >n/2 nodes

( )

nnnx

nn

ex

xk

22

2

22

logloglog

loglog11

211 and

: then 0 if

211:k if

>=

+

∞→=→

=+=

αα

α

In j steps, we reach >n/2 nodes

In j steps, we reach >n/2 nodes

⇒ Diameter = 2·j

⇒ Diameter = 2 log(n)

x

x xe

+=

∞→

11lim

Page 52: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 56

Degree distribution:

Path length: O(log n)

Clustering coefficient: C = p = k / n

knk ppk

nkP −−−

−= 1)1(

1)(

Page 53: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 57

Degree distribution:

Path length: 6.6 O(log n)

Clustering coefficient: 0.11 k / n ≈ 8·10-8

≈ 8.2

MSN Gnp

Page 54: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Are real networks like random graphs? Giant connected component: Average path length: Clustering Coefficient: Degree Distribution:

Problems with the random networks model: Degreed distribution differs from that of real networks Giant component in most real network does NOT

emerge through a phase transition No local structure – clustering coefficient is too low

Most important: Are real networks random? The answer is simply: NO! 9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 58

Page 55: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

If Gnp is wrong, why did we spend time on it? It is the reference model for the rest of the class. It will help us calculate many quantities, that can

then be compared to the real data It will help us understand to what degree is a

particular property the result of some random process

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 59

So, while Gnp is WRONG, it will turn out to be extremly USEFUL!

Page 56: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

What happens to Gnp when we vary p?

Page 57: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Remember, expected degree We want E[Xv] be independent of n So let: p=c/(n-1) Observation: If we build random graph Gnp

with p=c/(n-1) we have many isolated nodes Why?

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 61

c

n

nn e

ncpvP −

∞→

−− →

−−=−=

11

11)1(]0 degree has [

c

cx

x

cxn

ne

xxnc −

−−

∞→

⋅−−

∞→

=

−=

−=

−−

11111

1 limlim1

11

−=

nc

xUse substitution e

x

x xe

+=

∞→

11limBy definition:

pnXE v )1(][ −=

Page 58: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

How big do we have to make p before we are likely to have no isolated nodes?

We know: P[v has degree 0] = e-c

Event we are asking about is: I = some node is isolated where Iv is the event that v is isolated

We have:

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 62

Nv

vII∈

=

( ) ( )∑∈∈

−=≤

=

Nvv

Nvv

cneIPIPIP Union bound

∑≤i

ii

i AA

Ai

Page 59: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

We just learned: P(I) = n e-c

Let’s try: c = ln n then: n e-c = n e-ln n =n⋅1/n= 1 c = 2 ln n then: n e-2 ln n = n⋅1/n2 = 1/n

So if: p = ln n then: P(I) = 1 p = 2 ln n then: P(I) = 1/n → 0 as n→∞

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 63

Page 60: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Graph structure of Gnp as p changes:

Emergence of a Giant Component: avg. degree k=2E/n or p=k/(n-1) k=1-ε: all components are of size Ω(log n) k=1+ε: 1 component of size Ω(n), others have size Ω(log n)

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 64

0 1 p

1/(n-1) Giant component

appears

c/(n-1) Avg. deg const. Lots of isolated

nodes.

log(n)/(n-1) Fewer isolated

nodes.

2*log(n)/(n-1) No isolated nodes.

Empty graph

Complete graph

Page 61: CS224W: Social and Information Network Analysis Jure ...snap.stanford.edu/class/cs224w-2014/slides/02-gnp.pdf · Viral Marketing, Blogosphere, Memetracking . Scale-Free . Densification

Gnp, n=100k, p(n-1) = 0.5 … 3

9/25/2014 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 65

Fraction of nodes in the largest component