random graph modelssrijan kumar, georgia tech, cse6240 spring 2020: web search and text mining 3...
TRANSCRIPT
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 1
CSE 6240: Web Search and Text Mining. Spring 2020
Random Graph Models
Prof. Srijan Kumar
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2
Today’s Lecture: Networks• Networks introduction • Web as a network • Networks properties• Random graph model: Erdos-Renyi Random Graph Model• Random graph model: Small-world Random Graph Model
Some slides are inspired by Prof. Jure Leskovec’s slides
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3
Simplest Model of Graphs¡ Erdös-Renyi Random Graphs [Erdös-Renyi, 1960]• Two variants:
– Gn,p: undirected graph on n nodes and each edge (u,v) appears i.i.d. with probability p
– Gn,m: undirected graph with n nodes and m edges, where edges are picked uniformly at random
• What kind of networks do such models produce?
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 4
Random Graph Models: Intuition
• n and p do not uniquely determine the graph!– The graph is a result of a random process
• We can have many different realizations given the same n and p
n = 10 p= 1/6
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 5
Random Graph Model: Edges
• How likely is a graph on E edges?• P(E): the probability that a given Gnp generates a graph on
exactly E edges:
where Emax=n(n-1)/2 is the maximum possible number of edges in an undirected graph of n nodes
• P(E) is a Binomial distribution: NumberofsuccessesinasequenceofEmax independentyes/noexperiments
EEE ppEE
EP --÷÷ø
öççè
æ= max)1()( max
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 6
Node Degrees in a Random Graph
• What is expected degree of a node?
• Probability of node u linking to node v is p• u can link (flips a coin) to all other (n-1) nodes• Thus, the expected degree of node u is: p(n-1)
E[Xv ]= E[Xvu ]= (n−1)pu=1
n−1
∑
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Key Network Properties
7
• Degree distribution: P(k)
• Clustering coefficient: C
• Path length: h
What are the values of these properties for Gnp?
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 8
Degree Distribution• Degree distribution of Gnp is binomial• Let P(k) denote the fraction of nodes with degree k:
• Mean and variance of a binomial distribution
knk ppkn
kP ---÷÷ø
öççè
æ -= 1)1(
1)(
Select k nodes out of n-1
Probability of having k edges
Probability of missing the rest of the n-1-k edges
σ 2 = p(1− p)(n−1)
)1( -= npk
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 9
Degree Distribution• As the network size increases, the distribution becomes
increasingly narrow—we are increasingly confident that the degree of a node is in the vicinity of k.
σk=1− pp
1(n−1)
"
#$
%
&'
1/2
≈1
(n−1)1/2
P(k)
k
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 10
Clustering Coefficient of Gnp
• Clustering coefficient – Where ei is the number of edges between i’s neighbors
• So,
• Clustering coefficient of a random graph is small– Bigger graphs with the same average degree k have lower
clustering coefficient
nk
nkp
kkkkpCii
ii »-
==--×
=1)1(
)1(
)1(2-
=ii
ii kk
eC
ei = pki (ki −1)2
Number of distinct pairs of neighbors of node i of degree ki
Each pair is connected with prob. p
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Key Network Properties
11
• Degree distribution:
• Clustering coefficient: C=p=k/n
• Path length: h
What are the values of these properties for Gnp?
knk ppkn
kP ---÷÷ø
öççè
æ -= 1)1(
1)(
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 12
Average Shortest Path• Average path length = O(log n)• Erdös-Renyi networks can grow to be very large but nodes
will be just a few hops apart
0 200000 400000 600000 800000 1000000
05
1015
20
num nodes
aver
age
shor
test
pat
h
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
MSN Network Properties vs. Gnp Properties
13
Degree distribution:
Path length: 6.6 O(log n) ~ 8.2
Clustering coefficient: 0.11 k / n ≈ 8·10-8
MSN Gnp
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 14
Clustering Implies Edge Locality• MSN network has 7 orders of magnitude larger clustering
than the corresponding Gnp!• Other examples:
– Actor Collaborations (IMDB): N = 225,226 nodes, avg. degree k = 61– Electrical power grid: N = 4,941 nodes, k = 2.67– Network of neurons: N = 282 nodes, k = 14
Network hactual hrandom Cactual Crandom
Film actors 3.65 2.99 0.00027
Power Grid 18.70 12.40 0.005
C. elegans 2.65 2.25 0.05
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 15
Gnp Simulation Experiment: Giant Component
• n=100,000, k=p(n-1) = 0.5 … 3• Emergence of a giant component: average degree k=2E/n
or p=k/(n-1)– When k=1-ε: all
components are of size Ω(log n)
– k=1+ε: 1 component of size Ω(n), others have size Ω(log n)
Fraction of nodes in the largest component
p*(n-1)=1
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 16
Real Networks vs. Gnp
• Are real networks like random graphs?– Giant connected component: YES – Average path length: YES – Clustering Coefficient: NO– Degree Distribution: NO
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 17
Real Networks vs. Gnp
• Problems with the random networks model:– Degree distribution differs from that of real networks– Giant component in most real networks does NOT emerge
through a phase transition– No local structure – clustering coefficient is too low
• Most important: Are real networks random?– The answer is simply: NO!
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 18
Real Networks vs. Gnp
• If Gnp is wrong, why did we spend time on it?– It is the reference model for the rest of the class. – It will help us calculate many quantities, that can then be
compared to the real data– It will help us understand to what degree is a particular
property the result of some random process• While Gnp is not realistic, it is useful
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 19
Problem with the ER Model• Gnp model has short paths: O(log n)
– This is the smallest diameter we can get if we have a constant degree.
– But clustering is low!• But real networks have “local” structure
– Triadic closure: Friend of a friend is my friend– High clustering but diameter is also high
• Can we generate graphs with high clustering coefficient while having short paths (low diameter)?
• Solution: Small-World Model
Low diameterLow clustering coefficient
High clustering coefficientHigh diameter
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 20
Today’s Lecture: Networks• Networks introduction • Web as a network • Networks properties• Random graph model: Erdos-Renyi Random Graph Model• Random graph model: Small-world Random Graph Model
Some slides are inspired by Prof. Jure Leskovec’s slides
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 21
Six Degrees of Kevin Bacon
Origins of a small-world idea:• The Bacon number:
– Create a network of Hollywood actors– Connect two actors if they co-appeared in
the movie– Bacon number: number of steps to Kevin
Bacon• As of Dec 2007, the highest Bacon
number reported is 8• Only approx. 12% of all actors cannot be
linked to Bacon
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 22
Erdos Number
• Erdos Number: number of hops in scientific co-author graph to reach Paul Erdos
• Srijan’ Erdos number is 4.
• Find out your Erdos number: http://www.ams.org/mathscinet/collaborationDistance.html
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 23
The Small-World Experiment• What is the typical shortest path length
between any two people?– Experiment on the global friendship network
• Can’t measure, need to probe explicitly • Small-world experiment [Milgram ’67]
– Picked 300 people in Omaha, Nebraska and Wichita, Kansas
– Ask them to get a letter to a stock-broker in Boston by passing it through friends only
• How many steps do you think it took?
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 24
The Small-World Experiment• 64 chains completed (letters reached)
– It took 6.2 steps on the average, thus “6 degrees of separation”
• Further observations:– People who owned stock
had shorter paths to the stockbroker than random people: 5.4 vs. 6.7
– People from the Boston area have even closer paths: 4.4
• On average, you are 6 hops away from anyone in the world!
Milgram’s small world experiment
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 25
The Small-World Experiment #2: Columbia• In 2003 Dodds, Muhamad and Watts performed similar
experiments using e-mail:– 18 targets of various backgrounds– 24,000 first steps (~1,500 per target)– 65% dropout per step– 384 chains completed (1.5%)
Average chain length = 4.01Problem: People stop participating
Path length, h
n(h)
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 26
6-Degrees: Should We Be Surprised?• Assume each human is connected to 100 other people,
then:– Step 1: reach 100 people– Step 2: reach 100*100 = 10,000 people– Step 3: reach 100*100*100 = 1,000,000 people– Step 4: reach 100*100*100*100 = 100M people– In 5 steps we can reach 10 billion people
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 27
Small-World: How?• Could a network with high clustering be at the same
time a small world?– How can we at the same time have high clustering and small
diameter?
• Intuition:– Clustering implies edge “locality”– Randomness enables “shortcuts”
High clusteringHigh diameter
Low clusteringLow diameter
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 28
The Small-World Model [Watts-Strogatz ‘98]
Two components to the model:• (1) Start with a low-dimensional regular lattice
– In our case, we are using a ring as a lattice– Has high clustering coefficient, but has high diameter
• (2) Rewire: introduce randomness (“shortcuts”)– Add/remove edges to create shortcuts to join remote parts
of the lattice– For each edge with probability p move
the other end to a random node– Reduces the diameter by adding shortcuts
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 29
Tuning Randomness in Small-World Model• Rewiring allows us to “interpolate” between a regular lattice
and a random graph
High clusteringHigh diameter
High clusteringLow diameter
Low clusteringLow diameter
43
2== C
kNh
NkCNh ==
loglog
aC = ½
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 30
Randomness vs Clustering• Intuition: It takes a lot of randomness to ruin the clustering,
but a very small amount to create shortcuts.
Clustering coefficientC = 1/n ∑ Ci
Parameter region of high clustering and low path length
Probability of rewiring, p
Clustering Coefficient
Average path length
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 31
• Alternative formulation of the model:1. Start with a square grid2. Add 1 random long-range edge per node
• Each node has 1 spoke. Then randomly connect them.
• Each node has 8 + 1 = 9 edges• Each node’s neighbors have 12 edges • Clustering
• Diameter: O(log(n)) – Why?
Ci =2 ⋅ei
ki (ki −1)=2 ⋅129 ⋅8
≥ 0.33
Diameter of the Watts-Strogatz Model
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 32
Proof: O(log n) Diameter of Small World Model• Convert 2x2 subgraphs into supernodes:
– Each supernode has 4 long-range edges sticking out: a 4-regular random graph!• Ignore the edges between neighboring supernodes
– Recall Gnp: short paths between super nodesÞ Path in the original graph = add at most 2 steps
per long range edge (by traversing within supernodes)
Þ Diameter of the model is O(2 + log n) = O (log n)• Edges between neighboring supernodes: these edges
will reduce the diameter further. 4-regular randomgraph on supernodes
Super node
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 33
Small-World: Summary
• Could a network with high clustering be at the same time a small world?– Yes! You don’t need more than a few random links
• The Watts-Strogatz/Small-World Model:– Provides insight on the interplay between clustering and the
small-world – Captures the structure of many realistic networks– Accounts for the high clustering of real networks– Does not lead to the correct degree distribution
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 34
Today’s Lecture: Networks• Networks introduction • Web as a network • Networks properties• Random graph model: Erdos-Renyi Random Graph Model• Random graph model: Small-world Random Graph Model
Some slides are inspired by Prof. Jure Leskovec’s slides