jure leskovec (jure@cs.stanford.edu) computer science department cornell university / stanford...
Post on 28-Dec-2015
222 Views
Preview:
TRANSCRIPT
Size matters:1) Cluster structure of large networks2) Searching the world’s social networkJure Leskovec (jure@cs.stanford.edu)Computer Science DepartmentCornell University / Stanford University
Joint work with: Eric Horvitz, Michael Mahoney, Kevin Lang, Aniraban Dasgupta
Rich data: Networks
Large on-line computing applications have detailed records of human activity: On-line communities: Facebook (120 million) Communication: Instant Messenger (~1 billion) News and Social media: Blogging (250 million)
We model the data as a network (an interaction graph)
Can observe and study phenomena at scales not
possible before Communication network
3
Outline
The Small-world experiment:▪ On a 240 million node communication
network of Microsoft Instant Messenger
Small vs. large networks:▪ Modeling community (cluster) structure of
large networks
Zachary’s karate club (N=34) Tiny part of a large social network
4
How expressed are communities?
How community like is a set of nodes?
Idea: Use approximation algorithms for NP-hard graph partitioning problems as experimental probes of network structure.
Conductance (normalized cut)
Φ(S) = # edges cut / # edges inside Small Φ(S) corresponds to more
community-like sets of nodes
S
S’
5
Community score (quality)
Score: Φ(S) = # edges cut / # edges inside
What is “best”
community of 5 nodes?
6
Community score (quality)
Score: Φ(S) = # edges cut / # edges inside
Bad communit
yΦ=5/6 = 0.83
What is “best”
community of 5 nodes?
7
Community score (quality)
Score: Φ(S) = # edges cut / # edges inside
Better communit
y
Φ=5/7 = 0.7
Bad communit
y
Φ=2/5 = 0.4
What is “best”
community of 5 nodes?
8
Community score (quality)
Score: Φ(S) = # edges cut / # edges inside
Better communit
y
Φ=5/7 = 0.7
Bad communit
y
Φ=2/5 = 0.4
Best communit
yΦ=2/8 = 0.25
What is “best”
community of 5 nodes?
9
Network Community Profile Plot We define:
Network community profile (NCP) plotPlot the score of best community of size k
Community size, log k
log Φ(k)Φ(5)=0.25
Φ(7)=0.18
k=5 k=7
10
NCP plot: Low-dimensional and random graphs
d-dimensional meshes Hierarchically nested clusters
11
NCP plot: Zachary’s karate club
Zachary’s university karate club social network During the study club split into 2 The split (squares vs. circles) corresponds
to cut B
12
NCP plot: Network Science Collaborations between scientists in
Networks [Newman, 2005]
13
Present work: Large networks
Previous work mostly focused on community structure of small networks (~100 nodes)
We examined 108 different large networks
14
Example of a large network Typical example:
General relativity collaboration network (4,158 nodes, 13,422 edges)
15
More NCP plots of networks
16
Φ(k
), (
con
du
ctan
ce)
k, (community size)
NCP: LiveJournal (N=5M, E=42M)
Better and better
communities
Communities get worse and worse
Best community has ~100
nodes
17
Explanation: Downward part
Small clusters on the edge of the network are responsible for downward part of NCP plot
NCP plot
Best cluster
18
Explanation: Upward part
Each additional edge inside the cluster costs more: NCP plot
Φ=2/4 = 0.5
Φ=8/6 = 1.3
Φ=64/14 = 4.5
Each node has twice as many
children
Φ=1/3 = 0.33
19
Suggested network structure
Network structure: Core-
periphery (jellyfish, octopus)
Whiskers are responsible for
good communities
Denser and denser
core of the network
Core contains
~60% nodes and ~80%
edges
20
What is a good model?
What is a good model that explains such network structure?
Pref. attachment Small World Geometric Pref. Attachment
FlatDown and Flat
Flat and Down
21
Forest Fire model works
Forest Fire [LKF05]: connections spread like a fire New node joins the network Selects a seed node Connects to some of its neighbors Continue recursively
Notes:• Preferential attachment flavor - second neighbor is not uniform at random.• Copying flavor - since burn seed’s neighbors.• Hierarchical flavor - seed is parent.• “Local” flavor - burn “near” -- in a diffusion sense -- the seed vertex.As community grows it
blends into the core of
the network
22
Forest Fire NCP plot
rewired
network
23
Typical cluster size
How does the size of best cluster scale with the size of the network?
24
Size of best cluster over time
Cluster size remains constant (even if one allows nesting) over time
Linked in network over time
25
Cluster size vs. network size
Each dot is a different network
26
Connections
The Dunbar number 150 individuals is maximum community size
What edges “mean” and community identification
Using node and edge types/attributes Implications for machine learning
No large clusters No/little (assortative) hierarchical structure Can’t be well embedded – no underlying
geometry
27
The small-world of the MSN Instant Messenger
Joint work with Eric Horvitz, Microsoft Research
28
The Small-world experiment
Milgram’s small world experiment
The Small-world experiment [Milgram ’67, Dodds-Muhamad-Watts ‘03] People send letters from Nebraska to Boston
How many steps does it take? 6.2 on the average, thus “6 degrees of separation”
29
The Small-world experiment 1) Short paths exist in a social
network 2) People are able to find them
(using only partial knowledge of the network)
Local search: forwarding a message
ts
d(s,t)=h
Good nodes:d=h-1
Bad nodes: d≥h
Target
30
Our dataset: Instant Messaging
Contact (buddy) list Messaging window
31
MSN communication
We collected the data for June 20064.5Tb of compressed data: 245 million users logged in 180 million users engaged in
conversations 255 billion exchanged messages 1 billion conversations / day
32
MSN network
The network: 180M nodes, 1.3B undirected edges
33
MSN: path lengths
MSN Messenger network
Number of steps
between pairs of people
Avg. path length 6.690% of the people can be reached in
< 8 hops
Hops Nodes0 1
1 10
2 78
3 3,96
4 8,648
5 3,299,252
6 28,395,849
7 79,059,497
8 52,995,778
9 10,321,008
10 1,955,007
11 518,410
12 149,945
13 44,616
14 13,740
15 4,476
16 1,542
17 536
18 167
19 71
20 29
21 16
22 10
23 3
24 2
25 3
34
Degree distribution:
A node that exchanged
messages with ~2 million people
35
Robustness of shortest paths
Short paths exist and they are robust
Randomized network (same degree distr.)
All links
Both way links
36
Learning to search in a network
What is the decision function that makes me forward the message to the target?
ts
d(s,t)=h
Good nodes:d=h-1
Bad nodes: d≥h
Target
What are the characteristics of shortest paths? How hard is it to
find them?
37
Does geography help?
t s
38
Does geography help?
t s
39
How hard is to find a good node?
t s
40
How hard is to find a good node?
Probability of success if we forward to a
random neighbor
t s
41
Algorithm accuracy at hops
t s
42
Algorithm accuracy at hops
t s
Use a decision tree to learn a classifier:Model: 0.4128Random : 0.0207
43
The learned model
Green bar is prob. that node is good
44
Comparing search heuristics Pick a pair of nodes: start at s Walk until hit the target t where next node is chosen:
Search alg. % found Mean path lengthRandom 0.0008 3,709MinGeoDist 0.0282 778MaxDeg 0.0158 4,964Deg/Geo2 0.1446 2,676Cntry 0.0108 402Cntry*Deg 0.1313 3,114Lang 0.0055 1,699Lang*Deg 0.0496 3,163 Age 0.0012 2,890 Age*Deg 0.0203 5,324 ts
It works!(in a network with 180 million nodes)
-- Milgram’s path completion is 29%-- Dodds,Muhhamad, Watts: 0.015% comp
45
Conclusions and reflections
Why are networks the way they are?
Only recently have basic properties been observed on a large scale Confirms social science intuitions; calls
others into question
Benefits of working with large data Observe structures not visible at
smaller scales
top related