jure leskovec, cmu kevin lang, anirban dasgupta, michael mahoney yahoo! research

57
Properties of network community structure Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

Upload: silvester-ross

Post on 17-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

Properties of network community structure

Jure Leskovec, CMUKevin Lang, Anirban Dasgupta, Michael MahoneyYahoo! Research

Page 2: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

2

Networks

Big data Study emerging behaviors How are small networks different from large

Page 3: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

3

Network community structure

Communities (groups, clusters, modules): Sets of nodes with lots

of connections inside and few to outside (the rest of the network)

Communities, clusters, groups,

modules

Page 4: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

4

Example 1: Biology

Nodes represent proteins Edges represent

interactions/associations

Proteins with same function interact more

Can use network to discover functional groups

Yeast transcriptional regulatory modules [Bar-Joseph et al., 2003]

Page 5: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

5

Example 2: Social networks

Clusters correspond to social communities, organizational units (e.g., departments)

Zachary’s Karate club network• During the study the club split into 2• The split corresponds to min-cut (● vs. ■)

Page 6: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

6

Example 3: Web (blogs)

[Adamic-Glance 2005]Democrat vs. Republican blogs

Page 7: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

7

Example 4: Science

Citations Collaborations

[Newman 2003]

Page 8: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

8

Example 5: Hierarchies

Nested communities: modular structure of networks is hierarchically organized

CS Math Drama Music

Science Arts

University

Page 9: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

9

Example 6: Hierarchies

Recursive hierarchical network

(a) N=5, E=8

(b) N=25, E=56(c) N=125, E=344

Page 10: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

10

Discovering communities Intuition: Find nodes that can be easily

separated from the rest of the network Various objective functions

Min-cut Normalized-cut Centrality, Modularity

Various algorithms Spectral clustering (random walks) Girvan-Newman (centrality) Metis (contraction based)

Girvan-Newman: 1) Betweenness centrality: number of shortest paths passing through an edge. 2) Remove edges by decreasing centrality

Page 11: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

11

Our question: How well are communities expressed?

Page 12: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

12

Our question: How well are communities expressed?

Statistical properties of community structure Instead of searching for communities we measure

well how expressed are communities

Questions What is the community structure of real world

networks? How to measure and quantify this? What does this tell us about network structure? What is a good model (intuition)? What are consequences for clustering/partitioning

algorithms?

Page 13: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

13

Community score (quality)

How community like is a set of nodes?

Need a natural intuitive measure

Conductance (normalized cut)Φ(S) = # edges cut / # edges inside

Small Φ(S) corresponds to more community-like sets of nodes

S

S’

Page 14: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

14

Community score (quality)

Score: Φ(S) = # edges cut / # edges inside

What is “best”

community of 5 nodes?

Page 15: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

15

Community score (quality)

Score: Φ(S) = # edges cut / # edges inside

Bad communit

yΦ=5/6 = 0.83

What is “best”

community of 5 nodes?

Page 16: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

16

Community score (quality)

Score: Φ(S) = # edges cut / # edges inside

Better communit

y

Φ=5/7 = 0.7

Bad communit

y

Φ=2/5 = 0.4

What is “best”

community of 5 nodes?

Page 17: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

17

Community score (quality)

Score: Φ(S) = # edges cut / # edges inside

Better communit

y

Φ=5/7 = 0.7

Bad communit

y

Φ=2/5 = 0.4

Best communit

yΦ=2/8 = 0.25

What is “best”

community of 5 nodes?

Page 18: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

18

Network Community Profile Plot We define:

Network community profile (NCP) plotPlot the score of best community of size k

Search over all subsets of size k and find best: Φ(k=5) = 0.25

NCP plot is intractable to compute

Page 19: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

19

Network Community Profile Plot We define:

Network community profile (NCP) plotPlot the score of best community of size k

Community size, log k

log Φ(k)k=5, Φ(k)=0.25 k=7,

Φ(k)=0.18

Page 20: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

20

NCP: Example

Community size, log k

Com

munit

y s

core

, lo

g Φ

(k)

Page 21: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

21

Scaling to large networks Local spectral clustering

algorithm Pick a seed node Slowly diffuse mass around

it (via PageRank like random walk)

Find the bottleneck

Repeat many times Many seed nodes for very

local walks Less seed nodes for more

global (longer) walks

Page 22: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

22

How are NCP plots of real networks?

Page 23: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

23

NCP plot: Small Social Network

Dolphin social network Two communities of dolphins

NCP plotNetwork

Page 24: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

24

NCP plot: Zachary’s karate club

Zachary’s university karate club social network During the study club split into 2 The split (squares vs. circles) corresponds to cut B

NCP plotNetwork

Page 25: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

25

NCP plot: Network Science

Collaborations between scientists in Networks

NCP plotNetwork

Page 26: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

26

NCP plot: Grids

NCP plotNetwork

Page 27: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

27

NCP plot: Graphs with geometry

NCP plotNetwork

Page 28: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

28

NCP plot: Manifold dataset

Manifold learning dataset (Hands)

NCP plotNetwork

Page 29: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

29

NCP plot: Power Grid

Eastern US power grid:

Page 30: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

30

NCP plot: Hierarchical network

NCP plotNetwork

– Small social networks– Geometric and– Hierarchical networkhave downward NCP plot

What about large

networks?

Page 31: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

31

NCP of Large Networks

Page 32: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

32

Our work: Large networks

Previously researchers examined community structure of small networks (~100 nodes)

We examined more than 70 different large networks

Large real-world networks look very

different!

Page 33: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

33

Example of our findings

Typical example:General relativity collaboration network (4,158 nodes, 13,422 edges)

Page 34: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

34

Com

mu

nit

y s

core

Community size

NCP: LiveJournal (N=5M, E=42M)

Better and better

communities

Best communities get worse and worse

Best community

has 100 nodes

Page 35: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

35

Explanation: Downward part

Whiskers are responsible for downward slope of NCP plot

Whisker is a set of nodes connected to the network by a single

edge

NCP plot

Largest whisker

Page 36: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

36

Explanation: Upward part

Each new edge inside the community costs more

NCP plot

Φ=2/4 = 0.5

Φ=8/6 = 1.3

Φ=64/14 = 4.5

Each node has twice as many

children

Φ=1/3 = 0.33

Page 37: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

37

Comparison to rewired network

Take a real network G Rewire edges for a long time We obtain a random graph with same degree

distribution as the real network G

Page 38: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

38

Comparison to a rewired network

Rewired network: random

network with same degree distribution

Page 39: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

39

LiveJournal whisker sizes

Whiskers in real networks are larger than expected

Page 40: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

40

Whisker shapes

Whiskers in real networks are non-trivial (richer than

trees)

Edge to cut

Page 41: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

41

Caveat: Bag of whiskers

What if we allow cuts that give disconnected communities?

• Cut all whiskers • Compose communities out of whiskers• How good “communities” do we get?

Page 42: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

42

Communities made of whiskers

Com

mu

nit

y s

core

Community size

We get better community scores when

composing disconnected sets of

whiskers

Connected communitie

sBag of whiskers

Page 43: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

43

What if I remove whiskers?

Nothing happens!Now we have 2-edge

connected whiskers to deal with.

Page 44: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

44

Conclusion: Core+Whiskers

Connected communitie

sBag of whiskers

Rewired network

Page 45: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

45

Conclusion: Network structure

Network structure: Core-

periphery (jellyfish, octopus)

Whiskers are responsible for

good communities

Denser and denser

core of the network

Core contains 60%

node and 80% edges

Page 46: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

46

What is a good model? Intuiton?

Page 47: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

47

(Sparse) Random graphs

(Sparse) Random graph: Start with N nodes Pick pairs of nodes uniformly at random and connect

Flat (long random connections)

Theorem (works for any degree distribution)

Sparsity does not explain our observation

Page 48: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

48

Preferential attachment

Preferential attachment [Price 1965, Albert & Barabasi 1999]: Add a new node, create m out-links Probability of linking a node ki is

proportional to its degree Based on Herbert Simon’s result

Power-laws arise from “Rich get richer” (cumulative advantage)

Flat (connections to hubs – no locality)

Page 49: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

49

Small-World model

Let’s exploit local connections

Down (locally network looks like a mesh)and Flat (at large scale network looks random)

Page 50: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

50

Geometry + Preferential attachment

Geometric preferential attachment: Place nodes at random in 2D Pick a node Pick nodes in a radius Connect preferentially

Flat (locally network is random)and Down (globally network is a mesh – union of local expanders)

Page 51: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

51

Model that works: Forest Fire Forest Fire: connections spread like a fire

New node joins the network Selects a seed node Connects to some of its neighbors Continue recursively

As community grows it

blends into the core of

the network

Page 52: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

52

Forest Fire NCP plot

rewired

network

Bag of whisk

ers

Page 53: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

53

Conclusion and connections Whiskers:

Largest whisker has ~100 nodes Independent of network size Dunbar number: a person can maintain social

relationship to at most 150 people Core:

Core has little structure (hard to cut) Still more structure than the random network

Page 54: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

54

Connections to previous work

Other researchers examined small networks so they did not hit the Dunbar’s limit

Small evidence: 400k nodes Amazon co-purchasing network

[Clauset et al. 2004]▪ Largest community has 50% of all nodes▪ It was labeled “Miscelaneous”

Karate club has no significant community structure [Newman et al. 2007]

Page 55: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

55

Other explanantions

Bond vs. identity communities Multiple hierarchies that blur the community

boundaries

Page 56: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

56

Is there still hope?

Ground truth Yes, use attributes, better link semantics

Page 57: Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

57

Conclusion and connections

NCP plot is a way to analyze network community structure

Our results agree with previous work on small networks (that are commonly used for testing community finding algorithms)

But large networks are different Large networks

Whiskers + Core structure Small well isolated communities blend into the

core of the networks as they grow