Model-based Overlapping Seed ExpanSion(MOSES)
Aaron McDaid and Neil Hurley. This research was supported byScience Foundation Ireland (SFI) Grant No. 08/SRC/I1407.
Clique: Graph & Network Analysis ClusterSchool of Computer Science & Informatics
University College Dublin, Ireland
Overview
I Community finding
I The MOSES model
I The MOSES algorithm
I Evaluation
I Scalability
I Other/future work
August 7, 2010 2
Communities
August 7, 2010 3
I Traud et al. Community Structure In Online Collegiate SocialNetworks
I M. Salter-Townshend and T.B. Murphy. Variational BayesianInference for the Latent Position Cluster Model
I Marlow et al. Maintained relationships on Facebook
August 7, 2010 4
Communities
I Some nodes assigned to multiple communities.
I Most edges assigned to just one community.
I Multiple researchers have found Facebook members being in 6or 7 communities.
August 7, 2010 5
Communities
I A partition will break some of the communities in that simpleexample.
I Graclus breaks synthetic communities with low levels ofoverlap. (A. Lancichinetti and S. Fortunato, Benchmarks fortesting community detection algorithms on directed andweighted graphs with overlapping communities. )
I Graclus breaks communities found by MOSES in Facebooknetworks. (Traud et al, Community Structure in OnlineCollegiate Social Networks)
I Modularity has known problems, but we need to go furtherand move on from partitioning.
August 7, 2010 6
I Traud et al’s five university networks.
I Average of 7 communities per node.
August 7, 2010 7
Community finding
A general-purpose community finding algorithm must allow:
I Each node to be assigned to any number of communities.
I Pervasive overlap. Ahn et al. Link communities revealmultiscale complexity in networks. (Nature).
I The intersection (number of shared nodes) between a pair ofcommunities can vary. It can be small, even when the numberof communities-per-node is high.
August 7, 2010 8
MOSES
I MOSES deals only with undirected, unweighted, networks.
I No attributes/weights associated with nodes or edges.
August 7, 2010 9
The MOSES model
Model that:
I Every pair of nodes has a chance of having an edge.
I Independent for each pair of nodes, given the communities,but probability is higher for pairs that share community(ies).
I (This is an OSBM - Latouche et al. Annals of AppliedStatisticshttp://www.imstat.org/aoas/next_issue.html.)
August 7, 2010 10
MOSES model
Ignoring the observed edgesfor now. Just consider thenodes and a (proposed) set ofcommunities
August 7, 2010 11
MOSES model
These communities createprobabilities for the edges.
P(v1 ∼ v2) = pout where thetwo vertices do NOT share acommunity.
P(v1 ∼ v2) = 1−(1−pout)(1−pin) where the two vertices doshare 1 community.
August 7, 2010 12
MOSES model
These communities createprobabilities for the edges.
P(v1 � v2) = qout where thetwo vertices do NOT share acommunity.
P(v1 � v2) = qoutqin wherethe two vertices do share 1community.
P(v1 � v2) = qoutqins(v1,v2)
where s(v1, v2) is the numberof communities shared by v1
and v2.
August 7, 2010 13
MOSES model
I We now have a model that, for a given set of communities,assigns probabilities for edges.
I P(g |z , pin, pout)
I g is the observed graph of nodes and edges. z is the proposedset of communities.
I How do we match that with the observed edges to get a goodestimate of the set of communities?
I Naive approach: find (z , pin, pout) that maximizesP(g |z , pin, pout).
August 7, 2010 14
MOSES model
I We now have a model that, for a given set of communities,assigns probabilities for edges.
I P(g |z , pin, pout)
I g is the observed graph of nodes and edges. z is the proposedset of communities.
I How do we match that with the observed edges to get a goodestimate of the set of communities?
I Naive approach: find (z , pin, pout) that maximizesP(g |z , pin, pout).
August 7, 2010 14
MOSES model
I P(g |z , pin, pout) is maximized when pin = 1, pout = 1, andwhen z is defined as exactly one community around each edge.
I i.e. we don’t want to maximize P(g |z , pin, pout).
August 7, 2010 15
MOSES model
I P(z , pin, pout |g)
August 7, 2010 16
MOSES model
I Apply Bayes’ Theorem:
I P(z , pin, pout |g) ∝ P(g |z , pin, pout) P(z) P(pin, pout)
I
P(z) ∼ k!∏
1≤i≤k
(1
N + 1
1(Nni
))I where k is the number of communities, and ni is the number
of nodes in community i .
August 7, 2010 17
MOSES model
I Apply Bayes’ Theorem:
I P(z , pin, pout |g) ∝ P(g |z , pin, pout) P(z) P(pin, pout)
I
P(z) ∼ k!∏
1≤i≤k
(1
N + 1
1(Nni
))I where k is the number of communities, and ni is the number
of nodes in community i .
August 7, 2010 17
MOSES model
I We can correctly integrate out the number of communities, k ,and search across the resulting varying-dimensional space.
I No need for model selection, e.g. BIC.
August 7, 2010 18
MOSES Algorithm
I For the MOSES algorithm, we chose to look at the jointdistribution over (z , pin, pout) and aim to maximize it.
I The algorithm is a heuristic approximate algorithm, and we donot claim that it finds the maximum.
August 7, 2010 19
MOSES Algorithm
I Choose an edge at random to form a seed, and expand.
I Accept/reject those expanded seeds that contribute positivelyto the objective.
I Update pin, pout based on the graph and the current set ofcommunities.
I Delete communities that don’t make a positive contribution tothe objective.
I Final fine-tuning that moves nodes one at a time.
I It is not a Markov Chain, nor an EM algorithm. We can makeno such guarantees.
I The algorithm will be reaching a local maximum, and maywell have strong biases.
August 7, 2010 20
MOSES Algorithm
I Choose an edge at random to form a seed, and expand.
I Accept/reject those expanded seeds that contribute positivelyto the objective.
I Update pin, pout based on the graph and the current set ofcommunities.
I Delete communities that don’t make a positive contribution tothe objective.
I Final fine-tuning that moves nodes one at a time.
I It is not a Markov Chain, nor an EM algorithm. We can makeno such guarantees.
I The algorithm will be reaching a local maximum, and maywell have strong biases.
August 7, 2010 20
Evaluation
Synthetic benchmarks
I Networks created randomly by software.
I Ground truth communities are builtin to these networks.
I Check if the algorithms can discover the correct communitieswhen fed the network.
I To measure the similarity between the found communities andthe ground truth communities, overlapping NMI is used.(Lancichinetti et al. Detecting the overlapping andhierarchical community structure in complex networks)
August 7, 2010 21
Evaluation
I 2000 nodes
I Define hundreds of communities.
I Each community contains 20 nodes chosen at random fromthe 2000 nodes.
I Some nodes may be assigned to many communities. Somemay not be assigned to a community.
I pin = 0.4. About 40% of the pairs of nodes that share acommunity are then joined.
I pout = 0.005. Finally, a small amount of background noise isadded.
August 7, 2010 22
Evaluation
20-node communities (pin = 0.4), po = 0.005
2 4 6 8 10 12 14
0.0
0.2
0.4
0.6
0.8
1.0
Average Overlap
NM
I
1 15
MOSESLFM (default)
LFM (last Collection)GCE
Louvain methodcopra
5−clique percolation4−clique percolation (dashed)
Iterative Scan (dashed)
August 7, 2010 23
Evaluation, LFR benchmarks
1 2 5 10
0.0
0.2
0.4
0.6
0.8
1.0
Communities per node
NM
I
3 4 6 7 8 91.2 1.6
MOSESLFM2−firstColLFM2−lastCol
GCESCP−3
Louvain methodcopra
SCP−4
Evaluation, degree = 15,15 ≤ c ≤ 60
August 7, 2010 24
Evaluation, LFR benchmarks
1 2 5 10
0.0
0.2
0.4
0.6
0.8
1.0
Communities per node
NM
I
3 4 6 7 8 91.2 1.6
MOSESLFM2−firstColLFM2−lastCol
GCELouvain method
copraSCP−4
degree ∼ 15, maxdegree = 45, 15 ≤ c ≤ 60
August 7, 2010 25
1 5 10 50 500
0.0
0.1
0.2
0.3
0.4
Degree
Den
sity
August 7, 2010 26
1 2 5 10 20 50 100
0.0
0.1
0.2
0.3
0.4
0.5
Communities−per−person
Den
sity
August 7, 2010 27
1 5 10 50 500
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Size of community
Den
sity
OklahomaPrincetonUNCGeorgetownCaltech
August 7, 2010 28
0 200 400 600 800 1000 1200
0
10
20
30
40
50
60
70
Degree
Com
mun
itier
s pe
r no
de
172
14421528635842950057264371478585792899910711142
Counts
August 7, 2010 29
Table: Summary of Traud et al’s five university Facebook datasets, andof MOSES’s output.
Ca
ltec
h
Pri
nce
ton
Geo
rget
ow
n
UN
C
Ok
lah
om
a
Edges 16656 293320 425638 766800 892528Nodes 769 6596 9414 18163 17425
Average Degree 43.3 88.9 90.4 84.4 102.4
Communities found 62 832 1284 2725 3073Average Overlap 3.29 6.28 6.67 6.96 7.46
MOSES runtime (s) 41 553 839 1585 2233
August 7, 2010 30
Scalability
1 2 5 10
1e−
021e
+00
1e+
02
Communities per node
Tim
e(s)
3 4 6 7 8 91.2 1.6
MOSESLFM2−firstColLFM2−lastCol
GCEblondel
copraSCP−4
degree ∼ 15, maxdegree = 45, 15 ≤ c ≤ 60
August 7, 2010 31
Scalability
I In general, community finding means overlapping communityfinding, (in my interpretation).
I Partitioning breaks communities.
I So, partitioning is scalable, but partitioning doesn’t help withcommunity finding.
I Challenge: a very scalable algorithm that can credibly claim tobe a community-finding algorithm.
August 7, 2010 32
Scalability
I In general, community finding means overlapping communityfinding, (in my interpretation).
I Partitioning breaks communities.
I So, partitioning is scalable, but partitioning doesn’t help withcommunity finding.
I Challenge: a very scalable algorithm that can credibly claim tobe a community-finding algorithm.
August 7, 2010 32
Scalability
I In general, community finding means overlapping communityfinding, (in my interpretation).
I Partitioning breaks communities.
I So, partitioning is scalable, but partitioning doesn’t help withcommunity finding.
I Challenge: a very scalable algorithm that can credibly claim tobe a community-finding algorithm.
August 7, 2010 32
Scalability
I In general, community finding means overlapping communityfinding, (in my interpretation).
I Partitioning breaks communities.
I So, partitioning is scalable, but partitioning doesn’t help withcommunity finding.
I Challenge: a very scalable algorithm that can credibly claim tobe a community-finding algorithm.
August 7, 2010 32
Other/future research
I Markov Chain Monte CarloI Working with Prof. Brendan Murphy on an MCMC method.I Very different algorithm, which allows us to investigate the
model directly.
I MOSES algorithm may have many biases we’ll never fullygrasp.
I Different model (still an OSBM) where each community hasits own internal-connection probability.
I MOSES breaks down on synthetic data if the communities arenot equally dense (pin).
I Draw from this distribution: P(z , pout , p1, p2, p3, ...|g)I Multiple MCMC chains, where chains propose splits/merge to
each other.I (Modern) statisticians are innovative about scalability, e.g.
Hybrid Monte Carlo.
August 7, 2010 33
Other/future research
I Markov Chain Monte CarloI Working with Prof. Brendan Murphy on an MCMC method.I Very different algorithm, which allows us to investigate the
model directly.I MOSES algorithm may have many biases we’ll never fully
grasp.
I Different model (still an OSBM) where each community hasits own internal-connection probability.
I MOSES breaks down on synthetic data if the communities arenot equally dense (pin).
I Draw from this distribution: P(z , pout , p1, p2, p3, ...|g)I Multiple MCMC chains, where chains propose splits/merge to
each other.I (Modern) statisticians are innovative about scalability, e.g.
Hybrid Monte Carlo.
August 7, 2010 33
Other/future research
I Markov Chain Monte CarloI Working with Prof. Brendan Murphy on an MCMC method.I Very different algorithm, which allows us to investigate the
model directly.I MOSES algorithm may have many biases we’ll never fully
grasp.I Different model (still an OSBM) where each community has
its own internal-connection probability.I MOSES breaks down on synthetic data if the communities are
not equally dense (pin).
I Draw from this distribution: P(z , pout , p1, p2, p3, ...|g)I Multiple MCMC chains, where chains propose splits/merge to
each other.I (Modern) statisticians are innovative about scalability, e.g.
Hybrid Monte Carlo.
August 7, 2010 33
Other/future research
I Markov Chain Monte CarloI Working with Prof. Brendan Murphy on an MCMC method.I Very different algorithm, which allows us to investigate the
model directly.I MOSES algorithm may have many biases we’ll never fully
grasp.I Different model (still an OSBM) where each community has
its own internal-connection probability.I MOSES breaks down on synthetic data if the communities are
not equally dense (pin).I Draw from this distribution: P(z , pout , p1, p2, p3, ...|g)
I Multiple MCMC chains, where chains propose splits/merge toeach other.
I (Modern) statisticians are innovative about scalability, e.g.Hybrid Monte Carlo.
August 7, 2010 33
Other/future research
I Markov Chain Monte CarloI Working with Prof. Brendan Murphy on an MCMC method.I Very different algorithm, which allows us to investigate the
model directly.I MOSES algorithm may have many biases we’ll never fully
grasp.I Different model (still an OSBM) where each community has
its own internal-connection probability.I MOSES breaks down on synthetic data if the communities are
not equally dense (pin).I Draw from this distribution: P(z , pout , p1, p2, p3, ...|g)I Multiple MCMC chains, where chains propose splits/merge to
each other.I (Modern) statisticians are innovative about scalability, e.g.
Hybrid Monte Carlo.
August 7, 2010 33
Take home messages
I Community finding should be about discovering structure, notforcing the structure. Overlapping, hierarchy, et cetera.
I MOSES is a proof-of-concept: We show that quality results,overlapping communities, and scalability, are not incompatible.
I Very-scalable community finding algorithms don’t exist. Thisis an interesting challenge.
August 7, 2010 34
Take home messages
I Community finding should be about discovering structure, notforcing the structure. Overlapping, hierarchy, et cetera.
I MOSES is a proof-of-concept: We show that quality results,overlapping communities, and scalability, are not incompatible.
I Very-scalable community finding algorithms don’t exist. Thisis an interesting challenge.
August 7, 2010 34
Take home messages
I Community finding should be about discovering structure, notforcing the structure. Overlapping, hierarchy, et cetera.
I MOSES is a proof-of-concept: We show that quality results,overlapping communities, and scalability, are not incompatible.
I Very-scalable community finding algorithms don’t exist. Thisis an interesting challenge.
August 7, 2010 34
Acknowledgments
This research was supported by Science Foundation Ireland (SFI)Grant No. 08/SRC/I1407.
I http://clique.ucd.ie/software
I http://www.aaronmcdaid.com
I [email protected] , [email protected]
August 7, 2010 35