stanford infolab technical report overlapping communities

64
Stanford Infolab Technical Report Overlapping Communities Explain Core-Periphery Organization of Networks Jaewon Yang, Jure Leskovec * Stanford University * To whom correspondence should be addressed; E-mail: [email protected], [email protected] October 14, 2014 1

Upload: others

Post on 16-Oct-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stanford Infolab Technical Report Overlapping Communities

Stanford Infolab Technical Report

Overlapping Communities Explain Core-PeripheryOrganization of Networks

Jaewon Yang, Jure Leskovec∗

Stanford University

∗To whom correspondence should be addressed;

E-mail: [email protected], [email protected]

October 14, 2014

1

Page 2: Stanford Infolab Technical Report Overlapping Communities

PROCEEDINGS OF THE IEEE, DECEMBER 2014 1

Overlapping Communities ExplainCore-Periphery Organization of Networks

Jaewon Yang and Jure Leskovec

Abstract—Networks provide a powerful way to study complex systems of interacting objects. Detecting networkcommunities—groups of objects that often correspond to functional modules—is crucial to understanding social,technological, and biological systems. Revealing communities allows for analysis of system properties that are invisiblewhen considering only individual objects or the entire system, such as the identification of module boundaries andrelationships or the classification of objects according to their functional roles. However, in networks where objectscan simultaneously belong to multiple modules at once, the decomposition of a network into overlapping communitiesremains a challenge.Here we present a new paradigm for uncovering the modular structure of complex networks, based on a decompositionof a network into any combination of overlapping, non-overlapping, and hierarchically organized communities. Wedemonstrate on a diverse set of networks comping from a wide range of domains that our approach leads to moreaccurate communities and improved identification of community boundaries. We also unify two fundamental organizingprinciples of complex networks: the modularity of communities and the commonly observed core-periphery structure. Weshow that dense network cores form as an intersection of many overlapping communities. We discover that communitiesin social, information, and foodweb networks have a single central dominant core while communities in protein-proteininteraction as well as product co-purchasing networks have small overlaps and form many local cores.

Index Terms—Networks, Community detection, Ground-truth communities, Core-periphery structure.

F

1 INTRODUCTION

N ETWORKS provide a way to represent systemsof interacting objects where nodes denote ob-

jects (people, proteins, webpages) and edges be-tween the objects denote interactions (friendships,physical interactions, links). Nodes in networksorganize into communities [1], which often corre-spond to groups of nodes that share a commonproperty, role or function, such as functionally re-lated proteins [2], social communities [3], or top-ically related webpages [4]. Communities in net-works often overlap as nodes might belong to mul-tiple communities at once. Identifying such over-lapping communities in networks is a crucial stepin studying the structure and dynamics of social,technological, and biological systems [2], [3], [4],

• J. Yang is with the Department of Electrical Engineering, StanfordUniveristy, Stanford, CA.E-mail: [email protected]

• J. Leskovec is with the Department of Computer Science, StanfordUniveristy, Stanford, CA.E-mail: [email protected]

Manuscript received March 18, 2014; revised August 1, 2014.

[5]. For example, community detection allows usto gain insights into metabolic and protein-proteininteractions, ecological foodwebs, social networkslike Facebook, collaboration networks, informationnetworks of interlinked documents, and even net-works of co-purchased products [6], [7], [8], [9],[10], [11], [12]. In particular, communities allow foranalysis of system properties that cannot be studiedwhen considering only individual objects or theentire system, such as the identification of moduleboundaries and relationships and the classificationof objects according to their functional roles [13],[14], [15], [16], [17].

Here we explore the community structure ofa number of networks from many domains. Wedistinguish between structural and functional defini-tions of communities [18]. Communities are oftenstructurally defined as sets of nodes with manyconnections among the members of the set andfew connections to the rest of the network [1].Communities can also be defined functionally basedon the function or role of its members. For example,functional communities may correspond to socialgroups in social networks, scientific disciplines or

Page 3: Stanford Infolab Technical Report Overlapping Communities

PROCEEDINGS OF THE IEEE, DECEMBER 2014 2

research groups in scientific collaboration networks,and biological modules in protein-protein interac-tion networks. The premise of community detectionis that these functional communities share somedegree share some common structural signature,which allows us to extract them from the networkstructure.

Based on this distinction one can state that thegoal of community detection is to build a bridgebetween network structure and function. That is,to identify communities based on the networkstructure with the aim that such structurally iden-tified communities would correspond to functionalcommunities. Thus, the aim is to use communitydetection to identify functional communities basedon their structural connectivity patterns.

In this paper we build on this view of networkcommunity detection and identify networks wherewe can obtain reliable external labels of functionalcommunities. We refer to such explicitly labeledfunctional communities as ground-truth communi-ties [18]. We study structural properties of suchground-truth functional communities and find thatthey exhibit a particular structural pattern. We dis-cover that the probability of nodes being connectedincreases with the number of ground-truth commu-nities they share. Our observation means that nodesresiding in overlaps of ground-truth communitiesare more densely connected than nodes in the non-overlapping parts of communities. Interestingly, wealso find that assumptions behind many existingoverlapping community detection methods lead tothe opposite conclusion that the more communitiesa pair of nodes shares, the less likely they areto be connected [6], [7], [8], [9], [10], [11]. Thus,as a consequence many overlapping communitydetection methods may not be able to properlydetect ground-truth communities.

Based on the above observations we developa new overlapping community detection methodCommunity-Affiliation Graph Model (AGM), whichviews communities as overlapping “tiles” and thetile density corresponds to edge density [19]. Fig-ure 1 illustrates the concept. Our methodology de-composes the network into a combination of over-lapping, non-overlapping, and hierarchically orga-nized communities. We compare AGM to a numberof widely-used overlapping and non-overlappingcommunity detection methods [6], [7], [10], [20] andshow that AGM leads to more accurate functionalcommunities. On average, AGM gives 50% relativeimprovement over existing methods in assigning

(a)

(c)

(b)

Fig. 1: Communities as tiles. (a) Communities in net-works behave as overlapping tiles. (b) Many methodsview communities as clusters with a homogeneous edgedensity and thus they may break the tiles. (c) Our AGMmethodology successfully decomposes the network intodifferent tiles (communities).

nodes to their ground-truth communities in social,co-authorship, product co-purchasing, and biologi-cal networks.

Finally, we unify two fundamental organiz-ing principles of complex networks: overlappingcommunities and the commonly observed core-periphery structure. While network communitiesare often thought of as densely linked clustersof nodes, in core-periphery network structure,the network is composed of a densely connectedcore and a sparsely connected periphery [21],[22], [23]. Many large networks may exhibit core-periphery structure. The network core was tradi-tionally viewed as a single giant community andtherefore it was conjectured that the core lacks inter-nal communities [24], [25], [26], [27]. We unify thosetwo organizing principles and show that densenetwork cores form as a result of many overlappingcommunities. Moreover, we find that foodweb, so-cial, and web networks exhibit a single dominantcore while protein-protein interaction and productco-purchasing networks contain many local coresformed around the central core.

Our methodology to decompose networks intocommunities provides a powerful tool for study-ing social, technological, and biological systemsby uncovering their modular structure. Our workrepresents a new way of studying networks of

Page 4: Stanford Infolab Technical Report Overlapping Communities

PROCEEDINGS OF THE IEEE, DECEMBER 2014 3

(a) (b) (c)

Fig. 2: Three structural definitions of network com-munities. Networks (top) and corresponding adjacencymatrices (bottom), where rows/columns denote nodesand dots denote edges: (a) two non-overlapping com-munities; (b) two overlapping communities where theoverlap is less connected than the non-overlapping partsof communities; (c) two overlapping communities wherethe nodes in the overlap are better connected. Basedon (c), we structurally define communities as analogousto “tiles”, where community overlaps lead to higherdensity of edges.

complex systems by bringing a shift in perspectivefrom defining communities as densely connectednodes to conceptualizing them as overlapping tiles.

2 FROM STRUCTURAL TO FUNCTIONAL DEFI-NITIONS OF COMMUNITIES

The traditional structural view of network commu-nities is based on two fundamental social networkprocesses: triadic closure [28] and the strength of weakties theory [29], [30]. Under this view, structuralcommunities are often defined as correspondingto sets of nodes with many “strong” connectionsbetween the members of the community and few“weak” connections with the rest of the network(Figure 2a). However, in many domains nodes maybelong to multiple communities at once, and thusthe notion of structural communities has also beenextended to include overlapping, hierarchical, anddisassortative community structures [6], [31], [32],[33], [34].

Despite great progress in the field, we find thatextending the traditional structural view to over-lapping communities leads to an unnoticed con-sequence that nodes in community overlaps areless densely connected than nodes in the non-overlapping parts of communities (Figure 2b). (Re-fer to the extended version of the paper [35] fordetails.) We find this hidden consequence to be

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 3 4 5 6 7

P(k

), E

dge

prob

abili

ty

k, Number of shared communities

Social

(a) Social network

0

0.05

0.1

0.15

0.2

0.25

1 2 3 4 5 6 7

P(k

), E

dge

prob

abili

ty

k, Number of shared communities

Product

(b) Product network

Fig. 3: Community overlaps have higher edge densitythan the non-overlapping parts of communities. Edgeprobability P (k) as a function of the number of commoncommunity memberships k in the social network (a)and in the product co-purchasing network (b) (Table 1).Results in (a) and (b) suggest that as nodes share multi-ple communities, they are more likely to be connected,which leads to higher edge density in community over-laps as illustrated in Figure 2c.

present in many existing approaches to overlappingcommunity detection [6], [7], [8], [9], [10], [11].

We examine a diverse set of six networks drawnfrom a wide range of domains including social, col-laboration, and co-purchasing networks for whichwe obtain explicitly labeled functional communi-ties, which we refer to as the ground-truth com-munities [18]. For example, in social networkswe take ground-truth communities to be socialinterest-based groups to which people explicitlyjoin. In product networks, ground-truth communi-ties correspond to product categories [35]. Note thatground-truth communities are not defined basedon some observed node attribute or property (like,user’s age or user’s homework in a case of a socialnetwork). The idea behind ground-truth communi-ties is that they would correspond to true functionalmodules in complex networks. While the obtainedground-truth labels may sometimes be noisy or in-complete, consistency and robustness of the resultssuggests that the ground-truth labels are overallreliable.1

By studying the structure of ground-truth com-munities we find that two nodes are more likelyto be connected if they have multiple ground-truthcommunities in common (Figure 3). For example, inthe LiveJournal online social network (Table 1), theedge probability jumps from ∼ 10−6 for nodes thatshare no ground-truth communities to 0.1 for nodesthat have one ground-truth community in commonand keeps increasing all the way to 0.7 as nodes

1. Networks with ground-truth communities can be down-loaded from http://snap.stanford.edu/agm.

Page 5: Stanford Infolab Technical Report Overlapping Communities

PROCEEDINGS OF THE IEEE, DECEMBER 2014 4

share more communities (Figure 3a). This impliesthat the area of overlap between two communitieshas a higher average density of edges than an areathat falls in just a single community (Figure 2c).

Our observation is intuitive and consistent acrossseveral domains. For example, proteins belong-ing to multiple common functional modules aremore likely to interact [2], people who share mul-tiple interests have a higher chance of becomingfriends [36], and researchers with many commoninterests are more likely to collaborate [36].

2.1 Defining Structural Communities as Tiles

We think of communities as analogous to overlap-ping “tiles”. Thus, just as the overlap of two tilesleads to a higher tile height in the overlapping area,the overlap of two communities leads to higherdensity of edges in the overlap. (Figure 1 illustratesthe concept.) The composition of many overlappingcommunities then gives rise to the global structureof the network.

Conceptually, our methodology represents a shiftin perspective from structurally modeling commu-nities as sets of densely linked nodes to model-ing communities as overlapping tiles where thenetwork emerges as a result of the overlap ofmany communities. Our structural definition ofcommunities departs from the strength of weak tiestheory [30] and is consistent with the earlier web ofgroup affiliations social network theory [37], whichpostulates that edges arise due to shared commu-nity affiliations.

Our findings here also have implications for theunderstanding of homophily, which is one of theprimary forces that shape the formation of socialnetworks [36]. Homophily is the tendency of indi-viduals to connect to others with similar tastes andpreferences. Based on [30], it has been commonlyassumed that homophily operates in “pockets” andthus nodes that have neighbors in other communi-ties are less likely to share the attributes of thoseneighbors (as in Figures 2a, 2b). In contrast, ourresults are implying pluralistic homophily where thesimilarity of nodes is proportional to the numberof shared memberships/functions, not just theirsimilarity along a single dimension. In a multi-dimensional network, the most central nodes arethose that have the most shared dimensions.

3 DECOMPOSITION OF NETWORKS INTOCOMMUNITIES

In order to model communities in a networkwe define a Community-Affiliation Graph Model(AGM) [19]. In our model, edges of the underlyingnetwork arise due to shared community member-ships (Figure 4a) [38], [39]. The AGM parameterizeseach community A with a single parameter pA. Twonodes that belong to community A then form anedge in the underlying network with probabilitypA. Each community A generates edges betweenits members independently; however, if two nodeshave already been connected, then the duplicateedge is not included in the network.

The AGM naturally models communities withdense overlaps (Figures 4a, 4b). Pairs of nodes thatbelong to multiple common communities becomeconnected in the underlying network with a higherprobability, since for each shared community thenodes are given an independent chance of formingan edge.

The flexible nature of the AGM allows formodeling a wide range of network communitystructures, such as non-overlapping, hierarchicallynested, and overlapping communities (Figures 4c,4e, 4d). Given a bipartite community affiliationgraph and a probability pA for each community A,the AGM allows us to generate synthetic networkswith realistic community structures, a procedureuseful in and of itself.

Using the AGM, we can also identify and analyzecommunity structure of real-world networks. Weaccomplish decomposition of a given network intocommunities by fitting the AGM to the networkwith tools of statistical inference. We combine amaximum-likelihood approach with convex opti-mization and a Monte Carlo sampling algorithm onthe space of community affiliation graphs [19], [35],[40]. This technique allows us to efficiently searchfor the community affiliation graph that gives theobserved network the greatest likelihood. To auto-matically determine the number of communities ina given network, we apply techniques from statisti-cal regularization and sparse model estimation [35].

4 ACCURACY OF DETECTED COMMUNITIES

Next, we aim to infer functional communities basedonly on the structure of a given unlabeled undi-rected network.

Page 6: Stanford Infolab Technical Report Overlapping Communities

PROCEEDINGS OF THE IEEE, DECEMBER 2014 5

pA pCpB

A B C

(a) (b)

Com

mun

ity

stru

ctur

eA

GM

(c) (d) (e)

Fig. 4: Community-Affiliation Graph Model (AGM) [19]. (a) Squares represent communities and circles representthe nodes of a network. Edges represent node community memberships. For each community A that two nodesshare they create a link independently with probability pA. The probability that a pair of nodes u, v creates a link isthus p(u, v) = 1−

∏A∈Cuv

(1−pA), where Cuv is the set of communities that u and v share. If u and v do not share anycommunities, we assume they link with a small probability ε. (b) Network generated by the Community-AffiliationGraph Model in (a). As pairs of nodes that share multiple communities get multiple chances to create edges, theAGM naturally generates networks where nodes in the community overlaps are more densely connected than thenodes in non-overlapping regions. (c–e) AGM is capable of modeling any combination of (c) non-overlapping, (d)hierarchically nested, as well as (e) overlapping communities.

4.1 Qualitative Evaluation

As an illustrative example, we consider a Face-book friendship network of a single user’s friends(Figure 5a and Table 1). In order to obtain la-bels for ground-truth communities, we asked theuser to manually organize his Facebook friendsinto communities. The user classified his friendsinto four communities corresponding to his high-school, workplace, and two communities of univer-sity friends. The visualization of the same networkusing communities in Figure 5b shows that thenetwork in Figure 5a is in fact composed of theoverlaps of the four communities. In this example,the goal of community detection is to identify thecommunities in Figure 5b based only on the con-nectivity structure of the network in Figure 5a.

Due to an implicit assumption that nodes in com-munity overlaps are less densely connected thannodes in the non-overlapping parts of communities(Figure 2b), many overlapping community detec-tion approaches [6], [7], [8], [9], [10], [11] fail toproperly detect communities in this network. Forexample, Figures 5c, 5d, and 5e illustrate the resultof applying Clique Percolation [10], Link Cluster-

ing [6], and Mixed-Membership Stochastic BlockModel [7] to the Facebook network in Figure 5a.We also give a formal argument that explains thebehavior of these methods in the Appendix A.1 andthe extended version [35].

When we use the AGM to analyze the Face-book network, the AGM automatically detects fourcommunities (Figure 6), which is the same as thenumber identified by the user. Moreover, the com-munities detected by the AGM nearly perfectlycorrespond to communities identified by the user.The AGM correctly determines community over-laps and community memberships for 94% of theuser’s friends.

4.2 Quantitative EvaluationWe also perform a large-scale quantitative evalu-ation on AGM on biological, social, collaboration,and product networks where functional communi-ties are explicitly labeled [18]. The networks rep-resent a wide range of sizes and edge densities,as well as amounts of community overlap. Wecompare the AGM to a number of widely usedoverlapping and non-overlapping community de-

Page 7: Stanford Infolab Technical Report Overlapping Communities

PROCEEDINGS OF THE IEEE, DECEMBER 2014 6

Properties of Properties ofNetwork networks detected communities

N E 〈C〉 D 〈k〉 K 〈S〉 〈A〉Facebook 183 2,873 0.56 2.80 31.40 4 70.8 1.5Social network 3,997,962 34,681,189 0.28 6.47 17.35 29,774 83.3 0.6Foodweb 128 2,075 0.33 1.90 32.42 5 54.4 2.1Web graph 255,265 1,941,926 0.62 9.36 15.21 5,000 83.3 1.6PPI network 1,213 2,556 0.33 10.50 4.21 40 31.6 1.0Product network 334,863 925,872 0.40 15.00 5.53 9,020 50.0 1.3

TABLE 1: Network statistics and properties of detected communities. We consider the Facebook ego-network of aparticular user, the full LiveJournal online social network, the Florida bay foodweb network, the Stanford Universityweb graph, the literature-curated Saccharomyces cerevisiae protein-protein interaction (PPI) network, and the Amazonproduct co-purchasing network. Network statistics: N : Number of nodes, E: Number of edges, 〈C〉: Averageclustering coefficient, D: Effective diameter, 〈k〉: Average node degree. Properties of detected communities: K:Number of communities, 〈S〉: Average detected community size, 〈A〉: Average number of community membershipsper node. The networks vary from those with modular to highly overlapping community structure and represent awide range of edge densities. While the number of communities detected by AGM varies, the average communitysize is quite stable across the networks. Average number of community memberships per node reveals thatcommunities in the foodweb overlap most pervasively, while in PPI and social networks overlaps are smallest.

tection methods [6], [7], [10], [20] and quantifythe correspondence between the explicitly labeledground-truth communities and the communitiesdetected by a given method. The performance met-rics quantify the accuracy of the method in as-signing nodes to their ground-truth communities.(Refer to Appendix A.2 for further details.)

On a set of social, collaboration, and productnetworsk AGM on average outperforms existingmethods by 50% in four different metrics thatquantify the accuracy in assigning nodes to theirground-truth communities (Figure 11a). In particu-lar, AGM gives a 50% relative improvement overClique Percolation [10]. Link Clustering [6] detectsoverlapping as well as hierarchical communitiesand AGM improves 61% over it. Similar levels ofimprovement are achieved when comparing AGMto other overlapping and non-overlapping meth-ods [7], [20]. Furthermore, AGM gives a 14% rel-ative improvement over Link Clustering using thesame networks and same data-driven benchmarksas used in the Link Clustering work [6].

Furthermore, we also experiment with AGM ona set of four different biological protein-protein in-teraction networks. Remarkably, even though AGMwas developed based on insights gained on primar-ily social networks, we find that AGM performssurprisingly well on biological networks as well.As performance metrics, we compute the aver-age statistical significance of detected communities(p-value) for the three types of Gene Ontology(GO) [41]. We consider negative logarithm of aver-age p-values for each of the three GO term types

as three separate scores. On average, the AGMoutperforms Link Clustering by 150%, CPM by163%, Infomap by 148%, and MMSB for 12 times(Figure 11b). Further experimental details are inAppendix and [35].

Overall, the AGM approach yields substantiallymore accurate communities. The success of ourapproach relies on the AGM’s flexible nature,which allows the AGM to decompose a givennetwork into a combination of overlapping, non-overlapping, and hierarchical communities.

5 COMMUNITIES, PLURALISTIC HOMOPHILY,AND CORE-PERIPHERY STRUCTURE

The AGM also makes it possible to gain well-founded insights into the community structure ofnetworks. In particular, we discover that overlap-ping communities lead to a global core-peripherynetwork structure. Core-periphery structure cap-tures the notion that many networks decomposeinto a densely connected core and a sparselyconnected periphery [21], [22]. The core-peripherystructure is a pervasive and crucial characteristic oflarge networks [23], [24], [42].

We discover that a network core forms as aresult of pluralistic homophily where the connect-edness of nodes is proportional to the number ofshared community memberships, and not just theirsimilarity along a single dimension or community.Thus, the network core forms as a result of manyoverlapping communities. The average number ofcommunity memberships of a node decreases with

Page 8: Stanford Infolab Technical Report Overlapping Communities

PROCEEDINGS OF THE IEEE, DECEMBER 2014 7

(a) (b)

(c) (d) (e)

Fig. 5: An example on a Facebook friendship network of a particular user. (a) Facebook friendship networkof a single user. (b) The same network but with communities explicitly labeled by the user: high-school friends,colleagues at the workplace, and university friends with whom the user plays basketball and squash. Communitiesare denoted by filled regions. Notice that nodes in the overlap of communities have higher density of edges. (c-e)Results of applying (c) Clique Percolation, (d) Link Clustering, and (e) Mixed-Membership Stochastic Block Modelto the Facebook network.

Fig. 6: AGM on the Facebook network from Fig-ure 5. AGM successfully decomposes the network intodifferent tiles (communities) and correctly determinescommunity overlaps as well as community membershipsfor 94% of the nodes.

its distance from the center of the network (Fig-ure 7). Moreover, the edge likelihood increases asa function of community memberships (Figure 3).Thus, the nodes in the center of the network havehigher density of edges than nodes in the periphery.

Therefore, we show that even in the presence ofmany communities, pluralistic homophily leads todense community overlaps, which cause a globalcore-periphery network structure.

A further examination of the amount of commu-nity overlap reveals that social, web, and foodwebnetworks in Table 1 have a single central dominantcore (Figure 8a). On the other hand, communities inprotein and product networks have small overlapsand also form many local cores (Figure 8b). In par-ticular, protein communities only slightly overlapand form local cores as well as a small global core(Figure 8d). Small overlaps of protein communitiescan be explained by the fact that communities act asfunctional modules, and it would be hard for thecell to independently control heavily overlappingmodules [2], [6]. Communities of co-purchasedproducts can also be thought of as functional mod-ules since the products in a community are boughttogether for a specific purpose. On the other hand,foodweb communities overlap pervasively whileforming a single dominant core. This leads to aflower-like overlapping community structure (Fig-

Page 9: Stanford Infolab Technical Report Overlapping Communities

PROCEEDINGS OF THE IEEE, DECEMBER 2014 8

0

0.5

1

1.5

2

2.5

5 5.5 6 6.5

⟨m⟩,

Com

mun

ity m

embe

rshi

ps

d, Farness Centrality

Social

(a) Social network

0.6

0.8

1

1.2

1.4

1.6

1.8

2

6 6.5 7 7.5 8 8.5 9 9.5

⟨m⟩,

Com

mun

ity m

embe

rshi

ps

d, Farness Centrality

PPI

(b) PPI network

Fig. 7: Overlapping communities lead to global core-periphery network structure. The average (and the 10th

percentiles) of the number of community memberships〈m〉(d) as a function of its farness centrality d, definedas the average shortest path length of a given nodeto all other nodes of the network [3]. (a) LiveJournalsocial network, (b) Saccharomyces cerevisiae PPI network.Number of community memberships of a node decreaseswith its farness centrality. Nodes that reside in the centerof the network (and have small shortest path distancesto other nodes of the network) belong to the highestnumber of communities. This means that core-peripherystructure forms due to community overlaps. Communi-ties in the periphery tend to be non-overlapping whilecommunities in the core overlap pervasively.

ure 8c), where tiles (communities) overlap to forma single core of the network. The heavily overlap-ping foodweb communities form due to the closednature of the studied Florida bay ecosystem [43].Web communities overlap moderately and form asingle global core. Many of these communities formaround common interests or topics, which mayoverlap with each other [4].

6 CONCLUSION

In closing, we note that our approach builds on theprevious work on community detection [6], [7], [8],[9], [10], [11], [12], [13], [14], [15], [16]. We exam-ine an implicit assumption of sparsely connectedcommunity overlaps and find that regions of thenetwork where communities overlap have higherdensity of edges than the non-overlapping regions.

We then rethink classical structural definitions ofcommunities and develop the AGM, which modelsstructural communities as overlapping tiles. Us-ing our well-founded approach we find that allnetworks considered in this study exhibit a core-periphery structure where nodes that belong tomultiple communities reside in the core of thenetwork. However, networks have different kindsof core-periphery structure depending on the mech-anism for community formation in the networks.Dense community overlaps also explain the mixed

success of present community detection methodswhen applied to large networks [24], [27].

Our work also enhances our understanding ofhomophily as one of the most fundamental socialforces. Homophily in networks has been tradition-ally thought to operate in small pockets/clusters.Thus, nodes that have neighbors in other commu-nities were considered less likely to share prop-erties of those neighbors. In contrast, our resultsare implying pluralistic homophily where the sim-ilarity of nodes’ properties is proportional to thenumber of shared community memberships. In anetwork, the most central nodes are those that havethe most shared properties/functions/communitieswith others. More generally, our work provides ashift in perspective from conceptualizing communi-ties as densely connected sets of nodes to definingthem as overlapping tiles and represents a new wayof studying complex systems.

Acknowledgments. We thank R. Sosic, P. Mason,M. Macy, S. Fortunato, D. McFarland, and H.Garcia-Molina for invaluable discussions and feed-back. Supported by NSF Career Award IIS-1149837,DARPA XDATA and GRAPHS, Alfred P. SloanFellowship, and the Microsoft Faculty Fellowship.

APPENDIX AA.1 Detecting Densely Overlapping Communi-ties

We next show that three popular community de-tection methods Clique Percolation [10], [44]; LinkClustering [6]; and Stochastic Block Model [7], [45]cannot properly detect communities with denseoverlaps.

A.1.1 Clique PercolationFirst, we analyze the Clique Percolation methodand show that it may not properly detect twooverlapping communities from Figure 2c. CliquePercolation Method (CPM) has a single input pa-rameter k which determines the size of the maximalcliques that the algorithm looks for. For example,Figure 9 shows the result of CPM on the networkof Figure 2c where the overlap between the twocommunities is denser than the individual commu-nities. When k = 3, CPM finds a community thatcovers the whole network because the clique in theoverlap connects the cliques in the left communityand the right community, whereas CPM finds acommunity of the overlap when k = 4.

Page 10: Stanford Infolab Technical Report Overlapping Communities

PROCEEDINGS OF THE IEEE, DECEMBER 2014 9

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

C(a

)

a, Node community memberships

SocialFoodweb

WebPPI

Products

(a)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

P(o

)

o, Maximum overlap fraction

SocialFoodweb

WebPPI

Products

(b)

(c) (d)

Fig. 8: Primary and secondary cores in networks. (a) The fraction of nodes C(a) in the largest connected componentof the induced subgraph on the nodes who belong to at least a communities. By thinking of a network as a valleywhere peaks correspond to cores and peripheries to lowlands, our methodology is analogous to flooding lowlandsand measuring the fraction of the largest island. A high C(a) means that there is a single dominant core (peak),while a low C(a) suggests the existence of nontrivial secondary cores. (b) Probability density P (o) of the maximumoverlap o. Maximum overlap oA of a given community A is defined as the fraction of A’s nodes that are inthe overlap with any other community. Communities in the PPI, social, and product co-purchasing networks aremostly non-overlapping whereas the communities in the foodweb and the web graph are pervasively overlapping.(c) Communities detected by the AGM in the foodweb form a single central core. (d) Communities in the PPInetwork form many secondary cores.

In addition to Clique Percolation Method, thereare many other overlapping community detectionmethods that are based on expanding the maximalcliques. These methods (for example, Greedy cliqueexpansion [46] and EAGLE [47]) also suffer fromthe same problem.

A.1.2 Stochastic Block Models

We show that three variants of stochastic blockmodels are unable to correctly discover communi-ties with dense overlaps: the traditional Stochastic

Block Model [45], the Degree-Corrected Stochas-tic Block Model [48] and the Mixed-MembershipStochastic Block Model [7]. Based on the inputmatrix from Figure 2c, all three models identifythree blocks as illustrated in Figure 10. The reasonfor this is that the edge probability between twonodes that belong to communities A and B isweighted average of P (A,A) and P (B,B), whereP (X,Y ) is an edge probability between a node incommunity X and a node in community Y . Thismeans that the edge probability between the two

Page 11: Stanford Infolab Technical Report Overlapping Communities

PROCEEDINGS OF THE IEEE, DECEMBER 2014 10

(a) kCPM = 3 (b) kCPM = 4

Fig. 9: Clique Percolation method cannot detect com-munities with dense overlaps. Given a network withtwo communities and a dense overlap, Clique Percola-tion method would report a community that (depend-ing on the parameter settings) either (a) includes bothcommunities, or (b) it would find a small communityconsisting only of the overlap.

Fig. 10: The result of a Stochastic Block Model andthe Mixed-Membership Stochastic Block model on anetwork of two communities with dense overlap. Theadjacency matrix of the network in Figure 2c is shownand the bold lines denote the three partitions discoveredby the stochastic block models, where the overlap isconfused as a separate community.

nodes that share multiple communities is smallerthan the maximum of P (A,A) and P (B,B) (dueto the weighted summation). Therefore, the edgeprobability between overlapping nodes cannot behigher than the edge probability between nodesin an individual community. We also note thatin principle one could apply post-processing ofcommunities detected by stochastic block modelsto identify which of the detected structural commu-nities actually correspond to overlaps of functionalcommunities. However, it is not immediately clearhow to develop such post-processing method.

A.1.3 Link ClusteringLastly, we show that the Link Clustering [6] is notable to correctly detect overlapping communitieswith dense overlaps. Link Clustering performs hier-archical clustering with the edges of the given net-work based on the Jaccard similarity between theadjacent nodes of the edges. Since edge density in

the area of community overlap is higher, this meansthat the Jaccard similarity between the adjacentnodes will be higher, which in turn means that LinkClustering will identify the edges in the overlapas a separate community. (Refer to the extendedversion [35] for details.)

A.2 Metrics of Community Detection Accuracy

We focus the evaluation of community detectionmethods on their ability to correctly identify over-lapping communities.

To quantify the performance, we measure thelevel of agreement between the detected andthe ground-truth communities. Given a networkG(V,E), we consider a set of ground truth com-munities C∗ and a set of detected communities C,where each ground-truth community Ci ∈ C∗ andeach detected community Ci ∈ C is defined by aset of its member nodes. To compare C and C∗, weuse four performance metrics:

Average F1 score [49]: We compute Fg(Ci) =maxj F1(Ci, Cj) for each ground-truth communityCi and Fd(Ci) = maxj F1(Cj , Ci) for each detectedcommunity Ci, where F1(S1, S2) is the harmonicmean of precision and recall between two node setsS1, S2. The average F1 score is 1

2(Fg + Fd) whereFg = 1

|C∗|∑

i Fg(Ci) and Fd = 1|C|

∑i Fd(Ci).

Omega Index [50]: For each pair of nodes u, v ∈V , we define Cuv to be the set of ground-truthcommunities to which both u and v belong andCuv to be the set of detected communities to whichthe both nodes belong. Then the Omega Index is1|V |2

∑u,v∈V 1{|Cuv| = |Cuv|}.

Normalized Mutual Information [12]: We com-pute 1 − 1

2(H(C∗|C) + H(C|C∗)), where H(A|B)is the extension of entropy when A,B are sets ofsets [12].

Accuracy in the number of communities: 1 −||C∗|−|C|||C∗| , which is the relative error in predicting

the number of communities.

A.3 Applying AGM to Social, Product, and Col-laboration Networks

Figure 11a displays the composite performanceof each of the 5 methods over the six networkswith ground-truth communities. Overall, we noticethat AGM gives best overall performance on allnetworks except the Amazon, where it ties withMMSB. Furthermore, AGM detects highest qual-ity communities for most individual performance

Page 12: Stanford Infolab Technical Report Overlapping Communities

PROCEEDINGS OF THE IEEE, DECEMBER 2014 11

metrics in all networks. On average, the compositeperformance of AGM is 3.40, which is 61% higherthan that of Link Clustering (2.10), 50% higherthan that of CPM (2.41), 30% higher than that ofInfomap and 8% higher than that of MMSB (3.25).The absolute average value of Omega Index ofAGM over the 6 networks is 0.46, which is 21%higher than Link Clustering (0.38), 22% higher thanCPM (0.37), 5% higher than Infomap (0.44) and 26%higher than MMSB (0.36).

In terms of absolute values of scores, AGMarchives the average F1 score of 0.57, averageOmega index of 0.46, Mutual Information of 0.15and accuracy of the number of communities 0.42.We also note that AGM also outperforms CPM withother values of k (k = 3, 4, 6).

A.4 Applying AGM to Biological NetworksWe also evaluate the performance of AGM onthe four types of protein-protein interaction (PPI)networks of Saccharomyces cerevisiae [6]. As perfor-mance metrics, we compute the average statisticalsignificance of detected communities (p-value) forthe three types of Gene Ontology (GO) terms (bio-logical process, cellular component and molecularfunction) [41]. We consider negative logarithm ofaverage p-values for each of the three GO termtypes as three separate scores.

Figure 11b displays the composite performancein the four PPI networks. We observe that the AGMattains the best composite performance in all fournetworks. On average, the composite performanceof AGM is 3.00, which is 150% higher than thatof Link Clustering (1.20), 163% higher than thatof CPM (1.14), 148% higher than that of Infomap(1.21) and 12 times higher than that of MMSB (0.08).We further investigated the poor performance ofMMSB on these networks and found it is dueto the fact that MMSB tends to find very largecommunities consisting of more than 80% of thenodes.

REFERENCES

[1] S. Fortunato, “Community detection in graphs,” PhysicsReports, vol. 486, no. 3-5, pp. 75 – 174, 2010.

[2] N. Krogan, G. Cagney, H. Yu, et al., “Global landscape ofprotein complexes in the yeast saccharomyces cerevisiae,”Nature, vol. 440, no. 7084, pp. 637–643, 2006.

[3] S. Wasserman and K. Faust, Social Network Analysis. Cam-bridge University Press, 1994.

[4] G. Flake, S. Lawrence, C. Giles, and F. Coetzee, “Self-organization and identification of web communities,”Computer, vol. 35, no. 3, pp. 66–71, 2002.

[5] M. Newman, Networks: An Introduction. Oxford Univer-sity Press, Inc., 2010.

[6] Y.-Y. Ahn, J. P. Bagrow, and S. Lehmann, “Link commu-nities reveal multi-scale complexity in networks,” Nature,vol. 466, pp. 761–764, Oct. 2010.

[7] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing,“Mixed membership stochastic blockmodels,” Journal ofMachine Learning Research, vol. 9, pp. 1981–2014, 2007.

[8] M. Sales-Pardo, R. Guimera, A. Moreira, and L. A. N.Amaral, “Extracting the hierarchical organization of com-plex systems,” Proceedings of the National Academy of Sci-ences of the United States of America, vol. 104, pp. 18 874–18 874, 2007.

[9] I. Psorakis, S. Roberts, M. Ebden, and B. Sheldon,“Overlapping community detection using bayesian non-negative matrix factorization,” Physical Review E, vol. 83,p. 066114, 2011.

[10] G. Palla, I. Derenyi, I. Farkas, and T. Vicsek, “Uncov-ering the overlapping community structure of complexnetworks in nature and society,” Nature, vol. 435, no. 7043,pp. 814–818, 2005.

[11] T. S. Evans and R. Lambiotte, “Line graphs, link partitions,and overlapping communities,” Physical Review E, vol. 80,p. 016105, 2009.

[12] A. Lancichinetti and S. Fortunato, “Community detectionalgorithms: A comparative analysis,” Physical Review E,vol. 80, no. 5, p. 056117, 2009.

[13] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefeb-vre, “Fast unfolding of communities in large networks,”Journal of Statistical Mechanics: Theory and Experiment, vol.2008, no. 10, p. P10008, 2008.

[14] M. Newman and M. Girvan, “Finding and evaluatingcommunity structure in networks,” Physical Review E,vol. 69, p. 026113, 2004.

[15] P. J. Mucha, T. Richardson, K. Macon, M. A. Porter, andJ.-P. Onnela, “Community structure in time-dependent,multiscale, and multiplex networks,” Science, vol. 328, no.5980, pp. 876–878, 2010.

[16] C. Granell, S. Gomez, and A. Arenas, “Hierarchical mul-tiresolution method to overcome the resolution limit incomplex networks,” International Journal of Bifurcation andChaos, vol. 22, no. 7, 2012.

[17] B. Ball, B. Karrer, and M. E. J. Newman, “Efficient andprincipled method for detecting communities in net-works,” Physical Review E, vol. 84, p. 036103, 2011.

[18] J. Yang and J. Leskovec, “Defining and evaluating net-work communities based on ground-truth communities,”in Proceedings of the IEEE International Conference on DataMining (ICDM), 2012, pp. 745–754.

[19] ——, “Community-affiliation graph model for overlap-ping community detection,” in Proceedings of the IEEEInternational Conference on Data Mining (ICDM), 2012.

[20] M. Rosvall and C. T. Bergstrom, “Maps of random walkson complex networks reveal community structure,” Pro-ceedings of the National Academy of Sciences of the UnitedStates of America, vol. 105, pp. 1118–1123, 2008.

[21] S. P. Borgatti and M. G. Everett, “Models ofcore/periphery structures,” Social Networks, vol. 21,pp. 375 – 395, 1999.

[22] P. Holme, “Core-periphery organization of complex net-works,” Physical Review E, vol. 72, p. 046111, 2005.

[23] F. D. Rossa, F. Dercole, and C. Piccardi, “Profiling core-periphery network structure by random walkers,” Scien-tific Reports, vol. 3, 2013.

[24] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney,“Community structure in large networks: Natural cluster

Page 13: Stanford Infolab Technical Report Overlapping Communities

PROCEEDINGS OF THE IEEE, DECEMBER 2014 12

Measures

Normalized Mutual Information

Number of Communities

-index

F1-score

Co

mp

osite

Pe

rfo

rma

nce

Methods

A AGM

L Link Clustering

C Clique Percolation

Mixed-Membership

Stochastic Block Model

M

I Infomap

0

1

2

3

4

L C I M A L C I M A L C I M A L C I M A L C I M A L C I M A

(a) Social, collaboration, and product networks

Co

mp

osite

Pe

rfo

rma

nce

PPI (Y2H) PPI (AP/MS) PPI (LC) PPI (All)

Measures

Biological Process

Molecular Function

Cellular Component0

1

2

3

L C I M A L C I M A L C I M A L C I M A

Methods

A AGM

L Link Clustering

C Clique Percolation

Mixed-Membership

Stochastic Block Model

M

I Infomap

(b) Biological networks

Fig. 11: The composite performance of the community detection methods on: (a) six networks with externallylabeled ground-truth communities and (b) four biological networks.

sizes and the absence of large well-defined clusters,”Internet Mathematics, vol. 6, no. 1, pp. 29–123, 2009.

[25] A. Clauset, M. Newman, and C. Moore, “Finding com-munity structure in very large networks,” Physical ReviewE, vol. 70, p. 066111, 2004.

[26] M. Coscia, G. Rossetti, F. Giannotti, and D. Pedreschi,“Demon: a local-first discovery method for overlappingcommunities,” in Proceedings of the ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining(KDD), 2012, pp. 615–623.

[27] J. Leskovec, K. Lang, and M. Mahoney, “Empirical com-parison of algorithms for network community detection,”in Proceedings of the International Conference on World WideWeb (WWW), 2010.

[28] D. Watts and S. Strogatz, “Collective dynamics of small-world networks,” Nature, vol. 393, pp. 440–442, 1998.

[29] J. A. Davis, “Clustering and Structural Balance inGraphs,” Human Relations, vol. 20, pp. 181–187, 1967.

[30] M. S. Granovetter, “The strength of weak ties,” AmericanJournal of Sociology, vol. 78, pp. 1360–1380, 1973.

[31] A. Clauset, C. Moore, and M. Newman, “Hierarchicalstructure and the prediction of missing links in networks,”Nature, vol. 453, no. 7191, pp. 98–101, 2008.

[32] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, andD. Parisi, “Defining and identifying communities in net-works,” Proceedings of the National Academy of Sciences ofthe United States of America, vol. 101, no. 9, pp. 2658–2663,2004.

[33] J. Yang and J. Leskovec, “Community detection in net-works with node attributes,” in ICDM ’13: Proceedings ofthe IEEE International Conference on Data Mining, 2013.

[34] J. J. Yang, McAuley and J. Leskovec, “Detecting cohesive

and 2-mode communities in directed and undirected net-works,” in WSDM ’14: Proceedings of the ACM InternationalConference on Web Search and Data Minig, 2014.

[35] J. Yang and J. Leskovec, “Structure and overlapsof communities in networks.” Stanford InfoLab,Technical Report, October 2014. The data and codethat were used for the experiments are available athttp://snap.stanford.edu/agm.

[36] M. McPherson, L. Smith-Lovin, and J. M. Cook, “Birds ofa feather: Homophily in social networks,” Annual Reviewof Sociology, vol. 27, pp. 415–444, 2001.

[37] G. Simmel, Conflict: the Web of Group Affiliations. Trans. byKurt H. Wolff and Reinhard Bendix. Free Press, 1955.

[38] R. L. Breiger, “The duality of persons and groups,” SocialForces, vol. 53, no. 2, pp. 181–190, 1974.

[39] S. Lattanzi and D. Sivakumar, “Affiliation networks,” inProceedings of the 41st annual ACM Symposium on Theory ofComputing, 2009, pp. 427–434.

[40] J. Yang and J. Leskovec, “Overlapping community detec-tion at scale: A non-negative factorization approach,” inWSDM ’13: Proceedings of the ACM International Conferenceon Web Search and Data Minig, 2013.

[41] E. Boyle, S. Weng, J. Gollub, H. Jin, D. Botstein, J. Cherry,and G. Sherlock, “GO::TermFinder - open source softwarefor accessing Gene Ontology information and findingsignificantly enriched Gene Ontology terms associatedwith a list of genes,” Bioinformatics, vol. 20, no. 18, pp.3710–3715, 2004.

[42] M. P. Rombach, M. A. Porter, J. H. Fowler, and P. J. Mucha,“Core-periphery structure in networks,” SIAM Journal ofApplied Mathematics, vol. 74, no. 1, pp. 167–190, 2014.

[43] R. E. Ulanowicz, C. Bondavalli, and M. S. Egnotovich,

Page 14: Stanford Infolab Technical Report Overlapping Communities

PROCEEDINGS OF THE IEEE, DECEMBER 2014 13

“Network analysis of trophic dynamics in south floridaecosystem, FY 97: The florida bay ecosystem,” AnnualReport to the United States Geological Service Biological Re-sources Division, pp. 98–123, 1998.

[44] S. Lehmann, M. Schwartz, and L. K. Hansen, “Bicliquecommunities,” Phys. Rev. E, vol. 78, p. 016108, 2008.

[45] P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochasticblockmodels: First steps,” Social Networks, vol. 5, no. 2, pp.109–137, 1983.

[46] C. Lee, F. Reid, A. McDaid, and N. Hurley, “Detectinghighly overlapping community structure by greedy cliqueexpansion,” in Proceedings of the Fourth international work-shop on Advances in social network mining and analysis, 2010.

[47] H. Shen, X. Cheng, K. Cai, and M.-B. Hu, “Detect overlap-

ping and hierarchical community structure in networks,”Physica A: Statistical Mechanics and its Applications, vol. 388,no. 8, pp. 1706 – 1712, 2009.

[48] B. Karrer and M. Newman, “Stochastic blockmodels andcommunity structure in networks.” Physical Review E,vol. 83, p. 016107, 2010.

[49] C. D. Manning, P. Raghavan, and H. Schutze, Introductionto Information Retrieval. Cambridge University Press,2008.

[50] S. Gregory, “Fuzzy overlapping communities in net-works,” Journal of Statistical Mechanics: Theory and Experi-ment, vol. 2011, no. 02, p. P02017, 2011.

Page 15: Stanford Infolab Technical Report Overlapping Communities

Stanford Infolab Technical Report

Overlapping Communities Explain Core-PeripheryOrganization of Networks

Jaewon Yang, Jure Leskovec∗

Stanford University

∗To whom correspondence should be addressed;

E-mail: [email protected], [email protected]

October 14, 2014

1

Page 16: Stanford Infolab Technical Report Overlapping Communities

Contents

S1 Data description: Networks with ground-truth communities 3

S2 Empirical observation: Community overlaps have higher density of edges 8

S3 Consequences for present community detection approaches 11S3.1 Clique percolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11S3.2 Link clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12S3.3 Stochastic block models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14S3.4 Other models of network communities . . . . . . . . . . . . . . . . . . . . . . 15

S4 Mathematical model of communities: the AGM model 15S4.1 The Community-Affiliation Graph Model . . . . . . . . . . . . . . . . . . . . 16S4.2 Flexibility of AGM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18S4.3 Community detection with AGM . . . . . . . . . . . . . . . . . . . . . . . . 18S4.4 Automatically finding the number of communities . . . . . . . . . . . . . . . 20S4.5 AGM does not suffer from the “resolution” limit . . . . . . . . . . . . . . . . 23S4.6 Anecdotal comparison between AGM and the existing methods . . . . . . . . 24

S5 Experiments: Networks with ground-truth communities 26S5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26S5.2 Methods for comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26S5.3 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27S5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29S5.5 Experiments on modeling the network structure . . . . . . . . . . . . . . . . 29

S6 Experiments: Small networks 32

S7 Experiments: Biological networks 34S7.1 Dataset description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34S7.2 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34S7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

S8 Experiments: Networks in Ahn et al. 37S8.1 Dataset description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37S8.2 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37S8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

S9 Overlapping communities give rise to core-periphery network structure 39S9.1 Community overlaps lead to global core-periphery structure . . . . . . . . . 39S9.2 Comparison to other notions of core-periphery . . . . . . . . . . . . . . . . . 41

A Appendix 49A.1 Raw performance scores of the experiments with ground-truth communities . 49A.2 Raw performance scores of the experiments with biological networks . . . . . 49

2

Page 17: Stanford Infolab Technical Report Overlapping Communities

A.3 Raw performance scores of the experiments in Ahn et al. . . . . . . . . . . . 49

S1 Data description: Networks with ground-truth com-

munities

One of the challenges of the research on network community detection is in evaluation ofobtained network communities. The main challenge stems from the fact that labeled ground-truth communities are require significant effort to obtain. Thus, it is very hard to quantifyon a large scale the performance of a given community detection method. In this senseidentifying a diverse set of networks where ground-truth communities are explicitly labeledcan have significant impact on the field.

We identified networks where nodes explicitly state their ground-truth community mem-berships. We first describe the source of labels for ground-truth communities and then arguewhy they correspond to “real” underlying communities.

We consider a set of 6 large social, collaboration and information networks, where foreach network we identify a graph and a set of explicitly labeled ground-truth communities.We identify networks where nodes explicitly state their ground-truth community member-ships. We did our best to identify networks in which such ground-truth communities can bereliably defined and identified based on functional roles of the nodes. In particular, we defineground-truth communities based on common affiliations, social circles, roles, activities, inter-ests, functions, or some other properties around which networks organize into communities.Network sizes range from hundreds of thousand to hundreds of millions of nodes and billionsof edges. Even though our networks come from a diverse set of domains and the labels ofindividual ground-truth communities may include some noise, the results are surprisinglyrobust and consistent.

Social networks. First we consider four very different online social networks (the LiveJour-nal blogging community [5], the Friendster online network [42], the Orkut social network [42],and the Youtube social network [42]) where users create explicit groups which other usersthen join. Such groups serve as organizing principles of nodes in social networks and arefocused on specific topics, interests, hobbies, affiliations, and geographical regions. Groupsrange from small to very large and are created based on specific topics, interests, hobbiesand geographical regions. For example, LiveJournal categorizes groups into the followingtypes: culture, entertainment, expression, fandom, life/style, life/support, gaming, sports,student life and technology. Overall, there are over three hundred thousand explicitly definedgroups in LiveJournal. Similarly, users in Friendster as well as in Orkut and Youtube definetopic-based groups that others then join. The Friendster and Orkut networks have morethan a million explicitly defined groups and each user can join to one or more groups. Weconsider each such explicitly created group as a ground-truth community.

The LiveJournal network was provided to us by Lars Backstrom [5], the Friendster net-

3

Page 18: Stanford Infolab Technical Report Overlapping Communities

Network statistics Ground-truth communitiesNetwork N E 〈C〉 〈D〉 〈k〉 K S A

LiveJournal [5] 4,036,538 34,916,684 0.36 6.57 17.30 311,782 40.06 3.09Friendster [42] 117,751,379 2,586,147,869 0.21 5.98 43.93 1,449,666 26.72 0.33Orkut [42] 3,072,441 117,185,083 0.17 5.28 76.28 8,455,253 34.86 95.93Youtube [42] 1,138,873 2,990,443 0.17 6.28 5.25 30,087 9.75 0.26DBLP [5] 425,957 1,348,244 0.61 6.57 6.33 2,547 429.79 2.57Amazon [37] 334,863 925,872 0.43 12.98 5.53 49,732 99.86 14.83

Table S1: Networks with ground-truth communities. N : Number of nodes, E: Num-ber of edges, 〈C〉: Average clustering coefficient [61], 〈D〉: Average shortest path length, 〈k〉:Average node degree. Properties of ground-truth communities: K: Number of communities,S: Average community size, A: Community memberships per node. Additional networksused in this study are described in Table S2. All our networks are complete and publiclyavailable at http://snap.stanford.edu/agm.

work was made public by the Internet Archive1, and the Orkut and Youtube networks werekindly provided to us by Alan Mislove [42].

Amazon product co-purchasing network. The second type of network data we consideris the Amazon product co-purchasing network [37]. The nodes of the network representproducts and edges link commonly co-purchased products. Each product (i.e., node) belongsto one or more hierarchically organized product categories and products from the samecategory define a group which we view as a ground-truth community. This means membersof the same community share a common function or role, and each level of the producthierarchy defines a set of hierarchically nested and overlapping communities. We crawledthis network using the Amazon API [37].

DBLP collaboration network. We consider the collaboration networks of DBLP [5],where nodes represent authors and edges connect nodes that have co-authored a paper.We use publication venues as ground-truth communities which serve as proxies for highlyoverlapping scientific communities around which the network then organizes. This networkwas provided to us by Lars Backstrom.

Ground-truth network characteristics. Table S1 gives the dataset statistics. Observethat the size of the networks ranges between hundreds of thousands to hundreds of millions ofnodes and billions of edges. The number of ground-truth communities varies from hundredsto millions and there is also a range in ground-truth community sizes and node membershipdistribution.

All our networks are complete and are publicly available at http://snap.stanford.

edu/agm. For each of these networks we identified a sensible way of defining ground-truthcommunities that serve as organizational units of these networks.

Even though our networks come from very different domains and individual labels maybe noisy or even incomplete, the results we present here are robust and consistent across allthe datasets. Our work is consistent with the premise that is implicit in all network commu-

1http://www.archive.org/details/friendster-dataset-201107

4

Page 19: Stanford Infolab Technical Report Overlapping Communities

nity literature: members of “real” communities share some (latent/unobserved) property oraffiliation that serves as an organizing principle of the nodes and makes them well-connectedin the network. Here we use groups around which communities organize to explicitly defineground-truth.

Data preprocessing. To represent all networks in a consistent way we drop edge directionsand consider each network as an unweighted undirected static graph. Because members ofthe group may be disconnected in the network, we consider each connected component of thegroup as a separate ground-truth community. However, we allow ground-truth communitiesto be nested and to overlap (i.e., a node can be a member of multiple groups at once).

Community size and membership size distribution. Next we present the distribu-tion of the various properties of ground-truth communities. Our goal here is to investigateproperties of ground-truth communities and demonstrate that such sets of nodes in factcorrespond to “real” network communities.

Previous literature found that the size of communities, i.e., the number of the nodesin communities, follows a heavy-tailed distribution [1, 47, 65]. Figure S1 shows the CCDF(complementary cumulative distribution function) of the sizes of ground-truth communitiesin the 6 networks. The distribution appears to follow a heavy-tailed distribution, which forLiveJournal, YouTube and Amazon appears to be power-law.

Figure S2 shows that the CCDF of the distribution of the number of communities a nodeis member of. We observe it exhibits a power-law decay, but the distributions do not showa long tail in some data sets such as Orkut, DBLP, and Amazon. This is in accordance withPalla et al. [47] who reported that the distribution of node memberships, i.e., the numberof the communities that a node belongs to, tends to follow a power-law.

Last, we also examine the statistics of the community overlaps. We focus on overlapsbetween a pair of ground-truth communities and report the absolute and fractional size ofthe overlap between two communities. Figure S3 shows the distribution of the absoluteoverlap sizes. We observe that the distributions follow a power-law, as also observed byPalla et al. [47] on detected (not ground-truth) communities. In addition, we investigatehow ground-truth communities overlap: Do ground-truth communities overlap in a nestedstructure? Or, do they overlap only for a small fraction of members? To do this, we measurethe fraction f of the size of the overlap A ∩B between two communities A,B to the size ofthe smaller community, min(|A|, |B|) (f = |A∩B|/min(|A|, |B|)). f being close to 1 meansa nested structure where the larger community includes the smaller one, and small f meansoverlap in the fringe. Figure S4 plots the distribution of the overlap fraction f . The Amazonnetwork shows high probability at f = 1 because the ground-truth communities form anested structure by construction. In social networks and the DBLP network, most overlapstake a small fraction of individual communities, which is reasonable as each community hasits own special interests.

5

Page 20: Stanford Infolab Technical Report Overlapping Communities

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Fc(

s), C

ompl

emen

tary

CD

F

s, Community Size

LiveJournal

(a) LiveJournal

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104

Fc(

s), C

ompl

emen

tary

CD

F

s, Community Size

Friendster

(b) Friendster

10-8

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104

Fc(

s), C

ompl

emen

tary

CD

F

s, Community Size

Orkut

(c) Orkut

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104

Fc(

s), C

ompl

emen

tary

CD

F

s, Community Size

Youtube

(d) Youtube

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104

Fc(

s), C

ompl

emen

tary

CD

F

s, Community Size

DBLP

(e) DBLP

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105

Fc(

s), C

ompl

emen

tary

CD

F

s, Community Size

Amazon

(f) Amazon

Figure S1: Ground-truth community size distribution. Complementary cumulativedistribution function Fc(s) of the size of ground-truth communities, s. The size of a ground-truth community denotes the number of nodes belonging to the ground-truth community.

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103

Fc(

m),

Com

plem

enta

ry C

DF

m, Memberships

LiveJournal

(a) LiveJournal

10-8

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

100

101

102

103

104

Fc(m

), C

om

ple

menta

ry C

DF

m, Memberships

Friendster

(b) Friendster

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104

Fc(

m),

Com

plem

enta

ry C

DF

m, Memberships

Orkut

(c) Orkut

10-6

10-5

10-4

10-3

10-2

10-1

100

101

102

103

Fc(m

), C

om

ple

menta

ry C

DF

m, Memberships

Youtube

(d) Youtube

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103

Fc(

m),

Com

plem

enta

ry C

DF

m, Memberships

DBLP

(e) DBLP

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103

Fc(

m),

Com

plem

enta

ry C

DF

m, Memberships

Amazon

(f) Amazon

Figure S2: Node membership distribution. Complementary cumulative distributionfunction Fc(m) of the number of communities m nodes belong to.

6

Page 21: Stanford Infolab Technical Report Overlapping Communities

10-8

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104

Fc(

o), C

ompl

emen

tary

CD

F

o, Overlap Size

LiveJournal

(a) LiveJournal

10-8

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104

Fc(

o), C

ompl

emen

tary

CD

F

o, Overlap Size

Friendster

(b) Friendster

10-9

10-8

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

100

101

102

103

104

Fc(o

), C

om

ple

menta

ry C

DF

o, Overlap Size

Orkut

(c) Orkut

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103

Fc(

o), C

ompl

emen

tary

CD

F

o, Overlap Size

Youtube

(d) Youtube

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104

Fc(

o), C

ompl

emen

tary

CD

F

o, Overlap Size

DBLP

(e) DBLP

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105

Fc(

o), C

ompl

emen

tary

CD

F

o, Overlap Size

Amazon

(f) Amazon

Figure S3: Community overlap distribution. Complementary cumulative distributionfunction Fc(o) of the size of overlaps between pairs of ground-truth communities, o. The sizeof an overlap is the number of the nodes that belong to the overlap.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

p(f)

, Pro

babi

lity

f, Fraction of Overlap

(a) LiveJournal

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

p(f)

, Pro

babi

lity

f, Fraction of Overlap

(b) Friendster

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

p(f)

, Pro

babi

lity

f, Fraction of Overlap

(c) Orkut

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

p(f)

, Pro

babi

lity

f, Fraction of Overlap

(d) Youtube

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

p(f)

, Pro

babi

lity

f, Fraction of Overlap

(e) DBLP

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

p(f

), P

robabili

ty

f, Fraction of Overlap

(f) Amazon

Figure S4: Relative community overlap. Histogram (probability) p(f) of the fractionof the relative overlap size f . When ground-truth communities A,B overlap A ∩ B thenf = |A ∩B|/min(|A|, |B|), where min(x, y) is the smaller of x and y.

7

Page 22: Stanford Infolab Technical Report Overlapping Communities

S2 Empirical observation: Community overlaps have

higher density of edges

The availability of reliable ground-truth communities allows us to empirically study theirstructure and enhance our understanding of how nodes organize themselves into communities.We analyze how nodes of ground-truth communities connect to each other, how they connectto the rest of the network, and how they overlap. This way we can empirically study ona large scale how real communities map on the underlying network structure. Based onempirical findings presented here, we will later develop a novel model and a new method fordetecting overlapping communities in networks.

Empirical observation. Communities in networks form overlaps in a way that nodes be-long to multiple communities simultaneously. As demonstrated in previous section ground-truth communities overlap and many nodes belong to multiple communities at once (Fig-ures S2 and S3).

We study the structure of community overlap simply by asking what is the probabilitythat a pair of nodes is connected if they share membership to k common ground-truthcommunities. Figure S5 plots this probability for all six datasets for which we have theground-truth community data.

We notice that all curves are steeply increasing. This means that, the more communitiesa pair of nodes has in common, the higher the probability of them being connected. Noticethe effect of shared memberships on the edge probability is very strong. For example, inLiveJournal, if a pair of nodes has 4 communities in common, the probability of friendship isnearly 50%. To appreciate how strong the effect of shared communities is on edge probability,one has to note that all of our networks are extremely sparse. The background probabilityof a random pair of nodes being connected is ≈ 10−5, while as soon as a pair of nodes sharestwo communities, their probability of linking increases by 4 orders of magnitude (from 10−5

to 10−1).We note that all other data sets exhibit similar behavior — the probability of a pair of

nodes being connected approaches 1 as the number of common communities increases. Whilein online social networks the edge probability exhibits a diminishing-returns-like growth, inDBLP, it appears to follow a threshold-like behavior.

In retrospective, the above result is very intuitive: People sharing multiple interests havea higher chance of becoming friends [41], researchers with many common interests are morelikely to work together [49], and proteins belonging to multiple common functional modulesare more likely to interact [22, 32].

Implications for community detection. Our finding in Figure S5 suggests communitiesoverlap as illustrated in Figure S6a. In particular, we can think of communities as beinganalogous to tiles, where tile overlaps lead to higher thickness of tiles, that is, communityoverlaps lead to higher density of edges (Figure S6a illustrates the concept).

Even though this notion of network communities is very intuitive, it is also very differentfrom present literature that mostly defines communities as clusters of densely connectednodes. In particular, the predominant view of network communities today is based on twofundamental social network processes: triadic closure [61] and “strength of weak ties” [25].This leads to the picture of network communities as illustrated in Figure S6b. Applying

8

Page 23: Stanford Infolab Technical Report Overlapping Communities

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 3 4 5 6 7

P(k

), E

dge p

robabili

ty

k, Number of shared communities

LiveJournal

(a) LiveJournal

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

P(k

), E

dge p

robabili

ty

k, Number of shared memberships

Friendster

(b) Friendster

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5 6 7

P(k

), E

dge

prob

abili

ty

k, Number of shared communities

Orkut

(c) Orkut

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

1 2 3 4 5 6 7

P(k

), E

dge

prob

abili

ty

k, Number of shared communities

Youtube

(d) Youtube

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7

P(k

), E

dge

prob

abili

ty

k, Number of shared communities

DBLP

(e) DBLP

0

0.05

0.1

0.15

0.2

0.25

1 2 3 4 5 6 7

P(k

), E

dge p

robabili

ty

k, Number of shared communities

Amazon

(f) Amazon

Figure S5: Community overlaps are more densely connected than the non-overlapping parts of communities. Edge probability P (k) between two nodes giventhat two nodes share k communities. We observe that P (k) is an increasing function of k inall the networks. For the purpose of this plot, we use the 5,000 ground-truth communitiesthat are most cohesive in each data set [62].

this view to the case of overlapping communities leads to the (arguably unnatural) structureof community overlaps as illustrated in Figure S6c. On the other hand, our novel view ofnetwork communities as overlapping tiles is consistent with works of Simmel [57] on webof affiliations, and Feld [19] on focused organization of social ties. In both of these viewsnetworks consist of overlapping “tiles” or “social circles” that serve as organizing principlesof nodes in networks.

9

Page 24: Stanford Infolab Technical Report Overlapping Communities

(a) (b) (c)

Figure S6: Three different definitions of network communities. Three networks (top)and corresponding adjacency matrices (bottom). We find (a) that as nodes share multiplecommunities, they are more likely to link which leads to densely connected communityoverlaps. However, most existing community detection methods either assume that (b)communities do not overlap or that (c) community overlaps are less well-connected thanthe non-overlapping parts of communities. Moreover, most existing community detectionmethods cannot properly detect communities with overlaps as in (a).

10

Page 25: Stanford Infolab Technical Report Overlapping Communities

(a) kCPM = 3 (b) kCPM = 4

Figure S7: Clique percolation method cannot detect communities with dense over-laps. Given a network with two communities and a dense overlap, Clique percolation methodwould report a community that (depending on the parameter settings) either include bothcommunities ((a)), or it would find a small community consisting only of the overlap ((b)).

S3 Consequences for present community detection ap-

proaches

Having established that communities overlap as illustrated in Figure S6a we next show thatpresent state-of-the-art community detection methods fail to properly detect communitieswith such overlaps. In particular, we show that three state-of-the-art overlapping communitydetection methods Clique percolation [47, 36], Link clustering [1], and Stochastic blockmodel [2, 28] all fail to properly detect communities with dense overlaps (Figure S6a).

S3.1 Clique percolation

First we analyze the Clique percolation method and show that it fails to properly detect twooverlapping communities as illustrated in Figure S6a.

Clique percolation method (CPM) has a single input parameter k which determines thesize of the maximal cliques that the algorithm looks for. After finding all the k-cliques onthe given network, the method merges two k-cliques if they share k− 1 nodes. Overlaps canhappen when the nodes in the overlaps belong to multiple k cliques that cannot be merged.When an overlap is denser, however, nodes in the overlap form many k-cliques themselvesand the k cliques in the overlap would be likely to merge together. In this case, the methodwould either identify the overlap as a separate community, or merge adjacent communitiesthrough k-cliques in the overlap.

For example, Figure S7 shows the result of CPM on the network of Figure S6a wherethe overlap between the two communities is denser than the individual communities. Whenk = 3, CPM finds a community that covers the whole network because the clique in theoverlap connects the cliques in the left community and the right community, whereas CPMfinds a community of the overlap when k = 4.

In addition to Clique percolation method, there are many other overlapping communitydetection methods that are based on expanding the maximal cliques. These methods (forexample, Greedy clique expansion [35] and EAGLE [56]) also suffer from the same problem.

11

Page 26: Stanford Infolab Technical Report Overlapping Communities

Figure S8: Link clustering cannot detect communities with dense overlaps. A net-work with two overlapping communities and the outcome of Link clustering. Link clusteringbuilds a dendrogram with solid lines which finds the overlap as a separate community. Themerger of the overlap and the single community regions (the right and the left communities),described by the dotted lines in the dendrogram, cannot happen as Link clustering will stopthe algorithm because of the decrease in the partition density.

By the same reasoning that we used for Clique percolation, we can see that none ofthese clique expansion methods is able to discover densely overlapping communities. Forexample, in the networks in Figure S7, neither EAGLE nor Greedy clique expansion couldcorrectly identify the red overlapping nodes. As all the red nodes form a maximal clique,both methods will regard the red nodes as a single community, and there is no way to tellthat the red nodes belong to more than one community.

S3.2 Link clustering

Next we show that the Link clustering [1] also suffers from similar problems as the Cliquepercolation. Link clustering performs hierarchical clustering on the edges of the given net-work. For each pair of edges (i, k) and (j, k) that shares a single node k, Link clusteringcomputes the Jaccard similarity JAC(n(i), n(j)) between the sets of neighbors n(i) and n(j)of node i and j and builds a dendrogram by merging the pair of edges with the highestJaccard similarity. Finally, Link clustering cuts the dendrogram at the point where the par-tition density, a quality function proposed in [1], is maximized. In the following, we showthat Link clustering does not discover the true communities when their overlaps are moredensely connected than each individual community.

12

Page 27: Stanford Infolab Technical Report Overlapping Communities

We consider a network with two overlapping communities A and B with their overlap O(Figure S8). Let A and B each contain X + Y nodes and O contains X nodes. The totalnumber of nodes in the network is thus X + 2Y . Moreover, assume that the nodes in anindividual community are connected with probability p, and that the nodes in the overlaphave a higher probability of being connected, say 2p. Now let’s consider the case where thenumber of nodes in the overlap is not larger than the number belonging a single community(X ≤ Y ).

Now consider that Link clustering computes Jaccard similarity between the neighbors ofnodes u and v. Without the loss of generality we can have one of the four cases:

• (1) u ∈ O and v ∈ O

• (2) u ∈ A \O and v ∈ A \O

• (3) u ∈ A \O and v ∈ O

• (4) u ∈ A \O and v ∈ B \O

We will now show that the Jaccard similarity between a pair of edges in case (1) is higherthan in case (2), and that (2) is higher than (3), which is naturally higher than (4). Thismeans that Link clustering will first merge edges between nodes in O, and only then mergethe edges between nodes in A \ O and edges between nodes in B \ O. Last, Link clusteringwill merge the edges with one endpoint in O and the other in A\O (B \O). This process willproduce the dendrogram illustrated in Figure S8. In particular, this means that regardlesswhere one cuts the dendrogram, Link clustering will fail to correctly identify the communitystructure of the simple network in Figure S8.

To show this more formally we proceed as follows. Let’s consider nodes a1, a2 ∈ A\O andnodes o1, o2 ∈ O, and their neighbors n(a1), n(a2), n(o1), n(o2). We show that in expectationthe following is true:

|n(o1) ∩ n(o2)||n(o1)|+ |n(o2)|

>|n(a1) ∩ n(a2)||n(a1)|+ |n(a2)|

≥ |n(a1) ∩ n(o1)||n(a1)|+ |n(o1)|

The above inequalities are equivalent to JAC(n(o1), n(o2)) > JAC(n(a1) and n(a2)) >JAC(n(a1), n(o1)). We have |n(o1)| = |n(o2)| = 2pX+2pY , |n(a1)| = |n(a2)| = pX+pY andwe aim to compute the expected values of the sizes of the intersections between n(o1), n(o2),n(a1), and n(a2). For example, |n(o1)∩n(o2)| is 4p2X+2p2Y in expectation because o1 ando2 have a common neighbor in single community regions (2Y nodes) with probability p2 andin overlap (X nodes) with probability (2p)2. By the same logic, |n(a1)∩n(a2)| = p2X+p2Y ,and |n(a1) ∩ n(o1)| = 2p2X + p2Y . From these, we derive the following:

|n(o1) ∩ n(o2)||n(o1)|+ |n(o2)|

= p2X + Y

2X + 2Y>p

2=|n(a1) ∩ n(a2)||n(a1)|+ |n(a2)|

≥ p2X + Y

3X + 3Y=|n(a1) ∩ n(o1)||n(a1)|+ |n(o1)|

,

where the last inequality 1/2 ≥ (2X + Y )/(3X + 3Y ) comes from our assumption thatX ≤ Y . Therefore, Link clustering yields dendrogram in Figure S8, which first merges edgesin O and then merges edges in the two non-overlapping parts (A \ O,B \ O) and only thenmerges edges between the overlapping and the non-overlapping parts (O,A \O), (O,B \O).

13

Page 28: Stanford Infolab Technical Report Overlapping Communities

1 1 2 2 3 3 1 4 5 5 5 6 6 7 5 8 7

2 3 3 4 4 6 5 8 6 7 8 7 8 8 9 12 10

9

10

9

11

10

11

10

12

11

12

1

2

3

4

6

5

7

8

11

9

10

12

Figure S9: Example of Link clustering on densely overlapping communities. Fromthe network at the top, Link clustering produces the dendrogram at the bottom. Since themethod first groups the overlapping nodes (nodes 5, 6, 7, 8) together, it is unable to find thatthese overlapping nodes belong to both communities.

Figure S9 gives a concrete example of a networks where the result of Link clusteringgives counterintuitive results. Green nodes (1, 2, 3, 4) belong to one community, blue nodes(9, 10, 11, 12) belong to the second community and red nodes (5, 6, 7, 8) belong to both com-munities. As we showed in our analysis, Link clustering merges the edges inside the overlaps(between the red nodes) together and then it merges the nodes in a non-overlapping part(between the green nodes or between the blue nodes). Consequently, Link clustering identi-fies the overlap of the communities as a separate community and is unable to find that theoverlapping nodes belong to two communities at the same time.

S3.3 Stochastic block models

Last, we briefly mention that various kinds of stochastic block models [2, 28, 31] also will notbe able to correctly discover communities with dense overlaps. We show this for three variantsof stochastic block models: the traditional stochastic block model [28], the Degree-correctedstochastic block model [31] as well as the Mixed-membership stochastic block model [2].

The Stochastic block model [28] partitions a network into disjoint blocks and assigns anedge probability to each block. The only way for the model to increase the edge probabilityamong the nodes in the community overlap is to regard overlaps as separate communitieswith higher edge density than the individual non-overlapping parts of communities. Forexample, Figure S10 illustrates the adjacency matrix of the network from Figure S6a andthe block structure as discovered by the stochastic block model. Instead of two overlappingthree communities are discovered.

The degree-corrected stochastic block model [31] relaxes the assumption that nodes inthe same community have similar degrees. By assigning a high degree to nodes in communityoverlaps, it might be possible to extend the model to increase the edge probability between

14

Page 29: Stanford Infolab Technical Report Overlapping Communities

overlapping nodes without treating the overlap as a separate community. However, thepresent version of the model assumes non-overlapping community structure and thus itcannot tell that the overlapping nodes belong to multiple communities.

Last, the mixed-membership stochastic block model can discover overlapping communi-ties. However, the model cannot express dense community overlaps. In fact based on theinput matrix from Figure S6a the model will identify three blocks as illustrated in Figure S10.The reason for this is that the edge probability between two nodes that belong to commu-nities A and B is weighted average of P (A,A), P (A,B), and P (B,B), where P (X, Y ) is anedge probability between a node in community X and a node in community Y . This meansthat the edge probability between the two nodes that share multiple communities is smallerthan the maximum of P (A,A) and P (B,B) (due to the weighted summation). Therefore,the edge probability between overlapping nodes cannot be higher than the edge probabilitybetween nodes in an individual community.

S3.4 Other models of network communities

Our findings here present a new way of thinking about network communities as overlappingtiles. In particular, in the same way as thickness of tiles increases in the areas where thetiles overlap, the “thickness” (i.e., density) of edges increases in the areas where two or morecommunities overlap. To best of our knowledge our work here is the first to make theseobservations and build the conceptual understanding. However, we were able to identifya small number of other community detection methods that should in principle be ableto correctly identify densely overlapping communities (as illustrated in Figure S6a). Wenote that none of these methods has been tested specifically whether it can detect denselyoverlapping communities.

First method that is likely to be able to detect densely overlapping communities is astatistical model of network communities by Ball et al. [6]. In particular, Ball et al. presentan extension of stochastic block model where node community memberships are not modeledby a multinomial distribution but every node i maintains a factor ki,c that models the amountby which node i belongs to community c. This way one can think of node membershipbeing described by a non-normalized multinomial distribution which allows for modelingan increased density of the edges in the community overlaps. However, due to model’sgenerality the model inference method is not particularly scalable. Similarly, Mørup etal. [43] developed a non-parametric Bayesian multiple membership latent feature model fornetworks where edges of the network are generated by using a “soft-OR” function. Andlast, Gregory et al. [26] also developed a heuristic method for network community detectionfor which our analysis shows that it might be able to correctly identify densely overlappingcommunities.

S4 Mathematical model of communities: the AGM

model

We just illustrated that most commonly used state-of-the-art community detection methodsfail to properly detect communities with dense overlaps. And thus a new approach is needed.

15

Page 30: Stanford Infolab Technical Report Overlapping Communities

Figure S10: The result of a Stochastic block model and the Mixed-membershipstochastic block model on a network of two communities with dense overlap.The adjacency matrix of the network in Figure S6a is shown and the bold lines denote thepartitions discovered by stochastic block models. See main text for further discussion.

S4.1 The Community-Affiliation Graph Model

Next we develop a model-based community detection method that successfully detects over-lapping, non-overlapping, as well as nested communities in networks. First, we present asimple, conceptual model of network formation, which naturally leads to densely overlappingcommunities. Then, we describe a method to ‘fit’ the model to a given network and thisdiscover communities. The code of our approach to community detection is available athttp://snap.stanford.edu/agm.

We build on Breiger’s foundational work [11] where it has been recognized that communi-ties arise due to shared group affiliations [11, 19, 57]. We present the Community-AffiliationGraph Model (AGM), a simple probabilistic generative model for networks that captures theobserved phenomena and reliably reproduces the organization of networks into communitiesand the structure of community overlaps [?].

Consider a pair of people that are members of several different interest based communities.By having more interests in common, they are more likely to meet and link. Based on thisexample we require two ingredients: a way to capture memberships of nodes to communitiesand then a mechanism that gives nodes that share several communities multiple chances tocreate links among each other.

To capture memberships of nodes to communities we use a bipartite affiliation graphthat links nodes of the network to communities that they belong to. We obtain the secondingredient by assigning a single parameter to each community. This parameter captures theprobability that nodes belonging to that community create an edge. Nodes belonging tomultiple common communities get multiple chances to form links. Thus, naturally, the morecommunities a pair of nodes shares, the higher is the probability of linking.

Figure S11a illustrates the essence of our model. We start with a bipartite graph wherethe nodes at the bottom represent the nodes of the social network and the nodes on thetop represent communities. The edges between nodes of the network (bottom) and the com-munities (top) indicate node-community memberships. We denote the bipartite affiliationgraph as B(V,C,M), where V denotes the set of nodes of the original network G, C is a set

16

Page 31: Stanford Infolab Technical Report Overlapping Communities

pA pCpB

A B C

(a) Community Affiliation Network (b) Overlap

Figure S11: Community-Affiliation Graph Model. (a) Community-Affiliation GraphModel (AGM): Circles represent communities and squares represent the nodes of the ob-served network. Edges represent node community memberships. (b) Network generated bythe Community-Affiliation Graph Model in (a). The AGM (b) accurately models densecommunity overlaps.

of communities, and there is an edge (u, c) ∈ M from node u ∈ V to community c ∈ C ifnode u belongs to community c.

Now, given the affiliation network B(V,C,M), we describe the process to generate theunderlying networkG(V,E). To achieve this we need to specify the process that generates theedges E of G given the affiliation graph B. We consider a simple parameterization where weassign a parameter pc to every community c ∈ C. The parameter pc models the probabilityof an edge forming between two members of the community c. In other words, we simplygenerate an edge between a pair of nodes that belongs to community c with probability pc.Each community c creates edges independently. However, if the two nodes have already beenconnected via some other common community membership, then the duplicate edge is notincluded in the graph G(V,E).

Definition 1. Let B(V,C,M) be a bipartite graph where V is a set of nodes, C is a set ofcommunities, and an edge (u, c) ∈M connects node u ∈ V to community c ∈ C if u belongsto community c in the network G. Also, let {pc} be a set of probabilities for all c ∈ C.Given the affiliation network B(V,C,M) and {pc}, the Community-Affiliation Graph Modelgenerates a graph G(V,E) where the node set V and the edge set E are defined as follows.For each pair of nodes u, v ∈ V , the AGM creates edge (u, v) ∈ E with probability p(u, v):

p(u, v) = 1−∏

k∈Cuv

(1− pk), (1)

where Cuv ⊂ C is a set of communities that u and v share (Cuv = {c|(u, c), (v, c) ∈M}).

Note that this simple process already ensures that pairs of nodes that belong to multiplecommon communities are more likely to link. This is due to the fact that nodes that sharemultiple community memberships receive multiple chances to create a link. For example,pairs of nodes in the overlap of communities A and B (but not to C) in Figure S11a get twochances to create an edge. First they can be connected with probability pA (due to their

17

Page 32: Stanford Infolab Technical Report Overlapping Communities

(a) Non-overlapping (b) Nested (c) Overlapping

Figure S12: Flexibility of AGM. AGM allows for rich modeling of network communities:(a) non-overlapping, (b) nested, (c) overlapping. In (a) we can assume that nodes in dis-joint communities connect with small probability ε which allows for sparse links betweencommunities A and B.

membership in community A), and then also with probability pB (due to membership in B).While pairs of nodes residing in the non-overlapping region of A link with probability pA,nodes in the overlap link with probability 1− (1−pA)(1−pB) = pA +pB−pApB ≥ pA, whichmeans that overlaps of communities are more densely connected than the non-overlappingparts.

In the formulation of Equation 1, AGM does not account for the edges between the nodesthat do not share any common communities. To account for this, we simply assume thatnodes which have no communities in common link with a small probability ε. In all ourexperiments we simply set ε = 1/|V |2, where |V | is the number of nodes in a given network.

S4.2 Flexibility of AGM

Figure S12 illustrates the flexible nature of the Community-Affiliation Graph Model thatallows for modeling any combination of network community structures: Traditional non-overlapping communities can be modeled by the affiliation graph where each network nodelinks only to a single community node (Figure S12a). Hierarchically nested communities canbe modeled by the affiliation graph where subsets of network nodes belong to more and morespecialized communities (Figure S12b). Communities with overlaps can be modeled by theaffiliation graph where some nodes belong to multiple communities while other nodes belongto only one community.

S4.3 Community detection with AGM

Having defined the AGM model we now explain how we detect network communities byfitting the AGM to a given network. Given network G(V,E), we aim to detect communitiesby fitting the AGM (i.e., finding affiliation graph B and parameters {pc}) to the underlyingnetwork G by maximizing the likelihood L(B, {pc}) = P (G|B, {pc}):

argmaxB,{pc}

L(B, {pc}) =∏

(u,v)∈E

p(u, v)∏

(u,v) 6∈E

(1− p(u, v)) (2)

18

Page 33: Stanford Infolab Technical Report Overlapping Communities

To maximize the likelihood L we employ coordinate ascent strategy where we update{pc} fixing B and then we update B with {pc} fixed.

For now we assume the number of communities K (K = |C|) is known. We later showhow to automatically determine K. We start the process by generating a random affiliationgraph B on |V | network nodes and K community nodes. We then proceed by updating {pc}.Updating {pc}. With B fixed, we aim to find {pc} by solving the following optimizationproblem:

arg max{pc}

∏(u,v)∈E

(1−∏

k∈Cuv

(1− pk))∏

(u,v)6∈E

(∏

k∈Cuv

(1− pk)) (3)

with the constraints 0 ≤ pc ≤ 1. Now, we show that we can convert it to a convex optimiza-tion problem.

We maximize the logarithm of the likelihood and change variables with e−xk = 1− pk:

arg max{xc}

∑(u,v)∈E

log(1− e−∑k∈Cuv xk)−

∑(u,v) 6∈E

∑k∈Cuv

xk

The constraints 0 ≤ pc ≤ 1 become xc ≥ 0. This problem is a convex optimization of {xc}and can be thus solved efficiently using convex optimization techniques.

Updating B. Now, given fixed {pc} we aim to update B, while maximizing the likelihood.To this end we use the Metropolis-Hastings [14, 46] algorithm where we slowly update B usingsmall local modifications to it. Given the current community affiliation graph B(V,C,M),we consider three kinds of local modifications to generate a new community affiliation graphB′(V,C,M ′). As V and C remain the same, we just need to update M to M ′:

• LEAVE: Choose a node-community pair (u, c) ∈ M uniformly at random and let M ′ =M \ {(u, c)}.

• JOIN: Choose a node-community pair (u, c) 6∈ M uniformly at random and let M ′ =M ∪ {(u, c)}.

• SWITCH: Choose a node-community pair (u, c1) ∈M , (u, c2) 6∈M at uniformly randomand let M ′ = (M \ {(u, c1)}) ∪ {(u, c2)}.

Once we have generated new community affiliation B′, we accept B′ with the Metropolis-Hastings rule. If B′ achieves higher likelihood than B, we accept B′. Otherwise, we acceptB′ with probability L(B′, {pc})/L(B, {pc})).

We experimented with many synthetic and real-world networks, and found that theMarkov chain of fitting AGM exhibits relatively quick convergence as the likelihood does notincrease after roughly O(|V |2) steps. Although this is not a rigorous performance guarantee,results show that the fitting method works quite well in practice. The algorithm can fitAGM to the networks with up to a few thousand nodes in a reasonable amount of time.Figure S14b shows that how the log-likelihood of the Facebook ego-network (Figure 3a ofthe main paper) converges after about 10,000 iterations.

Identifiability of AGM. Our first test of the AGM community detection method is toexamine whether our fitting method can recover the model parameters given a synthetic

19

Page 34: Stanford Infolab Technical Report Overlapping Communities

C1

u

C2

LEAVE

JOIN

SWITCH

Figure S13: Local modifications for updating community affiliation graph B. Giventhe current community affiliation graph on the left, we consider three local modifications.LEAVE considers a node quitting a community. JOIN considers a node joining a new commu-nity. SWITCH causes a node to replace one of its current communities with a new one.

network that was generated by the AGM model itself. Assume a synthetic network G∗ thatis generated by AGM using some input parameters B∗, {pc}∗. Now our goal is to recoverB∗, {pc}∗ based only on the network G∗.

For example, we generated a network with the overlapping communities such that 100nodes belong to community A, 100 nodes belong to community B, and 50 nodes belongto the both communities. We set p∗A = p∗B = 0.3 and generate the network G∗. Nowgiven G∗ we identify communities by fitting AGM to the synthetic network G. The AGMcan discover communities A and B with perfect accuracy and estimate pA, pB very closely(pA = 0.30, pB = 0.29). Figure S14a shows the likelihood as a function of the number ofiterations. After 3,000 iterations, the likelihood reaches a plateau and converges.

We also considered a more general cases where we generated some random B∗ and ran-domly assigned parameters {pc}∗. In nearly all cases our algorithm was successfully able torecover parameters B∗ and {pc}∗ given only the synthetic network G∗.

S4.4 Automatically finding the number of communities

To initialize B(V,C,M), we need to set the number of communities K = |C|, which inpractice is not know in advance. To resolve this, we develop a method to automaticallyestimate the number of communities. Our approach is based on statistical regularizationtechniques [58].

Our strategy begins with a candidate set of a large number of communities that mightexist in a given network. We generate this candidate set by fitting AGM using a verylarge number of communities K. Then, we keep removing redundant communities usingl1-regularization until we observe a drop in the log-likelihood. We observe a threshold likebehavior of the log-likelihood a as a function of the regularization parameter. We choose

20

Page 35: Stanford Infolab Technical Report Overlapping Communities

-8000

-7500

-7000

-6500

-6000

-5500

-5000

0 10000 20000 30000 40000

Log

likel

ihoo

d

Iterations

Likelihood

(a) Synthetic network

-11000

-10000

-9000

-8000

-7000

-6000

-5000

0 10000 20000 30000 40000

Log

likel

ihoo

d

Iterations

Likelihood

(b) Facebook

Figure S14: Convergence of fitting the AGM to a given network. The likelihoodof AGM versus the number of iterations of Metropolis-Hastings measured on a syntheticnetwork generated by AGM (a) and a Facebook ego-network (b). In both cases AGMreaches converges after around 10,000 iterations and accurately discovers communities.

minimum number k∗ of communities before the log-likelihood drops. This way we find theminimum number k∗ of communities that is still sufficient to model the structure of a givennetwork.

More precisely we proceed as follows. First, we use a very large number of communities(|C0| = O(|V |)) to fit AGM on the given network G(V,E) and obtain the resulting bipartitecommunity affiliation graph B0(V,C0,M0).

Note that not every community found in B0 is important. This B0 is a set of candidatecommunities from which we find a set of communities that are very likely to exist. Keyintuition is that we can ignore communities if their corresponding pc is 0. Therefore, we aimto reduce the number of communities in B0 by forcing more and more parameters pc to zero.

We apply l1-regularization with parameter λ to a problem of fitting {pc} to B0. At eachvalue of λ, we solve the following problem:

{pc(λ)} = argmax{pc}

P (G|B0, {pc})− λ∑c

|pc| (4)

Solution {pc(λ)} is a sparse vector with only few individual pc 6= 0. Non-zero pc actas indicators of active communities in B0. We construct B(λ)(V,C(λ),M(λ)) by taking

the communities in B0 with non-zero ˆpc(λ). Each such B(λ) represents a set of activecommunities at the regularization intensity λ.

Now our goal is to find the value of regularization parameter λ such that we discoverthe true number of communities in the network. We achieve this by measuring how wellB(λ) can represent G(V,E) by measuring its likelihood L(B(λ)) = max{pc} P (G|B(λ), {pc}).Likelihood L(B(λ)) tells us how well G(V,E) can be explained when we use only C(λ)communities. For example, Figure S15 plots L(B(λ)) and K(λ) measured on a network thathas K∗ = 2 true communities.

Notice that whereas K(λ) is an almost strictly decreasing function of λ, L(B(λ)) seemsto be a step function which is flat until λ reaches some threshold and then suddenly drops.

21

Page 36: Stanford Infolab Technical Report Overlapping Communities

0

2

4

6

8

10

12

102 103 104 105 106 0

0.2

0.4

0.6

0.8

1

Num

ber

of c

omm

uniti

es

Val

ue

λ

K*

K(λ)L’(B(λ))

σ(λ)σ(λ) = 0.25

Figure S15: Automatically determining the number of communities in a givennetwork. We plot various quantities as a function of regularization intensity λ using twoY -axes. K∗: The true number of communities (using the left axis). K(λ): The number ofcommunities we estimate under regularization intensity λ (using the left axis). L′(B(λ)):The normalized likelihood of B(λ) (using right axis). σ(λ): The sigmoid function fit tonormalized L(B(λ)) (using right axis). Pink dotted vertical line: λ∗ at which σ(λ∗) fallsbelow 0.25 (using the right axis). Red horizontal line, the estimated (as well as the true)number of communities.

This suggests that no more than K(λ) communities are needed to explain G(V,E) when λis relatively small. In other words, K(λ) with high L(B(λ)) gives us a upper bound for thenumber of communities that exist on the network. The tight upper bound happens at thepoint at which L(B(λ)) suddenly drops, and we report such K(λ) measured at the quickdrop of L(B(λ)) as our estimate for the number of communities.

Since we cannot examine all possible values of λ, detecting the exact value of λ at whichL(B(λ)) suddenly drops is a challenging task. To find such λ accurately, we approximateL(B(λ)) by the sigmoid function σ(λ) = 1

1+eαλ+β. We first normalize L(B(λ)) into L′(B(λ))

so that the maximum over λ is 1 and the minimum is 0, and then we fit the sigmoid functionσ(λ) to L′(B(λ)) by finding the optimal parameters α and β [12]. Then we compute λ∗

such that σ(λ∗) = δ for some constant δ � 1, and K(λ∗) is our estimate for the number ofcommunities. We experimented with various values of δ and found that setting δ = 0.25 isa reasonable choice.

For example, in Figure S15, our strategy estimates the true number of communitiescorrectly. In the experiments on the real-world networks, our strategy succeeds to estimatethe true number of communities more accurately than other methods (Section S5).

The run time of this method mostly depends on fitting B0 because solving Problem 4 canbe done efficiently due to its convexity. Therefore the overhead of automatically finding thenumber of communities bring little computational overhead in practice. More importantly,with this method we can use the AGM without any parameters using the following two-stepstrategy. When a network is given, we first estimate the number of communities K in the

22

Page 37: Stanford Infolab Technical Report Overlapping Communities

Figure S16: Ring of cliques. Ring of cliques Km (a complete graph of m nodes) consideredin [21], where the authors show that Modularity-based methods fail to detect each Km as aseparate community. In our experiments, AGM successfully discovers each Km as a separatecommunity.

network, and then fit the AGM using our estimate K.

S4.5 AGM does not suffer from the “resolution” limit

Many community detection methods suffer from the “resolution limit” [8, 21, 24]. In Par-ticular, Fortunato et al. [21] showed that Modularity has a resolution limit in a sense thatModularity cannot detect communities if they are too small.

A ring of cliques (illustrated in Figure S16) is an example of a graph where the resolutionlimit can be reliably studied. The ring of cliques consists of n cliques Km, where each Km

is a complete graph on m nodes. The cliques are then connected into a ring by adding asingle edge between two consecutive cliques. On such a graph we expect a given communitydetection method to n find communities—each Km is a separate community. However, [21]proved that the Modularity-based methods fail to discover each Km as a separate communityif n is larger than the square root of the number of edges in the network, i.e., Modularity-based methods fail for a network consisting of many small modules.

We run AGM for the same values of n,m used in [21] (n = 30,m = 5), and find thatAGM correctly identifies both the number of communities n as well as detects each Km asa separate community. We also experimented with many other values for n and m (e.g.,n = 50,m = 10, n = 100,m = 10) and observed that AGM perfectly identifies each Km

as a community. From these experiments we conclude that AGM does not suffer from theresolution limit.

23

Page 38: Stanford Infolab Technical Report Overlapping Communities

S4.6 Anecdotal comparison between AGM and the existing meth-ods

In the main text, we gave an example of Facebook network where AGM correctly identifiesoverlapping network communities while the existing methods fail. In this section, we showtwo more anecdotal evidences: Network of famous philosophers in Wikipedia [1], and theE.Coli metabolic network [1].

First, we apply the AGM to the network of Wikipedia pages for famous philosophers [1](Table S2), where the nodes are the Wikipedia pages and the edges mean hyperlinks betweenthe pages. For the sake of visualization, we show the communities that Francis Bacon belongsto (Fig. ??). Our AGM detects two communities that Bacon belongs to: one with scientistsand the other with philosophers (Fig. S17a), whereas existing methods fail to produce asinterpretable results as AGM (Fig. S17b, S17c, and S17d).

Second, we also consider the E.Coli metabolic network [1] where nodes are metabolitesand edges mean interactions. Here, the existing methods miss the communities shared byvery important metabolites. For example, between H2O and CO2, the three existing methodsthat we consider report that the two metabolites [47, 1, 2] share only one community, whereasthe AGM detects 18 communities shared by the two metabolites.

24

Page 39: Stanford Infolab Technical Report Overlapping Communities

Francis BaconAristotle Baruch Spinoza

Georg Cantor

Isaac Newton

John Locke

Niccolò Machiavelli

Plutarch

Roger Bacon

Thomas Hobbes

Thomas Jefferson

Paracelsus

Giambattista Vico

Joseph Glanvill

William Whewell

Leo Strauss Joseph De Maistre

Bernardino Telesio

Pierre Charron

PlatoEdmund BurkeHannah Arendt

Thucydides

Michael Oakeshott

Johannes Althusius

Xenophon

Leonardo Da Vinci

Jean Bodin

Thrasymachus

Carl Schmitt

Maimonides

Louis Althusser

Desiderius Erasmus

Giordano Bruno

Nicolaus Copernicus

Galileo GalileiJohannes Kepler

PtolemyChristiaan Huygens

Robert Bellarmine

Robert Boyle

Herbert Spencer

Lucretius

Ehrenfried Walther Von Tschirnhaus

Averroes

James Burnett; Lord Monboddo

Kurt GöDel

Benjamin Franklin

Nasir Al-Din Al-Tusi

Alan Turing

Rudolf Steiner

Michael Faraday

Alfred North WhiteheadGeorge Boole

Science

Philosophy(a)

Francis BaconAristotle Baruch Spinoza

Georg Cantor

Isaac Newton

John Locke

Niccolò Machiavelli

Plutarch

Roger Bacon

Thomas Hobbes

Thomas Jefferson

Paracelsus

Giambattista Vico

Joseph Glanvill

William Whewell

Leo Strauss Joseph De Maistre

Bernardino Telesio

Pierre Charron

PlatoEdmund BurkeHannah Arendt

Thucydides

Michael Oakeshott

Johannes Althusius

Xenophon

Leonardo Da Vinci

Jean Bodin

Thrasymachus

Carl Schmitt

Maimonides

Louis Althusser

Desiderius Erasmus

Giordano Bruno

Nicolaus Copernicus

Galileo GalileiJohannes Kepler

PtolemyChristiaan Huygens

Robert Bellarmine

Robert Boyle

Herbert Spencer

Lucretius

Ehrenfried Walther Von Tschirnhaus

Averroes

James Burnett; Lord Monboddo

Kurt GöDel

Benjamin Franklin

Nasir Al-Din Al-Tusi

Alan Turing

Rudolf Steiner

Michael Faraday

Alfred North WhiteheadGeorge Boole

(b)

Francis BaconAristotle Baruch Spinoza

Georg Cantor

Isaac Newton

John Locke

Niccolò Machiavelli

Plutarch

Roger Bacon

Thomas Hobbes

Thomas Jefferson

Paracelsus

Giambattista Vico

Joseph Glanvill

William Whewell

Leo Strauss Joseph De Maistre

Bernardino Telesio

Pierre Charron

PlatoEdmund BurkeHannah Arendt

Thucydides

Michael Oakeshott

Johannes Althusius

Xenophon

Leonardo Da Vinci

Jean Bodin

Thrasymachus

Carl Schmitt

Maimonides

Louis Althusser

Desiderius Erasmus

Giordano Bruno

Nicolaus Copernicus

Galileo GalileiJohannes Kepler

PtolemyChristiaan Huygens

Robert Bellarmine

Robert Boyle

Herbert Spencer

Lucretius

Ehrenfried Walther Von Tschirnhaus

Averroes

James Burnett; Lord Monboddo

Kurt GöDel

Benjamin Franklin

Nasir Al-Din Al-Tusi

Alan Turing

Rudolf Steiner

Michael Faraday

Alfred North WhiteheadGeorge Boole

(c)

Francis BaconAristotle Baruch Spinoza

Georg Cantor

Isaac Newton

John Locke

Niccolò Machiavelli

Plutarch

Roger Bacon

Thomas Hobbes

Thomas Jefferson

Paracelsus

Giambattista Vico

Joseph Glanvill

William Whewell

Leo Strauss Joseph De Maistre

Bernardino Telesio

Pierre Charron

PlatoEdmund BurkeHannah Arendt

Thucydides

Michael Oakeshott

Johannes Althusius

Xenophon

Leonardo Da Vinci

Jean Bodin

Thrasymachus

Carl Schmitt

Maimonides

Louis Althusser

Desiderius Erasmus

Giordano Bruno

Nicolaus Copernicus

Galileo GalileiJohannes Kepler

PtolemyChristiaan Huygens

Robert Bellarmine

Robert Boyle

Herbert Spencer

Lucretius

Ehrenfried Walther Von Tschirnhaus

Averroes

James Burnett; Lord Monboddo

Kurt GöDel

Benjamin Franklin

Nasir Al-Din Al-Tusi

Alan Turing

Rudolf Steiner

Michael Faraday

Alfred North WhiteheadGeorge Boole

(d)

Figure S17: Examples of detected communities in the network of Wikipedia articlesfor philosophers (a–d). The detected communities that Francis Bacon belongs to are displayedby filled regions. (a) The AGM detects two communities which represent Bacon’s connectionsto scientists and to philosophers respectively. On the other hand, clique percolation (b), linkclustering (c), and mixed-membership block models (d) detect less interpretable communities.

25

Page 40: Stanford Infolab Technical Report Overlapping Communities

S5 Experiments: Networks with ground-truth commu-

nities

In this section we evaluate the performance of AGM and compare it to the state-of-the-artcommunity detection methods on a range of networks from a number of different domainsand research areas. We perform experiments on the 6 networks described in Section S1(Table S1) for which we have explicitly labeled ground-truth communities. Availabilityof ground-truth communities allows us to quantify the ‘accuracy’ of community detectionmethods by comparing the level of correspondence between the detected and the ground-truth communities.

S5.1 Experimental setup

We focus on the evaluation of community detection methods on their ability to correctlyidentify overlapping communities. With this purpose in mind, running community detectionalgorithms on a whole network is not an effective way for two reasons. First, for some nodeswe have no ground-truth community labels. And more importantly, none of the communitydetection algorithms that we consider is scalable to the size of networks we consider here.

To resolve this we proceed by finding subnetworks with highly overlapping communitystructure from a given network G(V,E). To obtain one such subnetwork we pick a randomnode u ∈ V that belongs to more than one ground-truth community and then take theinduced subgraph of G consisting of all the nodes that share at least one ground-truthcommunity membership with u. Figure S18 illustrates how a subnetwork (right) is createdfrom G(V,E) (left) when the red node u is chosen. We identify all the member nodes of thecommunities that the red node belongs to and then construct the induced subgraph on theright. This way we obtain subgraphs that are of reasonable size and contain fully labeledoverlapping communities.

We sample 500 subnetworks for each of the six networks from Table S1. We control thesampled networks to have similar number of ground-truth communities across the data sets.In particular, we sample 100 networks for each of 2, 3, 4, 5 communities on each network.And the last 100 sampled networks have more than 5 communities.

We also considered many alternative ways of obtaining small enough subnetworks so thatcommunity detection algorithms could be run. For example, we considered a strategy wheregiven a network G(V,E), we pick a random node u ∈ V and find a set of nodes Vu thatare less than 2-hop away from u. We then construct the induced subgraph of Vu and takecommunities that have more than 50% of their members in this induced subgraph. We alsoconsidered the approach where we created a random set of “connected” communities thateither share an edge or a node. In all these cases the results we obtained were qualitativelysimilar and lead to same conclusions.

S5.2 Methods for comparison

In order to evaluate performance, we compare AGM to existing, state of the art communitydetection methods. We choose the four most prominent community detection methods:

26

Page 41: Stanford Infolab Technical Report Overlapping Communities

Figure S18: Sampling subnetworks with community overlaps. On the left is a partof a full network G(V,E) and on the right is the subnetwork that we sample. We randomlypick a node (the red node on the left) and then construct a subnetwork consisting of thecommunities that the red node belongs to.

• Link clustering (LC) [1]

• Clique Percolation Method (CPM) [47]

• Infomap [55]

• Mixed membership stochastic blockmodel (MMSB) [2]

Link clustering, the Clique percolation method and Mixed membership stochastic block-models are regarded as state-of-the-art algorithms for detecting overlapping communities,and Infomap is a state-of-the-art method for detecting non-overlapping communities.

While Infomap and Link clustering are parameter-free, the Clique percolation methodrequires an input parameter k that determines the size of cliques to be percolated. Wechoose the clique size k = 5 as we find that CPM with k = 5 estimates the number ofcommunities most accurately in all the networks. CPM with k = 6 tends to estimate toofew communities and CPM with k = 4 detects too many communities. MMSB requires thenumber of communities to be given. To this end we use the Bayes Information Criterion(BIC) described by the authors [2] to determine the number of communities.

S5.3 Evaluation metrics

To quantify the performance we measure the level of agreement between the detected andthe ground-truth communities. Given a network G(V,E), we consider a set of ground truthcommunities C∗ and a set of detected communities C where each ground-truth communityCi ∈ C∗ and each detected community Ci ∈ C is defined by a set of its member nodes. Tocompare C and C∗, we use four performance metrics:

• Average F1 score [40] is the average of the F1-score of the best-matching ground-truth community for each detected community, and the F1-score of the best-matchingdetected community for each ground-truth community. In particular, we computeFg(Ci) = maxj F1(Ci, Cj) for each ground-truth community Ci and Fd(Ci) = maxj F1(Cj, Ci)

for each detected community Ci, where F1(S1, S2) is the harmonic mean of precision

27

Page 42: Stanford Infolab Technical Report Overlapping Communities

Figure S19: The example of two detected communities (shaded regions C1 and C2) and twoground-truth communities C1, C2 (“X”-marked nodes belong to C1 and yellow nodes belongto C2). In this example detected communities C1 and C2 achieve good correspondence tothe ground-truth communities C1, C2 (only a single mistake is made).

and recall between two node sets S1, S2. The average F1 score is 12(Fg + Fd) where

Fg = 1|C∗|

∑i Fg(Ci) and Fd = 1

|C|

∑i Fd(Ci).

• Omega Index [27] is the accuracy on estimating the number of communities thateach pair of nodes share. For each pair of nodes u, v ∈ V we define Cuv to be theset of ground-truth communities to which both u and v belong and Cuv to be the setof detected communities to which the both nodes belong. Then the Omega Index is1|V |2

∑u,v∈V 1{|Cuv| = |Cuv|}.

• Normalized Mutual Information [33] adopts the criterion used in information the-ory. The Normalized Mutual Information is 1− 1

2(H(C∗|C)+H(C|C∗)) where H(A|B)

is the extension of entropy when A,B are sets of sets [33].

• Accuracy in the number of communities is 1 − ||C∗|−|C|||C∗| , which is the relative

error in predicting the number of communities.

Note that all performance metrics take values on the interval [0, 1] and higher valuescorrespond to better performance. In all metrics score 1.0 is achieved when the detectedcommunities C are exactly the same as the ground-truth communities C∗.

Figure S19 gives a simple example of two ground-truth communities C1, C2 and twodetected communities C1, C2. C1 and C2 are denoted by shared ellipses. The nodes markedwith “X” belong to C1 and the nodes with yellow background belong to C2.

In this example detected community C1 perfectly corresponds to C1 but the detectedcommunity C2 only partially corresponds to C2. For this particular case the F1-score is 0.94as we see Fg(C1) = 1.0, Fg(C2) = 0.89 and Fd(C1) = 1.0, Fg(C2) = 0.89. The Omega Indexis 0.85 and the Normalized Mutual Information is 0.78. And the accuracy in the number ofcommunities is 1.

28

Page 43: Stanford Infolab Technical Report Overlapping Communities

Measures

Normalized Mutual Information

Number of Communities

-index

F1-score

Co

mp

osite

Pe

rfo

rma

nce

Methods

A AGM

L Link Clustering

C Clique Percolation

Mixed-Membership

Stochastic Block Model

M

I Infomap

0

1

2

3

4

L C I M A L C I M A L C I M A L C I M A L C I M A L C I M A

Figure S20: The composite performance of the community detection methods on six networkswith ground-truth communities. The AGM gives overall best performance.

S5.4 Results

For each of 6 networks and the 500 subnetworks (3,000 subnetworks total) we run AGMas well as the four other methods. For each subnetwork and method we compute the fourevaluation metrics. We then compute the average value of a given performance metrics fora given method and network. Now, for each metric, we normalize the scores of methods sothat the best performing method for each score has the value of 1.0. Finally, we computethe composite performance by summing up the four normalized scores. If a method achievesbetter value than any other method in all the scores, then the composite performance of themethod is 4.0.

Figure S20 displays the composite performance of each of the 5 methods over the sixnetworks with ground-truth communities. Overall, we notice that AGM gives superior overallperformance on all networks except the Amazon, where it ties with MMSB. Furthermore,AGM detects highest quality communities for most individual performance metrics in allnetworks. On average, the composite performance of AGM is 3.40, which is 61% higher thanthat of Link clustering (2.10), 50% higher than that of CPM (2.41), 30% higher than thatof Infomap and 8% higher than that of MMSB (3.25). The absolute average value of OmegaIndex of AGM over the 6 networks is 0.46, which is 21% higher than Link clustering (0.38),22% higher than CPM (0.37), 5% higher than Infomap (0.44) and 26% higher than MMSB(0.36).

In terms of absolute values of scores, AGM archives the average F1 score of 0.57, averageOmega index of 0.46, Mutual Information of 0.15 and accuracy of the number of communities0.42. We also note that AGM also heavily outperforms CPM with other values of k (e.g.,CPM with k = 3, 4, 6).

S5.5 Experiments on modeling the network structure

Having shown that AGM reliably discovers community structure of real-world networks, wenow proceed to evaluate how accurately AGM models network connectivity structure itself.For a given network G, we estimate input parameters of AGM by fitting AGM to G, andthen we generate a synthetic network G with AGM using our estimates for input parameters(Equation 1). We then compare the structure of G with that of G.

First, we verify that the edge probability between a pair of nodes in G is an increasing

29

Page 44: Stanford Infolab Technical Report Overlapping Communities

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14

P(k

), E

dge

prob

abili

ty

k, Number of shared memberships

LiveJournalAGM

(a) LiveJournal

0

0.2

0.4

0.6

0.8

1

0 2 4

P(k

), E

dge

prob

abili

ty

k, Number of shared memberships

FriendsterAGM

(b) Friendster

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10

P(k

), E

dge

prob

abili

ty

k, Number of shared memberships

OrkutAGM

(c) Orkut

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 2 4 6 8 10

P(k

), E

dge

prob

abili

ty

k, Number of shared memberships

YoutubeAGM

(d) Youtube

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2 4 6

P(k

), E

dge

prob

abili

ty

k, Number of shared memberships

DBLPAGM

(e) DBLP

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 2 4 6 8 10

P(k

), E

dge

prob

abili

ty

k, Number of shared memberships

AmazonAGM

(f) Amazon

Figure S21: Edge probability measured on the network reproduced by AGM. Edgeprobability P (k) as a function of the number of common community memberships k in the6 networks (red) and as modeled by the AGM (green), which reliably captures the pattern.

function of the number of communities that the nodes share (Figure S5). Figure S21 plotsthe edge probability as a function of the number of common communities between a pair ofnodes in G (red curves) and G (green curves). We note that the AGM successfully reproducesthe increase in the edge probability, which means that AGM naturally produces a networkwith dense community overlaps.

Second, we study whether the AGM is able to generate overall realistic networks. Weexamine how well the global structural properties of the synthetic network G match theproperties of the ground-truth network G. Figure S22 plots the node degree distribution.Red curves are the statistics of the ground-truth network and green curves are the statisticsof the synthetic networks. Figures show that synthetic networks generated by AGM exhibitsimilar structural properties to real-world networks. Overall, these results demonstrate thatthe AGM is not only able to reliably discover the network communities but also accuratelycaptures the structure of the underlying networks.

30

Page 45: Stanford Infolab Technical Report Overlapping Communities

10-4

10-3

10-2

10-1

100

100 101 102 103

Fc(

k), C

ompl

emen

tary

CD

F

k, Degree

LiveJournalAGM

(a) LiveJournal

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103

Fc(

k), C

ompl

emen

tary

CD

F

k, Degree

FriendsterAGM

(b) Friendster

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104

Fc(

k), C

ompl

emen

tary

CD

F

k, Degree

OrkutAGM

(c) Orkut

10-5

10-4

10-3

10-2

10-1

100

100

101

102

103

104

Fc(k

), C

om

ple

menta

ry C

DF

k, Degree

YoutubeAGM

(d) Youtube

10-5

10-4

10-3

10-2

10-1

100

100

101

102

103

Fc(k

), C

om

ple

menta

ry C

DF

k, Degree

DBLPAGM

(e) DBLP

10-4

10-3

10-2

10-1

100

100 101 102

Fc(

k), C

ompl

emen

tary

CD

F

k, Degree

AmazonAGM

(f) Amazon

Figure S22: Degree distribution of networks reproduced by AGM. Complementarycumulative degree distribution function Fc(k) as function of node degree k in the 6 networks(red) and as modeled by AGM (green). The synthetic networks generated by AGM exhibita heavy-tailed degree distribution in a similar way as the real-world networks.

31

Page 46: Stanford Infolab Technical Report Overlapping Communities

S6 Experiments: Small networks

As a point of comparison and a sanity check we also run the AGM on small networks thatwere thoroughly studied in literature [3, 6, 16, 20, 23, 34, 45]. Overall, out goal here is toshow that communities obtained by the AGM agree with previous literature. Note that sincethese networks are known to have non-overlapping community structure, our experimentsevaluate how well AGM discovers non-overlapping communities.

NCAA football network.The first network we examine is the network of American college football studied in [23,

51]. This network represents the schedule of Division I games for the 2000 season: nodesare college teams and an edge between two nodes means that the two teams had a gamein the season. By construction, this football network has a very well-defined communitystructure: The teams are divided into conferences in which most match-ups are made. Onaverage, teams played 7 games in the conference that they belong to and 4 games outsidethe conference.

Figure S23 shows the football network and the communities detected by AGM. Nodecolor specifies which conference a node belongs to and red circles display the communitiesdetected by AGM. We observe that AGM discovers most conferences perfectly with onlya few exceptions, which we find can in fact be nicely explained. For example, the 4 whitenodes that AGM assigns to “any community” are “independent” teams that actually do notbelong to any conference. And the grey nodes that are split into two communities belongto the Sunbelt conference, which did not enforce its members to play games against eachother [23]. This result implies that AGM can accurately detect non-overlapping communitystructure.

Bottlenose dolphins of Doubtful Sound.We also consider the social network of a community of 62 bottlenose dolphins living in

Doubtful Sound, New Zealand (Figure S24). The network was compiled from seven yearsof field studies of the dolphins by Lusseau et al. [39]. Nodes are dolphins and edges meanstatistically significant frequent association. Lusseau et al. observed the division of dolphinsinto two groups, represented by black and white nodes. The two shaded regions are twocommunities detected by AGM, which match the known division quite well. The AGM doesnot include some nodes to either community, which is plausible because those nodes are verysparsely connected to other nodes in the network, and thus violate the assumption that allnodes of the same community connect with a uniform probability.

32

Page 47: Stanford Infolab Technical Report Overlapping Communities

Mid American

Big East

Atlantic Coast

SEC

Conference USA

Big 12

Western Athletic

Pacific 10

Mountain West

Big 10

Sun Belt

Independents

Figure S23: AGM on the NCAA football network. Community structure in a networkreflecting the schedule of regular season Division I college football games for year 2000 [23].Nodes are universities and edges represent games between universities. Node colors representthe NCAA conference that the node belongs to. AGM communities, specified by circles,correspond to NCAA conferences with high accuracy.

Figure S24: AGM on a network of bottlenose dolphins of Doubtful Sound. Commu-nity structure in the social network of bottlenose dolphins assembled by Lusseau et al. [39],detected by AGM. Lusseau et al. [39] reported that dolphins are separated into two groups(black and white nodes). AGM reliably captures the existence of two separate groups.

33

Page 48: Stanford Infolab Technical Report Overlapping Communities

S7 Experiments: Biological networks

Having found that the AGM can discover the community structure on the social and infor-mation networks accurately, we now evaluate the AGM on biological networks.

S7.1 Dataset description

We consider the protein-protein interaction (PPI) networks of Saccharomyces cerevisiae,which are one of the most complete protein-protein interaction networks available today [63,32, 1]. The PPI networks are compiled into three different genome-scale networks: yeast two-hybrid (Y2H), affinity purification followed by mass spectrometry (AP/MS), and literaturecurates (LC). Edges correspond to statistically significant interactions among proteins. Inaddition to these three networks already mentioned, we use also the union of the threenetworks (PPI (All)). Basic statistics of the networks are shown in Table S2.

S7.2 Evaluation metrics

We measure the quality of detected communities by using high quality node meta-data. Weuse the Gene Ontology terms (GO terms) as meta-data [4]. The GO terms provide the mostelaborate annotations for the biological roles of groups of proteins in the protein-proteininteraction network for three different types: Biological process, Cellular component, andMolecular function. Moreover, there are statistical tools [7, 10] that find the most relevantGO-term for a group of proteins. We quantify the correspondence between the proteincommunities detected by a method and the significance of the associated GO terms.

Given a PPI network, we detect communities C over the whole network. For each de-tected community Ci ∈ C, we find the most statistically relevant GO term and its p-value,pv(Ci) using the GO term finder [10]. We then report the average p-value over the detectedcommunities: p (p = 1

|C|

∑i pv(Ci). We compute the average p-value p for the three types of

GO terms (biological process, cellular component and molecular function) and use − log(p)for the each GO term type as a separate score. We take the negative logarithm of the p-value to transform it so that the performance score is a nonnegative increasing function ofthe quality. Finally, we normalize the scores so that the best method achieves the value of1.0 as we did in Section S5.

S7.3 Results

Figure S25 displays the composite performance for biological networks for each of the fivemethods (LC, CPM, Infomap, MMSB, and AGM). Note that the AGM attains the bestcomposite performance in all four networks by a huge margin. CPM is the second best, Linkclustering is the third, and Mixed membership stochastic blockmodel scores the worse.

Since the comparison is made on the logarithms of p-values, this result suggests thatcommunities detected by the AGM are far more statistically relevant than those detectedby other methods. For example, the average p-value of the AGM communities over all thenetworks is 0.008, which is 13-times better than that of Link clustering (0.11), 15-timesbetter than of Clique percolation (0.12), 12-times better than of Infomap (0.096) and also

34

Page 49: Stanford Infolab Technical Report Overlapping Communities

Dataset N E 〈C〉 〈D〉 〈k〉Protein-protein interaction networks [1] (Section S7)

PPI (Y2H) [63] 1,647 2,518 0.10 6.60 3.06PPI (AP/MS) [15] 1,004 8,319 0.72 6.51 16.57PPI (LC) [50] 1,213 2,556 0.46 8.56 4.21PPI (All) [1] 1,647 12,784 0.41 6.24 8.60

Networks from Ahn et al. [1] (Section S8)

Metabolic [18] 1,042 8,756 0.74 3.15 16.81Philosophers [1] 1,218 5,972 0.30 4.25 9.81Word Association [44] 5,018 55,232 0.19 4.04 22.01

Social networks (Section S9)

MSN Finland 592,982 2,448,213 0.13 7.95 8.26LinkedIn 254,151 482,286 0.09 7.18 3.80

Web graphs (Section S9)

Web-Stanford 255,265 1,941,926 0.62 9.36 15.21Web-Notre Dame 325,729 1,090,108 0.23 9.42 6.69Web-Berkeley/Stanford 654,782 6,581,871 0.61 10.06 20.10

Foodweb networks (Section S9)

Foodweb-wet 128 2,075 0.33 1.90 32.42Foodweb-dry 128 2,106 0.33 1.90 32.90

Table S2: Network statistics. N : Number of nodes, E: Number of edges, 〈C〉: Averageclustering coefficient [61], 〈D〉: Average shortest path length, 〈k〉: Average node degree. The4 networks in the first block are the protein-protein interaction networks of Saccharomycescerevisiae which are described in Section S7. The next 3 networks in the second block arethe networks used in Ahn et al. [1] and we describe these networks in Section S8. The rest 3blocks are the different types of networks where different kinds of core-periphery structuresarise (Section S9). The 2 networks in the third block are social networks, the 3 networks inthe fourth block are web graphs, and the 2 networks in the last block are foodweb networks.

119-times better than that of MMSB (0.95). We further investigated the poor performanceof MMSB on this network and found it is due to the fact that MMSB tends to find verylarge communities, which in turn leads to very poor p-values.

In terms of absolute values of p-values the AGM performs quite well. For example, inthe AP/MS network, the AGM achieves the average p-value of 1.9×10−5, which suggest thehigh significance of the detected communities.

Discovering interactions among proteins still remains an active research area; only ≈ 20%of all protein-protein interactions in yeast have been currently reported [63]. The high-quality protein communities detected by the AGM can suggest very plausible candidateswhich biologists can investigate for undiscovered protein-protein interactions [48, 52].

35

Page 50: Stanford Infolab Technical Report Overlapping Communities

Co

mp

osite

Pe

rfo

rma

nce

PPI (Y2H) PPI (AP/MS) PPI (LC) PPI (All)

Measures

Biological Process

Molecular Function

Cellular Component0

1

2

3

L C I M A L C I M A L C I M A L C I M A

Methods

A AGM

L Link Clustering

C Clique Percolation

Mixed-Membership

Stochastic Block Model

M

I Infomap

Figure S25: The composite performance of the algorithms on the protein-protein interactionnetworks. The AGM gives overall best performance by a large margin.

36

Page 51: Stanford Infolab Technical Report Overlapping Communities

S8 Experiments: Networks in Ahn et al.

Finally, we also evaluate the performance of the AGM under exactly the same conditions asused in the original Link Clustering paper by Ahn et al. [1]. Ahn et al. [1] kindly shared withus the exact networks, metadata and the code. Ahn et al. provide objective evaluations ofcommunity detection methods with data-driven measures. We replicate the experiment in[1] with the same data sets and the same evaluation methodology.

S8.1 Dataset description

The seven networks used in [1] were kindly made available to us. We thank Sune Lehmann forgenerously providing data. We consider the 4 PPI networks that were described in Section S7.In addition, we test over the metabolic network of E. coli K-12 MG1655 strain (iAF1260),which is regarded as one of the state-of-the-art metabolic network reconstructions [18]. Twometabolites have an edge if they share a cellular reaction. In total, we have five biologicalnetworks.

In addition, we also examine other types of networks. We consider the network of famousphilosophers constructed based on Wikipedia [1]. If the Wikipedia page of a philosopher hasa hyperlink to the Wikipedia page of the other philosopher, then the two philosophers havean edge between them. We use the Word association network from the data sets from theUniversity of South Florida and the University of Kansas [17, 44], which observed whichwords human subjects associate to given words. As the data set provides weighted, directedgraph among words, we convert the graph into undirected and unweighted version [1, 47].Basic statistics of the networks is in Table S2. Further details are in [1].

S8.2 Evaluation metrics

We adopt the 4 data-driven measures defined in [1]:

• The community coverage is the fraction of the nodes that belong to at least one detectedcommunity.

• The overlap coverage is the average value of the number of communities a node belongsto. If the method detects many communities that share large overlaps, then the overlapcoverage will be high.

• The community quality assumes that the similarity of the two nodes µ(i, j) is availablefor any pair of nodes i and j. Given the similarities, the community quality is theaverage similarity between all pairs of nodes that share a community, divided by theaverage similarity between all pairs of nodes [1].

• The overlap quality requires for each node i the information W (i) which is related tothe number of true communities that i belongs to. On the protein-protein interactionnetworks, for example, a protein annotated with many GO terms is expected to be-long many protein communities. On the word association network, words with manydefinitions are likely to belong to many communities of words. The overlap quality is

37

Page 52: Stanford Infolab Technical Report Overlapping Communities

Co

mp

osite

Pe

rfo

rma

nce

PhilosophersPPI (Y2H) PPI (AP/MS) PPI (LC) PPI (All) Metabolic Word Association

Measures Community QualityOverlap Quality Overlap Coverage Community Coverage

Methods A AGML Link Clustering C Clique Percolation Mixed-Membership Stochastic Block ModelsM I Infomap

0

1

2

3

4

L C I M A L C I M A L C I M A L C I M A L C I M A L C I M A L C I M A

Figure S26: The data-driven benchmark presented in Y.Y. Ahn et al. [1]. Communitycoverage, Overlap coverage, Community quality, and Overlap quality measure the quality ofthe communities detected by the algorithms. The AGM gives overall best performance.

the mutual information between W (i) and the number of detected communities that ibelongs to.

For each network, we apply the AGM, Link clustering, Clique percolation, Infomap andMixed-membership stochastic block model. For evaluation we use exactly the same metadata and the same parameters as in [1].

S8.3 Results

Similar to the previous experiments, we compute the composite performance by normalizingthe scores the same way as we did in the experiments with ground-truth communities.Figure S26 shows the composite performance of the four methods. The AGM achieves bestcomposite performance in the 3 networks (PPI (Y2H), PPI (LC) and Philosophers), Linkclustering performs slightly better in the Word association and the metabolic network, andMMSB is the best in the PPI (Y2H) and PPI (All) networks. On average, the AGM achievesa composite performance score of 3.06, outperforming Link clustering (2.67) by 14%, Cliquepercolation (1.49) by 104%, Infomap (1.82) by 67% and MMSB (2.84) by 8%. Thus, AGMgives overall best performance on this diverse set of networks and evaluation metrics.

38

Page 53: Stanford Infolab Technical Report Overlapping Communities

S9 Overlapping communities give rise to core-periphery

network structure

In the main text, we showed how the core-periphery structure [9, 29, 54] arises in manydifferent types of networks. In this section, we provide further evidence for our explanationsby showing that same results hold for many other networks as well.

S9.1 Community overlaps lead to global core-periphery structure

We being by describing the rest of the networks that we consider for these experiments (TableS2):

• Three social networks. LiveJournal online social network is what we used in the maintext, MSN Finland is the MSN network of users in Finland, and the LinkedIn networkis the snapshot of the LinkedIn social network when it had 254,151 users.

• We consider the Amazon product network as we did in the main text.

• Three web graphs where nodes are web pages and edges mean hyperlinks [38]. Web-Stanford is a network of the web pages from Stanford University, Web-NotreDame isof the web pages from University of Notre Dame, and Web-BerkStan is of the pagesfrom Stanford University and University of California Berkeley.

• 4 protein-protein interaction networks described in Section S7.

• 2 networks of the Florida Bay food web networks[59]: In wet season (Foodweb-wet)and in dry season (Foodweb-dry).

Global core-periphery structure. Given the network G(V,E), we measure the averagenumber of communities m that a node belongs to as a function of the farness centrality d ofthe node. The Farness Centrality [30, 60] d of node u is the average shortest path length fromu to all other nodes in the network, i.e., d = 1

|V |∑

v∈V d(u, v) where d(u, v) is the shortestpath length from u to v.

Figure S27 displays the plots of m and d for all 10 networks. We use the communitymemberships detected by AGM to determine m. In all networks but the two web graphs,the number of community memberships of a node decreases with the farness centrality ofa node, which implies that nodes residing in the center of the network, which have smallshortest path distances to other nodes of the network, tend to belong to the highest numberof communities. This result shows that our observation that overlapping communities leadto core-periphery structure of large networks (Figure 3b in the main text) generally appliesto a wide range of real-world networks.

Emergence of local cores. We also examine the existence of local cores in the networks bymeasuring the fraction of the largest connected component (LCC) in the induced subgraphof the nodes who belong to at least l communities. Thinking of a network as a valley wherepeaks correspond to cores and peripheries to lowlands, our methodology is analogous toflooding lowlands and measuring the fraction of the largest island (which was a peak beforeflooding). High c(l) means that there is a single dominant core (peak), while low c(l) suggeststhe existence of nontrivial secondary cores.

39

Page 54: Stanford Infolab Technical Report Overlapping Communities

0

0.5

1

1.5

2

2.5

5 5.5 6 6.5

⟨m⟩,

Com

mun

ity m

embe

rshi

ps

d, Farness Centrality

Social

(a) LiveJournal

0

1

2

3

4

5

6

7

8

9

10

6 6.5 7 7.5 8 8.5

⟨m⟩,

Co

mm

un

ity m

em

be

rsh

ips

d, Farness Centrality

(b) MSN Finland

-2

-1

0

1

2

3

4

5

6

5 5.5 6 6.5 7 7.5

⟨m⟩,

Com

mun

ity m

embe

rshi

ps

d, Farness Centrality

(c) LinkedIn

0.8

1

1.2

1.4

1.6

1.8

2

2.2

10 11 12 13 14 15

⟨m⟩,

Com

mun

ity m

embe

rshi

ps

d, Farness Centrality

(d) Amazon

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

5 5.5 6 6.5 7 7.5 8 8.5 9

⟨m⟩,

Com

mun

ity m

embe

rshi

ps

d, Farness Centrality

(e) Web-Stanford

0

0.5

1

1.5

2

2.5

5.5 6 6.5 7 7.5 8 8.5 9

⟨m⟩,

Com

mun

ity m

embe

rshi

ps

d, Farness Centrality

(f) Web-NotreDame

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

6 6.5 7 7.5 8

⟨m⟩,

Com

mun

ity m

embe

rshi

ps

d, Farness Centrality

(g) Web-BerkStan

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

4.5 5 5.5 6 6.5 7

⟨m⟩,

Com

mun

ity m

embe

rshi

ps

d, Farness Centrality

(h) PPI (Y2H)

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

4.5 5 5.5 6 6.5 7

⟨m⟩,

Com

mun

ity m

embe

rshi

ps

d, Farness Centrality

(i) PPI (AP/MS)

0.6

0.8

1

1.2

1.4

1.6

1.8

2

6 6.5 7 7.5 8 8.5 9 9.5

⟨m⟩,

Com

mun

ity m

embe

rshi

ps

d, Farness Centrality

PPI

(j) PPI (LC)

0.5

1

1.5

2

2.5

3

4.5 5 5.5 6 6.5⟨m

⟩, C

omm

unity

mem

bers

hips

d, Farness Centrality

(k) PPI (All)

0.5

1

1.5

2

2.5

3

3.5

4

4.5

1.5 1.6 1.7 1.8 1.9 2

⟨m⟩,

Com

mun

ity m

embe

rshi

ps

d, Farness Centrality

(l) Foodweb-wet

1

1.5

2

2.5

3

3.5

1.5 1.6 1.7 1.8 1.9 2

⟨m⟩,

Co

mm

un

ity m

em

be

rsh

ips

d, Farness Centrality

(m) Foodweb-dry

Figure S27: Overlapping communities lead to global core-periphery network struc-ture. The average (and the 90-th percentile) of the number of community memberships m ofa node as a function of the average shortest path length d to all other nodes of the network.The number of community memberships increases with the centrality of a node. Nodes thatreside in the center of the network, and have small shortest path distances to other nodes ofthe network, tend to belong to the highest number of communities.

40

Page 55: Stanford Infolab Technical Report Overlapping Communities

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

c(l)

, Con

nect

ed c

ompo

nent

siz

e

l, Community memberships

LiveJournalMSN

LinkedIn

(a) Social networks

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

c(l)

, Con

nect

ed c

ompo

nent

siz

e

l, Community memberships

Amazon

(b) Product network

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

c(l)

, Con

nect

ed c

ompo

nent

siz

e

l, Community memberships

Web-StanfordWeb-Notre Dame

Web-Berkeley/Stanford

(c) Web graphs

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

c(l)

, Con

nect

ed c

ompo

nent

siz

e

l, Community memberships

PPI-LCPPI-Y2H

PPI-AP/MSPPI-All

(d) PPI networks

0

0.2

0.4

0.6

0.8

1

1 2 3 4c(

l), C

onne

cted

com

pone

nt s

ize

l, Community memberships

Florida bay (wet)Florida bay (dry)

(e) Foodwebs

Figure S28: Largest connected component size on an induced subgraph of nodesbelonging to at least l communities.

Figure S28 displays c(l) for each of the 5 types of networks. As shown in the main text,we observe that the protein-protein interaction networks and the Amazon product networkhave local cores, while the other types have a global core.

Maximum overlap fraction. Finally, we characterize how much communities overlap witheach other in different types of networks. Maximum overlap fraction oc of a given communityc quantifies the fraction of c’s members in the largest overlap with any other community.

Figure S29 shows the distribution of o in the 5 types of networks. Communities in theprotein-protein interaction networks, social and product co-purchasing network are mainlynon-overlapping whereas the communities in the foodweb and the web graph are pervasivelyoverlapping.

S9.2 Comparison to other notions of core-periphery

In order to argue about the core-periphery structure of networks we so far used the fact thatcommunities behave as tiles in the sense that overlap of two communities leads to higheredge density (higher tile thickness). Combining this with the observation that communitiesoverlap most pervasively in the center of the network leads to the conclusion about theglobal core-periphery structure. However, there are many other methods that identify core-periphery structure in networks and our goal is to quantify the agreement of our methodologyand existing methods.

We aim to quantify the agreement between cores we find here and the cores detected byexisting methods. In particular, we compare the method invented by Rombach et al. [53].

41

Page 56: Stanford Infolab Technical Report Overlapping Communities

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

p(o)

o, Maximum overlap fraction

LiveJournalMSN

LinkedIn

(a) Social networks

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

p(o

)

o, Maximum overlap fraction

Amazon

(b) Product network

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

p(o

)

o, Maximum overlap fraction

Web-StanfordWeb-Notre Dame

Web-Berkeley/Stanford

(c) Web graphs

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

p(o)

o, Maximum overlap fraction

PPI-LCPPI-Y2H

PPI-AP/MSPPI-All

(d) PPI networks

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

p(o

)o, Maximum overlap fraction

Florida bay (wet)Florida bay (dry)

(e) Foodwebs

Figure S29: Distribution of maximum overlap fraction.

Since AGM is proposed for community detection rather than core detection, our goal here isto measure the correspondence between the core determined by [53] and the core (i.e., highmembership nodes) detected by the AGM.

Rombach et al. computes a real-valued “core score” CS(i) for each node i which specifieshow likely i belongs to a core. In our experiments we used the number of communitymemberships m(i) of node i to indicate whether i belongs to the core or not. Since m(i) andCS(i) are scores rather than binary indicators, we aim to measure the Pearson correlationcoefficient [13] between m(i) and CS(i).

For these experiments we consider two networks that were also considered in Rombachet al. [53]. First, we use the Zachary’s karate club network [64]. And second, we alsoconsider the London underground network between the metro stations. Since the Londonunderground network is a weighted network, we build an unweighted network for AGMby connecting two nodes when the edge weight is larger than 2. In Table S3, we observethe correlation coefficient for the Zachary’s karate club network is 0.774, and 0.408 for theLondon underground network. In the second row of the table, we also compute the p-valuefor the null hypothesis that there is no positive correlation between the two values. We useStudent’s t-test to achieve this. As p-values are far lower than the standard 0.05, we confirmthat the cores that we find by AGM (i.e., the high membership nodes) correspond well to thecores found by the state-of-the-art methods. The level of correlation is lower in the Londonunderground network, which can be explained by the fact that some information is lost whenconverting the weighted network to an unweighted network.

42

Page 57: Stanford Infolab Technical Report Overlapping Communities

Dataset Zachary London undergroundCorrelation coefficient 0.774 0.408p-value 4.01× 10−8 6.12× 10−4

Table S3: Comparison with the cores detected by Rombach et al. [53] The Pear-son’s correlation coefficient between the core score computed by Rombach et al. [53] and thenumber of communities the node belongs to as determined by AGM. p-values are also com-puted for the null hypothesis using the Student’s t-test. High correlation coefficient impliesthat high membership nodes under the AGM are more likely to belong the network core asdetected by Rombach et al.

References

[1] Y.-Y. Ahn, J. P. Bagrow, and S. Lehmann. Link communities reveal multi-scale com-plexity in networks. Nature, 466:761–764, Oct. 2010.

[2] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership stochasticblockmodels. Journal of Machine Learning Research, 9:1981–2014, 2007.

[3] A. Arenas, L. Danon, A. Dıaz-Guilera, P. Gleiser, and R. Guimera. Community analysisin social networks. The European Physical Journal B, 38(2):373–380, 2004.

[4] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P.Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver,A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin,and G. Sherlock. Gene ontology: tool for the unification of biology. the gene ontologyconsortium. Nature Genetics, 25:25–29, 2000.

[5] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group formation in largesocial networks: membership, growth, and evolution. In Proceedings of the 12th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, pages44–54, 2006.

[6] B. Ball, B. Karrer, and M. E. J. Newman. Efficient and principled method for detectingcommunities in networks. Physical Review E, 84:036103, 2011.

[7] G. F. Berriz, J. E. Beaver, C. Cenik, M. Tasan, and F. P. Roth. Next generation softwarefor functional trend analysis. Bioinformatics, 25(22):3043–3044, 2009.

[8] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of com-munities in large networks. Journal of Statistical Mechanics: Theory and Experiment,2008(10):P10008, 2008.

[9] S. P. Borgatti and M. G. Everett. Models of core/periphery structures. Social Networks,21:375 – 395, 1999.

[10] E. Boyle, S. Weng, J. Gollub, H. Jin, D. Botstein, J. Cherry, and G. Sherlock.GO::TermFinder - open source software for accessing Gene Ontology information and

43

Page 58: Stanford Infolab Technical Report Overlapping Communities

finding significantly enriched Gene Ontology terms associated with a list of genes. Bioin-formatics, 20(18):3710–3715, 2004.

[11] R. L. Breiger. The duality of persons and groups. Social Forces, 53(2):181–190, 1974.

[12] G. Casella and R. L. Berger. Statistical Inference. Duxbury Press, 2001.

[13] S. Chatterjee and A. Hadi. Regression Analysis by Example. Wiley Series in Probabilityand Statistics. Wiley, 2006.

[14] A. Clauset, C. Moore, and M. Newman. Hierarchical structure and the prediction ofmissing links in networks. Nature, 453(7191):98–101, 2008.

[15] S. R. Collins, P. Kemmeren, X.-C. Zhao, J. F. Greenblatt, F. Spencer, F. C. P. Holstege,J. S. Weissman, and N. J. Krogan. Toward a comprehensive atlas of the physicalinteractome of saccharomyces cerevisiae. Molecular & Cellular Proteomics, 6(3):439–450, March 2007.

[16] L. Danon, J. Duch, A. Diaz-Guilera, and A. Arenas. Comparing community structureidentification. Journal of Statistical Mechanics: Theory and Experiment, 29(09):P09008,2005.

[17] T. S. Evans and R. Lambiotte. Line graphs, link partitions, and overlapping communi-ties. Physical Review E, 80:016105, 2009.

[18] A. M. Feist, C. S. Henry, J. L. Reed, M. Krummenacker, A. R. Joyce, P. D. Karp, L. J.Broadbelt, V. Hatzimanikatis, and B. Ø. Palsson. A genome-scale metabolic reconstruc-tion for escherichia coli k-12 mg1655 that accounts for 1260 orfs and thermodynamicinformation. Molecular Systems Biology, 3(121):121, 2007.

[19] S. L. Feld. The focused organization of social ties. American Journal of Sociology,86(5):1015–1035, 1981.

[20] S. Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75 – 174, 2010.

[21] S. Fortunato and M. Barthelemy. Resolution limit in community detection. Proceedingsof the National Academy of Sciences of the United States of America, 104(1):36–41,2007.

[22] A. Gavin, P. Aloy, P. Grandi, R. Krause, M. Boesche, M. Marzioch, C. Rau, L. Jensen,S. Bastuck, B. Dumpelfeld, A. Edelmann, M. Heurtier, V. Hoffman, C. Hoefert,K. Klein, M. Hudak, A. Michon, M. Schelder, M. Schirle, M. Remor, T. Rudi, S. Hooper,A. Bauer, T. Bouwmeester, G. Casari, G. Drewes, G. Neubauer, J. Rick, B. Kuster,P. Bork, R. Russell, and G. Superti-Furga. Proteome survey reveals modularity of theyeast cell machinery. Nature, 440(7084):631–636, 2006.

[23] M. Girvan and M. Newman. Community structure in social and biological networks.Proceedings of the National Academy of Sciences of the United States of America,99(12):7821–7826, 2002.

44

Page 59: Stanford Infolab Technical Report Overlapping Communities

[24] C. Granell, S. Gomez, and A. Arenas. Hierarchical multiresolution method to overcomethe resolution limit in complex networks. International Journal of Bifurcation andChaos, 22(7), 2012.

[25] M. S. Granovetter. The strength of weak ties. American Journal of Sociology, 78:1360–1380, 1973.

[26] S. Gregory. Finding overlapping communities in networks by label propagation. NewJournal of Physics, 12(10):103018, 2010.

[27] S. Gregory. Fuzzy overlapping communities in networks. Journal of Statistical Mechan-ics: Theory and Experiment, 2011(02):P02017, 2011.

[28] P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps.Social Networks, 5(2):109–137, 1983.

[29] P. Holme. Core-periphery organization of complex networks. Physical Review E,72:046111, 2005.

[30] P. Holme and G. Ghoshal. Dynamics of networking agents competing for high centralityand low degree. Phys. Rev. Lett., 96:098701, 2006.

[31] B. Karrer and M. Newman. Stochastic blockmodels and community structure in net-works. Physical Review E, 83:016107, 2010.

[32] N. Krogan, G. Cagney, H. Yu, G. Zhong, X. Guo, A. Ignatchenko, J. Li, S. Pu, N. Datta,A. Tikuisis, T. Punna, J. Peregrın-Alvarez, M. Shales, X. Zhang, M. Davey, M. Robin-son, A. Paccanaro, J. Bray, A. Sheung, B. Beattie, D. Richards, V. Canadien, A. Lalev,F. Mena, P. Wong, A. Starostine, M. Canete, J. Vlasblom, S. Wu, C. Orsi, S. Collins,S. Chandran, R. Haw, J. Rilstone, K. Gandi, N. Thompson, G. Musso, P. St Onge,S. Ghanny, M. Lam, G. Butland, A. Altaf-Ui, S. Kanaya, A. Shilatifard, E. O’Shea,J. Weissman, C. Ingles, T. Hughes, J. Parkinson, M. Gerstein, S. Wodak, A. Emili,and J. Greenblatt. Global landscape of protein complexes in the yeast saccharomycescerevisiae. Nature, 440(7084):637–643, 2006.

[33] A. Lancichinetti and S. Fortunato. Community detection algorithms: A comparativeanalysis. Physical Review E, 80(5):056117, 2009.

[34] A. Lancichinetti, F. Radicchi, J. J. Ramasco, and S. Fortunato. Finding statisticallysignificant communities in networks. PLoS ONE, 6(4):e18961, 2011.

[35] C. Lee, F. Reid, A. McDaid, and N. Hurley. Detecting highly overlapping commu-nity structure by greedy clique expansion. In Proceedings of the Fourth internationalworkshop on Advances in social network mining and analysis, 2010.

[36] S. Lehmann, M. Schwartz, and L. K. Hansen. Biclique communities. Phys. Rev. E,78:016108, 2008.

45

Page 60: Stanford Infolab Technical Report Overlapping Communities

[37] J. Leskovec, L. Adamic, and B. Huberman. The dynamics of viral marketing. ACMTransactions on the Web, 1(1), 2007.

[38] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Community structurein large networks: Natural cluster sizes and the absence of large well-defined clusters.Internet Mathematics, 6(1):29–123, 2009.

[39] D. Lusseau, K. Schneider, O. Boisseau, P. Haase, E. Slooten, and S. Dawson. Thebottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations. Behavioral Ecology and Sociobiology, 54:396–405, 2003.

[40] C. D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval.Cambridge University Press, 2008.

[41] M. McPherson, L. Smith-Lovin, and J. M. Cook. Birds of a feather: Homophily insocial networks. Annual Review of Sociology, 27:415–444, 2001.

[42] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measure-ment and analysis of online social networks. In IMC ’07: Proceedings of the 7th ACMSIGCOMM conference on Internet measurement, pages 29–42, 2007.

[43] M. Mørup, M. N. Schmidt, and L. K. Hansen. Infinite multiple membership relationalmodeling for complex networks. CoRR, abs/1101.5097, 2011.

[44] D. L. Nelson, C. L. McEvoy, and T. A. Schreiber. The universityof south florida word association, rhyme, and word fragment norms, 1998.http://www.usf.edu/FreeAssociation/.

[45] M. Newman. Modularity and community structure in networks. Proceedings of theNational Academy of Sciences of the United States of America, 103(23):8577–8582, 2006.

[46] M. Newman and G. Barkema. Monte Carlo Methods in Statistical Physics. OxfordUniversity Press, 1999.

[47] G. Palla, I. Derenyi, I. Farkas, and T. Vicsek. Uncovering the overlapping communitystructure of complex networks in nature and society. Nature, 435(7043):814–818, 2005.

[48] Y. Park, C. Moore, and J. S. Bader. Dynamic networks from hierarchical bayesiangraph clustering. PLoS ONE, 5(1):e8118, 2010.

[49] W. W. Powell, D. R. White, K. W. Koput, and J. Owen-Smith. Network dynamicsand field evolution: The growth of interorganizational collaboration in the life sciences.American Journal of Sociology, 110(4):1132–1205, 2005.

[50] T. Reguly, A. Breitkreutz, L. Boucher, B.-J. Breitkreutz, G. Hon, C. Myers, A. Parsons,H. Friesen, R. Oughtred, A. Tong, C. Stark, Y. Ho, D. Botstein, B. Andrews, C. Boone,O. Troyanskya, T. Ideker, K. Dolinski, N. Batada, and M. Tyers. Comprehensive cura-tion and analysis of global interaction networks in saccharomyces cerevisiae. Journal ofBiology, 5(4):11, 2006.

46

Page 61: Stanford Infolab Technical Report Overlapping Communities

[51] J. Reichardt and S. Bornholdt. Detecting fuzzy community structures in complex net-works with a potts model. Physical Review Letter, 93:218701, Nov 2004.

[52] C. Rivera, R. Vakil, and J. Bader. Nemo: Network module identification in cytoscape.BMC Bioinformatics, 11(Suppl 1):S61, 2010.

[53] M. P. Rombach, M. A. Porter, J. H. Fowler, and P. J. Mucha. Core-periphery structurein networks. SIAM Journal of Applied Mathematics, 74(1):167–190, 2014.

[54] F. D. Rossa, F. Dercole, and C. Piccardi. Profiling core-periphery network structure byrandom walkers. Scientific Reports, 3, 2013.

[55] M. Rosvall and C. T. Bergstrom. Maps of random walks on complex networks revealcommunity structure. Proceedings of the National Academy of Sciences of the UnitedStates of America, 105:1118–1123, 2008.

[56] H. Shen, X. Cheng, K. Cai, and M.-B. Hu. Detect overlapping and hierarchical com-munity structure in networks. Physica A: Statistical Mechanics and its Applications,388(8):1706 – 1712, 2009.

[57] G. Simmel. Conflict: the Web of Group Affiliations. Trans. by Kurt H. Wolff andReinhard Bendix. Free Press, 1955.

[58] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the RoyalStatistical Society. Series B (Methodological), 58(1):267–288, 1996.

[59] R. E. Ulanowicz, C. Bondavalli, and M. S. Egnotovich. Network analysis of trophicdynamics in south florida ecosystem, FY 97: The florida bay ecosystem. Annual Reportto the United States Geological Service Biological Resources Division, pages 98–123,1998.

[60] S. Wasserman and K. Faust. Social Network Analysis. Cambridge University Press,1994.

[61] D. Watts and S. Strogatz. Collective dynamics of small-world networks. Nature,393:440–442, 1998.

[62] J. Yang and J. Leskovec. Defining and evaluating network communities based on ground-truth communities. In Proceedings of the IEEE International Conference on Data Min-ing (ICDM), pages 745–754, 2012.

[63] H. Yu, P. Braun, M. A. Yldrm, I. Lemmens, K. Venkatesan, J. Sahalie, T. Hirozane-Kishikawa, F. Gebreab, N. Li, N. Simonis, T. Hao, J.-F. Rual, A. Dricot, A. Vazquez,R. R. Murray, C. Simon, L. Tardivo, S. Tam, N. Svrzikapa, C. Fan, A.-S. de Smet,A. Motyl, M. E. Hudson, J. Park, X. Xin, M. E. Cusick, T. Moore, C. Boone, M. Snyder,F. P. Roth, A.-L. Barabasi, J. Tavernier, D. E. Hill, and M. Vidal. High-quality binaryprotein interaction map of the yeast interactome network. Science, 322(5898):104–110,2008.

47

Page 62: Stanford Infolab Technical Report Overlapping Communities

[64] W. Zachary. An information flow model for conflict and fission in small groups. Journalof Anthropological Research, 33:452–473, 1977.

[65] E. Zheleva, H. Sharara, and L. Getoor. Co-evolution of social and affiliation networks.In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Dis-covery and Data mining, pages 1007–1016, 2009.

48

Page 63: Stanford Infolab Technical Report Overlapping Communities

A Appendix

A.1 Raw performance scores of the experiments with ground-truth communities

Table S4 provides the unnormalized value of evaluation metrics in the experiments in Sec-tion S5.

Data LiveJournal Friendster OrkutMethod L C I M A L C I M A L C I M AOmega Index 0.39 0.41 0.46 0.39 0.50 0.50 0.52 0.61 0.35 0.56 0.46 0.51 0.55 0.40 0.57F1 Score 0.35 0.39 0.45 0.63 0.57 0.34 0.39 0.47 0.59 0.56 0.34 0.39 0.46 0.62 0.57Mutual Information 0.13 0.12 0.17 0.08 0.14 0.17 0.18 0.28 0.10 0.20 0.17 0.18 0.26 0.11 0.21Number of Communities 0.00 0.27 0.00 0.54 0.34 0.00 0.29 0.00 0.51 0.37 0.00 0.00 0.00 0.54 0.35Data Youtube DBLP AmazonMethod L C I M A L C I M A L C I M AOmega Index 0.48 0.42 0.53 0.42 0.49 0.40 0.37 0.41 0.44 0.50 0.04 0.02 0.08 0.19 0.13F1 Score 0.33 0.22 0.42 0.57 0.49 0.41 0.38 0.43 0.59 0.53 0.47 0.26 0.61 0.78 0.70Mutual Information 0.12 0.05 0.17 0.05 0.08 0.24 0.22 0.21 0.10 0.17 0.10 0.05 0.11 0.07 0.10Number of Communities 0.00 0.37 0.00 0.49 0.46 0.00 0.00 0.00 0.39 0.52 0.00 0.27 0.10 0.44 0.49

Table S4: Performance of the methods on the networks with ground-truth com-munities. Raw scores of the methods in the experiments in Section S5. L: Link clustering,C: Clique percolation, I: Infomap, M: Mixed membership stochastic blockmodels and A:AGM.

A.2 Raw performance scores of the experiments with biologicalnetworks

Table S5 gives the unnormalized values of evaluation metrics (i.e.p-values) in the experimentsin Section S7. Scores are the lower the better.

Data PPI (Y2H) PPI (AP/MS)Method L C I M A L C I M ACellular Component 0.43 0.39 0.26 1.00 0.02 0.08 0.07 0.07 0.80 0.01

Biological Process 0.24 0.16 0.12 1.00 0.02 0.03 0.02 0.03 0.80 1.5× 10−6

Molecular Function 0.18 0.16 0.14 1.00 0.02 0.07 0.05 0.06 0.80 0.01Data PPI (LC) PPI (All)Method L C I M A L C I M A

Cellular Component 0.06 0.10 0.06 1.00 1.9× 10−6 0.08 0.24 0.21 1.00 0.01

Biological Process 1.6× 10−3 3.6× 10−3 0.01 1.00 2.4× 10−8 0.06 0.08 0.09 1.00 0.01

Molecular Function 0.04 0.05 0.01 1.00 5.5× 10−5 0.07 0.09 0.09 1.00 3.4× 10−3

Table S5: Performance of the methods on the biological networks measured bythe GO term finder. The average p-value of the detected communities computed by theGO term finder in the experiments in Section S7. L: Link clustering, C: Clique percolation,I: Infomap, M: Mixed membership stochastic blockmodels and A: AGM.

A.3 Raw performance scores of the experiments in Ahn et al.

Table S6 shows the unnormalized values of evaluation metrics in the experiments in Sec-tion S8.

49

Page 64: Stanford Infolab Technical Report Overlapping Communities

Data PPI (Y2H) PPI (AP/MS) PPI (LC)Method L C I M A L C I M A L C I M ACommunity Coverage 0.56 0.16 0.99 1.00 1.00 0.84 0.77 0.99 1.00 0.97 0.56 0.56 0.99 1.00 0.98Overlap Coverage 0.73 0.18 0.99 11.00 1.62 2.58 0.82 0.99 9.66 3.21 0.93 0.60 0.99 10.55 1.52Community Quality 2.30 2.18 2.33 1.00 1.75 2.90 2.16 2.76 1.04 2.70 4.71 2.94 3.67 1.00 3.59Overlap Quality 0.08 0.04 0.00 0.14 0.11 0.28 0.08 0.00 0.24 0.34 0.15 0.09 0.00 0.21 0.14Data PPI (All) Metabolic PhilosophersMethod L C I M A L C I M A L C I M ACommunity Coverage 0.44 0.55 0.99 1.00 0.99 0.95 0.67 1.00 1.00 1.00 0.82 0.75 0.99 1.00 1.00Overlap Coverage 1.28 0.59 0.99 12.07 2.31 4.66 0.88 1.00 9.82 6.25 2.66 0.77 0.99 10.59 4.64Community Quality 5.43 1.51 3.99 1.01 3.37 5.33 1.25 4.91 1.00 6.39 2.37 1.17 1.98 1.01 2.43Overlap Quality 0.15 0.08 0.00 0.09 0.12 0.31 0.12 0.00 0.14 0.38 0.46 0.13 0.00 0.29 0.60Data Word associationsMethod L C I M ACommunity Coverage 0.92 0.94 1.00 1.00 1.00Overlap Coverage 5.12 1.41 1.00 8.16 10.48Community Quality 86.20 1.18 12.72 1.01 28.06Overlap Quality 0.09 0.06 0.00 0.03 0.20

Table S6: Performance of the methods in the experiments of Ahn et al. Theunnormalized scores in the experiments of Ahn et al. [1] in Section S8. L: Link clustering, C:Clique percolation, I: Infomap, M: Mixed membership stochastic blockmodels and A: AGM.

50