the trouble with community detectiontuvalu.santafe.edu/...2016_metadatasbm_netsci.pdf · lies,...

HVR 6

100150

200250

300

The Trouble with Community Detection

Aaron Clauset@aaronclausetComputer Science Dept. & BioFrontiers InstituteUniversity of Colorado, BoulderExternal Faculty, Santa Fe Institute

1 June 2016© 2016 Aaron Clauset

community detection00.5

1

a(t) 0

1

0

1

020

040

060

0

0

1

alig

nmen

t pos

ition

t

1

23

4

56

78

9

calcu

late

alig

nmen

t sco

res

conv

ert t

o al

ignm

ent i

ndica

tors

rem

ove

shor

t alig

ned

regi

ons

extra

ct h

ighl

y va

riabl

e re

gion

s

NGDYKEKVSNNLRAIFNKIYENLNDPKLKKHYQKDAPNY

NGDYKKKVSNNLKTIFKKIYDALKDTVKETYKDDPNY

NGDYKEKVSNNLRAIFKKIYDALEDTVKETYKDDPNY

16

6

13

166

13

A

B

C

D

00.5

1

a(t) 0

1

0

1

020

040

060

0

0

1

alig

nmen

t pos

ition

t

1

23

4

56

78

9

calcu

late

alig

nmen

t sco

res

conv

ert t

o al

ignm

ent i

ndica

tors

rem

ove

shor

t alig

ned

regi

ons

extra

ct h

ighl

y va

riabl

e re

gion

s




16

6

13

166

13

A

B

C

D

community interaction matrix

large-scale structure = community structure

vertices with same pattern of inter-community connections

lies, damned lies, and community detection


many networks include metadata on their nodes:

metadata is often used to evaluate the accuracy of community detection algs.

if community detection method finds a partition that correlates with then we say that is good

social networks age, sex, ethnicity or race, etc.

food webs feeding mode, species body mass, etc.

Internet data capacity, physical location, etc.

protein interactions molecular weight, association with cancer, etc.

x

P xAA


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25 26

27

28

29

30

31

32

33

34

0

1

2

3

4

5

6

78

9

10

11

1213

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

3940

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95 96

97

98

99

101

102

103

104

105

106

107

108

109 110

111

112

113

114

100

!"#$%$&'()')*+,$-

module

belong. Wefind

that most m

odulescontain

metabolites

mostly

fromone

major

pathway. Forexam

ple, in17

ofthe

19

modules identified

for E. coli, more than

one-thirdof the m

etabo-

litesbelong

toasingle

pathway. Interestingly, someother

mod-

ules—two in

the case of E. coli—cannot be trivially associated with a

single traditional pathway. Thesemodules are typically central in the

metabolism

andcontain, m

ostly, metabolites that are classified

in

KEGGas belonging

tocarbohydrate and

amino-acid

metabolism

.

Next we identifythe role of each

metabolite. In

Fig. 2bwe show

theroles

identifiedinthe

metabolic

networkofE. coli. O

ther

organisms show

a similar distribution

of the nodes inthe different

roles, eventhough

theycorrespond

toorganism

sthat

arevery

distant fromanevolutionary standpoint (see Supplem

entary Infor-

mation). Role R1, which

contains ultra-peripheral metabolites with

small degree and no between-m

odule links, comprises 76–86%

of all

themetabolites in

thenetworks. This considerably

simplifies

the

coarse-grainedrepresentation

of the network as these nodes donot

needtobe

identifiedseparately.

Note

thatthis

findingalone

represents animportant step

towards thegoal of extracting

scale-

specific information

fromcom

plexnetworks.

Theinform

ationabout m

odulesand

roles enablesus to

build

acartographic

representationofthe

metabolic

networkof, for

example, E. coli (Fig. 3). This representation

enables us torecover

relevantbiological inform

ation. Forinstance, we

findthat

the

metabolism

ismostly

organizedaround

themodule

containing

pyruvate, whichinturn

is strongly connectedtothe m

odule whose

hubis acetyl-coenzym

eA(CoA). These

twomolecules are

keyto

connecting themetabolism

of carbohydrates, amino acids and lipids

tothe tricarboxylic acid

(TCA) cycle fromwhich

ATPis obtained.

These twomodules are connected

tomore peripheral ones by

key

metabolites such

asD-glyceraldehyde

3-phosphateand

D-fructose

6-phosphate (whichconnect to

the glucose andgalactose m

etabo-

lisms), D-ribose 5-phosphate (which

connects tothe m

etabolismof

certainnucleotides), and

glycerone phosphate (whichconnects to

the metabolism

of certainlipids).

Our analysis also

uncovers nodes withkey

connector roles that

takepart

inonly

asmall but

fundamental set

of reactions. For

example,N-carbam

oyl-L-aspartate takes part in only three reactions

but is vital because it connects the pyrimidine m

etabolism, whose

hub is uracil, tothe core of the m

etabolismthrough

the alanine and

aspartate metabolism

. The potential importance of such

non-hub

connectorspoints

toanother

consideration.Itisaplausible

hypothesisthat

nodeswith

differentroles

areunder

different

evolutionaryconstraints

andpressures. In

particular, weexpect

that nodes withstructurally

relevant roles are more necessary

and

therefore more conserved

across species.

Toquantify

therelation

betweenroles

andconservation, we

definethe

lossrate

plost (R)

(seeMethods). Structurally

relevant

roles are expectedtohave low

values of plost (R) and

vice versa. We

findthat the

different roleshave

different lossrates

(Fig. 4). As

expected, ultra-peripheral nodes (role R1) have the highest loss rate

whereas connector hubs (role R6) are the most conserved

across all

species considered.

Theresults

forthe

comparison

of plost (R)

forultra-peripheral

nodes andconnector hubs is illustrative, but hardly

surprising. The

comparison

ofplost (R)

fornon-hub

connectors(role

R3)and

provincial hubs (role R5), however, yields a surprisingfinding. The

metabolites in

theprovincial hubs class have

many

within-module

Figure 3Cartographic representation

of the metabolic network of E. coli. Eachcircle

represents a module and is coloured according to the KEGGpathway classification of the

metabolites it contains. Certain important nodes are depicted as triangles (non-hub

connectors), hexagons (connector hubs) and squares (provincial hubs). Interactions

between modules and nodes are depicted using lines, with thickness proportional to the

number of actual links. Inset: metabolic network of E. coli, which contains 473metabolites

and 574links. This representation was obtained using the program

Pajek. Each node is

coloured according to the ‘main’ colour of its module, as obtained fromthe cartographic

representation.

letterstonature

NATURE |VOL433 | 24

FEBRUARY2005 | www.nature.com

/nature

898

© 2005 N

ature Publishing G

roup


several recent studies claim these are the exception

• real networks either do not contain structural communities or communities exist but they do not correlate with metadata groups

[1] see Leskovec et al. (2009), and Yang & Leskovec (2012), and Hric, Darst & Fortunato (2014)

6

Figure 5: (Color online) NMI scores between structural communities and metadata groups for di↵erent networks. Scores aregrouped by datasets on the x -axis. The height of each column is the maximal NMI score between any partition layer of themetadata partitions and any layer returned by the community detection method, considering only those comparisons wherethe overlap of the partitions is larger than 10% of total number of nodes.

methods do not align with partitions built from meta-data, but what about specific groups? Can we detectany of the groups well? Are some groups reflected in thegraph structure and detectable, but lost in the bulk noiseof the graph? This is what we wish to investigate here.

The basis of our analysis is the Jaccard score betweentwo groups. Let Ci represent (the set of nodes of) theknown group i, and Dj represent (the set of nodes of)the detected community j. The Jaccard score betweenthese two sets is defined as

J(Ci, Dj) =|Ci \Dj ||Ci [Dj |

, (1)

with |· · · | set cardinality, \ set intersection, and [ setunion. The Jaccard score ranges from one (perfectmatch) to zero and roughly indicates the fraction of nodesshared between the two sets: the match quality.

The recall score measures how well one known group isdetected. The recall score of one known group Ci is de-fined as the maximal Jaccard score between it and everydetected community Dj ,

R(Ci) = maxDj2{D}

J(Ci, Dj). (2)

It is near one if the group is well detected and low oth-erwise. We can study the distribution of these scores tosee how many groups can be detected at any given qual-ity level. Recall measures the detection of known groups,and to measure the significance of detected communities,we can reverse the measure to calculate a precision score

P (Dj) = maxCi2{C}

J(Dj , Ci). (3)

The precision score tells us how well one detected com-munity corresponds to any known group.

We can now directly quantify the two conditions forgood community detection: every known group must cor-respond to some detected community, and every detectedcommunity must represent some known group. Both ofthese measures are still interesting independently: a highrecall but low precision indicates that the known groupsare reflected in the network structurally, but there aremany structural communities that are not known. Wevisualize the scores by means of rank-Jaccard plots whichgive an overview of the network’s detection quality. Wecompute the recall (precision) for every known (detected)group and sort the groups in order of ascending Jaccardscore. We plot recall (precision) vs the group rank, sortedby recall (precision) score so that the horizontal scale isthe relative group rank, i.e. the ratio between the rank ofthe group and the number of groups (yielding a value be-tween 0 and 1). Similar to our treatment of the partition-level analysis, we only plot matchings whose intersectioncovers more than 10% of total nodes in the graph. In ourfinal plots, the average value of the curve (proportional tothe area under it) is the average recall or precision scoreover all groups. The shape of the curve can tell us if allgroups are detected equally well (yielding a high plateau)or if there is a large inequality in detection (a high slope).Furthermore, this allows us to compactly represent mul-tiple layers. Each independent layer of known (detected)groups can be plotted in the same figure. We would gen-erally look for the highest curve to know if any layer hasa high recall (precision). When computing recall (preci-sion), unless otherwise specified, as detected communitieswe consider the communities of all partitions delivered bya method, whereas the metadata groups are those presentin all metadata partitions (if more than one partition isavailable in either case). This will give us the maximumpossible recall (precision), which might be far higher than

maximum NMI between any partition layer of the metadata partitions and any layer returned by the community detection method

Hric et al. (2014)

"classic" data sets online data sets

the idea

use metadata to help select a partition that correlates with , from among the exponential number of plausible partitions

a metadata-aware degree-corrected stochastic block model

P⇤ 2 {P}

but wait!

x x

metadatadegree

prior on community assignments

edge probability

block parameters

x = {xu}d = {du}P (s |�,x) =

Y

i

�si,xi

puv = dudv✓su,sv

✓st

[1] technical details in Newman & Clauset (2015) arxiv:1507.04001

http://arxiv.org/abs/1507.04001

the idea

use metadata to help select a partition that correlates with , from among the exponential number of plausible partitions

a metadata-aware degree-corrected stochastic block model• model likelihood (with adjacency matrix )

• inference performed using EM algorithm to choose and

P⇤ 2 {P}

but wait!

x x

[1] technical details in Newman & Clauset (2015) arxiv:1507.04001

P (A |⇥,�,x) =X

s

P (A |⇥, s)P (s |�,x)

=X

s

Y

u<v

pAuvuv

(1� puv

)1�AuvY

u

�su,xu

network metadata

A

⇥ �


networks with planted structure

does this method recover known structure in synthetic data?

• we use the planted partition test: 2 groups, equal sizes, constant mean degree

• assign metadata ground truth with prob.

cout

n

cinn

cinn

cout

n

xu = ⇢ = {0.5, ..., 0.9}

cin-cout

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Frac

tion

of c

orre

ctly

ass

igne

d no

des

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1 0.50.60.70.80.9


undetectable

let mean degree • when metadata is irrelevant and we recover regular SBM behavior

c = 8

⇢ = 0.5

[1] n = 10 000

weaker stronger

easy to detect


cin-cout

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Frac

tion

of c

orre

ctly

ass

igne

d no

des

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1 0.50.60.70.80.9

• for

metadata + SBM

beats any algorithm without metadata, and

beats metadata alone.

let mean degree • when metadata is irrelevant and we recover regular SBM behavior⇢ = 0.5

⇢ > 0.5

c = 8

undetectable

weaker stronger

[1] n = 10 000

easy to detect

real-world networks

real-world networks

1. high school social network: 795 students in a medium-sized American high school and its feeder middle school

2. marine food web: predator-prey interactions among 488 species in Weddell Sea in Antarctica

3. Malaria gene recombinations: recombination events among 297 var genes

4. Facebook friendships: online friendships among 15,126 Harvard students and alumni

5. Internet graph: peering relations among 46,676 Autonomous Systems

real-world networks


• {grade 7-12, ethnicity, gender}x =

• method finds a good partition between high-school and middle-school

• .

• without metadata:

NMI = 0.881

[1] Add Health network data, designed by Udry, Bearman & Harris

NMI 2 [0.105, 0.384]

real-world networks



• method finds a good partition between blacks and whites (with others scattered among)


NMI = 0.820

NMI 2 [0.120, 0.239]


real-world networks



• method finds no good partition between males/females. instead, chooses a mixture of grade/ethnicity partitions


NMI = 0.003

NMI 2 [0.000, 0.010]


conclusions

0

0.51

a(t)

0

1

0

1

0200

400600

0

1

alignment position t

1

23 4

56

78

9

calculate alignment scoresconvert to alignment indicatorsremove short aligned regionsextract highly variable regions




16

6

13

16 6

13

A

B

C

D

the trouble with community detection1. many good local optima2. naive optimization inconsistent against node metadata

conclusions

0

0.51

a(t)

0

1

0

1

0200

400600

0

1


1

23 4

56

78

9





16

6

13

16 6

13

A

B

C

D

the trouble with community detection1. many good local optima2. naive optimization inconsistent against node metadata

two birds with one stonea metadata-aware stochastic block model

choose a partition that correlates with edges, and metadata (if possible)learns metadata associations with communitiesmore accurate than any algorithm that uses only edges or metadatahighly scalable EM + Belief Propagation algorithm

conclusions

0

0.51

a(t)

0

1

0

1

0200

400600

0

1


1

23 4

56

78

9





16

6

13

16 6

13

A

B

C

D

P (A |⇥,�,x)

�

Mark Newman(Michigan)

acknowledgements Funding support:

Benjamin H. Good(Harvard)

Yva de Montjoye(MIT)

arX

iv:1

507.

0400

1v1

[cs.S

I] 1

4 Ju

l 201

5Structure and inference in annotated networks

M. E. J. Newman1, 2 and Aaron Clauset2, 3

1Department of Physics and Center for the Study of Complex Systems,University of Michigan, Ann Arbor, MI 48109, USA

2Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 875013Department of Computer Science and BioFrontiers Institute,

University of Colorado, Boulder, CO 80309, USA

For many networks of scientific interest we know both the connections of the network and informa-tion about the network nodes, such as the age or gender of individuals in a social network, geographiclocation of nodes in the Internet, or cellular function of nodes in a gene regulatory network. Herewe demonstrate how this “metadata” can be used to improve our analysis and understanding ofnetwork structure. We focus in particular on the problem of community detection in networks anddevelop a mathematically principled approach that combines a network and its metadata to detectcommunities more accurately than can be done with either alone. Crucially, the method does notassume that the metadata are correlated with the communities we are trying to find. Instead themethod learns whether a correlation exists and correctly uses or ignores the metadata depending onwhether they contain useful information. The learned correlations are also of interest in their ownright, allowing us to make predictions about the community membership of nodes whose networkconnections are unknown. We demonstrate our method on synthetic networks with known struc-ture and on real-world networks, large and small, drawn from social, biological, and technologicaldomains.

I. INTRODUCTION

Networks arise in many fields and provide a powerfuland compact representation of the internal structure ofa wide range of complex systems [1]. Examples includesocial networks of interactions among people, technolog-ical and information networks such as the Internet or theWorld Wide Web, and biological networks of molecules,cells, or entire species. The last two decades have wit-nessed rapid growth both in the availability of networkdata and in the number and sophistication of networkanalysis techniques. Borrowing ideas from graph theory,statistical physics, computer science, statistics, and otherareas, network analysis typically aims to characterize anetwork’s structural features in a way that sheds light onthe behavior of the system the network describes. Stud-ies of social networks, for instance, might identify themost influential or central individuals in a population.Studies of road networks can shed light on traffic flowsor bottlenecks within a city or country. Studies of path-ways in metabolic networks can lead to a more completeunderstanding of the molecular machinery of the cell.Most research in this area treats networks as objects of

pure topology, unadorned sets of nodes and their inter-actions. Most network data, however, are accompaniedby annotations or metadata that describe properties ofnodes such as a person’s age, gender, or ethnicity in asocial network, feeding mode or body mass of species ina food web, data capacity or physical location of nodeson the Internet, and so forth. (There can be metadatadata on the edges of a network as well as on the nodes [2],but our focus here is on the node case.) In this paper,we consider how to extend the analysis of networks to di-rectly incorporate such metadata. Our approach is basedon methods of statistical inference and can in principle

be applied to a range of different network analysis tasks.Here, we focus specifically on one of the most widelystudied tasks, the community detection problem. Com-munity detection, also called node clustering or classifi-cation, searches for a good division of a network’s nodesinto groups or classes [3]. Typically, one searches forassortative structure, groupings of nodes such that con-nections are denser within groups than between them.This structure is common in social networks, for exam-ple, where groups may correspond to sets of friends orcoworkers, but it also occurs in other cases, including bi-ological and ecological networks, the Web, transportationand distribution networks, and others. Less common, butno less important, is disassortative structure, in whichnetwork connections are sparser within groups than be-tween them, and mixtures of assortative and disassorta-tive structure can also occur, where different groups mayhave varying propensities for within- or between-groupconnections.In many cases, the groups identified by community de-

tection correlate meaningfully with other network prop-erties or functions, such as allegiances or personal in-terests in social networks [3, 4] or biological function inmetabolic networks [5, 6]. Some recent research, how-ever, has suggested that these cases may be the exceptionrather than the rule [7, 8], an important point that weaddress later in this paper.A large number of methods have been proposed

for detecting communities in unannotated networks [3].Among these, some of the most powerful, both in termsof rigorously provable performance and of raw speed, arethose based on statistical inference. Here we build onthese methods to incorporate node metadata into thecommunity detection problem in a principled and flex-ible manner. The resulting methods have several attrac-

to appear, Nature Communications (arxiv:1507.0400)

Performance of modularity maximization in practical contexts

Benjamin H. Good,1,2,* Yves-Alexandre de Montjoye,3,2,† and Aaron Clauset2,‡

1Department of Physics, Swarthmore College, Swarthmore, Pennsylvania 19081, USA2Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, New Mexico 87501, USA

3Department of Applied Mathematics, Université Catholique de Louvain, 4 Avenue Georges Lemaitre, B-1348 Louvain-la-Neuve, Belgium!Received 1 October 2009; published 15 April 2010"

Although widely used in practice, the behavior and accuracy of the popular module identification techniquecalled modularity maximization is not well understood in practical contexts. Here, we present a broad charac-terization of its performance in such situations. First, we revisit and clarify the resolution limit phenomenon formodularity maximization. Second, we show that the modularity function Q exhibits extreme degeneracies: ittypically admits an exponential number of distinct high-scoring solutions and typically lacks a clear globalmaximum. Third, we derive the limiting behavior of the maximum modularity Qmax for one model of infinitelymodular networks, showing that it depends strongly both on the size of the network and on the number ofmodules it contains. Finally, using three real-world metabolic networks as examples, we show that the degen-erate solutions can fundamentally disagree on many, but not all, partition properties such as the composition ofthe largest modules and the distribution of module sizes. These results imply that the output of any modularitymaximization procedure should be interpreted cautiously in scientific contexts. They also explain why manyheuristics are often successful at finding high-scoring partitions in practice and why different heuristics candisagree on the modular structure of the same network. We conclude by discussing avenues for mitigatingsome of these behaviors, such as combining information from many degenerate solutions or using generativemodels.

DOI: 10.1103/PhysRevE.81.046106 PACS number!s": 89.75.Hc, 89.20.!a, 87.16.!b, 02.70.!c

I. INTRODUCTION

Networks are a powerful tool for understanding the struc-ture, dynamics, robustness and evolution of complex biologi-cal, technological and social systems #1,2$. The automaticdetection of modular structures in networks—also calledcommunities #3$ or compartments #4$, and conventionallyunderstood to be large subgraphs with high internaldensities—can provide a scalable way to identify function-ally important or closely related classes of nodes from inter-action data alone #5,6$.

The identification of modular structures has broad impli-cations for many systems-level questions. For instance, it hasstrong consequences for the behavior of dynamical processeson networks #7,8$, and can provide a principled way to re-duce or coarse-grain a system by dividing global heteroge-neity into relatively homogeneous substructures. The searchfor such modular substructures has been particularly intensein molecular networks. This is, in part, because moduleshave theoretical significance for molecular networks #9–11$:they can correspond to functional clusters of genes or pro-teins #12,13$, they may represent targets of natural selectionabove the level of individual genes or proteins but below thelevel of the whole organism, and they may provide evidenceof past evolutionary constraints or pressures #13,14$. Pastwork along these lines has identified modular structures insignaling, metabolic and protein-interaction systems#14–17$, although some questions remain about their statis-

tical significance #18$ and functional relevance #19$. Natu-rally, many of these questions apply equally well to modulesin social and technological networks.

Empirical evidence for a modular organization is typicallyderived using computer algorithms that automatically iden-tify modules using network connectivity data, and among themany techniques now available !see #5,6,20$ for reviews",the method of modularity maximization #3$ is by far the mostpopular. Under this method, each decomposition or partitionof a network into k disjoint modules is given a score Q,called the modularity or sometimes “Newman-Girvan modu-larity”

Q = %i=1

k & ei

m− ' di

2m(2) , !1"

where ei is the number of edges in module i, di is the totaldegree of nodes in module i and m is the total number ofedges in the network. Intuitively, Q measures the differencebetween the observed connectivity within modules and itsexpected value for a random graph with the same degreesequence #21$. Thus, a “good” partition—with Q closer tounity—identifies groups with many more internal connec-tions than expected at random; in contrast, a “bad”partition—with Q closer to zero—identifies groups with nomore internal connections than we expect at random. Thisreasonable formulation recasts the problem of identifyingmodules as a problem of finding the so-called optimal parti-tion, i.e., the partition that maximizes the modularity func-tion Q.

Despite the popularity of modularity maximization, muchremains unknown about the quality and significance of itsoutput when applied to real-world networks with unknown

*[email protected]†[email protected]‡[email protected]

PHYSICAL REVIEW E 81, 046106 !2010"

1539-3755/2010/81!4"/046106!19" ©2010 The American Physical Society046106-1

[1] a C implementation of the metadata SBM is available by request



fin

0

0.51

a(t)

0

1

0

1

0200

400600

0

1


1

23 4

56

78

9





16

6

13

16 6

13

A

B

C

D

community detection

real-world networks

2. marine food web: predator-prey interactions among 488 species in Weddell Sea in Antarctica

• {species body mass, feeding mode, oceanic zone}

• partition recovers known correlation between body mass, trophic level, and ecosystem role:

x =

1

2

3

Detritivore

Carnivore

Omnivore

Herbivore

Primary producer[1] here, we’re using a continuous metadata model[2] Brose et al. (2005)

8

(a) Without metadata

(b) With metadata

FIG. S3: Inferred communities, without metadata and with,for the HVR 5 gene recombination network of the humanmalaria parasite P. falciparum, where metadata values arethe CP labels for the HVR 6 network.

predictive (96% probability) of that gene being in onegroup, while having four cysteines is modestly predictive(67% probability) of being in the other group. Thus themethod has discovered by itself that the motif sequencesthat define the CP labels, along with their correspondingnetwork communities, correlate with cysteine counts andtheir associated severe disease phenotypes [8, 11].

The communities in the HVR 6 network representhighly non-random patterns of recombination, which arethought to indicate functional constraints on proteinstructure. Previous work has conjectured that commonconstraints on recombination span distinct HVRs [9]. Wecan test this hypothesis using the methods described inthis paper. There is no reason a priori to expect thatthe community structure of HVR 6 should correlate withthat of HVR 5 because the Cys and CP labels are derivedfrom outside the HVR 5 sequences—Cys labels reflectcysteine counts in HVR 6 while CP labels subdivide Cyslabels based on sequence motifs adjacent to, but outsideof, HVR 5. Applying our methods to HVR 5 withoutany metadata (Fig. S3a), we find mixing of the HVR 6Cys labels across the HVR 5 communities. By contrast,using the CP labels as metadata for the HVR 5 network,our method finds a much cleaner partition (Fig. S3b), in-dicating that indeed the HVR 6 Cys labels correlate withthe community structure of HVR 5.

10-12

10-9

10-6

10-3

100

103

106

109

Mean body mass (g)

0

0.5

1

Pro

bab

ilit

y o

f co

mm

un

ity

mem

ber

ship

FIG. S4: Learned priors, as a function of body mass, for thethree-community division of the Weddell Sea network shownin Fig. 4 of the main paper.

C. Weddell Sea food web

As discussed in the main text, the Weddell Sea foodweb provides an example of the “ordered” metadata typein the body mass of species. A three-way community di-vision of the network with the log of species’ averagebody mass as metadata produces the division shown inFig. 4 of the paper. The prior probabilities as functionsof body mass are of interest in their own right. They areshown in Fig. S4. Although, as described in Section ICof the paper, the log mass is rescaled in our calculationsto the range [0, 1], the horizontal axis in the figure is cal-ibrated to read in terms of the original mass in grams, sothe prior probabilities of belonging to each of the threecommunities can be simply read from the figure. Theblue, green, and red curves correspond respectively tothe communities labeled 1, 2, and 3 in Fig. 4. Thus aspecies with a low mean mass of 10�12 g has about an80% probability of being in community 1, a 20% proba-bility of being in community 2, and virtually no chance ofbeing in community 3. Conversely, a species with meanbody mass of 108 g (which could only be a whale) hasabout a 90% chance of being in community 3, 10% ofbeing in community 2, and almost no chance of being incommunity 1.

D. School friendship network data use

Our use of the school friendship data described in themain text requires that we make the following statement:

This work uses data from Add Health, a pro-gram project designed by J. Richard Udry,Peter S. Bearman, and Kathleen Mullan Har-ris, and funded by a grant P01–HD31921from the National Institute of Child Health

3

2

1

real-world networks

x =

[1] Larremore, Clauset & Buckee (2013)


• {Cys-PoLV labels for HVR6 region}

• with metadata, partition discovers correlation with Cys labels (which are associated with severe disease)

without metadata with metadata

HVR6

NMI 2 [0.077, 0.675] NMI = 0.596

real-world networks

x =

[1] Larremore, Clauset & Buckee (2013)


• {Cys-PoLV labels for HVR6 region}

• on adjacent region of gene, we find Cys-PoLV labels correlate with recombinant structure here, too

HVR5

without metadata with metadata

real-world networks

x =

[1] Traud, Mucha & Porter (20012)

13

communities is less than the number of metadata values,in some cases by a wide margin. Assuming the values ofboth to be reasonably broadly distributed, this impliesthat the entropy H(s) of the communities will be smallerthan that of the metadata H(x) and hence, normally,min[H(s), H(x)] = H(s). Thus if we define

NMI =I(s ;x)

min[H(s), H(x)], (B4)

we ensure that the normalized mutual information liesbetween zero and one, that it has a symmetric defini-tion with respect to s and x, and that it will achieveits maximum value of one when the metadata perfectlypredict the community membership. Other definitions,normalized using the mean or maximum of the two en-tropies, satisfy the first two of these three conditions butnot the third, giving values smaller than one by an unpre-dictable margin even when the metadata perfectly pre-dict the communities.We use the definition (B4) in all the calculations pre-

sented in this paper.

Appendix C: Further examples

In this appendix we present a number of additionalapplications of our methods as well as some additionaldetails on examples described in the main text. Summarystatistics on all the networks studied are given in Table I.

1. Facebook friendship network

The FB100 data set of Traud et al. [31] is a set offriendship networks among college students at US uni-versities compiled from friend relations on the social net-working website Facebook. The networks date from theearly days of Facebook when its services were availableonly to college students and each university formed a sep-arate and unconnected subgraph in the larger network.The nodes in these networks represent the students, theedges represent friend relations on Facebook, and in ad-dition to the network structure there are metadata ofseveral types, including gender, college year (i.e., yearof college graduation), major (i.e., principal subject ofstudy, if known), and a numerical code indicating whichdorm they lived in.The primary divisions in these networks appear to be

by age, or more specifically by college year. For instance,we have looked in some detail at the data for HarvardUniversity, which was the birthplace of Facebook andits biggest institutional participant at the time the datawere gathered, with 15 126 students in the network, span-ning college years 2003 to 2009. There are also a smallnumber of Harvard alumni (i.e., former students) in thedata set, primarily those recently graduated—graduationyears 2000–2002. The top panel in Fig. 4 shows results

None 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Year

0

0.5

1Pr

ior p

roba

bilit

y of

mem

bers

hip

Dorm0

0.5

1

Prio

r pro

babi

lity

of m

embe

rshi

p

FIG. 4: Learned prior probability of community membershipfor two five-way divisions of the Facebook friendship networkof Harvard students described in the text. The horizontal axisis (top) year of graduation and (bottom) dorm, and the colorsrepresent the prior probabilities of membership in each of thecommunities.

from a five-way division of the network using our algo-rithm with year as metadata. Year, for the purposes ofthis calculation, was treated as an unordered variable,placing no constraints on the value of the prior probabil-ities of community membership for adjacent years. Onecould have treated it as an ordered variable, which wouldhave constrained adjacent years to have similar priors,but we did not do that here. Nonetheless, as we will see,the algorithm finds communities in which adjacent yearstend to be grouped together.This network provides a good example of the useful-

ness of the learned priors in shedding light on the struc-ture of the network. The figure shows a visualization ofthe priors as a function of year, with the colors show-ing the relative probability of belonging to each of thecommunities. Each of the bars in the plot has the sameheight of 1 since the prior probabilities are required tosum to 1, while the balance of colors shows the distribu-tion over communities. Examination of the top panel inthe figure shows clearly a division of the network alongage lines. Two groups, in orange and yellow at the rightof the plot, correspond to the most recent two years ofstudents at the time of the study (graduation years 2008and 2009) and the next, in red, accounts for the two yearsbefore that (2006 and 2007). The purple community cor-responds to the next three years, 2003–2005, while the

NMI 2 [0.573, 0.641]

NMI = 0.668

4. Facebook friendships: online friendships among 15,126 Harvard students and alumni (in Sept. 2005)

• {graduation year, dormitory}

• method finds a good partition between alumni, recent graduates, upperclassmen, sophomores, and freshmen

• .


real-world networks

x =

[1] Traud, Mucha & Porter (20012)

4. Facebook friendships: online friendships among 15,126 Harvard students and alumni (in Sept. 2005)

• {graduation year, dormitory}

• method finds a good partition among the dorms

• .


13

communities is less than the number of metadata values,in some cases by a wide margin. Assuming the values ofboth to be reasonably broadly distributed, this impliesthat the entropy H(s) of the communities will be smallerthan that of the metadata H(x) and hence, normally,min[H(s), H(x)] = H(s). Thus if we define

NMI =I(s ;x)

min[H(s), H(x)], (B4)

we ensure that the normalized mutual information liesbetween zero and one, that it has a symmetric defini-tion with respect to s and x, and that it will achieveits maximum value of one when the metadata perfectlypredict the community membership. Other definitions,normalized using the mean or maximum of the two en-tropies, satisfy the first two of these three conditions butnot the third, giving values smaller than one by an unpre-dictable margin even when the metadata perfectly pre-dict the communities.We use the definition (B4) in all the calculations pre-

sented in this paper.

Appendix C: Further examples

In this appendix we present a number of additionalapplications of our methods as well as some additionaldetails on examples described in the main text. Summarystatistics on all the networks studied are given in Table I.

1. Facebook friendship network

The FB100 data set of Traud et al. [31] is a set offriendship networks among college students at US uni-versities compiled from friend relations on the social net-working website Facebook. The networks date from theearly days of Facebook when its services were availableonly to college students and each university formed a sep-arate and unconnected subgraph in the larger network.The nodes in these networks represent the students, theedges represent friend relations on Facebook, and in ad-dition to the network structure there are metadata ofseveral types, including gender, college year (i.e., yearof college graduation), major (i.e., principal subject ofstudy, if known), and a numerical code indicating whichdorm they lived in.The primary divisions in these networks appear to be

by age, or more specifically by college year. For instance,we have looked in some detail at the data for HarvardUniversity, which was the birthplace of Facebook andits biggest institutional participant at the time the datawere gathered, with 15 126 students in the network, span-ning college years 2003 to 2009. There are also a smallnumber of Harvard alumni (i.e., former students) in thedata set, primarily those recently graduated—graduationyears 2000–2002. The top panel in Fig. 4 shows results

None 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Year

0

0.5

1

Prio

r pro

babi

lity

of m

embe

rshi

p

Dorm0

0.5

1Pr

ior p

roba

bilit

y of

mem

bers

hip

FIG. 4: Learned prior probability of community membershipfor two five-way divisions of the Facebook friendship networkof Harvard students described in the text. The horizontal axisis (top) year of graduation and (bottom) dorm, and the colorsrepresent the prior probabilities of membership in each of thecommunities.

from a five-way division of the network using our algo-rithm with year as metadata. Year, for the purposes ofthis calculation, was treated as an unordered variable,placing no constraints on the value of the prior probabil-ities of community membership for adjacent years. Onecould have treated it as an ordered variable, which wouldhave constrained adjacent years to have similar priors,but we did not do that here. Nonetheless, as we will see,the algorithm finds communities in which adjacent yearstend to be grouped together.This network provides a good example of the useful-

ness of the learned priors in shedding light on the struc-ture of the network. The figure shows a visualization ofthe priors as a function of year, with the colors show-ing the relative probability of belonging to each of thecommunities. Each of the bars in the plot has the sameheight of 1 since the prior probabilities are required tosum to 1, while the balance of colors shows the distribu-tion over communities. Examination of the top panel inthe figure shows clearly a division of the network alongage lines. Two groups, in orange and yellow at the rightof the plot, correspond to the most recent two years ofstudents at the time of the study (graduation years 2008and 2009) and the next, in red, accounts for the two yearsbefore that (2006 and 2007). The purple community cor-responds to the next three years, 2003–2005, while the

NMI 2 [0.074, 0.224]

NMI = 0.255

real-world networks

5. Internet graph: 262,953 peering relations among 46,676 Autonomous Systems

• {country location of AS}

• method finds a good partition along the lines of the 173 countries

• .


x =

[1] here, we’re using a continuous metadata model

NMI = 0.870

NMI 2 [0.398, 0.626]

the trouble with community detectiontuvalu.santafe.edu/...2016_metadatasbm_netsci.pdf · lies,...

Documents