the trouble with community detectiontuvalu.santafe.edu/...2016_metadatasbm_netsci.pdf · lies,...
TRANSCRIPT
HVR 6
100150
200250
300
The Trouble with Community Detection
Aaron Clauset@aaronclausetComputer Science Dept. & BioFrontiers InstituteUniversity of Colorado, BoulderExternal Faculty, Santa Fe Institute
1 June 2016© 2016 Aaron Clauset
community detection00.5
1
a(t) 0
1
0
1
020
040
060
0
0
1
alig
nmen
t pos
ition
t
1
23
4
56
78
9
calcu
late
alig
nmen
t sco
res
conv
ert t
o al
ignm
ent i
ndica
tors
rem
ove
shor
t alig
ned
regi
ons
extra
ct h
ighl
y va
riabl
e re
gion
s
NGDYKEKVSNNLRAIFNKIYENLNDPKLKKHYQKDAPNY
NGDYKKKVSNNLKTIFKKIYDALKDTVKETYKDDPNY
NGDYKEKVSNNLRAIFKKIYDALEDTVKETYKDDPNY
16
6
13
166
13
A
B
C
D
00.5
1
a(t) 0
1
0
1
020
040
060
0
0
1
alig
nmen
t pos
ition
t
1
23
4
56
78
9
calcu
late
alig
nmen
t sco
res
conv
ert t
o al
ignm
ent i
ndica
tors
rem
ove
shor
t alig
ned
regi
ons
extra
ct h
ighl
y va
riabl
e re
gion
s
NGDYKEKVSNNLRAIFNKIYENLNDPKLKKHYQKDAPNY
NGDYKKKVSNNLKTIFKKIYDALKDTVKETYKDDPNY
NGDYKEKVSNNLRAIFKKIYDALEDTVKETYKDDPNY
16
6
13
166
13
A
B
C
D
community interaction matrix
large-scale structure = community structure
vertices with same pattern of inter-community connections
lies, damned lies, and community detection
lies, damned lies, and community detection
many networks include metadata on their nodes:
metadata is often used to evaluate the accuracy of community detection algs.
if community detection method finds a partition that correlates with then we say that is good
social networks age, sex, ethnicity or race, etc.
food webs feeding mode, species body mass, etc.
Internet data capacity, physical location, etc.
protein interactions molecular weight, association with cancer, etc.
x
P xAA
lies, damned lies, and community detection
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25 26
27
28
29
30
31
32
33
34
0
1
2
3
4
5
6
78
9
10
11
1213
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
3940
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95 96
97
98
99
101
102
103
104
105
106
107
108
109 110
111
112
113
114
100
!"#$%$&'()')*+,$-
module
belong. Wefind
that most m
odulescontain
metabolites
mostly
fromone
major
pathway. Forexam
ple, in17
ofthe
19
modules identified
for E. coli, more than
one-thirdof the m
etabo-
litesbelong
toasingle
pathway. Interestingly, someother
mod-
ules—two in
the case of E. coli—cannot be trivially associated with a
single traditional pathway. Thesemodules are typically central in the
metabolism
andcontain, m
ostly, metabolites that are classified
in
KEGGas belonging
tocarbohydrate and
amino-acid
metabolism
.
Next we identifythe role of each
metabolite. In
Fig. 2bwe show
theroles
identifiedinthe
metabolic
networkofE. coli. O
ther
organisms show
a similar distribution
of the nodes inthe different
roles, eventhough
theycorrespond
toorganism
sthat
arevery
distant fromanevolutionary standpoint (see Supplem
entary Infor-
mation). Role R1, which
contains ultra-peripheral metabolites with
small degree and no between-m
odule links, comprises 76–86%
of all
themetabolites in
thenetworks. This considerably
simplifies
the
coarse-grainedrepresentation
of the network as these nodes donot
needtobe
identifiedseparately.
Note
thatthis
findingalone
represents animportant step
towards thegoal of extracting
scale-
specific information
fromcom
plexnetworks.
Theinform
ationabout m
odulesand
roles enablesus to
build
acartographic
representationofthe
metabolic
networkof, for
example, E. coli (Fig. 3). This representation
enables us torecover
relevantbiological inform
ation. Forinstance, we
findthat
the
metabolism
ismostly
organizedaround
themodule
containing
pyruvate, whichinturn
is strongly connectedtothe m
odule whose
hubis acetyl-coenzym
eA(CoA). These
twomolecules are
keyto
connecting themetabolism
of carbohydrates, amino acids and lipids
tothe tricarboxylic acid
(TCA) cycle fromwhich
ATPis obtained.
These twomodules are connected
tomore peripheral ones by
key
metabolites such
asD-glyceraldehyde
3-phosphateand
D-fructose
6-phosphate (whichconnect to
the glucose andgalactose m
etabo-
lisms), D-ribose 5-phosphate (which
connects tothe m
etabolismof
certainnucleotides), and
glycerone phosphate (whichconnects to
the metabolism
of certainlipids).
Our analysis also
uncovers nodes withkey
connector roles that
takepart
inonly
asmall but
fundamental set
of reactions. For
example,N-carbam
oyl-L-aspartate takes part in only three reactions
but is vital because it connects the pyrimidine m
etabolism, whose
hub is uracil, tothe core of the m
etabolismthrough
the alanine and
aspartate metabolism
. The potential importance of such
non-hub
connectorspoints
toanother
consideration.Itisaplausible
hypothesisthat
nodeswith
differentroles
areunder
different
evolutionaryconstraints
andpressures. In
particular, weexpect
that nodes withstructurally
relevant roles are more necessary
and
therefore more conserved
across species.
Toquantify
therelation
betweenroles
andconservation, we
definethe
lossrate
plost (R)
(seeMethods). Structurally
relevant
roles are expectedtohave low
values of plost (R) and
vice versa. We
findthat the
different roleshave
different lossrates
(Fig. 4). As
expected, ultra-peripheral nodes (role R1) have the highest loss rate
whereas connector hubs (role R6) are the most conserved
across all
species considered.
Theresults
forthe
comparison
of plost (R)
forultra-peripheral
nodes andconnector hubs is illustrative, but hardly
surprising. The
comparison
ofplost (R)
fornon-hub
connectors(role
R3)and
provincial hubs (role R5), however, yields a surprisingfinding. The
metabolites in
theprovincial hubs class have
many
within-module
Figure 3Cartographic representation
of the metabolic network of E. coli. Eachcircle
represents a module and is coloured according to the KEGGpathway classification of the
metabolites it contains. Certain important nodes are depicted as triangles (non-hub
connectors), hexagons (connector hubs) and squares (provincial hubs). Interactions
between modules and nodes are depicted using lines, with thickness proportional to the
number of actual links. Inset: metabolic network of E. coli, which contains 473metabolites
and 574links. This representation was obtained using the program
Pajek. Each node is
coloured according to the ‘main’ colour of its module, as obtained fromthe cartographic
representation.
letterstonature
NATURE |VOL433 | 24
FEBRUARY2005 | www.nature.com
/nature
898
© 2005 N
ature Publishing G
roup
lies, damned lies, and community detection
several recent studies claim these are the exception
• real networks either do not contain structural communities or communities exist but they do not correlate with metadata groups
[1] see Leskovec et al. (2009), and Yang & Leskovec (2012), and Hric, Darst & Fortunato (2014)
6
Figure 5: (Color online) NMI scores between structural communities and metadata groups for di↵erent networks. Scores aregrouped by datasets on the x -axis. The height of each column is the maximal NMI score between any partition layer of themetadata partitions and any layer returned by the community detection method, considering only those comparisons wherethe overlap of the partitions is larger than 10% of total number of nodes.
methods do not align with partitions built from meta-data, but what about specific groups? Can we detectany of the groups well? Are some groups reflected in thegraph structure and detectable, but lost in the bulk noiseof the graph? This is what we wish to investigate here.
The basis of our analysis is the Jaccard score betweentwo groups. Let Ci represent (the set of nodes of) theknown group i, and Dj represent (the set of nodes of)the detected community j. The Jaccard score betweenthese two sets is defined as
J(Ci, Dj) =|Ci \Dj ||Ci [Dj |
, (1)
with |· · · | set cardinality, \ set intersection, and [ setunion. The Jaccard score ranges from one (perfectmatch) to zero and roughly indicates the fraction of nodesshared between the two sets: the match quality.
The recall score measures how well one known group isdetected. The recall score of one known group Ci is de-fined as the maximal Jaccard score between it and everydetected community Dj ,
R(Ci) = maxDj2{D}
J(Ci, Dj). (2)
It is near one if the group is well detected and low oth-erwise. We can study the distribution of these scores tosee how many groups can be detected at any given qual-ity level. Recall measures the detection of known groups,and to measure the significance of detected communities,we can reverse the measure to calculate a precision score
P (Dj) = maxCi2{C}
J(Dj , Ci). (3)
The precision score tells us how well one detected com-munity corresponds to any known group.
We can now directly quantify the two conditions forgood community detection: every known group must cor-respond to some detected community, and every detectedcommunity must represent some known group. Both ofthese measures are still interesting independently: a highrecall but low precision indicates that the known groupsare reflected in the network structurally, but there aremany structural communities that are not known. Wevisualize the scores by means of rank-Jaccard plots whichgive an overview of the network’s detection quality. Wecompute the recall (precision) for every known (detected)group and sort the groups in order of ascending Jaccardscore. We plot recall (precision) vs the group rank, sortedby recall (precision) score so that the horizontal scale isthe relative group rank, i.e. the ratio between the rank ofthe group and the number of groups (yielding a value be-tween 0 and 1). Similar to our treatment of the partition-level analysis, we only plot matchings whose intersectioncovers more than 10% of total nodes in the graph. In ourfinal plots, the average value of the curve (proportional tothe area under it) is the average recall or precision scoreover all groups. The shape of the curve can tell us if allgroups are detected equally well (yielding a high plateau)or if there is a large inequality in detection (a high slope).Furthermore, this allows us to compactly represent mul-tiple layers. Each independent layer of known (detected)groups can be plotted in the same figure. We would gen-erally look for the highest curve to know if any layer hasa high recall (precision). When computing recall (preci-sion), unless otherwise specified, as detected communitieswe consider the communities of all partitions delivered bya method, whereas the metadata groups are those presentin all metadata partitions (if more than one partition isavailable in either case). This will give us the maximumpossible recall (precision), which might be far higher than
maximum NMI between any partition layer of the metadata partitions and any layer returned by the community detection method
Hric et al. (2014)
"classic" data sets online data sets
the idea
use metadata to help select a partition that correlates with , from among the exponential number of plausible partitions
a metadata-aware degree-corrected stochastic block model
P⇤ 2 {P}
but wait!
x x
metadatadegree
prior on community assignments
edge probability
block parameters
x = {xu}d = {du}P (s |�,x) =
Y
i
�si,xi
puv = dudv✓su,sv
✓st
[1] technical details in Newman & Clauset (2015) arxiv:1507.04001
the idea
use metadata to help select a partition that correlates with , from among the exponential number of plausible partitions
a metadata-aware degree-corrected stochastic block model• model likelihood (with adjacency matrix )
• inference performed using EM algorithm to choose and
P⇤ 2 {P}
but wait!
x x
[1] technical details in Newman & Clauset (2015) arxiv:1507.04001
P (A |⇥,�,x) =X
s
P (A |⇥, s)P (s |�,x)
=X
s
Y
u<v
pAuvuv
(1� puv
)1�AuvY
u
�su,xu
network metadata
A
⇥ �
networks with planted structure
does this method recover known structure in synthetic data?
• we use the planted partition test: 2 groups, equal sizes, constant mean degree
• assign metadata ground truth with prob.
cout
n
cinn
cinn
cout
n
xu = ⇢ = {0.5, ..., 0.9}
cin-cout
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Frac
tion
of c
orre
ctly
ass
igne
d no
des
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1 0.50.60.70.80.9
networks with planted structure
undetectable
let mean degree • when metadata is irrelevant and we recover regular SBM behavior
c = 8
⇢ = 0.5
[1] n = 10 000
weaker stronger
easy to detect
networks with planted structure
cin-cout
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Frac
tion
of c
orre
ctly
ass
igne
d no
des
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1 0.50.60.70.80.9
• for
metadata + SBM
beats any algorithm without metadata, and
beats metadata alone.
let mean degree • when metadata is irrelevant and we recover regular SBM behavior⇢ = 0.5
⇢ > 0.5
c = 8
undetectable
weaker stronger
[1] n = 10 000
easy to detect
real-world networks
real-world networks
1. high school social network: 795 students in a medium-sized American high school and its feeder middle school
2. marine food web: predator-prey interactions among 488 species in Weddell Sea in Antarctica
3. Malaria gene recombinations: recombination events among 297 var genes
4. Facebook friendships: online friendships among 15,126 Harvard students and alumni
5. Internet graph: peering relations among 46,676 Autonomous Systems
real-world networks
1. high school social network: 795 students in a medium-sized American high school and its feeder middle school
• {grade 7-12, ethnicity, gender}x =
• method finds a good partition between high-school and middle-school
• .
• without metadata:
NMI = 0.881
[1] Add Health network data, designed by Udry, Bearman & Harris
NMI 2 [0.105, 0.384]
real-world networks
1. high school social network: 795 students in a medium-sized American high school and its feeder middle school
• {grade 7-12, ethnicity, gender}x =
• method finds a good partition between blacks and whites (with others scattered among)
• without metadata:
NMI = 0.820
NMI 2 [0.120, 0.239]
[1] Add Health network data, designed by Udry, Bearman & Harris
real-world networks
1. high school social network: 795 students in a medium-sized American high school and its feeder middle school
• {grade 7-12, ethnicity, gender}x =
• method finds no good partition between males/females. instead, chooses a mixture of grade/ethnicity partitions
• without metadata:
NMI = 0.003
NMI 2 [0.000, 0.010]
[1] Add Health network data, designed by Udry, Bearman & Harris
conclusions
0
0.51
a(t)
0
1
0
1
0200
400600
0
1
alignment position t
1
23 4
56
78
9
calculate alignment scoresconvert to alignment indicatorsremove short aligned regionsextract highly variable regions
NGDYKEKVSNNLRAIFNKIYENLNDPKLKKHYQKDAPNY
NGDYKKKVSNNLKTIFKKIYDALKDTVKETYKDDPNY
NGDYKEKVSNNLRAIFKKIYDALEDTVKETYKDDPNY
16
6
13
16 6
13
A
B
C
D
the trouble with community detection1. many good local optima2. naive optimization inconsistent against node metadata
conclusions
0
0.51
a(t)
0
1
0
1
0200
400600
0
1
alignment position t
1
23 4
56
78
9
calculate alignment scoresconvert to alignment indicatorsremove short aligned regionsextract highly variable regions
NGDYKEKVSNNLRAIFNKIYENLNDPKLKKHYQKDAPNY
NGDYKKKVSNNLKTIFKKIYDALKDTVKETYKDDPNY
NGDYKEKVSNNLRAIFKKIYDALEDTVKETYKDDPNY
16
6
13
16 6
13
A
B
C
D
the trouble with community detection1. many good local optima2. naive optimization inconsistent against node metadata
two birds with one stonea metadata-aware stochastic block model
choose a partition that correlates with edges, and metadata (if possible)learns metadata associations with communitiesmore accurate than any algorithm that uses only edges or metadatahighly scalable EM + Belief Propagation algorithm
conclusions
0
0.51
a(t)
0
1
0
1
0200
400600
0
1
alignment position t
1
23 4
56
78
9
calculate alignment scoresconvert to alignment indicatorsremove short aligned regionsextract highly variable regions
NGDYKEKVSNNLRAIFNKIYENLNDPKLKKHYQKDAPNY
NGDYKKKVSNNLKTIFKKIYDALKDTVKETYKDDPNY
NGDYKEKVSNNLRAIFKKIYDALEDTVKETYKDDPNY
16
6
13
16 6
13
A
B
C
D
P (A |⇥,�,x)
�
Mark Newman(Michigan)
acknowledgements Funding support:
Benjamin H. Good(Harvard)
Yva de Montjoye(MIT)
arX
iv:1
507.
0400
1v1
[cs.S
I] 1
4 Ju
l 201
5Structure and inference in annotated networks
M. E. J. Newman1, 2 and Aaron Clauset2, 3
1Department of Physics and Center for the Study of Complex Systems,University of Michigan, Ann Arbor, MI 48109, USA
2Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 875013Department of Computer Science and BioFrontiers Institute,
University of Colorado, Boulder, CO 80309, USA
For many networks of scientific interest we know both the connections of the network and informa-tion about the network nodes, such as the age or gender of individuals in a social network, geographiclocation of nodes in the Internet, or cellular function of nodes in a gene regulatory network. Herewe demonstrate how this “metadata” can be used to improve our analysis and understanding ofnetwork structure. We focus in particular on the problem of community detection in networks anddevelop a mathematically principled approach that combines a network and its metadata to detectcommunities more accurately than can be done with either alone. Crucially, the method does notassume that the metadata are correlated with the communities we are trying to find. Instead themethod learns whether a correlation exists and correctly uses or ignores the metadata depending onwhether they contain useful information. The learned correlations are also of interest in their ownright, allowing us to make predictions about the community membership of nodes whose networkconnections are unknown. We demonstrate our method on synthetic networks with known struc-ture and on real-world networks, large and small, drawn from social, biological, and technologicaldomains.
I. INTRODUCTION
Networks arise in many fields and provide a powerfuland compact representation of the internal structure ofa wide range of complex systems [1]. Examples includesocial networks of interactions among people, technolog-ical and information networks such as the Internet or theWorld Wide Web, and biological networks of molecules,cells, or entire species. The last two decades have wit-nessed rapid growth both in the availability of networkdata and in the number and sophistication of networkanalysis techniques. Borrowing ideas from graph theory,statistical physics, computer science, statistics, and otherareas, network analysis typically aims to characterize anetwork’s structural features in a way that sheds light onthe behavior of the system the network describes. Stud-ies of social networks, for instance, might identify themost influential or central individuals in a population.Studies of road networks can shed light on traffic flowsor bottlenecks within a city or country. Studies of path-ways in metabolic networks can lead to a more completeunderstanding of the molecular machinery of the cell.Most research in this area treats networks as objects of
pure topology, unadorned sets of nodes and their inter-actions. Most network data, however, are accompaniedby annotations or metadata that describe properties ofnodes such as a person’s age, gender, or ethnicity in asocial network, feeding mode or body mass of species ina food web, data capacity or physical location of nodeson the Internet, and so forth. (There can be metadatadata on the edges of a network as well as on the nodes [2],but our focus here is on the node case.) In this paper,we consider how to extend the analysis of networks to di-rectly incorporate such metadata. Our approach is basedon methods of statistical inference and can in principle
be applied to a range of different network analysis tasks.Here, we focus specifically on one of the most widelystudied tasks, the community detection problem. Com-munity detection, also called node clustering or classifi-cation, searches for a good division of a network’s nodesinto groups or classes [3]. Typically, one searches forassortative structure, groupings of nodes such that con-nections are denser within groups than between them.This structure is common in social networks, for exam-ple, where groups may correspond to sets of friends orcoworkers, but it also occurs in other cases, including bi-ological and ecological networks, the Web, transportationand distribution networks, and others. Less common, butno less important, is disassortative structure, in whichnetwork connections are sparser within groups than be-tween them, and mixtures of assortative and disassorta-tive structure can also occur, where different groups mayhave varying propensities for within- or between-groupconnections.In many cases, the groups identified by community de-
tection correlate meaningfully with other network prop-erties or functions, such as allegiances or personal in-terests in social networks [3, 4] or biological function inmetabolic networks [5, 6]. Some recent research, how-ever, has suggested that these cases may be the exceptionrather than the rule [7, 8], an important point that weaddress later in this paper.A large number of methods have been proposed
for detecting communities in unannotated networks [3].Among these, some of the most powerful, both in termsof rigorously provable performance and of raw speed, arethose based on statistical inference. Here we build onthese methods to incorporate node metadata into thecommunity detection problem in a principled and flex-ible manner. The resulting methods have several attrac-
to appear, Nature Communications (arxiv:1507.0400)
Performance of modularity maximization in practical contexts
Benjamin H. Good,1,2,* Yves-Alexandre de Montjoye,3,2,† and Aaron Clauset2,‡
1Department of Physics, Swarthmore College, Swarthmore, Pennsylvania 19081, USA2Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, New Mexico 87501, USA
3Department of Applied Mathematics, Université Catholique de Louvain, 4 Avenue Georges Lemaitre, B-1348 Louvain-la-Neuve, Belgium!Received 1 October 2009; published 15 April 2010"
Although widely used in practice, the behavior and accuracy of the popular module identification techniquecalled modularity maximization is not well understood in practical contexts. Here, we present a broad charac-terization of its performance in such situations. First, we revisit and clarify the resolution limit phenomenon formodularity maximization. Second, we show that the modularity function Q exhibits extreme degeneracies: ittypically admits an exponential number of distinct high-scoring solutions and typically lacks a clear globalmaximum. Third, we derive the limiting behavior of the maximum modularity Qmax for one model of infinitelymodular networks, showing that it depends strongly both on the size of the network and on the number ofmodules it contains. Finally, using three real-world metabolic networks as examples, we show that the degen-erate solutions can fundamentally disagree on many, but not all, partition properties such as the composition ofthe largest modules and the distribution of module sizes. These results imply that the output of any modularitymaximization procedure should be interpreted cautiously in scientific contexts. They also explain why manyheuristics are often successful at finding high-scoring partitions in practice and why different heuristics candisagree on the modular structure of the same network. We conclude by discussing avenues for mitigatingsome of these behaviors, such as combining information from many degenerate solutions or using generativemodels.
DOI: 10.1103/PhysRevE.81.046106 PACS number!s": 89.75.Hc, 89.20.!a, 87.16.!b, 02.70.!c
I. INTRODUCTION
Networks are a powerful tool for understanding the struc-ture, dynamics, robustness and evolution of complex biologi-cal, technological and social systems #1,2$. The automaticdetection of modular structures in networks—also calledcommunities #3$ or compartments #4$, and conventionallyunderstood to be large subgraphs with high internaldensities—can provide a scalable way to identify function-ally important or closely related classes of nodes from inter-action data alone #5,6$.
The identification of modular structures has broad impli-cations for many systems-level questions. For instance, it hasstrong consequences for the behavior of dynamical processeson networks #7,8$, and can provide a principled way to re-duce or coarse-grain a system by dividing global heteroge-neity into relatively homogeneous substructures. The searchfor such modular substructures has been particularly intensein molecular networks. This is, in part, because moduleshave theoretical significance for molecular networks #9–11$:they can correspond to functional clusters of genes or pro-teins #12,13$, they may represent targets of natural selectionabove the level of individual genes or proteins but below thelevel of the whole organism, and they may provide evidenceof past evolutionary constraints or pressures #13,14$. Pastwork along these lines has identified modular structures insignaling, metabolic and protein-interaction systems#14–17$, although some questions remain about their statis-
tical significance #18$ and functional relevance #19$. Natu-rally, many of these questions apply equally well to modulesin social and technological networks.
Empirical evidence for a modular organization is typicallyderived using computer algorithms that automatically iden-tify modules using network connectivity data, and among themany techniques now available !see #5,6,20$ for reviews",the method of modularity maximization #3$ is by far the mostpopular. Under this method, each decomposition or partitionof a network into k disjoint modules is given a score Q,called the modularity or sometimes “Newman-Girvan modu-larity”
Q = %i=1
k & ei
m− ' di
2m(2) , !1"
where ei is the number of edges in module i, di is the totaldegree of nodes in module i and m is the total number ofedges in the network. Intuitively, Q measures the differencebetween the observed connectivity within modules and itsexpected value for a random graph with the same degreesequence #21$. Thus, a “good” partition—with Q closer tounity—identifies groups with many more internal connec-tions than expected at random; in contrast, a “bad”partition—with Q closer to zero—identifies groups with nomore internal connections than we expect at random. Thisreasonable formulation recasts the problem of identifyingmodules as a problem of finding the so-called optimal parti-tion, i.e., the partition that maximizes the modularity func-tion Q.
Despite the popularity of modularity maximization, muchremains unknown about the quality and significance of itsoutput when applied to real-world networks with unknown
*[email protected]†[email protected]‡[email protected]
PHYSICAL REVIEW E 81, 046106 !2010"
1539-3755/2010/81!4"/046106!19" ©2010 The American Physical Society046106-1
[1] a C implementation of the metadata SBM is available by request
fin
0
0.51
a(t)
0
1
0
1
0200
400600
0
1
alignment position t
1
23 4
56
78
9
calculate alignment scoresconvert to alignment indicatorsremove short aligned regionsextract highly variable regions
NGDYKEKVSNNLRAIFNKIYENLNDPKLKKHYQKDAPNY
NGDYKKKVSNNLKTIFKKIYDALKDTVKETYKDDPNY
NGDYKEKVSNNLRAIFKKIYDALEDTVKETYKDDPNY
16
6
13
16 6
13
A
B
C
D
community detection
real-world networks
2. marine food web: predator-prey interactions among 488 species in Weddell Sea in Antarctica
• {species body mass, feeding mode, oceanic zone}
• partition recovers known correlation between body mass, trophic level, and ecosystem role:
x =
1
2
3
Detritivore
Carnivore
Omnivore
Herbivore
Primary producer[1] here, we’re using a continuous metadata model[2] Brose et al. (2005)
8
(a) Without metadata
(b) With metadata
FIG. S3: Inferred communities, without metadata and with,for the HVR 5 gene recombination network of the humanmalaria parasite P. falciparum, where metadata values arethe CP labels for the HVR 6 network.
predictive (96% probability) of that gene being in onegroup, while having four cysteines is modestly predictive(67% probability) of being in the other group. Thus themethod has discovered by itself that the motif sequencesthat define the CP labels, along with their correspondingnetwork communities, correlate with cysteine counts andtheir associated severe disease phenotypes [8, 11].
The communities in the HVR 6 network representhighly non-random patterns of recombination, which arethought to indicate functional constraints on proteinstructure. Previous work has conjectured that commonconstraints on recombination span distinct HVRs [9]. Wecan test this hypothesis using the methods described inthis paper. There is no reason a priori to expect thatthe community structure of HVR 6 should correlate withthat of HVR 5 because the Cys and CP labels are derivedfrom outside the HVR 5 sequences—Cys labels reflectcysteine counts in HVR 6 while CP labels subdivide Cyslabels based on sequence motifs adjacent to, but outsideof, HVR 5. Applying our methods to HVR 5 withoutany metadata (Fig. S3a), we find mixing of the HVR 6Cys labels across the HVR 5 communities. By contrast,using the CP labels as metadata for the HVR 5 network,our method finds a much cleaner partition (Fig. S3b), in-dicating that indeed the HVR 6 Cys labels correlate withthe community structure of HVR 5.
10-12
10-9
10-6
10-3
100
103
106
109
Mean body mass (g)
0
0.5
1
Pro
bab
ilit
y o
f co
mm
un
ity
mem
ber
ship
FIG. S4: Learned priors, as a function of body mass, for thethree-community division of the Weddell Sea network shownin Fig. 4 of the main paper.
C. Weddell Sea food web
As discussed in the main text, the Weddell Sea foodweb provides an example of the “ordered” metadata typein the body mass of species. A three-way community di-vision of the network with the log of species’ averagebody mass as metadata produces the division shown inFig. 4 of the paper. The prior probabilities as functionsof body mass are of interest in their own right. They areshown in Fig. S4. Although, as described in Section ICof the paper, the log mass is rescaled in our calculationsto the range [0, 1], the horizontal axis in the figure is cal-ibrated to read in terms of the original mass in grams, sothe prior probabilities of belonging to each of the threecommunities can be simply read from the figure. Theblue, green, and red curves correspond respectively tothe communities labeled 1, 2, and 3 in Fig. 4. Thus aspecies with a low mean mass of 10�12 g has about an80% probability of being in community 1, a 20% proba-bility of being in community 2, and virtually no chance ofbeing in community 3. Conversely, a species with meanbody mass of 108 g (which could only be a whale) hasabout a 90% chance of being in community 3, 10% ofbeing in community 2, and almost no chance of being incommunity 1.
D. School friendship network data use
Our use of the school friendship data described in themain text requires that we make the following statement:
This work uses data from Add Health, a pro-gram project designed by J. Richard Udry,Peter S. Bearman, and Kathleen Mullan Har-ris, and funded by a grant P01–HD31921from the National Institute of Child Health
3
2
1
real-world networks
x =
[1] Larremore, Clauset & Buckee (2013)
3. Malaria gene recombinations: recombination events among 297 var genes
• {Cys-PoLV labels for HVR6 region}
• with metadata, partition discovers correlation with Cys labels (which are associated with severe disease)
without metadata with metadata
HVR6
NMI 2 [0.077, 0.675] NMI = 0.596
real-world networks
x =
[1] Larremore, Clauset & Buckee (2013)
3. Malaria gene recombinations: recombination events among 297 var genes
• {Cys-PoLV labels for HVR6 region}
• on adjacent region of gene, we find Cys-PoLV labels correlate with recombinant structure here, too
HVR5
without metadata with metadata
real-world networks
x =
[1] Traud, Mucha & Porter (20012)
13
communities is less than the number of metadata values,in some cases by a wide margin. Assuming the values ofboth to be reasonably broadly distributed, this impliesthat the entropy H(s) of the communities will be smallerthan that of the metadata H(x) and hence, normally,min[H(s), H(x)] = H(s). Thus if we define
NMI =I(s ;x)
min[H(s), H(x)], (B4)
we ensure that the normalized mutual information liesbetween zero and one, that it has a symmetric defini-tion with respect to s and x, and that it will achieveits maximum value of one when the metadata perfectlypredict the community membership. Other definitions,normalized using the mean or maximum of the two en-tropies, satisfy the first two of these three conditions butnot the third, giving values smaller than one by an unpre-dictable margin even when the metadata perfectly pre-dict the communities.We use the definition (B4) in all the calculations pre-
sented in this paper.
Appendix C: Further examples
In this appendix we present a number of additionalapplications of our methods as well as some additionaldetails on examples described in the main text. Summarystatistics on all the networks studied are given in Table I.
1. Facebook friendship network
The FB100 data set of Traud et al. [31] is a set offriendship networks among college students at US uni-versities compiled from friend relations on the social net-working website Facebook. The networks date from theearly days of Facebook when its services were availableonly to college students and each university formed a sep-arate and unconnected subgraph in the larger network.The nodes in these networks represent the students, theedges represent friend relations on Facebook, and in ad-dition to the network structure there are metadata ofseveral types, including gender, college year (i.e., yearof college graduation), major (i.e., principal subject ofstudy, if known), and a numerical code indicating whichdorm they lived in.The primary divisions in these networks appear to be
by age, or more specifically by college year. For instance,we have looked in some detail at the data for HarvardUniversity, which was the birthplace of Facebook andits biggest institutional participant at the time the datawere gathered, with 15 126 students in the network, span-ning college years 2003 to 2009. There are also a smallnumber of Harvard alumni (i.e., former students) in thedata set, primarily those recently graduated—graduationyears 2000–2002. The top panel in Fig. 4 shows results
None 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Year
0
0.5
1Pr
ior p
roba
bilit
y of
mem
bers
hip
Dorm0
0.5
1
Prio
r pro
babi
lity
of m
embe
rshi
p
FIG. 4: Learned prior probability of community membershipfor two five-way divisions of the Facebook friendship networkof Harvard students described in the text. The horizontal axisis (top) year of graduation and (bottom) dorm, and the colorsrepresent the prior probabilities of membership in each of thecommunities.
from a five-way division of the network using our algo-rithm with year as metadata. Year, for the purposes ofthis calculation, was treated as an unordered variable,placing no constraints on the value of the prior probabil-ities of community membership for adjacent years. Onecould have treated it as an ordered variable, which wouldhave constrained adjacent years to have similar priors,but we did not do that here. Nonetheless, as we will see,the algorithm finds communities in which adjacent yearstend to be grouped together.This network provides a good example of the useful-
ness of the learned priors in shedding light on the struc-ture of the network. The figure shows a visualization ofthe priors as a function of year, with the colors show-ing the relative probability of belonging to each of thecommunities. Each of the bars in the plot has the sameheight of 1 since the prior probabilities are required tosum to 1, while the balance of colors shows the distribu-tion over communities. Examination of the top panel inthe figure shows clearly a division of the network alongage lines. Two groups, in orange and yellow at the rightof the plot, correspond to the most recent two years ofstudents at the time of the study (graduation years 2008and 2009) and the next, in red, accounts for the two yearsbefore that (2006 and 2007). The purple community cor-responds to the next three years, 2003–2005, while the
NMI 2 [0.573, 0.641]
NMI = 0.668
4. Facebook friendships: online friendships among 15,126 Harvard students and alumni (in Sept. 2005)
• {graduation year, dormitory}
• method finds a good partition between alumni, recent graduates, upperclassmen, sophomores, and freshmen
• .
• without metadata:
real-world networks
x =
[1] Traud, Mucha & Porter (20012)
4. Facebook friendships: online friendships among 15,126 Harvard students and alumni (in Sept. 2005)
• {graduation year, dormitory}
• method finds a good partition among the dorms
• .
• without metadata:
13
communities is less than the number of metadata values,in some cases by a wide margin. Assuming the values ofboth to be reasonably broadly distributed, this impliesthat the entropy H(s) of the communities will be smallerthan that of the metadata H(x) and hence, normally,min[H(s), H(x)] = H(s). Thus if we define
NMI =I(s ;x)
min[H(s), H(x)], (B4)
we ensure that the normalized mutual information liesbetween zero and one, that it has a symmetric defini-tion with respect to s and x, and that it will achieveits maximum value of one when the metadata perfectlypredict the community membership. Other definitions,normalized using the mean or maximum of the two en-tropies, satisfy the first two of these three conditions butnot the third, giving values smaller than one by an unpre-dictable margin even when the metadata perfectly pre-dict the communities.We use the definition (B4) in all the calculations pre-
sented in this paper.
Appendix C: Further examples
In this appendix we present a number of additionalapplications of our methods as well as some additionaldetails on examples described in the main text. Summarystatistics on all the networks studied are given in Table I.
1. Facebook friendship network
The FB100 data set of Traud et al. [31] is a set offriendship networks among college students at US uni-versities compiled from friend relations on the social net-working website Facebook. The networks date from theearly days of Facebook when its services were availableonly to college students and each university formed a sep-arate and unconnected subgraph in the larger network.The nodes in these networks represent the students, theedges represent friend relations on Facebook, and in ad-dition to the network structure there are metadata ofseveral types, including gender, college year (i.e., yearof college graduation), major (i.e., principal subject ofstudy, if known), and a numerical code indicating whichdorm they lived in.The primary divisions in these networks appear to be
by age, or more specifically by college year. For instance,we have looked in some detail at the data for HarvardUniversity, which was the birthplace of Facebook andits biggest institutional participant at the time the datawere gathered, with 15 126 students in the network, span-ning college years 2003 to 2009. There are also a smallnumber of Harvard alumni (i.e., former students) in thedata set, primarily those recently graduated—graduationyears 2000–2002. The top panel in Fig. 4 shows results
None 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Year
0
0.5
1
Prio
r pro
babi
lity
of m
embe
rshi
p
Dorm0
0.5
1Pr
ior p
roba
bilit
y of
mem
bers
hip
FIG. 4: Learned prior probability of community membershipfor two five-way divisions of the Facebook friendship networkof Harvard students described in the text. The horizontal axisis (top) year of graduation and (bottom) dorm, and the colorsrepresent the prior probabilities of membership in each of thecommunities.
from a five-way division of the network using our algo-rithm with year as metadata. Year, for the purposes ofthis calculation, was treated as an unordered variable,placing no constraints on the value of the prior probabil-ities of community membership for adjacent years. Onecould have treated it as an ordered variable, which wouldhave constrained adjacent years to have similar priors,but we did not do that here. Nonetheless, as we will see,the algorithm finds communities in which adjacent yearstend to be grouped together.This network provides a good example of the useful-
ness of the learned priors in shedding light on the struc-ture of the network. The figure shows a visualization ofthe priors as a function of year, with the colors show-ing the relative probability of belonging to each of thecommunities. Each of the bars in the plot has the sameheight of 1 since the prior probabilities are required tosum to 1, while the balance of colors shows the distribu-tion over communities. Examination of the top panel inthe figure shows clearly a division of the network alongage lines. Two groups, in orange and yellow at the rightof the plot, correspond to the most recent two years ofstudents at the time of the study (graduation years 2008and 2009) and the next, in red, accounts for the two yearsbefore that (2006 and 2007). The purple community cor-responds to the next three years, 2003–2005, while the
NMI 2 [0.074, 0.224]
NMI = 0.255
real-world networks
5. Internet graph: 262,953 peering relations among 46,676 Autonomous Systems
• {country location of AS}
• method finds a good partition along the lines of the 173 countries
• .
• without metadata:
x =
[1] here, we’re using a continuous metadata model
NMI = 0.870
NMI 2 [0.398, 0.626]