supplementary online materials super-families of evolved ... · triad census reduces from 16 to 7....
TRANSCRIPT
- 1 -
Supplementary Online Materials
"Super-families of evolved and designed networks"
Network data:
Networks and their sources are listed in Table S1. When mutual edges constituted less
than 0.1% of the total number of edges, they were removed from the network (this
applies to the two yeast networks, which have one mutual edge in YEAST-1 and two
in YEAST-2).
Method for counting subgraphs:
All results were obtained with mfinder1.1 - network motif detection tool (software
and users guide available at www.weizmann.ac.il/mcb/UriAlon). For large networks,
we used a subgraph-sampling algorithm (Kashtan et. al., Bioinformatics 2004, in
press.). In cases where a subgraph appeared a very small number of times (here <2) in
both the real and random networks, it was counted as appearing zero times.
Methods for uniformly generating random graphs with a given degree sequence:
All results in the present paper were obtained with the 'switching' algorithm
implemented in mfinder1.1 (www.weizmann.ac.il/mcb/UriAlon, see also
http://arxiv.org/abs/cond-mat/0312028). In the 'switching' algorithm, a pair of edges is
randomly chosen and switched (A B, C D becomes A D, C B) (1). This is
repeated for a predetermined number of trials T, which is uniformly chosen at random
from the range Q*E<T<2Q*E, where E is the number of edges in the network and Q
is a positive number (we typically use Q=100). If the switch generates multiple edges
or self-edges, or makes no change in the network, it is not performed (but is counted
as a trial in order to ensure detailed balance). See http://arxiv.org/abs/cond-
mat/0312028 for a discussion of different methods for generating random graphs with
a given degree distribution and empirical results that demonstrate that the present
- 2 -
method samples the graphs sufficiently uniformly for the purposes of the present
analysis.
In contrast to our previous study (2), when analyzing tetrads in the present
study we compared the networks to randomized networks that are not constrained to
preserve the triad counts.
Transitivity and intransitivity in subgraphs:
An ordered triplet of nodes (x,y,z) is transitive if x y, y z and x z (3-5). For
example, triad 13, termed ‘clique’, has six transitive triplets, which is the highest
transitivity possible in a triad (Table S2). An ordered triplet of nodes is intransitive, as
defined for example by Harary and Kommel (5), if x y, y z but no edge is directed
from x to z. Table S2 summarizes the number of transitive and intransitive triplets for
all triad subgraphs.
SP can highlight similarities not evident merely from the subgraph content: The
similarities seen in the SP of different networks can be further appreciated by a
contrast with the distribution of the concentrations of the subgraphs. The
concentration of triad i, Ci, is the number of times it appears in the network divided by
the total number of appearances of all thirteen connected triads:
Ci = Nreali / sum(Nreali)
Unlike the SP, the concentration profile is based only on the real network and not on
the randomized networks. Networks of a given type are found to have a similar
distribution (Fig S1). On the other hand, even though they showed similar TSP
profiles, networks of different types in the same super-family can show quite different
concentration distributions, as seen in the case of WWW and social networks. The SP
can therefore highlight similarities that are not evident from subgraph counts.
- 3 -
Dimensionality reduction using conservation rules for subgraphs:
The triad significance profiles (TSPs) display certain relations between subgraph
types. For example, networks with excess triangle-shaped subgraphs tend to have a
deficit of V-shaped subgraphs. Are there conservation rules for subgraphs? Here we
show that there are nine triad conservation rules in networks that conserve the degree
sequences of single and mutual edges. Thus the values of the 16 possible 3-node
subgraphs (13 connected triads and 3 non-connected 3-node patterns, see Table S2)
are determined by only seven degrees of freedom. We present an intuitive way to
interpret these conservation laws in terms of reactions that convert V-shaped
subgraphs into triangles, preserving the degrees of all participating nodes (Fig. S2a).
In each reaction a triangle-shaped subgraph is either created from V-shaped subgraphs
or annihilated, creating V-shaped subgraphs.
The triad census T of a given network consists of 16 values for the number of
appearances of each of the possible 3 node subgraphs (4, 6). 13 of these 16 subgraphs
are connected. Conserving the single node properties (degree sequence) in the random
ensemble of networks leads to several conservation laws on the number of
appearances of different subgraphs (6). To illustrate this, consider the fact that the
total number of edges in the network is conserved. The conservation of the total
number of edges, E can be used to find a conserved expression w'T where the weight
vector w indicates how many single edges exist in each of the 16 types of 3-node
subgraphs. The scalar product w'T is equal to the total number of appearances of
single edges within all 3-node subgraphs in the network. Every edge appears in N-2
subgraphs (including non-connected subgraphs such as subgraph 15 in table S2).
Therefore w'T = (N-2) E = const. The fact that this value is a constant in the ensemble
of random networks represents a conservation rule. The value of the constant in each
- 4 -
of the randomized networks that conserve the degree sequence is equal to its value in
the real network. The above argument can be used for the nine properties given below
in the ensemble conserving the degree sequence. Denoting the in-degree sequence by
pi, the out-degree sequence by qi, and the mutual-degree sequence by mi, the
conserved properties are: number of single edges (c1=Σpi=Σqi), number of mutual
edges (c2=Σmi), variance of the single-edge incoming degree sequence (c3=Σpi2) (the
conservation of the variance is equivalent to the conservation of the second moment
of the degree distribution), variance of the single-edge outgoing degree sequence
(c4=Σqi2), variance of the mutual-edge degree sequence (c5=Σmi
2), covariance of the
single-edge incoming degree with outgoing degree (c6=Σpiqi), covariance of single-
edge incoming degree with mutual-edge degree (c7=Σpimi), covariance of single-edge
outgoing degree with mutual-edge degree (c8=Σqimi), and the total number of nodes
(c9=N). Using these 9 conservation rules the number of independent values in the
triad census reduces from 16 to 7. The analysis is similar for non-directed networks.
The non-directed triad census consists of 4 subgraphs of which 2 are connected. There
are three conservation rules: number of nodes, number of edges, and variance of the
degree distribution. The non-directed triad profile thus reduces to a single degree of
freedom. This type of analysis can be extended to n-node subgraphs. The relative
reduction in the number of degrees of freedom due to conservation rules decreases
with n.
The Reaction profile:
The triad structure of a network can be described by the reaction profile – the relative
strength of each of the seven effective reactions (Fig. S2b). The reaction profile can
be used to reduce the dimensionality of the SP. The strength of the seven triad
reactions is calculated by the difference in the number of triangle-shaped subgraphs
- 5 -
between the real network and the average of the random ensemble. Each type of
triangle-shaped subgraph is related to a reaction as shown in Fig. S2a, yielding a set
of linear equations for the reaction strengths Ri. The reaction profile is the set of
normalized reaction strength: RPi=Ri/sqrt(Σ(Ri)2), where Ri is the strength of reaction
i.
Consider for example a network that has no mutual edges. There are only 5 possible
connected triads (triads 1,2,3,7 and 8) and only 2 possible reactions (reactions 1 and
2). This applies approximately to sensory transcription networks. It can be seen that in
sensory transcription networks, reaction 1 is dominant (Fig S2a). For social networks,
almost all reactions are active. The reaction profiles allow a compact description of
the difference between networks and their randomized counterparts. One can view the
real networks as pushed out of equilibrium by the selection pressures under which
they evolved. It is important to note that these reactions are merely a way of
representing conservation laws, and are not meant to represent realistic processes
during network evolution.
Tetrad profile for directed networks:
There are 199 possible connected directed tetrads. To concentrate only on the most
important subgraphs we depict the profile for subgraphs which do not have a
"dangling" edge, that is, they are not composed of a 3-node subgraph plus one
incoming or outgoing edge (7). Furthermore, we plot only subgraphs whose absolute
normalized significance was among the highest three in at least one network. The
results are shown in Fig. S3. We find that directed networks in the same triad SP
super-family generally show similar tetrad profiles. The differences between the
tetrad SP of networks within a super-family are generally more pronounced than in
- 6 -
the triad profiles, suggesting that some super-families on the triad level may break up
into finer sub-families when considering higher order subgraphs.
Calculation of the correlation coefficient and results for non-directed tetrads:
To calculate the correlation coefficient between networks we used the Pearson
correlation coefficient of the two profiles. The correlation matrix for non-directed
networks at the level of tetrads are shown in Fig. S4. One can use clustering methods
(8, 9) to group different networks according to the correlations in their TSP profiles.
Additional comments:
- Robustness of the profile. The present method was checked for robustness against
random data errors by comparing the TSP of real-world network to perturbed versions
of these networks generated by addition of random edges, deletion of edges at random
or switching edges randomly. It was found that in most cases the resulting profiles
were insensitive to addition of 50% random edges, random deletion of 30% of the
edges and randomly switching 50% of the edges. In Fig S5 we show the robustness
analysis for the network of synaptic connections in C. elegans. This analysis suggests
that the TSP is robust to missing data or random data errors.
There were nonetheless cases where the results are not robust which are as follows.
When a network has no mutual edges the addition of mutual edges can change the
profile significantly. Therefore, as mentioned above, when mutual edges represented
less than 0.1% of the edges, they were removed (YEAST-1 and YEAST-2).
In isolated cases, the results can be sensitive to the details of the method. This occurs
for subgraphs where the significance is based on a tiny relative difference in numbers
between the real and random network: std(Nrand)<< |Nreal – Nrand| << Nreal +
Nrand. (in the present study these are subgraphs 1-3 in transcription networks of
bacteria and yeast, and subgraphs 1-6 in language networks). This can occur for n-
- 7 -
node subgraphs with n-1 edges, such as triads 1-3, since they usually appear a large
number of times in random networks (10). For example if one computes the
significance based on the concentration of appearances rather then the absolute
number, the profile may change for subgraphs which obey these inequalities, because
of slight variations in the total number of connected subgraphs in the randomized
networks. For these subgraphs one should be careful in the interpretation of the TSP.
The method could be improved by filtering out such cases (see section: Method for
counting subgraphs).
-Different types of edges. The representation of complex systems as networks with
nodes and edges of one type is often a severe simplification. A directed edge between
two nodes can represent many different kinds of interactions and interaction strengths.
This can be partially dealt with by using weighted edges that discern between
different types and strengths of interactions. The methods described here can in
principle be extended to more elaborate representations.
-Same structure can have different functions. After network motifs are discovered,
one may try to understand whether they have defined functions in the network. Since
a network representation of a system usually does not represent all of the details of
each interaction, it is not always possible to understand the function or dynamics of a
given structure without additional information. Also, the way each instance of the
structure is embedded in the network can affect its behavior. For these reasons, it is
important to notice that different instances of the same connectivity structure found in
a network need not have identical functions. For example, consider the case of the
feedforward loop (FFL) 3-node subgraph in transcription networks. The FFL can have
one of several different functions depending on the signs of regulation on each of the
edges and on the regulatory logic governing the output gene (11-13). We can know
- 8 -
the function of a given FFL instance only if we know the signs and logic function.
Interestingly, the FFL functions do not qualitatively depend on the strengths of the
interactions: there is a large range of parameters where the FFL has a defined,
"robust" function(11-13). Similar considerations apply to the other motifs in
transcription networks. More generally, further research will be needed to show to
what level similar structures imply similar functions in each type of network.
- Profiles of higher order subgraphs. Networks that are part of a super-family when
analyzed at the level of 3-node subgraphs will not necessarily share the same profile
for higher order subgraphs (see below for tetrads). It remains a computational
challenge to efficiently analyze similarities and differences at larger subgraph sizes
(see N. Kashtan et. al., Bioinformatics 2004, in press).
- Autoregulatory edges. Throughout the present study, the network representation
did not include edges pointing from a node to itself (also referred to as autoregulatory
or self edges). In some networks these are structurally not allowed (social networks,
protein structure, etc.) whereas in others they can play an important role (transcription
networks (14-17)). The present algorithms can accommodate such edges, but the
treatment becomes more complex because the number of possible subgraphs becomes
much larger. In some of the networks, such as the E. coli transcription network,
autoregulatory edges appear much more often than expected in random ensembles in
which edges are allowed to close on their node of origin while preserving the degree
sequence. Autoregulatory edges are therefore "single-node network motifs" in these
networks. They play important functions such as speeding response times (17) and
reducing noise (16).
- Profiles of random networks. Due to the process of normalization, random
networks or nearly random networks may have a seemingly well-defined TSP
- 9 -
resulting from noise. To avoid this one can perform the following test on the un-
normalized vector of Z-scores. A random network is expected by definition to have a
vector of Z-scores that has on the average a mean of zero and standard deviation of
one (we neglect effects of the conservation laws described below). On the other hand,
any network that has significant deviations from randomness in its local structure will
have a Z-score vector with standard deviation larger than one. The standard deviation
of the Z-score vector supplies a measure of the non-randomness of the net. All of the
real-world networks in the present study had a standard deviation of the Z-score
vector>2, except for TRANSC_SEA_URCHIN in which the standard deviation is 1.7.
- The importance of the random ensemble. The present method of detecting
network motifs is based on contrasting the number of appearances of subgraphs in a
network to an ensemble of random networks. The random networks used here
conserve the degree sequence of the original network for incoming edges, outgoing
edges and mutual edges. This ensemble conserves all one and two-node properties of
the original network. It has the virtue that it preserves the hubs, which can have a
large effect on the number of subgraphs (10). It also conserves two-node structures
(single and mutual edges). Other random ensembles, such as those preserving higher
order subgraphs, can also be useful (2). In general, the SP depends on what type of
random ensemble is used.
- 10 -
Fig. S1
- 11 -
Fig. S2a
���
���
���
���
��
��
��
�
�
��
��
�
�
�
��
�
�
�
- 12 -
Fig. S2b
- 13 -
Fig S3
- 14 -
Fig. S4
- 15 -
Fig. S5
- 16 -
Fig S1: The triad concentration profile for directed networks. Networks are grouped
based on super-families found according to the TSP (Fig 1a in the paper).
Fig S2: Triad reaction profiles of networks. (a) The seven possible reactions in an
ensemble that conserves the degree sequence. The reactions represent the seven
degrees of freedom of the triad counts in the network. In each reaction a switch is
performed that conserves the in- and out-degrees of each node but changes the
subgraph distribution. (b) The normalized reaction profile for different networks. The
profile is calculated from the deviation of the triad counts of the real network from the
random ensemble.
Fig S3: The subgraph ratio profile (SRP) at the level of tetrads for directed networks.
Networks are grouped based on the super-families found at the level of triads (Fig 1a
in the paper). Only a subset of all 199 possible tetrads is displayed as explained in the
text.
Fig S4: The correlation coefficient matrix of the tetrad SRP for non-directed
networks.
Fig S5: Robustness analysis for the triad significance profile (TSP). The TSP of
synaptic connections between neurons in C. elegans and TSP of this network
perturbed by adding, switching or removing 10%, 30% and 50% of the edges at
random.
- 17 -
Table S1
Directed Network Nodes Edges Description
TRANSC-COLI 424 519 Direct transcriptional regulation between operons in
Escherichia coli (18) based on (19) and additional
literature search .
TRANSC-B.SUBTILIS 516 577 Direct transcriptional regulation between genes in
Bacillus subtilis based on (20).
TRANSC-YEAST1 685 1052 Direct transcriptional regulation between genes in
Saccaromyces cerevisae (21) based on literature
database (2).
TRANSC-YEAST2 2341 3969 Direct transcriptional regulation between genes in
Saccaromyces cerevisae based on genome-wide
location analysis experiment (22).
SIGNAL-
TRANSDUCTION
491 989 Signal-transduction interactions in mammalian cells
based on the Signal Transduction Knowledge
Environment (STKE) http://stke.sciencemag.org/ and
additional literature search (present study).
TRANSC-
DROSOPHILA
110 307 Developmental transcription network of Drosophila
melanogaster from the genet literature database
(www.csa.ru/Inst/gorb_dep/inbios/genet/genet.htm).
All "direct" or "possibly direct" interactions were
included.
TRANSC-SEA
URCHIN
45 83 Developmental transcription network for sea-urchin
endomesoderm development, based on experiments
by the Davidson lab (23).
- 18 -
NEURONS 280 2170 Neuronal synaptic network of the nematode C.
elegans (24). Included all data except muscle cells.
Unlike (2), all synaptic connections were used not
only those with >=5 synapses.
WWW-1 325729 1469678 Hyperlinks in nd.edu (25).
WWW-2 277114 927400 Hyperlinks related to literary studies of Shakespeare
(26).
WWW-3 47870 235441 Hyperlinks related to tango (specifically the music of
Piazzolla) (26).
SOCIAL-1 67 182 Inmates in prison choose "What fellows on the tier are
you closest friends with?" (27).
SOCIAL-2 28 110 Friendship network among sociology freshmen after 6
weeks of acquaintance (28).
SOCIAL-3 32 96 College students in a course about leadership choose
which three members they wanted to have in a
committee (29).
ENGLISH 7724 46281 Word adjacency network of text from Darwin "The
Origin of Species".
FRENCH 9424 24295 Word adjacency network of text from Jules Verne "De
la Terre a la Lune".
SPANISH 12642 45129 Word adjacency network of text from Cervantes "Don
Quixote".
JAPANESE 3177 8300 Word adjacency network of text from Murasaki
Shikibu "The Tale of Genji".
BI-PARTITE 1010 1261 Bi-partite model with two groups of nodes of sizes
- 19 -
N1=1000 and N2=10 with probability of a directed or
mutual edge between nodes of different groups being
p=0.06 and q=0.003 respectively, and no edges
between nodes within the same group.
Non-directed Network Nodes Edges Description
POWERGRID 4941 6594 Stations in the electrical power grid of the western
United States (30).
GEO-MODEL-PG 5000 7499 A geometric model with number of nodes, edges and
clustering coefficient similar to that of the
POWERGRID network. Nodes are aligned on a line.
Points which are closer than a distance of R=5 on the
lattice are linked by an edge with probability p=0.3.
Points at a distance greater than R are unlinked.
PROTEIN-
STRUCTURE-1
PROTEIN-
STRUCTURE-2
PROTEIN-
STRUCTURE-3
95
53
99
213
123
212
Nodes are secondary structure elements (alpha-helices
and beta-strands) and two nodes are connected if their
distance is smaller than 10Å. The proteins are (PDB
id): 1A4J an immunoglobulin, 1EAW a serine
protease inhibitor and 1AOR an oxidoreductase.
Structure based on PDB database
(http://www.rcsb.org/pdb/ ).
GEO-MODEL-PS 53 136 A geometric model with number of nodes, edges and
clustering coefficient similar to that of PROTEIN-2.
Nodes are aligned on a line. Points which are closer
than a distance of R=5 on the lattice are linked by an
edge with probability p=0.46. Points at a distance
- 20 -
greater than R are unlinked.
AUTONOMOUS
SYSTEM-1
AUTONOMOUS
SYSTEM-2
AUTONOMOUS
SYSTEM-3
AUTONOMOUS
SYSTEM-4
AUTONOMOUS
SYSTEM-5
AUTONOMOUS
SYSTEM-6
3015
3522
4517
5357
7956
10515
5156
6324
8376
10328
15943
21455
The internet at the Autonomous System (31) level,
from the COSIN database http://www.cosin.org .
The networks correspond to sampling of the internet
on the following dates: 8/11/1997, 2/4/1998,
14/1/1999, 2/7/1999, 2/7/2000, 16/3/2001.
BA m=1 N=1000
BA m=1 N=3000
BA m=10 N=1000
BA m=10 N=3000
1000
3000
1000
3000
1000
3000
9901
29901
Networks evolved under the preferential attachment
model of Barabasi and Albert (BA) (25): a non-
directed network is grown node by node, connecting
each new node to m existing ones. The probability of
connecting to an existing node i increases with the
number of edges it already has.
- 21 -
Table S2
Subgraph Name Transitive triplets
Intransitive triplets
1
V-out 0 0
2
V-in 0 0
3
3-Chain 0 1
4
Mutual in 0 1
5
Mutual out 0 1
6
Mutual V 0 2
7
FFL 1 0
8
3-Loop 0 3
9
Regulated mutual 2 0
10
Regulating mutual 2 0
11
Mutual and 3-Chain
1 2
12
Semi clique 3 1
13
Clique 6 0
14
Null 0 0
15
Directed edge 0 0
16
Mutual edge 0 0
- 22 -
Table S1: Networks and databases. Shown are the number of nodes and edges, and a
short description for each of the networks used in this study.
Table S2: Transitivity (number of transitive triplets) and intransitivity (number of
intransitive triplets) in the thirteen connected and three non-connected triads.
- 23 -
References:
1. S. Maslov, K. Sneppen, Science 296, 910-3 (May 3, 2002). 2. R. Milo et al., Science 298, 824-7 (Oct 25, 2002). 3. Cartwright, Harrary, Psychological review 63, 277-293 (1956). 4. S. Wasserman, K. Faust, Social Network Analysis (Cambridge University
Press, 1994). 5. F. Harary, J. Kommel, Journal of Mathematical Sociology 6, 199-210 (1979). 6. P. Holland, S. Leinhardt, D. Heise, Ed. (Jossey-Bass, San Francisco, 1975) pp.
1-45. 7. J. Berg, M. Lassig, cond-mat/0308251 (2003). 8. R. O. Duda, P. E. Hart, D. G. Stork, Pattern classification (John Wiley &
Sons, 1973). 9. M. Blatt, S. Wiseman, E. Domany, Physical Review Letters 76, 3251-3254
(Apr 29, 1996). 10. S. Itzkovitz, R. Milo, N. Kashtan, G. Ziv, U. Alon, Phys Rev E Stat Nonlin
Soft Matter Phys 68, 026127 (Aug, 2003). 11. S. Mangan, U. Alon, Proc Natl Acad Sci U S A 100, 11980-5 (Oct 14, 2003). 12. S. Mangan, A. Zaslaver, U. Alon, Journal of Molecular Biology Vol 334, 197-
204 (2003). 13. Y. Setty, A. E. Mayo, M. G. Surette, U. Alon, Proc Natl Acad Sci U S A 100,
7702-7 (Jun 24, 2003). 14. R. Thomas, J Theor Biol 73, 631-56 (Aug 21, 1978). 15. R. Thomas, D. Thieffry, M. Kaufman, Bull Math Biol 57, 247-76 (Mar, 1995). 16. A. Becskei, L. Serrano, Nature 405, 590-3 (Jun 1, 2000). 17. N. Rosenfeld, M. B. Elowitz, U. Alon, J Mol Biol 323, 785-93 (Nov 8, 2002). 18. H. Salgado et al., Nucleic Acids Res 29, 72-4. (2001). 19. S. Shen-Orr, R. Milo, S. Mangan, U. Alon, Nat Genet 31, 64-8 (2002). 20. T. Ishii, K. Yoshida, G. Terai, Y. Fujita, K. Nakai, Nucleic Acids Res 29, 278-
80 (Jan 1, 2001). 21. M. C. Costanzo et al., Nucleic Acids Res 29, 75-9 (2001). 22. T. I. Lee et al., Science 298, 799-804 (Oct 25, 2002). 23. E. H. Davidson et al., Science 295, 1669-78 (Mar 1, 2002). 24. J. White, E. Southgate, J. Thomson, S. Brenner, Phil. Trans. Roy. Soc. London
Ser. B 314 (1986). 25. A. L. Barabasi, R. Albert, Science 286, 509-12. (1999). 26. J. P. Eckmann, E. Moses, Proc Natl Acad Sci U S A 99, 5825-9 (Apr 30,
2002). 27. Macrae, Sociometry 23, 360-371 (1960). 28. M. A. J. Van Duijn, M. Huisman, F. N. Stokman, F. W. Wasseur, E. P. H.
Zeggelink, Journal of Mathematical Sociology 27 (2003). 29. L. D. Zeleny, Sociometry 13, 314-328 (1950). 30. D. J. Watts, S. H. Strogatz, Nature 393, 440-2 (Jun 4, 1998). 31. L. Peterson, B. Davie, Computer networks: a systems approach (Morgan
Kauffman, ed. 2nd edition, 1999).