supplementary online materials super-families of evolved ... · triad census reduces from 16 to 7....

- 1 -

Supplementary Online Materials

"Super-families of evolved and designed networks"

Network data:

Networks and their sources are listed in Table S1. When mutual edges constituted less

than 0.1% of the total number of edges, they were removed from the network (this

applies to the two yeast networks, which have one mutual edge in YEAST-1 and two

in YEAST-2).

Method for counting subgraphs:

All results were obtained with mfinder1.1 - network motif detection tool (software

and users guide available at www.weizmann.ac.il/mcb/UriAlon). For large networks,

we used a subgraph-sampling algorithm (Kashtan et. al., Bioinformatics 2004, in

press.). In cases where a subgraph appeared a very small number of times (here <2) in

both the real and random networks, it was counted as appearing zero times.

Methods for uniformly generating random graphs with a given degree sequence:

All results in the present paper were obtained with the 'switching' algorithm

implemented in mfinder1.1 (www.weizmann.ac.il/mcb/UriAlon, see also

http://arxiv.org/abs/cond-mat/0312028). In the 'switching' algorithm, a pair of edges is

randomly chosen and switched (A B, C D becomes A D, C B) (1). This is

repeated for a predetermined number of trials T, which is uniformly chosen at random

from the range Q*E<T<2Q*E, where E is the number of edges in the network and Q

is a positive number (we typically use Q=100). If the switch generates multiple edges

or self-edges, or makes no change in the network, it is not performed (but is counted

as a trial in order to ensure detailed balance). See http://arxiv.org/abs/cond-

mat/0312028 for a discussion of different methods for generating random graphs with

a given degree distribution and empirical results that demonstrate that the present

- 2 -

method samples the graphs sufficiently uniformly for the purposes of the present

analysis.

In contrast to our previous study (2), when analyzing tetrads in the present

study we compared the networks to randomized networks that are not constrained to

preserve the triad counts.

Transitivity and intransitivity in subgraphs:

An ordered triplet of nodes (x,y,z) is transitive if x y, y z and x z (3-5). For

example, triad 13, termed ‘clique’, has six transitive triplets, which is the highest

transitivity possible in a triad (Table S2). An ordered triplet of nodes is intransitive, as

defined for example by Harary and Kommel (5), if x y, y z but no edge is directed

from x to z. Table S2 summarizes the number of transitive and intransitive triplets for

all triad subgraphs.

SP can highlight similarities not evident merely from the subgraph content: The

similarities seen in the SP of different networks can be further appreciated by a

contrast with the distribution of the concentrations of the subgraphs. The

concentration of triad i, Ci, is the number of times it appears in the network divided by

the total number of appearances of all thirteen connected triads:

Ci = Nreali / sum(Nreali)

Unlike the SP, the concentration profile is based only on the real network and not on

the randomized networks. Networks of a given type are found to have a similar

distribution (Fig S1). On the other hand, even though they showed similar TSP

profiles, networks of different types in the same super-family can show quite different

concentration distributions, as seen in the case of WWW and social networks. The SP

can therefore highlight similarities that are not evident from subgraph counts.

- 3 -

Dimensionality reduction using conservation rules for subgraphs:

The triad significance profiles (TSPs) display certain relations between subgraph

types. For example, networks with excess triangle-shaped subgraphs tend to have a

deficit of V-shaped subgraphs. Are there conservation rules for subgraphs? Here we

show that there are nine triad conservation rules in networks that conserve the degree

sequences of single and mutual edges. Thus the values of the 16 possible 3-node

subgraphs (13 connected triads and 3 non-connected 3-node patterns, see Table S2)

are determined by only seven degrees of freedom. We present an intuitive way to

interpret these conservation laws in terms of reactions that convert V-shaped

subgraphs into triangles, preserving the degrees of all participating nodes (Fig. S2a).

In each reaction a triangle-shaped subgraph is either created from V-shaped subgraphs

or annihilated, creating V-shaped subgraphs.

The triad census T of a given network consists of 16 values for the number of

appearances of each of the possible 3 node subgraphs (4, 6). 13 of these 16 subgraphs

are connected. Conserving the single node properties (degree sequence) in the random

ensemble of networks leads to several conservation laws on the number of

appearances of different subgraphs (6). To illustrate this, consider the fact that the

total number of edges in the network is conserved. The conservation of the total

number of edges, E can be used to find a conserved expression w'T where the weight

vector w indicates how many single edges exist in each of the 16 types of 3-node

subgraphs. The scalar product w'T is equal to the total number of appearances of

single edges within all 3-node subgraphs in the network. Every edge appears in N-2

subgraphs (including non-connected subgraphs such as subgraph 15 in table S2).

Therefore w'T = (N-2) E = const. The fact that this value is a constant in the ensemble

of random networks represents a conservation rule. The value of the constant in each

- 4 -

of the randomized networks that conserve the degree sequence is equal to its value in

the real network. The above argument can be used for the nine properties given below

in the ensemble conserving the degree sequence. Denoting the in-degree sequence by

pi, the out-degree sequence by qi, and the mutual-degree sequence by mi, the

conserved properties are: number of single edges (c1=Σpi=Σqi), number of mutual

edges (c2=Σmi), variance of the single-edge incoming degree sequence (c3=Σpi2) (the

conservation of the variance is equivalent to the conservation of the second moment

of the degree distribution), variance of the single-edge outgoing degree sequence

(c4=Σqi2), variance of the mutual-edge degree sequence (c5=Σmi

2), covariance of the

single-edge incoming degree with outgoing degree (c6=Σpiqi), covariance of single-

edge incoming degree with mutual-edge degree (c7=Σpimi), covariance of single-edge

outgoing degree with mutual-edge degree (c8=Σqimi), and the total number of nodes

(c9=N). Using these 9 conservation rules the number of independent values in the

triad census reduces from 16 to 7. The analysis is similar for non-directed networks.

The non-directed triad census consists of 4 subgraphs of which 2 are connected. There

are three conservation rules: number of nodes, number of edges, and variance of the

degree distribution. The non-directed triad profile thus reduces to a single degree of

freedom. This type of analysis can be extended to n-node subgraphs. The relative

reduction in the number of degrees of freedom due to conservation rules decreases

with n.

The Reaction profile:

The triad structure of a network can be described by the reaction profile – the relative

strength of each of the seven effective reactions (Fig. S2b). The reaction profile can

be used to reduce the dimensionality of the SP. The strength of the seven triad

reactions is calculated by the difference in the number of triangle-shaped subgraphs

- 5 -

between the real network and the average of the random ensemble. Each type of

triangle-shaped subgraph is related to a reaction as shown in Fig. S2a, yielding a set

of linear equations for the reaction strengths Ri. The reaction profile is the set of

normalized reaction strength: RPi=Ri/sqrt(Σ(Ri)2), where Ri is the strength of reaction

i.

Consider for example a network that has no mutual edges. There are only 5 possible

connected triads (triads 1,2,3,7 and 8) and only 2 possible reactions (reactions 1 and

2). This applies approximately to sensory transcription networks. It can be seen that in

sensory transcription networks, reaction 1 is dominant (Fig S2a). For social networks,

almost all reactions are active. The reaction profiles allow a compact description of

the difference between networks and their randomized counterparts. One can view the

real networks as pushed out of equilibrium by the selection pressures under which

they evolved. It is important to note that these reactions are merely a way of

representing conservation laws, and are not meant to represent realistic processes

during network evolution.

Tetrad profile for directed networks:

There are 199 possible connected directed tetrads. To concentrate only on the most

important subgraphs we depict the profile for subgraphs which do not have a

"dangling" edge, that is, they are not composed of a 3-node subgraph plus one

incoming or outgoing edge (7). Furthermore, we plot only subgraphs whose absolute

normalized significance was among the highest three in at least one network. The

results are shown in Fig. S3. We find that directed networks in the same triad SP

super-family generally show similar tetrad profiles. The differences between the

tetrad SP of networks within a super-family are generally more pronounced than in

- 6 -

the triad profiles, suggesting that some super-families on the triad level may break up

into finer sub-families when considering higher order subgraphs.

Calculation of the correlation coefficient and results for non-directed tetrads:

To calculate the correlation coefficient between networks we used the Pearson

correlation coefficient of the two profiles. The correlation matrix for non-directed

networks at the level of tetrads are shown in Fig. S4. One can use clustering methods

(8, 9) to group different networks according to the correlations in their TSP profiles.

Additional comments:

- Robustness of the profile. The present method was checked for robustness against

random data errors by comparing the TSP of real-world network to perturbed versions

of these networks generated by addition of random edges, deletion of edges at random

or switching edges randomly. It was found that in most cases the resulting profiles

were insensitive to addition of 50% random edges, random deletion of 30% of the

edges and randomly switching 50% of the edges. In Fig S5 we show the robustness

analysis for the network of synaptic connections in C. elegans. This analysis suggests

that the TSP is robust to missing data or random data errors.

There were nonetheless cases where the results are not robust which are as follows.

When a network has no mutual edges the addition of mutual edges can change the

profile significantly. Therefore, as mentioned above, when mutual edges represented

less than 0.1% of the edges, they were removed (YEAST-1 and YEAST-2).

In isolated cases, the results can be sensitive to the details of the method. This occurs

for subgraphs where the significance is based on a tiny relative difference in numbers

between the real and random network: std(Nrand)<< |Nreal – Nrand| << Nreal +

Nrand. (in the present study these are subgraphs 1-3 in transcription networks of

bacteria and yeast, and subgraphs 1-6 in language networks). This can occur for n-

- 7 -

node subgraphs with n-1 edges, such as triads 1-3, since they usually appear a large

number of times in random networks (10). For example if one computes the

significance based on the concentration of appearances rather then the absolute

number, the profile may change for subgraphs which obey these inequalities, because

of slight variations in the total number of connected subgraphs in the randomized

networks. For these subgraphs one should be careful in the interpretation of the TSP.

The method could be improved by filtering out such cases (see section: Method for

counting subgraphs).

-Different types of edges. The representation of complex systems as networks with

nodes and edges of one type is often a severe simplification. A directed edge between

two nodes can represent many different kinds of interactions and interaction strengths.

This can be partially dealt with by using weighted edges that discern between

different types and strengths of interactions. The methods described here can in

principle be extended to more elaborate representations.

-Same structure can have different functions. After network motifs are discovered,

one may try to understand whether they have defined functions in the network. Since

a network representation of a system usually does not represent all of the details of

each interaction, it is not always possible to understand the function or dynamics of a

given structure without additional information. Also, the way each instance of the

structure is embedded in the network can affect its behavior. For these reasons, it is

important to notice that different instances of the same connectivity structure found in

a network need not have identical functions. For example, consider the case of the

feedforward loop (FFL) 3-node subgraph in transcription networks. The FFL can have

one of several different functions depending on the signs of regulation on each of the

edges and on the regulatory logic governing the output gene (11-13). We can know

- 8 -

the function of a given FFL instance only if we know the signs and logic function.

Interestingly, the FFL functions do not qualitatively depend on the strengths of the

interactions: there is a large range of parameters where the FFL has a defined,

"robust" function(11-13). Similar considerations apply to the other motifs in

transcription networks. More generally, further research will be needed to show to

what level similar structures imply similar functions in each type of network.

- Profiles of higher order subgraphs. Networks that are part of a super-family when

analyzed at the level of 3-node subgraphs will not necessarily share the same profile

for higher order subgraphs (see below for tetrads). It remains a computational

challenge to efficiently analyze similarities and differences at larger subgraph sizes

(see N. Kashtan et. al., Bioinformatics 2004, in press).

- Autoregulatory edges. Throughout the present study, the network representation

did not include edges pointing from a node to itself (also referred to as autoregulatory

or self edges). In some networks these are structurally not allowed (social networks,

protein structure, etc.) whereas in others they can play an important role (transcription

networks (14-17)). The present algorithms can accommodate such edges, but the

treatment becomes more complex because the number of possible subgraphs becomes

much larger. In some of the networks, such as the E. coli transcription network,

autoregulatory edges appear much more often than expected in random ensembles in

which edges are allowed to close on their node of origin while preserving the degree

sequence. Autoregulatory edges are therefore "single-node network motifs" in these

networks. They play important functions such as speeding response times (17) and

reducing noise (16).

- Profiles of random networks. Due to the process of normalization, random

networks or nearly random networks may have a seemingly well-defined TSP

- 9 -

resulting from noise. To avoid this one can perform the following test on the un-

normalized vector of Z-scores. A random network is expected by definition to have a

vector of Z-scores that has on the average a mean of zero and standard deviation of

one (we neglect effects of the conservation laws described below). On the other hand,

any network that has significant deviations from randomness in its local structure will

have a Z-score vector with standard deviation larger than one. The standard deviation

of the Z-score vector supplies a measure of the non-randomness of the net. All of the

real-world networks in the present study had a standard deviation of the Z-score

vector>2, except for TRANSC_SEA_URCHIN in which the standard deviation is 1.7.

- The importance of the random ensemble. The present method of detecting

network motifs is based on contrasting the number of appearances of subgraphs in a

network to an ensemble of random networks. The random networks used here

conserve the degree sequence of the original network for incoming edges, outgoing

edges and mutual edges. This ensemble conserves all one and two-node properties of

the original network. It has the virtue that it preserves the hubs, which can have a

large effect on the number of subgraphs (10). It also conserves two-node structures

(single and mutual edges). Other random ensembles, such as those preserving higher

order subgraphs, can also be useful (2). In general, the SP depends on what type of

random ensemble is used.

- 10 -

Fig. S1

- 11 -

Fig. S2a

��

��

��

��

��

��

��

�

�

��

��

�

�

�

��

�

�

�

- 12 -

Fig. S2b

- 13 -

Fig S3

- 14 -

Fig. S4

- 15 -

Fig. S5

- 16 -

Fig S1: The triad concentration profile for directed networks. Networks are grouped

based on super-families found according to the TSP (Fig 1a in the paper).

Fig S2: Triad reaction profiles of networks. (a) The seven possible reactions in an

ensemble that conserves the degree sequence. The reactions represent the seven

degrees of freedom of the triad counts in the network. In each reaction a switch is

performed that conserves the in- and out-degrees of each node but changes the

subgraph distribution. (b) The normalized reaction profile for different networks. The

profile is calculated from the deviation of the triad counts of the real network from the

random ensemble.

Fig S3: The subgraph ratio profile (SRP) at the level of tetrads for directed networks.

Networks are grouped based on the super-families found at the level of triads (Fig 1a

in the paper). Only a subset of all 199 possible tetrads is displayed as explained in the

text.

Fig S4: The correlation coefficient matrix of the tetrad SRP for non-directed

networks.

Fig S5: Robustness analysis for the triad significance profile (TSP). The TSP of

synaptic connections between neurons in C. elegans and TSP of this network

perturbed by adding, switching or removing 10%, 30% and 50% of the edges at

random.

- 17 -

Table S1

Directed Network Nodes Edges Description

TRANSC-COLI 424 519 Direct transcriptional regulation between operons in

Escherichia coli (18) based on (19) and additional

literature search .

TRANSC-B.SUBTILIS 516 577 Direct transcriptional regulation between genes in

Bacillus subtilis based on (20).

TRANSC-YEAST1 685 1052 Direct transcriptional regulation between genes in

Saccaromyces cerevisae (21) based on literature

database (2).

TRANSC-YEAST2 2341 3969 Direct transcriptional regulation between genes in

Saccaromyces cerevisae based on genome-wide

location analysis experiment (22).

SIGNAL-

TRANSDUCTION

491 989 Signal-transduction interactions in mammalian cells

based on the Signal Transduction Knowledge

Environment (STKE) http://stke.sciencemag.org/ and

additional literature search (present study).

TRANSC-

DROSOPHILA

110 307 Developmental transcription network of Drosophila

melanogaster from the genet literature database

(www.csa.ru/Inst/gorb_dep/inbios/genet/genet.htm).

All "direct" or "possibly direct" interactions were

included.

TRANSC-SEA

URCHIN

45 83 Developmental transcription network for sea-urchin

endomesoderm development, based on experiments

by the Davidson lab (23).

- 18 -

NEURONS 280 2170 Neuronal synaptic network of the nematode C.

elegans (24). Included all data except muscle cells.

Unlike (2), all synaptic connections were used not

only those with >=5 synapses.

WWW-1 325729 1469678 Hyperlinks in nd.edu (25).

WWW-2 277114 927400 Hyperlinks related to literary studies of Shakespeare

(26).

WWW-3 47870 235441 Hyperlinks related to tango (specifically the music of

Piazzolla) (26).

SOCIAL-1 67 182 Inmates in prison choose "What fellows on the tier are

you closest friends with?" (27).

SOCIAL-2 28 110 Friendship network among sociology freshmen after 6

weeks of acquaintance (28).

SOCIAL-3 32 96 College students in a course about leadership choose

which three members they wanted to have in a

committee (29).

ENGLISH 7724 46281 Word adjacency network of text from Darwin "The

Origin of Species".

FRENCH 9424 24295 Word adjacency network of text from Jules Verne "De

la Terre a la Lune".

SPANISH 12642 45129 Word adjacency network of text from Cervantes "Don

Quixote".

JAPANESE 3177 8300 Word adjacency network of text from Murasaki

Shikibu "The Tale of Genji".

BI-PARTITE 1010 1261 Bi-partite model with two groups of nodes of sizes

- 19 -

N1=1000 and N2=10 with probability of a directed or

mutual edge between nodes of different groups being

p=0.06 and q=0.003 respectively, and no edges

between nodes within the same group.

Non-directed Network Nodes Edges Description

POWERGRID 4941 6594 Stations in the electrical power grid of the western

United States (30).

GEO-MODEL-PG 5000 7499 A geometric model with number of nodes, edges and

clustering coefficient similar to that of the

POWERGRID network. Nodes are aligned on a line.

Points which are closer than a distance of R=5 on the

lattice are linked by an edge with probability p=0.3.

Points at a distance greater than R are unlinked.

PROTEIN-

STRUCTURE-1

PROTEIN-

STRUCTURE-2

PROTEIN-

STRUCTURE-3

95

53

99

213

123

212

Nodes are secondary structure elements (alpha-helices

and beta-strands) and two nodes are connected if their

distance is smaller than 10Å. The proteins are (PDB

id): 1A4J an immunoglobulin, 1EAW a serine

protease inhibitor and 1AOR an oxidoreductase.

Structure based on PDB database

(http://www.rcsb.org/pdb/ ).

GEO-MODEL-PS 53 136 A geometric model with number of nodes, edges and

clustering coefficient similar to that of PROTEIN-2.

Nodes are aligned on a line. Points which are closer

than a distance of R=5 on the lattice are linked by an

edge with probability p=0.46. Points at a distance

- 20 -

greater than R are unlinked.

AUTONOMOUS

SYSTEM-1

AUTONOMOUS

SYSTEM-2

AUTONOMOUS

SYSTEM-3

AUTONOMOUS

SYSTEM-4

AUTONOMOUS

SYSTEM-5

AUTONOMOUS

SYSTEM-6

3015

3522

4517

5357

7956

10515

5156

6324

8376

10328

15943

21455

The internet at the Autonomous System (31) level,

from the COSIN database http://www.cosin.org .

The networks correspond to sampling of the internet

on the following dates: 8/11/1997, 2/4/1998,

14/1/1999, 2/7/1999, 2/7/2000, 16/3/2001.

BA m=1 N=1000

BA m=1 N=3000

BA m=10 N=1000

BA m=10 N=3000

1000

3000

1000

3000

1000

3000

9901

29901

Networks evolved under the preferential attachment

model of Barabasi and Albert (BA) (25): a non-

directed network is grown node by node, connecting

each new node to m existing ones. The probability of

connecting to an existing node i increases with the

number of edges it already has.

- 21 -

Table S2

Subgraph Name Transitive triplets

Intransitive triplets

1

V-out 0 0

2

V-in 0 0

3

3-Chain 0 1

4

Mutual in 0 1

5

Mutual out 0 1

6

Mutual V 0 2

7

FFL 1 0

8

3-Loop 0 3

9

Regulated mutual 2 0

10

Regulating mutual 2 0

11

Mutual and 3-Chain

1 2

12

Semi clique 3 1

13

Clique 6 0

14

Null 0 0

15

Directed edge 0 0

16

Mutual edge 0 0

- 22 -

Table S1: Networks and databases. Shown are the number of nodes and edges, and a

short description for each of the networks used in this study.

Table S2: Transitivity (number of transitive triplets) and intransitivity (number of

intransitive triplets) in the thirteen connected and three non-connected triads.

- 23 -

References:

1. S. Maslov, K. Sneppen, Science 296, 910-3 (May 3, 2002). 2. R. Milo et al., Science 298, 824-7 (Oct 25, 2002). 3. Cartwright, Harrary, Psychological review 63, 277-293 (1956). 4. S. Wasserman, K. Faust, Social Network Analysis (Cambridge University

Press, 1994). 5. F. Harary, J. Kommel, Journal of Mathematical Sociology 6, 199-210 (1979). 6. P. Holland, S. Leinhardt, D. Heise, Ed. (Jossey-Bass, San Francisco, 1975) pp.

1-45. 7. J. Berg, M. Lassig, cond-mat/0308251 (2003). 8. R. O. Duda, P. E. Hart, D. G. Stork, Pattern classification (John Wiley &

Sons, 1973). 9. M. Blatt, S. Wiseman, E. Domany, Physical Review Letters 76, 3251-3254

(Apr 29, 1996). 10. S. Itzkovitz, R. Milo, N. Kashtan, G. Ziv, U. Alon, Phys Rev E Stat Nonlin

Soft Matter Phys 68, 026127 (Aug, 2003). 11. S. Mangan, U. Alon, Proc Natl Acad Sci U S A 100, 11980-5 (Oct 14, 2003). 12. S. Mangan, A. Zaslaver, U. Alon, Journal of Molecular Biology Vol 334, 197-

204 (2003). 13. Y. Setty, A. E. Mayo, M. G. Surette, U. Alon, Proc Natl Acad Sci U S A 100,

7702-7 (Jun 24, 2003). 14. R. Thomas, J Theor Biol 73, 631-56 (Aug 21, 1978). 15. R. Thomas, D. Thieffry, M. Kaufman, Bull Math Biol 57, 247-76 (Mar, 1995). 16. A. Becskei, L. Serrano, Nature 405, 590-3 (Jun 1, 2000). 17. N. Rosenfeld, M. B. Elowitz, U. Alon, J Mol Biol 323, 785-93 (Nov 8, 2002). 18. H. Salgado et al., Nucleic Acids Res 29, 72-4. (2001). 19. S. Shen-Orr, R. Milo, S. Mangan, U. Alon, Nat Genet 31, 64-8 (2002). 20. T. Ishii, K. Yoshida, G. Terai, Y. Fujita, K. Nakai, Nucleic Acids Res 29, 278-

80 (Jan 1, 2001). 21. M. C. Costanzo et al., Nucleic Acids Res 29, 75-9 (2001). 22. T. I. Lee et al., Science 298, 799-804 (Oct 25, 2002). 23. E. H. Davidson et al., Science 295, 1669-78 (Mar 1, 2002). 24. J. White, E. Southgate, J. Thomson, S. Brenner, Phil. Trans. Roy. Soc. London

Ser. B 314 (1986). 25. A. L. Barabasi, R. Albert, Science 286, 509-12. (1999). 26. J. P. Eckmann, E. Moses, Proc Natl Acad Sci U S A 99, 5825-9 (Apr 30,

2002). 27. Macrae, Sociometry 23, 360-371 (1960). 28. M. A. J. Van Duijn, M. Huisman, F. N. Stokman, F. W. Wasseur, E. P. H.

Zeggelink, Journal of Mathematical Sociology 27 (2003). 29. L. D. Zeleny, Sociometry 13, 314-328 (1950). 30. D. J. Watts, S. H. Strogatz, Nature 393, 440-2 (Jun 4, 1998). 31. L. Peterson, B. Davie, Computer networks: a systems approach (Morgan

Kauffman, ed. 2nd edition, 1999).

supplementary online materials super-families of evolved ... · triad census reduces from 16 to 7....

Documents