an algorithm and metric for network decomposition from - mit

13
Available online at www.sciencedirect.com Social Networks 30 (2008) 146–158 An algorithm and metric for network decomposition from similarity matrices: Application to positional analysis Mo-Han Hsieh a,, Christopher L. Magee b a Engineering Systems Division, Massachusetts Institute of Technology, 77 Massachusetts Avenue, NE20-392 Cambridge, MA 02139-4307, USA b Engineering Systems Division and Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA Abstract We present an algorithm for decomposing a social network into an optimal number of structurally equivalent classes. The k-means method is used to determine the best decomposition of the social network for various numbers of subgroups. The best number of subgroups into which to decompose a network is determined by minimizing the intra-cluster variance of similarity subject to the constraint that the improvement in going to more subgroups is better than a random network would achieve. We also describe a decomposability metric that assesses how closely the derived decomposition approaches an ideal network having only structurally equivalent classes. Three well-known network data sets were used to demonstrate the algorithm and decomposability metric. These demonstrations indicate the utility of the approach and suggest how it can be used in a complementary way to Generalized Blockmodeling. © 2007 Elsevier B.V. All rights reserved. Keywords: Positional analysis; Structural equivalence; Decomposability; k-Means method; Generalized Blockmodeling 1. Introduction In the network analysis literature, two lines of research have been pursued to develop methods of decomposing networks into meaningful subgroups (Wasserman and Faust, 1994). These are (1) research that seeks to identify cohesive subgroups (Frank, 1995); and (2) research that seeks equivalence classes in a net- work (Breiger et al., 1975; Lorrain and White, 1971). While numerous methods have been proposed to conceptualize the idea of cohesive subgroups (including the algorithm recently proposed by Newman and Girvan (2004)), the recent efforts in social networks research have been on developing methods that identify equivalence classes. The initial concept of the equivalence class was proposed by Lorrain and White (1971) in the form of structural equivalence. By conceiving nodes in a network as equivalence classes or “positions” that relate in a similar way to other positions, a net- work can be transformed into a simplified model where nodes are combined into positions and the relations between nodes Corresponding author. Tel.: +1 617 577 5843; fax: +1 617 258 0485. E-mail address: [email protected] (M.-H. Hsieh). become relations between positions. For example, if two nodes link to and are linked by exactly the same set of other nodes, they are structurally equivalent to each other. Many definitions of equivalence have been proposed, see (Wasserman and Faust, 1994) for further discussion. Among the methods that identify structural equivalence classes, Batagelj et al. (1992b) proposed to divide them into direct and indirect methods. While the direct method involves optimizing a pre-specified block model with the network data, an indirect method typically composes two major parts: (1) a defini- tion of dissimilarity for the selected type of equivalence (e.g. the corrected Euclidean-like dissimilarity (Burt and Minor, 1983)) and (2) an algorithm that produces good clustering solutions (e.g. hierarchical clustering). The indirect method is indirect in the sense that the relational information among vertices is first used to create a partition, and the partition is then evalu- ated with an explicit criterion function (Batagelj et al., 1992b). The evaluation of the partition with a criterion function is not imperative for the indirect method; one example of the criterion function is the specified goodness-of-fit measure proposed by Batagelj et al. (1992a) originally designed for the use of their optimization approach in finding equivalence classes. While most of these indirect methods generate dissimilarity measures 0378-8733/$ – see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.socnet.2007.11.002

Upload: others

Post on 09-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

A

udmd

K

1

bm(1wnipsi

LB“wa

0d

Available online at www.sciencedirect.com

Social Networks 30 (2008) 146–158

An algorithm and metric for network decomposition from similaritymatrices: Application to positional analysis

Mo-Han Hsieh a,∗, Christopher L. Magee b

a Engineering Systems Division, Massachusetts Institute of Technology, 77 Massachusetts Avenue,NE20-392 Cambridge, MA 02139-4307, USA

b Engineering Systems Division and Department of Mechanical Engineering, Massachusetts Institute of Technology,Cambridge, MA 02139, USA

bstract

We present an algorithm for decomposing a social network into an optimal number of structurally equivalent classes. The k-means method issed to determine the best decomposition of the social network for various numbers of subgroups. The best number of subgroups into which toecompose a network is determined by minimizing the intra-cluster variance of similarity subject to the constraint that the improvement in going toore subgroups is better than a random network would achieve. We also describe a decomposability metric that assesses how closely the derived

ecomposition approaches an ideal network having only structurally equivalent classes.Three well-known network data sets were used to demonstrate the algorithm and decomposability metric. These demonstrations indicate the

tility of the approach and suggest how it can be used in a complementary way to Generalized Blockmodeling.2007 Elsevier B.V. All rights reserved.

ans m

blto1

cdoitca(i

eywords: Positional analysis; Structural equivalence; Decomposability; k-Me

. Introduction

In the network analysis literature, two lines of research haveeen pursued to develop methods of decomposing networks intoeaningful subgroups (Wasserman and Faust, 1994). These are

1) research that seeks to identify cohesive subgroups (Frank,995); and (2) research that seeks equivalence classes in a net-ork (Breiger et al., 1975; Lorrain and White, 1971). Whileumerous methods have been proposed to conceptualize thedea of cohesive subgroups (including the algorithm recentlyroposed by Newman and Girvan (2004)), the recent efforts inocial networks research have been on developing methods thatdentify equivalence classes.

The initial concept of the equivalence class was proposed byorrain and White (1971) in the form of structural equivalence.y conceiving nodes in a network as equivalence classes or

positions” that relate in a similar way to other positions, a net-ork can be transformed into a simplified model where nodes

re combined into positions and the relations between nodes

∗ Corresponding author. Tel.: +1 617 577 5843; fax: +1 617 258 0485.E-mail address: [email protected] (M.-H. Hsieh).

fiaTifBom

378-8733/$ – see front matter © 2007 Elsevier B.V. All rights reserved.oi:10.1016/j.socnet.2007.11.002

ethod; Generalized Blockmodeling

ecome relations between positions. For example, if two nodesink to and are linked by exactly the same set of other nodes,hey are structurally equivalent to each other. Many definitionsf equivalence have been proposed, see (Wasserman and Faust,994) for further discussion.

Among the methods that identify structural equivalencelasses, Batagelj et al. (1992b) proposed to divide them intoirect and indirect methods. While the direct method involvesptimizing a pre-specified block model with the network data, anndirect method typically composes two major parts: (1) a defini-ion of dissimilarity for the selected type of equivalence (e.g. theorrected Euclidean-like dissimilarity (Burt and Minor, 1983))nd (2) an algorithm that produces good clustering solutionse.g. hierarchical clustering). The indirect method is indirectn the sense that the relational information among vertices isrst used to create a partition, and the partition is then evalu-ted with an explicit criterion function (Batagelj et al., 1992b).he evaluation of the partition with a criterion function is not

mperative for the indirect method; one example of the criterion

unction is the specified goodness-of-fit measure proposed byatagelj et al. (1992a) originally designed for the use of theirptimization approach in finding equivalence classes. Whileost of these indirect methods generate dissimilarity measures

cial N

tss

cmcRvot

leudhtsdiGneitpbic

ua1fiwflfcants

iitndbc

tt

tabwddaoi

eanidiiS

2c

tnbrgdOiwcie

eMcvtLaatftdt

M.-H. Hsieh, C.L. Magee / So

hat are compatible1 with structural equivalence, the decompo-itions based on these dissimilarity measures are generally notatisfying.

The often used method, CONCOR (Breiger et al., 1975), isonsidered as having the aspects of both the indirect and directethod (Batagelj et al., 1992b). However, the CONCOR pro-

edure always splits a set of vertices into exactly two subsets.epeated application of CONCOR results in a series of subdi-ided bi-partitions of the original network. Thus, the partitionutcome is at least partially determined by the procedure, not byhe actual structure of the network (Schwartz, 1977).

The most recently developed approach in identifying equiva-ence classes is Generalized Blockmodeling (GBM) (Doreiant al., 2005). The method considers ideal blockmodels andses optimization methods to fit them to empirical data. Thisirect method allows for use of context information in formingypotheses and gives a criterion function (i.e. inconsistencies)hat measures the fit of a specified blockmodel or decompositiontructure to the actual data. GBM has been shown to give “better”ecompositions of social network data based upon comparingnconsistencies (Batagelj et al., 1992b; Doreian et al., 2005).BM finds better decompositions by a clear procedure, but asoted in (Doreian et al., 2005) hypotheses with a greater vari-ty of block types can always be found to lower the number ofnconsistencies towards zero. In the case of using BGM to findhe structural equivalence partition of a network, though it isossible to identify the most appropriate number of subgroupsy observing the jump of inconsistencies, to some extent it stillnvolves subjective judgment and thus lacks a fully objectiveriterion for stopping decomposition.

Another approach to decomposition of networks is basedpon network models (Fienberg and Wasserman, 1981; Snijdersnd Nowicki, 1997; Tallberg, 2005; Wasserman and Anderson,987). Recently, the development of stochastic models in theeld of cluster analysis has lead to its application to social net-orks (Handcock et al., 2007; Hoff et al., 2002). The attractive

eatures of using these approaches to find structural equiva-ence classes include, for example, statistical inferences withull models and statistical criteria for determining the number oflasses. However, the potential disadvantages of this approachre the difficulty of model selection and the potentially largeumber of parameters to be estimated. Both of these disadvan-ages make theoretical interpretation of positions and blocks forocial networks problematic.

The recent survey by Schaeffer (2007) indicates that select-ng the best number of clusters (and other “parameter selection”ssues) is one of the major open problems of graph clustering. Inhis paper, we propose a new indirect method for partitioning aetwork into structural equivalence classes and for this domain

evelop a method that also addresses the issue of clustering num-er selection. Overall, the method consists of (1) an unsupervisedlustering method, in which vertices are assigned to clusters

1 A dissimilarity measure is compatible to structural equivalence if it satisfieshe condition that the dissimilarity of a pair of nodes is zero if and only if thewo nodes are structurally equivalent (Doreian et al., 2005, p. 181).

wtcgui

d

etworks 30 (2008) 146–158 147

o minimize the intra-cluster variance of dissimilarity; (2) anpproach that takes into consideration not only the dissimilarityetween the pair of vertices but also the pair’s dissimilaritiesith all other vertices; (3) a quantitative stopping criteria foretermining the number of subgroups that a network should beivided into to better represent its underlying structural equiv-lence structure. The method is seen as a companion to GBMffering additional insight in certain kinds of studies (wherenductive learning is useful) and having a similar limitation.

The paper presents the new method for finding structuralquivalence classes and its application to ideal structural equiv-lence networks in Section 2. In Section 3, we develop aormalized decomposability metric for assessing how close non-deal networks are to the ideal networks found by our (or any)ecomposition methodology. Application of our method includ-ng the decomposability metric to three known social networkss presented in Section 4. Brief concluding remarks are given inection 5.

. A new method for finding structural equivalencelasses

The method starts with any dissimilarity measure of verticeshat is compatible with structural equivalence. For an n-nodeetwork, the dissimilarity measures can be arranged in an ny n matrix, whose entries give the dissimilarity between theow vertices i and the column vertices j. Hierarchical clusteringenerates the hierarchy of vertices by using these measures andifferent definitions of dissimilarity between the new clusters.ur method treats the n by n dissimilarity matrix as n data points

n the n-dimensional space that we wish to partition. That is,e read row i of the dissimilarity matrix as the n-dimensional

oordinates of the ith data point. Since the dissimilarity matrixs symmetric, the coordinates can also be read as the columnlements.

With n data points in the n-dimensional space, we then repeat-dly apply the k-means method (Hartigan and Wong, 1978;acQueen, 1967) to partition the n data points into k = 2 to k = n

lusters. Information about the k-means method and its manyariations can be found in (Kaufman and Rousseeuw, 2005). Inhis study, Lloyd’s k-means algorithm (Lloyd, 1982) was used.loyd’s algorithm begins with a set of k reference points whichre randomly selected from the data set. All of the data pointsre partitioned into k clusters by assigning each point to the clus-er of its closest reference point. In each iteration, the centroidor each cluster is calculated. A partition is then made usinghe newly calculated centroids as reference points for all of theata points. It has been proven (Bottou and Bengio, 1995) thathe iterative process will eventually converge to a configurationhere each data point is closer to the reference point of its cluster

han to any other reference point and each reference point is theentroid of its cluster. Since different initial reference points canenerate different partitions, multiple sets of initial points are

sed to evaluate whether the obtained partition has approachedts minimum sum of intra-cluster distances.

For each round of the k-means method that partitions the nata points into k clusters, we have the sum of the within cluster

1 cial N

p

D

wm

gcwzhn(azmsleoltfa

sBcwtnbd

igsltoos

F

wtr

carnr

i

tkblc

cndtpt

N

wdder

oecenetotsodwl

fcdn

shc

48 M.-H. Hsieh, C.L. Magee / So

oints-to-centroid distances as

k =∑k

i=1

∑j ∈ Si

∥∥xj−ci∥∥2

(1)

here Si (i = 1,2,. . .,k) is the cluster and ci is the centroid orean point of all of the data points xj in cluster Si.In the process of decomposing the network into more sub-

roups (i.e. as k increases), Dk gradually decreases as moreentroids are generated. A smaller Dk is desirable because weant a partition that has a smaller intra-cluster variance. Dk is

ero when all of the equivalence classes (including singletons)ave been identified by at least one centroid. We define idealetworks as those having only structural equivalence classesi.e. zero discrepancies for GBM), and for such networks anlgorithm that stops further partitioning the network when Dk isero would be appropriate. However, for most real networks, theonotonically decreasing Dk goes to zero only after numerous

ingletons have been individually identified as unique equiva-ence classes. In the case of k = n, Dk is always zero becausevery node is identified as itself an equivalence class. The resultf identifying a great number of singletons is relatively meaning-ess since it does not inform us about the underlying structure ofhe network. To avoid generating an excessive number of classesor real networks, a quantitative criterion must be designed toppropriately stop further decomposition of the network.

For any assigned number of subgroups, the k-means methodeeks to minimize Dk with the same number of centroids.ecause nodes of the same equivalence class have the sameoordinates, a lower Dk can be obtained by first grouping themith centroids. Therefore, if a network has equivalence classes

hat have more than one node, Dk decreases significantly withewly added centroids until every such equivalence class haseen identified by at least one centroid. The decrease of Dk slowsown with larger k when singletons start to appear as classes.

These singletons, with their unique linkage patterns, are sim-lar to randomly wired nodes in a network. We found that theradual decrease of Dk during the generation of singletons isimilar to that of the random networks with the same size andinkage density.2 We stop further dividing a network into addi-ional subgroups if the decrease of Dk (form k to k + 1) is smallerr equal to the average decrease of Dk obtained from a samplef these random networks.3 We thus define a fitness index asimply:

k = Drandomk − Dreal

k (2)

here Drandomk is the sum of intra-cluster point-to-centroid dis-

ances obtained by averaging over the results of a sample ofandom networks and Dreal

k is that of the real network. We find

2 We use the Erdos-Renyi model to generate random networks. The modelonsiders all pairs of nodes in a graph and puts an edge between the nodes withfixed probability (which in our case equals to the linkage density). Since a

andom network with larger size and density has larger step decrease of Dk, theetwork’s decrease of Dk (from k to k + 1) should be compared with that of theandom networks with the same size and density.3 These random networks can be viewed as null models for cluster validation

n the sense discussed by Gordon (1999).

vwo

teiwnmif

etworks 30 (2008) 146–158

he maximum of Fk as a function of k, and the correspondingrepresents the most appropriate subdivision of the network

ecause further subdivision is only reducing Drealk at random (or

ess than random) rates. The nodes belonging to the k differentlusters then form the equivalence classes of the network.

To obtain an appropriate estimate of Drandomk in Eq. (2), a

ertain number of random networks have to be sampled. Theumber of random networks sampled is determined by the stan-ard deviation of Drandom

k relative to the decrease of Drealk . Since

he simulation indicate that Drandomk is normally distributed, our

rocedure is to sample 30 random networks and then determinehe appropriate sample size, N, according to

≥[

zα/2sk

�Fk/2

]2

(3)

here z is the ordinate on the normal curve corresponding to theesired probability α (.05 in our case), sk the sampled standardeviation of Drandom

k , and �Fk is the difference between Fk andither Fk−1 or Fk+1. We iteratively increase sample size until theepeatedly recalculated N satisfies Eq. (3).

In theory, our method should work for ideal networks havingnly structural equivalence classes because nodes of the samequivalence class cause a larger decrease of the sum of intra-luster point-to-centroid distances than nodes that belong to noquivalence class (i.e. nodes of random networks). It should beoted that, if an ideal network has a singular node as a structuralquivalence class, it is possible that our algorithm will not iden-ify this node as an equivalence class. This is due to the difficultyf differentiating the linkage pattern of a meaningful node fromhat of a randomly placed node. In this case, our fitness index ashown in Eq. (2) can fail to indicate the most appropriate numberf structural equivalence classes and thus the most appropriateecomposition. Nevertheless, our method works for ideal net-orks with all of their structural equivalence classes having at

east two nodes.Our method works in practice as we have tried the algorithm

or a variety of ideal networks, and the algorithm identifies theorrect subgroups for all of them. However, there are easy andifficult cases of using the fitness index to identify the rightumber of classes that the network has.

The difficult cases are the networks whose decrease of theum of intra-cluster point-to-centroid distances is only slightlyigher than that of a random network. Fig. 1 shows two sets ofomparison between these difficult and easy cases. Each fitnessalue in the figure is normalized between zero and one so thate can compare their relative easiness of identifying the peakf fitness index.

Fig. 1(a) shows the fitness index for two ideal networks withhe same minimum equivalence class size (i.e. C = 5) but differ-nt network size (i.e. n = 25 and 100). As shown in the figure,t is easier to identify the peak of fitness index for the networkith smaller size. Fig. 1(b) shows the fitness index for two ideal

etworks with the same network size (i.e. n = 50) but differentinimum equivalence class size (i.e. C = 2 and 10). As shown

n the figure, identifying the peak of fitness index is now easieror the network with larger minimum equivalence class size.

M.-H. Hsieh, C.L. Magee / Social Networks 30 (2008) 146–158 149

F classi nt mi

lfclitt

3

ienItc

aawcfstwoacd

ig. 1. (a) Fitness index for ideal networks with the same minimum equivalentndex for two ideal networks with the same network size (i.e. n = 50) but differe

In general, the ideal networks having only small equiva-ence classes and larger network sizes are the difficult casesor use of the fitness index to identify the right number oflasses. Nonetheless, the method identifies the correct equiva-ence classes of these ideal networks. The more important issues how to assess its value in the less-than-ideal networks that areypically observed for social networks. The next two sections ofhe paper address this topic.

. Measuring the decomposability of a network

By applying our class finding algorithm, networks are dividednto subgroups that correspond to their underlying structuralquivalence structures. However, we want to differentiate among

etworks whose subgroups are not all ideal equivalence classes.n this case, we define perfect decomposability of a network ashat achieved when a network is composed of only equivalencelasses.

wct(

size (i.e. C = 5) but different network size (i.e. n = 25 and 100) and (b) fitnessnimum equivalent class size (i.e. C = 2 and 10).

Having a normalized objective measurement of decompos-bility is useful. For example, we can compare two networksnd determine which network is more similar to an ideal net-ork having only equivalence classes. Lower decomposability

an be used to infer that the suggested decomposition is moreorced and thus should be cautiously utilized in further analy-is. Moreover, if other variables (or time series data) are known,he change of the decomposability metric with the variables (orith time) affecting the network can be found. This can allowne to find how various variables influence the structural roles ingiven network or a variety of different networks. To be able toompare the decomposability of networks of different size andensity, it is necessary to normalize the metric for these effects.

To determine the normalized decomposability of a network,

e construct a metric that places networks with only equivalence

lasses at one end and those without any equivalence class athe other. We use the sum of intra-cluster distance, Dk, of Eq.1), to quantify the similarity between a real network and an

150 M.-H. Hsieh, C.L. Magee / Social Networks 30 (2008) 146–158

Table 1Dmax for network with different sizes and number of clusters

Number of clusters (k) Size of network (n)

6 7 8 9 10 11 12 13 14 15

2 21.3 32.2 44.6 60.3 76.7 96.4 120 146 168 196345

icivct

tntd

Q

wDohsiCttmdatoa

aetnaDt

2f7fin2f8

m

Q

a

Q

3sd

Q

wwnw

ctiiiphfiics

dewTrlsand 60 were sampled. Furthermore, we sample ideal networkswith the assumption that the number of classes for each net-work is normally distributed and the size of each class within a

14.3 23.2 33.5 46.88.39 16.4 26.5 36.84.51 10.5 17.8 27.6

deal network. For an ideal network having only equivalencelasses, its sum of intra-cluster distance, Dideal, equals zero. Thiss because every member of the same equivalence class, wheniewed as a node in the multidimensional space, has the sameoordinates. Therefore, their intra-cluster distances are zeros andhe sum of these distances, Dideal, is zero.

In addition to the value of Dideal, we want the upper bound ofhe sum of intra-cluster distance, Dmax(n,k), for networks having

nodes and k clusters. With the lower bound, Dideal = 0, andhe upper bound, Dmax(n,k), we can thus obtain the normalizedecomposability metric, Q, for the network as

= 1 − Dk − Dideal

Dmax(n,k) − Dideal= 1 − Dk

Dmax(n,k)(4)

hich defines Q as 1 for perfect decomposability and 0 fork = Dmax(n,k) which is equivalent to no decomposability. Tobtain the upper bound, Dmax(n,k), we are seeking a network thatas the maximum possible value of Dk while having the sameize and is divided into the same number of clusters as that of thedeal network. To obtain the upper bound of Dk, various Montearlo methods can be applied to obtain an approximate solu-

ion for the network with size, n, and number of clusters, k. Inhis study, we used a genetic algorithm (GA) as the optimization

ethod to search for Dmax(n,k). To apply the GA, the solutionomain was represented by rearranging the adjacency matrix ofnetwork into an array of bits, the fitness function was Dk, and

he two-point crossover was used to generate a new generationf solutions. For more information and implementation detailsbout GA, see (Mitchell, 1996).

By using the corrected Euclidean-like dissimilarity (Burtnd Minor, 1983) as the dissimilarity measure for structuralquivalence. Table 1 shows some examples of Dmax(n,k) (withhree significant figures) for network with different sizes andumber of classes. In Table 1, the maximum possible Dk for9-node network, for example, divided into three classes ismax(n,k) = Dmax(9,3) = 46.8. With this information, we consider

hree 9-node networks as shown in Fig. 2.Note that the only difference between network 1 and network

is the directed link from node 7 to node 1. network 3 differsrom network 2 by its additional links from node 2 to nodeand from node 4 to node 8. The result of applying our class

nding algorithm to network 1 and network 2 shows that the two

etworks are divided into the same k = 3 subgroups (i.e. node 1,, and 3, node 4, 5, and 6, and node 7, 8, and 9). Moreover,ollowing Eq. (4), with k = 3, we have Dk = D3 for network 1 as.71 and for network 2 as 11.2. Therefore, the decomposability

n

a

61.7 77.4 97.2 121 144 16849.5 65.1 82.8 102 125 14641.4 54.6 68.8 86.9 106 131

etric for network 1 is

1 = 1 − D3

Dmax(9,3)= 1 − 8.71

46.8= 0.81

nd the decomposability metric for network 2 is

2 = 1 − 11.2

46.8= 0.76

Similarly, our class finding algorithm tells us that networkshould be divided into still the same k = 3 classes. With its

um of intra-cluster distance, D3, equal to 16.2, we obtain itsecomposability metric as

3 = 1 − 16.2

46.8= 0.65

With network 1 having the highest decomposability and net-ork 3 having the lowest, the decomposability metric confirmshat visual inspection tells us; network 2 is closer to the idealetwork than is network 3 but is further from ideal than is net-ork 1.It should be noted that, the decomposability metric can be

alculated only after we know the number of subgroups thathe network should be divided into. Since the decomposabil-ty metric monotonically increases as the number of subgroupsncreases (and equals to unity as every node of the network istself a subgroup), it cannot be used to determine the appro-riate number of subgroups in a network. To do so, we stillave to use the fitness index as introduced in Eq. (2). Thetness index compares the decrease of Dreal

k resulting fromncreases in the number of subgroups to that of Drandam

k and thusan be used for determining the most appropriate number ofubgroups.

Since the decomposability can be viewed as a measure ofeviation of real networks from ideal networks that contain onlyquivalence classes, we explored the relationship between a net-ork’s decomposability and its deviation from an ideal network.o do this, we examine the decomposability of 10,000 pseudoeal networks generated from randomly perturbing4 all possibleinkages of ideal networks (i.e. adding or removing links) withix different percentages. Ideal networks with sizes between 30

etwork is also normally distributed. Since real networks typi-

4 The perturbation can be viewed as arising from an error in observation orrising because real social relationships are more complex than the ideal.

M.-H. Hsieh, C.L. Magee / Social Networks 30 (2008) 146–158 151

3 and

cd

ptnhcdal

vo4nqpa

4

iaaBiamr

Tmt

Fig. 2. Networks 1, 2, and

ally have very low density, we only sample ideal networks withensity lower than 0.2.

Fig. 3 shows the average decomposability metric of theseudo real networks (with one standard deviation also plot-ed) versus their percentage of linkage perturbation from idealetworks. Although, the linear relationship between the twoas an R-square value of 0.99, the results also show somelear non-linearity. However, if we limit the applicability of theecomposability metric to networks with decomposability of 0.4nd higher, the linear equation gives a reasonable estimate of theinkage perturbation.

With this result, we can calculate the deviation of our pre-ious three networks. Referring to Fig. 3, the decomposabilityf networks 1, 2, and 3 (calculated above) are equivalent to.8, 6.1, and 8.9% linkage perturbation of their underlying ideal

etwork. We note that the lower the decomposability the moreuestionable the ideal network that we associate with the decom-osition is. Other networks with slightly more deviations mightlso describe the network in these cases.

skop

their adjacency matrixes.

. Application of the method

In the previous sections, we propose a method for cluster-ng nodes of networks into structural equivalence classes anddecomposability metric to quantify a network’s level of link-

ge perturbation from a hypothetical underlying ideal network.ecause our method can identify the number of classes for any

deal network having only structural equivalence classes (witht least two nodes in each class), in this section we test ourethod and the decomposability metric with three examples of

eal networks.The social network of the 15 office workers reported by

hurman (1979) is used as the first example to evaluate ourethod. The network is shown in Fig. 4. By applying our method

o partition the social network into k = 2 to k = 15 subgroups, the

um of intra-cluster point-to-centroid distances as a function ofis obtained and is shown in Fig. 5(a) as dark gray bars. To

btain the fitness index, we need the comparable sum of a sam-le of random networks that have the same size and density as

152 M.-H. Hsieh, C.L. Magee / Social Networks 30 (2008) 146–158

ge pe

tdbsF

kdFb

Tid

waahtatshi

Fig. 3. Decomposability versus linka

he social network. This sum of intra-cluster point-to-centroidistances as a function of k is shown in Fig. 5(a) as light grayars. The fitness index generated by subtracting the one of theocial network from that of the random networks is shown inig. 5(b).

As shown in Fig. 5(b), the fitness index has its maximum at= 6, which by our method indicates that the most appropriateecomposition of the network is into six equivalence classes.ig. 6 shows these six classes and the block model as revealedy using our method.

As shown in Fig. 6, the first class includes Amy, Katy, andina, and the second class includes Ann, Pete, and Lisa. There

s strong interaction within and between the two classes. Whatifferentiates them is that the second class has strong interaction

gi(

Fig. 4. The social network of t

rturbation of various ideal networks.

ith the President. According to Thurman (1979), Pete is char-cterized as the center of a social circle that included Lisa, Katynd Amy. Ann arrived under the sponsorship of Pete, and Lisaas the ear of the President (Thurman, 1979). It is worth noticinghat the fourth class comprises only Emma, who has strong inter-ction with the President, the members of the second class, andhe members of the fifth class. According to Thurman (1979),he plays a special role in the social network and thus identifyinger as a unique class seems appropriate considering the contextnformation.

With the network size equal to 15 and the number of sub-roups equal to six, we have the upper bound of the sum ofntra-cluster distances, Dmax(15,6), equal to 113.07. By using Eq.4), the decomposability metrics for the social network is 0.65.

he Thurman office data.

M.-H. Hsieh, C.L. Magee / Social Networks 30 (2008) 146–158 153

Fig. 5. (a) The sum of intra-cluster point-to-centroid distances of Thurman’s office social network (dark gray bars) and that of the random networks with the samesize and density (light gray bars) and (b) the fitness index of Thurman’s office social network as a function of the number of subgroups.

Fig. 6. Class members, block density, and image graph of the Thurman social network found by the algorithm presented in this paper.

154 M.-H. Hsieh, C.L. Magee / Social Networks 30 (2008) 146–158

Table 2Kansas SAR inter-organizational network

A B C D E F G H I J K L M N O P Q R S T

Osage County Sheriff’s Department A 0 1 1 0 1 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1Osage County Civil Defense Office B 1 0 1 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 0 0Osage County Coroner’s Office C 1 0 0 1 1 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0Osage County Attorney’s Office D 1 1 1 0 1 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0Kansas State Highway Patrol E 1 0 1 1 0 1 1 0 0 0 1 1 1 1 0 1 1 0 0 1Kansas State Parks and Resources Authority F 1 0 1 1 1 0 1 0 1 0 1 1 0 1 0 0 0 0 0 0Kansas State Game and Fish Commission G 1 0 1 1 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0Kansas State Department of Transportation H 1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0U.S. Army Corps of Engineers I 1 0 1 0 1 1 1 0 0 0 1 1 0 0 0 0 1 0 0 0U.S. Army Reserve J 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0Crable Ambulance K 1 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0Franklin County Ambulance L 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0Lee’s Summit Underwater Rescue Team M 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0Shawney County Underwater Rescue Team N 1 0 1 0 1 1 1 0 1 0 0 0 1 0 0 0 1 1 0 1Burlingame Police Department O 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 0 1Lyndon Police Department P 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0American Red Cross Q 1 1 1 0 1 1 1 0 1 1 0 0 0 1 0 0 0 0 0 0Topeka Fire Department Rescue #1 R 1 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0C 1T 0

MiFptbi

csSn

C

12345

spcct

12345

ttcoi

iem

eKrfotsedp

12345. {H, J, L, M, R, S, T}.

This partition breaks the third class of our four-class par-tition into two classes as {B, N, O} and {D, K, P, Q} thus

arbondale Fire Department S 1 1 0 0 1opeka Radiator and Body Works T 1 0 0 0 1

oreover, by using the relationship between the decomposabil-ty and the linkage perturbation of the ideal network shown inig. 3, we can infer that the social network is about 8.9% linkageerturbation from the ideal network. This result indicates thathe inferred equivalence structure of the social network mighte substituted for easily with more observation or slight changesn interaction patterns.

The inter-organizational Search and Rescue (SAR) networkreated after a disaster in Kansas (Drabek, 1981) is used as theecond example to demonstrate the use of the new method. TheAR network has 20 organizations. The dichotomized commu-ication data among these organizations are shown in Table 2.

To present the basic structure of the network, Drabek usedONCOR to partition the network into five clusters as:

. Authority position: {A, E}.

. Primary support: {C, F, G, I, K}.

. Critical resources: {D, L, N}.

. Secondary support, 1: {M, O, P, Q, R, T}.

. Secondary support, 2: {B, H, J, S}.

While these five subgroups are potentially useful in under-tanding this network, Doreian et al. (2005) showed that thisartition has 79 inconsistencies when examined with their GBMriterion function for structural equivalence. They found a five-luster alternative that has only 57 inconsistencies (indicatinghe weakness of CONCOR discussed in Section 1 to this paper):

. Authority: {A, E}.

. Bodies and survivors: {C, F, G, I}.

. Infrastructure: {B, D, K, N, P, Q}.

. Primary rescue operators: {H, J, L, M, R, S, T}.

. Secondary rescue operators: {O}.ef

1 0 1 0 0 0 0 0 0 0 1 0 0 00 0 0 0 0 0 0 0 1 1 1 0 0 0

Applying the method presented in this paper to find the struc-ural equivalence classes of the SAR network, we again partitionhe network into k = 2 to k = 20 subgroups. The sum of intra-luster point-to-centroid distances of the SAR network and thatf the random network with the same size and density is shownn Fig. 7(a). The fitness index is shown in Fig. 7(b).

The fitness index shown in Fig. 7(b) has its maximum at k = 4,ndicating that the most appropriate decomposition is into fourquivalence classes. Fig. 8 shows these four classes and the blockodel as revealed by using our method.As shown in Fig. 8, our partition differs from that of Doreian

t al. (2005) only in that ours combines their two classes, {B, D,, N, P, Q} and {O}, into one class thus including “secondary

escue operators” with “infrastructure”. By using the criterionunction for structural equivalence proposed by Doreian et al.,ur partition has 64 inconsistencies, which is considerably betterhan the 79 for the five subgroups suggested by CONCOR buteven more than the five subgroup partition proposed by Dor-ian et al using their direct method. Since more subgroups willecrease the inconsistencies, we examine the five-class decom-osition of our method5:

. {A, E},

. {C, F, G, I},

. {B, N, O},

. {D, K, P, Q},

5 Our stopping algorithm indicates that four groups are appropriate but wexamine what the k-means method would yield if allowed for five groups onlyor comparative purposes.

M.-H. Hsieh, C.L. Magee / Social Networks 30 (2008) 146–158 155

F ork ano

dT5wac

ci

ig. 7. (a) The sum of intra-cluster point-to-centroid distances of the SAR netwf the SAR network.

ecomposing “infrastructure” but differently than Doreian et al.his decomposition has the same number of inconsistencies (i.e.

7) as that of the different five-class partition of Doreian et al.hen examined with their criterion function. Thus, our method

ppears more effective than CONCOR and relative to GBM isapable of finding interesting decompositions that are worthy of

gpd

Fig. 8. Class members, block density, and image graph of th

d the random networks with the same size and density and (b) the fitness index

onsideration along with various hypotheses arrived at by othernformation.

With the network size equal to 20 and the number of sub-roups equals to four for our first partition and five for the otherartition, we have the upper bound of the sum of intra-clusteristance, Dmax(20,4) = 298.51 and Dmax(20,5) = 271.56. With these

e SAR network found by the algorithm in this paper.

156 M.-H. Hsieh, C.L. Magee / Social Networks 30 (2008) 146–158

ncons

u0eDtdtbeulti

D

prta

sss

cc

Fig. 9. Political actor network with 32 i

pper bounds, our first partition has a decomposability metric of.42, which is greater than the five subgroup partition of Drabekt al. (i.e. 0.41) and slightly lower than that of the partition oforeian et al. (i.e. 0.44) and that of our five subgroup parti-

ion (i.e. 0.45). We feel it is more important to notice that theecomposability metric of 0.42 is about 15% perturbation fromhe ideal network. With this high percentage of linkage pertur-ation, we should be cautious when using any of the inferredquivalence structures of the SAR network. Conversely, we canse the low decomposability of the SAR network data and theack of clarity about structure derived from that data to support

he contention that communication structures were weak in thisnstance (Drabek, 1981).

Our third example is the political actor network reported byoreian and Albert (1989). In this network, the nodes are the

ceao

Fig. 10. Political actor network with 26 incon

istencies and decomposability of 0.42.

rominent political actors in a local community and the links rep-esent “strong political ally” among the actors. Fig. 9 shows thehree-class partition obtained by using CONCOR in the originalnalysis.

According to Doreian et al. (2005), this partition has 32 incon-istencies when examined with the GBM criterion function fortructural equivalence. They proposed a three-cluster alternativehown in Fig. 10 that has only 26 inconsistencies.

By applying our method to find the structural equivalencelasses of the network, maximization of the fitness index indi-ates that the network is best decomposed into four equivalence

lasses. The four-class partition is shown in Fig. 11. Whenxamined with the criterion function proposed by Doreian etl. (2005), it has 25 inconsistencies, which is one less than thatf the partition shown in Fig. 10.

sistencies and decomposability of 0.37.

M.-H. Hsieh, C.L. Magee / Social Networks 30 (2008) 146–158 157

incon

gipTs0sabtamtsa

5

adtecasde

BidtpI

cstfm(

amdcscncoa

A

e

R

B

B

B

B

Fig. 11. Political actor network with 25

Since reduced inconsistency is expected with more sub-roups, we also explore the three subgroup solution from thenductive method. In this case, our method suggests the sameartition as derived by CONCOR (i.e. the partition in Fig. 9).hough it has more inconsistencies than that of the partitionhown in Fig. 10, the former has a decomposability metric of.42 that is higher than 0.37 of the latter. This result clearlyhows that our four-class partition, with 25 inconsistencies anddecomposability metric of 0.49, has the best quality in terms ofoth the criterion function and the decomposability. However,he relatively low decomposability of this network indicates thatny of these interpretations is open to change if more or slightlyodified data was obtained about these networks. Alternatively,

he relatively low decomposability indicates that the structure isignificantly deviated from any ideal model and thus the politicalctor network is relatively weakly structured.

. Conclusion and discussion

The algorithm described in this paper appears to bringdditional theoretical utility to existing methodology forecomposing networks into structural equivalence classes. Theheoretical advantage is its ability to find all ideal structuralquivalence classes but yet has an objective stopping criterion forontinuing decomposition of non-ideal networks. The algorithmlso appears to bring additional practical utility to existing toolsuch as the Generalized Blockmodeling by suggesting differentecompositions of clear comparative merit to even well-studiedxamples as shown in Section 4.

When the algorithm is used in combination with Generalizedlockmodeling, one might obtain the advantages of combin-

ng inductive and deductive approaches. For example, with new

ata sets, one could start with finding the decompositions induc-ively (best and near best) and by in-context study of theseossibly arrive at a new hypothesis to test by various criteria.n general, applying both methods seems to be appropriate in all

B

sistencies and decomposability of 0.49.

ases because the results in Section 4 indicate they can deliverlightly different and yet interesting decompositions. In addi-ion, the examples show the potential merit of using our metricor decomposability. The metric provides an objective assess-ent of the normalized decomposability of various networks

and for various decompositions).The algorithm can be used in combination with the widely

pplied hierarchical clustering. For structural equivalence theethod described here can quickly suggest a more appropriate

ecomposition into a specific set of block models. This can beompared with the suggested hierarchy and provide additionaltructural information of interest. Interesting future researchould include (1) application of the algorithm in biological, eco-omic and engineering system classification problems and (2)omparison of the results of this algorithm with the one devel-ped by Newman and Girvan based upon cohesive subgroups inwide variety of network types.

cknowledgements

The authors are grateful for the helpful comments by theditor and the reviewers of this paper.

eferences

atagelj, V., Doreian, P., Ferligoj, A., 1992a. An optimizational approach toregular equivalence. Social Networks 14, 121–135.

atagelj, V., Ferligoj, A., Doreian, P., 1992b. Direct and indirect methods forstructural equivalence. Social Networks 14, 63–90.

ottou, L., Bengio, Y., 1995. Convergence properties of the k-means algo-rithm. In: Tesauro, G., Touretzky, D. (Eds.), Advances in Neural InformationProcessing Systems, vol. 7. MIT Press, Cambridge MA, pp. 585–592.

reiger, R.L., Boorman, S.A., Arabie, P., 1975. Algorithm for clustering rela-

tional data with applications to social network analysis and comparisonwith multidimensional-scaling. Journal of Mathematical Psychology 12,328–383.

urt, R.S., Minor, M.J., 1983. Applied Network Analysis: A MethodologicalIntroduction. Sage Publications, Beverly Hills.

1 cial N

D

D

D

F

FGH

H

H

K

L

L

M

M

N

SS

S

T

T

58 M.-H. Hsieh, C.L. Magee / So

oreian, P., Albert, L.H., 1989. Partitioning political actor networks: somequantitative tools for analyzing qualitative networks. Journal of QuantitativeAnthropology, 279–291.

oreian, P., Batagelj, V., Ferligoj, A., 2005. Generalized Blockmodeling. Cam-bridge University Press, Cambridge, UK.

rabek, T.E., 1981. Managing multiorganizational emergency responses: emer-gent search and rescue networks in natural disaster and remote area settings.Institute of Behavioral Science, University of Colorado, Boulder, Colorado.

ienberg, S.E., Wasserman, S., 1981. An expotential family of probability-distributions for directed-graphs—comment. Journal of the AmericanStatistical Association 76, 54–57.

rank, K.A., 1995. Identifying cohesive subgroups. Social Networks 17, 27–56.ordon, A.D., 1999. Classification. Chapman & Hall/CRC, Boca Raton.andcock, M.S., Raftery, A.E., Tantrum, J.M., 2007. Model-based clustering for

social networks. Journal of the Royal Statistical Society Series A-Statisticsin Society 170, 301–322.

artigan, J.A., Wong, M.A., 1978. A k-means clustering algorithm. AppliedStatistics 28, 100–108.

off, P.D., Raftery, A.E., Handcock, M.S., 2002. Latent space approaches tosocial network analysis. Journal of the American Statistical Association 97,

1090–1098.

aufman, L., Rousseeuw, P.J., 2005. Finding Groups in Data: An Introductionto Cluster Analysis. N.J. Wiley, Hoboken.

loyd, S.P., 1982. Least-squares quantization in Pcm. IEEE Transactions onInformation Theory 28, 129–137.

W

W

etworks 30 (2008) 146–158

orrain, F., White, H.C., 1971. Structural equivalence of individuals in socialnetworks. Journal of Mathematical Sociology, 49–80.

acQueen, J.B., 1967. Some methods for classification and analysis of multi-variate observations. In: Proceedings of the 5th Berkeley Symposium onMathematical Statistics and Probability, vol. 1. University of CaliforniaPress, Berkeley, pp. 281–297.

itchell, M., 1996. An Introduction to Genetic Algorithms. MIT Press, Cam-bridge, MA.

ewman, M.E.J., Girvan, M., 2004. Finding and evaluating community structurein networks. Physical Review E 69.

chaeffer, S.E., 2007. Graph Clustering. Computer Science Review 1, 27–64.chwartz, J.E., 1977. An examination of CONCOR and related methods for

blocking sociometric data. In: Heise, D.R. (Ed.), Sociological Methodology.Jossey-Bass, San Francisco, pp. 255–282.

nijders, T.A.B., Nowicki, K., 1997. Estimation and prediction for stochasticblockmodels for graphs with latent block structure. Journal of Classification14, 75–100.

allberg, C., 2005. A Bayesian approach to modeling stochastic blockstructureswith covariates. Journal of Mathematical Sociology 29, 1–23.

hurman, B., 1979. In the office—networks and coalitions. Social Networks 2,

47–63.

asserman, S., Anderson, C., 1987. Stochastic a posteriori blockmodels—construction and assessment. Social Networks 9, 1–36.

asserman, S., Faust, K., 1994. Social network analysis: methods and applica-tions. Cambridge University Press, Cambridge; New York.