1 the search landscape of graph partitioning problems using coupling and cohesion as the clustering...

1

The Search Landscape ofGraph Partitioning Problems

using Coupling and Cohesion as the Clustering Criteria

Brian S. Mitchell & Spiros Mancoridis{bmitchel,smancori}@mcs.drexel.eduhttp://www.mcs.drexel.edu/~{bmitchel,smancori}Department of Computer ScienceSoftware Engineering Research Grouphttp://serg.mcs.drexel.eduDrexel University, Philadelphia, PA, USA

10/05/2002

2Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Software Clustering with Bunch

Source CodeAnalysis Tools

MDG File

Bunch ClusteringTool

Partitioned MDG File

Visualization ToolSource Code

void main(){ printf(“hello”);}

Acacia Chava

M1

M2

M3

M5M4

M6

M7 M8

M1

M2

M3

M5M4

M6

M7 M8

Bunch GUI

ClusteringAlgorithms

Clustering Tools

ProgrammingAPI


Software Clustering as a Search Problem

Source CodeAnalysis Tools

MDG

Source Codevoid main(){ printf(“hello”);}

Acacia Chava

M1

M2

M3

M5M4

M6

M7 M8

Software ClusteringSearch Algorithms

bP = null;

while(searching()){ p = selectNext(); if(p.isBetter(bP)) bP = p;}

return bP;

“GOOD” MDG Partition“GOOD” MDG Partition

M1

M2

M3

M5M4

M6

M7 M8

SEARCH SPACESet of All

MDG Partitions

M1

M2

M3

M5M4

M6

M8 M7

M1

M2

M3

M5M4

M6

M8 M7

Total = 4140 Partitions


The Search Space is Enormous

1 = 12 = 23 = 54 = 155 = 52

6 = 2037 = 8778 = 41409 = 2114710 = 115975

11 = 67857012 = 421359713 = 2764443714 = 19089932215 = 1382958545

16 = 1048014214717 = 8286486980418 = 68207680615919 = 583274220505720 = 51724158235372

otherwisekSS

nkkifS

knknkn

,11,1,

11

The number of MDG partitions grows very quickly, as the number of modules in the system increases…

A 15 Module System is about the limit for performing Exhaustive Analysis


Our Assumption…“Well designed software systems are organized into cohesive clusters that are loosely interconnected.”

We designed a measurement called MQ that embodies our assumptionThe MQ measurement balances cohesion and coupling We apply MQ to partitions of the MDG


Not all Partitions of the MDG are Good Solutions

MDG

Good Partition! Bad Partition!

M1

M2

M1

M2 M3

M1

M2

M4

M3M5

M6M3

M4

M5 M6

M4

M5

M6

MQ(Good Partition) > MQ(Bad Partition)


The Software Clustering Problem:Algorithm Objectives

“Find a good partition of the MDG.”

A partition is the decomposition of a set of elements (i.e., all the nodes of the graph) into mutually disjoint clusters.A good partition is a partition where: highly interdependent nodes are grouped in

the same clusters independent nodes are assigned to separate

clusters

The better the partition the higher the MQ


Bunch Hill Climbing Clustering Algorithm

Generate a Random Decomposition of MDG

Iteration Step

GenerateNext

Neighbor

MeasureMQ

Compare to BestNeighboring Partition

Better

Measu

re M

Q

Best Neighboring Partition

New

Best

N

eig

hb

ori

ng

Part

itio

n

Convergence

Best Neighboring Partition for Iteration

CurrentPartition

A neighborpartition iscreated byaltering the

currentpartition slightly.

NeighborPartition

Better?


Bunch Hill Climbing Clustering Algorithm

Generate a Random Decomposition of MDG

Iteration Step

GenerateNext

Neighbor

MeasureMQ

Compare to BestNeighboring Partition

Better

Measu

re M

Q

Best Neighboring Partition

New

Best

N

eig

hb

ori

ng

Part

itio

n

Convergence

Best Neighboring Partition for Iteration

CurrentPartition

A neighborpartition iscreated byaltering the

currentpartition slightly.

NeighborPartition

Better?

Other Things of Interest

We have implemented a family ofhill-climbing algorithms

We also implemented an Exhaustiveand Genetic Algorithm

Other Things of Interest

We have implemented a family ofhill-climbing algorithms

We also implemented an Exhaustiveand Genetic Algorithm


Hierarchical Clustering (1):Nested View

1.

2. Default

4.

3.


Hierarchical Clustering (2):Consolidated View

1.

2. Default

4.

3.


Hierarchical Clustering (3):Tree View


Hierarchical Clustering (3):Tree View

Observations

•The number of levels for a givensystem’s clustering hierarchy isbounded by:

O(log2N)

because Bunch places at least 2nodes in each cluster.

Observations

•The number of levels for a givensystem’s clustering hierarchy isbounded by:

O(log2N)

because Bunch places at least 2nodes in each cluster.


Evaluating The Software Clustering Results

Over the past few years we have spent a lot of time evaluating Bunch’s software clustering results Empirically Semi-formally Measuring Similarity


What We Know

Given a particular MDG, the results produced by Bunch converge to a family of related solutionsThe search space is large, and the probability of finding a good solution by random sampling is infinitesimal


Software Clustering using Graph Partitioning Techniques

Running Bunch multiple times produces a family of related clustering results Bunch starts with a random partition of the

MDG, and makes random moves to explore the search space



How related are these clustering results?



Given that there are 2,7644,437 distinct partitionsof this MDG, there is a lot of agreement…


Software Clustering using Graph Partitioning TechniquesWhy Some Modules Don’t Agree…

Library Modules

IsomorphismOmnipresent

Module Influences


Special ModulesIsomorphic – Modules that are connected to multiple clusters with equal strengthLibrary – All edges fan-inDriver – All edges fan-outOmnipresent – Modules that are strongly connected to many other modules in the system


Clustering a SystemMany Times (1)…

RCS (Random)

0

0.5

1

1.5

2

2.5

0 10 20 30

Number of Clusters

MQ

Va

lue

RCS (Bunch)

0

0.5

1

1.5

2

2.5

0 10 20 30

Number of Clusters

MQ

Va

lue

RCS

0

5

10

15

20

25

30

0 250 500 750 1000

Sample

Nu

mb

er

Clu

ste

rs

RCS

0

0.5

1

1.5

2

2.5

0 250 500 750 1000

Sample

MQ

Dot (Random)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0 10 20 30 40

Number of Clusters

MQ

Va

lue

Dot (Bunch)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0 10 20 30 40

Number of Clusters

MQ

Va

lue

Dot

0

5

10

15

20

25

30

35

40

45

0 250 500 750 1000

Sample

Nu

mb

er

Clu

ste

rs

Dot

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0 250 500 750 1000

Sample

MQ

RC

SD

ot

Random

Bunch


Bunch

0

25

50

75

100

125

0 250 500 750 1000

Sample

Nu

mb

er

Clu

ste

rs


Swing (Random)

0

1

2

3

4

5

6

7

0 100 200 300 400

Number of Clusters

MQ

Val

ue

Swing (Bunch)

0

1

2

3

4

5

6

7

0 100 200 300 400

Number of Clusters

MQ

Va

lue

Swing

0

50

100

150

200

250

300

350

400

450

0 250 500 750 1000

Sample

Nu

mb

er

Clu

ste

rs

Swing

0

1

2

3

4

5

6

7

0 250 500 750 1000

Sample

MQ

Sw

ing

Bunch (Random)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 25 50 75 100 125

Number of Clusters

MQ

Val

ue

Bunch (Bunch)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 25 50 75 100 125

Number of Clusters

MQ

Va

lue

Bunch

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 250 500 750 1000

Sample

MQ

Bu

nch

Random

Bunch


Bunch

0

25

50

75

100

125

0 250 500 750 1000

Sample

Nu

mb

er

Clu

ste

rs


Swing (Random)

0

1

2

3

4

5

6

7

0 100 200 300 400

Number of Clusters

MQ

Val

ue

Swing (Bunch)

0

1

2

3

4

5

6

7

0 100 200 300 400

Number of Clusters

MQ

Va

lue

Swing

0

50

100

150

200

250

300

350

400

450

0 250 500 750 1000

Sample

Nu

mb

er

Clu

ste

rs

Swing

0

1

2

3

4

5

6

7

0 250 500 750 1000

Sample

MQ

Sw

ing

Bunch (Random)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 25 50 75 100 125

Number of Clusters

MQ

Val

ue

Bunch (Bunch)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 25 50 75 100 125

Number of Clusters

MQ

Va

lue

Bunch

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 250 500 750 1000

Sample

MQ

Bu

nch

Random

Bunch

Observations

•As the number of clusters increasedin the random samples, MQ decreased

•Bunch converged to a consistent“family” of solutions, no matter wherethe random starting point was generated

•Some solutions were multi-modal•Random solutions were consistentlyworse than Bunch’s solutions.

Observations

•As the number of clusters increasedin the random samples, MQ decreased

•Bunch converged to a consistent“family” of solutions, no matter wherethe random starting point was generated

•Some solutions were multi-modal•Random solutions were consistentlyworse than Bunch’s solutions.


Example - Detailed Results: Bunch System

MQ For Random Clusters (4-8)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 250 500 750 1000

Sample

MQ

MQ For Random Clusters (11-16)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 250 500 750 1000

Sample

MQ

MQ versus Number of Clusters

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 5 10 15 20

Number of ClustersM

Q

The search spacehas some inherent

structure, as randomclusters constrained

to the area whereBunch converged didnot produce better

MQ values.

77%

23%


Understanding the Search Space

There are characteristics of Bunch’s clustering algorithms that are interesting: It seems unusual that the clustering algorithms

produce consistent MQ values given the large search space

Other approaches [spectral methods] to solving the clustering problem using Bunch’s MQ have not produced better clustering results

The median clustering level is a good tradeoff between cluster size and number of clusters Harman et al. examined using a target granularity

[GECCO’02] to bias the desired cluster sizes


Investigating the Search Space

Examined multiple systems of different size: 15 open source systems developed in

C, C++, or Java 13 randomly generated graphs with

different properties that we wanted to investigate

We clustered each MDG 500 times and examinedthe clustering data to gain some insight into thesearch space.


45

50

55

60

65

70

L1 L2

L3 L4

L5 L6

L7 Median

Example: Median Clustering Level

Cu

mu

lati

ve M

Q

45

50

55

60

65

70

75

L1 L2

L3 L4

L5 L6

L7 Median Cu

mu

lati

ve M

Q

swing Kerbos v.5


0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

L1 L2

L3 Median

2

3

4

5

6

7

8

9

L1 L2L3 L4Median


MQ

MQ

telnetd php


8

10

12

14

16

4

6

8

10


bash mod_ssl

45

50

55

60

65

70

4

6

8

10ping_libc elm

4

6

8

10

2

3

4

5

6

lynx

mailx

X Axis:MQ Value


2

4

6

8

10

18

23

28

33

Example: Median Clustering Level – Random Bipartite Graphs

bip-100-1

2

4

6

8bip-100-2 bip-100-5

2

4

6

8

10

2

3

4

5bip-100-25 bip-100-75

X Axis:MQ Value


2

3

4

5

2

4

6

8

8

13

18

18

23

28

33

38

18

23

28

33

38

Example: Median Clustering Level – Random Graphs

rnd-100-1 rnd-100-2 rnd-100-5

rnd-100-25 rnd-100-75

X Axis:MQ Value


10

15

20

25

20

30

40

50

Example: Median Clustering Level – Random “Circle” Graphs

circle-50

35

45

55

65

75

circle-100

circle-150

X Axis:MQ Value


4

4.1

4.2

4.3

25 35 45

2.22.252.3

2.352.4

5 10 15

4.454.5

4.554.6

4.65

10 15 20

MQ versus #Clusters

44.845

45.245.445.6

150 160 170 180

4646.246.446.646.8

47

170 180 1900

1

2

3

0 5 10

krb5 swing telnetd php

4.94.95

55.055.1

5.15

25 35 45

8.2

8.3

8.4

8.5

40 45 50

4646.246.446.646.8

47

170 180 190

4.054.1

4.154.2

4.254.3

20 30 40

bash mod_ssl ping_libc elm

lynx mailx

X Axis: #ClustersY Axis: MQ Value


11.812

12.212.412.6

20 25 30

23.5

24

24.5

25

40 45 50

36

36.5

37

37.5

65 70 75

25.67

25.67

25.67

25.67

30 31 32

10

10.5

11

11.5

35 40 45 50

3.53.63.73.83.9

30 40 50

1.6

1.7

1.8

1.9

30 35 40

4.74.75

4.84.85

4.94.95

10 12 14

3.853.9

3.954

4.05

38 40 42

1.77

1.78

1.79

1.8

20 30 40

19.3819.4

19.4219.4419.46

20 25 30

MQ versus #Clustersbip-100-1 bip-100-5 bip-100-25 bip-100-75

rnd-100-1 rnd-100-5 rnd-100-25 rnd-100-75

cir-50 cir-100

cir-150

X Axis: #ClustersY Axis: MQ Value


1450

1500

1550

1600

0 200 400

0

100

200

300

0 100 200

900

920940

960

980

100 150 200

900

920940

960

980

100 150 200

2260

2280

2300

2320

500 550 600

125

130135

140

145

0 50 100

2260

2280

2300

2320

500 550 6001180119012001210122012301240

250 300 350

0

20

40

60

80

10 30 50

125

130

135

140

145

0 50 100

Internal- versus External Edges

krb5 swing telnetd php

bash mod_ssl ping_libc elm

lynx mailx

X Axis: External EdgesY Axis: Internal Edges

36Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu20212223242526

20 25 30

42

44

46

48

50

50 55 60

66

68

70

72

74

75 80 85

0

5

10

15

0 50

175

180

185

190

195

0 50 100

1080

1100

1120

1140

0 100 200

3200

3300

3400

3500

3600

0 500

0

5

10

15

0 20 40

130132134136138140142

85 90 95 100

985

990

995

1000

100 110 120

2250

2300

2350

2400

2450

0 200 400

Internal- versus External Edgesbip-100-1 bip-100-5 bip-100-25 bip-100-75

rnd-100-1 rnd-100-5 rnd-100-25 rnd-100-75

cir-50 cir-100

cir-150

X Axis: External EdgesY Axis: Internal Edges


Real SystemsSimilarity of Clustering Results

0

10

20

30

40

50

60

70

80

90

100

telnetd

crond

mailx

joe

dhcpd

php

elm

inn

bash

bunch

mod_ssl

lynx

swing

ping_libc

krb5

System

Pe

rce

nta

ge

IntraEdge AgreementIsomporphic Nodes


Random SystemsSimilarity of Clustering Results

0

10

20

30

40

50

60

70

80

90

100

bip-100-1

bip-100-2

bip-100-5

bip-100-25

bip-100-75

rnd-100-1

rnd-100-2

rnd-100-5

rnd-100-25

rnd-100-75

circle-50

circle-100

circle-150

System

Pe

rce

nta

ge

IntraEdge AgreementIsomporphic Nodes


Real SystemsSimilarity of Clustering Results

0

10

20

30

40

50

60

70

80

90

100

telnetd

crond

mailx

joe

dhcpd

php

elm

inn

bash

bunch

mod_ssl

lynx

swing

ping_libc

krb5

System

Pe

rce

nta

ge

IntraEdge Agreement


Random SystemsSimilarity of Clustering Results

0

10

20

30

40

50

60

70

80

90

100

bip-100-1

bip-100-2

bip-100-5

bip-100-25

bip-100-75

rnd-100-1

rnd-100-2

rnd-100-5

rnd-100-25

rnd-100-75

circle-50

circle-100

circle-150

System

Pe

rce

nta

ge

IntraEdge Agreement


What we Learned From Studying the Search Landscape

Not all modules are “equal” - Some modules: Are connected to many other modules Are connected to few other modules Have a large fan-in Have a large fan-out Are uniformly connected to other system

components Are not uniformly connected to other system

componentsSome modules may have a more “natural” home thanother subsystems with respect to their assigned cluster



Bunch tends to converge to a consistent solution with respect to MQ There is a very low probability of finding one

of these partitions by random selection The partitions found by Bunch are a very

small subset of the overall search landscape

The degree of isomorphism in the clustering results was larger than expected



When examining the median level of the clustering hierarchy we observed that all systems tend to converge to at most 2 levels

The systems that we studied range from under 100 modules to several thousand modules

The number of levels in the clustering hierarchy is bounded by O(log2N)

We expect that studying systems with several hundred thousand modules would produce results where the median level converges to more than 2 levels.

We observed this in very sparse graphs (e.g., rnd-100-1, and bip-100-1)


Conclusions (1)

Understanding the search landscape is important A single run of Bunch is helpful, but it

does not highlight modules/classes that tend to drift between clusters

Analysis of many Bunch runs helps build a mental model of the search landscape


Conclusions (2)

A best practice for program understanding Cluster a system many times in order to

understand the search landscape Identify and separate omnipresent, library and

supplier modules Identify that tend to drift between many

subsystems Assign to other clusters manually, or influence the

clustering algorithm by adjusting the edge weights Bunch supports manual and semi-automatic

clustering features to help with this type of analysis


Questions

Special Thanks To: AT&T Research Sun Microsystems DARPA NSF US Army

SEMINAL Group

1 the search landscape of graph partitioning problems using coupling and cohesion as the clustering...

Documents

neighbor partition

iteration current partition

mq slide

clustering algorithm

software clustering

hierarchical clustering

neighbor measure mq

genetic algorithm slide