modeling the search landscape of metaheuristic software clustering algorithms
DESCRIPTION
Modeling the Search Landscape of Metaheuristic Software Clustering Algorithms. Dagstuhl – Software Architecture Brian S. Mitchell [email protected] or http://www.mcs.drexel.edu/~bmitchel Department of Computer Science College of Engineering Drexel University Philadelphia, PA, 19104 USA. - PowerPoint PPT PresentationTRANSCRIPT
1
Modeling the Search Landscape of Metaheuristic
Software Clustering Algorithms
Dagstuhl – Software Architecture
Brian S. [email protected] or http://www.mcs.drexel.edu/~bmitchelDepartment of Computer ScienceCollege of EngineeringDrexel UniversityPhiladelphia, PA, 19104 USA
2Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Understanding Large Systems is HARD
Example: RedHat Linux 7.1Kernel 1,400 modules, 2.5M LOCSystem 350K modules, 30M LOCLanguages: > 19 (including scripting)[http://www.dwheeler.com/sloc]
ManualAnalysis is
Tedious andError Prone
Source CodeAnalysis Approaches
Create LargeRepositories
Software ClusteringApproaches
Create AbstractRepresentations
(1)
(2)
(3)
3Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Software Clustering
BunchTool
Requires aRepresentation...
…A ClusteringAlgorithm…
…And a way toRepresent Results…
Researchers Have Examined ManyDifferent Approaches for Software Clustering
4Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Search-Based Software Clustering with Bunch
Preparation Phase
Source Code
Software StructureGraph Generation
(e.g., MDG)
Source Code Analysis(e.g., cia, Acacia)
MDG
System.out.println(…);
Clustering Phase
Search Space
Metaheuristic SearchSoftware Clustering
Algorithms(e.g., Bunch)
Analysis & VisualizationPhase
Visualization
Additional Analysis<gxl> <graph id=”G1"> <node id=”C1"> <node id=”M1"/> <node id=”M2"/> <edge from=”M1" to=”M2"/> ... </node> </graph></gxl>
Bunch Uses Metaheuristic Search Algorithms for Software Clustering
5Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Bunch Example
The MDGThe RandomStart Point
The Solution
6Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Evaluating Bunch’s Results
Observation: Bunch produces similar results
This is desirable, but This is unexpected considering the
use of metaheuristic search algorithms
Some evaluation has been done “Good Enough” via empirical studies Similarity Analysis
[WCRE01,ICSM01] Comparing to spectral clustering
techniques [WCRE02]We were intrigued to investigate whyBunch’s results are consistently similar
Bunch ProducesA “Family” of
Related Results
7Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
The Search Landscape
Search Landscape ModelerSearch Landscape Modeler
Structural Landscape Similarity LandscapeWhat are some common
properties, if any, in the MDG partitions?
How similar are thecontents of theMDG partitions?
MDG BunchTool
ClusteringResults
Cluster a System Many Times, Look for Patterns in theClustering Results that Provide Insight into the Search
Space
Can Modeling theSearch Space be useful
for Evaluation?
8Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
The Structural Landscape – What do we Expect?
The Structural Landscape is Modeled using a Series of Views
MQ vsNumber of
Clusters
Intra-Edge
Density
MQValue
Number ofClusters
We expect to see a relationship between MQ and the number of clusters. Both MQ and the number of clusters in the partitioned MDG should not vary widely across clustering runs.
We expect a good result to produce a high percentage of intraedges (edges that start and end in the same cluster) consistently.
We expect repeated clustering runs to produce similar MQ results.
We expect that the number of clusters remains relatively consistent across multiple clustering runs.C
om
pari
ng
Bu
nch
’s F
inal
Resu
lts a
gain
st
the I
nit
ial
Ran
dom
Part
itio
ned
MD
G
9Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
The Similarity Landscape – What do we Expect?
ab c
CLUSTEROther
Clusters edges (Intra-Edges) edges (Inter-Edges)
1. Create a counter C<u,v> for each edge, initialize to zero2. Cluster a system many times, For each run:
• For each edge, Increment C<u,v> if <u,v> is an Intraedge3. After all Runs, determine P<u,v> which is the percentage of
times that each <u,v> appeares as an IntraedgeNone Low Mediu
mHigh
Aggregate the P <u,v>
based on the level of agreement
LARGEDissimilarity
MODERATEDissimilarity
NOTSimilar
VERYSimilar
Our Expectations
10Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Case Study
System Name
NumberModules
NumberRelations
Description
Telnet 28 81 Terminal Emulator
PHP 62 191 Internet Scripting Language
Bash 92 901 Unix Terminal Environment
Lynx 148 1,745 Text-Based HTML Browser
Bunch 220 764 Software Clustering Tool
Swing 413 1,513 Standard Java User Interface Framework
Kerberos 5 558 3,793 Security Services Infrastructure
We also looked at 6 randomly generated MDGs
11Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Structural Landscape (1)MQ
Cluster Count
Sample Number
(|IntraEdges|/|E|)%
Sample Number
MQ
Sample Number
Cluster Count
Black = Bunch Gray = Random
0
25
50
75
100
0 50 100
0
1
2
3
4
0 50 100
0
10
20
30
0 50 100
TE
LN
ET
0
6
12
18
5.6 5.8 6 6.2
0
25
50
75
100
0 50 100
0
2
4
6
8
0 50 100
0
20
40
60
80
0 50 100
PH
P
0
10
20
30
40
0 5 10
0
25
50
75
100
0 50 100
0
2
4
6
8
0 50 100
0
20
40
60
80
0 50 100
BA
SH
0
2
4
6
8
0 2 4
Y-Axis:
X-Axis:
The independent samples were ordered by MQ to highlight
some relationships that would not be obvious otherwise.
12Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Structural Landscape (2)MQ
Cluster Count
Sample Number
(|IntraEdges|/|E|)%
Sample Number
MQ
Sample Number
Cluster Count
Black = Bunch Gray = Random
0
10
20
30
40
4 6 8
0
25
50
75
100
0 50 100
0
2
4
6
8
0 50 100
0306090
120150
0 50 100
LYN
X
0
10
20
30
40
16 18 20
0
25
50
75
100
0 50 100
0
5
10
15
20
0 50 100
0255075
100125
0 50 100
BU
NC
H
0
20
40
60
80
60 65 70
0
25
50
75
100
0 50 100
0
20
40
60
80
0 50 100
0100200300400500
0 50 100
SW
ING
0
20
40
60
80
64 66 68 70
0
25
50
75
100
0 50 100
0
20
40
60
80
0 50 100
0
150
300
450
600
0 50 100
KER
BER
OS5
Y-Axis:
X-Axis:
13Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Structural Landscape (3) – Random MDGs
01020304050
0 2 4
0
25
50
75
100
0 50 100
0
1
2
3
4
0 50 100
0
25
50
75
100
0 50 100
01020304050
0 5
0
25
50
75
100
0 50 100
0
1
2
3
4
0 50 100
0
25
50
75
100
0 50 100
01020304050
14 16 18
0
25
50
75
100
0 50 100
0
5
10
15
20
0 50 100
0
25
50
75
100
0 50 100
RN
D5
RN
D5
0R
ND
75
MQ
Cluster Count
Sample Number
(|IntraEdges|/|E|)%
Sample Number
MQ
Sample Number
Cluster Count
Black = Bunch Gray = Random
Y-Axis:
X-Axis:
14Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Structural Landscape (4) – Random MDGs
01020304050
0 2 4
0
25
50
75
100
0 50 100
0
1
2
3
4
0 50 100
0
25
50
75
100
0 50 100
01020304050
0 5
0
25
50
75
100
0 50 100
0
2
4
6
0 50 100
0
25
50
75
100
0 50 100
0
5
10
15
20
5.6 5.8 6 6.2
0
25
50
75
100
0 50 100
0
2
4
6
8
0 50 100
0
25
50
75
100
0 50 100
BIP
5B
IP5
0B
IP75
MQ
Cluster Count
Sample Number
(|IntraEdges|/|E|)%
Sample Number
MQ
Sample Number
Cluster Count
Black = Bunch Gray = Random
Y-Axis:
X-Axis:
15Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Structural Landscape - Observations
There was significant commonality across the clustering resultsMany desirable aspectsA lot of commonality between the random and open source systems Some additional variability in the MQ vs
Cluster Size relationship for the random MDGs More variability in the clustering results for
the random graphs with higher edge densities
16Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Similarity Landscape (1)1009080706050403020100
Zero Low Medium High
35
61
51
12
47
32
14
30
22
21
54
35
7
139 13
34
27
12
25
18
0
28
6
Open Source Systems
Random MDGs
17Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Similarity Landscape (2)1009080706050403020100
Zero Low Medium High
35
61
51
12
37
25
14
30
22
21
54
38
7
139 13
2419
12
2518
9
2818
Open Source Systems
Random MDGs - Low
Random MDGs - High
29
47
36
24
38
32 28
3532
0
18Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Observations – Similarity Landscape
Open Source systems exhibited expected trends High dissimilarity and high similarity Low medium similarity
Random MDGs had much higher medium similarity, and almost no high-similarity We think that this might be due to
isomorphism in the clustering results Why: The variability in the number of clusters with
similar MQ that we observed from the structural landscape
19Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
ConclusionsIdeally evaluation can be performed by comparing Bunch’s results to a benchmark
Not possible – Graph partitioning is NP-Hard Empirical feedback indicates that the results are “good
enough”Up to this point and time no investigation has been performed on why Bunch produces consistent results
The Search Landscape model provided a lot of intuition into Bunch’s behavior
We examined both the structural and similarity aspects of the search landscapeThe Search Landscape approach seems appropriate for modeling other metaheuristic search algorithms