software clustering based on information loss minimization periklis andritsos university of toronto...
TRANSCRIPT
Software Clustering Based on Information
Loss Minimization
Periklis AndritsosUniversity of Toronto
Vassilios TzerposYork University
The 10th Working Conference on Reverse Engineering
November 2003 Vassilios Tzerpos 2
The Software Clustering Problem
• Input:
• A set of software artifacts (files, classes)
• Structural information, i.e. interdependencies between the artifacts (invocations, inheritance)
• Non-structural information (timestamps, ownership)
• Goal: Partition the artifacts into “meaningful” groups in order to help understand the software system at hand
November 2003 Vassilios Tzerpos 3
Example
Program files
Utility files
Used by the same
program files
Have almost the
same dependencies
November 2003 Vassilios Tzerpos 4
Open questions
o Validity of clusters discovered based on high-cohesion and low-coupling No guarantee that legacy software was developed in such a way
o Discovering utility subsystems Utility subsystems are low-cohesion / high-coupling They commonly occur in manual decompositions
o Utilizing non-structural information What types of information has value?
LOC, timestamps, ownership, directory structure
November 2003 Vassilios Tzerpos 5
Our goals
o Create decompositions that convey as much information as possible about the artifacts they contain
o Discover utility subsystems as well as subsystems based on high-cohesion and low-coupling
o Evaluate the usefulness of any combination of structural and non-structural information
November 2003 Vassilios Tzerpos 6
Information Theory Basics
o Entropy H(A): Measures the Uncertainty in a random variable A
o Conditional Entropy H(B|A): Measures the Uncertainty of a variable B,
given a value for variable A.
o Mutual Information I(A;B): Measures the Dependence of two random variables
A and B.
November 2003 Vassilios Tzerpos 7
Information Bottleneck (IB) Method
o A : random variable that ranges over the artifacts to be clusteredo B : a random variable that ranges over the artifacts’ featureso I(A;B) : mutual information of A and B
o Information Bottleneck Method [TPB’99] Compress A into a clustering Ck so that the information
preserved about B is maximum (k=number of clusters).
o Optimization criterion: minimize I(A;B) - I(Ck;B) minimize H(B|Ck) – H(B|A)
November 2003 Vassilios Tzerpos 8
a1
a2
a3
an
A: Artifacts B: Features
b1
b2
b3
bm
C: Clusters
c1
c2
c3
ck
MinimizeLoss of I(A;C)
Maximize I(C;B)
Information Bottleneck Method
November 2003 Vassilios Tzerpos 9
Agglomerative IBo Conceptualize graph as an nxm matrix
(artifacts by features)
-0.17.17.17u2
0-.17.17.17u1
.17.17-.17.17f3
.17.17.10-.10f2
.17.17.10.10-f1
u2u1f3f2f1A\BA\B f1 f2 f3 u1 u2 p
f1 0 1/4 1/4 1/4 1/4 1/5
f2 1/4 0 1/4 1/4 1/4 1/5
f3 1/4 1/4 0 1/4 1/4 1/5
u1 1/3 1/3 1/3 0 0 1/5
u2 1/3 1/3 1/3 0 0 1/5
o Compute an nxn matrix indicating the information loss we would incur if we joined any two artifacts into a cluster
o Merge tuples with the minimum information loss
November 2003 Vassilios Tzerpos 10
Adding Non-Structural Datao If we have information about the Developer and Location of files we
express the artifacts to be clustered using a new matrixo Instead of B we use B’ to include non-structural datao We can compute I(A;B’) and proceed as before
f1 f2 f3 u1 u2 Alice Bob p1 p2 p3
f1 0 1/6 1/6 1/6 1/6 1/6 0 1/6 0 0
f2 1/6 0 1/6 1/6 1/6 0 1/6 0 1/6 0
f3 1/6 1/6 0 1/6 1/6 0 1/6 0 1/6 0
u1 1/5 1/5 1/5 0 0 1/5 0 0 0 1/5
u2 1/5 1/5 1/5 0 0 1/5 0 0 0 1/5
November 2003 Vassilios Tzerpos 11
o AIB has quadratic complexity since we need to compute an (nxn) distance matrix.
o LIMBO algorithm Produce summaries of the artifacts Apply agglomerative clustering on the summaries
ScaLable InforMation BOttleneck
November 2003 Vassilios Tzerpos 12
Experimental Evaluation
o Data Sets TOBEY : 939 files / 250,000 LOC LINUX : 955 files / 750,000 LOC
o Clustering Algorithms ACDC : Pattern-based BUNCH : Adheres to High-Cohesion and Low-Coupling
NAHC, SAHC Cluster Analysis Algorithms
Single linkage (SL) Complete linkage (CL) Weighted average linkage (WA) Unweighted average linkage (UA)
November 2003 Vassilios Tzerpos 13
Experimental Evaluation
o Compared output of different algorithms using MoJo
MoJo measures the number of Move/Join operations needed to transform one clustering to another.
The smaller the MoJo value of a particular clustering, the more effective the algorithm that produced it.
o We compute MoJo with respect to an authoritative decomposition
November 2003 Vassilios Tzerpos 14
Structural Feature Results
TOBEY Linux
LIMBO 311 237ACDC 320 342
NAHC 382 249
SAHC 482 353
SL 688 402
CL 361 304
WA 351 309
UA 354 316
Limbo found Utility
Clusters
November 2003 Vassilios Tzerpos 15
Non-Structural Feature Results
o We considered all possible combinations of structural and non-structural features.
o Non-Structural Features available only for Linux Developers (dev) Directory (dir) Lines of Code (loc) Time of Last Update (time)
o For each combination we report the number of clusters k when the MoJo value between k and k+1 differs by one.
November 2003 Vassilios Tzerpos 16
Non-Structural Feature Results
Clusters MoJo
dev+dir 69 178
dev+dir+time 37 189
dir 25 195
dir+loc+time 78 201
dir+time 18 208
dir+loc 74 210
dev+dir+loc 49 212
dev 71 229
structural 56 237
o 8 combinations outperform structural results.
o “Dir” information produced better decompositions.
o “Dev” information has a positive effect.
o “Time” leads to worse clusterings.