software clustering based on information loss minimization periklis andritsos university of toronto...

Software Clustering Based on Information

Loss Minimization

Periklis AndritsosUniversity of Toronto

Vassilios TzerposYork University

The 10th Working Conference on Reverse Engineering

November 2003 Vassilios Tzerpos 2

The Software Clustering Problem

• Input:

• A set of software artifacts (files, classes)

• Structural information, i.e. interdependencies between the artifacts (invocations, inheritance)

• Non-structural information (timestamps, ownership)

• Goal: Partition the artifacts into “meaningful” groups in order to help understand the software system at hand


Example

Program files

Utility files

Used by the same

program files

Have almost the

same dependencies


Open questions

o Validity of clusters discovered based on high-cohesion and low-coupling No guarantee that legacy software was developed in such a way

o Discovering utility subsystems Utility subsystems are low-cohesion / high-coupling They commonly occur in manual decompositions

o Utilizing non-structural information What types of information has value?

LOC, timestamps, ownership, directory structure


Our goals

o Create decompositions that convey as much information as possible about the artifacts they contain

o Discover utility subsystems as well as subsystems based on high-cohesion and low-coupling

o Evaluate the usefulness of any combination of structural and non-structural information


Information Theory Basics

o Entropy H(A): Measures the Uncertainty in a random variable A

o Conditional Entropy H(B|A): Measures the Uncertainty of a variable B,

given a value for variable A.

o Mutual Information I(A;B): Measures the Dependence of two random variables

A and B.


Information Bottleneck (IB) Method

o A : random variable that ranges over the artifacts to be clusteredo B : a random variable that ranges over the artifacts’ featureso I(A;B) : mutual information of A and B

o Information Bottleneck Method [TPB’99] Compress A into a clustering Ck so that the information

preserved about B is maximum (k=number of clusters).

o Optimization criterion: minimize I(A;B) - I(Ck;B) minimize H(B|Ck) – H(B|A)


a1

a2

a3

an

A: Artifacts B: Features

b1

b2

b3

bm

C: Clusters

c1

c2

c3

ck

MinimizeLoss of I(A;C)

Maximize I(C;B)

Information Bottleneck Method


Agglomerative IBo Conceptualize graph as an nxm matrix

(artifacts by features)

-0.17.17.17u2

0-.17.17.17u1

.17.17-.17.17f3

.17.17.10-.10f2

.17.17.10.10-f1

u2u1f3f2f1A\BA\B f1 f2 f3 u1 u2 p

f1 0 1/4 1/4 1/4 1/4 1/5

f2 1/4 0 1/4 1/4 1/4 1/5

f3 1/4 1/4 0 1/4 1/4 1/5

u1 1/3 1/3 1/3 0 0 1/5

u2 1/3 1/3 1/3 0 0 1/5

o Compute an nxn matrix indicating the information loss we would incur if we joined any two artifacts into a cluster

o Merge tuples with the minimum information loss


Adding Non-Structural Datao If we have information about the Developer and Location of files we

express the artifacts to be clustered using a new matrixo Instead of B we use B’ to include non-structural datao We can compute I(A;B’) and proceed as before

f1 f2 f3 u1 u2 Alice Bob p1 p2 p3

f1 0 1/6 1/6 1/6 1/6 1/6 0 1/6 0 0

f2 1/6 0 1/6 1/6 1/6 0 1/6 0 1/6 0

f3 1/6 1/6 0 1/6 1/6 0 1/6 0 1/6 0

u1 1/5 1/5 1/5 0 0 1/5 0 0 0 1/5

u2 1/5 1/5 1/5 0 0 1/5 0 0 0 1/5


o AIB has quadratic complexity since we need to compute an (nxn) distance matrix.

o LIMBO algorithm Produce summaries of the artifacts Apply agglomerative clustering on the summaries

ScaLable InforMation BOttleneck


Experimental Evaluation

o Data Sets TOBEY : 939 files / 250,000 LOC LINUX : 955 files / 750,000 LOC

o Clustering Algorithms ACDC : Pattern-based BUNCH : Adheres to High-Cohesion and Low-Coupling

NAHC, SAHC Cluster Analysis Algorithms

Single linkage (SL) Complete linkage (CL) Weighted average linkage (WA) Unweighted average linkage (UA)


Experimental Evaluation

o Compared output of different algorithms using MoJo

MoJo measures the number of Move/Join operations needed to transform one clustering to another.

The smaller the MoJo value of a particular clustering, the more effective the algorithm that produced it.

o We compute MoJo with respect to an authoritative decomposition


Structural Feature Results

TOBEY Linux

LIMBO 311 237ACDC 320 342

NAHC 382 249

SAHC 482 353

SL 688 402

CL 361 304

WA 351 309

UA 354 316

Limbo found Utility

Clusters


Non-Structural Feature Results

o We considered all possible combinations of structural and non-structural features.

o Non-Structural Features available only for Linux Developers (dev) Directory (dir) Lines of Code (loc) Time of Last Update (time)

o For each combination we report the number of clusters k when the MoJo value between k and k+1 differs by one.


Non-Structural Feature Results

Clusters MoJo

dev+dir 69 178

dev+dir+time 37 189

dir 25 195

dir+loc+time 78 201

dir+time 18 208

dir+loc 74 210

dev+dir+loc 49 212

dev 71 229

structural 56 237

o 8 combinations outperform structural results.

o “Dir” information produced better decompositions.

o “Dev” information has a positive effect.

o “Time” leads to worse clusterings.

software clustering based on information loss minimization periklis andritsos university of toronto...

Documents