improved similarity measures for software clustering

Upload: princegirish

Post on 03-Jun-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Improved Similarity Measures For Software Clustering

    1/10

    Improved Similarity Measures For SoftwareClustering

    Rashid Naseem , Onaiza Maqbool , Siraj Muhammad Dept. of Computer Science, Quaid-I-Azam University, Islamabad

    Elixir Technologies Pakistan (PVT) LTDEmail: [email protected], [email protected], [email protected]

    Abstract Software clustering is a useful technique to recoverarchitecture of a software system. The results of clusteringdepend upon choice of entities, features, similarity measuresand clustering algorithms. Different similarity measures havebeen used for determining similarity between entities duringthe clustering process. In software architecture recovery domainthe Jaccard and the Unbiased Ellenberg measures have shownbetter results than other measures for binary and non-binaryfeatures respectively. In this paper we analyze the Russell andRao measure for binary features to show the conditions underwhich its performance is expected to be better than that of Jaccard. We also show how our proposed Jaccard-NM measureis suitable for software clustering and propose its counterpartfor non-binary features. Experimental results indicate that ourproposed Jaccard-NM measure and Russell & Rao measureperform better than Jaccard measure for binary features, whilefor non-binary features, the proposed Unbiased Ellenberg-NMmeasure produces results which are closer to the decompositionprepared by experts.

    Index Terms Software Clustering, Jaccard-NM Measure, Jac-card Measure, Unbiased Ellenberg-NM Measure, Russell & RaoMeasure

    I. I NTRODUCTION

    Software clustering has engaged the interest of researchersin the last two decades, primarily as a technique to facilitateunderstanding of legacy software systems. When the architec-tural documentation is not available, or the documentation hasnot been updated to reect changes in the software over time,software clustering may be used for software modularizationand architecture recovery [1], [2]. Besides clustering, othertechniques used for this purpose are association rule mining[3], concept analysis [4] and graphical visualization [5].

    The clustering process is used to modularize a softwaresystem or to recover sub-systems by grouping together soft-ware entities that are similar to each other. Thus entitieswithin a cluster have similar characteristics or features, andare dis-similar from entities in other clusters. To determinesimilarity based on features of an entity, a similarity measureis employed. Many different similarity measures are available.The choice of a measure depends on the characteristics of thedomain in which they are applied. In the software domain, themost commonly used similarity measure for hierarchical clus-tering is Jaccard coefcient for binary features [6], [7], whilefor non-binary features Unbiased Ellenberg and InformationLoss measure have been shown to produce better results ascompared to other measures [1].

    In this paper, we describe our proposed Jaccard-NM [8]measure for binary features, and compare it with the Jaccardand Russell & Rao [9] measures. We present different cases toshow deciencies in the Jaccard measure which may deterio-rate clustering results, and show how in these cases the Russell& Rao and Jaccard-NM measures are expected to have betterperformance. For non-binary features we propose the UnbiasedEllenberg-NM measure, and compare its performance withthe Unbiased Ellenberg and Information Loss measures. Wealso analyze cases where these measures produce arbitrarydecisions. We call a decision arbitrary when more than twoentities have equal similarity value. In this situation, clusteringalgorithms select two entities to be clustered arbitrarily. Thisarbitrary decision may create problems [1].

    Thus the contributions of this paper can be summarized as:1) Analysis of the Jaccard, Jaccard-NM and Russell & Rao

    measures for binary features and a comparison of theirstrengths and weaknesses.

    2) Denition of a new similarity measure for non-binaryfeatures and its comparison with well known existingmeasures used for software clustering.

    3) Internal and external evaluation of clustering results. In-ternal assessment is carried out using arbitrary decisionstaken by proposed and existing similarity measures. Ex-ternal assessment is carried out by comparing manuallyprepared software decompositions with automaticallyproduced clustering results using MoJoFM.

    This paper is organized as follows. In Section 2 we describerelated work. An overview of clustering is presented in Section3. In Section 4 we present an analysis of similarity measuresfor binary features and dene a new measure for non-binaryfeatures. Section 5 describes our experimental setup. In Sec-tion 6, we analyze our experimental results. Finally in Section7, we present the conclusions and future work.

    II . R ELATED W ORK

    To nd similarity between entities, various similarity mea-sures have been used. Davey and Burd evaluated different sim-ilarity measures including Jaccard, Sorensen-Dice, Canberraand Correlation coefcients [7]. From experimental resultsthey concluded that Jaccard and Sorensen-Dice similaritymeasures perform identically and they recommended Jaccardsimilarity measure for software clustering when features arebinary.

    2011 15th European Conference on Software Maintenance and Reengineering

    1534-5351/11 $26.00 2011 IEEEDOI 10.1109/CSMR.2011.9

    46

    2011 15th European Conference on Software Maintenance and Reengineering

    1534-5351/11 $26.00 2011 IEEEDOI 10.1109/CSMR.2011.9

    45

  • 8/12/2019 Improved Similarity Measures For Software Clustering

    2/10

    Anquetil and Lethbridge compared different similarity mea-sures including Jaccard, Simple Matching, Sorensen-Dice,Correlation, Taxonomic and Canberra [6]. For clustering theyused Complete linkage, Weighted linkage, Unweighted linkageand Single linkage algorithms. They concluded that Jaccardand Sorensen-Dice similarity measures produce good resultsbecause they do not consider absence of a feature ( d) as asign of similarity, while Simple Matching and other similaritymeasures consider absence of a feature to be a sign of similarity and thus do not produce satisfactory results.

    In 2003 Anquetil and Lethbridge evaluated different fea-tures, similarity measures and clustering algorithms. Fromexperimental results they once again concluded that Jaccardsimilarity measure produces good results [10].

    Saeed et al. developed a new linkage algorithm calledCombined algorithm [11]. They compared this algorithm withComplete linkage using different similarity measures includingJaccard, Sorensen-Dice, Simple Matching and Correlationcoefcient. They concluded that behavior of Correlation coef-cient is similar to the Jaccard similarity measure when the

    number of absent feature is very large as compared to presentfeatures.

    In 2004, Maqbool and Babri developed the WeightedCombined algorithm, and proposed the Unbiased Ellenbergsimilarity measure [12]. In this paper they evaluated Completelinkage, Combined algorithm and Weighted Combined algo-rithms using Jaccard, Euclidean distance, Pearson correlationcoefcient, Ellenberg and the Unbiased Ellenberg similaritymeasures. Their results suggested that Weighted Combined al-gorithm produces better results than Complete and Combinedalgorithms especially with Unbiased Ellenberg measure.

    Andritsos and Tzerpos developed an algorithm calledLIMBO (sca Lable Infor M ation BO tleneck algorithm) in 2005

    [2]. They applied LIMBO to three different data sets andcompared the results with ACDC, NAHC-lib, SAHC, SAHC-lib, Single Linkage, Complete Linkage, Weighted AverageLinkage and Unweighted Average Linkage algorithms. Theyconcluded that, on an average, LIMBO performed better thanother algorithms.

    In 2006, Mitchell and Mancoridis described their Bunchclustering tool which uses search techniques (hill-climbingand genetic algorithms) to nd optimal solutions [13]. Bunchtool was developed in 1998 [14] and over time modied toinclude new features (e.g omnipresent modules detection anddeletion) [15], [16]. Bunch tool uses Module DependencyGraph (MDG), where modules are entities and edges are staticrelationships among entities. Bunch makes partitions of MDGand uses a tness function Modularization Quality (MQ), tocalculate the quality of graph partitions.

    Harman et al. investigated the effect of noise in input infor-mation available for software module clustering [17]. To guidethe search they examine two tness functions: ModularizationQuality (MQ) and Evaluation Metric function (EVM). Forevaluation purpose they used six real software systems, threePerfect module dependency graphs and three Random moduledependency graphs and concluded that in the presence of noise

    EVM performs better than MQ for real and perfect MDGs.Results also show that EVM is more robust than MQ forsmaller software systems.

    In 2010 Naseem et al. proposed a new similarity measurecalled Jaccard-NM [8] for binary features. They evaluatedthis measure using Complete linkage, Weighted average andUnweighted average. From the experimental results they con-cluded that, in general, Jaccard-NM produces better resultsthan Jaccard similarity measure for binary features.

    Besides the software domain, a comparison of similaritymeasures has also been carried out in other domains. In thesedomains, the Jaccard measure does not necessarily performbetter than other measures as in the case of software dueto different domain characteristics. For example, Willett usedthirteen similarity measures including Tanimoto, Russell &Rao, and Simple matching to nd the similarity betweenmolecular ngerprints for virtual screening [18]. He concludedthat Tanimoto, Baroni-Urbani/Buser, Kulczynski(2), Fossumand Ochiai/Cosine coefcients perform reasonably well acrossthe range of molecular size [18]. Dalirsefat1a et al. compared

    three similarity measures including Jaccard, Sorensen-Diceand Simple Matching to nd the similarity between biologicalorganisms. They concluded from their experimental resultsthat when the organisms are closely related then Jaccardor Sorensen-Dice give satisfactory results [19]. MoreoverJaccard and Sorensen-Dice produce closely similar resultsbecause these two measures exclude negative co-occurrences.These results are similar to those obtained for software.

    III . O VERVIEW OF C LUSTERING

    In the clustering process, entities are grouped together basedon their features. In this section, we provide an overview of the steps in clustering.

    A. Selection of Entities and Features

    Selection of entities and features depend on type of softwaresystem, and also on the required architectural view. For mod-ularization of structured software systems, researchers haveselected different entities e.g. les, processes and functions.Features may be global variables or user dened types usedby an entity [6]. For object oriented software systems, entitiesmay be classes [20] and features are typically dened by therelationships between classes e.g. inheritance or containment.In the software domain, features are usually binary i.e. theyindicate the presence or absence of a characteristic or relation-ship.

    Before applying a clustering algorithm, a software systemmust be parsed to extract entities and features. The result isan NxP matrix, where N is the number of entities and P is thenumber of features. Table I presents an NxP matrix of a smallsoftware system containing 4 entities and 6 binary features.

    B. Selection of Similarity Metrics

    In the second step, a similarity measure is applied tocompute similarity between every pair of entities, resultingin a similarity matrix. Selection of similarity measure should

    4746

  • 8/12/2019 Improved Similarity Measures For Software Clustering

    3/10

    TABLE I(N X P ) FEATURE MATRIX FOR A SMALL SYSTEM

    f1 f2 f3 f4 f5 f6E1 0 1 1 0 1 0E2 1 1 1 0 0 1E3 1 0 0 1 0 1E4 1 0 0 1 1 0

    be done carefully, because selecting an appropriate similaritymeasure may inuence clustering results more than selectionof a clustering algorithm [21]. Table II lists some well knownsimilarity measures for binary features.

    TABLE IIS IMILARITY M EASURES FOR BINARY FEATURES

    S.No

    Name Mathematical representation

    1 Jaccard a/ (a + b + c)2 Russell & Rao a/ (a + b + c + d)3 Simple Matching (a + d)/ (a + b + c + d)4 Sokal Sneeth a/ (a + 2( b + c))5 Rogers-Tanimoto (a + d)/a + 2( b + c) + d)6 Gower-Legendre (a + d)/ (a + 0 .5(b + c) + d)

    The Jaccard-NM measure for binary features proposed byus in [8] is given by:

    Jaccard NM = a

    2(a + b + c) + d (1)

    In Table II and Equation 1, a , b, c and d can be determinedusing Table III. For two entities X and Y , a is the numberof features that are present 1 in both entities X and Y , brepresents features that are present in X but absent in Y , crepresents features that are not present in X and present in Y ,and d represents the number of features that are absent 0 inboth entities. n = a + b+ c+ d is the total number of features.

    TABLE IIIC ONTINGENCY TABLE

    Y 1 (Presence) 0 (Absence) Sum

    X 1 (Presence) a b a+b0 (Absence) c d c + dSum a + c b + d n = a + b + c + d

    Table IV lists some well known similarity measures fornon-binary features. In Table IV, since the features are non-binary, M a represents the sum of features that are present inboth entities X and Y , Mb represents sum of features thatare present in X but absent in Y and Mc represents sum of features that are not present in X and are present in Y .

    In the software domain, it has been shown that Jaccardmeasure produces better results than other measures for binaryfeatures [6], [7]. One reason for this is that it does notconsider d (absence of feature/negative match) [11], [22]. It

    TABLE IVS IMILARITY M EASURES FOR NON -BINARY FEATURES

    S.No

    Name Mathematical representation

    1 Ellenberg 0.5 Ma/ (0.5 Ma + Mb + Mc)2 Unbiased Ellenberg 0.5 Ma/ (0.5 Ma + b + c)3 Gleason Measure Ma/ (Ma + Mb + Mc)4 Unbiased Gleason

    measureMa/ (Ma + b + c)

    has been observed that in software clustering, the features areasymmetric, i.e. the presence of a feature 1 has more weightthan its absence 0. The absence of features does not indicatesimilarity between two entities e.g. if two classes both do notuse a variable, it does not mean that they are similar. For non-binary features the counter part of Jaccard similarity measureUnbiased Ellenberg produces better results for softwareclustering [1], [12].

    C. Application of a Clustering AlgorithmThe next step is to apply a clustering algorithm, which can

    be categorized into hierarchical or non-hierarchical. Agglom-erative Hierarchical Clustering (AHC) algorithms are basedon the bottom-up approach. In this approach, an algorithmconsiders entities as singleton clusters, and at every stepclusters the two most similar entities together. At the end, thealgorithm makes one large cluster which contains all entities.Although in the software domain, non-hierarchical algorithmshave also been used [23], [24], but there are some advantagesof using AHC algorithms. For example, there is no need of prior information about number of clusters. Moreover, the hier-archical structure of a software system is naturally represented

    through hierarchical algorithms. But the disadvantage is thatwe have to select a cutoff point, which represents the numberof steps after which to stop the algorithm.

    Widely used agglomerative hierarchical algorithms for soft-ware architecture recovery are Complete Linkage (CL), SingleLinkage (SL), Weighted Average Linkage (WAL) and Un-weighted Average Linkage (UWAL). When two entities aremerged into a cluster, similarity between the newly formedcluster and other clusters/entities is calculated differently bythese algorithms. Suppose we have three entities E 1 , E 2 andE 3 . Using these algorithms, similarity between E 1 and newlyformed cluster E 23 is calculated as [22]:

    Complete Linkage

    Similarity (E 1 , E 23 ) =min (Similarity (E 1 , E 2 ), Similarity (E 1 , E 3 )) . Single Linkage

    Similarity (E 1 , E 23 ) =max (Similarity (E 1 , E 2 ), Similarity (E 1 , E 3 )) .

    Weighted Average LinkageSimilarity (E 1 , E 23 ) = (1 / 2 Similarity (E 1 , E 2 ) +1/ 2 Similarity (E 1 , E 3 )) .

    Unweighted Average LinkageSimilarity (E 1 , E 23 ) = ( Similarity (E 1 , E 2 )

    4847

  • 8/12/2019 Improved Similarity Measures For Software Clustering

    4/10

    size (E 2 )+ Similarity (E 1 , E 3 )size (E 3 )) / (size (E 2 )+size (E 3 ).

    The Complete linkage algorithm supports formation of small but cohesive clusters, while the Single linkage algorithmmakes large non-cohesive but stable clusters. The results of Weighted and Unweighted Average Linkage algorithms lie

    between these two.Two recently proposed hierarchical algorithms for softwareclustering are Weighted Combined Algorithm (WCA) [12]and LIMBO [2]. When two entities are merged in a cluster,information about the number of entities accessing a featureis lost [12] when using linkage algorithms. WCA and LIMBOovercome this limitation of linkage algorithms by making anew feature vector for the newly formed cluster. This featurevector contains information about number of entities accessinga feature. Unlike linkage algorithms, these algorithms updatefeature matrix after every step.

    Suppose we have two entities E 1 and E 2 with normalizedfeature vectors f i and f j , respectively. The newly featurevector f ij is calculated for both algorithms as:

    f ij = (f i + f j ) /(n i + nj )= (f ik + f jk ) /(n i + n j ), k=1,2,...,p

    Information Loss (IL) measure is used with LIMBO to cal-culate the information loss between any two entities/clusters.The entities are chosen for grouping together into a new clusterwhen their IL is minimum. The IL represented by I , is brieydescribed below (For detail and examples see [2]).

    Information loss is given as:

    I= [p(E i ) + p(E j )]*D js [f i , f j ]

    For each singleton entity, p(E i ) = p ( E j ) = 1 /n , where nis the total number of entities. Djs is the Jensen-Shannon

    divergence, dened as follows:D js = p(E i )/ p(E ij )*Dkl[ f i || f ij ] + p(E j )/ p(E ij )*Dkl[ f j || f ij ]

    Dkl is the relative entropy (also called Kullback-Leibler (KL) divergence), which is the difference between two proba-bility distributions, given as:

    Dkl[ f i || f j ] = pk =1 f ik log f ik / f jk

    D. Evaluation of Results

    In external assessment, the automatically prepared decom-positions are compared with the decompositions prepared byhuman experts. For this purpose different measures may beused. A well known measure is MoJoFM [25], a recent version

    of MoJo [26]. MoJoFM is an external assessment measurewhich calculates the percentage of Move and Join operationsto convert the decomposition produced by a clustering algo-rithm to an expert decomposition [25]. To compare the resultA of our algorithm with expert decomposition B , we have:

    MoJoFM (M ) = 1 mno(A, B )

    max (mno (A, B ))100 (2)

    where mno(A, B ) is the minimum number of move andjoin operations needed to convert from A to B and

    max (mno (A, B )) is the minimum number of possiblemove and join operations needed to convert from A toB . A higher MoJoFM (100%) value denotes greater corre-spondence between the two decompositions and hence betterresults while lower MoJoFM (0%) values indicate that decom-positions are very different.

    In internal assessment, some internal characteristic of clus-ters may be used to evaluate quality of results. Arbitrarydecisions represent an internal quality measure [1]. Arbitrarydecision is taken by an algorithm when there are more thanone maximum values for similarity between entities (or fordistance and information loss measures, there are more thanone minimum values).

    IV. A N A NALYSIS OF S IMILARITY M EASURES ANDFEATURE V ECTOR C ASES

    In this section, we analyze similarity measures for binaryfeatures and propose a new measure for non-binary features.

    A. Analysis of similarity measures

    As described in Section III-B, for software clustering,measures that do not contain d produce better results. Thisis because features in software are asymmetric, and a 1 anda 0 do not have equal weight. 0 indicates the absence of a feature, and hence d indicates that features are not beingshared between entities. For software, the absence of a featurein two entities does not indicate similarity. For example, if two classes do not access the same global function, it doesnot mean that the two classes are similar.

    To show that the presence of d in a measure does notnecessarily deteriorate results, consider Table V which shows4 entities, E1-E4. E1 and E2 share two features, so that valueof a is 2. Both of them access one feature each that the otherentity does not, so b = 1 and c = 1 . E3 and E4 share threefeatures, so a = 3 . Similar to E1 and E2, both of them accessone feature each that the other entity does not, so b = 1 andc = 1 , as given in Figure 1.

    TABLE VSOFTWARE S YSTEM A

    Entities f1 f2 f3 f4 f5 f6 f7E1 1 1 0 1 0 0 0E2 1 1 0 0 1 0 0E3 1 1 1 0 0 1 0E4 1 1 1 0 0 0 1

    TABLE VIS IMILARITY M ATRIX U SING JACCARD FOR S OFTWARE S YSTEM A

    Entities E1 E2 E3 E4E1 0E2 0.5 0E3 0.4 0.4 0E4 0.4 0.4 0.6 0

    4948

  • 8/12/2019 Improved Similarity Measures For Software Clustering

    5/10

    Fig. 1. Relationships between entities in software system A

    TABLE VIIS IMILARITY M ATRIX U SING JACCARD -NM F OR S OFTWARE S YSTEM A

    Entities E1 E2 E3 E4E1 0E2 0.18 0E3 0.08 0.08 0E4 0.08 0.08 0.25 0

    TABLE VIIIS IMILARITY M ATRIX U SING RUSSELL & RAO FOR S OFTWARE S YSTEM A

    Entities E1 E2 E3 E4E1 0E2 0.28 0E3 0.28 0.28 0E4 0.28 0.28 0.42 0

    TABLE IXS IMILARITY M ATRIX U SING S IMPLE M ATCHING FOR S OFTWARE S YSTEM

    A

    Entities E1 E2 E3 E4

    E1 0E2 0.71 0E3 0.57 0.57 0E4 0.57 0.57 0.71 0

    The similarity matrix according to the Jaccard measure isgiven in Table VI. The similarity matrices according to theJaccard-NM, Russell & Rao and Simple Matching measures(all of which contain d) are given in Table VII - Table IX. Itcan be seen from Table VI - Table VIII that Jaccard, Jaccard-NM and Russell & Rao measures nd E3 and E4 to be mostsimilar. From Figure 1, it is clear that E3 and E4 should indeedbe considered most similar. However, due to presence of d innumerator of Simple Matching coefcient, it nds E1 & E2and E3 & E4 to be equally similar, resulting in an arbitrarydecision where either of these entities may be grouped.

    From this example, it is clear that the signicant factorhere is whether d is present in numerator or denominator of a measure. Its presence in the numerator deteriorates results(as for Simple Matching Coefcient). However, if it is presentin denominator only, it does not indicate similarity but it isa useful indicator of the proportion of common and total

    features. Consider the following two cases which indicate howthe presence of d in Jaccard-NM and Russell & Rao measuremay improve performance as compared to Jaccard.

    - Case1: Value of a is different among entities, but similarityas per Jaccard is same.

    TABLE X

    SOFTWARE SYSTEM B

    Entities f1 f2 f3 f4E1 1 1 0 0E2 1 1 0 0E3 1 1 1 1E4 1 1 1 1

    Fig. 2. Relationships between entities in software system B

    An example feature matrix with 4 entities (E1-E4) and4 features (f1-f4) of a software system B for this case ispresented in Table V and shown in Figure 2. In this systemvalue of a is 2 for entities E1 and E2. For entities E3 and E4,value of a is 4.

    The corresponding similarity matrices using Jaccard,Jaccard-NM and Russell & Rao measures are given in TableXI - Table XIII. It can be seen from Table XI that using the

    Jaccard measure, both E1 and E2, and E3 and E4 are found tobe equally similar. It may be better to choose E3 and E4 forclustering rather than E1 and E2 as they share a larger numberof features. Both Jaccard-NM and Russell & Rao nd E3 andE4 to be more similar so an arbitrary decision is reduced.

    TABLE XIS IMILARITY M ATRIX U SING JACCARD FOR S OFTWARE S YSTEM B

    Entities E1 E2 E3 E4E1 0E2 1 0E3 0.5 0.5 0E4 0.5 0.5 1 0

    TABLE XIIS IMILARITY M ATRIX U SING JACCARD -NM F OR S OFTWARE S YSTEM B

    Entities E1 E2 E3 E4E1 0E2 0.3 0E3 0.25 0.25 0E4 0.25 0.25 0.5 0

    5049

  • 8/12/2019 Improved Similarity Measures For Software Clustering

    6/10

    TABLE XIIIS IMILARITY M ATRIX U SING RUSSELL & RAO FOR S OFTWARE S YSTEM B

    Entities E1 E2 E3 E4E1 0E2 0.5 0E3 0.5 0.5 0E4 0.5 0.5 1 0

    - Case2: Value of a is high among entities, but they are not completely similar.

    An example feature matrix with 4 entities (E1-E4) and9 features (f1-f9) of a software system C for this case ispresented in Table XIV and Figure 3. The correspondingsimilarity matrices using Jaccard measure, Jaccard-NM andRussell & Rao measure are given in Table XV - Table XVII. Itcan be seen that entities E1 and E2 are found to be most similarby Jaccard. However, Jaccard-NM and Russell & Rao nd E3and E4 to be most similar, which may be more appropriate.

    TABLE XIVSOFTWARE SYSTEM C

    Entities f1 f2 f3 f4 f5 f6E1 1 1 0 0 0 0E2 1 1 0 0 0 0E3 1 1 1 1 1 0E4 1 1 1 1 0 1

    Fig. 3. Relationships between entities in software system C

    TABLE XVS IMILARITY M ATRIX U SING JACCARD FOR S OFTWARE S YSTEM C

    Entities E1 E2 E3 E4E1 0E2 1 0E3 0.4 0.4 0E4 0.4 0.4 0.6 0

    Through case1 and case2, we have shown that both theJaccard-NM and Russell & Rao measures are expected toprovide better results as compared to the Jaccard measure.The question arises as to why we need to dene Jaccard-NM

    TABLE XVIS IMILARITY M ATRIX U SING JACCARD -NM F OR S OFTWARE S YSTEM C

    Entities E1 E2 E3 E4E1 0E2 0.25 0E3 0.18 0.18 0E4 0.18 0.18 0.33 0

    TABLE XVIIS IMILARITY M ATRIX U SING RUSSELL & R AO FOR S OFTWARE S YSTEM C

    Entities E1 E2 E3 E4E1 0E2 0.33 0E3 0.33 0.33 0E4 0.33 0.33 0.6 0

    when Russell & Rao already exists. To answer this question,consider the following example:

    - Case3: Value of a is same but values of b and c is not.Consider Table XVIII and Figure 4 having four entities (E1-

    E4) and ve features (f1-f5). All the entities have same valueof a equal to three but entities E1 and E2 have b and c = 0while E3 and E4 have b = 1 and c = 1.

    TABLE XVIIISOFTWARE SYSTEM D

    Entities f1 f2 f3 f4 f5E1 1 1 1 0 0E2 1 1 1 0 0E3 1 1 1 1 0E4 1 1 1 0 1

    Fig. 4. Relationships between entities in software system D

    The corresponding similarity matrices using Russell & Raoand Jaccard-NM measures are given in Table XIX and TableXX respectively. It can be seen from Table XIX that Russell& Rao results in arbitrary decisions among all entities. But itcan be seen from Table XX that Jaccard-NM reduces arbitrarydecisions and gives preference to E1 and E2 to form a clusterin rst step. Hence in certain cases, the results of Jaccard-NMand Russell & Rao are different, with Jaccard-NM reducingthe arbitrary decisions which have a negative impact on theclustering results.

    5150

  • 8/12/2019 Improved Similarity Measures For Software Clustering

    7/10

    TABLE XIXS IMILARITY M ATRIX U SING RUSSELL & RAO FOR S OFTWARE S YSTEM D

    Entities E1 E2 E3 E4E1 0E2 0.6 0E3 0.6 0.6 0E4 0.6 0.6 0.6 0

    TABLE XXS IMILARITY M ATRIX U SING JACCARD -NM F OR S OFTWARE S YSTEM D

    Entities E1 E2 E3 E4E1 0E2 0.37 0E3 0.33 0.33 0E4 0.33 0.33 0.3 0

    B. Unbiased Ellenberg-NM - A new similarity measure for non-binary features

    Unbiased Ellenberg is a Jaccard like measure but for non-binary features as given in equation 3.

    UnbiasedEllenberg = 0.5 Ma

    0.5 Ma + b + c (3)

    The cases discussed in Section IV-A can also occur in non-binary features matrix. Therefore to solve these problems, wepropose a new measure called Unbiased Ellenberg-NM. Ournew measure is dened as follows.

    UnbiasedEllenberg NM = 0.5 Ma

    0.5 Ma + b + c + n (4)

    = 0.5 Ma

    0.5 Ma + b + c + ( a + b + c + d) (5)

    = 0.5 Ma

    0.5 Ma + 2( b + c) + a + d) (6)

    V. E XPERIMENTAL S ETUP

    In this section, we describe the test systems and clusteringsetup for our experiments.

    A. The Test Systems

    To conduct clustering experiments, we selected three objectoriented software systems which have been developed in Vi-

    sual C++ [20]. These are proprietary software systems that rununder Windows platforms. Statistical Analysis VisualizationTool (SAVT) is an application which provides functionalityrelated to statistical data and result visualization. Printer Lan-guage Converter (PLC) is a part of another system, whichprovides conversion of intermediate data structures to printerlanguage. Print Language Parser (PLP) is a parser of a wellknown printer language. It transforms plain text and storesoutput in intermediate data structures. A brief description isgiven in Table XXI.

    TABLE XXIB RIEF D ESCRIPTION OF DATA S ETS

    S. No. PLP SAVT PLC1 Total number source code

    lines50661 27311 51768

    2 Total number of header(.h) les

    30 70 27

    3 Total number of imple-mentation (.cpp,.cxx) les

    28 37 27

    4 Total number of Classes 72 97 69

    B. Entities and Features

    Since all systems are object-oriented, we selected classas an entity. From different relationships that exist betweenclasses, we selected eleven sibling (indirect) relationships [20]listed in Table XXII, since the similarity measures listed inTable II can only be applied to indirect relationships. Weused these relationships because they occur frequently withinobject-oriented systems.

    C. Similarity Measures

    To nd out similarity between entities having binary featureswe selected the Jaccard, Jaccard-NM and Russell & Rao simi-larity measures. For non-binary features we selected UnbiasedEllenberg and Information Loss measures and compared theirresults with our new proposed measure Unbiased Ellenberg-NM.

    D. Algorithms

    To cluster the most similar entities we selected agglom-erative clustering algorithms including Complete linkage,Weighted average and Unweighted average described in Sec-tion III-C. We also selected Weighted Combined Algorithm[12] and LIMBO [2].

    E. Assessment

    We obtained expert decompositions for each test systemand compared our automatically produced clustering resultswith the expert decompositions at each step of hierarchicalclustering using the MoJoFM [25]. Results are reported byselecting the maximum MoJoFM value obtained during theclustering process.

    For internal assessment, the results obtained by measureswere evaluated internally by number of arbitrary decisionstaken during clustering process.

    VI. E XPERIMENTAL R ESULTS AND A NALYSIS

    A. External evaluation of results for binary features

    In this section, we present experimental results of CompleteLinkage (CL), Weighted Average Linkage (WAL) and Un-weighted Average Linkage (UWAL) algorithms using Jaccard(J), Jaccard-NM (JNM) and Russell & Rao (RR) similaritymeasures.

    Table XXIII and Figure 5 present the results of the compari-son between automatically obtained decomposition and expertdecomposition using MoJoFM. From Figure 5 one can see that

    5251

  • 8/12/2019 Improved Similarity Measures For Software Clustering

    8/10

    TABLE XXIIINDIRECT RELATIONSHIPS BETWEEN CLASSES THAT WERE USED FOR EXPERIMENTS

    Name DescriptionSame Inheritance Hierarchy Two or more classes that are derived from same classSame Class Containment Represents that classes contain objects of same classSame Class in Methods Represents classes containing objects of same class declared in a method locally or as parameterSame Generic Class Represents that two classes are used as instantiating parameters to same generic classSame Generic Parameter The relationship between two generic classes which have same class as their parameterSame File The source code o f two or more classes is written in same leSame Folder Two or more classes reside in same folderSame Global Function Access Two or more than two classes access same global functionsSame Macro Access Two or more than two classes access same macroSame Global Variable Access Two or more than two classes access same global variable

    in all data sets Jaccard-NM, and Russell & Rao give resultsequal to or better than Jaccard for all algorithms. From TableXXIII and Figure 6 it can be seen that on an average, Jaccard-NM and Russell & Rao produce signicantly better resultsthan the Jaccard similarity measure for all linkage algorithms.

    Fig. 5 . Experimental results using MoJoFM values for Com-plete(CL),Unweighted Average(UWAL) and Weighted Average(WAL) usingJaccard(J), Jaccard-NM(JNM) and Russell & Rao(RR) similarity measures

    TABLE XXIIIM OJOFM VALUES OF JACCARD , JACCARD -NM A ND RUSSELL & R AO

    MEASURES FOR ALL DATA S ETS AND L INKAGE A LGORITHMS

    PLP SAVT PLC

    J JNM RR J JNM RR J JNM RRCL 51 60 55 54 54 58 61 65 64UWAL 43 46 46 49 54 55 47 50 52WAL 46 52 54 48 53 49 56 55 55Average 47 53 52 50 54 54 55 57 57

    B. External evaluation of results for non-binary features

    Figure 7 and Table XXIV show results of applying WeightedCombined algorithm using Unbiased Ellenberg and Unbiased

    Fig. 6. Average MoJoFM using Jaccard(J), Jaccard-NM(JNM) and Russell& Rao(RR)

    Ellenberg-NM measures and Limbo using Information Lossmeasure. Figure 7 indicates that Unbiased Ellenberg-NM givesbetter results as compared to Unbiased Ellenberg and Infor-mation Loss measures. We analyze the reason for the betterresults of Unbiased Ellenberg-NM in the next section.

    Fig. 7. MoJoFM results for Weighted Combined(WC) using Unbiased El-lenberg(UE) and Unbiased Ellenberg-NM(UENM) measures and InformationLoss Measure(IL) measures

    5352

  • 8/12/2019 Improved Similarity Measures For Software Clustering

    9/10

    TABLE XXIVE XPERIMENTAL R ESULTS USING M OJOFM VALUES FOR U NBIASED

    E LLENBERG (UE) A ND U NBIASED E LLENBERG -NM (UENM) USINGW EIGHTED C OMBINED A LGORITHM AND L IMBO USING INFORMATION

    L OSS MEASURE FOR ALL DATA S ETS

    PLP SAVT PLCUE UENM IL UE UENM IL UE UENM IL

    70 73 74 68 74 68 68 71 67

    Fig. 8. Average Number of arbitrary decisions using Complete Linkage(CL)with Jaccard(J), Jaccard-NM(JNM) and Russell & Rao(RR)

    C. Internal evaluation using arbitrary decisions

    Figure 8 presents the arbitrary decisions taken as a resultof applying the Jaccard, Jaccard-NM and Russell & Raomeasures throughout the clustering process for all test systems.

    We can see from Figure 9 that in rst thirteen steps of theclustering process for PLP, in rst quarter for SAVT and in rsthalf for PLC, the Jaccard similarity measure results in more

    arbitrary decisions as compared to Jaccard-NM and Russell& Rao. This is due to entities which have Jaccard similarityvalue equal to 1, while the value of a is different and thesecreate large number of arbitrary decisions. This is case1 whichwe have dened, and for which we proposed Jaccard-NM. Inthis case Russell & Rao also gives better results.

    It can be seen from Figure 9 that for PLP, number of arbitrary decisions by Russell & Rao is higher as comparedto Jaccard and Jaccard-NM. It can be seen also that for SAVTand PLC behavior of Jaccard-NM and Russell & Rao is almostsame. This difference in PLP data set is due to the case3dened in Section IV-A.

    The average arbitrary decisions for Unbiased Ellenberg,

    Unbiased Ellenberg-NM and Information Loss measure arepresented in Figure 10. It was expected that the number of arbitrary decisions by Unbiased Ellenberg-NM would be lessthan for other similarity measures. The experimental resultsconrm our expectations. We can see that Information Lossresults in less arbitrary decisions while Unbiased Ellenbergresults in more [1]. Moreover, our new measure UnbiasedEllenberg-NM results in less arbitrary decisions as comparedto Information Loss, thus producing the best clustering results.

    Thus from our analysis and experimental results we con-

    Fig. 9. Experimental results for arbitrary decisions using Complete Link-age(CL) with Jaccard(J), Jaccard-NM(JNM) and Russell & Rao(RR)

    Fig. 10. Average number of arbitrary decisions using WeightedCombined(WC) with Unbiased Ellenberg(UE) and Unbiased Ellenberg-

    NM(UENM) and Limbo using Information Loss(IL) measure

    clude that: When feature vector has d = 0 , then Jaccard-NM and

    Russell & Rao become equal to Jaccard measure. Russell & Rao depends on a only. Jaccard-NM and Russell & Rao produce better clustering

    results as compared to Jaccard by reducing arbitrarydecisions.

    Unbiased Ellenberg-NM substantially decreases numberof arbitrary decisions as compared to Unbiased Ellenbergand Information Loss for non-binary features producing

    signicantly better clustering results.VII. C ONCLUSIONS

    Various binary and non-binary similarity measures havebeen used during clustering for software architecture recovery.Each of the measures has its own characteristics. Previousresearch suggests that the similarity measures which do notconsider absence of features d, perform well for softwareclustering and those that include d do not. Amongst themeasures not containing d, Jaccard measure produces the best

    5453

  • 8/12/2019 Improved Similarity Measures For Software Clustering

    10/10

    results.In this paper, we analyzed the performance of the Jaccard

    measure (which does not contain d), and Jaccard-NM andRussell & Rao measures (which contain d) using variouscases that may arise in the feature matrix of a softwaresystem. We identied deciencies of the Jaccard measure andshowed how Jaccard-NM and Russell & Rao give better resultsthan Jaccard. This is because they use d not to determinesimilarity, but to determine proportion of common and totalfeatures. We also showed how Jaccard-NM is capable of reducing arbitrary decisions, which may be problematic duringclustering process.

    We also dened the non-binary counterpart of Jaccard-NM,the Unbiased Ellenberg-NM and compared its performancewith Unbiased Ellenberg and Information Loss measures.Similar to Jaccard-NM, it reduces arbitrary decisions andresults in better clusters.

    In the future, it will be interesting to evaluate the per-formance of Jaccard-NM, Russell & Rao and UnbiasedEllenberg-NM measures on other systems

    ACKNOWLEDGMENT

    The authors would like to thanks Mr. Abdul Qudus Abbasifor providing the Software Test Systems.

    R EFERENCES

    [1] O. Maqbool and H. A. Babri, Hierarchical clustering for softwarearchitecture recovery, IEEE Trans. Software Eng. , vol. 33, no. 11, pp.759 780, November 2007.

    [2] P. Andritsos and V. Tzerpos, Information theoretic software clustering, IEEE Trans. Software Eng. , vol. 31, no. 2, pp. 150 165, February 2005.

    [3] C. Tjortjis, L. Sinos, and P. Layzell, Facilitating program comprehen-sion by mining association rules from source code, Proc. Intl WorkshopProgram Comprehension , pp. 125 132, May 2003.

    [4] P. Tonella, Concept analysis for module restructuring, IEEE Trans.software Eng. , vol. 27, pp. 351363, Apr 2001.[5] M. Consens, A. Mendelzon, and A. Ryman, Visualizing and querying

    software structures, Proc. of the Intl. Conference on Software Engineer-ing(ICSE) , vol. 133, pp. 138156, May 1992.

    [6] N. Anquetil and T. C. Lethbridge, Experiments with clustering as asoftware remodularization method, Proc. Working Conference Reverse Engineering (WCRE) , pp. 235255, 1999.

    [7] J. Davey and E. Burd, Evaluating the suitability of data clustering forsoftware remodularization, Proc. Working Conf. Reverse Eng. , pp. 268 276, November 2000.

    [8] R. Naseem, O. Maqbool, and S. Muhammad, An improved simi-larity measure for binary features in software clustering, Proc. of the Intl. Conference on Computational Intelligence, Modelling and Simulation(CIMSim) , pp. 111116, September 2010.

    [9] S.-S. Chot, S.-H. Cha, and C. C. Tappert, A survey of Binary similaritynd distance measures, Journal of Systemics, Cybernetics and Informat-ics , vol. 8, no. 1, pp. 43 48, 2010.

    [10] N. Anquetil and T. Lethbridge, Comparative study of clustering al-gorithms and abstract representations for software remodularisation,Software, IEE Proceedings , vol. 150, no. 3, pp. 185 201, 2003.

    [11] M. Saeed, O. Maqbool, H. A. Babri, S. Hassan, and S. Sarwar, Softwareclustering techniques and the use of combined algorithm, Proc. IntlConf. Software Maintenance and Reeng. , pp. 301 306, March 2003.

    [12] O. Maqbool and H. A. Babri, The weighted combined algorithm: alinkage algorithm for software clustering, Proc. Intl Conf. Software Maintenance and Reeng. , pp. 15 24, 2004.

    [13] B. S. Mitchell and S. Mancoridis, On the automatic modularizationof software systems using the bunch tool, IEEE Trans. Software Eng. ,vol. 32, no. 3, pp. 193 208, March 2006.

    [14] S. Mancoridis, B. Mitchell, C. Rorres, Y. Chen, and E. R. Gansner,Using automatic clustering to produce high-level system organizationsof source code, In Proc. 6th Intl. Workshop on Program Comprehension ,pp. 45 53, 1998.

    [15] S. Mancoridis, B. Mitchell, Y. Chen, and E. Gansner, Bunch: Aclustering tool for the recovery and maintenance of software systemstructures, IEEE Intl. Conference on Software Maintenance , p. 50,1999.

    [16] B. S. Mitchell and S. Mancoridis, Using heuristic search techniquesto extract design abstractions from source code, Proceedings of theGenetic and Evolutionary Computation Conference , pp. 1375 1382,2002.

    [17] M. Harman, S. Swift, and K. Mahdavi, An empirical study of therobustness of two module clustering tness functions, Proc. Geneticand Evolutionary Computation Conference , pp. 1029 1036, June 2005.

    [18] P. Willett, Similarity-based approaches to virtual screening, Biochem-ical Society Transactions , vol. 31, no. 3, pp. 603 606, Jun 2003.

    [19] S. Dalirsefat, A. da Silva Meyer, and S. Mirhoseini, Comparison of similarity coefcients used for cluster analysis with amplied fragmentlength polymorphism markers in the silkworm, Bombyx mori, Journalof Insect Science , vol. 71, pp. 1 8, 2009.

    [20] A. Q. Abbasi, Application of appropriate machine learning techniquesfor automatic modularization of software systems, MPhil. thesis, Quaid-e-Azam University Islamabad, 2008.

    [21] Z. Wen and V. Tzerpos, Evaluating similarity measures for softwaredecompositions, Proc. Intl Conf. Software Maintenance , pp. 368 377,September 2004.

    [22] N. Anquetil, C. Fourier, and T. C. Lethbridge, Experiments with hi-erarchical clustering algorithms as software remodularization methods,Proc. Working Conf. Reverse Eng. , 1999.

    [23] Y. Kanellopoulos, P. Antonellis, C. Tjortjis, and C. Makris1, k-attractors: A clustering algorithm for software measurement data anal-ysis, In Proc. 19th IEEE Intl. Conference on Tools with Articial Intelligence , pp. 358 365, 2007.

    [24] A. Lakhotia, A unied framework for expressing software subsystemclassication techniques, Journal of Systems and Software , vol. 36, pp.211 231, 1997.

    [25] Z. Wen and V. Tzerpos, An effectiveness measure for algorithms, Proc. Intl Workshop Program Comprehension , pp. 194 203, June 2004.

    [26] M. Shtern and V. Tzerpos, A framework for the comparison of nestedsoftware decompositions, In Proc. of the 11th IEEE Working Conf. Reverse Engineering , pp. 284292, 2004.

    5554