mining features from the object-oriented source code of software variants by combining lexical and...

33
Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity R. AL-msie’deen + , A.-D. Seriai + , M. Huchard + , C. Urtado * , S. Vauttier * and Hamzeh Eyal-Salman + + LIRMM / CNRS & Montpellier 2 University, France Al-msiedee, Seriai, huchard, [email protected] * LGI2P / Ecole des Mines d’Al` es, Nˆ ımes, France Christelle.Urtado, [email protected] December 14, 2013 R. AL-msie’deen et al. Mining Features December 14, 2013 1 / 33

Upload: rafat-al-msiedeen

Post on 10-May-2015

56 views

Category:

Software


0 download

DESCRIPTION

Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

TRANSCRIPT

Page 1: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Mining Features from the Object-Oriented Source Codeof Software Variants by Combining Lexical and

Structural Similarity

R. AL-msie’deen+, A.-D. Seriai+, M. Huchard+,C. Urtado∗, S. Vauttier∗ and Hamzeh Eyal-Salman+

+LIRMM / CNRS & Montpellier 2 University, FranceAl-msiedee, Seriai, huchard, [email protected]

∗LGI2P / Ecole des Mines d’Ales, Nımes, FranceChristelle.Urtado, [email protected]

December 14, 2013

R. AL-msie’deen et al. Mining Features December 14, 2013 1 / 33

Page 2: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Outline

1 The context and the issue

2 Our goals and the main hypotheses

3 Our approach: The main ideas

4 Formal Concept Analysis (FCA)

5 Latent Semantic Index (LSI)

6 Structural similarity

7 The combined similarity (i.e., lexical + structural)

8 The Feature Mining Process in a Nutshell

9 Experimentation and results

10 Conclusion

11 Perspectives

12 References

R. AL-msie’deen et al. Mining Features December 14, 2013 2 / 33

Page 3: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

The context (1/3): Software Product Variants

1 Are similar software:

Share some features, called common features, and differ in others, calledoptional featuresDeveloped by ad-hoc reuse techniques such as clone-ownUsually, an existing product is copied and later modified to meet incre-mental demands of customersExample: Mobile Media, Health Watcher and Wingsoft Financial Man-agement Systems

R. AL-msie’deen et al. Mining Features December 14, 2013 3 / 33

Page 4: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

The context (2/3): Software Product Line

A software product line (SPL) is a family of related program variantsthat share a common code base, such as: ArgoUML-SPL

Software Product Line process consists of: i) domain engineering (coreassets + feature model) and ii) application engineering (product con-figuration)

Figure 1 : Software Product Lines process

R. AL-msie’deen et al. Mining Features December 14, 2013 4 / 33

Page 5: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

The context (3/3): Feature and Feature Models

A Feature is a system property relevant to some stakeholder used tocapture commonalities or variations among systems in a family

Feature model (FM) is a tree-like graph of features and relationshipsamong them.

Figure 2 : Text Editor Feature Model

R. AL-msie’deen et al. Mining Features December 14, 2013 5 / 33

Page 6: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

The Issue:

1 Software Product Variants:

Difficulties For: reuse, maintenance, comprehension, impact analysis

1 Software Product Line:Design from scratch is a hard task (domain engineering)

What are the common /optional features (i.e., source code) for thesesoftware variants?

R. AL-msie’deen et al. Mining Features December 14, 2013 6 / 33

Page 7: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Our Goals:

1 Reengineering existing software variants into a Software Product Line:Benefits:

1 Software variants will be managed as a product line2 Software Product Line will be engineered started from existing software

variants (not from scratch)

Strategy:1 Feature model mining (reverse engineering step)

(i.e., mining features, group of features and FM constraints)2 Source code Framework generation (reengineering step)

R. AL-msie’deen et al. Mining Features December 14, 2013 7 / 33

Page 8: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

The main hypotheses

1 Functional features are implemented using object-oriented building el-ements (OBEs); e.g., packages, classes, attributes, methods, etc.

2 We consider that a feature corresponds to only one set of OBEs3 We consider that feature implementations may overlap (i.e., junction)4 The feature has the same implementation in all product variants where

it is present5 We are not interested in OBEs for the model elements that represent

the platform used to execute the product variant.

R. AL-msie’deen et al. Mining Features December 14, 2013 8 / 33

Page 9: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Our approach: The main ideas

1 Our proposal consists in dividing the OBE set in specific subsets: thecommon feature set also called common block (CB) and several op-tional feature sets (Block of Variation, denoted as BVs)

2 Optional (resp. common) features appear in some but not all (resp. all)variants, they are implemented by OBEs that appear in some but notin all (resp. all) variants

R. AL-msie’deen et al. Mining Features December 14, 2013 9 / 33

Page 10: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Our approach: The main ideas

1 Based on the lexical and structural similarity between the OBEs weidentify atomic blocks of variation (ABV) (i.e., optional feature) amongsteach BV. A BV is thus composed of several ABVs

2 The same process applies for common features: we identify commonatomic blocks (CAB) from CB based on the lexical and structural sim-ilarity between the OBEs. A CB is thus composed of several CABs

R. AL-msie’deen et al. Mining Features December 14, 2013 10 / 33

Page 11: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Our approach: The Mapping Model

R. AL-msie’deen et al. Mining Features December 14, 2013 11 / 33

Page 12: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Formal Concept Analysis

1 The technique used to identify the CB and BVs relies on FCA

2 FCA is a classification technique that takes data sets of objects (soft-ware variants) and their attributes (OBEs), and extracts relations be-tween these objects according to the attributes they share

Figure 3 : The Formal Context (left) and AOC-poset (right) for Text EditorVariants.

R. AL-msie’deen et al. Mining Features December 14, 2013 12 / 33

Page 13: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Latent Semantic Indexing (1/2):

1 LSI assumes that all software artifacts are in textual format. Then, itcomputes the lexical similarity between two software artifacts based onthe cosine similarity matrix

2 If two documents share a large number of terms, those documents areconsidered to be similar

Table 1 : Cosine Similarity Matrix

Doc

umen

tN

ame

Cla

ss(S

tate

)

Cla

ss(t

rig

ger

Lis

t)

Cla

ss(n

oti

fyT

rig

ger

s)

Cla

ss(T

rig

ger

s)

Cla

ss(D

ata

ba

seS

tate

)

Class (State) 1 0.6173 0.0064 -0.1261 0.9997Class (triggerList) 0.6173 1 0.7906 0.7025 0.6350Class (notifyTriggers) 0.0064 0.7906 1 0.9911 0.0292Class (Triggers) -0.1261 0.7025 0.9911 1 -0.1035Class (DatabaseState) 0.9997 0.6350 0.0292 -0.1035 1

R. AL-msie’deen et al. Mining Features December 14, 2013 13 / 33

Page 14: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Latent Semantic Indexing (2/2):

1 The lexical (i.e., textual) similarity between classes of a CB/BVs rep-resented using the Lexical Similarity Matrix (LSM)

2 To transform the (numerical) Cosine similarity matrices into (binary)formal contexts, we use a threshold 0.70

3 In the LSM, only pairs of OBEs having a calculated similarity greaterthan or equal to 0.70 are considered similar

Table 2 : Lexical Similarity Matrix

# Class name 1 2 3 4 5 6 71 Class (ArcSettings Drawing.Shapes.Arc) × × ×2 Class (Arc Drawing.Shapes.Arc) × × ×3 Class (ArcAngle Drawing.Shapes.Arc) × × ×4 Class (MyTextShape Drawing.Shapes.Text) × × × ×5 Class (Text Drawing.Shapes.Text) × × × ×6 Class (TextInfo Drawing.Shapes.Text) × × × ×7 Class (MyText Drawing.Shapes.Text) × × × ×

R. AL-msie’deen et al. Mining Features December 14, 2013 14 / 33

Page 15: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Structural similarity (1/3)

I In our work, we consider the following main dependencies between classes(i.e., coupling between classes):

1 Inheritance coupling

2 Method invocation coupling

3 Composition coupling

4 Attribute access coupling

5 Combined coupling

R. AL-msie’deen et al. Mining Features December 14, 2013 15 / 33

Page 16: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Structural similarity (2/3)

I For two OBEs the structural similarity is based on coupling

I Two OBEs (i.e., classes) are structurally similar depends on the degreeof coupling between these OBEs

I Thus structural similarity between each pair of OBEs is computed basedon the coupling measures

I Structural similarity is used to capture and represent the dependenciesbetween classes of a block (i.e., CB/BVs) based on the coupling measures.

R. AL-msie’deen et al. Mining Features December 14, 2013 16 / 33

Page 17: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Structural similarity (3/3)

I The structural similarity between classes of a CB/BVs represented usingthe Dependency Structure Matrix (DSM)

Table 3 : Dependency Structure Matrix (DSM)

# Class name 1 2 3 4 5 6 71 Class (ArcSettings Drawing.Shapes.Arc) × ×2 Class (Arc Drawing.Shapes.Arc) × ×3 Class (ArcAngle Drawing.Shapes.Arc) × ×4 Class (MyTextShape Drawing.Shapes.Text) ×5 Class (Text Drawing.Shapes.Text) × × ×6 Class (TextInfo Drawing.Shapes.Text) ×7 Class (MyText Drawing.Shapes.Text) ×

I The OBE ”Class (ArcSettings Drawing.Shapes.Arc)” is linked to the OBE”Class (ArcAngle Drawing.Shapes.Arc)” because there is a structural linkbetween these two classes (i.e., inheritance coupling).

R. AL-msie’deen et al. Mining Features December 14, 2013 17 / 33

Page 18: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

The combined similarity (i.e., lexical + structural):

1 To combine both lexical and structural similarity between OBEs inthe common block or blocks of variation we introduce what we call aCombined Matrix (CM)

2 The combined matrix represents both DSM and LSM between a set ofclasses from a CB/BVs

3 The combined matrix also constitutes the formal context which is usedas input for applying FCA to extract the atomic block (i.e., feature)

Table 4 : The Combined Matrix

# Class name 1 2 3 4 5 6 71 Class (ArcSettings Drawing.Shapes.Arc) × × ×2 Class (Arc Drawing.Shapes.Arc) × × ×3 Class (ArcAngle Drawing.Shapes.Arc) × × ×4 Class (MyTextShape Drawing.Shapes.Text) × × × ×5 Class (Text Drawing.Shapes.Text) × × × ×6 Class (TextInfo Drawing.Shapes.Text) × × × ×7 Class (MyText Drawing.Shapes.Text) × × × ×

R. AL-msie’deen et al. Mining Features December 14, 2013 18 / 33

Page 19: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

The Feature Mining Process in a Nutshell (1/2)

Figure 4 : The Feature Mining Process

R. AL-msie’deen et al. Mining Features December 14, 2013 19 / 33

Page 20: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

The Feature Mining Process in a Nutshell (2/2)

1 Identifying OBEs via Abstract Syntax Tree (AST) parser2 Identifying the CB and BVs via FCA3 Identifying the CAB and the ABV (i.e., features) from CB/BVs via:

Measuring the lexical similarity between OBEs (i.e., LSI)Measuring the structural similarity between OBEsCombining lexical and structural similarity between OBEs

4 Identifying CAB and ABV from CB/BVs via FCA

Figure 5 : Atomic blocks (i.e., two features) mined from single BV

R. AL-msie’deen et al. Mining Features December 14, 2013 20 / 33

Page 21: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Experimentation and results (1/3): Software Variants

R. AL-msie’deen et al. Mining Features December 14, 2013 21 / 33

Page 22: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Experimentation and results (2/3): Feature mining result

R. AL-msie’deen et al. Mining Features December 14, 2013 22 / 33

Page 23: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Experimentation and results (3/3): Evaluation Measures

1 In order to evaluate our approach and based on our knowledge aboutsoftware variants and their features (i.e., OBEs for each feature) wehave used three measures: precision, recall and F-Measure

2 Recall is the percentage of correctly retrieved OBEs to the total numberof relevant OBEs

3 Precision is the percentage of correctly retrieved OBEs to the totalnumber of retrieved OBEs

4 F-Measure defines a tradeoff between precision and recall so that itgives a high value only in case where both recall and precision are high

5 All measures have values in [0, 1]

R. AL-msie’deen et al. Mining Features December 14, 2013 23 / 33

Page 24: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Reverse engineering FMs from program configurations

Table 5 : Feature Sets of ArgoUML SPL

Class Diagrams Use case Collabora. Cognitive S. Activity Deployment Sequence StateP1

√ √ √ √ √ √ √ √ √

P2√

P3√ √ √ √ √ √ √ √ √

P4√ √ √ √ √ √ √ √

P5√ √ √ √ √ √ √ √

P6√ √ √ √ √ √ √ √

P7√ √ √ √ √ √ √ √

P8√ √ √ √ √ √ √ √

P9√ √ √ √ √ √ √ √

P10√ √ √ √ √ √ √ √

Figure 6 : ArgoUML FM

R. AL-msie’deen et al. Mining Features December 14, 2013 24 / 33

Page 25: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Lexical vs. the combined approach

♣ When we combine both lexical and structural similarity we find that allevaluation measures now have values quite higher than lexical similarityalone (especially true for recall measure)

Table 6 : Comparing the two approaches (i.e., lexical + combined approach)

Lexical similarity Lexical and structural similarityMeasure / Average Mobile Media ArgoUML Mobile Media ArgoUMLPrecision 87% 97% 96% 97%Recall 66% 67% 100% 100%F-Measure 75% 79% 98% 98%

R. AL-msie’deen et al. Mining Features December 14, 2013 25 / 33

Page 26: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Related work (1/2):

Rubin et al. [Rubin and Chechik, 2012] present an approach to locateoptional features from two product variants’ source code. They do notconsider common features and limit their proposal to two variants.

The approach proposed by Ziadi et al. [Ziadi et al., 2012] is the closestone. They identify all common features as a single mandatory feature.However, they do not distinguish between optional features that appeartogether in a set of variants.

R. AL-msie’deen et al. Mining Features December 14, 2013 26 / 33

Page 27: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Related work (2/2):

McMillan et al. [McMillan et al., 2009] propose an approach to identifytraceability links between source code and documentation by combiningboth textual and structural analysis.

Duszynski et al. [Duszynski et al., 2011] describe a framework for theanalysis and visualization of similarities (i.e., commonalities) and varia-tions (i.e., variabilities) across related software variants based on clonedetection.

R. AL-msie’deen et al. Mining Features December 14, 2013 27 / 33

Page 28: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Conclusion (1/3):

In our first work1; we present an approach for Feature location ina collection of software product variants based on FCA. We distin-guish between the common block (i.e., CB) and the blocks of variation(i.e., BVs).

1Feature location in a collection of software product variants using formal conceptanalysis, Proceedings of the 13th International Conference on Software Reuse - ”Safeand Secure Reuse”, pages 302 307, Springer, 2013.

R. AL-msie’deen et al. Mining Features December 14, 2013 28 / 33

Page 29: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Conclusion (2/3):

In our second work2, we extended our previous work by splitting blocksof source code elements (i.e., CB/BVs) based on the lexical similaritybetween OBEs.

2Mining Features from the Object-Oriented Source Code of a Collection of SoftwareVariants Using Formal Concept Analysis and Latent Semantic Indexing, Proceedings ofthe 25th International Conference on Software Engineering and Knowledge Engineering,USA, 2013.

R. AL-msie’deen et al. Mining Features December 14, 2013 29 / 33

Page 30: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Conclusion (3/3):

Finally, we extended our previous works by splitting blocks of sourcecode elements (i.e., CB/BVs) based on the lexical and structuralsimilarity between OBEs.

R. AL-msie’deen et al. Mining Features December 14, 2013 30 / 33

Page 31: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

Perspectives:

Enhance the quality of the mining process

1 Identify junctions between features

2 More reducing of the search space

3 Automatically propose feature names for the atomic blocks

4 etc.

Feature model mining

1 Mining Features

2 Mining feature model constraints

3 Mining feature model structure (i.e., group of features/FM hierarchy)

R. AL-msie’deen et al. Mining Features December 14, 2013 31 / 33

Page 32: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

References I

Duszynski, S., Knodel, J., and Becker, M. (2011).Analyzing the Source Code of Multiple Software Variants for ReusePotential.In 18th WCRE, pages 303–307. IEEE.

McMillan, C., Poshyvanyk, D., and Revelle, M. (2009).Combining textual and structural analysis of software artifacts fortraceability link recovery.In TEFSE ’09 Workshop, pages 41–48. IEEE.

Rubin, J. and Chechik, M. (2012).Locating distinguishing features using diff sets.In 27th ASE Conference, ASE 2012, pages 242–245. ACM.

Ziadi, T., Frias, L., da Silva, M. A. A., and Ziane, M. (2012).Feature identification from the source code of product variants.In CSMR’2012, pages 417–422.

R. AL-msie’deen et al. Mining Features December 14, 2013 32 / 33

Page 33: Mining Features from the Object-Oriented Source Code of Software Variants by Combining Lexical and Structural Similarity

J Thank You For Your Attention I

Mining Features from the Object-Oriented Source Codeof Software Variants by Combining Lexical and

Structural Similarity

R. AL-msie’deen+, A.-D. Seriai+, M. Huchard+,C. Urtado∗, S. Vauttier∗ and Hamzeh Eyal-Salman+

+LIRMM / CNRS & Montpellier 2 University, FranceAl-msiedee, Seriai, huchard, [email protected]

∗LGI2P / Ecole des Mines d’Ales, Nımes, FranceChristelle.Urtado, [email protected]

December 14, 2013

R. AL-msie’deen et al. Mining Features December 14, 2013 33 / 33