semantic based information mining to imrove the …
TRANSCRIPT
1
SEMANTIC BASED INFORMATION MINING TO IMPROVE THE QUALITY
OF IDENTIFIERS IN UML MODEL
SYNOPSIS
1. INTRODUCTION
In order to develop quality software, software must be designed according to
the requirements. UML diagram [4] is an ideal choice for software developers who
need to demonstrate and deduce relationships, actions and connections of a software
application using the Unified Modeling Language (UML) notation. The software
designer must go through the software requirements specifications (SRS) and extract
data for the UML Models. This could be achieved efficiently by employing N-gram
Algorithm [21] [33] and Statistical Substring Algorithm [34]. But the N-gram
Algorithm uses Comb Sort [30] for sorting the word N-grams and Statistical Substring
Algorithm uses Radix Sort [15] for sorting the set of strings. But it is found that the
efficiency of both the algorithms can be improved by utilizing Yaroslavskiy‘s Dual-
Pivot Quick Sort Algorithm [32], after it has been implemented as a standard sorting
method for Oracle‘s Java 7 Run-Time Library [27] [28] recently.
From a different perspective, it has been highlighted that UML model textual
properties, in particular the usage of proper identifiers [17], [18], [19], [31] are also an
important indicator of software quality. Early actions for quality improvement on
UML Models are less resource intensive, and, hence, less cost intensive than later
actions [6]. Marcus et al. [23], [24] propose a new cohesion metric (conceptual
cohesion), which is complementary to structural cohesion, that exploits Latent
Semantic Indexing (LSI) [12] to compute the overlap of semantic information in a
class expressed in terms of textual similarity among methods.
Another scenario in which the quality of identifiers and their consistency with
the lexicon of high-level artifacts plays an important role in Information Retrieval
(IR)-based traceability recovery [1], [2], [5], [7], [11], [14], [20], [22], [29]. Such
approaches work under the assumption that, if a UML model artifact (e.g., a class
diagram) is textually similar to a high-level artifact (e.g., a requirement), then it is very
likely that there exists a traceability link between them. Also, well-known books (e.g.,
[13]) advocate the usefulness of a shared lexicon across software artifacts.
2
Clearly, when system architect does not consistently use identifiers with high-
level artifacts, the aforementioned traceability recovery approaches fail. The presence
of meaningless identifiers in system models can also imply that human tasks aimed at
understanding the model or at recovering traceability links in the context of a
maintenance task become more difficult and error-prone [9], [10], [19], [31].
According to previous studies [19], [31] producing system models (UML) with
more meaningful identifiers would improve the system comprehensibility and
maintainability. Moreover, the use of IR techniques in traceability recovery to measure
the similarity between the text contained in the UML models and the domain terms
contained in high-level software artifacts suggests that these techniques can also be
used to improve identifiers during software development and increase such similarity.
In this work, firstly, the researcher proposes an IR-based approach aimed at
showing the textual similarity between the UML model under development and related
high-level artifacts. The researcher‘s conjecture is that developers are induced to
improve the UML model lexicon, i.e., terms used in diagram, if the software
development environment provides information about the textual similarity between
the model under development and the related high level artifacts.
The suggestion provided to the analyzer/designer might induce them to take
different actions, such as making the model identifiers more consistent with domain
terms. To give further support to the analyzers/designers, the proposed approach also
recommends candidate identifiers and semantic identifiers built from high-level
artifacts related to the model. It also recommends list of nouns, adjectives, verbs, etc.,
available on the high-level artifacts, which is used for building the models.
Secondly, for similarity evaluation process and for suggesting candidate
identifiers for UML Model, the data has been extracted from the raw corpus. To do
that, the researcher proposes two algorithms namely Improved N-gram Extraction
(INGE) Algorithm and Improved Substring Removal (ISR) Algorithm. In both the
proposed algorithms, efficiency can be improved by Dual-Pivot Quick Sort
Algorithm and eliminating substring with equal frequency. Therefore, the proposed
algorithms have a time complexity of O (n log n), where ‗n‘ is the number of words in
the input file. Using these algorithms, it is evident that the automatic extraction or
3
determination of words, compound words and collocations are useful for designing
software models.
Finally the researcher has developed an Eclipse plug-in, named UMLHelper,
which implements the proposed approach. Its evaluation has been carried out through
two controlled experiments involving master‘s and bachelor‘s students, where students
are asked to perform system architect tasks with and without the availability of UML
Helper features. Then, the researcher has evaluated the quality of the produced UML
Models in terms of similarity with high-level artifacts and also through a peer review
process involving multiple inspectors.
The analysis of the achieved results confirms the conjecture that by providing
the analysts/designers with the similarity between model and high-level artifacts it is
found that it helps to significantly improve the quality of system model identifiers,
which also further increases when architect receives suggestions about candidate
identifiers.
2. SCOPE OF THE RESEARCH
The approach presented in this thesis relates to approaches aimed at applying IR
techniques for traceability recovery and for quality improvement or assessment. The
work is also related to approaches/tools aimed at analyzing or improving the quality of
system model (UML Model) identifiers.
The literature survey has been done on various concepts like In-consistency in
UML models, IR-based traceability recovery, IR-based artifact quality improvement,
N-gram extraction vs statistical substring reduction, and source code quality
assessment. The summary of the survey shows the importance of the system models
and its elements, which has been created in object oriented software engineering
process. It also insists that the inconsistency present in the system models. It also
specifies the creation of system model elements which need to be consistent with high
level artifacts. To do that, the survey indicates the need of better IR-based traceability
recovery method and IR-based artifact quality improvements. It also destines that the
need of better N-gram extraction algorithm and Statistical Substring Reduction
algorithm, which have been used in similarity evaluation process with the system
models and high level artifacts. Thus the researcher concentrates on the area of
similarity between the system models and high level artifact methods, natural language
4
processing to extract valid information for constructing an optimized system model
and explores the way to make the system model consistent, with high level artifacts
and its quality.
3. THE PROBLEM FORMULATION
The main objectives of the proposed work are:
1. An approach helps system architect to maintain system model text terms
consistent with high-level artifacts. Specifically, the approach computes and
shows the textual similarity between system model identifiers and related high-
level artifacts.
2. The proposed approach also recommends candidate identifiers built from high-
level artifacts related to the system model under development. To do that, the
researcher has proposed Improved N-gram Extraction (INGE) Algorithm and
Improved Substring Removal (ISR) Algorithm for extracting data from raw
corpus. In both the proposed algorithms, efficiency can be improved by
Dual-Pivot Quick Sort Algorithm and eliminating substring with equal
frequency.
3. It also recommends list of nouns, adjectives, verbs, etc., available on the high-
level artifacts as a tagged identifiers, which is used for building the models.
4. It also recommends semantic identifiers which is also helpful for building the
models.
5. The proposed approach has been implemented as an Eclipse plug-in.
6. The work also reports on two controlled experiments performed with master‘s
and bachelor‘s students. The goal of the experiments is to evaluate the quality
of identifiers (in terms of their consistency with high-level artifacts) in the
model produced when using or not using the developed plug-in.
7. The questionnaire has been collected from both the categories of students after
completing the experiments. The achieved results confirm our conjecture that
providing the analyzers/designers with similarity feature and identifier
suggestion feature helps to improve the quality of system model lexicon. This
indicates the potential usefulness of developed plug-in as a feature for software
development environments.
5
4. PROPOSED METHODOLGY
This section describes a narrative approach for improving the quality of the text
artifacts used in various UML Models during software development. The proposed
approach is based on the assumption that system analyzers/designers are induced to
make the system models and its identifiers more consistent with domain terms if the
software development environment provides information about the textual similarity
between the system model being drawn and the related high-level artifacts. Clearly, the
proposed approach is based on the assumption that high-level documentation like
System Requirement Specification (SRS) and module specification is available during
the development process. Figure.1 shows the flow of information between a software
architect and the Integrated Development Environment (IDE) in the proposed
approach.
System Requirements
Specification Documents
Term ExtractionTerm Filtering and
Transformation
Software Architect
Indexing
Natural Language
Processing
UML Models
Identifier Composing
Textual Comparison
Term ExtractionTerm Filtering and
TransformationIndexing
Ontology Inference
Service
4.1 Similarity between System Model and High-Level Artifacts
When the system architects are model a system using UML, they can be continuously
informed about the quality of model identifiers in terms of their similarity with the text
Fig.1 Software model lexicon improvement through
similarity information and identifier suggestion
6
contained in the related higher level software artifacts. To this end, the system
architect select the high-level artifacts which the system model should be traced on and
the IDE shows the similarity between the model under development and the selected
high-level artifacts.
The textual similarity between the system model and the related high-level
artifacts is computed by using an IR-based approach. In general, an IR method [3]
compares a given query against all the documents in a collection by computing the
textual similarity between these documents and the query. In this case, the query is the
text contained in the system model being written, while the documents are the related
high-level software artifacts, for example, requirements or module specification. To
compute the similarity, both the model and high-level artifacts are indexed. The
indexing process is preceded by a term extraction and term filtering and transformation
phases (see Fig. 1).
In particular, the latter phase aims at:
1. Removing non-textual items, e.g., UML notations, Numbers and punctuation;
2. Removing stop words using a stop word removal function which removes words
having a length less than a fixed threshold (we fixed this threshold to 3, as suggested in
[3]), and also removing words belonging to a stop word list (i.e., articles, adverbs, etc.)
[3].
It would be easy to integrate into the approach a stemming phase [26] aimed at
extracting stems from words, e.g., removing plurals, bringing verb forms to infinitive,
etc. However, the IR method used in our approach, namely, LSI [12], has previously
proven to also work well without the use of stemming [22]; therefore, in our current
implementation, the researcher does not use stemming. The indexing process and the
term comparison phase depend on the particular IR method adopted. In this case, the
extracted information is stored in an m × n matrix (called term-by-document matrix),
where m is the size of the union of terms used by the artifacts (i.e., the vocabulary
size) and n is the number of artifacts in the repository.
Once the term-by-document matrix has been built, the researcher uses LSI [12]
to compute the textual similarity between the model and the related high-level
documentation. Such a technique applies Singular Value Decomposition (SVD) [8] to
derive a set of uncorrelated indexing factors (concepts) from the term-by-document
matrix. In other words, the analysis is moved from the term-by-document space to the
concept-by-document space. In this new space, the similarity between a query and a
7
document is computed using the vector space cosine similarity measure [3]. The
researcher has decided to use LSI to limit problems related to 1) dependency between
terms, 2) homonymy, and 3) polysemy [12].
The textual similarity between the system model and related high-level artifacts
provides the developers with an indication about the consistency between the system
model lexicon and the related high-level artifacts. In particular, if the similarity is high,
it is likely that the model is properly traced to the related artifacts, i.e., the
analyzer/designer has selected meaningful identifiers and/or the model is properly
described. On the other hand, in case the similarity is low, the analyzer/designer can
make the software models identifiers more consistent with the terms contained in the
high-level artifacts, which increase the similarity between the system model and the
related high-level artifacts. It is worth noting that increasing the quality of identifiers
would make the model easier to understand [19], [31].
4.2 Suggestion of Candidate Identifiers
To further support the analyzer/designer in the choice of meaningful identifiers, the
researcher proposes suggesting candidate identifiers to the analyzer/designer by
extracting n-grams from the text contained in high-level artifacts associated to the
system model artifact under development (see Figure. 1). An n-gram is a string
composed of n subsequent words extracted from high-level artifacts after pruning out
stop words. In particular, given the sentence ―A user has a first name and a last name‖
the list of 2-grams (also called ―bigrams‖) is [―userFirst,‖ ―firstName,‖ ―nameLast,‖
―lastName‖]. As well as the computation of the textual similarity, the extraction of n-
grams is also preceded by text normalization and the composition of multi-words
identifiers. The n-gram extraction is performed using the proposed INGE algorithm.
4.3 Suggestion of Semantic Identifiers
In addition to that, the researcher proposes suggesting semantic identifier to the
analyzer/designer using Ontology Inference Service (OIS). The Ontology Inference
Service is a Java API for WordNet Searching (JAWS) that provides Java applications
with the ability to retrieve data from the WordNet database. WordNet is a semantic
lexicon for the English language. It groups English words into sets of synonyms called
synsets, provides short, general definitions, and records the various semantic relations
between these synonym sets. Hence it is more useful to the analyzer/designer to find
out the more accurate meaningful identifiers relevant to the problem domain.
8
4.4 Suggestion of Tagged Identifiers
To further support the analyzer/designer in the choice of tagged identifiers, the
researcher proposes suggesting tagged identifier to the analyzer/designer using Part-
Of-Speech Tagger [16]. A Part-Of-Speech Tagger (POS Tagger) is a piece of software
that reads text in some language and assigns parts of speech to each word (and other
tokens), such as noun, verb, adjective, etc. Generally computational applications use
more fine-grained POS tags like 'noun-plural'. This software is a Java implementation
of the log-linear part-of-speech (POS) taggers. Most features of the tagger can only be
accessed via the command line. But the researcher has created GUI based POS tagger
for accessing identifiers with tagger. These tagged identifiers are very much used for
modeling the system. For example identifying the class names, attributes and methods
for class diagram needs set of noun taggers.
In summary, the proposed approach provides 1) information about the
similarity between the software model under development and the related high-level
artifacts, 2) suggests identifiers obtained from terms belonging to high-level artifacts
3) suggests tagged identifiers based on natural language processing and 4) suggest
semantic identifiers using ontology inference service.
5. INTEGRATING THE APPROACH INTO ECLIPSE: UMLHELPER
The proposed approach has been implemented as UMLHelper, a plug-in for the
Eclipse IDE works with the Java Development Tool (JDT), although it can be easily
adapted for other Eclipse tools, e.g., UML modelers or development environments for
other programming languages. The plug-in contributes a new view to the Eclipse
workbench and is organized in four different tabs, namely, Similarity, Identifiers,
Semanticist, and Tagger. The Similarity tab provides information about the similarity
between the system model under development and related artifacts. The Identifiers tab
suggests appropriate (composed) identifiers to be used in the system model under
development. The Semanticist tab provides related semantic identifiers which are more
relevant to the problem domain and while the Tagger tab suggests the appropriate
tagged identifiers to be used in the specific model under development.
5.1 Similarity between System Model and High-Level Artifacts using UMLHelper
The Similarity tab shows a sorted list of all the indexed (high-level) artifacts as a table
(see Figure. 2). The first column of the table contains a check box that indicates
9
whether the artifact has to be selected and traced onto the system model under
development. The second column contains the description of the high-level artifacts,
and the third column shows the similarity between the artifact and the system model
under development. The high level artifacts being compared to the model under
development are requirements, use cases, and, in general, any software artifact that can
be represented in a textual file.
In similarity Preferences window, the user can create a new artifact space, i.e.,
a list of high-level artifacts related to the project being developed. Figures 2 and 3
shows a scenario where the analyzer/designer is modeling the class diagram. In the
first stage, the architect is using not very descriptive identifiers. During the
development, he/she decides to use UMLHelper to visualize the similarity between the
class diagram under development and the related highlevel artifacts. Thus, he/she
selects the artifacts related to the class member, namely, the use case OTV.txt and
clicks on the button in the top of the plug-in view. As shown in Fig. 2, the similarity
between the class and the related use case is very low (i.e., about 3.7 percent). This
means that the system model identifiers are not consistent with the related high-level
artifacts.
Based on the information provided by UMLHelper, the architect tries to
improve the similarity between the model under development and the related high-
level artifacts. In particular, he/she changes the identifiers, making them more
consistent with the application domain lexicon used in the high-level artifacts. Then,
he/she re-computes the similarity between the class diagram and the use case. As
shown in Figure 3, the similarity between the software model and the related use case
improves.
Fig.2 Effects on Similarity is low Fig.3 Effects on Similarity is Improved
10
5.2 Suggestion of Candidate Identifiers in UMLHelper
The Identifiers tab in the UMLHelper shows a list of candidate identifiers extracted
from the related high-level artifacts that are traced to the software model under
development. When the architect starts to type the first characters of an identifier, the
tab shows all possible identifiers, created from words extracted from high-level
artifacts and starting with the substring being typed (see Figure 4). The suggestion can
be customized by specifying the number of words to consider in multi-words
identifiers. Figure 4 shows a scenario where the architect is modeling the class
diagram and he/she uses UMLHelper to identify an appropriate name for a class,
properties and methods. In particular, he/she starts writing the class name (see Figure
4), and then selects the menu item Get suggestions from the pop-up menu activated on
the selected substring ―members‖. Alternatively, it is possible to get suggestions by
writing the substring in an appropriate field of the Identifiers tab, and then clicking the
button Suggest. As shown in Figure 4, UMLHelper proposes different identifiers
containing the selected substring. The developer can then select the most appropriate
one by double clicking on it.
5.3 Suggestion of Semantic Identifiers in UMLHelper
The Semanticist tab of the UML Helper shows a list of semantic identifiers for the
given text that are used for software model. When system architect constructing
software model he/she has to select appropriate identifiers from the identifier list. If
the chosen identifier not giving proper meaning to the context or too short to describe,
then relevant semantic identifier has been selected (see Figure 5). The Semantic
preference window is useful for selecting the directory of Dictionary, which can be
used in semantic identifier.
Fig.4 Suggesting Candidate Identifiers Fig.5 Suggesting Semantic Identifiers
11
5.4 Suggestion of Tagged Identifiers in UMLHelper
The Tagger tab in the UMLHelper shows a list of identifiers with tagger extracted
from the related high-level artifacts that are used for software model under
development. When the architect constructs software model, he/she has to choose
nouns for class names, attributes and methods, etc,. Similarly other parts of speech
have been used for constructing software models (see Figure 6). The Tagger
preference window is useful for selecting the appropriate tagger mode.
6. EMPIRICAL RESEARCH ASSESSMENT – PERFORMANCE ANALYSIS
Empirical studies are essential for developing and validating our knowledge of
software engineering in general and in particular of the quality of UML modeling. The
UML has been around for ten years now, but the number of empirical studies
addressing its use and quality are still relatively small compared to its popularity in
practice and the number of suggested changes and improvements for the UML.
The goal of the experiments is to analyze the use of similarity information
between system model and related documentation and identifier suggestion provided
by UMLHelper, with the purpose of evaluating their usefulness during system
analysis, design and maintenance tasks. The quality focus is to improve the quality of
system model identifiers. Such an improvement possibly increases the model quality
and its comprehensibility.
The perspective of this study is both of 1) researchers who want to evaluate
how suggestions based on traceability information help system architect to use
meaningful identifiers and 2) project managers who want to evaluate the possibility of
adopting UMLHelper within their own organization.
Fig.6 Suggesting Tagged Identifiers
12
6.1 Experiment Context: Subjects
The study is executed twice at the Panimalar Engineering College, affiliated to Anna
University, Chennai, India, with different subjects. Experiment-I is carried out with 20
first-year master‘s students attending the Object Oriented Analysis and Design course.
Students have been grouped into ten pairs. Experiment-II is carried out with 20 third-
year bachelor‘s students attending the course of Object Oriented Software
Development, also grouped into 10 pairs. Within each experiment, all students were
from the same class with a comparable background, but different abilities. All students
had knowledge of constructing system model, as well as of software artifact
traceability. Moreover, students involved in Experiment I (i.e., master‘s students) had
participated in real software projects during the internship. A quantitative assessment
of the ability level was obtained by considering the average grades obtained in the
previous university exams. In particular, students with average grades below a fixed
threshold, i.e., 7.0 GPA were classified as Low Ability, while the remaining ones were
High Ability. We decided to select such a threshold as it represents the median of the
possible grades for any exam to be passed by a student in an Anna university. Pairs
were formed by grouping subjects having both High and Low Ability. In Experiment I,
we had five Low Ability pairs and five High Ability pairs, while in Experiment II, we
had six Low Ability pairs and four High Ability pairs.
6.2 Experiment Material
To perform the experimental tasks, each student was provided with the following
material:
1. UML Helper plug-in user manual,
2. Requirement documents and/or use case descriptions for the tasks to be
performed,
3. The use case diagram, class diagram etc., and the documentation of the
system to be maintained,
4. The Eclipse-JDT environment in three possible configurations, depending on
the treatment a) INOUMLHP: without the UML Helper plug-in, b) IUMLHP:
with the UML Helper plug-in, however, without the identifier suggestion
feature, and c) IFUMLHP: with the fully featured UML Helper plug-in,
5. A survey questionnaire to be filled in after each lab.
13
6.3. Experiment Details
There are four experiments conducted. They are,
1. Banking Information System (BIS),
2. Library Management System (LMS),
3. Inventory Management System (IMS) and
4. Student Management System (SMS)
6.4 Results of the Empirical Assessment
After experiments are executed, artifacts produced/maintained by the subjects are
collected and the similarity between system model and high-level artifacts is computed
using Latent Semantic Indexing. To address our research hypotheses, the researcher
has computed such a similarity by considering the system model with and without
UML Helper plug-in. Table 1 reports descriptive statistics of the obtained similarity
for the Banking Information System (BIS), grouped by experiment, with and without
using UML Helper plug-in and also full featured plug-in treatment (INOUMLHP,
IUMLHP, and IFUMLHP) in Experiment I and Experiment II respectively.
Accuracy is the very important phenomena of any computer generated data
sets. In order to get accurate as well as best result for comparing the data sets, the
researcher has used Neural Network algorithm. The data set have been given as input
to R-studio. Mean Absolute Percentage Error (MAPE) has been calculated using
Neural Network algorithm.
In experiment I (BIS-1), Mean Absolute Percentage Error (MAPE) value for
similarity measure without UML Helper Plug-in is 14.35%, but it is reduced to 5.98%
when UML Helper Plug-in with identifier feature alone is used, and also further is
BANKING INFORMATION SYSTEM
BIS 1 BIS 2
TYPE OF
SERVICE MAPE
TYPE OF
SERVICE MAPE
INOUMLHP 14.35 % INOUMLHP 17.95 %
IUMLHP 5.98 % IUMLHP 11.79 %
IFUMLHP 3.48 % IFUMLHP 2.75 %
Table.1 Statistics of Similarity Values between BIS1 and BIS2
14
reduced as 3.48% when full featured UML Helper Plug-in (i.e., identifier, semanticist,
and tagger features) is used. In experiment II (BIS-2) initially without the use of any
automated tool the MAPE value is 17.95%, with UML Plug-in the MAPE values
11.79% and 2.75% respectively for partial and full featured as shown in figure 7.
Both experiment I and II have the MAPE value 3.48% and 2.75% respectively.
Experiment II results are more accurate than the experiment I, since the MAPE value
is less as shown in figure 8. It is not true for all the cases.
In the case of student management system, experiment I yield MAPE for
INOUMLHP, IUMLHP, IFUMLHP are 6.27%, 4.15, and 2.72 respectively. Similarly
experiment II yield MAPE of 5.35, 4.05, and 3.56 for INOUMLHP, IUMLHP, and
IFUMLHP respectively.
Fig.8 MAPE Comparison between BIS-1 and BIS-2
Fig.7 MAPE values of BIS-1 and BIS-2
15
In contrast over BIS, SMS experiment-I has yielded more accurate result than
the experiment II, since the MAPE value is less in experiment I as shown in the figure
9. Initially SMS-1 MSPE is higher than the SMS-2 value, after that experiment 1
(SMS-1) has yielded better result than the counterpart as shown in figure 10.
In the case of Library Management System project, both experiment I (LMS-1)
and experiment II (LMS-II) produce different result than the above two projects. LMS-
1 has 9.86, 8.84, and 2.69 as the MAPE values for the three types of services. But
LMS-2 has 7.39, 10.81, and 3.06. Here the error ratios are degraded after using the
partial UML Helper option (IUMLHP) as shown in figure 11. It shows the importance
of full featured UML Helper service (IFUMLHP).
Fig.9 MAPE values of SMS-1 and SMS-2
Fig.10 MAPE Comparisons between SMS-1 and SMS-2
16
The comparison chart shown in figure 12 differentiates the accuracy ratio
between the LMS-1 and LMS-2. Initially LMS-2 MAPE value is less than the LMS-1,
but at last LMS-1 wins the race with marginal difference. It indicates that the accuracy
of the result produced by the tool does not depend on the initial values.
In Inventory Management System project, the MAPE values for INOUMLHP,
IUMLHP, and IFUMLHP in both IMS-1 and IMS-2 are 19.39, 12.22, 1.79 and 11.67,
10.96, 4.28 respectively. The growth rates of both experiments are shown in figure 13.
Fig.11 MAPE values of LMS-1 and LMS-2
Fig.12 MAPE Comparisons between LMS-1 and LMS-2
17
In Inventory Management System, It is concluded that slow start may not fail
always. The below example yield better result compared to the one which yield better
result in the initial stage as shown in figure 14.
Each and every project has its own initial value, which depends on the
complexity of the project. As shown in the graph figure 15, Student Management
System of both Experiment-I and II (SMS-1 and SMS-2) have high accuracy rate even
in the initial stage, since its functionalities are easy to understand compared to other
projects. Therefore, the error rate is induced by the type of project developed.
Fig.13 MAPE values of IMS-1 and IMS-2
Fig.14 MAPE Comparisons between IMS-1 and IMS-2
18
All the experiment I and II are started with different accuracy level, but all of
them are converged with the Mean Absolute Percentage Error range between 0 and 4.
More particularly Inventory Management System produces the MAPE of 1.79 in IMS-
1 as shown in figure 16.
6.5 Comparing the results of the two experiments using Two-way ANOVA
The experiment has conducted with P.G and U.G students namely Experiment I and II.
These two categories of students have done the same experiment separately and their
results are different. In order to compare whether their results are significantly
different or same, a statistical analysis method called Analysis of Variance (ANOVA)
Fig.15 MAPE Comparisons among Experiment I and II
Fig.16 MAPE of both Experiment I and II
19
has been used. Since the analysis involves two factors, Two-way ANOVA has been
used.
The two-way ANOVA compares the mean differences between groups that
have been split on two independent variables (called factors). The primary purpose of
a two-way ANOVA is to understand if there is an interaction between the two
independent variables on the dependent variable. The Experiment I and II on the four
projects similarity results are compared. Here Experiment is an independent variable
and their similarity value is dependent variable.
Project Source of
Variation F P-value F crit
BIS Sample 3.536 0.065 4.020
Interaction 3.037 0.056 3.168
LMS Sample 0.061 0.807 4.020
Interaction 0.040 0.960 3.168
IMS Sample 0.001 0.974 4.020
Interaction 0.153 0.859 3.168
SMAS Sample 0.038 0.759 4.325
Interaction 0.236 0.675 3.453
One can now draw some conclusions from the ANOVA table in the table 2. In
BIS, since the p-value (sample) = .065 > .05 = α, one can‘t reject the null hypothesis,
and so it is concluded (with 95% confidence) that there are no significant differences
between the U.G and P.G students similarity values. One can also see that the p-value
(interactions) = .056 > .05 = α, and so it is concluded that there are no significant
differences in the interaction between Experiment and Similarity in the Bank
Information System (BIS).Similarly in the LMS, IMS, and SMAS‘s both sample and
Interaction p-value is less than α. This clearly shows that there is no significant
difference between the P.G and U.G student‘s project implementation similarity values
which has been obtained from the UMLHelper tool.
6.6 Comparison of Experiments Results through Standard Error
That one can be sure of the difference between the two means is not statistically
significant (P>0.05) using Standard Error. Overlapping of standard error bars denotes
that there is no significant difference between the two means.
Table.2 Two-way ANOVA table for the Similarity Values
20
The BIS experiment I and II‘s variance and their standard errors are given in
the table 3 and their associated graph with standard error bar is plotted in the figure 17.
Condition INOUMLHP IUMLHP IFUMLHP
Variance for Exp I. 2.626 21.931 41.477
Variance for Exp II 2.219 28.953 42.302
Standard Error of Exp. I 0.420 1.663 2.352
Standard Error Exp. II 0.223 1.915 1.862
The standard error bars for the BIS are overlapped at INOUMLHP and
IFUMLHP, which shows that there is no significant difference between the two
experiments at these two levels. But the standard error bars in IUMLHP are not
coincided, which shows that both the experiment I and II have a significant difference
in their values. Since it is a partial usage of the UML Helper, its results are negligible.
The experiment I and II of LMS project‘s variance and their standard errors are
given in the table 4. The graph with standard error bar is shown in the figure 18.
Condition INOUMLHP IUMLHP IFUMLHP
Variance for Exp I. 6.421 24.346 37.220
Variance for Exp II 7.330 24.350 37.347
Standard Error of Exp. I 0.728 1.352 1.818
Standard Error Exp. II 0.464 2.059 2.781
Table.3 Variance and Standard Error of BIS
Fig.17 Two-way ANOVA Results of BIS
Table.4 Variance and Standard Error of LMS
21
The standard error bars for the LMS are highly overlapped at IUMLHP and
IFUMLHP, which shows that there is no significant difference between the two
experiments at these two levels. But the standard error bars in INOUMLHP are not
exactly coincided, which shows that both the experiment I and II have a significant
difference in their values without using the UML Helper tool and their results are
negligible.
The IMS experiment I and II‘s variance and their standard errors are given in
the table 5 and their associated graph with standard error bar is plotted in the figure 19.
Similarly the standard error bars for the IMS with UML Helper tool‘s both the
case are extremely overlapped, which shows that there is no significant difference
between the two experiments. But the standard error bars in INOUMLHP are not
overlapped, which shows that both the experiment I and II have a significant
difference in their values. But their results are unimportant.
Condition INOUMLHP IUMLHP IFUMLHP
Variance for Exp I. 3.835 22.399 33.403
Variance for Exp II 4.667 22.032 32.829
Standard Error of Exp. I 0.355 1.674 1.769
Standard Error Exp. II 0.589 1.148 1.893
Fig.18 Two-way ANOVA Results of LMS
Table.5 Variance and Standard Error of IMS
22
Both experiments of SMS project‘s variance and their standard errors are
shown in the table 6. The graph with standard error bar is shown in the figure 20.
Condition INOUMLHP IUMLHP IFUMLHP
Variance for Exp I. 4.130 30.173 42.668
Variance for Exp II 4.520 30.276 41.217
Standard Error of Exp. I 0.218 0.946 1.108
Standard Error Exp. II 0.464 2.059 2.781
Fig.19 Two-way ANOVA Results of IMS
Fig.20 Two-way ANOVA Results of SMS
Table.6 Variance and Standard Error of SMS
23
The SMS‘s standard error bars of experiment I is completely overlapped with
experiment II in all the cases namely INOUMLHP, IUMLHP and IFUMLHP. This
shows that there is no significant difference between the experiments I and II with and
without using the UML Helper tool.
The above results clearly show that the UML Helper tool usage of both U.G
and P.G students are equal. There is no significant difference in their work.
6.7 POST-EXPERIMENT QUESTIONNAIRE
The following list of questions has been asked to fill up at the end of each lab session
to assess whether the laboratory tasks are clear, whether subjects have enough time to
perform the tasks, and other related questions.
1. I had enough time to perform the lab task.
2. The objectives of the lab were perfectly clear to me.
3. The task I had to perform was perfectly clear to me.
4. The requirement given to me provided enough information to perform the
required task.
5. I was able to locate the classes, attributes, and methods I had to maintain.
6. The use of the similarity feature was clear to me.
7. I found the similarity feature useful.
8. The use of the identifier suggestion feature was clear to me.
9. I found the identifier suggestion feature useful.
10. For how many identifiers (in percentage did you rely on the suggestions given
by the tool?
A. < 25% B. >=25% and < 50% C. >=50 and 75% D. >=75%
Possible answers to questions 1-9 are: 1. Strongly agree, 2. Weakly agree, 3.
Undecided to, 4. Weakly disagree and 5. Strongly disagree.
6.7.1 Experiment Execution
For each lab, subjects had 2 hours available to perform the required task. After the task
is completed, from each pair of subjects all possible models are collected. Also, all
pairs of subjects have returned the completed survey questionnaire to us. The
questionnaire is composed of questions expecting closed answers according to a Likert
scale [25]:
1. Strongly agree,
2. Weakly agree,
24
3. Undecided to,
4. Weakly disagree, and
5. Strongly disagree.
The purpose of the questionnaire is to assess whether the laboratory tasks are
clear, whether subjects have enough time to perform the tasks, and other related
questions. In addition, for subjects using the UML Helper plug-in, the survey has
investigated the usefulness of the plug-in and the clarity of its usage, and has asked
how much time subjects have spent using it.
It is noted that, differently from the inspection questionnaires filled in by our
three reviewers, the survey questionnaires also have an ―undecided‖ level. This is
because while, in code inspection, the researcher has wanted to favor the convergence
during inspection meetings (and thus avoided neutral grades), in this case, the
researcher is also interested in understanding whether subjects have found, for
example, the usefulness of a given tool feature unclear.
6.7.2 Questioner Results
To analyse the questionarie, the correlation, regression and two-way anova analysis
techniques are used. Their resulsts have clearly shown the importance of the proposed
tool.
6.7.2.1 Correlation Analysis:
Correlation analysis is a method of statistical evaluation used to study the strength of a
relationship between two, numerically measured, continuous variables. If correlation is
found between two variables it means that when there is a systematic change in one
variable, there is also a systematic change in the other. The table 7 shows the
correlation analysis table for the questionnaire.
I had
enough time
to perform
the lab task.
The
objectives of
the lab were
perfectly
clear to me.
The task I
had to
perform was
perfectly
clear to me.
The requirement given
to me provided
enough information to
perform the required
task.
I was able to locate
the classes,
attributes, and
methods I had to
maintain.
The use of
the
similarity
feature was
clear to me.
I found the
similarity
feature
useful.
The use of the
identifier
suggestion
feature was clear
to me.
I found the
identifier
suggestion
feature
useful.
I had enough time to perform the lab task. 1
The objectives of the lab were perfectly clear
to me.-0.1016888 1
The task I had to perform was perfectly clear
to me.-0.0797 -0.09113533 1
The requirement given to me provided
enough information to perform the required
task.
-0.1284418 -0.0564887 -0.11511188 1
I was able to locate the classes, attributes, and
methods I had to maintain.-0.1528327 0.12482934 0.123926514 0.361428992 1
The use of the similarity feature was clear to
me.-0.1223349 -0.13988753 0.089704462 0.614091032 0.408654747 1
I found the similarity feature useful. -0.0980618 0.01019379 -0.08788477 0.452628575 0.32757138 0.4002349 1
The use of the identifier suggestion feature
was clear to me.-0.0684141 -0.07823012 -0.06131393 0.509259118 0.479633733 0.59034766 0.6560981 1
I found the identifier suggestion feature
useful.-0.0889675 -0.15324273 -0.12010611 0.61220683 0.182574984 0.68320175 0.5476441 0.422501655 1
Table.7 Results of the Correlation Analysis
25
The pieces of independent variable in the questionnaire have a significant relationship
with the dependent variable ―I found Identifier Suggestion Feature Useful‖ are:
The requirement given to me provided enough information to perform the
required task.
The use of the similarity feature was clear to me.
I found the similarity feature useful.
The use of the identifier suggestion feature was clear to me.
These information have correlation with an absolute value of 0.25 or above. These
correlations are significant, meaning that there is at least a 95% chance that there is a
true relationship between these variables.
6.7.2.2 Multiple Regression Analysis:
Regression analysis is a form of predictive modelling technique which investigates the
relationship between a dependent (target) and independent variable (s) (predictor) and
indicates the strength of impact of multiple independent variables on a dependent
variable. Regression analysis will provide with an equation that can make predictions
about the data.
To find which predictors are significant in the regression, the researcher has to
use the questionnaire data that have a significant correlation with the independent
variables (The requirement given to me provided enough information to perform the
required task, The use of the similarity feature was clear to me, I found the similarity
feature useful, The use of the identifier suggestion feature was clear to me)as the
predictors(X variables) and use the dependent variable (I found Identifier Suggestion
Feature Useful) as the outcome (Y variable). The regression analysis results are shown
in table 8.
One can look for the predictors with significant value (P-value) less than 0.05
meaning that there is at least 95% chance that there is a true relationship between these
variables in the population. To find the exact percentage chance that there is a true
relationship in the population, it can be calculated using (1-P-value)*100.
26
Among the independent variables, ―The use of the similarity feature was clear to
me(SFC)”, and “I found the similarity feature useful(SFU)” P-values are less than
0.05.The independent variables SFC and SFU are having 99.93% and 99.41%
respectively true relationship with the dependent variable (I found Identifier
Suggestion Feature Useful-ISFU).
Regression equation for these significant predictors general formula is
. The calculated coefficients for the independent variables are shown in the
table 9.
INDEPENDENT VARIABLES COEFFICIENTS
Intercept -0.853
The use of the similarity feature was clear to me 0.779
I found the similarity feature useful. 0.369
Therefore, the predictor equation defined is,
Example:
Suppose a person who gave the SFC and SFU value as 4 and 5 respectively, then the
predictor value for ISFU can be calculated from regression equation as follows:
Predictor (ISFU) = (0.779 * 5 + 0.369 * 4) – 0.853
= 4.518 ≈ 5
The results show that the predictor value for the given independent variable is
5. That is, the user who Strongly Agree the ―similarity feature was clear” and Weekly
Independent Variables P-value% of True
Relationship
The requirement given to me provided enough
information to perform the required task.0.090 90.98
The use of the similarity feature was clear to me. 0.001 99.93
I found the similarity feature useful. 0.006 99.41
The use of the identifier suggestion feature was
clear to me.0.073 92.69
Table.8 Results of the Regression Analysis
Table.9 Calculated Coefficients for Independent Variables
Predictor (ISFU) = (0.779 * SFC + 0.369 * SFU) – 0.853
27
Agree the “similarity feature useful” may have a higher chance of strongly agreeing
the “Identifier Suggestion Feature Useful‖ questionnaire.
6.7.2.3Analysis based on the Questionnaire Results:
Each student who takes in part of the experiment is asked to answer all ten questions
about the usefulness of the proposed tool. Its results are discussed as follows:
The cumulative response of all the projects for the question 1, ―I had enough
time to perform the lab task‖ in percentage are depicted as a bar graph as shown in
figure 21. The graph clearly shows, more than 90 % of students agree that the given
time is sufficient to complete the lab task.
The cumulative response of all the four projects for the question 2, ―The
objectives of the lab were perfectly clear to me‖ in percentage are depicted as a bar
graph and it is shown in figure 22. The graph clearly shows, more than 92 % of
students agree that the objectives of the lab were perfectly clear to them.
Fig.21 Cumulative Responses of Question1
Fig.22 Cumulative Responses of Question 2
28
The table 10 shows the response count for the question 9, ―I found the
identifier suggestion feature useful‖. Almost all the project response are equal in the
category of strongly agree and some of them weakly agree the statement. A few of
them undecided to the statement and a very few of them weakly disagree and
strongly disagree the statement. The results are also depicted as graph in figure 23.
9. I found the identifier suggestion feature useful.
BMS LMS IMS SMAS Total Percentage
Strongly Agree 29 31 30 37 127 79.38
Weakly Agree 5 5 5 2 17 10.63
Undecided to 3 1 3 1 8 5.00
Weakly Disagree 1 2 2 0 5 3.13
Strongly Disagree 2 1 0 0 3 1.88
The graph in figure 24 depicts about the cumulative response for all the
projects in percentage. More than 90% of them are agrees that the use of the identifier
suggestion feature was useful to them are known from the graph.
The table 11 shows the response count for the question 10, ―For how many
identifiers (in percentage) did you rely on the suggestions given by the tool‖. Nearly
46 members in total agree that more than 75% and 76 members agree that more than
50% of the identifiers are identified through the tool. More than 25% and less than
Table.10 Question 9 Responses
Fig.23 Graph for the Question 9 Response
29
50% of the identifiers are identified by 28 members. Less than 10 % of the identifiers
are identified by 10 users.
10. For how many identifiers (in percentage) did you rely on the suggestions
given by the tool?
BMS LMS IMS SMAS Total Percentage
>=75% 11 9 2 24 46 28.75
>=50 &< 75% 23 23 21 9 76 47.50
>=25 &< 50% 4 7 12 5 28 17.50
< 25% 2 1 5 2 10 6.25
The results are also depicted as graph in figure 25.
Table.11 Question 10 Responses
Fig.24 Cumulative Responses of Question 9
Fig.25 Graph for the Question 10 Response
30
The cumulative percentage of response for all the projects in percentage are
depicted as a bar graph as shown in figure 26. The graph clearly shows, 28.75 % of the
users completely relay on the tool and 47.5% of the users identified more than 50% of
the identifiers.
Nearly 17.5% of the users agreed that they have identified more than 25% and
less than 50% of identifiers using the tool. Only 6.25% of the users agreed that less
than 25% of the identifiers are identified using the tool.
6.7.3 Comparing the Questionnaire Results of the Two Experiments
The two categories of student‘s questionnaire results are compared whether their
significantly different or same using a statistical analysis method called Analysis of
Variance (ANOVA). Since the analysis involves two factors, Two-way ANOVA has
been used. The primary purpose of a two-way ANOVA is to understand if there is an
interaction between the two independent variables on the dependent variable. The
Experiment I and II on the four projects questionnaire results are compared. Here
Experiment is an independent variable and their questionnaire result is a dependent
variable. The graph for the two-way ANOVA results with standard error bar is plotted
in the figure 27.
The question 1, ―I had enough time to perform the lab task.‖, the question 3,
―The task I had to perform was perfectly clear to me‖, the question 5, ―I was able to
locate the classes, attributes, and methods I had to maintain‖, the question 6, ―The use
of the similarity feature was clear to me‖, the question 8, ‖ The use of the identifier
suggestion feature was clear to me‖ and the question 9, ― I found the identifier
Fig.26 Cumulative Responses of Question 10
31
suggestion feature useful‖ standard error bars of experiment I is completely overlapped
with experiment II. This shows that there is no significant difference between the
questionnaire results of the experiments I and II.
The question 2, ―The objectives of the lab were perfectly clear to me‖, the
question 4, ―The requirement given to me provided enough information to perform the
required task‖, and the question 7, ―I found the similarity feature useful‖ standard error
bars of experiment I is slightly overlapped with experiment II. The above results
clearly show that the UML Helper tool usage of both U.G and P.G students‘
questionnaire results are equal.
7. COCLUSION
The proposed approach helps system analyst and system designers to improve the
system model lexicon, i.e., terms used as identifiers in system models. In particular,
our approach 1) computes and shows to developers the textual similarity between
system model and related high-level artifacts, and 2) recommends candidate identifiers
built from high-level artifacts related to the system model under development. On
behalf of that, two algorithms namely Improved N-gram Extraction (INGE) Algorithm
and Improved Substring Removal (ISR) Algorithm for extracting data from raw corpus
have been proposed.
A plug-in, called UMLHelper, has been implemented to provide the proposed
approach in the Eclipse IDE, and its usefulness has been evaluated through two
Fig.27 Two-way ANOVA Result for Questionnaire
32
controlled experiments. In general, the following pieces of evidence can be
summarized from the obtained results:
The use of UML Helper makes the system model and the text contents
―textually similar‖ to the related high-level artifacts. Both experiments indicate
that the usage of UML Helper significantly increases the similarity between
system model and related artifacts. Developers tend to provide more
meaningful names to identifiers and to better choose system model texts to
achieve a higher similarity.
The use of UML Helper improves the quality of the system model lexicon.
UML Helper makes the system model text descriptions ―textually similar‖ to
the related high-level artifacts.
The use of the UMLHelper identifier suggestion feature improves the similarity
between system model and high-level artifacts. However, the experimental
results indicate that the identifier suggestion further improves the similarity if
compared with the availability of the similarity feature only. Results also
suggest that other than highlighting the similarity between high-level artifacts
and system model, a better consistency in identifiers can be achieved if these
are suggested by extracting n-grams from high-level artifacts.
Indeed, as it always happens with empirical studies, replications in different
contexts, with different subjects and objects, is the only way to corroborate one‘s
findings. Replicating this study with students or professionals having a different
background would be extremely important to understand how the plug-in influences
the similarity between system model and the related high-level documentation of these
different subpopulations.
REFERENCES
[1] Abadi.A, Nisenson.M, and Simionovici.Y, ―A Traceability Technique for
Specifications,‖ Proc. 16th IEEE Int‘l Conf. Program Comprehension, pp.
103-112, 2008.
[2] Antoniol.G, Canfora.G, Casazza.G, De Lucia.A, and Merlo.E, ―Recovering
Traceability Links between Code and Documentation,‖ IEEE Trans. Software
Eng., vol. 28, no. 10, pp. 970-983, Oct. 2002.
33
[3] Baeza-Yates. R and Ribeiro-Neto. B, Modern Information Retrieval. Addison-
Wesley, 1999.
[4] Bogdan Czejdo.D, Rudolph Mappus IV.L, Kenneth Messa,‖The Impact of
UML Class Diagrams on Knowledge Modeling, Discovery and Presentations‖,
Journal of Information Technology Impact, 2003, Vol. 3, No. 1, p. 25-44.
[5] Capobianco.G, De Lucia.A, Oliveto.R, Panichella.A, and Panichella.S,
―Traceability Recovery Using Numerical Analysis,‖ Proc. 16th Working
Conf. Reverse Eng., 2009.
[6] Christian Lange.F.J, Michel R.V. Chaudron.R.V, "Managing Model Quality in
UML-Based Software Development," step, pp.7-16, 13th IEEE International
Workshop on Software Technology and Engineering Practice (STEP'05),
2005.
[7] Cleland-Huang.J, Settimi.R, Duan.C, and Zou.X, ―Utilizing Supporting
Evidence to Improve Dynamic Requirements Trace- ability,‖ Proc. 13th IEEE
Int‘l Requirements Eng. Conf., pp. 135-144, 2005.
[8] Cullum.J.K and Willoughby.R.A, ―Real Rectangular Matrices,‖ Lanczos
Algorithms for Large Symmetric Eigenvalue Computations, vol. 1,
Birkhauser, 1998.
[9] De Lucia.A, Oliveto.R, and Tortora.G, ―Assessing IR-Based Traceability
Recovery Tools through Controlled Experiments,‖ Empirical Software
Eng., vol. 14, no. 1, pp. 57-93, 2009.
[10] De Lucia.A, Fasano.F, Oliveto.R, and Tortora.G, ―Recovering Traceability
Links in Software Artefact Management Systems Using Information Retrieval
Methods,‖ ACM Trans. Software Eng. and Methodology, vol. 16, no. 4, 2007.
[11] De Lucia.A, Oliveto.R, and Sgueglia.P, ―Incremental Approach and User
Feedbacks: A Silver Bullet for Traceability Recovery,‖ Proc. 22nd IEEE
Int‘l Conf. Software Maintenance, pp. 299-309, 2006.
[12] Deerwester.S, Dumais. S.T, Furnas. G.W, Landauer. T.K and Harshman. R,
―Indexing by Latent Semantic Analysis,‖ J. Am. Soc. for Information Science,
vol. 41, no. 6, pp. 391-407, 1990.
[13] Evans.E, Domain Driven Design: Tackling Complexity in the Heart of
Software. Addison-Wesley Professional, 2003.
34
[14] Hayes.J.H, Dekhtyar.A, and Sundaram.S.K, ―Advancing Candidate Link
Generation for Requirements Tracing: The Study of Methods,‖ IEEE Trans.
Software Eng., vol. 32, no. 1, pp. 4-19, Jan. 2006.
[15] Juha Kärkkäinen, Tommi Rantala, ―Engineering Radix Sort for Strings‖, 15th
International Symposium, SPIRE 2008, Melbourne, Australia, November 10-
12, 2008, p 3-14.
[16] Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram
Singer, ―Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency
Network,‖ In Proceedings of HLT-NAACL 2003, pp. 252-259.
[17] Lawrie.D, Field.H, and Binkley.D, ―Quantifying Identifier Quality: An
Analysis of Trends,‖ Empirical Software Eng., vol. 12, no. 4, pp. 359-388,
2007.
[18] Lawrie.D, Morrell.C, Field.H, and Binkley.D, ―Effective Identifier Names for
Comprehension and Memory,‖ Innovations in Systems and Software Eng.,
vol. 3, no. 4, pp. 303-318, 2007.
[19] Lawrie.D, Morrell.C, Field.H, and Binkley.D, ―What‘s in a Name? A Study of
Identifiers,‖ Proc. 14th IEEE Int‘l Conf. Program Comprehension, pp. 3-12,
2006.
[20] Lormans.M, Deursen.A, and Gross.H.G, ―An Industrial Case Study in
Reconstructing Requirements Views,‖ Empirical Software Eng., vol. 13, no. 6,
pp. 727-760, 2008.
[21] Makoto Nagao, Shinsuke Mori, ―A New Method of N-gram Statistics for
Large Number of n and Automatic Extraction of Words and Phrases from
Large Text Data of Japanese‖, International Conference on Computational
Linguistics, In COLING-94, 1994, p. 611—615.
[22] Marcus. A and Maletic. J.I, ―Recovering Documentation-to-Source-Code
Traceability Links Using Latent Semantic Indexing,‖ Proc. 25th Int‘l Conf.
Software Eng., pp. 125-135, 2003.
[23] Marcus. A and Poshyvanyk. D, ―The Conceptual Cohesion of Classes,‖ Proc.
21st IEEE Int‘l Conf. Software Maintenance, pp. 133- 142, 2005.
[24] Marcus.A, Poshyvanyk.D, and Ferenc.R, ―Using the Conceptual Cohesion of
Classes for Fault Prediction in Object-Oriented Systems,‖ IEEE Trans.
Software Eng., vol. 34, no. 2, pp. 287-300, Mar./Apr. 2008.
35
[25] Oppenheim. A.N, Questionnaire Design, Interviewing and Attitude
Measurement. Pinter Publishers, 1992.
[26] Porter. M.F, ―An Algorithm for Suffix Stripping,‖ Program, vol. 14, no. 3, pp.
130-137, 1980.
[27] Sebastian Wild and Markus E. Nebel, ―Average Case Analysis of Java 7‘s
Dual Pivot Quicksort‖, Algorithms – ESA 2012, 20th Annual European
Symposium, Ljubljana, Slovenia, September 10-12, 2012, p. 825-836.
[28] Sebastian Wild, Markus Nebel, Raphael Reitzig, Ulrich Laube, ―Engineering
Java 7‘s Dual Pivot Quicksort Using MaLiJAn‖, Society for Industrial and
Applied Mathematics(SIAM), 2013, p-55-69.
[29] Settimi.R, Cleland-Huang.J, Ben Khadra.O, Mody.J, Lukasik.W, and De
Palma.C, ―Supporting Software Evolution through Dynamically Retrieving
Traces to UML Artifacts,‖ Proc. Seventh IEEE Int‘l Workshop Principles
of Software Evolution, pp. 49-54, 2004.
[30] Stephen Lacey, Richard Box: Nikkei BYTE, November 1991, p.305-312.
[31] Takang.A, Grubb.P, and Macredie.R, The effects of comments and identifier
names on program comprehensibility: an experiential study. Journal of
Program Languages, 4(3), 1996.
[32] Vladimir Yaroslavskiy, Jon Bentley, and Joshua Bloch, ―Dual-Pivot
Quicksort‖, February 16, 2009.
[33] William Cavnar.B and John Trenkle.M, ―N-Gram-Based Text
Categorization‖, Proceedings of the Third Symposium on Document Analysis
and Information Retrieval, 1994.
[34] Xueqiang LÜ, Le Zhang, and Junfeng Hu. Statistical Substring Reduction in
Linear Time. In Proceeding of the 1st International Joint Conference on
Natural Language Processing (IJCNLP-04), Sanya, Hainan Island, China,
March 2004.