user-perceived source code quality estimation based on static analysis metrics
TRANSCRIPT
1
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas SymeonidisElectrical and Computer Engineering Dept., Aristotle University of Thessaloniki
Intelligent Systems & Software Engineering Labgroup, Information Processing LaboratoryThessaloniki, Greece
Email: {mpapamic, thdiaman}@issel.ee.auth.gr, [email protected]
User-Perceived Source Code Quality Estimation based on Static Analysis MetricsMichail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
2 Outline
The concept of user-perceived quality. Research objectives. Key implementation points. The designed system. Evaluation. Conclusion and Future work.
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis MetricsMichail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
3 Why to evaluate code quality?
Various open source software projects. Numerous online software repositories.
Source Code Quality Evaluation
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Code Reuse
Is a software component suitable for
reuse?
User-Perceived Source Code Quality Estimation based on Static Analysis MetricsMichail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
4 User-Perceived source code quality
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Idea: Use of software components popularity as a quality
indicator.But: Popularity cannot be used as a sole quality criterion.
- Is based on current trends.- Depends on the programming language.
Popularity Static Analysis Metrics
Recommended Coding Practices+¿ +¿ Measure of
quality
User-Perceived Source Code Quality Estimation based on Static Analysis MetricsMichail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
5 The ideaUser-Perceived Quality Estimation
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Idea Tools Used Proposed System Use of software
components popularity as ground truth – GitHub number of stars
Use of static analysis metrics and violations of “good” coding practices
Apply machine learning techniques for estimating user-perceived source code quality
Static Analysis
Quality Evaluation
Models
Quality Score
User-Perceived Source Code Quality Estimation based on Static Analysis MetricsMichail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
6 Key implementation points
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Qualitative evaluation of the selected repositories.
Training set formation. Target set formation. Quality estimation models.
User-Perceived Source Code Quality Estimation based on Static Analysis MetricsMichail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
7 Training dataset
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Top 100 Repositories
GitHub
Metrics Report
Violations Report
24930 files
Training Dataset
Qualitative Evaluation
User-Perceived Source Code Quality Estimation based on Static Analysis MetricsMichail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
8 Selected repositories qualitative evaluation
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis MetricsMichail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
PMD Ruleset
Percentage (%) of files containing severe
violations
PMD Ruleset
Percentage (%) of files containing severe
violations
Priority 1 Priority2 Priority 1 Priority2
Unused Code 0.0% 0.0% Coupling 0.0% 0.0%Basic 0.015% 0.337% Design 3.37% 3.9%
Braces 0.0% 0.0% Empty 0.0% 0.0%Comments 0.0% 0.0% Finalizers 0.0% 0.0%
Naming 14.11% 0.45% Optimizations
0.0% 0.0%
Clone 0.0% 0.0% Strict Exception
4.99% 0.0%
CodeSize 0.0% 0.0% Strings 0.0% 0.06%Controversia
l1.75% 1.58% Unnecessary 0.0% 0.0%
Very small percentage of files contain
severe violations
9 Target set formation Use of GitHub stars as ground truth.
But: GitHub stars per repository (NOT per file) Every source code file is of different
importance Big differences in the number of files
between repositories
10000
x stars
y stars
z stars
Dependency Analysis
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis MetricsMichail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
10 Target set formation
For the i-th file of the j-th repository, the target if formulated as follows:
Smoothing factor A base score to
all files in the same repository
Added value according to the significance of
the source code file
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis MetricsMichail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
11 Quality Evaluation Models
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
ANNs Model Input: The values of 73 static analysis metrics. Output: User-Perceived source code quality estimation Applicable only for source code files that exceed
minimum quality threshold SVMs - One Class Classifier Used to rule out low quality code.
One Class Classifier
ANNs Model
AcceptedStatic
Analysis Quality Estimation
User-Perceived Source Code Quality Estimation based on Static Analysis MetricsMichail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
12 ANNs Model
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Two-layer feedforward network. Levenberg-Marquardt algorithm
(LMA) for adjusting the weights and the biases.
(Training, Validation, Test) = (70%, 15%, 15%).
Applicable only for source code files that exceed minimum quality threshold.
User-Perceived Source Code Quality Estimation based on Static Analysis MetricsMichail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
13 SVMs One Class Classifier
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Used to rule out low quality code. Gaussian radial basis kernel function. Training involved the use of 7 metrics:
Average Block Depth, Average Cyclomatic Complexity Average Depth of Inheritance Hierarchy Average Line of Codes Per Method Comments Ratio Distance Lines Of Code
(nu, gamma, tolerance) = (0.1, 0.01, 0.01)
User-Perceived Source Code Quality Estimation based on Static Analysis MetricsMichail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
1124 false-
positives
14 System Evaluation
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Results validation: Quantitative: Using PMD Qualitative: Examination of a representative sample of files
and their metricsEvaluation on three main axes:1. The system's ability to distinguish high quality source code
files.2. The effectiveness of the model for estimating the quality of
files exceeding a quality threshold.3. The accuracy of predicting the popularity of Java repositories
given their source code files.
User-Perceived Source Code Quality Estimation based on Static Analysis MetricsMichail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
15 System Evaluation
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Repositories selected: 8 random typical GitHub projects chosen independently. lines-of-code-per-file ratio around 100, including also several
extreme cases. Both human and auto-generated code. The auto-generated projects are expected to be of high quality.
Follow all coding conventions. Are architecturally and functionally complete.
User-Perceived Source Code Quality Estimation based on Static Analysis MetricsMichail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
16 Evaluation – One Class Classifier
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis MetricsMichail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
The percentage of the rejected files that contained
severe violations is very high
17 Evaluation – ANNs Model
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis MetricsMichail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
The quality score reflects the
characteristics of the repositories
18 Evaluation – Popularity Prediction
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis MetricsMichail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
19 Conclusions and future work
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Conclusions: Reliable determination of the area of high quality source code
based on static analysis metrics. Effective user-perceived source code quality estimation.Future Work: Further investigation of the response of our model in different
scenarios. Expansion of the ground truth coverage by using more metrics. Application of feature selection techniques in order to drop
overlapping metrics.
User-Perceived Source Code Quality Estimation based on Static Analysis MetricsMichail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
20
Thank you!
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis MetricsMichail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis