statistical distribution of metrics

18
Statistical distributions of software metrics: do they matter? Israel Herraiz Technical University of Madrid [email protected] Grab these slides from http://slideshare.net/herraiz/statistical-distributions-of-metrics Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 1/17

Upload: israel-herraiz

Post on 22-Nov-2014

495 views

Category:

Education


3 download

DESCRIPTION

Presentation for the Seminar on Open Source Evolution 2013 http://informatique.umons.ac.be/genlog/SOS-Evol/SOS-Evol2013.html

TRANSCRIPT

Page 1: Statistical Distribution of Metrics

Statistical distributions of software metrics: dothey matter?

Israel Herraiz

Technical University of Madrid

[email protected]

Grab these slides from

http://slideshare.net/herraiz/statistical-distributions-of-metrics

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 1/17

Page 2: Statistical Distribution of Metrics

Outline

1 Some background

2 Statistical properties of software metrics

3 Evidence of impact on quality

4 Summary of findings and further work

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 2/17

Page 3: Statistical Distribution of Metrics

1 Some background

2 Statistical properties of software metrics

3 Evidence of impact on quality

4 Summary of findings and further work

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 3/17

Page 4: Statistical Distribution of Metrics

A (not so) long time ago...

Statistical distribution of software metrics

Software size follows a double Pareto distributionTowards a theoretical model for software growth MSR 2007

More recently

Not only size, but some OO metrics too (and some complexity metrics)On the Statistical Distribution of Object-Oriented SystemProperties WETSoM 2012

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 4/17

Page 5: Statistical Distribution of Metrics

OK, but what is that double Pareto thing?

1 100 10000

1e

−0

41

e−

02

1e

+0

0

SLOC

P[X

> x

]

Data

Double Pareto

Lognormal

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 5/17

Page 6: Statistical Distribution of Metrics

But does it matter?

Most of the files are on thelognormal side

C C++ Java Python Lisp

% F

iles

05

10

15

20

25

30

35

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 6/17

Page 7: Statistical Distribution of Metrics

But does it matter?

Most of the files are on thelognormal side

C C++ Java Python Lisp

% F

iles

05

10

15

20

25

30

35

But the power law minoritymatters a lot

C C++ Java Python Lisp%

SLO

C

010

20

30

40

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 6/17

Page 8: Statistical Distribution of Metrics

Large files have a large impact

Size estimation models

Some software size estimation models are based on the log-normality of sizemetrics. These models systematically underestimate the size of software.

2000 5000 10000 50000

−1

00

05

0C

SLOC

RE

2000 5000 20000 50000

−1

00

05

0

C++

SLOC

RE

1000 2000 5000 10000

−100

050

Java

SLOC

RE

1000 2000 5000 10000

−100

050

Python

SLOC

RE

On the distribution of source code file sizes ICSOFT 2011

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 7/17

Page 9: Statistical Distribution of Metrics

1 Some background

2 Statistical properties of software metrics

3 Evidence of impact on quality

4 Summary of findings and further work

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 8/17

Page 10: Statistical Distribution of Metrics

Parameters of the statistical distribution

Power law parameters: λ and xmin

Transition from lognormal to power law

1 100 10000

1e−

04

1e−

02

1e+

00

SLOC

P[X

> x

]

Data

Double Pareto

Lognormal

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 9/17

Page 11: Statistical Distribution of Metrics

1 Some background

2 Statistical properties of software metrics

3 Evidence of impact on quality

4 Summary of findings and further work

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 10/17

Page 12: Statistical Distribution of Metrics

Probability of finding defects

Probability of finding defects

We have seen that files above xmin account for 40% of total size, beingonly about ∼ 1% of the files.

What about defects? Probability of finding defects in three softwareprojects (using CYCLO as metric)

Project Below xmin Above xmin

Apache .4178 .7708OpenIntents .2500 .7500Zxing .2143 .4161

* Data extracted from “ReLink: Recovering Links between Bugs and Changes” FSE

2011.

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 11/17

Page 13: Statistical Distribution of Metrics

Probability of finding defects

Probability of finding defects (normalized metrics)

Using CYCLO / WMC as metric (cyclomatic complex. per LOC)

Project Below xmin Above xmin

Apache .4159 .6296OpenIntents .2813 .5417Zxing .3181 .2389

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 12/17

Page 14: Statistical Distribution of Metrics

Probability of finding defects

Defects density (only pre-release defects)

Using Number of Methods and number of pre-release defects per LOC

Below xmin Above xmin

0 1 2 3 4 5 6 7 8 9 100

2000

4000

6000

8000

10000

12000Below xmin

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

50

100

150

200

250

300Above xmin

Avg .Dens. = .2685 Avg .Dens. = .4565

* Data obtained from "Predicting Defects for Eclipse” PROMISE 2007

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 13/17

Page 15: Statistical Distribution of Metrics

Probability of finding defects

Defects density (only post-release defects)

Using Number of Methods and number of post-release defects per LOC

Below xmin Above xmin

0 1 2 3 4 5 6 7 8 9 100

2000

4000

6000

8000

10000

12000Below xmin

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

50

100

150

200

250

300Above xmin

Avg .Dens. = .1437 Avg .Dens. = .2690

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 14/17

Page 16: Statistical Distribution of Metrics

Probability of finding defects

Defects density (pre + post-release defects)

Using CYCLO/SLOC and number of total defects per LOC

10−1

101

103

105

10−4

10−3

10−2

10−1

100

Pr(

X ≥

x)

x

10−1

100

101

102

103

104

105

10−1

100

101

102

103

Below xmin Above xmin

Avg .Dens. = .3335 (>9000 files) Avg .Dens. = .7747 (364 files)Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 15/17

Page 17: Statistical Distribution of Metrics

1 Some background

2 Statistical properties of software metrics

3 Evidence of impact on quality

4 Summary of findings and further work

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 16/17

Page 18: Statistical Distribution of Metrics

Summary and further work

Summary of preliminary findings

Some metrics have a transition from lognormal to power law

Clear relation between normalized metrics and defects density

Although the threshold might not be perfect (e.g., you might find ahigh defects density in a lower side file), it greatly reduces the searchspace for potentially problematic files

Further work

Verify in more projects

Do you have defects data at the file level?

Find explanation for the transition and its influence on quality

How do the statistical parameters change over time? Do defectsevolve accordingly?

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 17/17