software size distribution - why we always underestimate software cost

28
1 On the distribution of source code file sizes Israel Herraiz – Universidad Politécnica de Madrid, Spain Daniel German – University of Victoria, Canada Ahmed E. Hassan – Queen's University, Canada ICSOFT 2011 Sevilla, July 19th 2011 Preprint available at http://oa.upm.es/6791/ This presentation available at http://slideshare.net/herraiz/on-the-distribution-of-source-code-file-sizes

Upload: israel-herraiz

Post on 18-Dec-2014

1.472 views

Category:

Education


2 download

DESCRIPTION

Why we always underestimate software cost. Presentation of the paper "On the distribution of source code file sizes", accepted for ICSOFT 2011 http://www.icsoft.org Preprint available at http://oa.upm.es/6791/

TRANSCRIPT

Page 1: Software size distribution - Why we always underestimate software cost

1

On the distribution of source code file sizes

Israel Herraiz – Universidad Politécnica de Madrid, Spain

Daniel German – University of Victoria, Canada

Ahmed E. Hassan – Queen's University, Canada

ICSOFT 2011

Sevilla, July 19th 2011

Preprint available at

http://oa.upm.es/6791/

This presentation available at

http://slideshare.net/herraiz/on-the-distribution-of-source-code-file-sizes

Page 2: Software size distribution - Why we always underestimate software cost

2

Software size● Important metric

● Estimation of effort and cost● Examples

● COCOMO● Effort = a KLOC b

● Time = c Effort d

● People = Effort / Time

● Function points● Guess size and you will find out software cost

Page 3: Software size distribution - Why we always underestimate software cost

3

Goal of this paper● Find out the statistical

distribution of software size

● Why is the distribution important?

● Estimate overall size● Estimate number of

modules within a given range size

Image extracted from http://en.wikipedia.org/wiki/Normal_distribution

Page 4: Software size distribution - Why we always underestimate software cost

4

The lognormal distribution● Software size is believed to follow a lognormal distribution

● “The Distribution of Program Sizes and Its Implications: An Eclipse Case Study”. Hongyu Zhang, Hee Beng Kuan Tan, Michele Marchesi

● http://arxiv.org/abs/0905.2288

Log

Images extracted from http://en.wikipedia.org/wiki/Lognormal_distribution

Page 5: Software size distribution - Why we always underestimate software cost

5

Lognormal vs. Double Pareto● Contrarily to previous results, we have found that software

size follows a double Pareto distribution● Large files are found more often than predicted by the

lognormal distribution

Image extracted from http://en.wikipedia.org/wiki/Lognormal_distribution

Page 6: Software size distribution - Why we always underestimate software cost

6

How did we find out?● Measuring Debian 5.0.2, about 1.4M source code files

● Measured SLOC for different programming languages

Page 7: Software size distribution - Why we always underestimate software cost

7

The numbers

Page 8: Software size distribution - Why we always underestimate software cost

8

Estimation of the density function● Looks like log-Normally distributed

● Figure is in logarithmic scale

Page 9: Software size distribution - Why we always underestimate software cost

9

But is it normal?● Graphical normality test

● Compare the quantile of the sample with ideal values of normal quantiles

● Quite log-normal, except for the tails

Page 10: Software size distribution - Why we always underestimate software cost

10

The density function from another point of view● Complementary cumulative distribution function

● Easier to find shapes of known statistical distributions

Page 11: Software size distribution - Why we always underestimate software cost

11

Conclusions so far

Page 12: Software size distribution - Why we always underestimate software cost

12

Conclusions so far

● The shape of the actual distribution is very close to lognormal

● However it is not clear enough

● The tails deviate from lognormality, and do not show a clear shape in the CCDF plot

Page 13: Software size distribution - Why we always underestimate software cost

13

But are the tails important?● The tails are only a minority of files● We will come back to this plot later

Page 14: Software size distribution - Why we always underestimate software cost

14

Impact of the minority● But the tails are an immense minority● Impact of large files in the tails on the overall size of the

system

Page 15: Software size distribution - Why we always underestimate software cost

15

Model fitting● Two parts model fitting

● Lognormal● Straightforward procedure

● The tails● Probably power laws, not so straightforward procedure

● How do we decide where the lognormal body ends and the tails begin?

Page 16: Software size distribution - Why we always underestimate software cost

16

Maximum likelihood power law fitting● Fitting power laws to empirical data

● Clauset et al. “Power-law distributions in empirical data”.● http://www.santafe.edu/~aaronc/powerlaws/

● Estimate the parameters that minimize the Kolmogorov-Smirnov distance

● Maximum vertical distance in the CCDF between model and data

● Calculates a threshold value for data that deviate from the power law model

Page 17: Software size distribution - Why we always underestimate software cost

17

Example of model fitting● The data and two models in the CCDF plot● Showing only Lisp source code files

Page 18: Software size distribution - Why we always underestimate software cost

18

Results for all the languages● Two languages do not have power law tails

● Shell and Perl

Page 19: Software size distribution - Why we always underestimate software cost

19

What about the lognormal body?● Shell and Perl do not fit well the lognormal model either

Page 20: Software size distribution - Why we always underestimate software cost

20

Timeout!

Conclusions so far

Page 21: Software size distribution - Why we always underestimate software cost

21

Conclusions so far

● Lognormal body + power law tail● C, C++, Java, Python and Lisp

● Unknown distribution● Shell and Perl

● Large files are more frequent than predicted by a lognormal model

Page 22: Software size distribution - Why we always underestimate software cost

22

Using the threshold value to show the impact of large files

● Even though large files are very scarce, they account for a large part of the overall size

Page 23: Software size distribution - Why we always underestimate software cost

23

Estimation errors using double Pareto and lognormal models

● This impact causes a great error in the prediction of the lognormal model

● Showing relative error for Lisp

Page 24: Software size distribution - Why we always underestimate software cost

24

So what?● Estimation techniques based on lognormal size models,

will always underestimate the size of software● Because they underestimate the amount of large files● And large files have an impact >30% on the overall size

Page 25: Software size distribution - Why we always underestimate software cost

25

Any more juice extracted from these oranges?● More fuel for the programming languages holy war

● The power law parameters could be related to the properties of the different programming languages

Page 26: Software size distribution - Why we always underestimate software cost

26

And what about Shell and Perl?

● These languages are used to great extent for package maintenance activities in Debian

● So they are of a different nature

● Does it mean that double Pareto is the signature of the programming process?

Page 27: Software size distribution - Why we always underestimate software cost

27

Further work● Analysis over time

● How do files reach the threshold value?● What happens when files get large? Do they split? Are they

abandoned?

● Domains of applications● How do the power law parameters change with domain of

application?

● Can we find more “non-double Pareto” languages?

Page 28: Software size distribution - Why we always underestimate software cost

28

Take away

Software sizeis an important

metric(effort, cost)

Size is notlognormal,

it is double Pareto

Lognormalmodels

underestimatesoftware size

by design

Double Paretoas the signatureof programming?

Preprint available at http://oa.upm.es/6791/