characterizing the open source software process: a horizontal study
DESCRIPTION
Characterizing the Open Source Software Process: a Horizontal Study. A. Capiluppi, P. Lago, M. Morisio. Outline. Rationale behind the current study Methodology Conclusions Actual and future work. Rationale. - PowerPoint PPT PresentationTRANSCRIPT
Politecnico di Torino
Andrea Capiluppi
Characterizing the Open Source Software Process:
a Horizontal Study
Characterizing the Open Source Software Process:
a Horizontal Study
A. Capiluppi, P. Lago, M. Morisio
2Politecnico di Torino
OutlineOutline
Rationale behind the current studyMethodologyConclusionsActual and future work
3Politecnico di Torino
RationaleRationale
Most Open Source analyses focus on a single, flagship project (Linux, Apache, GNOME) Limitation: the conclusions are based on a ‘vertical’ studythere is a lack of ‘horizontal’ studies
a pool of projects a wider area of interest
4Politecnico di Torino
MethodologyMethodology
Choice of projectsAttributes definitionCodingAnalysis
5Politecnico di Torino
Choice of projects: repositoryChoice of projects: repository
Selected FreshMeat repositoryFreshMeat (http://freshmeat.net) is focused on Open Source development since 1996It gathers thousands of projects, either doubled on the pages of SourceForge (http://sourceforge.net), or hosted on FreshMeat only.FreshMeat lists more than 24000 projects (many inactive)
6Politecnico di Torino
Choice of projects: sampling IChoice of projects: sampling I
From 24000 to 406 - how?
FreshMeat organizes projects by filters and categories
Filter = “Topic”Categories = {“Internet”, “Database”, “Multimedia”,…}
Other filters: Programming language, Topic (i.e. application domain), Status of Evolution, etc.
7Politecnico di Torino
Choice of projects: sampling IIChoice of projects: sampling II
We picked randomly a number of projects through the “Status” filter
Rationale: limited number of categories associated {“Planning”, “PreAlpha”, “Alpha”, “Beta”, “Stable”, “Mature”}
The overall count is 406 projects
8Politecnico di Torino
Attribute definition Attribute definition
AgeApplication domainProgramming languageSize [KB]Number of developersStable and transient developersNumber of users • Red: defined by
FreshMeat• Black: defined by us
Modularity level Documentation levelPopularityStatusSuccess of projectVitality
9Politecnico di Torino
CodingCoding
Each attribute was coded twice, to capture evolutive trends
First observation: January 2002Second observation: July 2002
10Politecnico di Torino
AnalysisAnalysis
Here we discuss:Application domain issuesDevelopers [stable & transient] issuesSubscribers (as users) issuesCode size issues
11Politecnico di Torino
Application domain distributionApplication domain distribution
96 89 8670
27 23 18 16 16 10 9 5 10
341414
113
0
20
40
60
80
100
120
140
160In
tern
et
Sys
tem
Sw
Dev
el
Com
munic
atio
ns
Multim
edia
Des
ktop
Dat
abas
e
Gam
es
Sec
urity
Utilit
ies
Sci
ent/Eng
Tex
t Editors
Offi
ce/B
usi
nes
s
Tex
t Pro
cess
ing
Printing
Ter
min
als
oth
er
12Politecnico di Torino
Attributes: project’s developersAttributes: project’s developers
We evaluate how many people write code for an applicationExternal contributions are always credited in special-purpose files, or in the ChangeLogWe distinguish betweenStable developersTransient developers
Core team: more than one stable developerManual inspections and pattern-recognition scripts
13Politecnico di Torino
Developers over projectsDevelopers over projects
We observe:72% of projects have a single stable developer80% of projects have at most a number of 10 developers
14Politecnico di Torino
Developers distribution over projectsDevelopers distribution over projects
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
0 1 2 up to 10 up to 20 up to 50 up to100
morethan100
Developers
Fre
qu
ency
[%
]
DevelopersStable devTransient dev
15Politecnico di Torino
Definition: clusters of developers Definition: clusters of developers
Cluster 1: 1 to 3 developers (64.5%)Cluster 2: 4 to 10 developers (20%)Cluster 3: 11 to 20 developers (9.5%)
“Average” nr. of stable dev: 2“Average” nr. of transient dev: 3
Cluster 4: more than 20 developers (6%)“Average” nr. of stable dev: 6“Average” nr. of stable dev: 19
16Politecnico di Torino
Productivity vs. ‘global’ developersProductivity vs. ‘global’ developers
605
733
621656
0
100
200
300
400
500
600
700
800
Clust1 Clust2 Clust3 Clust4
Global developers
Co
de
Siz
e [
kB
]
17Politecnico di Torino
Productivity vs. ‘stable’ developersProductivity vs. ‘stable’ developers
1867
2543
3223
438
0
500
1000
1500
2000
2500
3000
3500
1 to 3 4 to 10 11 to 20 more than 20
Stable developers
Co
de
size
[kB
]
18Politecnico di Torino
Code variation over clustersCode variation over clusters
10.94%
19.58%
10.40%
15.83%
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
Clust1 Clust2 Clust3 Clust4
Co
de
Var
iati
on
[%
]
19Politecnico di Torino
Attributes: subscribersAttributes: subscribers
We use some publicly available data to gather some proxy about usersUsers ~ Mailing List subscribers (public datum)It’s not a monotonic measure: subscribers can join and leave as wellWe have a measure of users in two different observations
20Politecnico di Torino
Distribution of subscribers over projectDistribution of subscribers over project
05
1015
2025
3035
4045
1 5 10 50 100 More
Number of subscribers
Fre
qu
ency
[%
]
all projectsolder than one year
Around 42% of projects have at most 1 subscriber-user
21Politecnico di Torino
Users evolutionUsers evolution
30.3%
12.1%9.1%
32.3%
16.3%
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
Between 1 and 10 More than 10
Users evolution
Fre
qu
en
cy
[%
] No Gain
Proj's Loosing users
Proj's Gaining users
22Politecnico di Torino
Attributes: project’s sizeAttributes: project’s size
We evaluate the code of each project twiceCode evaluated is contained in packages. We exclude from the count:
Auxiliary files: documentation, configuration files, GIF files, etc.Legacy code: inherited libraries (e.g. Gnome macros), internationalization code
23Politecnico di Torino
Distribution of code size over projectsDistribution of code size over projects
39.25%
7%
17%
1%
35.75%
[0-10] (10-100] (100-1000] (1000-10000]
>10000
Size clusters [KB]
Fre
qu
en
cy
[%
]
24Politecnico di Torino
Evolutive observations of size changesEvolutive observations of size changes
59%
22%15%
5%
0% (0%-10%] (10%-50%] >50%
Range of variation
Fre
qu
ency
[%
]
25Politecnico di Torino
Conclusions IConclusions I
The vast majority of projects are developed by only one developerAdding people to a project has small effect on productivity (i.e. code per developer)Open Source software is made by experts for experts (72% of horizontal projects have more than 10 developers)58% of projects didn’t change their size63% of projects had a change within 1%
26Politecnico di Torino
Conclusions IIConclusions II
Java is relevant for 8% of the projects, C/C++ for 56%, PERL with Python for 20%Observations from flagship projects (Apache, Linux, Gnome) are not confirmed for an average Open Source projectSeveral projects are white noise: to be filtered outHuge amount of data on public repositories: empirical researchers have an invaluable resource of software data
27Politecnico di Torino
Current and future workCurrent and future work
Eliminating white noise: only projects in cluster 3 and 4 have been selectedDeeper analysis: the whole story of a project is being studied
What can we say with respect of conclusions on bigger OS projects?What can be said about OSS evolution compared with traditional software evolution?