mod2014-mens-lecture2

Evolving(So*ware(Ecosystems(Marktoberdorf(Summer(School(2014

Lecture(2

Tom(Mens(So#ware(Engineering(Lab(

University(of(Monsinforma7que.umons.ac.be/genlog

http://informatique.umons.ac.be/genlog/

So#ware(Evolu7on

So#ware(Evolu7on(Lehman’s(Laws

• Manny(Lehman((1925(?(2010)(– Studied(30?year(evolu7on(of IBM(OS/360(mainframe(

– Proposed(“laws”(that(reflect(established(observa/ons(based(on*empirical*evidence(

– EPSRC?funded(FEAST(project(• Addi7onal(evidence(on(more(industrial(so#ware(projects

31

Lehman and Belady (1985). Software Evolution – Processes of Software Change. Academic Press.

Lehman (1997). Laws of Software Evolution Revisited. Springer LNCS 1149, pp. 108-124

So#ware(Evolu7on(Lehman’s(Laws

• ConGnuing(change(• A([…](program(that(is(used(in(a(real?world(environment(must(be(con7nually(adapted,(else*it*becomes*progressively*less*sa/sfactory.*

• Increasing(complexity(• As(a(program(is(evolved(its(complexity(increases(unless*work*is*done*to*maintain*or*reduce*it.*

• ConGnuing(growth(• Func7onal(content(of(a(program(must(be(con7nually(increased(to(maintain(user(sa7sfac7on(over(its(life7me.(

• Declining(quality(• […](programs(will(be(perceived(as(of(declining(quality(unless(rigorously(maintained(and(adapted(to(a(changing(opera7onal(environment(

• Feedback(system(• […](programming(processes(cons7tute(mul7?loop,(mul7?level(feedback(systems(and(must(be(treated(as(such(to(be(successfully(modified(or(improved

32

July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(EngineeringFebruary(2014(?(CSMR?WCRE(So#ware(Evolu7on(Week,(Antwerp,(Belgium

So#ware(Evolu7on Relevant(Books

33

2006

Consider software evolution process as a multi-loop multi-level feedback system !- Reports on results from the EPSRC-

funded FEAST project - Supporting empirical evidence for

Lehman’s laws of software evolution



34

Relevant chapters !- Analyzing Software Repositories to

Understand Software Evolution - D’Ambros et al. !

- Predicting Bugs From History - Zimmermann et al. !

- Empirical Studies of Open Source Evolution

- Fernandez-Ramil et al.2008



35

Mens, Tom; Serebrenik, Alexander; Cleve, Anthony (Eds.) 2014, XXIII, 404 p. !Springer, ISBN 978-3-642-45398-4

Chapter 10Studying Evolving Software Ecosystemsbased on Ecological Models

Tom Mens, Maelick Claes, Philippe Grosjean and Alexander Serebrenik

Research on software evolution is very active, but evolutionary principles, modelsand theories that properly explain why and how software systems evolve over timeare still lacking. Similarly, more empirical research is needed to understand howdifferent software projects co-exist and co-evolve, and how contributors collaboratewithin their encompassing software ecosystem.

In this chapter, we explore the differences and analogies between natural ecosys-tems and biological evolution on the one hand, and software ecosystems and soft-ware evolution on the other hand. The aim is to learn from research in ecology toadvance the understanding of evolving software ecosystems. Ultimately, we wishto use such knowledge to derive diagnostic tools aiming to analyse and optimisethe fitness of software projects in their environment, and to help software projectcommunities in managing their projects better.

Tom Mens and Maelick Claes and Philippe GrosjeanCOMPLEXYS Research Institute, University of Mons, Belgiume-mail: tom.mens,maelick.claes,[email protected]

Alexander SerebrenikEindhoven University of Technology, The Netherlandse-mail: [email protected] work has been partially supported by F.R.S-F.N.R.S. research grant BSS-2012/V 6/5/015author’s stay at the Universite de Mons, supported by the F.R.S-F.N.R.S. under the grant BSS-2012/V 6/5/015. and ARC research project AUWB-12/17-UMONS-3,“Ecological Studies of OpenSource Software Ecosystems” financed by the Ministere de la Communaute francaise - Directiongenerale de l’Enseignement non obligatoire et de la Recherche scientifique, Belgium.

245

So#ware(Ecosystems

Defini&ons

July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering

So#ware(Ecosystems Relevant(Books

37

MIT(Press,(20052013


So#ware(Ecosystems Relevant(PhD(Disserta7ons

38

Reverse Engineering Software Ecosystems

Doctoral Dissertation submitted to the

Faculty of Informatics of the University of Lugano

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

presented by

Mircea F. Lungu

under the supervision of

Michele Lanza

September 2009

Social Aspects of Collaboration in Online Software Communities

Bogdan Vasilescu Eindhoven University of Technology

2014


So#ware(Ecosystems(Defini7ons

• Messerschmit(&(Szyperski,(2003([book](• “a*collec/on*of*so,ware*products*that*have*some*given*

degree*of*symbio/c*rela/onships.”

39




degree*of*symbio/c*rela/onships.”*• Lungu,(2008([disserta7on]*

• “a*collec/on*of*so,ware*projects*that*are*developed*and*evolve*together*in*the*same*environment.”

40




degree*of*symbio/c*rela/onships.”*• Lungu,(2008([disserta7on]*

• “a*collec/on*of*so,ware*projects*that*are*developed*and*evolve*together*in*the*same*environment.”*

• Jansen(et(al.,(2013([book]*• “a*set*of*actors*func/oning*as*a*unit*and*interac/ng*with*

a*shared*market*for*so,ware*and*services,*together*with*the*rela/onships*among*them.”

41

July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering 42


Business?oriented(view• “a*set*of*actors*func/oning*as*a*unit*

and*interac/ng*with*a*shared*market*for*so,ware*and*services,*together*with*the*rela/onships*among*them.”

Examples

• Eclipse(• Android*and*iOS*app*store

July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering 43


Development?centric(view• “a*collec/on*of*so,ware*

products*that*have*some*given*degree*of*symbio/c*rela/onships.”*

!!• “a*collec/on*of*so,ware*

projects*that*are*developed*and*evolve*together*in*the*same*environment.”*

Examples

• GnomeKDE(!

• Debian Ubuntu(!

• R’s*CRAN(!

• Apache



Projet 1

Projet 2

Projet 3

44

Socio?technical(view• a*community*of*persons*

(end&users,*developers,*debuggers,*…)*contribu/ng*to*a*collec/on*of*projects



Ecosystem(<>(System(of(systems(( (cf.(John(McDermid)(!An ecosystem is a set of systems that is

“designed as a whole”.!These systems!

cannot function in isolation (symbiotic relationships)!are usually very diverse!function together as a unit!are evolved together towards a common

(but evolving) goal

So#ware(EcosystemsChallenges


So#ware(Ecosystem(Analysis(Challenges

47

Empirically(analysing(so#ware(ecosystems(involves(many(challenges• Technical*challenges*• Scien/fic*challenges*• Prac/cal*challenges*• Ethical*challenges*• …



Projet 1

Projet 2

Projet 3

48

Technical(challenges

• Extrac/ng*and*combining*data*from*different*sources*

• Iden/ty*merging*• Dealing*with*inconsistent*and*

incomplete*data*• Big$data*analy/cs*

• special*skills*and*tools*needed*to*store,*process*and*analyse*huge*amounts*of*data*



49

Scien&fic(challenges

• Accessibility*of*data*• E.g.*many*apps*in*Google*Play*are*proprietary

and*historical*informa/on*is*not*accessible*• Focus*on*open*source*so,ware*

• Reproducibility*of*results*• Generalisability*of*results*• Which*research*methodology,*which*metrics,*which*sta/s/cal*

tools,*…



50

Prac&cal(challenges

• How*can*we*share*our*big*data*with*other*researchers?*• Different*formats,*different*tools,*storage*problems,*…*

• How*can*we*make*our*research*results*useful*to*prac//oners*and*development*communi/es?*

• How*can*we*build*tools*and*dashboards*that*integrate*our*findings?



51

Ethical(challenges

• Privacy*issues*• Can*we*use*and*combine*informa/on*about*actual*

developers?*• Can*we*make*these*results*freely*available?*• How*to*reconcile*privacy*with*reproducibility*?

Privacy Reproducibility


Technical(Challenges(Extrac7ng(data(from(different(sources

• (Source(code(and(other(commits(stored(in(version(control(repositories(

E.g.,(Subversion,(Git(• (Developer(mailing(lists(and(user(mailing(lists(!

• (Bug(reports(and(change(requests(stored(in(issue(tracking(systems((

E.g.,(Bugzilla,(JIRA(Ques7on(and(Answer(websites(

E.g.(StackOverflow

52


Technical(Challenges(Extrac7ng(data(from(different(sources

Using(open(source(MetricsGrimoire(tool(suite((htps://github.com/MetricsGrimoire)(

CVSAnalY(• extracts(informa7on(from(SVN(or(Git(source(code(repository(logs(and(stores(it(into(rela7onal(database(MailingListStats(

• extracts(mailing(list(informa7on(from(mbox(format(Bicho(

•extracts(informa7on(from(issue(tracking(systems(such(as(Bugzilla(and(JIRA

53


Technical(Challenges(Iden7ty(merging

The(same(contributor(may(use(different(aliases

54

Euphegenia Doubtfire, [email protected]

Robin Williams, [email protected]



55

DépôtsContributeurs

john

John Smith

Dépôt de code source

Mailing list

Bug tracker

john <[email protected]>

[email protected]

johnny

john

John, Doe

Doe, John

[email protected]

[email protected]

[email protected]

John W. Doe

Jane

566-3-2013

Ordering Rajesh Sola Sola RajeshSpelling: misspelling, diacritics, punctuation

Rene Engelhard Fene Engelhard

Démurget DemurgetJ. A. M. Carneiro J A M Carneiro

Middle initials, patronyms, nicknames, additional surnames, incomplete names

Daniel M. Mueth Daniel Mueth

Alexander Alexandrov Shopov

Alexander Shopov

Carlos Garnacho Parro Carlos Garnacho

Jacob “Ulysses” Berkman Jacob Berkman

A S Alam Amanpreet Singh Alam

Name variants: transliteration, diminutives

Γιωργοσ Georgios

Mike Gratton Michael Gratton

Software-specific: usernames, projects, tooling artefacts

mrhappypants Aaron BrownArturo Tena/libole2 Arturo Tena(16:06) Alex Roberts Alex Roberts

Mix Any combination of those



57

id(=(17{(John(Doe,(Doe(John,

[email protected],[email protected],[email protected](}

Semi-automatic approach: • eliminate specific quirks

observed during extraction Example: “(16:06) Alex Roberts”

• compute similarity between each pair of aliases (based on Levenshtein distance)

• cluster together aliases with high similarity

• post-process manually •rely on external information (websites) •precise but labor-intensive



Levenshtein(distance((1965):(• Computes(the(minimal(distance(between(2(strings(in(terms(of(single(character(edits((dele$on,(addi$on(or(replacement)(

• Example:(lev(“Mike”,(“Michael”)(=(4(• “Mike”(=>(“Mice”(=>(“Miche”(=>(“Michae”(=>(“Michael”

58



Levenshtein(distance((1965):(• Computes(the(minimal(distance(between(2(strings(in(terms(of(single(character(edits((dele$on,(addi$on(or(replacement)(

• Example:(lev(“Mike”,(“Michael”)(=(4(• “Mike”(=>(“Mice”(=>(“Miche”(=>(“Michae”(=>(“Michael”(

!• Side(note(

• Damerau?Levenshtein(distance(also(considers(transposi$on/of/adjacent/characters/

• Applied(in(biology(for(DNA(sequence(alignment

59



60

• several merge algorithms exist !

• the “noisier” the data, the worse they perform! !

• simple algorithms have higher precision and recall than more complex ones

A Comparison of Identity Merge

Algorithms for Software Repositories

Mathieu Goeminne⇤, Tom Mens⇤

Institut d’Informatique, Faculte des Sciences, Universite de Mons

Abstract

Software repository mining research extracts and analyses data originating frommultiple software repositories to understand the historical development of soft-ware systems, and to propose better ways to evolve such systems in the future.Of particular interest is the study of the activities and interactions between thepersons involved in the software development process. The main challenge withsuch studies lies in the ability to determine the identities (e.g., logins or e-mailaccounts) in software repositories that represent the same physical person. Toachieve this, di↵erent identity merge algorithms have been proposed in the past.This article provides an objective comparison of identity merge algorithms, in-cluding some improvements over existing algorithms. The results are validatedon a selection of large ongoing open source software projects.

Keywords: software repository mining, empirical software engineering,identity merging, open source, software evolution, comparison

1. Introduction

Empirical software engineering research focuses on the use of empirical stud-ies, experiments and statistical analysis in order to gain a better understandingof software products and processes [1]. An important branch of empirical re-search studies how software evolves over time and which processes are used tosupport this evolution. To achieve this, the principal data sources are softwarerepositories of di↵erent kinds, such as source code repositories, bug tracking sys-tems, and archived communications of the developer community (e.g., mailinglists, online forums and discussion boards). The research domain of softwarerepository mining [2] uses these data sources to understand the historical de-velopment of software systems, and to build and empirically validate theories,models, processes and tools for these evolving systems. Many of these empir-

⇤Corr. author: Place du Parc 20, 7000 Mons, [email protected], +32 65 373453

Preprint submitted to Elsevier November 28, 2011

Science(of(Computer(Programming(28(8),(August(2013



61

Alternative automated approach • Use of Latent Semantic

Analysis (LSA) • equally good as other

algorithms in average case

• better performance in worst case

oracle contains 4989 unique identities, i.e., on average eachGNOME contributor uses approximately 1.73 aliases.

We treat two cases: an average-case, containing randomsamples of the set of 8618 GNOME aliases, and a worst-case, consisting of a subset of 673 “noisy” GNOME aliases,expected to cause false negatives in the simple algorithm.We have obtained this dataset by removing contributorswith only one alias, as well as contributors with intersecting�name, prefix

sets. It is apriori not clear how the algorithm

by Bird et al. will behave on the worst-case dataset.For each algorithm/scenario we performed training/testing

steps and repeated the process ten times. Training determinesoptimal parameter values: for the simple algorithm we variedminLen (1, . . . , 10); for the algorithm by Bird et al. wevaried the Levenshtein similarity threshold t (0.05, . . . , 1);for LSA, to avoid training on all combinations of the 4parameters, we first performed a sensitivity analysis byfixing 3 and varying the remaining. After the sensitivityanalysis we restricted the range of minLen to {2, 3, 4},levThr to {0.5, 0.75}, cosThr to {0.65, 0.70, 0.75}, and k

was fixed to half of the number of terms. In the averagecase, for each of the ten repetitions, training was performedon one tenth of the GNOME aliases (' 860), and testing onten random subsets with the same size from the remainingaliases. Samples were chosen instead of the entire remainingdata for computational efficiency reasons. In the worst case,because of fewer aliases in the dataset (673), for each of theten repetitions, training was performed on one third of thedata and testing on the other two thirds. All algorithms aswell as the data, can be made available upon request.

��

� ��

� ��

� ��

� ��

� ��

� ��

��

��

� ��

� ��

� ��

� ��

� ��

� ��

��

��

Figure 1. The f -measures for the competing approaches. The f -measureranges between 0 and 1 (the higher the value, the better). LSA performs aswell as the simple algorithm in the average case, and significantly better inthe worst case. Note that both y-axes start at 0.75.

Figure 1 displays the results of the cross-validation. Inthe average case (left) we observe that LSA performs aswell as the simple algorithm (Kruskal-Wallis test followedby pairwise Wilcoxon tests with Bonferroni correction didnot reveal enough reasons to assume that the two produceessentially different results at 0.05 significance level), fol-lowed by the algorithm of Bird et al. Concurrent resultshave been obtained in [5]: simple is better than Bird,and is the best of all algorithms tested. LSA and the

simple algorithm do, however, behave differently. For ex-ample, the simple algorithm does not merge hChristopheMichael Saout, csaout@domainAi with hChristophe Saout,christophe@domainBi because the two aliases are disjoint,while LSA does. However, the simple algorithm correctlymerges hGareth Owen, gowen@domainAi with hgowen,gowen@domainBi, while LSA does not (the cosine simi-larity between the documents corresponding to the two is0.69 and falls just outside the threshold, in this case 0.70).This observation suggests that further improvements of theLSA algorithm, e.g., by using the simple algorithm in apre-processing step, might be possible, and are consideredas future work. On the other hand, the results in theworst case (Figure 1 right) show a clear improvement ofLSA (median=0.935) over Bird et al’s (median=0.893) andthe simple algorithms (median=0.778), confirmed by thestatistical analysis described above.

VI. CONCLUSIONS

Our main contribution is a generic new identity mergingalgorithm based on LSA, robust against many types of dis-crepancies in VCS aliases. Empirical evaluation on GNOMEGit repositories has shown equally-good performance of ouralgorithm as the state of the art in the average case, andbetter performance in the worst case.

REFERENCES

[1] C. Bird et al. “Mining email social networks”. In: MSR.ACM, 2006, pp. 137–143.

[2] A. Capiluppi, A. Serebrenik, and A. Youssef. “De-veloping an H-Index for OSS Developers”. In: MSR.IEEE, 2012, pp. 251–254.

[3] P. Christen. “A comparison of personal name matching:Techniques and practical issues”. In: ICDM. IEEE,2006, pp. 290–294.

[4] D.M. German. “The GNOME project: a case studyof open source, global software development”. In:Software Process 8.4 (2003), pp. 201–215.

[5] M. Goeminne and T. Mens. “A comparison of identitymerge algorithms for software repositories”. In: Sci-ence of Computer Programming (2011). accepted.

[6] T.K. Landauer and S.T. Dumais. “A solution to Plato’sproblem: The latent semantic analysis theory of acqui-sition, induction, and representation of knowledge.” In:Psychological Review 104.2 (1997), p. 211.

[7] A. Marcus and J.I. Maletic. “Recovering documenta-tion to source code traceability links using latent se-mantic indexing”. In: ICSE. IEEE, 2003, pp. 125–137.

[8] W. Poncin, A. Serebrenik, and M.G.J. van den Brand.“Process Mining Software Repositories”. In: CSMR.IEEE, 2011, pp. 5–14.

[9] G. Robles and J.M. Gonzalez-Barahona. “Developeridentification methods for integrated data from varioussources”. In: MSR. ACM, 2005, pp. 1–5.

Who’s who in GNOME: using LSA to merge software repository identities

Erik Kouters, Bogdan Vasilescu⇤, Alexander Serebrenik, Mark G. J. van den BrandTechnische Universiteit Eindhoven,

Den Dolech 2, P.O. Box 513,5600 MB Eindhoven, The Netherlands

[email protected], {b.n.vasilescu, a.serebrenik, m.g.j.v.d.brand}@tue.nl

Abstract—Understanding an individual’s contribution to anecosystem often necessitates integrating information from mul-tiple repositories corresponding to different projects withinthe ecosystem or different kinds of repositories (e.g., mailarchives and version control systems). However, recognisingthat different contributions belong to the same contributor ischallenging, since developers may use different aliases.

It is known that existing identity merging algorithms aresensitive to large discrepancies between the aliases used bythe same individual: the noisier the data, the worse theirperformance. To assess the scale of the problem for a largesoftware ecosystem, we study all GNOME Git repositories,classify the differences in aliases, and discuss robustness ofexisting algorithms with respect to these types of differences.

We then propose a new identity merging algorithm based onLatent Semantic Analysis (LSA), designed to be robust againstmore types of differences in aliases, and evaluate it empiricallyby means of cross-validation on GNOME Git authors. Ourresults show a clear improvement over existing algorithms interms of precision and recall on worst-case input data.

Keywords-identity merging; Gnome; latent semantic analysis

I. INTRODUCTION

One of the challenges when mining software repositoriesis identity merging [5]. To study contributors to softwareprojects or software ecosystems, one often tries to integrateinformation about their contributions in different softwarerepositories, such as version control systems, bug trackers, ormailing lists. However, developers may use different aliasesin different software repositories (e.g., Bryan Clark authorsEvince changes as Bryan Clark with the email addressclarkbw@domainA1, but participates in Evince mailing listsusing bclark@domainB), and even different aliases in thesame software repository (one of the Empathy developerssometimes uses the nickname mrhappypants). Correctlyidentifying who’s who in open source projects is an essentialpreprocessing step in many empirical analyses: for example,activity of open source developers could be used externallyas a measure of their recognition and experience [2].

*Supported by the Dutch Science Foundation project “Multi-Language Systems: Analysis and Visualization of Evolution—Analysis”(612.001.020).

1Domain names obscured for privacy reasons.

To integrate information about individual contributions,we therefore need a unique identity representing thesame contributor across different repositories and differentprojects. To this end, we need to use an identity mergingalgorithm [1, 3, 5, 8, 9]. However, performance of existingapproaches degrades sharply in presence of “noisy” data, i.e.,data containing large discrepancies between the aliases usedby the same individual: “the more noisy and complex theproject data is, the worse the merge algorithms behave” [5].

In this paper we concentrate on aliases used by developersin version control systems (VCS); here the term “alias”refers to a hname, emaili tuple, typically available in VCSlogs. Even for a single repository type such as VCS, thesame contributor may use different aliases at different times,or in different projects within the ecosystem. Our goalis to design an identity merge algorithm with improvedrobustness with respect to noisy data, common in ecosystemsmaintained by large developer communities. We start byextracting commit authorship information from all GNOMEGit repositories, and discuss differences in the aliases usedby GNOME developers in Section II. Next, we evaluaterobustness of two state of the art identity merging algorithmswith respect to types of differences in aliases in Section III.Based on lessons learned from existing approaches, wepropose a new identity merging algorithm using LatentSemantic Analysis (LSA) [6] in Section IV, and evaluateit empirically by means of cross-validation in Section V.Our results show equally-good performance as the state ofthe art in the average case, and a clear improvement overexisting approaches on noisy input data.

II. TYPES OF DIFFERENCES IN GNOME ALIASES

As case study we select GNOME, a popular free and opensource desktop environment for GNU/Linux. GNOME has along development history (some projects, e.g., gnome-disk-utility, have started in 1997 and are still evolving today),is maintained by a large community of developers (wefound 8618 different aliases2 across 1316 different GNOMEprojects3), and is well-known to researchers [4]. Analysis of

2We consider data from the author name/email fields in the Git logs.3Values computed on October 28, 2011, based on the entire lifetime of

the projects available at http://git.gnome.org/browse/.

978-1-4673-2312-3/12/$31.00 c� 2012 IEEE

ICSM(2012(ERA(track


Research(challenges(Accessibility

Focus(on(open6source(so#ware(•(Free(access(to(source(code,(defect(data,(developer(and(user(communica7on(•(Historical(data(available(in(open(repositories(

– Observable(communi7es(– Observable(ac7vi7es(

•(Increasing(popularity(for(personal(and(commercial(use(•(A(huge(range(of(community(and(so#ware(sizes

62

mod2014-mens-lecture2

Education