mod2014-mens-lecture2
DESCRIPTION
This is my second in a series of 4 lectures on the topic of Evolving Software Ecosystems, presented during the NATO Marktoberdorf 2014 Summer School on Dependable Software System Engineering in Germany, August 2014.TRANSCRIPT
Evolving(So*ware(Ecosystems(Marktoberdorf(Summer(School(2014
Lecture(2
Tom(Mens(So#ware(Engineering(Lab(
University(of(Monsinforma7que.umons.ac.be/genlog
So#ware(Evolu7on
So#ware(Evolu7on(Lehman’s(Laws
• Manny(Lehman((1925(?(2010)(– Studied(30?year(evolu7on(of IBM(OS/360(mainframe(
– Proposed(“laws”(that(reflect(established(observa/ons(based(on*empirical*evidence(
– EPSRC?funded(FEAST(project(• Addi7onal(evidence(on(more(industrial(so#ware(projects
31
Lehman and Belady (1985). Software Evolution – Processes of Software Change. Academic Press.
Lehman (1997). Laws of Software Evolution Revisited. Springer LNCS 1149, pp. 108-124
So#ware(Evolu7on(Lehman’s(Laws
• ConGnuing(change(• A([…](program(that(is(used(in(a(real?world(environment(must(be(con7nually(adapted,(else*it*becomes*progressively*less*sa/sfactory.*
• Increasing(complexity(• As(a(program(is(evolved(its(complexity(increases(unless*work*is*done*to*maintain*or*reduce*it.*
• ConGnuing(growth(• Func7onal(content(of(a(program(must(be(con7nually(increased(to(maintain(user(sa7sfac7on(over(its(life7me.(
• Declining(quality(• […](programs(will(be(perceived(as(of(declining(quality(unless(rigorously(maintained(and(adapted(to(a(changing(opera7onal(environment(
• Feedback(system(• […](programming(processes(cons7tute(mul7?loop,(mul7?level(feedback(systems(and(must(be(treated(as(such(to(be(successfully(modified(or(improved
32
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(EngineeringFebruary(2014(?(CSMR?WCRE(So#ware(Evolu7on(Week,(Antwerp,(Belgium
So#ware(Evolu7on Relevant(Books
33
2006
Consider software evolution process as a multi-loop multi-level feedback system !- Reports on results from the EPSRC-
funded FEAST project - Supporting empirical evidence for
Lehman’s laws of software evolution
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(EngineeringFebruary(2014(?(CSMR?WCRE(So#ware(Evolu7on(Week,(Antwerp,(Belgium
So#ware(Evolu7on Relevant(Books
34
Relevant chapters !- Analyzing Software Repositories to
Understand Software Evolution - D’Ambros et al. !
- Predicting Bugs From History - Zimmermann et al. !
- Empirical Studies of Open Source Evolution
- Fernandez-Ramil et al.2008
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(EngineeringFebruary(2014(?(CSMR?WCRE(So#ware(Evolu7on(Week,(Antwerp,(Belgium
So#ware(Evolu7on Relevant(Books
35
Mens, Tom; Serebrenik, Alexander; Cleve, Anthony (Eds.) 2014, XXIII, 404 p. !Springer, ISBN 978-3-642-45398-4
Chapter 10Studying Evolving Software Ecosystemsbased on Ecological Models
Tom Mens, Maelick Claes, Philippe Grosjean and Alexander Serebrenik
Research on software evolution is very active, but evolutionary principles, modelsand theories that properly explain why and how software systems evolve over timeare still lacking. Similarly, more empirical research is needed to understand howdifferent software projects co-exist and co-evolve, and how contributors collaboratewithin their encompassing software ecosystem.
In this chapter, we explore the differences and analogies between natural ecosys-tems and biological evolution on the one hand, and software ecosystems and soft-ware evolution on the other hand. The aim is to learn from research in ecology toadvance the understanding of evolving software ecosystems. Ultimately, we wishto use such knowledge to derive diagnostic tools aiming to analyse and optimisethe fitness of software projects in their environment, and to help software projectcommunities in managing their projects better.
Tom Mens and Maelick Claes and Philippe GrosjeanCOMPLEXYS Research Institute, University of Mons, Belgiume-mail: tom.mens,maelick.claes,[email protected]
Alexander SerebrenikEindhoven University of Technology, The Netherlandse-mail: [email protected] work has been partially supported by F.R.S-F.N.R.S. research grant BSS-2012/V 6/5/015author’s stay at the Universite de Mons, supported by the F.R.S-F.N.R.S. under the grant BSS-2012/V 6/5/015. and ARC research project AUWB-12/17-UMONS-3,“Ecological Studies of OpenSource Software Ecosystems” financed by the Ministere de la Communaute francaise - Directiongenerale de l’Enseignement non obligatoire et de la Recherche scientifique, Belgium.
245
So#ware(Ecosystems
Defini&ons
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
So#ware(Ecosystems Relevant(Books
37
MIT(Press,(20052013
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
So#ware(Ecosystems Relevant(PhD(Disserta7ons
38
Reverse Engineering Software Ecosystems
Doctoral Dissertation submitted to the
Faculty of Informatics of the University of Lugano
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
presented by
Mircea F. Lungu
under the supervision of
Michele Lanza
September 2009
Social Aspects of Collaboration in Online Software Communities
Bogdan Vasilescu Eindhoven University of Technology
2014
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
So#ware(Ecosystems(Defini7ons
• Messerschmit(&(Szyperski,(2003([book](• “a*collec/on*of*so,ware*products*that*have*some*given*
degree*of*symbio/c*rela/onships.”
39
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
So#ware(Ecosystems(Defini7ons
• Messerschmit(&(Szyperski,(2003([book](• “a*collec/on*of*so,ware*products*that*have*some*given*
degree*of*symbio/c*rela/onships.”*• Lungu,(2008([disserta7on]*
• “a*collec/on*of*so,ware*projects*that*are*developed*and*evolve*together*in*the*same*environment.”
40
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
So#ware(Ecosystems(Defini7ons
• Messerschmit(&(Szyperski,(2003([book](• “a*collec/on*of*so,ware*products*that*have*some*given*
degree*of*symbio/c*rela/onships.”*• Lungu,(2008([disserta7on]*
• “a*collec/on*of*so,ware*projects*that*are*developed*and*evolve*together*in*the*same*environment.”*
• Jansen(et(al.,(2013([book]*• “a*set*of*actors*func/oning*as*a*unit*and*interac/ng*with*
a*shared*market*for*so,ware*and*services,*together*with*the*rela/onships*among*them.”
41
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering 42
So#ware(Ecosystems(Defini7ons
Business?oriented(view• “a*set*of*actors*func/oning*as*a*unit*
and*interac/ng*with*a*shared*market*for*so,ware*and*services,*together*with*the*rela/onships*among*them.”
Examples
• Eclipse(• Android*and*iOS*app*store
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering 43
So#ware(Ecosystems(Defini7ons
Development?centric(view• “a*collec/on*of*so,ware*
products*that*have*some*given*degree*of*symbio/c*rela/onships.”*
!!• “a*collec/on*of*so,ware*
projects*that*are*developed*and*evolve*together*in*the*same*environment.”*
Examples
• GnomeKDE(!
• Debian Ubuntu(!
• R’s*CRAN(!
• Apache
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
So#ware(Ecosystems(Defini7ons
Projet 1
Projet 2
Projet 3
44
Socio?technical(view• a*community*of*persons*
(end&users,*developers,*debuggers,*…)*contribu/ng*to*a*collec/on*of*projects
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
So#ware(Ecosystems(Defini7ons
Ecosystem(<>(System(of(systems(( (cf.(John(McDermid)(!An ecosystem is a set of systems that is
“designed as a whole”.!These systems!
cannot function in isolation (symbiotic relationships)!are usually very diverse!function together as a unit!are evolved together towards a common
(but evolving) goal
So#ware(EcosystemsChallenges
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
So#ware(Ecosystem(Analysis(Challenges
47
Empirically(analysing(so#ware(ecosystems(involves(many(challenges• Technical*challenges*• Scien/fic*challenges*• Prac/cal*challenges*• Ethical*challenges*• …
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
So#ware(Ecosystem(Analysis(Challenges
Projet 1
Projet 2
Projet 3
48
Technical(challenges
• Extrac/ng*and*combining*data*from*different*sources*
• Iden/ty*merging*• Dealing*with*inconsistent*and*
incomplete*data*• Big$data*analy/cs*
• special*skills*and*tools*needed*to*store,*process*and*analyse*huge*amounts*of*data*
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
So#ware(Ecosystem(Analysis(Challenges
49
Scien&fic(challenges
• Accessibility*of*data*• E.g.*many*apps*in*Google*Play*are*proprietary
and*historical*informa/on*is*not*accessible*• Focus*on*open*source*so,ware*
• Reproducibility*of*results*• Generalisability*of*results*• Which*research*methodology,*which*metrics,*which*sta/s/cal*
tools,*…
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
So#ware(Ecosystem(Analysis(Challenges
50
Prac&cal(challenges
• How*can*we*share*our*big*data*with*other*researchers?*• Different*formats,*different*tools,*storage*problems,*…*
• How*can*we*make*our*research*results*useful*to*prac//oners*and*development*communi/es?*
• How*can*we*build*tools*and*dashboards*that*integrate*our*findings?
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
So#ware(Ecosystem(Analysis(Challenges
51
Ethical(challenges
• Privacy*issues*• Can*we*use*and*combine*informa/on*about*actual*
developers?*• Can*we*make*these*results*freely*available?*• How*to*reconcile*privacy*with*reproducibility*?
Privacy Reproducibility
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
Technical(Challenges(Extrac7ng(data(from(different(sources
• (Source(code(and(other(commits(stored(in(version(control(repositories(
E.g.,(Subversion,(Git(• (Developer(mailing(lists(and(user(mailing(lists(!
• (Bug(reports(and(change(requests(stored(in(issue(tracking(systems((
E.g.,(Bugzilla,(JIRA(Ques7on(and(Answer(websites(
E.g.(StackOverflow
52
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
Technical(Challenges(Extrac7ng(data(from(different(sources
Using(open(source(MetricsGrimoire(tool(suite((htps://github.com/MetricsGrimoire)(
CVSAnalY(• extracts(informa7on(from(SVN(or(Git(source(code(repository(logs(and(stores(it(into(rela7onal(database(MailingListStats(
• extracts(mailing(list(informa7on(from(mbox(format(Bicho(
•extracts(informa7on(from(issue(tracking(systems(such(as(Bugzilla(and(JIRA
53
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
Technical(Challenges(Iden7ty(merging
The(same(contributor(may(use(different(aliases
54
Euphegenia Doubtfire, [email protected]
Robin Williams, [email protected]
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
Technical(Challenges(Iden7ty(merging
55
DépôtsContributeurs
john
John Smith
Dépôt de code source
Mailing list
Bug tracker
john <[email protected]>
johnny
john
John, Doe
Doe, John
John W. Doe
Jane
566-3-2013
Ordering Rajesh Sola Sola RajeshSpelling: misspelling, diacritics, punctuation
Rene Engelhard Fene Engelhard
Démurget DemurgetJ. A. M. Carneiro J A M Carneiro
Middle initials, patronyms, nicknames, additional surnames, incomplete names
Daniel M. Mueth Daniel Mueth
Alexander Alexandrov Shopov
Alexander Shopov
Carlos Garnacho Parro Carlos Garnacho
Jacob “Ulysses” Berkman Jacob Berkman
A S Alam Amanpreet Singh Alam
Name variants: transliteration, diminutives
Γιωργοσ Georgios
Mike Gratton Michael Gratton
Software-specific: usernames, projects, tooling artefacts
mrhappypants Aaron BrownArturo Tena/libole2 Arturo Tena(16:06) Alex Roberts Alex Roberts
Mix Any combination of those
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
Technical(Challenges(Iden7ty(merging
57
id(=(17{(John(Doe,(Doe(John,
[email protected],[email protected],[email protected](}
Semi-automatic approach: • eliminate specific quirks
observed during extraction Example: “(16:06) Alex Roberts”
• compute similarity between each pair of aliases (based on Levenshtein distance)
• cluster together aliases with high similarity
• post-process manually •rely on external information (websites) •precise but labor-intensive
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
Technical(Challenges(Iden7ty(merging
Levenshtein(distance((1965):(• Computes(the(minimal(distance(between(2(strings(in(terms(of(single(character(edits((dele$on,(addi$on(or(replacement)(
• Example:(lev(“Mike”,(“Michael”)(=(4(• “Mike”(=>(“Mice”(=>(“Miche”(=>(“Michae”(=>(“Michael”
58
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
Technical(Challenges(Iden7ty(merging
Levenshtein(distance((1965):(• Computes(the(minimal(distance(between(2(strings(in(terms(of(single(character(edits((dele$on,(addi$on(or(replacement)(
• Example:(lev(“Mike”,(“Michael”)(=(4(• “Mike”(=>(“Mice”(=>(“Miche”(=>(“Michae”(=>(“Michael”(
!• Side(note(
• Damerau?Levenshtein(distance(also(considers(transposi$on/of/adjacent/characters/
• Applied(in(biology(for(DNA(sequence(alignment
59
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
Technical(Challenges(Iden7ty(merging
60
• several merge algorithms exist !
• the “noisier” the data, the worse they perform! !
• simple algorithms have higher precision and recall than more complex ones
A Comparison of Identity Merge
Algorithms for Software Repositories
Mathieu Goeminne⇤, Tom Mens⇤
Institut d’Informatique, Faculte des Sciences, Universite de Mons
Abstract
Software repository mining research extracts and analyses data originating frommultiple software repositories to understand the historical development of soft-ware systems, and to propose better ways to evolve such systems in the future.Of particular interest is the study of the activities and interactions between thepersons involved in the software development process. The main challenge withsuch studies lies in the ability to determine the identities (e.g., logins or e-mailaccounts) in software repositories that represent the same physical person. Toachieve this, di↵erent identity merge algorithms have been proposed in the past.This article provides an objective comparison of identity merge algorithms, in-cluding some improvements over existing algorithms. The results are validatedon a selection of large ongoing open source software projects.
Keywords: software repository mining, empirical software engineering,identity merging, open source, software evolution, comparison
1. Introduction
Empirical software engineering research focuses on the use of empirical stud-ies, experiments and statistical analysis in order to gain a better understandingof software products and processes [1]. An important branch of empirical re-search studies how software evolves over time and which processes are used tosupport this evolution. To achieve this, the principal data sources are softwarerepositories of di↵erent kinds, such as source code repositories, bug tracking sys-tems, and archived communications of the developer community (e.g., mailinglists, online forums and discussion boards). The research domain of softwarerepository mining [2] uses these data sources to understand the historical de-velopment of software systems, and to build and empirically validate theories,models, processes and tools for these evolving systems. Many of these empir-
⇤Corr. author: Place du Parc 20, 7000 Mons, [email protected], +32 65 373453
Preprint submitted to Elsevier November 28, 2011
Science(of(Computer(Programming(28(8),(August(2013
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
Technical(Challenges(Iden7ty(merging
61
Alternative automated approach • Use of Latent Semantic
Analysis (LSA) • equally good as other
algorithms in average case
• better performance in worst case
oracle contains 4989 unique identities, i.e., on average eachGNOME contributor uses approximately 1.73 aliases.
We treat two cases: an average-case, containing randomsamples of the set of 8618 GNOME aliases, and a worst-case, consisting of a subset of 673 “noisy” GNOME aliases,expected to cause false negatives in the simple algorithm.We have obtained this dataset by removing contributorswith only one alias, as well as contributors with intersecting�name, prefix
sets. It is apriori not clear how the algorithm
by Bird et al. will behave on the worst-case dataset.For each algorithm/scenario we performed training/testing
steps and repeated the process ten times. Training determinesoptimal parameter values: for the simple algorithm we variedminLen (1, . . . , 10); for the algorithm by Bird et al. wevaried the Levenshtein similarity threshold t (0.05, . . . , 1);for LSA, to avoid training on all combinations of the 4parameters, we first performed a sensitivity analysis byfixing 3 and varying the remaining. After the sensitivityanalysis we restricted the range of minLen to {2, 3, 4},levThr to {0.5, 0.75}, cosThr to {0.65, 0.70, 0.75}, and k
was fixed to half of the number of terms. In the averagecase, for each of the ten repetitions, training was performedon one tenth of the GNOME aliases (' 860), and testing onten random subsets with the same size from the remainingaliases. Samples were chosen instead of the entire remainingdata for computational efficiency reasons. In the worst case,because of fewer aliases in the dataset (673), for each of theten repetitions, training was performed on one third of thedata and testing on the other two thirds. All algorithms aswell as the data, can be made available upon request.
������ ������� ���
� ��
� ��
� ��
� ��
� ��
� ��
��������������������
������ ������� ���
� ��
� ��
� ��
� ��
� ��
� ��
���������
���������
Figure 1. The f -measures for the competing approaches. The f -measureranges between 0 and 1 (the higher the value, the better). LSA performs aswell as the simple algorithm in the average case, and significantly better inthe worst case. Note that both y-axes start at 0.75.
Figure 1 displays the results of the cross-validation. Inthe average case (left) we observe that LSA performs aswell as the simple algorithm (Kruskal-Wallis test followedby pairwise Wilcoxon tests with Bonferroni correction didnot reveal enough reasons to assume that the two produceessentially different results at 0.05 significance level), fol-lowed by the algorithm of Bird et al. Concurrent resultshave been obtained in [5]: simple is better than Bird,and is the best of all algorithms tested. LSA and the
simple algorithm do, however, behave differently. For ex-ample, the simple algorithm does not merge hChristopheMichael Saout, csaout@domainAi with hChristophe Saout,christophe@domainBi because the two aliases are disjoint,while LSA does. However, the simple algorithm correctlymerges hGareth Owen, gowen@domainAi with hgowen,gowen@domainBi, while LSA does not (the cosine simi-larity between the documents corresponding to the two is0.69 and falls just outside the threshold, in this case 0.70).This observation suggests that further improvements of theLSA algorithm, e.g., by using the simple algorithm in apre-processing step, might be possible, and are consideredas future work. On the other hand, the results in theworst case (Figure 1 right) show a clear improvement ofLSA (median=0.935) over Bird et al’s (median=0.893) andthe simple algorithms (median=0.778), confirmed by thestatistical analysis described above.
VI. CONCLUSIONS
Our main contribution is a generic new identity mergingalgorithm based on LSA, robust against many types of dis-crepancies in VCS aliases. Empirical evaluation on GNOMEGit repositories has shown equally-good performance of ouralgorithm as the state of the art in the average case, andbetter performance in the worst case.
REFERENCES
[1] C. Bird et al. “Mining email social networks”. In: MSR.ACM, 2006, pp. 137–143.
[2] A. Capiluppi, A. Serebrenik, and A. Youssef. “De-veloping an H-Index for OSS Developers”. In: MSR.IEEE, 2012, pp. 251–254.
[3] P. Christen. “A comparison of personal name matching:Techniques and practical issues”. In: ICDM. IEEE,2006, pp. 290–294.
[4] D.M. German. “The GNOME project: a case studyof open source, global software development”. In:Software Process 8.4 (2003), pp. 201–215.
[5] M. Goeminne and T. Mens. “A comparison of identitymerge algorithms for software repositories”. In: Sci-ence of Computer Programming (2011). accepted.
[6] T.K. Landauer and S.T. Dumais. “A solution to Plato’sproblem: The latent semantic analysis theory of acqui-sition, induction, and representation of knowledge.” In:Psychological Review 104.2 (1997), p. 211.
[7] A. Marcus and J.I. Maletic. “Recovering documenta-tion to source code traceability links using latent se-mantic indexing”. In: ICSE. IEEE, 2003, pp. 125–137.
[8] W. Poncin, A. Serebrenik, and M.G.J. van den Brand.“Process Mining Software Repositories”. In: CSMR.IEEE, 2011, pp. 5–14.
[9] G. Robles and J.M. Gonzalez-Barahona. “Developeridentification methods for integrated data from varioussources”. In: MSR. ACM, 2005, pp. 1–5.
Who’s who in GNOME: using LSA to merge software repository identities
Erik Kouters, Bogdan Vasilescu⇤, Alexander Serebrenik, Mark G. J. van den BrandTechnische Universiteit Eindhoven,
Den Dolech 2, P.O. Box 513,5600 MB Eindhoven, The Netherlands
[email protected], {b.n.vasilescu, a.serebrenik, m.g.j.v.d.brand}@tue.nl
Abstract—Understanding an individual’s contribution to anecosystem often necessitates integrating information from mul-tiple repositories corresponding to different projects withinthe ecosystem or different kinds of repositories (e.g., mailarchives and version control systems). However, recognisingthat different contributions belong to the same contributor ischallenging, since developers may use different aliases.
It is known that existing identity merging algorithms aresensitive to large discrepancies between the aliases used bythe same individual: the noisier the data, the worse theirperformance. To assess the scale of the problem for a largesoftware ecosystem, we study all GNOME Git repositories,classify the differences in aliases, and discuss robustness ofexisting algorithms with respect to these types of differences.
We then propose a new identity merging algorithm based onLatent Semantic Analysis (LSA), designed to be robust againstmore types of differences in aliases, and evaluate it empiricallyby means of cross-validation on GNOME Git authors. Ourresults show a clear improvement over existing algorithms interms of precision and recall on worst-case input data.
Keywords-identity merging; Gnome; latent semantic analysis
I. INTRODUCTION
One of the challenges when mining software repositoriesis identity merging [5]. To study contributors to softwareprojects or software ecosystems, one often tries to integrateinformation about their contributions in different softwarerepositories, such as version control systems, bug trackers, ormailing lists. However, developers may use different aliasesin different software repositories (e.g., Bryan Clark authorsEvince changes as Bryan Clark with the email addressclarkbw@domainA1, but participates in Evince mailing listsusing bclark@domainB), and even different aliases in thesame software repository (one of the Empathy developerssometimes uses the nickname mrhappypants). Correctlyidentifying who’s who in open source projects is an essentialpreprocessing step in many empirical analyses: for example,activity of open source developers could be used externallyas a measure of their recognition and experience [2].
*Supported by the Dutch Science Foundation project “Multi-Language Systems: Analysis and Visualization of Evolution—Analysis”(612.001.020).
1Domain names obscured for privacy reasons.
To integrate information about individual contributions,we therefore need a unique identity representing thesame contributor across different repositories and differentprojects. To this end, we need to use an identity mergingalgorithm [1, 3, 5, 8, 9]. However, performance of existingapproaches degrades sharply in presence of “noisy” data, i.e.,data containing large discrepancies between the aliases usedby the same individual: “the more noisy and complex theproject data is, the worse the merge algorithms behave” [5].
In this paper we concentrate on aliases used by developersin version control systems (VCS); here the term “alias”refers to a hname, emaili tuple, typically available in VCSlogs. Even for a single repository type such as VCS, thesame contributor may use different aliases at different times,or in different projects within the ecosystem. Our goalis to design an identity merge algorithm with improvedrobustness with respect to noisy data, common in ecosystemsmaintained by large developer communities. We start byextracting commit authorship information from all GNOMEGit repositories, and discuss differences in the aliases usedby GNOME developers in Section II. Next, we evaluaterobustness of two state of the art identity merging algorithmswith respect to types of differences in aliases in Section III.Based on lessons learned from existing approaches, wepropose a new identity merging algorithm using LatentSemantic Analysis (LSA) [6] in Section IV, and evaluateit empirically by means of cross-validation in Section V.Our results show equally-good performance as the state ofthe art in the average case, and a clear improvement overexisting approaches on noisy input data.
II. TYPES OF DIFFERENCES IN GNOME ALIASES
As case study we select GNOME, a popular free and opensource desktop environment for GNU/Linux. GNOME has along development history (some projects, e.g., gnome-disk-utility, have started in 1997 and are still evolving today),is maintained by a large community of developers (wefound 8618 different aliases2 across 1316 different GNOMEprojects3), and is well-known to researchers [4]. Analysis of
2We consider data from the author name/email fields in the Git logs.3Values computed on October 28, 2011, based on the entire lifetime of
the projects available at http://git.gnome.org/browse/.
978-1-4673-2312-3/12/$31.00 c� 2012 IEEE
ICSM(2012(ERA(track
July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
Research(challenges(Accessibility
Focus(on(open6source(so#ware(•(Free(access(to(source(code,(defect(data,(developer(and(user(communica7on(•(Historical(data(available(in(open(repositories(
– Observable(communi7es(– Observable(ac7vi7es(
•(Increasing(popularity(for(personal(and(commercial(use(•(A(huge(range(of(community(and(so#ware(sizes
62