mod2014-mens-lecture1

28
Evolving So*ware Ecosystems Marktoberdorf Summer School 2014 Lecture 1 Tom Mens So#ware Engineering Lab University of Mons informa7que.umons.ac.be/genlog

Upload: tom-mens

Post on 25-Dec-2014

211 views

Category:

Education


0 download

DESCRIPTION

This is my first in a series of 4 lectures on the topic of Evolving Software Ecosystems, presented during the NATO Marktoberdorf 2014 Summer School on Dependable Software System Engineering in Germany, August 2014.

TRANSCRIPT

Page 1: MOD2014-Mens-Lecture1

Evolving(So*ware(Ecosystems(Marktoberdorf(Summer(School(2014

Lecture(1Tom(Mens(

So#ware(Engineering(Lab(University(of(Mons

informa7que.umons.ac.be/genlog

Page 2: MOD2014-Mens-Lecture1

July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering

About(me

• PhD(obtained(in(1999(@VUB,(Brussels(• «(A(Formal(Founda7ons(for(Object?Oriented(So#ware(Evolu7on(»(

• FWO(post?doc(un7l(2003((@VUB)(!

• Lecturer/Associate(professor(@(UMONS,(Mons(since(2003(• Head(of(the(so#ware(engineering(lab(• Research(in(so#ware(evolu7on(• Teaching(in(so#ware(engineering

2

Page 3: MOD2014-Mens-Lecture1

July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering

Current(Research(Interests

• so#ware(evolu7on(• so#ware(quality(• model?driven(so#ware(engineering(• empirical(so#ware(engineering(• mul7modal(human?machine(interac7on(

!• Use(of(formal(techniques(to(support(the(above((

• graph(transforma7on(• logic?based(formalisms(• model(checking(• sta7s7cal(analysis(

• Develop(automated(tool(support

3

Page 4: MOD2014-Mens-Lecture1

July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering

Ongoing(Research(Projects

ARC Project « Ecological Studies of Open Source Software Ecoysystems »

- 2012-2017 - Interdisciplinary research - Use ideas from biological ecology to understand and improve evolution/

maintenance of software ecosystems !

FRFC Project « Data-Intensive Software System Evolution » - 2013-2017, in collaboration with A. Cleve (University of Namur) - Empirical study of evolution of data-intensive software systems

-Co-evolution between database and code !FRIA PhD Scholarship « Executable Modeling of Gestural Interaction

Applications » - 2011-2015 - Using domain-specific modeling languages - Based on high-level Petri nets and graph transformation

4

Page 5: MOD2014-Mens-Lecture1

July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering

L1

Human?machine(interac7on

Executable Modeling of Gestural Interaction Applications

5

Page 6: MOD2014-Mens-Lecture1

Evolving(So*ware(Ecosystems(Research(Context

Tom(Mens(So#ware(Engineering(Lab(

University(of(Monsinforma7que.umons.ac.be/genlog

Page 7: MOD2014-Mens-Lecture1

July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering

Research(Context

•(Study(of(macroDlevel(so#ware(evolu7on(– evolu7on(of(large(collec7ons(or(distribu7ons(of(so#ware(projects(or(packages(

– E.g.(forges(like(GITHUB,(SourceForge,(Savannah,(Google(Code

7

J.M.(Gonzalez?Barahona(et(al.(Macro&level*so,ware*evolu/on:*a*case*study*of*a*large*so,ware*compila/on.(Empirical(So#ware(Engineering(14(3):(262?285((2009)(

M.(Caneill,(S.(Zacchiroli.(Debsources:*Live*and*historical*views*on*macro&level*so,ware*evolu/on.(Int.(Symp.(ESEM(2014(

Page 8: MOD2014-Mens-Lecture1

July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering

Research(Context

•(Study(of(macroDlevel(so#ware(evolu7on(– evolu7on(of(large(collec7ons(or(distribu7ons(of(so#ware(projects(or(packages(

– E.g.(forges(like(GITHUB,(SourceForge,(Savannah,(Google(Code(

•(Focus(on(coherent(collec7ons(of(projects(– (a.k.a.(so*ware(ecosystems(– E.g.(Debian,(Ubuntu,(GNOME,(KDE,(CRAN,(Eclipse,(…(

•(Study(social/community(aspects(of(these(collec7ons

8

Page 9: MOD2014-Mens-Lecture1

July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering

Research(Context

Focus(on(open(source(so#ware(•(Free(access(to(source(code,(defect(data,(developer(and(user(communica7on(•(Historical(data(available(in(open(repositories(

– Observable(communi7es(– Observable(ac7vi7es(

•(Increasing(popularity(for(personal(and(commercial(use(•(A(huge(range(of(community(and(so#ware(sizes

9

Page 10: MOD2014-Mens-Lecture1

Long?term(goals

• Determine the main factors that drive the success or failure of OSS projects within their ecosystem !

• Investigate new techniques and mechanisms to predict and improve quality and survival of OSS projects

– Inspired by research in biological ecology

10

informa7que.umons.ac.be/genlog/projects/ecos

Page 11: MOD2014-Mens-Lecture1

Long?term(goals(Research(Ques7ons

• Specific questions depend on the software ecosystem under study

!

11

CRAN

Debian

Gnome

Eclipse

Page 12: MOD2014-Mens-Lecture1

Long?term(goals(Research(Ques7ons

• Specific questions for Eclipse • Software development environment • Plugin framework architecture

• Which plugins pose less problems or survive longer? • e.g. based on their API usage • stable and supported APIs versus unstable,

discouraged and unsupported APIs

12

Businge et al. “Survival of Eclipse third-party plug-ins”; “Co-evolution of the Eclipse SDK Framework and Its Third-Party Plug-Ins”; “Analyzing the Eclipse API Usage: Putting the Developer in the Loop”

Mens et al. “The Evolution of Eclipse”, ICSM 2008

Page 13: MOD2014-Mens-Lecture1

Long?term(goals(Research(Ques7ons

• Specific questions for CRAN • R package archive network

• Which packages are more likely to cause, upon update, problems in dependent packages?

13

Claes et al. “On the Maintainability of CRAN Packages”, CSMR-WCRE 2014

German, Adams et al. “The Evolution of the R Software Ecosystem”, CSMR 2013

Page 14: MOD2014-Mens-Lecture1

Long?term(goals(Research(Ques7ons

14

Page 15: MOD2014-Mens-Lecture1

Long?term(goals(Research(Ques7ons

15

Page 16: MOD2014-Mens-Lecture1

Long?term(goals(Research(Ques7ons

• CRAN (R package archive) • R package description file format:

16

Package: pkgname!Version: 0.5-1!Date: 2004-01-01!Title: My First Collection of Functions!Authors@R: c(person("Joe", "Developer", role = c("aut", "cre"),!! ! email = "[email protected]"),!! person("Pat", "Developer", role = "aut"),!! person("A.", "User", role = "ctb",!! ! email = "[email protected]"))!Author: Joe Developer and Pat Developer, with contributions from A. User!Maintainer: Joe Developer <[email protected]>!Depends: R (>= 1.8.0), nlme!Suggests: MASS!Description: A short (one paragraph) description of what! the package does and why it may be useful.!License: GPL (>= 2)!URL: http://www.r-project.org, http://www.another.url!BugReports: http://pkgname.bugtracker.url

Page 17: MOD2014-Mens-Lecture1

Long?term(goals(Research(Ques7ons

• CRAN (R package archive) • Types of R package dependencies:

• Strong dependencies (required to install and load a package)

• Depends: all packages the package directly depends upon

• Imports: all packages whose namespaces are imported from

• Weak dependencies • Suggests: packages that are not necessarily needed.

E.g. they may be required for running some tests or examples, but not for installing/loading

• Enhances: packages that enhance other packages, e.g., by providing methods for classes from these packages

17

Page 18: MOD2014-Mens-Lecture1

Long?term(goals(Research(Ques7ons

18

• CRAN Package dependencies

• Generated with miniCRAN

• Tool to create and visualise an internally consistent, mini version of CRAN with selected packages only.

Page 19: MOD2014-Mens-Lecture1

Long?term(goals(Research(Ques7ons

• Specific questions for Debian • Open source Linux distribution

• Which packages are more likely to cause futureco-installation conflicts with other packages?

• Can I upgrade a set of installed Debian packages without “breaking” my installation? • Based on a formalisation and SAT solving

19

Vouillon and Di Cosmo, “Broken Sets in Sotware Repository Evolution”, ICSE 2013

Page 20: MOD2014-Mens-Lecture1

Long?term(goals(Research(Ques7ons

• Debian (open source Linux distribution) • Historical evolution of package co-installation conflicts

20

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

03-0

506

-05

09-0

512

-05

03-0

606

-06

09-0

612

-06

03-0

706

-07

09-0

712

-07

03-0

806

-08

09-0

812

-08

03-0

906

-09

09-0

912

-09

03-1

006

-10

09-1

012

-10

03-1

106

-11

09-1

112

-11

03-1

206

-12

09-1

212

-12

03-1

306

-13

09-1

312

-13

03-1

406

-14

number of packages

number of conflicting packages

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

0,5

0,55

03-0

506

-05

09-0

512

-05

03-0

606

-06

09-0

612

-06

03-0

706

-07

09-0

712

-07

03-0

806

-08

09-0

812

-08

03-0

906

-09

09-0

912

-09

03-1

006

-10

09-1

012

-10

03-1

106

-11

09-1

112

-11

03-1

206

-12

09-1

212

-12

03-1

306

-13

09-1

312

-13

03-1

406

-14

ratio of conflicting packages

Page 21: MOD2014-Mens-Lecture1

Long?term(goals(Research(Ques7ons

• Specific questions for GNOME • Linux desktop environment

• Which projects have a higher chance of survival?

• How is workload distributed over different projects/contributors?

• What is the “bus factor” risk? Who are the top contributors (for a specific activity type)?

21

“Vasilescu et al. “On the variation and specialisation of workload: A case study of the GNOME ecosystem

community”, Emp. Softw. Eng. journal 2014.

Page 22: MOD2014-Mens-Lecture1

July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering

Long?term(goals(Research(Ques7ons

22

GNOME(Collaborator(Network

Page 23: MOD2014-Mens-Lecture1

Long?term(goals(Tooling

• Develop automated tools that help – developer (communities) improve upon their practices – companies and users to compare and adopt OSS

projects

23

Page 24: MOD2014-Mens-Lecture1

a

b

c

d

e

Birth First Release Maintenance Major Release

Fig. 2. Project Detail Page (here: Activity per day over the entire lifetime of the Nautilus project (a. number of commits, b. number of contributors, c. numberof lines added (green) vs. number of lines removed (red), d. difference between number of lines added and removed, e. number of files changed))

III. COMPLICITY

Complicity (http://complicity.inf.usi.ch) is a web-based vi-sualization tool that allows the user to interactively explore,analyze and understand the evolution of super-repositories attwo different abstraction levels: entity (single project or con-tributor), and ecosystem (a group of projects or contributors).

User Interface. We kept the user interface simple to avoiddistracting the user from the analysis tasks, whereas the shapesin the visualizations are colored to attract their attention.

The main page (see Figure 1) is divided into three mainparts: (1) the control panel on the left, (2) the main graphsof the ecosystem in the center of the page, (3) a quick viewpanel of a project’s or contributor’s details on the right, whichappears by clicking on a shape in the graph.

Control Panel. It gives the user the possibility to (a) analyzedifferent super-repositories, (b) choose between two entity types(project and committer) for the visualization at ecosystem level,(c) navigate through predefined viewpoints or change theirsettings, and (d) search for projects or contributors of interestby project type, project name or contributor name.

Graphical View. It visualizes the available data for theselected super-repository and the selected entity type as scatterplots. Every box represents either a single project or contributor,depending on the entity type selected, and reflects up to fivedifferent metrics: position on x and y axis, width, height andcolor. By clicking on a shape a detail panel on the right appears.

Entity Details Panel. It provides general information aboutthe selected entity, which goes from name and date of the

first and last commit, up to number of commits, number ofcontributors or projects, etc. From this panel the user has thepossibility to get more details and further analyze the selectedproject or contributor in a new window.

The detail page of a project or contributor (see Figure 2)has a similar layout as the main page, with the exception thatthe control panel on the left is replaced with an overview ofthe selected project’s or contributor’s details.

In the center of the page, the user can choose betweentwo views: (a) activity diagrams and (b) projects/contributorsinvolvement distribution. The activity diagrams view allows theuser to compare the activity of the selected entity in terms ofnumber of commits, number of contributors/projects, numberof lines added versus number of lines removed, and number offiles changed. In the involvement distribution view, the analystgets an idea about how many contributors have been involved,when, and for how long, or –in case of a selected contributor–how many projects he worked on and what his speciality is.

Architecture. Figure 3 shows the architecture and thebackend of Complicity. To fetch the relevant data from the Gitweb interface into our database, we developed a number ofJava programs. The crawler stores a copy of the web pagesfrom the web interface of the Git super-repositories locallybefore the parser extracts the data, and stores it in a database.The metrics calculator takes the extracted data, prepares andstores it in such a way that Complicity can easily retrieve thedata necessary for the visualization without having to calculatethe metrics on the fly.

Long?term(goals(Tooling

• Develop automated tools – E.g. Complicity (Neu et al., University of Lugano)

24

1

2

3

Fig. 1. Main View of Complicity visualizing the GNOME projects at ecosystem level, and the details of the Gimp project

Recent work has analyzed social aspects of a project’s evo-lution (e.g., activity, communication structure, and knowledgeflow), which influence the changing process and is crucial tounderstand the evolution of a software system.

The Ownership Map [13] identifies the owner of every singlefile within a software system. A file is represented with a line, adisc defines a file change, and the color of the line and the discdefines the owner and the committer, respectively. In Maispion[17], the authors analyzed the activity in the mailing list andversion control systems (VCS) of a single project to revealcommunication behavior within it. They answered questionssuch as “Is there a main driver?” and “When are the developersmost active?”. Oezbek et al. [18] checked whether the onioncommunication model is applicable to open-source systemsusing mailing lists as data source. They found that the coredevelopers are highly interactive and tightly interconnected.

These approaches use visualization techniques to support thecomprehension of a single software system. Complicity usessome of these visualization techniques (e.g., two dimensionalboxes to encode different metrics and visualizes contributors’activity) to support the understanding of software ecosystems.

Software ecosystems analysis is an under-researched area.FLOSSMole [19] aims at mining free, libre, and open-sourcesoftware (FLOSS) super-repositories (e.g., sourceforge) andmaking general information on these projects publicly available.The data from the VCS or any other data source is notmined. Lopez-Fernandez et al. apply social network analysisto FLOSS projects, such as KDE, and Apache [12]. They

found out that committers of the GNOME and KDE aremore tightly connected than the ones of the Apache, becauseof the GNOME’s and KDE’s projects technical proximity.Ohloh (http://www.ohloh.net/) is an online directory of FLOSSprojects and its developers. It retrieves data from different VCSand uses different metrics (e.g., number of commits, number oflines of code) to provide some visualizations that show variousaspects of the projects’ evolution. Lungu focuses his work onreverse engineering software ecosystems [10]. He created SPO,an interactive tool that can be used to analyze the evolution ofsoftware ecosystems. SPO differentiates between two aspects,project and developer, for which the ecosystem plays one oftwo different roles: focus, to better understand the ecosystem; orcontext, to understand a single entity of the ecosystem. Seichteret al. introduced an approach of knowledge management usinga social network of software artifacts in which the knowledgeis attached to an artifact rather than a contributor [20]. Theadvantage is that if a developer leaves a software project, theknowledge remains within the project. Goeminne and Mensprovide a framework to mine VCS, mailing lists and bugtracking databases, to analyze and visualize mainly the mailing,and commit activity of FLOSS ecosystems [21]. They definean ecosystem as “the source code together with the user anddeveloper communities surrounding the software”.

In our work we combine software visualization with softwaremetrics, by focusing not only on a single entity or the ecosystemlevel, but enabling the user to switch between the ecosystemand entity level, and between project and contributor views.

Page 25: MOD2014-Mens-Lecture1

Long?term(goals(Tooling

• Develop automated tools – E.g. SECONDA for Gnome

25

Fig. 3. Radar chart visualising the comparison of 5 coarse-grainedmetrics for a user-defined selection of 3 projects.

histograms depicting the distribution of file size for a se-lected project; boxplot charts displaying the distributionof the set of available metrics for a selected project. Aslider allows to easily navigate from commit to commitin order to show the evolution of metrics along time inboth charts.

B. Community and developer analysis

From the point of view of SECONDA, there is aduality between projects and developers on the one hand,and between the developer community and the softwareecosystem on the other hand.

We already explained before that SECONDA can visu-alise and analyse data at the project level, by consideringall commit activities carried out by all developers for thisparticular project. The dual of this is the visualisationof data at developer level: SECONDA can visualise allactivities carried out by a given developer for all projectsthis developer has been involved in.

Similarly, SECONDA can either be used to comparea selection of projects against one another (see, e.g.,Figure 3) or, alternatively, compare the work carriedout by a selection of developers against one another.Finally, SECONDA can perform global analyses of theentire ecosystem (see, e.g., Figure 2 where each pointrepresents a project) or, alternatively, perform a globalanalysis of the entire developer community (each pointin the visualisation represents an individual developer).

VI. WORK IN PROGRESS

The current version of the tool is under active de-velopment and changing every week. Currently, we areworking on the following issues that will be integratedin future releases of SECONDA: integrate the identitymatching algorithm (already implemented) as a postpro-cessing phase of the data extraction module; implementan incremental version of the data extraction and metricscomputation to accommodate new projects’ revisionswithout needing to recompute everything everytime,thereby saving bandwith, time and memory; integrate thecommunity and developer analysis; integrating the sta-tistical analysis module; implement a reporting module;analyse other ecosystems than GNOME; support othertypes of version repositories, and integrate informationfrom bug trackers, mailing lists and development fora.

VII. CONCLUSION

In this article we presented the essential characteristicsof SECONDA, an extensible modular framework forthe analysis and visualisation of open source softwareecosystems. The tool is under active development atour lab, and is useful for researchers and practitionersthat wish to study how the evolution of open sourceprojects is influenced by their surrounding ecosystemand developer community. If offers a dashboard forrapid visualisation of global (ecosystem-level) and local(project-level) metrics that can be extracted from infor-mation stored in the version repositories. Currently, thetool is used for analysing the GNOME ecosystem, butother ecosystems will be analysed in future releases. Wewill also continue to extent the dashboard with new func-tionalities, such as person and community visualisation,statistical analysis and reporting.

ACKNOWLEDGMENTS

This research has been co-funded by the EuropeanRegional Development Fund (ERDF) and Wallonia, Bel-gium. The research is also partially supported by theF.R.S.-FNRS FRFC project 2.4515.09.

REFERENCES

[1] Megan Conklin, James Howison, and Kevin Crowston. Collab-oration using ossmole: a repository of floss data and analyses.SIGSOFT Softw. Eng. Notes, 30:1–5, May 2005.

[2] Black Duck Software Inc. Ohloh software directory, 2011.[3] Mircea Lungu. Reverse Engineering Software Ecosystems. PhD

thesis, Faculty of Informatics; University of Lugano, 2009.[4] Mircea Lungu, Michele Lanza, Tudor Gırba, and Romain Robbes.

The small project observatory: Visualizing software ecosystems.Science of Computer Programming, 75(4):264 – 275, 2010.

[5] Dawid Weiss. A large crawl and quantitative analysis of opensource projects hosted on sourceforge. Research Report RA-001/05, Institute of Computing Science, Poznan University ofTechnology, Poland, 2005.

4

IV. DATA EXTRACTION AND METRICS COMPUTATION

SECONDA provides two types of manipulation of thedata extracted from GIT repositories: global analysis,which clones the GIT repository and performs a globalcoarse-grained analysis; and local analysis, which anal-yses the software projects in the local repository cloneat a fine-grained level. Although both analyses can beused independently, it is advisable to first carry out theglobal analysis, and then request a local analysis forthose projects that deserve more attention.

A. Global analysis

The data extraction module downloads and maintainsa local cached copy of the GNOME repository on whichthe metrics module runs a coarse-grained analysis, usingSLOCCOUNT for obtaining the projects’ size metrics,for the latest revision of each project, and with GIT forobtaining the commit history. For each project we extractand store the list of authors and committers, the GITcommit log and the results of running SLOCCOUNT forthe whole project. The latter counts the lines of code fora variety of programming languages (including Ada, C,C++, Cobol, Fortran, Haskell, Java, Pascal, LISP, XML,Perl, PHP and many more). The raw repository data isalso summarized into a CSV data file.

The analysis is run over the local repository cache,unless a project hasn’t been downloaded yet or anupdate of the local copy is needed. In such cases, thelatest revision of the project is pulled and stored in thelocal repository. The first time the cache is created theextraction process can take several hours. This is reducedto minutes for the successive executions of the tool.

B. Local analysis

Given that the large majority of code files for GNOMEare written in C, local analysis relies on CMETRICS, anopen source metrics tool for C code to compute, amongothers, size metrics (SLOC), Halstead metrics (H.LEN,H.VOL, H.LEVEL, etc.) and McCabe’s cyclomatic com-plexity (CYCLO).

The data extraction module creates a MySQL databasefor each project. This database is filled by the metricsmodule with the information computed by CMETRICSfor each revision of the project stored in the localrepository. CMETRICS collects data related to C filesand to the functions contained in them as well. All thisinformation, together with the links between revisions,files and functions is stored in each database. Thismakes it possible, for example, to know in which fileis contained a certain function and by extension, whatrevision this file came from.

Fig. 2. Scatter plot visualising the correlation, at ecosystem level,between two metrics for each project: their total number of lines ofcode TLOC and their total number of files.

Once all the data of all the projects’ commits isextracted and stored, the database can be queried toperform detailed analyses. It is also possible to comparedata across projects by performing searches over thedatabases of each project.

V. VISUALISATION

The visualisation module allows to display global andlocal analyses and therefore, to gain understanding ofindividual projects or to compare metrics across differentprojects and developers.

A. Ecosystem and project analysis

GNOME projects can be jointly analysed by combiningtheir separate metrics. The visualisation module uses themain metrics previously computed and displays themusing four different types of charts: scatter plots thatallow to confront two metrics in order to visualiseand find out their possible correlation (see Figure 2);programming language boxplots that display the usagedistribution of different programming languages, includ-ing main descriptive statistics such as mean, median, andquartiles; ecosystem boxplots that display the distributionof number of commits, committers, authors, and filesover all projects. spider web metrics that display andcompare a set of metrics for a set of different projectsselected by the user (see Figure 3).

The fine-grained analyses for single projects allow tovisualise and understand the evolution of each projectover time. It comprises two different types of charts:

3

Page 26: MOD2014-Mens-Lecture1

Long?term(goals(Tooling

• Develop automated tools • Example maintaineR (for CRAN packages)

26

Page 27: MOD2014-Mens-Lecture1

Long?term(goals(Tooling

• Develop automated tools • E.g. coinst and other tools for visualising and

resolving co-installation conflicts coinst.irill.org

27

Page 28: MOD2014-Mens-Lecture1

Relevant(Scien7fic(Venues

• ICSME • Int. Conf. Software Maintenance and Evolution

• SANER (merger of CSMR and WCRE) • Int. Conf. Software Analysis, Evolution and Reengineering

• MSR • Working Conference on Mining Software Repositories

• IWSECO • International Workshop on Software Ecosystems

28