analyzing hpc software gitrepositories to identify and ...kanikas/ideas-kanikav6.pdf · work in...

23
Analyzing HPC software Git repositories to identify and compute software productivity metrics Kanika Sood University of Oregon August 18, 2017 Supervised by Anshu Dubey, Boyana Norris, Rinku Gupta and Lois McInnes

Upload: lamminh

Post on 22-Mar-2019

215 views

Category:

Documents


0 download

TRANSCRIPT

Analyzing HPC software Git repositories to identify and compute software productivity metrics

Kanika SoodUniversity of Oregon

August 18, 2017

Supervised by Anshu Dubey, Boyana Norris, Rinku Gupta and Lois McInnes

2

IDEAS

n Move scientific software development toward an approach of building new

applications as reusable and scalable software components

and libraries using the best available practices.

n Develop and demonstrate new approaches for producing, using and

supporting scientific software.

n Establish methodologies that facilitate delivery of software as reusable,

interoperable components.

[email protected] SASSy 2017 August 18, 2017

Slide courtesy https://ideas-productivity.org/

3

IDEAS mission

n Software Challenges: Exploit massive on-node concurrency and handle

disruptive architectural changes while working toward predictive simulations that

couple physics, scales, analytics, and more.

n Approach: Collaborate to curate, create, and disseminate software

methodologies, processes, and tools that lead to improved scientific software.

Methodologies to improve software quality and achieve science goals.

[email protected] SASSy 2017 August 18, 2017

Slide courtesy https://ideas-productivity.org/

www.ideas-productivity.org

Improve software productivity and sustainability for computational science

4

Motivation

n Software productivity, reusability → important

n Traditional metrics → give limited insight

n Higher project complexity → difficult to estimate team productivity

and project maturity

n Gain

– Opportunity to reduce cost and increase scientific output

– Support for future project planning and funding projections

[email protected] SASSy 2017 August 18, 2017

5

Introduction

n Scientific software is rapidly growing in capabilities, accuracy, performance.

n Software productivity has received insufficient attention.

n We analyze the correlations between project issues and characteristics using traditional metrics.

n We propose new time-dependent metrics that can help quantify productivity.

n The metrics can be used to better understand the trends of software development workflows and provide objective measurements of productivity.

n We demonstrate our approach on ACME, PETSc, MOOSE, YT and SPACK.

[email protected] SASSy 2017 August 18, 2017

Software productivity - the effort, time, and cost for software development, maintenance, and support

Numerical Libraries analyzed

n PETSc: Portable Extensible Toolkit for scientific computation is a numerical software library [1] that offers a collection of linear and non-linear solvers and preconditioners for scalable solution of scientific applications. It is one of the most widely used parallel numerical library.

n SPACK: A package management tool [2] designed to support multiple versions and configurations of software on a wide variety of platforms and environments.

n MOOSE: The Multiphysics Object-Oriented Simulation Environment is a finite-element, multi-physics framework [3]. It aims to make predictive modeling accessible and scalable, for nuclear engineering problems.

n ACME: The accelerated climate modeling for energy project [4] applies advanced climate and Earth system models to solve the most challenging research problems related to climate-change. The goal is to build a modeling system that can be used efficiently on the next generation of computing systems.

n YT: A multi-code analysis toolkit for astrophysical simulation data. It is a community-developed analysis and visualization toolkit for volumetric data

[email protected] SASSy 2017 August 18, 2017

[email protected] SASSy 2017 August 18, 2017

Methodology

1. Fetch Github/Bitbucket data

2. Parse Json

3. Categorize issues¹

4. Analyze issues with standard metrics

5. Create metrics and correlate them

6. Test on scientific software projects

¹Github/Bitbucket Issues: A way to keep track of project tasks, enhancements, and bugsExamples: bug reporting, new feature suggestion

8

Why is quantifying productivity hard?

[email protected] SASSy 2017 August 18, 2017

Formatting and word replacement can change lines of code (LOC)

Word replacement

Current metrics like LOC are not very useful for quantifying software productivity

Formatting

9

Metrics we use

[email protected] SASSy 2017 August 18, 2017

n Traditional metrics– Weekly commits and additions for the project lifetime– Number of issues reported– Category of issues

n New metrics– Monthly bug fix rate– Monthly feature request rate– Correlation of the number of issues with project age– Total commits and additions– Number of followers and watchers– Cumulative bugs and cumulative fixes– Number of open and closed issues

so far…

10

Traditional metrics - I

[email protected] SASSy 2017 August 18, 2017

n How actively are developers participating in development?

MOOSE

11

Traditional metrics - II

[email protected] SASSy 2017 August 18, 2017

n How often is the issue tracker employed by the users?

PETScMOOSE

SPACK ACME

Traditional metrics - III

[email protected] SASSy 2017 August 18, 2017

n What are the most common tags associated with issues?

n Are bugs and issues among the top 10 tags?

Top 10 tags used in ACME: • Bugs• Enhancement

and others…

ACMEYT

ACME

13

New metrics - I

[email protected] SASSy 2017 August 18, 2017

MOOSE SPACK

PETSc

n How interested is the community and what is the potential for expanding capability?

*PETSc mainly uses emails for open and closed issues (Next talk)

n Is a project under-resourced for user support and/or development?SPACKMOOSE

YT

[email protected] SASSy 2017 August 18, 2017

New metrics - II

12

[email protected] SASSy 2017 August 18, 2017

n Do critical issues get resolved quicker than non-critical issues?ACME

New metrics - III

[email protected] SASSy 2017 August 18, 2017

n How many bugs and features have been requested over time?MOOSEACME

PETSc

New metrics - IV

PETSc mainly uses emails

[email protected] SASSy 2017 August 18, 2017

n How much effort is spent in resolving issues?

MOOSE

ACME SPACK

New metrics - V

[email protected] SASSy 2017 August 18, 2017

n Does the number of issue requests correlate with the different ways in which a project can be used?

* PETSc is excluded because it has issue tracking in the form of emails as well.

Circle diameter: No. of forksFork: A copy of repositoryWatchers: Active developers or users

Bigger circle: More customized use of projectSmaller circle: More ‘as is’ use of project

New metrics - VI

[email protected] SASSy 2017 August 18, 2017

n What topics cause the most changes in code?

New metrics - VII

Summary

[email protected] SASSy 2017 August 18, 2017

n Explore derived (more complex) productivity metrics.n Integrate information from developers with these analyses.n Better understand the trends of software development workflows and

provide objective productivity measurements. n Combine Git analysis with email analysis.

n Analyze the correlations between project issues and characteristics using standard metrics.

n Propose new time-dependent metrics that can help quantify productivity.

n Demonstrate our approach on ACME, PETSc, MOOSE, YT and SPACK.

Future work

[email protected] SASSy 2017 August 18, 2017

Software Productivity Metrics, Measurements and ImplicationsKanika Sood1, Shweta Gupta1, Boyana Norris1, Anshu Dubey2, Rinku Gupta2, Lois C. McInnes2

University of Oregon1, Argonne National Laboratory2

PROBLEMThe IDEAS approach aims to understand patterns in software development and capture productivity metrics.• Software productivity and reusability

have traditionally been largely ignored components of the scientific software development life cycle

• Standard metrics give limited insight• Higher project complexity → difficult to estimate individual/team productivity and project maturity/acceptance

Traditional

• No software productivity metrics → lost opportunity to reduce cost and increase scientific output

•No project maturity and audience acceptance metrics → limited future project planning and funding projections

Outcome

The IDEAS APPROACH

Step 1

• Understand and analyze diverse attributes, that impact productivity, that can be captured across a broad range of projects which are of interest to DOE for exascale.

Step 2

• Identify which top characteristics form dominant attributes across the various projects and use them to produce metrics that assess productivity improvement or degradation.

Step 3

• Test the metrics on chosen projects and quantify project productivity, team productivity, project success rate and project maturity.

Conclusions & Future Work Our tools can mine information from both repositories and email lists• Preliminary analysis of data mined with these tools• Indicate whether a project is appropriately resourced• Indicate if the project should grow• Identify the most challenging tasksWork in near future will include (in the final poster)• If most difficult issues are HPC specific• Can we quantify technical debt

ACKNOWLEDGEMENTS

5

Observations And Insights

4

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and

the National Nuclear Security Administration.

Projects and Methodologies

1

Projects

Repository Analysis using built-in tags

Email analysis using Natural Language Processing 32

4

Email Analysis

Collect Data

Parse HTML and Decode

Tokenize, Stem & Remove Stopwords

Create Bag-of-words and training LDA model

Assign Label to each document

Label Output topic

BUG

Developer• How does the behavior of issues changes?

• Which piece of text are relevant ?

Project Lead• What are the kind of issues ?

• Can we figure-out how many resources are available/ allocated?

User• How active are the developers in resolving issues?

• How interested is the community in the project ?

FLASH – Multi Domain Multiphysics Software

PETSC - Numerical Library

MOOSE - Multiphysics Framework

YT - Scientific Data Analysis, visualization Tool

SPACK - Package Management Tool

ACME - Climate Modeling Multiphysics Software

Bug in quadratic_cartesian interpolation scheme?

bug in hydro flash2.3

FLASH

strangness in Chebyshev estimate of eigenvalues

petsc-dev on bitbucket

configure error

PetscOptionsGetString

PETSc

Email AnalysisDiameter of the circle => time span

Height of the circle = > number of emails

While there is a range of complexity among topics, configuration issues appear to dominate.

Worker efficiency can be a good indicator of project progress and productivity

Not many topics appear to take time. However, this figure gives incomplete information because the practice in the project is to take discussions of topics specific to one user offline.

Methodologies

Repository AnalysisPlotting cumulative counts of bugs and cumulative counts of bug fixes can indicate if a project is under-resourced for management

Plotting open and closed issues in the same plot can be informative regarding community interest and scope for expanding capability and reach of the software

With the ability to mine the data we can engage with the scientific community to determine what questions to ask.

Questions can also come with other stakeholders such as funding agencies.

Our eventual objective is to understand the issues that arise during various stages of development in scientific software projects and develop metrics that are informative to all stakeholders.

While the bugs increase with the progression of time, the number of fixes appear to grow at a relative rate. This indicates the project currently has sufficient resources for user support.

The number of bugs increase with the advancement in time but the number of fixes do not increase proportionately until recently. This may indicate that the project paid attention towards user support and successfully reduced the widening gap between the bugs raised and fixes.

The bugs increase with the progression of time and although the number of fixes also increase, the gap between the bugs and fixes tends to grow.

Open issues have increased by 50% since the project started, which can indicate rising community interest and wider reach of the software. The gap between closed and open issues has almost doubled from the start of the project which can indicate additional resource requirement for project maintenance.

Open issues have increased drastically since the project started which indicates rising community interest and software reach. The gap between open and closed issues appears to grow over a short time period(< 2 years), this can suggest additional resources.

The number of open and closed issues at any point of time have not exceeded 80 in 5 years. This may be because of other means of issue tracking, like emails which PETSc maintains since the project started.

Repository Analysis

Collect Data

Parse json

Analyze issues

Issue categorization

Test on ECP Projects

Label Output topicCreate metrics and

correlate them

PETSc

References

[1] PETSc, https://www.mcs.anl.gov/petsc, 2017

[2] SPACK, https://computation.llnl.gov/projects/spack-hpc-package-manager, 2017

[3] MOOSE, http://mooseframework.org, 2017

[4] ACME, https://pegasus.isi.edu/portfolio/acme, 2017

[5] YT, http://yt-project.org/doc/index.html, 2017

[6] Github API v3, https://developer.github.com/v3, 2017

22K. Harkay, ANL ILCDR06 Cornell, Sep 2006

[email protected] SASSy 2017 August 18, 2017

Questions ?

Thank you