xsede14 reproducibility workshop: reproducibility in large scale computing – where do we stand...

5
XSEDE14 Reproducibility Workshop: Reproducibility in Large Scale Computing – Where do we stand Mark R. Fahey, NICS Robert McLay, TACC XSEDE14 - Reproducibility Workshop 1

Upload: milton-mccarthy

Post on 29-Dec-2015

220 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: XSEDE14 Reproducibility Workshop: Reproducibility in Large Scale Computing – Where do we stand Mark R. Fahey, NICS Robert McLay, TACC XSEDE14 - Reproducibility

XSEDE14 - Reproducibility Workshop 1

XSEDE14 Reproducibility Workshop:Reproducibility in Large Scale Computing – Where do we stand

Mark R. Fahey, NICSRobert McLay, TACC

Page 2: XSEDE14 Reproducibility Workshop: Reproducibility in Large Scale Computing – Where do we stand Mark R. Fahey, NICS Robert McLay, TACC XSEDE14 - Reproducibility

XSEDE14 - Reproducibility Workshop 2

Reproducibility – what it means to me

• Full documentation of how an experiment (simulation) was conducted– Source code (unique versioning)– Input data– Computing environment

• Hardware• Software (probably the most lacking component)

– How often are the OS, compilers, MPI versions fully documented so that one knows how to reproduce the build environment

– Ever seen a list of all the libraries linked into a code and the version of each library?

– Published results

Page 3: XSEDE14 Reproducibility Workshop: Reproducibility in Large Scale Computing – Where do we stand Mark R. Fahey, NICS Robert McLay, TACC XSEDE14 - Reproducibility

XSEDE14 - Reproducibility Workshop 3

Computing Center responsibilities

• Yale Report makes no mention of the role of computing centers

• I believe computing centers have an obligation to help solve some of the reproducibility issues– Namely documentation of the software environment

• Expecting a researcher to document all the system software in a complete way is asking too much– A researcher may not know what should be documented

Page 4: XSEDE14 Reproducibility Workshop: Reproducibility in Large Scale Computing – Where do we stand Mark R. Fahey, NICS Robert McLay, TACC XSEDE14 - Reproducibility

XSEDE14 - Reproducibility Workshop 4

What can/should be done• Need an automatic way to collect the information on the software (and versions)

used by the researcherThis is what the centers (national and campus level) should be providing

– A couple prototypes exist that do this• For example, NICS and TACC provide two similar but slightly different prototypes (ALTD and

Lariat, respectively) that capture the libraries and their versions for each code built and run• Solves part of the documentation problem; in fact NERSC uses so ALTD so that users can find

out provenance data from old builds so they can rebuild their codes exactly like they did months or years before

• A new effort (called XALT) is under development to combine and extend these prototypes from NICS and TACC to capture even more information – everything mentioned above

– Every center should be doing this for a variety of reasons • better user support; efficient use of staff resources• provenance data collection• security related concerns• And of course documentation for reproducibility

– Collecting this information is very doable (as proven by the prototypes) and has proven to be very useful. It would help the researchers greatly with providing the information the Yale report recommends

Page 5: XSEDE14 Reproducibility Workshop: Reproducibility in Large Scale Computing – Where do we stand Mark R. Fahey, NICS Robert McLay, TACC XSEDE14 - Reproducibility

XSEDE14 - Reproducibility Workshop 5

What can/should be done (2)• Computing centers (university and national level) can also somewhat

address repository and software versioning issues for researchers by providing snapshots of the OS and libraries and providing views into databases– Centers could document each and every version of all of their software and the

duration it was the default on the machine– There are already efforts to capture most of this information at some centers.

• For example, at NICS, the programming environment software versioning is documented – Provide users a list of all the system defaults at any time from the past with “all-in-one” modules

– Centers could make RPM bundles of the system software and provide a test bed cluster with which one could “revert” to past system software installations to confirm reproducibility• Only for the life of the technology/award• Test bed clusters are sometimes not part of HPC deployments• Test bed clusters would likely be only a few nodes, unable to reproduce large simulations