reproducibility: 10 simple rules

Post on 13-Apr-2017

343 Views

Category:

Science

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Reproducibility:10 Simple Rules

And more!

Sandve, Geir Kjetil, et al. "Ten simple rules for reproducible computational research." PLoS computational biology 9.10 (2013): e1003285.

Rule 1: For Every Result, Keep Track of How It Was Produced

http://xkcd.com/

Rule 2: Avoid Manual Data Manipulation Steps

• “Stop clicking, start typing” – Matt Frost, Charlottesville, VA

• Use scripts for even small changes• Split commonly used code off into

functions/classes, and put these into libraries

Rule 3: Archive the Exact Versions of All External Programs Used

Level 0

Note names and versions of all packages

Level 1

Use package management system (packrat,

anaconda/conda)

Boss Level

Save image of entire system

Rule 4: Version Control All Custom Scripts

http://www.slideshare.net/sjcockell/reproducibility-the-myths-and-truths-of-pipeline-bioinformatics

• Also, version control workflows (what are good workflow management systems, guys?)

• Use the commit space to write something useful to your future self (“pwew pwew pwew” is not useful)

Rule 5: Record All Intermediate Results, When Possible in Standardized Formats

• “Explicit is better than implicit” – Tim Peters, The Zen of Python

Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds

• This goes for all parameters that may change• Separate code from configuration, e.g. use

config files (another gift to your future self!)

Rule 7: Always Store Raw Data behind Plots

• (and the plot generating code, too)• Make raw data read only• Separate folders for raw and pre-processed

data

https://inspguilfoyle.wordpress.com/2014/02/19/straight-lines/

Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected

Rule 9: Connect Textual Statements to Underlying Results

Rule 10: Provide Public Access to Scripts, Runs, and Results

• GitHub• Synapse• Open Science Framework• ReadTheDocs• RunMyCode• ???

Documentation Is it clear where to begin? (e.g., can someone picking a project up see where to

start running it) can you determine which file(s) was/were used as input in a process that

produced a derived file? Who do I cite? (code, data, etc.) Is there documentation about every result? Have you noted the exact version of every external application used in the

process? For analyses that include randomness, have you noted the underlying random

seed(s)? Have you specified the license under which you're distributing your content,

data, and code? Have you noted the license(s) for others peoples' content, data, and code used

in your analysis?

http://ropensci.github.io/reproducibility-guide/sections/checklist/

Organization Which is the most recent data file/code? Which folders can I safely delete? Do you keep older files/code or delete them? Can you find a file for a particular replicate of your research project? Have you stored the raw data behind each plot? Is your analysis

output done hierarchically? (allowing others to find more detailed output underneath a summary)

Do you run backups on all files associated with your analysis? How many times has a particular file been generated in the past? Why was the same file generated multiple times? Where did a file that I didn't generate come from?

http://ropensci.github.io/reproducibility-guide/sections/checklist/

AutomationAre there lots of manual data manipulation steps are there?Are all custom scripts under version control? Is your writing (content) under version control?

http://ropensci.github.io/reproducibility-guide/sections/checklist/

PublicationHave you archived the exact version of every external application

used in your process(es)?Did you include a reproducibility statement or declaration at the

end of your paper(s)?Are textual statements connected/linked to the supporting results

or data?Did you archived preprints of resulting papers in a public

repository?Did you release the underlying code at the time of publishing a

paper?Are you providing public access to your scripts, runs, and results?

http://ropensci.github.io/reproducibility-guide/sections/checklist/

Best Practices for Scientific Computing

Write programs for people, not computers.Let the computer do the work.Make incremental changes.DRY: Don’t repeat yourself (or others).Plan for mistakes. (“Defensive Programming”)Use pair programming.

Wilson, Greg, et al. "Best practices for scientific computing." PLoS biology 12.1 (2014): e1001745.

Wilson, Greg, et al. "Best practices for scientific computing." PLoS biology 12.1 (2014): e1001745.

Document design and purpose, not mechanics.

Suggested Training Topics• version control and use of online repositories• modern programming practice including unit testing and regression testing• maintaining “notebooks” or “research compendia”• recording the provenance of final results relative to code and/or data• numerical / floating point reproducibility and nondeterminism• reproducibility on parallel systems• dealing with large datasets• dealing with complicated software stacks and use of virtual machines• documentation and literate programming• IP and licensing issues, proper citation and attribution

http://icerm.brown.edu/tw12-5-rcem/

Resources

• http://projecttemplate.net/ - Project automation (R)• http://www.nature.com/news/2010/101013/full/467

753a.html - Publish your computer code: it is good enough

• http://www.carlboettiger.info/ - Open lab notebook• http://wiki.stodden.net/ICERM_Reproducibility_in_C

omputational_and_Experimental_Mathematics:_Readings_and_References

• http://rrcns.readthedocs.org/ - Best practices tutorial• http://www.bioinformaticszen.com/

top related