reproducibility: 10 simple rules

19
Reproducibility: 10 Simple Rules And more! l, et al. "Ten simple rules for reproducible computational research." PLoS computational biology 9.10

Upload: annika-eriksson

Post on 13-Apr-2017

343 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Reproducibility: 10 Simple Rules

Reproducibility:10 Simple Rules

And more!

Sandve, Geir Kjetil, et al. "Ten simple rules for reproducible computational research." PLoS computational biology 9.10 (2013): e1003285.

Page 2: Reproducibility: 10 Simple Rules

Rule 1: For Every Result, Keep Track of How It Was Produced

http://xkcd.com/

Page 3: Reproducibility: 10 Simple Rules

Rule 2: Avoid Manual Data Manipulation Steps

• “Stop clicking, start typing” – Matt Frost, Charlottesville, VA

• Use scripts for even small changes• Split commonly used code off into

functions/classes, and put these into libraries

Page 4: Reproducibility: 10 Simple Rules

Rule 3: Archive the Exact Versions of All External Programs Used

Level 0

Note names and versions of all packages

Level 1

Use package management system (packrat,

anaconda/conda)

Boss Level

Save image of entire system

Page 5: Reproducibility: 10 Simple Rules

Rule 4: Version Control All Custom Scripts

http://www.slideshare.net/sjcockell/reproducibility-the-myths-and-truths-of-pipeline-bioinformatics

• Also, version control workflows (what are good workflow management systems, guys?)

• Use the commit space to write something useful to your future self (“pwew pwew pwew” is not useful)

Page 6: Reproducibility: 10 Simple Rules

Rule 5: Record All Intermediate Results, When Possible in Standardized Formats

• “Explicit is better than implicit” – Tim Peters, The Zen of Python

Page 7: Reproducibility: 10 Simple Rules

Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds

• This goes for all parameters that may change• Separate code from configuration, e.g. use

config files (another gift to your future self!)

Page 8: Reproducibility: 10 Simple Rules

Rule 7: Always Store Raw Data behind Plots

• (and the plot generating code, too)• Make raw data read only• Separate folders for raw and pre-processed

data

https://inspguilfoyle.wordpress.com/2014/02/19/straight-lines/

Page 9: Reproducibility: 10 Simple Rules

Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected

Page 10: Reproducibility: 10 Simple Rules

Rule 9: Connect Textual Statements to Underlying Results

Page 11: Reproducibility: 10 Simple Rules

Rule 10: Provide Public Access to Scripts, Runs, and Results

• GitHub• Synapse• Open Science Framework• ReadTheDocs• RunMyCode• ???

Page 12: Reproducibility: 10 Simple Rules

Documentation Is it clear where to begin? (e.g., can someone picking a project up see where to

start running it) can you determine which file(s) was/were used as input in a process that

produced a derived file? Who do I cite? (code, data, etc.) Is there documentation about every result? Have you noted the exact version of every external application used in the

process? For analyses that include randomness, have you noted the underlying random

seed(s)? Have you specified the license under which you're distributing your content,

data, and code? Have you noted the license(s) for others peoples' content, data, and code used

in your analysis?

http://ropensci.github.io/reproducibility-guide/sections/checklist/

Page 13: Reproducibility: 10 Simple Rules

Organization Which is the most recent data file/code? Which folders can I safely delete? Do you keep older files/code or delete them? Can you find a file for a particular replicate of your research project? Have you stored the raw data behind each plot? Is your analysis

output done hierarchically? (allowing others to find more detailed output underneath a summary)

Do you run backups on all files associated with your analysis? How many times has a particular file been generated in the past? Why was the same file generated multiple times? Where did a file that I didn't generate come from?

http://ropensci.github.io/reproducibility-guide/sections/checklist/

Page 14: Reproducibility: 10 Simple Rules

AutomationAre there lots of manual data manipulation steps are there?Are all custom scripts under version control? Is your writing (content) under version control?

http://ropensci.github.io/reproducibility-guide/sections/checklist/

Page 15: Reproducibility: 10 Simple Rules

PublicationHave you archived the exact version of every external application

used in your process(es)?Did you include a reproducibility statement or declaration at the

end of your paper(s)?Are textual statements connected/linked to the supporting results

or data?Did you archived preprints of resulting papers in a public

repository?Did you release the underlying code at the time of publishing a

paper?Are you providing public access to your scripts, runs, and results?

http://ropensci.github.io/reproducibility-guide/sections/checklist/

Page 16: Reproducibility: 10 Simple Rules

Best Practices for Scientific Computing

Write programs for people, not computers.Let the computer do the work.Make incremental changes.DRY: Don’t repeat yourself (or others).Plan for mistakes. (“Defensive Programming”)Use pair programming.

Wilson, Greg, et al. "Best practices for scientific computing." PLoS biology 12.1 (2014): e1001745.

Page 17: Reproducibility: 10 Simple Rules

Wilson, Greg, et al. "Best practices for scientific computing." PLoS biology 12.1 (2014): e1001745.

Document design and purpose, not mechanics.

Page 18: Reproducibility: 10 Simple Rules

Suggested Training Topics• version control and use of online repositories• modern programming practice including unit testing and regression testing• maintaining “notebooks” or “research compendia”• recording the provenance of final results relative to code and/or data• numerical / floating point reproducibility and nondeterminism• reproducibility on parallel systems• dealing with large datasets• dealing with complicated software stacks and use of virtual machines• documentation and literate programming• IP and licensing issues, proper citation and attribution

http://icerm.brown.edu/tw12-5-rcem/

Page 19: Reproducibility: 10 Simple Rules

Resources

• http://projecttemplate.net/ - Project automation (R)• http://www.nature.com/news/2010/101013/full/467

753a.html - Publish your computer code: it is good enough

• http://www.carlboettiger.info/ - Open lab notebook• http://wiki.stodden.net/ICERM_Reproducibility_in_C

omputational_and_Experimental_Mathematics:_Readings_and_References

• http://rrcns.readthedocs.org/ - Best practices tutorial• http://www.bioinformaticszen.com/