a guide for reproducible research
TRANSCRIPT
www.postersession.com
Reproducibility in research is the ability to replicate the ultimate product of academic research to reproduce the results and build on the research. The main entities of academic research are data, scripts/software for processing and analysis, workflow of the research process, and research output (Figure 1). Documenting workflow, data, and code during the active phase of the scientific research is important for communication of the scholarship and replication of the results. When researchers submit scientific papers or build on their work, they face the challenge of having to remember all the details of their own work if they haven't included well documentation for this work. In order to sustain and ensure the integrity of reproducibility in the scientific research and advance the scientific research process, this poster presents guidelines for researchers that help them to manage the research entities during the active phase of the research process.
A Guide for Reproducible Research Yasmin AlNoamany
University of California, Berkeley [email protected]
Introduction
The main entities of the scientific research
Research Software – source code or executables that researchers generate or integrate into the workflow of the scientific research. What to document: Good practices in managing your software: • Custom scripts to automate research analysis. • Attach examples of how the code works. • Generate a list of all scripts, how to run them, and in what order. • Use tools that capture the experimental environment, such as Docker and ReproZIP. • Use metadata standards for each generated module. Each module should have at least
the following: Ø Name of the module Ø Name of the project Ø Name of Author Ø Input and Output Ø Purpose of the Module Ø A brief Description
Naming files should be descriptive and consistent! Tools
• Docker • Apache Ivy
Research Software
• The experimental environment – e.g., hardware, operating system
• The computing platform and prerequisites
• Scripts and libraries • Input and output parameters • The functionality of each script • Dependencies of the software
indicating versions • The structure of the code/software and
details about individual components
Scientific paper(s) along with graphs/tables – document(s) that contains the results of the scientific research as well as all the assorted graphs and tables. This could be:
• Compiled files (e.g., pdf) • Source files (e.g., .tex files, figures, .bib file) • Packages/libraries/styles installed (e.g., graphics) • Graphs and tables Good practices in managing output files: • Document the environment and the file structure. • Track versions of produced papers, graphs, etc. • Document any problem that faces you with the computing environment. • Backup your files every while. • Save your files on Dropbox or any other cloud storage to keep track of your
versions. • For writing your manuscript, use Latex and Bibtex for these reasons:
Ø Latex is free and open source. Ø A .tex file can be edited in any text editor. Ø The content is separated from style. Ø With a couple of line and style files, you can convert how your pdf looks. Ø Latex allows preserving your files longer time. Ø The output document looks better.
Naming files should be descriptive and consistent! Tools
• Latex • Bibtex
Research Output Data
Data – files that were used or produced during the scientific research process. These files can be raw data or different versions of processed data.
Good practices in managing data: • Include a README file in the directory that has the data. • Write a data management plan, which has become a requirement by funding agencies. • Provide a detailed description of the data, data source(s), and how it will be used. • Provide a description to the process of capturing the data. • Describe all the steps of data preprocessing. • Provide a description and information about each new version of the data. • Provide details about the software/code that is used for preprocessing the data. • Adapt metadata standards for describing the data. • Backup your files every while.
Naming files should be descriptive and consistent!
Tools • DMPTool • DASH • Figshare
• EZID • Box and Drive • Merritt repository
Source: http://data-archive.ac.uk/create-manage/life-cycle
References 1. AlNoamany, Yasmin. "How to make your research reproducible”, http://guides.lib.berkeley.edu/reproducibility-guide,
(2017). 2. Stodden, Victoria. "Enabling reproducible research: Open licensing for scientific innovation." (2009). 3. Bailey, David H., Jonathan M. Borwein, and Victoria Stodden. "Facilitating reproducibility in scientific computing:
Principles and practice." Reproducibility: Principles, Problems, Practices, and Prospects (2014): 205-232. 4. Stodden, Victoria, et al. "Enhancing reproducibility for computational methods." Science 354.6317 (2016): 1240-1241.
Workflow Workflow documentation – detailed steps of the workflow that capture the process of the scientific research. • Weekly/daily notes on the project's stages • Documentation for the steps of the workflow For managing the research workflow, document: • The steps of the research starting from the design till
fetching the data till producing graphs and tables in the scientific output.
• All adopted libraries and integrated algorithms. • All citations and information of code and data used. • The input and the output of each step.
Electronic Notebooks, such as Jupyter help documenting the workflow!
Tools • Jupyter • knitr
• Overleaf • ShareLatex
• GitHub • Zenodo
Sponsored in part through grants from the Alfred P. Sloan Foundation #G-2014-13746 and from the National Science Foundation NSF ACI #1349002