a guide for reproducible research

1
Reproducibility in research is the ability to replicate the ultimate product of academic research to reproduce the results and build on the research. The main entities of academic research are data, scripts/software for processing and analysis, workflow of the research process, and research output (Figure 1). Documenting workflow, data, and code during the active phase of the scientific research is important for communication of the scholarship and replication of the results. When researchers submit scientific papers or build on their work, they face the challenge of having to remember all the details of their own work if they haven't included well documentation for this work. In order to sustain and ensure the integrity of reproducibility in the scientific research and advance the scientific research process, this poster presents guidelines for researchers that help them to manage the research entities during the active phase of the research process. A Guide for Reproducible Research Yasmin AlNoamany University of California, Berkeley [email protected] Introduction The main entities of the scientific research Research Software – source code or executables that researchers generate or integrate into the workflow of the scientific research. What to document: Good practices in managing your software: Custom scripts to automate research analysis. Attach examples of how the code works. Generate a list of all scripts, how to run them, and in what order. Use tools that capture the experimental environment, such as Docker and ReproZIP. Use metadata standards for each generated module. Each module should have at least the following: Ø Name of the module Ø Name of the project Ø Name of Author Ø Input and Output Ø Purpose of the Module Ø A brief Description Naming files should be descriptive and consistent! Tools Docker Apache Ivy Research Software The experimental environment – e.g., hardware, operating system The computing platform and prerequisites Scripts and libraries Input and output parameters The functionality of each script Dependencies of the software indicating versions The structure of the code/software and details about individual components Scientific paper(s) along with graphs/tables – document(s) that contains the results of the scientific research as well as all the assorted graphs and tables. This could be: Compiled files (e.g., pdf) Source files (e.g., .tex files, figures, .bib file) Packages/libraries/styles installed (e.g., graphics) Graphs and tables Good practices in managing output files: Document the environment and the file structure. Track versions of produced papers, graphs, etc. Document any problem that faces you with the computing environment. Backup your files every while. Save your files on Dropbox or any other cloud storage to keep track of your versions. For writing your manuscript, use Latex and Bibtex for these reasons: Ø Latex is free and open source. Ø A .tex file can be edited in any text editor. Ø The content is separated from style. Ø With a couple of line and style files, you can convert how your pdf looks. Ø Latex allows preserving your files longer time. Ø The output document looks better. Naming files should be descriptive and consistent! Tools Latex Bibtex Research Output Data Data – files that were used or produced during the scientific research process. These files can be raw data or different versions of processed data. Good practices in managing data: Include a README file in the directory that has the data. Write a data management plan, which has become a requirement by funding agencies. Provide a detailed description of the data, data source(s), and how it will be used. Provide a description to the process of capturing the data. Describe all the steps of data preprocessing. Provide a description and information about each new version of the data. Provide details about the software/code that is used for preprocessing the data. Adapt metadata standards for describing the data. Backup your files every while. Naming files should be descriptive and consistent! Tools DMPTool DASH Figshare EZID Box and Drive Merritt repository Source: http://data-archive.ac.uk/create-manage/life-cycle References 1. AlNoamany, Yasmin. "How to make your research reproducible”, http://guides.lib.berkeley.edu/reproducibility-guide , (2017). 2. Stodden, Victoria. "Enabling reproducible research: Open licensing for scientific innovation." (2009). 3. Bailey, David H., Jonathan M. Borwein, and Victoria Stodden. "Facilitating reproducibility in scientific computing: Principles and practice." Reproducibility: Principles, Problems, Practices, and Prospects (2014): 205-232. 4. Stodden, Victoria, et al. "Enhancing reproducibility for computational methods." Science 354.6317 (2016): 1240-1241. Workflow Workflow documentation – detailed steps of the workflow that capture the process of the scientific research. Weekly/daily notes on the project's stages Documentation for the steps of the workflow For managing the research workflow, document: The steps of the research starting from the design till fetching the data till producing graphs and tables in the scientific output. All adopted libraries and integrated algorithms. All citations and information of code and data used. The input and the output of each step. Electronic Notebooks, such as Jupyter help documenting the workflow! Tools Jupyter knitr Overleaf ShareLatex GitHub Zenodo Sponsored in part through grants from the Alfred P. Sloan Foundation #G-2014-13746 and from the National Science Foundation NSF ACI #1349002

Upload: yasmin-alnoamany-phd

Post on 22-Jan-2018

8 views

Category:

Science


1 download

TRANSCRIPT

Page 1: A Guide for Reproducible Research

www.postersession.com

Reproducibility in research is the ability to replicate the ultimate product of academic research to reproduce the results and build on the research. The main entities of academic research are data, scripts/software for processing and analysis, workflow of the research process, and research output (Figure 1). Documenting workflow, data, and code during the active phase of the scientific research is important for communication of the scholarship and replication of the results. When researchers submit scientific papers or build on their work, they face the challenge of having to remember all the details of their own work if they haven't included well documentation for this work. In order to sustain and ensure the integrity of reproducibility in the scientific research and advance the scientific research process, this poster presents guidelines for researchers that help them to manage the research entities during the active phase of the research process.

A Guide for Reproducible Research Yasmin AlNoamany

University of California, Berkeley [email protected]

Introduction

The main entities of the scientific research

Research Software – source code or executables that researchers generate or integrate into the workflow of the scientific research. What to document: Good practices in managing your software: •  Custom scripts to automate research analysis. •  Attach examples of how the code works. •  Generate a list of all scripts, how to run them, and in what order. •  Use tools that capture the experimental environment, such as Docker and ReproZIP. •  Use metadata standards for each generated module. Each module should have at least

the following: Ø  Name of the module Ø  Name of the project Ø  Name of Author Ø  Input and Output Ø  Purpose of the Module Ø  A brief Description

Naming files should be descriptive and consistent! Tools

•  Docker •  Apache Ivy

Research Software

•  The experimental environment – e.g., hardware, operating system

•  The computing platform and prerequisites

•  Scripts and libraries •  Input and output parameters •  The functionality of each script •  Dependencies of the software

indicating versions •  The structure of the code/software and

details about individual components

Scientific paper(s) along with graphs/tables – document(s) that contains the results of the scientific research as well as all the assorted graphs and tables. This could be:

•  Compiled files (e.g., pdf) •  Source files (e.g., .tex files, figures, .bib file) •  Packages/libraries/styles installed (e.g., graphics) •  Graphs and tables Good practices in managing output files: •  Document the environment and the file structure. •  Track versions of produced papers, graphs, etc. •  Document any problem that faces you with the computing environment. •  Backup your files every while. •  Save your files on Dropbox or any other cloud storage to keep track of your

versions. •  For writing your manuscript, use Latex and Bibtex for these reasons:

Ø  Latex is free and open source. Ø A .tex file can be edited in any text editor. Ø  The content is separated from style. Ø With a couple of line and style files, you can convert how your pdf looks. Ø  Latex allows preserving your files longer time. Ø  The output document looks better.

Naming files should be descriptive and consistent! Tools

•  Latex •  Bibtex

Research Output Data

Data – files that were used or produced during the scientific research process. These files can be raw data or different versions of processed data.

Good practices in managing data: •  Include a README file in the directory that has the data. •  Write a data management plan, which has become a requirement by funding agencies. •  Provide a detailed description of the data, data source(s), and how it will be used. •  Provide a description to the process of capturing the data. •  Describe all the steps of data preprocessing. •  Provide a description and information about each new version of the data. •  Provide details about the software/code that is used for preprocessing the data. •  Adapt metadata standards for describing the data. •  Backup your files every while.

Naming files should be descriptive and consistent!

Tools •  DMPTool •  DASH •  Figshare

•  EZID •  Box and Drive •  Merritt repository

Source: http://data-archive.ac.uk/create-manage/life-cycle

References 1.  AlNoamany, Yasmin. "How to make your research reproducible”, http://guides.lib.berkeley.edu/reproducibility-guide,

(2017). 2.  Stodden, Victoria. "Enabling reproducible research: Open licensing for scientific innovation." (2009). 3.  Bailey, David H., Jonathan M. Borwein, and Victoria Stodden. "Facilitating reproducibility in scientific computing:

Principles and practice." Reproducibility: Principles, Problems, Practices, and Prospects (2014): 205-232. 4.  Stodden, Victoria, et al. "Enhancing reproducibility for computational methods." Science 354.6317 (2016): 1240-1241.

Workflow Workflow documentation – detailed steps of the workflow that capture the process of the scientific research. •  Weekly/daily notes on the project's stages •  Documentation for the steps of the workflow For managing the research workflow, document: •  The steps of the research starting from the design till

fetching the data till producing graphs and tables in the scientific output.

•  All adopted libraries and integrated algorithms. •  All citations and information of code and data used. •  The input and the output of each step.

Electronic Notebooks, such as Jupyter help documenting the workflow!

Tools •  Jupyter •  knitr

•  Overleaf •  ShareLatex

•  GitHub •  Zenodo

Sponsored in part through grants from the Alfred P. Sloan Foundation #G-2014-13746 and from the National Science Foundation NSF ACI #1349002