reproducibility with revolution r open

25
Reproducibility with Revolution R Open David Smith @revodavid Chicago R User Group January 2015

Upload: revolution-analytics

Post on 15-Jul-2015

8.769 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Reproducibility with Revolution R Open

Reproducibility with Revolution R Open

David Smith@revodavid

Chicago R User GroupJanuary 2015

Page 2: Reproducibility with Revolution R Open

But first: Breaking News!

• Microsoft will acquire Revolution Analytics

• Revolution R and OSS projects will continue

• Continued commitment to R community

• More at:blog.revolutionanalytics.com/2015/01/revolution-acquired.html

2

Page 3: Reproducibility with Revolution R Open

Revolution R Open is:

• Downstream open source R distribution– GPLv2 license

• Compatible with all R-related software– Packages, Rstudio, …

• Multi-threaded for performance

• Focus on reproducibility

• Binaries available for Windows, Mac OS X, Ubuntu, Red Hat and OpenSUSE

• Download RRO free from mran.revolutionanalytics.com

• Paid support available from Revolution Analytics with Revolution R Plus

3

Page 4: Reproducibility with Revolution R Open

4

MRAN: The Managed R Archive Network

• Download Revolution R Open

• Learn about R and RRO

• Explore Packages– and dependencies

• Explore Task Views

• R tips and applications

• Daily CRAN snapshots

mran.revolutionanalytics.com

Page 5: Reproducibility with Revolution R Open

What is Reproducibility?

“The goal of reproducible research is to tie specific instructions to data analysis and experimental data so that scholarship can be recreated, better understood and verified.” CRAN Task View on Reproducible Research (Kuhn)

• Method + Environment-> Results

• A process for:– Sharing the method– Describing the environment– Recreating the results

5 xkcd.com/242/

Page 6: Reproducibility with Revolution R Open

Why care about reproducibility?

Academic / Research

• Verify results

• Advance Research

Business

• Production code

• Reliability

• Reusability

• Regulation

6

www.nytimes.com/2011/07/08/health/research/08genes.htmlhttp://arxiv.org/pdf/1010.1092.pdf

Page 7: Reproducibility with Revolution R Open

Observations

• R versions are pretty manageable– Major versions just once a year

– Patches rarely introduce incompatible changes

• Good solutions for literate programming– Rstudio / knitr / Rmarkdown

• OS/Hardware not the major cause of problems

• The big problem is with packages– CRAN is in a state of continual flux

7

Page 8: Reproducibility with Revolution R Open

The problem with packages

http://xkcd.com/234/8

Page 9: Reproducibility with Revolution R Open

Reproducible R Toolkit

• Static CRAN mirror in RRO– CRAN packages fixed with each Revolution R Open update

• Daily CRAN snapshots– Storing every package version since September 2014

– Hosted at mran.revolutionanalytics.com/snapshot

• Write and share scripts synced to a specific snapshot date– checkpoint package installed with RRO

9

projects.revolutionanalytics.com/rrt/

Page 10: Reproducibility with Revolution R Open

10

Using checkpoint

• Add 2 lines to the top of your scriptlibrary(checkpoint)

checkpoint("2015-01-28")

• Err, that’s it.

Or, whichever date you want

Page 11: Reproducibility with Revolution R Open

Checkpoint tips for script authors

• Work within a project– Dedicated folder with scripts, data and output– eg /Users/david/R/weather

• Create a master .R script file beginning withlibrary(checkpoint)

checkpoint("DATE")– package versions used will be as of this date

• Don’t use install.packages directly– Use library() and checkpoint does the rest– You can have different package versions installed for different

projects at the same time!

• Works only with CRAN packages– working on options for github, etc.

11

Page 12: Reproducibility with Revolution R Open

DEMO: WEATHER

12

Page 13: Reproducibility with Revolution R Open

Package dependency explosion

• R script file using 6 most popular packages

13

Any updated package = potential reproducibility error!http://blog.revolutionanalytics.com/2014/10/explore-r-package-connections-at-mran.html

Page 14: Reproducibility with Revolution R Open

Sharing projects with checkpoint

• Just share your script or project folder!• Recipient only needs:

– compatible R version– checkpoint package (installed with RRO)– Internet connection to MRAN (at least first time)

• Checkpoint takes care of:– Installing CRAN packages

• Binaries (ease of installation)• Correct versions (reproducibility)• Dependencies (ease of installation)

– Eliminating conflicts with other installed packages

14

Page 15: Reproducibility with Revolution R Open

The checkpoint magic

The checkpoint() call does all this:

• Scans project for required packages

• Installs required packages and dependencies– Packages installed specific to project

– Versions specific to checkpoint date• Installed in ~/.checkpoint/DATE

• Skips packages if already installed (2nd run through)

• Reconfigures package search path– Points only to project-specific library

15

Page 16: Reproducibility with Revolution R Open

MRAN checkpoint server

Checkpoint uses MRAN’s downstream CRAN mirror with daily snapshots.

16

CRAN

RRDaily snapshots

checkpoint package

library(checkpoint)

checkpoint("2015-01-28")

CRAN mirror

cran.revolutionanalytics.com(Windows binaries virus-scanned)

checkpoint server

Midnight UTC

mran.revolutionanalytics.com/snapshot/

Page 17: Reproducibility with Revolution R Open

checkpoint server - implementation

Checkpoint uses MRAN’s downstream CRAN mirror with daily snapshots.

• rsync to mirror CRAN daily– Only downloads changed packages

• zfs to store incremental snapshots– Storage only required for new packages

• Organizes snapshots into a labelled hierarchy– mran.revolutionanalytics.com/snapshot/YYYY-MM-DD

• MRAN hosted by high-performance cloud provider– Provisioned for availability and latency

17

https://github.com/RevolutionAnalytics/checkpoint-server

Page 18: Reproducibility with Revolution R Open

CRAN mirrors and RRO

• Revolution R Open ships with a fixed default CRAN mirror

– Currently, 1 Dec 2014 snapshot (v 8.0.1)

• All users of same version get same CRAN package versions by default

– regardless when “install.packages” is run

• Use checkpoint to access newer package versions

18

Page 19: Reproducibility with Revolution R Open

Future work

• Checks on R versionscheckpoint("2015-02-17", R="3.1.2")

• Support for other repos – BioConductor, GitHub, user

• Institution-level package duplication– CRAN “behind the firewall” – miniCRAN package

• Suggestions welcome!github.com/RevolutionAnalytics/checkpoint

19

Page 20: Reproducibility with Revolution R Open

Package Problem: The Author

http://xkcd.com/970/20

Time to update my package on CRAN!

>> Dependent packages that now fail to build: 67

>> Resubmit your package and try again

Crap.

Page 21: Reproducibility with Revolution R Open

Package Problem : The Update

http://xkcd.com/664/21

3 days later…Woot! A new version of R is out! I have 10 minutes now, time to download and install!

… package not found …

… can’t install package…

… error …

Page 22: Reproducibility with Revolution R Open

What about packrat?

• Packrat is flexible and powerful– Supports non-CRAN packages (e.g. github)

– Allows mix-and-matching package versions

– Requires shipping all package source

– Requires recipients to build packages from source

• Checkpoint is simple– Reproducibility from one script

– Simple for recipients to reproduce results

– Only allows use of CRAN packages versions that have been tested together

– Relies on availability of MRAN

22

rstudio.github.io/packrat/

Page 23: Reproducibility with Revolution R Open

Get started with checkpoint

• Install RRO: mran.revolutionanalytics.com

– checkpoint comes pre-installed

• Or, install “checkpoint” from CRAN

– Latest version available on GitHublibrary(devtools)

install_github("RevolutionAnalytics/checkpoint")

• Contributions welcome!

projects.revolutionanalytics.com/rrt/

23

Page 24: Reproducibility with Revolution R Open

Check out our other OSS projects

• DeployR Open

– simple, secure R integration for application developers

• Rhadoop

– connect R to Hadoop and run R on Hadoop nodes

• ParallelR:

– R packages for parallel and distributed programming

projects.revolutionanalytics.com

24

Page 25: Reproducibility with Revolution R Open

Thank You!

David Smith@revodavid

[email protected]