all about versioning? · versioning data depends on the size and update behaviour • finalised...
TRANSCRIPT
all about Versioning?Dr. Moritz Neun
Replicability
* Sumatra: a toolkit for reproducible research - Open Science Framework (https://osf.io/rc5jf/?action=download&version=1)
*
Replicability• manyfold and changing tools • manual work / tweaking • environment and dependencies
Building Blocks for Replicability
• documentation and record keeping • versioning
Versioning?
Version control is the means by which different versions and drafts of a document (or file or record or
dataset) are managed.
Version control is the means by which different versions and drafts of a document (or file or record or
dataset) are managed.
http://www2.le.ac.uk/services/research-data/organise-data/version-control
Applications
• Documents • Code • Data
➡ track evolution of work ➡ backup
Version Control
https://webinerds.com/version-control-systems-keep-your-code-in-order/
Versioning Changes
Version Control Systems
1972 1982 1990 2000 2005 2015
SCSS RCS CVSPERFORCE
SVNBITKEEPER
TFS
GITHG
VSTS
local to central central to distributed everything is a branch
SOURCEFORGE(users 2016: 3.7M)
BITBUCKET
GITHUB(users 2016: 15M)
GOOGLE CODE(discontinued 2016)
Source Code Hosting
Version Control Variants
• file name versions or gmail—> snapshots vs. diffs
• local / single user version control • distributed / shared version control
Version Control—> workhorse for record
keeping
Reproducible Workflows
notebook (journal, log, lab book) • keeping track of everything (incl. manual
work, bash history, …) • avoid manual work whenever possible
—> documentation and knowledge transfer
Record Keeping,Versioning & Sharing
Adequate documentation & knowledge transfer: • Shared docs for small projects • Wiki • documentation in code repository (README) • notebook applications (Jupyter) or toolsets
(e.g. RStudio & Knitr & Make & Latex & Git) —> records also need version control!
Versioning DataDepends on the size and update behaviour • Finalised data (keep in VCS or store on web object
stores if allowed) • Datasets with discrete updates (usually snapshots
with DB tools or also VCS work well) • Continuously updated/appended data (i.e. timeline
data) • DB versioning or full snapshots • make sure to annotate events and changes to the
pipeline or other tools
Sharing Code/ToolsHow to share and make code/tools reusable • Source Code, e.g. on GitHub, SourceForge,
University sites, web spaces (Dropbox, S3) • Executables, Bytecode • virtualisation and containers (e.g. VM, Docker) • Web Services (e.g. shared models)
Versioning Changes
Version Control Systems
1972 1982 1990 2000 2005 2015
SCSS RCS CVSPERFORCE
SVNBITKEEPER
TFS
GITHG
VSTS
local to central central to distributed everything is a branch
SOURCEFORGE(users 2016: 3.7M)
BITBUCKET
GITHUB(users 2016: 15M)
GOOGLE CODE(discontinued 2016)
Source Code Hosting
Sharing Data
Many open questions: • long term persistence? • cost / who pays? • privacy / copyright?
be pragmaticbut aware
Thank you