data as code - research software engineers association · • james hetherington turing research...
TRANSCRIPT
![Page 1: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The](https://reader034.vdocument.in/reader034/viewer/2022050408/5f8541364b34aa536f7bf18f/html5/thumbnails/1.jpg)
The Alan Turing Institute08/09/2017Data as code: Data management for reproducible research
1
Data as codeData management for reproducible research
Martin O’ReillyPrincipal Research Software Engineer
The Alan Turing Institute
@martinoreilly | @turinginst
![Page 2: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The](https://reader034.vdocument.in/reader034/viewer/2022050408/5f8541364b34aa536f7bf18f/html5/thumbnails/2.jpg)
The Alan Turing Institute
The Alan Turing Institute is the national centre for data science, headquartered at the British Library.
08/09/2017Data as code: Data management for reproducible research
Turing Research Engineering• Radka Jersakova• May Yong• Tim Hobson• James Geddes• James Hetherington
Turing Research Fellows• Kirstie Whitaker• Tomas Petricek
2@martinoreilly | @turinginst
![Page 3: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The](https://reader034.vdocument.in/reader034/viewer/2022050408/5f8541364b34aa536f7bf18f/html5/thumbnails/3.jpg)
The Alan Turing Institute
Data management for reproducible research
08/09/2017Data as code: Data management for reproducible research
3@martinoreilly | @turinginst
![Page 4: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The](https://reader034.vdocument.in/reader034/viewer/2022050408/5f8541364b34aa536f7bf18f/html5/thumbnails/4.jpg)
The Alan Turing Institute
FAIR Data Principles
408/09/2017Data as code: Data management for reproducible research
Source: FORCE11 website. https://www.force11.org/group/fairgroup/fairprinciples. Accessed on 07 Sep 2017
• Findable
• Accessible
• Interoperable
• Re-usable
![Page 5: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The](https://reader034.vdocument.in/reader034/viewer/2022050408/5f8541364b34aa536f7bf18f/html5/thumbnails/5.jpg)
The Alan Turing Institute
Code management for reproducible research
• How do I get your code?
• Online repositories and persistent archives with versioning support
• How do I use your code?
• Documentation, examples, packages, virtual machines, containers
• How do I trust your code?
• Tests, examples, readable code
• How do I build on your code?
• Documentation, readable code, tests
• What am I allowed to do with your code?
• Licence
508/09/2017Data as code: Data management for reproducible research
![Page 6: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The](https://reader034.vdocument.in/reader034/viewer/2022050408/5f8541364b34aa536f7bf18f/html5/thumbnails/6.jpg)
The Alan Turing Institute
Data management for reproducible research
• How do I get your data?
• Online repositories with versioning and APIs for data access
• How do I use your data?
• Documentation, metadata, common data formats, data packages
• How do I trust your data?
• Record of provenance and processing, versioning
• How do I build on your data?
• Record of provenance and processing, compatible content, linkable to other data
• What am I allowed to do with your data?
• Licences, terms of use, data access agreements, ethics
608/09/2017Data as code: Data management for reproducible research
![Page 7: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The](https://reader034.vdocument.in/reader034/viewer/2022050408/5f8541364b34aa536f7bf18f/html5/thumbnails/7.jpg)
The Alan Turing Institute 7
Good examples
08/09/2017Data as code: Data management for reproducible research
![Page 8: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The](https://reader034.vdocument.in/reader034/viewer/2022050408/5f8541364b34aa536f7bf18f/html5/thumbnails/8.jpg)
The Alan Turing Institute
UN Comtrade database
8
Web API for programmatic access
08/09/2017Data as code: Data management for reproducible research
Can apply current and historical classification codes to entire dataset
Can select subset of data to retrieve along multiple dimensions
Source: Screenshot of UN Comtrade database website. https://comtrade.un.org/data. Accessed on 06 Sep 2017
![Page 9: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The](https://reader034.vdocument.in/reader034/viewer/2022050408/5f8541364b34aa536f7bf18f/html5/thumbnails/9.jpg)
The Alan Turing Institute
UN Comtrade database
9
Third-party R package available for querying web API
08/09/2017Data as code: Data management for reproducible research
Source: Screenshot from Comtradr R package Github README.md. https://github.com/ChrisMuir/comtradr. Accessed on 06 Sep 2017
![Page 10: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The](https://reader034.vdocument.in/reader034/viewer/2022050408/5f8541364b34aa536f7bf18f/html5/thumbnails/10.jpg)
The Alan Turing Institute
ConnectomeDB
1008/09/2017Data as code: Data management for reproducible research
Source: Screenshot of ConnectomeDB login page. https://db.humanconnectome.org. Accessed on 06 Sep 2017
Website requires registration and login
![Page 11: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The](https://reader034.vdocument.in/reader034/viewer/2022050408/5f8541364b34aa536f7bf18f/html5/thumbnails/11.jpg)
The Alan Turing Institute
ConnectomeDB
1108/09/2017Data as code: Data management for reproducible research
One-time click for acceptance of terms
Generate dedicated Amazon AWS access credentials
Source: Screenshot of ConnectomeDB main page. https://db.humanconnectome.org. Accessed on 06 Sep 2017
![Page 12: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The](https://reader034.vdocument.in/reader034/viewer/2022050408/5f8541364b34aa536f7bf18f/html5/thumbnails/12.jpg)
The Alan Turing Institute
The Gamma
12
Dot-driven development• Intellisense autocomplete for
data exploration• Interactive dynamic data
preview• Uses F# type providers• For more details, see
http://tomasp.net/academic/papers/pivot/
08/09/2017Data as code: Data management for reproducible research
Source: The Gamma homepage. https://thegamma.net/. Accessed on 06 Sep 2017
![Page 13: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The](https://reader034.vdocument.in/reader034/viewer/2022050408/5f8541364b34aa536f7bf18f/html5/thumbnails/13.jpg)
The Alan Turing Institute
The Gamma
1308/09/2017Data as code: Data management for reproducible research
Source: UK National Statistics Public Expenditure Statistical Analyses 2016. Chapter 5 table 5.2. https://www.gov.uk/government/statistics/public-expenditure-statistical-analyses-2016/. Accessed on 06 Sep 2017
Subtotals indicated by background colour
Sub-sub categories indicated by text formatting
Sub categories indicated by initial numerals
![Page 14: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The](https://reader034.vdocument.in/reader034/viewer/2022050408/5f8541364b34aa536f7bf18f/html5/thumbnails/14.jpg)
The Alan Turing Institute
The Gamma
1408/09/2017Data as code: Data management for reproducible research
Source: Gamma @ The Turing: Accounting for Democracy. http://gamma.turing.ac.uk/expenditure/. Accessed on 06 Sep 2017
![Page 15: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The](https://reader034.vdocument.in/reader034/viewer/2022050408/5f8541364b34aa536f7bf18f/html5/thumbnails/15.jpg)
The Alan Turing Institute
The Gamma
1508/09/2017Data as code: Data management for reproducible research
Source: Gamma @ The Turing: Accounting for Democracy. http://gamma.turing.ac.uk/expenditure/. Accessed on 06 Sep 2017
![Page 16: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The](https://reader034.vdocument.in/reader034/viewer/2022050408/5f8541364b34aa536f7bf18f/html5/thumbnails/16.jpg)
The Alan Turing Institute 16
Dream data
08/09/2017Data as code: Data management for reproducible research
![Page 17: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The](https://reader034.vdocument.in/reader034/viewer/2022050408/5f8541364b34aa536f7bf18f/html5/thumbnails/17.jpg)
The Alan Turing Institute
My wish list
• Repository supporting versioning and content-aware sub-setting
• Data includes raw and processed data, with code to replicate processing
• Content-aware, on-demand differential download
• Automatable access to data requiring an access agreement / authentication
• Data accessible as native code objects
• Documentation accessible in context of data presentation
• Standard, machine-readable licences
• Repository tracks download / usage stats
1708/09/2017Data as code: Data management for reproducible research
![Page 18: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The](https://reader034.vdocument.in/reader034/viewer/2022050408/5f8541364b34aa536f7bf18f/html5/thumbnails/18.jpg)
The Alan Turing Institute
Interesting tools
Repositories
• Figshare, Zenodo, Dataverse, DataONE, Dryad
Data access
• Repository APIs, rOpenSci, SPARQL
Data formats
• RDF, OWL, Research object bundles, BagIt, Frictionless data
Differencing data
• Daff (tables), data-diff (JSON), data-diff (Python)
Provenance / processing record
• Workflow platforms (e.g. Galaxy), execution capture tools (e.g. Sumatra)
1808/09/2017Data as code: Data management for reproducible research
![Page 19: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The](https://reader034.vdocument.in/reader034/viewer/2022050408/5f8541364b34aa536f7bf18f/html5/thumbnails/19.jpg)
The Alan Turing Institute 19
turing.ac.uk@turinginst
08/09/2017Data as code: Data management for reproducible research
[email protected]@martinoreilly