the lifecycle of reproducible science data and what provenance has got to do with it
TRANSCRIPT
The lifecycle of reproducible science data and what provenance has got to do with it
Paolo MissierSchool of Computing Science
Newcastle University, UK
Alan Turing InstituteSymposium On Reproducibility for Data-Intensive Research
Oxford, April 6, 2016
With material contributed by:Yang Cao, Bertram Ludascher, Tim McPhillips, Dave Vieglais, Matt Jones and the DataONE CyberInfrastructure groupRawaa Qasha at Newcastle UniversityCarole Goble at the University of Manchester
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
(Yet another) Data Lifecycle picture
Searchdiscover
packagepublish
spec(P’)
DeployP’
Env(dep’)
?
prov(D’)
Compare(P,P’,D,D’)
spec(P)
prov(D)
D D1
P P’
dep dep’
<D,P,dep,spec(P), prov(D)>
compute
Env
D’
D1
Reproducibility: working. reporting
submit articleand move on…
publish article
Research Environment
Publication Environment
Peer Review
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
Re-what?
Re-*
ReRun:vary experiment and setup, same lab
P P’DD’depdep’
Repeat:Same experiment, setup, lab
P, D, dep, env(dep)
Replicate:Same experiment, setup, different lab
P, D, dep, env’(dep)
Reproduce:vary experiment and setup, different lab
P P’DD’depdep’env(dep) env’(dep’)
Reuse:Different experiment D, P Q
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
Mapping the reproducibility space
5
Goal: to help scientists understand the effect of workflow / data / dependencies evolution on workflow execution resultsApproach: compare provenance traces generated during the runs: PDIFF
P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013.
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
Workflow evolution
6
Each of the elements in an execution may evolve (semi) independently from the others:
Can trt be computed again at some time t’>t?Requires saving EDt but may be impractical (eg large DB state)
Repeatability:
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
Reproducibility
7
Can a new version trt’ of trt be computed at some later time t’ > t, after one of more of the elements has changed?
• Wi may not run new EDj’
• Wi may not run with wfmsk’
• Wi’ may not run with dh’
• ...
Potential issues:
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
(Yet another) Data Lifecycle picture
Searchdiscover
packagepublish
spec(P’)
DeployP’ Env
?
D D1
P P’
dep dep’
compute
Env
D’
prov(D’)
Compare(P,P’,D,D’)
spec(P)
prov(D)
ResearchObjects
DataONEFederatedResearch Data Repositories- Matlab
provenance recorder
TOSCA-based virtualisation
Pdiff: differencing provenance
YesWorkflow- Workflow Provenance- NoWorkflow
Matlab provenance recorder(DataONE)
ReproZip
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting• R provenance recorder• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow differences
Computational Workflow Runs
workflowrun.prov.ttl(RDF)
outputA.txt
outputC.jpg
outputB/
intermediates/
1.txt2.txt
3.txt
de/def2e58b-50e2-4949-9980-fd310166621a.txt
inputA.txtworkflow attribution
executionenvironment
Aggregating in Research Object
ZIP folder structure (RO Bundle)
mimetypeapplication/vnd.wf4ever.robundle+zip
.ro/manifest.json
URI references
Exchange
ReproducibilitySame dataSame code
Systematic and extensible meta-data collection
Workflow Annotation Profile
Wf4Ever Project
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
Manifests and Containers
ContainerPackaging: Zip files, Docker images, BagIt, …Catalogues & Commons Platforms: FAIRDOM SEEK, Farr Commons CKAN, STELAR eLab, myExperiment
ManifestMetadataDescribes the aggregated resources, theirannotations and their provenance
Manifest
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
Manifest Metadata
Manifest Construction• Identification – id, title, creator,
status….• Aggregates – list of ids/links to
resources• Annotations – list of annotations about
resources
Manifest
Manifest Description• Checklists – what should be there• Provenance – where it came from• Versioning – its evolution• Dependencies – what else is needed
Manifest
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting• R provenance recorder• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow differences
Components for a flexible, scalable, sustainable network
Cyberinfrastructure Component 2Member Nodes
www.dataone.org/member-nodes
Coordinating Nodes• retain complete
metadata catalog • indexing for search• network-wide services• ensure content
availability (preservation)
• replication services
Member Nodes• diverse institutions• serve local community• provide resources for
managing their data• retain copies of data
14
15
Cyberinfrastructure
Data Services: Extraction, sub-setting etc
Provenance Semantics-enabled Discovery
ontology
annotation
SystemMetadata
ScienceData
Search API
ScienceMetadata
Provenance
Replicate
MetadataIndex
Data Holdings
16
17
What input data went into this study?
What methods were used?
… with what parameter settings, calibrations, …?
Can we trust the data and methods?
Provenance (lineage): track origin and processing history of data trust, data quality ~ audit trail for attribution, credit
Discovery of data, methodologies, experiments
Use Provenance for Transparency, Reproducibility
W3C has published the ‘PROV’ standard
Entity
Activity
Agent
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
W3C PROV model
See w3.org/TR/prov-o/
used
20
map image
R scriptExecution
Scientist
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
Using a common model
Example: Scientific workflow
21
map image
R scriptExecution
Scientist
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
CSV dataused
wasDerivedFrom
Using a common model
Example: Scientific workflow
22
map image
R scriptExecution
Scientist
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
CSV dataused
wasDerivedFrom
< “map image” wasDerivedFrom “CSV data” >
Using a common model
Example: Scientific workflow
23
24
ProvONE Motivation: Different Kinds of Provenance Prospective Provenance
method/workflow description (“workflow-land”)
Retrospective Provenance runtime provenance tracking (“trace-land”)
Better together!
ProvONE extends PROV for science!
“Trace-Land”
“Workflow-Land”
“Data-Land”
http://purl.dataone.org/provone-v1-dev25
DataONE data packages: Provenance inside!
resource map
science metadata
system metadata
science data
system metadata
system metadata
OAI-ORE with ProvONE trace
figures
system metadata
software
system metadata
29
31
Provenance… of Figures
32
Provenance… of Data
1 # @begin CreateGulfOfAlaskaMaps
2 # @in hcdb @as Total_Aromatic_Alkanes_PWS.csv
3 # @in world @as RWorldMap
4 # @out map @as Map_Of_Sampling_Locations.png
5 # @out detailMap @as Detailed_Map_Of_SamplingLocations.png
... mapping code is here ...
25 # @end CreateGulfOfAlaskaMaps
YesWorkflow (YW): Scripts as prospective provenance
33
MATLAB, R , Python … Scripts
YesWorkflow (YW): Scripts as prospective provenance
Script + @YW-annotation workflow-land & trace-land
Combine provenance: Prospective (workflow) Retrospective (runtime trace) Reconstructed (logs, files, …)
User can query own data & provenance prior to sharing
Incentive: accelerate work!
“Provenance for Self”
34
When a user cites a pub, we know: Which data produced it What software produced it What was derived from it Who to credit down the
attribution stack
Katz & Smith. 2014. Implementing Transitive Credit with JSON-LD. arXiv:1407.5117
Missier, Paolo. “Data Trajectories: Tracking Reuse of Published Data for Transitive Credit Attribution.” 11th Intl. Data Curation Conference (IDCC). Amsterdam, 2016. (Best Paper Award)
Transitive Credit
36
Provenance today: Important but hard
“This report is the result of a three-year analytical effort by a team of over 300 experts, overseen by a broadly constituted Federal Advisory Committee of 60 members. It was developed from information and analyses gathered in over 70 workshops and listening sessions held across the country.”
37
Provenance today: Important but hard
38
data and “code” / method linked
alt formats
Yaxing’s script with inputs & output products
YesWorkflow model
Christopher using Yaxing’s outputs as inputs for his script
Christopher’s results can be traced back all the way to Yaxing’s input
Provenance in action
40
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting• R provenance recorder• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow differences
4
TOSCA
• Topology and Orchestration Specification of Cloud Applications
Use Case: e-Science Central Workflow
5
http://www.esciencecentral.co.uk
TOSCA-based mapping of an e-SC Workflow
6
• Workflow components as Node Types
• Block dependencies as Relationship Types
e-SC Workflow Service Template
7
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting• R provenance recorder• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow differences
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
Data divergence analysis using provenance
All work done with reference to the e-Science Central WFMSAssumption: workflow WFj (new version) runs to completion
thus it produces a new provenance tracehowever, it may be disfunctional relative to WFi (the original)
Example: only input data changes: d != d’, WFj == WFi
47
Note: results may diverge even when the input datasets are identical, for example when one or more of the services exhibits non-deterministic behaviour, or depends on external state that has changed between executions.
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
Provenance traces for two runs
48
used
genBy
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
Delta graphs
49
A graph obtained as a result of traces “diff”which can be used to explain observed differences in workflow outputs, in terms of differences throughout the two executions.
This is the simplest possible delta “graph”!
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
More involved workflow differences
50
WA
WB
sv2
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
The corresponding traces
51
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
Delta graph computed by PDIFF
52
P. M
issi
erAT
I Sym
posi
um o
n R
epro
duci
bilit
yO
xfor
d A
pril
6th, 2
016
References
Research Objects: www.researchobject.orgBechhofer, Sean, Iain Buchan, David De Roure, Paolo Missier, J. Ainsworth, J. Bhagat, P. Couch, et al. “Why Linked Data Is Not Enough for Scientists.” Future Generation Computer Systems (2011). doi:doi:10.1016/j.future.2011.08.004.
DataONE: dataone.orgCuevas-Vicenttín, Víctor, Parisa Kianmajd, Bertram Ludäscher, Paolo Missier, Fernando Chirigati, Yaxing Wei, David Koop, and Saumen Dey. “The PBase Scientific Workflow Provenance Repository.” In Procs. 9th International Digital Curation Conference, 9:28–38. San Francisco, CA, USA, 2014. doi:10.2218/ijdc.v9i2.332.
Process Virtualisation using TOSCAQasha, Rawaa, Jacek Cala, and Paul Watson. “Towards Automated Workflow Deployment in the Cloud Using TOSCA.” In 2015 IEEE 8th International Conference on Cloud Computing, 1037–1040. New York, 2015. doi:10.1109/CLOUD.2015.146.
NoWorkflow: provenance recording for PythonMurta, Leonardo, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire. “noWorkflow: Capturing and Analyzing Provenance of Scripts .” In Procs. IPAW’14. Cologne, ⋆Germany: Springer, 2014.
Pdiff: provenance differencing for understanding workflow differencesMissier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience (2013). doi:10.1002/cpe.3035.