the lifecycle of reproducible science data and what provenance has got to do with it

45
The lifecycle of reproducible science data and what provenance has got to do with it Paolo Missier School of Computing Science Newcastle University, UK Alan Turing Institute Symposium On Reproducibility for Data-Intensive Research Oxford, April 6, 2016 With material contributed by: Yang Cao, Bertram Ludascher, Tim McPhillips, Dave Vieglais, Matt Jones and the DataONE CyberInfrastructure group Rawaa Qasha at Newcastle University Carole Goble at the University of Manchester

Upload: paolo-missier

Post on 10-Apr-2017

326 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: The lifecycle of reproducible science data and what provenance has got to do with it

The lifecycle of reproducible science data and what provenance has got to do with it

Paolo MissierSchool of Computing Science

Newcastle University, UK

Alan Turing InstituteSymposium On Reproducibility for Data-Intensive Research

Oxford, April 6, 2016

With material contributed by:Yang Cao, Bertram Ludascher, Tim McPhillips, Dave Vieglais, Matt Jones and the DataONE CyberInfrastructure groupRawaa Qasha at Newcastle UniversityCarole Goble at the University of Manchester

Page 2: The lifecycle of reproducible science data and what provenance has got to do with it

P. M

issi

erAT

I Sym

posi

um o

n R

epro

duci

bilit

yO

xfor

d A

pril

6th, 2

016

(Yet another) Data Lifecycle picture

Searchdiscover

packagepublish

spec(P’)

DeployP’

Env(dep’)

?

prov(D’)

Compare(P,P’,D,D’)

spec(P)

prov(D)

D D1

P P’

dep dep’

<D,P,dep,spec(P), prov(D)>

compute

Env

D’

D1

Page 3: The lifecycle of reproducible science data and what provenance has got to do with it

Reproducibility: working. reporting

submit articleand move on…

publish article

Research Environment

Publication Environment

Peer Review

Page 4: The lifecycle of reproducible science data and what provenance has got to do with it

P. M

issi

erAT

I Sym

posi

um o

n R

epro

duci

bilit

yO

xfor

d A

pril

6th, 2

016

Re-what?

Re-*

ReRun:vary experiment and setup, same lab

P P’DD’depdep’

Repeat:Same experiment, setup, lab

P, D, dep, env(dep)

Replicate:Same experiment, setup, different lab

P, D, dep, env’(dep)

Reproduce:vary experiment and setup, different lab

P P’DD’depdep’env(dep) env’(dep’)

Reuse:Different experiment D, P Q

Page 5: The lifecycle of reproducible science data and what provenance has got to do with it

P. M

issi

erAT

I Sym

posi

um o

n R

epro

duci

bilit

yO

xfor

d A

pril

6th, 2

016

Mapping the reproducibility space

5

Goal: to help scientists understand the effect of workflow / data / dependencies evolution on workflow execution resultsApproach: compare provenance traces generated during the runs: PDIFF

P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013.

Page 6: The lifecycle of reproducible science data and what provenance has got to do with it

P. M

issi

erAT

I Sym

posi

um o

n R

epro

duci

bilit

yO

xfor

d A

pril

6th, 2

016

Workflow evolution

6

Each of the elements in an execution may evolve (semi) independently from the others:

Can trt be computed again at some time t’>t?Requires saving EDt but may be impractical (eg large DB state)

Repeatability:

Page 7: The lifecycle of reproducible science data and what provenance has got to do with it

P. M

issi

erAT

I Sym

posi

um o

n R

epro

duci

bilit

yO

xfor

d A

pril

6th, 2

016

Reproducibility

7

Can a new version trt’ of trt be computed at some later time t’ > t, after one of more of the elements has changed?

• Wi may not run new EDj’

• Wi may not run with wfmsk’

• Wi’ may not run with dh’

• ...

Potential issues:

Page 8: The lifecycle of reproducible science data and what provenance has got to do with it

P. M

issi

erAT

I Sym

posi

um o

n R

epro

duci

bilit

yO

xfor

d A

pril

6th, 2

016

(Yet another) Data Lifecycle picture

Searchdiscover

packagepublish

spec(P’)

DeployP’ Env

?

D D1

P P’

dep dep’

compute

Env

D’

prov(D’)

Compare(P,P’,D,D’)

spec(P)

prov(D)

ResearchObjects

DataONEFederatedResearch Data Repositories- Matlab

provenance recorder

TOSCA-based virtualisation

Pdiff: differencing provenance

YesWorkflow- Workflow Provenance- NoWorkflow

Matlab provenance recorder(DataONE)

ReproZip

Page 9: The lifecycle of reproducible science data and what provenance has got to do with it

P. M

issi

erAT

I Sym

posi

um o

n R

epro

duci

bilit

yO

xfor

d A

pril

6th, 2

016

You are here

Data packaging: Research Objects

DataONE: Data packaging, publication, search and discovery, hosting• R provenance recorder• Process-as-a-dataflow: YesWorkflow

Process Virtualisation using TOSCA

Provenance recorders• Workflow Provenance

• Taverna, eScience Central, Kepler, Pegasus, VisTrails…• NoWorkflow: provenance recording for Python

• Pdiff: provenance differencing for understanding workflow differences

Page 10: The lifecycle of reproducible science data and what provenance has got to do with it

Computational Workflow Runs

workflowrun.prov.ttl(RDF)

outputA.txt

outputC.jpg

outputB/

intermediates/

1.txt2.txt

3.txt

de/def2e58b-50e2-4949-9980-fd310166621a.txt

inputA.txtworkflow attribution

executionenvironment

Aggregating in Research Object

ZIP folder structure (RO Bundle)

mimetypeapplication/vnd.wf4ever.robundle+zip

.ro/manifest.json

URI references

Exchange

ReproducibilitySame dataSame code

Systematic and extensible meta-data collection

Workflow Annotation Profile

Wf4Ever Project

Page 11: The lifecycle of reproducible science data and what provenance has got to do with it

P. M

issi

erAT

I Sym

posi

um o

n R

epro

duci

bilit

yO

xfor

d A

pril

6th, 2

016

Manifests and Containers

ContainerPackaging: Zip files, Docker images, BagIt, …Catalogues & Commons Platforms: FAIRDOM SEEK, Farr Commons CKAN, STELAR eLab, myExperiment

ManifestMetadataDescribes the aggregated resources, theirannotations and their provenance

Manifest

Page 12: The lifecycle of reproducible science data and what provenance has got to do with it

P. M

issi

erAT

I Sym

posi

um o

n R

epro

duci

bilit

yO

xfor

d A

pril

6th, 2

016

Manifest Metadata

Manifest Construction• Identification – id, title, creator,

status….• Aggregates – list of ids/links to

resources• Annotations – list of annotations about

resources

Manifest

Manifest Description• Checklists – what should be there• Provenance – where it came from• Versioning – its evolution• Dependencies – what else is needed

Manifest

Page 13: The lifecycle of reproducible science data and what provenance has got to do with it

P. M

issi

erAT

I Sym

posi

um o

n R

epro

duci

bilit

yO

xfor

d A

pril

6th, 2

016

You are here

Data packaging: Research Objects

DataONE: Data packaging, publication, search and discovery, hosting• R provenance recorder• Process-as-a-dataflow: YesWorkflow

Process Virtualisation using TOSCA

Provenance recorders• Workflow Provenance

• Taverna, eScience Central, Kepler, Pegasus, VisTrails…• NoWorkflow: provenance recording for Python

• Pdiff: provenance differencing for understanding workflow differences

Page 14: The lifecycle of reproducible science data and what provenance has got to do with it

Components for a flexible, scalable, sustainable network

Cyberinfrastructure Component 2Member Nodes

www.dataone.org/member-nodes

Coordinating Nodes• retain complete

metadata catalog • indexing for search• network-wide services• ensure content

availability (preservation)

• replication services

Member Nodes• diverse institutions• serve local community• provide resources for

managing their data• retain copies of data

14

Page 15: The lifecycle of reproducible science data and what provenance has got to do with it

15

Cyberinfrastructure

Data Services: Extraction, sub-setting etc

Provenance Semantics-enabled Discovery

ontology

annotation

SystemMetadata

ScienceData

Search API

ScienceMetadata

Provenance

Replicate

MetadataIndex

Page 16: The lifecycle of reproducible science data and what provenance has got to do with it

Data Holdings

16

Page 17: The lifecycle of reproducible science data and what provenance has got to do with it

17

What input data went into this study?

What methods were used?

… with what parameter settings, calibrations, …?

Can we trust the data and methods?

Provenance (lineage): track origin and processing history of data trust, data quality ~ audit trail for attribution, credit

Discovery of data, methodologies, experiments

Use Provenance for Transparency, Reproducibility

Page 18: The lifecycle of reproducible science data and what provenance has got to do with it

W3C has published the ‘PROV’ standard

Entity

Activity

Agent

wasAssociatedWith

wasAttributedTo

wasGeneratedBy

W3C PROV model

See w3.org/TR/prov-o/

used

20

Page 19: The lifecycle of reproducible science data and what provenance has got to do with it

map image

R scriptExecution

Scientist

wasAssociatedWith

wasAttributedTo

wasGeneratedBy

Using a common model

Example: Scientific workflow

21

Page 20: The lifecycle of reproducible science data and what provenance has got to do with it

map image

R scriptExecution

Scientist

wasAssociatedWith

wasAttributedTo

wasGeneratedBy

CSV dataused

wasDerivedFrom

Using a common model

Example: Scientific workflow

22

Page 21: The lifecycle of reproducible science data and what provenance has got to do with it

map image

R scriptExecution

Scientist

wasAssociatedWith

wasAttributedTo

wasGeneratedBy

CSV dataused

wasDerivedFrom

< “map image” wasDerivedFrom “CSV data” >

Using a common model

Example: Scientific workflow

23

Page 22: The lifecycle of reproducible science data and what provenance has got to do with it

24

ProvONE Motivation: Different Kinds of Provenance Prospective Provenance

method/workflow description (“workflow-land”)

Retrospective Provenance runtime provenance tracking (“trace-land”)

Better together!

Page 23: The lifecycle of reproducible science data and what provenance has got to do with it

ProvONE extends PROV for science!

“Trace-Land”

“Workflow-Land”

“Data-Land”

http://purl.dataone.org/provone-v1-dev25

Page 24: The lifecycle of reproducible science data and what provenance has got to do with it

DataONE data packages: Provenance inside!

resource map

science metadata

system metadata

science data

system metadata

system metadata

OAI-ORE with ProvONE trace

figures

system metadata

software

system metadata

29

Page 25: The lifecycle of reproducible science data and what provenance has got to do with it

31

Provenance… of Figures

Page 26: The lifecycle of reproducible science data and what provenance has got to do with it

32

Provenance… of Data

Page 27: The lifecycle of reproducible science data and what provenance has got to do with it

1 # @begin CreateGulfOfAlaskaMaps

2 # @in hcdb @as Total_Aromatic_Alkanes_PWS.csv

3 # @in world @as RWorldMap

4 # @out map @as Map_Of_Sampling_Locations.png

5 # @out detailMap @as Detailed_Map_Of_SamplingLocations.png

... mapping code is here ...

25 # @end CreateGulfOfAlaskaMaps

YesWorkflow (YW): Scripts as prospective provenance

33

Page 28: The lifecycle of reproducible science data and what provenance has got to do with it

MATLAB, R , Python … Scripts

YesWorkflow (YW): Scripts as prospective provenance

Script + @YW-annotation workflow-land & trace-land

Combine provenance: Prospective (workflow) Retrospective (runtime trace) Reconstructed (logs, files, …)

User can query own data & provenance prior to sharing

Incentive: accelerate work!

“Provenance for Self”

34

Page 29: The lifecycle of reproducible science data and what provenance has got to do with it

When a user cites a pub, we know: Which data produced it What software produced it What was derived from it Who to credit down the

attribution stack

Katz & Smith. 2014. Implementing Transitive Credit with JSON-LD. arXiv:1407.5117

Missier, Paolo. “Data Trajectories: Tracking Reuse of Published Data for Transitive Credit Attribution.” 11th Intl. Data Curation Conference (IDCC). Amsterdam, 2016. (Best Paper Award)

Transitive Credit

36

Page 30: The lifecycle of reproducible science data and what provenance has got to do with it

Provenance today: Important but hard

“This report is the result of a three-year analytical effort by a team of over 300 experts, overseen by a broadly constituted Federal Advisory Committee of 60 members. It was developed from information and analyses gathered in over 70 workshops and listening sessions held across the country.”

37

Page 31: The lifecycle of reproducible science data and what provenance has got to do with it

Provenance today: Important but hard

38

data and “code” / method linked

alt formats

Page 32: The lifecycle of reproducible science data and what provenance has got to do with it

Yaxing’s script with inputs & output products

YesWorkflow model

Christopher using Yaxing’s outputs as inputs for his script

Christopher’s results can be traced back all the way to Yaxing’s input

Provenance in action

40

Page 33: The lifecycle of reproducible science data and what provenance has got to do with it

P. M

issi

erAT

I Sym

posi

um o

n R

epro

duci

bilit

yO

xfor

d A

pril

6th, 2

016

You are here

Data packaging: Research Objects

DataONE: Data packaging, publication, search and discovery, hosting• R provenance recorder• Process-as-a-dataflow: YesWorkflow

Process Virtualisation using TOSCA

Provenance recorders• Workflow Provenance

• Taverna, eScience Central, Kepler, Pegasus, VisTrails…• NoWorkflow: provenance recording for Python

• Pdiff: provenance differencing for understanding workflow differences

Page 34: The lifecycle of reproducible science data and what provenance has got to do with it

4

TOSCA

• Topology and Orchestration Specification of Cloud Applications

Page 35: The lifecycle of reproducible science data and what provenance has got to do with it

Use Case: e-Science Central Workflow

5

http://www.esciencecentral.co.uk

Page 36: The lifecycle of reproducible science data and what provenance has got to do with it

TOSCA-based mapping of an e-SC Workflow

6

• Workflow components as Node Types

• Block dependencies as Relationship Types

Page 37: The lifecycle of reproducible science data and what provenance has got to do with it

e-SC Workflow Service Template

7

Page 38: The lifecycle of reproducible science data and what provenance has got to do with it

P. M

issi

erAT

I Sym

posi

um o

n R

epro

duci

bilit

yO

xfor

d A

pril

6th, 2

016

You are here

Data packaging: Research Objects

DataONE: Data packaging, publication, search and discovery, hosting• R provenance recorder• Process-as-a-dataflow: YesWorkflow

Process Virtualisation using TOSCA

Provenance recorders• Workflow Provenance

• Taverna, eScience Central, Kepler, Pegasus, VisTrails…• NoWorkflow: provenance recording for Python

• Pdiff: provenance differencing for understanding workflow differences

Page 39: The lifecycle of reproducible science data and what provenance has got to do with it

P. M

issi

erAT

I Sym

posi

um o

n R

epro

duci

bilit

yO

xfor

d A

pril

6th, 2

016

Data divergence analysis using provenance

All work done with reference to the e-Science Central WFMSAssumption: workflow WFj (new version) runs to completion

thus it produces a new provenance tracehowever, it may be disfunctional relative to WFi (the original)

Example: only input data changes: d != d’, WFj == WFi

47

Note: results may diverge even when the input datasets are identical, for example when one or more of the services exhibits non-deterministic behaviour, or depends on external state that has changed between executions.

Page 40: The lifecycle of reproducible science data and what provenance has got to do with it

P. M

issi

erAT

I Sym

posi

um o

n R

epro

duci

bilit

yO

xfor

d A

pril

6th, 2

016

Provenance traces for two runs

48

used

genBy

Page 41: The lifecycle of reproducible science data and what provenance has got to do with it

P. M

issi

erAT

I Sym

posi

um o

n R

epro

duci

bilit

yO

xfor

d A

pril

6th, 2

016

Delta graphs

49

A graph obtained as a result of traces “diff”which can be used to explain observed differences in workflow outputs, in terms of differences throughout the two executions.

This is the simplest possible delta “graph”!

Page 42: The lifecycle of reproducible science data and what provenance has got to do with it

P. M

issi

erAT

I Sym

posi

um o

n R

epro

duci

bilit

yO

xfor

d A

pril

6th, 2

016

More involved workflow differences

50

WA

WB

sv2

Page 43: The lifecycle of reproducible science data and what provenance has got to do with it

P. M

issi

erAT

I Sym

posi

um o

n R

epro

duci

bilit

yO

xfor

d A

pril

6th, 2

016

The corresponding traces

51

Page 44: The lifecycle of reproducible science data and what provenance has got to do with it

P. M

issi

erAT

I Sym

posi

um o

n R

epro

duci

bilit

yO

xfor

d A

pril

6th, 2

016

Delta graph computed by PDIFF

52

Page 45: The lifecycle of reproducible science data and what provenance has got to do with it

P. M

issi

erAT

I Sym

posi

um o

n R

epro

duci

bilit

yO

xfor

d A

pril

6th, 2

016

References

Research Objects: www.researchobject.orgBechhofer, Sean, Iain Buchan, David De Roure, Paolo Missier, J. Ainsworth, J. Bhagat, P. Couch, et al. “Why Linked Data Is Not Enough for Scientists.” Future Generation Computer Systems (2011). doi:doi:10.1016/j.future.2011.08.004.

DataONE: dataone.orgCuevas-Vicenttín, Víctor, Parisa Kianmajd, Bertram Ludäscher, Paolo Missier, Fernando Chirigati, Yaxing Wei, David Koop, and Saumen Dey. “The PBase Scientific Workflow Provenance Repository.” In Procs. 9th International Digital Curation Conference, 9:28–38. San Francisco, CA, USA, 2014. doi:10.2218/ijdc.v9i2.332.

Process Virtualisation using TOSCAQasha, Rawaa, Jacek Cala, and Paul Watson. “Towards Automated Workflow Deployment in the Cloud Using TOSCA.” In 2015 IEEE 8th International Conference on Cloud Computing, 1037–1040. New York, 2015. doi:10.1109/CLOUD.2015.146.

NoWorkflow: provenance recording for PythonMurta, Leonardo, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire. “noWorkflow: Capturing and Analyzing Provenance of Scripts .” In Procs. IPAW’14. Cologne, ⋆Germany: Springer, 2014.

Pdiff: provenance differencing for understanding workflow differencesMissier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience (2013). doi:10.1002/cpe.3035.