data trajectories: tracking the reuse of published datafor transitive credit attribution

27
P. Missier IDCC ‘16 – Feb. 2016 Data Trajectories: tracking reuse of published data for transitive credit attribution Paolo Missier [email protected] School of Computing Science Newcastle University, UK IDCC’16 Amsterdam, Feb 24, 2016

Upload: paolo-missier

Post on 14-Apr-2017

475 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Data Trajectories:tracking reuse of published data

for transitive credit attribution

Paolo [email protected]

School of Computing ScienceNewcastle University, UK

IDCC’16Amsterdam, Feb 24, 2016

Page 2: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

A crowded space in Open Research Data (Repositories)

Page 3: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Data publication and reuse: a potential virtuous cycle

Publication

Reuse

Tracking

Partial credit

Article “reuse” == Article citation• Easy, but limited semantics

Data reuse is more interesting / complicated:

• Data derivation can take many forms• Multiple programs, information systems• Multiple generations

1. What happens to published datasets after their publication?2. Can we follow their trajectory through transformations?3. Can we use this knowledge to quantify credit to data contributors?

Measuring data impact (see eg [1])

[1] Alex Ball, Monica Duke (2015). ‘How to Track the Impact of Research Data with Metrics’. DCC How-to Guides. Edinburgh: Digital Curation Centre.Available online: http://www.dcc.ac.uk/resources/how-guides

Page 4: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Data publication & reuse: a hypothetical scenario

Who gets credit for what?How much credit should Alice, Bob, Charlie receive?

RO = “Research Object”

RO3

RO5

RO2

4RO3

RO4

Charlie

RO1

P2

3️⃣

DR1

Alice

RO1

1️⃣

DR3

DR2

RO3

RO2RO1

Bob

2️⃣

P1

Page 5: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Recording reuse chains

Sequence of derivations viewed as a provenance graph• W3C PROV compliant

Page 6: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Assignment and transitive propagation of credit

Inductive defintion of credit:1. External credit:

• Can be assigned to any ROx in the graph at any time• How? Don’t care: any (community-based) mechanism is ok

2. Transitively propagated partial credit:• If ROy is reachable from ROx in the graph, then ROy should

receive a portion of the credit given to ROx

Assuming this graph can be constructed:

Page 7: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Data trajectories

The trajectory DT(RO) of contains all RO’ on which RO has had an impact

For each RO, its credit is defined by induction on its trajectory graph:

External credit

Transitive credit

Page 8: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Next steps

1. Define a suitable credit transfer function f

2. Build the provenance graph in practice• Provlets and their composition

Page 9: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Credit propagation patterns - 1

Most general case:

RO has been reused r times, by activities, a1 … ar:

Then, we consider patterns that involve a single activity a

Page 10: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Credit propagation patterns - 2

we want RO to receive a fraction of RO’s credit.

credit transfer parameter through a:

𝝰 models the value of the transformation a relative to its inputs data RO

High value transformation: low value 𝝰 low credit to ROSimple transformation: high value 𝝰 high credit to RO

1. Single-input, single-output activity

Page 11: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Credit propagation patterns -3

We account for relative importance of each of A’s inputs RO1 … ROn

Modelled using n new factors:

2. multi-input activity: RO is only one of n>1 inputs to A

Page 12: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Credit propagation patterns -4

RO receives credit from each output RO’These are all part of DT(RO)

3. multi-input, multi-output activity: A generates M>1 outputs

Relative importance of derived data products RO’1 … RO’m:

Page 13: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Credit propagation patterns - unknown activity

When activity a is unknown, none of the parameters α,β,γ can be used

Exists some activity a such that:

(*) https://www.w3.org/TR/prov-constraints/#derivations

Modelled using a derivation transfer parameter:

For n known derivations of RO:

PROV-CONSTRAINTS (*)

Page 14: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Credit from data to Agents

Agents are the actual people to whom the ROs are attributed

Each agent may be responsible for a set R or ROs.

The credit to this agent is simply:

Page 15: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Summary of credit model

RO reuse events provenance statements about RO

complete provenance graph DT(RO)

cr(RO)

Three elements to cr(RO):

1. External credit that is independent of reuse- May follow any community-based scoring scheme of data

relevance

2. Credit propagation rules computed inductively from DT(RO)- These formalise the notion of \transitive credit

3. A collection of credit transfer parameters- These account for the nature of the activities involved DT(RO)

Page 16: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

How it might work

How it might work: a data reuse simulator

Events:- Data re-use through an activity- Adjustments to external credit

Page 17: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Next steps

Define a suitable credit transfer function f• Credit transfer parameters

2. Build the provenance graph in practice• Provlets and their composition

Issues in building a graph of reuse events:

1. Modelling reuse events using PROV [easy]

2. Detecting and reporting reuse events in practice [hard!!]

Page 18: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Modelling reuse using PROV

Alice generates RO1

Bob reuses RO1, generating RO2, RO3

Charlie reuses RO1 and RO3, generating RO4 through P2

Unknown Agent reuses RO2 and RO3, generating RO5 through an unkonwn activity

Observable events:

Provlets are PROV document fragments generated by multiple, independent, autonomous Information Systems

Page 19: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Provlets - I

Page 20: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Provlets - II

Page 21: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Provlets - III

Page 22: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Provlets - IV

Page 23: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Provlets generation and composition

Page 24: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Is this really practical?

Provlets are generated by multiple, independent, autonomous Systems• Not necessarily cooperative• Especially in the long tail of science

No guarantee of• Completeness• Consistency eg of RO PID usage

Alice misses out on credit due to dependenciesRO2 RO1, RO3 RO1

Provenance and trajectories can be incomplete, partially disconnected

Page 25: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Challenges: A research agenda

Vision: tracking data re-use in the wild

1. Community efforts• Incrementally instrument key systems to be provenance-friendly and cooperative

• Python NoWorkflow• R• Workflows (Kepler, Taverna, Pegasus, VisTrails, …)

• Facilitate consistent use of PIDs• Incentivise proactive reporting of re-use instances

2. Research into probabilistic provenance• Can we estimate the likelihood of some of the missing derivations?• Uncertain graph management a rich foundation

• Can we design robust credit models that incorporate uncertainty of derivation?

Page 26: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

A crowded space in Open Research Data (Repositories)

Page 27: Data Trajectories: tracking the reuse of published datafor transitive credit attribution

P. M

issi

erID

CC

‘16

– Fe

b. 2

016

Selected references

• Bechhofer, S., De Roure, D., Gamble, M., Goble, C. & Buchan, I. (2010). Research Objects: Towards Exchange and Reuse of Digital Knowledge. Nature Precedings.

• Callaghan, S., Donegan, S., Pepler, S., Thorley, M., Cunningham, N., Kirsch, P., . . . Wright, D. (2012, may). Making Data a First Class Scientific Output: Data Citation and Publication by NERC’s Environmental Data Centres (Vol. 7) (No. 1).

• Katz, D. S. (2014). Transitive credit as a means to address social and technological concerns stemming from citation and attribution of digital products. Journal of Open Research Software, 2(1), e20.

• Moreau, L. & Groth, P. (2013, sep). Provenance: An Introduction to PROV. Synthesis Lectures on the Semantic Web: Theory and Technology, 3(4), 1–129.

• Wallis, J. C., Rolando, E. & Borgman, C. L. (2013, jul). If We Share Data, Will Anyone Use Them? Data Sharing and Reuse in the Long Tail of Science and Technology. PLoS ONE, 8(7), e67332.