invited talk: symposium on provenance in scientific workflows salt lake city, oct. 2008

Post on 11-May-2015

227 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Granular workflow provenance in Taverna

1

Paolo MissierInformation Management Group

School of Computer Science, University of Manchester, UK

Symposium on Provenance in Scientific WorkflowsSalt Lake City, Oct. 2008

Outline

2

• Collection values in [bioinformatics] workflows are important• Granular provenance over collections: model and issues• Measuring “provenance friendliness” of dataflows• Increasing friendliness of existing dataflows• Extending the Open Provenance Model graph to describe

granular data derivations

• Provenance service architecture - brief description

IPAW'08 – Salt Lake City, Utah, June 2008

Example (Taverna) dataflow

QTL -> genes -> Kegg pathways

IPAW'08 – Salt Lake City, Utah, June 2008

Example (Taverna) dataflow

Collections example: from genes to SNPs

4

• See myexperiment.org: http://www.myexperiment.org/workflows/166

Collections example: from genes to SNPs

4

gene -> genomic region

• See myexperiment.org: http://www.myexperiment.org/workflows/166

Collections example: from genes to SNPs

4

gene -> genomic region

extend region

• See myexperiment.org: http://www.myexperiment.org/workflows/166

Collections example: from genes to SNPs

4

gene -> genomic region

extend region

retrieve SNPs in the region

• See myexperiment.org: http://www.myexperiment.org/workflows/166

Collections example: from genes to SNPs

4

gene -> genomic region

extend region

retrieve SNPs in the region

rearrange SNP details

• See myexperiment.org: http://www.myexperiment.org/workflows/166

Collections example: from genes to SNPs

4

gene -> genomic region

extend region

retrieve SNPs in the region

rearrange SNP details

• See myexperiment.org: http://www.myexperiment.org/workflows/166

[ ENSG00000139618 , ENSG00000083093 ]

[[<1,23554512,16,rs45585833>, <1,23554712,16,rs45594034>,...],[<1,31820153,13,ENSSNP10730823>, <1,31818497,13,ENSSNP10730820>,...] ]

Computational model for collections

5

Depth mismatch between declared / offered type:

type(P4:X1) = s but type(a) = list(s)

type(P4:X2) = type(c) = list(s)

type(P4:X3) = s but type(c) = list(s)

Execution at P4:

Y = (map P1 <(a ⊗ b) , c>) // cross product

Y = [ (P1 <a1,b1,c>) ... (P1 <an,bm,c>) ]

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]Dot product

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

<16, 23560179,..> [16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

Dot product

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809,...>

[23520984, 31786617][16,13]

<16, 23560179,..> [16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809,...>

[23520984, 31786617][16,13]

<16, 23560179,..> [16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

139618 83093

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809,...>

[23520984, 31786617][16,13]

<16, 23560179,..> [16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

139618 83093

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809,...>

[23520984, 31786617][16,13]

<16, 23560179,..> [16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

139618 83093

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809,...>

[23520984, 31786617][16,13]

<16, 23560179,..> [16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

139618 83093

Tracing granular lineage

7

• Provenance traces are most useful when they are granular– trace individual items in a collection– “which geneID is responsible for the presence of SNP

rs169546 in the output?”

• Curse of black box processors:– M-M (many-many) and M-1 (many-one) processors

destroy granularity

Granular lineage I: no loss of precision

8

X1 X2

Y2:l(s)Y1:l(s)

P0

P1 ≡ λ X . X2

P2 ≡ λ X . 2XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2 = [b1...bm]

Then, P1:Y = [a12...an2], P2:Y=[2b1...2bm]P3:Y = [a12+2b1... an2+2bm]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X:s

P2

Y:s

Andlineage(P3:Y[i], {P0}) = { P0:Y1[i], P0:Y2[j] }

[a1...ai...an] [b1...bi...bm]

[a12+2b1... ai2+2bi ... an2+2bm]

[2b1... 2bj ...2bm][a12... ai2 ...an2]

Cross product

Granular lineage I: no loss of precision

8

X1 X2

Y2:l(s)Y1:l(s)

P0

P1 ≡ λ X . X2

P2 ≡ λ X . 2XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2 = [b1...bm]

Then, P1:Y = [a12...an2], P2:Y=[2b1...2bm]P3:Y = [a12+2b1... an2+2bm]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X:s

P2

Y:s

Andlineage(P3:Y[i], {P0}) = { P0:Y1[i], P0:Y2[j] }

[a1...ai...an] [b1...bi...bm]

[a12+2b1... ai2+2bi ... an2+2bm]

[2b1... 2bj ...2bm][a12... ai2 ...an2]

Cross product

Granular lineage II: loss of precision

9

X1 X2

Y2Y1

P0

P1 ≡ λ X . X2

P2 ≡ λ X . min XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]

Then, P1:Y = [a12...an2], P2:Y = c = min {b1...bm} P3:Y = [a12+c... am2+c]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X: l(s)

P2

Y:s

Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }

[a1...ai...an] [b1...bi...bm]

[a12+c... ai2+c ... am2+c]

c[a12... ai2 ...an2]

Granular lineage II: loss of precision

9

X1 X2

Y2Y1

P0

P1 ≡ λ X . X2

P2 ≡ λ X . min XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]

Then, P1:Y = [a12...an2], P2:Y = c = min {b1...bm} P3:Y = [a12+c... am2+c]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X: l(s)

P2

Y:s

Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }

[a1...ai...an] [b1...bi...bm]

[a12+c... ai2+c ... am2+c]

c[a12... ai2 ...an2]

III: recoverable loss of precision

10

X1 X2

Y2Y1

P0

P1 ≡ λ X . X2

P2 ≡ λ X . f XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]

Then, P1:Y = [a12...an2], P2:Y=c P3:Y = [a12+c... am2+c]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X: l(s)

P2

Y:l(s)

Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }

[a1...ai...an] [b1...bi...bm]

[a12+c1... ai2+ci ... am2+cm]

[a12... ai2 ...an2] [c1...ci...cm]

III: recoverable loss of precision

10

X1 X2

Y2Y1

P0

P1 ≡ λ X . X2

P2 ≡ λ X . f XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]

Then, P1:Y = [a12...an2], P2:Y=c P3:Y = [a12+c... am2+c]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X: l(s)

P2

Y:l(s)

Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }

[a1...ai...an] [b1...bi...bm]

[a12+c1... ai2+ci ... am2+cm]

[a12... ai2 ...an2] [c1...ci...cm]

“f is index-preserving”

III: recoverable loss of precision

10

X1 X2

Y2Y1

P0

P1 ≡ λ X . X2

P2 ≡ λ X . f XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]

Then, P1:Y = [a12...an2], P2:Y=c P3:Y = [a12+c... am2+c]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X: l(s)

P2

Y:l(s)

Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }

[a1...ai...an] [b1...bi...bm]

[a12+c1... ai2+ci ... am2+cm]

[a12... ai2 ...an2] [c1...ci...cm]

“f is index-preserving”

III: recoverable loss of precision

10

X1 X2

Y2Y1

P0

P1 ≡ λ X . X2

P2 ≡ λ X . f XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]

Then, P1:Y = [a12...an2], P2:Y=c P3:Y = [a12+c... am2+c]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X: l(s)

P2

Y:l(s)

Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }

[a1...ai...an] [b1...bi...bm]

[a12+c1... ai2+ci ... am2+cm]

[a12... ai2 ...an2] [c1...ci...cm]

“f is index-preserving”

lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2[i] }

Multi-level nesting and lineage precision

11

Adding annotations to the original workflow

12

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures

Adding annotations to the original workflow

12

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Adding annotations to the original workflow

12

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

[139618, 83093]

CR:result[0,i]

CR:result[1,j]

lineage(CR:result[0,i]) = { geneIdList }lineage(CR:result[1,j]) = { geneIdList }

geneIdList:

Adding annotations to the original workflow

12

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

[139618, 83093]

“f is index-preserving”

“f is index-preserving”

CR:result[0,i]

CR:result[1,j]

lineage(CR:result[0,i]) = { geneIdList }lineage(CR:result[1,j]) = { geneIdList }

geneIdList:

Adding annotations to the original workflow

12

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

[139618, 83093]

“f is index-preserving”

“f is index-preserving”

CR:result[0,i]

CR:result[1,j]

lineage(CR:result[0,i]) = { geneIdList }lineage(CR:result[1,j]) = { geneIdList }

geneIdList:

Adding annotations to the original workflow

12

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

[139618, 83093]

“f is index-preserving”

“f is index-preserving”

lineage(CR:result[0,i]) = { geneIdList[0] }lineage(CR:result[1,j]) = { geneIdList[1] }

CR:result[0,i]

CR:result[1,j]

lineage(CR:result[0,i]) = { geneIdList }lineage(CR:result[1,j]) = { geneIdList }

geneIdList:

Granular lineage: recap

13

• Lineage query model accounts for granular traces over nested collections

• arbitrary nesting levels:– values are trees in general– lineage query identifies the correct sub-trees

• Lineage queries are efficient– recursion problem “compiled away” by query rewriting – (shameless claim - details omitted)

• But:– One single M-* processor can destroy granularity– in some cases annotations are a remedy

Towards provenance-friendly workflows

14

Towards provenance-friendly workflows

1.Define metrics for workflow provenance precision– how well is granularity preserved over a lineage trace?– what is the impact of M-* processors?– use to prioritize remedial actions

14

Towards provenance-friendly workflows

1.Define metrics for workflow provenance precision– how well is granularity preserved over a lineage trace?– what is the impact of M-* processors?– use to prioritize remedial actions

2.Make workflows more provenance friendly:– Add knowledge (static):

• “lightweight annotations” [MBZ+08] -- see IPAW08– Add knowledge (dynamic):

–provenance-active workflow processors– Redesign processors / workflow

• general guidelines, provenance friendly patterns

14

[MBZ+08] Missier, Khalid Belhajjame, Jun Zhao, Carole Goble, Data lineage model for Taverna workflows with lightweight annotation requirements, Procs. International Provenance and Annotation Workshop (IPAW 2008)

Lineage precision: example

15

b = [b1, b2] f

e = [e1, e2]

c = [c1, c2, c3]

d = [d1, d2]

a = [a1, a2]

Lineage precision: example

15

b = [b1, b2] f

e = [e1, e2]

c = [c1, c2, c3]

d = [d1, d2]

lineage(P4:Y1[1.2.2], {P0, P2, P3}) =

a = [a1, a2]

Lineage precision: example

15

b = [b1, b2] f

e = [e1, e2]

c = [c1, c2, c3]

d = [d1, d2]

lineage(P4:Y1[1.2.2], {P0, P2, P3}) =

a = [a1, a2]

Lineage precision: example

15

b = [b1, b2] f

e = [e1, e2]

c = [c1, c2, c3]

d = [d1, d2]

lineage(P4:Y1[1.2.2], {P0, P2, P3}) =

a = [a1, a2]

Lineage precision: example

15

b = [b1, b2] f

e = [e1, e2]

c = [c1, c2, c3]

d = [d1, d2]

lineage(P4:Y1[1.2.2], {P0, P2, P3}) =

a = [a1, a2]

Lineage precision: example

15

b = [b1, b2] f

e = [e1, e2]

c = [c1, c2, c3]

d = [d1, d2]

lineage(P4:Y1[1.2.2], {P0, P2, P3}) =

a = [a1, a2]

{ P0:Y[1]= a1, P2:X=c, P3:X=e }

Lineage precision: example

15

b = [b1, b2] f

e = [e1, e2]

c = [c1, c2, c3]

d = [d1, d2]

lineage(P4:Y1[1.2.2], {P0, P2, P3}) =

a = [a1, a2]

precision = (1 + .5 + .5) / 3 = 2/3

{ P0:Y[1]= a1, P2:X=c, P3:X=e }

Precision relative to a sub-graph

16

• Refining the previous idea:– precision relative to a set O of output variables and a set I of input variables

• because not all variables are equally interesting... • weights WI, WO account for relative importance of variables

O1

I1 I2

O2 O3

!

wi!WI

wi =!

wj!WO

wj = 1

prec(I, WI , O, WO) =!

j:1...|O|

"WO(Oj)

!

Xi(pi)!lin(Oj ,I)

WI(Xi) · len(pi)nl(Xi)

#

Precision relative to a sub-graph

16

• Refining the previous idea:– precision relative to a set O of output variables and a set I of input variables

• because not all variables are equally interesting... • weights WI, WO account for relative importance of variables

O1

I1 I2

O2 O3

reach(P, v) =

!1 if v is reachable from P

0 otherwise

impact(P,O) =!

o!O

W (o) · reach(P, o)

Impact of M-* processors on precision

17

O1

I1 I2

O2 O3

Count the number of variables in O that can be reached from P

• weighted sumP

Improving provenance precision

18

• Impact used to prioritize user actions on processors

• Precision used to assess improvement

• add index-preserving annotations

✓illustrated earlier

• refactor M-* processors

• make processors provenance-active

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]Dot product

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]Dot product

s → s

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]Dot product

139618

<16, 23520984>

s → s

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]Dot product

139618 83093

<16, 23520984> <13, 31786617>

s → s

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13]<16, 23560179> [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

Dot product

139618 83093

<16, 23520984> <13, 31786617>

s → s

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809>

[23520984, 31786617][16,13]

[16,13]<16, 23560179> [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

139618 83093

<16, 23520984> <13, 31786617>

s → s

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809>

[23520984, 31786617][16,13]

[16,13]<16, 23560179> [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

139618 83093

<16, 23520984> <13, 31786617>

s → s

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809>

[23520984, 31786617][16,13]

[16,13]<16, 23560179> [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

139618 83093

<16, 23520984> <13, 31786617>

s → s

IPAW'08 – Salt Lake City, Utah, June 2008

Provenance-active processors

X: l(s) = [a1, a2, a3]

P

Y: s = b

P

X: l(s) = [a1, a2, a3]

Y: l(s) = [b1, b2]

–Passive processors do not contribute explicit provenance info

–provenance-active processors actively feed metadata to the lineage service

Dynamic annotations:

Static annotations:

aggregation f()‏ P is index-preserving

b = X[i]‏ sorting:Y = Π(X)

b = f(X[1]...X[k])

Open Provenance Model

• A graph notation to represent process provenance– independent of the provenance producers– suitable for exchanging provenance across different workflow

systems• State: draft 1.01 (July 2008)

21

Mapping to OPM - granularity issue

22

X1 X2

Y2Y1

P0

X:s

P1

Y:s

X:s

P2

Y:s

a b

c d

fe

P0

P1

P2

a

b

c

dused

usedused

used

wgb

wgb

Mapping to OPM - granularity issue

22

X1 X2

Y2Y1

P0

X:s

P1

Y:s

X:s

P2

Y:s

a b

c d

fe

P0

P1

P2

a

b

c

dused

usedused

used

wgb

wgb

wasDerivedFrom

Mapping to OPM - granularity issue

22

X1 X2

Y2Y1

P0

X:s

P1

Y:s

X:s

P2

Y:s

a b

c d

fe

P0

P1

P2

a

b

c

dused

usedused

used

wgb

wgb

☐ ☐

wasDerivedFrom

Mapping to OPM - granularity issue

22

X1 X2

Y2Y1

P0

X:s

P1

Y:s

X:s

P2

Y:s

a b

c d

fe

P0

P1

P2

a

b

c

dused

usedused

used

wgb

wgb

☐ ☐b[p] d[p’]wasDerivedFrom

wasDerivedFrom

Mapping to OPM - granularity issue

22

X1 X2

Y2Y1

P0

X:s

P1

Y:s

X:s

P2

Y:s

a b

c d

fe

How can this granular dependency be described for all arbitrary paths p?

Currently cannot be expressed using OPM

P0

P1

P2

a

b

c

dused

usedused

used

wgb

wgb

☐ ☐b[p] d[p’]wasDerivedFrom

wasDerivedFrom

Path mapping rules

23

P1

P2

P3

a

b

c

dused

usedused

used

wgb

wgb

☐ ☐b[p] d[p’]actual lineage

wasDerivedFrom

Static graph structure sufficient to provide this (in Taverna)

But this is only known at query time

(extensional enumeration not an option)

Path mapping rules

23

P1

P2

P3

a

b

c

dused

usedused

used

wgb

wgb

☐ ☐b[p] d[p’]actual lineage

wasDerivedFrom

Static graph structure sufficient to provide this (in Taverna)

But this is only known at query time

(extensional enumeration not an option)

Observation: • only need to consider individual processor transformations• exploit local processor rules for propagating granular lineage

Path mapping rules

23

P1

P2

P3

a

b

c

dused

usedused

used

wgb

wgb

☐ ☐b[p] d[p’]actual lineage

wasDerivedFrom

Static graph structure sufficient to provide this (in Taverna)

But this is only known at query time

(extensional enumeration not an option)

Observation: • only need to consider individual processor transformations• exploit local processor rules for propagating granular lineage

Hint: granularity is only determined by depth of the pathAt query time, the Taverna lineage query algorithm encodes a path mapping rule to compute p’ given p

Architecture provenance-active processors

24

Taverna workflow engine provenancemanager

inputs outputs

provenanceinformationrepository

provenanceevents

lineage queryinterface

lin( P:Y, , Psel, E(D))

1. Common content:–processor execution details–binding of input/output variables to values–completion status

externalservices

Architecture provenance-active processors

24

Taverna workflow engine provenancemanager

inputs outputs

provenanceinformationrepository

provenanceevents

lineage queryinterface

lin( P:Y, , Psel, E(D))

1. Common content:–processor execution details–binding of input/output variables to values–completion status

2. Optional content for provenance-active processors:– explicit output → input dependency assertions:

let I, O be the input, resp. output variables setdepends(Y, X[p], <depType>), X ∈ I, Y ∈ O

externalservices

Architecture provenance-active processors

24

Taverna workflow engine provenancemanager

inputs outputs

provenanceinformationrepository

provenanceevents

lineage queryinterface

lin( P:Y, , Psel, E(D))

1. Common content:–processor execution details–binding of input/output variables to values–completion status

2. Optional content for provenance-active processors:– explicit output → input dependency assertions:

let I, O be the input, resp. output variables setdepends(Y, X[p], <depType>), X ∈ I, Y ∈ O

externalservices

p-active API

• Experimental evaluation:– to what extent is granularity a real practical problem?– Quantify provenance friendliness by analysing a large

collection of workflows from myExperiment– Quantify available improvements (i.e. by refactoring)

• Compare collection management in Taverna with other workflow models– can we sucessfully exchange provenance graphs?

• Integration of the provenance service with the new version of Taverna– to be released before end of year

25

Ongoing work

top related