![Page 1: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/1.jpg)
Granular workflow provenance in Taverna
1
Paolo MissierInformation Management Group
School of Computer Science, University of Manchester, UK
Symposium on Provenance in Scientific WorkflowsSalt Lake City, Oct. 2008
![Page 2: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/2.jpg)
Outline
2
• Collection values in [bioinformatics] workflows are important• Granular provenance over collections: model and issues• Measuring “provenance friendliness” of dataflows• Increasing friendliness of existing dataflows• Extending the Open Provenance Model graph to describe
granular data derivations
• Provenance service architecture - brief description
![Page 3: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/3.jpg)
IPAW'08 – Salt Lake City, Utah, June 2008
Example (Taverna) dataflow
QTL -> genes -> Kegg pathways
![Page 4: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/4.jpg)
IPAW'08 – Salt Lake City, Utah, June 2008
Example (Taverna) dataflow
![Page 5: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/5.jpg)
Collections example: from genes to SNPs
4
• See myexperiment.org: http://www.myexperiment.org/workflows/166
![Page 6: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/6.jpg)
Collections example: from genes to SNPs
4
gene -> genomic region
• See myexperiment.org: http://www.myexperiment.org/workflows/166
![Page 7: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/7.jpg)
Collections example: from genes to SNPs
4
gene -> genomic region
extend region
• See myexperiment.org: http://www.myexperiment.org/workflows/166
![Page 8: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/8.jpg)
Collections example: from genes to SNPs
4
gene -> genomic region
extend region
retrieve SNPs in the region
• See myexperiment.org: http://www.myexperiment.org/workflows/166
![Page 9: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/9.jpg)
Collections example: from genes to SNPs
4
gene -> genomic region
extend region
retrieve SNPs in the region
rearrange SNP details
• See myexperiment.org: http://www.myexperiment.org/workflows/166
![Page 10: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/10.jpg)
Collections example: from genes to SNPs
4
gene -> genomic region
extend region
retrieve SNPs in the region
rearrange SNP details
• See myexperiment.org: http://www.myexperiment.org/workflows/166
[ ENSG00000139618 , ENSG00000083093 ]
[[<1,23554512,16,rs45585833>, <1,23554712,16,rs45594034>,...],[<1,31820153,13,ENSSNP10730823>, <1,31818497,13,ENSSNP10730820>,...] ]
![Page 11: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/11.jpg)
Computational model for collections
5
Depth mismatch between declared / offered type:
type(P4:X1) = s but type(a) = list(s)
type(P4:X2) = type(c) = list(s)
type(P4:X3) = s but type(c) = list(s)
Execution at P4:
Y = (map P1 <(a ⊗ b) , c>) // cross product
Y = [ (P1 <a1,b1,c>) ... (P1 <an,bm,c>) ]
![Page 12: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/12.jpg)
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures
![Page 13: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/13.jpg)
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
![Page 14: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/14.jpg)
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
![Page 15: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/15.jpg)
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
![Page 16: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/16.jpg)
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]
![Page 17: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/17.jpg)
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]Dot product
![Page 18: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/18.jpg)
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
<16, 23560179,..> [16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
Dot product
![Page 19: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/19.jpg)
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
<13, 31871809,...>
[23520984, 31786617][16,13]
<16, 23560179,..> [16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
Dot product
![Page 20: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/20.jpg)
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
<13, 31871809,...>
[23520984, 31786617][16,13]
<16, 23560179,..> [16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
Dot product
139618 83093
![Page 21: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/21.jpg)
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
<13, 31871809,...>
[23520984, 31786617][16,13]
<16, 23560179,..> [16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
Dot product
139618 83093
![Page 22: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/22.jpg)
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
<13, 31871809,...>
[23520984, 31786617][16,13]
<16, 23560179,..> [16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
Dot product
139618 83093
![Page 23: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/23.jpg)
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
<13, 31871809,...>
[23520984, 31786617][16,13]
<16, 23560179,..> [16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
Dot product
139618 83093
![Page 24: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/24.jpg)
Tracing granular lineage
7
• Provenance traces are most useful when they are granular– trace individual items in a collection– “which geneID is responsible for the presence of SNP
rs169546 in the output?”
• Curse of black box processors:– M-M (many-many) and M-1 (many-one) processors
destroy granularity
![Page 25: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/25.jpg)
Granular lineage I: no loss of precision
8
X1 X2
Y2:l(s)Y1:l(s)
P0
P1 ≡ λ X . X2
P2 ≡ λ X . 2XP3 ≡ λ X1 . λ X2 . X1 + X2
Let P0:Y1 = [a1...an], P0:Y2 = [b1...bm]
Then, P1:Y = [a12...an2], P2:Y=[2b1...2bm]P3:Y = [a12+2b1... an2+2bm]
X1:s X2:s
Y
P3
X:s
P1
Y:s
X:s
P2
Y:s
Andlineage(P3:Y[i], {P0}) = { P0:Y1[i], P0:Y2[j] }
[a1...ai...an] [b1...bi...bm]
[a12+2b1... ai2+2bi ... an2+2bm]
[2b1... 2bj ...2bm][a12... ai2 ...an2]
Cross product
![Page 26: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/26.jpg)
Granular lineage I: no loss of precision
8
X1 X2
Y2:l(s)Y1:l(s)
P0
P1 ≡ λ X . X2
P2 ≡ λ X . 2XP3 ≡ λ X1 . λ X2 . X1 + X2
Let P0:Y1 = [a1...an], P0:Y2 = [b1...bm]
Then, P1:Y = [a12...an2], P2:Y=[2b1...2bm]P3:Y = [a12+2b1... an2+2bm]
X1:s X2:s
Y
P3
X:s
P1
Y:s
X:s
P2
Y:s
Andlineage(P3:Y[i], {P0}) = { P0:Y1[i], P0:Y2[j] }
[a1...ai...an] [b1...bi...bm]
[a12+2b1... ai2+2bi ... an2+2bm]
[2b1... 2bj ...2bm][a12... ai2 ...an2]
Cross product
![Page 27: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/27.jpg)
Granular lineage II: loss of precision
9
X1 X2
Y2Y1
P0
P1 ≡ λ X . X2
P2 ≡ λ X . min XP3 ≡ λ X1 . λ X2 . X1 + X2
Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]
Then, P1:Y = [a12...an2], P2:Y = c = min {b1...bm} P3:Y = [a12+c... am2+c]
X1:s X2:s
Y
P3
X:s
P1
Y:s
X: l(s)
P2
Y:s
Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }
[a1...ai...an] [b1...bi...bm]
[a12+c... ai2+c ... am2+c]
c[a12... ai2 ...an2]
![Page 28: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/28.jpg)
Granular lineage II: loss of precision
9
X1 X2
Y2Y1
P0
P1 ≡ λ X . X2
P2 ≡ λ X . min XP3 ≡ λ X1 . λ X2 . X1 + X2
Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]
Then, P1:Y = [a12...an2], P2:Y = c = min {b1...bm} P3:Y = [a12+c... am2+c]
X1:s X2:s
Y
P3
X:s
P1
Y:s
X: l(s)
P2
Y:s
Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }
[a1...ai...an] [b1...bi...bm]
[a12+c... ai2+c ... am2+c]
c[a12... ai2 ...an2]
![Page 29: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/29.jpg)
III: recoverable loss of precision
10
X1 X2
Y2Y1
P0
P1 ≡ λ X . X2
P2 ≡ λ X . f XP3 ≡ λ X1 . λ X2 . X1 + X2
Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]
Then, P1:Y = [a12...an2], P2:Y=c P3:Y = [a12+c... am2+c]
X1:s X2:s
Y
P3
X:s
P1
Y:s
X: l(s)
P2
Y:l(s)
Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }
[a1...ai...an] [b1...bi...bm]
[a12+c1... ai2+ci ... am2+cm]
[a12... ai2 ...an2] [c1...ci...cm]
![Page 30: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/30.jpg)
III: recoverable loss of precision
10
X1 X2
Y2Y1
P0
P1 ≡ λ X . X2
P2 ≡ λ X . f XP3 ≡ λ X1 . λ X2 . X1 + X2
Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]
Then, P1:Y = [a12...an2], P2:Y=c P3:Y = [a12+c... am2+c]
X1:s X2:s
Y
P3
X:s
P1
Y:s
X: l(s)
P2
Y:l(s)
Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }
[a1...ai...an] [b1...bi...bm]
[a12+c1... ai2+ci ... am2+cm]
[a12... ai2 ...an2] [c1...ci...cm]
“f is index-preserving”
![Page 31: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/31.jpg)
III: recoverable loss of precision
10
X1 X2
Y2Y1
P0
P1 ≡ λ X . X2
P2 ≡ λ X . f XP3 ≡ λ X1 . λ X2 . X1 + X2
Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]
Then, P1:Y = [a12...an2], P2:Y=c P3:Y = [a12+c... am2+c]
X1:s X2:s
Y
P3
X:s
P1
Y:s
X: l(s)
P2
Y:l(s)
Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }
[a1...ai...an] [b1...bi...bm]
[a12+c1... ai2+ci ... am2+cm]
[a12... ai2 ...an2] [c1...ci...cm]
“f is index-preserving”
![Page 32: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/32.jpg)
III: recoverable loss of precision
10
X1 X2
Y2Y1
P0
P1 ≡ λ X . X2
P2 ≡ λ X . f XP3 ≡ λ X1 . λ X2 . X1 + X2
Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]
Then, P1:Y = [a12...an2], P2:Y=c P3:Y = [a12+c... am2+c]
X1:s X2:s
Y
P3
X:s
P1
Y:s
X: l(s)
P2
Y:l(s)
Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }
[a1...ai...an] [b1...bi...bm]
[a12+c1... ai2+ci ... am2+cm]
[a12... ai2 ...an2] [c1...ci...cm]
“f is index-preserving”
lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2[i] }
![Page 33: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/33.jpg)
Multi-level nesting and lineage precision
11
![Page 34: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/34.jpg)
Adding annotations to the original workflow
12
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures
![Page 35: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/35.jpg)
Adding annotations to the original workflow
12
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
![Page 36: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/36.jpg)
Adding annotations to the original workflow
12
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
[139618, 83093]
CR:result[0,i]
CR:result[1,j]
lineage(CR:result[0,i]) = { geneIdList }lineage(CR:result[1,j]) = { geneIdList }
geneIdList:
![Page 37: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/37.jpg)
Adding annotations to the original workflow
12
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
[139618, 83093]
“f is index-preserving”
“f is index-preserving”
CR:result[0,i]
CR:result[1,j]
lineage(CR:result[0,i]) = { geneIdList }lineage(CR:result[1,j]) = { geneIdList }
geneIdList:
![Page 38: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/38.jpg)
Adding annotations to the original workflow
12
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
[139618, 83093]
“f is index-preserving”
“f is index-preserving”
CR:result[0,i]
CR:result[1,j]
lineage(CR:result[0,i]) = { geneIdList }lineage(CR:result[1,j]) = { geneIdList }
geneIdList:
![Page 39: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/39.jpg)
Adding annotations to the original workflow
12
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
[139618, 83093]
“f is index-preserving”
“f is index-preserving”
lineage(CR:result[0,i]) = { geneIdList[0] }lineage(CR:result[1,j]) = { geneIdList[1] }
CR:result[0,i]
CR:result[1,j]
lineage(CR:result[0,i]) = { geneIdList }lineage(CR:result[1,j]) = { geneIdList }
geneIdList:
![Page 40: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/40.jpg)
Granular lineage: recap
13
• Lineage query model accounts for granular traces over nested collections
• arbitrary nesting levels:– values are trees in general– lineage query identifies the correct sub-trees
• Lineage queries are efficient– recursion problem “compiled away” by query rewriting – (shameless claim - details omitted)
• But:– One single M-* processor can destroy granularity– in some cases annotations are a remedy
![Page 41: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/41.jpg)
Towards provenance-friendly workflows
14
![Page 42: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/42.jpg)
Towards provenance-friendly workflows
1.Define metrics for workflow provenance precision– how well is granularity preserved over a lineage trace?– what is the impact of M-* processors?– use to prioritize remedial actions
14
![Page 43: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/43.jpg)
Towards provenance-friendly workflows
1.Define metrics for workflow provenance precision– how well is granularity preserved over a lineage trace?– what is the impact of M-* processors?– use to prioritize remedial actions
2.Make workflows more provenance friendly:– Add knowledge (static):
• “lightweight annotations” [MBZ+08] -- see IPAW08– Add knowledge (dynamic):
–provenance-active workflow processors– Redesign processors / workflow
• general guidelines, provenance friendly patterns
14
[MBZ+08] Missier, Khalid Belhajjame, Jun Zhao, Carole Goble, Data lineage model for Taverna workflows with lightweight annotation requirements, Procs. International Provenance and Annotation Workshop (IPAW 2008)
![Page 44: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/44.jpg)
Lineage precision: example
15
b = [b1, b2] f
e = [e1, e2]
c = [c1, c2, c3]
d = [d1, d2]
a = [a1, a2]
![Page 45: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/45.jpg)
Lineage precision: example
15
b = [b1, b2] f
e = [e1, e2]
c = [c1, c2, c3]
d = [d1, d2]
lineage(P4:Y1[1.2.2], {P0, P2, P3}) =
a = [a1, a2]
![Page 46: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/46.jpg)
Lineage precision: example
15
b = [b1, b2] f
e = [e1, e2]
c = [c1, c2, c3]
d = [d1, d2]
lineage(P4:Y1[1.2.2], {P0, P2, P3}) =
a = [a1, a2]
![Page 47: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/47.jpg)
Lineage precision: example
15
b = [b1, b2] f
e = [e1, e2]
c = [c1, c2, c3]
d = [d1, d2]
lineage(P4:Y1[1.2.2], {P0, P2, P3}) =
a = [a1, a2]
![Page 48: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/48.jpg)
Lineage precision: example
15
b = [b1, b2] f
e = [e1, e2]
c = [c1, c2, c3]
d = [d1, d2]
lineage(P4:Y1[1.2.2], {P0, P2, P3}) =
a = [a1, a2]
![Page 49: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/49.jpg)
Lineage precision: example
15
b = [b1, b2] f
e = [e1, e2]
c = [c1, c2, c3]
d = [d1, d2]
lineage(P4:Y1[1.2.2], {P0, P2, P3}) =
a = [a1, a2]
{ P0:Y[1]= a1, P2:X=c, P3:X=e }
![Page 50: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/50.jpg)
Lineage precision: example
15
b = [b1, b2] f
e = [e1, e2]
c = [c1, c2, c3]
d = [d1, d2]
lineage(P4:Y1[1.2.2], {P0, P2, P3}) =
a = [a1, a2]
precision = (1 + .5 + .5) / 3 = 2/3
{ P0:Y[1]= a1, P2:X=c, P3:X=e }
![Page 51: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/51.jpg)
Precision relative to a sub-graph
16
• Refining the previous idea:– precision relative to a set O of output variables and a set I of input variables
• because not all variables are equally interesting... • weights WI, WO account for relative importance of variables
O1
I1 I2
O2 O3
![Page 52: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/52.jpg)
!
wi!WI
wi =!
wj!WO
wj = 1
prec(I, WI , O, WO) =!
j:1...|O|
"WO(Oj)
!
Xi(pi)!lin(Oj ,I)
WI(Xi) · len(pi)nl(Xi)
#
Precision relative to a sub-graph
16
• Refining the previous idea:– precision relative to a set O of output variables and a set I of input variables
• because not all variables are equally interesting... • weights WI, WO account for relative importance of variables
O1
I1 I2
O2 O3
![Page 53: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/53.jpg)
reach(P, v) =
!1 if v is reachable from P
0 otherwise
impact(P,O) =!
o!O
W (o) · reach(P, o)
Impact of M-* processors on precision
17
O1
I1 I2
O2 O3
Count the number of variables in O that can be reached from P
• weighted sumP
![Page 54: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/54.jpg)
Improving provenance precision
18
• Impact used to prioritize user actions on processors
• Precision used to assess improvement
• add index-preserving annotations
✓illustrated earlier
• refactor M-* processors
• make processors provenance-active
![Page 55: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/55.jpg)
Refactoring M-* → 1-1
19
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]Dot product
![Page 56: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/56.jpg)
Refactoring M-* → 1-1
19
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]Dot product
s → s
![Page 57: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/57.jpg)
Refactoring M-* → 1-1
19
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]Dot product
139618
<16, 23520984>
s → s
![Page 58: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/58.jpg)
Refactoring M-* → 1-1
19
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]Dot product
139618 83093
<16, 23520984> <13, 31786617>
s → s
![Page 59: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/59.jpg)
Refactoring M-* → 1-1
19
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13]<16, 23560179> [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
Dot product
139618 83093
<16, 23520984> <13, 31786617>
s → s
![Page 60: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/60.jpg)
Refactoring M-* → 1-1
19
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
<13, 31871809>
[23520984, 31786617][16,13]
[16,13]<16, 23560179> [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
Dot product
139618 83093
<16, 23520984> <13, 31786617>
s → s
![Page 61: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/61.jpg)
Refactoring M-* → 1-1
19
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
<13, 31871809>
[23520984, 31786617][16,13]
[16,13]<16, 23560179> [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
Dot product
139618 83093
<16, 23520984> <13, 31786617>
s → s
![Page 62: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/62.jpg)
Refactoring M-* → 1-1
19
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
<13, 31871809>
[23520984, 31786617][16,13]
[16,13]<16, 23560179> [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
Dot product
139618 83093
<16, 23520984> <13, 31786617>
s → s
![Page 63: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/63.jpg)
IPAW'08 – Salt Lake City, Utah, June 2008
Provenance-active processors
X: l(s) = [a1, a2, a3]
P
Y: s = b
P
X: l(s) = [a1, a2, a3]
Y: l(s) = [b1, b2]
–Passive processors do not contribute explicit provenance info
–provenance-active processors actively feed metadata to the lineage service
Dynamic annotations:
Static annotations:
aggregation f() P is index-preserving
b = X[i] sorting:Y = Π(X)
b = f(X[1]...X[k])
![Page 64: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/64.jpg)
Open Provenance Model
• A graph notation to represent process provenance– independent of the provenance producers– suitable for exchanging provenance across different workflow
systems• State: draft 1.01 (July 2008)
21
![Page 65: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/65.jpg)
Mapping to OPM - granularity issue
22
X1 X2
Y2Y1
P0
X:s
P1
Y:s
X:s
P2
Y:s
a b
c d
fe
P0
P1
P2
a
b
c
dused
usedused
used
wgb
wgb
![Page 66: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/66.jpg)
Mapping to OPM - granularity issue
22
X1 X2
Y2Y1
P0
X:s
P1
Y:s
X:s
P2
Y:s
a b
c d
fe
P0
P1
P2
a
b
c
dused
usedused
used
wgb
wgb
wasDerivedFrom
![Page 67: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/67.jpg)
Mapping to OPM - granularity issue
22
X1 X2
Y2Y1
P0
X:s
P1
Y:s
X:s
P2
Y:s
a b
c d
fe
P0
P1
P2
a
b
c
dused
usedused
used
wgb
wgb
☐ ☐
wasDerivedFrom
![Page 68: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/68.jpg)
Mapping to OPM - granularity issue
22
X1 X2
Y2Y1
P0
X:s
P1
Y:s
X:s
P2
Y:s
a b
c d
fe
P0
P1
P2
a
b
c
dused
usedused
used
wgb
wgb
☐ ☐b[p] d[p’]wasDerivedFrom
wasDerivedFrom
![Page 69: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/69.jpg)
Mapping to OPM - granularity issue
22
X1 X2
Y2Y1
P0
X:s
P1
Y:s
X:s
P2
Y:s
a b
c d
fe
How can this granular dependency be described for all arbitrary paths p?
Currently cannot be expressed using OPM
P0
P1
P2
a
b
c
dused
usedused
used
wgb
wgb
☐ ☐b[p] d[p’]wasDerivedFrom
wasDerivedFrom
![Page 70: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/70.jpg)
Path mapping rules
23
P1
P2
P3
a
b
c
dused
usedused
used
wgb
wgb
☐ ☐b[p] d[p’]actual lineage
wasDerivedFrom
Static graph structure sufficient to provide this (in Taverna)
But this is only known at query time
(extensional enumeration not an option)
![Page 71: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/71.jpg)
Path mapping rules
23
P1
P2
P3
a
b
c
dused
usedused
used
wgb
wgb
☐ ☐b[p] d[p’]actual lineage
wasDerivedFrom
Static graph structure sufficient to provide this (in Taverna)
But this is only known at query time
(extensional enumeration not an option)
Observation: • only need to consider individual processor transformations• exploit local processor rules for propagating granular lineage
![Page 72: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/72.jpg)
Path mapping rules
23
P1
P2
P3
a
b
c
dused
usedused
used
wgb
wgb
☐ ☐b[p] d[p’]actual lineage
wasDerivedFrom
Static graph structure sufficient to provide this (in Taverna)
But this is only known at query time
(extensional enumeration not an option)
Observation: • only need to consider individual processor transformations• exploit local processor rules for propagating granular lineage
Hint: granularity is only determined by depth of the pathAt query time, the Taverna lineage query algorithm encodes a path mapping rule to compute p’ given p
![Page 73: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/73.jpg)
Architecture provenance-active processors
24
Taverna workflow engine provenancemanager
inputs outputs
provenanceinformationrepository
provenanceevents
lineage queryinterface
lin( P:Y, , Psel, E(D))
1. Common content:–processor execution details–binding of input/output variables to values–completion status
externalservices
![Page 74: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/74.jpg)
Architecture provenance-active processors
24
Taverna workflow engine provenancemanager
inputs outputs
provenanceinformationrepository
provenanceevents
lineage queryinterface
lin( P:Y, , Psel, E(D))
1. Common content:–processor execution details–binding of input/output variables to values–completion status
2. Optional content for provenance-active processors:– explicit output → input dependency assertions:
let I, O be the input, resp. output variables setdepends(Y, X[p], <depType>), X ∈ I, Y ∈ O
externalservices
![Page 75: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/75.jpg)
Architecture provenance-active processors
24
Taverna workflow engine provenancemanager
inputs outputs
provenanceinformationrepository
provenanceevents
lineage queryinterface
lin( P:Y, , Psel, E(D))
1. Common content:–processor execution details–binding of input/output variables to values–completion status
2. Optional content for provenance-active processors:– explicit output → input dependency assertions:
let I, O be the input, resp. output variables setdepends(Y, X[p], <depType>), X ∈ I, Y ∈ O
externalservices
p-active API
![Page 76: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008](https://reader034.vdocument.in/reader034/viewer/2022051314/555065adb4c905ae3f8b55f8/html5/thumbnails/76.jpg)
• Experimental evaluation:– to what extent is granularity a real practical problem?– Quantify provenance friendliness by analysing a large
collection of workflows from myExperiment– Quantify available improvements (i.e. by refactoring)
• Compare collection management in Taverna with other workflow models– can we sucessfully exchange provenance graphs?
• Integration of the provenance service with the new version of Taverna– to be released before end of year
25
Ongoing work