optimizing user views for workflows sudeepa roy (with olivier biton, susan davidson and sanjeev...
TRANSCRIPT
Optimizing User Views for Workflows
Sudeepa Roy(with Olivier Biton, Susan Davidson and Sanjeev Khanna)
ZOOM Project, Database Research Group
University of Pennsylvania
1
Workflow
Start (s)
Split Entries
Align Sequences
Functional Data Curate Annotations
Format-2
Format-1
Format-3
Construct Trees
end (t)
2
Graphical representation of a sequence of actions to perform a task (eg. a biological experiment)
Vertex ≡ Module (program) Takes a set of data items as input Produces a set of data items as output
Edge ≡ Control (and Data) flow Data is typically a file
Has a start (s) and an end (t) module
Run: An execution of the workflow Actual data appears on the edges A module can be executed when data on each
incoming edges have been computed
TGCCGTGTGGCTAAATG…
CTGTGC
…
CTAAATGTCTGTGC…
GGCTAAATGTCTG
TGCCGTGTGGCGTC…
ATCCGTGTGGCTA..
High throughput technologies generate huge amount of data, which must be analyzed in “computational experiments” The analysis may be complex and multi-step
Scientific workflow systems are frequently used to help conceptualize and manage the analysis process as well as intermediate and final data products
Increasing need to record the provenance (i.e. the “origin” or “history”) of data products defined as a “depends-on” relationship between module execution and
other data products many scientific workflow systems (e.g. Vistrails, Kepler, Taverna) now
support provenance
Data Provenance in Scientific Workflows
3
Need for Provenance
4
TGCCGTGTGGCTAAATGTCTGTGC
…
CCCTTTCCGTGTGGCTAAATGTCTGTGC
…
TGCCGTGTGGCTAAATGTCTGTGC
GTCTGTGC…
TGCCGTGTGGCTAAATGTCTGTGC
GTCTGTGC…
TGCCGTGTGGCTAAATGTCTGTGC…
ATGGCCGTGTGGTCTGTGCCTAACTAACTAA…
Alignments ClustalW
PAUPSPhillips
…Bootstrap
Biologist’s workspace
Bioinformatics protocols
Which sequences have been used to produce this tree?
How this tree has been generated?
?
Can I throw away some of these data? Which ones are really
important to keep?
s
Split Entries
Align Sequences
Functional Data Curate Annotations
Format
Format
Format
Construct Trees
t
Provenance Overloads
Split Entries
Align Sequences
Functional Data Curate Annotations
Format-2
Format-1
Format-3
Construct Trees
t
5
WorkflowSpecification
s
Split Entries
Align Sequences
Functional Data Curate Annotations
Format
Format
Format
Construct Trees
t
Workflow run
d1…d100
d201…d301
d302…d402d403d404…d454
d455 d456d457
d458d459
d460
Construct Trees immediate provenance
“deep” provenance
Curate Annotations
Format-3Format-2
Functional Data
Format-1
Align Sequences
Split Entries
s
Can we reduce the amount of provenance shown to the user?
Relevant Modules and Composition
6
[BCD+08] shows how to focus user attention on relevant portion of provenance information
User specifies
relevant modules
System creates composite modules (clusters)
The result is called a user-view
s
Construct Trees
t
Align Sequences
s
Split Entries
Align Sequences
Functional DataCurate Annotations
Format-2
Format-1
Format-3
Construct Trees
t
User-view Reduces Provenance Information
7
d459 d458
d460
d201…d301
d456
M1
M2
M3
What properties should a good user-view have?
Problem: Can the number of clusters be minimized in a good user-view?
s
Construct Trees
Align Sequences
s
Split Entries
Align Sequences
Functional DataCurate Annotations
Format-2
Format-1
Format-3
Construct Trees
t
Model and Definitions Workflow Specification User-View “Good” user-view Series-Parallel Graphs
Results for Series-Parallel graphs Algorithm SP-View Correctness Upper bound Lower bound Optimality
Results for General graphs
Outline
8
Outlines
Model and Definitions Workflow Specification User-View “Good” user-view Series-Parallel Graphs
Results for Series-Parallel graphs Algorithm SP-View Correctness Upper bound Lower bound Optimality
Results for General graphs
Outline
9
Workflow Specification
Workflow Specification: (G, s, t, R)
A directed graph G(V, E)
Unique start module (source) s and unique finish module (sink) t
R: set of “relevant” modules NR: V – R, “non-relevant” modules
s, t R |V| = n, |E| = m, |R| = k
10
s R-node
NR-nodet
User ViewH: User-View of (G, s, t, R)
A directed graph, H, whose nodes are clusters/composite modules of nodes in G.
The nodes of H form a partition of the nodes in G.
An edge e = (u, v) in G survives in H as e’ if the end points u, v belong to different clusters in H The edge e in G induces the edge in
H or e is an origin of e’
R-cluster: contains at least one R-node NR-cluster: contains only NR-nodes
11
R-cluster
NR-cluster
s
t
12
Direct dependencies between relevant clusters should be preserved, defined in terms of
elementary path: a path where all the intermediate nodes are NR-nodes
At most one R-node in each cluster: R-cluster assumes the ‘meaning’ of the R-node
Good and Bad User Viewsr1
r3
r2
r4
Specification Bad view-1 Bad view-2 Good view-1 Good view-2
Three Properties of a Good User-view
13
Property 1 (well-formed)each cluster in H should contain at most one R-node from G
r1
G: Specification H: User-view
r1
r4
r2
r3
r4
r2
r3
Three Properties of a Good User-view
14
Property 2 (soundness)every edge on an elementary path between two R-clusters in H should have all the origins on an elementary path between the corresponding R-nodes in G
r1
r3
r2
r1
r3
r2
d
G: Specification H: User-view
Not sound!
r2 was not dependent on d in G, but dependent in H
Three Properties of a Good User-view
15
Property 3 (completeness)every edge on an elementary path between two R-nodes in G should induce an edge on an elementary path between the corresponding R-clusters in H
d
Specification User view
Not complete!
d produced by r1 was directly consumed by r3 in G, but processed by r2 in H
r1
r3
r2
r1
r3
r2
Given directed graph G(V, E), source s, sink t, a set of R of R-nodes (s, t R), |R| = k,
find a good user view H that minimizes the totalnumber of clusters (optimum user-view) in poly-time.
Optimization Problem
16
Can we find an optimum user-view in general directed graphs? Is this problem NP-complete?
What about special directed graphs that capture many common workflows?
Can we find matching upper and lower bounds of the #clusters in terms of k (= |R|) and not n (= |V|)? In general graphs? In some special graphs?
Questions
17
Unknown [BCD+08] gives a poly-time algorithm to find a ‘minimal’ good user-view, which may not be of minimum size
Optimum clustering for series-parallel graphs
Tight bounds for general and series-parallel graphs
Series-Parallel Graphs
18
An edge(Base case)
G1 G2Series Composition
ParallelComposition
Examples: (Non)Series-Parallel Graphs
19
Characterization of two-terminal SP-graph (VTL79)
A two-terminal DAG is an SP graph if and only if it does not contain a subgraph homeomorphic to this forbidden subgraph
SP graphsNon-SP graphs
Series-Parallel Graph (SP-graphs)s
Split Entries
Align Sequences
Functional Data Curate Annotations
Format
Format
Format
Construct Trees
t
SP graphs are the workflow equivalent of structured programming (without iteration)
Many workflows encountered in practice are SP graphs and do not allow looping
20
SP graph!
Contributions
21
Optimum Clustering
Upper Bound on #clusters
Lower Bound on #clusters
SP Graphs YES
(by an O(n) time algorithm )
2k - 3 2k - 3
General Graphs
?
(2k-1 – k)2 + k
(analyze the #clusters output by [BCD+08])
(2k-1 – k)2 + k
Moreover, we express global conditions for a good user-view in terms of local conditions for each cluster for general graphs…
useful when k << n
Model and Definitions Workflow Specification User-View “Good” user-view Series-Parallel Graphs
Results for Series-Parallel graphs Algorithm SP-View Correctness Upper bound Lower bound Optimality
Results for General graphs
Outline
22
Algorithm SP-View
23
s
t
Forward-pass Process the vertices in a topological
order
If an R-node do nothing
If an NR-node if single R-predecessor
o ‘merge’ if >= 1 NR-predecessor
o ‘merge’ with ‘last’ predecessor else
o do nothing
Produce an intermediate clustering
1
3
24
1094
5
611
12
137
15
16
8
Algorithm SP-View
24
s
t
Reverse-pass
Take intermediate clustering by Forward pass as input
Produce a reverse topological order on the clusters
Perform a symmetric procedure as done in the Forward pass on the clusters
C10
C7C6C8
C9
C5
C4C3
C2
C1
C11
C13
Reduces 16 modules to 10 clusters
Cannot do better than 10 (k = 9)!
O(m+n) = O(n) time
C12
Correctness
25
Proved by induction on each intermediate step for cluster formation
Any workflow specification is a good user-view
In each step,
we preserve the SP-property
we have a good user-view
use equivalent local conditions for clusters
use forbidden subgraph characterization of two-terminal SP graphs [VTL79]
Upper Bound
26
s
t
#clusters ≤ 2k-3
Here we show a weaker bound: 2k-1
Each surviving NR-cluster has at least one unique R-predecessor as a witness
t is no one’s predecessor!
#clusters ≤ k + k-1 = 2k-1
Lower Bound
27
s
t
= r0
r1
r2
rk-3
rk-2
= rk-1
p1
p2
pk-4
pk-3
#nodes = k + k-3 = 2k-3
No two nodes can be merged in anygood user-view
Optimum #clusters = 2k-3
Optimality
28
Outline of the steps …
Suppose SP-View outputs N1 R-clusters, N2 NR-clusters total #clusters = N1 + N2
N1 = k, can not be reduced
Each NR-cluster contains one essential NR-node that cannot be included in any R-cluster
If two essential NR-nodes are put in different clusters by SP-View, no good user-view can put them in the same cluster
Any good user view has at least N2 NR-clusters.
Model and Definitions Workflow Specification User-View “Good” user-view Series-Parallel Graphs
Results for Series-Parallel graphs Algorithm SP-View Correctness Upper bound Lower bound Optimality
Results for General graphs
Outline
29
Other Results (General Graphs)
30
Upper bound on the number of clusters
We show that the algorithm in [BCD+08] produces ≤ (2k-1 – k)2 + k clusters This is independent of the total number of nodes n
Tight lower bound
We show that there exists a graph that needs (2k-1 – k)2 + k clusters in any good user-view.
31
Can we solve the optimization problem on general directed graphs? Is it NP-complete? Can we get a constant-factor approximation to the
optimum solution?
Can we extend our algorithm to handle a larger class of directed graphs?
Open Problems
Thank You
32