optimizing user views for workflows sudeepa roy (with olivier biton, susan davidson and sanjeev...

32
Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University of Pennsylvania 1

Upload: callie-bowser

Post on 31-Mar-2015

218 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Optimizing User Views for Workflows

Sudeepa Roy(with Olivier Biton, Susan Davidson and Sanjeev Khanna)

ZOOM Project, Database Research Group

University of Pennsylvania

1

Page 2: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Workflow

Start (s)

Split Entries

Align Sequences

Functional Data Curate Annotations

Format-2

Format-1

Format-3

Construct Trees

end (t)

2

Graphical representation of a sequence of actions to perform a task (eg. a biological experiment)

Vertex ≡ Module (program) Takes a set of data items as input Produces a set of data items as output

Edge ≡ Control (and Data) flow Data is typically a file

Has a start (s) and an end (t) module

Run: An execution of the workflow Actual data appears on the edges A module can be executed when data on each

incoming edges have been computed

TGCCGTGTGGCTAAATG…

CTGTGC

CTAAATGTCTGTGC…

GGCTAAATGTCTG

TGCCGTGTGGCGTC…

ATCCGTGTGGCTA..

Page 3: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

High throughput technologies generate huge amount of data, which must be analyzed in “computational experiments” The analysis may be complex and multi-step

Scientific workflow systems are frequently used to help conceptualize and manage the analysis process as well as intermediate and final data products

Increasing need to record the provenance (i.e. the “origin” or “history”) of data products defined as a “depends-on” relationship between module execution and

other data products many scientific workflow systems (e.g. Vistrails, Kepler, Taverna) now

support provenance

Data Provenance in Scientific Workflows

3

Page 4: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Need for Provenance

4

TGCCGTGTGGCTAAATGTCTGTGC

CCCTTTCCGTGTGGCTAAATGTCTGTGC

TGCCGTGTGGCTAAATGTCTGTGC

GTCTGTGC…

TGCCGTGTGGCTAAATGTCTGTGC

GTCTGTGC…

TGCCGTGTGGCTAAATGTCTGTGC…

ATGGCCGTGTGGTCTGTGCCTAACTAACTAA…

Alignments ClustalW

PAUPSPhillips

…Bootstrap

Biologist’s workspace

Bioinformatics protocols

Which sequences have been used to produce this tree?

How this tree has been generated?

?

Can I throw away some of these data? Which ones are really

important to keep?

s

Split Entries

Align Sequences

Functional Data Curate Annotations

Format

Format

Format

Construct Trees

t

Page 5: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Provenance Overloads

Split Entries

Align Sequences

Functional Data Curate Annotations

Format-2

Format-1

Format-3

Construct Trees

t

5

WorkflowSpecification

s

Split Entries

Align Sequences

Functional Data Curate Annotations

Format

Format

Format

Construct Trees

t

Workflow run

d1…d100

d201…d301

d302…d402d403d404…d454

d455 d456d457

d458d459

d460

Construct Trees immediate provenance

“deep” provenance

Curate Annotations

Format-3Format-2

Functional Data

Format-1

Align Sequences

Split Entries

s

Can we reduce the amount of provenance shown to the user?

Page 6: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Relevant Modules and Composition

6

[BCD+08] shows how to focus user attention on relevant portion of provenance information

User specifies

relevant modules

System creates composite modules (clusters)

The result is called a user-view

s

Construct Trees

t

Align Sequences

s

Split Entries

Align Sequences

Functional DataCurate Annotations

Format-2

Format-1

Format-3

Construct Trees

t

Page 7: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

User-view Reduces Provenance Information

7

d459 d458

d460

d201…d301

d456

M1

M2

M3

What properties should a good user-view have?

Problem: Can the number of clusters be minimized in a good user-view?

s

Construct Trees

Align Sequences

s

Split Entries

Align Sequences

Functional DataCurate Annotations

Format-2

Format-1

Format-3

Construct Trees

t

Page 8: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Model and Definitions Workflow Specification User-View “Good” user-view Series-Parallel Graphs

Results for Series-Parallel graphs Algorithm SP-View Correctness Upper bound Lower bound Optimality

Results for General graphs

Outline

8

Outlines

Page 9: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Model and Definitions Workflow Specification User-View “Good” user-view Series-Parallel Graphs

Results for Series-Parallel graphs Algorithm SP-View Correctness Upper bound Lower bound Optimality

Results for General graphs

Outline

9

Page 10: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Workflow Specification

Workflow Specification: (G, s, t, R)

A directed graph G(V, E)

Unique start module (source) s and unique finish module (sink) t

R: set of “relevant” modules NR: V – R, “non-relevant” modules

s, t R |V| = n, |E| = m, |R| = k

10

s R-node

NR-nodet

Page 11: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

User ViewH: User-View of (G, s, t, R)

A directed graph, H, whose nodes are clusters/composite modules of nodes in G.

The nodes of H form a partition of the nodes in G.

An edge e = (u, v) in G survives in H as e’ if the end points u, v belong to different clusters in H The edge e in G induces the edge in

H or e is an origin of e’

R-cluster: contains at least one R-node NR-cluster: contains only NR-nodes

11

R-cluster

NR-cluster

s

t

Page 12: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

12

Direct dependencies between relevant clusters should be preserved, defined in terms of

elementary path: a path where all the intermediate nodes are NR-nodes

At most one R-node in each cluster: R-cluster assumes the ‘meaning’ of the R-node

Good and Bad User Viewsr1

r3

r2

r4

Specification Bad view-1 Bad view-2 Good view-1 Good view-2

Page 13: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Three Properties of a Good User-view

13

Property 1 (well-formed)each cluster in H should contain at most one R-node from G

r1

G: Specification H: User-view

r1

r4

r2

r3

r4

r2

r3

Page 14: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Three Properties of a Good User-view

14

Property 2 (soundness)every edge on an elementary path between two R-clusters in H should have all the origins on an elementary path between the corresponding R-nodes in G

r1

r3

r2

r1

r3

r2

d

G: Specification H: User-view

Not sound!

r2 was not dependent on d in G, but dependent in H

Page 15: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Three Properties of a Good User-view

15

Property 3 (completeness)every edge on an elementary path between two R-nodes in G should induce an edge on an elementary path between the corresponding R-clusters in H

d

Specification User view

Not complete!

d produced by r1 was directly consumed by r3 in G, but processed by r2 in H

r1

r3

r2

r1

r3

r2

Page 16: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Given directed graph G(V, E), source s, sink t, a set of R of R-nodes (s, t R), |R| = k,

find a good user view H that minimizes the totalnumber of clusters (optimum user-view) in poly-time.

Optimization Problem

16

Page 17: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Can we find an optimum user-view in general directed graphs? Is this problem NP-complete?

What about special directed graphs that capture many common workflows?

Can we find matching upper and lower bounds of the #clusters in terms of k (= |R|) and not n (= |V|)? In general graphs? In some special graphs?

Questions

17

Unknown [BCD+08] gives a poly-time algorithm to find a ‘minimal’ good user-view, which may not be of minimum size

Optimum clustering for series-parallel graphs

Tight bounds for general and series-parallel graphs

Page 18: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Series-Parallel Graphs

18

An edge(Base case)

G1 G2Series Composition

ParallelComposition

Page 19: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Examples: (Non)Series-Parallel Graphs

19

Characterization of two-terminal SP-graph (VTL79)

A two-terminal DAG is an SP graph if and only if it does not contain a subgraph homeomorphic to this forbidden subgraph

SP graphsNon-SP graphs

Page 20: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Series-Parallel Graph (SP-graphs)s

Split Entries

Align Sequences

Functional Data Curate Annotations

Format

Format

Format

Construct Trees

t

SP graphs are the workflow equivalent of structured programming (without iteration)

Many workflows encountered in practice are SP graphs and do not allow looping

20

SP graph!

Page 21: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Contributions

21

Optimum Clustering

Upper Bound on #clusters

Lower Bound on #clusters

SP Graphs YES

(by an O(n) time algorithm )

2k - 3 2k - 3

General Graphs

?

(2k-1 – k)2 + k

(analyze the #clusters output by [BCD+08])

(2k-1 – k)2 + k

Moreover, we express global conditions for a good user-view in terms of local conditions for each cluster for general graphs…

useful when k << n

Page 22: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Model and Definitions Workflow Specification User-View “Good” user-view Series-Parallel Graphs

Results for Series-Parallel graphs Algorithm SP-View Correctness Upper bound Lower bound Optimality

Results for General graphs

Outline

22

Page 23: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Algorithm SP-View

23

s

t

Forward-pass Process the vertices in a topological

order

If an R-node do nothing

If an NR-node if single R-predecessor

o ‘merge’ if >= 1 NR-predecessor

o ‘merge’ with ‘last’ predecessor else

o do nothing

Produce an intermediate clustering

1

3

24

1094

5

611

12

137

15

16

8

Page 24: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Algorithm SP-View

24

s

t

Reverse-pass

Take intermediate clustering by Forward pass as input

Produce a reverse topological order on the clusters

Perform a symmetric procedure as done in the Forward pass on the clusters

C10

C7C6C8

C9

C5

C4C3

C2

C1

C11

C13

Reduces 16 modules to 10 clusters

Cannot do better than 10 (k = 9)!

O(m+n) = O(n) time

C12

Page 25: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Correctness

25

Proved by induction on each intermediate step for cluster formation

Any workflow specification is a good user-view

In each step,

we preserve the SP-property

we have a good user-view

use equivalent local conditions for clusters

use forbidden subgraph characterization of two-terminal SP graphs [VTL79]

Page 26: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Upper Bound

26

s

t

#clusters ≤ 2k-3

Here we show a weaker bound: 2k-1

Each surviving NR-cluster has at least one unique R-predecessor as a witness

t is no one’s predecessor!

#clusters ≤ k + k-1 = 2k-1

Page 27: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Lower Bound

27

s

t

= r0

r1

r2

rk-3

rk-2

= rk-1

p1

p2

pk-4

pk-3

#nodes = k + k-3 = 2k-3

No two nodes can be merged in anygood user-view

Optimum #clusters = 2k-3

Page 28: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Optimality

28

Outline of the steps …

Suppose SP-View outputs N1 R-clusters, N2 NR-clusters total #clusters = N1 + N2

N1 = k, can not be reduced

Each NR-cluster contains one essential NR-node that cannot be included in any R-cluster

If two essential NR-nodes are put in different clusters by SP-View, no good user-view can put them in the same cluster

Any good user view has at least N2 NR-clusters.

Page 29: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Model and Definitions Workflow Specification User-View “Good” user-view Series-Parallel Graphs

Results for Series-Parallel graphs Algorithm SP-View Correctness Upper bound Lower bound Optimality

Results for General graphs

Outline

29

Page 30: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Other Results (General Graphs)

30

Upper bound on the number of clusters

We show that the algorithm in [BCD+08] produces ≤ (2k-1 – k)2 + k clusters This is independent of the total number of nodes n

Tight lower bound

We show that there exists a graph that needs (2k-1 – k)2 + k clusters in any good user-view.

Page 31: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

31

Can we solve the optimization problem on general directed graphs? Is it NP-complete? Can we get a constant-factor approximation to the

optimum solution?

Can we extend our algorithm to handle a larger class of directed graphs?

Open Problems

Page 32: Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University

Thank You

32