frag flow: automated fragment detection in scientific workflows

29
Date: 24/10/2014 FragFlow: Automatic Fragment Detection in Scientific Workflows Daniel Garijo *, Oscar Corcho *, Yolanda Gil Ŧ , Boris A. Gutman , Ivo D. Dinov , Paul Thompson and Arthur W. Toga * Universidad Politécnica de Madrid, Ŧ USC Information Sciences Institute, USC Laboratory of Neuroimaging

Upload: dgarijo

Post on 23-Jun-2015

246 views

Category:

Data & Analytics


1 download

DESCRIPTION

eScience 2014, Guarujá (Brasil). Abstract—Scientific workflows provide the means to define, execute and reproduce computational experiments. However, reusing existing workflows still poses challenges for workflow designers. Workflows are often too large and too specific to reuse in their entirety, so reuse is more likely to happen for fragments of workflows. These fragments may be identified manually by users as sub-workflows, or detected automatically. In this paper we present the FragFlow approach, which detects workflow fragments automatically by analyzing existing workflow corpora with graph mining algorithms. FragFlow detects the most common workflow fragments, links them to the original workflows and visualizes them. We evaluate our approach by comparing FragFlow results against user-defined sub-workflows from three different corpora of the LONI Pipeline system. Based on this evaluation, we discuss how automated workflow fragment detection could facilitate workflow reuse

TRANSCRIPT

Page 1: Frag Flow: Automated Fragment Detection in Scientific Workflows

Date: 24/10/2014

FragFlow:

Automatic Fragment Detection in

Scientific Workflows

Daniel Garijo *, Oscar Corcho *, Yolanda Gil Ŧ, Boris A. Gutman ⱡ, Ivo D. Dinov ⱡ, Paul Thompson ⱡ and Arthur W. Toga ⱡ

* Universidad Politécnica de Madrid, Ŧ USC Information Sciences Institute,

ⱡ USC Laboratory of Neuroimaging

Page 2: Frag Flow: Automated Fragment Detection in Scientific Workflows

2

Overview

•Detecting common groups of tasks in corpus of scientific workflows •Application of exact and inexact graph matching techniques •Filtering and linking results to the input corpus •Benefits: Discoverability, understandability, reuse, design, modularization, visualization

Lab book

Digital Log

Laboratory Protocol (recipe)

Workflow

Experiment

IEEE eScience 2014. Guarujá, Brasil

Page 3: Frag Flow: Automated Fragment Detection in Scientific Workflows

Background

• Workflows are software artifacts that capture computational experiments • Addition to paper publication • Provenance of results • Reuse

• Existing repositories of workflows

(Galaxy, myExperiment, the LONI Pipeline, CrowdLabs, etc.) • Sharing workflows • Exploring existing workflows

• PROBLEMS to address: • Workflows have many detailed steps and may be

difficult to understand • The general method may not apparent

• How are different workflow related? • What steps do they have in common?

3 IEEE eScience 2014. Guarujá, Brasil

Page 4: Frag Flow: Automated Fragment Detection in Scientific Workflows

Workflow Fragment: set of connected steps that are part of a workflow.

• Common Workflow Fragment: fragments that occur more than once in a corpus of workflows

• Grouping: Workflow fragment manually annotated by a user • Sub-Grouping: Grouping included as part of another grouping

Workflow Fragments and Groupings

4

A

B

C

A

F

D

A

B

C

G

B

H

A

B

F

B

E

Common workflow fragments

Workflow 1 Workflow 2 Workflow 3

IEEE eScience 2014. Guarujá, Brasil

Page 5: Frag Flow: Automated Fragment Detection in Scientific Workflows

Our Goals

Our goal is to automatically detect useful workflow fragments to

be reused by scientists. In this work, given a workflow corpus…

• Goal 1: Are automatically detected workflow fragments similar to user-defined groupings?

• Goal 2: For those automatically detected fragments that were NOT similar to user-defined groupings, do users find them useful?

• Goal 3: How are workflows and groupings reused?

5 IEEE eScience 2014. Guarujá, Brasil

Page 6: Frag Flow: Automated Fragment Detection in Scientific Workflows

The LONI Pipeline

6

• Workflow system for neuroimaging analysis

• Active community of users creating workflows

• Enables users to define groupings in workflows

• Has a corpus of published workflows

• Has a library of (uniquely identified) components with a well defined functionality

http://pipeline.loni.usc.edu/explore/library-navigator/ IEEE eScience 2014. Guarujá, Brasil

Page 7: Frag Flow: Automated Fragment Detection in Scientific Workflows

Workflow Mining in FragFlow

7

1

2

3

4

IEEE eScience 2014. Guarujá, Brasil

Corpus

Page 8: Frag Flow: Automated Fragment Detection in Scientific Workflows

Corpus Preparation

Workflows converted to Labeled Directed Acyclic Graphs (LDAG) • The label of a node in the graph corresponds to the type of the step in

the workflow

• Edges capture the dependencies between different steps

• Duplicated workflows are removed

• Single-step workflows are removed

8 IEEE eScience 2014. Guarujá, Brasil

Page 9: Frag Flow: Automated Fragment Detection in Scientific Workflows

Graph Mining

9

We use popular graph mining techniques:

• Inexact FGM: usage of heuristics to calculate similarity between two graphs. The solution might not be complete • SUBDUE

• 2 heuristics: Minimum Description Length (MDL) and Size • Frequency based

• Exact FGM: deliver all the possible fragments to be found the dataset. • gSpan

• Depth first search strategy • Support based

• FSG • Breadth first search strategy • Support based

IEEE eScience 2014. Guarujá, Brasil

Page 10: Frag Flow: Automated Fragment Detection in Scientific Workflows

Filtering Relevant Fragments

10

The number of resulting fragments can be very large. We distinguish:

• Multistep fragments: • More than one step

• Filtered Multistep fragments: • Multistep fragments • Contain all smaller fragments with the same number of

occurrences

IEEE eScience 2014. Guarujá, Brasil

Page 11: Frag Flow: Automated Fragment Detection in Scientific Workflows

Linking to the Corpus: Wf-fd

11 IEEE eScience 2014. Guarujá, Brasil

Page 12: Frag Flow: Automated Fragment Detection in Scientific Workflows

Linking to the Corpora: Example

12 IEEE eScience 2014. Guarujá, Brasil

Corpus

Fragment

Page 13: Frag Flow: Automated Fragment Detection in Scientific Workflows

Evaluation

13

Three workflow corpora: User Corpus 1 (WC1)

• Designed mostly by a single a single user • General medial imaging • 790 workflows (475 after data preparation)

User Corpus 2 (WC2)

• Created by a user, with collaborations of others • Well documented workflows, meant for reuse • 113 workflows (96 after data preparation)

Multi User Corpus 3 (WC3) • Workflows submitted by 62 users during the month of Jan 2014 • Several executions of the same workflows • 5859 workflows (357 after data preparation)

IEEE eScience 2014. Guarujá, Brasil

Page 14: Frag Flow: Automated Fragment Detection in Scientific Workflows

Evaluation: Metrics

14

Goal 1: Are automatically detected workflow fragments similar to user-defined groupings ?

Goal 2: Do users find useful the fragments that were NOT similar to their defined groupings?

IEEE eScience 2014. Guarujá, Brasil

Page 15: Frag Flow: Automated Fragment Detection in Scientific Workflows

Evaluation: Inexact FGM techniques

15

Exact Overlap (>80%)

Corpus Workflows (w) +

groupings(g)

Inexact

FGM Frequency

MultiStep

Frag. Fragment Precision Recall Fragment Precision Recall

WC1 475(w)+ 209(g)

MDL

min 264 76 29% 11% 113 42% 16%

2% 64 21 32% 3% 27 42% 3%

5% 26 9 34% 1% 11 42% 1%

10% 19 8 42% 1% 10 52% 1%

Size

min 381 136 35% 19% 223 58% 32%

2% 52 20 38% 2% 32 61% 4%

5% 22 8 36% 1% 14 63% 3%

10% 10 3 30% 0,4% 8 80% 1%

WC2 96 (w)+108(g)

MDL

min 95 15 15% 7% 21 22% 10%

2% 95 15 15% 7% 21 22% 10%

5% 12 3 25% 1% 3 25% 1%

10% 5 2 40% 1% 2 40% 1%

Size

min 88 17 19% 8% 34 38% 16%

2% 88 17 19% 8% 34 38% 16%

5% 14 4 28% 2% 9 64% 4%

10% 4 3 75% 1% 3 75% 1%

WC3 375(w)+ 175(g)

MDL

min 186 100 50% 18% 117 62% 21%

2% 23 7 30% 1% 11 47% 2%

5% 4 1 25% 0,1% 2 50% 0,3%

10% 0 0 0% 0% 0 0% 0%

Size

min 178 101 56% 18% 119 66% 22%

2% 22 12 54% 2% 16 72% 3%

5% 8 3 37% 0,5% 4 50% 0,7%

10% 0 0 0% 0% 0 0% 0%

IEEE eScience 2014. Guarujá, Brasil

Page 16: Frag Flow: Automated Fragment Detection in Scientific Workflows

Evaluation: Inexact FGM techniques

16

Exact Overlap (>80%)

Corpus Workflows (w) +

groupings(g)

Inexact

FGM Frequency

MultiStep

Frag. Fragment Precision Recall Fragment Precision Recall

WC1 475(w)+ 209(g)

MDL

min 264 76 29% 11% 113 42% 16%

2% 64 21 32% 3% 27 42% 3%

5% 26 9 34% 1% 11 42% 1%

10% 19 8 42% 1% 10 52% 1%

Size

min 381 136 35% 19% 223 58% 32%

2% 52 20 38% 2% 32 61% 4%

5% 22 8 36% 1% 14 63% 3%

10% 10 3 30% 0,4% 8 80% 1%

WC2 96 (w)+108(g)

MDL

min 95 15 15% 7% 21 22% 10%

2% 95 15 15% 7% 21 22% 10%

5% 12 3 25% 1% 3 25% 1%

10% 5 2 40% 1% 2 40% 1%

Size

min 88 17 19% 8% 34 38% 16%

2% 88 17 19% 8% 34 38% 16%

5% 14 4 28% 2% 9 64% 4%

10% 4 3 75% 1% 3 75% 1%

WC3 375(w)+ 175(g)

MDL

min 186 100 50% 18% 117 62% 21%

2% 23 7 30% 1% 11 47% 2%

5% 4 1 25% 0,1% 2 50% 0,3%

10% 0 0 0% 0% 0 0% 0%

Size

min 178 101 56% 18% 119 66% 22%

2% 22 12 54% 2% 16 72% 3%

5% 8 3 37% 0,5% 4 50% 0,7%

10% 0 0 0% 0% 0 0% 0%

Frequent fragments overlap with groupings in single user corpora (30% to 75% with 10% frequency, 40% to 80% overlapping)

IEEE eScience 2014. Guarujá, Brasil

Page 17: Frag Flow: Automated Fragment Detection in Scientific Workflows

Evaluation: Inexact FGM techniques

17

Exact Overlap (>80%)

Corpus Workflows (w) +

groupings(g)

Inexact

FGM Frequency

MultiStep

Frag. Fragment Precision Recall Fragment Precision Recall

WC1 475(w)+ 209(g)

MDL

min 264 76 29% 11% 113 42% 16%

2% 64 21 32% 3% 27 42% 3%

5% 26 9 34% 1% 11 42% 1%

10% 19 8 42% 1% 10 52% 1%

Size

min 381 136 35% 19% 223 58% 32%

2% 52 20 38% 2% 32 61% 4%

5% 22 8 36% 1% 14 63% 3%

10% 10 3 30% 0,4% 8 80% 1%

WC2 96 (w)+108(g)

MDL

min 95 15 15% 7% 21 22% 10%

2% 95 15 15% 7% 21 22% 10%

5% 12 3 25% 1% 3 25% 1%

10% 5 2 40% 1% 2 40% 1%

Size

min 88 17 19% 8% 34 38% 16%

2% 88 17 19% 8% 34 38% 16%

5% 14 4 28% 2% 9 64% 4%

10% 4 3 75% 1% 3 75% 1%

WC3 375(w)+ 175(g)

MDL

min 186 100 50% 18% 117 62% 21%

2% 23 7 30% 1% 11 47% 2%

5% 4 1 25% 0,1% 2 50% 0,3%

10% 0 0 0% 0% 0 0% 0%

Size

min 178 101 56% 18% 119 66% 22%

2% 22 12 54% 2% 16 72% 3%

5% 8 3 37% 0,5% 4 50% 0,7%

10% 0 0 0% 0% 0 0% 0%

Precision decreases in the Multi user corpus. Best results are 50% to 56% with minimum frequency.

Page 18: Frag Flow: Automated Fragment Detection in Scientific Workflows

Evaluation: Inexact FGM techniques

18

Exact Overlap (>80%)

Corpus Workflows (w) +

groupings(g)

Inexact

FGM Frequency

MultiStep

Frag. Fragment Precision Recall Fragment Precision Recall

WC1 475(w)+ 209(g)

MDL

min 264 76 29% 11% 113 42% 16%

2% 64 21 32% 3% 27 42% 3%

5% 26 9 34% 1% 11 42% 1%

10% 19 8 42% 1% 10 52% 1%

Size

min 381 136 35% 19% 223 58% 32%

2% 52 20 38% 2% 32 61% 4%

5% 22 8 36% 1% 14 63% 3%

10% 10 3 30% 0,4% 8 80% 1%

WC2 96 (w)+108(g)

MDL

min 95 15 15% 7% 21 22% 10%

2% 95 15 15% 7% 21 22% 10%

5% 12 3 25% 1% 3 25% 1%

10% 5 2 40% 1% 2 40% 1%

Size

min 88 17 19% 8% 34 38% 16%

2% 88 17 19% 8% 34 38% 16%

5% 14 4 28% 2% 9 64% 4%

10% 4 3 75% 1% 3 75% 1%

WC3 375(w)+ 175(g)

MDL

min 186 100 50% 18% 117 62% 21%

2% 23 7 30% 1% 11 47% 2%

5% 4 1 25% 0,1% 2 50% 0,3%

10% 0 0 0% 0% 0 0% 0%

Size

min 178 101 56% 18% 119 66% 22%

2% 22 12 54% 2% 16 72% 3%

5% 8 3 37% 0,5% 4 50% 0,7%

10% 0 0 0% 0% 0 0% 0%

IEEE eScience 2014. Guarujá, Brasil

Page 19: Frag Flow: Automated Fragment Detection in Scientific Workflows

Evaluation: Exact FGM techniques

19

Exact Overlap (>80%)

Corpus Wf (w) +

groups. (g) Support

MultiStep

Fragments

MultiStep

Filtered

Fragments

Fragments Precision Recall Fragments Precision Recall

WC1 475(w) +

209(g)

5% Out of

memory - - - - - - -

10% 51613 16 1 6,2% 0,1% 11 69% 1%

15% 2264 8 6 75% 0,8% 6 75% 0,8%

20% 3 1 0 0% 0% 0 0% 0%

WC2 96 (w) +

108(g)

5% Out of

Memory - - - - - - -

10% 33236 4 0 0% 0% 1 25% 0,4%

15% 25 2 0 0% 0% 0 0% 0%

20% 0 0 0 - - 0 - -

WC3 375(w) +

175(g)

5% 5701 3 1 33% 0,1% 1 33% 0,1%

10% 1074 1 1 100% 0,1% 1 100% 0,1%

15% 1 1 0 0% 0% 0 0% 0%

20% 0 0 0 - - 0 - -

IEEE eScience 2014. Guarujá, Brasil

Page 20: Frag Flow: Automated Fragment Detection in Scientific Workflows

Evaluation: Exact FGM techniques

20

Exact Overlap (>80%)

Corpus Wf (w) +

groups. (g) Support

MultiStep

Fragments

MultiStep

Filtered

Fragments

Fragments Precision Recall Fragments Precision Recall

WC1 475(w) +

209(g)

5% Out of

memory - - - - - - -

10% 51613 16 1 6,2% 0,1% 11 69% 1%

15% 2264 8 6 75% 0,8% 6 75% 0,8%

20% 3 1 0 0% 0% 0 0% 0%

WC2 96 (w) +

108(g)

5% Out of

Memory - - - - - - -

10% 33236 4 0 0% 0% 1 25% 0,4%

15% 25 2 0 0% 0% 0 0% 0%

20% 0 0 0 - - 0 - -

WC3 375(w) +

175(g)

5% 5701 3 1 33% 0,1% 1 33% 0,1%

10% 1074 1 1 100% 0,1% 1 100% 0,1%

15% 1 1 0 0% 0% 0 0% 0%

20% 0 0 0 - - 0 - -

Less results than inexact FGM, even when high numbers of fragments are found

IEEE eScience 2014. Guarujá, Brasil

Page 21: Frag Flow: Automated Fragment Detection in Scientific Workflows

Evaluation: Exact FGM techniques

21

Exact Overlap (>80%)

Corpus Wf (w) +

groups. (g) Support

MultiStep

Fragments

MultiStep

Filtered

Fragments

Fragments Precision Recall Fragments Precision Recall

WC1 475(w) +

209(g)

5% Out of

memory - - - - - - -

10% 51613 16 1 6,2% 0,1% 11 69% 1%

15% 2264 8 6 75% 0,8% 6 75% 0,8%

20% 3 1 0 0% 0% 0 0% 0%

WC2 96 (w) +

108(g)

5% Out of

Memory - - - - - - -

10% 33236 4 0 0% 0% 1 25% 0,4%

15% 25 2 0 0% 0% 0 0% 0%

20% 0 0 0 - - 0 - -

WC3 375(w) +

175(g)

5% 5701 3 1 33% 0,1% 1 33% 0,1%

10% 1074 1 1 100% 0,1% 1 100% 0,1%

15% 1 1 0 0% 0% 0 0% 0%

20% 0 0 0 - - 0 - -

How users define fragments affect the results

IEEE eScience 2014. Guarujá, Brasil

Page 22: Frag Flow: Automated Fragment Detection in Scientific Workflows

Preliminary Evaluation: User based evaluation

22

• Manual evaluation: each user is given 16-18 common workflow fragments detected by FragFlow

• 66% and 100% accuracy respectively • Some of the reasons to not use fragments depended on

the user preferences • Currently evaluating additional users

IEEE eScience 2014. Guarujá, Brasil

User Use as

proposed

Use with minor

changes

Use with major

changes Use

User1 (WC1) 11% 16,6% 38% 66,6%

User 2 (WC2) 44% 6% 50% 100%

Page 23: Frag Flow: Automated Fragment Detection in Scientific Workflows

Evaluation: Grouping analysis

23

• Workflows with groupings are more common in single user corpora (WC1 and WC2)

• Groupings are reused • 1463 groupings versus 209 unique groupings in WC1 • 302 grouping versus 108 unique groupings in WC2 • 456 groupings versus 175 unique groupings in WC3

• Grouping size ranges from 60 to 0 • Facilitate copy paste by users (large grouping size) • Reducing unnecessary inputs (groupings with no steps)

IEEE eScience 2014. Guarujá, Brasil

Corpus Total

qroup.

Unique

multistep qroup.

Wf with

qroup.

Avg. group.

per wf

Max nºof

steps in

qroup.

Min nº of

steps in

qroup.

WC1 1463 209 327 4 56 1

WC2 302 108 42 7 39 0

WC3 456 175 89 5 60 1

Page 24: Frag Flow: Automated Fragment Detection in Scientific Workflows

Findings

24

With respect to our goals…

• Goal 1: Are automatically detected workflow fragments similar to user-defined groupings? • (with freq 10%, single user, inexact FGM) 30% to 75% of the total

FragFlow fragments found correspond directly to user-defined groupings • (multi user)Best results are 50% to 56% inexact FGM with minimum

frequency. If we consider the overlap of 80% of the steps, the precision is 62% to 66%

• Goal 2: For those automatically detected fragments that were NOT

similar to user-defined groupings, do users find them useful? • For one user 66% of the proposed fragments were useful, for another

100% were useful • Further evaluation is needed

• Goal 3: How are workflows and groupings reused?

• Those workflows with groupings have at least 4 groupings • Reuse of groupings (grouping numbers are up to 7 times more than the

unique groupings in the corpora)

IEEE eScience 2014. Guarujá, Brasil

Page 25: Frag Flow: Automated Fragment Detection in Scientific Workflows

Limitations

25

• Graph mining is an NP-Complete problem • Big fragments can take time to be

recognized • Errors derived from memory heap

issues

• Detection of groupings may depend on user preferences on size and frequency

IEEE eScience 2014. Guarujá, Brasil

Page 26: Frag Flow: Automated Fragment Detection in Scientific Workflows

Conclusions and Future Work

26

• FragFlow: Approach to find the most common fragments in a corpus of workflows • Several integrated graph mining techniques • FragFlow can be used with different settings

• Minimum or maximum frequency and support. • Size • Type of the graph mining algorithm to be applied

• Evaluation of the results using corpora belonging to the LONI Pipeline system.

• New algorithms are being integrated! • Sigma (inexact FGM), Gaston (exact FGM)

• Future work

• Test FragFlow with other workflow systems, domains, and perform further user evaluations.

• Evaluate how workflow quality improves when users are proposed automatically mined workflow fragments

Evaluation and resources available here: http://purl.org/net/escience2014

IEEE eScience 2014. Guarujá, Brasil

Page 27: Frag Flow: Automated Fragment Detection in Scientific Workflows

27

Who are we?

•Daniel Garijo, Oscar Corcho Ontology Engineering Group, UPM •Yolanda Gil Information Sciences Institute, USC •Boris A. Gutman, Ivo D. Dinov, Paul Thompson Arthur W. Toga. USC Laboratory of Neuro Imaging

IEEE eScience 2014. Guarujá, Brasil

Page 28: Frag Flow: Automated Fragment Detection in Scientific Workflows

Want to collaborate? Contact me at [email protected]

28

Questions?

IEEE eScience 2014. Guarujá, Brasil

Page 29: Frag Flow: Automated Fragment Detection in Scientific Workflows

Date: 24/10/2014

FragFlow:

Automatic Fragment Detection in

Scientific Workflows

Daniel Garijo *, Oscar Corcho *, Yolanda Gil Ŧ, Boris A. Gutman ⱡ, Ivo D. Dinov ⱡ, Paul Thompson ⱡ and Arthur W. Toga ⱡ

* Universidad Politécnica de Madrid, Ŧ USC Information Sciences Institute,

ⱡ USC Laboratory of Neuroimaging