frag flow: automated fragment detection in scientific workflows
DESCRIPTION
eScience 2014, Guarujá (Brasil). Abstract—Scientific workflows provide the means to define, execute and reproduce computational experiments. However, reusing existing workflows still poses challenges for workflow designers. Workflows are often too large and too specific to reuse in their entirety, so reuse is more likely to happen for fragments of workflows. These fragments may be identified manually by users as sub-workflows, or detected automatically. In this paper we present the FragFlow approach, which detects workflow fragments automatically by analyzing existing workflow corpora with graph mining algorithms. FragFlow detects the most common workflow fragments, links them to the original workflows and visualizes them. We evaluate our approach by comparing FragFlow results against user-defined sub-workflows from three different corpora of the LONI Pipeline system. Based on this evaluation, we discuss how automated workflow fragment detection could facilitate workflow reuseTRANSCRIPT
Date: 24/10/2014
FragFlow:
Automatic Fragment Detection in
Scientific Workflows
Daniel Garijo *, Oscar Corcho *, Yolanda Gil Ŧ, Boris A. Gutman ⱡ, Ivo D. Dinov ⱡ, Paul Thompson ⱡ and Arthur W. Toga ⱡ
* Universidad Politécnica de Madrid, Ŧ USC Information Sciences Institute,
ⱡ USC Laboratory of Neuroimaging
2
Overview
•Detecting common groups of tasks in corpus of scientific workflows •Application of exact and inexact graph matching techniques •Filtering and linking results to the input corpus •Benefits: Discoverability, understandability, reuse, design, modularization, visualization
Lab book
Digital Log
Laboratory Protocol (recipe)
Workflow
Experiment
IEEE eScience 2014. Guarujá, Brasil
Background
• Workflows are software artifacts that capture computational experiments • Addition to paper publication • Provenance of results • Reuse
• Existing repositories of workflows
(Galaxy, myExperiment, the LONI Pipeline, CrowdLabs, etc.) • Sharing workflows • Exploring existing workflows
• PROBLEMS to address: • Workflows have many detailed steps and may be
difficult to understand • The general method may not apparent
• How are different workflow related? • What steps do they have in common?
3 IEEE eScience 2014. Guarujá, Brasil
Workflow Fragment: set of connected steps that are part of a workflow.
• Common Workflow Fragment: fragments that occur more than once in a corpus of workflows
• Grouping: Workflow fragment manually annotated by a user • Sub-Grouping: Grouping included as part of another grouping
Workflow Fragments and Groupings
4
A
B
C
A
F
D
A
B
C
G
B
H
A
B
F
B
E
Common workflow fragments
Workflow 1 Workflow 2 Workflow 3
IEEE eScience 2014. Guarujá, Brasil
Our Goals
Our goal is to automatically detect useful workflow fragments to
be reused by scientists. In this work, given a workflow corpus…
• Goal 1: Are automatically detected workflow fragments similar to user-defined groupings?
• Goal 2: For those automatically detected fragments that were NOT similar to user-defined groupings, do users find them useful?
• Goal 3: How are workflows and groupings reused?
5 IEEE eScience 2014. Guarujá, Brasil
The LONI Pipeline
6
• Workflow system for neuroimaging analysis
• Active community of users creating workflows
• Enables users to define groupings in workflows
• Has a corpus of published workflows
• Has a library of (uniquely identified) components with a well defined functionality
http://pipeline.loni.usc.edu/explore/library-navigator/ IEEE eScience 2014. Guarujá, Brasil
Workflow Mining in FragFlow
7
1
2
3
4
IEEE eScience 2014. Guarujá, Brasil
Corpus
Corpus Preparation
Workflows converted to Labeled Directed Acyclic Graphs (LDAG) • The label of a node in the graph corresponds to the type of the step in
the workflow
• Edges capture the dependencies between different steps
• Duplicated workflows are removed
• Single-step workflows are removed
8 IEEE eScience 2014. Guarujá, Brasil
Graph Mining
9
We use popular graph mining techniques:
• Inexact FGM: usage of heuristics to calculate similarity between two graphs. The solution might not be complete • SUBDUE
• 2 heuristics: Minimum Description Length (MDL) and Size • Frequency based
• Exact FGM: deliver all the possible fragments to be found the dataset. • gSpan
• Depth first search strategy • Support based
• FSG • Breadth first search strategy • Support based
IEEE eScience 2014. Guarujá, Brasil
Filtering Relevant Fragments
10
The number of resulting fragments can be very large. We distinguish:
• Multistep fragments: • More than one step
• Filtered Multistep fragments: • Multistep fragments • Contain all smaller fragments with the same number of
occurrences
IEEE eScience 2014. Guarujá, Brasil
Linking to the Corpus: Wf-fd
11 IEEE eScience 2014. Guarujá, Brasil
Linking to the Corpora: Example
12 IEEE eScience 2014. Guarujá, Brasil
Corpus
Fragment
Evaluation
13
Three workflow corpora: User Corpus 1 (WC1)
• Designed mostly by a single a single user • General medial imaging • 790 workflows (475 after data preparation)
User Corpus 2 (WC2)
• Created by a user, with collaborations of others • Well documented workflows, meant for reuse • 113 workflows (96 after data preparation)
Multi User Corpus 3 (WC3) • Workflows submitted by 62 users during the month of Jan 2014 • Several executions of the same workflows • 5859 workflows (357 after data preparation)
IEEE eScience 2014. Guarujá, Brasil
Evaluation: Metrics
14
Goal 1: Are automatically detected workflow fragments similar to user-defined groupings ?
Goal 2: Do users find useful the fragments that were NOT similar to their defined groupings?
IEEE eScience 2014. Guarujá, Brasil
Evaluation: Inexact FGM techniques
15
Exact Overlap (>80%)
Corpus Workflows (w) +
groupings(g)
Inexact
FGM Frequency
MultiStep
Frag. Fragment Precision Recall Fragment Precision Recall
WC1 475(w)+ 209(g)
MDL
min 264 76 29% 11% 113 42% 16%
2% 64 21 32% 3% 27 42% 3%
5% 26 9 34% 1% 11 42% 1%
10% 19 8 42% 1% 10 52% 1%
Size
min 381 136 35% 19% 223 58% 32%
2% 52 20 38% 2% 32 61% 4%
5% 22 8 36% 1% 14 63% 3%
10% 10 3 30% 0,4% 8 80% 1%
WC2 96 (w)+108(g)
MDL
min 95 15 15% 7% 21 22% 10%
2% 95 15 15% 7% 21 22% 10%
5% 12 3 25% 1% 3 25% 1%
10% 5 2 40% 1% 2 40% 1%
Size
min 88 17 19% 8% 34 38% 16%
2% 88 17 19% 8% 34 38% 16%
5% 14 4 28% 2% 9 64% 4%
10% 4 3 75% 1% 3 75% 1%
WC3 375(w)+ 175(g)
MDL
min 186 100 50% 18% 117 62% 21%
2% 23 7 30% 1% 11 47% 2%
5% 4 1 25% 0,1% 2 50% 0,3%
10% 0 0 0% 0% 0 0% 0%
Size
min 178 101 56% 18% 119 66% 22%
2% 22 12 54% 2% 16 72% 3%
5% 8 3 37% 0,5% 4 50% 0,7%
10% 0 0 0% 0% 0 0% 0%
IEEE eScience 2014. Guarujá, Brasil
Evaluation: Inexact FGM techniques
16
Exact Overlap (>80%)
Corpus Workflows (w) +
groupings(g)
Inexact
FGM Frequency
MultiStep
Frag. Fragment Precision Recall Fragment Precision Recall
WC1 475(w)+ 209(g)
MDL
min 264 76 29% 11% 113 42% 16%
2% 64 21 32% 3% 27 42% 3%
5% 26 9 34% 1% 11 42% 1%
10% 19 8 42% 1% 10 52% 1%
Size
min 381 136 35% 19% 223 58% 32%
2% 52 20 38% 2% 32 61% 4%
5% 22 8 36% 1% 14 63% 3%
10% 10 3 30% 0,4% 8 80% 1%
WC2 96 (w)+108(g)
MDL
min 95 15 15% 7% 21 22% 10%
2% 95 15 15% 7% 21 22% 10%
5% 12 3 25% 1% 3 25% 1%
10% 5 2 40% 1% 2 40% 1%
Size
min 88 17 19% 8% 34 38% 16%
2% 88 17 19% 8% 34 38% 16%
5% 14 4 28% 2% 9 64% 4%
10% 4 3 75% 1% 3 75% 1%
WC3 375(w)+ 175(g)
MDL
min 186 100 50% 18% 117 62% 21%
2% 23 7 30% 1% 11 47% 2%
5% 4 1 25% 0,1% 2 50% 0,3%
10% 0 0 0% 0% 0 0% 0%
Size
min 178 101 56% 18% 119 66% 22%
2% 22 12 54% 2% 16 72% 3%
5% 8 3 37% 0,5% 4 50% 0,7%
10% 0 0 0% 0% 0 0% 0%
Frequent fragments overlap with groupings in single user corpora (30% to 75% with 10% frequency, 40% to 80% overlapping)
IEEE eScience 2014. Guarujá, Brasil
Evaluation: Inexact FGM techniques
17
Exact Overlap (>80%)
Corpus Workflows (w) +
groupings(g)
Inexact
FGM Frequency
MultiStep
Frag. Fragment Precision Recall Fragment Precision Recall
WC1 475(w)+ 209(g)
MDL
min 264 76 29% 11% 113 42% 16%
2% 64 21 32% 3% 27 42% 3%
5% 26 9 34% 1% 11 42% 1%
10% 19 8 42% 1% 10 52% 1%
Size
min 381 136 35% 19% 223 58% 32%
2% 52 20 38% 2% 32 61% 4%
5% 22 8 36% 1% 14 63% 3%
10% 10 3 30% 0,4% 8 80% 1%
WC2 96 (w)+108(g)
MDL
min 95 15 15% 7% 21 22% 10%
2% 95 15 15% 7% 21 22% 10%
5% 12 3 25% 1% 3 25% 1%
10% 5 2 40% 1% 2 40% 1%
Size
min 88 17 19% 8% 34 38% 16%
2% 88 17 19% 8% 34 38% 16%
5% 14 4 28% 2% 9 64% 4%
10% 4 3 75% 1% 3 75% 1%
WC3 375(w)+ 175(g)
MDL
min 186 100 50% 18% 117 62% 21%
2% 23 7 30% 1% 11 47% 2%
5% 4 1 25% 0,1% 2 50% 0,3%
10% 0 0 0% 0% 0 0% 0%
Size
min 178 101 56% 18% 119 66% 22%
2% 22 12 54% 2% 16 72% 3%
5% 8 3 37% 0,5% 4 50% 0,7%
10% 0 0 0% 0% 0 0% 0%
Precision decreases in the Multi user corpus. Best results are 50% to 56% with minimum frequency.
Evaluation: Inexact FGM techniques
18
Exact Overlap (>80%)
Corpus Workflows (w) +
groupings(g)
Inexact
FGM Frequency
MultiStep
Frag. Fragment Precision Recall Fragment Precision Recall
WC1 475(w)+ 209(g)
MDL
min 264 76 29% 11% 113 42% 16%
2% 64 21 32% 3% 27 42% 3%
5% 26 9 34% 1% 11 42% 1%
10% 19 8 42% 1% 10 52% 1%
Size
min 381 136 35% 19% 223 58% 32%
2% 52 20 38% 2% 32 61% 4%
5% 22 8 36% 1% 14 63% 3%
10% 10 3 30% 0,4% 8 80% 1%
WC2 96 (w)+108(g)
MDL
min 95 15 15% 7% 21 22% 10%
2% 95 15 15% 7% 21 22% 10%
5% 12 3 25% 1% 3 25% 1%
10% 5 2 40% 1% 2 40% 1%
Size
min 88 17 19% 8% 34 38% 16%
2% 88 17 19% 8% 34 38% 16%
5% 14 4 28% 2% 9 64% 4%
10% 4 3 75% 1% 3 75% 1%
WC3 375(w)+ 175(g)
MDL
min 186 100 50% 18% 117 62% 21%
2% 23 7 30% 1% 11 47% 2%
5% 4 1 25% 0,1% 2 50% 0,3%
10% 0 0 0% 0% 0 0% 0%
Size
min 178 101 56% 18% 119 66% 22%
2% 22 12 54% 2% 16 72% 3%
5% 8 3 37% 0,5% 4 50% 0,7%
10% 0 0 0% 0% 0 0% 0%
IEEE eScience 2014. Guarujá, Brasil
Evaluation: Exact FGM techniques
19
Exact Overlap (>80%)
Corpus Wf (w) +
groups. (g) Support
MultiStep
Fragments
MultiStep
Filtered
Fragments
Fragments Precision Recall Fragments Precision Recall
WC1 475(w) +
209(g)
5% Out of
memory - - - - - - -
10% 51613 16 1 6,2% 0,1% 11 69% 1%
15% 2264 8 6 75% 0,8% 6 75% 0,8%
20% 3 1 0 0% 0% 0 0% 0%
WC2 96 (w) +
108(g)
5% Out of
Memory - - - - - - -
10% 33236 4 0 0% 0% 1 25% 0,4%
15% 25 2 0 0% 0% 0 0% 0%
20% 0 0 0 - - 0 - -
WC3 375(w) +
175(g)
5% 5701 3 1 33% 0,1% 1 33% 0,1%
10% 1074 1 1 100% 0,1% 1 100% 0,1%
15% 1 1 0 0% 0% 0 0% 0%
20% 0 0 0 - - 0 - -
IEEE eScience 2014. Guarujá, Brasil
Evaluation: Exact FGM techniques
20
Exact Overlap (>80%)
Corpus Wf (w) +
groups. (g) Support
MultiStep
Fragments
MultiStep
Filtered
Fragments
Fragments Precision Recall Fragments Precision Recall
WC1 475(w) +
209(g)
5% Out of
memory - - - - - - -
10% 51613 16 1 6,2% 0,1% 11 69% 1%
15% 2264 8 6 75% 0,8% 6 75% 0,8%
20% 3 1 0 0% 0% 0 0% 0%
WC2 96 (w) +
108(g)
5% Out of
Memory - - - - - - -
10% 33236 4 0 0% 0% 1 25% 0,4%
15% 25 2 0 0% 0% 0 0% 0%
20% 0 0 0 - - 0 - -
WC3 375(w) +
175(g)
5% 5701 3 1 33% 0,1% 1 33% 0,1%
10% 1074 1 1 100% 0,1% 1 100% 0,1%
15% 1 1 0 0% 0% 0 0% 0%
20% 0 0 0 - - 0 - -
Less results than inexact FGM, even when high numbers of fragments are found
IEEE eScience 2014. Guarujá, Brasil
Evaluation: Exact FGM techniques
21
Exact Overlap (>80%)
Corpus Wf (w) +
groups. (g) Support
MultiStep
Fragments
MultiStep
Filtered
Fragments
Fragments Precision Recall Fragments Precision Recall
WC1 475(w) +
209(g)
5% Out of
memory - - - - - - -
10% 51613 16 1 6,2% 0,1% 11 69% 1%
15% 2264 8 6 75% 0,8% 6 75% 0,8%
20% 3 1 0 0% 0% 0 0% 0%
WC2 96 (w) +
108(g)
5% Out of
Memory - - - - - - -
10% 33236 4 0 0% 0% 1 25% 0,4%
15% 25 2 0 0% 0% 0 0% 0%
20% 0 0 0 - - 0 - -
WC3 375(w) +
175(g)
5% 5701 3 1 33% 0,1% 1 33% 0,1%
10% 1074 1 1 100% 0,1% 1 100% 0,1%
15% 1 1 0 0% 0% 0 0% 0%
20% 0 0 0 - - 0 - -
How users define fragments affect the results
IEEE eScience 2014. Guarujá, Brasil
Preliminary Evaluation: User based evaluation
22
• Manual evaluation: each user is given 16-18 common workflow fragments detected by FragFlow
• 66% and 100% accuracy respectively • Some of the reasons to not use fragments depended on
the user preferences • Currently evaluating additional users
IEEE eScience 2014. Guarujá, Brasil
User Use as
proposed
Use with minor
changes
Use with major
changes Use
User1 (WC1) 11% 16,6% 38% 66,6%
User 2 (WC2) 44% 6% 50% 100%
Evaluation: Grouping analysis
23
• Workflows with groupings are more common in single user corpora (WC1 and WC2)
• Groupings are reused • 1463 groupings versus 209 unique groupings in WC1 • 302 grouping versus 108 unique groupings in WC2 • 456 groupings versus 175 unique groupings in WC3
• Grouping size ranges from 60 to 0 • Facilitate copy paste by users (large grouping size) • Reducing unnecessary inputs (groupings with no steps)
IEEE eScience 2014. Guarujá, Brasil
Corpus Total
qroup.
Unique
multistep qroup.
Wf with
qroup.
Avg. group.
per wf
Max nºof
steps in
qroup.
Min nº of
steps in
qroup.
WC1 1463 209 327 4 56 1
WC2 302 108 42 7 39 0
WC3 456 175 89 5 60 1
Findings
24
With respect to our goals…
• Goal 1: Are automatically detected workflow fragments similar to user-defined groupings? • (with freq 10%, single user, inexact FGM) 30% to 75% of the total
FragFlow fragments found correspond directly to user-defined groupings • (multi user)Best results are 50% to 56% inexact FGM with minimum
frequency. If we consider the overlap of 80% of the steps, the precision is 62% to 66%
• Goal 2: For those automatically detected fragments that were NOT
similar to user-defined groupings, do users find them useful? • For one user 66% of the proposed fragments were useful, for another
100% were useful • Further evaluation is needed
• Goal 3: How are workflows and groupings reused?
• Those workflows with groupings have at least 4 groupings • Reuse of groupings (grouping numbers are up to 7 times more than the
unique groupings in the corpora)
IEEE eScience 2014. Guarujá, Brasil
Limitations
25
• Graph mining is an NP-Complete problem • Big fragments can take time to be
recognized • Errors derived from memory heap
issues
• Detection of groupings may depend on user preferences on size and frequency
IEEE eScience 2014. Guarujá, Brasil
Conclusions and Future Work
26
• FragFlow: Approach to find the most common fragments in a corpus of workflows • Several integrated graph mining techniques • FragFlow can be used with different settings
• Minimum or maximum frequency and support. • Size • Type of the graph mining algorithm to be applied
• Evaluation of the results using corpora belonging to the LONI Pipeline system.
• New algorithms are being integrated! • Sigma (inexact FGM), Gaston (exact FGM)
• Future work
• Test FragFlow with other workflow systems, domains, and perform further user evaluations.
• Evaluate how workflow quality improves when users are proposed automatically mined workflow fragments
Evaluation and resources available here: http://purl.org/net/escience2014
IEEE eScience 2014. Guarujá, Brasil
27
Who are we?
•Daniel Garijo, Oscar Corcho Ontology Engineering Group, UPM •Yolanda Gil Information Sciences Institute, USC •Boris A. Gutman, Ivo D. Dinov, Paul Thompson Arthur W. Toga. USC Laboratory of Neuro Imaging
IEEE eScience 2014. Guarujá, Brasil
Want to collaborate? Contact me at [email protected]
28
Questions?
IEEE eScience 2014. Guarujá, Brasil
Date: 24/10/2014
FragFlow:
Automatic Fragment Detection in
Scientific Workflows
Daniel Garijo *, Oscar Corcho *, Yolanda Gil Ŧ, Boris A. Gutman ⱡ, Ivo D. Dinov ⱡ, Paul Thompson ⱡ and Arthur W. Toga ⱡ
* Universidad Politécnica de Madrid, Ŧ USC Information Sciences Institute,
ⱡ USC Laboratory of Neuroimaging