phd thesis: mining abstractions in scientific workflows
TRANSCRIPT
Date: 03/12/2015
Mining Abstractions in Scientific Workflows
Daniel Garijo *Supervisors: Oscar Corcho *, Yolanda Gil Ŧ
* Universidad Politécnica de Madrid,Ŧ USC Information Sciences Institute
Introduction
Lab book
Digital Log
Laboratory Protocol (recipe)
Scientific Workflow
Experiment
In silico experiment
2PhD Thesis: Mining Abstractions in Scientific Workflows
Benefits of workflows
Time savings• Copy & paste fragments of workflows
3PhD Thesis: Mining Abstractions in Scientific Workflows
Teaching• Reduce the learning curve of new students
Visualization• Simplify workflows
Design for modularity• Highlight the most relevant steps on a workflow
Design for standardizationDebugging
• Provenance exploration
Reproducibility and inspectability
Motivation of this work
Workflow Repositories
Workflow SystemsLet’s
Share!
I want to reuse…
?
I want to understand…?
I want to repurpose…
?
4PhD Thesis: Mining Abstractions in Scientific Workflows
Open research challenges
•Workflow representation heterogeneity
5PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow Repositories
How can we represent a description of workflows and their metadata?
How can we facilitate the homogeneous consumption of workflows and their resources?
Open research challenges
•Workflow representation heterogeneity
6PhD Thesis: Mining Abstractions in Scientific Workflows
•Inadequate level of workflow abstraction
What are the most relevant parts of a workflow
Dataset
PorterStemmer
Result
IDF
FinalResult
Dataset
LovinsStemmer
Result
ResidualIDF
FinalResult
Dataset
Stemmer
Result
Term Weighting
FinalResult
Are two seemingly disparate workflows related at a higher level of abstraction?
Open research challenges
•Workflow representation heterogeneity
7PhD Thesis: Mining Abstractions in Scientific Workflows
•Inadequate level of workflow abstraction
•Difficulties for workflow reuse
How is a workflow related to other workflows?
Which workflow (parts) are potentially useful for reuse?
???
Open research challenges
•Workflow representation heterogeneity
8PhD Thesis: Mining Abstractions in Scientific Workflows
•Inadequate level of workflow abstraction
•Difficulties for workflow reuse
•Lack of support for workflow annotation
+ +
How can we facilitate the annotation process?
Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
4. Workflow abstraction and reuse
5. Mining abstractions from workflows using graph mining techniques
6. Evaluation
7. Conclusions and future work
9PhD Thesis: Mining Abstractions in Scientific Workflows
•H.3: Commonly occurring patterns are potentially useful for users designing workflows.
•H.2: It is possible to detect commonly occurring patterns and abstractions automatically.
Hypothesis
•H.1: It is possible to define a catalog of common domain independent patterns based on the common functionality of workflow steps.
Scientific workflow repositories can be automatically analyzed to extract commonly occurring patterns and abstractions that are useful for workflow developers aiming to reuse existing workflows.
Workflow abstraction
Workflow representation
Workflow reuse
Workflow annotation
Workflow reuse
10PhD Thesis: Mining Abstractions in Scientific Workflows
Contributions
Workflow representation and publication
Model for representing workflow templates and executions
Workflow abstraction
Methodology to publish workflows in the web
Workflow annotationA model and means for annotating semi-automatically the abstractions in
workflows
A catalog of common domain independent workflow patterns based on the functionality of workflow steps
A method to extract generic commonly occurring workflow fragments automatically
Workflow reuseMetrics for assessing the usefulness of a fragment for reuse
A model to describe and annotate workflow fragments
11PhD Thesis: Mining Abstractions in Scientific Workflows
OPMW
Linked Data
Wf-motifsWf-fd
Workflow motifs
Graph mining
Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows a) Requirementsb) The OPMW modelc) Publishing workflows as Linked Data
4. Workflow abstraction and reuse
5. Mining abstractions from workflows using graph mining techniques
6. Evaluation
7. Conclusions and future work12PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow representation: Structures interchanged in the workflow lifecycle
Dataset
Stemmeralgorithm
Result
Term weightingalgorithm
FinalResult
File: Dataset123
LovinsStemmeralgorithm
Id:resultaa1
IDFalgorithm
Id:fresultaa2
WorkflowTemplate
13PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow Instance Workflow Execution Trace
Design Instantiation Execution
File: Dataset124
PorterStemmeralgorithm
Id:resultaa1
IDFalgorithm
Id:fresultaa2
File: Dataset123
LovinsStemmer execution
Id:resultaa1
IDF execution
Id:fresultaa2
File: Dataset123
LovinsStemmer execution
Id:resultaa1
IDF execution
Id:fresultaa2
File: Dataset124
PorterStemmerexecution
Id:resultaa1
IDF execution
Id:fresultaa2
File: Dataset124
PorterStemmer execution
Id:resultaa1
IDF execution
Id:fresultaa2
File: Dataset124
PorterStemmer execution
Id:resultaa1
IDF execution
Id:fresultaa2
File: Dataset123
LovinsStemmer execution
Id:resultaa1
IDF execution
Id:fresultaa2
…
…
Id:resultaa1
Requirements
14PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow template descriptionPlan: P-Plan [Garijo et al 2012] http://purl.org/net/p-plan
Workflow execution trace descriptionProvenance: PROV (W3C) [Lebo et al 2013]
http://www.w3.org/ns/prov#
Workflow attributionDublin Core, PROV (W3C)
Workflow metadata
Link between templates and executions
Scufl DAX
AGWL Dispel
IWIR
OPM
OBI EXPO ISA
PAV
RO D-PROV
[Cicarese et al 2013]
[Moreau et al 2011]
[Brinkman et al 2010][Soldatova and King 2006][Rocca et al 2008]
[Belhajjame et al 2012][Missier et al 2013]
[Oinn et al 2004]
[Fahringer et al 2005]
[Atkinson et al 2013]
[Plankensteiner et al 2005]
OPMW: Extending provenance standards and plan models
template1
opmw:isVariableOfTemplate
opmw:isVariableOfTemplate
Input Dataset
Term Weighting
Topics
p-plan:isOutputVarOf
p-plan:hasInputVar
opmw:isStepOfTemplate
opmw:correspondsToTemplate
opmw:correspondstoTemplateArtifact
opmw:correspondstoTemplateProcess
opmw:correspondstoTemplateArtifact
opmw:WorkflowExecutionProcess
opmw:WorkflowExecutionAccount
prov:Entity
prov:Activity
prov:Bundle
PROV, OPM Extension
opmv:Artifact
opmo:Account
opmv:Process
opmw:WorkflowExecutionArtifact
opmw:WorkflowTemplateArtifact
opmw:WorkflowTemplateProcess
opmw:WorkflowTemplate
p-plan:Plan
p-plan:Step
p-plan:Variable
P-Plan extension
Class Object property
Legend
Instance ofInstance Subclass of
15PhD Thesis: Mining Abstractions in Scientific Workflows
execution1
File: Dataset123
IDF(java)
File: FResultaa2
prov:wasGeneratedBy
prov:used
opmo:account
opmo:account
opmo:account
http://www.opmw.org/ontology/
Outline
1. Introduction and motivation
2. Hypothesis and work methodology
3. Workflow representation: OPMWa) Requirementsb) The OPMW modelc) Publishing workflows as Linked Data
4. Workflow abstraction and reuse
5. Mining abstractions from workflows using graph mining techniques
6. Evaluation
7. Conclusions and future work16PhD Thesis: Mining Abstractions in Scientific Workflows
Publishing workflows as Linked Data
Specification
17PhD Thesis: Mining Abstractions in Scientific Workflows
Why Linked Data?• Facilitates exploitation of workflow resources in an homogeneous manner
Adapted methodology from [Villazón-Terrazas et al 2011]Tested it for the Wings workflow system
1
Base URI = http://www.opmw.org/Ontology URI = http://www.opmw.org/ontology/Assertion URI = http://www.opmw.org/export/resource/ClassName/instanceName
Examples: http://www.opmw.org/export/resource/WorkflowTemplate/ABSTRACTSUBWFDOCKINGhttp://www.opmw.org/export/resource/WorkflowExecutionAccount/ACCOUNT1348629350796
Publishing workflows as Linked Data
Specification Modeling
18PhD Thesis: Mining Abstractions in Scientific Workflows
Why Linked Data?• Facilitates exploitation of workflow resources in an homogeneous manner
Adapted methodology from [Villazón-Terrazas et al 2011]Tested it for the Wings workflow system
1 2
OPMW
P-Plan
OPM DC
PROV
Publishing workflows as Linked Data
Specification Modeling Generation
19PhD Thesis: Mining Abstractions in Scientific Workflows
Why Linked Data?• Facilitates exploitation of workflow resources in an homogeneous manner
Adapted methodology from [Villazón-Terrazas et al 2011]Tested it for the Wings workflow system
1 2 3
Workflow system
Workflow Template
Workflowexecution
OPMWexport
OPMWRDF
Publishing workflows as Linked Data
Specification Modeling Generation Publication
20PhD Thesis: Mining Abstractions in Scientific Workflows
Why Linked Data?• Facilitates exploitation of workflow resources in an homogeneous manner
Adapted methodology from [Villazón-Terrazas et al 2011]Tested it for the Wings workflow system
1 2 3 4
RDFTriple store
Permanentweb-
accessiblefile
store
RDF Upload Interface
SPARQL Endpoint
OPMWRDF
Publishing workflows as Linked Data
Specification Modeling Generation Publication
21PhD Thesis: Mining Abstractions in Scientific Workflows
Why Linked Data?• Facilitates exploitation of workflow resources in an homogeneous manner
Adapted methodology from [Villazón-Terrazas et al 2011]Tested it for the Wings workflow system
1 2 3 4
Exploitation
5
Curl Linked Data BrowserWorkflowExplorer
SPARQL endpoint
Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
4. Workflow abstraction and reusea) A catalog of common workflow abstractionsb) Workflow reuse analysis
5. Mining abstractions from workflows using graph mining techniques
6. Evaluation
7. Conclusions and future work
22PhD Thesis: Mining Abstractions in Scientific Workflows
A catalog of common workflow abstractions
Generalization of workflow steps based on functionality.Workflow motif: Domain independent conceptual abstraction on the workflow steps.1.Data-oriented motifs: What kind of manipulations does the workflow have?
• E.g.: • Data retrieval • Data preparation• Data curation• Data visualization• etc.
23PhD Thesis: Mining Abstractions in Scientific Workflows
A catalog of common workflow abstractions
Generalization of workflow steps based on functionality.Workflow motif: Domain independent conceptual abstraction on the workflow steps.1.Data-oriented motifs: What kind of manipulations does the workflow have?
• E.g.: • Data retrieval • Data preparation• etc.
2. Workflow-oriented motifs: How does the workflow perform its operations?
•E.g.:• Stateful steps• Stateless steps• Human interactions• etc.
24PhD Thesis: Mining Abstractions in Scientific Workflows
Methodology for finding workflow motifs
Goal: Reverse-engineer the set of current practices in workflow development through an analysis of empirical evidence
25PhD Thesis: Mining Abstractions in Scientific Workflows
= 260 workflows
89 12526 20
Collect workflows
Methodology for finding workflow motifs
Goal: Reverse-engineer the set of current practices in workflow development through an analysis of empirical evidence
26PhD Thesis: Mining Abstractions in Scientific Workflows
Preliminary workflow analysis
Researcher 1 Researcher 2 Researcher 3
Methodology for finding workflow motifs
Goal: Reverse-engineer the set of current practices in workflow development through an analysis of empirical evidence
27PhD Thesis: Mining Abstractions in Scientific Workflows
Agreement and cross validation
Result Summary
28PhD Thesis: Mining Abstractions in Scientific Workflows
•Over 60% of the motifs are data preparation motifs
•Some differences are motivated by the workflow systems in the analysis
•Around 40% of workflows contain motifs related to workflow reuse
composite workflowsinternal macros
But how do users perceive workflow reuse?What about fragments of workflows?
Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
4. Workflow abstraction and reusea) A catalog of common workflow abstractionsb) Workflow reuse survey
5. Mining abstractions from workflows using graph mining techniques
6. Evaluation
7. Conclusions and future work
29PhD Thesis: Mining Abstractions in Scientific Workflows
Use case: The LONI Pipeline
Workflow system for neuroimaging analysishttp://pipeline.loni.usc.edu/explore/library-navigator/
30PhD Thesis: Mining Abstractions in Scientific Workflows
Discussions with scientistsUser survey
Collect responsesfrom users
21 responses
Discuss results
Summary results
The majority of users agree that reusing and sharing workflows is useful
Unlike workflows, reusing groupings from one’s own work is more useful than reusing groupings from others
Most respondents agreed that groupings help simplify workflows.
Groupings also make workflows more understandable by others
31PhD Thesis: Mining Abstractions in Scientific Workflows
Can we detect groupings automatically?
Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
4. Workflow abstraction and reuse
5. Mining abstractions from workflows using graph mining techniquesa) Corpus preparationb) Graph miningc) Fragment filteringd) Fragment linking
6. Evaluation
7. Conclusions and future work32PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow mining approaches
Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014]
33PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow corpusCluster1
Cluster 2
Cluster 3
Workflow corpus
Workflow mining approaches
Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014]
34PhD Thesis: Mining Abstractions in Scientific Workflows
Topic 1
Topic 2
P(Topic1) = 0.7P(Topic2)= 0.3
P(Topic1) = 0.5P(Topic2)= 0.5
P(Topic1) = 0.2P(Topic2)= 0.8 ….
Topic modeling [Stoyanovich et al 2010]
Workflow mining approaches
Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014]
Topic modeling [Stoyanovich et al 2010]
35PhD Thesis: Mining Abstractions in Scientific Workflows
Case-based reasoning [Leake and Kendall-Morwick 2008], [Müller and Bergmann 2014]
Workflow corpus ?
?
Workflow mining approaches
Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014]
Topic modeling [Stoyanovich et al 2010]
Case-based reasoning [Leake and Kendall-Morwick 2008] [Müller and Bergmann 2014]
Log mining [van del Aalst et al2003] [Gómez-Pérez and Corcho, 2008]
36PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow corpus ?
PSM
Workflow mining approaches
Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014]
Topic modeling [Stoyanovich et al 2010]
Case-based reasoning [Leake and Kendall-Morwick 2008] [Müller and Bergmann 2014]
Log mining [van del Aalst et al2003] [Gómez-Pérez and Corcho, 2008]
Graph mining [Diamantini et al., 2012]
37PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow Mining in FragFlow
1
2
3
4
38PhD Thesis: Mining Abstractions in Scientific Workflows
Corpus Preparation
Workflows converted to Labeled Directed Acyclic Graphs (LDAG)• The label of a node in the graph corresponds to the type of the step in
the workflow
• Edges capture the dependencies between different steps
39PhD Thesis: Mining Abstractions in Scientific Workflows
Dataset
Stemmeralgorithm
Result
Term weightingalgorithm
FinalResult
Stemmeralgorithm
Term weightingalgorithm
Duplicated workflows are removed
Single-step workflows are removed
Graph Mining
We use popular graph mining techniques:
Inexact FSM: usage of heuristics to calculate similarity between two graphs. The solution might not be complete
SUBDUE2 heuristics: Minimum Description Length (MDL) and Size
Exact FSM: deliver all the possible fragments to be found the dataset.gSpan
Depth first search strategyFSG
Breadth first search strategy
40PhD Thesis: Mining Abstractions in Scientific Workflows
Filtering Relevant Fragments
The number of resulting fragments can be very large. We distinguish:Multistep fragments:
More than one step
Filtered Multistep fragments:Multistep fragmentsContain all smaller fragments with the same number of
occurrences
41PhD Thesis: Mining Abstractions in Scientific Workflows
Stemmer
Term Weighting
Stemmer
Term Weighting
Filter
Filter
Sort
Filter
Sort
Query
F1
F2
F3
F4
(found 4 times)
(found 4 times)
(found 10 times)
(found 3 times)
Linking to the Corpus: Example
Workflow 1
42PhD Thesis: Mining Abstractions in Scientific Workflows
Stemmer
Term Weighting
Stemmer
Term Weighting
Merge
Stemmer
Term Weighting
Fragment1in Wf1(1)
Fragment1
Fragment1in Wf1(2)
Workflow fragment description vocabulary: http://purl.org/net/wf-fd
(Extends P-Plan)
wffd:foundAs
wffd:foundAs
wffd:foundInp-plan:isPrecededBy
p-plan:isPrecededByp-plan:isPrecededBy
p-plan:isPrecededBy p-plan:isPrecededBy p-plan:isStepOfPlan
p-plan:isStepOfPlan
p-plan:isStepOfPlan
p-plan:isStepOfPlan
p-plan:isStepOfPlan
p-plan:Step
wffd:TiedWorkflowFragment
wffd:DetectedResultWorkflowFragment
Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
4. Workflow abstraction and reuse
5. Mining abstractions from workflows using graph mining techniques
6. Evaluationa) Finding generic motifs in workflowsb) Workflow fragment assessment
7. Conclusions and future work
43PhD Thesis: Mining Abstractions in Scientific Workflows
Finding generic motifs in workflows
44PhD Thesis: Mining Abstractions in Scientific Workflows
?
Research question: Can we find commonly occurring abstractions?
composite workflowsinternal macros
Finding generic motifs in workflows
45PhD Thesis: Mining Abstractions in Scientific Workflows
?
Metrics used: precision and recall
Fragments(F)
Annotatedmotifs
(M)
Finding generic motifs in workflows
46PhD Thesis: Mining Abstractions in Scientific Workflows
?
Corpus: 22 templates from the same domain annotated manually Wings workflow corpus + domain knowledge
Dataset
PorterStemmer
Result
IDF
FinalResult
Dataset
LovinsStemmer
Result
ResidualIDF
FinalResult
+
Dataset
Stemmer
Result
Term Weighting
FinalResult
Stemmer
Porter Stemmer
Lovins Stemmer
Term Weighting
Inverse Document Frequency (IDF)
Residual IDF
Query Term Weighting
Component taxonomy
Finding generic motifs in workflows
47PhD Thesis: Mining Abstractions in Scientific Workflows
?
Results of the evaluation
H.2: It is possible to detect commonly occurring patterns and abstractions automatically.
Internal Macros:Inexact FSM : 2 out of 3 found (r=0,67); 4 out of 5 (r=0,8) when
applying generalization
Composite Workflows:Exact FSM: all motifs are found, although the precision is low
(p=0,18)Can we find commonly occurring abstractions?
Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
4. Workflow abstraction and reuse
5. Mining abstractions from workflows using graph mining techniques
6. Evaluationa) Finding generic motifs in workflowsb) Workflow fragment assessment
7. Conclusions and future work
48PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow fragment assessment
49PhD Thesis: Mining Abstractions in Scientific Workflows
?
Research question: Are our proposed workflow fragments useful?• A fragment is useful if it has been designed and (re)used by a user.• Comparison between proposed fragments and user designed groupings
and workflow
Workflow fragment assessment
50PhD Thesis: Mining Abstractions in Scientific Workflows
?
Metrics: Precision and recall
Fragments(F)
Workflows(W)
Groupings(G)
Workflow fragment assessment
51PhD Thesis: Mining Abstractions in Scientific Workflows
?
Workflow corporaUser Corpus 1 (WC1)
• Designed mostly by a single a single user• 790 workflows (475 after data preparation)
User Corpus 2 (WC2)• Created by a user, with collaborations of others• 113 workflows (96 after data preparation)
Multi User Corpus 3 (WC3)• Workflows submitted by 62 users during the month of Jan 2014• 5859 workflows (357 after data preparation)
User Corpus 4 (WC4)• Designed mostly by a single a single user• 53 workflows (50 after data preparation)
Workflow fragment assessment
52PhD Thesis: Mining Abstractions in Scientific Workflows
?
Result assessment
• 30%-60% of proposed fragments are equal to user defined groupings or workflows
• 40%-80% of proposed of proposed fragments are equal or similar to user defined groupings or workflows
H.3: Commonly occurring patterns are potentially useful for users designing workflows
What about the rest of the fragments? Are those useful?
Workflow fragment assessment
53PhD Thesis: Mining Abstractions in Scientific Workflows
?
User feedback: user survey
Q1: Would you consider the proposed fragment a valuable grouping?•I would not select it as a grouping (0)•I would use it as a grouping with major changes (i.e., adding/removing more than 30% of the steps) (1)•I would use it as a grouping with minor changes (i.e., adding/removing less than 30% of the steps) (2).•I would use it as a grouping as it is (3)Q2: What do you think about the complexity of the fragment?•The fragment is too simple (0)•The fragment is fine as it is (1)•The fragment has too many steps (2)
Not enough evidence to state that all proposed workflow fragments are useful
Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
4. Workflow abstraction and reuse
5. Mining abstractions from workflows using graph mining techniques
6. Evaluation
7. Conclusions and future work
54PhD Thesis: Mining Abstractions in Scientific Workflows
Conclusions: Results
H.1: It is possible to define a catalog of common domain independent patterns based on the common functionality of workflow steps.
Daniel Garijo and Yolanda Gil. A new approach for publishing workflows: Abstractions, standards, and Linked Data. (WORKS'11)
Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble. Common motifs in scientific workflows: An empirical analysis (extended version). Future Generation Computer Systems. 2013.
Model for representing workflows (OPMW) and publishing them as Linked Data
Catalog of workflow motifs + workflow annotation
H.2: It is possible to detect commonly occurring patterns and abstractions automatically.
Graph mining approach + workflow generalization
Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble. Common motifs in scientific workflows: An empirical analysis. 8th IEEE International Conference on e-Science (eScience 2012)
55PhD Thesis: Mining Abstractions in Scientific Workflows
Daniel Garijo, Oscar Corcho and Yolanda Gil. Detecting common scientific workflow fragments using templates and execution provenance. Proceedings of the seventh international conference on Knowledge capture, (K-CAP 2013).
Conclusions: Results
Daniel Garijo, Oscar Corcho, Yolanda Gil, Boris Gutman, Ivo D. Dinov, Paul Thompson and Arthur W. Toga. FragFlow: Automated fragment detection in scientific workflows. 10th IEEE Conference on e-Science, (eScience 2014)
Daniel Garijo, Oscar Corcho, Yolanda Gil, Meredith N. Braskie, Dereck Hibar, Xie Hua, Neda Jahanshad, Paul Thompson and Arthur W. Toga. Workflow reuse in practice: A study of neuroimaging pipeline users. 10th IEEE Conference on e-Science, (eScience 2014)
H.3: Commonly occurring patterns are potentially useful for users designing workflows.
Graph mining approach + reusability metrics for assessment + workflow annotation
56PhD Thesis: Mining Abstractions in Scientific Workflows
Reuse survey
Conclusions: Impact and future work
Impact:OPMW • Workflow annotation [García-Jiménez and Wilkinson 2014b]
Motif catalog • Expansion for distributed environments [Olabarriaga et al 2013]• Workflow summarization [Alper et al 2013]
Future work: • Towards workflow ecosystems
57PhD Thesis: Mining Abstractions in Scientific Workflows
[Garijo et al 2014] (WORKS’14)
Conclusions: Impact and future work
•Automatic detection of workflow abstractions
58PhD Thesis: Mining Abstractions in Scientific Workflows
•Improvement of workflow reuse
Custom fragments
Ranking fragments
Suggestions of workflows
Date: 03/12/2015
Mining Abstractions in Scientific Workflows
Daniel Garijo *Supervisors: Oscar Corcho *, Yolanda Gil Ŧ
* Universidad Politécnica de Madrid,Ŧ USC Information Sciences Institute
All materials are available as Research Objects (with pointers to Figshare)
http://w3id.org/dgarijo/ro/mining-abstractions-in-scientific-wfs
Supporting material
60PhD Thesis: Mining Abstractions in Scientific Workflows
Methodology
Workflow representation and publicationApproach
Workflow abstraction and reuseEmpirical
analysis ofworkflowcorpora
Problem Evaluation
Requirement validation anduser feedback
Model Competencyquestionvalidation
Provenance
Plan
Publication
Methodology for publication
Extension of existing
standardsand web
technologies
Workflow abstraction analysis for
reuse
Agreement on a catalog of
common abstractions
Automatic detection and annotation of workflow abstractions
Graph mining techniques,
generalization
Precision, recall and user
feedback
61PhD Thesis: Mining Abstractions in Scientific Workflows
Provenance Models
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid 62
“A record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data or a thing”-PROV-DM: The PROV Data Model (W3C)
Replace this slide with a methodological one
prov:used
p-plan:Variable
p-plan:isStepOfPlan
p-plan:isVariableOfPlan
p-plan:hasInputVar
p-plan:isOutputVarOf
p-plan:Activity
p-plan: correspondsToStep
p-plan:Entity
prov:wasGeneratedBy
p-plan:isPrecededBy
p-plan:Bundle
Class Object property
Legend
Subclass of
prov:Bundle
prov:Plan
prov:Entity
prov:Activity
PRO
V ex
tend
ed c
lass
es
Statements contained in a p-plan:Bundle
p-plan:Step
p-plan:Plan
p-plan: correspondsToVariable
63PhD Thesis: Mining Abstractions in Scientific Workflows
Assumptions and restrictions
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid 64
Restriction:• Workflows are represented as directed acyclic graphs
Assumptions: • Available workflow repositories exist for exploiting definitions
of workflows and workflow executions.• All the workflow steps can be assigned a label with their type• Two steps of a workflow with the same function have the same
type.• Researchers aim to reuse workflows and workflow fragments if
they find them useful.
9
Other models for representing workflow instances, templates and executions
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
Publishing as LD
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid 66
•Maybe paste here an example instead of the big picture
67
Data Oriented MotifsData-Oriented Motifs
Data Retrieval
Data Preparation
Format Transformation
Input Augmentation and Output Splitting
Data Organisation
Data Analysis
Data Curation/Cleaning
Data Movement
Data Visualisation
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
68
Data Oriented MotifsData-Oriented Motifs
Data Retrieval
Data Preparation
Format Transformation
Input Augmentation and Output Splitting
Data Organisation
Data Analysis
Data Curation/Cleaning
Data Movement
Data Visualisation
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
69
Data Oriented MotifsData-Oriented Motifs
Data Retrieval
Data Preparation
Format Transformation
Input Augmentation and Output Splitting
Data Organisation
Data Analysis
Data Curation/Cleaning
Data Movement
Data Visualisation
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
70
Data Oriented MotifsData-Oriented Motifs
Data Retrieval
Data Preparation
Format Transformation
Input Augmentation and Output Splitting
Data Organisation
Data Analysis
Data Curation/Cleaning
Data Movement
Data Visualisation
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
71
Data Oriented MotifsData-Oriented Motifs
Data Retrieval
Data Preparation
Format Transformation
Input Augmentation and Output Splitting
Data Organisation
Data Analysis
Data Curation/Cleaning
Data Movement
Data Visualisation
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
72
Data Oriented MotifsData-Oriented Motifs
Data Retrieval
Data Preparation
Format Transformation
Input Augmentation and Output Splitting
Data Organisation
Data Analysis
Data Curation/Cleaning
Data Movement
Data Visualisation
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
73
Workflow Oriented MotifsWorkflow-Oriented Motifs
Intra-Workflow Motifs
Stateful (Asynchronous) Invocations
Stateless (Synchronous) Invocations
Internal Macros
Human Interactions
Inter-Workflow Motifs
Atomic Workflows
Composite Workflows
Workflow Overloading
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
74
Workflow Oriented MotifsWorkflow-Oriented Motifs
Intra-Workflow Motifs
Stateful (Asynchronous) Invocations
Stateless (Synchronous) Invocations
Internal Macros
Human Interactions
Inter-Workflow Motifs
Atomic Workflows
Composite Workflows
Workflow Overloading
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
75
Workflow Oriented MotifsWorkflow-Oriented Motifs
Intra-Workflow Motifs
Stateful (Asynchronous) Invocations
Stateless (Synchronous) Invocations
Internal Macros
Human Interactions
Inter-Workflow Motifs
Atomic Workflows
Composite Workflows
Workflow Overloading
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
76
Workflow Oriented MotifsWorkflow-Oriented Motifs
Intra-Workflow Motifs
Stateful (Asynchronous) Invocations
Stateless (Synchronous) Invocations
Internal Macros
Human Interactions
Inter-Workflow Motifs
Atomic Workflows
Composite Workflows
Workflow Overloading
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
77
Workflow Oriented MotifsWorkflow-Oriented Motifs
Intra-Workflow Motifs
Stateful (Asynchronous) Invocations
Stateless (Synchronous) Invocations
Internal Macros
Human Interactions
Inter-Workflow Motifs
Atomic Workflows
Composite Workflows
Workflow Overloading
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
Result Summary: Data Oriented Motifs
•Over 60% of the motifs are data preparation motifs
•Some differences are motivated by the workflow systems in the analysis
•Data analysis is often the main functionality of the workflow
78PhD Thesis: Mining Abstractions in Scientific Workflows
Result Summary: Workflow Oriented Motifs
• Around 40% composite workflows and internal macros
But how do users perceive workflow reuse?• What about fragments of workflows?
79PhD Thesis: Mining Abstractions in Scientific Workflows
80
Differences and commonalities of the workflow systems
•Data moving/retrieval, stateful interactions and human interaction steps are not present in Wings• Web services (Taverna) versus software components (Wings)• Wings has layered execution through Pegasus
•Data preparation steps are common in both systems
•Use of sub workflows is high
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
Reusing workflows…
According to the respondents, the major benefits of workflows include:• Time savings •Organizing and storing code• Having a visualization of the overall analysis• Facilitating reproducibility
81PhD Thesis: Mining Abstractions in Scientific Workflows
Reusing groupings…
•Reuse is not the only reason why groupings are created. Unlike workflows, reusing groupings from one’s own work is more useful than reusing groupings from others
•Most respondents agreed that groupings help simplify workflows. Groupings also make workflows more understandable by others
82PhD Thesis: Mining Abstractions in Scientific Workflows
Graph Mining
We use popular graph mining techniques:
Inexact FSM: usage of heuristics to calculate similarity between two graphs. The solution might not be complete
SUBDUE• 2 heuristics: Minimum Description Length (MDL) and Size• Frequency based
Exact FSM: deliver all the possible fragments to be found the dataset.gSpan• Depth first search strategy• Support based
FSG• Breadth first search strategy• Support based
83PhD Thesis: Mining Abstractions in Scientific Workflows
Linking to the Corpus: Workflow fragment description vocabulary
84PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow fragment assessment: Summary of results
85PhD Thesis: Mining Abstractions in Scientific Workflows
Conclusions: Limitations
L1: OPMW has been designed for data-intensive workflows (without loops or conditionals)
L2: When publishing as Linked Data, it is assumed that all resources will be made public (no privacy issues)
L3: Motif catalog may be expanded with additional motifs
L4: Size and time needed to calculate some workflow fragments
L5: A taxonomy of components is needed when generalizing workflows. This taxonomy is provided by domain experts modeling the domain.
86PhD Thesis: Mining Abstractions in Scientific Workflows