Download - Mining Usage Patterns from a Repository of Scientific Workflow

Page 1: Mining Usage Patterns from a Repository of Scientific Workflow

Mining Usage Patterns from a Repositoryof Scientific Workflows

Claudia Diamantini, Domenico Potena, Emanuele Storti

DII, Universita Politecnica delle Marche, Ancona, Italy

SAC 2012, Riva del Garda, March 26-30

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 1 / 21

Page 2: Mining Usage Patterns from a Repository of Scientific Workflow

Introduction

Process mining techniques allow for extracting knowledge from eventlogs, i.e. traces of running processes:

audit trails of a workflow management systemtransaction logs of an enterprise resource planning system

Main applications:process discovery: what is really happening?conformance check: are we doing what was agreed upon?process extension: how can we redesign the process?

Page 3: Mining Usage Patterns from a Repository of Scientific Workflow

Introduction

Page 4: Mining Usage Patterns from a Repository of Scientific Workflow

Introduction

Page 5: Mining Usage Patterns from a Repository of Scientific Workflow

Motivation

Little work about how to extract knowledge from a set of models (i.e.:process schemas)

understanding of common patterns of usage (best/worstpractices)support in process designprocess integration (from multiple sources)process optimization/re-engineeringindexing of processes and their retrieval

Page 6: Mining Usage Patterns from a Repository of Scientific Workflow

Methodology

Processes schemas have an inherent graph structure: manygraph-mining tools available

Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs

Which representation form?Which specific technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecificity)

Page 7: Mining Usage Patterns from a Repository of Scientific Workflow

Methodology

Page 8: Mining Usage Patterns from a Repository of Scientific Workflow

Methodology

Page 9: Mining Usage Patterns from a Repository of Scientific Workflow

Methodology

Page 10: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE

SUBDUE [Joyner, 2000]graph-based hierarchical clustering algorithmsuited for discrete-valued and structured datasearches for substructures (i.e., subgraphs) that best compressthe input graph, according to MDL

Comments:to compress G with a SUB: to replace every occurrence of SUB inG with a single new nodeSUB best compresses G: num. of bits needed to represent G afterthe compression is minimum

Page 11: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE

Comments:to compress G with a SUB: to replace every occurrence of SUBin G with a single new nodeSUB best compresses G: num. of bits needed to represent G afterthe compression is minimum

Page 12: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE

Comments:to compress G with a SUB: to replace every occurrence of SUB inG with a single new nodeSUB best compresses G: num. of bits needed to represent Gafter the compression is minimum

Page 13: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)

2 compresses the graph by using the best SUB

Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice

Page 14: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

Page 15: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

Page 16: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

Page 17: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

Page 18: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

Page 19: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

Page 20: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

Traditional cluster evaluation is not applicable to hierarchic domains

quality = interClusterDistanceintraClusterDistance

Desirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better defined concepts)

New indexescompleteness: % of original nodes/edges still present in the finallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignificance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and application

Page 21: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

Page 22: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

Page 23: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

Page 24: Mining Usage Patterns from a Repository of Scientific Workflow

Representation

Common operators:sequencesplit-AND (parallel split), split-XOR (exclusive choice)join-AND (synchronization), join-XOR (simple merge)

How to map the process to the graph? Several choicesEmanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 8 / 21

Page 25: Mining Usage Patterns from a Repository of Scientific Workflow

Representation models: A rule

lowest level of compactnessevery operator is represented by a nodecalled operator, linked to another nodespecifying its type (AND/OR/SEQ) and toits operandsjoin/split are distinguishable by thenumber of in/out-going arcs

Page 26: Mining Usage Patterns from a Repository of Scientific Workflow

Representation models: B rule

medium level of compactnessthe closest to the original representationjoin/split are replaced by different nodes,one for each kind of operator

Page 27: Mining Usage Patterns from a Repository of Scientific Workflow

Representation models: C rule

highest level of compactnessimplicit representation of operatorsmultiple alternative graphs in case ofSPLIT-XOR or JOIN-XORarcs are labeled to preserve informationabout domain/range nodes

Page 28: Mining Usage Patterns from a Repository of Scientific Workflow

Evaluation

Commentsthe three representations hold the same information, with no losseasy extension with more attributes (actor name/role, info abouttime/resources, ...):

attribute name → edgevalue → a node

Need of experimentations to assess which one performs best w.r.t.the quality of clustering.

Page 29: Mining Usage Patterns from a Repository of Scientific Workflow

Evaluation

Page 30: Mining Usage Patterns from a Repository of Scientific Workflow

Evaluation

Page 31: Mining Usage Patterns from a Repository of Scientific Workflow

myExperiment

Dataset:258 processes (XScufl language for Taverna framework)e-Science WF for distributed computation over scientific dataapplications: local tools, Web Services, scripts

Setting:WF extraction/parsingpreprocessingtranslation into graphs (A,B,Crules)SUBDUE execution

Page 32: Mining Usage Patterns from a Repository of Scientific Workflow

Experimental results

Output lattice (fragment)

Red boxes = top level SUBs

Page 33: Mining Usage Patterns from a Repository of Scientific Workflow

A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03

Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:

B model: higher representativeness → indexingC model: higher significance → usage pattern discovery

Page 34: Mining Usage Patterns from a Repository of Scientific Workflow

Page 35: Mining Usage Patterns from a Repository of Scientific Workflow

Page 36: Mining Usage Patterns from a Repository of Scientific Workflow

Page 37: Mining Usage Patterns from a Repository of Scientific Workflow

Page 38: Mining Usage Patterns from a Repository of Scientific Workflow

Page 39: Mining Usage Patterns from a Repository of Scientific Workflow

Page 40: Mining Usage Patterns from a Repository of Scientific Workflow

Page 41: Mining Usage Patterns from a Repository of Scientific Workflow

Use Case

Scenario: a scientist wants to use web service X for DNA alignment

Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?

New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice

Page 42: Mining Usage Patterns from a Repository of Scientific Workflow

Use Case

Page 43: Mining Usage Patterns from a Repository of Scientific Workflow

Use Case

Page 44: Mining Usage Patterns from a Repository of Scientific Workflow

Use Case

Page 45: Mining Usage Patterns from a Repository of Scientific Workflow

Use Case

Page 46: Mining Usage Patterns from a Repository of Scientific Workflow

Use Case

Page 47: Mining Usage Patterns from a Repository of Scientific Workflow

Use case (cont’d)

Applications:reference during process design (learn-by-example)process retrieval: find every myExperiment WF containing such apattern

Page 48: Mining Usage Patterns from a Repository of Scientific Workflow

Use case (cont’d)

Applications:reference during process design (learn-by-example)process retrieval: find every myExperiment WF containing such apattern

Page 49: Mining Usage Patterns from a Repository of Scientific Workflow

Discussion

A graph-based clustering technique (SUBDUE) to recognize the mostfrequent/common subprocesses among a set of workflows/processschemas.

Commentsseveral representation alternatives for application of SUBDUEevaluations on specialized e-Science domainuseful to find typical patterns and schemas of usage for tools/tasks

Page 50: Mining Usage Patterns from a Repository of Scientific Workflow

Conclusion

Applications:recognition of reference processes (common/best/worstpractices)organize a process repository (by indexing processes throughsubstructures)enterprise integration: find similarities (differences, overlapping,complementariness) in BP in different companies

Future work:extending experimentations, especially with (real) BusinessProcessesmanagement of heterogeneity (syntactic/semantic, level ofgranularity)

Page 51: Mining Usage Patterns from a Repository of Scientific Workflow

Mining Usage Patterns from a Repositoryof Scientific Workflows

Claudia Diamantini, Domenico Potena, Emanuele Storti

DII, Universita Politecnica delle Marche, Ancona, Italy

SAC 2012, Riva del Garda, March 26-30

Page 52: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints