Mining Usage Patterns from a Repositoryof Scientific Workflows
Claudia Diamantini, Domenico Potena, Emanuele Storti
DII, Universita Politecnica delle Marche, Ancona, Italy
SAC 2012, Riva del Garda, March 26-30
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 1 / 21
Introduction
Process mining techniques allow for extracting knowledge from eventlogs, i.e. traces of running processes:
audit trails of a workflow management systemtransaction logs of an enterprise resource planning system
Main applications:process discovery: what is really happening?conformance check: are we doing what was agreed upon?process extension: how can we redesign the process?
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 2 / 21
Introduction
Process mining techniques allow for extracting knowledge from eventlogs, i.e. traces of running processes:
audit trails of a workflow management systemtransaction logs of an enterprise resource planning system
Main applications:process discovery: what is really happening?conformance check: are we doing what was agreed upon?process extension: how can we redesign the process?
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 2 / 21
Introduction
Process mining techniques allow for extracting knowledge from eventlogs, i.e. traces of running processes:
audit trails of a workflow management systemtransaction logs of an enterprise resource planning system
Main applications:process discovery: what is really happening?conformance check: are we doing what was agreed upon?process extension: how can we redesign the process?
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 2 / 21
Motivation
Little work about how to extract knowledge from a set of models (i.e.:process schemas)
understanding of common patterns of usage (best/worstpractices)support in process designprocess integration (from multiple sources)process optimization/re-engineeringindexing of processes and their retrieval
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 3 / 21
Methodology
Processes schemas have an inherent graph structure: manygraph-mining tools available
Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs
Which representation form?Which specific technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecificity)
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21
Methodology
Processes schemas have an inherent graph structure: manygraph-mining tools available
Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs
Which representation form?Which specific technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecificity)
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21
Methodology
Processes schemas have an inherent graph structure: manygraph-mining tools available
Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs
Which representation form?Which specific technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecificity)
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21
Methodology
Processes schemas have an inherent graph structure: manygraph-mining tools available
Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs
Which representation form?Which specific technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecificity)
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21
SUBDUE
SUBDUE [Joyner, 2000]graph-based hierarchical clustering algorithmsuited for discrete-valued and structured datasearches for substructures (i.e., subgraphs) that best compressthe input graph, according to MDL
Comments:to compress G with a SUB: to replace every occurrence of SUB inG with a single new nodeSUB best compresses G: num. of bits needed to represent G afterthe compression is minimum
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 5 / 21
SUBDUE
SUBDUE [Joyner, 2000]graph-based hierarchical clustering algorithmsuited for discrete-valued and structured datasearches for substructures (i.e., subgraphs) that best compressthe input graph, according to MDL
Comments:to compress G with a SUB: to replace every occurrence of SUBin G with a single new nodeSUB best compresses G: num. of bits needed to represent G afterthe compression is minimum
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 5 / 21
SUBDUE
SUBDUE [Joyner, 2000]graph-based hierarchical clustering algorithmsuited for discrete-valued and structured datasearches for substructures (i.e., subgraphs) that best compressthe input graph, according to MDL
Comments:to compress G with a SUB: to replace every occurrence of SUB inG with a single new nodeSUB best compresses G: num. of bits needed to represent Gafter the compression is minimum
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 5 / 21
SUBDUE (cont’d)
1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)
2 compresses the graph by using the best SUB
Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21
SUBDUE (cont’d)
1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)
2 compresses the graph by using the best SUB
Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21
SUBDUE (cont’d)
1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)
2 compresses the graph by using the best SUB
Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21
SUBDUE (cont’d)
1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)
2 compresses the graph by using the best SUB
Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21
SUBDUE (cont’d)
1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)
2 compresses the graph by using the best SUB
Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21
SUBDUE (cont’d)
1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)
2 compresses the graph by using the best SUB
Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21
SUBDUE (cont’d)
1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)
2 compresses the graph by using the best SUB
Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21
SUBDUE (cont’d)
Traditional cluster evaluation is not applicable to hierarchic domains
quality = interClusterDistanceintraClusterDistance
Desirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better defined concepts)
New indexescompleteness: % of original nodes/edges still present in the finallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignificance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and application
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 7 / 21
SUBDUE (cont’d)
Traditional cluster evaluation is not applicable to hierarchic domains
quality = interClusterDistanceintraClusterDistance
Desirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better defined concepts)
New indexescompleteness: % of original nodes/edges still present in the finallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignificance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and application
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 7 / 21
SUBDUE (cont’d)
Traditional cluster evaluation is not applicable to hierarchic domains
quality = interClusterDistanceintraClusterDistance
Desirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better defined concepts)
New indexescompleteness: % of original nodes/edges still present in the finallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignificance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and application
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 7 / 21
SUBDUE (cont’d)
Traditional cluster evaluation is not applicable to hierarchic domains
quality = interClusterDistanceintraClusterDistance
Desirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better defined concepts)
New indexescompleteness: % of original nodes/edges still present in the finallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignificance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and application
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 7 / 21
Representation
Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs
Common operators:sequencesplit-AND (parallel split), split-XOR (exclusive choice)join-AND (synchronization), join-XOR (simple merge)
How to map the process to the graph? Several choicesEmanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 8 / 21
Representation models: A rule
lowest level of compactnessevery operator is represented by a nodecalled operator, linked to another nodespecifying its type (AND/OR/SEQ) and toits operandsjoin/split are distinguishable by thenumber of in/out-going arcs
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 9 / 21
Representation models: B rule
medium level of compactnessthe closest to the original representationjoin/split are replaced by different nodes,one for each kind of operator
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 10 / 21
Representation models: C rule
highest level of compactnessimplicit representation of operatorsmultiple alternative graphs in case ofSPLIT-XOR or JOIN-XORarcs are labeled to preserve informationabout domain/range nodes
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 11 / 21
Evaluation
Commentsthe three representations hold the same information, with no losseasy extension with more attributes (actor name/role, info abouttime/resources, ...):
attribute name → edgevalue → a node
Need of experimentations to assess which one performs best w.r.t.the quality of clustering.
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 12 / 21
Evaluation
Commentsthe three representations hold the same information, with no losseasy extension with more attributes (actor name/role, info abouttime/resources, ...):
attribute name → edgevalue → a node
Need of experimentations to assess which one performs best w.r.t.the quality of clustering.
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 12 / 21
Evaluation
Commentsthe three representations hold the same information, with no losseasy extension with more attributes (actor name/role, info abouttime/resources, ...):
attribute name → edgevalue → a node
Need of experimentations to assess which one performs best w.r.t.the quality of clustering.
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 12 / 21
myExperiment
Dataset:258 processes (XScufl language for Taverna framework)e-Science WF for distributed computation over scientific dataapplications: local tools, Web Services, scripts
Setting:WF extraction/parsingpreprocessingtranslation into graphs (A,B,Crules)SUBDUE execution
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 13 / 21
Experimental results
Output lattice (fragment)
Red boxes = top level SUBs
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 14 / 21
Experimental results
A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03
Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:
B model: higher representativeness → indexingC model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
Experimental results
A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03
Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:
B model: higher representativeness → indexingC model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
Experimental results
A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03
Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:
B model: higher representativeness → indexingC model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
Experimental results
A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03
Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:
B model: higher representativeness → indexingC model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
Experimental results
A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03
Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:
B model: higher representativeness → indexingC model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
Experimental results
A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03
Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:
B model: higher representativeness → indexingC model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
Experimental results
A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03
Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:
B model: higher representativeness → indexingC model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
Experimental results
A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03
Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:
B model: higher representativeness → indexingC model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
Use Case
Scenario: a scientist wants to use web service X for DNA alignment
Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?
New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21
Use Case
Scenario: a scientist wants to use web service X for DNA alignment
Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?
New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21
Use Case
Scenario: a scientist wants to use web service X for DNA alignment
Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?
New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21
Use Case
Scenario: a scientist wants to use web service X for DNA alignment
Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?
New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21
Use Case
Scenario: a scientist wants to use web service X for DNA alignment
Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?
New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21
Use Case
Scenario: a scientist wants to use web service X for DNA alignment
Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?
New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21
Use case (cont’d)
Applications:reference during process design (learn-by-example)process retrieval: find every myExperiment WF containing such apattern
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 17 / 21
Use case (cont’d)
Applications:reference during process design (learn-by-example)process retrieval: find every myExperiment WF containing such apattern
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 17 / 21
Discussion
A graph-based clustering technique (SUBDUE) to recognize the mostfrequent/common subprocesses among a set of workflows/processschemas.
Commentsseveral representation alternatives for application of SUBDUEevaluations on specialized e-Science domainuseful to find typical patterns and schemas of usage for tools/tasks
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 18 / 21
Conclusion
Applications:recognition of reference processes (common/best/worstpractices)organize a process repository (by indexing processes throughsubstructures)enterprise integration: find similarities (differences, overlapping,complementariness) in BP in different companies
Future work:extending experimentations, especially with (real) BusinessProcessesmanagement of heterogeneity (syntactic/semantic, level ofgranularity)
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 19 / 21
Mining Usage Patterns from a Repositoryof Scientific Workflows
Claudia Diamantini, Domenico Potena, Emanuele Storti
DII, Universita Politecnica delle Marche, Ancona, Italy
SAC 2012, Riva del Garda, March 26-30
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 20 / 21
SUBDUE
Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph
1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
SUBDUE
Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph
1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
SUBDUE
Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph
1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
SUBDUE
Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph
1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
SUBDUE
Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph
1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
SUBDUE
Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph
1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
SUBDUE
Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph
1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
SUBDUE
Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph
1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints
2 Compress the graph with the best SUB
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
SUBDUE
Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph
1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints
2 Compress the graph with the best SUB
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
SUBDUE
Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph
1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints
2 Compress the graph with the best SUB
Subsequent SUBs may be defined in terms of previously defined SUBs
Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21