tutorial rm5 prom6extension - faculteit wiskunde en...
TRANSCRIPT
RM/ProM Manual
1. INTRODUCTION RapidMiner is a Scientific Workflow Management System, it contains more than 500 operators altogether for all tasks of professional data analysis, i.e. operators for input and output as well as data processing, modelling and other aspects of data mining. Within the process mining domain, there was not any support for the construction and execution of a workflow which describes all analysis steps and their order. For this reason it has been chosen to integrate the process mining framework ProM 6 within RapidMiner. The goal of the integration is to use the knowledge from the scientific workflow domain for the design and execution of workflows within the process mining domain.
2. INSTALLATION, PERSPECTIVES AND VIEWS
Before to start, you of course need to download and install the software first. You will find it on the http://www.win.tue.nl/~rmans/RapidMiner/doku.php?id=wiki:installation website. Download the appropriate installation package for your operating system and install RapidMiner according to the instructions given. As a next step, the ProM 6 extension needs to be installed. Therefore, please go to the Help/Updates and Extensions (MarketPlace) menu. As a result, you are entering the RapidMiner Marketplace (see Figure 1) in which you can install any extension you like. Via the ‘Search’ tab please search for the ProM 6 extension by searching for ‘prom’. As a result, the ‘ProM Framework Extension’ is shown. Please install the package and follow the instructions.
Figure 1: The RapidMiner Marketplace showing the ProM Framework Extension.
After that RapidMiner has restarted you see a dialog asking whether the ProM packages for the RapidMiner ProM 6‐extension have been installed. Please answer ‘No’. After that, a new dialog pops up asking for the folder in which the ProM packages need to be installed. These packages are needed in order to use ProM 6
algorithms within RapidMiner. Please select an appropriate folder and click ‘OK’. Next, a dialog shows up asking for which operating system the extension is installed (Windows 32 bit, Windows 64 bit, Linux 32 bit, Linux 64 bit, OSX). Please select the operating system that applies for your situation and click OK. Finally, a progress screen is shown. After this screen has disappeared you are done installing the ProM 6 extension! Once installed you will be welcomed into the so‐called Welcome Perspective (Figure 1).
Figure 2.1: Welcome Perspective of RapidMiner.
The lower section shows current news about RapidMiner, if you have an Internet connection. The list in the
centre shows the analysis processes recently worked upon. This is practical if you wish to continue working
on or execute one of these processes. The upper section shows typical actions:
1. New: Start a new analysis process.
2. Open Recent: Opens the process which is selected in the list below the actions.
3. Open: Opens the repository browser and allows you to select a process to be opened within the
process Design Perspective.
4. Open Template: Shows a selection of different pre‐defined analysis processes,
which can be configured in a few clicks.
5. Online Tutorial: Starts a tutorial which can be used directly within RapidMiner
At the right‐hand side of the toolbar you will find three icons which switch between the individual
RapidMiner perspectives. The icons shown here take you to the following perspectives
1. Design Perspective: This is the central RapidMiner perspective where all analysis processes are
created and managed.
2. Result Perspective: If a process supplies results in the form of data, models or the like, then
RapidMiner takes you to this Result Perspective, where you can look at several results at the same
time.
3. Welcome Perspective: The Welcome Perspective welcomes you with after starting the program.
You can switch to the desired perspective by clicking inside the toolbar. Now switch to the Design Perspective by clicking in the toolbar. You would now see the following screen:
Figure 2.2: Design Perspective of RapidMiner.
As you can see in Figure 2.2, there are two very central views in this area which are described in the following. In the Process View it is possible to see the process you are creating. All work steps (operators) available in RapidMiner are presented in groups in the Operators View and can therefore be included in the current process. If RapidMiner has been extended with one of the available extensions, then the additional operators can also be found here. You can select operators and add them in the desired place in the process by simply dragging and dropping, or with double click on an operator. The connections can be made manually by the user, just click on an output port and you can draw an orange strand. Click on an input port in order to connect the selected output port with this input port (Figure 2.3).
Figure 2.3: Click on an output port in order to connect, right click to cancel.
Numerous operators require one or several parameters to be set. For example, operators that read data (i.e. read log) from file require the file path to be indicated. After an operator offering parameters has been selected in the Process View, its parameters are shown in the Parameter View (See Figure 2.4). Like the other views, this view also has its own toolbar which is described in the following. Under the toolbar you will find the icon and name of the operator currently selected followed by the actual parameters. Bold font means that the parameter must absolutely be defined by the analyst and has no default value, for example, as you can see in the following Figure, the first parameter ‘logverbosity’ and the second ‘logfile’. Italic font means that the parameter is classified as an expert parameter and should not necessarily be changed by beginners to data analysis, for example the parameters: ‘resultfile’, ‘random seed’, ‘send email’ and ‘encoding’. The help window is very important, within the Help View, each time you select an operator, shows a description of this operator.
Figure 2.4: Parameters of the currently selected operator are set in the parameter view.
3. INTEGRATION WITH ProM If you are reading this tutorial, you probably have already some experience with ProM, but you should be interested to use its plugins within Rapid Miner. As such it has been decided to integrate various plugin of the process mining framework ProM 6 within RapidMiner. Therefore, as shown in Figure 3.1, the folder named “ProM 6” contains operators which allow for running ProM 6 plugins.
Figure 3.1: Design Perspective with ProM6 folder. This folder contains operators in order to run ProM 6 plugin.
With a double click on the folder ProM 6 we can see all the operators grouped in the subfolder:
Context: contains the ProM Context operator. This operator is needed for starting an instance of ProM 6. Several operators require a context in order to run.
Import: contains operator in order to read ProM object from disk (e.g. a log or a Petri Net).
Mining: contains ProM mining plugins (e.g. Alpha Miner, ILP Miner).
Analysis: contains algorithms analysing, one or more ProM Object (e.g. calculating the fitness for a log and a Petri Net.
Export: contains operators in order to export logs or PNML files.
Filtering: contains operators in order to modify the logs.
Conversion: contains operators for converting a ProM object into another ProM object.
Parallel: contains a subprocess which allows for running ProM6 operators in parallel.
3.1 Creating a New Process
To understand better how the functionality of ProM 6 can be used within RapidMiner we describe several use cases: Let’s start with the simplest example: This example is about the discovery of a process model. The process of discovery, based on an event log, aims at building a process model capturing the behaviour seen in the log. The first step is to drag the following operators into the Design Perspective screen:
‐ ProM Context (in the folder ‘Context’) ‐ Read Log (in the folder ‘Import’)
‐ Alpha Miner (in the folder ‘Mining’) After dragging the operators you will see the following figure:
Figure 3.1.1: Process View after dragging the ‘ProM Context’, ‘Read Log’ and ‘Alpha Miner’ operators.
In order to work, the Read Log File and the Alpha Miner operators, the ProM Context operator is needed. This operator provides a running instance of ProM 6. Therefore, it is necessary to connect this operator with both. When trying to do this you can see that a little square appears on the operator Read Log File (see Figure 3.1.2), clicking on this will open a new operator: ‘Multiply’ (see Figure 3.1.3).
Figure 3.1.2: Process View, with one click on the little square a new operator ‘Multiply’ appears.
Figure 3.1.3: Process View with all the operators needed for the first use case
After that, the structure of the process is completed. As you can see in the above figures, each operator has a circle with a colour, this can be: red, yellow or green. This circle is called status light and it indicates whether there is a problem like parameters that have not yet been set or unconnected input ports (In that case, as shown in Figure 3.1.2, the status light of the operator Alpha Miner is red since it is not still connected with the ProM Context). Also, it can be seen whether the configuration is basically complete but the operator has not yet been executed, in this case the status light is yellow (see Figure 3.1.3). Finally, it can be seen, whether everything is OK and the operator has executed successfully, in this case the status light is green (see Figure 3.1.8). The status light is not the only icon for an operator, as you can see in Figure 3.1.4 where all icons for an operator are shown.
Figure 3.1.4: Icons of each operator.
Starting from left the meaning of the icons in the panel is as follows: ‐ Warning triangle: Indicates when there are status messages for this operator. ‐ Breakpoint: Indicates whether process execution is to be stopped before or after this operator in
order to give the analyst the opportunity to examine intermediate results. ‐ Comment: If a comment has been entered for this operator, then this is indicated by this icon. ‐ Subprocess: This is a very important indicator, since some operators have one or more
subprocesses. It is shown by this indicator whether there is such a subprocess. You can double click on the operator concerned to go down into the subprocess.
Before the execution can start, it is needed to set the parameter of the ‘Read Log File’ which indicates where the log file can be found. The parameter can be seen in the Figure 3.1.5. By clicking on the folder you can set the path location of the log. For instance, you can find some logs on the website: http://www.processmining.org/book/start. On this webpage, it is possible to download the zip file, which is used for almost all use cases in this document: running‐example.xes. The log can be found in the ‘chapter 1’ folder of the zip‐file. This log corresponds to the handling of a request for compensation.
Figure 3.1.5: Path location of the log.
After selecting the log we are ready for the execution of the process. Therefore, just click on the play
button in the toolbar:
Figure 3.1.6: Piece of the toolbar with button of play, pause and stop.
The play button starts the process, you can stop the process in between with the pause button and the stop button aborts the process completely. While a process is running, the status indicator of the operator being executed in each case transforms into a small green play icon. In this way you can see what point the process is currently at. After an operator has been successfully executed, the status indicator then changes and stays green ‐ until for example you change a parameter for this operator: Then the status indicator will be yellow. The same applies for all operators that follow. This means you can see very quickly which operators a change could have an effect on. The process defined above only has a short runtime and so you will hardly have the opportunity to pause the running process. After the process is finished, RapidMiner should open in the Result Perspective a screen like in Figure 3.1.7. If this is not the case, then you probably did not connect the output port of the last operator (Alpha Miner) with one of the result ports of the process on the right‐hand side. Check this and also check for other possible errors, taking the notes in the Problems View into consideration.
Figure 3.1.7: The above figure shows the discovered Petri Net corresponding to the earlier chosen log.
Also, it is possible to see the outcomes of a workflow by right clicking on an outgoing port of an operator. For example, the imported log can be inspected by right clicking on the outgoing port of the ‘Read Log File’ operator (see Figure 3.1.8) end select ‘Show XLogIOObject Result’ (see Figure 3.1.9).
Figure 3.1.8: With a right click on the outgoing port of an operator, it is possible to see the outcomes of a process.
For example, when selecting the resultant log of the ‘Read Log file’ operator, by clicking on the ‘Show XLogIOObject Result’ submenu shown in Figure 3.1.8, we obtain the screen as shown in Figure 3.1.9. Via the Result View we see opened this screen:
Figure 3.1.9: The imported log can be inspected with a right click on the selecting ‘Show XLogIOObject Result’ submenu of the
operator ‘Read Log File’.
3.2 COMPLEX USE CASES Next, we discuss some more complex use cases. Use Case 1: Discovery of a process model and calculating the fitness. Use Case 2: Discovery of similar traces in a log and the associated process model for each set of similar traces. Use Case 3: Decomposed discovery.
USE CASE 1 The process of discovery, based on an event log, aims at building a process model thus capturing the behaviour seen in the log. Therefore, we need the following operators, which can be dragged in the Main Process screen.
‐ ProM Context (in the folder ‘Context’) ‐ Read Log (in the folder ‘Import’) ‐ ILP Miner (in the folder ‘Mining’) ‐ Replay a Log on Petri Net for Conformance Analysis (in the folder ‘Analysis’)
The corresponding process can be seen in Figure 3.2.1. Note that the process has finished executing as all the status lights are green.
Figure 3.2.1: Process for discovery a process model and calculating the fitness.
With a right click on the output port of Read Log we can see a log (running‐example.xes) corresponding to the handling of request for compensation. A snapshot of the log can be seen in Figure 3.2.2. Each line presents one event. Note that events are already grouped per case. The case identifier is shown in column ‘T: concept name’. Case 2 has five associated events, the first event of these is the register request by Mike on December 30th, 2010. In the table, each event is associated to a resource and a cost.
Figure 3.2.2: Snapshot of the log corresponding to the handling of request for compensation.
Process mining algorithms for process discovery can discover a process model based on the information in an event log. The ILP‐Miner plug‐in generates a marked Petri net given the event log as input. The plug‐in does so by using Integer Linear Programming to find places according to the theory of regions [1]. The Petri Net obtained (see Figure 3.2.3) can be seen with a right click on the operator ILP Miner:
Figure 3.2.3: Petri Net obtained by using the ‘ILP Miner’ operator.
The 'Replay a Log on Petri Net for Conformance Analysis' operator provides an example set with the trace
fitness, move‐log fitness, move model fitness, raw fitness costs, number of states, and the number of
queued states. By right clicking on the output port of this operator you can see the result shown in Figure
3.2.4.
Figure 3.2.4: Result of the operator ‘Replay a Log on Petri Net for Conformance Analysis’.
Here the ‘trace fitness’ value represent the fitness value of the Petri Net with the log. This value indicates how well the event log can be replayed in the discovered Petri Net. A fitness value of ‘1’ means that the log can be successfully replayed, whereas a value of ‘0’ means that this is completely not the case. In Figure 3.2.4 it can be seen that the model has a perfect fitness, so is able to replay all the traces in the log.
USE CASE 2 In this use case we describe the discovery of similar traces in a log and the associated process model. By grouping similar cases together it may be possible to construct partial process models that are easier to understand.
Figure 3.2.5: The process of discovering similar traces in a log and the associated process model.
The log set as parameter in the operator ‘Read Log file’ is ‘reviewingStartEnd.mxml’. In the Process View it is possible to see a new operator ‘Remember’, it stores the given object in the object store of the process. The stored object can be retrieved from the store by using the ‘Recall’ operator, inserted in the subprocess ‘Loop Collection’. The problem is that ‘Loop Collection’ has only one incoming port, which in this case is used for having a log available on the subpage of the ‘Loop Collection’ operator. As the ‘ILP Miner’ also needs a ProM Context we need to make it available on the subpage by using the ‘Remember’ and the ‘Recall’ operator. For both ‘Remember’ and ‘Recall’ the same parameters are inserted as shown in the following Figure:
Figure 3.2.6: Parameter View of the operators ‘Remember’ and ‘Recall’.
The of discovery of similar traces is done by the Guide Tree Miner algorithm, it is based on the concept of clustering. Clustering is concerned with grouping instances into clusters. Instances in one cluster should be similar to each other and dissimilar to instances in other clusters. The Guide Tree Miner uses the Agglomerative Hierarchical Clustering (AHC) algorithm. As a result of the plugin a variable number of cluster is generated. Then the operator ‘Convert Event Log Array Into Collection of Logs’ converts the resulting sublogs of the guide tree miner into a collection of logs. The collection enter in the subprocess: ‘Loop Collection’, with a double click on this you can create a subprocess. The loop ends when for each item of the collection, the operators on the subpage have finished execution.
Figure 3.2.7: The Subprocess of the ‘Loop Collection’ operator.
After execution, the results are shown in the Result View. These results are shown in Figure 3.2.8 and 3.2.9. The first is the Guide Tree, where the pink nodes correspond to the ‘k’ cluster nodes (where k is the number of clusters chosen by the user during the configuration, in this case k=4). Note that by clicking on the ‘Guide Tree Miner’ operator, the panel on the top right side shows the parameters and their values that can be set (by changing the value for the ‘SetNumberOfClusters’ parameter the value for ‘k’ can be set). Leaves correspond to the individual traces. All leafs under the sub‐tree of a pink node belongs to that cluster. Blue colored nodes are non‐leaf nodes that are expanded up to two levels deep.
Figure 3.2.8: Decision Tree.
Figure 3.2.9 shows the collections of the event logs and the corresponding Petri Net. With a click on each item of the collection you can see the corresponding Petri Net. As can be seen, in the figure, the second item in the collection has been selected.
Figure 3.2.9: Petri Net for each collection of logs. In the figure, the second of the collection has been selected.
USE CASE 3
This case is the most complicated, it aims at the process of Decomposed Discovery. This use case aims to ease the process model discovery by either splitting the log into sets of cases or into sets of activities.
Figure 3.2.10: The process of decomposed discovery.
At first within this use case we have a Subprocess (named Vertical/horizontal splitting). A subprocess introduces a process within a process, this is similar to a switch, where numerous options exist but only one option is selected at a time. The subprocess can be created by dragging the operator ‘Select Subprocess’ into the Process View. As a result, the screen as shown in Figure 3.2.11 will be shown.
Figure 3.2.11: The Subprocess Operator.
After dragging the next operators as shown in Figure 3.2.12, it is possible to see that, for the current use case, there are two different kind of splitting:
‐ The vertical splitting: splits the log into sets of cases. ‐ The horizontal splitting: split the log into sets of activities.
We can select one of these subprocesses by changing the parameter select which, in this case it is selected on the first subprocess (Figure 3.2.11). For both we have other three subprocesses:
Figure 3.2.12: Two subprocesses in the ‘Vertical/Horizontal splitting’ operator. In the ‘Selection 1’ process, the log is split into sets of cases. Afterwards, for each sublog the process is discovered and finally the process models are merged. In the ‘Selection 2’ process, the log is split into sets of activities. Afterwards, for each sublog the process is discovered and finally the process
models are merged.
With a double click on ‘Split vertically’, the following subprocess is shown (Figure 3.2.13):
Figure 3.2.13: The Subprocess ‘Split Vertically’.
The log is split in sublogs by the operator ‘Split Xlog’ (in this case ten). The next operator converts the resultant event log array into a collection of logs. Within the second subprocess ‘Mine Control Flow’ the collection of logs in input goes to the subprocess: ‘Collection Iteration Parallel2’ (see Figure 3.2.14).
Figure 3.2.14: The Subprocess ‘Mine Control Flow’.
With a double click on ‘Collection Iteration Parallel2’, the following subprocess is shown:
Figure 3.2.15:The Subprocess ‘Collection Iteration Parallel’.
The conversion from Log to Event Log Array is made from the operator ‘Convert Log into Event Log Array’, then the Event Log Array goes as input to the operator ‘Discover Accepting Petri Net Array from Event Log Array’, which realizes the discovery process and gives as output an Accepting Petri Net Array. Finally, as last conversion it is needed to obtain an Accepting Petri Net from Accepting Petri Net Array. This is achieved by the operator ‘Convert Accepting Petri Net Array to Accepting Petri Net’. In the above Figure 3.2.15 there is the operator ‘Recall’ as that the operator ‘Discovery Accepting Petri Net Array from Event Log Array’ requires the ProM Context in order to work, and that Subprocess ‘Collection Iteration Parallel2’ has only one input (see Figure 3.2.16).
Figure 3.2.16: The parameters of the operator ‘Recall’.
In the third subprocess ‘Merge Vertically’ the Petri Net is converted to a Petri Net array. Next with the operator ‘Merge Accepting Petri Net Array into an Accepting Petri Net’ the Accepting Petri Net array is merged into an Accepting Petri Net. Finally, the resulting Petri Net is reduced using the ‘Reduce Silent Transitions Algorithm’ (see Figure 3.2.17).
Figure 3.2.17: The Subprocess ‘Merge Vertically’.
The sequence of the subprocesses for the Horizontally Subprocess is almost the same as shown in Figure 3.2.12. With a double click on the first Subprocess ‘Split Horizontal’, it is possible to see that the operator ‘Mine a causal activity matrix from an event log’ does the horizontal splitting of the log (see Figure 3.2.18).
Figure 3.2.18: The Subprocess ‘Split Horizontal’.
With a click on ‘Mine a causal activity matrix from an event log’ you can see the value of the get miner parameter. In this case, the value is set to ‘Heuristics miner’ (see Figure 3.2.19).
Figure 3.2.19: The Parameter View of ‘Mine a causal activity matrix from an event log’ operator.
After the plugin the first conversion is made by the operator ‘ Convert Causal Activity Matrix to an Causal Activity Graph’. This operator is used for converting a causal activity matrix to a causal activity graph. As input a ProM context and a Causal Activity Matrix is needed. Then the operator ‘Convert Casual Activity Graph to an Activity Cluster Array’ is used to convert a causal activity graph to an activity array. Also in this case the Prom Context is needed. The ‘Decompose Event Log using an Activity Cluster Array’ operator decomposes an Event Log using an Activity Cluster Array. The log goes, as input, to the operator ‘Convert Event Log Array into Collection of Logs’. This operator can be used to convert an event log array into a collection of logs. The collection, the ProM Context and the Log are the input of the next subprocess ‘Mine Control Flow Horizontally’.
Figure 2: The subprocess of the Mine Control Flow Horizontally '' subprocess.
For the ‘Mine Control Flow Horizontally’ subprocess the subpage can be seen in Figure x. Here, the ‘Collection Iteration Parallel’ is the only important operator as it iterates over a collection of objects. It is a nested operator and its subprocess executes once for each object of the given collection. Moreover, each iteration is executed in parallel. In our case, the operator iterates over a collection of logs. In Figure y, the subpage of the operator can be seen.
Figure 3: Subprocess 'Mine Control Flow Horizontally'.
The event log is the only input of the operator ‘Convert Log into Event Log Array’ as it will be used for several next operators (‘Multiply (16)’, ‘DALA2’, and ‘Convert Event Log Array into Collection of Logs’). Furthermore, via the ‘Recall (2)’ operator the ProM Context is recalled in order that it can be used for the ‘Convert Accepting Petri Net Array into Accepting Petri Net’ and ‘DALA2’ operators. The DALA2 operator discovers an accepting petri net array out of an event log array. The result of the ‘DALA’ operator is used by the ‘Convert Accepting Petri Net Array into Accepting Petri Net’ operator. Finally, the last Subprocess ‘Merge Horizontally’ is the same as can be seen in the subprocess ‘Merge Vertically’.
Figure 3.2.25: The Subprocess ‘Merge Horizontally’.