deep learning transformer architecture for predictive

22
1 Deep Learning Transformer Architecture for Predictive Business Processes Monitoring and Anomaly Detection A Thesis presented to Systems and Computing Engineering Department Universidad de los Andes, Bogotá, Colombia By Mauricio Díaz Torres [email protected] Oscar González Rojas (Advisor) o-gonza1@uniandes.edu.co Manuel Camargo (Research Collaborator) [email protected] December, 2020

Upload: others

Post on 25-Jun-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning Transformer Architecture for Predictive

1

Deep Learning Transformer Architecture for Predictive Business Processes Monitoring and Anomaly Detection

A Thesis presented to

Systems and Computing Engineering Department

Universidad de los Andes, Bogotá, Colombia

By

Mauricio Díaz Torres

[email protected]

Oscar González Rojas (Advisor)

[email protected]

Manuel Camargo (Research Collaborator)

[email protected]

December, 2020

Page 2: Deep Learning Transformer Architecture for Predictive

2

Table of Contents 1 Introduction .......................................................................................................................................... 4

2 Background .......................................................................................................................................... 4 2.1 Definitions ........................................................................................................................................ 4

2.2 Deep Learning Models ...................................................................................................................... 5

2.3 Characterization of Architectures ..................................................................................................... 6

2.4 Classification of Anomalies .............................................................................................................. 7

2.5 A baseline architecture for Predictive Monitoring ............................................................................ 8

3 Related Work ....................................................................................................................................... 8

4 Contributions ...................................................................................................................................... 10 4.1 Main Goal ....................................................................................................................................... 10

4.2 Specific Goals ................................................................................................................................. 10

5 Approach ........................................................................................................................................... 10 5.1 Predicting the next event of a process using a transformers’ architecture ...................................... 10

TokenAndPositionEmbedding ..................................................................................................................... 11

Attention (Scaled Dot-Product Attention) .................................................................................................... 12

MultiHeadSelfAttention ............................................................................................................................... 14

Transformer Block ....................................................................................................................................... 14

Final Layers ................................................................................................................................................. 15

5.2 Detecting potential anomalies based on predictive monitoring ...................................................... 15

Global Threshold ......................................................................................................................................... 16

Specific Thresholds ...................................................................................................................................... 16

Generation of Anomalies ............................................................................................................................. 16

6 Evaluation .......................................................................................................................................... 17 6.1 Data sets .......................................................................................................................................... 17

6.2 Experiment 1: Accuracy and performance of Predictive Monitoring ............................................. 17

Setup ............................................................................................................................................................ 17

Results .......................................................................................................................................................... 17

6.3 Experiment 2: Accuracy for Anomaly Detection ............................................................................ 18

Setup ............................................................................................................................................................ 18

Results .......................................................................................................................................................... 19

6.4 Threats to validity ........................................................................................................................... 21

7 Conclusion and Future Work ............................................................................................................... 21

8 References ........................................................................................................................................ 21

Page 3: Deep Learning Transformer Architecture for Predictive

3

Tables

Table 1: First part of the characterization of the different architectures ............................................................. 6 Table 2: Second part of the characterization of the different architectures ........................................................ 7 Table 3: Approaches for Predictive Monitoring that use Deep Learning ........................................................... 9 Table 4: Approaches for Anomaly Detection. Taken from [12] ....................................................................... 10 Table 5: Event logs description ........................................................................................................................ 17 Table 6: Individual Accuracy measures in % for the next activity prediction .................................................. 18 Table 7: Comparison of accuracy measures in % for the next activity prediction. Taken from [2] vs Our Approach .......................................................................................................................................................... 18 Table 8: Anomaly Detection performance using Accuracy Metric .................................................................. 19 Table 9: 𝐹𝐹1 scores over BPI12 and BPI13 by Method. Taken from [12] vs Our approach ............................ 20

Figures Figure 1: LSTM Architecture .............................................................................................................. 8 Figure 2: Proposed Model Architecture ............................................................................................ 11 Figure 3: Token and Position Embedding Layer............................................................................... 12 Figure 4: Three Ways of Attention ................................................................................................... 12 Figure 5: Inputs and Outputs of event sequences .............................................................................. 13 Figure 6: Scaled Dot-Product Attention ............................................................................................ 13 Figure 7: MultiHeadSelfAttention .................................................................................................... 14 Figure 8: Transformer Block ............................................................................................................. 15 Figure 9: Global Threshold Calculation ............................................................................................ 15 Figure 10: Specific Thresholds Calculation ...................................................................................... 16 Figure 11: Generation of Anomalies ................................................................................................. 16 Figure 12: Example of Anomaly in “BPI 2012” .............................................................................. 20

Equations Equation 1: Attention ....................................................................................................................................... 14 Equation 2: Accuracy ....................................................................................................................................... 19 Equation 3: Precision ........................................................................................................................................ 19 Equation 4: Recall ............................................................................................................................................ 19 Equation 5: F1 .................................................................................................................................................. 19

Page 4: Deep Learning Transformer Architecture for Predictive

4

1 Introduction Over the years, deep learning and neural networks have been applied in different fields such as Natural Language Processing (NLP), healthcare, computer vision, speech recognition and so on. Even if NLP is the application focus of neural networks, several researches have been carried out in Predictive Process Monitoring [1,2,6,7]. This area aims to predict how the characteristics of running process events can be anticipated. Different methods and architectures have been used to deal with this problem and obtain the best prediction accuracy of different features in a case. These architectures will be addressed in the next section. The objective of this paper is to implement a predictive monitoring method in business processes using the deep learning architecture of transformers in order to assess its performance in the prediction of anomalous next events. Considering that, to detect if the following event is anomalous or not, only the results of the prediction and statistical methods will be used, and not a special method for the detection of anomalies like [8,12], where they introduce the DeepAlign and BiNet. Also, we will be addressing the anomalous events as an event that is out of the ordinary, and not necessarily a bad thing. For this reason, we will review different techniques for next event prediction used in related works, then I’m going to explain the selected architecture and implement it considering only for next event prediction. And finally, we will proceed with the validation of this technique with several data sets and the comparison of the results with other works.

2 Background In this section we address different deep learning models and its applications. Next, we review different works done in the field of Predictive Process Monitoring.

2.1 Definitions First of all, before starting with each type of network, it is necessary to clarify certain topics such as business processes, process mining, event log and different features that a network has, discriminative and generative. Networks can be discriminative or generative depending of the purpose of it. There are types of networks that can be both, but what does these subjects mean? Definition 1 (Business processes). A business process is a set of repeatable steps or activities taken by an organization to achieve a goal: deliveries, assembling products, etc. Definition 2 (Process mining). Process mining is a discipline which objective is to monitor, discover and improve business processes through the analysis of the event log of a process. This event logs are stored in information systems of the organization. Definition 3 (Event log). An event log is a collection of cases, each case represents a business process and can be seen as a trace/sequence of events. Events contains attributes among which you can find the case id, timestamp and the role of the event. Definition 4 (Discriminative). The purpose of a discriminative network is to classify, for example, classify images, is the image that we are seeing a dog or a cat? Then, the input of this type of network is going to be an image to classify. Definition 5 (Generative). On the other hand, we have generative networks, the purpose of this type of network, as its name indicates, is to generate. So instead of inputting an image to the network, we are going to put the word “dog”, so that the network is going to generate images of a dog (or at least what it thinks it’s a dog). The other feature that a network has is whether it is supervised or unsupervised. Definition 6 (Supervised). A supervised model needs human interaction while an un-supervised model does not. This means that, in a supervised model, the network is learning under supervision, there is someone judging if the network is doing the right job or not, it learns on a labeled dataset.

Page 5: Deep Learning Transformer Architecture for Predictive

5

Definition 7 (Unsupervised). On the contrary, in an unsupervised model, networks learn the inherent structure or un-labeled data, it learns automatically the structure of the data, and it doesn’t need human interaction.

2.2 Deep Learning Models The first architecture is Multilayer Perceptrons (MLP) or also known as feedforward network. It consists of at least three layers of nodes: input, hidden and output layer. It is the most typical neural network model; its purpose is to approximate a function f (). It must be a fixed size input; this makes it very limited. It is clearly a supervised model and is a classifier (Discriminative network). The main applications of this kind of network are simple logistic and linear regression problems. However, it is very important to consider this network because it is used within other models, which we will talk about later. The next ones are the Convolutional Neural Networks (CNNs), are networks that perform one or more convolutions (cross-correlation) operations. CNNs are mostly used for multi-dimensional data, i.e., images. This network has learnable weights and biases, it is capable of considering the local and global characteristics of the input data. Are specialized on processing images, however, in some cases have been used for sequential data. CNNs can be generative or discriminative, are also supervised models, but when these networks are too deep can suffer from gradient vanishing (ResNet [4] and DenseNet [5] are a solution for this). Recurrent Neural Networks (RNNs) are networks that have one or more recurrent (cyclic) connections, as opposed to just having feed-forward connections. These are a generalization of feedforward neural network that has an internal memory. RNNs are suitable for learning representations of sequential data. The internal design allows the network to discover dependency in the history of data that is useful for prediction. It also can be discriminative or generative, and is a supervised model. The main disadvantage of Simple RNNs is that in practice, RNNs can’t learn “long-term dependencies”. LSTM and GRU face this problem. Long Short-Term Memory (LSTM), address the problem of long-term dependency or remembering relevant past information to the present output. This architecture is a variation of RNN where a memory unit is introduced. It can not only process single data points (such as images), but also entire sequences of data (such as speech or video). Other applications of LSTM are unconstrained handwriting recognition, speech recognition, among others. This are the most used networks in Predictive Process Monitoring which made them so important for the develop of our work. Generative Adversarial Networks (GANs) is an unsupervised model in which two models are simultaneously trained, a Generator and Discriminator. Its purpose is on how to arrive at a model that approximates the input distribution. The role of the generator is to keep on figuring out how to generate fake data that can fool the discriminator. Meanwhile, the discriminator is trained to distinguish between fake and real signals. As the training progresses, the discriminator will no longer be able to see the difference between the generated data and the real one. From there, the discriminator can be discarded, and the generator can now be used to create new realistic signals that have never been observed before. GANs can be hard to train (Can be very sensible with changes causing network instability), but the results can be very promising. Attention Mechanism, specifically, Transformers. This model is used primarily in NLP. Transformers are designed to handle sequential data, are very quick to train and improved the performance of neural machine translation applications. Outperforms both recurrent and convolutional models on NLP. Requires less computation to train and is a much better fit for modern machine learning hardware, speeding up training by up to an order of magnitude. The fact that Transformers do not rely on sequential processing, and lend themselves very easily to parallelization, allows Transformers to be trained more efficiently on larger datasets. This model is structured by an Encoder block, a Decoder block (inside these blocks, attention is used; see [3]), and at the end, a linear (feed forward connected layer) and SoftMax layer. Transformers can be an opportunity for improvement for the prediction of the next event.

Page 6: Deep Learning Transformer Architecture for Predictive

6

2.3 Characterization of Architectures Architecture Characteristics Type of

network Type of input (for

best use)

Multilayer Perceptrons (MLPs)

Not parameter efficient. Not optimal for processing sequential and multi-dimensional data patterns. Struggle to remember patterns in sequential data

Feedforward neural network Fixed size input

Recurrent Neural Networks (RNNs)

RNNs in practice aren’t able to learn “long-term dependencies”. SimpleRNN has the lowest accuracy among MLP, RNN and CNN. Requires small number of params than MLP.

RNN Sequential data

Convolutional Neural Network (CNN)

Requires small number of params than MLP. Deep CNNs suffer from gradient vanishing (ResNet and DenseNet are a solution for this). Have a higher accuracy than MLPs (Train accuracy and test accuracy).

CNN

Multi-dimensional Data

Sequential data in some cases. (In the form of a

1D convolution)

Long Short-Term Memory (LSTM)

Not hardware Friendly, it takes a lot of resources we do not have to train these networks fast. The LSTM model displays much greater volatility throughout its gradient descent compared to the GRU model. Slow to train.

RNN Sequential data

Gated Recurrent Unit (GRU)

GRUs train a bit faster than LSTMs. The GRU controls the flow of information like the LSTM unit, but without having to use a memory unit. It just exposes the full hidden content without any control. GRUs are simpler and thus easier to modify. Computationally more efficient than LSTMs.

RNN Sequential data

Densely Connected Convolutional Networks

(DenseNet)

Better accuracy than CNN. Takes substantial amount of time to train the network. Improved the ResNet technique further by allowing every convolution to have direct access to inputs, and lower layer feature maps. It's also managed to keep the number of parameters low in deep networks by utilizing both the Bottleneck and Transition layers.

Deep CNNs Multi-dimensional Data

Autoencoders Use Transposed CNN to decode, transposed CNN will produce an image given feature maps. Autoencoders have practical applications both in their original form or as part of more complex neural networks.

Autoencoders Sequential Data

Multi-dimensional Data

Generative Adversarial Networks (GANs)

Is made up of two networks, a generator, and a discriminator. Unlike autoencoders, generative models are able to create new and meaningful outputs given arbitrary encodings. Hard to train (Can be very sensible with changes causing network instability).

CNN/RNN Sequential Data

Multi-dimensional Data

Variational Autoencoders (VAEs)

Unlike autoencoders, the latent space of VAEs is continuous, and the decoder itself is used as a generative model. VAEs have a similar objective than GANs of learning how to generate new images (data). They focus on learning the latent vector modeled as a Gaussian distribution. They bear a resemblance to GANs in the aspect of both attempts to create synthetic outputs from latent space. However, it can be noticed that the VAE networks are much simpler and easier to train compared to GANs. Samples from image VAEs tend to be blurry.

Autoencoders Sequential Data

Multi-dimensional Data

Attention Mechanism (Transformers)

Parallelization for sequential data. The fact that Transformers do not rely on sequential processing, and lend themselves very easily to parallelization, allows Transformers to be trained more efficiently on larger datasets. Both the encoder and decoder layers have a feed-forward neural network for additional processing of the outputs, and contain residual connections and layer normalization steps. Outperforms both recurrent and convolutional models on NLP. Requires less computation to train and is a much better fit for modern machine learning hardware, speeding up training by up to an order of magnitude.

Self-Attention Sequential Data

Table 1: First part of the characterization of the different architectures

Page 7: Deep Learning Transformer Architecture for Predictive

7

Architecture Applications in General Applications in Business Processes

Discriminative / Generative

Supervised / Unsupervised

Multilayer Perceptrons

(MLPs)

Simple logistic and linear regression problems.

Supervised learning. Discriminative Supervised

Recurrent Neural Networks (RNNs)

Text Generation - Text Summarization, Report Generation,

Conversational UI Discriminative/Generative Supervised

Convolutional Neural Network

(CNN)

Excels in extracting feature maps for classification, segmentation,

generation.

Predictive process monitoring [7] Discriminative/Generative Supervised

Long Short-Term Memory (LSTM)

Unconstrained handwriting recognition, speech recognition, handwriting generation, machine translation, image captioning, and

parsing.

Detection of anomalies in time series.

Prevent and mitigate upcoming problems during process

execution [9]. Predictive process monitoring

[1]

Discriminative/Generative Supervised

Gated Recurrent Unit (GRU)

NLP, machine comprehension, question answering and neural

retrieval ranking.

Real-time multivariate anomaly detection [8] Discriminative/Generative Supervised

Densely Connected Convolutional

Networks (DenseNet)

Classification and regression tasks. Image recognition/classification.

Segmentation, detection, tracking, generation, and visual/semantic

understanding.

Discriminative Supervised

Autoencoders Denoising, colorization, feature-level arithmetic, detection, tracking, and

segmentation Anomaly detection [10] Discriminative Unsupervised

Generative Adversarial

Networks (GANs)

Ability to synthesize data or signals that look real, Images generation,

computer vision, computer graphics, and image processing/translation

Predictive process monitoring [11] Discriminative/Generative Unsupervised

Variational Autoencoders

(VAEs)

Ability to synthesize data or signals that look real, Images generation,

computer vision, computer graphics, and image processing/translation

Discriminative/Generative Unsupervised

Attention Mechanism

(Transformers)

NLP Healthcare

Speech recognition Graph attention networks Recommender systems

Self-driving cars

Predictive process monitoring [19] Discriminative/Generative Semi-Supervised

Table 2: Second part of the characterization of the different architectures

2.4 Classification of Anomalies

There are several types of anomalies, as shown in [8].

- Skip: a necessary event in a trace has been skipped - Insert: a random activity has been inserted in the trace. - Rework: an event that has been executed a second time. - Early: event that has been executed too early, and hence is skipped later in the case. - Late: an event that has been executed too late, and hence is skipped earlier in the case - Attribute: an incorrect attribute has been set in an event.

For sake of simplicity, the only type of anomaly that is within the scope of this work, is the Insert anomaly, where sequences of events will be modified by inserting anomalous activities.

Page 8: Deep Learning Transformer Architecture for Predictive

8

2.5 A baseline architecture for Predictive Monitoring In [1], they implemented a composition of LSTMs to predict the next activity and its timestamp, and the remaining cycle time and suffix for a running case. Also, they made the following contributions: prediction architectures that supports large numbers of event types by using both embedded dimension and the idea of interleaving shared and specialized layers; extraction of n-grams in event logs; prediction of the next event and its timestamp in a sequence (numerical variables); and random sampling for the category selection of the next predicted event. Their work is divided into three main phases: Pre-Processing, Model Structure Definition, and Post-Processing. The important things to keep in mind in the Pre-Processing phase are embedded dimensions, attribute scaling and n-grams. An embedding is a mapping of a discrete variable to a vector of continuous numbers, embedded dimensions helps to control exponential attributes growth. In the experiments they used embedding to encode the log to map the categories into an n-dimensional space, the distance between the categories represents how close is one activity performed by one role in relation with the same activity performed by another role. However, for continuous attributes it’s a different topic, to scale the attributes values, they evaluated two techniques to scale relatives’ times: maximum value and log-normalization. The last thing are the n-grams; the extraction of n-grams let the model to see patterns of sub-sequences describing the execution order of activities, roles or relative times. They used different sizes of n-grams to compare the results.

In the “Model Structure Definition Phase”, the basic architecture of the LSTM network consists on an input layer for each attribute (Activity, Role and Relative times). They tested three variants of the architecture: Specialized (three independent models), Shared categorical (concatenates the inputs of activities and roles), and Full shared (Concatenates all of the inputs and shares the first LSTM layer).

And last, in the Post-processing Phase, they compare two different techniques for the category selection to generate complete traces of business processes: arg-max and random choice. Arg-max consists in selecting the category with the highest predicted probability (most used). Random choice, on the contrary, it has to be implemented if the model is used in a generative way, it consists on select randomly a new category following the predicted probability distributions.

Figure 1: LSTM Architecture. Each Block is a cell. H: processed output. X: new input. σ: sigmoid function. tanh: tanh function. x: pointwise multiplication operation. +: pointwise adding function

3 Related Work The authors in [2], they make a review of different approaches that face Predictive Business Process Monitoring tasks such as next event prediction. For this, they performed an evaluation of 10 different approaches over 12 publicly available process logs, like BPI and Helpdesk which will be evaluated in our approach. Among the evaluated models for next event prediction of this benchmark, is the baseline architecture [1], an LSTM

Page 9: Deep Learning Transformer Architecture for Predictive

9

network, and other types of networks, e.g., CNN, GRU, among others. See Table 3. For next event prediction evaluation, we will compare these works with our approach with the BPIs and HelpDesk data sets.

Author, Year Reference Network type

Pasquadibisceglie et al. [7] CNN Tax et al. [13] LSTM

Camargo et al. [1] LSTM Hinkka et al. [14] GRU Khan et al. [15] DNC

Evermann et al. [16] LSTM Mauro et al. [17] CNN

Thesis et al. (w/o attributes) [18] DFNN Thesis et al. (w/ attributes) [18] DFNN

Table 3: Approaches for Predictive Monitoring that use Deep Learning. Variants: without attributes (w/o attributes), with attributes (w/ attributes). Network type: Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Convolutional Neural Network (CNN), Deep Feedforward Network (DFNN), Differential Neural Computer (DFC). Taken from [2]

The authors in [8], propose DeepAlign, an architecture based on RNNs and bidirectional beam search, to process anomaly correction by combining predictive techniques. Two RNNs are trained to predict the next event, one reading cases from left to right and the other one from right to left (backward). Then the Bidirectional Beam search is used to transform the input case to the most probable case. Finally, the alignment is calculated based on the search history of the algorithm. They evaluate their model using synthetic datasets and the results indicate that RNNs are capable of modeling the behavior of a process solely based on an event log event if it contains anomalous behavior. Their results are based on the correction accuracy, average error for incorrect alignments and alignment optimality for correct alignments, where the best results of its implementation were close to 97% of accuracy. This work is a good example of a very complex model for anomaly detection. The problem they want to address may not be the same and they may not use real datasets, but it is a good reference in terms of the results they present. BiNet [12], Is a Recurrent Neural Network (GRU) trained to predict the next event and its attributes. And can be used for real time anomaly detection, since it does not require a completed case for detection. The internal architecture of BiNet is composed of two parts, control flow and data flow, where CFNet is responsible for predicting next event activity, and DataNeT is responsible for predicting the rest of the attributes. They follow the same idea of detecting anomalies in next event, which is based on the assumption that an anomalous attribute will be assigned a lower probability than a normal attribute. They defined a scoring function where the result is the difference between the probability of the most likely attribute (according to their model) and the probability of actual attribute value. So, whenever an anomalous score is greater than a threshold t, this attribute is flagged as anomalous They evaluated their method on synthetic datasets and also real datasets such as BPIC12, although they didn’t have such good results with the real ones, below 0.6 in most of the cases. They compared their results with 6 other methods (See Table 4), which we will also use to evaluate our approach.

Page 10: Deep Learning Transformer Architecture for Predictive

10

Method Reference Description

OC-SVM [20] One-Class Support Vector Machine Naive [21] Naive Algorithm

Sampling [21] Sampling Algorithm t-STIDE+ [22] Sliding Window

Likelihood [23] Extended Likelihood Graph DAE [24] Denoising Autoencoder

BINet [12] Multivariate Business Process Anomaly Detection

Table 4: Approaches for Anomaly Detection. Taken from [12] The problem with these works of anomaly detection, is that they only achieved good results with artificial datasets and not real, BiNet didn’t get good results with real datasets and DeepAlign only used artificial datasets, so DeepAlign does not make it so comparable with our work. Another problem of these works is that they are using RNNs, and these networks are very slow to train and aren’t hardware friendly, it takes a lot of resources we do not have to train these networks fast. The purpose of our work is not to overcome the results of DeepAlign or BiNet, because our model is much simpler than these works, and for the reason that the objective is to introduce the architecture of transformers in this type of problems and to make conclusions about its results and behavior in this field. Although at the end I’m going to compare the results with BiNet.

4 Contributions

4.1 Main Goal Demonstrate that Transformers Models can be used to predict and detect anomalies in the next event, obtaining results comparable to the state of the art using real datasets such as BPI. And thus, show that you do not need complex models to achieve good results, that from simple things you can achieve great things.

4.2 Specific Goals - Overcome related works in Predictive Business Processes Monitoring, in the accuracy of the

prediction of the next event. - Implement a simple and new way to detect anomalies of next events in Business Processes.

5 Approach

5.1 Predicting the next event of a process using a transformers’ architecture Transformer architecture was chosen because there has not been much research of this architecture in Predictive Business Monitoring and Anomaly Detection. In NLP, the transformers replaced all types of RNN, the simplicity of its architecture makes it much more efficient and the results have been better than the other types of models. Hence, I’m proposing a model based on The Transformer [3] to detect if the next event of a sequence is anomalous or not. the cases. Following the same pre-processing of the baseline architecture [1], for our approach we used the same embedding technique to encode the log to map the categories into an n-dimensional space in the “TokenAndPositionEmbedding” layer. In the original paper [3], The Transformer is based on Encoder and a Decoder block, each of these has different types of attention mechanisms and other layers. Also, this architecture has a Token and Positional

Page 11: Deep Learning Transformer Architecture for Predictive

11

Embedding. Encoder and Decoder are transformer blocks. Our approach consists in only one transformer block and the embedding layers. In this section we will explain each block and layer of this architecture.

Figure 2: Proposed Model Architecture

TokenAndPositionEmbedding The first layer of the proposed architecture is the TokenAndPositionEmbedding layer, which is composed by two Embedding layers. The Input Embedding, cluster of elements based on content, e.g., in NLP, the words in a sentence (prefix in our problem) are going to be mapped based on its meaning, so in the end, the words which have a similar meaning are going to be close to each other in the cluster. In our case, the words are going to be the activities and the sentence the prefix. Since this model does not contain any recurrence or convolution, positional encoding is added to give the model some information about the relative position of the event in the sequence. So, such as the Input Embedding is a cluster of elements based on content, the Positional Encoding is a clustering of elements based on their position (time of execution). In other words, seeing it as an NLP problem, the positional encoding vector is added to the embedding vector. Embeddings represent a token in d-dimensional space where tokens with a similar meaning will be closer together. But embeddings do not encode the relative position of words in a sequence of elements. So, after adding positional encoding, words will be closer to each other based on the similarity of their meaning and their position in the sentence, in the d-dimensional space. At the end, there is going to be one embedded vector for each element in the prefix independent from each other, this way we can do parallelization of sequential data. See Figure 3.

Page 12: Deep Learning Transformer Architecture for Predictive

12

Figure 3: Token and Position Embedding Layer

Attention (Scaled Dot-Product Attention) Attention answers: what part of the input should we focus on? Attention is content base querying. There are three ways of attention in the original architecture [3], Encoder Self-Attention, MaskedDecoder Self-Attention and Encoder-Decoder Attention. See Figure 4.

Figure 4: Three Ways of Attention

The Encoder-Decoder Attention is used in the original architecture to translate form an input to an output, so this attention, from the output attends to the input. What it means is that they are coding and decoding a word but with a different representation (language in this case), what the attention does is to look according to the weights of the vector that comes out of the attention block (we are going to see how that layer works in detail), which is the most similar to the one in another representation, and that is going to be the translated word. Then we have Self-Attention, that is Attention respect to oneself. In the translating task, we want to know the relevance of the i’th word of a sentence in relation to the other words in the same sentence. For every word we are going to have an attention vector generated which captures contextual relationships between words in a sentence. This attention block is located in the encoder of the original architecture. See [3]. But when we are translating, we are predicting the next word translated somehow, so if we only use Self-Attention the model is only attending the next word, looking at it and copying it, but there will be no way to decode it and we are going to have a model that will train immediately but have not way of inference.

Page 13: Deep Learning Transformer Architecture for Predictive

13

So where is the prediction taking place? The MaskedDecoder solves this issue, in this attention layer we are only attending things before you. What we do here is to mask everything that is ahead of the current element, so everything ahead you are going to be 0. Is the same as Self-Attention but with the masking. MaskedDecoder Self-Attention is located in the Decoder block of the original architecture [3]. So now that we know what is attention and the different types, it can be inferred that for our approach we are going to use a MaskedDecoder Self-Attention Layer to predict the next event. But the thing is that in the preprocessing we are already doing a “Masking” in the prefixes, where the input is not going to be the whole sequence of events, but only the event in time t together with its previous events of the sequence. See figure 5.

Figure 5: Inputs and Outputs of event sequences

How does attention work? Scaled Dot-Product Attention consists on multiplication of matrices (basically). So, in attention we have three matrices of vectors, one for queries Q, another for keys K which have dimension 𝑑𝑑𝑘𝑘, and the last for values V. So, we can think query as the current word or event in our case, values as all the past all the events I’ve generated before, and the keys are indexes of the values. Thus, what we do in attention is take the query, find the most similar key and then get the values that correspond to the similar key. Is a multiplication of matrices because you take Q, multiply it by K transposed, then normalize it by dividing the result by �𝑑𝑑𝑘𝑘, at that point we take a SoftMax which makes the keys that are similar to the query close to 1, and when we multiply it by the values, only the ones I’m interested in are kept and the others become 0.

Figure 6: Scaled Dot-Product Attention

Page 14: Deep Learning Transformer Architecture for Predictive

14

𝐴𝐴(𝑄𝑄,𝐾𝐾,𝑉𝑉) = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑥𝑥 �𝑄𝑄𝐾𝐾𝑇𝑇

�𝑑𝑑𝑘𝑘�𝑉𝑉

(1)

MultiHeadSelfAttention MultiHeadAttention allows the model to jointly attend to information from different sub-spaces of representations in different positions. Each head is a layer of attention. In other words, in this layer we are performing the attention function in parallel, generating output values V. Then, each of these are concatenated and projected. The final output of this layer is going to be weighted vectors, one for each event of the sequence (in NLP on for each word of the sentence). The sum of all the weights of each vector must be 1, of course.

Figure 7: MultiHeadSelfAttention

Transformer Block The transformer block is composed by the MultiHeadSelfAttention Layer and a position-wise fully connected Feed-Forward network (FFN). The multi-headed attention output vector is added to the original positional input embedding (residual connection). Then, the output of the residual connection goes through a normalization layer. Next, the normalized residual output goes through the FFN, the output is then again Normalized. Residual connections help train the network by allowing gradients to flow directly through the networks. The normalizations are used to stabilize the network, which substantially reduces the training time. The FFN is used to project attention outputs, potentially giving you a richer representation.

Page 15: Deep Learning Transformer Architecture for Predictive

15

Figure 8: Transformer Block

Final Layers Finally, the output of the transformer block is then projected to the last layers, GlobalAveragePooling1D, ReLU, and a SoftMax Layer. The purpose of these layers is to simply prepare the output, where the GAP layer downsamples the input representation by taking the average value over the time dimension (lowers one dimension), the ReLU layer converts all the negative values to zero (we don’t care about negative values), and the SoftMax Layer to convert the output into predicted next-token probabilities (probability distribution of the next event).

So, in the end, we are going to have a vector of probabilities in which the element with the highest probability is going to be de predicted next event. E.g., if the element in the second position of the vector is the most probable, the event identified as a2 is going to be the next.

5.2 Detecting potential anomalies based on predictive monitoring In a generalized way we can see an anomaly as an event that should not have happened. According to this, we can determine the anomalies with the output of our predictions. Basically, we only have to look in the output vector which activities have extremely low weights, and those activities are likely to be anomalous, since the activities that have a high probability are normal for the model. But how we determine the threshold in which every weight below that limit is going to be an anomaly? For this we need 2 types of thresholds, global and specific.

Figure 9: Global Threshold Calculation

Page 16: Deep Learning Transformer Architecture for Predictive

16

Global Threshold For the calculation of the global threshold, I’m calculating the average and the standard deviation of the weight of the events that the model should have predicted but made a wrong prediction. With this I’m extracting a confidence level by subtracting the Standard Deviation to the Average. So, in theory, any event whose weight is less than this threshold is anomalous. Specific Thresholds Why do we need specific thresholds? Because the model is not perfect and doesn’t have a perfect accuracy for predicting the next event, hence, an anomaly. So, we can’t use with all of the events the same threshold. That’s why we implemented one threshold for each type of prefix (each one has a different error rate clearly), because it doesn’t matter if the model has a significant error, some types of prefixes are going to have a good prediction accuracy, therefore, have to have its own threshold. The calculation is the same as the Global Threshold but only doing it for each prefix.

Figure 10: Specific Thresholds Calculation

Generation of Anomalies For the generation of anomalies, we used a dictionary of all types of events (activities), where the value of each key is the proportion of that activity in the training and validation sets. e.g., if activity x appeared 10 times in a log of 100 events, the value of that activity would be 0.1. To generate the anomalies, we are interested in the abnormal activities of our dictionary, these are, those whose weight is less than the Global Threshold. Next, we modified sequences of the testing data, 3 out of 10 (probability of 30%), by inserting one random abnormal event into the sequence. we marked the modified sequences for the evaluation.

Figure 11: Generation of Anomalies

Page 17: Deep Learning Transformer Architecture for Predictive

17

But, apart from generating anomalies, what else is the threshold good for? In the test data there will be prefixes that were not in the training and validation set, therefore it will not have a specific threshold. This is when the global threshold comes into play, as a prefix that has never been seen before can be considered an anomaly, if the weight of that next event is less than the global threshold, will be counted as an anomaly. Now, that the anomalies are generated and the thresholds calculated, we can proceed with the detection of anomalies in the testing data.

6 Evaluation

6.1 Data sets For this experiment we used nine real-life datasets, each one with different characteristics:

- Helpdesk: event log from a ticketing management process of the helpdesk of an Italian company. - BPI 2012 and BPI 2012 W: loan application process from a German financial institution. - BPI 2013 CP: Volvo’s IT incident and problem management. Closed Problems - BPI 2015: These five event logs contain data on building permit applications provided by five Dutch

municipalities during a period of four years.

Event log Num. Traces Num. Events Num. Activities

Avg. Activities per trace

Max. Activities per

trace Helpdesk 4580 21348 14 4.6 15 BPI 2012 13087 262200 36 20 175

BPI 2012 W 9658 170107 7 17.6 156 BPI 2013 CP 1487 6660 7 4.47 35 BPI 2015-1 1199 27409 38 22.8 61 BPI 2015-2 832 25344 44 30.4 78 BPI 2015-3 1409 31574 40 22.4 69 BPI 2015-4 1053 27679 43 26.2 83 BPI 2015-5 1156 36234 41 31.3 109

Table 5: Event logs description

6.2 Experiment 1: Accuracy and performance of Predictive Monitoring

Setup We ran the experiment on all the Data sets and were compared with [2], except for the BPI 2015 logs, because they only showed the results of those Event logs in their work. We split each dataset into three parts: training, validation, and testing. we chose a ratio of 60%, 20%, 20% between training, validation, and test sets. We use the same metric for next activity prediction as [2], the accuracy metric. This metric measures the proportion of correct classifications in relation to the number of predictions done.

Results The individual results indicate a relationship between the amount of data and the complexity of the Event Log. See Table 6. Where the best accuracy was obtained with “Helpdesk”, a not very complex log but with significant number of events and traces. On the other hand, in much more complex event logs, such as, “BPI 2012” and “BPI 2012 W”, the accuracy decreases, and if there is a high trace variability with respect to number of events, accuracy can be affected too, as you can see with BPI 12 logs. However, why does a simple log like BPI 2013 CP give such low results? What we concluded was that, this log not only it is very small but also has a high number of traces per event, and this caused the training of the network to have a high error and therefore low

Page 18: Deep Learning Transformer Architecture for Predictive

18

precision. Lastly, BPI 2015 logs will not be taken into account for predicting the next event because These logs are not structured by activities, the original log is structured by process tasks, and requires a preprocessing before using it with this kind of problems. This is also why these log events aren’t used in related works.

Event log Accuracy Helpdesk 84.8 BPI 2012 66.88

BPI 2012 W 54.09 BPI 2013 CP 37.76 BPI 2015-1 25.79 BPI 2015-2 32.38 BPI 2015-3 24.02 BPI 2015-4 19.08 BPI 2015-5 18.08

Table 6: Individual Accuracy measures in % for the next activity prediction

Table 7 shows the accuracy scores obtained by the different approaches showed in [2]. As you can see our work obtains the best score in only 1 of the 4 datasets, which is not bad for the proposed objective, because we are facing quite complex networks using a new architecture for this type of problem, and the proposed model is only a transformer block, something not as complex as the others approaches. It should also be noted that LSTM / GRU type networks have been tested to face this problem for quite some time, which means that the models using these architectures have been improved over the last years, while, to the best of our knowledge, the first time that the transformer architecture was used for Predictive Monitoring, was in February, 2020. [19]

Author, Year Helpdesk BPI 2012 BPI 2012 W BPI 2013 CP

Pasquadibisceglie et al. 65.84 82.59 81.59 24.35 Tax et al. 75.06 85.20 84.90 65.57

Camargo et al. 76.51 83.41 83.29 60.62 Hinkka et al. 77.90 86.05 83.52 61.14 Khan et al. 69.13 82.93 86.69 55.57

Evermann et al. 70.07 60.38 75.22 55.66 Mauro et al. 74.77 84.56 85.11 56.97

Thesis et al. (w/o attributes) 67.80 77.64 85.77 52.31 Thesis et al. (w/ attributes) 66.25 64.23 76.16 47.69

Our approach 84.8 66.88 54.09 37.76 Table 7: Comparison of accuracy measures in % for the next activity prediction. Taken from [2] vs Our

Approach

6.3 Experiment 2: Accuracy for Anomaly Detection

Setup For this experiment, I’m using the same predictions of the first experiment, so that part of the setup is the same thing. After the model is trained or loaded, the calculation of the thresholds and the generation of anomalies takes place, then the prediction is made for the test set and the classification of anomalies is performed. With

Page 19: Deep Learning Transformer Architecture for Predictive

19

the classification we can obtain the True Positives (Artificial Anomalies that were classified as anomalous), True Negatives (Nonanomalous events classified as nonanomalous), False Positives (Nonanomalous events classified as anomalous) and False Negatives (Anomalous events classified as nonanomalous), with all the sequences classified we can calculate the accuracy by adding the True Positives and True Negatives, and dividing the result by the total of sequences. We faced the results of [12] with the BPI 2012 and BPI 2013 event logs. But to compare our work with those of [12], I’m using the 𝐹𝐹1score, which is defined as the harmonic mean of precision and recall. Precision-Recall is a metric to evaluate classifier output quality, it also uses 𝑇𝑇𝑝𝑝, 𝐹𝐹𝑝𝑝, and 𝐹𝐹𝑛𝑛. High precision relates to a low False Positive rate, and high recall relates to a los False Negative rate.

𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝑠𝑠𝐴𝐴𝐴𝐴 =𝑇𝑇𝑝𝑝 + 𝑇𝑇𝑛𝑛

𝑇𝑇𝑝𝑝 + 𝑇𝑇𝑛𝑛 + 𝐹𝐹𝑝𝑝 + 𝐹𝐹𝑛𝑛

(2)

𝑃𝑃𝐴𝐴𝑃𝑃𝐴𝐴𝑃𝑃𝑠𝑠𝑃𝑃𝑠𝑠𝑃𝑃 = 𝑇𝑇𝑝𝑝

𝑇𝑇𝑝𝑝 + 𝐹𝐹𝑝𝑝 (3)

𝑅𝑅𝑃𝑃𝐴𝐴𝑠𝑠𝑅𝑅𝑅𝑅 =𝑇𝑇𝑝𝑝

𝑇𝑇𝑝𝑝 + 𝐹𝐹𝑛𝑛

(4)

𝐹𝐹1 = 2𝑃𝑃 ∗ 𝑅𝑅𝑃𝑃 + 𝑅𝑅

(5)

Results The individual results show that despite the next event prediction accuracy obtained in the first experiment, the anomaly detection method shows good results, See Table 8. This is because the thresholds calculation is based on the wrong predictions that the model had, so if you have bad accuracy on the prediction, you are going to have more data to calculate the thresholds, and clearly, if you have good prediction accuracy, the anomaly detection accuracy it’s going to be better. So, either way, with this method you aren’t going to have bad results.

Next Event Prediction Anomaly Detection Event Log Accuracy TP TN FP FN Amount Anoms Accuracy Helpdesk 0,848 480 2708 187 91 571 0,9198 BPI 2012 0,6688 3626 14640 3225 670 4296 0,8242

BPI 2012 W 0,5409 940 7682 628 951 1891 0,8452 BPI 2013 0,3776 114 665 5 37 151 0,9488

BPI 2015-1 0,2579 806 2002 912 45 851 0,7458 BPI 2015-2 0,3238 632 2020 653 74 706 0,7848 BPI 2015-3 0,2402 919 2378 1185 56 975 0,7265 BPI 2015-4 0,1908 707 1857 916 74 781 0,7214 BPI 2015-5 0,1808 999 2373 1478 134 1133 0,6766

Table 8: Anomaly Detection performance using Accuracy Metric

Page 20: Deep Learning Transformer Architecture for Predictive

20

Table 9 shows the 𝐹𝐹1 scores obtained by the different methods in [12]. Our method overcome all the approaches with BPI 2013 and was one of the bests with BPI 2012, this shows that the methodology of the calculation of global and specific thresholds is better than complex methods such as BiNet, and to get better results, the model needs better next event prediction accuracy, so that the other methods are overcome with other datasets.

Method BPI 2012 BPI 2013 OC-SVM 0.42 0.52

Naive 0.58 0.47 Sampling 0.45 0.26 t-STIDE+ 0.81 0.57

Likelihood 0.65 0.29 DAE 0.76 0.52

BINet 0.58 0.58 Our approach 0.65 0.84

Table 9: 𝐹𝐹1 scores over BPI12 and BPI13 by Method. Taken from [12] vs Our approach In “BPI 2012”, 3626 anomalies were detected out of 4296 that were artificially introduced. In order to show the importance of detecting these anomalies in a business, we decided to do a brief analysis of an anomaly detected in the “BPI 2012” dataset, see Figure 12. Because computers don’t understand words, we have to “encode” the activities in numbers, so for understanding what is happening in that sequence we have to “decode” each activity, where 5 is “A_SUBMITTED”, 7 is “A_PARTLYSUBMITTED”, 10 is “A_SUBMITTED”, and 1 is “A_ACCEPTED”. Remember that is an application process for a personal loan or overdraft within a global financing organization. As you can see, there is something wrong with the sequence of events, how can a loan be accepted immediately after it was declined? That’s an anomaly, lucky for us it was detected on time with our approach.

Figure 12: Example of Anomaly in “BPI 2012”

Page 21: Deep Learning Transformer Architecture for Predictive

21

6.4 Threats to validity The results obtained in the previous section are good and are comparable with other works in the state of the art, however, it must be taken into account that the input data may be different because in the results that are being taken as a basis, the anomalies are introduced in a different way than our work. As we based on the global value of [12], the results may have variations. Likewise, what is being done is a validation with exhaustive reference data, the experiment that they are doing is not being exactly replicated.

7 Conclusion and Future Work

The architecture of transformers in for Predictive Business Processes Monitoring and Anomaly Detection has a lot of potential. Not more than 10 months that work is being done for this type of problems with this architecture, and you can see that comparable results are being obtained with other works in this field, even though that the model is not a network as complex as LSTM or GRU. Although the main goal of our approach wasn’t to overcome related work results, surprisingly in some event logs it did, demonstrating that Transformers Models can be used to predict and detect anomalies in the next event with results that are comparable to the state of the art. By overcoming the other works in one of the 4 event logs tested for next event prediction. Likewise, for the detection of anomalies, we did an implementation of a simple and a promising way to detect anomalies of next events in Business Processes. Although anomaly detection depends on prediction accuracy, this implementation has very good detection accuracy because it is based on the model's erroneous predictions. Thanks to this method, we overcame all of the BPI 2013 approaches and was one of the bests with BPI 2012. On the other hand, a possible extension of the model is to carry out the analysis of anomalies with respect to the entire trace, in this way more types of anomalies can be detected, such as Skip, Late, Rework, Early and Attribute anomalies. Also, for future work, it would be interesting to analyze the anomalies taking into account their meaning in the business, that is, to link the anomalies with the outcome of the process. You do not need complex models to achieve good results, from simple things you can achieve great things.

8 References 1. M. Camargo, M. Dumas, and O. Gonzalez Rojas. Learning Accurate LSTM Models of Business Processes. In Proc. of

BPM, LNCS. Springer, 2019. 2. Rama-Maneiro, E., Vidal, J. C., and Lama, M., “Deep Learning for Predictive Business Process Monitoring: Review and

Benchmark”, 2020. https://arxiv.org/abs/2009.13251 3. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all

you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008. 4. K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. 2015. 5. G. Huang, Z. Liu, L. van der Maaten, K. Weingerger. Densely Connected Convolutional Networks. 2016. 6. F. Taymouri, M. La Rosa, S. Erfani, Z. Bozorgi, I. Verenich. Predictive Business Process Monitoring via Generative

Adversarial Nets: The Case of Next Event Prediction. 2020. 7. V. Pasquadibisceglie, A. Appice, G. Castellano, D. Malerba. Using Convolutional Neural Networks for Predictive Process

Analytics. 2019. 8. Nolle, T., Seeliger, A., Thoma, N., Mühlhäuser, M.: DeepAlign: alignment-based process anomaly correction using

recurrent neural networks. In Advanced Information Systems Engineering, 2020, pp. 319–333. 9. Metzger A., Neubauer A., Bohn P., Pohl K. (2019) Proactive Process Adaptation Using Deep Learning Ensembles. In:

Giorgini P., Weber B. (eds) Advanced Information Systems Engineering. CAiSE 2019. Lecture Notes in Computer Science, vol 11483. Springer, Cham. https://doi.org/10.1007/978-3-030-21290-2_34

10. Nolle, T., Luettgen, S., Seeliger, A. et al. Analyzing business process anomalies using autoencoders. Mach Learn 107, 1875–1893 (2018). https://doi.org/10.1007/s10994-018-5702-8

11. Taymouri, Farbod et al. “Predictive Business Process Monitoring via Generative Adversarial Nets: The Case of Next Event Prediction.” Business Process Management (2020): 237–256. Crossref. Web.

12. Nolle T., Seeliger A., Mühlhäuser M. (2018) BINet: Multivariate Business Process Anomaly Detection Using Deep Learning. In: Weske M., Montali M., Weber I., vom Brocke J. (eds) Business Process Management. BPM 2018. Lecture Notes in Computer Science, vol 11080. Springer, Cham. https://doi.org/10.1007/978-3-319-98648-7_16

Page 22: Deep Learning Transformer Architecture for Predictive

22

13. N. Tax, I. Verenich, M. L. Rosa, and M. Dumas, “Predictive business process monitoring with LSTM neural networks,” in Proceedings of the 29th International Conference on Advanced Information Systems Engineering (CAISE 2017), ser. Lecture Notes in Computer Science, vol. 10253. Springer, 2017, pp. 477–492.

14. M. Hinkka, T. Lehto, and K. Heljanko, “Exploiting event log event attributes in RNN based prediction,” in Proceedings of the 9th International Symposium on Data-Driven Process Discovery and Analysis (SIMPDA 2019), ser. Lecture Notes in Business Information Processing, vol. 379. Springer, 2019, pp. 67–85.

15. A. Khan, H. Le, K. Do, T. Tran, A. Ghose, H. Dam, and R. Sindhgatta, “Memory-augmented neural networks for predictive process analytics.”

16. J. Evermann, J.-R. Rehse, and P. Fettke, “Predicting process behavior using deep learning,” Decision Support Systems, vol. 100, pp. 129–140, 2017.

17. N. D. Mauro, A. Appice, and T. M. A. Basile, “Activity prediction of business process instances with inception CNN models,” in Proceedings of the 18th International Conference of the Italian Association for Artificial Intelligence (AIIA 2019), ser. Lecture Notes in Computer Science, vol. 11946. Springer, 2019, pp. 348–361.

18. J. Theis and H. Darabi, “Decay replay mining to predict next process events,” IEEE Access, vol. 7, pp. 119 787–119 803, 2019.

19. P. Philipp, R. Jacob, S. Robert and J. Beyerer, "Predictive Analysis of Business Processes Using Neural Networks with Attention Mechanism," 2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Fukuoka, Japan, 2020, pp. 225-230, doi: 10.1109/ICAIIC48513.2020.9065057.

20. Schölkopf, B., Williamson, R.C., Smola, A.J., Shawe-Taylor, J., Platt, J.C., et al.: Support vector method for novelty detection. In: NIPS. vol. 12, pp. 582–588 (1999).

21. Bezerra, F., Wainer, J.: Algorithms for anomaly detection of traces in logs of process aware information systems. Information Systems 38(1), 33–44 (2013).

22. Warrender, C., Forrest, S., Pearlmutter, B.: Detecting intrusions using system calls: Alternative data models. In: Proceedings of the 1999 IEEE Symposium on Security and Privacy. pp. 133–145. IEEE (1999).

23. Böhmer, K., Rinderle-Ma, S.: Multi-perspective anomaly detection in business process execution events. In: OTM Confederated International Conferences" On the Move to Meaningful Internet Systems". pp. 80–98. Springer (2016).

24. Nolle, T., Luettgen, S., Seeliger, A., Mühlhäuser, M.: Analyzing business process anomalies using autoencoders. arXiv preprint arXiv:1803.01092 (2018).