context-based multimodal machine learning on game oriented

IN DEGREE PROJECT ,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2021

Context-based Multimodal Machine Learning on Game Oriented Data for Affective State Recognition

ILIAN CORNELIUSSEN

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Context-based MultimodalMachine Learning on GameOriented Data for AffectiveState Recognition

ILIAN CORNELIUSSEN

Master in Systems, Control and RoboticsDate: August 11, 2021Supervisor: André PereiraExaminer: Joakim GustafssonSchool of Electrical Engineering and Computer ScienceSwedish title: Kontextbaserad multimodal maskininlärning påspelorienterad data för affektivt tillståndsigenkänning

iii

AbstractAffective computing is an essential part of Human-Robot Interaction, whereknowing the human’s emotional state is crucial to create an interactive andadaptive social robot. Previous work has mainly been focusing on using uni-modal or multimodal sequential models for Affective State Recognition. How-ever, few have included context-based information with their models to boostperformance. In this paper, context-based features are tested on a multimodalGated Recurrent Unit model with late fusion on game oriented data. It showsthat using context-based features such as game state can significantly increasethe performance of sequential multimodal models on game oriented data.

KeywordsTelepresence, Affective Recognition, Multimodal Machine Learning, Human-Robot Interaction

v

SammanfattningAffektiv beräkning är en viktig del av interaktion mellan människa och robot,där kunskap om människans emotionella tillstånd är avgörande för att skapaen interaktiv och anpassningsbar social robot. Tidigare arbete har främst fo-kuserat på att använda unimodala eller multimodala sekventiella modeller föraffektiv tillståndsigenkänning. Men få har inkluderat kontextbaserad informa-tion i sin inställning för att öka prestanda. I denna uppsats testas kontextbase-rade funktioner på en multimodal s.k. Gated Recurrent Unit modell med senfusion på spelorienterad data. Det visar att användning av kontextbaserade in-formation som tillståndet i spelet kan avsevärt öka prestandan hos sekventiellamultimodala modeller på spelorienterad data.

NyckelordTelepresence, Affektiv Igenkänning, Multimodal Maskininlärning, Robot ochMänniska Interaktion

Acknowledgments vii

AcknowledgmentsFirst, I would like to thank my supervisor André Pereira for his help andguidance throughout this project. Secondly, I would like to thank my examinerJoakim Gustafsson for the opportunity. Finally I would like to thank AndrejWilczek and Kildo Alias for their outstanding collaboration with the casestudy.

Stockholm, June 2021Ilian Corneliussen

CONTENTS ix

Contents

1 Introduction 11.1 Research Question . . . . . . . . . . . . . . . . . . . . . . . 21.2 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Ethical and Societal Aspects . . . . . . . . . . . . . . . . . . 31.5 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Related Work 5

3 Theory 93.1 Support-vector Machine . . . . . . . . . . . . . . . . . . . . 93.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.1 Backpropagation . . . . . . . . . . . . . . . . . . . . 113.2.2 Regularization . . . . . . . . . . . . . . . . . . . . . 12

3.3 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . 123.3.1 Gated Recurrent Units . . . . . . . . . . . . . . . . . 13

3.4 Multimodal Machine Learning . . . . . . . . . . . . . . . . . 143.4.1 Multimodal Fusion . . . . . . . . . . . . . . . . . . . 14

4 Case Study 164.1 Scenario Design . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . 174.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . 184.4 Questionnaires . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.4.1 Pre-study Questionnaire . . . . . . . . . . . . . . . . 194.4.2 Post-study Questionnaire . . . . . . . . . . . . . . . . 19

4.5 Covid-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Data Set 21

x Contents

5.1 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2.1 Facial Features . . . . . . . . . . . . . . . . . . . . . 235.2.2 Posture Features . . . . . . . . . . . . . . . . . . . . 245.2.3 Audio Features . . . . . . . . . . . . . . . . . . . . . 245.2.4 Context-based Features . . . . . . . . . . . . . . . . . 26

5.3 Post-study Questionnaire . . . . . . . . . . . . . . . . . . . . 26

6 Method 286.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . 28

6.1.1 Unimodal Models . . . . . . . . . . . . . . . . . . . . 286.1.2 Multimodal Models . . . . . . . . . . . . . . . . . . . 29

6.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296.2.1 Feature Importance. . . . . . . . . . . . . . . . . . . 30

6.3 Model Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7 Results 33

8 Discussion 37

9 Conclusion & Future Work 399.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Bibliography 41

LIST OF FIGURES xi

List of Figures

2.1 Temporal Selective Attention Model, with the Attention Mod-ule, Encoding Module, and finally the pooling and predictionlayers. Image source [18]. . . . . . . . . . . . . . . . . . . . . 7

3.1 Linear kernel Support-VectorMachine (SVM) on linear seper-able data, image source [30]. . . . . . . . . . . . . . . . . . . 10

3.2 A Feed-forward Neural Network. With one input layer, twohidden layers, and a output layer, image source [32]. . . . . . . 11

3.3 Underfitting, good fit, and overfitting on the training data, im-age source [35] . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Unfolded Recurrent Neural Network for an input sequence oflength of three. Further, the flow of the network, from input tooutput, is displayed and how sequential information are passedbetween time steps. Image source [37]. . . . . . . . . . . . . . 13

3.5 Schematic illustration of aGatedRecurrent Unit with themem-ory gates: r reset gate and z update gate. Image source [38]. . 13

4.1 Snakes & Ladders board game. . . . . . . . . . . . . . . . . . 174.2 Experimental Setup. Including Furhat, touch table, micro-

phone, speaker and participant. . . . . . . . . . . . . . . . . . 18

5.1 Valence and arousal 2D space [7], only handpicked affectivestates are displayed, other does occurs in the discretized inter-vals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.2 Free annotation and segmentation software for video and au-dio data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.3 2D and 3D Landmarks from OpenFace and OpenPose. . . . . 24

6.1 Unimodal Gated Recurrent Unit network. . . . . . . . . . . . 29

xii LIST OF FIGURES

6.2 Different Recurrent Neural Network (RNN) setups, image source[37]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.3 Multimodal Gated Recurrent Unit network with late fusionand game states. . . . . . . . . . . . . . . . . . . . . . . . . . 30

7.1 NormalizedConfusionMatrix for the baseline SVMwith gamecontext. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7.2 Normalized Confusion Matrix for the LF-GRU without gamecontext. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

7.3 NormalizedConfusionMatrix for the LF-GRUwith game con-text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

LIST OF TABLES xiii

List of Tables

2.1 Related work approaches. With features for each modality andtheir corresponding timing (time window [s]: time discretiza-tion [fps]) are summarized. . . . . . . . . . . . . . . . . . . . 6

5.1 AU description and example, image and description source [42]. 255.2 Summary of the post-study questionnaire responses rated on

a discrete scale from 1-5. . . . . . . . . . . . . . . . . . . . . 26

6.1 Feature importancy, where the top 10 most important featuresare marked in bold. . . . . . . . . . . . . . . . . . . . . . . . 32

7.1 Unimodal and multimodal performance, where the best scorefor each metric are marked in bold. . . . . . . . . . . . . . . . 33

List of acronyms and abbreviations xv

List of acronyms and abbreviationsAI Artificial Intelligence

AU Action Units

BERT Bidirectional Encoder Representation from Transformer

Bi-LSTM Bidirectional Long Short-Term Memory

DNN Deep Neural Network

FACS Facial Action Coding System

FC Fully Connected

FNN Feed-forward Neural Network

GRU Gated Recurrent Unit

GTAP Global Temporal Average Pooling

HRI Human-Robot Interaction

Leaky-ReLU Leaky Rectified Linear Unit

LF-GRU Late Fusion GRU

LSTM Long Short-Term Memory

MML Multimodal Machine Learning

NLP Natural Language Processing

RBF Radial Basis Function

RNN Recurrent Neural Network

SGD Stochastic Gradient Descent

SVM Support-Vector Machine

TCN Temporal Convolutional Network

xvi List of acronyms and abbreviations

TEMMA Transform Encoder with Multimodal Multihead Attention

TSAM Temporal Selective Attention Model

Introduction 1

Chapter 1

Introduction

In Human-Robot Interaction (HRI), knowing the human’s emotional and sen-timental state is a crucial part of creating smart and interactive robotics thatcan respond to changing and adaptive social environments. In the last decade,the research field of Affective Computing has gained attention due to techni-cal advances and to increasing daily interaction between robots and humans.Affective Computing is an interdisciplinary research field including ArtificialIntelligence (AI), Natural Language Processing (NLP), and Cognitive and So-cial Science [1].

Affective State Recognition plays an important part in enhancing HRI, wherethe ability to recognize the affective state imposes as a key element for an AIto make intelligent and interactive decisions. Hence, one of the next relevantsteps of progressing the state-of-the-art in AI could be to equip social robotswith capable emotional intelligence [2]. In task-oriented HRI, it is essential togenerate a common ground to have a successful interaction, where a crucialpart is to recognize the affective state of the human [3].

In human psychiatry, only six universal basic emotions have been established:happiness, sadness, anger, fear, surprise, and disgust [4]. Where the sentimen-tal state of humans can be divided into three categories; neutral, positive, ornegative [5]. Besides these expressions of the human’s emotional spectrum,the affective state of a human can be seen as a complex structure of multimodalexpressions [6]. An emotionally intelligent robot will need to recognize moresubtle emotions such as frustration, dissatisfaction, disengagement, and un-certainty. These emotions are less trivial to detect and are often expressedthrough multimodal channels such as gestures, facial expressions, vocal cues,

2 Introduction

and speech. A common choice of describing such emotions is to use valenceand arousal as a continuous space representation [7]. Valence can be describedwith how positive/negative the affective state is, and arousal can be seen as howactive the affective state is. Valence and arousal are well-known and well usedfor Affective State Recognition tasks (see Table 2.1).

According to [5] and [2], there is a potential for multimodal models to out-perform unimodal models when it comes to affective state recognition. Thisassumption is based on the fact that Human-Human Interaction is, as alreadymentioned, a multi-sensory interaction including gesture, vocals, facial ex-pression. Lately there have been attempts with promising results, where mul-timodal models have outperformed unimodal models [8], [9]. Where a fun-damental challenge for multimodal models is to extract and summarize multi-modal data for gathering a redundant and complementary data set [10]. Fur-thermore, an common approach for feature extraction is to use powerful open-source software, such as OpenPose[11], OpenFace[12], and OpenSmile[13],see Table 2.1.

According to [14], the context in HRI is crucial information for the robotto interact in a way that the human expects. The context carries importantinformation to determine the human’s emotional state and is crucial for therobot to interact appropriately accordingly to the situation. Furthermore, [15]showed that it was possible to increase the accuracy of recognizing emotionssuch as satisfaction in HRI. In their setup, they used a SVM that used gamestate and facial features, which outperformed an SVMmodel only using facialfeatures.

1.1 Research QuestionWhat is the impact of incorporating context-based features in Affective StateRecognition for Multimodal Machine Learning methods?

1.2 HypothesisContext-based features, such as game-states, can increase the performance ofAffective State Recognition with Multimodal Machine Learning methods.

Introduction 3

1.3 ContributionThis paper will investigate the possibility of recognizing the affective state ofa human playing against a teleoperated robot in a dull and frustrating game,Snakes & Ladders. The choice of game is to trigger the following affectivestates amongst the players: frustration, dissatisfaction, excitement, and en-gagement. The Multimodal Machine Learning (MML) model will estimatethe affective states in terms of valence and arousal and consider the follow-ing modalities: gestures, facial expressions, verbal communication from theplayer, audio features, and game states.

The significant contributions of this paper include:

• Evaluate the performance of the Multimodal Affective State Recogni-tion method on game oriented data.

• Providing a framework for incorporating context-based information in aMultimodal Machine Learning Affective State Recognition model.

• Evaluating modality importance in Affective State Recognition.

1.4 Ethical and Societal AspectsResearch and science are building on the process of gathering, creating, andpassing knowledge. Even if the research per se is not harmful, it hard to predictwhat eventual new findings and research it will lead to. Therefore, it is of mostimportance to take all possible precautions to prevent it from being used in aharmful way for a group of people or society as a whole. In recent years ahot topic has been to discuss the eventual harmful way computer vision canbe used. For example, recognize people in a population of a dictator, wherepeople are exposed to harmful surveillance. The fact that the algorithms foridentifying cats and dogs in images could be altered and used in such a mannercould be hard to anticipate when developing the algorithm. This does lay anenormous responsibility on the researcher to be aware of what they produceas general knowledge.

This paper focuses on how to recognize affective states in a board game setup.There exists a risk that future work based on this paper could be used in harm-ful applications. For example, if applied in poker tournaments, recognizingthe emotional state of an opponent would be considered cheating and would bedishonest and illegal. Another harmful example would be to use it in political

4 Introduction

situations where it is used to find people against the regime of a country. Theresults from this paper are not directly applicable nor meant to be used in anysuch way. This paper is believed to provide beneficial opportunities where itcould enhance future AI to create better HRI. It could benefit applications suchas elder care, hospitals, or any other applications recurring HRI.

1.5 StructureThe structure of this paper is as follows: first, an overview of the latest ad-vances in Affective State Recognition research is summarised in Chapter 2.The case study is explained in detail in Chapter 4 and the resulting dataset areexplained in Chapter 5. In Chapter 6, the MML approaches, frameworks, andsetup are explained in detail and followed by the results in Chapter 7. Theresults are summarized and discussed in Chapter 8 and conclusions are madein Chapter 9.

Related Work 5

Chapter 2

Related Work

In HRI, there is a significant amount of information encoded in non-explicitcommunication, such as facial expressions, gestures, and the way humansspeak. Recent research has shown thatMMLdoes have significantly higher ac-curacy in Affective State Recognition than unimodal state-of-the-art methods[1]. Furthermore, [16] describes the importance of usingMMLmodels to gainaccess to crucial information encoded in other signals than facial expression,for example, pose and body gestures. In their paper, they evaluated the per-formance of three different Machine Learning implementations: Deep NeuralNetwork (DNN) with a Global Temporal Average Pooling (GTAP) layer, aLong Short-TermMemory (LSTM) RNN, and a Temporal Convolutional Net-work (TCN).

LSTM is an adaptation of the standard RNN units, including three memorygates to store temporal information. GTAP uses temporal average poolinglayer to extract temporal information by averaging multidimensional sequen-tial data. Furthermore, TCN works similarly to the GTAP but uses a convolu-tional layer instead of an average pooling layer. They evaluated their methodson a touchscreen-based game, Express the Feeling, with Furhat and Zeno asinteractive robots. The DNN with GTAP layers outperformed both LSTMand TCN in their experiments. The authors argue that one of the reasons thatGTAP outperformed the LSTM was likely due to a small data set that was notenough to train the LSTM adequately.

Another paper [19], had promising results with a Bidirectional Long Short-Term Memory (Bi-LSTM) Network on Multimodal Affective Dimension Pre-dictions. More precisely on audio and video data from the Audio-Visual Emo-

6 Related Work

Table 2.1: Related work approaches. With features for each modality andtheir corresponding timing (time window [s]: time discretization [fps]) aresummarized.

Paper ML App. Target Audio Facial Pose Speech Rec./text

[17] TEMMA Arousal Vggish (30: 10) Resnet50 (30: 10) - -

[17] TEMMA Valence Vggish (30: 10) Resnet50 (30: 10) - -

[18] TSAM Sentiments COVAREP (30: 30) OpenFace (30: 30) - GloVe (30: 30)

[18] Bi-LSTM Sentiments COVAREP (30: 30) OpenFace (30: 30) - GloVe (30: 30)

[18] Bi-GRU Sentiments COVAREP (30: 30) OpenFace (30: 30) - GloVe (30: 30)

[16] GTAP Emotions - - OpenPose (2.4: 30) -

[16] TCN Emotions - - OpenPose (2.4: 30) -

[16] LSTM Emotions - - OpenPose (2.4: 30) -

[16] GTAP Emotions - OpenFace (2.4: 30) OpenPose (2.4: 30) -

[19] Bi-LSTM Arousal OpenSmile (3: 25) LGBP-TOP+ LPQ-TOP (3: 25) - -

[19] Bi-LSTM Valence OpenSmile (3: 25) LGBP-TOP+ LPQ-TOP (3: 25) - -

[20] LSTM Arousal OpenSmile (6: 10) DenseFace (6: 10) - CustomWord Vector (6: 10)

[20] LSTM Valence OpenSmile (6: 10) DenseFace (6: 10) - CustomWord Vector (6: 10)

[21] GRU Sentiments OpenSmile (30: NaN) ResNet (30: NaN) - BERT (30: NaN)

tion Challenge [22]. The bidirectional part of Bi-LSTM means that the net-work passes temporal information both forward and backward in time, whichcan help with the model’s performance.

Recently, there have been attempts on improving the Affective State Recogni-tion performance by introducing Temporal Selective AttentionModel (TSAM)to extract temporal and relevant information by valuating the attention eachdata point should be assigned. The TSAM model is a sequential model withan attention layer [18], using an attention module to score how much attentioneach data point in the sequence should be assigned. In parallel, the input isfed to an LSTM and then concatenated with the attention module’s attention,creating an encoding module. The encoding module output, with the attentionfrom the attention module, is then further passed to a pooling layer beforefinally feeding it to the final prediction layer, see Figure 2.1.

In the last couple of years, works have been studying Attention Encoders,

Related Work 7

Figure 2.1: Temporal Selective Attention Model, with the Attention Module,Encoding Module, and finally the pooling and prediction layers. Image source[18].

wheremodels such as Bidirectional Encoder Representation fromTransformer(BERT) have been well established within the NLP research field. Further-more, Transform Encoder with Multimodal Multihead Attention (TEMMA)has been constructed and tested on multimodal visual and audio data for Con-tinuous Affective State Recognition [17]. TEMMA competed with the currentstate-of-the-art methods, i.e., variations of RNNs.

Another paper [23], used Gated Recurrent Unit (GRU) network in a multi-modal setup to predict empathic responses on a data set with interviews inpsychiatry setup to help with detecting depression and post-traumatic stressdisorder. A GRU is similar to LSTM with memory gates but is a simpler unitwith only two gates. Their results show that it was possible to detect the sen-timental state of the patients. Further, it was shown that using context-basedinformation such as speech improved the performance significantly, where the

8 Related Work

speech was encoded with BERT and passed to a late fusion layer fusing it withthe temporal modalities that used a many-to-one setup.

Related papers use different approaches on how to gather features from the dif-ferent modalities, time windows, and time discretizations (see Table 2.1). Themost common approaches of extracting audio features were Vggish[24], CO-VAREP[25], and OpenSmile. For facial expression OpenFace, Resnet50[26]and DensFace[20]. Furthermore, the most common approaches for extractingpose features were OpenPose and for text GloVe[27] and BERT[28]. All ofthe papers used the same time window for all of their modalities and used thehighest time discretization allowed for their setup except [21] that did not statetheir time discretization settings.

Theory 9

Chapter 3

Theory

In this chapter, the necessary theory of the Machine Learning methods will beexplained. First, a classicalMachine Learningmethod, SVM, for classificationwill be explained. Followed by the fundamentals in Deep Learning and howNeural Networks work. Further, the idea behind recurrent models will bedescribed, and the applicable frameworks for this paper will be presented andexplained. Finally, challenges with MML will be presented and how to fusemultiple modalities.

3.1 Support-vector MachineIn classic Machine Learning, one of the most common classic machine learn-ing approaches for classification is SVM. SVMs are often trained on super-vised data that it uses to optimize a high dimensional hyperplane to separatedata points from each other, see Figure 3.1. The shape of the hyperplane ismodified by using different kernels, such as linear, polynomial, or Radial BasisFunction (RBF) kernel. The choice of a kernel is often task-dependent. If then-dimension data points are linearly separable, the most natural choice is thelinear kernel. Moreover, higher dimensional kernels such as polynomial orRBF kernels are often used for non-linear separable data. However, the linearkernel has the advantage of being usable for evaluating feature importance[29]. Neither polynomial nor RBF kernels can be used for feature importancewithout iterating through the feature with the leave-one-out approach. Theleave-one-out approach implies leaving out one of the features and training themodel to see the impact on performance for that feature. The feature impor-tance can be calculated by using the norm for each weight in the weight vector.

10 Theory

Feature importance will be done in this thesis to estimate the importance ofthe features, see Chapter 6.

Figure 3.1: Linear kernel SVM on linear seperable data, image source [30].

3.2 Deep LearningDeep Learning was first introduced in the 20th century [31], where it was in-spired by how the human brain works. The building blocks of Neural Networksare neurons and are connected with an arbitrary number of other neurons. Theneurons are connected, generating bonds and patterns between the neurons toprocess data. The strength between the connection of the neurons is controlledwith the so-called weights. The signal sent from each neuron is restricted withan activation function. An activation function is a non-linear function thatscales and reshapes the signal to have the appropriate functionality for thenetwork. The choice of activation function can have critical importance forthe network to learn adequately and function correctly to the task.

A Feed-forward Neural Network (FNN) is the simplest form of network ar-chitecture, only passing information in one direction, i.e., forward. FNN isconstructed with an input layer, followed by an arbitrary number of hiddenlayers, and finalized with an output layer. The layers consist of neurons that

Theory 11

Figure 3.2: A Feed-forward Neural Network. With one input layer, twohidden layers, and a output layer, image source [32].

are connected to the next layer, see Figure 3.2.

3.2.1 BackpropagationIn supervised learning, the network needs to be trained and fit to the data. Itis done with so-called backpropagation. It implies modifying the network’sweights to minimize the loss function calculated on the error between thepredicted and actual labels from the output. Further, the choice of the lossfunction is task and data-dependent and should be suitable for the task that thenetwork is trying to solve. Backpropagation is the procedure of calculatingthe gradient of the loss function. The way of calculating the gradient canvary, one of the most typical is to use Stochastic Gradient Descent (SGD),and another common approach is to Adam [33] which is an adaptation ofthe SGD. This thesis will use weighted cross-entropy, a typically used lossfunction for uneven distributed multi-class classification problems, and useAdam as an optimization technique for gradient calculation. The weights inweighted cross-entropy are calculated as the inverse number of samples perclass [34]. A well-known problem in training a Neural Network is vanishinggradient or exploding gradient. A vanishing gradient is when the gradientstend to zero the further back in the network the gradient is calculated. It istypically common when using a poorly chosen activation function. LeakyRectified Linear Unit (Leaky-ReLU), is known to handle vanishing gradientswell and have similar benefits as its ancestor Rectified Linear Unit to have

12 Theory

sparse activation and efficient computation.

3.2.2 RegularizationIn the training process of Neural Networks, there is a high risk for the networkto overfit to the training data, see Figure 3.3. It occurs when training the modelfor too long and is often regulated using a validation data set as a referencepoint. While the loss function is decreasing for the validation data, the trainingis still generalized. I.e., the model is not overfitting to the training data. How-ever, when the validation loss is increasing, and the training loss is decreasingis a clear sign of overfitting. The model is often forced to stop training beforethis occurs to ensure themodel is not overfitting to the training data and is oftenreferred to as early-stopping. Early-stopping terminates the training processwhen the validation loss is no longer decreasing. Another alternative to copewith overfitting is to use regularization techniques. This paper will use L2-regularization, also known as Ridge Regression, which introduces a penaltyterm in the loss function to penalize large weights.

Figure 3.3: Underfitting, good fit, and overfitting on the training data, imagesource [35]

3.3 Recurrent Neural NetworksOne restriction of FNNs is its capability of handling sequential data. Aworkar-ound was introduced by [36] in 1986, where they proposed an architecturewhere the neurons pass information from their previous state to the currentstate, see Figure 3.4. These architecture are categories as RNNs. RNNsuses the internal state, referred to as the hidden state, to capture temporalinformation from sequential data.

Theory 13

Figure 3.4: Unfolded Recurrent Neural Network for an input sequence oflength of three. Further, the flow of the network, from input to output, isdisplayed and how sequential information are passed between time steps.Image source [37].

3.3.1 Gated Recurrent UnitsAlthough vanilla RNNs, in general, performs better on sequential data thanFNNs, they struggle with capturing and processing extended sequential infor-mation. It is due to the problem of vanishing gradients when backpropagatingthrough time. To cope with longer sequences, the GRU [23] was constructedusing so-called memory gates to capture temporal information for longer se-quences. The gates allow for controlling when to update the hidden state ofthe GRU cell, see Figure 3.5.

Figure 3.5: Schematic illustration of a Gated Recurrent Unit with the memorygates: r reset gate and z update gate. Image source [38].

14 Theory

3.4 Multimodal Machine LearningRelated works show that MML can often yield higher performance than uni-modal methods. However, MML approaches have some unique challenges tocope with [10]. The issues are often depending on what kind of fusion tech-nique is approached. The most profound challenges include representing themodalities, i.e., how the modality is presented to the network. Furthermore,the alignment of the modalities is another problem. These issues are usuallyassociated with early fusion. In this paper, the modalities are passed to power-ful open-source software to extract high-level features fed to the network, andlate fusion is used to avoid the alignment challenge.

3.4.1 Multimodal FusionMultimodal fusion implies concatenating the information and data stream frommultiple modalities before making the final prediction. Using multimodal fu-sion can generate a more robust model due to the exposure of complementaryinformation from multiple modalities and can often be functional even if oneor more modalities are missing [10]. Historically, different approaches havebeen used, where the most frequently used are early [20], late [16], or hybridfusion [39].

According to [10], early fusion is the simplest form of fusion and impliesconcatenating the modalities before entering the network. This approach givesintermediate interaction between the different modalities but needs the indi-vidual modality data stream to be in the same shape, often referred to as analignment challenge. Early fusion often needs a more extensive and heavierdata set to learn the more complex data representation from the multiple datastreams.

Contrary, late fusion is not as reliant on an extensive data set due to eachmodality is embedded by an individual subnetwork before being concatenatedin the final layers. The final layers are typically then fine-tuned with all of themodalities subnetworks as input. This approach allows for transfer learningin a larger extense where pre-trained powerful unimodal models can be usedjointly and only require the final fusion and prediction layers to be trained.Late fusion does not suffer from the same alignment problem as early fusionoften does since it can handle any modality shape or sequence length. Onlythe final layer output shape from each modality must be in appropriate shapefor the fusion layer.

Theory 15

Hybrid fusion is an attempt to take the best parts from the two aforementionedmethods. By exploiting intermediate interaction between the different modal-ities, and hybrid fusion is not cursed by the alignment problem. Unlike latefusion, hybrid fusion fuses the modalities in different stages and not solely atthe end.

16 Case Study

Chapter 4

Case Study

This paper is based on a case study executed at KTH Royal Institution ofTechnology in Stockholm. The goal was to collect data from 13 participantsas they interacted with a social robot, Furhat [40]. Furhat is a commonly usedrobot for HRI research and is customizable and programmable through theFurhat SDK. The Furhat SDK can be customized to show predefined emotionalexpressions. In this study, Furhat was connected to Unity, a well-establishedgame engine, and controlled with a Virtual Reality (VR) headset.

4.1 Scenario DesignThe case study scenario was to collect data from participants interacting witha social robot in the context of gaming. Each participant played a 20 minuteslong game against Furhat that was remotely controlled with a wizard-of-ozsetup. A wizard-of-oz setup implies that the player thought they were inter-acting with a robot and not aware that it was a human controlling it. Thespeech, head movement, and gaze of the wizard were translated to the Furhatrobot to simulate a complete HRI experience. The game Snakes & Ladder wasselected by the criteria of being simple, interactive, and emotionally triggering,see Figure 4.1.

Snakes & Ladders is a game of luck developed for a younger audience. Thegame is a turn-based board game requiring a dice. At each turn, the turn taker’sonly optional action is to roll the dice, where the outcome of the dice decidesthe distance that their character should move. The goal of the game is to reachthe end of the board.

Case Study 17

Figure 4.1: Snakes & Ladders board game.

The game mechanics are as follows:

• If the character ends up on a ladder, the character will proceed to the topof the ladder (advantage).

• If the character ends up on a snake, the character will travel backwardto the snake’s tail (disadvantage).

The game is designed to havemore snakes than ladders, and the snakes are alsoplaced at the top of the board to make the game as frustrating and rewardingas possible, see Figure 4.1. The disadvantage of ending up on a snake isexpected to make the participant frustrated about not advancing in the game.Further, ending up on a ladder is expected to give the participant a sensationof satisfaction and excitement due to the rarity of ladders and being the onlyopportunity to advance in the game quickly.

4.2 Experiment SetupThe game was made in Unity and ran on a 30" touch table (see Figure 4.2)to give the wizard good visual inputs. The game was connected to GoogleFirebase to sync game state logs with video and to give the wizard the abilityto control Furhats turn remotely. For controlling the Furhat, an Oculus VR

18 Case Study

Figure 4.2: Experimental Setup. Including Furhat, touch table, microphone,speaker and participant.

headset was used. The Oculus headset sent the head movement, gaze, andaudio to the Furhat robot. A microphone was placed next to the participantto send audio to the wizard. The hand controller was used to roll the dice onFurhat’s turn.

The wizard and the participant were separated into two different rooms. Theparticipant was notified that it was playing against a robot and not told that ahuman controlled it. Before starting the experiment, the participant had to fillin a consent form and questionnaire covering their emotional state and theirexperience of HRI. This questionnaire was later usedwhen annotating the data.During the game, the wizard gave instructions, engaged, asked the participantquestions, and answered any questions asked back – personalizing an engag-ing and curious persona. After the experiment, the participant had to fill inanother questionnaire covering how they perceived the game experience, theiremotional experience, and how they perceived the Furhat social expressions.

4.3 Data CollectionEach game was logged synchronized with the video recordings covering thesound, posture, and the participant’s face. The video was recorded at 25 fps,with a slightly downward and angled position. Each participant was asked tofill in a pre and post-study questionnaire covering their emotional state and

Case Study 19

HRI experience.

4.4 QuestionnairesTwo questionnaires were collected for each of the participants, where onewas a pre-study and the other one a post-study questionnaire. The pre-studyquestionnaire was used to help with the annotation of the participant to geta sense of the participant’s general mood before annotating, see Chapter 5.The post-study questionnaire was used to evaluate if the annotations seemedreasonable.

4.4.1 Pre-study QuestionnaireThe pre-study questionnaire started with three questions regarding whetherthe participant had previous experience with social robots and/or Snakes &Ladders and/or if the participant had seen or met Furhat before. Two of theparticipants had experience with social robots, and three had seen Furhat be-fore. None of the participants had played Snakes & Ladders before. Thethree questions were then followed by the following four questions (rated on adiscrete scale from 1-5):

• Rate your general mood.

• Rate your day.

• Rate your expectations on the experiment.

• Rate your excitement for the experiment.

4.4.2 Post-study QuestionnaireAfter the participant had played the game for 20 minutes, they were askedto fill in a post-study questionnaire, starting with three yes or no questions ifthey felt frustrated, engaged, and enjoyed during the game experience. Theparticipants were then asked to rate the following five questions on a discretescale from 1-5:

• Rate the highest level of frustration with the game.

• Rate how frustrating it was to end up on a snake.

• Rate the highest level of enjoyment felt during the game.

20 Case Study

• Rate how enjoyable it was to end up on a ladder.

• Rate how engaged you felt during the game.

4.5 Covid-19Due to Covid-19, the 13 participants were selected to either have a close con-nection to KTH or being part of the researcher’s restricted social network.Which implicitly made the average participant more tech-savvy and more ex-posed to HRI than the typical person. All researchers wore face masks duringthe experiment, and the touch-screen surface was disinfected between eachparticipant. Further, the participants were given a time slot to minimize thecontact between the participants. The local restrictions and recommendationswere obeyed during the study.

Data Set 21

Chapter 5

Data Set

The data recorded in the case study described in Chapter 4 was used to con-struct the data set for this paper. The data set consists of 11 of the 13 par-ticipants whom each played Snakes & Ladders for 20 minutes. Includingface, posture, audio, language, and game state features and where annotatedon a discretized valence and arousal scale. The data set consists of 3009sequential data points with a sequent length of 50 for face and posture features,46 for audio, and the entire current game history for each entry with the mostextended sequence of 594 game states.

5.1 AnnotationsThe video was segmented systematically where each game event was dis-cretized with zero, one, or two 2s long segments, depending on the game eventduration (d) in the following manner:

• if d < 2s→ zero segments,

• if 2s ≤ d < 4s→ one segment,

• if d ≥ 4s→ two segments were extracted.

A game event was defined as either a new dice roll or when Furhat/participantended up on a snake/ladder.

Secondly, each segment was annotated on a discretized valence and arousalscale. Valence was either negative or positive and arousal low, medium, orhigh, giving six possible different classes; negative-low, negative-medium,

22 Data Set

Figure 5.1: Valence and arousal 2D space [7], only handpicked affective statesare displayed, other does occurs in the discretized intervals.

negative-high, positive-low, positive-medium, positive-high, see Fig. 5.1. Theannotation procedure was inspired and following most of what is proposed in[41]. Where to discretize the procedure of annotating arousal and valence intothree steps:

1. First, annotate valence (positive/negative).

2. Then, annotate arousal (low/medium/high).

3. Finally, concatenate the arousal low and medium annotations.

Step (3), the last step was excluded due to the loss of information by concate-nating already annotated information, with the risk of these annotation beingless accurate. Before starting the annotation procedure, the pre-study ques-tionnaire was analyzed to grasp the general affective state of the participant.Two annotators were working simultaneously and discussing each annotation.

Data Set 23

When there was a disagreement, a third annotator was called on with thecasting judgment. The annotation was done in ELAN, see Figure 5.2. Thesix classes have the following uneven distribution: positive-low 969 (32%),negative-low 360 (12%), positive-med 656 (22%), negative-med 467 (16%),positive-high 275 (9%), negative-high 282 (9%).

Figure 5.2: Free annotation and segmentation software for video and audiodata.

5.2 Feature ExtractionThe video recording was passed to powerful open-source software for eachmodality providing face, posture, and audio features. Further, context-basedfeatures such as language and game state were also extracted.

5.2.1 Facial FeaturesThe facial features of each participant were extracted with OpenFace [12] astate-of-the-art facial behavior analysis tool (see Figure 5.3a). OpenFace wasapplied on the video recording for each participant frame by frame, extracting18 Action Units (AU) based on Facial Action Coding System (FACS), six headpostures (3D rotation, 2D translation, and scale), and 20 2D landmarks of themouth’s position. The 18 AU are; 1, 2, 4, 5, 6, 7, 9, 10, 12, 14, 15, 17, 20, 23,

24 Data Set

(a) OpenFace (b) OpenPose

Figure 5.3: 2D and 3D Landmarks from OpenFace and OpenPose.

25, 26, 28, and 45 (see Table 5.1 for examples). All features for each annotationof 50 frames were combined into one data point resulting in a 2D tensor ofdimension 50×64. If more than 50% of the frames from OpenFace had lowercertainty than 75%, the data point was discarded. Further, two participants hadto be excluded due to very low certainty on most data points due to bad cameraangle and extensive facial hair. Resulting in data from only 11 participantsbeing used.

5.2.2 Posture FeaturesThe upper-body postures of each participant were extracted with OpenPose[11] a state-of-the-art real-time body posture estimation tool. Each frame ofthe video was fed to OpenPose to extract the 3D position of; ears, eyes, nose,shoulders, elbows, and wrists (see Figure 5.3b). As with the facial features,each annotation of 50 frames was combined to form a 2D tensor of shape50× 36 as a sequential data point.

5.2.3 Audio FeaturesAudio features were extracted from the audio recordings with OpenSmile, anopen-source audio feature extraction tool commonly used in Affective Com-puting. Three prosodic features were extracted; pitch, loudness, and the voic-

Data Set 25

Table 5.1: AU description and example, image and description source [42].

Action Unit Description Example

AU 1 Inner Brow Raiser

AU 2 Outer Brow Raiser

AU 4 Brow Lowerer

AU 5 Upper Lid Raiser

AU 6 Cheek Raiser

AU 7 Lid Tightener

AU 9 Nose Wrinkler

AU 10 Upper Lip Raiser

AU 12 Lip Corner Puller

AU 14 Dimpler

AU 15 Lip Corner Depressor

AU 17 Chin Raiser

AU 20 Lip stretcher

AU 23 Lip Tightener

AU 25 Lips part

AU 26 Jaw Drop

AU 28 Lip Suck

AU 45 Blink

ing probability of the final fundamental frequency, i.e. Voicing Final Un-clipped (VFU) in OpenSmile. Various window sizes (2s, 1s, 0.4s, 0.2s, and0.04s) were evaluated, and 0.2s produced the best scores. The step size forthe sliding window was selected as 0.04s as the video recording frameraterestricted it. The three features were extracted for each annotation of 50 framesand grouped into one data point with the dimension 46× 3.

26 Data Set

5.2.4 Context-based FeaturesTwo different context-based feature sets were extracted:

1. Game state features: whose turn, turn time, the position of both play-ers, and if they ended up on snake/ladders complimented by its length.To cope with each game’s variant in length, each data point where zero-padded resulting in a 594× 7 shaped tensor for each annotation.

2. Verbal communication features: what had been said since the lastturn encoded with state-of-the-art NLP language model BERT. Only theparticipant’s content was used to avoid any bias or influence the wizardmistakenly made. BERT was used to encode the last said sentences to a728 long vector.

The unimodal model using BERT as encoder did not perform better than arandom classifier, strongly implying that the modality in this context did notprovide any useful information. The language model was therefore quicklydiscarded, and performance was not measured nor included in this paper.

Table 5.2: Summary of the post-study questionnaire responses rated on adiscrete scale from 1-5.

Emotion Question Average

FrustratedHighest level of frustration with the game. 2.9

How frustrating it was to end up on a snake. 3.0

EnjoymentHighest level of enjoyment felt during the game. 3.6

How enjoying it was to end up on a ladder. 3.8

Engaged How engaged you felt during the game. 3.6

5.3 Post-study QuestionnaireFrustrated can be seen as a negative valence emotion and enjoyment as apositive. From Table 5.2 one can notice that the average person did enjoythe game more than felt frustrating by it. Comparing it with the annotations inSection 5.1, which resulted in 64% negative and 36% positive valence labels, itseems to be indicating the same trend. Further, engagement had an average of

Data Set 27

3.6, which indicates that the game was not highly engaging. This trend couldalso be seen in the annotations where 82% of the annotations where either lowor medium labeled arousal.

28 Method

Chapter 6

Method

In recent years, a common approach for multimodal perception has been to useMML, and few have integrated context-based features. There have been someearlier attempts using classical ML approaches such as SVM [15]. This paperis based on a Deep Learning framework approach and comparing it againstSVM as a baseline. The models and the data set used in this thesis can befound as open-source at [43].

6.1 Model ArchitectureThis paper applied a GRU Neural Network [23] framework in both unimodaland multimodal setup. GRU was selected over the LSTM framework be-cause GRU performs similar on shorter sequences and requires less trainingdata. Further, the dataset is relatively small concerning other Deep Learningdatasets, and therefore the GRUs was kept shallow with small amounts oflayers. [21] inspired the architecture of the GRU and the modality fusion,where they used a similar framework, and their approach was to use LateFusion GRU (LF-GRU) for three class sentiment classification.

6.1.1 Unimodal ModelsEach modality: facial, posture, audio, and game context, were encoded with atwo-layer deep GRUmodel. The last GRU output was used as 128 long vectorembedding by using a many-to-one setup, see Figure 6.2. The embeddingwas further passed through a Leaky-ReLU activation function with a slope of

Method 29

Figure 6.1: Unimodal Gated Recurrent Unit network.

0.2 before passed to a Fully Connected (FC) dense layer for prediction. Eachmodality was trained in an end-to-end fashion, see Figure 6.1.

Figure 6.2: Different RNN setups, image source [37].

6.1.2 Multimodal ModelsThe GRU embedding layers from the pre-trained unimodal were transferredinto a multimodal framework and passed to a Leaky-ReLU activation functionwith a slope of 0.2 individually. Further, it was passed to an FC fusion layer,reducing the dimension from 128 · M to a vector of 128, where M is thenumber of modalities. The vector was passed to a Leaky-ReLU activationfunction with a slope of 0.2 and then passed to the final FC layer to predictthe 6-classes, see Figure 6.3. The GRU embedding layers were frozen duringtraining, and only the two FC layers were fine-tuned. The multimodal modelwill be referred to as LF-GRU.

6.2 BaselineTwo factors influenced the choice of baseline in this paper:

30 Method

Figure 6.3: Multimodal Gated Recurrent Unit network with late fusion andgame states.

• The only found related work including game state as context-based fea-tures used SVM [15].

• Linear kernel SVM can be used to evaluate feature importance.

The SVM had a linear kernel and was evaluated w/o the game state as an inputfeature. For each sequence, facial and posture features were represented tothe SVM by its min, max, start, end, average, and exponential moving averagevalues. The audio feature was the OpenSmile output from the sliding windowbut flattened to represent one annotation with only one data point. Further, thegame state was only represented by its latest game state and not by the entiregame history. This is because the SVM is not a sequential model, and usemultiple game states as input would lead to a vast input space.

6.2.1 Feature Importance.SVM with a linear kernel can be used to evaluate the importance of eachfeature, i.e., how much it affected the SVM performance, see Table 6.1. Ac-cording to [29], using the absolute value of the weight of the feature vectors isa good way of calculating feature importance and was therefore used in this pa-per. The ten most important features were: the player’s and Furhat’s position,on a ladder, and the AU: 1, 6, 9, 14, and 45. Further, the less important featureswere: the head’s position and the mouth’s landmarks. However, removing the

Method 31

less important features did not improve the validation F1-score on either of thebaseline models and was therefore kept throughout this paper.

6.3 Model SetupIn this work, the mentioned data set, see Section 5, was used. The data set wasstratified split into three subsets: training, validation, and test set. The splitratio was 70% training, 20% validation, and 10% test from the original 3009sequential data points. It resulted in training, validation, and test set consistingof 2106, 602, and 301 sequential data points. Minmax normalization wasapplied and calculated on the training set and later applied to the validationand test set with the same parameters. The metric used for picking the bestscoring model was a weighted F1-score to favor minority classes and penalizethe model from overfitting to the majority classes. Batch training was used onboth the unimodal and the multimodal models due to achiving the best resultson the validation data set.

A two-layer deepGRUwith 128 neurons architecture was used in the unimodalsetup, followed by an FC layer for prediction. Each unimodal model was opti-mized with Adam as an optimizer with a weighted cross-entropy loss functionsuitable for an uneven multi-class data set. During training, L2-regularizationwas applied to prevent the model from overfitting to the training data. Themodels were saved on the lowest validation loss score to prevent the modelfrom overfitting. During training, the learning rate was 5e−4, 5e−3, 1e−4,and 1e−3 for face, body, audio, and game respectively. The Weight decaywas 1e−5, 5e−4, 1e−5, and 1e−5 for face, body, audio, and game respectively.After yielding the highest scores on the validation dataset with a Bayesianhyperparameter search.

For the multimodal setup, the two-layer deep GRU layers from the unimodalmodels were transferred to the multimodal model before combining them withan FC late fusion layer and finally an FC prediction layer. The last two FClayers were then fine-tuned with Adam as optimizer with a weighted cross-entropy loss function. As with the unimodal models, L2-regularization wasapplied and early stopping, as explained above. During training, the learningrate and weight decay were selected to 1e−3, 4e−3 and 2e−3, 5e−4 respectivelyfor without and with game states. This after resulting in the highest score onthe validation data set from a Bayesian hyperparameters search.

32 Method

Table 6.1: Feature importancy, where the top 10 most important features aremarked in bold.

Feature Importance

OpenFace

AU 1 0.37AU 2 0.32AU 4 0.14AU 5 0.25AU 6 0.36AU 7 0.15AU 9 0.36AU 10 0.17AU 12 0.29AU 14 0.35AU 15 0.25AU 17 0.12AU 20 0.14AU 23 0.13AU 25 0.25AU 26 0.31AU 28 0.34AU 45 0.36

Mouth landmarks 0.04Head translation 0.03Head rotation 0.31Head scale 0.17

OpenPose

Ears 0.26Eyes 0.21Nose 0.19Neck 0.26

Shoulders 0.26Elbows 0.25Wrists 0.23

OpenSmilePitch 0.18

Loudness 0.16VFU 0.22

Game

Furhat 0.35Human 0.80Ladder 0.47Snake 0.25

Dice value 0.28Who’s turn 0.09Turn time 0.21

Results 33

Chapter 7

Results

The unimodal, multimodal, and baseline were all trained and optimized onthe train and validation data set. The hyperparameters resulting in the highestweighted F1-score on the validation data set were further tested on the test setresulting in Table 7.1. Where the baseline w/o game state features significantlyoutperformed a random classifier with the F1-score 0.17.

Table 7.1: Unimodal and multimodal performance, where the best score foreach metric are marked in bold.

Method Features F1-score Valence Arousal

Unimodal

GRU Face (F) 0.32 0.61 0.48GRU Body (B) 0.26 0.61 0.38GRU Audio (A) 0.18 0.56 0.41GRU Game (G) 0.35 0.65 0.50

Multimodal LF-GRU F, B, A 0.36 0.65 0.49LF-GRU F, B, A, G 0.46 0.69 0.59

BaselineRandom - 0.17 0.51 0.36SVM F, B, A 0.30 0.60 0.47SVM F, B, A, G 0.35 0.65 0.50

Among the unimodal models, facial expression was themost important modal-ity with the highest F1-score, valence accuracy, and arousal accuracy of all themodalities. Further, it is also clear that the game states were themost importantstandalone feature set, outperforming all of the modalities and scoring similar

34 Results

to themultimodal model using facial, posture, and audio features. When incor-porating the game state into the multimodal model, the performance increasedrastically to 0.46 F1, 69% valence accuracy, and 59% arousal accuracy. Bothof the multimodal LF-GRUs performs better than the SVM baseline w/o gamestate. The game context increased the performance significantly over all of themetrics on both the baselines and the LF-GRUs.

Figure 7.1: Normalized Confusion Matrix for the baseline SVM with gamecontext.

By inspecting Figure 7.1, one can see that the baseline has its majority ofhigh values along the diagonal of the confusion matrix. It is implying thatthe model has learned to classify the correct label better than chance. Thematrix also shows that the second off diagonal are relatively strong for lowerarousal but with the right valence, which corresponds well with the resultsfrom Table 7.1, where the valence accuracy was 65%. For arousal, the sametrends can be noticed for high and low arousal. However, the model has ahigher tendency to classify a lower valance and arousal label than the true

Results 35

value, which is indicated by the lower triangle of the matrix have higher valuesthan the top triangle.

Figure 7.2: Normalized Confusion Matrix for the LF-GRU without gamecontext.

In Figure 7.2, the diagonal, in general, is stronger than for the baseline. Onecould also see that the model struggles with over-predicting the majority classpositive-low. The class having themost correctly classified is also the positive-low class. However, in general, the model does a decent work of predicting thecorrect valence and arousal label and performs better than the baseline withoutthe game state. When comparing the performance with the baseline with thegame state, the LF-GRU performs similar but slightly worse than the baseline.

When looking at Figure 7.3, the confusion matrix has a relatively strong diag-onal implying the LF-GRU with the game state does a decent job at predictingthe correct classes. Furthermore, it does an excellent job at predicting thecorrect valence. However, it struggles more with the correct arousal level,

36 Results

Figure 7.3: NormalizedConfusionMatrix for the LF-GRUwith game context.

except for the negative-high class that is often confused with positive high,implying that the model finds the right arousal level for the specified class.When comparing the confusion matrix for LF-GRU with the game state withthe other two matrices, it is clear that the model outperforms the other twoapproaches. Where the diagonal is significantly stronger compared to the othertwo.

Discussion 37

Chapter 8

Discussion

The results in Chapter 7 show that including context-based information canincrease the performance in Affective State Recognition. Since it improvedthe performance for the multimodal model and the baseline, see Table 7.1.When comparing the performance in Table 7.1 in conjunction with Table 6.1the context-based information and the facial expression were the features pro-viding themost emotional information useful for theMLmethods. Further, theless important features modalities from OpenPose and OpenSmile had a lowerperformance in the unimodal setups. Nevertheless, when combined with thefacial features, the performance increased. Implying it provided complemen-tary information, useful in a multimodal setup. The unimodal audio model didhave the lowest performance and performed only slightly above chance on thetest data. Multiple factors could have caused this. One could be noise in theaudio data from background sound, which is hard to remove post-study.

In general, the LF-GRU with game context does a decent job at predictingthe correct class. When inspecting the misclassified data points, the modelhad difficulty picking up short and discreet emotional cues and performingbetter on longer emotional expressions. It is a possibility that short emotionsgot diluted by other expressions in the 2s time window, i.e., that the shortemotions got diluted by neutral states surrounding it. Further, it had a hardtime classifying the correct valence expressions for high arousal, i.e., if it wasnegative or positive. It could be explained by the subtle differences betweenthe high arousal emotions, which the model is not sophisticated enough todifferentiate.

This paper is based on the data set extracted from the case study, see Chapter

38 Discussion

4 and 5. In the case study, the camera had a slightly downward and angledposition, which led to some of the data points had to be discarded. Further,when inspecting the feature importance for facial expression, AU: 4, 17 standsout as the less important AU. It could have been caused by the camera po-sitioning and that the camera was not capturing the whole facial expression.Furthermore, the participant tended to lean over the 30" game table, resultingin an increased false-positive certainty for some of the AU, such as AU 4.

The annotators that labeled the data set had no previous experience of labelingvalence and arousal, and this could have introduced a significant inconsistencyin the annotations. Furthermore, all the participants had their personal way ofexpressing their affective state, making it hard to standardize the annotationprocedure. Further, the participants could use different modalities to expresstheir emotions. For example, one could use more body language rather thanfacial expression and vice versa, making it crucial for the annotator to considerall the modalities when doing the annotation. There were also cases where thedifferent modalities gave oppositional information. For example, the facial ex-pression gave the impression that the person was happy by smiling. However,the body indicated that the person was frustrated, making it problematic tomerge the multiple signals into one label.

Each participant had their personal way of expressing their affective state, andthe model was trained on all of the participants during this study. It impliesthat the model is not generalized and is not tested whether or not it wouldsuccessfully predict the affective state on a new subject. The data set wouldhave to be larger and include more subjects, including the general distributionof the emotional spectrum of the general population, to generate a generalizedmodel. A setup where it was exposed on a new subject would probably notachieve the same performance displayed in this paper. Nevertheless, the resultsimply that the LF-GRU can recognize the affective state of a player and thatgame context does increase its performance.

Conclusion & Future Work 39

Chapter 9

Conclusion & Future Work

This paper includes a case study with a task-oriented setup with Snakes &Ladders, where 13 participants played a 20minutes long game against a teleop-erated robot. Data was recorded and constructed into a data set consistent withfacial, postural, audio, and context-based features. The data set was manuallyannotated with a discretized arousal and valence scale.

The data set was used to investigate if context-based features can improvethe performance of MML for Affective State Recognition. First, a baselinewas constructed with a classic Machine Learning approach, SVM, for w/ogame states for comparison with the MMLs approaches. Further, the featureimportance was evaluated where audio features had the lowest importance andgame state, and facial features had the highest importance. TheMMLLF-GRUwas tested w/o game state, and it was clear that for both the baseline and theLF-GRU, the game state was indeed an important feature. Furthermore, it wasshown that both the MML approaches outperformed the baseline.

The results in Chapter 7 show that it is possible to increase the performancein an Affective State Recognition MML approach by incorporating context-based information, where it increases the performance from 0.36 to 0.46 F1-score and outperform the baseline significantly. Further, this paper shows thatfacial expression and game state are the most important modalities and thataudio and speech might be less important for Affective State Recognition ongame oriented data.

40 Conclusion & Future Work

9.1 Future WorkFuture work should include an investigation if the results hold on a moreextensive data set and to see if the results and model can be generalized.Further, alternative MML frameworks should be applied and tested to see ifthe same results with context-based features can be achieved. Another thingto investigate is to see if similar results can be seen in other task-oriented HRI.Such as other games, tutoring, and service - to mention a few. The modelthat is presented in this thesis uses OpenFace, OpenPose, and OpenSmile asfeature extraction tools. These three tools combined with the LF-GRU aretoo slow to run in real-time, which is a crucial area to investigate since HRIimplementations will have to be run in real-time.

BIBLIOGRAPHY 41

Bibliography

[1] A. Psaltis et al. “Multimodal affective state recognition in serious gamesapplications”. In: (2016), pp. 435–439. doi: 10.1109/IST.2016.7738265.

[2] Ginevra Castellano, Loic Kessous, and George Caridakis. “Emotionrecognition through multiple modalities: face, body gesture, speech”.In: Affect and emotion in human-computer interaction. Springer, 2008,pp. 92–103.

[3] Dimosthenis Kontogiorgos, Andre Pereira, and Joakim Gustafson. “Es-timatingUncertainty in Task-OrientedDialogue”. In: 2019 InternationalConference on Multimodal Interaction. ICMI ’19. Suzhou, China: As-sociation for ComputingMachinery, 2019, pp. 414–418. isbn: 9781450368605.doi: 10.1145/3340555.3353722. url: https://doi.org/10.1145/3340555.3353722.

[4] Paul Ekman and Dacher Keltner. “Universal facial expressions of emo-tion”. In: Segerstrale U, P. Molnar P, eds. Nonverbal communication:Where nature meets culture (1997), pp. 27–46.

[5] Soujanya Poria et al. “A review of affective computing: From unimodalanalysis tomultimodal fusion”. In: Information Fusion 37 (2017), pp. 98–125. issn: 1566-2535. doi: https://doi.org/10.1016/j.inffus.2017.02.003.url:http://www.sciencedirect.com/science/article/pii/S1566253517300738.

[6] R. A. Calvo and S. D’Mello. “Affect Detection: An InterdisciplinaryReview of Models, Methods, and Their Applications”. In: IEEE Trans-actions on Affective Computing 1.1 (2010), pp. 18–37. doi: 10.1109/T-AFFC.2010.1.

[7] James A Russell. “A circumplex model of affect.” In: Journal of per-sonality and social psychology 39.6 (1980), p. 1161.

https://doi.org/10.1109/IST.2016.7738265

https://doi.org/10.1109/IST.2016.7738265

https://doi.org/10.1145/3340555.3353722

https://doi.org/10.1145/3340555.3353722

https://doi.org/10.1145/3340555.3353722

https://doi.org/https://doi.org/10.1016/j.inffus.2017.02.003

https://doi.org/https://doi.org/10.1016/j.inffus.2017.02.003

http://www.sciencedirect.com/science/article/pii/S1566253517300738

http://www.sciencedirect.com/science/article/pii/S1566253517300738

https://doi.org/10.1109/T-AFFC.2010.1

https://doi.org/10.1109/T-AFFC.2010.1

42 BIBLIOGRAPHY

[8] Amir Zadeh et al. “Tensor Fusion Network for Multimodal SentimentAnalysis”. In: CoRR abs/1707.07250 (2017). arXiv: 1707 . 07250.url: http://arxiv.org/abs/1707.07250.

[9] Devamanyu Hazarika et al. “Conversational memory network for emo-tion recognition in dyadic dialogue videos”. In: Proceedings of the con-ference. Association for Computational Linguistics. North AmericanChapter. Meeting. Vol. 2018. NIH Public Access. 2018, p. 2122.

[10] Tadas Baltrušaitis, ChaitanyaAhuja, and Louis-PhilippeMorency. “Mul-timodal machine learning: A survey and taxonomy”. In: IEEE transac-tions on pattern analysis andmachine intelligence 41.2 (2018), pp. 423–443.

[11] Zhe Cao et al. OpenPose: Realtime Multi-Person 2D Pose Estimationusing Part Affinity Fields. 2019. arXiv: 1812.08008 [cs.CV].

[12] BrandonAmos, Bartosz Ludwiczuk, andMahadev Satyanarayanan.Open-Face: A general-purpose face recognition library with mobile applica-tions. Tech. rep. CMU-CS-16-118, CMU School of Computer Science,2016.

[13] Florian Eyben, Martin Wöllmer, and Björn Schuller. “Opensmile: themunich versatile and fast open-source audio feature extractor”. In: Pro-ceedings of the 18th ACM international conference onMultimedia. 2010,pp. 1459–1462.

[14] Hanan Salam and Mohamed Chetouani. “A multi-level context-basedmodeling of engagement in human-robot interaction”. In: 2015 11thIEEE international conference and workshops on automatic face andgesture recognition (FG). Vol. 3. IEEE. 2015, pp. 1–6.

[15] G. Castellano et al. “Detecting Engagement in HRI: An Exploration ofSocial and Task-Based Context”. In: 2012 International Conference onPrivacy, Security, Risk and Trust and 2012 International Confernece onSocial Computing. 2012, pp. 421–428. doi: 10.1109/SocialCom-PASSAT.2012.51.

[16] P. P. Filntisis et al. “Fusing Body Posture With Facial Expressions forJoint Recognition ofAffect in Child–Robot Interaction”. In: IEEERoboticsand Automation Letters 4.4 (2019), pp. 4011–4018. doi: 10.1109/LRA.2019.2930434.

[17] Haifeng Chen, Dongmei Jiang, and Hichem Sahli. “Transformer En-coder with Multi-modal Multi-head Attention for Continuous AffectRecognition”. In: IEEE Transactions on Multimedia (2020).

https://arxiv.org/abs/1707.07250

http://arxiv.org/abs/1707.07250


https://doi.org/10.1109/SocialCom-PASSAT.2012.51

https://doi.org/10.1109/SocialCom-PASSAT.2012.51

https://doi.org/10.1109/LRA.2019.2930434

https://doi.org/10.1109/LRA.2019.2930434

BIBLIOGRAPHY 43

[18] HongliangYu et al. “Temporally selective attentionmodel for social andaffective state recognition in multimedia content”. In: Proceedings ofthe 25th ACM international conference onMultimedia. 2017, pp. 1743–1751.

[19] LangHe et al. “Multimodal AffectiveDimension PredictionUsingDeepBidirectional Long Short-Term Memory Recurrent Neural Networks”.In: AVEC ’15 (2015), pp. 73–80.doi:10.1145/2808196.2811641.url: https://doi.org/10.1145/2808196.2811641.

[20] Shizhe Chen et al. “Multimodal multi-task learning for dimensional andcontinuous emotion recognition”. In: Proceedings of the 7th AnnualWorkshop on Audio/Visual Emotion Challenge. 2017, pp. 19–26.

[21] Leili Tavabi et al. “Multimodal Learning for Identifying Opportunitiesfor Empathetic Responses”. In: 2019 International Conference on Mul-timodal Interaction. 2019, pp. 95–104.

[22] Fabien Ringeval et al. “Av+ ec 2015: The first affect recognition chal-lenge bridging across audio, video, and physiological data”. In: Pro-ceedings of the 5th international workshop on audio/visual emotionchallenge. 2015, pp. 3–8.

[23] Kyunghyun Cho et al. “Learning phrase representations using RNNencoder-decoder for statistical machine translation”. In: arXiv preprintarXiv:1406.1078 (2014).

[24] Shawn Hershey et al. “CNN architectures for large-scale audio classi-fication”. In: 2017 ieee international conference on acoustics, speechand signal processing (icassp). IEEE. 2017, pp. 131–135.

[25] Gilles Degottex et al. “COVAREP—Acollaborative voice analysis repos-itory for speech technologies”. In: 2014 ieee international conferenceon acoustics, speech and signal processing (icassp). IEEE. 2014, pp. 960–964.

[26] KaimingHe et al.DeepResidual Learning for Image Recognition. 2015.arXiv: 1512.03385 [cs.CV].

[27] Jeffrey Pennington, Richard Socher, andChristopherDManning. “Glove:Global vectors for word representation”. In: Proceedings of the 2014conference on empirical methods in natural language processing (EMNLP).2014, pp. 1532–1543.

[28] Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Trans-formers for LanguageUnderstanding. 2019. arXiv:1810.04805 [cs.CL].

https://doi.org/10.1145/2808196.2811641

https://doi.org/10.1145/2808196.2811641



44 BIBLIOGRAPHY

[29] Yin-WenChang andChih-Jen Lin. “Feature RankingUsing Linear SVM”.In:Proceedings of theWorkshop on the Causation and Prediction Chal-lenge at WCCI 2008. Ed. by Isabelle Guyon et al. Vol. 3. Proceedings ofMachine Learning Research. Hong Kong: PMLR, Mar. 2008, pp. 53–64. url: http://proceedings.mlr.press/v3/chang08a.html.

[30] A. Zisserman. Lecture notes: The SVMclassifier. Apr. 2021.url:https://www.robots.ox.ac.uk/~az/lectures/ml/lect2.pdf.

[31] Haohan Wang and Bhiksha Raj. “On the origin of deep learning”. In:arXiv preprint arXiv:1702.07800 (2017).

[32] Pawel Herman. Lecture notes: Linear feed-forward networks, percep-tron learning for classification. Apr. 2021. url: https : / / kth .instructure.com/courses/20711/pages/from-perceptron-learning-rule-to-backpropagation-in-feedforward-networks - supervised - learning ? module _ item _ id =244105.

[33] Zijun Zhang. “Improved adam optimizer for deep neural networks”. In:2018 IEEE/ACM 26th International Symposium on Quality of Service(IWQoS). IEEE. 2018, pp. 1–2.

[34] Yin Cui et al. “Class-balanced loss based on effective number of sam-ples”. In: Proceedings of the IEEE/CVF Conference on Computer Vi-sion and Pattern Recognition. 2019, pp. 9268–9277.

[35] Jörg Conradt. Lecture notes: Artificial Neural Networks. Apr. 2021.url:https://kth.instructure.com/courses/17663/pages/lectures.

[36] MI Jordan. Serial order: a parallel distributed processing approach.Technical report, June 1985-March 1986. Tech. rep. California Univ.,San Diego, La Jolla (USA). Inst. for Cognitive Science, 1986.

[37] Pawel Herman. Lecture notes: feedforward vs recurrent network archi-tectures. Apr. 2021. url: https://kth.instructure.com/courses/20711/pages/temporal-processing-with-anns-feedforward-vs-recurrent-network-architectures?module_item_id=244112.

[38] Junyoung Chung et al. “Empirical evaluation of gated recurrent neuralnetworks on sequence modeling”. In: arXiv preprint arXiv:1412.3555(2014).

http://proceedings.mlr.press/v3/chang08a.html

http://proceedings.mlr.press/v3/chang08a.html

https://www.robots.ox.ac.uk/~az/lectures/ml/lect2.pdf



https://kth.instructure.com/courses/20711/pages/from-perceptron-learning-rule-to-backpropagation-in-feedforward-networks-supervised-learning?module_item_id=244105





https://kth.instructure.com/courses/17663/pages/lectures

https://kth.instructure.com/courses/17663/pages/lectures

https://kth.instructure.com/courses/20711/pages/temporal-processing-with-anns-feedforward-vs-recurrent-network-architectures?module_item_id=244112




BIBLIOGRAPHY 45

[39] JiaxinMa et al. “Emotion recognition using multimodal residual LSTMnetwork”. In: Proceedings of the 27th ACM international conference onmultimedia. 2019, pp. 176–183.

[40] —.Furhat Robotics.Mar. 2021.url:https://furhatrobotics.com/.

[41] Sinem Aslan et al. “Human expert labeling process: valence-arousallabeling for students’ affective states”. In: International Conference inMethodologies and intelligent Systems for Techhnology Enhanced Learn-ing. Springer. 2018, pp. 53–61.

[42] —.FACS - Ekman and Friesen 1978. Apr. 2021. url:https://www.cs.cmu.edu/~face/facs.htm.

[43] Ilian Corneliussen.Context-Based Affective State Recognition. Apr. 2021.url: https : / / github . com / JqkerN / Context - Based _Affective_State_Recognition.

https://furhatrobotics.com/

https://furhatrobotics.com/

https://www.cs.cmu.edu/~face/facs.htm

https://www.cs.cmu.edu/~face/facs.htm

https://github.com/JqkerN/Context-Based_Affective_State_Recognition

https://github.com/JqkerN/Context-Based_Affective_State_Recognition

www.kth.se

TRITA -EECS-EX-2021:628

context-based multimodal machine learning on game oriented

Documents