deciphering conversation participants' uncertainty -...

20
INOM EXAMENSARBETE TEKNIK, GRUNDNIVÅ, 15 HP , STOCKHOLM SVERIGE 2017 Deciphering conversation participants' uncertainty A study using non-verbal cues NILS HEMMINGSSON OLIVER ÅSTRAND KTH SKOLAN FÖR TEKNIKVETENSKAP

Upload: phungbao

Post on 03-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

INOM EXAMENSARBETE TEKNIK,GRUNDNIVÅ, 15 HP

, STOCKHOLM SVERIGE 2017

Deciphering conversation participants' uncertaintyA study using non-verbal cues

NILS HEMMINGSSON

OLIVER ÅSTRAND

KTHSKOLAN FÖR TEKNIKVETENSKAP

Deciphering conversation participants’uncertainty

A study using non-verbal cues

Nils [email protected]

Oliver [email protected]

Supervisor: Olov Engwall & Jose David Lopes at CSC

2017-05-10

Abstract

The field of human emotion expression has become increasingly importantfor engineers due to the potential of developing computors and robots thatcan interact naturally with humans. This study investigates uncertainty ismanifested in facial expressions. We formulated the binary definition ”Aconversation participant is uncertain when they feel they do not understandwhat the counterpart is trying to communicate or that they do not know whatto say”. With regards to this definition, a corpus consisting of videos wherepeople were playing a spot the difference-game and participants could onlycommunicate verbally was marked with uncertainty labels. Facial featureswere extracted using the computor vision software OpenFace. These features,along with the uncertainty markings were then used to train a neural network.The trained model was then used to predict whether a participant is uncertainor not at a given moment. To evaluate the model, it was tested on previouslyunseen data with an even distribution of certain and uncertain data pointsas well the actual data. For both of the cases, the best prediction modelmanaged to perform only better than if the model would employ majorityclass prediction.

i

Contents

1 Introduction 11.1 Objective & Research Question . . . . . . . . . . . . . . . . . 1

2 Background 22.1 Defining Uncertainty . . . . . . . . . . . . . . . . . . . . . . . 22.2 Previous Research . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.2.1 Dialog interaction . . . . . . . . . . . . . . . . . . . . . 22.2.2 Facial feature extraction . . . . . . . . . . . . . . . . . 3

2.3 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Methodology 43.1 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2 Facial Features . . . . . . . . . . . . . . . . . . . . . . . . . . 53.3 Observer Annotation . . . . . . . . . . . . . . . . . . . . . . . 63.4 Introduced variables . . . . . . . . . . . . . . . . . . . . . . . 73.5 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.5.1 Data preparation . . . . . . . . . . . . . . . . . . . . . 83.5.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Empirical Findings 10

5 Discussion 12

6 Conclusion 13

1 INTRODUCTION

1 Introduction

The ability to predict and foresee human behaviour are key areas within thestudies of psychology and other social sciences. Due to the development ofcomputer dialogue systems and robots, it has become an increasingly relevanttopic for engineers as well. In order to develop a robot which efficiently canconverse with people, one has to be able to understand human behaviour.

There are multiple factors that are believed to have predictive powerwith respect to human behaviour in conversations; the relation between theparticipants along with verbal and non-verbal cues to name a few [1][2][3].As computational power has increased research on aforementioned subjectshas been greatly facilitated. Furthermore, with the development of faciallandmark detection programs the study of facial expressions and patternshas been greatly facilitated. These technical advancements have createdan opportunity for data driven research focused on conversational humanbehaviour.

One specific conversation related emotion of interest is uncertainty. Forinstance, if one aims to develop a robot that teaches human subjects, one haveto be able to extract whether the person is understanding the certain matterthat is being taught. If the taught person is expressing signs of uncertainty,it may be advantageous for the robot to try to explain the matter in adifferent way or ask what it is that the pupil is not understanding. Thesame holds for screen-based teaching programs, which have access to thefront camera of your computer or cellphone. This solution would also besignificantly cheaper, and thus potentially more relevant. What is more, ifwanting to create a robot which can facilitate conversations between two ormore human persons, it is crucial to be able to detect uncertainty. For insuch moments it is of particular importance for the robot to intervene andattempt to help or alleviate pressure from the uncertain person.

1.1 Objective & Research Question

The objective of this paper is to through usage of statistical and machinelearning algorithms examine the possibility of determining whether a con-versation participant is uncertain using facial expressions. This is usefulsince it can facilitate and provide powerful decision tools for Human-Robot-Interaction. Our research question is thus:

Can machine driven algorithms use facial expression to extract whethera conversation participant is uncertain or not?

1

2 BACKGROUND

2 Background

This section begins with defining uncertainty as used in this paper. Fur-thermore, it provides information about previous research in the area of theconcerned topic. In light of the previous research, the section concludes withour hypotheses of the stated research question.

2.1 Defining Uncertainty

In performing the research we have to define uncertainty in the context ofthis paper. We do this as follows:

A conversation participant is uncertain when they feel they donot understand what the counterpart is trying to communicateor that they do not know what to say.

.Throughout the paper when mentioning uncertainty, we refer to the above

definition. Moreover, to facilitate annotation and analysis, the state of beinguncertain is defined to be binary, i.e. one is uncertain or one is not, andwhen one is not uncertain, one is regarded as being certain.

2.2 Previous Research

2.2.1 Dialog interaction

A dialog is a conversation between two persons, and a conversation can bedescribed as a joint activity of the conversation participants. The smoothnessof a dialog depends on the coordination between these two persons [4]. Jointattention occurs when the conversation participants’ attention coincide andthey are both aware of it [5][6]. This phenomenon enables more efficientlanguage (e.g. when both persons are aware that ”it” refers to a chair that haspreviously been described). Thus, both persons participating in the dialogneed to keep track of the current subject spoken of [7]. When speaking ofa certain object, the speaker naturally looks or glances at it if possible [8].The same holds for a listener [9]. Thus, the gaze of the dialog participantscan serve as an indication of their attention. When conversing face-to-face,by studying the others gaze, one can get a hint of where the other person’sattention is and thus increase the likelihood of joint attention. This is nottrue for telephone calls or similar conversation settings.

When taking part of conversations, people aim to follow Grice’s maxim ofquality ”Try to make your contribution one that is true” [10]. A submaxim

2

2.2 Previous Research 2 BACKGROUND

of it is ”Do not say that for which you lack adequate evidence.”. Thus, whenspeaking and feeling uncertain about one’s statement, one should express it,or else one appears certain. Conversation participants are further encouragedto express their uncertainty since they do not want to appear foolish if theirstatement is false [11][12].

The understanding of uncertainty and other attitudes expressed in a con-versation is crucial for the development of efficient dialog models [13] [14].Several studies have been performed on how uncertainty effects how we speak,e.g. [15]. However, to our knowledge, none has been done on how it is ex-pressed in our facial language.

There are multiple ways of expressing uncertainty. Certainly, one cansay it directly out loud. Furthermore, when expressing a statement one isuncertain of, one can communicate it with emphasizing certain words ofthe statement. For instance, certain intonation or emphasis of the wordright can signal both certainty and uncertainty [16]. The intonation of un-certainty expression is different throughout different cultures and languages[17]. Furthermore, when conversation participants are uncertain, they morefrequently use filled pauses; ”um” and ”uh” and other similar expressions.Moreover, when uncertain, word frequency is lower [18]. Increased usage offilled pauses is true for other emotions as well, for instance disagreement andthe filled pauses used when disagreeing can be hard to distinguish from thefilled pauses used when uncertain [19]. However, phonemic prior biases canhelp differentiate the two [20]. Continuing, the usage of ”ah” and ”mm”indicates uncertainty whereas ”yes” does not [21]. Furthermore, uncertainacknowledgements (short responses such as ”okay” or ”mhm”) have beenfound to have lower intensity, longer duration and a more monotone pitch.[21].

2.2.2 Facial feature extraction

Machine learning analysis of facial expressions has attracted increasing amountsof attention [22][23][24]. Facial Coding Action System (FACS) is a systemthat anatomically categorizes the human facial movements by their appear-ance [25]. It has been adopted by psychologists and computer scientistsstudying human behaviour [26][27]. The taxonomy of the facial movementsis done through action units (AU), which focuses on certain characteristicsof the face. For instance, such an action unit can be Nose Wrinkler (AU #9)and Lip Thightner (AU #23). This system has facilitated the developmentof facial landmark detection programs, since they can base their output dataon FACS [28][29].

Many facial landmark detection systems, including FACS, attempt to rec-

3

2.3 Hypotheses 3 METHODOLOGY

ognize a small set of prototypical emotional expressions of the, by Darwinproposed, six basic emotions; fear, anger, disgust, surprise, happiness andsadness [28][29][30]. Automatic machine driven facial analysis aiming to de-tect these basic emotions has largely been solved [31]. Moreover, research onmore complex emotions such as frustration [32] and fatigue [33] has been con-ducted. However, recognition of such more psychological states have provento be more difficult [31].

2.3 Hypotheses

In light of the above section, we hypothesize that facial expression can, to anextent, determine uncertainty. Moreover, since studying each others gaze willnot be possible (see section 3.1), we believe gaze pattern to have significantexplanatory value on its own.

3 Methodology

In the following section information about the corpus and data used will bepresented. Moreover, the annotation process will be described. Following,we will define necessary variables and lastly present the algorithms we willemploy.

3.1 Corpus

In order to examine the posed research question, we chose a data set consist-ing of 50 two party conversations with video and audio, where the conversa-tion participants played a spot the difference-game. Each conversation hadtwo different video cameras, each facing one of the conversation participants.The videos were recorded with GoPro Hero3+ at a resolution of 1920x1080and a frame rate of 29 frames per second. The audio was recorded overseparate microphones.

The setup for the conversations was as follows: participants were putin different rooms, and allowed to communicate only verbally over a set ofmicrophones. They were then shown similar pictures, which differed in 3 to10 places. The participants were given roles; one instruction giver and oneinstruction follower. The former lead the conversation, taking the initiativedescribing what could be seen in the picture, and they should thereby find thedifferences. When the participants thought they had found all the differencesor the time of 200 seconds ran out, the game ended. Then the solution withthe differences highlighted was show, and when the participants were ready

4

3.2 Facial Features 3 METHODOLOGY

Figure 1: A pair of pictures that were used in the game.

they could move on to the next pair of pictures. This was repeated 3 timesfor each pair of participants. One example pair of pictures is shown in figure1.

3.2 Facial Features

In order to extract facial features from the video the open source softwareOpenFace [35] (OF) was used. This software uses deep neural networks ca-pable of facial landmark detection, head pose estimation, facial action unitrecognition and eye-gaze estimation. The facial action units that were de-tected by the program are described in table 1. The gaze was given as athree-dimensional vector for each eye, and the head pose was one 3D vectorrepresenting the tilt of the head. Open Face also provided a measure of howcertain it was of its estimation of the facial landmarks for each frame, inincrements of ten percentage points [35].

5

3.3 Observer Annotation 3 METHODOLOGY

AU Number: Name:1 Inner Brow Raiser2 Outer Brow Raiser4 Brow Lowerer5 Upper Lid Raiser6 Cheek Raiser7 Lid Tightener9 Nose Wrinkler10 Upper Lip Raiser12 Lip Corner Puller14 Dimpler15 Lip Corner Depressor17 Chin Raiser20 Lip Stretcher23 Lip Tightener25 Lips Part26 Jaw Drop28 Lip Suck45 Blink

Table 1: A table describing the facial action units generated by OpenFace

3.3 Observer Annotation

In performing the analysis the data needed uncertainty labeling. Annota-tors were given the instructions to mark the videos when the definition ofuncertainty was fulfilled. It was of particular importance to not only look atthe facial expressions of the conversation participants. Rather, annotatorsshould focus on the entirety of the means of communication; body languageand verbal and facial expression to name a few. The main reason for this isthat we wanted to capture uncertainty in any form, and not only uncertaintythat rely on the annotators ability to read facial expression. All videos inthe corpus were annotated once.

Although we certainly did not in any way attempt to annotate in such away that would facilitate reaching our posed hypotheses, it should howeverbe noted that it was the authors of this work that performed these annota-tions. Figure 2 shows the annotation interface ELAN that was used. Thesoftware allowed us to mark time intervals. The moments in the data werethe participant was seen as uncertain were marked, and the time in betweenwas regarded as certain.

6

3.4 Introduced variables 3 METHODOLOGY

Figure 2: The annotation interface ELAN. The blue portion of the bottombar shows an annotation.

Parts of the videos were recorded whilst the participants were not playingthe game. However, since we only wanted to investigate the facial expres-sions during the game, the videos were only annotated while the game wasplayed. Then, the data that was outside the first and the last uncertainty an-notation was discarded. Having done so, we obtained that the conversationparticipants were uncertain 7% and certain 93% of the time.

3.4 Introduced variables

Most of the variables that are extracted from Open Face were incorporatedin our algorithms as they are delivered from the program, without being re-defined in any way. These were the facial action units mentioned in FacialFeatures as defined by Paul Ekman [26]. However, with regards to gaze- andhead pose-estimation we needed to define a useful measurement for them.For the gaze we argued that the movement of the eyes is what is relevant,since staring and very rapid eye movements could be signs of uncertainty.We therefore introduced the variable σ2

gaze. This was defined as the trace ofthe covariance matrix of the gazes, taken over 15 frames after and before thegiven point in time. The head pose was defined in a way such that the totalrotation of the head was given by the the product of the vector elements [35].This gave us a total of 20 variables to be analyzed.

7

3.5 Algorithms 3 METHODOLOGY

3.5 Algorithms

3.5.1 Data preparation

In order to analyze the data the labels needed to be paired up with thevariables from OpenFace. This was done in two different ways that werelater contrasted with each other. In the first method each video was splitinto sections that were marked as certain and uncertain. Each such sectionwas then split into ten evenly spaced frames. The data from OpenFaceassociated with each frame was then paired up with the given label. Howeverif the confidence of OpenFace was less than 80% that frame was discarded.Since every section was given ten frames, this resulted in the data beingalmost balanced between certain and uncertain. We refer division of data setas balanced.

The second method was performed in a manner such that the portion ofthe data that was certain or uncertain reflected the true distribution in themarkings. The data was gone through with a step of 50 ms, and each framewas paired up with the correct marking. The frames were discarded if theconfidence of OpenFace was below 80%. Frames were also thrown away inboth models whenever the variance of the gaze could not be computed.

3.5.2 Analysis

The data was now ready to be analyzed using supervised learning. Themethods employed were Artificial Neural Networks (ANN) and Support Vec-tor Machines (SVM).

ANN is a method that loosely mimics the human brain. It consists ofa large number of units called neurons, that are connected in in layers, asshown in figure 3. In a layer the neurons have weighted inputs from each ofthe previous layer, the input signal is the sum of the weighted signals from theprevious layer. A neuron is activated, and passes on the signal, if the inputis larger than a certain threshold. The first layer takes the defined variablesas inputs and is called the input layer. At the final layer a prediction isproduced, this is called the output layer. The whole process of generating anestimate from input is called forward propagation (FP).

First the network is initialized with random weights. Then the data issent through the network using FP, and the predictions are compared to theactual labels. The residual errors are then squared and summed to producean error function. This function is then minimized by changing the weightsin the network. The method by which the weights are changed is called backpropagation (BP). The method is described in detail by Michael Nielsen [36].But the outline is that you differentiate the error function with respect to the

8

3.5 Algorithms 3 METHODOLOGY

Figure 3: An Artificial Neural Network [36]

weights and then change the weights in the opposite direction of the gradient.Then you repeat the process until an optimum is reached.

The network we designed had two layers, excluding the input. In almostall cases two layers is enough to approximate any decision boundary[39].Since the input layer has as many nodes as the dimensionality of the dataand the output simply has one, our primary meta-parameter was the numberof nodes in the middle (hidden) layer. This was determined by splitting thedata into training, validation and testing data.

The second method employed was Support Vector Machines. This tech-nique utilizes the fact that most data is linearly separable in very high dimen-sions. Mathematically the data is mapped to a higher dimensional domain,separated by a hyperplane and then projected back onto the original space.This can however be done without the high dimensional data vectors everbeing calculated, using the kernel trick [38].

To implement the algorithms the Python package SciKitLearn was used.It has a built in class MLPClassifier for neural networks, as well as theclass SVC for support vector machines[37]. The MLPClassifier class allowscreation of networks with a specific number of hidden units. It can then betrained and used to predict new instances of data.

We partitioned the data into two parts, one training set, which contained80% of the data, and one test set, which contained 20%. Then neural net-works of sizes ranging from 10 to 140 hidden units were created, where theincrements were 10 hidden units. The networks were then trained on thetraining data. Every networks performance was then evaluated on the testset. A support vector machine with an ”RBF”-kernel was trained in thesame way, and compared to the neural networks.

9

4 EMPIRICAL FINDINGS

In order to get a better grasp of the effects of the variables in the datathe Pearson correlation coefficient was computed against the labels. ThePearson coefficient is defined as the co-variance of two random variablesover the product of their standard deviations. The package used for thistask was SciPy. While such a coefficient can not show all types of relationsbetween variables, it can give an indication on which variables that are worthinvestigating further.

The data was also plotted in two dimensions at a time in order to get abetter understanding of the different variables. The plots were done usingthe Python package MatplotLib. For the larger data-sets this was not done,since the amount of data was too large for MatplotLib to function properly.

In order to evaluate the models more thoroughly three more metrics ofmodel success were defined, precision, recall and F1. Precision is the amountof data correctly identified as uncertain over the total amount of data iden-tified as uncertain. Recall is the amount of data correctly identified as un-certain over the total amount of actually uncertain data. The F1-score is anaggregate of these two using the formula:

F1 = 2 · precision · recallprecision+ recall

4 Empirical Findings

This section provides the results from the above presented algorithms. Theaccuracy of the algorithms is presented in table 2. This is the number ofpredictions that were correct over the total number of predictions. The tablealso shows which algorithm got the highest accuracy and compares that tothe majority class.

Data Number Majority Best BestSet of Class Prediction Performing

Points Voting Accuracy ModelW. Gaze Balanced 7759 53.4% 58.2% ANN (100 Neurons)

W. Gaze Linear 816560 92.4% 92.7% ANN (70 Neurons)No Gaze Balanced 31056 55.0% 61.3% ANN (80 Neurons)

No Gaze Linear 839064 91.3% 91.5% ANN (40 Neurons)

Table 2: Number of data points, the result obtained when performing major-ity class prediction, the best results obtained of our different algorithms andthe method for obtaining these results. The results are shown with regardsto the balanced and unbalanced data, with and without gaze incorporated.

10

4 EMPIRICAL FINDINGS

We also generated F1 as well as precision and recall scores for the aboveclassifiers. These are shown in table 3.

Data Set Precision Recall F1W. Gaze Balanced 56.3% 34.9% 40.7

W. Gaze Unbalanced 31.6% 0.04% 0.08No Gaze Balanced 66.0% 51.7% 54.6

No Gaze Unbalanced 35.9% 2.6% 5.1

Table 3: Precision, recall and F1 for the classifier models. The results areshown with regards to the balanced and unbalanced data, with and withoutgaze incorporated.

In order to get a sense for the distribution of uncertainty in differentdimensions we plotted the data. Below one of such plots is shown.

Figure 4: Action Unit 7, Lid Tightener, scattered against Action Unit 4,Brow Lowerer. The blue dots mark uncertain data, and red marks certain

Above the plot of AU04 and AU07 is presented. Both were found to havepositive correlations with uncertainty. The former had values ρ = 0.045, p =

11

5 DISCUSSION

1 · 10−15) and the latter ρ = 0.048, p = 1 · 10−17. These are the Action Unitsthat were found to have the smallest p-value of the analyzed variables. Thecorrelation is however still very low, so no AU can be said to correlate withuncertainty.

5 Discussion

In contrast to what we hypothesized, we were not able to find a reliablemethod to extract uncertainty from facial expression. We do not think thatfacial expression can on its own accord completely predict uncertainty. Forinstance, verbal expression have been found to relate with uncertainty [21].However, we do still believe that explanatory value resides in facial behaviour,but that the methods we employed were too robust and were not perfectlysuited for this research. Certainly, there are multiple other possible machinelearning algorithms that can be used on the data. Since we only lookedat snapshots, without regarding the frames around the particular snapshots,significant information might be omitted. Thus, future research are suggestedto examine possibilities of using more refined machine learning algorithms,for instance the usage of Markov models through HTK [34]. As a motivation,it is possible that looking at the same spot and holding your face still for2 seconds means nothing in particular, but if you do it for 10 seconds ormore it might be an indication of uncertainty. Furthermore, it could beadvantageous to look at each individual differently. For instance, one couldlook at deviations from the mean facial expression rather than absolute facialexpression. This could potentially yield better results. However, it is possiblethat it is found that regardless of the machine learning algorithms used, thereis no good method for deciphering uncertainty. Regardless if this is the caseor not, a potential path for further research is to examine how only verbalcommunication can determine uncertainty. Moreover, combining verbal andnon-verbal communication for a more complete set of explanatory variablescould be feasible. What is more, another suggestion for future research is tolet different annotators label the same data more than once. Thus, one couldfor instance choose to let only parts of the videos where the annotators agreedthat uncertainty occurred be marked as uncertain, the parts of the video thatthe annotators agreed that uncertainty did not occur be marked as certain,and omit the rest of the data. Thereby, one could get a clearer distinctionbetween uncertainty and certainty, thus getting more reliable predictions.An other alternative is to let the agreed uncertain data have uncertaintyvalue 2, the disagreed data uncertainty value 1 and the agreed certain datauncertainty value 0. This could yield more accurate representation of the level

12

6 CONCLUSION

of uncertainty of the conversation participants. Furthermore, in the future, ifthe means of expressing uncertainty has been solved, it may be interesting tolook at the data points where the annotators disagree, and what distinguishesthem. Moreover, it could be feasible to let the annotators be people whichdo not perform the research itself. However, this might be difficult due tothe shear amount of time it took to annotate the videos. Furthermore, onehas to become acquainted with the software used, something that as well isa potential threshold.

Furthermore, the extracted data on eye gaze were very uncertain. Thus,we could not find whether eye gaze on its own can detect uncertainty, whichwas hypothesized. In order to perform accurate analysis, one would need datathat can be relied on. Thus, if one is creating a data set for research similarto this one, it would be advantageous to use eye-trackers on the conversationparticipants, thus yielding reliable data.

6 Conclusion

Our analysis has not found a that machine driven analysis can predict un-certainty of conversation participants. However, this is not to say that it isnot possible to do so. It could be that it is not possible to extract whether aperson is uncertain or not when using only facial expression data. However,we do not believe this is the case. Rather, we hypothesize that the machinelearning methods used are not sufficient to extract the relevant information.Thereby, the most relevant suggestion for future research on this particularsubject is to use more refined machine learning algorithms, for instance notonly regarding snapshots, but as well the time frames surrounding the partic-ular moment. Moreover, it could be feasible if the machine learning methodevaluated different persons individually.

13

REFERENCES REFERENCES

References

[1] Robin I. M. Dunbar, Anna Marriott, Neil D. C. Duncan Human Con-versational Behaviour Human Nature, Volume 8, Issue 3, pp 231–246,1997.

[2] Elizabeth Aries. Interaction Patterns and Themes of Male, Female andMixed Groups, Small Group Research, Volume 7, Issue 1, 1976

[3] Geoffrey Beattie, Talk: An Analysis of Speech and Non-Verbal Behaviourin Conversation, Milton Keynes, U.K.: Open University Press, 1983

[4] Herbert H. Clark, Using Language Cambridge University Press, Cam-bridge, UK, 1996.

[5] Herbert H. Clark, Catherine R. Marshall, Definite reference and mutualknowledge In: Joshi, A.K., Webber, B.L., Sag, I.A. (Eds.), Elements ofDiscourse Understanding. Cambridge University Press, pp. 10–63, 1981

[6] Simon Baron-Cohen, The eye direction detector (EDD) and the sharedattention mechanism (SAM): two cases for evolutionary psychology, In:Moore, C., Dunham, P.J. (Eds.), Joint Attention: Its Origins and Rolein Development. Lawrence Erlbaum Associates, Inc, pp. 41–60, 1995

[7] Barbara J. Grosz, Candace L. Sidner, Attention, intentions, and thestructure of discourse Computational Linguistics, Volume 12, Number3, , pp. 175–204, 1986

[8] Herbert H. Clark, Meredith A. Krych, Speaking while monitoring ad-dressees for understanding, Journal of Memory and Language, Volume50, pp. 62–81.

[9] Paul D. Allopenna, James S. Magnuson, Michael K. Tanenhaus, Trackingthe time course of spoken word recognition using eye movements: evidencefor continuous mapping models Journal of Memory and Language, Vol-ume 38, Issue 4, pp. 419–439. 1998

[10] H. Paul Grice, Logic and Conversation, Speech arts , New York: Aca-demic Press, 1975.

[11] Erving Goffman, Interaction Ritual: Essays in Face to Face BehaviorGarden City N.Y. Doubleday, 1967.

[12] Erving Goffman, Relations in Public, New York: Harper and Row, 1971.

14

REFERENCES REFERENCES

[13] Marie Nilsenova, Rises and falls. studies in the semantics and prag-matics of intonation Ph.D. dissertation, University of Amsterdam, 2006.

[14] Jackson Liscombe, Julia Hirschberg & Jennifer J. Venditti Detectingcertainness in spoken tutorial dialogues, In: Ninth European Conferenceon Speech Communication and Technology, 2005

[15] Heather Pon-Barry, Prosodic manifestations of confidence and uncer-tainty in spoken language, In: Proceedings of Interspeech, Brisbane, Aus-tralia, pp. 74–77, 2008

[16] Catherine Lai, What do you mean, you’re uncertain?: The interpreta-tion of cue words and rising intonation in dialogue, In: Proceedings ofInterspeech, Makuhari, Japan, 2010

[17] Dwight Bolinger, Intonation and its uses Stanford, California, StanfordUniversity Press, 1989.

[18] Vicki L. Smith & Herbert H. Clark, On the Course of Answering Ques-tions, Journal of Melody and Language, Volume 32, Issue 1, pp. 25-38,1993

[19] Daniel Neiberg & Joakim Gustafson, Towards letting machines hummingin the right way - prosodic analysis of six functions of short feedbacktokens in english, FONETIK 2012, Gothenburg, Sweden, 2012

[20] Daniel Neiberg & Joakim Gustafson, Cues to perceived functions of actedand spontaneous feedback expressions, KTH, Stockholm, Sweden, 2012

[21] Catharine Oertel, Modelling Engagement in Multipart Conversations,KTH, Stockholm, Sweden, 2017

[22] Maja Pantic & Marian S. Bartlett, Machine analysis of facial expres-sions In Face recognition (eds Delac K., Grgic M., editors. ), pp. 377–416Vienna, Austria: I-Tech Education and Publishing, 2007

[23] Zhihong Zeng, Maja Pantic, Glenn I. Roisman, Thomas S. Huang, Asurvey of affect recognition methods: audio, visual, and spontaneous ex-pressions, IEEE Transactions on Pattern Analysis and Machine Intelli-gence, Volume 31, Issue 1, pp. 39–58, 2009

[24] Irfan Essa & Alex Pentland, Coding, Analysis, Interpretation, andRecognition of Facial Expressions, IEEE Transactions on Pattern Analy-sis and Machine Intelligence, Volume 19, Issue 7, 757–763, 1997

15

REFERENCES REFERENCES

[25] Carl-Herman Hjortsjo, Man’s Face and Mimic Language, Lund: Stu-dentlitteratur, 1970

[26] Paul Ekman, Wallace Friesen, The Facial Action Coding System: ATechnique for The Measurement of Facial Movement, Consulting Psy-chologists Press; San Francisco, 1978

[27] Paul Ekman, Facial Expression and Emotion. American Psychologist,Volume 48, Issue 4, pp. 384-392, 1993

[28] Marian S. Bartlett, Paul Ekman, Joseph C. Hager & Terry J. Sejnowski,Measuring Facial Expressions by Computer Image Analysis Psychophys-iology. Volume 36, Issue 2, pp. 253–264 1999

[29] Gianluca Donato, Marian S. Bartlett, Joseph C. Hager, Paul Ekman &Terry J. Sejnowski Classifying Facial Actions, IEEE Transactions on Pat-tern Analysis and Machine Intelligence, Volume 21, Issue 10, pp. 974–989,1999

[30] Charles R. Darwin, The expression of the emotions in man and animals,London: John Murray, 1872.

[31] Maja Pantic, Machine analysis of facial behaviour: naturalistic and dy-namic behaviour, Philosophical Transactions of the Royal Society B: Bi-ological Sciences, Volume 364, Issue 1535, 2009

[32] Ashish Kapoor, Yuan Qi., Rosalind W. Picard, Fully automatic upperfacial action recognition Proceedings In: IEEE International Workshopon Analysis and Modeling of Faces and Gestures, 2003

[33] Qiang Ji, Peilin Lan & Carl Looney, A probabilistic framework for mod-eling and real-time monitoring human fatigue, IEEE Transactions on Sys-tems, Man, and Cybernetics - Part A: Systems and Humans, Volume 36,Issue 5, 2006

[34] Hidden Markov Model Toolkit, http://htk.eng.cam.ac.uk/, CambridgeUniversity Engineering Department, 2016

[35] Tadas Baltrusaitis, Peter Robinson & Louis-Philippe Morency, Open-Face: an open source facial behavior analysis toolkit,IEEE Winter Con-ference on Applications of Computer Vision, 2016

[36] Michael Nielsen, Neural Networks and Deep Learning, DeterminationPress, 2015

16

REFERENCES REFERENCES

[37] SciKitLearn Documentation, http://scikit-learn.org/stable/documentation.html

[38] Martin Hofmann,Support Vector Machines — Ker-nels and the Kernel Trick, http://www.cogsys.wiai.uni-bamberg.de/teaching/ss06/hs svm/slides/SVM Seminarbericht Hofmann.pdf,2006

[39] Kurt Hornik, Approximation capabilities of multilayer feedforward net-works, Neural Networks, Volume 4, Issue 2, 1991

17