3d corpus of spontaneous complex mental states corpus of spontaneous complex mental states marwa...

10
3D corpus of spontaneous complex mental states Marwa Mahmoud 1 , Tadas Baltruˇ saitis 1 , Peter Robinson 1 and Laurel Riek 2 1 Univeristy of Cambridge, United Kingdom 2 University of Notre Dame, United States Abstract. Hand-over-face gestures, a subset of emotional body lan- guage, are overlooked by automatic affect inference systems. We propose the use of hand-over-face gestures as a novel affect cue for automatic in- ference of cognitive mental states. Moreover, affect recognition systems rely on the existence of publicly available datasets, often the approach is only as good as the data. We present the collection and annotation methodology of a 3D multimodal corpus of 108 audio/video segments of natural complex mental states. The corpus includes spontaneous fa- cial expressions and hand gestures labelled using crowd-sourcing and is publicly available. 1 Introduction Human computer interaction could greatly benefit from automatic detection of affect from non-verbal cues such as facial expressions, non-verbal speech, head and hand gestures, and body posture. Unfortunately, the algorithms trained on currently available datasets might not generalise well to the real world situations in which such systems would be ultimately used. Our work is trying to fill this gap with a 3D multimodal corpus, which consists of elicited complex mental states. In addition, we are proposing hand-over-face gestures as a novel affect cue in affect recognition. 1.1 Motivation There is now a move away from the automatic inference of the basic emotions proposed by Ekman [8] towards the inference of complex mental states such as attitudes, cognitive states, and intentions. The real world is dominated by neu- tral expressions [1] and complex mental states, with expressions of confusion, amusement, happiness, surprise, thinking, concentration, anger, worry, excite- ment, etc. being the most common ones [19]. This shift to incorporate complex mental states alongside basic emotions is necessary if one expects to build affect sensitive systems as part of a ubiquitous computing environment. There is also a move towards analysing naturalistic rather than posed ex- pressions, as there is evidence of differences between them [4]. In addition, even Action Unit [7] amplitude and timings differ in spontaneous and acted expres- sions [5]. These differences imply that recognition results reported on systems trained and tested on acted expressions might not generalise to spontaneous

Upload: lekhanh

Post on 27-May-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

3D corpus of spontaneous complex mental states

Marwa Mahmoud1, Tadas Baltrusaitis1, Peter Robinson1 and Laurel Riek2

1 Univeristy of Cambridge, United Kingdom2 University of Notre Dame, United States

Abstract. Hand-over-face gestures, a subset of emotional body lan-guage, are overlooked by automatic affect inference systems. We proposethe use of hand-over-face gestures as a novel affect cue for automatic in-ference of cognitive mental states. Moreover, affect recognition systemsrely on the existence of publicly available datasets, often the approachis only as good as the data. We present the collection and annotationmethodology of a 3D multimodal corpus of 108 audio/video segmentsof natural complex mental states. The corpus includes spontaneous fa-cial expressions and hand gestures labelled using crowd-sourcing and ispublicly available.

1 Introduction

Human computer interaction could greatly benefit from automatic detection ofaffect from non-verbal cues such as facial expressions, non-verbal speech, headand hand gestures, and body posture. Unfortunately, the algorithms trained oncurrently available datasets might not generalise well to the real world situationsin which such systems would be ultimately used. Our work is trying to fill thisgap with a 3D multimodal corpus, which consists of elicited complex mentalstates. In addition, we are proposing hand-over-face gestures as a novel affectcue in affect recognition.

1.1 Motivation

There is now a move away from the automatic inference of the basic emotionsproposed by Ekman [8] towards the inference of complex mental states such asattitudes, cognitive states, and intentions. The real world is dominated by neu-tral expressions [1] and complex mental states, with expressions of confusion,amusement, happiness, surprise, thinking, concentration, anger, worry, excite-ment, etc. being the most common ones [19]. This shift to incorporate complexmental states alongside basic emotions is necessary if one expects to build affectsensitive systems as part of a ubiquitous computing environment.

There is also a move towards analysing naturalistic rather than posed ex-pressions, as there is evidence of differences between them [4]. In addition, evenAction Unit [7] amplitude and timings differ in spontaneous and acted expres-sions [5]. These differences imply that recognition results reported on systemstrained and tested on acted expressions might not generalise to spontaneous

(a) Colour image (b) Disparity map (c) Point cloud (d) Point cloud

Fig. 1: Point cloud visualisation from two angles (c)&(d) combining colour image(a) and disparity (inverse depth) map (b) of an image captured using Kinect.

ones. Furthermore, this means that systems trained on current posed datasetswould not be able to perform in tasks requiring recognition of spontaneous affect.

Hand-over-face gestures, a subset of emotional body language, are overlookedby automatic affect inferencing systems. Many facial analysis systems are basedon geometric or appearance facial feature extraction or tracking. As the facebecomes occluded, facial features are either lost, corrupted or erroneously de-tected, resulting in an incorrect analysis of the person’s facial expression. Only afew systems recognise facial expressions in the presence of partial face occlusion,either by estimation of lost facial points [3, 22] or by excluding the occluded facearea from the classification process [6]. In all these systems, face occlusions area nuisance and are treated as noise, even though they carry useful information.

Moreover, current availability of affordable depth sensors (such as the Mi-crosoft Kinect) is giving easy access to 3D data, which can be used to improvethe results of expression and gesture tracking and analysis. An example of suchdata captured using Kinect can be seen in Figure 1.

These developments are hindered by the lack of publicly available corpora,making it difficult to compare or reproduce results. Researchers cannot easilyevaluate their approaches without an appropriate benchmark dataset.

1.2 Contributions

In order to address the issues outlined, we have collected and annotated a corpusof naturalistic complex mental states. Our dataset consists of 108 videos of 12mental states and is being made freely available to the research community. Theannotations are based on emotion groups from the Baron-Cohen taxonomy [2]and are based on crowd-sourced labels.

Moreover, we have analysed hand-over-face gestures and their possible mean-ing in spontaneous expressions. By studying the videos in our corpus, we arguethat these gestures are not only prevalent, but can also serve as affective cues.

1.3 Related Work

In Table 1 we list several publicly available databases for easy comparison withour corpus Cam3D. This list is not exhaustive, for a more detailed one see Zeng

Table 1: Overview of similar databases.Properties Cam3D MMI[16] CK+[14] SAL[15] BU-4DFE[21] FABO[11]

3D Y N N N Y N

Modalities S/F/U F F S/F F F/U

Spontaneity S P/S P/S S P P

Number of videos 108 2894 700 10h 606 210

Number of subjects 7 79 210 24 100 24

Number of states 12 6 6 N/A 6 6

Emotional description B/C B B D/C B B,CModalities: S:speech, F:face, U:upper body, Spontaneity: S:spontaneous, P:posed,

Emotional description: B:basic, C:complex, D:dimensional

et al. [25]. When compared in terms of modality and spontaneity all availabledatasets concentrate on some factors we are trying to address, while ignoringthe others. MMI and CK+ corpora do not have upper body or hand gestures,SAL corpus consists of emotionally coloured interaction but lacks segments ofspecific mental states, while the FABO dataset contains only posed data.

Several 3D datasets of still images [24] and videos [21] of posed basic emotionsalready exist. The resolution of the data acquired by their 3D sensors is muchhigher than that available from Microsoft Kinect, but it is unlikely that suchhigh quality imaging will be available for everyday applications soon.

2 Corpus

Care must be taken when collecting a video corpus of naturally-evoked mentalstates to ensure the validity and usability of the data. In the following sections,we will discuss our elicitation methodology, data segmentation and annotation.

2.1 Methodology

Elicitation of affective states Data collection was divided into two sessions:interaction with a computer program and interaction with another person. Mostavailable corpora are of emotions collected from single individuals or human-computer interaction tasks. However, dyadic interactions between people elicita wide range of spontaneous emotions in social contexts under fairly controlledconditions [18]. Eliciting the same mental states during both sessions provides acomparison between the non-verbal expressions and gestures in both scenarios,especially if the same participant, stimuli, and experimental environment condi-tions are employed, and also enriches the data collected with different versionsof expressions for the same affective state.

Our desire to collect multi-modal data presented a further challenge. Wewere interested in upper body posture and gesture as well as facial expressionsto investigate the significance of hand and body gestures as important cues innon-verbal communication. Recent experiments have shown the importance of

body language, especially in conditions where it conflicts with facial expressions[10]. Participants were not asked to use any computer peripherals during datacollection so that their hands were always free to express body language.

Our elicitation methodology operated in four steps:

1. Choose an initial group of mental states.2. Design an experimental task to induce them and conduct a pilot study.3. Revise the list of the mental states induced according to the pilot results.4. Validate the elicitation methodology after collecting and labelling the data.

The first group of induced mental states were cognitive: thinking, concen-trating, unsure, confused and triumphant. For elicitation, a set of riddles weredisplayed to participants on a computer screen. Participants answered the rid-dles verbally, with the option to give up if they did not know the answer. Asecond computer-based exercise was a voice-controlled computer maze. Partic-ipants were asked to traverse the maze via voice commands. In the dyadic in-teraction task, both participants were asked to listen to a set of riddles. Theydiscussed their answers together and either responded or gave up and heard theanswer from the speakers. In both tasks, the riddles’ order was randomised tocounter-balance any effect of the type of riddle on the results.

The second group of affective states were frustrated and angry. It was eth-ically difficult to elicit strong negative feelings in a dyadic interaction, so theywere only elicited in the computer based session. During one attempt at thevoice-controlled computer maze, the computer responded incorrectly to the par-ticipant’s voice commands.

The third group included bored and neutral. It was also hard to elicit boredomintentionally in a dyadic interaction, so this was only attempted in the computer-based session by adding a ‘voice calibration’ task, where the participant wasasked to repeat the words: left, right, up, down a large number of times accordingto instructions on the screen. Participants were also left alone for about threeminutes after the whole computer task finished.

The last group included only surprised. In the computer-based session, thecomputer screen flickered suddenly in the middle of the ‘voice calibration’ task.In the dyadic interaction session, surprise was induced by flickering the lights ofthe room suddenly at the end of the session.

Experimental procedure Data was collected in a standard experimental ob-servation suite. A double mirror allowed experimenters to watch without dis-turbing the participants. In our experiment, a wizard-of-oz method was used forboth the computer-based and dyadic interaction sessions. Participants knew atthe beginning of the experiment that their video and audio were being recorded,but they did not know the actual purpose of the study. They were told that theexperiment was for voice recognition and calibration. Not explaining the realobjective of the experiment to participants in the first instance was essentialfor the data collection, to avoid having participants exaggerate or mask theirexpressions if they knew we were interested in their non-verbal behaviour.

(a) Dyadic task (b) Computer task (c) Examples of frames from Kinect and C2

Fig. 2: The layouts of the two parts of the data collection. P1 and P2 are theparticipants, C1 and C2 the HD cameras.

Participants 16 participants (4 pilot and 12 non-pilot) were recruited throughthe university mailing lists and local message boards. The 12 non-pilot partici-pants were 6 males and 6 females with age groups ranging from 24 to 50 yearsold (µ=27, σ=8 ). They were from diverse ethnic backgrounds including: Cau-casian, Asian and Middle Eastern and with varied fields of work and study. Allparticipants completed the two sessions: half the participants started with thecomputer-based task, while the other half started with the dyadic interactiontask. Dyads were chosen randomly. Since cross-sex interactions elicit more non-verbal warmth and sexual interest than same-sex interactions [23], we chose samesex dyads to avoid this effect on the non-verbal expressions in the dyads. Half theparticipants (3 males, and 4 females) gave public consent for data distribution.

2.2 Data acquisition

We used three different sensors for data collection: Microsoft Kinect sensors, HDcameras, and microphones in the HD cameras.

Figure 2a shows the layout for recording dyadic interactions. Two Kinectsensors and two HD cameras were used. The HD cameras each pointed at oneparticipant and were used to record the voice of the other. In the case of thecomputer interaction task the camera layout is presented in Figure 2b. A Kinectsensor and an HD camera were facing the participant while one HD camera waspositioned next to the participant and was facing away. The camera facing awaywas used to record the participant’s voice.

Several computers were used to record the data from the different sensors,this required subsequent manual synchronisation.

The HD cameras provided 720 x 576 px resolution colour images at 25 framesper second. The recorded videos were later converted to 30 frames per second tosimplify synchronisation with Kinect videos.

The Kinect sensor provides a colour image and a disparity map, which isthe inverse of depth values, at 30 frames per second. The sensor uses structuredinfrared light and an infrared camera to calculate 640 x 480 px 11-bit disparitymap. An additional camera provides a 640 x 480 colour image.

2.3 Segmentation and Annotation

After the initial data collection, the videos were segmented. Each segment showeda single event such as a change in facial expression, head and body posturemovement or hand gesture. This increases the value of the annotation comparedwith cutting the whole video into equal length segments [1].

Video segments were chosen and annotated using ELAN [13]. From videoswith public consent, a total of 451 segments were collected. The mean durationis 6 seconds (σ=1.2). For subsequent analysis, each video segment was annotatedwith the type of the task and interaction. In addition, we encoded hand-over-facegestures (if any) in terms of: hand shape, action, and facial region occluded. Fromthe non-public videos recorded, only hand-over-face gestures (120 segments) weresegmented and encoded to be included in subsequent hand gestures analysis.

Labelling was based on context-free observer judgment. Public segments werelabelled by community crowd-sourcing, which is fast, cheap and can be as goodas expert labelling [20]. The sound was low-pass filtered to remove the verbalcontent of speech. Video segments were displayed randomly through a web in-terface and participants were asked to give a ‘word’ describing the emotionalstate of the person in the video. Free-form input was used rather than menusin order not to influence the choice of label. In addition, an auto-complete listof mental states was displayed to avoid mis-spelling. The list was based on theBaron-Cohen taxonomy of emotions [2] as it is an exhaustive list of emotionalstates and synonyms (1150 words, divided into 24 emotion groups and 412 emo-tion concepts with 738 synonyms for the concepts). In total 2916 labels from 77labellers were collected (µ=39). Non-public video segments were labelled by fourexperts in the field and segments with less than 75% agreement were excluded.

We decided to use categorical labels rather than continuous ones such asPAD (pleasure, arousal, dominance) scales because we used naive labellers. Di-mensional representation is not intuitive and usually requires special training[25], which would be difficult when crowd-sourcing.

3 Data Analysis

3.1 Validation

Out of the 451 segmented videos we wanted to extract the ones that can reliablybe described as belonging to one of the 24 emotion groups from the Baron-Cohentaxonomy. From the 2916 labels collected, 122 did not appear in the taxonomyso were not considered in the analysis. The remaining 2794 labels were groupedas belonging to one of the 24 groups plus agreement , disagreement , and neutral .

Because raters saw a random subset of videos not all of them received anequal number of labels. We did not consider the 16 segments that had fewerthan 5 labels. To filter out non-emotional segments we chose only the videosthat 60% or more of the raters agreed on. This resulted in 108 segments in total.As the average number of labels per video was 6 the chance of getting 60%agreement by chance is less than 0.1%. The most common label given to a video

Fig. 3: Example of still images from the dataset

segment was considered as the ground truth. Examples of still images from thelabelled videos can be seen in Figure 3.

We validated the labelling of the selected videos using Fleiss’s Kappa (κ)[9] measure of inter-rater reliability. The resulting κ = 0.45 indicates moderateagreement. This allows us to dismiss agreement by chance, and be confident inannotating the 108 segments with the emotional group chosen by the annotators.

Alternatively, if we were to choose a higher cutoff rate of 70% (56 videos) or80% (40 videos), instead of 60% we would get κ = 0.59 and κ = .67 respectively,reaching substantial agreement. Although this would lead to fewer videos in ourcorpus, those videos might be seen as better representations of the mental states.We reflect this in our corpus by reporting the level of agreement per segment.Probabilistic systems can benefit from knowing the uncertainty in the groundtruth and exploit that in classification.

Furthermore, we wanted to estimate inter-rater agreement for specific mentalstates in the resulting 108 segments. Expressions of basic emotions of happy(κ = 0.64) and surprised (κ = 0.7), had higher levels of agreement than complexmental states of interested (κ = 0.32), unsure (κ = 0.52), and thinking (κ =0.48). For this analysis we only consider the expressions with no fewer than 5representative videos.

We also wanted to see how successful certain elicitation methods were atgenerating certain naturalistic expressions of affect. Most of the affective displayscame from the riddles both in computer and dyadic tasks. They were successfulat eliciting thinking (26) and unsure (22), with some happy (14), surprised (3),agreeing (5), and interested (2). The longer and more complicated maze wassuccessful at eliciting interest (4) and thinking (1). The third maze managed toelicit a broader range of expressions, including surprised (1), sure (1), unsure(1) and happy (3). This was somewhat surprising as we did not expect to elicithappiness in this task. There is some evidence [12] of people smiling duringfrustration which might explain perception of happiness by labellers.

Fig. 4: Different hand shape, action and face region occluded are affective cuesin interpreting different mental states.

3.2 Analysis of Hand-over-face gestures

In The Definitive Book of Body Language, Pease and Pease [17] attempt toidentify the meaning conveyed by different hand-over-face gestures. Althoughthey suggest that different positions and actions of the hand occluding the facecan imply different affective states, no quantitative analysis has been carriedout. Using collected video segments, we have analysed hand-over-face gesturesincluded in the videos in terms of the hand shape and its action relative toface regions. In the 451 initial public segments collected, hand-over-face gesturesappeared in 20.8% of the segments (94 segments), with 16% in the computer-based session and 25% in the dyadic interaction session. Participants varied inhow much they gestured, some had a lot of gestures while others only had a few.

Looking at the place of the hand on the face in this subset of the 94 hand-over-face segments, the hand covered upper face regions in 13% of the segments andlower face regions in 89% of them, with some videos having the hand overlappingboth upper and lower face regions. This indicates that in naturalistic interactionshand-over-face gestures are very common and that hands usually cover lower faceregions, especially chin, mouth and lower cheeks, more than upper face regions.

We analysed the annotated corpus of the 108 video segments in addition tothe expert-labelled private segments. Total hand-over-face segments studied were82. Figure 4 presents examples of labelled segments of hand-over-face gestures.

In the publicly labelled set, hand-over-face gestures appeared in 21% of thesegments. Figure 5 shows the distribution of the mental states in each categoryof the encoded hand-over-face gestures. For example, index finger touching faceappeared in 12 thinking segments and 2 unsure segments out of a total of 15segments in this category. The mental states distribution indicates that passivehand-over-face gestures, like leaning on the closed or open hand, appear in differ-ent mental states, but not in cognitive mental states. On the other hand, actionslike stroking, tapping and touching facial regions - especially with index finger- are all associated with cognitive mental states, namely thinking and unsure.Thus, hand shape and action on different face regions can be used as a novel cuein interpreting cognitive mental states.

4 Discussion

We have described the collection and annotation of a 3D multi-modal corpus ofnaturalistic complex mental states, consisting of 108 videos of 12 mental states.

Fig. 5: Encoding of hand-over-face shape and action in different mental states.Note the significance of the index finger actions in cognitive mental states.

The annotations are based on crowd-sourced labels. Over six hours of data wascollected, but only generated 108 segments of meaningful affective states, whichhighlights the challenge of collecting large naturalistic datasets. Analysing ourcorpus, we noticed the potential of hand-over-face gestures as a novel modal-ity in facial affect recognition. Our mental states elicitation methodology wassuccessful; therefore, future work will include adding more data to our corpus.This will allow further exploration spontaneous gestures and hand-over-face cues.Furthermore, we are exploring the use of depth in automatic analysis of facialexpressions, hand gestures and body postures.

5 Acknowledgment

We acknowledge funding support from Yousef Jameel Scholarships, EPSRC andThales Research and Technology (UK). We would like to thank Lech Swirskiand Andra Adams for their help in video recording and labelling.

References

1. Afzal, S., Robinson, P.: Natural affect data - collection & annotation in a learningcontext. In: ACII. pp. 1–7. IEEE (2009)

2. Baron-Cohen, S., Golan, O., Wheelwright, S., Hill, J.: Mind Reading: The Inter-active Suide to Emotions (2004)

3. Bourel, F., Chibelushi, C., Low, A.: Robust facial expression recognition using astate-based model of spatially-localised facial dynamics. In: IEEE AFGR (2002)

4. Cowie, R.: Building the databases needed to understand rich, spontaneous humanbehaviour. In: AFGR. pp. 1–6. IEEE (2008)

5. Duchenne, G., Cuthbertson, R.: The mechanism of human facial expression. Cam-bridge Univ Press (1990)

6. Ekenel, H., Stiefelhagen, R.: Block selection in the local appearance-based facerecognition scheme. In: CVPRW. pp. 43–43. IEEE (2006)

7. Ekman, P., Friesen, W.: Manual for the Facial Action Coding System. Palo Alto:Consulting Psychologists Press (1977)

8. Ekman, P., Friesen, W.V., Ellsworth, P.: Emotion in the Human Face. CambridgeUniversity Press, second edn. (1982)

9. Fleiss, J., Levin, B., Paik, M.: Statistical Methods for Rates and Proportions. Wiley(2003)

10. de Gelder, B.: Why bodies? Twelve reasons for including bodily expressions inaffective neuroscience. Phil. Trans. of the Royal Society B 364(1535), 3475 (2009)

11. Gunes, H., Piccardi, M.: A bimodal face and body gesture database for automaticanalysis of human nonverbal affective behavior. In: ICPR. vol. 1, pp. 1148–1153.IEEE (2006)

12. Hoque, M.E., Picard, R.W.: Acted vs. natural frustration and delight: Many peoplesmile in natural frustration. In: IEEE AFGR (2011)

13. Lausberg, H., Sloetjes, H.: Coding gestural behavior with the NEUROGES-ELANsystem. Behavior research methods (2009), http://www.lat-mpi.eu/tools/elan/

14. Lucey, P., Cohn, J., Kanade, T., Saragih, J., Ambadar, Z., Matthews, I.: Theextended Cohn-Kanade dataset (CK+): A complete dataset for action unit andemotion-specified expression. In: CVPRW. pp. 94–101. IEEE (2010)

15. McKeown, G., Valstar, M., Cowie, R., Pantic, M.: The SEMAINE corpus of emo-tionally coloured character interactions. In: ICME. pp. 1079–1084. IEEE (2010)

16. Pantic, M., Valstar, M., Rademaker, R., Maat, L.: Web-based database for facialexpression analysis. In: IEEE Conf. Multimedia and Expo. p. 5. IEEE (2005)

17. Pease, A., Pease, B.: The definitive book of body language. Bantam (2006)18. Roberts, N., Tsai, J., Coan, J.: Emotion elicitation using dyadic interaction tasks.

Handbook of emotion elicitation and assessment pp. 106–123 (2007)19. Rozin, P., Cohen, A.B.: High frequency of facial expressions corresponding to con-

fusion, concentration, and worry in an analysis of naturally occurring facial ex-pressions of Americans. Emotion 3(1), 68 (2003)

20. Snow, R., O’Connor, B., Jurafsky, D., Ng, A.: Cheap and fast-but is it good?:evaluating non-expert annotations for natural language tasks. In: Proc. of the Conf.on Empirical Methods in Natural Language Processing. pp. 254–263. Associationfor Computational Linguistics (2008)

21. Sun, Y., Yin, L.: Facial expression recognition based on 3D dynamic range modelsequences. In: ECCV. pp. 58–71. Springer-Verlag (2008)

22. Tong, Y., Liao, W., Ji, Q.: Facial action unit recognition by exploiting their dy-namic and semantic relationships. IEEE PAMI pp. 1683–1699 (2007)

23. Weitz, S.: Sex differences in nonverbal communication. Sex Roles 2, 175–184 (1976)24. Yin, L., Wei, X., Sun, Y., Wang, J., Rosato, M.J.: A 3D facial expression database

for facial behavior research. AFGR pp. 211–216 (2006)25. Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A survey of affect recognition

methods: Audio, visual, and spontaneous expressions. TPAMI 31(1), 39–58 (2009)