automatic 3d facial expression recognition

Automatic 3D facial expression recognition

Rafael Monteiro

September 16, 2013

Date Performed: September 10, 2013Instructors: Claudio Esperanca

Ricardo Marroquim

1 Introduction

Facial expressions are an important aspect of human emotion communication.They indicate the emotional state of a subject, his personality, among otherfeatures. According to Bettadapura [1], their study begun with clinical andpsychological purposes, but with recent advances on computer vision, computerscience researchers began to show interest on developing systems to automati-cally detect those expressions.

Automatic facial expression recognition has several applications, as in HCI(Human-Computer Interaction), where interfaces could be developed in order torespond to certain user expressions, as in games, communication tools, etc. Al-though humans can easily recognize a specific facial expression, its identificationby computer systems is not that easy. There are several challenges involved, likeillumination changes, occlusion, use of beards, glasses, etc [2].

In the 70s, one of the first problems faced by researchers was: how to accu-rately describe an expression? Paul Ekman on his research defined six basicexpressions, which he considered to be universal expressions, because they canbe identified on any culture. They are: joy, sadness, fear, surprise, disgust andanger [3]. Examples are shown in Figure 1. Later in 2001, Parrot identified136 emotional states and categorized them on three levels: primary, secondaryand tertiary emotions [4]. Primary emotions are Ekman’s six basic emotions,and the other two levels form a hierarchy. Still in 1971, Ekman wrote a studyclaiming facial expressions were universal across different cultures [5].

Figure 1: Universal expressions: joy, sadness, fear, surprise, disgust and anger

1

In 1977 Ekman and Friesen developed a methodology to measure expressionsin a more precise way, by creating FACS (Facial Action Coding System) [6].On FACS, basic expression components are defined, called Action Units (AUs).They describe small facial movements, such as raising the inner brows (AU1),or wrinkling the nose (AU9), an so on. These action units can be combined toform facial expressions.

A discussion about the universality of human expressions arose in 1994, whenRussell questioned Ekman’s position and discussed several points indicating thathuman expressions are not universal across different cultures [7]. In the sameyear, Ekman wrote a paper refuting Russell’s arguments one by one [8]. Sincethen Ekman’s position has been widely accepted and the claim that human ex-pressions are universal across different cultures has been sustained.

Facial expression recognition research has many fields of study. One of them is3D facial expression recognition. These systems are based on facial surface in-formation obtained by creating a 3D model of the subject’s face, and they try toidentify the expression in this model. This report will discuss some approachesused in this field. There is a major division between static and dynamic stud-ies. Static studies are performed on a single picture of a subject, where theexpression is identified, and dynamic studies consider the temporal behaviorof expressions (see Figure 2). A great example of dynamic studies is micro-expression analysis. A micro-expression is an expression that happens in a veryshort instant of time, generally between 1/25th to 1/15th of a second. Theygenerally occur when a subject is trying to conceal an expression but fail, andit will appear for a brief moment on the face.

Figure 2: Example of a dynamic facial expression system

One major problem on facial expression studies is to capture spontaneous ex-pressions. Most facial expressions databases are composed of simulated expres-sions, such as the ones displayed in Figure 3. It is easier to ask subjects todisplay these expressions than to capture expressions generated spontaneouslybased on emotional reactions to real world stimuli. An interesting development

2

occured when Sebe et al. gave a solution to this problem by using a kiosk witha camera [9]. People would stop by and watch videos, while displaying genuineemotions, and their face was being captured by a camera. At the end of thestudy, subjects were asked if they would allow their image to be used for aca-demic purposes.

Figure 3: Examples of clearly non-spontaneous facial expressions

2 Facial expression systems

There are many approaches used by facial expression systems. In a recent survey,Sandbach et al. reviewed the state of the art and they noticed most systems areorganized in three steps: face acquisition, face tracking and alignment, andexpression recognition [10].

2.1 Face acquisition

Face acquisition is a step performed with the objective of generating a 3D modelof the subject’s face. There are some approaches, such as single image recon-struction, use of structured light, stereo photometry and multi-vision stereoacquisition.

Single image reconstruction methods are an emerging research topic, becauseof its simplicity: only a single image is required, using an ordinary camera ina non-restricted environment. Blanz and Vetter developed a method called 3DMorphable Models (3DMM), which statistically builds a model combining infor-mation of 3D shape and 2D texture [11]. The method can be used to generatelinear combinations of different expressions and use them to synthesize expres-sions and detect them on facial models. The main disadvantages are: someinitialization is required and the method is not robust to partial occlusions.

Structured light techniques are based on projecting a light pattern on the sub-ject’s face, analyzing the pattern deformations and recovering 3D shape infor-mation. Figure 4 shows an example of such systems. Hall and Rusinkiewiczdeveloped a system using multiple patterns, which are alternately displayed onthe face [12]. An image without the pattern can also be captured in order toincorporate 2D texture information on the 3D model.

3

Figure 4: Illustration of a structured light system

Stereo photometry is a variation of structured light techniques which uses morethan one light, and each one can emit a different color, as shown in Figure 5.Such systems can retrieve surface normals, which can be integrated in orderto recover 3D shape information. Jones et al. developed a system which usesthree lights switching on and off in a cycle around the camera [13]. The systemperforms well using either visible or infrared light.

Figure 5: Illustration of a stereo photometry system

Multi-vision stereo acquisition systems use more than one camera to simul-taneously capture images from different angles and combine these images toreconstruct the scene. Beeler at al. developed a system which uses high-endcameras and standard illumination, showing great results with sub-millimeteraccuracy [14].

4

2.2 Face tracking and alignment

The second step performed on most facial expression systems is face trackingand alignment. Given two meshes, the problem is to align them in 3D space,so that they can be tracked over time. There are two kinds of alignment: rigidapproaches, which assume similar meshes without large transformation, andnon-rigid, which deals with large transformations. Most rigid-based approachesrely on the traditional ICP (Iterative Closest Point) algorithm [15]. As for non-rigid approaches, there are several different ways to perform the alignment.

Amberg et al. created a variant of ICP which adds a stiffness variable to con-trol the rigidity of the transformation at each iteration [16]. The stiffness valuestarts with a high value and is reduced at each iteration, so that the matchingwill gradually allow a non-rigid transformation to be performed. Rueckert etal. used a FFD (Free-Form Deformation) model which performs deformationsusing control points [17]. By reducing the number of points, computing timecan be reduced as well. See Figure 6 for an example of a FFD model.

Figure 6: Free-Form Deformation model

Wang et al. used harmonic maps to perform the alignment [18]. The face ismapped from 3D space to 2D space, by projecting the mesh into a disc, as shownin Figure 7, thus reducing one dimension. Different discs can be compared inorder to perform alignment. Sun et al. used a similar technique called conformalmapping, which maps the mesh into a 2D space, preserving the angles betweenedges [19]. Tsalakanidou and Malassiotis modified ASMs (Active Shape Models)[20] to work in 3D, using a face model with the most prominent features, suchas eyes, nose, etc [21]. Figure 8 shows examples of ASMs plotted on the faces.

2.3 Expression recognition

The third and last step of a facial expression system is to recognize the expres-sion. In this step, descriptors are extracted, selected and classified using artificialintelligence techniques. Features can be static or dynamic. Static features aremostly used on a single image, whereas dynamic features have the property tobe stable across time, and can be tracked through successive frames on a video

5

Figure 7: Harmonic maps

Figure 8: Active Shape Models

analysis. Temporal modeling can be done in order to analyze the dynamics ofthe expression through time. Most systems use HMMs (Hidden Markov Models)[22] to perform this task. Common static features are distance-based, patch-based, morphable models and 2D representations.

Distance-based features rely on distances between facial attributes, such as thedistance between the corners of the mouth, or between the mouth and the eye,and so on. Soyel and Demirel used 3D distances to recognize expressions [23].Maalej et al. used patch-based features, where patches are small regions of themesh represented as surface curves [24], as shown in Figure 9. Patches arecompared against templates by computing the geodesic distance between them.Ramanathan et al. used a MEM (Morphable Expression Model), where baseexpressions are defined and any expression could be modeled through a linearcombination between these base expressions by using morphing parameters [25].These parameters define a parameter space, where similar expressions form clus-ters. A new expression is identified by finding the parameters which generate

6

the closest expression and passing these parameters to a classifier. Berretti et al.used 2D representations, where the depth map of the face is computed, generat-ing a 2D image [26]. Classification is done using SIFT (Scale Invariant FeatureTransform) descriptors [27] and SVMs (Support Vector Machines) [28].

Figure 9: Patch-based descriptors

As for dynamic features, there are a few approaches. Le et al. used facial levelcurves, since their variation through time can be tracked and calculated usingChamfer distances [29]. Figure 10 shows an example of such curves. Sandbachet al. used FFDs to model the lattice deformation over time, and they usedHMMs to perform temporal analysis [30].

Figure 10: Facial level curves

Feature classification is generally performed using well known classifiers, suchas AdaBoost and variations [31], k-NNs (k-Nearest Neighbors) [32], Neural Net-works [33], SVMs [28], etc.

7

3 Future challenges

Research on 3D facial expression recognition is evolving, but there are somechallenges to consider. One is the construction of more spontaneous expressionsdatabases, since most of them were built using artificial expressions. Further-more, the development of systems capable to distinguish a spontaneous expres-sion from an artificial one is also desirable. Recognition of expressions otherthan Ekman’s six universal expressions is important, since most systems focusonly on these six. Temporal analysis is still on its infant stage. More focus onthis area is required, especially on the analysis of micro-expressions, which arevery hard to detect. Improvement on algorithms performance is also a crucialfactor. Ideally, all systems should work in real-time.

References

[1] V. Bettadapura. Face expression recognition and analysis: The state of theart. CoRR, abs/1203.6722, 2012.

[2] M. Pantic, Student Member, and L. J. M. Rothkrantz. Automatic analysisof facial expressions: The state of the art. IEEE Transactions on PatternAnalysis and Machine Intelligence, 22:1424–1445, 2000.

[3] P. Ekman. Universals and Cultural Differences in Facial Expressions ofEmotion. University of Nebraska Press, 1971.

[4] W.G. Parrott. Emotions in Social Psychology: Essential Readings. Keyreadings in social psychology. Psychology Press, 2001.

[5] P. Ekman and W. V. Friesen. Constants across cultures in the face and emo-tion. Journal of Personality and Social Psychology, 17(2):124–129, 1971.

[6] P. Ekman and W.V. Friesen. ”Manual for the Facial Action Coding Sys-tem”. Consulting Psychologists Press, 1977.

[7] J. A. Russell. Is there universal recognition of emotion from facial ex-pressions? A review of the cross-cultural studies. Psychological Bulletin,115(1):102–141, 1994.

[8] P. Ekman. Strong evidence for universals in facial expressions: a reply toRussell’s mistaken critique. Psychology Bulletin, 115(2):268–287, 1994.

[9] N. Sebe, M.S. Lew, I. Cohen, Yafei Sun, T. Gevers, and T.S. Huang. Au-thentic facial expression analysis. In Automatic Face and Gesture Recog-nition, 2004. Proceedings. Sixth IEEE International Conference on, pages517–522, 2004.

[10] G. Sandbach, S. Zafeiriou, M. Pantic, and L. Yin. Static and dynamic 3dfacial expression recognition: A comprehensive survey. Image and VisionComputing, 30(10):683 – 697, 2012. ¡ce:title¿3D Facial Behaviour Analysisand Understanding¡/ce:title¿.

8

[11] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces.In Proceedings of the 26th annual conference on Computer graphics andinteractive techniques, SIGGRAPH ’99, pages 187–194, New York, NY,USA, 1999. ACM Press/Addison-Wesley Publishing Co.

[12] O. Hall-Holt and S. Rusinkiewicz. Stripe boundary codes for real-timestructured-light range scanning of moving objects. In Eighth IEEE Inter-national Conference on Computer Vision, pages 359–366, 2001.

[13] A. Jones, G. Fyffe, Xueming Yu, Wan-Chun Ma, J. Busch, R. Ichikari,M. Bolas, and P. Debevec. Head-mounted photometric stereo for perfor-mance capture. In Visual Media Production (CVMP), 2011 Conference for,pages 158–164, 2011.

[14] T. Beeler, B. Bickel, P. Beardsley, B. Sumner, and M. Gross. High-qualitysingle-shot capture of facial geometry. In ACM SIGGRAPH 2010 papers,SIGGRAPH ’10, pages 40:1–40:9, New York, NY, USA, 2010. ACM.

[15] P.J. Besl and Neil D. McKay. A method for registration of 3-d shapes. Pat-tern Analysis and Machine Intelligence, IEEE Transactions on, 14(2):239–256, 1992.

[16] B. Amberg, S. Romdhani, and T. Vetter. Optimal step nonrigid icp algo-rithms for surface registration. In Computer Vision and Pattern Recogni-tion, 2007. CVPR ’07. IEEE Conference on, pages 1–8, 2007.

[17] D. Rueckert, L. I. Sonoda, C. Hayes, D. L. G. Hill, M. O. Leach, and D. J.Hawkes. Nonrigid registration using free-form deformations: Applicationto breast mr images. IEEE Transactions on Medical Imaging, 18:712–721,1999.

[18] Y. Wang, M. Gupta, S. Zhang, S. Wang, X. Gu, D. Samaras, and P. Huang.High resolution tracking of non-rigid motion of densely sampled 3d datausing harmonic maps. Int. J. Comput. Vision, 76(3):283–300, March 2008.

[19] Y. Sun, X. Chen, M. Rosato, and L. Yin. Tracking vertex flow and modeladaptation for three-dimensional spatiotemporal face analysis. Systems,Man and Cybernetics, Part A: Systems and Humans, IEEE Transactionson, 40(3):461–474, 2010.

[20] T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham. Active shapemodels-their training and application. Computer Vision and Image Under-standing, 61(1):38 – 59, 1995.

[21] F. Tsalakanidou and S. Malassiotis. Real-time facial feature tracking from2d-3d video streams. In 3DTV-Conference: The True Vision - Capture,Transmission and Display of 3D Video (3DTV-CON), 2010, pages 1–4,2010.

[22] L. E. Baum and T. Petrie. Statistical Inference for Probabilistic Functionsof Finite State Markov Chains. The Annals of Mathematical Statistics,37(6):1554–1563, 1966.

9

[23] H. Soyel and H. Demirel. Facial expression recognition using 3d facial fea-ture distances. In Mohamed Kamel and Aurelio Campilho, editors, ImageAnalysis and Recognition, volume 4633 of Lecture Notes in Computer Sci-ence, pages 831–838. Springer Berlin Heidelberg, 2007.

[24] A. Maalej, B. Ben Amor, M. Daoudi, A. Srivastava, and S. Berretti. Local3d shape analysis for facial expression recognition. In Pattern Recognition(ICPR), 2010 20th International Conference on, pages 4129–4132, 2010.

[25] S. Ramanathan, A. Kassim, Y.V. Venkatesh, and W.S. Wah. Human facialexpression recognition using a 3d morphable model. In Image Processing,2006 IEEE International Conference on, pages 661–664, 2006.

[26] S. Berretti, B. Ben Amor, M. Daoudi, and A. del Bimbo. 3d facial expres-sion recognition using sift descriptors of automatically detected keypoints.The Visual Computer, 27(11):1021–1036, 2011.

[27] D.G. Lowe. Object recognition from local scale-invariant features. In Com-puter Vision, 1999. The Proceedings of the Seventh IEEE InternationalConference on, volume 2, pages 1150–1157 vol.2, 1999.

[28] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning,20(3):273–297, 1995.

[29] V. Le, H. Tang, and T.S. Huang. Expression recognition from 3d dynamicfaces using robust spatio-temporal shape features. In Automatic Face Ges-ture Recognition and Workshops (FG 2011), 2011 IEEE International Con-ference on, pages 414–421, 2011.

[30] G. Sandbach, S. Zafeiriou, M. Pantic, and D. Rueckert. Recognition of3d facial expression dynamics. Image and Vision Computing, 30(10):762 –773, 2012. 3D Facial Behaviour Analysis and Understanding.

[31] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-linelearning and an application to boosting, 1995.

[32] David Bremner, Erik Demaine, Jeff Erickson, John Iacono, Stefan Langer-man, Pat Morin, and Godfried Toussaint. Output-sensitive algorithms forcomputing nearest-neighbour decision boundaries. In F. Dehne, J. Sack,and M. Smid, editors, Algorithms and Data Structures, volume 2748 ofLecture Notes in Computer Science, pages 451–461. Springer Berlin Hei-delberg, 2003.

[33] S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice HallPTR, Upper Saddle River, NJ, USA, 2nd edition, 1998.

10

automatic 3d facial expression recognition

Education