bilinear decomposition for blended expressions representation anonymous ...€¦ · bilinear...

BILINEAR DECOMPOSITION FOR BLENDED EXPRESSIONS REPRESENT ATION

Anonymous VCIP submission

Paper ID: 437

ABSTRACT

This paper proposes a new method for the analysis of blendedexpressions with varying intensity. The method is based onan asymmetric bilinear model learned on a small amount ofexpressions. In the resulting expression space, a blendedunknown expression has a signature, that can be interpretedas a mixture of the basic expressions used in the creationof the space. Three methods are compared: a traditionalmethod based on active appearance vectors, the asymmet-ric bilinear model on person-independent appearance vectorsand the asymmetric bilinear model on person-specific appear-ance vectors. Experimental results on the recognition of 14blended unknown expressions show the relevance of the bi-linear models compared to appearance-based methods andthe robustness of the person-specific models according to thetypes of parameters (shape and/or texture).

Index Terms— Facial expression space, Bilinear decom-position, Blended expressions, Person-specific models

1. INTRODUCTION

Facial expression analysis has grown significantly these lastdecades, especially for the recognition of emotions. Indeed,the work of Ekman [1] showed that the six basic emotions(joy, anger, surprise, fear, sadness and disgust) were univer-sally expressed by six corresponding facial expressions. Therecognition of these 6 prototypic expressions on known sub-jects is now taken for granted, as FERA 2011 challenge shows[2] : the person-specific best performance was 100%. Un-fortunatly, those prototypic expressions are rarely displayedin everyday life, but blended expressions with variable inten-sity are (as displayed in figure 1). This is also the case ofemotions, that are often more subtle that those 6 categories.That is why systems evolved towards a dimensional represen-tation of emotions (activation, valence, evaluation). AVEC2012 challenge [3] aimed at detecting the variations of fouraffect dimensions (valence, power, expectancy and arousal) ofsubjects speaking to an emotional agent. In AVEC 2012 chal-lenge, the winners’ average correlation between ground truthand prediction only reached 0.45, which shows that real lifeemotion prediction is still a challenging problem. With suchrepresentations, we do not have a direct link between expres-sion and emotion anymore. A two step process (facial feature

extraction and emotion detection) is no more adequate. Anintermediate step has to be added : expression space repre-sentation.In this paper, we focus on this key step. The expression spaceis large and continuous, thus it is not possible to learn all ex-pressions. For this reason, we propose a system, that has arestricted learning database but yet deals with the expressionsnot included in any training database (called blended expres-sions). In out method, we assume that 8 expressions of eachsubject are known. As we need to deal with subtle differencesbetween expressions, we use appearance features based onActive-Appearance Models (AAMs) [4]. AAMs are knownto provide important spatial information of key facial land-marks. In this paper, we use manual annotations of the facesto avoid problems due to AAM tracking. The last step of thesystem (emotion prediction) is not addressed in this paper,buta basic recognition algorithm is implemented instead to mea-sure the relevancy of the expression space representation.It is commonly admitted that the expressions are similaracross people and that the identity is person-specific. Themost popular invariant representation to characterize uniquelyan expression is the FACS system [5]. In FACS, one expres-sion is characterized by a combination of action units (AU).In systems using AUs as the intermediate step for expres-sion representation, the question of non-prototypic expres-sions is rarely addressed, taking the FACS combinations forgranted. Moreover, AUs detection is still a challenging prob-lem as FERA 2011 challenge illustrates (best recognition rateof 62%) [2].Liu and Wu [6] showed that, with their method,training AU6 and AU12 simultaneously performed better forsmile deceit detection than training AU6 and AU12 sepa-rately, which makes think that AU description is not adaptedto Computer Vision.To overcome morphological differences between subjects ina more unified manner, neutral face subtraction on facial fea-tures is often performed. Cheon & Kim [7] aligned differ-ent subjects’ expressions by using Diff-AAM before mani-fold learning and recognition tasks. Diff-AAM features arethe difference of active appearance model (AAM) featuresbetween the input face image and the reference face image(neutral face). They represent the main distortions of the face.These methods assume that all people have similar patterns offacial expression changing from neutral expression to a spe-cific expression.

Manifold learning methods have also been used for facial ex-pression representation. The facial features are projected on alow dimensional subspace which is optimal for low cost andaccurate classification. Such manifolds take into account themixture of expressions and the notion of intensity. Stoiberetal. [8] embedded AAM features into a disc. The resultingspace leads to promising results for animation purpose but ismanually labeled and dedicated to a subject. [9] addressed thequestion of blended expressions, that are not included in thetraining database. They created a person-specific manifoldfor each subject by Lipschitz embedding applied on video se-quences displaying transitions from the neutral face to oneof the 6 basic expressions. The transitions between the neu-tral face and the six basic expressions represented 6 paths onthe manifold. Blended expressions with varying intensitieslied between those paths. Their experiments were limited toonly five subjects and they did not measure the adequacy ofthe representation of the blended expressions across subjects.Moreover, embedding methods need a high density of train-ing data to compute a manifold that properly approximatesthe organization of the expressions.Finaly, another way of extracting expression features fromthe facial features is to separate expression and identity viabilinear models. The advantage of bilinear methods is thatthey need few training images. Wang and Ahuja [10] usedHOSVD (High Order Singular Value Description) to decom-pose appearance features similar to AAM features into a per-son subspace and an expression subspace. They used the re-sulting model to synthesize facial expressions of a new sub-ject and to recognize simultaneously face and expression. Ab-boud and Davoine [11] used bilinear symmetric and asym-metric models on AAM features to perform both facial ex-pression recognition and synthesis. Mpiperis et al. [12] de-coupled face and facial expression by bilinear models appliedon elastically deformable model. They used the resultingmodel to perform face and expression recognition simulta-neously. For each of these methods, the recognition was per-formed on the 6 basic expressions and neutral face but it wasnever mentioned how the system performs with blended ex-pressions.AUs representation, widely used by psychologists, seem tobe not adapted to Computer Vision, for AUs recognition stilllead to poor results, and mixed AUs are still a matter of con-cern. Representations of the expressions in a unified mannerseem more performing. In this paper, we propose a repre-sentation that uses the distortions of the whole face. Separat-ing expression and identity is then the key problem. Mani-fold learning methods need a high density of training data tocover the whole expression space. That is why we preferreda method based on bilinear models. Even if bilinear mod-els have been already used for expression analysis, tests werelimited to the 6 basic expressions. Here we propose to anal-yse how the system behaves with expressions not included inthe training database, and improve bilinear models by using

person-specific appearance vectors. Finally, contrary to theother methods that deal with blended expressions, we mea-sure the relevancy of the representation we propose by com-puting the recognition rate of blended unknown expressions.The main contribution is the application of asymmetric bi-linear decomposition on person-specific appearance parame-ters to uniquely characterize non-prototypic expressions. Theoriginality of the approach is to determine a unique signatureof a blended expression via a bilinear model learned on a lim-ited number of expressions and to give meaning to the com-ponents of this signature. Another important contributionofthis work is the discussion about the impact of texture for ex-pression characterization. The particularity of the approach isthe use of person-specific appearance models that limits thisimpact.The rest of the paper is organized as follows. In Section 2,we briefly justify the importance of moving from an generalappearance space to an expression space. Section 3 presentsthe learning phase of the asymmetric bilinear models. Section4 describes the signature of a blended expression. Section 5demonstrates the relevance of the system and section 6 con-cludes the paper.

2. APPEARANCE BASED METHODS (ABM ETDABM)

This section justifies the importance of moving from an ap-pearance space to an expression space by analyzing the re-sults of blended expression identification with two tradition-nal methods. The first one (Appearance Based Method ABM)is based directly on the active appearance vectors. The sec-ond one (Differential Appearance Based Method DABM [7])is based on differential active appearance vectors, that istosay, the appearance vectors of the subjects are aligned by sub-tracting the neutral expression vector of the subject.To obtain comparable appearance vectors for the expressionsof different subjects, we use a generic active appearance(AAM) [4], learned on a few expressions displayed by thesubjects of the test set. The testing phase is performed on theappearance vectors of blended expressions, different fromtheones used for the learning phase of the model (see process infigure 2).To analyze the relevance of these vectors, the recognition taskis performed by a basic algorithm :

1. we compute for each expression of a given subject,the closest expression (nearest neighbor) of each of theother subjects (leave-one-subject-out method) ;

2. we determine the recognized blended expression by avoting algorithm ;

3. we then compute the recognition rate.As we will see in section 5, the ABM method gives lowrecognition rate, showing that the appearance vectors are notdirectly comparable across different subjects. Indeed, theycarry both information about the expression and about the

1 2 3 4 5 6 7 8 9 10 11 12 13 ... 22

Learning phase Test phase

Fig. 1. Similar expressions displayed by different subjects.

subject’s morphology. The DABM method gives better re-sults. Its purpose is to remove the influence of a face’s mor-phology from its appearance vectors, thus making the appear-ance space a purely expression description space. As intu-ition suggests, removing non-expressional components con-siderably improves recognition results. In the following,wepropose a new parameterization that further improves the sep-aration of identity and expression.

3. ASYMETRIC BILINEAR DECOMPOSITION

To separate the expression and the morphology, we proposeto use asymmetric bilinear decomposition on appearance pa-rameters [13]. Given a corpus ofE known facial expressionsof P subjects, we want to decompose the appearance parame-tersype of dimensionK into a signature of the expressionbe

of dimensionE that is common to all the subjects and a linearmapping specific to the subjectWp of dimensionK.E:

ype = Wp.be (1)

or in matrix form:Y = W.B (2)

whereY is of the form

Y =

y11 y12 . . . y1E

y21 y22 . . . y2E

......

......

yP1 yP2 . . . yPE

(3)

andW andB :

W.B =

W1

W2

...WP

.(

b1 b2 . . . bE)

(4)

The singular value decomposition (SVD) of the matrix Ycan solve this problem. By SDV :

Y = U.S.Vt (5)

W is then given byU.S andB by Vt. Each vectorbe, e =1..E of B is the signature of the expressionype. It is the samefor all subjects in the training set (p = 1..P ).

4. BLENDED UNKNOWN EXPRESSION ANALYSISBY BILINEAR DECOMPOSITION (BDM AND SBDM)

This section describes the process used to compute the signa-ture of an unknown blended expression of a known subject.Learning phaseThe learning phase is achieved by the asymmetric bilinear de-composition presented in section 3 onE known similar facialexpressions ofP subjects.Analysis phaseFor one subjectpi of the learning base, we want to analyzean unknown blended expressionypie (different from one oftheE expressions of the learning phase). The signaturebe ofthe expressionypie is estimated with the matrixWpi of thelearning model.

be = (Wpt

i .Wpi)−1.Wpt

i .ypie (6)

The bilinear methodsWhen the computation of the learning and analysis phaseis made on the appearance parameters of a generic AAM,we call the method BDM (Bilinear Decomposition Method).When it is performed on the diff-aam parameters, we call themethod BDDM (Bilinear Decomposition Method on Differ-ential parameters). As we only use the linearity property ofthe expressions, we can apply the asymmetric bilinear de-composition on person-specific appearance parameters. Foreach subject, we then calculate a specific appearance model(AAM) learned on some expressions of the subject (the sameexpressions as for the other methods). When the computa-tion is made on the appearance parameters of person-specificAAM, we call the method SBDM (Bilinear Decompositionon person-Dpecific parameters). When it is performed on thediff-aam parameters, we call the method SBDDM (Bilinear

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Fig. 2. Overview of the methods compared in this paper (ABM : appearance based method, DABM differential appearancebased method, BDM : bilinear decomposition, BDDM : bilineardecomposition on differential parameters, SBDM : bilineardecomposition on person-specific parameters, SBDDM : bilinear decomposition on person-specific differential parameters.

Method ABM DABM BDM BDDM SBDM SBDDM

Shape 36.2 54.1 57.9 58.7 57.7 59.1Shape+Text. 25.4 34.0 32.1 32.5 59.9 57.3

Table 1. Mean recognition rate of 14 unknown expressionson known subjects according to the type of appearance pa-rameters (shape or shape+texture).

Decomposition on person-Specific Differential parameters).Figure 2 shows the link between the methods.A meaningfull signatureThe matrixB is by construction a matrix with orthonormalvectors and can therefore be interpreted as an orthonormalbasis. Each signature can then be characterized in this basison the form of a relative signatureb′ with respect to theEexpressions used in the learning phase. By rotation, we have:

b′ = Bt.b (7)

Thus, the relative parametersb′ can be interpreted as amixture between several known expressions.

5. EXPERIMENTAL RESULTS

This section compares the signatures of 14 blended expres-sions on appearance based methods ABM and DABM (sec-tion 2) and on bilinear decomposition methods BDM, BDDM,SBDM and SBDDM (section 4).DatabaseThe experiments were performed on a public database1 con-taining 22 blended similar expressions plus neutral on 17 sub-jects aged between 20 and 55 (see figure 1). The databasecontains about 30% of female. Most subjects are Caucasian.Some subjects wear glasses and others have beards. They

1Database available athttp://www.rennes.supelec.fr/immemo/

were showed 22 pictures of expressive faces and were askedto reproduce as closely as possible the displayed expressions.Having 22 expressions allows to have enough expressions tobuild the expression model and to test it. The expressionshave also been manually annotated with one of the 6 basicemotions.Learning and recognition phaseThe models were trained on 8 expressions plus neutral faceof the 17 subjects. The other 14 remaining blended expres-sions were used to test the models (see figure 1). To measureand compare the relevance of the signatures, the same basicrecognition algorithm (described in section 2) is used for allthe methods.Comparison of the methodsTable 1 shows the best recognition rates for each method onshape vectors. As expected, the ABM method gives the worstresults. Indeed, the appearance parameters include expressioninformation but also subjects’ morphology. Bilinear decom-position methods give better results than DABM. This can beinterpreted by the fact that the differential appearance vec-tors still contain a significant amount of information aboutidentity, such as how the subject smiles (crescent smile or flatsmile). In contrast, bilinear models learn from the expres-sions used in the learning phase, how each subject achievesthe expression. This type of deformation is then specific tothe subject and is used for blended expression recognition.We note that the bilinear decomposition applied to the differ-ential appearance vectors gives slightly better results, but thisdifference is not significant enough to be relevant.Shape vs. textureThe tests were performed on shape vectors only and on shapeand texture vectors. Figure 3 shows, for each method, therecognition rate depending on the size of the vectors and ta-ble 1, the best recognition rate. We note that for all methodsperformed on vectors from person-independent appearancemodels, adding texture information significantly reduces the

http://www.rennes.supelec.fr/immemo/

0 20 40 60 80 100 120 1400

10

20

30

40

50

60

70

↑ ABM shape

↓ DABM shape

↓ ABM shape+texture

↑ DABM shape+texture

ABM − DABM

number of appearance parameters

mea

n re

cogn

ition

rat

e (%

)

0 20 40 60 80 100 120 1400

10

20

30

40

50

60

70

↑ BDM shape

↓ DBDM shape

↑ BDM shape+texture

↓ DBDM shape+texture

BDM − DBDM

number of appearance parametersm

ean

reco

gniti

on r

ate

(%)

0 1 2 3 4 5 6 7 8 90

10

20

30

40

50

60

70

↓ SBDM shape↓ SDBDM shape

↑ SBDM shape+texture

← SDBDM shape+texture

SBDM − SDBDM

number of appearance parameters

mea

n re

cogn

ition

rat

e (%

)

Fig. 3. Comparison of ABM, DABM, BDM, BDDM, SBDM and SBDDM with or without texture information. Facial expres-sion recognition rate on 14 unknown expressions of known subjects according to the size of the appearance vectors. In dottedline, methods on differential appearance vectors ; in black, on shape ; in red, on shape and texture.

1 2 3 4 5 6 7 8−1

0

1

1 2 3 4 5 6 7 8−1

0

1

1 2 3 4 5 6 7 8−1

0

1

1 2 3 4 5 6 7 8−1

0

1

1 2 3 4 5 6 7 8−1

0

1

1 2 3 4 5 6 7 8−1

0

1

Fig. 4. Meaning of the signature of an expression. On the left, two known expres-sions (smile - expression 4 and surprise - expression 5) plusan unknown expression(a mixture of smile and surprise - expression 12). On the right, two known expres-sions (angry with low intensity - expression 1 and with high intensity - expression2) plus an unknown expression (angry with middle intensity -expression 9).

Truth

Pre

dic

tio

n

Anger

Disgust

Joy

Fear

Surprise

Sadness

Fig. 5. Confusion matrix of the best recog-nition of 14 blended expressions displayedby 17 subjects (DABM, shape+texture fea-tures, 6 parameters). The darker a box is,the higher the recognition rate is.

recognition rates. DABM and ABM results can be comparedto the work of Cheon & Kim [7]. Contrary to our results,they found an improvement in recognition rate when the tex-ture information was added. This can be explained by the factthat the database we used contains a greater variety of sub-jects than theirs (which contained only Korean people) andtexture vectors thus contain too much information of iden-tity to be relevant and comparable between subjects. Onemight expect that the bilinear methods would eliminate thisdifference and would give similar results whether applied toshape vectors or shape and texture vectors. This is not whatis observed for bilinear methods on person-independent ap-pearance vectors. In contrast, the use of bilinear models onperson-specific appearance vectors gives similar good results.Indeed, the texture variations are mainly due to identity vari-ations in person-independent models, whereas they are due toexpression variations in person-specific models.Expression space dimensionMethods based on bilinear decomposition determine in ad-

vance the size of the expression signature (here 8 or 9), con-trary to the appearance based methods whose results oftenrequire a large number of parameters. Although in our tests,the best recognition rate with DABM method is obtained fora parameter size of 10, it is more common to have up to 20 or30 to get satisfying results [11], this value depends greatly onthe learning database.Interpretation of the signature of an expressionFigure 4 shows the relative signatures of different expressionsobtained by equation 7. In each case, the first two lines showthe signatures of expressions of the learning phase, the lastline shows the signature of a blended expression of the testphase (either a mixture of different expressions, or an expres-sion with a different intensity). We can notice that the ob-served mixture in the expression is reflected in the parametersof the signature (highest components).ConfusionFigure 5 shows the confusion matrix on the signatures withDABM. We can notice that the confusion mostly appears

when the expression corresponds to the same kind of emotion.For some expressions, we experienced the classical confusionbetween Anger & Disgust and between Fear & Surprise.

6. CONCLUSION

This paper has presented methods for blended expressionanalysis based on asymmetric bilinear decomposition, thatef-ficiently separates expression and identity. For each blendedunknown expression, the method computes a signature, thatcharacterizes uniquely the expression. This signature canthusbe used to recognize blended expressions with varying inten-sity. Contrary to appearance based methods, the dimension ofthe signature is small (8 parameters in our tests) and couldthus be used for real-time application. Moreover, bilinearmodels compute a signature that can be interpreted: the valueof the components of a blended expression indicates the ex-pressions of the learning phase that are mixed. Finally, thebilinear method applied on person-specific appearance mod-els is robust to the type of features (shape and/or texture),be-cause person-specific models reduce the variations of texturedue to the subjects’ morphology. The recognition rate of 59%may seem low and with no practical use. Yet, in a completesystem, the expression is computed for each image of a videosequence and is integrated in time with a minimum period of1 second (that is around 30 images). The information is thenaccurate enough to detect the emotion of the subject. In thefuture, we will extend the method to unknown subjects.

7. REFERENCES

[1] P. Ekman, W. V Friesen, and P. Ellsworth,Emotion inthe human face, Cambridge University Press New York,1982.

[2] M. F Valstar, B. Jiang, M. Mehu, M. Pantic, andK. Scherer, “The first facial expression recognitionand analysis challenge,” inAutomatic Face & GestureRecognition and Workshops (FG 2011), 2011 IEEE In-ternational Conference on. IEEE, 2011, pp. 921–926.

[3] B. Schuller, M. Valster, F. Eyben, R. Cowie, and M. Pan-tic, “Avec 2012: the continuous audio/visual emotionchallenge,” inProceedings of the 14th ACM inter-national conference on Multimodal interaction. ACM,2012, pp. 449–456.

[4] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active ap-pearance models,”Pattern Analysis and Machine Intel-ligence, IEEE Transactions on, vol. 23, no. 6, pp. 681–685, 2001.

[5] P. Ekman and W. V Friesen,Facial action coding sys-tem: A technique for the measurement of facial move-ment, Consulting Psychologists Press, Palo Alto, CA,1978.

[6] H. Liu and P. Wu, “Comparison of methods for smile de-ceit detection by training au6 and au12 simultaneously,”in Image Processing (ICIP), 2012 19th IEEE Interna-tional Conference on. IEEE, 2012, pp. 1805–1808.

[7] Y. Cheon and D. Kim, “Natural facial expression recog-nition using differential-AAM and manifold learning,”Pattern Recognition, vol. 42, no. 7, pp. 1340–1350,2009.

[8] N. Stoiber, R. Seguier, and G. Breton, “Automatic de-sign of a control interface for a synthetic face,” inPro-ceedings of the 13th international conference on Intelli-gent user interfaces, 2009, pp. 207–216.

[9] Y. Chang, C. Hu, and M. Turk, “Probabilistic expres-sion analysis on manifolds,” inComputer Vision andPattern Recognition, 2004. CVPR 2004. Proceedings ofthe 2004 IEEE Computer Society Conference on. IEEE,2004, vol. 2, pp. II–520.

[10] H. Wang and N. Ahuja, “Facial expression decompo-sition,” in Computer Vision, 2003. Proceedings. NinthIEEE International Conference on. IEEE, 2003, pp.958–965.

[11] B. Abboud and F. Davoine, “Bilinear factorisation forfacial expression analysis and synthesis,” inVision,Image and Signal Processing, IEE Proceedings-, 2005,vol. 152, p. 327333.

[12] I. Mpiperis, S. Malassiotis, and M. G. Strintzis, “Bi-linear elastically deformable models with application to3d face and facial expression recognition,” inAutomaticFace & Gesture Recognition, 2008. FG’08. 8th IEEEInternational Conference on, 2008, p. 18.

[13] E. S. Chuang, F. Deshpande, and C. Bregler, “Facialexpression space learning,” inComputer Graphics andApplications, 2002. Proceedings. 10th Pacific Confer-ence on, 2002, p. 6876.

bilinear decomposition for blended expressions representation anonymous ...€¦ · bilinear...

Documents