multi-instance hidden markov model for facial …cvrl/publication/pdf/wu2015.pdfmulti-instance...

Multi-Instance Hidden Markov Model For Facial ExpressionRecognition

Chongliang Wu1, Shangfei Wang �1 and Qiang Ji21 School of Computer Science and Technology, University of Science and Technology of China

2 ECSE Department, Rensselaer Polytechnic InstituteEmail:{[email protected], [email protected], [email protected]}

Abstract— This paper presents a novel method for facialexpression recognition using only sequence level labeling. Withfacial image sequences containing multiple peaks of expres-sion, our method aims to label these sequences and identifiesexpressional peaks automatically. We formulate this weaklylabeled expression recognition as a multi-instance learning(MIL) problem. First, image sequences are clustered intomultiple segments. After segmentation, image sequences areregarded as bags in MIL and the segments in the bags areviewed as instances. Second, bags data are used to train a dis-criminative classifier which combined multi-instance learningand discriminative Hidden Markov Model (HMM) learning.In our method, HMM is used to model temporal variationwithin segments. We conducted experiments on CK+ databaseand UNBC-McMaster Shoulder Pain Database. Experimentalresults on both databases show that our method can not onlylabel the sequences effectively, but also locate apex frames ofmulti-peak sequences. Besides, the experiments demonstratethat our method outperforms state of the art.

I. INTRODUCTION

Facial expression recognition has attracted increasing at-tention due to its wide applications in human-computerinteraction. Current research of facial expression recognitioncan be divided into two categories: frame-based and sequencebased. The former usually recognizes expressions from apexexpression frame or every frame. The latter analyzes expres-sions from facial image sequences. Most research of frame-based expression recognition requires the pre-identified apexfacial images and their corresponding expression labels, orlabels for each frame, while sequence based work needthe pre-segmented image sequences, usually from neutralimages to apex images, and sequence level labels. Sincethe spontaneous facial expression videos could feature multi-peak expressions, manually identifying the apex frame andsegmenting expression sequence from the onset frame to theapex frame are very time consuming and prone to error.Compared with expression annotation with pre-identifiedapex images or pre-segmented expression sequences, weaklylabeled data in the form of sequence level ground-truthwithout providing the onset, duration, and frequency of facialexpressions within the video is much easier to annotate, andcan be labeled with great accuracy. Thus, in this paper, weformulate sequence-based expression recognition as a multi-instance learning (MIL) problem, in which image sequences

� is the corresponding author.This paper was supported by the National Natural Science Foundation of

China (Gront Nos. 61175037, 61228304, 61473270).

are regarded as bags, and image segments of the sequencesare viewed as instances in the bags. Multi-instance learningassumes that every positive bag contains at least one positiveinstance, and every negative bag contains only negativeinstances. Instead of an exact label for every instance, multi-instance learning algorithms only require bag-level annota-tions. Through multi-instance learning, we can obtain thetemporal segments and their corresponding expression labelssimultaneously with only sequence level ground-truth.

To the best of our knowledge, there is only one workto recognize expression using multi-instance method. Sikkaet al. [1] proposed multiple segment multi-instance learningmethod for pain recognition and localization. First, facialexpression sequences are clustered into several segments.Second, features are extracted from every frame withinsegments and fused into one single feature vector. Third,MilBoost framework is adopted for multi-instance learn-ing. Experiments were conducted on the UNBC-McMasterShoulder pain database, demonstrating the advantages ofthere proposed approach. Although the fused feature vectorin [1] captures temporal information to a certain extent, theclassifier they used is static, and it hence failed to capture thetemporal variation in segment, which is crucial for expressionanalysis. Therefore, in this paper, we propose to use a HMMto model temporal variation within a segment and developa novel training process to combine discriminative HMMtraining with MIL.

Although much effort has been made to develop multi-instance learning algorithms, such as diverse density,Citation-kNN, ID3-MI, RIPPER-MI, and BP-MIP [2], on-ly few works consider temporary modeling for multi-instance learning. Yuksel [3] proposed multiple instancehidden Markov model (MI-HMM) for landmine detection.MI-HMM is accomplished via Metropolis-Hastings sam-pling. Although they introduced HMM in the frame ofmulti-instance to model the temporary information in time-sequence data, Metropolis-Hastings sampling, which is aBayesian inference, does not guarantee that we can obtainthe correct posterior estimate if the class of distributions theyconsider do not contain the true generator of the data we areobserving. In addition, learning HMM parameters numerical-ly via sampling may not guarantee convergence dependingon the number of samples collected. Different from [3], wepropose to learn HMM analytically through gradient descentthat guarantees the convergence. Furthermore, discriminative

978-1-4799-6026-2/15/$31.00 ©2015 IEEE

learning typically produces better classification results be-cause the learning and testing criteria are the same.

Thus, in this paper, we formulate expression recogni-tion as a multi-instance learning problem, and propose adiscriminative multi-instance HMM learning algorithm, inwhich HMMs are adopted to capture the dynamic withininstances. Experiments on the Extended Cohn-Kanade (CK+)database [4]and the UNBC-McMaster Shoulder Pain Expres-sion Archive Database [5] demonstrate the effectiveness ofour proposed approach.

II. MIL-HMM METHOD FOR EXPRESSIONRECOGNITION

A. Method Framework

The framework of our approach is shown in Fig.1, whichconsists of image sequence segmentation and multi-instanceHMM models. First, an image sequencings is clustered intoseveral segments, and the displacements of the facial featurepoints between current frame and last frame are extracted asthe features. After that, discriminative multi-instance HMMis proposed to capture the temporary information in facialsegments. We further propose to learn HMM analyticallythrough gradient descent. During testing, we first estimatelabels of image segments using the learned HMM, and theninfer the label of an image sequence through multi-instanceinference. Simultaneously, we obtain the peaks of expressioncontained in the sequence. The details are described asfollows.

B. Facial image sequence segmentation

Each facial image sequence is cut into several segments,which are regarded as instances in a bag. The normalized cuts(Ncut) algorithm [6] is adopted to cluster an image sequenceinto several segments. The whole sequence is represented asa weighted undirected graph G = (V,E), where V denotesevery frame, and E is formed as the links between everypair of nodes. The weight between edges, W , is a functionof the similarity between nodes r and s. The object of Ncutsalgorithm is to minimize the following function by removingedges between two nodes.

Ncut(V1, V2) =cut(V1, V2)

assoc(V1, V )+

cut(V1, V2)

assoc(V2, V )

where V1,V2 are two disjoint node sets of G, and satisfy theconstrains of V1

∪V2 = V and V1

∩V2 = ∅; cut(V1, V1)

is the degree of dissimilarity between V1,V2 , and canbe computed as total weight of the edges that have beenremoved; assoc(V1, V ) is the total connection from nodes inV1 to all nodes in the graph; Similarly, assoc(V2, V ) denotesthe total connection from nodes in V2 to all nodes in thegraph.

cut(V1, V2) =∑

u∈V1,v∈V2

W (u, v)

assoc(V1, V ) =∑

u∈V1,t∈V

W (u, t)

Since segments consist of continuous frames, the weightbetween frame fs

i and frame fri of sequence i, i.e. Wi(r, s),

should contain feature similarity as well as the time indexof two frames, as shown in Eq.1:

Wi(r, s) =exp

(−∣∣∣∣dist(F (fr

i ), F (fsi ))

δf

∣∣∣∣2)

+ exp

(−∣∣∣∣ tr − ts

δt

∣∣∣∣2) (1)

where F (f) indicates the feature vector of frame f , thedist indicates the distance between vectors. Here, we useEuclidean distance. t is the relative time of frame in thesequence. δf and δt are the parameters to regularize theweight matrix.

For the features, we use facial landmarks provided by thedatabases. First, the face images are normalized accordingto the binocular position. Then the displacements of featurepoints between current frame and last frame are used as thefeatures.

C. Multi-instance HMM for expression recognition

Let D = {(S, Y )} denote facial image sequencesdata with only sequence level expression labels, whereS = {S1, S2, ..., SN} is image sequences set, and Y ={Y1, Y2, ..., YN} is the corresponding sequence-level expres-sion labels set. Each image sequence (i.e a bag) Si may con-sists of several segments: Xi = {Xi1, Xi2, ..., Xin}, whichare regarded as instances in the bag. yi = {yi1, yi2, ..., yin}denote the corresponding instance labels. During training,we only have the labels for the whole image sequences. Thepurpose of multi-instance learning is to maximize conditionallog-likelihood (CLL) as shown in Eq.2.

CLL(Θ) =∑

Si,Yi=1

logP (Yi = 1|Si,Θ)

+∑

Si,Yi=0

logP (Yi = 0|Si,Θ)(2)

where,

P (Yi = 1|Si) =n

maxj=1

{P (yij = 1|Xij)} (3)

and P (Yi = 0|Si) = 1 − P (Yi = 1|Si), accordingly.Here, P (yij = 1|Xij) denotes the posterior probability ofan instance Xij in Si. In this way, the probability that abag is positive depends on the maximum probability that aninstance in it is positive.

Since max function is not differentiable, Noisy-Or (NOR)model is adopted as an approximation, shown in Eq.4

P (Yi = 1|Si) = 1−n∏

j=1

[1− P (yij = 1|Xij)] (4)

Fig. 1. The framework of our approach. First, the raw image sequences were divided in to multiple segments. Then, with facial point data, two HMMswere trained discriminatively under the framework of MIL. HMM0 denotes the Hidden Markov Model trained for negative instances and HMM1 denotesHMM for positive instances. With the trained two HMMs, label of every segment can be estimated. Image sequences’ label can be evaluated through theassumption of Multiple Instance.

Here, HMM is adopted as the instance classifierP (yij |Xij). As a generative model, HMM provides the jointprobability of all observed variables and hidden variables.Thus, the joint probability of observed variables is shown inEq.5:

P (Xij |Θ) =∑πij

P (πij) · P (Xij |πij)

=∑πij

P (π1,ij) ·T−1∏t=1

P (πt+1,ij |πt,ij) ·T∏

t=1

P (Xt,ij |πt,ij)

(5)where Xij = {X1,ij , X2,ij , ..., XT−1,ij , XT,ij} is the ob-served variables of the jth instance in ith bag, πij ={x1,ij , π2,ij , ..., πT−1,ij , πT,ij} express the hidden variables,and Θ are the parameters.

For HMM, the local conditional probability distributionis parameterized with conditional Gaussian P (Xt,ij |πt,ij =k) ∼ N(µijk,Σijk). To ensure the covariance matrix bepositive semidefinite, Σijk is parameterized as A = Σ

12

ijk

[7]. For the transition probability, θa|b = P (πt+1,ij =a|πt,ij = b), we introduce a transformed parameter βa|b toincorporate the constraints θa|b ≥ 0 and

∑a θa|b = 1, where

θa|b =eβa|b∑

a′ eβa′|b

.

In order to maximize the conditional log-likelihood asshown in Eq.2 , we should transfer the log-likelihood ofHMM to conditional log-likelihood accord to Bayesian the-orem:

P (yij = c|Xij) =P (Xij |yij = c)P (yij = c)∑

yij=c′ P (Xij |yij = c′)P (yij = c′)(6)

where c ∈ {0, 1} and c′ = {0, 1}.Thus, we train HMM discriminatively for the multi-

instance learning using gradient descent as shown in Eq.7

∂LL(Θ)

∂Θ≈

∑Si,Yi=1

n∑j=1

P (Yi = 0|Si)

P (Yi = 1|Si)· P (yij = 1|Xij)

P (yij = 0|Xij)

· ∂ logP (yij = 1|Xij)

∂Θ+

∑Si,Yi=0

n∑j=1

∂ logP (yij = 0|Xij)

∂Θ

(7)

The right hand side of Eq.7, the derivative,∂ logP (yij=1|Xij)

∂Θ and ∂ logP (yij=0|Xij)∂Θ , can be translated

into ∂ logP (Xij |yij=1)∂Θ and ∂ logP (Xij |yij=0)

∂Θ by applyingBayesian theorem as Eq.6 [7], and can be expressed asfollows:

∂ logP (yij = c|Xij)

∂Θc′=

[1− P (yij = c|Xij)]

.∂ logP (Xij |yij=c)

∂Θcc′ = c

−P (yij = c′|Xij)

.∂ logP (Xij |yij=c′)

∂Θc′c′ ̸= c

(8)

where c ∈ {0, 1} and c′ = {0, 1}.Therefor, we obtain the derivative of P (Xij) to parameters

µijk, Aijk and transition probability β as shown in Eq.9,Eq.10 and Eq.11 respectively.

∂ logP (Xij)

∂µijk=

T∑t=1

P (πt,ij = k|Xij)A2ijk(Xt,ij − µijk) (9)

∂ logP (Xij)

∂Aijk=

T∑t=1

P (πt,ij = k|Xij)

·[A−1

ijk −Aijk(Xt,ij − µijk)(Xt,ij − µijk)T]

(10)

where P (πt,ij = k|Xij) can be computed using the forwardalgorithm for HMM.

Algorithm 1 Parameter Learninginput: Train bag set S with corresponding label set Y. Every

bag contains multiple instances.output: Optimal parameters of HMMs Θ̂1. Parameter initialization: Randomly assign positive or

negative label for every instances, and then learn theinitialize parameters of HMM through EM algorithm;

2. Compute the parameter derivative according toEq.7,9,10,11, and adopt BFGS [8] for optimization;

3. Using parameters returned by BFGS as output.

∂ logP (Xij)

∂βa′|b=

∑T−1t=1 P (πt,ij = b, πt+1,ij = a|Xij)

· [1− P (πt+1,ij = a|πt,ij = b)] a′ = a∑T−1t=1 P (πt,ij = b, πt+1,ij = b|Xij)

· [−P (πt+1,ij = a′|πt,ij = b)] a′ ̸= a(11)

The detailed of parameter learning algorithm is listed inAlgorithm 1.

D. Testing

During testing, an image sequence Si is first clusteredinto several segments (Xi1, ..., Xin) using Ncut algorithmdescribed in Section II-B After that, the likelihood probabil-ity of every segment P (Xij |yij) is obtained using the learnedHMMs according to Eq. 5. Thus, we obtain the expressioncategory of every segments according to Eq.12

yij = {P (Xij |yij = 1) > P (Xij |yij = 0)} (12)

Then, the label of an image sequence is inferred using Eq.3or Eq.4.

III. EXPERIMENTS AND RESULTS

A. Experimental conditions

We conducted multi-instance expression recognition ontwo benchmark databases, i.e. the Extended Cohn-Kanade(CK+) database [4] and the UNBC-McMaster Shoulder PainExpression Archive Database [5].

The CK+ database is a posed expression database, whichconsists of 327 facial expression image sequences, from neu-ral frame to apex frame. The length of sequences ranges over5 to 70, and each expression sequence belongs to one of sev-en expression categories (i.e. angry, contempt, disgust, fear,happiness, sadness, surprise). Table.I shows the distributionof image sequences over seven expressions. For this database,the whole sequence include only one instance. Therefore, weextend the image sequence to simulate multiple instances.For sequence extension, facial expression should changeas natural as possible. Specifically, we add several neutralframes before the neutral frame in the sequence. Then, welet the expression after the peak (last frame) vanish naturallyby copy the frame before the peak reversely. Thus, we obtainpositive bags for each expression. For the negative bag,we generate negative sequences using neutral frame from

TABLE ISEQUENCES DISTRIBUTION OVER EXPRESSION IN CK+ DATABASE

Expression Numberangry 45contempt 18disgust 59fear 25happiness 69sadness 28surprise 83Total 327

TABLE IIEXPRESSION RECOGNITION RESULT ON CK+ DATABASE USING OUR

METHOD

Expression Accuracy F1-scoreangry 100% 1

contempt 98.15% 0.9714disgust 97.18% 0.9558

fear 97.33% 0.9583happiness 99.52% 0.9927sadness 100% 1surprise 97.59% 0.9652average 98.54% 0.9776

the corresponding positive sequence by simply duplication.negative sequences contains only neutral frames.

The UNBC-McMaster Shoulder Pain database is a spon-taneous pain expression database, including 200 image se-quences from 25 subjects. The length of the sequences varyfrom 60 ∼ 600. The Observed Pain Intensity (OPI), rangingfrom 0 (no-pain observed) to 5 (extreme pain observed), areassigned to each frame of all sequences by an independentobserver. In our experiments, the sequences with OPI ≥ 3are labeled as positive (’pain’) bags, and the sequences withOPI = 0 are labeled as negative (’no pain’) bags[1]. Thus,we obtain 57 positive bags and 92 negative bags.

Both databases provide feature points for each frame,which are used as features in our experiments. Leave-one-subject-out cross validation method is adopted.

B. Experimental Result and Analysis On CK+ Database

The expression recognition results on CK+ database areshown in Table.II. From Table.II, we can find that theaverage accuracy is 98.54%. Furthermore, the performanceof anger and sadness recognition reaches 100%. This demon-

TABLE IIIEXPERIMENT RESULT ON UNBC-MCMASTER SHOULDER PAIN

DATABASE

Classified aspositive negative

Confusion positive 39 18Matrix negative 4 88

Accuracy 85.23%F1-score 0.78

(a) Fear expression

(b) Happy expression

Fig. 2. Expression localization examples on CK+ database

TABLE IVCOMPARISON BETWEEN OUR METHOD AND OTHER PAIN DETECTION

ALGORITHMS

Algorithm Accuracy # subjects # sequencesMIL-HMM 85.23% 25 149

MS-MIL 83.7% 23 147(K. Sikka (et.al) [1])

P. Lucey et.al [9] 80.99% 20 142A.B. Ashraf et.al [10] 68.31% 20 142A.B. Ashraf et.al [10] 81.21% 21 84

strates the effectiveness of our proposed approach.In addition to expression recognition, our method can

locate the expression segments contained in the positiveimage sequences. The posterior probability of segmentsindicate the score of the segments being positive. Since thesegmentation process clusters similar frames together, thescore of each frame of one segment is the same as thatof the segment. Fig.2 shows two samples obtained by ourmulti-instance HMM. According to the criterion we used forimage sequence extension of the CK+ database, each positivesequence has only one positive instance in the middle. Fromthese two samples, we can find that the inferred scores areconsistent with the ground truth.

C. Experimental Result and Analysis On UNBC-McMasterShoulder pain database

Expression recognition results on the UNBC-McMasterShoulder pain database are shown in Table.III. From thetable, we can find that our approach achieves 85.23% of ac-curacy and 0.78 of F1-score, demonstrating its effectiveness.From the confusion matrix, we may find that it is easierto recognize negative sequences than positive ones. This isprobably due to the unbalanced data.

Similar as CK+ database, two examples are shown in Fig.3

to demonstrate the expression localization capability of ourmethod. The green line is the normalized PSPI value, whichis a frame level ground truth.

From Fig.3(a), we can find that our method can locatemulti-peak pain segments, such as frames around 80 and280. From Fig.3(b), we find that the posterior probabilitywe predicted can locate some ambiguity segments correctly,such as frames around 175 and 450.

D. Comparison with related work

Till now, there is only one work that recognized expressionusing multi-instance learning, and conducted experiments onthe the UNBC-McMaster Shoulder pain database. Therefore,we only compare our work with related work on the UNBC-McMaster Shoulder pain database as shown in Table .IV.

From the table, we can find that both our approach andSikka et.al’s outperform [9] and [10]. These two worksformulated pain recognition as a traditional machine learn-ing problem, and used frame-based expression recognitionmethod. support vector machines (SVM) are adopted asclassifier to assign labels for each frame. It demonstratethe superior of multi-instance learning for facial expressionrecognition with weakly labeled data. Compared with Sikkaet.al’s approach, our method achieves better performance.Although Sikka et.al also formulats expression recognitionas a multi-instance learning problem, the classifier they usedis static, which can not fully capture the dynamic embeddedin expression sequences. While our approach adopts HMMto model the temporary variation within segments.

IV. CONCLUSION

In this paper, a method for expression recognition usingonly weakly labeled image sequencing data by applyingMIL frame work in combination with Discriminative HMMlearning approach is proposed. We first demonstrate that

(a)

(b)

Fig. 3. Expression localization examples on UNBC-McMaster Shoulder pain database

how multi-instance be created. We divide facial every im-age sequence into several segments using Ncut clusteringalgorithm. In this way, image sequences can be regarded asbags and segments in the bags are viewed as instances. Then,the detail of MIL-HMM method for expression recognitionis presented. We trained a classifier under MIL frameworkfor expression recognition. In order to better capture tem-poral variation contained in segments, HMM is applied.During training, we learn the parameters by combiningmulti-instance learning and discriminative HMM trainingalgorithm. We experiment on CK+ database and UNBC-McMaster Shoulder Pain Database. Experimental results onboth databases show that our method can not only label thesequences effectively, but also locate apex frames of multi-peak sequences. Further more, the results demonstrate thatour method outperforms both traditional machine learningand other multi-instance learning based methods.

REFERENCES

[1] Sikka, K., Dhall, A., Bartlett, M.: Weakly supervised pain localizationusing multiple instance learning. In: Automatic Face and GestureRecognition (FG), 2013 10th IEEE International Conference andWorkshops on, IEEE (2013) 1–8

[2] Babenko, B.: Multiple instance learning: algorithms and applications.View Article PubMed/NCBI Google Scholar (2008)

[3] Yuksel, S.E., Bolton, J., Gader, P.D.: Landmine detection with multipleinstance hidden markov models. In: Machine Learning for SignalProcessing (MLSP), 2012 IEEE International Workshop on, IEEE(2012) 1–6

[4] Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z., Matthews,I.: The extended cohn-kanade dataset (ck+): A complete dataset foraction unit and emotion-specified expression. In: Computer Visionand Pattern Recognition Workshops (CVPRW), 2010 IEEE ComputerSociety Conference on, IEEE (2010) 94–101

[5] Lucey, P., Cohn, J.F., Prkachin, K.M., Solomon, P.E., Matthews, I.:Painful data: The unbc-mcmaster shoulder pain expression archivedatabase. In: Automatic Face & Gesture Recognition and Workshops(FG 2011), 2011 IEEE International Conference on, IEEE (2011) 57–64

[6] Shi, J., Malik, J.: Normalized cuts and image segmentation. PatternAnalysis and Machine Intelligence, IEEE Transactions on 22 (2000)888–905

[7] Wang, X., Ji, Q.: Learning dynamic bayesian network discriminativelyfor human activity recognition. In: Pattern Recognition (ICPR), 201221st International Conference on, IEEE (2012) 3553–3556

[8] Avriel, M.: Nonlinear programming: analysis and methods. CourierDover Publications (2003)

[9] Lucey, P., Howlett, J., Cohn, J.F., Lucey, S., Sridharan, S., Ambadar,Z.: Improving pain recognition through better utilisation of temporalinformation. In: AVSP. (2008) 167–172

[10] Ashraf, A.B., Lucey, S., Cohn, J.F., Chen, T., Ambadar, Z., Prkachin,K.M., Solomon, P.E.: The painful face–pain expression recognitionusing active appearance models. Image and vision computing 27(2009) 1788–1796

multi-instance hidden markov model for facial …cvrl/publication/pdf/wu2015.pdfmulti-instance...

Documents