· web viewfrom the frames of the video stream the human eyes can be extracted using the well...

14

Click here to load reader

Upload: trinhdat

Post on 27-Jul-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1:  · Web viewFrom the frames of the video stream the human eyes can be extracted using the well known canny edge operator and classified using a non – linear Support Vector machine

Emotion Recognition from Human Eye Expression

S.R.Vinotha1,a, R.Arun2,b and T.Arun3,c

1 Assistant Professor, Dhanalakshmi College of Engineering2 Student, Dhanalakshmi College of Engineering3 Student, Dhanalakshmi College of Engineering

a [email protected], b [email protected], c [email protected]

Abstract – Facial expressions play an essential role in communications in social interactions with other human beings which deliver rich information about their emotions. The most crucial feature of human interaction that grants naturalism to the process is our ability to infer the emotional states of others. Our goal is to categorize the various human emotions from their eye expressions. The proposed system presents a human emotion recognition system that analyzes the human eye region from video sequences. From the frames of the video stream the human eyes can be extracted using the well known canny edge operator and classified using a non – linear Support Vector machine (SVM) classifier. Finally, we use standard learning tool, Hidden Markov Model (HMM) for recognizing the emotions from the human eye expressions.

Keywords – Human emotions, Canny edge operator, Support Vector Machine (SVM), Hidden Markov Model (HMM)

I. INTRODUCTIONHuman emotion recognition is an important component for efficient human – computer interaction. It plays a critical role in communication, allowing people to express oneself beyond the verbal domain. Analysis of emotions from human eye expression involves the detection and categorization of various human emotions or state of mind. For example, in security

and surveillance, they can predict the offender or criminal’s behaviour by analysing the images of their faces from the frames of the video sequence.

The analysis of human emotions can be applied in a variety of application domains, such as video surveillance and human – computer interaction systems. In some cases, the results of such analysis can be applied to identify and categorize the various human emotions automatically from the videos.

The six primary or main types of emotions are as follows – fear, joy, love, sadness, surprise and anger. Our method is to use the feature extraction technique to extract the eyes, support vector machine (SVM) classifier and a HMM to build a human emotion recognition system.

The remainder of the paper has organized to explain the related work in section 2, proposed methodology in section 3, Data collection in section 4, experimental results in section 5, and concludes the work in section 6.

II. RELATED WORKIn the last two decades, many

approaches for human emotion recognition have been proposed. Kumar et al. [1] developed a frequency domain feature of face images for recognition by proposing a cross-correlating method based upon the fast Fourier transform. Savvides et al.[2] further extended a correlation filter and developed a hybrid PCA-correlation filter called “Core faces,” that performed robust

1

Page 2:  · Web viewFrom the frames of the video stream the human eyes can be extracted using the well known canny edge operator and classified using a non – linear Support Vector machine

illumination-tolerant face recognition. Picard et al. [3] stressed the significance of human emotions on their affective psychological states.

Ekman et al. [4], [5], who analyzed six fundamental facial expressions and encoded them into the so-called facial action coding system (FACS). FACS enumerates action units (AUs) of a face that cause facial movements. Korma et al. [6] used 17 features; however, 5 of them are specific to their scenarios they are calculated based on eye movements over circles that are formed of images.

This in-depth survey disclosed the fact that many approaches that were used before uses more than two features to identify the facial expression or human emotions. Hence in this paper we proposed an efficient way of emotion recognition using only the human eye expressions.

III. METHODOLOGYThe detailed components of the system are shown in Fig.1.

Fig.1. Functional flow diagram of the system

An image representing a set of frames is preprocessed and a noise free image is obtained. The noise free image is edge detected using Canny Edge Operator. Using the feature extraction process, the eye regions are extracted from the resultant edge detected image. The extracted eye regions are classified using SVM classifier. Finally, the corresponding emotions are recognized and categorized using HMM technique.

A. Pre-processingThe initial stage of the human

emotion recognition system is the pre-processing of the image representing the set of frames.

Fig.2. Before and after pre-processing

Pre-processing images commonly involves removing low-frequency background noise, normalizing the intensity of the individual particles images, removing reflections, and masking portions of images. Image pre-processing is the technique of enhancing data images prior to computational processing. Examples: Normalization, Edge filters, Soft focus, selective focus, User-specific filter, Static/dynamic binarisation,Image plane separation, Binning.

The pre-processing techniques that are used here are as follows:Filters-Median filter: The median filter is a nonlinear digital filtering technique, often used to remove noise. Such noise reduction is a typical pre-processing step to improve the results of later processing. The main idea of the median filter is to run through the signal entry by entry, replacing each entry with the median of neighbouring entries. The pattern of neighbours is called the "window", which slides, entry by entry, over the entire signal.Smoothing: To smooth a data set is to create an approximating function that attempts to capture important patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena. In smoothing, the data points of a signal are modified so individual points (presumably because of noise) are reduced, and points that are lower than the adjacent points are increased leading to a smoother signal.

2

Page 3:  · Web viewFrom the frames of the video stream the human eyes can be extracted using the well known canny edge operator and classified using a non – linear Support Vector machine

B. Edge DetectionEdge detection method finds the

edges in the given image and returns a binary matrix where the edges have value 1 and all other pixels are 0. The given image is a non-sparse numeric array but the output image is of class logical which means the matrix of the output image will be with 0’s and 1’s. The output of the edge detection should be an edge image or edge map, in which the value of each pixel reflects how strong the corresponding pixel in the original image meets the requirements of being an edge pixel.

Fig.3. Edge detection

Here the face boundary is extracted by using the well known canny edge operator. The canny edge detection algorithm is known to many as the optimal edge detector. The Canny edge detector is an edge detection operator that uses a multi-stage algorithm to detect a wide range of edges in images as in Fig.3.Stages of the Canny algorithm:Noise reduction: Because the Canny edge detector is susceptible to noise present in raw unprocessed image data, it uses a filter based on a Gaussian (bell curve), where the raw image is convolved with a Gaussian filter. The result is a slightly blurred version of the original which is not affected by a single noisy pixel to any significant degree.Finding the intensity gradient of the image : An edge in an image may point in a variety of directions, so the Canny algorithm uses four filters to detect horizontal, vertical and diagonal edges in the blurred image. The edge detection operator (Roberts, Prewitt, Sobel for example) returns a value for the first derivative in the horizontal direction (Gx)

and the vertical direction (Gy). From this the edge gradient and direction can be determined:

Non-maximum suppression: Given estimates of the image gradients, a search is then carried out to determine if the gradient magnitude assumes a local maximum in the gradient direction.Tracing edges through the image and hysteresis thresholding: Thresholding with hysteresis requires two thresholds – high and low. We begin by applying a high threshold. This marks out the edges that can be fairly genuine. Starting from these, using the directional information derived earlier, edges can be traced through the image. While tracing an edge, we apply the lower threshold, allowing us to trace faint sections of edges as long as we find a starting point.

Once this process is complete we have a binary image where each pixel is marked as either an edge pixel or a non-edge pixel. From complementary output from the edge tracing step, the binary edge map obtained in this way can also be treated as a set of edge curves, which after further processing can be represented as polygons in the image domain.Differential geometric formulation of the Canny edge detector: A more refined approach to obtain edges with sub-pixel accuracy is by using the approach of differential edge detection, where the requirement of non-maximum suppression is formulated in terms of second- and third-order derivatives computed from a scale space representation.Variational-geometric formulation of the Haralick-Canny edge detector : A variational explanation for the main ingredient of the Canny edge detector, that is, finding the zero crossings of the 2nd derivative along the gradient direction, was shown to be the result of minimizing a

3

Page 4:  · Web viewFrom the frames of the video stream the human eyes can be extracted using the well known canny edge operator and classified using a non – linear Support Vector machine

Kronrod-Minkowski functional while maximizing the integral over the alignment of the edge with the gradient field.

C. Feature extractionIn a recent paper, Moriyama et al.

[7] demonstrate that precise and detailed detection and feature extraction from the eye region is already possible.Extraction of eyes: The most dominant and reliable features of the face, the eyes, provide a constant channel of communications. When the rough face region is detected, as we have said, an efficient feature based method will be sequentially applied to locate the rough regions of both eyes. Fig.4 shows the processes of the proposed method.

Fig. 4. Detection of eyes regions

The first step is to calculate the gradient image (b) of the rough face region image (a). Then we apply a horizontal projection to this gradient image. As we know that the eyes locate in the upper part of the face and that the pixels near the eyes are more changeful in value comparing with the other parts of face, it is obvious that the peak of this horizontal projection in the upper part can give us the horizontal position of eyes.

According to this horizontal position and the total height of the face, we can easily line out a horizontal region (c) in which the eyes locate.

And then we perform a vertical projection to all pixels in this horizontal region of image (c), and a peak of this projection can be found near the vertical center of face image. In fact, the position of this vertical peak can be treated as the

position of vertical center of face (d), because the area between both eyes is most bright in the horizontal region.

(a) (b)

(c) (d)

(e) (f)

(g)

Fig.5.Eye extraction by feature based method

And then we perform a vertical projection to all pixels in this horizontal region of image (c), and a peak of this projection can be found near the vertical center of face image. In fact, the position of this vertical peak can be treated as the position of vertical center of face (d), because the area between both eyes is most bright in the horizontal region.

In the same time, a vertical projection will be done to the gradient image (b). There are two peaks of projection near the right and left boundary

4

Page 5:  · Web viewFrom the frames of the video stream the human eyes can be extracted using the well known canny edge operator and classified using a non – linear Support Vector machine

of face image which correspond to right and left limit of the face (e). In addition, from these two vertical limit lines, the width of face can be easily estimated.

Combining all results from (c), (d) and (e), we can get an image segmented like (f). Finally, based on the result of (f) and the estimated width of face, the regions of both eyes can be lined out (g).

D. Classification (SVM classifier)A Support Vector Machine (SVM)

is a maximal margin hyper-plane in feature space built by using a kernel function in gene space. SVMs are a state-of-the-art classification technique where patterns can be described by a finite set of characteristic features [8]. It has large application fields in text classification, face recognition, genomic classification, etc.

The SVM as a non-linear classifier handles overlapping effectively. Also SVM has achieved very good performance in lots of real-world classification problems. Finally, SVM can deal with very high dimensional feature vectors, which means that we can choose the feature vectors without restrictive dimension limits.

Multi-class SVMs are usually implemented by combining several two-class SVMs. In each binary SVMs, only one class is labelled as "1" and the others labelled as "-1". If there are M classes, SVM will construct M binary classifiers by learning. During the testing process, each classifier will get a confidence coefficient and the class with maximum confidence coefficient will be assigned to this test sample.

SVM map input vector to a higher dimensional space where a maximal separating hyperplane is constructed. Two parallel hyperplanes are constructed on each side of the hyperplane that separate the data. The separating hyperplane is the hyperplane that maximize the distance between the two parallel hyperplanes. An

assumption is made that the larger the margin or distance between these parallel hyperplanes the better the generalization error of the classifier will be [11]. We consider data points of the form{(x1,y1),(x2,y2),(x3,y3),(x4,y4).,(xn, yn)}

Where yn=1 / -1 , a constant denoting the class to which that point xn belongs. n = number of sample. Each xn is p-dimensional real vector. The scaling is important to guard against variable (attributes) with larger variance. We can view this Training data, by means of the dividing (or separating) hyperplane , which takes

w . x + b = o

Where b is scalar and w is p-dimensional Vector. The vector w points perpendicular to the separating hyperplane. Adding the offset parameter b allows us to increase the margin. Absent of b, the hyperplane is forced to pass through the origin, restricting the solution. As we are interesting in the maximum margin, we are interested in the SVM and the parallel hyper planes. Parallel hyperplanes can be described by equation

w.x + b = 1

w.x + b = -1

Figure.6 Maximum margin hyperplanes for a SVM trained with samples from two classes

If the training data are linearly separable, we can select these hyperplanes so that

5

Page 6:  · Web viewFrom the frames of the video stream the human eyes can be extracted using the well known canny edge operator and classified using a non – linear Support Vector machine

there are no points between them and then try to maximize their distance. By geometry, We find the distance between the hyperplane is 2 / │w│. So we want to minimize │w│. To excite data points, we need to ensure that for all I either

w. xi – b ≥ 1 or w. xi – b ≤ -1

This can be written as

yi ( w. xi – b) ≥1 , 1 ≤ i ≤ n

Kernel Selection of SVM: Training vectors xi are mapped into a higher (may be infinite) dimensional space by the function Ф. Then SVM finds a linear separating hyperplane with the maximal margin in this higher dimension space .C > 0 is the penalty parameter of the error term. Furthermore,

K(xi , xj) ≡ Ф(xi)T Ф(xj)

is called the kernel function.

E. Recognition (Hidden Markov Model)Hidden Markov models have been

widely used for many classification and modelling problems. One of the main advantages of HMMs is their ability to model non-stationary signals or events. Dynamic programming methods allow one to align the signals so as to account for the non-stationarity. The HMM finds an implicit time warping in a probabilistic parametric fashion. It uses the transition probabilities between the hidden states and learns the conditional probabilities of the observations given the state of the model. In the case of emotion expression, the signal is the measurements of the eye expression.

An HMM is given by the following set of parameters:

λ = (A, B,π)

where A is the state transition probability matrix, B is the observation probability

distribution, and π is the initial state distribution. The number of states of the HMM is given by N. In the discrete case, B becomes a matrix of probability entries (Conditional Probability Table), and in the continuous case, B will be given by the parameters of the probability distribution function of the observations (normally chosen to be the Gaussian distribution or a mixture of Gaussians).

Emotion-Specific HMMs: Since the display of a certain emotion in video is represented by a temporal sequence of facial motions it is natural to model each eye expression using an HMM trained for that particular type of emotion. There will be six such HMMs, one for each emotion: happy(1), angry(2), surprise(3), disgust(4), fear(5), sad(6) . There are several choices of model structure that can be used. The two main models are the left-to-right model and the ergodic model.

In [9], Otsuka and Ohya used left-to-right models with three states to model each type of facial expression. The advantage of using this model lies in the fact that it seems natural to model a sequential event with a model that also starts from a fixed starting state and always reaches an end state. It also involves fewer parameters, and therefore is easier to train. However, it reduces the degrees of freedom the model has to try to account for the observation sequence. On the other hand, using the ergodic HMM allows more freedom for the model to account for the observation sequences, and in fact, for an infinite amount of training data it can be shown that the ergodic model will reduce to the left-to-right model, if that is indeed the true model.

The observation vector Ot for the HMM represents continuous motion of the facial action units. Therefore, B is represented by the probability density

6

Page 7:  · Web viewFrom the frames of the video stream the human eyes can be extracted using the well known canny edge operator and classified using a non – linear Support Vector machine

functions (pdf) of the observation vector at time t given the state of the model. The Gaussian distribution is chosen to represent these pdf’s, i.e.,

B = { bi (Ot) } ˜ N ( µj , ∑j ), 1≤j ≤ N

where µj and ∑j are the mean vector and full covariance matrix, respectively.

The parameters of the model of emotion-expression specific HMM are learned using the well-known Baum-Welch re-estimation formulas. For learning, hand labelled sequences of each of the facial expressions are used as ground truth sequences, and the Baum algorithm is used to derive the maximum likelihood (ML) estimation of the model parameters (λ ).

Parameter learning is followed by the construction of a ML classifier. Given an observation sequence Ot , where t ε (1,T), the probability of the observation given each of the six models P (Ot | λj ) is computed using the forward-backward procedure [10]. The sequence is classified as the emotion corresponding to the model that yielded the highest probability, i.e.,

c*= argmax [P (O|λc ) ] 1≤c≤6

IV. DATA COLLECTION

In order to test the algorithms described in the previous sections, we collected data of people that are instructed to display facial expressions corresponding to the six types of emotions. The video was used as the input and the sampling rate was 30 Hz.

The data was collected in an open recording scenario, where the person was asked to display the expression corresponding to the emotion being induced. This is of course not the ideal way of collecting emotion data. The ideal way would be using a hidden recording,

inducing the emotion through events in the normal environment of the subject, not in a studio.

V. EXPERIMENTAL RESULTS

The above algorithms are applied on various face images containing the frontal view of the human face using Matlab7.0. The images were obtained from the fixed CCTV. Surprise Happy Angry

Fear Disgust Sad

Fig.7.Sample eye expressions

For emotion recognition, the set of frames representing a video is given as input. Initially the various pre-processing techniques are used to remove the noise present in the image. The noise free image is subjected to canny edge detector and the edge detected image is obtained as output. From the edge detected image, the eyes part are extracted using feature based method. The extracted image is then classified using the SVM classifier. Finally the HMM votes for the model which has maximum probability. That action label will be obtained as output from HMM as in Fig.7.

VI. CONCLUSION

Recent research documents tell that the understanding and recognition of emotional expressions plays an important role in the development and maintenance of social relationships. In this work, a novel and efficient framework for human emotion recognition system is proposed. Compared to existing methods, SVM is robust and efficient classifier to label the emotions. HMM is the dynamic classifier to recognize the human emotions because it achieves good accuracy. This approach

7

Page 8:  · Web viewFrom the frames of the video stream the human eyes can be extracted using the well known canny edge operator and classified using a non – linear Support Vector machine

is useful for real-world problems such as human-computer interaction, security surveillance and understanding tutor. In future, this work may be extended to identify the human emotions for the people wearing specs by using only eye expressions.

REFERENCES

[1]B. V. Kumar, M. Savvides, K. Venkataramani, and X. Xie, “Spatial frequency domain image processing for biometric recognition,” in Proc. IEEE. Intl. Conf. Image Process., 2002, vol. 1, pp. 53 56.

[2] M. Savvides, B. Kumar, and P. Khosla, “Corefaces—Robust shift invariant PCA based correlation filter for illumination tolerant face recognition,” in Proc. IEEE, Comput. Vis. Pattern Recognit., Jun. 2004, vol. 2, pp. 834–841.

[3] R. W. Picard, E. Vyzas, and J. Healey, “Toward machine emotional intelligence: Analysis of affective psychological states,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 10, pp. 1175–1191, Oct. 2001.

[4] P. Ekman and W. V. Friesen, Unmasking the Face. Englewood Cliffs, NJ: Prentice-Hall, 1975.

[5] P. Ekman and W. V. Friesen, The Facial Action Coding System. San Francisco, CA: Consulting Psychologist, 1978.

[6] L. Kozma, A. Klami, and S. Kaski, “Gazir: Gaze-based zooming interface for image retrieval,” in Proc. 2009 Int. Conf. Multimodal Interfaces, New York, 2009, pp. 305–312, ser. ICMI-MLMI’09, ACM.

[7] T. Moriyama, T. Kanade, J. Xiao, and J.F. Cohn, “Meticulously Detailed Eye Region Model and Its Application to Analysis of Facial Images,” IEEE Trans. Pattern Analysis and Machine Intelligence, vo[13] Recognition Of Facial Expressions

Of Six Emotions By Children With Specific Language Impairment, Kristen D. Atwood, Brigham Young University, August 2006

[8] Weiming Hu and Tieniu Tan “A Survey on Visual Surveillance of Object Motion and Behaviors”, IEEE Transactions on SMC, Vol.34, No.3, August 2004.

[9] T. Otsuka and J. Ohya. “Recognizing multiple persons’ facial expressions using HMM based on automatic extraction of significant frames from image sequences”. In Proc. Int. Conf. on Image Processing (ICIP-97), pages 546–549, Santa Barbara, CA, USA, Oct. 26-29, 1997.l. 28, no. 5, pp. 738-752, May 2006.

[10] L.R. Rabiner. “A tutorial on hidden Markov models and selected applications in speech processing.” Proceedings of IEEE, 77(2):257–286, 1989.

[11] V. Vapnik. The Nature of Statistical Learning Theory. NY: Springer-Verlag. 1995.

8