a supervised approach to support the analysis and the classification of non verbal humans...
TRANSCRIPT
8/4/2019 A Supervised Approach to Support the Analysis and the Classification of Non Verbal Humans Communications
http://slidepdf.com/reader/full/a-supervised-approach-to-support-the-analysis-and-the-classification-of-non 1/6
A supervised approach to support the analysis and the
classification of non verbal humans communications
Vitoantonio Bevilacqua12*, Marco Suma1 , Dario D‘Ambruoso1,
Giovanni Mandolino1, Michele Caccia1, Simone Tucci1,
Emanuela De Tommaso1, Giuseppe Mastronardi12
1Dipartimento di Elettrotecnica ed Elettronica, Polytechnic of Bari, Italy,2e.B.I.S. s.r.l. (electronic Business in Security), Spin-Off of Polytechnic of Bari, Italy
*corresponding author: [email protected]
Abstract. Background: It is well known that non verbal communication issometimes more useful and robust than verbal one in understanding sincere emotions
by means of spontaneous body gestures and facial expressions analysis acquired from
video sequences. At the same time, the automatic or semi-automatic procedure to
segment a human from a video stream and then figure out several features to address
a robust supervised classification is still a relevant field of interest in computer vision
and intelligent data analysis algorithms.
Materials and Methods: We obtained data from four datasets: first dataset contains
100 images of humans silhouettes (or templates) acquired from a video sequence
dataset, second dataset contains 543 images of gestures from a preregistered video of
MotoGp driver Jorge Lorenzo, the third one 200 images of mouths and finally the
fourth one 100 images of noses; third and fourth datasets contain images acquired by
a tool implemented from the authors and also samples available in literature in public
databases. We used supervised methods to train the proposed classifiers and, in
particular, three different EBP Neural-Network architectures for humans templates,
mouths and noses and J48 algorithm for gestures.
Results: We obtained on average a 80% correct classification for binary classifier of
humans templates (no false positives), 90% correct classification for happy/non happy
emotion, 85% of binary disgust/non disgust emotion and 80% correct classification
related to the 4 different gestures.
Keywords: Neural Network, Emotions Recognition, Humans Silhouetts,
Gesture Recognition, Facial Expressions Recognition, Human Detection,
Hands, Action Units, Centre of Gravity, Pose Estimation.
1 Introduction
Good communication is the foundation of successful relationships, both personally
and professionally. But we communicate with much more than words. In fact, many
researches show that the majority of our communication is nonverbal. Nonverbal
communication, or body language, includes facial expressions, gestures, eye contact,
posture and even the tone of our voice. Although the details of his theory have
evolved substantially since the 1960’s, Ekman remains the most vocal proponent of
8/4/2019 A Supervised Approach to Support the Analysis and the Classification of Non Verbal Humans Communications
http://slidepdf.com/reader/full/a-supervised-approach-to-support-the-analysis-and-the-classification-of-non 2/6
the idea that emotions are discrete entities [1]. Unlike some forms of nonverbal
communication, facial expressions are universal. About gestures recognition we
consider how the way we move communicates a wealth of information to the world.
This type of nonverbal communication includes our posture, stance, and subtle
movements. Gestures are omnipresent in our daily lives. However, the meaning of
gestures can be very different across cultures and regions, so it is important being
careful to avoid misinterpretation. Using these ideas, we want to provide an automatic
system which is able to evaluate emotions in particular situations (videoconference,
meetings, neurological examination, investigation).
2 Materials
Materials for all the four datasets have been collected with the goal of increasing the
variance of their samples and then supporting the amount of information in the
training examples necessary for the proposed supervised classifiers.
2.1 Humans silhouettes
The humans silhouettes used in this paper comes from those walking in a video
stream dataset where the training examples consist of only 20 different silhouettes
binary images obtained after a pre-processing phase of background subtraction. By
this methods the training examples consist in each of a number of different and
several human silhouettes extracted from each frame.
Fig. 1. a) and b) samples frames and c) 4 different examples of humans silhouettes with their
several dimensions and behaviours.
2. 2 Facial Expressions
First of all we explain the concept of Action Units (AUs) as minimal facial actionsnot separable, elements for the construction of facial expressions. Combination of
these, with different intensities, generate facial expression. According to our previous
work [2] we can assert that, generally, prescinding other AUs, the presence of AU-10
discriminates unequivocally disgust emotion; the presence of AU-12 or AU-13
discriminates unequivocally happy emotion. For this reason we are able to recognize
two of the six primary emotions declared by Paul Ekman: happy and disgust
emotions. To extract middle and lower part of the face we have used our tool;
8/4/2019 A Supervised Approach to Support the Analysis and the Classification of Non Verbal Humans Communications
http://slidepdf.com/reader/full/a-supervised-approach-to-support-the-analysis-and-the-classification-of-non 3/6
moreover we have used public databases of faces [3] and then we have taken our
regions of interest.
2.3 Gestures
Each frame of the video has a resolution of 640x480 pixel. As for the automatic
classification of gestures, the research has been based on different studies by
psychologist David McNeill [4], who divides them into four main categories:
- deictic gestures: typical indicating movements, usually emphasized by the
movement of fingers or by other parts of the body that can be used for this
purpose.
- iconic gestures: gestures that express formal relation in respect to the semantic
content of discourse. They mainly occur in the area occupied by the torso of the
prototype being focused;
- metaphoric gestures: they represent real figures. These refer to abstract concepts,
as moods or language. The density of such gestures is concentrated in the lower
part of the torso;
- beat gestures: these may be recognized by only focusing the attention on the
characteristics of their movements.
It has been decided to monitor the movement of the center of gravity (CG) of the
hands in each frame so as to be able to calculate various parameters of evaluation,
such as the velocity with which gestures are made.
3 Methods
The application of supervised neural network using Error Back Propagation algorithm
gives easier solution to complex problems such as in correct classification of
silhouettes shapes, facial expression and gestures. Advantages of neural networks
include their high tolerance to noise as well as their ability to classify patterns not
used for training. In particular we implemented neural networks supervised classifier
for the classification of silhouettes, mouths and noses emotions features and the J48
classifier for gestures.
3.1 Silhouettes classification
The neural network classifier is a two layers feed-forward with 396 inputs(corresponding to 33*12 dimensions of the smallest figure previously resized to
contain the smallest human silhouette) with 6 logistic neurons in the first layer and 1
neuron as output. The images passed to the neural networks have the following
characteristics: the height bigger than the width, the ratio between height and width
ranging 1.9 and 4, the height bigger than 33 pixels and the width bigger than 12
pixels. All images are divided in more images and then each image contains a singular
human silhouette always resized to 33*12 pixels in order to have the same number of
8/4/2019 A Supervised Approach to Support the Analysis and the Classification of Non Verbal Humans Communications
http://slidepdf.com/reader/full/a-supervised-approach-to-support-the-analysis-and-the-classification-of-non 4/6
inputs for each neural network classification sample. This procedure guarantee a
constant number of neural network’s input. In any case to achieve good performance
in terms of generalization the training set is selected with large variability in terms of
positive poses and movements that are people not staring the cameras (not frontal
images), people with their arms far or closed to the body, people not very well
identified owing of the presence of just one arm and negative ones that are objects
similar to people used as contrary examples.
3.2 Facial Expression classification
We have realized two NNs, that work in parallel; the first one receives the form of the
mouth: in happy expressions the mouth should be open, the teeth should be visible
and its shape is curved (AU-12, AU-13); the second one receives the nose: in disgust
expressions nasolabial furrows are visible (AU-10).
Fig. 2. Segmentation and vectorization of the face.
Each bitmap gray-scale image is a band of 40x80 pixels which contains respectively
the lower and the middle part of the face; to use it as input for the neural network they
have been arranged in an array and then normalized, obtaining a 1x50 vector (a
function calculates a mean value each 8x8 pixels). In case of no happy and no disgust
expressions, the network returns 0 (zero); in the other case the network returns 1. To
train the NN for the mouth, we have used a training set of 200 photos that are
composed of 100 negative and 100 of positive examples in 20000 epochs. The NN
comes with a structure of the first layer of 300 neurons, the second layer of 200
neurons, the third layer of 10 neurons and 1 output neuron (300x200x10x1). To train
the NN for the nose, we have used a training set of 100 examples that are composedof 50 negative and 50 positive examples in 20000 epochs. The NN comes with a
structure of the first layer of 400 neurons, the second layer of 80 neurons, the third
layer of 10 neurons and 1 output neuron (400x80x10x1).
Fig. 3. mouths and noses from our tool (the first four images) and from public databases.
8x8
M 50x1
M 50x1
ROIextraction
gray-scaleconversion
vectorization
8/4/2019 A Supervised Approach to Support the Analysis and the Classification of Non Verbal Humans Communications
http://slidepdf.com/reader/full/a-supervised-approach-to-support-the-analysis-and-the-classification-of-non 5/6
3.3 Gestures
For gestures analysis the supervised classifier is implemented by means of J48
algorithm instead of using a EBP NN classifier. Rule induction systems are currently
employed in several different environments ranging from loan request evaluation to
fraud detection, bioinformatics and medicine [5]. In particular the main goal of this
scheme is to minimize the number of tree levels and tree nodes, thereby maximizing
data generalization. The input is a 10 elements array where the features are the x
coordinate of the right hand CG, the y coordinate of the right hand CG, the x
coordinate of the left hand CG; the y coordinate of the left hand CG, the position of
the right hand (respect to the torso of the prototype being shot), the position of the left
hand, the right/left hands slant (measured in radiant), the velocity of the movement of
the right/left hand. To find CG, frames have been processed according to the follow
workflow consisting of skin detection by color-space conversion from RGB to HSV,
background subtraction technique to exalt only the hands region, image smoothing
and binarization, tracing of rectangles that contain hands, CG identification, edge and
features detection; template matching to notice resting position of hands; gestures
classification and storing data on .csv file.
4 Experimental results
In this paper we have presented a system that recognizes separately shapes, two of six
primary emotions and analyzes information derived from gestures. The complete
project expects to recognize all primary emotions. In particular in the following we
show, separately, results related to facial expressions, gestures and silhouettes. About
facial processing, using about 150 test images, the results of NNs have achieved about
90% for happy/no-happy emotion and 85% for disgust/no-disgust emotion of success
rate. We can assert that the results are reliable, also because in some particular cases
nor human beings can distinguish exactly emotions. About gestures, the confusion
matrix is shown in Table 1. The NN has correctly classified approximately 80% of
gestures. The network has specifically been able to label “metaphoric gestures” in a
precise way. Performances are not optimal as for the recognition of “deictic gestures”
and “beat gestures”. “Iconic gestures” are not present in the preregistered video.
Table 1. Confusion matrix of data set for gestures. Deictic (A); Spontaneous (B); Beat (C); not
recognized (D); Metaphoric (E)
A B C D E
A 53 71 3 0 0
B 6 280 10 3 1
C 1 9 54 1 0
D 1 2 1 20 0
E 0 1 1 0 25
Neural network shows on average good results in terms of false positives and then in
the following figure are reported detected Vs total humans per each frame.
8/4/2019 A Supervised Approach to Support the Analysis and the Classification of Non Verbal Humans Communications
http://slidepdf.com/reader/full/a-supervised-approach-to-support-the-analysis-and-the-classification-of-non 6/6
4 humans Vs 5 humans. 4 humans Vs 6 humans 4 humans Vs 6 humans
3 humans Vs 4 humans. 2 humans Vs 5 humans 2 humans Vs 5 humans.
Fig 4. Detected Vs total humans number per each tested frame.
Conclusions
The goal of this paper is to investigate emotion-related and realize a system to
recognize separately emotional patterns of the body and face using Neural Networks.
The research aims at developing an intelligent system that can interpret intellectual
conversation between human beings. When we interact with others, we continuously
give and receive countless wordless signals. The nonverbal signals we send either
produce a sense of interest, trust, and desire for connection, or they generate
disinterest, distrust, and confusion. The analyzed gestures and facial emotions
represent non-verbal communication; they provide the user to what the speaker is
saying, thus helping the listener to interpret the meaning of words. Future works
forecast the design of a new multimodal system performing at the same time emotion
recognition by means other several facial bands (eyebrows band eyes bands), gestures
recognition and human silhouettes.
References
[1] Paul Ekman, FACS: Facial Action Coding System, Research Nexus division of Network
Information Research Corporation, Salt Lake City, UT 84107, (2002)
[2].V. Bevilacqua, D. D’Ambruoso, G. Mandolino, M. Suma, A new tool to support diagnosisof neurological disorders by means of facial expressions, IEEE Proc. of MeMeA pp 544-549
[3]. http://www.emotional-face.org
[4]. http://mcneilllab.uchicago.edu
[5].F. Menolascina, V. Bevilacqua et al . Novel Data Mining Techniques in aCGH based Breast
Cancer Subtypes Profiling: the Biological Perspective – Proc. of IEEE Symp. on Comp.
Intelligence in Bioinformatics and Comp. Biology (CIBCB 2007) pp.9-16