facial expression recognitionhji/cs519_slides/facial... · what is facial expression? one or more...
TRANSCRIPT
Facial Expression RecognitionConvolutional Neural Network
Yekun Yang
What is Facial Expression?
● One or more motions or
positions of the muscles
beneath the skin of the face
● Babies can already tell the
difference between happy and
sad at just 14-months old
● Even some animals can sense
human facial expressions, such
as dogs, but to read them
would need the animal to have
an experience with humans
7 Basic Emotions
● Happy
● Sad
● Surprise
● Fear
● Anger
● Disgust
● Neutral
Why is facial emotion recognition important?
● Retailers may use these metrics to evaluate customer’s interest.
● Healthcare providers can provide better service by using additional information
about patients’ emotional state during treatment.
● Entertainment producers can monitor audience engagement in events to
consistently create desired content.
Project Model
● Data
● CNN model
● Analysis
Data
● Kaggle Facial Expression
Recognition Challenge
(FER2013)
● 35887 pre-cropped, 48-by-48-
pixel gray-scale images
Data
● 28709 labeled faces for training
● Remaining two test sets
(3589/set)
CNN Structure:Input Layer
The input layer has pre-determined,
fixed dimensions, so the image must
be pre-processed before it can be fed
into the layer.
Convolutional Layer
● Consist of a set of learnable
filters (or kernels)
● Every filter is a small receptive
field, but extend through the full
depth of the input volume
● Generates feature maps that
represent how pixel values are
enhanced
Case 1
Case 2
Case 3
Case 4: weight value of (1, 0.3) and (0.1, 5)
Case 5
Pooling Layer, Padding & Stride
● Pooling is a dimension
reduction technique usually
applied after one or several
convolutional layers.
● Max pooling
● Padding is adding zeros to the
edge of image for preserving
certain size.
● With higher stride values, move
large number of pixels at a time
and hence produce smaller
output volumes.
Dense Layer
● Fully connected network
● The more layers/nodes added
to the network the better it can
pick up signals.
● On the other hand, the model
also becomes increasingly
prone to overfitting the training
data.
● Dropout is required.
Output Layer
● Softmax
● How many layers are good for
the model?
● The deeper the better
model = Sequential()
model.add(keras.layers.InputLayer(input_shape=input_shape))
model.add(keras.layers.convolutional.Conv2D(filters, filtersize, strides=(1, 1), padding='same', activation='relu'))
model.add(keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(units=2, input_dim=50,activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(images, label, epochs=epochs, batch_size=batchsize, validation_split=0.3)
Keras Sample Code
Related Work
● AlexNet (2012)
● ZF Net (2013)
● VGG Net (2014)
● GoogLeNet (2015)
● Microsoft ResNet (2015)
AlexNet (2012)
● Top 5 test error rate of 15.4%
(Top 5 error is the rate at which,
given an image, the model does
not output the correct label with
its top 5 predictions)
● 15 million annotated images
from a total of over 22,000
categories
● Used ReLU for the nonlinearity
functions, and trained the
model using batch stochastic
gradient descent.
ZF Net (2013)
● 11.2% error rate
● Trained on only 1.3 million
images
● Instead of using 11x11 sized
filters in the first layer (which is
what AlexNet implemented), ZF
Net used filters of size 7x7 and
a decreased stride value.
● Used ReLUs for their activation
functions, cross-entropy loss
for the error function, and
trained using batch stochastic
gradient descent.
VGG Net (2014)
● 7.3% error rate
● The use of only 3x3 sized filters
is quite different from AlexNet’s
11x11 filters in the first layer
and ZF Net’s 7x7 filters. The
authors’ reasoning is that the
combination of two 3x3 conv
layers has an effective
receptive field of 5x5.
● Used ReLU layers after each
conv layer and trained with
batch gradient descent.
GoogLeNet (2015)
● Top 5 error rate of 6.7%
● Used 9 Inception modules in
the whole architecture, with
over 100 layers in total.
● No use of fully connected
layers! They use an average
pool instead, to go from a
7x7x1024 volume to a
1x1x1024 volume. This saves a
huge number of parameters.
● Utilized concepts from R-CNN
for detection model.
Microsoft ResNet (2015)
● 3.6% error rate!
● The idea behind a residual
block is that you have your
input x go through conv-relu-
conv series. This will give you
some F(x). That result is then
added to the original input x.
● 152 layers...
● The group tried a 1202-layer
network, but got a lower test
accuracy, presumably due to
overfitting.
Larsson, Gustav, Michael Maire, and
Gregory Shakhnarovich. “FractalNet:
Ultra-Deep Neural Networks without
Residuals”
Kaiming He, Xiangyu Zhang, Shaoqing
Ren, Jian Sun. “Deep Residual
Learning for Image Recognition”
Result Analysis
● The final CNN had a validation
accuracy around 55%.
● The model performs pretty well
on classifying positive
emotions, but weaker across
negative emotions on average.
● Happy: 75%
● Surprise: 70%
● Sad: 40%
● Misclassifies angry, fear and
neutral as sad
Demo from HappyNet
● Jostine Ho, https://github.com/JostineHo/mememoji
● Dan Duncan, https://github.com/danduncan/HappyNet
● Adit Deshpande,
https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-
Learning-Papers-You-Need-To-Know-About.html
● http://cs231n.stanford.edu/
Reference