facial expression recognitionhji/cs519_slides/facial... · what is facial expression? one or more...

Facial Expression RecognitionConvolutional Neural Network

Yekun Yang

What is Facial Expression?

● One or more motions or

positions of the muscles

beneath the skin of the face

● Babies can already tell the

difference between happy and

sad at just 14-months old

● Even some animals can sense

human facial expressions, such

as dogs, but to read them

would need the animal to have

an experience with humans

7 Basic Emotions

● Happy

● Sad

● Surprise

● Fear

● Anger

● Disgust

● Neutral

Why is facial emotion recognition important?

● Retailers may use these metrics to evaluate customer’s interest.

● Healthcare providers can provide better service by using additional information

about patients’ emotional state during treatment.

● Entertainment producers can monitor audience engagement in events to

consistently create desired content.

Project Model

● Data

● CNN model

● Analysis

Data

● Kaggle Facial Expression

Recognition Challenge

(FER2013)

● 35887 pre-cropped, 48-by-48-

pixel gray-scale images

Data

● 28709 labeled faces for training

● Remaining two test sets

(3589/set)

CNN Structure:Input Layer

The input layer has pre-determined,

fixed dimensions, so the image must

be pre-processed before it can be fed

into the layer.

Convolutional Layer

● Consist of a set of learnable

filters (or kernels)

● Every filter is a small receptive

field, but extend through the full

depth of the input volume

● Generates feature maps that

represent how pixel values are

enhanced

Case 1

Case 2

Case 3

Case 4: weight value of (1, 0.3) and (0.1, 5)

Case 5

Pooling Layer, Padding & Stride

● Pooling is a dimension

reduction technique usually

applied after one or several

convolutional layers.

● Max pooling

● Padding is adding zeros to the

edge of image for preserving

certain size.

● With higher stride values, move

large number of pixels at a time

and hence produce smaller

output volumes.

Dense Layer

● Fully connected network

● The more layers/nodes added

to the network the better it can

pick up signals.

● On the other hand, the model

also becomes increasingly

prone to overfitting the training

data.

● Dropout is required.

Output Layer

● Softmax

● How many layers are good for

the model?

● The deeper the better

model = Sequential()

model.add(keras.layers.InputLayer(input_shape=input_shape))

model.add(keras.layers.convolutional.Conv2D(filters, filtersize, strides=(1, 1), padding='same', activation='relu'))

model.add(keras.layers.MaxPooling2D(pool_size=(2, 2)))

model.add(keras.layers.Flatten())

model.add(keras.layers.Dense(units=2, input_dim=50,activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(images, label, epochs=epochs, batch_size=batchsize, validation_split=0.3)

Keras Sample Code

Related Work

● AlexNet (2012)

● ZF Net (2013)

● VGG Net (2014)

● GoogLeNet (2015)

● Microsoft ResNet (2015)

AlexNet (2012)

● Top 5 test error rate of 15.4%

(Top 5 error is the rate at which,

given an image, the model does

not output the correct label with

its top 5 predictions)

● 15 million annotated images

from a total of over 22,000

categories

● Used ReLU for the nonlinearity

functions, and trained the

model using batch stochastic

gradient descent.

ZF Net (2013)

● 11.2% error rate

● Trained on only 1.3 million

images

● Instead of using 11x11 sized

filters in the first layer (which is

what AlexNet implemented), ZF

Net used filters of size 7x7 and

a decreased stride value.

● Used ReLUs for their activation

functions, cross-entropy loss

for the error function, and

trained using batch stochastic

gradient descent.

VGG Net (2014)

● 7.3% error rate

● The use of only 3x3 sized filters

is quite different from AlexNet’s

11x11 filters in the first layer

and ZF Net’s 7x7 filters. The

authors’ reasoning is that the

combination of two 3x3 conv

layers has an effective

receptive field of 5x5.

● Used ReLU layers after each

conv layer and trained with

batch gradient descent.

GoogLeNet (2015)

● Top 5 error rate of 6.7%

● Used 9 Inception modules in

the whole architecture, with

over 100 layers in total.

● No use of fully connected

layers! They use an average

pool instead, to go from a

7x7x1024 volume to a

1x1x1024 volume. This saves a

huge number of parameters.

● Utilized concepts from R-CNN

for detection model.

Microsoft ResNet (2015)

● 3.6% error rate!

● The idea behind a residual

block is that you have your

input x go through conv-relu-

conv series. This will give you

some F(x). That result is then

added to the original input x.

● 152 layers...

● The group tried a 1202-layer

network, but got a lower test

accuracy, presumably due to

overfitting.

Larsson, Gustav, Michael Maire, and

Gregory Shakhnarovich. “FractalNet:

Ultra-Deep Neural Networks without

Residuals”

Kaiming He, Xiangyu Zhang, Shaoqing

Ren, Jian Sun. “Deep Residual

Learning for Image Recognition”

Result Analysis

● The final CNN had a validation

accuracy around 55%.

● The model performs pretty well

on classifying positive

emotions, but weaker across

negative emotions on average.

● Happy: 75%

● Surprise: 70%

● Sad: 40%

● Misclassifies angry, fear and

neutral as sad

Demo from HappyNet

http://www.youtube.com/watch?v=MDHtzOdnSgA

● Jostine Ho, https://github.com/JostineHo/mememoji

● Dan Duncan, https://github.com/danduncan/HappyNet

● Adit Deshpande,

https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-

Learning-Papers-You-Need-To-Know-About.html

● http://cs231n.stanford.edu/

Reference

facial expression recognitionhji/cs519_slides/facial... · what is facial expression? one or more...

Documents