facial expression recognition and generationshapiro/ee562/notes/... · 2016. 11. 9. · • eight...
TRANSCRIPT
Facial Expression Recognition and Generation
Deepali Aneja Ph.D. student
Computer Science and Engineering University of Washington
Motivation • Accurate facial expression depiction is critical for storytelling.
• And difficult!
0% 13% 25% 38% 50% 63%
JoySadness
AngerSurprise
FearDisgustNeutral
0% 13% 25% 38% 50%
JoySadness
AngerSurprise
FearDisgustNeutral
0% 13% 25% 38% 50%
Joy
Sadness
Anger
Surprise
Fear
Disgust
Neutral
We asked three professional animators to make the character appear as surprised as possible. None of the expressions achieved above 50% recognition on
Mechanical Turk testing.
Use human anatomy (FACs) to generate expressions
MPEG – 4 HapFACS HapFACS FACSGen (Anger) (Anger) (Fear) (Fear)
Adobe Character Animator (Geometry + Audio input)
Problem Statement
Given that that simple geometric mappings are not sufficient:
• How can we transfer human expressions to stylized characters without losing perceptual information?
• How can we use human expressions to quickly and
automatically create expressions for a wide range of characters?
Generate characters from human expressions
Our Approach
• Use deep learning to learn mappings between • human expressions and human expressions • character expressions and character expressions • human expressions and characters expressions
• Seven classes of expressions : • Joy, Sad, Anger, Disgust, Surprise, Fear and Neutral
• This isn’t just geometry mapping
• It is perceptual modelling of expressions
Step 4
Retrieve characters using a perceptual model and geometry
Step 2
Learn analogous character model
Character feature space
f’( )
Step 1
Use deep learning to create a perceptual model of human expressions
Human feature space
f( )
Step 3
Learn Mapping f’( ) f( )
Part 1: Expression Retrieval
Steps
Data Collection
Data Pre-processing
Network Training
using Deep Learning
Transfer expressions
Data Collection - Human Database
• CK+: The Extended Cohn-Kanade [REF] -309 images • DISFA: Denver Intensity of Spontaneous Facial Actions [REF] 60,000
images • KDEF: The Karolinska Directed Emotional Faces [REF] 4900 images • MMI: 10,000 images • Total of 75K images - We balanced out the final number of the
samples for training our network to avoid any bias towards any particular expression.
Data Collection - Character Database
• Eight stylized characters • The animator creates the
• key poses for each expression • labeled via Mechanical Turk (MT) to populate the database
initially • We only used the expression key poses having 70% MT test
agreement among 50 Turkers for the same pose. Interpolating between the key poses resulted in 60,000 images (around 8,000 images per character).
Data Pre-processing
Extract Face 49 landmarks (Intraface)
Register faces to an average frontal face via an affine transformation
Face bounding box selection
Re-size to 256x256 pixels for analysis
Registered faces
Disgust(CK+) Joy(DISFA) Anger (KDEF) Surprise (MMI)
Training networks
Stylized Character
Neural Network
Human Neural
Network
A
D
F
J
N
Sa
Sa
D
F
J
N
Sa
Sa
A
Find the correlation
between the corresponding expressions
Mapping
Network Training using Deep Learning
Data Augmentation • 5 crops of 227x227
from four corners • center crop • Horizontal flip
Training Human model • 4 CONV layers • 4 POOL layers • 2 Fully Connected
layers
Training character model • 3 CONV layers • 3 POOL layers • 2 Fully Connected
layers
Fine-tuning character model • N-1 layer features
Network Architecture
Human CNN (HCNN) Character CNN (CCNN) Shared CNN (SCNN)
When and How to Fine-tune?
• New dataset is small and similar to original dataset. • Not a good idea to fine-tune the ConvNet (overfitting) • Train a linear classifier on the CNN codes.
• New dataset is medium/large and similar to the original dataset. • Fine-tune through the full network (Our shared CNN)
• New dataset is small but very different from the original dataset. • Train the SVM classifier from activations (somewhere earlier in the network)
• New dataset is large and very different from the original dataset. • Train from scratch • Initialize with weights from a pre-trained model.
Transfer Learning
FC6 features extracted
from HCNN
FC6 features
extracted from SCNN
Shared human-character
feature space
Distance Metrics
• Extracted features from the last fully connected layer of both the models: human expression trained model and fine-tuned character expression model & normalized the feature vectors
• To retrieve the stylized character closest expression match to the human expression:
• Jensen—Shannon divergence Distance for expression clarity • Geometric feature distance for expression refinement
Expression feature vectors (N-1) Layer features
Geometry feature vectors
Jensen—Shannon divergence • JS Divergence is symmetrical and gives a finite value:
where • Kullback—Leibler divergence is given as
where X and M are discrete probability distributions
KL Div. KL Div.
Multiple correct label results
Geometric distance refinement
• Since expressions are mainly controlled by muscles around the mouth, eyes and eyebrows, we focus on features that characterize the shape and location of these parts of the face.
• We use the facial landmarks to extract our geometric features including the following measurements:
• the left/right eyebrow height • left/right eyelid height • nose width • left mouth corner to mouth center distance • mouth corner to mouth center distance.
• We normalize these feature vectors and compute the L2 norm distance between
the human geometry vector and character geometry vectors with the correct expression label. Finally, we re-order the retrieved images within the matched label based on matched geometry.
Layers Visualization
Input
Filter – conv1 Features – conv1 Features – conv2 Features – conv3
Prediction label: Surprise
Top match results (Surprise and Joy) Query Character Retrievals
Expression based Retrieval
Using CCNN
Using HCNN
Evaluation
How close is the retrieved character expression label is to the human query expression label?
Retrieval Score
Spearman rank correlation coefficient
Kendall τ test
Expert Comparison
Retrieval Score
• We measured the retrieval performance of our method by calculating the average normalized rank of relevant results (0 is the best score)
• The evaluation score for a query human expression image was calculated as
follows:
where where N is the number of images in the database Nrel the number of relevant expression label images to q Rk is the rank assigned to the kth relevant image.
Average retrieval score for each expression across all characters
Sample expert comparison
test1 test2 test3 test4 test5
Expression
test1 test2 test3 test4 test5
Expert
test1 test2 test3 test4 test5
Expression + Geometry
Rank 1 Rank 2 Rank 3 Rank4 Rank 5
Query
Rank correlation coefficient • Pearson correlation coefficient
• The closer value is to 1, the better the two ranks are correlated. • The average Spearman correlation coefficient for the 30 validation rank
orderings is 0.773 ± 0.336. • Rank 1 correlation is 0.934. – Most relevant match!
• Kendall test
• Pairwise error that represents how many pairs are ranked discordant. The best matching ranks get a τ value of 1.
• The average Kendall correlation coefficient for 30 validation rank orderings is 0.706 ± 0.355
• Rank 1 correlation is 0.910 - Most relevant match!
Correlation metrics with expert
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Corr
elat
ion
coef
ficie
nt
Number of validation sets
Spearman
Kendall
Part 2: Generating Character Expressions
Convolutional layer Max pooling layer Fully connected layer
Surprise
Fully Connected Convolutional Neural Network
Generating Character Expressions
Convolutional layer Max pooling layer Fully connected layer
N-1 feature vector
Generating Character Expressions
Convolutional layer Max pooling layer Fully connected layer
N-1 feature vector
Maya parameters
Learn character model parameters
Convolutional Max pooling Fully connected Soft max
N-1 feature vector
Maya parameters
Preliminary Result:
Disgust expression query
Disgust expression Parameter rendering
Applications
• Improve visual storytelling applications: • animated films • Gaming • Online marketing • VR/AR experiences • Robotics
• Medically-motivated application: teaching children with autism
spectrum disorder (ASD) to both recognize and convey expressions using cartoon characters in an interactive environment.
Expression retrieval work to be presented at Asian Conference on Computer Vision (Nov 2016).
Project webpage http://grail.cs.washington.edu/projects/deepexpr/
Questions?