emotion detection for - gtc on demand · emotion detection for mobile jay turcot director of...
TRANSCRIPT
@affectiva
Facial expression and
Emotion detection for
mobile
Jay Turcot
Director of Applied AI
Affectiva
@affectiva
What if technology could identify emotions as humans can?
@affectiva
Our vision is to humanize technology with Emotion Intelligence by enabling
machines to be emotion-aware and by allowing businesses to get emotion
analytics
@affectiva
How it works?
@affectiva
How it works?
Face detection and
tracking
Facial action & attribute
classification
Facial expression
interpretation
@affectiva
Task: Facial expression recognition
• Multi-attribute classification (~20+ classes) • Upright, fixed-size, grayscale
• Fast enough to run on-device!
Brow furrow
Brow raise
Smile
@affectiva
Emotion AI platform built on deep learning
Sadness Joy Anger Surprise
Contempt Disgust Fear
Age Ethnicity Gender
Convolutional Neural Networks Output: Input:
11 Facial expressions
Gender
Labeled and unlabeled videos (+voice)
data. Meta data. Latest training used
1M+ images.
@affectiva
Training Setup
• NVIDIA Titan X (Pascal) • 12 GB, 3584 CUDA cores
• NVIDIA CUDA® Deep Neural Network • CUDA 8.0, cuDNN 5
FAST
• Keras + TensorFlow, Docker • TensorFlow 1.0, nvidia-docker SIMPL
E
@affectiva
A few tips on training
@affectiva
Use all annotated data available!
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
Training Loss
Use every frame in a video as well as partially annotated data
0
0.1
0.2
0.3
0.4
0.5
0.6
Testing Loss
0.91
0.912
0.914
0.916
0.918
0.92
0.922
0.924
0.926
Accuracy (overall)
Sampled frames
All frames
All frames
(+ data w/ partial
annotation data)
@affectiva
Balancing data isn’t strictly required Classes with ~3 times more data
0.87
0.88
0.89
0.9
0.91
0.92
0.93
0.94
Gender Smile Brow raise Brow furrow
Balanced sampling Unique sampling
Balanced: 90.5% Natural (unbalanced):
90.8%
@affectiva
Building fast models
@affectiva
Speeding up deep learning models Several approaches are used for speeding up models
Model Pruning
Model Compression
Model Quantization
@affectiva
Match architecture to the problem Avoid network architecture that is larger than needed
Problem Object detection (& classification) Facial action & attribute
classification
Details 1000 classes
~224x224 pixels, color
Objects with arbitrary scales / positions /
orientations
20+ classes
~100x100 pixels, grayscale
Faces only, upright & registered
Architectures VGG’16 [1] - 16 layers (~30.9 GOP/image)
ResNet [2] - 152 layers (~22.6 GOP/image)
Others: Inception v4, E-Net
?
@affectiva
Lots of big filters are expensive! Use smaller filters to condense information
@affectiva
Look for redundancy in your layers Small filters are faster… but can be highly correlated
@affectiva
Small networks still work very well… … for simpler problems
7.14
21.97
15.96
0
5
10
15
20
25
CNN (small) CNN(medium)
ENet
MFLOPs
92.99% 93.09%
92.97%
92.00%
92.20%
92.40%
92.60%
92.80%
93.00%
93.20%
93.40%
93.60%
93.80%
94.00%
CNN (small) CNN(medium)
ENet
Accuracy
@affectiva
Result
• Smaller models (<10 MFLOPs) still
outperform traditional methods • Don’t just copy architectures like VGG (30+ GOPS)
• Explore network architectures that prioritize efficiency (E-Net)
• Other methods still apply to improve
runtime performance: • Quantize models to 8-bit fixed point or binary
• Prune models
• Models deployed in our on-device SDK: http://developer.affectiva.com/
@affectiva
Questions
@affectiva
Questions