looking at / sensing peoplecis.eecs.qmul.ac.uk/201303_kyoto_presentations/... · 2018. 8. 21. ·...
TRANSCRIPT
-
Looking at / Sensing people
Ioannis Patras
www.eecs.qmul.ac.uk/~ioannisp
Centre for Intelligent Sensing
Queen Mary University of London
-
Related research
• Scene analysis
Object recognition / Semantic segmentation
• Motion Analysis
Motion estimation / segmentation
Object Tracking
• Facial (Expression) Analysis
Head tracking/Facial Feature Tracking
Facial expression recognition
• Action / Gesture Recognition
Spatio-temporal representations for action recognition
Pose estimation
• Brain Computer Interfaces
Dynamic Vision
Looking at / sensing people
Static Analysis
-
Looking at/sensing people
• Facial (Expression) Analysis
Head tracking/Facial Feature Tracking
Facial expression recognition
• Action / Gesture Recognition
Action recognition and localisation
Pose estimation
• Brain Computer Interfaces
-
Introduction
Motivation
Vision-based analysis and understanding of human activities becomes of paramount importance in a world centered around at humans and overwhelmed with visual data.
Challenges
Detection, tracking, understanding
Applications
Visual Surveillance, Human Machine/Robot Interaction, Intelligent Systems, Multimedia Analysis, Ambient Intelligence
Related expertise
Computer Vision, Pattern Recognition, AI
-
5
Recognition and Localisation of Actions
Goal:
Recognize categories of
actions
Localize them in terms of their
bounding box (space +
time)
Challenges:
Occlusions, clutter, variations,
…
Hypothesis: Analysis can be restricted on a set of
spatiotemporally ‘interesting’/salient events
-
Implicit Shape Model (ISM) Input training patches
Clustering
Each codeword is associated with a vote map that gives
the possible location of the hypothesis centre
Codewords
offsets Codeword
center
D1
D2
D3
Appearence space
-
Implicit Shape Model (ISM)
S1
S2
S3
D1
D3
D2
Appearence space voting maps
xi
ii
SN
S1
Output Hough space
-
Implicit Shape Model (ISM)
S1
S2
S3
D1
D3
D2
Appearence space voting space
ii
SN
S1
l
l
SN
1
xl
Output Hough space
-
Implicit Shape Model (ISM)
S1
S2
S3
D1
D3
D2
Appearence space voting space
Hypothesis center
ii
SN
S1
l
l
SN
1
Output Hough space
-
Discriminative learning
• Higher weights for pdfs with higher
localisation accuracy
• Class dictionary comprise of
discriminative codewords •Adaboost on the codeword similarities
iii
cpcpdw |log|exp( icp |
-
Discriminative Voting Score
Yc : An area around the hypothesis
centre of the training image
Let S(y) denote the probabilistic
Hough score at location y
The discriminative voting score:
Objective : Maximize discriminative voting score for the
training set
ccYyYy
ySyS )()(
YC
Output Hough space
Local feature
-
Goal : Learn a task-dependant dictionary for
localization of actions
ISM N
i 1,
iilx
S(y)
y yc
cY
c cYy Yy
iiySyS )()(
12
-
13
Action recognition
• KTH dataset – average : 88% • HoHA dataset – average : 37%
-
Artificial occlusions and clutter
-
Detection Results
15
-
Regression Forests for Facial Analysis
H. Yang, I. Patras, ACCV2012]
[H. Yang, I. Patras, IEEE FG 2013]
Input test point
Split function
at node
Input training data
-
Regression Forests-Review In previous methods for multiple targets (parts)
regression, each part is regressed separately, ignoring the interdependency.
𝑌 = {𝑦1, … , 𝑦𝑖 , … , yn} 𝑦𝑖
𝑥𝑖
𝑦𝑖
𝑥𝑗
𝑦𝑗
Independency
assumption in
previous methods
𝑝(𝑦
𝑖|x)
𝑦𝑖
𝑝(𝑦
𝑗|x)
𝑦𝑗
𝑋 :the set of all image patches 𝑋𝑖: the set of image patches that are able to vote for point 𝑖
Some
non-plausible
results from the
paper .
[Dantone et al. CVPR12 ]
𝑝 𝑦𝑖 𝑋) = 𝑝(𝑦𝑖|𝑥)
𝑥∈𝑋𝑖
𝑝 𝑦𝑗 𝑋) = 𝑝(𝑦𝑖|𝑥)
𝑥∈𝑋𝑗
-
SORF
x17 x18
x19 x16
7
10 1
2 3 4 5 6
8 9
11 12 13
14 15
16 17 18
19 20
7 10
1 2
3 4 5 6
8 9
11
12 13
14 15
16 17 18
19 20
1 2
3 4 5 6
8 9
11
12 13
14 15
16 17 18
19 20
1 2
3 4 5 6
8 9
11
12 13
14 15
16 17 18
19 20
7 10
a b c x17 x18 x19 x16
d
𝑝 𝑦𝑗 𝑦17, 𝑙 𝑗 ∈ 𝑁𝑒(17)
𝑝(𝑦17|𝑙)
1
2
3 4 5
6
8
9
11
12
13
14 15
16 17
18
19
20
𝑝(𝑦16|𝑙) 𝑝(𝑦19|𝑙) 𝑝(𝑦18|𝑙)
𝑝 𝑦𝑗 𝑦16, 𝑙 𝑗 ∈ 𝑁𝑒(16)
𝑝 𝑦𝑗 𝑦19, 𝑙 𝑗 ∈ 𝑁𝑒(19)
𝑝 𝑦𝑗 𝑦18, 𝑙 𝑗 ∈ 𝑁𝑒(18)
10
7
a. A predefined
graph model
b. Local graph
related to P17
c. Traditional
independent
voting
d. SO voting
Models learning
Regression models for the base point and its neighbors
-
SORF
(relative offsets vectors and weights)
𝑝(𝑦𝑖|𝑙) {Δ𝑙𝑖𝑘 , 𝜔𝑙𝑖𝑘}
𝑑𝑗𝑙
𝑑𝑖𝑙
Regression model for base point [Mean Shift method like Sun et al. CVPR2012]
Gaussian shape model
𝑥𝑖 Feature patch from position 𝑦 𝑖0 reaches leaf 𝑙, then the absolute vote position for 𝑦𝑖 is 𝑦 𝑖 = 𝑦 𝑖
0 + Δ𝑙𝑖𝑘 . To aggregate the votes:
𝑥𝑖
𝑝 𝑦𝑖 𝑦𝑗 , 𝑙 = 𝑁(𝑑𝑙𝑖 − 𝑑𝑙
𝑗|Δ𝑗𝑙
𝑖 , Λ𝑗𝑙𝑖 )
𝑝 𝑦𝑖 𝑥𝑖 = 𝜔𝑙𝑖𝑘 exp(−𝑦𝑖 − 𝑦 𝑖
ℎ𝑖2
2
)
𝑝 𝑦𝑗 𝑦𝑖 = 𝐾𝑦𝑗 − (𝑦 𝑖 + Δ𝑖𝑙
𝑗)
ℎ𝑖𝑗
Training
Inference
-
SORF
𝑝 𝑦𝑖 𝑋 = 𝑝 𝑦𝑖 𝑥
𝑥∈𝑋𝑖
∗ 𝑝(𝑦𝑖|𝑦𝑗𝑗∈𝑁𝑒(𝑖)
) 𝑥𝑖
𝑦𝑖
𝑥𝑗
𝑦𝑗
Face bounding box
= X
-
SORF vs. RF on BioID dataset
-
Slide 22
Privileged Information CRF
1 2 1 1 2 2
1 2
1 2
1 1 1 1 1 2 2 2 𝜙∗ = argmax 𝐼𝐺𝑦∗ (𝜙) 𝜙
∗ = argmax 𝐼𝐺𝑦 (𝜙)
RF PI-RF
(𝑥, 𝑦∗, 𝑦)
Training Patches
𝑖𝑡ℎ point offsets
𝑦𝑘∗=
𝑙
Three models at each leaf node:
1. 𝑝 𝑦𝑘∗ 𝑙 =𝑛𝑘
𝑛
2. 𝑝 𝑦𝑖 𝑦𝑘∗, 𝑙 = Δ𝑖𝑙
𝑘 , 𝜔𝑖𝑙𝑘
3. 𝑝 𝑦𝑗 − 𝑦𝑖 𝑦𝑘∗, 𝑙 =
𝑁 𝑑𝑗 − 𝑑𝑖|𝜇𝑖𝑗, Σ𝑖
𝑗
Privileged information is only available during training, used for:
1. tree growing
2. conditional models learning (similar to CRF in [Sun et al. CVPR2012] )
-
Slide 23
PI-CRF: Inference PI is only available during training, used for:
…
The privileged information is estimated first.
𝑝 𝑦𝑘∗ 𝑋 = 𝑝(𝑦𝑘∗|𝑙)
𝑙∈𝐿𝑥𝑥∈𝑋
𝐿𝑥 is the set of leaf nodes the patch 𝑥 arrived.
𝑦 𝑘∗ = argmax𝑦𝑘∗∈𝑌𝑘∗
𝑝(𝑦𝑘∗|𝑋)
Then 𝑦 𝑘∗ is used in subsequent steps to select the regression and shape models.
𝑝 𝑦𝑖 𝑦𝑘∗, 𝑙 = Δ𝑖𝑙
𝑘 , 𝜔𝑖𝑙𝑘
𝑝 𝑦𝑗 − 𝑦𝑖 𝑦𝑘∗, 𝑙 = 𝑁 𝑑𝑗 − 𝑑𝑖|𝜇𝑖
𝑗, Σ𝑖
𝑗
of which the 𝑦𝑘∗ = 𝑦 𝑘∗.
-
Slide 24
PI-CRF: Experiments
Experiments of three types of privileged information are conducted on LFW dataset,
Head pose yaw and roll show promising improvement while gender information does not.
[LFW database: http://vis-www.cs.umass.edu/lfw/]
-
Slide 25
PI-CRF: Experiments
The first row shows detection results of [Dantone et al. CVPR2012]
The second row shows our detection results [Yang and Patras. IEEE FG 2013].
-
Slide 26
PI-CRF: Experiments
The advantages of our method:
1. Shared tree structure and
no need to train an
additional forest for head
pose estimation
2. Fusion of different types of
useful privileged
information
3. Taking structure into
account
4. Better performance
Overall performance on the LFW
-
References
J. Shotton, et al. Efficient regression of general-activity human poses from depth images. ICCV2011.
J. Shotton et al. Real-time human pose recognition in parts from single depth images. CVPR2011 .
A. Criminisi et al. Regression forests for efficient anatomy detection and localization in CT studies. Medical Computer Vision.
Recognition Techniques and Applications in Medical Imaging 2011.
M. Sun et al. Conditional regression forests for human pose estimation. CVPR2012
M. Dantone et al. Real-time facial feature detection using conditional regression forests. CVPR2012.
M. Sun et al. Conditional Regression Forests for Human Pose Estimation. CVPR2012.
H. Yang, I. Patras, Face Parts Localization using structured output regression forests, ACCV 2012
H. Yang, I. Patras, Privileged information-based conditional regression forests for facial feature detection, IEEE FG 2013.
-
Anger Surprise Sadness Disgust Fear Happiness
Pose-invariant Facial Expression Recognition
Rudovic, Patras, Pantic, ECCV 2010
Rudovic, Patras, Pantic, TPAMI )To appear)
-
Pose-invariant FER: Our Approach
FACIAL EXPRESSION
CLASSIFICATION
HEAD POSE ESTIMATION POSE NORMALIZATION
Emotion:
SURPRISE
Emotion:
-
Pose-invariant FER: Pose Normalization
FACIAL EXPRESSION
CLASSIFICATION
HEAD POSE ESTIMATION POSE NORMALIZATION
Emotion:
SURPRISE
Emotion:
-
Experiments conducted on the BU3FE and Multi-PIE database
Input to the system: the position of 39 facial landmarks.
Overview of the conducted experiments:
1. BU3FE: evaluation of CGPR model in terms of (a) head-pose-normalization accuracy, (b) robustness to noise, and (c) facial expression classification (balanced dataset).
2. MultiPIE: evaluation of CGPR model in terms of (a) head-pose-normalization accuracy, and (b) facial expression classification (unbalanced dataset).
Pose-invariant FER: Experiments
-
Pose-invariant FER: Experiments
- tp
- ntp
- 7 facial expressions:
Surprise
Anger
Happiness
Neutral
Disgust
Fear
Sadness
- 247 poses (35 training)
- 5-fold person-indep.
cross validation
- 50 subjects (54%
female)
(+45,+30)
(-45,-30)
-
Pose-invariant FER: Experiments
-
Pose-invariant FER: Experiments
-
Implicit tagging via EEG and Face Analysis
Recognition of Affective States: Arousal, Valence, Control
-
Implicit tagging via EEG and Face Analysis
-
Implicit tagging via EEG and Face Analysis
-
Implicit tagging via EEG and Face Analysis
-
Possible collaborations
• Facial (Expression) Analysis
• Body gesture analysis (pose estimation, tracking, action
recognition)
• Multimodal analysis for affect recognition