deep learning for articulated human pose estimation
TRANSCRIPT
End-to-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks
for Human Pose Estimation
Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang
Department of Electronic Engineering, The Chinese University of Hong Kong
Outline
Introduction
Methodology
Experiments
2
Human Pose Estimation What is Human Pose Estimation?
3Results are generated by the proposed methods (without temporal constraints).Video Credit: Peter Jasko solo - M-idzomer 2013
Human Pose Estimation is to recover the joint positions of articulated limbs of human body given an image or a video.
4Results are generated by the proposed methods (without temporal constraints).Video Credit: Peter Jasko solo - M-idzomer 2013
Activity Recognition Game and Animation Clothing Parsing
Applications
5
[Yamaguchi et al. CVPR’14]
Challenges Articulation
Forshortening
Clothing
6
Dantone et al. CVPR 2013
Structure Modeling
7
Figure Drawing
• Cylinders for each body parts
• Join up the cylinders
Pictorial Structures
• Unary templates
• Pairwise springs
Unable to handle large variations
Image credit:artintegrity.wordpress.com
Fischler & Elschlager 1973
Felzenszwalb & Huttenlocher 2005
Springs
Unary template
Structure Modeling
8
Figure Drawing
• Cylinders for each body parts
• Join up the cylinders
Pictorial Structures
• Unary templates
• Pairwise springs
Unable to handle large variations
Mixtures of each part
• Unary template for each mixture type
• Clustering on (∆𝑥,∆𝑦)
Yang & Ramanan 2011Image credit:artintegrity.wordpress.com
Fischler & Elschlager 1973
Felzenszwalb & Huttenlocher 2005
Graph Structures
• Trees• Latent trees• Loopy graphs• Fully connected
graphs
Deep Learning Based Methods
9
Benefits from better neural network architectures VGG
GoogLeNet,
ResNet
Structures are only used for post processing
Fully Convolutional Networks
Heat map prediction
Gap Between Deep Models and Structure Modeling
10
Gap
End-to-End Learning
Gap Between Deep Models and Structure Modeling
11
CNN
Local appearance is ambiguous
Forward
Backward
Motivation: Geometric Constraints Among Body Parts Helps in Learning Better Representation
12
Forward
Backward
Global consistency helps training
Outline
Introduction
Methodology
Experiments
13
Graph Models
𝐺 = (𝑉, 𝐸)
Vertices
Locations and mixture types of
body parts
Modeled by a front-end CNN
Edges
Pairwise spatial relationships
between body parts
Modeled by message passing
layers
14
message passing
CNN
Log
ari
thm
So
ftm
ax
Framework
… … …
l.shou
neck
r.shou
r.knee
r.ankle
head
…
max
max
max
max
max
max
15
Message Passing Layers
CNN
Log
ari
thm
So
ftm
ax
Framework
… … …
l.shou
neck
r.shou
r.knee
r.ankle
head
…
max
max
max
max
max
max
𝐹 𝒍, 𝒕 𝐼; 𝜽,𝝎 =
𝑖∈𝑉
𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃)
+
𝑖,𝑗 ∈𝐸
𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗 𝐼; 𝝎𝑖,𝑗
𝑡𝑖,𝑡𝑗
16
Message Passing Layers
CNN
Log
ari
thm
So
ftm
ax
Framework
… … …
l.shou
neck
r.shou
r.knee
r.ankle
head
…
max
max
max
max
max
max
𝐹 𝒍, 𝒕 𝐼; 𝜽,𝝎 =
𝑖∈𝑉
𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃)
17
𝒍 = 𝒍𝑖 = {(𝑥𝑖 , 𝑦𝑖)}: location of part 𝑖
Message Passing Layers
+
𝑖,𝑗 ∈𝐸
𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗 𝐼; 𝝎𝑖,𝑗
𝑡𝑖,𝑡𝑗
CNN
Log
ari
thm
So
ftm
ax
Framework
… … …
l.shou
neck
r.shou
r.knee
r.ankle
head
…
max
max
max
max
max
max
𝐹 𝒍, 𝒕 𝐼; 𝜽,𝝎 =
𝑖∈𝑉
𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃)
18
𝒍 = 𝒍𝑖 = {(𝑥𝑖 , 𝑦𝑖)}: location of part 𝑖
𝒕 = 𝒕𝑖 : mixture type of part 𝑖
Message Passing Layers
+
𝑖,𝑗 ∈𝐸
𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗 𝐼; 𝝎𝑖,𝑗
𝑡𝑖,𝑡𝑗
CNN
Log
ari
thm
So
ftm
ax
Front-End CNNPart Appearance Terms
… … …
l.shou
neck
r.shou
r.knee
r.ankle
head
…
max
max
max
max
max
max
𝐹 𝒍, 𝒕 𝐼; 𝜽,𝝎 =
𝑖∈𝑉
𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃)
19
+
𝑖,𝑗 ∈𝐸
𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗 𝐼; 𝝎𝑖,𝑗
𝑡𝑖,𝑡𝑗
Message Passing LayersSpatial Relationship Terms
CNN
Log
ari
thm
So
ftm
ax
… … …
l.shou
neck
r.shou
r.knee
r.ankle
head
…
max
max
max
max
max
max
𝐹 𝒍, 𝒕 𝐼; 𝜽,𝝎 =
𝑖∈𝑉
𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃)
20
+
𝑖,𝑗 ∈𝐸
𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗 𝐼; 𝝎𝑖,𝑗
𝑡𝑖,𝑡𝑗
conv+
norm+
pool
conv+
norm+
pool
con
v
con
v
con
v
1x1 conv
+dropout
1x1
co
nv
Dro
po
ut
Dro
po
ut
3
3 3 3 3 3
1x1
co
nv
4096
4096
PxK
…
CNN
Local confidence of the appearance of part 𝑖 with mixture type 𝑡𝑖 :
21
Front end CNN
𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃) = log 𝑝 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 = log𝜎(𝑓 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 )
conv+
norm+
pool
conv+
norm+
pool
con
v
con
v
con
v
1x1 conv
+dropout
1x1
co
nv
Dro
po
ut
Dro
po
ut
3
3 3 3 3 3
1x1
co
nv
4096
4096
PxK
…
CNN
Local confidence of the appearance of part 𝑖 with mixture type 𝑡𝑖 :
22
Front end CNN
𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃) = log 𝑝 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 = log𝜎(𝑓 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 )
output of the front-end CNN
conv+
norm+
pool
conv+
norm+
pool
con
v
con
v
con
v
1x1 conv
+dropout
1x1
co
nv
Dro
po
ut
Dro
po
ut
3
3 3 3 3 3
1x1
co
nv
4096
4096
PxK
…
CNN
Local confidence of the appearance of part 𝑖 with mixture type 𝑡𝑖 :
23
Front end CNN
𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃) = log 𝑝 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 = log𝜎(𝑓 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 )
conv+
norm+
pool
conv+
norm+
pool
con
v
con
v
con
v
1x1 conv
+dropout
1x1
co
nv
Dro
po
ut
Dro
po
ut
3
3 3 3 3 3
1x1
co
nv
4096
4096
PxK
…
CNN
Local confidence of the appearance of part 𝑖 with mixture type 𝑡𝑖 :
24
Front end CNN
𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃) = log 𝑝 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 = log𝜎(𝑓 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 )
Softmax function
conv+
norm+
pool
conv+
norm+
pool
con
v
con
v
con
v
1x1 conv
+dropout
1x1
co
nv
Dro
po
ut
Dro
po
ut
3
3 3 3 3 3
1x1
co
nv
4096
4096
PxK
…
CNN
Local confidence of the appearance of part 𝑖 with mixture type 𝑡𝑖 :
25
Front end CNN
𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃) = log 𝑝 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 = log𝜎(𝑓 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 )
Logarithm
CNN
Log
ari
thm
So
ftm
ax
… … …
l.shou
neck
r.shou
r.knee
r.ankle
head
…
max
max
max
max
max
max
𝐹 𝒍, 𝒕 𝐼; 𝜽,𝝎 =
𝑖∈𝑉
𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃)
26
+
𝑖,𝑗 ∈𝐸
𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗 𝐼; 𝝎𝑖,𝑗
𝑡𝑖,𝑡𝑗
Message Passing LayersSpatial Relationship Terms
𝜓 𝒍𝑖 , 𝒍𝑗, 𝑡𝑖 , 𝑡𝑗 𝐼;𝝎𝑖,𝑗
𝑡𝑖,𝑡𝑗 = 𝝎𝑖,𝑗
𝑡𝑖,𝑡𝑗 , 𝑑(𝒍𝑖 , 𝒍𝑗)
Spatial Relationship Terms
Quadratic deformation constraints
𝑑 𝒍𝑖 , 𝒍𝑗 = ∆𝑥, ∆𝑥2, ∆𝑦, ∆𝑦2
∆𝑥 = 𝑥𝑖 − 𝑥𝑗 , ∆𝑦 = 𝑦𝑖 − 𝑦𝑗
27
Yang Y, Ramanan D. Articulated pose estimation with flexible mixtures-of-parts. CVPR, 2011.
Message Passing: Max-Sum Algorithm
28
𝑚𝑖𝑗 𝒍𝑗 , 𝑡𝑗 ← 𝛼𝑚max𝒍𝑖,𝑡𝑖
𝑢𝑖 𝒍𝑖 , 𝑡𝑖 + 𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗The message passed from part 𝑖 to 𝑗
1
2
𝑚21
Message Passing: Max-Sum Algorithm
29
𝑚𝑖𝑗 𝒍𝑗 , 𝑡𝑗 ← 𝛼𝑚max𝒍𝑖,𝑡𝑖
𝑢𝑖 𝒍𝑖 , 𝑡𝑖 + 𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗2
𝑚21
1
𝑢𝑖 𝒍𝑖 , 𝑡𝑖 ← 𝛼𝑢 𝜙 𝒍𝑖 , 𝑡𝑖 + 𝑘∈𝑁(𝑖)
𝑚𝑘𝑖(𝒍𝑖 , 𝑡𝑖)The belief of part 𝑖
The message passed from part 𝑖 to 𝑗
𝑢𝑖 𝒍𝑖 , 𝑡𝑖 ← 𝛼𝑢 𝜙 𝒍𝑖 , 𝑡𝑖 + 𝑘∈𝑁(𝑖)
𝑚𝑘𝑖(𝒍𝑖 , 𝑡𝑖)
Message Passing: Max-Sum Algorithm
30
𝑚𝑖𝑗 𝒍𝑗 , 𝑡𝑗 ← 𝛼𝑚max𝒍𝑖,𝑡𝑖
𝑢𝑖 𝒍𝑖 , 𝑡𝑖 + 𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗
𝒍𝑖∗, 𝑡𝑖
∗ = argmax 𝑢𝑖∗ 𝒍𝑖 , 𝑡𝑖
𝒍𝑖,𝑡𝑖
Diagnosis on Message Passing Layers
31
1st
Diagnosis on Message Passing Layers
32
2nd
Diagnosis on Message Passing Layers
33
3rd
Outline
Introduction
Methodology
Experiments
34
Datasets
LSP
35
FLIC
Sports1000 training1000 testing
Films3987 training1016 testing
36
Qualitative Results on the LSP Dataset
37
38
PCP on the LSP dataset[Percentage of Correct Parts]
39
30
40
50
60
70
80
90
100
TORSO HEAD UPPER ARMS LOWER ARMS UPPER LEGS LOWER LEGS MEAN
STRICT PCP ON THE LSP DATASET
Yang&Ramanan, CVPR'11 Pishchulin et al., CVPR'13
Eichner&Ferrari, ACCV'13 Kiefel&Gehler, ECCV'14
Pose Machines, ECCV'14 Ouyang et al., CVPR'14
Pishchulin et al., ICCV'13 DeepPose, CVPR'14
Chen&Yuille, NIPS'14 Ours
Percentage of Detected Joints (PDJ)
40LSP FLIC
Our method
Sports1000 training1000 testing
Films3987 training1016 testing
Datasets
LSP
41
Image ParseFLIC
Activities100 training205 testing
Cross-dataset validation
42
GeneralizationImage Parse Dataset
30 50 70 90
TORSO
HEAD
UPPER ARMS
LOWER ARMS
UPPER LEGS
LOWER LEGS
MEAN
Ours
Ouyang et al., CVPR'14
Yang&Ramanan, TPAMI'13
Pishchulin et al., ICCV'13
Pishchulin et al., CVPR'13
Pishchulin et al., CVPR'12
Johnson&Everingham, CVPR'11
Yang&Ramanan, CVPR'11
Component Analysis
43
Unary Term vs. Full Model83.4
69
53.5
34.9
72.2
63.5
60.1
93
82.1
70.6
55.4
82.1
75.3
74.2
96.5
83.1
78.8
66.7
88.7
81.7
81.1
30
40
50
60
70
80
90
100
TORSO HEAD UPPER ARMS LOWER ARMS UPPER LEGS LOWER LEGS MEAN
STRICT PCP ON THE LSP DATASET (VGG-LG)
Unary term only
Full model (seperately trained)
Full model (jointly trained)
44
Tree-Structured Model vs. Loopy Model
45
96.2
83.4
78.7
65.8
87.9
81.1
80.7
96.5
83.1
78.8
66.7
88.7
81.7
81.1
62 72 82 92
Torso
Head
Upper Arms
Lower Arms
Upper Legs
Lower Legs
Mean
Loopy Model Tree Model
Human Pose Estimation in This CVPR
46
Chu et al. Structured Feature Learning for Pose Estimation.
Wei et al. Convolutional Pose Machines.
Carreira et al. Human Pose Estimation With Iterative Error Feedback.
CNNStructure
Summary
End-to-End Learning
Bridge the gap between structure modeling and deep learning
A new message passing layer, which is flexible to tree-structured/loopy relational graphs.
Scan for more details
CNNStructure
Thank you.Welcome to our poster session #5