deep learning for articulated human pose estimation

End-to-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks

for Human Pose Estimation

Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang

Department of Electronic Engineering, The Chinese University of Hong Kong

Outline

Introduction

Methodology

Experiments

Human Pose Estimation What is Human Pose Estimation?

3Results are generated by the proposed methods (without temporal constraints).Video Credit: Peter Jasko solo - M-idzomer 2013

Human Pose Estimation is to recover the joint positions of articulated limbs of human body given an image or a video.

4Results are generated by the proposed methods (without temporal constraints).Video Credit: Peter Jasko solo - M-idzomer 2013

Activity Recognition Game and Animation Clothing Parsing

Applications

[Yamaguchi et al. CVPR’14]

Challenges Articulation

Forshortening

Clothing

Dantone et al. CVPR 2013

Structure Modeling

Figure Drawing

• Cylinders for each body parts

• Join up the cylinders

Pictorial Structures

• Unary templates

• Pairwise springs

Unable to handle large variations

Image credit:artintegrity.wordpress.com

Fischler & Elschlager 1973

Felzenszwalb & Huttenlocher 2005

Springs

Unary template

Structure Modeling

Figure Drawing

• Cylinders for each body parts

• Join up the cylinders

Pictorial Structures

• Unary templates

• Pairwise springs

Unable to handle large variations

Mixtures of each part

• Unary template for each mixture type

• Clustering on (∆𝑥,∆𝑦)

Yang & Ramanan 2011Image credit:artintegrity.wordpress.com

Fischler & Elschlager 1973

Felzenszwalb & Huttenlocher 2005

Graph Structures

• Trees• Latent trees• Loopy graphs• Fully connected

graphs

Deep Learning Based Methods

Benefits from better neural network architectures VGG

GoogLeNet,

ResNet

Structures are only used for post processing

Fully Convolutional Networks

Heat map prediction

Gap Between Deep Models and Structure Modeling

End-to-End Learning

Gap Between Deep Models and Structure Modeling

Local appearance is ambiguous

Forward

Backward

Motivation: Geometric Constraints Among Body Parts Helps in Learning Better Representation

Forward

Backward

Global consistency helps training

Outline

Introduction

Methodology

Experiments

Graph Models

𝐺 = (𝑉, 𝐸)

Vertices

Locations and mixture types of

body parts

Modeled by a front-end CNN

Pairwise spatial relationships

between body parts

Modeled by message passing

layers

message passing

Framework

… … …

l.shou

r.shou

r.knee

r.ankle

Message Passing Layers

Framework

… … …

l.shou

r.shou

r.knee

r.ankle

𝐹 𝒍, 𝒕 𝐼; 𝜽,𝝎 =

𝑖∈𝑉

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃)

𝑖,𝑗 ∈𝐸

𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗 𝐼; 𝝎𝑖,𝑗

𝑡𝑖,𝑡𝑗

Framework

… … …

l.shou

r.shou

r.knee

r.ankle

𝑖∈𝑉

𝒍 = 𝒍𝑖 = {(𝑥𝑖 , 𝑦𝑖)}: location of part 𝑖

𝑖,𝑗 ∈𝐸

𝑡𝑖,𝑡𝑗

Framework

… … …

l.shou

r.shou

r.knee

r.ankle

𝑖∈𝑉

𝒍 = 𝒍𝑖 = {(𝑥𝑖 , 𝑦𝑖)}: location of part 𝑖

𝒕 = 𝒕𝑖 : mixture type of part 𝑖

𝑖,𝑗 ∈𝐸

𝑡𝑖,𝑡𝑗

Front-End CNNPart Appearance Terms

… … …

l.shou

r.shou

r.knee

r.ankle

𝑖∈𝑉

𝑖,𝑗 ∈𝐸

𝑡𝑖,𝑡𝑗

Message Passing LayersSpatial Relationship Terms

… … …

l.shou

r.shou

r.knee

r.ankle

𝑖∈𝑉

𝑖,𝑗 ∈𝐸

𝑡𝑖,𝑡𝑗

1x1 conv

+dropout

3 3 3 3 3

Local confidence of the appearance of part 𝑖 with mixture type 𝑡𝑖 :

Front end CNN

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃) = log 𝑝 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 = log𝜎(𝑓 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 )

1x1 conv

+dropout

3 3 3 3 3

Front end CNN

output of the front-end CNN

1x1 conv

+dropout

3 3 3 3 3

Front end CNN

1x1 conv

+dropout

3 3 3 3 3

Front end CNN

Softmax function

1x1 conv

+dropout

3 3 3 3 3

Front end CNN

Logarithm

… … …

l.shou

r.shou

r.knee

r.ankle

𝑖∈𝑉

𝑖,𝑗 ∈𝐸

𝑡𝑖,𝑡𝑗

Message Passing LayersSpatial Relationship Terms

𝜓 𝒍𝑖 , 𝒍𝑗, 𝑡𝑖 , 𝑡𝑗 𝐼;𝝎𝑖,𝑗

𝑡𝑖,𝑡𝑗 = 𝝎𝑖,𝑗

𝑡𝑖,𝑡𝑗 , 𝑑(𝒍𝑖 , 𝒍𝑗)

Spatial Relationship Terms

Quadratic deformation constraints

𝑑 𝒍𝑖 , 𝒍𝑗 = ∆𝑥, ∆𝑥2, ∆𝑦, ∆𝑦2

∆𝑥 = 𝑥𝑖 − 𝑥𝑗 , ∆𝑦 = 𝑦𝑖 − 𝑦𝑗

Yang Y, Ramanan D. Articulated pose estimation with flexible mixtures-of-parts. CVPR, 2011.

Message Passing: Max-Sum Algorithm

𝑚𝑖𝑗 𝒍𝑗 , 𝑡𝑗 ← 𝛼𝑚max𝒍𝑖,𝑡𝑖

𝑢𝑖 𝒍𝑖 , 𝑡𝑖 + 𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗The message passed from part 𝑖 to 𝑗

𝑚21

𝑢𝑖 𝒍𝑖 , 𝑡𝑖 + 𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗2

𝑚21

𝑢𝑖 𝒍𝑖 , 𝑡𝑖 ← 𝛼𝑢 𝜙 𝒍𝑖 , 𝑡𝑖 + 𝑘∈𝑁(𝑖)

𝑚𝑘𝑖(𝒍𝑖 , 𝑡𝑖)The belief of part 𝑖

The message passed from part 𝑖 to 𝑗

𝑢𝑖 𝒍𝑖 , 𝑡𝑖 ← 𝛼𝑢 𝜙 𝒍𝑖 , 𝑡𝑖 + 𝑘∈𝑁(𝑖)

𝑚𝑘𝑖(𝒍𝑖 , 𝑡𝑖)

𝑢𝑖 𝒍𝑖 , 𝑡𝑖 + 𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗

𝒍𝑖∗, 𝑡𝑖

∗ = argmax 𝑢𝑖∗ 𝒍𝑖 , 𝑡𝑖

𝒍𝑖,𝑡𝑖

Diagnosis on Message Passing Layers

Outline

Introduction

Methodology

Experiments

Datasets

Sports1000 training1000 testing

Films3987 training1016 testing

Qualitative Results on the LSP Dataset

PCP on the LSP dataset[Percentage of Correct Parts]

TORSO HEAD UPPER ARMS LOWER ARMS UPPER LEGS LOWER LEGS MEAN

STRICT PCP ON THE LSP DATASET

Yang&Ramanan, CVPR'11 Pishchulin et al., CVPR'13

Eichner&Ferrari, ACCV'13 Kiefel&Gehler, ECCV'14

Pose Machines, ECCV'14 Ouyang et al., CVPR'14

Pishchulin et al., ICCV'13 DeepPose, CVPR'14

Chen&Yuille, NIPS'14 Ours

Percentage of Detected Joints (PDJ)

40LSP FLIC

Our method

Sports1000 training1000 testing

Films3987 training1016 testing

Datasets

Image ParseFLIC

Activities100 training205 testing

Cross-dataset validation

GeneralizationImage Parse Dataset

30 50 70 90

UPPER ARMS

LOWER ARMS

UPPER LEGS

LOWER LEGS

Ouyang et al., CVPR'14

Yang&Ramanan, TPAMI'13

Pishchulin et al., ICCV'13

Pishchulin et al., CVPR'13

Pishchulin et al., CVPR'12

Johnson&Everingham, CVPR'11

Yang&Ramanan, CVPR'11

Component Analysis

Unary Term vs. Full Model83.4

TORSO HEAD UPPER ARMS LOWER ARMS UPPER LEGS LOWER LEGS MEAN

STRICT PCP ON THE LSP DATASET (VGG-LG)

Unary term only

Full model (seperately trained)

Full model (jointly trained)

Tree-Structured Model vs. Loopy Model

62 72 82 92

Upper Arms

Lower Arms

Upper Legs

Lower Legs

Loopy Model Tree Model

Human Pose Estimation in This CVPR

Chu et al. Structured Feature Learning for Pose Estimation.

Wei et al. Convolutional Pose Machines.

Carreira et al. Human Pose Estimation With Iterative Error Feedback.

CNNStructure

Summary

End-to-End Learning

Bridge the gap between structure modeling and deep learning

A new message passing layer, which is flexible to tree-structured/loopy relational graphs.

Scan for more details

CNNStructure

Thank you.Welcome to our poster session #5

deep learning for articulated human pose estimation

Documents

articulated pose estimation using discriminative armlet...

pictorial structures for articulated pose estimation

boundary fragment matching and articulated pose under

articulated pose estimation by a graphical model with...

stacked hourglass networks for human pose...

cascaded models for articulated pose estimation - university...

head pose estimation - pdfs.semanticscholar.org · head...

articulated pose estimation via over-parametrization and...

articulated pose estimation by a graphical model with image

a wearable system for articulated human pose tracking

using richer models for articulated pose estimation of...

chapter 4 articulated pose estimation - people | mit...

multi-person pose estimation model … · human pose...

2d articulated human pose estimation and retrieval in...

vision-based articulated machine pose estimation for...

real-time articulated hand pose estimation using...

skeleton-based pose estimation of human figures ·...

real-time head pose estimation in low-resolution football...

em-pose: 3d human pose estimation from sparse

2.our framework 2.1. enforcing temporal consistency by post...