deep learning for articulated human pose estimation

Post on 14-Feb-2017

244 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

End-to-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks

for Human Pose Estimation

Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang

Department of Electronic Engineering, The Chinese University of Hong Kong

Outline

Introduction

Methodology

Experiments

2

Human Pose Estimation What is Human Pose Estimation?

3Results are generated by the proposed methods (without temporal constraints).Video Credit: Peter Jasko solo - M-idzomer 2013

Human Pose Estimation is to recover the joint positions of articulated limbs of human body given an image or a video.

4Results are generated by the proposed methods (without temporal constraints).Video Credit: Peter Jasko solo - M-idzomer 2013

Activity Recognition Game and Animation Clothing Parsing

Applications

5

[Yamaguchi et al. CVPR’14]

Challenges Articulation

Forshortening

Clothing

6

Dantone et al. CVPR 2013

Structure Modeling

7

Figure Drawing

• Cylinders for each body parts

• Join up the cylinders

Pictorial Structures

• Unary templates

• Pairwise springs

Unable to handle large variations

Image credit:artintegrity.wordpress.com

Fischler & Elschlager 1973

Felzenszwalb & Huttenlocher 2005

Springs

Unary template

Structure Modeling

8

Figure Drawing

• Cylinders for each body parts

• Join up the cylinders

Pictorial Structures

• Unary templates

• Pairwise springs

Unable to handle large variations

Mixtures of each part

• Unary template for each mixture type

• Clustering on (∆𝑥,∆𝑦)

Yang & Ramanan 2011Image credit:artintegrity.wordpress.com

Fischler & Elschlager 1973

Felzenszwalb & Huttenlocher 2005

Graph Structures

• Trees• Latent trees• Loopy graphs• Fully connected

graphs

Deep Learning Based Methods

9

Benefits from better neural network architectures VGG

GoogLeNet,

ResNet

Structures are only used for post processing

Fully Convolutional Networks

Heat map prediction

Gap Between Deep Models and Structure Modeling

10

Gap

End-to-End Learning

Gap Between Deep Models and Structure Modeling

11

CNN

Local appearance is ambiguous

Forward

Backward

Motivation: Geometric Constraints Among Body Parts Helps in Learning Better Representation

12

Forward

Backward

Global consistency helps training

Outline

Introduction

Methodology

Experiments

13

Graph Models

𝐺 = (𝑉, 𝐸)

Vertices

Locations and mixture types of

body parts

Modeled by a front-end CNN

Edges

Pairwise spatial relationships

between body parts

Modeled by message passing

layers

14

message passing

CNN

Log

ari

thm

So

ftm

ax

Framework

… … …

l.shou

neck

r.shou

r.knee

r.ankle

head

max

max

max

max

max

max

15

Message Passing Layers

CNN

Log

ari

thm

So

ftm

ax

Framework

… … …

l.shou

neck

r.shou

r.knee

r.ankle

head

max

max

max

max

max

max

𝐹 𝒍, 𝒕 𝐼; 𝜽,𝝎 =

𝑖∈𝑉

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃)

+

𝑖,𝑗 ∈𝐸

𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗 𝐼; 𝝎𝑖,𝑗

𝑡𝑖,𝑡𝑗

16

Message Passing Layers

CNN

Log

ari

thm

So

ftm

ax

Framework

… … …

l.shou

neck

r.shou

r.knee

r.ankle

head

max

max

max

max

max

max

𝐹 𝒍, 𝒕 𝐼; 𝜽,𝝎 =

𝑖∈𝑉

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃)

17

𝒍 = 𝒍𝑖 = {(𝑥𝑖 , 𝑦𝑖)}: location of part 𝑖

Message Passing Layers

+

𝑖,𝑗 ∈𝐸

𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗 𝐼; 𝝎𝑖,𝑗

𝑡𝑖,𝑡𝑗

CNN

Log

ari

thm

So

ftm

ax

Framework

… … …

l.shou

neck

r.shou

r.knee

r.ankle

head

max

max

max

max

max

max

𝐹 𝒍, 𝒕 𝐼; 𝜽,𝝎 =

𝑖∈𝑉

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃)

18

𝒍 = 𝒍𝑖 = {(𝑥𝑖 , 𝑦𝑖)}: location of part 𝑖

𝒕 = 𝒕𝑖 : mixture type of part 𝑖

Message Passing Layers

+

𝑖,𝑗 ∈𝐸

𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗 𝐼; 𝝎𝑖,𝑗

𝑡𝑖,𝑡𝑗

CNN

Log

ari

thm

So

ftm

ax

Front-End CNNPart Appearance Terms

… … …

l.shou

neck

r.shou

r.knee

r.ankle

head

max

max

max

max

max

max

𝐹 𝒍, 𝒕 𝐼; 𝜽,𝝎 =

𝑖∈𝑉

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃)

19

+

𝑖,𝑗 ∈𝐸

𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗 𝐼; 𝝎𝑖,𝑗

𝑡𝑖,𝑡𝑗

Message Passing LayersSpatial Relationship Terms

CNN

Log

ari

thm

So

ftm

ax

… … …

l.shou

neck

r.shou

r.knee

r.ankle

head

max

max

max

max

max

max

𝐹 𝒍, 𝒕 𝐼; 𝜽,𝝎 =

𝑖∈𝑉

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃)

20

+

𝑖,𝑗 ∈𝐸

𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗 𝐼; 𝝎𝑖,𝑗

𝑡𝑖,𝑡𝑗

conv+

norm+

pool

conv+

norm+

pool

con

v

con

v

con

v

1x1 conv

+dropout

1x1

co

nv

Dro

po

ut

Dro

po

ut

3

3 3 3 3 3

1x1

co

nv

4096

4096

PxK

CNN

Local confidence of the appearance of part 𝑖 with mixture type 𝑡𝑖 :

21

Front end CNN

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃) = log 𝑝 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 = log𝜎(𝑓 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 )

conv+

norm+

pool

conv+

norm+

pool

con

v

con

v

con

v

1x1 conv

+dropout

1x1

co

nv

Dro

po

ut

Dro

po

ut

3

3 3 3 3 3

1x1

co

nv

4096

4096

PxK

CNN

Local confidence of the appearance of part 𝑖 with mixture type 𝑡𝑖 :

22

Front end CNN

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃) = log 𝑝 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 = log𝜎(𝑓 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 )

output of the front-end CNN

conv+

norm+

pool

conv+

norm+

pool

con

v

con

v

con

v

1x1 conv

+dropout

1x1

co

nv

Dro

po

ut

Dro

po

ut

3

3 3 3 3 3

1x1

co

nv

4096

4096

PxK

CNN

Local confidence of the appearance of part 𝑖 with mixture type 𝑡𝑖 :

23

Front end CNN

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃) = log 𝑝 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 = log𝜎(𝑓 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 )

conv+

norm+

pool

conv+

norm+

pool

con

v

con

v

con

v

1x1 conv

+dropout

1x1

co

nv

Dro

po

ut

Dro

po

ut

3

3 3 3 3 3

1x1

co

nv

4096

4096

PxK

CNN

Local confidence of the appearance of part 𝑖 with mixture type 𝑡𝑖 :

24

Front end CNN

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃) = log 𝑝 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 = log𝜎(𝑓 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 )

Softmax function

conv+

norm+

pool

conv+

norm+

pool

con

v

con

v

con

v

1x1 conv

+dropout

1x1

co

nv

Dro

po

ut

Dro

po

ut

3

3 3 3 3 3

1x1

co

nv

4096

4096

PxK

CNN

Local confidence of the appearance of part 𝑖 with mixture type 𝑡𝑖 :

25

Front end CNN

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃) = log 𝑝 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 = log𝜎(𝑓 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 )

Logarithm

CNN

Log

ari

thm

So

ftm

ax

… … …

l.shou

neck

r.shou

r.knee

r.ankle

head

max

max

max

max

max

max

𝐹 𝒍, 𝒕 𝐼; 𝜽,𝝎 =

𝑖∈𝑉

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃)

26

+

𝑖,𝑗 ∈𝐸

𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗 𝐼; 𝝎𝑖,𝑗

𝑡𝑖,𝑡𝑗

Message Passing LayersSpatial Relationship Terms

𝜓 𝒍𝑖 , 𝒍𝑗, 𝑡𝑖 , 𝑡𝑗 𝐼;𝝎𝑖,𝑗

𝑡𝑖,𝑡𝑗 = 𝝎𝑖,𝑗

𝑡𝑖,𝑡𝑗 , 𝑑(𝒍𝑖 , 𝒍𝑗)

Spatial Relationship Terms

Quadratic deformation constraints

𝑑 𝒍𝑖 , 𝒍𝑗 = ∆𝑥, ∆𝑥2, ∆𝑦, ∆𝑦2

∆𝑥 = 𝑥𝑖 − 𝑥𝑗 , ∆𝑦 = 𝑦𝑖 − 𝑦𝑗

27

Yang Y, Ramanan D. Articulated pose estimation with flexible mixtures-of-parts. CVPR, 2011.

Message Passing: Max-Sum Algorithm

28

𝑚𝑖𝑗 𝒍𝑗 , 𝑡𝑗 ← 𝛼𝑚max𝒍𝑖,𝑡𝑖

𝑢𝑖 𝒍𝑖 , 𝑡𝑖 + 𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗The message passed from part 𝑖 to 𝑗

1

2

𝑚21

Message Passing: Max-Sum Algorithm

29

𝑚𝑖𝑗 𝒍𝑗 , 𝑡𝑗 ← 𝛼𝑚max𝒍𝑖,𝑡𝑖

𝑢𝑖 𝒍𝑖 , 𝑡𝑖 + 𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗2

𝑚21

1

𝑢𝑖 𝒍𝑖 , 𝑡𝑖 ← 𝛼𝑢 𝜙 𝒍𝑖 , 𝑡𝑖 + 𝑘∈𝑁(𝑖)

𝑚𝑘𝑖(𝒍𝑖 , 𝑡𝑖)The belief of part 𝑖

The message passed from part 𝑖 to 𝑗

𝑢𝑖 𝒍𝑖 , 𝑡𝑖 ← 𝛼𝑢 𝜙 𝒍𝑖 , 𝑡𝑖 + 𝑘∈𝑁(𝑖)

𝑚𝑘𝑖(𝒍𝑖 , 𝑡𝑖)

Message Passing: Max-Sum Algorithm

30

𝑚𝑖𝑗 𝒍𝑗 , 𝑡𝑗 ← 𝛼𝑚max𝒍𝑖,𝑡𝑖

𝑢𝑖 𝒍𝑖 , 𝑡𝑖 + 𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗

𝒍𝑖∗, 𝑡𝑖

∗ = argmax 𝑢𝑖∗ 𝒍𝑖 , 𝑡𝑖

𝒍𝑖,𝑡𝑖

Diagnosis on Message Passing Layers

31

1st

Diagnosis on Message Passing Layers

32

2nd

Diagnosis on Message Passing Layers

33

3rd

Outline

Introduction

Methodology

Experiments

34

Datasets

LSP

35

FLIC

Sports1000 training1000 testing

Films3987 training1016 testing

36

Qualitative Results on the LSP Dataset

37

38

PCP on the LSP dataset[Percentage of Correct Parts]

39

30

40

50

60

70

80

90

100

TORSO HEAD UPPER ARMS LOWER ARMS UPPER LEGS LOWER LEGS MEAN

STRICT PCP ON THE LSP DATASET

Yang&Ramanan, CVPR'11 Pishchulin et al., CVPR'13

Eichner&Ferrari, ACCV'13 Kiefel&Gehler, ECCV'14

Pose Machines, ECCV'14 Ouyang et al., CVPR'14

Pishchulin et al., ICCV'13 DeepPose, CVPR'14

Chen&Yuille, NIPS'14 Ours

Percentage of Detected Joints (PDJ)

40LSP FLIC

Our method

Sports1000 training1000 testing

Films3987 training1016 testing

Datasets

LSP

41

Image ParseFLIC

Activities100 training205 testing

Cross-dataset validation

42

GeneralizationImage Parse Dataset

30 50 70 90

TORSO

HEAD

UPPER ARMS

LOWER ARMS

UPPER LEGS

LOWER LEGS

MEAN

Ours

Ouyang et al., CVPR'14

Yang&Ramanan, TPAMI'13

Pishchulin et al., ICCV'13

Pishchulin et al., CVPR'13

Pishchulin et al., CVPR'12

Johnson&Everingham, CVPR'11

Yang&Ramanan, CVPR'11

Component Analysis

43

Unary Term vs. Full Model83.4

69

53.5

34.9

72.2

63.5

60.1

93

82.1

70.6

55.4

82.1

75.3

74.2

96.5

83.1

78.8

66.7

88.7

81.7

81.1

30

40

50

60

70

80

90

100

TORSO HEAD UPPER ARMS LOWER ARMS UPPER LEGS LOWER LEGS MEAN

STRICT PCP ON THE LSP DATASET (VGG-LG)

Unary term only

Full model (seperately trained)

Full model (jointly trained)

44

Tree-Structured Model vs. Loopy Model

45

96.2

83.4

78.7

65.8

87.9

81.1

80.7

96.5

83.1

78.8

66.7

88.7

81.7

81.1

62 72 82 92

Torso

Head

Upper Arms

Lower Arms

Upper Legs

Lower Legs

Mean

Loopy Model Tree Model

Human Pose Estimation in This CVPR

46

Chu et al. Structured Feature Learning for Pose Estimation.

Wei et al. Convolutional Pose Machines.

Carreira et al. Human Pose Estimation With Iterative Error Feedback.

CNNStructure

Summary

End-to-End Learning

Bridge the gap between structure modeling and deep learning

A new message passing layer, which is flexible to tree-structured/loopy relational graphs.

Scan for more details

CNNStructure

Thank you.Welcome to our poster session #5

top related