recognizing human-object interaction in still image by modeling the mutual context of objects and...
TRANSCRIPT
RECOGNIZING HUMAN-OBJECT INTERACTION IN STILL IMAGE BY MODELING THE MUTUAL CONTEXT OF OBJECTS AND HUMAN POSES
Date: 2013/05/27
Instructor: Prof. Wang, Sheng-Jyh
Student: Hung, Fei-Fan
Yao, B., and Fei-fei, L. IEEE Transactions on PAMI(2012)
2
Outline• Introduction
• Intuition and goal
• Model Representation• Model Learning
• Obtaining Atomic Poses• Training Detectors and Classifiers• Estimating Model Parameters
• Model Inference• Experimental Results• Conclusion
3
Outline• Introduction
• Intuition and goal
• Model Representation• Model Learning
• Obtaining Atomic Poses• Training Detectors and Classifiers• Estimating Model Parameters
• Model Inference• Experimental Results• Conclusion
4
Why using context in computer vision?
• simple image vs. human activities
~3-4%
with context
without context
With mutual context:
Without context:
5
Challenges in Human Pose Estimation
• Human pose estimation is challenging
• Object detection facilitate human pose estimation
Difficult part appearance
Self-occlusion
Image region looks like a body part
6
Challenges in Object Detection• Object detection is challenging
• human pose estimation facilitate object detection
Small, low-resolution, partially occluded
Image region similar to detection target
7
The Goal• To build a mutual context model in Human-Object
Interaction(HOI) activities
8
Outline• Introduction
• Intuition and goal
• Model Representation• Model Learning
• Obtaining Atomic Poses• Training Detectors and Classifiers• Estimating Model Parameters
• Model Inference• Experimental Results• Conclusion
9
Tennis ball
Croquet mallet
Volleyball
Tennis racket
O:
Model representation• Modeling the mutual context of object and human poses
A:
Croquet shot
Volleyball smash
Tennis forehand
H:
P: body parts,
, M:num of bounding box
More than one atomic pose H in A
Body parts
10
• : co-occurrence compatibility
between A,O,H• : spatial relationship between O,H• : modeling the image evidence with detectors
or classifiers
Model representation
H
A
P1 P2 PL
O1 O2
activity
Human poseobjects
11𝝓1: Co-occurrence context
• co-occurrence between all A,O,H
• : strength of co-occurrence interaction
between
: indicator function: total number of atomic poses : total number of objects : total number of activity classes
H
A
P1 P2 PL
O1 O2
12
• Spatial relationship between all O and different H
• : weight of • : a sparse binary vector • shows relative location• of w.r.t.
𝝓2: Spatial context
H
A
P1 P2 PL
O1 O2
:
13
• Model O in the image I using object detection score
• For all object O• : vector of score of detecting • : weight of
• Between Om and Om’
• : binary feature vector• : weight of and
𝝓3: Modeling objects
H
A
P1 P2 PL
O1 O2
14𝝓4: Modeling human pose
• Model atomic pose that H belongs to and likelihood
• : Gaussian likelihood function• : vector of score of detecting
body part in
H
A
P1 P2 PL
O1 O2
15𝝓5: Modeling activity
• Model HOI activity by training activity classifier
• : -dim output of one-versus-all (OVA)
discriminative classifier
taking image as features
• : feature weight of
H
A
P1 P2 PL
O1 O2
17
Model Properties• Spatial context between O and H
• Object detection and human pose estimation facilitate each other • Ignore the objects and body parts that are unreliable
• Flexible to extend to large scale datasets and other activities• Jointly model can share all objects and atomic poses
18
Outline• Introduction
• Intuition and goal
• Model Representation• Model Learning
• Obtaining Atomic Poses• Training Detectors and Classifiers• Estimating Model Parameters
• Model Inference• Experimental Results• Conclusion
19
Model Learning
Assign human pose to atomic pose
Training detectors and classifiers
Estimate parameters by Maximum Likelihood
20
• Using clustering to obtain atomic poses
• Normalize the annotations
• Finding missing part• Using the nearest visible neighbor
• Obtain a set of atomic poses• Hierarchical clustering
with maximum linkage
measure :
Obtaining Atomic Poses
Assign human pose to atomic pose
Training detectors and classifiers
Estimate parameters by Maximum Likelihood
21
Training Detectors and Classifiers• : Object detector in • : Human body part detector in
• : Overall activity classifier in
Assign human pose to atomic pose
Training detectors and classifiers
Estimate parameters by Maximum Likelihood
deformable part model
Spatial pyramid matching (SPM)SIFT + 3 level image pyramid
24
Estimating Model Parameters
• Estimate by using ML approach with zero-mean Gaussian prior
Assign human pose to atomic pose
Training detectors and classifiers
Estimate parameters by Maximum Likelihood
25
Learning result
26
Outline• Introduction
• Intuition and goal
• Model Representation• Model Learning
• Obtaining Atomic Poses• Training Detectors and Classifiers• Estimating Model Parameters
• Model Inference• Experimental Results• Conclusion
27
Model Inference
Initialize with learned results
New image
Update human body parts
Update object detection results
Update A and H labels
28
Initialization
Initialize Activity classification
Object detectionHuman pose estimation
New image
Initialize with learned results
A: SPM classificationO: object detectionH: pictorial structure model
29
Update model inference• Marginal distribution of human pose:
• Using mixture of Gaussian to refine the prior of body part
Update human body parts
Update object detection results
Update A and H labels
30
Update model inference
• Greedy forward search method :• Initial and no object in bounding box• Select • Label box as • update
• Stop when <0
Update human body parts
Update object detection results
Update A and H labels
O,H
O,A,H O,I
31
Update model inference• Enumerate possible A and H label
• Optimize
Update human body parts
Update object detection results
Update A and H labels
32
Outline• Introduction
• Intuition and goal
• Model Representation• Model Learning
• Obtaining Atomic Poses• Training Detectors and Classifiers• Estimating Model Parameters
• Model Inference• Experimental Results• Conclusion
33
Experimental Results (Sports Dataset)
34
Experimental Results (Sports Dataset)
35
Experimental Results (Sports Dataset)• Activity classification
36
37
Experimental results (PPMI Dataset)
38
Experimental results (PPMI Dataset)
39
40
Outline• Introduction
• Intuition and goal
• Model Representation• Model Learning
• Obtaining Atomic Poses• Training Detectors and Classifiers• Estimating Model Parameters
• Model Inference• Experimental Results• Conclusion
41
Conclusion• Mutual context can significantly improve the performance
in difficult visual recognition problems
• The joint model can share all the information
• Annotate all the human body parts and objects in training images
42
Reference• Yao, B., and Fei-fei, L. “Recognizing Human-Object Interactions in
Still Images by Modeling the Mutual Context of Objects and Human Poses,” IEEE Transactions on Pattern Analysis and Machine Intelligence (2012)
• B. Yao and L. Fei-Fei, “Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010
• B. Sapp, A. Toshev, and B. Taskar, “Cascade Models for Articulated Pose Estimation,” Proc. European Conf. Computer Vision, 2010.
• S. Lazebnik, C. Schmid, and J. Ponce, “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2006.
• http://en.wikipedia.org/wiki/Hierarchical_clustering