action recognition in cluttered dynamic scenes using pose...
TRANSCRIPT
Action Recognition in Cluttered Dynamic Scenes using Pose-Specific Part Models
Vivek Kumar SinghUniversity of Southern California
Los Angeles, CA, [email protected]
Ram NevatiaUniversity of Southern California
Los Angeles, CA, [email protected]
Abstract
We present an approach to recognizing single actor hu-man actions in complex backgrounds. We adopt a JointTracking and Recognition approach, which track the ac-tor pose by sampling from 3D action models. Most ex-isting such approaches require large training data or Mo-CAP to handle multiple viewpoints, and often rely on cleanactor silhouettes. The action models in our approach areobtained by annotating keyposes in 2D, lifting them to 3Dstick figures and then computing the transformation matri-ces between the 3D keypose figures. Poses sampled fromcoarse action models may not fit the observations well; toovercome this difficulty, we propose an approach for effi-ciently localizing a pose by generating a Pose-Specific PartModel (PSPM) which captures appropriate kinematic andocclusion constraints in a tree-structure. In addition, ourapproach also does not require pose silhouettes. We showimprovements to previous results on two publicly availabledatasets as well as on a novel, augmented dataset with dy-namic backgrounds.
1. IntroductionThe objective of this work is to recognize single actor hu-
man actions in videos captured from a single camera. This
has been a popular research topic over past few years, as ef-
fective solutions to this problem find applications in surveil-
lance, HCI, video retrieval, among others. Existing action
recognition methods work well with variations in actor ap-
pearances, however, handling viewpoint variations with low
training requirements, and dealing with cluttered dynamic
backgrounds is still a challenge. While view-invariant ap-
proaches using 3D models have been proposed, they ei-
ther require 3D MoCAP for learning models and/or require
videos from multiple viewpoints.
We present a simultaneous tracking and recognition ap-
proach which tracks the actor pose by sampling from 3D
action models and localizing each pose sample; this allows
view-invariant action recognition. To deal with cluttered
dynamic background, we accurately localize each pose us-
ing a 2D part model. We model an action as a sequence
of transformations between keyposes. These action models
can be obtained by annotating keyposes in 2D, lifting them
to 3D stick figures and then computing the transformation
matrices between 3D keyposes [18]; this avoids large train-
ing data and MoCAP. However poses sampled from such
coarse models do not match observations well. Thus during
inference, errors due to pose approximation and observation
noise would accumulate over time, and result in tracking
failures and lower recognition rates, especially in cluttered
dynamic scenes.
We address these issues by a more accurate localization
of the human pose using a 2D part model with kinematic
constraints. Such models have been successfully applied
to localize human pose in cluttered images, under the as-
sumption that the parts are not occluded [4, 20, 1]. How-
ever, poses often have multiple occluded parts, and hence
modeling inter-part occlusions is useful for accurately lo-
calizing such poses. Existing methods such as [9, 27] that
model such constraints are too inefficient for tracking and
recognition where multiple poses may need to be localized
every few frames. We propose a novel framework to select
a tree-structured model that captures appropriate kinematic
and inter-part occlusion constraints for a particular pose in
order to accurately localize that pose; we call this model
Pose-Specific Part Model (PSPM). To determine the PSPM
for a given pose, we search over many possible tree models
and select the model with highest localizability score.
We demonstrate our approach on 2 publicly available
datasets: Full body Gestures [17] with 6 actions captured
from multiple viewpoints with cluttered, dynamic back-
grounds, and Hand Gestures [18] with 12 actions with sub-
tle pose variations in a static background. To further demon-
strate the robustness to background changes, we evaluate
our method on an augmented Hand Gestures Set with 25real sequences with camera shakes and background object
motion, and 215 sequences with embedded dynamic back-
grounds. We also evaluate localization using PSPM on an
image dataset with different poses and backgrounds, and
show improvements over the standard Pictorial Structures
[4] and other pose localization methods.
In the rest of the paper, we first review the related work
in Section 2. We then present the action representation and
inference in Section 3. Next we describe pose localization
from 3D priors using Pose-Specific Part Model (PSPM) in
Section 4, followed by the results.
2. Related WorkA natural approach to recognizing actions is to first
estimate the body pose and then infer the action based
on the pose dynamics [8, 5]. However effectiveness of
such approaches depends on reliable human pose tracking
methods. A popular approach is to avoid pose tracking and
directly match image descriptors to the action models by
learning action classifiers, using SVMs [13], or graphical
models such as CRFs [16], LDA [15]; however it is
difficult to capture temporal relationships in such models.
Furthermore, these methods typically require large amount
of training data from multiple viewpoints.
Another approach is to simultaneously track the pose
and recognize the action; we refer to these as Joint-Tracking-and-Recognition methods. These methods learn
action models that capture the evolution of the actor pose
in 3D, and during inference, use the action priors for
tracking pose and the estimated pose to recognize actions.
While these methods work well across viewpoints, most
of them require 3D MoCAP data for learning accurate
models [28, 22, 19, 17] and/or rely on person silhouettes for
localization and matching [21, 24, 23, 31, 19, 6, 18] which
assumes a static background. Recently, [18] proposed a
multi-view approach without using MoCAP, by learning
3D action models from 2D keypose annotations and recog-
nizing actions by matching poses sampled from the action
models to actor silhouettes. However poses sampled from
such coarse models result in large matching error which
accumulate over time and significantly affect recognition,
especially in cluttered scenes. In our work, we address this
issue by using accurate pose localization using part models.
An important aspect of the Joint-Tracking-And-Recognition methods is to reliably localize/match the
pose to the video. Recently, part-based graphical models
(pictorial structures [4]) have been shown to accurately
localize 2D poses in complex backgrounds [1, 20] but they
do not model inter-part occlusion. Localization of poses
with inter-part occlusions require simultaneous modeling
of body kinematics and inter-part occlusion which makes
inference hard. Existing approaches model such constraints
using common-factor models [12], multiple trees [29],
or represent them in a kinematic graph (with cycles) and
infer pose using non-parametric message passing [23] or
branch-and-bound [9, 27]. However, these methods either
use person silhouettes [23], requires training data from all
viewpoints [12], or are too inefficient [9, 27] for tracking.
Recently, [2] trained multiple view-specific models for
estimating pose in walking action; however for multiple
actions, a large number of models would need to be trained.
3. Action RecognitionIn this work, we develop on the Joint-Tracking-and-
Recognition approach of combining Tracking-by-Priors and
Recognition-by-Tracking. For each action, we obtain an ap-
proximate model of the human pose dynamics in a scale and
pan-normalized 3D space; this allows scale and viewpoint-
invariant representation. This is done by scaling the poses to
a fixed known height. For inference, we match image obser-
vations to the action models by tracking using a 3D human
model in the action-restricted pose space, and find the ac-
tion with highest matching score [31, 14, 18]. Here, we first
present the action representation and model learning, fol-
lowed by action and pose inference. The pose localization
is described later in Section 4.
3.1. Representation and Learning
We learn a separate model for each action that captures
the dynamics of the human pose. Our models are based on
the concept that a single actor human action can be repre-
sented as a sequence of linear transformations between a
few, representative keyposes. Our action model is inspired
by [18], which refers to the linear transformation between a
keypose pair as a primitive. For example, the walking action
can be represented with four primitives - left leg forward →right leg crosses left leg → right leg forward → left leg
crosses right leg. Note that each primitive is a conjunction
of rotation of body parts, e.g. during walking, rotation of
upper leg about the hip and rotation of lower leg about the
knee, and thus can be represented as a linear transformation
in joint-angle space. This is illustrated in figure 1.
Scale Normalized JointAngle Space
Keypose
Primitive
Keypose
Figure 1. Geometric Interpretation of the Action Action for Walk-
ing; dotted Red Curves denote different instances of walking ac-
tion; piecewise linear curve (in gray) denotes the learnt action
model with keyposes marked with circles (in black)
To capture the variations in keyposes across different in-
stances of the same action, we model each keypose by a set
of Gaussian distributions, one for every 3D joint position.
For speed variations, we model the length of each primitive
as a truncated sigmoid function. We normalize each primi-
tive to unit length and learn a Gaussian over the fraction of
primitive that gets covered at each time step. Thus, an ac-
tion with Nk keyposes is modeled by a set of Nk×(Nj+1)Gaussians, where Nj is the number of 3D joints (= 15).
We learn action models by annotating 2D poses and the
primitive action boundaries in the training videos. For each
action model, we first manually select the set of keyposes
for each action; intuitively, we select a keypose whenever
there is a “big” change in pose dynamics; alternatively if
3D MoCAP is available, keyposes can be automatically ob-
tained as discontinuities in the pose energy [14]. We then
learn the 3D model for each keypose from 2D annotations
by lifting (using our implementation of [30] for more de-
tails). For each primitive, we obtain the expected change
in the duration by collecting primitive lengths from action
boundary annotations and fitting a Gaussian.
3.2. Conditional Action Network
Given the action models, we embed them into a Dynamic
Conditional Random Field [26], which we refer to as Con-
ditional Action Network illustrated in Figure 2.
�������� �������������� �����������
��� ����������
� ���
Figure 2. Conditional Action Network
We define the state st of CAN at time t by a tuple of
action and pose variables 〈sactt , sposet 〉; action set sactt =〈at, pt, ft〉 include the action label at, current primitive ptand the fraction of primitive elapsed ft and pose sposet =〈xt〉 include the current pose xt. To infer the action from an
observation sequence of length T , we estimate the optimal
state sequence over all actions by maximizing the log-linear
likelihood which takes the following form,
sbest[1:T ] = arg max∀s[1:T ]
T∑t=1
⎛⎝ no∑
f=1
wfφf (st, st−1, It)
⎞⎠
where, φ(st, st−1, It) are observation and transition poten-
tials and w = {wi} is the weight vector, one for each po-
tential function.
[Transition Potentials] Action transition potential
φ(at, ft, at−1, ft−1) is modeled as a truncated sigmoid
function over the fraction of primitive elapsed ft, such that
the probability of staying in the same primitive pt decreases
as ft approaches 1 and the probability of transition to
a new primitive increases. The pose transition poten-
tial φ(xt, xt−1) is modeled using a Normal distribution
N (0, σ) of displacement in neck position and height ht.
[Observation Potentials] We compute the observation
likelihood of a pose sample xt, sampled from action-pose
potential φ(at, ft, xt), by combining shape and motion
likelihoods. We first localize the pose using a part based
model which is generated from the spatial prior available
from the action model and handles constraints due to
occlusion. We then compute shape likelihood, as the
normalized log likelihood of the parts used in the model.
The details of this step are described in Sec 4.
φshape(x) =1
|P |∑i∈P
φi(xi, It)
where, P is the set of the parts in the pose model, xi is the
ith part in pose x.
The motion likelihood is computed by matching the ob-
served optical flow with the direction of motion of each part,
using the cosine distance. We used the Lucas-Kanade al-
gorithm (in OpenCV 1.0) for computing optical flow and
quantize the flow into 8 orientation bins.
[Weight Learning] We assume uniform transition weights
across different actions/primitives, and hence weight learn-
ing only involves learning 3 weight values, one for each po-
tential. In this work, we use the Voted Perceptron algorithm[3] due to its efficiency and ease of implementation. The
ground truth pose estimates for all frames were obtained us-
ing our inference with known action label for the sequence.
3.3. Tracking and Recognition
Since our action models are continuous and our graphi-
cal model has cycles, exact inference is infeasible. Thus, we
use a particle filtering approach [22, 25] by sampling poses
from the action models and matching each pose to the scene
observations.
During tracking, we first find the person by applying a
full-body and a head-shoulder pedestrian detector [7]; mul-
tiple detectors help reliable detection especially in com-
plex scenes. We then uniformly sample poses from action
models and localize the poses to fit the observations using
the approximate position (neck) and scale (person standing
height) available from the detection responses. The details
of the localization method are described in detail in Sec-
tion 3. For viewpoint invariance, poses are matched to the
observations at various pan angles.
To propagate each sample st over time, we increment the
ft (fraction of primitive elapsed) to obtain the next action
state sactt+1; note that if ft is toward the end of a primitive,
next state may transition to the next primitive or action. We
then perturb the position and scale of the person, and obtain
the next pose by localizing the pose to the observations; note
that localization step takes into account the spatial prior on
the pose from the action model 〈at+1, pt+1, ft+1〉. During
actions that are performed while standing at the same loca-
tion such as sitting on ground, we imposed a constraint that
the feet of the person remain on ground at roughly the same
location (using a penalty function modeled as a zero-mean
Gaussian). This constraint makes our tracker more robust to
drifting. The best state sequence from the state distribution
over all frames is then obtained using Viterbi algorithm.
4. Accurate Pose Localization from 3D Priors
In this section, we present our approach to accurately lo-
calize a hypothesized pose (from the action model) to the
image observations. Given prior information such as scale
and position, localization involves searching through the
pose space to infer the pose that best describes the image
evidence. In our setting, where the pose is being tracked us-
ing approximate action models, prior on the pose includes
coarse 2D position and scale information and the pose sub-
space which is likely to include the true pose. It is natural to
assume that in cluttered environments, the 2D position and
scale priors may be quite noisy. Furthermore, the pose sub-
space induced by the action model can be large especially
for fast moving parts, for e.g. hands during waving.
For efficient localization, we first project the 3D pose
search space on the 2D image to obtain spatial prior on the
2D pose, then localize the 2D pose using image observa-
tions and then estimate the 3D pose from the aligned 2D
pose. For 2D pose localization, we use a part-based graphi-
cal model approach (similar to pictorial structure [4, 20, 1])
which represents the human body by its constituent parts
(see figure 3(a)) and impose pairwise constraints over the
parts during inference. These pairwise constraints model
the kinematic and/or the inter-part occlusion relationships
between the parts; however when all such constraints are
imposed, the graphical model has loops (see figure 3(b)).
Even though attempts have been made to infer pose using
models with loops but they either tend to be computation-
ally expensive [9, 27]. Thus, for efficient and exact infer-
ence tree-structured models are preferred.
We develop an approach to automatically select a tree
structured model that is most likely to give an accurate lo-
calization for a given pose, by leveraging the fact that under
occlusion, some kinematic constraints may be relaxed in or-
der to model constraints that would be more effective for
localization; we call this model Pose-Specific Part Model(PSPM).
Next, we first present 2D pose localization using tree-
structured part model. We then describe the PSPM selection
and learning, followed by the 3D pose localization using
PSPM.
(b)
t
hx
xlua
llax
xlul rulx
xlll rllx
ruax
rlax
xlll
rulx
rllx
ruax
rlaxllax
xlua
hx
xlul
xt
(a)
x
Figure 3. Graphical Models for 2D pose (a) Kinematic Tree model
[4] (b) Graph with edges to model kinematic and inter-part occlu-
sion constraints; observation nodes are not shown for clarity
4.1. Localizing 2D Pose using Part Model
In a 2D pose model, each part is represented as a node
and the edges represent pairwise constraints between the
parts. During inference, detectors for all parts are indepen-
dently applied on the image, and then best pose x is ob-
tained by maximizing the joint likelihood given by
p(x, I|Θ) =∏i∈P
p(I|xi,Θsi )
∏ij∈E
p(xi|xj ,Θpij) (1)
where xi denote the part i, (P,E) is the graphical model
over the parts P ; p(I|xi,Θsi ) represent the likelihood of
part hypothesis xi obtained by applying the part detec-
tor; p(xi|xj ,Θpij) represent the pairwise constraints; Θ =
(Θs,Θp) are model priors for unary and pairwise potentials.
A commonly used 2D pose model [4, 20, 1] assumes a tree-
structure, as efficient and exact inference can be performed
[4].
4.1.1 Part Detection
Recently, [1] reported that better part detectors can signifi-
cantly improve localization results; however, better part de-
tectors are also computationally expensive. Thus, in this
work, we experiment with 2 different types of detectors that
can be applied efficiently and have been previously used
for localizing 2D body parts - geometric templates [10] and
boundary and region templates [20]. We briefly describe the
part detectors here,
[Geometric Templates] Each part is modeled with a simple
geometric object - head with an ellipse, torso with an ori-
ented rectangle and each arm with a pair of line segments.
The log likelihood score of a part is obtained by accumulat-
ing the edge strength and orientation match on the boundary
points.
[Boundary and Region Templates] Each template is a
weighted sum of the oriented bar filters where the weights
are obtained by maximizing the conditional joint likelihood
[20]. We use the detectors provided by the authors.
4.1.2 Pairwise Constraints
The pairwise kinematic potential between parts is defined
using a Gaussian distribution, similar to [4, 1]. To avoid the
overlapping parts from occupying exactly the same place,
we add additional repulsion constraint that reduces the like-
lihood of the occluded part to overlap with the occluder. For
parts xi and xj such that xi is occluding xj , we define the
pairwise potential as
p(xi|xj ,Θij) = N (li − lj ;μij , σij)× Λ(li, lj)
where, li denote the position and orientation of xi, and
Θij = (μij , σij) is Gaussian prior over the relative part
position and orientation, Λ(li, lj) is the repulsive prior be-
tween the overlapping parts [5].
4.2. Pose-Specific Part Models for Localization
Given spatial priors on a 3D pose, the Pose-Specific Part
Model (PSPM) is a tree-structured graph, and is tuned to ac-
curately localize the specified pose. Obtaining PSPM for a
pose involves selecting the model (set of parts P and the
structure E) and estimating the model prior Θ which is
likely to maximize the joint likelihood. Accurate localiza-
tion can be obtained by maximizing Eqn 1.
[Part Selection]For accurate localization, we select the parts that are at least
partially visible, since the part detectors do not work well
for heavily occluded parts. To achieve this, we project the
3D pose to obtain the approximate position and orientation
of each part. This information, together with relative depth
ordering of parts, is used to estimate visibility of each part.
The visibility v(pi) is computed as the fraction of part pithat is unoccluded i.e.
v(pi) = 1− ovlp(pi,⋃∀j �=i
pj) (2)
where ovlp(pi, pj) show the fraction of part pi occluded by
pj . For model selection, we only consider the parts with
visibility greater than 0.5.
[Structure Selection]This step involves selecting a tree from all possible trees,
that captures appropriate constraints for localizing the given
pose. For localizing poses with partially or fully occluded
parts, we can relax some kinematic constraints in the stan-
dard tree model 3(a), and add an approximate neighborhood
cum non-overlap constraint such that the resulting model is
still a tree. For example, consider the pose in figure 4(a).
An alternate model to the standard kinematic model con-
nects the left lower leg to the right lower leg, and results in
a better pose estimate that using the standard kinematic tree.
Since upper and lower parts of the body are rarely coupled
(i.e. kinematically connected or occlude each other), we ig-
nore the edges between an arm and a leg. Figure 3(b) shows
the edges considered for structure selection.
(b)
t
hx
xlua
llax
xlul rulx
rllx
ruax
rlax
xlll
x
(c)(a)
Figure 4. Pose localization using Pose-specific Part Model; (a) Im-
age of a person sitting down (b) Selected Pose-specific Part Model
(occluded parts are marked with dotted lines) (c) Localized 2D
parts obtained using the selected PSPM
A standard approach for structure selection is to find the
tree-structure that maximizes the joint likelihood over la-
beled data [11]. This involves estimating the prior parame-
ters (mean and variance) for all pairs of parts that are con-
nected, and then finding a tree-structure which has the low-
est score (sum of variance over all edges). Since the tree
structure that maximizes the joint likelihood may be dif-
ferent for different poses, the standard learning approach
would require labeled data for all poses in the action model,
from various viewpoints; which is prohibitively large. In
this work, we propose a measure for the model score based
on the geometry of the pose.
To come up with an appropriate measure, we annotated
2D and 3D poses for 200 images and estimated the tree
model with highest localization score by performing an ex-
haustive search over all tree-structured models from the
graph shown figure 3(b). Note that the number of all possi-
ble tree models is quite large. To reduce the search space,
we consider only those trees which include the kinematic
edges and those non-kinematic edges where the connected
pair of parts overlap.
From our experiments, we observed that for poses with
unoccluded parts, the best tree had mostly kinematic edges
in it. However non-kinematic edges were preferred when
the parts occluded each other. Based on this observation,
we propose a score, Localization effect of an edge L(eij),which captures the “usefulness” of that edge toward local-
izing the given pose. We define the localization effect of an
edge as the product of the detection accuracy of part detec-
tors and the degree of occlusion of the connected parts. We
define L(eij) for an edge eij as:
L(eij) ={ D(pi)D(pj)min{v(pi), v(pj)}, eij ∈ K
D(pi)D(pj)max{ovlp(pi, pj), ovlp(pj , pi)}, eij /∈ K
where, K is the set of kinematic edges; D(pi) is the de-
tection accuracy of detector for part pi; the min/max term
captures the degree of occlusion.
The tree selection for accurate localization can be for-
mulated as a search over the set of edges that maximizes
the total localization effect. Since the localization effect of
an edge is independent of others, the optimal tree structure
E∗ can be estimated as:
E∗ = maxE∈G
∑ij∈E
L(eij) s.t. E is a tree (3)
where G is the graph with all pairwise constraints. Note
that Equation 3 can be solved efficiently by finding the
maximum spanning tree in the graph G, with L(eij) as the
weight of eij .
[Estimating Model Prior Θ]We define the pairwise potential using a Gaussian (in Sec-
tion 4.1.2). Previous methods work with uninformed prior
and hence, learn the parameters of the Gaussian from a la-
beled data [4]. But in our case, where prior knowledge of
pose is available, learning pose-specific parameters would
be more meaningful. However, learning pose-specific pa-
rameters would require a prohibitively large number of pose
samples (for all poses from various viewpoints). We esti-
mate these parameters using the prior on the 3D pose.
The model parameters, mean and variance at each joint,
are estimated by projecting the 3D pose prior, modeled as
Gaussian distributions, to 2D. For e.g, the mean relative po-
sition μij of part i w.r.t. part j is simply the difference of
the mid-point of the end-joints of part pi and that of part pj .
4.3. Localizing Pose from 3D Action Priors
The action prior include the 3D prior on the pose repre-
sented with Gaussian distributions (one for each joint) and
approximate position and scale of the person available from
the tracker. Given this prior, we obtain accurate 2D local-
ization of that pose using PSPM (as described earlier). Note
that during inference, we only apply the part detector in the
neighborhood of the projected 2D position, orientation and
scale for each part.
After localizing the pose in 2D, we then estimate the 3D
pose from the 2D joints positions. While estimating 3D
pose from 2D joints is ambiguous; in our case the spatial
priors on pose available from action model and the tracking
information help remove such ambiguities. For accurate 3D
pose estimation from 2D pose with known depth ordering
of parts, one can estimate the 3D joints using non-linear
least squares to fit the 2D estimates while constraining the
joints to stay within the pose search space (similar to [30]);
in this work, we simply update each joint position, starting
from the neck, assuming the 3D length of the parts does not
change. An initial estimate of 3D part lengths are obtained
by scaling a canonical 3D model in standing pose, such that
height of the model is same as the observed height of the
actor (available from tracking).
5. ExperimentsWe first demonstrate our pose localization approach using
PSPMs on an image dataset with pose annotations. We then
evaluate our action recognition algorithm that uses PSPMs
for localization on 2 publicly available datasets: Full body
Gestures [17] and Hand Gestures [18]. Compared to KTH
[13], Weizmann, HumanEva[23] and Hand Gestures [18]
datasets which have a clean background and/or few view-
point variations, Full body Gestures set includes videos with
cluttered dynamic backgrounds, captured at various view-
points. We also report results on hand gestures in dynamic
scenes.
5.1. Pose Localization
We selected frames from existing action recognition
datasets [17, 18] and created a collection of 195 images with
variety of poses. For each image, we annotated the 3D pose
of the actor by marking the 2D joint positions and their rel-
ative depths, followed by lifting to 3D (similar to keypose
annotations). To quantitatively evaluate pose localization,
we computed the average localization score over the visi-
ble parts: a part is considered to be correctly localized if it
overlaps more than 50% with the ground truth part.
Recall that the pose prior include approximate 2D scale
and position information from the tracker, and the approxi-
mate 3D pose (represented as a set of Gaussian distributions
over the 3D joint positions). To simulate the noisy prior ob-
tained from the action models, we set the variance of each
3D joint to be 5% of the part length. This prior was then
used as an input to various localization methods.
We first apply our implementation of Pictorial Structure
(PS) [4], which is a tree-structure model with kinematic
edges and uses an uninformed prior. Using Boundary Tem-
plates (BT), PS gives a localization accuracy of 44.53%.
Then we modify PS by applying part detectors only in the
search region provided by the prior and enforce kinematics
using parameters estimated from the prior; we refer to this
as CPS (Constrained Pictorial Structures). Applying CPS
using Boundary Templates, gave a localization accuracy of
63.74%, which when compared to PS, clearly shows the
importance of incorporating pose prior. We then apply the
Pose-Specific Part Model [20] and achieved a much higher
localization accuracy 71%, which demonstrates the advan-
tage of modeling occlusion based constraints. We also com-
pare with [17], which uses Hausdorff distance between the
pose boundary and canny edges as a shape likelihood mea-
sure to localize the pose. This approach achieved a lower
accuracy of 62.71%.
We test the robustness of our approach to uncertainty in
position and scale of the pose (which is likely to occur dur-
ing tracking). Figure 5 shows the accuracy plots for various
localization methods against the degree of uncertainty. No-
tice that localization using PSPM and CPS with Boundary
Templates is quite robust to position uncertainty compared
to Hausdorff method. Using CPS with Geometric Tem-
plates and Boundary Templates gave comparable accuracy
scores at low uncertainty, but deteriorates as the uncertainty
increases; this indicates that Boundary Templates are more
robust to noise. Also notice in Figure 5.b, that the PSPM us-
ing Boundary Templates tolerates small errors in the height
estimate ( 10%). However PSPM based localization is about
10− 15 times slower than using Hausdorff distance.
0
10
20
30
40
50
60
70
80
0 0.05 0.1 0.2 0.5
Loca
lizat
ion
Acc
urac
y
Uncertainty in position
HausdorffPS-BTCPS-BTCPS-BT-IIPCPS-GTPSPM-BT
0
10
20
30
40
50
60
70
80
90
100
0.9 1 1.1
Loca
lizat
ion
Acc
urac
y
Uncertainty in Height
HausdorffPSPM-BT
Figure 5. Plots showing Localization accuracy of different ap-
proaches (a) with uncertainty in position (shown in ratio of posi-
tion error to person height) (b) with uncertainty in height estimate
(scale);
5.2. Action Recognition
From pose localization experiments, we observe that
Hausdorff distance based method localizes well when pre-
dicted pose is not far from the true pose. Thus for efficiency,
we apply PSPMs every 5th frame and use Hausdorff dis-
tance based method for intermediate frames. In addition,
for efficient localization using PSPM, we scale down the
image so the actor is ≈ 100 pixels high. Our entire system
runs at ≈ 1 frame per second on a 3GHz Xeon CPU run-
ning Windows/C++ programs. We now present our results
on three datasets.
Hand Gestures Dataset [18]: This dataset has 5-6 in-
stances of 12 actions from 8 different actors in an indoor
lab setting; a total of 495 action sequences across all ac-
tions. Even though the background is not cluttered, recog-
nition task is still challenging due to the large number of
actions with small pose difference. For evaluation, we train
the models on a subset of actors and test on the rest. We
compare our approach to [18], that uses a similar joint track-
ing and recognition approach but uses discrete action dura-
tion models and foreground based features for localization
and matching. [18] reports recognition rate of 78% and 90%with 1 : 8 and 3 : 5 train:test respectively. Our algorithm
achieves 92% recognition accuracy at 1 : 8 train:test. If we
replace the PSPM based localization with Hausdorff dis-
tance based method, recognition rate drops to 84%. This
illustrates that even in clean backgrounds, use of PSPMs
improves action recognition.
Augmented Hand Gestures Dataset: To demonstrate ro-
bustness to cluttered dynamic backgrounds, we generated
a dataset by embedding 45 action instances from the orig-
inal dataset [18] into videos with complex dynamic back-
grounds (see figure 6(f-k) for sample images). The dataset
Dataset Method Train:Test Recognition %
Hand Gestures
Natarajan et al [18] 1:8 78%Natarajan et al [18] 3:5 90.18%CAN (Hausdorff) 1:8 84.2%CAN (PSPM) 1:8 92%
USC GesturesSFD-CRF [17] MoCAP 77.45%CAN (PSPM) 1:6 89.5%
Table 1. Evaluation Results on the Hand Gestures Dataset.
has 215 videos including 3 different actors performing hand
gestures in 5 different scenes. Our algorithm achieves 91%recognition accuracy. Note that the recognition accuracy on
the original 45 videos from [18], that were used for em-
bedding, was about 95%. To process these videos, we used
the parameters trained on the original hand gestures dataset
[18].
In addition, we also collected 25 videos including 4 hand
gestures but performed in dynamic scenes, with camera
shakes and/or objects moving in the background. Our algo-
rithm, trained on the original dataset, correctly recognized
20 action instances (≈ 80% accuracy).
USC Gestures Dataset [17]: This dataset has videos of 6
full body actions, captured at various pan and tilt angles;
actions include - sit-on-ground, standup-from-ground, sit-on-chair, stand-from-chair, pickup-from-ground and point-forward. We evaluated our approach on a part of this dataset
that was captured at 0◦ tilt in 6 varying backgrounds includ-
ing cluttered indoor scenes and outdoors in front of moving
vehicles; rest of the dataset was captured at other tilt an-
gles in a relatively clean, static background. The selected
set include actions captured at 5 different camera pan an-
gles w.r.t. to the actor - 0◦, 45◦, 90◦, 270◦, 315◦, for a total
240 action instances, each performed either by a different
actor, at a different pan or in a different background. For
our experiments, we trained our models using 2 actions in-
stances from one actor and evaluated on the rest. Note that
models were trained only on 2 viewpoints, and tested on 5
different viewpoints. On segmented action instances, our
approach achieved an accuracy of 75.91%. Figure 6(n-s)
show sample results. [17] reports an accuracy of 77.35%;
however they assume that the sit-on-chair and sit-on-ground
actions are followed by stand-from-chair and stand-from-
ground respectively. When we incorporate this information,
our action recognition accuracy improves to 89.5% which
shows a 12% improvement over [17].
6. ConclusionWe have presented an approach for joint pose tracking
and action recognition in cluttered dynamic environments,
which has low training requirements and doesn’t require 3D
MoCAP data. We achieve this by proposing an accurate
and efficient pose localization approach using Pose-Specific
Part Models (PSPMs). We have demonstrated that the our
localization approach is robust to noise and works well in
(m)
(n) (o) (p) (s)(q) (r)
(a) (b) (c) (d) (e) (f)
(g) (h) (i) (j) (k) (l)
Figure 6. Results obtained on Gesture datasets (a-e) Hand Gestures [18], (f-m) Augmented Hand Gestures, (n-s) USC Gesures [17]. The
estimated pose is overlaid on each image (in red), and the corresponding part distribution obtained by applying PSPM is shown next to it
cluttered environments. Further, we have also demonstrated
our approach for action recognition on hand gestures as well
as on USC Gestures dataset with full body gestures in clut-
tered and dynamic environments.
Acknowledgements. This research was supported, in part,
by the Office of Naval Research under grants #N00014-06-
1-0470 and #N00014-10-1-0517.
References[1] M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited:
People detection and articulated pose estimation. In CVPR, pages
1014–1021, 2009. 1, 2, 4, 5
[2] M. Andriluka, S. Roth, and B. Schiele. Monocular 3d pose estima-
tion and tracking by detection. In CVPR, pages 623 –630, 2010. 2
[3] M. Collins. Discriminative training methods for hidden markov mod-
els: Theory and experiments with perceptron algorithms. In EMNLP,
2002. 3
[4] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for
object recognition. IJCV, 61(1):55–79, 2005. 1, 2, 4, 5, 6
[5] V. Ferrari, M. J. Marın-Jimenez, and A. Zisserman. Pose search:
Retrieving people using their pose. In CVPR, 2009. 2, 5
[6] Y. Hu, L. Cao, F. Lv, S. Yan, Y. Gong, and T. Huang. Action detection
in complex scenes with spatial and temporal ambiguities. In ICCV,
pages 128 –135, 2009. 2
[7] C. Huang and R. Nevatia. High performance object detection by
collaborative learning of joint ranking of granules features. In CVPR,
pages 41–48, 2010. 3
[8] N. Ikizler and D. A. Forsyth. Searching video for complex activities
with finite state models. In CVPR, 2007. 2
[9] H. Jiang and D. Martin. Global pose estimation using non-tree mod-
els. In CVPR, 2008. 1, 2, 4
[10] S. X. Ju, M. J. Black, and Y. Yacoob. Cardboard people: A parame-
terized model of articulated image motion. In FG, 1996. 4
[11] D. Koller and N. Friedman. Probabilistic Graphical Models - Prin-ciples and Techniques. MIT Press, 2009. 5
[12] X. Lan and D. P. Huttenlocher. Beyond trees: Common-factor mod-
els for 2d human pose recovery. ICCV, 2005. 2
[13] I. Laptev. On space-time interest points. International Journal onComputer Vision (IJCV), 64(2-3):107–123, 2005. 2, 6
[14] F. Lv and R. Nevatia. Single view human action recognition using
key pose matching and viterbi path searching. In CVPR, 2007. 2, 3
[15] R. Messing, C. Pal, and H. Kautz. Activity recognition using the
velocity histories of tracked keypoints. In ICCV, 2009. 2
[16] L.-P. Morency, A. Quattoni, and T. Darrell. Latent-dynamic discrim-
inative models for continuous gesture recognition. In CVPR, 2007.
2
[17] P. Natarajan and R. Nevatia. View and scale invariant action recog-
nition using multiview shape-flow models. In CVPR, 2008. 1, 2, 6,
7, 8
[18] P. Natarajan, V. K. Singh, and R. Nevatia. Learning 3d action models
from a few 2d videos for view invariant action recognition. In CVPR,
2010. 1, 2, 6, 7, 8
[19] H. Ning, W. Xu, Y. Gong, and T. S. Huang. Latent pose estimator for
continuous action recognition. In ECCV (2), 2008. 2
[20] D. Ramanan. Learning to parse images of articulated bodies. In
NIPS, pages 1129–1136, 2007. 1, 2, 4, 6
[21] R. Rosales and S. Sclaroff. Inferring body pose without tracking
body parts. In CVPR, 2000. 2
[22] L. Sigal, A. O. Balan, and M. J. Black. Combined discriminative and
generative articulated pose and non-rigid shape estimation. In NIPS,
2007. 2, 3
[23] L. Sigal and M. J. Black. Measure locally, reason globally:
Occlusion-sensitive articulated pose estimation. In CVPR, pages
2041–2048, 2006. 2, 6
[24] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas. Conditional
random fields for contextual human motion recognition. In ICCV,
pages 1808–1815, 2005. 2
[25] J. Sullivan and S. Carlsson. Recognizing and tracking human action.
In ECCV (1), pages 629–644, 2002. 3
[26] C. Sutton, K. Rohanimanesh, and A. McCallum. Dynamic condi-
tional random fields: factorized probabilistic models for labeling and
segmenting sequence data. In ICML, page 99, 2004. 3
[27] T.-P. Tian and S. Sclaroff. Fast globally optimal 2d human detection
with loopy graph models. In CVPR, 2010. 1, 2, 4
[28] R. Urtasun, D. J. Fleet, and P. Fua. 3d people tracking with gaussian
process dynamical models. In CVPR, 2006. 2
[29] Y. Wang and G. Mori. Multiple tree models for occlusion and spatial
constraints in human pose estimation. In ECCV, 2008. 2
[30] X. K. Wei and J. Chai. Modeling 3d human poses from uncalibrated
monocular images. In ICCV, pages 1873–1880, 2009. 3, 6
[31] D. Weinland, E. Boyer, and R. Ronfard. Action recognition from
arbitrary views using 3d exemplars. In ICCV, 2007. 2