action recognition in cluttered dynamic scenes using pose...

8
Action Recognition in Cluttered Dynamic Scenes using Pose-Specific Part Models Vivek Kumar Singh University of Southern California Los Angeles, CA, USA [email protected] Ram Nevatia University of Southern California Los Angeles, CA, USA [email protected] Abstract We present an approach to recognizing single actor hu- man actions in complex backgrounds. We adopt a Joint Tracking and Recognition approach, which track the ac- tor pose by sampling from 3D action models. Most ex- isting such approaches require large training data or Mo- CAP to handle multiple viewpoints, and often rely on clean actor silhouettes. The action models in our approach are obtained by annotating keyposes in 2D, lifting them to 3D stick figures and then computing the transformation matri- ces between the 3D keypose figures. Poses sampled from coarse action models may not fit the observations well; to overcome this difficulty, we propose an approach for effi- ciently localizing a pose by generating a Pose-Specific Part Model (PSPM) which captures appropriate kinematic and occlusion constraints in a tree-structure. In addition, our approach also does not require pose silhouettes. We show improvements to previous results on two publicly available datasets as well as on a novel, augmented dataset with dy- namic backgrounds. 1. Introduction The objective of this work is to recognize single actor hu- man actions in videos captured from a single camera. This has been a popular research topic over past few years, as ef- fective solutions to this problem find applications in surveil- lance, HCI, video retrieval, among others. Existing action recognition methods work well with variations in actor ap- pearances, however, handling viewpoint variations with low training requirements, and dealing with cluttered dynamic backgrounds is still a challenge. While view-invariant ap- proaches using 3D models have been proposed, they ei- ther require 3D MoCAP for learning models and/or require videos from multiple viewpoints. We present a simultaneous tracking and recognition ap- proach which tracks the actor pose by sampling from 3D action models and localizing each pose sample; this allows view-invariant action recognition. To deal with cluttered dynamic background, we accurately localize each pose us- ing a 2D part model. We model an action as a sequence of transformations between keyposes. These action models can be obtained by annotating keyposes in 2D, lifting them to 3D stick figures and then computing the transformation matrices between 3D keyposes [18]; this avoids large train- ing data and MoCAP. However poses sampled from such coarse models do not match observations well. Thus during inference, errors due to pose approximation and observation noise would accumulate over time, and result in tracking failures and lower recognition rates, especially in cluttered dynamic scenes. We address these issues by a more accurate localization of the human pose using a 2D part model with kinematic constraints. Such models have been successfully applied to localize human pose in cluttered images, under the as- sumption that the parts are not occluded [4, 20, 1]. How- ever, poses often have multiple occluded parts, and hence modeling inter-part occlusions is useful for accurately lo- calizing such poses. Existing methods such as [9, 27] that model such constraints are too inefficient for tracking and recognition where multiple poses may need to be localized every few frames. We propose a novel framework to select a tree-structured model that captures appropriate kinematic and inter-part occlusion constraints for a particular pose in order to accurately localize that pose; we call this model Pose-Specific Part Model (PSPM). To determine the PSPM for a given pose, we search over many possible tree models and select the model with highest localizability score. We demonstrate our approach on 2 publicly available datasets: Full body Gestures [17] with 6 actions captured from multiple viewpoints with cluttered, dynamic back- grounds, and Hand Gestures [18] with 12 actions with sub- tle pose variations in a static background. To further demon- strate the robustness to background changes, we evaluate our method on an augmented Hand Gestures Set with 25 real sequences with camera shakes and background object motion, and 215 sequences with embedded dynamic back- grounds. We also evaluate localization using PSPM on an image dataset with different poses and backgrounds, and

Upload: dinhhanh

Post on 30-May-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Action Recognition in Cluttered Dynamic Scenes using Pose-Specific Part Models

Vivek Kumar SinghUniversity of Southern California

Los Angeles, CA, [email protected]

Ram NevatiaUniversity of Southern California

Los Angeles, CA, [email protected]

Abstract

We present an approach to recognizing single actor hu-man actions in complex backgrounds. We adopt a JointTracking and Recognition approach, which track the ac-tor pose by sampling from 3D action models. Most ex-isting such approaches require large training data or Mo-CAP to handle multiple viewpoints, and often rely on cleanactor silhouettes. The action models in our approach areobtained by annotating keyposes in 2D, lifting them to 3Dstick figures and then computing the transformation matri-ces between the 3D keypose figures. Poses sampled fromcoarse action models may not fit the observations well; toovercome this difficulty, we propose an approach for effi-ciently localizing a pose by generating a Pose-Specific PartModel (PSPM) which captures appropriate kinematic andocclusion constraints in a tree-structure. In addition, ourapproach also does not require pose silhouettes. We showimprovements to previous results on two publicly availabledatasets as well as on a novel, augmented dataset with dy-namic backgrounds.

1. IntroductionThe objective of this work is to recognize single actor hu-

man actions in videos captured from a single camera. This

has been a popular research topic over past few years, as ef-

fective solutions to this problem find applications in surveil-

lance, HCI, video retrieval, among others. Existing action

recognition methods work well with variations in actor ap-

pearances, however, handling viewpoint variations with low

training requirements, and dealing with cluttered dynamic

backgrounds is still a challenge. While view-invariant ap-

proaches using 3D models have been proposed, they ei-

ther require 3D MoCAP for learning models and/or require

videos from multiple viewpoints.

We present a simultaneous tracking and recognition ap-

proach which tracks the actor pose by sampling from 3D

action models and localizing each pose sample; this allows

view-invariant action recognition. To deal with cluttered

dynamic background, we accurately localize each pose us-

ing a 2D part model. We model an action as a sequence

of transformations between keyposes. These action models

can be obtained by annotating keyposes in 2D, lifting them

to 3D stick figures and then computing the transformation

matrices between 3D keyposes [18]; this avoids large train-

ing data and MoCAP. However poses sampled from such

coarse models do not match observations well. Thus during

inference, errors due to pose approximation and observation

noise would accumulate over time, and result in tracking

failures and lower recognition rates, especially in cluttered

dynamic scenes.

We address these issues by a more accurate localization

of the human pose using a 2D part model with kinematic

constraints. Such models have been successfully applied

to localize human pose in cluttered images, under the as-

sumption that the parts are not occluded [4, 20, 1]. How-

ever, poses often have multiple occluded parts, and hence

modeling inter-part occlusions is useful for accurately lo-

calizing such poses. Existing methods such as [9, 27] that

model such constraints are too inefficient for tracking and

recognition where multiple poses may need to be localized

every few frames. We propose a novel framework to select

a tree-structured model that captures appropriate kinematic

and inter-part occlusion constraints for a particular pose in

order to accurately localize that pose; we call this model

Pose-Specific Part Model (PSPM). To determine the PSPM

for a given pose, we search over many possible tree models

and select the model with highest localizability score.

We demonstrate our approach on 2 publicly available

datasets: Full body Gestures [17] with 6 actions captured

from multiple viewpoints with cluttered, dynamic back-

grounds, and Hand Gestures [18] with 12 actions with sub-

tle pose variations in a static background. To further demon-

strate the robustness to background changes, we evaluate

our method on an augmented Hand Gestures Set with 25real sequences with camera shakes and background object

motion, and 215 sequences with embedded dynamic back-

grounds. We also evaluate localization using PSPM on an

image dataset with different poses and backgrounds, and

show improvements over the standard Pictorial Structures

[4] and other pose localization methods.

In the rest of the paper, we first review the related work

in Section 2. We then present the action representation and

inference in Section 3. Next we describe pose localization

from 3D priors using Pose-Specific Part Model (PSPM) in

Section 4, followed by the results.

2. Related WorkA natural approach to recognizing actions is to first

estimate the body pose and then infer the action based

on the pose dynamics [8, 5]. However effectiveness of

such approaches depends on reliable human pose tracking

methods. A popular approach is to avoid pose tracking and

directly match image descriptors to the action models by

learning action classifiers, using SVMs [13], or graphical

models such as CRFs [16], LDA [15]; however it is

difficult to capture temporal relationships in such models.

Furthermore, these methods typically require large amount

of training data from multiple viewpoints.

Another approach is to simultaneously track the pose

and recognize the action; we refer to these as Joint-Tracking-and-Recognition methods. These methods learn

action models that capture the evolution of the actor pose

in 3D, and during inference, use the action priors for

tracking pose and the estimated pose to recognize actions.

While these methods work well across viewpoints, most

of them require 3D MoCAP data for learning accurate

models [28, 22, 19, 17] and/or rely on person silhouettes for

localization and matching [21, 24, 23, 31, 19, 6, 18] which

assumes a static background. Recently, [18] proposed a

multi-view approach without using MoCAP, by learning

3D action models from 2D keypose annotations and recog-

nizing actions by matching poses sampled from the action

models to actor silhouettes. However poses sampled from

such coarse models result in large matching error which

accumulate over time and significantly affect recognition,

especially in cluttered scenes. In our work, we address this

issue by using accurate pose localization using part models.

An important aspect of the Joint-Tracking-And-Recognition methods is to reliably localize/match the

pose to the video. Recently, part-based graphical models

(pictorial structures [4]) have been shown to accurately

localize 2D poses in complex backgrounds [1, 20] but they

do not model inter-part occlusion. Localization of poses

with inter-part occlusions require simultaneous modeling

of body kinematics and inter-part occlusion which makes

inference hard. Existing approaches model such constraints

using common-factor models [12], multiple trees [29],

or represent them in a kinematic graph (with cycles) and

infer pose using non-parametric message passing [23] or

branch-and-bound [9, 27]. However, these methods either

use person silhouettes [23], requires training data from all

viewpoints [12], or are too inefficient [9, 27] for tracking.

Recently, [2] trained multiple view-specific models for

estimating pose in walking action; however for multiple

actions, a large number of models would need to be trained.

3. Action RecognitionIn this work, we develop on the Joint-Tracking-and-

Recognition approach of combining Tracking-by-Priors and

Recognition-by-Tracking. For each action, we obtain an ap-

proximate model of the human pose dynamics in a scale and

pan-normalized 3D space; this allows scale and viewpoint-

invariant representation. This is done by scaling the poses to

a fixed known height. For inference, we match image obser-

vations to the action models by tracking using a 3D human

model in the action-restricted pose space, and find the ac-

tion with highest matching score [31, 14, 18]. Here, we first

present the action representation and model learning, fol-

lowed by action and pose inference. The pose localization

is described later in Section 4.

3.1. Representation and Learning

We learn a separate model for each action that captures

the dynamics of the human pose. Our models are based on

the concept that a single actor human action can be repre-

sented as a sequence of linear transformations between a

few, representative keyposes. Our action model is inspired

by [18], which refers to the linear transformation between a

keypose pair as a primitive. For example, the walking action

can be represented with four primitives - left leg forward →right leg crosses left leg → right leg forward → left leg

crosses right leg. Note that each primitive is a conjunction

of rotation of body parts, e.g. during walking, rotation of

upper leg about the hip and rotation of lower leg about the

knee, and thus can be represented as a linear transformation

in joint-angle space. This is illustrated in figure 1.

Scale Normalized JointAngle Space

Keypose

Primitive

Keypose

Figure 1. Geometric Interpretation of the Action Action for Walk-

ing; dotted Red Curves denote different instances of walking ac-

tion; piecewise linear curve (in gray) denotes the learnt action

model with keyposes marked with circles (in black)

To capture the variations in keyposes across different in-

stances of the same action, we model each keypose by a set

of Gaussian distributions, one for every 3D joint position.

For speed variations, we model the length of each primitive

as a truncated sigmoid function. We normalize each primi-

tive to unit length and learn a Gaussian over the fraction of

primitive that gets covered at each time step. Thus, an ac-

tion with Nk keyposes is modeled by a set of Nk×(Nj+1)Gaussians, where Nj is the number of 3D joints (= 15).

We learn action models by annotating 2D poses and the

primitive action boundaries in the training videos. For each

action model, we first manually select the set of keyposes

for each action; intuitively, we select a keypose whenever

there is a “big” change in pose dynamics; alternatively if

3D MoCAP is available, keyposes can be automatically ob-

tained as discontinuities in the pose energy [14]. We then

learn the 3D model for each keypose from 2D annotations

by lifting (using our implementation of [30] for more de-

tails). For each primitive, we obtain the expected change

in the duration by collecting primitive lengths from action

boundary annotations and fitting a Gaussian.

3.2. Conditional Action Network

Given the action models, we embed them into a Dynamic

Conditional Random Field [26], which we refer to as Con-

ditional Action Network illustrated in Figure 2.

�������� �������������� �����������

��� ����������

� ���

Figure 2. Conditional Action Network

We define the state st of CAN at time t by a tuple of

action and pose variables 〈sactt , sposet 〉; action set sactt =〈at, pt, ft〉 include the action label at, current primitive ptand the fraction of primitive elapsed ft and pose sposet =〈xt〉 include the current pose xt. To infer the action from an

observation sequence of length T , we estimate the optimal

state sequence over all actions by maximizing the log-linear

likelihood which takes the following form,

sbest[1:T ] = arg max∀s[1:T ]

T∑t=1

⎛⎝ no∑

f=1

wfφf (st, st−1, It)

⎞⎠

where, φ(st, st−1, It) are observation and transition poten-

tials and w = {wi} is the weight vector, one for each po-

tential function.

[Transition Potentials] Action transition potential

φ(at, ft, at−1, ft−1) is modeled as a truncated sigmoid

function over the fraction of primitive elapsed ft, such that

the probability of staying in the same primitive pt decreases

as ft approaches 1 and the probability of transition to

a new primitive increases. The pose transition poten-

tial φ(xt, xt−1) is modeled using a Normal distribution

N (0, σ) of displacement in neck position and height ht.

[Observation Potentials] We compute the observation

likelihood of a pose sample xt, sampled from action-pose

potential φ(at, ft, xt), by combining shape and motion

likelihoods. We first localize the pose using a part based

model which is generated from the spatial prior available

from the action model and handles constraints due to

occlusion. We then compute shape likelihood, as the

normalized log likelihood of the parts used in the model.

The details of this step are described in Sec 4.

φshape(x) =1

|P |∑i∈P

φi(xi, It)

where, P is the set of the parts in the pose model, xi is the

ith part in pose x.

The motion likelihood is computed by matching the ob-

served optical flow with the direction of motion of each part,

using the cosine distance. We used the Lucas-Kanade al-

gorithm (in OpenCV 1.0) for computing optical flow and

quantize the flow into 8 orientation bins.

[Weight Learning] We assume uniform transition weights

across different actions/primitives, and hence weight learn-

ing only involves learning 3 weight values, one for each po-

tential. In this work, we use the Voted Perceptron algorithm[3] due to its efficiency and ease of implementation. The

ground truth pose estimates for all frames were obtained us-

ing our inference with known action label for the sequence.

3.3. Tracking and Recognition

Since our action models are continuous and our graphi-

cal model has cycles, exact inference is infeasible. Thus, we

use a particle filtering approach [22, 25] by sampling poses

from the action models and matching each pose to the scene

observations.

During tracking, we first find the person by applying a

full-body and a head-shoulder pedestrian detector [7]; mul-

tiple detectors help reliable detection especially in com-

plex scenes. We then uniformly sample poses from action

models and localize the poses to fit the observations using

the approximate position (neck) and scale (person standing

height) available from the detection responses. The details

of the localization method are described in detail in Sec-

tion 3. For viewpoint invariance, poses are matched to the

observations at various pan angles.

To propagate each sample st over time, we increment the

ft (fraction of primitive elapsed) to obtain the next action

state sactt+1; note that if ft is toward the end of a primitive,

next state may transition to the next primitive or action. We

then perturb the position and scale of the person, and obtain

the next pose by localizing the pose to the observations; note

that localization step takes into account the spatial prior on

the pose from the action model 〈at+1, pt+1, ft+1〉. During

actions that are performed while standing at the same loca-

tion such as sitting on ground, we imposed a constraint that

the feet of the person remain on ground at roughly the same

location (using a penalty function modeled as a zero-mean

Gaussian). This constraint makes our tracker more robust to

drifting. The best state sequence from the state distribution

over all frames is then obtained using Viterbi algorithm.

4. Accurate Pose Localization from 3D Priors

In this section, we present our approach to accurately lo-

calize a hypothesized pose (from the action model) to the

image observations. Given prior information such as scale

and position, localization involves searching through the

pose space to infer the pose that best describes the image

evidence. In our setting, where the pose is being tracked us-

ing approximate action models, prior on the pose includes

coarse 2D position and scale information and the pose sub-

space which is likely to include the true pose. It is natural to

assume that in cluttered environments, the 2D position and

scale priors may be quite noisy. Furthermore, the pose sub-

space induced by the action model can be large especially

for fast moving parts, for e.g. hands during waving.

For efficient localization, we first project the 3D pose

search space on the 2D image to obtain spatial prior on the

2D pose, then localize the 2D pose using image observa-

tions and then estimate the 3D pose from the aligned 2D

pose. For 2D pose localization, we use a part-based graphi-

cal model approach (similar to pictorial structure [4, 20, 1])

which represents the human body by its constituent parts

(see figure 3(a)) and impose pairwise constraints over the

parts during inference. These pairwise constraints model

the kinematic and/or the inter-part occlusion relationships

between the parts; however when all such constraints are

imposed, the graphical model has loops (see figure 3(b)).

Even though attempts have been made to infer pose using

models with loops but they either tend to be computation-

ally expensive [9, 27]. Thus, for efficient and exact infer-

ence tree-structured models are preferred.

We develop an approach to automatically select a tree

structured model that is most likely to give an accurate lo-

calization for a given pose, by leveraging the fact that under

occlusion, some kinematic constraints may be relaxed in or-

der to model constraints that would be more effective for

localization; we call this model Pose-Specific Part Model(PSPM).

Next, we first present 2D pose localization using tree-

structured part model. We then describe the PSPM selection

and learning, followed by the 3D pose localization using

PSPM.

(b)

t

hx

xlua

llax

xlul rulx

xlll rllx

ruax

rlax

xlll

rulx

rllx

ruax

rlaxllax

xlua

hx

xlul

xt

(a)

x

Figure 3. Graphical Models for 2D pose (a) Kinematic Tree model

[4] (b) Graph with edges to model kinematic and inter-part occlu-

sion constraints; observation nodes are not shown for clarity

4.1. Localizing 2D Pose using Part Model

In a 2D pose model, each part is represented as a node

and the edges represent pairwise constraints between the

parts. During inference, detectors for all parts are indepen-

dently applied on the image, and then best pose x is ob-

tained by maximizing the joint likelihood given by

p(x, I|Θ) =∏i∈P

p(I|xi,Θsi )

∏ij∈E

p(xi|xj ,Θpij) (1)

where xi denote the part i, (P,E) is the graphical model

over the parts P ; p(I|xi,Θsi ) represent the likelihood of

part hypothesis xi obtained by applying the part detec-

tor; p(xi|xj ,Θpij) represent the pairwise constraints; Θ =

(Θs,Θp) are model priors for unary and pairwise potentials.

A commonly used 2D pose model [4, 20, 1] assumes a tree-

structure, as efficient and exact inference can be performed

[4].

4.1.1 Part Detection

Recently, [1] reported that better part detectors can signifi-

cantly improve localization results; however, better part de-

tectors are also computationally expensive. Thus, in this

work, we experiment with 2 different types of detectors that

can be applied efficiently and have been previously used

for localizing 2D body parts - geometric templates [10] and

boundary and region templates [20]. We briefly describe the

part detectors here,

[Geometric Templates] Each part is modeled with a simple

geometric object - head with an ellipse, torso with an ori-

ented rectangle and each arm with a pair of line segments.

The log likelihood score of a part is obtained by accumulat-

ing the edge strength and orientation match on the boundary

points.

[Boundary and Region Templates] Each template is a

weighted sum of the oriented bar filters where the weights

are obtained by maximizing the conditional joint likelihood

[20]. We use the detectors provided by the authors.

4.1.2 Pairwise Constraints

The pairwise kinematic potential between parts is defined

using a Gaussian distribution, similar to [4, 1]. To avoid the

overlapping parts from occupying exactly the same place,

we add additional repulsion constraint that reduces the like-

lihood of the occluded part to overlap with the occluder. For

parts xi and xj such that xi is occluding xj , we define the

pairwise potential as

p(xi|xj ,Θij) = N (li − lj ;μij , σij)× Λ(li, lj)

where, li denote the position and orientation of xi, and

Θij = (μij , σij) is Gaussian prior over the relative part

position and orientation, Λ(li, lj) is the repulsive prior be-

tween the overlapping parts [5].

4.2. Pose-Specific Part Models for Localization

Given spatial priors on a 3D pose, the Pose-Specific Part

Model (PSPM) is a tree-structured graph, and is tuned to ac-

curately localize the specified pose. Obtaining PSPM for a

pose involves selecting the model (set of parts P and the

structure E) and estimating the model prior Θ which is

likely to maximize the joint likelihood. Accurate localiza-

tion can be obtained by maximizing Eqn 1.

[Part Selection]For accurate localization, we select the parts that are at least

partially visible, since the part detectors do not work well

for heavily occluded parts. To achieve this, we project the

3D pose to obtain the approximate position and orientation

of each part. This information, together with relative depth

ordering of parts, is used to estimate visibility of each part.

The visibility v(pi) is computed as the fraction of part pithat is unoccluded i.e.

v(pi) = 1− ovlp(pi,⋃∀j �=i

pj) (2)

where ovlp(pi, pj) show the fraction of part pi occluded by

pj . For model selection, we only consider the parts with

visibility greater than 0.5.

[Structure Selection]This step involves selecting a tree from all possible trees,

that captures appropriate constraints for localizing the given

pose. For localizing poses with partially or fully occluded

parts, we can relax some kinematic constraints in the stan-

dard tree model 3(a), and add an approximate neighborhood

cum non-overlap constraint such that the resulting model is

still a tree. For example, consider the pose in figure 4(a).

An alternate model to the standard kinematic model con-

nects the left lower leg to the right lower leg, and results in

a better pose estimate that using the standard kinematic tree.

Since upper and lower parts of the body are rarely coupled

(i.e. kinematically connected or occlude each other), we ig-

nore the edges between an arm and a leg. Figure 3(b) shows

the edges considered for structure selection.

(b)

t

hx

xlua

llax

xlul rulx

rllx

ruax

rlax

xlll

x

(c)(a)

Figure 4. Pose localization using Pose-specific Part Model; (a) Im-

age of a person sitting down (b) Selected Pose-specific Part Model

(occluded parts are marked with dotted lines) (c) Localized 2D

parts obtained using the selected PSPM

A standard approach for structure selection is to find the

tree-structure that maximizes the joint likelihood over la-

beled data [11]. This involves estimating the prior parame-

ters (mean and variance) for all pairs of parts that are con-

nected, and then finding a tree-structure which has the low-

est score (sum of variance over all edges). Since the tree

structure that maximizes the joint likelihood may be dif-

ferent for different poses, the standard learning approach

would require labeled data for all poses in the action model,

from various viewpoints; which is prohibitively large. In

this work, we propose a measure for the model score based

on the geometry of the pose.

To come up with an appropriate measure, we annotated

2D and 3D poses for 200 images and estimated the tree

model with highest localization score by performing an ex-

haustive search over all tree-structured models from the

graph shown figure 3(b). Note that the number of all possi-

ble tree models is quite large. To reduce the search space,

we consider only those trees which include the kinematic

edges and those non-kinematic edges where the connected

pair of parts overlap.

From our experiments, we observed that for poses with

unoccluded parts, the best tree had mostly kinematic edges

in it. However non-kinematic edges were preferred when

the parts occluded each other. Based on this observation,

we propose a score, Localization effect of an edge L(eij),which captures the “usefulness” of that edge toward local-

izing the given pose. We define the localization effect of an

edge as the product of the detection accuracy of part detec-

tors and the degree of occlusion of the connected parts. We

define L(eij) for an edge eij as:

L(eij) ={ D(pi)D(pj)min{v(pi), v(pj)}, eij ∈ K

D(pi)D(pj)max{ovlp(pi, pj), ovlp(pj , pi)}, eij /∈ K

where, K is the set of kinematic edges; D(pi) is the de-

tection accuracy of detector for part pi; the min/max term

captures the degree of occlusion.

The tree selection for accurate localization can be for-

mulated as a search over the set of edges that maximizes

the total localization effect. Since the localization effect of

an edge is independent of others, the optimal tree structure

E∗ can be estimated as:

E∗ = maxE∈G

∑ij∈E

L(eij) s.t. E is a tree (3)

where G is the graph with all pairwise constraints. Note

that Equation 3 can be solved efficiently by finding the

maximum spanning tree in the graph G, with L(eij) as the

weight of eij .

[Estimating Model Prior Θ]We define the pairwise potential using a Gaussian (in Sec-

tion 4.1.2). Previous methods work with uninformed prior

and hence, learn the parameters of the Gaussian from a la-

beled data [4]. But in our case, where prior knowledge of

pose is available, learning pose-specific parameters would

be more meaningful. However, learning pose-specific pa-

rameters would require a prohibitively large number of pose

samples (for all poses from various viewpoints). We esti-

mate these parameters using the prior on the 3D pose.

The model parameters, mean and variance at each joint,

are estimated by projecting the 3D pose prior, modeled as

Gaussian distributions, to 2D. For e.g, the mean relative po-

sition μij of part i w.r.t. part j is simply the difference of

the mid-point of the end-joints of part pi and that of part pj .

4.3. Localizing Pose from 3D Action Priors

The action prior include the 3D prior on the pose repre-

sented with Gaussian distributions (one for each joint) and

approximate position and scale of the person available from

the tracker. Given this prior, we obtain accurate 2D local-

ization of that pose using PSPM (as described earlier). Note

that during inference, we only apply the part detector in the

neighborhood of the projected 2D position, orientation and

scale for each part.

After localizing the pose in 2D, we then estimate the 3D

pose from the 2D joints positions. While estimating 3D

pose from 2D joints is ambiguous; in our case the spatial

priors on pose available from action model and the tracking

information help remove such ambiguities. For accurate 3D

pose estimation from 2D pose with known depth ordering

of parts, one can estimate the 3D joints using non-linear

least squares to fit the 2D estimates while constraining the

joints to stay within the pose search space (similar to [30]);

in this work, we simply update each joint position, starting

from the neck, assuming the 3D length of the parts does not

change. An initial estimate of 3D part lengths are obtained

by scaling a canonical 3D model in standing pose, such that

height of the model is same as the observed height of the

actor (available from tracking).

5. ExperimentsWe first demonstrate our pose localization approach using

PSPMs on an image dataset with pose annotations. We then

evaluate our action recognition algorithm that uses PSPMs

for localization on 2 publicly available datasets: Full body

Gestures [17] and Hand Gestures [18]. Compared to KTH

[13], Weizmann, HumanEva[23] and Hand Gestures [18]

datasets which have a clean background and/or few view-

point variations, Full body Gestures set includes videos with

cluttered dynamic backgrounds, captured at various view-

points. We also report results on hand gestures in dynamic

scenes.

5.1. Pose Localization

We selected frames from existing action recognition

datasets [17, 18] and created a collection of 195 images with

variety of poses. For each image, we annotated the 3D pose

of the actor by marking the 2D joint positions and their rel-

ative depths, followed by lifting to 3D (similar to keypose

annotations). To quantitatively evaluate pose localization,

we computed the average localization score over the visi-

ble parts: a part is considered to be correctly localized if it

overlaps more than 50% with the ground truth part.

Recall that the pose prior include approximate 2D scale

and position information from the tracker, and the approxi-

mate 3D pose (represented as a set of Gaussian distributions

over the 3D joint positions). To simulate the noisy prior ob-

tained from the action models, we set the variance of each

3D joint to be 5% of the part length. This prior was then

used as an input to various localization methods.

We first apply our implementation of Pictorial Structure

(PS) [4], which is a tree-structure model with kinematic

edges and uses an uninformed prior. Using Boundary Tem-

plates (BT), PS gives a localization accuracy of 44.53%.

Then we modify PS by applying part detectors only in the

search region provided by the prior and enforce kinematics

using parameters estimated from the prior; we refer to this

as CPS (Constrained Pictorial Structures). Applying CPS

using Boundary Templates, gave a localization accuracy of

63.74%, which when compared to PS, clearly shows the

importance of incorporating pose prior. We then apply the

Pose-Specific Part Model [20] and achieved a much higher

localization accuracy 71%, which demonstrates the advan-

tage of modeling occlusion based constraints. We also com-

pare with [17], which uses Hausdorff distance between the

pose boundary and canny edges as a shape likelihood mea-

sure to localize the pose. This approach achieved a lower

accuracy of 62.71%.

We test the robustness of our approach to uncertainty in

position and scale of the pose (which is likely to occur dur-

ing tracking). Figure 5 shows the accuracy plots for various

localization methods against the degree of uncertainty. No-

tice that localization using PSPM and CPS with Boundary

Templates is quite robust to position uncertainty compared

to Hausdorff method. Using CPS with Geometric Tem-

plates and Boundary Templates gave comparable accuracy

scores at low uncertainty, but deteriorates as the uncertainty

increases; this indicates that Boundary Templates are more

robust to noise. Also notice in Figure 5.b, that the PSPM us-

ing Boundary Templates tolerates small errors in the height

estimate ( 10%). However PSPM based localization is about

10− 15 times slower than using Hausdorff distance.

0

10

20

30

40

50

60

70

80

0 0.05 0.1 0.2 0.5

Loca

lizat

ion

Acc

urac

y

Uncertainty in position

HausdorffPS-BTCPS-BTCPS-BT-IIPCPS-GTPSPM-BT

0

10

20

30

40

50

60

70

80

90

100

0.9 1 1.1

Loca

lizat

ion

Acc

urac

y

Uncertainty in Height

HausdorffPSPM-BT

Figure 5. Plots showing Localization accuracy of different ap-

proaches (a) with uncertainty in position (shown in ratio of posi-

tion error to person height) (b) with uncertainty in height estimate

(scale);

5.2. Action Recognition

From pose localization experiments, we observe that

Hausdorff distance based method localizes well when pre-

dicted pose is not far from the true pose. Thus for efficiency,

we apply PSPMs every 5th frame and use Hausdorff dis-

tance based method for intermediate frames. In addition,

for efficient localization using PSPM, we scale down the

image so the actor is ≈ 100 pixels high. Our entire system

runs at ≈ 1 frame per second on a 3GHz Xeon CPU run-

ning Windows/C++ programs. We now present our results

on three datasets.

Hand Gestures Dataset [18]: This dataset has 5-6 in-

stances of 12 actions from 8 different actors in an indoor

lab setting; a total of 495 action sequences across all ac-

tions. Even though the background is not cluttered, recog-

nition task is still challenging due to the large number of

actions with small pose difference. For evaluation, we train

the models on a subset of actors and test on the rest. We

compare our approach to [18], that uses a similar joint track-

ing and recognition approach but uses discrete action dura-

tion models and foreground based features for localization

and matching. [18] reports recognition rate of 78% and 90%with 1 : 8 and 3 : 5 train:test respectively. Our algorithm

achieves 92% recognition accuracy at 1 : 8 train:test. If we

replace the PSPM based localization with Hausdorff dis-

tance based method, recognition rate drops to 84%. This

illustrates that even in clean backgrounds, use of PSPMs

improves action recognition.

Augmented Hand Gestures Dataset: To demonstrate ro-

bustness to cluttered dynamic backgrounds, we generated

a dataset by embedding 45 action instances from the orig-

inal dataset [18] into videos with complex dynamic back-

grounds (see figure 6(f-k) for sample images). The dataset

Dataset Method Train:Test Recognition %

Hand Gestures

Natarajan et al [18] 1:8 78%Natarajan et al [18] 3:5 90.18%CAN (Hausdorff) 1:8 84.2%CAN (PSPM) 1:8 92%

USC GesturesSFD-CRF [17] MoCAP 77.45%CAN (PSPM) 1:6 89.5%

Table 1. Evaluation Results on the Hand Gestures Dataset.

has 215 videos including 3 different actors performing hand

gestures in 5 different scenes. Our algorithm achieves 91%recognition accuracy. Note that the recognition accuracy on

the original 45 videos from [18], that were used for em-

bedding, was about 95%. To process these videos, we used

the parameters trained on the original hand gestures dataset

[18].

In addition, we also collected 25 videos including 4 hand

gestures but performed in dynamic scenes, with camera

shakes and/or objects moving in the background. Our algo-

rithm, trained on the original dataset, correctly recognized

20 action instances (≈ 80% accuracy).

USC Gestures Dataset [17]: This dataset has videos of 6

full body actions, captured at various pan and tilt angles;

actions include - sit-on-ground, standup-from-ground, sit-on-chair, stand-from-chair, pickup-from-ground and point-forward. We evaluated our approach on a part of this dataset

that was captured at 0◦ tilt in 6 varying backgrounds includ-

ing cluttered indoor scenes and outdoors in front of moving

vehicles; rest of the dataset was captured at other tilt an-

gles in a relatively clean, static background. The selected

set include actions captured at 5 different camera pan an-

gles w.r.t. to the actor - 0◦, 45◦, 90◦, 270◦, 315◦, for a total

240 action instances, each performed either by a different

actor, at a different pan or in a different background. For

our experiments, we trained our models using 2 actions in-

stances from one actor and evaluated on the rest. Note that

models were trained only on 2 viewpoints, and tested on 5

different viewpoints. On segmented action instances, our

approach achieved an accuracy of 75.91%. Figure 6(n-s)

show sample results. [17] reports an accuracy of 77.35%;

however they assume that the sit-on-chair and sit-on-ground

actions are followed by stand-from-chair and stand-from-

ground respectively. When we incorporate this information,

our action recognition accuracy improves to 89.5% which

shows a 12% improvement over [17].

6. ConclusionWe have presented an approach for joint pose tracking

and action recognition in cluttered dynamic environments,

which has low training requirements and doesn’t require 3D

MoCAP data. We achieve this by proposing an accurate

and efficient pose localization approach using Pose-Specific

Part Models (PSPMs). We have demonstrated that the our

localization approach is robust to noise and works well in

(m)

(n) (o) (p) (s)(q) (r)

(a) (b) (c) (d) (e) (f)

(g) (h) (i) (j) (k) (l)

Figure 6. Results obtained on Gesture datasets (a-e) Hand Gestures [18], (f-m) Augmented Hand Gestures, (n-s) USC Gesures [17]. The

estimated pose is overlaid on each image (in red), and the corresponding part distribution obtained by applying PSPM is shown next to it

cluttered environments. Further, we have also demonstrated

our approach for action recognition on hand gestures as well

as on USC Gestures dataset with full body gestures in clut-

tered and dynamic environments.

Acknowledgements. This research was supported, in part,

by the Office of Naval Research under grants #N00014-06-

1-0470 and #N00014-10-1-0517.

References[1] M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited:

People detection and articulated pose estimation. In CVPR, pages

1014–1021, 2009. 1, 2, 4, 5

[2] M. Andriluka, S. Roth, and B. Schiele. Monocular 3d pose estima-

tion and tracking by detection. In CVPR, pages 623 –630, 2010. 2

[3] M. Collins. Discriminative training methods for hidden markov mod-

els: Theory and experiments with perceptron algorithms. In EMNLP,

2002. 3

[4] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for

object recognition. IJCV, 61(1):55–79, 2005. 1, 2, 4, 5, 6

[5] V. Ferrari, M. J. Marın-Jimenez, and A. Zisserman. Pose search:

Retrieving people using their pose. In CVPR, 2009. 2, 5

[6] Y. Hu, L. Cao, F. Lv, S. Yan, Y. Gong, and T. Huang. Action detection

in complex scenes with spatial and temporal ambiguities. In ICCV,

pages 128 –135, 2009. 2

[7] C. Huang and R. Nevatia. High performance object detection by

collaborative learning of joint ranking of granules features. In CVPR,

pages 41–48, 2010. 3

[8] N. Ikizler and D. A. Forsyth. Searching video for complex activities

with finite state models. In CVPR, 2007. 2

[9] H. Jiang and D. Martin. Global pose estimation using non-tree mod-

els. In CVPR, 2008. 1, 2, 4

[10] S. X. Ju, M. J. Black, and Y. Yacoob. Cardboard people: A parame-

terized model of articulated image motion. In FG, 1996. 4

[11] D. Koller and N. Friedman. Probabilistic Graphical Models - Prin-ciples and Techniques. MIT Press, 2009. 5

[12] X. Lan and D. P. Huttenlocher. Beyond trees: Common-factor mod-

els for 2d human pose recovery. ICCV, 2005. 2

[13] I. Laptev. On space-time interest points. International Journal onComputer Vision (IJCV), 64(2-3):107–123, 2005. 2, 6

[14] F. Lv and R. Nevatia. Single view human action recognition using

key pose matching and viterbi path searching. In CVPR, 2007. 2, 3

[15] R. Messing, C. Pal, and H. Kautz. Activity recognition using the

velocity histories of tracked keypoints. In ICCV, 2009. 2

[16] L.-P. Morency, A. Quattoni, and T. Darrell. Latent-dynamic discrim-

inative models for continuous gesture recognition. In CVPR, 2007.

2

[17] P. Natarajan and R. Nevatia. View and scale invariant action recog-

nition using multiview shape-flow models. In CVPR, 2008. 1, 2, 6,

7, 8

[18] P. Natarajan, V. K. Singh, and R. Nevatia. Learning 3d action models

from a few 2d videos for view invariant action recognition. In CVPR,

2010. 1, 2, 6, 7, 8

[19] H. Ning, W. Xu, Y. Gong, and T. S. Huang. Latent pose estimator for

continuous action recognition. In ECCV (2), 2008. 2

[20] D. Ramanan. Learning to parse images of articulated bodies. In

NIPS, pages 1129–1136, 2007. 1, 2, 4, 6

[21] R. Rosales and S. Sclaroff. Inferring body pose without tracking

body parts. In CVPR, 2000. 2

[22] L. Sigal, A. O. Balan, and M. J. Black. Combined discriminative and

generative articulated pose and non-rigid shape estimation. In NIPS,

2007. 2, 3

[23] L. Sigal and M. J. Black. Measure locally, reason globally:

Occlusion-sensitive articulated pose estimation. In CVPR, pages

2041–2048, 2006. 2, 6

[24] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas. Conditional

random fields for contextual human motion recognition. In ICCV,

pages 1808–1815, 2005. 2

[25] J. Sullivan and S. Carlsson. Recognizing and tracking human action.

In ECCV (1), pages 629–644, 2002. 3

[26] C. Sutton, K. Rohanimanesh, and A. McCallum. Dynamic condi-

tional random fields: factorized probabilistic models for labeling and

segmenting sequence data. In ICML, page 99, 2004. 3

[27] T.-P. Tian and S. Sclaroff. Fast globally optimal 2d human detection

with loopy graph models. In CVPR, 2010. 1, 2, 4

[28] R. Urtasun, D. J. Fleet, and P. Fua. 3d people tracking with gaussian

process dynamical models. In CVPR, 2006. 2

[29] Y. Wang and G. Mori. Multiple tree models for occlusion and spatial

constraints in human pose estimation. In ECCV, 2008. 2

[30] X. K. Wei and J. Chai. Modeling 3d human poses from uncalibrated

monocular images. In ICCV, pages 1873–1880, 2009. 3, 6

[31] D. Weinland, E. Boyer, and R. Ronfard. Action recognition from

arbitrary views using 3d exemplars. In ICCV, 2007. 2