graphical models for human motion modelingresearch.cs.rutgers.edu/~ksmoon/papers/hmb_pav.pdf ·...

Graphical Models for Human Motion Modeling

Kooksang Moon and Vladimir Pavlovic

Rutgers University, Department of Computer SciencePiscataway, NJ 08854, USA

Abstract. The human figure exhibits complex and rich dynamic behav-ior that is both nonlinear and time-varying. To automate the process ofmotion modeling we consider a class of learned dynamic models cast inthe framework of dynamic Bayesian networks (DBNs) applied to anal-ysis and tracking of the human figure. While direct learning of DBNparameters is possible, Bayesian learning formalism suggests that hyper-parametric model description that considers all possible model dynamicsmay be preferred. Such integration over all possible models results in asubspace embedding of the original motion measurements. To this end,we propose a new family of Marginal Auto-Regressive (MAR) graphicalmodels that describe the space of all stable auto-regressive sequences, re-gardless of their specific dynamics. We show that the use of dynamics andMAR models may lead to better estimates of sequence subspaces thanthe ones obtained by traditional non-sequential methods. We then pro-pose a learning method for estimating general nonlinear dynamic systemmodels that utilizes the new MAR models. The utility of the proposedmethods is tested on the task of tracking 3D articulated figures in monoc-ular image sequences. We demonstrate that the use of MAR can resultin efficient and accurate tracking of the human figure from ambiguousvisual inputs.

1 Introduction

Modeling the dynamics of human figure motion is essential to many applicationssuch as realistic motion synthesis in animation and human activity classifica-tion. Because the human motion is a sequence of human poses, many statisticaltechniques for sequence modeling have been applied to this problem. A dynami-cal Bayesian network (DBN) is one method for motion modeling that facilitateseasy interpretation and learning. We are interested in learning dynamic modelsfrom motion capture data using DBN formalisms and dimensionality reductionmethods.

Dimensionality reduction / subspace embedding methods such as PrincipalComponents Analysis (PCA), Multidimensional Scaling (MDS) [11], GaussianProcess Latent Variable Models (GPLVM) [9] and others, play an importantrole in many data modeling tasks by selecting and inferring those features thatlead to an intrinsic representation of the data. As such, they have attracted sig-nificant attention in computer vision where they have been used to representintrinsic spaces of shape, appearance, and motion. However, it is common that

2 K. Moon and V. Pavlovic

subspace projection methods applied in different contexts do not leverage inher-ent properties of those contexts. For instance, subspace projection methods usedin human figure tracking [5, 16, 20, 22] often do not fully exploit the dynamicnature of the data. As a result, the selected subspaces sometimes do not exhibittemporal smoothness or periodic characteristics of the motion they model. Evenif the dynamics are used, the methods employed are sometimes not theoreticallysound and are disjoint from the subspace selection phase.

In this paper we present a graphical model formalism for human motion mod-eling using dimensionality reduction. A new approach to subspace embedding ofsequential data is proposed, which explicitly accounts for their dynamic nature.We first model the space of sequences using a novel Marginal Auto-Regressive(MAR) formalism. A MAR model describes the space of sequences generatedfrom all possible AR models. In the limit case MAR describes all stable AR mod-els. As such, the MAR model is weakly-parametric and can be used as a priorfor an arbitrary sequence, without knowing the typical AR parameters such asthe state transition matrix. The embedding model is then defined using a prob-abilistic Gaussian Process Latent Variable Model (GPLVM) framework [9] withMAR as its prior. A GPLVM framework is particularly well suited for this taskbecause of its probabilistic generative interpretation. The new hybrid GPLVMand MAR framework results in a general model of the space of all nonlineardynamic systems (NDS). Because of this it has the potential to theoreticallysoundly model nonlinear embeddings of a large family of sequences.

The paper is organized as follows. We first justify modeling human motionwith subspace embedding. Next we define the family of MAR models and studysome properties of the space of sequences modeled by MAR. We also show thatMAR and GPLVM result in a model of the space of all NDS sequences anddiscuss its properties. The utility of the new framework is examined througha set of experiments with synthetic and real data. In particular, we apply thenew framework to modeling and tracking of the 3D human figure motion froma sequence of monocular images.

2 Related Work

Manifold learning approaches to motion modeling have attracted significant in-terest in the last several years. Brand proposed nonlinear manifold learning thatmaps sequences of the input to paths of the learned manifold [4]. Rosales andSclaroff [13] proposed the Specialized Mapping Architecture (SMA) that utilizesforward mapping for the pose estimation task. Agarwal and Triggs [1] directlylearned a mapping from image measurement to 3D pose using Relevance VectorMachine (RVM).

However, with high-dimensional data, it is often advantageous to considera subspace of e.g. the joint angles space that contains a compact representa-tion of the actual figure motion. Principal Component Analysis (PCA) [8] is themost well-known linear dimensionality reduction technique. Although PCA hasbeen applied to human tracking and other vision application [15, 12, 21], it is

Graphical Models for Human Motion Modeling 3

insufficient to handle the non-linear behavior inherent to human motion. Non-linear manifold embedding of the training data in low dimensional spaces usingisometric feature mapping (Isomap), Local linear (LLE) and spectral embed-ding [19, 14, 3, 24], have shown success in recent approaches [5, 16]. While thesetechniques provide point-based embeddings implicitly modeling the nonlinearmanifold through exemplars, they lack a fully probabilistic interpretation of theembedding process.

The GPLVM, a Gaussian Processes [25] model, produces a continuous map-ping between the latent space and the high-dimensional data in a probabilisticmanner [9]. Grochow et al. [7] use a Scaled GPLVM (SGPLVM) to model inversekinematics for interactive computer animation. Tian et al. [20] use a GPLVMto estimate the 2D upper body pose from the 2D silhouette features. More re-cently, Urtasun et al. [22] exploit the SGPLVM for 3D people tracking. Howeverthese approaches utilize simple temporal constraints in the pose space that oftenintroduce “dimensionality curse” to nonlinear tracking methods such as particlefilters. Moreover, such methods fail to explicitly consider motion dynamics dur-ing the embedding process. Our work addresses both of these issues through theuse of a novel marginal NDS model. Wang et al. [23] introduced Gaussian Pro-cess Dynamical Models (GPDM) that utilize the dynamic priors for embedding.Our work extends the idea to tracking and investigates the impact of dynamicsin the embedded space on tracking in real sequences.

3 Dynamics Modeling and Subspace Embedding

Because the human pose is typically represented by more than 30 parameters(e.g. 59 joint angles in the marker-based motion capture system), modeling hu-man motion is a complex task. Suppose yt is a M -dimensional vector consistingof joint angles at time t. Modeling human motion can be formulated as learninga dynamic system:

yt = h(y0, y1, ..., yt−1) + ut, (1)

where ut is a (Gaussian) noise process.A common approach to modeling linear motion dynamics would be to assume

a T -th order linear autoregressive (AR) model:

yt =T∑

i=1

aiyt−i + εt (2)

where ai is the autoregression coefficients and εt is a noise process. For instance,2nd order AR models are sufficient for modeling of periodic motion and higherorder models lead to more complex motion dynamics. However, as the order ofthe model increases the number of parameters grows as M2 ·T +M2 (transitionand covariance parameters). Learning this set of parameters may require largetraining sets and can be prone to overfitting.

Armed with the intuition that correlation between the limbs such as armsand legs always exists for a certain motion, many researchers have exploited the


dynamics in the lower projection space rather than learned the dynamics in thehigh-dimensional pose space for human motion modeling. By inducing a hiddenstate xt of dimension N (M À N) satisfying the first-order Markovian condition,modeling human motion is cast in the framework of dynamic Bayesian networks(DBNs) depicted in Figure 1:

xt = f(xt−1) + wt (3)yt = g(xt) + vt, (4)

where f(·) is a transition function, g(·) represents any dimensional reductionoperation, and wt and vt are (Gaussian) noise processes.

.....

.....

x1

y1 y2

x2

yt

xt

y0

x0

Fig. 1. A graphical model for human motion modeling with the subspace modeling.

The above DBN formalism implies that predicting future observation yt+1

based on the past observation data Yt0 = {y0, . . . , yt} can be stated as the fol-

lowing inference problem:

P (yt+1|Yt0) =

P (Yt+10 )

P (Yt0)

=

∑xt+1

· · ·∑x1P (x1)

∏ti=1 P (xi+1|xi)

∏t+1i=1 P (yi|xi)∑

xt· · ·∑x1

P (x1)∏t−1

i=1 P (xi+1|xi)∏t

i=1 P (yi|xi).

(5)This suggests that the dynamics of the observation (pose) sequence Y posses

a more complicated form. Namely, the pose yt at time t becomes dependenton all previous poses yt−1, yt−2, ... effectively resulting in an infinite order ARmodel. However, such model can use a smaller set of parameters than the ARmodel of Equation (2) in the pose space. Assuming a 1st order linear dynamicsystem (LDS) xt = Fxt−1 + w and the linear dimensionality reduction processyt = Gxt + v where F is the transition matrix and G is the inverse of thedimensionality reduction matrix, the number of parameters to be learned isN2 +N2 +N ·M +M2 = 2N2 +M · (N +M) (N2 in F , NM in G and N2 +M2

in the two noise covariance matrices for w and v). When N ¿ M the numberof parameters of the LDS representation becomes significantly smaller than thatof the “equivalent” AR model. That is, by learning both the dynamics in theembedded space and the subspace embedding model, we can effectively estimateyt given all Yt−1

0 at any time t using a small set of parameters.To illustrate the benefit of using the dynamics in the embedded space for

human motion modeling, we take 12 walking sequences of one subject from CMU


Graphics Lab Motion Capture Database [27] where the pose is represented by59 joint angles. The poses are projected into a 3D subspace. Assume that thedynamics in the pose space and in the embedded space are modeled using the2nd order linear dynamics. We perform leave-one-out cross-validation for these12 sequences - 11 sequences are selected as a training set and the 1 remainingsequence is reserved for a testing set. Let Mpose be the AR model in the posespace learned from this training set and Membed be the LDS model in the latentspace. Figure 2 shows the summary statistics of the two negative log-likelihoodsof P (Yn|Mpose) and P (Yn|Membed), where Yn is a sequence reserved for testing.

embed pose

11

12

13

14

15

16

17

18

19

20

log (

−lo

g P

(Y|M

))

LDS Models

Fig. 2. Comparison of generalization abilities of AR (”pose”) and LDS (”embed”)models. Shown are the medians, upper and lower quartiles (boxes) of the negative loglikelihoods (in log space) under the two models. The whiskers depict the total range ofthe values. Note that lower values suggest better generalization properties (fit to testdata) of a model.

The experiment indicates that with the same training data, the learned dy-namics in the embedded space models the unseen sequences better than thedynamic model in the pose space. The large variance of P (Yn|Mpose) for differ-ent training sets also indicates the overfitting problem that is generally observedin a statistical model that has too many parameters.


As shown in Figure 1, there are two processes in modeling human motionusing a subspace embedding. One is learning the embedding model P (yt|xt) andthe other is learning the dynamic model P (xt+1|xt). The problem of the previ-ous approaches using the dimensionality reduction in human motion modelingis that these two precesses are decoupled into two separate stages in learning.However, coupling the two learning processes results in the better embeddedspace that preserves the dynamic nature of original pose space. For example, ifthe prediction by the dynamics suggests that the next state will be near a certainpoint we can learn the projection that retains the temporal information betterthan naive projection disregarding this prior knowledge. Our proposed frame-work formulates this coupling of the two learning processes in a probabilisticmanner.

4 Marginal Auto-Regressive Model

We develop a framework incorporating dynamics into the process of learninglow-dimensional representations of sequences. In this section, a novel marginaldynamic model describing the space of all stable auto-regressive sequences isproposed to model the dynamics of unknown subspace.

4.1 Definition

Consider sequence X of length T of N -dimensional real-valued vectors xt =[xt,0xt,1...xt,N−1] ∈ <1×N . Suppose sequence X is generated by the 1st orderAR model AR(A):

xt = xt−1A + wt, t = 0, ..., T − 1, (6)

where A is a specific N×N state transition matrix and wt is a white iid Gaussiannoise with precision, α: wt ∼ N (0, α−1I). Assume that, without loss of generality,the initial condition x−1 has normal multivariate distribution with zero meanand unit precision: x−1 ∼ N (0, I).

We adopt a convenient representation of sequence X as a T × N matrixX = [x′0x

′1...x

′T−1]

′ whose rows are the vector samples from the sequence. Usingthis notation Equation (6) can be written as

X = X∆A + W, (7)

where W = [w′0w′1...w

′T−1]

′ and X∆ is a shifted/delayed version of X, X∆ =[x′−1x

′0...x

′T−2]

′ . Given the state transition matrix A and the initial condition,the AR sequence samples have the joint density function

P (X|A, x−1) = (2π)−NT2 exp

{−1

2tr {(X −X∆A)(X −X∆A)′}

}. (8)

The density in Equation (8) describes the distribution of samples in a T -longsequence for a particular instance of the state transition matrix A. However, we


are interested in the distribution of all AR sequences, regardless of the value of A.In other words, we are interested in the marginal distribution of AR sequences,over all possible parameters A.

Assume that all elements aij of A are iid Gaussian with zero mean and unitprecision, aij ∼ N (0, 1). Under this assumption, it can be shown [10] that themarginal distribution of the AR model becomes

P (X|x−1, α) =∫

A

P (X|A, x−1)P (A|α)dA

= (2π)−NT2 |Kxx(X, X)|−N

2 exp

{−1

2tr{Kxx(X,X)−1XX ′}

}(9)

whereKxx(X, X) = X∆X ′

∆ + α−1I. (10)

We call this density the Marginal AR or MAR density. α is the hyperparameterof this class of models, MAR(α). Intuitively, Equation (9) favors those samplesin X that do not change significantly from t to t + 1 and t − 1. The graphicalrepresentation of MAR model is depicted in Figure 3. Different treatments tothe nodes are represented by different shades.

.....

.....

x−1 x2x1 xT−1

A

x0

Fig. 3. Graphical representation of MAR model. White shaded nodes are optimizedwhile grey shaded node is marginalized.

MAR density models the distribution of all (AR) sequences of length T inthe space X = <T×N . Note that while the error process of an AR model hasGaussian distribution, the MAR density is not Gaussian. We illustrate this inFigure 4. The figure shows joint pdf values for four different densities: MAR,periodic MAR (see Section 4.2), AR(2), and a circular Gaussian, in the space oflength-2 scalar-valued sequences [x0x1]′. In all four cases we assume zero-mean, unit precision Gaussian distribution of the initial condition. All models havethe mode at (0, 0). The distribution of the AR model is multivariate Gaussianwith the principal variance direction determined by the state transition matrixA. However, the MAR models define non-Gaussian distributions with no circularsymmetry and with directional bias. This property of MAR densities is importantwhen viewed in the context of sequence subspace embeddings, which we discussin Section 5.


High

Lowx

1

x 0

−10 −8 −6 −4 −2 0 2 4 6 8 10

−10

−8

−6

−4

−2

0

2

4

6

8

10

x 0

x1

−10 −8 −6 −4 −2 0 2 4 6 8 10

−10

−8

−6

−4

−2

0

2

4

6

8

10

x1

x 0

−10 −8 −6 −4 −2 0 2 4 6 8 10

−10

−8

−6

−4

−2

0

2

4

6

8

10

x1

x 0

−10 −8 −6 −4 −2 0 2 4 6 8 10

−10

−8

−6

−4

−2

0

2

4

6

8

10

Fig. 4. Distribution of length-2 sequences of 1D samples under MAR, periodic MAR,AR, and independent Gaussian models.

4.2 Higher-Order Dynamics

The above definition of MAR models can be easily extended to families of ar-bitrary D-th order AR sequences. In that case the state transition matrix A isreplaced by an ND ×N matrix A = [A′1A

′2...A

′D]′ and X∆ by [X∆X1∆...XD∆].

Hence, a MAR(α,D) model describes a general space of all D-th order AR se-quences. Using this formulation one can also model specific classes of dynamicmodels. For instance, a class of all periodic models can be formed by settingA = [A′1 − I]′, where I is an identity matrix.

4.3 Nonlinear Dynamics

In Equation (6) and Equation (9) we assumed linear families of dynamic sys-tems. One can generalize this approach to nonlinear dynamics of the formxt = g(xt−1|ζ)A, where g(·|ζ) is a nonlinear mapping to an L-dimensional sub-space and A is a L ×N linear mapping. In that case Kxx becomes a nonlinearkernel using justification similar to e.g. [9]. While nonlinear kernels often havepotential benefits, such as robustness, they also preclude closed-form solutions oflinear models. In our preliminary experiments we have not observed significantdifferences between MAR and nonlinear MAR.

4.4 Justification of MAR Models

The choice of the prior distribution of AR model’s state transition matrix leadsto the MAR density in Equation (9). One may wonder, however, if the choiceof iid N (0, 1) results in a physically meaningful space of sequences. We suggestthat, indeed, such choice may be justified.

Namely, Girko’s circular law [6] states that if 1N A is a random N ×N matrix

with N (0, 1) iid entries, then in the limit case of large N(> 20) all real andcomplex eigenvalues of A are uniformly distributed on the unit disk. For smallN , the distribution shows a concentration along the real line. Consequently, theresulting space of sequences described by the MAR model is that of all stableAR systems.


5 Nonlinear Dynamic System Models

In this section we develop a Nonlinear Dynamic System view of the sequencesubspace reconstruction problem that relies on the MAR representation of theprevious section. In particular, we use the MAR model to describe the structureof the subspace of sequences to which the extrinsic representation will be mappedusing a GPLVM framework of [9].

5.1 Definition

Let Y be an extrinsic or measurement sequence of duration T of M -dimensionalsamples. Define Y as the T ×M matrix representation of this sequence, similarto the definition in Section 4, Y = [y′0y

′1...y

′T−1]

′. We assume that Y is a result ofthe process X in a lower-dimensional MAR subspace X , defined by a nonlineargenerative or forward mapping

Y = f(X|θ)C + V. (11)

f(·) is a nonlinear <N → <L mapping, C is a linear L×M mapping, and V isa Gaussian noise with zero-mean and precision β.

To recover the intrinsic sequence X in the embedded space from sequence Yit is convenient not to focus, at first, on the recovery of the specific mapping C.Hence, we consider the family of mappings where C is a stochastic matrix whoseelements are iid cij ∼ N (0, 1). Marginalizing over all possible mappings C yieldsa marginal Gaussian Process [25] mapping:

P (Y |X, β, θ) =∫

C

P (Y |X,C, θ)P (C|β)dC

= (2π)−MT2 |Kyx(X,X)|−M

2 exp

{−1

2tr{Kyx(X,X)−1Y Y ′}

}(12)

whereKyx(X, X) = f(X|θ)f(X|θ)′ + β−1I. (13)

Notice that in this formulation the X → Y mapping depends on the inner prod-uct 〈f(X), f(X)〉. The knowledge on the actual mapping f is not necessary; amapping is uniquely defined by specifying a positive-definite kernel Kyx(X, X|θ)with entries Kyx(i, j) = k(xi, xj) parameterized by the hyperparameter θ. A va-riety of linear and non-linear kernels (RBF, square exponential, various robustkernels) can be used as Kyx. Hence, our likelihood model is a nonlinear Gaussianprocess model, as suggested by [9]. Figure 5 shows the graphical model of NDS.

By joining the MAR model and the NDS model, we have constructed aMarginal Nonlinear Dynamic System (MNDS) model that describes the jointdistribution of all measurement and all intrinsic sequences in a Y × X space:

P (X, Y |α, β, θ) = P (X|α)P (Y |X, β, θ). (14)


.....

.....

.....

yy1y0 yT−1

C

x2x10 xx T−1

2

Fig. 5. Graphical model of NDS. White shaded nodes are optimized while grey shadednode is marginalized and black shaded nodes are observed variables.

The MNDS model has a MAR prior P (X|α), and a Gaussian process likelihoodP (Y |X, β, θ). Thus it places the intrinsic sequences X in the space of all ARsequences. Given an intrinsic sequence X, the measurement sequence Y is zero-mean normally distributed with the variance determined by the nonlinear kernelKyx and X.

5.2 Inference

Given a sequence of measurements Y one would like to infer its subspace repre-sentation X in the MAR space, without needing to first determine a particularfamily of AR models AR(A), nor the mapping C. Equation (14) shows thatthis task can be, in principle, achieved using the Bayes rule P (X|Y, α, β, θ) ∝P (X|α)P (Y |X,β, θ).

However, this posterior is non-Gaussian because of the nonlinear mapping fand the MAR prior. One can instead attempt to estimate the mode X∗

X∗ = arg maxX

{log P (X|α) + log P (Y |X,β, θ)} (15)

using nonlinear optimization such as the Scaled Conjugate Gradient in [9].To effectively use a gradient-based approach, one needs to obtain expressions

for gradients of the log-likelihood and the log-MAR prior. Note that the expres-sions for MAR gradients are more complex than those of e.g. GP due to a lineardependency between X and X∆ (see Section 9).

5.3 Learning

MNDS space of sequences is parameterized using a set of hyperparameters(α, β, θ) and the choice of the nonlinear kernel Kyx. Given a set of sequences{Y (i)}, i = 1, .., S the learning task can be formulated as the ML/MAP estima-tion problem

(α∗, β∗, θ∗)|Kyx= arg max

α,β,θ

S∏

i=1

P (Y (i)|α, β, θ). (16)


One can use a generalized EM algorithm to obtained the ML parameter estimatesrecursively from two fixed-point equations:E-step : X(i)∗ = arg maxX P (Y,X(i)|α∗, β∗, θ∗)M-step: (α∗, β∗, θ∗) = arg max(β,α,θ)

∏Ki=1 P (Y (i), X(i)∗|α, β, θ)

5.4 Learning of Explicit NDS Model

Inference and learning in MNDS models result in the embedding of the measure-ment sequence Y into the space of all NDS/AR models. Given Y , the embeddedsequences X estimated in Section 5.3 and MNDS parameters α, β, θ, the explicitAR model can be easily reconstructed using the ML estimation on sequence X,e.g. :

A∗ = (X ′∆X∆)−1X ′

∆X. (17)

Because the embedding was defined as a GP, the likelihood function P (yt|xt, β, θ)follows a well-known result from GP theory: yt|xt ∼ N (µ, σ2I)

µ = Y ′Kyx(X, X)−1Kyx(X, xt) (18)

σ2 = Kyx(xt, xt)−Kyx(X, xt)′Kyx(X,X)−1Kyx(X, xt). (19)

The two components fully define the explicit NDS.In summary, a complete sequence modeling algorithm consists of the following

set of steps.

Input : Measurement sequence Y and kernel family Kyx

Output: NDS(A, β, θ)1) Learn subspace embedding MNDS(α, β, θ) model of trainingsequences Y , Section 5.3.2) Learn explicit subspace and projection model NDS(A, β, θ) of Y ,Section 5.4.

Algorithm 1: NDS learning.

5.5 Inference in Explicit NDS Model

The choice of the nonlinear kernel Kyx results in a nonlinear dynamic systemmodel of training sequences Y . The learned model can then be used to infersubspace projections of a new sequence from the same family. Because of thenonlinearity of the embedding, one cannot apply the linear forward-backward orKalman filtering/smoothing inference. Rather, it is necessary to use nonlinearinference methods such as (I)EKF or particle filtering/smoothing.

It is interesting to note that one can often use a relatively simple sequentialnonlinear optimization in place of the above two inference methods:

x∗t = arg maxxt

P (yt|xt, β∗, θ∗)P (xt|x∗t−1, A

∗). (20)


Such sequential optimization yields local modes of the true posterior P (X|Y ).While one would expect such approximation to be valid in situations with fewambiguities in the measurement space and models learned from representativetraining data, our experiments show the method to be robust across a set ofsituations. However, dynamics seem to play a crucial role in the inference process.

5.6 Example

We illustrate the concept of MNDS on a simple synthetic example. Considerthe AR model AR(2) from Section 4. Sequence X generated by the model isprojected to the space Y = <2×3 using a linear conditional Gaussian modelN (XC, I). Figure 6 shows negative likelihood over the space X of the MNDS,a marginal model (GP) with independent Gaussian priors, a GP with the exactAR(2) prior, and a full LDS with exact parameters. All likelihoods are computedfor the fixed Y . Note that the GP with Gaussian prior assumes no temporal struc-ture in the data. This example shows that, as expected, the maximum likelihood

Low

High

x 0

x1

−10 −8 −6 −4 −2 0 2 4 6 8 10

−10

−8

−6

−4

−2

0

2

4

6

8

10

x1

x 0

−10 −8 −6 −4 −2 0 2 4 6 8 10

−10

−8

−6

−4

−2

0

2

4

6

8

10

x1

x 0

−10 −8 −6 −4 −2 0 2 4 6 8 10

−10

−8

−6

−4

−2

0

2

4

6

8

10

x1

x 0

−10 −8 −6 −4 −2 0 2 4 6 8 10

−10

−8

−6

−4

−2

0

2

4

6

8

10

−10 −8 −6 −4 −2 0 2 4 6 8 10

−10

−8

−6

−4

−2

0

2

4

6

8

10

Fig. 6. Negative log-likelihood of length-2 sequences of 1D samples under MNDS, GPwith independent Gaussian priors, GP with exact AR prior and LDS with the trueprocess parameters. “o” mark represents the optimal estimate X∗ inferred from thetrue LDS model. “+” shows optimal estimates derived using the three marginal models.

subspace estimates of the MNDS model fall closer to the “true” LDS estimatesthan those of the non-sequential model. This property holds in general. Figure 7shows the distribution of optimal negative log likelihood scores, computed atcorresponding X∗, of the four models over a 10000 sample of Y sequences gen-erated from the true LDS model. Again, one notices that MNDS has a lowermean and mode than the non-sequential model, GP+Gauss, indicating MNDS’sbetter fit to the data. This suggests that MNDS may result in better subspaceembeddings than the traditional GP model with independent Gaussian priors.

6 Human Motion Modeling using MNDS

In this section we consider an application of MNDS to modeling of the humanmotion from sequences of video images. Specifically, we assume that one wantsto recover two important aspects of human motion: (1) 3D posture of the humanfigure in each image and (2) an intrinsic representation of the motion.


0 1 2 3 4 5 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

MNLDSGP+GaussGP+ARLDS

Fig. 7. Normalized histogram of optimal negative log-likelihood scores for MNDS, aGP model with a Gaussian prior, a GP model with exact AR prior and LDS with thetrue parameters.

We propose two models in this context. The first model (Model 1) is a fullygenerative model that considers the natural way of image formation. Given the3D pose space represented by joint angles yt, the mapping into a sequence offeatures zt computed from monocular images (such as the silhouette-based altmoments, orientation histograms etc. ) is given by a Gaussian process modelP (Z|Y, θzy). An NDS is used to model the space Y × X of poses and intrinsicmotions P (X, Y |A, β, θyx). The second model (Model 2) relies on the premisethat the correlation between the pose and the image features can be modeledusing a latent-variable model. In this model, the mapping into the image featurespace from the intrinsic space is given by a Gaussian process model P (Z|X, θzx).The model for the space of poses and intrinsic motions is defined in the sameway as the first model. Similar models are often used in other domains such ascomputational language modeling to correlate complex processes. The graphicalmodels of these two approaches are shown in Figure 8.

Y Z

NDSXModel

X Z

NDSModelY

Fig. 8. Graphical Models for human motion modeling using MNDS. Left: model 1.Right: model 2.


6.1 Model 1

When the dimension of image feature vector zt is much smaller than the di-mension of pose vector yt (e.g. 10-dimensional vector of alt Moments vs. 59-dimensional joint angle vector of motion capture data), estimating the pose giventhe feature becomes the problem of predicting a higher dimensional projectionin the model P (Z|Y, θzy). It is an undetermined problem. In this case, we canutilize the practical approximation by modeling P (Y |Z) rather than P (Z|Y ) -It yielded better results and still allowed a fully GP-based framework. That isto say, the mapping into the 3D pose space from the feature space is given by aGaussian process model P (Y |Z, θyz) with a parametric kernel Kyz(zt, zt|θyz).

As a result, the joint conditional model of the pose sequence Y and intrinsicmotion X, given the sequence of image features Z is approximated by

P (X,Y |Z, A, β, θyz, θyx) ≈ P (Y |Z, θyz)P (X|A)P (Y |X,β, θyx). (21)

Learning. In the training phase, both the image features Z and the correspond-ing poses Y are known. Hence, the learning of GP and NDS models becomesdecoupled and can be accomplished using the NDS learning formalism presentedin the previous section and a standard GP learning approach [25].

Input : Image sequence Z and joint angle sequence YOutput: Human motion model.1) Learn Gaussian Process model P (Y |Z, θyz) using e.g. [25].2) Learn NDS model P (X, Y |A, β, θyx) as described in Section 5.

Algorithm 2: Human motion model learning.

Inference and Tracking. Once the models are learned they can be used fortracking of the human figure in video. Because both NDS and GP are nonlinearmappings, estimating current pose yt given a previous pose and intrinsic mo-tion space estimates P (xt−1, yt−1|Z0..t) will involve nonlinear optimization orlinearizion, as suggested in Section 5.5. In particular, optimal point estimates x∗tand y∗t are the result of the following nonlinear optimization problem:

(x∗t , y∗t ) = arg max

xt,yt

P (xt|xt−1, A)P (yt|xt, β, θyx)P (yt|zt, θyz). (22)

The point estimation approach is particularly suited for a particle-basedtracker. Unlike some traditional approaches that only consider the pose spacerepresentation, tracking in the low dimensional intrinsic space has the potentialto avoid problems associated with sampling in high-dimensional spaces.

A sketch of the human motion tracking algorithm using particle filter withNP particles and weights (w(i), i = 1, ..., NP ) is shown below. We apply thisalgorithm to a set of tracking problems described in Section 7.2.


Input : Image zt, Human motion model (GP+NDS) and prior pointestimates (w(i)

t−1, x(i)t−1, y

(i)t−1)|Z0..t−1, i = 1, ..., NP .

Output: Current pose/intrinsic state estimates(w(i)

t , x(i)t , y

(i)t )|Z0..t, i = 1, ..., NP

1) Draw the initial estimates x(i)t ∼ p(xt|x(i)

t−1, A).2) Compute the initial poses y

(i)t from the initial x

(i)t and NDS model.

3) Find optimal estimates (x(i)t , y

(i)t ) using nonlinear optimization in

Equation (22).4) Find point weightsw

(i)t ∼ P (x(i)

t |xt−1, A)P (y(i)t |x(i)

t , β, θyx)P (y(i)t |zt, θyz).

Algorithm 3: Particel filter in human motion tracking.

6.2 Model 2

Given the sequence of image feature Z, the joint conditional model of the posesequence Y and the corresponding embedded sequence X is expressed as

P (X, Y |Z, A, β, θzx, θyx) = P (Z|X, θzx)P (X|A)P (Y |X, β, θyx). (23)

Comparing with Model 1, this model needs no approximation related to thedimension of the image features.

Learning. Given both the sequence of poses and corresponding image features,a generalized EM algorithm as in Section 5.3 can be used to learn a set of modelparameters (A, β, θzx, θyx).

Input : Image sequence Z and joint angle sequence YOutput: Human motion model.E-step X∗ = arg maxX P (X, Y |Z, A∗, β∗, θ∗zx, θ∗yx)M-step(A∗, β∗, θ∗zx, θ∗yx) = arg max(A,β,θzx,θyx) P (X∗, Y |Z, A, β, θzx, θyx)

Algorithm 4: Human motion model learning with EM.

Inference and Tracking. Once the model is learned the joint probability ofthe pose Y and the image features Z can be approximated by

P (Y, Z) ≈ P (Y,Z, X∗) = P (Y |X∗)P (Z|X∗)P (X∗) (24)

where X∗ = arg maxX

P (X|Y,Z).

Because we have two conditionally independent GPs, estimating current pose(distribution) yt and estimating current point xt in the embedded space can


be separated. First the optimal point estimate x∗t is the result of the followingoptimization problem:

x∗t = arg maxxt

P (xt|xt−1, A)P (zt|xt, β, θzx). (25)

A particle-based tracker is utilized to estimate the optimal xt as tracking withthe previous Model 1. A sketch of this procedure using particle filter with NP

particles and weights (w(i), i = 1, ..., NP ) is shown below.

Input : Image zt, Human motion model (NDS) and prior pointestimates (w(i)

t−1, x(i)t−1, y

(i)t−1)|Z0..t−1, i = 1, ..., NP .

Output: Current pose/intrinsic state estimates(w(i)

t , x(i)t )|Z0..t, i = 1, ..., NP

1) Draw the initial estimates x(i)t ∼ p(xt|x(i)

t−1, A).2) Find optimal estimates x

(i)t using nonlinear optimization in

Equation (25).4) Find point weights w

(i)t ∼ P (x(i)

t |xt−1, A)P (z(i)t |x(i)

t , β, θzx).

Algorithm 5: Particel filter in human motion tracking.

After estimating the optimal x∗t , we can easily compute the distribution ofthe pose yt by using the same result from GP theory as in Section 5.4. Note thatModel 2 does explicitly provide the estimates of pose variance, which Model 1does not. The mode can be selected as the final pose estimate:

y∗t = Y ′Kyx(X, X)−1Kyx(X, x∗t ). (26)

7 Experiments

7.1 Synthetic Data

In our first experiment we examine the utility of MAR priors in a subspaceselection problem. A 2nd order AR model is used to generate sequences in a<T×2 space; the sequences are then mapped to a higher dimensional nonlinearmeasurement space. An example of the measurement sequence, a periodic curveon the Swiss-roll surface, is depicted in Figure 9.

We apply two different methods to recover the intrinsic sequence subspace:MNDS with an RBF kernel and a GPLVM with the same kernel and independentGaussian priors. Estimated embedded sequences are shown in Figure 10. Theintrinsic motion sequence inferred by the MNDS model more closely resemblesthe “true” sequence in Figure 9. Note that one dimension (blue/dark) is reflectedabout the horizontal axis, because the embeddings are unique up to an arbitraryrotation. These results confirm that proper dynamic priors may have crucial rolein learning of embedded sequence subspaces. We study the role of dynamics intracking in the following section.


0 20 40 60 80 100 120 140 160 180 200−50

−40

−30

−20

−10

0

10

20

30

40

50

Fig. 9. A periodic sequence in the intrinsic subspace and the measured sequence onthe Swiss-roll surface.

0 20 40 60 80 100 120 140 160 180 200−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

0 20 40 60 80 100 120 140 160 180 200−1.5

−1

−0.5

0

0.5

1

1.5

Fig. 10. Recovered embedded sequences. Left: MNDS. Right: GPLVM with iid Gaus-sian priors.


7.2 Human Motion Data

We conducted experiments using a database of motion capture data for a 59 d.o.fbody model from CMU Graphics Lab Motion Capture Database [27]. Figure 11shows the latent space resulting from the original GPLVM and our MNDS model.Note that there are breaks in the intrinsic sequence of the original GPLVM.On the other hand, the trajectory in the embedded space of MNDS model issmoother, without sudden breaks. Note that the precisions for the points corre-sponding to the training poses are also higher in our MNDS model.

Fig. 11. Latent space with the grayscale map of log precision. Left: pure GPLVM.Right: MNDS.

For the experiments on human motion tracking, we utilize synthetic imagesas our training data similar to [1, 20]. Our database consists of seven walkingsequences of around 2000 frames total. The data was generated using the software(3D human model and Maya binaries) generously provided by the authors of [18,17]. We train our GP and NDS models with one sequence of 250 frames and teston the remaining sequences. In our experiments, we exclude 15 joint angles thatexhibit small movement during walking (e.g. clavicle and figures joint) and usethe remaining 44 joints. Our choice of image features are the silhouette-based Altmoments used in [20, 13]. The scale and translational invariance of Alt momentsmakes them suitable to a motion modeling task with little or no image-planerotation.

In the model learning phase we utilize the approach proposed in Section 5.Once the model is learned, we apply the two tracking/inference approaches inSection 6 to infer motion states and poses from sequences of silhouette images.The pose estimation results with the two different models show no much differ-ence. The big difference between two models is the speed, which we discuss inthe following Section 7.3.

Figure 12 depicts a sequence of estimated poses. The initial estimates forgradient search are determined by the nearest neighborhood matching in the


Alt moments space alone. To evaluate our NDS model, we estimate the sameinput sequence with the original GPLVM tracking in [20]. Although the silhou-ette features are informative for human pose estimation, they are also prone toambiguities such as the left/right side changes. Without proper dynamics mod-eling, the original GPLVM fails to estimate the correct poses because of thisambiguity.

Fig. 12. Firs row: Input image silhouettes. Remaining rows show reconstructed poses.Second row: GPLVM model. Third row: NDS model.

The accuracy of our tracking method is evaluated using the mean RMS errorbetween the true and the estimated joint angles [1], D(y, y′) = 1

44

∑44i=1 |(yi −

y′i)mod±180o|. The first column of Figure 13 displays the mean RMS errors overthe 44 joint angles, estimated using three different models. The testing sequenceconsists of 320 frames. The mean error for NDS model is in range 3o ∼ 6o. Theinversion of right and left legs causes significant errors in the original GPLVMmodel. Introduction of simple dynamics in the pose space similar to [22] wasnot sufficient to rectify the “static” GPLVM problem. The second column ofFigure 13 shows examples of trajectories in the embedded space corresponding tothe pose estimates with the three different models. The points inferred from ourNDS model follow the path defined by the MAR model, making them temporallyconsistent. The other two methods produced less-than-smooth embeddings.

We applied the algorithm to tracking of various real monocular image se-quences. The data used in these experiments was the sideview sequence in CMUmobo database made publicly available under the HumanID project [26]. Fig-ure 14 shows one example of our tracking result. This testing sequence consists of


0 50 100 150 200 250 3000

2

4

6

8

10

12

14

frame number

RM

S e

rror

in d

egre

e

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−1.5

−1

−0.5

0

0.5

1

1.5

x1

x2

0 50 100 150 200 250 3000

2

4

6

8

10

12

14

frame number

RM

S e

rror

in d

egre

e

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

x1

x2

0 50 100 150 200 250 3000

2

4

6

8

10

12

14

frame number

RM

S e

rror

in d

egre

e

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

x1

x2

Fig. 13. Mean angular pose RMS errors and 2D latent space trajectories. First row:tracking using our NDS model. Seconde row: original GPLVM tracking. Third row:tracking using simple dynamics in the pose space.


340 frames. Because a slight mismatch in motion dynamics between the trainingand the test sequences, reconstructed poses are not geometrically perfect. How-ever the overall result sequence depicts a plausible walking motion that agreeswith the observed images.

Fig. 14. First row: Input real walking images. Second row: Image silhouettes. Thirdrow: Images of the reconstructed 3D pose.

It is also interesting to note that in a number of tracking experiments it wassufficient to carry a very small number of particles (∼ 1) in the point-basedtracker of Algorithm 3 and Algorithm 5. In most cases all particles clustered ina small portion of the motion subspace X , even in ambiguous situations inducedby silhouette-based features. This indicates that the presence of dynamics hadan important role in disambiguating statically similar poses.

7.3 Complexities

Our experimental results show that Model 1 and Model 2 exhibit similar accu-racies of estimating poses. The distinctive difference between the two models isthe complexity in the learning and inference stages. The complexity in comput-ing the objective function in the GPLVM is proportional to the dimension of aobservation (pose) space. For the approximation of Model 1, a 44 dimensionalpose is the observation for the two GP models, P (Y |X) and P (Y |Z). How-ever, for Model 2 the pose is the observation for only one GP model P (Y |X)and the observation of the other GP model P (Z|X) (Alt Moments) has only10 dimensions. This makes learning of Model 2 less complex than learning ofModel 1. In addition, in inference only the latent variable (e.g. 3-dimension) isoptimized in Model 2 while the optimization in Model 1 deals with both thelatent variable and the pose (3-dimensions + 44-dimension in our experiments).


As a result, Model 2 requires significantly fewer iterations of the nonlinear op-timization search, leading to potentially more suitable algorithm for real-timetracking.

8 Conclusions

We proposed a novel method for embedding of sequences into subspaces of dy-namic models. In particular, we propose a family of marginal AR (MAR) sub-spaces that describe all stable AR models. We show that a generative nonlineardynamic system (NDS) can then be learned from a hybrid of Gaussian (latent)process models and MAR priors, a marginal NDS (MNDS). As a consequence,learning of NDS models and state estimation/tracking can be formulated in thisnew context. Several synthetic examples demonstrate the potential utility of theNDS framework and display its advantages over traditional static methods indynamic domains. We also test the proposed approach on the problem of the 3Dhuman figure tracking in sequences of monocular images. Our preliminary resultsindicate that dynamically constructed embeddings using NDS can resolve ambi-guities during tracking that may plague static as well as less principled dynamicapproaches.

In our future work we plan to extend the set of evaluations and gather moreinsight into theoretical and computational properties of MNDS with linear andnonlinear MARs. In particular, our on-going experiments address the posteriormultimodality in the embedded spaces, an issue relevant to point-based trackers.

As we noted in our experiments on 3D human figure tracking, the style prob-lem should be resolved for ultimate automatic human tracking. Tensor decom-position has been introduced as a way to extract the motion signature capturingthe distinctive pattern of movement of any particular individual [2]. It would bea promising direction to extend this linear decomposition technique to the novelnonlinear decomposition.

The computation cost for learning MNDS model is another issue to be con-sider. The fact that the cost is proportional to the number of data points indi-cates that we can utilize the sparse prior of data for more efficient leaning. Wealso plan to extend the NDS formalism to collections of dynamic models usingthe switching dynamics approaches as a way of modeling a general and diversefamily of temporal processes.

9 Appendix: MAR Gradient

Log-likelihood of MAR model is, using Equation (9) and leaving out the constantterm,

L =N

2log |Kxx|+ 1

2tr

{K−1

xx XX ′} (27)

with Kxx = Kxx(X,X) defined in Equation (10). Gradient of L with respect toX is

∂L

∂X=

∂X∆

∂X

∂L

∂Kxx

∂Kxx

∂X∆+

∂L

∂X

∣∣∣∣X∆

. (28)


X∆ can be written as a linear operator on X,

X∆ = ∆ ·X, ∆ =[

0(T−1)×1 I(T−1)×(T−1)

0 01×(T−1)

], (29)

where 0 and I denote zero vectors and identity matrices of sizes specified in thesubscripts. It is now easily follows that

∂L

∂X= ∆′ (NK−1

xx −K−1xx XX ′K−1

xx

)∆ ·X + K−1

xx X. (30)

References

1. A. Agarwal and B. Triggs: 3D Human Pose from Silhouettes by Relevance VectorRegression, CVPR, Vol. 2, (2004) 882–888

2. M. A. O. Vasilescu: Human Motion Signatures: Analysis, Synthesis, Recognition,ICPR, Vol 3, (2002) 30456

3. M. Belkin and P. Niyogi: Laplacian Eigenmaps for Dimensionality Reduction andData Representation, Neural Computation, Vol. 15(6), (2003) 1373–1396

4. M. Brand: Shadow Puppetry, CVPR, Vol. 2, (1999) 1237–12445. A. Elgammal and C. Lee: Inferring 3D Body Pose from Silhouettes using Activity

Manifold Learning, CVPR, Vol. 2, (2004) 681–6886. V. L. Girko: Circular Law, Theory of Probability and its Applications, Vol. 29,

(1984) 694–7067. K. Grochow, S. L. Martin, A. Hertzmann, and Z. Popovic: Style-based Inverse

Kinematics, SIGGRAPH, (2004) 522–5318. I. T. Jolliffe: Principal Component Analysis, Springer Verlag, (1986)9. N. D. Lawrence: Gaussian Process Latent Variable Models for Visualisation of High

Dimensional Data, NIPS, (2004)10. N. D. Lawrence: Probabilistic Non-linear Principal Component Analysis with

Gaussian Process Latent Variable Models, Journal of Machne Learning Research,Vol 6, (2005) 1783–1816

11. K. V. Mardia, J. Kent, and M. Bibby: Multivariate Analysis, Academic Press,(1979)

12. D. Ormoneit, H. Sidenbladh, M. J. Black, and T. Hastie: Learning and TrackingCyclic Human, NIPS, (2001) 894–900

13. R. Rosales and S. Sclaroff: Specialized Mappings and the Estimation of HumanBody Pose from a Single Image, Workshop on Human Motion, (2000) 19–24

14. S. T. Roweis and L. K. Saul: Nonlinear Dimensionality Reduction by Locally LinearEmbedding, Science, Vol. 290, (2000) 2323–2326

15. H. Sidenbladh, M. J. Black, and D. J. Fleet: Stochastic Tracking of 3D HumanFigures Using 2D Image Motion, ECCV, (2000) 702–718

16. C. Sminchisescu and A. Jepson: Generative Modeling for Continuous Non-linearlyEmbedded Visual Inference, ICML, ACM Press, (2004)

17. C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas: Conditional Visual Trackingin Kernel Space, NIPS, (2005)

18. C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas: Discriminative Density Prop-agation for 3D Human Motion Estimation, CVPR, Vol. 1, (2005) 390–397

19. J. B. Tenenbaum, V. de Silva, and J. C. Langford: A Global Geometric Frameworkfor Nonlinear Dimensionality Reduction, Science, Vol. 290, (2000) 2319–2323


20. T. P. Tian, R. Li, and S. Sclaroff: Articulated Pose Estimation in a Learned SmoothSpace of Feasible Solutions, Workshop on Learning in CVPR, (2005) 50

21. R. Urtasun, D. J. Fleet, and P. Fua: Monocular 3D Tracking of the Golf Swing,CVPR, Vol. 2, (2005) 1199

22. R. Urtasun, D. J. Fleet, A. Hertzmann, and P. Fua: Priors for People Trackingfrom Small Training Sets, ICCV, Vol. 1, (2005) 403–410

23. J. M. Wang, D. J. Fleet, and A. Hertzmann: Gaussian Process Dynamical Models,NIPS, (2005)

24. Q. Wang, G. Xu, and H. Ai: Learning Object Intrinsic Structure for Robust VisualTracking, CVPR, Vol. 2, (2003) 227–233

25. C. K. I. Williams and D. Barber: Bayesian Classification with Gaussian Processes,PAMI, Vol. 20(12), (1998) 1342–1351

26. http://www.hid.ri.cmu.edu/Hid/databases.html27. http://mocap.cs.cmu.edu/

graphical models for human motion modelingresearch.cs.rutgers.edu/~ksmoon/papers/hmb_pav.pdf ·...

Documents