exploring the space of an action for human action...

7
Exploring the Space of an Action for Human Action Recognition Paper 614 Abstract One of the fundamental challenges of recognizing actions is accounting for the variability that arises when arbitrary cameras capture humans performing actions. In this paper, we explicitly identify three important sources of variability: (1) viewpoint, (2) execution rate, and (3) anthropometry of actors, and propose a model of human actions that allows us to address all three. Our hypothesis is that the variabil- ity associated with the execution of an action can be closely approximated by a linear combination of action bases in joint spatio-temporal space. We demonstrate that such a model bounds the rank of a matrix of image measurements and that this bound can be used to achieve recognition of actions based only on imaged data. A test employing prin- cipal angles between subspaces that is robust to statistical fluctuations in measurement data is presented to find the membership of an instance of an action. The algorithm is applied to recognize several actions, and promising results have been obtained. 1. Introduction Developing algorithms to recognize humans actions has proven to be an immense challenge since it is a problem that combines the uncertainty associated with computational vi- sion with the added whimsy of human behavior. Even with- out these two sources of variability, the human body has no less than 244 degrees of freedom ([19]) and modeling the dynamics of an object with such non-rigidity is no mean feat. Further compounding the problem, recent research into anthropology has revealed that body dynamics are far more complicated than was earlier thought, affected by age, ethnicity, class, family tradition, gender, sexual orientation, skill, circumstance and choice, [4]. Human actions are not merely functions of joint angles and anatomical land- mark positions, but bring with them traces of the psychol- ogy, the society and culture of the actor. Thus, the sheer range and complexity of human actions makes developing action recognition algorithms a daunting task. So how does one appropriately model the non-rigidity of human motion? How do we account for the personal styles (or motion sig- natures, [17]) while recognizing actions? How do we ac- count for the diverse shapes and sizes of different people? In this paper, we consider some of these questions while developing a model of human actions that approaches these issues. To begin with, it is important to identify properties that are expected to vary with each observation of an action, but which should not affect recognition: Viewpoint The relationship of action recognition to object recognition was observed by Rao and Shah in [13], and de- veloped further by Parameswaran and Chellappa in [9], [10] and Gritai et al in [6]. In these papers, the importance of view invariant recognition has been stressed, highlighting the fact that, as in object recognition, the vantage point of the camera should not affect recognition. The projective and affine geometry of multiple views is well-understood, see [7], and various invariants have been proposed. Anthropometry In general, an action can be executed, ir- respective of the size or gender of the actor. It is therefore important that action recognition be unaffected by so-called “anthropometric transformations”. Unfortunately, since an- thropometric transformations do not obey any known laws, formally characterizing invariants is impossible. However, empirical studies have shown that these transformations are not arbitrary (see [3]). This issue has previously been ad- dressed by Gritai et al. in [6]. Execution Rate With rare exceptions such as synchronized dancing or army drills, actions are rarely executed at a pre- cise rate. It is desirable, therefore, that action recognition algorithms remain unaffected by some set of temporal trans- formations. The cause of temporal variability can be two fold, caused by the actor or by differing camera frame-rates. Dynamic time warping has been a popular approach to ac- count for highly non-linear transformations, [13]. Recognition presumes some manner of grouping and the question of what constitutes an action is a matter of per- ceptual grouping. It is difficult to quantify exactly, for in- stance, whether “walking quickly” should be grouped to- gether with “walking” or with “running”, or for that matter whether walking and running should be defined as a sin- gle action or not. Thus grouping can be done at different levels of abstraction and, more often than not, depends on circumstance and context. In this paper, rather than arbi- trarily defining some measure of similarity between actions, we allow membership to be defined through exemplars of a group. Our hypothesis is that the variability associated with the execution of an action can be closely approximated by a linear combination of action bases in joint spatio-temporal 1

Upload: others

Post on 12-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploring the Space of an Action for Human Action Recognitionvision.ucsd.edu/tmp/ICCV2005/A614.pdfAction recognition has been an active area of research in the vision community since

Exploring the Space of an Action for Human Action Recognition

Paper 614

Abstract

One of the fundamental challenges of recognizing actionsis accounting for the variability that arises when arbitrarycameras capture humans performing actions. In this paper,we explicitly identify three important sources of variability:(1) viewpoint, (2) execution rate, and (3) anthropometry ofactors, and propose a model of human actions that allowsus to address all three. Our hypothesis is that the variabil-ity associated with the execution of an action can be closelyapproximated by a linear combination of action bases injoint spatio-temporal space. We demonstrate that such amodel bounds the rank of a matrix of image measurementsand that this bound can be used to achieve recognition ofactions based only on imaged data. A test employing prin-cipal angles between subspaces that is robust to statisticalfluctuations in measurement data is presented to find themembership of an instance of an action. The algorithm isapplied to recognize several actions, and promising resultshave been obtained.

1. IntroductionDeveloping algorithms to recognize humans actions hasproven to be an immense challenge since it is a problem thatcombines the uncertainty associated with computational vi-sion with the added whimsy of human behavior. Even with-out these two sources of variability, the human body has noless than 244 degrees of freedom ([19]) and modeling thedynamics of an object with such non-rigidity is no meanfeat. Further compounding the problem, recent researchinto anthropology has revealed that body dynamics are farmore complicated than was earlier thought, affected by age,ethnicity, class, family tradition, gender, sexual orientation,skill, circumstance and choice, [4]. Human actions arenot merely functions of joint angles and anatomical land-mark positions, but bring with them traces of the psychol-ogy, the society and culture of the actor. Thus, the sheerrange and complexity of human actions makes developingaction recognition algorithms a daunting task. So how doesone appropriately model the non-rigidity of human motion?How do we account for the personal styles (or motion sig-natures, [17]) while recognizing actions? How do we ac-count for the diverse shapes and sizes of different people?In this paper, we consider some of these questions whiledeveloping a model of human actions that approaches these

issues. To begin with, it is important to identify propertiesthat are expected to vary with each observation of an action,but which should not affect recognition:

Viewpoint The relationship of action recognition to objectrecognition was observed by Rao and Shah in [13], and de-veloped further by Parameswaran and Chellappa in [9], [10]and Gritaiet al in [6]. In these papers, the importance ofview invariant recognition has been stressed, highlightingthe fact that, as in object recognition, the vantage point ofthe camera should not affect recognition. The projectiveand affine geometry of multiple views is well-understood,see [7], and various invariants have been proposed.

Anthropometry In general, an action can be executed, ir-respective of the size or gender of the actor. It is thereforeimportant that action recognition be unaffected by so-called“anthropometric transformations”. Unfortunately, sincean-thropometric transformations do not obey any known laws,formally characterizing invariants is impossible. However,empirical studies have shown that these transformations arenot arbitrary (see [3]). This issue has previously been ad-dressed by Gritaiet al. in [6].

Execution Rate With rare exceptions such as synchronizeddancing or army drills, actions are rarely executed at a pre-cise rate. It is desirable, therefore, that action recognitionalgorithms remain unaffected by some set of temporal trans-formations. The cause of temporal variability can be twofold, caused by the actor or by differing camera frame-rates.Dynamic time warping has been a popular approach to ac-count for highly non-linear transformations, [13].

Recognition presumes some manner of grouping and thequestion of what constitutes an action is a matter of per-ceptual grouping. It is difficult to quantify exactly, for in-stance, whether “walking quickly” should be grouped to-gether with “walking” or with “running”, or for that matterwhether walking and running should be defined as a sin-gle action or not. Thus grouping can be done at differentlevels of abstraction and, more often than not, depends oncircumstance and context. In this paper, rather than arbi-trarily defining some measure of similarity between actions,we allow membership to be defined through exemplars of agroup. Our hypothesis is that the variability associated withthe execution of an action can be closely approximated by alinear combination of action bases in joint spatio-temporal

1

Page 2: Exploring the Space of an Action for Human Action Recognitionvision.ucsd.edu/tmp/ICCV2005/A614.pdfAction recognition has been an active area of research in the vision community since

space. We demonstrate that such a model bounds the rank ofa matrix of image measurements and that this bound can beused to achieve recognition of actions based only on imageddata. A test employing principal angles between subspacesthat is robust to statistical fluctuations in measurement datais presented to find the membership of an instance of an ac-tion. The algorithm is applied to recognize several actions,and promising results have been obtained. As in [9] and[6], we do not address lower-level processing tasks such asshot segmentation, object detection, and body-joint detec-tion. Instead, we assume the image-positions of anatomi-cal landmarks on the body are provided, and concentrate onhow best to model and use this data to recognize actions.Johansson demonstrated that point-based representationsofhuman actions were sufficient for the recognition of actions,[8]. In our work, the input is the 2D motion of a set of 13anatomical landmarks,L = {1, 2, · · · 13}, as viewed froma camera.

The rest of the paper is organized as follows. We situateour work in context of previous research in Section 2. InSection 3, we present our model of human actions and dis-cuss some properties of the proposed framework, followedby the development of a matching algorithm in Section 4.Results are presented in Section 5, followed by conclusionsin Section 6.

2 Previous Work

Action recognition has been an active area of research inthe vision community since the early 90s. A survey of ac-tion recognition research by Gavrila, in [5], classifies dif-ferent approaches into three categories: (1) 2D approacheswithout shape models, (2) 2D approach with shape modelsand (3) 3D approaches. Since the publication of this sur-vey, a newer approach to action recognition has emerged:2D approaches based on 3D constraints, which maintaininvariance to viewpoint, while avoiding difficulties of 3Dreconstruction of non-rigid motion. The first approach touse 3D constraints on 2D measurements was proposed bySeitz and Dyer in [14], where sufficient conditions for deter-mining whether measurements were 2D affine projectionsof 3D cyclic motion. Rao and Shah extended this idea in[13], to recognize non-cyclic actions as well, proposing arepresentation of action using dynamic instances and inte-vals and proposing a view invariant measure of similarity.Syeda-Mahmood and Vasilescu proposed a view invariantmethod of concurrently estimating the fundamental matrixand recognizing actions in [15]. In [9], Parameswaran andChellappa use 2D measurements to match against candidateaction volumes, utilizing 3-D model based invariants. In ad-dition to view invariance, Gritaiet al. proposed a methodthat was invariant to changes in the anthropometric propor-tions of actors. As in [13] and [9], they ignored time and

−1 −0.5 0 0.5 1 1.5 2 2.5 3

−4

−3

−2

−1

00

0.5

1

1.5

2

2.5

x−coordinate

XYZ Visualization

y−coordinate

z−co

ordi

nate

−0.50

0.51

1.52

2.53

−3−2.5

−2−1.5

−1−0.5

00.5

0

20

40

60

80

100

120

140

x−coordinate

XYT Visualization

y−coordinate

t−co

ordi

nate

(a) (b)

Figure 1: Representation of an action in 4-space. (a) Actionin XY Z space, (b) Action inXY T space. The actor isshown at frame 1, 50 and 126.

treated each action as an object in 3D.

3 The Space of an Action

By marginalizing time, several papers have represented ac-tions essentially as objects in 3D ([13], [6] and [9]). Whilesome success has been achieved, ignoring temporal infor-mation in this way and focussing only on order, makesmodeling of temporal transformations impossible. Instead,since an action is a function of time, in this work an in-stance of an action is modeled as a spatio-temporal con-struct, a set of points,A = [X1,X2, . . .Xp], whereXi =

(XjTi

, Y jTi

, ZjTi

, Ti)⊺ andj ∈ L (see Figure 1) andp = 13n,

for n recorded postures of that action1. An instanceof anaction is defined as a linear combination of a set ofaction-basisA1,A2, . . .Ak. Consequently, any instance of an ac-tion can be expressed as,

A′ =

k∑

i=1

aiAi, (1)

whereai ∈ R is the coefficient associated with the action-basisAi ∈ R

4×p. The space of an action,A is the span ofall its action bases. By allowing actions to be defined in thisway by action bases we do not impose arbitrary definitionson what an actions is. The action is defined entirely by theconstituents of its action bases. The variance captured bythe action bases can include different (possibly non-linearlytransformed) execution rates of the same action, differentindividual styles of performance as well as the anthropo-metric transformations of different actors (see Figure 2).Ingeneral, the number of samples per execution of an actionwill not necessarily be the same. In order to construct thebases, the entire duration of each action is sampled the samenumber of times.

1The construction ofA must respect the ordering ofL.

2

Page 3: Exploring the Space of an Action for Human Action Recognitionvision.ucsd.edu/tmp/ICCV2005/A614.pdfAction recognition has been an active area of research in the vision community since

3.1 Action Projection

When change in depth of a scene is small compared to thedistance between the camera and the scene, affine projectionmodels can be used to approximate the process of imaging.In this paper, we assume a special case of affine projection- weak-perspective projection. Weak-perspective projectionis effectively composed of two steps. The world point isfirst projected under orthography, followed by a scaling ofthe image coordinates. This can be written as,

(xy

)=

[αxr

1 ⊺

αyr2 ⊺

]

XYZ

+ D (2)

whereri⊺ is the i-th row of the rotation matrixR, αi is

a constant scaling factor andD is the displacement vector.Here, a fixed camera is observing the execution of an ac-tion across time. For our purposes we find it convenient todefine a canonical world time coordinateT , where the im-aged time coordinate is related to the world time coordinatet by a linear relationship,t = αtT +dt whereαt is temporalscaling factor anddt is a temporal displacement. This trans-formation in time can occur because of varying frame rates,because the world action and the imaged action are sepa-rated in time or because of linear changes in the speed ofexecution of an action.If two actions differ only by a lineartransformation in execution rate, we consider them equiv-alent. We can define a space-time projection matrixR3×4

that projects a point(X,Y,Z, T )⊺ to it’s image(x, y, t),

xyt

=

αxr

1 ⊺ 0αyr

1 ⊺ 00

⊺ αt

XYZT

+

[D

dt

]

orx = RX + D.

As in [2] and [16], we can eliminateD by subtracting themean of all imaged points. Thus, in our setup, where eachinstance of an action,Ai, is being observed by a station-ary camera, we haveai = RAi. Available data is usuallyin terms of these imaged position of the landmarks acrosstime. A matrixW can be constructed from these measure-ments as,

W =

x1,1 x1,2 · · · x1,n

y1,1 y1,2 · · · y1,n

t1,1 t1,2 · · · t1,n

. (3)

We now show that simply given sufficientimagedexem-plars of an action in the form ofW, a new action,W canbe recognized.

Figure 2: Different postures associated with sitting. Ourhypothesis is that there exists a set of action-basis that cancompactly describe different styles and rates of executionofan action.

Proposition 1 If W is constructed of images of severalinstances of an action that span the space of that action, andW

′ is another instance of that action, then

rank(W) = rank([W W′]).

Under affine projection, we have,

W = RA = R

k∑

i=1

aiAi = [a1R · · · akR]︸ ︷︷ ︸4k

A1

A2

...Ak

(4)When several observations are available,

W =

26664 W1

W2

...Wn

37775 =

264 a1,1R1 · · · ak,1R1

......

a1,nRn · · · ak,nRn

37526664 A1

A2

...Ak

37775(5)

Since the columns ofW are spanned byA, the rankof W is at most4k. Now if an observed actionA′ isan instance of the same action, then it too should beexpressible as a linear combination (Equation 1) ofAi, andtherefore the the rank of[W W

′] should remain4k. If it isnot the same action, i.e. that is not expressible as a linearcombination, then the rank should increase.

An important consequence of Proposition 1 is that we donot need to explicitly compute either the action bases or thespace-time projection matrix. In the next section, we showhow membership can be tested using only imaged measure-ments.

4 Recognizing an Action Using theAngle between Subspaces

Given a matrixW′ containing measurements of the imagedposition of the anatomical landmarks of an actore, we wish

3

Page 4: Exploring the Space of an Action for Human Action Recognitionvision.ucsd.edu/tmp/ICCV2005/A614.pdfAction recognition has been an active area of research in the vision community since

0 10 20 30 40 50 60 70 800

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Ang

le

Action Number

NegativesPositives

, 0 10 20 30 40 50 60 70 800

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Ang

le

Action Number

NegativesPositives

, 0 10 20 30 40 50 60 70 800

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Ang

le

Action Number

NegativesPositives

, 0 10 20 30 40 50 60 70 800

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Ang

le

Action Number

NegativesPositives

Figure 3: Change in Subspace angles as number of action exemplars are increased. We incrementally added positive examplesto the training set and as the span of the action bases increased the angle associated with the positive examples decreasedmuch more than that of the negative examples.

to find which ofc possible actions was performed. The mea-sured image positions are described in terms of the true po-sitions,W with independent normally distributed measure-ment noise,µ = 0 and varianceσ2, that is

W′ = W

′ + ǫ, ǫ ∼ N (0, σ). (6)

Our objective is to find the Maximum Likelihood estimateof j∗ such that,

j∗ = arg maxj∈c

p(Aj |W′). (7)

Because of Proposition 1, we do not need to have theactual action bases to evaluatep(Aj |W′). Instead, eachaction is defined by a set of imaged exemplars, that de-scribe the possible variation in the execution of that ac-tion. This variation may arise from any one of the manyreasons discussed in the introduction. Thus for each ac-tion Aj we have a set of exemplars of that action,Wj =[Wj,1,Wj,2 · · ·Wj,n], wheren ≥ k anddim(Aj) = k.

Now, W andW′ are matrices defining two subspaces,

and without loss of generality assume that,dim(W) ≥dim(W′) ≥ 1. The smallest angleθ1 ∈ [0, π/2] betweenW andW

′ is,

cos θ1 = maxu∈W

maxv∈W′

u⊺v, (8)

where‖u‖2 = 1, ‖v‖2 = 1. It was shown by Wedin in [11]that the angle betweenW andW

′ gives an estimate of theamount of new information afforded by the second matrixnot associated with measurement noise. They show that fortwo matricesA andB, whereB is a perturbation ofA, i.e.if A = B + ǫ, the subspace angle betweenrange(B) andrange(A) is bounded as,

sin(θ) ≤‖ǫ‖2

σr(A),

whereσr(A) is ther-th eigenvalue ofA. The stability tomeasurement error makes the angle between subspaces anattractive alternative to standard re-projection errors.Thus,

p(Aj |W) = L(W|Wj) ∝ cos θ1. (9)

−15−10

−50 −5

0

5

10

15

−25

−20

−15

−10

−5

0

5

y−coordinate

x−coordinate

z−co

ordi

nate

20

30

40

50−5

05

1015

−20

−10

0

10

20

30

y−coordinatex−coordinate

z−co

ordi

nate

(a) (b)

Figure 6: Reconstructing an action in XYZT. We do not re-quire such reconstruction in our recognition algorithm. Inthis figure, we are simply demonstrating the validity of ourhypothesis. (a) Accurate reconstruction. The last frame ofan instance of “Sitting” is shown. The blue-markered skele-ton represents the original measurements, the red-markeredskeleton represents the reconstruction after projection onthe “Sitting” action basis. (b) The 32th frame of an in-stance of “Running” is shown. The blue-markered skeletonrepresents the original measurements and the red-markeredskeleton represents the reconstruction after projection ontothe “Walking” action basis.

where L is the likelihood function. The recognition al-gorithm, along with the algorithm described by Bjork andGolub in [1] for the numerically stable computation of theangle between the subspaces, is given in Figure .

5 Results

During experimentation our goal was to verify the claims inthis paper, namely that the proposed algorithm can recog-nize actions despite changes in viewpoint, anthropometryand execution rate. Furthermore, through experimentation,we validate our conjecture that an action can be described

4

Page 5: Exploring the Space of an Action for Human Action Recognitionvision.ucsd.edu/tmp/ICCV2005/A614.pdfAction recognition has been an active area of research in the vision community since

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ROC Curve for "Standing"

False Positive Fraction

Tru

e P

ositi

ve F

ract

ion

Area Under the Curve = 0.984200

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ROC Curve for "Running"

False Positive Fraction

Tru

e P

ositi

ve F

ract

ion

Area Under the Curve = 0.923142

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ROC Curve for "Walking"

False Positive Fraction

Tru

e P

ositi

ve F

ract

ion

Area Under the Curve = 0.957291

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ROC Curve for "Sitting"

False Positive Fraction

Tru

e P

ositi

ve F

ract

ion

Area Under the Curve = 0.966026

(a) (b) (c) (d)

Figure 4: ROC Curves for four actions. The dotted diagonal line shows the random prediction. (a) ROC Curve for Standing.The area under the ROC curve is 0.9842. (b) ROC Curve for Running. The area under the ROC Curve is 0.9231. (c) ROCCurve for Walking. The area under the ROC Curve is 0.9573. (d)ROC Curve for Sitting. The area under the ROC Curve is0.9660.

ObjectiveGiven a matrixW′ corresponding to the projection of an action instance, and matricesW1,W2, · · ·WN each modeling theN differentactions, find which action was mostly likely executed.

AlgorithmFor each actionAi, i ∈ 1, 2, · · ·N do,

1. Normalization: Compute a similarity transform, transforming the mean of the points to the origin and making the average distanceof the points from the origin equal to

√2. This should be done separately for each action instance.

2. Compute Subspace Angle between W and Wi:

• Compute Orthogonal Bases: Use SVD to reliably compute orthonormal bases ofW′ andWi, fW′ andfWi.

• Compute Projection: Using the iterative procedure described in [1], forj ∈ 1, · · · p

W′i+1 = W

′i −WW⊺

W′i

• Find Angle: Computeθ = arcsin�min(1, ‖W′

p‖2)�.

Selecti∗ = arg maxi∈{1,··· ,N} cos(θ1).

Figure 5: Algorithm for Action Recognition

by a set of exemplars of that actions. In our experiments,we used a number of complex sequences which were a mixof real motion capture data2 and also direct video mea-surements, performed by different actors in many differentways. The test data included the following actions: Sit-ting, Standing, Falling, Walking, Dancing and Running. Inaddition to differences in anthropometry and the executionrates, we also generated different views by changing projec-tion matrices for the motion-captured data. For the imageddata, too, we captured the sequences from several differentviews. Figure 7(a) shows some examples of “Sitting” whileFigure 7(b) shows some examples of “Walking”.

2We did not use anyZ information while testing the recognition.

5.1 Action Recognition Results

The set of action exemplars for each action is composed byadding exemplars iteratively until these sufficiently spantheaction space. To achieve this using the minimum number oftraining samples, we iteratively picked the action sequencefrom the corpus of that action which has the maximum an-gle between it and the action subspace (at that point) andcontinue until the rank of the action subspace is unaffectedby additions of further exemplars from the corpus. Thisgreedy method allows us to minimize the number of train-ing samples required to span the action space instead of justselecting an arbitrary number of exemplars. The effect ofincreasing the number of exemplars in this way is shownin Figure 3. Clearly, the angles of the positive exemplarsare ultimately significantly lower than those of the nega-tive exemplars. To test our approach, for each action, we

5

Page 6: Exploring the Space of an Action for Human Action Recognitionvision.ucsd.edu/tmp/ICCV2005/A614.pdfAction recognition has been an active area of research in the vision community since

x−coordinate

y−co

ordi

nate

100 200 300 400 500 600 700

0

50

100

150

200

250

300

350

400

450

500

x−coordinate

y−co

ordi

nate

100 200 300 400 500 600 700

0

50

100

150

200

250

300

350

400

450

500

x−coordinate

y−co

ordi

nate

100 200 300 400 500 600 700

0

50

100

150

200

250

300

350

400

450

500

x−coordinate

y−co

ordi

nate

100 200 300 400 500 600 700

0

50

100

150

200

250

300

350

400

450

500

(a)

x−coordinate

y−co

ordi

nate

100 200 300 400 500 600 700

0

50

100

150

200

250

300

350

400

450

500

x−coordinate

y−co

ordi

nate

100 200 300 400 500 600 700

0

50

100

150

200

250

300

350

400

450

500

x−coordinate

y−co

ordi

nate

100 200 300 400 500 600 700

0

50

100

150

200

250

300

350

400

450

500

x−coordinate

y−co

ordi

nate

100 200 300 400 500 600 700

0

50

100

150

200

250

300

350

400

450

500

(b)

Figure 8: Video sequences used in the experiments. Actors ofboth genders, and of different body proportions were observedperforming actions in different styles. (a) Sitting sequences. Clearly, each actor sits in a unique way, with legs apart, or oneleg crossed over the other. (b) Walking sequences. Sources of variability include arm swing, speed, and stride length.

50 100 150 200 250

−20

0

20

0 50 100 150 200

−20

0

20

50 100 150 200 250

−20

0

20

50 100 150 200 250

−20

0

20

50 100 150 200 250

−20

0

20

Figure 7: Variations in Walking. Ten evenly sampled pos-tures were taken from the duration of the action.

take all the instances of that action in the corpus as posi-tive sequences and the sequences for all the other actions asnegative sequences. Table 1 shows the number of trainingand testing sequences that were eventually used to obtainthe final results. For each action in our testing set, we com-puted the angle between the action space and the subspacespanned by that action instance. This result is thresholdedtogive the final binary classification. The ROC curves basedon this classification for “Walking”, “Sitting”, “Standing”and “Running” are shown in Figure 4. As can be seen fromthe area under these ROC curves, using our approach wehave been able to correctly classify most of the actions.

Action Exemplars # of Positive # of NegativeSitting 11 230 127

Standing 12 120 138Running 17 290 121Walking 11 450 105

Table 1: Number of training and testing samples for each ofthe four actions recognized during the experiments.

These figures also show that ”Standing” and ”Sitting” arebetter classified than ”Walking” and ”Running”. As dis-cussed earlier in Section 1, this is because ”Walking” and”Running” are similar actions and it is comparatively dif-ficult to distinguish between them, although our method isstill able to distinguish between these to a large extent. Re-construction of an imaged action after projection onto theaction basis for “Sitting” is shown, along with its coeffi-cients for each of its 11 basis is shown in Figure 9.

6 Summary and Conclusions

In this paper we have developed a framework for learningthe variability in the execution of human actions that is un-affected by the changes. Our hypothesis is that any instanceof an action can be expressed as a linear combination ofspatio-temporal action basis, capturing different personalstyles of execution of an action, different sizes and shapesof people, and different rates of execution. We demonstratethat using sufficientimagedexemplars of actions, an action

6

Page 7: Exploring the Space of an Action for Human Action Recognitionvision.ucsd.edu/tmp/ICCV2005/A614.pdfAction recognition has been an active area of research in the vision community since

1 2 3 4 5 6 7 8 9 10 11−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Basis Number

Coefficie

nt

−60

−40

−20

0

20

0

50

100

150

200

250−70

−60

−50

−40

−30

−20

−10

0

10

20

x−coordinatet−coordinate

y−co

ordi

nate

Figure 9: Action reprojection of “Sitting” inxyt space.The red points show the original action inxyt and the bluepoints show close reconstruction after projection onto theaction bases of sitting. Note this is based on imaged exem-plars only.

as view from a camera can be recognized using the anglebetween the subspace of the exemplars and the subspaceof the inspection instance. This concept is related to ear-lier factorization approaches proposed by (among others)Tomasi and Kanade in [16], and Bregleret al. in [2]. In par-ticular in [2], non-rigid motion viewed by a single cameraover time was modeled as a linear combination 3D shapebasis. However, rather than factorizing measurement matri-ces constructed from a single camera, in the case of objects,we factorize measurement matrices captured across multi-ple cameras. In this work, we are not interested in explicitlyrecovering the actual three (or four) dimensional actions oraction bases, but instead to use the constraints they provideto perform recognition. Future directions could involve re-covering the 4D structure of an action explicitly, aided byaction bases.

References

[1] A. Bj ork and G. Golub, “Numerical Methods for ComputingAngles between Linear Subspaces,” Mathematics of Compu-tation, 1973.

[2] C. Bregler, A. Hertzmann and H. Biermann, “RecoveringNon-Rigid 3D Shape from Image Streams,”IEEE Conference

of Computer Vision and Pattern Recognition, 2000.

[3] R. Easterby, K. Kroemer and D. Chaffin, “Anthropometryand Biomechanics - Theory and Appplication,” Plenum Press,1982.

[4] B. Farnell, “Moving Bodies, Acting Selves,”Annual Reviewof Anthropology, Vol. 28, 1999.

[5] D. Gavrila, “The Visual Analysis of Human Movement: ASurvey,”Computer Vision and Image Understanding, 1999.

[6] A. Gritai, Y. Sheikh, M. Shah, “On the Use of Anthropometryin the Invariant Analysis of Human Actions,”InternationalConference on Pattern Recognition, 2004.

[7] R. Hartley and A. Zisserman,Multiple View Geometry inComputer Vision, Cambridge University Press, 2000.

[8] G. Johansson, “Visual perception of biological motion and amodel for its analysis,”Perception and Psychophysics, 1973.

[9] V. Parameswaran, R. Chellappa, “View Invariants for HumanAction Recognition,”IEEE Conference on Computer Visionand Pattern Recognition, 2003.

[10] V. Parameswaran, R. Chellappa, “Quasi-Invariants for Hu-man Action Representation and Recognition,”IEEE Interna-tional Conference on Pattern Recognition, 2002.

[11] P.-A. Wedin, “On angles between subspaces of a finite di-mensional inner product space,” Matrix Pencils, Lecture notesin Mathematics, Kagstrom and Ruhe (Eds.), 1983.

[12] C. Rao, A. Gritai, M. Shah, “View-invariant Alignment andmatching of Video Sequences,”IEEE International Confer-ence on Computer Vision, 2003.

[13] C. Rao, M. Shah, “View-Invariance in Action Recognition,”IEEE Conference on Computer Vision and Pattern Recogni-tion, 2001.

[14] S. Seitz and C. Dyer, “View-Invariant Analysis of CyclicMotion,” International Journal of Computer Vision, 1997.

[15] T. Syeda-Mahmood and A. Vasilescu, “Recognizing actionevents from multiple viewpoints,”IEEE Workshop on Detec-tion and Recognition of Events in Video, 2001.

[16] C. Tomasi and T. Kanade, “Shape and Motion from ImageStreams under Orthography,” International Journal of Com-puter Vision, 1992.

[17] M. Vasilescu, “Human Motion Signatures: Analysis, Synthe-sis, recognition,”International Conference on Pattern Recog-nition, 2002.

[18] A. Veeraraghavan, A. Roy Chowdhury and R. Chel-lappa,“Role of Shape and Kinematics in Human MovementAnalysis,” IEEE Conference on Computer Vision and PatternRecognition, 2004.

[19] V. Zatsiorsky, “Kinematics of Human Motion,”Human Ki-netics, 2002.

7