[ieee 2011 ieee international workshop on haptic audio visual environments and games (have 2011) -...

Human Action Recognition from Local Part Model

Feng Shi, Emil M. Petriu, and Albino Cordeiro

School of Electrical Engineering and Computer Science (EECS)

University of Ottawa

Ottawa, Canada

Abstract—In this paper, we present a part model for human

action recognition from video. We use 3D HOG descriptor and

bag-of-feature to represent video. To overcome the unordered

events of bag-of-feature approach, we propose a novel multiscale

local part model to preserve temporal context. Our method

builds upon several recent ideas including dense sampling, local

spatial-temporal (ST) features, 3D HOG descriptor, BOF

representation and non-linear SVMs. The preliminary results on

KTH action dataset show a higher recognition rate than recent

studies.

Keywords- action recognition; 3D HOG descriptor; bag-of-

feature (BOF); local spatio-temporal (ST) features; part model

I. INTRODUCTION

Recognizing human action from image sequences is one of the most challenging problems in computer vision with many important applications, such as intelligent video surveillance, content-based video retrieval, human-robot interaction, and smart home, etc. The task is difficult not only due to inter-class variations, camera movements, background cluttering and partial occlusion, but also to some inter-class overlaps and similarities, such as running vs. jogging or walking.

There is great progress recently in solving many of the problems described above in the field of human action recognition. According to a recent survey [1], the action recognition methods in the literature can be divided into two categories: global representations and local representations. The former treats a person in the image as a whole, and adopts explicit shape models. It often includes following steps: (a) detecting a region of interest around the person either by background subtraction [2, 3] or tracking [4, 5]; (b) computing image features of the detected human as silhouettes [2, 3], edges [6], contours [3] or optical flow [4]; (c) comparing the computed features with pre-defined parametric [3, 4] or template [5, 6] models to perform the classification.

Earlier works on human action recognition in video often used global representations. These approaches can achieve good results in simplified and controlled settings, such as static backgrounds, single object and fully visible humans. However, they highly depend on the background subtraction or object tracking, and lack discriminative power for action classification given a realistic noisy video data.

Local representation methods usually extract smaller local space-time interest regions from the video. These local patches are either sampled densely or chosen by spatio-temporal

interest points. In [7], Schüldt et al. extended the Harris corner detector and automatic scale selection to 3D to detect salient sparse spatio-temporal features. To address the issue of relatively sparse detected salient features, Dollár et al. [8] used a pair of 1D Gabor filters both on spatial and temporal dimension, which produced denser interest points. Built on 2D Hessian detector, Hessian3D [9] was proposed by Willems et al. to detect dense, scale-invariant, and computationally efficient salient spatial-temporal points. Motivated by excellent object recognition results [10, 11] obtained by dense sampling, Wang et al. [12] evaluated recent local spatio-temporal features [7, 8, 9], and found: “dense sampling consistently outperforms all tested interest point detectors in realistic video setting”.

Recent studies [8, 9, 13, 14] have shown that local features can achieve remarkable performance when represented by popular “bag-of-feature (BOF)” method. BOF approach was originally applied to document analysis. It gained great popularity in object classification from images and action recognition from video data due to its discriminative power. However, BOF only contains statistics of unordered “features” from the image sequences, and any information of temporal relations or spatial structures is ignored. In [15], Hamid et al. argued: “Generally activities are not fully defined by their event-content alone; however, there are preferred or typical event-orderings”.

To preserve the “ordering of events”, many works were proposed to add geometry and temporal information. Laptev et al. [14] added structural information by dividing a local spatio-temporal feature into different overlaid grids and used a greedy approach to find the best set of grids. As for more primitive motions, Thurau et al. [16] introduced n-Grams of primitive level motion features for action recognition. Hamid et al. [15] proposed an unsupervised method for detecting anomalous activities by using bags of event n-grams. In their method, human activities are represented as overlapping n-Grams of actions. While overlapping n-Grams can preserve the temporal order information of events, it causes the dimensionality of the space to grow exponentially as n increase.

We aim to maintain both global structure information and ordering of local events for action recognition. Our method should incorporate both space structural information as [14] and ordering of the events as [15, 16], but avoid the increased dimensionality of the n-Grams method. Inspired by the work of multiscale deformable part model [17], we propose a novel 3D multiscale part model for video classification. Our model includes both a coarse primitive level ST feature word covering

This paper was funded by ORF-RE as a part of MUSES_SECRET project.

978-1-4577-0499-4/11/$26.00 ©2011 IEEE

event-content statistics and higher resolution overlapping parts incorporating temporal relations. We extract the local 3D multiscale root-part features by dense sampling [12], and adopt HOG3D [18] descriptor to represent features. K-means method is used to cluster the visual words. Finally, we apply BOF method and non-linear SVM for action classification.

The paper is organized as follows: The next section introduces our ST feature model. Section 3 describes the classification method. In section 4, we present some experimental results and analysis. The paper is completed with a brief conclusion.

II. LOCAL SPATIO-TEMPORAL FEATURES

A. 3D multiscale part model

A model of a local spatio-temporal (ST) volume feature consists of a coarse global “root” model and several fine “part” models. The underlying building blocks for our models are densely sampled local ST patches represented by HOG3D

descriptor [18]. As shown on Fig. 1, for a video p

V with size of

2W x 2H x 2T, we create a new video r

V with size W x H x 2T

by down-sampling. We use multiscale scanning window

approach to extract 3D local ST patches from r

V as coarse

features for “root” model. For every “root” model, a group of

fine “part” models are extracted from the video p

V with respect

to the location where the coarse patch serves as a reference position.

overlapping temporal grids of

part model

overlapping space grids of

part model

root model

x

y

t

Video pyramid HOG3D feature pyramid

2H

2W

2T

H

w

2T

Figure 1. Example of a feature defined with root model and overlapping

grids of part model

B. Space-time features by dense sampling

We use the dense sampling to extract local ST patches from the video at different scales and locations. Our method is similar to the approaches in [12, 18]. A sampling point is

decided by 5 dimensions ( , , , , )x y t , where and are the

spatial and temporal scale, and ( , , )x y t is its space-time

location in the video. For a 3D point ( , , , , )s s s s s

s x y t , a

feature can be computed for a local ST region with the size of

widths

w , height s

h and length s

l given by

0 0s s s s s

w h l 　and

where 0

and 0 are the initial spatial and temporal scale,

respectively, and s

and s

are the spatial and temporal step

factor for consecutive scales.

For the “root” model, we use the similar dense sampling method as the approach [12]. As stated above, we extract

coarse features from the video r

V . In our experiments, we set

initial 0

and 0 , and let ( 2)i

s s with 0,1,2,...,i K as

the thi step of the scales.

Given a “root” patch with the size of s s s

h lw at the

location ( , , )s s s

x y t of videor

V , we first extract a ST patch of size

2 2s s s

h lw at the location (2 , 2 , )s s s

x y t of the high resolution

videop

V . To construct a group of fine “part” model, this patch

is then divided into a set of overlapping s s s

M M N sub-

patches. In our experiments, the neighboring sub-patches have 50% overlapping area (see Fig. 1).

C. HOG3D descriptor

We use HOG3D [18] descriptor to describe the local ST features. HOG3D is built up on 3D oriented gradients, and can be seen as an extension of the 2D SIFT [19] descriptor to video data. It can be computed as follows:

Spatio-temporal gradients are computed for each pixel over the video. The gradients are computed efficiently with integral video method.

A 3D patch is divided into a grid of c c c

M M N cells.

Each cell is then divided into b b b

M M N blocks. A

mean 3D gradient is computed for each block.

Each mean gradient is quantized using a polyhedron.

For every cell, a 3D histogram of oriented gradients is obtained by summing the quantized mean gradients of all its blocks.

The 3D histogram of oriented gradients for the 3D patch is formed by concatenating gradient histogram of all cells.

Our model consists of a coarse root model and a group of higher resolution part models. The histograms of root model and all part models are concatenated to create a histogram representation of a local ST feature. Both the coarse root model and the higher resolution part models are described by HOG3D descriptors, and act with the same classification power as the HOG3D approaches [12, 18]. However, our local part model incorporates the temporal ordering information by including local overlapping “events”. Thus, it provides more discriminative power for action recognition.

III. CLASSIFICATION

A. Bag-of-feature representation

We use bag-of-feature method to represent a video sequence. Given a video sequence, a set of local spatio-temporal features are extracted and quantized into visual words. The classification is performed by measuring the frequency of visual word occurrences in the video. This method requires a visual vocabulary (or codebook). To construct word vocabulary, the local ST features are densely sampled from the training videos and described with the HOG3D descriptors. The k-means [20] method is applied to cluster the features into k centers. The word vocabulary of k visual words is thus created with the centers. Each feature from a video sequence can be assigned to the closest (Euclidean distance) word from the vocabulary, and video sequences can be represented as the histogram of visual word occurrences.

B. SVM classifier

For classification, we use a non-linear support vector machine with a RBF (radial basis function) kernel:

2

( , ) exp( ), 0.K i j i j

x x x x

We use the LIBSVM [21] library. In our experiments, the

vectori

x is computed as the histogram of visual word

occurrences. Data scaling is performed on all training and testing samples. The best parameters for the SVM and kernel are found through 10-fold cross-validation procedure on training data. For multi-class classification, we use the one-against-rest approach and select the class with the highest score.

IV. EXPERIMENTAL RESULTS

In order to evaluate our method with comparable results, we closely follow the dense sampling experiments of Wang et al. [12]. In summary, we use the proposed local part model to extract local ST features by dense sampling at multiple spatial and temporal scales. The ST features are described with the HOG3D descriptor and quantized into visual words using visual vocabulary built with k-means from training data. The bag-of-feature SVM approach is applied for the classification.

A. KTH action dataset

KTH action dataset [7] is one of the most used datasets in evaluations of action recognition. It contains six classes of human actions: walking, jogging, running, boxing, hand waving and hand clapping. There are totally 2391 sequences. The sequences have spatial resolution of 160 x 120 pixels and a length of four seconds in average. The sequences are recorded with 25 subjects in four different scenarios: outdoors, outdoors with scale variation, outdoors with different clothes and indoors. The background is homogeneous and static in most cases. We follow the experimental setup as those in [7, 12] by dividing the sequences into training set (16 persons) and test set (9 persons).

B. Parameter settings

We evaluate the overall detection performance of different experiments based on the various sampling and HOG3D parameters. Such parameters are:

1) Dense sampling parameters: we first down-sample

original video by 2 into 80 x 60, with the temporal space

unchanged. The “root” model is extracted from this video by

dense sampling. The minimal (initial) spatial size of sampling

is set as 12, and the further scales are sampled with a scale

factor of 2s

with up to 8 scales. For the temporal length,

the minimal size of 6 and 10 frames is evaluated, each

combined with 2 and 3 sampling scales with a scale factor of

2s , resprective. The overlapping rate for ST dense

patches is 50%. As for “part” models, their location and

sampling parameters are decided by the “root” model, given

the sampling is performed in the original high resolution

video.

2) HOG3D parameters: Different combinations of

following parameters are examined: number of histogram

cells, number of blocks and polyhedron types(icosahedron or

dodecahedron). Other HOG3D parameters are set based on the

optimization of those in [12, 18]: the orientation type as full

orientation, number of supporting mean gradient s = 3, and

cut-off value c = 0.25.

3) BOF visual vocabulary: The codebook size is evaluated

with the value of 2000, 3000 and 4000 words.

C. The results and comparison

In our experiments, we choose parameter settings to make it computationally tractable, mainly by limiting the vector size of visual words. The optimal parameter settings are: codebook

size V=4000; minimal patch size 0

12 ,0

6 ; total sampling

scales 8x8x3; number of histogram cells 2x2x2; polyhedron type dodecahedron (12); and number of parts per “root” model 2x2x2. The dimension for “root” model is 2x2x2x12 = 96. The vector size of a feature is 96 x (1(root) + 8(parts)) = 864. We obtain the average accuracy of 91.77%.

We also observe a slight decreased performance of 91.19% in a feature of dimensionality 540 with parameters: codebook

size V=4000; minimal patch size 0

12 ,0

10 ; total sampling

scales 8x8x2; number of histogram cells 1x1x3; polyhedron type dodecahedron (20); and number of parts per “root” model

TABLE I. AVERAGE ACCURACY ON KTH DATASET

2x2x2. A performance of 90.26% is achieved when the dimension is reduced to 360 by changing above parameters as:

minimal patch size 0

12 and0

6 , total sampling scales

8x8x3, and number of histogram cells 1x1x2.

We compare our method with the state-of-the-art on the KTH dataset. Table 1 shows the comparison between the average class accuracy of our results and those reported in an evaluation framework [12]. Compared to the other approaches adopted dense sampling for feature extraction, our method achieves significantly better performance.

As for the computational complexity, since the dense sampling for “root” model is performed on the video with half spatial resolution, the total number of features is far less than that in [12] with full spatial resolution. Although we need to extract the part features in the original video, the total number of features is only decided by the “root” model, and the computation on part feature descriptor is handled efficiently with integral video.

V. CONCLUSION

This paper has introduced a novel local parts model on the task of action recognition. We aim at solving the out-of-order problem of the bag-of-feature approach by adding higher resolution overlapping parts to incorporate ordering of the events. The experimental results show that our method outperforms the state-of-art on KTH action dataset. Future works include the experiments on other popular action datasets. Furthermore, instead of limiting the vector size of visual words, we also plan to perform experiments on a larger variety of settings for optimal parameters.

REFERENCES

[1] R. Poppe, “A Survey on vision-based human action recognition,” Image

and Vision Computing, pp.976-990, 2010

[2] A. Bobick and J. Davis, “The recognition of human movement using

temporal templates,” IEEE Trans. Pattern Recognit. Machine Intell., vol. 23, pp. 257–267, Mar. 2001.

[3] L. Wang and D. Suter, “Informative shape representations for human

action recognition,” International Conference on Pattern Recognition (ICPR’06), vol. 2, pp.1266–1269, 2006.

[4] A. Efros, A. Berg, G. Mori, and J. Malik, “Recognizing action at a distance,” ICCV, pp. 726–733, 2003.

[5] W. L. Lu and J. Little, “Simultaneous tracking and action recognition using the pca-hog descriptor,” CRV’06, 2006.

[6] S. Carlsson and J. Sullivan, “Action recognition by shape matching to

key frame,” Workshop on Models versus Exemplars in Computer Vision, 2001.

[7] C. Schüldt, I. Laptev and B. Caputo, “Recognizing human actions: a local SVM approach,” ICPR'04, pp.32-36, 2004

[8] P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recognition via sparse spatiotemporal features,” VS-PETS’05, 2005.

[9] G. Willems, T. Tuytelaars, and L. Van Gool, “An efficient dense and scale-invariant spatiotemporal interest point detector,” ECCV’08, 2008.

[10] N. Dalal and B. Triggs, “Histograms of oriented gradients for human

detection,” Computer Vision and Pattern Recognition (CVPR’05), pp. I: 886–893, 2005.

[11] F. Jurie and B. Triggs, “Creating efficient codebooks for visual recognition,” ICCV’05, 2005.

[12] H. Wang, M. Ullah, A.Kläser, I. Laptev and C. Schmid, “Evaluation of local spatio-temporal features for action recognition,” BMVC’09, 2009

[13] J. C. Niebles and L. Fei-Fei, “A hierarchical model of shape and

appearance for human action classification,” Computer Vision and Pattern Recognition (CVPR’07), pp. 1 – 8, 2007.

[14] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” CVPR’08, 2008.

[15] R. Hamid, A. Johnson, S. Batta, A. Bobick, C. Isbell and G. Coleman:

“Detection and explanation of anomalous activities: Representing

activities as bags of event n-grams,” Computer Vision and Pattern Recognition (CVPR’05) Vol. 1, pp. 1031–1038, 2005.

[16] C. Thurau and V. Hlavac: “Pose primitive based human action

recognition in videos or still images,” Computer Vision and Pattern Recognition (CVPR’08), pp. 1–8, 2008.

[17] P. Felzenszwalb, D.McAllester, and D. Ramanan, “A discriminatively trained, multiscale, deformable part model,” CVPR’08, 2008.

[18] A. Kläser, M. Marszałek and C. Schmid, “A spatio-temporal descriptor

based on 3d-gradients,” British Machine Vision Conference (BMVC’08), pp.995–1004, 2008.

[19] D. Lowe, “Distinctive image features from scale-invariant keypoints.,” IJCV, pp.91-110, 2004.

[20] D. Arthur and S. Vassilvitskii, “k-means++: the advantages of careful

seeding,” The eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027–1035, 2007.

[21] C. Chang and C. Lin, “LIBSVM : a library for support vector machines,”

ACM Transactions on Intelligent Systems and Technology, pp. 1–27, 2011.

HOG/HOF [3]

HOG [3]

HOF [3]

HOG3D [18]

ours

86.1% 79.0% 88.0% 85.3% 91.77%

[ieee 2011 ieee international workshop on haptic audio visual environments and games (have 2011) -...

Documents