[ieee 2011 ieee international workshop on haptic audio visual environments and games (have 2011) -...
TRANSCRIPT
Human Action Recognition from Local Part Model
Feng Shi, Emil M. Petriu, and Albino Cordeiro
School of Electrical Engineering and Computer Science (EECS)
University of Ottawa
Ottawa, Canada
Abstract—In this paper, we present a part model for human
action recognition from video. We use 3D HOG descriptor and
bag-of-feature to represent video. To overcome the unordered
events of bag-of-feature approach, we propose a novel multiscale
local part model to preserve temporal context. Our method
builds upon several recent ideas including dense sampling, local
spatial-temporal (ST) features, 3D HOG descriptor, BOF
representation and non-linear SVMs. The preliminary results on
KTH action dataset show a higher recognition rate than recent
studies.
Keywords- action recognition; 3D HOG descriptor; bag-of-
feature (BOF); local spatio-temporal (ST) features; part model
I. INTRODUCTION
Recognizing human action from image sequences is one of the most challenging problems in computer vision with many important applications, such as intelligent video surveillance, content-based video retrieval, human-robot interaction, and smart home, etc. The task is difficult not only due to inter-class variations, camera movements, background cluttering and partial occlusion, but also to some inter-class overlaps and similarities, such as running vs. jogging or walking.
There is great progress recently in solving many of the problems described above in the field of human action recognition. According to a recent survey [1], the action recognition methods in the literature can be divided into two categories: global representations and local representations. The former treats a person in the image as a whole, and adopts explicit shape models. It often includes following steps: (a) detecting a region of interest around the person either by background subtraction [2, 3] or tracking [4, 5]; (b) computing image features of the detected human as silhouettes [2, 3], edges [6], contours [3] or optical flow [4]; (c) comparing the computed features with pre-defined parametric [3, 4] or template [5, 6] models to perform the classification.
Earlier works on human action recognition in video often used global representations. These approaches can achieve good results in simplified and controlled settings, such as static backgrounds, single object and fully visible humans. However, they highly depend on the background subtraction or object tracking, and lack discriminative power for action classification given a realistic noisy video data.
Local representation methods usually extract smaller local space-time interest regions from the video. These local patches are either sampled densely or chosen by spatio-temporal
interest points. In [7], Schüldt et al. extended the Harris corner detector and automatic scale selection to 3D to detect salient sparse spatio-temporal features. To address the issue of relatively sparse detected salient features, Dollár et al. [8] used a pair of 1D Gabor filters both on spatial and temporal dimension, which produced denser interest points. Built on 2D Hessian detector, Hessian3D [9] was proposed by Willems et al. to detect dense, scale-invariant, and computationally efficient salient spatial-temporal points. Motivated by excellent object recognition results [10, 11] obtained by dense sampling, Wang et al. [12] evaluated recent local spatio-temporal features [7, 8, 9], and found: “dense sampling consistently outperforms all tested interest point detectors in realistic video setting”.
Recent studies [8, 9, 13, 14] have shown that local features can achieve remarkable performance when represented by popular “bag-of-feature (BOF)” method. BOF approach was originally applied to document analysis. It gained great popularity in object classification from images and action recognition from video data due to its discriminative power. However, BOF only contains statistics of unordered “features” from the image sequences, and any information of temporal relations or spatial structures is ignored. In [15], Hamid et al. argued: “Generally activities are not fully defined by their event-content alone; however, there are preferred or typical event-orderings”.
To preserve the “ordering of events”, many works were proposed to add geometry and temporal information. Laptev et al. [14] added structural information by dividing a local spatio-temporal feature into different overlaid grids and used a greedy approach to find the best set of grids. As for more primitive motions, Thurau et al. [16] introduced n-Grams of primitive level motion features for action recognition. Hamid et al. [15] proposed an unsupervised method for detecting anomalous activities by using bags of event n-grams. In their method, human activities are represented as overlapping n-Grams of actions. While overlapping n-Grams can preserve the temporal order information of events, it causes the dimensionality of the space to grow exponentially as n increase.
We aim to maintain both global structure information and ordering of local events for action recognition. Our method should incorporate both space structural information as [14] and ordering of the events as [15, 16], but avoid the increased dimensionality of the n-Grams method. Inspired by the work of multiscale deformable part model [17], we propose a novel 3D multiscale part model for video classification. Our model includes both a coarse primitive level ST feature word covering
This paper was funded by ORF-RE as a part of MUSES_SECRET project.
978-1-4577-0499-4/11/$26.00 ©2011 IEEE
event-content statistics and higher resolution overlapping parts incorporating temporal relations. We extract the local 3D multiscale root-part features by dense sampling [12], and adopt HOG3D [18] descriptor to represent features. K-means method is used to cluster the visual words. Finally, we apply BOF method and non-linear SVM for action classification.
The paper is organized as follows: The next section introduces our ST feature model. Section 3 describes the classification method. In section 4, we present some experimental results and analysis. The paper is completed with a brief conclusion.
II. LOCAL SPATIO-TEMPORAL FEATURES
A. 3D multiscale part model
A model of a local spatio-temporal (ST) volume feature consists of a coarse global “root” model and several fine “part” models. The underlying building blocks for our models are densely sampled local ST patches represented by HOG3D
descriptor [18]. As shown on Fig. 1, for a video p
V with size of
2W x 2H x 2T, we create a new video r
V with size W x H x 2T
by down-sampling. We use multiscale scanning window
approach to extract 3D local ST patches from r
V as coarse
features for “root” model. For every “root” model, a group of
fine “part” models are extracted from the video p
V with respect
to the location where the coarse patch serves as a reference position.
overlapping temporal grids of
part model
overlapping space grids of
part model
root model
x
y
t
Video pyramid HOG3D feature pyramid
2H
2W
2T
H
w
2T
Figure 1. Example of a feature defined with root model and overlapping
grids of part model
B. Space-time features by dense sampling
We use the dense sampling to extract local ST patches from the video at different scales and locations. Our method is similar to the approaches in [12, 18]. A sampling point is
decided by 5 dimensions ( , , , , )x y t , where and are the
spatial and temporal scale, and ( , , )x y t is its space-time
location in the video. For a 3D point ( , , , , )s s s s s
s x y t , a
feature can be computed for a local ST region with the size of
widths
w , height s
h and length s
l given by
0 0s s s s s
w h l and
where 0
and 0 are the initial spatial and temporal scale,
respectively, and s
and s
are the spatial and temporal step
factor for consecutive scales.
For the “root” model, we use the similar dense sampling method as the approach [12]. As stated above, we extract
coarse features from the video r
V . In our experiments, we set
initial 0
and 0 , and let ( 2)i
s s with 0,1,2,...,i K as
the thi step of the scales.
Given a “root” patch with the size of s s s
h lw at the
location ( , , )s s s
x y t of videor
V , we first extract a ST patch of size
2 2s s s
h lw at the location (2 , 2 , )s s s
x y t of the high resolution
videop
V . To construct a group of fine “part” model, this patch
is then divided into a set of overlapping s s s
M M N sub-
patches. In our experiments, the neighboring sub-patches have 50% overlapping area (see Fig. 1).
C. HOG3D descriptor
We use HOG3D [18] descriptor to describe the local ST features. HOG3D is built up on 3D oriented gradients, and can be seen as an extension of the 2D SIFT [19] descriptor to video data. It can be computed as follows:
Spatio-temporal gradients are computed for each pixel over the video. The gradients are computed efficiently with integral video method.
A 3D patch is divided into a grid of c c c
M M N cells.
Each cell is then divided into b b b
M M N blocks. A
mean 3D gradient is computed for each block.
Each mean gradient is quantized using a polyhedron.
For every cell, a 3D histogram of oriented gradients is obtained by summing the quantized mean gradients of all its blocks.
The 3D histogram of oriented gradients for the 3D patch is formed by concatenating gradient histogram of all cells.
Our model consists of a coarse root model and a group of higher resolution part models. The histograms of root model and all part models are concatenated to create a histogram representation of a local ST feature. Both the coarse root model and the higher resolution part models are described by HOG3D descriptors, and act with the same classification power as the HOG3D approaches [12, 18]. However, our local part model incorporates the temporal ordering information by including local overlapping “events”. Thus, it provides more discriminative power for action recognition.
III. CLASSIFICATION
A. Bag-of-feature representation
We use bag-of-feature method to represent a video sequence. Given a video sequence, a set of local spatio-temporal features are extracted and quantized into visual words. The classification is performed by measuring the frequency of visual word occurrences in the video. This method requires a visual vocabulary (or codebook). To construct word vocabulary, the local ST features are densely sampled from the training videos and described with the HOG3D descriptors. The k-means [20] method is applied to cluster the features into k centers. The word vocabulary of k visual words is thus created with the centers. Each feature from a video sequence can be assigned to the closest (Euclidean distance) word from the vocabulary, and video sequences can be represented as the histogram of visual word occurrences.
B. SVM classifier
For classification, we use a non-linear support vector machine with a RBF (radial basis function) kernel:
2
( , ) exp( ), 0.K i j i j
x x x x
We use the LIBSVM [21] library. In our experiments, the
vectori
x is computed as the histogram of visual word
occurrences. Data scaling is performed on all training and testing samples. The best parameters for the SVM and kernel are found through 10-fold cross-validation procedure on training data. For multi-class classification, we use the one-against-rest approach and select the class with the highest score.
IV. EXPERIMENTAL RESULTS
In order to evaluate our method with comparable results, we closely follow the dense sampling experiments of Wang et al. [12]. In summary, we use the proposed local part model to extract local ST features by dense sampling at multiple spatial and temporal scales. The ST features are described with the HOG3D descriptor and quantized into visual words using visual vocabulary built with k-means from training data. The bag-of-feature SVM approach is applied for the classification.
A. KTH action dataset
KTH action dataset [7] is one of the most used datasets in evaluations of action recognition. It contains six classes of human actions: walking, jogging, running, boxing, hand waving and hand clapping. There are totally 2391 sequences. The sequences have spatial resolution of 160 x 120 pixels and a length of four seconds in average. The sequences are recorded with 25 subjects in four different scenarios: outdoors, outdoors with scale variation, outdoors with different clothes and indoors. The background is homogeneous and static in most cases. We follow the experimental setup as those in [7, 12] by dividing the sequences into training set (16 persons) and test set (9 persons).
B. Parameter settings
We evaluate the overall detection performance of different experiments based on the various sampling and HOG3D parameters. Such parameters are:
1) Dense sampling parameters: we first down-sample
original video by 2 into 80 x 60, with the temporal space
unchanged. The “root” model is extracted from this video by
dense sampling. The minimal (initial) spatial size of sampling
is set as 12, and the further scales are sampled with a scale
factor of 2s
with up to 8 scales. For the temporal length,
the minimal size of 6 and 10 frames is evaluated, each
combined with 2 and 3 sampling scales with a scale factor of
2s , resprective. The overlapping rate for ST dense
patches is 50%. As for “part” models, their location and
sampling parameters are decided by the “root” model, given
the sampling is performed in the original high resolution
video.
2) HOG3D parameters: Different combinations of
following parameters are examined: number of histogram
cells, number of blocks and polyhedron types(icosahedron or
dodecahedron). Other HOG3D parameters are set based on the
optimization of those in [12, 18]: the orientation type as full
orientation, number of supporting mean gradient s = 3, and
cut-off value c = 0.25.
3) BOF visual vocabulary: The codebook size is evaluated
with the value of 2000, 3000 and 4000 words.
C. The results and comparison
In our experiments, we choose parameter settings to make it computationally tractable, mainly by limiting the vector size of visual words. The optimal parameter settings are: codebook
size V=4000; minimal patch size 0
12 ,0
6 ; total sampling
scales 8x8x3; number of histogram cells 2x2x2; polyhedron type dodecahedron (12); and number of parts per “root” model 2x2x2. The dimension for “root” model is 2x2x2x12 = 96. The vector size of a feature is 96 x (1(root) + 8(parts)) = 864. We obtain the average accuracy of 91.77%.
We also observe a slight decreased performance of 91.19% in a feature of dimensionality 540 with parameters: codebook
size V=4000; minimal patch size 0
12 ,0
10 ; total sampling
scales 8x8x2; number of histogram cells 1x1x3; polyhedron type dodecahedron (20); and number of parts per “root” model
TABLE I. AVERAGE ACCURACY ON KTH DATASET
2x2x2. A performance of 90.26% is achieved when the dimension is reduced to 360 by changing above parameters as:
minimal patch size 0
12 and0
6 , total sampling scales
8x8x3, and number of histogram cells 1x1x2.
We compare our method with the state-of-the-art on the KTH dataset. Table 1 shows the comparison between the average class accuracy of our results and those reported in an evaluation framework [12]. Compared to the other approaches adopted dense sampling for feature extraction, our method achieves significantly better performance.
As for the computational complexity, since the dense sampling for “root” model is performed on the video with half spatial resolution, the total number of features is far less than that in [12] with full spatial resolution. Although we need to extract the part features in the original video, the total number of features is only decided by the “root” model, and the computation on part feature descriptor is handled efficiently with integral video.
V. CONCLUSION
This paper has introduced a novel local parts model on the task of action recognition. We aim at solving the out-of-order problem of the bag-of-feature approach by adding higher resolution overlapping parts to incorporate ordering of the events. The experimental results show that our method outperforms the state-of-art on KTH action dataset. Future works include the experiments on other popular action datasets. Furthermore, instead of limiting the vector size of visual words, we also plan to perform experiments on a larger variety of settings for optimal parameters.
REFERENCES
[1] R. Poppe, “A Survey on vision-based human action recognition,” Image
and Vision Computing, pp.976-990, 2010
[2] A. Bobick and J. Davis, “The recognition of human movement using
temporal templates,” IEEE Trans. Pattern Recognit. Machine Intell., vol. 23, pp. 257–267, Mar. 2001.
[3] L. Wang and D. Suter, “Informative shape representations for human
action recognition,” International Conference on Pattern Recognition (ICPR’06), vol. 2, pp.1266–1269, 2006.
[4] A. Efros, A. Berg, G. Mori, and J. Malik, “Recognizing action at a distance,” ICCV, pp. 726–733, 2003.
[5] W. L. Lu and J. Little, “Simultaneous tracking and action recognition using the pca-hog descriptor,” CRV’06, 2006.
[6] S. Carlsson and J. Sullivan, “Action recognition by shape matching to
key frame,” Workshop on Models versus Exemplars in Computer Vision, 2001.
[7] C. Schüldt, I. Laptev and B. Caputo, “Recognizing human actions: a local SVM approach,” ICPR'04, pp.32-36, 2004
[8] P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recognition via sparse spatiotemporal features,” VS-PETS’05, 2005.
[9] G. Willems, T. Tuytelaars, and L. Van Gool, “An efficient dense and scale-invariant spatiotemporal interest point detector,” ECCV’08, 2008.
[10] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” Computer Vision and Pattern Recognition (CVPR’05), pp. I: 886–893, 2005.
[11] F. Jurie and B. Triggs, “Creating efficient codebooks for visual recognition,” ICCV’05, 2005.
[12] H. Wang, M. Ullah, A.Kläser, I. Laptev and C. Schmid, “Evaluation of local spatio-temporal features for action recognition,” BMVC’09, 2009
[13] J. C. Niebles and L. Fei-Fei, “A hierarchical model of shape and
appearance for human action classification,” Computer Vision and Pattern Recognition (CVPR’07), pp. 1 – 8, 2007.
[14] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” CVPR’08, 2008.
[15] R. Hamid, A. Johnson, S. Batta, A. Bobick, C. Isbell and G. Coleman:
“Detection and explanation of anomalous activities: Representing
activities as bags of event n-grams,” Computer Vision and Pattern Recognition (CVPR’05) Vol. 1, pp. 1031–1038, 2005.
[16] C. Thurau and V. Hlavac: “Pose primitive based human action
recognition in videos or still images,” Computer Vision and Pattern Recognition (CVPR’08), pp. 1–8, 2008.
[17] P. Felzenszwalb, D.McAllester, and D. Ramanan, “A discriminatively trained, multiscale, deformable part model,” CVPR’08, 2008.
[18] A. Kläser, M. Marszałek and C. Schmid, “A spatio-temporal descriptor
based on 3d-gradients,” British Machine Vision Conference (BMVC’08), pp.995–1004, 2008.
[19] D. Lowe, “Distinctive image features from scale-invariant keypoints.,” IJCV, pp.91-110, 2004.
[20] D. Arthur and S. Vassilvitskii, “k-means++: the advantages of careful
seeding,” The eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027–1035, 2007.
[21] C. Chang and C. Lin, “LIBSVM : a library for support vector machines,”
ACM Transactions on Intelligent Systems and Technology, pp. 1–27, 2011.
HOG/HOF [3]
HOG [3]
HOF [3]
HOG3D [18]
ours
86.1% 79.0% 88.0% 85.3% 91.77%