wacv2014

Real Time Action Recognition Using Histograms of Depth Gradientsand Random Decision Forests

Hossein Rahmani, Arif Mahmood, Ajmal Mian, Du HuynhThe School of Computer Science and Software Engineering, The University of

Western Australia,Crawley, WA, [email protected], arif.mahmood, ajmal.mian, [email protected]

AbstractWe propose an algorithm which combines the discrimi-

native information from depth images as well as from 3Djoint positions to achieve high action recognition accuracy.To avoid the suppression of subtle discriminative informa-tion and also to handle local occlusions, we compute a vec-tor of many independent local features. Each feature en-codes spatiotemporal variations of depth and depth gradi-ents at a specific space-time location in the action volume.Moreover, we encode the dominant skeleton movements bycomputing a local 3D joint position difference histogram.For each joint, we compute a 3D space-time motion volumewhich we use as an importance indicator and incorporate inthe feature vector for improved action discrimination. To re-tain only the discriminant features, we train a Random De-cision Forest (RDF). The proposed algorithm is evaluatedon three standard datasets and compared with nine state ofthe art algorithms. Experimental results show that on theaverage, the proposed algorithm outperformed all other al-gorithms in accuracy and have a processing speed of over112 frames/second.

1. IntroductionAutomatic human action recognition in videos is a sig-

nificant research problem with wide applications in variousfields such as smart surveillance and monitoring, health andmedicine, sports and recreation. Human action recognitionis a challenging problem mainly because significant intra-action variations exist due to the differences in viewpoints,visual appearance such as color, texture and shape of cloth-ing, scale variations including performer’s body size anddistance from the camera, and variations in the speed of ac-tion performance. Some of these challenges have recentlybeen simplified by the availability of real time depth cam-eras, such as the Microsoft Kinect depth sensor, which of-fer many advantages over the conventional RGB cameras.Depth sensors are color and texture invariant and have easedthe task of object segmentation and background subtraction.Moreover, depth images are not sensitive to illumination

Depth Histogram

Depth X-Gradient Histogram

Depth Y-Gradient Histogram

Depth Temporal Gradient Histogram

x

Y

x

Y

x

Y

x

Y

(a)

(b) (c)

Hand Motion Volume

Elbow Motion Volume

Toe Motion Volume

Head Motion Volume

Joint Displacement Vectors

Shoulder Motion Volume

Ankle Motion Volume

Knee Motion Volume

Figure 1. (a) 4D Spatiotemporal Depth and Depth-Gradient his-tograms for each space-time subvolume are mapped to the line ofreal numbers using one-to-one mapping function. (b) 3D SkeletonDisplacement Vectors are extracted from each frame. (c) Featuresare extracted from Joint Movement Volume for each joint.

and can be captured in complete darkness using active sen-sors like Kinect.

In RGB images, pixels indicate the three color re-flectance intensity of a scene, whereas in depth images, pix-els represent the calibrated depth values. Although depthimages help overcome the problems induced by clothingcolor and texture variation, many challenges still remainsuch as variations in scale due to body size and distancefrom the sensor, non-rigid shape of the clothing and vari-ations in the style and speed of actions. To handle scaleand orientation variations, different types of normalizationshave been proposed in the literature [6, 5, 3, 12, 15, 7].Intra-action variations are catered by information integra-tion at coarser levels, for example, by computing 2D sil-houettes [6, 3, 5] or depth motion maps [15]. However,coarse information integration may result in the loss of clas-sification accuracy due to suppression of discriminative in-formation. Moreover, these algorithms must handle largeamounts of data and, therefore, have high computational

1

Figure 2. Skeleton plots for the first 15 actions in MSR Action3D dataset [6, 13]. For some actions, skeletons are discriminativewhile some skeletons are subsets of others.

complexity. Some other techniques have tried to overcomethese challenges by mainly depending on the 3D joint po-sitions [14, 13]. However, joint position estimation maybe inaccurate in the presence of occlusions [13]. Further-more, in some applications, such as hand gesture recog-nition, where joint positions are not available, these ap-proaches cannot be adopted.

In this paper, we propose a new action classification al-gorithm by integrating information from both the depth im-ages and the 3D joint positions. We normalize the inter-subject temporal and spatial variations by dividing the depthsequence into a fixed number of small subvolumes. In eachsubvolume we encode the depth and the depth gradient vari-ations by making histograms. To preserve the spatial andtemporal locality of each bin, we map it to a fixed positionon the real line by using an invertible mapping function.Thus we obtain local information integration in a spatiotem-poral feature descriptor.

Temporal variations of 3D joint positions (Fig. 2) alsocontain discriminative information of actions; however, itmay be noisy and some times incorrect or unavailable. Wehandle these problems by capturing the dominant skele-ton movements by computing local joint position differencehistograms. As the joint position estimation errors scatterover many histogram bins, their effect is suppressed. Alsothose joints that are more relevant to a certain action ex-hibit larger motions and sweep larger volumes compared tonon-relevant joints (Fig. 2). For each joint, we compute a3D space-time motion volume (Fig. 1c) which is used as animportance indicator and incorporated in the feature vectorfor improved action discrimination. To give more impor-tance to the joint motion volume, we divide it into a fixednumber of cells and encode the local depth variations foreach cell. Note that these features are not strictly based onthe joint trajectories and only require the joint extreme po-sitions which are robustly estimated. The joint position in-accuracies due to occlusions within these extreme positionsdo not effect the quality of the proposed feature.

We evaluate the proposed algorithm on three standarddatasets [12, 5, 6, 13] and compare it with nine state of theart methods [12, 13, 7, 6, 14, 11, 2, 4, 5]. The proposed

algorithm outperforms all existing methods while achievinga processing speed of over 112 frames/sec.

1.1. Contributions

1. The proposed algorithm combines the discriminative in-formation from depth images as well as from 3D jointpositions to achieve high action recognition accuracy.However, unlike [13, 14], the proposed algorithm is alsoapplicable when joint positions are not available.

2. To avoid the suppression of subtle discriminative infor-mation, we perform local information integration andnormalization which is in contrast to previous tech-niques which perform global integration and normaliza-tion. Our feature is a collection of many independentlocal features. Consequently, it has more chances of suc-cess in the presence of occlusions.

3. Encoding joint importance by using joint motion volumeis an innovative idea and to the best of our knowledge,has not been done before.

4. To retain only the discriminant features, we train a ran-dom decision forest (RDF). Smaller feature dimension-ality leads to fast feature extraction and subsequent clas-sification. As a result, we obtain very fast speedup.

2. Related WorkAction recognition using depth images has become an

active research area after the release of Microsoft Kinectdepth sensor. Real time computation of 3D joint posi-tions [9] from the depth images has facilitated the appli-cation of skeleton based action recognition techniques. Inthis context, the depth based action recognition research canbe broadly divided into two categories, namely, algorithmsmainly using depth images [6, 5, 3, 12, 15, 7] and othersmainly using joint positions [14, 13].

Some algorithms in the first category have exploited sil-houette and edge pixels as discriminative features. For ex-ample, Li et al. [6] sampled boundary pixels from 2D sil-houettes as a bag of features. Yang et al. [15] added tempo-ral derivative of 2D projections to get Depth Motion Maps(DMM). Vieira et al. [11] computed silhouettes in 3D by us-ing the space-time occupancy patterns. The spatio-temporaldepth volume was divided into cells. A filled cell was rep-resented by 1, an empty cell by 0 and a partially-filled cellby a fraction. An ad hoc parameter was used to differen-tiate between fully and partially filled cells. No automaticmechanism was proposed for the optimal selection of thisparameter. Instead of these very simple occupancy features,Wang et al. [12], computed a vector of 8 Haar features ona uniform grid in the 4D volume. LDA was used to de-tect the discriminative feature positions and an SVM clas-sifier was used for action classification. Both of the tech-niques [11, 12] have high computational complexity.

Space Time Volume

Features

Joint Motion Volume Features

Combined

Feature Vectors

Randomized Decision Forest

for Feature Pruning

Discriminative

Features

Randomized Decision Forest for Classification

Action Classes

Joint Displacement Histograms

Figure 3. The proposed algorithm: If 3D joint positions are not available, then only depth and depth gradient features are used. Twodifferent types of decision forests are trained. The first type is used only for feature pruning and is discarded after the training phase. Thesecond type of forest is used for classification.

Tang et al. [10] proposed histograms of the normal vec-tors for object recognition in depth images. Given a depthimage, spatial derivatives were computed and transformedto the polar coordinates (θ, φ, r). 2D histograms of (θ, φ)were used as object descriptors. Oreifej and Liu [7] ex-tended it to the temporal dimension by adding time deriva-tive. The gradient vector was normalized to unit magni-tude and projected on fixed basis to make histograms. Thelast component of normalized gradient vector was inverse ofthe gradient magnitude. As a result, information from verystrong derivative locations, such as edges and silhouettes,may get suppressed.

In the second category, Yang and Tian [14] proposedpairwise 3D joint position differences in each frame andtemporal differences across frames to represent an action.Since 3D joints cannot capture all discriminative informa-tion, the action recognition accuracy reduced. Wang et al.[13] extended previous approach by adding the depth his-togram based features computed from a fixed region aroundeach joint in each frame. In the temporal dimension, lowfrequency Fourier components were used as features. Adiscriminative set of joints were found using an SVM. Al-though their algorithm achieved high performance, it re-quired 3D joint positions to be known beforehand.

3. Proposed Algorithm

We consider an action as a function operating on a fourdimensional space with (x, y, t) being independent vari-ables and the depth d being the dependent variable: d =h(x, y, t). The discrimination of a particular action canbe characterized by using the variations of the depth val-ues with the variations of spatial and temporal dimensions.We capture these variations by 4D histograms of depth(Fig. 1(a)). The complete algorithm is shown in Fig. 3.

3.1. Histograms of Spatiotemporal Depth Function

In each depth frame f(t), where t is the time (or framenumber), we segment the foreground and the backgroundusing depth values and compute a 2D bounding rectanglecontaining the foreground only. For all the depth frames inone action, we compute a 3D action volume V = P×Q×Tcontaining all the 2D rectangles. The spatial dimensionsP × Q mainly depend upon the size of the subject and hisrelative distance from the sensor, while the temporal dimen-sion T depends on the speed of execution of the action.

The size of V often varies during the performance of thesame action, whether it is inter-subject or intra-subject ex-ecution. Therefore, some spatial and temporal normaliza-tion on V needs to be enforced. Instead of scale normaliza-tion, we divide V into nv smaller volumes ∆V , such thatnv = p× q × t, where p = P/∆P and q = Q/∆Q are thespatial divisions and t = T/∆T are the temporal divisionsand (∆P ,∆Q) are horizontal and vertical numbers of pixelsand ∆T is the number of frames included in the subvolume∆V , such that ∆V = ∆P∆Q∆T . For each ∆V , we assigna unique identifier using and invertible mapping function topreserve the spatial and temporal position of ∆V . As thesize of V changes, the number of subvolumes nv remainsfixed while the size of each subvolume ∆V changes.

First, the minimum non-zero depth dmin > 0 and max-imum depth dmax in the action volume V are found. Thedepth range dmax−dmin is then divided into nd = (dmax−dmin)/δd equal steps, where δd is the step size. In eachsubvolume ∆V , the frequency of pixels with respect tothe depth range is calculated and a depth histogram vectorhd ∈ Rnd is computed using the quantized depth values:

dx,y,t =

⌈d(x, y, t)

δd

⌉, (1)

hddx,y,t = hddx,y,t+ 1,∀(x, y, t) ∈ ∆V . (2)

0 30 60 90 1200

5000

10,000

15,000

20,000

25,000

-60 -30 0 30 600

5000

10000

15000-60 -30 0 30 60

0

5000

10000

15000

-60 -30 0 30 600

5000

10000

15000X-Derivative Y-Derivative

Time-Derivative Gradient Magnitude

Figure 4. Histograms of depth gradients and the gradient magni-tude for MSRGesture3D dataset. Strong depth gradients are ob-tained at the foreground edges, while gradients inside the objectare weak and noisy.

Since the subvolume size may vary across different actionvideos, we normalize the histogram by the subvolume size:

hdi =hdi

∆P∆Q∆T, for 1 ≤ i ≤ nd. (3)

Note that we represent the background by a depth value ofzero and ignore the count of 0 values in ∆V . A featurevector fd ∈ Rndnv for action volume V is obtained by con-catenating the depth histograms vectors for each ∆V , i. e.,

fd = [h>d1 h>d2 · · · h>dnv]>. (4)

The position of each histogram in this global feature vectoris determined from the unique identifier of the correspond-ing ∆V .

3.2. Histograms of Depth Derivatives

The depth function variations has also been character-ized with the depth gradient:

∇d(x, y, t) =∂d

∂xi +

∂d

∂yj +

∂d

∂tk, (5)

where the derivatives are given by: ∂d/∂x = (d(x, y, t) −d(x + δx, y, t))/δx, ∂d/∂y = (d(x, y, t) − d(x, y +δy, t))/δy, and ∂d/∂t = (d(x, y, t) − d(x, y, t + δt)/δt.Fig. 4 shows the histograms of derivatives and the gradientmagnitude ||∇d(x, y, t)||2 over the first video of the handgesture dataset [5]. Each of the derivative histograms hasthree peaks. The peak centered at 0 is for the depth surfacevariations, while the peaks at Λ , ±|dfg−dbg| correspondsto silhouette pixels, where dfg is the foreground depth anddbg is the background depth. In contrast to [7], we do not

normalize the gradient magnitude to 1.0, because normal-ization destroys the structure by mixing the surface gradi-ents and silhouette gradients. The un-normalized gradientmagnitude histogram has four distinct peaks. The peak atthe small gradient magnitudes corresponds to the surfacevariations while the other three peaks correspond to the sil-houettes and have mean values given by Λ,

√2Λ, and

√3Λ.

By making gradient histograms at the subvolume level, weencode the spatiotemporal position of a specific type of lo-cal depth variations and the position and shape of the sil-houette into the feature vector.

For each depth derivative image, we find the derivativerange as ωdx = max( ∂d∂x )−min( ∂d∂x ) and divide it into ndxuniform steps ∆ωdx = ωdx/ndx. The derivative values arequantized using ∆ωdx as follows

dx =

⌈∂d∂x

∆ωdx

⌉. (6)

From these quantized derivative values, the x-derivative his-togram h ∂d

∂xis computed for each subvolume ∆V . All the

x-derivative histograms are concatenated according to theunique identifier for the ∆V to make a global x-derivativefeature vector f ∂d

∂x. Three feature vectors f ∂d

∂y, f ∂d

∂tand

f∇d(x,y,t) are also computed for ∂d∂y , ∂d∂y and gradient mag-

nitude ||∇d(x, y, t)||2.The five global feature vectors are concatenated to get

the global feature vector fg ∈ R(nd+ndx+ndy+ndt+n∇)nv

for the full action volume:

fg = [f>d f>∂d∂x

f>∂d∂y

f>∂d∂t

f>∇d(x,y,t)]>. (7)

The feature vector fg is computed only from the depth im-ages. In the following subsections we discuss the computa-tion of the proposed 3D joint position features.

3.3. Histograms of Joint Position Differences

The Kinect sensor comes with a human skeleton trackingframework (OpenNI) [9] which efficiently estimates the 3Dpositions of 20 joints of the human skeleton (Fig. 2). Al-though the information of joint position is already presentin the depth images in raw form, we observe that the es-timated joint position information can be more efficientlyused for action recognition. We propose two different typesof features to be extracted using the 3D joint positions.

The proposed action recognition algorithm is applicableeven if the 3D joint positions are not available. However, weobserve that the 3D joints movement patterns are discrimi-native across different actions. Therefore, the classificationperformance should improve if joint movement patterns areincorporated within the feature vector. Let [xi, yi, di, t]

> bethe 3D position of a joint, where (xi, yi) are spatial coordi-nates, di is the depth and t is the time for the ith joint. In the

current setting, in each frame, the 20 joint positions are de-termined. In each frame, we find the distance of each jointi from a reference joint c:

(∆Xi,∆Yi,∆di) = (xi, yi, di)− (xc, yc, dc). (8)

The reference joint may be a fixed joint position or centroidof the skeleton. We use the torso joint as the reference due toits stability. We capture the motion information of the skele-ton over each action sequence by making one histogram foreach component ∆Xi, ∆Yi, and ∆di. These histogramsare concatenated to make a global skeleton feature vector,which is included in the depth feature vector given by Eq.(7).

3.4. Joint Movement Volume Features

We observe that the joints relevant to an action exhibitlarger movements and there are different sets of active andpassive joints for each action. We incorporate this dis-criminative information in our algorithm by computing the3D volume occupied by each joint (Fig. 1c). For the jth

joint, we find extreme positions over the full action se-quence by computing maximum position differences alongx-axis and y-axis. The lengths along x-axis, y-axis anddepth range is given by: Lx =

[max(xj) − min(xj)

],

Ly =[max(yj) −min(yj)

], Ld =

[max(dj) −min(dj)

]and joint volume is given by

Vj = LxLyLd. (9)

For each joint, we incorporate the joint volume Vj andLx, Ly, Ld in the feature vector. We also incorporate thecentroid of each joint movement volume and the distance ofeach centroid from the torso joint into the feature. To cap-ture the local spatiotemporal structure of the joint volumeVj , we divide Ld into nLd

bins and compute the depth stepsize ∆Ld

= Ld/nLd.

We divide each Vj into a fixed number of cells whichhave a different size compared to the subvolume ∆V . Foreach cell, we compute the histogram of local depth vari-ations using ∆Ld

as the bin size, starting from min(dj).The histograms for the 20 joint volumes are concatenatedto form a spatiotemporal local joint movement feature vec-tor.

3.5. Random Decision Forests

A random decision forest [8, 1] is an ensemble of weaklearners which are decision trees. Each tree consists of splitnodes and leaf nodes. Each split node performs binary clas-sification based on the value of a particular feature fg[i].If fg[i] ≤ τi (a threshold), then the action is assigned tothe left partition, else to the right partition. If the classesare linearly separable, after log2(c) decisions each actionclass will get separated from the remaining c−1 classes and

reach a leaf node. For a given feature vector, each tree in-dependently predicts its label and a majority voting schemeis used to predict the final label of the feature vector. It hasbeen shown that random decision forests are fast and effec-tive multi-class classifiers [9, 3]. The human action recog-nition problem tackled in this paper fits well in the randomdecision forest framework as the number of actions to beclassified is quite large.

3.5.1 Training

In the random decision forest, each decision tree is trainedon a randomly selected 2/3 part of the training data. The re-maining 1/3 of the training data are out-of-bag (OOB) sam-ples and used for validation. For each split node, from theset of total features having cardinality M , we randomly se-lect a subset ξ of cardinality m = d

√Me and search for the

best feature fg[i] ∈ ξ and an associated threshold τi suchthat the number of class labels present across the partition isminimized. In other words, we want the boundary betweenleft and right partitions to maximally match with the actualclass boundaries and not to pass through (divide) any actionclass. This criteria is ensured by maximizing the reductionin entropy of the training data after partitioning, also knownas information gain. Let H(Q) be the original entropy ofthe training data and H(Q|fg[i], τi) the entropy of Q af-ter partitioning it into left and right partitions, Ql and Qr.Entropy reduction or information gain, G, is given by

G(Q|fg[i], τi) = H(Q)−H(Q|fg[i], τi), (10)

where

H(Q|fg[i], τi) =|Ql||Q|

H(Ql) +|Qr||Q|

H(Qr), (11)

and |Ql| and |Qr| denote the number of data samples in theleft and right partitions. The entropy of Ql is given by

H(Ql) = −∑i∈Ql

pi log2 pi, (12)

where pi is the number of samples of class i in Ql dividedby |Ql|. The feature and the associated threshold that max-imize the gain is selected as the splitting test for that node

fg[i], τi∗ = arg maxfg [i],τi

G(Q|fg[i], τi) (13)

If a partition contains only a single class, it is consideredas a leaf node and no further partitioning is required. Par-titions that contain multiple classes are further partitioneduntil they all end up as leaf nodes (contain single classes) orthe maximum height of the tree is reached. If a tree reachesits maximum height and some of its partitions contain labelsfrom multiple classes, the majority label is used as its label.

0 200 400 600 800 1000 1200 1400 1600 18000

0.002

0.004

0.006

0.008

0.01

0.012

Feature Index Location

No

rmal

ized

Sig

nif

ican

ce S

core

Figure 5. Normalized significance score Ω[i] of 1800 features re-tained after feature pruning in MSRGesture3D Dataset.

3.5.2 Feature Pruning and Classification

After a set of trees have been trained, the significance scoreof each variable, fg[i],∀i, in the global feature vectors, fg ,can be assessed as follows [8, 1]. For the kth tree in theforest,

1. Pass down all the OOB samples and count the number ofvotes cast for the correct classes, Ωkc .

2. To find the significance of fg[i] in classification, ran-domly permute the values of fg[i] in the OOB samplesand pass these new samples down the tree, counting thecorrect number of votes, Ωkp[i]

3. Compute importance as ∆Ωk[i] = Ωkc − Ωkp[i].

An average over K number of trees in the forest is con-sidered as the significance score for feature fg[i]: Ω[i] =1K

∑Kk=1 Ωk[i]. The sum of significance scores of all fea-

tures is normalized to unit magnitude:Ω[i] = Ω[i]∑Mj=1 Ω[j]

. A

random decision forest is first trained using the set of allproposed features and then all the features whose normal-ized importance scores Ω[i] are below a specified thresholdare deleted. The new compact feature vectors are then usedto train a new random decision forest to be used for classi-fication of the test action videos.

The significance score for the selected features for ges-ture dataset is shown in Figure 5. The original index loca-tion for each selected feature is also preserved. The sub-volumes ∆V not contributing any discriminative featuresare identified and skipped during the feature extraction stepfrom the test video sequences. For the remaining subvol-umes, only the required histogram bins are computed to re-duce the feature extraction time. The low-dimensional fea-ture vectors are fed to the trained random decision forest forclassification.

4. Experiments and ResultsThe proposed algorithm is evaluated on three standard

databases including MSR Action 3D [6, 13], MSR Gesture

3D [5, 12], and MSR Daily Activity 3D [13]. The perfor-mance of the proposed algorithm is compared with ninestate of the art algorithms including Random OccupancyPattern (ROP) [12], Actionlet Ensemble [13], HON4D [7],Bag of 3D Points [6], Eigen Joints [14], Space Time Oc-cupancy Patterns(STOP) [11], Convolutional Networks [2],3D Gradient based descriptor [4], and Cell and SilhouetteOccupancy Features [5]. For the Random Decision For-est (RDF), we use the implementation of [1]. Except forHON4D, all accuracies are reported from original papersor from the implementations of [13] and [7]. For HON4D,code and feature set is obtained from the original author.

4.1. MSR Gesture 3D Dataset

The proposed algorithm with only depth histograms anddepth-gradient histograms is evaluated on MSR Gesture 3Ddataset [12, 5] because it does not contain the 3D joint po-sitions (Fig. 6). This dataset contains 12 American SignLanguage (ASL) gestures for bathroom, blue, finish, green,hungry, milk, past, pig, store, where, letter J, and letter Z.Each gesture is performed 2 or 3 times by each of the 10subjects. In total, the dataset contains 333 depth sequences.Two actions “store” and “finish” are performed with twohands while the rest with one hand.

We divide each video into nv = 700 subvolumes with(p, q, t) = (14, 10, 5). From each bin, we make a depth his-togram with nd = 10, and one similar size histogram eachfor x, y and time derivatives. The average feature extractiontime is 0.0045 sec/frame. The first RDF is trained with 300trees in 14.1 sec and the second RDF is trained in 4.75 secwith 200 trees on the pruned feature vectors. Pruned fea-ture vector length is 1800 which is 19 times smaller than thelength 34176 of HON4D. The test time is 10−6 sec/frame.Overall test speed including all overheads is 222 frames/secwhich is 13.7 times faster than HON4D.

For comparison with previous techniques, we use leave-one-subject-out cross-validation scheme proposed by [12].Experiments are repeated 10 times, each time with a differ-ent test subject. The performance of the proposed method isevaluated in three different settings. In the first setting, oneRDF is trained to classify all gesture classes. The accuracyof our algorithm is 92.76% in this setting. In the second set-ting, we first find the number of connected components ineach video and two different RDFs are trained for two con-nected components and for the one component cases. Theaccuracy of our algorithm in this setting is 93.61%. Notethat all subjects performed the actions “store” and “finish”by two hands while subject one performed these actions byone hand causing six videos to be invalid. In the third set-ting, we exclude these six invalid videos from the experi-ment and the average accuracy of our algorithm on the re-maining dataset becomes 95.29%.

In all the three settings, our algorithm outperformed ex-

Figure 6. Sample depth images from MSR Gesture 3D dataset(left) and MSR Action 3D dataset (right).

Method Accuracy (%)Convolutional Network [2] 69.003D-Gradients Descriptor [4] 85.23Action Graph on Occupancy Features [5] 80.5Action Graph on Silhouette Features [5] 87.7Random Occupancy Pattern [12] 88.5HON4D [7] 92.45Proposed Method (Setting 1) 92.76Proposed Method (Setting 2) 93.61Proposed Method (Setting 3) 95.29

Table 1. Comparison of action recognition rate on MSRGes-ture3D.

Table 2. Comparison with HON4D and Actionlet. Mean±STD arecomputed over 10-folds for Gesture dataset, 252 folds for Actiondataset and 14 folds for Daily Activity Dataset. 5/5 means subjects1,2,3,4,5 used in Action and 1,3,5,7,9 used in Daily Activityfor training.

Algo Data Mean±STD Max Min 5/5

HON4D Gesture 92.4±8.0 100 75 -Proposed Gesture 93.6± 8.3 100 77.8 -HON4D Action 82.1±4.2 90.6 69.5 88.9Actionlet Action - - - 88.2Proposed Action 82.7±3.3 90.3 70.9 88.8HON4D DailyAct 73.20±3.92 80.0 66.25 80.0Actionlet DailyAct - - - 85.75Proposed DailyAct 74.45±3.97 81.25 67.08 81.25

isting state of the art algorithms (Table 1). Note that [13]cannot be applied to this dataset because of the absenceof 3D joint positions. Detailed speedup comparison ofHON4D with the proposed algorithm is shown in Table 2.To show the recognition accuracy of each gesture, in setting2, the confusion matrix is shown in Fig. 7.4.2. MSR Action 3D dataset

The proposed algorithm with full feature set is evaluatedon MSR Action 3D dataset which consists of both depthsequences (Fig. 6) and skeleton data (Fig. 2) of twenty hu-man actions: high arm wave, horizontal arm wave, hammer,hand catch, forward punch, high throw, draw x, draw tick,draw circle, hand clap, two hand wave, sideboxing, bend,

Milk 96.4% 3.6%Bathroom 96.4% 3.6%

Blue 100.0%Finish 10.7% 89.3%Green 3.6% 3.6% 89.3% 3.6%Hungry 96.4% 3.6%

Past 3.6% 3.6% 10.7% 82.2%Pig 4.0% 92.0% 4.0%

Store 10.7% 89.3%Where 10.7% 82.1% 3.6% 3.6%

J 3.6% 96.4%Z 100.0%

Milk

Bathroom

Blue

Finish

Green

Hungry

Past

Pig

Store

Whe

re

J ZFigure 7. Confusion matrix of the proposed algorithm on MSRGesture 3D dataset.

forward kick, side kick, jogging, tennis swing, tennis serve,golf swing, pick up and throw. Each action is performed by10 subjects two or three times. The frame rate is 15 framesper second and the resolution is 320×240. The dataset con-sists of 567 actions samples. These actions were chosen inthe context of interactions with game consoles and cover avariety of movements related to torso, legs, arms and theircombinations.

We divide each action video into nv = 500 subvolumeswith (p, q, t) = (10, 10, 5). For each bin, we make onedepth histogram with nd = 5, and three gradient histogramsof the same size. For joint motion features, we divide eachvolume into nc = 20 cells with (p, q, t) = (2, 2, 5) andfor each cell a local depth variation histogram is made withnd = 5. The average feature extraction time per frame is.0071 sec. An RDF for feature pruning is trained in 2.93sec and RDF for classification is trained in 1.07 sec. Thepruned feature vector length is 1819 which is 9.83 timessmaller than the length 17880 of HON4D. Overall, the testspeed including all overheads is 140 frames/sec which is 16times faster than HON4D.

Experiments are performed using five subjects as train-ing and five subjects as test. The experiments are repeated256 folds exhaustively as proposed by [7]. The average ac-curacy obtained is 82.7±3.3 % which is higher than the av-erage accuracy 82.1±4.2 of HON4D (Table 2). Also, thestandard deviation of our algorithm is significantly less thanthe HON4D. Our paired t-test shows that our results are sta-tistically very significant: p < 0.003. For the Actionletmethod [13] apart from the accuracy of one fold, no param-eters are available. Comparison with other techniques forsubjects 1,2,3,4,5 as training and others for test, as usedby [13, 7] is given in Table 4.2).4.3. MSR Daily Activity 3D Dataset

The MSR Daily Activity 3D dataset [13] contains 16daily activities including drink, eat, read book, call cellphone, write on a paper, use laptop, use vacuum cleaner,cheer up, sit still, toss paper, play game, lay down on sofa,walk, play guitar, stand up, and sit down. Each activity isperformed two times by each of the 10 subjects in two dif-ferent poses: sitting on a sofa and standing(Fig. 8). In total,the dataset contains 640 RGB-D sequences at 30 frames persecond and 320× 240 resolution.

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

Figure 8. Sample depth images from MSR Daily Activity 3Ddataset. This dataset is more challenging because the images con-tain background objects very close to the subject.

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

5 10 15 20

5

10

15

20(a) (b)

Figure 9. The confusion matrix of the proposed method for (a)MSR Action 3D dataset (b) MSR Daily Activity 3D dataset.

Method Accuracy (%)Depth Motion Maps(DMM) [15] 85.523D-Gradients Descriptor [4] 81.43Action Graph on Bag of 3D Points [6] 74.70Eigenjoints [14] 82.30STOP Feature [11] 84.80Random Accupancy Pattern [12] 86.50Actionlet Ensemble [13] 88.20HON4D [7] 88.90Proposed Method(FAV Features) 81.50Proposed Method(JMV Features) 81.10Proposed Method(JDH Features) 78.20Proposed Method(All Features) 88.82

Table 3. Comparison of the proposed algorithm with the currentstate-of-the-art techniques on MSR-Action3D dataset. Accuracyof the proposed algorithm is reported for different features types:Full Action Volume Features (FAVF), Joint Displacement His-tograms (JDH),Joint Movement Volume Feature (JMVF). For adetailed accuracy comparison with HON4D, see Table 2.

We use half of the subjects for training and the rest fortesting in 14-fold experiments, each time randomly select-ing the training subjects. Average feature extraction timeis .0089 sec/frame. Feature pruning RDF is trained in 6.96sec and classification RDF required only 0.43 sec. The fea-ture vector length is 1662, which is 90.97 times smallerthan the length 151200 of HON4D. The proposed algorithmhas achieved average accuracy 74.45±3.97 which is higherthan the accuracy of HON4D (Table 2). Overall, the test-

ing speed including all overheads is 112 frames/sec whichis 10.4 times faster than HON4D.

5. Conclusion and Future WorkIn this paper we proposed a fast and accurate action

recognition algorithm using depth videos and the 3D jointposition estimates. The proposed algorithm is based on 4Ddepth and depth gradient histograms, local joint displace-ment histograms and joint movement occupancy volumefeatures. Random Decision Forest (RDF) is used for fea-ture pruning and classification. The proposed algorithm istested on three standard datasets and compared with ninestate of the art algorithms. On the average, the proposedalgorithm is found to be more accurate than all other al-gorithms while achieving a testing speed of more than 112frames/sec. Execution time can be further improved by par-allel implementation of the proposed algorithm.References

[1] L. Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001. 5, 6

[2] S. Ji, W. Xu, M. Yang, and K. Yu. 3D Convolutional NeuralNetworks for Human Action Recognition. PAMI, 35(1):221–231, 2013. 2, 6, 7

[3] C. Keskin, F. Kirac, Y. Kara, and L. Akarun. Real Time HandPose Estimation using Depth Sensors. In ICCVW, 2011. 1,2, 5

[4] A. Klaeser, M. Marszalek, and C. Schmid. A spatio-temporaldescriptor based on 3d-gradients. In BMVC, 2008. 2, 6, 7, 8

[5] A. Kurakin, Z. Zhang, and Z. Liu. A Real Time System forDynamic Hand Gesture Recognition with a Depth Sensor. InEUSIPCO, pages 1975–1979, 2012. 1, 2, 4, 6, 7

[6] W. Li, Z. Zhang, and Z. Liu. Action Recognition based on aBag of 3D Points. In CVPRW, 2010. 1, 2, 6, 8

[7] O. Oreifej and Z. Liu. HON4D: Histogram of Oriented 4DNormals for Activity Recognition from Depth Sequences. Into appear in CVPR, 2013. 1, 2, 3, 4, 6, 7, 8

[8] J. R. Quinlan. Induction of Decision Trees. Machine Learn-ing, pages 81–106, 1986. 5, 6

[9] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio,R. Moore, A. Kipman, and A. Blake. Real-time Human PoseRecognition in Parts from Single Depth Images. In CVPR,pages 1297–1304, 2011. 2, 4, 5

[10] S. Tang, X. Wang, X. Lv, T. Han, J. Keller, Z. He, M. Skubic,and S. Lao. Histogram of oriented normal vectors for objectrecognition with a depth sensor. In ACCV. 2012. 3

[11] A. W. Vieira, E. Nascimento, G. Oliveira, Z. Liu, andM. Campos. Stop: Space-time occupancy patterns for 3daction recognition from depth map sequences. In CIARP,pages 252–259, 2012. 2, 6, 8

[12] J. Wang, Z. Liu, J. Chorowski, Z. Chen, and Y. Wu. Robust3D Action Recognition with Random Occupancy Patterns.In ECCV, pages 872–885, 2012. 1, 2, 6, 7, 8

[13] J. Wang, Z. Liu, Y. Wu, and J. Yuan. Mining ActionletEnsemble for Action Recognition with Depth Cameras. InCVPR, pages 1290 –1297, 2012. 2, 3, 6, 7, 8

[14] X. Yang and Y. Tian. Eigenjoints-based action recognitionusing naive bayes nearest neighbor. In CVPRW, 2012. 2, 3,6, 8

[15] X. Yang, C. Zhang, and Y. Tian. Recognizing Actions usingDepth Motion Maps-based Histograms of Oriented Gradi-ents. In ACM ICM, 2012. 1, 2, 8

wacv2014

Documents

depth images

action volume

depth sensors

d spatiotemporal depth

depth gradients

calibrated depth values

microsoft kinect depth

joint movement volume