gcm-motiondetection(2014)

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE JULY 2014

Structured Time Series Analysis for HumanAction Segmentation and RecognitionDian Gong, GerardMedioni, Fellow, IEEE, and Xuemei ZhaoAbstractWe address the problem of structure learning of human motion in order to recognize actions from a continuousmonocular motion sequence of an arbitrary person from an arbitrary viewpoint. Human motion sequences are represented bymultivariate time series in the joint-trajectories space. Under this structured time series framework, we rst propose KernelizedTemporal Cut (KTC), an extension of previous works on change-point detection by incorporating Hilbert space embedding ofdistributions, to handle the nonparametric and high dimensionality issues of human motions. Experimental results demonstratethe effectiveness of our approach, which yields realtime segmentation, and produces high action segmentation accuracy. Second,a spatio-temporal manifold framework is proposed to model the latent structure of time series data. Then an efcient spatiotemporal alignment algorithm Dynamic Manifold Warping (DMW) is proposed for multivariate time series to calculate motionsimilarity between action sequences (segments). Furthermore, by combining the temporal segmentation algorithm and thealignment algorithm, online human action recognition can be performed by associating a few labeled examples from motioncapture data. The results on human motion capture data and 3D depth sensor data demonstrate the effectiveness of the proposedapproach in automatically segmenting and recognizing motion sequences, and its ability to handle noisy and partially occludeddata, in the transfer learning module.Index TermsMultivariate Time Series, Action Recognition, Online Temporal Segmentation, Saptio-Temporal Alignment,Transfer Learning.

!

1 I NTRODUCTIONRecognizing human action is a key component in manyapplications, such as human computer interaction, computergame, surveillance and human pose estimation. Extractingthis high level information from motion capture data ordepth sensor data is the problem we propose to addresshere.Although signicant progress has been made in humanaction recognition [1], [2], [3], [4], [5], [6], the problemremains inherently challenging due to signicant intraclass variations, viewpoint change, partial occlusion andbackground dynamic variations. A key limitation of manyaction-recognition approaches is that the models are learnedfrom single 2D view video features on individual datasets,and unable to handle arbitrary view change or scale andbackground variations. Also, since they are not generalizable across different datasets, retraining is needed forevery new dataset. Furthermore, many works in humanactivity recognition focus on simple primitive actions suchas walking, running and jumping, in contrast to the fact thatdaily activity involves complex temporal patterns (walking,sit-down, then stand-up). Thus, recognizing such complexactivities relies on accurate temporal structure decomposition [7].We offer to take as input either a motion capture (Mocap)sequence providing 3D joint positions, or a depth videofrom a 3D camera, pre-processed to obtain partial, noisy D. Gong, G. Medioni and X. Zhao are with the Institute for Roboticsand Intelligence Systems, University of Southern California, Los Angeles, CA 90089. E-mail: {diangong, medioni, xuemeiz}@usc.edu.

Digital Object Indentifier 10.1109/TPAMI.2013.244

3D joint positions, or, in future work, a 2D video froman arbitrary viewpoint, pre-processed to provide partial,noisy 2D joint positions. Our rst step is to online segmentthese sequences into segments corresponding to differentactivities. This is achieved with no training. Furthermore,these segments are subdivided into action units corresponding to cycles (such as walking). Ofine, we learn differentactivities from one or very few examples of labeled Mocapsegments. Then, online, we compare a segment to labeledones using a novel alignment algorithm in order to performclassication. We show promising results, demonstratingthe power of our approach. The proposed approach has thefollowing modules:(1) Given a labeled Mocap sequence with M markersin 3D, which is a 3M -dimensional time series (3M D+t),the low dimensional manifold structure (i.e., tangent space,geodesics distance, etc) is learned by using Tensor Voting.This is an ofine process, as shown in Fig. 1.(2) For other unlabeled motion sequences in 3D, thesequential temporal segmentation is performed to automatically segment the input motion sequence into differentaction units. For a single action unit, after structure learning(1), we calculate the motion similarity score with eachlabeled motion sequence by the proposed spatio-temporalalignment approach, and perform action recognition. Thisis an online process.(3) Our system can recognize actions from depth sensor.Available human pose estimation methods (Kinect SDK andOpenNI) can provide 3D human pose estimation results butthey are often noisy and have occlusions, while the structurelearning algorithm (1) remains the same and our temporal

0162-8828/13/$31.00 2013 IEEE

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE JULY 2014

Dynamic Manifold Warping (DMW). KTC is a temporalextension of Hilbert space embedding of distributions [8],[9] and kernelized two sample test [10], [11] for onlinechange-point detection. DMW extends previous works onspatio-temporal alignment by incorporating manifold learning. Empirical results demonstrate the superior performanceof these two algorithms compared to other the state-of-theart methods on human action segmentation and recognition.The technical details of proposed algorithms are given insections 3, 4, 5, and 6.3.

Fig. 1. Flow chart of the proposed approach.segmentation and alignment approach (2) can naturallyhandle noisy input and occlusion.Our approach has the following advantages:One or very few examples are required in each actioncategory in the training stage, compared to 100s for manylearning approaches.Transfer learning: when applying our approach to depthimage sequences, there is no training process on thesedepth images and people in these images do not necessarilyappear in the labeled Mocap sequences. Thus, our approachcan be considered as a transfer learning framework, i.e., theknowledge from labeled Mocap data can be adapted to anyhuman motion data.Online action recognition: the input sequence from unlabeled Mocap sequences or 3D sensors can be temporallysegmented in an online fashion (sec. 3), resulting in continuous action recognition (sec. 6.3).Intra/Inter-person variations: a person repeating an action twice with differences, or two people performing anaction with differences in both pose style and motion dynamic, can be handled by combining the proposed temporaland spatial alignment methods (sec. 5.2 and 5.3) together.View invariance: low dimensional human motion manifoldmodels are learnt from the 3D Mocap data, and our spatialtemporal alignment algorithms can handle 3D input witharbitrary viewpoint; these two features enable our systemrobust to actions viewpoint.Noise and occlusion handling: in order to recognizeactions from depth image sequences, human poses need tobe estimated. Instead of M key points, often only K visiblepoints can be estimated (with noise) during the whole action(K M ), such as a side view boxing man. Our systemcan handle these noisy trajectories, even with occlusion(3KD+t).An overview of our approach is sketched in Fig. 1. Thejoint-trajectories of M human body key points are used torepresent a human motion sequence. Trajectories can beeither provided by Mocap (3D) or be tracked from depthimage sequence by available human pose estimation methods such as Kinect SDK and OpenNI (noisy 3D). The coreof our approach is the structured time series representationand two newly proposed machine learning algorithms fortime series data, i.e., Kernelized Temporal Cut (KTC) and

2

R ELATED W ORK

Dynamic Manifold Model. Non-linear manifold learningand Latent Variable Modeling (LVM) is prominent inmachine learning research in the past decade [12], [13],[14]. In [15], Tensor Voting [16] is used to analyze the 1Dmanifold of landmark sequences, and the manifold structureis applied to 3D face tracking and expression inference. Inparticular, some probabilistic latent variable frameworks,i.e., GP-LVM, GPDM and its variants [17], [18], [19],focus on motion capture data and try to capture the intrinsicstructure of human motion, which is further applied to 3Dmonocular people tracking [20].Moreover, these are some manifold related works onhuman action recognition and motion analysis. [21], [22]apply generative based manifold models to several aspectsof human motion analysis including pose recovery, bodytracking, gait recognition and faical expression recognition. [23] utilizes manifold learning to perform motion retrieval and [24] combines ISOMAP and DTW to recognizeactions from silhouettes. The focus of our approach differsfrom these works signicantly. The problem we addressedis to perform jointly online action segmentation and recognition and can recognize actions from realtime OpenNIinput based on labled Mocap sequences. This online processfor stream input and transfer learning functionality is notthe target of the above mentioned works.Besides the differences on temporal segmentation andtransfer learning, our alignment step uses manifold learningto infer the latent completion variable, while [21], [22]use it to build the generative model. Their human motiongenerative models have advantages on pose recovery andtracking.Temporal Segmentation. This is a multifaced area andseveral related topics in machine learning, statistics, computer vision and graphics are discussed.-Change-Point Detection. Most of the work in statistics,i.e., ofine or quickest (online) change-point detections(CD) [25], is often restricted to univariate series (1D) andparametric distribution assumption, which does not holdfor human motions with complex structure. [26] uses theundirected sparse Gaussian graphical models and performsjointly structure estimation and segmentation. Recently, asa nonparametric extension of Bayesian online change-pointdetection (BOCD) [27], [28] is proposed to combineBOCD and Gaussian Process (GPs) to relax the i.i.d assumption in a regime. Although GPs improve the ability to

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, SUBMISSION

model complex data, it also brings in high computationalcost. More relevant to us, kernel methods have been appliedto non-parametric change-point detection on multivariatetime series [29], [30]. In particular, [29] (KCD) utilizes theone-class SVM as online training method and [30] (KCpA)performs sequentially segmentation based on the KernelFisher Discriminant Ratio. Unlike all the above works,KTC can not only detect action transitions but also cyclicmotions.-Temporal Clustering. Recently, as an extension of clustering [31], [32], some works focus on how to correctlytemporally segment time series into different clusters. Asa elegant combination of Kernel K-means and spectralclustering, Aligned Cluster Analysis (ACA) is developedfor temporal clustering of facial behavior with a multisubject correspondence algorithm for matching facial expressions [33]. To estimate the unknown number of clusters, [34] use the hierarchical Dirichlet process as a prior toimprove the switch linear dynamical system (SLDS). Mostof these works ofine segment time series and providecluster labels as in clustering. As a complementary approach, KTC performs online temporal segmentation whichis suitable for realtime applications.-Motion Analysis. In computer vision and graphics, someworks focus on grouping human motions. Unusual humanactivity detection is addressed in [35] using the (bipartite)graph spectral clustering. [36] extracts spatio-temporal features to address event clustering on video sequences. [37]proposes a geometric-invariant temporal clustering algorithm to cluster facial expressions. More relevantly, [38]proposes an online algorithm to decompose motion sequences into distinct action segments. Their method isan elegant temporal extension of Probabilistic PrincipalComponent Analysis for change-point detection (PPCACD), which is computationally efcient but restricted to(approximate) Gaussian assumptions.Action Recognition. Inspired by the success in objectrecognitions, low-level features like Space-Time InterestPoints (STIPs) plus Histogram of Oriented Gradient (HOG)descriptors are used in many action recognition works [1],[2], [39]. Silhouettes based features are also popular [40],[41], for which good results rely on accurate foregroundextractions. Some works also use tracked key points, whichare quantized as feature vectors by the pre-learned ormanually designed codebook [3], [42], [43]. Action recognition is a multifaceted eld, our discussion focuses onview-invariant methods, and readers can refer to a recentreview [44] for more details.A Hidden Markov Model (HMM) is built on 3D jointtrajectories (Mocap) to capture the dynamic information ofhuman motion in [45]. The claimed advantage of the 3DHMM model is that the dependence on view point andillumination is removed. However, HMMs require largeamount of training data in relatively high dimensionalspace (e.g. 67) and the HMM structure must be adaptivelydesigned for specic application domains. These may bepotential factors that make the recognition performanceunsatisfactory, and AdaBoost is used to improve the accu-

3

racy [45]. View-independence is also addressed in [41], [4]by rendering Mocap data of various actions from multipleviewpoints, which is a time and storage consuming process.In [40], 3D models are projected onto 2D silhouettes withrespect to different view point, and [5] detects 2D featurerst and then back-projects them to action features based ona 3D visual hull. These methods require a computationallyexpensive search process over model parameters to ndthe best match between 2D features and 3D model. Veryrecently, in [46], a 3D HOG descriptor was proposed tohandle view point change, and this approach requires themultiple view camera settings for training data to achievethe view-invariant recognition. Departing from these methods, our recognition process does not require pose renderingor parameters search. Our trajectory features are locatedat body skeletons key locations, with explicit semanticmeaning, allowing our system to be directly applied toan arbitrary scene without datasets dependent training.Recently, there are a few works [47], [48], [49] focusingon action recognition on depth or RGB plus depth (RGBD)image sequences.Spatio-Temporal Alignment. Given two human motionsequences, an important question is to consider whetherthose two sequences represent the same motion, similarmotions or distinct motions. This can be viewed as a(spatio-temporal) alignment problem, serving as a foundation for action recognition, clustering, etc. CanonicalComponent Analysis (CCA) [50], proposed for learning theshared subspace between two high dimensional features,which been used as the spatial matching algorithm foractivity recognition from video [51] and activity correlationfrom cameras [52]. Video synchronization is addressed asa temporal alignment problem in [53], [54], which usesdynamic time warping (DTW) or its variants [55]. [56] usesoptimization methods to maximize a similarity measure oftwo human action sequences, while the temporal warpingis constrained by 1D afne transformation. The same lineartemporal model is also used in [57].Very recently, as the elegant extension of CCA and DTW,Canonical Time Warping (CTW) is proposed for spatiotemporal alignment of two multivariate time series andapplied to align human motion sequences between twosubjects [58]. CTW is formulated as an energy minimization framework and solved by an iterative gradient descentprocedure. Since spatial and temporal transformations arecoupled together, the objective function becomes nonconvex and the solution is not guaranteed to be globaloptimal. Under the STM model, we propose Dynamic Manifold Warping (DMW), which focuses on time series withintrinsic spatial structure and guarantees global optimalsolution. By combining KTC and alignment approachessuch as [59], [58], we can perform online action recognitionfor input from 2.5D depth sensor. Unlike other workson supervised joint segmentation and recognition [60],two signicant features of our approach are, viewpointindependence and handling arbitrary person with a fewlabeled Mocap sequences, in the transfer learning module.


Fig. 2. Online Hierarchial Temporal Segmentation.A 22 secs input sequence is temporally cut into twosegments; a walking segment (S1) which is further cutinto 6 action units, and a jumping segment (S2) whichis further cut into 4 action units.

3

O NLINE T EMPORAL S EGMENTATION

3.1 Time Series RepresentationTo effectively represent human motion sequences, jointposition (or joint-angle) is used in this paper. In eachframe, the joint-position (or joint-angle) of several keypoints on human skeleton is formulated as a point in themulti-dimensional space. Thus, a human motion sequenceis represented as a trajectory, i.e., a structured multivariatetime series which implicitly contains the human motionstructure. For instance, given a length Lx human actionsequence (e.g. stretching), the joint-position trajectory canbe represented as a matrix X1:Lx = [x1 x2 ... xLx ] DLx , where xt is the joint-positions at temporal indext. In 3D (Mocap), xt = [pt1 , pt2 ... ptM ]T 3M 1 andpti = (pti1 pti2 pti3 ) is the coordinate of the ith markerin 3 . Or in partial 3D (e.g., tracking trajectories fromdepth sensor), xt = [q t1 , q t2 ... q tK ]T 3K1 (K M ),tttq ti = (qi1qi2qi3) is the location of the ith tracked point.This structured time series representation is used forboth temporal segmentation and alignment. This sectiondescribes the Kernelized Temporal Cut (KTC), a temporalapplication of Hilbert space embedding of distributions [8]and kernelized two-sample test [10], [11], to sequentiallyestimate temporal cut points in human motion sequences(Fig. 2) [61]. It is notable that, as a kernelized learning algorithm, KTC can be applied for structured sequential datain general, such as multivariate time series and dynamicgraphs.3.2

Problem Formulation

DtxGiven a stream input X 1:Lx = {xt }Lt=1 (xt ,where Dt can be xed or change over time), the goal oftemporal segmentation is to predict temporal cut points ci .For instance, if a person walks and then boxes, a temporalcut point must be detected. For depth sensor data, xt is thevector representation of tracked joints. More details of xtare given in sec. 6.1. From a machine learning perspective,cthe estimated {ci }Ni=1 can be modeled by minimizing thefollowing objective function,cLX ({ci }Ni=1 , Nc ) =

Nc

i=1

I(X ci1 :ci 1 , X ci :ci+1 1 ) (1)

4

where X ci :ci+1 1 D(ci+1 ci ) indicates the segmentbetween two cut points ci and ci+1 (c1 = 1, cNc +1 =Lx + 1). Here I() is the homogeneous function to measurethe spatio-temporal consistency between two consecutivecsegments. It is worth noting that, both {ci }Ni=1 and Nc needto be estimated from eq. 1. Next, the main task is to designI() and to online optimize eq. 1. As the counterpart, eq. 1could be ofine optimized by dynamic programming whenNc is given, which is out of the scope of this paper.3.3 KTC-SInstead of jointly optimizing eq. 1, the proposed KernelizedTemporal Cut (KTC) sequentially optimizes ci+1 based onci by minimizing the following loss function,L{X ci :ci +T 1 } (ci+1 ) =I(X ci :ci+1 1 , X ci+1 :ci +T 1 ), i = 1, 2, ..., Nc 1

(2)

where ci (c1 = 1, cNc +1 = Lx + 1) is provided bythe previous step and T is a xed length. We refer tothis sequential optimization process for eq. 2 as KTC-S,where S stands for sequential. Sequentially optimizing Lis actually a xed-length sliding window process whichis also used in [30]. However, setting T is a difculttask and how to improve this process is described insec. 3.4. Essentially, Eq. 2 is a two-class temporal clusteringproblem for X ci :ci +T 1 DT . The crucial factor isconstructing I(), which is related to temporal version of(dis)similar functions in spectral clustering [31], [32], [37]and information theoretical clustering [62].To handle the complex structure of human motion, unlikeprevious work, KTC utilizes Hilbert space embedding ofdistributions (HED) to map the distribution of X t1 :t2 intothe Reproducing Kernel Hilbert Space (RKHS). [8], [9] areseminal works on combining kernel methods and probability distribution analysis. Without going into details, the ideaof using HED for temporal segmentation is straightforward.The change-point is detected by using a well behaved(smooth) kernel function, whose values are large on thesamples belonging to the same spatio-temporal pattern andsmall on the samples from different patterns. By doingthis, KTC does not only handle nonparametric and highdimensionality problems but also rests on a solid theoreticalfoundation [8].HED. Inspired by [9], probabilistic distributions can beembedded in RKHS. At the center of the Hilbert spaceembedding of distributions are the mean mapping functions,(P x ) = E x (k(x, )), (X) =

T1 k(xt , )T t=1

(3)

where {xt }Tt=1 are assumed to be i.i.d sampled from thedistribution P x . Under mild conditions, (P x ) (same for(X)) is an element of the Hilbert space as follows,T1< (P x ), f >= E x (f (x)), < (X), f >=f (xt )T t=1

Mappings (P x ) and (X) are attractive because,


Theorem 1. if the kernel k is universal, then the mean map: P x (P x ) is injective. [9]This theorem states that distributions of x D havea one-to-one correspondence with mappings (P x ). Thus,for two distributions P x and P y , we can use the functionnorm ||(P x ) (P y )|| to quantitatively measure thedifference (denoted as D(P x , P y )) between these twodistributions. Moreover, we do not need to access theactual distributions but rather nite samples to calculateD(P x , P y ) because:Theorem 2. Assume that ||f ||inf C for all f H with|f ||H 1, then with probability at least 1 , ||(P x ) (X)|| 2RT (H, P x ) + C T 1 log()). [9]As long as the Rademacher average is well behaved,nite samples yield error that converges to zero, thus theyempirically approximate (P x ). Therefore, D(P x , P y )can be precisely approximated by using nite sample estimation ||(X) (Y )||.Thanks to the above facts, we use HED to constructIKT C (X 1:T1 , Y 1:T2 ) to measure the consistency betweendistributions of two segments as follows,2 1 1 k(xi , y j ) 2k(xi , xj ) 2k(y i , y j )T1 T2 i,jT1 i,jT2 i,j(4)Combining eq. 2 and eq. 4, ci+1 is estimated by minimizingthe following function in matrix formulation as:c

c

CiL{X ci :ci +T 1 } (ci ) = (E Ti )H K KTci :ci +T 1 E T1 1:c1 c :TcE Ti = eT i eTicidi

(5)

where ci and di are short notations for ci+1 ci and ci +T ci+1 . etT1 :t2 T 1 is a binary vector with 1 for positionsCT Tfrom t1 to t2 and 0 for others. K KTis theci :ci +T 1 kernel matrix based on the kernel function kKT C ().Kernel. The success of kernel methods largely depends onthe choice of the kernel function [8]. As mentioned before,the difculty of human motion, is that both spatial andtemporal structures are important. Thus, we propose a novelspatio-temporal kernel kKT C () as follows,kKT C (xi , xj ) = kS (xi , xj )kT (xi , xj ) i ), (x j ))= kS (xi , xj )kT ((x

(6)

where kS () is the spatial kernel and kT () is the temporal

kernel. (x)is the estimated local tangent space at pointx. kS () and kT () can be chosen according to the domainknowledge or universal kernels such as Gaussian. For instance, the canonical component analysis (CCA) kernel [50]is used for joint-position features as,kSCCA (xi , xj ) = exp(S dCCA (xi , xj )2 )

(7)

where dCCA () is the CCA metric based on M 3 matrixrepresentation of x 3M 11 . Or in general, we set them

5

as,kS (xi , xj ) = exp(S xi xj 2 ) i ), (x j )) = exp(T ((x i ), (x j ))2 )kT ((x

where S is the kernel parameter for kS () and T is thekernel parameter for kS (). () is the notation of principalangle between two subspace (range from 0 to 2 ).In short, the spatio-temporal kernel kKT C captures bothspatial and temporal distributions of data (a visual examplein Fig. 3), which is suitable to model structured sequentialdata. As special cases, kST degenerates to spatial kernel ifT 0 and to temporal kernel if S 0.Optimization. Unlike the NP-Hard optimization in spectralclustering [32], eq. 5 can be efciently solved because thefeasible region of ci+1 is [ci + 1, ci + T 1], allowing tosearch the entire space to minimize L(ci+1 ). For each step,minimizing eq. 5 has complexity at most O(T 2 ) to accesskKT C ().

3.4 KTC-RSequentially optimizing eq. 1 is given in sec. 3.3. However,this process may not be suitable for realtime applications.A key feature of human motion is temporal variations,i.e., one action can last for a long time or only a fewseconds. Thus, it is difcult to use a xed-length-T slidingwindow to capture transitions. Small values of T causeover-segmentation and large values of T cause large delays(T = 300 for depth sensor results in 10 secs delay). Toovercome this problem, we combine incremental slidingwindow strategy [38] and two-sample test [10], [11] todesign a realtime algorithm for eq. 5, i.e., KTC-R (Fig. 3).Given X 1:Lx = [x1 , ..., xLx ] DLx , KTC-R sequentially processes the varying-length window X t =[xnt , ..., xnt +Tt ] at step t. This process starts from n1 = 1and T1 = 2T0 , where T0 is the pre-dened shorest possibleaction length. At step-t (assume the last cut is ci ), if there isno captured action transition point, the following updatingprocess is performed,nt+1 = nt , Tt+1 = Tt + T

(9)

else if there is a transition point,ci+1 = nt + Tt T0 , nt+1 = ci+1 , Tt+1 = T1

(10)

where T is the step length of increasing the window. Thisprocess ends when nt Lx T0 . As shown in eq. 9 andeq. 10, X 1:Lx is sequentially processed and all cuts ci areestimated when the algorithm requires the (ci + T0 1)thframe (same for non-cut frames). This fact indicates STC-Rhas a xed-length time delay T0 , as shown in Fig. 3.At each step, deciding on a cut (at frame nt + Tt T0 )is equivalent to the following hypothesis test,n 1

H0 : {xi }ntt

HA : not H01. M is the number of 3D joints from Mocap or depth sensor.

(8)

and {xi }nnt +Tt 1 are the samet

(11)

where nt is the short notation for nt + Tt T0 . eq. 11 is


6

C190190Fig. 3. An illustration of KTC-R. Left: tracked joints (45190 ) from depth sensor, right: K KTfor1:190 window X 1:190 (Human stands for human ground-truth). The decision to make no cut between frame 1 to 110 ismade before the current window with a maximum delay of T0 = 80 frames.

re-written by combining eq. 5 as follows,Lt = (E

nt ntTt

C)H K KTnt :nt +Tt 1 E

H0 : Lt t : nt is not a cut HA : Lt < t : nt is a cut

nt ntTt

(12)

3.5 Online Hierarchical Temporal Segmentation

where t is the adaptive threshold for the hypothesis test 12.In fact, eq. 12 is directly inspired by [10] which proposes akernelized two-sample test method. Lt is analogous to thenegative square of empirical estimation of Maximum MeanDiscrepancy (MMD), which has the following formulation,M M D[F, X 1:T1 , Y 1:T2 ] = (

T11 k(xi , xj )T12 i,j=1

T1 ,T2T212 1 k(xi , y j ) + 2k(y i , y j )) 2T1 T2 i,j=1T2 i,j=1

The reason is that a clear temporal cut for human motionrequires a large number of observations before and afterthe cut. Indeed, the required number of frames varies fromaction to action, even for manual annotation.

(13)

1where F is a unit ball in a universal RKHS H, and {x}Ti=1T2and {y}j=1 are i.i.d. samples from distributions P x andP y . It can be shown that,

lim Lt = M M D[F, X nt :nt 1 , X nt :nt +Tt 1 ]2 (14)

T 0

if the same kernel in MMD is used as the spatial kernel inkKT C () (eq. 6) and kT () degenerates to 1 as T 0.Based on eq. 14, t is set as BR (t) + where BR (t)is an adaptive threshold which is calculated from theRademacher bound [10], and is a xed global thresholdwhich is the only non-trivial parameter in KTC-R (used tocontrol the coarse-ne level of segmentation).Analysis. In summary, both KTC-S and KTC-R are basedon eq. 5. The main differences are, KTC-S performssegmentation by sequential optimization in a two-class temporal clustering way, and KTC-R performs segmentation byusing incremental sliding-window in a two-sample test way.KTC-R requires more sliding-windows than KTC-S, but foreach one, there is no optimization, and accessing kKT C ()O(Tt T ) times is enough (linear to Tt ). Only when anew cut is detected, O(Tt2 ) times accessing is required.Thus, KTC-R is extremely efcient and suitable for realtimeapplications. It is notable that, even if the xed-lengthsliding-window method (sec. 3.3) is improved to makethe decision whether a cut happens or not in X ci :ci +T 1 ,a small T is still not reliable for realtime applications.

Besides estimating {ci }, decomposing an action segmentX ci :ci+1 1 into an unknown number of action units (e.g.,three walking cycles) if cyclic motions exists, is alsoneeded [63]. This is not only helpful for understandingmotion sequences, but also for other applications such asrecognition and indexing. Thus, an online cyclic structure segmentation algorithm, i.e., Kernelized AlignmentCut (KAC), is proposed as a generalization of kernelembedding of distributions and temporal alignment [58],[33]. By combining KAC and KTC-R, we get the twolayer segmentation algorithm KTC-H, where H stands forhierarchical. Action units segmentation is difcult for nonperiodic motions (e.g., jumping), which are actions thatare usually performed once locally. However, people canstill perform two consecutive non-periodic motions, andthese two motions are not identical because of intra-personvariations, which brings challenges for KAC.KAC. As an online algorithm, KAC utilizes the slidingwindow strategy. Each window X aj +nt Tm :aj +nt 1 issequentially processed, starting from n1 = 2Tm , a1 = ci ,where aj is jth action unit cut. Tm is a parameter whichis the minimal length of one action units. We empiricallynd that results are insensitive to Tm .For each window X aj +nt Tm :aj +nt 1 , this process hastwo branches. The last action unit continues: nt+1 = nt +Tm ; or there is a new action unit: aj+1 = aj + nt Tm , nt+1 = 2Tm . Here Tm is the step length. This processends when a new cut point ci+1 received. Deciding whetherX aj +nt Tm :aj +nt 1 is the start of a new unit or not canbe formulated as,St = SAlign (X aj :aj +Tm 1 , X aj +nt Tm :aj +nt 1 ) H0 : St t : aj + nt Tm is a new unit(15) HA : St > t : not H0where SAlign () is the metric to measure the structure similarity between X aj :aj +Tm 1 and X aj +nt Tm :aj +nt 1 , tohandle intra-person variations. t is an adaptive threshold(empirically set by cross-validation) and ideally should be


close to zero if alignment can perfectly leverage variations.Similar to KTC-R, KAC has delay Tm . In particular, KACuses dynamic time warping (DTW) [58], [33] to designSKAC () by minimizing the following loss function basedon the kernel from eq. 6,a +n T :a +nt 1

tm jSKAC (K ajj :aj +Tm 1

: W 1, W 2)

(16)

where K is the cross-kernel matrix for two segments,and W 1 and W 2 are binary temporal warping matricesencoding the temporal alignment path as shown in [58].Interested readers are referred to [58], [33] for more detailsabout S(). Eq. 16 can be optimized by using dynamic2programming with complexity O(Tm), and SKAC () measure the similarity between the current action unit (a part)and the current window. Importantly, alignment methodssuch as DTW are not suitable for eq. 12. This is becausealignment requires two segments to have roughly the samestarting and ending points, which does not hold in eq. 12.KTC-H. By combining KTC-R and KAC, we can sequentially simultaneously capture action transitions (cuts) andaction units, in the integrated algorithm KTC-H. Formally,KTC-H uses the two-layer sliding window strategy, i.e., theouter loop (sec. 3.4) to estimate ci and the inner loop toestimate aj between ci and the current frame from the outerloop. Since KTC-R (eq. 12) and KAC (eq. 15) both havexed delay (T0 and Tm ), KTC-H is suitable for realtimetransition and action unit segmentation.Discussion. We compare with several related algorithms:(1) Spectral clustering [32] can be extended to temporalclustering if only allowing temporal cuts (TSC) [37]. Similarly, minimizing eq. 5 can be viewed as an instance ofTSC motivated by embedding distributions into RKHS. (2)PPCA-CD is proposed in [38] to model motion segmentsby Gaussian models, where CD stands for change-pointdetection. Compared to [38], KTC has higher computationalcost but gains the ability to handle nonparametric distributions. (3) KTC is similar to KCpA [30], which uses thenovel kernel Fisher discriminant ratio. Compared to [30],KTC performs change-point detection by using the incremental sliding-window. More importantly, KTC detectsboth change-points and cyclic structures. This is crucial foronline recognition, making action can be recognized afteronly one unit instead of the whole action. (4) As an elegantextension of Kernel K-means and spectral clustering, ACAis proposed in [33] for ofine temporal clustering. KTC canbe viewed as an online complementary approach to [33]. (5)Differ from the online two one-class SVM strategy KCDin [29], the hypothesis testing in KTC has null distributionsthanks to embedding of distributions [9].

4

S PATIO -T EMPORAL M ANIFOLD M ODEL

As given in Sec. 3.1, we use structured time series xtto represent the human motion sequences. Although xtlies in a high dimensional space, the natural property ofhuman pose suggests xi has lower intrinsic degree offreedom. Suppose there is a d-dimensional submanifold Membedded in an ambient space of dimensionality D d.

7

We use latent variable model (LVM) to represent M asa mapping between the intrinsic space and the ambientspace: f : d D and x = f ( ) + , where x Dis the observation variable, d is the latent variableand D is the noise. In computer vision applications,the mapping function f is often highly non-linear, andthe ambient space is the spatial (feature) space, so M isalso called spatial-manifold. To incorporate the temporaldimension into the standard LVM to model human motiontime series, we propose a novel framework as follows.Denition: a spatio-temporal manifold (STM) is a directed traversing path Mp (with boundary or compact) ona spatial-manifold M, and further embedded in D .A traversing path Mp can be intuitively thought as apoint walking on M from a starting point at time t1( start , xstart ) to an ending point ( end , xend ) at timet2 . A path is not just a subset of M which looks likea curve, it also includes a natural parametrization as,g : [0 1] M, s.t. g(0) = start and g(1) = end .So, a new latent variable [0 1] is associated with everypoint in this path. Furthermore, the relationship between and temporal index t can be modeled as a time seriespt : [t1 t2 ] [0 1], s.t. h(t1 ) = 0 and h(t2 ) = 1.Since M is embedded in D by f (), essentially thetraversing path (with noise) can be described as a non-linearmultivariate time series as x(t) = f (g(p(t))) + .Under this denition, the structured representationX1:Lx of a human motion sequence is just a sequenceof sampled observations on a STM. Here, the ambientspace is the joint-position space, manifold M is the humanpose space, and Mp is a specic type of human action.The newly introduced variable is assigned to a semanticmeaning which indicates the completion degree of anaction. For an action sequence including only one actionunit, we assume the starting point of the action has = 0and the ending point has = 12 . Inferring from X1:Lx isimportant for temporal alignment in our approach, which isgiven in Sec. 5.1. It is notable that, this 1D representation is mainly used for temporal alignment in Sec. 5.2, andmultivariate time series are used in the other steps such asspatial matching and temporal segmentation.

5

S PATIO -T EMPORAL A LIGNMENT

Given two human action segments X 1:Lx Dx Lx(Mxp ) and Y 1:Ly Dy Ly (Myp )3 , we need to calculatethe motion distance score S(X 1:Lx , Y 1:Ly ), after properspatial and temporal alignment. The problem is inherentlychallenging because of the large spatial/temporal scale difference between human actions, ambiguity between humanposes, as well as the inter/intra subject variability [58]. Wemodel the motion sequence matching as a spatial-temporalalignment problem under the STM framework, and incorporate manifold learning, spatial alignment and temporalalignment together, resulting in Dynamic Manifold Warping(DMW) [59].2. For periodic motion, i.e, walking, it denes a motion cycle.3. Both are one action unit sequences after temporal segmentation.


5.1

Structure Learning

An important module of the proposed spatio-temporalalignment is structure learning. Given {xt }Lt=1 as L ordereddata points sampled from a STM, the goal of structurelearning is to recover the latent completion variable tfrom those samples. Note that our goal is different frommost latent variable models, which aim to identify [12],[13] and sometimes f () [17], [18], [19].Estimating dGeo (). We use Tensor Voting to calculate theminimum traversing distance between xs and xs+1 (1 s L 1) to approximate the geodesic distance dGeo ().Tensor Voting is a non-parametric framework propose toestimate the geometric information of manifolds, as wellas the intrinsic dimensionality [16]. Let xs (0) = xs , wehavedGeo (xs , xs+1 ; Mp )

R

xs (r) xs (r + 1)L2

r=0

(17)where xs (r + 1) is updated from the current point by therst order Taylor expansion xs (r),xs (k+1) = xs (k)+J (xs (r))J (xs (r))T (xs+1 xs (r))until xs (r + 1) converges to xs+1 . is a step length,and J (xs (r)) is the tangent space estimation on xs (r)by Tensor Voting (local PCA [64] can also be used). [15]uses Tensor Voting to estimate the manifold structure for3D face tracking in 126D space, while the temporal index isnot explicitly considered. Our algorithm is a revised versionof [15] under the STM framework.Learning t . A two stage approach is possible, rst estimate (or f ()) on a collection of time series, and thenoptimize {1:L }. Nevertheless, we propose a solution whichperforms direct estimation for individual sequence based onthe learnt geodesic distance.t1s=1 dGeo (xs , xs+1 ; Mp )t = L1(18)s=1 dGeo (xs , xs+1 ; Mp )Since the traversing path is continuous and smooth, theglobal geodesic distance is approximately decomposed tothe sum of the local distance, inspired by ISOMAP [12].

Fig. 4. An illustration of the non-linearity of (t).Top, action stretching(Mocap), 6 samples are uniformly distributed in 368 frames; bottom, estimatedlatent completion variable. The whole action is decomposed into 5 stages.

8

Fig. 4 illustrates the latent completion variable learningresults from a stretching sequence in an action unit, i.e.,from the actions start to the end. We use the CMU Mocapdata [65] in this experiment, and M = 15 key points areused to represent the human body, resulting in joint 3Dtrajectories in 45D. These 15 key points are extracted fromthe amc and asf les by our joint-angle to joint-positiontransfer algorithm. We uniformly divide the sequence into5 stages along the time index. The dynamic variations instage 2 and 4 are larger than the others, and these twocorrespond to stretch and fold arms. Stage 3 has thesmallest variation, because it corresponds to the peakstateof a stretching, i.e., there is almost no arm movement.5.2 Temporal AlignmentThe temporal alignment part of DMW is called DynamicManifold Temporal Warping (DMTW). DMTW is the combination of manifold learning and Dynamic Time Warping(DTW), and can be applied to any temporal data with latentspatial structure.Formulation. Given two time series X 1:Lx Dx Lxand Y 1:Ly Dy Ly , nd the optimal alignment pathQ = [q 1 , q 2 , ..., q L ] 2L by minimizing the followingloss function ( F is the Frobenius norm operator),LDM T W (Fx (), Fy (), W x , W y )= Fx (X 1:Lx )W Tx Fy (Y 1:Ly )W Ty 2F

(19)

yx} {0, 1}LLx , W y = {wt,t} where W x = {wt,tyxLLyare binary selection matrices encoding the{0, 1}yxtemporal alignment path Q [58]. wt,t= wt,t= 1 isyxequivalent to q t = [tx ty ]T , which means xtx correspondsto y ty at step t in the alignment path. F() maps X 1:Lx andY 1:Ly to a shared subspace with the same dimensionality.Essentially, Fx () and Fy () are spatial mapping functionsand W x and W y are temporal warping matrices.If F() is an identity function, then LDM T W reducesto X 1:Lx W Tx Y 1:Ly W Ty 2F , which is equivalent toperforming the standard DTW directly on X 1:Lx andY 1:Ly . Unlike the alternative iterative algorithm to optimizeLDM T W , i.e., optimize W with xed F and then optimizeF with xed W , we propose a two-step approach withoutthe iterative computation. Instead of optimizing Fx , Fyin eq. 19, we directly estimate them under the STMframework.Step 1. Under the STM model in section 4, we choosexFx (X 1:Lx ) to be 1:L 1Lx and Fy (X 1:Ly ) to bexy1Ly1:Ly . t represent the universal structure for allSTMs, making aligning two sequences with different actions possible. If the sequence is training data (i.e. Mocap),then methods in sec. 5.1 can be used. Otherwise, insteadof performing the variable-length path estimation, we candirectly estimate the dGeo () by using the xed-length (i.e.,1 or 2) traversing path, without re-pefroming Tensor Votingat each step. After learning dGeo (), combining with eq. 18,yxwe can obtain the estimated results for 1:Land 1:L,xydenoted as x 1Lx and y 1Ly .


Step 2. Replace Fx () and Fy () with x and y in eq. 19,LDM T W reduces to the following formulation,LDM T W (W x , W y ) = x W Tx y W Ty 2F

(20)

This is equivalent to performing DTW in the transformdomain, i.e., x and y . The temporal aligning matrix A =y 2

x{atx ,ty } is dened as atx ,ty = (tx ty ) , which is acompact representation of x and y . Optimizing eq. 20results in variable length path (vary from max(Lx , Ly ) toLx + Ly 1), which is not proper for similarity metric.Thus, referenced DTW is proposed to x the path lengthby setting one warping matrix to be identity, x I Lx y W Ty 2F

(21)

where I Lx is an identity matrix. X 1:Lx is chosen as thereference sequence, and Y 1:Lx is aligned to X 1:Lx by thewarping matrix W y Lx Ly . The path Q in eq. 21has xed length Lx . Since x and y are monotonicallyincreasing sequences, dynamic programming provides anextremely efcient solution (O(Lx Ly )) to optimize Q(W y ).

9

forms DTW. Our DMTW gets the best results among threemethods. It is notable that our temporal alignment step doesnot involve spatial matching (unlike CTW). More visualcomparison results are not provided due to lack of space,and DMTW gets similar performance in all experiments.While the objective function of DMTW (eq. 19) isinspired from CTW [58], key differences exist. CTW useslinear F(), and its optimization process may lead to localextreme since the objective function is non-convex. InDMTW, F() is chosen as the non-linear mapping h1 (),which can guarantee a global solution. It is notable thatCTW does not need smooth manifold assumption, and thushas more general applications than DMTW, while DMTWfocuses on time series with intrinsic manifold structure.DMTW is also related with Prole Models [66]. Although the ideas of Prole and t seem similar, theydiffer in many aspects. In particular, Prole Models needmultiple training examples and the size of the discreteProle space increases exponentially with the precisionrequirement, which is not only computationally impracticalbut also causes over-tting. In contrast, DMTW does notneed training stage, and t is continuous in nature.5.3 Temporally Local Spatial AlignmentAfter temporal alignment, spatial alignment is performed toleverage the subjects variability, i.e., body-skeleton scalesvariations between different people, or viewpoint variations.In particular, we propose Dynamic Manifold Spatial Warping (DMSW), which has the following framework,DDM SW (X t1 :t2 , Y t1 :t2 )= V x (U (X t1 :t2 )) V y (U (Y t1 :t2 ))2F

Fig. 5. Temporal Alignment Results. DMTW is compared with DTW and CTW. The reference sequence isshown in the rst row, followed by the aligned results.2 red arrows indicate 2 key states in the referencesequence, i.e, the peaks of the rst and the secondboxing. The aligned sequence also has 2 red arrows,indicating the peaks of the rst and the second jump.DMTW is able to align the two peak states in thejumping sequence to the peak states in the boxingsequence very well.Results. The proposed DMTW algorithm (eq. 19) is compared with other state-of-the-art algorithms. In particular, Dynamic Time Warping (DTW) [53], [55] is chosenas the baseline algorithm and Canonical Time Warping(CTW) [58] is chosen as the alternative method. To makethe comparison clearer, the sequences may include morethan one action units. Fig. 5 shows the visual comparisonfor two motion sequences, one is boxing (twice) and theother is side jumping (twice). DTW does not considerthe spatial transformation, making it difcult to align twomotion sequences by two people. CTW signicantly outper-

(22)

X t1 :t2 Dx (t2 t1 +1) are consecutive frame featuresxt1 to xt2 in the reference sequence, and Y t1 :t2 Dy (t2 t1 +1) are temporally corresponding samples inthe aligned sequence Y 1:Lx . V x () is the spatial alignmentfunction (same for V y ()) and U() is the pre-denedfeature extraction function. Spatial alignment is restrictedto temporally local (from t1 to t2 ) segments, since globalmatching on entire sequences is often not accurate due tonon-linear variations. How to set V () is explained in thefollowing part and U () will be discussed in eq. 25.Denoting the extracted features by U () as two zeromean feature sets, U x d1 n and U y d2 n , weconsider an unsupervised learning approach, i.e., CanonicalCorrelation Analysis (CCA), in which a pair of linearalignment matrices is optimized in the sense of maximizingthe correlation E() in transformed features as follows,E(V x , V y ) = T r(V Tx U x (V Ty U y )T )s.t., V Tx U x U Tx V x = V Ty U y U Ty V y = I d

(23)

where V x d1 d and V y d2 d are two linear spatialalignment matrices for U x and U y , and I d is the identitymatrix of size dd. T r() is the trace operator. Minimizingthis objective function is equivalent to solving a generalizedeigenvalue problem [50].


10

The metric can be induced in the transform domain as,2TDDM SW (X t1 :t2 , Y t1 :t2 ) = V Tx U x V y U y F (24)

V x d1 d and V y d2 d are the solutions of eq. 23.Eq. 24 can handle two feature sets with different dimensionalities, making the alignment between 2D and 3D inputpossible. Both DMSW and CTW algorithms use CCA,but key differences exist. Spatial alignment in DMSW isrestricted to temporally local manifolds, since global linearmatching on entire sequences (CTW) is often not accuratedue to non-linear variations. But this global matching is notnecessarily a disadvantage: CTW can provide dimensionreduction results, which is useful in some applications.In short, DMW (DMTW and DMSW) extends previousworks in two ways, (i) it combines temporal alignment withmanifold learning and (ii) it allows local CCA performingon local temporal aligned segments.Based on the proposed DMTW for temporal alignmentand DMSW for spatial alignment, we further propose twotypes of motion distance functions by choosing two featureextraction functions U (.). In particular, instead of treatingxt Dx 1 (or y t ) as a multi-dimensional vector, theimplicit structure in the joint-position space is considered.In sec. 4, xt = [pt1 , ..., ptM ]T 3M 1 , the 3D Euclideanspace is implicitly embedded in the joint-position space3M . Thus, we reformulate xt as,p11 ... pM 1(25)Nt = p12 ... pM 2 3Mp13 ... pM 3which turns to be M samples in 3 (similar operation forxt 3K to Nt 3K , K M , as used in sec. 6.1).This operation is dened as T 3 : 3M 3M (same for3K 3K ). It is notable that, this operation can bealso performed for joints in 2D, as T 2 : 2K 2K .Thus, we can align 2KD video tracks (with noise) with3M D Mocap sequences, which is not addressed by previousworks.The rst feature extraction function is chosen asU1 (xt ) = T (xt ), which is the static pose feature (jointposition in the matrix formulation). The second one isU2 (xt , xt+1 ) = T (xt ) T (xt+1 ), which is the motionpose feature between two consecutive frames. Thus, thenal similarity score S1 (X 1:Lx , Y 1:Ly ) given by the staticfeatures is as follows,S1 =

Lx

DDM SW (T (xt ), T (y t ))

(26)

t=1

t is the temporally corresponding frame estimatedwhere yby eq. 19. The similarity score S2 (X 1:Lx , Y 1:Ly ) given bythe motion features is as follows,Lx

t=1

DDM SW (T (xt ) T (xt+1 ), T (y t ) T (y t+1 )) (27)

Fig. 6. Examples of Online Temporal Segmentation. Top (Depth Sensor): a sequence includes 3segments as walking, boxing and jumping. Noisy jointtrajectories are tracked by OpenNI. Middle (Mocap):a sequence has 4579 frames and 7 action segments.Bottom (Video): a clip includes walking and running.For all cases, KTC-R achieves the highest accuracy.

These two scores can be linearly combined,SDM W (X 1:Lx , Y 1:Ly ) =S1 (X 1:Lx , Y 1:Ly ) + (1 )S2 (X 1:Lx , Y 1:Ly )

(28)

where [0 1] can be either optimized by cross-validationin the supervised setting (i.e., recognition), or chosen manually in the unsupervised setting (i.e., clustering). Eq. 28 is asummary result of eqs. 18, 19, 23 and two feature extractionfunctions. The similarity metric is not symmetric, so we setthe testing sequence to be the reference sequence.

6

E XPERIMENTAL R ESULTS

We quantitatively evaluate the performance of proposed approach on CMU Motion Capture data (MoCap),HumanEva-2 [67] and depth sensor data. These data arechosen to demonstrate the general capability of our algorithms on human motion analysis, as well as the advantageson continuous action recognition in the transfer learningmodule. We investigate the performance of realtime actionsegmentation cross different data sets with comparisonto alternative methods. We also perform comparisons ofaction recognitions on MoCap data to other state-of-the-artalignment algorithms. In particular, we can online recognizeactions of an arbitrary person from an arbitrary viewpoint,given realtime continuous depth sensor input.6.1 Segmentation ResultsIn this section, quantitative comparison of online temporalsegmentation methods is provided. KTC-R is compared


11

TABLE 1Temporal Segmentation Results Comparison. Precision (P), recall (R) and rand index (RI) are reportedMethodsDepthMocapVideo

PPCA-CD (online)0.73(P)/0.78(R)/0.80(RI)0.85(P)/0.90(R)/0.90(RI)

TSC-CD (online)0.77(P)/0.81(R)/0.81(RI)0.83(P)/0.86(R)/0.88(RI)0.78(P)/0.85(R)/0.82(RI)

with other state-of-the-art methods, i.e, PPCA-CD [38] andTSC-CD [36], [37], where TSC-CD is a change-point detection algorithm based on temporal spectral clustering byour implementation. PPCA-CD uses the same incrementalsliding-window strategy as sec. 3.4 and TSC-CD uses thexed-length sliding window as sec. 3.3. Thresholds (e.g., for KTC-R and thresholds for other methods) are set bycross-validation on one sequence. Methods like ACA [33]and [34] can not be directly compared since they are ofine.Results are evaluated by three metrics, i.e., precision, recalland rand index. The rst two are for cut points and the lastone is for all frames. The ground-truth for rand index (RI)is labeled as consecutive numbers 1, 2, 3, ... for differentsegments. Importantly, T0 is set as 80, 250, 60 for depthsensor, Mocap and video respectively, making KTC-R have2.3, 2.1 and 1 seconds delay. Results are very robust to T0and T . For instance, we got almost identical results whenT0 ranges from 60 to 120 in OpenNI data. Furthermore,KTC-S achieves similar accuracy to KTC-R but with longerdelay, thus KTC-R is preferred.Depth Sensor. To validate online temporal segmentationon depth sensor, 10 human motion sequences are capturedby the PrimeSense sensor. Each sequence is a combinationof 3 to 5 actions (e.g., walking to boxing) with lengtharound 700 frames (30Hz). For human pose tracking, weuse the available OpenNI tracker to automatically trackjoints on human skeleton. K [12, 15] key points aretracked, resulting in joint 3D position xt in 36 to 45 . Although human pose tracking results are often noisy (Fig. 6and Fig. 8), we can correctly estimate action transitionsfrom these noisy tracking results. In particular, KTC-R(T = 30) signicantly improves the accuracy from othermethods (Table 1). The main reason is the joint-positions ofnoisy tracked joints have complex nonparametric structures,which is handled by kernel embedding of distributions [10],[11], [9] in KTC.KTC-H. Besides action transitions, results on detectingboth cyclic motions and transitions are reported by performing KTC-H (Tm = 50, Tm = 1). Since other methodsdont have the module, we report quantitative comparisonon online hierarchial segmentation by using KTC-H orother methods plus our KAC algorithm in sec. 3.5. Resultsshow (Table 3) that KTC-H gets higher accuracy than othercombinations. It is notable that, because of the natural ofRI, the RI metric will increase when the number of cutsincrease, even for low P/R, which is the case for hierarchialsegmentation (including two types of cuts).Mocap. Similar to [38], [33], M = 14 joints are used torepresent the human skeleton, resulting in joint-quaternions

KTC-S (online)0.88(P)/0.91(R)/0.89(RI)0.87(P)/0.90(R)/0.91(RI)0.85(P)/0.89(R)/0.87(RI)

KTC-R (online)0.87(P)/0.93(R)/0.88(RI)0.86(P)/0.91(R)/0.92(RI)0.85(P)/0.92(R)/0.88(RI)

of joint angles in 42D. Online temporal segmentationmethods are tested on 14 selected Mocap sequences fromsubject 86 in CMU Mocap database. Each sequence is acombination of roughly 10 action segments and in totalthere are around 105 frames (120Hz). Since the implementation of PPCA-CD differs from [38] (such as only forwardpass is allowed in our experiments), results are not the sameas in [38]. Table 1 shows that the gain of KTC-R (T = 50)to other methods in Mocap is reduced, compared with depthsensor data. This is because the Gaussian property is morelikely to hold for quaternions representation of noiselessMocap data, which is not the case for real data in general.Video. Furthermore, KTC-R is performed on a numberof sequences from HumanEva-2, which is a benchmark forhuman motion analysis [67]. Silhouettes are extracted bybackground substraction, resulting in a sequence of binarymasks (60Hz). xt Dt is set as the vector representationof the mask at frame t. It is notable that, Dt (size ofmasks) in different frames may not be identical, so PPCACD can not be applied. This fact supports the advantage ofKTC, which is applicable for complex sequential data aslong as a (pseudo) kernel can be dened. In particular, wefollow [33] to compute the matching distance of silhouettesto set the kernel. Results are shown in Fig. 6 and Table. 1.As a reference, state-of-the-art ofine temporal clusteringmethod ACA achieves higher accuracy than KTC-R onMocap (96% precision). However, ofine methods (1) arenot suitable for real-time applications, and (2) require thenumber of clusters (segments) to be set in advance, whichis not applicable in many cases.6.2 Recognition Results

Fig. 7. Action recognition on Mocap. 1 Mocapexample for each action.We collected 3978 frames from CMU Mocap [65] capturing fteen people, performing 10 natural actions (detailsin Fig. 7). For action recognition, we use the leave-oneout procedure for each sequence, i.e., each sequence istreated as unlabeled and associated with all other sequences.


12

TABLE 2Action Recognition Rates on Mocap. Rate is measured by # of sequences (S) or # of frames (F).MethodsRate (S)Rate (F)

DTW+DMSW (3M D)60%62%

CTW+DMSW (3M D)85%91%

Since each person only performs a specic action once, therecognition process can not benet from the fact that thesame person repeating the same action results in quite largesimilarity. in eq. 28 is set to be 0.5 and results (Table 2)show that our approach only misclassies 5% sequences,or 1.2% by weighing with the number of frames.To investigate how temporal alignment affects recognition, results of using DTW [53] and CTW [58] are alsoprovided in Table 2. To make a fair comparison, onlythe temporal alignment step is changed. Results show thatthis change reduces accuracy signicantly, which supportsthe effectiveness of DMW not only in temporal alignment(Fig. 5), but also in action recognition (quantitatively).Furthermore, to demonstrate the ability to recognize actionsfrom arbitrary 2D view, Mocap sequences are linearlyprojected to joint 2D trajectories in 30D space using asynthetic camera (without occlusion, K = M = 15). Weachieve 90% accuracy on this 2D view recognition. [45](HMM+Adaboost) also reports recognition rate for Mocapdata, but it requires a large number of training sequences,and 2D view recognition is not included.6.3

Joint Segmentation and Recognition

The proposed approach in sec. 5 can also be used torecognize actions from 2.5D depth sensor input, withoutextra training process in the transfer learning module.Furthermore, we combine the temporal segmentation insec. 3 and the action recognition method in sec.5 and tobuild an online system for sequential action segmentationand recognition.Following the same approach as in sec. 6.1, we useOpenNI [68] to get the 36D to 45D time series Y 1:Ly 3KLy (K M ) from depth sensor. Afterwards, theonline recognition process for Y 1:Ly is performed in thefollowing procedure:(1) Temporal Segmentation. Use the algorithms in section 3 to sequentially cut Y 1:Ly into different action unitsY aj :aj+1 1 3Kaj+1 aj (j = 1, 2, 3..., a1 = 1).(2) Structure Learning. Use the algorithm in section 5.1for the current action unit Y aj :aj+1 1 .(3) Temporal Alignment. Use the proposed DMTW(eq. 19) algorithm to get temporal correspondencei3M aj+1 aj

Xfrom a labeled Mocap se1:aj+1 aj iquences X mocap .(4) Spatial Alignment. Select K markers from1:a a (the K corresponding ones from OpenNI),Xj+1jK

3Kaj+1 aj . Only featuresresulting in XK

1:aj+1 aj

from X1:aj+1 aj are selected to match Y aj :aj+1 1 by

DMW (3M D)95%99%

DMW (2KD)90%87%

using DMSW (eq. 24), since M K markers informationis missing from OpenNI tracking results.(5) Motion Distance SDM W (Y aj :aj+1 1 , X imocap ) iscalculated by using eq. 28.(6) Assume there are N labeled motion sequencesi{X imocap }Ni=1 associated with action label I I, whereI = 1, 2, ..., C indicates C action classes. The estimatedaction label Ijy for the current action unit is given byIjy = arg

min

i{1,2..,N }

SDM W (Y aj :aj+1 1 , {X imocap , I i })

(29)Results. We collect additional 5109 frames (N = 30)with 10 primitive actions from CMU Mocap as the labeled(training) data for recognition. In order to associate labeledMocap sequences with data from other domains, jointposition trajectories (M = 15) are used in eq. 29 [59].Testing data are previous collected sequences from depthsensor, and online segmentation and recognition are simultaneously performed by KTC-H and eq. 29. A signicantfeature of our approach is, there is no extra-training processfor depth sensor, i.e., the knowledge from Mocap can betransferred to other motion sequences, based on properfeatures. Tracked trajectories from OpenNI in an action unit(segmented by KTC-H) are associated with labeled Mocapsequences from 10 action categories.Although OpenNI tracking results are often noisy (highlighted by blue circles in Fig. 8), we achieve 85% recognition accuracy (Acc) from these noisy tracking results(Table 3), without any additional training on depth sensordata. This result does not only benet from DMW [59],but also from KTC-H. DMW requires the input onlycontain one action unit, while KTC-H performs a criticalmissing step, i.e., accurate online temporal segmentation, inorder to perform recognition. As illustrated in Table 3, theaccuracy on OpenNI is enhanced from 0.71 to 0.85, whichstrongly supports the effectiveness of KTC-H. Furthermore,the complete and accurate 3M D human motion sequencescan be inferred by associating the learned manifolds fromMocap.

7

C ONCLUSION

In this paper, we rst propose an online temporal segmentation method KTC, as a temporal extension of Hilbertspace embedding of distributions for change-point detectionbased on the novel spatio-temporal kernel. Then, a realtimeimplementation of KTC and a hierarchial extension aredesigned, which can detect both action transitions andaction units. Furthermore, a robust and efcient alignmentalgorithm DMW is designed to calculate the similarity


13

TABLE 3Online hierarchial segmentation and recognition on 2.5D depth sensorMethodsDepth

PPCA-CD+KAC+CTW0.72(P)/0.76(R)/0.89(RI)/0.62(Acc)

PPCA-CD+KAC+DMW0.72(P)/0.76(R)/0.89(RI)/0.71(Acc)

KTC-H + DMW0.85(P)/0.87(R)/0.94(RI)/0.85(Acc)

Fig. 8. Online action segmentation and recognition on 2.5D depth sensor. Top to bottom, depth imagesequences, KTC-H results and action recognition results. For segmentation, blue line indicates the cut anddifferent rectangles indicate different action units. The blue circle indicates noisy tracking results. For recognition:distance to labeled Mocap sequences, and inferred 3M D motion sequences.between two multivariate time series.. Finally, temporalsegmentation is combined with spatio-temporal alignment,resulting in realtime action recognition on depth sensorinput, without the need of training data from depth sensor.Future works include the extension to 2D videos and moreapplications rather than human motion analysis. In orderto apply our approach to action recognition in 2D videos,we can use the idea similar to [69] to estimate the 2D keypoints of human skeleton from the image. As the proposedframework and two algorithms KTC and DMW are generalmethods, they can be used in other domains such as 3Dfacial expression analysis, if the time series representationis available.

R EFERENCES[1][2][3][4][5][6][7]

[8][9]

I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, Learningrealistic human actions from movies, in Proc. CVPR, 2008.J. Nielbes, H. Wang, and L. Fei-Fei, Unsupervised learning of human action categories using spatial-temporal words, IJCV, vol. 79,pp. 299318, 2008.P. Matikainen, M. Hebert, and R. Sukthankar, Representing pairwisespatial and temporal relations for action recognition, in Proc. ECCV,2010, vol. 6311, pp. 508521.P. Natarajan and R. Nevatia, View and scale invariant actionrecognition using multiview shape-ow models, in Proc. CVPR,2008.P. Yan, S. M. Khan, and M. Shah, Learning 4d action feature modelsfor arbitary view action recognition, in Proc. CVPR, 2008.H. Ning, W. Xu, Y. Gong, and T. Huang, Latent pose estimator forcontinuous action recognition, in Proc. ECCV, 2008, vol. 5303, pp.419433.J. C. Niebles, C.-W. Chen, , and L. Fei-Fei, Modeling temporalstructure of decomposable motion segments for activity classication, in Proceedings of the 12th European Conference of ComputerVision (ECCV), Crete, Greece, September 2010.T. Hofmann, B. Scholkopf, and A. J. Smola, Kernel methods inmachine learning, Annals of Statistics, vol. 36, pp. 11711220,2008.A. Smola, A. Gretton, L. Song, and B. Schoelkopf, A hilbertspace embedding for distributions, in Algorithmic Learning Theory:18th International Conference. Springer-Verlag, Berlin/Heidelberg,2007, pp. 1331.

[10] A. Gretton, K. Borgwardt, M. Rasch, B. Scholkopf, and A. Smola, Akernel method for the two-sample-problem, in Advances in NeuralInformation Processing Systems 19. MIT Press, 2007, pp. 513520.[11] A. Gretton, K. Fukumizu, Z. Harchaoui, and B. Sriperumbudur, Afast, consistent kernel two-sample test, in NIPS, 2009, vol. 19, pp.673681.[12] J. B. Tenenbaum, V. de Silva, and J. C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science,vol. 290, pp. 23192323, December 2000.[13] L. K. Saul and S. T. Roweis, Think globally, t locally: unsupervised learning of low dimensional manifolds, Journal of MachineLearning Research, vol. 4, pp. 119155, 2003.[14] O. C. Jenkins and M. J. Mataric, A spatio-temporal extension toisomap nonlinear dimension reduction, in Proc. ICML, 2004.[15] W. Liao and G. Medioni, 3d face tracking and expression inferencefrom a 2d sequence using manifold learning, in Proc. CVPR, vol. 2,2008, pp. 416423.[16] P. Mordohai and G. Medioni, Dimensionality estimation, manifoldlearning and function approximation using tensor voting, JMLR,vol. 11, pp. 411450, 2010.[17] N. Lawrence, Probabilistic non-linear principal component analysiswith gaussian process latent variable models, JMLR, vol. 6, pp.17831816, November 2005.[18] J. M. Wang, D. J. Fleet, and A. Hertzmann, Gaussian processdynamical models for human motion, IEEE PAMI, vol. 30, pp. 283298, 2008.[19] R. Urtasun, D. J. Fleet, A. Geiger, J. Popovic, T. Darrell, and N. D.Lawrence., Topologically-constrained latent variable models, inProc. ICML, 2008, pp. 10801087.[20] R. Urtasun, D. Fleet, and P. Fua, 3d people tracking with gaussianprocess dynamical models, in Proc. CVPR, vol. 1, 2006, pp. 238245.[21] C.-S. Lee, Modeling human motion using manifold learning andfactorized generative models, PhD Thesis, 2007.[22] A. Elgammal and C.-S. Lee, The role of manifold learning in humanmotion analysis, Computational Imaging and Vision, vol. 36, 2008.[23] N. C. Tang, C.-T. Hsu, T.-Y. Lin, and H.-Y. M. Liao, Examplebased human motion extrapolation based on manifold learning, inProc. ACM MM, 2011.[24] J. Blackburn and E. Ribeiro, Human motion recognition usingisomap and dynamic time warping, in Proc. ICCV workshop, 2007.[25] J. Chen and A. Gupta, Parametric Statistical Change-point Analysis.Birkhauser, 2000.[26] X. Xuan and K. Murphy, Modeling changing dependency structurein multivariate time series, in Proceedings of the 24th internationalconference on Machine learning, 2007.[27] R. P. Adams and D. J. MacKay, Bayesian online changepointdetection, in University of Cambridge Technical Report, 2007.[28] Y. Saatci, R. Turner, and C. Rasmussen, Gaussian process changepoint models, in Proc. ICML, 2010.[29] F. Desobry, M. Davy, and C. Doncarli, An online kernel change detection algorithm, IEEE Transactions on Signal Processing, vol. 53,pp. 29612974, 2005.


[30] Z. Harchaoui, F. Bach, and E. Moulines, Kernel change-pointanalysis, in Advances in Neural Information Processing Systems21, 2009.[31] A. Y. Ng, M. I. Jordan, and Y. Weiss, On spectral clustering:Analysis and an algorithm, in Advances in Neural InformationProcessing Systems 14. MIT Press, 2002, pp. 849856.[32] U. von Luxburg, A tutorial on spectral clustering, Statistics andComputing, vol. 17, pp. 395416, 2007.[33] F. Zhou, F. D. la Torre, and J. K. Hodgins, Hierarchical alignedcluster analysis for temporal clustering of human motion, Acceptedfor publication at IEEE Transactions Pattern Analysis and MachineIntelligence (PAMI), 2012.[34] E. Fox, E. Sudderth, M. Jordan, and A. Willsky, Nonparametricbayesian learning of switching linear dynamical systems, in NIPS21, 2009, pp. 457464.[35] H. Zhong, J. Shi, and M. Visontai, Detecting unusual activity invideo, in Proc. CVPR, 2004, pp. 816823.[36] L. Zelnik-Manor and M. Irani, Statistical analysis of dynamicactions, IEEE Trans. on PAMI, vol. 28, pp. 15301535, 2006.[37] F. D. la Torre, J. Campoy, Z. Ambadar, and J. F. Conn, Temporalsegmentation of facial behavior, in Proc. ICCV, 2007.[38] J. Barbic, A. Safonova, J.-Y. Pan, C. Faloutsos, J. K. Hodgins,and N. S. Pollard, Segmenting motion capture data into distinctbehaviors, in Proc. Graphics Interface, 2004, pp. 185194.[39] I. Junejo, E. Dexter, I. Laptev, and P. Perez, View-independent action recognition from temporal self-similarities, IEEE Transactionson PAMI, vol. 33, pp. 172185, 2011.[40] D. Weinland, E. Boyer, and R. Ronfard, Action recognition fromarbitrary views using 3d exemplars, in Proc. ICCV, 2007.[41] F. Lv and R. Nevatia, Single view human action recognition usingkey pose matching and viterbi path searching, in Proc. CVPR, 2007,pp. 18.[42] R. Messing, C. Pal, and H. Kautz, Activity recognition using thevelocity histories of tracked keypoints, in Proc. of ICCV, 2009, pp.104111.[43] J. Sun, X. Wu, S. Yan, L. F. Cheong, T. S. Chua, and J. Li, Hierarchical spatial-temporal context modeling for action recognition, inProc. of CVPR, 2009, pp. 20042011.[44] R. Poppe, A survey on vision-based human action recognition,Image and Vision Computing, vol. 28, pp. 976990, 2010.[45] F. Lv and R. Nevatia, Recognition and segmentation of 3-d humanaction using hmm and multi-calss adaboost, in Proc. ECCV, 2006,vol. 3954, pp. 359372.[46] D. Weinland, M. Ozuysal, and P. Fua, Making action recognitionrobust to occlusions and viewpoint changes, in ECCV, 2010, vol.6313, pp. 635648.[47] W. Li, Z. Zhang, and Z. Liu, Action recognition based on a bagof 3d points, in Workshop on CVPR for Human CommunicativeBehavior Analysis, 2010, pp. 914.[48] B. Ni, G. Wang, and P. Moulin, Rgbd-hudaact: A color-depthvideo database for huamn daily activity recognition, in Workshopon Consumer Depth Cameras for Computer Vision, in conjunctionwith ICCV, 2011.[49] J. Wang, Z. Liu, Y. Wu, and J. Yuan, Mining actionlet ensemble foraction recognition with depth cameras, in CVPR, 2012, pp. 12901297.[50] F. R. Bach and M. I. Jordan, Kernel independent componentanalysis, JMLR, vol. 3, pp. 148, 2003.[51] T. K. Kim and R. Cipolla, Canonical correlation analysis of videovolume tensors for action categorization and detection, IEEE PAMI,vol. 31, pp. 14151428, 2009.[52] C. C. Loy, T. Xiang, and S. Gong, Multi-camera activity correlationanalysis, in Proc. CVPR, 2009, pp. 19881995.[53] C. Rao, A. Gritaiand, M. Shah, and T. Syeda-Mahmood, Viewinvariant alignment and matching of video sequences, in Proc.ICCV, 2003, pp. 939945.[54] M. Singh, I. Cheng, M. Mandal, and A. Basu, Optimization ofsymmetric transfer error for sub-frame video synchronization, inProc. ECCV, 2008, vol. 5303, pp. 554567.[55] L. R. Rabiner and B. Juang, Fundamentals of speech recognition.Prentice-Hall, Inc., 1993.[56] Y. Ukainitz and M. Irani, Aligning sequences and actions bymaximizing space-time correlations, in Proc. ECCV, 2006, vol.3953, pp. 538550.[57] F. Padua, F. Carceroni, R. Santos, and G. Kutulakos, Linearsequence-to-sequence alignment, IEEE PAMI, vol. 32, pp. 304320,2010.[58] F. Zhou and F. D. la Torre, Canonical time warping for alignmentof human behavior, in NIPS, 2009, vol. 22, pp. 22862294.[59] D. Gong and G. Medioni, Dynamic manifold warping for viewinvariant action recognition, in Proc. ICCV, 2011, pp. 571578.[60] M. Hoai, Z. Lan, and F. D. la Torre, Joint segmentation andclassication of human actions in video, in Proc. CVPR, 2011.[61] D. Gong, G. Medioni, S. Zhu, and X. Zhao, Kernelized temporal cutfor online tempoal segmentation and recognition, in Proc. ECCV,2012.[62] L. Faivishevsky and J. Goldberger, A nonparametric informationtheoretic clustering algorithm, in ICML, 2010, pp. 351358.[63] I. Laptev, S. Belongie, P. Perez, and J. Wills, Periodic motiondetection and segmentation via approximate sequence alignment,in Proc. ICCV. Springer-Verlag, Berlin/Heidelberg, 2005, pp. 8168231.[64] Y. W. Teh and S. Roweis, Automatic alignment of local representations, in Advances in Neural Information Processing Systems 15,M. Kearns, S. Solla, and D. Cohn, Eds. Cambridge, MA: MITPress, 2003, pp. 841848.[65] CMU Motion Capture Database, http://mocap.cs.cmu.edu/.

14

[66] J. Listgarten, R. M. Neal, S. T. Roweis, and A. Emili, Multiplealignment of continuous time series, in NIPS, 2005, vol. 17.[67] L. Sigal, A. O. Balan, and M. J. Black, Humaneva: Synchronizedvideo and motion capture dataset and baseline algorithm for evaluation of articulated human motion, IJCV, vol. 87, pp. 427, 2010.[68] OpenNI, http://www.openni.org/Downloads/OpenNIModules.aspx.[69] B. Yao and L. Fei-Fei, Action recognition with exemplar based 2.5dgraph matching, in ECCV, 2012, pp. 173186.

Dian Gong received the PhD degree majorin Electrical Engineering and minor in Computer Science from University of SouthernCalifornia. He got his BS degree in Electronic Engineering from Tsinghua University.His PhD thesis applies machine learning tomining (large-scale) time series data. He iscurrently working as a quantitative trading associate at Susquehanna International Group.He also had working experiences at BarclaysInvestment Bank, Sony US Research and Microsoft Research. In the past, he won severalmathematics contest awards, and was selected as a national teamcandidate for International Mathematics Olympiad.

Gerard Medioni received the Diplme dIngenieur from ENST, Paris in 1977, a M.S.and Ph.D. from the University of SouthernCalifornia in 1980 and 1983 respectively. Hehas been at USC since then, and is currentlyProfessor of Computer Science and Electrical Engineering, co-director of the Institutefor Robotics and Intelligent Systems (IRIS),and co-director of the USC Games Institute.He served as Chairman of the ComputerScience Department from 2001 to 2007. Professor Medioni has made signicant contributions to the eld of computer vision. His research covers a broadspectrum of the eld, such as edge detection, stereo and motionanalysis, shape inference and description, and system integration.He has published 4 books, over 75 journal papers and 200 conference articles, and is the recipient of 14 international patents.Prof. Medioni is on the advisory board of the IEEE Transactionson PAMI Journal, associate editor of the International Journal ofComputer Vision, associate editor of the Pattern Recognition andImage Analysis Journal, and associate editor of the InternationalJournal of Image and Video Processing. Prof. Medioni served atprogram co-chair of the 1991 IEEE CVPR Conference in Hawaii, ofthe 1995 IEEE Symposium on Computer Vision in Miami, general cochair of the1997 IEEE CVPR Conference in Puerto Rico, conferenceco-chair of the 1998 ICPR Conference in Australia, general co-chairof the 2001 IEEE CVPR Conference in Kauai, general co-chair of the2007 IEEE CVPR Conference in Minneapolis, general co-chair of the2009 IEEE CVPR Conference in Miami, program co-chair of the 2009IEEE WACV Conference in Snowbird, Utah, general co-chair of the2011 IEEE WACV Conference in Kona, Hawaii, and general co-chairof the 2013 IEEE CVPR in Portland. He is a Fellow of IAPR, a Fellowof the IEEE, and a Fellow of AAAI.

Xuemei Zhao received the BS degree inElectronic Engineering from Tsinghua University, Beijing, in 2008. She received thePhD degree in Electrical Engineering fromUniversity of Southern California, Los Angeles, in 2013. She is currently working asa software engineer at Google in New Yorkofce.

gcm-motiondetection(2014)

Documents

human motion capture

human action recognition

human motion sequences

bymultivariate time

recognizing motion sequences

motion capture mocapsequence

d depth sensor data

motioncapture data