activity recognition using the velocity histories of ... · 2. background initial work on activity...

8
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 ICCV #2117 ICCV #2117 ICCV 2009 Submission #2117. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. Activity recognition using the velocity histories of tracked keypoints Anonymous ICCV submission Paper ID 2117 Abstract We present an activity recognition feature inspired by human psychophysical performance. This feature is based on the velocity history of tracked keypoints. We present a generative mixture model for video sequences using this feature, and show that it performs comparably to local spatio-temporal features on the KTH activity recognition dataset. In addition, we contribute a new activity recog- nition dataset, focusing on activities of daily living, with high resolution video sequences of complex actions. We demonstrate the superiority of our velocity history feature on high resolution video sequences of complicated activi- ties. Further, we show how the velocity history feature can be extended, both with a more sophisticated latent veloc- ity model, and by combining the velocity history feature with other useful information, like appearance, position, and high level semantic information. Our approach per- forms comparably to established and state of the art meth- ods on the KTH dataset, and significantly outperforms all other methods on our challenging new dataset. 1. Introduction Activity recognition is an important area of active com- puter vision research. Recent work has focused on bag-of- spatio-temporal-features approaches that have proven effec- tive on established datasets. These features have very strong limits on the amount of space and time that they can de- scribe. Human performance suggests that more global spa- tial and temporal information could be necessary and suf- ficient for activity recognition. Some recent work has at- tempted to get at this global information by modeling the relationships between local space-time features. We instead propose a feature, directly inspired by studies of human per- formance, with a much less limited spatio-temporal range. Our work is motivated an important application of ac- tivity recognition, assisted cognition health monitoring sys- tems designed to unobtrusively monitor patients. These sys- tems are intended to ensure the mental and physical health of patients either at home or in an extended care facility. Additionally, they provide critical, otherwise unavailable information to concerned family members or physicians making diagnoses. With an aging population, and with- out a concurrent increase in healthcare workers, techniques that could both lessen the burden on caregivers and increase quality of life for patients by unobtrusive monitoring are very important. We introduce a high resolution dataset of complicated actions designed to reflect the challenges faced by assisted cognition activity recognition systems, as well as other systems dealing with activities of daily living. 2. Background Initial work on activity recognition involved extracting a monolithic description of a video sequence. This could have been a table of motion magnitude, frequency, and po- sition within a segmented figure [14] or a table of the pres- ence or recency of motion at each location [5]. Both of these techniques were able to distinguish some range of activities, but because they were individually the full descriptions of a video sequence, rather than features extracted from a se- quence, it was difficult to use them as the building blocks of a more complicated system. Another approach to activity recognition is the use of ex- plicit models of the activities to be recognized. Domains where this approach has been applied include face and fa- cial expression recognition [4] and human pose modeling [15]. These techniques can be very effective, but by their nature, they cannot offer general models of the information in video in the way that less domain-specific features can. Recent work in activity recognition has been largely based on local spatio-temporal features. Many of these fea- tures seem to be inspired by the success of statistical mod- els of local features in object recognition. In both domains, features are first detected by some interest point detector running over all locations at multiple scales. Local maxima of the detector are taken to be the center of a local spatial or spatio-temporal patch, which is extracted and summarized by some descriptor. Most of the time, these features are then clustered and assigned to words in a codebook, allow- ing the use of bag-of-words models from statistical natural language processing. 1

Upload: others

Post on 13-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Activity recognition using the velocity histories of ... · 2. Background Initial work on activity recognition involved extracting a monolithic description of a video sequence. This

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053

054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

ICCV#2117

ICCV#2117

ICCV 2009 Submission #2117. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Activity recognition using the velocity histories of tracked keypoints

Anonymous ICCV submission

Paper ID 2117

Abstract

We present an activity recognition feature inspired byhuman psychophysical performance. This feature is basedon the velocity history of tracked keypoints. We present agenerative mixture model for video sequences using thisfeature, and show that it performs comparably to localspatio-temporal features on the KTH activity recognitiondataset. In addition, we contribute a new activity recog-nition dataset, focusing on activities of daily living, withhigh resolution video sequences of complex actions. Wedemonstrate the superiority of our velocity history featureon high resolution video sequences of complicated activi-ties. Further, we show how the velocity history feature canbe extended, both with a more sophisticated latent veloc-ity model, and by combining the velocity history featurewith other useful information, like appearance, position,and high level semantic information. Our approach per-forms comparably to established and state of the art meth-ods on the KTH dataset, and significantly outperforms allother methods on our challenging new dataset.

1. IntroductionActivity recognition is an important area of active com-

puter vision research. Recent work has focused on bag-of-spatio-temporal-features approaches that have proven effec-tive on established datasets. These features have very stronglimits on the amount of space and time that they can de-scribe. Human performance suggests that more global spa-tial and temporal information could be necessary and suf-ficient for activity recognition. Some recent work has at-tempted to get at this global information by modeling therelationships between local space-time features. We insteadpropose a feature, directly inspired by studies of human per-formance, with a much less limited spatio-temporal range.

Our work is motivated an important application of ac-tivity recognition, assisted cognition health monitoring sys-tems designed to unobtrusively monitor patients. These sys-tems are intended to ensure the mental and physical healthof patients either at home or in an extended care facility.

Additionally, they provide critical, otherwise unavailableinformation to concerned family members or physiciansmaking diagnoses. With an aging population, and with-out a concurrent increase in healthcare workers, techniquesthat could both lessen the burden on caregivers and increasequality of life for patients by unobtrusive monitoring arevery important. We introduce a high resolution dataset ofcomplicated actions designed to reflect the challenges facedby assisted cognition activity recognition systems, as wellas other systems dealing with activities of daily living.

2. BackgroundInitial work on activity recognition involved extracting

a monolithic description of a video sequence. This couldhave been a table of motion magnitude, frequency, and po-sition within a segmented figure [14] or a table of the pres-ence or recency of motion at each location [5]. Both of thesetechniques were able to distinguish some range of activities,but because they were individually the full descriptions ofa video sequence, rather than features extracted from a se-quence, it was difficult to use them as the building blocks ofa more complicated system.

Another approach to activity recognition is the use of ex-plicit models of the activities to be recognized. Domainswhere this approach has been applied include face and fa-cial expression recognition [4] and human pose modeling[15]. These techniques can be very effective, but by theirnature, they cannot offer general models of the informationin video in the way that less domain-specific features can.

Recent work in activity recognition has been largelybased on local spatio-temporal features. Many of these fea-tures seem to be inspired by the success of statistical mod-els of local features in object recognition. In both domains,features are first detected by some interest point detectorrunning over all locations at multiple scales. Local maximaof the detector are taken to be the center of a local spatial orspatio-temporal patch, which is extracted and summarizedby some descriptor. Most of the time, these features arethen clustered and assigned to words in a codebook, allow-ing the use of bag-of-words models from statistical naturallanguage processing.

1

Page 2: Activity recognition using the velocity histories of ... · 2. Background Initial work on activity recognition involved extracting a monolithic description of a video sequence. This

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

ICCV#2117

ICCV#2117

ICCV 2009 Submission #2117. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

While local spatio-temporal features were initially in-troduced in bag-of-features approaches ignoring non-localspatio-temporal relationships between the local features,more recent work has attempted to weaken these spatio-temporal independence assumptions [8, 10, 13, 17]. Whilethese techniques weaken either or both of the spattial andtemporal independence assumptions, they are fundamen-tally limited by the locality of the features underlying theirmodels, and the kinds of complicated motion they can dis-ambiguate. Junejo et al.’s temporal self-similarity featuressacrifice all spatial information for extraordinary view-point invariance[8]. Laptev et al.’s system caputres spatio-temporal relationships between a set of very coarse spatialand temporal grid regions[10]. Niebles and Fei-Fei’s sys-tem creates a spatial constellation model of feature loca-tion within each frame, but does not consider relationshipsacross frames[13]. Savarese et al. add a new type of fea-ture to their bag of words that captures the spatio-temporalcorrelation of different types of features[17]. This describesbroad relationships of the over space and time, but fails tocapture the motion of any individual feature, or even typeof feature.

Local features have proven successful with existing ac-tivity recognition datasets. However, most datasets containlow resolution video of simple actiions. Since the expres-siveness of a local feature decreases with the volume ofspace and time it can describe, as high resolution video be-comes more commonplace, local features must lose infor-mation by operating on scaled-down videos, become larger,dramatically expanding the amount of computation they re-quire, or represent a smaller and smaller space-time volumeas video resolution increases. Additionally, as actions be-come more and more complicated, local space-time patchesmay not be the most effective primatives to describe them.

In proposing a feature as an alternative to local patchmethods, we see that human subjects can disambiguate ac-tions using only the motion of tracked points on a body, inthe absence of other information. Johansson [7] showed thatpoint-light displays of even very complicated, structuredmotion were perceived easily by human subjects . Most fa-mously, Johansson showed that human subjects could cor-rectly perceive “point-light walkers”, a motion stimulusgenerated by a person walking in the dark, with points oflight attached to the walker’s body. In computer vision,Madabhushi and Aggarwal were able to disambiguate ac-tivities using only the tracked motion of a single feature,the subject’s head. While their system had limited success,it’s ability to use an extremely sparse motion feature to rec-ognize activities shows just how informative the motion of atracked point can be. Inspired by these results, we proposea new feature, the velocity history of tracked keypoints.

Much of the previous trajectory tracking work has oftenbeen based on object centroid or bounding box trajectories.

In a more sophisticated extension of such approaches, Ros-ales and Sclaroff [16] used an extended Kalman filter to helpkeep track of bounding boxes of segmented moving objectsand then use motion history images and motion energy im-ages as a way of summarizing the trajectories of trackedobjects. Madabhushi and Aggarwal [12] used the trajecto-ries of heads tracked in video and built models based onthe mean and covariance of velocities. Our approach to an-alyzing feature trajectories differs dramatically from theseprevious methods in that we use dense clouds of KLT [11]feature tracks. In this way our underlying feature represen-tation is also much closer to Johansson’s point lights.

3. Velocity Histories of Tracked KeypointsWe present a system that tracks a set of keypoints in a

video sequence, extracts their velocity histories, and usesa generative mixture model to learn a velocity-history lan-guage and classify video sequences.

3.1. Feature Detection and Tracking

Given a video sequence, we extract feature trajectoriesusing Birchfield’s implementation [3] of the KLT tracker[11]. This system finds interest points where both eigenval-ues of the matrix

∇I =(

g2x gxgy

gxgy g2y

)are greater than a threshold, and tracks them by calculat-ing frame-to-frame translation with an affine consistencycheck. In order to maximize the duration of our featuretrajectories, and thus capitalize on their descriptive power,we did not use the affine consistency check (which ensuresthat the features have not deformed). This increased theamount of noise in the feature trajectories, but also the abil-ity to capture non-rigid motion. While the detector findsboth moving and nonmoving corners, our mixture modelensures that nonmoving features are described by a smallnumber of mixture components. As described by Baker andMatthews [1], the tracker finds the match that minimizes∑

x

[T (x)− I(W (x; p))]2 (1)

where T is the appearance of the feature to be matched inthe last frame, I is the current frame, x is the position in thetemplate window, W are the set of transformations consid-ered (in this case, translation) between the last frame andthe current one, and p are the parameters for the transfor-mation. To derive the tracking update equations, we firstexpress Equation 1 as an iterative update problem (replac-ing p with p + ∆p), and Taylor expand the iterative update:∑

x[T (x)− I(w(x; p + ∆p))]2 ≈∑x[T (x)− I(w(x; p))−∇I ∂W

∂p ∆p]2 (2)

2

Page 3: Activity recognition using the velocity histories of ... · 2. Background Initial work on activity recognition involved extracting a monolithic description of a video sequence. This

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

ICCV#2117

ICCV#2117

ICCV 2009 Submission #2117. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Now we take the derivative of Equation 2 with respect tothe update parameters:

∂∂p

∑x[T (x)− I(w(x; p))−∇I ∂W

∂p ∆p]2 =∑x

[∇I ∂W

∂p

]T [T (x)− I(w(x; p))−∇I ∂W

∂p ∆p]

(3)

Setting the derivative in Equation 3 to zero and solving forthe parameter update ∆P gives us:

∆p =[∑

x

SDT (x)SD(x)]−1

·∑

x

SDT (x)T (x)− I(W (x; p)) (4)

where SD(x) = ∇I ∂W∂p is the steepest descents image[1].

By iteratively updating the parameters according to Equa-tion 4, we can quickly find the best matching for a trackedfeature. Recent GPU-based implementations of this algo-rithm can run in real time [19].

We tracked 500 features at a time, detecting new fea-tures to replace lost tracks. We call each feature’s quan-tized velocity over time its “velocity history”, and use thisas our basic feature. We emphasize that we are unaware ofother work on activity recognition using a motion descrip-tion with as long a temporal range as the technique pre-sented here. Uniform quantization is done in log-polar coor-dinates, with 8 bins for direction, and 5 for magnitude. Thetemporal resolution of input video was maintained. We lim-ited the velocity magnitude to 10 pixels per frame, which inall of our datasets corresponds almost exclusively to noise.This limit meant that for an individual feature’s motion fromone frame to the next, any motion above the threshold wasnot ignored, but that feature’s motion on that frame wasplaced in the bin for the greatest motion magnitude in theappropriate direction. We also only considered feature tra-jectories lasting at least ten frames. This heuristic reducesnoise while eliminating little of the extended velocity in-formation we seek to capture. Examples of the trajectoriesbeing extracted can be seen in Figure 7.

To give an idea of the long temporal range of our featureflows, in the new dataset we introduce, containing videosbetween 10 and 60 seconds (300 and 1800 frames) long, thetotal number of flows extracted per activity sequence variedbetween about 700 and 2500, with an average of over 1400flows per sequence. The mean duration of the flows wasover 150 frames.

3.2. Activity recognition

We model activity classes as a weighted mixture of bagsof augmented trajectory sequences. Each activity model hasa distribution over 100 shared mixture components. Thesemixture components can be thought of as velocity history“words”, with each velocity history feature being generated

Figure 1. Graphical model for our tracked keypoint velocity his-tory model (Dark circles denote observed variables).

by one mixture component, and each activity class having adistribution over these mixture components. Each mixturecomponent models a velocity history feature with a Markovchain. So each velocity history feature is explained by amixture of velocity history markov chains, and each activityclass has a distribution over those mixture components. Byanalogy to supervised document-topic models, each activityis a document class with a distribution over mixture compo-nents (analogous to topics). Each mixture component hasa distribution over velocity history features (analogous towords). In this way, each tracked velocity history featurehas a different probability of being generated by each mix-ture component. Each document is modeled using naiveBayes as a “bag” of these soft-assigned mixture-componentwords. This model assumes that each document class sharesa single multinomial distribution over mixture components.

3.2.1 Inference

Our model for action class is shown in Figure 1. Each actionclass is a mixture model. Each instance of an action (videoclip) is treated as a bag of velocity history sequences. A isthe action variable, which can take a value for each distinctaction the system recognize. M is the mixture variable, in-dicating one of the action’s mixture components. We usethe notation M i

f to abbreviate Mf = i, or in words, fea-ture f is generated by mixture component i. On denotesa feature’s discretized velocity at time n. Nf denotes thenumber of augmented features in the video sequence, andNv denotes the number of video sequences.

Our joint probability model for an action is:

P (A, O) =∑M

P (A, M, O) =

P (A)Nf∏f

Nm∑i

P (M if |A)P (O0,f |M i

f )

Tf∏t=1

P (Ot,f |Ot−1,f , M if ) (5)

where Nm is the number of mixture components, Nf isthe number of features, and Tf is the number of observa-

3

Page 4: Activity recognition using the velocity histories of ... · 2. Background Initial work on activity recognition involved extracting a monolithic description of a video sequence. This

324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377

378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431

ICCV#2117

ICCV#2117

ICCV 2009 Submission #2117. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

tions in feature trajectory f . For a feature tracked over 20frames, t would have 19 disctinct values, since each obser-vation is given by the difference between the feature’s loca-tion between two frames. Note that all random variables arediscrete, and all component distributions in the joint distri-bution are multinomial.

We train the model using Expectation Maximization(EM) [2], and classify a novel video sequence D by findingthe action A that maximizes P (A|D) ∝ P (D|A)P (A). Weassume a uniform prior on action, so argmaxAP (A|D) ∝argmaxAP (D|A).

We also give Dirichlet priors to each of the model param-eters in order to smooth over the sparse data. These priorsare chosen to be equivalent to having seen each next veloc-ity state (within the same mixture component) once for eachvelocity state in each mixture component, and each mixturecomponent once in each activity class.

While our velocity history features may seem to be verysparse compared to spatio-temporal patch features, theirlong duration means that they can contain a lot of informa-tion. Additionally, because of their simplicity, they couldpotentially be implemented very efficiently. Feature extrac-tion for KLT tracks has been implemented in real-time sys-tems using GPU hardware [19], and online evaluation foreach feature (in log space) requires m table lookups and ad-ditions per frame, where m is the number of mixture com-ponents. .

4. Feature EvaluationWe evaluated the performance of our model on the KTH

dataset, a popular dataset introduced by Schuldt et al. [18].This dataset consists of low-resolution (160 × 120 pixels)monochrome video clips taken by a non-moving camera of25 subjects performing each of 6 actions (walking, clap-ping, boxing, waving, jogging and running) in several sizeand background variation conditions. While our trackedkeypoint velocity history features our intended to performwell on high resolution video of complicated actions, it isimportant to evaluate the baseline performance of our fea-ture on an established dataset. We use one of the populartesting paradigms for this dataset, where a system is trainedon 8 randomly selected subjects, and tested on 9 randomlyselected remaining subjects.

To provide a basis for comparison, we use Dollar et al.’sspatio-temporal cuboids [6], and Laptev et al.’s space-timeinterest points [10]. In order to ensure that the comparisonis of the features, rather than the model, we use the samecodebook size (100 elements), and the same higher levelmodel (naive bayes), for all systems.

Results are shown in table 1. Clearly, our velocity his-tory features are competitive with state-of-the-art local fea-tures on established datasets. We note that both Dollar etal.’s cuboids and Laptev et al.’s interest points have pre-

Table 1. Performance results of different features using a train on8, test on 9 testing paradigm on the KTH dataset. All features used100 codewords and were combined using naive Bayes.

Method Percent CorrectSpatio-Temporal Cuboids [6] 66

Velocity Histories (Sec. 3) 74Space-Time Interest Points [10] 80

viously achieved much greater performance on this datasetusing larger codebooks and more sophisticated discrimina-tive models (like SVMs). We held the codebook size andfeature combination model fixed in order to evaluate the fea-tures themselves.

5. High Resolution Videos of Complex Activi-ties

Existing activity recognition datasets involve low-resolution videos of relatively simple actions. The KTHdataset [18], probably the single most popular one, is acase in point. This dataset consists of low-resolution (160× 120 pixels) video clips taken by a non-moving cameraat 25 frames per second of 25 subjects performing each of6 actions (walking, clapping, boxing, waving, jogging andrunning) in several conditions varying scale, background,and clothing. Subjects repeated the actions several times,in clips ranging from just under 10 seconds, to just undera minute. The KTH dataset’s relatively low resolution re-wards techniques that focus on coarse, rather than fine spa-tial information. It doesn’t include object interaction, animportant part of many real-world activities. Lastly, its sim-ple activities, while still a challenge to existing systems,may not reflect the complexity of the real activities a sys-tem might want to recognize.

We contribute a high resolution video dataset of activ-ities of daily living. These activities were selected to beuseful for an assisted cognition task, and to be difficult toseparate on the basis of any single source of information.For example, motion information might distinguish eating abanana from peeling a banana, but might have difficulty dis-tinguishing eating a banana from eating snack chips, whileappearance information would conversely be more useful inthe latter than the former task.

The full list of activities is: answering a phone, dialing aphone, looking up a phone number in a telephone directory,writing a phone number on a whiteboard, drinking a glass ofwater, eating snack chips, peeling a banana, eating a banana,chopping a banana, and eating food with silverware.

These activities were each performed three times by fivedifferent people. These people were all members a depart-ment of computer science, and were naive to the details of

4

Page 5: Activity recognition using the velocity histories of ... · 2. Background Initial work on activity recognition involved extracting a monolithic description of a video sequence. This

432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485

486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539

ICCV#2117

ICCV#2117

ICCV 2009 Submission #2117. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

our model when the data was collected. By using people ofdifferent shapes, sizes, genders, and ethnicities, we hoped toensure sufficient appearance variation to make the individ-ual appearance features generated by each person different.We also hoped to ensure that activity models were robust toindividual variation.

The scene was shot from about two meters away bya tripod-mounted Canon Powershot TX1 HD digital cam-era. Video was taken at 1280 × 720 pixel resolution, at 30frames per second. An example background image is shownin Figure 3.

The total number of features extracted per activity se-quence varied between about 700 and 2500, with an aver-age of over 1400 features per sequence. The mean durationof the trajectories was over 150 frames. Video sequenceslasted between 10 and 60 seconds, terminating when theactivity was completed.

Our evaluation consisted of training on all repetitions ofactivities by four of the five subjects, and testing on all rep-etitions of the fifth subject’s activities. This leave-one-outtesting was averaged over the performance with each left-out subject.

In addition to our tracked keypoint velocity history fea-tures, and the local spatio-temporal patch features of Dol-lar et al. [6] and Laptev et al. [10], we use Bobick andDavis’s temporal templates [5], an approach based on theshape of tables of the presence or recency of motion at eachlocation. Temporal templates, which offer a single-featuredescription of a video sequence, are not amenable to thesame models as feature-based approaches. We used a stan-dard technique with temporal templates, the Mahalanobisdistance between the concatenated Hu moment representa-tions of the motion energy and motion history images of atest video and the members of each activity class.

Due to limitations of the distributed implementation ofLaptev et al.’s space-time interest point features, we wereforced to reduce the video resolution to 640 × 480 pixels.However, we found almost no performance difference run-ning Dollar et al.’s spatio-temporal cuboids at both this re-duced resolution, and the full video resolution, suggestingthat Laptev et al.’s features would also perform similarlyon the full resolution video. Additionally, we found almostno performance difference running Laptev et al.’s space-time interest point features at a greatly reduced video reso-lution (160× 120 pixels, as in the KTH dataset), suggestingthat the extra fine spatial detail provided by high resolutionvideo is unhelpful for space-time interest points.

The performance of all systems on this dataset are shownin Table 2. A confusion matrix illustrating the performanceof our velocity history features is shown in Figure 2. Theseresults show several things. First, we see that feature-basedmethods, even when combined using the simplest models,outperform older whole-video methods. Given that the field

Table 2. Performance results of different methods using a leave-one-out testing paradigm on our high resolution activities of dailyliving dataset

Method Percent CorrectTemporal Templates [5] 33

Spatio-Temporal Cuboids [6] 36Space-Time Interest Points [10] 59

Velocity Histories (Sec. 3) 63Latent Velocity Histories (Sec. 7) 67

Augmented Velocity Histories (Sec. 6) 89

Figure 2. Confusion matrix using only feature trajectory informa-tion without augmentations. Overall accuracy is 63%. Zeros areomitted for clarity

has almost exclusively moved to feature-based methods,this is not surprising. Second, we see that while there is arange of performance among the local spatio-temporal fea-tures, velocity history features outperform them on this highresolution, high complexity dataset.

6. Augmenting our features with additional in-formation

While the velocity history of tracked keypoints seems tobe an effective feature by itself, it lacks any informationabout absolute position, appearance, color, or the positionof important, nonmoving objects. By augmenting each ofour velocity history features with this information, we cansignificantly improve performance.

6.1. Feature Augmentations

Features were augmented by location information in-dicating their initial and final positions (their “birth” and“death” positions). To obtain a given feature trajectory’sbirth and death, we use k-means clustering (k = 24) to form

5

Page 6: Activity recognition using the velocity histories of ... · 2. Background Initial work on activity recognition involved extracting a monolithic description of a video sequence. This

540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

ICCV#2117

ICCV#2117

ICCV 2009 Submission #2117. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 3. A background imagefrom the activities of daily liv-ing data set

Figure 4. Codebook-generatedpixel map of birth-death loca-tions

a codebook for both birth and death (independently), and as-sign a feature’s birth and death to be it’s nearest neighbor inthe codebook. An example of this codebook can be seen inFigure 4, which was generated from activities in the sceneshown in Figure 3.

In addition to this absolute position information, featuresare also augmented by relative position information. We usebirth and death positions of the feature relative to the posi-tion of an unambiguously detected face, if present. Faceswere detected by the OpenCV implementation of the Viola-Jones face detector [20], run independently on each frameof every video. Like birth and death, these positions areclustered by k-means (k = 100) into a codebook, and a fea-ture position relative to the face is assigned to it’s nearestneighbor in the codebook.

We augment features with local appearance information.We extract a 21× 21 pixel patch around the initial positionof the feature, normalized to the dominant orientation ofthe patch, calculate the horizontal and vertical gradients ofthe oriented patch, and use PCA-SIFT [9] to obtain a 40-dimensional descriptor. These descriptors are clustered byk-means into a codebook (k = 400), and each feature’spatch is assigned to it’s nearest neighbor in the codebook.

Lastly, we augment features with color information.From the same 21 × 21 pixel patch that we use for the ap-pearance augmentation, we compute a color histogram inCIELAB space, with 10 bins in each color dimension. Weuse PCA to reduce the dimensionality of the histograms to40 dimensions, and as with appearance, cluster these his-tograms using k-means into a 400 element codebook. Eachfeature flow patch’s color histogram is assigned to its near-est neighbor in the codebook.

The codebooks for all augmentations were formed onlyusing samples from the training data.

We note that only our absolute birth and death positionaugmentations lack a significant degree of view invariance.The other augmentations, and the feature trajectory modelitself, should maintain the same view invariance as otherfeature-based approaches to activity recognition which usu-ally lack the scale and orientation invariance that character-ize successful object recognition features.

We emphasize that our velocity history features are par-ticularly appropriate recipients of these augmentations. By

Figure 5. Graphical model for our augmented tracked keypoint ve-locity history model (Dark circles denote observed variables).

contrast, it is much less straightforward to augment localspace-time features with these kinds of information, par-ticularly because notions like “birth” and “death” positionare not as well-defined for features based on local spatio-temporal patches.

6.2. Modeling Augmented Features

Our graphical model for our augmented velocity historymodel is shown in Figure 5. All symbols have the samemeaning as in Figure 1, with the addition of L0 and LT

as the birth and death location variables respectively, R0

and RT as the positions relative to the face at birth anddeath, X as the feature’s appearance, and C as the feature’scolor. Augmentation information is combined using a naiveBayes assumption at the mixture component level, so theadditional computational cost of each augmentation duringinference is just another table lookup per feature.

Our new joint probability model for an action is:

P (A, L0, LT , R0, RT , X,O) =∑M

P (A, M, L0, LT , R0, RT , X,O) =

P (A)Nf∏f

Nm∑i

P (M if |A)P (L0,f |M i

f )P (LT,f |M if )

P (R0,f |M if )P (RT,f |M i

f )P (Xf |M if )P (Cf |M i

f )

P (O0,f |M if )

T∏t=1

P (Ot,f |Ot−1,f , M if ) (6)

As before, all random variables are discrete, and all compo-nent distributions in the joint distribution are multinomial.Similarly, EM [2] was again used for training.

6.3. Results

The performance of this augmented model using exper-imental conditions identical to those used for the unaug-mented model is shown in Figure 6. Augmentations yieldsignificant performance improvements on our high resolu-tion activities of daily living dataset.

6

Page 7: Activity recognition using the velocity histories of ... · 2. Background Initial work on activity recognition involved extracting a monolithic description of a video sequence. This

648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701

702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755

ICCV#2117

ICCV#2117

ICCV 2009 Submission #2117. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 6. Confusion matrix for the augmented feature trajectoryactivity recognition system. Overall accuracy is 89%. Zeros areomitted for clarity

Figure 7. Labeled example flows from “Lookup in Phonebook”and “Eat Banana”, with different colors denoting different mixturecomponents. Note that complex semantic features like feature flowposition relative to face are used in our framework.

7. Adding uncertainty to the velocity historyWhile tracked keypoint velocity history has proven a

powerful feature for activity recognition, particularly whenaugmented as in Section 6, it is necessarily dependent onboth the discretization of velocity and the presence of suf-ficient data to learn a rich velocity transition matrix for allmixture components. One way to limit this dependence onthe amount of data is to introduce uncertainty in each veloc-ity observation. By modeling observed velocity as the truelatent velocity plus fixed Gaussian noise, we can spread theprobability mass from each observation over nearby, simi-lar observations. As can be seen in Figure 8, this transformsour mixture of velocity Markov chains into a mixture of ve-locity hidden Markov models, or a factorial hidden Markovmodel.

7.1. Modeling Latent Veloicty

Our graphical model for our latent velocity history modelis shown in Figure 8. All symbols have the same meaning as

Figure 8. Graphical model for our latent tracked keypoint velocityhistory model (Dark circles denote observed variables).

in Figure 1, and we introduce the variable Sit to indicate the

latent velocity at time t as explained by mixture componenti. By breaking up the latent velocities by mixture compo-nent, we avoid a cyclic dependence of each latent velocitystate on the others, and so can use the forward-backwardalgorithm to compute latent state marginal distributions.

The joint PDF for an action with these new latent veloc-ity states is:

P (A, S,O) =∑M

P (A, M, S, O) =

P (A)Nf∏f

Nm∑i

P (M if |A)P (Si

0,f |M if )P (O0,f |Si

0,f )

T∏t=1

P (Sit,f |Si

t−1,f , M if )P (Ot,f |Si

t,f ) (7)

As in the previous models, all variables are discrete, and alldistributions are multinomial, aside from the P (O|S) term,which is the sum of two fixed Gaussian distributions withzero mean. One Gaussian is broad, with a diagonal covari-ance matrix whose entries are the maximum velocity mag-nitude of 10 pixels per frame, and the other is narrow, witha diagonal covariance matrix whose entries are a quarter thevalue of the broad one’s entries. As before, we use EM [2]to learn the model parameters.

While this model is more powerful than the mixture ofvelocity Markov chains, it is much more expensive to per-form inference in it. Each velocity transition table lookupin the mixture of velocity Markov chains becomes a matrixmultiplication. In order to test this model in a reasonableamount of time, it was necessary to break apart the mixturecomponents during training. Instead of 100 mixture com-ponents shared over all 10 action classes, 10 mixture com-ponents were assigned to each action class, and trained onlyon data from that class. Testing used a naive Bayes modelsharing mixture components across all action classes, as inthe previous versions of the model.

7

Page 8: Activity recognition using the velocity histories of ... · 2. Background Initial work on activity recognition involved extracting a monolithic description of a video sequence. This

756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809

810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863

ICCV#2117

ICCV#2117

ICCV 2009 Submission #2117. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 9. Confusion matrix for the latent feature trajectory activityrecognition system. Overall accuracy is 67%. Zeros are omittedfor clarity

7.2. Results

The performance of the latent velocity model is shownin Figure 9. Clearly, there is some improvement in perfor-mance over the model without latent states shown in Figure2. This suggests that incorporating uncertainty in our ve-locity states, can offer moderately improved results. On theone hand, it is encouraging to see that even without addingmore data, a more sophisticated model has the potential toimprove performance. On the other hand, it is a modest im-provement compared to that offered by the augmentationsof Section 6, at a much greater computational cost.

8. ConclusionsThe velocity history of tracked keypoints is clearly a use-

ful feature for activity recognition particularly when aug-mented with extra information. By incorporating infor-mation outside a local spatio-temporal patch at the fea-ture level, it allows a bag-of-features model to a significantamount of non-local structure. We anticipate that high reso-lution video data will become more commonplace, and thatpractical activity recognition systems will face more com-plicated activities. As the complixity of activity recognitionincreases, features capable of capturing non-local structuremay prove more valuable than local spatio-temporal meth-ods.

References[1] S. Baker and I. Matthews. Lucas-kanade 20 years on: A

unifying framework. IJCV, 56(3):221 – 255, March 2004. 2,3

[2] J. Bilmes. A gentle tutorial on the EM algorithm. TechnicalReport ICSI-TR-97-021, University of Berkeley, 1997. 4, 6,7

[3] S. Birchfield. Klt: An implementation of the kanade-lucas-tomasi feature tracker, 1996. 2

[4] M. Black and Y. Yacoob. Tracking and recognizing rigidand non-rigid facial motions using local parametric modelsof image motion. In ICCV, pages 374–381, 1995. 1

[5] A. F. Bobick and J. W. Davis. The recognition of humanmovement using temporal templates. IEEE Transactionson Pattern Analysis and Machine Intelligence, 23:257–267,2001. 1, 5

[6] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behaviorrecognition via sparse spatio-temporal features. In VS-PETS,October 2005. 4, 5

[7] G. Johansson. Visual perception of biological motion anda model for its analysis. Perception and Psychophysics,14:201–211, 1973. 2

[8] I. Junejo, E. Dexter, I. Laptev, and P. Perez. Cross-view ac-tion recognition from temporal self-similarities. In Proc. Eur.Conf. Computer Vision (ECCV’08), volume 2, pages 293–306, Marseille, France, October 2008. 2

[9] Y. Ke and R. Sukthankar. Pca-sift: a more distinctive repre-sentation for local image descriptors. In CVPR04, pages II:506–513, 2004. 6

[10] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld.Learning realistic human actions from movies. In Proc.Int. Conf. Computer Vision and Pattern Recog. (CVPR’08),pages 1–8, Anchorage, Alaska, June 2008. 2, 4, 5

[11] B. D. Lucas and T. Kanade. An iterative image registra-tion technique with an application to stereo vision. In IJCAI,pages 674–679, 1981. 2

[12] A. Madabhushi and J. K. Aggarwal. A bayesian approachto human activity recognition. In VS ’99: Proceedings ofthe Second IEEE Workshop on Visual Surveillance, page 25,Washington, DC, USA, 1999. IEEE Computer Society. 2

[13] J. C. Niebles and L. Fei-Fei. A hierarchical model modelof shape and appearance for human action classification. InIEEE CVPR, 2007. 2

[14] R. Polana and R. C. Nelson. Detecting activities. In IEEEComputer Vision and Pattern Recognition (CVPR), 1993. 1

[15] D. Ramanan and D. A. Forsyth. Automatic annotation of ev-eryday movements. In Neural Information Processing Sys-tems (NIPS), December 2003. 1

[16] R. Rosales and S. Sclaroff. Trajectory guided tracking andrecognition of actions. Technical report, Boston University,Boston, MA, USA, 1999. 2

[17] S. Savarese, A. D. Pozo, J. C. Niebles, and L. Fei-Fei.Spatial-temporal correlations for unsupervised action classi-fication. In IEEE Workshop on Motion and Video Computing,2008. 2

[18] C. Schuldt, I. Laptev, and B. Caputo. Recognizing humanactions: A local svm approach. In ICPR, pages 32–36, 2004.4

[19] S. N. Sinha, J. michael Frahm, M. Pollefeys, and Y. Genc.Gpu-based video feature tracking and matching. Technicalreport, In Workshop on Edge Computing Using New Com-modity Architectures, 2006. 3, 4

[20] P. Viola and M. Jones. Robust real-time face detection. Inter-national Journal of Computer Vision, 57(2):137–154, May2004. 6

8