arm and body gesture recognition - unifr.ch · gesture recognition is a complex topic in computer...

Arm and body gesture recognition

Tom ForrerUniversity of FribourgFribourg Switzerland

[email protected]

ABSTRACTIn this work the typical steps of vision based gesture recog-nition, namely tracking, pose estimation and classification,are highlighted with focus on arm and body gesture recog-nition, with some common methods and algorithms as wellsome promising work, having potential to tackle the problemof the high-dimensionality and exploding sets of examplesnormally required to satisfy a realistic, uncontrolled gesturerecognition environment.

Categories and Subject DescriptorsA.1 [Introductory and survey]; I.5.4 [Pattern Recog-nition]: Applications—Computer vision

1. INTRODUCTIONGesture recognition is a complex topic in computer vision,trying to detect and interpret human movement, posturesand gestures. Normally vision-based gesture recognition iscomposed of four steps of model initialization, tracking, poseestimation and gesture recognition and classification [5] (seefigure 1). These steps are further discussed in the followingsections.

Figure 1: Four main steps in gesture recognition

This article was written in the context of a gesture recognition seminarat the University of Fribourg under the supervision of Prof. Rolf Ingoldand Denis Lalanne. It is not intended to be a published academic paper.Additional information can be found at:http://diuf.unifr.ch/diva/

2. TRACKING TECHNIQUESThe visual tracking techniques in gesture recognition aimat extracting certain features or visual cues for posture es-timation and gesture classification. While most of the fea-ture extraction algorithms work principally on images froma single camera [9], many setups nowadays use stereoscopicvision from two or more cameras [3, 12]. The depth infor-mation gained from stereoscopic vision helps to surpass thebig challenge of tracking body parts that are self-occludingin a monocular setting. But stereoscopic vision has stillthe disadvantage by requiring higher processing speed of thehardware [9].

While most tracking techniques described in this section canbe applied in other computer vision problems or gesturerecognition applications, which focus on other body partssuch as hands, faces or torso, they are also relevant for thetwo gesture recognition topics of tracking arms and bodies.

In gesture recognition, tracking mainly consists of the twoprocesses of figure-ground segmentation and finding tempo-ral correspondences [5]. Moeslund et al classifies the figure-ground segmentation further into five categories: backgroundsubtraction, motion based, appearance based, shape basedand depth based segmentation.

2.1 Figure-ground segmentationBackground segmentationThe simplest way to separate background from foreground isby taking the difference of an empty reference backgroundimage and an arbitrary frame. However this approach isnot used, because pixels aren’t independent and time vary-ing background objects should also be considered as back-ground.

Wren et al. represented in PFinder 2 each pixel by a Gaus-sian described by a full covariance matrix, where each pixel isupdated recursively with the corresponding statistical prop-erties [15]. This allowed for changes in the background scenelike waving curtains or moving objects. The pixels belong-ing to the figure can then be determined by the Mahalanobisdistance. But often the part that needs to be tracked is justa hand or the head in order to further refine a search area. Inthis case a color cue can be obtained by the same methods.

Motion based segmentationMotion based segmentation does not find temporal corre-spondences but inspect the difference of two consecutive

http://diuf.unifr.ch/diva/

Figure 2: Background segmentation in PFinder [15]

frames, like Sidenbladh did for the figure-ground segmen-tation of sequences containing a walking figure [10]. Azozet al. detect time varying edges as a motion cue to separateedges belonging to the figure from edges in the background.

Appearance based segmentationUsing the fact that appearance of the searched figure canvary from person to person, but definitively is different fromthe appearance of the background, classifiers can be trainedwith sets of images containing the appearance of positivesand sets with negatives for the background. These classi-fiers can then be used to detect that type of figure in thescene. Usually, this gives no figure-ground segmentation butindicates the location in the scene with a bounding box.

Shape based segmentationShape based segmentation refers often to silhouette basedsegmentation which relies on a good background subtrac-tion. Agarwal and Triggs use silhouettes, because they areeasily obtained and shadowing in the figure, clothing andtexture are not encoded in the information. But silhou-ettes have several disadvantages like occlusion problems orshadow attachments that can distort the shape. Takahashiet al. [11] and Chu and Cohen [4] also extract shape frommultiple cameras to construct a visual hull of the figure. Butshape based segmentation is not always silhouette based:Mori et al. use shapes obtained by edge detection [7] andin [6] by a normalized cuts segmentation in order to identifysalient limbs 3, which are later assembled into a posture.Azoz et al. use shape filters to detect clusters of colors [2].

Figure 3: Normalized cuts segmentation [6]

Depth based segmentationUsing two ore more cameras, a depth map or a completethree-dimensional reconstruction can be estimated. A depth

map 4 can be computed with a disparity map, which is ob-tained by a correspondence algorithm. Due to the estima-tion, occlusions or lens geometry problems, depth maps donot always reflect the 3D scene accurately. In [3], Chienet al. refines the accuracy of the correspondence algorithmwith epipolar geometry, using the pointing direction fromseveral images. In [4, 11, 12] silhouettes of the figure frommultiple cameras are used to obtain a three-dimensional vi-sual hull with a voxel reconstruction algorithm.

Figure 4: Scene with estimated depth map [8]

2.2 Temporal correspondencesFinding the temporal correspondences resolves ambiguitieswile tracking objects or figures in an image, given their statesin past frames. Temporal correspondences can help, whenmultiple tracked objects occlude each other, or simply todistinguish correctly two nearby objects.

Commonly, the task of tracking objects across frames is donewith an estimator, which first predicts the location of theobject for the next frame, based on previous tracking resultsand secondly reconciles the measured location of the objectwith the predictions. A broader class of estimators calledcondensation algorithms, on which the particle filters arebased, can also take occlusions into account by representingmultiple possibilities as hypotheses or ”particles” to predictfuture locations of the tracked objects.

Schmidt and Fritsch use a kernel based particle filter [9],which has the advantage over standard particle filters ofavoiding a huge number of particles required for trackingthe upper body and arms with 14 degrees of freedom.

3. POSE ESTIMATIONOnce the searched limbs or figure features are tracked, poseestimation, or model matching adds a layer of abstractionby assembling features into a superstructure which can laterbe classified. The main goal of pose estimation is to recoveran underlying skeletal structure. Moeslund et al. proposeda separation of pose estimation methods according to theiruse of a model[5]:

3.1 Model-freeModel-free methods do not use a-priori knowledge of thebody configuration. The limb or parts are either assembleddynamically or their constellation can be matched to a poseby comparing them to a database of examples.

Agarwal and Triggs presented a method which recovereda 3D pose from monocular images without using manuallylabelled examples or templates nor using a body model. The

pose is estimated inferring joint angles from silhouttes witha Relevance Vectore Machine (RVM) regression [1]

One of the first model-free pose estimations was PFinder[15]. Wren et al. dynamically added blobs to the pose,either by contour matching or a color splitting process.

Another approach is to dynamically decompose a visual hullinto 3D Haarlets 5 using Linear Discriminant Analysis (LDA)[12] or into shape descriptor atoms with a matching pur-suit algorithm and Singular Value Decomposition (SVD) re-sulting in shape descriptors with the largest eigenvalues [4].Both methods have a key advantage: the atomic shape de-scription and the 3D Haarlets have an additive property 6,reducing drastically the number of possible poses to whichthe measured visual hull can be matched.

Figure 5: 3D Haarlet[12]

Figure 6: Visual Hull [4]

3.2 Indirect model useThis category does not map directly the tracked informationonto a model, but rather guides the interpretation. Infor-mation like joints, limb length ratios, etc. is labelled intothe examples.

Mori et al. use multiple constraints like torso length ratio,adjacency to torso or joints and a-priori body configura-tion and clothing properties [6]. In another approach Moriet al. recovers the pose from labelled silhouette examplesby deforming the measured shape and therein the skeletalstructure to match the labelled example [7] (see figure 7).

Figure 7: Deformable shape context matching [7]

3.3 Direct model useWith the direct model approach, the geometric represen-tation of the figure is known beforehand, and the measureddata is matched against the model. In newer approaches themodel contains human motion constraints to further elimi-nate ambiguities.

Takahashi et al. implement this fitting process onto an ar-ticulated cylindrical model with 10 body parts (see figure 8)[11].

Figure 8: Full body model [11]

Schmidt and Fritsch also have an a-priori known model ofthe upper body with individual joint angle limits (see figure9) [9].

Figure 9: Upper body model [9]

4. RECOGNITION AND CLASSIFICATIONThere are several approaches to classify gestures, dependingon what information to classify. Because gestures can beviewed as sequence of postures in time, the most widely usedclassification method is a Hidden Markov Model (HMM).HMMs are suited for the classification of gestures, as HMMtry to model the spatio-temporal variability and to maximizethe likelihood of generating all examples of a gesture class[16].

Yang et al. trained a HMM for spotting specific body ges-tures like raising a hand, jumping etc., but also presentedthe concept of the garbage gesture model (see figure 10) andthe key gesture spotting model (see figure 11). The garbagegesture model is an ergodic model and tied in the the keygesture spotter model at the beginning of a gesture and atthe end [16].

Figure 10: Garbage gesture model [16]

Figure 11: Key gesture spotter model [16]

Both Willson et al. and Chu and Cohen used the concept ofa multiphasic gesture: in a sequence of postures, there is atleast a transition to the key posture and a transition leavingit. Gestures can also have several key postures identifyinga gesture. While Willson et al. did not use hidden statesbut a simpler markovian chain, Chu and Cohen used a dual-state HMM for a primary and secondary decomposition of

gestures, which gives also smaller set of atoms and a HMMstate space of linear complexity [4, 14].

Wang et al. introduce a powerful approach for gesture clas-sification: hidden conditional random fields (HCRF) (seefigure 13). HCRFs incorporates hidden states variables in adiscriminative multi-class random field. It is an extension ofthe spatial Conditional Random Fields and takes the mainadvantage of HMMs, the capability of modelling temporalsequences. HCRFs can be used like HMMs (see figure 12),but the hidden states can be shared among different gestureclasses [13].

Figure 12: Hidden Markov Model [13]

Figure 13: Hidden Conditional Random Field [13]

5. CONCLUSIONSIn this work, the typical steps in gesture recognition of track-ing, pose estimation and classification were briefly inspected,with focus on previous work that concentrated on arm andbody gesture recognition. Some common tracking meth-ods, like mixture of Gaussians for background subtraction,and particle filters for temporal correspondences, were men-tioned, as well some powerful and relatively new methods,like shape deformation, voxel reconstruction for pose esti-mations.

The field of gesture recognition for arm and body has stillmany challenges, while the goal is to operate in a settingas general and uncontrolled as possible: often the requiredcomputational resources are not sufficient, either the statespace is to high-dimensional or amounts of data tracked tobig. The small list of examples in the category of pose es-timation and classification indicates however that decompo-sition, hierarchisation and sharing of common informationshows great promise.

APPENDIXA. REFERENCES[1] A. Agarwal and B. Triggs. Recovering 3d human pose

from monocular images. IEEE transactions on patternanalysis and machine intelligence, 28(1):44–58, JAN2006.

[2] Y. Azoz, L. Devi, and R. Sharma. Reliable tracking ofhuman arm dynamics by multiple cue integration andconstraint fusion. In Proc. IEEE Computer SocietyConference on Computer Vision and PatternRecognition, pages 905–910, 23–25 June 1998.

[3] C.-Y. Chien, C.-L. Huang, and C.-M. Fu. Avision-based real-time pointing arm gesture trackingand recognition system. In Proc. IEEE InternationalConference on Multimedia and Expo, pages 983–986,2–5 July 2007.

[4] C.-W. Chu and I. Cohen. Posture and gesturerecognition using 3d body shapes decomposition. InProc. IEEE Computer Society Conference onComputer Vision and Pattern Recognition, pages69–69, 25–25 June 2005.

[5] T. Moeslund, A. Hilton, and V. Krueger. A survey ofadvances in vision-based human motion capture andanalysis. Computer vision and image understanding,104(2-3):90–126, 2006.

[6] G. Mori, X. Ren, A. Efros, and J. Malik. Recoveringhuman body configurations: Combining segmentationand recognition. In Proceedings of the 2004 IEEEComputer Society conference on computer vision andpattern recognition, Vol 2, pages 326–333, 2004.

[7] M. J. Mori, G. Recovering 3d human bodyconfigurations using shape contexts. Pattern Analysisand Machine Intelligence, IEEE Transactions on,28(7):1052 –1062, July 2006.

[8] K. Nickel and R. Stiefelhagen. Pointing gesturerecognition based on 3d-tracking of face, hands andhead orientation. In Proceedings of the 5thinternational conference on Multimodal interfaces,page 146. ACM, 2003.

[9] J. Schmidt, J. Fritsch, and B. Kwolek. Kernel particlefilter for real-time 3d body tracking in monocularcolor images. In Proc. 7th International Conference onAutomatic Face and Gesture Recognition FGR 2006,pages 567–572, 2–6 April 2006.

[10] H. Sidenbladh. Detecting human motion with supportvector machines. In Proceedings of the 17thinternational conference on pattern recognition, VOL2, pages 188–191, 2004.

[11] K. Takahashi, Y. Nagasawa, and M. Hashimoto.Remarks on Markerless Human Motion Capture fromVoxel Reconstruction with Simple Human Model. In2008 IEEE/RSJ International Conference on Robotsand Intelligent Systems, pages 755–760, 2008.

[12] M. Van den Bergh, E. Koller-Meier, and L. Van Gool.Fast body posture estimation using volumetricfeatures. In Proc. IEEE Workshop on Motion andvideo Computing WMVC 2008, pages 1–8, 8–9 Jan.2008.

[13] S. B. Wang, A. Quattoni, L. P. Morency,D. Demirdjian, and T. Darrell. Hidden conditionalrandom fields for gesture recognition. In Proc. IEEEComputer Society Conference on Computer Vision and

Pattern Recognition, volume 2, pages 1521–1527, 2006.

[14] A. D. Wilson, A. F. Bobick, and J. Cassell. Recoveringthe temporal structure of natural gesture. In Proc.Second International Conference on Automatic Faceand Gesture Recognition, pages 66–71, 14–16 Oct.1996.

[15] C. R. Wren, A. Azarbayejani, T. Darrell, and A. P.Pentland. Pfinder: real-time tracking of the humanbody. 19(7):780–785, July 1997.

[16] H.-D. Yang, A.-Y. Park, and S.-W. Lee. Robustspotting of key gestures from whole body motionsequence. In Proceedings of the Seventh InternationalConference on Automatic Face and GestureRecognition - Proceedings of the Seventh InternationalConference, pages 231–236, 2006.

arm and body gesture recognition - unifr.ch · gesture recognition is a complex topic in computer...

Documents