human body modelingand tracking using …cvrr.ucsd.edu/publications/2008/cuong_icdsc08.pdf3d...

9
HUMAN BODY MODELING AND TRACKING USING VOLUMETRIC REPRESENTATION: SELECTED RECENT STUDIES AND POSSIBILITIES FOR EXTENSIONS Cuong Tran Mohan M. Trivedi Computer Vision and Robotics Research Laboratory University of California, San Diego ABSTRACT Articulated human body modeling and tracking from vision data is an attractive research area with many potential applications. There has been a tremendous amount of related research works in this area. Therefore, having a comprehensive insight into high quality existing works and awareness of the research frontier in the area is essential for follow-up research studies. With that objective, this paper provides a review of the subarea of model based methods for human body modeling and tracking using volumetric (voxel) data. We will focus on analyzing and comparing some recent techniques, especially which are in the past two years, in order to highlight trends in the domain as well as to point out limitations of the current state of the art. Based on this analysis, we will discuss our idea of combining Laplacian Eigenspace (LE) based voxel segmentation [20] and Kinematically Constrained Gaussian Mixture Model (KC-GMM) method [3] to have a more powerful human body pose estimation system as well as discuss other possibilities for future work. Keywords- Vision based, markerless, human body pose estimation, volumetric reconstruction 1. INTRODUCTION Vision-based pose estimation and tracking of articulated human body is the problem of estimating kinematic parameters of the body model (such as joints position and joints angle) from static image or video sequence as the body's position and configuration change over time. Related research studies in this area include body pose estimation, hand pose estimation, head pose estimation. Among those, the most extensive subfield is body pose estimation, which refers to the articulated body model normally with torso, head, and 4 limbs but without details of hand, foot, or facial variation. A good body pose estimation system has many potential applications including advance Human Computer Interaction (HCI), 3D animation, intelligent environment, robot control, etc. Compared to previous technologies using 978-1-4244-2665-2/08/$25.00 ©2008 IEEE markers or some specific devices, markerless vision-based approaches provide more natural, non-contact solutions. This is however a very challenging task. One major reason is the very high dimensionality of the pose configuration space, e.g. in [3], 19 DOF (Degree Of Freedom) are used for body model and 27 DOF are used for hand model. Moreover, we also have to deal with other common issues in computer vision like self occlusion, variation in lighting condition, shadow, object appearance (e.g. different clothes, hair, ... ). Some surveys of several techniques for human body pose estimation can be found in [14, 15, 17, 23], each with different focus and taxonomy. Werghi [23] provided a general overview of both 3D human body scanner technologies and approaches dealing with such scanned data, which focus on one or more of the following topics: body landmark detection, body scanned data segmentation, body modeling, body tracking. Poppe [17] surveys on pose estimation techniques, in which they mentioned the division into 2D approaches and 3D approaches, depends on the goal to achieve 2D pose or 3D pose representation; The division into model-based approaches and model-free approaches, depends on whether a priori kinematic body model is employed. This survey split the pose estimation process into modeling process, which is the construction of the likelihood function and estimation process, which is concerned with finding the optimal pose given the likelihood. Moeslund et al. [14] split the pose estimation process into initialization, tracking, pose estimation, and recognition. In [15], they also provided an updated review of advances in human motion capture for the period from 2000 to 2006. We see that it is not easy to have a unified taxonomy of the broad area of human body modeling and tracking. In Figure 1, we describe a simple block diagram of generic human body pose estimation system, in which we first need some components to extract useful features from input vision data and then a procedure to infer body pose from extracted features. We can loosely categorize related research studies into monocular [9, 12, 18] and multi-view approaches [1, 2, 3, 4, 5, 6, 10, 11, 13, 16, 20]. Compare to Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on January 21, 2009 at 14:53 from IEEE Xplore. Restrictions apply.

Upload: others

Post on 09-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HUMAN BODY MODELINGAND TRACKING USING …cvrr.ucsd.edu/publications/2008/cuong_ICDSC08.pdf3D features reconstructed from multiple views [1, 2, 3,4, 5, 6, 13, 20], e.g. volumetric (voxel)

HUMAN BODY MODELING AND TRACKING USING VOLUMETRIC REPRESENTATION:SELECTED RECENT STUDIES AND POSSIBILITIES FOR EXTENSIONS

Cuong Tran Mohan M. Trivedi

Computer Vision and Robotics Research LaboratoryUniversity of California, San Diego

ABSTRACT

Articulated human body modeling and tracking from visiondata is an attractive research area with many potentialapplications. There has been a tremendous amount ofrelated research works in this area. Therefore, having acomprehensive insight into high quality existing works andawareness of the research frontier in the area is essential forfollow-up research studies. With that objective, this paperprovides a review of the subarea of model based methodsfor human body modeling and tracking using volumetric(voxel) data. We will focus on analyzing and comparingsome recent techniques, especially which are in the past twoyears, in order to highlight trends in the domain as well as topoint out limitations of the current state of the art. Based onthis analysis, we will discuss our idea of combiningLaplacian Eigenspace (LE) based voxel segmentation [20]and Kinematically Constrained Gaussian Mixture Model(KC-GMM) method [3] to have a more powerful humanbody pose estimation system as well as discuss otherpossibilities for future work.

Keywords- Vision based, markerless, human bodypose estimation, volumetric reconstruction

1. INTRODUCTION

Vision-based pose estimation and tracking of articulatedhuman body is the problem of estimating kinematicparameters of the body model (such as joints position andjoints angle) from static image or video sequence as thebody's position and configuration change over time. Relatedresearch studies in this area include body pose estimation,hand pose estimation, head pose estimation. Among those,the most extensive subfield is body pose estimation, whichrefers to the articulated body model normally with torso,head, and 4 limbs but without details of hand, foot, or facialvariation. A good body pose estimation system has manypotential applications including advance Human ComputerInteraction (HCI), 3D animation, intelligent environment,robot control, etc. Compared to previous technologies using

978-1-4244-2665-2/08/$25.00 ©2008 IEEE

markers or some specific devices, markerless vision-basedapproaches provide more natural, non-contact solutions.This is however a very challenging task. One major reasonis the very high dimensionality of the pose configurationspace, e.g. in [3], 19 DOF (Degree Of Freedom) are usedfor body model and 27 DOF are used for hand model.Moreover, we also have to deal with other common issuesin computer vision like self occlusion, variation in lightingcondition, shadow, object appearance (e.g. different clothes,hair, ... ).

Some surveys of several techniques for human bodypose estimation can be found in [14, 15, 17, 23], each withdifferent focus and taxonomy. Werghi [23] provided ageneral overview of both 3D human body scannertechnologies and approaches dealing with such scanneddata, which focus on one or more of the following topics:body landmark detection, body scanned data segmentation,body modeling, body tracking. Poppe [17] surveys on poseestimation techniques, in which they mentioned the divisioninto 2D approaches and 3D approaches, depends on thegoal to achieve 2D pose or 3D pose representation; Thedivision into model-based approaches and model-freeapproaches, depends on whether a priori kinematic bodymodel is employed. This survey split the pose estimationprocess into modeling process, which is the construction ofthe likelihood function and estimation process, which isconcerned with finding the optimal pose given thelikelihood. Moeslund et al. [14] split the pose estimationprocess into initialization, tracking, pose estimation, andrecognition. In [15], they also provided an updated reviewof advances in human motion capture for the period from2000 to 2006.

We see that it is not easy to have a unified taxonomy ofthe broad area of human body modeling and tracking. InFigure 1, we describe a simple block diagram of generichuman body pose estimation system, in which we first needsome components to extract useful features from inputvision data and then a procedure to infer body pose fromextracted features. We can loosely categorize relatedresearch studies into monocular [9, 12, 18] and multi-viewapproaches [1, 2, 3, 4, 5, 6, 10, 11, 13, 16, 20]. Compare to

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on January 21, 2009 at 14:53 from IEEE Xplore. Restrictions apply.

Page 2: HUMAN BODY MODELINGAND TRACKING USING …cvrr.ucsd.edu/publications/2008/cuong_ICDSC08.pdf3D features reconstructed from multiple views [1, 2, 3,4, 5, 6, 13, 20], e.g. volumetric (voxel)

2. A REVIEW OF SELECTED RECENT MODEL­BASED METHODS FOR HUMAN BODY POSE

ESTIMATION USING VOXEL DATA

Figure 1. Block diagram of a generic human body poseestimation system. Dash line means that the underlyingkinematic model can be used or not. Gray boxes show thefocus of this paper, which are model based methods usingvoxel data and aim to extract full 3D posture.

Mikic'03etal. (13)CaIette et 81.'08 (1]

Pose/motiOn capkre

Aim 10 extract highlevel information

correspondng to poseImotion..

Cheung et at '03 (5)Sundaresanet a1. -07 (20)

-Extract uset\j features

120~~··1---+

~

I/

....-------~:_~ Voxel reconstruction ..

Delamerre etat '01 (6)Cheng at a1. '07 (3)

Camera calibration!Data capture

Motivated from this review, Section 3 is about the idea ofcombining LE based voxel segmentation and KC-GMMmethod into a more powerful system for human body poseestimation. Finally in Section 4, we have some concludingremarks and discussion about directions for future work.

Figure 2. Flowchart of common model-based methods forarticulated human body pose estimation using voxel data.Dashed boxes mean that some methods mayor may nothave all of these steps. The initialization/segmentation maybe called at each frame or just be called when we need toinitialize or re-initialize body model during the tracking. Asummary of steps contained in some selected methods isshown at the bottom.

rc;;;;l~

monocular view, multi-view data can help to reduce the selfocclusion issue and provide more information to make thepose estimation task easier as well as to improve theaccuracy. Among multi-view approaches, some methods use3D features reconstructed from multiple views [1, 2, 3,4 , 5,6, 13, 20], e.g. volumetric (voxel) data, while others still use2D features [10, 11, 16], e.g. color, edges, silhouette.Because the real body pose is in 3D, using voxel data canhelp avoiding the repeated projection of 3D body modelonto the image planes to compare against the extracted 2Dfeatures. Furthermore, reconstructed voxel data help toavoid the image scale issue. These advantages of usingvoxel data allow the design of simple algorithms and we canmake use of our knowledge about shapes and sizes of bodyparts. For example, Mikic et al. [13] used specificinformation about shape and size of head and torso to havea hierarchical growing procedure (detecting head first, thentorso, then limbs) for body model acquisition that can beused effectively even when there is a large displacementsbetween frames. Several methods using voxel data onlyindicate that voxel data is a strong cure for body poseestimation. Of course, there is an additional computationalcost for voxel reconstruction but efficient techniques for thistask have also been developed [1, 4, 5, 19].

Another input used in many methods is a predefinedkinematic model of the human body. These methods calledmodel-based methods, in which there is an underlyingkinematic model and a procedure to fit that model onto realinput data. There are also model-free methods, whichassume no underlying kinematic model and containprocedures to learn a direct mapping from feature space topose configuration space. Although information from anunderlying kinematic body model can help to improve theaccuracy and robustness, the advantage of model-freemethods is that they do not suffer from (re)initializationissue and can be used for initialization of model-basedmethods.

Regarding the pose estimation result, two types ofresearch directions have emerged. One only aims to extracthigh-level abstract information corresponding to motion andposture of the body, which can then be applied for gestureclassification for example. The other aims to recover thereal (full) 3D motion and posture of human body. The latterone is more challenging but it is also worth dealing with,because it provides more general, principled methods thatcan be adapted to extract different high-level abstractinformation depending on application area. Moreover,various types of interaction styles and applications alsoexplicitly rely on the full 3D pose information. This paperfocuses on model-based methods for real 3D human bodypose estimation using reconstructed voxel data.

The remainder of the paper is organized as follows.Section 2 is a review of selected recent model-basedmethods for human body pose estimation using voxel data.

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on January 21, 2009 at 14:53 from IEEE Xplore. Restrictions apply.

Page 3: HUMAN BODY MODELINGAND TRACKING USING …cvrr.ucsd.edu/publications/2008/cuong_ICDSC08.pdf3D features reconstructed from multiple views [1, 2, 3,4, 5, 6, 13, 20], e.g. volumetric (voxel)

Authors Inifta.lizadon Model ~stimation'~pr~eedure EValuadQll Comment.:

Delamarre et ale A priori known 3D primitive Tracking-based method Visual only Not really a method for'01 [6] model with shape model (Kalman filter). Uses physical estimating pose in voxel

manual initial (truncated forces, a simpler form of space (only using 2 viewsplacement cones, spheres, Iterative Closest Point (ICP), to with epipolar geometry

parallelepipeds make the 3D model and the reconstruction) -> limitationcomponents) voxel data intersect due to possible ambiguities.

Mikic et ale Hierarchically Ellipsoidal, Tracking-based method. Uses Visual only Fully automated; Can track'03 [13] grows model cylindrical Kalman filter to predict next for even large displacement;

over data from components. pose then update usinghead to torso, Described with growing procedure &to limbs twists Bayesian Networks

Cheung et at. CSP alignment Skeletal body Uses Colored Surface Point Synthesized Fully automated;'03 [5] & model (CSP). Hierarchical ground-truth

segmentation Segmentation/ SFS alignmentvia motion to recover motion, shape andclustering joint

Sundaresan et at. Voxel 6-chains Segment voxel data in Synthesized Fairly general LE based'07 [20] segmentation representation Laplacian Eigenspace (LE). ground-truth voxel segmentation; Fully

inLE and a of body. Probabilistic register automated; However isprocedure for Superquadric segmented voxel to body parts sensitive to noise in voxelbody parts components in then estimate skeletal and dataregistration body model superquadric parameters

Cheng et at. Manual Ellipsoidal Integrate kinematic constraints Synthesized Has generality: Has been'07 [3] initialization of components. in KC-GMM model. Derive & Marker- applied for both body and

body parts Described by EM algorithm with KC-GMM based hand; However require adimension & Kinematically for pose estimation (no motion manual initializationinitial pose Constrained additional projection step) capture for

Mixture Model ground-truth(KC-GMM)

Caillette et at. A priori known Skeletal body Tracking-based method. Break Manually Fast (real-time); However'08 [1] skeletal model. model & complex movement into basic annotated limited to trained movement

Initialize with Gaussian blobs motions ground-truth sequencesK-mean blob Use Variable Length Markovfitting Model (VLMM) to predictprocedure candidate pose.

Evaluate with blob fitting.Use colored voxel for morerobust tracking.

Table 1. Summary of selected model based methods for body pose estimation using voxel data. The last three rows are recentmethods chosen to discuss in more details

Figure 2 shows a typical flowchart of common model-basedpose estimation methods using voxel data. There are fivemain steps: camera calibration/data capture, voxelreconstruction, initialization/segmentation (segment voxeldata into different body parts), modeling/estimation(estimating pose using current frame only), and tracking(use temporal information from previous frames inestimating body pose in current frame). The first two steps

are common for all methods of this kind while among thelast three steps different methods may touch differentcombinations of these steps. A summary of which steps arecontained in some selected methods is shown at the bottomof Figure 2. Compare to previous surveys [14, 15, 17, 23],this review focus on analyzing model based methods forbody pose estimation using voxel data. In addition to somemethods already mentioned in previous survey (prior to

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on January 21, 2009 at 14:53 from IEEE Xplore. Restrictions apply.

Page 4: HUMAN BODY MODELINGAND TRACKING USING …cvrr.ucsd.edu/publications/2008/cuong_ICDSC08.pdf3D features reconstructed from multiple views [1, 2, 3,4, 5, 6, 13, 20], e.g. volumetric (voxel)

2007), we emphasize our analysis on some selected methodsin 2007, 2008, which could be considered the current stateof the art including a general probabilistic method that canapplied for both body model and hand model Cheng et al.'07 [3], a method with real-time performance Caillette et al.'08 [1], and a new general LE based method for voxelsegmentation Sundaresan et al. '07 [20].

In data capture and voxel reconstruction steps, there aresome human body scanner technologies mentioned in [23],in which we are concerned about vision-based technique toreconstruct voxel data of the body from multiple­perspective cameras. A common approach to do this is theshape-from-silhouette (visual hull) approach. First, theimages from multiple synchronized cameras are segmentedinto object silhouette using some background subtractiontechniques [7, 8]. Then some efficient shape-from-silhouettetechniques [4, 5, 19] can be used to retrieve the 3D voxeldata. There is also another approach called shape-from­photo consistency (photo hull) [21] that uses other features(not just the silhouette) from the photos to have moreaccurate geometry of the reconstructed photo hull. In caseof body pose estimation, the more accurate geometry ofvoxel data is not necessary so using visual hull is moreappropriate (should be faster and more robust to noise)

The modeling and tracking steps can be considered as amapping from input space of voxel data Y and informationin the predefined model (e.g. kinematic constraints) C to thebody model configuration space 0 .

M: (Y,C) J-7 0The body model configuration contains both static

parameters (i.e. shape and size of each body component)and dynamic parameters (i.e. mean and orientation of eachbody component), in which the static parameters areestimated in the initialization step. Some methods haveautomatic initialization step like [13, 20] while othersrequire a priori known or manually initialized staticparameters. The main differences between methods of thiskind are in the body model that they use and how theyimplement the mapping procedure M. Methods that havemodeling step but no tracking step are also called singleframe-based methods while methods with tracking step arecalled tracking-based methods. Because the tracker intracking based methods would be lost over long sequences,multiple hypotheses at each frame can be used to improvethe robustness of tracking. Single frame based approach is amore difficult issue because it does not make anyassumptions on time coherence. However, we see that thiskind of approach is needed for initialization orreinitialization of tracking-based methods. Regarding theevaluation step, some methods only have visual evaluationwhile others have both visual evaluation and quantitativeevaluation using ground-truth data got from synthesizeddata, manual annotation or maker-based motion capturesystem.

According to the factors mentioned above, a summaryof several recent model based methods for human body poseestimation using volumetric data is shown in Table 1. In thefollowing section, we will discuss in more details someselected state-of-the-art methods, which is published in thepast two years, to emphasize important results andlimitations of each one.

2.1. Kinematically Constrained Gaussian Mixture Model(KC-GMM) approach for both body and hand poseestimation [3]

This is one of very few methods that have been applied(with experimental result) for both body models and handmodels. Among several methods competing in theWorkshop on Evaluation of Articulated Human Motion andPose Estimation - CVPR EHuM2 2007 (including [3, 10, 16]),this method won the first prize.

The hand model and body model used in this methodare shown in Figure 3.(a). For hand, there are 16components with 27 DOF (degree of freedom). For body,there are 11 components with 19 DOF. The pose estimationprocedure of this method uses the same paradigm ofprobabilistic clustering. Each body/hand component isdescribed by a Gaussian and the set of components arekinematically constrained according to the predefinedmodel. The goal is then to estimate optimal value for theGaussian Mixture Model (GMM) under those kinematicconstraints. They represent these kinematic constraints by 3equations corresponding to 3 types of constraint: spherical(3 DOF) constraint, hardy-spicer (2 DOF) constraint, andrevolute (1 DOF) constraint.

c s (0) = Pi + ROiaij -(Pj + ROja ji )

ch (0) =ROiqij X ROjqji

cr (8) = ROiqij - Rojq ji

where 8 is the embodiment of the kinematic constraints and

all configuration parameters, Pi , Pj are the means of

components i and j, Roh Roj are the rotation of the

components relative to the world coordinates, Qij' Q ji are

the joint positions in component coordinate frame (the

origin is at the component center), qij , qji are the rotation

axes of each component in either component coordinateframe. We can interpret these equations as follow: Cs = 0means two joints on two component are coincided, we have3 DOF constraint; Ch = 0 means 2 rotation axes areperpendicular, combined with Cs = 0 we have 2 DOFconstraint; Cr = 0 means 2 rotation axes are aligned,combined with Cs =0 we have 1 DOF constraint.

In a previous work of the same authors [2], theseconstraint equations are satisfied by adding a constrainingstep (C-step) into EM algorithm for Gaussian mixture

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on January 21, 2009 at 14:53 from IEEE Xplore. Restrictions apply.

Page 5: HUMAN BODY MODELINGAND TRACKING USING …cvrr.ucsd.edu/publications/2008/cuong_ICDSC08.pdf3D features reconstructed from multiple views [1, 2, 3,4, 5, 6, 13, 20], e.g. volumetric (voxel)

(a) Hand model and body model used in (3) (b) Body model used ;n [1]

(c) 6-chain~ skeletal. and superquadric body model used in {20]

I~f~

I}Figure 3. Some body/hand models used in analyzed methods

estimation. However this additional C-step may competewith the M-step and cause instability in the optimization.The primary contribution of [3] is the removal of this C-stepby incorporating kinematic constraints into the probabilitymodel in the form of a prior probability to have aKinematically Constrained Gaussian Mixture Model (KC­GMM)

P(Y,c 18) = P(c 1 8 )[JP(Yn I c,8)n

where Y={Yn} represent the distribution of input voxel data.The EM algorithm for this new probability model is thenderived for estimation of Gaussian component parameters,which can be interpreted to body configuration.

This method is quite general and was appliedsuccessfully for both HumanEvaII body data andsynthesized hand data. Some visual and quantitative resultsin [3] are shown in Figure 4. However, this method is notfully automated because it requires a careful manualinitialization step. This is obviously an obstacle if we wantto use this method in real time application. Another issue ofKC-GMM method is that due to the nature of EMalgorithm, it could stuck in a sub-optimal solution especiallywhen there is a large displacement between frames.

2.2. Real-time approach using Variable Length MarkovModels (VLMM) pose prediction followed by a Gaussianblob fitting procedure for body modeling and tracking[1]

Due to the complex, high dimensional model of articulatedbody, running speed of pose estimation algorithm is reallyan issue. Real-time performance is a prominent goal of thismethods and this is one of very few methods that hasreported the run-time performance.

The body model used in this method is shown in Figure3.(b), which consists of a skeletal model and Gaussian blobsattached to bones of this skeletal model. For real-timeperformance in voxel reconstruction, the authors proposednot to perform binary segmentation of the input images butinstead to compute a measure of the distance to thebackground model for each 2D sample. The statistics onthese distances across the available views are then used toclassify voxels. In this method, the color information is alsokept along with each voxel, which allows more robusttracking.

This is a tracking-based method that exploits temporaldependencies from previous frames. For more accurate andefficient hypothesis propagation (pose prediction), complexhuman activities such as dancing are broken into elementarymovements using variant EM algorithm (partition parameterspace into Gaussian clusters). The transition betweenclusters is predicted using Variable Length Markov Model(VLMM), which can explain high-level behaviors over along history. The evaluation of the likelihood is done with aGaussian blob fitting procedure. This blob-fitting procedurecan detect tracking failures, e.g. the best achieved likelihoodis below a threshold. A reinitialization can then be requestedthen by performing blob fitting from all clusters, whichhowever might provoke a considerable lag.

Figure 5 shows some experimental results of thismethod, which indicate an improvement compared to someother standard particle filter based algorithms. The runtimeperformance was also reported with the total time of bothvoxel reconstruction and body pose inference is around so­110 milliseconds depending on the configurationparameters. However the performance of this methoddepends largely on the correctness of the prediction result,which means that it requires a good training phase and it

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on January 21, 2009 at 14:53 from IEEE Xplore. Restrictions apply.

Page 6: HUMAN BODY MODELINGAND TRACKING USING …cvrr.ucsd.edu/publications/2008/cuong_ICDSC08.pdf3D features reconstructed from multiple views [1, 2, 3,4, 5, 6, 13, 20], e.g. volumetric (voxel)

6505004SO

3SO 400Frame Index

360 400Flame Index

300

300

2SO

250

12

70

I :iI 40

t :I 10

4035

.. 96%=32.4" RMS-19.8" mean = 8.61" medi8n = 3.77'If' mode = 0.247

•1 1.6 2 2.5

Spatial Error [em)

10 15 20 25 30Angu~rEnorldegreee]

0.6

~~,..J ~

--1'_ ,,1..'_';"'_

5

Overall Positional Error: Voxel Resolution 0.5 em

0.05

0.16

I 0.1

LL

:j..,·f....!..,.~

... -l-i-1•~~,;,;,·~-;;-;;.7";:';-

0.015

~~ 0.01r

LL 0.006...JIaIL.L

Figure 4. Experimental results in [3]: The first and secondrows are visual results with HumanEvall body data andsynthesized hand data. The third and forth rows arequantitative results Goint position/angle) of synthesized handdata.

Figure 5. Experimental results in [1]: The visual result withballet sequence and the quantitative results Gointposition/angle) which show comparative performance of themethod in [1] with some other standard particle filter basedmethods.

Figure 6. Experimental results in [20] with synthesized data.They also have experiment with scanned data, real captureddata and HumanEvall dataset.

Figure 7. LE-based voxel segmentation in [20] performedsuccessfully in case of self contact, which some previousvoxel segmentation algorithms do not address.

could only work well with some specific types of trainedmovements. Current implementation of this method doesnot handle case of new movements that are previouslyunseen in the training data.

2.3. Laplacian Eigenspace (LE) based approach for bodymodeling [20]

This is a kind of skeletonization method that obtains theskeletons of individual articulated body chains. The voxelsegmentation technique in this method is quite general andcan handle poses where there is self contact, Le. when oneor more limbs touch other body parts. In this method, firstthe voxel data is segmented into 6 chains representing thebody (torso, head, and 4 limbs). Based on this segmentedresult, more detailed skeletal model and superquadric modelrepresenting the body are estimated. These representations(6-chain, skeletal and superquadric model) of body areshown in Figure 3.(c).

A main contribution of this method is to discover the

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on January 21, 2009 at 14:53 from IEEE Xplore. Restrictions apply.

Page 7: HUMAN BODY MODELINGAND TRACKING USING …cvrr.ucsd.edu/publications/2008/cuong_ICDSC08.pdf3D features reconstructed from multiple views [1, 2, 3,4, 5, 6, 13, 20], e.g. volumetric (voxel)

Achieve body Segment initial Initialize body Continuevoxel data at an .... voxel data .. model using ... tracking withinitial specific

....using spline

JII'"

segmented....

KC-GMMpose fitting in LE voxel data method

Figure 8. Intended steps of a method combining LE-based voxel segmentation KC-GMM methodfor automatic initialization and tracking of human body

properties of Laplacian Eigenspace (LE) transformationafter comparison of several manifold techniques (Laplacian,Isomap, MDS ...). By mapping into high dimensional (e.g.6D) LE, voxel data of body chains like limbs, which havetheir length greater than their thickness, will form an 1-Dsmooth curve which can then be used to segment voxel datainto different body chains. The procedure for LE mapping isas follows: First, we compute the adjacent matrix W ofvoxel data, such that Wij = 1 only if voxel i is. a neighbor ofvoxel j. Then, we compute D matrix, so that

D.. = ~m WOk and Dii = 0 for i:~J.. The first dII ~k=l I "

eigenvectors of L=D-W with minimum eigenvalues give usthe d basis of the needed LE. After mapping into LE, thereare several 1-D curves corresponding to different bodychains. A spline fitting process is used to segment thecurves which results in the segmentation of their respectivebody chains. Segmented voxel clusters are then registers totheir actual body chain using a probabilistic registrationprocedure. And next using general knowledge about humanstature they have a procedure to estimate the skeletal andsuperquadric model of the body.

They did experiment with several synthesized, scannedand real captured body data (e.g. Figure 6). Figure 7 showsthat the proposed LE-based voxel segmentation performedsuccessfully in case of self contact, which was notaddressed in some previous voxel segmentation algorithms.Their experimental result with HumanEvaII dataset howeverwas not good (only around 9% of the total frames weresuccessfully segmented and registered), which indicates thesensitization of LE based segmentation step to voxel noiseand this will affects the whole subsequent steps.

3. IDEA OF COMBINING LE-BASED VOXELSEGMENTATION AND KC-GMM METHODINTO A MORE POWERFUL SYSTEM FOR

HUMAN BODY POSE ESTIMATION

As discussed above, a desired improvement of KC-GMMmethod [3] is an automated initialization step. A possiblesolution is to use results in [13], which is based on specificinformation about the shape and size of the head and torsoto have hierarchical growing procedure for body modelacquisition: starting by locating the head, then torso, thenlimbs. In doing so, however, we will lose the generalityadvantage of KC-GMM method which means we cannot

apply it to other articulated models like hand. The voxelsegmentation using LE transformation in [20] has thegenerality, for example we should be able to apply it forhand case because fingers also have their length greater thantheir thickness (however it should be mentioned that thesubsequent steps in [20] of probabilistic registration andmodel estimation is specific for body case). Therefore LEbased voxel segmentation would be a more appropriatechoice for improving KC-GMM method with an automatedinitialization step. Regarding LE based method for bodymodeling [20], their experiment with HumanEvaII datasetindicates the voxel segmentation step is sensitive to noiseand failure in this initial step will affect their whole process.This motivates the idea that instead of doing voxelsegmentation at every frame, we only use it for initializationpurpose. In subsequent frames, a tracking based method thatexploit temporal information like KC-GMM method couldhelp in overcoming the sensitization to noise to some extent(we know that KC-GMM method had quite goodexperiment with HumanEvaII dataset). The intended stepsof a combined method following this idea are shown inFigure 4:

• The body starts at an initial specific pose (e.g. stretchpose), which clearly reveals the body's structure.

• LE transformation is applied to segment body voxel atthis initial pose into different parts (e.g. limbs). With aselected initial pose, we can expect to have goodsegmentation result.

• Also because we request the initial pose to be specific, itis possible to develop a simple procedure to initializebody model from segmented voxel.

• After body model is initialized, KC-GMM method willbe used for body pose estimation in subsequent frames.

4. CONCLUDING REMARKS AND FUTURE WORK

In this paper, we provide a review of the sub-area of model­based method for real human body pose estimation usingvolumetric data. After a brief overview to put in context thisconcerned subarea, we focus on analyzing and comparingseveral selected methods, especially some recent methods inthe past two years to high light their important resultsincluding increasing generality, real time performance, anda new general LE based method for voxel segmentation.

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on January 21, 2009 at 14:53 from IEEE Xplore. Restrictions apply.

Page 8: HUMAN BODY MODELINGAND TRACKING USING …cvrr.ucsd.edu/publications/2008/cuong_ICDSC08.pdf3D features reconstructed from multiple views [1, 2, 3,4, 5, 6, 13, 20], e.g. volumetric (voxel)

Based on this analysis, we discuss about our idea of amethod combining LE based voxel segmentation and KC­GMM methods for an automatic human body modelinitialization and tracking using voxel data. A close followup work for us is to implement this idea.

We may think of several other directions for futurework in improving performance & robustness of currentpose estimation methods. First, we can keep trying tocombine good characteristics from different methods tohave a more robust one. For example, we may want toincorporate some kind of prediction information as done in[1, 13] to the proposed combined method. Second, we canfind some way to use both 3D voxel feature and 2Dfeatures. In [1, 5], they have associate color information tovoxel data. Other 2D features like edges, appearance model,etc. .. should be also useful. Regarding the major difficultyof high-dimensional body pose configuration space, we canalso exploit the divide and conquer principle by trying tobreak the problem into smaller dimensional ones like thehierarchical estimating of body pose in [13] (detect headfirst, then torso and so on) or the breaking of complexhuman movement into basic motions in [1].

There are also some opened related research areas thatshould be mentioned. First is the issue of human body poseestimation at multilevel (e.g. body level, head level, handlevel) which was mentioned in [22]. We can see the benefitsof having such a multilevel human body pose estimationsystem: Combined information from different level is moreuseful (e.g. in intelligent environment, the combination ofbody pose, hand pose, head pose would give betterinterpretation of human status/intention); Information fromdifferent levels can support each other and help to improvethe estimation performance. However typical approaches inthe area only deal with each task of body pose estimation,hand pose estimation, head pose estimation separately.Therefore, it is worth to have some studies that analyze thereasons why typical approaches only deal with one task at atime and find a way to achieve the goal of a full body model(e.g. including body, head, and hand). Another openedrelated research area that is worth to dealing with is theissue of pose estimation and tracking of multiple objectssimultaneously.

ACKNOWLEDGEMENT

We would like to thank our colleges at CVRR lab,especially Dr. Shinko Cheng for useful discussions andassistances.

REFERENCES

[1] F. Caillette, A. Galata, T. Howard, "Real-time 3-D HumanBody Tracking Using Learnt Models of Behaviour", ComputerVision and Image Understanding (109),2008.

[2] S. Cheng, M. Trivedi, "Multimodal Voxelization andKinematically Constrained Gaussian Mixture Model for Full HandPose Estimation: An Integrated Systems Approach", IEEE Int.Conference on Computer Vision Systems, pages 34-42, 2006.

[3] S. Cheng, M. Trivedi, "Articulated Human Body PoseInference from Voxel Data Using a Kinematically ConstrainedGaussian Mixture Model", CVPR EHuM2: 2nd Workshop onEvaluation of Articulated Human Motion and Pose Estimation,2007.

[4] G. Cheung and T. Kanade, "A Real-time System for Robust 3DVoxel Reconstruction of Human Motions", IEEE Proc. ComputerVision and Pattern Recognition Conference, pages 714-720, 2000.

[5] G. Cheung, S. Baker, and T. Kanade, "Shape-From-Silhouetteof Articulated Objects and Its Use for Human Body KinematicEstimation and Motion Capture", IEEE Computer Vision andPattern Recognition Conference, 2003.

[6] Q. Delamarre and o. Faugeras, "3D Articulated Models andMultiview Tracking With Physical Forces", Computer Vision andImage Understanding, 81(3):328-357,2001.

[7] A. Doshi, M. Trivedi, "Hybrid Cone-Cylinder CodebookModel for Foreground Detection with Shadow and HighlightSuppression", IEEE International Conference on Advanced Videoand Signal based Surveillance, Nov 2006.

[8] T. Horprasert, D. Harwood, and L. S. Davis, "A StatisticalApproach for Real-time Robust Background Subtraction andShadow Detection", IEEE Proceedings ICCV Frame-RateWorkshop, September 1999.

[9] E. Hunter, "Visual Estimation of Articulated Motion UsingExpectation-Constrained Maximization Algorithm", PhD thesis,University ofCalifornia, San Diego, 1999.

[10] Z. Husz, A. Wallace, "Evaluation of a HierarchicalPartitioned Particle Filter with Action Primitives", CVPR EHuM2:2nd Workshop on Evaluation of Articulated Human Motion andPose Estimation, 2007.

[11] D. Knossow, R. Ronfard, R. Horaud, "Human MotionTracking with A Kinematic Parameterization of ExtremalContours", International Journal of Computer Vision, vol. 79,pages 247-269, 2008.

[12] M.W. Lee, R. Nevatia, "Human Pose Tracking in MonocularSequence Using Multi-level Structured Models", IEEETransactions on Pattern Analysis and Machine Intelligence, 2008.

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on January 21, 2009 at 14:53 from IEEE Xplore. Restrictions apply.

Page 9: HUMAN BODY MODELINGAND TRACKING USING …cvrr.ucsd.edu/publications/2008/cuong_ICDSC08.pdf3D features reconstructed from multiple views [1, 2, 3,4, 5, 6, 13, 20], e.g. volumetric (voxel)

[13] I. Mikic, M. Trivedi, E. Hunter, P. Cosman, "Human BodyModel Acquisition and Tracking using Voxel Data", InternationalJournal ofComputer Vision, pages 199-223, July 2003.

[14] T. Moeslund and E. Granum, "A Survey of Computer Vision­Based Human Motion Capture", Computer Vision and ImageUnderstanding, 81(3):231-268,2001.

[15] T. Moeslund, A. Hilton, and V. Kruger, "A Survey onAdvances In Vision-based Human Motion Capture and Analysis",Computer Vision and Image Understanding, pages 90-126, 2006.

[16] R. Poppe, "Evaluating Example-based Pose Estimation:Experiments on the HumanEva Sets", CVPR EHuM2: 2ndWorkshop on Evaluation of Articulated Human Motion and PoseEstimation, 2007.

[17] R. Poppe, "Vision-based Human Motion Analysis: AnOverview", Computer Vision and Image Understanding, vol. 108,pages 4-18, 2007.

[18] D. Ramanan, D.A. Forsyth, and A. Zissennan, "TrackingPeople by Learning Their Appearance", IEEE Transactions onPattern Analysis and Machine Intelligence, 2007.

[19] G. Slabaugh, B. Culbertson, and T. Malzbender, "A Survey ofMethods for Volumetric Scene Reconstruction for Photographs",International Workshop on Volume Graphics, pages 81-100,2001.

[20] A. Sundaresan, R. Chellappa, "Model Driven Segmentation ofArticulating Humans in Laplacian Eigenspace", IEEETransactions on Pattern Analysis and Machine Intelligence, 2007.

[21] G. Slabaugh, R. Schafer, M. Hans, "Image Based PhotoHulls", International Symposium on 3D Data ProcessingVisualization and Transmission, 2002.

[22] M. Trivedi, "Human Movement Capture and Analysis inIntelligent Environments", Machine Vision and Application, vol.14, pages 215-217, 2003.

[23] N. Werghi, "Segmentation and Modeling ofFull Human BodyShape From 3-D Scan Data: A Survey", IEEE Transactions onSystems, Man, and Cybernetics, Part C 37(6): 1122-1136 (2007).

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on January 21, 2009 at 14:53 from IEEE Xplore. Restrictions apply.