augmenting monocular motion estimation using intermittent

4
Augmenting monocular motion estimation using intermittent 3D models from depth sensors Rohith MV and Chandra Kambhamettu Video/Image Modeling and Synthesis (VIMS) Lab, Dept. of Computer and Information Sciences, University of Delaware, Delaware, USA. {rohithmv,chandrak}@udel.edu Abstract Estimation of human motion has been improved by recent advances in depth sensors such as the Microsoft Kinect. However, they often have limited range of depths and a large number of such sensors are nec- essary to estimate motion in large areas. In this pa- per, we explore the possibility of estimating motion from monocular data using initial and intermittent 3D mod- els provided by the depth sensor. We use motion seg- mentation to divide the scene into several rigidly mov- ing components. The orientation of individual compo- nents are estimated and these reconstructions are syn- thesized to provide a coherent estimate of the scene. We demonstrate our algorithm on three sequences from a real video sequence. Quantitative comparison with depth sensor reconstructions show that the proposed method can accurately estimate motion even with a sin- gle 3D initialization. 1 Introduction Non-rigid structure from motion algorithms [10, 5, 6] are being explored for a variety of applications such as 2D to 3D conversion of existing video sequences, hu- man motion analysis and human computer interaction. It is often achieved by analyzing the trajectories of a sparse set of feature points over the length of the given sequence. Movement of feature points in monocular imagery depends on camera motion as well as defor- mation of objects in the scene. Depth sensors such as Microsoft Kinect can be used to obtain dense 3D recon- structions of the scene and subsequently provide pose estimates. Since such sensors often have limited depth resolution (small range of depth over which reconstruc- tions are accurate), multiple units are needed to cover large areas. However monocular camera networks are nearly ubiquitous and it would be advantageous if we could combine few Kinect cameras with a larger net- work of monocular cameras. In this paper, we explore a method for estimating human motion using initial and intermittent 3D models derived from depth sensors. We exploit the articulated nature of deformation that is predominant in human motion[2, 3, 4]. Using motion segmentation, we separate out the various rigid compo- nents of the object. Given the 3D model of a particular segment at one frame, the problem of finding the orien- tation of the object in another frame reduces to a pose estimation problem. Once we obtain the pose indepen- dently for all the segments, we synthesize a coherent reconstruction based on their spatial relation in the 3D initialization frame. Comparisons with other motion segmentation meth- ods demonstrate that our method is more generic in terms of the feature point distribution and complexity of motion. We present results of reconstruction on three real sequences. Reconstruction errors indicate that pro- posed method can accurately estimate motion even with a single 3D initialization. We describe our approach to segmentation and classification of motion in section 3. In section 4, we present the methods we use for struc- ture estimation of segments and synthesis. We present results on human motion videos in section 5 and con- clude in section 6. 2 Background Successfully segmenting regions of a video based on their motion is often dependent on the nature of the de- formations present. Vidal et al. [12] use GPCA based framework to identify multiple rigid motions. Del Bue et al. [2] employ deformability index [7] to segment the scene into rigid and non-rigid parts. The method is based on motion trajectories lying in different sub- spaces. However, these methods cannot be applied to 21st International Conference on Pattern Recognition (ICPR 2012) November 11-15, 2012. Tsukuba, Japan 978-4-9906441-1-6 ©2012 IAPR 473

Upload: others

Post on 24-Mar-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Augmenting monocular motion estimation using intermittent 3D modelsfrom depth sensors

Rohith MV and Chandra KambhamettuVideo/Image Modeling and Synthesis (VIMS) Lab,

Dept. of Computer and Information Sciences, University of Delaware, Delaware, USA.{rohithmv,chandrak}@udel.edu

Abstract

Estimation of human motion has been improved byrecent advances in depth sensors such as the MicrosoftKinect. However, they often have limited range ofdepths and a large number of such sensors are nec-essary to estimate motion in large areas. In this pa-per, we explore the possibility of estimating motion frommonocular data using initial and intermittent 3D mod-els provided by the depth sensor. We use motion seg-mentation to divide the scene into several rigidly mov-ing components. The orientation of individual compo-nents are estimated and these reconstructions are syn-thesized to provide a coherent estimate of the scene.We demonstrate our algorithm on three sequences froma real video sequence. Quantitative comparison withdepth sensor reconstructions show that the proposedmethod can accurately estimate motion even with a sin-gle 3D initialization.

1 Introduction

Non-rigid structure from motion algorithms [10, 5,6] are being explored for a variety of applications suchas 2D to 3D conversion of existing video sequences, hu-man motion analysis and human computer interaction.It is often achieved by analyzing the trajectories of asparse set of feature points over the length of the givensequence. Movement of feature points in monocularimagery depends on camera motion as well as defor-mation of objects in the scene. Depth sensors such asMicrosoft Kinect can be used to obtain dense 3D recon-structions of the scene and subsequently provide poseestimates. Since such sensors often have limited depthresolution (small range of depth over which reconstruc-tions are accurate), multiple units are needed to coverlarge areas. However monocular camera networks are

nearly ubiquitous and it would be advantageous if wecould combine few Kinect cameras with a larger net-work of monocular cameras. In this paper, we explorea method for estimating human motion using initial andintermittent 3D models derived from depth sensors.

We exploit the articulated nature of deformation thatis predominant in human motion[2, 3, 4]. Using motionsegmentation, we separate out the various rigid compo-nents of the object. Given the 3D model of a particularsegment at one frame, the problem of finding the orien-tation of the object in another frame reduces to a poseestimation problem. Once we obtain the pose indepen-dently for all the segments, we synthesize a coherentreconstruction based on their spatial relation in the 3Dinitialization frame.

Comparisons with other motion segmentation meth-ods demonstrate that our method is more generic interms of the feature point distribution and complexityof motion. We present results of reconstruction on threereal sequences. Reconstruction errors indicate that pro-posed method can accurately estimate motion even witha single 3D initialization. We describe our approach tosegmentation and classification of motion in section 3.In section 4, we present the methods we use for struc-ture estimation of segments and synthesis. We presentresults on human motion videos in section 5 and con-clude in section 6.

2 Background

Successfully segmenting regions of a video based ontheir motion is often dependent on the nature of the de-formations present. Vidal et al. [12] use GPCA basedframework to identify multiple rigid motions. Del Bueet al. [2] employ deformability index [7] to segmentthe scene into rigid and non-rigid parts. The methodis based on motion trajectories lying in different sub-spaces. However, these methods cannot be applied to

21st International Conference on Pattern Recognition (ICPR 2012)November 11-15, 2012. Tsukuba, Japan

978-4-9906441-1-6 ©2012 IAPR 473

Figure 1. Result of our motion segmenta-tion algorithm. Colors represent motionsegments, a total of seven segments aredetected.

articulated motion as there is a considerable overlap inthe subspaces of trajectories which are connected. Tosegment articulated motions, Yan and Pollefeys [13]propose a method based on spectral clustering of affin-ity derived from principal angles between the subspacesto segment subspaces. They also provide a graph al-gorithm which constructs a kinematic chain based onshared subspaces. This method cannot be extended tocomplex objects due to the restriction of minimum num-ber of samples required (which grows exponentiallywith number of subspaces [3]). We use a segmentationapproach close to [4], which performs motion segmen-tation using subspace clustering as in [13], but employsa different affinity function that is insensitive to distanceof feature point from the axis of articulation. In addi-tion, we also analyze the resultant subspaces using de-formability index [7] to obtain the nature of non-rigidmotion present in the segment.

As noted earlier, a number of non-rigid structurefrom motion algorithms have been proposed [10] whichperform structure recovery for all the scene points in auniform manner. In this paper however, as we concen-trate on articulated motion, we restrict our discussionof related works to those that handle articulated motionand combinations of multiple motion types. Tresadernand Reid [11] use factorization to solve purely articu-lated motion based on segmentation using RANSAC.Paladini et al. [5] present a method based on iterativefactorization. The estimates are projected onto a man-ifold of metric constraints to preserve the relation be-tween points of the same sub-model (or link in the caseof articulated structures). Fayad et al. [3] solve the

problem of articulated structure from motion using thefactorization with constraints on rotation matrices. Thismethod independently reconstructs each segment in adifferent frame of reference and hence needs a post-processing step of stitching the segments to provide acoherent reconstruction. They use 3D Euclidean dis-tances between pairs of corresponding points as a mea-sure of alignment.

3 Motion segmentation and classification

Since our scheme aims at reconstructing objects inthe scene based on their rigidity, we start with dividingthe feature trajectories in a video sequence into differ-ent segments. The segmentation is performed using avariant of the algorithm proposed in [4] which is sensi-tive to proximity of features. This algorithm is designedto identify articulated motions and disjoint motion seg-ments.

Given the observation matrix W constructed fromthe coordinates of the feature trajectories, we estimate apreliminary motion and shape estimate using factoriza-tion in the formW = CMAS[4]. Though this does notrepresent the actual motion and structure matrices, thesubspace properties shown in [13, 4] ensure that trajec-tories of feature points that move rigidly with respect toeach other lie in the same subspace. Equation 1 showsthe affinity measure suggested in [4]. Hij represents theaffinity between the features i and j, where Si is the ithcolumn of shape matrix S and + represents the pseudo-inverse.

Hij = exp(−∑

1≤k≤K

|[Si Sj

]([Si Sj

]+Sk)− Sk|2) (1)

We define a modified affinity measure H ′ijwhich in-cludes the average Euclidean distance between the fea-ture points (uij), as H ′ij = Hij + αexp(−uij). Hereα controls the weight of proximity over trajectory sim-ilarity. We then use the normalized cuts [8] algorithmto segment the various segments. We will refer to thesegroups of trajectories as motion segments.

An example of this segmentation scheme is shown inFigure 1. A comparison of proposed method with [13]using the CMU Mocap data [1] is shown in Tables 2 and3 of Supplementary material. 1

4 Structure estimation

Articulated objects are often characterized by rigidsegments which rotate relative to each other on a jointor pivot. The trajectories of all the points that belongto one link or section of the object lie on a linear sub-space [13]. This allows the trajectories to be ordered

1Supplementary material can be downloaded fromhttp://vims.cis.udel.edu/ICPR12/SupplementaryMV.zip

474

or sorted, enabling us to identify the extreme or pivotalpoints of each link. Since no prior information is avail-able about the nature of objects present in the scene,we assume that each joint is a universal joint, allowingthe link connected to undergo motion with two degreesof freedom. Intuitively, the degrees of freedom repre-sent rotation within and out of the image plane. Us-ing each pair of points on a given link, we can obtainan estimate of these angles. Due to tracking error andmovement of feature points parallel to the axis of rota-tion (e.g., sliding of shirt along the arm), some of theseestimates may be inaccurate. These are handled usingsample consensus to eliminate outliers. Note that thesign of the out of plane angle cannot be resolved by thismethod, however this is handled in the synthesis sec-tion. We reconstruct the articulated segments using themethod presented in [4], except we use the 3D model ofthe segment as the reference. A 3D model maps eachfeature point in a single frame to a corresponding 3Dpoint in space. To obtain a 3D model corresponding tothe feature points corresponding to a frame, points fromthe 3D reconstruction are projected into the video cam-era’s image plane and each feature point is assigned tothe closest reprojected 3D point.

Consider the structure estimates from two rigid mo-tion segments W1 and W2 as S1 and S2, respectively(individual reconstructions containing all frames). Us-ing the 3D initialization frame, transformations betweenpoints that are close to the pivot of a joint. Followingthe notation of the previous subsection, the observationmatrices [W1, W2] is reconstructed using the rigid struc-ture estimation methods discussed earlier. We refer tothe resultant set of 3D points as S12. This result maybe improved by using the camera motion estimate fromthe SFM results of W1 or W2 (individual reconstruc-tions containing all frames) as an initialization for thefactorization process.

The reconstructions of 3D initialization frames pro-vide us with a frame of reference which describes thespatial relation between the points belonging to dif-ferent motion segments. We are now able to performalignment of different motion segments using the sub-sequence reconstruction as an intermediate reference.This is illustrated in Equation 2 where the transforma-tions T1 and T2 aligns the points from S1 and S2 to S12

respectively.

S1T1−→ S12

T−12GGGGGGGGBFGGGGGGGG

T2S2. (2)

The points in S1 can now be transformed into thereference frame of S2 using the transformation T1T−12 .The reconstructions of uncalibrated algorithms are in

general related by a bijective mapping between projec-tive spaces (collineation) and hence this must be thenature of T1 and T2. This is estimated using by min-imizing the distance between the two point sets in theprojective sense using the least squares method.

5 Results

We tested our algorithm on three sequences of hu-man motion. The data was collected using MicrosoftKinect depth sensor (which provides 640x480 depthmap) and video camera (providing 1024x768 image se-quences). Feature trajectories were generated in thevideo sequence using dense feature tracking tools [9].As the features are not tracked at every pixel, the recon-struction is sparse. The cameras were calibrated to ob-tain intrinsic and extrinsic parameters and the depth im-ages were converted to 3D reconstructions. We will re-fer to these reconstructions as Kinect 3D frames. Eachsequence contains about 100 frames of images. We con-ducted experiments by varying the number of 3D ini-tializations used to augment monocular reconstruction.This ranged from using a single frame of Kinect 3D, tousing 20 frames of Kinect 3D spread uniformly in timeover the sequence.

The trajectories were grouped into motions segmentsusing the algorithm proposed in Section 3. For each ofthese segments a 3D model was generated by averag-ing the model obtained from the specified Kinect 3Dframes. These average models are then used to obtainthe pose of each segment in a given frame of the videosequence. The transformations obtained from the clos-est 3D frame (in time) is used to align the individualreconstructions together.

The results for some of the frames are shown in Fig-ure 2. These results are obtained using a single Kinect3D frame as initialization. It can be seen that the pro-posed method accurately reconstructs the observed mo-tion. To check if using more Kinect 3D frames wouldimprove the reconstruction, we performed experimentswherein five, ten, fifteen and twenty Kinect 3D frameswere used. To evaluate the performance, we calculatedthe mean error between the reconstructed points andthe closest points in corresponding Kinect 3D frames.These results are presented in Table 1 of the supple-mentary material. The errors reported are the averageEuclidean distance in 3D of all points in the reconstruc-tion. We can see that the error only decreases fromabout 50mm to 39mm with the use of more 3D frames.Cases where the tracker fails on some features, leads towrong depth values to be read from Kinect frame (as inSequence 3).

475

Figure 2. Images from the human motionsequence with reconstructions obtainedfrom the proposed algorithm. Theseresults are obtained from a single ini-tialization at the beginning of the se-quence. Videos of the reconstruc-tion available in supplementary materialat http://vims.cis.udel.edu/ICPR12/ Sup-plementaryMV.zip.

6 Conclusion

Obtaining 3D structure from monocular videos ofdeforming objects is a challenging problem. This prob-lem is readily solved using depth sensors, but their nar-row depth range precludes their application in largeareas. As monocular video networks are ubiquitousand easy to set up, we explored the possibility of us-ing 3D information from the depth sensor to augmentthe monocular reconstruction. We proposed an algo-rithm using motion segmentation and synthesis of rigidstructures to estimate human motion in 3D. The methodis initialized using one or more depth reconstructionsthat are available from the depth sensors. We testedour method on real video sequences and found that themethod produces accurate reconstructions.

7 Acknowledgements

This work was made possible by National Sci-ence Foundation (NSF) Ofce of Polar Program grantANT0636726.

References

[1] CMU Graphics Lab Motion Capture Database,http://mocap.cs.cmu.edu// .

[2] A. Del Bue, X. Llad, and L. Agapito. Non-rigid met-ric shape and motion recovery from uncalibrated imagesusing priors. In CVPR, volume 1, pages 1191 – 1198,june 2006.

[3] J. Fayad, C. Russell, and L. Agapito. Automated articu-lated structure and 3d shape recovery from point corre-spondences. In ICCV, 2011.

[4] MV, Rohith., and C. Kambhamettu. Estimation and uti-lization of articulations in recovering non-rigid structurefrom motion using motion subspaces. In Joint ACMWorkshop on Human Gesture and Behavior Under-standing (J-HGBU’11), ACM Multimedia 2011., 2011.

[5] M. Paladini, A. D. Bue, M. Stosic, M. Dodig, J. Xavier,and L. Agapito. Factorization for non-rigid and articu-lated structure using metric projections. CVPR, pages2898–2905, 2009.

[6] D. A. Ross, D. Tarlow, and R. S. Zemel. Learning ar-ticulated structure and motion. Int. J. Comput. Vision,88:214–237, 2010.

[7] A. Roy-Chowdhury. A measure of deformability ofshapes, with applications to human motion analysis. InCVPR, volume 1, pages 398 – 404 vol. 1, 2005.

[8] J. Shi and J. Malik. Normalized cuts and image seg-mentation. In CVPR, 1997.

[9] N. Sundaram, T. Brox, and K. Keutzer. Dense point tra-jectories by gpu-accelerated large displacement opticalflow. In Proceedings of the 11th European conferenceon Computer vision: Part I, ECCV’10, pages 438–451,Berlin, Heidelberg, 2010. Springer-Verlag.

[10] L. Torresani, A. Hertzmann, and C. Bregler. Nonrigidstructure-from-motion: Estimating shape and motionwith hierarchical priors. PAMI, 30:878–892, 2008.

[11] P. Tresadern and I. Reid. Articulated structure from mo-tion by factorization. In CVPR, volume 2, pages 1110 –1115 vol. 2, june 2005.

[12] R. Vidal, Y. Ma, and J. Piazzi. A new gpca algorithmfor clustering subspaces by fitting, differentiating anddividing polynomials. Computer Vision and PatternRecognition, IEEE Computer Society Conference on,1:510–517, 2004.

[13] J. Yan and M. Pollefeys. Automatic kinematic chainbuilding from feature trajectories of articulated objects.In CVPR, pages 712–719, 2006.

476