an active strategy for the simultaneous localization and reconstruction of a 3d object from a...

An active strategy for the simultaneous localization andreconstruction of a 3D object from a humanoid platform.

Josafat Delfin1 Oscar Mar2 Jean-Bernard Hayet2 Mario Castelan1 Gustavo Arechavaleta1

Abstract—This paper proposes a novel approach for thesimultaneous localization and reconstruction of a 3D object.This reconstruction is done with a small humanoid platform bythe only means of monocular vision. We combine (1) a stochasticcontrol approach to decide how to move the robot such thatthe object can be both localized with respect to the robot andcorrectly reconstructed, and (2) a space carving approach forthe 3D reconstruction of the object that merges the acquiredmonocular views into a coherent model. We present a set ofvery promising results obtained with a NAO platform.

I. INTRODUCTION

In all the recent developments done in the area of hu-manoid robotics, in terms of safe navigation, localizationand map building, one striking characteristic is that mostsensor-based elementary tasks rely nearly exclusively on theuse of monocular or stereoscopic vision. In particular, thisis because embarking range finders onboard of these robotsis not always possible, due to their weights, cost, and thedifficulty to use them in a fully 3D context.

In this paper, we focus on one of these fundamentalsensor-based problems, which is the 3D reconstruction ofan unknown object from a humanoid platform by meansof its monocular vision system. By “unknown object”, wemean that neither the real position of the object with respectto the robot, nor its size are known by the robot whenthe reconstruction process starts. This implies that the robotshould fulfill several objectives to perform the reconstructioncorrectly: (1) Estimate the object position and size and (2)Move in such a way that, by using the images taken all alongthe way, the reconstruction is possible. This reconstructionsetup is much more realistic than e.g., simply supposingthat the position of the object is known initially, and itcould be applied in a wider range of situations. Our workdefines such a strategy, that combines a reconstruction-oriented control of the humanoid robot in the plane (i.e. atthe level of footstep planning) and a dense reconstructionalgorithm (space carving).

The paper is organized as follows: in Section II, we reviewa few works related to our approach. In Section III, we detailthe motion strategy for controlling the robot as well as themethod to recover 3D shape from occluding boundaries. InSection IV, we give a few results of object reconstructionsand finally give conclusions in Section V.

1Authors are with Robotics and Advanced Manufacturing Group, Centrode Investigacion y de Estudios Avanzados del IPN, Saltillo, Coah. Mexico.2Authors are with Computer Science Group, Centro de Investigacion enMatematicas, Guanajuato, Gto., Mexico.

II. RELATED WORK

The problem of monocular 3D object reconstruction is aclassical one in computer vision and has been ported recently,as many others, onboard on humanoid platforms. Basically,it consists in using several images from different viewpointsof the object to either match local image features along theviews and deduce a sparse reconstruction or (as we do here)perform a dense reconstruction based on photogrammetricprinciples. In the humanoids literature, the problem is madeeven more challenging as it does also imply to choose posesfor the humanoid robot such that the reconstruction can becompleted [1], [2].

In [3], the authors perform 3D reconstruction of surround-ing objects by using stereo-vision and a modified ExtendedKalman Filter, that estimates the position of 3D points ina foot-centered frame. To handle specifically the sensor-to-robot uncertainties of the humanoid system, they switch froma static motion model when the foot is firmly lying on theground to an uncertain dynamic motion model when a step isundergone. Furthermore, they integrate image measurementsresulting from point matches in the left and right cameras.

In [2], the authors represent the object to reconstruct as a3D occupancy grid, and update occupancy values on the gridcells with a stereo system. Meanwhile, the robot stays local-ized relatively to the object by using feature points detectedon it. The next best view problem is tackled by searchingfor the largest “unknown” area of the object to reconstruct,which translates into a multi-objective optimization problem,as the camera pose at the next view must correspond toan admissible configuration for the robot head and a stableconfiguration for the whole body. Both problems are treatedas constrained nonlinear optimization problems.

All the aforementioned works handle the reconstructionwith stereo-vision. Other successful approaches rely on depthsensors such as Time-of-Flight (ToF), coupled with view-based SLAM techniques for a better handling of uncertain-ties, due to the noisy nature of ToF sensors [4]. However, formost affordable humanoid platforms, depth sensors are noteasily available, and the main sensor to rely on is monocularvision. The problem of active, monocular reconstructionon humanoid platforms, which is the topic of this work,has been addressed, to our knowledge, only through sparsereconstruction by means of visual SLAM strategies [5].

Our contribution in this work lies in the definition of anovel strategy coupling (1) active control of the humanoidrobot to perform the reconstruction although the object is notwell located at the beginning, and (2) a dense monocular 3D

reconstruction algorithm.

III. RECONSTRUCTION-ORIENTED CONTROL

To simplify the problem of generating controls for themotion of the humanoid robot, we choose to handle itsmobility at the highest level through a differential drivemodel, as in previous works in the humanoid literature [6].

Let (x, y) be a coordinate frame in the plane in which therobot and the object to reconstruct stand. This frame is linkedto our choice of an absolute localization system for the robot(see III-C). Let Xo = (xo, yo, ro)T be a state vector formedby the unknown position (xo, yo) of the object to reconstructin this same coordinate frame and ro its radius. Let alsobe Xr = (xr, yr, θr)T the position and orientation of thehumanoid in the plane. By commodity, and although theposition of the robot is considered as deterministic, we willconsider the joint state X

def= ((Xo)T , (Xr)T )T . We suppose

that, with the help of the visual localization system, Xr canbe determined precisely. This setup is described in Fig. 1. Theaim is to determine controls for the robot such that, as theobject of interest is not initially located, we can end up withlocalizing it well and performing its 3D reconstruction bytaking images of it at several positions. The estimation of theobject size and position is done as a variant of the bearing-only localization problem by integrating observations φt(bearing) and δt (object diameter).

First, we suppose that our humanoid robot is controlledonly through (v, ω), i.e. the linear and angular velocities, bymeans of a walking controller [7]. We also suppose that thelinear velocity stays constant and without loss of generalitywe take v = 1 in this work. Hence, the control of therobot is reduced to ωt, at time t. The current belief on thestate of the object of interest is represented by the first twomoments of its posterior distribution, (Xo

t ,Σot ), i.e. its mean

and covariance matrix. We will estimate the “optimal” (ina sense to be defined below) sequence ω∗

t+1:t+∆ of angularvelocities for the robot walk, over a time horizon ∆, suchthat

ω∗t+1:t+∆ = arg min

t+1:t+∆E(c(Xo

t ,Σot , ωt+1:t+∆)), (1)

where E(V ) is the expected value of a random variable Vand c a stochastic cost function reflecting the needs to beenforced during the walk within the next ∆ time steps, i.e.,

• get a more precise estimation of Xo measured throughΣot+∆, the covariance matrix at t+ ∆;

• keep the target inside the field of view of the sensor;• obtain a complete reconstruction of the object.

A. Estimation of the object position

We run an extended Kalman filter (EKF) to not onlyestimate the state Xt at time t (including the object toreconstruct) but also to compute expected values of covari-ances on this state at some given horizon ∆. This will beimportant for quantifying expected precision gains on ourknowledge about the location of the object to reconstruct.

ϕt

δt

(xr , yr , θr) t t t

(xo, yo) t t

Fig. 1. Bird-view of the reconstruction system. The location of the objectto reconstruct is unknown at (xo, yo). Our aim is to control the robot sothat the reconstruction of the object (which is initially not well localized)can be done with precision.

Classically, it updates the parameters of a Gaussian forrepresenting the posterior distribution over Xt in a two-step fashion: a prediction step integrating the last control,and a correction step integrating the last observation Yt.Both perform a linearization of the motion and observationmodels to propagate uncertainties to the state and observationspaces. Here, the observation vector Yt is made by twoangles extracted from the video camera image. Using verysimple image processing techniques (see III-D), the object issegmented into a single blob, and from this blob are deducedthe bearing φt to the object and its apparent diameter δt,

Yt = (φt, δt)T .

The probabilistic model generating these observations is

Ytdef= h(Xt, ν

φt , ν

δt )

=

arctan(yot−y

rt

xot−xr

t)− θrt + νφt

arctan(rot√

(xot−xr

t )2+(yot−yrt )2) + νδt

,

where (νφt , νδt )T is a realization of a zero-mean 2D Gaussian

noise distribution with constant covariance ΣM . As for themotion model, we use the simple unicycle model

xot+1

yot+1

rot+1

xrt+1

yrt+1

θrt+1

=

xotyotrotxrtyrtθrt

+

000

cos(θrt + ωt)sin(θrt + ωt)

ωt

+

νxtνytνrt000

.

Thus, we suppose the motion of the robot is deterministicallygiven by ωt (remember that v = 1), whereas the object staysstatic but is considered to have zero-mean Gaussian noise(νxt , ν

yt , ν

rt ) on its position and radius. The 3× 3 covariance

matrix on this noise is denoted as Σa.We can denote the overall application of the two-step filter

with observation Yt by the recursive equation

(Xot+1,Σ

ot+1) = F (Xo

t ,Σot , ωt,Yt),

where F encapsulates the equations of the Kalman filter.

B. Reconstruction-oriented control

At future horizons, we approximate the expected value ofEq. 1 by synthesizing theoretical noise-free observations atcomputed predictions, i.e. at t+ 1

Yst+1 = h(X−

t+1, 0, 0),

where X−t+1 is the predicted mean of the state at t + 1.

With this principle, at horizon ∆, we can define the ∆-foldrecursive application of F as F (∆) as a function of the ap-plied controls ωt+1:t+∆, where observations are synthesizedat each step based on the prediction :

(Xot+∆,Σ

ot+∆) = F (∆)(Xo

t ,Σot , ωt+1:t+∆).

Hence, to control the humanoid robot we use the expectedvalue of c at t + ∆. This cost function is decomposed intoseveral terms:c(Xo

t ,Σot , ωt+1:t+∆) =

∑k∈{l,v,r,d}

λkck(Xot ,Σ

ot , ωt+1:t+∆)

+µ‖ωt+1:t+∆‖2,

where λk and µ are weighting factors, and the last term is aregularization term smoothing the amplitude of the motion.The first term is the volume of the uncertainty on the objectposition and size, that can be written as

cl(Xot ,Σ

ot , ωt+1:t+∆) = det Σo

t+∆.

The smaller this term, the more confident we are on theobject position and size. The second term prevents the objectof interest from leaving the sensor field of view by penalizingits horizontal deviation in the sensor frame, i.e.

cv(Xot ,Σ

ot , ωt+1:t+∆) =

(arctan

(yot+∆ − yrt+∆

xot+∆ − xrt+∆

)− θot+∆

)2

.

The following term is defined to ensure a complete cover-age of the object for its reconstruction, by discretizing thecylinder surrounding the object of interest into N0 bins,

cr(Xot ,Σ

ot , ωt+1:t+∆) =

Nv

N0,

where Nv is the total number of bins seen by the robotduring the execution of the whole trajectory, from τ = 0 toτ = t+ ∆. Last, we also use a term to favor the acquisitionof images at some “comfortable” distance dmin,

cd(Xot ,Σ

ot , ωt+1:t+∆) =

dmin√(yot+∆ − yrt+∆)2 + (xot+∆ − xrt+∆)2

,

As explained, the expected value for the cost c dependsonly on the set ωt+1,t+∆ to be applied. We used a classicalNelder-Mead simplex method to perform the optimization.

C. Robot localization and object reconstruction

As explained before, we suppose that a localization systemprovides the robot position in an absolute frame. We firstcalculate the camera pose in order to relate it to the robot po-sition. For this purpose, a variety of visual odometry methodsmay be performed [8]. The success of these, however, greatlydepends on the robustness and suitability of the appliedfeature descriptors for solving feature correspondences [9].

Fig. 2. Visual localization. The visual localization system is based onthe extraction of colored landmarks in the camera images. Crosses are thecentroids of the segmented color blobs. Both the pixel positions of thecentroids and their 3D positions in the world feed the localization system.

Assuming an accurate camera localization is as well crucialfor the task of 3D modeling from multiple views. In thissense, our specific setup is aimed at estimating correct cam-era poses while simplifying the correspondence problem. Weused a visual localization system based on colored landmarksdetected in the robot images by simple thresholding methods,where each landmark is identified by its color (see Fig. 2).

The intrinsic parameters K of the camera were estimatedusing the Matlab calibration toolbox. Once this informationis at hand, a standard method can be applied to minimizethe landmarks reprojection error and estimate the cameralocalization pretty precisely. We solved this as a 2D-to-3Destimation problem using the openCV function solvePnP,which calculates the camera pose (i.e. the camera extrinsicparameters) for each view of the object. The camera poseis recovered as a rotation vector in Rodrigues representationand a translation vector tc expressed in camera coordinates.For convenience, the rotation vector is later transformed intoa rotation matrix Rc, also defined in camera coordinates.

It is important to remember that the simultaneous localiza-tion and reconstruction of the 3D object implies localizingboth the camera and the robot. On one hand, the knowledgeof the camera pose is required for an accurate 3D recon-struction of the object. On the other hand, calculating theposition of the robot is needed for estimating the position ofthe object and controlling the trajectory of the robot.

As far as the 3D reconstruction of the object is concerned,we used a method based on shape from occluding boundariesknown as space carving [10]. Roughly, this method usesthe camera matrix in order to reproject, towards the world,the area bounded by the silhouette of the observed objectin the image. The camera matrix can be easily built asP = K[Rc|tc], where K is the matrix of intrinsic parametersof the camera. From a set of multiple silhouettes with knowncamera matrices, a 3D model is finally recovered fromthe intersection of all reprojected silhouettes. This process,which resembles sculpting (carving), requires several viewsof the object usually acquired while the object is surroundedwith the camera along a circular path.

Note how, for experimental evaluation, assumptions about

prpcpl

Fig. 3. From image blobs to observations vector Yt. Once a binaryblob corresponding to the object has been segmented by applying colorthresholding and morphological operations, the three points pl, pc, pr areextracted to define the observation vector angles φt and δt.

segmentation of object from background were also simplifiedin order to facilitate the extraction of the silhouette throughcolor differentiation. This can be observed in Fig. 2, wherethe scene consists of a predominantly red object against ablue background.

As far as the localization of the robot is concerned, weare only interested in the position of the robot on the xy-plane and its orientation angle θ. Thus we need to find ahomogeneous matrix that maps the robot body frame to theworld frame, defined as Tb

w = (Twc )

−1Tbc, where

Twc =

[Rc tc0 1

](2)

and w, c and b respectively stand for world, camera andbody. The transformation matrix from body to camera Tb

c isdirectly taken from the kinematics of the robot.

The position of the robot in world coordinates can be takenfrom the first three elements in the fourth column of thehomogeneous matrix Tb

w. The orientation angle θ can befound by projecting a vector pointing to the z-axis of thecamera frame onto the world plane, i.e., vw = (Tw

c )−1vc,where vc =

[0 0 1 0

]Tand vw is the projection of vc

defined in world coordinates. The angle φ between the worldx-axis and vw can be calculated from the arctangent betweenthe first two components of vw. Finally, the direction of therobot body is given by θ = φ−α, where α is the robot headyaw angle obtained from the robot odometry.

D. Extraction of the observation vector data

A fundamental element of the estimation frameworkof III-A is the extraction of observation vectors Yt, con-taining the bearing to the object and its apparent diameter.To perform this process in a simple way, we suppose that theobject can be segmented in the image from its background.As illustrated in Fig. 3, with standard thresholding and mor-phological operations, we extract a blob corresponding to theobject of interest. Then, along an horizontal axis, we projectthe blob centroid and the blob extremities, to get three pointspc, pl (left) and pr (right). Then, by denoting the intrinsicparameters matrix as K, we generate the observations:

Yt = (± arccos eTz K−TK−1pr, arccos pTl K−TK−1pr)T ,

1000

100

1000

100

Fig. 4. Simulations. Simulations of simultaneous localization and recon-struction of a single disk, in a 2D world, for two values of dmin (seeSection III). The two figures correspond respectively to the execution oftwo trajectories corresponding to different values for the parameters of theobjective function cd, respectively dmin = 100 and 160. The robot appearsas a disk, the (noisy) observations as the rays coming out of this disk. Theobject to reconstruct (which true position is the origin) is depicted as abrown disk, and the uncertainty ellipse over its position in orange.

where ez = (0, 0, 1)T and the first angle sign is easilydetermined by the horizontal position of the centroid.

IV. IMPLEMENTATION AND RESULTS

We first led a series of simulations to validate our recon-struction approach. Two illustrative results are given in Fig. 4,where we tackled the problem of the simultaneous localiza-tion and reconstruction of a single 2D disk in the plane, fordifferent values of the parameters dmin (the “comfortable”distance at which reconstruction can take place, as describedin Section III). As expected, one can see at the beginning ofthe trajectory that two motion behaviors first tend to reducethe uncertainty over the object position and size (term cl): bygetting closer to the object and by circling around it (whichimplies triangulation over the position of the object). Then,the terms cv , cr and cd dominate in the second phase, whichleads the robot to stay at a stabilized distance to the objectand to circle around it to get a full panoramic view.

Results on real world experiments were obtained by port-ing our algorithm onboard of a NAO humanoid robot, andby using its top camera. The robot is controlled through thebasic walking facilities provided by the manufacturer thatimplements the method reported in [7]. In addition to thefeatures we mentioned above, we also implemented a simpleposition-based visual servoing approach to keep the target atthe center of the image. Figure 5 depicts active reconstructionresults for three different objects, an action figure, a wiiremote control, and a beverage can. The figure presents, foreach object, five different views of the different objects asrecorded by the camera of the robot, the extracted silhouetteand a rendered view of the final 3D object reconstruction forthe purposes of visual comparison. As it can be observed, thereconstructions are obtained with good quality, in particularin the case of the action man figure, where filiform parts

(a) Iron man action figure (b) Wii remote control. (c) Beverage can.

Fig. 5. 3D reconstruction results. The figure contains reconstruction results for three objects: an action figure (a), a remote control (b) and a can (c).For each object, example views are provided row-wise. The different views correspond to input images (as seen by the camera of the robot), extractedsilhouettes (shown in black) and rendered views of the reconstructed 3D model. In the figure, only the silhouettes remain proportional in size with theinput images, while the rendered views have been scaled for the purposes of visualization.

(arms, legs) are well reconstructed, suggesting a successfulcoupling of the proposed robot control with the space carvingmethodology.

Figure 6 presents the robot estimated trajectories for thethree test objects shown in Fig. 5. The parallelepiped repre-sents the object base position ground truth (and which has tobe estimated by the robot), and the taken snapshots appearas blue cones. The obtained trajectories are quite similarto the simulation case, as the robot first makes positionuncertainties lower by first getting closer to the target (forconstant errors on the bearing angle, getting closer makeserrors on positioning smaller) and by searching for positionsfavoring triangulation (hence, beginning to round up theobject). Then, the aforementioned terms cv , cr, and cd makethe robot complete the reconstruction objectives.

To conclude the experimental section, we present resultsconcerning the repeatability of success for different trials.Figure 7 shows plots for three reconstruction trials of the Wiiremote control. In the experiment, two measures of successwere registered. First, the positions of the center of the object

on the (x, y)-plane, as estimated by the active robot control,were recorded along the circular acquisition paths performedby the robot on each trial. This measure was comparedagainst the final position of the center of the object obtainedfrom the 3D reconstruction. The Euclidean distance betweenthese two measures is shown as a function of robot positionon the top diagram of the figure. Note how initial instabilitiespresent at the beginning of the walk tend to diminish as thepath is completed. The final error, fluctuating around twocentimeters, indicates an increasingly accurate estimation ofthe position of the object while the robot surrounds it. Theplot on the bottom depicts a similar experiment consideringthe width of the object. The error converging towards fourmillimeters reveals that not only the position of the object,but its dimensions, can also be estimated precisely as thecamera of the robot acquires the relevant images for the 3Dreconstruction method.

(a) Iron man action figure (b) Wii remote control. (c) Beverage can.

Fig. 6. Estimated camera poses. The figures correspond to the trajectories of the robot for all three cases of reconstruction in Fig. 5. The viewpointsassociated to the snapshots taken along these trajectories are depicted in blue. The five camera positions corresponding to the five views shown in Figure5 have been highlighted in order to examplify the proximity of the camera to the object.

0 20 40 60 80 1000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2Object localization error

Robot position

Erro

r (m

eter

s)

Trial 1Trial 2Trial 3

0 20 40 60 80 1000.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

0.022Object width error

Robot position

Erro

r (m

eter

s)

Trial 1Trial 2Trial 3

Fig. 7. Error plots for object localization and object width. The twoplots depict the error behavior for three reconstruction trials of the Wiiremote control. The x-axis of the plots corresponds to the position of therobot (around the generated circular path). The plots reveal how both errorsin object localization and object width diminish as the robot seems to finishsurrounding the object.

V. CONCLUSIONS AND FUTURE WORK

We have presented an active simultaneous reconstructionand localization strategy that allows a humanoid robot toautonomously reconstruct an object on which it has no initialknowledge, i.e. that needs to be localized by the robot. Basedon expectations on future precision gains and on fitness of thefuture positions with respect to the reconstruction problem,

the robot computes a best suited trajectory to perform thereconstruction. Space carving on images collected along thetrajectory is then used to perform the 3D reconstruction. Wehave shown very promising results on simulation and on realexperiments done on a NAO platform.

Among our ongoing and future work, we aim at adaptingthe current strategy to use the whole body redundancy as asupplementary mean to complete the 3D reconstruction, i.e.for generating views of the object from above.

REFERENCES

[1] O. Stasse, D. Larlus, B. Lagarde, A. Escande, F. Saidi, A. Kheddar,K. Yokoi, and F. Jurie, “Towards autonomous object reconstruction forvisual search by the humanoid robot HRP-2,” in Proc. of the IEEE/RASInt. Conf. on Humanoid Robots, 2007, pp. 151–158.

[2] T. Foissotte, O. Stasse, P.-B. Wieber, A. Escande, and A. Kheddar,“Autonomous 3d modeling by a humanoid robot using an optimization-driven next-best-view formulation,” Int. Journal on Humanoid Robots,pp. 1–23, Jan 2011.

[3] O. Lorch, J. Seara, K. Strobl, U. Hanebeck, and G. Schmidt, “Percep-tion errors in vision guided walking: analysis, modeling, and filtering,”in Proc. of the IEEE Int. Conf. on Robotics and Automation, vol. 2,2002, pp. 2048–2053.

[4] S. Foix, G. Aleny, J. Andrade-Cetto, and C. Torras, “Object modelingusing a tof camera under an uncertainty reduction approach,” in Proc.of the Int. Conf. on Robotics and Automation (ICRA), 2010, pp. 1306–1312.

[5] A. Davison, I. Reid, N. Molton, and O. Stasse, “MonoSLAM: Real-time single camera slam,” IEEE Transactions on Pattern Analysis andMachine Intelligence, pp. 1052–1067, 2007.

[6] G. Arechavaleta, J.-P. Laumond, H. Hicheur, and A. Berthoz, “Anoptimality principle governing human walking trajectories,” IEEETransactions on Robotics, vol. 24, pp. 5–14, 2008.

[7] A. Herdt, H. Diedam, P.-B. Wieber, D. Dimitrov, K. Mombaur, andM. Diehl, “Online Walking Motion Generation with Automatic FootStep Placement,” Advanced Robotics, vol. 24, no. 5-6, pp. 719–737,2010.

[8] D. Scaramuzza and F. Fraundorfer, “Visual odometry: Part i - the first30 years and fundamentals,” IEEE Robotics and Automation Magazine,vol. 18, pp. 80–92, 2011.

[9] A. Schmidt, M. Kraft, and A. Kasinski, “An evaluation of imagefeature detectors and descriptors for robot navigation,” in Proc. Inter-national Conference on Computer vision and graphics: Part II, 2010,pp. 251–259.

[10] K. N. Kutulakos and S. M. Seitz, “A theory of shape by space carving,”International Journal of Computer Vision, vol. 38, pp. 199–218, 2000.

an active strategy for the simultaneous localization and reconstruction of a 3d object from a...

Documents

d object reconstruction

object position

thatthe object

humanoid robot

sparse reconstruction

d shape

d context

humanoid platforms