feature-based rgb-d camera pose optimization for real-time 3d …xguo/cvm17.pdf · 2017-03-11 ·...

Computational Visual MediaDOI 10.1007/s41095-016-0072-2

Research Article

Feature-based RGB-D camera pose optimization for real-time3D reconstruction

Chao Wang1, Xiaohu Guo1 (�)

c© The Author(s) 2016. This article is published with open access at Springerlink.com

Abstract In this paper we present a novel feature-based RGB-D camera pose optimization algorithmfor real-time 3D reconstruction systems. Duringcamera pose estimation, current methods in onlinesystems suffer from fast-scanned RGB-D data, orgenerate inaccurate relative transformations betweenconsecutive frames. Our approach improves currentmethods by utilizing matched features across all framesand is robust for RGB-D data with large shifts inconsecutive frames. We directly estimate camerapose for each frame by efficiently solving a quadraticminimization problem to maximize the consistency of3D points in global space across frames correspondingto matched feature points. We have implementedour method within two state-of-the-art online 3Dreconstruction platforms. Experimental results testifythat our method is efficient and reliable in estimatingcamera poses for RGB-D data with large shifts.

Keywords camera pose optimization; featurematching; real-time 3D reconstruction;feature correspondence

1 Introduction

Real-time 3D scanning and reconstructiontechniques have been applied to many areas inrecent years with the prevalence of inexpensivedepth cameras for consumers. The sale of millionsof such devices makes it desirable for users to scanand reconstruct dense models of the surroundingenvironment by themselves. Online reconstruction

1 University of Texas at Dallas, Richardson, Texas, USA.E-mail: C. Wang, [email protected]; X. Guo,[email protected] (�).

Manuscript received: 2016-09-09; accepted: 2016-12-20

techniques have various popular applications, e.g.,in augmented reality (AR) to fuse supplementedelements with the real-world environment, invirtual reality (VR) to provide users with reliableenvironment perception and feedback, and insimultaneous localization and mapping (SLAM)for robots to automatically navigate in complexenvironments [1–3].

One of the earliest and most notable methodsamong RGB-D based online 3D reconstructiontechniques is KinectFusion [4], which enables auser holding and moving a standard depth camerasuch as Microsoft Kinect to rapidly create detailed3D reconstructions of a static scene. However, amajor limitation of KinectFusion is that camerapose estimation is performed by frame-to-modelregistration using an iterative closest point (ICP)algorithm based on geometric data, which is onlyreliable for RGB-D data with small shifts betweenconsecutive frames acquired by high-frame-ratedepth cameras [4, 5].

To solve the aforementioned limitation, a commonstrategy adopted by most subsequent onlinereconstruction methods is to introduce photometricdata into the ICP-based framework to estimatecamera poses by maximizing the consistency ofgeometric information as well as color informationbetween two adjacent frames [2, 5–11]. However,even though an ICP-based framework can effectivelydeal with RGB-D data with small shifts, it solvesa non-linear minimization problem and alwaysconverges to a local minimum near the initialinput because of the small angle assumption [4].This indicates that pose estimation accuracy reliesstrongly on a good initial guess, which is unlikely tobe satisfied if the camera moves rapidly or is shiftedsuddenly by the user. For the same reason, ICP-

1

2 C. Wang, X. Guo

based online reconstruction methods always generateresults with drifts and distortion for scenes with largeplanar regions such as walls, ceilings, and floors, evenif consecutive frames only contain small shifts. Figure1 illustrates this shortcoming for several currentonline methods using an ICP-based framework, andalso shows the advantage of our method on RGB-Ddata with large shifts on a planar region.

Another strategy to improve the robustness ofcamera tracking is to introduce RGB featuresinto camera pose estimation by maximizing the3D position consistency of corresponding featurepoints between frames [12–14]. These feature-basedmethods are better than ICP-based ones in handlingRGB-D data with large shifts, since they simplyrun a quadratic minimization problem to directlycompute the relative transformation between twoconsecutive frames [13, 14]. However, unlike ICP-based methods using frame-to-model registration,current feature-based methods estimate camerapose only based on pairs of consecutive frames,which usually brings in errors and accumulatesdrifts in reconstruction on RGB-D data withsudden change. Moreover, current feature-basedmethods always inaccurately estimate camerapose because of unreliable feature extractors andmatching. Practically, the inaccurate camera posesare not utilized directly in reconstruction, butpushed into an offline backend post-process toimprove their reliability, such as global pose graphoptimization [12, 15] or bundle adjustment [13,14]. For this reason, most current feature-basedreconstruction methods are strictly offline.

In this paper, we combine the advantages of the

two above strategies and propose a novel feature-based camera pose optimization algorithm for online3D reconstruction systems. To solve the limitationthat the ICP-based framework always convergesto a local minimum near the initial input, ourapproach estimates the global camera poses directlyby efficiently solving a quadratic minimizationproblem to maximize the consistency of matchedfeature points across frames, without any initialguess. This makes our method robust in dealingwith RGB-D data with large shifts. Meanwhile,unlike current feature-based methods which onlyconsider pairs of consecutive frames, our methodutilizes matched features from all previous frames toreduce the impact of bad features and accumulatederror in camera pose during scanning. This isachieved by keeping track of RGB features’ 3D pointsinformation from all frames in a structure called thefeature correspondence list.

Our algorithm can be directly integrated intocurrent online reconstruction pipelines. We haveimplemented our method within two state-of-the-artonline 3D reconstruction platforms. Experimentalresults testify that our approach is efficient andimproves current methods in estimating camerapose on RGB-D data with large shifts.

2 Related work

Following KinectFusion, many variants andother brand new methods have been proposedto overcome its limitations and achieve moreaccurate reconstruction results. Here we mainlyconsider camera pose estimation methods in online

Fig. 1 Camera pose estimation comparison between methods. Top: four real input point clouds scanned using different views of a white wallwith a painting. Bottom: results of stitching using camera poses provided by the Lucas–Kanade method [6], voxel-hashing [2], ElasticFusion [9],and our method.

Feature-based RGB-D camera pose optimization for real-time 3D reconstruction 3

and offline reconstruction techniques, and brieflyintroduce camera pose optimization in some otherrelevant areas.

2.1 Online RGB-D reconstruction

A typical online 3D reconstruction process takesRGB-D data as input and fuses the denseoverlapping depth frames into one reconstructedmodel using some specific representation, of whichtwo most important categories are volume-basedfusion [2, 4, 5, 10, 16, 17] and point/surfel-basedfusion [1, 9]. Volume-based methods are verycommon since they can directly generate modelswith connected surfaces, and are also efficientin data retrieval and use of the GPU. WhileKinectFusion is limited to a small fixed-size scene,several subsequent methods introduce different dataprocessing techniques to extend the original volumestructure, such as moving volume [16, 18], octree-based volume [17], patch volume [19], or hierarchicalvolume [20]. However, these online methods simplyinherit the same ICP framework from KinectFusionto estimate camera pose.

In order to handle dense depth data and stitchframes in real time, most online reconstructionmethods prefer an ICP-based framework whichis efficient and reliable if the depth data hassmall shifts. While KinectFusion runs a frame-to-model ICP process with vertex correspondenceobtained by projective data association, Peasley andBirchfield [6] improved it by providing ICP witha better initial guess and correspondence based ona warp transformation between consecutive RGBimages. However, this warp transformation isonly reliable for images with very small shifts,just like the ICP-based framework. Nießner etal. [2] introduced voxel-hashing technique intovolumetric fusion to reconstruct scenes at large scaleefficiently and used color-ICP to maintain geometricas well as color consistency of all correspondingvertices. Steinbrucker et al. [21] proposed anoctree-based multi-resolution online reconstructionsystem which estimates relative camera posesbetween frames by stitching their photometric andgeometric data together as closely as possible.Whelan et al.’s method [10] and a variant [5]both utilize a volume-shifting fusion technique tohandle large-scale RGB-D data, while Whelan et

al.’s ElasticFusion approach [9] extends it to a surfel-based fusion framework. They introduce local loopclosure detection to adjust camera poses at any timeduring reconstruction. Nonetheless, these methodsstill rely on an ICP-based framework to determinea single joint pose constraint and therefore are stillonly reliable on RGB-D data with small shifts.Figure 1 gives a comparison between our methodand these current methods on a rapidly scannedwall. In Section 4 we compare voxel-hashing [2],ElasticFusion [9], and our method on an RGB-Dbenchmark [22] and a real scene.

Feature-based online reconstruction methods aremuch rarer than ICP-based ones, since cameraposes estimated only using features are usuallyunreliable due to the noisy RGB-D data, and mustbe subsequently post-processed. Huang et al. [13]proposed one of the earliest SLAM systems whichestimates an initial camera pose in real time foreach frame by utilizing FAST feature correspondencebetween consecutive frames, and sending all posesto a post-process for global bundle adjustmentbefore reconstruction, which makes this method lessefficient and not strictly an online reconstructiontechnique. Endres et al. [12] considered differentfeature extractors and estimated camera poseby simply computing the transformation betweenconsecutive frames using an RANSAC algorithmbased on feature correspondences. Xiao et al. [14]provided an RGB-D database with full 3D spaceviews and used SIFT features to constructthe transformation between consecutive frames,followed by bundle adjustment to globally improvepose estimates. In summary, current feature-based methods utilize feature correspondences onlybetween pairs of consecutive frames to estimate therelative transformation between them. Unlike suchmethods, our method utilizes the feature-matchinginformation from all previous frames by keepingtrack of the information in a feature correspondencelist. Section 4.4 compares our method and currentfeature-based frameworks utilizing only pairs ofconsecutive frames.

2.2 Offline RGB-D reconstructionThe typical and most common scheme for offlinereconstruction methods is to take advantage ofsome global optimization technique to determineconsistent camera poses for all frames, such as bundle

4 C. Wang, X. Guo

adjustment [13, 14], pose graph optimization [5, 14,23], and deformation graph optimization with loopclosure detection [9]. Some offline works utilizesimilar strategies to online methods [2, 5, 9] byintroducing feature correspondences into an ICP-based framework. They maximize the consistencyof both dense geometric data and sparse imagefeatures, such as one of the first reconstructionsystems proposed by Henry et al. [7] using SIFTfeatures.

Other work introduces various special points ofinterest into camera pose estimation and RGB-Dreconstruction. Zhou and Koltun [24] proposed animpressive offline 3D reconstruction method whichfocuses on preserving details of points of interestwith high density values across RGB-D frames, andruns pose graph optimization to obtain globallyconsistent pose estimations for these points. Twoother works by Zhou et al. [25] and Choi et al. [26]both detect smooth fragments as point of interestzones and attempt to maximize the consistencyof corresponding points in fragments across framesusing global optimization.

2.3 Camera pose optimization in other areasCamera pose optimization is also very common inmany other areas besides RGB-D reconstruction.Zhou and Koltun [3] presented a color mappingoptimization algorithm for 3D reconstruction whichoptimizes camera poses by maximizing the coloragreement of 3D points’ 2D projections in allRGB images. Huang et al. [13] proposed anautonomous flight control and navigation methodutilizing feature correspondence to estimate relativetransformation between consecutive frames in realtime. Steinbrucker et al. [27] presented a real-timevisual odometry method which estimates cameraposes by maximizing photo-consistency betweenconsecutive images.

3 Camera pose estimation

Our camera pose optimization method attempts tomaximize the consistency of matched features’

corresponding 3D points in global space acrossframes. In this section we start with a brief overviewof the algorithmic framework, and then describe thedetails of each step.

3.1 Overall scheme

The pipeline is illustrated in Fig. 2. For eachinput RGB-D frame, we extract the RGB featuresin the first step (see Section 3.2), and then generatea good feature match with correspondence-check(see Section 3.3). Next, we maintain and updatea data structure called the feature correspondencelist to store matched features and corresponding3D points in the camera’s local coordinate spaceacross frames (see Section 3.4). Finally, we estimatecamera pose by minimizing the difference betweenmatched features’ 3D positions in global space (seeSection 3.5).

3.2 Feature extraction

2D feature points can be utilized to reduce theamount of data needed to evaluate the similaritybetween two RGB images while preserving theaccuracy of the result. In order to estimate camerapose efficiently in real time while guaranteeing thereconstruction reliability, we need to select a featureextraction method with a good balance betweenfeature accuracy and speed. We ignore corner-basedfeature detectors such as BRIEF and FAST, sincethe depth data from consumer depth cameras alwayscontains much noise around object contours dueto the cameras’ working principles [28]. Instead,we simply use an SURF detector to extract anddescribe RGB features, for two main reasons. Firstly,SURF is robust, stable, and scale and rotationinvariant [29], which is important for establishingreliable feature correspondences between images.Secondly, existing methods can efficiently computeSURF in parallel on the GPU [30].

3.3 Feature matching

Using the feature descriptors, a feature match canbe obtained easily but it usually contains many

Extract

keypoints

Construct feature

correspondence list

Estimate

camera pose

Input RGBFeature

matching

Input depth

Fig. 2 Algorithm overview.


mismatched pairs. To remove as many outliers aspossible, we run an RANSAC-based correspondence-check based on 2D homography and relativetransformation between pairs of frames.

For two consecutive frames i − 1 and i withRGB images and corresponding 3D points in thecamera’s local coordinate space, we first obtain aninitial feature match between 2D features basedon their descriptors. Next, we run a number ofiterations, and in each iteration we randomly select4 feature pairs to estimate the 2D homography H z

using the direct linear transformation algorithm [31]and the 3D relative transformation Tz between thecorresponding 3D points. H z and Tz with lowestre-projection errors amongst all feature pairs areselected as the final ones to determine the outliers.After iterations, feature pairs with a 2D re-projectionerror larger than a threshold σH or a 3D re-projectionerror larger than a threshold σT are treated asoutliers, and are removed from the initial featurematch.

During the correspondence-check, we only selectfeature pairs with valid depth values. Meanwhile,in order to reduce noise in the depth data, wepre-smooth the depth image with a bilateral filterbefore computing 3D points from 2D features. Afterthe correspondence-check, if the number of validmatched features is too small, the estimated camerapose obtained based on them will be unreliable.Therefore, we abandon all subsequent steps afterfeature matching and use a traditional ICP-basedframework if the number of validly matched featuresis smaller than a threshold σF . In our experiment, weempirically choose σH = 3, σT = 0.05, and σF = 10.

Figure 3 shows a feature matching comparisonbefore and after the correspondence-check for twoconsecutive images captured by a fast-movingcamera. The blue circles are feature points, whilethe green circles and lines are matched feature pairs.Note that almost all poorly matched correspondencepairs are removed.

3.4 Feature correspondence list construction

In order to estimate the camera pose by maximizingthe consistency of the global positions of matchedfeatures in all frames, we establish and update afeature correspondence list (FCL) to keep track ofmatched features in both the spatial and temporal

Fig. 3 Two original images (top), feature matching before (middle)and after (bottom) correspondence checking.

domain. The FCL is composed of 3D point sets, eachof which denotes a series of 3D points in the camera’slocal coordinate space, whose corresponding 2Dpixels are matched features across frames. Thus,the FCL in frame i is denoted by L = {Sj |j = 0,. . . ,mi−1}, where each Sj contains 3D points whosecorresponding 2D points are matched features, j isthe point set index, and mi is the number of pointsets in the FCL in frame i. The FCL can be simplyconstructed: Fig. 4 illustrates the process used toconstruct FCL for two consecutive frames.

By keeping track of all RGB features’ 3Dpositions in each camera’s local space, wecan estimate camera poses by maximizing theconsistency of all these 3D points’ global positions.By utilizing feature information from all framesinstead of just two consecutive frames, we aim toreduce the impact of possible bad features, suchas incorrectly matched features or features fromill-scanned RGB-D frames. Moreover, this alsoavoids the accumulation of error in camera posefrom previous frames.

3.5 Camera pose optimization

For the 3D points in each point set in FCL, their

6 C. Wang, X. Guo

Frame 1Frame 0

00p

1S

0S

10p

11p

01p

12p

Frame 2Frame 1Frame 0

00p

1S

0S

10p 20p

21p11p

Fig. 4 Feature correspondence lists for frame 1 (left) and frame 2(right). To construct the FCL for frame 2, we remove point sets withunmatched features (green), add matched points whose correspondingfeatures are in the previous frame’s FCL into the correspondingpoint sets (red), and add new point sets for matched features whosecorresponding features are not in the previous frame’s FCL (blue).Finally we re-index all points in the FCL. The number of point setsin the two FCLs is the same; here m1 = m2 = 2.

corresponding RGB features can be regarded as 2Dprojections from one 3D point in the real world onthe RGB images in a continuous series of frames. Forthese 3D points in the camera coordinate space, weaim to ensure that their corresponding 3D points inthe world space are as close as possible.

Given the FCL L = {Sj |j = 0, · · · ,mi − 1} inframe i, for each 3D point pij ∈ Sj , our objectiveis to maximize the agreement between pij and itstarget position in the world space with respect to arigid transformation. Specifically, we seek a rotationRi and translation vector ti that minimize thefollowing energy function:

Ei(Ri, ti) =mi−1∑j=0

wj‖Ripij + ti − qj‖2 (1)

where wj is a weight to distinguish the importanceof points, and qj is the target position in the worldspace of pij after transformation. In our method weinitially set:

qj = 1|Sj | − 1

i−1∑k=nj

(Rkpkj + tk) (2)

which is the average position of the 3D points in theworld frame obtained from all points in Sj except forpij itself, where nj is the frame index for Sj ’s firstpoint. Intuitively, the more frequently a 3D globalpoint appears in frames, the more reliable this point’smeasured data will be for the estimation of camerapose. Therefore, we use wj = |Sj | to balance theimportance of points. qj in Eq. (2) can be easilycomputed from the stored information in frame i’sFCL.

The energy function Ei(Ri, ti) in Eq. (1) is

a quadratic least-squares objective and can beminimized by Arun et al.’s method [32]:

Ri = VDUT (3)ti = q −Ripi (4)

Here D = diag(1, 1,det(VUT)) ensures that Ri is arotation matrix without reflection. U, V are both 3×mi matrices from the singular value decomposition(SVD) of matrix S = UΣΣΣVT, which is constructedby S = XWYT where:

X = [pi0 − pi, . . . ,pi(mi−1) − pi] (5)W = diag(w0, . . . , wmi−1) (6)Y = [q0 − q, . . . , qmi−1 − q] (7)

pi =∑

k wkpik∑k wk

, q =∑

k wkqk∑k wk

(8)

Here X and Y are both 3 × mi matrices, W is adiagonal matrix with weight values, and pi and qare the mass centers of all pij and qj in frame i

respectively. In general, by minimizing the energyfunction in Eq. (1), we seek a rigid transformationwhich makes each 3D point’s global position inthe world space as close as possible to the averageposition of all its corresponding 3D points from allprevious frames.

After solving Eq. (1) for the current frame i, eachpij ’s target position qj in Eq. (2) can be updated by

q ′j = 1|Sj |

i∑k=nj

(Rkpkj + tk) (9)

This is simply done by putting pij and the newlyobtained transformation Ri and ti into Eq. (2), andestimating qj as the average center of all points inSj . Note that we can utilize the new q ′j in Eq. (9)to further decrease the energy in Eq. (1) and obtainanother new transformation, which can be utilizedagain to update q ′j in turn. Therefore, an iterativeoptimization process updating q ′j and minimizingthe energy Ei can be repeatedly used to optimizethe transformation until the energy converges.

Furthermore, the aforementioned iterative processcan also be run on previous frames to furthermaximize the consistency of matched 3D points’global positions between frames. If an onlinereconstruction system contains techniques to updatethe previously reconstructed data, then the furtheroptimized poses in previous frames can be used toupdate the reconstruction quality further. Actually,we only need to optimize poses between frame r to


i, where r is the earliest frame index of all pointsin frame i’s FCL. A common case during onlinescanning and reconstruction is that, the camera stayssteady on a same scene for a long time. Then, thecorrespondence list will keep too many old redundantmatched features from very early previous frames,which will greatly increase the computation costof optimization. To avoid this, we check the gapbetween r and i for every frame i. If i − r is largerthan a threshold δ, we only run optimization betweenframe i−δ and i. In the experiments, we use δ = 50.

In particular, minimizing each energy Ek(r 6 k 6i) is equivalent to minimizing the sum of the energybetween these frames:

E(Er, . . . , Ei) =i∑

k=r

Ei

=i∑

k=r

mk−1∑j=0

wj‖Rkpkj + tk−qj‖2 (10)

According to the solutions in Eqs. (5)–(8), thecomputation of each transformation Rk and tk inEq. (10) is independent of that in other frames. Thetotal energy E is estimated each time in the iterativeoptimization process to determine if the convergencecondition is satisfied or not.

Algorithm 1 describes the entire iterative camerapose optimization process in our method. In theexperiments we set the energy threshold ε = 0.01.Our optimization method is very efficient in that itonly takes O(mi) multiplications and additions aswell as a few SVD processes on 3×3 matrices.

4 Experimental results

To assess the capabilities of our camera poseestimation method, we embedded it within two state-of-the-art platforms: a volume-based method basedon voxel-hashing [2] and a surfel-based method,ElasticFusion [9]. In the implementation, we firstestimate camera poses using our method, andthen regard them as good initial guesses for theoriginal ICP-based framework in each platform.The reason is that the reconstruction quality ispossibly low if the online system does not run aframe-to-model framework to stitch dense data fromthe current frame with the previous model duringreconstruction [5]. Note that for each frame, eventhough our method optimizes camera poses from allrelevant frames, we only use the optimized

Algorithm 1 Camera pose optimizationInput: Feature correspondence list for frame i, earliest

frame index r, and energy threshold ε.Output: Optimized camera poses between frame r and

frame i.1: Update {qj} via Eq. (2);2: Compute energy Ei in Eq. (1) with {qj} and obtain Ri

and ti;3: Compute total energy E in Eq. (10);4: while (true) do5: E′ ⇐ E;6: Update {q′

j} with {Rk, tk|k = r, . . . , i} via Eq. (9);7: for all (r 6 k 6 i) do8: Compute energy Ek in Eq. (1) with {q′

j} and obtainR′

k and t′k ;

9: end for10: Compute new energy E in Eq. (10) with {R′

k, t′k|k =

r, . . . , i};11: if (|E′ − E| < ε) then12: break;13: end if14: for all (r 6 k 6 i) do15: Rk ⇐ R′

k, tk ⇐ t′k;

16: end for17: end while18: Return {R′

k, t′k|k = r, . . . , i}

pose for the current frame for the frame-to-modelframework to update the reconstruction, and theoptimized poses in previous frames are only utilizedto estimate the camera poses in future frames.

4.1 Trajectory estimation

We first compare our method with both voxel-hashing [2] and ElasticFusion [9], evaluating thetrajectory estimation performance using severaldatasets from the RGB-D benchmark [22]. Inorder to compare with ElasticFusion [9], weutilize the same error metric as in their work,absolute trajectory root-mean-square error (ATE)which measures the root-mean-square of Euclideandistances between estimated camera poses andground truth ones associated with timestamps [9,22].

Table 1 shows the results from each methodwith and without our improvement. We denotethe smallest error for each dataset in bold. Here“dif1” and “dif5” denote the frame difference usedfor each dataset during reconstruction. In otherwords, for “dif5”, we only use the first frameof every 5 consecutive frames in each originaldataset, and omit the other 4 intermediate frames

8 C. Wang, X. Guo

Table 1 Trajectory estimation comparison of methods using ATEmetric

System fr1/desk fr1/floor fr1/room fr3/ntfdif1 dif5 dif1 dif5 dif1 dif5 dif1 dif5

Voxel-hashing 1.10 0.74 1.01 0.70 0.61 1.08 1.32 1.30Ours 0.32 0.32 0.16 0.19 0.34 0.61 0.18 0.08

ElasticFusion 0.03 0.30 0.41 0.54 0.38 0.48 0.08 0.15Ours 0.04 0.21 0.17 0.22 0.32 0.35 0.08 0.08

in order to estimate the trajectories on RGB-Ddata with large shifts, while for “dif1” we justuse the original dataset. Note that our resultsare different when embedded in the two platformseven for the same dataset. This is because,firstly, the two online platforms utilize differentdata processing and representation techniques,and different frame-to-model frameworks duringreconstruction. Secondly, the voxel-hashing platformdoes not contain any optimization technique tomodify previously constructed models and cameraposes, while ElasticFusion utilizes both localand global loop closure detection in conjunctionwith global optimization techniques to optimizeprevious data and generate a globally consistentreconstruction [9]. Results in Table 1 show thatour method improves upon the other two methodsfor estimating trajectories, especially on largeplanar regions such as fr1/floor and fr3/ntf whichboth contain floor with textures. Furthermore, ourmethod also estimates trajectories better than theother methods when the shifts between the RGB-Dframes are large.

4.2 Pose estimation

To estimate the pose estimation performance, we

compared our methods with the same two methodson the same benchmark using relative pose error(RPE) [22], which measures the relative posedifference between each estimated camera pose andthe corresponding ground truth. Table 2 gives theresults, which show that our method can improvecamera pose estimation on datasets with large shifts,even though our result is only on a par with theothers on the original datasets with small shiftsbetween consecutive frames.

4.3 Surface reconstruction

In order to compare the influence of computedcamera poses on the final reconstructed models forour method and the others, we firstly computecamera poses by each method on its correspondingplatform, and then use all the poses on the samevoxel-hashing platform to generate reconstructedmodels. Here our method runs on the voxel-hashing platform. Figure 5 gives the reconstructionresults for different methods on the fr1/floor datasetfrom the same benchmark, with frame difference5. The figure shows that our method improves thereconstructed surface by producing good cameraposes for the RGB-D data with large shifts.

To test our method on a fast-moving camera onTable 2 Pose estimation comparison of methods using RPE metric

System fr1/desk fr1/floor fr1/room fr3/ntfdif1 dif5 dif1 dif5 dif1 dif5 dif1 dif5

Voxel-hashing 1.57 1.16 1.25 0.98 1.15 1.49 1.60 1.62Ours 0.91 0.94 0.80 0.80 0.84 1.14 1.36 1.42

ElasticFusion 0.04 0.41 0.42 0.58 0.51 0.63 0.11 0.11Ours 0.05 0.29 0.43 0.48 0.54 0.61 0.12 0.11

Voxel-hashing ElasticFusion Ours Ground truth

Fig. 5 Reconstruction results for different methods on fr1/floor from the RGB-D benchmark [22] with frame difference 5.


a real scene, we fixed an Asus XTion depth cameraon a tripod with a motor to rotate the camera withcontrolled speed. With this device, we firstly scanneda room by rotating the camera only around its axis(the y-axis in the camera’s local coordinate frame)for several rotations with a fixed speed, and selectedthe RGB-D data for exactly one rotation for thetest. This dataset contains 235 RGB-D frames; mostof the RGB images are blurred, since it took thecamera only about 5 seconds to finish the rotation.Figure 6 gives an example showing two blurredimages from this RGB-D dataset. Note that ourfeature matching method can still match features

Fig. 6 Two blurred images (top) and feature matching result(bottom) from our scanned RGB-D data from a real scene using afast-moving camera.

very well.Figure 7 gives the reconstruction results produced

by different methods on the dataset. As in Fig. 5,all reconstruction results here are also obtainedusing the voxel-hashing platform with cameraposes pre-computed by different methods on eachcorresponding platform; again our method ran onthe voxel-hashing platform. For the ground truthcamera poses, since we scan the scene with fixedrotation speed, we simply compute the ground truthcamera pose for each frame i (0 � i < 235) asRi = Ry(θi) with θi = (360(i− 1)/235)◦ andti = 0, where Ry(θi) rotates around the y-axis byan angle θi. Moreover, note that ElasticFusion [9]utilizes loop closure detection and deformation graphoptimization to globally optimize camera poses andglobal point positions in the final model. To makethe comparison more reasonable, we introduce thesame loop closure detection in ElasticFusion [9] intoour method, and use a pose graph optimizationtool [15] to globally optimize camera poses for allframes efficiently. Figure 7 shows that our optimizedcamera poses can determine the structure of thereconstructed model very well for the real-scene datacaptured by a fast-moving camera.

4.4 Justification of feature correspondencelist

In our method we utilize the FCL in order to

Voxel-hashing ElasticFusion

Ours Ground truth

Fig. 7 Reconstruction results for different methods on room data captured by a speed-controlled fast-moving camera.

10 C. Wang, X. Guo

reduce the impact of bad features on camerapose estimation, and also to avoid accumulatingerror in camera poses during scanning. Currentfeature-based methods always estimate the relativetransformation between the current frame and theprevious one using only the matched features in thesetwo consecutive frames [12–14] and here we call thisstrategy consecutive-feature estimation.

In our framework, the consecutive-featureestimation can be easily implemented by only usingsteps (1) and (2) (lines 1 and 2) in Algorithm 1for each qj = p(i−1)j , which is pij ’s matched 3Dpoint in the previous frame. Figure 8 gives theATE and RPE errors for our method utilizing FCLsand the consecutive-feature method on fr1/floor, forincreasing frame differences. Clearly our methodwith FCLs outperforms the consecutive-featuremethod in determining camera poses for RGB-Ddata with large shifts.

4.5 Performance

We have tested our method on the voxel-hashingplatform on a laptop running Microsoft Windows8.1 with an Intel Core i7-4710HQ CPU at 2.5 GHz,12 GB RAM, and an NVIDIA GeForce GTX

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 2 4 6 8 10 12

AT

E

Frame difference

2-adjacent-featureOurs

0.5

1

1.5

2

2.5

0 2 4 6 8 10 12

RP

E

Frame difference

2-adjacent-featureOurs

Fig. 8 Comparison between our method and the consecutive-featuremethod on fr1/floor for varying frame difference.

860M GPU with 4 GB memory. We used theOpenSURF library and used OpenCL [30] to extractSURF features on each down-sampled 320 × 240RGB image. For each frame, our camera poseoptimization pipeline takes about 10 ms to extractfeatures and finish feature matching, 1–2 ms forFCL construction, and only 5–8 ms for the camerapose optimization step, including the iterativeoptimization of camera poses for all relevant frames.Therefore, our method is efficient enough to runin real time. We also note that the offline posegraph optimization tool [15] used for the RGB-Ddata described in Section 4.3 takes only 10 ms forglobal pose optimization of all frames.

5 Conclusions and future work

This paper has proposed a novel feature-basedcamera pose optimization algorithm which efficientlyand robustly estimates camera pose in online RGB-D reconstruction systems. Our approach utilizesthe feature correspondences from all previous framesand optimizes camera poses across frames. Wehave implemented our method within two state-of-the-art online RGB-D reconstruction platforms.Experimental results verify that our methodimproves current online systems in estimating moreaccurate camera poses and generating more reliablereconstructions for RGB-D data with large shiftsbetween consecutive frames.

Considering that our camera pose optimizationmethod is only part of the RGB-D reconstructionsystem pipeline, we aim to develop a new RGB-D reconstruction system with our camera poseoptimization framework in it. Moreover, we willalso explore utilizing our optimized camera posesin previous frames to update the previouslyreconstructed model in the online system.

References

[1] Keller, M.; Lefloch, D.; Lambers, M.; Izadi, S.;Weyrich, T.; Kolb, A. Real-time 3D reconstructionin dynamic scenes using point-based fusion. In:Proceedings of the International Conference on 3DVision, 1–8, 2013.

[2] Nießner, M.; Zollhofer, M.; Izadi, S.; Stamminger,M. Real-time 3D reconstruction at scale using voxelhashing. ACM Transactions on Graphics Vol. 32, No.6, Article No. 169, 2013.


[3] Zhou, Q.-Y.; Koltun, V. Color map optimization for3D reconstruction with consumer depth cameras. ACMTransactions on Graphics Vol. 33, No. 4, Article No.155, 2014.

[4] Newcombe, R. A.; Izadi, S.; Hilliges, O.; Molyneaux,D.; Kim, D.; Davison, A. J.; Kohi, P.; Shotton, J.;Hodges, S.; Fitzgibbon, A. KinectFusion: Real-timedense surface mapping and tracking. In: Proceedingsof the 10th IEEE International Symposium on Mixedand Augmented Reality, 127–136, 2011.

[5] Whelan, T.; Kaess, M.; Johannsson, H.; Fallon, M.;Leonard, J.; McDonald, J. Real-time large-scale denseRGB-D slam with volumetric fusion. The InternationalJournal of Robotics Research Vol. 34, Nos. 4–5, 598–626, 2015.

[6] Peasley, B.; Birchfield, S. Replacing projective dataassociation with Lucas–Kanade for KinectFusion. In:Proceedings of the IEEE International Conference onRobotics and Automation, 638–645, 2013.

[7] Henry, P.; Krainin, M.; Herbst, E.; Ren, X.; Fox, D.RGB-D mapping: Using Kinect-style depth camerasfor dense 3D modeling of indoor environments. TheInternational Journal of Robotics Research Vol. 31,No. 5, 647–663, 2012.

[8] Newcombe, R. A.; Lovegrove, S. J.; Davison, A. J.DTAM: Dense tracking and mapping in real-time. In:Proceedings of the IEEE International Conference onComputer Vision, 2320–2327, 2011.

[9] Whelan, T.; Leutenegger, S.; Salas-Moreno, R.;Glocker, B.; Davison, A. ElasticFusion: Dense SLAMwithout a pose graph. In: Proceedings of Robotics:Science and Systems, 11, 2015.

[10] Whelan, T.; Johannsson, H.; Kaess, M.; Leonard, J.J.; McDonald, J. Robust real-time visual odometryfor dense RGB-D mapping. In: Proceedings ofthe IEEE International Conference on Robotics andAutomation, 5724–5731, 2013.

[11] Zhang, K.; Zheng, S.; Yu, W.; Li, X. A depth-incorporated 2D descriptor for robust and efficient 3Denvironment reconstruction. In: Proceedings of the10th International Conference on Computer Science& Education, 691–696, 2015.

[12] Endres, F.; Hess, J.; Engelhard, N.; Sturm, J.;Cremers, D.; Burgard, W. An evaluation of theRGB-D SLAM system. In: Proceedings of theIEEE International Conference on Robotics andAutomation, 1691–1696, 2012.

[13] Huang, A. S.; Bachrach, A.; Henry, P.; Krainin, M.;Maturana, D.; Fox, D.; Roy, N. Visual odometryand mapping for autonomous flight using an RGB-D camera. In: Robotics Research. Christensen, H. I.;Khatib, O.; Eds. Springer International Publishing,235–252, 2011.

[14] Xiao, J.; Owens, A.; Torralba, A. SUN3D: A databaseof big spaces reconstructed using SfM and objectlabels. In: Proceedings of the IEEE InternationalConference on Computer Vision, 1625–1632, 2013.

[15] Kummerle, R.; Grisetti, G.; Strasdat, H.; Konolige,K.; Burgard, W. G2o: A general frameworkfor graph optimization. In: Proceedings of theIEEE International Conference on Robotics andAutomation, 3607–3613, 2011.

[16] Roth, H.; Vona, M. Moving volume KinectFusion. In:Proceedings of British Machine Vision Conference, 1–11, 2012.

[17] Zeng, M.; Zhao, F.; Zheng, J.; Liu, X. Octree-based fusion for realtime 3D reconstruction. GraphicalModels Vol. 75, No. 3, 126–136, 2013.

[18] Whelan, T.; Johannsson, H.; Kaess, M.; Leonard, J.J.; McDonald, J. Robust tracking for real-time denseRGB-D mapping with Kintinuous. Computer Scienceand Artificial Intelligence Laboratory TechnicalReport, MIT-CSAIL-TR-2012-031, 2012.

[19] Henry, P.; Fox, D.; Bhowmik, A.; Mongia, R. Patchvolumes: Segmentation-based consistent mappingwith RGBD cameras. In: Proceedings of theInternational Conference on 3D Vision, 398–405, 2013.

[20] Chen, J.; Bautembach, D.; Izadi, S. Scalable real-timevolumetric surface reconstruction. ACM Transactionson Graphics Vol. 32, No. 4, Article No. 113, 2013.

[21] Steinbrucker, F.; Kerl, C.; Cremers, D. Large-scalemultiresolution surface reconstruction from RGB-Dsequences. In: Proceedings of the IEEE InternationalConference on Computer Vision, 3264–3271, 2013.

[22] Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.;Cremers, D. A benchmark for the evaluation of RGB-D SLAM systems. In: Proceedings of the IEEE/RSJInternational Conference on Intelligent Robots andSystems, 573–580, 2012.

[23] Stuckler, J.; Behnke, S. Multi-resolution surfel mapsfor efficient dense 3D modeling and tracking. Journalof Visual Communication and Image RepresentationVol. 25, No. 1, 137–147, 2014.

[24] Zhou, Q.-Y.; Koltun, V. Dense scene reconstructionwith points of interest. ACM Transactions on GraphicsVol. 32, No. 4, Article No. 112, 2013.

[25] Zhou, Q.-Y.; Miller, S.; Koltun, V. Elastic fragmentsfor dense scene reconstruction. In: Proceedings of theIEEE International Conference on Computer Vision,473–480, 2013

[26] Choi, S.; Zhou, Q.-Y.; Koltun, V. Robustreconstruction of indoor scenes. In: Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition, 5556–5565, 2015.

[27] Steinbrucker, F.; Sturm, J.; Cremers, D. Real-time visual odometry from dense RGB-D images. In:Proceedings of the IEEE International Conference onComputer Vision Workshops, 719–722, 2011.

[28] Hansch, R.; Weber, T.; Hellwich, O. Comparison of 3Dinterest point detectors and descriptors for point cloudfusion. ISPRS Annals of the Photogrammetry, RemoteSensing and Spatial Information Sciences Vol. 2, No.3, 57, 2014.

[29] Juan, L.; Gwun, O. A comparison of SIFT, PCA-SIFTand SURF. International Journal of Image ProcessingVol. 3, No. 4, 143–152, 2009.

12 C. Wang, X. Guo

[30] Yan, W.; Shi, X.; Yan, X.; Wan, L. ComputingopenSURF on openCL and general purpose GPU.International Journal of Advanced Robotic SystemsVol. 10, No. 10, 375, 2013.

[31] Hartley, R.; Zisserman, A. Multiple View Geometry inComputer Vision. Cambridge University Press, 2003.

[32] Arun, K. S.; Huang, T. S.; Blostein, S. D. Least-squares fitting of two 3-D point sets. IEEETransactions on Pattern Analysis and MachineIntelligence Vol. PAMI-9, No. 5, 698–700, 1987.

Chao Wang is currently a Ph.D.candidate in the Department ofComputer Science at the Universityof Texas at Dallas. Before that, hereceived his M.S. degree in computerscience in 2012, and B.S. degreein automation in 2009, both fromTsinghua University. His research

interests include geometric modeling, spectral geometricanalysis, and 3D reconstruction of indoor environments.

Xiaohu Guo received his Ph.D. degreein computer science from Stony BrookUniversity in 2006. He is currentlyan associate professor of computerscience at the University of Texas atDallas. His research interests includecomputer graphics and animation, withan emphasis on geometric modeling and

processing, mesh generation, centroidal Voronoi tessellation,spectral geometric analysis, deformable models, 3D and

4D medical image analysis, etc. He received a prestigiousNational Science Foundation CAREER Award in 2012.

Open Access The articles published in this journalare distributed under the terms of the CreativeCommons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permitsunrestricted use, distribution, and reproduction in anymedium, provided you give appropriate credit to theoriginal author(s) and the source, provide a link to theCreative Commons license, and indicate if changes weremade.

Other papers from this open access journal are available freeof charge from http://www.springer.com/journal/41095.To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

feature-based rgb-d camera pose optimization for real-time 3d …xguo/cvm17.pdf · 2017-03-11 ·...

Documents