[ieee 2013 ieee workshop on applications of computer vision (wacv) - clearwater beach, fl, usa...

Estimation of Camera Pose with Respect to Terrestrial LiDAR Data

Wei Guan∗ Suya You† Guan Pang‡

Computer Science DepartmentUniversity of Southern California, Los Angeles, USA

Abstract

In this paper, we present an algorithm that is to estimatethe position of a hand-held camera with respect to terres-trial LiDAR data. Our input is a set of 3D range scanswith intensities and one or a set of 2D uncalibrated cam-era images of the scene. The algorithm that automaticallyregisters range scans and 2D images is composed of fol-lowing steps. In the first step, we project the terrestrialLiDAR onto 2D images according to several preselectedviewpoints. Intensity-based features such as SIFT are ex-tracted from these projected images and these features areprojected back onto the LiDAR data to obtain their 3D po-sitions. In the second step, we estimate the initial pose ofgiven 2D images from feature correspondences. In the thirdstep, we refine the coarse camera pose obtained from theprevious step through iterative matchings and optimizationprocess. We presents results from experiments in severaldifferent urban settings.

1. Introduction

This paper deals with the problem of automatic pose es-timation of a 2D camera image with respect to 3D LiDARdata of an urban scene, which is an important problem incomputer vision. Its application include urban modeling,robots localization and augmented reality. One way to solvethis problem is to extract features on both types of data andfind the 2D-to-3D feature correspondences. However, sincethe structures of this two types of data are so different, thefeatures extracted from one type of data are usually not re-peatable in the other (except for very simple features suchas lines or corners). Instead of direct extracting in 3D space,features can be extracted on their 2D projections and 2D-to-2D-to-3D matching scheme can be used.

As remote sensing technology develops, most recent Li-DAR data has intensity value for each point in the cloud.

∗e-mail: [email protected]†e-mail:[email protected]‡e-mail:[email protected]

(a)

(b)

Figure 1. (a) The 3D LiDAR data with color information (sampledby software for fast rendering). (b) The 2D image of the samescene taken at the ground level.

Some LiDAR data also contains color information. The in-tensity information is obtained by measuring the strength ofsurface reflectance, and the color information is provided byan additional co-located optical sensor that captures visiblelight. These information is very helpful for matching 3Drange scans with 2D camera images. Unlike geometry-onlyLiDAR data, intensity-based features can be applied in thepose estimation process.

Figure 1 shows the colored LiDAR data and a cameraimage taken on the ground. As we can observe, the pro-jected LiDAR looks similar to an image that is taken by anoptical camera. The fact is that if the point cloud is denseenough, the projected 2D image can be treated the same

391978-1-4673-5052-5/12/$31.00 ©2012 IEEE 391978-1-4673-5054-9/13/$31.00 ©2013 IEEE

way as a normal camera image. However, there are severaldifferences between projected image and an image taken bya camera. First, there are many holes on the projected im-age due to the missing data. This is usually caused by thenon-reflecting surfaces and occlusions in the scene. Sec-ond, if the point cloud intensity is measured by reflectancestrength, the reflectance property of invisible lights are dif-ferent from that of visible lights. Even in the case that visi-ble lights are used to obtain LiDAR intensities, the lightingconditions could be different from the lighting of a cameraimage. In this paper, we propose an algorithm that can han-dle LiDAR with both types of intensity information.

The intensity information of LiDAR data is useful forcamera pose initialization. However, due to intensity differ-ences, occlusions etc., there are not many correspondencesavailable and one small displacement in any of these match-ing points will cause large errors in the computed camerapose. Moreover, in most urban scenes, there exist manyrepeated patterns, which make many features fail in thematching process. With the initial pose, we can estimatethe location of corresponding features and limit the search-ing range within which repeated patterns does not appear.Therefore, we can generate more correspondences and re-fine the pose. After several iterative matchings, we fur-ther refine the camera pose by minimizing the differencesbetween the projected image and camera image. The esti-mated camera pose is more stable after the above two stepsof refinement. The contribution of this paper is summarizedas follows.

1. We propose a framework of camera pose estimationwith respect to 3D terrestrial LiDAR data that containsintensity values. No prior knowledge about the cameraposition is required.

2. We designed a novel algorithm that refines the camerapose in two steps. Both intensity and geometric infor-mation are used in the refinement process.

3. We have tested the proposed framework on differenturban settings. The results show that the estimatedcamera pose is accurate and the framework can be ap-plied in many applications such as mixed reality.

The remainder of this paper presents the proposed algo-rithm in more details. We first discuss some related work inSection 2. Section 3 describes the camera pose initializationprocess. Following that, Section 4 discusses the algorithmthat refines camera pose. We show experimental results inSection 5 and conclude the paper in the last section.

2. Related WorkThere has been a considerable amount of research in reg-

istering images with LiDAR data. The registration methods

vary from keypoint-based matching [3, 1], structure-basedmatching [20, 13, 14, 21], to mutual information based reg-istration [24]. There are also methods that are speciallydesigned for registering aerial LiDAR and aerial images[7, 22, 5, 23, 16].

When the LiDAR data contains intensity values,keypoint-based matchings [3, 1] that are based on similaritybetween LiDAR intensity image and camera intensity im-age can be applied. Feature points such as SIFT [15] areextracted from both images and a matching strategy is usedto determine the correspondences thus camera parameters.The drawback of intensity-based matching is that it usuallygenerates very few correspondences and the estimated poseis not accurate or stable. Najafi et al [18] also created anenvironment map to represent an object appearance and ge-ometry using SIFT features. Vasile, et al. [22] used LiDARdata to generate a pseudo-intensity image with shadows thatare used to match with aerial imagery. They used GPS asthe initial pose and applied exhaustive search to obtain thetranslation, scale, and lens distortion. Ding et al. [5] reg-istered oblique aerial images based on 2D and 3D cornerfeatures in the 2D images and 3D LiDAR model respec-tively. The correspondences between extracted corners aregenerated through Hough transform and a generalized M-estimator. The corner correspondences are used to refinecamera parameters. In general, a robust feature extractionand matching scheme is the key to a successful registrationfor this type of approaches.

Instead of point-based matchings, structural featuressuch as lines and corners have been utilized in many re-searches. Stamos and Allen [20] used matching of rect-angles from building facades for alignment. Liu et al.[13, 14, 21] extracted line segments to form “rectangularparallelepiped”, which are composed of vertical or horizon-tal 3D rectangular parallelepiped in the LiDAR and 2D rect-angles in the images. The matching of parallelepiped aswell as vanishing points are used to estimate camera pa-rameters. Yang, et al. [25] used feature matching to alignground images, but they worked with a very detailed 3Dmodel. Wang and Neumann [23] proposed an automaticregistration method between aerial images and aerial Li-DAR based on matching 3CS (“3 Connected Segments) inwhich each linear feature contains 3 connected segments.They used a two-level RANSAC algorithm to refine puta-tive matches and estimated camera pose from the correspon-dences.

Given a set of 3D to 2D point or line correspondences,there are many approaches to solve the pose recovery prob-lem [17, 12, 4, 19, 11, 8]. The same problems also appear inpose recovery with respect to point cloud which is generatedfrom image sequences [9, 10]. In both cases, a probabilisticRANSAC method [6] was also introduced for automaticallycomputing matching 3D and 2D points and remove outliers.

392392

Figure 2. The virtual cameras are placed around the LiDAR scene.They are placed uniformly in viewing directions and logarithmi-cally in the distance.

In this paper, we will apply keypoint-based method to esti-mate initial camera pose, then use iterative methods withRANSAC by utilizing both intensity and geometric infor-mation to obtain the refined pose.

3. Camera Pose Initialization3.1. Synthetic Views of 3D LiDAR

To compute the pose for an image taken at an arbitraryviewpoint, we first create “synthetic” views that cover alarge viewing directions. Z-buffers are used to handle oc-clusions. Our application is to recover the camera poses ofimages taken in urban environments, so we can restrict theplacement of virtual cameras to the height of eye-level tosimplify the problem. Generally, the approach is not lim-ited to such images.

We place the cameras around the LiDAR in about 180degrees. The cameras are placed uniformly in the angle ofviews and logarithmically in distance, as shown in Figure2. The density of locations depends on the type of featurethat we use for matchings. If the feature is able to han-dle rotation, scale and wide baseline, we need less virtualcameras to cover most cases. In contrast, if the feature isneither rotation-invariant nor scale-invariant, we need to se-lect as many viewpoints as possible, and rotate the cameraat each viewpoint. Furthermore, it should be noted that theviewpoints cannot be too close to point cloud, otherwisethe quality of projected image is not good enough to gen-erate initial feature correspondences. In our work, we useSIFT [15] features which are scale and rotation invariantand robust to moderate viewing angle changes. We select 6viewing angles uniformly and 3 distance for each viewingangle in a logarithmic way. The synthetic views are shownin Figure 3.

3.2. Generation of 3D Feature Cloud

We extract 2D SIFT features for each synthetic view.Once the features are extracted, we project them back ontothe point cloud by finding intersection with the first plane

Figure 3. The synthetic views of LiDAR data. 2D features areextracted from each synthetic view.

that is obtained through plane segmentation with method[20]. It is possible that the same feature is reprojected ontodifferent points through different synthetic views. To han-dle this problem, we post process these feature points sothat close points with similar descriptors are merged intoone feature. Note that we can also get the 3D features bytriangulation method. However, such method depends onmatching pairs so it generates much fewer features for po-tential match. The obtained positions of 3D keypoints arenot accurate due to projection and reprojection errors, butgood enough to provide an initial pose. We will optimizetheir positions and camera pose in later stage.

The generated 3D feature cloud is shown in Figure 4.Each point is associated with one descriptor. For a givencamera image, we extract SIFT features and match themwith the feature cloud. A direct 3D to 2D matching andRANSAC method is used to estimate the pose and removeoutliers. When we use RANSAC method, rather than max-imizing the number of inliers that are consensus to the hy-pothesized pose, we make modifications as follows.

We cluster the inliers according to their normal direc-tions. Inliers with close normal directions will be groupedinto the same cluster. Let N1 and N2 be the number ofinliers for the largest two clusters. Among all the hypothe-sized poses, we want to maximize the value of N2, i.e.

393393

Figure 4. SIFT features in 3D space. The 3D positions are obtainedby reprojecting 2D features onto the 3D LiDAR data.

[R|T ] = argmax[R|T ]

N2. (1)

This is to ensure that not all the inliers lie within thesame plane, in which case the calculated pose is unstableand sensitive to position errors.

4. Camera Pose Refinement4.1. Refinement by More Point Correspondences

With the estimated initial pose, we can generate morefeature correspondences by limiting the similarity search-ing space. For the first iteration, we still use SIFT feature.From 2nd iteration on, we can use less distinctive features togenerate more correspondences. In our work, we use Harriscorners as the keypoints. For each corner point, a normal-ized intensity histogram within an 8x8 patch is computedas the descriptor. Its corresponding point will probably liewithin the neighborhood of H by H pixels. Initially, H is setto 64 pixels. For each iteration, the size is reduced to halfsince more accurate pose is obtained. We keep the mini-mum searching size to 16 pixels. Figure 5 shows a few iter-ations and matching results within reduced searching space.

4.2. Geometric Structures and Alignment

The purpose of geometric structure extraction is not toform features to generate correspondences. Instead, they areused to align 3D structure with 2D structures in the cameraimage. In our work, line segments are used to align 3Drange scans with 2D images. Therefore, we need to definethe distance between these line segments.

There are two types of lines in the 3D LiDAR data. Thefirst type is generated from the geometric structure, whichcan be computed at the intersections between segmentedplanar regions and at the borders of the segmented planar

(a)

(b)

(c)

(d)

(e)

Figure 5. (a) The initial camera pose. (b) 3D to 2D matching oninitial pose. (c) Camera pose after 1st iteration. (d) 3D to 2Dmatching based on refined pose. (e) Camera pose after 2nd itera-tion.

regions [20]. The other type of lines is formed by intensi-ties. These lines can be detected on the projected syntheticimage with method [2] and reprojected onto 3D LiDAR toget their 3D coordinates. For each hypothesis pose, these

394394

3D lines are projected onto 2D images and we measure thealignment error as follows.

Eline =1

N

M∑i=1

N∑j=1

K(li, Lj)·max(D(li1, Lj), D(li2, Lj)),

(2)where li is the ith 2D line segment with li1 and li2 as its twoendpoints, Lj is jth 3D line segment. M and N are numberof 2D segments and 3D segments respectively. K(li, Lj) isa binary function deciding whether the two line segmentshave similar slopes. D(li1, Lj) is a function describing thedistance from the endpoint li1 to projected line segment Lj .The function K and D are defined as follows,

K(l, L) =

{0 for (l, L) < Kth

1 for (l, L) ≥Kth

(3)

D(l12, L) =

{0 for d(l12, L) ≥ Dth

d(l12, L) for d(l12, L) < Dth

(4)

where (l, L) represents the angle difference between the twoline segments, and d(l12, L) is the distance from endpointl1 or l2 to projected line segment L. Kth and Dth are twothresholds deciding whether the two segments are potentialmatches. In our experiment, we set Kth = π/6, Dth =W/20, where W is the image width.

4.3. Refinement by Minimizing Error Function

(a)

(b)

Figure 6. (a) The refined pose from iterative matchings. (b) Cam-era pose after minimizing the error function.

Once we have obtained the camera pose with iterativerefinements, we can further refine the pose by minimizing

(a)

(b)

Figure 7. Intensity differences between projected image and cam-era image. (a) errors after iterative refinements (b) errors afteroptimization.

the differences between LiDAR-projected image and cam-era image. The differences are represented by an error func-tion, which is composed of two parts, line differences andintensity differences. We have talked about line differencesabove. The intensity error function is defined as follows,

Eintensity =1

|{i}|∑i

(s · I3D(i)− I2D(i))2, (5)

where I3D(i) and I2D(i) are intensity values for the ithpixel on projected image and camera image respectively.|{i}| is the number of projected pixels. s is the scale fac-tor that compensate the reflectance or lighting differences.s will take the value that can minimize the intensity errors,so the above error function is equivalent to,

Eintensity =∑i

(I2D(i)2 − I3D(i)2 · I2D(i)2

I3D(i)2), (6)

395395

The overall error function is a weighted combination oftwo error functions,

Epose = αEline|pose + (1− α)Eintensity|pose, (7)

where pose is determined by the rotation R, translation Tor equivalently 3D positions of keypoints P . We set α =0.5 in our experiments. Since intensity errors usually havelarger scales, this will make intensity a larger effect on theoverall error function.

The relative pose is refined via minimization of the aboveerror function:

(R, T, P ) = argmminR,T,P

Epose|R,T,P . (8)

The refinement results are shown in Figure 6 and 7.

5. Experimental ResultsWe have tested more than 10 sets of scenes with cam-

era images taken from different viewpoints. Figure 8 showsan example of pose recovery through iterations and opti-mizations. After a series of refinement through iterativematchings and optimization, we can get an accurate viewof a given camera image. Figure 9 shows an image of thesame scene but taken from another view. It can be easilyobserved that the virtual image is well aligned with the realimage by blending the two images together.

We have also measured the projection errors for each re-finement process. The results are shown in Figure 10 andFigure 11. As is shown in Figure 10, the errors stay con-stant after 3rd refinements. This is because that usually wehave obtained sufficient correspondences after 3rd iterationto get a stable camera pose. The errors are caused by theerrors in calculating the 3D position of keypoints. Thiscan be further improved by adjusting the pose to get evensmaller projection errors, as shown in Figure 11. However,due to moving passengers, occlusions, lighting conditionsetc., there are always errors between a projected image anda camera image.

6. ConclusionWe have proposed a framework of camera pose estima-

tion with respect to 3D terrestrial LiDAR data. The LiDARdata contains intensity information. We first project the Li-DAR onto several pre-selected viewpoints and calculate theSIFT features. These features are reprojected back onto theLiDAR data to obtain their positions in 3D space. These 3Dfeatures are used to compute the initial pose of the camerapose. In the next stage, we iteratively refine the camera poseby generating more correspondences. After that, we furtherrefine the pose through minimizing the proposed objective

0 2 4 6 8 100

10

20

30

40

50

60

70

80

iteration no.

erro

rs

Figure 10. The errors after each iterative refinement.

0 5 10 15 20 25 300

10

20

30

40

50

60

image no.

erro

rs

after iterative refinementsafter optimization

Figure 11. The errors before and after optimization.

function. The function is composed of two components,errors from intensity differences and errors from geomet-ric structure displacements between projected LiDAR im-age and camera image. We have tested the proposed frame-work on different urban settings. The results show that theestimated camera pose is stable and the framework can beapplied in many applications such as augmented reality.

References[1] D. G. Aguilera, P. R. Gonzalvez, and J. G. Lahoz. An auto-

matic procedure for co-registration of terrestrial laser scan-ners and digital cameras. ISPRS Journal of Photogrammetryand Remote Sensing, 64(3):308–316, 2009.

[2] N. Ansari and E. J. Delp. On detecting dominant points.Pattern Recognition, 24(5):441–451, 1991.

[3] S. Becker and N. Haala. Combined feature extraction forfacade reconstructio. In ISPRS Workshop on Laser Scanning,

396396

(a) (b)

(c) (d)

(e) (f)

Figure 8. (a) The camera image (b) The initial view calculated from limited number of matches (c) The refined view by generating morecorrespondences (d) It has no further refinement (from more matches) after 2 or 3 iterations (e) The pose is refined by minimizing theproposed error function (f) The virtual building is well aligned with the real image for the calculated view.

2007.[4] S. Christy and R. Horaud. Iterative pose computation from

line correspondences, 73(1):137–144, 1999.[5] M. Ding, K. Lyngbaek, and A. Zakhor. Automatic registra-

tion of aerial imagery with untextured 3d lidar models. InComputer Vision and Pattern Recognition (CVPR), 2008.

[6] M. A. Fischler and R. C. Bolles. Random sample consen-sus: A paradigm for model fiting with applications to imageanalysis and automated cartography, 24(6):381–395, 1981.

[7] C. Frueh, R. Sammon, and A. Zakhor. Automated texturemapping of 3d city models with oblique aerial imagery. InSymposium on 3D Data Processing, Visualization and Trans-mission, pages 3963–403, 2004.

[8] W. Guan, L. Wang, M. Jonathan, S. You, and U. Neumann.Robust pose estimation in untextured environments for aug-mented reality applications. In ISMAR, 2009.

[9] W. Guan, S. You, and U. Neumann. Recognition-driven 3dnavigation in large-scale virtual environments. In IEEE Vir-tual Reality, 2011.

[10] W. Guan, S. You, and U. Neumann. Efficient matchings andmobile augmented reality. In ACM TOMCCAP, 2012.

[11] R. I. Hartley and A. Zisserman. Multiple View Geometry inComputer Vision.

[12] R. Horaud, F. Dornaika, B. Lamiroy, and S. Christy. Objectpose: The link between weak perspective, paraperspectiveand full perspective, 22(2), 1997.

397397

(a) (b)

Figure 9. The estimated camera pose with respect to the same scene as in Figure 8 but from a different viewpoint. The right figure showsthe mixed reality of both virtual world and real world.

[13] L. Liu and I. Stamos. Automatic 3d to 2d registration for thephotorealistic rendering of urban scenes. In Computer Visionand Pattern Recognition, pages 137–143, 2005.

[14] L. Liu and I. Stamos. A systematic approach for 2d-image to3d-range registration in urban environments. In In Interna-tional Conference on Computer Vision, pages 1–8, 2007.

[15] D. Lowe. Object recognition from local scale invariant fea-tures. In International Conference on Computer Vision,1999.

[16] A. Mastin, J. Kepner, and J. Fisher. Automatic registration oflidar and optical images of urban scenes. In Computer Visionand Pattern Recognition (CVPR), pages 2639–2646, 2009.

[17] D. Oberkampf, D. DeMenthon, and L. Davis. Iterative poseestimation using coplanar feature points. In CVGIP, 1996.

[18] F. of 3D, appearance models for fast object detection, andpose estimation. Iterative pose estimation using coplanar fea-ture points. In in ACCV, pages 415–426, 2006.

[19] L. Quan and Z. Lan. Linear n-point camera pose determina-tion. In PAMI, 1999.

[20] I. Stamos and P. K. Allen. Geometry and texture recovery ofscenes of large scale, 88(2):94–118, 2002.

[21] I. Stamos, L. Liu, C. Chen, G.Wolberg, G. Yu, and S. Zokai.Integrating automated range registration with multiview ge-ometry for the photorealistic modeling of large-scale scenes.pages 237–260, 2008.

[22] A. Vasile, F. R. Waugh, D. Greisokh, and R. M. Heinrichs.Automatic alignment of color imagery onto 3d laser radardata. In Applied Imagery and Pattern Recognition Workshop,2006.

[23] L. Wang and U. Neumann. A robust approach for automaticregistration of aerial images with untextured aerial lidar data.In Computer Vision and Pattern Recognition (CVPR), pages2623–2630, 2009.

[24] R. Wang, F. Ferrie, and J. Macfarlane. Automatic registrationof mobile lidar and spherical panoramas. In Computer Visionand Pattern Recognition Workshops (CVPRW), pages 33–40,2012.

[25] G. Yang, J. Becker, and C. Stewart. Estimating the locationof a camera with respect to a 3d model. In 3D Digital Imag-ing and Modeling, 2007.

398398

[ieee 2013 ieee workshop on applications of computer vision (wacv) - clearwater beach, fl, usa...

Documents