automatic appearance-based loop detection from three...

• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •

Automatic Appearance-BasedLoop Detection from

Three-Dimensional Laser DataUsing the Normal

Distributions Transform

Martin Magnusson and Henrik AndreassonCenter for Applied Autonomous Sensor SystemsOrebro University70182 Orebro, Swedene-mail: [email protected],[email protected]

Andreas NuchterSchool of Engineering and ScienceJacobs University Bremen28759 Bremen, Germanye-mail: [email protected]

Achim J. LilienthalCenter for Applied Autonomous Sensor SystemsOrebro University70182 Orebro, Swedene-mail: [email protected]

Received 5 January 2009; accepted 5 August 2009

We propose a new approach to appearance-based loop detection for mobile robots, usingthree-dimensional (3D) laser scans. Loop detection is an important problem in the simul-taneous localization and mapping (SLAM) domain, and, because it can be seen as theproblem of recognizing previously visited places, it is an example of the data associationproblem. Without a flat-floor assumption, two-dimensional laser-based approaches arebound to fail in many cases. Two of the problems with 3D approaches that we address inthis paper are how to handle the greatly increased amount of data and how to efficientlyobtain invariance to 3D rotations. We present a compact representation of 3D point cloudsthat is still discriminative enough to detect loop closures without false positives (i.e.,detecting loop closure where there is none). A low false-positive rate is very important be-cause wrong data association could have disastrous consequences in a SLAM algorithm.Our approach uses only the appearance of 3D point clouds to detect loops and requires nopose information. We exploit the normal distributions transform surface representationto create feature histograms based on surface orientation and smoothness. The surfaceshape histograms compress the input data by two to three orders of magnitude. Becauseof the high compression rate, the histograms can be matched efficiently to compare theappearance of two scans. Rotation invariance is achieved by aligning scans with respectto dominant surface orientations. We also propose to use expectation maximization to fit

Journal of Field Robotics 26(11–12), 892–914 (2009) C© 2009 Wiley Periodicals, Inc.Published online in Wiley InterScience (www.interscience.wiley.com). • DOI: 10.1002/rob.20314

Magnusson et al.: Automatic Appearance-Based Loop Detection from 3D Laser Data Using NDT • 893

a gamma mixture model to the output similarity measures in order to automatically de-termine the threshold that separates scans at loop closures from nonoverlapping ones. Wediscuss the problem of determining ground truth in the context of loop detection and thedifficulties in comparing the results of the few available methods based on range infor-mation. Furthermore, we present quantitative performance evaluations using three real-world data sets, one of which is highly self-similar, showing that the proposed methodachieves high recall rates (percentage of correctly identified loop closures) at low false-positive rates in environments with different characteristics. C© 2009 Wiley Periodicals, Inc.

1. INTRODUCTION

For autonomously navigating mobile robots, it is es-sential to be able to detect when a loop has beenclosed by recognizing a previously visited place. Oneexample application is when performing simulta-neous localization and mapping (SLAM). A com-mon way to perform SLAM is to let a robot movearound in the environment, sensing its surroundingsas it goes. Typically, discrete two-dimensional (2D) orthree-dimensional (3D) laser scans are registered us-ing a local scan registration algorithm in order to cor-rect the robot’s odometry and improve the estimateof the robot’s pose (that is, its position and orienta-tion) at each point in time. The scans can be stitchedtogether at their estimated poses in order to builda map. However, even though good scan registra-tion algorithms exist [for example, 3D-normal distri-butions transform (NDT) (Magnusson, Lilienthal, &Duckett, 2007)], pose errors will inevitably accumu-late over longer distances, and after covering a longtrajectory the robot’s pose estimate may be far fromthe true pose. When a loop has been closed and therobot is aware that it has returned to a previouslyvisited place, existing algorithms can be used to dis-tribute the accumulated pose error of the pairwiseregistered scans in order to render a consistent map.Some examples are the tree-based relaxation meth-ods of Frese, Larsson, and Duckett (2005) and Freseand Schroder (2006) and the 3D relaxation methodsof Grisetti, Grzonka, Stachniss, Pfaff, and Burgard(2007) and Borrmann, Elseberg, Lingemann, Nuchter,and Hertzberg (2008). However, detecting loop closurewhen faced with large pose errors remains a difficultproblem. As the uncertainty of the estimated pose ofthe robot grows, an independent means of detectingloop closure becomes increasingly important. Giventwo 3D scans, the question we are asking is, “HaveI seen this before?” A good loop detection algorithmaims at maximizing the recall rate; that is, the per-centage of true positives (scans acquired at the same

place that are recognized as such), with a minimum offalse positives (scans that are erroneously consideredto be acquired at the same place). False positives aremuch more costly than false negatives in the contextof SLAM. A single false positive can render the mapunusable if no further measures are taken to recoverfrom false scan correspondences. On the other hand,a relatively low number of true positives is often ac-ceptable, given that several scans are acquired fromeach overlapping section. As long as a few of thesescans are detected, the loop can be closed.

We present a loop detection approach basedon the appearance of scans. Appearance-based ap-proaches often use camera images (Booij, Terwijn,Zivkovic, & Krose, 2007; Cummins & Newman, 2007,2008a, 2008b, 2009; Konolige et al., 2009; Valgren &Lilienthal, 2007). In this work, however, we consideronly data from a 3D laser range scanner. Using theproposed approach, loop detection is achieved bycomparing histograms computed from surface shape.The surface shape histograms can be used to recog-nize scans from the same location without pose in-formation, thereby helping to solve the problem ofglobal localization. Scans at loop closure are sepa-rated from nonoverlapping scans based on a “differ-ence threshold” in appearance space. Pose estimatesfrom odometry or scan registration are not required.(However, if such information is available, it could beused to further increase the performance of the loopdetection by restricting the search space.) Thoughwe have chosen to term the problem “loop detec-tion,” the proposed method solves the same prob-lem that Cummins and Newman (2009) refer to as“appearance-only SLAM.”

Existing 2D loop detection algorithms could po-tentially be used for the same purpose, after extract-ing a single scan plane from the available 3D scans.However, in many areas it may be advantageous touse all of the available information. One example isfor vehicles driving over rough surfaces. Dependingon the local slope of the surface, 2D scans from nearby

Journal of Field Robotics DOI 10.1002/rob

894 • Journal of Field Robotics—2009

positions may look quite different. Therefore the ap-pearance of 2D scans cannot be used to detect loopclosure in such cases. In fact, this is a common prob-lem for current 2D-scanning semi-autonomous min-ing vehicles.1 Places where there are nearly horizon-tal surfaces close to the level of the 2D laser scannerare especially problematic. Another example is placeswhere there are deep wheel tracks, meaning that asmall lateral offset can result in a large difference inthe vehicle’s roll angle. Using 3D data instead of 2Dis of course also important for airborne robots whoseorientation is not restricted to a mainly planar align-ment.

The work described in this paper builds on pre-viously published results (Magnusson, Andreasson,Nuchter, & Lilienthal, 2009), with the main additionsbeing a sound method for estimating the differencethreshold parameter, a more complete performanceevaluation, and more experimental data, as well as adiscussion of the difficulties of determining groundtruth for loop detection.

The paper is laid out as follows. In Section 2,we describe the details of our loop detection ap-proach. The performance is evaluated in Section 3,and a method for automatically selecting the differ-ence threshold is described and evaluated in Sec-tion 3.4. Our approach is compared to related workon loop detection in Section 4. Finally, the paper issummarized in Section 5, which also states our con-clusions, and directions for future work are suggestedin Section 6.

2. SURFACE SHAPE HISTOGRAMS

Our loop detection method is inspired by NDT. NDTis a method for representing a scan surface as apiecewise continuous and twice-differentiable func-tion. It has previously been used for efficient pair-wise 2D and 3D scan registration (Biber & Straßer,2003; Magnusson et al., 2007; Magnusson, Nuchter,Lorken, Lilienthal, & Hertzberg, 2009). However, theNDT surface representation can also be used to de-scribe the appearance of a 3D scan, as will be ex-plained in this section.

2.1. The NDT

The NDT gives a compact surface shape representa-tion, and it therefore lends itself to describing the gen-

1This information is from personal communication with JohanLarsson, Atlas Copco Rock Drills.

eral appearance of a location. The method is outlinedin the following. For more details, refer to previouspublications (Biber & Straßer, 2003; Magnusson et al.,2007).

Given a range scan represented as a point cloud,the space occupied by the scan is subdivided into aregular grid of cells (squares in the 2D case, cubesin the 3D case). Each cell ci stores the mean vectorμi and covariance matrix �i of the positions of thescan points within the cell; in other words, the pa-rameters of a normally distributed probability den-sity function (PDF) describing the local surface shape.Depending on the covariance, the PDF can take ona linear (stretched ellipsoid), planar (squashed ellip-soid), or spherical shape. Our appearance descriptoris created from histograms of these local shape de-scriptions.

To minimize the issues with spatial discretizationwe use overlapping cells, so that if the side length ofeach cell is q, the distance between each cell’s centerpoint is q/2. (The parameter choices will be coveredin Section 2.5.)

2.2. Appearance Descriptor

It is possible to use the shapes of the PDF of NDTcells to describe the appearance of a 3D scan, classi-fying the PDFs based on their orientation and shape.For each cell, the eigenvalues λ1 ≤ λ2 ≤ λ3 and cor-responding eigenvectors e1, e2, e3 of the covariancematrix are computed. We use three main cell classes:spherical, planar, and linear. Distributions are as-signed to a class based on the relations between theireigenvalues with respect to a threshold te ∈ [0, 1] thatquantizes a “much smaller” relation:

• Distributions are linear if λ2/λ3 ≤ te.• Distributions are planar if they are nonlinear

and λ1/λ2 ≤ te.• Distributions are spherical if they are

nonlinear and nonplanar (in other words,if no eigenvalue is 1/te times larger thananother one).

The planar cell classes can be divided intosubclasses, based on surface orientation. As statedin our earlier work (Magnusson, Andreasson, et al.,2009), the same can potentially be done for the lin-ear classes, and spherical subclasses could be definedby surface roughness. However, in our experiments,using one spherical and one linear class has been



Figure 1. Visualization of the planar part p of a histogram vector created from the scan on the right. In this case, p = 9planar directions are used. The thin black lines correspond to the directions P1, . . . , P9. The cones are scaled according tothe values of the corresponding histogram bins. There are two cones for each direction in this illustration: one on eachside of the origin. The dominant directions used to normalize the scan’s orientation are shaded. (The following text willbe explained further in Section 2.3.) Directions that are not in Z or Y are white. Z (dark gray) contains one direction inthis case: the vertical direction, corresponding to the ground plane. Y (light gray) includes two potential secondary peaks(whose magnitudes are more similar to the rest of the binned directions). In this example ta was set to 0.6.

sufficient. It would also be straightforward to usemore classes such as different levels of “almost pla-nar” distributions by using more than one eigenvalueratio threshold. However, for the data presented here,using more than one threshold te did not improve theresult.

For the planar distributions, the eigenvector e1(which corresponds to the smallest eigenvalue) coin-cides with the normal vector of the plane that is ap-proximated by the PDF. We define planar subclassesas follows. Assuming that there is a set P of p ap-proximately evenly distributed lines passing throughthe origin, P = {P1, . . . , Pp}, the index for planarsubclasses is

i = arg minj

δ(e1, Pj ), (1)

where δ(e, P ) is the distance between a point e and aline P . In other words, we choose the index of theline Pj that is closest to e1. The problem of evenlydistributing a number of lines intersecting the ori-gin is analogous to distributing points evenly onthe surface of a sphere. This is an ill-posed prob-lem because it is not possible to find a solution inwhich the distances between all neighboring points

are equal. However, a number of solutions giving ap-proximately even point distributions exist. For exam-ple, using an equal area partitioning (Saff & Kuijlaars,1997) to distribute p points on a half-sphere, P isthe set of lines connecting the origin and one of thepoints. The distribution of lines used in this work isvisualized in Figure 1.

Using p planar subclasses, the basic element ofthe proposed appearance descriptor is the featurevector

f =

⎡⎢⎣f1, . . . , fp︸︷︷︸

planar classes

, fp+1︸︷︷︸spherical

, fp+2︸︷︷︸linear

,

⎤⎥⎦

T

=⎡⎣ p

fp+1fp+2

⎤⎦ , (2)

where fi is the number of NDT cells that belong toclass i.

In addition to surface shape and orientation, thedistance from the scanner location to a particular sur-face is also important information. For this reason,each scan is described by a matrix

F = [f 1 · · · f r

](3)



and a corresponding set of range intervals R ={r1, . . . , rr}. The matrix is a collection of surface shapehistograms, where each column f k is the histogramof all NDT cells within range interval rk (measuredfrom the laser scanner position).

2.3. Rotation Invariance

Because the appearance descriptor (3) explicitly usesthe orientation of surfaces, it is not rotation invariant.For the appearance descriptor to be invariant to rota-tion, the orientation of the scan must first be normal-ized.

Starting from an initial histogram vector f ′, witha single range interval R = {[0,∞)}, we want to findtwo peaks in plane orientations and orient the scanso that the most common plane normal (the primarypeak) is aligned along the z axis and the second mostcommon (the secondary peak) is in the yz plane. Thereason for using plane orientations instead of lineorientations is that planar cells are much more com-mon than linear ones. For an environment with morelinear structures than planar ones, line orientationscould be used instead, although such environmentsare unlikely to be encountered.

Because there is not always an unambiguousmaximum, we use two sets of directions, Z and Y .Given the planar part p′ = [p′

1, . . . , p′p]T of f ′ and an

ambiguity threshold ta ∈ [0, 1] that determines whichhistogram peaks are “similar enough,” the dominantdirections are selected as follows. (This selection isalso illustrated in Figure 1.) First we pick the his-togram bin with the maximum value

i ′ = argmaxi

p′i . (4)

The potential primary peaks are i ′ and any directionsthat are “almost” as common as i ′ with respect to ta :

Z = {i ∈ {1, . . . , p} | p′i ≥ tap

′i ′ }. (5)

The same procedure is repeated to find the secondmost common direction, choosing as a secondarypeak the largest histogram bin that is not already in-cluded in the primary peak set:

i ′′ = argmaxi

p′i | i /∈ Z. (6)

The potential secondary peaks are i ′′ and any direc-tions that are almost as common as i ′′ (but not already

included in the primary peak set):

Y = {i ∈ {1, . . . , p} | i /∈ Z, p′i ≥ tap

′i ′′ }. (7)

This procedure gives two disjoint subsets Z ⊂ P andY ⊂ P .

Now we want to align the primary (most com-mon) peak along the positive z axis. In the follow-ing, we will use axis/angle notation for rotations:R = (v, φ) denotes a rotation with angle φ around theaxis v. To perform the desired alignment, the rotation

Rz =⎛⎝P i ×

⎡⎣0

01

⎤⎦ , −arccos

⎛⎝P i ·

⎡⎣0

01

⎤⎦

⎞⎠

⎞⎠ , (8)

where P i is a unit vector along the line Pi , rotates thescan so that P i is aligned along the positive z axis.The rotation axis P i × [0, 0, 1] is perpendicular to thez axis. A separate rotation is created for each potentialprimary peak i ∈ Z .

Similarly, for each secondary peak i ∈ Y , it ispossible to create a rotation Ry that rotates the scanaround the z axis so that P i lies in the yz plane. To de-termine the angle of Ry , we use the normalized pro-jection of Rz P i onto the xy plane: P ′

i . The angle of Ry

is the angle between the projected vector P ′i and the

yz plane:

Ry =⎛⎝

⎡⎣0

01

⎤⎦ , −arccos

⎛⎝P ′

i ·⎡⎣0

10

⎤⎦

⎞⎠

⎞⎠ . (9)

Given a scan S, the appearance descriptor F iscreated from the rotated scan RyRzS. This alignmentis always possible to do, unless all planes have thesame orientation. If it is not possible to find two maindirections, it is sufficient to use only Rz because inthat case no subsequent rotation around the z axischanges which histogram bins are updated for anyplanar PDF. If linear subclasses of different orienta-tions are used, it is possible to derive Ry from lineardirections if only one planar direction can be found.

In the case of ambiguous peaks (that is, whenZ or Y has more than one member), we generatemultiple histograms. For each combination {i, j | i ∈Z, j ∈ Z ∪ Y, i = j} we apply the rotation RyRz tothe original scan and generate a histogram. The out-come is a set of histograms

F = {F1, . . . , F|Z|(|Z∪Y|−1)}. (10)

The set F is the appearance descriptor of the scan.



For highly symmetrical scans, the approach pre-sented in this section could lead to a very large num-ber of histograms. For example, in the case of a scangenerated at the center of a sphere, where the his-togram bins for all directions have the same value,p2 − p histograms would be created (although a post-processing step to prune all equivalent histogramscould reduce this to just one). In practice, this kindof symmetry effect has not been found to be a prob-lem. The average number of histograms per scan isaround three for the data sets used in this work.

2.4. Difference Measure

To quantify the difference between two surface shapehistograms F and G, we normalize F and G withtheir entrywise 1-norms (which corresponds to thenumber of occupied NDT cells in each scan), com-pute the sum of Euclidean distances between eachof their columns (each column corresponds to onerange interval), and weight the sum by the ratiomax(‖F‖1 , ‖G‖1)/ min(‖F‖1 , ‖G‖1):

σ (F, G) =r∑

i=1

(∥∥∥∥ f i

‖F‖1− gi

‖G‖1

∥∥∥∥2

)max(‖F‖1 , ‖G‖1)min(‖F‖1 , ‖G‖1)

.

(11)

The normalization makes it possible to use a sin-gle threshold for data sets that contain both scans thatcover a large area (with many occupied NDT cells)and scans of more confined spaces (with fewer cells).If the Euclidean distance without normalization wereused instead,

σ (F, G) =r∑

i=1

∥∥ f i − gi

∥∥2 , (12)

scans with many cells would tend to have largerdifference values than scans with few cells. A con-sequence is that in environments with some nar-row passages and some open areas, the open spaceswould be harder to recognize. It would not be possi-ble to use a global, fixed threshold, because the bestthreshold for the wide areas would tend to cause falsepositives in the narrow areas.

The scaling factor max(‖F‖1, ‖G‖1)/ min(‖F‖1 ,

‖G‖1) is used to differentiate large scans (with manycells) from small ones (with few cells).

Given two scans S1 and S2 with histogram sets Fand G, all members of the scans’ sets of histograms

are compared to each other using Eq. (11), and theminimum σ is used as the difference measure for thescan pair:

τ (S1,S2) = mini,j

σ (Fi , Gj ), Fi ∈ F , Gj ∈ G. (13)

If τ (S1,S2) is less than a certain difference thresholdvalue td , the scans S1 and S2 are assumed to be fromthe same location. For evaluation purposes the twoscans S1 and S2 are classified as positive.

2.5. Parameters

Summarizing the preceding text, these are the param-eters of the proposed appearance descriptor alongwith the parameter values selected for the experi-ments:

• NDT cell size q = 0.5 m• range limits R = {[0, 3), [3, 6), [6, 9), [9, 15),

[15,∞)} m• planar class count p = 9• eigenvalue ratio threshold te = 0.10• ambiguity ratio threshold ta = 0.60

The values of these parameters were chosen empir-ically. Some parameters depend on the scale of theenvironment, but a single parameter set worked wellfor all investigated data sets.

The best cell size q and the range limits R de-pend mainly on the scanner configuration. If the cellsize is too small, the PDFs are dominated by scan-ner noise. For example, planes at the farther partsof scans (where scan points are sparse) may showup in the histogram as lines with varying orienta-tions. If the cell size is too large, details are lostbecause the PDFs do not accurately represent thesurfaces. Previous work (Magnusson et al., 2007)has shown that cell sizes between 0.5 and 2 mwork well for registering scans of the scale encoun-tered by a mobile robot equipped with a rotatingSICK LMS 200 laser scanner when using NDT forscan registration. We have used similar experimen-tal platforms for the data examined in this work.For the present experiments, q = 0.5 m and R ={[0, 3), [3, 6), [6, 9), [9, 15), [15,∞)} were used. Usingfewer range intervals decreased the loop detection ac-curacy. If using a scanner with a different max range,R and q should probably be adjusted. The same pa-rameter settings worked well for all the data sets



used here even though the point cloud resolutionvaries, with almost an order of magnitude amongthem.

Using nine planar classes (in addition to onespherical class and one linear class) worked well forall of the data sets. The reason for using only onespherical and linear class is that these classes tendto be less stable than planar ones. Linear distribu-tions with unpredictable directions tend to occur atthe far ends of a scan, where the point density is toosmall. Spherical distributions often occur at cornersand edges, depending on where the boundaries of theNDT cells end up, and may shift from scan to scan.However, using the planar features only decreasedthe obtainable recall rate without false positives witharound one-third for the data sets used in our evalua-tions. The small number of classes means that the sur-face shape histograms provide a very compact rep-resentation of the input data. We use only 55 values(11 shape classes and 5 range intervals) for each his-togram. To achieve rotation invariance we use multi-ple histograms per scan, as described in Section 2.3,but with an average of three histograms per scan, theappearance of a point cloud with several tens of thou-sands of points can still be represented using only 165values.

The eigenvalue ratio threshold te and ambiguityratio threshold ta were also chosen empirically. Bothof these thresholds must be on the interval [0, 1]. Inthe experiments, using te = 0.10 and ta = 0.60 pro-duced good results independent of the data.

In addition to the parameters of the appearancedescriptor, it is necessary to select a difference thresh-old td that determines which scans are similar enoughto be assumed to have been taken at the same lo-cation. The difference threshold td was chosen sepa-rately for each data set, as described in Section 3.3.A method for automatically selecting a differencethreshold is presented in Section 3.4.

3. EXPERIMENTS

3.1. Data Sets

To evaluate the performance of the proposed loopdetection algorithm, three data sets were used: oneoutdoor set from a campus area, one from an indooroffice environment, and one from an undergroundmine. All of the data sets are available online fromthe Osnabruck Robotic 3D Scan Repository (http://kos.informatik.uni-osnabrueck.de/3Dscans/).

The Hannover2 data set, shown in Figure 2, wasrecorded by Oliver Wulf at the campus of LeibnizUniversitat Hannover. It contains 922 3D omniscans(with 360-deg field of view) and covers a trajectory ofabout 1.24 km. Each 3D scan is a point cloud contain-ing approximately 15,000 scan points. Ground truthpose measurements were acquired by registering ev-ery 3D scan against a point cloud made from a given2D map and an aerial LIDAR scan made while fly-ing over the campus area, as described in the SLAMbenchmarking paper by Wulf, Nuchter, Hertzberg,and Wagner (2007).

The AASS-loop data set was recorded around therobot lab and coffee room of the Center for AppliedAutonomous Sensor Systems (AASS) research insti-tute at Orebro University. An overhead view of thisdata set is shown in Figure 3. The total trajectory trav-eled is 111 m. This set is much smaller than the Han-nover2 one. It contains 60 omnidirectional scans witharound 112,000 points per scan. For this data set, pair-wise scan registration using 3D-NDT (given the ini-tial pose estimates from the robot’s odometry) wasexact enough to be used for the ground truth poses.(The accumulated pose error between scan 1 and scan60 was 0.67 m and 1.3 deg after registration.) How-ever, using only the laser scans without odometryinformation, it is not possible to detect loop closurewith NDT scan registration.

A third data set, Kvarntorp, was recorded in theKvarntorp mine outside Orebro, Sweden. The origi-nal data set is divided into four “missions.” For theexperiments presented here, we used “mission 4” fol-lowed by “mission 1.” The reason for choosing thesetwo missions is that they overlap each other and thatthe starting point of mission 1 is close to the end pointof mission 4. This combined sequence has 131 scans,each covering a 180-deg horizontal field of view andcontaining around 70,000 data points. The total tra-jectory is approximately 370 m. See Figure 4 for anoverview of this data set. The Kvarntorp data setis rather challenging for a number of reasons. First,the mine environment is highly self-similar. Withoutknowledge of the robot’s trajectory, it is very diffi-cult to tell different tunnels apart, both from 3D scansand from camera images, as illustrated in Figure 5.The fact that the scans of this data set are not omni-directional also makes loop detection more difficult,because the same location can look very different de-pending on which direction the scanner is pointingtoward. Another challenge is that the distance trav-eled between the scans is longer for this data set. For



Figure 2. The Hannover2 data set, seen from above with parallel projection.

Figure 3. The AASS-loop data set, shown from above with the ceiling removed. The inlay in the right-hand corner showsthe accumulated pose error from pairwise scan registration using 3D-NDT when returning to location B.



Figure 4. The Kvarntorp data set, seen from above withthe ceiling removed.

this reason, scans taken when revisiting a locationtend to be recorded farther apart, making the scanslook more different.

Scan registration alone was not enough to builda consistent 3D map of the Kvarntorp data set, andan aerial reference scan was not available for obviousreasons. Instead, ground truth poses were providedusing a network-based global relaxation method for3D laser scans (Borrmann et al., 2008). A networkwith loop closures was manually created and givenas input to the algorithm in order to generate a refer-ence map. The result was visually inspected for cor-rectness. The relaxation method of Borrman et al. willbe briefly described here. Given a network with n +1 nodes x0, . . . , xn representing the poses v0, . . . , vn

and the directed edges di,j , the algorithm aims at es-timating all poses optimally. The directed edge di,j

represents the change of the pose (x, y, z, θx, θy, θz)that is necessary to transform one pose vi into vj ;that is, vi = vj ⊕ di,j , thus transforming two nodes ofthe graph. For simplicity, the approximation that themeasurement equation is linear is made; that is,

di,j = xi − xj . (14)

A detailed derivation of the linearization is given inthe paper by Borrman et al. (2008). An error function

is formed such that minimization results in improvedpose estimations:

W =∑(i,j )

(di,j − di,j )TC−1i,j (di,j − d i,j ), (15)

where di,j = di,j + �di,j models random Gaussiannoise added to the unknown exact pose di,j . This rep-resentation involves resolving the nonlinearities re-sulting from the additional roll and pitch angles byTaylor expansion. The covariance matrices Ci,j de-scribing the pose relations in the network are com-puted based on the paired closest points. The errorfunction (15) has a quadratic form and is thereforesolved in closed form by sparse Cholesky decompo-sition.

3.2. Experimental Method

We have used two methods to judge the discrimina-tion ability of our surface shape histograms.

3.2.1. Full Evaluation

First, we look at all combinations (Si ,Sj | i = j ) ofscan pairs from each data set, counting the number oftrue positives and false positives with regard to theground truth. In related work (Bosse & Zlot, 2008b;Granstrom, Callmer, Ramos, & Nieto, 2009), the per-formance is reported as the recall rate with a man-ually chosen threshold that gives a 1% false-positiverate. For these tests, we have taken the same approachto evaluate the result.

However, it is not trivial to determine the groundtruth: what should be considered a true or a false pos-itive. In this work we have chosen to use the matrixof the distances between all scan pairs as the groundtruth, after applying a distance threshold tr (in metricspace), so that all pairs of scans that are within, forexample, 3 m are considered to be truly overlapping.It is not always easy to select a distance thresholdvalue that captures the relationships between scansin a satisfactory manner. If the threshold tr is large,some scans with very different appearances (for ex-ample, scans taken at different sides of the corner of abuilding or before and after passing through a door)might still be considered to overlap and will there-fore be regarded as false negatives when their ap-pearances do not match. Another problem is that se-quential scans are often acquired in close proximity



(a) Location F (b) Location H

Figure 5. An example of perceptual aliasing in the Kvarntorp data set. The images show two different places (locations Fand H in Figure 4). It is difficult to tell the two places apart, from both the camera images and the scanned point clouds.The point clouds are viewed from above.

to one another. Therefore, when revisiting a location,there will be several overlapping scans within the dis-tance threshold, according to the “ground truth.” Butwith a discriminative difference threshold td (in ap-pearance space), only one or a few of them may bedetected as positives. Even when the closest scan pairis correctly matched, the rest would then be regardedas false negatives, which may not be the desired re-sult. If, on the other hand, the distance threshold tris too small, the ground truth matrix will miss someloop closures where the robot is not revisiting the ex-act same position.

Another possibility would be to manually labelall scan pairs. However, when evaluating multipledata sets containing several hundreds of scans, it isnot practical to do so; and even then, some arbitrarydecision would have to be made as to whether somescan pairs overlap.

We will discuss how our experimental methodcompares to the evaluations of other authors in Sec-tion 4. The validity of our design decisions and theresults may be judged by inspecting the trajectoriesand ground truth matrices in Figures 6–11.

3.2.2. SLAM Scenario

As a second type of evaluation, we also consider howthe method would fare in a SLAM application. In thiscase, for each scan S we consider only the most sim-ilar corresponding scan S instead of all other scans.The ground truth in this case is a manual labelingof scans as either “overlapping” (meaning that theywere acquired at a place that was visited more thanonce and therefore should be similar to at least oneother scan) or “nonoverlapping” (which is to say thatthey were seen only once). Because the ground truth



Figure 6. SLAM result for the Hannover2 data set. The robot traveled along the sequence A-B-C-D-A-B-E-F-A-D-G-H-I-J-H-K-F-E-L-I-K-A. Note that there are no false positives and that all true positives are matched to nearby scans.

Figure 7. Comparing the ground truth matrix and the output similarity matrix for Hannover2. Scan numbers are on theleft-hand and bottom axes; place labels are on the top and right-hand axes. (Because of the large matrix and the small printsize, the right-hand image has been morphologically dilated by a 3 × 3 element in order to better show the values.)



False negativesTrue negativesTrue positives

Figure 8. Result for the AASS-loop data set. The robot moved along the path A-B-C-D-E-F-G-C-B-A.

labeling is done to the set of individual scans insteadof all combinations of scan pairs, it is feasible to per-form manually.

This second type of evaluation is more simi-lar to how the FAB-MAP method of Cummins andNewman (2007, 2008a, 2008b, 2009) has been evalu-ated. If S has been labeled as overlapping, the most

similar scan S is within 10 m of S and the differ-ence measure of the two scans is below the thresh-old [τ (S, S) < td ], then S is considered a true posi-tive. The 10-m distance threshold is the same as thatused by Valgren and Lilienthal (2007) for establish-ing successful localization. Cummins and Newman(2009) use a 40-m threshold, but that was deemed too

0

10

20

30

40

50

60

0 10 20 30 40 50 60

A

B

C

D

E

F

G

C

B

A

A B C D E F G C B A

(a) Ground truth distance matrix, showing all scan pairs taken

less than1 m apart

0

10

20

30

40

50

60

0 10 20 30 40 50 60

A

B

C

D

E

F

G

C

B

A

A B C D E F G C B A

(b) Thresholded similarity matrix, showing all scan pairs with a

difference value τ < 0.099

Figure 9. Comparing the ground truth matrix and output similarity matrix for AASS-loop.



False negativesTrue negativesTrue positives

Figure 10. Result Kvarntorp. The robot traveled along thesequence A-B-C-D-E-A-B-F-G-A-B-C-H-F-H.

Figure 11. Comparing the distance matrix and the output similarity matrix for Kvarntorp.

large for the data sets used here. Most of the detectedscan pairs are comfortably below the 10-m threshold.For the AASS-loop and Kvarntorp data sets, the max-imum interscan distance at detected loop closure is2.6 m. For Hannover2, 98% of the detected scans arewithin 5 m of each other, and 83% are within 3 m.

For these experiments, we report the result asprecision-recall rates. Precision is the ratio of true pos-itive loop detections to the sum of all loop detec-tions. Recall is the ratio of true positives to the num-ber of ground truth loop closures. A nonoverlappingscan cannot contribute to the true positive rate, butit can generate a false positive, thus affecting preci-sion. Likewise true loop closures that are incorrectlyregarded as negative decrease the recall rate but donot impact the precision rate. It is important to real-ize that a 1% false-positive rate is not the same as 99%precision. If the number of nonoverlapping scans ismuch larger than the number of overlapping ones, asis the case for our data sets, falsely detecting 1% ofthe nonoverlapping ones as positive will decrease theprecision rate with much more than 1%.

In a SLAM application, even a single false pos-itive can make the map unusable if no further



measures are taken to recover from false scan corre-spondences. Therefore the best difference thresholdin this case is the largest possible value with 100%precision.

In this second type of evaluation, we also em-ployed a minimum loop size. Even though we do notuse pose estimates from odometry, we assume thatthe scans are presented as an ordered sequence, suc-cessively acquired by the robot as it moves along itstrajectory. When finding the most similar correspon-dence of S, it is compared only to scans that are morethan 30 steps away in the sequence. The motivationfor this limit is that in the context of a SLAM appli-cation it is not interesting to find small “loops” withonly consecutive scans. We want to detect loop clo-sure only when the robot has left a place and returnedto it later. A side effect of the minimum loop size limitis that some similar scans that are from the same areabut more than 10 m apart and therefore otherwisewould decrease the precision are removed. However,in a SLAM scenario it makes sense to add such a limitif it is known that the robot cannot possibly close a“real” loop in only a few steps.

Table I. Summary of loop detection results for all scanpairs.

Set tr (m) ol nol td Recall (%)

Hannover2 3 9,984 839,178 0.1494 80.6AASS-loop 1 32 3,508 0.0990 62.5Kvarntorp 3 138 16,632 0.1125 27.5

This table shows the maximum achievable recall rate with less than1% false positives and the difference threshold (td ) at which this isattained. The distance threshold applied to the ground truth ma-trix is denoted tr , and the ground truth numbers of overlapping(ol) and nonoverlapping (nol) scan pairs after applying tr are alsoshown.

Table II. Summary of SLAM scenario loop detection results.

Manual threshold Automatic threshold

Set ol nol td Recall (%) Prec. (%) P (fp) (%) td Recall (%) Prec. (%) P (fp) (%)

Hannover2 428 494 0.0737 47.0 100 0.08 0.0843 55.6 94.8 0.5AASS-loop 23 37 0.0990 69.6 100 1.17 0.0906 60.9 100 0.5Kvarntorp 35 95 0.0870 28.6 100 0.65 0.0851 22.9 100 0.5

Precision and recall rates are shown both for manually selected td and for thresholds selected using a gamma mixture model, as describedin Section 3.4. The probability of false positives P (fp) according to the mixture model is shown for both thresholds. The numbers of (groundtruth) overlapping and nonoverlapping scans for each set are denoted ol and nol.

In our previous work on loop detection (Magnus-son, Andreasson, et al., 2009) we used this SLAM-type evaluation but obtained the ground truth label-ing of overlapping and nonoverlapping scans usinga distance threshold. The manual labeling employedhere is a better criterion for judging which scans areoverlapping and not. Again, refer to the figures visu-alizing the results (Figures 6, 8, and 10) to judge thevalidity of the evaluations.

3.3. Results

This section details the results of applying our loopdetection method to the data sets described above.The results are summarized in Tables I and II, wherethe recall rates are shown in boldfaced type.

3.3.1. Hannover2

The Hannover2 data set is the one that is most simi-lar to the kind of outdoor semistructured data investi-gated in many other papers on robotic loop detection(Bosse & Zlot, 2008b; Cummins & Newman, 2008b;Granstrom et al., 2009; Valgren & Lilienthal, 2007).

When evaluating the full similarity matrix, themaximum attainable recall rate with at most 1%false positives is 80.6%, using td = 0.1494. Figure 7(a)shows the ground truth distance matrix of theHannover2 scans, and Figure 7(b) shows the simi-larity matrix obtained with the proposed appearancedescriptor and difference measure. Note that the twomatrices are strikingly similar. Most of the overlap-ping (dark) parts in the ground truth matrix are cap-tured correctly in the similarity matrix. The distancethreshold tr was set to 3 m.

For the SLAM-style experiment, the maximumrecall rate at 100% precision is 47.0%, using td =0.0737. The result is visualized in Figure 6, showing



all detected true positives and the scans that they arematched to, as well as true and false negatives.

If no minimum loop size is used in the SLAMevaluation (thus requiring that the robot should beable to relocalize itself from the previous scan at alltimes), the maximum recall rate at 100% precision is24.6% at td = 0.0579. If the same difference thresholdas above is used (td = 0.0737), the recall rate for thiscase is 45.7% and the precision rate is 98.6%, with sixfalse correspondences (0.65% of the 922 scans). Of thesix errors, four scans (two pairs) are from the parkinglot between locations H and J, which is a place withrepetitive geometric structure. The other two are fromtwo corners of the same building: locations A and B.

At this point it should be noted that even a re-call rate of around 30% often is sufficient to close allloops in a SLAM scenario, as long as the detectedloop closures are uniformly distributed over the tra-jectory, because several scans are usually taken fromeach location. Even if one overlapping scan pair isnot detected (because of noisy scans, discretizationartifacts in the surface shape histograms, or dynamicchanges), one of the next few scans is likely to bedetected instead. [This fact has also been noted byCummins and Newman (2008b) and Bosse and Zlot(2008b).]

As a side note, we would also like to mentionthat using scan registration alone to detect loop clo-sure is not sufficient for this data set, as was describedby Wulf et al. (2007). Because they depend on an ac-curate initial pose estimate (which is necessary evenfor reliable and fast scan registration algorithms), itis necessary to use the robot’s current pose estimateand consider only the closest few scans to detect loopclosure. Therefore the method of Wulf et al. (2007),and indeed all methods using local pairwise registra-tion methods such as ICP or 3D-NDT, cannot detectloops when the accumulated pose error is too large.In contrast, the method proposed in this text requiresno pose information.

3.3.2. AASS-Loop

When evaluating the full similarity matrix for theAASS-loop data set, we used a distance threshold trof 1 m instead of 3 m on the ground truth distancematrix. The reason for the tighter distance thresh-old in this case is the many passages and tight cor-ners of this data set. The appearance of scans oftenchanges drastically from one scan to the next whenrounding a corner into another corridor or passing

through a door, and an appearance-based loop detec-tion method cannot be expected to handle such scenechanges. The 1-m threshold filters out all such scanpairs while keeping the truly overlapping scan pairsthat occur after the robot has returned to location C,as can be seen in Figure 9(a).

For this data set, the maximum recall rate (for thecomplete similarity matrix) with less than 1% falsepositives is 62.5% (td = 0.0990). In the SLAM sce-nario, the recall rate for this data set was 69.6% at100% precision, using td = 0.099.

The part of this data set that contains a loopclosure (between locations A and C) is traversed inthe opposite direction when the robot returns. Thehigh recall rate illustrates that the surface shape his-tograms are robust to changes in rotation.

The trajectory of the AASS-loop data set is shownin Figure 8. The ground truth and similarity matricesare shown in Figure 9.

3.3.3. Kvarntorp

The Kvarntorp data set had to be evaluated slightlydifferently, because an omnidirectional scanner wasnot used to record this data set. An appearance-basedloop detection algorithm cannot be rotation invari-ant if the input scans are not omnidirectional. Whenlooking in opposite directions from the same place,the view is generally very different. Therefore onlyscans taken in similar directions (within 20 deg) werecounted as overlapping when evaluating the algo-rithm for Kvarntorp. The scans that were taken atoverlapping positions but with different orientationswere all (correctly) marked as nonoverlapping by thealgorithm. With the exception of the way of deter-mining which scans are from overlapping sections,the same evaluation and algorithm parameters wereused for this data set as for Hannover2.

Evaluating the full similarity matrix, the recallrate at 1% false positives is 27.5% (td = 0.1134). Forthe SLAM experiment, td = 0.0870 gives the highestrecall rate at full precision: 28.6%.

The challenging properties of the undergroundmine environment are shown in the substantiallylower recall rates for this data set compared toHannover2. Still, a reasonable distribution of theoverlapping scans in the central tunnel is detected inthe SLAM scenario (shown in Figure 10), and thereare no false positives. The ground truth distance ma-trix is shown in Figure 11(a), and the similarity matrixis shown in Figure 11(b). Comparing the two figures,



it can be seen that some scans are recognized from alloverlapping segments: For all off-diagonal stripes inFigure 11(a), there is at least one corresponding scanpair below the difference threshold in Figure 11(b).

3.4. Automatic Threshold Selection

It is important to find a good value for the differ-ence threshold td . Using a too-small value results ina small number of true positives (correctly detectedoverlapping scans). Using a too-large value resultsin false positives (scan pairs considered overlappingeven though they are not). Figures 12 and 13 illus-trate the discriminative ability of the surface shapehistograms for the two different modes of evalua-tion, showing how the numbers of true positives anderrors change with increasing values of the differ-ence threshold, as well as the ROC (receiver operat-ing characteristics) curve.

The results reported thus far used manually cho-sen difference thresholds, selected with the help ofthe available ground truth data. To determine td whenground truth data are unavailable, it is desirable to es-timate the distributions of difference values [Eq. (13)]for overlapping scans versus the values for nonover-lapping scans. Given the set of numbers containingall scans’ smallest difference values, it can be as-sumed that the values are drawn from two distribu-tions—one for the overlapping scans and one for thenonoverlapping ones. If it is possible to fit a proba-bilistic mixture model of the two components to the

set of values, a good value for the difference thresh-old should be such that the estimated probability offalse positives P (fp) is small but the estimated prob-ability of true positives is as large as possible. Fig-ure 14 shows a histogram of the difference values forthe scans in the Hannover2 data set. The histogramwas created using the difference value of most simi-lar scan for each scan in the data set. (In other words,this is the outcome of the algorithm in the SLAMscenario.) The figure also shows histograms for theoverlapping and nonoverlapping subsets of the data(which are not known in advance).

A common way to estimate mixture model pa-rameters is to fit a Gaussian mixture model to the datawith the expectation maximization (EM) algorithm.However, inspecting the histograms of difference val-ues (as in Figure 14), it seems that the underlyingdistributions are not normally distributed but havea significant skew, with the right tail being longerthan the left. As a matter of fact, trying to fit a two-component Gaussian mixture model with EM usu-ally results in distribution estimates with too-largemeans. It is sometimes feasible to use three Gaussiancomponents instead, where one component is usedto model the long tail of the skew data (Magnusson,Andreasson, et al., 2009). However, only a binaryclassification is desired, so there is no theoreticalground for such a model.

Gamma distributed components fit the differ-ence value distributions better than Gaussians. Fig-ure 15(a) shows two gamma distributions fitted in

0

20

40

60

80

100

0 0.1 0.2 0.3 0.4 0.5

% o

f scans

Difference threshold

True positivesFalse positives

(a) Difference threshold vs. true and false positive rates for Hannover2

0

20

40

60

80

100

0 20 40 60 80 100

% tru

e p

ositiv

e

% false postive

KvarntorpAASS-loopHannover2

(b) ROC plots for all data sets

Figure 12. Plots of the appearance descriptor’s discriminative ability, evaluating all possible scan pairs.



Figure 13. Plots of the appearance descriptor’s discriminative ability for the SLAM scenario. In (a), the best threshold(giving the maximum number of true positives at 100% precision) is marked with a bar.

0

5

10

15

20

25

30

0 0.05 0.1 0.15 0.2 0.25 0.3

Num

ber

of

scans

Smallest difference value τ

All scansOverlapping

Non-overlapping

Figure 14. Histograms of the smallest difference values foroverlapping and nonoverlapping scans of the Hannover2data set. (In general, only the histogram for all scans isknown.)

isolation to each of the two underlying distributions.Because the goal is to choose td such that the ex-pected number of false positives is small, a reasonablecriterion is that the cumulative distribution functionof the mixture model component that correspondsto nonoverlapping scans should be small. This isequivalent to saying that P (fp) should be small. Fig-

ure 15(b) shows the cumulative distribution functionsof the mixture model components in Figure 15(a).

For the Kvarntorp data set, EM finds a ratherwell-fitting mixture model. With P (fp) = 0.005 thethreshold value is 0.0851, resulting in a 22.9% recallrate with no false positives. This is a slightly conser-vative threshold, but it has 100% precision.

The AASS-loop data set is more challenging forEM. It contains only 60 scans, which makes it diffi-cult to fit a reliable probability distribution model tothe difference values. Instead, the following approachwas used for evaluating the automatic threshold se-lection. Two maximum likelihood gamma distribu-tions were fitted to the overlapping and nonover-lapping scans separately. Using these distributionsand the relative numbers of overlapping and non-overlapping scans of AASS-loop, 600 gamma dis-tributed random numbers were generated, and EMwas applied to find a maximum likelihood model ofthe simulated combined data. The simulated valuesrepresent the expected output of collecting scans ata much denser rate in the same environment. Usingthe resulting mixture model and P (fp) = 0.005 gavetd = 0.091 and a recall rate of 60.9% using the AASS-loop scans.

Because EM is a local optimization algorithm, itcan be sensitive to the initial estimates given. Whenapplied to the output of the Hannover2 data set, ittends to converge to one of two solutions, shown in



(a) Two maximum likelihood gamma distributions fitted to the

overlapping and nonoverlapping scans in isolation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3

P

Smallest difference value

P(fp) = 0.005CDF (true positives)

CDF (false positives)

(b) Cumulative distribution functions of the mixture model com-

ponents, showing the threshold such that the estimated probabil-

ity of false positives is 0.5%

0

5

10

15

20

25

30

0 0.05 0.1 0.15 0.2 0.25 0.3

Nu

mb

er

of

sca

ns



Non-overlappingPDF (true positives)

PDF (false positives)

Figure 15. Determining td for the Hannover2 data set using a gamma mixture model.

Figure 16. From visual inspection, the solution of Fig-ure 16(b) looks better than that of Figure 16(a), butthe likelihood function of the solution in Figure 16(a)is higher. Solution 16(a) uses a wider than necessarymodel of the nonoverlapping scans, resulting in aconservative threshold value. With P (fp) = 0.005, so-lution 16(a) gives td = 0.0500 and only a 20.8% re-call rate, although at 100% precision. The numbers

for solution 16(b) are td = 0.0843, 55.6% recall, and94.8% precision. Table II includes the results of solu-tion 16(b).

This approach for determining td involves notraining and is a completely unsupervised learningprocess. However, the difference threshold can onlybe estimated offline: not because of the computa-tional burden (which is very modest) but because a

(a) One solution (b) Another solution

0

5

10

15

20

25

30

0 0.05 0.1 0.15 0.2 0.25 0.3

Num

ber

of scans





0

5

10

15

20

25

30

0 0.05 0.1 0.15 0.2 0.25 0.3

Num

ber

of scans





Figure 16. Histograms of the difference values τ (considering each scan’s most similar correspondence) for overlappingand nonoverlapping scans of the Hannover2 data set. The components of a gamma mixture model fitted with EM for twodifferent initial parameter estimates are also shown. The log-likelihood ratio of 16(a)/16(b) is 1.01.



sufficiently large sample of scans must have been en-countered before EM can be used to estimate a reli-able threshold. As long as there are enough samples,the method described in this section gives a usefulestimate for td . However, because it is not possible toguarantee that the output threshold value producesno false positives, a reliable SLAM implementationshould still have some way of handling spurious falsepositives.

3.5. Execution Time

The experiments were run using a C++ implemen-tation on a laptop computer with a 1,600-MHz IntelCeleron CPU and 2 GB of RAM.

For the AASS-loop data set, average times (mea-sured with the gprof profiling utility) for computingthe surface shape histograms were 0.5 s per call to thehistogram computation function and in total 2.2 s perscan to generate histograms (including transformingthe point cloud, generating f ′ and the histogramsthat make up F). The average number of histogramsrequired for rotation invariance (that is, the size of F)is 2.4. In total, 0.14 s was spent computing similaritymeasures for scan pairs. There are 60 scans in the dataset, 144 histograms were created, and 1442 = 20,736similarity measures were computed, so the averagetime per similarity comparison [Eq. (11)] was around7 μs, and it took less than 0.5 ms to compare twoscans [Eq. (13)]. (Naturally, we could also have com-puted only one-half of the similarity matrix, becausethe matrix is symmetric.) In other words, once thehistograms have been created, if each scan requiresthe generation of 2.4 histograms on average, a newscan can be compared to roughly 25,000 other scansin 1 s when testing for loop closure, using exhaustivesearch. The corresponding numbers for all of the datasets are shown in Table III.

The time for creating the histograms and thenumber of histograms required for rotation invari-

ance depend on the data, but the time requiredfor similarity comparisons is independent of thedata.

The time spent on histogram creation can be sig-nificantly reduced if transformations are applied tothe first computed histogram when creating F , in-stead of computing new histograms from scratch af-ter transforming the original point cloud. With thisoptimization, the total time spent while generatingthe appearance descriptor is 1.0 s per scan instead of2.2 s per scan for the AASS-loop data set. However,the resulting histograms are not identical to the onesthat are achieved by recomputing histograms fromthe transformed point clouds. They are only approxi-mations. For all three data sets, the recognition resultswere marginally worse when using this optimization.

4. RELATED WORK

4.1. Other Loop Detection Approaches

A large part of the related loop detection literature isfocused on data from camera images and 2D rangedata.

Ramos, Nieto, and Durrant-Whyte (2007) used acombination of visual cues and laser readings to as-sociate features based on both position and appear-ance. They demonstrated that their method workswell in outdoor environments with isolated features.The experiments used for validation were performedon data collected in Victoria Park, Sydney, where theavailable features are sparsely planted trees. A limita-tion of the method of Ramos et al. is that the laser fea-tures are found by clustering isolated point segments,which are stored as curve segments. In many othersettings (such as indoor or urban environments), theappearance of scans is quite different from the onesin Victoria Park in that features are not generallysurrounded by empty space. Compared to the laserfeatures used by Ramos et al., the proposed surface

Table III. Summary of resource requirements.

Data set Scans Points/scan Avg. histogram creation time (s) Avg. histograms/scan

Hannover2 922 15,000 0.18 3.2Kvarntorp 130 70,000 0.27 2.8AASS-loop 60 112,000 0.50 2.4

In addition to the number of scans in each data set and the average point count per scan, the table shows the average time to create a singlehistogram (on a 1.6-GHz CPU) and the average number of histograms per scan.



shape histograms have the advantage that they re-quire no clustering of the input data and therefore itis likely that they are more context independent. It iscurrently not clear how the method of Ramos et al.would perform in a more cluttered environment.

Cummins and Newman (2007, 2008a, 2008b,2009) have published several articles on visual loopdetection using their FAB-MAP method. They usea bag-of-words approach in which scenes are rep-resented as a collection of “visual words” (local vi-sual features) drawn from a “dictionary” of avail-able features. Their appearance descriptor is a bi-nary vector indicating the presence or absence of allwords in the dictionary. The appearance descriptor isused within a probabilistic framework together witha generative model that describes how informativeeach visual word is by the common co-occurrences ofwords. In addition to simple matching of appearancedescriptors, as has been done in the present work,they also use pairwise feature statistics and sequencesof views to address the perceptual aliasing prob-lem. Cummins and Newman (2008) have reported re-call rates of 37%–48% at 100% precision, using cam-era images from urban outdoor data sets. RecentlyCummins and Newman (2009) reported on the expe-riences of applying FAB-MAP on a very large scale,showing that the computation time scales well to tra-jectories as long as 1,000 km. The precision, however,is much lower on the large data set, as is to be ex-pected.

A method that is more similar to the approachpresented here is the 2D histogram matching of Bosseand Roberts (2007) and Bosse and Zlot (2008a, 2008b).Although our loop detection method may also be re-ferred to as histogram matching, there are several dif-ferences. For example, Bosse et al. use the normals oforiented points instead of the orientation/shape fea-tures of NDT. Another difference lies in the amountof discretization. Bosse et al. create 2D histogramswith one dimension for the spatial distance to thescan points and one dimension for scan orientations.The angular histogram bins cover all possible rota-tions of a scan in order to achieve rotation invariance.Using 3-deg angular resolution and 1-m range resolu-tion, as in the published papers, results in 120 × 200 =240,000 histogram bins for the 2D case. For uncon-strained 3D motion with angular bins for the x, y,and z axes, a similar discretization would lead tomany millions of bins. In contrast, the 3D histogramspresented here require only a few dozen bins. At afalse-positive rate of 1%, Bosse and Zlot (2008b) have

achieved a recall rate of 51% for large urban data sets,using a manually chosen threshold.

Very recent work by Granstrom et al. (2009)showed good performance of another 2D loop de-tection algorithm. Their method uses AdaBoost(Freund & Schapire, 1997) to create a strong classi-fier composed from 20 weak classifiers, each of whichdescribes a global feature of a 2D laser scan. Thetwo most important weak classifiers are reported tobe the area enclosed by the complete 2D scan andthe area where the scan points with maximum rangehave been removed. With 800 scan pairs manually se-lected from larger urban data sets (400 overlappingpairs and 400 nonoverlapping ones), Granstrom et al.report an 85% recall rate with 1% false positives. Itwould be interesting to see how their method couldbe extended to the 3D case and how it would performin other environments.

Perhaps the most relevant related method forloop detection from 3D range data is the work byJohnson (1997) and Huber (2002). Johnson’s “spin im-ages” are local 3D feature descriptors that give de-tailed descriptions of the local surface shape aroundan oriented point. Huber (2002) has described amethod based on spin images for matching multiple3D scans without initial pose estimates. Such globalregistration is closely related to the loop detectionproblem. The initial step of Huber’s multiview sur-face matching method is to compute a model graphby using pairwise global registration with spin im-ages for all scan pairs. The model graph contains po-tential matches between pairs of scans, some of whichmay be incorrect. Surface consistency constraints onsequences of matches are used to reliably distin-guish correct matches from incorrect ones because itis not possible to distinguish the correct and incor-rect matches at the pairwise level. Huber has usedthis method to automatically build models of vari-ous types of scenes. However, we are not aware of aperformance measurement that is comparable to thework covered in this paper. Our algorithm can beseen as another way of generating the initial modelgraph and evaluating a local quality measure. An im-portant difference between spin images and the sur-face shape histograms proposed here is that spin im-ages are local feature descriptors, more akin to vi-sual words, describing the surface shape around onepoint. In contrast, the surface shape histograms areglobal appearance descriptors, describing the appear-ance of a whole 3D point cloud. Comparing spin im-ages to the local Gaussian features used in this work,



spin images are more descriptive and invariant to ro-tation when reliable point normals are available. Nor-mal distributions are unimodal functions, whereasspin images can capture arbitrary surface shapes ifthe resolution is high enough. However, the process-ing requirements are quite different for the two meth-ods. Using data sets containing 32 scans with 1,000mesh faces each, as done by Huber (2002), the timeto compute the initial model graph using spin-imagematching can be estimated to 1.5 × 322 = 1, 536 s.(The complete time is not explicitly stated, but pair-wise spin-image matching is reported to require 1.5 son average.) With a data set of that size, a rough esti-mate of the execution time of the algorithm proposedin this paper is 32 × 0.8 + (32 × 3)2 × 7 × 10−6 = 26 son similar hardware, based on the execution times inTable III. On a data set of a more realistic size, the dif-ference would be even greater.

4.2. Comparing Results

As discussed in Section 3.2., it is not always obvioushow to determine ground truth in the context of loopdetection. Granstrom et al.(2009) solved this problemby evaluating their algorithm on a selection of 400scan pairs that were manually determined to be over-lapping and 400 nonoverlapping ones. However, wewould like to evaluate the performance on the com-plete data sets. Bosse and Roberts (2007) and Bosseand Zlot (2008a, 2008b) use the connectivity graphbetween submaps created by the Atlas SLAM frame-work (Bosse, Newman, Leonard, & Teller, 2004) as theground truth. In this case, each scan has a single cor-respondence in each local subsequence of scans (al-though there may be other correspondences at sub-sequent revisits to the same location). In our evalua-tions of the full similarity matrix for each data set nosuch preprocessing was performed. Instead, we ap-plied a narrow distance threshold tr to the scan-to-scan distance matrix in order to generate a groundtruth labeling of true and false positives. The fact thatthe approaches used to determine the ground truthvary so much between different authors makes it dif-ficult to compare the results.

Furthermore, because all of the methods dis-cussed above were evaluated on different data sets,it is not possible to make any conclusive statementsabout how the quality of the results compare toone another, both because the appearances of scansmay vary greatly between different data sets andalso because the relative numbers of overlapping and

nonoverlapping scans differ. A false-positive rate of1% (of all nonoverlapping scans) for a data set thathas a large ratio of nonoverlapping scans is not di-rectly comparable to the same result for a set withmore loop closures.

Having said that, we will still compare our resultsto those reported in the related literature in order togive some indication of the relative performance ofour approach. On the Hannover2 data set, which isthe only one with characteristics comparable to thoseused in the related work, we obtained a recall rate of80.5% at 1% false positives when evaluating all scanpairs. This result compares well to the 51% recall rateof Bosse and Zlot (2008b) and the 85% recall rate ofGranstrom et al.(2009).

The SLAM-style experiment on the same dataset is more similar to those of Cummins andNewman (2007, 2008a, 2008b). With no false pos-itives, we achieved a 47.0% recall rate for theHannover2 data set, which is comparable to their re-call rates of 37%–48%.

5. SUMMARY AND CONCLUSIONS

We have described a new approach to appearance-based loop detection from 3D range data by compar-ing surface shape histograms. Compared to 2D laser-based approaches, using 3D data makes it possible toavoid dependence on a flat ground surface. However,3D scans bring new problems in the form of a massiveincrease in the amount of data and more complicatedrotations, which means a much larger pose space inwhich to compare appearances. We have shown thatthe proposed surface shape histograms overcomethese problems by allowing for drastic compressionof the input 3D point clouds while being invariantto rotation. We propose to use EM to fit a gammamixture model to the output similarity measures inorder to automatically determine the threshold thatseparates scans at loop closures from nonoverlappingones, and we have shown that doing so gives thresh-old values that are in the vicinity of the manually se-lected ones that give the best results for all our datasets. With experimental evidence we have shown thatthe presented approach can achieve high recall ratesat low false-positive rates in different and challeng-ing environments. Another contribution of this workis that we have focused on the problem of provid-ing quantifiable performance evaluations in the con-text of loop detection. We have discussed the dif-ficulties of determining unambiguous ground truth



correspondences that can be compared for differentloop detection approaches.

We can conclude that NDT is a suitable rep-resentation for 3D range scans. It allows for verygood scan registration (Magnusson et al., 2007;Magnusson, Nuchter, et al., 2009) and can be used asan intermediate representation for the proposed sur-face shape histograms. We also conclude that the pro-posed NDT-based surface shape histograms performwell in comparison with related loop detection meth-ods based on 2D and 3D range data as well as currentmethods using visual data. The highly compact his-togram representation (which uses 50–200 values onaverage to represent a 3D point cloud with severaltens of thousands of points) makes it possible to com-pare scans very quickly. We can compare a 3D scanto around 25,000 others in 1 s, as compared to 1.5 sper comparison using 3D spin-image descriptors. Thehigh speed makes it possible to detect loop closureseven in large maps by exhaustive search, which is animportant contribution of our work. Even though theinput data are highly compressed, the recall rate isstill 80.5% at a 1% false-positive rate for our outdoorcampus data set.

6. FUTURE WORK

It would be very interesting in future work to com-pare our approach to different methods using thesame data set. The Kvarntorp data set includes 2Dscans and camera images in addition to the 3D scansused here, so that data set would lend itself especiallywell to comparing different approaches.

It would be equally important to improve cur-rent experimental methodology to include a unifiedmethod for selecting true and false positives in thecontext of loop detection. A formal definition of whatconstitutes “a place” in this context would be verywelcome, for the same purpose.

To further improve the performance of our ap-proach, future work could include learning a genera-tive model in order to learn how to disregard com-mon nondiscriminative features (such as floor andceiling orientations), based on the general appear-ance of the current surroundings, as done previouslyin the visual domain (Cummins & Newman, 2008b).The disjoint range intervals that we have used in thepresent work may be a source of error, although theuse of overlapping NDT cells should alleviate alsothese discretization artifacts to some degree. We will

consider overlapping range intervals in future work.It would also be interesting to do a more elaborateanalysis of the similarity matrix than applying a sim-ple threshold, in order to better discriminate betweenoverlapping and nonoverlapping scans and to eval-uate the effects of difference measures other than theone used here. Another potential direction is to re-search whether it is possible to learn more of the pa-rameters from the data. Further future work shouldinclude investigating how the proposed loop detec-tion method is affected by dynamic changes, such asmoving vehicles or people.

REFERENCES

Biber, P., & Straßer, W. (2003, October). The normal distribu-tions transform: A new approach to laser scan match-ing. In Proceedings of the IEEE International Confer-ence on Intelligent Robots and Systems (IROS), LasVegas, NV (pp. 2743–2748).

Booij, O., Terwijn, B., Zivkovic, Z., & Krose, B. (2007,April). Navigation using an appearance based topolog-ical map. In Proceedings of the IEEE International Con-ference on Robotics and Automation (ICRA), Rome,Italy (pp. 3927–3932). IEEE.

Borrmann, D., Elseberg, J., Lingemann, K., Nuchter, A.,& Hertzberg, J. (2008). Globally consistent 3D map-ping with scan matching. Journal of Robotics and Au-tonomous Systems, 56(2), 130–142.

Bosse, M., Newman, P., Leonard, J., & Teller, S. (2004). Si-multaneous localization and map building in large-scale cyclic environments using the Atlas framework.International Journal of Robotics Research, 23(12),1113–1139.

Bosse, M., & Roberts, J. (2007, April). Histogram match-ing and global initialization for laser-only SLAM inlarge unstructured environments. In Proceedings ofthe IEEE International Conference on Robotics and Au-tomation (ICRA), Rome, Italy (pp. 4820–4826).

Bosse, M., & Zlot, R. (2008a, June). Keypoint design andevaluation for global localization in 2D LIDAR maps.In Robotics: Science and Systems, Zurich, Switzerland.

Bosse, M., & Zlot, R. (2008b). Map matching and data as-sociation for large-scale two-dimensional laser scan-based SLAM. International Journal of Robotics Re-search, 27(6), 667–691.

Cummins, M., & Newman, P. (2007, April). Probabilis-tic appearance based navigation and loop closing.In Proceedings of the IEEE International Conferenceon Robotics and Automation (ICRA), Rome, Italy(pp. 2042–2048).

Cummins, M., & Newman, P. (2008a, May). Acceleratedappearance-only SLAM. In Proceedings of the IEEE In-ternational Conference on Robotics and Automation(ICRA), Pasadena, CA (pp. 1828–1833).

Cummins, M., & Newman, P. (2008b). FAB-MAP: Prob-abilistic localization and mapping in the space of



appearance. International Journal of Robotics Re-search, 27(6), 647–665.

Cummins, M., & Newman, P. (2009, June). Highly scalableappearance-only SLAM—FAB-MAP 2.0. In Robotics:Science and Systems, Seattle, WA.

Frese, U., Larsson, P., & Duckett, T. (2005). A multilevel re-laxation algorithm for simultaneous localisation andmapping. IEEE Transactions on Robotics, 21(2), 196–207.

Frese, U., & Schroder, L. (2006, October). Closing a million-landmarks loop. In Proceedings of the IEEE Interna-tional Conference on Intelligent Robots and Systems(IROS), Beijing, China (pp. 5032–5039).

Freund, Y., & Schapire, R. (1997). A decision-theoretic gen-eralization of online learning and an application toboosting. Journal of Computer and System Sciences,55(1), 119–139.

Granstrom, K., Callmer, J., Ramos, F., & Nieto, J. (2009,May). Learning to detect loop closure from range data.In Proceedings of the IEEE International Conference onRobotics and Automation (ICRA), Kobe, Japan (pp. 15–22).

Grisetti, G., Grzonka, S., Stachniss, C., Pfaff, P., & Bur-gard, W. (2007, October). Efficient estimation of accu-rate maximum likelihood maps in 3D. In Proceedingsof the IEEE International Conference on IntelligentRobots and Systems (IROS), San Diego, CA (pp. 3472–3478).

Huber, D. F. (2002). Automatic three-dimensional modelingfrom reality. Ph.D. thesis, Carnegie Mellon University.

Johnson, A. E. (1997). Spin images: A representation for 3-Dsurface matching. Ph.D. thesis, Carnegie Mellon Uni-versity.

Konolige, K., Bowman, J., Chen, J. D., Mihelich, P.,Calonder, M., Lepetit, V., & Fua, P. (2009, June). View-

based maps. In Robotics: Science and Systems, Seattle,WA.

Magnusson, M., Andreasson, H., Nuchter, A., & Lilien-thal, A. J. (2009, May). Appearance-based loop de-tection from 3D laser data using the normal distri-butions transform. In Proceedings of the IEEE In-ternational Conference on Robotics and Automation(ICRA), Kobe, Japan (pp. 23–28).

Magnusson, M., Lilienthal, A. J., & Duckett, T. (2007). Scanregistration for autonomous mining vehicles using 3D-NDT. Journal of Field Robotics, 24(10), 803–827.

Magnusson, M., Nuchter, A., Lorken, C., Lilienthal, A. J.,& Hertzberg, J. (2009, May). Evaluation of 3D registra-tion reliability and speed—A comparison of ICP andNDT. In Proceedings of the IEEE International Confer-ence on Robotics and Automation (ICRA), Kobe, Japan(pp. 3907–3912).

Ramos, F. T., Nieto, J., & Durrant-Whyte, H. F. (2007, April).Recognising and modelling landmarks to close loopsin outdoor SLAM. In Proceedings of the IEEE In-ternational Conference on Robotics and Automation(ICRA), Rome, Italy (pp. 2036–2041).

Saff, E. B., & Kuijlaars, A. B. J. (1997). Distributing manypoints on a sphere. Mathematical Intelligencer, 19(1),5–11.

Valgren, C., & Lilienthal, A. J. (2007, September). SIFT,SURF and seasons: Long-term outdoor localization us-ing local features. In Proceedings of the European Con-ference on Mobile Robots (ECMR), Freiburg, Germany(pp. 253–258).

Wulf, O., Nuchter, A., Hertzberg, J., & Wagner, B. (2007,October). Ground truth evaluation of large urban 6DSLAM. In Proceedings of the IEEE International Con-ference on Intelligent Robots and Systems (IROS), SanDiego, CA (pp. 650–657).


automatic appearance-based loop detection from three...

Documents