moving in stereo: fast structure and motion from...

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053

054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

CVPR#1346

CVPR#1346

CVPR 2009 Submission #1346. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Moving in Stereo: Fast Structure and Motion from Lines

Anonymous CVPR submission

Paper ID 1346

Abstract

We present a robust and fast approach for estimatingstructure and motion with a stereo pair, using lines as fea-tures. Our primary contribution is an efficient algorithmthat performs this estimation using the theoretically mini-mum number of lines (two), which makes it well-suited foruse in a robust hypothesize-and-test framework.

In contrast to some prior literature that uses line seg-ments, our approach avoids the issues arising due to un-certain determination of end-points by using infinite lines.Our cost function stems from a rank condition on planesbackprojected from corresponding image lines, which canbe solved linearly using three lines. Consideration of or-thonormality constraints on the rigid body motion allowsus to perform estimation using only two lines, through thesolution of an overdetermined system of polynomials. Wepropose a global optimization approach as well as an ef-ficient implementation for solving this system in a “leastsquares” sense.

Experiments using synthetic as well as real data validatethe speed, accuracy and reliability of our algorithms.

1. IntroductionSimultaneous estimation of the structure of a scene and

the motion of a camera rig using only visual input data isa classic problem in computer vision. Interest points, suchas corners, are the primary features of interest for sparsereconstruction, since they abound in natural scenes. Con-sequently, significant research efforts in recent times havebeen devoted to sparse 3D reconstructions from points [1].

However, it is often the case, especially in man-madescenes like office interiors or urban cityscapes, that notenough point features can be reliably detected, while anabundant number of lines are visible. While structure frommotion using lines has, in the past, received considerableattention in its own right, not much of it has been towardsdeveloping algorithms usable in a robust real-time system.On the other hand, important breakthroughs in solving min-imal problems for points, such as [12][14], have led to the

emergence of efficient point-based reconstruction methods.This paper proposes algorithms that can estimate struc-

ture and motion using lines with the minimum amount ofdata, which makes them ideal for use in a hypothesize-and-test framework [5] crucial for designing a robust, real-timesystem. In deference to our intended application of au-tonomous navigation, the visual input for our algorithmsstems from a calibrated stereo pair, rather than a monoc-ular camera. This has the immediate advantage that onlytwo stereo views (as opposed to three monocular views) aresufficient to recover the structure and motion using lines.The cost functions and representations we use in this paperare adapted to exploit the benefits of stereo input.

Besides availability in man-made scenes, an advantageof using lines is accurate localization due to the multi-pixelsupport that a straight line edge enjoys. However, severalprior works rely on using finite line segments for structureand motion estimation, which has the drawback that thequality of reconstruction is limited by the accuracy of de-tecting end-points. In the worst case scenario, such as acluttered environment where lines are often occluded, theend-points might be false corners or unstable T-junctions.Our approach largely alleviates these issues by using in-finite lines as features and employing a problem formula-tion that depends on backprojected planes rather than re-projected end-points.

Commonly used multifocal tensor based approaches usea larger number of line correspondences to produce a linearestimate, which can be arbitrarily far from a rigid-body mo-tion in the presence of noise. In contrast, our minimum datasolution uses only two lines and accounts for orthonormal-ity constraints on the recovered motion. Rather than resortto a brute force nonlinear minimization to solve the result-ing nonlinear problem, we reduce the problem to a smallpolynomial system of low degree, which can be solved ex-peditiously. Our approach also obviates the need for exten-sive book-keeping to maintain the internal dependencies inthe tensor estimates.

We begin with a brief overview of relevant prior workin Section 2. Our algorithms for estimating structure andmotion are described in Section 3, while the system details

1

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

CVPR#1346

CVPR#1346


such as line detection and tracking are provided in Section4. Synthetic and real data experiments are presented in Sec-tion 5 and we finish with discussions in Section 6.

2. Related WorkIt can be argued that points provide “stronger” con-

straints on a reconstruction than lines [6]. For instance,an analog of the eight-point algorithm for two-view recon-struction using lines requires three views and thirteen cor-respondences [11]. Yet, using lines as primary features isbeneficial in some scenarios, so there has been tremendousinterest in the problem.

Multifocal tensors are a common approach to estimatethe structure of a scene from several line correspondences.A linear algorithm for estimating the trifocal tensor fromthree views and thirteen correspondences was presented in[6], while the properties of the quadrifocal tensor are eluci-dated in [18]. It is apparent from these works that the book-keeping required to enforce non-linear dependencies withinthe tensor indices can be substantial [7][10]. Consequently,approaches such as [4] which compute the quadrifocal ten-sor between two calibrated stereo pairs, are too cumbersomefor estimating just a 6-dof motion.

An algebraic construct, called the 3D line motion matrix,is used to align the Plücker coordinates of two line recon-structions in [2], however, rigid body constraints can be ac-counted for only through nonlinear minimization. ExtendedKalman Filter based techniques for estimating incrementaldisplacements from stereo frames have been proposed [21],but the inherent linearization assumption might lead to in-accuracies.

For various reasons, it is not reliable to use end-points fordesigning a structure and motion algorithm with lines. Theissue of representing 3D lines is addressed in detail in [3]where an orthonormal representation well-suited for bundleadjustment is proposed. The method in [20] minimizes anotion of reprojection error in the image plane which is ro-bust to variable end-points across views. However, the costfunction can be optimized only by iterative local minimiza-tion which requires initialization and is not amenable to areal-time implementation. Further, a minimum of six linecorrespondences in three views are required, which can beburdensome for a hypothesize-and-test framework. A simi-lar procedure for 3D reconstruction of urban environmentsis presented in [17].

In recent years, there has been significant progress in thedevelopment of visual odometry systems [13], which relyon real-time solutions to important minimal problems forpoint-based structure-from-motion such as 5-point relativeorientation [12][19] or 3-point generalized pose [14]. Simi-lar to those works, our focus is to develop a minimum datasolution using lines in a stereo framework, but this leadsto an over-determined system of polynomials. An over-

constrained system, albeit an easier one, is solved in [16]for the pose estimation problem.

3. Structure and Motion Using Lines

Unless stated otherwise, 3D points are denoted as ho-mogeneous column vectors X ∈ R4 and 2D image pointsby homogeneous column vectors x ∈ R3. The perspectivecamera matrix is represented by a 3 × 4 matrix P = K[R|t],where K is the 3 × 3 upper triangular matrix that encodesthe internal parameters of the camera and (R, t) representsthe exterior orientation. We refer the reader to [8] for therelevant details of projective geometry and to [15] for anexcellent treatment of 3D line geometry.

A convenient representation of 3D lines is through thePlücker coordinates, which lie on the Klein manifold. Butfor our purposes, it suffices to interpret lines as the intersec-tion of two planes π1 and π2, so the locus of points on theline is represented by the span of the null-space of the 2× 4matrix [π1 π2]

>.

3.1. Geometry of the problem

We will assume that the stereo pair is calibrated, so theintrinsic parameter matrices, K, can be assumed to be iden-tity. The left and right stereo cameras can be parametrizedas P1 = [I|0] and P2 = [R0|t0], where R0 is a known rota-tion matrix and t0 is a known translation vector. Then, thecamera matrices in a stereo pair related by a rigid body mo-tion (R, t) are given by P3 = [R|t] and P4 = [RR0|Rt0 + t].

Note that the only unknown for the problem of recov-ering the stereo rig’s position and orientation is the six de-grees of freedom motion (R, t). It can be easily verified bythe reader that images of two lines in the four cameras (cor-responding to two stereo frames) are sufficient to recoverthis 6-dof motion.

The back-projected plane through the camera center andcontaining the 3D line L, imaged as a 2D line l by a cam-era P, is given by π = P>l. Suppose l1, l2, l3 and l4 areimages of the same 3D line in the four cameras P1, · · · , P4.See Figure 1. Then, the four back-projected planes, stackedrow-wise, form a 4×4 matrix which is the primary geomet-ric object of interest for our algorithms:

W =

π>1π>2π>3π>4

=

l>1 0l>2 R0 l

>2 t0

l>3 R l>3 t

l>4 (RR0) l>4 (Rt0 + t)

(1)Since the four backprojected planes in Figure 1 intersect ina straight line, there are two independent points X1 and X2that satisfy π>i Xj = 0, for i = 1, 2, 3, 4 and j = 1, 2.Consequently, the matrix W has a 2D null-space, or has rank2, when the lines in the 4 images correspond.

2

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

CVPR#1346

CVPR#1346


P2 = [R0|t0] P1 = [I|0]

P3 = [R|t]P4 = [RR0|Rt0 + t]

π1 = P!1 l1

π2 = P!2 l2

π3 = P!3 l3

π4 = P!4 l4

l4

l3

l2l1

Figure 1. Geometry of image formation. If the 4 image lines ineach of the 4 cameras of the 2 stereo pairs actually correspond,the back-projected planes joining the respective camera centers tothese straight lines would intersect in a straight line (the 3D linewhich was imaged by the 2 stereo pairs).

3.2. Linear Solution

Note that the only unknowns in the matrix W are R and t,which are not present in the first two rows. Asserting that Whas rank 2 is equivalent to demanding that each of its 3× 3sub-determinants have rank 2, which amounts to four inde-pendent constraints. One way to extract these constraintsis to perform two steps of Gauss-Jordan elimination on thematrix W>, to get a system of the form

Wgj =

× 0 0 0× × 0 0× × f1(R, t) f2(R, t)× × f3(R, t) f4(R, t)

(2)where fi(R, t) are linear in the entries of R and t. Sincerank of a matrix is preserved by elementary operations, thematrix Wgj must also be rank 2 and thus, its third and fourthrows are linear combinations of the first two. It easily fol-lows that

fi(R, t) = 0 , i = 1, 2, 3, 4 (3)

These four linear constraints will be used throughout thepaper.

Note that the algebraic degeneracies of our cost functionare apparent here, namely, when the (1, 1) entry is zero orthe determinant of the first 2 × 2 sub-matrix vanishes. Inpractice, it is possible to detect these situations from theinput data and perform column reordering to avoid the de-generacies.

It follows from (3) that the correspondence of n linesacross the four images of two stereo pairs will result in 4nindependent linear equations in the 12 variables of (R, t).Thus, for n ≥ 3, a linear estimate for the motion can then be

obtained as the singular vector corresponding to the small-est singular value of the 4n× 12 matrix formed by stackingup the linear constraints fi.

The drawbacks that such a linear procedure suffers fromin the presence of noise are similar to those of, say, the DLTalgorithm for estimating the fundamental matrix using eightpoints. In particular, since orthonormality constraints havebeen ignored, the solution can be arbitrarily far from a rigidbody motion. An additional disadvantage is that at least3 lines are required for estimation, which is more than theminimum number required (2 lines). This can result in asignificant additional computational burden in a robust esti-mation framework like RANSAC.

3.3. Minimum Data Solution

For a hypothesize-and-test framework, a solution that re-lies on a minimum amount of data is indispensable to min-imize computational complexity. In the present case, theminimum data problem, which requires two lines, is inher-ently non-linear due to orthonormality constraints. How-ever, rather than use a brute force non-linear minimizationapproach, it is possible to tackle the problem as an overde-termined system of low degree polynomials in a few vari-ables, for which a principled solution may be obtained effi-ciently.

Consider the eight linear equations of the form (3) ob-tained from the correspondence of two lines: f1(R, t)...

f8(R, t)

= F

r1r2r3t

= 0...

0

(4)where ri is the i-th column of R and F is an 8 × 12 ma-trix. Consequently, the solution (R, t) must lie in the 4-dimensional null-space of F. Let vi ∈ R12, for i = 1, · · · , 4be the orthonormal vectors that span the null-space of F.Then, the desired solution can be expressed as a linear com-bination of the null vectors:

r1r2r3t

= x |v1|

+y |v2|

+z |v3|

+w |v4|

(5)

and the problem reduces to determining the coefficientsx, y, z and w of the above linear combination.

Recall orthonormality constraints on the rotation R:

‖r1‖2 = 1 , ‖r2‖2 = 1 , ‖r3‖2 = 1 (6)r>1 r2 = 0 , r

>2 r3 = 0 , r

>3 r1 = 0. (7)

Substituting for (R, t) from (5) in (6) and (7), we get sixpolynomials of degree 2 in the four variables x, y, z and w.

3

324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377

378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431

CVPR#1346

CVPR#1346


This is an over-determined system of polynomial equations,which will have no solution in the general, noisy case. Wemust instead resort to a “least-squares” approach to extractthe solution.

3.3.1 A Global Optimization Approach

Let the six quadratic polynomials in (6) and (7) beQi(x, y, z, w), for i = 1, · · · , 6. Then, a straightforward“least squares” approach is to solve

minx,y,z,w

6∑i=1

Q2i (x, y, z, w). (8)

where the objective function is a polynomial of degree 4 infour variables. The global optimum for this unconstrainedminimization problem can be found using an LMI relax-ation based polynomial solver, such as Gloptipoly [9].

In general, greater numerical precision and speed can beobtained by reducing the degree of the system and the num-ber of variables. One way to do so is to drop the scale, thatis, divide each equation in (5) by w and replace the unitnorm constraints in (6) by equal norm constraints:

‖r1‖2 − ‖r2‖2 = 0 , ‖r2‖2 − ‖r3‖2 = 0. (9)

Now, the five polynomials in (9) and (7) have three vari-ables (x, y, z) each, call them qi(x, y, z), for i = 1, · · · , 5.Then, the corresponding reduced least squares problem isto minimize q(x, y, z) =

∑5i=1 q

2i (x, y, z). Once a solution

for (x, y, z) is obtained, the scale factor w can be recov-ered by imposing the unit determinant condition on R as apost-processing step.

3.3.2 An Efficient Solution

We will illustrate the algorithm by again assuming that thescale w has been divided out. The derivation that followscan be easily extended, with small modifications, for thecase of fixed scale.

Each of the five polynomial equations consists of theten monomials [x2, y2, xy, x, y, xz, yz, z2, z, 1]. Since rel-atively low degree polynomials in a single variable can besolved very fast using methods like Sturm sequences, wewill attempt to solve for a single variable first, say z. Then,each of the polynomials qi(x, y, z) can be rewritten as

c1x2 + c2y2 + c3xy + [1]x+ [1]y + [2] = 0 (10)

where we use the notation [n] to denote some polynomial ofdegree n in the single variable z. Our system of polynomi-

als, written out in a matrix format, now has the formc1 c2 c3 [1] [1] [2]c′1 c

′2 c

′3 [1] [1] [2]

c′′1 c′′2 c

′′3 [1] [1] [2]

c′′′1 c′′′2 c

′′′3 [1] [1] [2]

c′′′′1 c′′′′2 c

′′′′3 [1] [1] [2]

x2

y2

xyxy1

=

00000

.(11)

Let the 5 × 6 matrix above be denoted as G. Then,the i-th component of its null-vector, denoted as u =[x2, y2, xy, x, y, 1]> can be obtained as

ui = (−1)i−1 det(Ĝi) (12)where Ĝi stands for the 5×5 matrix obtained by deleting thei-th column of G. Thus, the vector u can be obtained, up toscale, with each of its components a polynomial of degree4 (x2, y2, xy), 3 (x, y) or 2 (1) in the single variable z.

Now, we note that all the components of the vector u arenot independent. In fact, in the noiseless case, they mustsatisfy three constraints

[u4u5 = u3u6] : (x)× (y) = (xy)× (1)[u24 = u1u6] : (x)× (x) = (x2)× (1) (13)[u25 = u2u6] : (y)× (y) = (y2)× (1)

Substituting the expressions for u obtained from (12), weobtain three polynomials of degree 6 in the single variablez. Let us call these polynomials pi(z), i = 1, 2, 3. Then,we have a system of three univariate polynomials that mustbe solved in a “least-squares” sense. We approach this as anunconstrained optimization problem

minz

p21(z) + p22(z) + p

23(z). (14)

At the minimum, the first-order derivative of the above ob-jective function must vanish, so the optimal z is a root ofthe polynomial

p(z) = p1(z)p′1(z) + p2(z)p′2(z) + p3(z)p

′3(z). (15)

Note that p(z) is a degree 11 univariate polynomial, whoseroots can be determined very fast in practice. There can beup to 11 real roots of p(z), all of which must be tested asa candidate solution. Also worth noticing is that the odddegree of p(z) guarantees at least one real solution.

Finally, once the candidates for z have been obtained, thecorresponding candidates for x and y are obtained by sub-stituting the value of z in the expression for u and readingoff the values of x =

u4u6

and y =u5u6

. The rotation and

translation can now be recovered, up to a common scalefactor, by substituting in (5). We can fix scale by, for in-stance, dividing each column of R by its norm and dividingthe estimate of t by the average of those three norms. Thus,we have obtained an estimate of (R, t) using the minimumrequired 2 lines.

4

432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485

486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539

CVPR#1346

CVPR#1346


4. System Details4.1. Line Detection

To detect lines in the images, we first construct a Sobeledge image and then apply non-maximal suppression to ac-curately locate the peak/ridge location of both strong andweak edges. We used the following non-maximal suppres-sion method to build the edge map I ,

Ix = maxdx

σ

(Sx − (Sx+dx + Sx−dx)/2

max((Sx+dx + Sx−dx)/2, θedge); s,m

)where σ(x; s,m) = 11+exp(−(x−m)/s) is a sig-moid function, Sx is Sobel edge response atpixel x = (x, y)>, θedge is a free parameter anddx ∈ {(0, 1)>, (1, 1)>, (1, 0)>, (1,−1)>} is the testingdirection. To detect lines, we note that the complexity ofthe edge map for indoor scenes makes it difficult to usestandard approaches like Hough transforms. Instead, westart by detecting stripes of edge points in the edge map,and then break the stripes into line segments. The fitnessto a line of a given set of edge points is determined by thesmallest eigenvalue of the covariance matrix of their x- andy-coordinates. We use dynamic programming to determinethe optimal break points on each stripe, throw away shortsegments, and fit a line using the points in each segment.Figure 2 shows examples of the result after each step in linedetection and tracking.

(a) Sobel edge image (b) Edge map

(c) Line detection (d) Lines tracked 25 frames laterFigure 2. Line detection and tracking. See the text for details.

4.2. Line tracking

We assume that incremental motion, between consecu-tive frames, of the image of each point on a straight linecan be approximated by an affine transformation, H =[h1 h2]

>. The local optical flow constraint at each pointxi = (xi, yi, 1)> on the line is given by

Exivxi + Eyivyi + Eti = 0

where vxi = h>1 xi − xi and vyi = h>2 xi − yi. Since allpoints on the straight line are constrained to lie on the sameline in subsequent frames, the affine motion can be obtainedby a least squares minimization of the squared optical flowcost aggregated over the image line:

[h>1 h>2 ]> = (A>A)−1A>b

where the i-th row of A is given by [Exix>, Eyix>] andbi = Exixi + Eyiyi − Eti. In our implementation, we firstestimate the global affine motion using all lines, and thenfind best matches for individual lines.

4.3. Efficiently Computing Determinants

A critical step of the algorithm in Section 3.3.2 is to com-pute the determinants of the 5 × 5 sub-matrices of G. It isimportant to carefully design the computation of these de-terminants, since the number of arithmetic operations canexplode very quickly, which might adversely affect the nu-merical behavior of the algorithm.

As an illustration, to compute the polynomial corre-sponding to u1 in (12), consider the matrix G′1, which isĜ1 with the column order reversed. Then, det(G′1

>) =det(Ĝ1). Each of the lower 4 × 4 sub-determinants of G′1>is a polynomial of degree 2. It now requires relatively lit-tle book-keeping to determine the coefficients of z4, · · · , z0in u1 by appropriately convolving the coefficients of degree2 polynomials in the first row of G′1

> with those from itslower 4 × 4 sub-determinants. The symbolic interactionsof the coefficients are pre-computed and only need to beevaluated for values of v1, · · · ,v4. Similar steps can betaken to compute the coefficients of various powers of z inu2, · · · , u6.

5. ExperimentsWe conduct experiments for performance evaluation

with synthetic data, across various noise levels, for the lin-ear solution of Section 3.2, the minimum data problemsolved to its global minimum (Section 3.3.1) and the effi-cient implementation of the minimum data solution (Sec-tion 3.3.2). The performance of the minimum data solutionin a RANSAC framework is evaluated across varying per-centages of outliers.

All the results quoted are for an implementation in unop-timized MATLAB. The polynomial solver used for globaloptimization is Gloptipoly [9], while polynomial roots inSection 3.3.2 are extracted using the MATLAB functionroots().

5.1. Synthetic Data

The first set of experiments is designed to quantify theperformance of the various algorithms across various noise

5

540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

CVPR#1346

CVPR#1346


levels and for various types of motion. Specifically, we con-sider three types of motion: random, sideways translationand forward translation. All the values reported for theseexperiments are RMS errors over 1000 trials.

For each trial, lines are generated as the join of two ran-domly generated points in the cube [−1, 1]3. The first pairof stereo cameras, with baseline 0.1m and a very small rel-ative rotation, is generated in a random orientation. The po-sition and orientation of the second stereo pair is randomlygenerated according to the type of experiment. Zero mean,Gaussian noise of varying standard deviations is added tothe coordinates of the image lines in each view.

The angular error in rotation and the error in translationfor random motion are plotted in Figure 3. Two lines arerequired for the minimum data algorithms, while an addi-tional line is included for the linear algorithm. Also, forthe linear algorithm, nonlinear minimization is used withthe linear estimate as initialization, while raw errors are re-ported for the minimum data solvers . As expected, theminimum data algorithms which account for orthonormalityclearly outperform the linear method. The average solutiontime for the global optimization algorithm is 300ms, whileit is 2.44ms for the efficient version.

(a) Rotation error (b) Translation errorFigure 3. Rotation and translation errors with random motionsacross varying percentage of noise levels for linear algorithm (red,dotted curve), minimum data algorithm with global solver (blue,dashed curve) and the efficient minimum data algorithm (black,solid curve). All quantities plotted are averages of 1000 trials.

In some cases, the solver fails to generate a result. Thismight be due to numerical errors or due to the two lines be-ing in a nearly degenerate position, for instance, both ver-tical. For the minimum data solver, the number of break-downs vary from 0.1% for the lowest noise level to 10.1%for the highest one. Note that 1% noise, especially withlines as features, can be considered quite high.

The next subset of experiments creates scenarios wherethe stereo baseline is small, while the camera undergoespure translational motion. The rotation and translation er-rors for purely forward motion are plotted in 4, while thosefor purely sideways motion aligned with the stereo baselineare plotted in 5. While higher than the errors for the easiersubset of experiments with random motion, the errors ob-tained here are still acceptable for reasonable noise levels.

(a) Rotation error (b) Translation errorFigure 4. Rotation and translation errors with purely forward trans-lation across varying percentage of noise levels.

(a) Rotation error (b) Translation errorFigure 5. Rotation and translation errors with purely sidewaystranslation across varying percentage of noise levels.

Next, we quantify the behavior of the efficient versionof the minimum data solver in a set-up that recreates itsintended use, that is, in a RANSAC framework. For thefirst set of experiments, 100 random lines are generated asdescribed previously, while the wide baseline stereo pair’smotion is random. The noise level is kept fixed at 0.1% andthe number of outliers in the data varies from 1% to 50%.The adaptive RANSAC strategy, described in [8], is usedwith a conservative probability p = 0.99 that at least oneof the samples is outlier-free. All the figures reported areaverages of 500 trials.

Besides the rotation and translation errors, we alsorecord the number of hypotheses required for terminationand the computation time. Comparative estimates of thenumber of hypotheses for various outlier levels, for the effi-cient minimum data solver and the linear three-line solver,are tabulated in Table 1. The average angular rotation andtranslation errors across all the trials are tabulated in the up-per rows of Table 3 (the errors are practically the same forvarious outlier levels).

For the next set of experiments, we create a scenario withonly 20 lines and narrow stereo baseline, which might arisein indoor environments where not many straight edges arepresent. The noise level is again kept fixed at 0.1% andp = 0.99 for RANSAC. The average rotation and transla-tion errors obtained over all trials are recorded in the lowerrows of Table 3. The number of hypotheses for various out-lier levels, for the efficient minimum data solver and the

6

648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701

702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755

CVPR#1346

CVPR#1346


Alg. No. of hypotheses for various outlier %0 1 5 10 20 30 40 50

Minimal 2.3 2.7 3.0 3.7 4.9 6.4 8.0 9.8Linear 1.9 2.2 3.0 3.9 5.8 8.3 11.4 15.5

Table 1. Required number of hypotheses across various outlier lev-els, with 100 lines in the scene, for RANSAC to terminate with99% confidence of obtaining an outlier-free set. The base noiselevel is 0.1%. The first row corresponds to the minimum data so-lution which has a sample set of two lines, while the second rowis for the linear solution with a sample set of three lines. All tabu-lated quantities are averages of 500 trials.

linear three-line solver, are tabulated in Table 2. From thecorresponding columns of Tables 1 and 2, it is interestingto observe that the required number of hypotheses is quitecomparable for the two experiments.

Alg. No. of hypotheses for various outlier %0 10 20 30 40 50

Minimal 3.4 4.6 5.6 6.9 8.7 10.8Linear 2.7 4.3 6.3 8.8 12.4 17.0

Table 2. Required number of hypotheses across various outlier lev-els, with only 20 lines in the scene and a narrow stereo baseline,for RANSAC to terminate with 99% confidence of obtaining anoutlier-free set.

Expt Alg. ErrorRotation (deg) Translation (m)

100 lines Minimal 0.0363 0.0035Linear 0.2195 0.0183

20 lines Minimal 0.0531 0.0045Linear 0.2616 0.0208

Table 3. The average rotation and translation errors after RANSACwith 99% probability of an outlier-free set. The experiment with100 lines has a wide baseline stereo configuration, while the fol-lowing experiment is more challenging, with only 20 lines and anarrow baseline. All tabulated quantities are averaged over 500trials and various outlier levels.

5.2. Real Data

The image sequences for our real data experiments areacquired using a stereo pair with baseline 7.4 cm, with awide field-of-view of 110◦ × 80◦ for each camera.

The first dataset is obtained by imaging a set of boxesrotating on a turntable. This is an interesting dataset be-cause there are very few straight lines for estimation, withsome outliers in the form of background lines that do notmove with the turntable. Further, low contrast between thefaces of the boxes and significant orientation/order changes

[frm206]

13

75

1

2

3

4

6

7

8

10

12

14

15

16

17

20

36

234

259278280

296

299

345

368

180

372

393399

368

280

91

81

101

75

259

278

71

111

4

345

61

121

1

296

20

11

131

21

31

15

51

191

41

161

201

181

151

141

171

1

7

!0.50

0.51

1.5

!1!0.500.5

1

!1

!0.8

!0.6

!0.4

!0.2

0

0.2

0.4

0.6

0.8

1

2

13

16

234368

8 6

17

280259 20278

345

75

36

4

15

296

91 81

101 71111

61

121

299

1 11

131 21

31

51 41

161

3

1

141

151

191 7

201181

171

12 14

10

Figure 7. Structure and motion estimation from the computersequence. (top left) Frame #206 overlayed with lines. Redboxes around line numbers indicate in-liers of the estimated mo-tion. (right) The reconstructed 3D lines and camera motions afterwithin-frame refinement using inliers obtained by the minimumdata solver. About 7% motions had rotation larger than 5◦ ortranslation larger than 2cm, and these failure cases were ignored.(bottom left) Zoom-up of the estimated camera motions.

even due to small motions make line detection and trackingrelatively difficult.

Yet, as Figure 6 shows, the quality of the structure andmotion estimates obtained by the minimum data solver arequite good. From Fig. 6 (b), it can be seen that the loop ofcamera motion is nearly closed and the cameras are point-ing inward. Further, the coplanarity of the motion and theaccuracy of the structure (from orthogonality of adjacentedges, for instance) are evident from Fig. 6 (c). Note thatno inter-frame bundle adjustment has been used for thesereconstructions, but a local gradient descent is used afterrejecting outliers for refining the RANSAC estimate for agiven frame. The output without this local refinement isshown in Fig. 6 (d). The algorithm fails to produce a goodestimate for about 6% of the frames, which are marked asred circles in the reconstructed camera sequence. We notethat bundle adjustment over a few frames can overcomethese errors.

The next dataset has larger scene depths, of the orderof 1 m, so the stereo baseline is quite narrow. The recon-struction obtained, illustrated in Figure 7, is again good andclosely mimics the ground truth motion of the camera.

Additional results have been uploaded as supplementalmaterial.

6. DiscussionsIn this paper, we have reported developments in the con-

struction of a robust, real-time stereo-based system for per-forming structure and motion estimation using infinite lines.Our primary contribution is a fast polynomial optimizationalgorithm that requires a minimum amount of data and isideal for use in a RANSAC framework.

As with any structure from motion system, bundle ad-justment forms an integral part of the framework, but was

7

756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809

810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863

CVPR#1346

CVPR#1346


[frm318]

126

235 219 225

226

231

237

238

251

253

2

192

203

224

249255

[frm361]

203

296

2

224

225

235

238

251

255

262

270271

274

275

285

291

292

297

303

!0.1

!0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

!0.2!0.100.10.20.3

!0.2

!0.1

0

0.1

0.2

0.3

224

241

271

281

261

291

251

231221

301

211

311

201191

274

291

321

181

331341

203

225

171

297

275

351

285

161

303

271

296

238

151

255

270

361

262

141

251

131

292

121

235

11 1

111101

21

31

91 81

41 51

71

61

0

0.2

0.4 !0.2!0.100.10.20.3

!0.25

!0.2

!0.15

!0.1

!0.05

0

0.05

0.1

0.15

0.2

0.25

224

2

241231261221271251281

211291201191301181

291

274

311171

321161

203

151

225

331

297

341275

285238

271

141

303

131351

296

270

255

262

121

251

361

292

111101

235

91 81

11 1 21

71 31

61 41 51

!0.1

!0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

!0.2!0.100.10.20.3

!0.2

!0.1

0

0.1

0.2

0.3

224

291281

301

271251261

241

311

231221

321

274291

331

211

341

201

191

297

203

225

351

181

275

285

303

296

238

171

271

361

270

161

262255

251

292

151141

235

131121

1 11

111

21

101

31

91

81

41

61

71

51

(a) (b) (c) (d)Figure 6. Structure and motion estimation from the box sequence. (a) Frame #318 and #361 (the last frame) overlayed with lines. Redboxes around line numbers indicate in-liers of the estimated motion. (b,c) Top-view and side-view of the reconstructed 3D lines and cameramotions after (within-frame) gradient descent refinement using inliers from RANSAC. (d) Top-view of the reconstructed camera motionsby the efficient minimum data algorithm.

not discussed in this paper to better illustrate the proper-ties of the minimum data algorithms. Further, we note thatit is rather straightforward to incorporate information frompoint features as well in our polynomial solver, but we willleave the details to an extended version of the paper. Alsocovered in extended works will be a detailed evaluation ofthe numerical behavior of the algorithm and our steps forfully incorporating stereo depth information into the struc-ture and motion estimation using lines.

All the timings reported in the paper are for an un-optimized MATLAB prototype, we expect the system toachieve real-time rates in our ongoing C implementation.

References[1] A. Akbarzadeh et al. Towards urban 3D reconstruction from

video. In 3DPVT, pages 1–8, 2006. 1[2] A. Bartoli and P. Sturm. The 3D line motion matrix and

alignment of line reconstructions. IJCV, 57(3):159–178,2004. 2

[3] A. Bartoli and P. Sturm. Structure-from-motion using lines:Representation, triangulation and bundle adjustment. CVIU,100(3):416–441, 2005. 2

[4] A. Comport, E. Malis, and P. Rives. Accurate quadrifocaltracking for robust 3D visual odometry. In ICRA, pages 40–45, 2007. 2

[5] M. A. Fischler and R. C. Bolles. Random sample consen-sus: A paradigm for model fitting with applications to imageanalysis and automated cartography. Comm. ACM, 24:381–195, 1981. 1

[6] R. I. Hartley. Lines and points in three views and the trifocaltensor. IJCV, 22(2):125–140, 1997. 2

[7] R. I. Hartley. Computation of the trifocal tensor. In ECCV,pages 20–35, 1998. 2

[8] R. I. Hartley and A. Zisserman. Multiple View Geometry inComputer Vision. Cambridge University Press, 2004. 2, 6

[9] D. Henrion and J.-B. Lasserre. Global optimization overpolynomials with matlab and SeDuMi. ACM Math. Soft.,29(2):165–194, 2003. 4, 5

[10] A. Heyden. Geometry and Algebra of Multiple ProjectiveTransformations. PhD thesis, Lund University, 1995. 2

[11] Y. Liu and T. S. Huang. A linear algorithm for motionestimation using straight line correspondences. CVGIP,44(1):33–57, 1988. 2

[12] D. Nistér. An efficient solution to the five-point relative poseproblem. PAMI, 26(6):756–770, 2004. 1, 2

[13] D. Nistér, O. Naroditsky, and J. Bergen. Visual odometry. InCVPR, pages 652–659, 2004. 2

[14] D. Nistér and H. Stewénius. A minimal solution to the gen-eralized 3-point pose problem. JMIV, 27(1):67–79, 2007. 1,2

[15] H. Pottmann and J. Wallner. Computational Line Geometry.Springer, 2001. 2

[16] L. Quan and Z. Lan. Linear N-point camera pose determina-tion. PAMI, 21(8):774–780, 1999. 2

[17] G. Schindler, P. Krishnamurthy, and F. Dellaert. Line-basedstructure from motion for urban environments. In 3DPVT,pages 846–853, 2006. 2

[18] A. Shashua and L. Wolf. On the structure and properties ofthe quadrifocal tensor. In ECCV, pages 710–724, 2000. 2

[19] H. Stewénius, C. Engels, , and D. Nistér. Recent develop-ments on direct relative orientation. ISPRS, 60:2006, 284-294. 2

[20] C. J. Taylor and D. J. Kriegman. Structure and motion usingline segments in multiple images. PAMI, 17(11):1995, 1021-1032. 2

[21] Z. Zhang and O. D. Faugeras. Estimation of displace-ments from two 3-D frames obtained from stereo. PAMI,14(12):1141–1156, 1992. 2

8

moving in stereo: fast structure and motion from...

Documents