moving in stereo: fast structure and motion from...

8
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 CVPR #1346 CVPR #1346 CVPR 2009 Submission #1346. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. Moving in Stereo: Fast Structure and Motion from Lines Anonymous CVPR submission Paper ID 1346 Abstract We present a robust and fast approach for estimating structure and motion with a stereo pair, using lines as fea- tures. Our primary contribution is an efficient algorithm that performs this estimation using the theoretically mini- mum number of lines (two), which makes it well-suited for use in a robust hypothesize-and-test framework. In contrast to some prior literature that uses line seg- ments, our approach avoids the issues arising due to un- certain determination of end-points by using infinite lines. Our cost function stems from a rank condition on planes backprojected from corresponding image lines, which can be solved linearly using three lines. Consideration of or- thonormality constraints on the rigid body motion allows us to perform estimation using only two lines, through the solution of an overdetermined system of polynomials. We propose a global optimization approach as well as an ef- ficient implementation for solving this system in a “least squares” sense. Experiments using synthetic as well as real data validate the speed, accuracy and reliability of our algorithms. 1. Introduction Simultaneous estimation of the structure of a scene and the motion of a camera rig using only visual input data is a classic problem in computer vision. Interest points, such as corners, are the primary features of interest for sparse reconstruction, since they abound in natural scenes. Con- sequently, significant research efforts in recent times have been devoted to sparse 3D reconstructions from points [1]. However, it is often the case, especially in man-made scenes like office interiors or urban cityscapes, that not enough point features can be reliably detected, while an abundant number of lines are visible. While structure from motion using lines has, in the past, received considerable attention in its own right, not much of it has been towards developing algorithms usable in a robust real-time system. On the other hand, important breakthroughs in solving min- imal problems for points, such as [12][14], have led to the emergence of efficient point-based reconstruction methods. This paper proposes algorithms that can estimate struc- ture and motion using lines with the minimum amount of data, which makes them ideal for use in a hypothesize-and- test framework [5] crucial for designing a robust, real-time system. In deference to our intended application of au- tonomous navigation, the visual input for our algorithms stems from a calibrated stereo pair, rather than a monoc- ular camera. This has the immediate advantage that only two stereo views (as opposed to three monocular views) are sufficient to recover the structure and motion using lines. The cost functions and representations we use in this paper are adapted to exploit the benefits of stereo input. Besides availability in man-made scenes, an advantage of using lines is accurate localization due to the multi-pixel support that a straight line edge enjoys. However, several prior works rely on using finite line segments for structure and motion estimation, which has the drawback that the quality of reconstruction is limited by the accuracy of de- tecting end-points. In the worst case scenario, such as a cluttered environment where lines are often occluded, the end-points might be false corners or unstable T-junctions. Our approach largely alleviates these issues by using in- finite lines as features and employing a problem formula- tion that depends on backprojected planes rather than re- projected end-points. Commonly used multifocal tensor based approaches use a larger number of line correspondences to produce a linear estimate, which can be arbitrarily far from a rigid-body mo- tion in the presence of noise. In contrast, our minimum data solution uses only two lines and accounts for orthonormal- ity constraints on the recovered motion. Rather than resort to a brute force nonlinear minimization to solve the result- ing nonlinear problem, we reduce the problem to a small polynomial system of low degree, which can be solved ex- peditiously. Our approach also obviates the need for exten- sive book-keeping to maintain the internal dependencies in the tensor estimates. We begin with a brief overview of relevant prior work in Section 2. Our algorithms for estimating structure and motion are described in Section 3, while the system details 1

Upload: others

Post on 11-Feb-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

  • 000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053

    054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

    CVPR#1346

    CVPR#1346

    CVPR 2009 Submission #1346. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

    Moving in Stereo: Fast Structure and Motion from Lines

    Anonymous CVPR submission

    Paper ID 1346

    Abstract

    We present a robust and fast approach for estimatingstructure and motion with a stereo pair, using lines as fea-tures. Our primary contribution is an efficient algorithmthat performs this estimation using the theoretically mini-mum number of lines (two), which makes it well-suited foruse in a robust hypothesize-and-test framework.

    In contrast to some prior literature that uses line seg-ments, our approach avoids the issues arising due to un-certain determination of end-points by using infinite lines.Our cost function stems from a rank condition on planesbackprojected from corresponding image lines, which canbe solved linearly using three lines. Consideration of or-thonormality constraints on the rigid body motion allowsus to perform estimation using only two lines, through thesolution of an overdetermined system of polynomials. Wepropose a global optimization approach as well as an ef-ficient implementation for solving this system in a “leastsquares” sense.

    Experiments using synthetic as well as real data validatethe speed, accuracy and reliability of our algorithms.

    1. IntroductionSimultaneous estimation of the structure of a scene and

    the motion of a camera rig using only visual input data isa classic problem in computer vision. Interest points, suchas corners, are the primary features of interest for sparsereconstruction, since they abound in natural scenes. Con-sequently, significant research efforts in recent times havebeen devoted to sparse 3D reconstructions from points [1].

    However, it is often the case, especially in man-madescenes like office interiors or urban cityscapes, that notenough point features can be reliably detected, while anabundant number of lines are visible. While structure frommotion using lines has, in the past, received considerableattention in its own right, not much of it has been towardsdeveloping algorithms usable in a robust real-time system.On the other hand, important breakthroughs in solving min-imal problems for points, such as [12][14], have led to the

    emergence of efficient point-based reconstruction methods.This paper proposes algorithms that can estimate struc-

    ture and motion using lines with the minimum amount ofdata, which makes them ideal for use in a hypothesize-and-test framework [5] crucial for designing a robust, real-timesystem. In deference to our intended application of au-tonomous navigation, the visual input for our algorithmsstems from a calibrated stereo pair, rather than a monoc-ular camera. This has the immediate advantage that onlytwo stereo views (as opposed to three monocular views) aresufficient to recover the structure and motion using lines.The cost functions and representations we use in this paperare adapted to exploit the benefits of stereo input.

    Besides availability in man-made scenes, an advantageof using lines is accurate localization due to the multi-pixelsupport that a straight line edge enjoys. However, severalprior works rely on using finite line segments for structureand motion estimation, which has the drawback that thequality of reconstruction is limited by the accuracy of de-tecting end-points. In the worst case scenario, such as acluttered environment where lines are often occluded, theend-points might be false corners or unstable T-junctions.Our approach largely alleviates these issues by using in-finite lines as features and employing a problem formula-tion that depends on backprojected planes rather than re-projected end-points.

    Commonly used multifocal tensor based approaches usea larger number of line correspondences to produce a linearestimate, which can be arbitrarily far from a rigid-body mo-tion in the presence of noise. In contrast, our minimum datasolution uses only two lines and accounts for orthonormal-ity constraints on the recovered motion. Rather than resortto a brute force nonlinear minimization to solve the result-ing nonlinear problem, we reduce the problem to a smallpolynomial system of low degree, which can be solved ex-peditiously. Our approach also obviates the need for exten-sive book-keeping to maintain the internal dependencies inthe tensor estimates.

    We begin with a brief overview of relevant prior workin Section 2. Our algorithms for estimating structure andmotion are described in Section 3, while the system details

    1

  • 108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

    162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

    CVPR#1346

    CVPR#1346

    CVPR 2009 Submission #1346. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

    such as line detection and tracking are provided in Section4. Synthetic and real data experiments are presented in Sec-tion 5 and we finish with discussions in Section 6.

    2. Related WorkIt can be argued that points provide “stronger” con-

    straints on a reconstruction than lines [6]. For instance,an analog of the eight-point algorithm for two-view recon-struction using lines requires three views and thirteen cor-respondences [11]. Yet, using lines as primary features isbeneficial in some scenarios, so there has been tremendousinterest in the problem.

    Multifocal tensors are a common approach to estimatethe structure of a scene from several line correspondences.A linear algorithm for estimating the trifocal tensor fromthree views and thirteen correspondences was presented in[6], while the properties of the quadrifocal tensor are eluci-dated in [18]. It is apparent from these works that the book-keeping required to enforce non-linear dependencies withinthe tensor indices can be substantial [7][10]. Consequently,approaches such as [4] which compute the quadrifocal ten-sor between two calibrated stereo pairs, are too cumbersomefor estimating just a 6-dof motion.

    An algebraic construct, called the 3D line motion matrix,is used to align the Plücker coordinates of two line recon-structions in [2], however, rigid body constraints can be ac-counted for only through nonlinear minimization. ExtendedKalman Filter based techniques for estimating incrementaldisplacements from stereo frames have been proposed [21],but the inherent linearization assumption might lead to in-accuracies.

    For various reasons, it is not reliable to use end-points fordesigning a structure and motion algorithm with lines. Theissue of representing 3D lines is addressed in detail in [3]where an orthonormal representation well-suited for bundleadjustment is proposed. The method in [20] minimizes anotion of reprojection error in the image plane which is ro-bust to variable end-points across views. However, the costfunction can be optimized only by iterative local minimiza-tion which requires initialization and is not amenable to areal-time implementation. Further, a minimum of six linecorrespondences in three views are required, which can beburdensome for a hypothesize-and-test framework. A simi-lar procedure for 3D reconstruction of urban environmentsis presented in [17].

    In recent years, there has been significant progress in thedevelopment of visual odometry systems [13], which relyon real-time solutions to important minimal problems forpoint-based structure-from-motion such as 5-point relativeorientation [12][19] or 3-point generalized pose [14]. Simi-lar to those works, our focus is to develop a minimum datasolution using lines in a stereo framework, but this leadsto an over-determined system of polynomials. An over-

    constrained system, albeit an easier one, is solved in [16]for the pose estimation problem.

    3. Structure and Motion Using Lines

    Unless stated otherwise, 3D points are denoted as ho-mogeneous column vectors X ∈ R4 and 2D image pointsby homogeneous column vectors x ∈ R3. The perspectivecamera matrix is represented by a 3 × 4 matrix P = K[R|t],where K is the 3 × 3 upper triangular matrix that encodesthe internal parameters of the camera and (R, t) representsthe exterior orientation. We refer the reader to [8] for therelevant details of projective geometry and to [15] for anexcellent treatment of 3D line geometry.

    A convenient representation of 3D lines is through thePlücker coordinates, which lie on the Klein manifold. Butfor our purposes, it suffices to interpret lines as the intersec-tion of two planes π1 and π2, so the locus of points on theline is represented by the span of the null-space of the 2× 4matrix [π1 π2]

    >.

    3.1. Geometry of the problem

    We will assume that the stereo pair is calibrated, so theintrinsic parameter matrices, K, can be assumed to be iden-tity. The left and right stereo cameras can be parametrizedas P1 = [I|0] and P2 = [R0|t0], where R0 is a known rota-tion matrix and t0 is a known translation vector. Then, thecamera matrices in a stereo pair related by a rigid body mo-tion (R, t) are given by P3 = [R|t] and P4 = [RR0|Rt0 + t].

    Note that the only unknown for the problem of recov-ering the stereo rig’s position and orientation is the six de-grees of freedom motion (R, t). It can be easily verified bythe reader that images of two lines in the four cameras (cor-responding to two stereo frames) are sufficient to recoverthis 6-dof motion.

    The back-projected plane through the camera center andcontaining the 3D line L, imaged as a 2D line l by a cam-era P, is given by π = P>l. Suppose l1, l2, l3 and l4 areimages of the same 3D line in the four cameras P1, · · · , P4.See Figure 1. Then, the four back-projected planes, stackedrow-wise, form a 4×4 matrix which is the primary geomet-ric object of interest for our algorithms:

    W =

    π>1π>2π>3π>4

    =

    l>1 0l>2 R0 l

    >2 t0

    l>3 R l>3 t

    l>4 (RR0) l>4 (Rt0 + t)

    (1)Since the four backprojected planes in Figure 1 intersect ina straight line, there are two independent points X1 and X2that satisfy π>i Xj = 0, for i = 1, 2, 3, 4 and j = 1, 2.Consequently, the matrix W has a 2D null-space, or has rank2, when the lines in the 4 images correspond.

    2

  • 216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

    270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

    CVPR#1346

    CVPR#1346

    CVPR 2009 Submission #1346. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

    P2 = [R0|t0] P1 = [I|0]

    P3 = [R|t]P4 = [RR0|Rt0 + t]

    π1 = P!1 l1

    π2 = P!2 l2

    π3 = P!3 l3

    π4 = P!4 l4

    l4

    l3

    l2l1

    Figure 1. Geometry of image formation. If the 4 image lines ineach of the 4 cameras of the 2 stereo pairs actually correspond,the back-projected planes joining the respective camera centers tothese straight lines would intersect in a straight line (the 3D linewhich was imaged by the 2 stereo pairs).

    3.2. Linear Solution

    Note that the only unknowns in the matrix W are R and t,which are not present in the first two rows. Asserting that Whas rank 2 is equivalent to demanding that each of its 3× 3sub-determinants have rank 2, which amounts to four inde-pendent constraints. One way to extract these constraintsis to perform two steps of Gauss-Jordan elimination on thematrix W>, to get a system of the form

    Wgj =

    × 0 0 0× × 0 0× × f1(R, t) f2(R, t)× × f3(R, t) f4(R, t)

    (2)where fi(R, t) are linear in the entries of R and t. Sincerank of a matrix is preserved by elementary operations, thematrix Wgj must also be rank 2 and thus, its third and fourthrows are linear combinations of the first two. It easily fol-lows that

    fi(R, t) = 0 , i = 1, 2, 3, 4 (3)

    These four linear constraints will be used throughout thepaper.

    Note that the algebraic degeneracies of our cost functionare apparent here, namely, when the (1, 1) entry is zero orthe determinant of the first 2 × 2 sub-matrix vanishes. Inpractice, it is possible to detect these situations from theinput data and perform column reordering to avoid the de-generacies.

    It follows from (3) that the correspondence of n linesacross the four images of two stereo pairs will result in 4nindependent linear equations in the 12 variables of (R, t).Thus, for n ≥ 3, a linear estimate for the motion can then be

    obtained as the singular vector corresponding to the small-est singular value of the 4n× 12 matrix formed by stackingup the linear constraints fi.

    The drawbacks that such a linear procedure suffers fromin the presence of noise are similar to those of, say, the DLTalgorithm for estimating the fundamental matrix using eightpoints. In particular, since orthonormality constraints havebeen ignored, the solution can be arbitrarily far from a rigidbody motion. An additional disadvantage is that at least3 lines are required for estimation, which is more than theminimum number required (2 lines). This can result in asignificant additional computational burden in a robust esti-mation framework like RANSAC.

    3.3. Minimum Data Solution

    For a hypothesize-and-test framework, a solution that re-lies on a minimum amount of data is indispensable to min-imize computational complexity. In the present case, theminimum data problem, which requires two lines, is inher-ently non-linear due to orthonormality constraints. How-ever, rather than use a brute force non-linear minimizationapproach, it is possible to tackle the problem as an overde-termined system of low degree polynomials in a few vari-ables, for which a principled solution may be obtained effi-ciently.

    Consider the eight linear equations of the form (3) ob-tained from the correspondence of two lines: f1(R, t)...

    f8(R, t)

    = F

    r1r2r3t

    = 0...

    0

    (4)where ri is the i-th column of R and F is an 8 × 12 ma-trix. Consequently, the solution (R, t) must lie in the 4-dimensional null-space of F. Let vi ∈ R12, for i = 1, · · · , 4be the orthonormal vectors that span the null-space of F.Then, the desired solution can be expressed as a linear com-bination of the null vectors:

    r1r2r3t

    = x |v1|

    +y |v2|

    +z |v3|

    +w |v4|

    (5)

    and the problem reduces to determining the coefficientsx, y, z and w of the above linear combination.

    Recall orthonormality constraints on the rotation R:

    ‖r1‖2 = 1 , ‖r2‖2 = 1 , ‖r3‖2 = 1 (6)r>1 r2 = 0 , r

    >2 r3 = 0 , r

    >3 r1 = 0. (7)

    Substituting for (R, t) from (5) in (6) and (7), we get sixpolynomials of degree 2 in the four variables x, y, z and w.

    3

  • 324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377

    378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431

    CVPR#1346

    CVPR#1346

    CVPR 2009 Submission #1346. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

    This is an over-determined system of polynomial equations,which will have no solution in the general, noisy case. Wemust instead resort to a “least-squares” approach to extractthe solution.

    3.3.1 A Global Optimization Approach

    Let the six quadratic polynomials in (6) and (7) beQi(x, y, z, w), for i = 1, · · · , 6. Then, a straightforward“least squares” approach is to solve

    minx,y,z,w

    6∑i=1

    Q2i (x, y, z, w). (8)

    where the objective function is a polynomial of degree 4 infour variables. The global optimum for this unconstrainedminimization problem can be found using an LMI relax-ation based polynomial solver, such as Gloptipoly [9].

    In general, greater numerical precision and speed can beobtained by reducing the degree of the system and the num-ber of variables. One way to do so is to drop the scale, thatis, divide each equation in (5) by w and replace the unitnorm constraints in (6) by equal norm constraints:

    ‖r1‖2 − ‖r2‖2 = 0 , ‖r2‖2 − ‖r3‖2 = 0. (9)

    Now, the five polynomials in (9) and (7) have three vari-ables (x, y, z) each, call them qi(x, y, z), for i = 1, · · · , 5.Then, the corresponding reduced least squares problem isto minimize q(x, y, z) =

    ∑5i=1 q

    2i (x, y, z). Once a solution

    for (x, y, z) is obtained, the scale factor w can be recov-ered by imposing the unit determinant condition on R as apost-processing step.

    3.3.2 An Efficient Solution

    We will illustrate the algorithm by again assuming that thescale w has been divided out. The derivation that followscan be easily extended, with small modifications, for thecase of fixed scale.

    Each of the five polynomial equations consists of theten monomials [x2, y2, xy, x, y, xz, yz, z2, z, 1]. Since rel-atively low degree polynomials in a single variable can besolved very fast using methods like Sturm sequences, wewill attempt to solve for a single variable first, say z. Then,each of the polynomials qi(x, y, z) can be rewritten as

    c1x2 + c2y2 + c3xy + [1]x+ [1]y + [2] = 0 (10)

    where we use the notation [n] to denote some polynomial ofdegree n in the single variable z. Our system of polynomi-

    als, written out in a matrix format, now has the formc1 c2 c3 [1] [1] [2]c′1 c

    ′2 c

    ′3 [1] [1] [2]

    c′′1 c′′2 c

    ′′3 [1] [1] [2]

    c′′′1 c′′′2 c

    ′′′3 [1] [1] [2]

    c′′′′1 c′′′′2 c

    ′′′′3 [1] [1] [2]

    x2

    y2

    xyxy1

    =

    00000

    .(11)

    Let the 5 × 6 matrix above be denoted as G. Then,the i-th component of its null-vector, denoted as u =[x2, y2, xy, x, y, 1]> can be obtained as

    ui = (−1)i−1 det(Ĝi) (12)where Ĝi stands for the 5×5 matrix obtained by deleting thei-th column of G. Thus, the vector u can be obtained, up toscale, with each of its components a polynomial of degree4 (x2, y2, xy), 3 (x, y) or 2 (1) in the single variable z.

    Now, we note that all the components of the vector u arenot independent. In fact, in the noiseless case, they mustsatisfy three constraints

    [u4u5 = u3u6] : (x)× (y) = (xy)× (1)[u24 = u1u6] : (x)× (x) = (x2)× (1) (13)[u25 = u2u6] : (y)× (y) = (y2)× (1)

    Substituting the expressions for u obtained from (12), weobtain three polynomials of degree 6 in the single variablez. Let us call these polynomials pi(z), i = 1, 2, 3. Then,we have a system of three univariate polynomials that mustbe solved in a “least-squares” sense. We approach this as anunconstrained optimization problem

    minz

    p21(z) + p22(z) + p

    23(z). (14)

    At the minimum, the first-order derivative of the above ob-jective function must vanish, so the optimal z is a root ofthe polynomial

    p(z) = p1(z)p′1(z) + p2(z)p′2(z) + p3(z)p

    ′3(z). (15)

    Note that p(z) is a degree 11 univariate polynomial, whoseroots can be determined very fast in practice. There can beup to 11 real roots of p(z), all of which must be tested asa candidate solution. Also worth noticing is that the odddegree of p(z) guarantees at least one real solution.

    Finally, once the candidates for z have been obtained, thecorresponding candidates for x and y are obtained by sub-stituting the value of z in the expression for u and readingoff the values of x =

    u4u6

    and y =u5u6

    . The rotation and

    translation can now be recovered, up to a common scalefactor, by substituting in (5). We can fix scale by, for in-stance, dividing each column of R by its norm and dividingthe estimate of t by the average of those three norms. Thus,we have obtained an estimate of (R, t) using the minimumrequired 2 lines.

    4

  • 432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485

    486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539

    CVPR#1346

    CVPR#1346

    CVPR 2009 Submission #1346. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

    4. System Details4.1. Line Detection

    To detect lines in the images, we first construct a Sobeledge image and then apply non-maximal suppression to ac-curately locate the peak/ridge location of both strong andweak edges. We used the following non-maximal suppres-sion method to build the edge map I ,

    Ix = maxdx

    σ

    (Sx − (Sx+dx + Sx−dx)/2

    max((Sx+dx + Sx−dx)/2, θedge); s,m

    )where σ(x; s,m) = 11+exp(−(x−m)/s) is a sig-moid function, Sx is Sobel edge response atpixel x = (x, y)>, θedge is a free parameter anddx ∈ {(0, 1)>, (1, 1)>, (1, 0)>, (1,−1)>} is the testingdirection. To detect lines, we note that the complexity ofthe edge map for indoor scenes makes it difficult to usestandard approaches like Hough transforms. Instead, westart by detecting stripes of edge points in the edge map,and then break the stripes into line segments. The fitnessto a line of a given set of edge points is determined by thesmallest eigenvalue of the covariance matrix of their x- andy-coordinates. We use dynamic programming to determinethe optimal break points on each stripe, throw away shortsegments, and fit a line using the points in each segment.Figure 2 shows examples of the result after each step in linedetection and tracking.

    (a) Sobel edge image (b) Edge map

    (c) Line detection (d) Lines tracked 25 frames laterFigure 2. Line detection and tracking. See the text for details.

    4.2. Line tracking

    We assume that incremental motion, between consecu-tive frames, of the image of each point on a straight linecan be approximated by an affine transformation, H =[h1 h2]

    >. The local optical flow constraint at each pointxi = (xi, yi, 1)> on the line is given by

    Exivxi + Eyivyi + Eti = 0

    where vxi = h>1 xi − xi and vyi = h>2 xi − yi. Since allpoints on the straight line are constrained to lie on the sameline in subsequent frames, the affine motion can be obtainedby a least squares minimization of the squared optical flowcost aggregated over the image line:

    [h>1 h>2 ]> = (A>A)−1A>b

    where the i-th row of A is given by [Exix>, Eyix>] andbi = Exixi + Eyiyi − Eti. In our implementation, we firstestimate the global affine motion using all lines, and thenfind best matches for individual lines.

    4.3. Efficiently Computing Determinants

    A critical step of the algorithm in Section 3.3.2 is to com-pute the determinants of the 5 × 5 sub-matrices of G. It isimportant to carefully design the computation of these de-terminants, since the number of arithmetic operations canexplode very quickly, which might adversely affect the nu-merical behavior of the algorithm.

    As an illustration, to compute the polynomial corre-sponding to u1 in (12), consider the matrix G′1, which isĜ1 with the column order reversed. Then, det(G′1

    >) =det(Ĝ1). Each of the lower 4 × 4 sub-determinants of G′1>is a polynomial of degree 2. It now requires relatively lit-tle book-keeping to determine the coefficients of z4, · · · , z0in u1 by appropriately convolving the coefficients of degree2 polynomials in the first row of G′1

    > with those from itslower 4 × 4 sub-determinants. The symbolic interactionsof the coefficients are pre-computed and only need to beevaluated for values of v1, · · · ,v4. Similar steps can betaken to compute the coefficients of various powers of z inu2, · · · , u6.

    5. ExperimentsWe conduct experiments for performance evaluation

    with synthetic data, across various noise levels, for the lin-ear solution of Section 3.2, the minimum data problemsolved to its global minimum (Section 3.3.1) and the effi-cient implementation of the minimum data solution (Sec-tion 3.3.2). The performance of the minimum data solutionin a RANSAC framework is evaluated across varying per-centages of outliers.

    All the results quoted are for an implementation in unop-timized MATLAB. The polynomial solver used for globaloptimization is Gloptipoly [9], while polynomial roots inSection 3.3.2 are extracted using the MATLAB functionroots().

    5.1. Synthetic Data

    The first set of experiments is designed to quantify theperformance of the various algorithms across various noise

    5

  • 540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

    594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

    CVPR#1346

    CVPR#1346

    CVPR 2009 Submission #1346. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

    levels and for various types of motion. Specifically, we con-sider three types of motion: random, sideways translationand forward translation. All the values reported for theseexperiments are RMS errors over 1000 trials.

    For each trial, lines are generated as the join of two ran-domly generated points in the cube [−1, 1]3. The first pairof stereo cameras, with baseline 0.1m and a very small rel-ative rotation, is generated in a random orientation. The po-sition and orientation of the second stereo pair is randomlygenerated according to the type of experiment. Zero mean,Gaussian noise of varying standard deviations is added tothe coordinates of the image lines in each view.

    The angular error in rotation and the error in translationfor random motion are plotted in Figure 3. Two lines arerequired for the minimum data algorithms, while an addi-tional line is included for the linear algorithm. Also, forthe linear algorithm, nonlinear minimization is used withthe linear estimate as initialization, while raw errors are re-ported for the minimum data solvers . As expected, theminimum data algorithms which account for orthonormalityclearly outperform the linear method. The average solutiontime for the global optimization algorithm is 300ms, whileit is 2.44ms for the efficient version.

    (a) Rotation error (b) Translation errorFigure 3. Rotation and translation errors with random motionsacross varying percentage of noise levels for linear algorithm (red,dotted curve), minimum data algorithm with global solver (blue,dashed curve) and the efficient minimum data algorithm (black,solid curve). All quantities plotted are averages of 1000 trials.

    In some cases, the solver fails to generate a result. Thismight be due to numerical errors or due to the two lines be-ing in a nearly degenerate position, for instance, both ver-tical. For the minimum data solver, the number of break-downs vary from 0.1% for the lowest noise level to 10.1%for the highest one. Note that 1% noise, especially withlines as features, can be considered quite high.

    The next subset of experiments creates scenarios wherethe stereo baseline is small, while the camera undergoespure translational motion. The rotation and translation er-rors for purely forward motion are plotted in 4, while thosefor purely sideways motion aligned with the stereo baselineare plotted in 5. While higher than the errors for the easiersubset of experiments with random motion, the errors ob-tained here are still acceptable for reasonable noise levels.

    (a) Rotation error (b) Translation errorFigure 4. Rotation and translation errors with purely forward trans-lation across varying percentage of noise levels.

    (a) Rotation error (b) Translation errorFigure 5. Rotation and translation errors with purely sidewaystranslation across varying percentage of noise levels.

    Next, we quantify the behavior of the efficient versionof the minimum data solver in a set-up that recreates itsintended use, that is, in a RANSAC framework. For thefirst set of experiments, 100 random lines are generated asdescribed previously, while the wide baseline stereo pair’smotion is random. The noise level is kept fixed at 0.1% andthe number of outliers in the data varies from 1% to 50%.The adaptive RANSAC strategy, described in [8], is usedwith a conservative probability p = 0.99 that at least oneof the samples is outlier-free. All the figures reported areaverages of 500 trials.

    Besides the rotation and translation errors, we alsorecord the number of hypotheses required for terminationand the computation time. Comparative estimates of thenumber of hypotheses for various outlier levels, for the effi-cient minimum data solver and the linear three-line solver,are tabulated in Table 1. The average angular rotation andtranslation errors across all the trials are tabulated in the up-per rows of Table 3 (the errors are practically the same forvarious outlier levels).

    For the next set of experiments, we create a scenario withonly 20 lines and narrow stereo baseline, which might arisein indoor environments where not many straight edges arepresent. The noise level is again kept fixed at 0.1% andp = 0.99 for RANSAC. The average rotation and transla-tion errors obtained over all trials are recorded in the lowerrows of Table 3. The number of hypotheses for various out-lier levels, for the efficient minimum data solver and the

    6

  • 648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701

    702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755

    CVPR#1346

    CVPR#1346

    CVPR 2009 Submission #1346. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

    Alg. No. of hypotheses for various outlier %0 1 5 10 20 30 40 50

    Minimal 2.3 2.7 3.0 3.7 4.9 6.4 8.0 9.8Linear 1.9 2.2 3.0 3.9 5.8 8.3 11.4 15.5

    Table 1. Required number of hypotheses across various outlier lev-els, with 100 lines in the scene, for RANSAC to terminate with99% confidence of obtaining an outlier-free set. The base noiselevel is 0.1%. The first row corresponds to the minimum data so-lution which has a sample set of two lines, while the second rowis for the linear solution with a sample set of three lines. All tabu-lated quantities are averages of 500 trials.

    linear three-line solver, are tabulated in Table 2. From thecorresponding columns of Tables 1 and 2, it is interestingto observe that the required number of hypotheses is quitecomparable for the two experiments.

    Alg. No. of hypotheses for various outlier %0 10 20 30 40 50

    Minimal 3.4 4.6 5.6 6.9 8.7 10.8Linear 2.7 4.3 6.3 8.8 12.4 17.0

    Table 2. Required number of hypotheses across various outlier lev-els, with only 20 lines in the scene and a narrow stereo baseline,for RANSAC to terminate with 99% confidence of obtaining anoutlier-free set.

    Expt Alg. ErrorRotation (deg) Translation (m)

    100 lines Minimal 0.0363 0.0035Linear 0.2195 0.0183

    20 lines Minimal 0.0531 0.0045Linear 0.2616 0.0208

    Table 3. The average rotation and translation errors after RANSACwith 99% probability of an outlier-free set. The experiment with100 lines has a wide baseline stereo configuration, while the fol-lowing experiment is more challenging, with only 20 lines and anarrow baseline. All tabulated quantities are averaged over 500trials and various outlier levels.

    5.2. Real Data

    The image sequences for our real data experiments areacquired using a stereo pair with baseline 7.4 cm, with awide field-of-view of 110◦ × 80◦ for each camera.

    The first dataset is obtained by imaging a set of boxesrotating on a turntable. This is an interesting dataset be-cause there are very few straight lines for estimation, withsome outliers in the form of background lines that do notmove with the turntable. Further, low contrast between thefaces of the boxes and significant orientation/order changes

    [frm206]

    13

    75

    1

    2

    3

    4

    6

    7

    8

    10

    12

    14

    15

    16

    17

    20

    36

    234

    259278280

    296

    299

    345

    368

    180

    372

    393399

    368

    280

    91

    81

    101

    75

    259

    278

    71

    111

    4

    345

    61

    121

    1

    296

    20

    11

    131

    21

    31

    15

    51

    191

    41

    161

    201

    181

    151

    141

    171

    1

    7

    !0.50

    0.51

    1.5

    !1!0.500.5

    1

    !1

    !0.8

    !0.6

    !0.4

    !0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    2

    13

    16

    234368

    8 6

    17

    280259 20278

    345

    75

    36

    4

    15

    296

    91 81

    101 71111

    61

    121

    299

    1 11

    131 21

    31

    51 41

    161

    3

    1

    141

    151

    191 7

    201181

    171

    12 14

    10

    Figure 7. Structure and motion estimation from the computersequence. (top left) Frame #206 overlayed with lines. Redboxes around line numbers indicate in-liers of the estimated mo-tion. (right) The reconstructed 3D lines and camera motions afterwithin-frame refinement using inliers obtained by the minimumdata solver. About 7% motions had rotation larger than 5◦ ortranslation larger than 2cm, and these failure cases were ignored.(bottom left) Zoom-up of the estimated camera motions.

    even due to small motions make line detection and trackingrelatively difficult.

    Yet, as Figure 6 shows, the quality of the structure andmotion estimates obtained by the minimum data solver arequite good. From Fig. 6 (b), it can be seen that the loop ofcamera motion is nearly closed and the cameras are point-ing inward. Further, the coplanarity of the motion and theaccuracy of the structure (from orthogonality of adjacentedges, for instance) are evident from Fig. 6 (c). Note thatno inter-frame bundle adjustment has been used for thesereconstructions, but a local gradient descent is used afterrejecting outliers for refining the RANSAC estimate for agiven frame. The output without this local refinement isshown in Fig. 6 (d). The algorithm fails to produce a goodestimate for about 6% of the frames, which are marked asred circles in the reconstructed camera sequence. We notethat bundle adjustment over a few frames can overcomethese errors.

    The next dataset has larger scene depths, of the orderof 1 m, so the stereo baseline is quite narrow. The recon-struction obtained, illustrated in Figure 7, is again good andclosely mimics the ground truth motion of the camera.

    Additional results have been uploaded as supplementalmaterial.

    6. DiscussionsIn this paper, we have reported developments in the con-

    struction of a robust, real-time stereo-based system for per-forming structure and motion estimation using infinite lines.Our primary contribution is a fast polynomial optimizationalgorithm that requires a minimum amount of data and isideal for use in a RANSAC framework.

    As with any structure from motion system, bundle ad-justment forms an integral part of the framework, but was

    7

  • 756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809

    810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863

    CVPR#1346

    CVPR#1346

    CVPR 2009 Submission #1346. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

    [frm318]

    126

    235 219 225

    226

    231

    237

    238

    251

    253

    2

    192

    203

    224

    249255

    [frm361]

    203

    296

    2

    224

    225

    235

    238

    251

    255

    262

    270271

    274

    275

    285

    291

    292

    297

    303

    !0.1

    !0.05

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    !0.2!0.100.10.20.3

    !0.2

    !0.1

    0

    0.1

    0.2

    0.3

    224

    241

    271

    281

    261

    291

    251

    231221

    301

    211

    311

    201191

    274

    291

    321

    181

    331341

    203

    225

    171

    297

    275

    351

    285

    161

    303

    271

    296

    238

    151

    255

    270

    361

    262

    141

    251

    131

    292

    121

    235

    11 1

    111101

    21

    31

    91 81

    41 51

    71

    61

    0

    0.2

    0.4 !0.2!0.100.10.20.3

    !0.25

    !0.2

    !0.15

    !0.1

    !0.05

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    224

    2

    241231261221271251281

    211291201191301181

    291

    274

    311171

    321161

    203

    151

    225

    331

    297

    341275

    285238

    271

    141

    303

    131351

    296

    270

    255

    262

    121

    251

    361

    292

    111101

    235

    91 81

    11 1 21

    71 31

    61 41 51

    !0.1

    !0.05

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    !0.2!0.100.10.20.3

    !0.2

    !0.1

    0

    0.1

    0.2

    0.3

    224

    291281

    301

    271251261

    241

    311

    231221

    321

    274291

    331

    211

    341

    201

    191

    297

    203

    225

    351

    181

    275

    285

    303

    296

    238

    171

    271

    361

    270

    161

    262255

    251

    292

    151141

    235

    131121

    1 11

    111

    21

    101

    31

    91

    81

    41

    61

    71

    51

    (a) (b) (c) (d)Figure 6. Structure and motion estimation from the box sequence. (a) Frame #318 and #361 (the last frame) overlayed with lines. Redboxes around line numbers indicate in-liers of the estimated motion. (b,c) Top-view and side-view of the reconstructed 3D lines and cameramotions after (within-frame) gradient descent refinement using inliers from RANSAC. (d) Top-view of the reconstructed camera motionsby the efficient minimum data algorithm.

    not discussed in this paper to better illustrate the proper-ties of the minimum data algorithms. Further, we note thatit is rather straightforward to incorporate information frompoint features as well in our polynomial solver, but we willleave the details to an extended version of the paper. Alsocovered in extended works will be a detailed evaluation ofthe numerical behavior of the algorithm and our steps forfully incorporating stereo depth information into the struc-ture and motion estimation using lines.

    All the timings reported in the paper are for an un-optimized MATLAB prototype, we expect the system toachieve real-time rates in our ongoing C implementation.

    References[1] A. Akbarzadeh et al. Towards urban 3D reconstruction from

    video. In 3DPVT, pages 1–8, 2006. 1[2] A. Bartoli and P. Sturm. The 3D line motion matrix and

    alignment of line reconstructions. IJCV, 57(3):159–178,2004. 2

    [3] A. Bartoli and P. Sturm. Structure-from-motion using lines:Representation, triangulation and bundle adjustment. CVIU,100(3):416–441, 2005. 2

    [4] A. Comport, E. Malis, and P. Rives. Accurate quadrifocaltracking for robust 3D visual odometry. In ICRA, pages 40–45, 2007. 2

    [5] M. A. Fischler and R. C. Bolles. Random sample consen-sus: A paradigm for model fitting with applications to imageanalysis and automated cartography. Comm. ACM, 24:381–195, 1981. 1

    [6] R. I. Hartley. Lines and points in three views and the trifocaltensor. IJCV, 22(2):125–140, 1997. 2

    [7] R. I. Hartley. Computation of the trifocal tensor. In ECCV,pages 20–35, 1998. 2

    [8] R. I. Hartley and A. Zisserman. Multiple View Geometry inComputer Vision. Cambridge University Press, 2004. 2, 6

    [9] D. Henrion and J.-B. Lasserre. Global optimization overpolynomials with matlab and SeDuMi. ACM Math. Soft.,29(2):165–194, 2003. 4, 5

    [10] A. Heyden. Geometry and Algebra of Multiple ProjectiveTransformations. PhD thesis, Lund University, 1995. 2

    [11] Y. Liu and T. S. Huang. A linear algorithm for motionestimation using straight line correspondences. CVGIP,44(1):33–57, 1988. 2

    [12] D. Nistér. An efficient solution to the five-point relative poseproblem. PAMI, 26(6):756–770, 2004. 1, 2

    [13] D. Nistér, O. Naroditsky, and J. Bergen. Visual odometry. InCVPR, pages 652–659, 2004. 2

    [14] D. Nistér and H. Stewénius. A minimal solution to the gen-eralized 3-point pose problem. JMIV, 27(1):67–79, 2007. 1,2

    [15] H. Pottmann and J. Wallner. Computational Line Geometry.Springer, 2001. 2

    [16] L. Quan and Z. Lan. Linear N-point camera pose determina-tion. PAMI, 21(8):774–780, 1999. 2

    [17] G. Schindler, P. Krishnamurthy, and F. Dellaert. Line-basedstructure from motion for urban environments. In 3DPVT,pages 846–853, 2006. 2

    [18] A. Shashua and L. Wolf. On the structure and properties ofthe quadrifocal tensor. In ECCV, pages 710–724, 2000. 2

    [19] H. Stewénius, C. Engels, , and D. Nistér. Recent develop-ments on direct relative orientation. ISPRS, 60:2006, 284-294. 2

    [20] C. J. Taylor and D. J. Kriegman. Structure and motion usingline segments in multiple images. PAMI, 17(11):1995, 1021-1032. 2

    [21] Z. Zhang and O. D. Faugeras. Estimation of displace-ments from two 3-D frames obtained from stereo. PAMI,14(12):1141–1156, 1992. 2

    8