arxiv:2108.12876v1 [cs.cv] 29 aug 2021

9
Noname manuscript No. (will be inserted by the editor) Solving Viewing Graph Optimization for Simultaneous Position and Rotation Registration Seyed-Mahdi Nasiri · Reshad Hosseini · Hadi Moradi Received: date / Accepted: date Abstract A viewing graph is a set of unknown camera poses, as the vertices, and the observed relative mo- tions, as the edges. Solving the viewing graph is an essential step in a Structure-from-Motion procedure, where a set of relative motions is obtained from a col- lection of 2D images. Almost all methods in the liter- ature solve for the rotations separately, through rota- tion averaging process, and use them for solving the positions. Obtaining positions is the challenging part because the translation observations only tell the direc- tion of the motions. It becomes more challenging when the set of edges comprises pairwise translation obser- vations between either near and far cameras. In this paper an iterative method is proposed that overcomes these issues. Also a method is proposed which obtains the rotations and positions simultaneously. Experimen- tal results show the-state-of-the-art performance of the proposed methods. SM. Nasiri E-mail: [email protected] School of ECE, College of Engineering, University of Tehran, Tehran, Iran R. Hosseini E-mail: [email protected] Tel.: +98-21-82089799 School of ECE, College of Engineering, University of Tehran, Tehran, Iran School of Computer Science, Institute of Research in Funda- mental Sciences (IPM), Tehran, Iran H. Moradi E-mail: [email protected] Tel.: +98-21-82084960 School of ECE, College of Engineering, University of Tehran, Tehran, Iran Intelligent Systems Research Institute, SKKU, South Korea Keywords Viewing Graph · Structure-from-Motion · Camera Pose Registration 1 Introduction Structure-from-Motion (SfM) refers to a process in which a set of 3D points are reconstructed from their projec- tions on a given set of images. Almost all SfM tech- niques can be classified into two categories, sequen- tial or incremental techniques [27,1,8,36,9,15,28], and global or batch techniques [16,19,2,7,30]. As the incremental approaches often suffer from large drifting error, it is recommended to use global methods to achieve better accuracy and consistent models. The global methods consist of the following main steps: 1) Pairwise image registration: Estimating relative rota- tions and movement directions between pairs of images using feature points [31,22,17]. The result is a graph, i.e. the viewing graph, whose edges are the relative pose measurements. 2) Solving the viewing graph: Cam- era poses estimation [23,30,16,37,13,6,4], which is the topic of this paper. 3) Triangulation: Reconstructing 3D points by triangulating corresponding points from cal- culated camera points [14,29,5]. 4) Bundle adjustment: Camera poses and 3D points refinement by reprojection error minimization [32, 18]. Since the relative translation observations only con- tains the movement directions and not their scale, solv- ing the viewing graph is the challenging step in the SfM process. To simplify the problem, almost all viewing graph solvers determine the rotations using only the rotation observations and then use them to compute the camera positions [35,11,23,16]. Obtaining a set of rotations from their relative observations is known as arXiv:2108.12876v1 [cs.CV] 29 Aug 2021

Upload: others

Post on 28-Oct-2021

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2108.12876v1 [cs.CV] 29 Aug 2021

Noname manuscript No.(will be inserted by the editor)

Solving Viewing Graph Optimization for SimultaneousPosition and Rotation Registration

Seyed-Mahdi Nasiri · Reshad Hosseini · Hadi Moradi

Received: date / Accepted: date

Abstract A viewing graph is a set of unknown camera

poses, as the vertices, and the observed relative mo-

tions, as the edges. Solving the viewing graph is an

essential step in a Structure-from-Motion procedure,

where a set of relative motions is obtained from a col-

lection of 2D images. Almost all methods in the liter-

ature solve for the rotations separately, through rota-

tion averaging process, and use them for solving the

positions. Obtaining positions is the challenging part

because the translation observations only tell the direc-

tion of the motions. It becomes more challenging when

the set of edges comprises pairwise translation obser-

vations between either near and far cameras. In this

paper an iterative method is proposed that overcomes

these issues. Also a method is proposed which obtains

the rotations and positions simultaneously. Experimen-tal results show the-state-of-the-art performance of the

proposed methods.

SM. NasiriE-mail: [email protected] of ECE, College of Engineering, University of Tehran,Tehran, Iran

R. HosseiniE-mail: [email protected].: +98-21-82089799School of ECE, College of Engineering, University of Tehran,Tehran, IranSchool of Computer Science, Institute of Research in Funda-mental Sciences (IPM), Tehran, Iran

H. MoradiE-mail: [email protected].: +98-21-82084960School of ECE, College of Engineering, University of Tehran,Tehran, IranIntelligent Systems Research Institute, SKKU, South Korea

Keywords Viewing Graph · Structure-from-Motion ·Camera Pose Registration

1 Introduction

Structure-from-Motion (SfM) refers to a process in which

a set of 3D points are reconstructed from their projec-

tions on a given set of images. Almost all SfM tech-

niques can be classified into two categories, sequen-

tial or incremental techniques [27,1,8,36,9,15,28], and

global or batch techniques [16,19,2,7,30].

As the incremental approaches often suffer from large

drifting error, it is recommended to use global methods

to achieve better accuracy and consistent models. The

global methods consist of the following main steps: 1)

Pairwise image registration: Estimating relative rota-

tions and movement directions between pairs of images

using feature points [31,22,17]. The result is a graph,

i.e. the viewing graph, whose edges are the relative

pose measurements. 2) Solving the viewing graph: Cam-

era poses estimation [23,30,16,37,13,6,4], which is the

topic of this paper. 3) Triangulation: Reconstructing 3D

points by triangulating corresponding points from cal-

culated camera points [14,29,5]. 4) Bundle adjustment:

Camera poses and 3D points refinement by reprojection

error minimization [32,18].

Since the relative translation observations only con-

tains the movement directions and not their scale, solv-

ing the viewing graph is the challenging step in the SfM

process. To simplify the problem, almost all viewing

graph solvers determine the rotations using only the

rotation observations and then use them to compute

the camera positions [35,11,23,16]. Obtaining a set of

rotations from their relative observations is known as

arX

iv:2

108.

1287

6v1

[cs

.CV

] 2

9 A

ug 2

021

Page 2: arXiv:2108.12876v1 [cs.CV] 29 Aug 2021

2 Seyed-Mahdi Nasiri et al.

rotation averaging problem and has been studied in the

computer vision community [21,12,3,13].

The rotations obtained by solving the rotation av-

eraging problem are used to change the representation

of direction observations from local coordinates to a

global coordinate system. The camera location estima-

tion problem is to find camera positions from these rel-

ative directions in the global coordinate frame. A so-

lution is to find camera locations which minimize the

squared sum of direction errors [35]. The difficulty of

solving this non-linear cost function led most of the

available methods in the literature to change the cost

function from the direction error to the displacement er-

ror. To this end, unknown scale factors are added to the

problem which should be estimated besides the camera

poses. Some of these approaches used the cross product

of solution translations with direction observations as

displacement error [11,2]. These approaches suffer from

the property of cross product distance that decreases for

an angle error of more than 90 degrees. The constrained

least-squared [34,33], and least-unsquared chordal dis-

tances [24,23] are other commonly used cost functions.

Using the displacement error instead of direction error

biases the errors in proportion to the length of edges.

In this paper, we show that the constrained least-

squared formulation for solving camera positions has

inherit limitation that decreases its performance. We,

therefore, propose an iterative algorithm to solve the

original displacement error or direction error cost func-

tions starting from the solution of the constrained least-

squared formulation. Experimental results show the ef-

ficacy of the proposed methods. We also propose itera-

tive methods for solving simultaneous position and ro-

tation registration. These methods solves a pose graph

optimization (PGO) problem in each step, that thanks

to recent success of PGO solvers can be solved effi-

ciently and with high accuracy. Experimental results

show that our proposed methods significantly outper-

form most accurate methods proposed in the literature.

2 Preliminaries

2.1 Viewing-Graph

Given n views or images from a scene, a subset of(n2

)pairwise relative motions can be estimated. The camera

poses, as the vertices, and the pairwise relative motions,

as the observed edges, construct the viewing graph.

Each relative motion observation comprises a relative

rotation and a relative direction. Relative directions are

vectors from an endpoint of an edge to another one and

are represented in local coordinate frames of vertices.

A set of vertices, V , and its relevant set of edges, E,

form a viewing graph G = {V,E}. The parameters of

the ith vertex, vi, are the position and the orientation

of the ith camera with respect to a global coordinate

frame, i.e. vi = {pi, Ri}. An edge between the ith and

jth vertices, represented by eij , comprise a noisy mea-

surement of the relative rotation, Rij ← RTj Ri, and a

noisy measurement of the direction of movement in the

ith coordinate frame, i.e. γlij ← RTi (pj −pi)/‖pj −pi‖.Solving a viewing graph means finding unknown pa-

rameters of the vertices, i.e. camera poses, that match

the edges, i.e. relative observations, as much as possi-

ble. Mathematically speaking, the viewing graph opti-

mization problem solves for the edges that minimize the

following mismatch cost function,

min{pi,Ri}

∑eij={i,j}∈E

d2R(Rij , R

Tj Ri

)+ d2d(γij ,

pj − pi‖pj − pi‖

),

(1)

where, dR(.) and dγ(.) are distance metrics and γij =

Riγlij . The weights can be added to the cost function to

handle different confidence level of measurements and

to balance errors of the two parts of the cost function.

2.2 Common Metrics

Rotations distance metrics are widely studied under the

title of rotation averaging [13,6]. The common metric,

that is also used in our proposed methods, is the Frobe-

nius norm of difference of the rotation matrices:

dR (R1, R2) = ‖R1 −R2‖F . (2)

Several distance metrics can also be used for the di-

rection part of the cost function, i.e. dd(.). Some cases

are listed in [35]. The commonly used metrics are the

chordal and the orthogonal distances which are formu-

lated as follows:

Chordal: dd (u,v) = ‖u− v‖ (3)

Orthogonal: dd (u,v) = ‖Pu⊥(v)‖ = ‖u× v‖ (4)

where Pu⊥ is the projector onto the orthogonal com-

plement of the span of u and “×” represents the cross

product.

Common approaches separate the two parts of the

cost function. The first part, i.e. the rotation averaging

problem [13,21], is independent of the positions and can

be solved independently to obtain the vertices orienta-

tions. The rotation averaging problem is defined as:

min{Ri}

∑eij={i,j}∈E

d2R(Rij , R

Tj Ri

). (5)

Page 3: arXiv:2108.12876v1 [cs.CV] 29 Aug 2021

Solving Viewing Graph Optimization for Simultaneous Position and Rotation Registration 3

𝛾𝑖𝑗

𝒑𝑗

𝒑𝑖 𝒑𝑗 − 𝒑𝑖

‖𝒑𝑗 − 𝒑𝑖‖

𝑎) ‖𝛾𝑖𝑗 −𝒑𝑗 − 𝒑𝑖

‖𝒑𝑗 − 𝒑𝑖‖‖

𝑏) ‖𝛾𝑖𝑗 ×𝒑𝑗 − 𝒑𝑖

‖𝒑𝑗 − 𝒑𝑖‖‖

𝑑) ‖𝛾𝑖𝑗 × (𝒑𝑗 − 𝒑𝑖)‖

𝑐) ‖(‖𝒑𝑗 − 𝒑𝑖‖𝛾𝑖𝑗) − (𝒑𝑗 − 𝒑𝑖)‖

Fig. 1 The geometric interpretation of different error crite-ria. a) Chordal distance of directions error. b) Orthogonaldistance of directions error. c) Chordal distance of displace-ments error. d) Orthogonal distance of displacements error.

The obtained orientations are used to calculate γijs

from γlijs. Now the second part of the cost function is

a function of positions and known as “camera location

estimation” problem. The objective function is designed

to minimize directions error and is formulated as:

min{pi}

∑eij={i,j}∈E

d2d(γij ,pj − pi‖pj − pi‖

). (6)

There are global scale and translation ambiguities in

the solutions of (6), and if optimization methods can

cope with over-parametrization there is no need to re-

move the ambiguities. Some existing methods however

use some constraints to resolve the ambiguites. The

constraints∑

pi = 0 or fixing the first vertex at the

origin are the common constraints to remove the trans-

lation ambiguity, and the constraints like∑‖pi‖2 = 1

or ‖pi−pj‖ > 1,∀(i, j) ∈ E can be used to remove the

scale ambiguity.

A different approach to find camera locations is to

minimize the displacements error instead of directions

error. In this case the problem is formulated as:

min{pi}

∑eij={i,j}∈E

d2d(‖pj − pi‖γij ,pj − pi). (7)

In this case. it is important to consider a constraint

like∑‖pi‖2 = 1 to prevent the trivial solutions of plac-

ing all camera positions at the same point.

The combinations of two distance metrics, i.e. or-

thogonal and chordal, and the two aforementioned ap-

proaches, i.e. minimizing directions or displacements er-

ror, form a set of four different error criteria which were

used by different methods to solve the camera loca-

tion estimation problem. The geometric interpretation

of these error criteria is shown in Fig. 1.

3 Limitations of Existing Methods

In this section, we first talk about a limitation of an

existing method to solve the cost function of (7). The

method is fast and gives relatively good results and we

use it as the initialization procedure of our proposed

methods. In the next part, we show the limitation of the

orthogonal distance, and therefore the existing methods

in the literature that use this cost function.

3.1 Linear constraint problem

A common approach of solving camera locations is to

solve the following alternative problem:

min{pi,λij}

∑eij={i,j}∈E

‖λijγij − (pj − pi)‖2 , (8)

where the scale factors, λijs, are unknown parameters.

For λij = ‖pj−pi‖, problem (8) is the same as (7) with

the chordal distance, which minimizes the displacement

error, i.e. Euclidean distance between the endpoint of

vectors λijγij and pj − pi. In practice, the constraint

λij = ‖pj − pi‖ is dropped and the linear constraint

λij ≥ 1 is replaced, which also handles the scale am-

biguity. Like [3], we refer to the problem of (8) with

the constraint λij ≥ 1 as the constrained least-squares

(CLS).

The cost function of (8) is quadratically related to

the global scale, i.e. the size of the vertex set. Reducing

the global scale reduces the cost quadratically. Obvi-

ously, setting the global scale to zero, results in zero

cost with the trivial solution of {pi = 0,∀i ∈ V }. How-

ever, the constraint λij ≥ 1, prevents a solver to get

this trivial solution. In fact, λijs represent the distance

between the vertices, and the constraint prevents the

vertices from getting too close to each other.

At a first glance, it seems that the constraint causes

the global scale to be determined according to the min-

imum distance between the vertices. This implies that

the smallest scale factor, i.e. the minimum λij corre-

sponding to the minimum distance of the vertices, is

set to one and the other scale factors are estimations of

the distance of their endpoints divided by the minimum

distance of the vertices. But in practice, a significant

number of λijs are exactly set to one. In fact, λijs for

a subset of edges, with shorter length than the others,

are set to one. This will increase the cost on the subset

of edges, but will reduce the global scale and reduce

the cost on the other edges with longer length. There-

fore, the cost is decreased, but λijs and consequently

the positions deviate from their real values.

Fig. 2 compares the obtained λijs by CLS and one

of our proposed methods for a dataset. In this figure,

Page 4: arXiv:2108.12876v1 [cs.CV] 29 Aug 2021

4 Seyed-Mahdi Nasiri et al.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

0

2

4

6

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

0

2

4

6

Fig. 2 Comparing the obtained scale factors, i.e. λijs, inCLS and Alg. 1C on “Tower of London” dataset. A significantnumber of scale factors, obtained by CLS, are stick to thelower bound and set to 1. That is while the scale factorsobtained by Alg. 1C match the distances between estimatedcamera positions.

there are a large number of vertices with small ‖pj −pi‖, such that λij is set to 1 and therefore does not

match ‖pj − pi‖. Furthermore, it shows that for CLS

the mismatches also occurred in large λijs. In contrast

in our proposed method, λijs and their corresponding

‖pj − pi‖, completely match.

3.2 Orthogonal metric problem

Orthogonal distance is a common distance metric usedin the literature for solving the camera locations. It is

actually an easier metric to be solved and one can have

interesting convergence results. It can be shown that

the following cost function that has multipliers λij can

be solved iteratively and at the minimum is equal to

the orthogonal distance metric (see Appendix for more

details).

min{pi,λij}

∑eij={i,j}∈E

‖γij − λij(pj − pi)‖2 . (9)

The orthogonal distance decreases for an angle error

of more than 90 degrees. This means that the angle be-

tween γij and (pj−pi), and also the Euclidean distance

between them, can be increased while their orthogonal

distance is reduced. This can results in undesirable re-

sults in methods based on this distance metric.

We construct a simple example to demonstrate the

problem of orthogonal distance. Fig. 3.2 shows a graph

with six vertices. Vertices number 5 and 6, placed near

1

2

3

45 6

Fig. 3 The results of applying algorithms CLS (green plus),the cross product minimization of [11] (purple cross), andorthogonal projection minimization of [10] (orange stars). Thegraph vertices and edges are shown as blue dots and greenlines.

the center of the graph, are closer to each other than the

other vertices. The rotation observations assumed to be

exact and the direction observations are perturbed by

5 degrees error. The cross product method of [11] and

ShapeFit method of [10] and the CLS algorithms are

applied to the problem and the results are shown in the

figure. The relative positions of the first four vertices in

all three methods are close to their true relative posi-

tions, and the four vertices are located in the four ver-

tices of a rhombus. The positions of vertices number 5

and 6 estimated by CLS are far from their true relative

positions, but these vertices are still inside the rhom-

bus of the first four vertices. That is while the positions

of vertices number 5 and 6 estimated by the other two

methods is outside of the rhombus in a situation that

their positions relative to one of the vertices 1 and 2 is in

the opposite direction of the real relative direction. Note

that, when the angle between the relative position and

the direction observation is about 180 degrees, i.e. they

are in opposite directions, then the orthogonal distance

between them are very small. Therefore, the cost value

obtained by the methods based on the orthogonal dis-

tance will be minimized, while the relative positions of

some vertices are completely in different position with

respect to their real relative positions to the positions

of other vertices.

4 The Proposed Methods

To overcome the problem of linear constraint weakness,

discussed in 3.1, we propose an iterative solver used in

two different approaches, minimizing sum of squared of

displacements and directions error, to solve the viewing

graph problem. The main idea is to use λijs obtained

by CLS at the first iteration and then replace them

with fixed coefficients according to the obtained posi-

tions in the last iteration. We are not able to prove the

Page 5: arXiv:2108.12876v1 [cs.CV] 29 Aug 2021

Solving Viewing Graph Optimization for Simultaneous Position and Rotation Registration 5

Algorithm 1

Input: Edges of a viewing graph (Rij , γlij , ∀{i, j} ∈ E)approach:“C” minimizing the common used objective function

which comprises displacements error.“O” minimizing the original objective function which

comprises directions error.Output: Estimated poses of vertices (Ri, pi, ∀i ∈ V )

1: Ri ← RS(Rij) . Find initial orientations by RS

2: γij ← Riγlij , ∀ek = {i, j} ∈ E . Move directions toglobal coordinate

3: G′ = {E′, V ′} ← 1DSFM(G = {E, V }) . Removeoutliers by 1DSFM

4: {pi} ← CLS(γij) . Solve CLS to obtain positions5: repeat6: if approach=“C” then7: λij ← ‖pj − pj‖, ∀e′k = {i, j} ∈ E′8: {pi} ← LS1(λijγij) . Solve (8) with fixed λijs

to update positions9: else if approach=“O” then

10: λij ← 1/‖pj − pj‖, ∀e′k = {i, j} ∈ E′11: {pi} ← LS2(λij , γij) . Solve (9) with fixed λijs

to update positions12: end if13: until convergence14: return Ri, pi, ∀i ∈ V ′

convergence of the iterative solvers, but we observe ex-

perimentally that all of our iterative solvers converge.

To minimize the sum of squared displacements er-

ror, the formulation of (8) is employed and λijs are set

to obtained distance of the endpoints of the vertices i

and j in the last iteration of the algorithm. Therefore

upon convergence, the algorithm yields λij = ‖pj−pi‖,where pis are the estimated camera positions and we

are solving the actual optimization problem of (7).

The proposed algorithm summarized in Alg. 1 con-

sists of the following steps:

1. Use the RS algorithm [20] to solve (5) and find the

orientation of the vertices. (RS uses the Frobenius

norm for dR(.))

2. Use 1DSFM algorithm [35] to remove the outliers.

3. Solve (8) on the common constraint of λij ≥ 1 to

estimate pis.

4. Replace parameters λij with fixed coefficients λij =

‖pj − pi‖.5. Solve (8) with fixed λijs, which now is a uncon-

strained linear least-squares problem. (We refer to

it as LS1 in Alg. 1)

6. Repeat the last two steps until convergence.

To minimize the directions error instead of displace-

ments error, we can use the formulation of (9). There-

fore the 4th and 5th step of the algorithm is replaced

by:

4. Replace parameters λij with fixed coefficients λij =

1/‖pj − pi‖.

Algorithm 2

Input: Edges of a viewing graph (Rij , γlij , ∀ek = {i, j} ∈E)

approach:“C” minimizing the common used objective function

comprises displacements error.“O” minimizing the original objective function com-

prises directions error.Output: Estimated poses of vertices (Ri, pi, ∀i ∈ V )

1: Ri ← RS(Rij) . Find initial orientations by RS

2: γij ← Riγlij , ∀ek = {i, j} ∈ E . Move directions toglobal coordinate

3: G′ = {E′, V ′} ← 1DSFM(G = {E, V }) . Removeoutliers by 1DSFM

4: {pi} ← CLS(γij) . Solve CLS to obtain positions5: repeat6: γij ← Riγlij , ∀ek = {i, j} ∈ E . Move directions to

global coordinate according to the updated rotations7: if approach=“C” then8: λij ← ‖pj − pj‖, ∀e′k = {i, j} ∈ E′9: wij ← 1, ∀e′k = {i, j} ∈ E′

10: else if approach=“O” then11: λij ← ‖pj − pj‖, ∀e′k = {i, j} ∈ E′12: wij ← 1

‖pj−pj‖2 , ∀e′k = {i, j} ∈ E′

13: end if14: {pi, Ri} ← PS(λijγlij , Rij , wij) . Use PS [20] to

update positions and orientations15: until convergence16: return Ri, pi, ∀i ∈ V ′

5. Solve (9) with fixed λijs, which now is a uncon-

strained linear least-squares problem. (We refer to

it as LS2 in Alg. 1)

Fixing the parameters λijs converts the direction

observations to translation observations. This means

that the given viewing graph is converted to a Pose-

Graph, and the problem is turned to a Pose-Graph op-

timization problem:

min{pi,Ri}

∑eij={i,j}∈E

∥∥Rij , RTj Ri∥∥2F +

wij ‖λijγij − (pj − pi)‖2 .(10)

The literature of the Pose-Graph optimization has

a wide variety of algorithms from which there are algo-

rithms that solve the the orientations and positions of

the vertices simultaneously [20,25]. This was the moti-

vation of our second method which optimizes the po-

sitions and the orientations together. The idea is to

change the 5th step of the algorithm with the state-of-

the-art PS algorithm proposed in [20] to find the ver-

tices’ poses. The proposed algorithm is summarized in

Alg. 2.

In both of our proposed algorithm, we use displace-

ment error in the cost function without any constraint

to avoid trivial solution. Since we use a good initializa-

tion obtained by the CLS algorithm, we do not see any

Page 6: arXiv:2108.12876v1 [cs.CV] 29 Aug 2021

6 Seyed-Mahdi Nasiri et al.

problems in the experiments. Because of the scale prob-

lem using displacement error in the full viewing graph

loss, which was solve by turning the problem into a

PGO in our method, seems unjustified. But using this

error in the full viewing graph loss helps getting impor-

tant insight that becomes clear later in the experiments.

5 Experiments

To evaluate the proposed algorithms, the challenging

real datasets of [35] are used. The camera poses esti-

mations, computed by Bundler [27] and presented in

[35] are considered as the ground truth. In this section,

Alg. 1C or Alg. 2C stand for using the displacements er-

ror in the proposed algorithms, and Alg. 1O or Alg. 2O

stand for using the directions error in each iteration of

the algorithms 1.

For the first step, the weakness of the constraint

λij ≥ 1, which is discussed in section 3.1, is shown ex-

perimentally. To this end, the orientation of the cam-

eras are estimated by solving the rotation averaging

problem of (5) using RS [20]. Outliers are removed by

1DSFM algorithm, and then the positions are estimated

by CLS, i.e. solving (8) on the constraint of λij ≥ 1. The

results are compared to the results of Alg. 1C. Fig. 4

compares the histogram of the scale factors ratio to the

corresponding vertices distances obtained by CLS and

Alg. 1C in different datasets. It can be seen that all

scale factors obtained by Alg. 1C, despite CLS, match

their corresponding vertices distances.

In the following, the recovery performance of the

proposed algorithms are evaluated in comparison to themost accurate algorithms proposed in the literature.

The comparing methods are Cross method of [11], CLS

method which is used in [33,34], and LUD method pro-

posed in [23]. Since the problem has natural global ro-

tation, translation, and scale ambiguities, an Euclidean

transformation should be found that aligns the cam-

era positions to the ground truth as much as possible.

Mathematically, we solve

minR,t,s

n∑i=1

∥∥((sRpi + t)− pi)∥∥2 , (11)

where R, t, and s are the rotation matrix, transla-

tion vector, and scale parameters of the scaled Eu-

clidean transformation. The comparison criteria are the

mean (e), the median (e), and the root mean squared (e)

distances error. The root mean squared distances error

1 Efficient implementation of the proposed algorithmsin MATLAB is available via http://visionlab.ut.ac.ir/

resources/vgorls.zip

is given by

e =

√√√√ 1

m

m∑i=1

‖pi − pi‖2, (12)

where m is the number of edges.

The histogram of rotation and direction measure-

ments error, before and after applying 1DSFM for Tower

of London are shown in Fig. 5. The figure shows that

the most of outliers were removed and the output has

a few number of outliers. The similar results were ob-

tained for other datasets.

The results of various methods in different datasets

are listed in Table 1. The results show that in almost all

datasets, the proposed methods outperform the others.

Further, the comparison of Alg. 1 and Alg. 2 demon-

strates that minimizing the cost function over rotations

and translations simultaneously, results in better cam-

era localization. Comparing the results of Alg. 1C to

Alg. 1O and Alg. 2C to Alg. 2O shows that, although

the differences are small, minimizing the displacements

error instead of the directions error results in a bet-

ter camera localization. Because of the scale problem

as discussed in the section of the proposed methods,

using Alg. 2C is not justified. But interestingly, this al-

gorithm shows the best performance which shows that

resulting scales obtained by solving the problem using

our proposed method is fine. This shows the importance

of using correct weight for combining two terms of the

viewing graph optimization problem.

6 Conclusion

In this paper, we proposed iterative methods for solv-

ing the camera location estimation problem. One of our

proposed methods fixed the unknown scale factors, i.e.

λijs, which appear on the common cost function (8), in

each iteration. Hence, the solution is obtained through

solving iterative least-squares problems. The scale fac-

tors in each iteration are set to the distance between

the corresponding vertices distances in the previous it-

eration. We observed experimentally that this helps the

final scale factors to converge to the distances of the cor-

responding vertices, i.e. λij → ‖pj − pi‖. This solved

the problem of CLS algorithm that bunches up a sig-

nificant number of scale factors in the lower bound.

Fixing the parameters λijs converts the direction

observations to translation observations and converts

the given viewing graph to a Pose-Graph. This moti-

vated us to use the state-of-the-art method of PS [20]

to obtain rotations and positions of cameras simulta-

neously and to propose the second algorithm. Obtain-

ing the best results by solving the full cost of viewing

Page 7: arXiv:2108.12876v1 [cs.CV] 29 Aug 2021

Solving Viewing Graph Optimization for Simultaneous Position and Rotation Registration 7

-2.5 -2 -1.5 -1 -0.5 0 0.5100

101

102

103

104

105

-2 -1.5 -1 -0.5 0 0.5100

101

102

103

104

-2.5 -2 -1.5 -1 -0.5 0 0.5100

101

102

103

104

-2 -1.5 -1 -0.5 0 0.5100

101

102

103

104

105

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5100

101

102

103

104

105

-2 -1.5 -1 -0.5 0 0.5100

101

102

103

104

-2 -1.5 -1 -0.5 0 0.5100

101

102

103

104

105

-2 -1.5 -1 -0.5 0 0.5100

101

102

103

104

-2 -1.5 -1 -0.5 0 0.5100

101

102

103

104

-2.5 -2 -1.5 -1 -0.5 0 0.5100

101

102

103

104

105

-2 -1.5 -1 -0.5 0 0.5100

101

102

103

104

Fig. 4 Comparing the scale factors obtained by CLS and Alg. 1C on different real datasets. The figures show the histogram

of log10λij

‖pi−pj‖obtained by the methods on different datasets.

Table 1 Performance comparison of various methods in different original datasets. The table depicts the number of edges, m,and the number of vertices, n, after removing the outliers using 1DSFM. The comparison criteria are the root mean squareerror, e, the mean distance error, e, and the median distance error ,e.

Dataset Cross CLS LUD Alg. 1C Alg. 1O Alg. 2C Alg. 2OName m n e e e e e e e e e e e e e e e e e e e e e

Alamo 46353 479 15.94 12.96 12.52 5.36 2.89 1.78 4.89 2.56 1.43 5.46 2.57 1.18 5.39 2.61 1.30 5.14 2.17 0.93 4.83 2.11 0.95Ellis Island 4512 193 30.85 26.18 23.26 11.19 6.87 4.22 10.92 6.17 3.57 10.00 5.71 3.07 10.86 6.17 3.28 10.00 5.37 2.65 10.29 5.50 2.68Madrid Metropolis 4150 253 51.45 35.65 29.23 21.66 14.65 11.36 20.63 13.47 10.16 18.55 12.23 8.51 20.33 13.20 10.01 16.90 10.95 7.05 17.28 10.98 6.79Montreal Notre Dame 26330 385 12.13 11.14 11.99 2.70 1.82 1.13 2.62 1.76 1.10 2.54 1.66 1.01 2.87 1.84 1.05 2.45 1.59 0.88 2.60 1.67 0.96Notre Dame 45048 501 13.32 9.81 7.83 6.26 4.00 2.97 6.27 3.70 2.48 5.71 3.18 2.05 5.84 3.27 2.20 5.28 2.71 1.60 5.39 2.78 1.68NYC Library 5614 260 15.29 13.80 14.05 4.31 3.14 2.28 4.04 2.82 1.88 5.06 3.55 2.41 4.60 3.15 2.05 4.85 3.03 1.92 4.87 2.96 1.82Piazza del Popolo 12027 234 16.43 14.29 12.56 3.30 2.24 1.48 3.10 2.03 1.30 3.04 1.95 1.17 3.31 2.11 1.18 2.35 1.47 0.94 2.52 1.55 0.91Tower of London 7539 371 76.41 66.35 60.20 39.01 26.75 19.02 36.52 24.32 15.37 34.16 20.49 11.86 38.13 25.15 16.79 32.99 18.19 9.26 35.24 20.69 10.96Union Square 5919 468 18.73 14.84 10.81 12.91 9.02 6.61 12.67 8.68 6.27 13.17 9.07 6.74 12.89 9.02 6.89 12.80 8.43 5.75 12.90 8.71 6.19Vienna Cathedral 36729 624 34.58 27.46 22.41 24.27 18.81 14.12 22.84 17.62 13.17 20.90 15.59 10.85 25.45 19.20 13.15 19.56 14.42 10.06 23.32 17.29 11.99Yorkminster 10249 317 27.97 13.74 5.94 15.99 8.68 5.77 17.09 7.85 4.99 15.00 7.64 4.74 14.89 7.45 4.81 14.39 6.88 4.05 14.27 6.70 3.81

graph optimization problem is an important result. It

is an step toward to solve challenging viewing graph

optimization problem and calls other researches to put

emphasis on developing the algorithms considering both

terms together in the viewing graph optimization prob-

lem.

Both algorithms can solve the original cost function

in which the chordal distance of directions are mini-

mized. But the experiments show that minimizing the

common used displacement distance metric (weighting

the directions error by the edges length), results in a

better performance in real datasets. This is an interest-

ing result, because using displacement distance metric

has scale problem which can be more important in the

case of the full viewing graph optimization solved us-

ing our Alg. 2C. Having the best performance of Alg.

2C shows the importance of having good weights for

combining the two term of the viewing graph optimiza-

tion problem. An interesting future direction of work

is obtaining the covariance of noise for using accurate

weights for combining two terms in the loss function of

viewing graph optimization.

We observed experimentally that all our proposed

algorithms converge. Another important line of future

Page 8: arXiv:2108.12876v1 [cs.CV] 29 Aug 2021

8 Seyed-Mahdi Nasiri et al.

0 50 100 150 200100

105

0 50 100 150 200100

105

0 50 100 150 200100

102

104

0 50 100 150 200100

102

104

Fig. 5 Histograms of rotation and direction measurementserror, before and after applying the 1DSFM algorithm for“Tower of London” dataset.

research that we are undertaking is developing theoret-

ical convergence results for our proposed algorithms.

Appendix

The author in [11] proposed an iterative reweighted al-

gorithm for solving orthogonal distance of directions

error, i.e. the solution of (6) with orthogonal distance.

The strategy is to minimize orthogonal distance of dis-

placements error which can be solved in the closed-form

and then reweigthing errors by the squared distance of

the edges. Although upon convergence, the algorithm

yields that the weights equal to the edges lengths, there

is no guarantee that the algorithm converges.

We use a novel formulation in which the solution is

directly the minimizer of (6) where the distance metric

is orthogonal distance. We use the cost function of (9)

that uses λij multipliers before the second term. The

problem (9) does not need any constraint and the mini-

mizer of the cost function is directly the solution of (8)

with the orthogonal distance metric.

Suppose that the set of {λ∗ij , p∗i } is the optimal so-

lution of (9). So we have,

∀eij = {i, j} ∈ E,∀λij ∈ R,∥∥γij − λ∗ij(p∗j − p∗i )∥∥2 < ∥∥γij − λij(p∗j − p∗i )

∥∥2 . (13)

Since λij(p∗j − p∗i ) comprises all points on the line

which connects pi to pj , it can be conclude from (13)

that λ∗ij(p∗j−p∗i ) is the nearest point of pipj-line to the

endpoint of γij . Obviously the nearest point obtained

by projecting γij onto pipj-line, and its Euclidean dis-

tance is equal to cross product of γij andp∗j−p

∗i

‖p∗j−p∗i ‖.

Problem (9) can be solved by a two block coordinate

descent approach. The algorithm repeats the following

two steps. In one step, the cost is optimized over pis

while λijs are fixed, and in the next step the cost is

optimized over λijs while pis are fixed vectors. Both

steps are linear least-squares problem and can be solved

in a closed form. It is easy to show that the proposed

coordinate descent algorithm satisfies the convergence

conditions of theorem 1 of [26]. Therefore, the algorithm

finds the minimizer of the sum of squared orthogonal

distance of directions error. Although we were able to

prove a convergence result for minimizing orthogonal

distance, we obtained bad results using this method in

experiments.

References

1. Agarwal, S., Furukawa, Y., Snavely, N., Simon, I., Cur-less, B., Seitz, S.M., Szeliski, R.: Building rome in a day.Communications of the ACM 54(10), 105–112 (2011)

2. Arie-Nachimson, M., Kovalsky, S.Z., Kemelmacher-Shlizerman, I., Singer, A., Basri, R.: Global motion esti-mation from point matches. In: International Conferenceon 3D Imaging, Modeling, Processing, Visualization &Transmission, pp. 81–88. IEEE (2012)

3. Arrigoni, F., Magri, L., Rossi, B., Fragneto, P., Fusiello,A.: Robust absolute rotation estimation via low-rank andsparse matrix decomposition. In: International Confer-ence on 3D Vision, pp. 491–498. IEEE (2014)

4. Arrigoni, F., Rossi, B., Fragneto, P., Fusiello, A.: Ro-bust synchronization in SO(3) and SE(3) via low-rankand sparse matrix decomposition. Computer Vision andImage Understanding 174, 95–113 (2018)

5. Byrod, M., Josephson, K., Astrom, K.: Improving nu-merical accuracy of Grobner basis polynomial equationsolvers. In: International Conference on Computer Vi-sion, pp. 449–456. IEEE (2007)

6. Chatterjee, A., Govindu, V.M.: Robust relative rotationaveraging. IEEE Transactions on Pattern Analysis andMachine Intelligence 40(4), 958–972 (2017)

7. Crandall, D., Owens, A., Snavely, N., Huttenlocher, D.:Discrete-continuous optimization for large-scale structurefrom motion. In: Conference on Computer Vision andPattern Recognition, pp. 3001–3008. IEEE (2011)

8. Frahm, J.M., Fite-Georgel, P., Gallup, D., Johnson, T.,Raguram, R., Wu, C., Jen, Y.H., Dunn, E., Clipp, B.,Lazebnik, S., et al.: Building rome on a cloudless day. In:European Conference on Computer Vision, pp. 368–381.Springer (2010)

9. Furukawa, Y., Curless, B., Seitz, S.M., Szeliski, R.: To-wards internet-scale multi-view stereo. In: Computer So-ciety Conference on Computer Vision and Pattern Recog-nition, pp. 1434–1441. IEEE (2010)

10. Goldstein, T., Hand, P., Lee, C., Voroninski, V., Soatto,S.: Shapefit and shapekick for robust, scalable structurefrom motion. In: European Conference on Computer Vi-sion, pp. 289–304. Springer (2016)

11. Govindu, V.M.: Combining two-view constraints for mo-tion estimation. In: Computer Society Conference on

Page 9: arXiv:2108.12876v1 [cs.CV] 29 Aug 2021

Solving Viewing Graph Optimization for Simultaneous Position and Rotation Registration 9

Computer Vision and Pattern Recognition., pp. 218—-225. IEEE (2001)

12. Hartley, R., Aftab, K., Trumpf, J.: L1 rotation averagingusing the weiszfeld algorithm. In: Computer Vision andPattern Recognition (CVPR). IEEE Conference on, pp.3041–3048. IEEE (2011)

13. Hartley, R., Trumpf, J., Dai, Y., Li, H.: Rotation aver-aging. International Journal of Computer Vision 103(3),267–305 (2013)

14. Hartley, R.I., Sturm, P.: Triangulation. Computer Visionand Image Understanding 68(2), 146–157 (1997)

15. Havlena, M., Torii, A., Knopp, J., Pajdla, T.: Random-ized structure from motion based on atomic 3d modelsfrom camera triplets. In: Conference on Computer Visionand Pattern Recognition, pp. 2874–2881. IEEE (2009)

16. Jiang, N., Cui, Z., Tan, P.: A global linear method forcamera pose registration. In: International Conferenceon Computer Vision, pp. 481–488. IEEE (2013)

17. Kukelova, Z., Bujnak, M., Pajdla, T.: Polynomial eigen-value solutions to the 5-pt and 6-pt relative pose prob-lems. In: British Machine Vision Conference, vol. 2, pp.56.1–56.10 (2008)

18. Lourakis, M.I.A., Argyros, A.A.: SBA: A software pack-age for generic sparse bundle adjustment. ACM Trans-actions on Mathematical Software (TOMS) 36(1), 1–30(2009)

19. Moulon, P., Monasse, P., Marlet, R.: Global fusion of rel-ative motions for robust, accurate and scalable structurefrom motion. In: International Conference on ComputerVision, pp. 3248–3255. IEEE (2013)

20. Nasiri, S.M., Hosseini, R., Moradi, H.: Novel parameter-ization for gauss–newton methods in 3D pose graph op-timization. IEEE Transactions on Robotics (2020)

21. Nasiri, S.M., Moradi, H., Hosseini, R.: A linear leastsquare initialization method for 3D pose graph optimiza-tion problem. In: International Conference on Roboticsand Automation, pp. 2474–2479. IEEE (2018)

22. Nister, D.: An efficient solution to the five-point relativepose problem. IEEE Transactions on Pattern Analysisand Machine Intelligence 26(6), 756–770 (2004)

23. Ozyesil, O., Singer, A.: Robust camera location estima-tion by convex programming. In: Conference on Com-puter Vision and Pattern Recognition, pp. 2674–2683.IEEE (2015)

24. Ozyesil, O., Singer, A., Basri, R.: Stable camera motionestimation using convex programming. SIAM Journal onImaging Sciences 8(2), 1220–1262 (2015)

25. Rosen, D.M., Carlone, L., Bandeira, A.S., Leonard, J.J.:Se-sync: A certifiably correct algorithm for synchroniza-tion over the special euclidean group. The InternationalJournal of Robotics Research 38(2-3), 95–125 (2019)

26. Rouzban, S.M., Hosseini, R.: A rate of convergencefor two-block coordinate descent. arXiv preprintarXiv:1901.08794 (2019)

27. Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: ex-ploring photo collections in 3d. In: ACM siggraph 2006papers, pp. 835–846 (2006)

28. Snavely, N., Seitz, S.M., Szeliski, R.: Skeletal graphs forefficient structure from motion. In: 2008 IEEE Confer-ence on Computer Vision and Pattern Recognition, pp.1–8. IEEE (2008)

29. Stewenius, H., Schaffalitzky, F., Nister, D.: How hard is3-view triangulation really? In: International Conferenceon Computer Vision, vol. 1, pp. 686–693. IEEE (2005)

30. Sweeney, C., Sattler, T., Hollerer, T., Turk, M., Polle-feys, M.: Optimizing the viewing graph for structure-from-motion. In: International Conference on ComputerVision, pp. 801–809. IEEE (2015)

31. Torr, P.H., Zisserman, A.: Mlesac: A new robust estima-tor with application to estimating image geometry. Com-puter Vision and Image Understanding 78(1), 138–156(2000)

32. Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon,A.W.: Bundle adjustment—a modern synthesis. In: In-ternational workshop on vision algorithms, pp. 298–372.Springer (1999)

33. Tron, R., Vidal, R.: Distributed image-based 3-d local-ization of camera sensor networks. In: Conference onDecision and Control held jointly with Chinese ControlConference, pp. 901–908. IEEE (2009)

34. Tron, R., Vidal, R.: Distributed 3-d localization of cam-era sensor networks from 2-d image measurements. IEEETransactions on Automatic Control 59(12), 3325–3340(2014)

35. Wilson, K., Snavely, N.: Robust global translations with1dsfm. In: European Conference on Computer Vision,pp. 61–75. Springer (2014)

36. Wu, C.: Towards linear-time incremental structure frommotion. In: International Conference on 3D Vision-3DV2013, pp. 127–134. IEEE (2013)

37. Zhu, S., Zhang, R., Zhou, L., Shen, T., Fang, T., Tan,P., Quan, L.: Very large-scale global sfm by distributedmotion averaging. In: Conference on Computer Visionand Pattern Recognition, pp. 4568–4577. IEEE (2018)