viral-fusion: a visual-inertial-ranging-lidar sensor

17
1 VIRAL-Fusion: A Visual-Inertial-Ranging-Lidar Sensor Fusion Approach Thien-Minh Nguyen, Shenghai Yuan, Muqing Cao, Yang Lyu, Thien Hoang Nguyen, and Lihua Xie, Fellow, IEEE Abstract—In recent years, Onboard Self Localization (OSL) methods based on cameras or Lidar have achieved many sig- nificant progresses. However, some issues such as estimation drift and feature-dependence still remain inherent limitations. On the other hand, infrastructure-based methods can generally overcome these issues, but at the expense of some installation cost. This poses an interesting problem of how to effectively combine these methods, so as to achieve localization with long- term consistency as well as flexibility compared to any single method. To this end, we propose a comprehensive optimization- based estimator for 15-dimensional state of an Unmanned Aerial Vehicle (UAV), fusing data from an extensive set of sensors: inertial measurement units (IMUs), Ultra-Wideband (UWB) ranging sensors, and multiple onboard Visual-Inertial and Lidar odometry subsystems. In essence, a sliding window is used to formulate a sequence of robot poses, where relative rotational and translational constraints between these poses are observed in the IMU preintegration and OSL observations, while orientation and position are coupled in body-offset UWB range observations. An optimization-based approach is developed to estimate the trajectory of the robot in this sliding window. We evaluate the performance of the proposed scheme in multiple scenarios, including experiments on public datasets, high-fidelity graphical- physical simulator, and field-collected data from UAV flight tests. The result demonstrates that our integrated localization method can effectively resolve the drift issue, while incurring minimal installation requirements. Index Terms—Localization, Optimization, Aerial Robots I. I NTRODUCTION S TATE estimation is a fundamental task in developing autonomous mobile robots. The problem is even more critical when dealing with unmanned aerial vehicles (UAVs), since they can be quite easily destroyed when system failures occur, yet limited payload capacity also often prevents one from including so many backup systems on the platform. Thus, depending on the specific scenario, one has to carefully assess the priorities on reliability, accuracy, flexibility, and then settle on appropriate set of sensors and fusion techniques for the targeted application. With the goal of achieving a reliable localization sys- tem for UAV-based inspection applications of 3D structures, in this work, we study a general sensor fusion scheme for fusing measurements from Inertial Measurement Units (IMUs), camera/Lidar-based Onboard Self-Localization (OSL) The work is supported by National Research Foundation (NRF) Singapore, ST Engineering-NTU Corporate Lab under its NRF Corporate Lab@ Univer- sity Scheme. The authors are with School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798. Email for correspondence: [email protected] (Thien-Minh Nguyen) systems, and UWB, hence the name V isual-I nertial-Ra nging- L idar Sensor Fusion framework. Indeed, many developers would find OSL the first choice for the localization task. Thanks to many advances in hardware and software design, it has become quite popular and ubiquitous in research and industry in recent years. Currently, one can find quite reliable and compact off-the-shelf camera/Lidar sensors on the market, along with many open-sourced software software packages to be used with these sensors [1]–[11]. However, besides the sensitivity to lighting condition and availability of features in the environment, OSL systems have two characteristics that one needs to pay attention to. First, OSL data are often refer- enced to a local coordinate frame that coincides with the initial pose, and second, there will be significant estimation drift over a long period of time. These issues are quite adversely challenging in inspection or surveillance applications, when predefined trajectories need to be executed in reference to a chosen so-called world frame. Thus, some small misalignment of the initial pose can eventually lead to large errors towards the end of the operation. Therefore, in this work we investigate the fusion of OSL data with range measurements to some UWB anchors, which can be deployed in a field to define such a global frame. To keep the system easy to use, we also employ very simple deployment strategies. Moreover, we also innovate the use of UWB with a body-offset ranging scheme, where orientation is also coupled with the range measurements besides the posi- tions. Such coupling ensures that both position and orientation estimates do not drift during the operation. Besides these ad- vantages, since it is well known that most VIOs can potentially ”lose track”, relative observations that couple consecutive poses are also constructed from IMU data and fused with UWB measurements to ensure the continuity of estimation. Note that despite its high reliability and inexpensiveness, the integration of the IMU is not a trivial task, as one has to carefully derive the correct kinematic model, handle the biases, and on top of that properly synchronize the high-rate IMU data with other sensors. These issues are successfully addressed in this paper using a synchronization scheme inspired by the image-IMU synchronization process in state of the art VIO [6], [12] and Lidar-inertial systems [8]–[10]. Hence, IMU, OSL and UWB measurements are effectively fused together to provide accurate and reliable localization for the UAV application. The remaining of this paper is organized as follows. In Section II, we review some of the most related previous works and highlight our contributions. Section III provides several arXiv:2010.12274v1 [cs.RO] 23 Oct 2020

Upload: others

Post on 23-Jan-2022

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VIRAL-Fusion: A Visual-Inertial-Ranging-Lidar Sensor

1

VIRAL-Fusion: A Visual-Inertial-Ranging-LidarSensor Fusion Approach

Thien-Minh Nguyen, Shenghai Yuan, Muqing Cao, Yang Lyu,Thien Hoang Nguyen, and Lihua Xie, Fellow, IEEE

Abstract—In recent years, Onboard Self Localization (OSL)methods based on cameras or Lidar have achieved many sig-nificant progresses. However, some issues such as estimationdrift and feature-dependence still remain inherent limitations.On the other hand, infrastructure-based methods can generallyovercome these issues, but at the expense of some installationcost. This poses an interesting problem of how to effectivelycombine these methods, so as to achieve localization with long-term consistency as well as flexibility compared to any singlemethod. To this end, we propose a comprehensive optimization-based estimator for 15-dimensional state of an Unmanned AerialVehicle (UAV), fusing data from an extensive set of sensors:inertial measurement units (IMUs), Ultra-Wideband (UWB)ranging sensors, and multiple onboard Visual-Inertial and Lidarodometry subsystems. In essence, a sliding window is used toformulate a sequence of robot poses, where relative rotationaland translational constraints between these poses are observed inthe IMU preintegration and OSL observations, while orientationand position are coupled in body-offset UWB range observations.An optimization-based approach is developed to estimate thetrajectory of the robot in this sliding window. We evaluatethe performance of the proposed scheme in multiple scenarios,including experiments on public datasets, high-fidelity graphical-physical simulator, and field-collected data from UAV flight tests.The result demonstrates that our integrated localization methodcan effectively resolve the drift issue, while incurring minimalinstallation requirements.

Index Terms—Localization, Optimization, Aerial Robots

I. INTRODUCTION

STATE estimation is a fundamental task in developingautonomous mobile robots. The problem is even more

critical when dealing with unmanned aerial vehicles (UAVs),since they can be quite easily destroyed when system failuresoccur, yet limited payload capacity also often prevents onefrom including so many backup systems on the platform. Thus,depending on the specific scenario, one has to carefully assessthe priorities on reliability, accuracy, flexibility, and then settleon appropriate set of sensors and fusion techniques for thetargeted application.

With the goal of achieving a reliable localization sys-tem for UAV-based inspection applications of 3D structures,in this work, we study a general sensor fusion schemefor fusing measurements from Inertial Measurement Units(IMUs), camera/Lidar-based Onboard Self-Localization (OSL)

The work is supported by National Research Foundation (NRF) Singapore,ST Engineering-NTU Corporate Lab under its NRF Corporate Lab@ Univer-sity Scheme.

The authors are with School of Electrical and Electronic Engineering,Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798.

Email for correspondence: [email protected] (Thien-Minh Nguyen)

systems, and UWB, hence the name Visual-Inertial-Ranging-Lidar Sensor Fusion framework. Indeed, many developerswould find OSL the first choice for the localization task.Thanks to many advances in hardware and software design,it has become quite popular and ubiquitous in research andindustry in recent years. Currently, one can find quite reliableand compact off-the-shelf camera/Lidar sensors on the market,along with many open-sourced software software packages tobe used with these sensors [1]–[11]. However, besides thesensitivity to lighting condition and availability of features inthe environment, OSL systems have two characteristics thatone needs to pay attention to. First, OSL data are often refer-enced to a local coordinate frame that coincides with the initialpose, and second, there will be significant estimation driftover a long period of time. These issues are quite adverselychallenging in inspection or surveillance applications, whenpredefined trajectories need to be executed in reference to achosen so-called world frame. Thus, some small misalignmentof the initial pose can eventually lead to large errors towardsthe end of the operation.

Therefore, in this work we investigate the fusion of OSLdata with range measurements to some UWB anchors, whichcan be deployed in a field to define such a global frame. Tokeep the system easy to use, we also employ very simpledeployment strategies. Moreover, we also innovate the use ofUWB with a body-offset ranging scheme, where orientation isalso coupled with the range measurements besides the posi-tions. Such coupling ensures that both position and orientationestimates do not drift during the operation. Besides these ad-vantages, since it is well known that most VIOs can potentially”lose track”, relative observations that couple consecutiveposes are also constructed from IMU data and fused withUWB measurements to ensure the continuity of estimation.Note that despite its high reliability and inexpensiveness, theintegration of the IMU is not a trivial task, as one has tocarefully derive the correct kinematic model, handle the biases,and on top of that properly synchronize the high-rate IMU datawith other sensors. These issues are successfully addressedin this paper using a synchronization scheme inspired by theimage-IMU synchronization process in state of the art VIO[6], [12] and Lidar-inertial systems [8]–[10]. Hence, IMU,OSL and UWB measurements are effectively fused togetherto provide accurate and reliable localization for the UAVapplication.

The remaining of this paper is organized as follows. InSection II, we review some of the most related previous worksand highlight our contributions. Section III provides several

arX

iv:2

010.

1227

4v1

[cs

.RO

] 2

3 O

ct 2

020

Page 2: VIRAL-Fusion: A Visual-Inertial-Ranging-Lidar Sensor

2

TABLE I: Related works and their classifications.

Sensor types UWB Utilization Fusion MethodRelated worksUWB IMU Camera Lidar State Coupling Localization Type

UWB only: [13]–[15] X - - Position Absolute EKFUWB-IMU: [16], [17] X X - - Position Absolute EKFRange-only Graph Optimization: [18], [19] X - - - Position Absolute OptimizationUWB-aided Visual SLAM (VSLAM) [20] X - X - Position Absolute OptimizationUWB-aided Monocular VSLAM: [21]–[23] X - X - Position Relative OptimizationVIR-SLAM: [24] X X X - Position Relative OptimizationUWB/Lidar 2D SLAM: [25] X - - X 2D Position Relative OptimizationOurs X X X X Position, Orientation, Velocity Absolute Optimization

basic concepts that will be employed in the later parts. SectionIV provides an overview of the approach to the estimationproblem. Section V goes into detail on how to constructthe cost function. Section VI presents the experiment results.Finally, Section VII concludes our work.

II. RELATED WORKS

To the best of our knowledge, this work presents an originaland comprehensive sensor fusion scheme in terms of the setof sensors, their utilization for estimation, and the fusionapproach. Tab. I summarizes and compares the most relatedworks and ours with respect to these aspects. Note that weonly consider the works that employ UWBs in combinationwith other types of sensors.

As Tab. I presents, over the last five years, many researchershave investigated the use of UWB for UAV localization asa standalone method [13]–[16], [26], or in combination withIMU [16], [17]. These works have confirmed the effectivenessof UWB for localization in the mid-range scale (from 10m to 100 m) [27]. Since these early works use ExtendedKalman Filter (EKF), the fusion of UWB with relative motioninformation from OSL system can be quite cumbersome. Inan early attempt to find an alternative to EKF, in [18], [19]the authors proposed optimizing a cost function comprisingof ranging and smoothness factors on a sliding window. Thisoptimization process provides a sequence of pose estimatesthat best fit the available measurements. Note that the smooth-ness factors in this formulation are some artificially designedrelative factors, i.e. factors that couple two consecutive statesin the sliding window, to keep the problem well-posed [19].As the term smoothness suggests, these factors penalize thesolutions where consecutive positions are too close or too farfrom each other, and assume some conditions on the constantvelocity and maximum speed of the robot. Hence the methodmay only be effective when the robot moves at a low speed. In[20], the authors introduced new relative factors from a VisualSLAM system to the cost function, together with the rangeand smoothness constraints, and demonstrated the improvedlocalization accuracy compared with purely visual methods.

In the aforementioned methods, it is assumed that UWBanchor positions are known and localization is with respectto the coordinate system specified by the anchor setup. Theseoften require at least four anchors to ensure a unique solution.On the other hand, in [21]–[25], ranging to a single anchoror several unknown anchors are used to aid the OSL sys-tems. More specifically, the works in [21]–[23] employ UWB

ranges to estimate the scale factors, which are ambiguous inmonocular Visual SLAM systems. Whereas in [24], rangingto a single anchor and among a group of robots are used toreduce the drift of the OSL data. In [25], the authors proposeda 2D UWB/Lidar SLAM problem, where anchor-robot andanchor-anchor ranges along with Lidar scans are used tosimultaneously estimate the anchor and robot positions. Itshould be noted that the position estimates from these methodsare still in an arbitrary coordinate frame, and therefore notconducive for executing trajectories relative to some fixedframe of references, which is typical in inspection tasks.

Compared with the previous works, the first contributionof our work is the fusion of a comprehensive set of sen-sors, namely UWB, IMU, VIO and LOAM (in fact, mul-tiple cameras and Lidars are employed instead of one foreach type in our system). Indeed, the integration of such acomprehensive set of sensors prompts a need to design amore comprehensive sensor fusion scheme, whereas previousschemes cannot be easily applied. We will discuss this issuein more details in Section IV. Obviously, the first benefitof using more sensors is to improve the reliability of thelocalization, especially when OSL methods are involved, asthey can lose track and destabilise the localization process.Hence, by employing multiple OSL methods together, we canmore reliably detect anomaly in one method and temporarilyignore its data. Moreover, our work also employs a tightly-coupled fusion of IMU-preintegration and UWB ranges in theoptimization approach, hence even when all OSL informationis lost, localization can still be maintained. This approach isalso better than the previous works’ use of smoothness factors,as no strict assumption on maximum velocity is needed.

Moreover, we notice that the previous works only featuresthe coupling of UWB measurements with position state.As mentioned, our method introduces a body-offset rangingscheme plus an observation model that couple both position,orientation and velocity with the range measurements. Thus,not only the use of range factors can reduce the drift in po-sition, but also orientation and velocity. These improvementswill be discussed in details in Section VI.

Finally, it should be noted that the derivation of Jacobiansthat is needed for optimization in previous works were oftenglossed over, or approximated with numerical methods [28].Hence their methods would have the potential of facing nu-merical stability. In contrast, since the analytical forms of theJacobians can ensure more accurate and fast computations, inthis paper we also provide a detailed mathematical derivations

Page 3: VIRAL-Fusion: A Visual-Inertial-Ranging-Lidar Sensor

3

of the Jacobians for the IMU, OSL and body-offset UWBranges, backed up by a comprehensive and succinct self-contained theoretical framework. Our system is successfullyimplemented on the ceres optimization framework [29], whichare also used by a majority of current state of the art VIO andLOAM systems [8], [9], [12], [30], [31].

III. PRELIMINARIES

A. Notations

In this article we respectively use R, N, and N+ to denotethe sets of natural numbers, non-negative integers, and positiveintegers. Rm denotes the m-dimension real vector space withm ∈ N.

We use (·)> to denote the transpose of a vector or matrixunder (·). For a vector x ∈ Rm, ‖x‖ stands for its Euclideannorm, and ‖x‖2G is short-hand for ‖x‖2G = x>Gx. For amatrix A, brcm(A) returns the m×m block at bottom right

corner of A, e.g. brc1

([5 74 3

])= [3].

To make it explicit that a vector v is w.r.t. a coordinateframe A, we attach a left superscript A to v, e.g. Av. Arotation matrix and transform matrix between two referenceframes are denoted with the frames attached as the left-hand-side superscript and subscript, e.g. ABR and A

BT are the rotationand transform matrices from frame A and B, respectively.Morover, left-hand-side subscript B(·) (“B” is for body frame)is implied when none is specified, e.g., Lqk implies thequaternion representing the rotation from the local frame Lto the body frame B at time tk, i.e, L

Bkq.

For a list of vectors v1,v2 . . . ,vn of various dimensions(including scalar), we use the tuple notation (v1,v2 . . . ,vn)as a short-hand for [v>1 ,v

>2 . . . ,v

>n ]>.

Finally, throughout the paper, super/subscripts may be occa-sionally omitted for conciseness, or inserted for clarity, whendeemed unambiguous in its context.

B. Geometric concepts

1) Special Orthogonal Group: SO(3) describes the groupof 3D rotations, which are formally defined as

SO(3) = R ∈ R3×3 : R>R = I, det(R) = 1.

On this algebraic group, the binary operation [32] representingthe composition of rotations is defined as R1 R2 = R1R2.By definition it can also be seen that the inverse of a rotationmatrix is defined as R−1 = R>.

In a practical sense, a rotation matrix ABR helps convert the

coordinate of a vector in the frame B to that in the frameA, e.g. a vector v whose coordinate in B is Bv has itscoordinates in A as Av = A

BRBv.

2) Skew-symmetric matrix: We define the skew-symetricmatrix of a vector a = [ax, ay, az]

> ∈ R3 as

bac× =

0 −az ayaz 0 −ax−ay ax 0

, (1)

which helps define the cross product × between two vectors:

a× b = bac× b = −bbc× a. (2)

The space of skew symetric matrices is a Lie algebraassociated with SO(3), denoted as so(3). We also define thefollowing mappings between R3 and so(3) that will be usedin later formulations:

(·)∧ : R3 → so(3); τ∧ = bτ c× ;

(·)∨ : so(3)→ R3; bτ c∨× = τ .

(3)

(4)

The Lie algebra is called the ”infinitestimal generator” ofSO(3) as one can map an element of so(3) into SO(3) viathe matrix exponential:

∀τ∧ ∈ so(3) : exp(τ∧) , I +

∞∑k=1

1

k!bτ ck× = R ∈ SO(3).

Following [33], a closed-form solution of the above matrixexponential is provided as follows

Exp(φ) = I +sin(‖φ‖)‖φ‖

bφc× +1− cos(‖φ‖)‖φ‖2

bφc2× , (5)

where Exp(φ) , exp(φ∧). This equation is commonly knownas Rodrigues’ formula, and φ is the axis–angle representationof a rotation matrix R = Exp(φ), also called rotation vector.The inverse mapping from a rotation matrix to a rotation vectoris defined as

Log(R) = φ =‖φ‖

2 sin(‖φ‖)(R−R>)∨,

‖φ‖ = cos−1

(Tr(R)− 1

2

), (6)

where Tr(R) denotes the trace of the matrix R.From the definitions of rotation matrix and skew symmetric

matrix, we can verify the following identities for all R ∈SE(3) and φ ∈ R3:

bRφc× = R bφc×R>, Exp(φ)R = RExp(R>φ). (7)

We also have the first order approximation for the mapping(5) under a small disturbance δφ:

Exp(φ+ δφ) ' Exp(φ)Exp (H(φ)δφ) ,

Exp(φ)Exp(δφ) ' Exp(φ+H−1(φ)δφ

),

(8)

(9)

where H(φ) is termed the right Jacobian of SO(3) [32], andis formally defined as:

H(φ) =∂Log

[Exp(φ)−1Exp(φ+ θ)

]∂θ

∣∣∣∣∣θ=0

(10)

The closed-form of H(φ) and its inverse are given as follows

H(φ) = I− 1− cos(‖φ‖)‖φ‖2

bφc× +‖φ‖ − sin(‖φ‖)

‖φ‖3bφc2× .

H−1(φ) = I +1

2bφc× +

(1

‖φ‖2− 1 + cos(‖φ‖)

2 ‖φ‖ sin(‖φ‖)

)bφc2× .

Note that when φ is small, then H(φ) ' I. Thus for twosmall disturbances δφ1, δφ2 we can have:

Exp(δφ1 + δφ2) ' Exp(δφ1)Exp (δφ2) . (11)

This result will later be used to derive the equations for propa-gating the covariance of the IMU preintegration measurements.

Page 4: VIRAL-Fusion: A Visual-Inertial-Ranging-Lidar Sensor

4

3) Quaternion: A quaternion q is a four-dimensional com-plex number comprising of a scalar part qw ∈ R, and a vectorpart qv ∈ R3:

q = qw + iqx + jqy + kqz = (qw, qv),

qw, qx, qy, qz ∈ R, qv = (qx, qy, qz) ∈ R3.

(12)

(13)

A unit quaternions, i.e. quaternion q where ‖q‖ = 1, can beused to represent a rotation. The formulae to convert a unitquaternion q to a rotation matrix R = aij, i, j ∈ 1, 2, 3and vice versa are recalled from [34] as follows:

R(q) = (2q2w − 1)I3×3 + 2qw bqvc× + 2qvq

>v ,

Q(R) = (qw, qx, qy, qz), qw =

√a11 + a22 + a33 + 1

2,

qx =a32 − a23

4qw, qy =

a13 − a31

4qw, qz =

a21 − a12

4qw.

(14)

(15)

(16)

We also have the following formula for the compoundmapping E(φ) = Q(Exp(φ)):

∀φ ∈ R3 : E(φ) =

(cos

(‖φ‖

2

),φ

‖φ‖sin

(‖φ‖

2

))(17)

Given two rotation matrices ABR, BCR and their corresponding

quaternions ABq, A

Bq, the rotation and quaternion compositionoperations can be defined via the matrix multiplications, i.e.:

ACR = A

BRBCR⇔ q p = bqc` p = bpcr q,

bqc` =

[qw −q>vqv qwI + bqvc×

],

bqcr =

[qw −q>vqv qwI− bqvc×

],

(18)

(19)

(20)

where denotes the quaternion multiplication operator.Despite representing the same object, the use of quaternion

and rotation matrix may be advantageous in some cases, andnot in other cases (more explainations are given in SectionIII-D). In this paper we take liberty to use the rotation matrixinterchangeably with its corresponding quaternion, as theconversion between them has been provided in the formulae(14) and (16).

4) Special Euclidean Group: SE(3) describes the group ofrigid-body motions in 3D space whose definition is

SE(3) ,

T =

[R t0 1

]∈ R4 : R ∈ SO(3), t ∈ R3

.

Note that T is often referred to as a transform, R as rotationand t as translation. Similar to SO(3), the group operation onSE(3) is defined as T1 T2 = T1T2 ∈ SE(3), ∀T1,T2 ∈SE(3). Given a transform T its inverse can be found as:

T−1 =

[R−1 −R−1t0 1

].

In practice, a transform ABT is used to convert a pose

in frame B to A via the operation [Ap>, 1]> =ABT[Bp>, 1]>, or more simply put, Ap = A

BRBp + A

Bt.

W

Bk ≡Ik

Cik

Li

WLiT =?

WBk

T

Li

Cik

T

BkCik

T = I

Fig. 1: The coordinate frames that are considered under thesensor fusion scheme.

C. Coordinate frames and Extrinsic Parameters

In reference to Fig. 1, we denote Bk as the robot’s bodycoordinate frame at time tk, and W is the world frame.Obviously, the main goal of our work is to estimate thetransform W

BkT:

WBk

T =

[WBk

R WBk

t0 1

]=

[R(WBk

q)

WBk

p0 1

],

where WBk

q and WBk

p are the orientation quaternion and positionof the body frame w.r.t. the world frame at time tk.

Following the common practice in the literature, we willconsider the IMU-attached frame Ik to coincide with thebody frame, i.e. Ik ≡ Bk,∀k.

Also in reference to Fig. 1, let us denote Cik as acoordinate frame attached to the main sensor of the i-th OSLsystem, which could be a camera or a Lidar. In most practicalOSL systems, we can obtain a measurement of the transformLi

CikT, where Li is some local frame of reference, which

usually coincides with Ci0. Similar to the case with IMU,we can assume Bk

Cik

T ≡ BCiT,∀k (i.e. the camera is rigidly

mounted on the robot) and BCiT is known beforehand (in

the literature Bk

Cik

T is often called the extrinsic parametersand can be measured or estimated from the OSL’s data bysome software tools). Indeed we can go a little bit furtherby assuming that Li

kBk

T =Lik

Cik

T ·(BCiT

)−1as the output of

the OSL system. In effect, this is the same as assuming thatall frames Ci coincide with B, i.e. ∀Cik : Bk

Cik

T , I4×4.

And henceforth, we will not have any further reference to theframe Ci in this paper. However, the transform W

LiT remains aconcern, as it is usually either unknown, very hard to obtainin practice, or subject to drift. In Section V-A, based on thecharacteristics of OSL system, we will discuss convertingthis externally-referenced measurements into so-called posedisplacement measurements.

D. Optimization problem on Manifold

1) Local reparameterization: In robotic localization ap-plications, we often encounter the following optimizationproblem

minX∈M

f(X ), (21)

where f(·) is a cost function constructed from the sensormeasurements, X is the estimate of the robot states that may

Page 5: VIRAL-Fusion: A Visual-Inertial-Ranging-Lidar Sensor

5

include position, orientation quaternion, velocity, etc., andM is a manifold where the robot states reside on. Sincethe orientation states are often overparameterized (by ninevariables for rotation matrices, and four for quaternion), thecondition X ∈ M introduces extra nonlinear constraints tothe problem, and solving such problem with both nonlinearcost function and constraints is not a trivial task. Instead, theproblem (21) is often reparameterized so that the decisionvariable resides on the linear tangent space of the manifolddefined at the current state X , i.e.

minδX∈Rn

f(LX (δX )

). (22)

where LX (·) : Rn → M is called a local reparameterizationat X , or retraction function [35]. After the reparameterization,for each iteration in the Gauss-Newton method, we can calcu-late the optimal gradient δX ? and then ”retract” the solutionδX ? from the tangent space back to the manifold using theoperation X ← LX (δX ?). In practice, the mappings (5) and(17) are used to define the local parameterization of rotationsas follows:

LR(δφ) = RExp(δφ), Lq(δφ) = q E(δφ). (23)

For a vector state such as position and velocity, the retractionfunction is simply defined as:

Lp(δp) = p + δp, Lv(δv) = v + δv. (24)

Note that in (5) and (17), the use of trigonometric functioncan be computationally expensive, and the variable at thedenominator has the potential of causing division by zero fault.This is why we often employ the following approximations

LR(δφ) ' R(I + bδφc×), Lq(δφ) ' q [

112δφ

], (25)

which only involve basic arithmetic operations. As can be seenin later parts, the former is convenient for calculating the Ja-cobians and the covariance, while the latter is more convenientin calculating the rotational residual and the retraction step.

2) Jacobian and the chain rule: One other importantconcept for the optimization on manifold is the Jacobianof a function f(X ) : M → N acting on manifolds (e.g.f(R) = Ry, where y ∈ R3, M = SO(3) and N = R3),the Jacobian (or more precisely right Jacobian) of f(X ) isdefined as

Jf(X )X ,

∂LogN[f(X )−1 f(LX (θ))

]∂θ

∣∣∣∣∣θ=0

, (26)

where LogN (X ) is the mapping from the manifold N to thetangent space at the identity of N . Recall that in (10), we haveapplied (26) in the case where f(X ) = Exp(X ), M = R3

and N = SO(3).From [33] the chain rule simply states that for Y = f(X )

and Z = g(Y), we have

JZX = JZYJYX , (27)

which can be generalized to any number of variables.

W

XkXk+1

Xk+2

Xk+3

d1k+1

d2k+2

d1k+2

d2k+3

d1k+3Wxi

k+3

Fig. 2: Illustration of a sliding window consisting of fourconsecutive states Xk, . . .Xk+3 of the robot registered at fortime steps tk, . . . tk+3. For a closer look, during the interval(tk+1, tk+2], we have two body-offset UWB distances d1

k+2

and d2k+2, hence the subscript k + 2 and the superscripts 1

and 2 are used. Between two consecutive poses, IMU andOSL sensors can provide relative displacement observation.Hence UWB, IMU and OSL data will be fused together toestimate the robot state Xk.

IV. METHODOLOGY

A. Problem formulation

Let us denote all of the states to be estimated at time tk asXk as follows:

Xk =(qk, pk, vk, bωk , bak

), (28)

where qk = WBk

q, pk = WBk

p, vk = pk ∈ R3 are respec-tively the robot’s orientation quaternion, position and velocityw.r.t. the world frame W at time tk; bak, bωk ∈ R3 arerespectively the IMU accelerometer and gyroscope biases.

Hence, we define the state estimate at each time step k, andthe sliding windows as follows:

X =(Xk, Xk+1 . . . Xk+M

),

Xk =(qk, pk, vk, bωk , bak

),

(29)

(30)

where M ∈ N is the size of the sliding windows.Thus, we will construct a cost function based on the ob-

servations from UWB, IMU and OSL measurements at everynew time step. As can be seen in Fig. 2, UWB, IMU and OSLwill provide observations that couple two consecutive states inthe sliding window. Using all of the observation obtained fromUWB, OSL and IMU, a cost function f(X ) can be constructedat each time step tk+M as:

f(X ) ,

k+M−1∑m=k

NmU∑

i=1

∥∥∥rU (Xm, Xm+1,U im)∥∥∥2

P−1

Ui

+

k+M−1∑m=k

NmV∑

i=1

∥∥∥rV(Xm, Xm+1,Vim)∥∥∥2

P−1

Vi

+

k+M−1∑m=k

∥∥∥rI(Xm, Xm+1, Im)∥∥∥2

P−1Im

, (31)

where rU (·), rV(·), rI(·) are the residuals constructed fromUWB, IMU and OSL observations; PUi , PVi , PIk are the

Page 6: VIRAL-Fusion: A Visual-Inertial-Ranging-Lidar Sensor

6

Fig. 3: Overview of the system’s main components.

covariance matrices of the measurement error, and NkU , Nk

Vrespectively are the number of UWB and OSL observationsobtained in the inverval [tk, tk+1].

Note that each term ‖r(·)‖2P−1 above is also referred to asa cost factor or simply factor. In Section V we will elaborateon the construction of these terms.

B. Workflow

Fig. 3 provides an overview of our localization system. Ashaving introduced in Section I, the UWB ranging and commu-nication network (Fig. 3, top part) is employed to gather thedistance between the anchors, after they have been deployedin the field. These distances will be then used to calculate theposition of the anchors by using a self-localization algorithmsimilar to our previous works [14]. Once the anchors’ positionshave been calculated, they will be associated with the UWBmeasurements. Thus, a UWB observation consists of threemain pieces of information, first is the position of the UWBnode in the UAV’s body frame, second is the UWB anchorposition in the chosen world frame, and third is the distancebetween these the UWB node and the UWB anchor. Note thatsome information regarding the sampling times will also beused, as explained Section V-C.

After the anchor self-localization process, UWB, IMU andOSL data will go through multiple stages before they arefused together in the Optimization-based Sensor Fusion (OSF)process.

In the first stage, for each UWB measurement obtained,it will go through some preliminary checks to eliminate badmeasurements. These checks include comparing the measure-ment’s signal over noise ratio (SNR), leading-edge-detectionquality, and rate-of-change with some user-defined thresholds,etc. Only those that are within these thresholds will be passedto the buffer for subsequent stages. For IMU and OSL, wewill transform these measurements into the body frame usingthe fixed extrinsic transforms (see Section III-C). These trans-formed measurements are then passed to the correspondingbuffers for the next stages.

In the second stage, after every fixed period of time, weextract UWB measurements together with the IMU and OSL

UWB

IMU

OSLinterpolatedδti

∆tk

tk ti tk+1 tk+2

Fig. 4: Synchronization scheme: at every time step tk, wewould be able to obtain a set of UWB observations (see(88)), an IMU preintegration observation (see (53)), and apose-displacement measurement from each OSL system (see(34)). Note that in practice multiple UWB measurements canbe acquired simultaneously, hence there will be no other datafrom other sensors between them (more details in Sec. VI), orthat some measurements can be lost in some intervals. Thiswill cause some significant challenges that are discussed inmore details in Remark 1.

measurements from the buffers to construct some observations,namely IMU preintegration Ik, OSL-based pose-displacementVik and UWB body-offset ranges U ik. In reference to Fig. 4,these measurements are extracted at the time instance tk+1,which is also used to register a new state estimate Xk+1.Note when the new step tk+1 is created, we can use itto make an interpolated IMU and OSL measurements fromthe immediate preceding and succeeding samples (see Fig.4). Hence all of the IMU measurements within [tk, tk+1],including the interpolated values, will be used to calculate thepreintegration measurements, and the initial and final OSLvalues in the interval will be used to calculate the posedisplacement measurements. For UWB, each observation alsoincludes the time offset δti from the latest earlier time step(see Fig. 4), which is needed for the construction of the rangefactor in Section V-C.

In the final stage, the stack of UWB data will be subjected toanother check to see if any measurement is an outlier. This isdone by comparing the measurement with the predicted rangevalue using the IMU-predicted robot pose at time tk + δti.When the UWB measurements have passed the outlier check,we can finally use them to construct the cost function, togetherwith the preintegrated IMU measurements and the OSL-derived pose displacement measurements. The function willbe then optimized to provide an estimate of the robot statesin the sliding windows. The latest state estimate in the slidingwindows will be combined with the latest IMU measurementsin the buffer, which are leftovers from the last extraction plusthe newly arrived samples during the optimization process,to calculate the high-rate pose estimate. This high-rate poseestimate is also used to perform the outlier check on the UWBmeasurements that was mentioned earlier.

After the optimization has been completed, the oldest stateestimate and its associated measurements will be abandoned,and a new time step will be registered when the conditionsmentioned earlier are met.

Remark 1. The design of the synchronization scheme de-scribed above generalizes our method compared to previousworks [18]–[24] and is a necessary requirement when fusing

Page 7: VIRAL-Fusion: A Visual-Inertial-Ranging-Lidar Sensor

7

such an extensive set of sensor data. In [18]–[21], the creationof a new state Xk is synchronized with the acquisition of a newUWB sample, hence when multiple UWB samples are obtainedat the same time, or UWB ranging is temporarily lost, theestimation scheme can be destabilized. Moreover, when UWBrate increases, real-time processing may not be guaranteed,since the number of UWB factor increases (hence increasingthe computational complexity), but the interval ∆tk availablefor the optimization process is reduced. On the other hand,in [23], [24], the creation of new state is synchronized withthe camera sample times, and only UWB samples that areclosest to the camera sample times are used, while others arediscarded. In our framework, the time steps tk are independentof the timing of the sensor data. Hence when more sensors areintegrated and complexity increases, we can easily adjusted∆tk, or even skip the optimization in one step (see SectionIV-C) to ensure real-time estimation.

C. Initialization and real-time optimization

While many OSL based methods require an elaborate initial-ization motions before actual operation [5], [36], this appearsto be a very difficult task when working with UAV platformsor medium size or beyond. Thus, in this work, initialization isdone by collecting about 100 UWB measurements on startup,then run the optimization with several initial yaw values in[0, 2π), and the estimate that yields the smallest cost is chosen.Note that roll and pitch are initialized directly from the anglebetween the measured gravity vector and the body frame’s zaxis.

During the estimation process, sometimes the optimizationoperation can return quickly before the next batch of mea-surements are admitted to the sliding windows. In this case,we can try to ”explore” other minima of the cost functionby nudging the yaw value around the current estimate, andrerun the optimization. Whenever a smaller cost is obtained,the estimate will be updated to this new value. Moreover,sometimes the optimization may elapse when multiple batchesof measurements are still in the queue. In this case we can skipthe optimization for a few steps and slide the window forwardto use the latest patch for constructing the cost function, so asto ensure the real-time operation of the algorithm.

V. CONSTRUCTION OF THE COST FACTORS

In this section, we will derive the factor for each type ofobservation in the cost function (31). In general, the basicprocedure is as follows: 1) we construct the observation modelthat states the relationship of the sensor observation, the robottrue states, and the measurement error; 2) we calculate thecovariance matrix of the error in order to properly applythe weightage of each least-square term in the optimizationproblem; 3) we calculate the residual (difference) betweeneach observation and its predicted value, which is obtainedby plugging the current state estimate into the mathematicalobservation model; 4) we calculate the Jacobian for the fastimplementation of the optimization process.

A. OSL factors

In this section, we discuss the construction of pose-displacement observation from OSL data. Note that we wouldomit the OSL system index i in (31) since the treatment is thesame for all OSL systems available on the UAV.

1) Observation model: OSL systems can have slightlydifferent implementations (monocular or stereo, RGBD orLidar, etc.) and characteristics (the definition of features,overall accuracy and data rate, etc.). However, as havingdiscussed in Section III-C, due to such diversity, in this paperwe consider OSL systems working in a manner similar toodometer, where they essentially measure the relative motionof the sensor between two time instances, and then integratethese relative measurements to provide an estimate of the robotpose w.r.t. some local frame of reference L. To this end, wemodel the OSL measurement as follows.

LBk+1

q = LBk

q (∆qk E(φk)) ,

LBk+1

p = LBk

p + LBk

R (∆pk + dk) ,

(32a)

(32b)

where E(φk), dk are some errors in the estimation process,∆qk and ∆pk are the relative rotation and displacement fromtime tk to tk+1, which are formally defined as:∆qk , W

Bkq−1 WBk+1

q,

∆pk , LBk

R−1(LBk+1

p− LBk

p).

(33a)

(33b)

From the user’s end, only LBk+1

q and LBk+1

p are accessible.Thus, given these measurements, we can retrieve the pose-displacement observation as follows:

Vk , (∆qk, ∆pk) ,

∆qk , q−1k qk+1 = ∆qk E(φk),

∆pk , R−1k (pk+1 − pk) = ∆pk + dk.

(34a)

(34b)

(34c)

2) Observation error covariance: In reference to (34), wemodel the “noises” φk and dk as independent zero-meanGaussians with the following covariance

PVk = Cov

([φkdk

])= ∆tkdiag(σ2

V1, σ2V2, . . . σ2

V6). (35)

where σ2V1, σ2V2, . . . σ2

V6are some parameters that can be

empirically determined.3) Residuals: Given the observations in (34), we can cal-

culate the residuals from OSL as follows:

rV(Xk, Xk+1,Vk) ,

[r∆q

r∆p

]=

[2vec(∆q−1

k ∆qk)∆pk −∆pk

],

(36)

where vec(·) returns the vector part of the quaternion, ∆qkand ∆pk are the estimated pose displacement at time tk, whichcan be calculated in the same manner with (34):

∆qk , q−1k qk+1, ∆pk , R−1

k (pk+1 − pk) . (37)

Page 8: VIRAL-Fusion: A Visual-Inertial-Ranging-Lidar Sensor

8

4) Jacobian: Let us first examine Jr∆q

qk. From (36) and (37)

and the fact that ∆q−1k q−1

k qk+1 = (q−1k+1 qk ∆qk)−1,

we have r∆q = −2vec[q−1k+1 qk ∆qk

]since vec(q) =

−vec(q−1) over a unit quaternion q. Hence:

r∆q(Lqk(δθk)) ' −2vec

[q−1k+1 qk

[1

12δθk

]∆qk

]= −2vec

[b∆qkcr

⌊q−1k+1 qk

⌋`

[1

12δθk

]],

and we can straightforwardly obtain

Jr∆q

qk= −brc3

(b∆qkcr

⌊q−1k+1 qk

⌋`

). (38)

Similarly we can obtain the Jacobian Jr∆q

qkas follows

Jr∆q

qk+1= brc3

(⌊∆q−1

k q−1k+1 qk

⌋`

).

To obtain Jr∆p

qk, by employing (25), (2), we can obtain

r∆p(Lqk(δθk)) ' (I + bδθkc×)>R>k (pk+1 − pk)

= −bδθkc× R>k (pk+1 − pk) + R>k (pk+1 − pk)

=⌊R>k (pk+1 − pk)

⌋×δθk + R>k (pk+1 − pk).

Thus, taking the derivative over δθk yields

Jr∆p

qk'⌊R>k (pk+1 − pk)

⌋×.

Following the same steps, we can easily find that

Jr∆p

pk' −R>k , J

r∆p

pk+1' R>k . (39)

B. IMU preintegration factors:

1) IMU kinematic models: Before exploring the preintegra-tion observation models, let us recall some features of IMU.First, let us notice the following kinematic model of the robotpose and velocity w.r.t. to the world frame:

qt = qt [

0ωt

], pt = vt, vt = Rtat, (40)

where ωt, at are the angular velocity and acceleration of therobot measured in the body frame. In reality we only haveaccess to IMU measurements that are corrupted by noises andbiases, i.e.:

ωt = ωt + bωt + ηωt ,

at = at + R>t g + bat + ηat ,

bωt = ηbωt , bat = ηbat ,

(41)

(42)

(43)

where g = [0, 0, g]> is the gravitational acceleration,bωt ,b

at are respectively the biases of the gyroscope and the

accelerometer, and ηωt , ηat , ηbωt , ηbat are some zero-meanGaussian noises, whose standard deviations are σηω

, σηa,

σηbω, σηba

, respectively.

2) IMU preintegration: Let us substitute (41) and (42) tothe kinematic model (40), (40), (40), and rewrite them in theintegral form as follows:

qk+1 = qk M

∫ tk+1

tk

1

2Ω(ωs − bωs − ηωs )qs,

pk+1 = pk + vk∆tk −1

2g∆t2k

+ WRk

∫ tk+1

tk

∫ u

tk

kRs (as − bas − ηas ) ds du,

vk+1 = vk − g∆tk

+ WRk

∫ tk+1

tk

kRs (as − bas − ηas ) ds,

(44)

(45)

(46)

(47)

whereM

∫(·) denotes the integration operator on manifold, and

a few short-hands have been employed

kqs ,BkBs

q, WRk = Rk,kRs ,

BkBs

R

Ω(ω) ,

[0 −ω>ω −bωc×

], ∆tk , tk+1 − tk.

(48)

(49)

We can reorganize the terms in (44), (45), (47) to reveal theobservation model on the robot states. Specifically,

γk+1 ,

M

∫ tk+1

tk

1

2Ω(ωs − bωs − ηωs )qs = q−1

k qk+1,

βk+1 ,∫ tk+1

tk

kRs(as − bas − ηas )ds

= R−1k (vk+1 − vk + g∆tk) ,

αk+1 ,∫ tk+1

tk

∫ u

tk

kRs(as − bas − ηas )ds du

= R−1k (pk+1 − pk − vk∆tk +

1

2g∆t2k).

(50)

(51)

(52)

Thus, we can see from (50), (51), (52) that the integralsobtained from the IMU measurements, seen on the left handside, are functions of the robot states on the right hand side.

3) Observation model: In reality we do not know the exactvalue of the IMU noises and bias. However, if we ignore thenoise and use some nominal values bak, bωk for the biasesduring [tk, tk+1], we can calculate the following tuple ofpreintegrated IMU observations as follows:

Ik = (αk+1, βk+1, γk+1), (53)

where

αk+1 ,∫ tk+1

tk

∫ u

tk

kRs(as − bak)ds du,

βk+1 ,∫ tk+1

tk

kRs(as − bak)ds,

γk+1 ,

M

∫ tk+1

tk

1

2Ω(ωs − bωk )kqsds.

(54)

(55)

(56)

In practice, the above integration can be realized by zero-order-hold (ZOH) or higher order methods (Runge-Kuttamethods). During the period from tk to tk+1, we have asequence of m IMU samples (ω0, a0), (ω1, a1), . . . ,

Page 9: VIRAL-Fusion: A Visual-Inertial-Ranging-Lidar Sensor

9

(ωm−1, am−1). Thus, if ZOH is chosen, we have the follow-ing recursive rule to update the pre-integrated measurements:

αn+1 = αn + βn∆τn +1

2kRn(an − bak)∆τ2

n,

βn+1 = βn + kRn(an − bak)∆τn,

γn+1 = γn E(∆τn(ωn − bωk )

),

α0 = 0, β0 = 0, γ0 =[1 01×3

]>,

∆τn = τn+1 − τn, n = 0, 1 . . .m− 1,

(57a)

(57b)

(57c)

(57d)(57e)

where τn is the time stamp of the n-th sample, and τ0 = tk,τm = tk+1. Note that the above equations cannot be consid-ered a usable observation model yet, since the observations arestill functions of the chosen bias point, i.e. αk+1 = αk+1(bk),βk+1 = βk+1(bk), γk+1 = γk+1(bk), bk , (bωk , bak). Letbk = (bωk , bak) be a numerical value of the variable bk, wecan further approximate the observation model in (52) (51)(50) as follows:

αk+1

∣∣∣bk=bk

+ Aωk+1∆bωk + Aa

k+1∆bak

= R−1k (pk+1 − pk − vk∆tk +

1

2g∆t2k)− δαk+1,

βk+1

∣∣∣bk=bk

+ Bωk+1∆bωk + Ba

k+1∆bak

= R−1k (vk+1 − vk + g∆tk)− δβk+1,

γk+1

∣∣∣bω

k =bωk

[

112Cω

k+1∆bωk

]' γk+1

∣∣∣bω

k =bωk

= q−1k qk+1

[1

− 12δθk+1

],

(58)

(59)

(60)

where δαk+1, δβk+1, δθk+1 are the measurement errors,and A,B,C are the Jacobians of the IMU preintegrationsevaluated at the bias point bk, i.e.

Cωk+1 , J

γ(bωk +∆bω

k )∆bω

k, kRk+1 , R(γk+1),

Aωk+1 , ∂αk+1(bω

k )

∂bωk

∣∣∣bω

k =bωk

, Aak+1 , ∂αk+1(ba

k)

∂bak

∣∣∣ba

k=bak

,

Bωk+1 ,

∂βk+1(bωk )

∂bωk

∣∣∣bω

k =bωk

, Bak+1 ,

∂βk+1(bak)

∂bak

∣∣∣ba

k=bak

,

∆bωk, bωk − bωk , ∆bak , bak − bak.

In practice bk can be chosen as bk at the beginning of theoptimization process. During the optimization when the biasestimate is updated, the preintegrated terms can be recalculatedby either redoing the integrations (54), (55), (56) with new biaspoint, or by the first order approximation on the left hand sideof the approximate models (58), (59), (60), depending on howmuch the bias estimate changes.

To fully stipulate the observation model, the Jacobians A,B, C must be stated as well. Based on (57), they can be alsocomputed via an iterative scheme. First, let us find the formulafor Cω

k+1. We rewrite (57c) as

kRn+1 =

n∏j=0

Exp(

(ωj − bωk )∆τj

)= kRnExp

((ωn − bωk )∆τn

)= kRn

nRn+1. (61)

Then, we apply the approximation of each of the terms kRn+1,kRn, nRn+1 under some small changes in the bias:

kRn+1Exp(Cωn+1∆bωk ) . . .

(8)' kRnExp(Cω

n∆bωk )nRn+1Exp (−Hn∆τn∆bωk )(7)= kRn+1Exp

(nR−1

n+1Cωn∆bωk

)Exp (−Hn∆τn∆bωk )

(11)' kRn+1Exp

(nR−1

n+1Cωn∆bωk −Hn∆τn∆bωk

).

Hence, we multiply kR−1n+1 to the left and apply Log(·) to

both sides, we have:

Cωn+1 ' nR−1

n+1Cωn −Hn∆τn,

where in reference to (8), Hn and C0ω are defined as:

Hn = H((ωn − bωk

)∆τn

), Cω

0 = 01×3. (62)

From (57a) it can be easily seen that

Aωn+1 = Aω

n + Bωn∆τn −

1

2kRn

⌊(an − bak)

⌋×

Cωn∆τ2

n,

Aan+1 = Aa

n + Ban∆τn −

1

2kRn∆τ2

n,

Aω0 = 0, Aa

0 = 0.

Similarly, from (57b), we get

Bωn+1 = Bω

n − kRn

⌊(an − bak)

⌋×

Cωn∆τn,

Ban+1 = Ba

n − kRn∆τn,

Bω0 = 0, Ba

0 = 0.

4) Measurement error covariance: The covariances of theIMU preintegration measurements are calculated based on apropagation scheme similar to [30], [37]. Specifically, recallthat we have defined the measurement errors of the pIMUterms in (58), (59), (60) as It , (δαt, δβt, δθt, δb

ωt , δb

at )

where γt = γt (1, 12δθt), αt = αt + δαt, βt = βt + δβt,

bωt = bωk+δbωt , bat = bak+δbat , the continuous time dynamicsof the error can be defined as

˙It = FtIt + Gtηt, (63)

where Ft, Gt, ηt are defined as

Ft ,

0 I 0 0 0

0 0 −kRt

⌊at − bak

⌋×

0 −kRt

0 0 −⌊ωt − bωk

⌋×

−I 0

0 0 0 0 00 0 0 0 0

,

Gt ,

0 0 0 0

0 −kRt 0 00 −I 0 00 0 I 00 0 0 I

, ηt ,ηωt ,ηat ,ηbωt ,ηbat

. (64)

Hence, if we assume that the angular velocity and accel-eration measurements are constant between two consecutive

Page 10: VIRAL-Fusion: A Visual-Inertial-Ranging-Lidar Sensor

10

IMU sample times τn and τn+1, then Ft and Gt are con-stant in [τn, τn+1]. Thus the discrete time dynamics of themeasurement error can be calculated by

In+1 = exp(Fn∆τn)In . . .

+

∫ ∆τn

0

exp(Fns)Gnηs+τnds, (65)

and the covariance of It can be propagated from time tk totk+1 via the following iterative steps

Pn = 0, τn = tk,

Pn+1 = exp(Fn∆τn)Pn exp(Fn∆τn)> + Qn,

Qn =

∫ ∆τn

0

exp(Fns)GnQG>n exp(Fns)>ds,

(66)

(67)

(68)

where Q , diagσ2ηω, σ2ηa, σ2ηbω

, σ2ηba, and recall that

exp(Fn∆τn) is the matrix exponential of Fn∆τn. This ma-trix exponential and Gn, Qn can be computed by severaltechniques such as first-order approximation exp(Fn∆s) 'I+sFn in [6] or closed-form in [37]. In this work we use thefirst-order approximation form.

5) Residual: The pIMU residual rI(Xk, Xk+1, Ik) is there-fore defined as

rI(Xk, Xk+1, Ik) , (rγ , rα, rβ, rbω, rba),

rγ , 2vec

((γk E(Ck

ω∆bωk ))−1

∆qk

),

rα , R−1k (pk+1 − pk − vk∆tk +

1

2g∆t2k) + . . .

−Aωk+1∆bωk −Aa

k+1∆bak − αk+1,

rβ , R−1k (vk+1 − vk + g∆tk) + . . .

−Bωk+1∆bωk −Ba

k+1∆bak − βk,rbω , bωk+1 − bωk , rba , bak+1 − bak,

(69)

(70)

(71)

(72)

(73)

where ∆bωk , bωk − bωk , ∆bak , bak− bak, ∆qk , q−1k qk+1.

6) Jacobian: Following the same manipulations in the caseof the OSL factor, we can easily find the Jacobians of thepreintegrated quaternion residual rγ over the state estimates:

Jrγqk

= −brc3

(bγkcr

⌊q−1k+1 qk

⌋`

),

Jrγqk+1

= brc3

(⌊(γk)−1 q−1

k qk+1

⌋`

),

Jrγ

bkω= −brc3

(⌊(γk)−1 q−1

k qk+1

⌋r

)Ckω,

(74)

(75)

(76)

where γk , γk E(Ckω∆bωk

).

The Jacobians of the preintegrated quaternion residual rαover the state estimates are

Jrαqk=

⌊R>k

(pk+1 − pk − vk∆tk +

∆t2k2

g)⌋×

,

Jrαpk= −R>k , Jrαpk+1

= R>k , Jrαvk= −R>k ∆tk,

Jrαbω

k

= −Ak+1ω , Jrα

bak

= −Ak+1a .

(77)

(78)

(79)

The Jacobians of the preintegrated quaternion residual rβover the state estimates are found as

Jrβqk

=⌊R>k

(vk+1 − vk + g∆tk

)⌋×,

Jrβvk

= −R>k , Jrβvk+1

= R>k ,

Jrβ

bωk

= −Bkω, J

bak

= −Bka.

(80)

(81)

(82)

Finally, for rbω , rba, it is straightforward that

Jrbωbω

k

= −I3×3, Jrbωbω

k+1

= I3×3,

Jrbaba

k

= −I3×3, Jrbaba

k+1

= I3×3.

(83)

(84)

Remark 2. The above calculations can be found in

C. UWB factorsAt each time instance ti ∈ (tk, tk+1] we have the following

measurements and priori from a UAV range:

U i =(di, xi, yi, δti,∆tk

), i ∈ Nk

U (85)

where di is the distance measurement, ∆tk , tk+1−tk, δti ,ti− tk, xi is the position of the UWB anchor w.r.t. the worldframe, and yi is the coordinate of the UWB ranging nodein the body frame. The UWB factor of this measurement isconstructed via the following steps.

1) Observation model: If we assume that the velocity andorientation of the robot change at a constant rate from time tkto tk+1, then at time tk + δti, we have the following relativeposition of a UAV ranging node yi from an anchor xi as

ni = n(pk+1, qk, qk+1, vk, vk+1, δti, ∆tk)

, pk+1 + RkExp

(δti

∆tkLog(R−1

k Rk+1)

)yi

−∫ ∆tk

δti

(vk +

τ

∆tk+1(vk+1 − vk)

)dτ − xi

, pk+1 + RkExp(siLog(R−1

k Rk+1))yi

+ aivk+1 + bivk − xi, (86)

where

si =δti∆tk

, aik = −∆t2k − δt2i

2∆tk, bik = − (∆tk − δti)2

2∆tk. (87)

Hence, we consider the distance measurement di at timetk + δti as the norm of the vector ni, corrupted by a zero-mean Gaussian noise ηUi ∼ N (0, σ2

U ):

di = ‖n(pk+1, qk, qk+1, vk, vk+1)‖+ ηUi (88)

2) Observation error covariance: Based on (88), we canstraightforwardly identify the covariance of the measurementerror as

PUi ≡ σ2U . (89)

3) Residual: The residual of a range measurement Uk is thedifference between the distance estimate dk and the distancemeasurement dk:

rU (Xk, Xk+1,U i) , rUi = ‖ni‖ − di, (90)

where ni = n(pk+1, qk, qk+1, vk, vk+1).

Page 11: VIRAL-Fusion: A Visual-Inertial-Ranging-Lidar Sensor

11

4) Jacobian: From (90) and (86) the Jacobians of the UWBresidual over the position and velocity states can be easilycalculated as follows

JrUi

pk+1= (ni)>, J

rUi

vk+1= ai(ni)>, J

rUi

vk= bi(ni)>. (91)

The Jacobians of the orientation states can be found by usingthe chain rule. First let us consider the variable qk+1:

JrUi

qk+1= J

rUi

RiyJRiy

Ri JRi

siφkJs

iφk

qk+1, (92)

where Ri , Exp(siφk) and φk , Log(R−1k Rk+1), and the

intermediate Jacobians in the above chain are

JrUi

Riy= (ni)>Rk, JRiy

Ri = −Exp(siφk) byc× ,

JRi

siφk= H(siφk), Js

iφk

qk+1= siH−1(φk)

(93)

(94)

To find the Jacobian of rUi with respect to qk, somemanipulations are needed. First we notice that

Ri = Rk+1R−1k+1RkExp(siφk)

= Rk+1Exp(Log(R−1k+1Rk))Exp(siφk)

= Rk+1Exp((si − 1)φk) , Rk+1Exp(siφk).

And thus we can obtain a similar Jacobian compared to (92),except for si in place of si:

JrUi

qk= J

rUi

RiyJRiy

Ri JRi

siφkJs

iφk

qk, (95)

where the intermediate Jacobians are

JrUi

Riy= (ni)>Rk, JRiy

Ri = −Exp(siφk) byc×JRi

siφk= H(siφk), Js

iφk

qk= siH−1(φk)

(96)

(97)

VI. EXPERIMENTS

In this section, we report results of the sensor fusionmethod on three types of datasets: public datasets, high-fidelitygraphical-physical simulation, and field-collected datasets.Some implementation issues will also be discussed. Videoillustrating the sensor fusion results on these datasets can beviewed at https://youtu.be/2um-n Wc9-k.

A. Public datasets

1) Preparation: Before working with physical systems,we employ the EuRoC datasets [38] to verify the efficacyof our sensor fusion method. We only focus on the V1and V2 datasets, as these experiments provide both positionand orientation ground truth data from VICON, which arenecessary to demonstrate the capability of our method toestimate not only the position but also the orientation.

Since the datasets do not contain UWB measurements, vicondata are used to generate UWB measurements, with realisticfeatures taken into account. Specifically we simulate a UWBnetwork of four UWB anchor nodes and two UAV nodes. Inaddition, each UAV node is assumed to have two antennae Aand B, thus there are effectively four UAV ranging nodes intotal. We assign the ID numbers 100, 101, 102, 103 to theanchors, and their respective coordinates in the world frameare chosen as (3, 3, 3), (3, -3, 0.5), (-3, -3, 3), (-3, 3, 0.5).Similarly, we assign the ID numbers 200.A, 200.B, 201.A,

201.B to the UAV ranging nodes, and their coordinates in thebody frame are chosen as (0.25, -0.25, 0), (0.25, 0.25, 0),(-0.25, 0.25, 0), (-0.25, -0.25, 0).

Each UAV ranging node is given a pre-determined rangingcycle. Specifically, the ranging sequence of node 200 is set as:

200.A→ 100, 200.B→ 100, 200.A→ 101, 200.B→ 101,

200.A→ 102, 200.B→ 102, 200.A→ 103, 200.B→ 103.

Concurrently, the ranging cycle of 201 is similar, but the IDof the anchor node is cyclically shifted by two units, i.e.:

201.A→ 102, 201.B→ 102, 201.A→ 103, 201.B→ 103,

201.A→ 100, 201.B→ 100, 201.A→ 101, 201.B→ 101.

Hence, at each step, two UWB measurements can beobtained simultaneously, over a period of 6 steps. The stepsare set at 25 ms apart, and we will effectively have an80 Hz update rate for UWB measurements. Finally a zero-mean Gaussian noise with the standard deviation of 0.05m isadded to the measurement, based on the characteristics of thephysical sensor.

The software packages are implemented under ROS frame-work, and the optimization method employs the ceres solver.For each dataset, we run the VINS-Fusion1 algorithm andORB-SLAM32 [31] to estimate the aerial robot’s pose basedon data from the stereo camera and IMU. The VINS-Fusionestimate is then treated as an OSL data and fused with UWBand IMU using the methods proposed in previous sections.

2) Results and evaluation: After running the algorithmson the datasets, we can obtain the OSL-based trajectoryLXOSL = (Lpk, Lqk)Kk=0 from the OSL system, andWXVIRAL = (Wpk, Wqk)Kk=0 from the VIRAL method.Since each OSL method provides a trajectory estimate LXOSL

w.r.t. to a local frame L that coincides with the initialpose (Wp0,

Wq0) of the UAV, we need to transform LXOSL

to WXOSL = (Wpk, Wqk)Kk=0 before we can compare it withWXVIRAL. To this end, we perform:

Wpk = R(Wq0)Lpk + Lp0,Wqk = Wq0 Lqk,

(98)(99)

where (Wp0,Wq0) is the initial pose of the UAV according to

ground truth. Hence we calculate the Root Mean Square Errorfor the position and rotation estimates as follows

RMSEpos =

√√√√ 1

N + 1

K∑k=0

‖pk − pk‖2,

RMSErot =

√√√√ 1

N + 1

K∑k=0

Log(R−1k Rk),

(100)

(101)

where pk and Rk are the position and orientaion fromgroundtruth, and pk and Rk are the position and orientationestimate from either OSL or VIRAL algorithm. The result isshown in Tab. II.

1https://github.com/HKUST-Aerial-Robotics/VINS-Fusion2https://github.com/UZ-SLAMLab/ORB SLAM3

Page 12: VIRAL-Fusion: A Visual-Inertial-Ranging-Lidar Sensor

12

(a) V1 01 dataset. (b) V1 02 dataset. (c) V1 03 dataset.

(d) V2 01 dataset. (e) V2 02 dataset. (f) V2 03 dataset.

Fig. 5: Trajectories of the UAV and the estimates by multiple methods with the EuRoC datasets.

Fig. 6: Position estimation error of OSL and VIRAL methodson the V1 03 dataset.

TABLE II: Translational and rotational RMSE of VINS-Fusion, ORB-SLAM3 and VIRAL estimates on the EuRoCdatasets.

RMSEpos [m] RMSErot [deg]DatasetVINS ORB3 VIRAL VINS ORB3 VIRAL

V1 01 0.2508 0.2230 0.1442 0.7935 2.7362 2.3019V1 02 0.3215 0.3628 0.1509 1.7794 2.8454 2.7633V1 03 0.2460 0.1937 0.1498 5.9205 3.3969 1.9597V2 01 0.1948 0.1891 0.1482 6.9295 6.0469 5.6066V2 02 0.3588 0.2569 0.1685 8.8905 8.4701 9.5030V2 03 0.7027 0.2630 0.1663 11.9267 8.9584 10.2057

From Tab. II, we can clearly see that the VIRAl schemeachieves better positional RMSE compared to both VINS andORB-SLAM3. This is the desired and expected outcome, sincethe VINS-Fusion and ORB-SLAM3 algorithms, being OSL

Fig. 7: Orientation estimation error of OSL and VIRALmethods from the V1 03 dataset.

methods, will accumulate error over time, while the VIRALmethod employs ranging measurements that can constrain theerror. This can be seen more clearly in Fig. 5 and Fig. 6, wherethe trajectory estimate by VINS-Fusion clearly diverges fromthe ground truth towards the end of the trajectory, while theVIRAL estimate still track the ground truth.

In terms of rotational RMSE, we can see from Tab. IIthat our proposed scheme can work well in the same numberof datasets as VIO methods. The same explanation for thepositional RSME can be applied here, as the body-offsetranging scheme can ensure that estimation error will not drift,while the OSL method may suffer from some bias error, ascan be seen from Fig. 7.

Page 13: VIRAL-Fusion: A Visual-Inertial-Ranging-Lidar Sensor

13

(a) Dataset #01. (b) Dataset #02. (c) Dataset #03.

Fig. 8: Trajectories of the UAV and the estimates from multiple methods in the datasets generated from AirSim.

TABLE III: Translational and rotational RMSE of VINS-Fusion, LOAM and VIRAL estimates on our AirSim datasets.

RMSEpos [m] RMSErot [deg] RMSEvel [ms/s]DatasetVINS ORB3 LOAM (H) LOAM (V) VIRAL VINS ORB3 LOAM (H) LOAM (V) VIRAL VINS ORB3 VIRAL

01 0.8616 1.1260 0.5561 4.5228 0.0736 5.7186 3.5593 1.2890 17.7741 1.2291 0.1071 0.2633 0.069902 0.4735 0.1102 0.1283 2.1169 0.0656 4.1718 0.1561 1.0533 2.3456 2.7413 0.0824 0.2613 0.083003 0.8392 0.6343 0.3427 0.3902 0.1222 1.6726 0.2957 1.6611 1.9817 1.4319 0.2352 0.4385 0.1438

Fig. 9: Velocity estimate from VINS-Fusion and VIRALcompared with groundtruth from the AirSim test #02.

B. High fidelity visual and physical simulation

Though the EuRoC dataset has demonstrated the effective-ness of our localization method, it does not match well withour experiment setup in later part. Specifically, the movementis limited within a small area, and the dataset does not containLidar measurements. Moreover, the dataset does not containvelocity groundtruth to compare with our estimate. Hence,to verify the method in a more general manner, we employthe AirSim software package [39] to construct new datasetsthat resemble the actual scenario with our UAV platform. Thesoftware packages needed to reconstruct the datasets for thisexperiment can be found at https://github.com/britsknguyen/airsim pathplanner.

The Building 99 environment3 is used for this experiment,which features a large indoor foyer area. For this experiment,the following sensors are integrated with the UAV model: one

3https://github.com/microsoft/AirSim/releases/tag/v1.3.1-linux

10 Hz stereo-camera setup, two 10 Hz Lidar sensors, one 400Hz IMU, two UWB sensors, each with two antennae, and fouranchors. The configuration of the UWB network is similar toSection VI-A1, except that we set the simulated anchors at thecoordinates (0, 0, 11), (-10, 0, 11), (-10, -10, 11), (0, -10, 11).This imitates an installation of these anchors on the ceilingof the building in the real-world scenario. The coordinatesof the four UWB ranging nodes in the body frame are setat (0.35, -0.35, 0), (0.35, 0.35, 0), (-0.35, 0.35, 0), (-0.35, -0.35, 0), catered to a bigger platform compared to that in theEuRoC experiment. Finally, for the two Lidars, they are rotatedso that one so-called horizontal lidar can effectively scan thesurrounding walls, and the other so-called vertical Lidar canscan the floor and the ceiling. The open-sourced A-LOAMsoftware package4 is used to provide LOAM estimate fromLidar data. The data from VINS-Fusion and LOAM methodsare used as OSL to be fused with UWB and IMU (ORB-SLAM3 are only used for reference as the software packagecurrently does not support publication its data in real-time toROS).]

Fig. 8 presents the trajectories of the UAV in three exper-iments, along with the estimates from VINS, ORB-SLAM3,LOAM and our proposed method. These trajectories are gener-ated from some fixed setpoints with intended purposes. On thefirst experiment, we generate a trajectory that goes through allthe points in a lattice of cubic cells, and the heading setpointis also changed by the direction from one setpoint to the next,thus this trajectory is supposed to cover all of the 4 degrees offreedom of the UAV. In the second experiment, the trajectoryis similarly set by some setpoints limited to a xy plane. Finally,we generate a trajectory from setpoints on the xz plane, andset the head fixed to the -y direction. This trajectory wouldmost resemble our actual operation in the field test, wherea vertical structure will be inspected. From Fig. 8, it can be

4https://github.com/HKUST-Aerial-Robotics/A-LOAM

Page 14: VIRAL-Fusion: A Visual-Inertial-Ranging-Lidar Sensor

14

easily seen that all OSL methods exhibit significant drift overtime, while our method does not suffer from this thanks to theinstallation of some fixed anchors in the environment.

Tab. III summarizes the RMSE of the estimation errors ofour method. Again, we see the advantage of VIRAL methodin terms of positional estimation. In addition, the estimate oforientation and velocity estimate also demonstrate the best orsecond best accuracy compared to other methods. Fig. 9 showsthe velocity estimate from VINS, VIRAL and the grountruth. Itcan be seen that the velocity estimate follows the groundtruthquite closely, which demonstrates the advantage of our methodover previous works that only considered positional states[18]–[22].

C. Field-collected datasets

To further verify the efficacy of the localization scheme, wedevelop a platform consisting of a camera system5, one IMU6,two Lidar sensors7, two UWB modules and three anchors8.The configuration is similar to the AirSim test. Again, theopen-sourced VINS-Fusion and A-LOAM packages are usedto generate the VIO and LOAM data for the fusion scheme.Unfortunately, we find that ORB-SLAM3 cannot work withthese datasets as the front end process fails to detect reliablefeatures. A Leica MS60 station9 with millimeter-level accuracyis used to provide groundtruth for the experiment. All softwareare run on a small form factor computer10 with Intel Core i7-8650U processor. Fig. 10 presents an overview of our setup.

Fig. 10: The hardware setup used in our flight tests.

As having introduced earlier, to keep the problem simpleand cost effective, in actual operation we only use threeanchors that can be quickly deployed in the field. Fig. 11illustrates our deployment scheme. Specifically, given threeanchors 0, 1, 2, we can define the origin of the coordinatesystem to be at anchor 0. Then the +x axis of the coordinates

5http://wiki.ros.org/ueye cam6https://www.vectornav.com/products/vn-1007https://ouster.com/products/os1-Lidar-sensor/8https://go.humatics.com/p4409https://leica-geosystems.com/en-sg/products/total-stations/multistation/

leica-nova-ms6010https://simplynuc.com

Fig. 11: The simple deployment scheme for three anchors.

TABLE IV: Estimated coordinates of the three anchors in thefour flight tests.

Test Anc. 0 coord. [m] Anc. 1 coord. [m] Anc. 2 coord. [m]X Y Z X Y Z X Y Z

test 01 0.0 0.0 1.0 61.55 0.0 1.0 24.14 -14.15 1.0test 02 0.0 0.0 1.0 39.53 0.0 1.0 14.28 -9.92 1.0test 03 0.0 0.0 1.0 44.98 0.0 1.0 28.46 -9.97 1.0test 04 0.0 0.0 1.0 12.16 0.0 1.0 6.08 -11.64 1.0

system can be defined by the direction from anchor 0 toanchor 1. And by choosing for the z axis to point upwardsfrom the ground, we can easily determine the direction of they axis using the right hand rule, which is intended to pointtowards the structure that we seek to inspect. Finally, we canplace anchor 2 on the -y half plane, further away from thestructure than anchor 0 and anchor 1. All of the anchors aresupposed to be on a plane parallel to the ground, and their zcoordinates are set to be 1m. By using some communicationprotocol, the UAV can collect the distances between these threeanchors, and estimate the remaining unknown coordinates ofthe anchors, i.e. anchor 1’s x coordinate and anchor 2’s xand y coordinates. Thanks to this automated self-localizationscheme for the anchors, we can quickly set up the anchors atover 60m within 10 to 15 minutes, which mostly comprisesof walking time. Otherwise it would be quite a tedious task tomeasure the coordinates of these anchors, especially at such anextensive distance. The estimated coordinates of the anchorsin the flight tests are given in Tab. IV. Fig. 12 shows the UWBmeasurements in one flight test to demonstrate the advantageof the body-offset ranging scheme.

A total of three flight tests were conducted. Video recordingof a flight test can be viewed online11. As having introducedearlier, since our work is meant to deliver a solution for UAV-based inspection application, the UAV’s trajectory is differentfrom those in common datasets that are often extensive in xand y directions, but have limited movement in the z direction.In our case, for inspection purposes, the UAV only takes offand moves in a vertical plane parallel to the object. Hence themovements along the x and z directions are extensive, whilemovement along the y direction is minimal. Fig. 13 shows thetrajectories of the UAV in these flight tests, along with theestimated results.

Similar to the experiments previous section, for each dataset,we run an experiment where OSL is fused with UWB and IMUdata. Since the transform between ground truth coordinate

11https://youtu.be/2um-n Wc9-k

Page 15: VIRAL-Fusion: A Visual-Inertial-Ranging-Lidar Sensor

15

Fig. 12: Distance measurements to the anchors from field-collected Dataset 01. From the zoomed-in part of the 200.A→101plot (middle), the variation of the distance is about 5cm. In the zoomed-in part on the leftmost plot, we can see that thedistances between UAV nodes to the anchors are still distinguishable thanks to the significant separation of the UAV nodes onthe body frame. We also notice some outliers in the 200.A→102 and 200.B→102 ranging measurements caused by the pilotblocking the anchor during landing. However they are successfully rejected by the algorithm.

(a) Dataset 01. (b) Dataset 02. (c) Dataset 03.

Fig. 13: Trajectories of the UAV and the estimates from multiple methods in the self-collected datasets. In datasets 01 and 02,groundtruth was temporarily lost for some periods. The data in these periods are not used in the error analysis.

(a) Position estimate from the VIRAL system over time, comparedwith estimates from OSL systems.

(b) Orientation estimate from the VIRAL system over time, com-pared with estimates OSL systems.

Fig. 14: Position and Orientation estimate from OSL and VIRAL methods in Dataset 03: it can be seen that all OSL methodsexhibit significant drift in this scenario, while the VIRAL method provides drift-free estimates despite using only three anchornodes deployed at the test area.

frame and the local frame of reference is not known, wetransform the esmiates using the method in [40], which findsthe yaw and translation that minimize the position estimation

error of each trajectory, and report these errors Tab. V.Moreover, since the leica ground station does not provideorientation measurements, we use the data from the onboard

Page 16: VIRAL-Fusion: A Visual-Inertial-Ranging-Lidar Sensor

16

(a) Position estimate error in each direction. (b) Orientation estimate error in yaw pitch-roll-representation.

Fig. 15: Position and orientation estimation error of VIRAL method in dataset #03.

TABLE V: Translational RMSE VINS-Fusion, LOAM andVIRAL estimates on our collected datasets.

RMSEpos [m]DatasetVINS LOAM (H) LOAM (V) VIRAL

01 0.8416 5.4666 2.9470 0.396502 0.9381 2.9086 1.5804 0.213803 2.4057 1.0344 12.1921 0.1692

TABLE VI: Rotational RMSE of VINS-Fusion, LOAM andVIRAL estimates on our collected datasets.

RMSErot [deg]DatasetVINS LOAM (H) LOAM (V) VIRAL

01 10.6475 5.7859 4.8602 3.466602 10.6973 7.4157 5.4855 2.613103 10.1036 1.9316 11.5245 2.7682

attitude and heading reference system (AHRS) of the droneas ground truth, which is trusted to be accurate as it fusesmagnetometer measurements and no interference was detectedat the test location on this day. The orientation estimationerrors are reported in Tab. VI. Fig. 14, Fig. 15 show thedetailed plots of the position and orientation estimates of oneof the experiments.

From the results in Tab. V and Tab. VI, we can reaffirm ben-efit of the UWB-IMU-OSL fusion scheme, whose positionalerror is much smaller than the OSL methods. In Fig. 15a,we can see that the position error is bounded. The orientationestimate also performs quite well, and Fig. 15b also shows thatthe orientation is bounded. Compared to the other experimentson the EuRoC and AirSim datasets, the positional error isslightly larger. This can be attributed to some small error inthe anchor self-localization procedure, as the anchors maynot be exactly on the same level. Also, some biases in thedistance measurements due to extension cable may also affectthe accuracy of the UWB measurements. Nevertheless, as canbe seen on Fig. 14 that while all of the OSL estimates exhibitsignificant drift over the operation range, our sensor fusion’sestimation error remains bounded, which is the most importantfeature required for this work.

VII. CONCLUSION

In this paper we have developed an optimization basedsensor fusion framework for visual, inertial, ranging and Lidarmeasurements. Detailed mathematical models for the construc-tion of the cost function and the on-manifold optimizationprocedure have been carefully derived. The framework has

been successfully implemented with open-source ceres solverand ROS framework. We have also conducted experimentsto verify the efficacy and effectiveness of the sensor fusionscheme on multiple datasets, and shown that via the use ofsome anchors that can be quickly deployed on the field, theissues of OSL drift and misaligned frame of references canbe effectively overcome. Moreover, results on estimation ofposition, orientation, and velocity have also been presentedto show the comprehensiveness of the estimator. Via thesimulation and experimental results, it can be seen that theVIRAL scheme is reliable, flexible, accurate and has a greatpotential in many real-world applications.

REFERENCES

[1] S. Shen, N. Michael, and V. Kumar, “3d indoor exploration with acomputationally constrained mav,” in Robotics: Science and Systems,2011.

[2] M. Bloesch, S. Omari, M. Hutter, and R. Siegwart, “Robust visualinertial odometry using a direct ekf-based approach,” in IntelligentRobots and Systems (IROS), 2015 IEEE/RSJ International Conferenceon. IEEE, 2015, pp. 298–304.

[3] J. Engel, T. Schops, and D. Cremers, “Lsd-slam: Large-scale directmonocular slam,” in European conference on computer vision. Springer,2014, pp. 834–849.

[4] A. Weinstein, A. Cho, G. Loianno, and V. Kumar, “Visual inertialodometry swarm: An autonomous swarm of vision-based quadrotors,”IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 1801–1807,2018.

[5] C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semi-directmonocular visual odometry,” in 2014 IEEE International Conferenceon Robotics and Automation (ICRA). IEEE, 2014, pp. 15–22.

[6] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monocularvisual-inertial state estimator,” IEEE Transactions on Robotics, vol. 34,no. 4, pp. 1004–1020, 2018.

[7] R. Mur-Artal and J. D. Tardos, “Orb-slam2: An open-source slamsystem for monocular, stereo, and rgb-d cameras,” IEEE Transactionson Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.

[8] J. Zhang and S. Singh, “Laser–visual–inertial odometry and mappingwith high robustness and low drift,” Journal of Field Robotics, vol. 35,no. 8, pp. 1242–1264, 2018.

[9] H. Ye, Y. Chen, and M. Liu, “Tightly coupled 3d lidar inertial odometryand mapping,” in 2019 IEEE International Conference on Robotics andAutomation (ICRA). IEEE, 2019.

[10] T. Shan, B. Englot, D. Meyers, W. Wang, C. Ratti, and R. Daniela,“Lio-sam: Tightly-coupled lidar inertial odometry via smoothing andmapping,” in IEEE/RSJ International Conference on Intelligent Robotsand Systems (IROS). IEEE, 2020.

[11] A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, and V. Sze, “Navion:A 2-mw fully integrated real-time visual-inertial odometry acceleratorfor autonomous navigation of nano drones,” IEEE Journal of Solid-StateCircuits, vol. 54, no. 4, pp. 1106–1119, 2019.

[12] T. Qin and S. Shen, “Online temporal calibration for monocular visual-inertial systems,” in 2018 IEEE/RSJ International Conference on Intel-ligent Robots and Systems (IROS). IEEE, 2018, pp. 3662–3669.

[13] K. Guo, Z. Qiu, C. Miao, A. H. Zaini, C.-L. Chen, W. Meng, andL. Xie, “Ultra-wideband-based localization for quadcopter navigation,”Unmanned Systems, vol. 4, no. 01, pp. 23–34, 2016.

Page 17: VIRAL-Fusion: A Visual-Inertial-Ranging-Lidar Sensor

17

[14] T.-M. Nguyen, A. H. Zaini, K. Guo, and L. Xie, “An ultra-wideband-based multi-uav localization system in gps-denied environments,” inInternational Micro Air Vehicle Competition and Conference 2016,Beijing, China, Oct 2016, pp. 56–61.

[15] J. Tiemann and C. Wietfeld, “Scalable and precise multi-uav indoornavigation using tdoa-based uwb localization,” in 2017 InternationalConference on Indoor Positioning and Indoor Navigation (IPIN). IEEE,2017, pp. 1–7.

[16] M. W. Mueller, M. Hamer, and R. D’Andrea, “Fusing ultra-widebandrange measurements with accelerometers and rate gyroscopes forquadrocopter state estimation,” in 2015 IEEE International Conferenceon Robotics and Automation (ICRA). IEEE, 2015, pp. 1730–1736.

[17] J. Li, Y. Bi, K. Li, K. Wang, F. Lin, and B. M. Chen, “Accurate 3dlocalization for mav swarms by uwb and imu fusion,” in 2018 14th IEEEInternational Conference on Control and Automation (ICCA). IEEE,2018, pp. 100–105.

[18] X. Fang, C. Wang, T.-M. Nguyen, and L. Xie, “Model-free approachfor sensor network localization with noisy distance measurement,” in2018 15th International Conference on Control, Automation, Roboticsand Vision (ICARCV). IEEE, 2018, pp. 1973–1978.

[19] X. Fang, C. Wang, T.-M. Nguyen, and L. Xie, “Graph optimizationapproach to range-based localization,” IEEE Transactions on Systems,Man, and Cybernetics: Systems, pp. 1–12, 2020.

[20] C. Wang, H. Zhang, T.-M. Nguyen, and L. Xie, “Ultra-Wideband AidedFast Localization and Mapping System,” in 2017 IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS). IEEE, 2017.

[21] T. H. Nguyen, T.-M. Nguyen, M. Cao, and L. Xie, “Loosely-Coupled Ultra-Wideband-Aided Scale Correction for MonocularVisual Odometry,” Unmanned Systems, vol. 08, no. 02, pp. 179–190,2020. [Online]. Available: https://www.worldscientific.com/doi/10.1142/S2301385020500119

[22] T. H. Nguyen, T.-M. Nguyen, and L. Xie, “Tightly-coupled single-anchor ultra-wideband-aided monocular visual odometry system,” in2020 IEEE International Conference on Robotics and Automation(ICRA). Paris, France: IEEE, 2020.

[23] ——, “Tightly-Coupled Ultra-wideband-Aided Monocular VisualSLAM with Degenerate Anchor Configurations,” Autonomous Robots(Under Review), 2020.

[24] Y. Cao and G. Beltrame, “Vir-slam: Visual, inertial, and ranging slamfor single and multi-robot systems,” arXiv preprint arXiv:2006.00420,2020.

[25] Y. Song, M. Guan, W. P. Tay, C. L. Law, and C. Wen, “Uwb/lidar fusionfor cooperative range-only slam,” in 2019 International Conference onRobotics and Automation (ICRA). IEEE, 2019, pp. 6568–6574.

[26] J. Paredes, F. Alvarez, T. Aguilera, and J. Villadangos, “3d indoor

positioning of uavs with spread spectrum ultrasound and time-of-flightcameras,” Sensors, vol. 18, no. 1, p. 89, 2018.

[27] R. Mautz, “Indoor positioning technologies,” ETH Zurich’sResearch Collection, 2012. [Online]. Available: https://doi.org/10.3929/ethz-a-007313554

[28] R. Kummerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard,“g 2 o: A general framework for graph optimization,” in 2011 IEEEInternational Conference on Robotics and Automation (ICRA). IEEE,2011, pp. 3607–3613.

[29] S. Agarwal, K. Mierle, and Others, “Ceres solver,” http://ceres-solver.org.

[30] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monocularvisual-inertial state estimator,” IEEE Transactions on Robotics, vol. 34,no. 4, pp. 1004–1020, 2018.

[31] C. Campos, R. Elvira, J. J. G. Rodrıguez, J. M. Montiel, and J. D. Tardos,“Orb-slam3: An accurate open-source library for visual, visual-inertialand multi-map slam,” arXiv preprint arXiv:2007.11898, 2020.

[32] G. S. Chirikjian, Stochastic Models, Information Theory, and LieGroups, Volume 2: Analytic Methods and Modern Applications.Springer Science & Business Media, 2011, vol. 2.

[33] J. Sola, J. Deray, and D. Atchuthan, “A micro lie theory for stateestimation in robotics,” arXiv preprint arXiv:1812.01537, 2018.

[34] W. Huang, H. Liu, and W. Wan, “An online initialization and self-calibration method for stereo visual-inertial odometry,” IEEE Transac-tions on Robotics, 2020.

[35] C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “On-manifoldpreintegration for real-time visual–inertial odometry,” IEEE Transactionson Robotics, vol. 33, no. 1, pp. 1–21, 2016.

[36] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “ORB-SLAM: aversatile and accurate monocular SLAM system,” IEEE Transactionson Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.

[37] K. Eckenhoff, P. Geneva, and G. Huang, “Closed-form preintegrationmethods for graph-based visual–inertial navigation,” The InternationalJournal of Robotics Research, vol. 38, no. 5, pp. 563–586, 2019.

[38] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W.Achtelik, and R. Siegwart, “The euroc micro aerial vehicle datasets,”The International Journal of Robotics Research, vol. 35, no. 10, pp.1157–1163, 2016.

[39] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visualand physical simulation for autonomous vehicles,” in Field and servicerobotics. Springer, 2018, pp. 621–635.

[40] Z. Zhang and D. Scaramuzza, “A tutorial on quantitative trajectory eval-uation for visual (-inertial) odometry,” in 2018 IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp.7244–7251.