on-board dual-stereo-vision for the navigation

J Intell Robot Syst (2014) 74:1–16DOI 10.1007/s10846-013-9907-6

On-Board Dual-Stereo-Vision for the Navigationof an Autonomous MAV

Konstantin Schauwecker · Andreas Zell

Received: 15 August 2013 / Accepted: 12 September 2013 / Published online: 11 October 2013© Springer Science+Business Media Dordrecht 2013

Abstract We present a quadrotor Micro AerialVehicle (MAV) equipped with four cameras,which are arranged in two stereo configurations.The MAV is able to perform stereo matchingfor each camera pair on-board and in real-time,using an efficient sparse stereo method. In case ofthe camera pair that is facing forward, the stereomatching results are used for a reduced stereoSLAM system. The other camera pair, which isfacing downwards, is used for ground plane detec-tion and tracking. Hence, we are able to obtain afull 6DoF pose estimate from each camera pair,which we fuse with inertial measurements in anextended Kalman filter. Special care is taken tocompensate various drift errors. In an evaluationwe show that using two instead of one camerapair significantly increases the pose estimationaccuracy and robustness.

Keywords Micro Aerial Vehicle (MAV) ·Unmanned Aerial Vehicle (UAV) ·Stereo vision · Robot vision · SimultaneousLocalization And Mapping (SLAM) ·

K. Schauwecker (B) · A. ZellComputer Science Department, Cognitive Systems,University of Tübingen, Sand 1,72076 Tübingen, Germanye-mail: [email protected]

Parallel Tracking And Mapping (PTAM) ·Sensor fusion

1 Introduction

In recent years, Micro Aerial Vehicles (MAVs)have become increasingly popular in the roboticsresearch community. In particular, rotatory wingMAVs such as quadrotors have received muchattention. Unlike their fixed wing counterparts,these MAVs are able to take-off and land verti-cally, and move at low speeds. These propertiesallow those MAVs to be operated inside buildingsor other confined spaces.

One of the main challenges for constructinga robotic MAV is to enable the MAV to local-ize itself with respect to its environment. Only ifthe MAV can track its own position, it can per-form autonomous flight maneuvers. For outdoorflights, an MAV can rely on the GPS system toobtain an estimate for its current location. For in-door flights however, this is generally not possibledue to a lack of GPS coverage. We hence requiredifferent sensors to facilitate autonomous indoorflight.

The choice of available sensors is limited bythe maximum payload and power consumptionconstraints imposed by the MAV platform. Forwheeled robots, laser scanners are a popular

2 J Intell Robot Syst (2014) 74:1–16

sensor choice, as they offer an accurate 3D per-ception of the robots environment, and can beused for robust and accurate localization. How-ever, such laser scanners have a considerableweight and power consumption, which makesthem critical to be employed on an MAV. This isparticularly true for laser scanners using severalbeams to obtain measurements from more thanone plane.

In this research, we hence focus on vision-basedmethods for the localization of a robotic MAV.Unlike laser scanners, cameras are generally muchlighter and have a lower power consumption. Asingle camera can be used for 3D localization, ashas e.g. been shown in [10]. However, in this casethe camera position can only be determined withrespect to an unknown and unobservable scal-ing factor. An alternative, which does not sufferfrom this problem, is stereo vision. Thanks to theadditional depth measurements received from astereo camera, the camera position can be trackedwithout any scale ambiguities. The processing ofstereo images, however, requires significant com-puting resources. Hence, only recently it has beenpossible to equip an MAV with enough computingpower to process stereo images on-board.

Usually the stereo camera is mounted in aforward-facing configuration. This provides alarge field of view and facilitates the detectionof obstacles that lie ahead in flying direction.Unfortunately, however, a forward facing cameradoes not perform well when encountering fastyaw rotations. In this case, the camera registersfast image movements or even motion blur, whichimpedes current visual navigation methods. Analternative would be a downward facing camera,which in the case of stereo vision can also be usedfor detecting the ground distance and the rela-tive orientation of the MAV towards the groundplane. While a downward facing camera is betterat observing fast yaw rotations, it exhibits similarproblems when the MAV is flying close to theground. Further, in case of a downward-facingstereo camera, stereo matching is only possibleonce the MAV has reached a minimum altitude.

Those dissimilar strengths of downward- andforward-facing cameras have lead us to believethat they would complement each other whenused in a combined setting. In this paper we

hence propose an MAV design that is equippedwith both, a forward-facing and a downward-facing camera pair, which are each arranged in acanonical stereo configuration. Equipping theMAV with two instead of one camera pair alsogreatly increases the computational demands. Weshow that nevertheless it is possible to simultane-ously process the imagery of all four cameras on-board the MAV in real-time. Further, we demon-strate that using two instead of one camera pairsignificantly increases the localization accuracyand robustness.

The presented work was initially published atthe 2013 International Conference of UnmannedAircraft Systems (ICUAS) [16]. In this place, weprovide an extended version of this work. In par-ticular, we extended the evaluation of our MAVdesign, in the hope to give a better impression onthe performance of our system.

2 Related Work

The key to autonomous flight is to enable theMAV to estimate its current pose (i.e. positionand orientation) with respect to its environment.Despite of the high weight and power consump-tion, much previous work has been done on usinglaser scanners for this task. Examples are thequadrotor MAV presented in [18] that also in-cludes a monocular camera for loop closure detec-tion, and which is able to navigate autonomouslyin an indoor environment. In [19] this MAV wasextended to also perform an autonomous pathplanning, facilitating the fully autonomous explo-ration of unknown indoor environments. Anotheroutstanding autonomous MAV that relies on laserscanners, is the fixed wing air craft presentedin [4]. This MAV can perform aggressive flightmaneuvers, including the flight through a largeobstacle covered indoor environment.

A possible substitute for laser scanners arecameras. Since cameras can be built very light andpower efficient, using visual motion estimationis a compelling alternative. As we have alreadymentioned, however, the MAV position can onlybe observed with respect to an unknown scalingfactor, if only one camera is used. This is why

J Intell Robot Syst (2014) 74:1–16 3

most MAVs featuring monocular vision are onlyoperable in specific environments.

One example is the quadrotor MAV presentedin [21] that relies on visual markers. This MAV,however, does not perform any on-board imageprocessing, but wirelessly transmits the cameraimages to a ground computer. The computer thenremotely controls the MAV. An example for anapproach with on-board processing can be foundin [23], where the authors track the location of alanding pad. Because the geometry of this pad isknown, it is possible to infer the MAV’s relativepose from the observed perspective projection. In[24], the authors extended this approach such thatthis landing pad is only required during take-off.The MAV estimates the scale during take-off, andthen propagates it for further observations.

Instead of avoiding the scaling factor problemby flying in known environments, one can al-ternatively try to estimate this factor by meansof additional sensors. An example is the re-mote controlled low-cost quadrotor used in [6],which streams camera images from an on-boardmonocular camera to a ground computer, run-ning a Simultaneous Localization And Mapping(SLAM) software. This SLAM software is basedon Parallel Tracking And Mapping (PTAM) [10],which is a popular open source SLAM implemen-tation known for its robustness and efficiency.The scale is estimated using additional measure-ments from an on-board ultrasound altimeter. TheMAV presented in [2] also employs PTAM forestimating its pose, but here the authors were ableto perform all processing on-board. The requiredscaling factor is obtained by using measurementsfrom an accelerometer and pressure sensor. Asimilar system, which is also based on PTAM, hasbeen published in [22]. Here the authors estimatethe scale using measurements form an IMU.

As mentioned before, the scaling factor prob-lem disappears if stereo vision is used instead ofmonocular vision. Stereo matching, however, gen-erally has high computational demands. Most less-recent work has thus focused on off-board stereoprocessing. For example, ground mounted stereocameras were used in [1, 13] that are focused ona quadrotor MAV, which is remotely controlledby a ground computer. In [5] a forward-facingstereo camera is mounted on a quadrotor MAV,

which wirelessly transmits all camera images ata relatively low frame rate. Again, a computerreceives those images and then remotely controlsthe MAV.

Only very recently, the first MAVs haveemerged that are able to perform stereo process-ing on-board. The MAV presented in [9] featuresa forward-facing stereo camera and runs a denseblock matching algorithm with a resolution of320 × 240. This MAV was later extended in [12]to use an image resolution of 640 × 480. While inboth cases, the stereo matching results are onlyused for obstacle avoidance and visual markersare still required for navigation, this limitation wasresolved with a further extension of this platformin [7]. However, according to the numbers givenfor this final revision, stereo processing only runsat a relatively low frame rate of just 5 Hz.

One example for a downward-facing stereocamera and on-board stereo processing can befound in [20]. For this MAV, the authors use adense correlation based stereo algorithm whichruns at a very low frame rate of just 3 Hz. Themovement of the MAV is calculated using visualodometry and the resulting data is fused with fur-ther odometry data gained from an on-board laserscanner. Other work that is worth mentioning in-cludes the lighter-than-air MAV presented in [8],which is equipped with three fisheye cameras, ofwhich two are arranged in a stereo configuration.However, no stereo matching is being performed,but rather the imagery of each camera is trackedindividually using PTAM. After tracking, the datafrom all cameras is fused using a pose alignmentstep. For this MAV, all processing is performedoff-board and no autonomous control has beendemonstrated.

What the presented MAVs that perform stereomatching have in common is that they all employdense stereo algorithms. There exists, however,at least one example for sparse stereo matchingwith a forward-facing camera pair [14]. Here, thesparsely matched features are used for a modifiedversion of the SLAM method presented in [17],which itself is an extension of PTAM that incor-porates depth information. Because sparse stereomatching is much faster than any dense algo-rithm, this MAV can maintain a high pose es-timation rate of 30 Hz. This high performance


can be credited to the efficiency of PTAM aswell as to the efficiency of the used sparse stereoalgorithm, which was previously published in [15].Because this method appears to be much fasterthan all other existing approaches, it makes themost promising base for creating an MAV capableof simultaneously processing the imagery of twostereo cameras.

3 Hardware Overview

A front- and bottom-view of our quadrotor MAVcan be seen in Fig. 1. This quadrotor is based onthe open source and open hardware PIXHAWKplatform, which was developed by the ETHZürich [12]. We equipped this quadrotor withfour USB cameras in two stereo configurations.Two cameras are facing forward with a baselineof 11 cm, while the remaining two cameras arefacing downwards with a baseline of 5 cm. Weoperate the forward-facing cameras with a framerate of 30 Hz, and the downward-facing cameraswith 15 Hz. This unequal frame rate has been cho-sen in order to reduce the computational require-ments. Even though the cameras are operatedwith different frame rates, they are still synchro-nized with the downward-facing cameras skippingevery other frame. All cameras have a gray-scaleimage sensor with a resolution of 640 × 480.

The MAV features an on-board computer withan Intel Core 2 Duo CPU, running at 1.86 GHz.All cameras are connected to this computer, whichexecutes all image processing and motion esti-mation software. In addition to this on-boardcomputer, the MAV is also equipped with amicrocontroller, dedicated to all low-level con-trol tasks. The microcontroller board, which alsoincludes an IMU, is connected through an I2Cbus with the main on-board computer. Throughthis bus, the on-board computer transmits high-level control instructions to the microcontroller,which in return transmits the current IMUmeasurements.

4 Processing of Forward-Facing Cameras

For the forward-facing cameras we use a methodthat largely matches the processing pipeline de-

Fig. 1 Our quadrotor MAV seen from front and bottom

scribed in [14]. We improved the performance ofthis method mainly through code level optimiza-tions and by resolving one problem in the originalPTAM code, which in case of small maps causesBundle Adjustment to be executed too frequently.We have made several extensions in order to allowthe conjoint use with our processing methods forthe downward-facing cameras. Those extensionswill be explained later, when their necessity be-comes apparent. For completeness, we provide


a summary of the original method from [14] atthis point.

As this method is based on sparse stereo match-ing, the first processing step is to apply a featuredetector and a sparse stereo matching algorithm.The two algorithms published in [15], for which anefficient open source implementation is available,are used for both of these tasks. Feature detectionis, however, extended to include a scale space andan upper bound for the total number of features.We set our bound to 800 features, which is lessthan the 1000 features used in [14]. An examplefor the performance of this stereo method duringindoor flight of our MAV is given in Fig. 2.

The successfully matched features are used forestimating the current MAV pose. A SLAM sys-tem is employed for this task, which is basedon the method proposed in [17]. This methodis an adaptation of PTAM that incorporatesdepth information, which we receive from stereomatching. This SLAM system has, however, beenmodified in several ways. Most importantly, themap is continuously cleared of old keyframes,such that only a small set of most recent keyframespersist. In our case, only the four most recentkeyframes are preserved. This enables us toachieve a constant processing performance, whichis crucial for the control of an MAV. Only keep-ing the most recent keyframes, however, does nolonger resemble real SLAM, as this leads to theabsence of a global map. The resulting approachcan be regarded as a compromise between SLAMand plain visual odometry. To avoid confusion, we

Fig. 2 Example for on-board stereo matching performance

Fig. 3 Schematics of sensor fusion for only two cameras

will be referring to this method as local SLAM forthe rest of this paper.

The received pose is fused with measurementsfrom the IMU. The basic schematics of the sen-sor fusion are depicted in Fig. 3. For this task,the open source ROS package imu_filter [11]is employed, which is based on an extendedKalman filter. Here, the IMU measurements areused for performing the Kalman prediction, whilethe Kalman correction is then executed with theestimated pose. Unlike in [14], we process theIMU measurements at a higher rate of 100 Hz.

For tracking a new frame, the local SLAMsystem requires an estimate for the current pose,which is obtained through a motion model. Thismotion model extrapolates the fused pose thatis fed back to local SLAM after sensor fusion.Rotation, however, is predicted by using anEfficient Second Order Minimization (ESM) [3]based image alignment method, which can befound in the original PTAM implementation.

5 Problems with Forward-Facing Cameras

The method we have discussed above can be usedto achieve autonomous hovering of an MAV, ashas been demonstrated in [14]. However, thereare some problems with using this approach as is.In this section we provide a qualitative overviewof these issues. Quantitative examples for some ofthese problems can be found in Section 10.

One key issue is the potential drift of the esti-mated position and orientation. If the orientationestimated by local SLAM is used for controllingthe quadrotor, any errors in the estimated pitchand roll angles will have a disrupting impact onthe flight stability. In [14] the MAV was con-trolled with the original PIXHAWK flight con-


troller, which only relies on IMU measurementsfor determining the current attitude. It wouldbe very preferable to use the more accurate vi-sion estimated orientation instead. However, thiswould make the handling of orientation errorseven more important.

Not only orientation drift can be problematic,but also drift errors for the estimated position arean issue. If the MAV is programmed to fly on apreset track, a position deviation would cause theMAV to leave this track, which can potentiallylead to a crash. But even if the MAV performs on-board path planning in consideration of the per-ceived environment, position drift can still causetroubles. For example, the MAV presented in [7]performs such autonomous on-board path plan-ning, but this happens only in two dimensions witha fixed flying altitude. If such an MAV does nothave any other means for perceiving the currentaltitude, it is unable to react on position drifts inthe vertical direction.

The local SLAM method also has difficultieswith yaw rotations. This was more severe forthe original PTAM version, which just processesimagery of one monocular camera. Because notriangulation can be performed for rotation-onlymovements, PTAM is not able to obtain reliabledepth measurements in this case. The local SLAMmethod we use obtains its depth information fromstereo vision, which should make yaw rotationsless problematic. However, fast yaw rotations stilllead to large image movements, which can resultin bad tracking accuracy or even tracking failure.

Finally, if tracking ever fails, recovery can onlyoccur if the camera still depicts a scene thathas been well observed by at least one existingkeyframe. Since the MAV is likely to be on aflying trajectory, we cannot expect that this isthe case. Even if the camera hasn’t moved muchsince the previous keyframe, recovery might stillfail which can lead to a random new position.Thus, recovery needs to be improved if we wantto achieve robust flight.

The problems we have described so far can besolved or at least be reduced, if we employ a pairof downward-facing cameras in addition to the al-ready used forward-facing ones. How exactly thiscan be achieved will be discussed in the followingsections.

6 Processing of Downward-Facing Cameras

Our processing method for the downward-facingcameras differs fundamentally from the methodpresented above. At the beginning, however,there is again stereo matching, for which we em-ploy the same technique previously used for theforward-facing cameras. This time, however, weset the maximum feature count to 300 and processimages at a rate of just 15 Hz. This is done in orderto save computing resources, as the method is lesssensitive to feature count and image rate, than theused local SLAM system.

6.1 Ground Plane Detection

If we assume that the ground is flat, then all 3Dpoints received from stereo matching are expectedto lie in the same geometric plane. We obtain anestimate for this ground plane with a RANdomSAmpling Consensus (RANSAC) based planeestimator. Given a plane equation in the form of

ax + b y + cz + d = 0, (1)

we can extract the height h, pitch angle � and rollangle � with the following equations:

h = −db

(2)

� = tan−1

(−ab

)(3)

� = tan−1

(−cb

)(4)

Unlike the pose estimates of the local SLAMsystem, these measurements are absolute. Hence,they are not prone to drift or erroneous offsets,which is why we expect those measurements toincrease the overall accuracy in a combined sys-tem. We estimate the variances of those measure-ments by using a sampling based approach. Forthe height variance σh this happens by calculatingthe distance between the plane model and eachpoint that was selected as inlier by the RANSACmethod. The variance of the mean distance is thenour estimate for σh. For the angular variances σ�

and σ�, we first shuffle the set of inliers suchthat three consecutive points always have a largedistance to each other. We then repeatedly draw


Fig. 4 Example for on-board ground plane detection

three such points and use them to calculate sam-ples for � and �. The variances of the means ofthose samples are then our final estimate.

Finally, we apply a simple outlier rejectionbased on the previously detected plane model. Anexample on how our method performs for groundplane detection can be found in Fig. 4. Here, theplane has been projected back into the unrectifiedcamera image. The red points indicate featuresthat were selected as inliers by the RANSACplane estimator, while yellow points were clas-sified as outliers.

6.2 Frame-to-Frame Tracking

It is unfortunately not possible to infer horizontaldisplacement or yaw rotation from the detectedplane. For measuring these quantities, we henceneed to apply a different method. In our case,we have selected an approach based on frame-to-frame tracking. Because we assume a flat ground,horizontal displacements and yaw rotations willresult in an affine image transformation. Thistransformation consists of a 2d-translation and anin-plane rotation. We hence attempt to find thetransformation that aligns a previously capturedcamera frame to the current camera image.

For this task we chose the previously men-tioned ESM-algorithm, which uses a homographyto align two images. This happens by iterativelyapplying transformations to an initial homographyuntil the algorithm converges to a steady solution.

In our case, we limit ourselves to a homographyconsisting of translations and in-plane rotations.Even though ESM is an efficient method, findingthe transformation between two full-resolutionframes is still very time consuming. Hence, weperform this step with two very low-resolutionsub-sampled frames. In fact, we use a resolution ofjust 80 × 60 pixels. This number might seem small,but because ESM works well at the sub-pixel level,we still receive sufficiently accurate results.

To have a large field of view, we use lenseswith small focal lengths on the downward-facingcameras. This, however, leads to strong lens dis-tortions that disrupt the frame-to-frame tracking.Hence, we first perform image rectification, whichcan be combined with the required sub-samplinginto one single image transformation. This trans-formation is much faster than an individual rec-tification at full image resolution, and also avoidsunnecessary blur.

Once the ESM algorithm has converged, weextract our measurements from the found homog-raphy. The translation vector matches the fourthcolumn of the homography matrix. The rotation,however, cannot as easily be extracted withoutapplying a homography decomposition. Thus, weapply the simple approach of transforming a dis-tant point with the found homography, and thenmeasuring the angle towards the point’s initiallocation. Using the height that we received fromthe detected ground plane and the known cameraparameters, we can convert the translation vectorfrom pixel to world coordinates. Together with theyaw rotation and the measurements received fromthe detected ground plane, we obtain a full 6DoFestimate of the current MAV pose.

7 Sensor Fusion

With the methods described above we receivetwo independent estimates for the MAV pose:one from the forward-facing, and one from thedownward-facing camera pair. These two esti-mates need to be unified into one single poseestimate, which happens during sensor fusion. Forthis purpose we can complement the sensor fusionwe have discussed earlier. The schematics of ourextended sensor fusion are shown in Fig. 5. For


Fig. 5 Schematics of sensor fusion with forward-facing and downward-facing cameras

now we ignore the block labeled “Angular DriftCorrection”, as this part will be discussed in detaillater on.

Thanks to camera synchronization, measure-ments will be from exactly the same point of time,if received from both camera pairs. In this case, wefuse both poses before applying the Kalman filter.This happens by calculating the variance weightedmean, which delivers us a maximum likelihoodestimate for the current pose. The reason whywe fuse both poses before applying the Kalmanfilter is that we want to avoid giving preferenceto either of them. If both pose estimates wouldbe processed individually, the pose processed lastwould have the higher influence on the filter out-come. This is usually the pose obtained throughthe forward-facing cameras, as it requires moretime to be computed.

If only one pose is available due to the bot-tom cameras skipping one frame, or because onemethod fails to deliver a reliable pose estimate,the variance weighted mean is avoided and thesingle pose estimate is processed by the Kalmanfilter as is. In any case, the fused pose is passedon to the low-level flight controller. Unlike theoriginal PIXHAWK flight controller, we use thefused pose for attitude estimation, rather thanonly relying on IMU measurements.

8 Drift Correction

Preliminary experiments with the method pre-sented so far have shown that there are stillproblems with flight stability. This can mostly becredited to accumulated errors that lead to un-

wanted drift. Two such error sources have beenidentified and we have been able to compensatethem by using additional processing steps.

8.1 Map Drift

One major source of error is the map generatedby the local SLAM system. Once a keyframe hasbeen created, its position is only altered by BundleAdjustment, which generally only performs minorcorrective changes. This means that if a keyframehas been created at an incorrect position, thiserror will not be corrected until the keyframeis discarded. Hence, all pose estimates that areobtained by matching against this keyframe willbe inaccurate.

The downward-facing cameras deliver us ab-solute measurements for height, roll and pitch.With those absolute measurements we shouldbe able to at least partially correct inaccuratekeyframes. The fused position, which containscontributions from those absolute measurements,is already fed back to the local SLAM method(see Fig. 5). However, so far the fused pose is onlyused for motion prediction, which does not havean influence on the existing map.

It is thus necessary to correct the pose of ex-isting keyframes. We do this by applying a globaltransformation to the entire map. This transfor-mation is chosen such that it compensates thedifference between the last pose estimated by thelocal SLAM system and the final pose estimateafter sensor fusion. If Ts is the transformation ma-trix for the pose estimated by local SLAM, and Tf

is the transformation matrix after sensor fusion,then the matrix product Tf

−1 · Ts represents the


transformation that we required to map Ts to Tf .We hence define our corrective transformationTc as:

Tc = λ(Tf−1 · Ts) (5)

Here, the transformation is scaled with theweighting factor λ. This weight is set to a smallvalue (we use a value of 0.05), such that only smallcorrective steps are performed. Any drift or errorwill thus be gradually reduced over several frames.Further, we force the horizontal displacement ofthe corrective transformation to 0. Because thereis no sensor that delivers absolute measurementsof the horizontal position, we prefer to keep theposition estimated by local SLAM instead.

8.2 Angular Drift

Although the previously described drift correc-tion works well for correcting the height of akeyframe, its performance is generally poor forroll and pitch errors. This can be explained by ex-amining the variances used during sensor fusion.While the height measurements received from thedownward-facing cameras are more accurate thanthe measurements received from local SLAM,the variances for the measured roll and pitch an-gles are several orders of magnitude larger. Thismeans that roll and pitch measurements from thedownward-facing cameras are mostly ignored dur-ing sensor fusion.

Unlike local SLAM, however, the downwardfacing cameras provide an absolute measurement,which is why we do not want do disregard thisinformation. We solve this problem by introduc-ing an additional processing step during sensorfusion, which has been labeled “Angular DriftCorrection” in Fig. 5. In this step, we try to es-timate the angular drift of the local SLAM poseand correct it before sensor fusion starts. Becausethe angular measurements from the downward-facing cameras are considerably noisy, we employan additional Kalman filter for this task. ThisKalman filter tracks the difference between theorientation estimate gained from local SLAM andfrom the downward-facing cameras. We representthe orientation as quaternions, which matches therepresentation used in the entire sensor fusionpipeline.

If we are able to correct the angular drift, thenthe pose received after sensor fusion should con-tain the correct orientation. We know that thefused pose is fed back to local SLAM, where it isused to correct the map drift with respect to theweight λ. Hence, we can also expect that the an-gular drift will be reduced in the next frame. Thisknowledge can be incorporated into the model ofour Kalman filter. We assume that the arithmeticdifference between the two orientation quater-nions �q reduces to �q · (1 − λ) from one frameto another. If we ignore all other influences on theorientation drift, then we arrive at the followingstate transition matrix:

Fk =

⎛⎜⎜⎝

1 − λ 0 0 00 1 − λ 0 00 0 1 − λ 00 0 0 1 − λ

⎞⎟⎟⎠ (6)

The filtered quaternion difference is thenadded to the orientation quaternion from lo-cal SLAM, which effectively removes orientationdrift. However, we further adapt this pose byrestoring the yaw rotation to its uncorrected value,as there are no absolute measurements for theyaw angle.

9 Recovery

The last remaining problem that needs to besolved is recovery of the local SLAM system incase of tracking failure. As discussed previously,the recovery approach employed by PTAM doesnot work well for our application. We hence usea different technique that makes use of the re-dundant information available from both camerapairs. Even when the local SLAM method fails,we still receive a full 6DoF pose estimate from thedownward-facing cameras. Thus, the pose of theMAV is still known but with a degraded accuracy.Nevertheless, we should be able to maintain con-trol of the MAV until local SLAM has recovered.

If tracking fails, we force the local SLAM poseto the current output of the sensor fusion. Inthis case, the fused pose is only obtained throughmeasurements of the downward-facing camerasand the IMU. This pose will, however, not matchthe current map of the local SLAM system, which


prevents the system from recovering by itself. Wehence discard the entire map and begin mappingfrom scratch. We start by adding the current frameat the currently available fused pose. The sys-tem should thus quickly recover once the causeof the error has disappeared. Usually, trackingfailures result from quick camera movements.Hence, once the MAV has stabilized itself byusing the less error prone pose estimates fromthe downward-facing cameras, local SLAM willcontinue functioning.

10 Evaluation

We have conducted several experiments to evalu-ate the quality of the proposed MAV design. Allflying experiments took place in the same indoorlab environment. We covered the floor with atexture rich carpet in order to provide sufficientfeatures for the downward-facing cameras. Exam-ples for the scene observed by the forward- anddownward-facing cameras are shown in Figs. 2and 4. For all experiments we recorded groundtruth motion information with an OptiTrac opticaltracking system.

10.1 Hovering

In our first experiment we instructed the quadro-tor to hover at a height of approximately 1 mfor one minute, with take-off and landing beingperformed autonomously. This should allow anassessment of the MAV’s flight stability anddemonstrate its general operativeness. During thisflight, the MAV recorded all sensor data (i.e. cam-era imagery and IMU measurements), allowing usto re-process all test-runs offline.

The ground truth position and the position es-timate obtained by the on-board software are de-picted in Fig. 6. This figure contains yet two morecurves, which are the position estimates receivedwhen offline processing the recorded data withonly the local SLAM system or only our ground-plane-based pose estimation. By plotting theseoffline results, we can compare how the MAVwould have behaved, if it had been equippedwith only two cameras. All plotted tracks havebeen aligned such that their starting position and

0

0.1

0.2

0.3

0.4

0.5

0.6

-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4

Ground TruthOn-Board Estimate

Local SLAM OnlyGround Plane Only

Ground TruthOn-Board

Local SLAMGround Plane

-0.6-0.4

-0.20

0.20.4

0.6-0.6

-0.4-0.2

00.2

0.40.6

0.8

00.10.20.30.40.50.60.70.80.9

11.11.21.3

Fig. 6 Ground truth and estimated position during hover-ing flight in (top) perspective view and (bottom) top-downview. Scale is in meters

orientation match closely. For this purpose weiteratively performed an error minimization foreach position coordinate and the yaw rotation forthe first 0.5 s of each track.

The slow take-off and landing in this exper-iment, as well as the stable hovering position,are also an easy challenge for the local-SLAM-only test run. Hence, the corresponding curve andthe curve for our on-board estimate both closelymatch the recorded ground truth. The ground-plane-only based pose estimate, however, showsaccurate height but exhibits high horizontal drift.While in this case, the absolute height can bemeasured, the horizontal position is only obtainedthrough frame-to-frame tracking, which is partic-ularly prone to error accumulation.


0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 10 20 30 40 50 60 70

Err

or /

m

Flight Time / s

On-Board ErrorLocal SLAM Error

Ground Plane Error

Fig. 7 Position error during hovering flight

We can quantify the deviation from the groundtruth by examining the Euclidean distances be-tween the estimated and ground truth positions,which we depicted in Fig. 7. We further com-puted the average and Root Mean Square Error(RMSE) for all position estimates received duringthis test flight, which are listed in Table 1. Bothtimes we see that the errors received with ourcombined method are significantly lower than forthe local-SLAM-only system. In fact, the averageerror and RMSE for the local-SLAM-only versionare more than twice as large as for our on-boardresults. Much of this improvement can be creditedto the more accurate height that we obtain withthe combined approach. As we have already an-ticipated from Fig. 6, the errors received with theground-plane-only based method are much higherthan for the other two test runs. However, theerrors still remain bounded in this case.

The more interesting measure, however, is theMAV’s ability to keep its hovering location. Toremain comparable, we apply the same measureas used in [14]. This means that we calculate theposition error with respect to the average positionduring hovering. In our case we receive an averageerror of 11.0 cm and a RMSE of 12.6 cm. Com-pared to the values reported in [14] (average errorof 26 cm and RMSE of 32 cm), our position errorsare much smaller, meaning that our MAV cankeep its hovering location more precisely. We be-

Table 1 Position estimation errors for hovering flight

Method RMSE Avg. error(cm) (cm)

On-board estimate 1.41 1.44Local SLAM only 3.14 3.15Ground plane only 26.2 23.7

lieve that much of this performance improvementcan be credited to our modified flight controller,which now obtains its attitude estimate from ouron-board vision software.

10.2 Drift Compensation

The next interesting characteristic is the perfor-mance of our drift correction methods. Becauseit is difficult to evaluate the drift correction ina flying experiment, we decided to simulate drifterrors instead. We take the sensor data recordedduring an autonomous hovering flight, and re-process this data offline. While the MAV is hover-ing, we force an erroneous orientation and heightinto the system. We do this by disturbing theoutput of the sensor fusion for a short periodof time. During this time, we keep on applyingan erroneous rotation and vertical translation tothe fused pose, which is being fed back to localSLAM. This disturbance forces local SLAM intorecovery, which means that mapping starts againfrom scratch.

The height recorded in this experiment is plot-ted in Fig. 8. The disturbance is applied duringthe highlighted section. For comparison, we alsoincluded the undisturbed height that was esti-mated on-board during autonomous hovering. Wesee that the height from both recordings divergeonce the disturbance is applied. Once the distur-bance period ends, however, the height measure-ments quickly converge again to the undisturbedon-board estimates. Similarly, the disturbed andundisturbed roll and pitch angles are shown inFig. 9. Again, the angular measurements con-verge to the undisturbed estimates after the dis-turbance period has finished. This successfully

0.9

1

1.1

1.2

1.3

1.4

1.5

34 36 38 40 42 44 46 48

Hei

ght/

m

Flight Time / s

Height UndisturbedHeight Disturbed

Disturbance Period

Fig. 8 Recovery of height estimates after forcefuldisturbance


-2

-1

0

1

2

3

4

34 36 38 40 42 44 46 48

Ang

les

/˚

Flight Time / s

Pitch Undist.Roll Undist.

Pitch Dist.Roll Dist.

Dist. Period

Fig. 9 Recovery of roll- and pitch-angle after forcefuldisturbance

demonstrates the functioning of our drift correc-tion methods.

10.3 Square Flight

The previous hovering flight is a particularly easychallenge for the local SLAM method, as theMAV remains mostly stationary and is thus not re-quired to map many keyframes. In a further flightexperiment, we hence let the MAV fly a trajec-tory, which resembles a horizontal square with anedge length of 1 m. Please note that this is just 2.5times the rotor-to-rotor distance, which makes aprecise following of this trajectory challenging. Inthis experiment, the MAV took off autonomouslyand then approached each corner of the squaretwice before landing again. It hovered at eachcorner for about 5 seconds before continuing to

-1

-0.75

-0.5

-0.25

0

0.25

0.5

0.75

1

1.25

-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1 1.25

y/m

x / m



Fig. 10 Flight of a horizontal rectangle shape

0

0.05

0.1

0.15

0.2

0.25

0.3

0 10 20 30 40 50 60 70 80 90

Err

or /

m

Flight Time / s

On-Board ErrorLocal SLAM Error

Ground Plane Error

Fig. 11 Position error during square flight

the next corner. A top view of the ground truthand on-board position estimate is given in Fig. 10.Again, this figure contains the offline processingresults for a local-SLAM-only and ground-plane-only version of our software system. The achievedresults show the same tendency we previously ob-served in the hovering experiment: while our on-board estimates and local-SLAM-only test runsprovide results that closely match the recordedground truth, the ground-plane-only version ex-hibits significant drift.

Like previously for the hovering flight, weagain plotted the position error in Fig. 11,and calculated the average error and RMSE inTable 2. Again, we see that our on-board es-timates have a significantly lower error com-pared to a local-SLAM-only system. The receivedRMSE and average error for such a local-SLAM-only version are again almost twice as large as forour on-board estimates. In this experiment, theerror for the ground-plane-only version no longerstays bounded, which is why we have truncatedthe corresponding curve in Fig. 11.

Despite the relatively small dimensions ofthe flown square trajectory, the movements per-formed by the MAV are large enough to forceour local SLAM method into continuously addingnew keyframes. Hence, even though this experi-ment was performed in a small confined space, theperformance results should be equivalent to theperformance on a long-range flight. In Fig. 12 we

Table 2 Position estimation errors for square flight

Method RMSE Avg. error(cm) (cm)

On-board estimate 2.38 2.20Local SLAM only 4.53 4.39Ground plane only 33.7 41.3


0

2

4

6

8

10

0 10 20 30 40 50 60 70 80

Age

/s

Time / s

Max. Keyframe AgeKeyframe Times

Fig. 12 Maximum age and creation time of mappedkeyframes

have plotted the age of the oldest keyframe in thecurrent map. Initially, the maximum keyframe ageincreases until the map has reached the maximumkeyframe count of four. Then, with every newkeyframe that is added to the map, the maximumage decreases again as an older keyframe has tobe deleted. We have marked the times where newkeyframes are added with a small arrow.

If we take a closer look at Fig. 12, we can seethat periods with high maximum keyframe agealternate with periods with low maximum age.Those high age periods occur when the MAVreaches a new corner of the square trajectory.Because the MAV will stay stationary for a fewseconds at each corner, there will be a short periodwhere local SLAM is not required to map newkeyframes. In total, local SLAM has mapped 61keyframes during a time of 88 s in this flightexperiment.

10.4 Yaw Rotations

While the previous flight experiments are alsofeasible with only two cameras and local SLAM,the situation is different if we encounter yawrotations. As we have discussed earlier, observing



-1.5-1

-0.50

0.51

-1-0.5

00.5

10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Fig. 14 Ground truth and estimated position during 360◦yaw rotation. Scale is in meters

fast yaw rotations with the forward-facing camerasis particularly challenging. Thus, we expect thatour MAV will benefit much from our additionaldownward-facing cameras. For putting this as-sumption to a test, we let our MAV perform a 360◦yaw rotation. This rotation was divided into fourseparate 90◦ turns, for which our MAV requiredan average time of 2.3 s each. After each turn,the MAV waited for itself to stabilize and thenhovered for 5 s before continuing with the nextturn. An example for the scene observed by theforward-facing cameras after each turn is shownin Fig. 13.

Figure 14 contains the recorded ground truthand on-board position estimates for a typical testrun of this experiment. We again re-processedthe recorded camera imagery and IMU mea-surements offline with a local-SLAM-only andground-plane-only version of our software system,and included the results in Fig. 14. In this figurewe see that despite the yaw rotations, the MAV is

Fig. 13 Scene observed by the forward-facing cameras during 360◦ yaw rotation


-90

0

90

180

270

360

450

0 5 10 15 20 25 30

Yaw

Ang

le/˚

Time / s



Fig. 15 Yaw angles measured during 360◦ yaw rotation

able to maintain an accurate estimate of its currentposition. The ground-plane-only test run againshows the already observed behavior of accurateheight estimates but strong horizontal drift.

The position estimated by the local-SLAM-only version, on the other hand, shoots off in arandom direction after the first 90◦ turn. Pleasenote that the diagram has been truncated andthat the position estimation continues to showthe same erroneous behavior for each of the fourturns. In fact, we have never been able to obtaina valid position estimate beyond the first turn forany test-run with the local-SLAM-only version. Ifthe MAV had used this erroneous position esti-mate for navigation, this would have inevitably ledto a crash.

The recorded and estimated yaw rotations aredepicted in Fig. 15. In this diagram we see that theyaw rotation estimated with our on-board methodclosely follows the ground truth, while the ground-plane-only version follows the ground truth lessaccurately. The local-SLAM-only version, startsdeviating significantly after the first turn, whichmatches the observation from Fig. 14.

The good performance of our method can inlarge parts be credited to our recovery strategy. Infact, recovery of the local SLAM method was per-formed once during each turn. Because the morerotation-robust pose from the downward-facingcameras is used during recovery, our MAV is ableto keep an accurate pose estimate throughout theexperiment.

11 Conclusions

In this paper we have proposed a novel designfor an autonomous quadrotor MAV. In a GPS-

denied space, the key problem of autonomousflight is self-localization. We have solved thisproblem for our MAV by employing four camerasin two stereo configurations. We are able to per-form stereo processing for both camera pairs on-board the MAV in real time. To our knowledge,on-board stereo processing with more than twocameras has not been demonstrated before.

We have used a dedicated processing techniquefor the stereo matching results of each camerapair. While the imagery of the forward-facingcameras is processed with an approach based onstereo SLAM, the downward-facing cameras canbe processed more efficiently if we assume thatthe ground is flat. For this camera pair we thususe a method based on RANSAC plane detectionand ESM homography estimation. Combining thepose estimates we obtain from each camera pairwas one of the main challenges. While a sensorfusion can be applied to receive the current poseestimate, we also need to update the map cre-ated by the local SLAM method. For this task,we included the two presented drift correctiontechniques.

The resulting MAV has been evaluated in sev-eral flight experiments. In a hovering test, theMAV took off autonomously, hovered at a givenheight and then landed again by itself. It was ableto keep its hovering location with an average errorof only 11.0 cm. By re-processing the collectedsensor data with only the forward-facing cameras,we have been able to show that the downward-facing cameras indeed increase the pose estima-tion accuracy. The same observation has beenmade for an experiment where the MAV flew atrajectory resembling a horizontal square.

In further experiments we have demonstratedthe robustness of our MAV design. Pose esti-mation is able to automatically recover when anerroneous height and orientation are forcefully in-troduced into the system. Further, we have shownthat our MAV is able to successfully fly a 360◦ yawrotation. This is a difficult maneuver, for whichwe received very bad results when using only theforward-facing cameras. Our MAV was able toperform this operation thanks to the availabilityof a redundant pose estimate from the downward-facing cameras. Further, our tracking recoverymethod played an important role in this experi-


ment, which is able to perform a re-initializationof the local SLAM system.

With this work we have shown that by si-multaneously using both, a forward-facing and adownward-facing camera pair, we can significantlyincrease accuracy and robustness of the MAVnavigation. Forward- and downward-facing cam-eras have different strengths and weaknesses, andare thus able to complement each other. Usingmore cameras also adds redundancy to the system,which leads to an increased fail-safety. There ex-ist, however, situations in which the performanceof all cameras can be impeded, such as low light-ing, fog or an insufficiently textured environment.To further increase redundancy, we would thuspropose to equip the MAV with additional sen-sors, such as laser scanners, ultrasonic transceiversor a depth camera. The very limited payload ofmost MAVs, however, does not allow us to carrytoo much sensory equipment. Hence, such exten-sions will have to wait until lighter and smallersensors are available.

References

1. Achtelik, M., Zhang, T., Kuhnlenz, K., Buss, M.:Visual tracking and control of a quadcopter using astereo camera system and inertial sensors. In: Inter-national Conference on Mechatronics and Automation(ICMA), pp. 2863–2869. IEEE (2009)

2. Achtelik, M., Achtelik, M., Weiss, S., Siegwart, R.:Onboard IMU and monocular vision based controlfor MAVs in unknown in-and outdoor environments.In: IEEE International Conference on Robotics andAutomation (ICRA), pp. 3056–3063 (2011)

3. Benhimane, S., Malis, E.: Real-time image-based track-ing of planes using efficient second-order minimization.In: IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), vol. 1, pp. 943–948 (2004)

4. Bry, A., Bachrach, A., Roy, N.: State estimation foraggressive flight in GPS-denied environments using on-board sensing. In: IEEE International Conference onRobotics and Automation (ICRA), pp. 1–8 (2012)

5. Carrillo, L.R.G., López, A.E.D., Lozano, R.,Pégard, C.: Combining stereo vision and inertialnavigation system for a quad-rotor UAV. J. Intell.Robot. Syst. 65(1), 373–387 (2012)

6. Engel, J., Sturm, J., Cremers, D.: Camera-based navi-gation of a low-cost quadrocopter. In: IEEE/RSJ Inter-national Conference on Intelligent Robots and Systems(IROS), pp. 2815–2821 (2012)

7. Fraundorfer, F., Heng, L., Honegger, D., Lee, G.H.,Meier, L., Tanskanen, P., Pollefeys, M.: Vision-based

autonomous mapping and exploration using a quadro-tor MAV. In: IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS), pp. 4557–4564(2012)

8. Harmat, A., Sharf, I., Trentini, M.: Parallel trackingand mapping with multiple cameras on an unmannedaerial vehicle. In: International Conference on In-telligent Robotics and Applications (ICIRA), vol. 1,pp. 421–432. Springer (2012)

9. Heng, L., Meier, L., Tanskanen, P., Fraundorfer, F.,Pollefeys, M.: Autonomous obstacle avoidance andmaneuvering on a vision-guided MAV using on-boardprocessing. In: IEEE International Conference onRobotics and Automation (ICRA), pp. 2472–2477(2011)

10. Klein, G., Murray, D.: Parallel tracking and mappingfor small AR workspaces. In: IEEE and ACM Interna-tional Symposium on Mixed and Augmented Reality(ISMAR), pp. 1–10 (2007)

11. Klose, S.: imu_filter. http://ros.org/wiki/imu_filter(2011). Accessed 15 Aug 2013

12. Meier, L., Tanskanen, P., Heng, L., Lee, G.,Fraundorfer, F., Pollefeys, M.: PIXHAWK: a microaerial vehicle design for autonomous flight usingonboard computer vision. Auton. Robot. 33(1–2),21–39 (2012)

13. Pebrianti, D., Kendoul, F., Azrad, S., Wang, W., Non-ami, K.: Autonomous hovering and landing of a quad-rotor micro aerial vehicle by means of on ground stereovision system. J. Syst. Des. Dyn. 4(2), 269–284 (2010)

14. Schauwecker, K., Ke, N.R., Scherer, S.A., Zell, A.:Markerless visual control of a quad-rotor micro aer-ial vehicle by means of on-board stereo processing.In: Autonomous Mobile System Conference (AMS),pp. 11–20. Springer (2012)

15. Schauwecker, K., Klette, R., Zell, A.: A new fea-ture detector and stereo matching method for ac-curate high-performance sparse stereo matching. In:IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), pp. 5171–5176 (2012)

16. Schauwecker, K., Zell, A.: On-board dual-stereo-visionfor autonomous quadrotor navigation. In: Interna-tional Conference on Unmanned Aircraft Systems(ICUAS), pp. 332–341. IEEE (2013)

17. Scherer, S.A., Dube, D., Zell, A.: Using depth in visualsimultaneous localisation and mapping. In: IEEE In-ternational Conference on Robotics and Automation(ICRA), pp. 5216–5221 (2012)

18. Shen, S., Michael, N., Kumar, V.: Autonomous multi-floor indoor navigation with a computationally con-strained MAV. In: IEEE International Conference onRobotics and Automation (ICRA), pp. 20–25 (2011)

19. Shen, S., Michael, N., Kumar, V.: Autonomous indoor3D exploration with a micro-aerial vehicle. In: IEEEInternational Conference on Robotics and Automation(ICRA), pp. 9–15 (2012)

20. Tomic, T., Schmid, K., Lutz, P., Domel, A., Kassecker,M., Mair, E., Grixa, I., Ruess, F., Suppa, M., Burschka,D.: Toward a fully autonomous UAV: research plat-form for indoor and outdoor urban search and rescue.IEEE Robot. Autom. Mag. 19(3), 46–56 (2012)

http://ros.org/wiki/imu_filter


21. Tournier, G.P., Valenti, M., How, J., Feron, E.:Estimation and control of a quadrotor vehicle usingmonocular vision and Moiré patterns. In: In AIAAGuidance, Navigation and Control Conference,pp. 2006–6711 (2006)

22. Weiss, S., Scaramuzza, D., Siegwart, R.: Monocular-SLAM–based navigation for autonomous micro heli-copters in GPS-denied environments. J. Field Robot.28(6), 854–874 (2011)

23. Yang, S., Scherer, S.A., Zell, A.: An onboard monocu-lar vision system for autonomous takeoff, hovering andlanding of a micro aerial vehicle. J. Intell. Robot. Syst.69, 499–515 (2012)

24. Yang, S., Scherer, S.A., Schauwecker, K., Zell, A.: On-board monocular vision for landing of an MAV ona landing site specified by a single reference image.In: International Conference on Unmanned AircraftSystems (ICUAS), pp. 317–324. IEEE (2013)

on-board dual-stereo-vision for the navigation

Documents