basil huber.pdf

8/18/2019 BASIL HUBER.pdf

1/60

Department of Informatics

Basil HuberSection de microtechnique

École Polytechnique Fédérale de Lausanne

High-Speed PoseEstimation using a

Dynamic Vision Sensor

Master Thesis

Robotics and Perception GroupUniversity of Zurich

SupervisionProf. Dr. Davide Scaramuzza, RPG, UHZ

Prof. Dr. Dario Floreano, LIS, EPFLElias Müggler, RPG, UHZ

March 2014


2/60


3/60


4/60


5/60

Contents

Abstract v

Nomenclature vii

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 DVS Calibration 72.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Displaying a Pattern . . . . . . . . . . . . . . . . . . . . . 72.2.2 Focusing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.3 Intrinsic Camera Calibration . . . . . . . . . . . . . . . . 82.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Focusing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.2 Intrinsic Camera Calibration . . . . . . . . . . . . . . . . 10

3 DVS Simulation 143.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3 Simulation Procedure . . . . . . . . . . . . . . . . . . . . . . . . 193.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4.1 Correction of DVS-Screen Misalignment . . . . . . . . . . 203.4.2 Screen Refreshing Effects . . . . . . . . . . . . . . . . . . 21

4 Pose Estimation 244.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . 254.2.2 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 314.3.1 Trajectory Simulation . . . . . . . . . . . . . . . . . . . . 314.3.2 DVS on Quadrotor . . . . . . . . . . . . . . . . . . . . . . 32

4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.4.1 Trajectory Simulation . . . . . . . . . . . . . . . . . . . . 334.4.2 DVS on Quadrotor . . . . . . . . . . . . . . . . . . . . . . 38

iii


6/60

5 Conclusion 41

5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43


7/60

Abstract

We see because we move; we move because we see.James J. Gibson, The Perception of the Visual World

Micro Aerial Vehicles (MAV) have gained importance in various elds, suchas search and rescue missions, surveillance, and delivery services over the lastyears. To stabilize and navigate reliably, the pose of the MAV must be knownprecisely. While impressive maneuvers can be performed using external motioncapture systems, their usage is limited to small predened areas. Numeroussolutions exist to estimate the pose using onboard sensors such as laser rangenders, Inertial Measurement Units (IMU) and cameras. To navigate quicklythrough cluttered and dynamic environments, an MAV must be able to react

agilely to suddenly appearing obstacles. In current visual navigation techniques,the agility is limited to the update rate of the perception pipeline, typically inthe order of 30 Hz. The Dynamic Vision Sensor (DVS) is a novel visual sensorthat allows to push these limits.While conventional cameras provide brightnessvalues at a xed frame, this sensor only registers changes in brightness. Whenthe illumination of a pixel changes, an event is emitted containing the pixellocation, the sign of the change (i.e., increase or decrease of illumination) anda timestamp, indicating the precise time of the change. The events are emit-ted asynchronously leading to a latency of only 15 s. We rst introduce aconvenient method for the intrinsic camera calibration of the DVS using a com-puter screen to display the calibration pattern. We then present a simulationapproach where the real DVS is used in a virtual environment. The DVS isplaced in front of a computer screen displaying a scene. The recorded data issubjected to the genuine behavior of the DVS including noise while providingground truth for the DVS pose and the scene. We present a method to estimatethe 6-DoF pose of the DVS with respect to a known pattern. The algorithm iscapable of estimating the pose of the DVS mounted on quadrotor during a ipwith an angular rate of 1200 s− 1 .

v


8/60


9/60

Nomenclature

Notation

Scalars are written in lower case letters ( x), vectors in upper case letters ( X )and matrices in upper case bold letters ( X ).

Acronyms and Abbreviations

RPG Robotics and Perception GroupDoF Degree of FreedomIMU Inertial Measurement UnitMAV Micro Aerial Vehicle

DVS Dynamic Vision Sensor

vii


10/60

Chapter 1

Introduction

1.1 Motivation

Autonomous Micro Aerial Vehicles (MAV) have gained importance in variouselds such as search and rescue missions [1, 2], surveillance [3, 4], and deliveryservices [5, 6] over the last years. While quadrotors can move very agilely, keep-ing track of the vehicle pose during the maneuver is essential for stability andremains an open problem. Using external motion capture systems, very precisetrajectories can be performed with high accuracy [7, 8].

However, external motion capture systems are not available for real world ap-plications. In most environments, the state estimation must rely on onboardsensors to achieve the necessary independence. While GPS is widely used forglobal localization, its accuracy is limited [9]. GPS sensors are therefore of-ten used in combination with Inertial Measurement Units (IMU) [10] and/orcameras [11, 12]. Another important drawback is the availability of GPS. Itis only unreliably available in urban canyons and cluttered environments andcompletely unavailable indoors. In [13] a laser range nder and an IMU areused to estimate the pose of an MAV in real-time. While laser range nders areheavy and power consuming, one [ 14] or multiple [15] on-board cameras can beused for the pose estimation. However, standard cameras have a limited framerate, typically 15 Hz to 100Hz. Furthermore, during fast motion, the imagessuffer from motion blur.While the above-mentioned approaches work well for relatively slow motions,they are not able to provide accurate pose estimation during aggressive maneu-vers. A novel sensor type, inspired by the human eye, allows to overcome thelimitations of frame rate and exposure time of conventional cameras. The Dy-namic Vision Sensor (DVS) [16] does not send pictures representing an instantin time as it is the case for an ideal (i.e., innitesimal exposure time) conven-tional camera. Instead, it sends an event whenever the illumination of a pixelchanges. An event is composed of the coordinates of the pixel and a timestamp,indicating the time of change. It also contains the sign of the illuminationchange, i.e. whether the illumination has increased or decreased. By send-ing only information about illumination changes, the data volume is decreased

1


11/60

2 1.1. Motivation

signicantly compared to conventional cameras. The events are transmitted

asynchronously with a theoretical latency of 15

s [17]. However, due to thelimitation introduced by the USB interface, events are transmitted in packages.The timestamps are attributed on the device at hardware level and are hencenot subjected the transmission latency. If the DVS is stationary, only movingobjects produce events while the background produces no events. Figure 1.1shows a comparison between the output of the DVS and a conventional camera.The output of both sensors is shown when observing a dot on a rotating wheel.

Time

Time

standardcameraoutput

DVSoutput

Figure 1.1: Comparison between a standard camera and a DVS: A black dot on a

turning wheel (left) is observed by both sensors. The standard camera registers thewhole scene at a xed rate. The DVS observes only the black dot, but it is notrestrained to a frame rate. If the wheel stops turning, no events are emitted by theDVS while the conventional camera continues sending images.

To take advantage of the asynchronous signal, new algorithms have to be de-veloped. Many traditional approaches are based on features detected in animage [18]. Features describe interest points in an image. These interest pointsshould have a high repeatability (i.e., they should be recognised as such underdifferent viewing conditions). Interest points can be corners and edges [19, 20],blobs [21, 22], maxima and minima of the difference of Gaussian (DoG) func-tion [23], or other salient image regions. The interest points are then describedby a feature descriptor which should be able to identify them across differentimages.Due to the asynchronous nature of the DVS, images in the traditional sense donot exist. Therefore, these methods cannot be directly applied to DVS data.One approach to get images from DVS data is to synthesize them by integratingthe events emitted by the DVS. In integrated images, the grey value of a pixelis dene by the number of events that this pixel emitted during the integrationtime. Depending on the implementation, events of different signs are handleddifferently. These images show gradients (typically edges) of objects that movewith respect to the DVS. The integration time is critical for the quality of theimages, comparable to the exposure time of conventional cameras. When theintegration time is long compared to the apparent motion, the image is blurred,


12/60

Chapter 1. Introduction 3

which is the equivalent of motion blur. In images with short integration time,

the gradients are only partially visible. Furthermore, only gradients perpendic-ular to the apparent motion are visible in these images. Thus, the synthesizedimages depend not only on the appearance of the scene, but also on the move-ments during the integration time as shown in Figure 1.2. If many gradients

Figure 1.2: Integrated images are depending on the apparent motion; The blue arrowsshow the direction of the apparant motion. White/black pixels indicate an intensityincrease/decrease. Hatched pixels indicate that the intensity rst decreases and thenincreases. Hence, an event of each polarity is emitted. The two movements on theright have identical initial and end position, but produce different integrated images.

in different directions exist in the scene, some of the above-mentioned meth-ods might be adapted to the DVS, using edges as feature points. However,

approaches based on integrated images cannot take full advantage of the asyn-chronous circuit since they introduce lag.Therefore, we aim for a purely event-based approach, where each emitted eventupdates the estimation of the MAV pose. This allows us to benet from thelow latency of the sensor and the sparsity of its output data, thus resulting inan update rate of several kHz. Approaches using conventional cameras have atheoretical limit at the camera frame rate (50 Hz to 100Hz). Due to the largeamount of redundant data, the update rate of the perception pipeline is typi-cally limited to 30Hz [24]. The perception pipeline is currently a bottle neck forthe agility of MAVs. We propose an algorithm that allows to estimate the 6 De-grees of Freedom (DoF) of the pose with minimal integration to avoid lag. Ourapproach is based on tracking straight gradient segments in the event stream of the DVS. We update the pose estimation upon the arrival of each event that isattributed to a tracked segment, allowing very high frequency pose updates.

Several tools exist for calibrating conventional cameras [ 25, 26, 27]. These toolstypically involve taking pictures of a calibration pattern. The calibration param-eters are then estimated by minimizing the reprojection error. These techniquescannot directly be used for the DVS since it does not register static scenes. Wepropose a calibration method that uses a computer screen to display the cali-bration pattern. The backlighting of LED-backlit LCD screens ickers with ahigh frequency when dimmed. Hence, the pattern can be ”seen” by the DVS.This trick is also used for the focal adjustement of the DVS.

Ground truth data for the DVS trajectory is an important aid for testing and


13/60

4 1.2. Related Work

evaluation of tracking and pose estimation algorithms. Setting up a system for

ground truth measurement (e.g., using a motion capture system) is laborious.Simulations can not only provide ground truth but also permit to exibly emu-late the environment. For pure simulations, however, very precise models of thedevice are required to produce realistic data. While the DVS’ basic principleis straight forward to simulate, accurate models for temporal noise and otherphenomena related to the sensor and the readout circuit are not available. Theexact behavior of the DVS is complex and depends on numerous parametersthat can be set in form of bias current on the device’s circuit. These parame-ters inuence the ON/OFF threshold, bandwidth, event ring rate, and othercharacteristics.We propose a simulation method that allows simulating the DVS in a virtualenvironment, while having the real characteristics of the DVS. The virtual sceneis displayed on a computer screen. The DVS is lming this scene. By choos-ing the virtual camera appropriately, the DVS ”sees” the scene under the sameperspective as the virtual camera. To correct for the misalignment between theDVS and the screen, a transformation is applied to the output of the virtualcamera before displaying it on the screen.

1.2 Related Work

Despite being a relatively new technology, various applications using a DVS fortracking were proposed. In many applications, clusters of events were tracked.Examples include traffic control [28], ball tracking [29], and particle track-

ing [30].Based on the particle tracker, a visual shape tracking algorithm was proposed [31].The shape tracking is used as a feedback for a robotic microgripper. The DVSoutput is used to determine the gripper’s position. In a rst step, objects aredetected using the generalized Hough transform. Contrary to the approach of this work, the objects are detected in the image of a conventional camera ratherthan in the DVS event stream. To estimate the gripper’s pose and shape, the lo-cation of incoming events is compared to a model of the gripper in the estimatedpose. The pose of the gripper relative to the model is then estimated using anIterative Closest Point (ICP) approach. The pose is then used to provide theoperator of the gripper with a haptic feedback.Several methods to track the knee of a walking person using a DVS were pro-posed [32]. One approach consists of an event-based line-segment detector. Inanother approach, the leg is found in the Hough space, similar to the approachof this work. The tracking is then implemented with a particle lter in theHough space.Another approach is using event-based optical ow to estimate motion [33].Since the DVS does not provide gray levels, it is challenging to nd the spatialgradient. To do so, they integrate the image over a very short time (50 s).The gradient is then approximated as the difference of the event count betweenneighboring pixels during the integration time. To avoid integrating events,they proposed an approach relying only on the timing of the events [ 34]. Forthis approach, the optical ow is estimated by comparing the timestamps of themost recent events of neighboring pixels. Knowing the time difference and the


14/60

Chapter 1. Introduction 5

distance between the pixels, they can directly calculate the velocity of the visual

ow. To make their method more robust, they assume the ow velocity to belocally constant. In their experiments, they track a black bar painted on a con-veyor belt and on a turning wheel. Based on this optical ow implementation,they present a Time-Of-Contact (TTC) estimation method [ 35]. The TTC isthe time until the camera reaches an obstacle assuming uniform motion. It isestimated using the velocity of the optic ow at a point in the image and thedistance of this point to the Focus of Expansion (FoE). The TTC can then beused for obstacle avoidance and motion planning.They further estimated the depth (i.e., the distance to the camera) of an objectusing event-based stereo matching between two DVS [36]. They propose to es-timate the epipolar line of a DVS pixel by nding pixels of the other DVS thatemit events nearly simultaneously.Another group uses blinking LEDs on a quadrotor to nd its pose with a sta-tionary DVS [ 37]. While providing good results for pose estimation duringaggressive maneuvers, the tracking is limited to the eld of view of the DVS,similar to approaches using a motion capture system.In more recent work [38], they mounted the DVS on a two wheeled robot. Fea-tures are tracked using an onboard Kinect camera, providing pose estimates atthe frame rate of the camera. Between two subsequent frames, the relative mo-tion is estimated based on events emitted by the DVS. The received events arecompared to the events expected when considering the most recent image of theCMOS camera. The motion that has the highest coherence between expectedand received events is taken as estimate. To estimate the translation, depthinformation of the Kinect has to be included. This approach is very promisingto estimate rotational motion while it performs poorly for translation due to

the low resolution of the DVS. However, tracking is lost when the motion be-tween to subsequent frames is faster than half the eld of view. Furthermore,when performing maneuvers, the CMOS images suffer from motion blur and cantherefore not be used for pose updates. Hence, the pose estimation relies thenonly on the estimation of the relative motion of the DVS. Over time, the poseestimation is drifting considerably until the next sharp CMOS image arrives.Although they claimed that it could be extended to estimate 3-DoF rotation,they only demonstrate a 1-DoF rotation implementation.Another approach performs localization based on a particle lter with DVSdata [39]. Similar to the above approach, they compare the received eventswith the expected events. However, rather then localizing with respect to acamera image, they localize with respect to a predened global map. Althoughthis approach is promising, it was only implemented in 2D (3-DoF). In laterwork [40], they expanded their method to perform simultaneous localizationand mapping (SLAM). Hence, the 2D map is built and expanded during oper-ation. They demonstrated the performance of their approach on a slow groundrobot.

1.3 Contribution

In this work, we rst introduce a novel technique for focusing and intrinsiccamera calibration of the DVS (Chapter 2). For this procedure, we display a


15/60

6 1.3. Contribution

pattern on a computer screen. The backlighting of most screens is ickering

at a high frequency when they are dimmed. This effect is used to make thepattern visible for the DVS. Standard methods to estimate the DVS’ intrinsicparameters are then used. Focusing is performed by changing the sensor-lensdistance manually while observing the ”image” (accumulation of events overtime) of a calibration pattern. For the calibration, ”images” of the pattern aretaken from different viewpoints. By minimizing the reprojection error, we cannd the intrinsic camera parameters. We further present and discuss the resultsachieved with this method.In Chapter 3, we present a method to observe a virtual environment with areal DVS. A virtual camera is placed in this environment and can performarbitrary trajectories. The output of the virtual camera is displayed on thescreen. The real DVS is lming the screen. A transformation is applied tothe output of the virtual camera so that the real DVS ”sees” the scene fromthe same perspective. This allows convenient testing of pose estimation andother algorithms for the DVS while having the real camera properties such astemporal noise and spontaneous events. Thanks to its high repeatability, thismethod is well suited for benchmarking and experimenting on the parametersof the DVS. For our simulations, a virtual camera performs trajectories in avirtual scene. Not only can an arbitrary scene be shown, but also ground truthfor the camera pose is provided.The main contribution of this thesis is a line tracker, presented in Chapter 4.This algorithm can track straight gradient segments in the DVS event stream.The pose is then estimated by comparing the position and orientation of thegradients to their known location. To the best of our knowledge, this is the rstwork presenting 6-DoF pose estimation with an event based imaging sensor. We

tested the performance of our algorithm using a simulation as well as a quadrotorequipped with a DVS. Using an AR.drone, we performed ips around the opticalaxis of the front-looking DVS with angular rates of up to 1200 s− 1 .


16/60

Chapter 2

DVS Calibration

2.1 Outline

In this section, the calibration process is explained and the results are presented.In the rst section, the methods for focusing and for the intrinsic camera cal-ibration are explained. First, we show how the screen can be used to displaypatterns so that they are visible for the DVS. Then, the procedure for focusingof the DVS is explained, followed by the procedure for the intrinsic camera cal-ibration. We explain the parameters obtained from the calibration and explainbriey the used distortion model.In Section 2.3, we demonstrate the performance of our focusing and intrinsiccamera calibration method. First, we compare the output of a DVS before andafter focusing. Next, we show a distorted and undistorted ”image”. We thendiscuss reasons for the remaining distortion.

2.2 Approach

While the DVS is different to conventional cameras in many ways, the opticsare the same. However, since the DVS can only detect intensity changes in theobserved image, conventional camera calibration tools cannot be used out of thebox where a pattern (typically a checkerboard) is held in front of the camera.For the DVS to nd a calibration pattern, it has to be moving or its brightnesshas to change. In our approach, we let the pattern blink to render it visible forthe DVS. A convenient way to produce a blinking pattern is to use a computerscreen. The same effect can be used to focus the camera.

2.2.1 Displaying a Pattern

Observing an LED-backlit LCD screen with an stationary DVS under differ-ent brightness settings reveals the mechanism used to dim this type of screens.

7


17/60

8 2.2. Approach

When the screen is set to full brightness and a static image is displayed, no

events are emitted. However, when the screen is dimmed, bright areas on thescreen emit events at a high rate ( ≈ 170 events/s per pixel) while dark areas donot emit any. This effect is caused by the pulse-width modulation of the screen’sbacklighting used for dimming [41]. The ickering caused by the modulationis not visible to the human eye since the frequency is typically in the range of 100 Hz to 200 Hz, but varies strongly from model to model [42].The high event rate of bright areas allows to nd the pattern during a shortintegration time and low sensitivity (i.e., only strong illumination changes emitevents). Hence, the percentage of events that are generated by noise is de-creased. These events include spontaneously generated events (e.g., shot noise,dark current) and events generated by unintentional movements of the DVSrelative to edges (e.g., the border of the screen). In addition, motion blur dueto unintentional movement of the DVS during the integration time is limited.

2.2.2 Focusing

As for conventional cameras, the distance between the image sensor and the lenshas to be adjusted to get a sharp image on the sensor. To adjust the focus, theuser has to change the sensor-lens distance by manually screwing the lens closeror farther from the sensor. Our blinking screen technique provides a patternvisible in the DVS data without the need to move the camera. After placing theDVS in front of the screen, the focusing pattern is shown on the dimmed screen.We chose a set of concentric unlled squares, alternately black and white asshown in Figure 2.1a. The squares are logarithmically scaled to provide squares

with suitable thickness for different distances between the DVS and the screen.Integrated images are synthesized by setting pixels to white if they registeredmore events than a threshold value and to black otherwise. Hence, white screenregions appear white in the integrated images, while black regions appear black.A preview window showing the integrated output of the DVS allows the userto observe the sharpness of the image. The user can then adjust the focus untilthe preview appears as sharp as possible. When out of focus, the white linesare blurred and hence the pattern is not recognizable towards the center of theimage, where the distance between the the white lines is narrow. The betterthe sensor-lens distance is adjusted, the more of the pattern is visible as shownin Figure 2.2.

2.2.3 Intrinsic Camera Calibration

For the intrinsic calibration of the DVS, the pattern consists of white circleson a black background as shown in Figure 2.1b. There is tradeoff a betweenthe size and the number of circles. The larger and farther apart the circles are,the easier it is to detect the circles in the integrated image. However, morecircles result in more points for the calibration, hence improving the estimatedparameters. In our implementation, we chose a grid of 7 × 7 circles.

As for convential camera calibration tools [ 25], the user should take several pic-tures of the pattern from different viewpoints. By increasing the number of


18/60

Chapter 2. DVS Calibration 9

(a) focusing pattern (b) calibration pattern

Figure 2.1: Pattern shown on the screen for (a) focusing and (b) intrinsic cameracalibration.

images and choosing the viewpoints as different as possible, the user can highlyincrease quality of the estimation of the intrinsic parameters. Especially tiltingthe camera with respect to the screen considerably increases the accuracy of theestimation [43].

For the detection of the calibration pattern, the events are integrated over 75 ms.

As for focusing, pixels with an event count higher than a certain threshold arewhite in the synthesized image whereas the other pixels are black. After thisthresholding, a morphological closing lter [44] is applied to ll holes or dentsin the white regions. This lter rst dilates the white regions and then shrinksthem again, resulting in more convex regions. The circles are then detectedusing the OpenCV [45] routine findCircleGrid [46]. This function providesthe centers of circles that are arranged in a projection of a grid.

The calibration is performed using the OpenCV routine calibrateCamera [47]based on Bouguet’s Camera Calibration Toolbox [25]. The calibration routineprovides the DVS’ focal length ( f x ,f y ), principal point ( cx ,cy ) and the radialand tangential distortion coefficients ( k1 , . . . , k 5 ). For the focal length and theprincipal point, the pinhole camera model is used. In this model, the imagecoordinates X D ∈R 3 are described as

λX D =f x 0 cx0 f y cy0 0 1

X cam , (2.1)

where X cam ∈ R 3 are the camera coordinates and λ is a scaling factor, so thatzcam = 1.Browns’s ”Plumb Bob” model [ 48] is used to approximate the distortion causedby imperfect centering of the lens and imperfections of the lens. The radialdistortion is approximated with a sixth order model. First, the normalized


19/60

10 2.3. Results

camera coordinates are found as

X n = xnyn = xcam /z camycam /z cam

. (2.2)

The distorted coordinates X d are than dened as

X d = (1 + k1 r 2 + k2 r 4 + k5 r 6 )X n + D t , (2.3)

where r is the distance from the optical axis ( r 2 = x2n + y2n ) and Dt is thetangential distortion vector,

D t =2k3 xn yn + k4 (r 2 + 2 x2n )2k4 xn yn + k3 (r 2 + 2 y2n )

. (2.4)

The distorted image points can be undistorted by iteratively solving ( 2.3) usingthe previous estimation of X n to calculate r and D t as follows:

X n = X d − D t

(1 + k1 r 2 + k2 r 4 + k5 r 6 ). (2.5)

The distorted coordinates are used as an initial guess for the undistorted coor-dinates ( X n = X d ).

2.3 Results

2.3.1 Focusing

Figure 2.2 shows the integrated images taken from the screen showing the fo-cusing pattern. On the left, it can be seen that the image is blurred due tobad focusing of the DVS. On the right, the image is shown after adjusting thecamera-lens distance. The right image is clearly sharper.

2.3.2 Intrinsic Camera Calibration

To investigate the quality of the estimated distortion parameters, we comparethe raw and the rectied output of the DVS. For this experiment, the DVS isplaced in front of a ickering (i.e., dimmed) screen after being calibrated withthe above-mentioned routine. A pattern consisting of a grid of white lines on ablack background is displayed. The output of the DVS is integrated for a longtime (2 s) in order to received a high number of events even where little lightreaches the sensor. Two different lenses are used: A 2 .8 mm S-mount lens and a3.5 mm C-mount lens. For each lens, 50 images were taken from different view-points. The choice of view points is limited by the fact that the screen emitslight mainly towards the front. Hence, when the camera is tilted too much withrespect to the screen, the pattern cannot be detected. Because of the smallamount of light that is passed through the second lens, it could not be tiltedmore than approximately 15 .


20/60


(a) out of focus (b) in focus

Figure 2.2: Integrated images taken during the focusing process: (a) out of focus; (b)sharp image after nding the optimal lens-camera distance.

The result of the calibration for the rst lens can be seen in Figure 2.3. Onthe left, the output without correction of the distortion is shown. It can beobserved that the lines are curved due to the radial distortion. The furtherfrom the image center, the smaller becomes the radius of curvature. On theright, the pixel location is corrected using the above-mentioned method. Thelines appear overall straight. However, it can be observed that the lines showpiecewise curvature in the opposite direction (left border of the image). This

effect is not due to an error in the estimation of the intrinsic parameters, but israther due to the spatially discrete nature of the sensor. Thus, it occurs for allpixel-based cameras but is more important for low resolution sensors. Despitebeing distorted, pieces of the line appear straight in the uncorrected image dueto the limited resolution of the DVS. These straight segments are then bent tocorrect the distortion resulting in ”overbent” segments. Figure 2.5 illustratesthis problem.

Figure 2.4 shows the output for the second lens (3 .5 mm). Although the linesare slightly less curved in the corrected image, they still appear distorted. Thisis due to badly estimated intrinsic camera parameters.

The proposed calibration method is mainly limited by two factors: LCD displaysemit the light towards the front and only little light is emitted sideways. Hence,not enough light can reach the imaging sensor when the DVS is tilted to much.Therefore, images of less different viewpoints can be taken, which decreases thequality of the intrinsic parameter estimation. The second reason lies in thelimited resolution of the DVS (128 × 128 pixel). The low resolution introducesan error on the estimation of the center of the circles in the integrated image.The rms of the reprojection error of all points used for the calibration is 0 .28 pixelfor the rst lens (2 .8 mm S-mount) and 0 .30 mm for the second lens (3 .5mm C-mount). It is calculated as (

N − 1j =0

M − 1i =0 d̂ij − dij

2 ), where N = 50 is thenumber of images, M = 49 is the number of circles, and dij is the expected andd̂ the found position of the circle i in the integrated image j . The fact that the


21/60

12 2.3. Results

(a) distorted (b) undistorted

Figure 2.3: Integrated image of a line grid using a 2 .8 mm S-mount lens with (a) andwithout (b) correction


Figure 2.4: Integrated image of a line grid using a 3 .5 mm C-mount lens with (a) andwithout (b) correction.

reprojection error for both lenses is in the same range suggests, that the poorestimation of the intrinsic camera parameters with the the second lens is due tothe low variation of the viewing angles.


22/60



Figure 2.5: Pixel level schematics showing the inuence of the low resolution on therectication. (a) Shows the distorted image of two straight lines. Due to the lowresolution, parts of the curved lines appear straight. (b) Shows the rectied image.Convexly curved lines appear straight again. However, lines that are straight in theundistorted image are curved concavely (dashed curves). Therefore, the straight seg-ments from the distorted image become curved segments in the rectied image. Thiseffect occurs not only for the DVS, but for all pixel-based cameras.


23/60

Chapter 3

DVS Simulation

3.1 Outline

In this chapter, the proposed simulation approach is explained, its performanceis evaluated, and its limits are discussed.In the rst section, we explain our simulation approach, where the DVS is lm-ing a virtual environment. The projection pipeline is explained and the required

formulas are presented. The correction for the misalignment between the DVSand the screen is explained.In the next section we explain the procedure to perform a simulation.In the nal section, the achieved results are presented and discussed. We char-acterize our simulation using the reprojection error as a quality measure. Thesources of the error are analyzed. Furthermore, we discuss the differences be-tween the simulation and the reality.

(a) initialization (b) simulation

Figure 3.1: Setup for the simulation; (a) The DVS is placed in front of the screen. Thecalibration pattern is shown to estimate the misalignment of the DVS with respect tothe screen. (b) The simulation is shown on the screen. Due to the applied misalignmentcorrection, the DVS ”sees” the scene under the intended perspective.

14


24/60

Chapter 3. DVS Simulation 15

3.2 Approach

For this method, a virtual environment is setup using OpenGL. A virtual cam-era is then placed in this environment. The output of this virtual camera isrendered live and shown on the computer screen. The DVS is placed in front of a computer screen as shown in Figure 3.1. Since the DVS cannot be perfectlyaligned with the screen, the output of the virtual camera is transformed by ahomography. In this way, the DVS ”sees” the scene under the same perspectiveas the virtual camera. This setup is shown in Figure 3.2. This allows to simu-late the environment and having the actual DVS behavior including noise andlatency. For the testing of our pose estimation algorithm, the virtual cameraperforms a predened trajectory.

(a) without correction

(b) with correction

Figure 3.2: Schematics showing the setup for the simulation without (a) and with (b)misalignment correction.

1. The virtual camera is lming the scene.The output is displayed on the screen.In (b), the homography to correct for the misalignment is applied.

2. The DVS is ”lming” the screen.Note that the DVS is tilted with respect to the screen.

3. Output of the DVS.


25/60

16 3.2. Approach

Projection Pipeline To provide a better understanding of the implementa-

tion of the simulation, we explain the OpenGL rendering pipeline using OpenGLterminology. Throughout the whole rendering pipeline, 4D homogeneous coor-dinates are used. The virtual camera is implemented using the pinhole cameramodel. In a rst step, the world coordinates X W ∈R 4 are transformed to virtualcamera coordinates X cam ∈R 4 (sometimes referred to as eye coordinates). Thistransformation is described by the multiplication with the 4 × 4 view matrixM O as

X cam = M O X W . (3.1)

The view matrix is dened as

M O =R O T O

0 1

, (3.2)

where R O is the 3 × 3 rotation matrix from the world to the virtual cameraframe and T O is the virtual camera translation vector. The virtual cameracoordinates are then projected to OpenGL clipping coordinates X clip ∈ R 4 .Clipping coordinates dene the virtual image before the division by the depth.This transformation is performed by the multiplication with the 4 × 4 projectionmatrix K O and can be describes as

X clip = K O X cam . (3.3)

In our implementation, we choose

K O =

f O x 0 0 00 f O y 0 00 0 0 00 0 1 0

, (3.4)

where f O y and f O y are the focal length of the virtual camera in x and y directionrespectively. The clipping coordinates can then be written as

X clip =

xclipyclipzclipwclip

=

f O x xcamf O y ycam

0zcam

. (3.5)

The coordinate zclip is stored in the depth buffer. The depth buffer is used byOpenGL to determine whether a point is visible or occluded by another point.To avoid very far objects from being rendered, only points with zclip between − 1and 1 are visible. For the simulation used in this work, we choose a 2D scene andhence no occlusion can occur. We therefore choose zclip = 0 to make all pointsvisible, independent of their distance to the camera. For 3D scenes, this has tobe adapted to allow OpenGL to handle occlusions. The coordinate wclip is usedto scale the coordinates by the depth of the point, independently of the valuein the depth buffer. The transformation from world to clipping coordinatesis performed on the Graphics Processing Unit (GPU). The programmer candescribe this behavior with a program that is loaded to the GPU called vertexshader.


26/60


27/60

18 3.2. Approach

where λ0 is the scaling factor so that zD = 1, R D is the rotation matrix from

screen to DVS camera frame, and T D is the DVS translation vector. Assum-ing the DVS is tilted with respect to the screen (i.e., R D = I3× 3 ), the imagegenerated on the DVS image plane is distorted by the projection as shown inFigure 3.2b.

Correction for the camera alignment and screen scaling Since the DVSimage plane will not be perfectly aligned with the screen, the output of thevirtual camera has to be warped. This warping is performed by applying amatrix transformation to the OpenGL clipping coordinates. By compensatingthe misalignment between the DVS and the screen and the scaling, the DVS”sees” the scene under the same perspective as the virtual camera.In the rst step, the transformation between the viewport and the DVS image

plane has to be found. This transformation can be described by a homography.It includes the scaling from viewport to screen coordinates and the projection tothe DVS image plane in one 4 × 4 matrix H D . Using the viewport coordinatesrather than screen coordinates avoids the problem of the unknown screen scale.The homography can be formulated as

λ1xvpyvp1

= H D X D , (3.11)

where λ1 is an arbitrary factor.To nd this homography, a blinking grid of circles is shown on the screen,similar as for the calibration (see Chapter 2). In this case, however, the screen’sbacklighting is constant and the pattern is blinking on the screen to preventchanges of the screen brightness during the procedure. The homography isthen found by comparing the position of the circles in the DVS image with thecorresponding position in the OpenGL viewport. Since the DVS image suffersfrom lens distortion, the position of the circles have to be undistorted. In a nextstep, the desired position of the circles in the DVS image X D ∈R 3 is calculatedby projecting the world coordinates of the circle centers directly to the DVSimage plan assuming for the DVS to have the same pose as the virtual camera.These coordinates, representing what the DVS would ”see” if it was placed atthe same position in the virtual world as the virtual camera, can be describedas

λ2 X D = K D R O T O X W , (3.12)where λ 2 is a scaling factor so that zD = 1. Using the homography between theimage and viewport coordinates H D , the desired viewport coordinates X vp are

λ3xvpyvp1

= H D X D , (3.13)

where λ3 is again a scaling factor. These coordinates represent the ideal positionof the circles on the OpenGL viewport, so that both cameras observe the sceneunder the same perspective.


28/60


Knowing the current and the desired viewport coordinates, the homography

between these two can be found using the homography equation

λ4xvpyvp1

= Hxvpyvp1

, (3.14)

where λ4 is a scaling factor and H is the homography matrix. This homographyis the transformation allowing to correct the virtual camera’s output. For thecorrection, this homography is applied to the clipping coordinates, since theprogrammer cannot modify the viewport coordinates. The corrected clippingcoordinates can be found as

X̂ clip = H̃ X clip = H̃K OR

O T O0 1 X W , (3.15)

where H̃ is the homography H expanded to a 4x4 matrix:

H̃ =

h11 h12 0 h13h21 h22 0 h230 0 1 0

h31 h32 0 h33

with: H =h11 h12 h13h21 h22 h23h31 h32 h33

. (3.16)

The homography matrix H must be expanded to transform the homogeneouscoordinates correctly. Note that the third row and the third column are set tozero, except for h̃33 = 1. This is the case since, as explained above, the corrected

clipping coordinates are divided by ˆwclip rather than ˆzclip .

3.3 Simulation Procedure

In this section, we show the procedure to record simulated data. The user canenter the desired camera trajectory directly in the source code. By default, thestart and end pose of the camera are entered. The trajectory is then calculatedas a linear interpolation of the translation vector and the orientation anglesyaw, pitch, and roll. This part can be easily modied in the code to get anyparametrizable trajectory. In our experiment, we display a black square on awhite background. The user can modify the scene to be displayed and the way

it is rendered.Once both the trajectory and the scene are set, the camera is placed in front of the screen. It should be placed as close to the screen as possible but with stillthe whole height of the screen in view. If the camera is placed too far from thescreen, the correction above might result in an image that is too large for thescreen. A preview window showing the integrated DVS output assists the userin this task. The brightness must be set to the maximum, so that the screen isnot ickering.

The user can then start the initialization. First, the initialization pattern isshown to determine the homography for the alignment correction. Then, the


29/60

20 3.4. Results

animation is shown and recorded. Figure 3.1 shows the setup for the initializa-

tion phase on the left side and the simulation phase on the right side.

3.4 Results

3.4.1 Correction of DVS-Screen Misalignment

To investigate the quality of our correction method for the DVS-screen mis-alignment, we propose the following setup: The DVS is placed in front of thescreen after being intrinsically calibrated (see Chapter 2). For this experiment,we use a 3.5 mm S-mount lens, which introduces only minimal distortion. Thecalibration pattern is displayed on the dimmed screen to nd the homographybetween the desired and the current viewpoint coordinates. Then, a grid of thin white lines is displayed on the screen without applying the correction. Animage is taken by integrating the DVS output for 100 ms. This image is shownin Figure 3.3a. In the next step, the correction is applied to the projectionmatrix and the grid is displayed again. Another DVS ”image” is taken, shownin Figure 3.3b. Both images are thresholded to remove noise.

(a) uncorrected (b) corrected

Figure 3.3: Integrated image of a line grid using a 3.5mm S-mount lens (a) with and (b)without misalignment correction; Certain image regions did not receive enough lightdue to the tilt of the DVS. Both images are undistorted and thresholded to removenoise. In (b) the expected position of the grid is shown in red.

The position of the events is compared to the expected position of the lines.The mean distance between the event positions and the expected lines is 0.28pixels. Note that this error is not only due to inaccuracy of the correction. Ithas to be considered that the lines are thicker than one pixel at some places inthe image. Furthermore, the low resolution of the DVS introduces discretizationerror and the correction for the lens distortion is not perfect. The correctionitself is limited by the accuracy of the estimation of the circle centers from theimage. This estimation is inuenced by the quality of the correction of thelens distortion and suffers again from the low resolution. The error caused by


30/60


the pixel nature of the screen can be neglected when comparing the resolution

of a computer screen (typically in the order of 1600 × 900 pixel) to the DVS(128 × 128 pixel).

3.4.2 Screen Refreshing Effects

The simulation is not only limited by the misalignment between the DVS andthe screen, but also by the screen refreshing. While the DVS does not sufferfrom motion blur thanks to its asynchronous circuit, the screen starts to displayblurred images when motion is faster than one screen pixel per screen refresh.The screen refresh rate of our setup was measured to be 16 ms. This value isdetermined by measuring the time between two subsequent drawing commandssent by OpenGL. Furthermore, the individual pixels are updated row by rowfrom the top to the bottom. This could introduce an ”unnatural” chronologicalorder of the emitted events. To investigate this phenomenon, we display ablack square moving horizontally over the screen. The square is invisible at thebeginning of the animation. It then slides into view from the left side until thescreen is fully covered. To sensitivity of the DVS is set low to minimize thenoise in the measurement.

Figure 3.4: Plot of the timestamp of the rst event of each pixel in milliseconds; Bluepixels red rst, red pixels last. White pixels indicate pixels that have not emitted anyevents. A black square was slided over the screen. The four vertical bands correspondto screen refreshes.

Figure 3.4 visualizes the timestamps of the events registered during this exper-iment. Four vertical bands can clearly be seen. Each band represents a screenrefreshing. The time between these bands is coherent with the measured screenrefresh rate. It is therefore important for simulations to avoid to high apparentmotion. The speed of the simulation is a tradeoff between the screen’s motionblur and the noise level, since DVS output is more affected by noise during slow


31/60

22 3.4. Results

Figure 3.5: The experiment shown 3.4 is repeated, but 5 times slower. The horizontalgradient is smooth over the whole image, indicating that the simulation does not sufferfrom motion blur of the screen at that speed.

simulations.When investigating the bands more closely, a vertical gradient from top to bot-tom can be observed, especially in the rightmost band. When turing the DVSby 180 around the optical axis, the gradient are upside down. This suggestthat the gradients arise from the row-wise screen refresh rather than from DVSrelated issues. Figure 3.5 shows the result when performing the experiment 5times slower. A smooth horizontal gradient over the whole screen can be seen.The absence of the bands observed before indicates that the simulation does notsuffer from motion blur of the screen at that speed.

In a second experiment, the whole screen was turned from white to black inbetween two refreshes. The timestamp of the st event of each pixel are shownin Figure 3.6. Against our rst intuition, the timestamps of the events do notshow a continuous vertical gradient over the whole image plane. It appearsthat several gradients are overlapping, and different horizontal bands of thisgradients are visible. This could either be due to the refreshing of the screenor the readout of the pixels in the DVS. To disambiguate the origin of thisphenomenon, we turned the DVS 90 around the optical axis and repeated theexperiment. If it is due to the screen, the bands should be turned in the newplot as well, while they should stay the same if its related to the DVS. Theresult is shown in Figure 3.7. Again, horizontal bands of vertical gradients canbe observed. This indicates that this phenomenon is due to the DVS ratherthan to the screen refreshing. OFF events (i.e., decrease of illumination) arereadout row-wise. [17]. Since the vertical gradients within the bands do notchange orientation either, they are assumed to originate in the readout as well,rather than in the screen refreshing. Hence, the row-wise screen refresh is visible


32/60


if a small amount of the screen changes. In that case, the DVS readout is fast

enough to be inuenced by this effect. If, however, the whole screen changes itsbrightness, the lag introduced by the readout method dominates the timing of the events.

Figure 3.6: Timestamps of the events in milliseconds; several overlaying vertical gra-dients can be seen.

Figure 3.7: Timestamps of the events in milliseconds; the DVS is turned 90 withrespect to the plot in 3.6. The gradients are still vertical.


33/60

Chapter 4

Pose Estimation

4.1 Outline

In this chapter we present the main contribution of this thesis, the tracking andpose estimation algorithm. In the rst section, we explain the algorithm. Westart with the initialization phase, where we show how we nd the square whosesides are used as landmarks. Therefore, we explain rst how the line segmentsare extracted from the event stream. Then, we explain how the polygon isfound among the extracted line segments and how the initial pose is estimated.Subsequently, we explain the tracking and the pose estimation.In the next section, we present the conducted experiments. Therefore, the setupsand the motivation for the experiments are described. We rst do this for theexperiments involving our simulation method and then for those which involvereal data collected with the DVS on a quadrotor.In the last section, the results of the experiments are shown and discussed.Again, we rst treat the simulation, followed by the real data.

4.2 Approach

The pose of the camera is estimated by extracting the position of line segmentsfrom the DVS data and comparing them to their known position in world coor-dinates. In the initialization phase, the stream of events is searched for straightlines. We then search these straight lines for line segments. In our imple-mentation, we use the sides of a square as landmarks. Being a closed shape,the endpoints of the segments can be estimated as the intersection with theneighboring segment. Assuming only limited tilting of the camera, the lines areeasily discriminable since the angles are not very acute or obtuse and the linesare therefore not interfering with each other.The tracking algorithm then tracks the edges of the square through time. Thepose of the DVS is estimated after the arrival of each event, resulting in high-frequency event-based updates.

24


34/60

Chapter 4. Pose Estimation 25

Upon the arrival of an event, it is checked whether the event is close to a

tracked line segment. Events that are in a dened range of a segment areattributed to this segment. If a event has been attributed to a segment, thepose is update, taking into account all events of all line segments. Using thenew pose estimation, the line segments are than projected onto the image plane.The projected segments are then used as new estimates.

4.2.1 Initialization

Finding Line Segments To nd line segments in the stream of events, thearriving events are integrated to form an image. Therefore, an arriving event isadded to a buffer containing the event’s location.To detect lines in this image, the Hough transform is used [49]. It transformsan image point P to a curve in the Hough space described by the followingequation:

r P (θ) = P x cosθ + P y sin θ. (4.1)

Each pair ( r P (θ), θ) fullling the above equation represents a line passing throughthe point P [50]. The radius r P (θ) is the smallest distance from the line to theorigin and θ is the angle between the normal to the line through the origin andthe x axis, as shown in Figure 4.1a. The curves of a set of collinear points inter-sect in a single point in the Hough space, representing the line passing throughall points as shown in Figure 4.1b.

For the implementation, the space of all possible lines is discrete. A bin foreach possible line is stored. Upon the arrival of an event, the value of each binwhich fulls (4.1) for the location of the event P i is incremented. Hence binswith a high vote count represent lines passing through many received events.Figure 4.1c shows the values of the bins for the example in Figure 4.1a. The res-olution of the Hough space (i.e., the number of bins) is an important parameter.Too few bins result in a poor estimate of the orientation and position of the line.If the number of bins is too high, clear maxima are not found since the pointsbelonging to a segment are generally not exactly on straight line. Furthermore,the computational cost increases with the number of bins. We chose an angu-lar resolution of 7 .5 and a radial resolution of 2 .5 pixel, resulting in 24 × 73 bins.

After receiving a certain number of events, the bins are searched for potentiallines. Usually, the Hough bins are searched for local maxima. However, if onlylocal maxima are considered, line segments could be omitted if there is a linewith similar parameters that has more votes. This problem is shown in 4.2.Therefore, the space of all bins is thresholded: bins containing more than acertain number of events represent line candidates (we chose a threshold of 25 events). In a next step, each line candidate is searched for segments with ahigh event density (i.e., clusters of events along the line). To do so, all eventsthat are close enough to a line are attributed to the line (we chose a maximaldistance of 2 pixel). An event can be attributed to several lines. The eventsattributed to a line are then sorted according to their distance (parallel to theline) to the intersection of the line with its normal passing through the origin s as


35/60

26 4.2. Approach

− 50 0 50

− 50

0

50 P

θr

x [pixel]

y

[ p i x e

l ]

(a) image plane

0 50 100 150

− 50

0

50

100

θ [◦ ]

r

[ p i x e

l ]

r p (θ )

(b) Hough space

0 30 60 90 120 150-90-70

-50

-30-10103050

7090

θ [◦ ]

r

[ p i x e

l ]

0

2

4

6

8

(c) Hough bins

Figure 4.1: Hough Transform: The points are shown in (a). In (b), the continuousHough transform of the points is shown. Each point in the image corresponds to a lineof the same color. In (c), the discrete Hough transform is shown. The color indicatesthe number of points which support the line. We chose an angular resolution 7 .5 anda radial resolution of 2 .5 pixel.

shown in Figure 4.3. A segment is dened as the part of a line candidate whereneighbouring event locations are not separated further than a certain distance(we chose 15 pixel). A segment has furthermore to have a minimal length and aminimal number of events lying on it (we chose 20 pixel for the minimal lengthand 25 events for the minimal event count). The set of found segments is thensearched for a closed shape as explained in the following paragraph. If no closedshape can be detected, the accumulation of events continues and the Houghspace is again searched for segments after a certain number of events.


36/60


− 50 0 50

− 50

0

50

x [pixel]

y

[ p i x e

l ]

(a) image plane

0 30 60 90 120150–-90-70-50

-30-101030

5070

90

θ [◦ ]

r

[ p i x e

l ]

0

5

10

15

20

(b) Hough bins

Figure 4.2: Problem of local maxima approach for segment search; Although only theblue line contains a segment, the green line would be chosen as line candiadate sinceit has more votes. Hence, we consider all lines with a certain bin count as candidates.

r

θ

s

x

y

Figure 4.3: Parametrization of the event location on a line; the parameter s is usedto describe the location of an event on a line. It is used to sort the events in order tond line segments.

Finding the Square To detect the square, the image is searched for 4 linesegments forming a quadrangle. In a rst step, the line segments are sortedaccording to their length. The search for the square starts with the longestsegment. This gives priority to longer segments. Hence, in case of ambiguitylarger quads are found rst and hence selected. One end is selected arbitrarily.A list of all segments which have one endpoint close to this point is generated(we count a point as close if the distance is smaller than one third of the lengthof the current segment). Only segments that are oriented counter-clockwiseand have an angle larger than 45 and smaller than 135 are considered. Thiscondition implies that the camera is not tilted too much with respect to thesquare during the initialization phase. It increases the robustness of the search


37/60

28 4.2. Approach

since it avoids many false detections. Then further elements connected counter-

clockwise are looked for recursively. This results in a recursive chain of elementspossibly forming a square. Once four elements are found, it is checked whetherthe far end point of the current segment is close to the far end point of the rstsegment. In this case, the search is stopped. If not, the rst element is removedfrom the chain and the search is continued. If no segment can be added to thecurrent segment and it does not form a quadrangle, a dead end is hit. Thesearch continues from the previous segment.To guarantee that a possible quadrangle can be found, a segment is checkedtwice: Even though a segment has on one end no close segment in counter-clockwise direction, it is possible that the other end has an adjunct counter-clockwise neighbor as shown in Figure 4.4. To avoid unnecessary computation,a list of possible segment endpoints is maintained. Once a segment is added,the endpoint close to the previous segment is removed from the list of possibleendpoints. This guarantees that segments are only checked twice.To avoid as much false positive detection as possible, a minimal side lengthfor the quadrangle is established: Once a quadrangle is found, it’s corners arecalculated as the intersections of its sides. If the distance between to adjacentcorners is smaller than the minimal length for segments used in the segmentdetection (20 pixel), the quadrangle is rejected and the search continued. Thisdetection method could be easily applied to other closed convex shapes.

P 2

P 1

Figure 4.4: Square detection: Segments have to be checked twice to see if they belongto the square since there could be a square on both sides of it. For example, thesegment in the middle is not attributed to a square when starting from point P 1 . It ishowever attributed when starting from P 2 .

Initial Pose Estimation A rst coarse pose estimation is performed by cal-culating the homography between the estimated position of the corners in theimage and the known position in the world frame. The pose of the DVS is thenestimated by decomposing the homography. The correspondence between worldand image points is established by assuming the rotation of the DVS around theoptical axis to be between − 45 and 45 . A renement of the pose estimationis achieved by minimizing the reprojection error.


38/60


4.2.2 Tracking

After nding the line segments to be tracked in the image plane, their positionis updated upon the arrival of a new event that can be attributed to a segment.If an event can be attributed to a segment, it is appended to the segment’sevent buffer. The DVS pose is then optimized considering all events attributedto all line segments. The segment positions are then estimated by projectingthe world coordinates on to the image plane.

Event Attribution When an event arrives, its distance to each line segmentis calculated as

d⊥ = v0 − (v0 · n)n , (4.2)

where v0 = −−→P 0 P is the vector from one of the line endpoint P 0 to the location of

the event P and n = −−−→P 0 P 1 /P 0 P 1 is the unit vector of the segment. The distance

parallel to the line to the closer of the semgent’s endpoints is calculated as

d =− v0 · n if |v0 · n | < |v1 · n |v1 · n if |v0 · n | > |v1 · n |

, (4.3)

where v1 = −−→P 1 P is the vector from the other line endpoint P 1 to the location of

the event P . This distance is negative if the point lies between the endpoints

and positive if it lies outside. An event is in the range of the segment i if itsorthogonal and parallel distance to the line is smaller than a threshold:

d⊥i < d⊥max ∧d i < d max . (4.4)

In our implementation, we chose the threshold for the orthogonal distanced⊥max = 2pixel and d max = 10 pixel. This large range for pixel attributionallows the tracker to recover from bad pose estimates. If an event is in therange of only one segment, it is attributed to it. If an event is in the range of no or more than two segments, it is disregarded. In case that the event is in therange of exactly two segments, its attribution depends on its relative positionto the segments. Figure 4.5a illustrates how the events are attributed. If theevent lies outside of the endpoints of both segments la and lb (d a > 0∧d a > 0,region C ), it is attributed to both lines. In case that the event is located be-tween the endpoints of la and outside of the endpoints of lb (d a < 0∧d a > 0,region A), it is attributed only to la . If the event lies between the endpointsof both lines ( d a < 0 ∧ d a < 0, region D), it is discarded. Although thisimplies omitting valuable events coming from corners, it is necessary to avoidpolygons from shrinking. Shrinking can occur if an event lies in region D andis attributed to both lines, even though it is produce only by lb . The estimateof la is then drawn to the inside of the square. If this occurs several time, theestimates will be too far from the actual position and hence miss the eventsproduced by this segment. The tracking of this line is then lost. Figure 4.5bdepicts this problem.


39/60

30 4.2. Approach

A

C

D

B

la

lb

(a) regions for attribution

la

la

(b) shrinking problem

Figure 4.5: In (a), the attribution of events is shown. Events located in region A orB are attributed only to the segment la or lb respectively. Events located in C are

attributed to both segments. Events locates in D

are attributed to neither segment.In (b), the shrinking of the square is depicted. Produced by lb , events can draw thesegment la towards the center when attributed to la .

Event Buffer Each line segment has its own event buffer. When a new eventis attributed to a line, an old event in the buffer is replaced.

When a line is rotating, the number of events is increasing with the distanceto the center of rotation. If the center of rotation is close to an endpoint of the line segment, it can lead to inaccurate estimations of the position and theorientation of the segment. This is due to the fact that all events stored for thissegment are gathered on the other end of the segment as shown in Figure 4.6a.

This problem can be addressed in two ways: Increasing the number of eventsstored for each segment or replacing stored events that have occurred close tothe location of the current event. The rst solution introduces an unwanted lagsince old events located far from the current segment position are considered forthe estimation. Furthermore, the computational cost and the memory requiredincrease. The second solution can improve the distribution of event locationsalong the segment. Despite their age, events close to the center of rotation arestill lying on the line and are therefore still valid. These events can improvethe estimation of the segment considerably. In the presented approach, eventsreplace close-by events if the distance (parallel to the line) between the events issmaller than a threshold distance. This threshold distance is chosen to be equalto half the distance between the points if they where distributed uniformly. Thisresults in an more uniform distribution of event along the line and hence a betterestimation. A disadvantage is that once an outlier is attributed to the line, itmight inuence the estimation for a long time. However, since the number of outliers is relatively small the advantage outweighs.

Pose Estimation The pose is estimated by minimizing the distances of eventsto their line segments. The initial guess for the optimization is the previous poseestimation. Due to the high rate of pose estimations, this guess is close to thecurrent pose, and thus the minimization converges rapidly. The arguments forthe minimizations are the camera’s position and Euler angles. The world coordi-nates are than projected using this parameters. The orthogonal distance of the


40/60


(a) Replacing oldest pixel (b) Replacing close pixels

Figure 4.6: Pixel level schematics showing the problem of replacing the oldest eventstored for a line; The true line (black dashed) is rotated (black solid). The line isestimated (red) based on the position of the events (gray). (a) New events replace theoldest event stored for this line. Note how the stored events cluster on one end of the

line, thus, corrupting the line estimate. (b) Instead, new events replace the old eventclosest to their location. Hence, the line estimate is more accurate.

events to the projected lines d⊥ is than minimised in the least-square sense. Theoptimization is performed by the MATLAB function lsqnonlin . This functionimplements the trust-region-reective algorithm. The new estimation for theline segments is obtained by the projection of the world coordinates using thepose resulting from the optimization.

4.3 Experimental Setup

To demonstrate the performance of our system, we performed simulations aswell as an experiment with the DVS aboard a quadrotor. While the simulationprovides an accurate ground truth and allows to dene an arbitrary trajectoryfor the DVS, the feasability is best demonstrated on a real ying vehicle.

4.3.1 Trajectory Simulation

The trajectory simulation is performed using the setup described in Chapter 3.The virtual camera performs a conic helical trajectory around the square, havingthe optical axis pointed towards the square’s center as shown in Figure 4.7. Thistrajectory is described by the following view matrix:

M =cos(α ) sin(α ) 0 0

− sin(α)cos(γ ) cos(α)cos(γ ) sin (γ ) 0sin(α )sin( γ ) − cos(α )sin( γ ) cos(γ ) z

, (4.5)

where α = 720 t/T , γ = 210 , and z = 2 .5 m · t/T . We chose the duration of thesimulation as T = 4 s to minimize effects of screen refreshing. This trajectory isused to investigate the inuence of the size of the square in the DVS image plane.

The second experiment consists of the same trajectory, except that the distanceto the square is kept x. The experiment was performed with two different dis-


41/60

32 4.3. Experimental Setup

tances from the square. First, the camera has a distance of 1 m from the square,

hence the projection square is a large quadrangle that should be easily trackable(the longest side measures 114 pixel and the shortest measures 70 pixel). Thesecond distance is chosen as 2 .5 m, resulting in a quadrangle whose longest sidemeasures 45 pixel and the shortest measures 31 pixel. This is near the maximaldistance for which the square can still be tracked. With these simulation, wewant to observe the inuence of the number of events stored for each line.

− 1− 0.5

00.5

1 − 10

1

0

1

2

3

x [m] y [m]

z [ m ]

Figure 4.7: Trajectory of the virtual camera: The optical axis (blue) is always pointingtowards the center of the square.

4.3.2 DVS on Quadrotor

Setup In this experiment, the DVS is mounted on top of the front-facingCMOS camera of a Parrot AR.Drone 2.0 . The DVS including the lens and thehousing weighs 122g, which is too heavy for this drone. Therefore, the housingwas removed and an S-type lens mount is xed onto the DVS circuit board,

resulting in a weight of 23 g.An Odroid U2 computer is mounted on the drone to record the DVS data. Italso sends the data to a laptop computer via Wi-Fi. This allows to see theDVS output in real time and therefore making sure the square is in sight of thecamera.A black square of 0 .9 m × 0.9 m attached to a white wall as a landmark. Groundtruth is provided by an OptiTrack motion capture system. Therefore, reectivemarkers where attached to the quadrotor. The modied quadrotor is shown inFigure 4.8.

In order to demonstrate the performance of our system, 25 ips along the opticalaxis (z-axis) of the camera are performed as shown in Figure 4.9. The drone is


42/60


Figure 4.8: Image of the modied AR.drone; 1) DVS mounted ontop of the standardCMOS camera; 2) Odroid onboard computer; 3) reective markers for the motiontracking system.

remote controlled during the whole ight using the AR.drone ’s standard smart-phone application. Flips can be performed easily by choosing this maneuver inthe application. During the ip, the drone rises approximately 50 cm and fallsthan back to its initial height. The angular velocity during these ips reachespeak values of 1200 s− 1 . The distance to the wall was chosen so that the squareis always in the eld of view of the DVS, ranging from 0 .75m to 2m.

Figure 4.9: AR.drone performing a ip; the black square can be seen in the background.As in the simulation, it measures 0 .9 m × 0.9 m.

4.4 Results4.4.1 Trajectory Simulation

Conic Helical Trajectory The estimated trajectory and the ground truthfor the helical trajectory are shown in Figure 4.10. The position is expressedas the camera translation vector T in camera coordinates and the orientationis expressed as euler angles of the rotation matrix R that transforms worldcoordinates to camera coordinates such that

X cam =R T 0 1 X W , (4.6)


43/60

34 4.4. Results

− 1

0

1

x [ m ]

− 1

0

1

y [ m ]

0

1

2

3

z [ m ]

0200400600800

y a w

[ d e g

]

−20

020

p i t c h

[ d e g

]

0 1 2 3

160180

200

time [s]

r o l l [ d e g

]

− 0.1

0

0.1

− 0.1

0

0.1

− 0.1

0

0.1

− 20

− 100

1020

− 20

− 100

1020

0 1 2 3− 20

− 100

1020

time [s]

Figure 4.10: Pose estimation for a helical trajectory; (a) estimation of the cameratranslation vector t and the euler angles (blue) including ground truth (red); (b) error

of the estimation; 8 events were stored per line

where X W ∈ R 4 are homogeneous world coordinates and X cam ∈ R 4 are ho-mogeneous camera coordinates. The error of the position estimation ∆ T isdescribed as the euclidean distance between the camera translation vector of the estimation and the ground truth:

∆ T = T̂ − T , (4.7)

where T̂ is the estimated and T is the ground truth translation vector. Toexpress the error of the orientation estimation, the relative rotation between theestimated and the ground truth pose is express in the axis-angle representation.


44/60


The angle of the axis-angle representation is then used as a measure for the

orientation error:

∆ α = arccostrace(∆ R ) − 1

2with ∆ R = R̂R − 1 , (4.8)

where R̂ is the estimated and R the ground truth of the rotation matrix.

0 0.5 1 1.5 2 2.5 30

5

10

time [s]

∆ T [ c m

]

0 0.5 1 1.5 2 2.5 3

10

15

20

time [s]

∆ α

[ d e g

]

Figure 4.11: Error of the pose estimation for a helical trajectory; (a) norm of thedistance between estimated and ground truth camera’s translation vector ∆ T ; (b)angle between the estimated and the ground truth pose in axis-angle representation∆ α

Figure 4.11 shows the error of the estimation, expressed as stated above. Themean position error for the rst second of the simulation is 1 .85 cm with a stan-dard deviation of 0 .63 cm. The mean angular error for this period is 9 .15 with astandard deviation of 0 .96 . This low standard deviation compared to the highmean error suggests that this error is due to a lag or a bias. The plots indicate alag in the order of 50 ms when looking at the pitch and roll angle. It is howevernot clear, whether this lag is introduced by the tracker or whether it is dueto a temporal misalignment of the ground truth and the measured data. Thetemporal alignment is based on the event density in the DVS event stream. Theground truth starts when the number of events exceeds 500 event s − 1 , averagedover 5 events.It can be seen that the error of the estimation grows as the camera moves fartherfrom the square. Especially the errors in z direction (optical axis) and in pitchand roll increase rapidly towards the end of the simulation as shown in 4.10.When the camera is approximately 2 .5 m away from the square, tracking is lost.This is caused by the low amount of events that are emitted per segment. Fig-ure 4.12b shows the number of events emitted during the simulation. The eventrate decreases from 14 kHz to 4 kHz as the distance between the camera and thesquare increases from 1 m to 3 m. It can also be seen that, due to the decrease


45/60

36 4.4. Results

of the quality of the estimation, less events get attributed to the a line segment

towards the end of the simulation. At this distance, the shortest side of thesquare measures only 30 pixel and the apparent motion is much smaller. Thisincreases the number of old events in the buffers of the segments. The qualityof the estimation is further decreased since the events are close to each otherrather than spread over the entire image plane. The low number of events alsoincreases the inuence of sensor noise.

− 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.50

5

10

time [s]

e v e n t r a t e [ k H z ]

(a) coarse

− 20 0 20 40 60 80 100 120 1400

20

40

time [ms]

e v e n t r a t e [ k H z ]

(b) zoom

Figure 4.12: Rate of emitted event during simulation (blue) and pose updates (red)sampled at 10 Hz (a) and 2 kHz (b); The pose update rate is equal to the number of events attributed to the square per millisecond.(a) The number of events decreases as the camera moves further from the square. Asthe estimation error increases, less events are attributed to the square. (b) In thezoom, the screen update frequency of 16 Hz can be seen as peaks. The width of thepeaks corresponds to the temporal noise.


46/60


Circular Trajectory Figure 4.13 shows the pose error as a function of the

number of events stored per line (buffer size) for a circular trajectory of theDVS around the square. As for the helical trajectory, it can be seen that theerror is higher if the DVS is further from the square. For more than 30 eventsper segment, tracking is lost in the case of the further trajectory ( T z = 2 .5m).30 events per segment corresponds to nearly 1 event per pixel at this distance.As described in section 4.2.2, newly arriving events that are close to an eventin the buffer replace this event rather than the oldest event. Hence, once anevent is far from the line, its chance to be replaced is small. The more events inthe buffer, the smaller is the change that the event is removed. Therefore, theestimation suffers from lag and new events can eventually not be attributed toline segments due to the bad estimation.The position error (euclidean distance between the camera translation vectorT and the corresponding ground truth vector) is decreasing with an increasingbuffer size. This behavior is as expected, since more points are available forthe optimization for large buffers. Furthermore, the more events are used toestimate the square, the lower is the inuence of a event that does not originatefrom the line segment. In the case where the camera is further away from thesquare, the error stagnates between 15events and 25 events and increases slightlyfor larger buffers. The angular error is nearly constant in the case where thecamera is close to the square. In the other case, however, it increases for bufferslarger than 6 events with a local minima for 13 and 14 events, presumably forthe reason stated above.

0 20 400

0.05

0.1

events per line segment

m e a n p o s

i t i o n e r r o r

[ m ]

0 20 400

5

10

15


m e a n a n g u

l a r e r r o r

[ ◦ ]

Figure 4.13: Error of the camera translation vector and the camera orientation inaxis-angle representation for a circular trajectory 1 m (red) and 2 .5 m (blue) above thesquare as a function of the number of events per line segment. Tracking was lost withmore than 30 events in case of the further trajectory.

Figure 4.14 shows the time need for the pose estimation of the whole trajectory.It can be seen that the computation time decreases with increasing buffer size.This might appear counter-intuitive. However, when considering the number


47/60

38 4.4. Results

of iterations per pose estimation shown in 4.14 , it can be seen that the more

points are considered, the faster converges the minimization and thus the fasteris the algorithm.

0 20 40

200

400

600

800

1 , 000


m e a n c a

l c u

l a t i o n t i m e

[ s ]

0 20 40

2

2 . 5

3

3 . 5


m e a n

c a l c

. t i m e p e r u p

d a t e

[ s ]

Figure 4.14: Time required to process the whole tracking of the circular trajectory(left) and the average number of iterations to nd the optimal pose (right) as a functionof the buffer size (number of events stored per line).

4.4.2 DVS on Quadrotor

We let the quadrotor, equipped with a front-looking DVS, perform 25 ips. Inonly 1 case, the tracking was lost during the ip, which is a success rate of 96%. Figure 4.15 shows the estimated pose and the ground truth ( OptiTrack )for three consecutive ips. Note that the values are given with respect to theworld coordinate system, with the origin in the center of the square, the x-axispointing into the wall, the y-axis pointing to the left and the z-axis upwards. Itcan be seen that the error decreases from ip to ip. This is due to the forwardmovement of the drone between the ips (see x-axis). With every ip, the droneis closer to the wall and hence the square becomes larger in the image frame.As for the simulation, the pose estimation is more precise if the square is largerin the image. In the last of the three ips, the error in x- and y-direction issmaller than 10 cm. In z-direction, the error is smaller than 20cm. The angular

error remains smaller than 8

.

Noise The observed noise is caused by several effects. One factor is the noisedue to the sensor itself. While the algorithm is robust against spontaneouslyemitted events, it suffers considerably from temporal noise. Temporal noise isthe difference of delay between different pixels. This causes the line to jittersince delayed events draw the line back to a previous position. This effect isintensied by the fact that pixels emit several events when passing a gradient.Although the sensor was congured to have a low pixel ring rate and a highthreshold (only strong illumination changes produce events), several events perpixel were emitted during one transition. If a pixel res two events, the second


48/60


− 1

− 0.50

0.51

y [ m ]

− 2

− 1.5

− 1

− 0.50

x [ m ]

− 0.4

− 0.20

0.20.4

− 0.4

− 0.20

0.20.4

− 0.4

− 0.20

0.20.4

− 1

− 0.50

0.51

z [ m ]

− 40− 30

− 20

− 100

y a w

[ d e g

]

− 20

− 100

1020

p i t c h

[ d e g

]

0 2 4− 200

− 1000

100200

time [s]

r o l l [ d e g

]

0 2 4− 20

− 100

1020

time [s]

− 20

− 100

1020

− 20− 10

01020

Figure 4.15: Estimated trajectory (red) with ground truth (blue) and errors (black)for three consecutive ips with a quadrotor


49/60

40 4.4. Results

event arrives with considerable delay.

Another reason for old events in the event buffer of a segment is the method toprevent events from clustering described in Section 4.2.2. Furthermore, eventstend to cluster on one end of the segment when it is rotated despite the usageof this method.The low resolution is also a limiting factor for the estimation.

Considering the images of the drone’s standard CMOS camera in Figure 4.16,it can be seen that the motion blur at this angular rates poses a severe problemto conventional optical pose estimation methods.

Figure 4.16: Output of the standard CMOS camera on the quadrotor during the ip;the images are corrupted by strong motion blur.


50/60

Chapter 5

Conclusion

In this work, we presented, to the best of our knowledge, the rst intrinsic cam-era calibration tool for the DVS. It is convenient and easy to use, requiring onlya computer screen. Using the ickering backlighting of a dimmed LED-backlitcomputer screen, this tool allows to estimate the intrinsic camera parametersincluding the distortion coefficient. Our method is limited by the direction of the light emitted from the screen in combination with the light passing throughthe lens to the DVS sensor. Since computer screens are designed to emit lightonly towards the front, the pattern can not be detect when the camera is highlytilted with respect to the screen. Fo

basil huber.pdf

Documents