robotic home sentinel - ulisboarobotic home sentinel david cruz veiga Álvares pereira thesis to...
TRANSCRIPT
Robotic Home Sentinel
David Cruz Veiga Álvares Pereira
Thesis to obtain the Master of Science Degree in
Electrical and Computer Engineering
Supervisor: Professor José António da Cruz Pinto Gaspar
Co-Supervisor: Doctor Ricardo Jorge Santos Ferreira
Examination Committee
Chairperson: Professor João Fernando Cardoso Silva Sequeira
Supervisor: Professor José António da Cruz Pinto Gaspar
Member of the Committee: Professor Rodrigo Martins de Matos Ventura
April 2014
Contents
Agradecimentos iii
Resumo v
Abstract vii
1 Introduction 3
1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4
1.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5
1.3 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 5
1.4 Thesis Structure and Contributions . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 6
1.5 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6
2 Background 7
2.1 Camera Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
2.1.1 Perspective Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
2.1.2 Extrinsic Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Intrinsic Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 Depth Image to 3D Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.5 Back-projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Integration of measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 11
2.2.1 The Discrete Kalman Filter Algorithm . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Other Types of Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Application to the thesis context . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 15
3 Navigation using Color-Depth Cameras 17
3.1 Occupancy Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 18
3.1.1 Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.2 Map Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
ii CONTENTS
3.1.3 Loop Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Motion Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Potential Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Randomized Methods for Path Planning . . . . . . . . . . . . . . . . . . . . . .23
3.2.3 Probabilistic Roadmaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.4 Rapidly Exploring Random Trees . . . . . . . . . . . . . . . . . . . . . . . . .24
3.2.5 Comparison of Trajectory Planning Methods . . . . . . . . . . . . . . . . . .. 25
4 Navigation based in Monocular Vision 29
4.1 Depth Information Not Directly Available . . . . . . . . . . . . . . . . . . . . . . .. . 29
4.2 Iterative Estimation of Tubular Structure and Camera Motion . . . . . . . . . .. . . . . 31
4.2.1 State Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.2 System Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.3 Observation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Dewarping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 32
4.3.1 Dewarping a VRML Tubular Shape . . . . . . . . . . . . . . . . . . . . . . . .33
4.3.2 Dewarping a Real Tubular Shape . . . . . . . . . . . . . . . . . . . . . . . .. . 34
4.4 Rectangular Section for a Corridor . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 34
4.4.1 Results obtained for the synthetic case . . . . . . . . . . . . . . . . . . . . . .. 35
4.4.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5 Experiments 39
5.1 The Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 39
5.2 Single Image Global Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 40
5.3 Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4 Occupancy Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 44
5.5 Patrol #1 - Run with no obstacles . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 46
5.6 Patrol #2 - Unknown blocked passage . . . . . . . . . . . . . . . . . . . . .. . . . . . 49
5.7 Mapping and Navigation with Monocular Vision . . . . . . . . . . . . . . . . . . .. . 51
6 Conclusion and Future Work 55
Agradecimentos
Gostaria de agradecer ao meu orientador Professor José António Gaspar e co-orientador Doutor Ricardo
Ferreira toda a disponibilidade e apoio prestados no decorrer desta dissertação.
A realização e conclusão do curso em Engenharia Electrotécnica e Computadores não seria possível
sem o sempre presente apoio da minha família e amigos. A todos eles um muito obrigado por me terem
influenciado de alguma maneira.
iv CONTENTS
Resumo
Esta dissertação aborda o problema de mapear e navegar um robot móvelem ambientes interiores, uti-
lizando visão artificial. O primeiro sensor considerado é uma câmara RGBD. Ainformação de profun-
didade permite transportar para 3D pontos característicos seleccionadosna imagem ("SIFT features").
Um conjunto de três pontos 3D, ou mais, correspondidos entre imagens consecutivas, permitem estimar
o movimento de uma câmara considerando uma transformação rígida (problema de Procrustes). Para
além da estimação de movimento, as imagens de profundidade permitem calcular oespaço livre à frente
do robot. A partir de várias leituras de espaço livre pode ser criado um mapa de ocupação ("occupancy
grid"), que depois permite encontrar trajectórias para navegação.
Alternativamente a utilizar uma câmara RGBD, consideramos também a utilização deuma câmara
RGB/grey. A combinação de várias imagens obtidas ao longo do tempo, utilizando uma metodologia de
mapeamento e localização (SLAM), permite recuperar informação 3D. Em particular consideramos um
modelo geométrico para o cenário assumindo uma estrutura tubular com secção rectangular (corredor).
A retro-projecção permite passar pontos no plano da imagem em 2D para 3Dno referencial da câmara,
assumindo conhecida a calibração. Sobre o mapa construído, como no caso anterior, é aplicado um
planeador de trajectórias, resultando no final um caminho da posição actual para a desejada.
Ambas as metodologias de navegação foram testadas em ambientes de realidade virtual. Os resulta-
dos obtidos comparados com os dados da simulação,ground truth, são promissores.
Palavras chave:Modelo Câmara, EKF, SLAM, Planeador de Trajectórias, Transformação Rígida.
vi CONTENTS
Abstract
In this dissertation we approach the problem of mapping and navigating a mobilerobot in indoor environ-
ments using artificial vision. A color-depth camera is a first option sensor. Using the depth information
is possible to infer the 3D location of selected points in the image (visual features). A set of at least three
3D points matched between consecutive images allow to estimate the motion of a camera considering
a rigid transformation (Procrustes problem). In addition to motion estimation, depth images allow also
observing the free space in front of the robot. When the robot moves thefree space information can be
integrated into an occupancy grid. The occupancy grid is then used to findpaths for navigation.
Alternatively to use a RGBD camera, we consider also the use of a simple RGB camera. The combi-
nation of several images obtained in time, using a SLAM methodology allows to recover 3D information.
In particular we consider a geometric model for the scene assuming a tubularstructure with a rectangular
section (corridor). The back-projection allows to pass points in the image plane in 2D to the camera
frame in 3D assuming the calibration known. Over the map built, a path planner is applied, resulting in
a trajectory from the current position to the desired one.
Experiments were conducted involving both methodologies in virtual reality environments. Compar-
ison with ground truth shows promising results.
Keywords: Camera Model, EKF, SLAM, Path Planner, Rigid Transformation.
viii CONTENTS
List of Figures
1.1 Typical sentinel, visiting places 1-8. . . . . . . . . . . . . . . . . . . . . . . . .. . . . 3
1.2 Basic Diagram of the proposed solution. . . . . . . . . . . . . . . . . . . . . .. . . . . 6
2.1 Perspective Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 8
2.2 Frontal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 8
2.3 The discrete Kalman filter cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 13
3.1 Diagram detailing the steps for the navigation. . . . . . . . . . . . . . . . . . . .. . . . 17
3.2 Various steps of the occupancy grid mapping process . . . . . . . . . . .. . . . . . . . 21
3.3 Profiles for the estimated motion of the camera (translation and orientation). .. . . . . . 22
3.4 Example of a Potential Field. Image taken from [24] . . . . . . . . . . . . . . .. . . . 22
3.5 Example of a Roadmap. Image taken from [20] . . . . . . . . . . . . . . . . . .. . . . 23
3.6 Example of an expanding tree. Image taken from [15] . . . . . . . . . . . .. . . . . . . 24
3.7 The RRT method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.8 The RRT-Connect method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 25
3.9 (a) Map 1 (b) Map 2 (c) Map 3 (d) Map 4. Red Circle is the starting point and the green
star is the goal point. All the maps solved with different methods. Second rowit’s the
Potential Fields method. Third row it’s the Probabilistic Roadmaps. Bottom row it’sfor
Rapidly exploring Random Trees. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 26
4.1 Basic steps in mapping and motion estimation for the Color-depth vs Visual SLAM. . . . 30
4.2 Tubular structure used for testing the system behavior. . . . . . . . . . .. . . . . . . . . 31
4.3 Creating a mosaic with the interior texture of the tube. . . . . . . . . . . . . . . . .. . 33
4.4 (a) 3D Coordinates of pixels in the image inside the tube. (b) Relation between the angle
of the coordinate in the closed image and the vertical pixel in the mosaic. (c) Pixels in
the mosaic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 (a) VRML setup. (b) View of the camera inside the tube (c) Dewarping ofthe tube for a
single step (d) Result of dewarping several steps. . . . . . . . . . . . . . .. . . . . . . 34
4.6 (a) Real setup. (b) Inside view of the tube (c) Dewarping of the image for one step (d)
Result of dewarping several steps. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 34
x LIST OF FIGURES
4.7 Back-projection Model for Rectangular Section. . . . . . . . . . . . . . .. . . . . . . . 35
4.8 Results for the synthetic corridor. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 36
5.1 VRML model used for visualization. . . . . . . . . . . . . . . . . . . . . . . . . .. . . 39
5.2 Various consecutive steps of the POV of the robot inside the virtual world. . . . . . . . . 40
5.3 Map with robot ground truth locations (green crosses) and estimated locations (red cir-
cles) based on query and database images (a). The query images for thevarious steps (5,
81, 100 and 31) are on the left and the retrieved matching images on the right(b, c, d, e). 41
5.4 Pre-determined path used by the robot to go through the environment andmap it. . . . . 42
5.5 Color-depth camera used for motion estimation. . . . . . . . . . . . . . . . . . . .. . . 43
5.6 All the estimated positions for the camera, resulting from the motion estimation. . .. . . 44
5.7 All the steps to create a map used for navigation. . . . . . . . . . . . . . . . . .. . . . 45
5.8 POV of the robot when there is an obstacle ( 5.8b) and when there is none ( 5.8a). . . . . 46
5.9 All the missions the sentinel function created. One for each goal point. .. . . . . . . . . 48
5.10 All the missions the sentinel function created. There is one extra mission because of the
obstacle present in the environment. . . . . . . . . . . . . . . . . . . . . . . . . . .. . 50
5.11 Results for one corridor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 52
5.12 Map representation for one and two corridors. . . . . . . . . . . . . . .. . . . . . . . . 53
5.13 Result for the VSLAM experiments. . . . . . . . . . . . . . . . . . . . . . . . .. . . . 54
List of Tables
3.1 Average values for 10 runs . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 27
2 LIST OF TABLES
Chapter 1
Introduction
Video surveillance is continuously increasing its place within human environments. A wide range of
application scenarios is currently feasible because the level of maturity reached by the main enabling
technologies. In particular, is envisaged the development of robots capable of working as extensions
of human operators in the field, namely doing patrols, recognizing and tracking intruders, as well as
collaborating with other robots or even with humans using tele and videoconference.
Applications include monitoring and operations in hazardous or remote environments (ocean, space,
contaminated areas, areas destroyed by natural disasters, etc), in industrial or civil engineering structures
(pipeline monitoring and surveillance, bridges, dams), and in services (buildings, public areas, traffic
monitoring and surveillance). Surveillance robots must have high levels of autonomy. This implies sens-
ing and moving within the environment, working for long periods of time without human intervention,
and avoiding harmful situations to people, pets, property, or the robot itself.
The consumer market is growing in size and acceptance of this type of technology in the home. A
robotic home sentinel would take a tour of the house on demand when the owner is far away on vacations
or at work and take a picture of pre-specified locations. The idea is illustrated in Figure 1.1 where a
Wowwee Rovio [27] is used to visit a list of locations.
Figure 1.1: Typical sentinel, visiting places 1-8.
Recent vision sensors, namely the Microsoft Kinect [25] allow effectively measuring the empty space
4 Introduction
in front of a robot.The Kinect depth sensor is a structured light 3D scanner. Its infrared light source
projects light patterns to the space ahead. The reflected light is then received by an infrared sensor and
processed to extract geometric information about object surfaces and the distance to the cameras image
plane is computed. It captures both depth and color images simultaneously at aframe rate of up to 30
fps. The best part is that in comparison to the traditional laser scanners itcomes as a much cheaper
alternative. The Kinect has since attracted the attention of researchers and developers and is used in
indoor mapping, surveillance, robot-navigation and forensics applications, where accuracy requirements
are less strict. This MSc thesis is therefore concerned with using Kinect likesensors for freely navigating
in a cluttered home environment while reporting pre-specified scene-views.
Alternatively, we also explore the use of simpler conventional color cameras. Depth information
has in this case to be inferred in order to navigate the robot safely. It is anobjective of this thesis to
understand and assess advantages vs disadvantages for color vs color-and-depth cameras.
1.1 Related Work
As mentioned before, sensing and moving in an unknown cluttered environment is nowadays facilitated
by the recent introduction in the market of the MS Kinect camera. The Kinectcamera provides not only
visual information of the scenario, but also depth information. Some recentresearch works show that by
choosing the proper scene representations it is possible to do data acquisition, processing and integration
with a standard PC. One such reference is the work by Wurm et al [28],where a mobile robot maps a
number of campuses. Another reference is the work by Shen et al [19], where a flying robot (quadrotor)
maps a multi-floor scenario using a Kinect camera.
Considering ground plane wheeled robots, the depth information providedby the camera in particular
provides a horizontal stripe in front of the robot, and therefore indicates occupancy. This information
is usually integrated into occupancy grids. The principle idea of the occupancy grid [9] is to represent
a map of the environment as an evenly spaced field of binary random variables each representing the
presence or not of an obstacle at that location in the environment. Occupancy grid algorithms compute
approximate posterior estimates for these random variables.
Given the occupancy grid, one still has to find trajectories which can be followed by the robot. The
Robotics Toolbox [6] is a great software package developed by Peter Corke that addresses several of the
topics discussed in this work, and in particular trajectory planing.
Considering conventional color cameras, depth information is not immediately available. Inferring
depth can be done using tools firstly derived for range sensing. The Simultaneous Localization and Map-
ping (SLAM) methodology, firstly introduced by Randall Smith and Peter Cheeseman [21], and coined
by J. Leonard and H.F. Durrant-Whyte [8], addresses the challengingproblem of how to simultaneous
predict the location of a given autonomous unmanned vehicle and map its surrounding environment, us-
ing the on board sensors for data gathering, feeding its observation andmotion models at every instance.
Extensive work has already been done in terms of monocular SLAM. In [7] it is presented a real-
1.2 Problem Formulation 5
time algorithm which can recover the 3D trajectory of a monocular camera, moving rapidly through a
previously unknown scene. The core of the approach is the online creation of a sparse but persistent
map of natural landmarks within a probabilistic framework. In [4] it is presented a new parametrization
for point features within monocular simultaneous localization and mapping that permits efficient and
accurate representation of uncertainty during undelayed initialization and beyond, all within the standard
extended Kalman filter (EKF). For camera motion estimation some important references are [29] where
a kinematic model-based approach for the estimation of 3-D motion parameters from a sequence of noisy
stereo images is discussed. The approach is based on representing the constant acceleration translational
motion and constant precession rotational motion in the form of a bilinear state-space model using stan-
dard rectilinear states for translation and quaternions for rotation. Also, [2] and [1] where the problem
considered involves the use of a sequence of noisy monocular images of athree-dimensional moving
object to estimate both its structure and kinematics. The object is assumed to be rigid, and its motion
is assumed to be smooth. A set of object match points is assumed to be available, consisting of fixed
features on the object, the image plane coordinates of which have been extracted from successive images
in the sequence.
This work started as the continuation of the work developed by João Tomaz [23] in his MSc thesis
that followed a SLAM approach. In his work he estimated the motion of a camerainside a tubular
structure. He did the reconstruction of a tubular shape by selecting and registering distinctive visual
features along the tubular shape. For that he also assumes the a priori knowledge of a tubular model for
the environment structure.
1.2 Problem Formulation
Using the sensor available we have to be able to characterize the surroundings in every step. The ultimate
goal is to map the environment and plan the path the robot takes to reach its goal using first a kinect-like
camera (RGB+D) and then a normal (RGB) camera. It uses only visual odometry and has to detect and
bypass obstacles, in the navigation part, that may present in its way.
1.3 Proposed Solution
The solution proposed is based in finding and tracking visual (SIFT) features while the robot moves. The
depth information transforms the 2D points into 3D, which can be used directly toestimate the camera
motion (rigid transformation, rotationR and translationt) between consecutive images.
Using the camera motion(R, t) and the depth information (horizontal scan line from the depth image)
allows building an occupancy grid map. Once the map of the environment is available, one computes
the trajectory between start and goal points using a path planner as Potential Fields, PRM or RRT [6].
If an obstacle interferes with the path chosen the robot stops and computesa new path considering the
6 Introduction
Mapping:
Occupancy grids &
Favorite locations
Sentinel:
Navigate within the
mapped areas
Figure 1.2: Basic Diagram of the proposed solution.
obstacle found. If it reaches the goal it then starts to plan the path to the next goal, if there is one. Figure
1.2 summarizes the proposed solution.
Alternatively, we want also to assess the use of color cameras. In orderto infer depth information,
we consider a SLAM methodology. Constraints on the scene structure should be explored for the design
of improved filtering of the raw data obtained by registering features (SIFT) along image sequences.
1.4 Thesis Structure and Contributions
Chapter 1 introduces the problem to approach in the dissertation, in particular presents a short discussion
of the state of the art. Chapter 2 presents the theoretical background in camera model, perspective
projection and Kalman filtering. Chapter 3 describes how the mapping and navigation using a color-
depth camera is made. Chapter 4 describes how the mapping and navigation using a monocular camera
is considered. Chapter 5 shows the results of some experiments done to validate the ideas presented.
Chapter 6 summarizes the work performed and highlights the main achievements inthis work. Moreover,
this chapter proposes further work to extend the activities described in thisdocument.
The main contribution of this thesis is the presentation of a new approach for the vSLAMmethodol-
ogy under the assumption that the camera navigates within aTubular Shaped Structurewith a rectangular
section simulating a corridor-like environment.
The work described hereafter was partially published in [17].
1.5 Acknowledgments
This work was developed under the projectPOETICON++ - Robots Need Language: A Computational
Mechanism for Generalization and Generation of New Behaviors in Robotsin which VisLab is involved.
Chapter 2
Background
This chapter provides an introduction to the theoretical background concepts that will be used in the
thesis. Two main aspects are introduced here: the filtering of noisy data based in Kalman filtering and
the camera model that gives structure to the measurements equation of the filtering process. In the
following, first we detail the camera model and then the filtering.
2.1 Camera Model
The principles of perspective were discovered during the Italian Renaissance by da Vinci and other
architects and painters. They found how we see the 3D world. Cameras project 3D points into the
image plane. This transformation is not invertible unless we have additional information about the point
location in space (for instance, its depth). Cameras perform a projection of points in space onto the image
plane described by a perspective projection
m = ΠM (2.1)
where M, m are the homogeneous coordinates of the points in space and in the image plane andΠ is a
3× 4 matrix denoted as camera matrix.
2.1.1 Perspective Projection
Cameras concentrate electromagnetic radiation (light rays) on the surfaceof their sensors using lenses.
However this whole process can be approximated by the pin hole model [12]which simplifies and re-
duces the lenses to a small hole and all the light rays arriving pass throughthat hole, called the Optical
Center (O). Each point P in 3D space is projected into a point p obtained by intersectingthe PO straight
line with the image plane. The optical axis of the camera is a straight line orthogonal to the image plane
which contains the optical center and their intersection is called the principal point of the image.
In Figure 2.1 we can see two frames. One for the 3D points centered in the Optical Center and
8 Background
Figure 2.1: Perspective Projection
another for the 2D points (pixels) centered in the principal point of the image. The coordinates for point
P are (X,Y,Z) and for its projected point p are (x,y). Using the similarities of triangles we get
x = −f XZ
y = −f YZ
(2.2)
and f is the focal length of the camera. All these variables are measured in meters. For simplification
purposes we can use the frontal model of the pin hole and get Figure 2.2 where
Figure 2.2: Frontal Model
x = f XZ
y = f YZ
(2.3)
All these equations became simpler if we represent the two points P, p using homogeneous coordinates:
P = α
X
Y
Z
1
, p = β
x
y
1
(2.4)
whereα andβ are arbitrary constants. The normalized camera model (f = 1) in homogeneous coordi-
2.1 Camera Model 9
nates is written as
Z
x
y
1
=
1 0 0 0
0 1 0 0
0 0 1 0
X
Y
Z
1
(2.5)
2.1.2 Extrinsic Parameters
The coordinates of a point in space might not be known relative to its opticalcenter if this is hidden
inside the camera. The coordinates of a point in another frame with arbitrarylocation and orientationX0
are related to the camera coordinates by a rigid body transformation (rotation+translation ) by
X = RX0 + T (2.6)
whereR(3x3) is a rotation matrix andT(3x1) is a vector of translation. These parameters (R,T) are called
the extrinsics parameters of the camera since they depend on the camera pose. It can be written using
homogeneous coordinates:
X
Y
Z
1
=
[R T
0 1
]
X0
Y0
Z0
1
(2.7)
2.1.3 Intrinsic Parameters
The coordinates of a point in an image are measured in pixels and the frame is not usually centered at
the principal of the camera. The relation between the homogeneous coordinates of the image points in
meters and in pixels is showed by
K =
fsx fsθ ox
0 fsy oy
0 0 1
x′ = Kx
(2.8)
where f is the focal length,sx andsy are the conversion factors from meters to pixels in their respective
axis andox,oy are the coordinates of the principal point in the new frame. Also,sθ is a diagonal scale
(skew) factor. These parameters are denoted as intrinsic parameters ofthe camera and they are indepen-
dent of the camera pose. MatrixK(3x3) is called the matrix of intrinsic parameters. The camera model
with intrinsic and extrinsic parameters is given by
λx′ = K[R T ]X0 (2.9)
10 Background
λx′ = ΠX0 (2.10)
andΠ = K[RT ].The camera matrixΠ can be written as
Π =
πT1
πT2
πT3
(2.11)
whereΠTi is the i-th line ofΠ. Using the informationλ = πT
3 X0 we get
x′ =πT
1X0
πT
3X0
y′ =πT
2X0
πT
3X0
(2.12)
2.1.4 Depth Image to 3D Points
Cameras project points in 3D to the image plane. The spatial information of the points is lost in this
projection unless some additional information is available. Depth cameras givethat additional infor-
mation making it possible to locate in space the position of every pixel in its image. The combination
of the depth value, the camera intrinsics and extrinsics can be used to reconsctruct a point cloud of the
scene. Cartesian X, Y, Z points are represented by the camera as Z(u,v)where u, v are image points i.e.
perspective projection (projective) coordinates. The following equation 2.13 relates the 3D position of a
point with its projection in the image plane:
λ
u
v
1
=
αu 0 u0
0 αv v0
0 0 1
f 0 0 0
0 f 0 0
0 0 1 0
Xc
Y c
Zc
1
(2.13)
that is the same as, by combining the first two matrices on the right side of the equation
λu
λv
λ
=
αuf 0 u0 0
0 αvf v0 0
0 0 1 0
Xc
Y c
Zc
1
(2.14)
If we makeλ = d whered = Z(u, v) or the depth value for the pixel in image coordinates(u, v):
du = αufX + u0Z
dv = αvfY + v0Z
d = Z
(2.15)
2.2 Integration of measurements 11
If we put X,Y and Z in evidence, we arrive at
X(u, v) = u−u0
αufZ(u, v)
Y (u, v) = v−v0αvf
Z(u, v)
Z(u, v) = d(u, v)
(2.16)
where we get our 3D coordinates of the pixel at(u, v).
2.1.5 Back-projection
The 3-D ray that starts in the camera optical centerC and pierces the image plane in the point projected
eventually reaches the 3-D point. The camera centerC is defined as the intersection of the three planes
defined in the projection matrixP columns, wherePC = 0. If we redefineP asP =[p(123)|p4
]where
p(123) is a 3 × 3 matrix representing the first three columns ofP andp4 as the last column. We can
compute the camera center as
PC = 0 ⇔ P
[C
1
]= 0 ⇔ p(123)C+ p4 = 0 ⇔ C = −p(123)
−1p4. (2.17)
D is the 2-D point back-projection in 3-D coordinates, defined by the ray leaving the camera center and
intersecting the plane at infinity, and its representation becomes
x = PD ⇔ P
[D
0
]= x ⇔ p(123)D = x ⇔ D = p(123)
−1x (2.18)
wherex is in homogeneous coordinates.C andD determine a line in 3-D space where all the points in
that line is given by
X = C+ αD (2.19)
whereα is a scaling factor that∈ [−∞,+∞]. The following equation is the full expression for repre-
senting a pointx back-projection into toX
X = −p(123)−1p4 + αp(123)
−1x. (2.20)
2.2 Integration of measurements
A camera provides many measurements at each time stamp. In this thesis, color information allows
registering features of the scenario, while the geometric location of the features allows inferring pose
properties of the camera. Camera data is naturally corrupted by noise, bothradiometric and geometric.
Kalman filtering is a methodology to integrate camera measurements along time, such that the random
nature of the noise is progressively attenuated.
12 Background
In 1960, R.E. Kalman published his famous paper [13] proposing a recursive solution to the discrete-
data linear filtering problem. The Kalman filter is a set of mathematical equations that provides an
efficient computational (recursive) solution of the least-squares method. The filter estimates optimal
states over time, given observations in the presence of noise of a dynamic system driven by noise inputs.
This estimation is computed recursively as a Markov chain model, i.e. it only requires the previous
estimatek − 1 of a given state at instancek. The optimality comes from minimizing the estimated
covariance error. So the general problem of trying to estimate the statex ∈ ℜn of a discrete-time
controlled process that is governed by the linear stochastic difference equation
xk+1 = Akxk +Buk + wk, (2.21)
with a sensor measurementz ∈ ℜm that is
zk = Hkxk + vk. (2.22)
The random variableswk andvk represent respectively the process and measurement noise. They are
assumed to be independent of each other, white and with a normal probabilitydistributions
p(w) ∼ N (0, Q),
p(v) ∼ N (0, R).(2.23)
Then × n matrix A in equation 2.21 relates the state at instancek to the state at instancek + 1, in the
absence of either an input or process noise. Then × l matrix B relates the control inputu ∈ ℜl to the
statex. Them × n matrix H in the measurement equation 2.22 associates the state to the measurement
zk.
2.2.1 The Discrete Kalman Filter Algorithm
The Kalman filter estimates a process by using a form of feedback control: itestimates the process state
at some time and then obtains feedback in the form of noisy measurements. So the equations for the filter
can be categorized in two groups:time updateandmeasurement updateequations. The time update are
responsible for projecting forward in time the current state and error covariance estimates to get thea
priori estimates for the next time step. The measurement update are in charge of incorporating the new
measurements into thea priori estimates to obtain an improveda posterioriestimate. Figure 2.3 shows
the basic function of the filter.
The specific equations for the prediction step in the discrete Kalman filter are
xk+1 = Akxk +Buk (2.24)
Pk+1 = AkPkATk +Qk (2.25)
2.2 Integration of measurements 13
Figure 2.3: The discrete Kalman filter cycle.
and for the update step are
Kk = PkHTk (HkPkH
Tk +Rk)
−1 (2.26)
xk = xk +K(zk −Hkxk) (2.27)
Pk = (I −KkHk)Pk (2.28)
In the time update equations 2.24 and 2.25 it is clear that the state and covariance estimates are
projected from instancek to k + 1. The bar over the symbol means it is ana priori estimate. For
the measurement updates it starts by computing the Kalman gainKk in 2.26, then uses the current
measurement to generate ana posterioristate estimate in 2.27 and also at the same time to obtain a new
error covariance estimate as in 2.28. This process is repeated after eachpredict-update cycle is complete
and the newa posterioriestimates are used as inputs to the new cycles predict step.
2.2.2 Extended Kalman Filter
The Extended Kalman Filter (EKF) allows handling non-linear systems functions and measurement mod-
els through linearization with a first order Taylor series expansion. Both Kalman Filter and its non-linear
extension can be described in two main steps with a standard set of equations, denominated respectively
as prediction and update, also known asa priori anda posterioriestimates. Let the state vectorsk be
represented as a Gaussian distribution process, withxk mean value andPk as the uncertainty of the
estimates covariance matrix, i.e.sk ∼ N (xk,Pk), with initial state estimate also normally distributed
s0 ∼ N (x0,P0).
EKF Prediction Step The mean valuexk and the covariance matrixPk predicted transition from an
instantk to k + 1 is accomplished trough the following set of equations:
xk+1 = f(xk,uk, ǫk) (2.29)
Pk+1 = FkPkFTk +GkQkG
Tk (2.30)
14 Background
The system dynamics non-linear model functionf(xk,uk, ǫk) depends on the system input controluk,
and input transition noiseǫk ∼ N (0,Qk) with covariance matrixQk. Fk andGk stand for for the state
transition Jacobian matrix and input transition noise Jacobian matrix, linearizednear the given points of
the estimatexk+1
EKF Update Step Whenever a measurementzk+1 becomes available after the prediction step, one
applies the update procedure:
Sk+1 = Hk+1Pk+1HTk+1 + Jk+1Rk+1J
Tk+1 (2.31)
Kk+1 = Pk+1HTk+1S
−1k+1 (2.32)
xk+1 = xk+1 +Kk+1(zk+1 − hk+1(xk+1, δk+1)) (2.33)
Pk+1 = (I−Kk+1Hk+1)Pk+1 (2.34)
wherehk+1(xk+1, δk+1) is the observation model function, producer of sensor measurements estimates,
with Hk+1 as the observation model Jacobian matrix, linearized around the estimatexk+1. Rk+1 defines
the measurement model noise covariance matrix, from the Gaussian noiseδk ∼ N (0,Rk). The inno-
vation processνk+1 = zk+1 − hk+1(xk+1, δk+1) which measures the error between measurements and
predicted ones, assumes a normal distributionγk ∼ N (νk,Sk+1), whereSk is the innovation covariance
matrix. Much alike the characteristics ofGk, Jk+1 is the measurement noise Jacobian matrix, linearized
around the state predicted estimates. Finally, the Kalman GainKk+1 weights the predicted state with the
newly observed measurements.
2.2.3 Other Types of Filters
There are other types of data filtering techniques used in robotics other than the previously mentioned
Kalman filter. Particle filters [22] are a great way to track the state of a dynamicsystem for which you
have a Bayesian model. That means that if you have a model of how the system changes in time, possibly
in response to inputs, and a model of what observations you should see inparticular states, you can use
particle filters to track your belief state.
Particle Filters
The particle filter (PF) was introduced in 1993 as a numerical approximation tothe nonlinear Bayesian
filtering problem, and there is today a rather mature theory as well as a numberof successful applica-
tions described in literature. A particle filter is a set of on-line posterior density estimation algorithms
that estimate the posterior density of the state-space by directly implementing the Bayesian recursion
equations.
2.3 Application to the thesis context 15
Its objective is to estimate the posterior density of the state variables given the observation variables.
The particle filter is designed for a hidden Markov Model, where the systemconsists of hidden and
observable variables. The observable variables (observation process) are related to the hidden variables
(state-process) by some functional form that is known. Similarly the dynamical system describing the
evolution of the state variables is also known probabilistically. A generic particle filter estimates the
posterior distribution of the hidden states using the observation measurementprocess.
Particle methods assume thatxk and the observationsyk can be modeled like this:
xk|xk−1 ∼ pxk|xk−1(x|xk−1) wherex0,x1,... is a first order Markov process and has an initial distribution
p(x0) and the observationsyk|xk ∼ py|x(y|xk) are conditionally independent provided thatx0,x1,... are
known, meaning thatyk only depends onxk. A generic example is
xk = g(xk−1) + wk (2.35)
yk = h(xk) + vk (2.36)
wherewk andvk are mutually independent and identically distributed sequences with probabilitydensity
functions known andg andh also known functions. They look very similar to the state space equations
of the Kalman filter except the functionsg andh are linear for the discrete Kalman filter andwk andvkare gaussian. Kalman filter based methods are a first-order approximation (EKF) where Particle filters
are also an approximation, but with enough particles they can be much more accurate.
2.3 Application to the thesis context
In the following we describe a simple map building and navigation methodology based on occupancy
grids and trajectory planing. The camera is assumed to be a color-depth camera, as the Microsoft Kinect.
More details in the next chapter.
After developing a basic navigation methodology, we study and develop a methodology to use of
a simpler camera, namely a standard color camera, hence without depth measurements. In order to
compensate the lack of depth measurements we use an EKF based approach, more precisely Mono-
SLAM [7]. The dynamic motion model is a 6D constant velocity model. In particular we assume an
hypothesis on the scene structure to be nearly a tubular shape. This hypothesis allows constraining more
the data, and therefore more effective filtering, and in addition to keep a constant state size. More details
in chapter 4.
16 Background
Chapter 3
Navigation using Color-Depth Cameras
Navigation is the process of monitoring and controlling the movement of the robot from one location to
another. In other words, this means that the robot needs to be able to locatehimself compared to other
known locations (for instance, its last location). In this chapter we will detailmore on how the mapping
and navigation using a color-depth camera is dealt. A diagram with the basic steps of the implemented
navigation process is produced next in Figure 3.1:
Figure 3.1: Diagram detailing the steps for the navigation.
18 Navigation using Color-Depth Cameras
3.1 Occupancy Grids
The principle idea of the occupancy grid [9] is to represent a map of the environment as an evenly spaced
field of binary random variables each representing the presence or not of an obstacle at that location in
the environment. Occupancy grid algorithms compute approximate posterior estimates for these random
variables.
The goal of these algorithms is to estimate the posterior probability over maps given the data:
p(m|z(1:t), x(1:t)) wherem is the map,z(1:t) are the sensor readings from the beginning untiltime = t
andx(1:t) are the position and orientations of the robot in the same time steps.
The posterior of a map is approximated by factoring it intop(m|z(1:t), x(1:t)) =∏
i p(mi|z(1:t), x(1:t)).
Due to this factorization, a binary Bayes filter can be used to estimate the occupancy probability for each
grid cell.
In this section we use the depth and color information directly provided by the camera (a Microsoft-
Kinect like camera) to build an occupancy grid map.
3.1.1 Motion Estimation
We need to estimate where the camera is at all steps to be able to correctly updatethe occupancy grid.
For that we use the RGB images. We will now explain in a more detailed way how that estimation is
made.
How the camera motion estimation between steps is made
First the SIFT [16] features are computed in the current and previous RGB images. Then a matching
algorithm returns the corresponding features in both images. By displayingthese matched points over
the images (Figure 5.5a) we can see that there are still some wrong matchings or outliers. So we apply a
RANSAC filtering algorithm to the matchings using a rigid transformation model to discover the inliers
in the list of matchings. This is only possible because we have the depth information from the features
and can infer their 3D position and use the rigid transformation model. Using only those inliers we
input them in the Procrustes function in Matlab that returns the rigid transformation between the two
point clouds (one for each image). This way we improve a lot our motion estimation of the camera and
consequently the final map.
Steps to compute a rigid transformation
LetP = p1, p2, ..., pn andQ = q1, q2, ..., qn be two sets of corresponding points inRd. We wish to
find a rigid transformation that optimally aligns the two sets in the least squares sense:
3.1 Occupancy Grids 19
(R, t) = argminR,t
n∑
i=1
wi ‖(Rpi + t)− qi‖2 (3.1)
wherewi are weights assigned to each pair of points. We briefly explain the steps to compute the optimal
translation t and rotation R that minimize the euclidian distance between one point cloud and the other
transformed (Equation 3.1). First compute the weighted centroids of both point sets:
p =
∑ni=1wipi∑ni=1wi
, q =
∑ni=1wiqi∑ni=1wi
. (3.2)
Then, center the vectors of each cloud point:
xi := pi − p (3.3)
yi := qi − q , i = 1, 2, ..., n. (3.4)
Next, calculate thed× d covariance matrix
S = XWY T , (3.5)
whereX andY ared×nmatrices that have as their columnsxi andyi respectively andW = diag(w1, w2, ..., wn).
Now we compute the singular value decomposition ofS = UΣV T . We get the rotation
R = V
1
1.. .
1
det(V UT )
UT . (3.6)
Finally, we just need to get the translation
t = q−Rp. (3.7)
3.1.2 Map Building
To validate this approach an experiment was conducted and the results arepresented next. Using the data
from a run around a VRML world composed by the four corridors and elevator hall (simulating the 5th
floor IST North Tower), we have the projection matrices and depth maps forevery step. By choosing
the middle line in the depth image and assuming that the camera is horizontal the wholetime we can
simulate a laser range finder. In each step we have measurements for obstacles. These measurements are
in reference to the camera position in the world, so we have to take that into account and put the sensor
20 Navigation using Color-Depth Cameras
readings in world coordinates (change from the camera referential to theworld referential).Then we need
to use that information (obstacles and free space) and update our probability grid. Each cell in the grid
has a probability value for the belief that it is occupied. To update the free space we do a sampling of the
area to where the camera is watching and see what samples fall in the polygondefined by the obstacle
points and camera position in the world. The free cells are updated with the following equations
Bel_Occ = (1− TrueHit)× p(x, y) (3.8)
Bel_Unocc = (1− FalseAlarm)× (1− p(x, y)) (3.9)
p(x, y) =Bel_Occ
Bel_Occ+Bel_Unocc(3.10)
The values forBel_OccandBel_Unoccare the belief that the cell is occupied or not, respectively. Then
we normalize the probability and getp(x, y). The locations where obstacles were detected are also
updated with
Bel_Occ = TrueHit× p(x, y) (3.11)
Bel_Unocc = FalseAlarm× (1− p(x, y)) (3.12)
p(x, y) =Bel_Occ
Bel_Occ+Bel_Unocc(3.13)
The variablesTrueHit andFalseAlarmrepresent the probability that an obstacle is detected by the
sensor when it is actually there and when the sensor detects something that isnot there, respectively. We
do this for all the steps and in the end we have an occupancy grid map of the environment. Figure 3.2
shows various steps of the occupancy grid mapping process. This was done with the ground truth position
of every step of the camera known.
3.1.3 Loop Closure
The idea for the loop closure is to propagate backwards the error observed when we know where the
robot should be and where it actually is, for instance when we know it started and ended in the same
place but in the map built it did not. When the robot passes in a certain area that has visited before it
should be able to recognize it and correct its location, meaning that the map ideally should be a closed
loop if the robot performed an actual loop.
The adopted solution was: once the loop closure is detected (i.e. a past location is being visited
again) and using the projection matrices in every step it is possible to make a profile of the movement of
the camera inx, y, z axis for the translation andα,β,γ (the euler angles computed using the Rodrigues
[26] formula) for the orientation, i.e. we can see in Figure 3.3 how the cameraevolves in these reference
frames in all steps. Based on this evolution we need to divide our final error through all the steps. If we
compute new camera projection matrices with the error correction added and rebuild the occupancy grid,
3.2 Motion Planning 21
(a) Initial map (b) Step 1 (c) Step 12 (d) Step 25
(e) Step 50 (f) Step 75 (g) Step 100 (h) Final Map
Figure 3.2: Various steps of the occupancy grid mapping process
the final error should be smaller. So for each camera matrix we update its rotation matrix and translation
vector with the weighted error. The only things that we have to compute are essentially the weight of
the error added in each step and the error itself. The error is just the difference between your desired
and actual end pose. To find the weight of the error to add in each step weused the formula: (this is the
formula forx, but it is the same for the others)
ex(t) = exmax
(−1
2 + 1Mx
∫ t
0 |vx(t)|dt), t ∈ [0, tmax]
Mx ≡∫ tmax
0 |vx(t)|dt(3.14)
whereMx is a normalization constant andvx is the velocity at that instant.
3.2 Motion Planning
Motion planning is a term used in robotics for the process of breaking downa desired movement task
into discrete motions that satisfy movement constraints and possibly optimize some aspect of the move-
ment. In this section we detail some trajectory planning methodologies, namely the Potential Fields,
Probabilistic Roadmaps and Rapidly Exploring Random Trees.
3.2.1 Potential Fields
Potential Fieldsis a motion planning technique where the robot is treated as a particle placed onan
influencing potential field [11]. The topology of the field is created by the designer: it first creates
multiple behaviors, each assigned a particular task or function, then represents each of these behaviors
as a potential field and finally combines all of the behaviors to produce the robot motion by combining
22 Navigation using Color-Depth Cameras
20 40 60 80 100 120
0
10
20
# steps
Dis
tanc
e
xyz
20 40 60 80 100 120−4
−2
0
2
4
# steps
rads
alphabetagama
Figure 3.3: Profiles for the estimated motion of the camera (translation and orientation).
the referred fields (an example can be seen in Fig. 3.4). One can also thinkof it as a marble rolling down
some hills that represent the influence of all the behaviors present.
Figure 3.4: Example of a Potential Field. Image taken from [24]
The action vector, the speed and orientation of the robot, is produced by each behavior based on
its purpose. The behavior of the goal state is to attract the agent and of anobstacle is to repel it. The
collection of these vectors for all possible locations of the agent is the Potential Field.
Based on this Potential Field the robot knows how to move around. It moves at the speed and
orientation of the vector at that location.
3.2 Motion Planning 23
Different Types of Potential Fields
There are more types of behaviors than just the goal and obstacle described above. It all depends on
the situation you want to create. Follow/Avoid a wall, Circle around something and Do Not Repeat Past
Locations are just a few of the types possible to recreate by choosing the right potential function to model
the wanted behavior. In this work, the avoid the past locations behavior was implemented to solve local
minima where the agent gets stuck. This is done by adding a repulsive trail wherever it passes.
3.2.2 Randomized Methods for Path Planning
Motion planning problems involve searching the system configuration spaceof one or more complicated
geometric bodies for a collision-free path that connects a given start andgoal configuration, while satis-
fying constraints imposed by complicated obstacles. Complete algorithms are known to solve this class
of problems but their computational complexity limits their use to low-dimensional configuration spaces.
This motivated the development of randomized methods. Although these methodsare incomplete they
will find a solution if given enough time. Randomized path planning algorithms were designed for one
of two contexts: single-query or multiple-query planning. For single-query planning, it is assumed that
a single path planning problem must be solved quickly, without any preprocessing. For multiple-query,
many path planning problems will be solved for the same environment and is worthwhile to preprocess
information and store it in a data structure that allows fast path planning queries. In the coming sections
a brief explanation of two randomized methods is presented.
3.2.3 Probabilistic Roadmaps
The Probabilistic Roadmaps algorithm work in two steps, the first calledlearning stageand the second
calledquery stage[3]. Given an instance of the RMP (Robot Mapping Problem) in thelearning stage
the algorithm samples theconfiguration spaceand builds an undirected graphG = (V,E) which captures
the information gathered. The graph is calledProbabilistic Roadmap. An example of that is depicted
in Fig. 3.5. In thequery stage, the PRM is used to solve specific RMP problem instances which are then
reduced to a graph search.
Figure 3.5: Example of a Roadmap. Image taken from [20]
24 Navigation using Color-Depth Cameras
The learning stage starts by generating a random configurationc ∈ Cfree. Cfree is the set of con-
figurations for which the robot does not collide with any static obstacle. It checks for nearby vertices
where its distance is lower than a certain threshold. Then it connects them bythe increasing order of
their distance and the planner verifies that the path between them is free.
In the query stage two configurations are given (startandgoal). They both are connected to the graph
by the same ascending distance method. They are checked if both belong to the same connected graph
component. If so, the path is returned. Possibly, path smoothing is later applied for a better solution.
Notes on PRM algorithm
The learning stage takes much more time than the query stage therefore is only worth if multiple queries
are made. If the robot is to move in a dynamic environment, the PRM will be no morevalid as obstacles
move, thus vanishing the time spent to build it. It has difficulties in finding narrowpassages in the
configuration spaceC.
3.2.4 Rapidly Exploring Random Trees
The RRT Method
The RRT is an efficient data structure and sampling scheme to quickly searchhigh-dimensional spaces
that have both algebraic and differential constraints [14]. The basic idea is to direct the search in the
configuration space for a solution.
Figure 3.6: Example of an expanding tree. Image taken from [15]
It starts by building a tree rooted at the starting pointqinit. A random configurationqrand is chosen
in Cfree. The current tree is extended from the closest vertex by a fixed step size in the direction of that
configuration. Every time the tree is extended a function checks for collisions between the closest vertex
and the new vertex (Figure 3.7). There are three possible outcomes for this tree extension: the path is
clear and the tree is extended in that direction with the new vertex; the path is obstructed by an obstacle
and the new vertex is rejected; the random configuration is reached and added to the tree because it is at a
smaller distance than the step fixed size. This process is repeated until the goal configuration is reached.
Some things to keep in mind is that for this method an efficient neighbor search isimportant. The metric
3.2 Motion Planning 25
Figure 3.7: The RRT method
chosen is also fundamental.
RRT-Connect Path Planner
A similar variant of the RRT method is the RRT-Connect. The RRT-Connect planner is designed specif-
ically for path planning problems that involve no differential constraints. The method is based on two
ideas: the Connect heuristic that attempts to move over a longer distance, andthe growth of two RRTs
from bothqinit andqgoal.
Instead of attempting to extend an RRT by a single step size, the Connect heuristic iterates the extend
step untilq or an obstacle is reached (Figure 3.8). With this method, the greedy heuristicis combined
with the rapid and uniform exploration properties of RRTs, which seems to avoid the local minima.
The idea is: two trees are maintained at all times until they become connected anda solution is found.
In each iteration, one tree is extended and an attempt is made to connect the nearest vertex of the other
tree to the new vertex. Then, the roles are reversed by swapping the two trees. This causes both trees to
explore the free space around them, while trying to establish a connection between them.
Figure 3.8: The RRT-Connect method
The key result is that the distribution of the RRT vertices converges towardthe sampling distribution,
which is uniform.
3.2.5 Comparison of Trajectory Planning Methods
In order to obtain a fair comparison of the planning methods an experiment asbeen conducted. The
results were obtained after 10 runs for each method and map under the samerandom conditions (randinit
26 Navigation using Color-Depth Cameras
for the same method in different maps). The maps used are in Figure 3.9. Thestarting point was(20, 10)
and goal point(50, 30) for the first map. For all the others it was chosen thestart = (10, 10) and
goal = (90, 90) points.
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100
(a) (b) (c) (d)
Figure 3.9: (a) Map 1 (b) Map 2 (c) Map 3 (d) Map 4. Red Circle is the starting point and the green star isthe goal point. All the maps solved with different methods. Second row it’s thePotential Fields method.Third row it’s the Probabilistic Roadmaps. Bottom row it’s for Rapidly exploringRandom Trees.
The table with the results is in table 3.1. Looking at the results we observe that the Potential Fields
(PF) method is on average the fastest and the one that covers the least amount of distance (except for
the first map). The Rapidly Exploring Random Trees usually covers a bigger distance but has a time
scale comparable to the PF method. The Probabilistic Roadmap takes the most time to compute and has
a distance somewhere in between the previous two methods. However it’s success rate drops forhard
maps where it’s critical to find a narrow passage or there are several obstacles.
3.2 Motion Planning 27
MAP 1 DISTANCE TIME [s] % SUCCESSPotential Fields 1414 4,2 100RRT 328,3 4,8 100PRM 280 18,68 50MAP 2 DISTANCE TIME [s] % SUCCESSPotential Fields 80 0,18 100RRT 143.99 0,58 100PRM 105,6 34,35 100MAP 3 DISTANCE TIME [s] % SUCCESSPotential Fields 121 0,23 100RRT 161,74 0,3595 100PRM 135,5 38,36 100MAP 4 DISTANCE TIME [s] % SUCCESSPotential Fields 108 0,34 100RRT 161,63 4,9 100PRM 137 15,1 10
Table 3.1: Average values for 10 runs
The first map was consideredvery hardbecause it has a lot of local minima solutions since there are
various walls needed to circumvent. For that reason the Potential Fields methodhad to take various turns
increasing a lot the distance to travel. The PRM method only had a 50% success rate on this map.
The second map was consideredeasysince the obstacles do not get in the way of the direct path
between the start and goal. All the methods were successful.
The third map rankedmediumdifficulty. The obstacle present made it mandatory to go around it. All
the methods were successful.
The fourth map was gradedhard because of the narrow passage the path had to find. The PF and
RRT methods were successful but the PRM had only a 10% rate of completion.
Based on the observation of the results, the best method depends on the type/difficulty of the map
and on the location of the start and goal points. Foreasyandmediummaps, all the methods are well
suited to solve them. Forhard ones the PF and RRT seem to yield more robust results. Forvery hard
type of maps, the RRT and PRM (although the success rate of PRM was not 100%) appear to do a better
job in these cases taking into account the distance.
28 Navigation using Color-Depth Cameras
Chapter 4
Navigation based in Monocular Vision
This chapter addresses aSimultaneous Mapping and LocalizationSLAM [8] methodology for a system
capable of performing visual inspections in an unknown environment domain assumed to be a tubular
shaped structure (TSS), using a monocular camera [23]. Under theTSSassumption, a geometrical model
is developed, which directly maps visual features from planar images onto a3D representation, through
a backprojection procedure, thus enabling a scenario reconstruction composed of cylindrical segments.
We will also adapt the tubular shape to a rectangular shape. The EKF framework allows reconstructing
the path described by the camera and its visualized structure, based in visual landmarks described by
feature detectors.
Estimating the motion of a camera moving inside a shaped structure involves considering various as-
pects. Two important aspects are the number of degrees of freedom of the camera motion and the shape
of the environment structure. ATSSincludes straight and curved sections. The camera can move in 6D
(3D in translation and 3D in rotation). Motion can be estimated by registering the texture, retrieving dis-
tinctive visual features, and dewarping into a mosaic. We approached theproblem by first reconstructing
points of the scene and then focus on fitting a simple 3D cylindrical model to the various tube sections,
which makes simple the dewarping step [18].
4.1 Depth Information Not Directly Available
The big difference from this method to the one presented in Chapter 3 is the sensor used. In this method,
a simple monocular camera does not give the depth information from its pixels. Because of this a new
way of estimating the motion of the camera is necessary. The SLAM methodology is used with some
assumptions madea priori. A structural model for the environment is defined and an EKF with a motion
model and observation model work together to estimate the pose of the camera among other things.
The diagram in Figure 4.1 sums up the basic steps of mapping and allows to compare VSLAM with
the Color-Depth methodology.
30 Navigation based in Monocular Vision
Figure 4.1: Basic steps in mapping and motion estimation for the Color-depth vs Visual SLAM.
4.2 Iterative Estimation of Tubular Structure and Camera Motion 31
4.2 Iterative Estimation of Tubular Structure and Camera Motion
Assuming the world as aTSS(Fig. 4.2) domain navigated by a forward hand-held like monocular cam-
era, smoothly moving with a constant speed inside a tube without revisiting previous positions, one
can determine the features 3D locations, which are strapped to a cylindricalsection wall with radius
ρ. By thoroughly defining a state vectors composed of both cameras and theTSSgeometrical param-
eters estimates, and incorporating a geometrical observation model reliable enough to produce relevant
estimations of the cylindrical section geometrical parameters, the simultaneous camera localization and
mapping can be achieved.
0
0.5
1
1.5
2
2.5
3
3.5
4
−1.5
−1
−0.5
0
−0.4
−0.2
0
ZX
Y
Figure 4.2: Tubular structure used for testing the system behavior.
4.2.1 State Vector
The state vectors is a joint composition of camera posex and cylinders geometrical parametersy :
sk = [xk ; yk] (4.1)
where the semicolon denotes vertical stacking of vectors.
Camera motion behavior is modeled with a constant velocity dynamic model. The statevectorxk
provides the positionrWCk , orientationqWC
k and both linearvWk and angularωC
k velocities at every
instant, i.e.
xk =[rWCk ; qWC
k ; vWk ; ωC
k
]. (4.2)
A cylindrical section is characterized by a 7 dimensionynk state vector as an array of the cylinder
parameters centre positionpnWk , orientationonk, and radiuslog(ρnk) expressed in a logarithm form
ynk =[pn
Wk ; onk ; log(ρnk)
]. (4.3)
32 Navigation based in Monocular Vision
4.2.2 System Dynamics
The non-linear state transition model functionf , defined as the compilation of two independent state
transition processes,fx andfy, for both camera and cylinder state vectorx andy, which are influenced
by an additive zero-mean Gaussian transition noiseǫ ∼ N (0,Qk):
f = [fx ; fy] . (4.4)
The constant velocity model allows smooth velocity variations with zero-mean Gaussian distributed
acceleration noisenx ∼ N (0,Nxk) = (aWϑC)T . Under this model assumption, and definingVW =
aW∆t andΩC = ϑC∆t as the linear and angular velocities impulse between a transition step∆t, one
can express the camera state transition,xk+1 = fx(xk,nxk) [1, 4]:
rWCk+1
qWCk+1
vWk+1
ωCω+1
=
rWCk +
(vWk +VW
)∆t
qWCk × q
((ωCk +ΩC
)∆t
)
vWk + V W
ωCk +ΩC
. (4.5)
4.2.3 Observation Model
The observation model describes the process of implicitly representing observed featuresΛ in a 3-D pa-
rameterization as a function of the systems states parameters by inferring constraints to the environment
structure and consequently to the features 3-D locations. At an instantk the state transition functionf is
applied tosk, retrievingsk+1 = f (sk,nk). Λk is the input set of features in pixel coordinates acquired
at the instantk , whereasΛk+1 is the set of features in following instantk+ 1 and the observation model
output. The observation model can thus be written as a functionh:
Λk+1 = hk+1(sk+1, Λk) (4.6)
Consider that the set of featuresΛk, 3-D locations on theTSSsurface, are the intersections of rays
leaving the camera centrerWCk with the cylinder wall, piercing the 2-D image plane where all features lie.
With the information present in the state vectorsk, i.e. camera posexk and cylinders section parameters
yk, one can estimate the 3-D location of every feature in setΛk through a back-projection methodology,
and compute the re-projection of this 3-D coordinates with the next instantk + 1 state vectorsk+1 to a
2-D image plane, thus acquiringΛk+1 features.
4.3 Dewarping
The process of opening the tube (Figure 4.3) or dewarping the interior ofthe tube is done by finding
the transformation between the pixels in the scene viewed by the camera inside the TSSand the planar
4.3 Dewarping 33
mosaic we want to build of the inside texture. In essence is a conversion from polar coordinates to
cartesian coordinates. Figure 4.4 shows that relation. We first find the 3Dposition of each pixel in
relation to the cameras center. After deciding how many samples we want, i.e. how many lines of the
mosaic in each step, and the depth we are able to see we have the size of the new image. Then the
position of each new pixel(u, v) is a function of its 3D relation to the camera optical center. The depth
(Z coordinate) will influence the horizontal location in the mosaic. The angleα (which is computed
based on the X and Y coordinates) influence the vertical location of the pixel in the mosaic. By doing
this, we defined a mapping function where we give polar coordinates to each pixel in the image and using
those we apply a transformation and get the corresponding pixel location inthe new image, the planar
mosaic. A final mosaic is stitched using an homography showing the full inside texture of the tube. Some
examples of dewarped images can be seen in Fig. 4.5 and Fig. 4.6.
Figure 4.3: Creating a mosaic with the interior texture of the tube.
(a) (b) (c)
Figure 4.4: (a) 3D Coordinates of pixels in the image inside the tube. (b) Relation between the angle ofthe coordinate in the closed image and the vertical pixel in the mosaic. (c) Pixelsin the mosaic.
4.3.1 Dewarping a VRML Tubular Shape
Datasets with real images, acquired from virtual or real cameras, introduce image processing frame-
works to the developed system, incorporating image description techniques such as features descriptor
algorithms (e.g. SIFT, SURF), enhanced by the RANSAC methodology for outliers removal. Figure 4.5
34 Navigation based in Monocular Vision
shows various steps of the process.
4.3.2 Dewarping a Real Tubular Shape
Another experiment consisted on evaluating the developed algorithm with a set of images retrieved from
a real camera navigation inside a texturedTSS. The setup and the results can be seen in Figure 4.6.
(a) (b) (c) (d)
Figure 4.5: (a) VRML setup. (b) View of the camera inside the tube (c) Dewarping of the tube for asingle step (d) Result of dewarping several steps.
(a) (b) (c) (d)
Figure 4.6: (a) Real setup. (b) Inside view of the tube (c) Dewarping ofthe image for one step (d) Resultof dewarping several steps.
4.4 Rectangular Section for a Corridor
To adapt this model from a tubular shape to our corridor model we had to make some changes. The idea
was to switch the back-projection function to a rectangular section instead ofa circular one and the state
vector now assumes the radius of the cylinder as the aspect ratio of the rectangular section. The concept
is the same as before: find the 3D ray that starts in the camera optical centerC and pierces the image
plane in the 2D projection ofD and then using the known structure of the environment compute the
scaling factorα that represents the intersection between the 3D line and the corridor planes(Fig. 4.7).
M = C+ αD (4.7)
4.4 Rectangular Section for a Corridor 35
D
z
Figure 4.7: Back-projection Model for Rectangular Section.
4.4.1 Results obtained for the synthetic case
In this setup, a synthetic corridor was simulated by creating a point cloud of features (Fig. 4.8a) that
consisted in a rectangular section repeated along the zz axis. The camera was then moved along that axis
starting at the origin until the point(0, 0, 5) was reached and at each step it captured the projection of
the features in its image plane. This was the dataset used in the simulation. In totalthere were 50 steps
with a step size of 0.1m. In the starting configuration file, the initial assumption of the camera position
was purposely set wrong(0.5, 0, 0) to see if the filter would correct the state. The resulting plot of the
camera pose in world coordinates exhibits a clear correction by the filter (Figure 4.8b and Figure 4.8c).
The plots in Figure 4.8d and Figure 4.8e show in more detail the evolution of the state of the filter in
camera position and orientation, respectively.
36 Navigation based in Monocular Vision
−5
0
5
−5
0
5
0
5
10
x
z
y
(a) Synthetic Cloud Point.
−3−2
−10
12
3
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
0
2
4
6
x
z
y
(b) Synth section with ground truth (green): Perspectiveview.
−3 −2 −1 0 1 2 30
1
2
3
4
5
6
7
x
z
(c) Synth section with ground truth (green): Top view.
5 10 15 20 25 30 35 40 45
0
5
10
# steps
Dis
tanc
e
xyz
(d) State of the filter for the position of the camera.
0 10 20 30 40 50−0.4
−0.35
−0.3
−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
# steps
rads
alphabetagama
(e) State of the filter for the orientation of the camera(euler angles).
Figure 4.8: Results for the synthetic corridor.
4.4.2 Conclusion
This chapter aimed at presenting a robust model to solve theSimultaneous Localization and Mapping
problem with a priori structural environment knowledge, assumed to be aTSSfirst and a rectangular
tube secondly. The cameras pose estimate and the camera trajectory can be recovered, up to a scale
4.4 Rectangular Section for a Corridor 37
factor, assuming the constant speed dynamic model, allowing a posterior description of the environment
structure and its reconstruction.
38 Navigation based in Monocular Vision
Chapter 5
Experiments
In this chapter a number of experiments are conducted to validate the mapping and navigation methods
studied and introduced in previous chapters. This chapter describes theexperiments and the results
obtained from them.
5.1 The Experimental Setup
A virtual world created in VRML (Virtual Reality Modeling Language [5]) seen in Figure 5.1 is used for
purposes of providing the imaging data that a robot can acquire. An example of a point of view (POV)
for the robot inside the virtual world in various consecutive steps can beseen in Figure 5.2. Several 3D
points coinciding with the VRML corners make up the facets used to simulate the depth sensor. The
virtual world has textures in its walls that we will use to scan for visual features.
(a) VRML world wireframe (b) VRML world wih textures
Figure 5.1: VRML model used for visualization.
40 Experiments
Step 1 Step 2 Step 3
Step 4 Step 5 Step 6
Step 7 Step 8 Step 9
Figure 5.2: Various consecutive steps of the POV of the robot inside the virtual world.
5.2 Single Image Global Localization
Usually the first mission of a robot in an unknown scenario is an explorationrun to map the environment.
During that run the robot periodically, after a certain number of steps, saves in its memory the SIFT
descriptors and map location from the image captured. The objective is that ifwe randomly choose an
RGB image from any past location we can compute its SIFT descriptors and compare with all the other
SIFT descriptors in the database. It can be a direct match, because the step we choose to locate was one
that coincidentally was saved or we choose the one with the bestscorefrom the matching function that
should yield the best option to locate the robot. Essentially it compares images and finds the most similar
one. There are some problems with this approach like the textures (some textures are not good for the
SIFT method) and redundancy (repeated or similar images in different places) are the main ones.
In Figure 5.3 we can see examples for all the possible outcomes. The correct correspondences, the
exact correspondences and the incorrect correspondences. InFigure 5.3 there is a map of the environ-
ment. The points represent the location where the RGBD images were taken. The yellow points represent
where the SIFT features were saved. In this case they were saved forevery 3 steps. The blue points rep-
resent the points where the robot did not record any information. The green cross is the actual location
of the image we are trying to locate and the red circle is the location of the strongest match returned by
5.2 Single Image Global Localization 41
the algorithm.
The first case (step 5) is a correct one because the strongest match is the closest image (Figure 5.3b)
with recorded features as we can see in the map. The second case (step 81) is also a correct one for the
same reasons (Figure 5.3c). Looking at the map the green cross and redcircle are very close. The third
case (step 100) is an exact match. The image we are trying to locate (Figure 5.3d) is one that is also
stored in memory so its SIFT features are exactly the same and we have the correct location of it. The
fourth and final case (step 32) is a mismatch. From the RGB image (Figure 5.3e) we can speculate that
the reason for the mismatch is the repeated texture that exists in several locations in the environment. The
input image (green cross in the map) is located in the end of the second corridor and the output image
(red circle and strongest match) is in the third corridor.
The performance overall of the algorithm is around80% correct correspondences,8% almost correct
(the output is not the closest point with recorded SIFTs but it is close, i.e.same corridor) and12%
incorrect correspondences.
0 5 10 15 200
2
4
6
8
10
12
14
16
18
20
step 5
step 100
step 81
step 32
a) Global map with the locations
Randomly chosen image Similar image present in the set
b) Query image (step 5) and retrieved imageRandomly chosen image Similar image present in the set
c) Query image (step 81) and retrieved image
Randomly chosen image Similar image present in the set Randomly chosen image Similar image present in the set
d) Query image (step 100) and retrieved image e) Query image (step 32) andretrieved image
Figure 5.3: Map with robot ground truth locations (green crosses) and estimated locations (red circles)based on query and database images (a). The query images for the various steps (5, 81, 100 and 31) areon the left and the retrieved matching images on the right (b, c, d, e).
The results obtained show that this method works fine overall but there aresome problematic areas
in the map that it constantly mismatched. For a better success rate in localizing the robot cannot rely in
a single match. The use of particle filters to localize the robot based on the sensor readings [10], which
in this case are the SIFT descriptors from the RGB images, allows for the robot to recover its correct
42 Experiments
location in a limited number of sequential images. In this method, given a map of the environment, the
algorithm estimates the position and orientation of a robot as it moves and senses the environment. The
algorithm uses a particle filter to represent the distribution of likely states, with each particle representing
a possible state, i.e. a hypothesis of where the robot is. Whenever the robot moves, it shifts the particles
to predict its new state after the movement. When the robot senses something, the particles are resampled
based on recursive Bayesian estimation, i.e. how well the actual sensed data correlate with the predicted
state. The particles should converge towards the actual pose of the robot.
5.3 Motion Estimation
Motion estimation, as mentioned before, is the process of determining the movement of the camera from
one image to the next. The path given to the camera is a pre-programmed one asseen in Figure 5.4.
The next figures show the intermediary steps in motion estimation used in the mapping and navigation
process. In Figure 5.5a we plot the corresponding matched SIFT features in two consecutive steps.
Some of these corresponding points are not correct so we apply the RANSAC algorithm with the rigid
transformation to search for inliers. Those can be seen in Figure 5.5b. Using the inliers from consecutive
steps we can compute two point clouds and use the Procrustes analysis to find the transformation between
them. Figure 5.5c depicts both point clouds, one in blue and another in red withthe matching done with
a green line, and show the pose of the camera before and after the transformation calculated. If we
superimpose all the estimated poses from all the steps into one plot we get Figure 5.6 that illustrates the
total estimated trajectory of the camera. The resulting occupancy grid map obtained from the camera
matrices and the depth information (Figure 5.7a) in each location is in Figure 5.7c:
0 5 10 15 200
2
4
6
8
10
12
14
16
18
20
Figure 5.4: Pre-determined path used by the robot to go through the environment and map it.
5.3 Motion Estimation 43
(a) Matched SIFT features in red.
(b) Inliers from RANSAC in yellow.
−2 −1 0 1 2−101
0
2
4
6
8
10
12
14
16
18
xy
z
(c) Two point clouds from consecutive steps (inliers) and the positionof the camera before and after the estimated motion.
Figure 5.5: Color-depth camera used for motion estimation.
44 Experiments
05
1015
200
5
10
15−1
01
Figure 5.6: All the estimated positions for the camera, resulting from the motion estimation.
5.4 Occupancy Grid
We assume that the camera at its first localization defines the origin of our world. In this case, our world
consists of four corridors that form a closed loop. We move the camera throughout the environment
and at each step we grab an RGB image and a depth image. Starting at step two we compute the SIFT
features from the current and previous RGB images and using these we can compute an estimate for the
transformation between the current and previous step. At the same time we are using the kinect to detect
whatever is in front of the robot. Combining this information we can constructa map incrementally. In
each step we compute how much the camera has moved in respect to the last stepand based on that we
can place in the map all the obstacles the sensor detects. The map we are building is an occupancy grid
map that stores in each cell the probability that that cell is occupied. It is continually updated with new
information. The values for the probabilities that a sensor can detect or fails to detect an obstacle are
previously set. After the run is made and the occupancy grid is finished we have a matrix with values
between zero (free cell) and one (occupied cell).
The pre-determined path used to move around our sensor in the environment consists on wandering
through four corridors (simulating the corridors in the 5th floor of the NorthTower at IST) that form a
square and so, a closed loop. Periodically, after a certain number of steps that the robot takes, it stores
the SIFT features detected at that step in its database. If later on during theprocess of mapping, the
estimated current position of the robot is getting near to the initial starting positionwe do a search on the
database of the SIFT features from that area to see what is the best match. Using that we can compute
the error in the final position that we will use in the loop closure correction. Figure 5.7e illustrates the
occupancy grid map achieved before and after the correction for the loop closure.
5.4 Occupancy Grid 45
100 200 300 400 500
50
100
150
200
250
300
350
(a) Depth Image with scan line.
0 5 10 15 20−2
0
2
4
6
8
10
12
14
16
18
(b) All the scan lines superimposed inthe world. (c) Occupancy Grid.
(d) Occupancy Grid Map without Loop Clo-sure.The area where the loop closes is highlightedwith a red rectangle.
(e) Occupancy Grid Map with Loop Closure.Thearea where the loop closes is highlighted with a redrectangle.
(f) Occupancy Grid Map (g) Initial Bitmap (h) Dilated Bitmap
Figure 5.7: All the steps to create a map used for navigation.
46 Experiments
5.5 Patrol #1 - Run with no obstacles
In this experiment we simulate the navigation of the robot. Using the map obtained by the mapping
procedure, we plan our path and send that information to the robot. The robot POV is simulated with
the VRML model as in Figure 5.8a. Amissionis the procedure of following a certain path to reach
thatmissionsgoal point. These are listed in themission cyclewhere the key-points orfavorite locations
in the map are listed, i.e. where we want our robot to pass. At each step we simulate our sensor as a
rangefinder. Our Kinect is used as a LIDAR to detect walls or other obstacles that may traverse our path.
A list of 3D points that comprise the facets where our obstacles are simulated isused. When something
is detected like in Figure 5.8b, it is computed its position in the global map using the current estimate
of the robot position. Anything that crosses our path is considered an obstacle. Once our robot detects
something in its path, it stops and recomputes the path considering the new map withthe obstacle in it.
The obstacles are considered to be temporary, meaning that they only existduring the current mission.
(a) Free path. (b) Obstacle blocking the path.
0 100 200 300 400 500
0
50
100
150
200
250
300
350
(c) Depth image for a free path.
0 100 200 300 400 500
0
50
100
150
200
250
300
350
(d) Depth image for an obstacle.
Figure 5.8: POV of the robot when there is an obstacle ( 5.8b) and when there is none ( 5.8a).
In this experiment we start by doing some processing to the occupancy gridmatrix. This matrix is
saturated with a chosen threshold (25%) and we end up with a bitmap with the same size only with zeros
5.5 Patrol #1 - Run with no obstacles 47
and ones. After that, the map is dilated to take into account that the robot is notjust a point in the map. Its
body has a certain size so the obstacles (walls) are enlarged by3 pixels(30cm) to take in consideration
the robots radius. Figures 5.7f,5.7g,5.7h show the various stages the map goes through.
We use this map (Figure 5.7h) as one of the inputs to the sentinel mode function.The other necessary
inputs are the initial pose of the robot, the list of points of interest where we want our robot to go through
and the list of facets used to simulate the kinect in measuring depth information. The sentinel function
starts with the first point of interest in the list and plans the path necessary toreach that point. Once it has
a possible path it starts a mission to move the robot there. The mission consists onmoving the robot in
each step to the next point on the path. What it actually does is move the viewpoint in the VRML world
to that new location, simulating this way what the robot is seeing. The commands given to the VRML to
update the viewpoint position are incremental translations in a specific direction. If the next point in the
path planned is the pixel to the North direction of the map, the command ’Go one step North’ is sent to
the VRML that translate it to ’Move one step to my North’. Every time the robot moves, a snapshot of
the RGBD is taken and the depth information is used to check for obstacles in our path and to estimate
the motion made. This information is then returned to the sentinel function to checkif it moved the right
amount.
If any obstacle is in our planned path, the robot stops and re-plans, with the map updated, a new
path from the position where it detected the obstacle to the same location it was originally going. Once
it reaches the objective that is aiming, the mission is considered a success and the sentinel starts a new
mission to the next point of interest until there are no more. For this first experiment no obstacles were
put in place. The list of points that the robot had to visit were: it starts in(x = 20; y = 10) and has to
visit in the following order(100; 10) → (190; 100) → (100; 190) and then return to the starting point
(20; 10).
The method chosen for the path planning algorithm to show here was Probabilistic Roadmaps. In
Figure 5.9 we can see the various missions the robot had to complete to reach all the locations in its
mission cycle.
The total path traversed by the robot can be seen in Figure 5.9e.
48 Experiments
x
y
0 50 100 150 200
20
40
60
80
100
120
140
160
180
200
(a) Mission 1.
x
y
0 50 100 150 200
20
40
60
80
100
120
140
160
180
200
(b) Mission 2.
x
y
0 50 100 150 200
20
40
60
80
100
120
140
160
180
200
(c) Mission 3.
x
y
0 50 100 150 200
20
40
60
80
100
120
140
160
180
200
(d) Mission 4.
(e) Total path of the robot after completing all the mis-sions.
Figure 5.9: All the missions the sentinel function created. One for each goal point.
5.6 Patrol #2 - Unknown blocked passage 49
In this experiment, we successfully completed our objective of visiting a list ofplaces in the map and
also returning to the initial location.
5.6 Patrol #2 - Unknown blocked passage
In this second navigation experiment we repeat the same list of points of interest to tour except this time
we put an obstacle somewhere in the map. The only different input here is the facets list that is used
to retrieve the depth information and thus used to simulate an obstacle. The sentinel function works
exactly the same like before except that when a mission detects an obstacle in itspath it aborts. In these
cases the sentinel function receives an aborted mission and has to re-plan a new path considering the
obstacle detected. After computing a new trajectory a new mission is started. Bydesign the obstacles
are considered temporary and only remain at that location for the duration of the mission (when a new
mission starts all the obstacles are cleared). This is because it is considered that all the obstacles that
are only detected during the sentinel mode and not during the mapping mode arenot structural-fixed
obstacles and are only temporary ones (e.g. chairs). The obstacle usedin this experiment (see Fig. 5.8b)
was placed in a manner that blocked completely the first corridor(x = 50 and0 ≤ y ≤ 20), effectively
closing it down. The method chosen for the path planning algorithm to show here was Probabilistic
Roadmaps. The trajectories that each mission planned are presented in Figure 5.10.
The total path traversed by the robot can be seen in Figure 5.10f.
50 Experiments
x
y
0 50 100 150 200
20
40
60
80
100
120
140
160
180
200
(a) Mission 1.
x
y
0 50 100 150 200
20
40
60
80
100
120
140
160
180
200
Obstacle
(b) Mission 2 with the obstacle present in themap.
x
y
0 50 100 150 200
20
40
60
80
100
120
140
160
180
200
(c) Mission 3.
x
y
0 50 100 150 200
20
40
60
80
100
120
140
160
180
200
(d) Mission 4.
x
y
0 50 100 150 200
20
40
60
80
100
120
140
160
180
200
(e) Mission 5.
(f) Total path the robot had to make to visit all the points.
Figure 5.10: All the missions the sentinel function created. There is one extra mission because of theobstacle present in the environment.
5.7 Mapping and Navigation with Monocular Vision 51
In the first mission (Figure 5.10a) the planned path was pretty much a straightline from the initial
point to the first objective. However the obstacle present in the map was between them and was detected
early on the mission. So that first mission was aborted and a new mission started(Figure 5.10b) with a
new path computed with the map updated. The obstacle is only present in the second mission because, as
mentioned above, the obstacles are temporary and only taken into account during that mission. Because
of the obstacle we see in the total path (Figure 5.10f) that the robot as to go around the obstacle (in this
case find a new path to reach its objective by going the other way since it is a closed loop map) and
has to take a much bigger route to reach its goals. Despite of the obstacle it wasstill able to visit the
pre-determined locations on the map and complete its patrol starting and finishingin the same place.
5.7 Mapping and Navigation with Monocular Vision
An experiment was conducted to test the tubular-shape based VSLAM approach with textures in a VRML
corridor. For this dataset the camera went from(1, 1, 2) to (1, 1, 12) so it traveled 10 meters in the z
direction. All the walls (including the floor and ceiling) of the corridor were filled with paintings (Figure
5.2) to give a better texture for the visual features detector to find and track. In this case, the initial
assumptions for the camera position were correct. There were 80 steps witha step size of 0.125 m. The
resulting plot for the camera poses in world coordinates is in Figure 5.11a. Figures 5.11b and 5.11c show
how the state of the filter evolved over time, for position and orientation.
52 Experiments
−2−1
01
2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
0
2
4
6
8
10
x
z
y
(a) Camera poses returned by the filter.
10 20 30 40 50 60 70 80−5
0
5
10
# steps
Dis
tanc
e
xyz
(b) State of the filter for the position of the camera.
0 10 20 30 40 50 60 70 80−0.7
−0.6
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
# steps
rads
alphabetagama
(c) State of the filter for the orientation of the camera (euler angles).
Figure 5.11: Results for one corridor.
We can conclude by watching the plot that the camera moved essentially in the z direction as it was
expected. The drift in the orientation of the camera can be associated with thenoise present in the quality
of the features detected.
To test the VSLAM method for creating a map a test was conducted using the approach explained in
chapter 4. In this test, the camera moves in one corridor and based on the information retrieved by the
filter a sketch of the environment is made. This sketch is done by drawing two straight parallel lines with
the direction of the corridor updated by the filter for each step. It results ina series of line segments that
5.7 Mapping and Navigation with Monocular Vision 53
are positioned in the map according to the filter’s position of the camera in the world and in relation to
the walls (Figure 5.12a). The result is straightforward and is visible the corridor in the end map in Figure
5.12b. In the next test the camera moves in two corridors that are connected by a90 degree corner. The
results for this test are not as good. The resulting map is not a good representation of the environment
because of the rotation the camera suffers when exiting one corridor andentering another. The structural
assumption of the environment used in the observation model of the Kalman filterdoes not take this
into account and fails to keep track of the back-projection of features during the rotation. So it fails to
understand how the camera is moving (Fig. 5.12c) and it cannot update the direction of the corridor
correctly. This is visible in the final map seen in Figure 5.12d.
Since the objective of this dissertation is to study and implement an application thatis ultimately able
to map the environment and then navigate in it we will use the map obtained for onecorridor and we
will navigate in it by visiting a sequence of points of interest much like taking a tour. The list of points
followed were:(10, 50) → (10, 90) → (10, 140) and then returning to the starting point(10, 20). The
trajectory planning method chosen was Probabilistic Roadmaps. Figure 5.13 shows the paths taken to
reach each point and the total path can be seen in Figure 5.13e.
−2
−1
0
1
20 2 4 6 8 10 12 14
x
y
(a) Sketch representing the direction of the tube in eachstep for one corridor.
(b) Occupancy Grid map for one corridor.
−2
−1
0
1
2
30 2 4 6 8 10 12 14 16 18
x
y
(c) Sketch representing the direction of the tube in eachstep for two corridors.
(d) Occupancy Grid map for two corridors.
Figure 5.12: Map representation for one and two corridors.
54 Experiments
x
y
0 20 40 60 80 100 120 1400
10
20
30
40
(a) Mission 1.
x
y
0 20 40 60 80 100 120 1400
10
20
30
40
(b) Mission 2.
x
y
0 20 40 60 80 100 120 1400
10
20
30
40
(c) Mission 3.
x
y
0 20 40 60 80 100 120 1400
10
20
30
40
(d) Mission 4.
(e) Total path of the robot after completing all the mis-sions.
Figure 5.13: Result for the VSLAM experiments.
We conclude that the robot visited the list of all locations defined as it was suposed to.
Chapter 6
Conclusion and Future Work
The work described in this thesis aimed at developing and simulating a robotic sentinel that first, maps
an indoors environment and secondly, is able to navigate through the environment, computing alternative
trajectories when obstacles block the previously computed trajectories. Themain sensor available to the
robot is one video camera.
The first video camera considered was a Color-Depth (RGBD) camera. With the depth information
available we were able to infer the 3D position of the features. By using consecutive image frames
we estimated the motion of the camera as a rigid transformation. The process is thefollowing: First
SIFT features are computed between consecutive images and matched; Then the matched 2D features
are converted to 3D, in the camera coordinate frame; The next step consists simply in computing a rigid
transformation between both clouds of points; The computed transformation isthe (inverse) motion the
camera made to capture the two images.
In addition to estimate the motion of the camera, the color-depth camera allows computing an hori-
zontal depth profile, which can be integrated along time to form an occupancy grid. Given the occupancy
grid, allows planing trajectories.
The second video camera considered was simply a Color (RGB) camera. Given the lack of depth
information, we have introduced an Extended Kalman Filter based in the cameramodel and a constant
velocity motion model, forming therefore a visual-SLAM framework. In particular we have considered
that the scenario can be described by a tubular shape, which gives further consistency to the measure-
ments. Considering the tubular shape hypothesis has the additional benefitof saving and keeping constant
the computer memory, as the parameters describing the tube are much less than the number of registered
features.
Both types of cameras allowed in the end to compute an occupancy grid and planing trajectories.
Navigation results were promising in both cases as the robot visits all the predefined locations and, if an
obstacle blocks its path, it tries to go around and reach the objective through an alternative path.
The use of the color (only) camera is constrained in our work by the tubularshape model. In par-
ticular a single straight section is considered at each moment and consequently the EKF over-smooths
56 Conclusion and Future Work
the camera motion in corridor corners. The navigation methodology based in the color-depth camera
allows a more general world structure, but is limited to the capabilities of the camera. In some cases, the
smoothing properties of an EKF are desired. In the end, a combination of both methodologies may be
the best result.
As future work, the visual SLAM method can be improved with a world model taking into account
multiple corridors. The explicit modeling of corners, or a function that detects the exiting of one corridor
and entering the next, decreases the over-smoothing of the robot trajectories. Smoothed trajectories allow
a topological-like navigation, where a robot can return home and reset localization error. The explicit
modeling of corners allow helping the detection of closed loops, and therefore take more advantage of
graph based trajectory planing.
Bibliography
[1] Ted J. Broia and Rama Chellappa. Estimating the kinematics and structure ofa rigid object from a
sequence of monocular images.IEEE T-PAMI, 13(6):497–513, June 1991.
[2] Ted J. Broia, Rama Chellappa, and S. Chandrashekhar. Recursive 3-d motion estimation from a
monocular image sequence.IEEE Transactions on Aerospace and Electronic Systems, 26(4):639–
656, July 1990.
[3] Stefano Carpin. Randomized motion planning: a tutorial.I. J. Robotics and Automation, 21(3),
2006.
[4] Javier Civera, Andrew J. Davison, and J.M.M Montiel. Inverse depth parametrization for monocular
slam. IEEE Transactions on Robotics, 24(5):932–945, October 2008.
[5] W3 Consortium. Virtual reality modeling language.http://www.w3.org/MarkUp/VRML/.
[6] Peter I. Corke.Robotics, Vision & Control: Fundamental Algorithms in Matlab. Springer, 2011.
[7] Andrew J. Davison, Ian D. Reid, Nicholas D. Molton, and Olivier Stasse. Monoslam: Real-time
single camera slam.IEEE T-PAMI, 29:2007, 2007.
[8] Hugh Durrant-Whyte and Tim Bailey. Simultaneous localisation and mapping(slam): Part i the
essential algorithms.IEEE ROBOTICS AND AUTOMATION MAGAZINE, 2:2006, 2006.
[9] Albert Elfes. Using occupancy grids for mobile robot perception and navigation. Computer,
page 57, 1989.
[10] Dieter Fox, Sebastian Thrun, Wolfram Burgard, and Frank Dellaert. Particle filters for mobile robot
localization, 2001.
[11] Michael A. Goodrich. 1 introduction potential fields tutorial.
[12] Richard Hartley and Andrew Zisserman.Multiple View Geometry in computer vision. Cambridge
University Press, 2003.
58 BIBLIOGRAPHY
[13] R. E. Kalman. A new approach to linear filtering and prediction problems. Transactions of the
ASME Journal of Basic Engineering 82 Series D: pages 35-45, 1960.
[14] J.J. Kufner and S.M. LaValle. Rrt-connect: An efficient approach to single-query path planning. In
IEEE Int. Conf. on Robotics and Automation, pages 995–1001, 2000.
[15] S. LaValle. http://mappingignorance.org/2013/02/14/rapidly-exploring-manifolds-when-going-
from-a-to-b-aint-easy/.
[16] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of
Computer Vision, pages 91–110, 2004.
[17] D. Pereira, J. Tomaz, J. Gaspar, and R. Ferreira. Mosaicing theinterior of tubular structures. In
RecPad, 2013.
[18] Luís Ruivo. Omni-sonda: Câmara de visão omnidireccional para inspecção visual. Master’s thesis,
DEEC/IST, 2007.
[19] Shaojie Shen, Nathan Michael, and Vijay Kumar. 3d indoor exploration with a computationally
constrained mav, 2011.
[20] Simeon87. http://en.wikipedia.org/wiki/Motion_planning.
[21] R. Smith, M. Self, and P. Cheeseman. Estimating uncertain spatial relationships in robotics. In I. J.
Cox and G. T. Wilfong, editors,Autonomous Robot Vehicles, volume 8, pages 167–193. Springer-
Verlag New York, Inc., 1990.
[22] S. Thrun. Particle filters in robotics. InProceedings of the 17th Annual Conference on Uncertainty
in AI (UAI), 2002.
[23] João Pedro Borralho Tomaz. Visual self-localization and mapping using the inside texture of a
tubular structure. Master’s thesis, Instituto Superior Técnico, 2013.
[24] Taylor Wang. http://taylorwang.wordpress.com/2012/04/06/collision-free-path-planning-using-
potential-field-method-for-highly-redundant-manipulators/.
[25] Wikipedia. Microsoft kinect. http://en.wikipedia.org/wiki/Kinect.
[26] Wikipedia. Rodrigues’ rotation formula. http://en.wikipedia.org/wiki/Rodrigues_rotation_formula.
[27] Wowwee. Wowwee rovio. http://www.wowwee.com/en/products/tech/telepresence/rovio/rovio.
[28] Kai M. Wurm, Armin Hornung, Maren Bennewitz, Cyrill Stachniss, and Wolfram Burgard. Oc-
tomap: A probabilistic, flexible, and compact 3d map representation for robotic systems. InIn
Proc. of the ICRA 2010 workshop, 2010.
BIBLIOGRAPHY 59
[29] Gem-Sun Young and Rama Chellappa. 3-d motion estimation using a sequence of noisy stereo
images.IEEE T-PAMI, pages 710–716, June 1988.