ai in robots

i

AUTONOMOUS MOBILE ROBOTS: AN

ARTIFICIAL INTELLIGENCE APPROACH

MOHAMMED E. EL-TELBANY

ii

AUTONOMOUS MOBILE ROBOTS: AN

ARTIFICIAL INTELLIGENCE APPROACH

BY

MOHAMMED E. EL-TELBANY

Electronics Research Institute

Computers and System Department

iii

TTTooo eeevvveeerrryyyooonnneee hhhooowww hhhaaasss cccooonnnsssiiidddeeerrraaabbblllyyy ccchhhaaannngggeeeddd mmmyyy vvviiieeewww tttooo ttthhheee wwwooorrrlllddd aaannnddd gggiiivvveeesss mmmeee ttthhheee mmmooorrraaalll sssuuuppppppooorrrttt nnneeeeeedddeeeddd tttooo mmmyyy sssuuucccccceeessssss...

iv

TABLE OF CONTENTS

Table of Contents ........................................................................................................ iv

Preface ......................................................................................................................... vi

Chapter I ...................................................................................................................... 1

Intelligence in Autonomous robots ....................................................................... 1

1.1 Mobile Robots ................................................................................................ 1

1.2 Mobile Robots Applications .......................................................................... 2

1.3 Intelligence in Robotics ................................................................................. 3

1.4 Mobile Robot Sensors .................................................................................... 4

1.4.1 Tactile Sensors ........................................................................................ 5

1.4.2 Infrared Sensors ...................................................................................... 5

1.4.3 Sonar Sensors .......................................................................................... 5

1.4.4 Compass Sensors .................................................................................... 7

1.4.5 Shaft Encoder .......................................................................................... 7

1.4.6 Vision Sensors ........................................................................................ 7

1.4.7 Sensors Fusion ........................................................................................ 8

1.5 Mobile Robot Actuators ................................................................................. 8

1.6 Assignments ................................................................................................... 9

Chapter II .................................................................................................................. 10

Behavior-Based Control ...................................................................................... 10

2.1 Behavior-Based Robotics ............................................................................ 10

2.2 Subsumption Control Architecture .............................................................. 12

2.3 Potential Fields ............................................................................................ 13

2.4 Schema-Based Reactive Control ................................................................. 15

2.5 Temporal Sequencing .................................................................................. 16

2.6 Motor Schema Versus Subsumption Architecture ....................................... 17

2.7 Other Architectures ...................................................................................... 18

2.7.1 Colony Architecture .............................................................................. 18

2.7.2 DAMN Architecture ............................................................................. 18

2.8 Deliberative Control .................................................................................... 18

2.9 Inspiration Sources of Behaviours ............................................................... 19

2.10 Assignments ............................................................................................... 21

Chapter III ................................................................................................................ 22

Behaviors Design and Programming ................................................................. 22

3.1 Behavior Encoding ...................................................................................... 22

3.2 Concurrent Behaviors Combination ............................................................ 23

3.3 Subsumption Architecture Behaviors’ Programming .................................. 23

3.4 Behaviors and Schema Theory .................................................................... 25

v

3.5 Motor-Schema Behavior’ Programming ..................................................... 26

3.6 Motor-Schema Concurrent Behaviors Combination ................................... 28

3.7 Assignments ................................................................................................. 29

Chapter IV ................................................................................................................. 30

Mapping, localization and Navigation ............................................................... 30

4.1 Mapping ....................................................................................................... 31

4.1.1 Occupancy Grids ............................................................................... 32

4.1.2 Grid or Topological Maps: A Comparison ....................................... 33

4.2 Localization ................................................................................................. 34

4.3 Navigation .................................................................................................... 35

4.3.2 Reactive Navigational Approaches ....................................................... 35

4.3.3 Deliberative Approaches ....................................................................... 36

4.3.3.1 Workspace and Configuration Space ............................................. 36

4.3.3.2 Roadmap Methods ......................................................................... 38

4.3.3.3 Cell Decomposition ....................................................................... 39

4.3.3.4 Potential Fields .............................................................................. 40

4.3.3.5 Virtual Force histogram ................................................................. 42

4.4 Assignments ................................................................................................. 42

Chapter V .................................................................................................................. 43

Robot Learning .................................................................................................... 43

5.1 Machine Learning in Mobile Robots ........................................................... 43

5.2 Reinforcement Learning .............................................................................. 44

5.2.1 Monte Carlo Learning ........................................................................... 48

5.2.2 Temporal Difference Leaning ............................................................... 49

5.2.3 Q-learning Algorithm ........................................................................... 50

5.2.4 SARSA Learning Algorithm ................................................................ 52

5.3 Which Algorithm Should We Use? ............................................................. 52

5.4 Planning in Reinforcement Learning ........................................................... 52

5.5 Designing the Reinforcement Functions ...................................................... 53

5.6 Other Reinforcement Learning Techniques ................................................. 54

5.6.1 Evolutionary Reinforcement Learning ................................................. 55

5.6.2 Statistical Reinforcement Learning ...................................................... 55

5.7 Comparison Between Learning Techniques ................................................ 56

5.8 Multi-Agent Robotic Systems ..................................................................... 57

5.9 Communication Techniques in Robotics ..................................................... 58

5.10 Collective Learning .................................................................................... 59

References ................................................................................................................. 60

vi

PREFACE

This notes is discuss the artificial intelligence approaches that implemented for the

mobile robots field. The notes were motivated from the desire to educate the students

interested in mobile robots and artificial intelligence. The notes attempt to provide both a

foundation and appreciation for mobile robotics. At the same time, putting newer trends in

machine learning techniques in perspective specially the reinforcement learning.

Mohammed E. El-Telbany

1

CHAPTER I

INTELLIGENCE IN AUTONOMOUS ROBOTS

Artificial intelligence (AI) as a discipline was born by an eagerness to understand the

nature and mechanisms of intelligence and possibly replicate them. Robotics, as a main

field of AI, demonstrates this clearly. Robots are generally understood as artificial creature

that exist physically in the real world and exhibit anthropomorphic characteristics. Most

artificial intelligence researchers that study robotics are working on mobile robots. Mobile

robots pose a unique challenge to the artificial intelligence community, since they force the

researchers to deal with some of its tasks like uncertainty in sensing and action, planning,

mapping, localization and navigation. Moreover, it uses machine-learning techniques, such

as reinforcement learning, neural networks and genetic algorithms to improve their

performance.

1.1 Mobile Robots

Russell and Norvig [1995] were defined the robot as ”an active, artificial agent whose

environment is the physical world.” This broad definition includes all artificial creatures that exist

physically in the real world and interact wit it in some manner. However, we can define the

robot as “a machine able to extract information from its environment and use knowledge about its world

to move safely in a meaningful and purposive manner.” The focus will be primarily on autonomous

robots, which can operate on their own without a human directly controlling them. A

particular class of robots that has attracted special attention is that mobile robots. The

mobile robots emphasis on problems related to understanding of large-scale space [Dudek

and Jenkin, 2000]. Where influences of the space creates a fundamental tasks of moving

through the space, sensing about space, and reasoning about the space. Theoretically, the

environment of a mobile robot can be land, water, air, space or any combination. Most

research efforts have focused on the first class, namely land mobile robots, which is also

the simplest case.

Mobile robots are achievements of engineering. The effectors, processors, user

interface, sensors, and communication mechanism that permit a mobile robot to operate

must be integrated so as to permit the entire system to function as a complete whole. The

effectors and sensors with which they are equipped distinguish mobile robots from each

other. It is obvious that a mobile robot needs both. An effector is any device under the

control of the robot that affects the environment. Effectors are used in two ways: to

change the position of the robot within its environment (locomotion) and to move other

objects in the environment (manipulation). To have an impact on the physical world, an

effector must be equipped with an actuator that converts software commands into physical

motion. The actuators themselves are electric motors or hydraulic or pneumatic cylinders. The

correspondence between the actuator motions in a mechanism and the resulting motion in

its various parts can be described with kinematics, the study of motion. In the next sections,

description for most common sensor and actuator types used in mobile robots to date are

2

presented and giving examples of the kind of reading that can be obtained from them and

their most common applications.

1.2 Mobile Robots Applications

Mobile robotics however is even one step behind in industrial applications. In a lot of

cases, mobile robots are used as wheeled transport vehicles in warehouses. These constructions

follow paths marked by reflective tape, paint or buried wire. In modern approaches, more

autonomous systems are designed which for example permit to build up mobile delivery

systems for hospitals. Autonomous systems in these cases may react dynamically to

changes in their environment or may be easily able to switch between different tasks. The

developers of those systems created different methods like behavior-based and bio-inspired

approaches or more algorithmic solution based on physical models to solve the emerging

problems. The real applications in which in which current mobile robots have been

successfully installed are characterized by one of more of the following attributes: an

absence of on-site human operator, a potentially high cost, and the need to tolerate

environment conditions that might not be acceptable to a human. As such, robots are

especially well suited for tasks where: (1) a human is at significant risk (i.e., nuclear, space,

military), (2) The economics or menial nature of the application result in insufficient use of

human workers (i.e., service industry, agriculture, clean streets or vacuum apartments), (3)

for humanitarian uses where there is a great risk (i.e., de-mining an area of land mines,

urban search and rescue), and (4) for tasks with a high duty cycles or very high fatigue

factor (i.e., security guards, transportation of objects, welding).

Figure 1.1: An automated guided vehicle.

For example, in industry, a lot of different transport systems are established. Besides

conveyors and forklifts, other automated transporters called AGVs (Automated Guided

Vehicles) carry parts to different stations in production halls. Figure 1.1 shows such a system

where, the AGVs act automatically but not autonomous. Usually, they follow marked paths

to find their goal points with high accuracy. But this precision has to be paid with the lack

of autonomy. If one looks for autonomous systems, which can perform transportation

tasks, very often, behavior-based systems can be found. Figure 1.2 shows another robot

with manipulator arm and a payload of about 100 kg. Again, this system is controlled with a

3

behavior-based approach. Especially, mobile robots, which are equipped with a

manipulator arm, are very rarely found as a commercial product.

Figure 1.2: A mobile robot with arm and a high payload of about 100 kg.

Figure 1.3 shows a service robot as a last example. Such a service task does no require

millimeter precision accuracy. In fact, these systems need a reactive behavior in the manner

”cup not above table yet move a bit forward”.

Figure 1. 3: A service robot delivers a cup.

1.3 Intelligence in Robotics

The Mobile robots pose a unique challenge to the artificial intelligence community, since

they force the researchers to deal with some of its tasks like uncertainty in sensing and

action, planning, learning, reliability, and real-time response. By improving and expanding

this knowledge into the mobile robot, it may be successfully becomes intelligent robot.

However, there are far more fundamental problems that have to be solved before the robot

becomes completely intelligent. These problems are:

Navigation: deals with the problem of how to get from point A to point B. In order

to move around in the house, the robot needs to be able to avoid running into things.

Much research has been devoted to this problem, but there exist no solution, which will

solve the problem in all cases, as all sensors have limited regions of operations (cf.

chapter IV).

Localization: the robot also needs information about its position in the world. One

can think about different ways to express this information. It could be in relation to

4

some global coordinate system, but it could be in relative to some objects. A

combination is likely to be needed as every physical contact requires the robot to

position itself relative to the object, whereas the robot will need its global position when

reasoning about how to go from one place to another. In order to do anything

meaningful a model of the world is needed. This model or, map, can be of any different

types, e.g., grid maps and feature based maps (cf. Chapter IV). The way the map is

acquired also various from system to system, but for a fully autonomous system, the

robot acquires the map on its own. Hence an unknown environment must be explored

and a map of it built.

Recognition: in order to know what to manipulate and navigate, objects have to be

recognized. In the past this problem has been solved by limiting the set of objects to

contain only objects, which can be unambiguously identified by simple features, such as

color or shape. In real environments this is not feasible, so the robot has to rely on

some kind of information extracted from its vision as we do.

Learning: Learning is one of important issues in intelligent robotics. The goal of

learning is to prepare it to deal with unforeseen situations and circumstances in its

environment by automatically acquiring knowledge, developing strategies and modifying

its behavioral tendency by experience. The robots would like to adapt themselves to the

changes of external and internal circumstances (e.g., changes in terrain, draft in sensors

and actuators, loss of power).

The four key problems mentioned above can be studied individually. However, each one

complements the other and assists to build a fully intelligent and autonomous robot.

1.4 Mobile Robot Sensors

Sensors are devices that can sense and a measure physical property of the environment

and deliver low-level information about the environment the robot is working in. This

information is noisy (i.e., imprecise), often contradictory and ambiguous. The sensors can

be grouped into internal and external sensors. The internal sensors measure some internal

state, such as the orientation of the wheel axis or the direction of the camera. The external

sensors measure e.g. distance to objects, position relative to beacon. Another possible way

to group the sensors would be to group them into sensors, which are active, and those,

which are passive. The active sensors probe the world by looking at the effect of some

action. Here we find sensor like sonar that probe the by sending out a sound signal and

looks at what comes back. The camera, used in ambient light, is on the other hand an

example of a passive sensor. All sensors are characterized by a number of properties that

describe their capabilities. The most important are [Nehmzow, 2000]:

Sensitivity: ratio of change of output to change of input.

Linearity: measure of the constancy of the ratio of input to output.

Measurement range: difference between minimum and maximum values measurable.

Response time: time required for a change in input to be observable in the output.

Accuracy: the difference between actual and measured value.

5

Resolution: smallest observable increment in input.

Repeatability: the difference between successive measurements of the same entity.

Type of output.

Good physical understanding for the sensors, will greatly increase the ability to utilize

their potential. Moreover, it is important to understand how the sensors work as they are

the input to the algorithms.

1.4.1 Tactile Sensors

Tactical sensors detect physical contact with an object. The simplest tactical sensors are

micro-switches or whisker sensors as shown in Figure 1.4. When a bumper is in contact with an

object, a micro-switch is closed and gives an electrical signal that can be sensed by the

controller. Other methods of building simple touch sensors include using strain gauges, or

piezoelectric transducers that measure the degree of deformation and give more information

than binary micros-witches sensors.

Figure 1.4: Tactical sensors (a) microswitch being used as a short –range digital touch sensor (b) A whisker and micoswitch used as a long-age

touch sensor.

1.4.2 Infrared Sensors

Infrared (IR) sensors are probably the simplest type of non-contact sensors; they are

widely used in mobile robots to detect obstacles. They operate by emitting an infrared light,

and detecting any reflections of surfaces in front of the robot. To differentiate the emitted

IR from ambient IR stemming from fluorescent tubes or the sun, the emitted signal is

usually modulated with a low frequency (e.g. 100 Hz). The intensity of the reflected light is

inversely proportional to the square of the distance, IR sensors are inherently short-range

sensor and the maximum ranges for reflection detection are 50 to 100 cm.

1.4.3 Sonar Sensors

In the sonar sensor a powerful pulse of a range of frequencies, is emitted, and a receiver

detects its reflection off objects ahead of the sensor is detected by a receiver. The

sensitivity of sonar receiver is nit uniform, but consists of a main lobe and side lobes. The

echoed signal can be processed in many ways, but the most common way is to measure the

time-of-flight (TOF), i.e., measure the time from the transmission to reception of the signal.

The signal, which is received, is integrated until it reached a certain threshold and then the

6

time is measured. As the speed of sound is known the distance to the object that reflected

the pulse can be computed from the time that elapsed between emission and reception,

using Equation 1.1.

,2

1vtd (1.1)

With d being the distance to the nearest object within the sonar cone, v the speed of

the emitted signal in the medium (i.e. 344 ms-1 for sound in air of 20º C), and t the time

elapsed between emission of a pulse and the reception of its echo as shown in Figure 1.5.

Figure 1.5: The echo signal of robot’s sonar signals.

There are a number of uncertainties associated with reading from the sonar sensors.

Firstly, the sensitivity of sonar is cone-shaped as shown in Figure 1.6, and an object

detected at distance d could be any where within the sonar cone, on an arc with distance d

from the robot. So hat, the accuracy of object position measurement is a function of the

width of the sonar beam pattern.

Figure 1.6: The sonar beam pattern at 50 kHz.

Secondly, objects with flat surfaces and sharp edges reflects very little echo in most

directions and will probably not be noticed which called specular reflection problem. Finally,

7

after being reflected back from a surface, the sound may strike another surface and

reflected back to the sensor. The time delay will not correspond to a physical object, but to

a “ghost” object.

1.4.4 Compass Sensors

Compass sensors are very important for navigation application, and widely used.

Compass mechanism used in mobile robots measure the horizontal components of the

earth’ natural magnetic field.

1.4.5 Shaft Encoder

Shaft encoder is used to measure the rotation of the robot axels. They are mounted on

the rotating shaft to generate a binary representation of the shaft’s position. There are two

types of shaft encoders: absolute and incremental. In absolute shaft encoders, a disk showing a

gray code is read by an optical sensor and indicates the shaft’s position. Incremental encoders

are used for integrating movement over time. Shaft encoders used to measure the

movement of the translation and rotation movement of the robot. This method is called

path integration or dead reckoning. Based on their measurement, odometry can provide an

estimate of the robot’s location that is very accurate when expressed relative to the robot’s

previous location. Unfortunately, because of slippage as the robot moves, the position error

from wheel motion increases.

1.4.6 Vision Sensors

To supplement sonar information, a real time vision is often used, where structured

light sensors can determine the shape of an object by projecting stripes of light on it and

stereo cameras provide pairs of images recorded simultaneously for depth calculations.

Cameras are used to generate matrices of numbers that corresponding the light-level

distribution in an image. Cameras are available for grey-level and color image acquisition,

with a range of image resolutions and frame rates. All vision systems require at least the

following components: an object to be viewed, an illumination source, a vision sensor, an

edge detector, a line detector, an object detector, and previously recorded information for

interpreting what is being seen as shown in Figure 1.7.

Figure 1.7: The vision system architecture.

8

Edge detection is the process of finding the boundaries between areas of different

intensities of vision energy.

Motion detection is the process of detecting things that are moving in the field of

vision. You can detect motion by comparing two edge-detection results. If an edge has

moved or a new edge has appeared since the previous edge-detection process was

performed, motion has occurred.

Object line detection is the process of connecting detected edges into lines.

Object shape detection is the process of assembling detected lines to see if they form

some known shape.

Image understanding, the final step of vision processing, is where the system assesses

whether it is seeing the right things. Where the Color, depth, and texture provide

additional information about a viewed scene.

Computer vision requires knowledge of the field of computer graphics and image

processing and will not be addressed in this report.

1.4.7 Sensors Fusion

Sensor fusion is a technical term used to describe the process of using information from

several sensor modalities (i.e., types of sensors) to form one image of the world. Usually,

assumptions have to be made in order to form a global view of the world, and results have

to be amended in the light of new evidence. Because sensors have different characteristics,

different strengths and different weakness, sensor fusion is an extremely difficult task. In

the context of mobile robot, sensors fusion must be accomplished in at least three different

axes: (1) sensors, (2) positions, and (3) times. Combining measurements from multiple

positions, times, or sensors is closely related to interpolation, extrapolating, and fitting an

analytical function to data [Dudek and Jenkin, 2000]. One method of sensor fusion is to

use artificial neural networks that learn to associate perception with some kind of output

(e.g., an action, or classification) [Nehmazow, 2000].

1.5 Mobile Robot Actuators

Robot actuators can take many forms such as a tool for drilling, grinding, painting, and

welding as shown in Figure 1.8. Three types of power are used for actuators: electrical,

pneumatic, and hydraulic. The electrical power imbedded in the electrical motor, usually a

DC motor or a stepper motor. The former is the easiest to control – applying DC power

operates the motor. The latter’s stator coils are sectioned; so that the rotator movement can

be determined by applying a voltage to the desired section of the coil. This generates a

rotation by a known angle. Stepper motor can be used for fine and precise motion.

The simplest and cheapest of all the drives are pneumatic actuators, which typically

operate between fixed positions by pumping compressed air between air chambers, and

thus moving a piston. However they are not capable of high precision. Hydraulic actuators

use pressured oil to generate motion. They are able to generate both high powers and high

precision movements, but are usually too heavy, dirty and expensive to be used on mobile

robots.

9

Figure 1.8: The three-finger gripper.

1.6 Assignments

1. Write a report on the impact of mobile robots on the society. (Hint: explore the

applications of mobile robots in industry, agriculture, services, …etc, by searching the

Internet).

2. What is an intelligent robot?

3. Write a report that surveying the methods and techniques used in sensor fusions.

4. In Electronics Research Institute mobile robots platform obtains a copy of the

system documentation. How is powered, recharged, and controlled? Learn how to

power the robot on and off and how to control via software.

10

CHAPTER II

BEHAVIOR-BASED CONTROL

Building a mobile robot that shows an intelligent behavior while executing a complex

task is an interesting research challenge. In the same time, there has been an upsurge of

interest in learning in autonomous mobile robots deals with the behavior-based paradigm.

This approach concentrates on physical systems situated in the real world and promotes

simple associative learning between sensing and acting. The behavior-based control has been

served as an effective methodology for autonomous mobile robot control and learning in a

large number of autonomous robot problem domains. This chapter presents and overviews

the behavior-based control methods and a comparison between them to show their pros

and cons.

2.1 Behavior-Based Robotics

There are two control approaches used for robot control, traditional approach and

behavior-based approach. The traditional approach (Akin, 1998; Russell and Norvig, 1995)

structuring a robot’s controls into functional modules – perception, planning, learning etc.,

and constraining as much as possible the environment where the robot will operate.

Creating a model of the environment and the robot preprocess sensor information into

abstracted internal representation that are acted on by a central planner, then instantiated

the results to become actions that can be executed by the robot to reach a specific goal.

This can be represented in horizontal control architecture as shown in Figure 2.1.

Figure 2.1: Horizontal control architecture of the robot.

Control systems using this architecture solve their task in several steps. First, the sensor

input is used o modify the internal representation of the environment. Second, based on

the internal representation planning is made. This results in a series of actions for the robot

to take to reach a specified goal. Third, this series of actions is used to control the motors

of the robot. This completes the cycle of the control system and it’s restarted to achieve

new goals. The first general purpose mobile robot constructed in the late 1960s at the

Stanford Research Institute is Shakey as shown in Figure 2.2, demonstrated that general

purpose planning systems using STRIPS were not very efficient and much two slow for

practical use.

11

Figure 2.2: Shakey is first “general-purpose” mobile platform.

General planning approach has several problems. Maintaining the model is in many

cases difficult because of sensor limitation or imperfect. The plans produced by the planner

often don’t give the effects in the real world that is anticipated. So planning cannot be done

using this open-loop approach (Wyatt, 1997; Gat, 1998). Brooks (1990a; 1990b; 1991a;

1991b) argues that this traditional approach is very difficult, slow, and the resulting programs

are very brittle and not generalizable to other robotic platform or to a different environment.

It is clear that a different approach is required if one wants to bring robots to every day life.

This was creating systems that are more robust that do not require too many modifications

to the lay out of the environment where the robot will operate. To overcome these

limitations behavior-based control approach is developed (Brooks, 1986). In Behavior-based

approach, instead of decomposing the task based on the functionality, the decomposition is

done based on task-achieving modules or competences, are called behaviors on top of each

other as shown in Figure 2.3, this is called vertical control architecture.

Figure 2.3: Vertical control architecture of the robot.

12

The most common behavior-based control approaches are subsumption (Brooks, 1986)

and motor schemas (Arkin, 1989; 1998; Arkin and Balch, 1997; Murphy, 2000) architectures,

which made up of several parallel running behaviors1. Each behavior calculates a mapping

from sensor inputs - the sensor inputs relevant for the task of that behavior are used - to

motor outputs (cf. Appendix A). The suggested motor outputs from the behavior with

highest priority are used to control the robot’s motors, or summed to generate one motors’

output. These architectures are called behavior-based control approaches and represent

methodologies for endowing robots with a collection of intelligent behaviors (Matarić,

1994b; 1995b; 1995c; 1997a; 1998a; Parker, 1996). Behavior-based approaches are an

extension of reactive architecture, their computation is not limited to lookup table and

execution a simple functional mappings. Behavior-based systems are typically designed so

the effects of the behaviors interact in the environment rather than internally through the

system.

2.2 Subsumption Control Architecture

Brooks’ subsumption architecture is a hierarchical architecture (Brook, 1986). Lower

levels control “instinctive” behaviors and higher-level control tasks that are regarded as

more abstract. Each behavior consists of a fixed hardwired finite state machine (FSM), and

cannot be altered without redesign the whole system as shown in Figure 2.4. The high-level

behaviors can influence low-level ones through subsumption links. These links can either

inhibit the lower levels completely, or influence their behaviors.

Figure 2.4: Behavior module as augmented FSA.

The large limitation of the subsumption architecture is its complexity. Creating correct

subsuming and inhibitory links into other behaviors requires detailed knowledge of how the

other modules, and the system as a whole, function. Another important limitation is the

lack of explicit goals and its goal handling capabilities (Pirjanian, 1999). Since there is no

1 These are many infinitely man possible robot control schemes. These schemes are classified into four classes: reactive control

(“don’t think, react”), deliberative control (“think, then act”), hybrid control (“think and act independently in parallel”, and

behavior-based control (“think the way you act”). Each of these approaches has its strengths and weaknesses, and all play

important and successful roles in certain problems and application. The behavior-based and hybrid control have the same

expressive and computational capabilities: both can store representation and look ahead. But they work in very different

ways. Reactive control is also a popular in highly stochastic environments and demands very fast responses. Deliberative

control is the best choice for domains that require a great deal of strategy and optimization, and in turn search and learning

(Matarić, 2002)

13

centralized control, there the higher-level behavior wins the arbitration when they compete

against lower-level ones, and the functionality of subsumption architecture is viewed as an

emergent property of the interaction of its behavior and depends on the static and dynamic

properties of the environment.

2.3 Potential Fields

The potential-field approach is an approach to motion planning where the robot, which

represented as a point in configuration space, move under the influence of an artificial

potential field produced by an attractive force at the goal configuration and repulsive forces

at the obstacles (Khateb, 1986). The motor commands of the robot at any position in an

artificial potential field correspond to the vector on which the robot is situated. Goal

attract, and thus the goals will have vectors pointing towards them; obstacles repulse, and

will be surrounded by vectors pointing away. Forces can be of constant magnitude on

every vector in a system, or operate on a gradient. Many potential fields implementations

have goal forces exert a constant force across the entire gradient, while obstacles will exert

greater outward force the nearer the vector is to the obstacles. The vector at any given

position is computed by summing all the forces exerted on that potential vector U . In a

simple domain with one goal and two obstacles, the potential function U is constructed as

the sum of two more elementary potential functions:

qU UU repatt q q (2.1)

Where U att is the attractive potential and U rep is a repulsive potential associated with

obstacles. The force UF

corresponds the most promising direction to move in.

From Equation (2.1) it is seen that F

is the sum of two forces: attractive UF attatt

and

repulsive forces UF reprep

. These forces vectors will generally point directly towards

the goal except in the area around the obstacles as shown in Figure 2.4. There are many

ways to define the elementary functions. For example, the attractive potential field U att can

be defined as a parabolic well:

q goalq2

1qU att

2

(2.2)

Where 0 is a scaling factor and qgoal denotes the goal configuration. Figure 2.4(b)

is a plot of U att for a two dimensional cases. The repulsive potential field U rep can define as

follow:

0

0

0

2

if0

if11

q

q q2

1

qU rep (2.3)

14

Where 0 is a scaling factor, q denotes the distance from q to the closest

obstacle and 00 is the distance within which the robot is under the influence of

repulsive force. Figure 2.5(c) is a plot of U rep for a two dimensional cases. Figure 2.4(d) is a

plot of the total potential field function U . These forces vectors will generally point

directly towards the goal except in the area around the obstacles as shown in Figure 2.4.

Figure 2.5: (a) The configuration of obstacles (black) in the map. (b) The attractive potential associated with the goal configuration. (c) The repulsive potential associative with obstacles. (d) The sum of the attractive and repulsive potentials (adopted from Pirjanian, 1999).

There are two major difficulties with the potential fields method as applied to robot

navigation. The first is that this method can be slow and require substantial internal state,

as it appears that vectors must be computed based on a map of the entire region. But a

potential fields method can be easily adapted to a reactive system. There is no need to

factor in forces outside the of the robot’s perceptual range, though true potential fields

required computing vectors based on the entire field. A goal outside the perceptual range

may need to be maintained, but obstacles that the robot cannot sense do not necessarily

need to affect the motion vectors. Thus the vector accompanying the current position of

the robot can be computed using only the sensory information from a single time step

(Arkin, 1989). This vector computation using only current sensor readings makes this

method reactive, in that it requires no internal state, and allows the vector on the current

position of the robot to be computed quite quickly. The second potential field problem is

local minima. The robot’s motion at any moment is the sum of the forces affecting it. What

happens, then when these forces sum perfectly to zero? The robot could remain

indefinitely in a single position, perfectly immobilized at a local minimum. If no force

15

disrupts this equilibrium, movement may cease. Thus any potential fields method must

employ strategies for coping with this problem of local minima. One of these strategies is

generating a time-varying noise vector; another is to relocate the goal temporarily when stuck in a

local minimum, or to mark areas that have been visited already as impassable. All these methods

do not avoid the existence of local minima and are not always successful in their solutions.

Potential field navigation is very suitable for local navigation since the environment has

only to be known in the vicinity of the vehicle and only a short piece of the path is

calculated with each evaluation of the force fields.

2.4 Schema-Based Reactive Control

The potential fields method is sufficient to create good behavior when the environment

consists of only simple attractive and repulsive forces, but is not designed to support more

complicated and situational-dependent forces. The motor schema method is the reactive

control system components of Arkin’s Autonomous Robot Architecture (AuRA) and is an

attempt to create a more general, behavior-based, conception of the potential field method

(Arkin, 1989; 1998; Arkin and Balch, 1997; Murphy, 2000). A motor schema consists of

several pre-programmed collection of parallel, concurrently active behaviors, some of

which gather sensor information, called perceptual-schema some derive effectors called motor-

schema. Each perceptual schema is responsible for one sensory modality, and each motor

schema is responsible for one type of primitive behavior, expressing goal, or constraint for

a task. The perceptual-schemas directly feed into the motor schemas. These fields can be easily

added to each other resulting in a field taking information from all behaviors into account.

The direction and magnitude of the vector field are determined by the nature and gain of

the schema. The absence of arbitration between behaviors (schemas), the fusion of

behavioral outputs using vector summation in a manner analogous to the potential field

method2. As an example, important schema for a navigational task would include

avoid_obstacles and move_to_goal. Since schemas are independent, they can run concurrently,

provide parallelism and speed. Sensors inputs are processed by perceptual-schemas embedded

in the motor behaviors. Perceptual processing is minimal and provides just information

pertinent to the motor schema. For instance, a find_obstacles perceptual schema, which

provides a list of sensed obstacles, is embedded in the avoid_obstacles motor schema.

The concurrently running motor schemas are integrated as follows. First, each scheme

produces a vector indicating the direction the robot should move to satisfy that schema’s

goal or constraint. The magnitude of the vector indicates the importance of achieving it. It

is not critical, for instance, to avoid an obstacle if it is distant, but critical if close by. The

magnitude of the avoid-obstacles vector is correspondingly small for distant obstacles and

large for close ones. The importance of motor schemas relative to each other is indicated

by the gain value for each one. The gain is usually set by human designer, but may also be

determined by learning using genetic algorithms (Pearce et. al., 1992; Arkin, 1998). Each

motor vector is multiplied by associated gain value and the results are summed and

2 The programming of schema-based reactive control is demonstrated in Appendix A.

16

normalized. The resultant vector is sent to the robot hardware for execution as shown in

Figure 2.6.

Figure 2.6: Motor schema example. The diagram on the left shows a vector corresponding to move-to-goal schema; the center diagram shows

avoid-obstacles schema. On the right, the two schemas are summed, resulting in a complete behavior for reaching the goal.

2.5 Temporal Sequencing

If a single set of schemata does not seem sufficient to obtain the desired behavior, motor-schemata can be clumped together into complex, emergent behaviors, in which a behavior consists of a number of different schemata. Groups of behaviors are referred to as behavioral assemblages. One way behavioral assemblages may be used in solving complex mission is to develop an assemblage for each task and to execute the assemblage in an appropriate sequence. The steps in the sequence are separate behavioral states. Perceptual events that cause transitions from one behavioral state to another are called perceptual triggers. The resulting mission-solving strategy can be represented as a Finite State Automaton (FSA). The technique is referred to as temporal sequencing (Arkin and MacKenzie, 1994; Arkin, 1998).

Figure 2.7: The Forage finite state automaton (FSA).

For example, the foraging task as shown in Figure 2.7, the forage task for a robot is to

wander about the environment looking for items of interest (attractors). Upon

encountering one of these attractors, the robot moves towards it, finally attaching itself.

After attachment the robot navigation to the home base where it deposits the attractor. In

17

this approach to solving the forage task, a robot can be in one of three behaviors states:

wander, acquire and deliver. The robot begins in the wander state. If there are no attractors

within the robot’s field of view, the robot remains in wander until one is encountered. When

an attractor is encountered, a transition to the acquire state is triggered. While in the acquire

state the robot moves towards the attractor and when it is sufficiently close, attaches to it.

The last state, deliver, is triggered when the robot attaches to the attractor. While in the

deliver state the robot carries the attractor back to home base. Upon reaching home base,

the robot deposits the attractor there and reverts to the wander state (Balch, 1998a).

2.6 Motor Schema Versus Subsumption Architecture

Comparison with the subsumption architecture can highlight some of distinctive aspects

of the motor schema architecture. Subsumption architectures consist of a behavioral

hierarchy, with each behavior running simultaneously and in parallel. As in motor schema

approach, each behavior has an independent access to the sensory data, and uses that data

to formulate an action to be taken, or it may decide to recommend no action at all. But

unlike the motor schema approach, subsumption style architecture creates a hierarchy of

behaviors, consisting of low-level behaviors that have no knowledge of the higher-level

behaviors. Coordination of behaviors occurs according to the priority hierarchy in a winner-

take-all approach. In the motor schema approach the action outputs of each of the schemes

are combined by summation, and all the schemata are blended together to produce the

actions of the robot at each time step (See §2.2 and §2.4). Each type of architecture has

advantages and disadvantages. Subsumptive design gives a single behavior control over a

system, meaning that the behavior is given complete priority. This can mean that the robot

does whatever behavior is currently in control very will, but it can also mean that the robot

loses sight of the larger goals of the system. Motor schemata allow the blending of

behaviors, allowing the behavior of the robot to combine multiple goals. But motor

schema architecture must contend with local minima and situations in which having multiple

goals means that none of them is fulfilled to the designer’s satisfactions. Moreover, motor

schemata inherent flexibility due to the dynamic instantiation and deinstantiation of behaviors

on an as-needed basis, and easy reconfigurability through the use of high-level planners or

adaptive learning systems (Pearce et. al., 1992; Gat, 1998; Connell, 1991). The comparison is

summarized in Table 2.1.

Subsumption architecture Motor-Schemata

Hierarchical Yes No

Reconfigurability/Learning No Yes

Arbitration Mechanism Competitive Cooperative

Behavior implementation Hardwired Software

Behavior Reusability No Yes

Table 2.1: Comparison between Subsumption architecture and Motor schema.

18

2.7 Other Architectures

There is a wide range of other behavior-based architectures that share a philosophy of

eliminating or reduction of he symbolic representational knowledge and using the

behavioral units as their primary building blocks. Each one’s uniqueness arises from is

choice of coordination mechanisms, the response encoding method used, and the design

methodology. A survey of these architectures is given, highlighting each one’s approach

and contribution.

2.7.1 Colony Architecture

The colony architecture [Connell, 1990] is a direct descendent of the subsumption

architecture that uses simper coordination strategies (i.e., suppression only) and permits a

more flexible specification of behavioral relations. The colony architecture permits a treelike

ordering for behavioral priority as to the total ordering of layered behaviors found in

subsumption.

2.7.2 DAMN Architecture

The Distributed Architecture for Mobile Navigation (DAMN) developed by Rosenblatt

[1995] introduces a unique coordination mechanism. The behaviors in DAMN are

themselves asynchronous processes each generating outputs as a collection of votes cast

over a range responses. The votes for each behavior can be cast in a verity of ways,

including differing statistical distribution over a range of responses the behavior arbitration

method is a winner-take-all strategy in which the response with the largest number of votes

is selected for performing.

2.8 Deliberative Control

Over he years many different techniques and approaches for robotic control were

developed. The complete spectrum ranges from traditional methods that use deliberative

reasoning on the other side to reactive and behaviour-based control on the other as shown

in Figure (2.8).

Figure 2.8: The spectrum of robot control (Adopted from Arkin, 1998).

19

The robot that uses deliberative reasoning in control requires much knowledge about

the world and uses this knowledge to predict the outcome of actions to optimize its

performance. If the information is inaccurate or out of date, the outcome of the reasoning

process is probably incorrect. In a dynamic world, where objects may be moving, it is

potentially dangerous to rely on past information that may no longer be valid.

Representational world models are therefore generally constructed from both prior

knowledge about the environment and incoming sensor data (cf. chapter IV). On the other

side of the control spectrum are the reactive systems. Reactive control is a technique for tightly

coupling perception and action, typically in the context of motor behaviors, to produce timely

robotic responses in dynamic and unstructured worlds.

Both deliberative planning systems and purely reactive control systems have their

limitations, but using both forms of knowledge in a robotic architecture can make

behavior-based navigation more flexible and general. Hybrid deliberative-reactive robotic

architecture can combine the traditional abstract representational knowledge with the

responsiveness, robustness, and flexibility of purely reactive systems. However, building

such a hybrid system requires certain compromises. The interface between deliberation and

reactivity is not yet well understood and a research in this area is still ongoing. An

important design issue of hybrid control systems is the number of layers. Usually the hybrid

architecture consists of two or three layers. In the case of two layers the interface between

the layers is very important because it links rapid reactions with long-range planning. In the

case of three layers the middle layer coordinates between the other two layers, much like

the interface in the two-layer architecture. There are four main interface strategies for the

various hybrid architectural designs:

Selection: planning is viewed as configuration and determines the behavioral composition and parameters used during execution.

Advising: the planner suggests changes that the reactive control system may or may not use.

Adaptation: the planner continuously alters the ongoing reactive behaviors based on the changing conditions within the world and task requirements.

Postponing: planning is viewed as a least commitment process. The planner postpones making decisions on action until as late as possible.

Strong evidence exists that hybrid deliberative and behavior-based systems are found in

biology, so research in this area could lead to new insights. For example, there seem to be

two distinct systems concerned with controlling human behavior, a reactive and automatic

system (e.g. when one’s hand touches something hot) and a willed and conscious control

system (e.g. grasping something).

2.9 Inspiration Sources of Behaviours

Behavior-based researchers argue that there is much can be gained for robotics through the

study of neuroscience, psychology and ethology. Some scientists attempt to implement

these results as closely as possible, testing the hypothesis of the biological models in

question. Others use it as an inspiration to create more intelligent robots [Arkin, 1998].

20

2.9.1 Neuroscientific and Psychological Basis for Behaviors

Neuroscience has inspired computer scientists to create artificial neural networks at the

cellular level of neurons, but also at the organizational level of brain structure much can be

leaned. Neuroscience models have often bad an impact on behavior-based robot design.

Two examples are:

The distinction between deliberative and automatic behavior control system.

Parallel mechanisms associated with both long and short-term memory.

Psychological models focus on the concept of mind and behavior rather then on the brain

itself. Various often-opposing psychological schools have inspired roboticists. Examples

are:

Using stimulus-response mechanisms for the expression of the behaviors.

Capturing the relationship a robot has with its environment based on the theories of

ecological psychology.

Using computational models from cognitive psychology to describe a robot’s behavior

within he world.

2.9.2 Ethological Basis for Behaviors

Ethology is the study of animal behaviors in its natural environment. For the true

ethologoist, the behavior of an organism is intimately coupled with its environment, so

removing the animal from its environment destroys the context for its behavior. Murphy

[2000] divides the behaviors into three broad categories. Reflexive behaviors are stimulus-

response (S-R), such as when your knee is tapped, it jerks upward. Reactive behaviors are leaned,

and then consolidated to where they can be executed without conscious thought (i.e riding

a bike, skiing). Conscious behaviors are deliberative such that stringing together previously

developed behaviors. The reflexive types of behaviors are interested and can be further

divided into three classes [Murphy, 2000]:

Reflexes are rapid automatic and involuntary responses triggered by certain environmental stimuli. The reflexive response persists only as long as the duration of the stimuli. Further, the response intensity correlated with the stimulus’s strength. Certain escape behaviors, such as those found in snails and bristle worms, involve reflexive actions.

Taxes are voluntary behavioral responses that direct the animal toward or away from a stimulus (attractive or repulsive). Taxes occur in response to visual, chemical, mechanical and electromagnetic phenomena in a wide range of animals. Chemotaxis for example, can be seen in response to chemical stimuli, used in trail following of ants.

Fixed-action patterns are time-extended response patterns triggered by a stimulus but persisting for longer than the stimulus itself. The strength and duration of the stimulus do not control the intensity and duration of the response. Unlike reflexes, fixed –action patterns may be motivated and they may result from a much broader range of stimuli. Examples include egg-retrieving behavior of the grayling goose and the song of crickets.

One of the most important concepts for behavior-based robotics drawn from the field of

ethology is the ethological niche. The animals survive in nature because they have found a

reasonable table niche. They have a place where they can coexist with their environment.

21

Evolution has shaped animals to fit their niche. As the environment is always changing, a

successful animal must be capable of adapting to these changes, a successful animal must

be capable of adapting to these changes or it will became extinct.

2.10 Assignments

1. Explain in one or two sentences each of the following terms: reflexes, taxes, and fixed-

action pattern.

2. Write GUI program using Java or C++ to simulate a robot containing the sensors and

actuators, which are coupled together in reflexive behaviours to do:

a. Reflexive avoid: turn left when they touch something (use touch sensors and two

motors).

b. Phototaxis: follow a black line (use the IR sensor to detect the difference between

light and dark).

c. Fixed-action pattern avoid: backup and turn right when robot encounter a

“negative obstacles” (a cliff).

3. Describe the difference between robot control using a horizontal decomposition and a

vertical decomposition.

4. Evaluate the subsumption architecture in terms of: support for modularity, niche

targetability, ease of portability to other domains, robustness.

22

CHAPTER III

BEHAVIORS DESIGN AND PROGRAMMING

3.1 Behavior Encoding

A behavior is a mapping of sensory inputs to a pattern of motor actions, which then are

used to achieve a task as shown in Figure A.1.

Figure 3.1: Simple Behavior Diagram.

To encode behavior response that the stimuli evoke, a functional mapping is created

from the stimuli plane to the motor plane. By factoring the robot’s motor response into

orthogonal components: strength and orientation. Strength denotes the magnitude of the

sensory input and orientation denotes the direction of action. The behavior can express as

triple ,, RS where S denotes the domain of all interpretable stimuli, R denotes the range

of possible response, and denotes the mapping RS : .

S consists of all the perceivable stimuli. Each individual stimulus or percept s S is

represented as a binary tuple ,p having a particular type or perceptual class p and a

property of strength . The stimulus strength can be defined in a variety of ways: discrete

(e.g., binary: absent or present; categorical: absent, weak, medium, strong) or real valued or

continuous. The instantaneous response r R of the mobile robot that moves on flat ground

and can rotate only about its central axis has three degree of freedom expressed as r=[x, y,

]. For each active behavior we can formally establish a mapping between stimulus and

response using the behavior function rs . Associated with a particular behavior, ,

may be a scalar gain value g modifies the magnitude of the overall response r for a given s

as: rr g . These gain values are used to compose behaviors by specify their strengths

relative to one another. In the extreme case, g can be used o turn off a behavior by setting it

to 0, thus reducing ŕ to zero [Arkin, 1998]. is defined arbitrary, but it must be defined

over all relevant p in S. The behavior mappings, fall into three general categories:

Null: The Stimulus produces no motor response.

Discrete: The stimulus produces a response from an enumerative set of prescribed

choices (e.g., turn-right, stop… etc.,). For example, consists of a finite set of (situation,

response) pairs. Sensing provides the index for finding the appropriate situation. Another

strategy involves he use of rule-based system. Here is represented as a collection of IF-

THEN rules [Matarić, 1994]., which can be extended using the fuzzy logic by synthesis a

Fuzzy IF-THEN rules [Saffiotti, 1997; Hoffmann, 1998].

23

Continuous: The stimulus produces a response that is continues over R’s range. One of

the most common methods for implementing continuous response is based on the potential

field technique (cf. §2.3).

3.2 Concurrent Behaviors Combination

Having discussed method to describe individual behaviors, we now study methods for

constructing systems consisting of multiple behaviors, where multiple behaviors may be

concurrently active with the robot system. The behavioral coordination function, C, is now

defined such as SBGC * or RGC * . C can be arbitrary defined, but several

strategies are commonly used to encode this function. They are split across two

dimensions: competitive and cooperative. The simplest competitive method is pure arbitration,

where only one behavior’s output (ri) is selected from R and assigned to , what is,

arbitrarily choosing only one response from the many available. Several methods used to

implement this particular technique, including behavioral prioritization (Subsumption).

Cooperative methods, on the other hand, blend the outputs of multiple behaviors in some

way consistent with the robot’s overall goal. The most common method of this type is

vector addition or super-positioning. Competitive and cooperative methods can be

composed as well [see Arkin, 1998 for more details]. This is illustrated in Figure 3.2.

3.3 Subsumption Architecture Behaviors’ Programming

Task-achieving behaviors in the subsumption architecture are represented as separate

layers. Individual layers work on individual goals concurrently and asynchronously. Where

each behavior is represented by augmented finite state machine (AFSM) model (cf. §2.2).

This method is difficult in design and implementation. Brooks [1990c] recognize the

problem and develop the Behavior Language, which provide a new abstraction

independent of the AFSMs using a single rule set to encode each behavior. This high level

language is then compiled to the intermediate AFSM representation, which can then be

further compiled to run on arrange of target processors.

Figure 3.2: The classification of multiple behaviors coordination techniques (adopted from Pirjanian, 1999).

24

To illustrate a programming example how the subsumption-based design is

programmed. We design and implement a foraging example for a robot having two sonar

sensors on the front of its body. Each behavior in the system is encoded as a set of rules

(standard for the Behavior Language) [Arkin, 1998]. Before programming of the behavior

begins the priorities of the behaviors have to be thought. It looks like a hard problem since

the many layers of behaviors are not dividing work and occasionally pass results to each

other. Usually the lowest level of behavior is programmed first and tested on the robot

until it works. When the behavior is working it is frozen and programming of the behavior

of the next level is initiated. The new behavior is tested together with the previous behavior

and adjusted until every thing is works. This process is continued until the desired level of

competence is achieved. The proposed architecture is shown in Figure 3.3, which consists

of five behaviors and layered based on the priorities.

Figure 3.3: Subsumption based model for foraging robot (adopted fro Arkin 1998).

Coding theses behavior is straightforward for example, a wander behavior which making

the robot to move in a random direction for some time has the following Java code

segment.

private int wander() { if ( tt > 5 ) { tt--; move_random(); } return 0; /* set priority = 0 which activate find_goal behavior */ } For the pickup behavior, which turn towards the sensed attractor and go forward. If at

the attractor, close gripper, has the following Java code segment.

25

private int pickup() { int dif; if( obstacle ) { return 1; /* set priority = 1 which activate avoid_obstacles behavior */ } if( pellet ) { pellet = false; return 0; /* set priority = 0 which activate find_goal behavior */ } if( energy < energy_low ) { return 3; /* set priority = 3 which activate homing behavior again */ } dif = laser_sen[0] - laser_sen[1]; /* calculate the difference between right and left laser sensors */ if( dif >= -40 && dif <= 40 ) { move_forward(); } if( dif < -40 ) { move_left(); } if( dif > 40 ) { move_right(); } return 2; /* set priority = 2 which activate pickup behavior again */ }

Coordination the multiple behaviors based on behavioral prioritization. Each behavior

has a priority value and every behavior execute its action if its priority matches the current

priority variable value, this can be coded as following:

priorty = 0;

public void run() { while(isAlive()) { switch(priorty) { case 0: priorty = find_goal (); break; case 1: priorty = avoid_obstacles(); break; case 2: priorty = pickup(); break; case 3: priorty = homing(); break; case 4: priorty = upload(); break; case 5: priorty = recharge(); break; default: priorty = wander(); break;

} } }

3.4 Behaviors and Schema Theory

Schemas were conceived of by psychologists as a way of expressing the basic unit of

activity. A schema consists both of the knowledge of to act and/or perceive as will as the

computational process by which it is used to accomplish the activity. The idea of schema

maps very well onto a class in object-oriented programming. Extending the schema theory

towards a computation theory of intelligence, the building block for robot intelligence, is

equivalent to a schema and composed of a motor schema and a perceptual schema. The motor

26

schema represents the template for the physical activity; the perceptual schema embodies

the sensing as shown in Figure 3.4.

Figure 3.4: Behavior decomposed into perceptual and motor schema (adopted from Murphy, 2000).

The schema theory implemented the motor schema as a vector field, which direction and

magnitude of the action. In the schema theory, the perceptual schema is permitted to path

both the percept and a gain to the motor schema. The motor schema can use the gain to

compute a magnitude on the output action. However, schema theory does not specify how

the output from concurrent behaviors is combined. Where the output of concurrent

behaviors in some circumstances is combined or summed, in others occur in a sequence,

and sometimes would be the winner-talk-all effect.

3.5 Motor-Schema Behavior’ Programming

The motor-schema behavior-based methodology based on the potential fields. The

potential fields are actually easy to program, especially since the fields are egocentric to the

robot. The robot computes the effect of the potential field, usually as a straight line, at

every update, with no memory of where it was previously or where the robot has moved.

There are five basic potential fields, or primitives, which can be combined to build more

complex fields: uniform, perpendicular, attractor, repulsive, and tangential as shown in Figure 3.5.

Figure 3.5: Five primitive potential fields (a) uniform, b) perpendicular, c) attractive, d) repulsive, and e) tangential (adopted from Murphy, 2000).

27

Any primitive potential field is usually represented by a single function. The vector

impacting the robot is computed each update. Consider a robot with a single range sensor

facing forward. The designer has decided that the repulsive field with a liner drop off is

appropriate. The formula is:

Dd

DdD

dD

V

V

magnitude

direction

for 0

for

180

(3.1)

Where D is the maximum range of the field’s effect, or the maximum distance at which

the robot can detect the obstacles. D is always the range at which the robot should respond

to stimulus. The roboticist set a D of 2 meters and the robot maximum velocity to 10

meters. This can be coded using Java as follow:

class vector { double magnitude; double direction; public vector() { magnitude = 0; direction = 0; } }

class perceptual_schema { double readSonar () { /* hardware dependent function */ } }

class motor_schema { public static final double MAX_DIST = 2.0; public static final double MAX_VELOCITY = 10.0; vector repulsive (double d, double D) { vector output = new vector(); if (d <= D)

{ output.direction = -180; output.magnitude = (D – d)/D; } else return (new vector()); /* stay where you are */ } } To illustrate how the repulsive potential field can be used by a behavior, runaway for a

robot with a single sensor as a motor-schema. The following code segment clarifies the

idea.

28

class runaway extended Behavior { perceptual_schema perceptual; motor_schema motor; public runaway (perceptual_schema p, motor_schema m) { perceptual = p; /* construct the runaway behavior */ motor = m; }

vector execute_runaway () { /* everytime step get perceptual value and calculate the potential field */ double reading = perceptual.readSonar(); vector output = motor.repulsive(reading, motor_schema.MAX_DIST); return (output); } The robot can execute the runaway behavior as soon is became on as follow:

vector output; While (isAlive()) { output = execute_runaway(); turn (output.direction); forward(output.magnitude * motor_schema.MAX_VOLICITY); } The implementation of runaway behavior for a robot with a single sensor is

straightforward. But what if the robot has multiple range sensors? Multiple sensors will

detect bigger obstacles at the same time. The common way is to have a runaway behavior

for each sensor. This called multiple instantiation of the same behavior and using the

vector summation prosperity of potential field to generate a single vector value.

3.6 Motor-Schema Concurrent Behaviors Combination

A robot will generally have forces acting on it from multiple behaviors. In motor-

schema concurrent behaviors is combined and coordinated by vector summation. Each

behavior’s relative strength or gain, gi, is used as a multiplier of the vectors before addition.

Figure 3.6 illustrate how this is accomplished.

Figure 3.6: Behavior Fusion via vector summation (adopted from Arkin, 1998).

29

For example, adding another behavior, move_to_goal that is represented with an attractive

potential field and uses the shaft encoders to sense if reach the goal position. The

move_to_goal behavior exerts an attractive field over the entire space; wherever the robot is,

it well feel a force from the goal. The runaway behavior exerts a repulsive field in a radius

around the obstacle. Extending our example in the previous section with the two behaviors

coded in java as follows:

vector output, vec1, vec2; While (isAlive()) { vec1 = execute_runaway(); vec2 = execute_move_to_goal(); /* you can multiply the two vectors by their gain values. */ output = vec1 + vec2; /* combine the two vectors summation. */ turn (output.direction); forward(output.magnitude* motor_schema.MAX_VOLICITY); }

3.7 Assignments

1. Describe the difference between the two dominant methods for combining behaviours

in a reactive architecture, subsumption and potential field summation.

2. Suppose you were to construct a library of potential fields of the five primitives using

Java or C++. What parameters would you include as arguments to allow a user to

customize the fields?

30

CHAPTER IV

MAPPING, LOCALIZATION AND NAVIGATION

The behavior-based robotic community generally views the use of symbolic

representational knowledge as hindrance to efficient and effective robotic control, where as

the symbol-grounding problem arise [Arkin, 1998]. The symbol-grounding problem refers to the

difficulty in connecting the meaning of an arbitrary symbol to the real entity or event. So

that if we want our robots to act knowledgeably in order to do navigation and other task,

various types of knowledge are necessary. The suggested types of knowledge are:

Spatial world knowledge: these knowledge deals with understanding of the navigable

space and structure surrounding the robot.

Object knowledge: these knowledge deals with recognizing the categories or instances of

particular types of things within the world.

Behavior knowledge: these knowledge deals with understanding of how to react in

different situations.

Perceptual knowledge: information regarding how to sense the environment under

various circumstances.

In this chapter, we are concerned with the spatial world knowledge, which assist in the

robotic navigation task. The problem of navigation can be broken down into four

questions: “What should I remember?”, “Where am I?”, ”Where am I going?”, and “How should I get

there?”. The first question has been termed mapping problem, that is, how can the navigator

acquire the spatial knowledge by integrating observations gathered over time to create

cognitive mapping, and use it to find routes and determine the relative positions of the places.

The need for cognitive mapping is evident, since the robot must store information from its

observations and experiences to use later for effective navigation. The second question has

been termed the localization problem, that is, how can the navigator determine its location in

the environment, given its current view of the world (i.e., cognitive map) and what it has

encountered while traveling thus far? The third and fourth questions are those of specifying

a goal and finding an appropriate path to that goal. Robotic research in path planning and

obstacle avoidance is concerned with these last two questions. However, the localization

problem is central to the task of robotic navigation. Without some idea of its current

location, the navigator will be unable to follow a proper path to reach its goal. For example,

consider the case of a hypothetical robot delivery vehicle. The robot must navigate through

busy city streets to reach its destination while avoiding obstacles in its path (the other cars)

and obeying traffic laws. But in order to do so, the robot must have accurate information

of its location: its position in the traffic lane, its location in the city, whether or not it has

reached the goal, etc. Furthermore, the task may be assisted by the ability to recognize

landmarks or distinctive areas on the streets of interest. Navigation and mapping are crucial

to all mobile robot systems. Navigation and mapping tasks include sonar sensor

interpretation, collision avoidance, localization and path planning.

31

4.1 Mapping

Mapping is the process of constructing a model of the environment based on the

sensors measurements. There are different approaches to representing and using spatial

information. On one side, there are purely gird-based maps, also called geometric or metric maps.

In these representations, a single, global coordinate system in which all mapping and

navigation takes place defines the robot’s environment. In geometric mapping, low-level

geometric primitives are extracted from sensor data to build up high-level abstractions of the

environment. Thus, the world might be modeled using wire-frame, surface boundary, or

volumetric representations, for example. This approach is appealing because it provides a

concise representation that is easily manipulated and displayed by computers and is readily

understood by humans. However, the geometric paradigm has drawbacks in that it leads to

world models that are potentially fragile and unrepresentative of the sensor data because it

relies heavily upon a priori environment assumptions and heuristics [Elfes, 1987]. An

alternative to the geometric paradigm is the sensor-based paradigm. Here, the environment is

represented as a collection of the sensor readings, obtained through sensor models. No a

priori assumptions or models of the world are employed. Most typical of this approach are

cellular representations such as occupancy grid [Elfes, 1989]. However, the unrealistic storage

requirements needed to represent each cell of the grid explicitly, one alternative is to take

the advantages of the fact that many of the cells will have a similar state, especially those

cells that correspond to spatially adjacent regions. The general approach to solve this

problem is to represent the space using cells of a non-uniform shape and size such as

quadtrees [Nebot and Pagac, 1995]. As shown in Figure 4.1, the map is a grid with each cell

of the grid representing some amount of space in the real worlds.

(a) (b)

Figure 4.1: The environment represenation (a) grid-based mapping (b) quadtree mapping.

These methods make weaker assumptions about the environment than in the geometric

paradigm, and usually have a high sensing to computation ratio, since “no high-level

abstractions” are generated. These approaches work well within bounded environments

where the robot has opportunities to realign itself with the global coordinate system using

external markers. On the other side are topological maps, also called qualitative maps, in which

the robot’s environment is represented as places and connections between places. Each

node in such a graph corresponds to a distinct place or landmark as shown in Figure 4.2.

32

Arcs connect these nodes if there exists a direct path between them. This approach has the

advantage that it eliminates the inevitable problems of dealing with movement uncertainty.

Movement errors do not accumulate in topological maps as they do in maps with global

coordinate system since the robot navigate locally.

Figure 4.2: representation of a floor plane as topological maps.

4.1.1 Occupancy Grids

Moravec and Elfes [1988; 989] introduced the statistical occupancy grid method for map-

making as a way of making detailed geometric maps using noisy sensors. A statistical

occupancy grid is an array where the content of each cell is based on the confidence that

the part of the environment corresponding to that cell is occupied. The main idea behind

the statistical occupancy grid is that, since range data is inexact, the content of each cell

should be a probability: a cell containing a high probability is likely to be occupied, while a

cell with low probability is likely empty. The ultrasonic sonar sensors have often been used

in implementations of the certainty grid system. A typical ultrasonic sensor using time-of-

flight measurements yields precise range readings for object location, but lacks angular

resolution. Figure 4.3 shows the cross section of an ultrasonic signal intercepting an object

in the environment.

Figure 4.3: Cross section of a conical sonar signal. Any object along arc A will yield a sensed distance of d.

33

Since the sensor uses merely the first echo it receives for range calculations, it is

impossible to determine the exact angular location of the reflecting object with respect to

the acoustic axis; only its radial distance can be measured. The arc labeled “A” depicts the

angular uncertainty of the ultrasonic sensor - it is bounded by the width of the ultrasonic

cone (typically 30 degrees). In his occupancy grid system [Moravec 1988; Elfes, 1989]

accounts for this uncertainty with a Gaussian probability function that models the error of the

ultrasonic sensor. Here, cells close to the sensed distance from the sensor and near to the

acoustic axis have a high likelihood of occupancy. Likewise, cells near to the ultrasound

sensor and close to the acoustic axis have a high likelihood of vacancy. After each sensor

reading, Bayes’ theorem (Equation 4.1) is applied to update the probability of occupancy of

each of the cells intersecting the ultrasonic cone. A Gaussian model is used as the evidence

of occupancy (OCC) or vacancy (VAC) and the current value of the cell provide the prior

probability of occupancy.

kkkk

kk

ksvacPvacsPsoccPoccsP

soccPoccsPsoccP

||||

|||

11

1

1

(4.1)

Here ksoccP | is the current value of the cell (after k sensor readings), 1| ksoccP

will be the new cell value, and occsP k |1 and vacsP k |1 are obtained from the sensor

model. Initially, the occupancy status of a cell is unknown, thus we set 5.0| 0 soccP for

all cells. After several readings by multiple sensors at different locations in the

environment, a map representing the objects and empty space of the environment is

constructed. The algorithm summary is shown in Table 4.1.

Occupancy grid algorithm

Initialize all the cells of the map to a reasonable prior

Repeat:

Take a sensor scan.

For each cell on the map in sensor range, update the probability in that cell using the sensor model and the current value in the cell as the prior.

Move to somewhere else.

Table 4.1: The occupancy grid algorithm.

4.1.2 Grid or Topological Maps: A Comparison

When using a grid, the environment is represented by evenly – spaced grids indicating,

for example, the presence of an obstacle in the corresponding region of the environment.

When using the topological approach, the environment is represented by graphs. Each

node in such a graph corresponds to a distinct place or landmark. Arcs connect these

nodes if there exists a direct path between them. The grid map is very easy to build and

maintain and the relationship between the map and the environment is straightforward. In

addition the multiple viewpoints can easily be integrated by using a coordinate –

transformation and calculation of shortest path is fairly easy. However, the space and

34

memory is consuming since the complexity of the map does not depend on the complexity

of the environment and as a result of the previous point the processing of these kinds of

maps can be very time consuming since the position of the robot itself must be known

accurately. Also it required more pre -processing when used by symbolic problem solvers

such as behaviors coordination algorithms.

On the other hand, the Topological map is difficult to construct and to maintain since it is

required recognition of landmarks and places and must have the sensory equipment for

doing this and even then, it is no simple task. It is efficient and faster in planning since the

exact robot position isn’t that important. Saving the memory and space, since the

resolution depends on the complexity of the environment, and very convenient interface

towards symbolic problem solvers. However, the paths calculated based on topological

maps may not be optimal in terms of energy consumption or distance traveled. Both

methods have their advantages and disadvantages the most important ones are stated in

Table 4.2.

Grid Maps Topological Maps

Building / Maintainability easy difficult

Memory space huge small

Create shortest path yes no

Planning speed slow fast

Table 4.2: Comparison between grid and topological maps.

4.2 Localization

An important sub-problem in the robot navigation task is that of localization. That is,

given accumulated knowledge about the world and current sensor readings, what is the

position of the robot within the workspace? Localization is particularly difficult for map-

based approaches that learn their maps, since the accuracy of a metric map depends on the

alignment of the robot with its map. This would be a trivial problem if the navigation

system could rely upon its velocity history (i.e., employ “dead reckoning”) to calculate its

global coordinates. Unfortunately, distance and orientation measurements made from dead

reckoning are highly prone to error due to shaft encoder imprecision, wheel slippage, and

irregularities in the surface of the workspace, etc. which accumulate over time. This odometry

and inertial navigation methods as relative position measurements, requires supplemental

information in order to periodically “flush” the metric error from the system and re-

determine the robot’s position in the environment [Borenstein et. al., 1996].

The external measurement techniques, so called the absolute measurement techniques, are

often used in conjunction with the relative measurement techniques to improve localization

estimation. Examples of absolute measurement techniques are:

Guidepath: is one of the simplest forms of localization. The guidepath can be for

instance a wire (which transmits audio or radio signals) or a magnetic strip [Everett,

35

1995], which a robot can follow. Guidepaths indicates static paths in an environment

and are not suitable in applications where mobile robots should be allowed to move

flexibly.

Active beacons use a set of transmitters (beacons), which their locations in the

workspace are known in advance. The transmitters, in most cases, use light or radio

signals to announce their presence. The active beacons may transmit their identity

and current location. For a robot to be able to use the active beacons for its

localization, at some position in the workspace, at least three of the beacons have to

be “visible” or “detectable” at that exact position.

Artificial landmark recognition method tries to recognize objects or images with a

distinctive shape placed in the workspace. The positions of the landmarks are known,

and if three of more landmarks are detectable at certain position, and estimation of

the robot position can be calculated.

4.3 Navigation

The process of generating a sequence of actions that has to be performed in order for a

robot to move through its environment (also called workspace) autonomously and without

collisions is called navigation. Methods for navigation, in their most simple form, fall into

one of two categories, deliberative or reactive. The pure reactive methods offer real-time

motion response provided by simple computations and do not use any representations of

its environment. (i.e., it has no understanding of the layout of the workspace, e.g. where the

obstacle are situated). It is compute a function using the parameters obtained from external

and proprioceptive sensors on the robot. It can assure target achieving if it has already made the

way and learned it or if it was designed to follow a certain path.

Pure deliberative methods, on the other and are highly representation dependent and

express a higher level of intelligence. They use a priori knowledge to pre-calculate robot

motion. Typical way to utilize this knowledge is to plan a path through the workspace for

the robot, so called path planning. The path planner generates minimum-cost paths to the

goal(s) using the map or the workspace. There are a large number of approaches that

combine both reactive and deliberative navigation and seem to be the most effective ones.

They combine the efficiency of planning with the flexibility and adaptive ability of the

reactive systems. As a result for this integration, deliberative navigation provides

intermediate goals to the reactive collision avoidance behavior that controls the velocity

and the exact direction of the robot’s motion.

4.3.2 Reactive Navigational Approaches

Reactive approaches attempt to overcome the computational limitation of deliberative

approaches by offering fast solutions for real-time applications. The basic idea is to couple

perception with actuation, so that particular sensed patterns directly activate appropriate

motor commands. In reactive navigational, the robot planned a path using the topological

maps, also called qualitative maps, in which the robot’s environment is represented as places

and connections between places. These places or nodes represent gateways, landmarks, or

goal. Edges or connections represent a navigable path between two nodes, where the

36

robot’ path consisting of segments. Gateways or landmarks are needed so the robot can tell

when it has completed a segment another should begin (§4.1). These distinctive places are

derived from the sensor readings and after the robot determines these observable and

ultimately recognizable landmarks, they can be integrated easily into behaviours control.

For example, Matarić [1990] demonstrated the integration of qualitative maps and

behavior-based robotic systems in a subsumption-based system as shown in Figure 4.4.

Figure 4.4: Integration of topological map in subsumption architecture (Adopted from Arkin, 1998).

Landmarks are derived from sonar data, using features that are stable and consistent

over time. In particular, right walls, left walls, and corridors are the landmarks features

derived from the sensors. Spatial relationships are added connecting various landmarks

through constructing the graph. Map navigation consists of following the boundaries of

wall or corridor regions.

4.3.3 Deliberative Approaches

Classical motion planners assume that full knowledge of the geometry of the robot's

environs is known a priori. The idea is to model the environment and the robot using

geometric structures (e.g. polygons, rectangles, circles) and then define the necessary

rotations and translations of these objects that will lead to the desired results. A famous

problem in this category is the “piano-movers” problem that deals with the “maneuvers”

required to move a long piano out of a room through a narrow door. Geometric methods

have rarely been applied to real robotic problems, due to the computational cost for

modeling and manipulating geometric objects. Additionally, they cannot be easily adapted o

dynamically changed worlds. However, the classical planners have the useful properties of

correctness and completeness. A path is correct if it lies wholly within F, and, if the goal is

reachable, connects the robot's initial position with the goal. Whereas any useful path

planner produces correct paths, completeness is a highly desirable virtue: the planner will

generate a path if one is possible, and will halt otherwise in finite time. These approaches

use geometric methods to derive solutions to path planning and navigation problem.

4.3.3.1 Workspace and Configuration Space

Any robot operates in a subset of the real world. It called the workspace of the robot. It is

important that the exact position of the robot in its workspace with respect to some frame of

37

reference can be precisely specified. The configuration of the robot is a vector that holds the

current values of all those single-valued parameters that need to be specified in order to

identify uniquely the robot’s exact position. These parameters are broadly known as the

generalized coordinates of the robot. For a mobile robot the generalization coordinates might

be the Cartesian coordinates of its center and the orientation of its heading. The

configuration space C of the robot is simply the space where its configuration vector lives,

i.e. the set of all possible configurations, and its dimension is equal to the number of

generalized coordinates [Lozano-Pérez, 1983]. Since the robot can only be in a single

configuration at any time, it is represented as a single point in C. obstacles in the workspace

implies that certain areas are not accessible by the robot.

The configuration q of a robot A specifies the exact position and orientation of A relative

to a fixed reference frame. Therefore, the configuration space (often referred to as "C-obstacle")

of A is the set of all possible configurations of A. For a robot with n degrees of freedom

(DOF), its C-space, C, is an n-dimensional space (typically an n-dimensional manifold),

since there is a one-to-one, onto map between q and a specific arrangement of the robot's

joints and the robot's position within its (real world) workspace. Obstacles are mapped into

C-space by determining which configurations of the robot produce collisions with an

obstacle; these configurations are deemed forbidden. Let A(q) denote the location of A's

particles when A is in configuration q. A C-space obstacle (or "C-obstacle") associated with a

physical obstacle B is defined as:

ø| BqACqCB (4.2)

The complement of the C-obstacles is termed the “freespace," F:

ii CBC \F (4.3)

Motion plans are constructed in F. Note that the closure of freespace includes the

obstacle boundaries. Since generally a robot is allowed to touch an obstacle (but not

penetrate it), the word "freespace" (and symbol F) will often be used to refer to the closure of

freespace as well. In the case illustrated in Figure 4.5 (part (a)), the robot is a rigid body in a

2D environment (its workspace), and it is capable of translating in any direction (i.e.,

horizontally and vertically) but not rotating. Thus, this robot has 2-DOF, and its

configuration space (shown in (b)) is equivalent to 2. Configuration space is a useful

concept for classical motion planners aimed at creating paths for (holonomic3) multi-DOF

robots. Classical motion planning is a two-step process. In the first step, the C-obstacles

are computed from knowledge of the physical obstacles and of robot geometry. The

second step constructs a path within F. A path in C of the robot A from the start

3 Holonomic robots are capable of motion directly between configurations: for example, a robot, which can

move forward, backward, and side-to-side freely in an obstacle-free 2D environment. An automobile is an example of a non-holonomic vehicle; other techniques must be used (perhaps in addition to C-space) to capture the inability of a car to move directly sideway.

38

configuration, startq , to the goal configuration, goaltq , is a continuous map C1,0: ,

with startqq 0 and goalqq 1 . The resultant path can then be translated back into

actual robot configurations through an interface with the robot's automatic controllers

[Hwang and Ahuja, 1992].

Figure 4.5: A sketch of the configuration space of an oblong robot with the two polygonal obstacles shown in (a). The robot can slide in any

direction, but not rotate. The polygons in (b) are the configuration space obstacles corresponding to the point highlighted on the robot, which would then be used by a motion planner to plot the robot's trajectory.

4.3.3.2 Roadmap Methods

Roadmap methods basically attempt to reduce the dimensionality of the space in which

path planning and navigation takes place. It attempts capturing the structure of the free

space by a set of on-dimensional curves and their interconnections, the roadmap. The result

is a graph-like structure, where paths can be planned easily. The full path consists of three

components: an initial path segment that connects the current configuration to the

roadmap, the main path that lies on the roadmap and the final segment that connects the

final configuration to the roadmap. The search can be performed using a variety of

approaches, but among the most is the A* algorithm. This search method uses a heuristic

measure to minimum cost paths. It is shown that the A* algorithm is admissible; that it

always find the shortest path and it is optimal [Doyle, 1995].

Well-known methods in this category are the visibility graphs and the Voronoi diagrams. The

visibility graph is a roadmap that can be used in two dimensional configuration spaces. It is

constructed by using all the vertices of the obstacles and the start and goal configurations

as the graph nodes, and the links are the line segments, which connect two nodes without

intersecting any obstacles. An improvement to this method is the reduced visibility or the

tangent graph. By removing all the line segments that are not tangents of the obstacles

produces these graphs. This reduces the size of the graph, which can make the structure

quicker to search. The tangent graph will always contain the shortest path between the start

and the goal, which will always be found when using the A* search algorithm. The Voronoi

diagram contains the set of points that are equidistant from two or more object features.

These create roadmaps that connect the initial and goal configurations by forming paths

consisting of line segments and parabolic arcs (for polygonal obstacles) that maximize the

clearance between the robot and the obstacles as shown in Figure 4.6. The space is

partitioned into regions each containing one feature and any point in a region is closer to

39

the feature in that region than to another feature. The resulting roadmap is the edge

between regions. This edge is the safest possible path as it avoids obstacles by the

maximum distance possible.

Figure 4.6: The road map approach.

A related problem in this case; is how the roadmap can be constructed (a

computationally heavy problem, in general) and how it can be modified in case the

environment changes dynamically. Roadmap methods are discussed in [Canny, 1988].

4.3.3.3 Cell Decomposition

The idea here is to decompose the free space into a large number of small regions, called

cells. The decomposition is such that adjacent cells define configurations, where the robot

can trivially navigate between them. The result is non-directed connectivity graph that represents

the adjacency of the cells. The sizes of the cells can be variables as well their distribution.

The nodes of the graph are free space cells, cells are only linked if the pass and adjacency

rule. Searching this graph form a start node, to find a goal will only successful if a channel

can be found linking the start and goal. The method can be divided into two categories:

exact and approximate cell decomposition. In Exact cell decomposition refers to the case where

the union of the cells is the exact free space as opposed to approximate cell decomposition

where the union of the cells is properly included in the free space. In Exact cell

decomposition the free space is divided into trapezoidal and triangular segments using the

external and internal obstacle vertices, as demonstrated in Figure 4.7.

Figure 4.7: Exact Cell Decomposition.

40

In approximate cell decomposition, cells usually have a standard predefined shape but

perhaps variable size. Any algorithm for graph search can realize path construction. The

details of the decomposition, however, can affect significantly the completeness and the

computational complexity of the algorithm, as well as the quality of the resulted path.

These methods are perhaps the best studied and have been applied widely for robot

navigation. Simple instances of such method are Quadtrees [Nebot and Pagac, 1995] and

grid-based representations [Moravec and Martin, 1996] as shown in Figures 4.8 and 4.9.

Figure 4.8: Approximate Cell Decomposition (grid-based decomposition).

One example of the approx. cell decomposition method is the quadtree algorithm. This

algorithm represents C-space as a quadtree structure by performing the steps listed in Table

4.2 and the resulted decomposition is shown in Figure 4.9. The path is created at any time

by creating a connectivity graph based on the quadtree and computes a path through the

known portions F.

Quadtree cell decomposition algorithm

1. Place C-space as the root node and current cell.

2. Break current cell into n non-overlapping cells.

3. Check each cell for obstacles, noting if obstacles take up all, some, or none of the cell.

Represent each of these non-overlapping cells as children nodes.

4. If a cell is completely free, or completely full by obstacles it has no children, as the

characteristics of the entire cell are known.

5. If a cell is mixed (partially free, partially full by obstacles) repeat this process by returning to step 2. (Note: Once the desired resolution is achieved, you may stop the algorithm).

Table 4.2: The quadtree cell decomposition algorithm.

4.3.3.4 Potential Fields

Potential field methods are based on the construction of a function of configuration

vectors, called potential function, such that the value of the function is monotonically

increasing toward obstacles and monotonically decreasing toward the goal configuration

where it achieve the minimum value (cf. §2.3 for more details). The search progresses

iteratively, at each step the robot move a small pre-defined distance in the direction

41

determined by the field until the goal configuration is reached. Using only our potential

energy function qU , we can determine the path, which our robot should take through F

using the algorithm listed in Table 4.2.

Figure 4.9: Approximate Cell Decomposition (quadtree decomposition).

Figure 4.1: The potential filed approach.

Potential field path planning algorithm

1. Let e = Maximum distance traveled in each step. 2. Let k=1.

3. Find the gradient of the potential function = qUgrad

4. Let tu be the direction opposite of the gradient vector. (Note: tu between [0..2))

5. Let new position tttttt ueyuexyx sin,cos, 11

6. If 11 , tt yx is within distance D of the goal, declare success.

7. If Q iteration have passed declare failure (indicates local minima is probably present) 8. Goto step 2.

Table 4.2: The potential field-planning algorithm.

42

4.3.3.5 Virtual Force histogram

Various heuristic based on the idea similar to the potential field have been used for mobile

robot control. The virtual field force (VFF) algorithm [Borenstein and Koren, 1989] for

example, constructs an occupancy grid on-line and then has obstacles (and the goal) exert

forces on the robot. As a potential field approaches, these forces drive the robot towards

the goal while driving it away from obstacles. Unlike the true potential fields however, the

VFF algorithm only considers forces from a small local neighbourhood around the robot.

However, the VFF faces reduce the spatial distribution of forces to a single vector quantity

and the detailed information concerning the local distribution of obstacles is lost. The vector

field histogram algorithm [Borenstein and Koren, 1991] overcomes this problem by

constructing a local polar representation of obstacle density. The local region around the

robot, defined by a circular window of width w, is divided into angular sectors, and

obstacles within each sector are combined to measure of that sector’s traversablility. A

threshold is set and sectors above the threshold are termed impassable, whereas sectors

below the threshold are identified as candidate directions for the robot. The candidate

region most closely aligned with the robot’s preferred direction of motion is chosen as the

appropriate direction of travel. The robot and obstacles are defined in pre-existing

occupancy grid C expressed in Cartesian coordinates as rr yx , and yixi , . The vector

direction is given by

ri

ri

jixx

yy

arctan, (4.4)

Where the vector magnitude is given by

22

,, ijjiji bdacm (4.5)

Where ijc is the certainty of cell ijC in the grid, ijd is the distance to the cell ijC from the

robot, and a and b are constants.

4.4 Assignments

1. Implement a potential field path planning method and compute a 2-D trajectory a cross

a rectangular room with several non-convex obstacles in it.

2. Implement the visibility graph-based planner for an environment composed of

polygonal obstacles.

3. An important property of an occupancy grid is that it supports sensor fusion. Discuss

this sentence.

43

CHAPTER V

ROBOT LEARNING

In order to built this mobile robot system to come close for that the people dream with

them. Researches suggested that embedding learning techniques in it to incite them for

efficiency and prepare it to deal with unforeseen situations and circumstances in its

environment. Since that, the learning has often been neglected, and the designer of mobile

robot system has generally taken on the extremely difficult task of trying to anticipate all

possible contingencies and interactions among the robot components ahead of time. The

machine learning techniques ultimate goals are to deal with the noise in their internal and

external sensors and the inconsistencies in the behavior of the environment. Machine

learning focused on developing approaches and algorithms for artificial agent (i.e., robot) to

automatically acquiring knowledge, developing strategies and modifying its behavioral

tendency by experience.

5.1 Machine Learning in Mobile Robots

Research in machine learning has undergone a tremendous growth. This field is

contained within the broader confidences of artificial intelligence, and its attraction for

researchers stems from many resources. Machine learning focused on developing

approaches and algorithms for artificial agent (i.e., robot) to automatically acquiring

knowledge, developing strategies and modifying its behavioral tendency by experience. This

is especially important in mobile robot systems, where learning has often been neglected.

The designer of mobile robot systems have generally taken on the extremely difficult task

of trying to anticipate all possible contingencies and interactions among the robot and its

environment ahead of time. Therefore, the machine learning ultimate goal is to make

intelligent robots to deal with noise in their internal and external sensors and with

inconsistencies in the behavior of the environment. Machine learning has studied many

different types of learning. They are classified under two types of learning: supervised and

unsupervised [Russell and Norvig, 1995; Russell, 1996; Mahadevan, 1996a; Dietterich,

1997; Mitchell, 1997].

In supervised learning, a designer carefully selects training examples of the task to be

performed by robots, which must be supplied by an expert human or computerized.

The backpropagation algorithm for training artificial neural networks is one common

technique for supervised learning [Mitchell, 1997; Haykin, 1999]. If the training

information provided is more limited in the form of scalar, performance-related

reward, not indicate the correct action itself then learning method become a

reinforcement learning. ALVINN (Autonomous Land Vehicle In a Neural Network) is

a well-known application of supervised learning to robot control [Pomerleau, 1993].

The network receives as input an image of the road ahead and produces, as output the

steering command required keeping the car on the road.

44

Reinforcement learning algorithms can be used as a component of multi-robot systems.

Where the environment supplies the scalar reward signal indicates how good or bad the

robot’s action was. The mapping from state to action performed by the robot, the

policy, is the object modified during learning [Nehmzov and Mitchell, 1995]. Usually the

policy is based on a value function providing the estimate of the expected future reward to

receive in a state or for a state-action pair. The reward is said to be immediate when, and

only when an optimal policy is always taking action with the highest next reward

without considering future consequences. If such a policy is not optimal, rewards are

delayed. The reinforcement learning problems can be solved in principle using dynamic

programming algorithms [Barto at. el., 1995; Sutton and Barto 1998]. However, dynamic

programming requires a model of the environment, which is not available in dynamic

interactive tasks and time that is polynomial in the number of the states. Some recent

reinforcement learning algorithms have been designed to perform dynamic

programming in an incremental manner without the need of state transition

probabilities and reward structure of the environment by enhancing the performance

on-line while interacting with the environment. Reinforcement learning algorithms can

be used as components of multi-robot algorithms. If the members of a group of robots

each employ one of these algorithms, a new collective algorithm emerges for the group as

a whole. This type of collective algorithm allows control policies to be learned in a

decentralized way.

Unsupervised learning, on the other hand, doesn’t provide a feedback on the learning task.

Unsupervised learning techniques perform a clustering of incoming information

without knowing the target output. Unsupervised learning can identify clusters of

similar data points; enable the data to be represented in terms of more orthogonal

features. This reduces the effective dimensionality of the data, enabling representation

that is more concise and supporting more accurate supervised learning [Mitchell, 1997;

Haykin, 1999]. One of unsupervised learning paradigm is Kohonen map, which is

frequently used, in autonomous robots, most often for categorization, map-building

and motor control tasks. A well-known application of Kohonen network for location

recognition. The robot learns to recognize a particular location based on information

from motor signals [Nehmzow and Smithers, 1992].

From the different approached to machine learning, I have chosen the reinforcement

learning because it suitable for lifelong learning because the a priori knowledge required by

the designer of the learning robot consists only of a good specification of the task to be

learned.

5.2 Reinforcement Learning

The artificial intelligence, a very general model of environment is assumed to be

deterministic and by planning and heuristic search typically attempt to construct open-loop

control policies. Such methods are very general in that they can find correct policies for

complex, non-linear environment if the model is accurate, perception completes, and

controlled system is deterministic. In complex dynamic environments for which we don’t

45

posses models, and in which the robot may be hampered by incomplete perception and

unreliable effectors. On the other hand, control theory gives us method for constructing

closed-loop adaptive control policies that depends on the environment’s state and the time

[Singh 1994; Barto et. al., 1995]. Techniques for constructing such control policies include

analytical methods for simple (linear) systems, and numerical methods such as dynamic

programming for more complex (non-linear and stochastic) systems. These methods are for

constructing control policies off-line, given accurate environment model [Wyatt, 1997].

When it is trying to design adaptive control policies for robots we typically have no model

of the environment. Thus the machine learning arises for developing on-line and model-free

approaches and algorithms for constructing robot’s closed-loop policy by automatically

acquiring knowledge, developing strategies and modifying its behavioral tendency by

experience [Kaelbling at. el., 1996; Sutton and Barto, 1998; Keerthi and Ravindran, 1999;

Mahadevan 1996b; Russell 1996; Mitchell, 1997; Dietterich 1997]. Reinforcement learning (RL)

which describes a large class of learning tasks and algorithms characterized by the fact that

the robot learns an associative mapping, AX (from state space X to the action

space A) by maximizing a scalar evaluation (rewards) of its performance form the

environment. The robot learns strategies for interacting with its environment in closed-

loop cycle as shown in Figure 5.1. This is the basic reinforcement-learning model where the

robot environment interaction is viewed as a coupled dynamic system in which the outputs

of the robot are transferred into inputs to the environment and the outputs of the

environment are transformed into inputs to the robot. The inputs to the robot usually do

not capture a complete description of the environment’s state and the environment

changes continuously through time. Furthermore, the robot influences the future sequence

of its inputs by its outputs at each stage. Through that, the robot explore its dynamic

environment, attempting to monitor the environment and figure out by itself how it can

exploit acquired experience to maximize its reward, which will receive based on the

sequences of the actions it takes.

Figure 5.1: The basic reinforcement-learning model. The robot interacts with an environment in a feedback cycle.

Learning solely from reinforcement signals, makes the environment is a powerful force

shaping the behaviors of autonomous robots. Robots can also directly shape their

environments. Because of this interaction of shaping forces, it is important to develop

autonomous robots in a system where all of the forces for change are learnable to optimize

46

system performance over its lifetime. When learning control policies, we must be able to

evaluate them with respect to each other. The robot’s goal is to find, based on its

experience with the environment, a strategy, or an optimal policy, for choosing actions that

will yield as much reward as possible. The most obvious metric, the sum of all rewards over

the life of the robot,

,0

ttr (5.1)

is generally not used. For robots with infinite lifetimes, all possible sequences of rewards

would sum to infinity. This is not the case for robot with a finite lifetime, however. In this

case, the obvious metric turns into the finite horizon measure,

.0

k

ttr (5.2)

This measure sums the rewards over some finite number, k , of time steps. Average

case,

.1

lim0

k

tt

kr

k (5.3)

Extends this by using the average reward received over the whole lifetime of the robot.

The infinite horizon discounted measure,

.0

tt

tr (5.4)

Uses a discount factor, 10 , to controls the relative importance of short-term and long-

term reinforcements. As 0 the short-term reinforcement signals became more

important. When 0 the only reinforcement signal that matters is the immediate

reinforcement signal, and we obtain the one-step greedy policy; the best action is the one

that gives the greatest immediate reward. The optimal finite horizon policy may be non-

stationary in the sense that different actions may be selected for the same state at deferent

time steps. In problems with hard time lines, finite horizon policies are the right choice. In

problem without hard deadlines, k , and an infinite-horizon policy is desired. It is theorem

that the optimal infinite horizon policy is stationary [Haykin, 1999]. Most of the

reinforcement-learning problems are typically cast as Markov decision processes (MDPs)4 as

shown in Figure 5.2.

4 Complete observaabilty of the state is necessary for learning methods based on MDPs. In the case of noisy and incomplete

information for the robot’ observations, the “perceptual aliasing” or “hidden state” problem arise. So, the MDP famework

needs to extend by partially observable Markov decision process (POMDP). See Kaelbling et al., ( 1996) for the stratiges

implemented to deal with this problem.

47

Figure 5.2: The Markov decision process model.

In Markov decision process there is a finite set of states, S , a finite set of actions, A , and

time is discrete. The reward function

ASR : (5.5) returns an immediate measure of how good an action was. The resulting state, st 1 , is

dependent on the transition function

SAST : (5.6)

Which returns a probability distribution over possible next states. An important

property of MDPs is that these state transitions depend only on the last state and action.

This is known as the Markov property. The problem, then, is to generate a policy, AS : ,

based on these immediate rewards that maximize our expected long-term reward measure such

as:

.0

0*

sEsV t

t

tR (5.7)

Where E indicates expected values, and s

sasRsasTas ,,,,,R is the expected

reward for state-action pair. If we know the functions T and R then we can define an

optimal value function, sV* , over states:

.,,,max)( **

sVsasTassVSa

R (5.8)

This function assigns a value to each state, which is the best immediate reward that we

can get for any action from that state added to the optimal value from each of the possible

resulting states, weighted by their probability. If we know this function, then we can define

the optimal policy, * by simply selecting the action, a , that gives the maximum value:

.,,,maxarg)( **

sVsasTassSa

R (5.9)

Equation (5.9) is called Bellman’s optimality equation. It provides a mean of computing the

optimal policy for the Markov decision process by performing one-step look-ahead search

48

and choosing the action whose backup value is the largest. There are well-understood

dynamic programming based methods for computing V* such as value iteration and policy

iteration, which lead to a simple procedure for learning the optima value function, and hence

the optimal policy. The two algorithms, policy iteration and value iteration, are off-line and

model-based methods, they need to learn T and R functions [Dietterich, 1997]. Instead of

learning T and R , however we can incrementally learn the optimal value function directly.

In the next sections, we describe will known algorithms that attempts to iterative

approximate the optimal value function.

5.2.1 Monte Carlo Learning

Dynamic programming methods require complete environment dynamics to find the

optimal value function. However, in problems domains, full environmental information is

not available, or there is no model of the environment. For such cases, environment must be

observed through experience. Monte Carlo methods provide some means of making this

possible. Monte Carlo methods are for episodic tasks only, because they are based on

complete averaged returns. By episodic we mean that task somehow (with success or

failure) terminates and then restarts. MC methods learn episode-by-episode manner, rather

than step-by-step manner. This implies that the value function estimates and policies are

changed after each episode and remain same until the episode complete. Monte Carlo

methods average the values of the states that are visited in an episode. After many episodes

and many visits to each state, the averaged values became approximately correct

estimations.

First-visit Monte Carlo Algorithm

Initialize

Policy to be evaluated

V(s) an arbitrary state-value function.

Returns (s) an empty list, for all sS Do forever:

Generate an episode using .

For each state s appearing in the episode:

R return following the first occurrence of s.

Append R to Returns (s)

V(s) average (Returns (s)) End Do

Table 5.1: First-visit Monte Carlo method for estimating sV

.

In order to estimate sV , the value of state s under policy , given a set of episodes

obtained by following and passing through s , we first define each occurrence of state

s in an episode to be a visit to s . Then, every-visit Monte Carlo method estimates sV as the

average of the returns following all the visits to s in a set of episodes. Another approach,

called first-visit Monte Carlo method (as shown in Table 5.1) averages just the returns following

49

first visit to s . Both first-visit and every-visit Monte Carlo converge to sV as the

number of visits (or first visits) to s goes to infinity [Kaelbling at. el., 1996; Sutton and

Barto, 1998; Keerthi and Ravindran, 1999].

5.2.2 Temporal Difference Leaning

Temporal difference learning (TD (0)) [Sutton 1988] is a combination of Monte Carlo and

dynamic programming methods for learning a value function for states, sV , based on

state transitions and rewards, srs ttt 11 ,, . Starting with a random value for each state, it

iterative updates the value-function approximately according to the following update rule:

sVsVrssVsV tttttt 11 (5.10)

There are two parameters; a learning rate, , and a discount factor, . The learning rate

controls how much we change the current estimates of sV based on each new

experience. Equation (5.10) forms the basis for the TD (0) algorithm which is really an

instance of a more general class of algorithms called TD (), with =0, where 10 is a

parameter that controls how much of the current difference sVsVr tttt 111 is

applied to all previous states. As shown in Table 5.2, TD (0) looks only one step ahead when

adjusting value estates; although it will eventually arrive at correct answer, it can take quite a

while to do so. The general TD () is similar to TD (0) but is generalized to propagate

information across trajectories of states according to its eligibility, se tt , which is a function

of that indicates how far in the past and how frequency the state has been visited.

sesVsVrssVsV tttttttt 11 (5.11)

The value of se tt is updated each transition for all Ss . There are two forms of

update equations. An accumulating trace updates the eligibility of a state using,

otherwise se

statecurrent s if sese

t

t

t

)(

1)()(

1

1

(5.12)

A replacing trace update is defined by,

otherwise )(

statecurrent if 1)(

1

se

sse

t

t (5.13)

Under both mechanisms, the eligibility of a state decays away exponentially when the

state is unvisited. Under an accumulating trace, the eligibility is increased by a constant

every time the state is visited and under a replacing trace, eligibility of a state is reset to a

constant on each visit. An eligibility trace can be thought of as a short-term memory process,

initiated the first time a state is visited by a robot. The degree of activation depends on the

50

recency of the most recent visit and on the frequency of visits. Thus, eligibility traces

implement two heuristics, a recency heuristic, and a frequency heuristic [Sutton and Barto, 1998].

Sutton [1988] and others [Dayan, 1992; Dayan and Sejnowski, 1994] note that TD ()

converges more quickly for 0. Learning the value function sV for a fixed policy can be

combined with a policy-learner to get what is known as an actor-critic or an adaptive heuristic critic

system [Barto et. al., 1983]. This alternates between learning the value function for the

current policy, and modify the policy based on the learned value function.

TD (0) Algorithm

For each (sS), initiate the table entry V(s) to 0. Observe the current state s. Do forever:

Select an action at using st and , execute it.

Receive immediate reward rt+1.

Observe the new state st+1.

Update the table entry for V(s) as Eq. (3.9).

Set st to st+1. End Do

Table 5.2: TD (0) learning algorithm for estimating sV .

5.2.3 Q-learning Algorithm

Instead of maintaining both a policy and value function using actor-critic techniques, Q-

learning [Watkins, 1989] performs essentially the same function by combines them into a

single function, known as the Q-function. This function, asQ , , reflects how good it is, in

a long-term sense that depends on the evaluation measure, to take action a from states.

On other words, the quantity asQ , gives the expected cumulative discounted reward of

performing action a in state s and then pursuing the current policy thereafter. It uses 4-

tuples sras tttt 11,,, to iteratively update an approximation to the optimal Q-function,

.,max,,,,**

asQsasTasRasQaS

(5.14)

Once the optimal value function is known, the optimal policy, s* can be easily

calculated:

.,maxarg** asQs

a (5.15)

Starting with a random the Q-function can be approximated incrementally according the

following rule:

asQasQasrasQasQ ttta

ttttttt

,,max,,, 1 (5.16)

51

The Q-learning algorithm is summarized in Table 5.3. Watkins and Dayan [1992]

proved the iteration converges to Q*(s, a) under the condition that the learning set includes

an infinite number of episodes for each state and action. If the function is properly

computed, an agent can act optimally simply by looking up the best-valued action for any

situation. The Q-learning chooses the optimal action for state based on the value of the Q-

function for that state. If only the best action is chosen, it is possible that some actions will

never be chosen from some state. This is bad, because we can never be certain of having

found the best action from a state unless we have tried them all. So, we must have sort of

exploration strategy, which encourages us to take some non-optimal actions in order to

gain information about the world. This is known as the exploration/exportation problem

[Thrun, 1992a, 1992b; Wiering, 1999].

Q-learning Algorithm

For each pair (sS, aA), initiate the table entry Q (s, a) to 0. Observe the current state s. Do forever:

Select an action at and execute it.



Update the table entry using real-experience for Q (st, at) as Equation (3.16).

Set st to st+1. End Do

Table 3.3: Q-learning algorithm.

The simplest exploration strategy follows a policy of optimism in the face of uncertainty.

When the effect of an action is poorly understood, then it is usually better to take that

action in order to reduce this uncertainty. For example, introducing and propagating

exploration bonuses to artificially raise the estimated value of untried actions [Sutton, 1990a;

Wyatt, 1997; Wiering, 1999]. Another approach is -greedy strategy, is to select a random

action a certain percentage of time , and the best action otherwise [Sutton and Barto,

1998]. The last approach attempts to take into account how certain we are that are god or

bad. It is based on the Boltzmann or Gibbs distribution and is often called soft-max

exploration. Where at each time step, action a is chosen from state s with the following

probability:

a

TasQ

TasQ

e

esa

/,

/,

|Pr (3.17)

The parameter 0T is called the temperature. Varying the temperature parameter

provides a continuum of stochastic policies, varying between the random policy at height

temperature and greedy policy at low temperature. Even at fixed temperature, the nature of

the policy will change as the system learns, because a lot of exploration occurs in states

52

where Q-values for different actions are almost equal, and little exploration in states where

Q-values are very different.

5.2.4 SARSA Learning Algorithm

While Q-learning update Q-values by making use of an assumption that the currently best action will be taken at the next step, there is a simple scheme which makes use of the real-world experience instead. This approach is called SARSA, which is introduced by Rummery and Niranjan [1994]. SARSA is similar to Q-learning in that it attempts to learn

the state-action value function asQ , . The main difference between SARSA and Q-

learning, however, is in the incremental update function. Actually, the max operator is dropped from update equation and state action value of the next state-action is used. The update equation is:

asQasQasrasQasQ ttttttttttt ,,,,, 11 (3.18)

Just like TD, SARSA learns the value for a fixed policy and must be combined with

policy learning component in order to make a complete reinforcement system.

5.3 Which Algorithm Should We Use?

The algorithms presented in previous sections have all been shown to be effective in

solving a variety of reinforcement learning tasks. However, there is a fundamental

difference between Q-learning and the other that makes it much more appealing four our

purposes. The temporal difference and SARSA algorithms are known as on-policy

algorithms. The value function that they learn is dependent on the policy that is followed

during learning. Q-learning on the other hand, is an off-policy algorithm. The learned policy is

independent of the policy followed during learning. Using an off-policy algorithm, such as Q-

learning, frees us from worrying about the quality of the policy that we are follow during

the learning, since we might not know a good policy for the task that we are attempting to

learn. Using an on-policy algorithm with an arbitrary bad training policy might cause us not

to learn the optimal policy. Using an off-policy method allows us to avoid this problem by

using exploration strategies to maximize the information obtained during learning [Smart,

2002]. There are still potential problems, however, with the speed of convergence to the

optimal policy when using training policies that are not necessarily very good.

5.4 Planning in Reinforcement Learning

Reinforcement learning algorithms learn from experiences obtained interacting with the

environment. Although making errors are required to learn, they may cost robot’s life

(Hazards situations). If the robot has an almost correct model, it can produce can produce

hypothetical experience to learn from them, since the learning process for an optimal policy

needs many experiences [Lin, 1992; 1993; Mahadevan and Connell, 1992]. From an AI

point of view, the optimal policy might obtained if the system had carried out explicit

planning, which can be thought as learning the possible future robot-environment

interactions, from the con-specifics to determine long-range consequences. Planning is

53

learning from these hypothetical experiences and useful when making errors are costly, robot

has less real experiences from the interaction with the environment, and speeding up the

learning process is urgently [Sutton, 1990a; 1990b; 1991; Clouse, 1997]. The earliest use of

planning in the reinforcement learning was the Dyna architecture as shown in Table (5.4)

that integrates learning and planning which proposed by Sutton’s [1990a; 1990b; 1991]. The

planning is treated as being virtually identical to reinforcement learning except that while

learning updates the appropriate value function estimates according to real experience,

planning differs only in that updates these same value function estimated for hypothetical

experience chosen from the world model. The model component of Dyna algorithm is a

table indexed by state action pairs and contains the resultant state and a reward as an entry

[Sutton and Barto, 1998]. The table is filled while real experience gathered. When state and

action space are too large, storing all experience become impractical. There is a version of

Dyna called Prioritized Sweeping [Moore and Atkeson, 1993] and closely related Queue-Dyna

[Peng and Willams, 1993] that store fixed number of most useful experiences and updates

experience tuples for which there will be large update errors using prioritize value functions.

Lin [Lin, 1993] proposed the experience replay method, where the past useful experience are

stored in a fixed length list and presented to the reinforcement learning again and again in

reverse order. This is also a kind of planning because some costly experiences from the

past are used to learn from once again. This is useful for learning in real worlds for which

exploring new experiences is often much more expensive than replaying old experiences.

Q-Dyna-Learning Algorithm

For each pair (sS, aA), initiate the table entry Q (s, a) to 0. Observe the current state st. Do forever:

Select an action at and execute it.



Update the table entry using real-experience for Q (st, at) as Equation (3.16).

Set st to st+1.

Update Model using this real experience. Repeat k times

Predict reward rt+1 and next state st+1 from the Model

Update the table entry using hypothetical experience for Q (st, at) as Equation (3.16).

End Repeat End Do

Table 5.4: Q-Dyna learning algorithm.

5.5 Designing the Reinforcement Functions

Many researchers recognized that for any real artificial agent implementation, there is no

convenient reinforcement function in the world inserting appropriate values into the robot

[Karlsson, 1997]. The reinforcement function is thus depicted as a part of the robot that

54

translates the observed state into a reinforcement value. That is, part of the robot’s a priori

knowledge is the ability to recognize states and assign a value to them. Balch [1998b]

provides taxonomies for various kinds of reinforcement functions that have developed for

multi-robot systems based on numbers of issues. The taxonomy of multi-robot

reinforcement functions is to examine the source of reward, relation to performance, locality, time

and continuity. In order to define the reinforcement function, the designer distills a large

amount of task specific knowledge into a number. Thus, the reinforcement function usually

closely coupled to the performance metric for a task. Since learning robots strive to

maximize the reward signal provided them, performance is maximized when their reward

closely parallels performance. It is sometimes the case however, in many embedded

autonomous agent applications, especially in multi-robot systems, robots cannot or should

not be rewarded strictly according to overall system performance. For example, the

reinforcement function designer will encode a notation of what classes of states represents

the robot’s goal. However, the robot’s sensors do not provide enough information for an

accurate computation of performance, reward is delayed or performance depends on the

actions of other robots over which the agent has limited knowledge and/or control [Balch,

1998b]. Since the reinforcement function is the robot’s only indication of how well it is

meeting its goals, it must be designed with great care. For all practical purposes, the

reinforcement function defines the robot’s task. For complex problems, it may be necessary

to refine the reinforcement function as a robot discovers ways to get the reinforcement

without accomplishing the task. In order to design a correct a reinforcement function,

there must first be a clear idea of what the robot’s goals should be and the reinforcement

function parallels the performance metric [Balch, 1998b]. Given a simple goal, it is

straightforward to define a reinforcement function leading to the proper behavior. The

reinforcement learning gains much of its power from the ability to use reinforcement

function to specify different degrees of goal satisfaction, and identify sub-goals of a

complex task. Rewarding sub-goals is one way to encode domain, task knowledge, and

speedup learning [Matarić, 1994c]. The relative magnitude of reinforcements needs to

reflect the relative importance of the states being rewarded. One could prioritize sub-goals

by assigning reinforcements of greater magnitude to the more important ones [Karlsson,

1997].

5.6 Other Reinforcement Learning Techniques

There are other two reinforcement-learning techniques that are not cast in Markov

decision process framework. The two techniques take differing approaches, with disparate

assumptions, and algorithms to endow the robot with the capacity to learn to solve

problems. These techniques that we present in this section are: (1) evolutionary and (2)

statistical reinforcement learning [Holland, 1986; Goldberg, 1989; Moriarty at. el., 1999;

Maes and Brooks, 1991; Parker, 1994, 1995b; 1998a; 2000b]5.

5 Evolutionary learning is a form of learning. The main difference between evolution and learning is concerned with how the

adaptation process is carried out in space and time. Evolution is based on a population of individuals, but learning taking

place within a single individual. Evolution capture relatively slow environmental changes on the other hand, learning

capture environmental changes faster.

55

5.6.1 Evolutionary Reinforcement Learning

Evolutionary learning algorithms [Goldberg, 1989; Michalewicz 1994; Koza, 1992] are

used to search the space of decision policies, and use a direct mapping that associates state

descriptions directly with a recommended action. Evolutionary algorithms encoded the

potential policies into structures, called chromosomes. During each iteration, the evolutionary

algorithms evaluate policies and generate offspring based on the fitness of each solution in

the task. Substructures, or genes, of the policy are then modified through genetic operators

such as mutation and crossover. The idea is that policies that lead to good solutions in

previous evaluations can be mutated or combined to form even better policies in

subsequent evaluations. The policies valuated by a so-called fitness or objective function,

indicating its utility or immediate rewards as of a candidate policy [Moriarty at. el., 1999].

Several researchers employed an evolutionary reinforcement learning, which combines

genetic evolution with rich variety of structures that been but under evolution likes neural

networks [Lund 1995; Floreano 1997], symbolic rule-based [Grefenstette and Schultz

1994], and fuzzy rules [Hoffmann, 1998] for robot control. The evolutionary learning

algorithms are off-on and model-free approaches, however they address the credit assignment and

statistics information implicitly based on the fitness evaluations of the entire policy sequence

obtain [Moriarty at. el. 1999]. In addition, evolutionary learning algorithms required a lot of

memory space and evaluation time for the policies, which make them actually off-line

learning techniques [Matraić and Cliff 1996]. The co-evolution has been used for evolving

robust cooperative and competitive behavior in societies, but has not yet been applied to

evolve control policies for physical robots [Matraić and Cliff 1996]6.

5.6.2 Statistical Reinforcement Learning

Statistic learning is deals with learning a world model using analysis and interpretation of

data. These world models are usually not known a priori, it is possible to learn the world

model from experiences. In statistical reinforcement algorithms - by using of a mathematical

world model combining a set of parameters - the robot try to learn the parameters

adjustments or altering the precondition list for the behaviors activation using the collected

data through monitoring the performance of the system through accumulating a statistical

factors. Such that correlation between the reinforcement signal and behavior activation [Maes

and Brooks, 1991], the average, and standard deviations of the task performance [Parker, 1994;

1995b; 2000b]. These mathematical world models guided the system to learn the conditions

of activation and selection of the behaviors through feedback signals. Furthermore, they

are useful to explain causal relationship, to explain why the robot prefers some choice over

another, and to report what the robot has been ding most of its time.

6 The combining of the evolution and learning during the life of the robot is one way to overcome the lengthy evolutionary

time scale. The learning can improves the search properties of artificial evolution by making the controller more robust to

changes occur very fast.. This is typically achieved by evolving neural controllers that can lean with a reinforcement

learning or back-propagation.

56

5.7 Comparison Between Learning Techniques

One can characterize the techniques along dimensions such as: the memory space used by

these algorithms, tracking changing world and noise, world-model is learned for an optimal

policy, and the optimality of the learned policy. One can also discuss how each technique

deals with the credit assignment and explore/exploit problems. Finally, another quality on which

the methods differ is the speed with which each achieves its objective. One can compare

and contrast these techniques along these dimensions and notice their pros and cons.

Leaning to solve a problem needs to find a policy. Each algorithm uses different data

structure and memory allocation. In evolutionary learning algorithms a lot of memory

space and evaluation time is required for the policies, they create a population of candidate

policies to be evaluated, which make them actually off-line learning techniques [Matraić and

Cliff 1996]. The other techniques, they reserved only one policy and gradually improve it.

Moreover, this enables them to track changing world and noise. The evolutionary learning

algorithms are independent of any model of the environment in learning the policy, the

temporal difference can be either model-free or model-based approach, and the statistical

approach is based completely on a world-model. The using of the world models speed up

learning process and reaching the objectives. In addition, world models enable to track

changing world and noise. The optimality of learned policy is also a characteristic of

concern. There is no evidence that a robot employ evolutionary learning will learn an

optimal policy, especially with premature convergence [Goldberg 1989; Michalewicz 1994]. For

statistical learning, the barely have empirical evidence [Maes and Brooks, 1991; Parker,

1994; 1995b] that the learner will develop a useful policy at all, let alone an optimal one. On

the other hand, there is much theoretical support for the claim that temporal reinforcement

learning robot will eventually learn an optimal policy. Many convergence proofs for TD ()

and Q-learning exist [Watkins 1989; Watkins and Dayan 1992; Tsitsiklis 1994; Sutton and

Barto, 1998]. Reinforcement learning must face the credit assignment problem and the

explore/exploit tradeoff. Learning with an evolutionary learning algorithms address the credit

assignment and statistics information implicitly based on the fitness evaluations of the entire

policy sequence obtain [Moriarty at. el., 1999], with one exception for the classifier system.

Temporal difference learning addresses the credit assignment and statistics information

explicitly by distribution of credit and statistic for every state. On the other hand, statistical

reinforcement learning, address credit assignment explicitly by correlating samples

experience with reinforcement rewards. Finally, both of evolutionary learning and temporal

difference algorithms needs to explore their environment, but for statistical learning, there

is no reason for exploring since the experience data is cataloged and correlated with

reinforcement signals. The comparison is summarized in Table 5.5. In our comparison, we

point out why the temporal difference and statistical learning techniques are preferred in

the robotics and multi-robot domains [EL-Telbany et al., 2002b].

Evolutionary learning

Temporal Difference

Statistical learning

57

Memory space Large Small Small

Tracking changes No Yes Yes

Model-based learning No Yes Yes

Learning speed Slow Middle Fast

Optimality of policy Mostly Approached Mostly

Credit assignment Implicit Explicit Explicit

Explore/exploit problem Yes Yes No

Table 5.5: Comparison between different reinforcement techniques in a set of characteristics.

5.8 Multi-Agent Robotic Systems

The decentralized control and distributed artificial intelligence became common in

everyday systems, where there are techniques for coordinating multiple intelligent mobile

robots gaining popularity in the robotic community. Certainly, a decentralized and

distributed solution is the only viable approach for many real-world application domains

that are inherently distributed in time, space, and/or behaviorally. Therefore, there has been an

upsurge of interest in multiple autonomous mobile robots engaged in collective behavior

that supporting and complementing each other. Many of these real-world applications are

space missions, operations in hazardous environments (fire fighting, cleanup of toxic waste,

nuclear power plant decommissioning, security), and military operations (Arkin, 1998;

Balch and Arkin; 1998). When compared with single robot, multi-robot systems offer a

number of advantages those include:

Distributed Actions: the robots can be separated over a wide area. The single

cannot be in different places at same time.

Parallelism: this leads to speed up a system’s operations.

Scalability: the multi-robot systems are an open system. one may add a new robot

without major changes to the over all system.

Robustness or fault tolerance: being distributed are less prone to failures. Even if a

robot is damaged, the over all performance not be compromised (provide that there

is redundancy).

Simplicity: The multi-robot systems are modular, making programming and testing

of the individual robot easier. Moreover, integrating many functions in one robot is

more difficult than many robots have different function.

Despite the advantages, multi-root systems have some drawbacks:

How to coordinate multiple robots?

How they cope with the dynamic environments?

How the robotic team performance is measured?

58

Coordination and cooperation are a vital part of applying multiple robots to a mission in

an efficient manner. It is a form of interaction that let’s the individual robot capable of

collaborating with other’ robots to accomplish that are beyond individual’s capabilities.

Designing and implementing cooperative team of robots needs to decide how to resolve

organization problems. The learning of choosing organization scheme is based on the

improvement gained to complete the mission cooperatively within the efficiency

constraints (i.e., execution time, interference, and robustness). Gaining dynamic organization

behavior through learning lets the robots complement each other in cooperative manner and

to cope with the dynamics environments. This will exist different organizations, which

have different behaviors, and this behavior difference leads to difference in performance.

Therefore, no one organization is best suited for every mission. Dynamic assigning each

robot different roles through learning allows a group of robots to improve their

performance by providing constraints for robots by accepting tasks that meet their roles in

the organization based upon these robots’ skills and experience. As shown in Figure 5.3,

increasing the level of communication and dynamic task allocation leads to a virtual

centralized controller that direct the multi-robot systems for a coherent behavior. The

cooperative and adaptive character of robots behavior increases the organization level, which

demonstrated by high efficiency and decreasing execution time. In addition, preserve for

the robots in the team with their autonomous.

Figure 5.3: Dynamic team organization behavior gained through learning, which lets the robots complement each other in cooperative

manner.

5.9 Communication Techniques in Robotics

Communications plays a large role in robot teams coordination and cooperation.

Communication is necessary for cooperation among multiple robots if and only if the

robots use the information that they communicate to enhance their performance as

individuals and/or as a group. Range, content, and guarantees for communication are

important factors in the design of social behavior. Communication is not an error free and

can be undependable. There are three kinds of communication used in robotics [Cao at. el.,

1997]. Explicit communication occurs when symbols representing data or commands are

exchanged between robots through direct channels. This active communication may be

local or global (broadcast). Implicit communication happens when robots gain useful

59

information by observing each other performing actions in the environment and

understanding of a robot’s actions. Using implicit communication, robots cooperate by

predicting the behavior, awareness of other robots or estimating reinforcements of other

from past observations and this type of information does not require the robot to model

another robot’s internal state [Parker, 1995a; Matraić, 1994a; 1997a]. This form of

communication is passive and local communication occurs when individuals sense other

individuals.

Stigmergic communication is when a robot gains useful information by observing the

changes other robot has made to the environment. Considering the environment as an

implicit communication channel that provides awareness of robot by another; usually it

refers to communication based on modification of the environment. All three types of

communication occur in the natural world. For instance, the honeybee’s dance and human

language are explicit. Implicit communication occurs when you approach a door at the

same time as another person and they hold open the door to allow you to pass through

first. Stigmergy is seen when one ant follows the pheromone trial another one laid to find

food. Neither ant ever communicated directly with the other – they only changed or

observed the change in the environment [Cao at. el., 1997].

5.10 Collective Learning

Various forms of machine learning have been applied to robotic teams, including

reinforcement learning and imitation. Social learning is the process of acquiring new

(cooperative) behavior patterns by learning from others. Social learning includes learning

how to perform a behavior, and when to perform it. Mataric defines the basic forms of

social learning as imitation or mimicry and social facilitation [Mataric 1994]. Imitation

involves first observing another agent’s actions (either human or robot), then encoding that

action in some internal representation and finally reproducing the initial action.

Reinforcements can result from a robot’s action directly, from observation of another

robot’s actions, or from observations of the reinforcement another robot receives Tensions

between individual and group needs can exist. Robots may be strongly self-interested and

have no concern for the society’s overall well being. Optimization in social robots usually

focuses on minimizing interference between robots and maximizing the society’s reward.

60

REFERENCES

[1] Albus S., (1981). A new approach to manipulator control: Cerebeller model

articulation controller (cmac). Journal of Dynamic Systems, Management and

Control, 97, 220-227.

[2] Arkin R., (1989). Towards the Unification of Navigational Planning and Reactive

Control. In AAAI Spring Symposium on Robot Navigation, pp. 1-5.

[3] Arkin R., (1992). Cooperation without Communication: Multiagent Schema Based

Robot Navigation. Journal of Robotic Systems.

[4] Arkin R., (1998). Behavior-Based robots. MIT Press.

[5] Arkin R., and Balch T., (1994). Communication in Reactive Multi-agent Robotic

Systems. Journal of Autonomous Robots, 1(1): 27-53.

[6] Arkin R., and Balch T., (1997). AuRA: Principles and Practice in Review. Journal of

Experimental and Theoretical Artificial Intelligence.

[7] Arkin R., and Balch T., (1998). Cooperative Multiagent Robotic Systems. In

Artificial Intelligence and Mobile Robots, Kortenkamp D., Bonasso R., and

Murphy R., (eds.), MIT Press.

[8] Arkin R., and MackKenzie D., (1994). Temporal coordination of perceptual

algorithms for mobile robot navigation. IEEE Transactions on Robotics and

Automation, 10(3): 276:286.

[9] Arkin R., Balch T., and Nitz E., (1993). Communication of behavioral State in

Multiagent Retrieval Tasks. In IEEE International Conference on Robotics and

Automation, pp. 588-594.

[10] Balch T., (1997). Clay: Integrating Motor Schemas and Reinforcement Learning.

Technical Report GIT-CC-97-11, Georgia Institute of Technology.

[11] Balch T., (1998a). Behavior Diversity in Learning Robots Teams. Ph.D. Thesis,

Georgia Institute of Technology.

[12] Balch T., (1998b). Taxonomies of multi-robot task and reward. Technical Report,

Carnegie Mellon University.

[13] Balch T., (1999). Behavioral Diversity as Multiagent Cooperation. SPIE’99

Workshop on Multi-agents Systems, Boston.

[14] Balch T., (2000). Hierarchical social entropy: an information theoretic measure of

robot team diversity. Journal Autonomous Robots.

[15] Balch T., and Arkin R., (1998). Behavior-Based Formation Control for Multi-Robot

Teams. IEEE Transaction on Robotics and Automation, 14(6): 1-15.

[16] Barto A., Bradtke S., and Singh S., (1995). Learning to act using real-time dynamic

programming. Journal of Artificial Intelligence, 72: 81-138.

[17] Barto A., Sutton R., and Anderson C. (1983). Neurolike Adaptive Elements that

can solve Difficult Learning Control Problems. IEEE Transaction on Systems,

Man and Cybernetics, Vol. SMC-13, No. 5.

[18] Bonabeau E., Dorigo M., and Theraulaz G., (1999). Swarm Intelligence: From

Natural to Artificial Systems. Oxford University Press.

61

[19] Borenstein J., and Koren Y., (1989). Real-Time Obstacle Avoidance for Fast Mobile

Robots. IEEE Tans. Systems, Man, and Cybernetics, 19(5): 1179-1187.

[20] Borenstein J., and Koren Y., (1991). The Vector Field Histogram- Fast obstacle avoidance

for mobile robots. IEEE Tans. Robotics and Automation, 7(3): 278-288.

[21] Borenstein J., Everett H., and Feng L., (1996). Navigating Mobile Robots- Systems and

Techniques. A.K. Peters, Wessley, MA.

[22] Brooks R. and Stein L. (1994). Building brains for bodies. Journal Autonomous

Robots, 1: 7-25.

[23] Brooks R., (1986). A Robust Layered Control System for a Mobile Robot. IEEE

Journal of Robotics and Automation, 2(1): 14-22.

[24] Brooks R., (1990a). Challenges for Complete Creature Architectures. In:

Proceedings of the first international conference on simulation of adaptive

Behavior, MIT Press, pp. 484-448.

[25] Brooks R., (1990b). Elephants Don’t Play Chess. In: Maes P., (Ed.), Designing

Autonomous Agents, MIT Press, pp. 3-15.

[26] Brooks R., (1990c). The Behavior Language; User’s Guide. Technical report AIM-

1127, MIT Artificial Intelligence Lab.

[27] Brooks R., (1991a). Intelligence Without Reason. In: Proceedings IJCAI-91,

Sydney.

[28] Brooks R., (1991b). Intelligence Without Representation. Artificial Intelligence 47,

139-160.

[29] Canny John (1988). The Complexity of Robot Motion Planning. Cambridge,

Massachusetts: MIT Press.

[30] Cao Y., Fukunaga A., and Kahng A. (1997). Co-operative Mobile Robotics:

Antecedents and Directions. Journal of Autonomous Robotics, 4:1-23.

[31] Clouse J., (1997). On Integrating Apprentice learning and Reinforcement Learning.

Ph.D. Thesis, University of Massachusetts.

[32] Clouse J., and Utgoff P., (1992). A teaching method for reinforcement. Machine

learning: In Proceedings of the Ninth International Conference, pp. 182-189.

[33] Connell J. (1990). Minimalist Mobile Robotics: A Colony Architecture for an Artificial

Creature. Academic Press.

[34] Connell J. (1991). SSS: a hybrid architecture applied to robot navigation. In:

proceedings of IEEE International conference on robotics and automation, France,

pp. 2719-2724.

[35] Dayan P., (1992). The Convergence of TD() for general . Machine Learning, 8:

341-362.

[36] Dayan P., and Sejnowski T., (1994). TD() Converges with Probability 1. Machine

Learning, 14: 295-301.

[37] Dias B., and Stentz A., (2000). A Free Market Architecture for Distributed Control

of a Multirobot System. In: the 6th International Conference on Intelligent

Autonomous Systems, pp. 115-122.

[38] Dietterich T., (1997). Machine Learning Research: Four Current Directions. AI

Magazine 18(4), 97-136.

62

[39] Doran J., Franklin S., Jennings N., and Norman T., (1997). On Cooperation in

Multi-Robot Systems. The knowledge Engineering review, 12 (3).

[40] Doyle A., (1995). Algorithms and Computational Techniques for Robot Path Planning.

Ph.D. Thesis, School of Electronic Engineering and Computer Systems, University

of Wales.

[41] Dudek G., and Jenkin M., (2000). Competently Principles of Mobile Robots. Cambridge

University Press.

[42] Elfes A., (1987). Sonar-based real-world mapping and navigation. IEEE Transaction on

Robotics and Automation RA-3(3), pp. 249-265.

[43] Elfes A., (1987). Sonar-based real-world mapping and navigation. IEEE Transaction on

Robotics and Automation RA-3(3), pp. 249-265.

[44] Elfes A., (1989). Using occupancy grids for mobile robot perception and navigation. IEEE

Computer, pp. 46-57.

[45] Elfes A., (1989). Using occupancy grids for mobile robot perception and navigation. IEEE

Computer, pp. 46-57.

[46] El-Telbany M., AbdelWhab A., and Shaheen S. (2001). Learning Multi-robot Co-

operation by Expertise Distribution. Proceedings of the International Conference

on Industrial Electronics, Technology & Automation. Cairo, Egypt.

[47] El-Telbany M., AbdelWhab A., and Shaheen S. (2002a). Learning Co-ordination by

Spatial and Expertise Distribution in Multi-agent Systems. The Egyptian Journal of

Engineering and Applied Science, 49 (2), 357-367.

[48] El-Telbany M., AbdelWhab A., and Shaheen S. (2002b). Behaviour-Based Multi-

Robots Learning: A Reinforcement Learning Survey. The Egyptian Informatics

Journal. (To appear).

[49] El-Telbany M., AbdelWhab A., and Shaheen S. (2002c). Speed up Reinforcement

Learning in Multi-robot Domains by Apprentice Learning. In WSEAS

International conference, Greece.

[50] Everett H., (1995) Sensors for Mobile Robots- Theory and Application. A.K.

Peters, Wessley, MA.

[51] Floreano D., (1997). Evolutionary Mobile Robots. In D. Quagliarelli , J Periaux, C.

Polini, and G. Winter (eds), Genetic Algorithms in Engineering and Computer

Science, Chichester: Johan Wiley and Sons, Ltd.

[52] Fontán M., and Matarić M., (1996). A Study of Territoriality: the Role of Critical

Mass in Adaptive Task Division. In Proceedings of fourth International

Conference on Simulation and Adaptive Behavior, pp. 553-561.

[53] Fontán M., and Matarić M., (1998). Territorial Multi-Robot Task Division. IEEE

transaction on Robotics and Automation, 14 (5).

[54] Gat E., (1998). On three-layer architectures. In: Bonnasso R., and Merphy P.

(Eds.), Artificial Intelligence and mobile robots, AAAI Press.

[55] Gerkey B., and Matarić M., (2000). Murdoch: Publish/Subscribe Task Allocation

for Heterogeneous Agents. Proceedings, Autonomous Agents 2000, Barcelona,

Spain, June 3-7, 203-204.

63

[56] Gerkey B., and Matarić M., (2001). Principled Communication for Dynamic Multi-

Robot Task Allocation. In Rus D., and Singh S., (eds.) Experimental Robotics VII,

353-362, Springer-Verlag, Berlin.

[57] Gerkey B., and Matarić M., (2002). Pusher-Watcher: an Approach to Fault-Tolerant

Tightly Coupled Robot Coordination. In Proceedings of IEEE International

Conference on Robotics and Automation, Washington.

[58] Gerkey B., and Matarić M., (2002). Sold!: Auction Methods for Multi-Robot

Coordination. IEEE Transactions on Robotics and Automation.

[59] Goldberg D., (1989). Genetic Algorithms in Search, optimization, and Machine

Learning. Addison-Wesley.

[60] Goldberg D., and Matarić M., (1997). Interference as a Tool for Designing and

Evaluating Multi-Robot Controllers. In Proceedings AAAI-97, pp. 637-642.

[61] Goldberg D., and Matarić M., (1999). Coordinating mobile robot group behavior

using a model of interaction dynamics. In: Proceedings of the third international

Conference on Autonomous Agents, Washington, ACM Press.

[62] Grefenstette J. and Schultz A. (1994). An Evolutionary approach to learning in

robots. In proceedings Machine Learning Workshop on Robot Learning, New

Brunswick, NJ.

[63] Haykin S., (1999). Neural Networks: A Comprehensive Foundation. Second

Edition, Prentice-Hall, Inc.

[64] Hoffmann F., (1998). Soft computing techniques for the design of the mobile robot

behaviors. Journal of Information Sciences.

[65] Holland J., (1986). Escaping brittleness: the possibilities of general-purpose learning

algorithms applied to parallel rule-based systems. In R.S. Michalski and T.M.

Mitchell (eds.), Machine learning: An Artificial Intelligence Approach, Volume II,

Morgan Kaufmann.

[66] Humphrys M., (1995). W-Learning: Competition among selfish Q-Learners.

Technical Report no. 362, University of Cambridge, Computer Laboratory.

[67] Humphrys M., (1997). Action Selection Methods using Reinforcement Learning.

Ph.D., Thesis, University of Cambridge, Computer Laboratory.

[68] Kaelbling L., Littman M., and Moore A., (1996). Reinforcement Learning: A

Survey. Journal of Artificial Intelligence Research, 4, pp.237-258.

[69] Kaiser M., Dillmann R., and Rogalla O., (1996). Communication as the Basis for

Learning in Multi-Agent Systems. In: Workshop on Learning in Distributed AI

Systems, Budapest, Hungary, 1996.

[70] Karlsson J., (1997). Learning to Solve Multiple Goals. Ph.D. Thesis, University of

Rochester, New York.

[71] Keerthi S., and Ravindran B., (1999) A Tutorial Survey of Reinforcement Learning.

Department of Computer Science, Indian Institute of Science.

[72] Khatib O., (1986). Real-time obstacle avoidance for manipulators and mobile

robots. The International Journal of Robotics Research, 5(1).

[73] Koza J., (1992). Genetic Programming, on the Programming of Computers by

Means of Natural Selection. MIT Press.

64

[74] Kube C. R., and Zhang H., (1992). Collective Robotics Intelligence. In: From

Animal to Animates: Second International conference on Simulation of Adaptive

Behaviors, PP. 460-468.

[75] Kube C. R., and Zhang H., (1993). Collective Robotics: From Social Insects to

Robots. Journal of Adaptive Behavior, 2 (2): 189-219.

[76] Kube C. R., and Zhang H., (1996). The use of Perceptual cues in Multi-Robot Box-

Pushing. In: Proceedings of IEEE International Conference on Robotics and


[77] Kube R., and Bonabeau E., (1998). Cooperative transport by ants and robots.

Journal of Robotics and Autonomous Systems.

[78] Lin Long-Ji, (1992). Self-improving reactive agents based on reinforcement

learning, planning and teaching. Machine learning, 8, 293-321.

[79] Lin Long-Ji, (1993). Scaling up reinforcement learning for robot control.

Proceedings of the 10th International Conference of Machine learning, pp. 182-189.

[80] Lozano-Pérez T., (1983). Spatial Planning: A Configuration Space Approach. IEEE

Transaction on Computers, C-32, pp. 108-120.

[81] Lund, H. (1995). Evolving Robot Control System. In Proceedings of the first

NWGA, University of Vaasa, Finland.

[82] Maes P., (1994). Modeling Adaptive Autonomous Agents. Journal of Artificial Life

1(2), 135-162.

[83] Maes P., and Brooks R., (1991). Learning to Coordinate Behaviors. In: Proceedings

of the Eighth National Conference on AI, Boston.

[84] Mahadevan S., (1996a). Machine Learning for robots: A Comparison of Different

Paradigms. In: proceedings of International Conference on Intelligent Robots and

systems.

[85] Mahadevan S., (1996b). The NSF Workshop on Reinforcement Learning: Summary

and Observations. AI Magazine.

[86] Mahadevan S., Connell J., (1992). Automatic Programming of behavior-based

Robots using reinforcement Learning. Artificial Intelligent, pp. 311-365.

[87] Matarić M., (1990). A Distributed Model for Mobile-Robot Environment Learning

and navigation. Ms.c. Thesis, Massachusetts Institute of Technology, Artificial

Intelligence laboratory, Cambridge, USA.

[88] Matarić M., (1992). Minimizing complexity in controlling a collection of mobile

robots. In: proceedings of IEEE International Conference on Robotics and


[89] Matarić M., (1994a). Learning Social Behaviors. In: Proceedings of the Third

International Conference on Simulation of Adaptive Behavior, PP. 453-462.

[90] Matarić M., (1994b). Interaction and Intelligent Behavior. Ph.D. Thesis,

Massachusetts Institute of Technology, Artificial Intelligence laboratory,

Cambridge, USA.

[91] Matarić M., (1994c). Reward Functions for Accelerating Learning. In: Proceedings

of the Eleventh International Machine Learning conference, Morgan Kaufmanns

Publisher Inc.

65

[92] Matarić M., (1995a). Learning in Multi-Robot Systems. In: Proceedings of the

IJCAI-95 Workshop on Adaptation and Learning in Multiagent Systems, 16:321-

331.

[93] Matarić M., (1995b). Issues and Approaches in the Design of Collective

Autonomous Robots. Journal of Robotics and Autonomous Systems, 16(2-4), 321-

331.

[94] Matarić M., (1995c). Designing and Understanding Adaptive Group Behavior.

Journal of Adaptive Behavior, pp. 51-80.

[95] Matarić M., (1997a). Behavior-Based Control: Examples from Navigation, Learning

and Group Behavior. Journal of Experimental and Theoretical Artificial

Intelligence, 9 (2-3), 323-336.

[96] Matarić M., (1997b). Using Communication to Reduce Locality in Distributed

Multi-Agent learning. In: Proceedings, AAAI-97, pp. 643-648.

[97] Matarić M., (1997c). Reinforcement Learning in the Multi-Robot Domain.

Autonomous Robots 4(1), 73-83.

[98] Matarić M., (1998a). Behavior-Based Robotics as a Tool for Synthesis of Artificial

behavior and Analysis of Natural Behavior. In Trends in cognitive science, Vol. 2,

No. 3, 82-87.

[99] Matarić M., (1998b). Coordination and Learning in Multi-robot Systems. In IEEE

Intelligent Systems.

[100] Matarić M., (2001). Learning in behavior-based multi-robot systems: policies,

models, and other agents. Journal of Cognitive Systems Research, 2: 81-93.

[101] Matarić M., (2002). Situated Robotics. In the Encyclopedia of Cognitive Science,

Nature Publishers Group.

[102] Matarić M., and Cliff D. (1996). Challenges in Evolving Controllers for Physical

Robots. Journal of Robotics and Autonomous Systems, 19:67-83.

[103] Matarić M., and Sukhatme G., (2001). Task-allocation and Coordination of Multiple

robots for Planetary Exploration. In the 10th International Conference on

Advanced Robotics, Buda, Hungary, pp. 61-70.

[104] Matarić M., Nilsson M., and Simsarian K. (1995). Co-operative Multi-Robot Box-

Pushing. In proceedings, IROS-95, IEEE computer society press Pittsburgh, PA.

[105] Michalewicz Z., (1994). Genetic Algorithms + Data Structure = Evolutionary

Programs. Springer-Verlag.

[106] Mitchell T., (1997). Machine Learning. McGraw-Hill Inc.

[107] Moore A., and Atkeson C., (1993). Prioritized sweeping: Reinforcement learning

with less data and less time. Machine learning, 103-130.

[108] Moravec H., and Martin M., (1996). Robot Evidence Grids. Technical Report CMU-

RI-TR-96-06, Carnegie Mellon University, Robot Institute.

[109] Moriarty D., Schultz A., and Grefenstette J., (1999). Evolutionary Algorithms for

Reinforcement Learning. Journal of Artificial Intelligence Research, 11, pp. 241-

278.

[110] Murphy R., (2000). Introduction to AI Robotics. MIT Press.

66

[111] Nagayuki Y., Ishii S., and Doya K., (1999). Multi-Agent Reinforcement Learning:

An Approach Based on the Other Agent’s Internal Model. Advanced

Telecommunications Research Institute, Japan.

[112] Nebot E., and Pagac D., (1995). Quadtree Representations and Ultrasonic Information for

Mapping an Autonomous Guided Vehicles Environment. International Journal of

Computers and Their Applications, 2(3), pp. 160-170.

[113] Nehmzov U., and Mitchell T., (1995). A Prospective Student’s Introduction to the

Robot Learning Problem. Technical Report UMCS-95-12-6, University of

Manchester.

[114] Nehmzow U. and Smithers T., (1992). Using motor actions for location

recognition. In Varela F., and Bourgine P. (Eds.) Toward a practice of autonomous

systems. MIT Press, pp. 96-104.

[115] Ostergaard E., Matarić M., and Sukhatme G., (2001b). Distributed Multi-Robot

Task Allocation for Emergency Handling. In Proceedings of IEEE/RSJ

International Conference on Robots and Systems (IROS), Maui, Hawaii, 821-826.

[116] Ostergaard E., Matarić M., and Sukhatme G., (2002). Multi-Robot Task Allocation

in the Light of Uncertainty. In Proceedings of IEEE International Conference on

Robotics and Automation (ICRA-2002), Washington.

[117] Ostergaard E., Sukhatme G., and Matarić M., (2001a). Emergent Bucket Brigading.

Autonomous Agents 2001, Monteral, Canada.

[118] Parker L., (1992). Adaptive Action Selection for Cooperative Agent Teams. In

International Conference on Simulation of Adaptive Behavior, pp. 442-450.

[119] Parker L., (1994). Heterogeneous Multi-robot Cooperation. Ph.D. thesis,

Massachusetts Institute of Technology, Artificial Intelligence laboratory,

Cambridge, USA.

[120] Parker L., (1995a). The Effect of Action Recognition and robot Awareness in

Cooperative Robotic Teams. In proceedings of International Conference on

Intelligent Robotics and Systems, 1, pp. 212-219.

[121] Parker L., (1995b). L-ALLIANCE: A mechanism for adaptive action selection in

heterogeneous multi-robot teams. Technical Report ORNL/TM-13000, Oak Ridge

National laboratory, Tennessee.

[122] Parker L., (1996). On the Design of Behavior-based Multi-Robot Teams. In Journal

of Advanced Robotics, 10, pp. 547-578.

[123] Parker L., (1998a). ALLIANCE: An Architecture for Fault-Tolerant Multi-robot

Cooperation. IEEE Transactions on Robotics and Automation 14(2), pp. 220-240.

[124] Parker L., (1998b). Distributed Control of Multi-robot Teams: Cooperative Baton-

Passing Task. In: proceedings of the 4th International Conference on Information

Systems Analysis and Synthesis, Volume 3, pp. 89-94

[125] Parker L., (2000a). Current State of the art in Distributed Autonomous Mobile

Robotics. In: proceedings of Fifth International Symposium on Distributed

Autonomous Robotic systems.

67

[126] Parker L., (2000b). Multi-robot Learning in a Cooperative Observation Task. In:

proceedings of Fifth International Symposium on Distributed Autonomous

Robotic systems.

[127] Pearce M., Arkin R., and Ram A., (1992). The learning of reactive Control

Parameters through Genetic Algorithms. In Proceedings of the IEEE/RSJ

International Conference on Intelligent Robotics and Systems.

[128] Peng J., and Williams J., (1993). Efficient learning and learning within the Dyna

framework. Adaptive Behavior, 4, 323-334.

[129] Pirjanian P., (1999). Behavior coordination Mechanisms –State-of-the-art.

University of Southern California, Robotics Research laboratory.

[130] Pomerleau A. (1993). Neural Network Perception for mobile Robot Guidance.

Kluwer Academic Publishers.

[131] Rummery G., and Niranjan M., (1994). On-line Q-learning Using Connectionist

Systems. Technical Report CUED/F-INFENG/TR 166. Cambridge University.

[132] Russell S., (1996). Machine Learning. In Boden, M., A., Artificial Intelligence,

Academic Press, Inc.

[133] Russell S., and Norvig P., (1995). Artificial intelligence: A Modern Approach.

Prentice-Hall International, Inc.

[134] Saffiotti A., (1997). The uses of fuzzy logic in autonomous robot navigation: a

catalogue raisonne. Soft Computing 1(4): 180-197.

[135] Salustowicz P., Wiering A., and Schmidhuber H., (1998). Learning Team Strategies:

Soccer case studies. Machine Learning, 33(2/3).

[136] Samuel A., (1967). Some studies in machine learning using the game for Checkers

II: Recent Progress. IBM Journal of Research and Development, 11, 601-617.

[137] Schneider J., Wong W., Moore A., and Riedmiller M., (1999). Distributed Value

Functions. International Conference on Machine Learning.

[138] Shappard J., (1997). Multi-agent Reinforcement learning in Markov games. PhD.

Thesis, Johan Hopkins University.

[139] Simsaian K., and Matarić M., (1995). Learning to Cooperate using Two Six-Legged

Mobile Robots. In: Proceedings of 3rd European Workshop of Learning Robots,

Crete, Greece.

[140] Singh S., (1994). Learning to Solve Markovian Decision Processes. Ph.D. Thesis,

University of Massachusetts.

[141] Smart W., (2002). Making Reinforcement Learning Work on Real Robots. Ph.D.

Thesis, Department of Computer Science, University of Brown, USA.

[142] Stone P., (1998) Layered Learning in Multi-Agent Systems. Technical Report CMU-

CS-98-187, Computer Science Department, Carnegie Mellon University, Pittsburgh,

PA.

[143] Stone P., and Veloso M., (1997). Multi-Agent Systems: A Survey from Machine

Learning Perspective. Technical Report CMU-CS-97-1993, Computer Science

Department, Carnegie Mellon University, Pittsburgh, PA.

[144] Sutton R. S., and Barto A. G., (1998). Reinforcement Learning: An Introduction.

MIT Press.

68

[145] Sutton R., (1988). Learning to predict by the methods of temporal differences.

Machine Learning, 3(1): 9-44.

[146] Sutton R., (1990a). Integrated architectures for learning, planning, and reacting

based on approximating dynamic programming. In Proceedings of the 7th

International Conference on Machine Learning.

[147] Sutton R., (1990b). Planning by Incremental Dynamic Programming. In

Proceedings of the 9sth International Conference on Machine Learning.

[148] Sutton R., (1991). DYNA, an integrated Architecture for Learning, Planning and

Reacting. In working notes of the AAAI Spring Symposium of Integrated

intelligent Architectures.

[149] Tan M., (1993). Multi-Agent Reinforcement Learning: Independent vs. Cooperative

Agents. Proceedings of the 10th International Conference on Machine Learning, pp.

330-337.

[150] Tennenholtz M., and Shoham Y., (1992). Emergent Conventions in Multi-Agent

Systems. Robotics Laboratory, Department of Computer science, Stanford

University, Proceedings of KR, Boston.

[151] Thrun S. (1992a). Efficient exploration in reinforcement learning. Technical Report

CMU-CS-92-102, Carnegie Mellon University, Computer Science Department,

Pittsburgh, PA, 1992.

[152] Thrun S., (1995). Exploration in active learning. In M. Arbib, editor, Handbook of

Brain and Cognitive Science. MIT Press.

[153] Touzet C., (2000). Distributed Lazy Q-Learning for Cooperative Mobile Robots.

Autonomous Robots, (2) pp. 1-18.

[154] Tsitsiklis J., (1994). Asynchronous stochastic approximation and Q-learning.

Machine Learning, 16:185-202.

[155] Version C., and Gambardella L. (1996). Ibots: Learning Real Team Solutions. In

MAS96 Workshop on Learning, Interaction and Organization in Multi-agent

Environments, Kyoto, Japan.

[156] Watkins C., (1989). Learning from delayed rewards. Ph.D. Thesis, Cambridge

University, England.

[157] Watkins C., and Dayan P., (1992). Technical note: Q-learning. Journal of Machine

Learning, 8(3/4): 279:292.

[158] Weiss G., (1997). Adaptation and learning in multi-agent systems: some remarks

and bibliography. Weiss, G., (ed.), Distributed Artificial Intelligence Meets Machine

Learning: Learning in Multi-agent Environment, Lecture Notes in AI 1221,

Springer-Verlag, Berlin, 223-241.

[159] Werger B., and Matarić M., (2000a). Broadcast of Local Eligibility for Multi-Target

Observation. In proceedings of the international Symptom on Distributed

Autonomous Robotic Systems, Knoxville, Tennessee.

[160] Werger B., and Matarić M., (2000b). Broadcast of Local Eligibility: Behavior-Based

Control for Strongly Cooperative Robot Teams. In proceedings, Autonomous

Agents 2000, Barcelona, Spain, June 3-7, 203-204.

69

[161] Wiering M., (1999). Explorations in Efficient Reinforcement learning. PhD Thesis,

University of Amsterdam.

[162] Wiering M., Krose B., and Groen F., (2000). Learning in Multi-agent Systems.

Department of Computer Science, University of Amsterdam.

[163] Wyatt J., (1997). Exploration and Inference in Learning from Reinforcement. Ph.D.

Thesis, University of Edinburgh, England.

[164] Yanco H., and Stein L., (1993). An Adaptive Communication Protocol for

Cooperative Mobile Robots. In: Meyer J., Roitblat H., and Wilson S., (Eds.) From

Animals to Animates: International Conference on Simulation of adaptive

Behavior. MIT Press, pp. 478-485.

[165] Zlot R., Stentz A., Dias B., and Thayer S., (2002). Multi-Robot Exploration

Controlled by a Market Economy. ICRA-2002.

[166] Hwang Y. and Ahuja N., (1992). Gross Motion Planning- A Survey. ACM

computing Surveys, vol. 24, pp. 219-291.

ai in robots

Documents

mobile robot sensors

autonomous mobile robots

autonomous robots

mobile robots applications

tactile sensors

infrared sensors

sonar sensors

compass sensors