learning outdoor mobile robot behaviors by example

20
Learning Outdoor Mobile Robot Behaviors by Example Richard Roberts, Charles Pippin, and Tucker Balch Center for Robotics and Intelligent Machines Georgia Institute of Technology Atlanta, Georgia 30332 e-mail: [email protected], [email protected], [email protected] Received 8 April 2008; accepted 16 December 2008 We present an implementation and analysis of a real-time, online, supervised learning system for nonparametrically learning behaviors from a human trainer on a mobile robot in outdoor environments. This approach enables a human operator to train and tune robot behaviors simply by driving the robot with a remote control. Hand-designed behaviors for outdoor environments often require many parameters, and complicated behaviors can be difficult or impossible to specify with a manageable number of parameters. Furthermore, their design requires knowledge of the robot’s internal models and knowledge of the envi- ronment in which the behaviors will be used. In real-world scenarios, we can design new behaviors using our learning system much more quickly than we can write hand-crafted behaviors. We present the results of training the robot to execute several specialized and general-purpose behaviors, including traversing a slalom, staying near “cover,” navigat- ing on paths, navigating in an obstacle field, and general-purpose navigation. Our system learns and executes most of these behaviors well after 1–4 h of operator training time. In quantitative tests, the learned behavior is not as robust as a hand-crafted behavior but often completes obstacle courses more quickly. Additionally, we identify the factors that influence the effectiveness of this approach and investigate the properties of the training data provided by the human trainer. On the basis of our analyses, we suggest future work to ensure sufficient training, handle conflicting training examples, model robot dynam- ics, and further investigate dimensionality reduction of perception features. C 2009 Wiley Periodicals, Inc. 1. INTRODUCTION In this paper we describe and evaluate a system for outdoor robot navigation that uses a learned policy directly mapping local robot-centric features to con- trol commands. The robot learns this policy as a hu- A compilation video of the robot behaviors is available in the online issue at www.interscience.wiley.com. man teleoperates it via remote control, such that it learns to “drive like the human trainer does.” The motivation for learning robot behaviors in- stead of designing them by hand is twofold. Hand- crafted behaviors for outdoor environments often re- quire many parameters, and complicated behaviors can be difficult or impossible to specify with a man- ageable number of parameters. Furthermore, their design requires knowledge of the robot’s internal Journal of Field Robotics 26(2), 176–195 (2009) C 2009 Wiley Periodicals, Inc. Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/rob.20278

Upload: richard-roberts

Post on 06-Jul-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning outdoor mobile robot behaviors by example

• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •

Learning Outdoor MobileRobot Behaviors by Example

Richard Roberts, Charles Pippin,and Tucker BalchCenter for Robotics and Intelligent MachinesGeorgia Institute of TechnologyAtlanta, Georgia 30332e-mail: [email protected],[email protected], [email protected]

Received 8 April 2008; accepted 16 December 2008

We present an implementation and analysis of a real-time, online, supervised learningsystem for nonparametrically learning behaviors from a human trainer on a mobile robotin outdoor environments. This approach enables a human operator to train and tune robotbehaviors simply by driving the robot with a remote control. Hand-designed behaviors foroutdoor environments often require many parameters, and complicated behaviors can bedifficult or impossible to specify with a manageable number of parameters. Furthermore,their design requires knowledge of the robot’s internal models and knowledge of the envi-ronment in which the behaviors will be used. In real-world scenarios, we can design newbehaviors using our learning system much more quickly than we can write hand-craftedbehaviors. We present the results of training the robot to execute several specialized andgeneral-purpose behaviors, including traversing a slalom, staying near “cover,” navigat-ing on paths, navigating in an obstacle field, and general-purpose navigation. Our systemlearns and executes most of these behaviors well after 1–4 h of operator training time. Inquantitative tests, the learned behavior is not as robust as a hand-crafted behavior butoften completes obstacle courses more quickly. Additionally, we identify the factors thatinfluence the effectiveness of this approach and investigate the properties of the trainingdata provided by the human trainer. On the basis of our analyses, we suggest future workto ensure sufficient training, handle conflicting training examples, model robot dynam-ics, and further investigate dimensionality reduction of perception features. C© 2009 Wiley

Periodicals, Inc.

1. INTRODUCTION

In this paper we describe and evaluate a system foroutdoor robot navigation that uses a learned policydirectly mapping local robot-centric features to con-trol commands. The robot learns this policy as a hu-

A compilation video of the robot behaviors is available in the onlineissue at www.interscience.wiley.com.

man teleoperates it via remote control, such that itlearns to “drive like the human trainer does.”

The motivation for learning robot behaviors in-stead of designing them by hand is twofold. Hand-crafted behaviors for outdoor environments often re-quire many parameters, and complicated behaviorscan be difficult or impossible to specify with a man-ageable number of parameters. Furthermore, theirdesign requires knowledge of the robot’s internal

Journal of Field Robotics 26(2), 176–195 (2009) C© 2009 Wiley Periodicals, Inc.Published online in Wiley InterScience (www.interscience.wiley.com). • DOI: 10.1002/rob.20278

Page 2: Learning outdoor mobile robot behaviors by example

Roberts et al.: Learning Outdoor Mobile Robot Behaviors by Example • 177

Figure 1. The polar range measurements radiate out fromthe robot to the nearest obstacles. The large arrow indicatesthe direction to the goal. These measurements comprise thelocal, robot-centric, feature space.

models and knowledge of the environment in whichthe behaviors will operate.

We desire that the learned behaviors can success-fully guide the robot through any environment inwhich it would typically operate, without having tocreate or refine a global policy for the environment.To this end, we present a learning system that mapslocal perception and robot-centric features to motorcommands. We extract measurements about therobot’s surroundings in the form of range measure-ments radiating from the robot to their intersectionswith obstacles, as in Figure 1, and use principalcomponents analysis (PCA) to reduce the dimen-sionality of these measurements. We then combinethis reduced-dimensionality feature vector with therobot’s relative heading to a goal location to form thelearner’s input space. When running autonomously,the controller executes the motion command fromthe nearest neighbor in this input space.

Although these learned behaviors can operate ascomponents of a multitier reactive and deliberativesystem, we perform all of our investigation here withthe behaviors standing alone, forming a purely reac-tive system. Generally, a purely reactive system hasdifficulty efficiently escaping mazes, cul-de-sacs, andother local minima that extend beyond its reactivehorizon, as a deliberative system with global knowl-edge would be able to. This is partly because lo-cal features alone generally do not contain enoughspatial information for large-scale, goal-direction ac-tion and because the consideration of full kinemat-ics and sometimes dynamics in reactive controllers

makes producing goal-directed behavior computa-tionally complex.

Hybrid reactive/deliberative systems, in whichreactive behaviors are coupled with a deliberativeplanner, support both goal-directed global behaviorand fast, smooth local behavior. A deliberativecomponent is arguably necessary for long-term, goal-oriented behavior but does not generally addressrobot dynamics or motion constraints because theseintroduce many additional state-space dimensions.Reactive controllers can address these dynamics andenable smooth and robust control of robot position,heading, and speed (Arkin, 1998; Brooks, 1986) orperform more complicated maneuvers, as demon-strated in this paper and by many other groups.Examples of situations in which learned reactivecontrollers perform especially well include pathand corridor following, small obstacle avoidance,and systems with significant or unstable dynamics(Abbeel, Coates, Quigley, & Ng, 2007; Hamner, Singh,& Scherer, 2006; Jonsson, Wiberg, & Wickstrom, 1996;Pomerleau, 1989, for example).

In this work, we make the assumption that therobot’s goal is a location in the world, derived by ahigher level system (such as a planner), to which therobot knows its relative heading.

Our main contribution with this work is to eval-uate the properties and limitations of a function-approximation approach to behavior learning usinglocal, but general, perception features. On the basisof this evaluation, we highlight some of the factorsthat influence the effectiveness of this approach. Ad-ditionally, we present discussion and analysis of thetraining data the human expert provided to the robot.

Our approach has several advantages over otherbehavior-learning approaches. First, it is more fea-sible to implement than an end-to-end approachthat directly uses images as its input because thespace of images is very large and any manifoldtherein is likely to be highly nonlinear. Second, incomplex outdoor environments, it is more capablethan approaches that use simplified features such as“bearing and range to the nearest obstacle,” becauseour range-measurement features can resolve paths,single obstacles, or large groups of obstacles. Finally,unlike approaches that derive a complete global pol-icy for new environments, our approach does notrequire knowledge of transition or reward functions.

Dimensionality reduction of the raw range datawas necessary to allow the nearest-neighbor learnerto generalize with modest numbers of training

Journal of Field Robotics DOI 10.1002/rob

Page 3: Learning outdoor mobile robot behaviors by example

178 • Journal of Field Robotics—2009

exemplars. We observed that when running thelearner on the raw range data, the “nearest neigh-bors” did not resemble the query instances. PCA,however, effectively preserved the important aspectsof the training data, while giving us the additionalbenefit of removing some noise. Additionally, prin-cipal components corresponded intuitively to typi-cal obstacle configurations. We further discuss the be-havior of PCA with our training data in Section 5.

Several other groups successfully used PCA fordimensionality reduction of laser range scans (Pfaff,Stanchiss, Plagemann, & Burgard, 2008, for exam-ple), to which our radial obstacle-distance featuresare in fact very similar. With images, on the otherhand, several groups have demonstrated advantagesof nonlinear dimensionality reduction: Grudic andMulligan (2005) find clusters on manifolds in images,and Grollman et al. show significant improvementusing this method with depth images over clusteringin a latent space found by PCA (Grollman, Jenkins, &Wood, 2006).

We used a single nearest-neighbor learner to mapthe compressed feature vectors to motor commands.Several other groups have also shown success withsingle nearest-neighbor methods for control in lo-cal perception space (Argall, Browning, & Veloso,2007; Stolle, Tappeiner, Chestnutt, & Atkeson, 2007,for example). Other groups have used related tech-niques such as locally weighted projection regression(Vijayakumar, D’Souza, & Schaal, 2005). All of thesemethods share the advantage of incorporating addi-tional training data quickly, without long retrainingtimes. Fast incremental training is a requirement foran interactive learning system. We chose to use a sin-gle nearest-neighbor instead of a multiple nearest-neighbor technique such as locally weighted projec-tion regression, because the training data invariablycontained noise and inconsistencies, such as aliasingof multiple actions onto the same features. We furtherdiscuss the problems arising from these properties ofthe training data in Section 5.

We found it necessary to train the robot in awide variety of scenarios and also to act consistentlyto reduce the variability of the actions associatedwith any particular state. LeCun et al. also iden-tify these requirements for an end-to-end behavior-learning system operating on images (LeCun, Muller,Ben, Cosatto, & Flepp, 2005). Acting consistently forany given state is not always possible, however. Thehuman trainer may demonstrate different actions forthe same state either unintentionally or intentionally

because of his or her being aware of information notrepresentable in the state space of the robot or notperceived by the robot (such as knowing the short-est way around an obstacle or avoiding unseen ob-stacles). To help reduce this “intentional aliasing,”LeCun et al. used a human trainer wearing video gog-gles while training the robot. In our work, however,the state space is not made up of images, and it is dif-ficult for a human to control the robot only by lookingat the obstacle range measurements. Thus, our hu-man trainer operated the robot from a third-personview, observing its motion from nearby, but we hadto be careful to act as consistently as possible and notto respond to obstacles in the environment the robotcould not perceive.

2. RELATED WORK

We consider the problem of learning from examplein a state–action–policy framework, in which theseterms have the same meaning as in Markov deci-sion processes (MDPs). Within this framework, weclassify the work in this field into the categoriesof trajectory following, task learning, and behaviorlearning. These categories are each characterized bythe assumptions they rely on about the nature of arobot’s state and policy, about which parts of theproblem are learned and which are specified by thedesigner, and in the way in which learning general-izes to new situations.

Our work in this paper fits into the category ofbehavior learning, but we believe there are many op-portunities for crossover between these categories inour future work.

2.1. Trajectory Following

The goal of trajectory following is to be able to providedemonstrations of desired trajectories to a robot andhave the robot follow one or some combination ofthose trajectories. Trajectory following assumes thatthe robot can directly perceive or compute its state,that its state consists of position and/or velocity, andthat the goal of the policy is to cause the robot to fol-low a particular trajectory through its state space.

Atkeson and Schaal (1997) focus on the taskof swinging up and balancing an inverted pendu-lum from observation of a human performing thetask. The robot learns a physical model by watchingitself perform the task and learns a reward functionfrom the human demonstration. The authors use

Journal of Field Robotics DOI 10.1002/rob

Page 4: Learning outdoor mobile robot behaviors by example

Roberts et al.: Learning Outdoor Mobile Robot Behaviors by Example • 179

reinforcement learning such that the robot learns tofollow the demonstrated trajectory.

Calinon and Billard (2005) apply PCA and inde-pendent component analysis to time-series demon-strations of a gesture and then train a hidden Markovmodel (HMM) on the reduced-dimensionality, time-series data to form a model of the gesture. The au-thors then present a controller that follows a gener-alized version of the gesture, constructed from theHMM.

The goal of trajectory following is to have therobot move as accurately as possible between a se-ries of locations. Many systems adapt the trajectoryto account for minor workpiece and environmentalchanges, but with trajectory following, the robot stayswithin a narrow band in state space. Unlike task andbehavior learning, the goal of generalizing to entirelynew environments is not usually included in trajec-tory following.

2.2. Task Learning

In task learning, the agent learns to perform a setof actions over which action ordering is usuallyimportant. Time and accumulated history may bothbe important components of the agent’s state repre-sentation. Often, task learning is also associated withhigh-level concepts such as objects and complex orintentional actions. Unlike with trajectory following,the environment may change significantly with tasklearning, and the agent must do the right thing withrespect to accomplishing some task, adapting its tra-jectory and actions accordingly.

Calinon, Guenter, and Billard (2007) present amethod for using measurements from multiple ref-erence frames, and other arbitrary variables, in thestate. As in their previous work, the authors use PCAto reduce the dimensionality of the input data, whichconsist of joint angles, global end-effector positions,object-centric end-effector positions, gripper status,and time. After applying dynamic time warping(DTW) to align the training sequences, the methodthen fits a Gaussian mixture model (GMM) to thereduced-dimensionality data, in which the time is anextra dimension for each variable. The authors de-sign a controller to follow the generalized trajectory,represented by the GMM, through the latent space.The robot follows paths in important parts of the statespace closely, while ignoring parts that are inconsis-tent between training sequences. In experiments, arobot using this method was able to move objects

from a variable initial position to a fixed final globalposition.

Nicolescu and Mataric (2003) present a methodfor learning somewhat higher level task representa-tions. They start with a set of predefined low-levelbehaviors, which a human demonstrator activates atrun time. When key points in the task are reached,the low-level behaviors record their respective rel-evant parts of the robot’s state. A “behavior net-work” combines multiple demonstrated sequences ofbehavior activations, representing the dependenciesgleaned from the similarities and differences amongthe training sequences. The robot can then performthe demonstrated task autonomously by traversingthe behavior network.

The goal of task learning is to have the robotreach some end state, at which the state includesall relevant information about the environment. Acommon assumption is that there are multiple waysof reaching the end state, which often include strongrequirements about the order in which actions areperformed or states are visited. Unlike behaviorlearning, task learning often deals with high-levelstates, such as object status or location, insteadof lower level operations, such as interpretingperception or controlling motors.

2.3. Behavior Learning

The goal of behavior learning is to learn an appro-priate policy mapping state to action for all regionsof the state space the robot could be in. The stateis often built from the robot’s perception and in-formation about its location in the world. Behaviorlearning shares with task learning that the end state(such as reaching a goal location) is important andthat there are multiple ways of arriving at that state.With behavior learning, the state is often continuousand closely tied to the robot’s local perception. Sometrajectories through the state space will be possible,whereas others may be impossible, but with behaviorlearning, the particular trajectory taken is not explic-itly controlled and is influenced by errors and ran-dom environmental, perception, and control events.Learned behaviors, then, are maps, from either sen-sor input or some stored representation of the state ofthe world, to robot actions.

Learning with function approximation in localfeature space. Hayes and Demiris (1994) presenteda method in which a learner agent, which follows ateacher agent through a series of “hallways” (with no

Journal of Field Robotics DOI 10.1002/rob

Page 5: Learning outdoor mobile robot behaviors by example

180 • Journal of Field Robotics—2009

branches), builds a mapping from its perception ofthe immediately surrounding wall configuration toan action provided by imitating another robot. Theagent learns to perform the actions necessary to navi-gate the hallway, which means moving straight whenthe walls are only to the sides of the agent and turn-ing corners when the agent reaches them.

A classic example of end-to-end behavior learn-ing is ALVINN, the back-propagation network thatcontrolled an autonomous van on the road, using asinput low-resolution monocular images (Pomerleau,1991). The network was trained online as a humandrove the van, and the learning system recorded ex-ample images and control outputs.

LeCun et al. (2005) present a system that uses aconvolutional network to learn a mapping directlyfrom stereo images to motor control commands. Theirmethod successfully learns obstacle avoidance in arange of outdoor environments. Ideally, a convolu-tional network learns filter kernels that extract usefulinformation from the images, reducing the numberof parameters that must be learned from the numberthat must be learned for a fully connected neural net.The authors also identify several requirements for thehuman-provided training data, of which we have dis-cussed our experience (LeCun et al., 2005).

Our work uses function approximation with localfeatures but differs from the above work in that weuse general descriptive features.

Improving upon the policy after demonstration.Argall et al. (2007) employ a policy-learning systemin which the robot stores human-demonstrated tra-jectories in a local, robot-centric state space (consist-ing of relative headings and velocities) and uses anearest-neighbor method to query its example bankat run time. In addition, the authors’ system incor-porates a critiquing signal, given by the human whilethe robot is executing the policy, by applying a scalingfactor to the nearest-neighbor distances that is basedon whether the human labels segments of the robot’strajectory as good or bad (Argall et al., 2007).

Kaiser and Dillmann (1996) use radial basis func-tion (RBF) networks to incrementally learn a func-tional state–action mapping from example actionsbut then alter output of this mapping based on a re-ward function. The reward function is also estimatedfrom the training examples and allows the robot toimprove its execution of a task over time. Using thismethod, the authors train their system to control arobotic arm to insert a peg in a hole and to open adoor. Both of these tasks require the use of force feed-

back, and the difficulty experienced by the humancontrolling the arm with these tasks made the rein-forcement learning portion of this method necessaryand very beneficial.

Chernova and Veloso (2007) use Gaussianmixture models to learn behaviors from a humandemonstrator with a discrete action space. Also, theyallow the agent to request an additional demonstra-tion if its confidence in the most likely correct actionis low. This ability to improve the policy in areas ofthe state space with few examples or low confidenceleads to a significant improvement in behavior per-formance (Chernova & Veloso, 2007). An extensionof our present work could incorporate a similarconfidence-based method to ensure that all impor-tant regions of the state space are covered by trainingexamples.

Potential future work includes refining the policyand identifying areas of the state space covered bytraining examples too sparsely.

Reinforcement learning. It is tempting to allowa human trainer to provide a reward signal to a re-inforcement learning system. Thomaz and Breazeal(2006) evaluated the merit of such a method and com-pared it to an expanded method in which the humancould also influence an agent’s future action choices.The authors found that allowing the human to in-fluence the agent’s future decisions directly, insteadof providing feedback only after the fact, not onlygreatly improved learner performance, but also wasmore suited to the way that humans naturally teach(Thomaz & Breazeal, 2006).

Smart and Kaelbling (2002) use reinforcementlearning in conjunction with a human expert trainer.While the trainer operates the robot, the learningsystem begins to learn the Q-function by observ-ing the actions taken, given the states visited andthe rewards received. Later, the system refines theQ-function while the learned policy controls therobot.

Abbeel et al. (2007) also present a take on be-havior learning that uses reinforcement learning inconjunction with a human trainer to perform aer-obatic maneuvers with an autonomous helicopter.Their method learns a model of the helicopter dy-namics and a reward function from expert demon-strations of each maneuver. This work shares a pat-tern with those above, in that only the “good” areasof the state space are illustrated by the expert demon-strator. In this work, that fact is used to obtain a re-ward function, and reinforcement learning methods

Journal of Field Robotics DOI 10.1002/rob

Page 6: Learning outdoor mobile robot behaviors by example

Roberts et al.: Learning Outdoor Mobile Robot Behaviors by Example • 181

are used to produce a state–action function for thesuboptimal regions of the state space in which the he-licopter finds itself.

Reinforcement learning can refine a policy withexploration as well as learn it from initial and incre-mental demonstration.

Learning with primitives. Bentivegna (2004)specifies task primitives, or parameterized actionsa robot can perform, and learns both the most ap-propriate primitive and the parameters with whichto perform the primitive for any possible state ina global state space, by observing a human per-forming the same task. The author uses a nearest-neighbor method for primitive selection and addi-tionally other function-learning methods, includinga nearest-neighbor method, for selecting the param-eters of each action. Using parameterized task prim-itives helps to simplify the learning of actions whilemaintaining generality among situations. Bentivegnauses multiple levels of abstraction for policy learn-ing. First, for any given state, a learner selects a prim-itive from among the robot’s library of primitives.Second, another learner selects the desired “subgoal,”or outcome, of performing the action associated withthe primitive. Finally, an additional learner selects themotor commands to achieve that subgoal.

Learning state costs. Several groups have suc-ceeded in learning certain goal-directed behaviors byintegrating learned costs into a global planner. Solv-ing for the MDP whose optimal policy predicts thetrajectories taken by a human trainer yields learnedcosts of local features, which in turn enable a globalplanning algorithm to plan paths similar to thoseof the human trainer (Ratliff, Bagnell, & Zinkevich,2006). For many outdoor mobile robot tasks, one canmake the assumption that the terrain far from the tra-jectory taken by a human trainer is not preferable,whereas the terrain close to the trainer’s trajectoryis preferable. This assumption allows costs to be in-ferred directly from sample trajectories and used witha global planner to produce planned trajectories sim-ilar to those of the trainer, as in Ollis, Huang, andHappold (2007) and Sun et al. (2007).

Building global policies. Stolle and Atkeson(2006) describe using a nearest-neighbor method tobuild a complete global policy from sample “trajec-tory libraries” in a global state space. In their work,these trajectory libraries are initially built using aplanning algorithm. They and others then discusstransferring these trajectory libraries to new globalstate spaces using a nearest-neighbor lookup in a

reduced-dimensionality local feature space (Stolleet al., 2007). Additionally, they discuss the more gen-eral case of transferring entire policies between globaldomains using similar techniques (Stolle & Atkeson,2007). These authors note that executing a policy inthe local feature space often resulted in the agent get-ting stuck or traveling in loops, thus not exhibitinggoal-directed behavior. To address this problem, afterbuilding the initial transferred policy in global statespace, they refine it using dynamic programming.This refining step, however, requires knowledge ofthe cost and transition functions of the underlyingMDP.

3. METHODS

We implemented our learning from an example sys-tem on the LAGR robot, shown in Figure 2, adifferential-drive outdoor robot equipped with twopairs of Point Grey Bumblebee stereo cameras, wheelencoders, an inertial measurement unit (IMU), aglobal positioning system (GPS), and front bumperswitches. We wrote our software in C++ and Java.

Our system learns a nonparametric mappingfrom the state of the world to the robot’s motor

Figure 2. Our experimental platform is the LAGR robot.This is a differential-drive robot with wheel encoders, anIMU, and a GPS. For perception it has two pairs of PointGrey Bumblebee stereo cameras. Dimensions of the robotare approximately 120 × 75 × 105 cm.

Journal of Field Robotics DOI 10.1002/rob

Page 7: Learning outdoor mobile robot behaviors by example

182 • Journal of Field Robotics—2009

commands in a supervised manner using a nearest-neighbor method. The state of the world, in our case,consists of a representation of the obstacles and freespace around the robot as a set of regularly spacedpolar range measurements, as well as the robot’s rel-ative heading to the goal location.

To produce polar range measurements, we castvirtual “rays” out around the robot’s location in anobstacle map, produced with stereo perception, asshown in Figure 1. Each ray represents the distance tothe closest obstacle along the direction of the ray. Weiterate over the grid cells within a certain radius ofthe robot, computing for each ray the grid cell closestto the robot that contains an obstacle. For most tasks,we use rays that each subtend 3 deg, spanning a to-tal of 180 deg in front of the robot, in 60 rays, witha maximum ray range of 5 m. For the “slalom” taskdescribed below, the robot benefited from seeing far-ther behind itself, so we used additional rays to span210 deg. For the “stealth” task, we used a maximumray range of 10 m so that the robot could move to-ward cover from a greater distance. For each behav-ior, we perform unsupervised dimensionality reduc-tion on these range measurements to produce featurevectors in a six-dimensional latent space for super-vised learning.

The choice of six principal components was em-pirical. We used leave-one-out cross-validation todetermine that keeping six principal componentsprovided the best accuracy in predicting motor com-mands with a typical data set.

Figure 3 shows an overview of the flow of in-formation during online operation of our learning-from-example system. During a training session, thetrainer drives the robot using a remote control, while

the robot records instances consisting of polar obsta-cle range measurements, the relative heading of thegoal location (informed by GPS), and the forward ve-locity and turning rate commanded by the remote. Assoon as the user ends the training session, the robotappends the new instances to the previously recordedones and performs PCA on the polar range measure-ments of the entire set of training examples for thebehavior. It then projects all of the polar range mea-surements down to the latent space spanned by thesix largest-eigenvalue principal components.

We discretize the goal heading measurementsinto eight values, a resolution that we determinedthrough preliminary experiments, but which did nothave a large effect. The robot groups the training in-stances according to the discrete goal heading underwhich they fall and builds eight nearest-neighborlearners. Each learner handles training examplesfrom one goal heading. Effectively, this causes thegoal heading to act as an attribute with very highimportance.

When running autonomously, the robot projectsits current perception of polar range measurementsonto the same principal components found in thetraining data and discretizes its current goal head-ing. It then queries the learner corresponding to thatgoal heading with the projected polar range mea-surements, executing the motor command associatedwith the single nearest-matching training examplewith the same discrete goal heading.

For the majority of our experiments, the robotrecorded training instances at 3 Hz and when run-ning autonomously, queried learners to obtain mo-tor commands at 8 Hz. Although we tested our sys-tem recording and querying at up to 20 Hz, we

Figure 3. Information flow in our learning-by-example system.

Journal of Field Robotics DOI 10.1002/rob

Page 8: Learning outdoor mobile robot behaviors by example

Roberts et al.: Learning Outdoor Mobile Robot Behaviors by Example • 183

Table I. Training examples and human operator time to train each behavior and robot time required to process data aftertraining and perform queries at run time.

Training Time in training Approx. operator Posttraining Online recallBehavior instances mode (min) time processing (ms) time (ms)

Path navigation 661 (3 Hz) 3.7 1 h 40 <1Slalom 682 (3 Hz) 3.8 1 h 42 <1Stealth 1,240 (10 Hz) 2.1 1 h 104 <1Sparse obstacles 1,281 (3 Hz) 7.1 3 h 92 <1General purpose 2,560 (3 Hz) 14.2 4 h 227 <1Hand-crafted 2 months

general purpose

saw no increase in performance at higher rates. Weuse an exact nearest-neighbor method, employingk-dimensional (KD)-trees, from the approximate-nearest-neighbor library by Mount and Arya (2006).

4. EXPERIMENTS AND RESULTS

We trained the robot to execute five behaviors, threeof which we considered nontrivial or difficult to pro-gram and tune by hand. In addition, we measured thetime required of the human operator to train each be-havior and the time required by the robot to processthe training data after each run. In contrast to eachof these learned behaviors, our hand-crafted behav-ior for avoiding obstacles required several months oftesting and tuning cycles.

Table I shows the number of example instancesthe robot recorded during training for each of thetasks described below. In addition, the table showsthe actual time spent in training mode (given thatthe robot records examples at a fixed rate) and thetotal operator time spent training, which includesrunning the robot autonomously to identify scenariosfor which the robot needs additional training anddriving the robot back to the start locations on thecourses. The table also shows the time required forthe robot to process the training data, includingperforming dimensionality reduction and buildingKD-tree learners. Finally, the table shows the recalltime, or how long queries at run time take, includingtransforming the polar range measurement vector tothe low-dimensionality space and finding the closestmatching instance.

The hand-crafted behavior, to which we comparethe learned “general-purpose” behavior, was devel-

oped over the course of several months and was usedon our LAGR robot prior to developing the learnedbehaviors. The hand-crafted behavior was imple-mented such that a collection of basic controllers voteon directions of travel for the robot. One controllervotes to drive toward the goal, whereas another votesto avoid obstacles, using the same polar range mea-surements described previously. Additionally, a thirdcontroller disallows driving directions that wouldcause collisions with obstacles, using a full local per-ception map and considering the robot’s configura-tion space. Careful tuning of the weights of eachcontroller’s votes yields a behavior that drives to-ward a goal location while avoiding obstacles and isfairly robust in avoiding collisions. This hand-craftedbehavior is described in detail in Wooden, Powers,MacKenzie, Balch, and Egerstedt (2007).

The behaviors and scenarios we trained, alongwith our observations, are described in detail below.In the autonomous run trajectories illustrated in thissection, we transcribed the map layouts and robot tra-jectories from video recordings. Additionally, the be-haviors are shown in the attached video.

4.1. Slalom

We created slalom courses from movable barrels andtrained the robot to “zigzag” in between the barrels,as illustrated in Figures 4(a) and 4(b). We rearrangedthe barrels frequently during training and evalua-tion runs, training the robot to traverse new slalomcourses until it was successful in navigating new andunseen slalom courses. For this test, we increased thespan of the polar range measurements from 180 to210 deg so that the robot did not lose sight of the

Journal of Field Robotics DOI 10.1002/rob

Page 9: Learning outdoor mobile robot behaviors by example

184 • Journal of Field Robotics—2009

Figure 4. (a) A typical slalom course and (b) the pattern in which we drove the robot while training on this course. Theobjective of the slalom behavior was that the robot “zigzag” in between the barrels. During training and evaluation, wefrequently rearranged the barrels to ensure that learning could generalize. The goal location was positioned far from thecourse. (c) and (d) Autonomous slalom course runs. The robot sometimes skipped individual barrels, as seen in (d), butskipped only one or two nonconsecutive barrels and reached the end of the course approximately 80% of the time.

barrels during turns. With slalom courses, the goallocation was positioned far from the course, so thatit acted as a compass.

When traversing the slalom, the robot oftenskipped barrels, especially when the arrangement ofthe barrels was not a straight line. Figures 4(c) and4(d) show two successful autonomous slalom runs,one in which the robot skips two nonconsecutive bar-rels. When it skipped a barrel, the robot usually couldrecover by driving between the next pair of barrelsand then continue to traverse the rest of the course.In approximately 80% of the runs, the robot skippedonly one or two nonconsecutive barrels and reachedthe end of the course. In the remaining runs, it lostsight of the barrels and drove away from the course.

4.2. Stealth Operation

To stay near cover, otherwise known as “stealth”mode, we trained the robot to make its way towardthe goal location, but to hug large obstacles when-ever possible. We trained the robot to turn toward thegoal location only when the goal was at least 90 degto its left or right. We used four different courses,with large natural obstacles including scrap piles, acargo container, a shed, and bushes. For this test, weincreased the maximum polar range measurement to10 m, instead of 5 m as we used for the other experi-ments, so that the robot could react to farther objectsthat could provide cover. Figures 5(a) and 5(b) showa typical course and training pattern for the stealthbehavior.

Journal of Field Robotics DOI 10.1002/rob

Page 10: Learning outdoor mobile robot behaviors by example

Roberts et al.: Learning Outdoor Mobile Robot Behaviors by Example • 185

Figure 5. (a) Typical stealth course and (b) typical training pattern. The objective of the stealth behavior was to stay nearlarge obstacles (“cover”) while navigating toward the goal. In the photo (a), the goal location is out of the image to the left,beyond the silver pickup truck. (c) and (d) Autonomous runs. The point at which the robot stops hugging obstacles andturns toward the goal varies between runs because of the discretization of the goal heading measurement. With the stealthbehavior, the robot performed especially well, seeking and driving along large obstacles while making its way to the goal.

Creating a hand-crafted behavior for this taskwould be nontrivial and likely require careful tun-ing, as perception of the boundaries of large obstaclesis patchy. Sets of rules and thresholds would be re-quired to determine when the robot should continuehugging an obstacle, turn toward the goal, or turn to-ward a different obstacle.

The behavior that we trained for stealth oper-ation worked very well. When the robot perceivedlarge objects, it approached them and drove alongthem. When no objects were visible, or when the goalwas approximately 90 deg to the robot’s left or right,it drove toward the goal location. Figures 5(c) and5(d) show typical autonomous runs for the stealthbehavior.

4.3. Path NavigationNavigating on real paths with imperfect perceptionpresents several problems for writing hand-tunedcontrollers. There may be additional paths splittingoff of the robot’s current path, which the user mayor may not want the robot to take, depending on thesizes of the paths and whether they head toward thegoal. Additionally, detecting and responding to sharpturns is nontrivial when writing a hand-tuned path-following controller. Typical path and training pat-terns are shown in Figures 6(a) and 6(b).

We were able to train the robot to successfullynavigate on paths, including turning corners at inter-sections, for approximately 30 s at a time. Althoughthe robot eventually veered to the sides of the paths

Journal of Field Robotics DOI 10.1002/rob

Page 11: Learning outdoor mobile robot behaviors by example

186 • Journal of Field Robotics—2009

Figure 6. (a) and (b) Typical paths and the patterns in which we drove the robot to teach path navigation. The objective ofpath navigation was to drive down the center of a path bounded by obstacles on both sides and to turn onto intersectingpaths that head more directly toward the goal than the current path, as long as they were not too narrow for the robot todrive on. Path following was particularly difficult, because the robot’s perception of the path edges often contained falsegaps. Additionally, false obstacles were often perceived in the middle of the path. (c) and (d) On the path behavior, therobot often performed well but occasionally deviated from the path and entered the brush.

to become entangled in brush, it successfully stayedin the center of the paths until doing so. Follow-ing a path presents a difficult perception problemfor our current system, as the vegetation lining thepath sometimes becomes too low to be classified asan obstacle, making the edge of the path appear verysparse. Additionally, once the robot ends up insidean obstacle, all polar obstacle measurements are zero.When the robot did veer off the path, it usually didso in the direction of the goal. Figures 6(c) and 6(d)show several typical path runs.

4.4. Random Obstacle Fields

As shown in Figure 7, we generated random coursesby placing 20 rectangular obstacles (plastic bins) in

a 10 × 15 m “arena” at randomly generated locationsand orientations. For each random course, we startedthe robot at each of five starting locations at one endof the arena (the five square markers at the left end ofthe arena, visible only in the diagram) and assigned agoal location a long distance past the other end of thearena.

We initially trained the robot by driving manu-ally through two random courses from each of thefive starting locations.

Next, we entered a period of corrective training,starting the robot autonomously from each of the fivestarting locations. Every time the robot collided withan obstacle, we interrupted autonomous operationand manually backed the robot away from the obsta-cle and then provided a training sequence of driving

Journal of Field Robotics DOI 10.1002/rob

Page 12: Learning outdoor mobile robot behaviors by example

Roberts et al.: Learning Outdoor Mobile Robot Behaviors by Example • 187

Figure 7. The arena used for the random obstacle field tests, with a typical course configuration. The objective of thebehavior was for the robot to reach the goal location by driving through the course without hitting any obstacles. Weintended this experiment to lend insight into the properties of the learning system alone by making the task of perceptioneasy for the robot.

smoothly around the obstacle. For each starting loca-tion on each course, we started the robot up to fourtimes, moving on to the next starting location whenthe robot successfully completed the course twice orafter four attempts, whichever occurred first. Aftercompleting an entire course, we rerandomized thecourse. After repeating this procedure on six randomcourses, we perceived no improvement with furtherincremental training.

Finally, we ran the robot autonomously on twoadditional random courses, with no interruption, toevaluate performance.

During each autonomous run, we measured thetime the robot spent on the course before crossing thefinish line. Additionally, we measured the numberof collisions with obstacles. Upon each collision, westopped the timer, executed a consistent maneuverto shift the robot sideways, and restarted the robotand the timer. When the robot is running fully

autonomously, we detect collisions using the robot’sfront bumper and execute a similar automatic ma-neuver. The robot incurred a significant time penaltywith each collision because of the time it takes toaccelerate from a halt.

We repeated the time and collision count mea-surements from each start location for both thelearned behavior and our hand-tuned behavior. Weobserved very different traversal times from eachstart location, due to variable difficulty of the coursein front of each start location. In addition, we ob-served different times from each random course,again due to variability in difficulty of the courses.

Table II presents the fraction of runs on eachcourse for which the behaviors reached the finish linewithout colliding with any course obstacles and theaverage number of collisions per run. The learned be-havior usually collided more frequently with obsta-cles than did the hand-coded behavior.

Table II. Percentage of runs free of collisions and average number of collisions per run for each course.

Learned, Hand coded, Learned, collisions Hand coded,Course collision free (%) collision free (%) per run collisions per run

3 49 100 0.69 0.004 33 72 1.22 0.285 71 64 0.57 0.366 56 69 0.56 0.317 50 90 0.80 0.108 30 70 0.90 0.30

Journal of Field Robotics DOI 10.1002/rob

Page 13: Learning outdoor mobile robot behaviors by example

188 • Journal of Field Robotics—2009

Table III and Figure 8 summarize the coursetraversal times for the learned behavior and thehand-crafted behavior. The table shows the times,averaged per course, for all runs. Despite many morecollisions, which impose time penalties, the learnedbehavior is competitive with the hand-coded behav-ior. The figure shows the time differences betweenthe hand-coded and learned behaviors from eachstart position on each course, as well as per-courseaverages with 90% confidence intervals. Positivedifferences indicate that the learned behavior wasfaster, whereas negative differences indicate that thehand-coded behavior was faster.

Table IV and Figure 9 show a subset of thesetimes, including only runs for which neither behav-ior collided with any obstacles. When the robot wasrunning autonomously, many of the collisions we ob-served appeared to be due to the confounding factors,especially the “oscillation problem,” as described inour Discussion (Section 5). When the learned behav-iors did not collide with obstacles, however, theywere usually faster than the hand-coded behaviors.Unfortunately, as seen in Figure 9(b), this differencewas statistically significant only on two of the sixtested courses.

Table III. All times, averaged over runs for each course.

Learned behavior Hand-coded behaviorCourse average time (s) average time (s)

3 22.5 22.54 23.5 22.85 23.4 22.16 22.5 21.87 23.0 21.48 21.7 22.3

Table IV. Times, averaged over runs for each course, forruns in which neither behavior collided with any obstacles.

Learned behavior Hand-coded behaviorCourse average time (s) average time (s)

3 20.2 22.54 21.8 22.65 22.4 22.66 20.7 21.27 23.6 21.08 19.0 20.9

Qualitatively, we observed the learned behav-iors executing certain maneuvers, such as turningwhile driving between two close obstacles or driv-ing around single obstacles more smoothly and effi-ciently than the hand-coded behaviors, which oftentook indirect paths or slowed to make precise turns.

4.5. General-Purpose Navigation

For the LAGR robot tests, we developed a “general-purpose” behavior by training the robot in a mixtureof scenarios containing both paths and various obsta-cle configurations. Figure 10(a) shows a typical sce-nario, in which the goal location is on the oppositeside of the tree.

We tried to make this training as realistic as pos-sible for the types of courses the robot would en-counter. We spent 3 h training the robot in obsta-cle courses, in addition to the hour spent trainingit to follow paths. The performance of the robot in“general-purpose” tasks was very good. It could suc-cessfully navigate around natural obstacles with veryfew collisions. It could escape shallow cul-de-sacswithout the use of a planner, as in the right-handtrajectory of Figure 10(b), a task at which our hand-tuned behavior failed without the planner.

For the final LAGR test, we employed a delibera-tive planner that handed off waypoints as goals to thelearned “general-purpose” behavior. Navigation inthis setting was very robust, as the occasional failuresof the learned behaviors were corrected as the plan-ner directed the robot toward open space. When us-ing the learned behaviors in place of the hand-tunedbehaviors, in conjunction with the planner, the robotwandered less and made fewer unnecessary move-ments, usually reaching the goal in less time.

5. DISCUSSION

In our discussion, we investigate the principal com-ponents extracted from the training data and presentour observations of several capabilities of our learn-ing system. Additionally, we discuss “grooves” in thestate space generated by training along single trajec-tories and the need to fill nearby areas of the statespace with training examples. Finally, we identify fac-tors that affected the performance of our learningsystem and suggest methods for addressing specificproblems identified in our experiments.

Principal components. The principal compo-nents extracted from the training data provide insight

Journal of Field Robotics DOI 10.1002/rob

Page 14: Learning outdoor mobile robot behaviors by example

Roberts et al.: Learning Outdoor Mobile Robot Behaviors by Example • 189

Figure 8. Time differences (handcrafted–learned) for all runs. (a) Each bar shows the averaged time differences for onestarting location on one course. There were between two and four runs per starting location, per course, as described inSection 4.4. Runs are in chronological order, so those on the left ran with the fewest training examples, and those on theright ran with the most training examples from corrective training. (b) The averages of time differences grouped by course,with 90% confidence bounds. Courses are also in chronological order, and runs on courses 7 and 8 introduced no furthertraining.

into the environmental features our learning systemis using to map world states to motor commands.Figure 11 shows the first six principal components foreach of the five learned behaviors. The principal com-

ponents are sorted from highest to lowest eigenvalue.Each principal component describes a linear combi-nation of each of the polar range measurements. Theplots in the figure show the scale, for each principal

Figure 9. Time differences (handcrafted–learned) for only the runs in which neither behavior collided with any obsta-cles. (a) All runs individually, in chronological order. (b) The averages of these time differences for each course, with 90%confidence bounds. Courses are also in chronological order, and runs on courses 7 and 8 included no further training.

Journal of Field Robotics DOI 10.1002/rob

Page 15: Learning outdoor mobile robot behaviors by example

190 • Journal of Field Robotics—2009

Figure 10. A typical “general-purpose navigation” scenario involving various obstacles. The goal location is on the far sideof the tree. In the autonomous runs shown in (b), the robot navigates successfully to the goal. In the left-hand trajectory,the robot drives in the middle of the narrow passage, instead of staying on the side closest to the goal. In the right-handtrajectory, the robot escapes a concave group of obstacles, or cul-de-sac, by driving right along the concrete pipes, insteadof trying to go toward the goal.

component, by which each polar range measurementis multiplied to obtain the response of that compo-nent to the robot’s perception. Red and blue lines inthe plot have opposite signs (i.e., one is positive andthe other is negative).

Principal components represent the orthogonaldirections of highest variance in the data. By usingPCA, we are assuming that high-variance directionsin our data represent important environmentalfeatures from which to determine the correct motorcommands. In fact, we do find relevant environmen-tal features in the principal components shown inFigure 11:

• Principal component 1 of (a), (b), and (d) andprincipal component 2 of (c) and (e) distin-guish obstacles on the left from obstacles onthe right.

• Principal component 2 of (a), (b), and (d) andprincipal component 3 of (c) and (e) distin-guish obstacles on the sides of the robot fromobstacles in front of the robot.

• Principal component 1 of (c) and (e) respondsto lines of obstacles appearing on both sides,a situation that occurs when the robot is on apath.

• Principal component 4 of (d) responds toany obstacles perceivable by the robot, exceptthose on the extreme right and left.

PCA also helped to reduce noise in the percep-tion vector. Gaps in the robot’s perception of ob-stacles led to range measurements alternating be-tween the true distance to the obstacle and largeror smaller distances. Each principal component com-prised measurements from a range of nearby angles,which smoothed out this noise.

Further investigating other methods of unsuper-vised learning in learning from example is part of ourongoing research.

Capabilities. We found it interesting that ourlearned behaviors can learn to escape cul-de-sacs byfollowing the wall of the cul-de-sac, as in Figure 10(b),where the hand-crafted behavior (without the plan-ner) gets stuck turning back and forth in the bot-tom of the cul-de-sac. This capability arises becausewe train the robot to follow a wall of obstacles ifthere are no gaps in the wall it can drive though.One could in theory design a hand-crafted behaviorto exhibit the same behavior, and in previous workwe attempted to do so on the same platform, design-ing the behavior to switch to a modified form of the“bug” algorithm when it encountered a wall of obsta-cles. However, we experienced great difficulty tuningthe hand-crafted behavior to reliably recognize andfollow the boundaries of irregular obstacles such aslines of vegetation.

Additionally, our learning system can recognizeand react to situations that would be complicated to

Journal of Field Robotics DOI 10.1002/rob

Page 16: Learning outdoor mobile robot behaviors by example

Roberts et al.: Learning Outdoor Mobile Robot Behaviors by Example • 191

Figure 11. Our learning system reduces dimensionality by projecting polar range measurements onto the first six principalcomponents of the training data. This figure shows the magnitudes and signs of these principal components, plotted as theycorrespond to the polar range measurements. Red and blue colored lines represent opposite signs. For each behavior, asdescribed in Section 5, the first two or three principal components yield intuitively relevant features the robot uses todescribe the obstacles in the environment.

detect with a parametric model. For instance, at pathjunctions, such as in Figures 6(c) and 6(d), the robotcorrectly turns down paths that lead in the generaldirection of the goal.

We observed that even though the learned pol-icy is in local feature space, individual maneuversare often conserved, allowing the robot to smoothlyexecute motions to avoid obstacles, reverse away

Journal of Field Robotics DOI 10.1002/rob

Page 17: Learning outdoor mobile robot behaviors by example

192 • Journal of Field Robotics—2009

from obstacles, and escape small cul-de-sacs. Therobot was executing these maneuvers with signif-icantly delayed closed-loop control. The ability toact with delayed feedback, or even execute open-loop maneuvers at times, is important for smoothand efficient motion in mobile robots, humans, andother animals, on which control lag is often sig-nificant and actuator control is imprecise, makinghigh-speed, closed-loop control on all control levelsimpossible. Humans and animals have evolved andlearned to be particularly good at controlling suchsystems.

“Grooves.” We observed in this work that wehad to do a lot of corrective training on thesame courses. It appears that when the robot devi-ates more than a small amount from the trajectory onwhich we trained it (a “groove”), the nearest match-ing examples are poor matches for the robot’s situ-ation and usually cause the robot to drive straightahead, resulting in a collision. We suspect that be-cause in the majority of examples, the robot is pri-marily driving straight, the poor matches by chanceusually are closest to the examples in which therobot is driving forward. The underlying problem isthat when the robot deviates from the trajectories onwhich it was trained, or when it encounters a situ-ation it has never seen, it enters regions of the statespace with no relevant training examples.

Pomerleau (1991) encountered this same problemwith ALVINN, in which because the human usuallydrove down the center of the road, only regions of thestate space in which the van was already in the correctposition on the road were filled with training exem-plars. This is a difficulty particular to behavior learn-ing, which is solved by hand-designed controllers intrajectory learning.

Grooves, or trajectories, in global pose space cancorrespond to small areas in perception state space.In vision-based lane following, for instance, the per-ception includes neither time nor distance traveled inthe lane, and thus the majority of a complete trainingtrajectory occupies a small region of the perceptionstate space corresponding to the robot’s typical viewfrom the middle of the lane.

The presence of grooves thus suggests that thereare good and bad regions of the state space. If the state–action mapping is defined only in the good regions,there is no way for the agent to escape from the badregions. The reinforcement learning method of Kaiserand Dillmann (1996) used exploration to find its wayback to the good regions. Pomerleau (1991) solved

this problem by synthetically generating new train-ing examples from every provided example by shift-ing the images left and right several times and ad-justing the control command to steer the robot backto the center. In this work, we attempted to solve thisproblem by starting the robot in the bad regions, suchas near the edge of a path, or prior to colliding withan obstacle, and giving training examples of escapingthese situations.

Further investigating and addressing the prob-lem of filling in missing areas of the policy is partof our future work. Specifically, we wish to recognizeonline when the current training is insufficient andto then improve the policy by requesting additionaltraining examples or through “practice” while run-ning autonomously, as in Kaiser and Dillmann (1996)and Chernova and Veloso (2007).

Confounding factors and future work. In thesame situations, the motor commands associatedwith several nearest matches can be quite different.Inspecting typical nearest perception matches, shownin Figure 12, reveals that multiple close matchescan be associated with entirely different motor com-mands. These differences are especially pronouncedwhen an obstacle is directly in front of the robot,and it could drive either left or right around it. Be-cause of these differences, our learning system canexhibit chaotic changes in the motor command, dueto rapidly changing nearest neighbors. In the worstcase, these changes are effectively smoothed by therobot’s motor controller and momentum, and therobot drives straight into an obstacle when it couldhave driven around either side.

Although these oscillations arise from inconsis-tencies in the training data, we believe that to be us-able by laypersons, a learning-from-example systemmust deal with both unintentional and intentional in-consistency of the human trainer. To fix this oscilla-tion problem, neither smoothing of motor commandsover time nor averaging of multiple nearest neigh-bors would work, as right and left would still be av-eraged. Future work that we believe will solve thisproblem includes investigating a scheme in whichnearest neighbors vote on motor commands and in-corporating a hysteresis effect that increases the like-lihood that motor commands similar to the previousone are chosen.

Every so often, our learned behaviors fail by run-ning into an obstacle. Whether due to the oscillationproblem or due to unseen regions of state space, theresult is usually that the robot becomes entangled or

Journal of Field Robotics DOI 10.1002/rob

Page 18: Learning outdoor mobile robot behaviors by example

Roberts et al.: Learning Outdoor Mobile Robot Behaviors by Example • 193

Figure 12. A group of very close perception matches, where an obstacle is directly in front of the robot, where motorcommands (shown by large arrows) are to both the left and the right. The net effect is that the motor command oscillatesbetween turning left and right, and the robot effectively goes straight.

stuck in an obstacle, requiring human intervention.We observed that when coupled with a planner, how-ever, the learned behaviors collide with obstacles lessfrequently than the hand-crafted behaviors coupledwith a planner do. We believe that our behaviors per-form better under the planner because we are ableto train the learned behaviors to solve problems thatconfound the hand-crafted behaviors, such as driv-ing through gaps, ignoring noise (in the form of false-negative and false-positive obstacle detection), anddriving away from the goal when necessary to getaround concave obstacles.

We hold several hypotheses for why our learnedbehavior did not perform well on the random obsta-cle fields. The bins from which we built the courseswere high enough that the robot often could not seesome bins until after it emerged from around others.Because of the lag in our perception system, the robotthus could not see some bins until it was almostupon them. We saw evidence for this in that we hadto train the robot to stop and turn in place when itwas close to an obstacle. We believe that this problemalso arises partially from the robot’s dynamics, as thecontrol command to turn suddenly when the robot ismoving quickly, unlike the control command whenthe robot is moving slowly, must include slowing therobot as much as possible. To model these dynamics,we could incorporate the robot’s speed into thelearner’s input state. Alternatively, we could learntrajectories instead of control commands, as input toa nonparametric dynamics model, as in Sermanet,Scoffier, Crudele, Muller, and LeCun, (2008), whichwould then generate control commands that aresensitive to the robot’s speed.

Another hypothesis for the poor performance inthe random obstacle fields is that because they were

random, the number of scenarios to which the robothad to respond was larger than in the other scenar-ios. The state space that had to be filled by exampleswould thus be larger than in the other scenarios andpotentially not captured by six principal components.Additionally, PCA may not capture the important as-pects of the environment for these courses if highvariance is caused by unimportant aspects of the en-vironment. Thus, future work includes further char-acterization of the latent space found by PCA. In par-ticular, the number of principal components requiredto preserve important aspects of the environmentlikely varies between environments, as evidenced byFigure 11. Additionally, a different method of dimen-sionality reduction, such as independent componentsanalysis, or a nonlinear method, may be better suitedto some environments and behaviors.

6. CONCLUSION

We have presented a method for real-time, onlinelearning of new reactive behaviors, using supervisedlearning. Particularly because this system uses anonparametric learning method, it is able to learna wide variety of behaviors. Additionally, the useof a memory learning method allows the trainer toimmediately examine the robot’s performance andprovide additional training examples if necessary.Because this system incorporates a relative goal lo-cation, it works in conjunction with a global planner.Learning is highly interactive, as training sessions areconducted by flipping a remote control switch anddriving the robot. Training sessions are immediatelyintegrated into the robot’s behavior online. In thisway, rapid incremental training can be performed

Journal of Field Robotics DOI 10.1002/rob

Page 19: Learning outdoor mobile robot behaviors by example

194 • Journal of Field Robotics—2009

when situations arise in which the robot does notperform optimally.

Learning by example is useful for adapting orprototyping robot behaviors to new environmentsand tasks without having to hand code behaviorlogic. A human trainer can train the robot for newbehaviors or environments using this method with-out having knowledge of the robot’s internals. Oncethe robot has been trained, it is immediately readyfor testing and operation, as training data processingtimes are less than 1 s. We quickly trained behaviorsin 1–4 h that would be complicated or take weeks ormonths to design and tune by hand.

Our learned behaviors generally performed verywell on an outdoor mobile robot in the presence ofsensor error and in unstructured environments. Inquantitative tests, the learned behavior is not as ro-bust as the hand-tuned behavior but often completesobstacle courses more quickly.

ACKNOWLEDGMENTS

The authors wish to thank Jianxin Wu, Jinhan Lee,and Aaron Bobick for helpful discussions. This workwas funded by DARPA under the DARPA LAGRprogram.

REFERENCES

Abbeel, P., Coates, A., Quigley, M., & Ng, A. Y. (2007). Anapplication of reinforcement learning to aerobatic heli-copter flight. In Advances in Neural Information Pro-cessing Systems 19 (pp. 1–8). Cambridge, MA: MITPress.

Argall, B., Browning, B., & Veloso, M. (2007). Learning bydemonstration with critique from a human teacher.ACM SIGCHI/SIGART human-robot interaction(pp. 57–64). New York: ACM Press.

Arkin, R. C. (1998). Behavior-based robotics. Cambridge,MA: MIT Press.

Atkeson, C. G., & Schaal, S. (1997). Robot learningfrom demonstration. In International Conference onMachine Learning, Nashville, TN (pp. 12–20). SanFrancisco, CA: Morgan Kaufmann.

Bentivegna, D. C. (2004). Learning from observation usingprimitives. Ph.D. thesis, Georgia Institute of Technol-ogy, Atlanta.

Brooks, R. (1986). A robust layered control system for a mo-bile robot. IEEE Journal of Robotics and Automation,2(1), 14–23.

Calinon, S., & Billard, A. (2005). Recognition and re-production of gestures using a probabilistic frame-work combining PCA, ICA and HMM. Proceedingsof the International Conference on Machine Learning

(ICML), Bonn, Germany (pp. 105–112). New York:ACM Press.

Calinon, S., Guenter, F., & Billard, A. (2007). On learning,representing, and generalizing a task in a humanoidrobot. IEEE Transactions on Systems, Man, and Cy-bernetics, Part B. Special Issue on Robot Learning byObservation, Demonstration, and Imitation, 37(2), 286–298.

Chernova, S., & Veloso, M. (2007). Confidence-based pol-icy learning from demonstration using gaussian mix-ture models. In Proceedings of International Confer-ence on Autonomous Agents and Multiagent Systems(AAMAS’07), Honolulu, HI (pp. 1–8). New York: ACMPress.

Grollman, D. H., Jenkins, O. C., & Wood, F. (2006). Dis-covering natural kinds of robot sensory experiences inunstructured environments. Journal of Field Robotics,23(11–12), 1077–1089.

Grudic, G., & Mulligan, J. (2005). Topological mapping withmultiple visual manifolds. In Proceedings of Robotics:Science and Systems, Cambridge, MA (pp. 25–32).Cambridge, MA: MIT Press.

Hamner, B., Singh, S., & Scherer, S. (2006). Learning obstacleavoidance parameters from operator behavior. Journalof Field Robotics, 23(11/12), 1037–1058.

Hayes, G., & Demiris, J. (1994). A robot controller us-ing learning by imitation. In A. Borkowski and J. L.Crowley (Eds.), Proceedings of the 2nd InternationalSymposium on Intelligent Robotic Systems, Grenoble,France (SIRS-94) (pp. 198–204).

Jonsson, M., Wiberg, P., & Wickstrom, N. (1996). Vision-based low-level navigation using a feed-forward neu-ral network. In Proceedings of the 2nd InternationalWorkshop on Mechatronical Computer Systems forPerception and Action 1997, Pisa Italy.

Kaiser, M., & Dillmann, R. (1996). Building elementaryrobot skills from human demonstration. In IEEE In-ternational Conference on Robotics and Automation,Minneapolis, MN (pp. 2700–2705). IEEE.

LeCun, Y., Muller, U., Ben, J., Cosatto, E., & Flepp, B. (2005).Off-road obstacle avoidance through end-to-end learn-ing. Advances in Neural Information Processing Sys-tems (NIPS 2005) (pp. 739–746). Cambridge, MA: MITPress.

Mount, D. M., & Arya, S. (2006). Approximate nearestneighbor (ANN) library version 1.1.1. University ofMaryland, http://www.cs.umd.edu/ mount/ANN/.

Nicolescu, M., & Mataric, M. J. (2003). Natural methods forrobot task learning: Instructive demonstrations, gen-eralization and practice. In Proceedings of the SecondInternational Joint Conference on Autonomous Agentsand Multi-Agent Systems, Melbourne, Australia(pp. 241–248). New York: ACM Press.

Ollis, M., Huang, W., & Happold, M. (2007). A Bayesianapproach to imitation learning for robot naviga-tion. IEEE/RSJ International Conference on IntelligentRobots and Systems, 2007. IROS 2007, San Diego, CA(pp. 709–714). IEEE.

Pfaff, P., Stanchiss, C., Plagemann, C., & Burgard, W.(2008). Efficiently learning high-dimensional obser-vation models for Monte-Carlo localization using

Journal of Field Robotics DOI 10.1002/rob

Page 20: Learning outdoor mobile robot behaviors by example

Roberts et al.: Learning Outdoor Mobile Robot Behaviors by Example • 195

gaussian mixtures. In Proceedings of the InternationalConference on Intelligent Robots and Systems, Nice,France. IEEE.

Pomerleau, D. A. (1991). Efficient training of artificial neu-ral networks for autonomous navigation. Neural Com-putation, 3(1), 88–97.

Ratliff, N., Bagnell, J., & Zinkevich, M. (2006). Maximummargin planning. Proceedings of the 23rd InternationalConference on Machine Learning, Pittsburgh, PA(pp. 729–736). New York: ACM Press.

Sermanet, P., Scoffier, M., Crudele, C., Muller, U., & LeCun,Y. (2008). Learning maneuver dictionaries for groundrobot planning. In 39th International Symposium onRobotics, Seoul, South Korea.

Smart, W., & Kaelbling, L. (2002). Effective reinforcementlearning for mobile robots. Proceedings: IEEE Inter-national Conference on Robotics and Automation,Washington, DC (vol. 4, pp. 3404–3410). IEEE.

Stolle, M., & Atkeson, C. (2006). Policies based on trajectorylibraries. Proceedings 2006 IEEE International Confer-ence on Robotics and Automation, 2006. ICRA 2006,Orlando, FL (pp. 3344–3349). IEEE.

Stolle, M., & Atkeson, C. G. (2007). Knowledge transferusing local features. In Proceedings of the IEEE Sym-posium on Approximate Dynamic Programming and

Reinforcement Learning (ADPRL 2007), Honolulu, HI(pp. 26–31). IEEE.

Stolle, M., Tappeiner, H., Chestnutt, J., & Atkeson, C.(2007). Transfer of policies based on trajectory li-braries. IEEE/RSJ International Conference on Intelli-gent Robots and Systems, 2007. IROS 2007, San Diego,CA (pp. 2981–2986). IEEE.

Sun, J., Mehta, T., Wooden, D., Powers, M., Rehg, J., Balch,T., & Egerstedt, M. (2007). Learning from examples inunstructured, outdoor environments. Journal of FieldRobotics, 23(11–12), 1019–1036.

Thomaz, A. L., & Breazeal, C. (2006). Reinforcementlearning with human teachers: Evidence of feed-back and guidance with implications for learningperformance. In 21st National Conference on Artifi-cial Intelligence (AAAI), Boston, MA (pp. 1444–1449).AAAI Press.

Vijayakumar, S., D’Souza, A., & Schaal, S. (2005). Incremen-tal online learning in high dimensions. Neural Compu-tation, 17(12), 2602–2634.

Wooden, D., Powers, M., MacKenzie, D., Balch, T., &Egerstedt, M. (2007). Control-driven mapping andplanning. In IEEE/RSJ International Conference onIntelligent Robots and Systems, San Diego, CA(IROS 07) (pp. 3056–3061). IEEE.

Journal of Field Robotics DOI 10.1002/rob