policy search for multi-robot coordination under...

Policy Search for Multi-RobotCoordination under Uncertainty

Christopher Amato∗, George Konidaris†, Ariel Anders‡, Gabriel Cruz‡,Jonathan P. How§ and Leslie P. Kaelbling‡

∗Department of Computer Science, University of New Hampshire, Durham, NH 03824†Departments of Computer Science & Electrical and Computer Engineering, Duke University, Durham, NC 27708

‡CSAIL, MIT, Cambridge, MA 02139§LIDS, MIT, Cambridge, MA 02139

Abstract—We introduce a principled method for multi-robotcoordination based on a generic model (termed a MacDec-POMDP) of multi-robot cooperative planning in the presence ofstochasticity, uncertain sensing and communication limitations.We present a new MacDec-POMDP planning algorithm thatsearches over policies represented as finite-state controllers,rather than the existing policy tree representation. Finite-statecontrollers can be much more concise than trees, are mucheasier to interpret, and can operate over an infinite horizon.The resulting policy search algorithm requires a substantiallysimpler simulator that models only the outcomes of executing agiven set of motor controllers, not the details of the executionsthemselves and can to solve significantly larger problems thanexisting MacDec-POMDP planners. We demonstrate significantlyimproved performance over previous methods and applicationto a cooperative multi-robot bartending task, showing that ourmethod can be used for actual multi-robot systems.

I. INTRODUCTION

In order to fulfill the potential of increasingly capable andaffordable robot hardware, effective methods for controllingrobot teams must be developed. Although many algorithmshave been proposed for multi-robot problems, the vast majorityare specialized methods engineered to match specific teamor problem characteristics. Progress in more general settingsrequires the specification of a model class that captures thecore challenges of controlling multi-robot teams in a genericfashion. Such general models—in particular, the Markov de-cision process [22] and partially observable Markov decisionprocess [11]—have led to significant progress in single-robotsettings through standardized models that enable empiricalcomparisons between very general planners that optimize acommon metric.

Decentralized partially observable Markov decision pro-cesses (or Dec-POMDPs [8]) are the natural extension of suchframeworks to the multi-robot case—modeling multi-agent co-ordination problems in the presence of stochasticity, uncertainsensing and action, and communication limitations. Unfortu-nately, Dec-POMDPs are exactly solvable only for very smallproblems. The search for tractable approximations led to therecent introduction of the MacDec-POMDP model [4], whichincludes temporally extended macro-actions that naturallymodel robot motor controllers which may require multipletime-steps to execute (e.g., navigating to a waypoint, liftingan object, or waiting for another robot). Planning now takes

Fig. 1. The bartender and waiters domain

place at the level of selecting controllers to execute, rather thansequencing low-level motions, and MacDec-POMDP solutionmethods can scale up to reasonably realistic robotics problems;for example, solving a multi-robot warehousing problem thatwas orders of magnitude larger than those solvable by previousmethods [5]. General-purpose planners based on MacDec-POMDPs have the potential to replace the abundance of ad-hocmulti-robot algorithms for specific task scenarios with a singleprecise and generic formulation of cooperative multi-robotproblems that is powerful enough to include (and naturallycombine) all existing cooperative scenarios.

Unfortunately existing MacDec-POMDP planners have twocritical flaws. First, even though using macro-actions dras-tically increases the size of problems that can be solved,planning still scales poorly with the horizon. Second, theyassume that the underlying (primitive) problem is discrete, andthat a low-level model of that problem is completely specifiedand available to all agents. These difficulties significantly limitthe applicability of existing MacDec-POMDP planners.

This paper introduces a new MacDec-POMDP planningalgorithm that searches over policies represented as finite-statecontrollers, rather than the currently used policy trees. Finite-state controllers are often much more concise than trees, areeasier to interpret, and can operate for an infinite horizon.The resulting policy-search algorithms only require a modelof the problem at the macro-action level—at the level ofmodeling the outcome of executing given motor controllers,

not the details of execution itself—substantially reducing theknowledge required for planning. We show that the newplanner can solve significantly larger problems than existingMacDec-POMDP planners (and by extension, all existingDec-POMDP planners), and demonstrate its application toa cooperative multi-robot bartending task, showing that ourmethod can automatically optimize solutions to multi-robotproblems from a high-level specification.

II. MOTIVATING PROBLEM

As a motivating experimental domain we consider a het-erogenous multi-robot problem, shown in Figure 1. The robotteam consists of a PR2 bartender and two TurtleBot waiters.There are also three rooms in which people can order drinksfrom the waiters. Our goal is to bring drinks to the rooms withorders efficiently. We impose communication limitations so therobots cannot communicate unless they are in close range. Asa result, the robots must make decisions based on their owninformation, reasoning about the status and behavior of theother robots. This is a challenging task with stochasticity inordering, navigation, picking, and placing objects as well aspartial observability in the orders and the location and statusof the other robots.

We model this domain as a MacDec-POMDP and introducea planning algorithm capable of automatically generatingcontrollers for the robots (in the form of finite-state machines)that collectively maximize team utility. This problem involvesaspects of communication, task allocation, and cooperativenavigation—tasks for which specialized algorithms exists—butmodeling it as a MacDec-POMDP allows us to automaticallygenerate controllers that express and combine aspects of thesebehaviors, without specifying them in advance, while tradingoff their costs in a principled way.

III. BACKGROUND

Decentralized Partially Observable Markov Decision Pro-cesses (Dec-POMDPs) generalize POMDPs to the multiagent,decentralized setting [8]. Multiple agents operate under un-certainty based on partial views of the world, with executionunfolding over a bounded or unbounded number of steps. Ateach step, every agent chooses an action (in parallel) basedpurely on locally observable information and then receives anobservation. The agents share a single reward function basedon the actions of all agents, making the problem cooperative,but their local views mean that execution is decentralized.

Formally, a Dec-POMDP is defined by tuple〈I, S, {Ai}, T,R, {Zi}, O, h〉, where I is a finite set ofagents; S is a finite set of states with designated initialstate distribution b0; Ai is a finite set of actions for eachagent i with A = ×iAi the set of joint actions; T is a statetransition probability function, T : S × A × S → [0, 1], thatspecifies the probability of transitioning from state s ∈ Sto s′ ∈ S when the actions a ∈ A are taken by the agents(i.e., T (s, a, s′) = Pr(s′|a, s)); R is a reward function:R : S × A → R, the immediate reward for being in states ∈ S and taking the actions a ∈ A; Zi is a finite set of

observations for each agent, i, with Z = ×iZi the set ofjoint observations; O is an observation probability function:O : Z×A×S → [0, 1], the probability of seeing observationso ∈ Z given actions a ∈ A were taken which results in states′ ∈ S (i.e., O(o, a, s′) = Pr(o|a, s′)); and h is the number of(possibly infinite) steps until termination, called the horizon.

A solution to a Dec-POMDP is a joint policy—a set ofpolicies, one for each agent. Because the full state is notdirectly observed, it is often beneficial for each agent toremember a history of its observations. A local policy for agenti is a mapping from local observation histories to actions,HOi → Ai. Because the system state depends on the behavior

of all agents, it is typically not possible to estimate the systemstate (i.e., calculate a belief state) from the history of a singleagent, as is often done in the POMDP case. Since a policyis a function of history, rather than of a directly observedstate (or a calculated belief state), it is typically representedexplicitly. The most common representation is a policy tree,where the vertices indicate actions to execute and the edgesindicate transitions conditioned on an observation (with thehistory represented as the current path in the tree).

The value of a joint policy, π, from state s is V π(s) =

E[∑h−1

t=0 γtR(at, st)|s, π

], which is the expected sum of

rewards for the agents given the action prescribed by thepolicy at each step until the horizon is reached. In the finite-horizon case, the discount factor, γ, is typically set to 1. In theinfinite-horizon case, the discount factor γ ∈ [0, 1) is includedto maintain a finite sum and h = ∞. An optimal policybeginning at state s is π∗(s) = argmaxπ V

π(s). Dec-POMDPsolution methods typically assume that the set of policies isgenerated in a centralized manner, but they are executed in adecentralized manner based on each agent’s histories.

Although Dec-POMDPs have been widely studied [3, 18,23], optimal (and boundedly optimal) methods do not scaleto large problems, while approximate methods do not scaleor perform poorly as problem size (including horizon) grows.Subclasses of the full Dec-POMDP model have been explored,but they make strong assumptions about the domain (e.g.,assuming a large amount of independence between agents).

Macro-Actions for Dec-POMDPs: Dec-POMDPs typicallyrequire synchronous decision-making: every agent chooses anaction to execute, and then executes it in a single time step.This restriction is problematic for two reasons. First, manysystems use a set of controllers (e.g, for waypoint navigation,grasping an object, waiting for a signal), and planning con-sists of sequencing the execution of those controllers. Thesecontrollers often require different amounts of time to execute,so synchronous decision-making requires waiting until allagents have completed their controller execution (and achievedcommon knowledge of this fact). This is inefficient and maynot even be possible in some domains (e.g., when controllingairplanes or underwater vehicles that cannot stay in place).Second, the planning complexity of a Dec-POMDP is doublyexponential in the horizon. A planner that reasons about allagents’ possible policies at every time step will only ever be

able to make very short plans.The Dec-POMDP model was recently extended to plan

using macro-actions, or temporally extended actions [4] (hencethe MacDec-POMDP model). This formulation uses higher-level planning to compute near-optimal solutions for problemswith significantly longer horizons by extending the MDP-based options framework [26] to Dec-POMDPs by usingmacro-actions, mi, that execute a policy in a low-level Dec-POMDP from states that satisfy a the initial conditions untilsome terminal condition is met. Policies for each agent, µi,can be defined for choosing macro-actions that depend onhigh-level observations. Because macro-action policies arebuilt from primitive actions, we can evaluate the high-levelpolicies in a way that is similar to other Dec-POMDP-based approaches. Given a joint policy, the primitive ac-tion at each step is determined by the (high-level) policy,which chooses the macro-action, and the macro-action policy,which chooses the (primitive) action. The joint policy andmacro-action policies can then be evaluated as: V µ(s) =

E

[h−1∑t=0

γtR(at, st)|s, π, µ

]. The goal is to obtain a hierarchi-

cally optimal policy: µ∗(s) = argmaxµVµ(s), which produces

the highest expected value that can be obtained by sequencingthe agents’ given macro-actions.

Two Dec-POMDP algorithms have been extended to theMacDec-POMDP case [4], but other extensions are possible.The key difference is that nodes in a policy tree now selectmacro-actions (rather than primitive actions) and edges corre-spond to terminal conditions (or high-level observations).

IV. FINITE-STATE CONTROLLERS FOR MACDEC-POMDPS

A tree-based representation of a policy requires the agentto remember its entire history to determine its next action. Infinite-horizon problems, the memory requirement is exponen-tial in the horizon and for infinite-horizon problems an agentwould need infinite memory. Clearly, this is not feasible. As analternative, we introduce a finite-state controller representationand provide algorithms for generating these controllers.

Finite-state controllers (FSCs) can be used to representpolicies in a manner similar to policy trees. There is adesignated initial node and after action selection at a node,the controller transitions to the next node depending on theresulting observation. This continues for the infinite steps ofthe problem. Nodes in an agent’s controller represent internalstates, which prescribe actions based on that agent’s finitememory.

A set of controllers, one per agent, provides the joint policyof the agents. FSCs are a widely used as solution representa-tions for POMDPs and Dec-POMDPs [2, 6, 9, 15, 21, 27, 28].

A. Mealy Controllers

There are two main types of controllers, Moore and Mealy,that have been used for POMDP and Dec-POMDP solutions.Moore controllers associate actions with nodes and Mealycontrollers associate actions with controller transitions (i.e.,nodes and observations). We use the Mealy representation.

A Mealy controller is a tuple m = 〈Qi, Ai, Zi, δi, λi, q0i 〉:• Qi is the set of nodes• Ai and Zi are the output and input alphabets (i.e., the

action chosen and the observation seen)• δi : Qi × Zi → Qi is the node transition function• λi : Qi×Zi → Ai is the output function for nodes 6= q0i

that associates output symbols with transitions• λ0i : Qi → Ai is the output function for node q0i• q0i ∈ Qi is the initial node

Because action selection depends on the observation as wellas the current node, for the first node (where no observationshave yet been seen), the action only depends on the node. Forall other nodes, the action output depends on the node andobservation with λm(q, o).

Mealy controllers are a natural policy representation forMacDec-POMDPs because the initial conditions of macro-actions can be easily checked. That is, since the (macro-)actionis chosen after an observation is seen, MacDec-POMDPsin which initial conditions depend solely on agent’s localobservations can be verified directly. As such, algorithmsthat use Mealy controllers can ensure that valid macro-actionpolicies are generated for each agent in a simple manner.

For a set of Mealy controllers, m, when the initial state iss, the joint observation is o and the current nodes of m are q,the value is denoted Vm(q, o, s) and satisfies:

Vm(q, o, s) =

R(s, λ(q, o))+γ∑s′,o′

Pr(s′|s, a) Pr(o′|s′, a)Vm(δ(q, o), o′, s′).

where λ(q) = {λ1(q1, o1), . . . , λn(qn, on)} are the ac-tions selected by each agent given the current node ofits controller and the observation seen while δ(q, o) ={δ1(q1, o1), . . . , δn(qn, on)} are the next nodes for each agentgiven that agent’s current node and observation. Because thefirst nodes do not depend on observations, the value of thecontrollers m at b is Vm(b) =

∑s b(s)Vm(q0, s), where q0 is

the set of initial nodes for all agents.1

B. Macro-action controllers

Representing policies in MacDec-POMDPs with the finite-state controllers discussed above is trivial since we can replacethe primitive actions with macro-actions. The output functionbecomes λi : Qi × Zi → Mi where Zi are now the obser-vations resulting from macro-actions and Mi are the macro-actions for agent i. Unfortunately, evaluation of these macro-action controllers is non-trivial due to the fact that macro-actions may require different amounts of time. Therefore, wemust explicitly consider time when performing evaluation.

To perform this evaluation, we can build on recent workfor modeling decentralized partially observable semi-Markovdecision processes (Dec-POSMDPs) [19]. The Dec-POSMDPmodel explicitly considers actions with different durations,using a reward model that accumulates value until any agent

1The value can also be represented as Vm(b) =∑

s b(s)Vm(q0, o∗, s),where o∗ are dummy observations that are only received on the first step.

terminates a (macro-)action and a transition model that consid-ers how many time steps take place until termination. Theselengths of time may be different (e.g., when all agents areexecuting macro-actions that require a large amount of timeto terminate the time for any agent terminating will be long).

We propose a more general Dec-POSMDP model as fol-lows. We can model the state as S ×i Smi , which includesthe world state and a state of each of the agents’ currentlyexecuting macro-actions. In our experimental domain, Smi willbe the amount of time that agent’s macro-action has beenexecuting (since this is sufficient information to determinehow much more time will be required in that problem), butSm could correspond to a node in a lower-level finite-statecontroller or other relevant information. The actions are themacro-actions, Ai = Mi. The transition probability function,T , now includes the number of discrete time steps until com-pletion as Pr(s′, k|s,m), where k is this number of steps andm is the joint set of macro-actions being executed. The rewardfunction, R(s,m), is the value until any agent terminates,E{rt+ . . .+γk−1rt+k|s,m, t}, starting at time t. Zi is now afinite set of high-level observations that are only observed afteran agent’s macro-action has been completed. The observationprobability function, Pr(o|s′,m), generates an observation foreach agent based on the resulting state and the macro-actionthat was executed. The horizon h is the number of (low-level,not macro-action) steps until termination.

Using this model, we can evaluate a joint policy thatis represented as a set of Mealy controllers, µ, using thefollowing Bellman equation:

V µ(q, o, s) = R(s, λ(q, o))+

∞∑k

γk∑s′

Pr(s′, k|s, λ(q, o))×∑o′

Pr(o′|s′, λ(q, o))V µ(δ(q, o′), o′, s′).

Note that, in the Dec-POSMDP, observations are only gen-erated for agents that complete their macro-actions. As such,the observation, oi, and the current controller node, qi, do notupdate until agent i terminates its macro-action execution.

C. Exploiting domain structure

The Bellman equation provides a formal framework forevaluating policies in MacDec-POMDPs directly if we havea model (or simulator) of the system. It is often difficult toconstruct a full model in complex domains, but many domainspossess structure that allows efficient evaluation.

For example, in our bartender domain, we perform asample-based evaluation of policies using a high-level simula-tor. This simulator uses distributions for each macro-action de-scribing the completion time at each terminal condition giveneach possible initial condition. Having this time information isa much less restrictive assumption than knowing the full policyof each macro-action. We will also assume that the rewardonly depends on the state, and the observations only dependon the state and terminal condition of the macro-action. Thissimulator allows us to evaluate policies while keeping track ofthe relevant state information and execution of macro-actions.

Algorithm 1 MacDec-POMDP Sample-Based Evaluationfunction SAMPLEEVAL(µ,s0,numSims,maxTime)

totalReturn= 0for sim 0 to numSims do

s = s0, q = q0, o = o∗t = 0, tAg = ~0minTime=minAgents=termConds=nullwhile t < maxTime do

for all agents i ∈ I dofor all terminal conditions βi of λi(qi) do

t← SampleFromDist(s, λi(qi), βi)if t− tAgi = minTime then

minAgents ← minAgents ∪ itermConds ←termConds ∪ β

else if t− tAgi < minTime thenminAgents ← {i}termConds ← {β}

t+ =minTimes′ ← sampledStateUpdate(s, λ(q),termConds)r ← R(s′)for all agents i ∈minAgents do

oi ← sampleObs(s′,termConds)qi ← δ(qi, oi)tAgi = 0

for all agents i /∈minAgents dotAgi + =minTime

totalReturn + = r

return totalReturn/numSims

Pseudocode for sample-based evaluation is given in Algo-rithm 1. The simulator keeps track of the world state, thecurrent node and last seen observation of each agent, thecurrent time in the system and the amount of time each agenthas been executing its macro-action. At each iteration, thesimulator determines the set of agents which terminate theirmacro-actions in the least amount of time. The system stateupdates based on the termination of these completed macro-actions and the corresponding agents observe new observationsand transition in their controllers. These iterations continueuntil the system time reaches a limit. This sample-basedevaluation can calculate the value of policies in problems withvery large state spaces using a small number of simulations.

V. POLICY SEARCH

Policy evaluation is an important step, but we must alsodetermine the policies of each agent. We propose a heuris-tic search method that seeks to optimize the parameters ofeach agent’s controller. Such methods have been able togenerate high-quality controllers in the (primitive-action) Dec-POMDP case [1, 27]. Our new method, termed MacDec-POMDP heuristic search (or MDHS), integrates our sample-based evaluation and searches for a policy that is optimalwith respect to a given controller size. Our heuristic searchmethod constructs a search tree of possible controllers for each

Algorithm 2 MacDec-POMDP Heuristic Search (MDHS)function HEURSEARCH(s0,n)

V ← V initpolSet← ∅repeat

δ ←selectBest(polSet)∆′ ←expandNextStep(δ)for δ′ ∈ ∆′ do

if isFulPol(δ’,n) thenv ← valueOf(δ’)if v > V then

µ∗ ← µV ← vprune(polSet,V )

elsev ← valueUpperOf(δ’)if v > V then

polSet← polSet ∪ δ′

polSet← polSet \ δuntil polSet is emptyreturn µ∗

agent, and searches through this space of policies by fixingthe parameters (of all agents) for one node at a time, usingheuristic upper bound values to direct the search.

Pseudocode of our policy search approach is in Algorithm 2.A lower bound value, V is initialized with the value of the bestknown joint policy (e.g., a random or hand-coded policy). Anopen list, polSet, which represents the set of partial policiesthat are available to be expanded is initialized to be the emptyset. At each step, the partial joint policy (node in the searchtree) with the highest estimated value is selected. Then, thispartial policy is expanded, generating policies with the actionselection and node transition parameters for an additional nodespecified (all children in the search tree). This set is called∆′. These expanded policies are examined to determine ifthey are fully specified. If they are, their value is comparedwith the value of the best known policy, which is updatedaccordingly (allowing for pruning of policies with value lessthan the new V ). If a policy is not fully specified, its upperbound is calculated and it is added to the candidate set forexpansion as long as that bound is greater than the value ofthe current best policy. The partial policy that was expanded isremoved from the candidate set (the open list) and this processcontinues until the optimal policy (of size n) is found.

While this approach will generate a set of optimal con-trollers of a fixed size when it completes, it can also be stoppedat any time to return the best solution found so far. In ourimplementation, we set the initial lower bound to be the valueof a random policy and the upper bound as the highest-valuedsingle trajectory which uses random actions for controllernodes that have not been specified (rather than the expectedvalue). These are relatively loose values, but performed well inour experiments. In the future, we can explore tighter bounds.

MDHS Controller Tree

5× 5 25× 25 5× 5 25× 25

Value −5.33 −9.91 −5.30 −9.87Time 180 180 388 4959Size 5 5 10049 10044

TABLE IVALUES, TIMES (IN S) AND POLICY SIZES ON NAMO BENCHMARKS OF

SIZE 5× 5 AND 25× 25.

VI. EXPERIMENTS

We perform comparisons with previous work on existingbenchmark domains and demonstrate the effectiveness ofour MDHS policy search in the bartender scenario. For allcomparisons, we report the best solution found by our methodafter a time cut-off of 180 seconds. In the first two problems,controller sizes are fixed to be 5 nodes, but in the future,sizes could be generated from trajectories in the simulator(similar to previous methods [1]). Experiments were run on asingle core of a 2.2 GHz Intel i7 with a maximum of 8GB ofmemory. The benchmark and simulation experiments providea rigorous quantitative analysis on the efficacy of the MacDec-POMDP planner, while the real world experiments show thatour method can be used for actual multi-robot systems.

A. A benchmark problem

For comparison with previous methods, we consider robotsnavigating among movable obstacles [24]. This domain wasdesigned as a finite-horizon problem [4] so we add a discountfactor (of 0.9) for the infinite-horizon case. Previous MacDec-POMDP methods were only designed for finite-horizon prob-lems [4], but can produce policies that have a high value ininfinite-horizon problems by using a large planning horizon.We compare against an optimal (finite-horizon) method forhorizon 50, which can produce a solution within 0.046 ofthe optimal (infinite-horizon) value in both instances of theproblem we consider. As mentioned above, our MDHS methodused a random lower bound and the best single trajectory thatwas sampled as an upper bound. No other parameters areneeded except for the desired controller size for each agent(which balances off time with computational complexity).

As seen in Table I, our method (“MDHS Controller”),produces solutions that are near optimal (as given by the finite-horizon “Tree” method) in much less time and with a muchmore concise representation. The previous tree-based dynamicprogramming method can produce a (near) optimal solution,but requires a representation exponential in the problem hori-zon and must search through many more possible policiesbefore generating a solution.

B. A small warehousing problem

A small warehousing problem has also been modeled andsolved as a MacDec-POMDP [5]. Previous solution methods(which are based on the policy tree methods [4]) were able toautomatically generate a set of policies for a team of iRobotCreates, but were unable to exceed a problem horizon of 9.

MDHS Controller Tree

NoCom ComLimit NoCom ComLimit

Value 11.38 12.41 - -Time 180 180 - -Size 5 5 - -

TABLE IIVALUES, TIMES (IN S) AND POLICY SIZES ON TWO WAREHOUSING

PROBLEMS (WITH NO AND LIMITED COMMUNICATION). THE TREE-BASEDMETHOD IS UNABLE TO SOLVE THESE PROBLEMS.

As seen in Table II, MDHS can produce concise solutionsvery quickly on these problems. A direct comparison withprevious work is not possible (due to the existing method’slack of scalability to a large enough horizon and lack of boundson the solution quality). Nevertheless, our method is able tosolve these problems, while previous methods could not. Weomit the results for the other (signaling) problem discussed inthe previous paper, but the results are similar.

As an additional comparison, we also evaluate the con-trollers generated by our MDHS algorithm for the samenumber of steps that the previous algorithm used (9) withouta discount factor. In this case, the previous tree based methodproduced solutions with values of 1.16 and 1.6 while ourmethod produces solutions with values 1.12 and 1.14. Notethat our solution was not optimized for this particular horizon,but it shows that both methods have similar solution qualitywhen executed for only 9 steps.

C. Bartender and waiters problem

The bartender and waiters problem is a multi-robot problemmodeled after waiters gathering drinks and delivering them todifferent rooms. The waiters can go to different rooms to findout about and deliver drink orders. The waiters can go to thebar to obtain drinks from the bartender. The bartender canserve at most one waiter at a time and the rooms can haveat most one order at a time. Any waiter can fulfill an order(even if that waiter did not have previous knowledge about theorder). Drink orders are created stochastically: a new order willbe created in a room with 1% probability at each (low-level)time step when one does not currently exist. The reward fordelivering a drink is 100 − (tnow − torder)/10, where tnowis the current time step and torder is the time step at whichthe order was created. The domain consists of three types ofmacro-actions for the waiters, as shown in Table III. The statevariables each waiter can observe are shown in Table IV.

To develop a simulator that is similar to the robot imple-mentation, we estimated the macro-action times by measuringthem in the actual domain over a number of trials (starting themacro-actions at each possible initial condition and executinguntil each possible terminal condition, generating a distributionfor the results). There is a large amount of uncertainty inthe problem in terms of the time required to complete amacro-action and outcomes such as receiving orders. Wedid not explicitly model failures in the navigation or PR2picking/placing, but these could be easily modeled.

ROOM N Go to room n, observe orders and deliver drinks.BAR Go to the bar and observe current status of the bartender.GET DRINK Obtain a drink from the bartender.

TABLE IIIMACRO-ACTIONS FOR THE WAITERS.

Variable Values Description

loc {room n, bar} waiter’s locationorders {True, False} drink order for current roomholding {True, False} waiter holding drink status

bartender

not servingready to serveserving waiterno obs

not serving and not ready to servenot serving and is ready to serveserving a drinkcannot observe bartender

TABLE IVOBSERVATIONS FOR THE WAITERS.

Our instance of the bartender and waiters problem consistedof one bartender and two waiters. The domain had four rooms:the bar and rooms 1-3. We had a total of 5 macro-actionssince there is a macro-action for each room as well as onefor requesting a drink. No communication was used exceptbetween the bartender and waiters in the bar area. There werealso 64 observations (from Table IV) and the underlying statespace consists of the continuous locations of the TurtleBotsand the status of the PR2 (which we abstract into executiontime in macro-actions) and discrete variables for orders in eachroom and whether each TurtleBot is holding a beverage.

D. Robot implementation

As shown in Figure 1, we used two TurtleBots (we will callthe blue one Leonardo and the red one Raphael) as waiters andthe PR2 as a bartender. The TurtleBots had 2 types of macro-actions: navigation and obtaining a drink from the bartender.The navigation actions were created using a given map, shownin Figure 2, with simple collision avoidance. For pickingand placing drinks, we combined several ROS controllers forgrasping and manipulation. The GET DRINK macro-actionimplemented a queueing system to serve multiple TurtleBotsin the order they arrived.

For the observations, we used state action deduction andcommunication. That is, the GET DRINK action was assumedto always succeed (but may require different amounts of time);the TurtleBot asserted it was holding a drink after this action.When the TurtleBot entered a room it would prompt theuser to take the drink it was holding or to place an order.The user could give a boolean response by toggling a redbutton on top of the TurtleBot. After a user picked up thedrink, the waiter observed not holding until it completed thenext GET DRINK action. The location observations were setwith the localization functionality of the TurtleBot navigationstack. To obtain information about the bartender, the PR2would broadcast its current state (serving, not serving, orready to serve). The TurtleBots were only able to listen tothe message in the bar location.

Fig. 2. Navigation map generated by the TurtleBots.

E. Bartender and waiter problem results

Our MDHS planner automatically generated the bartenderand waiter solution based on the macro-action definitions andour high-level problem description (discussed above). The so-lution is a Mealy controller that maps nodes and observationsto actions. To easily examine the results, we generated policieswith 1 and 5 nodes. In the 1 node case, the action selections arememoryless and depend solely on the current state observation.For the 5 node case, the node transitions allow each robot toremember some relevant information.

Figures 3(a)–3(c) show parts of the generated 1-node poli-cies. Analysis of the solution is naturally segmented into threephases: bar, delivery, and ordering, which correspond to 1) thewaiter being located in the bar, 2) holding a drink, and 3) notholding a drink. In general, the solution spread out the servingand delivery behaviors of the TurtleBots between the threerooms: Leonardo only visited rooms 1 and 3, whereas Raphaelfocused on rooms 2 and 1. Additionally, the TurtleBot’scontrollers selected the BAR macro-action even when drinkswere not ordered. This allowed the TurtleBots to have drinksthat were ready to deliver, even if they did not previously knowabout an order.

Figure 3(a) shows the macro-actions for each TurtleBotwhen it is located in the bar. BAR is a navigation macro-action that takes the TurtleBot to the bar from any location.Once in the bar, the TurtleBot can observe the bartender’sstatus. If the bartender is ready to serve, either agent willexecute the GET DRINK action. Following the GET DRINKaction, Raphael and Leonardo will execute ROOM 2 andROOM 3 macro-actions, respectively. If the bartender is notready to serve the waiter will execute ROOM 1 or ROOM 2macro-actions, depending on the observation. The distance isfarthest to ROOM 3 so it requires less time visit the otherrooms when the bartender is not ready to serve.

Once a TurtleBot is holding a drink, it is in the deliv-ery phase. Figure 3(b) shows the sequence of macro-actionsexecuted in this case. Raphael receives a drink from the

bartender and tries to complete deliveries in the followingorder: ROOM 2, ROOM 1, ROOM 3. That is, it continueslooping through all rooms while holding a drink. Leonardoexecutes the ROOM 3 macro-action after receiving a drinkfrom the bartender. If the drink is not delivered then it choosesthe ROOM 1 macro-action. It continues looping betweenROOM 1 and ROOM 3 actions until a delivery is made.

It is important to note that the one-node controllers cannotcontain a more complex solution that would let the waiters goto different rooms after receiving a drink. This controller is anelegant solution given the constraint: Raphael serves room 2then room 1, whereas Leonardo room serves room 3 followedby room 1. This resultant behavior shows clear cooperationbetween the two robots to efficiently cover the rooms.

After a TurtleBot has delivered a drink, it enters the orderingphase. Figure 3(c) shows the macro-action sequence for thecase when the waiters are not holding any drinks. The dashedand dotted lines show the two cases when the waiters do not goto the bar. This happens when there is no order placed in rooms2 and 3; the waiters go to the bar for all other observations.This behavior balances off having a drink ready for unknownorders and the time used to visit other rooms.

An example execution of our generated controllers (for the1-node case) is shown in Figure 4. Initially, the TurtleBotsstart in the bar next to the PR2. The PR2 immediately startspicking up a can (Figure 4(a)) and the two TurtleBots navigateto different rooms (Figure 4(b)). Then Leonardo successfullyreceives a drink from the PR2 (Figure 4(c)) and starts thedelivery phase by going to room 2 (Figure 4(e)). There is nodrink order in room 2 so Leonardo continues to room 1 andsuccessfully delivers the drink to a thirsty graduate student(Figure 4(f)). While Leonardo is served by the PR2, Raphaelgoes to the bar and observes the PR2 is busy (Figure 4(d)).Since the PR2 is busy, Raphael continues the ordering phaseby navigating to room 1 (Figure 4(e)).

We also tested a 5-node controller on the robots. Thesolutions for the 1-node and 5-node cases deviate since the5-node controller can keep track of more information. Forexample, the waiter can choose a different room based oninformation such as the other rooms the waiter has been inpreviously or whether orders have been placed in the otherrooms recently. These solutions quickly become complex andnon-obvious to specify by hand. Both controller sizes allow forhigh-quality solutions, but having 5 nodes permits improvedcooperation. For instance, the 5-node solutions were able toselect different rooms to visit after receiving a drink from thePR2 based on memory of orders in those locations as wellas observing the other waiter. These improvements are seenin the simulator where solution value (over 50 macro-actionsteps or approximately 1000 low-level steps) was 1231 (anaverage of 13.98 drinks delivered) for the 1-node controllerand 1296 (14.56 drinks delivered) for the 5-node controller.For comparison, a hand-coded controller that assigns one robot(Leonardo) to room 3 (since it is farthest from the kitchen) andone robot (Raphael) to rooms 1 and 2, produces a solution withvalue 917 (11.13 drinks delivered).

(a) Bar phase while located in the bar. (b) Delivery phase while holding a drink. (c) Ordering phase while not holding a drink.

Fig. 3. Controller phases for each waiter.

(a) PR2 picking up a drink. (b) TurtleBots go to first rooms.

(c) Leonardo sees the PR2 ready andgets a drink.

(d) Raphael sees the PR2 servingLeonardo.

(e) TurtleBots go to rooms 1 and 2. (f) Leonardo delivers to room 1.

Fig. 4. Images from the bartender and waiter experiments.

These results demonstrate that the MDHS planner is ableto effectively generate a solution to a cooperative multi-robotproblem, given a declarative MacDec-POMDP planner. Notethat the same planner solved all these experimental problemsbased on a high-level domain description.

VII. RELATED WORK

Other frameworks exist for multi-robot decision making.For instance, behavioral methods have been studied for per-forming task allocation over time with loosely-coupled [20]or tightly-coupled [25] tasks. These are heuristic in natureand make strong assumptions about the type of tasks thatwill be completed. Market-based approaches use traded value

to establish an optimization framework for task allocation[12, 14]. These approaches have been used to solve real multi-robot problems [16, 10], but are largely aimed at tasks wherethe robots can communicate through a bidding mechanism.

One important related class of methods is based on lineartemporal logic (LTL) [7, 17] to specify behavior for a robot;reactive controllers that are guaranteed to satisfy the resultingspecification are then derived. These methods are appropriatewhen the world dynamics can be effectively described non-probabilistically and when there is a useful characterizationof the robot’s desired behavior in terms of a set of discreteconstraints. When applied to multiple robots, it is necessaryto give each robot its own behavior specification. By contrast,our approach (probabilistically) models the domain and allowsthe planner to automatically optimize the robots’ behavior.

There has been less work on scaling Dec-POMDPs to realrobotics scenarios, Emery-Montemerlo et al. [13] introduceda (cooperative) game-theoretic formalization of multi-robotsystems which resulted in solving a Dec-POMDP. An approxi-mate forward search algorithm was used to generate solutions,but because a (relatively) low-level Dec-POMDP was used,scalability was limited, and their system required synchronizedexecution by the robots. The introduction of MacDec-POMDPmethods has largely eliminated these two concerns.

VIII. SUMMARY AND CONCLUSION

We introduced an extended MacDec-POMDP model for rep-resenting cooperative multi-robot systems under uncertaintyusing a high-level problem description. We also developedMDHS, a new MacDec-POMDP planning algorithm thatsearches over policies represented as finite-state controllers.In the bartender and waiters problem, MDHS was able toautomatically generate controllers for a heterogenous robotteam that collectively maximized team utility, using only ahigh-level model of the task. For this problem, an accurate low-level simulator would have been hard to build, and generatinga solution is beyond the reach of existing planners. MDHSis therefore a significant step forward in the development ofgeneral-purpose planners for cooperative multi-robot systems.

ACKNOWLEDGMENTS

Research supported by US Office of Naval Research un-der MURI program award #N000141110688, NSF award#1463945 and the ASD R&E under Air Force Contract#FA8721-05-C-0002. Opinions, interpretations, conclusionsand recommendations are those of the author and are notnecessarily endorsed by the United States Government.

REFERENCES

[1] Christopher Amato and Shlomo Zilberstein. Achievinggoals in decentralized POMDPs. In Proceedings ofthe International Conference on Autonomous Agents andMultiagent Systems, pages 593–600, 2009.

[2] Christopher Amato, Daniel S. Bernstein, and ShlomoZilberstein. Optimizing fixed-size stochastic controllersfor POMDPs and decentralized POMDPs. Journal ofAutonomous Agents and Multi-Agent Systems, 21(3):293–320, 2010.

[3] Christopher Amato, Girish Chowdhary, Alborz Geram-ifard, Nazim Kemal Ure, and Mykel J. Kochenderfer.Decentralized control of partially observable Markovdecision processes. In Proceedings of the Fifty-SecondIEEE Conference on Decision and Control, pages 2398–2405, 2013.

[4] Christopher Amato, George D. Konidaris, and Leslie P.Kaelbling. Planning with macro-actions in decentralizedPOMDPs. In Proceedings of the International Confer-ence on Autonomous Agents and Multiagent Systems,pages 1273–1280, 2014.

[5] Christopher Amato, George D. Konidaris, Gabriel Cruz,Christopher A. Maynor, Jonathan P. How, and Leslie P.Kaelbling. Planning for decentralized control of multiplerobots under uncertainty. In Proceedings of the Interna-tional Conference on Robotics and Automation, 2015.

[6] Haoyu Bai, David Hsu, and Wee Sun Lee. Integratedperception and planning in the continuous space: APOMDP approach. International Journal of RoboticsResearch, 33:1288–1302, 2013.

[7] Calin Belta, Antonio Bicchi, Magnus Egerstedt, EmilioFrazzoli, Eric Klavins, and George J Pappas. Symbolicplanning and control of robot motion. Robotics &Automation Magazine, IEEE, 14(1):61–70, 2007.

[8] Daniel S. Bernstein, Robert Givan, Neil Immerman, andShlomo Zilberstein. The complexity of decentralizedcontrol of Markov decision processes. Mathematics ofOperations Research, 27(4):819–840, 2002.

[9] Daniel S. Bernstein, Christopher Amato, Eric A. Hansen,and Shlomo Zilberstein. Policy iteration for decentralizedcontrol of Markov decision processes. Journal of Artifi-cial Intelligence Research, 34:89–132, 2009.

[10] Jesus Capitan, Matthijs T. J. Spaan, Luis Merino, andAnıbal Ollero. Decentralized multi-robot cooperationwith auctioned POMDPs. International Journal ofRobotics Research, 32(6):650–671, 2013.

[11] Anthony R. Cassandra, Leslie Pack Kaelbling, andMichael L. Littman. Acting optimally in partially observ-able stochastic domains. In Proceedings of the NationalConference on Artificial Intelligence, 1994.

[12] M. Bernardine Dias and Anthony Stentz. A comparativestudy between centralized, market-based, and behavioralmultirobot coordination approaches. In Proceedings ofthe IEEE/RSJ International Conference on IntelligentRobots and Systems, volume 3, pages 2279 – 2284,

October 2003.[13] Rosemary Emery-Montemerlo, Geoff Gordon, Jeff

Schneider, and Sebastian Thrun. Game theoretic controlfor robot teams. In Proceedings of the InternationalConference on Robotics and Automation, pages 1163–1169, April 2005.

[14] Brian P. Gerkey and Maja J. Mataric. A formal analysisand taxonomy of task allocation in multi-robot systems.International Journal of Robotics Research, 23(9):939–954, 2004.

[15] Leslie Pack Kaelbling, Michael L. Littman, and An-thony R. Cassandra. Planning and acting in partiallyobservable stochastic domains. Artificial Intelligence,101:1–45, 1998.

[16] Nidhi Kalra, David Ferguson, and Anthony Stentz. Ho-plites: A market-based framework for planned tight co-ordination in multirobot teams. In Proceedings of theInternational Conference on Robotics and Automation,pages 1170 – 1177, April 2005.

[17] Savvas G. Loizou and Kostas J. Kyriakopoulos. Auto-matic synthesis of multi-agent motion tasks based on LTLspecifications. In Proceedings of the Forty-Third IEEEConference on Decision and Control, volume 1, pages153–158. IEEE, 2004.

[18] Frans A. Oliehoek. Decentralized POMDPs. In MarcoWiering and Martijn van Otterlo, editors, ReinforcementLearning: State of the Art, volume 12 of Adaptation,Learning, and Optimization, pages 471–503. SpringerBerlin Heidelberg, 2012.

[19] Shayegan Omidshafiei, Ali-akbar Agha-mohammadi,Christopher Amato, and Jonathan P. How. Decentralizedcontrol of partially observable Markov decision processesusing belief space macro-actions. In Proceedings of theInternational Conference on Robotics and Automation,2015.

[20] Lynne E Parker. ALLIANCE: An architecture for faulttolerant multirobot cooperation. IEEE Transactions onRobotics and Automation, 14(2):220–240, 1998.

[21] Pascal Poupart and Craig Boutilier. Bounded finitestate controllers. In Advances in Neural InformationProcessing Systems, 16, pages 823–830. 2003.

[22] Martin L. Puterman. Markov Decision Processes:Discrete Stochastic Dynamic Programming. Wiley-Interscience, 1994.

[23] Sven Seuken and Shlomo Zilberstein. Formal models andalgorithms for decentralized control of multiple agents.Journal of Autonomous Agents and Multi-Agent Systems,17(2):190–250, 2008.

[24] M. Stilman and J. Kuffner. Navigation among movableobstacles: Real-time reasoning in complex environments.International Journal on Humanoid Robotics, 2(4):479–504, 2005.

[25] Ashley W Stroupe, Ramprasad Ravichandran, and TuckerBalch. Value-based action selection for exploration anddynamic target observation with robot teams. In Pro-ceedings of the International Conference on Robotics and

Automation, volume 4, pages 4190–4197. IEEE, 2004.[26] Richard S Sutton, Doina Precup, and Satinder Singh.

Between MDPs and semi-MDPs: A framework for tem-poral abstraction in reinforcement learning. ArtificialIntelligence, 112(1):181–211, 1999.

[27] Daniel Szer and Francois Charpillet. An optimal best-first search algorithm for solving infinite horizon DEC-POMDPs. In Proceedings of the European Conferenceon Machine Learning, pages 389–399, 2005.

[28] Feng Wu, Shlomo Zilberstein, and Xiaoping Chen. Point-based policy generation for decentralized POMDPs. InProceedings of the International Conference on Au-tonomous Agents and Multiagent Systems, pages 1307–1314, 2010.

policy search for multi-robot coordination under...

Documents