1 execution-time communication decisions for coordination of multi-agent teams maayan roth thesis...

1

Execution-Time Communication Decisions for Coordination of Multi-

Agent Teams

Maayan RothThesis Defense

Carnegie Mellon University

September 4, 2007

2

Cooperative Multi-Agent Teams Operating Under Uncertainty and Partial Observability

Cooperative teams– Agents work together to

achieve team reward– No individual motivations

Uncertainty – Actions have stochastic

outcomes Partial observability

– Agents don’t always know world state

3

Coordinating When Communication is a Limited Resource

Tight coordination– One agent’s best action

choice depends on the action choices of its teammates

– We wish to Avoid Coordination Errors

Limited communication– Communication costs– Limited bandwidth

4

Thesis Question

“How can we effectively use communication to enable the coordination of cooperative

multi-agent teams making sequential decisions under uncertainty and partial

observability?

5

Multi-Agent Sequential Decision Making

6

Thesis Statement

“Reasoning about communication decisions at execution-time provides a more tractable means for coordinating

teams of agents operating under uncertainty and partial observability.”

7

Thesis Contributions

Algorithms that: – Guarantee agents will Avoid Coordination Errors

(ACE) during decentralized execution– Answer the questions of when and what agents

should communicate

8

Outline

Dec-POMDP model– Impact of communication on complexity

Avoiding Coordination Errors by reasoning over Possible Joint Beliefs (ACE-PJB)– ACE-PJB-Comm: When should agents communicate?– Selective ACE-PJB-Comm: What should agents

communicate? Avoiding Coordination Errors by executing Individual

Factored Policies (ACE-IFP) Future directions

9

Dec-POMDP Model

Decentralized Partially Observable Markov Decision Process– Multi-agent extension of single-agent POMDP

model– Sequential decision-making in domains where:

Uncertainty in outcome of actions Partial observability - uncertainty about world state

10

Dec-POMDP Model

M = <, S, {Ai}im, T, {i}im, O, R> is the number of agents– S is set of possible world states– {Ai}im is set of joint actions, <a1, …, am> where ai Ai

– T defines transition probabilities over joint actions– {i}im is set of joint observations, <1, …, m> where i

i

– O defines observation probabilities over joint actions and joint observations

– R is team reward function

11

Dec-POMDP Complexity

Goal - Compute policy which, for each agent, maps its local observation history to an action

For all 2, Dec-POMDP with agents is NEXP-complete– Agents must reason about the possible actions

and observations of their teammates

12

Impact of Communication on Complexity [Pynadath and Tambe, 2002]

If communication is free:– Dec-POMDP reducible to single-agent POMDP – Optimal communication policy is to communicate

at every time step

When communication has any cost, Dec-POMDP is still intractable (NEXP-complete)– Agents must reason about value of information

13

Classifying Communication Heuristics

AND- vs. OR-communication [Emery-Montemerlo, 2005]

– AND-communication does not replace domain-level actions– OR-communication does replace domain-level actions

Initiating communication [Xuan et al., 2001]– Tell - Agent decides to tell local information to teammates– Query - Agent asks a teammate for information– Sync - All agents broadcast all information simultaneously

14

Classifying Communication Heuristics

Does the algorithm consider communication cost?

Is the algorithm is applicable to:– General Dec-POMDP domains– General Dec-MDP domains– Restricted domains

Are the agents guaranteed to Avoid Coordination Errors?

15

Related Work

[Xuan and Lesser, 2002] X X X

Communicative JESP [Nair et al., 2003]

X X X X

BaGA-Comm [Emery-Montemerlo, 2005]

X X X X

ACE-PJB-Comm X X X X X

Selective ACE-PJB-Comm

X X X X X

ACE-IFP X X / X

Unr

estr

icte

d

Cos

t

Syn

c

Que

ry

Tel

l

OR

AN

D

AC

E

16

Overall Approach

Recall, if communication is free, you can treat a Dec-POMDP like a single agent

1) At plan-time, pretend communication is free- Generate a centralized policy for the team

2) At execution-time, use communication to enable decentralized execution of this policy while Avoiding Coordination Errors

17

Outline

Dec-POMDP, Dec-MDP models– Impact of communication on complexity

Avoiding Coordination Errors by reasoning over Possible Joint Beliefs (ACE-PJB)– ACE-PJB-Comm: When should agents communicate?– Selective ACE-PJB-Comm: What should agents communicate?

Avoiding Coordination Errors by executing Individual Factored Policies (ACE-IFP)

Future directions

18

Tiger Domain: (States, Actions)

Two-agent tiger problem [Nair et al., 2003]:

S: {SL, SR}

Tiger is either behind left door or behind right door

Individual Actions:

ai {OpenL, OpenR, Listen}

Robot can open left door, open right door, or listen

19

Tiger Domain: (Observations)

Individual Observations:

I {HL, HR}

Robot can hear tiger behind left door or hear tiger behind right door

Observations are noisy and independent.

20

Tiger Domain:(Reward)

Coordination problem – agents must act together for maximum reward

Maximum reward (+20) when both agents open door with treasure

Minimum reward (-100) when only one agent opens door with tiger

Listen has small cost (-1 per agent)

Both agents opening door with tiger leads to medium negative reward (-50)

21

Coordination Errors

HL

HL

HL

…

a1 = OpenR

a2 = OpenL

Reward(<OpenR, OpenL>) = -100

Reward(<OpenL, OpenL>) ≥ -50

Agents Avoid Coordination Errors when each agent’s action is a best response to its teammates’ actions.

22

Avoid Coordination Errors by Reasoning Over Possible Joint Beliefs (ACE-PJB)

Centralized POMDP policy maps joint beliefs to joint actions

– Joint belief (bt) – distribution over world states Individual agents can’t compute the joint belief

– Don’t know what their teammates have observed or what action they selected

Simplifying assumption:– What if agents knew the joint action at each timestep?– Agents would only have to reason about possible

observations– How can this be assured?

23

Ensuring Action Synchronization

Agents only allowed to choose actions based on information known to all team members

At start of execution, agents knowb0 – initial distribution over world states

A0 – optimal joint action given b0, based on centralized policy

At each timestep, each agent computes Lt, distribution of possible joint beliefs Lt = {<bt, pt, t>} t – observation history that led to bt

pt - likelihood of observing t

24

Possible Joint Beliefs

a = <Listen, Listen>

HL

HL

How should agents select actions over joint beliefs?

),|( 111 −−− ×= ttttt abPpp

b: P(SL) = 0.5

p: p(b) = 1.0L0

b: P(SL) = 0.8

p: p(b) = 0.29L1

b: P(SL) = 0.5

p: p(b) = 0.21

b: P(SL) = 0.5

p: p(b) = 0.21

b: P(SL) = 0.2

p: p(b) = 0.29

HL,HL

HL,HR

HR

,HL

HR,HR

25

Q-POMDP Heuristic

Select joint action that maximizes expected reward over possible joint beliefs

Q-MDP [Littman et al., 1995]– approximate solution to large POMDP using

underlying MDP

Q-POMDP [Roth et al., 2005]– approximate solution to Dec-POMDP using

underlying single-agent POMDP

)()(maxarg)( sVsbbQSs

aa

MDP ∑∈

×=

)),(()(maxarg)( aLbQLpLQ ti

LL

ti

a

tPOMDP

tti

×= ∑∈

26

Q-POMDP Heuristic

)),(()(maxarg)( aLbQLpLQ ti

LL

ti

a

tPOMDP

tti

×= ∑∈

Choose joint action by computing expected reward over all leaves

Agents will independently select same joint action, guaranteeing they avoid coordination errors…

but action choice is very conservative (always <Listen,Listen>)

ACE-PJB-Comm: Communication adds local observations to joint belief

b: P(SL) = 0.5

p: p(b) = 1.0

b: P(SL) = 0.8

p: p(b) = 0.29

b: P(SL) = 0.5

p: p(b) = 0.21

b: P(SL) = 0.5

p: p(b) = 0.21

b: P(SL) = 0.2

p: p(b) = 0.29

HL,HL

HL,HR H

R,H

L

HR,HR

27

ACE-PJB-Comm Example

<HR,HL><HL,HL> <HL,HR>

{}

<HR,HR>

HL

L1

aNC = Q-POMDP(L1) = <Listen,Listen>

L* = circled nodes

aC = Q-POMDP(L*) = <Listen,Listen>

Don’t communicate

28


<HL,HL>

{}

L1 <HL,HR> <HR,HL> <HR,HR>

…<HL,HL>

<HL,HL>

<HL,HL>

<HL,HR>

<HL,HL>

<HR,HL>

<HL,HL>

<HR,HR>

<HL,HR>

<HL,HL>

<HL,HR>

<HL,HR>

<HL,HR>

<HR,HL>

<HL,HR>

<HR.HR>L2


{HL,HL}

aNC = Q-POMDP(L2) = <Listen, Listen>

L* = circled nodes

V(aC) - V(aNC) > ε

Agent 1 communicatesaC = Q-POMDP(L*) = <OpenR,OpenR>

29


<HL,HL>

{}

L1 <HL,HR> <HR,HL> <HR,HR>

…<HL,HL>

<HL,HL>

<HL,HL>

<HL,HR>

<HL,HL>

<HR,HL>

<HL,HL>

<HR,HR>

<HL,HR>

<HL,HL>

<HL,HR>

<HL,HR>

<HL,HR>

<HR,HL>

<HL,HR>

<HR.HR>L2


{HL,HL}

Agent 1 communicates <HL,HL>

Q-POMDP(L2) = <OpenR, OpenR>Agents open right door!

30

ACE-PJB-Comm Results

20,000 trials in 2-Agent Tiger Domain– 6 timesteps per trial

Agents communicate 49.7% fewer observations using ACE-PJB-Comm, 93.3% fewer messages

Difference in expected reward because ACE-PJB-Comm is slightly pessimistic about outcome of communication

Mean Reward

()

Mean Messages

()

Mean Observations

()

Full Communication 7.14

(27.88)

10.0

(0.0)

10.0

(0.0)

ACE-PJB-Comm 5.31

(19.79)

1.77

(0.79)

5.13

(2.38)

31

Additional Challenges

Number of possible joint beliefs grows exponentially– Use particle filter to model distribution of possible joint

beliefs

ACE-PJB-Comm answers the question of when agents should communicate

– Doesn’t deal with what to communicate– Agents communicate all observations that they haven’t

previously communicated

32

Selective ACE-PJB-Comm[Roth et al., 2006]

Answers what agents should communicate Chooses most valuable subset of

observations

Hill-climbing heuristic to choose observations that “push” teams towards aC

– aC - joint action that would be chosen if agent communicated all observations

– See details in thesis document

33

Selective ACE-PJB-Comm Results

2-Agent Tiger domain:

Communicates 28.7% fewer observations Same expected reward Slightly more messages

Mean Reward

()

Mean Messages

()

Mean Observations

()

ACE-PJB-Comm 5.30

(19.79)

1.77

(0.79)

5.13

(2.38)

Selective ACE-PJB-Comm

5.31

(19.74)

1.81

(0.92)

3.66

(1.67)

34

Outline

Dec-POMDP, Dec-MDP models– Impact of communication on complexity

Avoiding Coordination Errors by reasoning over Possible Joint Beliefs (ACE-PJB)– ACE-PJB-Comm: When should agents communicate?– Selective ACE-PJB-Comm: What should agents communicate?

Avoiding Coordination Errors by executing Individual Factored Policies (ACE-IFP)

Future directions

35

Dec-MDP

State is collectively observable– One agent can’t identify full state on its own– Union of team observations uniquely identifies

state Underlying problem is an MDP, not a

POMDP Dec-MDP has same complexity as Dec-

POMDP– NEXP-Complete

36

Acting Independently

ACE-PJB requires agents to know joint action at every timestep

Claim: In many multi-agent domains, agents can act independently for long periods of time, only needing to coordinate infrequently

37

Meeting-Under-Uncertainty Domain

Agents must move to goal location and signal simultaneously

Reward:+20 - Both agents signal at goal

-50 - Both agents signal at another location

-100 - Only one agent signals

-1 - Agents move north, south, east, west, or stop

38

Factored Representations

Represent relationships among state variables instead of relationships among states

S = <X0, Y0, X1, Y1>

Each agent observes its own position

39

Factored Representations

Dynamic Decision Network models state variables over time at = <East, *>:

40

Tree-structured Policies

Decision tree that branches over state variables A tree-structured joint policy has joint actions at the leaves

41

Approach [Roth et al., 2007]

Generate tree-structured joint policies for underlying centralized MDP

Use this joint policy to generate a tree-structured individual policy for each agent*

Execute individual policies

* See details in thesis document

42

Context-specific Independence

Claim: In many multi-agent domains, one agent’s individual policy will have large sections where it is independent

of variables that its teammates observe.

43

Individual Policies

One agent’s individual policy may depend on state features it doesn’t observe

44

Avoid Coordination Errors by Executing an Individual Factored Policy (ACE-IFP)

Robot traverses policy tree according to its observations

– If it reaches a leaf, its action is independent of its teammates’ observations

– If it reaches a state variable that it does not observe directly, it must ask a teammate for the current value of that variable

The amount of communication needed to execute a particular policy corresponds to the amount of context-specific independence in that domain

45

Avoid Coordination Errors by Executing an Individual Factored Policy (ACE-IFP)

Benefits:– Agents can act independently without reasoning

about the possible observations or actions of their teammates

– Policy directs agents about when, what, and with whom to communicate

Drawback:– In domains with little independence, agents may

need to communicate a lot

46

Experimental Results

In 3x3 domain, executing factored policy required less than half as many messages as full communication, with same reward

Communication usage decreases relative to full communication as domain size increases

Mean Reward

Mean Messages

Sent

Mean Variables

Sent

Full Communication

17.484 7.032 14.064

Factored Execution

17.484 3.323 6.646

47

Factored Dec-POMDPs

[Hansen and Feng, 2000] looked at factored POMDPs– ADD-representations of transition, observation, and reward

functions– Policy is a finite-state controller

Nodes are actions Transitions depend on conjunctions of state variable

assignments

To extend to Dec-POMDP, make individual policy a finite-state controller among individual actions

– Somehow combine nodes with the same action– Communicate to enable transitions between action nodes

48

Future Directions

Considering communication cost in ACE-IFP– All children of a particular variable may have

similar values– Worst-case cost of mis-coordination?– Modeling teammate variables requires reasoning

about possible teammate actions

Extending factoring to Dec-POMDPs

49

Future Directions

Knowledge persistence– Modeling teammates’ variables– Can we identify “necessary conditions”?

e.g. “Tell me when you reach the goal.”

Are you here yet?

Are you here yet?

50

Contributions

Decentralized execution of centralized policies– Guarantee that agents will Avoid Coordination

Errors– Make effective use of limited communication

resources– When should agents communicate?– What should agents communicate?

Demonstrate significant communication savings in experimental domains

51

Contributions

ACE-PJB-Comm X X X X X XSelective ACE-PJB-Comm

X X X X X X X

ACE-IFP X X / X X X X

Unr

estr

icte

d

Cos

t

Syn

c

Que

ry

Tel

l

OR

AN

D

AC

E

Whe

n?

Wha

t?

Who

?

52

Thank You!

Advisors: Reid Simmons, Manuela Veloso Committee: Carlos Guestrin, Jeff Schneider,

Milind Tambe RI Folks: Suzanne, Alik, Damion, Doug,

Drew, Frank, Harini, Jeremy, Jonathan, Kristen, Rachel (and many others!)

Aba, Ema, Nitzan, Yoel

53

References

Roth, M., Simmons, R., and Veloso, M. “Reasoning About Joint Beliefs for Execution-Time Communication Decisions” In AAMAS, 2005

Roth, M., Simmons, R., and Veloso, M. “What to Communicate? Execution-Time Decisions in Multi-agent POMDPs” In DARS, 2006

Roth, M., Simmons, R., and Veloso, M. “Exploiting Factored Representations for Decentralized Execution in Multi-agent Teams” In AAMAS, 2007

Bernstein, D., Zilberstein, S., and Immerman, N. “The Complexity of Decentralized Control of Markov Decision Processes” In UAI, 2000

Pynadath, D. and Tambe, M. “The Communicative Multiagent Team Decision Problem: Analyzing Teamwork Theories and Models” In JAIR, 2002

Becker, R., Zilberstein, S., Lesser, V., and Goldman, C. “Transition-independent Decentralized Markov Decision Processes” In AAMAS, 2003

Nair, R., Roth, M., Yokoo, M., and Tambe, M. “Communication for Improving Policy Computation in Distributed POMDPs” In IJCAI, 2003

54

Tiger Domain Details

Action/State SL SR

<OpenR, OpenR> +20 -50

<OpenL, OpenL> -50 +20

<OpenR, OpenL> -100 -100

<OpenL, OpenR> -100 -100

<Listen, Listen> -2 -2

<Listen, OpenR> +9 -101

<Listen, OpenL> +9 -101

<OpenR, Listen> -101 +9

<OpenL, Listen> -101 +9

Action/Transition SL → SL SL → SR SR→ SL SR → SR

<OpenR, *> 0.5 0.5 0.5 0.5

<OpenL, *> 0.5 0.5 0.5 0.5

<*, OpenR> 0.5 0.5 0.5 0.5

<*, OpenL> 0.5 0.5 0.5 0.5

<Listen, Listen> 1.0 0.0 1.0 0.0

Action State HL HL

<Listen, Listen> SL 0.7 0.3

<Listen, Listen> SR 0.3 0.7

<OpenR, *> * 0.5 0.5

<OpenL, *> * 0.5 0.5

<*, OpenR> * 0.5 0.5

<*, OpenL> * 0.5 0.5

55

Particle filter representation

Each particle is a possible joint belief Each agent maintains two particle filters:

– Ljoint : possible joint team beliefs

– Lown : possible joint beliefs that are consistent with local observation history

Compare action selected by Q-POMDP over Ljoint to action selected over Lown and communicate as needed

56

Related Work: Transition Independence [Becker, Zilberstein, Lesser, Goldman, 2003]

DEC-MDP – collective observability Transition independence:

– Local state transitions Each agent observes local state Individual actions only affect local state transitions

– Team connected through joint reward Coverage set algorithm – finds optimal policy quickly

in experimental domains

No communication

57

Related Work: COMM-JESP [Nair, Roth, Yokoo, Tambe, 2004]

Add SYNC action to domain– If one agent chooses SYNC,

all other agents SYNC– At SYNC, send entire

observation history since last SYNC

SYNC brings agents to synchronized belief over world states

Policies indexed by root synchronized belief and observation history since last SYNC

t=0(SL () 0.5)(SR () 0.5)

(SL (HR) 0.1275)(SL (HL) 0.7225)(SR (HR) 0.1275)(SR (HL) 0.0225)

(SL (HR) 0.0225)(SL (HL) 0.1275)(SR (HR) 0.7225)(SR (HL) 0.1275)

a = {Listen, Listen}

= HL = HR

a = SYNC

t = 2(SL () 0.5)(SR () 0.5)

t = 2(SL () 0.97)(SR () 0.03)

“At-most K” heuristic – there must be a SYNC within at most K timesteps

58

Related Work: “No news is good news” [Xuan, Lesser, Zilberstein, 2000]

Applies to transition-independent DEC-MDPs Agents form joint plan

– “plan”: exact path to be followed to accomplish goal

Communicate when deviation from plan occurs– agent sees it has slipped from optimal path– communicates need for re-planning

59

Related Work: BaGA-Comm [Emery-Montemerlo, 2005]

Each agent has a type– Observation and action history

Agents model distribution of possible joint types– Choose actions by finding joint type closest to own local

type– Allows coordination errors

Communicate if gain in expected reward is greater than cost of communication

60

Colorado/Wyoming Domain

Robots must meet in the capital, but do not know if they are in Colorado or Wyoming

Robots receive positive reward of +20 only if they SIGNAL simultaneously from correct goal location

To simplify problem, each robot knows both own and teammate position

Colorado Wyoming= capital

61

Noisy observations – mountain, plain, pikes peak, old faithful

Communication can help team reach goal more efficiently

Pike’s Peak

Old Faithful

Colorado/Wyoming Domain

= possible goal location

State Mt Pl PP Of

C 0.7 0.1 0.19 0.01

W 0.1 0.7 0.01 0.19

62

Build-Message: What to Communicate

First, determine if communication is necessary

– Calculate AC using Ace-PJB-Comm

– If AC = ANC, do not communicate

Greedily build message– “Hill-climbing” towards AC,

away from ANC- Choose single observation that

most increases difference between Q-POMDP values of AC

and ANC

Mt

Pl

Mt

Pike

63


Is communication necessary?

Mt

Pl

Mt

Pike

ANC = [east, south]

AC = [east, west]

AC ≠ ANC so agent should communicate

64


Mt

AC = [east, west] - “toward Denver”

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.5 1

P(State = Colorado)

Distribution if agent communicates entire observation history

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.5 1

P(State = Colorado)

Mt

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.5 1

P(State = Colorado)

Mt

PlPl

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.5 1

P(State = Colorado)

Pike

650

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.5 1

P(State = Colorado)


Mt

Pl

Mt

AC = [east, west] – “toward Denver”

- PIKE is single best observation

- In this case, PIKE sufficient to change joint action to AC, so agent communicates only one observation

m = {Pike}

66

Context-specific Independence

A variable may be independent of a parent variable in some contexts but not others

– e.g. X2 depends on X3 when X1 has value 1, but is independent otherwise

Claim - Many multi-agent domains exhibit a large amount of context-specific independence

67

Constructing Individual Factored Policies

[Boutilier et al., 2000] defined Merge and Simplify operations for policy trees

We want to construct trees that maximize context-specific independence– Depends on variable ordering in policy– We define Intersect and Independent

operations

68

Intersect

Find the intersection of the action sets of a node’s children

1. If all children are leaves, and action sets have non-empty intersections, replace the node with the intersection2. If all but one child is a leaf, and all the actions in the non-leaf child’s subtree are included in the leaf-children’s intersection, replace with the non-leaf child

69

Independent

An individual action is Independent in a particular leaf of a policy tree if it is optimal when paired with any action its teammate could perform at that leaf

a is independent for agent 1 agent 1 has no independent actions

70

Generate Individual Policies

Generate a tree-structured joint policy For each agent:

– Reorder variables in joint policy so that variables local to this agent are near the root

– For each leaf in the policy, find the Independent actions

– Break ties among remaining joint actions– Convert joint actions individual actions– Intersect and Simplify

71

Why Break Ties?

Ensure agents select the same optimal joint action to prevent mis-coordination

1 execution-time communication decisions for coordination of multi-agent teams maayan roth thesis...

Documents

teams of agents

communication xuan

joint actions

guarantee agents

communication emerymontemerlo

world state slide

possible actions

coordination errors