1 execution-time communication decisions for coordination of multi-agent teams maayan roth thesis...
Post on 22-Dec-2015
220 views
TRANSCRIPT
1
Execution-Time Communication Decisions for Coordination of Multi-
Agent Teams
Maayan RothThesis Defense
Carnegie Mellon University
September 4, 2007
2
Cooperative Multi-Agent Teams Operating Under Uncertainty and Partial Observability
Cooperative teams– Agents work together to
achieve team reward– No individual motivations
Uncertainty – Actions have stochastic
outcomes Partial observability
– Agents don’t always know world state
3
Coordinating When Communication is a Limited Resource
Tight coordination– One agent’s best action
choice depends on the action choices of its teammates
– We wish to Avoid Coordination Errors
Limited communication– Communication costs– Limited bandwidth
4
Thesis Question
“How can we effectively use communication to enable the coordination of cooperative
multi-agent teams making sequential decisions under uncertainty and partial
observability?
5
Multi-Agent Sequential Decision Making
6
Thesis Statement
“Reasoning about communication decisions at execution-time provides a more tractable means for coordinating
teams of agents operating under uncertainty and partial observability.”
7
Thesis Contributions
Algorithms that: – Guarantee agents will Avoid Coordination Errors
(ACE) during decentralized execution– Answer the questions of when and what agents
should communicate
8
Outline
Dec-POMDP model– Impact of communication on complexity
Avoiding Coordination Errors by reasoning over Possible Joint Beliefs (ACE-PJB)– ACE-PJB-Comm: When should agents communicate?– Selective ACE-PJB-Comm: What should agents
communicate? Avoiding Coordination Errors by executing Individual
Factored Policies (ACE-IFP) Future directions
9
Dec-POMDP Model
Decentralized Partially Observable Markov Decision Process– Multi-agent extension of single-agent POMDP
model– Sequential decision-making in domains where:
Uncertainty in outcome of actions Partial observability - uncertainty about world state
10
Dec-POMDP Model
M = <, S, {Ai}im, T, {i}im, O, R> is the number of agents– S is set of possible world states– {Ai}im is set of joint actions, <a1, …, am> where ai Ai
– T defines transition probabilities over joint actions– {i}im is set of joint observations, <1, …, m> where i
i
– O defines observation probabilities over joint actions and joint observations
– R is team reward function
11
Dec-POMDP Complexity
Goal - Compute policy which, for each agent, maps its local observation history to an action
For all 2, Dec-POMDP with agents is NEXP-complete– Agents must reason about the possible actions
and observations of their teammates
12
Impact of Communication on Complexity [Pynadath and Tambe, 2002]
If communication is free:– Dec-POMDP reducible to single-agent POMDP – Optimal communication policy is to communicate
at every time step
When communication has any cost, Dec-POMDP is still intractable (NEXP-complete)– Agents must reason about value of information
13
Classifying Communication Heuristics
AND- vs. OR-communication [Emery-Montemerlo, 2005]
– AND-communication does not replace domain-level actions– OR-communication does replace domain-level actions
Initiating communication [Xuan et al., 2001]– Tell - Agent decides to tell local information to teammates– Query - Agent asks a teammate for information– Sync - All agents broadcast all information simultaneously
14
Classifying Communication Heuristics
Does the algorithm consider communication cost?
Is the algorithm is applicable to:– General Dec-POMDP domains– General Dec-MDP domains– Restricted domains
Are the agents guaranteed to Avoid Coordination Errors?
15
Related Work
[Xuan and Lesser, 2002] X X X
Communicative JESP [Nair et al., 2003]
X X X X
BaGA-Comm [Emery-Montemerlo, 2005]
X X X X
ACE-PJB-Comm X X X X X
Selective ACE-PJB-Comm
X X X X X
ACE-IFP X X / X
Unr
estr
icte
d
Cos
t
Syn
c
Que
ry
Tel
l
OR
AN
D
AC
E
16
Overall Approach
Recall, if communication is free, you can treat a Dec-POMDP like a single agent
1) At plan-time, pretend communication is free- Generate a centralized policy for the team
2) At execution-time, use communication to enable decentralized execution of this policy while Avoiding Coordination Errors
17
Outline
Dec-POMDP, Dec-MDP models– Impact of communication on complexity
Avoiding Coordination Errors by reasoning over Possible Joint Beliefs (ACE-PJB)– ACE-PJB-Comm: When should agents communicate?– Selective ACE-PJB-Comm: What should agents communicate?
Avoiding Coordination Errors by executing Individual Factored Policies (ACE-IFP)
Future directions
18
Tiger Domain: (States, Actions)
Two-agent tiger problem [Nair et al., 2003]:
S: {SL, SR}
Tiger is either behind left door or behind right door
Individual Actions:
ai {OpenL, OpenR, Listen}
Robot can open left door, open right door, or listen
19
Tiger Domain: (Observations)
Individual Observations:
I {HL, HR}
Robot can hear tiger behind left door or hear tiger behind right door
Observations are noisy and independent.
20
Tiger Domain:(Reward)
Coordination problem – agents must act together for maximum reward
Maximum reward (+20) when both agents open door with treasure
Minimum reward (-100) when only one agent opens door with tiger
Listen has small cost (-1 per agent)
Both agents opening door with tiger leads to medium negative reward (-50)
21
Coordination Errors
HL
HL
HL
…
a1 = OpenR
a2 = OpenL
Reward(<OpenR, OpenL>) = -100
Reward(<OpenL, OpenL>) ≥ -50
Agents Avoid Coordination Errors when each agent’s action is a best response to its teammates’ actions.
22
Avoid Coordination Errors by Reasoning Over Possible Joint Beliefs (ACE-PJB)
Centralized POMDP policy maps joint beliefs to joint actions
– Joint belief (bt) – distribution over world states Individual agents can’t compute the joint belief
– Don’t know what their teammates have observed or what action they selected
Simplifying assumption:– What if agents knew the joint action at each timestep?– Agents would only have to reason about possible
observations– How can this be assured?
23
Ensuring Action Synchronization
Agents only allowed to choose actions based on information known to all team members
At start of execution, agents knowb0 – initial distribution over world states
A0 – optimal joint action given b0, based on centralized policy
At each timestep, each agent computes Lt, distribution of possible joint beliefs Lt = {<bt, pt, t>} t – observation history that led to bt
pt - likelihood of observing t
24
Possible Joint Beliefs
a = <Listen, Listen>
HL
HL
How should agents select actions over joint beliefs?
),|( 111 −−− ×= ttttt abPpp
b: P(SL) = 0.5
p: p(b) = 1.0L0
b: P(SL) = 0.8
p: p(b) = 0.29L1
b: P(SL) = 0.5
p: p(b) = 0.21
b: P(SL) = 0.5
p: p(b) = 0.21
b: P(SL) = 0.2
p: p(b) = 0.29
HL,HL
HL,HR
HR
,HL
HR,HR
25
Q-POMDP Heuristic
Select joint action that maximizes expected reward over possible joint beliefs
Q-MDP [Littman et al., 1995]– approximate solution to large POMDP using
underlying MDP
Q-POMDP [Roth et al., 2005]– approximate solution to Dec-POMDP using
underlying single-agent POMDP
)()(maxarg)( sVsbbQSs
aa
MDP ∑∈
×=
)),(()(maxarg)( aLbQLpLQ ti
LL
ti
a
tPOMDP
tti
×= ∑∈
26
Q-POMDP Heuristic
)),(()(maxarg)( aLbQLpLQ ti
LL
ti
a
tPOMDP
tti
×= ∑∈
Choose joint action by computing expected reward over all leaves
Agents will independently select same joint action, guaranteeing they avoid coordination errors…
but action choice is very conservative (always <Listen,Listen>)
ACE-PJB-Comm: Communication adds local observations to joint belief
b: P(SL) = 0.5
p: p(b) = 1.0
b: P(SL) = 0.8
p: p(b) = 0.29
b: P(SL) = 0.5
p: p(b) = 0.21
b: P(SL) = 0.5
p: p(b) = 0.21
b: P(SL) = 0.2
p: p(b) = 0.29
HL,HL
HL,HR H
R,H
L
HR,HR
27
ACE-PJB-Comm Example
<HR,HL><HL,HL> <HL,HR>
{}
<HR,HR>
HL
L1
aNC = Q-POMDP(L1) = <Listen,Listen>
L* = circled nodes
aC = Q-POMDP(L*) = <Listen,Listen>
Don’t communicate
28
ACE-PJB-Comm Example
<HL,HL>
{}
L1 <HL,HR> <HR,HL> <HR,HR>
…<HL,HL>
<HL,HL>
<HL,HL>
<HL,HR>
<HL,HL>
<HR,HL>
<HL,HL>
<HR,HR>
<HL,HR>
<HL,HL>
<HL,HR>
<HL,HR>
<HL,HR>
<HR,HL>
<HL,HR>
<HR.HR>L2
a = <Listen, Listen>
{HL,HL}
aNC = Q-POMDP(L2) = <Listen, Listen>
L* = circled nodes
V(aC) - V(aNC) > ε
Agent 1 communicatesaC = Q-POMDP(L*) = <OpenR,OpenR>
29
ACE-PJB-Comm Example
<HL,HL>
{}
L1 <HL,HR> <HR,HL> <HR,HR>
…<HL,HL>
<HL,HL>
<HL,HL>
<HL,HR>
<HL,HL>
<HR,HL>
<HL,HL>
<HR,HR>
<HL,HR>
<HL,HL>
<HL,HR>
<HL,HR>
<HL,HR>
<HR,HL>
<HL,HR>
<HR.HR>L2
a = <Listen, Listen>
{HL,HL}
Agent 1 communicates <HL,HL>
Q-POMDP(L2) = <OpenR, OpenR>Agents open right door!
30
ACE-PJB-Comm Results
20,000 trials in 2-Agent Tiger Domain– 6 timesteps per trial
Agents communicate 49.7% fewer observations using ACE-PJB-Comm, 93.3% fewer messages
Difference in expected reward because ACE-PJB-Comm is slightly pessimistic about outcome of communication
Mean Reward
()
Mean Messages
()
Mean Observations
()
Full Communication 7.14
(27.88)
10.0
(0.0)
10.0
(0.0)
ACE-PJB-Comm 5.31
(19.79)
1.77
(0.79)
5.13
(2.38)
31
Additional Challenges
Number of possible joint beliefs grows exponentially– Use particle filter to model distribution of possible joint
beliefs
ACE-PJB-Comm answers the question of when agents should communicate
– Doesn’t deal with what to communicate– Agents communicate all observations that they haven’t
previously communicated
32
Selective ACE-PJB-Comm[Roth et al., 2006]
Answers what agents should communicate Chooses most valuable subset of
observations
Hill-climbing heuristic to choose observations that “push” teams towards aC
– aC - joint action that would be chosen if agent communicated all observations
– See details in thesis document
33
Selective ACE-PJB-Comm Results
2-Agent Tiger domain:
Communicates 28.7% fewer observations Same expected reward Slightly more messages
Mean Reward
()
Mean Messages
()
Mean Observations
()
ACE-PJB-Comm 5.30
(19.79)
1.77
(0.79)
5.13
(2.38)
Selective ACE-PJB-Comm
5.31
(19.74)
1.81
(0.92)
3.66
(1.67)
34
Outline
Dec-POMDP, Dec-MDP models– Impact of communication on complexity
Avoiding Coordination Errors by reasoning over Possible Joint Beliefs (ACE-PJB)– ACE-PJB-Comm: When should agents communicate?– Selective ACE-PJB-Comm: What should agents communicate?
Avoiding Coordination Errors by executing Individual Factored Policies (ACE-IFP)
Future directions
35
Dec-MDP
State is collectively observable– One agent can’t identify full state on its own– Union of team observations uniquely identifies
state Underlying problem is an MDP, not a
POMDP Dec-MDP has same complexity as Dec-
POMDP– NEXP-Complete
36
Acting Independently
ACE-PJB requires agents to know joint action at every timestep
Claim: In many multi-agent domains, agents can act independently for long periods of time, only needing to coordinate infrequently
37
Meeting-Under-Uncertainty Domain
Agents must move to goal location and signal simultaneously
Reward:+20 - Both agents signal at goal
-50 - Both agents signal at another location
-100 - Only one agent signals
-1 - Agents move north, south, east, west, or stop
38
Factored Representations
Represent relationships among state variables instead of relationships among states
S = <X0, Y0, X1, Y1>
Each agent observes its own position
39
Factored Representations
Dynamic Decision Network models state variables over time at = <East, *>:
40
Tree-structured Policies
Decision tree that branches over state variables A tree-structured joint policy has joint actions at the leaves
41
Approach [Roth et al., 2007]
Generate tree-structured joint policies for underlying centralized MDP
Use this joint policy to generate a tree-structured individual policy for each agent*
Execute individual policies
* See details in thesis document
42
Context-specific Independence
Claim: In many multi-agent domains, one agent’s individual policy will have large sections where it is independent
of variables that its teammates observe.
43
Individual Policies
One agent’s individual policy may depend on state features it doesn’t observe
44
Avoid Coordination Errors by Executing an Individual Factored Policy (ACE-IFP)
Robot traverses policy tree according to its observations
– If it reaches a leaf, its action is independent of its teammates’ observations
– If it reaches a state variable that it does not observe directly, it must ask a teammate for the current value of that variable
The amount of communication needed to execute a particular policy corresponds to the amount of context-specific independence in that domain
45
Avoid Coordination Errors by Executing an Individual Factored Policy (ACE-IFP)
Benefits:– Agents can act independently without reasoning
about the possible observations or actions of their teammates
– Policy directs agents about when, what, and with whom to communicate
Drawback:– In domains with little independence, agents may
need to communicate a lot
46
Experimental Results
In 3x3 domain, executing factored policy required less than half as many messages as full communication, with same reward
Communication usage decreases relative to full communication as domain size increases
Mean Reward
Mean Messages
Sent
Mean Variables
Sent
Full Communication
17.484 7.032 14.064
Factored Execution
17.484 3.323 6.646
47
Factored Dec-POMDPs
[Hansen and Feng, 2000] looked at factored POMDPs– ADD-representations of transition, observation, and reward
functions– Policy is a finite-state controller
Nodes are actions Transitions depend on conjunctions of state variable
assignments
To extend to Dec-POMDP, make individual policy a finite-state controller among individual actions
– Somehow combine nodes with the same action– Communicate to enable transitions between action nodes
48
Future Directions
Considering communication cost in ACE-IFP– All children of a particular variable may have
similar values– Worst-case cost of mis-coordination?– Modeling teammate variables requires reasoning
about possible teammate actions
Extending factoring to Dec-POMDPs
49
Future Directions
Knowledge persistence– Modeling teammates’ variables– Can we identify “necessary conditions”?
e.g. “Tell me when you reach the goal.”
Are you here yet?
Are you here yet?
50
Contributions
Decentralized execution of centralized policies– Guarantee that agents will Avoid Coordination
Errors– Make effective use of limited communication
resources– When should agents communicate?– What should agents communicate?
Demonstrate significant communication savings in experimental domains
51
Contributions
ACE-PJB-Comm X X X X X XSelective ACE-PJB-Comm
X X X X X X X
ACE-IFP X X / X X X X
Unr
estr
icte
d
Cos
t
Syn
c
Que
ry
Tel
l
OR
AN
D
AC
E
Whe
n?
Wha
t?
Who
?
52
Thank You!
Advisors: Reid Simmons, Manuela Veloso Committee: Carlos Guestrin, Jeff Schneider,
Milind Tambe RI Folks: Suzanne, Alik, Damion, Doug,
Drew, Frank, Harini, Jeremy, Jonathan, Kristen, Rachel (and many others!)
Aba, Ema, Nitzan, Yoel
53
References
Roth, M., Simmons, R., and Veloso, M. “Reasoning About Joint Beliefs for Execution-Time Communication Decisions” In AAMAS, 2005
Roth, M., Simmons, R., and Veloso, M. “What to Communicate? Execution-Time Decisions in Multi-agent POMDPs” In DARS, 2006
Roth, M., Simmons, R., and Veloso, M. “Exploiting Factored Representations for Decentralized Execution in Multi-agent Teams” In AAMAS, 2007
Bernstein, D., Zilberstein, S., and Immerman, N. “The Complexity of Decentralized Control of Markov Decision Processes” In UAI, 2000
Pynadath, D. and Tambe, M. “The Communicative Multiagent Team Decision Problem: Analyzing Teamwork Theories and Models” In JAIR, 2002
Becker, R., Zilberstein, S., Lesser, V., and Goldman, C. “Transition-independent Decentralized Markov Decision Processes” In AAMAS, 2003
Nair, R., Roth, M., Yokoo, M., and Tambe, M. “Communication for Improving Policy Computation in Distributed POMDPs” In IJCAI, 2003
54
Tiger Domain Details
Action/State SL SR
<OpenR, OpenR> +20 -50
<OpenL, OpenL> -50 +20
<OpenR, OpenL> -100 -100
<OpenL, OpenR> -100 -100
<Listen, Listen> -2 -2
<Listen, OpenR> +9 -101
<Listen, OpenL> +9 -101
<OpenR, Listen> -101 +9
<OpenL, Listen> -101 +9
Action/Transition SL → SL SL → SR SR→ SL SR → SR
<OpenR, *> 0.5 0.5 0.5 0.5
<OpenL, *> 0.5 0.5 0.5 0.5
<*, OpenR> 0.5 0.5 0.5 0.5
<*, OpenL> 0.5 0.5 0.5 0.5
<Listen, Listen> 1.0 0.0 1.0 0.0
Action State HL HL
<Listen, Listen> SL 0.7 0.3
<Listen, Listen> SR 0.3 0.7
<OpenR, *> * 0.5 0.5
<OpenL, *> * 0.5 0.5
<*, OpenR> * 0.5 0.5
<*, OpenL> * 0.5 0.5
55
Particle filter representation
Each particle is a possible joint belief Each agent maintains two particle filters:
– Ljoint : possible joint team beliefs
– Lown : possible joint beliefs that are consistent with local observation history
Compare action selected by Q-POMDP over Ljoint to action selected over Lown and communicate as needed
56
Related Work: Transition Independence [Becker, Zilberstein, Lesser, Goldman, 2003]
DEC-MDP – collective observability Transition independence:
– Local state transitions Each agent observes local state Individual actions only affect local state transitions
– Team connected through joint reward Coverage set algorithm – finds optimal policy quickly
in experimental domains
No communication
57
Related Work: COMM-JESP [Nair, Roth, Yokoo, Tambe, 2004]
Add SYNC action to domain– If one agent chooses SYNC,
all other agents SYNC– At SYNC, send entire
observation history since last SYNC
SYNC brings agents to synchronized belief over world states
Policies indexed by root synchronized belief and observation history since last SYNC
t=0(SL () 0.5)(SR () 0.5)
(SL (HR) 0.1275)(SL (HL) 0.7225)(SR (HR) 0.1275)(SR (HL) 0.0225)
(SL (HR) 0.0225)(SL (HL) 0.1275)(SR (HR) 0.7225)(SR (HL) 0.1275)
a = {Listen, Listen}
= HL = HR
a = SYNC
t = 2(SL () 0.5)(SR () 0.5)
t = 2(SL () 0.97)(SR () 0.03)
“At-most K” heuristic – there must be a SYNC within at most K timesteps
58
Related Work: “No news is good news” [Xuan, Lesser, Zilberstein, 2000]
Applies to transition-independent DEC-MDPs Agents form joint plan
– “plan”: exact path to be followed to accomplish goal
Communicate when deviation from plan occurs– agent sees it has slipped from optimal path– communicates need for re-planning
59
Related Work: BaGA-Comm [Emery-Montemerlo, 2005]
Each agent has a type– Observation and action history
Agents model distribution of possible joint types– Choose actions by finding joint type closest to own local
type– Allows coordination errors
Communicate if gain in expected reward is greater than cost of communication
60
Colorado/Wyoming Domain
Robots must meet in the capital, but do not know if they are in Colorado or Wyoming
Robots receive positive reward of +20 only if they SIGNAL simultaneously from correct goal location
To simplify problem, each robot knows both own and teammate position
Colorado Wyoming= capital
61
Noisy observations – mountain, plain, pikes peak, old faithful
Communication can help team reach goal more efficiently
Pike’s Peak
Old Faithful
Colorado/Wyoming Domain
= possible goal location
State Mt Pl PP Of
C 0.7 0.1 0.19 0.01
W 0.1 0.7 0.01 0.19
62
Build-Message: What to Communicate
First, determine if communication is necessary
– Calculate AC using Ace-PJB-Comm
– If AC = ANC, do not communicate
Greedily build message– “Hill-climbing” towards AC,
away from ANC- Choose single observation that
most increases difference between Q-POMDP values of AC
and ANC
Mt
Pl
Mt
Pike
63
Build-Message: What to Communicate
Is communication necessary?
Mt
Pl
Mt
Pike
ANC = [east, south]
AC = [east, west]
AC ≠ ANC so agent should communicate
64
Build-Message: What to Communicate
Mt
AC = [east, west] - “toward Denver”
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.5 1
P(State = Colorado)
Distribution if agent communicates entire observation history
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.5 1
P(State = Colorado)
Mt
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.5 1
P(State = Colorado)
Mt
PlPl
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.5 1
P(State = Colorado)
Pike
650
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.5 1
P(State = Colorado)
Build-Message: What to Communicate
Mt
Pl
Mt
AC = [east, west] – “toward Denver”
- PIKE is single best observation
- In this case, PIKE sufficient to change joint action to AC, so agent communicates only one observation
m = {Pike}
66
Context-specific Independence
A variable may be independent of a parent variable in some contexts but not others
– e.g. X2 depends on X3 when X1 has value 1, but is independent otherwise
Claim - Many multi-agent domains exhibit a large amount of context-specific independence
67
Constructing Individual Factored Policies
[Boutilier et al., 2000] defined Merge and Simplify operations for policy trees
We want to construct trees that maximize context-specific independence– Depends on variable ordering in policy– We define Intersect and Independent
operations
68
Intersect
Find the intersection of the action sets of a node’s children
1. If all children are leaves, and action sets have non-empty intersections, replace the node with the intersection2. If all but one child is a leaf, and all the actions in the non-leaf child’s subtree are included in the leaf-children’s intersection, replace with the non-leaf child
69
Independent
An individual action is Independent in a particular leaf of a policy tree if it is optimal when paired with any action its teammate could perform at that leaf
a is independent for agent 1 agent 1 has no independent actions
70
Generate Individual Policies
Generate a tree-structured joint policy For each agent:
– Reorder variables in joint policy so that variables local to this agent are near the root
– For each leaf in the policy, find the Independent actions
– Break ties among remaining joint actions– Convert joint actions individual actions– Intersect and Simplify
71
Why Break Ties?
Ensure agents select the same optimal joint action to prevent mis-coordination