Download - Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005
![Page 1: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/1.jpg)
Markov Models for Multi-Agent Coordination
Maayan Roth
Multi-Robot Reading Group
April 13, 2005
![Page 2: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/2.jpg)
Multi-Agent Coordination
Teams:– Agent work together to achieve a common goal– No individual motivations
Objective:– Generate policies (individually or globally) to yield
best team performance
![Page 3: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/3.jpg)
Observability [Pynadath and Tambe, 2002]
“Observability” - degree to which agents can, either individually or as a team, identify current world state
– Individual observability MMDP [Boutilier, 1996]
– Collective observability O1 + O2 + … + On uniquely identify the state e.g. [Xuan, Lesser, Zilberstein, 2001]
– Collective partial observability DEC-POMDP [Bernstein et al., 2000] POIPSG [Peskin, Kim, Meuleau, Kaelbling, 2000]
– Non-observability
![Page 4: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/4.jpg)
Communication [Pynadath and Tambe, 2002]
“Communication” - explicit message-passing from one agent to another
– Free Communication No cost to send messages Transforms MMDP to MDP, DEC-POMDP to POMDP
– General Communication Communication is available but has cost or is limited
– No Communication No explicit message-passing
![Page 5: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/5.jpg)
Taxonomy of Cooperative MAS
No Communication
General Communication
Free Communication
Full Observability
Collective Observability
Partial Observability
Unobservability
Ope
n-lo
op c
ontr
ol?MMDP
MDPMDP POMDP
DEC-MDP DEC-POMDP
![Page 6: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/6.jpg)
Complexity
Individually Observable
Collectively Observable
Collectively Partially Observable
No Comm. P-complete NEXP-complete NEXP-complete
General Comm. P-complete NEXP-complete NEXP-complete
Free Comm. P-complete P-complete PSPACE-complete
![Page 7: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/7.jpg)
Multi-Agent MDP (MMDP) [Boutilier, 1996]
Also called IPSG (identical payoff stochastic games)
M = <S, {Ai}im, T, R>– S is set of possible world states– {Ai}im is set of joint actions, <a1, …, am> where ai Ai
– T defines transition probabilities over joint actions– R is team reward function
State is fully observable by each agent P-complete
![Page 8: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/8.jpg)
Coordination Problems
Multiple equilibria – two optimal joint policies
s1
s2
s3
s4
s5
s6
+10
-10
+5
<a,*>
<b,*><*,*>
<*,*>
<*,*>
<*,*>
<a,a>; <b,b>
<a,b>; <b,a>
![Page 9: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/9.jpg)
Randomization with Learning
Expand MMDP to include FSM that selects actions based on whether agents are coordinated or uncoordinated
Build the expanded MMDP by adding a coordination mechanism at every state where a coordination problem is discovered
U
A
Brandom(a,b)
b
a
<*,*>
<*,*>
<a,a>
<b,b>
<a,b>; <b,a>
![Page 10: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/10.jpg)
Multi-Agent POMDP
DEC-POMDP [Bernstein et al., 2000], MTDP [Pynadath et al., 2002], POIPSG [Peshkin et al., 2000]
M = <S, {Ai}im, T, {i}im, O, R>– S is set of possible world states– {Ai}im is set of joint actions, <a1, …, am> where ai Ai
– T defines transition probabilities over joint actions– {i}im is set of joint observations, <1, …, m> where i i
– O defines observation probabilities over joint actions and joint observations
– R is team reward function
![Page 11: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/11.jpg)
Complexity [Bernstein, Zilberstein, Immerman, 2000]
For all m 2, DEC-POMDP with m agents is NEXP-complete
– “For a given DEC-POMDP with finite time horizon T, and an integer K, is there a policy for which yields a total reward K?”
– Agents must reason about all possible actions and observations of their teammates
For all m 3, DEC-MDP with m agents is NEXP-complete
![Page 12: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/12.jpg)
Dynamic Programming[Hansen, Bernstein, Zilberstein, 2004]
Finds optimal solution Partially observable stochastic game (POSG)
framework– Iterate between DP step and pruning step to remove
dominated strategies
Experimental results show speed-up in some domains
– Still NEXP-complete, so hyper-exponential in worst case
No communication
![Page 13: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/13.jpg)
Transition Independence [Becker, Zilberstein, Lesser, Goldman, 2003]
DEC-MDP – collective observability Transition independence:
– Local state transitions Each agent observes local state Individual actions only affect local state transitions
– Team connected through joint reward
Coverage set algorithm – finds optimal policy quickly in experimental domains
No communication
![Page 14: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/14.jpg)
Efficient Policy Computation [Nair, Pynadath, Yokoo, Tambe, Marsella, 2003]
Joint Equilibrium-Based Search for Policies– initialize random policies for all agents– for each agent i:
fix the policies for all other agents find the optimal policy for agent i
– repeat until no policies change Finds locally optimal policies with potentially
exponential speedups over finding the global optimum
No communication
![Page 15: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/15.jpg)
Communication [Pynadath and Tambe, 2002]
COM-MTDP framework for analyzing communication– Enhance model with {i}im where i is the set of possible
messages agent i can send– Reward function includes cost ( 0) of communication
If communication is free:– DEC-POMDP reducible to single-agent POMDP (PSPACE-
complete)– Optimal communication policy is to communicate at every
time step Under general communication (no restrictions on
cost), DEC-POMDP still NEXP-complete i may contain every possible observation history
![Page 16: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/16.jpg)
COMM-JESP [Nair, Roth, Yokoo, Tambe, 2004]
Add SYNC action to domain– If one agent chooses SYNC,
all other agents SYNC– At SYNC, send entire
observation history since last SYNC
SYNC brings agents to synchronized belief over world states
Policies indexed by root synchronized belief and observation history since last SYNC
t=0(SL () 0.5)(SR () 0.5)
(SL (HR) 0.1275)(SL (HL) 0.7225)(SR (HR) 0.1275)(SR (HL) 0.0225)
(SL (HR) 0.0225)(SL (HL) 0.1275)(SR (HR) 0.7225)(SR (HL) 0.1275)
a = {Listen, Listen}
= HL = HR
a = SYNC
t = 2(SL () 0.5)(SR () 0.5)
t = 2(SL () 0.97)(SR () 0.03)
“At-most K” heuristic – there must be a SYNC within at most K timesteps
![Page 17: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/17.jpg)
Dec-Comm [Roth, Simmons, Veloso, 2005]
At plan-time, assume communication is free– generate joint policy using single-agent POMDP-
solver
Reason over possible joint beliefs to execute policy
Use communication to integrate local observations into team belief
![Page 18: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/18.jpg)
Tiger Domain: (States, Actions)
Two-agent tiger problem [Nair et al., 2003]:
S: {SL, SR}
Tiger is either behind left door or behind right door
Individual Actions:
ai {OpenL, OpenR, Listen}
Robot can open left door, open right door, or listen
![Page 19: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/19.jpg)
Tiger Domain: (Observations)
Individual Observations:
I {HL, HR}
Robot can hear tiger behind left door or hear tiger behind right door
Observations are noisy and independent.
![Page 20: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/20.jpg)
Tiger Domain:(Reward)
Coordination problem – agents must act together for maximum reward
Maximum reward (+20) when both agents open door with treasure
Minimum reward (-100) when only one agent opens door with tiger
Listen has small cost (-1 per agent)
Both agents opening door with tiger leads to medium negative reward (-50)
![Page 21: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/21.jpg)
Joint Beliefs
Joint belief (bt) – distribution over world states Why compute possible joint beliefs?
– action coordination– transition and observation functions depend on joint action– agent can’t accurately estimate belief if joint action is
unknown
To ensure action coordination, agents can only reason over information known by all teammates
![Page 22: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/22.jpg)
Possible Joint Beliefs
b: P(SL) = 0.5
p: p(b) = 1.0
= {}
L0
a = <Listen, Listen>
HL
HL
b: P(SL) = 0.8
p: p(b) = 0.29
= {HL,HL}
b: P(SL) = 0.5
p: p(b) = 0.21
= {HL,HR}
b: P(SL) = 0.5
p: p(b) = 0.21
= {HR,HL}
b: P(SL) = 0.2
p: p(b) = 0.29
= {HR,HR}
L1
How should agents select actions over joint beliefs?
),|( 111 ttttt abPpp
![Page 23: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/23.jpg)
Q-POMDP Heuristic
Select joint action over possible joint beliefs Q-MDP (Littman et. al., 1995)
– approximate solution to large POMDP using underlying MDP
Q-POMDP– approximate solution to DEC-POMDP using
underlying single-agent POMDP
)()(maxarg)( sVsbbQSs
aa
MDP
)),(()(maxarg)( aLbQLpLQ ti
LL
ti
a
tPOMDP
tti
![Page 24: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/24.jpg)
Q-POMDP Heuristic
)),(()(maxarg)( aLbQLpLQ ti
LL
ti
a
tPOMDP
tti
b: P(SL) = 0.5
p: p(b) = 0.21
= {HR,HL}
b: P(SL) = 0.8
p: p(b) = 0.29
= {HL,HL}
b: P(SL) = 0.5
p: p(b) = 0.21
= {HL,HR}
b: P(SL) = 0.5
p: p(b) = 1.0
= {}
b: P(SL) = 0.2
p: p(b) = 0.29
= {HR,HR}
Choose joint action by computing expected reward over all leaves
Agents will independently select same joint action…
but action choice is very conservative (always <Listen,Listen>)
DEC-COMM: Use communication to add local observations to joint belief
![Page 25: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/25.jpg)
Dec-Comm Example
b: P(SL) = 0.5
p: p(b) = 0.21
= {HR,HL}
b: P(SL) = 0.8
p: p(b) = 0.29
= {HL,HL}
b: P(SL) = 0.5
p: p(b) = 0.21
= {HL,HR}
b: P(SL) = 0.5
p: p(b) = 1.0
= {}
b: P(SL) = 0.2
p: p(b) = 0.29
= {HR,HR}
HL
HL
L1
aNC = Q-POMDP(L1) = <Listen,Listen>
L* = circled nodes
aC = Q-POMDP(L*) = <Listen,Listen>
Don’t communicate
![Page 26: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/26.jpg)
Dec-Comm Example (cont’d)
b: P(SL) = 0.85
p: p(b) = 0.29
= {HL,HL}
b: P(SL) = 0.5
p: p(b) = 1.0
= {}
L1b: P(SL) = 0.5
p: p(b) = 0.21
= {HL,HR}
b: P(SL) = 0.5
p: p(b) = 0.21
= {HR,HL}
b: P(SL) = 0.15
p: p(b) = 0.29
= {HR,HR}
…b: P(SL) = 0.97
p: p(b) = 0.12
: {<HL,HL>,
<HL,HL>}
b: P(SL) = 0.85
p: p(b) = 0.06
: {<HL,HL>,
<HL,HR>}
b: P(SL) = 0.85
p: p(b) = 0.06
: {<HL,HL>,
<HR,HL>}
b: P(SL) = 0.5
p: p(b) = 0.04
: {<HL,HL>,
<HR,HR>}
b: P(SL) = 0.85
p: p(b) = 0.06
: {<HL,HR>,
<HL,HL>}
b: P(SL) = 0.5
p: p(b) = 0.04
: {<HL,HR>,
<HL,HR>}
b: P(SL) = 0.5
p: p(b) = 0.04
: {<HL,HR>,
<HR,HL>}
b: P(SL) = 0.16
p: p(b) = 0.06
: {<HL,HR>,
<HR,HR>}
L2
a = <Listen, Listen>
<HL,HL>
<HL,HL>
aNC = Q-POMDP(L2) = <Listen, Listen>
L* = circled nodes
V(aC) - V(aNC) > ε
Agent 1 communicatesaC = Q-POMDP(L*) = <OpenR,OpenR>
![Page 27: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/27.jpg)
Dec-Comm Example (cont’d)
b: P(SL) = 0.85
p: p(b) = 0.29
= {HL,HL}
b: P(SL) = 0.5
p: p(b) = 1.0
= {}
L1b: P(SL) = 0.5
p: p(b) = 0.21
= {HL,HR}
b: P(SL) = 0.5
p: p(b) = 0.21
= {HR,HL}
b: P(SL) = 0.15
p: p(b) = 0.29
= {HR,HR}
…b: P(SL) = 0.97
p: p(b) = 0.12
: {<HL,HL>,
<HL,HL>}
b: P(SL) = 0.85
p: p(b) = 0.06
: {<HL,HL>,
<HL,HR>}
b: P(SL) = 0.85
p: p(b) = 0.06
: {<HL,HL>,
<HR,HL>}
b: P(SL) = 0.5
p: p(b) = 0.04
: {<HL,HL>,
<HR,HR>}
b: P(SL) = 0.85
p: p(b) = 0.06
: {<HL,HR>,
<HL,HL>}
b: P(SL) = 0.5
p: p(b) = 0.04
: {<HL,HR>,
<HL,HR>}
b: P(SL) = 0.5
p: p(b) = 0.04
: {<HL,HR>,
<HR,HL>}
b: P(SL) = 0.16
p: p(b) = 0.06
: {<HL,HR>,
<HR,HR>}
L2
a = <Listen, Listen>
<HL,HL>
<HL,HL>
Agent 1 communicates <HL,HL>
Agent 2 communicates <HL,HL>
Q-POMDP(L2) = <OpenR, OpenR>
Agents open right door!
![Page 28: Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005](https://reader035.vdocument.in/reader035/viewer/2022062421/56649cda5503460f949a3cb9/html5/thumbnails/28.jpg)
References
Becker, R., Zilberstein, S., Lesser, V., Goldman, C. V. Transition-independent decentralized Markov decision processes. In International Joint Conference on Autonomous Agents and Multi-Agent Systems, 2003.
Bernstein, D., Zilberstein, S., Immerman, N. The complexity of decentralized control of Markov decision processes. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, 2000.
Boutilier, C. Sequential optimality and coordination in multiagent systems. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 1999.
Hansen, E. A., Bernstein, D. S., Zilberstein, S. Dynamic programming for partially observable stochastic games. In National Conference on Artificial Intelligence, 2004.
Nair, R., Roth, M., Yokoo, M., Tambe, M. Communication for improving policy computation in distributed POMDPs. To appear in Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, 2004.
Nair, R., Pynadath, D., Yokoo, M., Tambe, M., Marsella, S. Taming decentralized POMDPs: Towards efficient policy computation for multiagent settings. In Proceedings of International Joint Conference on Artificial Intelligence, 2003.
Peshkin, L., Kim, K.-E., Meuleau, N., Kaelbling, L. P. Learning to cooperate via policy search. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2000.
Pynadath, D. and Tambe, M. The communicative multiagent team decision problem: Analyzing teamwork theories and models. In Journal of Artificial Intelligence Research, 2002.
Roth, M., Simmons, R., Veloso, M. Reasoning about joint beliefs for execution-time communication decisions. To appear in International Joint Conference on Autonomous Agents and Multi-Agent Systems, 2005.
Xuan, P., Lesser, V., Zilberstein, S. Communication decisions in multi-agent cooperation: Model and experiments. In Proceedings of the International Joint Conference on Artificial Intelligence, 2001.