dynamic programming for partially observable stochastic games daniel s. bernstein university of...
TRANSCRIPT
![Page 1: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/1.jpg)
Dynamic Programming for Partially Observable Stochastic Games
Daniel S. BernsteinUniversity of Massachusetts Amherst
in collaboration with
Christopher Amato, Eric A. Hansen,
Shlomo Zilberstein
June 23, 2004
![Page 2: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/2.jpg)
Extending the MDP Framework
• The MDP framework can be extended to incorporate partial observability and multiple agents
• Can we still do dynamic programming?– Lots of work on the single-agent case (POMDP)
Sondik 78, Cassandra et al. 97, Hansen 98
– Some work on the multi-agent case, but limited theoretical guarantees Varaiya
& Walrand 78, Nair et al. 03
![Page 3: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/3.jpg)
Our contribution
• We extend DP to the multi-agent case• For cooperative agents (DEC-POMDP):
– First optimal DP algorithm
• For noncooperative agents:– First DP algorithm for iterated elimination of
dominated strategies
• Unifies ideas from game theory and partially observable MDPs
![Page 4: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/4.jpg)
Game Theory
• Normal form game
• Only one decision to make – no dynamics• A mixed strategy is a distribution over
strategies
3,3 0,4
4,0 1,1
a1 a2
b1
b2
![Page 5: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/5.jpg)
Solving Games
• One approach to solving games is iterated elimination of dominated strategies
• Roughly speaking, this removes all unreasonable strategies
• Unfortunately, can’t always prune down to a single strategy per player
![Page 6: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/6.jpg)
Dominance
• A strategy is dominated if for every joint distribution over strategies for the other players, there is another strategy that is at least as good
• Dominance test looks like this:
• Can be done using linear programming
a1
a2
b1 b2
a3
dominated
![Page 7: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/7.jpg)
Dynamic Programming for POMDPs
• We’ll start with some important concepts:
a1
s2s1
policy tree linear value functionbelief state
s1 0.25
s2 0.40
s3 0.35
a2 a3
a3 a2 a1 a1
o1
o1 o2 o1 o2
o2
![Page 8: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/8.jpg)
Dynamic Programming
a1 a2 s1 s2
![Page 9: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/9.jpg)
Dynamic Programming
s1 s2
a1
a1 a1
o1 o2
a1
a1 a2
o1 o2
a1
a2 a1
o1 o2
a1
a2 a2
o1 o2
a2
a1 a1
o1 o2
a2
a1 a2
o1 o2
a2
a2 a1
o1 o2
a2
a2 a2
o1 o2
![Page 10: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/10.jpg)
Dynamic Programming
s1 s2
a1
a1 a1
o1 o2
a1
a2 a1
o1 o2
a2
a1 a1
o1 o2
a2
a2 a2
o1 o2
![Page 11: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/11.jpg)
Dynamic Programming
s1 s2
![Page 12: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/12.jpg)
Properties of Dynamic Programming
• After T steps, the best policy tree for s0 is contained in the set
• The pruning test is exactly the same as in elimination of dominated strategies in normal form games
![Page 13: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/13.jpg)
Partially Observable Stochastic Game
• Multiple agents control a Markov process• Each can have a different observation and
reward function
world
a1
o1, r1
o2, r2
a2
1
2
![Page 14: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/14.jpg)
POSG – Formal Definition
• A POSG is S, A1, A2, P, R, 12, O, where– S is a finite state set, with initial state s0
– A1, A2 are finite action sets
– P(s, a1, a2, s’) is a state transition function
– R1(s, a1, a2) and R2(s, a1, a2) are reward functions
1, 2 are finite observation sets
– O(s, o1, o2) is an observation function
• Straightforward generalization to n agents
![Page 15: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/15.jpg)
POSG – More Definitions
• A local policy is a mapping i : i* Ai
• A joint policy is a pair , 2 • Each agent wants to maximize its own
expected reward over T steps
• Although execution is distributed, planning is centralized
![Page 16: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/16.jpg)
Strategy Elimination in POSGs
• Could simply convert to normal form• But the number of strategies is doubly
exponential in the horizon length
R111, R11
2 … R1n1, R1n
2
… … …
Rm11, Rm1
2 … Rmn1, Rmn
2
…
…
![Page 17: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/17.jpg)
A Better Way to Do Elimination
• We use dynamic programming to eliminate dominated strategies without first converting to normal form
• Pruning a subtree eliminates the set of trees containing it
a1
a1 a2
a2 a2 a3 a3
o1
o1 o2 o1 o2
o2a3
a2 a1
o1 o2
a2
a2 a3
a3 a2 a2 a1
o1
o1 o2 o1 o2
o2
prune
eliminate
![Page 18: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/18.jpg)
Generalizing Dynamic Programming
• Build policy trees as in single agent case• Pruning rule is a natural generalization
Normal form game strategy(strategies of
other agents)
POMDP policy tree (states)
POSG policy tree(states policy
trees of other agents)
What to prune Space for pruning
![Page 19: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/19.jpg)
Dynamic Programming
a1 a2a1 a2
![Page 20: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/20.jpg)
Dynamic Programming
a1
a1 a2
o1 o2
a2
a1 a2
o1 o2
a1
a2 a1
o1 o2
a2
a1 a1
o1 o2
a2
a2 a1
o1 o2
a1
a2 a2
o1 o2
a2
a2 a2
o1 o2
a1
a1 a1
o1 o2
a1
a1 a2
o1 o2
a2
a1 a2
o1 o2
a1
a2 a1
o1 o2
a2
a1 a1
o1 o2
a2
a2 a1
o1 o2
a1
a2 a2
o1 o2
a2
a2 a2
o1 o2
a1
a1 a1
o1 o2
![Page 21: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/21.jpg)
Dynamic Programming
a1
a1 a2
o1 o2
a1
a2 a1
o1 o2
a2
a2 a1
o1 o2
a1
a2 a2
o1 o2
a2
a2 a2
o1 o2
a1
a1 a1
o1 o2
a1
a1 a2
o1 o2
a2
a1 a2
o1 o2
a1
a2 a1
o1 o2
a2
a1 a1
o1 o2
a2
a2 a1
o1 o2
a1
a2 a2
o1 o2
a2
a2 a2
o1 o2
a1
a1 a1
o1 o2
![Page 22: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/22.jpg)
Dynamic Programming
a1
a1 a2
o1 o2
a1
a2 a1
o1 o2
a2
a2 a1
o1 o2
a1
a2 a2
o1 o2
a2
a2 a2
o1 o2
a1
a1 a1
o1 o2
a1
a1 a2
o1 o2
a2
a1 a2
o1 o2
a1
a2 a1
o1 o2
a2
a2 a1
o1 o2
a2
a2 a2
o1 o2
![Page 23: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/23.jpg)
Dynamic Programming
a1
a2 a1
o1 o2
a2
a2 a1
o1 o2
a2
a2 a2
o1 o2
a1
a1 a1
o1 o2
a1
a1 a2
o1 o2
a2
a1 a2
o1 o2
a1
a2 a1
o1 o2
a2
a2 a1
o1 o2
a2
a2 a2
o1 o2
![Page 24: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/24.jpg)
Dynamic Programming
a1
a2 a1
o1 o2
a2
a2 a1
o1 o2
a2
a2 a2
o1 o2
a1
a1 a1
o1 o2
a1
a1 a2
o1 o2
a1
a2 a1
o1 o2
a2
a2 a1
o1 o2
a2
a2 a2
o1 o2
![Page 25: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/25.jpg)
Dynamic Programming
![Page 26: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/26.jpg)
Correctness of Dynamic Programming
Theorem: DP performs iterated elimination of dominated strategies in the normal form of the POSG.
Corollary: DP can be used to find an optimal joint policy in a cooperative POSG.
![Page 27: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/27.jpg)
Dynamic Programming in Practice
• Initial empirical results show that much pruning is possible
• Can solve problems with small state sets• And we can import ideas from the POMDP
literature to scale up to larger problems Boutilier & Poole 96, Hauskrecht 00, Feng & Hansen 00, Hansen & Zhou 03, Theocharous & Kaelbling 03
![Page 28: Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher](https://reader035.vdocument.in/reader035/viewer/2022081516/56649eda5503460f94be9c7a/html5/thumbnails/28.jpg)
Conclusion
• First exact DP algorithm for POSGs• Natural combination of two ideas
– Iterated elimination of dominated strategies– Dynamic programming for POMDPs
• Initial experiments on small problems, ideas for scaling to larger problems