decentralized decision makingebrun/ijcai2011/ws_papers/zilber... · 3! problem characteristics! a...

83
Decentralized Decision Making in Partially Observable, Uncertain Worlds Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst Joint work with Martin Allen, Christopher Amato, Daniel Bernstein, Alan Carlin, Claudia Goldman, Eric Hansen, Akshat Kumar, Marek Petrik, Sven Seuken, Feng Wu, and Xiaojian Wu IJCAI’11 Workshop on Decision Making in Partially Observable, Uncertain Worlds Barcelona, Spain July 18, 2011

Upload: others

Post on 31-May-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Decentralized Decision Making !in Partially Observable, Uncertain Worlds

Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst

Joint work with Martin Allen, Christopher Amato, Daniel Bernstein, Alan Carlin, Claudia Goldman, Eric Hansen, Akshat Kumar, Marek Petrik, Sven Seuken, Feng Wu, and Xiaojian Wu

IJCAI’11 Workshop on Decision Making in Partially Observable, Uncertain Worlds Barcelona, Spain July 18, 2011

2!

Decentralized Decision Making!

  Challenge: How to achieve intelligent coordination of a group of decision makers in spite of stochasticity and partial observability?!

  Key objective: Develop effective decision-theoretic methods to address the uncertainty about the domain, the outcome of actions, and the knowledge, beliefs and intentions of the other agents.!

3!

Problem Characteristics!  A group of decision makers or agents interact in

a stochastic environment!  Each “episode” involves a sequence of decisions

over finite or infinite horizon!  The change in the environment is determined

stochastically by the current state and the set of actions taken by the agents!

  Each decision maker obtains different partial observations of the overall situation!

  Decision makers have the same objectives!

4!

Applications!  Autonomous rovers for space

exploration!  Protocol design for multi-

access broadcast channels!  Coordination of mobile robots!

  Decentralized detection and tracking!

  Decentralized detection of hazardous weather events !

5!

Outline!  Models for decentralized decision making

  Complexity results

  Solving finite-horizon DEC-POMDPs

  Solving infinite-horizon DEC-POMDPs

  Scalability beyond two agents

  Conclusion

8!

Decentralized POMDP !

  Generalization of POMDP involving multiple cooperating decision makers with different observation functions!

a1

o1

o2

a2

1

2

World Reward r

9!

DEC-POMDPs!  A DEC-POMDP is defined by a tuple 〈S, A1, A2, P, R1, R2, Ω1, Ω2, O〉, where!  S is a finite set of domain states, with initial state s0!  A1, A2 are finite action sets!  P(s, a1, a2, s' ) is a state transition function!  R(s, a1, a2) is a reward function!  Ω1, Ω2 are finite observation sets

  O(a1, a2, s', o1, o2) is an observation function!

  Straightforward generalization to n agents!

10!

Formal Models !

11!

Example: Mobile Robot Planning!

States: grid cell pairs

Actions: ↑,↓,←,→

Transitions: noisy

Goal: meet quickly

Observations: red lines

12!

Example: Cooperative Box-Pushing

Goal: push as many boxes as possible to goal area; larger box has higher reward, but requires two agents to be moved.

13!

Solving DEC-POMDPs!  Each agentʼs behavior is described by a

local policy δi!

  Policy can be represented as a mapping from!  Local observation sequences to actions; or!  Local memory states to actions

  Actions can be selected deterministically or stochastically!

  Goal is to maximize expected reward over a finite horizon or discounted infinite horizon!

14!

Work on Decentralized Decision Making and DEC-POMDPs!

  Team theory [Marschak 55, Tsitsiklis & Papadimitriou 82]!  Incorporating dynamics [Witsenhausen 71]!  Communication strategies [Varaiya & Walrand 78, Xuan et al. 01,

Pynadath & Tambe 02]!  Approximation algorithms [Peshkin et al. 00, Guestrin et al. 01,

Nair et al. 03, Emery-Montemerlo et al. 04]!  First Exact DP algorithm [Hansen et al. 04]!  First policy iteration algorithm [Bernstein et al. 05]!  Many recent exact and approximate DEC-POMDP algorithms!

15!

Some Fundamental Questions!  Are DEC-POMDPs significantly harder to solve than

POMDPs? Why?!  What features of the problem domain affect the

complexity and how?!  Is optimal dynamic programming possible?!  Can dynamic programming be made practical?!  Is it beneficial to treat communication as a separate type

of action?!  How can we exploit the locality of agent interaction to

develop more scalable algorithms?!

16!

Outline!  Models for decentralized decision making

  Complexity results

  Solving finite-horizon DEC-POMDPs

  Solving infinite-horizon DEC-POMDPs

  Scalability beyond two agents

  Conclusion

17!

Previous Complexity Results!

MDP! P-complete !( if T < |S| )!

Papadimitriou & Tsitsiklis 87!

POMDP! PSPACE- complete !( if T < |S| )!

Papadimitriou & Tsitsiklis 87!

MDP! P-complete! Papadimitriou & Tsitsiklis 87!

POMDP! Undecidable! Madani et al. 99!

Finite Horizon!

Infinite-Horizon Discounted!

18!

How Hard are DEC-POMDPs? !Bernstein, Givan, Immerman & Zilberstein, UAI 2000, MOR 2002

  The complexity of finite-horizon DEC-POMDPs has been hard to establish.!

  A static version of the problem, where a single set of decisions is made in response to a single set of observations, was shown to be NP-hard [Tsitsiklis and Athan, 1985]!

  We proved that two-agent finite-horizon DEC-POMDPs are NEXP-hard!

  But these are worst-case results! ! Are real-world problems easier?!

19!

What Features of the Domain Affect the Complexity and How?   Factored state spaces (structured domains)!  Independent transitions (IT)!

  Independent observations (IO)!  Structured reward function (SR)!

  Goal-oriented objectives (GO)!  Degree of observability (partial, full, jointly full)!

  Degree and structure of interaction!

  Degree of information sharing and communication!

20!

NP-C

P-C P-C

NP-C

NP-C NEXP-C

NEXP-C

Complexity of Sub-Classes Goldman & Zilberstein, JAIR 2004

Finite-Horizon

DEC-MDP

IO & IT Goal Oriented

Goal Oriented

|G| = 1 |G| > 1 Certain Conditions

w/ Sharing Information

21!

Outline!  Models for decentralized decision making

  Complexity results

  Solving finite-horizon DEC-POMDPs

  Solving infinite-horizon DEC-POMDPs

  Scalability beyond two agents

  Conclusion

JESP: First DP Algorithm!Nair, Tambe, Yokoo, Pynadath & Marsella, IJCAI 2003!

  JESP: Joint Equilibrium-based Search for Policies!

  Complexity: exponential!

  Result: only locally optimal solutions!

22!

23!

Is Exact DP Possible?

  The key to solving POMDPs is that they can be viewed as belief-state MDPs [Smallwood & Sondik 73] !

  Not as clear how to define a belief-state MDP for a DEC-POMDP!

  The first exact DP algorithm for finite-horizon DEC-POMDPs uses the notion of a generalized belief state!

  The algorithm also applies to competitive situations modeled as POSGs !

Generalized Belief State!

!A generalized belief state captures the uncertainty of one agent with respect to the state of the world as well as the policies of other agents.!

24!

25!

Strategy Elimination!  Any finite-horizon DEC-POMDP can be

converted to a normal form game!  But the number of strategies is doubly

exponential in the horizon length!!

R111, R11

2! …! R1n1, R1n

2!

…! …! …!

Rm11, Rm1

2! …! Rmn1, Rmn

2!

26!

A Better Way to Do Elimination Hansen, Bernstein & Zilberstein, AAAI 2004

  We can use dynamic programming to eliminate dominated strategies without first converting to normal form!

  Pruning a subtree eliminates the set of trees containing it!

a1

a1 a2

a2 a2 a3 a3

o1

o1 o2 o1 o2

o2 a3

a2 a1

o1 o2

a2

a2 a3

a3 a2 a2 a1

o1

o1 o2 o1 o2

o2

prune

eliminate

35!

First Exact DP for DEC-POMDPs !Hansen, Bernstein & Zilberstein, AAAI 2004

  Theorem: DP performs iterated elimination of dominated strategies in the normal form of the POSG.!

  Corollary: DP can be used to find an optimal joint policy in a DEC-POMDP.!

  Algorithm is complete & optimal!

  Complexity is double exponential!

37!

Alternative: Heuristic Search Szer, Charpillet & Zilberstein, UAI 2005!  Perform forward best-

first search in the space of joint policies!

  Take advantage of a known start state distribution!

  Take advantage of domain-independent heuristics for pruning!

38!

The MAA* Algorithm Szer, Charpillet & Zilberstein, UAI 2005!

  MAA* is complete and optimal!

  Main advantage: significant reduction in memory requirements over the dynamic programming approach!

Scaling Up Heuristic Search Spaan, Oliehoek, and Amato, IJCAI 2011!

  Problem with MAA*: The number of children of a node is doubly exponential in the node's depth !

  Basic idea: avoid the full expansion of each node by incrementally generating the children only when a child might have a higher heuristic value !

  Introduce a more memory-efficient representation for heuristic functions!

  Yields a speed up over the state-of-the-art allowing for the optimal solution over longer horizons!

39!

Scaling Up Heuristic Search Spaan, Oliehoek, and Amato, IJCAI 2011!

40!

41!

Memory-Bounded DP (MBDP) Seuken & Zilberstein, IJCAI 2007!

  Combining the two approaches:!  The DP algorithm is a bottom-up approach!  The search operates top-down!

  The DP step can only eliminate a policy tree if it is dominated for every belief state!

  But, only a small subset of the belief space is actually reachable!

  Furthermore, the combined approach allows the algorithm to focus on a small subset of joint policies that appear best !

42!

Memory-Bounded DP Cont.

43!

The MBDP Algorithm!

44!

Generating “Good” Belief States!  MDP Heuristic -- Obtained by solving the

corresponding fully-observable multiagent MDP!

  Infinite-Horizon Heuristic -- Obtained by solving the corresponding infinite-horizon DEC-POMDP!

  Random Policy Heuristic -- Could augment another heuristic by adding random exploration!

  Heuristic Portfolio -- Maintain a set of belief states generated by a set of different heuristics!

  Recursive MBDP!

46!

Performance of MBDP!

MBDP Successors!  Improved MBDP (IMBDP)!

![Seuken and Zilberstein, UAI 2007]!

  MBDP with Observation Compression (MBDP-OC)!![Carlin and Zilberstein, AAMAS 2008]!

  Point Based Incremental Pruning (PBIP)!![Dibangoye, Mouaddib, and Chaib-draa, AAMAS 2009]!

  PBIP with Incremental Policy Generation (PBIP-IPG)!![Amato, Dibagoye, Zilberstein, AAAI 2009]!

  Constraint-Based Dynamic Programming (CBDP)!![Kumar and Zilberstein, AAMAS 2009]!

  Point-Based Backup for Decentralized POMDPs!![Kumar and Zilberstein, AAMAS 2010]!

  Point-Based Policy Generation (PBPG)!![Wu, Zilberstein, and Chen, AAMAS 2010]! 49!

Key Ideas Behind These Algorithms!  Perform search in a reduced

policy space!  Exact algorithm perform only

lossless pruning !  Approximate algorithms rely on

more aggressive pruning!  MBDP represents an

exponential size policy with linear space O(maxTrees × T)!

  Resulting policy is an acyclic finite-state controller.!

55!

56!

Outline!  Models for decentralized decision making

  Complexity results

  Solving finite-horizon DEC-POMDPs

  Solving infinite-horizon DEC-POMDPs

  Scalability beyond two agents

  Conclusion

57!

Infinite-Horizon DEC-POMDPs

  Unclear how to define a compact belief-state without fixing the policies of other agents!

  Value iteration does not generalize to the infinite-horizon case!

  Can generalize policy iteration for POMDPs [Hansen 98, Poupart & Boutilier 04]!

  Basic idea: Representing local policies using (deterministic/stochastic) finite-state controllers and defining a set of controller transformations that guarantee improvement & convergence !

58!

Policies as Controllers

  Finite state controller represents each policy!  Fixed memory!  Randomness used to offset memory limitations !  Action selection, ψ : Qi → ΔAi!  Transitions, η : Qi × Ai × Oi → ΔQi!

  Value of two-agent joint controller given by the Bellman equation:!

V (q1,q2,s) = P(a1 |q1)P(a2 |q2)a1 ,a2

∑ R(s,a1,a2) +[

γ P(s' | s,a1,a2) O(o1,o2 | s',a1,a2) P(q1' |q1,a1,o1)P(q2 ' |q2,a2,o2)q1 ',q2 '∑

o1 ,o2

∑ V (q1',q2 ',s')s'∑

⎦ ⎥ ⎥

59!

Controller Example

a1

a1

a2

o1

  Stochastic controller for one agent!  2 nodes, 2 actions, 2 observations !  Parameters!

•  P(ai | qi)!•  P(qi | qi, oi)

1 2

o2 0.5 0.5

0.75 0.25

o1

1.0

o2 1.0

1.0

o2

1.0 1.0

'!

60!

Finding Optimal Controllers   How can we search the space of possible joint

controllers?!  How do we set the parameters of the controllers

to maximize value?!  Deterministic controllers – can use traditional

search methods such as BSF or B&B!  Stochastic controllers – continuous optimization

problem!  Key question: how to best use a limited

amount of memory to optimize value?!

61!

Independent Joint Controllers!  Local controller for agent

i is defined by conditional distribution P(ai, qi | qi, oi)

  Independent joint controller is expressed by: Πi P(ai, qi | qi, oi)

  Can be represented as a dynamic Bayes net!

'!

'!

s0

q1t+1

q2t+1

o2t+1

o1t+1

o21

o1t

o2t

a1t

a2t

st

a1

q2t

a2 o2

q2 q!

2

q1

q1t

a1t

st st+1

s!

o1

q!

1

s

a1

a2 o2

q2 q!

2

q1

s!

o1

q!

1

s

q!

cqc

a10

a20

s1

r0

a2t

o11

62!

Correlated Joint Controllers!Bernstein, Hansen & Zilberstein, IJCAI 2005, JAIR 2009!

  A correlation device, [Qc,ψ], is a set of nodes and a stochastic state transition function !

'!'!'!

  Joint controller:!!∑qc

P(qc|qc) Πi P(ai, qi | qi, oi, qc)!

  A shared source of randomness affecting decisions and memory state update!

  Random bits for the correlation device can be determined prior to execution time!

s0

q1t+1

q2t+1

o2t+1

o1t+1

o21

o1t

o2t

a1t

a2t

st

a1

q2t

a2 o2

q2 q!

2

q1

q1t

a1t

st st+1

s!

o1

q!

1

s

a1

a2 o2

q2 q!

2

q1

s!

o1

q!

1

s

q!

cqc

a10

a20

s1

r0

a2t

o11

63!

Exhaustive Backups!

a1 a2 o1,o2

o1,o2 a1 a2

o1,o2

o1,o2

a1

a1

a1 a1

a2

a2

a2 a2

o2 o1

o1 o2

o1,o2

o2 o1

o1,o2

o2 o1

o1,o2 o1,o2

a1

a1

a1 a1

a2

a2

a2 a2

o2 o1

o1 o2

o1,o2

o2 o1

o1,o2

o2 o1

o1,o2 o1,o2

  Add a node for every possible action and deterministic transition rule!

  Repeated backups converge to optimality, but lead to very large controllers!

64!

Value-Preserving Transformations!  A value-preserving transformation changes

the joint controller without sacrificing value!

  Formally, there must exist mappings! fi : Qi → ΔRi for each agent i and fc : Qc → ΔRc such that!

!for all s ∈ S, , and

V (s, q ,qc ) ≤ P( r | q ) P(rc |qc )V (s, r ,rc )

rc

∑ r ∑

q ∈ Q

qc ∈ Qc

65!

Bounded Policy Iteration Algorithm!Bernstein, Hansen & Zilberstein, IJCAI 2005, JAIR 2009!

Repeat!1)  Evaluate the controller!2)  Perform an exhaustive backup!3)  Perform value-preserving transformations!Until controller is ε-optimal for all states!

Theorem: For any ε, bounded policy iteration returns a joint controller that is ε-optimal for all initial states in a finite number of iterations.!

66!

Useful Transformations!  Controller reductions!

  Shrink the controller without sacrificing value!

  Bounded dynamic programming updates!  Increase value while keeping the size fixed!

  Both can be done using polynomial-size linear programs!

  Generalize ideas from POMDP literature, particularly the BPI algorithm [Poupart & Boutilier 03]!

67!

Controller Reduction!  For some node qi, find a convex combination of nodes

in Qi \ qi that dominates qi for all states and nodes of the other controllers; Merge qi into the convex combination by changing transition probabilities!

  Corresponding linear program:!!Variables: ε, !!Objective: Maximize ε!!Constraints: ∀s ∈ S, q–i ∈ Q–i, qc ∈ Qc!

  Theorem: A controller reduction is a value-preserving transformation.!

V (s,qi,q−i,qc ) + ε ≤ P( ˆ q i)V (s, ˆ q i,q− i,qc )ˆ q i

∑€

P( ˆ q i)

68!

Bounded DP Update!  For some node qi, find better parameters assuming that

the old parameters will be used from the second step onwards; New parameters must yield value at least as high for all states and nodes of the other controllers!

  Corresponding linear program:!!Variables: ε, P(ai, qi | qi, oi, qc)!!Objective: Maximize ε!!Constraints: ∀s ∈ S, q–i ∈ Q–i, qc ∈ Qc!

  Theorem: A bounded DP update is a value-preserving transformation. !

V (s, q ,qc ) + ε ≤ P( a | q ,qc ) R(s,a) + γ P( q ' | q , a , o ,qc )P(s', o | s, a )P(qc ' |qc )V (s',

q ',qc ')s', o , q ',qc '∑

⎣ ⎢ ⎢

⎦ ⎥ ⎥

a ∑

'!

69!

Modifying the Correlation Device!  Both transformations can be applied to the

correlation device!  Slightly different linear programs to solve!  Can think of the correlation device as another

agent!  Lots of implementation questions…!

  What to use for an initial joint controller?!  Which transformations to perform?!  Order for choosing nodes to remove or improve?!

76!

Decentralized BPI Summary!  DEC-BPI finds better and much more compact

solutions than exhaustive backups!  A larger correlation device tends to lead to higher

values on average!  Larger local controllers tend to yield higher average

values up to a point!  But, bounded DP is limited by improving one

controller at a time!  Linear program (one-step lookahead) results in

local optimality and tends to “get stuck”!

77!

Nonlinear Optimization Approach!Amato, Bernstein & Zilberstein, UAI 2007, JAAMAS 2010!

  Basic idea: Model the problem as a non-linear program (NLP)!

  Consider node values (as well as controller parameters) as variables!

  The NLP can take advantage of an initial state distribution when it is given!

  Improvement and evaluation all in one step (equivalent to an infinite lookahead) !

  Additional constraints maintain valid values!

78!

NLP Representation Variables:! , ,!Objective: Maximize !

Value Constraints: ∀s ∈ S, ∈ Q

Additional linear constraints:!  ensure controllers are independent!  all probabilities sum to 1 and are non-negative!

z( q ,s) = x( q ', a ) R(s, a ) + γ P(s' | s, a ) O( o | s', a ) y( q , a , o , q ') q '∑

o ∑ z( q ',s')

s'∑

⎣ ⎢ ⎢

⎦ ⎥ ⎥

a ∑

b0(s)s∑ z( q 0,s)

x( q , a ) = P( a | q )

y( q , a , o , q ') = P( q ' | q , a , o )

z( q ,s) = V ( q ,s)

q

79!

Independence Constraints

  Independence constraints guarantee that action selection and controller transition probabilities for each agent depend only on local information!

  Action selection independence:!

  Controller transition independence:!

80!

Probability Constraints

  Probability constraints guarantee that action selection probabilities and controller transition probabilities are non negative and that they add up to 1: !

! (Superscript f ʼs represent arbitrary fixed values)!

81!

Optimality Theorem: An optimal solution of the NLP results in

optimal stochastic controllers for the given size and initial state distribution.!

  Advantages of the NLP approach:!  Efficient policy representation with fixed memory!  NLP represents optimal policy for given size!  Takes advantage of known start state!  Easy to implement using off-the-shelf solvers!

  Limitations:!  Difficult to solve optimally!

82!

Adding a Correlation Device   NLP approach can be extended to include a correlation

device, using the following formulation:!

  New variable w(c,c') represents the transition function of the correlation device; action selection and controller transitions depend on new shared signal. !

83!

Comparison of NLP & DEC-BPI!Amato, Bernstein & Zilberstein, UAI 2007, JAAMAS 2010!

  Used freely available nonlinear constrained optimization solver called “filter” on the NEOS server (http://www-neos.mcs.anl.gov/neos/)!

  Solver guarantees locally optimal solution!  Used 10 random initial controllers for a range

of controller sizes!  Compared NLP with DEC-BPI, with and

without a small (2-node) correlation device!

84!

Results: Broadcast Channel!Amato, Bernstein & Zilberstein, UAI 2007!

  Simple two agents networking problem !!(2 agents, 4 states, 2 actions, 5 observations)!

  Average quality over 10 trials:!

  Average run time:!

85!

Results: Multi-Agent Tiger!Amato, Bernstein & Zilberstein, JAAMAS 2010!

  A two-agent version of a “well-known” POMDP benchmark [Nair et al. 03] (2 states, 3 actions, 2 observations)!

  Average quality of various controller sizes using NLP methods with and without 2-node correlation device and BFS!

86!

Results: Meeting in a Grid!Amato, Bernstein & Zilberstein, JAAMAS 2010!

  A two-agent domain with 16 states, 5 actions, 2 observations!  Average quality of various controller sizes using NLP

methods and DEC-BPI with and without 2-node correlation device and BFS!

Results: Box Pushing!Amato, Bernstein & Zilberstein, JAAMAS 2010!

88!

Values and running times (in seconds) for each controller size using NLP methods and DEC-BPI with and without a 2 node correlation device and BFS. An “x” indicates that the approach was not able to solve the problem.!

90!

NLP Approach Summary   The NLP defines the optimal fixed-size stochastic

controller!

  Approach shows consistent improvement over DEC-BPI using an off-the-shelf locally optimal solver!

  A small correlation device can have significant benefits !

  Better performance may be obtained by exploiting the structure of the NLP!

91!

Outline!  Models for decentralized decision making

  Complexity results

  Solving finite-horizon DEC-POMDPs

  Solving infinite-horizon DEC-POMDPs

  Scalability beyond two agents

  Conclusion

Exploiting the Locality of Interaction

  In practical settings that involve many agents, each agent often interacts with a small number of “neighboring” agents (e.g., firefighting, sensor networks)!

  Algorithms designed exploit this property include LID-JESP [Nair et al. AAAI 05] and SPIDER [Varakantham et al. AAMAS 07] and FANS [Marecki et al. AAMAS 08] !

  FANS uses FSCs for policy representation and!  Exploits FSCs for dynamic programming in policy evaluation and

heuristic computations and provides significant speedups!  Introduces novel heuristics to automatically vary the FSC size in

different agents!  Performs policy search that exploits the locality of agent interactions!

92!

93!

Constraint-Based DP !Kumar & Zilberstein, AAMAS 2009

  Model the domain as a Network Distributed POMDP (ND-POMDP)—a restricted class of DEC-POMDPs characterized by a decomposable reward function.!

  CBDP uses a point-based dynamic programming (similar to MBDP).!

  CBDP uses constraint networks algorithms to improve the efficiency of key steps:!  Computation of the heuristic function!  Belief sampling using heuristic function!  Finding the best joint policy for a particular belief!

94!

Results: Sensors Tracking Target!Kumar & Zilberstein, AAMAS 2009

  CBDP provides orders!!of magnitude of speedup !!over FANS!

  Provides better solution quality for all test instances!  Provides strong theoretical guarantees on the time and

space complexity enhancing scalability!  Linear complexity in planning horizon length!  Linear in the number of agents, which is necessary to solve large

realistic problems!  Exponential only in a small parameter that depends on the level of

interaction among the agents!

N S E W N

S E W N S E W loc1 loc2

Sample Results   A 7-agent configuration with 4 actions per agent. Two adjacent

agents are required to track a target!  Graphs show the solution quality (left) and time (right) of our

approach (CBDP) compared with the best existing method (FANS)!

  FANS is not scalable beyond horizon 7. CBDP has linear complexity in the horizon, and it provides better solution quality is less time!

0.1!

1!

10!

100!

1000!

2! 3! 4! 5! 6! 7! 8! 10!

CBDP!FANS!

0!

100!

200!

300!

400!

500!

600!

700!

2! 3! 4! 5! 6! 7! 8! 10!

CBDP!FANS!

Horizon! Horizon!

Solu

tion

qual

ity!

Tim

e (s

ec, l

ogsc

ale)!

New Scalable Approach!Kumar, Zilberstein, and Toussaint, IJCAI 2011!

  Extend an approach [Toussaint and Storkey, ICML 06] that maps planning under uncertainty (POMDP) problems into probabilistic inference !

  Characterize general constraints on the interaction graph that facilitate scalable planning!

  Introduce an efficient algorithm to solve such models using probabilistic inference!

  Identify a number of existing models with such constraints!

96!

Value Factorization!  θ = parameters of an agent!  Factors state-space s = (s1, . . . , sM )!

  Example: Consider four agents!!s.t. V = V12 + V23 + V34!

97!

Existing Models Satisfy VF!  Each agent/state variable can participate in multiple

value factors!  Worst case complexity is NEXP-C!  TI-DEC-MDP, ND-POMDP, TD-POMDP satisfy value

factorization!

98!

Computational Advantages!  Applicability!

  In models that satisfy VF, inference in the EM framework can be done independently in each value factor!

  Smaller value factors ⇒ efficient inference!  Planning no longer exponential, linear in # of factors!

  Implementation!  Distributed planning!  Efficient implementation using message-passing!  Parallel computation of messages!

99!

Planning by Inference!  Recasts planning as likelihood

maximization in a DBN mixture with binary reward variable r :!!P(r =1 | s, a1, a2) ∝ R(s, a1, a2)!

100!

DB

N M

ixtu

re

Exploiting the VF Property!  Exploit additive nature of value function for scalability!  Outer mixture simulates the VF property!  Each Vf (θf, sf ) evaluated using time dependent mixture!

  Theorem: Maximizing the likelihood of observing the variable r = 1 optimizes the joint-policy!

101!

The Expectation-Maximization Algorithm!  Observed data r = 1, every other variable hidden!  Use the EM algorithm to maximize the likelihood!  Implemented using message passing on the VF graph!  Example: 3 factors {Ag1, Ag2}, {Ag2, Ag3} and {Ag2, Ag3}!

102!

Properties of the EM Algorithm!  Scalability!

  μ message requires independent inference in each factor!  Agents/state vars. can be involved in multiple factors – can

model complex systems via simpler interactions!  Distributed planning via message passing!

  Complexity!  Linear in the number of factors, exponential in the number of

agents/state variables in a factor!  Generality!

  No additional assumptions (such as TOI) required – a general optimization recipe for models with the VF property!

  Local optima?!103!

Experiments!  ND-POMDP domains involving target tracking in

sensor networks with imperfect sensing!  Multiple targets, limited sensors with battery!  Penalty = -1 per sensor for miscoordination,

recharging battery; positive reward (+80) per target scanned simultaneous by two adjacent sensors!

104!

Comparisons with NLP Approach (5P Domain)!

106!

Scalability on Larger Benchmarks!  15 agent and 20 agent domains, internal states = 5!

108!

Summary of the EM Approach !  Value factorization (VF) facilitates scalability!  Several existing weakly-coupled models satisfy VF!  An EM algorithm can solve models with such

property and yield good quality solutions!  Scalability: E-step decomposes according to value

factors; smaller factors lead to efficient inference!  Can be easily implemented using message-passing

among the agents!  Future work: Explore techniques for even faster

inference, and establish better error bounds.!109!

110!

Outline!  Models for decentralized decision making

  Complexity results

  Solving finite-horizon DEC-POMDPs

  Solving infinite-horizon DEC-POMDPs

  Scalability beyond two agents

  Conclusion

111!

Back to Some Basic Questions!  Are DEC-POMDPs significantly harder to solve than

POMDPs? Why?!  What features of the problem domain affect the

complexity and how?!  Is optimal dynamic programming possible?!  Can dynamic programming be made practical?!  Is it beneficial to treat communication as a separate type

of action?!  How can we exploit the locality of agent interaction to

develop more scalable algorithms?!

112!

Questions?!

Additional Information:!Resource-Bounded Reasoning Lab!University of Massachusetts, Amherst!http://rbr.cs.umass.edu