pradeep varakantham singapore management university joint work with j.y.kwak, m.taylor, j. marecki,...

Exploiting Coordination Locales in DisPOMDPs via Social Model Shaping

Pradeep Varakantham Singapore Management University

Joint work with J.Y.Kwak, M.Taylor, J. Marecki, P. Scerri, M.Tambe

Motivating Domains

Disaster RescueSensor

Networks

Characteristics of Domains: Uncertainty Coordinating multiple agents Sequential decision making

Meeting the challengesProblem:

Multiple agents coordinating to perform multiple tasks in presence of uncertainty

Sol: Represent as Distributed POMDPs and solveNEXP Complete for optimal solutionApproximate algorithm to dynamically exploit

structure in interactionsResult: Vast improvement in performance over

existing algorithms

Outline

Illustrative Domain

Model

Approach: Exploit dynamic structure in interactions

Results

Illustrative Domain Multiple types of

robots Uncertainty in

movements Reward

Saving victims Collisions Clearing debris

Maximize expected joint reward

ModelDisPOMDPs with Coordination Locales, DPCL

Joint model: <S, A, Ω, P, R, O, Ag>Global state represents completion of tasksAgents independent except in coordination locales,

CLsTwo types of CLs:

Same time CL (Ex: Agents colliding with each other)

Future time CL (Ex: Cleaner robot cleaning the debris assists rescue robot in reaching the goal)

Individual observability

Solving DPCLs with TREMOR

Teams REshaping of MOdels for Rapid execution

Two steps:1. Branch and Bound search

MDP based heuristics

2. Task Assignment evaluation By computing policies for every agentPerform only joint policy computation at CLs

1. Branch and Bound search

2. Task Assignment EvaluationUntil convergence of policies or

maximum iterations:1)Solve individual POMDPs2)Identify potential coordination locales3)Based on type and value of

coordination :Shape P and R of relevant individual agents

Capture interactionsEncourage/Discourage interactions

4)Go to step 1

Identifying potential CLsCL = <State, Action>Probability of CL occurring at a time step, T

Given starting beliefStandard belief update given policy

Policy over belief states

Probability of observing w, in belief state “b”

Updating “b”

Type of CLSTCL, if there exists “s” and “a” for which

Transition/Reward function not decomposable, P(s,a,s’) ≠ Π1≤i≤N P((sg,si),ai,(sg’,si’)) OR R(s,a,s’) ≠ Σ1≤i≤N R((sg,si),ai,(sg’,si’))

FTCL, Completion of task (global state) by an agent at

t’ affects transitions/rewards of other agents at t

Shaping Model (STCL)Shaping transition function

Shaping reward function

Joint transition probability when CL occursNew transition

probability for agent “i”

ResultsBenchmark Algorithms

Independent POMDPsMemory Bounded Dynamic Programming

(MBDP)

CriterionDecision qualityRun-time

Parameters: (i) agents; (ii) CLs; (iii) states; (iv) horizon

State space

Agents

Coordination Locales

Time Horizon

Related workExisting Research

DEC-MDPs Assuming individual or collective full observability Task allocation and dependencies as input

DEC-POMDPs JESP MBDP Exploiting independence in

transition/reward/observation.Model Shaping

Guestrin and Gordon, 2002

ConclusionDPCL, a specialization of Distributed POMDPs

TREMOR exploits presence of few CLs in domains

TREMOR depends on single agent POMDP solvers

Results: TREMOR outperformed DisPOMDP algorithms,

except in tightly coupled small problems

Questions?

Same Time CL (STCL)There is an STCL, if

Transition function not decomposable, OR P(s,a,s’) ≠ Π1≤i≤N P((sg,si),ai,(sg’,si’))

Observation function not decomposable, OR O(s’,a,o) ≠ Π 1≤i≤N O(oi,ai,(sg’,si’))

Reward function not decomposable R(s,a,s’) ≠ Σ1≤i≤N R((sg,si),ai,(sg’,si’))

Ex: Two robots colliding in a narrow corridor

Future Time CL

Actions of one agent at “ t’ ” can affect transitions OR observations OR rewards of

other agents at “ t ” P((st

g,sti),at

i,(stg’,st

i’)|ajt’ ) ≠ P((st

g,sti),at

i,(stg’,st

i’)) , ¥ t’ < t

R((stg,st

i),ati,(st

g’,sti’)|aj

t’ ) ≠ R((stg,st

i),ati,(st

g’,sti’)) , ¥ t’

< t O(wt

i,ati,(st

g’,sti’)|aj

t’ ) ≠ O(wti,at

i,(stg’,st

i’)) , ¥ t’ < t

Ex: Clearing of debris assists rescue robots in getting to victims faster

pradeep varakantham singapore management university joint work with j.y.kwak, m.taylor, j. marecki,...

Documents

cls slide

b slide

tambe slide

time horizon slide

state space slide

bound search slide

expected joint reward

agent i