pomdp seminar backup3

Planning and Acting In Partially Observable Stochastic Demains

By Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra

Artificial Intelligence 101 (Jan 1998) 99-134

Iss seminar 11/16/06

Presented by Darin Hitchings

• Essentially a planning problem: given a model of the world dynamics and a reward structure, find an optimal way to behave.

• Decisions and rewards in stages• Limitation: discrete state representation

(MDP case)• Stochastic: actions have uncertain outcomes

(and state uncertainty)• change the state of the world <=> gain info

Introduction

MDP Model• Tradeoff: value of information vs immediate

reward vs long term cost• Uncertainty in actions’ effects but perfect

perception• MDP described by tuple (S, A, T, R):

Markov Property:

Optimality Concepts• Finite-horizon optimality:

• oo-horizon optimality: max

geometric discounting• Two Kinds of Policies: stationary, non-stationary

(t th-to-last step)

• Finite horizon model: typically not stationary

Optimality Cont• Finite horizon: last move usually very

different than first move• oo horizon: also optimal if the probability of

stopping early is • => Finite horizon problem when • Define value function of being in a state at

time t subject to policy :• Or for stationary case:

Recursion Relations

reward at last step

in general

current “local”

discount (amortization

for time delay)

Expectation (over states) of future reward

(non-stationary)

oo-horizon

same function!

Recursion Relations Cont• oo-horizon: solution is unique, simultaneous set of

linear equations, one equation for each state s• Relies on having discrete statespace• Have shown: policy => Value Function, what

about Value Function => policy?• Can compute a Greedy (myopic) policy given the

value function:

Expected immediate reward Expected reward of next state

Bellman EquationOptimal policy for a single move?

Can figure out an expression for the general case by working backwards from the end of the horizon:

is derived from and

Stationary Policy Exists• For oo-horizon problems, Howard showed stationary

policy always exists and is optimal:

• An optimal policy, , is just a greedy policy with respect to

Value Iteration Algorithm• Algorithm 1. The value-iteration algorithm for

finite state space MDPs.

Value Iteration Cont• is the t-step value of starting in state s,

taking action a, then continuing with the optimal (t-1)-step nonstationary policy

• is the Bellman Error Magnitude• If for all s, then it can be

shown:

• Tighter bounds may be obtained from using span semi-norm on the value function [49].

POMDP Model

observation functionis a finite set of observations of world

(a probability distribution over possible observations for each state)

• POMDP described by tuple (S, A, T, R, , O): describe a Markov decision process

• New state => continuous distribution over discrete states in original MDP: “Belief State”

• Belief State is a sufficient statistic for the past history and initial belief state of the agent

POMDP Example 1

POMDP Example:S = 4, A = {East, West}, O = {not at goal, at goal}P(move W|a=W) = 0.90, P(move E|a=W) = 0.10P(move E|a=E) = 0.90, P(move W|a=E) = 0.10No movement if at boundry (but no other information)Initial b = { 0.333, 0.333, 0.000, 0.333 }

belief state given a=East,o=not at goal => { 0.100 0.450 0.000 0.450 }2nd belief state (East twice, no goal) => { 0.100 0.164 0.000 0.736 }

Computing New Belief States• Know for all • Must have to be a valid distribution

Bayes Rule

Markov Property

Plugging in Definitions

• So we now have a new state transition equation:

Belief Space State Transitions• B, the set of belief states are the statespace• A, set of actions remains the same• New state transition function:

• Reward Function

Expected reward over belief states

Policy Trees• Simplest case: 1 step policy tree:• General case: t-step nonstationary policy

=> tree of depth t:

Actions generate observations which control belief (state) evolution

Policy Trees Cont• If p is a t-step policy tree and a(p) is the

action specified at the top node:(t-1)-step policy subtree associated with observation oi

Expectation over possible future states

Expected value of a state

Value on Policy TreesValue of executing a policy tree p from some belief state b

(Just an expectation over world states of executing p in each state)More compactly, if

• Let P be the finite set of all t-step policy trees

then Important geometric consequences!

Geometric Import• Each policy tree p induces a value function Vp that

is linear in b• Vt is the upper surface of these value functions• Vt is piecewise-linear and convex

Two states so value function is 1D

Each policy component is a (linear) hyperplane

Max of convex set is convex

Simplex Constraints• Vertices (0,0), (0,1), (1,0)• Makes Intuitive Sense:

High value on edges, Low value in center (Higher Uncertainty)

• Just one policy is optimal within any colored region

• Can find optimal policy by projecting back down into belief space wheremaximal over entire region

More on Projection• Optimal t-step situation-action mapping found by

projecting value function onto belief space• Especially interested in a(p) action in the root node

of policy tree p

1D belief space, S=2

Remember multiple possible observations per action, also dim(a(p)) = S

• Define to be region where dominates:

Parsimonious Representation• pd never useful• pc never useful

given pa and pb

• Can find a point in by linear programming

• There are elements to the set

• Could construct a superset of all possible policies for Vt based on a useful (minimal) set of policies at time Vt-1

Pruning and Vt

• Pruning Method? Simple technique is to remove nowhere dominant policies, more complex / better techniques exist

• How to find Vt from Vt-1?

• So exhaustive enumeration wastes time

Redefined to be set of policy trees that forms

Directly Generating Vt

• Equivalent to long-standing question “Does NP=RP?”• Instead we compute a set of t-step policy trees ,

that have action ‘a’ at the root, one for each ‘a’

• Algorithm for generating Vt which is polynomial in:?

• We know ie

Witness Algorithm• Algorithm 2. Outer loop of the witness algorithm

Basic structure is still Value IterationFirst construct all possible policy subtrees usingThen prune using linear programs

• Witness algorithm usually polynomial time!

Witness Inner Loop• Initialize the set with one policy tree

which is best for an arbitrary b• Ask

From one-step lookahead with Vt-1

Using our model for Vt: Ua

• Such a point is called a Witness Point• When we can prove that no more witness

points exist, our model for Vt is exact!

• Let pnew = p with just 1 subtree replaced

Witness Theorem

• Then true Q-function differs from approx one and some biff

if no tree can be improved by replacing a single subtree, there are no witness points

replaced

Abc a• Finite-horizon optimality:

pomdp seminar backup3

Documents

a pomdp framework to nd optimal policy in sustainable

visual localization and pomdp for autonomous indoor...

d1.3: pomdp learning for isu dialogue management · d1.3:...

the permutable pomdp: fast solutions to pomdps for ......

gaussian processes for fast policy optimisation of...

paper pomdp-basedstatistical spokendialogsystems:...

policy gradients +...

kaist — kaist aipr lab - (a pomdp framework for dynamic...

finding approximate pomdp solutions through belief...

designing states, actions, and rewards for using pomdp in

finding approximate pomdp solutions through...

personalized robot tutoring using the assistive tutor pomdp...

a reinforcement learning scheme for a multi-agent card game:...

magic: learning macro-actions for online pomdp planning

hybrid online pomdp planning and deep reinforcement learning...

designing states, actions, and rewards for using pomdp in...

learning dialogue pomdp model components from …...model...

mixed reality as a bidirectional communication interface...

partially observable markov decision process...

sampling networks and aggregate simulation for online pomdp...