pomdp seminar backup3
Post on 15-Apr-2017
37 Views
Preview:
TRANSCRIPT
Planning and Acting In Partially Observable Stochastic Demains
By Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra
Artificial Intelligence 101 (Jan 1998) 99-134
Iss seminar 11/16/06
Presented by Darin Hitchings
• Essentially a planning problem: given a model of the world dynamics and a reward structure, find an optimal way to behave.
• Decisions and rewards in stages• Limitation: discrete state representation
(MDP case)• Stochastic: actions have uncertain outcomes
(and state uncertainty)• change the state of the world <=> gain info
Introduction
MDP Model• Tradeoff: value of information vs immediate
reward vs long term cost• Uncertainty in actions’ effects but perfect
perception• MDP described by tuple (S, A, T, R):
Markov Property:
Optimality Concepts• Finite-horizon optimality:
max
• oo-horizon optimality: max
geometric discounting• Two Kinds of Policies: stationary, non-stationary
(t th-to-last step)
• Finite horizon model: typically not stationary
Optimality Cont• Finite horizon: last move usually very
different than first move• oo horizon: also optimal if the probability of
stopping early is • => Finite horizon problem when • Define value function of being in a state at
time t subject to policy :• Or for stationary case:
Recursion Relations
reward at last step
in general
current “local”
cost
discount (amortization
for time delay)
Expectation (over states) of future reward
(non-stationary)
oo-horizon
same function!
Recursion Relations Cont• oo-horizon: solution is unique, simultaneous set of
linear equations, one equation for each state s• Relies on having discrete statespace• Have shown: policy => Value Function, what
about Value Function => policy?• Can compute a Greedy (myopic) policy given the
value function:
Expected immediate reward Expected reward of next state
Bellman EquationOptimal policy for a single move?
Can figure out an expression for the general case by working backwards from the end of the horizon:
is derived from and
Stationary Policy Exists• For oo-horizon problems, Howard showed stationary
policy always exists and is optimal:
• An optimal policy, , is just a greedy policy with respect to
Value Iteration Algorithm• Algorithm 1. The value-iteration algorithm for
finite state space MDPs.
Value Iteration Cont• is the t-step value of starting in state s,
taking action a, then continuing with the optimal (t-1)-step nonstationary policy
• is the Bellman Error Magnitude• If for all s, then it can be
shown:
• Tighter bounds may be obtained from using span semi-norm on the value function [49].
POMDP Model
observation functionis a finite set of observations of world
(a probability distribution over possible observations for each state)
• POMDP described by tuple (S, A, T, R, , O): describe a Markov decision process
• New state => continuous distribution over discrete states in original MDP: “Belief State”
• Belief State is a sufficient statistic for the past history and initial belief state of the agent
POMDP Example 1
POMDP Example:S = 4, A = {East, West}, O = {not at goal, at goal}P(move W|a=W) = 0.90, P(move E|a=W) = 0.10P(move E|a=E) = 0.90, P(move W|a=E) = 0.10No movement if at boundry (but no other information)Initial b = { 0.333, 0.333, 0.000, 0.333 }
belief state given a=East,o=not at goal => { 0.100 0.450 0.000 0.450 }2nd belief state (East twice, no goal) => { 0.100 0.164 0.000 0.736 }
Computing New Belief States• Know for all • Must have to be a valid distribution
Bayes Rule
Markov Property
Plugging in Definitions
• So we now have a new state transition equation:
Belief Space State Transitions• B, the set of belief states are the statespace• A, set of actions remains the same• New state transition function:
where
• Reward Function
Expected reward over belief states
Policy Trees• Simplest case: 1 step policy tree:• General case: t-step nonstationary policy
=> tree of depth t:
Actions generate observations which control belief (state) evolution
Policy Trees Cont• If p is a t-step policy tree and a(p) is the
action specified at the top node:(t-1)-step policy subtree associated with observation oi
Expectation over possible future states
Expected value of a state
Value on Policy TreesValue of executing a policy tree p from some belief state b
(Just an expectation over world states of executing p in each state)More compactly, if
then
• Let P be the finite set of all t-step policy trees
then Important geometric consequences!
Geometric Import• Each policy tree p induces a value function Vp that
is linear in b• Vt is the upper surface of these value functions• Vt is piecewise-linear and convex
Two states so value function is 1D
Each policy component is a (linear) hyperplane
Max of convex set is convex
Simplex Constraints• Vertices (0,0), (0,1), (1,0)• Makes Intuitive Sense:
High value on edges, Low value in center (Higher Uncertainty)
S=3
• Just one policy is optimal within any colored region
• Can find optimal policy by projecting back down into belief space wheremaximal over entire region
More on Projection• Optimal t-step situation-action mapping found by
projecting value function onto belief space• Especially interested in a(p) action in the root node
of policy tree p
1D belief space, S=2
Remember multiple possible observations per action, also dim(a(p)) = S
• Define to be region where dominates:
Parsimonious Representation• pd never useful• pc never useful
given pa and pb
• Can find a point in by linear programming
• There are elements to the set
• Could construct a superset of all possible policies for Vt based on a useful (minimal) set of policies at time Vt-1
Pruning and Vt
• Pruning Method? Simple technique is to remove nowhere dominant policies, more complex / better techniques exist
• How to find Vt from Vt-1?
• So exhaustive enumeration wastes time
Redefined to be set of policy trees that forms
Directly Generating Vt
• Equivalent to long-standing question “Does NP=RP?”• Instead we compute a set of t-step policy trees ,
that have action ‘a’ at the root, one for each ‘a’
• Algorithm for generating Vt which is polynomial in:?
• We know ie
Witness Algorithm• Algorithm 2. Outer loop of the witness algorithm
Basic structure is still Value IterationFirst construct all possible policy subtrees usingThen prune using linear programs
• Witness algorithm usually polynomial time!
Witness Inner Loop• Initialize the set with one policy tree
which is best for an arbitrary b• Ask
From one-step lookahead with Vt-1
Using our model for Vt: Ua
• Such a point is called a Witness Point• When we can prove that no more witness
points exist, our model for Vt is exact!
Note:
• Let pnew = p with just 1 subtree replaced
Witness Theorem
• Then true Q-function differs from approx one and some biff
if no tree can be improved by replacing a single subtree, there are no witness points
replaced
where
Abc a• Finite-horizon optimality:
max
Abc a• Finite-horizon optimality:
max
Abc a• Finite-horizon optimality:
max
top related