chapter 17 – making complex decisions a
TRANSCRIPT
Chapter 17 – Making Complex Decisions A
Sequential decision problems utility depends on a sequence of decisions instead of one-time or episodic like last
chapterStochastic environments non-deterministic: actions will be based on probability
Overview Markov decision problems (MDPs) Algorithms for finding optimal policies – the solution to the MDP Partially observable Markov decision problems (POMDPs)
17.1 – Sequential Decision Problems Simple Example
Fully observable environment Markovian property: probability of moving to new state only depends on current
state Actions: [ Left, Right, Up, Down] Move in direction of choice with 0.8 probability Reward = +1 and -1 for two goal states and -0.04 for non-goal states Want to maximize total reward or utility
Markov Decision Process (MDP)
We use MDPs to solve sequential decision problems. We eventually want to find the best choice of action for each state.
Consists of:a set of actions A(s) • for actions in each state in state s
transition model P(s' | s, a) • describing the probability of reaching s' using action a in s• transitions are Markovian - only depends on s not previous states
reward function R(s)• the reward an agent receives for arriving in state s
Policies
• Utitlity U(s): the sum of rewards obtained during the sequence including that state• Agent wants to maximize utility• Policy π: a function which describes the action for each state• Expected utility: the expected sum of rewards for a policy• optimal policy π*: policy which yields the highest expected utility – the solution
to the problem
Utilities Over Time
• Time interval is one step in sequence• Finite or infinite horizon: infinite horizon meas unlimited time steps
– Infinite horizons have stationary policies – policy doesn't change
How to calculate the utility of state sequences?• assume stationary state preferences: policies don't change depending on the current
state• additive rewards
• discounted rewards
• utility is finite for infinite horizon with discounted rewards• if γ < 1 then Uh = Rmax / (1 – γ) when summed over infinity• improper policy: one that is not guaranteed to terminate• use discounted mainly to prevent problems from improper policies
U h [ s0, s1, s2,. ..]=R s0+γ R s1 +γ2R s2. . .
U h [ s0, s1, s2,. ..]=R s0+R s2+R s3. ..
Optimal Policies for utilities of states
Expected utility for some policy π starting in state s
The optimal policy π* has the highest expected utility and will be given by
This sets π*(s) to the argument a of A(s) which gives the highest utility
Policy is actually independent of start state:• actions will differ but policy will never change• this comes from the nature of a Markovian decision problem with discounted utilities
over infinite horizonsU(s) is also independent of start state and current state
U π s =E[∑t= 0
∞
γt R S t ]
∗s =argmaxa∈A s
∑s 'P s '∣s , aU s '
Optimal Policies for utilities of states
Utilities for states in simple example with γ = 1
17.2 – Value Iteration Bellman Equation for Utilities
Bellman Equation:
Utility of the state = the reward of the state plus the discounted utility of the next stateNon-linear (max operator) function based on the utility of future states
Sample calculation for simple example:
U s=R s maxa∈A s
∑s 'P s '∣s , aU s '
Value Iteration Algorithm
Hard to calculate because it's non-linear so use an iterating algorithm. Basic idea is start at an initial value for all states then update each state using their neighbours until they hit equilibrium.
AlgorithmStart with an initial value for states eg. ZeroRepeat
For each state
Update each state using utility of neighbours in previous iteration
Until at equilibrium
Each iteration is called a Bellmen Update given by
U i+ 1s =R s maxa∈A s
∑s'P s '∣s , aU i s '
Using Value Iteration on the Simple Example
This shows the utilities over a number of iterations. Notice that they all converge at some equilibrium.
Convergence of Value Iteration
We need to know when to stop iterating
• A bit of error is produced by the algorithm at the start because it needs to be at equilibrium
• Error = |U'(s) – U(s)| -> which is the difference between the utilities of two iterations of the Bellman update
• We cut off the algorithm when max( |U'(s) – U(s)|) < e(1- γ) / γ
– where e = c Rmax and c is a constant factor• Do this because we want to avoid as much error as we can
• We also want to stop at a reasonable amount of iterations
Convergence of Value Iteration
Why use c Rmax(1- γ) / γ
• Recall: if γ < 1 and infinite-horizon then Uh converges to Rmax / (1 – γ) when summed over infinity
• So we know we want to stop at c Rmax(1- γ) / γ where c is a constant factor less than 1
• Normally c is fairly small around 0.1
• Even with some error the policy eqn (shown before) almost always produces an optimal policy
∗s =argmaxa∈A s
∑s 'P s '∣s , aU s '
Convergence of Value Iteration
Number of Bellman update iterations required to converge vs the discount factor for different values of c.
17. 3 - Policy Iteration
Uses the idea that for each state one action is most likely significantly better than others
Algorithmstart with policy π0
repeat Policy evaluation: for each state calculate Ui given by policy πi
– simplified version of Bellman Update eqn – no need for max
check if unchanged
Policy improvement: for each state
– if the max utility over each action gives a better result than π(s)– set π(s) to the new policy
until unchanged
U i+ 1s =R s∑s 'P s '∣s ,i sU i s '
17. 3 - Policy Iteration Algorithm
17.4 Partially Observable MDPs (POMDPs)Properties
Like MDPs has:• a set of actions A(s) • transition model P(s' | s, a) • reward function R(s)
Also has:• sensor model P(e | s)
– gives the probability of perceiving evidence e in state s• belief state b(s)
Belief States
• Belief state is a probability distribution over all possible states
• Similar to probability distributions from chapter 13
• Second term forward projects all possible states s to the next state s'
• First term updates that with evidence e
• α is normalization constant
b ' s ' = Pe∣s ' ∑sP s '∣s , ab s
Decision Cycle for POMDP Agent
General decision cycle is as follows:
• Given belief state b, execute action given from policy, a = π*(b)
• Receive percept e
• Set new belief state using
Neat fact: POMDPs on a physical space can be reduced to an MDP on a belief-state space. Not very effective though because of high dimensionality of state space
b ' s ' = Pe∣s ' ∑sP s '∣s , ab s
Value Iteration for POMDPs
• Generally very inefficient
• Even the simple example discussed previously is too hard
• Better to use approximate online algorithm based on look-ahead search
Online agents for POMDPs
• Transition and sensor models are represented by dynamic Bayesian networks
• They are extended with decision and utility nodes
• The model is called a dynamic decision network
• Filtering algorithms are used to update belief states when percepts and actions are found
• Decisions are made by projecting forward possible action sequences and choosing the best one
Very similar to minimax algorithm from chapter 5
• Creates trees of possible actions for a number of iterations
• Performs the best looking action
• Repeats the process