CS 7180: Behavioral Modeling and Decision-making in AI Markov Decision Processes for Complex Decision-making Prof. Amy Sliva October 17, 2012
Decisions are nondeterministic • In many situations, behavior and decisions may have more than
one possible outcome
• Action failures • e.g., gripper drops its load
• One approach: Markov Decision Processes
a c b
grasp(c)
a
c
b Intended outcome
a b Unintended outcome
• Exogenous events • e.g., road closed
• Still need to be able to plan and make decisions in such situations
Decision-making under uncertainty • Decisions traditionally represented in decision trees
• Example decision problem: Should I have my party inside or outside? • Exponential in number of decisions
• Limitations of decision trees • What if there are many stages to consider? • What if there are many different outcomes?
• Markov decision processes (MDPs) • Framework for complex multi-stage decision problems under uncertainty • More compact representation of problem with efficient solutions • Further specialized influence diagram
• No explicit representation of qualitative dependencies
in
out
Regret
Relieved
Perfect!
Disaster
dry
wet
dry
wet
Stochastic systems of actions • We have already learned about stochastic systems
• Markov chain/Markov process
• Stochastic system: a triple Σ = (S, A, P) • S = finite set of states • A = finite set of actions • Pa(sʹ′ | s) = probability of going to sʹ′ if we execute a in s • ∑sʹ′ ∈ S Pa (sʹ′ | s) = 1
• Several different possible action representations • e.g., Bayes networks (or influence diagrams) , probabilistic state-space
operators, etc. • Same underlying semantics
• Fully specified transition model—explicit enumeration of each Pa (sʹ′ | s)
MDPs are stochastic systems with utility • Model the dynamics of the environment under different actions • Formally defined as a 4-tuple MDP = (S, A, T, R)
• Markov assumption—next state depends only on current state and action
• Full observability—cannot predict exactly which state will result from an action, but once it is realized know what it is
S = set of states Environmental context A = set of actions Possible behaviors T = transition model T(s, a, sʹ′) = Pa(sʹ′ | s) What states can result from an action?
R = reward model R(s) and C(s,a)
Reward for each state Cost for each state and action
Graphical MDP representation • Nodes are possible states of the world
• E.g., Location of robot
• Arcs are actions • E.g., Move to new location
• Each arc has associated transition probability
• Each state has associated reward or cost
First-order Markov dynamics and rewards • First-Order Markov dynamics (history independence)
• P(st+1 | at,st,at-1,st-1,...,s0) = P(st+1 | at,st) • Next state only depends on current state and current action
• First-Order Markov reward process • P(rt
| at,st,at-1,st-1,...,s0) = P(rt | at,st)
• Reward only depends on current state and action • Assume reward is specified by a deterministic function R(s)
• Stationary dynamics and reward • P(st+1 | at,st) = P(sk+1 | ak,sk) for all t, k • The world dynamics do not change over time
Planning actions in MDPs • Robot r1 starts at location l1
• State s1 in the diagram
• Objective is to get r1 to location l4 • State s4 in the diagram
Goal Start
Probability of initial states • For every state s, there will be a probability
P(s) that the system starts in s
• Sometimes assume a unique state s0 such that the system always starts in s0
• In the example, s0 = s1 • P(s1) = 1 • P(s) = 0 for all s ≠ s1
Goal Start
• Robot r1 starts at location l1 • State s1 in the diagram
• Objective is to get r1 to location l4 • State s4 in the diagram
• No classical plan (sequence of actions) can be a solution • Why not?
Planning actions in MDPs
Start Goal
• Robot r1 starts at location l1 • State s1 in the diagram
• Objective is to get r1 to location l4 • State s4 in the diagram
• No classical plan (sequence of actions) can be a solution • No guarantee we’ll be in a state where the next action is applicable π = 〈move(r1,l1,l2), move(r1,l2,l3), move(r1,l3,l4)〉
Planning actions in MDPs
Start Goal
Policies for choosing actions • Policy is a function mapping states to actions
• Stationary policy • π:S → A • π(s) is action to do at state s (regardless of time)—set of state-action pairs • Specifies a continuously reactive controller
• Nonstationary policy • π:S x N→ A, where N is the non-negative integers
• π(s,t) is action to do at state s with t stages-to-go • What if we want to keep acting indefinitely?
• Robot r1 starts at location l1 • State s1 in the diagram
• Objective is to get r1 to location l4 • State s4 in the diagram
π1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, wait)}
Example stationary policy
Start Goal
wait
wait
Histories of previous states • History is a sequence of system states
h = 〈s0, s1, s2, s3, s4, … 〉
• Examples:| h0 = 〈s1, s3, s1, s3, s1, … 〉
h1 = 〈s1, s2, s3, s4, s4, … 〉 h2 = 〈s1, s2, s5, s5, s5, … 〉 h3 = 〈s1, s2, s5, s4, s4, … 〉 h4 = 〈s1, s4, s4, s4, s4, … 〉 h5 = 〈s1, s1, s4, s4, s4, … 〉
h6 = 〈s1, s1, s1, s4, s4, … 〉 h7 = 〈s1, s1, s1, s1, s1, … 〉
• Each policy induces a probability distribution over histories • If h = 〈s0, s1, … 〉 then P(h | π) = P(s0) ∏i ≥ 0 Pπ(Si) (si+1 | si)
Goal Start
π1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, wait)}
h1 = 〈s1, s2, s3, s4, s4, … 〉 P(h1 | π1) = 1 × 1 × .8 × 1 × … = 0.8 h2 = 〈s1, s2, s5, s5 … 〉 P(h2 | π1) = 1 × 1 × .2 × 1 × … = 0.2 P(h | π1) = 0 for all other h
Example history distribution
Start Goal
wait
wait
π1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, wait)}
h1 = 〈s1, s2, s3, s4, s4, … 〉 P(h1 | π1) = 1 × 1 × .8 × 1 × … = 0.8 h2 = 〈s1, s2, s5, s5 … 〉 P(h2 | π1) = 1 × 1 × .2 × 1 × … = 0.2 P(h | π1) = 0 for all other h
Example history distribution
Start Goal
wait
wait
π1 reaches the goal with probability 0.8
π2 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, move(r1,l5,l4))}
h1 = 〈s1, s2, s3, s4, s4, … 〉 P(h1 | π2) = 1 × 1 × .8 × 1 × … = 0.8 h3 = 〈s1, s2, s5, s4, s4, … 〉 P(h3 | π2) = 1 × 0.2 × 1 × 1 × … = 0.2 P(h | π2) = 0 for all other h
Example history distribution
Start Goal
wait
wait
π2 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, move(r1,l5,l4))}
h1 = 〈s1, s2, s3, s4, s4, … 〉 P(h1 | π2) = 1 × 1 × .8 × 1 × … = 0.8 h3 = 〈s1, s2, s5, s4, s4, … 〉 P(h3 | π2) = 1 × 0.2 × 1 × 1 × … = 0.2 P(h | π2) = 0 for all other h
Example history distribution
Start Goal
wait
wait
π2 reaches the goal with probability 1.0
π3 = {(s1, move(r1,l1,l4)), (s2, move(r1,l2,l1)), (s3, move(r1,l3,l4)), (s4, wait), (s5, move(r1,l5,l4))}
h4 = 〈s1, s4, s4, s4, … 〉 P(h4 | π3) = 0.5 × 1 × 1 × 1 × 1 × … = 0.5 h5 = 〈s1, s1, s4, s4, s4, … 〉 P(h5 | π3) = 0.5 × 0.5 × 1 × 1 × 1 × … = 0.25 h6 = 〈s1, s1, s1, s4, s4, … 〉 P(h6 | π3) = 0.5 × 0.5 × 0.5 × 1 × 1 × … = 0.125
• • • h7 = 〈s1, s1, s1, s1, s1, s1, … 〉 P(h7 | π3) = 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × … = 0
Example history distribution
Start Goal wait
wait
π3 = {(s1, move(r1,l1,l4)), (s2, move(r1,l2,l1)), (s3, move(r1,l3,l4)), (s4, wait), (s5, move(r1,l5,l4))}
h4 = 〈s1, s4, s4, s4, … 〉 P(h4 | π3) = 0.5 × 1 × 1 × 1 × 1 × … = 0.5 h5 = 〈s1, s1, s4, s4, s4, … 〉 P(h5 | π3) = 0.5 × 0.5 × 1 × 1 × 1 × … = 0.25 h6 = 〈s1, s1, s1, s4, s4, … 〉 P(h6 | π3) = 0.5 × 0.5 × 0.5 × 1 × 1 × … = 0.125
• • • h7 = 〈s1, s1, s1, s1, s1, s1, … 〉 P(h7 | π3) = 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × … = 0
Example history distribution
Start Goal wait
wait
π3 reaches the goal with probability 1.0
Determining the value of a policy • How good is a policy π? • How do we measure “accumulated” reward? • Utility function V: S →ℝ associates value with each state (or
each state and time for non-stationary π) • Vπ(s) denotes value of policy at state s
• Depends on immediate reward, but also what you achieve subsequently by following π
• Optimal policy is no worse than any other policy at any state
• The goal of MDP reasoning is to compute an optimal policy (method depends on how we define utility)
Compute utility using reward function • Numeric cost C(s,a) for each state s and action a • Numeric reward R(s) for
each state s
• No explicit goals now • Desirable states have
high rewards
• Example: • C(s,wait) = 0 at every state
except s3 • C(s,a) = 1 for each
“horizontal” action • C(s,a) = 100 for each
“vertical” action • R as shown
• Utility of a history/policy:
• If h = 〈s0, s1, … 〉, then V(h | π) = ∑i ≥ 0 [R(si) – C(si,π(si))]
Start
r = –100
r = 100
r = 0 r = 0
r = 0 c
= 10
0
c =
100
c = 1
c = 1
Example utility computation π1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, wait)} h1 = 〈s1, s2, s3, s4, s4,…〉 P(h1 h2 = 〈s1, s2, s5, s5 … 〉 V(h1|π1) = [R(s1) – C(s1,π1(s1))] + [R(s2) – C(s2,π1(s2))] + [R(s3) –
C(s3,π1(s3))] + [R(s4) – C(s4,π1(s4))] + [R(s4) – C(s4,π1(s4))] +…
= [0 – 100] + [0 – 1] + [0 – 100] + [100 – 0] + [100 – 0] + … = ∞
V(h2|π1) = [0 – 100] + [0 – 1] + [–100 – 0] + [–100 – 0] + [–100 – 0] + …
= –∞
Start
r = –100
r = 100
r = 0 r = 0
r = 0
c =
100
c =
100
c = 1
c = 1 wait
wait
Discounted utility • In a long history of states, distant rewards/costs have less
influence on the current decision
• Often need to use a discount factor γ, 0 ≤ γ ≤ 1
• Discounted utility of a history
V(h | π) = ∑i ≥ 0 γ i [R(si) – C(si,π(si))]
• Convergence is guaranteed if 0 ≤ γ < 1
• Expected utility of a policy: E(π) = ∑h P(h|π) V(h|π)
Discounted and expected utility π1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, wait)} h1 = 〈s1, s2, s3, s4, s4,…〉 P(h1 h2 = 〈s1, s2, s5, s5 … 〉 V(h1|π1) = .90[0 – 100] + .91[0 – 1] + .92[0 – 100] + .93[100 – 0] + .94[100 – 0] + … = 547.9
V(h2|π1) = .90[0 – 100] + .91[0 – 1] + .92[–100 – 0] + .93[–100 – 0] + … = -910.1
E(π1) = 0.8 V(h1|π1) + 0.2 V(h2|π1) = 0.8(547.9) + 0.2(–910.1) = 256.3
Start
r = –100
r = 100
r = 0 r = 0
r = 0
c =
100
c =
100
c = 1
c = 1 wait
wait
Planning as optimization • Consider special case with start state s0 and all rewards are 0
• Consider cost rather than utility—the negative of what we had before • Equations slightly simpler—can generalize to the case of nonzero rewards
• Discounted cost of a history h and policy π: C(h | π) = ∑i ≥ 0 γ i C(si, π(si))
• Expected cost of a policy π: E(π) = ∑h P(h | π) C(h | π)
• A policy π is optimal if for every π', E(π) ≤ E(π')
• A policy π is everywhere optimal if for every s and every π', Eπ(s) ≤ Eπ' (s)
where Eπ(s) is the expected utility if we start at s rather than s0
Bellman’s theorem • If π is any policy, then for every s,
Eπ(s) = C(s, π(s)) + γ ∑s ∈ S Pπ(s)(sʹ′ | s) Eπ(sʹ′)
• Let Qπ(s,a) be the expected cost in a state s if we start by
executing the action a, and use the policy π from then onward
Qπ(s,a) = C(s,a) + γ ∑sʹ′ ∈ S Pa(sʹ′ | s) Eπ(sʹ′)
• Bellman’s theorem: Suppose π* is everywhere optimal. Then for every s,
Eπ*(s) = mina∈A(s) Qπ*(s,a)
s
s1
s2
sn
…
π(s)
Intuition behind Bellman’s theorem • Bellman’s theorem:
Suppose π* is everywhere optimal. Then for every s, Eπ*(s) = mina∈A(s) Qπ*(s,a)
• If we use π* everywhere else, then the optimal actions at s is
arg mina∈A(s) Qπ*(s,a) • If π* is optimal, then at each state it will pick one of those • Otherwise we can construct a better policy by using an action in
arg mina∈A(s) Qπ*(s,a) instead of the action that π* uses
• From Bellman’s theorem it follows that for all s, Eπ*(s) = mina∈A(s) {C(s,a) + γ ∑s’ ∈ S Pa(sʹ′ | s) Eπ*(sʹ′)}
Finite horizon decision-making • First look at policy optimization over a finite horizon
• Assumes the agent has n time steps to live • γ =1
• To act optimally, should we use stationary or nonstationary policy?
• Put another way:
• If you had only one week to live would you act the same way as if you had fifty years to live?
Finite horizon problems • Value (utility) depends on stage-to-go, hence so should policy
• Nonstationary π(s,k)
• Vπk(s) is the k-stage-to-go value function for π
• Expected total reward/cost after executing π for k time steps
Vπk(s) = ∑t = 0 to k C(st, π(st))P(ht | π) where C(st, π(st)) denotes the cost received at stage t and P(ht | π) is the probability of the history seen until state t, given the policy π
Computing the finite-horizon value • Markov property facilitates dynamic programming
• Use to compute Vπk(s) • Work backward from final stage to determine optimal policy
Vπ0(s) = C(s,π(s))
Vπk(s) = C(s,π(s)) + ∑sʹ′ Pπ(s,k)(sʹ′ | s) Vπk-1(sʹ′)
Vk-1 Vk
0.7
0.3
π(s,k)
Computing the finite-horizon value • Markov property facilitates dynamic programming
• Use to compute Vπk(s) • Work backward from final stage to determine optimal policy
Vπ0(s) = C(s,π(s))
Vπk(s) = C(s,π(s)) + ∑sʹ′ Pπ(s,k)(sʹ′ | s) Vπk-1(sʹ′)
Immediate cost Expected future cost with
k-1 stages to go
Vk-1 Vk
0.7
0.3
π(s,k)
Bellman backup • Use Bellman’s theorem to incrementally compute utility value
• Backup the values from Vt to Vt+1 using DP and choosing minimum path
Vt+1(s) = mina∈A(s) {C(s,a) + 0.7Vt(s1) + 0.3Vt(s4),
C(s,a) + 0.4Vt(s2) + 0.6Vt(s3) } • Uses procedure called value iteration
a1
s
a2 s4
s1
s3
s2
Vt
0.7
0.3
0.4
0.6
Vt+1(s)
General case value iteration algorithm 1. Start with an arbitrary cost E0(s) for each s and a small ε > 0 2. For i = 1, 2, …
a) For every s in S and a in A, i. Qi (s,a) := C(s,a) + γ ∑sʹ′ ∈ S Pa (sʹ′ | s) Ei–1(sʹ′)
ii. Ei(s) = mina∈A(s) Qi (s,a) iii. πi(s) = arg mina∈A(s) Qi (s,a)
b) If maxs ∈ S |Ei(s) – Ei–1(s)| < ε for every s then exit
• πi converges to π* after finitely many iterations, but how to tell it has converged?
• In general, Ei ≠ Eπi • When πi doesn’t change, Ei may still change • The changes in Ei may make πi start changing again
General case value iteration algorithm 1. Start with an arbitrary cost E0(s) for each s and a small ε > 0 2. For i = 1, 2, …
a) For every s in S do
i. For each a in A do • Q (s,a) := C(s,a) + γ ∑sʹ′ ∈ S Pa (sʹ′ | s) Ei–1(sʹ′)
ii. Ei(s) = mina∈A(s) Q(s,a) iii. πi(s) = arg mina∈A(s) Q(s,a)
b) If maxs ∈ S |Ei(s) – Ei–1(s)| < ε for every s then exit
• If Ei changes by < ε and if ε is small enough, then πi will no longer change • In this case πi has converged to π*
• How small is small enough?
Value iteration in finite-horizon case • Use Markov property and Bellman backup to avoid
enumerating all possibilities
• Value iteration • Initialize to final stage cost:
V0(s) = C(s)
• Compute Vk(s)—optimal k-stage-to-go value function
Vk(s) = mina∈A(s) {C(s,a) + ∑s’ ∈ S Pa(sʹ′ | s) Vk-1(sʹ′)}
• Derive optimal k-stage-to-go policy
π*(s,k) = argmina∈A(s) {C(s,a) + ∑s’ ∈ S Pa(sʹ′ | s) Vk-1(sʹ′)}
• Optimal value function is unique, but optimal policy is not
• Many policies can have same value
Finite-horizon value iteration example
• V1(s4) = mina∈A(s) {C(s4,a1) + 0.7V0(s1) + 0.3V0(s4),
C(s4,a2) + 0.4V0(s2) + 0.6V0(s3) }
0.3
0.7 0.4
0.6
s4
s1
s3
s2
V0 V1
0.4
0.3
0.7
0.6 0.3
0.7
0.4
0.6
V2 V3
a1 =
a2 =
Finite-horizon value iteration example
• π*(s4,1) = argmina1,a2 {C(s,a) + ∑s’ ∈ S Pa(sʹ′ | s) Vk-1(sʹ′)}
0.3
0.7 0.4
0.6
s4
s1
s3
s2
V0 V1
0.4
0.3
0.7
0.6 0.3
0.7
0.4
0.6
V2 V3
a1 =
a2 =
Complexity of finite-horizon value iteration • Optimal solution to k-1 stage problem can be used without
modification as part of optimal solution to k-stage problem • Dynamic programming
• Because of finite horizon, policy nonstationary • What is the computational complexity?
• T iterations • At each iteration, each of n states computes expectation for |A| actions • Each expectation takes O(n) time
Complexity of finite-horizon value iteration • Optimal solution to k-1 stage problem can be used without
modification as part of optimal solution to k-stage problem • Dynamic programming
• Because of finite horizon, policy nonstationary • What is the computational complexity?
• T iterations • At each iteration, each of n states computes expectation for |A| actions • Each expectation takes O(n) time
• Total time complexity: O(T|A|n2) • Polynomial in number of states. Is this good?
Discounted infinite-horizon MDPs • Defining value as total reward is problematic with infinite horizons
• Many or all policies have infinite expected reward
• Use our discounted utility to discount the future reward at each time step—discount factor γ, 0 ≤ γ ≤ 1
• Can restrict attention to stationary policies • Discounted cost of a history h and policy π:
C(h | π) = ∑i ≥ 0 γ i C(si, π(si))
• Expected cost of a policy π: E(π) = ∑h P(h | π) C(h | π)
• A policy π is optimal if for every π', E(π) ≤ E(π')
Value iteration in infinite horizon case • Use value iteration and Bellman backup to compute optimal
policies just like in finite-horizon case • Now just include the discount factor in the computation
• Initialize to start state cost: E0(s) = arbitrary cost
• Compute Vk(s)—optimal value function
Ei(s) = mina∈A(s) {C(s,a) + γ ∑s’ ∈ S Pa(sʹ′ | s) Ei-1(sʹ′)}
• Will converge to the optimal value function as k gets large
Infinite horizon value iteration example • Let aij be the action that
moves from si to sj • e.g., a11= wait and
a12 = move(r1,l1,l2)
• Start with E0(s) = 0 for all s, and ε = 1
Start
c = 100
c = 0
r = 0 c = 1
c = 1 c
= 10
0
c =
100
c = 1
c = 1
wait
wait
wait
wait
Infinite horizon value iteration example For each s and a compute Q (s,a) :=
C(s,a) + γ ∑sʹ′ ∈ S Pa (sʹ′ | s) Ei–1(sʹ′)
Q(s1, a11) = 1 + .9×0 = 1 Q(s1, a12) = 100 + .9×0 = 100 Q(s1, a14) = 1 + .9(.5×0 + .5×0) = 1
Q(s2, a21) = 100 + .9×0 = 100 Q(s2, a22) = 1 + .9×0 = 1 Q(s2, a23) = 1 + .9(.5×0 + .5×0) = 1
Q(s3, a32) = 1 + .9×0 = 1 Q(s3, a34) = 100 + .9×0 = 100
Q(s4, a41) = 1 + .9×0 = 1 Q(s4, a43) = 100 + .9×0 = 1 Q(s4, a44) = 0 + .9×0 = 0 Q(s4, a45) = 100 + .9×0 = 100
Q(s5, a52) = 1 + .9×0 = 1 Q(s5, a54) = 100 + .9×0 = 100 Q(s5, a55) = 100 + .9×0 = 100
Start
c = 100
c = 0
r = 0 c = 1
c = 1 c
= 10
0
c =
100
c = 1
c = 1
wait
wait
wait
wait
Infinite horizon value iteration example For each s and a compute E1 (s,a) and π1(s) E1(s1) = 1 E1(s2) = 1 E1(s3) = 1 E1(s4) = 0 E1(s5) = 1 π1(s1) = a11 = wait π1(s2) = a22 = wait π1(s3) = a32 = move(r1,l3,l2) π1(s4) = a44 = wait π1(s5) = a52 = move(r1,l5,l2)
Start
c = 100
c = 0
r = 0 c = 1
c = 1 c
= 10
0
c =
100
c = 1
c = 1
wait
wait
wait
wait
Infinite horizon value iteration example For each s and a compute E1 (s,a) and π1(s) E1(s1) = 1 E1(s2) = 1 E1(s3) = 1 E1(s4) = 0 E1(s5) = 1 π1(s1) = a11 = wait π1(s2) = a22 = wait π1(s3) = a32 = move(r1,l3,l2) π1(s4) = a44 = wait π1(s5) = a52 = move(r1,l5,l2)
Start
c = 100
c = 0
r = 0 c = 1
c = 1 c
= 10
0
c =
100
c = 1
c = 1
wait
wait
wait
wait
What other actions could we have chosen? Is ε small enough?
Deciding how to act using MDPs • Given an Ei from value iteration that closely approximates Eπ*,
what should we use as our policy?
• Use greedy policy
π(s) = greedy[Ei(s)] = argmina∈A(s) Q(s,a) where Q(s,a) = C(s,a) + γ ∑sʹ′ ∈ S Pa (sʹ′|s) V(sʹ′)
• Note that the value of greedy policy may not be equal to Ei • Let EG
be the value of the greedy policy? How close is EG to Eπ*?
Deciding how to act using MDPs • Given an Ei from value iteration that closely approximates Eπ*,
what should we use as our policy?
• Use greedy policy
π(s) = greedy[Ei(s)] = argmina∈A(s) Q(s,a) where Q(s,a) = C(s,a) + γ ∑sʹ′ ∈ S Pa (sʹ′|s) V(sʹ′)
• Greedy is not too far from optimal if Ei close to Eπ* • In particular, if Ei within ε of Eπ*, then EG within 2εγ /1-γ of Eπ* • Furthermore, there exists a finite ε s.t. greedy policy is optimal
• I.e., even if value estimate is off, greedy is optimal once it is close enough
Policy iteration for infinite-horizon planning • Policy iteration is another way to find π*
• Suppose there are n states s1, …, sn
• Start with an arbitrary initial policy π1 • For i = 1, 2, …
• Compute πi’s expected costs by solving n equations with n unknowns • n instances of the discounted expected cost equation
• For every sj,
• If πi+1 = πi then exit • Converges in a finite number of iterations
Eπ i(s1) =C(s,π i (s1))+γ Pπ i (s1 )k=1
n∑ (sk | s1) Eπ i
(sk ) Eπ i
(sn ) =C(s,π i (sn ))+γ Pπ i (sn )k=1
n∑ (sk | sn ) Eπ i
(sk )
π i+1(sj ) = argmina∈A Qπ i(sj,a)
= argmina∈A C(sj,a)+γ Pak=1
n∑ (sk | sj ) Eπ i
(sk )
Policy iteration example • Assume discount factor γ = 0.9
• s5 undesirable • C(s5, wait) = 100
• Incentive to leave non-goal states:
• C(s1,wait) = 1 • C(s2,wait) = 1
Start
c = 100
c = 0
r = 0 c = 1
c = 1 c
= 10
0
c =
100
c = 1
c = 1
wait
wait
wait
wait
Policy iteration example • Start with arbitrary policy π1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, wait)}
• Compute expected costs across all states
Start
c = 100
c = 0
r = 0 c = 1
c = 1 c
= 10
0
c =
100
c = 1
c = 1
wait
wait
wait
wait
Policy iteration example • Start with arbitrary policy π1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, wait)}
• Compute expected costs across all states
Start
c = 100
c = 0
r = 0 c = 1
c = 1 c
= 10
0
c =
100
c = 1
c = 1
wait
wait
wait
wait
Policy iteration example • Start with arbitrary policy π1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, wait)}
• Compute expected costs across all states
Start
c = 100
c = 0
r = 0 c = 1
c = 1 c
= 10
0
c =
100
c = 1
c = 1
wait
wait
wait
wait
Policy iteration example • Compute expected costs
across all states
• At each state s, let π2(s) = argmina∈A(s) Qπ (s,a)
• π2 = {(s1, move(r1,l1,l4)), (s2, move(r1,l2,l1)), (s3, move(r1,l3,l4)), (s4, wait), (s5, move(r1,l5,l4))}
Start
c = 100
c = 0
r = 0 c = 1
c = 1 c
= 10
0
c =
100
c = 1
c = 1
wait
wait
wait
wait
Policy iteration vs. value iteration • Policy iteration computes an entire policy in each iteration,
and computes values based on that policy • More work per iteration—needs to solve a set of simultaneous equations • Usually converges in a smaller number of iterations
• Value iteration computes new values in each iteration, and chooses a policy based on those values • In general, the values are not the values that one would get from the
chosen policy or any other policy • Less work per iteration • Usually takes more iterations to converge
Policy iteration vs. value iteration • For both, iterations is polynomial in the number of states
• But the number of states is usually quite large • Need to examine the entire state space in each iteration
• Thus, these algorithms can take huge amounts of time and space
• Use real-time dynamic programming and heuristics to improve efficiency
Real-time dynamic programming basics • Explicitly specify goal states
• If s is a goal, then actions at s have no cost and produce no change
• For each state s, maintain value V(s) that gets updated as algorithm proceeds • Initially V(s) = h(s), where h is a heuristic function
• Greedy policy π(s) = greedy[Ei(s)] = argmina∈A(s) Q(s,a)
where
Q(s,a) = C(s,a) + γ ∑sʹ′ ∈ S Pa (sʹ′|s) V(sʹ′)
RTDP algorithm procedure RTDP(s)
1. Loop until termination condition a) RTDP-trial(s)
procedure RTDP-trial(s) 1. While s is not a goal state
a) a := arg mina∈A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with probability Pa (sʹ′|s) d) s := s'
RTDP algorithm procedure RTDP(s)
1. Loop until termination condition a) RTDP-trial(s)
procedure RTDP-trial(s) 1. While s is not a goal state
a) a := arg mina∈A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with probability Pa (sʹ′|s) d) s := s'
Forward search
Example of RTDP procedure RTDP(s)
1. Loop until termination condition a) RTDP-trial(s)
procedure RTDP-trial(s)
1. While s is not a goal state a) a := arg mina∈A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with
probability Pa (sʹ′|s) d) s := s'
Start
c = 100
c = 0
r = 0 c = 1
c = 1 c
= 10
0
c =
100
c = 1
c = 1
wait
wait
wait
wait
Example of RTDP procedure RTDP(s)
1. Loop until termination condition a) RTDP-trial(s)
procedure RTDP-trial(s)
1. While s is not a goal state a) a := arg mina∈A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with
probability Pa (sʹ′|s) d) s := s'
Start
c = 100
c = 0
r = 0 c = 1
c = 1 c
= 10
0
c =
100
c = 1
c = 1
wait
wait
wait
wait
Example: γ = 0.9 h(s) = 0 for all s
V = 0
Example of RTDP procedure RTDP(s)
1. Loop until termination condition a) RTDP-trial(s)
procedure RTDP-trial(s)
1. While s is not a goal state a) a := arg mina∈A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with
probability Pa (sʹ′|s) d) s := s'
Start
c = 100
c = 0
r = 0 c = 1
c = 1 c
= 10
0
c =
100
c = 1
c = 1
wait
wait
wait
wait
Example: γ = 0.9 h(s) = 0 for all s
V = 0
V = 0
Q = 100+.9*0 = 100
Q = 1+.9(½*0+½*0) = 1
Q = 100+.9*0 = 100
V = 0
Example of RTDP procedure RTDP(s)
1. Loop until termination condition a) RTDP-trial(s)
procedure RTDP-trial(s)
1. While s is not a goal state a) a := arg mina∈A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with
probability Pa (sʹ′|s) d) s := s'
Start
c = 100
c = 0
r = 0 c = 1
c = 1 c
= 10
0
c =
100
c = 1
c = 1
wait
wait
wait
wait
Example: γ = 0.9 h(s) = 0 for all s
V = 0
V = 0
Q = 1+.9(½*0+½*0) = 1
V = 0
Example of RTDP procedure RTDP(s)
1. Loop until termination condition a) RTDP-trial(s)
procedure RTDP-trial(s)
1. While s is not a goal state a) a := arg mina∈A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with
probability Pa (sʹ′|s) d) s := s'
Start
c = 100
c = 0
r = 0 c = 1
c = 1 c
= 10
0
c =
100
c = 1
c = 1
wait
wait
wait
wait
Example: γ = 0.9 h(s) = 0 for all s
V = 1
V = 0
Q = 1+.9(½*0+½*0) = 1
V = 0
Example of RTDP procedure RTDP(s)
1. Loop until termination condition a) RTDP-trial(s)
procedure RTDP-trial(s)
1. While s is not a goal state a) a := arg mina∈A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with
probability Pa (sʹ′|s) d) s := s'
Start
c = 100
c = 0
r = 0 c = 1
c = 1 c
= 10
0
c =
100
c = 1
c = 1
wait
wait
wait
wait
Example: γ = 0.9 h(s) = 0 for all s
V = 1
Example of RTDP procedure RTDP(s)
1. Loop until termination condition a) RTDP-trial(s)
procedure RTDP-trial(s)
1. While s is not a goal state a) a := arg mina∈A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with
probability Pa (sʹ′|s) d) s := s'
Start
c = 100
c = 0
r = 0 c = 1
c = 1 c
= 10
0
c =
100
c = 1
c = 1
wait
wait
wait
wait
Example: γ = 0.9 h(s) = 0 for all s
V = 1
V = 0
Q = 100+.9*0 = 100
Q = 1+.9(½*1+½*0) = 1.45
Q = 100+.9*0 = 100
V = 0
Example of RTDP procedure RTDP(s)
1. Loop until termination condition a) RTDP-trial(s)
procedure RTDP-trial(s)
1. While s is not a goal state a) a := arg mina∈A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with
probability Pa (sʹ′|s) d) s := s'
Start
c = 100
c = 0
r = 0 c = 1
c = 1 c
= 10
0
c =
100
c = 1
c = 1
wait
wait
wait
wait
Example: γ = 0.9 h(s) = 0 for all s
V = 1
V = 0
Q = 1+.9(½*1+½*0) = 1.45
V = 0
Example of RTDP procedure RTDP(s)
1. Loop until termination condition a) RTDP-trial(s)
procedure RTDP-trial(s)
1. While s is not a goal state a) a := arg mina∈A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with
probability Pa (sʹ′|s) d) s := s'
Start
c = 100
c = 0
r = 0 c = 1
c = 1 c
= 10
0
c =
100
c = 1
c = 1
wait
wait
wait
wait
Example: γ = 0.9 h(s) = 0 for all s
V = 1.45
V = 0
Q = 1+.9(½*1+½*0) = 1.45
V = 0
Example of RTDP procedure RTDP(s)
1. Loop until termination condition a) RTDP-trial(s)
procedure RTDP-trial(s)
1. While s is not a goal state a) a := arg mina∈A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with
probability Pa (sʹ′|s) d) s := s'
Start
c = 100
c = 0
r = 0 c = 1
c = 1 c
= 10
0
c =
100
c = 1
c = 1
wait
wait
wait
wait
Example: γ = 0.9 h(s) = 0 for all s
V = 1.45
Example of RTDP procedure RTDP(s)
1. Loop until termination condition a) RTDP-trial(s)
procedure RTDP-trial(s)
1. While s is not a goal state a) a := arg mina∈A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with
probability Pa (sʹ′|s) d) s := s'
Start
c = 100
c = 0
r = 0 c = 1
c = 1 c
= 10
0
c =
100
c = 1
c = 1
wait
wait
wait
wait
Example: γ = 0.9 h(s) = 0 for all s
V = 1.45
Example of RTDP procedure RTDP(s)
1. Loop until termination condition a) RTDP-trial(s)
procedure RTDP-trial(s)
1. While s is not a goal state a) a := arg mina∈A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with
probability Pa (sʹ′|s) d) s := s'
Start
c = 100
c = 0
r = 0 c = 1
c = 1 c
= 10
0
c =
100
c = 1
c = 1
wait
wait
wait
wait
Example: γ = 0.9 h(s) = 0 for all s
V = 1.45
Performance of RTDP • In practice, it can solve much larger problems than policy
iteration and value iteration
• Won’t always find an optimal solution, won’t always terminate • If h does not overestimate, and if a goal is reachable (with positive
probability) at every state then it will terminate • h should be an admissible heuristic
• If in addition to the above, there is a positive-probability path between every pair of states • Then it will find an optimal solution