stochastic games (part i): policy improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf ·...
TRANSCRIPT
Stochastic Games (Part I): Policy Improvementin Discounted (Noncompetitive) Markov Decision
Processes
Paul Varkey
Multi Agent Systems Group, Department of Computer Science, UIC
4th Annual Graduate Student Probability ConferenceApr 30th – 2010
Duke University, Durham NC
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 1 / 18
Outline
1 The Model (definitions, notations and the problem statement)
2 Basic Theorems
3 The Algorithm
4 An Example
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 2 / 18
References
BLACKWELL, D. (1962): “Discrete Dynamic Programming”,The Annals of Mathematical Statistics, Vol. 33, No. 2. (Jun.1962), pp. 719-726.
FILAR, J.A. and VRIEZE, O.J. (1996): “Competitive MarkovDecision Processes Theory, Algorithms, and Applications”,SpringerVerlag, New York, 1996.
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 3 / 18
Decision Processes, States and Actions
A decision process is a discrete stochastic process that is observed atdiscrete time points t = 0, 1, 2, 3, ... called stages where a decision-maker (or, controller) chooses an action at each stage.
S denotes the state space – the process may be in one of (finitely)many states
A denotes the action space – the actions may be chosen from oneof (finitely) many actions
Choosing action a in state s results in
(i) an immediate reward r(s, a)(ii) a probabilistic transition to a state s ′ given by p(s ′|s, a)
↑the (stationary) Markov assumption
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 4 / 18
Decision Processes, States and Actions
A decision process is a discrete stochastic process that is observed atdiscrete time points t = 0, 1, 2, 3, ... called stages where a decision-maker (or, controller) chooses an action at each stage.
S denotes the state space – the process may be in one of (finitely)many states
A denotes the action space – the actions may be chosen from oneof (finitely) many actions
Choosing action a in state s results in
(i) an immediate reward r(s, a)(ii) a probabilistic transition to a state s ′ given by p(s ′|s, a)
↑the (stationary) Markov assumption
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 4 / 18
Decision Processes, States and Actions
A decision process is a discrete stochastic process that is observed atdiscrete time points t = 0, 1, 2, 3, ... called stages where a decision-maker (or, controller) chooses an action at each stage.
S denotes the state space – the process may be in one of (finitely)many states
A denotes the action space – the actions may be chosen from oneof (finitely) many actions
Choosing action a in state s results in
(i) an immediate reward r(s, a)(ii) a probabilistic transition to a state s ′ given by p(s ′|s, a)
↑the (stationary) Markov assumption
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 4 / 18
Decision Processes, States and Actions
A decision process is a discrete stochastic process that is observed atdiscrete time points t = 0, 1, 2, 3, ... called stages where a decision-maker (or, controller) chooses an action at each stage.
S denotes the state space – the process may be in one of (finitely)many states
A denotes the action space – the actions may be chosen from oneof (finitely) many actions
Choosing action a in state s results in
(i) an immediate reward r(s, a)(ii) a probabilistic transition to a state s ′ given by p(s ′|s, a)
↑the (stationary) Markov assumption
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 4 / 18
Decision Processes, States and Actions
A decision process is a discrete stochastic process that is observed atdiscrete time points t = 0, 1, 2, 3, ... called stages where a decision-maker (or, controller) chooses an action at each stage.
S denotes the state space – the process may be in one of (finitely)many states
A denotes the action space – the actions may be chosen from oneof (finitely) many actions
Choosing action a in state s results in
(i) an immediate reward r(s, a)(ii) a probabilistic transition to a state s ′ given by p(s ′|s, a)
↑the (stationary) Markov assumption
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 4 / 18
Decision Rules, Strategies, Transitions & Rewards
A decision rule f : S → A is a function that specifies the actionthat a controller chooses in a given state
Each decision rule f definesI a transition probability matrix P(f ) such that the (s, s ′)-th entry
is given by
P(f )[s, s ′] = p(s ′|s, f (s))I a reward vector r(f ) such that the (s)-th entry is given by
r(f )[s] = r(s, f (s))
A (Markov) strategy π is a sequence of decision rules {fn, n =0, 1, ...} such that fn is used by the controller at stage n
A stationary strategy is a stage-independent strategy; i.e. it usesthe same decision rule f at each stage, and will be denoted as f
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 5 / 18
Decision Rules, Strategies, Transitions & Rewards
A decision rule f : S → A is a function that specifies the actionthat a controller chooses in a given state
Each decision rule f definesI a transition probability matrix P(f ) such that the (s, s ′)-th entry
is given by
P(f )[s, s ′] = p(s ′|s, f (s))I a reward vector r(f ) such that the (s)-th entry is given by
r(f )[s] = r(s, f (s))
A (Markov) strategy π is a sequence of decision rules {fn, n =0, 1, ...} such that fn is used by the controller at stage n
A stationary strategy is a stage-independent strategy; i.e. it usesthe same decision rule f at each stage, and will be denoted as f
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 5 / 18
Decision Rules, Strategies, Transitions & Rewards
A decision rule f : S → A is a function that specifies the actionthat a controller chooses in a given state
Each decision rule f definesI a transition probability matrix P(f ) such that the (s, s ′)-th entry
is given by
P(f )[s, s ′] = p(s ′|s, f (s))I a reward vector r(f ) such that the (s)-th entry is given by
r(f )[s] = r(s, f (s))
A (Markov) strategy π is a sequence of decision rules {fn, n =0, 1, ...} such that fn is used by the controller at stage n
A stationary strategy is a stage-independent strategy; i.e. it usesthe same decision rule f at each stage, and will be denoted as f
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 5 / 18
Decision Rules, Strategies, Transitions & Rewards
A decision rule f : S → A is a function that specifies the actionthat a controller chooses in a given state
Each decision rule f definesI a transition probability matrix P(f ) such that the (s, s ′)-th entry
is given by
P(f )[s, s ′] = p(s ′|s, f (s))I a reward vector r(f ) such that the (s)-th entry is given by
r(f )[s] = r(s, f (s))
A (Markov) strategy π is a sequence of decision rules {fn, n =0, 1, ...} such that fn is used by the controller at stage n
A stationary strategy is a stage-independent strategy; i.e. it usesthe same decision rule f at each stage, and will be denoted as f
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 5 / 18
Decision Rules, Strategies, Transitions & Rewards
A decision rule f : S → A is a function that specifies the actionthat a controller chooses in a given state
Each decision rule f definesI a transition probability matrix P(f ) such that the (s, s ′)-th entry
is given by
P(f )[s, s ′] = p(s ′|s, f (s))I a reward vector r(f ) such that the (s)-th entry is given by
r(f )[s] = r(s, f (s))
A (Markov) strategy π is a sequence of decision rules {fn, n =0, 1, ...} such that fn is used by the controller at stage n
A stationary strategy is a stage-independent strategy; i.e. it usesthe same decision rule f at each stage, and will be denoted as f
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 5 / 18
Decision Rules, Strategies, Transitions & Rewards
A decision rule f : S → A is a function that specifies the actionthat a controller chooses in a given state
Each decision rule f definesI a transition probability matrix P(f ) such that the (s, s ′)-th entry
is given by
P(f )[s, s ′] = p(s ′|s, f (s))I a reward vector r(f ) such that the (s)-th entry is given by
r(f )[s] = r(s, f (s))
A (Markov) strategy π is a sequence of decision rules {fn, n =0, 1, ...} such that fn is used by the controller at stage n
A stationary strategy is a stage-independent strategy; i.e. it usesthe same decision rule f at each stage, and will be denoted as f
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 5 / 18
Decision Rules, Strategies, Transitions & Rewards
Each strategy π defines a t-stage transition probability matrix as
Pt(π) := P(f0)P(f1)...P(ft−1) for t = 1, 2, ...
where
P0(π) := IN
Associate with each strategy π the β-discounted value vector
φβ(π) =∞∑
t=0
βtPt(π)r(ft)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 6 / 18
Decision Rules, Strategies, Transitions & Rewards
Each strategy π defines a t-stage transition probability matrix as
Pt(π) := P(f0)P(f1)...P(ft−1) for t = 1, 2, ...
where
P0(π) := IN
Associate with each strategy π the β-discounted value vector
φβ(π) =∞∑
t=0
βtPt(π)r(ft)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 6 / 18
≥ Ordering for Value Vectors and Strategies
≥ Ordering Value Vectors
Denote u ≥ w whenever [u]s ≥ [w ]s for all s ∈ S
Denote u > w whenever u ≥ w and u 6= w
≥ Ordering Strategies
π1 ≥ π2 iff φβ(π1) ≥ φβ(π2)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 7 / 18
≥ Ordering for Value Vectors and Strategies
≥ Ordering Value Vectors
Denote u ≥ w whenever [u]s ≥ [w ]s for all s ∈ S
Denote u > w whenever u ≥ w and u 6= w
≥ Ordering Strategies
π1 ≥ π2 iff φβ(π1) ≥ φβ(π2)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 7 / 18
Optimal Strategies
π0 is an optimal strategy if
π0 ≥ π for all π
or, in other words,
π0 “maximizes” φβ(π)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 8 / 18
The Problem
Given an MDP, find the optimal strategy
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 9 / 18
The L(·) operator
Let π+ := {f1, f2, ...} (where π = {f0, f1, f2, ...})I Observe that
φβ(π) = r(f0) + βP(f0)∞∑t=1
βt−1Pt−1(π+)r(ft)
= r(f0) + βP(f0)φβ(π+)
For every decision rule g , we define the associated operator
L(g) : R|S | → R|S |
as
L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|
I In this notation,
φβ(f , π) = L(f )(φβ(π))
and, more generally,
φβ(f0, f1, ..., ft−1, π) = L(f0)L(f1)...L(ft−1)(φβ(π))I L(g) is a monotone operator
u ≥ w ⇒ L(g)(u) ≥ L(g)(w)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 10 / 18
The L(·) operator
Let π+ := {f1, f2, ...} (where π = {f0, f1, f2, ...})I Observe that
φβ(π) = r(f0) + βP(f0)∞∑t=1
βt−1Pt−1(π+)r(ft)
= r(f0) + βP(f0)φβ(π+)
For every decision rule g , we define the associated operator
L(g) : R|S | → R|S |
as
L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|
I In this notation,
φβ(f , π) = L(f )(φβ(π))
and, more generally,
φβ(f0, f1, ..., ft−1, π) = L(f0)L(f1)...L(ft−1)(φβ(π))I L(g) is a monotone operator
u ≥ w ⇒ L(g)(u) ≥ L(g)(w)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 10 / 18
The L(·) operator
Let π+ := {f1, f2, ...} (where π = {f0, f1, f2, ...})I Observe that
φβ(π) = r(f0) + βP(f0)∞∑t=1
βt−1Pt−1(π+)r(ft)
= r(f0) + βP(f0)φβ(π+)
For every decision rule g , we define the associated operator
L(g) : R|S | → R|S |
as
L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|
I In this notation,
φβ(f , π) = L(f )(φβ(π))
and, more generally,
φβ(f0, f1, ..., ft−1, π) = L(f0)L(f1)...L(ft−1)(φβ(π))I L(g) is a monotone operator
u ≥ w ⇒ L(g)(u) ≥ L(g)(w)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 10 / 18
The L(·) operator
Let π+ := {f1, f2, ...} (where π = {f0, f1, f2, ...})I Observe that
φβ(π) = r(f0) + βP(f0)∞∑t=1
βt−1Pt−1(π+)r(ft)
= r(f0) + βP(f0)φβ(π+)
For every decision rule g , we define the associated operator
L(g) : R|S | → R|S |
as
L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|
I In this notation,
φβ(f , π) = L(f )(φβ(π))
and, more generally,
φβ(f0, f1, ..., ft−1, π) = L(f0)L(f1)...L(ft−1)(φβ(π))I L(g) is a monotone operator
u ≥ w ⇒ L(g)(u) ≥ L(g)(w)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 10 / 18
The L(·) operator
Let π+ := {f1, f2, ...} (where π = {f0, f1, f2, ...})I Observe that
φβ(π) = r(f0) + βP(f0)∞∑t=1
βt−1Pt−1(π+)r(ft)
= r(f0) + βP(f0)φβ(π+)
For every decision rule g , we define the associated operator
L(g) : R|S | → R|S |
as
L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|
I In this notation,
φβ(f , π) = L(f )(φβ(π))
and, more generally,
φβ(f0, f1, ..., ft−1, π) = L(f0)L(f1)...L(ft−1)(φβ(π))I L(g) is a monotone operator
u ≥ w ⇒ L(g)(u) ≥ L(g)(w)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 10 / 18
Theorem 1
Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is
optimal.
Proof.
Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.
π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))
We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))
Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))
= φβ(f0, f1, ..., fT , π0)
=∑T
t=0 βtPt(π)r(ft) +
∑∞t=T+1 β
tPt(f0, ...fT , π0)r(f 0
t−T−1)for every nonnegative integer T
Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥
∑∞t=0 β
tPt(π)r(ft) = φβ(π)
Since π was arbitrary, the proof is complete.
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 11 / 18
Theorem 1
Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is
optimal.
Proof.
Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.
π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))
We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))
Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))
= φβ(f0, f1, ..., fT , π0)
=∑T
t=0 βtPt(π)r(ft) +
∑∞t=T+1 β
tPt(f0, ...fT , π0)r(f 0
t−T−1)for every nonnegative integer T
Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥
∑∞t=0 β
tPt(π)r(ft) = φβ(π)
Since π was arbitrary, the proof is complete.
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 11 / 18
Theorem 1
Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is
optimal.
Proof.
Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.
π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))
We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))
Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))
= φβ(f0, f1, ..., fT , π0)
=∑T
t=0 βtPt(π)r(ft) +
∑∞t=T+1 β
tPt(f0, ...fT , π0)r(f 0
t−T−1)for every nonnegative integer T
Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥
∑∞t=0 β
tPt(π)r(ft) = φβ(π)
Since π was arbitrary, the proof is complete.
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 11 / 18
Theorem 1
Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is
optimal.
Proof.
Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.
π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))
We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))
Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))
= φβ(f0, f1, ..., fT , π0)
=∑T
t=0 βtPt(π)r(ft) +
∑∞t=T+1 β
tPt(f0, ...fT , π0)r(f 0
t−T−1)for every nonnegative integer T
Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥
∑∞t=0 β
tPt(π)r(ft) = φβ(π)
Since π was arbitrary, the proof is complete.
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 11 / 18
Theorem 1
Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is
optimal.
Proof.
Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.
π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))
We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))
Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))
= φβ(f0, f1, ..., fT , π0)
=∑T
t=0 βtPt(π)r(ft) +
∑∞t=T+1 β
tPt(f0, ...fT , π0)r(f 0
t−T−1)for every nonnegative integer T
Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥
∑∞t=0 β
tPt(π)r(ft) = φβ(π)
Since π was arbitrary, the proof is complete.
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 11 / 18
Theorem 1
Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is
optimal.
Proof.
Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.
π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))
We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))
Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))
= φβ(f0, f1, ..., fT , π0)
=∑T
t=0 βtPt(π)r(ft) +
∑∞t=T+1 β
tPt(f0, ...fT , π0)r(f 0
t−T−1)for every nonnegative integer T
Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥
∑∞t=0 β
tPt(π)r(ft) = φβ(π)
Since π was arbitrary, the proof is complete.
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 11 / 18
Theorem 1
Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is
optimal.
Proof.
Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.
π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))
We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))
Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))
= φβ(f0, f1, ..., fT , π0)
=∑T
t=0 βtPt(π)r(ft) +
∑∞t=T+1 β
tPt(f0, ...fT , π0)r(f 0
t−T−1)for every nonnegative integer T
Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥
∑∞t=0 β
tPt(π)r(ft) = φβ(π)
Since π was arbitrary, the proof is complete.
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 11 / 18
Theorem 2
Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy and f adecision rule such that (f , π) > π. Then f > π.
Proof.
(f , π) > π ⇒ φβ(f , π) = L(f )(φβ(π)) > φβ(π)
As before, we use the monotonicity of L(f ) and repeated ap-plication of the above inequality to obtain φβ(f , f , f , ..., f , π) =L(T )(f )(φβ(π)) ≥ L(f )(φβ(π)) = φβ(f , π) > φβ(π)
Allowing T →∞ yields φβ(f) = φβ(f , f , f , ...) ≥ φβ(f , π) > φβ(π)
⇒ f > π
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 12 / 18
Theorem 2
Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy and f adecision rule such that (f , π) > π. Then f > π.
Proof.
(f , π) > π ⇒ φβ(f , π) = L(f )(φβ(π)) > φβ(π)
As before, we use the monotonicity of L(f ) and repeated ap-plication of the above inequality to obtain φβ(f , f , f , ..., f , π) =L(T )(f )(φβ(π)) ≥ L(f )(φβ(π)) = φβ(f , π) > φβ(π)
Allowing T →∞ yields φβ(f) = φβ(f , f , f , ...) ≥ φβ(f , π) > φβ(π)
⇒ f > π
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 12 / 18
Theorem 2
Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy and f adecision rule such that (f , π) > π. Then f > π.
Proof.
(f , π) > π ⇒ φβ(f , π) = L(f )(φβ(π)) > φβ(π)
As before, we use the monotonicity of L(f ) and repeated ap-plication of the above inequality to obtain φβ(f , f , f , ..., f , π) =L(T )(f )(φβ(π)) ≥ L(f )(φβ(π)) = φβ(f , π) > φβ(π)
Allowing T →∞ yields φβ(f) = φβ(f , f , f , ...) ≥ φβ(f , π) > φβ(π)
⇒ f > π
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 12 / 18
Theorem 2
Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy and f adecision rule such that (f , π) > π. Then f > π.
Proof.
(f , π) > π ⇒ φβ(f , π) = L(f )(φβ(π)) > φβ(π)
As before, we use the monotonicity of L(f ) and repeated ap-plication of the above inequality to obtain φβ(f , f , f , ..., f , π) =L(T )(f )(φβ(π)) ≥ L(f )(φβ(π)) = φβ(f , π) > φβ(π)
Allowing T →∞ yields φβ(f) = φβ(f , f , f , ...) ≥ φβ(f , π) > φβ(π)
⇒ f > π
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 12 / 18
Theorem 2
Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy and f adecision rule such that (f , π) > π. Then f > π.
Proof.
(f , π) > π ⇒ φβ(f , π) = L(f )(φβ(π)) > φβ(π)
As before, we use the monotonicity of L(f ) and repeated ap-plication of the above inequality to obtain φβ(f , f , f , ..., f , π) =L(T )(f )(φβ(π)) ≥ L(f )(φβ(π)) = φβ(f , π) > φβ(π)
Allowing T →∞ yields φβ(f) = φβ(f , f , f , ...) ≥ φβ(f , π) > φβ(π)
⇒ f > π
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 12 / 18
Policy Improvement Algorithm
Step 0. (Initialization) Set k := 0. Select any pure strategy f.Set f0 := f and φ0 := φβ(f0) = [I − βP(f0)]−1r(f0)
Step 1. (Check Optimality) Let aks be the action selected by fk in state s.
If the following “optimality equation”r(s, ak
s ) + β∑N
s′=1 p(s ′|s, aks )φk(s ′)
= maxa∈A
{r(s, a) + β
N∑s′=1
p(s ′|s, aks )φk(s ′)
}(?)
holds for each s ∈ S , STOP.The strategy f k is optimal and φk is the discounted optimal valuevector.
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 13 / 18
Policy Improvement Algorithm
Step 0. (Initialization) Set k := 0. Select any pure strategy f.Set f0 := f and φ0 := φβ(f0) = [I − βP(f0)]−1r(f0)
Step 1. (Check Optimality) Let aks be the action selected by fk in state s.
If the following “optimality equation”r(s, ak
s ) + β∑N
s′=1 p(s ′|s, aks )φk(s ′)
= maxa∈A
{r(s, a) + β
N∑s′=1
p(s ′|s, aks )φk(s ′)
}(?)
holds for each s ∈ S , STOP.The strategy f k is optimal and φk is the discounted optimal valuevector.
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 13 / 18
Policy Improvement Algorithm
Step 2. (Policy improvement) Let S be the nonempty subset of states forwhich equality is violated in (?), i.e., the left side is strictly smallerthan the right side. Define
aks = arg max
a∈A
{r(s, a) + β
N∑s′=1
p(s ′|s, aks )φk(s ′)
}for each s ∈ S , and a new strategy g by
g(s, a) =
f (s, a) if s /∈ S ,
1 if s ∈ S and a = aks
0 otherwise
Set fk+1 := g and φk+1 := φβ(g)
Step 3. (Iteration) Set k := k + 1, fk := fk+1, φk := φk+1. Return toStep 1.
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 14 / 18
Policy Improvement Algorithm
Step 2. (Policy improvement) Let S be the nonempty subset of states forwhich equality is violated in (?), i.e., the left side is strictly smallerthan the right side. Define
aks = arg max
a∈A
{r(s, a) + β
N∑s′=1
p(s ′|s, aks )φk(s ′)
}for each s ∈ S , and a new strategy g by
g(s, a) =
f (s, a) if s /∈ S ,
1 if s ∈ S and a = aks
0 otherwise
Set fk+1 := g and φk+1 := φβ(g)
Step 3. (Iteration) Set k := k + 1, fk := fk+1, φk := φk+1. Return toStep 1.
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 14 / 18
Correctness & Termination
Improvement Lemma
fk+1 > fk
Proof.
Let fk+1 be the decision rule corresponding to fk+1.By its definition, we have that
φβ(fk+1,fk) = r(fk+1) + βP(fk+1)φβ(fk) > φβ(fk)
i.e. that
(fk+1,fk) > fk
Then, from Theorem 2, fk+1 > fk .
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 15 / 18
Correctness & Termination
Improvement Lemma
fk+1 > fk
Proof.
Let fk+1 be the decision rule corresponding to fk+1.By its definition, we have that
φβ(fk+1,fk) = r(fk+1) + βP(fk+1)φβ(fk) > φβ(fk)
i.e. that
(fk+1,fk) > fk
Then, from Theorem 2, fk+1 > fk .
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 15 / 18
Correctness & Termination
Improvement Lemma
fk+1 > fk
Proof.
Let fk+1 be the decision rule corresponding to fk+1.By its definition, we have that
φβ(fk+1,fk) = r(fk+1) + βP(fk+1)φβ(fk) > φβ(fk)
i.e. that
(fk+1,fk) > fk
Then, from Theorem 2, fk+1 > fk .
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 15 / 18
Correctness & Termination
Improvement Lemma
fk+1 > fk
Proof.
Let fk+1 be the decision rule corresponding to fk+1.By its definition, we have that
φβ(fk+1,fk) = r(fk+1) + βP(fk+1)φβ(fk) > φβ(fk)
i.e. that
(fk+1,fk) > fk
Then, from Theorem 2, fk+1 > fk .
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 15 / 18
Correctness & Termination
Improvement Lemma
fk+1 > fk
Proof.
Let fk+1 be the decision rule corresponding to fk+1.By its definition, we have that
φβ(fk+1,fk) = r(fk+1) + βP(fk+1)φβ(fk) > φβ(fk)
i.e. that
(fk+1,fk) > fk
Then, from Theorem 2, fk+1 > fk .
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 15 / 18
Correctness & Termination
Termination Lemma
The Policy Improvement Algorithm terminates in finite steps.
Proof.
At stage k, if the algorithm has not already found the optimal strat-egy, it computes fk+1 such that fk+1 > fk (Improvement Lemma).Therefore, the algorithm does not cycle. There are only finitestrategies. Therefore, the algorithm terminates when it finds the(an) optimal strategy.
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 16 / 18
Correctness & Termination
Termination Lemma
The Policy Improvement Algorithm terminates in finite steps.
Proof.
At stage k, if the algorithm has not already found the optimal strat-egy, it computes fk+1 such that fk+1 > fk (Improvement Lemma).Therefore, the algorithm does not cycle. There are only finitestrategies. Therefore, the algorithm terminates when it finds the(an) optimal strategy.
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 16 / 18
Correctness & Termination
Termination Lemma
The Policy Improvement Algorithm terminates in finite steps.
Proof.
At stage k, if the algorithm has not already found the optimal strat-egy, it computes fk+1 such that fk+1 > fk (Improvement Lemma).Therefore, the algorithm does not cycle. There are only finitestrategies. Therefore, the algorithm terminates when it finds the(an) optimal strategy.
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 16 / 18
Correctness & Termination
Termination Lemma
The Policy Improvement Algorithm terminates in finite steps.
Proof.
At stage k, if the algorithm has not already found the optimal strat-egy, it computes fk+1 such that fk+1 > fk (Improvement Lemma).Therefore, the algorithm does not cycle. There are only finitestrategies. Therefore, the algorithm terminates when it finds the(an) optimal strategy.
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 16 / 18
An ExampleState: s1
a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]
State: s2
a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]
Policy denoted as (action in state 1, action in state 2)Corresponding value vector denoted as value of current policy start-ing when starting in (state 1,... state 2)
Let β = 0.9
No improvement! ⇒ Optimal strategy: (a1,a1)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18
An ExampleState: s1
a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]
State: s2
a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]
Let β = 0.9
Policy f0: (a2,a2)
Evaluate value vector φ0β := (0, 0)
Policy improvement:I For state s1,
consider a1: 1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 1consider a2: 0 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 0
I For state s2,consider a1: −1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = −1consider a2: 0 + 0.9(0 ∗ 0 + 1 ∗ 0) = 0
No improvement! ⇒ Optimal strategy: (a1,a1)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18
An ExampleState: s1
a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]
State: s2
a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]
Let β = 0.9
Policy f0: (a2,a2)
Evaluate value vector φ0β := (0, 0)
Policy improvement:I For state s1,
consider a1: 1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 1consider a2: 0 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 0
I For state s2,consider a1: −1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = −1consider a2: 0 + 0.9(0 ∗ 0 + 1 ∗ 0) = 0
No improvement! ⇒ Optimal strategy: (a1,a1)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18
An ExampleState: s1
a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]
State: s2
a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]
Let β = 0.9
Policy f0: (a2,a2)
Evaluate value vector φ0β := (0, 0)
Policy improvement:I For state s1,
consider a1: 1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 1consider a2: 0 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 0
I For state s2,consider a1: −1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = −1consider a2: 0 + 0.9(0 ∗ 0 + 1 ∗ 0) = 0
No improvement! ⇒ Optimal strategy: (a1,a1)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18
An ExampleState: s1
a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]
State: s2
a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]
Let β = 0.9
Policy f0: (a2,a2)
Evaluate value vector φ0β := (0, 0)
Policy improvement:I For state s1,
consider a1: 1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 1consider a2: 0 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 0
I For state s2,consider a1: −1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = −1consider a2: 0 + 0.9(0 ∗ 0 + 1 ∗ 0) = 0
No improvement! ⇒ Optimal strategy: (a1,a1)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18
An ExampleState: s1
a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]
State: s2
a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]
Let β = 0.9
Policy f1: (a1,a2)
Evaluate value vector φ1β := (2.1739, 0)
Policy improvement:I For state s1,
consider a1: 1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 2.1739consider a2: 0 + 0.9(1 ∗ 2.1739 + 0 ∗ 0) = 1.9565
I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 0.1739consider a2: 0 + 0.9(0 ∗ 2.1739 + 1 ∗ 0) = 0
No improvement! ⇒ Optimal strategy: (a1,a1)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18
An ExampleState: s1
a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]
State: s2
a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]
Let β = 0.9
Policy f1: (a1,a2)
Evaluate value vector φ1β := (2.1739, 0)
Policy improvement:I For state s1,
consider a1: 1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 2.1739consider a2: 0 + 0.9(1 ∗ 2.1739 + 0 ∗ 0) = 1.9565
I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 0.1739consider a2: 0 + 0.9(0 ∗ 2.1739 + 1 ∗ 0) = 0
No improvement! ⇒ Optimal strategy: (a1,a1)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18
An ExampleState: s1
a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]
State: s2
a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]
Let β = 0.9
Policy f1: (a1,a2)
Evaluate value vector φ1β := (2.1739, 0)
Policy improvement:I For state s1,
consider a1: 1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 2.1739consider a2: 0 + 0.9(1 ∗ 2.1739 + 0 ∗ 0) = 1.9565
I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 0.1739consider a2: 0 + 0.9(0 ∗ 2.1739 + 1 ∗ 0) = 0
No improvement! ⇒ Optimal strategy: (a1,a1)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18
An ExampleState: s1
a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]
State: s2
a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]
Let β = 0.9
Policy f1: (a1,a2)
Evaluate value vector φ1β := (2.1739, 0)
Policy improvement:I For state s1,
consider a1: 1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 2.1739consider a2: 0 + 0.9(1 ∗ 2.1739 + 0 ∗ 0) = 1.9565
I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 0.1739consider a2: 0 + 0.9(0 ∗ 2.1739 + 1 ∗ 0) = 0
No improvement! ⇒ Optimal strategy: (a1,a1)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18
An ExampleState: s1
a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]
State: s2
a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]
Let β = 0.9
Policy f2: (a1,a1)
Evaluate value vector φ2β := (2.302, 0.1637)
Policy improvement:I For state s1,
consider a1: 1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 2.302consider a2: 0 + 0.9(1 ∗ 2.302 + 0 ∗ 0.1637) = 2.0718
I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 0.302consider a2: 0 + 0.9(0 ∗ 2.302 + 1 ∗ 0.1637) = 0.14733
No improvement! ⇒ Optimal strategy: (a1,a1)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18
An ExampleState: s1
a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]
State: s2
a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]
Let β = 0.9
Policy f2: (a1,a1)
Evaluate value vector φ2β := (2.302, 0.1637)
Policy improvement:I For state s1,
consider a1: 1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 2.302consider a2: 0 + 0.9(1 ∗ 2.302 + 0 ∗ 0.1637) = 2.0718
I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 0.302consider a2: 0 + 0.9(0 ∗ 2.302 + 1 ∗ 0.1637) = 0.14733
No improvement! ⇒ Optimal strategy: (a1,a1)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18
An ExampleState: s1
a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]
State: s2
a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]
Let β = 0.9
Policy f2: (a1,a1)
Evaluate value vector φ2β := (2.302, 0.1637)
Policy improvement:I For state s1,
consider a1: 1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 2.302consider a2: 0 + 0.9(1 ∗ 2.302 + 0 ∗ 0.1637) = 2.0718
I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 0.302consider a2: 0 + 0.9(0 ∗ 2.302 + 1 ∗ 0.1637) = 0.14733
No improvement! ⇒ Optimal strategy: (a1,a1)
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18
An ExampleState: s1
a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]
State: s2
a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]
Let β = 0.9
Policy f2: (a1,a1)
Evaluate value vector φ2β := (2.302, 0.1637)
Policy improvement:I For state s1,
consider a1: 1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 2.302consider a2: 0 + 0.9(1 ∗ 2.302 + 0 ∗ 0.1637) = 2.0718
I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 0.302consider a2: 0 + 0.9(0 ∗ 2.302 + 1 ∗ 0.1637) = 0.14733
No improvement! ⇒ Optimal strategy: (a1,a1)Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18
Thank You! Any Questions?
Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 18 / 18