stochastic games (part i): policy improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf ·...

Stochastic Games (Part I): Policy Improvementin Discounted (Noncompetitive) Markov Decision

Processes

Paul Varkey

Multi Agent Systems Group, Department of Computer Science, UIC

4th Annual Graduate Student Probability ConferenceApr 30th – 2010

Duke University, Durham NC

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 1 / 18

Outline

1 The Model (definitions, notations and the problem statement)

2 Basic Theorems

3 The Algorithm

4 An Example

References

BLACKWELL, D. (1962): “Discrete Dynamic Programming”,The Annals of Mathematical Statistics, Vol. 33, No. 2. (Jun.1962), pp. 719-726.

FILAR, J.A. and VRIEZE, O.J. (1996): “Competitive MarkovDecision Processes Theory, Algorithms, and Applications”,SpringerVerlag, New York, 1996.

Decision Processes, States and Actions

A decision process is a discrete stochastic process that is observed atdiscrete time points t = 0, 1, 2, 3, ... called stages where a decision-maker (or, controller) chooses an action at each stage.

S denotes the state space – the process may be in one of (finitely)many states

A denotes the action space – the actions may be chosen from oneof (finitely) many actions

Choosing action a in state s results in

(i) an immediate reward r(s, a)(ii) a probabilistic transition to a state s ′ given by p(s ′|s, a)

↑the (stationary) Markov assumption

Decision Rules, Strategies, Transitions & Rewards

A decision rule f : S → A is a function that specifies the actionthat a controller chooses in a given state

Each decision rule f definesI a transition probability matrix P(f ) such that the (s, s ′)-th entry

is given by

P(f )[s, s ′] = p(s ′|s, f (s))I a reward vector r(f ) such that the (s)-th entry is given by

r(f )[s] = r(s, f (s))

A (Markov) strategy π is a sequence of decision rules {fn, n =0, 1, ...} such that fn is used by the controller at stage n

A stationary strategy is a stage-independent strategy; i.e. it usesthe same decision rule f at each stage, and will be denoted as f

is given by

r(f )[s] = r(s, f (s))

is given by

r(f )[s] = r(s, f (s))

is given by

r(f )[s] = r(s, f (s))

is given by

r(f )[s] = r(s, f (s))

is given by

r(f )[s] = r(s, f (s))

Each strategy π defines a t-stage transition probability matrix as

Pt(π) := P(f0)P(f1)...P(ft−1) for t = 1, 2, ...

P0(π) := IN

Associate with each strategy π the β-discounted value vector

φβ(π) =∞∑

βtPt(π)r(ft)

Each strategy π defines a t-stage transition probability matrix as

Pt(π) := P(f0)P(f1)...P(ft−1) for t = 1, 2, ...

P0(π) := IN

Associate with each strategy π the β-discounted value vector

φβ(π) =∞∑

βtPt(π)r(ft)

≥ Ordering for Value Vectors and Strategies

≥ Ordering Value Vectors

Denote u ≥ w whenever [u]s ≥ [w ]s for all s ∈ S

Denote u > w whenever u ≥ w and u 6= w

≥ Ordering Strategies

π1 ≥ π2 iff φβ(π1) ≥ φβ(π2)

≥ Ordering for Value Vectors and Strategies

≥ Ordering Value Vectors

Denote u ≥ w whenever [u]s ≥ [w ]s for all s ∈ S

Denote u > w whenever u ≥ w and u 6= w

≥ Ordering Strategies

π1 ≥ π2 iff φβ(π1) ≥ φβ(π2)

Optimal Strategies

π0 is an optimal strategy if

π0 ≥ π for all π

or, in other words,

π0 “maximizes” φβ(π)

The Problem

Given an MDP, find the optimal strategy

The L(·) operator

Let π+ := {f1, f2, ...} (where π = {f0, f1, f2, ...})I Observe that

φβ(π) = r(f0) + βP(f0)∞∑t=1

βt−1Pt−1(π+)r(ft)

= r(f0) + βP(f0)φβ(π+)

For every decision rule g , we define the associated operator

L(g) : R|S | → R|S |

L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|

I In this notation,

φβ(f , π) = L(f )(φβ(π))

and, more generally,

φβ(f0, f1, ..., ft−1, π) = L(f0)L(f1)...L(ft−1)(φβ(π))I L(g) is a monotone operator

u ≥ w ⇒ L(g)(u) ≥ L(g)(w)

The L(·) operator

φβ(π) = r(f0) + βP(f0)∞∑t=1

= r(f0) + βP(f0)φβ(π+)

L(g) : R|S | → R|S |

L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|

I In this notation,

φβ(f , π) = L(f )(φβ(π))

u ≥ w ⇒ L(g)(u) ≥ L(g)(w)

The L(·) operator

φβ(π) = r(f0) + βP(f0)∞∑t=1

= r(f0) + βP(f0)φβ(π+)

L(g) : R|S | → R|S |

L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|

I In this notation,

φβ(f , π) = L(f )(φβ(π))

u ≥ w ⇒ L(g)(u) ≥ L(g)(w)

The L(·) operator

φβ(π) = r(f0) + βP(f0)∞∑t=1

= r(f0) + βP(f0)φβ(π+)

L(g) : R|S | → R|S |

L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|

I In this notation,

φβ(f , π) = L(f )(φβ(π))

u ≥ w ⇒ L(g)(u) ≥ L(g)(w)

The L(·) operator

φβ(π) = r(f0) + βP(f0)∞∑t=1

= r(f0) + βP(f0)φβ(π+)

L(g) : R|S | → R|S |

L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|

I In this notation,

φβ(f , π) = L(f )(φβ(π))

u ≥ w ⇒ L(g)(u) ≥ L(g)(w)

Theorem 1

Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is

optimal.

Proof.

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.

π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))

We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))

Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))

= φβ(f0, f1, ..., fT , π0)

t=0 βtPt(π)r(ft) +

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

t−T−1)for every nonnegative integer T

Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥

∑∞t=0 β

tPt(π)r(ft) = φβ(π)

Since π was arbitrary, the proof is complete.

Theorem 1

optimal.

Proof.

= φβ(f0, f1, ..., fT , π0)

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

∑∞t=0 β

Theorem 1

optimal.

Proof.

= φβ(f0, f1, ..., fT , π0)

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

∑∞t=0 β

Theorem 1

optimal.

Proof.

= φβ(f0, f1, ..., fT , π0)

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

∑∞t=0 β

Theorem 1

optimal.

Proof.

= φβ(f0, f1, ..., fT , π0)

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

∑∞t=0 β

Theorem 1

optimal.

Proof.

= φβ(f0, f1, ..., fT , π0)

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

∑∞t=0 β

Theorem 1

optimal.

Proof.

= φβ(f0, f1, ..., fT , π0)

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

∑∞t=0 β

Theorem 2

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy and f adecision rule such that (f , π) > π. Then f > π.

Proof.

(f , π) > π ⇒ φβ(f , π) = L(f )(φβ(π)) > φβ(π)

As before, we use the monotonicity of L(f ) and repeated ap-plication of the above inequality to obtain φβ(f , f , f , ..., f , π) =L(T )(f )(φβ(π)) ≥ L(f )(φβ(π)) = φβ(f , π) > φβ(π)

Allowing T →∞ yields φβ(f) = φβ(f , f , f , ...) ≥ φβ(f , π) > φβ(π)

⇒ f > π

Theorem 2

Proof.

⇒ f > π

Theorem 2

Proof.

⇒ f > π

Theorem 2

Proof.

⇒ f > π

Theorem 2

Proof.

⇒ f > π

Policy Improvement Algorithm

Step 0. (Initialization) Set k := 0. Select any pure strategy f.Set f0 := f and φ0 := φβ(f0) = [I − βP(f0)]−1r(f0)

Step 1. (Check Optimality) Let aks be the action selected by fk in state s.

If the following “optimality equation”r(s, ak

s ) + β∑N

s′=1 p(s ′|s, aks )φk(s ′)

= maxa∈A

{r(s, a) + β

N∑s′=1

p(s ′|s, aks )φk(s ′)

holds for each s ∈ S , STOP.The strategy f k is optimal and φk is the discounted optimal valuevector.

Step 0. (Initialization) Set k := 0. Select any pure strategy f.Set f0 := f and φ0 := φβ(f0) = [I − βP(f0)]−1r(f0)

Step 1. (Check Optimality) Let aks be the action selected by fk in state s.

If the following “optimality equation”r(s, ak

s ) + β∑N

s′=1 p(s ′|s, aks )φk(s ′)

= maxa∈A

{r(s, a) + β

N∑s′=1

holds for each s ∈ S , STOP.The strategy f k is optimal and φk is the discounted optimal valuevector.

Step 2. (Policy improvement) Let S be the nonempty subset of states forwhich equality is violated in (?), i.e., the left side is strictly smallerthan the right side. Define

aks = arg max

{r(s, a) + β

N∑s′=1

}for each s ∈ S , and a new strategy g by

g(s, a) =

f (s, a) if s /∈ S ,

1 if s ∈ S and a = aks

0 otherwise

Set fk+1 := g and φk+1 := φβ(g)

Step 3. (Iteration) Set k := k + 1, fk := fk+1, φk := φk+1. Return toStep 1.

Step 2. (Policy improvement) Let S be the nonempty subset of states forwhich equality is violated in (?), i.e., the left side is strictly smallerthan the right side. Define

aks = arg max

{r(s, a) + β

N∑s′=1

}for each s ∈ S , and a new strategy g by

g(s, a) =

f (s, a) if s /∈ S ,

1 if s ∈ S and a = aks

0 otherwise

Set fk+1 := g and φk+1 := φβ(g)

Step 3. (Iteration) Set k := k + 1, fk := fk+1, φk := φk+1. Return toStep 1.

Correctness & Termination

Improvement Lemma

fk+1 > fk

Proof.

Let fk+1 be the decision rule corresponding to fk+1.By its definition, we have that

φβ(fk+1,fk) = r(fk+1) + βP(fk+1)φβ(fk) > φβ(fk)

i.e. that

(fk+1,fk) > fk

Then, from Theorem 2, fk+1 > fk .

Improvement Lemma

fk+1 > fk

Proof.

i.e. that

(fk+1,fk) > fk

Improvement Lemma

fk+1 > fk

Proof.

i.e. that

(fk+1,fk) > fk

Improvement Lemma

fk+1 > fk

Proof.

i.e. that

(fk+1,fk) > fk

Improvement Lemma

fk+1 > fk

Proof.

i.e. that

(fk+1,fk) > fk

Termination Lemma

The Policy Improvement Algorithm terminates in finite steps.

Proof.

At stage k, if the algorithm has not already found the optimal strat-egy, it computes fk+1 such that fk+1 > fk (Improvement Lemma).Therefore, the algorithm does not cycle. There are only finitestrategies. Therefore, the algorithm terminates when it finds the(an) optimal strategy.

Termination Lemma

Proof.

Termination Lemma

Proof.

Termination Lemma

Proof.

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Policy denoted as (action in state 1, action in state 2)Corresponding value vector denoted as value of current policy start-ing when starting in (state 1,... state 2)

Let β = 0.9

No improvement! ⇒ Optimal strategy: (a1,a1)

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f0: (a2,a2)

Evaluate value vector φ0β := (0, 0)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 1consider a2: 0 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 0

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = −1consider a2: 0 + 0.9(0 ∗ 0 + 1 ∗ 0) = 0

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f0: (a2,a2)

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f0: (a2,a2)

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f0: (a2,a2)

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f1: (a1,a2)

Evaluate value vector φ1β := (2.1739, 0)

consider a1: 1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 2.1739consider a2: 0 + 0.9(1 ∗ 2.1739 + 0 ∗ 0) = 1.9565

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 0.1739consider a2: 0 + 0.9(0 ∗ 2.1739 + 1 ∗ 0) = 0

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f1: (a1,a2)

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f1: (a1,a2)

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f1: (a1,a2)

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f2: (a1,a1)

Evaluate value vector φ2β := (2.302, 0.1637)

consider a1: 1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 2.302consider a2: 0 + 0.9(1 ∗ 2.302 + 0 ∗ 0.1637) = 2.0718

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 0.302consider a2: 0 + 0.9(0 ∗ 2.302 + 1 ∗ 0.1637) = 0.14733

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f2: (a1,a1)

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f2: (a1,a1)

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f2: (a1,a1)

No improvement! ⇒ Optimal strategy: (a1,a1)Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

Thank You! Any Questions?

stochastic games (part i): policy improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf ·...

Documents