topics in computational sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · topics in...

33
Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision Making Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.

Upload: others

Post on 05-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Topics in Computational SustainabilityCS 325

Spring 2016

Making Choices: Sequential Decision MakingNote to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.

Page 2: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Stochastic programming

?

ba c

{a(pa),b(pb),c(pc)}

maximizes expected utility

minimizes expected cost

Probabilistic model

wet dry

decision

Page 3: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Problem Setup

Given limited budget, what parcels should I conserve to maximize the expected number of occupied territories in 50 years?

! "#$%#"

Conserved

parcels

Available

parcelsCurrent

territories

Potential

territories

Page 4: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Metapopulation = Cascade

i

j

k

l

m

i

j

k

l

m

i

j

k

l

m

i

j

k

l

m

i

j

k

l

m

• Metapopulation model can be viewed as a cascade in the layered graph representing territories over time

Target nodes: territories at final time step

Patches

Page 5: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Management Actions• Conserving parcels adds nodes to the network to

create new pathways for the cascade

Parcel 1

Parcel 2

Initial

network

Page 6: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Management Actions

Parcel 1

Parcel 2

Initial

network

• Conserving parcels adds nodes to the network to create new pathways for the cascade

Page 7: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Management Actions

Parcel 1

Parcel 2

Initial

network

• Conserving parcels adds nodes to the network to create new pathways for the cascade

Page 8: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Cascade Optimization Problem

Given:

• Patch network– Initially occupied territories

– Colonization and extinction probabilities

• Management actions– Already-conserved parcels

– List of available parcels and their costs

• Time horizon T

• Budget B

Find set of parcels with total cost at most B that maximizes the

expected number of occupied territories at time T.

Can we make our decision adaptively?

Page 9: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

9

Sequential decision making

• We have a systems that changes state over time

• Can (partially) control the system state transitions by taking actions

• Problem gives an objective that specifies which states (or state sequences) are more/less preferred

• Problem: At each time step select an action to optimize the overall (long-term) objective – Produce most preferred sequences of “states”

Page 10: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Discounted Rewards/CostsAn assistant professor gets paid, say, 20K per year.

How much, in total, will the A.P. earn in their life?

20 + 20 + 20 + 20 + 20 + … = Infinity

What’s wrong with this argument?

$ $

Page 11: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Discounted Rewards

“A reward (payment) in the future is not worth quite

as much as a reward now.”

– Because of chance of obliteration

– Because of inflation

Example:

Being promised $10,000 next year is worth only 90% as

much as receiving $10,000 right now.

Assuming payment n years in future is worth only

(0.9)n of payment now, what is the AP’s Future

Discounted Sum of Rewards ?

Page 12: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Infinite Sum

Assuming a discount rate of 0.9, how much

does the assistant professor get in total?

20 + .9 20 + .92 20 + .93 20 + …

= 20 + .9 (20 + .9 20 + .92 20 + …)

x = 20 + .9 x

x = 20/(.1) = 200

Page 13: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

People in economics and probabilistic decision-

making do this all the time.

The “Discounted sum of future rewards” using

discount factor g” is

(reward now) +

g (reward in 1 time step) +

g 2 (reward in 2 time steps) +

g 3 (reward in 3 time steps) +

:

: (infinite sum)

Discount Factors

Page 14: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Markov System: the Academic Life

Define:

JA = Expected discounted future rewards starting in state A

JB = Expected discounted future rewards starting in state B

JT = “ “ “ “ “ “ “ T

JS = “ “ “ “ “ “ “ S

JD = “ “ “ “ “ “ “ D

How do we compute JA, JB, JT, JS, JD ?

A.Assistant

Prof20

B.Assoc.

Prof60

S.On theStreet

10

D.Dead

0

T.Tenured

Prof400

0.7

0.7

0.6

0.3

0.2 0.2

0.2

0.3

0.60.2

Page 15: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Working Backwards

A. Assistant

Prof.: 20

B. Associate

Prof.: 60

T. Tenured

Prof.: 100

S. Out on the

Street: 10 D. Dead: 0

1.0

0.60.2

0.2

0.7

0.3

0.6

0.2

0.20.7

0.3

Discount

factor 0.9

0

270

27

247151

Page 16: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Reincarnation?

A. Assistant

Prof.: 20

B. Associate

Prof.: 60

T. Tenured

Prof.: 100

S. Out on the

Street: 10 D. Dead: 0

0.5

0.60.2

0.2

0.7

0.3

0.6

0.2

0.20.7

0.3

Discount

factor 0.90.5

Page 17: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

System of Equations

L(A) = 20 + .9(.6 L(A) + .2 L(B) + .2 L(S))

L(B) = 60 + .9(.6 L(B) + .2 L(S) + .2 L(T))

L(S) = 10 + .9(.7 L(S) + .3 L(D))

L(T) = 100 + .9(.7 L(T) + .3 L(D))

L(D) = 0 + .9 (.5 L(D) + .5 L(A))

Page 18: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Solving a Markov System with

Matrix Inversion

• Upside: You get an exact answer

• Downside: If you have 100,000 states

you’re solving a 100,000 by 100,000 system

of equations.

Page 19: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Define

J1(Si) = Expected discounted sum of rewards over the next 1 time step.

J2(Si) = Expected discounted sum rewards during next 2 steps

J3(Si) = Expected discounted sum rewards during next 3 steps

:

Jk(Si) = Expected discounted sum rewards during next k steps

J1(Si) = (what?)

J2(Si) =(what?)

Jk+1(Si) =(what?)

Value Iteration: another way to solve

a Markov System

Page 20: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Define

J1(Si) = Expected discounted sum of rewards over the next 1 time step.

J2(Si) = Expected discounted sum rewards during next 2 steps

J3(Si) = Expected discounted sum rewards during next 3 steps

:

Jk(Si) = Expected discounted sum rewards during next k steps

J1(Si) = ri

(what?)

J2(Si) =(what?)

:

Jk+1(Si) =(what?)

Value Iteration: another way to solve

a Markov System

N

j

jiji sJpr1

1 )(g

N

j

j

k

iji sJpr1

)(g

N = Number of states

Page 21: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Let’s do Value Iteration

k Jk(SUN) Jk(WIND) Jk(HAIL)

1

2

3

4

5

SUN

+4

WIND

0

HAIL

.::.:.::

-81/2

1/2

1/21/2

1/2

1/2

g = 0.5

Page 22: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Let’s do Value Iteration

k Jk(SUN) Jk(WIND) Jk(HAIL)

1 4 0 -8

2 5 -1 -10

3 5 -1.25 -10.75

4 4.94 -1.44 -11

5 4.88 -1.52 -11.11

SUN

+4

WIND

0

HAIL

.::.:.::

-81/2

1/2

1/21/2

1/2

1/2

g = 0.5

Page 23: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Value Iteration for solving Markov

Systems

• Compute J1(Si) for each i

• Compute J2(Si) for each i

:

• Compute Jk(Si) for each i

As k→∞ Jk(Si)→J*(Si)

When to stop? When

Max Jk+1(Si) – Jk(Si) < ξi

This is faster than matrix inversion (N3 style)

if the transition matrix is sparse

What if we have a way to interact with the Markov system?

Page 24: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

A Markov Decision Process

g = 0.9

Poor &

Unknown

+0

Rich &

Unknown

+10

Rich &

Famous

+10

Poor &

Famous

+0

You run a

startup

company.

In every

state you

must choose

between

Saving

money (S)

or

Advertising

(A).

S

S

AA

S1

1

1

1/2

1/2

1/2

1/2

1/2

1/2

1/2

1/2

1/2

1/2

Page 25: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Markov Decision Processes

An MDP has…

• A set of states {s1 ··· SN}

• A set of actions {a1 ··· aM}

• A set of rewards {r1 ··· rN} (one for each state)

• A transition probability function

k

ijP

kijk

ij action use I and ThisNextProbP

On each step:

0. Call current state Si

1. Receive reward ri

2. Choose action {a1 ··· aM}

3. If you choose action ak you’ll move to state Sj with

probability

4. All future rewards are discounted by g

What’s a solution to an MDP? A sequence of actions?

Page 26: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

A PolicyA policy is a mapping from states to actions.

Examples

STATE → ACTION

PU S

PF A

RU S

RF A

STATE → ACTION

PU A

PF A

RU A

RF A

Polic

y N

um

ber

1:

• How many possible policies in our example?

• Which of the above two policies is best?

• How do you compute the optimal policy?

PU0

PF0

RU+10

RF+10

RF10

PF0

PU0

RU10

S

A

AA

A A

1

1

1

1

1

1/2

1/2

1/2

1/21/2

1/2

Polic

y N

um

ber

2:

Page 27: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Interesting Fact

For every M.D.P. there exists an optimal

policy.

It’s a policy such that for every possible start

state there is no better option than to follow

the policy.

Page 28: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Computing the Optimal Policy

Idea One:

Run through all possible policies.

Select the best.

What’s the problem ??

Page 29: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Optimal Value Function

Define J*(Si) = Expected Discounted Future Rewards, starting from state Si, assuming we use the optimal policy

S1

+0

S3

+2

S2

+3

B

1/2

1/21/2

1/2

0

1

1

1

1/3

1/3

1/3

Question:

What is an optimal policy

for that MDP?

(assume g = 0.9)

What is J*(S1) ?

What is J*(S2) ?

What is J*(S3) ?

Page 30: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Computing the Optimal Value

Function with Value Iteration

Define

Jk(Si) = Maximum possible expected

sum of discounted rewards I

can get if I start at state Si and

I live for k time steps.

Note that J1(Si) = ri

Page 31: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Let’s compute Jk(Si) for our example

k Jk(PU) Jk(PF) Jk(RU) Jk(RF)

1

2

3

4

5

6

Page 32: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

k Jk(PU) Jk(PF) Jk(RU) Jk(RF)

1 0 0 10 10

2 0 4.5 14.5 19

3 2.03 8.55 16.52 25.08

Page 33: Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Bellman’s Equation

N

j

j

na

iji

a

i

n r1

1 SJPSJ max g

i

n

i

n

i

SJSJ when converged 1

max

Value Iteration for solving MDPs

• Compute J1(Si) for all i

• Compute J2(Si) for all i

• :

• Compute Jn(Si) for all i

…..until converged

…Also known as

Dynamic Programming