machine learning - sharif university of...

111
Fall 2007 Sharif University of Technology 1 Machine Learning Reinforcement Learning

Upload: others

Post on 20-May-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 1

Machine Learning

Reinforcement Learning

Page 2: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 2

Reinforcement LearningBased on “Reinforcement Learning – An Introduction”by Richard Sutton and Andrew Barto

Slides are based on the course material provided by the same authors

Page 3: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 3

What is Reinforcement Learning?

Learning from interactionGoal-oriented learningLearning about, from, and while interacting with an external environmentLearning what to do—how to map situations to actions—so as to maximize a numerical reward signal

Hadi Asheri
Highlight
Hadi Asheri
Highlight
Page 4: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 4

Supervised Learning

Supervised Learning SystemInputs Outputs

Training Info = desired (target) outputs

Error = (target output – actual output)

Page 5: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 5

Reinforcement Learning

RLSystemInputs Outputs (“actions”)

Training Info = evaluations (“rewards” / “penalties”)

Objective: get as much reward as possible

Page 6: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 6

Key Features of RLLearner is not told which actions to takeTrial-and-Error searchPossibility of delayed reward (sacrifice short-term gains for greater long-term gains)The need to explore and exploitConsiders the whole problem of a goal-directed agent interacting with an uncertain environment

Hadi Asheri
Note
Three important aspects: Sensation Action Goal
mz2
Highlight
Page 7: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 7

Complete Agent

Temporally situatedContinual learning and planningObject is to affect the environmentEnvironment is stochastic and uncertain

Environment

actionstate

rewardAgent

Hadi Asheri
Question
Page 8: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 8

Elements of RL

Policy: what to doReward: what is goodValue: what is good because it predicts rewardModel: what follows what

Policy

Reward

ValueModel of

environment

Hadi Asheri
Note
Maybe Stochastic
Hadi Asheri
Note
must necessarily be fixed because It may, however, be used as a basis for changing the policy.
Hadi Asheri
Note
Maybe Stochastic
Hadi Asheri
Note
Action choices are made on the basis of value judgments.
Page 9: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 9

An Extended Example: Tic-Tac-Toe

X XXO OX

XOX

OXO

X

O

XXO

X

O

X OXO

X

O

X O

X

} x’s move

} x’s move

} o’s move

} x’s move

} o’s move

...

...... ...

... ... ... ... ...

x x

x

x o

x

o

xo

x

xx

oo

Assume an imperfect opponent: he/she sometimes makes mistakes

Page 10: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 10

An RL Approach to Tic-Tac-Toe

1. Make a table with one entry per state:

2. Now play lots of games. To pick our moves, look ahead one step:

State V(s) – estimated probability of winning.5 ?.5 ?. . .

. . .

. . .. . .

1 win

0 loss

. . .. . .

0 draw

x

xxx

oo

oo

ox

x

oo

o ox

xx

xo

current state

various possiblenext states*

Just pick the next state with the highestestimated prob. of winning — the largest V(s);a greedy move.

But 10% of the time pick a move at random;an exploratory move.

Page 11: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 11

RL Learning Rule for Tic-Tac-Toe

“Exploratory” move

movegreedy our after statethe– smovegreedy our before statethe– s

[ ])()()()(: a– )( toward)(each increment We

sVsVsVsVsVsV

−′+←

αbackup

parametersize -step the. e.g., fraction, positive smalla 1=α

Hadi Asheri
Rectangle
Hadi Asheri
Highlight
Page 12: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 12

How is Tic-Tac-Toe Too Easy?

Finite, small number of statesOne-step look-ahead is always possibleState completely observable…

Page 13: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 13

Some Notable RL Applications

TD-Gammon: Tesauroworld’s best backgammon program

Elevator Control: Crites & Bartohigh performance down-peak elevator controller

Dynamic Channel Assignment: Singh & Bertsekas, Nie & Haykin

high performance assignment of radio channels to mobile telephone calls

Page 14: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 14

TD-Gammon

Start with a random networkPlay very many games against selfLearn a value function from this simulated experience

This produces arguably the best player in the world

Action selectionby 2–3 ply search

Value

TD errorVt+1 −Vt

Tesauro, 1992–1995

Effective branching factor 400

Page 15: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 15

Elevator Dispatching

10 floors, 4 elevator cars

STATES: button states; positions, directions, and motion states of cars; passengers in cars & in halls

ACTIONS: stop at, or go by, next floor

REWARDS: roughly, –1 per time step for each person waiting

Conservatively about 10 states22

Crites and Barto, 1996

Page 16: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 16

Evaluative FeedbackEvaluating actions vs. instructing by giving correct actions

Pure evaluative feedback depends totally on the action taken. Pure instructive feedback depends not at all on the action taken.

Supervised learning is instructive; optimization is evaluative

Associative vs. Nonassociative:

Associative: inputs mapped to outputs; learn the best output for each input

Nonassociative: “learn” (find) one best output

n-armed bandit (at least how we treat it) is:

Nonassociative

Evaluative feedback

Hadi Asheri
Question
Hadi Asheri
Highlight
Hadi Asheri
Highlight
Page 17: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 17

The n-Armed Bandit ProblemChoose repeatedly from one of n actions; each choice is called a playAfter each play , you get a reward , where

)a(Qa|rE t*

tt =ta tr

These are unknown action valuesDistribution of depends only on rt at

Objective is to maximize the reward in the long term, e.g., over 1000 plays

To solve the n-armed bandit problem, you must explorea variety of actions and then exploit the best of them.

Hadi Asheri
Rectangle
Page 18: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 18

The Exploration/Exploitation DilemmaSuppose you form estimates

The greedy action at t is

You can’t exploit all the time; you can’t explore all the timeYou can never stop exploring; but you should always reduce exploring

Qt(a) ≈ Q*(a) action value estimates

at* = argmax

aQt(a)

at = at* ⇒ exploitation

at ≠ at* ⇒ exploration

Hadi Asheri
Rectangle
Page 19: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 19

Action-Value Methods

Methods that adapt action-value estimates and nothing else, e.g.: suppose by the t-th play, action had been chosen times, producing rewards then

a

kt k

rrr)a(Q a

+++=

L21

ka,r,,r,r

akK21

“sample average”

)a(Q)a(Qlim *tka

=∞→

a

Page 20: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 20

ε-Greedy Action SelectionGreedy action selection:

ε-Greedy:)a(Qmaxargaa ta

*tt ==

at* with probability 1 − ε

random action with probability ε{at =

… the simplest way to try to balance exploration and exploitation

Page 21: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 21

10-Armed Testbed

n = 10 possible actionsEach is chosen randomly from a normal distribution: each is also normal: 1000 playsrepeat the whole thing 2000 times (with reselecting ) and average the resultsEvaluative versus instructive feedback

)),a(Q(N t* 1

),(N 10rt

)a(Q*

)a(Q*

Page 22: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 22

ε-Greedy Methods on the 10-Armed Testbed

= 0 (greedy)

= 0.01

0

0.5

1

1.5

Averagereward

0 250 500 750 1000

Plays

0%

20%

40%

60%

80%

100%

%Optimalaction

0 250 500 750 1000

= 0.1

Plays

= 0.01

= 0.1

Page 23: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 23

Softmax Action SelectionSoftmax action selection methods grade action probs. by estimated values.The most common softmax uses a Gibbs, or Boltzmann, distribution:

Choose action a on play t with probability

where τ is the “computational temperature”

,e

e n

b)b(Q

)a(Q

t

t

∑ =1τ

τ

Page 24: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 24

Evaluation Versus InstructionSuppose there are K possible actions and you select action number k.Evaluative feedback would give you a single score f, say 7.2. Instructive information, on the other hand, would say that action k’, which is eventually different from action k, have actually been correct. Obviously, instructive feedback is much more informative, (even if it is noisy).

Hadi Asheri
Note
learning by selection,
Hadi Asheri
Note
learning by instruction
Hadi Asheri
Note
is typically independent of the action selected (so is not really feedback at all).
Page 25: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 25

Binary Bandit Tasksat = 1 or at = 2

rt = success or rt = failure

Suppose you have just two actions:

and just two rewards:

Then you might infer a target or desired action:

at if success

the other action if failure{dt =

and then always play the action that was most often the target.Call this the supervised algorithm. It works fine on deterministic tasks but is suboptimal if the rewards are stochastic.

Hadi Asheri
Highlight
Page 26: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 26

Contingency Space

The space of all possible binary bandit tasks:

Page 27: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 27

AutomatonMathematically, an automaton is defined as a set of input actions B, a set of states Q, a set of output actions A. and the functions F and G needed to compute the next state of the automaton and, its output respectively. The input determines the evolution of the automaton from its current state to the next state.

Page 28: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 28

AutomataDefinition : An automaton is a quintuple <A,B,Q,F,G> , where:

A={al,a2,…,ar} is the set of output actions of the automaton.B is the set of input actions that can be finite or infinite.Q is the vector state of the automaton with q(t) denoting the state at instant t, as: Q=(q1(t), q2(t),…,qs(t))F: QxB→Q is the transition function that determines the state at theinstant t+1 in terms of the state and input at the instant t:q(t+ 1 )=F(q(t).B(t)).This mapping can be either deterministic or stochastic.The output function G determines the output of the automaton at any instant t based on the state at the current instant:a(t)=G (q(t))

If both F and G are independent of time and input sequence, the automata called fixed structure automata, otherwise it is called variable structure automata.

Hadi Asheri
Highlight
Page 29: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 29

EnvironmentThe environment can be defined mathematically by a triple <A. C, B> where

A=(a1, a2, ..., ar} represents a finite input set, B={b1, b2,..., bm} is the output set of the environment and C=(c1, c2 ,..., cr} are a set of penalty probabilities, when each element ci of C corresponds to an input action ai.

Page 30: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 30

Learning AutomataA "learning" automaton is an automaton that interacts with a random environment, having as goal to improve its behavior. It is connected to the environment in a feedback loop, such that the input of the automaton is the output of the environment and the output of the automaton is the input for the environment.

Page 31: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 31

Norms of BehaviorAverage Penalty :

Pure-Chance Automata:

Page 32: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 32

Norms of Behavior

Page 33: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 33

Norms of Behavior

Page 34: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 34

Fixed Structure AutomataTsetline Automaton

Page 35: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 35

Fixed Structure AutomataKrinsky Automaton

Page 36: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 36

Variable Structure AutomataDefinition : A variable structure automaton is a 4-tuple <AB,T,P>, where

A is the set of actions, B is the set of inputs of the automaton (the set of outputs of the environment), T:[0,l]rxB+[0, 1]r is an learning algorithm such that P(t+1)=T (P(t),B(t)), andP is the action probability vector.

Page 37: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 37

Learning AlgorithmA general learning algorithm for variable structure automata operating in a stationary environment with B={0,1} can be represented as :

If action is chosen a(t)=ai, the updated action probabilities are:

Page 38: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 38

Learning AlgorithmSince P(t) is a probability vector then we have

Page 39: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 39

Learning AlgorithmsIn this representation, functions hj and gj have the following properties:

hj and gj are continuous functions (assumed for mathematical convenience)hj and gj are nonnegative functions,

Page 40: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 40

Linear Reward-Penalty (LR-P)In a linear reward-penalty algorithm, the automaton increases the probability for the action that has been rewarded, and decreases the probability for the action that has been penalized.

Page 41: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 41

Linear Reward-Inaction (LR-I)In a linear reward-inaction algorithm, the automaton increases the probability for the action that has been rewarded, and doesn’t change the probability for the action that has been penalized.

Page 42: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 42

Discritized Learning AutomataAll the existing continuous variable structure automata permitted action probabilities to take any value in the interval [0,1] in their implementation, the leaning automata use Random-Number Generator in determining which action to choose.

In theory, an action probability can take any value between 0 and 1, so the random number generator is required to be very accurate; however, in practice, the probabilities are rounded-off to an accuracy depending on the architecture of the machine that is used to implement the automaton.

In order to increase the speed of convergence of these algorithms and to minimize the requirements of the random number generators, theconcept of discretizing the probability space was introduced.

Hadi Asheri
Highlight
Hadi Asheri
Highlight
Hadi Asheri
Highlight
Page 43: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 43

Discretized Linear Reward-InactionThe probability space [0,1] is divided into N intervals, where N is a resolution parameter and is recommended to be an even integer.

A state associated with every possible probability value, which determines the following set of states: Q=(q1,q2,…,qN).

In every state qi, the probability that the automaton chooses action ai, is i/N and the probability to choose action a2 is (1- i/N).

Hadi Asheri
Highlight
Hadi Asheri
Note
Marked set by Hadi Asheri
Page 44: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 44

Page 45: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 45

Estimator AlgorithmsThe main feature of these algorithms is that they maintain running estimates for the penalty probability of each possible action and use them in the probability updating equations. The purpose of these estimates is to crystallize the confidence in the reward capabilities of each action.For the definition of an estimator algorithm, a vector of rewardestimates d’(t) must be introduced. Hence, the state vector Q(t) is defined as Q(t) =< P(t),d’(t)>, were d’(t) = [d’1(t ),…,d’r (t)].

Page 46: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 46

Pursuit Algorithms

These algorithms are characterized by the fact that they pursues the action that is currently estimated to be the optimal action. These algorithms achieve this by increasing the probability of the current optimal action if the chosen action was either rewarded or penalized by the environment.

Page 47: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 47

Pursuit Algorithms

Page 48: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 48

Pursuit Algorithms

Page 49: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 49

Incremental Implementation

Qk =

r1 + r2 +Lrk

k

Recall the sample average estimation method:

Can we do this incrementally (without storing all the rewards)? We could keep a running sum and count, or, equivalently:

[ ]kkkk Qrk

QQ −+

+= ++ 11 11

The average of the first k rewards is(dropping the dependence on ):

This is a common form for update rules:NewEstimate = OldEstimate + StepSize[Target – OldEstimate]

a

Page 50: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 50

Computation

( )

[ ]

1

11

11

1

1

11

11

11

11

Stepsize constant or changing with time

k

k ii

k

k ii

k k k k

k k k

Q rk

r rk

r kQ Q Qk

Q r Qk

+

+=

+=

+

+

=+

⎛ ⎞= +⎜ ⎟+ ⎝ ⎠

= + + −+

= + −+

Page 51: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 51

Tracking a Non-stationary Problem

Choosing to be a sample average is appropriate in a stationary problem, i.e., when none of the change over time,

But not in a non-stationary problem.

kQ

Q*(a)

Better in the non-stationary case is:Qk +1 = Qk +α rk +1 − Qk[ ]

for constant α, 0 < α ≤ 1

= (1− α) kQ0 + α (1 −αi =1

k

∑ )k −i ri

Hadi Asheri
Rectangle
Page 52: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 52

Computation

1 1

1 1

1

21 2

01

2

1 1

Use [ ]

Then[ ]

(1 )

(1 ) (1 )

(1 ) (1 )

In general : convergence if

( ) and ( )

1satisfied for but not for fi

k k k k

k k k k

k k

k k k

kk k i

ii

k kk k

k

Q Q r Q

Q Q r Qr Q

r r Q

Q r

a a

k

α

αα α

α α α α

α α α

α α

α

+ +

− −

− −

=

∞ ∞

= =

= + −

= + −= + −

= + − + −

= − + −

= ∞ < ∞

=

∑ ∑

xed α

111

=−∑=

−k

i

ik)( αα

Notes:

1.

2. Step size parameter after the k-thapplication of action a

Hadi Asheri
Rectangle
Page 53: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 53

Optimistic Initial Values

All methods so far depend on , i.e., they are biased.Suppose instead we initialize the action values optimistically, i.e., on the 10-armed testbed, use

for all a.

)a(Q0

)a(Q 50 =

Optimistic initialization can force exploration behavior!

Hadi Asheri
Highlight
Hadi Asheri
Highlight
Page 54: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 54

The Agent-Environment Interface

1

1

210

+

+ ℜ∈∈

∈=

t

t

tt

t

s : statenext resulting and r :reward resulting gets

)s(Aa :t stepat action produces Ss :t stepat stateobserves Agent

,,,t : stepstime discrete at interact tenvironmen and Agent K

t. . . st a

rt +1 st +1t +1a

rt +2 st +2t +2a

rt +3 st +3 . . .t +3a

Page 55: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 55

ss whenaa thaty probabilit )a,s( iesprobabilit action to statesfrom mapping a

:,t step

ttt

t

===π

πat Policy

The Agent Learns a Policy

Reinforcement learning methods specify how the agent changes its policy as a result of experience.Roughly, the agent’s goal is to get as much reward as it can over the long run.

Page 56: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 56

Getting the Degree of Abstraction RightTime steps need not refer to fixed intervals of real time.Actions can be low level (e.g., voltages to motors), or high level (e.g., accept a job offer), “mental” (e.g., shift in focus of attention), etc.States can low-level “sensations”, or they can be abstract, symbolic, based on memory, or subjective (e.g., the state of being “surprised”or “lost”).An RL agent is not like a whole animal or robot, which consist of many RL agents as well as other components.The environment is not necessarily unknown to the agent, only incompletely controllable.Reward computation is in the agent’s environment because the agent cannot change it arbitrarily.

Page 57: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 57

Goals and Rewards

Is a scalar reward signal an adequate notion of a goal?—maybe not, but it is surprisingly flexible.A goal should specify what we want to achieve, not how we want to achieve it.A goal must be outside the agent’s direct control—thus outside the agent.The agent must be able to measure success:

explicitly;frequently during its lifespan.

Page 58: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 58

Returns

Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze.

Rt = rt +1 + rt +2 +L + rT ,where T is a final time step at which a terminal state is reached, ending an episode.

Suppose the sequence of rewards after step t is:rt+1, rt+2, rt+3, …

What do we want to maximize?

In general, we want to maximize the expeted returnE{Rt}, for each step t.

Hadi Asheri
Rectangle
Hadi Asheri
Rectangle
Page 59: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 59

Returns for Continuing Tasks

Continuing tasks: interaction does not have natural episodes.

Discounted return:

Rt = rt +1 +γ rt+ 2 + γ 2rt +3 +L = γ krt + k +1,k =0

∑where γ , 0 ≤ γ ≤ 1, is the discount rate.

shortsighted 0 ← γ → 1 farsighted

Hadi Asheri
Rectangle
Hadi Asheri
Highlight
Hadi Asheri
Highlight
Page 60: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 60

An Example

Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track.

reward = +1 for each step before failure

⇒ return = number of steps before failure

As an episodic task where episode ends upon failure:

As a continuing task with discounted return:reward = −1 upon failure; 0 otherwise

⇒ return = −γ k , for k steps before failure

In either case, return is maximized by avoiding failure for as long as possible.

Page 61: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 61

In episodic tasks, we number the time steps of each episode starting from zero.We usually do not have distinguish between episodes, so we writeinstead of for the state at step t of episode j.Think of each episode as ending in an absorbing state that always produces reward of zero:

We can cover all cases by writing

where γ can be 1 only if a zero reward absorbing state is always reached.

A Unified Notation

stj,ts

∑∞

=++=

01

kkt

kt ,rR γ

Page 62: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 62

The Markov Property

By “the state” at step t, we mean whatever information is available to the agent at step t about its environment.The state can include immediate “sensations,” highly processed sensations, and structures built up over time from sequences of sensations. Ideally, a state should summarize past sensations so as to retain all “essential” information, i.e., it should have the Markov Property:

for all s’, r, and histories st, at, st-1, at-1, …, r1, s0, a0.

{ }{ }tttt

ttttttt

a,srr,ssPr

a,s,r,,a,s,r,a,srr,ssPr

=′=

==′=

++

−−++

11

0011111 K

Hadi Asheri
Highlight
Hadi Asheri
Highlight
Hadi Asheri
Rectangle
Hadi Asheri
Note
A state signal that succeeds in retaining all relevant information is said to be Markov
Page 63: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 63

Markov Decision ProcessesIf a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process(MDP).If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give:

state and action setsone-step “dynamics” defined by transition probabilities:

reward probabilities:

Ps ′ s a = Pr st +1 = ′ s st = s,at = a{ } for all s, ′ s ∈S, a ∈A(s).

Rs ′ s a = E rt +1 st = s,at = a,st +1 = ′ s { } for all s, ′ s ∈S, a ∈A(s).

Hadi Asheri
Highlight
Hadi Asheri
Rectangle
Hadi Asheri
Rectangle
Hadi Asheri
Cross-Out
Hadi Asheri
Replacement Text
Expected Reward Value
Page 64: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 64

Recycling Robot

An Example Finite MDP

At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge. Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad).Decisions made on basis of current energy level: high, low.Reward = number of cans collected

Page 65: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 65

Recycling Robot MDP

search

high low1, 0

1–�β , –3

search

recharge

wait

wait

search1–�α , R

β , R search

α, R search

1, R wait

1, R wait

{ }{ }

{ }rechargewaitsearchlowwaitsearchhigh

lowhigh

,,)(A,)(A

,S

==

=

waitsearch

wait

search

RR waiting whilecans of no. expected R searching whilecans of no. expected R

>

=

=

Page 66: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 66

Transition Table

Page 67: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 67

The value of a state is the expected return starting from that state. It depends on the agent’s policy:

The value of taking an action in a state under policy π is the expected return starting from that state, taking that action, and thereafter following π :

Value Functions

State - value function for policy π :

Vπ (s) = Eπ Rt st = s{ }= Eπ γ krt +k +1 st = sk =0

∑⎧ ⎨ ⎩

⎫ ⎬ ⎭

Action - value function for policy π :

Qπ (s, a) = Eπ Rt st = s, at = a{ }= Eπ γ krt + k +1 st = s,at = ak = 0

∑⎧ ⎨ ⎩

⎫ ⎬ ⎭

Page 68: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 68

Bellman Equation for a Policy π

Rt = rt +1 + γ rt +2 +γ 2rt + 3 +γ 3rt + 4 L

= rt +1 + γ rt +2 + γ rt +3 + γ 2rt + 4 L( )= rt +1 + γ Rt +1

The basic idea:

So: { }

( ){ }sssVrE

ssRE)s(V

ttt

tt

=γ+=

==

++π

ππ

11

Or

Vπ (s) = π (s,a) Ps ′ s a Rs ′ s

a + γ V π( ′ s )[ ]′ s

∑a

Hadi Asheri
Rectangle
mz2
Highlight
Page 69: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 69

Derivation

Hadi Asheri
Rectangle
Page 70: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 70

More on the Bellman Equation

Vπ (s) = π (s,a) Ps ′ s a Rs ′ s

a + γ V π( ′ s )[ ]′ s

∑a

∑This is a set of equations (in fact, linear), one for each state. The value function for π is its unique solution.

Backup diagrams:s,as

a

s'

r

a'

s'r

(b)(a)

for V π for Qπ

hadi
Highlight
Page 71: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 71

Grid World

Actions: north, south, east, west; deterministic.

If would take agent off the grid: no move but reward = –1

Other actions produce reward = 0, except actions that move agent out of special states A and B as shown.

State-value function for equiprobablerandom policy;γ = 0.9

Page 72: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 72

π ≥ ′ π if and only if Vπ (s) ≥ V ′ π (s) for all s ∈S

Optimal Value Functions

For finite MDPs, policies can be partially ordered:

There is always at least one (and possibly many) policies that is better than or equal to all the others. This is an optimal policy. We denote them all π *.Optimal policies share the same optimal state-value function:

Optimal policies also share the same optimal action-value function:

This is the expected return for taking action a in state s and thereafter following an optimal policy.

V∗ (s) = maxπ

Vπ (s) for all s ∈S

Q∗(s,a) = maxπ

Qπ (s, a) for all s ∈S and a ∈A(s)

Hadi Asheri
Highlight
Hadi Asheri
Highlight
Hadi Asheri
Highlight
Hadi Asheri
Highlight
Hadi Asheri
Highlight
Hadi Asheri
Highlight
Hadi Asheri
Highlight
Page 73: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 73

Bellman Optimality Equation for V*

s

a

s'

r

(a)

max

V∗ (s) = maxa∈A(s)

Qπ ∗

(s,a)

= maxa∈A(s)

E rt +1 + γ V∗(st +1) st = s, at = a{ }= max

a∈A(s)Ps ′ s

a

′ s ∑ Rs ′ s

a + γ V ∗( ′ s )[ ]

The value of a state under an optimal policy must equalthe expected return for the best action from that state:

The relevant backup diagram:

is the unique solution of this system of nonlinear equations.∗V

Hadi Asheri
Highlight
Hadi Asheri
Highlight
Hadi Asheri
Highlight
Page 74: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 74

Bellman Optimality Equation for Q*

s,a

a'

s'r

(b)

max

{ }[ ]∑

′′′

+∗

′+∗

′′+=

==′+=

s a

ass

ass

tttat

asQRP

aassasQrEasQ

),(max

,),(max),( 11

γ

γ

The relevant backup diagram:

is the unique solution of this system of nonlinear equations.*Q

Hadi Asheri
Rectangle
Page 75: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 75

Why Optimal State-Value Functions are Useful

V∗

V∗

Any policy that is greedy with respect to is an optimal policy.

Therefore, given, one-step-ahead search produces the long-term optimal actions.

E.g., back to the grid world:

Page 76: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 76

What About Optimal Action-Value Functions?

Given , the agent does not even have to do a one-step-ahead search:

Q*

π ∗(s) = arg maxa∈A (s)

Q∗(s,a)

Page 77: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 77

Solving the Bellman Optimality EquationFinding an optimal policy by solving the Bellman Optimality Equation requires the following:

accurate knowledge of environment dynamics;we have enough space and time to do the computation;the Markov Property.

How much space and time do we need?polynomial in number of states (via dynamic programming methods; see later),BUT, number of states is often huge (e.g., backgammon has about 1020 states).

We usually have to settle for approximations.Many RL methods can be understood as approximately solving the Bellman Optimality Equation.

Hadi Asheri
Highlight
Hadi Asheri
Highlight
Hadi Asheri
Highlight
Hadi Asheri
Highlight
Hadi Asheri
Highlight
Hadi Asheri
Highlight
Page 78: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 78

Dynamic Programming

Objectives of the next slides: Overview of a collection of classical solution methods for MDPs known as dynamic programming (DP)Show how DP can be used to compute value functions, and hence, optimal policiesDiscuss efficiency and utility of DP

Page 79: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 79

Policy Evaluation

{ }⎭⎬⎫

⎩⎨⎧

=γ=== ∑∞

=++ππ

π

01

ktkt

ktt ssrEssRE)s(V

[ ]∑ ∑′

π′′

π ′γ+π=a s

ass

ass )s(VRP)a,s()s(V

Policy Evaluation: for a given policy π, compute the state-value function Vπ

Recall: State value function for policy π:

Bellman equation for V*:

A system of |S| simultaneous linear equations

Hadi Asheri
Highlight
Page 80: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 80

Iterative Methods

V0 → V1 → L → Vk → Vk +1 → L→ Vπ

Vk+1 (s) ← π (s,a) Ps ′ s a Rs ′ s

a + γ Vk ( ′ s )[ ]′ s

∑a

a “sweep”

A sweep consists of applying a backup operation to each state.

A full policy evaluation backup:

Page 81: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 81

Iterative Policy Evaluation

Page 82: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 82

Policy Improvement

Suppose we have computed V* for a deterministic policy π.

For a given state s, would it be better to do an action ?

The value of doing a in state s is:

a ≠ π(s)

{ }[ ])s(VRP

aa,ss)s(VrE)a,s(Q ass

s

ass

tttt

′γ+=

==γ+=π

′′

+ππ

∑11

Q π (s , a ) > V π (s )

It is better to switch to action a for state s if and only if

Page 83: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 83

The Policy Improvement Theorem

Page 84: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 84

Proof sketch

Page 85: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 85

Policy Improvement Cont.

′ π (s) = argmaxa

Qπ (s,a)

= argmaxa

Ps ′ s a

′ s ∑ R

s ′ s a + γ V π ( ′ s )[ ]

Do this for all states to get a new policy ′ π that is

greedy with respect to V π :

Then V ′ π ≥ Vπ

Page 86: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 86

Policy Improvement Cont.

What if V ′ π = Vπ ?

i.e., for all s ∈S, V ′ π (s) = maxa

Ps ′ s a

′ s ∑ R

s ′ s a +γ Vπ ( ′ s )[ ] ?

But this is the Bellman Optimality Equation.

So V ′ π = V∗ and both π and ′ π are optimal policies.

Page 87: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 87

Policy Iteration

π0 → V π 0 → π1 → Vπ1 → Lπ * → V * → π *

policy evaluation policy improvement“greedification”

Page 88: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 88

Policy Iteration

Page 89: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 89

Value IterationDrawback to policy iteration is that each iteration involves a policy evaluation, which itself may require multiple sweeps.Convergence of Vπ occurs only in the limit so that we in principle have to wait until convergence.As we have seen, the optimal policy is often obtained long before Vπ has converged.Fortunately, the policy evaluation step can be truncated in several ways without losing the convergence guarantees of policy iteration.Value iteration is to stop policy evaluation after just one sweep.

Hadi Asheri
Highlight
Hadi Asheri
Highlight
Hadi Asheri
Highlight
Hadi Asheri
Highlight
Hadi Asheri
Highlight
Page 90: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 90

Value Iteration

Vk+1 (s) ← π (s,a) Ps ′ s a Rs ′ s

a + γ Vk ( ′ s )[ ]′ s

∑a

Recall the full policy evaluation backup:

Vk+1 (s) ← maxa

Ps ′ s a Rs ′ s

a + γ Vk ( ′ s )[ ]′ s

Here is the full value iteration backup:

Combination of policy improvement and truncated policy evaluation.

Hadi Asheri
Highlight
Hadi Asheri
Highlight
Page 91: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 91

Value Iteration Cont.

Page 92: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 92

Asynchronous DP

All the DP methods described so far require exhaustive sweeps of the entire state set.Asynchronous DP does not use sweeps. Instead it works like this:

Repeat until convergence criterion is met: Pick a state at random and apply the appropriate backup

Still needs lots of computation, but does not get locked into hopelessly long sweepsCan you select states to backup intelligently? YES: an agent’s experience can act as a guide.

Page 93: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 93

Generalized Policy Iteration

π V

evaluation

improvement

V →Vπ

π→greedy(V)

*Vπ*

startingV π

V = V π

π = gree d y ( V )

V*

π*

Generalized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement, independent of their granularity.

A geometric metaphor forconvergence of GPI:

Hadi Asheri
Sign Here
Page 94: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 94

Efficiency of DPTo find an optimal policy is polynomial in the number of states…BUT, the number of states is often astronomical, e.g., often growing exponentially with the number of state variables (what Bellman called “the curse of dimensionality”).In practice, classical DP can be applied to problems with a few millions of states.Asynchronous DP can be applied to larger problems, and appropriate for parallel computation.It is surprisingly easy to come up with MDPs for which DP methods are not practical.

Hadi Asheri
Highlight
Hadi Asheri
Highlight
Hadi Asheri
Highlight
Hadi Asheri
Highlight
Page 95: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 95

SummaryPolicy evaluation: backups without a maxPolicy improvement: form a greedy policy, if only locallyPolicy iteration: alternate the above two processesValue iteration: backups with a maxFull backups (to be contrasted later with sample backups)Generalized Policy Iteration (GPI)Asynchronous DP: a way to avoid exhaustive sweepsBootstrapping: updating estimates based on other estimates

Hadi Asheri
Highlight
Hadi Asheri
Sign Here
Page 96: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 96

Monte Carlo MethodsMonte Carlo methods learn from complete sample returns

Only defined for episodic tasks

Monte Carlo methods learn directly from experienceOn-line: No model necessary and still attains optimalitySimulated: No need for a full model

Hadi Asheri
Highlight
Hadi Asheri
Note
Three Advantages over DP: 1. actual experience 2. simulated experience 3. the estimates for each state are independent.
Hadi Asheri
Rejected
Page 97: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 97

Monte Carlo Policy Evaluation

Goal: learn Vπ(s)Given: some number of episodes under π which contain sIdea: Average returns observed after visits to s

Every-Visit MC: average returns for every time s is visited in an episodeFirst-visit MC: average returns only for first time s is visited in an episodeBoth converge asymptotically

1 2 3 4 5

Page 98: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 98

First-visit Monte Carlo policy evaluation

Page 99: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 99

Monte Carlo Estimation of Action Values (Q)Monte Carlo is most useful when a model is not available

We want to learn Q*

Qπ(s,a) - average return starting from state s and action a following πAlso converges asymptotically if every state-action pair is visitedExploring starts: Every state-action pair has a non-zero probability of being the starting pair

Page 100: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 100

Monte Carlo Control

MC policy iteration: Policy evaluation using MC methods followed by policy improvementPolicy improvement step: greedify with respect to value (or action-value) function

π Q

evaluation

improvement

Q →Qπ

π→greedy(Q)

Page 101: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 101

Convergence of MC Control

Policy improvement theorem tells us:Qπ k (s,π k +1(s)) = Qπ k (s,arg max

aQπ k (s,a))

= maxa

Qπ k (s,a)

≥ Qπ k (s,π k (s))

= Vπ k (s)

This assumes exploring starts and infinite number of episodes for MC policy evaluationTo solve the latter:

update only to a given level of performancealternate between evaluation and improvement per episode

Page 102: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 102

Monte Carlo Exploring Starts

Fixed point is optimal policy π∗

Page 103: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 103

Summary about Monte Carlo Techniques

MC has several advantages over DP:Can learn directly from interaction with environmentNo need for full modelsNo need to learn about ALL statesLess harm by Markovian violations MC methods provide an alternate policy evaluation process

One issue to watch for: maintaining sufficient exploration

exploring starts, soft policies

Hadi Asheri
Highlight
Hadi Asheri
Highlight
Hadi Asheri
Highlight
Hadi Asheri
Highlight
Page 104: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 104

Temporal Difference Learning

[ ])()()( tttt sVRsVsV −+← α

Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function πV

Recall:

[ ])()()()( 11 ttttt sVsVrsVsV −++← ++ γα

target: the actual return after time t

target: an estimate of the return

Simple every-visit Monte Carlo method:

The simplest TD method, TD(0):

Page 105: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 105

Simple Monte Carlo

T T T TT

T T T T T

V(st ) ← V(st) + α Rt − V (st )[ ]where Rt is the actual return following state st .

st

T T

T T

TT T

T TT

Page 106: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 106

Simplest TD Method

T T T TT

T T T T T

st+1

rt+1

st

V(st ) ← V(st) + α rt +1 + γ V (st+1 ) − V(st )[ ]

TTTTT

T T T T T

Page 107: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 107

cf. Dynamic Programming

V(st ) ← Eπ rt +1 + γ V(st ){ }

T

T T T

st

rt+1

st+1

T

TT

T

TT

T

T

T

Page 108: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 108

Sarsa: Learning an Action-Value Function

st+2,at+2st+1,at+1

rt+2rt+1st st+1st ,atst+2

Estimate Qπ for the current behavior policy π.

After every transition from a nonterminal state st , do this :

Q st , at( )← Q st , at( )+ α rt +1 +γ Q st +1,at +1( )− Q st ,at( )[ ]If st +1 is terminal, then Q(st +1, at +1 ) = 0.

Page 109: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 109

Sarsa

Turn this into a control method by always updating thepolicy to be greedy with respect to the current estimate:

Page 110: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 110

Q-Learning

( ) ( ) ( ) ( )[ ]tttattttt asQasQrasQasQ ,,max,,:learning-Q step-One

11 −++← ++ γα

Page 111: Machine Learning - Sharif University of Technologyce.sharif.edu/courses/92-93/1/ce717-1/resources... · Machine Learning Reinforcement Learning. Fall 2007 Sharif University of Technology

Fall 2007 Sharif University of Technology 111

Continuous Action-Set Reinforcement Learning