reinforcement...

Reinforcement Learning

Yishay MansourTel-Aviv University

1

Reinforcement Learning:Course Information

• Classes: Wednesday– Lecture 10-13

• Yishay Mansour

– Recitations:14-15/15-16• Eliya Nachmani• Adam Polyak

• Course Web site:rl-tau-2018.wikidot.com

• Resources:• Markov Decision Processes

– Puterman

• Reinforcement Learning – Sutton and Barto

• Neuro-dynamic Programming– Bertsekas and Tsitsiklis

2

Reinforcement Learning:Course requirements

• Homework:– Every two week:

Theory and Programming

• Project: – Near the end of the

term– Deep RL/Atari

• Final Exam

• Grade:– 60% Final exam

• Have to pass

– 20% HW– 20% project

3

Playing Board Games

4

Gerald Tesauro, TD-gammon 1992

AlphaGo, DeepMind 2015-17

Other notable board games

5

Arthur Samuel 1962

Deep Blue1996

Playing Atari Games

6

Controlling Robots

7

Today Outline: Overview

• Basics– Goal of Reinforcement

Learning– Mathematical Model

(MDP)

• Planning– Value iteration– Policy iteration

• Learning Algorithms– Model based– Model Free

• Large state space– Function approx– Policy gradient

8

Goal of Reinforcement Learning

Goal oriented learning through interaction

Control of large scale stochastic environments withpartial knowledge.

Supervised / Unsupervised LearningLearn from labeled / unlabeled examples

9

Mathematical Model - Motivation

Model of uncertainty:

Environment, actions, our knowledge.

Focus on decision making.

Maximize long term reward.

Markov Decision Process (MDP)11

Contrast with Supervised Learning

The system has a “state”.

The algorithm influences the state distribution.

Inherent Tradeoff: Exploration versus Exploitation.* There is a cost to discover information !

12

Mathematical Model - MDP

Markov Decision Processes

S- set of states

A- set of actions

d - Transition probability

R - Reward function Similar to DFA!

13

MDP model - states and actions

Actions = transitions

action a

)',,( sasd

Environment = states

0.3

0.7

14

MDP model - rewards

R(s,a) = reward at state s

for doing action a

(a random variable).

Example:R(s,a) = -1 with probability 0.5

+10 with probability 0.35+20 with probability 0.15 15

MDP model - trajectories

trajectory:

s0a0 r0 s1

a1 r1 s2a2 r2

16

MDP - Return function.

Combining all the immediate rewards to a single value.

Modeling issues:

Are early rewards more valuable than later rewards?

Is the system “terminating” or continuous?

Usually the return is linear in the immediate rewards. 17

MDP model - return functions

Finite Horizon - parameter H ),(1

iHi

i asRreturn å££

=

Infinite Horizon

discounted - parameter g<1. ),aR(sγreturn iii

iå¥

=

=0

undiscounted return),aR(sN ii

N

¾¾ ®¾ ¥®-

=å N1

0i

1

Terminating MDP

18

MDP Example: Inventory

19

!"

#" $"

!"%&=!" + #" − $" %

)" = ,-" + . #" − / !"%&P=Price/item,J(.)=ordercostC(.)=inventorycost-" = min(!" + #", $")

MDP model - action selection

Policy - mapping from states to actions

Fully Observable - can “see” the “exact” state.

AIM:Maximize the expected return.This talk: discounted return

Optimal policy: optimal from any start state.

THEOREM: There exists a deterministic optimal policy

20

MDP model - summary

- set of states, |S|=n.

- set of k actions, |A|=k.

- transition function.

- immediate reward function.

- policy.

- discounted cumulative return.

),,( 21 sasd

Ss Î

Aa Î

R(s,a)

AS ®:p

ii

i rå¥

= 0g

21

Contrast with Supervised Learning

Supervised Learning:Fixed distribution on examples.

Reinforcement Learning:The state distribution is policy dependent!!!

A small local change in the policy can make a hugeglobal change in the return.

22

Simple setting: Multi-armed bandit

Single state.

a1

a2

a3

s

Goal: Maximize sum of immediate rewards.

Difficulty: unknown rewards.

Given the model: Greedy action.

23




(MDP)



• Large state space– Function Approx– Policy gradient

24

Planning - Basic Problems.

Policy evaluation - Given a policy p, evaluate its return.

Optimal control - Find an optimal policy p* (maximizes the return from any start state).

Given a complete MDP model.

25

Planning - Value Functions

Vp(s) The expected return starting at state s following p.

Qp(s,a) The expected return starting at state s withaction a and then following p.

V*(s) and Q*(s,a) are define using an optimal policy p*.

V*(s) = maxp Vp(s)

26

Planning - Policy Evaluation

Discounted infinite horizon (Bellman Eq.)

Rewrite the expectation

)'()'),(,())](,([)('

sVsssssREsVså+= pp pdgp

Linear system of equations.

Vp(s) = Es’~ p (s) [ R(s,p (s)) + g Vp(s’)]

27

Algorithms -Policy Evaluation Example

A={+1,-1}g = 1/2d(si,a)= si+ap random

s0 s1

s3s2

"a: R(si,a) = i

0 1

23

28

Vp(s0) = 0 +g [p(s0,+1)Vp(s1) + p(s0,-1) Vp(s3) ]

Algorithms -Policy Evaluation Example


s0 s1

s3s2

"a: R(si,a) = i

0 1

23

Vp(s2) = 2 + (Vp(s1) + Vp(s3) )/4

Vp(s0) = 5/3Vp(s1) = 7/3Vp(s2) = 11/3Vp(s3) = 13/3

29

Vp(s1) = 1 + (Vp(s0) + Vp(s2) )/4

Vp(s3) = 3 + (Vp(s0) + Vp(s2) )/4

Vp(s0) = 0 + (Vp(s1) + Vp(s3) )/4

Algorithms - optimal control

State-Action Value function:

Note ))(,()( ssQsV ppp =

Qp(s,a) = E [ R(s,a)] + gEs’~ p (s) [ Vp(s’)]

For a deterministic policy p.

30

Algorithms -Optimal control Example


s0 s1

s3s2

R(si,a) = i

0 1

23

Qp(s0,+1) = 0 +g Vp(s1)

Qp(s0,+1) = 7/6Qp(s0,-1) = 13/6

31

Algorithms - optimal control

CLAIM:A policy p is optimal if and only if at each state s:

Vp(s) = MAXa {Qp(s,a)} (Bellman Eq.)

PROOF: (only if) Assume there is a state s and action a s.t.,

Vp(s) < Qp(s,a).Then the strategy of performing a at state s (the first time)is better than p.This is true each time we visit s, so the policy thatperforms action a at state s is better than p. p 32

Algorithms -optimal control Example


s0 s1

s3s2

R(si,a) = i

0 1

23

Changing the policy using the state-action value function.33

MDP - computing optimal policy

1. Linear Programming

2. Value Iteration method.

)},({maxarg)( 1 asQs i

ai

-= pp

)}'( )',,(),({max)('

1 sVsasasRsVs

t

a

t å+¬+ dg

3. Policy Iteration method.

34

Example: Grid world

35

ACTIONS

InitialState=RedFinalState=BlueCOST=stepstotarget(blue)

Example: Grid world ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ 0

∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞

36

InitialValuest=0

Example: Grid world ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ 1 0∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ 1∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞

37

OneIterationt=1

Example: Grid world ∞ ∞ ∞ ∞ ∞ ∞ ∞ 2 1 0∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ 2 1∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ 2∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞

38

TwoIterationst=2

Example: Grid world

• Optimal Policy

39

Q(s,à)=walkàandadddistancetotarget

OptimalpolicyDerivefromQ(s,- )

Convergence

• Value Iteration – Decrease the distance from optimal

• By a factor of 1-γ

• Policy Iteration– Policy improves– Number of iteration less than Value Iteration

40




(MDP)




41

Learning Algorithms

Given access only to actions perform:1. policy evaluation.2. control - find optimal policy.

Two approaches: 1. Model based.2. Model free.

42

Learning - Model Based

Estimate the model from the observation.(Both transition probability and rewards.)

Use the estimated model as a true model,and find a near optimal policy.

If we have a “good” estimated model, we shouldhave a “good” estimation.

43

44

Learning - Model Based: off policy

• Let the policy run for a “long” time.owhat is “long” ?!oAssuming some “exploration”

• Build an “observed model”:o Transition probabilitieso Rewards

§ Both are independent!

• Use the “observed model” learn optimal policy.

Learning - Model Basedoff-policy algorithm

• Observe a trajectory generated by a policy πo off-policy: not need to control

• For every !, !# ∈ &and ' ∈ (:o )′(!, ', !’) = #(!, ', !′)/#(!, ',·)o 2′(!, ') = (34(5|!, ')

• Find an optimal policy for (&, ', )′, 2′)

45

Learning - Model Based

• Claim: if the model is “accurate” we will compute a near-optimal policy.

• Hidden assumption: o Each (s,a,·) is sampled many times.o This is the “responsibility” of the policy π

§ Off-policy

• Simple question: How many samples we need for each (s,a,·)

46

47

Learning - Model Based: on policy

• The learner has control over the action.o The immediate goal is to learn a model

• As before:o Build an “observed model”:

§ Transition probabilities and Rewards

oUse the “observed model” to estimate value of the policy.

• Accelerating the learning:oHow to reach “unexplored” states ?!

48


Well sampled states Relatively unknown states

49


Well sampled states Relatively unknown states

HIGH REAWRD

Exploration à Planning in new MDP

Learning: Policy improvement

• Assume that we can perform:– Given a policy p,– Compute Vπ and Qπ functions of p

• Can run policy improvement:– p = Greedy (Qπ)

• Process converges if estimates are accurate.

50

Model-Free learning

53

54

Q-Learning: off policy

Basic Idea: Learn the Q-function.On a move (s,a) ® s’ update:

Old estimate

New estimateLearning rate at (s,a)=1/tw

)],'(max),()[,( ),()),(1(),(1

usQasRasasQasasQ

tutt

ttt

gaa

++-=+

55

Q-Learning: update equation

Old estimate New estimate

updatelearning rate

),'(max),(),( ),(),(),(1

usQasRasQasasQasQ

tutt

ttt

ga

+-=DD-=+

56

Q-Learning: Intuition

• Updates based on the difference:

• Assume we have the right Q function• Good news: The expectation is Zero !!!• Challenge: understand the dynamics

– stochastic process

)],'(max),([),( usQasRusQ tutt g+-=D

Learning - Model FreePolicy evaluation: TD(0)

An online view:At state st we performed action at, received reward rt and moved to state st+1.

Our “estimation error” is Errt =rt+gV(st+1)-V(st),The update:

Vt +1(st) = Vt(st ) + a Errt

Note that for the correct value function we have:

E[r+gV(s’)-V(s)] =E[Errt]=0 57

Learning - Model FreePolicy evaluation: TD(l)

Again: At state st we performed action at, received reward rt and moved to state st+1.Our “estimation error” Errt =rt+gV(st+1)-V(st),

Update every state s:

Vt +1(s) = Vt(s ) + a Errt e(s)

Update of e(s) :When visiting s: incremented by 1: e(s) = e(s)+1For all other s’:

decremented by g l factor: e(s’) = g l e(s’)58

Different Model-Free Algorithms

• On-policy versus Off-policy• The function approximated (V vs Q)• Fairly similar general methodologies• Challenges: controlling stochastic process

59




(MDP)




60

Large state MDP

Previous Methods: Tabular. (small state space).

Large state Exponential size.

Similar to basic view of learning.

61

Large scale MDP

Approaches:

1. Restricted Value function.

2. Restricted policy class.

3. Restricted model of MDP.

4. Different MDP representation: Generative Model

62

Large MDP - Restricted Value Function

applications: most of the recent work (AlphaGo, Atari etc.)

Vague Idea: reduce to a supervised (deep) learning.

(Value) Function Approximation: Use a limited class of functions to estimate the value function.

Given a good approximation of the value function, we canestimate the optimal policy.

63

Large MDP: Restricted Policy

• Fix a policy class Π = {$: & → (}– Given a policy $ ∈ Π– Approximate Vπ and Qπ

• Run policy improvement:– p = Greedy (Qπ)

• Quality depends on approximation.– Convergence not guaranteed.

64

Large MDP: Policy Gradient

• Improving the parameter of a policy– Taking the gradient

• Challenge:– The update impacts both:

• Action probabilities• Distribution over states

– Use off-policy• compute the gradient from policy history

65

Large MDP: Generative Model.

Algorithm for estimating optimal policy (for discounted infinitehorizon) in time independent of the number of states

s

a

s’rGenerator

Generator representation:

Clearly we can not use matrix representation.

66

Large state MDP: Generative Model.

• Use generative model for “look ahead”

• Sample a tree of a shallow depth• Using generative model

• Compute optimal policy on the tree

• Result in approximate optimal policy in the MDP

67




(MDP)




68

Course (tentative) Outline

• Part 1: MDP basics and planning• Part 2: MDP learning

– Model Based and Model free• Part 3: Large state MDP

– Policy Gradient, Deep Q-Network• Part 4: Special MDPs

– Bandits and Partially Observable MDP• Part 5: Advanced topics

69

reinforcement...

Documents