reinforcement...
TRANSCRIPT
Reinforcement Learning
Yishay MansourTel-Aviv University
1
Reinforcement Learning:Course Information
• Classes: Wednesday– Lecture 10-13
• Yishay Mansour
– Recitations:14-15/15-16• Eliya Nachmani• Adam Polyak
• Course Web site:rl-tau-2018.wikidot.com
• Resources:• Markov Decision Processes
– Puterman
• Reinforcement Learning – Sutton and Barto
• Neuro-dynamic Programming– Bertsekas and Tsitsiklis
2
Reinforcement Learning:Course requirements
• Homework:– Every two week:
Theory and Programming
• Project: – Near the end of the
term– Deep RL/Atari
• Final Exam
• Grade:– 60% Final exam
• Have to pass
– 20% HW– 20% project
3
Playing Board Games
4
Gerald Tesauro, TD-gammon 1992
AlphaGo, DeepMind 2015-17
Other notable board games
5
Arthur Samuel 1962
Deep Blue1996
Playing Atari Games
6
Controlling Robots
7
Today Outline: Overview
• Basics– Goal of Reinforcement
Learning– Mathematical Model
(MDP)
• Planning– Value iteration– Policy iteration
• Learning Algorithms– Model based– Model Free
• Large state space– Function approx– Policy gradient
8
Goal of Reinforcement Learning
Goal oriented learning through interaction
Control of large scale stochastic environments withpartial knowledge.
Supervised / Unsupervised LearningLearn from labeled / unlabeled examples
9
10
Mathematical Model - Motivation
Model of uncertainty:
Environment, actions, our knowledge.
Focus on decision making.
Maximize long term reward.
Markov Decision Process (MDP)11
Contrast with Supervised Learning
The system has a “state”.
The algorithm influences the state distribution.
Inherent Tradeoff: Exploration versus Exploitation.* There is a cost to discover information !
12
Mathematical Model - MDP
Markov Decision Processes
S- set of states
A- set of actions
d - Transition probability
R - Reward function Similar to DFA!
13
MDP model - states and actions
Actions = transitions
action a
)',,( sasd
Environment = states
0.3
0.7
14
MDP model - rewards
R(s,a) = reward at state s
for doing action a
(a random variable).
Example:R(s,a) = -1 with probability 0.5
+10 with probability 0.35+20 with probability 0.15 15
MDP model - trajectories
trajectory:
s0a0 r0 s1
a1 r1 s2a2 r2
16
MDP - Return function.
Combining all the immediate rewards to a single value.
Modeling issues:
Are early rewards more valuable than later rewards?
Is the system “terminating” or continuous?
Usually the return is linear in the immediate rewards. 17
MDP model - return functions
Finite Horizon - parameter H ),(1
iHi
i asRreturn 壣
=
Infinite Horizon
discounted - parameter g<1. ),aR(sγreturn iii
iå¥
=
=0
undiscounted return),aR(sN ii
N
¾¾ ®¾ ¥®-
=å N1
0i
1
Terminating MDP
18
MDP Example: Inventory
19
!"
#" $"
!"%&=!" + #" − $" %
)" = ,-" + . #" − / !"%&P=Price/item,J(.)=ordercostC(.)=inventorycost-" = min(!" + #", $")
MDP model - action selection
Policy - mapping from states to actions
Fully Observable - can “see” the “exact” state.
AIM:Maximize the expected return.This talk: discounted return
Optimal policy: optimal from any start state.
THEOREM: There exists a deterministic optimal policy
20
MDP model - summary
- set of states, |S|=n.
- set of k actions, |A|=k.
- transition function.
- immediate reward function.
- policy.
- discounted cumulative return.
),,( 21 sasd
Ss Î
Aa Î
R(s,a)
AS ®:p
ii
i rå¥
= 0g
21
Contrast with Supervised Learning
Supervised Learning:Fixed distribution on examples.
Reinforcement Learning:The state distribution is policy dependent!!!
A small local change in the policy can make a hugeglobal change in the return.
22
Simple setting: Multi-armed bandit
Single state.
a1
a2
a3
s
Goal: Maximize sum of immediate rewards.
Difficulty: unknown rewards.
Given the model: Greedy action.
23
Today Outline: Overview
• Basics– Goal of Reinforcement
Learning– Mathematical Model
(MDP)
• Planning– Value iteration– Policy iteration
• Learning Algorithms– Model based– Model Free
• Large state space– Function Approx– Policy gradient
24
Planning - Basic Problems.
Policy evaluation - Given a policy p, evaluate its return.
Optimal control - Find an optimal policy p* (maximizes the return from any start state).
Given a complete MDP model.
25
Planning - Value Functions
Vp(s) The expected return starting at state s following p.
Qp(s,a) The expected return starting at state s withaction a and then following p.
V*(s) and Q*(s,a) are define using an optimal policy p*.
V*(s) = maxp Vp(s)
26
Planning - Policy Evaluation
Discounted infinite horizon (Bellman Eq.)
Rewrite the expectation
)'()'),(,())](,([)('
sVsssssREsVså+= pp pdgp
Linear system of equations.
Vp(s) = Es’~ p (s) [ R(s,p (s)) + g Vp(s’)]
27
Algorithms -Policy Evaluation Example
A={+1,-1}g = 1/2d(si,a)= si+ap random
s0 s1
s3s2
"a: R(si,a) = i
0 1
23
28
Vp(s0) = 0 +g [p(s0,+1)Vp(s1) + p(s0,-1) Vp(s3) ]
Algorithms -Policy Evaluation Example
A={+1,-1}g = 1/2d(si,a)= si+ap random
s0 s1
s3s2
"a: R(si,a) = i
0 1
23
Vp(s2) = 2 + (Vp(s1) + Vp(s3) )/4
Vp(s0) = 5/3Vp(s1) = 7/3Vp(s2) = 11/3Vp(s3) = 13/3
29
Vp(s1) = 1 + (Vp(s0) + Vp(s2) )/4
Vp(s3) = 3 + (Vp(s0) + Vp(s2) )/4
Vp(s0) = 0 + (Vp(s1) + Vp(s3) )/4
Algorithms - optimal control
State-Action Value function:
Note ))(,()( ssQsV ppp =
Qp(s,a) = E [ R(s,a)] + gEs’~ p (s) [ Vp(s’)]
For a deterministic policy p.
30
Algorithms -Optimal control Example
A={+1,-1}g = 1/2d(si,a)= si+ap random
s0 s1
s3s2
R(si,a) = i
0 1
23
Qp(s0,+1) = 0 +g Vp(s1)
Qp(s0,+1) = 7/6Qp(s0,-1) = 13/6
31
Algorithms - optimal control
CLAIM:A policy p is optimal if and only if at each state s:
Vp(s) = MAXa {Qp(s,a)} (Bellman Eq.)
PROOF: (only if) Assume there is a state s and action a s.t.,
Vp(s) < Qp(s,a).Then the strategy of performing a at state s (the first time)is better than p.This is true each time we visit s, so the policy thatperforms action a at state s is better than p. p 32
Algorithms -optimal control Example
A={+1,-1}g = 1/2d(si,a)= si+ap random
s0 s1
s3s2
R(si,a) = i
0 1
23
Changing the policy using the state-action value function.33
MDP - computing optimal policy
1. Linear Programming
2. Value Iteration method.
)},({maxarg)( 1 asQs i
ai
-= pp
)}'( )',,(),({max)('
1 sVsasasRsVs
t
a
t å+¬+ dg
3. Policy Iteration method.
34
Example: Grid world
35
ACTIONS
InitialState=RedFinalState=BlueCOST=stepstotarget(blue)
Example: Grid world ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ 0
∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞
36
InitialValuest=0
Example: Grid world ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ 1 0∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ 1∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞
37
OneIterationt=1
Example: Grid world ∞ ∞ ∞ ∞ ∞ ∞ ∞ 2 1 0∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ 2 1∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ 2∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞
38
TwoIterationst=2
Example: Grid world
• Optimal Policy
39
Q(s,à)=walkàandadddistancetotarget
OptimalpolicyDerivefromQ(s,- )
Convergence
• Value Iteration – Decrease the distance from optimal
• By a factor of 1-γ
• Policy Iteration– Policy improves– Number of iteration less than Value Iteration
40
Today Outline: Overview
• Basics– Goal of Reinforcement
Learning– Mathematical Model
(MDP)
• Planning– Value iteration– Policy iteration
• Learning Algorithms– Model based– Model Free
• Large state space– Function approx– Policy gradient
41
Learning Algorithms
Given access only to actions perform:1. policy evaluation.2. control - find optimal policy.
Two approaches: 1. Model based.2. Model free.
42
Learning - Model Based
Estimate the model from the observation.(Both transition probability and rewards.)
Use the estimated model as a true model,and find a near optimal policy.
If we have a “good” estimated model, we shouldhave a “good” estimation.
43
44
Learning - Model Based: off policy
• Let the policy run for a “long” time.owhat is “long” ?!oAssuming some “exploration”
• Build an “observed model”:o Transition probabilitieso Rewards
§ Both are independent!
• Use the “observed model” learn optimal policy.
Learning - Model Basedoff-policy algorithm
• Observe a trajectory generated by a policy πo off-policy: not need to control
• For every !, !# ∈ &and ' ∈ (:o )′(!, ', !’) = #(!, ', !′)/#(!, ',·)o 2′(!, ') = (34(5|!, ')
• Find an optimal policy for (&, ', )′, 2′)
45
Learning - Model Based
• Claim: if the model is “accurate” we will compute a near-optimal policy.
• Hidden assumption: o Each (s,a,·) is sampled many times.o This is the “responsibility” of the policy π
§ Off-policy
• Simple question: How many samples we need for each (s,a,·)
46
47
Learning - Model Based: on policy
• The learner has control over the action.o The immediate goal is to learn a model
• As before:o Build an “observed model”:
§ Transition probabilities and Rewards
oUse the “observed model” to estimate value of the policy.
• Accelerating the learning:oHow to reach “unexplored” states ?!
48
Learning - Model Based: on policy
Well sampled states Relatively unknown states
49
Learning - Model Based: on policy
Well sampled states Relatively unknown states
HIGH REAWRD
Exploration à Planning in new MDP
Learning: Policy improvement
• Assume that we can perform:– Given a policy p,– Compute Vπ and Qπ functions of p
• Can run policy improvement:– p = Greedy (Qπ)
• Process converges if estimates are accurate.
50
Model-Free learning
53
54
Q-Learning: off policy
Basic Idea: Learn the Q-function.On a move (s,a) ® s’ update:
Old estimate
New estimateLearning rate at (s,a)=1/tw
)],'(max),()[,( ),()),(1(),(1
usQasRasasQasasQ
tutt
ttt
gaa
++-=+
55
Q-Learning: update equation
Old estimate New estimate
updatelearning rate
),'(max),(),( ),(),(),(1
usQasRasQasasQasQ
tutt
ttt
ga
+-=DD-=+
56
Q-Learning: Intuition
• Updates based on the difference:
• Assume we have the right Q function• Good news: The expectation is Zero !!!• Challenge: understand the dynamics
– stochastic process
)],'(max),([),( usQasRusQ tutt g+-=D
Learning - Model FreePolicy evaluation: TD(0)
An online view:At state st we performed action at, received reward rt and moved to state st+1.
Our “estimation error” is Errt =rt+gV(st+1)-V(st),The update:
Vt +1(st) = Vt(st ) + a Errt
Note that for the correct value function we have:
E[r+gV(s’)-V(s)] =E[Errt]=0 57
Learning - Model FreePolicy evaluation: TD(l)
Again: At state st we performed action at, received reward rt and moved to state st+1.Our “estimation error” Errt =rt+gV(st+1)-V(st),
Update every state s:
Vt +1(s) = Vt(s ) + a Errt e(s)
Update of e(s) :When visiting s: incremented by 1: e(s) = e(s)+1For all other s’:
decremented by g l factor: e(s’) = g l e(s’)58
Different Model-Free Algorithms
• On-policy versus Off-policy• The function approximated (V vs Q)• Fairly similar general methodologies• Challenges: controlling stochastic process
59
Today Outline: Overview
• Basics– Goal of Reinforcement
Learning– Mathematical Model
(MDP)
• Planning– Value iteration– Policy iteration
• Learning Algorithms– Model based– Model Free
• Large state space– Function approx– Policy gradient
60
Large state MDP
Previous Methods: Tabular. (small state space).
Large state Exponential size.
Similar to basic view of learning.
61
Large scale MDP
Approaches:
1. Restricted Value function.
2. Restricted policy class.
3. Restricted model of MDP.
4. Different MDP representation: Generative Model
62
Large MDP - Restricted Value Function
applications: most of the recent work (AlphaGo, Atari etc.)
Vague Idea: reduce to a supervised (deep) learning.
(Value) Function Approximation: Use a limited class of functions to estimate the value function.
Given a good approximation of the value function, we canestimate the optimal policy.
63
Large MDP: Restricted Policy
• Fix a policy class Π = {$: & → (}– Given a policy $ ∈ Π– Approximate Vπ and Qπ
• Run policy improvement:– p = Greedy (Qπ)
• Quality depends on approximation.– Convergence not guaranteed.
64
Large MDP: Policy Gradient
• Improving the parameter of a policy– Taking the gradient
• Challenge:– The update impacts both:
• Action probabilities• Distribution over states
– Use off-policy• compute the gradient from policy history
65
Large MDP: Generative Model.
Algorithm for estimating optimal policy (for discounted infinitehorizon) in time independent of the number of states
s
a
s’rGenerator
Generator representation:
Clearly we can not use matrix representation.
66
Large state MDP: Generative Model.
• Use generative model for “look ahead”
• Sample a tree of a shallow depth• Using generative model
• Compute optimal policy on the tree
• Result in approximate optimal policy in the MDP
67
Today Outline: Overview
• Basics– Goal of Reinforcement
Learning– Mathematical Model
(MDP)
• Planning– Value iteration– Policy iteration
• Learning Algorithms– Model based– Model Free
• Large state space– Function approx– Policy gradient
68
Course (tentative) Outline
• Part 1: MDP basics and planning• Part 2: MDP learning
– Model Based and Model free• Part 3: Large state MDP
– Policy Gradient, Deep Q-Network• Part 4: Special MDPs
– Bandits and Partially Observable MDP• Part 5: Advanced topics
69