the right way to do reinforcement learning with function approximation
DESCRIPTION
The Right Way to do Reinforcement Learning with Function Approximation. Rich Sutton AT&T Labs with thanks to Satinder Singh, David McAllester, Mike Kearns. The Prize. To find the “Right Way” to do RL with FA sound (stable, non-divergent) ends up with a good policy - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/1.jpg)
The Right Way to do Reinforcement Learning
with Function Approximation
Rich SuttonAT&T Labswith thanks to
Satinder Singh, David McAllester, Mike Kearns
![Page 2: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/2.jpg)
The Prize• To find the “Right Way” to do RL with FA
– sound (stable, non-divergent)– ends up with a good policy– gets there quickly, efficiently– applicable to any (discrete-time, finite-state) MDP– compatible with (at least) linear FA– online and incremental
• To prove that it is so
• Tensions:– Proof and practice often pull in different directions– Lack of knowledge negative knowledge– We have not handled this well as a field
critical to viability of RL !
![Page 3: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/3.jpg)
Outline• Questions• History: from policy to value back to policy• Problem Definition
– Why function approximation changes everything
• REINFORCE• Policy Gradient Theory• Do we need values? Do we need TD?
– Return baselines - using values without bias– TD/boostrapping/truncation
• may not be possible without bias• but seems essential for reducing variance
![Page 4: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/4.jpg)
QuestionsIs RL theory fundamentally different/harder with FA? yes
Are value methods unsound with FA? absolutely not
Should we prefer policy methods for other reasons? probably
Is it sufficient to learn just a policy, not value? apparently not
Didn’t we already do all this policy stuff in the 1980s? only some of it
Can values be used without introducing bias? yes
Can TD (bootstrapping) be done without bias? I wish
Is TD much more efficient than Monte Carlo? apparently
Is it TD that makes FA hard? yes and no, but mostly no
So are we stuck with dual, “actor-critic” methods? maybe so
Are we talking about genetic algorithms? No!
What about learning “heuristic” or “relative” values. Are these policy methods or value methods? policy
![Page 5: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/5.jpg)
The Swing towards Value Functions
• Early RL methods all used parameterized policies• But adding value functions seemed key to
efficiency• Why not just learn action value functions and compute policies from them!
• A prediction problem - almost supervised• Fewer parameters• Enabled first proofs of convergence to optimal
policy• Impressive applications using FA• So successful that early policy work was bypassed
Q*(s,a) =E r1 +γr2 +γ2r3 +L s0 =s,a0 =a,π *
π * (s) =argmaxa
Q* (s,a)CleanerSimplerEasier to use
Q-learningWatkins, 1989
![Page 6: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/6.jpg)
• Theory hit a brick wall for RL with FA• Q-learning shown to diverge with linear FA• Many counterexamples to convergence• Widespread scepticism about any argmax VF
solution– that is, about any way to get conventional convergence
• But is this really a problem? • In practice, on-policy methods perform well • Is this only a problem for our theory?• With Gordon’s latest result these concerns seem
to have been hasty, now invalid
The Swing away from Value Functions
Why?
Diagram
![Page 7: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/7.jpg)
Why FA makes RL hard
• All the states interact and must be balanced, traded off• Which states are visited is affected by the policy
A small change (or error) in the VF estimate can cause a large, discontinuous change in the
policy can cause a large change in the VF estimate
![Page 8: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/8.jpg)
Diagram of What Happens in— Value Function Space —
inadmiss
able value fu
nctions
value fu
nctions c
onsiste
nt with
True V*Region of π*
best admissable
policy
Original naïve hope
guaranteed convergenc
eto good
policy
Res gradient et al.
guaranteed convergenc
eto less
desirable policy
Sarsa, TD() & other on-policy methods
chatteringwithout
divergence or guaranteed
convergence
Q-learning, DP & otheroff-policy methods
divergence possible
![Page 9: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/9.jpg)
…and towards Policy Parameterization
• A parameterized policy (PP) can be changed continuously
• A PP can find stochastic policies– the optimal policy will often be stochastic with FA
• A PP can omit the argmax (the action-space search)– necessary for large/continuous action spaces
• A PP can be more direct, simpler• Prior knowledge is often better expressed as a PP
• A PP method can be proven convergent to a local optimum for general differentiable FA
REINFORCE Williams, 1988
![Page 10: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/10.jpg)
Defining the Problem (RL with FA)
Part I: Parameterized PoliciesFinite state and action sets
Discrete time
Transition probabilities
Expected rewards
Stochastic policy
possibly parameterized
S A N
t 0,1,2,3, K
ps s a Pr st 1 s st s,at a
sa E rt 1 st s,at a
π (s,a) Pr at a st s
π w.l.o.g. n
n N
s ′ s a rπ , p
L L e.g.,
![Page 11: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/11.jpg)
Examples of Policy Parameterizations
Gibbs
or =1Normalization
featuresof
= weightsRanking #s,one per action
ActionProbabilities
π (st ,a)
Gibbsor
=1
featuresof and
Ranking # forrepeat
ActionProbabilities
π (st ,a)
st
st a
a Aa
Ranking #s are mechanistically like action valuesBut do not have value semanticsMany “heuristic” or “relative” values are better viewed as ranking #s
![Page 12: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/12.jpg)
More Policy ParameterizationsContinuous actions work too, e.g.:
GaussianSampler
featuresof
= weightsimplicitly determine the
continuous distribution
st
mean of
std. dev. of a t
a t
a t
π (st ,a)Much stranger parameterizations are possible
e.g., cascades of interacting stochastic processes
such as in a communications network or factory
We require only that our policy process produces
at according to some distribution π (st,a)∇π (st ,a)
perhaps π (st ,a)
![Page 13: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/13.jpg)
Choose π to maximize a measure of total future reward,called the return
Values are expected returns
Defining the Problem (RL with FA) II
Rt
Vπ(s) E Rt st s,π
Qπ(s,a) E Rt st s,at a,π
Optimal policies
π * arg maxπ
Vπ(s) s S π *
(s,a) arg maxa
Qπ(s,a)
Value methods maintain a parameterized approximation to a value function,
And then compute their policy, e.g.,
Vπ, Q
π, V
π *
, or Qπ *
π (s) arg maxa
ˆ Q (s,a)
![Page 14: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/14.jpg)
FA Breaks the Standard Problem Def’n!
Discounted case
One infinite, ergodic episode: s0, a0, r1, s1, a1, r2 , s2 , K
Let be the space of all policiesLet be all policies consistent with the parameterizationProblem: depends on s!
no one policy—in —is best for all statesstates compete for control of
Need an overall (not per state) measure of policy quality, e.g.,
argmaxπ∈Π
Vπ (s)
J(π) = Vπ (s)dπ (s)s∑
dπ
(s)asymptotic fraction of time spent in s under πBut! Thm: J(π) is independent of J(π) = ave. reward/step
Rt =1
1−γ rt+1 +γrt+2 +γ2rt+3 +L( )Return:
![Page 15: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/15.jpg)
RL Cases Consistent with FA
Average-reward case
J(π) = average reward per time step underπ
Rt = rt+k −J (π )[ ]k=1
∞
∑
Episodic case
J(π) =Vπ (s0 )
Rt =rt+1 +γrt+2 +γ2rt+3 +L +γT−k−1rT
Many epsiodes, all starting from
One infinite, ergodic episode: s0, a0, r1, s1, a1, r2 , s2 , K
s0
![Page 16: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/16.jpg)
Outline• Questions• History: from policy to value back to policy• Problem Definition
– Why function approximation changes everything
• REINFORCE• Policy Gradient Theory• Do we need values? Do we need TD?
– Return baselines - using values without bias– TD/boostrapping/truncation
• may not be possible without bias• but seems essential for reducing variance
![Page 17: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/17.jpg)
Do we need Values at all?
Extended REINFORCE (Williams, 1988)
Δ t = α Rt
∇θπ (st ,at )
π (st ,at )Δ = Δt
t∑ offline
updating
episodiccase
Thm: Eπ Δ{ } =α∇ J (π )There is also an online, incremental implementation using
eligibility traces
Converges to a local optimum of J for general diff. FA!Simple, clean, a single parameter...Why didn’t we love this algorithm in 1988??
No TD/bootstrapping (a Monte Carlo method) thought to be inefficient Extended to average-case
(Baxter and Bartlett, 1999)
![Page 18: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/18.jpg)
Policy Gradient Theorem
Thm: ∇ J(π ) = dπ (s) Qπ (s, a)∇θπ (s,a)a
∑s
∑Marbach & Tsitsiklis ‘98Jaakkola Singh Jordan ‘95Cao & Chen ‘97Sutt McAl Sing Mans ‘99Konda & Tsitsiklis ‘99Williams ‘88
how often soccurs under π
does not involve
∇d
π (s) !
![Page 19: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/19.jpg)
Policy Gradient Theory
Thm: ∇ J(π ) = dπ (s) Qπ (s, a)∇θπ (s,a)a
∑s
∑
how often soccurs under π
= dπ (s) Qπ (s,a) − b(s)[ ]∇θπ (s,a)a
∑s
∑ for any b : S → ℜ
∇π(s,a)
a∑ = 0 ∀s
= dπ (s) π (s,a) Qπ (s,a) − b(s)[ ]∇θπ (s,a)π (s,a)a
∑s
∑
how often s,a occurs under π
=Eπ Qπ (st ,at ) − b(st )[ ]∇θπ (st ,at )
π (st ,at )
⎧ ⎨ ⎪
⎩ ⎪
⎫ ⎬ ⎪
⎭ ⎪
t
![Page 20: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/20.jpg)
∇ J(π ) = Eπ Qπ (st ,at ) − b(st )[ ]∇θπ (st , at )
π (st , at )
⎧ ⎨ ⎪
⎩ ⎪
⎫ ⎬ ⎪
⎭ ⎪
=Eπ Rt
∇θπ (st ,at )
π (st ,at )
⎧ ⎨ ⎪
⎩ ⎪
⎫ ⎬ ⎪
⎭ ⎪REINFORCE
t
![Page 21: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/21.jpg)
∇ J(π ) = Eπ Qπ (st ,at ) − b(st )[ ]∇θπ (st , at )
π (st , at )
⎧ ⎨ ⎪
⎩ ⎪
⎫ ⎬ ⎪
⎭ ⎪
=Eπ rt +1 + γV π (st +1) − V π (st )[ ]
∇θπ (st ,at )
π (st ,at )
⎧ ⎨ ⎪
⎩ ⎪
⎫ ⎬ ⎪
⎭ ⎪
OR
≈Eπ rt+1 + γ ˆ V π (st +1) − ˆ V π (st )[ ]
∇θπ (st ,at )
π (st ,at )
⎧ ⎨ ⎪
⎩ ⎪
⎫ ⎬ ⎪
⎭ ⎪actor-critic
≈Eπ Rtλ − ˆ V π (st )[ ]
∇θπ (st ,at )
π (st ,at )
⎧ ⎨ ⎪
⎩ ⎪
⎫ ⎬ ⎪
⎭ ⎪
OR
general form,includes all above
possibleTD/bootstrapping
idealbaseline?
=Eπ Rt
∇θπ (st ,at )
π (st ,at )
⎧ ⎨ ⎪
⎩ ⎪
⎫ ⎬ ⎪
⎭ ⎪REINFORCE
t
![Page 22: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/22.jpg)
∇ J(π ) = Eπ Qπ (st ,at ) − b(st )[ ]∇θπ (st , at )
π (st , at )
⎧ ⎨ ⎪
⎩ ⎪
⎫ ⎬ ⎪
⎭ ⎪
Conjecture: The ideal baseline is
b(s) =Vπ (s)
In which case our error term is an advantage
Qπ (st,at)−Vπ (st) =Aπ (st,at)Baird ‘93
No bias is introduced by an approximation here:
b(s) = ˆ Vπ (s)
How important is a baseline to the efficiency of REINFORCE?
Apparently very important, but previous tests flawed
![Page 23: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/23.jpg)
Random MDP Testbed
• 50 Randomly constructed episodic MDPs– 50 states, uniform starting distribution– 2 actions/state– 2 possible next states per action– expected rewards (1,1); actual rewards +(0,0.1)– 0.1 prob of termination on each step
• State aggregation FA - 5 groups of 10 states each• Gibbs action selection• Baseline learned by gradient descent• Parameters initially• Step-size parameters
= ˆ V π (s)
=0 w = 0
![Page 24: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/24.jpg)
Effect of Learned Baseline
.01 .1 1 10.0
10.8
No Baseline
α
01
02
01
2
J(π)after 50
episodes
Much better to learn a baseline approximating Vπ
REINFORCE withper-episode updating
![Page 25: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/25.jpg)
∇ J(π ) = Eπ Qπ (st ,at ) − b(st )[ ]∇θπ (st , at )
π (st , at )
⎧ ⎨ ⎪
⎩ ⎪
⎫ ⎬ ⎪
⎭ ⎪
Can We TD without Introducing Bias?
Thm: An approximation Q can replace Q without bias if it is of the form
ˆ Q (s,a) =wT ∇π (s,a)π (s,a)
and has converged to a local optimum.
Sutton et al. ‘99Konda & Tsitsiklis ‘99
However!Thm: Under batch updating, such a Q results in exactly the same updates as REINFORCE. There is no useful bootstrapping.
Empirically, there is also no win with per-episode updating
Singh McAllester Suttonunpublished
![Page 26: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/26.jpg)
Effect of Unbiased Linear Q^
.01 .1 1 0.1
1.7
J(π)after 50
episodes
α
REINFORCE Unbiasedat best
ˆ Q
per-episode updating
![Page 27: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/27.jpg)
TD Creates Bias; Must We TD?
Is TD really more efficient than Monte Carlo?
Apparently “Yes”,but this question deserves a better answer
accumulatingtraces
0.2
0.3
0.4
0.5
0 0.2 0.4 0.6 0.8 1
RANDOM WALK
50
100
150
200
250
300
Failures per100,000 steps
0 0.2 0. 0.6 0.8 1
CART AND POLE
00
50
500
550
600
650
700
Steps perepisode
0 0.2 0. 0.6 0.8 1
MOUNTAIN CAR
replacingtraces
150
160
170
180
190
200
210
220
230
20
Cost perepisode
0 0.2 0. 0.6 0.8 1
PUDDLE WORLD
replacingtraces
accumulatingtraces
replacingtraces
accumulatingtraces
RMS error
![Page 28: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/28.jpg)
Is it TD that makes FA hard?
• Yes, TD prediction with FA is trickier than Monte Carlo– even the linear case converges only near an optimum– nonlinear cases can even diverge
• No, TD is not the reason the control case is hard
• This problem is intrinsic to control + FA• It happens even with Monte Carlo methods
small change in value
discontinuouschange in policylarge change in
state distribution
large
![Page 29: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/29.jpg)
Small Sample Importance Sampling -
A Superior Eligibility Term?Thm: ∇ J(π ) = dπ (s) Qπ (s, a)∇θπ (s,a)
a∑
s∑
how often soccurs under π
= dπ (s) Qπ (s,a) − b(s)[ ]∇θπ (s,a)a
∑s
∑ for any b : S → ℜ
∇π(s,a)
a∑ = 0 ∀s
= dπ (s) π (s,a) Qπ (s,a) − b(s)[ ]∇θπ (s,a)π (s,a)a
∑s
∑
how often s,a occurs under π
=Eπ Qπ (st ,at ) − b(st )[ ]∇θπ (st ,at )
π (st ,at )
⎧ ⎨ ⎪
⎩ ⎪
⎫ ⎬ ⎪
⎭ ⎪
t
n?
n?
![Page 30: The Right Way to do Reinforcement Learning with Function Approximation](https://reader035.vdocument.in/reader035/viewer/2022062321/568136d3550346895d9e70d7/html5/thumbnails/30.jpg)
Questions
Is RL theory fundamentally different/harder with FA? yes
Are value methods unsound with FA? absolutely not
Should we prefer policy methods for other reasons? probably
Is it sufficient to learn just a policy, not value? apparently not
Can values be used without introducing bias? yes
Can TD (bootstrapping) be done without bias? I wish
Is TD much more efficient than Monte Carlo? apparently
Is it TD that makes FA hard? yes and no, but mostly
no
So are we stuck with dual, “actor-critic” methods? maybe so