a comparative analysis of expected and jerrod parker and
TRANSCRIPT
![Page 1: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/1.jpg)
A Comparative Analysis of Expected and Distributional Reinforcement Learning
Clare Lyle, Pablo Samuel Castro, Marc G. Bellemare
Presented by,Jerrod Parker and Shakti Kumar
![Page 2: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/2.jpg)
Outline:1. Motivation
2. Background
3. Proof Sequence
4. Experiments
5. Limitations
![Page 3: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/3.jpg)
Outline:1. Motivation
2. Background
3. Proof Sequence
4. Experiments
5. Limitations
![Page 4: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/4.jpg)
Why Distributional RL?1. Why restrict ourselves to the mean of value distributions?
i.e. Approximate Expectation v/s Approximate Distribution
![Page 5: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/5.jpg)
Why Distributional RL?1. Why restrict ourselves to the mean of value distributions?
i.e. Approximate Expectation v/s Approximate Distribution
2. Approximation of multimodal returns?
![Page 6: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/6.jpg)
Why Distributional RL?
![Page 7: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/7.jpg)
Motivation● Poor theoretical understanding of distributional RL framework
● Benefits have only been seen in Deep RL architectures and it is not known if
simpler architectures have any advantage at all
![Page 8: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/8.jpg)
Contributions● Distributional RL different than Expected RL?
![Page 9: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/9.jpg)
Contributions● Distributional RL different than Expected RL?
○ Tabular setting
![Page 10: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/10.jpg)
Contributions● Distributional RL different than Expected RL?
○ Tabular setting ○ Tabular setting with categorical distribution approximator
![Page 11: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/11.jpg)
Contributions● Distributional RL different than Expected RL?
○ Tabular setting ○ Tabular setting with categorical distribution approximator○ Linear function approximation
![Page 12: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/12.jpg)
Contributions● Distributional RL different than Expected RL?
○ Tabular setting ○ Tabular setting with categorical distribution approximator○ Linear function approximation○ Nonlinear function approximation
![Page 13: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/13.jpg)
Contributions● Distributional RL different than Expected RL?
○ Tabular setting ○ Tabular setting with categorical distribution approximator○ Linear function approximation○ Nonlinear function approximation
● Insights into nonlinear function approximators’ interaction with distributional RL
![Page 14: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/14.jpg)
Outline:1. Motivation
2. Background
3. Proof Sequence
4. Experiments
5. Limitations
![Page 15: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/15.jpg)
General Background– Formulation
X’, A’ are the random variables
Sources of randomness in ?
1. Immediate rewards2. Dynamics3. Possibly stochastic policy
![Page 16: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/16.jpg)
General Background– Formulation
X’, A’ are the random variables
Sources of randomness in ?
1. Immediate rewards2. Dynamics3. Possibly stochastic policy
![Page 17: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/17.jpg)
General Background– Formulation
X’, A’ are the random variables
Sources of randomness in ?
1. Immediate rewards2. Dynamics3. Possibly stochastic policy
![Page 18: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/18.jpg)
General Background– Visualization
denotes the scalar reward obtained for transition
![Page 19: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/19.jpg)
General Background: Randomness
Source of randomness –
● Immediate rewards ● Stochastic dynamics ● Possibly stochastic policy
![Page 20: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/20.jpg)
General Background– Contractions?
1. Is the policy evaluation step a contraction operation? Can I believe that during policy evaluation my distribution is converging to the true return distribution?
2. Is contraction guaranteed in the control case, when I want to improve the current policy?Can I believe that the Bellman optimality operator will lead me to the optimal policy?
![Page 21: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/21.jpg)
Policy Evaluation Contracts?
Is the policy evaluation step a contraction operation? Can I believe that during policy evaluation my distribution is converging to the true return distribution?
Formally– given a policy do iterations converge to ?
![Page 22: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/22.jpg)
Contraction in Policy Evaluation?
So the result says Yes! You can rely on the distributional bellman updates for policy evaluation!
Given a policy do iterations converge to ?
![Page 23: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/23.jpg)
Defined as,
where F-1 and G-1 are inverse CDF of F and G respectively
Maximal form of the Wasserstein,
Where an and Ƶ denotes the space of value distributions with bounded moments
Detour– Wasserstein Metric
![Page 24: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/24.jpg)
Contraction in Policy Evaluation?
So the result says Yes! You can rely on the distributional bellman updates for policy evaluation!
Given a policy do iterations converge to ?
![Page 25: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/25.jpg)
Contraction in Policy Evaluation?
Given a policy do iterations converge to ?
Thus,
![Page 26: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/26.jpg)
Contraction in Control/Improvement ?First give a small background using definitions 1 and 2 from DPRL
Write the equation in the policy iteration of the attached image.
<give equations>
Unfortunately this cannot be guaranteed...
GIve a similar equation for the policy evaluation also
![Page 27: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/27.jpg)
General Background– Contractions?
1. Is the policy evaluation step a contraction operation? Can I believe that during policy evaluation my distribution is converging to the true return distribution?
2. Is contraction guaranteed in the control case, when I want to improve the current policy?Can I believe that the Bellman optimality operator will lead me to the optimal policy?
![Page 28: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/28.jpg)
Contraction in Policy Improvement?
![Page 29: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/29.jpg)
Contraction in Policy Improvement?x1 x2 transitionAt x2 two actions are possibler(a1 )=0, r(a2 ) = ε+1 or ε-1 with 0.5 probability
Assume a1 , a2 are terminal actions and the environment is undiscounted
What is the bellman update TZ(x2, a2) ?
Since the actions are terminal, the backed up distribution should equal the rewards
Thus TZ(x2, a2) = ε±1 (or 2 diracs at ε+1 and ε-1)
![Page 30: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/30.jpg)
Contraction in Policy Improvement?x1 x2 transitionAt x2 two actions are possibler(a1 )=0, r(a2 ) = ε+1 or ε-1 with 0.5 probability
Assume a1 , a2 are terminal actions and the environment is undiscounted
What is the bellman update TZ(x2, a2) ?
Since the actions are terminal, the backed up distribution should equal the rewards
Thus TZ(x2, a2) = ε±1 (or 2 diracs at ε+1 and ε-1)
![Page 31: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/31.jpg)
Contraction in Policy Improvement?Recall that if rewards are scalar, then bellman updates are older distributions Z just scaled and translated
Thus the original distribution Z(x2, a2) can be considered as a translated version of TZ(x2, a2)
Let Z(x2, a2) be -ε±1
The 1 Wasserstein distance between Z and Z* (assuming Z and Z* are same everywhere except x2, a2 )
![Page 32: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/32.jpg)
Contraction in Policy Improvement?When we apply T to Z, then greedy action a1 is selected, thus TZ(x1) = Z(x2,a1)
This shows that the undiscounted update is not a contraction.
Thus a contraction cannot be guaranteed in the control case.
![Page 33: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/33.jpg)
Contraction in Policy Improvement?When we apply T to Z, then greedy action a1 is selected, thus TZ(x1) = Z(x2,a1)
This shows that the undiscounted update is not a contraction.
Thus a contraction cannot be guaranteed in the control case.
So is distributional RL a dead end?
![Page 34: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/34.jpg)
Contraction in Policy Improvement?When we apply T to Z, then greedy action a1 is selected, thus TZ(x1) = Z(x2,a1)
This shows that the undiscounted update is not a contraction.
Thus a contraction cannot be guaranteed in the control case.
So is distributional RL a dead end?
Bellemare showed that if there is a total ordering on the set of optimal policies, and the state space is finite, then there exists an optimal distribution which is the fixed point of the bellman update in the control case.
And the policy improvement converges to this fixed point [4]
![Page 35: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/35.jpg)
Contraction in Policy Improvement?So is distributional RL a dead end?
Bellemare showed that if there is a total ordering on the set of optimal policies, and the state space is finite, then there exists an optimal distribution which is the fixed point of the bellman update in the control case
Here Z** is the set of value distributions corresponding to the set of optimal policies.This is a set of non stationary optimal valuedistributions
![Page 36: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/36.jpg)
The C51 AlgorithmCould have minimized Wasserstein metric between TZ and Z and hence learn an algorithm.
But learning cannot be done with samples in this case.
The expected sample Wasserstein distance between 2 distributions is always greater than the true Wasserstein distance between the 2 distributions.
So how do you develop an algorithm?
Instead project it on some finite supports, (which implicitly minimizes the Cramer distance between the original distribution thus still approximating the original distribution while keeping the expectation the same.)
Project what? Project the updates TZ.
So now we can see the entire algorithm!
![Page 37: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/37.jpg)
The C51 Algorithm
This is same as a Cramer Projection which we’ll see in the next slide
![Page 38: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/38.jpg)
C51 Visually
z1 z2 z3…... zK
δzi
Update each dirac as per the distributional bellman operator
The distribute the mass of misaligned diracs on the supports
![Page 39: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/39.jpg)
Cramèr Distance● Gradient for the sample Wasserstein distance is biased
● For 2 given probability distributions with CDFs, FP and FQ , the cramer metric is defined as
For biased wasserstein gradient refer to section 3 of Reference [1]
![Page 40: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/40.jpg)
Cramèr Distance● Attractive metric for distributional manipulations
1. The policy evaluation bellman operator is a contraction in Cramer distance as well as shown by Rowland et. al. 2018
2. A Cramer projection produces a distribution supported on z which minimizes the Cramer distance to the original distribution
If the support is contained in the interval [z1, zK] then it’s trivial to show that Cramer projection preserves the distribution expected value
![Page 41: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/41.jpg)
Cramèr DistanceNow as we saw earlier, in distributional RL we need to approximate distributions
One way to do this is to formulate them as a categorical distribution like C51 did
Then the cramer distance is given as,
This is same as a weighted Euclidean norm between the CDFs of the 2 distributions.
When the atoms of the support are equally spaced apart, we get a scalar multiple of the Euclidean distance between the vectors of the CDFs
![Page 42: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/42.jpg)
Outline:1. Motivation
2. Background
3. Proof Sequence
4. Experiments
5. Limitations
![Page 43: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/43.jpg)
Methods● Compare policy evaluation in expected RL vs dist RL in several settings
(ie tabular, linear approx, non linear approx) ● For each setting, the goal is to show expectation equivalence of expected
version vs an analogous distributional version. Expectation equivalence:
● Want to show: ● Use same experience in both
![Page 44: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/44.jpg)
Methods: Sequence of Proofs1. Tabular Models: Represent distribution over returns at each (s,a) separately
a. Contains Model: (Have full knowledge of the transition model and policy)i. No constraint on type of distribution to model returnsii. Constrain return distributions to being categorical on fixed support
b. Sample Based: (SARSA based updates, i.e. only using samples)i. No constraint on type of distribution to model returnsii. Constrain return distributions to being categorical on fixed supportiii. Semi gradient w.r.t CDF update for distributional compared to SARSAiv. Semi gradient w.r.t PDF update for distributional compared to SARSA
(doesn’t hold)2. Linear Approximations:
a. Semi gradient of Cramer distance w.r.t CDF 3. Non linear Approximation:
a. There exists a non linear representation of the CDF such that initially we have equivalence but lose it after the first weight update.
![Page 45: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/45.jpg)
Methods: Sequence of Proofs1. Tabular Models: Represent distribution over returns at each (s,a) separately
a. Contains Model: (Have full knowledge of the transition model and policy)i. No constraint on type of distribution to model returnsii. Approximate return distributions as categorical on fixed support
b. Sample Based: (SARSA based updates, i.e. only using samples)i. No constraint on type of distribution to model returnsii. Approximate return distributions as categorical on fixed supportiii. Semi gradient w.r.t CDF update for distributional compared to SARSAiv. Semi gradient w.r.t PDF update for distributional compared to SARSA
(doesn’t hold)2. Linear Approximations:
a. Semi gradient of Cramer distance w.r.t CDF 3. Non linear Approximation:
a. There exists a non linear representation of the CDF such that initially we have equivalence but lose it after the first weight update.
![Page 46: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/46.jpg)
Proposition 1: Cramer Projection
● If we have a categorical distribution which has support lying between then where , then Cramer project it onto the support z, then the expectation will remain.
![Page 47: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/47.jpg)
Proposition 2: Tabular, Model-BasedZ(s,a) and Q(s,a) defined separately for each (s,a)
![Page 48: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/48.jpg)
Methods: Sequence of Proofs1. Tabular Models: Represent distribution over returns at each (s,a) separately
a. Contains Model: (Have full knowledge of the transition model and policy)i. No constraint on type of distribution to model returnsii. Approximate return distributions as categorical on fixed support
b. Sample Based: (SARSA based updates, i.e. only using samples)i. No constraint on type of distribution to model returnsii. Approximate return distributions as categorical on fixed supportiii. Semi gradient w.r.t CDF update for distributional compared to SARSAiv. Semi gradient w.r.t PDF update for distributional compared to SARSA
(doesn’t hold)2. Linear Approximations:
a. Semi gradient of Cramer distance w.r.t CDF 3. Non linear Approximation:
a. There exists a non linear representation of the CDF such that initially we have equivalence but lose it after the first weight update.
![Page 49: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/49.jpg)
Proof Proposition 2
![Page 50: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/50.jpg)
Tabular, Contains Model, Categorical DistributionsSuppose Z has finite support kdjfhjk then applying:
can cause the resulting distribution to require a projection back to the support.
Proposition 3:
![Page 51: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/51.jpg)
Methods: Sequence of Proofs1. Tabular Models: Represent distribution over returns at each (s,a) separately
a. Contains Model: (Have full knowledge of the transition model and policy)i. No constraint on type of distribution to model returnsii. Approximate return distributions as categorical on fixed support
b. Sample Based: (SARSA based updates, i.e. only using samples)i. No constraint on type of distribution to model returnsii. Approximate return distributions as categorical on fixed supportiii. Semi gradient w.r.t CDF update for distributional compared to SARSAiv. Semi gradient w.r.t PDF update for distributional compared to SARSA
(doesn’t hold)2. Linear Approximations:
a. Semi gradient of Cramer distance w.r.t CDF 3. Non linear Approximation:
a. There exists a non linear representation of the CDF such that initially we have equivalence but lose it after the first weight update.
![Page 52: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/52.jpg)
SARSA vs Distributional SARSA (Arbitrary Distribution)
Given transition:
Proposition 4: These two policy evaluation methods have expectation equivalence.
![Page 53: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/53.jpg)
Proof: SARSA vs Distributional SARSA
Expand P_(Z_(t+1))
Notice similarities between exp SARSA and dist SARSA
![Page 54: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/54.jpg)
Methods: Sequence of Proofs1. Tabular Models: Represent distribution over returns at each (s,a) separately
a. Contains Model: (Have full knowledge of the transition model and policy)i. No constraint on type of distribution to model returnsii. Approximate return distributions as categorical on fixed support
b. Sample Based: (SARSA based updates, i.e. only using samples)i. No constraint on type of distribution to model returnsii. Approximate return distributions as categorical on fixed supportiii. Semi gradient w.r.t CDF update for distributional compared to SARSAiv. Semi gradient w.r.t PDF update for distributional compared to SARSA
(doesn’t hold)2. Linear Approximations:
a. Semi gradient of Cramer distance w.r.t CDF 3. Non linear Approximation:
a. There exists a non linear representation of the CDF such that initially we have equivalence but lose it after the first weight update.
![Page 55: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/55.jpg)
SARSA vs Distributional SARSA (with Categorical Dist)
Recall:
Difference:Project onto support
![Page 56: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/56.jpg)
Proof: SARSA vs Distributional SARSA (Categorical)
Need to Cramer project this variable
![Page 57: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/57.jpg)
Methods: Sequence of Proofs1. Tabular Models: Represent distribution over returns at each (s,a) separately
a. Contains Model: (Have full knowledge of the transition model and policy)i. No constraint on type of distribution to model returnsii. Approximate return distributions as categorical on fixed support
b. Sample Based: (SARSA based updates, i.e. only using samples)i. No constraint on type of distribution to model returnsii. Approximate return distributions as categorical on fixed supportiii. Semi gradient w.r.t CDF update for distributional compared to SARSAiv. Semi gradient w.r.t PDF update for distributional compared to SARSA
(doesn’t hold)2. Linear Approximations:
a. Semi gradient of Cramer distance w.r.t CDF 3. Non linear Approximation:
a. There exists a non linear representation of the CDF such that initially we have equivalence but lose it after the first weight update.
![Page 58: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/58.jpg)
SARSA vs Semi-gradient of Cramer Distance:
Assume approximating distribution with categorical (c-spaced support). Gradient of squared Cramer w.r.t CDF:
Goal Proposition 6: Showing there is a semi gradient update which maintains expectation equivalence to SARSA (with a slight change in step size).
Results: Semi-gradient w.r.t CDF => Expectation Equivalence
Semi-gradient w.r.t PDF => Expectation Equivalence
![Page 59: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/59.jpg)
Methods: Sequence of Proofs1. Tabular Models: Represent distribution over returns at each (s,a) separately
a. Contains Model: (Have full knowledge of the transition model and policy)i. No constraint on type of distribution to model returnsii. Approximate return distributions as categorical on fixed support
b. Sample Based: (SARSA based updates, i.e. only using samples)i. No constraint on type of distribution to model returnsii. Approximate return distributions as categorical on fixed supportiii. Semi gradient w.r.t CDF update for distributional compared to SARSAiv. Semi gradient w.r.t PDF update for distributional compared to SARSA
(doesn’t hold)2. Linear Approximations:
a. Semi gradient of Cramer distance w.r.t CDF 3. Non linear Approximation:
a. There exists a non linear representation of the CDF such that initially we have equivalence but lose it after the first weight update.
![Page 60: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/60.jpg)
Linear Function Approximation
Loss Functions
Update Rule
![Page 61: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/61.jpg)
Theta update from last slide
Takeaway: If 1. Distributions add to 1 2. Distance between bins in distribution is 1
Then:Expectation equivalence holds
Under the assumption that the distributions add to one, and distance between bins in distribution is one, we obtain expectation equivalence
![Page 62: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/62.jpg)
Methods: Sequence of Proofs1. Tabular Models: Represent distribution over returns at each (s,a) separately
a. Contains Model: (Have full knowledge of the transition model and policy)i. No constraint on type of distribution to model returnsii. Approximate return distributions as categorical on fixed support
b. Sample Based: (SARSA based updates, i.e. only using samples)i. No constraint on type of distribution to model returnsii. Approximate return distributions as categorical on fixed supportiii. Semi gradient w.r.t CDF update for distributional compared to SARSAiv. Semi gradient w.r.t PDF update for distributional compared to SARSA
(doesn’t hold)2. Linear Approximations:
a. Semi gradient of Cramer distance w.r.t CDF 3. Non linear Approximation:
a. There exists a non linear representation of the CDF such that initially we have equivalence but lose it after the first weight update.
![Page 63: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/63.jpg)
Nonlinear Function ApproximationCreated example to show expectation equivalence doesn’t always hold:Let:
1. Start with expectation equivalence2. Fix a transition such that the target and prediction have
the same expectation but different distributions.
![Page 64: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/64.jpg)
Nonlinear Function ApproximationCreated example to show expectation equivalence doesn’t always hold:Let:
1. Start with expectation equivalence2. Fix a transition such that the target and prediction have
the same expectation but different distributions.
Recall Losses Used
![Page 65: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/65.jpg)
Nonlinear Function ApproximationCreated example to show expectation equivalence doesn’t always hold:Let:
1. Start with expectation equivalence2. Fix a transition such that the target and prediction have
the same expectation but different distributions.3. Now show that when we take a gradient step (using
gradient of Cramer), the expectation of the predicted distribution changes but the Q-value didn’t change expectation equivalence is broken.
![Page 66: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/66.jpg)
Nonlinear Function ApproximationTakeaways: ● This doesn’t prove that for all nonlinear functions that this happens.● Gradient is taken w.r.t Cramer distance which isn’t the case in many
successful algorithms (Quantile Distributional RL for ex. minimizes Wasserstein).
● Expectation equivalence never breaks in the linear case which might mean that the benefits of distributional RL seen in practice could have to do with it’s interplay with nonlinear function approximation.
![Page 67: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/67.jpg)
Recap: Sequence of Proofs1. Tabular Models: Represent distribution over returns at each (s,a) separately
a. Model Based: (Have full knowledge of the transition model and policy)i. No constraint on type of distribution to model returnsii. Constrain return distributions to being categorical on fixed support
b. Sample Based: (SARSA based updates, ie only using samples)i. No constraint on type of distribution to model returnsii. Constrain return distributions to being categorical on fixed supportiii. Semi gradient w.r.t CDF update for distributional compared to SARSAiv. Semi gradient w.r.t PDF update for distributional compared to SARSA (doesn’t hold)
2. Linear Approximations:a. Semi gradient of Cramer distance w.r.t CDF
3. Non linear Approximation: a. There exists a non linear representation of the CDF such that initially we have equivalence but
lose it after the first weight update.
![Page 68: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/68.jpg)
Takeaways1. In cases where they proved expectation equivalence, there isn’t anything to
gain from dist RL in terms of expected return. For ex:a. Variance of our expected return is same in expected RL and distributional RL since
Var[E(Z)]=Var[Q]b. If using greedy methods, then policy improvement steps will be equivalent since
expected value is same for each action. 2. Distributional RL and expected RL are usually expectation-equivalent for
tabular representations and linear function approximation.3. Expectation equivalence doesn’t always hold when using non linear function
approximation.
![Page 69: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/69.jpg)
Experimental Results: Tabular Case (12x12 Grid)Compare: Q-learning, dist with CDF updates, dist with PDF updates. Using same random seed, eps-greedy actions: (so end with same results if expectation equiv)
![Page 70: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/70.jpg)
Outline:1. Motivation
2. Background
3. Proof Sequence
4. Experiments
5. Limitations
![Page 71: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/71.jpg)
Experimental Results: Linear Approximation
Cart Pole Acrobat
![Page 72: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/72.jpg)
Experimental Results: Nonlinear Approximation
Cart Pole Acrobat
![Page 73: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/73.jpg)
Outline:1. Motivation
2. Background
3. Proof Sequence
4. Experiments
5. Limitations
![Page 74: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/74.jpg)
Limitations● Their results all hold for minimizing Cramer distance but possibly not other
metrics that are used in some successful distributional RL algorithms (Wasserstein, cross-entropy)
● The algorithm they use through their proofs doesn’t seem to lead to quality results in practice
● Even though the authors prove that Cramer improves on Wasserstein limitations distributional RL [1], the empirical results don’t convey this
![Page 75: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/75.jpg)
Open Questions1. What happens in deep neural networks that benefits most from the
distributional perspective? 2. Is there a regularizing effect of modeling a distribution instead of expected
value?
![Page 76: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/76.jpg)
Questions1. Derive:
2. What is one of the major benefits of the Cramer projection? 3. What are some possible reasons for the performance improvement of
distributional RL over expected value RL when using non linear function approximation?
![Page 77: A Comparative Analysis of Expected and Jerrod Parker and](https://reader031.vdocument.in/reader031/viewer/2022031517/622b1fef675a681e5228e764/html5/thumbnails/77.jpg)
References
[1] Marc G. Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, and Rémi Munos. The Cramer distance as a solution to biased Wasserstein gradients. In arXiv preprint arXiv:1705.10743, 2017.
[2] Rowland, M.; Bellemare, M.; Dabney, W.; Munos, R.; and Teh, Y. W. 2018. An analysis of categorical distributional reinforcement learning. In Storkey, A., and Perez-Cruz, F., eds., Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, 29–37. Playa Blanca, Lanzarote, Canary Islands: PMLR.
[3] Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A distributional perspective on reinforcement learning. In ICML.
[4] Will Dabney, Mark Rowland, Marc G. Bellemare, and Remi Munos. Distributional Reinforcement ´ Learning with Quantile Regression. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018b.