rainbow - combining improvements in deep reinforcement ...€¦ · •two separate value functions...
TRANSCRIPT
![Page 1: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/1.jpg)
Rainbow - Combining Improvements in Deep Reinforcement Learning
IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-
Learner Architectures
Mohan Zhang
![Page 2: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/2.jpg)
Problem settings
• The Markov Decision Process <𝑆, 𝐴, 𝑇, 𝑟, 𝛾>
• S states
• A actions
• 𝑇 𝑠, 𝑎, 𝑠′ = 𝑃[𝑆𝑡+1 = 𝑠′|𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎] (stochastic) transaction func
• 𝑟 𝑠, 𝑎 = 𝔼[𝑅𝑡+1|𝑆𝑡 = 𝑠,𝐴𝑡 = 𝑎] reward
• 𝛾𝑡 discount factor at time t
• Discounted return 𝐺𝑡 = σ𝑘=0∞ 𝛾𝑡
(𝑘)𝑅𝑡+𝑘+1
• Discount factor 𝛾𝑡(𝑘)
= ς𝑖=1𝑘 𝛾𝑡+𝑖
![Page 3: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/3.jpg)
Tons of tricks in a nutshell! Ready?
![Page 4: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/4.jpg)
Value-based RL
• vπ(s)= 𝔼[𝐺𝑡|𝑆𝑡 = 𝑠] or 𝑄𝜋 𝑠, 𝑎 = 𝔼𝜋[𝐺𝑡|𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎]
• Then, with the value (or Q value) as a proxy, we could derive the policy 𝜋 with 𝜖-greedy argmax. (take max value action with probability 1- 𝜖 or uniformly from action space A with probability 𝜖)
![Page 5: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/5.jpg)
DQN
• A deep Q network (DQN) is a multi-layered neural network that for a given state s outputs a vector of action values Q(s, · ; 𝜃), where 𝜃 are the parameters of the network.
• target network: its parameters (𝜃−) are copied every episode from the online network (𝜃) to make training more stable.
• reply buffers (Experience reply): transitions, rewards and actions are stored for some time and sampled uniformly from this memory bank to update the network. This is to prevent our DNN to overfit the current episode
![Page 6: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/6.jpg)
Double Q-Learning
• Two separate value functions (DNNs in our case)
• Pick a batch of experience, then assign each experience randomly to one of the DNN to update it. After this, we get two set of params 𝜃and 𝜃′
• For each update, one set of DNN is used to determine the action greedily, the other is used to determine the Q value.
![Page 7: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/7.jpg)
From Double Q-Learning to Prioritized Replay
• For SGD, we used this to measure the temporal-difference (TD) error:
• Δ = 𝑅𝑡+1 + 𝛾𝑡+1𝑎𝑟𝑔𝑚𝑎𝑥𝑎′𝑄𝜃− 𝑆𝑡+1 , 𝑎′ − 𝑄𝜃 𝑆𝑡 , 𝐴𝑡• We perform gradient descent over 𝜃, then update 𝜃− in the
beginning of every episode.
![Page 8: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/8.jpg)
Backup vanilla Q-Learning
• 𝜋: policy
• 𝛾: discount factor
• R: reward
• S: state
• A: action
• Then, the optimal Q value
≡ 𝔼 [ 𝑅1 + 𝛾𝑄𝜋 𝑠𝑡+1 , 𝑎′ ]
![Page 9: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/9.jpg)
Backup vanilla Q-Learning
• The optimal Q value can be learned from Q Learning
• In most cases, we cannot go over all action values in all states separately. So, we parametrize the Q value by 𝜃: 𝑄 𝑠, 𝑎; 𝜃𝑡 , which can be updated with SGD:
•
•
• 𝑌𝑡𝑄
represents the optimal Q value given best choice of 𝜃
• 𝛼 is the learning rate
![Page 10: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/10.jpg)
Prioritized Replay
• DQN samples uniformly from the replay buffer.
• we sample import (with high expected learning progress) transactions more frequently.
• Sample probability given the (traditional) experience <𝑆𝑡 , 𝐴𝑡 , 𝑅𝑡 , 𝑆𝑡+1>
• ω is a hyper-parameter that determines the shape of the distribution.
• Note that stochastic transitions might also be favored, even when there is little left to learn about them, in order to avoid overfitting.
![Page 11: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/11.jpg)
Dueling Networks
• Basically, add another network to evaluate action advantages.
• We evaluate “goodness” (on the edge or not) of a state s and advantage of choosing an action a (turn left or right).
• 𝑣𝜂: value of state. 𝜂 is the params of such value stream.
• 𝑎𝜓: value of action advantage. 𝜓 is the params of such
advantage stream.
• 𝑓𝜉 is the shared convolutional encoder (network)
![Page 12: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/12.jpg)
Multi-step Learning
• truncated n-step return from a given state 𝑆𝑡:
• Then, the multi-step variant of DQN is then defined by minimizing the
alternative loss (same thing as before, just changed 𝑅𝑡 to be 𝑅𝑡(𝑛)
• Multi-step targets with suitably tuned n often lead to faster learning
![Page 13: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/13.jpg)
Distributional RL• learn to approximate the distribution of returns instead
of the expected return.
• Maximize over the expected sum of future rewards.
• New Bellman: 𝑉𝜋 𝑥 ≡ 𝔼𝑃𝜋 σ𝑡 𝛾𝑡𝑅 𝑥𝑡 |𝑥0 = 𝑥 = 𝔼𝑅 𝑥 + 𝔼𝑥′~𝑃𝜋𝑉
𝜋 𝑥′
• Future expectation makes modeling even more complex! We use a hidden variable z to model the value distribution:
• 𝑉𝜋 𝑥 = 𝔼𝑍𝜋 𝑥 = 𝔼 𝑅 𝑥 + 𝛾𝑍𝜋 𝑥′ , where 𝑥′~P𝜋(∙ |𝑥)
• Discrete distributions C51 to measure. Simply replace the Q-output in DQN to a softmax over 51 probabilities.(more bins, better performance!)
![Page 14: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/14.jpg)
Distributional RL• The equations for a distributional
variance of Q-learning: constructing a new support 𝑑𝑡 by minimizing the KL divergence between 𝑑𝑡 and target 𝑑𝑡 ’.
![Page 15: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/15.jpg)
Noisy Nets
• where many actions must be executed to collect the first reward (Montezuma’s Revenge), what do we do?
• Add noise for better exploration!
• Over time, the network can learn to ignore the noisy stream at different rates in different parts of the state space
• self-annealing
![Page 16: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/16.jpg)
Now, group together!
![Page 17: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/17.jpg)
Recap
![Page 18: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/18.jpg)
Target net
Dueling network
Prioritized reply
Final target KL to minimize
Multi-step Learning Distributional RL
Double DQN
![Page 19: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/19.jpg)
Experiments:
• “All Rainbow’s components have a number of hyper-parameters. The combinatorial space of hyper-parameters is too large for an exhaustive search, therefore we have performed limited tuning.”
![Page 20: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/20.jpg)
Experiments:
• Double DQN is redundant?
• Is it just useless or the functionality is shadowed by the combination of other tricks?
![Page 21: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/21.jpg)
• Pros:
All tricks together, SOTA performance!
A good base to construct your other algorithms on
• Cons:
Hard to tune, hard to implement
No clue how to make it more efficient
![Page 22: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/22.jpg)
IMPALA:Scalable Distributed Deep-RL with Importance
Weighted Actor-Learner Architectures
![Page 23: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/23.jpg)
IMPALA
![Page 24: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/24.jpg)
V trace correction
if 𝝁>𝝅
![Page 25: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/25.jpg)
My Questions:
• Where were they from?
![Page 26: Rainbow - Combining Improvements in Deep Reinforcement ...€¦ · •Two separate value functions (DNNs in our case) •Pick a batch of experience, then assign each experience randomly](https://reader035.vdocument.in/reader035/viewer/2022081600/6041c775e8b08a09a0516635/html5/thumbnails/26.jpg)
• What is the most important contribution of IMPALA?
(hint: distributed)