deep q-learning
TRANSCRIPT
![Page 1: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/1.jpg)
Deep Q-LearningA Reinforcement Learning approach
![Page 2: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/2.jpg)
What is Reinforcement Learning?
- Much like biological agents behave- No supervisor, only a reward - Data is time dependent (non iid)- Feedback is delayed- Agent actions affect the data it receives
![Page 3: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/3.jpg)
Examples- Play checkers (1959)- Defeat the world champion at Backgammon (1992)- Control a helicopter (2008)- Make a robot to walk- Robocup Soccer- Play ATARI games better than humans (2014)- Defeat the world champion at Go (2016)
Videos
![Page 4: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/4.jpg)
Reward HypothesisAll goals can be described by the maximisation of expected cumulative reward
- Defeat the world champion at Go: +R / -R for winning/losing a game- Make a robot to walk: +R for forward, -R for falling over - Play ATARI games: +R / -R for increasing/decreasing score- Control a helicopter: + R / -R following trajectory / crashing
![Page 5: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/5.jpg)
Agent and Environment
![Page 6: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/6.jpg)
Fully Observable EnvironmentsFully Observable Environments (agent state = environment state):
- Agent directly observes environment- Example: chess board
Partially Observable Environments (agent state not equal environment state):
- Agent indirectly observes environment- Example: A robot with motion sensor or camera - Agent must construct its own state representation
![Page 7: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/7.jpg)
RL components: Policy and Value FunctionPolicy is agent’s behaviour function
- Maps from state to action - Deterministic policy: - Stochastic:
Value function is a is a prediction of future reward
- Used to evaluate state and select between actions-
![Page 8: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/8.jpg)
ModelPredicts what environment will do next:
![Page 9: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/9.jpg)
Maze example: r = -1 per time-step and policy
[David Silver. Advanced Topics: RL]
![Page 10: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/10.jpg)
Maze example: Value function and Model
[David Silver. Advanced Topics: RL]
![Page 11: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/11.jpg)
Exploration - Exploitation dilemma
![Page 12: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/12.jpg)
Math: Markov Decision Process (MDP)Almost all RL problems can be formalised as MDPs
It’s a tuple:
- S is finite set of states- A is finite set of actions- P is state transition probability matrix:- R is a reward function:- Discount factor:
![Page 13: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/13.jpg)
State-Value and Action-Value functions, Bellman eq.Expected return starting from state s, and then following policy :
Expected return starting from state s, taking action a, and then following policy :
![Page 14: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/14.jpg)
Finding an Optimal Policy- There is always optimal policy for any MPD- All optimal policies achieve the optimal value function - All optimal policies achieve the optimal action-value function
All you need is to find
![Page 15: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/15.jpg)
Bellman Opt Equation for state-value function
[David Silver. Advanced Topics: RL]
![Page 16: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/16.jpg)
Bellman Opt Equation for action-value function
[David Silver. Advanced Topics: RL]
![Page 17: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/17.jpg)
Bellman Opt Equation for state-value function
[David Silver. Advanced Topics: RL]
![Page 18: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/18.jpg)
Bellman Opt Equation for action-value function
[David Silver. Advanced Topics: RL]
![Page 19: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/19.jpg)
Policy Iteration Demo
![Page 20: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/20.jpg)
Q-Learning - model-free off-policy control algorithmModel-free (vs Model-based):
- MDP model is unknown, but experience can be sampled MDP - Model is known, but is too big to use, except by samples
Off-policy (vs On-policy):
- Can learn about policy from experience sampled from some other policy
Control (vs Prediction):
- Find best policy
![Page 21: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/21.jpg)
Q-Learning
[David Silver. Advanced Topics: RL]
![Page 22: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/22.jpg)
DQN - Q-Learning with function approximation
[Human-level control through deep reinforcement learning]
![Page 23: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/23.jpg)
[Human-level control through deep reinforcement learning]
![Page 24: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/24.jpg)
Issues with Q-learning with neural network- Data is sequential (non-iid)- Policy changes rapidly with slight changes to Q-values
- Policy may oscillate- Experience flows from one extreme to another
- Scale of rewards and Q-values is unknown- Unstable backpropagation due to large gradients
![Page 25: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/25.jpg)
DQN solutions- Use experience replay
- Breaks correlations in data- Learn from all past policies- Using off-policy Q-learning
- Freeze target Q-network- Avoid policy oscillations- Break correlations between Q-network and target
- Clip rewards and gradients
![Page 26: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/26.jpg)
Neon Demo
![Page 27: Deep Q-Learning](https://reader031.vdocument.in/reader031/viewer/2022033108/586f89ba1a28ab54768b6119/html5/thumbnails/27.jpg)
Links
- Human-level control through deep reinforcement learning- Course: David Silver. Advanced Topics: RL- Tutorial: David Silver. Deep Reinforcement Learning - Book: Sutton, Barto. Reinforcement learning- Source Code: simple_dqn- Reinforcejs- The Arcade Learning Environment