asynchronous methods for deep reinforcement learning · distributed version of dqn (then)...
TRANSCRIPT
![Page 1: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/1.jpg)
||Hardware Acceleration for Data Processing – Fall 2017
Hardware Acceleration for Data Processing
Fall 2017
14.11.2017 1
Asynchronous Methods for
Deep Reinforcement Learning
Lavrenti Frobeen
![Page 2: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/2.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 2
![Page 3: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/3.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 3
![Page 4: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/4.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 4
![Page 5: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/5.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 5
![Page 6: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/6.jpg)
||Hardware Acceleration for Data Processing – Fall 2017
▪ Set of States 𝑆
▪ Set of Actions 𝐴
Agent observes 𝑠𝑡 ∈ 𝑆 and…
…takes action 𝑎𝑡 ∈ 𝐴 according to its policy…
…receiving a new 𝑠𝑡+1 ∈ 𝑆 and a reward 𝑟𝑡+1 ∈ ℝ.
▪ Total Reward:
𝑅𝑡 =
𝑘=0
∞
𝛾𝑘 𝑟𝑡+𝑘 𝛾 ∈ (0, 1)
14.11.2017Lavrenti Frobeen 6
Ingredients of Reinforcement Learning
![Page 7: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/7.jpg)
||Hardware Acceleration for Data Processing – Fall 2017
▪ Delayed rewards
▪ Vast state spaces
▪ Dynamic environment /
Actions influence training data
14.11.2017Lavrenti Frobeen 7
What makes RL difficult?
![Page 8: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/8.jpg)
||Hardware Acceleration for Data Processing – Fall 2017
Idea: Tabulate “quality” of state-action-pairs
Qπ s, a = 𝔼 Rt|st = s, a, 𝜋
Choose action that maximizes Q (𝜀-greedily)
For an optimal policy it holds that:
Q∗ s, a = ቊ𝑟
𝑟 + 𝛾maxa′
Q∗ 𝑠′, 𝑎′
14.11.2017Lavrenti Frobeen 8
Solving RL: Q-Learning (1)
Action 1 … Action m
State 1 1 … 0.5
State 2 0 … 0.6
… ... … …
State n 0.7 … 2
if 𝑠′ terminal
else
![Page 9: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/9.jpg)
||Hardware Acceleration for Data Processing – Fall 2017
Iterate Two Steps:
1. Choose action that maximizes Q (𝜀-greedily)
2. Update table according to
Q∗ s, a = ቊ𝑟
𝑟 + 𝛾maxa′
Q∗ 𝑠′, 𝑎′
3. Repeat until convergence
14.11.2017Lavrenti Frobeen 9
Solving RL: Q-Learning (2)[Watkins 89]
Action 1 … Action m
State 1 1 … 0.5
State 2 0 … 0.6
… ... … …
State n 0.7 … 2if 𝑠′ terminal
else
![Page 10: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/10.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 10
Problem: Q-Tables Do Not Scale
Action 1 … Action 361
State 1 1 … 0.5
State 2 0 … 0.6
… ... … …
State 10170 0.7 … 2
Input State Q-values
![Page 11: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/11.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 11
Reinforcement Learning with Neural Networks
Input State
RepresentationQ-values
𝑄(𝑠, 𝑎, 𝜃) ≈ 𝑄∗ s, a
![Page 12: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/12.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 12
Reinforcement Learning with Deep Neural Networks[Mnih et al. 2015]
Input State
RepresentationQ-values
Taken from “Human-level control through deep reinforcement learning” (Mnih et al. 2015)
![Page 13: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/13.jpg)
||Hardware Acceleration for Data Processing – Fall 2017
Network is parameterized by 𝜃
𝜃 can be trained by minimizing the loss
𝐿 𝜃 = 𝔼 𝑡𝑎𝑟𝑔𝑒𝑡𝑄 − 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑𝑄 2
= 𝔼 𝑟 + 𝛾max𝑎′
Q 𝑠′, 𝑎′, 𝜃 − 𝑄 𝑠, 𝑎, 𝜃
2
14.11.2017Lavrenti Frobeen 13
Deep Q-Learning
Q-values for 𝑠
State 𝑠
Action 𝑎
Q-values for 𝑠′
max𝑎′
𝑄 𝑠′, 𝑎′, 𝜃 and 𝑄 𝑠, 𝑎, 𝜃𝜕𝐿
𝜕𝜃
Reward 𝑟 for action 𝑎
State 𝑠′
(after 𝑎)
![Page 14: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/14.jpg)
||Hardware Acceleration for Data Processing – Fall 2017
▪ Delayed rewards
▪ Vast state spaces
▪ Dynamic environment /
Actions influence training data
▪ Feedback loops
▪ Updates affect Q-function globally
14.11.2017Lavrenti Frobeen 14
Issues with Deep Q-Learning
Q-values for 𝑠
State 𝑠
Action 𝑎
Q-values for 𝑠′
𝜕𝐿
𝜕𝜃State 𝑠′
(after 𝑎)
max𝑎′
𝑄 𝑠′, 𝑎′, 𝜃 and 𝑄 𝑠, 𝑎, 𝜃
Reward 𝑟 for action 𝑎
![Page 15: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/15.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 15
Breaking Feedback Loops
Q-values for 𝑠State 𝑠
Action 𝑎
max𝑎′
𝑄 𝑠′, 𝑎′, 𝜃−𝜕𝐿
𝜕𝜃
Reward for 𝑎
State 𝑠′
(after 𝑎)
𝑄 𝑠, 𝑎, 𝜃
Target Network (𝜃−)
Trained Network (𝜃)
![Page 16: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/16.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 16
Breaking Feedback Loops
Q-values for 𝑠State 𝑠
Action 𝑎
State 𝑠′
(after 𝑎)
𝜃Target Network (𝜃−)
![Page 17: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/17.jpg)
||Hardware Acceleration for Data Processing – Fall 2017
▪ Delayed rewards
▪ Vast state spaces
▪ Dynamic environment /
Actions influence training data
▪ Feedback loops
▪ Updates affect Q-function globally
14.11.2017Lavrenti Frobeen 17
Issues with Deep Q-Learning
Q-values for 𝑠
State 𝑠
Action 𝑎
Q-values for 𝑠′
𝜕𝐿
𝜕𝜃State 𝑠′
(after 𝑎)
max𝑎′
𝑄 𝑠′, 𝑎′, 𝜃 and 𝑄 𝑠, 𝑎, 𝜃
Reward 𝑟 for action 𝑎
![Page 18: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/18.jpg)
||Hardware Acceleration for Data Processing – Fall 2017
Problem
Learning from new experiences might override
previous progress!
Idea
Store state transitions during exploration
Sample from past experience during learning to
maintain diversity
14.11.2017Lavrenti Frobeen 18
Experience Replay
𝑎 = 𝑓𝑟 = 0
𝑎 = 𝑓𝑟 = 1
𝑎 = 𝑏𝑟 = 0
𝜕𝐿
𝜕𝜃
![Page 19: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/19.jpg)
||Hardware Acceleration for Data Processing – Fall 2017
▪ Delayed rewards
▪ Vast state spaces
▪ Dynamic environment /
Actions influence training data
▪ Feedback loops
▪ Updates affect Q-function globally
14.11.2017Lavrenti Frobeen 19
Issues with Deep Q-Learning
![Page 20: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/20.jpg)
||Hardware Acceleration for Data Processing – Fall 2017
Problem
Rewards propagate through the DQN slowly!
Idea
Take multiple actions in sequence before
updating the network
14.11.2017Lavrenti Frobeen 20
N-Step Methods
![Page 21: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/21.jpg)
||Hardware Acceleration for Data Processing – Fall 2017
Agent N
14.11.2017Lavrenti Frobeen 21
Parallel Deep Q-Learning
Parameter Server
𝜃 𝜃𝜃
Agent 1 Agent …
![Page 22: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/22.jpg)
||Hardware Acceleration for Data Processing – Fall 2017
▪ Distributed version of DQN
▪ (Then) State-of-the-art results on the Atari
▪ Training takes days on 100+ machines
14.11.2017Lavrenti Frobeen 22
Massively Parallel Methods of Deep RL [Nair et al. 2015]
Taken from “Massively Parallel Methods of Deep RL” (Nair et al. 2015)
![Page 23: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/23.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 23
Going HOGWILD!
Agent / Thread N
RAM
𝜃 𝜃𝜃
Agent / Thread 1 Agent / Thread …
Single Machine
![Page 24: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/24.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 24
Asynchronous one-step Q-learning[Mnih et al. 2016]
Agent / Thread 1
Shared 𝜃, 𝜃−
Agent / Thread N
every 5 frames
every 40000 frames
![Page 25: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/25.jpg)
||Hardware Acceleration for Data Processing – Fall 2017
𝑎 = 𝑓𝑟 = 0
𝑎 = 𝑓𝑟 = 1
Experience Replay was scrapped!
Intuition
Parallel learners explore different paths anyway!
Parallelism similar effect to experience replay.
Diversity can be enforced by different
policies
14.11.2017Lavrenti Frobeen 25
What about Experience Replay?
𝜕𝐿
𝜕𝜃
Agent / Thread 1 Agent / Thread N
𝜕𝐿
𝜕𝜃
Experience Replay:
Actor Parallelism:
![Page 26: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/26.jpg)
||Hardware Acceleration for Data Processing – Fall 2017
Policy is parametrized directly with
𝜋(𝑎𝑡|𝑠𝑡; 𝜃)
and estimated using gradient ascent for
𝛻𝜃 log 𝜋(𝑎𝑡|𝑠𝑡; 𝜃) 𝑅𝑡
which is an unbiased estimate of
𝛻𝜃𝔼[𝑅𝑡]
14.11.2017Lavrenti Frobeen 26
Policy Gradient Methods
𝜋(𝑎𝑡|𝑠𝑡; 𝜃)
State
𝑎1 𝑎2 𝑎3 ... 𝑎𝑛
![Page 27: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/27.jpg)
||Hardware Acceleration for Data Processing – Fall 2017
Advantage Actor-Critic
𝛻𝜃 log 𝜋 𝑎𝑡 𝑠𝑡; 𝜃 𝑅𝑡 − 𝑉(𝑠𝑡 , 𝜃𝑣)
Note:
𝑅𝑡 depends on action taken 𝑎𝑡𝑉(𝑠𝑡 , 𝜃𝑣) depends the state only
𝑅𝑡 − 𝑉(𝑠𝑡 , 𝜃𝑣) is called the advantage of 𝑎𝑡
14.11.2017Lavrenti Frobeen 27
Asynchronous Advantage Actor-Critic (A3C)
Agent / Thread N
Agent / Thread 1
Shared 𝜃, 𝜃𝑣
𝜋 𝑎𝑡 𝑠𝑡; 𝜃
𝑉(𝑠𝑡 , 𝜃𝑣)
![Page 28: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/28.jpg)
||Hardware Acceleration for Data Processing – Fall 2017
Agent / Thread N
14.11.2017Lavrenti Frobeen 28
Asynchronous Advantage Actor-Critic (A3C) [Mnih et al. 2016]
Agent / Thread 1
Shared 𝜃, 𝜃𝑣
𝜋 𝑎𝑡 𝑠𝑡; 𝜃
𝑉(𝑠𝑡 , 𝜃𝑣)
![Page 29: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/29.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 29
Evaluation on the Atari Domain
Normalized scores on 57 Atari Games [Mnih et. al 2016]
![Page 30: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/30.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 30
EvaluationAsynchronous Methods vs. DQN
Averaged scores of different algorithms for 5 games over time [Mnih et. al 2016]
![Page 31: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/31.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 31
(Super-) Linear Speedup
Average speedup of different algorithms for
different thread counts [Mnih et. al 2016]
Average speedup of A3C for different thread counts
[Mnih et. al 2016]
![Page 32: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/32.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 32
Evaluation with TORCS
Averaged scores of different algorithms for TORCS
racing simulator [Mnih et. al 2016]
![Page 33: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/33.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 33
Stability & Robustness
Scores of A3C with different learning rates for 5 games [Mnih et. al 2016]
![Page 34: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/34.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 34
Stability & Robustness
Scores of A3C with different learning rates for 5 games [Mnih et. al 2016]
Raw scores of different algorithms for a selection of Atari Games [Mnih et. al 2016]
![Page 35: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/35.jpg)
||Hardware Acceleration for Data Processing – Fall 2017
▪ Deep Reinforcement Learning is hard
▪ Requires techniques like experience replay
▪ Deep RL is easily parallelizable
▪ Parallelism can replace experience replay
▪ Dropping experience replay allows on-policy
methods like actor-critic
▪ A3C surpasses state-of-the-art performance
14.11.2017Lavrenti Frobeen 35
Takeaway Message
![Page 36: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/36.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 36
QUESTIONS?
![Page 37: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/37.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 37
One-step Q-learning vs. N-step Q-learning
![Page 38: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/38.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 38
Q-Learning: A Toy Example
Backwards Forwards
Start 0 0
Street 0 0
Goal – –
Current Q-Table
Hyperparameters: 𝜀 = 0.999 𝛾 = 0.5
Next Action: Go Forward
![Page 39: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/39.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 39
Q-Learning: A Toy Example
Backwards Forwards
Start 0 0
Street 0 0
Goal – –
Current Q-Table
Previous Action: Go Forwards
No Reward
Next Action: Go Forwards
![Page 40: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/40.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 40
Q-Learning: A Toy Example
Backwards Forwards
Start 0 0
Street 0 1
Goal – –
Current Q-Table
Previous Action: Go Forwards
Reward: +1
Update Table According to Q∗ s, a = ቊ𝑟
𝑟 + 𝛾maxa′
Q∗ 𝑠′, 𝑎′if 𝑠′ terminal
else
![Page 41: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/41.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 41
Q-Learning: A Toy Example
Backwards Forwards
Start 0 0
Street 0 1
Goal – –
Current Q-Table
Game has restarted!
Next Action: Go Forwards
![Page 42: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/42.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 42
Q-Learning: A Toy Example
Backwards Forwards
Start 0 0.5
Street 0 1
Goal – –
Current Q-Table
Previous Action: Go Forwards,
No reward, but rewarding follow-up action
Update Table According to
Next Action: Go Backwards
Q∗ s, a = ቊ𝑟
𝑟 + 𝛾maxa′
Q∗ 𝑠′, 𝑎′if 𝑠′ terminal
else
![Page 43: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/43.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 43
Q-Learning: A Toy Example
Backwards Forwards
Start 0 0.5
Street 0.25 1
Goal – –
Current Q-Table
Previous Action: Go Backwards
No reward, but rewarding follow-up action
Update Table According to
Next Action: Go Backwards
Q∗ s, a = ቊ𝑟
𝑟 + 𝛾maxa′
Q∗ 𝑠′, 𝑎′if 𝑠′ terminal
else
![Page 44: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen](https://reader036.vdocument.in/reader036/viewer/2022081517/5ec699fbda86924bd42928de/html5/thumbnails/44.jpg)
||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 44
Q-Learning: A Toy Example
Backwards Forwards
Start 0.25 0.5
Street 0.25 1
Goal – –
Current Q-Table
Done