asynchronous methods for deep reinforcement learning · distributed version of dqn (then)...

||Hardware Acceleration for Data Processing – Fall 2017

Hardware Acceleration for Data Processing

Fall 2017

14.11.2017 1

Asynchronous Methods for

Deep Reinforcement Learning

Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 2


▪ Set of States 𝑆

▪ Set of Actions 𝐴

Agent observes 𝑠𝑡 ∈ 𝑆 and…

…takes action 𝑎𝑡 ∈ 𝐴 according to its policy…

…receiving a new 𝑠𝑡+1 ∈ 𝑆 and a reward 𝑟𝑡+1 ∈ ℝ.

▪ Total Reward:

𝑅𝑡 =

𝑘=0

∞

𝛾𝑘 𝑟𝑡+𝑘 𝛾 ∈ (0, 1)

14.11.2017Lavrenti Frobeen 6

Ingredients of Reinforcement Learning


▪ Delayed rewards

▪ Vast state spaces

▪ Dynamic environment /

Actions influence training data


What makes RL difficult?


Idea: Tabulate “quality” of state-action-pairs

Qπ s, a = 𝔼 Rt|st = s, a, 𝜋

Choose action that maximizes Q (𝜀-greedily)

For an optimal policy it holds that:

Q∗ s, a = ቊ𝑟

𝑟 + 𝛾maxa′

Q∗ 𝑠′, 𝑎′


Solving RL: Q-Learning (1)

Action 1 … Action m

State 1 1 … 0.5

State 2 0 … 0.6

… ... … …

State n 0.7 … 2

if 𝑠′ terminal

else


Iterate Two Steps:

1. Choose action that maximizes Q (𝜀-greedily)

2. Update table according to

Q∗ s, a = ቊ𝑟

𝑟 + 𝛾maxa′

Q∗ 𝑠′, 𝑎′

3. Repeat until convergence


Solving RL: Q-Learning (2)[Watkins 89]

Action 1 … Action m

State 1 1 … 0.5

State 2 0 … 0.6

… ... … …

State n 0.7 … 2if 𝑠′ terminal

else


Problem: Q-Tables Do Not Scale

Action 1 … Action 361

State 1 1 … 0.5

State 2 0 … 0.6

… ... … …

State 10170 0.7 … 2

Input State Q-values


Reinforcement Learning with Neural Networks

Input State

RepresentationQ-values

𝑄(𝑠, 𝑎, 𝜃) ≈ 𝑄∗ s, a


Reinforcement Learning with Deep Neural Networks[Mnih et al. 2015]

Input State

RepresentationQ-values

Taken from “Human-level control through deep reinforcement learning” (Mnih et al. 2015)


Network is parameterized by 𝜃

𝜃 can be trained by minimizing the loss

𝐿 𝜃 = 𝔼 𝑡𝑎𝑟𝑔𝑒𝑡𝑄 − 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑𝑄 2

= 𝔼 𝑟 + 𝛾max𝑎′

Q 𝑠′, 𝑎′, 𝜃 − 𝑄 𝑠, 𝑎, 𝜃

2


Deep Q-Learning

Q-values for 𝑠

State 𝑠

Action 𝑎

Q-values for 𝑠′

max𝑎′

𝑄 𝑠′, 𝑎′, 𝜃 and 𝑄 𝑠, 𝑎, 𝜃𝜕𝐿

𝜕𝜃

Reward 𝑟 for action 𝑎

State 𝑠′

(after 𝑎)


▪ Delayed rewards




▪ Feedback loops

▪ Updates affect Q-function globally


Issues with Deep Q-Learning

Q-values for 𝑠

State 𝑠

Action 𝑎


𝜕𝐿

𝜕𝜃State 𝑠′

(after 𝑎)

max𝑎′

𝑄 𝑠′, 𝑎′, 𝜃 and 𝑄 𝑠, 𝑎, 𝜃



Breaking Feedback Loops

Q-values for 𝑠State 𝑠

Action 𝑎

max𝑎′

𝑄 𝑠′, 𝑎′, 𝜃−𝜕𝐿

𝜕𝜃

Reward for 𝑎

State 𝑠′

(after 𝑎)

𝑄 𝑠, 𝑎, 𝜃

Target Network (𝜃−)

Trained Network (𝜃)


Breaking Feedback Loops

Q-values for 𝑠State 𝑠

Action 𝑎

State 𝑠′

(after 𝑎)

𝜃Target Network (𝜃−)


▪ Delayed rewards




▪ Feedback loops




Q-values for 𝑠

State 𝑠

Action 𝑎


𝜕𝐿

𝜕𝜃State 𝑠′

(after 𝑎)

max𝑎′

𝑄 𝑠′, 𝑎′, 𝜃 and 𝑄 𝑠, 𝑎, 𝜃



Problem

Learning from new experiences might override

previous progress!

Idea

Store state transitions during exploration

Sample from past experience during learning to

maintain diversity


Experience Replay

𝑎 = 𝑓𝑟 = 0

𝑎 = 𝑓𝑟 = 1

𝑎 = 𝑏𝑟 = 0

𝜕𝐿

𝜕𝜃


▪ Delayed rewards




▪ Feedback loops





Problem

Rewards propagate through the DQN slowly!

Idea

Take multiple actions in sequence before

updating the network


N-Step Methods


Agent N


Parallel Deep Q-Learning

Parameter Server

𝜃 𝜃𝜃

Agent 1 Agent …


▪ Distributed version of DQN

▪ (Then) State-of-the-art results on the Atari

▪ Training takes days on 100+ machines


Massively Parallel Methods of Deep RL [Nair et al. 2015]

Taken from “Massively Parallel Methods of Deep RL” (Nair et al. 2015)


Going HOGWILD!

Agent / Thread N

RAM

𝜃 𝜃𝜃

Agent / Thread 1 Agent / Thread …

Single Machine


Asynchronous one-step Q-learning[Mnih et al. 2016]

Agent / Thread 1

Shared 𝜃, 𝜃−

Agent / Thread N

every 5 frames

every 40000 frames


𝑎 = 𝑓𝑟 = 0

𝑎 = 𝑓𝑟 = 1

Experience Replay was scrapped!

Intuition

Parallel learners explore different paths anyway!

Parallelism similar effect to experience replay.

Diversity can be enforced by different

policies


What about Experience Replay?

𝜕𝐿

𝜕𝜃

Agent / Thread 1 Agent / Thread N

𝜕𝐿

𝜕𝜃

Experience Replay:

Actor Parallelism:


Policy is parametrized directly with

𝜋(𝑎𝑡|𝑠𝑡; 𝜃)

and estimated using gradient ascent for

𝛻𝜃 log 𝜋(𝑎𝑡|𝑠𝑡; 𝜃) 𝑅𝑡

which is an unbiased estimate of

𝛻𝜃𝔼[𝑅𝑡]


Policy Gradient Methods

𝜋(𝑎𝑡|𝑠𝑡; 𝜃)

State

𝑎1 𝑎2 𝑎3 ... 𝑎𝑛


Advantage Actor-Critic

𝛻𝜃 log 𝜋 𝑎𝑡 𝑠𝑡; 𝜃 𝑅𝑡 − 𝑉(𝑠𝑡 , 𝜃𝑣)

Note:

𝑅𝑡 depends on action taken 𝑎𝑡𝑉(𝑠𝑡 , 𝜃𝑣) depends the state only

𝑅𝑡 − 𝑉(𝑠𝑡 , 𝜃𝑣) is called the advantage of 𝑎𝑡


Asynchronous Advantage Actor-Critic (A3C)

Agent / Thread N

Agent / Thread 1

Shared 𝜃, 𝜃𝑣

𝜋 𝑎𝑡 𝑠𝑡; 𝜃

𝑉(𝑠𝑡 , 𝜃𝑣)


Agent / Thread N


Asynchronous Advantage Actor-Critic (A3C) [Mnih et al. 2016]

Agent / Thread 1

Shared 𝜃, 𝜃𝑣

𝜋 𝑎𝑡 𝑠𝑡; 𝜃

𝑉(𝑠𝑡 , 𝜃𝑣)


Evaluation on the Atari Domain

Normalized scores on 57 Atari Games [Mnih et. al 2016]


EvaluationAsynchronous Methods vs. DQN

Averaged scores of different algorithms for 5 games over time [Mnih et. al 2016]


(Super-) Linear Speedup

Average speedup of different algorithms for

different thread counts [Mnih et. al 2016]

Average speedup of A3C for different thread counts

[Mnih et. al 2016]


Evaluation with TORCS

Averaged scores of different algorithms for TORCS

racing simulator [Mnih et. al 2016]


Stability & Robustness

Scores of A3C with different learning rates for 5 games [Mnih et. al 2016]


Stability & Robustness

Scores of A3C with different learning rates for 5 games [Mnih et. al 2016]

Raw scores of different algorithms for a selection of Atari Games [Mnih et. al 2016]


▪ Deep Reinforcement Learning is hard

▪ Requires techniques like experience replay

▪ Deep RL is easily parallelizable

▪ Parallelism can replace experience replay

▪ Dropping experience replay allows on-policy

methods like actor-critic

▪ A3C surpasses state-of-the-art performance


Takeaway Message


QUESTIONS?


One-step Q-learning vs. N-step Q-learning


Q-Learning: A Toy Example

Backwards Forwards

Start 0 0

Street 0 0

Goal – –

Current Q-Table

Hyperparameters: 𝜀 = 0.999 𝛾 = 0.5

Next Action: Go Forward



Backwards Forwards

Start 0 0

Street 0 0

Goal – –

Current Q-Table

Previous Action: Go Forwards

No Reward

Next Action: Go Forwards



Backwards Forwards

Start 0 0

Street 0 1

Goal – –

Current Q-Table

Previous Action: Go Forwards

Reward: +1

Update Table According to Q∗ s, a = ቊ𝑟

𝑟 + 𝛾maxa′

Q∗ 𝑠′, 𝑎′if 𝑠′ terminal

else



Backwards Forwards

Start 0 0

Street 0 1

Goal – –

Current Q-Table

Game has restarted!

Next Action: Go Forwards



Backwards Forwards

Start 0 0.5

Street 0 1

Goal – –

Current Q-Table

Previous Action: Go Forwards,

No reward, but rewarding follow-up action

Update Table According to

Next Action: Go Backwards

Q∗ s, a = ቊ𝑟

𝑟 + 𝛾maxa′


else



Backwards Forwards

Start 0 0.5

Street 0.25 1

Goal – –

Current Q-Table

Previous Action: Go Backwards

No reward, but rewarding follow-up action

Update Table According to

Next Action: Go Backwards

Q∗ s, a = ቊ𝑟

𝑟 + 𝛾maxa′


else



Backwards Forwards

Start 0.25 0.5

Street 0.25 1

Goal – –

Current Q-Table

Done

asynchronous methods for deep reinforcement learning · distributed version of dqn (then)...

Documents