raia hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · raia...
TRANSCRIPT
Learning in sequential environments
Raia HadsellStaff Research Scientist, DeepMind
raiahadsell.com
Scaling deep reinforcement learning towards the real world:
part 1: learning sequential tasks without forgettingpart 2: learning to navigate in complex worlds
Raia Hadsell 2017
EnvironmentAgent
Reinforcement Learning
OBSERVATIONS
ACTIONS
REWARD?
Raia Hadsell 2017
○ Maximizing Qπ(s,a) over possible policies gives the optimal
action-value function and the Bellman equation:
○ Basic idea:
■ Approximate →
■ Apply the Bellman Equation as an iterative update
Value Iteration
Raia Hadsell 2017
○ Use a neural network for Q(s,a; )
○ Train end-to-end from raw pixels
End-to-End Reinforcement Learning
Raia Hadsell 2017
but.. a network for every task?
Raia Hadsell 2017
one network for all?
Raia Hadsell 2017
Catastrophic forgetting
● Well-known phenomenon● Especially severe in Deep RL
Raia Hadsell 2017
Catastrophic forgetting
https://www.youtube.com/watch?v=Fh_zNpdc0Xs
Raia Hadsell 2017
Catastrophic forgetting
https://www.youtube.com/watch?v=yk_sW4x6zb0https://www.youtube.com/watch?v=V4oT1Ei-8_khttps://www.youtube.com/watch?v=LjFGy4BxOL8
Raia Hadsell 2017
An illustration
Task B
*
Task A
SGD
EWC
L2
Raia Hadsell 2017
Elastic Weight Consolidation
Task B
*
Task A
Elastic Weight Consolidation (EWC):
Constrain important parameters
to stay close to their old values
Continual learning in the brain:
Synaptic consolidation reduces
the plasticity of synapses that are
vital to previous tasks.
SGD
EWC
L2
Raia Hadsell 2017
Elastic Weight Consolidation
Implement constraint as a quadratic penalty
that is applied while training on B, but not
uniformly - rather, should be greater for
important parameters of Task A.
Posterior distribution
contains exactly this,
but is intractable. Task B
*
Task A
SGD
EWC
L2
Raia Hadsell 2017
Estimate posterior with Gaussian.
Mean: parameter vector *A
Diagonal precision given by approximation
of the Fisher Information F.
Elastic Weight Consolidation
Task B
*
Task A
SGD
EWC
L2
Raia Hadsell 2017
Elastic Weight Consolidation
Task B
*
Task A
SGD
EWC
L2
Raia Hadsell 2017
Experiment: Permuted MNIST
Random, fixed permutations of MNIST dataset.
Train a multilayer, fully-connected network with ReLus until convergence
We compare SGD, L2 regularisation, and EWC.
Perm A Perm B Perm C
Raia Hadsell 2017
Experiment: Permuted MNIST
Raia Hadsell 2017
Experiment: Permuted MNIST
Raia Hadsell 2017
Experiment: Permuted MNIST
Raia Hadsell 2017
Experiment: Permuted MNIST
Raia Hadsell 2017
Let’s try something harder...
Sequential reinforcement learning tasks (10 Atari games)
Random ordering with extended game play on each task, multiple returns
Unknown task boundaries
Regular testing of all 10 games
Single network with fixed capacity
Raia Hadsell 2017
Experiment: Atari 10
Forget-Me-Not1 allows labeling of data
segments, used for
● EWC regularisation
● Task-specific replay buffers used for
DDQN2
● Task-specific bias and gains at each
network layer
Fisher estimated at each task
boundary and EWC penalty is updated
[1] The forget-me-not process, Milan et al., NIPS 2016[2] Deep reinforcement learning with double q-learning, Hasselt et al., AAAI 2016
https://www.youtube.com/watch?v=Ry2WRcnwsYU
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia
Clopath, Dharshan Kumaran, Raia Hadsell
Overcoming catastrophic forgetting in neural networks
PNAS 2017arxiv.org/abs/1612.00796
Raia Hadsell 2017
Learning to navigate in complex mazes
Raia Hadsell 2017
Navigation mazesGame episode:
1. Random start2. Find the goal (+10)3. Teleport randomly4. Re-find the goal (+10)5. Repeat (limited time)
+10 +1
Raia Hadsell 2017
Navigation mazesGame episode:
1. Random start2. Find the goal (+10)3. Teleport randomly4. Re-find the goal (+10)5. Repeat (limited time)
+10 +1
Variants:● Static maze, static goal● Static maze, random goal● Random maze
Observations: RGB, velocityActions: 8
Raia Hadsell 2017
Navigation mazesGame episode:
1. Random start2. Find the goal (+10)3. Teleport randomly4. Re-find the goal (+10)5. Repeat (limited time)
3600 steps/episode
Variants:● Static maze, static goal● Static maze, random goal● Random maze
Observations: RGB, velocityActions: 8
Raia Hadsell 2017
Navigation mazes Game episode:
1. Random start2. Find the goal (+10)3. Teleport randomly4. Re-find the goal (+10)5. Repeat (limited time)
3600 steps/episode
10800 steps/episode
3600 steps/episode
Variants:● Static maze, static goal● Static maze, random goal● Random maze
Observations: RGB, velocityActions: 8
Raia Hadsell 2017
The vast and meaningless silence of an agent exploring...
Given: Sparse rewards
Wanted:Spatial knowledge
1e7
I have been here before! I know where to go!
1e7
Why is learning navigation via reinforcement learning hard?
Raia Hadsell 2017
Given:Sparse rewards
Wanted:Spatial knowledge
1. Accelerate reinforcement learning through auxiliary losses➔ Stable gradients help learning, even if unrelated to reward
2. Drive spatial knowledge through choice of auxiliary tasks:● Depth prediction● Loop closure prediction
Why is learning navigation via reinforcement learning hard?
Raia Hadsell 2017
Nav agent ingredients:
1. Convolutional encoder and RGB inputs
enc
xt
Raia Hadsell 2017
Nav agent ingredients:
1. Convolutional encoder and RGB inputs
2. Stacked LSTM
enc
xt
Raia Hadsell 2017
Nav agent ingredients:
1. Convolutional encoder and RGB inputs
2. Stacked LSTM
3. Additional inputs (reward, action, and velocity)
enc
xt rt-1 {vt, at-1}
Raia Hadsell 2017
Nav agent ingredients:
1. Convolutional encoder and RGB inputs
2. Stacked LSTM
3. Additional inputs (reward, action, and velocity)
4. RL: Asynchronous advantage actor critic (A3C)
enc
xt rt-1 {vt, at-1}
Mnih et al. (2016)
Raia Hadsell 2017
Nav agent ingredients:
1. Convolutional encoder and RGB inputs
2. Stacked LSTM
3. Additional inputs (reward, action, and velocity)
4. RL: Asynchronous advantage actor critic (A3C)
5. Aux task 1: Depth predictors
enc
Depth (D1 )
xt rt-1 {vt, at-1}
Depth (D2 )
Raia Hadsell 2017
Nav agent ingredients:
1. Convolutional encoder and RGB inputs
2. Stacked LSTM
3. Additional inputs (reward, action, and velocity)
4. RL: Asynchronous advantage actor critic (A3C)
5. Aux task 1: Depth predictors
Raia Hadsell 2017
Nav agent ingredients:
1. Convolutional encoder and RGB inputs
2. Stacked LSTM
3. Additional inputs (reward, action, and velocity)
4. RL: Asynchronous advantage actor critic (A3C)
5. Aux task 1: Depth predictor
6. Aux task 2: Loop closure predictor enc
Loop (L)
Depth (D1 )
xt rt-1 {vt, at-1}
Depth (D2 )
Raia Hadsell 2017
Nav agent ingredients:
1. Convolutional encoder and RGB inputs
2. Stacked LSTM
3. Additional inputs (reward, action, and velocity)
4. RL: Asynchronous advantage actor critic (A3C)
5. Aux task 1: Depth predictor
6. Aux task 2: Loop closure predictor
7. For analysis: Position decoder
enc
Loop (L)
Depth (D1 )
xt rt-1 {vt, at-1}
Depth (D2 ) Position
Raia Hadsell 2017
details..
Raia Hadsell 2017
more details.. policy gradient:
depth prediction from visual features:
depth prediction from LSTM features:
loop prediction from LSTM features:
Raia Hadsell 2017
Experiments
xt rt-1 {vt, at-1}
enc
xt
enc enc
Loop (L)
Depth (D1 )
a. FF A3C c. Nav A3C d. Nav A3C +D1D2L
xt rt-1 {vt, at-1}
enc
xt
b. LSTM A3C
Depth (D2 )
Raia Hadsell 2017
+10 +1
Results on large maze with static goal
https://www.youtube.com/watch?v=zHhbypmKaj0
Raia Hadsell 2017
Should depth be an input? Or a target?
rgbdt rt-1 {vt, at-1}
enc enc
Depth (D1 )
rgbt rt-1 {vt, at-1}
Depth (D2 )
Answer: the dense, non-noisy gradients from depth as a target are more helpful
Raia Hadsell 2017
Results with Random Goal locations
Is the agent remembering the goal
location?
● Mean time to first goal find of episode:
14.0 sec
● Mean time to subsequent goal finds:
7.2 sec
● Not as impressive for large mazes:
15.4 sec vs 15.0 sec.
small
large
Raia Hadsell 2017
Latency to goal (as the agent returns)
● Trajectories of the Nav A3C+D+L agent in the I-maze and random goal maze over the course of one episode
● Value function and goal finding (red lines) are shown
Raia Hadsell 2017
Position decoding
● Trajectories of the Nav A3C+D+L agent in the random goal maze
● Position likelihoods are overlaid (predicted from LSTM hiddens)
● Initial uncertainty gives way to accurate position estimation.
enc
���� L
D1
xt rt-1 {vt, at-1}
D2Position
Raia Hadsell 2017
Results in random mazessmall
large
https://www.youtube.com/watch?v=EKXQAjoNdGM
https://www.youtube.com/watch?v=lNoaTyMZsWI
Thank you!raiahadsell.com
Piotr Mirowski Razvan Pascanu
Fabio Ross Andy Hubert Laurent Koray Dharsh Misha Andrea
Learning to navigate in complex environments
ICLR2017arxiv.org/abs/1611.03673