machine learning - lecture 16: value-based deep...
TRANSCRIPT
![Page 1: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/1.jpg)
Machine LearningLecture 16: Value-Based Deep Reinforcement Learning
Nevin L. [email protected]
Department of Computer Science and EngineeringThe Hong Kong University of Science and Technology
This set of notes is based on the references listed at the end and internet resources.
Nevin L. Zhang (HKUST) Machine Learning 1 / 37
![Page 2: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/2.jpg)
Introduction
Outline
1 Introduction
2 Training Target
3 Experience Replay
4 DQN in Atari
5 Improvements to DQNDouble DQNPrioritized Experience Replay
Nevin L. Zhang (HKUST) Machine Learning 2 / 37
![Page 3: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/3.jpg)
Introduction
Motivation
Q-learning:
Difficult to represent Q(s, a) as a lookup table when the number of states islarge.
Atari games: A state is an image with 210× 160 pixels and each pixelhas 128 possible color values.The total number of states is: (210× 160)128.
Cannot determine actions for states never visited (such states are many).
Nevin L. Zhang (HKUST) Machine Learning 3 / 37
![Page 4: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/4.jpg)
Introduction
Motivation
Q-learning:
Difficult to represent Q(s, a) as a lookup table when the number of states islarge.
Atari games: A state is an image with 210× 160 pixels and each pixelhas 128 possible color values.The total number of states is: (210× 160)128.
Cannot determine actions for states never visited (such states are many).
Nevin L. Zhang (HKUST) Machine Learning 3 / 37
![Page 5: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/5.jpg)
Introduction
Deep Q-Networks
In DQN, we aim to approximate the optimal action-value function Q∗(s, a)using a neural network with parameters θ:
Q(s, a; θ) ≈ Q∗(s, a)
A function is defined over all possible states (images) without enumeratingthem.
There is information for picking actions at any states.
Nevin L. Zhang (HKUST) Machine Learning 4 / 37
![Page 6: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/6.jpg)
Introduction
Deep Q-Networks
In DQN, we aim to approximate the optimal action-value function Q∗(s, a)using a neural network with parameters θ:
Q(s, a; θ) ≈ Q∗(s, a)
A function is defined over all possible states (images) without enumeratingthem.
There is information for picking actions at any states.
Nevin L. Zhang (HKUST) Machine Learning 4 / 37
![Page 7: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/7.jpg)
Introduction
Deep Q-Networks
In DQN, we aim to approximate the optimal action-value function Q∗(s, a)using a neural network with parameters θ:
Q(s, a; θ) ≈ Q∗(s, a)
A function is defined over all possible states (images) without enumeratingthem.
There is information for picking actions at any states.
Nevin L. Zhang (HKUST) Machine Learning 4 / 37
![Page 8: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/8.jpg)
Introduction
Deep Q-Networks
In DQN, we aim to approximate the optimal action-value function Q∗(s, a)using a neural network with parameters θ:
Q(s, a; θ) = Q∗(s, a)
Key question: How to learn θ from experiences with the environment?
s0, a0, r0, s1, a1, r1, s2, a2, r2, . . .
Nevin L. Zhang (HKUST) Machine Learning 5 / 37
![Page 9: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/9.jpg)
Training Target
Outline
1 Introduction
2 Training Target
3 Experience Replay
4 DQN in Atari
5 Improvements to DQNDouble DQNPrioritized Experience Replay
Nevin L. Zhang (HKUST) Machine Learning 6 / 37
![Page 10: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/10.jpg)
Training Target
Learning DQN
Q-learning:
We will learn θ iteratively, just as Q(s, a) is learned iteratively in Q-learning.
As the first step toward a DQN learning algorithm, we can change the theQ-value update step into a θ update step:
θ ← update(θ; s, a, s ′, r)
To do so, we need come up with a objective function and update θ using itsgradient.
Nevin L. Zhang (HKUST) Machine Learning 7 / 37
![Page 11: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/11.jpg)
Training Target
Learning DQN
Q-learning:
We will learn θ iteratively, just as Q(s, a) is learned iteratively in Q-learning.
As the first step toward a DQN learning algorithm, we can change the theQ-value update step into a θ update step:
θ ← update(θ; s, a, s ′, r)
To do so, we need come up with a objective function and update θ using itsgradient.
Nevin L. Zhang (HKUST) Machine Learning 7 / 37
![Page 12: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/12.jpg)
Training Target
Learning DQN
Q-learning:
We will learn θ iteratively, just as Q(s, a) is learned iteratively in Q-learning.
As the first step toward a DQN learning algorithm, we can change the theQ-value update step into a θ update step:
θ ← update(θ; s, a, s ′, r)
To do so, we need come up with a objective function and update θ using itsgradient.
Nevin L. Zhang (HKUST) Machine Learning 7 / 37
![Page 13: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/13.jpg)
Training Target
Learning from experience tuple e = (s, a, s ′, r)
We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).
The problem is that we do not know the true target Q∗(s, a).
So, we use the TD target:
y = r(s, a) + γmaxa′
Q(s ′, a′; θ)
Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′
based on the experience tuple.
Nevin L. Zhang (HKUST) Machine Learning 8 / 37
![Page 14: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/14.jpg)
Training Target
Learning from experience tuple e = (s, a, s ′, r)
We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).
The problem is that
we do not know the true target Q∗(s, a).
So, we use the TD target:
y = r(s, a) + γmaxa′
Q(s ′, a′; θ)
Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′
based on the experience tuple.
Nevin L. Zhang (HKUST) Machine Learning 8 / 37
![Page 15: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/15.jpg)
Training Target
Learning from experience tuple e = (s, a, s ′, r)
We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).
The problem is that we do not know the true target Q∗(s, a).
So, we use the TD target:
y = r(s, a) + γmaxa′
Q(s ′, a′; θ)
Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′
based on the experience tuple.
Nevin L. Zhang (HKUST) Machine Learning 8 / 37
![Page 16: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/16.jpg)
Training Target
Learning from experience tuple e = (s, a, s ′, r)
We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).
The problem is that we do not know the true target Q∗(s, a).
So, we use the TD target:
y = r(s, a) + γmaxa′
Q(s ′, a′; θ)
Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′
based on the experience tuple.
Nevin L. Zhang (HKUST) Machine Learning 8 / 37
![Page 17: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/17.jpg)
Training Target
Learning from experience tuple e = (s, a, s ′, r)
We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).
The problem is that we do not know the true target Q∗(s, a).
So, we use the TD target:
y = r(s, a) + γmaxa′
Q(s ′, a′; θ)
Let Q ′ be the results if we run one value iteration on Q(s, a; θ).
It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′
based on the experience tuple.
Nevin L. Zhang (HKUST) Machine Learning 8 / 37
![Page 18: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/18.jpg)
Training Target
Learning from experience tuple e = (s, a, s ′, r)
We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).
The problem is that we do not know the true target Q∗(s, a).
So, we use the TD target:
y = r(s, a) + γmaxa′
Q(s ′, a′; θ)
Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.
As explained in the previous lecture, y is an unbiased estimation of Q ′
based on the experience tuple.
Nevin L. Zhang (HKUST) Machine Learning 8 / 37
![Page 19: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/19.jpg)
Training Target
Learning from experience tuple e = (s, a, s ′, r)
We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).
The problem is that we do not know the true target Q∗(s, a).
So, we use the TD target:
y = r(s, a) + γmaxa′
Q(s ′, a′; θ)
Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′
based on the experience tuple.
Nevin L. Zhang (HKUST) Machine Learning 8 / 37
![Page 20: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/20.jpg)
Training Target
Learning from experience tuple e = (s, a, s ′, r)
Our target is now to push Q(s, a; θ) toward the TD target.
So, we use thefollowing loss function:
L(θ) = (y − Q(s, a; θ))2 = ([r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ))2 (1)
The update rule for θ is:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ))2 (2)
where α is the learning rate.
Nevin L. Zhang (HKUST) Machine Learning 9 / 37
![Page 21: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/21.jpg)
Training Target
Learning from experience tuple e = (s, a, s ′, r)
Our target is now to push Q(s, a; θ) toward the TD target. So, we use thefollowing loss function:
L(θ) = (y − Q(s, a; θ))2 = ([r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ))2 (1)
The update rule for θ is:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ))2 (2)
where α is the learning rate.
Nevin L. Zhang (HKUST) Machine Learning 9 / 37
![Page 22: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/22.jpg)
Training Target
Learning from experience tuple e = (s, a, s ′, r)
Our target is now to push Q(s, a; θ) toward the TD target. So, we use thefollowing loss function:
L(θ) = (y − Q(s, a; θ))2 = ([r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ))2 (1)
The update rule for θ is:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ))2 (2)
where α is the learning rate.
Nevin L. Zhang (HKUST) Machine Learning 9 / 37
![Page 23: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/23.jpg)
Training Target
Learning from experience tuple e = (s, a, s ′, r)
Our target is now to push Q(s, a; θ) toward the TD target. So, we use thefollowing loss function:
L(θ) = (y − Q(s, a; θ))2 = ([r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ))2 (1)
The update rule for θ is:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ))2 (2)
where α is the learning rate.
Nevin L. Zhang (HKUST) Machine Learning 9 / 37
![Page 24: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/24.jpg)
Training Target
Moving Target
The TD target y = r(s, a) + γmaxa′ Q(s ′, a′; θ) depends on the parametersθ.
Every time the parameters are updated, the target is also changed.
So, learning θ with update rule (2) is like chasing a moving target.
This makes learning unstable.
Nevin L. Zhang (HKUST) Machine Learning 10 / 37
![Page 25: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/25.jpg)
Training Target
Moving Target
The TD target y = r(s, a) + γmaxa′ Q(s ′, a′; θ) depends on the parametersθ.
Every time the parameters are updated, the target is also changed.
So, learning θ with update rule (2) is like chasing a moving target.
This makes learning unstable.
Nevin L. Zhang (HKUST) Machine Learning 10 / 37
![Page 26: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/26.jpg)
Training Target
Moving Target
The TD target y = r(s, a) + γmaxa′ Q(s ′, a′; θ) depends on the parametersθ.
Every time the parameters are updated, the target is also changed.
So, learning θ with update rule (2) is like chasing a moving target.
This makes learning unstable.
Nevin L. Zhang (HKUST) Machine Learning 10 / 37
![Page 27: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/27.jpg)
Training Target
Moving Target
The TD target y = r(s, a) + γmaxa′ Q(s ′, a′; θ) depends on the parametersθ.
Every time the parameters are updated, the target is also changed.
So, learning θ with update rule (2) is like chasing a moving target.
This makes learning unstable.
Nevin L. Zhang (HKUST) Machine Learning 10 / 37
![Page 28: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/28.jpg)
Training Target
Target Network
To deal with the moving target issue, create a target network that has thesame structure as the DQN.
Let θ− be its parameters.
Use the target network to compute the TD target y
Set θ− = θ once in a while (every C steps).
So, we have this new update rule:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)
This way, we are solving a sequence of well-defined optimization problems.
Nevin L. Zhang (HKUST) Machine Learning 11 / 37
![Page 29: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/29.jpg)
Training Target
Target Network
To deal with the moving target issue, create a target network that has thesame structure as the DQN. Let θ− be its parameters.
Use the target network to compute the TD target y
Set θ− = θ once in a while (every C steps).
So, we have this new update rule:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)
This way, we are solving a sequence of well-defined optimization problems.
Nevin L. Zhang (HKUST) Machine Learning 11 / 37
![Page 30: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/30.jpg)
Training Target
Target Network
To deal with the moving target issue, create a target network that has thesame structure as the DQN. Let θ− be its parameters.
Use the target network to compute the TD target y
Set θ− = θ once in a while (every C steps).
So, we have this new update rule:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)
This way, we are solving a sequence of well-defined optimization problems.
Nevin L. Zhang (HKUST) Machine Learning 11 / 37
![Page 31: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/31.jpg)
Training Target
Target Network
To deal with the moving target issue, create a target network that has thesame structure as the DQN. Let θ− be its parameters.
Use the target network to compute the TD target y
Set θ− = θ once in a while (every C steps).
So, we have this new update rule:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)
This way, we are solving a sequence of well-defined optimization problems.
Nevin L. Zhang (HKUST) Machine Learning 11 / 37
![Page 32: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/32.jpg)
Training Target
Target Network
To deal with the moving target issue, create a target network that has thesame structure as the DQN. Let θ− be its parameters.
Use the target network to compute the TD target y
Set θ− = θ once in a while (every C steps).
So, we have this new update rule:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)
This way, we are solving a sequence of well-defined optimization problems.
Nevin L. Zhang (HKUST) Machine Learning 11 / 37
![Page 33: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/33.jpg)
Training Target
Target Network
To deal with the moving target issue, create a target network that has thesame structure as the DQN. Let θ− be its parameters.
Use the target network to compute the TD target y
Set θ− = θ once in a while (every C steps).
So, we have this new update rule:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)
This way, we are solving a sequence of well-defined optimization problems.
Nevin L. Zhang (HKUST) Machine Learning 11 / 37
![Page 34: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/34.jpg)
Training Target
Deep Q-Learning with Target Network
Repeat:
Take action a in current state s, observe r and s ′
Update the parameters:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2
s ← s ′
θ− ← θ in every C steps.
Nevin L. Zhang (HKUST) Machine Learning 12 / 37
![Page 35: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/35.jpg)
Training Target
Deep Q-Learning with Target Network
Repeat:
Take action a in current state s, observe r and s ′
Update the parameters:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2
s ← s ′
θ− ← θ in every C steps.
Nevin L. Zhang (HKUST) Machine Learning 12 / 37
![Page 36: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/36.jpg)
Training Target
Deep Q-Learning with Target Network
Repeat:
Take action a in current state s, observe r and s ′
Update the parameters:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2
s ← s ′
θ− ← θ in every C steps.
Nevin L. Zhang (HKUST) Machine Learning 12 / 37
![Page 37: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/37.jpg)
Experience Replay
Outline
1 Introduction
2 Training Target
3 Experience Replay
4 DQN in Atari
5 Improvements to DQNDouble DQNPrioritized Experience Replay
Nevin L. Zhang (HKUST) Machine Learning 13 / 37
![Page 38: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/38.jpg)
Experience Replay
Experience Replay
In Deep Q-Learning with target network
Only the latest experience tuple is used for parameter update.
The idea of experience replay is:
Store experience tuples in a bufferUse a random minibatch experience tuples for parameter update.
Nevin L. Zhang (HKUST) Machine Learning 14 / 37
![Page 39: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/39.jpg)
Experience Replay
Experience Replay
In Deep Q-Learning with target network
Only the latest experience tuple is used for parameter update.
The idea of experience replay is:
Store experience tuples in a buffer
Use a random minibatch experience tuples for parameter update.
Nevin L. Zhang (HKUST) Machine Learning 14 / 37
![Page 40: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/40.jpg)
Experience Replay
Experience Replay
In Deep Q-Learning with target network
Only the latest experience tuple is used for parameter update.
The idea of experience replay is:
Store experience tuples in a bufferUse a random minibatch experience tuples for parameter update.
Nevin L. Zhang (HKUST) Machine Learning 14 / 37
![Page 41: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/41.jpg)
Experience Replay
Deep Q-Learning with Target Network and ExperienceReplay
Repeat:
Take action a in current state s, observe r and s ′; add experience tuple(s, a, s ′, r) to a buffer D; s ← s ′
Sample a minibatch B = {sj , aj , s ′j , rj} from D.
Update the parameters
θ ← θ − α∇θ∑j
([r(sj , aj) + γmaxa′j
Q(s ′j , a′j ; θ−)]− Q(sj , aj ; θ))2
θ− ← θ in every C steps.
Nevin L. Zhang (HKUST) Machine Learning 15 / 37
![Page 42: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/42.jpg)
Experience Replay
Deep Q-Learning with Target Network and ExperienceReplay
Repeat:
Take action a in current state s, observe r and s ′; add experience tuple(s, a, s ′, r) to a buffer D; s ← s ′
Sample a minibatch B = {sj , aj , s ′j , rj} from D.
Update the parameters
θ ← θ − α∇θ∑j
([r(sj , aj) + γmaxa′j
Q(s ′j , a′j ; θ−)]− Q(sj , aj ; θ))2
θ− ← θ in every C steps.
Nevin L. Zhang (HKUST) Machine Learning 15 / 37
![Page 43: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/43.jpg)
Experience Replay
Deep Q-Learning with Target Network and ExperienceReplay
Repeat:
Take action a in current state s, observe r and s ′; add experience tuple(s, a, s ′, r) to a buffer D; s ← s ′
Sample a minibatch B = {sj , aj , s ′j , rj} from D.
Update the parameters
θ ← θ − α∇θ∑j
([r(sj , aj) + γmaxa′j
Q(s ′j , a′j ; θ−)]− Q(sj , aj ; θ))2
θ− ← θ in every C steps.
Nevin L. Zhang (HKUST) Machine Learning 15 / 37
![Page 44: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/44.jpg)
Experience Replay
Deep Q-Learning with Target Network and ExperienceReplay
Repeat:
Take action a in current state s, observe r and s ′; add experience tuple(s, a, s ′, r) to a buffer D; s ← s ′
Sample a minibatch B = {sj , aj , s ′j , rj} from D.
Update the parameters
θ ← θ − α∇θ∑j
([r(sj , aj) + γmaxa′j
Q(s ′j , a′j ; θ−)]− Q(sj , aj ; θ))2
θ− ← θ in every C steps.
Nevin L. Zhang (HKUST) Machine Learning 15 / 37
![Page 45: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/45.jpg)
Experience Replay
Advantages of Experience Replay
Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated.
Experience replay breaks the correlationsand hence reduces variance of the updates.
Experience replay avoids oscillations or divergence in the parameters.
With experience replay, each experience tuple is potentially used multipletimes. Hence better data efficiency.
With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples). Hence faster convergence.
Nevin L. Zhang (HKUST) Machine Learning 16 / 37
![Page 46: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/46.jpg)
Experience Replay
Advantages of Experience Replay
Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated. Experience replay breaks the correlationsand hence reduces variance of the updates.
Experience replay avoids oscillations or divergence in the parameters.
With experience replay, each experience tuple is potentially used multipletimes. Hence better data efficiency.
With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples). Hence faster convergence.
Nevin L. Zhang (HKUST) Machine Learning 16 / 37
![Page 47: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/47.jpg)
Experience Replay
Advantages of Experience Replay
Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated. Experience replay breaks the correlationsand hence reduces variance of the updates.
Experience replay avoids oscillations or divergence in the parameters.
With experience replay, each experience tuple is potentially used multipletimes. Hence better data efficiency.
With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples). Hence faster convergence.
Nevin L. Zhang (HKUST) Machine Learning 16 / 37
![Page 48: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/48.jpg)
Experience Replay
Advantages of Experience Replay
Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated. Experience replay breaks the correlationsand hence reduces variance of the updates.
Experience replay avoids oscillations or divergence in the parameters.
With experience replay, each experience tuple is potentially used multipletimes.
Hence better data efficiency.
With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples). Hence faster convergence.
Nevin L. Zhang (HKUST) Machine Learning 16 / 37
![Page 49: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/49.jpg)
Experience Replay
Advantages of Experience Replay
Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated. Experience replay breaks the correlationsand hence reduces variance of the updates.
Experience replay avoids oscillations or divergence in the parameters.
With experience replay, each experience tuple is potentially used multipletimes. Hence better data efficiency.
With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples). Hence faster convergence.
Nevin L. Zhang (HKUST) Machine Learning 16 / 37
![Page 50: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/50.jpg)
Experience Replay
Advantages of Experience Replay
Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated. Experience replay breaks the correlationsand hence reduces variance of the updates.
Experience replay avoids oscillations or divergence in the parameters.
With experience replay, each experience tuple is potentially used multipletimes. Hence better data efficiency.
With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples).
Hence faster convergence.
Nevin L. Zhang (HKUST) Machine Learning 16 / 37
![Page 51: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/51.jpg)
Experience Replay
Advantages of Experience Replay
Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated. Experience replay breaks the correlationsand hence reduces variance of the updates.
Experience replay avoids oscillations or divergence in the parameters.
With experience replay, each experience tuple is potentially used multipletimes. Hence better data efficiency.
With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples). Hence faster convergence.
Nevin L. Zhang (HKUST) Machine Learning 16 / 37
![Page 52: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/52.jpg)
Experience Replay
A More General View of DQN
Processes 1 and 3 run at the same pace, while Process 2 is slower.
Old experiences are discarded when buffer is full.
Nevin L. Zhang (HKUST) Machine Learning 17 / 37
![Page 53: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/53.jpg)
Experience Replay
A More General View of DQN
Processes 1 and 3 run at the same pace, while Process 2 is slower.
Old experiences are discarded when buffer is full.
Nevin L. Zhang (HKUST) Machine Learning 17 / 37
![Page 54: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/54.jpg)
DQN in Atari
Outline
1 Introduction
2 Training Target
3 Experience Replay
4 DQN in Atari
5 Improvements to DQNDouble DQNPrioritized Experience Replay
Nevin L. Zhang (HKUST) Machine Learning 18 / 37
![Page 55: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/55.jpg)
DQN in Atari
Atari Games
Space Invaders
Nevin L. Zhang (HKUST) Machine Learning 19 / 37
![Page 56: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/56.jpg)
DQN in Atari
DAN in Atari
End-to-end learning of values Q(s; a) from pixels s
Input state s is stack of raw pixels from last 4 frames
Output is Q(s; a) for 18 joystick/button positions
Reward is change in score for that step
Network architecture and hyperparameters fixed across all games
Nevin L. Zhang (HKUST) Machine Learning 20 / 37
![Page 57: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/57.jpg)
DQN in Atari
DAN in Atari
End-to-end learning of values Q(s; a) from pixels s
Input state s is stack of raw pixels from last 4 frames
Output is Q(s; a) for 18 joystick/button positions
Reward is change in score for that step
Network architecture and hyperparameters fixed across all games
Nevin L. Zhang (HKUST) Machine Learning 20 / 37
![Page 58: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/58.jpg)
DQN in Atari
DAN in Atari
End-to-end learning of values Q(s; a) from pixels s
Input state s is stack of raw pixels from last 4 frames
Output is Q(s; a) for 18 joystick/button positions
Reward is change in score for that step
Network architecture and hyperparameters fixed across all games
Nevin L. Zhang (HKUST) Machine Learning 20 / 37
![Page 59: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/59.jpg)
DQN in Atari
DAN in Atari
End-to-end learning of values Q(s; a) from pixels s
Input state s is stack of raw pixels from last 4 frames
Output is Q(s; a) for 18 joystick/button positions
Reward is change in score for that step
Network architecture and hyperparameters fixed across all games
Nevin L. Zhang (HKUST) Machine Learning 20 / 37
![Page 60: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/60.jpg)
DQN in Atari
DAN in Atari
End-to-end learning of values Q(s; a) from pixels s
Input state s is stack of raw pixels from last 4 frames
Output is Q(s; a) for 18 joystick/button positions
Reward is change in score for that step
Network architecture and hyperparameters fixed across all games
Nevin L. Zhang (HKUST) Machine Learning 20 / 37
![Page 61: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/61.jpg)
DQN in Atari
Comparable with or superior to human at majority of games
Nevin L. Zhang (HKUST) Machine Learning 21 / 37
![Page 62: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/62.jpg)
DQN in Atari
Results on Breakout
https://www.youtube.com/watch?v=V1eYniJ0Rnk
Nevin L. Zhang (HKUST) Machine Learning 22 / 37
![Page 63: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/63.jpg)
Improvements to DQN
Outline
1 Introduction
2 Training Target
3 Experience Replay
4 DQN in Atari
5 Improvements to DQNDouble DQNPrioritized Experience Replay
Nevin L. Zhang (HKUST) Machine Learning 23 / 37
![Page 64: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/64.jpg)
Improvements to DQN Double DQN
Outline
1 Introduction
2 Training Target
3 Experience Replay
4 DQN in Atari
5 Improvements to DQNDouble DQNPrioritized Experience Replay
Nevin L. Zhang (HKUST) Machine Learning 24 / 37
![Page 65: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/65.jpg)
Improvements to DQN Double DQN
Overestimate of Value function
In (3), we estimate TD Target as follows:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−)
The subscript Q indicates that yQ is the estimate used in Q-learning (DQN).
The Q-values used are not accurate, and are influenced by random factorsdenoted as ω−. So, we can write:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−, ω−)
If we run the process many times and take average of the y values, we get
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 25 / 37
![Page 66: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/66.jpg)
Improvements to DQN Double DQN
Overestimate of Value function
In (3), we estimate TD Target as follows:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−)
The subscript Q indicates that yQ is the estimate used in Q-learning (DQN).
The Q-values used are not accurate, and are influenced by random factorsdenoted as ω−. So, we can write:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−, ω−)
If we run the process many times and take average of the y values, we get
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 25 / 37
![Page 67: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/67.jpg)
Improvements to DQN Double DQN
Overestimate of Value function
In (3), we estimate TD Target as follows:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−)
The subscript Q indicates that yQ is the estimate used in Q-learning (DQN).
The Q-values used are not accurate,
and are influenced by random factorsdenoted as ω−. So, we can write:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−, ω−)
If we run the process many times and take average of the y values, we get
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 25 / 37
![Page 68: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/68.jpg)
Improvements to DQN Double DQN
Overestimate of Value function
In (3), we estimate TD Target as follows:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−)
The subscript Q indicates that yQ is the estimate used in Q-learning (DQN).
The Q-values used are not accurate, and are influenced by random factorsdenoted as ω−.
So, we can write:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−, ω−)
If we run the process many times and take average of the y values, we get
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 25 / 37
![Page 69: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/69.jpg)
Improvements to DQN Double DQN
Overestimate of Value function
In (3), we estimate TD Target as follows:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−)
The subscript Q indicates that yQ is the estimate used in Q-learning (DQN).
The Q-values used are not accurate, and are influenced by random factorsdenoted as ω−. So, we can write:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−, ω−)
If we run the process many times and take average of the y values, we get
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 25 / 37
![Page 70: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/70.jpg)
Improvements to DQN Double DQN
Overestimate of Value function
In (3), we estimate TD Target as follows:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−)
The subscript Q indicates that yQ is the estimate used in Q-learning (DQN).
The Q-values used are not accurate, and are influenced by random factorsdenoted as ω−. So, we can write:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−, ω−)
If we run the process many times and take average of the y values, we get
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 25 / 37
![Page 71: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/71.jpg)
Improvements to DQN Double DQN
Overestimate of Value function
Ideally, we would like to first get a robust estimate of Q by take averageover many run, i.e.,
Eω− [Q(s ′, a′; θ−, ω−)]
and then use it to estimate the TD target:
y = r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
On average, the yQ overestimates the ideal value y because
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
≥ r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
= y
Note that. for any two random variables X1 and X2,E [max(X1,X2)] ≥ max(E [X1],E [X2])
Nevin L. Zhang (HKUST) Machine Learning 26 / 37
![Page 72: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/72.jpg)
Improvements to DQN Double DQN
Overestimate of Value function
Ideally, we would like to first get a robust estimate of Q by take averageover many run, i.e.,
Eω− [Q(s ′, a′; θ−, ω−)]
and then use it to estimate the TD target:
y = r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
On average, the yQ overestimates the ideal value y because
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
≥ r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
= y
Note that. for any two random variables X1 and X2,E [max(X1,X2)] ≥ max(E [X1],E [X2])
Nevin L. Zhang (HKUST) Machine Learning 26 / 37
![Page 73: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/73.jpg)
Improvements to DQN Double DQN
Overestimate of Value function
Ideally, we would like to first get a robust estimate of Q by take averageover many run, i.e.,
Eω− [Q(s ′, a′; θ−, ω−)]
and then use it to estimate the TD target:
y = r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
On average, the yQ overestimates the ideal value y because
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
≥ r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
= y
Note that. for any two random variables X1 and X2,E [max(X1,X2)] ≥ max(E [X1],E [X2])
Nevin L. Zhang (HKUST) Machine Learning 26 / 37
![Page 74: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/74.jpg)
Improvements to DQN Double DQN
Overestimate of Value function
Ideally, we would like to first get a robust estimate of Q by take averageover many run, i.e.,
Eω− [Q(s ′, a′; θ−, ω−)]
and then use it to estimate the TD target:
y = r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
On average, the yQ overestimates the ideal value y because
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
≥ r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
= y
Note that. for any two random variables X1 and X2,E [max(X1,X2)] ≥ max(E [X1],E [X2])
Nevin L. Zhang (HKUST) Machine Learning 26 / 37
![Page 75: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/75.jpg)
Improvements to DQN Double DQN
Overestimate of Value function
Ideally, we would like to first get a robust estimate of Q by take averageover many run, i.e.,
Eω− [Q(s ′, a′; θ−, ω−)]
and then use it to estimate the TD target:
y = r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
On average, the yQ overestimates the ideal value y because
Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
≥ r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
= y
Note that. for any two random variables X1 and X2,E [max(X1,X2)] ≥ max(E [X1],E [X2])
Nevin L. Zhang (HKUST) Machine Learning 26 / 37
![Page 76: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/76.jpg)
Improvements to DQN Double DQN
Double DQN
a∗ = arg maxa′ Q(s ′, a′; θ, ω)
yDoubleQ = r(s, a) + γQ(s ′, a∗; θ−, ω−)
Taking average, we get
Eω− [yDoubleQ ] = r(s, a) + γEω− [Q(s ′, a∗; θ−, ω−)]
This is closer to
y = r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
than Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 27 / 37
![Page 77: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/77.jpg)
Improvements to DQN Double DQN
Double DQN
a∗ = arg maxa′ Q(s ′, a′; θ, ω)
yDoubleQ = r(s, a) + γQ(s ′, a∗; θ−, ω−)
Taking average, we get
Eω− [yDoubleQ ] = r(s, a) + γEω− [Q(s ′, a∗; θ−, ω−)]
This is closer to
y = r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
than Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 27 / 37
![Page 78: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/78.jpg)
Improvements to DQN Double DQN
Double DQN
a∗ = arg maxa′ Q(s ′, a′; θ, ω)
yDoubleQ = r(s, a) + γQ(s ′, a∗; θ−, ω−)
Taking average, we get
Eω− [yDoubleQ ] = r(s, a) + γEω− [Q(s ′, a∗; θ−, ω−)]
This is closer to
y = r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
than Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 27 / 37
![Page 79: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/79.jpg)
Improvements to DQN Double DQN
Double DQN
a∗ = arg maxa′ Q(s ′, a′; θ, ω)
yDoubleQ = r(s, a) + γQ(s ′, a∗; θ−, ω−)
Taking average, we get
Eω− [yDoubleQ ] = r(s, a) + γEω− [Q(s ′, a∗; θ−, ω−)]
This is closer to
y = r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
than Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 27 / 37
![Page 80: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/80.jpg)
Improvements to DQN Double DQN
Double DQN
a∗ = arg maxa′ Q(s ′, a′; θ, ω)
yDoubleQ = r(s, a) + γQ(s ′, a∗; θ−, ω−)
Taking average, we get
Eω− [yDoubleQ ] = r(s, a) + γEω− [Q(s ′, a∗; θ−, ω−)]
This is closer to
y = r(s, a) + γmaxa′
Eω− [Q(s ′, a′; θ−, ω−)]
than Eω− [yQ ] = r(s, a) + γEω− [maxa′
Q(s ′, a′; θ−, ω−)]
Nevin L. Zhang (HKUST) Machine Learning 27 / 37
![Page 81: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/81.jpg)
Improvements to DQN Double DQN
An Analogy
TD Target in DQN:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−)
The second term estimates future optimal value of s ′ in two steps:
Picks next action a′, andEvaluate Q(s’, a’)
In both steps, we use θ−
Nevin L. Zhang (HKUST) Machine Learning 28 / 37
![Page 82: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/82.jpg)
Improvements to DQN Double DQN
An Analogy
TD Target in DQN:
yQ = r(s, a) + γmaxa′
Q(s ′, a′; θ−)
The second term estimates future optimal value of s ′ in two steps:
Picks next action a′, andEvaluate Q(s’, a’)
In both steps, we use θ−
Nevin L. Zhang (HKUST) Machine Learning 28 / 37
![Page 83: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/83.jpg)
Improvements to DQN Double DQN
An Analogy
Task: Estimate highest exam score of a class given current situation s ′.
Method 1: Ask one teacher θ− to
Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).
Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.
Method 2:
Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)
Nevin L. Zhang (HKUST) Machine Learning 29 / 37
![Page 84: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/84.jpg)
Improvements to DQN Double DQN
An Analogy
Task: Estimate highest exam score of a class given current situation s ′.
Method 1: Ask one teacher θ− to
Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).
Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.
Method 2:
Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)
Nevin L. Zhang (HKUST) Machine Learning 29 / 37
![Page 85: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/85.jpg)
Improvements to DQN Double DQN
An Analogy
Task: Estimate highest exam score of a class given current situation s ′.
Method 1: Ask one teacher θ− to
Estimate the score of each student a′: Q(s ′, a′; θ−).
Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).
Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.
Method 2:
Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)
Nevin L. Zhang (HKUST) Machine Learning 29 / 37
![Page 86: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/86.jpg)
Improvements to DQN Double DQN
An Analogy
Task: Estimate highest exam score of a class given current situation s ′.
Method 1: Ask one teacher θ− to
Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)
Report Q(s ′, a∗; θ−).
Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.
Method 2:
Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)
Nevin L. Zhang (HKUST) Machine Learning 29 / 37
![Page 87: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/87.jpg)
Improvements to DQN Double DQN
An Analogy
Task: Estimate highest exam score of a class given current situation s ′.
Method 1: Ask one teacher θ− to
Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).
Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.
Method 2:
Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)
Nevin L. Zhang (HKUST) Machine Learning 29 / 37
![Page 88: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/88.jpg)
Improvements to DQN Double DQN
An Analogy
Task: Estimate highest exam score of a class given current situation s ′.
Method 1: Ask one teacher θ− to
Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).
Problem: Best student picked by teach θ− might not the best after all.
Score for a∗ is overestimated, and hence max score of class is overestimated.
Method 2:
Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)
Nevin L. Zhang (HKUST) Machine Learning 29 / 37
![Page 89: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/89.jpg)
Improvements to DQN Double DQN
An Analogy
Task: Estimate highest exam score of a class given current situation s ′.
Method 1: Ask one teacher θ− to
Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).
Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated,
and hence max score of class is overestimated.
Method 2:
Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)
Nevin L. Zhang (HKUST) Machine Learning 29 / 37
![Page 90: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/90.jpg)
Improvements to DQN Double DQN
An Analogy
Task: Estimate highest exam score of a class given current situation s ′.
Method 1: Ask one teacher θ− to
Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).
Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.
Method 2:
Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)
Nevin L. Zhang (HKUST) Machine Learning 29 / 37
![Page 91: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/91.jpg)
Improvements to DQN Double DQN
An Analogy
Task: Estimate highest exam score of a class given current situation s ′.
Method 1: Ask one teacher θ− to
Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).
Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.
Method 2:
Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)
Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)
Nevin L. Zhang (HKUST) Machine Learning 29 / 37
![Page 92: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/92.jpg)
Improvements to DQN Double DQN
An Analogy
Task: Estimate highest exam score of a class given current situation s ′.
Method 1: Ask one teacher θ− to
Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).
Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.
Method 2:
Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)
Nevin L. Zhang (HKUST) Machine Learning 29 / 37
![Page 93: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/93.jpg)
Improvements to DQN Double DQN
Double DQN
Update rule for DQN:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2
Or, equivalently:
θ ← θ − α∇θ([r(s, a) + γQ(s ′, a∗; θ−)]− Q(s, a; θ))2
where a∗ = arg maxa′ Q(s ′, a′; θ−).
Update rule for Double DQN:
θ ← θ − α∇θ([r(s, a) + γQ(s ′, a∗; θ−)]− Q(s, a; θ))2 (4)
where a∗ = arg maxa′ Q(s ′, a′; θ).
Nevin L. Zhang (HKUST) Machine Learning 30 / 37
![Page 94: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/94.jpg)
Improvements to DQN Double DQN
Double DQN
Update rule for DQN:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2
Or, equivalently:
θ ← θ − α∇θ([r(s, a) + γQ(s ′, a∗; θ−)]− Q(s, a; θ))2
where a∗ = arg maxa′ Q(s ′, a′; θ−).
Update rule for Double DQN:
θ ← θ − α∇θ([r(s, a) + γQ(s ′, a∗; θ−)]− Q(s, a; θ))2 (4)
where a∗ = arg maxa′ Q(s ′, a′; θ).
Nevin L. Zhang (HKUST) Machine Learning 30 / 37
![Page 95: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/95.jpg)
Improvements to DQN Double DQN
Double DQN
Update rule for DQN:
θ ← θ − α∇θ([r(s, a) + γmaxa′
Q(s ′, a′; θ−)]− Q(s, a; θ))2
Or, equivalently:
θ ← θ − α∇θ([r(s, a) + γQ(s ′, a∗; θ−)]− Q(s, a; θ))2
where a∗ = arg maxa′ Q(s ′, a′; θ−).
Update rule for Double DQN:
θ ← θ − α∇θ([r(s, a) + γQ(s ′, a∗; θ−)]− Q(s, a; θ))2 (4)
where a∗ = arg maxa′ Q(s ′, a′; θ).
Nevin L. Zhang (HKUST) Machine Learning 30 / 37
![Page 96: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/96.jpg)
Improvements to DQN Double DQN
DQN versus Double DQN
Nevin L. Zhang (HKUST) Machine Learning 31 / 37
![Page 97: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/97.jpg)
Improvements to DQN Prioritized Experience Replay
Outline
1 Introduction
2 Training Target
3 Experience Replay
4 DQN in Atari
5 Improvements to DQNDouble DQNPrioritized Experience Replay
Nevin L. Zhang (HKUST) Machine Learning 32 / 37
![Page 98: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/98.jpg)
Improvements to DQN Prioritized Experience Replay
Why Prioritized Experience Replay?
In DQN and Double DQN, experience transitions are uniformly sampledfrom a replay memory for the purpose of Q-function regression. (Process 3)
This approach does not take the significance of the experience transitionsinto consideration.
Nevin L. Zhang (HKUST) Machine Learning 33 / 37
![Page 99: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/99.jpg)
Improvements to DQN Prioritized Experience Replay
Why Prioritized Experience Replay?
In DQN and Double DQN, experience transitions are uniformly sampledfrom a replay memory for the purpose of Q-function regression. (Process 3)
This approach does not take the significance of the experience transitionsinto consideration.
Nevin L. Zhang (HKUST) Machine Learning 33 / 37
![Page 100: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/100.jpg)
Improvements to DQN Prioritized Experience Replay
Prioritized Experience Replay
The significance of a experience tuple (s, a, s ′, r) is measured using the TDerror:
δ = [r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ)
Let {si , ai , s ′i , ri )} be the collection of experience tuples in the replay memory.
For each update of θ, experience tuples are sampled using the followingdistribution:
P(i) =pαi∑k p
αk
pi is a function of δi : pi = |δi |+ ε or pi = 1rank(i)
α determines how much prioritization is used.α = 0 amounts to uniformsampling.
Nevin L. Zhang (HKUST) Machine Learning 34 / 37
![Page 101: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/101.jpg)
Improvements to DQN Prioritized Experience Replay
Prioritized Experience Replay
The significance of a experience tuple (s, a, s ′, r) is measured using the TDerror:
δ = [r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ)
Let {si , ai , s ′i , ri )} be the collection of experience tuples in the replay memory.
For each update of θ, experience tuples are sampled using the followingdistribution:
P(i) =pαi∑k p
αk
pi is a function of δi : pi = |δi |+ ε or pi = 1rank(i)
α determines how much prioritization is used.α = 0 amounts to uniformsampling.
Nevin L. Zhang (HKUST) Machine Learning 34 / 37
![Page 102: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/102.jpg)
Improvements to DQN Prioritized Experience Replay
Prioritized Experience Replay
The significance of a experience tuple (s, a, s ′, r) is measured using the TDerror:
δ = [r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ)
Let {si , ai , s ′i , ri )} be the collection of experience tuples in the replay memory.
For each update of θ, experience tuples are sampled using the followingdistribution:
P(i) =pαi∑k p
αk
pi is a function of δi : pi = |δi |+ ε or pi = 1rank(i)
α determines how much prioritization is used.α = 0 amounts to uniformsampling.
Nevin L. Zhang (HKUST) Machine Learning 34 / 37
![Page 103: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/103.jpg)
Improvements to DQN Prioritized Experience Replay
Prioritized Experience Replay
The significance of a experience tuple (s, a, s ′, r) is measured using the TDerror:
δ = [r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ)
Let {si , ai , s ′i , ri )} be the collection of experience tuples in the replay memory.
For each update of θ, experience tuples are sampled using the followingdistribution:
P(i) =pαi∑k p
αk
pi is a function of δi : pi = |δi |+ ε or pi = 1rank(i)
α determines how much prioritization is used.α = 0 amounts to uniformsampling.
Nevin L. Zhang (HKUST) Machine Learning 34 / 37
![Page 104: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/104.jpg)
Improvements to DQN Prioritized Experience Replay
Prioritized Experience Replay
The significance of a experience tuple (s, a, s ′, r) is measured using the TDerror:
δ = [r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ)
Let {si , ai , s ′i , ri )} be the collection of experience tuples in the replay memory.
For each update of θ, experience tuples are sampled using the followingdistribution:
P(i) =pαi∑k p
αk
pi is a function of δi : pi = |δi |+ ε or pi = 1rank(i)
α determines how much prioritization is used.
α = 0 amounts to uniformsampling.
Nevin L. Zhang (HKUST) Machine Learning 34 / 37
![Page 105: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/105.jpg)
Improvements to DQN Prioritized Experience Replay
Prioritized Experience Replay
The significance of a experience tuple (s, a, s ′, r) is measured using the TDerror:
δ = [r(s, a) + γmaxa′
Q(s ′, a′; θ)]− Q(s, a; θ)
Let {si , ai , s ′i , ri )} be the collection of experience tuples in the replay memory.
For each update of θ, experience tuples are sampled using the followingdistribution:
P(i) =pαi∑k p
αk
pi is a function of δi : pi = |δi |+ ε or pi = 1rank(i)
α determines how much prioritization is used.α = 0 amounts to uniformsampling.
Nevin L. Zhang (HKUST) Machine Learning 34 / 37
![Page 106: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/106.jpg)
Improvements to DQN Prioritized Experience Replay
Prioritized Experience Replay Improves Double DQN inMost Cases
Nevin L. Zhang (HKUST) Machine Learning 35 / 37
![Page 107: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/107.jpg)
Improvements to DQN Prioritized Experience Replay
Code Example
github.com/vmayoral/basic reinforcement learning/blob/master/tutorial6/README.md
Nevin L. Zhang (HKUST) Machine Learning 36 / 37
![Page 108: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning](https://reader033.vdocument.in/reader033/viewer/2022060409/5f1007d37e708231d447193a/html5/thumbnails/108.jpg)
Improvements to DQN Prioritized Experience Replay
References
Sergey Levine. Deep Reinforcement Learning,http://rail.eecs.berkeley.edu/deeprlcourse/, 2018.
David Silver. Tutorial: Deep Reinforcement Learning,http://hunch.net/∼beygel/deep rl tutorial.pdf, 2016.
Tambet Matiisen. Demystifying Deep Reinforcement Learning.https://ai.intel.com/demystifying-deep-reinforcement-learning/, 2015.
Watkins. (1989). Learning from delayed rewards: introduces Q-learning
Riedmiller. (2005). Neural fitted Q-iteration: batch-mode Q-learning with neural networks
Mnih et al. (2013). Human-level control through deep reinforcement learning: Q-learningwith convolutional networks for playing Atari.
Van Hasselt, Guez, Silver. (2015). Deep reinforcement learning with double Q-learning: avery effective trick to improve performance of deep Q-learning.
Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016). Prioritized experience replay.In ICLR, 2016.
Wang, Schaul, Hessel, van Hasselt, Lanctot, de Freitas (2016). Dueling networkarchitectures for deep reinforcement learning: separates value and advantage estimation inQ-function.
Nevin L. Zhang (HKUST) Machine Learning 37 / 37