![Page 1: Deep Learning Lab11: TensorFlow 101€¦ · Deep Learning Lab11: TensorFlow 101 Author: 秉翰 江 Created Date: 12/24/2019 2:33:45 PM](https://reader030.vdocument.in/reader030/viewer/2022040608/5ec5faec064161120259eed9/html5/thumbnails/1.jpg)
Deep LearningLab17: Deep RL
Bing-Han Chiang & Datalab
1
![Page 2: Deep Learning Lab11: TensorFlow 101€¦ · Deep Learning Lab11: TensorFlow 101 Author: 秉翰 江 Created Date: 12/24/2019 2:33:45 PM](https://reader030.vdocument.in/reader030/viewer/2022040608/5ec5faec064161120259eed9/html5/thumbnails/2.jpg)
Outline
• From RL to Deep RL
• Deep Q-Network• Naïve Algorithm(TD)
• Experience Replay
• Delayed Target Network
• Complete Algorithm
• Implementation
• Assignment
2
![Page 3: Deep Learning Lab11: TensorFlow 101€¦ · Deep Learning Lab11: TensorFlow 101 Author: 秉翰 江 Created Date: 12/24/2019 2:33:45 PM](https://reader030.vdocument.in/reader030/viewer/2022040608/5ec5faec064161120259eed9/html5/thumbnails/3.jpg)
Outline
• From RL to Deep RL
• Deep Q-Network• Naïve Algorithm(TD)
• Experience Replay
• Delayed Target Network
• Complete Algorithm
• Implementation
• Assignment
3
![Page 4: Deep Learning Lab11: TensorFlow 101€¦ · Deep Learning Lab11: TensorFlow 101 Author: 秉翰 江 Created Date: 12/24/2019 2:33:45 PM](https://reader030.vdocument.in/reader030/viewer/2022040608/5ec5faec064161120259eed9/html5/thumbnails/4.jpg)
From RL to Deep RL
• (Tabular) RL• Q-learning:
𝑄∗ 𝑠, 𝑎 ← 𝑄∗ 𝑠, 𝑎 + 𝜂[(𝑅 𝑠, 𝑎, 𝑠′ + 𝛾max𝑎′
𝑄∗(𝑠′, 𝑎′)) − 𝑄∗(𝑠, 𝑎)]
• It requires a large table to store 𝑄∗ values in realistic environments with large state/action space.
• Flappy bird: 𝑂(105), Tetris: 𝑂 1060 , Automatic car: ? ? ?
• Hard to visit all 𝑠, 𝑎 ′𝑠 in limited training time.
• Agents must derive efficient representations of the environment from high-dimensional inputs and use these to generalize past experience to new situations.
4
![Page 5: Deep Learning Lab11: TensorFlow 101€¦ · Deep Learning Lab11: TensorFlow 101 Author: 秉翰 江 Created Date: 12/24/2019 2:33:45 PM](https://reader030.vdocument.in/reader030/viewer/2022040608/5ec5faec064161120259eed9/html5/thumbnails/5.jpg)
Outline
• From RL to Deep RL
• Deep Q-Network• Naïve Algorithm(TD)
• Experience Replay
• Delayed Target Network
• Complete Algorithm
• Implementation
• Assignment
5
![Page 6: Deep Learning Lab11: TensorFlow 101€¦ · Deep Learning Lab11: TensorFlow 101 Author: 秉翰 江 Created Date: 12/24/2019 2:33:45 PM](https://reader030.vdocument.in/reader030/viewer/2022040608/5ec5faec064161120259eed9/html5/thumbnails/6.jpg)
Deep Q-Network
• To learn a function 𝑓𝑄∗(𝑠, 𝑎; 𝜃) that approximates Q∗(𝑠, 𝑎)• Trained by a small number of samples.
• Generalize to unseen states/actions.
• Smaller 𝜃 to store.
6
![Page 7: Deep Learning Lab11: TensorFlow 101€¦ · Deep Learning Lab11: TensorFlow 101 Author: 秉翰 江 Created Date: 12/24/2019 2:33:45 PM](https://reader030.vdocument.in/reader030/viewer/2022040608/5ec5faec064161120259eed9/html5/thumbnails/7.jpg)
Deep Q-Network
• Naïve Algorithm(TD)1. Take action 𝑎 from 𝑠 using some exploration policy 𝜋′ derived from 𝑓𝑄∗ (e.g.
𝜀-greedy).
2. Observe 𝑠′ and reward 𝑅(𝑠, 𝑎, 𝑠′), and update 𝜃 using SGD:
𝜃 ← 𝜃 − 𝜂𝛻𝜃𝐶, where
𝐶 𝜃 = [(𝑅(𝑠, 𝑎, 𝑠′) + 𝛾max𝑎′
𝑓𝑄∗ 𝑠′, 𝑎′; 𝜃 ) −𝑓𝑄∗ (𝑠, 𝑎; 𝜃)]2
Recall the Q-learning update formula:
𝑄∗ 𝑠, 𝑎 ← 𝑄∗ 𝑠, 𝑎 + 𝜂[(𝑅 𝑠, 𝑎, 𝑠′ + 𝛾max𝑎′
𝑄∗(𝑠′, 𝑎′)) − 𝑄∗(𝑠, 𝑎)]
7
![Page 8: Deep Learning Lab11: TensorFlow 101€¦ · Deep Learning Lab11: TensorFlow 101 Author: 秉翰 江 Created Date: 12/24/2019 2:33:45 PM](https://reader030.vdocument.in/reader030/viewer/2022040608/5ec5faec064161120259eed9/html5/thumbnails/8.jpg)
Deep Q-Network
• However, the naïve TD algorithm diverges due to:1. Samples are correlated (violates i.i.d. assumption of training examples).
2. Non-stationary target (𝑓𝑄∗(𝑠′, 𝑎′; 𝜃) changes as 𝜃 is updated for current 𝑎).
• As a result, the Deep Q-Network applies two stabilization techniquesto solve each problem respectively:
1. Experience Replay
2. Delayed Target Network
8
![Page 9: Deep Learning Lab11: TensorFlow 101€¦ · Deep Learning Lab11: TensorFlow 101 Author: 秉翰 江 Created Date: 12/24/2019 2:33:45 PM](https://reader030.vdocument.in/reader030/viewer/2022040608/5ec5faec064161120259eed9/html5/thumbnails/9.jpg)
Outline
• From RL to Deep RL
• Deep Q-Network• Naïve Algorithm(TD)
• Experience Replay
• Delayed Target Network
• Complete Algorithm
• Implementation
• Assignment
9
![Page 10: Deep Learning Lab11: TensorFlow 101€¦ · Deep Learning Lab11: TensorFlow 101 Author: 秉翰 江 Created Date: 12/24/2019 2:33:45 PM](https://reader030.vdocument.in/reader030/viewer/2022040608/5ec5faec064161120259eed9/html5/thumbnails/10.jpg)
Deep Q-Network
• Experience Replay• To remove the correlations in the observation sequence.
• Use a replay memory 𝔻 to store recently seen transitions 𝑠, 𝑎, 𝑟, 𝑠′ ,s.
• Sample a mini-batch from 𝔻 and update 𝜃.• The sample from the mini-batch is not a sequence now.
10
![Page 11: Deep Learning Lab11: TensorFlow 101€¦ · Deep Learning Lab11: TensorFlow 101 Author: 秉翰 江 Created Date: 12/24/2019 2:33:45 PM](https://reader030.vdocument.in/reader030/viewer/2022040608/5ec5faec064161120259eed9/html5/thumbnails/11.jpg)
Outline
• From RL to Deep RL
• Deep Q-Network• Naïve Algorithm(TD)
• Experience Replay
• Delayed Target Network
• Complete Algorithm
• Implementation
• Assignment
11
![Page 12: Deep Learning Lab11: TensorFlow 101€¦ · Deep Learning Lab11: TensorFlow 101 Author: 秉翰 江 Created Date: 12/24/2019 2:33:45 PM](https://reader030.vdocument.in/reader030/viewer/2022040608/5ec5faec064161120259eed9/html5/thumbnails/12.jpg)
Deep Q-Network
• Delayed Target Network• To avoid chasing a moving target.
• Set the target value to the output of the network parameterized by old 𝜃−.
• Update 𝜃− ← 𝜃 every 𝐾 iterations.
Update formula of naïve TD algorithm:
𝐶 𝜃 = [(𝑅(𝑠, 𝑎, 𝑠′) + 𝛾max𝑎′
𝑓𝑄∗ 𝑠′, 𝑎′; 𝜃 ) −𝑓𝑄∗ (𝑠, 𝑎; 𝜃)]2
Update formula after applying Delayed Target Network:
𝐶 𝜃 = [(𝑅(𝑠, 𝑎, 𝑠′) + 𝛾max𝑎′
𝑓𝑄∗ 𝑠′, 𝑎′; 𝜃− ) −𝑓𝑄∗ (𝑠, 𝑎; 𝜃)]2
12
![Page 13: Deep Learning Lab11: TensorFlow 101€¦ · Deep Learning Lab11: TensorFlow 101 Author: 秉翰 江 Created Date: 12/24/2019 2:33:45 PM](https://reader030.vdocument.in/reader030/viewer/2022040608/5ec5faec064161120259eed9/html5/thumbnails/13.jpg)
Outline
• From RL to Deep RL
• Deep Q-Network• Naïve Algorithm(TD)
• Experience Replay
• Delayed Target Network
• Complete Algorithm
• Implementation
• Assignment
13
![Page 14: Deep Learning Lab11: TensorFlow 101€¦ · Deep Learning Lab11: TensorFlow 101 Author: 秉翰 江 Created Date: 12/24/2019 2:33:45 PM](https://reader030.vdocument.in/reader030/viewer/2022040608/5ec5faec064161120259eed9/html5/thumbnails/14.jpg)
Deep Q-Network
• Complete Algorithm• Naïve algorithm(TD) + Experience Replay + Delayed Target Network
• Initialize 𝜃 arbitrarily and set 𝜃− = 𝜃. Iterate until converge:1. Take action 𝑎 from 𝑠 using some exploration policy 𝜋′ derived from 𝑓𝑄∗ (e.g. 𝜀-greedy).
2. Observe 𝑠′ and reward 𝑅(𝑠, 𝑎, 𝑠′), add (𝑠, 𝑎, 𝑅, 𝑠′) to 𝔻.
3. Sample a mini-batch of (𝑠, 𝑎, 𝑅, 𝑠′),s from 𝔻, do:
𝜃 ← 𝜃 − 𝜂𝛻𝜃𝐶, where
𝐶 𝜃 = [(𝑅(𝑠, 𝑎, 𝑠′) + 𝛾max𝑎′
𝑓𝑄∗ 𝑠′, 𝑎′; 𝜃− ) −𝑓𝑄∗ (𝑠, 𝑎; 𝜃)]2
4. Update 𝜃− ← 𝜃 every 𝐾 iterations.
14
![Page 15: Deep Learning Lab11: TensorFlow 101€¦ · Deep Learning Lab11: TensorFlow 101 Author: 秉翰 江 Created Date: 12/24/2019 2:33:45 PM](https://reader030.vdocument.in/reader030/viewer/2022040608/5ec5faec064161120259eed9/html5/thumbnails/15.jpg)
Outline
• From RL to Deep RL
• Deep Q-Network• Naïve Algorithm(TD)
• Experience Replay
• Delayed Target Network
• Complete Algorithm
• Implementation
• Assignment
15
![Page 16: Deep Learning Lab11: TensorFlow 101€¦ · Deep Learning Lab11: TensorFlow 101 Author: 秉翰 江 Created Date: 12/24/2019 2:33:45 PM](https://reader030.vdocument.in/reader030/viewer/2022040608/5ec5faec064161120259eed9/html5/thumbnails/16.jpg)
Outline
• From RL to Deep RL
• Deep Q-Network• Naïve Algorithm(TD)
• Experience Replay
• Delayed Target Network
• Complete Algorithm
• Implementation
• Assignment
16
![Page 17: Deep Learning Lab11: TensorFlow 101€¦ · Deep Learning Lab11: TensorFlow 101 Author: 秉翰 江 Created Date: 12/24/2019 2:33:45 PM](https://reader030.vdocument.in/reader030/viewer/2022040608/5ec5faec064161120259eed9/html5/thumbnails/17.jpg)
Assignment – state-based DQN
• What you should do:• Change the input from stack of frames to game state(as Lab 16).
• Change the network structure from CNNs to Dense layers.
• Train the state-based DQN agent to play Flappy Bird.
• Evaluation metrics:• Code (Whether the implementation is correct) (50%).
• The bird is able to fly through at least 1 pipes (50%).
• Requirements:• Upload the notebook named Lab17_{strudent_id}.ipynb to google drive, and
submit the link to iLMS.
• Deadline: 2020-01-02(Thur) 23:59.17