ben lau, quantitative researcher, complus asset management limited, at mlconf nyc 2017
TRANSCRIPT
![Page 1: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/1.jpg)
Deep Reinforcement Learningusing deep learning to play self-driving car games
Ben Lau
Ben Lau - Deep Learning and Reinforcement Learning
MLConf 2017, New York City
![Page 2: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/2.jpg)
What is Reinforcement Learning?
Ben Lau - Deep Learning and Reinforcement Learning
3 classes of learning
Supervised Learning
Label data Direct Feedback
Unsupervised Learning
No labels data No feedback “Find Hidden Structure”
Reinforcement Learning
Using reward as feedback Learn series of actions Trial and Error
![Page 3: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/3.jpg)
RL: Agent and Environment
Ben Lau - Deep Learning and Reinforcement Learning
𝑅𝑡
Agent Action
Environment
Reward
Observation
At each step t the Agent• Receive observation • Execute action • Receive reward
the Environment• Receive action • Sends observation • Sends reward
![Page 4: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/4.jpg)
RL: State
Ben Lau - Deep Learning and Reinforcement Learning
Experience is a sequence of observations, actions, rewards
The state is a summary of experience
Note: Not all the state are fully observable
Fully Observable Not Fully Observable
![Page 5: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/5.jpg)
Approach to Reinforcement Learning
Ben Lau - Deep Learning and Reinforcement Learning
Value-Based RL
Estimate the optimal value function This is the maximum value achievable under any policy
Policy-Based RL
Search directly for the optimal policy This is the policy achieving maximum future reward
Model-based RL
Build a model of the environment Plan (e.g. by lookahead) using model
![Page 6: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/6.jpg)
Deep Learning + RL AI
Ben Lau - Deep Learning and Reinforcement Learning
reward
Game input
Deep convolution network
Steer
Gas Pedal
Brake
![Page 7: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/7.jpg)
Policies
Ben Lau - Deep Learning and Reinforcement Learning
A deterministic policy is the agent’s behavior
It is a map from state to action:
In Reinforcement Learning, the agent’s goal is to choose each action such that it maximize the sumof future rewards
Choose at to maximize
is a discount factor [0,1], as the reward is less certain when further away
State(s) Action(a)
Obstacle Brake
Corner Left/Right
Straight line
Acceleration
![Page 8: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/8.jpg)
Approach to Reinforcement Learning
Ben Lau - Deep Learning and Reinforcement Learning
Value-Based RL
Estimate the optimal value function This is the maximum value achievable under any policy
Policy-Based RL
Search directly for the optimal policy pi* This is the policy achieving maximum future reward
Model-based RL
Build a model of the environment Plan (e.g. by lookahead) using model
![Page 9: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/9.jpg)
Value Function
Ben Lau - Deep Learning and Reinforcement Learning
A value function is a prediction of future reward How much reward will I get from action a in state s?
A Q-value function gives expected total reward From state-action pair (s, a) Under policy With discount factor
An optimal value function is the maximum achievable value
Once we have the we can act optimally
![Page 10: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/10.jpg)
Understanding Q Function
Ben Lau - Deep Learning and Reinforcement Learning
The best way to understand Q function is considering a “strategy guide”
Suppose you are playing a difficult game (DOOM)
If you have a strategy guide, it’s pretty easy Just follow the guide
Suppose you are in state s, and need to make a decision, If you have this magical Q-function(strategy guide), then it is easy, just pick the action with highest Q
Doom Strategy Guide
![Page 11: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/11.jpg)
How to find Q-function
Ben Lau - Deep Learning and Reinforcement Learning
Discount Future Reward:
which can be written as:
Recall the definition of Q-function (max reward if choose action a in state s)
Therefore, we can rewrite the Q-function as below
Q(
In plain English, it means maximum future reward for (s,a) is the immediate reward r + maximum future reward in next state s’, action a’
It can be solved by dynamic programming or iterative solution
![Page 12: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/12.jpg)
Deep Q-Network (DQN)
Ben Lau - Deep Learning and Reinforcement Learning
Action-Value function (Q-function) often very big
DQN idea: I use the neural network to compress this Q-table, using the weight (w) in the neural network
Training become finding sets of optimal weights w instead
In the literature we often called “non-linear function approximation”State
Action
Value
A 1 140.11A 2 139.22B 1 145.89B 2 140.23C 1 123.67C 2 135.27
![Page 13: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/13.jpg)
DQN Demo Using DeepQ network to play Doom
![Page 14: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/14.jpg)
Approach to Reinforcement Learning
Ben Lau - Deep Learning and Reinforcement Learning
Value-Based RL
Estimate the optimal value function Q*(s,a) This is the maximum value achievable under any policy
Policy-Based RL
Search directly for the optimal policy This is the policy achieving maximum future reward
Model-based RL
Build a model of the environment Plan (e.g. by lookahead) using model
![Page 15: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/15.jpg)
Deep Policy Network
Ben Lau - Deep Learning and Reinforcement Learning
Review: A policy is the agent’s behavior
It is a map from state to action: at = (st)
We can directly search the policy
Let’s parameterize the policy by some model parameters
We called it Policy-Based reinforcement learning because wewill adjust the model parameters directly
The goal is to maximize the total discount reward from beginning
maximize total
![Page 16: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/16.jpg)
Policy Gradient
Ben Lau - Deep Learning and Reinforcement Learning
How to make good action more likely?
Define objective function as total discounted reward
or
Where the expectations of the total reward R is calculated under someprobability distribution parameterized by
The goal become maximize the total reward by
compute the gradient
![Page 17: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/17.jpg)
Policy Gradient (II)
Ben Lau - Deep Learning and Reinforcement Learning
Recall: Q-function is the maximum discounted future reward in state s, action a
In the continuous case we can written as
Therefore, we can compute the gradient as
Using chain-rule, we can re-write as
No dynamics model required!
1. Only requires Q is differential w.r.t. a2. As long as a can be parameterized
as function of
![Page 18: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/18.jpg)
The power of Policy Gradient
Ben Lau - Deep Learning and Reinforcement Learning
Because the policy gradient does not require the dynamical modeltherefore, no prior domain knowledge is required
AlphaGo doesn’t pre-programme any domain knowledge
It keep playing many times (via self-play) and adjust the policy parameters to maximize the reward(winning probability)
![Page 19: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/19.jpg)
Intuition: Value vs Policy RL
Ben Lau - Deep Learning and Reinforcement Learning
Valued Based RL is similar to driving instructor : A score is given for any action is taken by student
Policy Based RL is similar to a driver : It is the actual policy how to drive a car
![Page 20: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/20.jpg)
The car racing game TORCS
Ben Lau - Deep Learning and Reinforcement Learning
TORCS is a state of the art open source simulator written in C++
Main Features Sophisticated dynamics Provided with several
tracks, controllers
Sensors Rangefinder Speed Position on track Rotation speed of wheels RPM Angle with tracks
Quite realistic to self-driving cars… Track sensors
![Page 21: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/21.jpg)
Deep Learning Recipe
Ben Lau - Deep Learning and Reinforcement Learning
rewardGame input state s
Deep Neural network
Steer
Gas Pedal
Brake
Rangefinder Speed Position on track Rotation speed of wheels RPM Angle with tracks
Compute the optimal policy via policy gradient
![Page 22: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/22.jpg)
Design of the reward function
Ben Lau - Deep Learning and Reinforcement Learning
Obvious choice : Highest velocity of the car
H
Use modify reward function |track pos|
Encourage stay in the center of the track
![Page 23: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/23.jpg)
Source code available here:Google: DDPG Keras
Ben Lau - Deep Learning and Reinforcement Learning
![Page 24: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/24.jpg)
Training Set: Aalborg Track
![Page 25: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/25.jpg)
Validation Set: Alpine TracksRecall basic Machine Learning, make sure you need to test the modelIn the validation set, not the training set
![Page 26: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/26.jpg)
Learning how to brake
Ben Lau - Deep Learning and Reinforcement Learning
Since we try to maximize the velocity of the carThe AI agent don’t want to hit the brake at all! (As it go against the reward function)Using Stochastic Brake Idea
![Page 27: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/27.jpg)
Final Demo – Car does not stay center of track
Ben Lau - Deep Learning and Reinforcement Learning
![Page 28: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/28.jpg)
Future Application
Ben Lau - Deep Learning and Reinforcement Learning
Self driving cars:
![Page 29: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/29.jpg)
Future Application
![Page 30: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/30.jpg)
Thank you!Twitter: @yanpanlau
![Page 31: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/31.jpg)
Appendix
![Page 32: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017](https://reader036.vdocument.in/reader036/viewer/2022081604/58e4a4661a28aba3458b689b/html5/thumbnails/32.jpg)
How to find Q-function (II)
Ben Lau - Deep Learning and Reinforcement Learning
Q(
We could use iterative method to solve the Q-function, given a transition (s,a,r,s’)
We want Q(to be same as
Consider find Q-function is a regression task, we can define a loss function
Loss function =
Q is optimal when the loss function is minimumtarget prediction