a brief introduction to reinforcement...
TRANSCRIPT
![Page 2: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/2.jpg)
Outline• Characteristics of Reinforcement Learning (RL)
• Components of RL (MDP, value, policy, Bellman)
• Planning (policy iteration, value iteration)
• Model-free Prediction (MC, TD)
• Model-free Control (Q-Learning)
• Deep Reinforcement Learning (DQN)
2
![Page 3: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/3.jpg)
![Page 4: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/4.jpg)
Outline• Characteristics of Reinforcement Learning (RL)
• The RL Problem (MDP, value, policy, Bellman)
• Planning (policy iteration, value iteration)
• Model-free Prediction (MC, TD)
• Model-free Control (Q-Learning)
• Deep Reinforcement Learning (DQN)
4
![Page 5: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/5.jpg)
Characteristics of RL SL VS RL
• Supervised Learning
• i.i.d data
• direct and strong supervision (label: what is the right thing to do)
• instantaneous feedback
• Reinforcement Learning
• sequential data, non-i.i.d
• no supervisor, only a reward signal (rule: what you did is good or bad)
• delayed feedback
5
![Page 6: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/6.jpg)
Characteristics of RL SL VS RL
• Supervised Learning
• i.i.d data
• direct and strong supervision (label: what is the right thing to do)
• instantaneous feedback
• Reinforcement Learning
• sequential data, non-i.i.d
• no supervisor, only a reward signal (rule: what you did is good or bad)
• delayed feedback
6
![Page 7: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/7.jpg)
Characteristics of RL SL VS RL
• Supervised Learning
• i.i.d data
• direct and strong supervision (label: what is the right thing to do)
• instantaneous feedback
• Reinforcement Learning
• sequential data, non-i.i.d
• no supervisor, only a reward signal (rule: what you did is good or bad)
• delayed feedback
7
![Page 8: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/8.jpg)
Characteristics of RL SL VS RL
• Supervised Learning
• i.i.d data
• direct and strong supervision (label: what is the right thing to do)
• instantaneous feedback
• Reinforcement Learning
• sequential data, non-i.i.d
• no supervisor, only a reward signal (rule: what you did is good or bad)
• delayed feedback
8
![Page 9: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/9.jpg)
![Page 10: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/10.jpg)
Outline• Characteristics of Reinforcement Learning (RL)
• Components of RL (MDP, value, policy, Bellman)
• Planning (policy iteration, value iteration)
• Model-free Prediction (MC, TD)
• Model-free Control (Q-Learning)
• Deep Reinforcement Learning (DQN)
10
![Page 11: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/11.jpg)
Components of RL MDP
• A general framework for sequential decision making
• A MDP is a tuple:
• Markov property:
“The future is independent of the past given the present”
11
hS,A,P,R, �i
S :states
A :actions
P :transition probability,Pass0 = P [St+1 = s0|St = s,At = a]
R :reward function,Ras = E [Rt+1|St = s,At = a]
� :discount factor, � 2 [0, 1]
![Page 12: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/12.jpg)
Components of RL MDP
• A general framework for sequential decision making
• A MDP is a tuple:
• Markov property:
“The future is independent of the past given the present”
12
hS,A,P,R, �i
S :states
A :actions
P :transition probability,Pass0 = P [St+1 = s0|St = s,At = a]
R :reward function,Ras = E [Rt+1|St = s,At = a]
� :discount factor, � 2 [0, 1]
![Page 13: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/13.jpg)
Components of RL MDP
• A general framework for sequential decision making
• A MDP is a tuple:
• Markov property:
“The future is independent of the past given the present”
13
hS,A,P,R, �i
S :states
A :actions
P :transition probability,Pass0 = P [St+1 = s0|St = s,At = a]
R :reward function,Ras = E [Rt+1|St = s,At = a]
� :discount factor, � 2 [0, 1]
![Page 14: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/14.jpg)
• Policy:
• Return:
• State-value function:
• Action-value function:
Components of RL Policy & Return & Value
14
Gt = Rt+1 + �Rt+2 + . . . =1X
k=0
�kRt+k+1
⇡(a|s) = P [At = a|St = s]
v⇡(s) = E⇡ [Gt|St = s]
q⇡(s,a) = E⇡ [Gt|St = s,At = a]
![Page 15: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/15.jpg)
• Policy:
• Return:
• State-value function:
• Action-value function:
Components of RL Policy & Return & Value
15
Gt = Rt+1 + �Rt+2 + . . . =1X
k=0
�kRt+k+1
⇡(a|s) = P [At = a|St = s]
v⇡(s) = E⇡ [Gt|St = s]
q⇡(s,a) = E⇡ [Gt|St = s,At = a]
![Page 16: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/16.jpg)
• Policy:
• Return:
• State-value function:
• Action-value function:
Components of RL Policy & Return & Value
16
Gt = Rt+1 + �Rt+2 + . . . =1X
k=0
�kRt+k+1
⇡(a|s) = P [At = a|St = s]
v⇡(s) = E⇡ [Gt|St = s]
q⇡(s,a) = E⇡ [Gt|St = s,At = a]
![Page 17: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/17.jpg)
• Policy:
• Return:
• State-value function:
• Action-value function:
Components of RL Policy & Return & Value
17
Gt = Rt+1 + �Rt+2 + . . . =1X
k=0
�kRt+k+1
⇡(a|s) = P [At = a|St = s]
v⇡(s) = E⇡ [Gt|St = s]
q⇡(s,a) = E⇡ [Gt|St = s,At = a]
![Page 18: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/18.jpg)
Components of RL Bellman Equations
• Bellman Expectation Equation
18
v⇡(s) = E⇡ [Rt+1 + �v⇡(St+1)|St = s]
v⇡(s) =X
a2A⇡(a|s)
Ra
s + �X
s02SPass0v⇡(s
0)
!
![Page 19: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/19.jpg)
• Bellman Optimality Equation
Components of RL Bellman Equations
19
v⇤(s) = maxa
Ra
s + �X
s02SPass0v⇤(s
0)
!
![Page 20: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/20.jpg)
Components of RL Prediction VS Control
• Prediction
given a policy, evaluate how much reward you can get by following that policy
• Control
find an optimal policy that maximizes the cumulative future reward
20
![Page 21: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/21.jpg)
Components of RL Prediction VS Control
• Prediction
given a policy, evaluate how much reward you can get by following that policy
• Control
find an optimal policy that maximizes the cumulative future reward
21
![Page 22: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/22.jpg)
Components of RL Planning VS Learning
• Planning
• the underlying MDP is known
• agent only needs to perform computations on the given model
• dynamic programming (policy iteration, value iteration)
• Learning
• the underlying MDP is initially unknown
• agent needs to interact with the environment
• model-free (learn value / policy) / model-based (learn model, plan on it)
22
![Page 23: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/23.jpg)
Components of RL Planning VS Learning
• Planning
• the underlying MDP is known
• agent only needs to perform computations on the given model
• dynamic programming (policy iteration, value iteration)
• Learning
• the underlying MDP is initially unknown
• agent needs to interact with the environment
• model-free (learn value / policy) / model-based (learn model, plan on it)
23
![Page 24: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/24.jpg)
Components of RL Planning VS Learning
• Planning
• the underlying MDP is known
• agent only needs to perform computations on the given model
• dynamic programming (policy iteration, value iteration)
• Learning
• the underlying MDP is initially unknown
• agent needs to interact with the environment
• model-free (learn value / policy) / model-based (learn model, plan on it)
24
![Page 25: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/25.jpg)
Components of RL Planning VS Learning
• Planning
• the underlying MDP is known
• agent only needs to perform computations on the given model
• dynamic programming (policy iteration, value iteration)
• Learning
• the underlying MDP is initially unknown
• agent needs to interact with the environment
• model-free (learn value / policy) / model-based (learn model, plan on it)
25
![Page 26: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/26.jpg)
![Page 27: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/27.jpg)
Outline• Characteristics of Reinforcement Learning (RL)
• The RL Problem (MDP, value, policy, Bellman)
• Planning (policy iteration, value iteration)
• Model-free Prediction (MC, TD)
• Model-free Control (Q-Learning)
• Deep Reinforcement Learning (DQN)
27
![Page 28: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/28.jpg)
Planning Dynamic Programming
• Applied when optimal solutions can be decomposed into subproblems
• For prediction:
• Input:
• Output:
• For control:
• Input:
• Output:
28
< S,A,R,S, � >,⇡
< S,A,R,S, � >
v⇡
v⇤,⇡⇤
![Page 29: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/29.jpg)
Planning Dynamic Programming
• Applied when optimal solutions can be decomposed into subproblems
• For prediction: (iterative policy evaluation)
• Input:
• Output:
• For control: (policy iteration, value iteration)
• Input:
• Output:
29
< S,A,R,S, � >,⇡
< S,A,R,S, � >
v⇡
v⇤,⇡⇤
![Page 30: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/30.jpg)
Planning Dynamic Programming
• Applied when optimal solutions can be decomposed into subproblems
• For prediction: (iterative policy evaluation)
• Input:
• Output:
• For control: (policy iteration, value iteration)
• Input:
• Output:
30
< S,A,R,S, � >,⇡
< S,A,R,S, � >
v⇡
v⇤,⇡⇤
![Page 31: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/31.jpg)
• Iterative application of Bellman Expectation backup
Planning Iterative Policy Evaluation
31
vk+1(s) =X
a2A⇡(a|s)
Ra
s + �X
s02SPass0vk(s
0)
!
v1 ! v2 ! . . . ! v⇡
![Page 32: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/32.jpg)
• Evaluate the given policy and get:
• Get an improved policy by acting greedily:
Planning Policy Iteration
32
⇡0 = greedy(v⇡)
v⇡
![Page 33: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/33.jpg)
• Iterative application of Bellman Optimality backupv1 ! v2 ! . . . ! v⇤
Planning Value Iteration
33
vk+1(s) = maxa2A
Ra
s + �X
s02SPass0vk(s
0)
!
![Page 34: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/34.jpg)
Planning Synchronous DP Algorithms
34
![Page 35: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/35.jpg)
![Page 36: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/36.jpg)
Outline• Characteristics of Reinforcement Learning (RL)
• The RL Problem (MDP, value, policy, Bellman)
• Planning (policy iteration, value iteration)
• Model-free Prediction (MC, TD)
• Model-free Control (Q-Learning)
• Deep Reinforcement Learning (DQN)
36
![Page 37: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/37.jpg)
• Planning
• the underlying MDP is known
• agent only needs to perform computations on the given model
• dynamic programming (policy iteration, value iteration)
• Learning
• the underlying MDP is initially unknown
• agent needs to interact with the environment
• model-free (learn value / policy) / model-based (learn model, plan on it)
Recap: Components of RL Planning VS Learning
37
![Page 38: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/38.jpg)
• Monte Carlo Learning
• learns from complete trajectories, no bootstrapping
• estimates values by looking at sample returns, empirical mean return
• Temporal Difference Learning
• learns from incomplete episodes, by bootstrapping, substituting the remainder of the trajectory with our estimate
• updates a guess towards a guess
Model-free Prediction MC VS TD
38
![Page 39: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/39.jpg)
• Monte Carlo Learning
• learns from complete trajectories, no bootstrapping
• estimates values by looking at sample returns, empirical mean return
• Temporal Difference Learning
• learns from incomplete episodes, by bootstrapping, substituting the remainder of the trajectory with our estimate
• updates a guess towards a guess
Model-free Prediction MC VS TD
39
![Page 40: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/40.jpg)
• Monte Carlo Learning
• learns from complete trajectories, no bootstrapping
• estimates values by looking at sample returns, empirical mean return
• Temporal Difference Learning
• learns from incomplete episodes, by bootstrapping, substituting the remainder of the trajectory with our estimate
• updates a guess towards a guess
Model-free Prediction MC VS TD
40
![Page 41: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/41.jpg)
• Goal:
• Recall: Return is the total discounted reward:
• Recall: Value function is the expected return:
• MC policy evaluation (every visit MC):
uses empirical mean return instead of expected return
Model-free Prediction MC
41
learn v⇡ from episodes of experience under policy ⇡
v⇡(s) = E⇡ [Gt|St = s]
Gt = Rt+1 + �Rt+2 + . . . =1X
k=0
�kRt+k+1
![Page 42: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/42.jpg)
• Goal:
• Recall: Return is the total discounted reward:
• Recall: Value function is the expected return:
• MC policy evaluation (every visit MC):
uses empirical mean return instead of expected return
Model-free Prediction MC
42
learn v⇡ from episodes of experience under policy ⇡
v⇡(s) = E⇡ [Gt|St = s]
Gt = Rt+1 + �Rt+2 + . . . =1X
k=0
�kRt+k+1
![Page 43: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/43.jpg)
• Goal:
• Recall: Return is the total discounted reward:
• Recall: Value function is the expected return:
• MC policy evaluation (every visit MC):
uses empirical mean return instead of expected return
Model-free Prediction MC
43
learn v⇡ from episodes of experience under policy ⇡
v⇡(s) = E⇡ [Gt|St = s]
Gt = Rt+1 + �Rt+2 + . . . =1X
k=0
�kRt+k+1
![Page 44: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/44.jpg)
• Goal:
• Recall: Return is the total discounted reward:
• Recall: Value function is the expected return:
• MC policy evaluation (every visit MC):
uses empirical mean return instead of expected return
Model-free Prediction MC
44
learn v⇡ from episodes of experience under policy ⇡
v⇡(s) = E⇡ [Gt|St = s]
Gt = Rt+1 + �Rt+2 + . . . =1X
k=0
�kRt+k+1
![Page 45: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/45.jpg)
• Goal:
• MC:
• TD:
Model-free Prediction MC -> TD
45
learn v⇡ from episodes of experience under policy ⇡
V (St) V (St) + ↵(Gt � V (St))
V (St) V (St) + ↵(Rt+1 + �V (St+1)� V (St))
updates V (St) towards actual return: Gt
updates V (St) towards estimated return: Rt+1 + �V (St+1)
![Page 46: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/46.jpg)
Model-free Prediction MC VS TD: Driving Home
46
![Page 47: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/47.jpg)
Model-free Prediction MC Backup
47
![Page 48: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/48.jpg)
Model-free Prediction TD Backup
48
![Page 49: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/49.jpg)
Model-free Prediction DP Backup
49
![Page 50: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/50.jpg)
Model-free Prediction Unified View
50
![Page 51: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/51.jpg)
![Page 52: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/52.jpg)
Outline• Characteristics of Reinforcement Learning (RL)
• The RL Problem (MDP, value, policy, Bellman)
• Planning (policy iteration, value iteration)
• Model-free Prediction (MC, TD)
• Model-free Control (Q-Learning)
• Deep Reinforcement Learning (DQN)
52
![Page 53: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/53.jpg)
• MDP is unknown:
but experience can be sampled
• MDP is known:
but too big to use except to sample from it
Model-free Control Why model-free?
53
![Page 54: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/54.jpg)
• Evaluate the given policy and get:
• Get an improved policy by acting greedily:
Recap: Planning Policy Iteration
54
⇡0 = greedy(v⇡)
v⇡
![Page 55: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/55.jpg)
• Evaluate the given policy and get:
• Get an improved policy by acting greedily:
Model-free Control Generalized Policy Iteration
55
⇡0 = greedy(v⇡)
v⇡
![Page 56: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/56.jpg)
• Greedy policy improvement over V(s) requires model of MDP
• Greedy policy improvement over Q(s,a) is model-free
Model-free Control V -> Q
56
⇡0(s) = argmaxa2A
(Ras + Pa
ss0V (s0))
⇡0(s) = argmaxa2A
Q(s,a)
![Page 57: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/57.jpg)
Model-free Control SARSA
57
Q(S,A) Q(S,A) + ↵ (R+ �Q(S0,A0)�Q(S,A))
![Page 58: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/58.jpg)
Model-free Control Q-Learning
58
Q(S,A) Q(S,A) + ↵⇣R+ �max
a0Q(S0,a0)�Q(S,A)
⌘
![Page 59: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/59.jpg)
Model-free Control SARSA VS Q-Learning
59
![Page 60: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/60.jpg)
Model-free Control DP VS TD
60
![Page 61: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/61.jpg)
Model-free Control DP VS TD
61
![Page 62: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/62.jpg)
![Page 63: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/63.jpg)
Outline• Characteristics of Reinforcement Learning (RL)
• The RL Problem (MDP, value, policy, Bellman)
• Planning (policy iteration, value iteration)
• Model-free Prediction (MC, TD)
• Model-free Control (Q-Learning)
• Deep Reinforcement Learning (DQN)
63
![Page 64: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/64.jpg)
Deep Reinforcement Learning Why?
• So far we represented value function by a lookup table
• every state s has an entry V(s)
• every state-action pair (s, a) has an entry Q(s, a)
• Problem w/ large MDPs
• too many states and/or actions to store in memory
• too slow to learn the value of each state individually
64
![Page 65: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/65.jpg)
Deep Reinforcement Learning Why?
• So far we represented value function by a lookup table
• every state s has an entry V(s)
• every state-action pair (s, a) has an entry Q(s, a)
• Problem w/ large MDPs
• too many states and/or actions to store in memory
• too slow to learn the value of each state individually
65
![Page 66: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/66.jpg)
Deep Reinforcement Learning How to?
• Use deep networks to represent:
• value function (value-based methods)
• policy (policy-based methods)
• model (model-based methods)
• Optimize value function / policy / model end-to-end
66
![Page 67: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/67.jpg)
Deep Reinforcement Learning How to?
• Use deep networks to represent:
• value function (value-based methods)
• policy (policy-based methods)
• model (model-based methods)
• Optimize value function / policy / model end-to-end
67
![Page 68: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/68.jpg)
Deep Reinforcement Learning Q-learning -> DQN
68
![Page 69: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/69.jpg)
Deep Reinforcement Learning Q-learning -> DQN
69
![Page 70: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/70.jpg)
Deep Reinforcement Learning AI = RL + DL
• Reinforcement Learning (RL)
• a general purpose framework for decision making
• learn policies to maximize future reward
• Deep Learning (DL)
• a general purpose framework for representation learning
• given an objective, learn representation that is required to achieve objective
• DRL: a single agent which can solve any human-level task
• RL defines the objective
• DL gives the mechanism
• RL + DL = general intelligence
70
![Page 71: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/71.jpg)
Deep Reinforcement Learning AI = RL + DL
• Reinforcement Learning (RL)
• a general purpose framework for decision making
• learn policies to maximize future reward
• Deep Learning (DL)
• a general purpose framework for representation learning
• given an objective, learn representation that is required to achieve objective
• DRL: a single agent which can solve any human-level task
• RL defines the objective
• DL gives the mechanism
• RL + DL = general intelligence
71
![Page 72: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/72.jpg)
Deep Reinforcement Learning AI = RL + DL
• Reinforcement Learning (RL)
• a general purpose framework for decision making
• learn policies to maximize future reward
• Deep Learning (DL)
• a general purpose framework for representation learning
• given an objective, learn representation that is required to achieve objective
• DRL: a single agent which can solve any human-level task
• RL defines the objective
• DL gives the mechanism
• RL + DL = general intelligence
72
![Page 73: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/73.jpg)
Some Recommendations• Reinforcement Learning from David Silver on YouTube
• Reinforcement Learning, An Introduction, Richard Sutton, 2nd Edition
• DQN Nature Paper: Human-level Control Through Deep Reinforcement Learning
• Flappy Bird:
• Tabular RL: https://github.com/SarvagyaVaish/FlappyBirdRL
• Deep RL: https://github.com/songrotek/DRL-FlappyBird
• Many many 3rd party implementations, just search for “deep reinforcement learning”, “dqn”, “a3c” on github
• My implementations in pytorch: https://github.com/jingweiz/pytorch-rl
73
![Page 74: A Brief Introduction to Reinforcement Learningais.informatik.uni-freiburg.de/.../deep_learning... · Outline • Characteristics of Reinforcement Learning (RL) • The RL Problem](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec67826a60cb616bc756a4c/html5/thumbnails/74.jpg)