19 rl keyurtasun/courses/csc411... · zemel, urtasun, fidler (uoft) csc 411: 19-reinforcement...
TRANSCRIPT
![Page 1: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/1.jpg)
Q-learning TutorialCSC411
Geoffrey Roeder
Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava
![Page 2: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/2.jpg)
Tutorial Agenda• Refresh RL terminology through Tic Tac Toe
• Deterministic Q-Learning: what and how
• Q-learning Matlab demo: Gridworld
• Extensions: non-deterministic reward, next state
• More cool demos
![Page 3: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/3.jpg)
Tic Tac Toe Redux
![Page 4: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/4.jpg)
Lose Tie WinReward
:-1 0 +1
X OX O
st =
⇡ : S ! ATic Tac Toe Redux
X OX O
V⇡( ) 7! rfuture
V⇡ : S ! R
X O
X O( (⇡ 7! a
R =
![Page 5: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/5.jpg)
RL & Tic-Tac-Toe
Each board position (taking into account symmetry) has some probability
Simple learning process:
I start with all values = 0.5I policy: choose move with highest
probability of winning given currentlegal moves from current state
I update entries in table based onoutcome of each game
I After many games value function willrepresent true probability of winningfrom each state
Can try alternative policy: sometimes select moves randomly (exploration)
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 22 / 1
![Page 6: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/6.jpg)
RL & Tic-Tac-Toe
Each board position (taking into account symmetry) has some probability
Simple learning process:
I start with all values = 0.5I policy: choose move with highest
probability of winning given currentlegal moves from current state
I update entries in table based onoutcome of each game
I After many games value function willrepresent true probability of winningfrom each state
Can try alternative policy: sometimes select moves randomly (exploration)
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 22 / 1
![Page 7: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/7.jpg)
RL & Tic-Tac-Toe
Each board position (taking into account symmetry) has some probability
Simple learning process:
I start with all values = 0.5
I policy: choose move with highestprobability of winning given currentlegal moves from current state
I update entries in table based onoutcome of each game
I After many games value function willrepresent true probability of winningfrom each state
Can try alternative policy: sometimes select moves randomly (exploration)
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 22 / 1
![Page 8: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/8.jpg)
RL & Tic-Tac-Toe
Each board position (taking into account symmetry) has some probability
Simple learning process:
I start with all values = 0.5I policy: choose move with highest
probability of winning given currentlegal moves from current state
I update entries in table based onoutcome of each game
I After many games value function willrepresent true probability of winningfrom each state
Can try alternative policy: sometimes select moves randomly (exploration)
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 22 / 1
![Page 9: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/9.jpg)
RL & Tic-Tac-Toe
Each board position (taking into account symmetry) has some probability
Simple learning process:
I start with all values = 0.5I policy: choose move with highest
probability of winning given currentlegal moves from current state
I update entries in table based onoutcome of each game
I After many games value function willrepresent true probability of winningfrom each state
Can try alternative policy: sometimes select moves randomly (exploration)
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 22 / 1
![Page 10: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/10.jpg)
RL & Tic-Tac-Toe
Each board position (taking into account symmetry) has some probability
Simple learning process:
I start with all values = 0.5I policy: choose move with highest
probability of winning given currentlegal moves from current state
I update entries in table based onoutcome of each game
I After many games value function willrepresent true probability of winningfrom each state
Can try alternative policy: sometimes select moves randomly (exploration)
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 22 / 1
![Page 11: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/11.jpg)
RL & Tic-Tac-Toe
Each board position (taking into account symmetry) has some probability
Simple learning process:
I start with all values = 0.5I policy: choose move with highest
probability of winning given currentlegal moves from current state
I update entries in table based onoutcome of each game
I After many games value function willrepresent true probability of winningfrom each state
Can try alternative policy: sometimes select moves randomly (exploration)
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 22 / 1
![Page 12: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/12.jpg)
MDP RefresherFamiliar? Skip?
![Page 13: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/13.jpg)
MDP Formulation
Goal: find policy ⇡ that maximizes expected accumulated future rewardsV
⇡(st), obtained by following ⇡ from state st :
V
⇡(st) = rt + �rt+1
+ �2
rt+2
+ · · ·
=1X
i=0
� irt+i
Game show example:
I assume series of questions, increasingly di�cult, but increasing payo↵I choice: accept accumulated earnings and quit; or continue and risk
losing everything
Notice that:V
⇡(st) = rt + �V ⇡(st+1
)
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 29 / 1
![Page 14: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/14.jpg)
MDP Formulation
Goal: find policy ⇡ that maximizes expected accumulated future rewardsV
⇡(st), obtained by following ⇡ from state st :
V
⇡(st) = rt + �rt+1
+ �2
rt+2
+ · · ·
=1X
i=0
� irt+i
Game show example:
I assume series of questions, increasingly di�cult, but increasing payo↵I choice: accept accumulated earnings and quit; or continue and risk
losing everything
Notice that:V
⇡(st) = rt + �V ⇡(st+1
)
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 29 / 1
![Page 15: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/15.jpg)
MDP Formulation
Goal: find policy ⇡ that maximizes expected accumulated future rewardsV
⇡(st), obtained by following ⇡ from state st :
V
⇡(st) = rt + �rt+1
+ �2
rt+2
+ · · ·
=1X
i=0
� irt+i
Game show example:
I assume series of questions, increasingly di�cult, but increasing payo↵
I choice: accept accumulated earnings and quit; or continue and risklosing everything
Notice that:V
⇡(st) = rt + �V ⇡(st+1
)
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 29 / 1
![Page 16: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/16.jpg)
MDP Formulation
Goal: find policy ⇡ that maximizes expected accumulated future rewardsV
⇡(st), obtained by following ⇡ from state st :
V
⇡(st) = rt + �rt+1
+ �2
rt+2
+ · · ·
=1X
i=0
� irt+i
Game show example:
I assume series of questions, increasingly di�cult, but increasing payo↵I choice: accept accumulated earnings and quit; or continue and risk
losing everything
Notice that:V
⇡(st) = rt + �V ⇡(st+1
)
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 29 / 1
![Page 17: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/17.jpg)
What to Learn
We might try to learn the function V (which we write as V ⇤)
V
⇤(s) = maxa
[r(s, a) + �V ⇤(�(s, a))]
Here �(s, a) gives the next state, if we perform action a in current state s
We could then do a lookahead search to choose best action from any state s:
⇡⇤(s) = arg maxa
[r(s, a) + �V ⇤(�(s, a))]
But there’s a problem:
I This works well if we know �() and r()I But when we don’t, we cannot choose actions this way
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1
![Page 18: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/18.jpg)
What to Learn
We might try to learn the function V (which we write as V ⇤)
V
⇤(s) = maxa
[r(s, a) + �V ⇤(�(s, a))]
Here �(s, a) gives the next state, if we perform action a in current state s
We could then do a lookahead search to choose best action from any state s:
⇡⇤(s) = arg maxa
[r(s, a) + �V ⇤(�(s, a))]
But there’s a problem:
I This works well if we know �() and r()I But when we don’t, we cannot choose actions this way
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1
![Page 19: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/19.jpg)
What to Learn
We might try to learn the function V (which we write as V ⇤)
V
⇤(s) = maxa
[r(s, a) + �V ⇤(�(s, a))]
Here �(s, a) gives the next state, if we perform action a in current state s
We could then do a lookahead search to choose best action from any state s:
⇡⇤(s) = arg maxa
[r(s, a) + �V ⇤(�(s, a))]
But there’s a problem:
I This works well if we know �() and r()I But when we don’t, we cannot choose actions this way
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1
![Page 20: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/20.jpg)
What to Learn
We might try to learn the function V (which we write as V ⇤)
V
⇤(s) = maxa
[r(s, a) + �V ⇤(�(s, a))]
Here �(s, a) gives the next state, if we perform action a in current state s
We could then do a lookahead search to choose best action from any state s:
⇡⇤(s) = arg maxa
[r(s, a) + �V ⇤(�(s, a))]
But there’s a problem:
I This works well if we know �() and r()
I But when we don’t, we cannot choose actions this way
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1
![Page 21: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/21.jpg)
What to Learn
We might try to learn the function V (which we write as V ⇤)
V
⇤(s) = maxa
[r(s, a) + �V ⇤(�(s, a))]
Here �(s, a) gives the next state, if we perform action a in current state s
We could then do a lookahead search to choose best action from any state s:
⇡⇤(s) = arg maxa
[r(s, a) + �V ⇤(�(s, a))]
But there’s a problem:
I This works well if we know �() and r()I But when we don’t, we cannot choose actions this way
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1
![Page 22: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/22.jpg)
Q LearningDeterministic rewards and actions
![Page 23: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/23.jpg)
Q Learning
Define a new function very similar to V
⇤
Q(s, a) = r(s, a) + �V ⇤(�(s, a))
If we learn Q, we can choose the optimal action even without knowing �!
⇡⇤(s) = arg maxa
[r(s, a) + �V ⇤(�(s, a))]
= arg maxa
Q(s, a)
Q is then the evaluation function we will learn
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 31 / 1
![Page 24: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/24.jpg)
Q Learning
Define a new function very similar to V
⇤
Q(s, a) = r(s, a) + �V ⇤(�(s, a))
If we learn Q, we can choose the optimal action even without knowing �!
⇡⇤(s) = arg maxa
[r(s, a) + �V ⇤(�(s, a))]
= arg maxa
Q(s, a)
Q is then the evaluation function we will learn
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 31 / 1
![Page 25: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/25.jpg)
Q Learning
Define a new function very similar to V
⇤
Q(s, a) = r(s, a) + �V ⇤(�(s, a))
If we learn Q, we can choose the optimal action even without knowing �!
⇡⇤(s) = arg maxa
[r(s, a) + �V ⇤(�(s, a))]
= arg maxa
Q(s, a)
Q is then the evaluation function we will learn
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 31 / 1
![Page 26: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/26.jpg)
Q Learning
Define a new function very similar to V
⇤
Q(s, a) = r(s, a) + �V ⇤(�(s, a))
If we learn Q, we can choose the optimal action even without knowing �!
⇡⇤(s) = arg maxa
[r(s, a) + �V ⇤(�(s, a))]
= arg maxa
Q(s, a)
Q is then the evaluation function we will learn
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 31 / 1
![Page 27: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/27.jpg)
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 32 / 1
![Page 28: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/28.jpg)
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 32 / 1
![Page 29: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/29.jpg)
Training Rule to Learn Q
Q and V
⇤ are closely related:
V
⇤(s) = maxa
Q(s, a)
So we can write Q recursively:
Q(st , at) = r(st , at) + �V ⇤(�(st , at))
= r(st , at) + �maxa0
Q(st+1
, a0)
Let Q̂ denote the learner’s current approximation to Q
Consider training rule
Q̂(s, a) r(s, a) + �maxa0
Q̂(s 0, a0)
where s
0 is state resulting from applying action a in state s
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 33 / 1
![Page 30: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/30.jpg)
Training Rule to Learn Q
Q and V
⇤ are closely related:
V
⇤(s) = maxa
Q(s, a)
So we can write Q recursively:
Q(st , at) = r(st , at) + �V ⇤(�(st , at))
= r(st , at) + �maxa0
Q(st+1
, a0)
Let Q̂ denote the learner’s current approximation to Q
Consider training rule
Q̂(s, a) r(s, a) + �maxa0
Q̂(s 0, a0)
where s
0 is state resulting from applying action a in state s
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 33 / 1
![Page 31: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/31.jpg)
Training Rule to Learn Q
Q and V
⇤ are closely related:
V
⇤(s) = maxa
Q(s, a)
So we can write Q recursively:
Q(st , at) = r(st , at) + �V ⇤(�(st , at))
= r(st , at) + �maxa0
Q(st+1
, a0)
Let Q̂ denote the learner’s current approximation to Q
Consider training rule
Q̂(s, a) r(s, a) + �maxa0
Q̂(s 0, a0)
where s
0 is state resulting from applying action a in state s
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 33 / 1
![Page 32: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/32.jpg)
Training Rule to Learn Q
Q and V
⇤ are closely related:
V
⇤(s) = maxa
Q(s, a)
So we can write Q recursively:
Q(st , at) = r(st , at) + �V ⇤(�(st , at))
= r(st , at) + �maxa0
Q(st+1
, a0)
Let Q̂ denote the learner’s current approximation to Q
Consider training rule
Q̂(s, a) r(s, a) + �maxa0
Q̂(s 0, a0)
where s
0 is state resulting from applying action a in state s
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 33 / 1
![Page 33: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/33.jpg)
Training Rule to Learn Q
Q and V
⇤ are closely related:
V
⇤(s) = maxa
Q(s, a)
So we can write Q recursively:
Q(st , at) = r(st , at) + �V ⇤(�(st , at))
= r(st , at) + �maxa0
Q(st+1
, a0)
Let Q̂ denote the learner’s current approximation to Q
Consider training rule
Q̂(s, a) r(s, a) + �maxa0
Q̂(s 0, a0)
where s
0 is state resulting from applying action a in state s
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 33 / 1
![Page 34: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/34.jpg)
Q Learning for Deterministic World
For each s, a initialize table entry Q̂(s, a) 0
Start in some initial state s
Do forever:
I Select an action a and execute itI Receive immediate reward rI Observe the new state s
0
I Update the table entry for Q̂(s, a) using Q learning rule:
Q̂(s, a) r(s, a) + �maxa0
Q̂(s 0, a0)
Is s
0
If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 1
![Page 35: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/35.jpg)
Q Learning for Deterministic World
For each s, a initialize table entry Q̂(s, a) 0
Start in some initial state s
Do forever:
I Select an action a and execute itI Receive immediate reward rI Observe the new state s
0
I Update the table entry for Q̂(s, a) using Q learning rule:
Q̂(s, a) r(s, a) + �maxa0
Q̂(s 0, a0)
Is s
0
If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 1
![Page 36: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/36.jpg)
Q Learning for Deterministic World
For each s, a initialize table entry Q̂(s, a) 0
Start in some initial state s
Do forever:
I Select an action a and execute itI Receive immediate reward rI Observe the new state s
0
I Update the table entry for Q̂(s, a) using Q learning rule:
Q̂(s, a) r(s, a) + �maxa0
Q̂(s 0, a0)
Is s
0
If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 1
![Page 37: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/37.jpg)
Q Learning for Deterministic World
For each s, a initialize table entry Q̂(s, a) 0
Start in some initial state s
Do forever:
I Select an action a and execute it
I Receive immediate reward rI Observe the new state s
0
I Update the table entry for Q̂(s, a) using Q learning rule:
Q̂(s, a) r(s, a) + �maxa0
Q̂(s 0, a0)
Is s
0
If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 1
![Page 38: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/38.jpg)
Q Learning for Deterministic World
For each s, a initialize table entry Q̂(s, a) 0
Start in some initial state s
Do forever:
I Select an action a and execute itI Receive immediate reward r
I Observe the new state s
0
I Update the table entry for Q̂(s, a) using Q learning rule:
Q̂(s, a) r(s, a) + �maxa0
Q̂(s 0, a0)
Is s
0
If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 1
![Page 39: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/39.jpg)
Q Learning for Deterministic World
For each s, a initialize table entry Q̂(s, a) 0
Start in some initial state s
Do forever:
I Select an action a and execute itI Receive immediate reward rI Observe the new state s
0
I Update the table entry for Q̂(s, a) using Q learning rule:
Q̂(s, a) r(s, a) + �maxa0
Q̂(s 0, a0)
Is s
0
If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 1
![Page 40: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/40.jpg)
Q Learning for Deterministic World
For each s, a initialize table entry Q̂(s, a) 0
Start in some initial state s
Do forever:
I Select an action a and execute itI Receive immediate reward rI Observe the new state s
0
I Update the table entry for Q̂(s, a) using Q learning rule:
Q̂(s, a) r(s, a) + �maxa0
Q̂(s 0, a0)
Is s
0
If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 1
![Page 41: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/41.jpg)
Q Learning for Deterministic World
For each s, a initialize table entry Q̂(s, a) 0
Start in some initial state s
Do forever:
I Select an action a and execute itI Receive immediate reward rI Observe the new state s
0
I Update the table entry for Q̂(s, a) using Q learning rule:
Q̂(s, a) r(s, a) + �maxa0
Q̂(s 0, a0)
Is s
0
If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 1
![Page 42: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/42.jpg)
Q Learning for Deterministic World
For each s, a initialize table entry Q̂(s, a) 0
Start in some initial state s
Do forever:
I Select an action a and execute itI Receive immediate reward rI Observe the new state s
0
I Update the table entry for Q̂(s, a) using Q learning rule:
Q̂(s, a) r(s, a) + �maxa0
Q̂(s 0, a0)
Is s
0
If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 1
![Page 43: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/43.jpg)
Updating Estimated Q
Assume the robot is in state s
1
; some of its current estimates of Q are asshown; executes rightward move
Q̂(s1
, aright) r + �maxa0
Q̂(s2
, a0)
r + 0.9maxa
{63, 81, 100} 90
Important observation: at each time step (making an action a in state s
only one entry of Q̂ will change (the entry Q̂(s, a))
Notice that if rewards are non-negative, then Q̂ values only increase from 0,approach true Q
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 35 / 1
![Page 44: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/44.jpg)
Updating Estimated Q
Assume the robot is in state s
1
; some of its current estimates of Q are asshown; executes rightward move
Q̂(s1
, aright) r + �maxa0
Q̂(s2
, a0)
r + 0.9maxa
{63, 81, 100} 90
Important observation: at each time step (making an action a in state s
only one entry of Q̂ will change (the entry Q̂(s, a))
Notice that if rewards are non-negative, then Q̂ values only increase from 0,approach true Q
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 35 / 1
![Page 45: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/45.jpg)
Updating Estimated Q
Assume the robot is in state s
1
; some of its current estimates of Q are asshown; executes rightward move
Q̂(s1
, aright) r + �maxa0
Q̂(s2
, a0)
r + 0.9maxa
{63, 81, 100} 90
Important observation: at each time step (making an action a in state s
only one entry of Q̂ will change (the entry Q̂(s, a))
Notice that if rewards are non-negative, then Q̂ values only increase from 0,approach true Q
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 35 / 1
![Page 46: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/46.jpg)
Updating Estimated Q
Assume the robot is in state s
1
; some of its current estimates of Q are asshown; executes rightward move
Q̂(s1
, aright) r + �maxa0
Q̂(s2
, a0)
r + 0.9maxa
{63, 81, 100} 90
Important observation: at each time step (making an action a in state s
only one entry of Q̂ will change (the entry Q̂(s, a))
Notice that if rewards are non-negative, then Q̂ values only increase from 0,approach true Q
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 35 / 1
![Page 47: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/47.jpg)
Updating Estimated Q
Assume the robot is in state s
1
; some of its current estimates of Q are asshown; executes rightward move
Q̂(s1
, aright) r + �maxa0
Q̂(s2
, a0)
r + 0.9maxa
{63, 81, 100} 90
Important observation: at each time step (making an action a in state s
only one entry of Q̂ will change (the entry Q̂(s, a))
Notice that if rewards are non-negative, then Q̂ values only increase from 0,approach true Q
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 35 / 1
![Page 48: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/48.jpg)
Q Learning: Summary
Training set consists of series of intervals (episodes): sequence of (state,action, reward) triples, end at absorbing state
Each executed action a results in transition from state si to sj ; algorithmupdates Q̂(si , a) using the learning rule
Intuition for simple grid world, reward only upon entering goal state ! Q
estimates improve from goal state back
1. All Q̂(s, a) start at 02. First episode – only update Q̂(s, a) for transition leading to goal state3. Next episode – if go thru this next-to-last transition, will update
Q̂(s, a) another step back4. Eventually propagate information from transitions with non-zero reward
throughout state-action space
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 36 / 1
![Page 49: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/49.jpg)
Q Learning: Summary
Training set consists of series of intervals (episodes): sequence of (state,action, reward) triples, end at absorbing state
Each executed action a results in transition from state si to sj ; algorithmupdates Q̂(si , a) using the learning rule
Intuition for simple grid world, reward only upon entering goal state ! Q
estimates improve from goal state back
1. All Q̂(s, a) start at 02. First episode – only update Q̂(s, a) for transition leading to goal state3. Next episode – if go thru this next-to-last transition, will update
Q̂(s, a) another step back4. Eventually propagate information from transitions with non-zero reward
throughout state-action space
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 36 / 1
![Page 50: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/50.jpg)
Q Learning: Summary
Training set consists of series of intervals (episodes): sequence of (state,action, reward) triples, end at absorbing state
Each executed action a results in transition from state si to sj ; algorithmupdates Q̂(si , a) using the learning rule
Intuition for simple grid world, reward only upon entering goal state ! Q
estimates improve from goal state back
1. All Q̂(s, a) start at 02. First episode – only update Q̂(s, a) for transition leading to goal state3. Next episode – if go thru this next-to-last transition, will update
Q̂(s, a) another step back4. Eventually propagate information from transitions with non-zero reward
throughout state-action space
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 36 / 1
![Page 51: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/51.jpg)
Q Learning: Summary
Training set consists of series of intervals (episodes): sequence of (state,action, reward) triples, end at absorbing state
Each executed action a results in transition from state si to sj ; algorithmupdates Q̂(si , a) using the learning rule
Intuition for simple grid world, reward only upon entering goal state ! Q
estimates improve from goal state back
1. All Q̂(s, a) start at 0
2. First episode – only update Q̂(s, a) for transition leading to goal state3. Next episode – if go thru this next-to-last transition, will update
Q̂(s, a) another step back4. Eventually propagate information from transitions with non-zero reward
throughout state-action space
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 36 / 1
![Page 52: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/52.jpg)
Q Learning: Summary
Training set consists of series of intervals (episodes): sequence of (state,action, reward) triples, end at absorbing state
Each executed action a results in transition from state si to sj ; algorithmupdates Q̂(si , a) using the learning rule
Intuition for simple grid world, reward only upon entering goal state ! Q
estimates improve from goal state back
1. All Q̂(s, a) start at 02. First episode – only update Q̂(s, a) for transition leading to goal state
3. Next episode – if go thru this next-to-last transition, will updateQ̂(s, a) another step back
4. Eventually propagate information from transitions with non-zero rewardthroughout state-action space
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 36 / 1
![Page 53: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/53.jpg)
Q Learning: Summary
Training set consists of series of intervals (episodes): sequence of (state,action, reward) triples, end at absorbing state
Each executed action a results in transition from state si to sj ; algorithmupdates Q̂(si , a) using the learning rule
Intuition for simple grid world, reward only upon entering goal state ! Q
estimates improve from goal state back
1. All Q̂(s, a) start at 02. First episode – only update Q̂(s, a) for transition leading to goal state3. Next episode – if go thru this next-to-last transition, will update
Q̂(s, a) another step back
4. Eventually propagate information from transitions with non-zero rewardthroughout state-action space
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 36 / 1
![Page 54: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/54.jpg)
Q Learning: Summary
Training set consists of series of intervals (episodes): sequence of (state,action, reward) triples, end at absorbing state
Each executed action a results in transition from state si to sj ; algorithmupdates Q̂(si , a) using the learning rule
Intuition for simple grid world, reward only upon entering goal state ! Q
estimates improve from goal state back
1. All Q̂(s, a) start at 02. First episode – only update Q̂(s, a) for transition leading to goal state3. Next episode – if go thru this next-to-last transition, will update
Q̂(s, a) another step back4. Eventually propagate information from transitions with non-zero reward
throughout state-action space
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 36 / 1
![Page 55: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/55.jpg)
Gridworld Demo
![Page 56: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/56.jpg)
ExtensionsNon-deterministic reward and actions
![Page 57: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/57.jpg)
Q Learning: Exploration/Exploitation
Have not specified how actions chosen (during learning)
Can choose actions to maximize Q̂(s, a)
Good idea?
Can instead employ stochastic action selection (policy):
P(ai |s) =exp(kQ̂(s, ai ))Pj exp(kQ̂(s, aj))
Can vary k during learning
I more exploration early on, shift towards exploitation
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 37 / 1
![Page 58: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/58.jpg)
Q Learning: Exploration/Exploitation
Have not specified how actions chosen (during learning)
Can choose actions to maximize Q̂(s, a)
Good idea?
Can instead employ stochastic action selection (policy):
P(ai |s) =exp(kQ̂(s, ai ))Pj exp(kQ̂(s, aj))
Can vary k during learning
I more exploration early on, shift towards exploitation
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 37 / 1
![Page 59: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/59.jpg)
Q Learning: Exploration/Exploitation
Have not specified how actions chosen (during learning)
Can choose actions to maximize Q̂(s, a)
Good idea?
Can instead employ stochastic action selection (policy):
P(ai |s) =exp(kQ̂(s, ai ))Pj exp(kQ̂(s, aj))
Can vary k during learning
I more exploration early on, shift towards exploitation
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 37 / 1
![Page 60: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/60.jpg)
Q Learning: Exploration/Exploitation
Have not specified how actions chosen (during learning)
Can choose actions to maximize Q̂(s, a)
Good idea?
Can instead employ stochastic action selection (policy):
P(ai |s) =exp(kQ̂(s, ai ))Pj exp(kQ̂(s, aj))
Can vary k during learning
I more exploration early on, shift towards exploitation
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 37 / 1
![Page 61: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/61.jpg)
Q Learning: Exploration/Exploitation
Have not specified how actions chosen (during learning)
Can choose actions to maximize Q̂(s, a)
Good idea?
Can instead employ stochastic action selection (policy):
P(ai |s) =exp(kQ̂(s, ai ))Pj exp(kQ̂(s, aj))
Can vary k during learning
I more exploration early on, shift towards exploitation
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 37 / 1
![Page 62: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/62.jpg)
Q Learning: Exploration/Exploitation
Have not specified how actions chosen (during learning)
Can choose actions to maximize Q̂(s, a)
Good idea?
Can instead employ stochastic action selection (policy):
P(ai |s) =exp(kQ̂(s, ai ))Pj exp(kQ̂(s, aj))
Can vary k during learning
I more exploration early on, shift towards exploitation
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 37 / 1
![Page 63: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/63.jpg)
Non-deterministic Case
What if reward and next state are non-deterministic?
We redefine V ,Q based on probabilistic estimates, expected values of them:
V
⇡(s) = E⇡[rt + �rt+1
+ �2
rt+2
+ · · · ]
= E⇡[1X
i=0
� irt+i ]
and
Q(s, a) = E [r(s, a) + �V ⇤(�(s, a))]
= E [r(s, a) + �X
s0
p(s 0|s, a)maxa0
Q(s 0, a0)]
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 38 / 1
![Page 64: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/64.jpg)
Non-deterministic Case
What if reward and next state are non-deterministic?
We redefine V ,Q based on probabilistic estimates, expected values of them:
V
⇡(s) = E⇡[rt + �rt+1
+ �2
rt+2
+ · · · ]
= E⇡[1X
i=0
� irt+i ]
and
Q(s, a) = E [r(s, a) + �V ⇤(�(s, a))]
= E [r(s, a) + �X
s0
p(s 0|s, a)maxa0
Q(s 0, a0)]
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 38 / 1
![Page 65: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/65.jpg)
Non-deterministic Case
What if reward and next state are non-deterministic?
We redefine V ,Q based on probabilistic estimates, expected values of them:
V
⇡(s) = E⇡[rt + �rt+1
+ �2
rt+2
+ · · · ]
= E⇡[1X
i=0
� irt+i ]
and
Q(s, a) = E [r(s, a) + �V ⇤(�(s, a))]
= E [r(s, a) + �X
s0
p(s 0|s, a)maxa0
Q(s 0, a0)]
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 38 / 1
![Page 66: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/66.jpg)
Non-deterministic Case: Learning Q
Training rule does not converge (can keep changing Q̂ even if initialized totrue Q values)
So modify training rule to change more slowly
Q̂(s, a) (1� ↵n)Q̂n�1
(s, a) + ↵n[r + �maxa0
Q̂n�1
(s 0, a0)]
where s
0 is the state land in after s, and a
0 indexes the actions that can betaken in state s
0
↵n =1
1 + visitsn(s, a)
where visits is the number of times action a is taken in state s
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 39 / 1
![Page 67: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/67.jpg)
Non-deterministic Case: Learning Q
Training rule does not converge (can keep changing Q̂ even if initialized totrue Q values)
So modify training rule to change more slowly
Q̂(s, a) (1� ↵n)Q̂n�1
(s, a) + ↵n[r + �maxa0
Q̂n�1
(s 0, a0)]
where s
0 is the state land in after s, and a
0 indexes the actions that can betaken in state s
0
↵n =1
1 + visitsn(s, a)
where visits is the number of times action a is taken in state s
Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 39 / 1
![Page 68: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/68.jpg)
More Cool Demos
![Page 69: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/69.jpg)
https://www.youtube.com/watch?v=L4KBBAwF_bESuper Mario World
https://www.youtube.com/watch?v=XiigTGKZfks
Model-based RL: Pole Balancing
Other Examples:
![Page 70: 19 rl keyurtasun/courses/CSC411... · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn the function V](https://reader035.vdocument.in/reader035/viewer/2022080721/5f7b4eebabd559174045a3db/html5/thumbnails/70.jpg)
Learn how to fly a Helicopter
• http://heli.stanford.edu/
• Formulate as an RL problem
• State - Position, orientation, velocity, angular velocity
• Actions - Front-back pitch, left-right pitch, tail rotor pitch, blade angle
• Dynamics - Map actions to states. Difficult!
• Rewards - Don’t crash, Do interesting things.
Slide credit: Nitish Srivastava