model-free methods
TRANSCRIPT
![Page 1: Model-Free Methods](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e10e11a7b741a34c60ab/html5/thumbnails/1.jpg)
Model-Free Methods
Model-Free Methods
![Page 2: Model-Free Methods](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e10e11a7b741a34c60ab/html5/thumbnails/2.jpg)
A1
A2
S1
A3
S2
S3
S1
S3
S2
R=2
R= -1
Model-based: use all branches
In model-based we update Vπ(S) using all the possible S’ In model-free we take a step, and update based on this sample
![Page 3: Model-Free Methods](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e10e11a7b741a34c60ab/html5/thumbnails/3.jpg)
A1
A2
S1
A3
S2
S3
S1
S3
S2
R=2
R= -1
Model-based: use all branches
In model-free we take a step, and update based on this sample
V(S1) ← V(S1) + α [r + γ V (S3) - V(S1) ]
<V> ← <V> + α (V – <V> )
![Page 4: Model-Free Methods](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e10e11a7b741a34c60ab/html5/thumbnails/4.jpg)
A
S1
S3
S2
r1
On-line: take an action A, ending at S1
St
<V> ← <V> + α (V – <V> )
![Page 5: Model-Free Methods](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e10e11a7b741a34c60ab/html5/thumbnails/5.jpg)
TD Prediction Algorithm Terminology: Prediction -- computing Vπ (S) for a given π
Prediction error: [r + γV(S') – V(S)] Expected : V(S), observed: r + γV(S')
![Page 6: Model-Free Methods](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e10e11a7b741a34c60ab/html5/thumbnails/6.jpg)
A
S1
S3
S2
r1
Learning a Policy: Exploration problem: take an action A, ending at S1
St
Update St then update S1 May never explore the alternative actions to A
![Page 7: Model-Free Methods](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e10e11a7b741a34c60ab/html5/thumbnails/7.jpg)
From Value to Action
• Based on V(S), action can be selected
• ‘Greedy’ selection is not good enough (Select action A with current max expected future reward)
• Need for ‘exploration’
• For example: ‘ε-greedy’
• Max return with p = 1-ε, and with p=ε one of the other actions
• Can be a more complex decision
• Done here in episodes
![Page 8: Model-Free Methods](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e10e11a7b741a34c60ab/html5/thumbnails/8.jpg)
TD Policy Learning
ε-greedy
ε-greedy performs exploration Can be more complex, e.g. changing ε with time or with conditions
![Page 9: Model-Free Methods](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e10e11a7b741a34c60ab/html5/thumbnails/9.jpg)
TD ‘Actor-Critic’
Terminology: Prediction is the same as policy evaluation. Computing Vπ (S)
‘actor’
Motivated by brain modeling
![Page 10: Model-Free Methods](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e10e11a7b741a34c60ab/html5/thumbnails/10.jpg)
‘Actor-critic’ scheme -- standard drawing
Motivated by brain modeling (E.g. Ventral striatum is the critic, dorsal striatum is the actor)
![Page 11: Model-Free Methods](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e10e11a7b741a34c60ab/html5/thumbnails/11.jpg)
Q-learning
• The main algorithm used for model-free RL
![Page 12: Model-Free Methods](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e10e11a7b741a34c60ab/html5/thumbnails/12.jpg)
Q-values (state-action)
A1
A2
S1
A3
S2
S3
S1
S3
S2
R=2
R= -1
Q(S1, A1 )
Q(S1, A3 )
Qπ (S,a) is the expected return starting from S, taking the action a, and thereafter following policy π
![Page 13: Model-Free Methods](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e10e11a7b741a34c60ab/html5/thumbnails/13.jpg)
Q-value (state-action)
• The same update is done on Q-values rather than on V
• Used in most practical algorithms and some brain models
• Qπ (S,a) is the expected return starting from S, taking the action a, and thereafter following policy π:
![Page 14: Model-Free Methods](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e10e11a7b741a34c60ab/html5/thumbnails/14.jpg)
A1
A2
S1
A3
S2
S3
S1
S3
S2
R=2
R= -1
Q-values (state-action)
Q(S1, A1 )
Q(S1, A3 )
![Page 15: Model-Free Methods](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e10e11a7b741a34c60ab/html5/thumbnails/15.jpg)
SARSA
It is called SARSA because it uses s(t) a(t) r(t+1) s(t+1) a(t+1) A step like this uses the current π, so that each S has its a = π(S)
![Page 16: Model-Free Methods](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e10e11a7b741a34c60ab/html5/thumbnails/16.jpg)
SARSA RL Algorithm
Epsilon greedy: with probability epsilon do not select the greedy action, but with equal probability among all actions
![Page 17: Model-Free Methods](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e10e11a7b741a34c60ab/html5/thumbnails/17.jpg)
On Convergence
• Using episodes:
• Some of the states are ‘terminals’
• When the computation reaches a terminal s, it stops.
• Re-starts at a new state s according to some probability
• At the starting state, each action has a non-zero probability (exploration)
• As the number of episodes goes to infinity, Q(S,A) will converge to
Q*(S,A).