policy estimate from training episodes · policy estimate from training episodes we have: i unknown...
TRANSCRIPT
![Page 1: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/1.jpg)
Policy estimate from training episodes
We have:
I unknown squared world of unknown size and structure
I robot/agents moves in unknown directions with unknown parameters
→ We do not know anything
I we only have a few episodes the robot tried
What to do?
A: Run away :-)
B: Examine episodes and learn
C: Guess
D: Try something
1 / 59
![Page 2: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/2.jpg)
Policy estimate from training episodes
We have:
I unknown squared world of unknown size and structure
I robot/agents moves in unknown directions with unknown parameters
→ We do not know anything
I we only have a few episodes the robot tried
What to do?
A: Run away :-)
B: Examine episodes and learn
C: Guess
D: Try something
1 / 59
![Page 3: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/3.jpg)
Policy estimate from training episodes
We have:
I unknown squared world of unknown size and structure
I robot/agents moves in unknown directions with unknown parameters
→ We do not know anything
I we only have a few episodes the robot tried
What to do?
A: Run away :-)
B: Examine episodes and learn
C: Guess
D: Try something
1 / 59
![Page 4: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/4.jpg)
Policy estimate from training episodes
We have:
I unknown squared world of unknown size and structure
I robot/agents moves in unknown directions with unknown parameters
→ We do not know anything
I we only have a few episodes the robot tried
What to do?
A: Run away :-)
B: Examine episodes and learn
C: Guess
D: Try something
2 / 59
![Page 5: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/5.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s ′, r), known discount factor γ = 1Task: for non-terminal states determine the optimal policy. Use model-based learning.
What do we have to learn (model based learning)?
A: policy π
B: state set S , policy π
C: state set S , action set A, transition model p(s ′|s, a)
D: state set S , action set A, rewards r , transition model p(s ′|s, a)
3 / 59
![Page 6: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/6.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s ′, r), known discount factor γ = 1Task: for non-terminal states determine the optimal policy. Use model-based learning.
What do we have to learn (model based learning)?
A: policy π
B: state set S , policy π
C: state set S , action set A, transition model p(s ′|s, a)
D: state set S , action set A, rewards r , transition model p(s ′|s, a)
3 / 59
![Page 7: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/7.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s ′, r)Task: for non-terminal states determine the optimal policy
What do we have to learn (model based learning)?
A: policy π
B: state set S , policy π
C: state set S , action set A, transition model p(s ′|s, a)
D: state set S , action set A, rewards r , transition model p(s ′|s, a)
4 / 59
![Page 8: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/8.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
What is the state set S?
A: S = {B,C}B: S = {A,B,C ,D, exit}C: S = {A,B,C ,D}D: S = {A,D}
5 / 59
![Page 9: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/9.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
What is the state set S?
A: S = {B,C}B: S = {A,B,C ,D, exit}C: S = {A,B,C ,D}D: S = {A,D}
6 / 59
![Page 10: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/10.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}I What are the terminal states?
A: {A,B,C ,D}B: {A,D}C: {B,C}D: {A,C ,D}
7 / 59
![Page 11: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/11.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}I What are the terminal states?
A: {A,B,C ,D}B: {A,D}C: {B,C}D: {A,C ,D}
7 / 59
![Page 12: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/12.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}I What are the terminal states?
A: {A,B,C ,D}B: {A,D}C: {B,C}D: {A,C ,D}
8 / 59
![Page 13: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/13.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}I Terminal states: {A,D}I What are the non-terminal states?
A: {A,B,C ,D}B: {A,D}C: {B,C}D: {A,B,C}
9 / 59
![Page 14: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/14.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}I Terminal states: {A,D}I What are the non-terminal states?
A: {A,B,C ,D}B: {A,D}C: {B,C}D: {A,B,C}
9 / 59
![Page 15: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/15.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}I Terminal states: {A,D}I What are the non-terminal states?
A: {A,B,C ,D}B: {A,D}C: {B,C}D: {A,B,C}
10 / 59
![Page 16: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/16.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}
What is the action set?
A: {→,←}B: {→,←, ↑, ↓}C: {→,←, ↑}D: {→,←, ↓}
11 / 59
![Page 17: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/17.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}
What is the action set?
A: {→,←}B: {→,←, ↑, ↓}C: {→,←, ↑}D: {→,←, ↓}
11 / 59
![Page 18: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/18.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}
What is the action set?
A: {→,←}B: {→,←, ↑, ↓}C: {→,←, ↑}D: {→,←, ↓}
12 / 59
![Page 19: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/19.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}
What is the transition model?
A: deterministic
B: non-deterministic
Let’s examine :-)
13 / 59
![Page 20: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/20.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}
What is the transition model?
A: deterministic
B: non-deterministic
Let’s examine :-)
13 / 59
![Page 21: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/21.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}
What is the transition model?
A: deterministic
B: non-deterministic
Let’s examine :-)
13 / 59
![Page 22: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/22.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}
What is the transition model?
I How to compute?
A: for each state and actionB: for each state, action and new stateC: for each stateD: for each action and new state
14 / 59
![Page 23: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/23.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}
What is the transition model?
I How to compute?
A: for each state and actionB: for each state, action and new stateC: for each stateD: for each action and new state
15 / 59
![Page 24: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/24.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}
What is the transition model?
I How to compute?
1. for each state, action and new state2. A: as relative frequencies in one episode
B: as sum of occurencies in one episodeC: as relative frequencies in all episodesD: as sum of occurencies in all episodes
16 / 59
![Page 25: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/25.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}
What is the transition model?
I How to compute?
1. for each state, action and new state2. A: as relative frequencies in one episode
B: as sum of occurencies in one episodeC: as relative frequencies in all episodesD: as sum of occurencies in all episodes
17 / 59
![Page 26: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/26.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}
What is the transition model?
I How to compute?
1. for each state, action and new state2. as relative frequencies in all episodes
I evaluate p(C |B,→)
A: 1B: 2/3C: 1/2D: 1/3
18 / 59
![Page 27: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/27.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}
What is the transition model?
I How to compute?
1. for each state, action and new state2. as relative frequencies in all episodes
I evaluate p(C |B,→)
A: 1 = #(B,→,C ,·)#(B,→,·,·) = 2/2
B: 2/3C: 1/2D: 1/3
19 / 59
![Page 28: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/28.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}
What is the transition model?
I p(C |B,→) = 2/2 = 1
p(A|B,←) = 2/2 = 1
p(D|C ,→) = 2/2 = 1
p(B|C ,←) = 2/2 = 1
A: non-deterministic
B: deterministic
20 / 59
![Page 29: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/29.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}
What is the transition model?
I p(C |B,→) = 2/2 = 1
p(A|B,←) = 2/2 = 1
p(D|C ,→) = 2/2 = 1
p(B|C ,←) = 2/2 = 1
A: non-deterministic
B: deterministic
20 / 59
![Page 30: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/30.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}
What is the transition model?
I p(C |B,→) = 2/2 = 1
p(A|B,←) = 2/2 = 1
p(D|C ,→) = 2/2 = 1
p(B|C ,←) = 2/2 = 1
A: non-deterministic
B: deterministic
20 / 59
![Page 31: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/31.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}
What is the transition model?
I p(C |B,→) = 2/2 = 1
p(A|B,←) = 2/2 = 1
p(D|C ,→) = 2/2 = 1
p(B|C ,←) = 2/2 = 1
A: non-deterministic
B: deterministic
21 / 59
![Page 32: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/32.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
What is the world structure?
A: A C B D
B: A B C D
C: B A C D
22 / 59
![Page 33: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/33.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
What is the world structure?
A: A C B D
B: A B C D
C: B A C D
22 / 59
![Page 34: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/34.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
What is the world structure?
A: A C B D
B: A B C D
C: B A C D
23 / 59
![Page 35: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/35.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C D
What is a correct value for the reward function?
A: r(B) = −1
B: r(B,←,A) = −4
C: r(B) = −3
D: r(B,←) = −1
24 / 59
![Page 36: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/36.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C D
What is a correct value for the reward function?
A: r(B) = −1
B: r(B,←,A) = −4
C: r(B) = −3
D: r(B,←) = −1
25 / 59
![Page 37: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/37.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C D
I r(B,←) = −1
What is also correct for the reward function?
A: r(B) = −1
B: r(B,→) = −3
C: r(B) = −3
D: r(B,→,C) = −1
26 / 59
![Page 38: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/38.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C D
I r(B,←) = −1
What is also correct for the reward function?
A: r(B) = −1
B: r(B,→) = −3
C: r(B) = −3
D: r(B,→,C) = −1
26 / 59
![Page 39: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/39.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C D
I r(B,←) = −1
What is also correct for the reward function?
A: r(B) = −1
B: r(B,→) = −3
C: r(B) = −3
D: r(B,→,C) = −1
27 / 59
![Page 40: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/40.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C D
I r(B,←) = −1, r(B,→) = −3
What is also correct for the reward function?
A: r(C) = −1
B: r(C ,←,B) = −3
C: None
D: r(C ,←) = −1
28 / 59
![Page 41: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/41.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C D
I r(B,←) = −1,r(B,→) = −3
What is also correct for the reward function?
A: r(C) = −1
B: r(C ,←,B) = −3
C: None
D: r(C ,←) = −1
29 / 59
![Page 42: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/42.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C D
I r(B,←) = −1,r(B,→) = −3,r(C ,←) = −1
What is also correct for the reward function?
A: r(C) = −1
B: r(C ,→) = −3
C: r(C) = −3
D: r(C ,→,D) = −4
30 / 59
![Page 43: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/43.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C D
I r(B,←) = −1,r(B,→) = −3,r(C ,←) = −1
What is also correct for the reward function?
A: r(C) = −1
B: r(C ,→) = −3
C: r(C) = −3
D: r(C ,→,D) = −4
30 / 59
![Page 44: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/44.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C D
I r(B,←) = −1,r(B,→) = −3,r(C ,←) = −1
What is also correct for the reward function?
A: r(C) = −1
B: r(C ,→) = −3
C: r(C) = −3
D: r(C ,→,D) = −4
31 / 59
![Page 45: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/45.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C D
I r(B,←) = −1,r(B,→) = −3,r(C ,←) = −1, r(C ,→) = −3
Discussion point, do we need more reward values?
A: Yes, for all states and actions.
B: No.
C: Yes, for terminal states.
32 / 59
![Page 46: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/46.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3
Add also the terminal state rewards: r({A,D}, {←,→}) = 6
33 / 59
![Page 47: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/47.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6
Do we have all we need?
A: Yes
B: No
34 / 59
![Page 48: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/48.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6
Do we have all we need?
A: Yes
B: No
Let’s compute the policy.
35 / 59
![Page 49: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/49.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6
Do we have all we need?
A: Yes
B: No
Let’s compute the policy.
35 / 59
![Page 50: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/50.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6
Observation: Immediate rewards significantly decrease state value.
A: Best is to go directly to terminal state
B: We can go to the terminal state arbitrarily
36 / 59
![Page 51: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/51.jpg)
Example I
Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6
Observation: Immediate rewards significantly decrease state value.
A: Best is to go directly to terminal state
B: We can go to the terminal state arbitrarily
37 / 59
![Page 52: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/52.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6
Obs.: Immediate rewards significantly decrease state value. → Best is to go directly to terminal stateCompute:
A: q(B,←) = 5
B: q(B,←) = 3
C: q(B,←) = −1
D: q(B,←) = −3
38 / 59
![Page 53: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/53.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6
Obs.: Immediate rewards significantly decrease state value. → Best is to go directly to terminal stateCompute:
A: q(B,←) = 5
B: q(B,←) = 3
C: q(B,←) = −1
D: q(B,←) = −3
38 / 59
![Page 54: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/54.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6
Obs.: Immediate rewards significantly decrease state value. → Best is to go directly to terminal stateCompute:
A: q(B,←) = B ← A = 6− 1 = 5
B: q(B,←) = 3
C: q(B,←) = −1
D: q(B,←) = −3
39 / 59
![Page 55: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/55.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6
Obs.: Immediate rewards significantly decrease state value. → Best is to go directly to terminal stateCompute:
I q(B,←) = 5
(What can we assume about π(C)?)
A: q(B,→) = 5
B: q(B,→) = 3
C: q(B,→) = 040 / 59
![Page 56: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/56.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6
Obs.: Immediate rewards significantly decrease state value. → Best is to go directly to terminal stateCompute:
I q(B,←) = 5
(What can we assume about π(C)?)
A: q(B,→) = 5
B: q(B,→) = 3
C: q(B,→) = 040 / 59
![Page 57: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/57.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6
Obs.: Immediate rewards significantly decrease state value. → Best is to go directly to terminal stateCompute:
I q(B,←) = 5
(What can we assume about π(C)?)
A: q(B,→) = 5
B: q(B,→) = 3
C: q(B,→) = B → C → D = 6− 3− 3 = 041 / 59
![Page 58: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/58.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6
Obs.: Immediate rewards significantly decrease state value. → Best is to go directly to terminal stateCompute:
I q(B,←) = 5
I q(B,→) = 0
→ π(B) =←
42 / 59
![Page 59: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/59.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6
Obs.: Immediate rewards significantly decrease state value. → Best is to go directly to terminal stateCompute:
I q(B,←) = 5
I q(B,→) = 0
→ π(B) =←
42 / 59
![Page 60: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/60.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6
Obs.: Immediate rewards significantly decrease state value. → Best is to go directly to terminal stateπ(B) =←Compute now π(C):
A: q(C ,→) = 5
B: q(C ,→) = 3
C: q(C ,→) = 0
D: q(C ,→) = −343 / 59
![Page 61: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/61.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6
Obs.: Immediate rewards significantly decrease state value. → Best is to go directly to terminal stateπ(B) =←Compute now π(C):
A: q(C ,→) = 5
B: q(C ,→) = 3
C: q(C ,→) = 0
D: q(C ,→) = −343 / 59
![Page 62: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/62.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6
Obs.: Immediate rewards significantly decrease state value. → Best is to go directly to terminal stateπ(B) =←Compute now π(C):
A: q(C ,→) = 5
B: q(C ,→) = C → D = 6− 3 = 3
C: q(C ,→) = 0
D: q(C ,→) = −344 / 59
![Page 63: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/63.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6
Obs.: Immediate rewards significantly decrease state value. → Best is to go directly to terminal stateπ(B) =←Compute now π(C):
I q(C ,→) = 3
A: q(C ,←) = 4
B: q(C ,←) = 3
C: q(C ,←) = 0 45 / 59
![Page 64: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/64.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6
Obs.: Immediate rewards significantly decrease state value. → Best is to go directly to terminal stateπ(B) =←Compute now π(C):
I q(C ,→) = 3
A: q(C ,←) = C ← B ← A = 6− 1− 1 = 4
B: q(C ,←) = 3
C: q(C ,←) = 0 46 / 59
![Page 65: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/65.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6
Obs.: Immediate rewards significantly decrease state value. → Best is to go directly to terminal stateπ(B) =←Compute now π(C):
I q(C ,→) = 3
I q(C ,←) = 4
→ π(C) =←47 / 59
![Page 66: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/66.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6
Obs.: Immediate rewards significantly decrease state value. → Best is to go directly to terminal stateπ(B) =←Compute now π(C):
I q(C ,→) = 3
I q(C ,←) = 4
→ π(C) =←47 / 59
![Page 67: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/67.jpg)
Example IEpisode 1 Episode 2 Episode 3 Episode 4
(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)
(B,←,A,−1)(A,←, exit, 6)
each field in table is n-tuple (s, a, s′, r)
State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1
World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6
Obs.: Immediate rewards significantly decrease state value. → Best is to go directly to terminal state
Solution:
I π(B) =←I π(C) =←
48 / 59
![Page 68: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/68.jpg)
Example II
Episode 1 Episode 2 Episode 3 Episode 4 Episode 5 Episode 6 Episode 7 Episode 8(B,→, C ,−3) (B,←, A,−1) (C ,→,D,−3) (C ,←, B,−1) (B,←, C ,−3) (B,→, A,−1) (C ,→, B,−1) (C ,→,D,−3)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→, C ,−3) (C ,←, B,−1) (A,→, exit, 6) (B,→, C ,−3) (D,→, exit, 6)(D,←, exit, 6) (C ,←, B,−1) (B,←, A,−1) (C ,←,D,−3)
(B,←, A,−1) (A,←, exit, 6) (D,←, exit, 6)(A,←, exit, 6)
Calculating policy
I state set S ,
I action set A,
I rewards r ,
I transition model p(s ′|s, a)
I policy π
49 / 59
![Page 69: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/69.jpg)
Example II
Episode 1 Episode 2 Episode 3 Episode 4 Episode 5 Episode 6 Episode 7 Episode 8(B,→, C ,−3) (B,←, A,−1) (C ,→,D,−3) (C ,←, B,−1) (B,←, C ,−3) (B,→, A,−1) (C ,→, B,−1) (C ,→,D,−3)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→, C ,−3) (C ,←, B,−1) (A,→, exit, 6) (B,→, C ,−3) (D,→, exit, 6)(D,←, exit, 6) (C ,←, B,−1) (B,←, A,−1) (C ,←,D,−3)
(B,←, A,−1) (A,←, exit, 6) (D,←, exit, 6)(A,←, exit, 6)
What is the transition model?
A: deterministic
B: non-deterministic
50 / 59
![Page 70: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/70.jpg)
Example II
Episode 1 Episode 2 Episode 3 Episode 4 Episode 5 Episode 6 Episode 7 Episode 8(B,→, C ,−3) (B,←, A,−1) (C ,→,D,−3) (C ,←, B,−1) (B,←, C ,−3) (B,→, A,−1) (C ,→, B,−1) (C ,→,D,−3)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→, C ,−3) (C ,←, B,−1) (A,→, exit, 6) (B,→, C ,−3) (D,→, exit, 6)(D,←, exit, 6) (C ,←, B,−1) (B,←, A,−1) (C ,←,D,−3)
(B,←, A,−1) (A,←, exit, 6) (D,←, exit, 6)(A,←, exit, 6)
What is a correct transitional probability?
A p(C |B,→) = 0.75
B p(A|B,→) = 0.75
C p(A|B,←) = 0.25
D p(D|B,←) = 0.75
51 / 59
![Page 71: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/71.jpg)
Example II
Episode 1 Episode 2 Episode 3 Episode 4 Episode 5 Episode 6 Episode 7 Episode 8(B,→, C ,−3) (B,←, A,−1) (C ,→,D,−3) (C ,←, B,−1) (B,←, C ,−3) (B,→, A,−1) (C ,→, B,−1) (C ,→,D,−3)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→, C ,−3) (C ,←, B,−1) (A,→, exit, 6) (B,→, C ,−3) (D,→, exit, 6)(D,←, exit, 6) (C ,←, B,−1) (B,←, A,−1) (C ,←,D,−3)
(B,←, A,−1) (A,←, exit, 6) (D,←, exit, 6)(A,←, exit, 6)
What is a correct transitional probability?
A p(C |B,→) = 0.75, see the episodes(B,→) occurs 4 times, three of which lead to C , one case to A thus also p(A|B,→) = 0.25
B p(A|B,→) = 0.75
C p(A|B,←) = 0.25
D p(D|B,←) = 0.75
Transition model: Similarly for other probabilities. Agent follows the direction given with probability
0.75. Otherwise, it goes the other direction.
52 / 59
![Page 72: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/72.jpg)
Example II
Episode 1 Episode 2 Episode 3 Episode 4 Episode 5 Episode 6 Episode 7 Episode 8(B,→, C ,−3) (B,←, A,−1) (C ,→,D,−3) (C ,←, B,−1) (B,←, C ,−3) (B,→, A,−1) (C ,→, B,−1) (C ,→,D,−3)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→, C ,−3) (C ,←, B,−1) (A,→, exit, 6) (B,→, C ,−3) (D,→, exit, 6)(D,←, exit, 6) (C ,←, B,−1) (B,←, A,−1) (C ,←,D,−3)
(B,←, A,−1) (A,←, exit, 6) (D,←, exit, 6)(A,←, exit, 6)
What is the reward function?
A r(B,→,C ) = −3
B p(B,→,A) = −3
C p(B,←,A) = −3
D p(B,←,C ) = −3
53 / 59
![Page 73: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/73.jpg)
Example II
Episode 1 Episode 2 Episode 3 Episode 4 Episode 5 Episode 6 Episode 7 Episode 8(B,→, C ,−3) (B,←, A,−1) (C ,→,D,−3) (C ,←, B,−1) (B,←, C ,−3) (B,→, A,−1) (C ,→, B,−1) (C ,→,D,−3)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→, C ,−3) (C ,←, B,−1) (A,→, exit, 6) (B,→, C ,−3) (D,→, exit, 6)(D,←, exit, 6) (C ,←, B,−1) (B,←, A,−1) (C ,←,D,−3)
(B,←, A,−1) (A,←, exit, 6) (D,←, exit, 6)(A,←, exit, 6)
What is the reward function?
A r(B,→,C ) = −3
B p(B,→,A) = −3
C p(B,←,A) = −3
D p(B,←,C ) = −3
54 / 59
![Page 74: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/74.jpg)
Example IIEpisode 1 Episode 2 Episode 3 Episode 4 Episode 5 Episode 6 Episode 7 Episode 8
(B,→, C ,−3) (B,←, A,−1) (C ,→,D,−3) (C ,←, B,−1) (B,←, C ,−3) (B,→, A,−1) (C ,→, B,−1) (C ,→,D,−3)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→, C ,−3) (C ,←, B,−1) (A,→, exit, 6) (B,→, C ,−3) (D,→, exit, 6)(D,←, exit, 6) (C ,←, B,−1) (B,←, A,−1) (C ,←,D,−3)
(B,←, A,−1) (A,←, exit, 6) (D,←, exit, 6)(A,←, exit, 6)
Result:
I States: S = {A,B,C ,D}, terminal= {A,D}, nonterminal= {B,C}I Action set: {←,→}I Rewards:
r(B, {←,→},C ) = −3, r(B, {←,→},A) = −1,r(C , {←,→},B) = −1, r(C , {←,→},D) = −3
I World structure:A B C D
I Transition model: Agent follows the direction given with probability 0.75. Otherwise, it goes theother direction.
I Policy: π(B) =?, π(C ) =?
55 / 59
![Page 75: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/75.jpg)
Example II
Episode 1 Episode 2 Episode 3 Episode 4 Episode 5 Episode 6 Episode 7 Episode 8(B,→, C ,−3) (B,←, A,−1) (C ,→,D,−3) (C ,←, B,−1) (B,←, C ,−3) (B,→, A,−1) (C ,→, B,−1) (C ,→,D,−3)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→, C ,−3) (C ,←, B,−1) (A,→, exit, 6) (B,→, C ,−3) (D,→, exit, 6)(D,←, exit, 6) (C ,←, B,−1) (B,←, A,−1) (C ,←,D,−3)
(B,←, A,−1) (A,←, exit, 6) (D,←, exit, 6)(A,←, exit, 6)
Policy evaluation:
←,→ q(B,←) =?, q(C ,→) =?
→,→ q(B,→) =?, q(C ,→) =?
→,← q(B,→) =?, q(C ,←) =?
←,← q(B,←) =?, q(C ,←) =?
56 / 59
![Page 76: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/76.jpg)
Example II
Episode 1 Episode 2 Episode 3 Episode 4 Episode 5 Episode 6 Episode 7 Episode 8(B,→, C ,−3) (B,←, A,−1) (C ,→,D,−3) (C ,←, B,−1) (B,←, C ,−3) (B,→, A,−1) (C ,→, B,−1) (C ,→,D,−3)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→, C ,−3) (C ,←, B,−1) (A,→, exit, 6) (B,→, C ,−3) (D,→, exit, 6)(D,←, exit, 6) (C ,←, B,−1) (B,←, A,−1) (C ,←,D,−3)
(B,←, A,−1) (A,←, exit, 6) (D,←, exit, 6)(A,←, exit, 6)
A single policy computation:
←,→ q(B,←) =?, q(C ,→) =?
A q(B,←) = .5 · −1 + .5 · −3,q(C ,→) = .5 · −1 + .5 · −3
B q(B,←) = .25 · (6− 1) + .75 · (−3 + V (C )),q(C ,→) = .25 · −1 + .75 · (−3 + V (B))
C q(B,←) = .75 · (6− 1) + .25 · (−3 + V (C )),q(C ,→) = .75 · (−3 + 6) + .25 · (−1 + V (B))
D q(B,←) = .75 · (6− 1) + .25 · −3,q(C ,→) = .5 · −1 + .25 · −3
57 / 59
![Page 77: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/77.jpg)
Example II
Episode 1 Episode 2 Episode 3 Episode 4 Episode 5 Episode 6 Episode 7 Episode 8(B,→, C ,−3) (B,←, A,−1) (C ,→,D,−3) (C ,←, B,−1) (B,←, C ,−3) (B,→, A,−1) (C ,→, B,−1) (C ,→,D,−3)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→, C ,−3) (C ,←, B,−1) (A,→, exit, 6) (B,→, C ,−3) (D,→, exit, 6)(D,←, exit, 6) (C ,←, B,−1) (B,←, A,−1) (C ,←,D,−3)
(B,←, A,−1) (A,←, exit, 6) (D,←, exit, 6)(A,←, exit, 6)
A single policy computation:
←,→ q(B,←) =?, q(C ,→) =?
A q(B,←) = .5 · −1 + .5 · −3,q(C ,→) = .5 · −1 + .5 · −3
B q(B,←) = .25 · (6− 1) + .75 · (−3 + V (C )),q(C ,→) = .25 · −1 + .75 · (−3 + V (B))
C q(B,←) = .75 · (6− 1) + .25 · (−3 + V (C )),q(C ,→) = .75 · (−3 + 6) + .25 · (−1 + V (B))
D q(B,←) = .75 · (6− 1) + .25 · −3,q(C ,→) = .5 · −1 + .25 · −3
58 / 59
![Page 78: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/78.jpg)
Example II
Episode 1 Episode 2 Episode 3 Episode 4 Episode 5 Episode 6 Episode 7 Episode 8(B,→, C ,−3) (B,←, A,−1) (C ,→,D,−3) (C ,←, B,−1) (B,←, C ,−3) (B,→, A,−1) (C ,→, B,−1) (C ,→,D,−3)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→, C ,−3) (C ,←, B,−1) (A,→, exit, 6) (B,→, C ,−3) (D,→, exit, 6)(D,←, exit, 6) (C ,←, B,−1) (B,←, A,−1) (C ,←,D,−3)
(B,←, A,−1) (A,←, exit, 6) (D,←, exit, 6)(A,←, exit, 6)
A single policy computation. As the policy is fixed V (B) = q(B,←),V (C ) = q(C ,→):
I q(B,←) = .75 · (6− 1) + .25 · (−3 + q(C ,→))
I q(C ,→) = .75 · (−3 + 6) + .25 · (−1 + q(B,←))
Therefore:
I q(B,←) = .75 · 5 + .25 · (−3 + .75 · 3 + .25 · (−1 + q(B,←)) = ... ≈ 3.73
I q(C ,→) = .75 · 3 + .25 · (−1 + 3.73) ≈ 2.93
And we calculate for the remaining policies.
59 / 59
![Page 79: Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions](https://reader033.vdocument.in/reader033/viewer/2022060719/607f1e6e9e112549c8587c82/html5/thumbnails/79.jpg)
Example II
Episode 1 Episode 2 Episode 3 Episode 4 Episode 5 Episode 6 Episode 7 Episode 8(B,→, C ,−3) (B,←, A,−1) (C ,→,D,−3) (C ,←, B,−1) (B,←, C ,−3) (B,→, A,−1) (C ,→, B,−1) (C ,→,D,−3)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→, C ,−3) (C ,←, B,−1) (A,→, exit, 6) (B,→, C ,−3) (D,→, exit, 6)(D,←, exit, 6) (C ,←, B,−1) (B,←, A,−1) (C ,←,D,−3)
(B,←, A,−1) (A,←, exit, 6) (D,←, exit, 6)(A,←, exit, 6)
←,→ q(B,←) ≈ 3.73,q(C ,→) ≈ 2.93
→,→ q(B,→) ≈ 0.62,q(C ,→) ≈ 2.15
→,← q(B,→) ≈ −2.29,q(C ,←) ≈ −1.71
←,← q(B,←) ≈ 3.70,q(C ,←) ≈ 2.77
And we can determine the best policy: π(B) =←, π(C ) =→
60 / 59