structural return maximization for reinforcement learning josh joseph alborz geramifard javier velez...
TRANSCRIPT
![Page 1: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/1.jpg)
1
Structural Return Maximization for Reinforcement Learning
Josh JosephAlborz Geramifard
Javier Velez Jonathan HowNicholas Roy
![Page 2: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/2.jpg)
2
How should we act in the presence of complex, unknown dynamics?
![Page 3: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/3.jpg)
3
How should we act in the presence of complex, unknown dynamics?
![Page 4: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/4.jpg)
4
How should we act in the presence of complex, unknown dynamics?
![Page 5: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/5.jpg)
5
How should we act in the presence of complex, unknown dynamics?
![Page 6: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/6.jpg)
6
What do I mean by complex dynamics?
• Can’t derive from first principles / intuition• Any dynamics model will be approximate• Limited data– Otherwise just do nearest neighbors
• Batch data– Trying to keep it as simple as possible for now– Fairly straightforward to extend to active learning
![Page 7: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/7.jpg)
7
What do I mean by complex dynamics?
• Can’t derive from first principles / intuition• Any dynamics model will be approximate• Limited data• Batch data– Fairly straightforward to extend to active learning
![Page 8: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/8.jpg)
8
How does RL solve these problems?
• Assume some representation class for:– Dynamics model– Value function– Policy
• Collect some data• Find the “best” representation based on the
data
![Page 9: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/9.jpg)
9
How does RL solve these problems?
• Assume some representation class for:– Dynamics model– Value function– Policy
• Collect some data• Find the “best” representation based on the
data
![Page 10: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/10.jpg)
10
• The “best” representation based on the data
• This defines the best policy…not the best representation
Value (return)
How does RL solve these problems?
Policy
Starting state
reward unknown dynamics model
![Page 11: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/11.jpg)
11
• The “best” representation based on the data
• This defines the best policy…not the best representation
Value (return)
How does RL solve these problems?
Policy
Starting state
reward unknown dynamics model
![Page 12: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/12.jpg)
12
• The “best” representation based on the data
• This defines the best policy…not the best representation
Value (return)
How does RL solve these problems?
Policy
Starting state
reward unknown dynamics model
![Page 13: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/13.jpg)
13
…but does RL actually solve this problem?
• Policy Search– Policy directly parameterized by
![Page 14: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/14.jpg)
14
…but does RL actually solve this problem?
• Policy Search– Policy directly parameterized by
![Page 15: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/15.jpg)
15
…but does RL actually solve this problem?
• Policy Search– Policy directly parameterized by
Number of episodes
Empirical estimate
![Page 16: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/16.jpg)
16
…but does RL actually solve this problem?
• Policy Search– Policy directly parameterized by
Number of episodes
Empirical estimate
![Page 17: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/17.jpg)
17
…but does RL actually solve this problem?
• Model-based RL– Dynamics model =
![Page 18: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/18.jpg)
18
…but does RL actually solve this problem?
• Model-based RL– Dynamics model =
![Page 19: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/19.jpg)
19
…but does RL actually solve this problem?
• Model-based RL– Dynamics model =
![Page 20: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/20.jpg)
20
…but does RL actually solve this problem?
• Model-based RL– Dynamics model =
Maximizing likelihood != maximizing return
![Page 21: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/21.jpg)
21
…but does RL actually solve this problem?
• Model-based RL– Dynamics model =
Maximizing likelihood != maximizing return
…similar story for value-based methods
![Page 22: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/22.jpg)
22
ML model selection in RL
• So why do we do it?– It’s easy– It sometimes works really well– Intuitively it feels like finding the most likely model
should result in a high performing policy• Why does it fail?– Chooses an “average” model based on the data– Ignores reward function
• What do we do then?
![Page 23: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/23.jpg)
23
ML model selection in RL
• So why do we do it?– It’s easy– It sometimes works really well– Intuitively it feels like finding the most likely model
should result in a high performing policy• Why does it fail?– Chooses an “average” model based on the data– Ignores reward function
• What do we do then?
![Page 24: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/24.jpg)
24
ML model selection in RL
• So why do we do it?– It’s easy– It sometimes works really well– Intuitively it feels like finding the most likely model
should result in a high performing policy• Why does it fail?– Chooses an “average” model based on the data– Ignores reward function
• What do we do then?
![Page 25: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/25.jpg)
25
Our Approach
• Model-based RL– Dynamics model =
![Page 26: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/26.jpg)
26
Our Approach
• Model-based RL– Dynamics model =
Empirical estimate
![Page 27: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/27.jpg)
27
Our Approach
• Model-based RL– Dynamics model =
Empirical estimate
![Page 28: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/28.jpg)
28
Planning with Misspecified Model Classes
Us
![Page 29: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/29.jpg)
29
Our Approach
• Model-based RL– Dynamics model =
Empirical estimate
![Page 30: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/30.jpg)
30
Our Approach
• Model-based RL– Dynamics model =
Empirical estimate
We can do the same thing in a value-based setting.
![Page 31: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/31.jpg)
31
…but
• We are indirectly choosing a policy representation
• The win of this indirect representation is that it can be “small”
• Small = less data?– Intuitively you’d think so– Empirical evidence from toy problems
• But all of our guarantees rely on infinite data• …maybe there’s a way to be more concrete
![Page 32: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/32.jpg)
32
…but
• We are indirectly choosing a policy representation
• The win of this indirect representation is that it can be “small”
• Small = less data?– Intuitively you’d think so– Empirical evidence from toy problems
• But all of our guarantees rely on infinite data• …maybe there’s a way to be more concrete
![Page 33: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/33.jpg)
33
What we want
• How does the representation space relate to true return?
• …they’ve been doing this in classification since the 60s– Relationship between the bound and “size” of the
representation space / amount of data
≈?
![Page 34: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/34.jpg)
34
What we want
• How does the representation space relate to true return?
• …they’ve been doing this in classification since the 60s– Relationship between the bound and “size” of the
representation space / amount of data
≈?
![Page 35: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/35.jpg)
35
What we want
• How does the representation space relate to true return?
• …they’ve been doing this in classification since the 60s– Relationship between the “size” of the
representation space and the amount of data
≈?
![Page 36: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/36.jpg)
36
How to get there
Model-based, value-based, policy search
![Page 37: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/37.jpg)
37
How to get there
Model-based, value-based, policy search
Map RL to classification Empirical Risk Minimization
![Page 38: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/38.jpg)
38
How to get there
Model-based, value-based, policy search
Map RL to classification Empirical Risk Minimization
Measuring function class size Bound on true risk
![Page 39: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/39.jpg)
39
How to get there
Model-based, value-based, policy search
Map RL to classification Empirical Risk Minimization
Measuring function class size Bound on true risk
![Page 40: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/40.jpg)
40
How to get there
Model-based, value-based, policy search
Map RL to classification Empirical Risk Minimization
Measuring function class size Bound on true risk
Structure of function classes Structural risk minimization
![Page 41: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/41.jpg)
41
How to get there
Model-based, value-based, policy search
Map RL to classification Empirical Risk Minimization
Measuring function class size Bound on true risk
Structure of function classes Structural risk minimization
![Page 42: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/42.jpg)
42
Classification
![Page 43: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/43.jpg)
43
Classification
![Page 44: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/44.jpg)
44
Classification
f
𝑓 ([𝑥1𝑥2])=𝑠𝑖𝑔𝑛([𝜃1𝜃2]𝑇
[𝑥1𝑥2])
𝑥1
𝑥2
![Page 45: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/45.jpg)
45
Classification
Risk
![Page 46: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/46.jpg)
46
Classification
Loss (cost)
Risk Unknown datadistribution
![Page 47: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/47.jpg)
47
Empirical Risk Minimization
Unknown datadistribution
![Page 48: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/48.jpg)
48
Empirical Risk Minimization
Unknown datadistribution
Number of samples
Empirical estimate
![Page 49: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/49.jpg)
49
Mapping RL to Classification
![Page 50: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/50.jpg)
50
Mapping RL to Classification
![Page 51: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/51.jpg)
51
Mapping RL to Classification
![Page 52: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/52.jpg)
52
How to get there
Model-based, value-based, policy search
Map RL to classification Empirical Risk Minimization
Measuring function class size Bound on true risk
Structure of function classes Structural risk minimization
![Page 53: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/53.jpg)
53
Measuring the size of a function class:VC Dimension
• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly
decide
![Page 54: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/54.jpg)
54
Measuring the size of a function class:VC Dimension
• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly
decide
![Page 55: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/55.jpg)
55
Measuring the size of a function class:VC Dimension
• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly
decide
![Page 56: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/56.jpg)
56
Measuring the size of a function class:VC Dimension
• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly
decide
![Page 57: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/57.jpg)
57
Measuring the size of a function class:VC Dimension
• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly
decide
![Page 58: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/58.jpg)
58
Measuring the size of a function class:VC Dimension
• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly
decide
![Page 59: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/59.jpg)
59
Measuring the size of a function class:VC Dimension
• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly
decide
𝑉𝐶𝐷𝑖𝑚()=3
![Page 60: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/60.jpg)
60
Measuring the size of a function class:VC Dimension
• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly
decide• Magically, shattering (VC Dim) can be used to
bound true risk
![Page 61: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/61.jpg)
61
Measuring the size of a function class:VC Dimension
• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly
decide• Magically, shattering (VC Dim) can be used to
bound true risk
![Page 62: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/62.jpg)
62
Measuring the size of a function class:VC Dimension
• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly
decide• Magically, shattering (VC Dim) can be used to
bound true risk
![Page 63: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/63.jpg)
63
For those of you familiar with statistical learning theory…
• VC Dim – Only known for a few function classes– Difficult to estimate, bound
• Rademacher complexity– Use the data to estimate the “volume” of the
function class– This volume can then be used in a similar bound
![Page 64: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/64.jpg)
64
Measuring the size of a function class
• Now we can say concrete things about why we may prefer one representation over another with limited data
![Page 65: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/65.jpg)
65
Measuring the size of a function class
• Now we can say concrete things about why we may prefer one representation over another with limited data
![Page 66: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/66.jpg)
66
How to get there
Model-based, value-based, policy search
Map RL to classification Empirical Risk Minimization
Measuring function class size Bound on true risk
Structure of function classes Structural risk minimization
![Page 67: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/67.jpg)
67
Empirical Risk Minimization
Unknown datadistribution
Number of samples
Empirical estimate
![Page 68: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/68.jpg)
68
Empirical Risk Minimization and Limited Data
Unknown datadistribution
But if we have limited data we cannot expect small empirical risk to result in small true risk
Empirical estimate
![Page 69: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/69.jpg)
69
Empirical Risk Minimization and Limited Data
• If the bound is large, we cannot expect small empirical risk to result in small true risk
• …so what do we do?• Choose the function class which minimizes the
bound!
![Page 70: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/70.jpg)
70
Empirical Risk Minimization and Limited Data
• If the bound is large, we cannot expect small empirical risk to result in small true risk
• …so what do we do?• Choose the function class which minimizes the
bound!
![Page 71: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/71.jpg)
71
Structural Risk Minimization
• Using a “structure” of function classes
• For N data, we choose the function class:
![Page 72: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/72.jpg)
72
Structural Risk Minimization
• Using a “structure” of function classes
Many natural structures of policy classes!
![Page 73: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/73.jpg)
73
Structural Risk Minimization
• Using a “structure” of function classes
• We choose the function class:
![Page 74: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/74.jpg)
74
Is this Bayesian?
• Prior knowledge– Structure encodes prior knowledge
• Robust to over-fitting– Choose the function class based on risk bound
• No Bayes update• No assumptions about the true function lying
in the structure– Breaks most (all?) Bayesian nonparametrics
![Page 75: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/75.jpg)
75
Is this Bayesian?
• Prior knowledge– Structure encodes prior knowledge
• Robust to over-fitting– Choose the function class based on risk bound
• No Bayes update• No assumptions about the true function lying
in the structure– Breaks most (all?) Bayesian nonparametrics
![Page 76: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/76.jpg)
76
Is this Bayesian?
• Prior knowledge– Structure encodes prior knowledge
• Robust to over-fitting– Choose the function class based on risk bound
• No Bayes update• No assumption that the true function is
somewhere in the structure– Breaks most (all?) Bayesian nonparametrics
![Page 77: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/77.jpg)
77
Contribution
• Classification to RL mapping• Transferred probabilistic bounds from
statistical learning theory to RL• Applied structural risk minimization to RL
![Page 78: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/78.jpg)
78
Contribution
• Classification to RL mapping• Transferred probabilistic bounds from
statistical learning theory to RL• Applied structural risk minimization to RL
![Page 79: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/79.jpg)
79
Backup Slides
![Page 80: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/80.jpg)
80
From last time…
![Page 81: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/81.jpg)
81
From last time…
{𝒎𝒄 ,𝒎𝒑 ,𝒍 }
![Page 82: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/82.jpg)
82
From last time…
≈?
{𝒎𝒄 ,𝒎𝒑 ,𝒍 }
![Page 83: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1](https://reader035.vdocument.in/reader035/viewer/2022062407/56649f585503460f94c7ccb8/html5/thumbnails/83.jpg)
83
Measuring the size of a function class
• Rademacher complexity