structural return maximization for reinforcement learning josh joseph alborz geramifard javier velez...

1

Structural Return Maximization for Reinforcement Learning

Josh JosephAlborz Geramifard

Javier Velez Jonathan HowNicholas Roy

2

How should we act in the presence of complex, unknown dynamics?

3


4


5


6

What do I mean by complex dynamics?

• Can’t derive from first principles / intuition• Any dynamics model will be approximate• Limited data– Otherwise just do nearest neighbors

• Batch data– Trying to keep it as simple as possible for now– Fairly straightforward to extend to active learning

7

What do I mean by complex dynamics?

• Can’t derive from first principles / intuition• Any dynamics model will be approximate• Limited data• Batch data– Fairly straightforward to extend to active learning

8

How does RL solve these problems?

• Assume some representation class for:– Dynamics model– Value function– Policy

• Collect some data• Find the “best” representation based on the

data

9


• Assume some representation class for:– Dynamics model– Value function– Policy

• Collect some data• Find the “best” representation based on the

data

10

• The “best” representation based on the data

• This defines the best policy…not the best representation

Value (return)


Policy

Starting state

reward unknown dynamics model

11



Value (return)


Policy

Starting state


12



Value (return)


Policy

Starting state


13

…but does RL actually solve this problem?

• Policy Search– Policy directly parameterized by

14



15



Number of episodes

Empirical estimate

16



Number of episodes

Empirical estimate

17


• Model-based RL– Dynamics model =

18



19



20



Maximizing likelihood != maximizing return

21



Maximizing likelihood != maximizing return

…similar story for value-based methods

22

ML model selection in RL

• So why do we do it?– It’s easy– It sometimes works really well– Intuitively it feels like finding the most likely model

should result in a high performing policy• Why does it fail?– Chooses an “average” model based on the data– Ignores reward function

• What do we do then?

23





24





25

Our Approach


26

Our Approach


Empirical estimate

27

Our Approach


Empirical estimate

28

Planning with Misspecified Model Classes

Us

29

Our Approach


Empirical estimate

30

Our Approach


Empirical estimate

We can do the same thing in a value-based setting.

31

…but

• We are indirectly choosing a policy representation

• The win of this indirect representation is that it can be “small”

• Small = less data?– Intuitively you’d think so– Empirical evidence from toy problems

• But all of our guarantees rely on infinite data• …maybe there’s a way to be more concrete

32

…but

• We are indirectly choosing a policy representation

• The win of this indirect representation is that it can be “small”

• Small = less data?– Intuitively you’d think so– Empirical evidence from toy problems

• But all of our guarantees rely on infinite data• …maybe there’s a way to be more concrete

33

What we want

• How does the representation space relate to true return?

• …they’ve been doing this in classification since the 60s– Relationship between the bound and “size” of the

representation space / amount of data

≈?

34

What we want


• …they’ve been doing this in classification since the 60s– Relationship between the bound and “size” of the

representation space / amount of data

≈?

35

What we want


• …they’ve been doing this in classification since the 60s– Relationship between the “size” of the

representation space and the amount of data

≈?

36

How to get there

Model-based, value-based, policy search

37

How to get there


Map RL to classification Empirical Risk Minimization

38

How to get there



Measuring function class size Bound on true risk

39

How to get there




40

How to get there




Structure of function classes Structural risk minimization

41

How to get there





42

Classification

43

Classification

44

Classification

f

𝑓 ([𝑥1𝑥2])=𝑠𝑖𝑔𝑛([𝜃1𝜃2]𝑇

[𝑥1𝑥2])

𝑥1

𝑥2

45

Classification

Risk

46

Classification

Loss (cost)

Risk Unknown datadistribution

47

Empirical Risk Minimization

Unknown datadistribution

48



Number of samples

Empirical estimate

49

Mapping RL to Classification

50


51


52

How to get there





53

Measuring the size of a function class:VC Dimension

• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly

decide

54



decide

55



decide

56



decide

57



decide

58



decide

59



decide

𝑉𝐶𝐷𝑖𝑚()=3

60



decide• Magically, shattering (VC Dim) can be used to

bound true risk

61




bound true risk

62




bound true risk

63

For those of you familiar with statistical learning theory…

• VC Dim – Only known for a few function classes– Difficult to estimate, bound

• Rademacher complexity– Use the data to estimate the “volume” of the

function class– This volume can then be used in a similar bound

64

Measuring the size of a function class

• Now we can say concrete things about why we may prefer one representation over another with limited data

65


• Now we can say concrete things about why we may prefer one representation over another with limited data

66

How to get there





67



Number of samples

Empirical estimate

68

Empirical Risk Minimization and Limited Data


But if we have limited data we cannot expect small empirical risk to result in small true risk

Empirical estimate

69


• If the bound is large, we cannot expect small empirical risk to result in small true risk

• …so what do we do?• Choose the function class which minimizes the

bound!

70


• If the bound is large, we cannot expect small empirical risk to result in small true risk

• …so what do we do?• Choose the function class which minimizes the

bound!

71

Structural Risk Minimization

• Using a “structure” of function classes

• For N data, we choose the function class:

72



Many natural structures of policy classes!

73



• We choose the function class:

74

Is this Bayesian?

• Prior knowledge– Structure encodes prior knowledge

• Robust to over-fitting– Choose the function class based on risk bound

• No Bayes update• No assumptions about the true function lying

in the structure– Breaks most (all?) Bayesian nonparametrics

75

Is this Bayesian?



• No Bayes update• No assumptions about the true function lying

in the structure– Breaks most (all?) Bayesian nonparametrics

76

Is this Bayesian?



• No Bayes update• No assumption that the true function is

somewhere in the structure– Breaks most (all?) Bayesian nonparametrics

77

Contribution

• Classification to RL mapping• Transferred probabilistic bounds from

statistical learning theory to RL• Applied structural risk minimization to RL

78

Contribution

• Classification to RL mapping• Transferred probabilistic bounds from

statistical learning theory to RL• Applied structural risk minimization to RL

79

Backup Slides

80

From last time…

81

From last time…

{𝒎𝒄 ,𝒎𝒑 ,𝒍 }

82

From last time…

≈?

{𝒎𝒄 ,𝒎𝒑 ,𝒍 }

83


• Rademacher complexity

structural return maximization for reinforcement learning josh joseph alborz geramifard javier velez...

Documents