Rather than minimizing prediction error maximize what we actually care about:
return of the policy.
Reinforcement Learning with Misspecified Bayesian Nonparametric Model Classes Joshua Joseph, Alborz Geramifard, Jonathan P. How and Nicholas Roy
Poor Performance in Standard Model-Based Reinforcement Learning due to Misspecification
Parametric Reward Based Model Search
Bayesian Nonparametric RBMS
Results on a Toy Problem
[1] J. Joseph, A. Geramifard, J. W. Roberts, J. P. How, and N. Roy, “Reinforcement learning with misspecified model classes,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2013), Under Review. [2] R. Fonteneau, S. A. Murphy, L. Wehenkel, and D. Ernst, “Model-free monte carlo-like policy evaluation,” Journal of Machine Learning Research - Proceedings Track, vol. 9, pp. 217–224, 2010.
Training Data
Model Class Prior
Maximum a Posteriori
Policy Determination
𝜋𝜃𝑀𝐴𝑃 𝜃𝑀𝐴𝑃
𝑥𝑡+1~𝑓(𝑥𝑡 , 𝑎𝑡; 𝜃)
𝑥𝑡+1~𝑓(𝑥𝑡 , 𝑎𝑡; 𝜃𝑀𝐴𝑃)
Training Data
Model Class
Policy Determination
Policy Evaluation
Maximum Return Policy
𝜋𝜃
𝜋𝜃
𝜃
𝜃 ∈ Θ
𝑉 𝜋𝜃
∆𝒙𝟐
MAP
Resulting policy’s return: -62
Resulting policy’s return: -20
MAP model:
RBMS model:
𝜽 = 𝜽 + 𝒄𝝏𝑉 𝜋𝜽
𝝏𝜽
𝑥𝑡+1~𝑓 𝑥𝑡 , 𝑎𝑡; 𝜃 , 𝜃 = 𝜑, 𝐷
Optimal:
MAP:
Discontinuous dynamics!
The standard approach to learning a model in RL:
…the model is fit without considering the reward The policy that results from using a Gaussian process (which assumes smoothness) to model the dynamics
Reward Based Model Search (RBMS) [1]: a) Policy evaluation is performed using [2] b) Policy improvement is performed by gradient ascent
Misspecified Model Class True System
𝜽𝟏
𝑉 𝜋𝜽1
𝑉 𝜋𝜽2
𝑉 𝜋𝜽3 𝜽𝟐
𝑉 𝜋𝜽4
𝜽 = 𝜽 + 𝒄𝝏𝑉 𝜋𝜽
𝝏𝜽
𝑥𝑡+1~𝑓(𝑥𝑡 , 𝑎𝑡; 𝜃)
b) a) start
goal
𝑉 𝜋 = 1
𝑁 𝑟𝑛,𝑡
Batch of arbitrary (off-policy) data:
wind
True System Misspecified Model Class
Domain description: • Actions = {up, right} • Gaussian process dynamics model • -1 reward for each time step • -100 for falling in a pit • Taking actions on the ice results in “slipping” south • 100 episodes of training data from a random policy
𝒙𝟏
∆𝒙𝟐
𝒙𝟏
RBMS
• Unlike the parametric approach, the predictions from a Bayesian nonparametric model is a function of the training data.
• To adapt parametric RBMS to Bayesian nonparametric model classes: • Policy evaluation using [2] still works • Policy improvement is unclear
Open question: How should we perform policy improvement? • Ascending using the gradient in the hyperparameter space is straightforward • What about the data?
• Remove data from D • Add in “fake” data generated from f • Move the data
hyperparameters
data
For Bayesian nonparametric models, what does this term mean?
when the true dynamics cannot be represented by any model in the class (most real-world problems)
The Gaussian process dynamics model’s mean function for each action
Conclusion • RBMS is able to learn models from misspecified model classes that perform well in cases where
learning based on minimizing prediction error does not • RBMS allows us to use smaller model classes, resulting in significantly lower sample complexity than
using larger, more expressive model classes • The extension of RBMS to Bayesian Nonparametric models is promising but requires more work to
understand how to perform policy improvement
The gradient for RBMS policy improvement was computed by
randomly adding and removing data from the Gaussian process