safe policy improvement with baseline bootstrapping11-14-00)-11... · 2019-06-08 · safe policy...

31
Safe Policy Improvement with Baseline Bootstrapping Romain Laroche, Paul Trichelair, R´ emi Tachet des Combes

Upload: others

Post on 02-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Safe Policy Improvementwith Baseline Bootstrapping

Romain Laroche, Paul Trichelair, Remi Tachet des Combes

Page 2: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Problem setting

Batch setting• Fixed dataset, no direct interaction with the environment.• Access to the behavioural policy, called baseline.• Objective: improve the baseline with high probability.• Commonly encountered in real world applications.

Distributed systems Long trajectories

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 2/18

Page 3: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Problem setting

Batch setting• Fixed dataset, no direct interaction with the environment.• Access to the behavioural policy, called baseline.• Objective: improve the baseline with high probability.• Commonly encountered in real world applications.

Distributed systems

Long trajectories

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 2/18

Page 4: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Problem setting

Batch setting• Fixed dataset, no direct interaction with the environment.• Access to the behavioural policy, called baseline.• Objective: improve the baseline with high probability.• Commonly encountered in real world applications.

Distributed systems Long trajectories

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 2/18

Page 5: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Contributions

Novel batch RL algorithm: SPIBB

• SPIBB comes with reliability guarantees in finite MDPs.• SPIBB is as computationally efficient as classic RL.

Finite MDPs benchmark• Extensive benchmark of existing algorithms.• Empirical analysis on random MDPs and baselines.

Infinite MDPs benchmark• Model-free SPIBB for use with function approximation.• First deep RL algorithm reliable in the batch setting.

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 3/18

Page 6: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Contributions

Novel batch RL algorithm: SPIBB

• SPIBB comes with reliability guarantees in finite MDPs.• SPIBB is as computationally efficient as classic RL.

Finite MDPs benchmark• Extensive benchmark of existing algorithms.• Empirical analysis on random MDPs and baselines.

Infinite MDPs benchmark• Model-free SPIBB for use with function approximation.• First deep RL algorithm reliable in the batch setting.

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 3/18

Page 7: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Contributions

Novel batch RL algorithm: SPIBB

• SPIBB comes with reliability guarantees in finite MDPs.• SPIBB is as computationally efficient as classic RL.

Finite MDPs benchmark• Extensive benchmark of existing algorithms.• Empirical analysis on random MDPs and baselines.

Infinite MDPs benchmark• Model-free SPIBB for use with function approximation.• First deep RL algorithm reliable in the batch setting.

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 3/18

Page 8: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Robust Markov Decision Processes

[Iyengar, 2005, Nilim and El Ghaoui, 2005]

• True environment M∗ = 〈X ,A,P∗,R∗, γ〉 is unknown.• Maximum Likelihood Estimation (MLE) MDP built from

counts: M = 〈X ,A, P, R, γ〉.• Robust MDP set Ξ(M,e): M∗ ∈ Ξ(M,e) with probability at

least 1− δ.• Error function e(x ,a) derived from concentration bounds.

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 4/18

Page 9: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Existing algorithms

[Petrik et al., 2016]: SPI by robust baseline regret minimization• Robust MDPs considers the maxmin of the value over Ξ,→ favors over-conservative policies.• They also consider the maxmin of the value improvement,→ NP-hard problem.• RaMDP hacks the reward to account for uncertainty:

R(x ,a)← R(x ,a)− κadj√ND(x ,a)

,

→ not theoretically grounded.

[Thomas, 2015]: High-Confidence Policy Improvement

• HCPI searches for the best regularization hyperparameterto allow safe policy improvement.

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 5/18

Page 10: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Existing algorithms

[Petrik et al., 2016]: SPI by robust baseline regret minimization• Robust MDPs considers the maxmin of the value over Ξ,→ favors over-conservative policies.• They also consider the maxmin of the value improvement,→ NP-hard problem.• RaMDP hacks the reward to account for uncertainty:

R(x ,a)← R(x ,a)− κadj√ND(x ,a)

,

→ not theoretically grounded.

[Thomas, 2015]: High-Confidence Policy Improvement

• HCPI searches for the best regularization hyperparameterto allow safe policy improvement.

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 5/18

Page 11: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Safe Policy Improvement with Baseline Bootstrapping

Safe Policy Improvement with Baseline Bootstrapping (SPIBB)• Tractable approximate solution to the robust policy

improvement formulation.• SPIBB allows policy update only with sufficient evidence.• Sufficient evidence = state-action count that exceeds

some threshold hyperparameter N∧.

SPIBB algorithm

• Construction of the bootstrapped set:B = {(x ,a) ∈ X ×A, ND(x ,a) < NΛ}.

• Optimization over a constrained policy set:π�spibb = argmaxπ∈Πb

ρ(π, M),

Πb = {π , s.t. π(a|x) = πb(a|x) if (x ,a) ∈ B}.

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 6/18

Page 12: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Safe Policy Improvement with Baseline Bootstrapping

Safe Policy Improvement with Baseline Bootstrapping (SPIBB)• Tractable approximate solution to the robust policy

improvement formulation.• SPIBB allows policy update only with sufficient evidence.• Sufficient evidence = state-action count that exceeds

some threshold hyperparameter N∧.

SPIBB algorithm

• Construction of the bootstrapped set:B = {(x ,a) ∈ X ×A, ND(x ,a) < NΛ}.

• Optimization over a constrained policy set:π�spibb = argmaxπ∈Πb

ρ(π, M),

Πb = {π , s.t. π(a|x) = πb(a|x) if (x ,a) ∈ B}.

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 6/18

Page 13: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

SPIBB policy iteration

Policy improvement step exampleQ-value Baseline policy Bootstrapping SPIBB policy update

Q(i)M(x , a1) = 1 πb(a1|x) = 0.1 (x , a1) ∈ B

Q(i)M(x , a2) = 2 πb(a2|x) = 0.4 (x , a2) /∈ B

Q(i)M(x , a3) = 3 πb(a3|x) = 0.3 (x , a3) /∈ B

Q(i)M(x , a4) = 4 πb(a4|x) = 0.2 (x , a4) ∈ B

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 7/18

Page 14: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

SPIBB policy iteration

Policy improvement step exampleQ-value Baseline policy Bootstrapping SPIBB policy update

Q(i)M(x , a1) = 1 πb(a1|x) = 0.1 (x , a1) ∈ B

Q(i)M(x , a2) = 2 πb(a2|x) = 0.4 (x , a2) /∈ B

Q(i)M(x , a3) = 3 πb(a3|x) = 0.3 (x , a3) /∈ B

Q(i)M(x , a4) = 4 πb(a4|x) = 0.2 (x , a4) ∈ B

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 7/18

Page 15: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

SPIBB policy iteration

Policy improvement step exampleQ-value Baseline policy Bootstrapping SPIBB policy update

Q(i)M(x , a1) = 1 πb(a1|x) = 0.1 (x , a1) ∈ B π(i+1)(a1|x) = 0.1

Q(i)M(x , a2) = 2 πb(a2|x) = 0.4 (x , a2) /∈ B

Q(i)M(x , a3) = 3 πb(a3|x) = 0.3 (x , a3) /∈ B

Q(i)M(x , a4) = 4 πb(a4|x) = 0.2 (x , a4) ∈ B π(i+1)(a4|x) = 0.2

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 7/18

Page 16: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

SPIBB policy iteration

Policy improvement step exampleQ-value Baseline policy Bootstrapping SPIBB policy update

Q(i)M(x , a1) = 1 πb(a1|x) = 0.1 (x , a1) ∈ B π(i+1)(a1|x) = 0.1

Q(i)M(x , a2) = 2 πb(a2|x) = 0.4 (x , a2) /∈ B π(i+1)(a2|x) = 0.0

Q(i)M(x , a3) = 3 πb(a3|x) = 0.3 (x , a3) /∈ B π(i+1)(a3|x) = 0.7

Q(i)M(x , a4) = 4 πb(a4|x) = 0.2 (x , a4) ∈ B π(i+1)(a4|x) = 0.2

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 7/18

Page 17: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Theoretical analysis

Theorem (Convergence)

Policy iteration converges to a policy π�spibb that is Πb-optimal in

the MLE MDP M.

Theorem (Safe policy improvement)With high probability 1− δ:

ρ(π�spibb,M∗)− ρ(πb,M∗) ≥ ρ(π�spibb, M)− ρ(πb, M)

− 4Vmax

1− γ

√2

N∧log

2|X ||A|2|X |δ

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 8/18

Page 18: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Theoretical analysis

Theorem (Convergence)

Policy iteration converges to a policy π�spibb that is Πb-optimal in

the MLE MDP M.

Theorem (Safe policy improvement)With high probability 1− δ:

ρ(π�spibb,M∗)− ρ(πb,M∗) ≥ ρ(π�spibb, M)− ρ(πb, M)

− 4Vmax

1− γ

√2

N∧log

2|X ||A|2|X |δ

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 8/18

Page 19: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Model-free formulation

SPIBB algorithm• It may be formulated in a model-free manner by setting the

targets:

y (i)j = rj + γ

∑a′|(x ′

j ,a′)∈B

πb(a′|x ′j )Q(i)(x ′j ,a′)

+ γ

∑a′|(x ′

j ,a′)/∈B

πb(a′|x ′j )

maxa′|(x ′

j ,a′)/∈B

Q(i)(x ′j ,a′).

Theorem (Model-free formulation equivalence)In finite MDPs, the model-free formulation admits a unique fixedpoint that coincides with the Q-value of π�spibb.

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 9/18

Page 20: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Model-free formulation

SPIBB algorithm• It may be formulated in a model-free manner by setting the

targets:

y (i)j = rj + γ

∑a′|(x ′

j ,a′)∈B

πb(a′|x ′j )Q(i)(x ′j ,a′)

+ γ

∑a′|(x ′

j ,a′)/∈B

πb(a′|x ′j )

maxa′|(x ′

j ,a′)/∈B

Q(i)(x ′j ,a′).

Theorem (Model-free formulation equivalence)In finite MDPs, the model-free formulation admits a unique fixedpoint that coincides with the Q-value of π�spibb.

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 9/18

Page 21: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

25-state stochastic gridworld – mean

101 102 103 104

number of trajectories in dataset D

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

per

form

ance

Baseline

Optimal policy

Basic RL mean

perf RaMDP mean

Robust MDP mean

HCPI doubly robust meanΠb-SPIBB mean

Π≤b-SPIBB mean

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 10/18

Page 22: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

25-state stochastic gridworld – 1%-CVaR

101 102 103 104

number of trajectories in dataset D

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

per

form

ance

Baseline

Optimal policy

Basic RL 1%-CVaR

perf RaMDP 1%-CVaR

Robust MDP 1%-CVaR

HCPI doubly robust 1%-CVaRΠb-SPIBB 1%-CVaR

Π≤b-SPIBB 1%-CVaR

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 11/18

Page 23: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Random MDPs, random baseline – 1%-CVaR

101 102 103

number of trajectories in dataset D

−3.0

−2.5

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

ρ=ρ(π,M

∗ )−ρb

ρ∗−ρb

Baseline

Optimal policy

Basic RL 1%-CVaR

RaMDP 1%-CVaR

Robust MDP 1%-CVaR

HCPI doubly robust 1%-CVaRΠb-SPIBB 1%-CVaR

Π≤b-SPIBB 1%-CVaR

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 12/18

Page 24: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Gridworld – RaMDP hyperparameter sensitivity

10 20 50 100 200 500 1000 2000 5000 10000

Number of trajectories

0.001

0.002

0.003

0.004

0.005

κ

−1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00

Normalized performance of the target policy π: ρ =ρ(π,M∗)− ρb

ρ∗ − ρb

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 13/18

Page 25: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Gridworld – SPIBB hyperparameter sensitivity

10 20 50 100 200 500 1000 2000 5000 10000

Number of trajectories

5

7

10

15

20

30

50

70

100

N∧

−1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00

Normalized performance of the target policy π: ρ =ρ(π,M∗)− ρb

ρ∗ − ρb

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 14/18

Page 26: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Helicopter domain (continuous task)

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 15/18

Page 27: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Helicopter domain - benchmark (improved results)

10−1 100

Hyperparameter: ε or 1√N∧

or 1κ

1.5

2.0

2.5

3.0

3.5

per

form

ance

Baseline, mean

Baseline, 10%-CVaR

RaMDP-DQN, mean

RaMDP-DQN, 10%-CVaRΠb-SPIBB-DQN, mean

Πb-SPIBB-DQN, 10%-CVaR

Approx-Soft-SPIBB-DQN, mean

Approx-Soft-SPIBB-DQN, 10%-CVaR

Vanilla DQN is off the chart• mean = 0.22,• 10%-CVaR = -1 (minimal score).

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 16/18

Page 28: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Conclusion

SPIBB• Assumes fixed dataset, and known behavioural policy.• Tractable, provably reliable, sample-efficient algorithm.• Successfully transferred to DQN architectures.

Follow-up work• Factored SPIBB [Simao and Spaan, 2019a].• Structure learning coupled [Simao and Spaan, 2019b].• Soft SPIBB [Nadjahi et al., 2019].

Still to do• Improve the pseudo-count/error estimates.• Investigate an online SPIBB inspired algorithm.

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 17/18

Page 29: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Conclusion

SPIBB• Assumes fixed dataset, and known behavioural policy.• Tractable, provably reliable, sample-efficient algorithm.• Successfully transferred to DQN architectures.

Follow-up work• Factored SPIBB [Simao and Spaan, 2019a].• Structure learning coupled [Simao and Spaan, 2019b].• Soft SPIBB [Nadjahi et al., 2019].

Still to do• Improve the pseudo-count/error estimates.• Investigate an online SPIBB inspired algorithm.

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 17/18

Page 30: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Conclusion

SPIBB• Assumes fixed dataset, and known behavioural policy.• Tractable, provably reliable, sample-efficient algorithm.• Successfully transferred to DQN architectures.

Follow-up work• Factored SPIBB [Simao and Spaan, 2019a].• Structure learning coupled [Simao and Spaan, 2019b].• Soft SPIBB [Nadjahi et al., 2019].

Still to do• Improve the pseudo-count/error estimates.• Investigate an online SPIBB inspired algorithm.

Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 17/18

Page 31: Safe Policy Improvement with Baseline Bootstrapping11-14-00)-11... · 2019-06-08 · Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping

Introduction Theory Experiments Conclusion

Thanks for your attention (POSTER #101)

Iyengar, G. N. (2005).Robust dynamic programming.Mathematics of Operations Research.

Nadjahi, K., Laroche, R., and Tachet des Combes, R. (2019).Safe policy improvement with soft baseline bootstrapping.In Proceedings of the 17th European Conference on Machine Learning and Principles and Practice ofKnowledge Discovery in Databases (ECML-PKDD).

Nilim, A. and El Ghaoui, L. (2005).Robust control of markov decision processes with uncertain transition matrices.Operations Research.

Petrik, M., Ghavamzadeh, M., and Chow, Y. (2016).Safe policy improvement by minimizing robust baseline regret.In Proceedings of the 29th Advances in Neural Information Processing Systems (NIPS).

Simao, T. D. and Spaan, M. T. J. (2019a).Safe policy improvement with baseline bootstrapping in factored environments.In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.

Simao, T. D. and Spaan, M. T. J. (2019b).Structure learning for safe policy improvement.In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI).

Thomas, P. S. (2015).Safe reinforcement learning.PhD thesis, Stanford university.