safe policy improvement with baseline bootstrapping high complexity ... methodology of bootstrapping...

Safe Policy Improvement with Baseline Bootstrapping

Romain Laroche 1 Paul Trichelair 1

AbstractA common goal in Reinforcement Learning is toderive a good strategy given a limited batch ofdata. In this paper, we adopt the safe policy im-provement (SPI) approach: we compute a targetpolicy guaranteed to perform at least as well as agiven baseline policy. Our SPI strategy, inspiredby the knows-what-it-knows paradigm, consistsin bootstrapping the target policy with the base-line policy when it does not know. We developtwo computationally efficient bootstrapping al-gorithms, a value-based and a policy-based, bothaccompanied with theoretical SPI bounds. Threealgorithm variants are proposed. We empiricallyshow the literature algorithms limits on a smallstochastic gridworld problem, and then demon-strate that our algorithms both improve the worstcase scenario and the mean performance.

1. IntroductionReinforcement Learning (RL, Sutton & Barto 1998) con-sists in discovering by trial-and-error, in an unknown un-certain environment, which action is the most valuable ina particular situation. In an online learning setting, trial-and-error works optimally, because a good outcome bringsa policy improvement, and even an error leads to learningnot to do it again at a lesser cost. However, most real-worldalgorithms are to be widely deployed on independent de-vices/systems, and as such their policies cannot be updatedas often as online learning would require. In this offline set-ting, batch RL algorithms should be applied (Lange et al.,2012). But, the trial-and-error paradigm shows its limitswhen the policy updates are rare, because the commitmenton the trial is too strong and the error impact may be severe.In this paper, we endeavour to build batch RL algorithmsthat are safe in this regard.

The notion of safety in RL has been defined in severalcontexts (Garcıa & Fernández, 2015). Two notions of un-certainty: the internal and the parametric, are defined inGhavamzadeh et al. 2016: internal uncertainty reflects theuncertainty of the return due to stochastic transitions andrewards, for a single known MDP, while parametric uncer-

1 Microsoft Research, [email protected]

tainty reflects the uncertainty about the unknown MDP pa-rameters: the transition and reward distributions. In short,internal uncertainty intends to guarantee a certain level ofreturn for each individual trajectory (Schulman et al., 2015;2017), which is critical in view of their potential harmfulbehaviour (Amodei et al., 2016) or in the catastrophe avoid-ance scenarios (Geibel & Wysotzki, 2005; Lipton et al.,2016). In this paper, we focus more specifically on theparametric uncertainty in order to guarantee a given ex-pected return for the trained policy in the batch RL set-ting (Thomas et al., 2015a; Petrik et al., 2016).

More specifically, we seek high confidence that the trainedpolicy approximately outperforms the behavioural policy,called the baseline policy from now onwards. The goal istherefore to improve the policy, even in the worst case sce-nario. As such, this family of algorithms can be seen aspessimistic: the optimism in the face of uncertainty (Braf-man & Tennenholtz, 2002; Auer & Ortner, 2007; Szita &Lorincz, 2008) counterpart. Section 2 recalls the necessarybackground on MDPs and safe policy improvement (SPI).

Section 3 presents our novel optimization formulation: SPIby Baseline Bootstrapping (SPIBB). It consists in boot-strapping the trained policy with the baseline policy inthe state-action pair transitions that were not probed suf-ficiently often in the dataset. We develop two novel com-putationally efficient SPIBB algorithms, a value-based anda policy-based, both accompanied with theoretical SPIbounds. At the expense of theoretical guarantees, we im-plement three additional algorithms variants. Then, we de-velop the related work positioning where we argue that thealgorithms found in the literature are impractical becausethey are intractable in non-small MDPs and/or make unrea-sonable assumptions on the dataset size and distribution.

Then, Section 4 empirically validates the theoretical resultson a gridworld problem, where our algorithms are com-pared to the state of the art algorithms. The results showthat our algorithms significantly outperform the competi-tors, both on the mean performance and on the worst casescenario, while being as computationally efficient as a stan-dard model-based algorithm.

Finally, Section 5 concludes the paper with prospectiveideas of improvement. Appendix includes the proof of alltheorems and some additional experimental results.

arX

iv:1

712.

0692

4v3

[cs

.LG

] 1

8 Ja

n 20

18


2. Background2.1. The MDP framework

Markov Decision Processes (MDPs, Bellman 1957) are awidely used framework to address the problem of optimiz-ing a sequential decision making. In our work, we assumethat the true environment is modelled as an unknown MDPM∗ = 〈X ,A, R∗, P ∗, γ〉, where X is the state space, Ais the action space, R∗(x, a) ∈ [−Rmax, Rmax] is the truebounded stochastic reward function, P ∗(·|x, a) is the truetransition probability, and γ ∈ [0, 1[ is the discount factor.Without loss of generality, we assume that the process de-terministically begins in state x0, the stochastic initializa-tion being modelled by P ∗(·|x0, a0), and leading the agentto state x1. The agent then makes a decision about whichaction a1 to select. This action leads to a new state that de-pends on the transition probability and the agent receivesa reward R∗(x1, a1) reflecting the quality of the decision.This process is then repeated until the end of the episode.We denote by π the policy which corresponds to the deci-sion making mechanism that assigns actions to states. Wedenote by Π = π : X → ∆A the set of stochastic poli-cies, with ∆A the set of probability distributions over theset of actions A.

The state value function V πM (x) (resp. state-action valuefunction QπM (x, a)) evaluates the performance of policyπ ∈ Π starting from state x ∈ X (resp. performing actiona ∈ A in state x ∈ X ) in the MDP M = 〈X ,A, R, P, γ〉:

V πM (x) = E

T∑t=0

γtR(xt, at)

∣∣∣∣∣ x0 = xat ∼ π(·|xt)xt+1 ∼ P (·|xt, at)

QπM (x, a) = E

T∑t=0

γtR(xt, at)

∣∣∣∣∣ x0 = x, a0 = aat ∼ π(·|xt)xt+1 ∼ P (·|xt, at)

The goal of a reinforcement learning algorithm is to dis-cover the unique optimal state value function V ∗M (resp.action-state value function Q∗M ). We define the perfor-mance of a policy by its expected value ρ(π,M) =V πM (x0). Given a policy subset Π′ ⊆ Π, a policy π′ issaid to be Π′-optimal for an MDP M when it maximisesits performance: ρ(π′,M) = maxπ∈Π′ ρ(π,M). Later, wealso make use of the notation Vmax ≤ Rmax

1−γ as a knownupper bound of the return absolute value.

2.2. Percentile criterion

We transpose here the percentile criterion (Delage & Man-nor, 2010) to the safe policy improvement objective:

πC = argmaxπ∈Π

E [ρ(π,M) |M ∼ PMDP(·|D)] , (1)

s.t. P (ρ(π,M) ≥ ρ(πb,M)− ζ |M ∼ PMDP(·|D)) ≥ 1− δ,

where PMDP(·|D) is the posterior probability of the MDPparameters, where 1 − δ is the high probability meta-parameter, and where ζ is the error meta-parameter. Petriket al. 2016’s bound from below the constraint by consider-ing Ξ(M, e) as the set of admissible MDP with high prob-ability 1 − δ, where M = 〈X ,A, P , R, γ〉 is the MDP pa-rameters estimator, and e : X ×A → R is an error functionparametrised with the dataset D and the meta-parameter δ:

Ξ(M, e) = M : ∀(x, a) ∈ X ×A,||P (·|x, a)− P (·|x, a)||1 ≤ e(x, a),

||R(·|x, a)− R(·|x, a)||1 ≤ e(x, a)Rmax

Instead of the immeasurable expectation in Equation 1, Ro-bust MDP (Iyengar, 2005; Nilim & El Ghaoui, 2005) clas-sically consider the worst case scenario in Ξ (from now on,the Ξ(M, e) notation is simplified) of the maximization ofthe performance: ρ(π,M). Rather, Petrik et al. 2016 con-template the worst case scenario in Ξ of the SPI problem:the maximization the gap between the target and the base-line performances:

πS = argmaxπ∈Π

minM∈Ξ

(ρ(π,M)− ρ(πb,M)) (2)

Unfortunately, they prove that this is an NP-hard problem.They propose two algorithms approximating the solutionwithout any formal proof. First, Approximate Robust Base-line Regret Minimizatrion (ARBRM) assumes that there isno error in the transition probabilities of the baseline pol-icy, which is a hazardous assumption. Also, consideringits high complexity (polynomial time), it is difficult to em-pirically assess its percentile criterion safety. Second, theRobust MDP solver uses a Ξ-worst-case safety test to guar-antee safety, which is very conservative.

3. SPI with Baseline Bootstrapping3.1. SPIBB methodology

As evoked in Section 2.2, we endeavour in this sectionto further reformulate the percentile criterion in order tofind an efficient and provably-safe policy within a tractableamount of computer time. Our new criterion consists inoptimising the policy with respect to its performance inthe MDP estimate M , while being guaranted to be ζ-approximately at least as good as πb in the admissible MDPset Ξ, with high probability 1− δ. More formally, we writeit as follows:

maxπ∈Π

ρ(π, M), s.t. ∀M ∈ Ξ, ρ(π,M) ≥ ρ(πb,M)− ζ(3)

In order to have this constraint fulfilled with high proba-bility 1 − δ, for a model-based RL learner, the choice of


Algorithm 1 Construction of the set of bootstrapped pairsData: Dataset DData: Parameters ε, δ, and γData: State and action sets: X and AInitiate the Qπb -bootstrapped state-action pair set: B = ∅.N∧ = 2

ε2 log |X ||A|2|X|

δfor (x, a) ∈ X ×A do

Compute ND(x, a) the transition count from the state-action (x, a) couple in D.if ND(x, a) < N∧ then

B = B ∪ (x, a)end

endreturn B

ζ is determined by Theorem 8 from Petrik et al. 2016 thatwe reformulate hereinbelow with the consideration of theuncertainty on the reward model.

Theorem 1 (Near optimality of model-based RL). Let π∗M

be an optimal policy in the MDP M constructed from thedataset. Let π∗M be an optimal policy in the true MDP M .If at each state-action pair (x, a) ∈ X ×A, we define:

e(x, a) =

√2

ND(x, a)log|X ||A|2|X |

δ, (4)

then, with high probability 1− δ,

ρ(π∗M,M) ≥ ρ(π∗M ,M)− 2Vmax

1− γ ||e||∞. (5)

Inversely, given a desired ζ, the count should satisfy forevery state-action pair (x, a) ∈ X ×A:

ND(x, a) ≥ N∧ =8V 2

max

ζ2(1− γ)2log|X ||A|2|X |

δ(6)

In this paper, we extend this previous result by allowingthis constraint to be only partially satisfied in a subset ofX×A. Its complementary subset, the set of uncertain state-action pairs, is called the bootstrapped pairs and is denotedby B in the following. B is dependent on the state set X ,on the action set A, on the dataset D and on a parameterN∧, which itself depends on three parameters: the returnprecision level ζ, or equivalently the MDP model precisionlevel ε = ||e||∞, the high probability 1 − δ, and the dis-count factor γ. For ease of notation those dependencies areomitted. The pseudocode for the construction of the set ofbootstrapped state-action pairs is presented in Algorithm 1.

We call SPI with Baseline Bootstrapping (SPIBB) themethodology of bootstrapping the uncertain state-action

pairs with low variance value estimators/policies obtainedfrom the baseline policy and then to use RL to train a policy.We implement in the next subsections two novel SPIBB al-gorithms. We show that this approach is safe and proveSPI bounds. We derive three additional SPIBB variants thatwork better to some extent in our experiments.

3.2. Value-based SPIBB

In this section, we consider bootstrapping the uncertainstate-action pairs (x, a) ∈ B with a transition to a terminalstate yielding an immediate reward equal to the baselinepolicy expected return estimate: Qπb(x, a). Consideringthat the baseline policy is the behavioural policy used forthe generation of dataset D, the estimates Qπb(x, a) can beobtained by averaging the returns obtained in the dataset af-ter (x, a) transitions1. Indeed, the constraint on ND(x, a)for estimation at precision ε with probability 1 − δ growslogarithmically with the state set size:Proposition 1. If for all state action pairs (x, a) ∈ B,√

2ND(x,a) log 2|X ||A|

δ ≤ ε, then, with probability at least1− δ:∀(x, a) /∈ B, ‖P ∗(·|x, a)− P (·|x, a)‖1 ≤ ε∀(x, a) /∈ B, |R∗(x, a)− R(x, a)| ≤ εRmax∀(x, a) ∈ B, |Qπb(x, a)− Qπb(x, a)| ≤ εVmax

(7)

Inversely, given a desired ε, the count should satisfy:

ND(x, a) ≥ N⊥ =2

ε2log

2|X ||A|δ

(8)

In the rest of this subsection, we assume that this inequal-ity is satisfied for every state-action pair (x, a) ∈ B. Ifso, we can bootstrap the uncertain state-action pairs withthe Q-function estimates for the baseline policy by creat-ing a Qπb -bootstrapped MDP M as described earlier, andformalised in Algorithm 2. Then, we solve the estimated

Qπb -bootstrapped MDP M and let πval denote an optimalpolicy. Hereinbelow, Theorem 2 provides bounds on itsnear optimality in M , while Theorem 3 offers guaranteeson improving the baseline policy in M∗.Theorem 2 (Near optimality of Qπb -SPIBB-1). Let πvalbe an optimal policy of the reward maximization problem

of an estimated Qπb -bootstrapped MDP M . Then, underthe construction properties of M and under the assumptionof Proposition 1, the performance of πval in M is near-optimal:

ρ(πval, M) ≥ maxπ∈Π

ρ(π, M)− 2εVmax

1− γ (9)

1If the baseline policy is not the behavioural policy, the esti-mates Qπb(x, a) can still be obtained through importance sam-pling, but it suffers from a large variance that may compromisethe use of a small ε.


Algorithm 2 Qπb -SPIBB-1 algorithmData: Dataset DData: Set of bootstrapped state-action pairs BData: State and action sets: X and ACompute the estimates P , R, and Qπb from D.

Construct the estimated Qπb -bootstrapped MDP: M =

〈X ,A, P , R, γ〉 such that:

• P (x′|x, a) =

P (x′|x, a) if (x, a) /∈ BP (xf |x, a) = 1 otherwise

• R(x, a) =

R(x, a) if (x, a) /∈ B

Qπb(x, a) otherwise

return πval = argmaxπ∈Π

ρ(π, M)

Theorem 3 (Safe policy improvement of Qπb -SPIBB-1).Let πval be an optimal policy of the reward maximization

problem of an estimatedQπb -bootstrapped MDP M . Then,under the construction properties of M and under the as-sumption of Proposition 1, πval applied in M is an approx-imate safe policy improvement over the baseline policy πbwith high probability 1− δ:

ρ(πval, M) ≥ ρ(πb,M∗)− 2εVmax

1− γ (10)

πval is trained on the estimated Qπb -bootstrapped MDP,which can be performed with any RL algorithm with thesame computational efficiency.

During utilization of πval, casting the environment intoits Qπb -bootstrapped version means that once a Qπb -bootstrapped state-action pair (x, a) ∈ B has been per-formed, the reached state in M is terminal and the baselinepolicy πb should take control of the trajectory until its end.This has two practical shortcomings: 1/ it means that πbmust be known, and 2/ it does not take advantage of thefact that πval is defined over all state-action pairs, and ex-pected to be more efficient that πb.

As a consequence, despite the lack of theoretical guaran-tees, the experimental section also assesses the empiricalsafety of continuing to control the trajectory with πval af-ter choosing a bootstrapping action. These two variantsof value-based SPIBB are respectively referred as Qπb -SPIBB-1 and Qπb -SPIBB-∞.

3.3. Policy-based SPIBB

In the previous section, we propose to bootstrap the un-certain state-action pairs with a Monte Carlo evaluation

of the baseline policy. In this section, we adopt a pol-icy bootstrapping. More precisely, when a state-actionpair (x, a) is rarely seen in the dataset, i.e. (x, a) ∈ B,the batch algorithm is unable to assess its performanceand instead it relies on the baseline policy by copying theprobability to take this action in this particular situation:π(a|x) = πb(a|x) if (x, a) ∈ B. Algorithm 3 provides thepseudo-code of the baseline policy bootstrapping. It con-sists in constructing the set of allowed policies Πb and thento search the Πb-optimal policy πpol in the MDP model

M estimated from dataset D. In practice, the optimisationprocess may be performed by policy iteration (Howard,1960; Puterman & Brumelle, 1979): the current policy π(i)

is evaluated with Q(i), and then the next iteration policyπ(i+1) is made greedy with respect to Q(i) under the con-straint of belonging to Πb (see Algorithm 4 in Appendix).

Algorithm 3 Πb-SPIBB algorithmData: Dataset DData: Baseline policy πbData: Set of bootstrapped state-action pairs BData: State and action sets: X and ACompute estimated MDP: M = 〈X ,A, P , R, γ〉 where Pand R are the model estimates obtained from D.

Define Πb = π ∈ Π |π(a|x) = πb(a|x) if (x, a) ∈ Breturn πpol = argmax

π∈Πb

ρ(π, M)

Similarly to Theorems 2 and 3 for Qπb -SPIBB-1, the nearΠb-optimality and the SPI of the baseline policy can beproven for Πb-SPIBB:

Theorem 4 (Near Πb-optimality of Πb-SPIBB). Let Πb bethe set of policies under the constraint of following πb when(x, a) ∈ B. Let πpol be a Πb-optimal policy of the reward

maximization problem of an estimated MDP M . Then, theperformance of πpol is near Πb-optimal in the true MDPM∗:

ρ(πpol,M∗) ≥ max

π∈Πb

ρ(π,M∗)− 4εVmax

1− γ . (11)

Theorem 5 (Safe policy improvement of Πb-SPIBB). LetΠb be the set of policies under the constraint of followingπb when (x, a) ∈ B. Let πpol be a Πb-optimal policy of

the reward maximization problem of an estimated MDP M .Then, πpol is an approximate safe policy improvement overthe baseline policy πb with high probability 1− δ:

ρ(πpol,M∗) ≥ ρ(πb,M

∗)− 4εVmax

1− γ . (12)

Algorithm 3, referred as Πb-SPIBB, has the tendency toreproduce the rare actions from the baseline policy. Eventhough this is what allows to guarantee a performance al-most as good as the baseline policy’s one, it may prove


Q-value Baseline policy Boostrapped state Πb-SPIBB Π0-SPIBB Π≤b-SPIBBQ(i)(x, a1) = 1 πb(x, a1) = 0.1 (x, a1) ∈ B π(i)(x, a1) = 0.1 π(i)(x, a1) = 0 π(i)(x, a1) = 0Q(i)(x, a2) = 2 πb(x, a2) = 0.4 (x, a2) /∈ B π(i)(x, a2) = 0 π(i)(x, a2) = 0 π(i)(x, a2) = 0Q(i)(x, a3) = 3 πb(x, a3) = 0.3 (x, a3) /∈ B π(i)(x, a3) = 0.7 π(i)(x, a3) = 1 π(i)(x, a3) = 0.8Q(i)(x, a4) = 4 πb(x, a4) = 0.2 (x, a4) ∈ B π(i)(x, a4) = 0.2 π(i)(x, a4) = 0 π(i)(x, a4) = 0.2

Table 1: Policy improvement step at iteration (i) for the three policy-based SPIBB algorithms.

to be toxic when the baseline policy is already near opti-mal for two reasons: 1/ the low visited state-action pairsare generally the actions for which the behavioural policyprobability is lower, meaning that the actions are likely tobe bad, 2/ the baseline exploratory strategies fall into thiscategory, and copying the baseline policy in this case is re-producing these strategies.

Another way to look at the problem is therefore to considerthat rare actions must be avoided, because they are risky,and therefore to force the policy to assign a probability of0 to perform this action. Algorithm 3 remains unchangedexcept that the policy search space Πb has to be replacedwith Π0 (see Algorithm 5 in Appendix) defined as follows:

Π0 = π ∈ Π |π(a|x) = 0 if (x, a) ∈ B (13)

The empirical analysis of Section 4 shows that this vari-ant, referred as Π0-SPIBB, often proves to be unsafe. Webelieve that a better policy-improvement SPIBB lays in-between: the space of policies to search in should be con-strained not to give more weight than πb to actions thatwere not tried out enough to significantly assess its perfor-mance, but still leave the possibility to completely cut offbad performing actions even though this evaluation is un-certain. The resulting algorithm is referred as Π≤b-SPIBB.Once again, Algorithm 3 is reused by replacing the policysearch space Πb with Π≤b defined as follows:

Π≤b = π ∈ Π |π(a|x) ≤ πb(a|x) if (x, a) ∈ B (14)

Algorithm 4 in Appendix describes the greedy projection ofQ(i) on Π≤b. Despite the lack of theoretical guarantees, inour experiments, Π≤b-SPIBB proves to be safe while out-performing Πb-SPIBB. However, in a growing batch set-ting (Lange et al., 2012), it might be valuable to keep ex-ploring the way Πb-SPIBB does.

Table 1 shows the difference of policy projection during thepolicy improvement step of the policy iteration process. Itshows how the baseline policy probability mass is redis-tributed among the different actions according to the threepolicy-based SPIBB algorithms. We observe that for Πb-SPIBB, the boostrapped state-action pairs probabilities areuntouched. At the opposite, Π0-SPIBB removes all massfrom the boostrapped state-action pairs. And finally Π≤b-SPIBB lies in-between.

3.4. Discussion

Table 2 summarizes the algorithms strengths and weak-nesses. High-Confidence PI refers to the family of algo-rithms introduced in Thomas et al. 2015a, which rely on theability to produce high-confidence policy evaluation (Man-del et al., 2014; Thomas et al., 2015b) of the trained pol-icy, which is known to be high variance (Paduraru, 2013).As a consequence, they formally rely on a policy improve-ment safety test that is unlikely to be positive without animmense dataset.

Petrik et al. 2016 propose a large variety of algorithms.The basic (model-based) RL relates to searching the op-timal policy in the MDP estimate M . It is proved to con-verge to the optimal policy, but this proof relies on the un-practical assumption of having sampled every state-actionpair a huge number of times (see Theorem 1 in Section 4).ARBRM assumes the transition model known around thebaseline policy, which is a strong assumption that cannotbe made in most practical problems. ARBRM and RobustMDP to a lesser extent suffer from high complexity evenwith a finite state space MDP: NP-hard reduced through ap-proximation to polynomial time; and they also lack safetyguarantees with respect to their approximation to make ittractable. Finally, Reward-Adjusted MDP’s algorithm hasno proven safety and relies as a consequence on a safetytest, similarly to High-Confidence PI.

Berkenkamp et al. 2017 assume the existence of a local,stable policy, and their Safe Lyapunov RL algorithm allowsto safely explore outside from the safe region without everleaving it. It exclusively addresses the stabilization tasks inthe online scenario. Roy et al. 2017 develop robust versionsof the main model-free algorithms, offering algorithms thatrobustly converge online to the optimal policy. It does nothelp in the general batch setting, where the trust regionsare infinite if a state-action pair has never been observed.Papini et al. 2017 propose to adapt the batch size to en-sure that the gradients safely improve the policy. It has twomain shortcomings: it requires the use of policy gradientupdates, which is less sample-efficient than model-basedmethods, and it commands the batch size, which is usuallynot a feature over which the system has control in an of-fline setting like ours. These three very recent algorithmsuse different assumptions than ours.


algorithm name safe PI PI test known πb? ND(x, a) ≥ N•?

Basic model-based RL yes no no N∧ =2

ε2log |X ||A|2

|X|

δ

ARBRM yes yes yes noRobust MDP ξ yes yes yes noReward-Adjusted MDP yes yes no noHigh-Confidence PI no yes no no

Qπb -SPIBB-1 Th. 3 no yes N⊥ =2

ε2log 2|X ||A|

δ

Qπb -SPIBB-∞ no n/a no N⊥ =2

ε2log 2|X ||A|

δ

Πb-SPIBB Th. 5 no yes noΠ0-SPIBB no n/a yes noΠ≤b-SPIBB no n/a yes no

Table 2: Brief summary of safe improvement algorithms. The columns are from left to right: name of the algorithm,existence of safe improvement guarantees, does it rely on a policy improvement test?, does the baseline policy need tobe known?, is there a constraint on the action-state pair counts? The safety of basic model-based RL is proved in Petriket al. 2016. ARBRM, Robust MDP, and Reward-Adjusted MDP are respectively Algorithms 1, 2, and 3 in Petrik et al.2016. High-Confidence PI is the general approach to policy improvement guaranteed by off-policy evaluation confidenceintervals defended in Thomas 2015.

Our algorithms take inspiration from the ARBRM idea offinding a policy that is guaranteed to be an improvementfor any realization of the uncertain parameters. Still, likeARBRM, they do so by taking into account the estimationof the error, as a function of the state-action pair counts.But instead of searching for the analytic optimum, it goesstraightforwardly to a solution improving the baseline pol-icy where it can guarantee the improvement, and boot-strapping on the baseline policy where the uncertainty istoo high. One can see it as a knows-what-it-knows algo-rithm (Li et al., 2008), asking for help from the baselinepolicy when it does not know whether it knows. As a con-sequence, our proofs do not require the policy improve-ment safety test. Qπb -SPIBB-1 is proved to be safe underconditions of application that are widely relaxed, comparedwith the basic model-based RL Theorem 1. Πb-SPIBB al-gorithm does not rely on any condition of application elsethan knowing the baseline policy. Additionally, contrary tothe other robust/safe batch RL algorithms in the literature,all SPIBB algorithms maintain a computational cost equalto the basic RL algorithms.

4. Experimental evaluation4.1. Gridworld setting

Our case study is a straightforward discrete, stochastic 5×5gridworld (see Figure 1a). We use four actions: up, down,left and right. The transition function is stochastic andthe actions move the agent in the specified direction with75% chances, in the opposite direction with 5% chances

and with 10% to each side. The initial and final states arerespectively the bottom left and top right corners. The re-ward is −10 when hitting a wall (in which case the agentdoes not move) and +100 if the final state is reached. Eachrun consists in generating a dataset, training a policy fromit, and evaluating the trained policy.

The gridworld domain is justified by the fact that basicmodel-based RL already fails to be safe in this simple en-vironment, and by the empirical worst-case evaluation thatrequires to run 1,000 runs for 8 algorithms (the 5 SPIBBalgorithms, the basic RL, Robust MDP, and the Reward-Adjusted MDP), 11 dataset sizes, and 8 N∧ values. ForBasic RL, several Q-functions initializations were inves-tigated: the optimistic (Vmax), the null (0), and the pes-simistic (−Vmax). The two first yielded awful perfor-

S

G

S

G

(a) Stochastic gridworld do-main and its optimal trajectory

77

44

77

47

75

57

44

11

4

13

67

75

77

77

75

77

45

41

5

01

76

65

77

67

74

66

52

53

5

66

64

67

65

76

23

74

0

5

46

64

17

54

27

31 7

(b) log10 state-action paircounts after 1.2× 107 trajecto-ries

Figure 1: Gridworld domain and dataset distribution.


Figure 2: Expectation of the number of (x, a) pairs count-ing a unique transition as a function of the dataset size.

mances. All the presented results are obtained with thepessimistic initialization.

Two baseline policies were used to generate the dataset andto bootstrap on. The literature benchmark is performedon the first one, a strong softmax exploration around theoptimal Q-function. The SPIBB benchmark is performedon the second one, which differs in that it favours walk-ing along the walls, although it should avoid it to preventbad stochastic transitions. This baseline was constructed inorder to demonstrate the unsafety of algorithm Π0-SPIBB.

The results are presented in two forms: the mean perfor-mance on all the runs; and the worst-case performance ofthe 10% (decile) or 1% (centile) worst runs. For the SPIBBalgorithms, values of N∧ from 5 to 1000 are tested.

4.2. Basic model-based RL failure

All the state-of-the-art algorithms for batch RL assume thatevery state-action has been experienced a certain amount oftimes (Delage & Mannor, 2010; Petrik et al., 2016). In thissubsection, we aim to empirically demonstrate that this as-sumption is generally transgressed even in our simple grid-world domain. To do so, we collect 12 millions trajectorieswith the first baseline. The map of the state-action countlog10 logarithm (see Figure 1b) shows how unbalanced theND(x, a) counts are: some transitions are experienced ineach trajectory, some only once every few million trajecto-ries, and some are even never seen once.

Moreover, the actions that are rarely chosen are likely tobe the dangerous ones, and for those ones, a bad modelmight lead to a catastrophic policy. Figure 2 displays theexpected number of transitions that are seen exactly oncein a dataset as a function of its size. This is a curve thatdecreases slowly as more trajectories are collected. But,if we look more specifically at dangerous transitions, i.e.the ones that direct the agent to a wall, we observe a peakaround 1,000 trajectories. In the next subsection, we see

Figure 3: Literature benchmark: Basic RL, Robust MDP,and Reward Adjusted MDP are compared to our Π≤b-SPIBB with N∧ = 100 on mean and safe performances.

that it strongly affects the basic RL safety: surprisingly, themodels trained with 10 trajectories yield better returns thanthe ones trained with 1,000 trajectories on average. Weconjecture that this issue is faced in most practical applica-tions too. For instance, in dialogue, all the collected humandialogue transitions are relevant to what is being discussed.

4.3. Results

Figure 3 shows the literature benchmark results againstour best algorithm Π≤b-SPIBB. The basic RL algorithmperforms reasonably well on average, but fails to be safe,and sometimes outputs a policy that is disastrous. We cannotice that the performance reaches a valley for datasetsaround 1,000 trajectories. We interpret it as the conse-quence of the rare pair count effect developed in the pre-vious subsection. Neither Robust MDP, nor Reward Ad-justed MDP, seem to improve the safety when the safetytest is omitted. We did so in our curves to make a relevantcomparison: this test appears to be always negative becauseof its wide confidence interval. It is also worth mentioningthat the Reward Adjusted MDP algorithm tends to becomesuicidal in environments where it can get killed (not thecase in our domain). Indeed, the intrinsic penalty adjust-ment may be overwhelming the environment reward andthe optimal strategy may be to stop the trajectory as fastas possible. Our algorithm Π≤b-SPIBB with N∧ = 100is safe. Its worst decile performance is even significantlyhigher than the other algorithms mean performance. TheSPIBB algorithms empirical results are lengthily discussedhereinbelow.

Our SPIBB algorithms are so efficient and safe that we shiftthe dataset size range to the [10,1000] window. The safetyis assessed by a worst-centile measure: mean of the perfor-mance of the 1% worst runs. The basic RL worst centile istoo low to appear. A wide range of values for N∧ are eval-


101 102 103

number of trajectories in dataset D

−40

−30

−20

−10

0

10

20

30

40

mea

np

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞

(a) Mean performance

101 102 103


−40

−30

−20

−10

0

10

20

30

40

cent

ilep

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞

(b) Worst centile performance

Figure 4: SPIBB benchmark (N∧ = 5): mean and worst centile performance

uated: from 5 to 1000. The main lessons are that the safetyof improvement over the baseline is not much impacted bythe choice of N∧, but that a higher N∧ implies the SPIBBalgorithms to be more conservative and to bootstrap moreoften on the baseline. For complete results, we refer theinterested reader to the Appendix. Even though the theorywould advise to use higher values, we report here our bestempirical results: with N∧ = 5.

Value-based SPIBB algorithms Qπb -SPIBB-1 and Qπb -SPIBB-∞ fail at being safe with small datasets. The rea-son is that these algorithms rely on the assumption that evenbootstrapped state-action pairs must have been experienceda small amount of times. We also notice that, despite thelack of guarantees, Qπb -SPIBB-∞ improves the safety ascompared to Qπb -SPIBB-1.

Πb-SPIBB and Π≤b-SPIBB get a worst case scenario only10 points below the baseline, which is partially explainedby the variance in the evaluation, and is not likely to be aconsequence of a bad policy. Π0-SPIBB lacks safety withvery small datasets, because it tends to completely abandonactions that are not sampled enough in the dataset, regard-less of their performance. Results with higher N∧ valuesshow that there is a dataset size for which Π0-SPIBB tendsto cut the optimal actions (of not walking along the wall),which causes a strong performance drop, both in worst casescenario and in mean performance (see the Appendix). Πb-SPIBB is more conservative and fails to improve as fast asthe two other policy-based SPIBB algorithms, but it doesit safely. Π≤b-SPIBB is the best of both worlds: safe al-though still capable of cutting bad actions even with only asmall number of samples. However, for growing batch set-tings, it might be better to keep on trying out the actions thatwere not sufficiently explored yet, and Πb-SPIBB might bethe best algorithm in this setting.

5. Conclusion and future workIn this paper, we tackle the problem of Batch Reinforce-ment Learning and its safety. We reformulate the percentilecriterion without compromising its safety at the expense ofthe optimality of the safe solution. The gain is that it al-lows to implement two algorithms Qπb -SPIBB-1 and Πb-SPIBB that run as fast as a basic model-based RL algo-rithm, while generating a provably safe policy improve-ment over a known baseline πb. Three other SPIBB al-gorithms are derived without any safety guarantees: Qπb -SPIBB-∞, Π0-SPIBB, and Π≤b-SPIBB.

The empirical analysis shows that, even on a very sim-ple domain, the basic RL algorithm fails to be safe, andthe state-of-the-art safe batch RL algorithms does no betterwhen the policy improvement safety test is omitted. Thissafety test has also proven to be (almost) always negativein our tests consequently preventing any improvement overthe baseline. The SPIBB algorithms show significantly bet-ter results: their worst-centile performance even surpassingthe basic RL mean performance in most settings.

Future work includes developing model-free versions ofour algorithms in order to ease their use in continuousstate MDP and complex real-world applications, with staterepresentation approximation, using density networks (Ve-ness et al., 2012; Van Den Oord et al., 2016) to computepseudo-counts (Bellemare et al., 2016), in a similar way tothat of optimism-motivated online RL (Osband et al., 2016;Laroche & Barlier, 2017; Ostrovski et al., 2017). Futurework also includes designing a Bayesian policy projectionto take into account the uncertainty of local policy evalua-tion, and demonstrating that our algorithms may be used inconjunction with imitation learning to compute the baselinepolicy estimate.


ReferencesAmodei, Dario, Olah, Chris, Steinhardt, Jacob, Christiano,

Paul, Schulman, John, and Mané, Dan. Concrete prob-lems in ai safety. arXiv preprint arXiv:1606.06565,2016.

Auer, Peter and Ortner, Ronald. Logarithmic online re-gret bounds for undiscounted reinforcement learning. InProceedings of the 21st Annual Conference on NeuralInformation Processing Systems (NIPS), 2007.

Bellemare, Marc, Srinivasan, Sriram, Ostrovski, Georg,Schaul, Tom, Saxton, David, and Munos, Remi. Unify-ing count-based exploration and intrinsic motivation. InProceedings of the 29th Advances in Neural InformationProcessing Systems (NIPS), 2016.

Bellman, Richard. A markovian decision process. Journalof Mathematics and Mechanics, 1957.

Berkenkamp, Felix, Turchetta, Matteo, Schoellig, Angela,and Krause, Andreas. Safe model-based reinforcementlearning with stability guarantees. In Proceedings of the30th Advances in Neural Information Processing Sys-tems (NIPS). 2017.

Brafman, Ronen I and Tennenholtz, Moshe. R-max-a gen-eral polynomial time algorithm for near-optimal rein-forcement learning. Journal of Machine Learning Re-search, 2002.

Delage, Erick and Mannor, Shie. Percentile optimizationfor markov decision processes with parameter uncer-tainty. Operations research, 2010.

Garcıa, Javier and Fernández, Fernando. A comprehensivesurvey on safe reinforcement learning. Journal of Ma-chine Learning Research, 2015.

Geibel, Peter and Wysotzki, Fritz. Risk-sensitive reinforce-ment learning applied to control under constraints. 2005.

Ghavamzadeh, Mohammad, Mannor, Shie, Pineau, Joelle,and Tamar, Aviv. Bayesian reinforcement learning: Asurvey. CoRR, abs/1609.04436, 2016.

Howard, Ronald A. Dynamic programming and markovprocesses. 1960.

Iyengar, Garud N. Robust dynamic programming. Mathe-matics of Operations Research, 2005.

Lange, Sascha, Gabel, Thomas, and Riedmiller, Martin.Batch reinforcement learning. In Reinforcement learn-ing. 2012.

Laroche, Romain and Barlier, Merwan. Transfer reinforce-ment learning with shared dynamics. In Proceedingsof the 31st AAAI Conference on Artificial Intelligence,2017.

Li, Lihong, Littman, Michael L, and Walsh, Thomas J.Knows what it knows: a framework for self-aware learn-ing. In Proceedings of the 25th International Conferenceon Machine Learning (ICML), 2008.

Lipton, Zachary C, Kumar, Abhishek, Gao, Jianfeng, Li,Lihong, and Deng, Li. Combating deep reinforcementlearning’s sisyphean curse with reinforcement learning.arXiv preprint arXiv:1611.01211, 2016.

Mandel, Travis, Liu, Yun-En, Levine, Sergey, Brunskill,Emma, and Popovic, Zoran. Offline policy evaluationacross representations with applications to educationalgames. In Proceedings of the 13th International Con-ference on Autonomous Agents and Multi-Agent Systems(AAMAS), 2014.

Nilim, Arnab and El Ghaoui, Laurent. Robust control ofmarkov decision processes with uncertain transition ma-trices. Operations Research, 2005.

Osband, Ian, Blundell, Charles, Pritzel, Alexander, andVan Roy, Benjamin. Deep exploration via bootstrappeddqn. In Advances in Neural Information Processing Sys-tems, 2016.

Ostrovski, Georg, Bellemare, Marc G, Oord, Aaronvan den, and Munos, Rémi. Count-based explo-ration with neural density models. arXiv preprintarXiv:1703.01310, 2017.

Paduraru, Cosmin. Off-policy Evaluation in Markov Deci-sion Processes. PhD thesis, PhD thesis, McGill Univer-sity, 2013.

Papini, Matteo, Pirotta, Matteo, and Restelli, Marcello.Adaptive batch size for safe policy gradients. In Pro-ceedings of the 30th Advances in Neural InformationProcessing Systems (NIPS). 2017.

Parr, Ronald E. and Russell, Stuart. Hierarchical controland learning for Markov decision processes. Universityof California, Berkeley Berkeley, CA, 1998.

Petrik, Marek, Ghavamzadeh, Mohammad, and Chow, Yin-lam. Safe policy improvement by minimizing robustbaseline regret. In Proceedings of the 29th Advances inNeural Information Processing Systems, 2016.

Puterman, Martin L and Brumelle, Shelby L. On the con-vergence of policy iteration in stationary dynamic pro-gramming. Mathematics of Operations Research, 1979.


Roy, Aurko, Xu, Huan, and Pokutta, Sebastian. Reinforce-ment learning under model mismatch. In Proceedingsof the 30th Advances in Neural Information ProcessingSystems (NIPS). 2017.

Schulman, John, Levine, Sergey, Abbeel, Pieter, Jordan,Michael, and Moritz, Philipp. Trust region policy opti-mization. In Proceedings of the 32nd International Con-ference on Machine Learning (ICML), 2015.

Schulman, John, Wolski, Filip, Dhariwal, Prafulla, Rad-ford, Alec, and Klimov, Oleg. Proximal policy optimiza-tion algorithms. arXiv preprint arXiv:1707.06347, 2017.

Sutton, Richard S. and Barto, Andrew G. ReinforcementLearning: An Introduction. The MIT Press, 1998.

Sutton, Richard S., Precup, Doina, and Singh, Satinder. Be-tween mdps and semi-mdps: A framework for temporalabstraction in reinforcement learning. Artificial intelli-gence, 112(1-2):181–211, 1999.

Szita, István and Lorincz, András. The many faces of opti-mism: a unifying approach. In Proceedings of the 25thInternational Conference on Machine Learning (ICML),2008.

Thomas, Philip, Theocharous, Georgios, andGhavamzadeh, Mohammad. High confidence policyimprovement. In Proceedings of the 32nd InternationalConference on Machine Learning (ICML), 2015a.

Thomas, Philip S. Safe reinforcement learning. PhD thesis,Stanford university, 2015.

Thomas, Philip S, Theocharous, Georgios, andGhavamzadeh, Mohammad. High-confidence off-policy evaluation. In Proceedings of the 29th AAAIConference on Artificial Intelligence, 2015b.

Van Den Oord, Aäron, Kalchbrenner, Nal, andKavukcuoglu, Koray. Pixel recurrent neural net-works. In Proceedings of the 33rd InternationalConference on International Conference on MachineLearning (ICML), pp. 1747–1756. JMLR.org, 2016.

Veness, Joel, Ng, Kee Siong, Hutter, Marcus, and Bowl-ing, Michael. Context tree switching. In Proceedingsof the 22nd Data Compression Conference (DCC), pp.327–336. IEEE, 2012.


A. Matrix notations for the proofsThe proofs make use of the matrix representation for the Q-function, V -function, the policy, the transition, the reward andthe discount rate (when dealing with semi-MDPs) functions.

The Q-functions matrices have 1 row and |X ||A| columns.

The V -functions matrices have 1 row and |X | columns.

The policy matrices π have |X ||A| row and |X | columns. Even though a policy is generally defined as a function from X toA and shouldbe represented by a compact matrix with |A| rows and |X | columns, in order to use simple matrix operators,we need the policy matrix to output a distribution over the state-action pairs. Consequently, our policy matrix obtainedthrough the following expansion through the diagonal:

π11 . . . π1j . . . π1|X |...

......

πi1 . . . πij . . . πi|X |...

......

π|A|1 . . . π|A|j . . . π|A||X |

=[π·1 . . . π·j . . . π·|X |

]−→

π·1 0 0

. . .0 π·j 0

. . .0 0 π·|S|.

The transition matrices P have |X | rows and |X ||A| columns.

The reward matrices R have 1 row and |X ||A| columns.

The discount rate matrices Γ have |X | rows and |X ||A| columns.

The expression AB is the matrix product between matrices A and B for which column and row dimensions coincide.

The expression (A B) is the element-wise product between matrices A and B of the same dimension.

I denotes the identity matrix (the diagonal matrix with only ones), which dimension is given by the context.

1(y) denotes the column unit vector with zeros everywhere except for the element of index y which equals 1. For instanceQ1x,a denotes the value of performing action a in state x.

The regular and option Bellman equations are therefore respectively written as follows:

Q = R+ γQπP (15)Q = R+Qπ(Γ P ) (16)


B. Proofs for Qπb-SPIBB-1 (Section 3.2)B.1. ε error with high probability 1− δ

Proposition 1. If for all state action pairs (x, a) ∈ B,√

2ND(x,a) log 2|X ||A|

δ ≤ ε, then, with probability at least 1− δ:

∀(x, a) /∈ B, ‖P ∗(·|x, a)− P (·|x, a)‖1 ≤ ε∀(x, a) /∈ B, |R∗(x, a)− R(x, a)| ≤ εRmax∀(x, a) ∈ B, |Qπb(x, a)− Qπb(x, a)| ≤ εVmax

(17)

Proof. From construction of B, and from Proposition 9 of Petrik et al. 2016, the first condition is satisfied for everystate-action pair (x, a) /∈ B individually with probability δ

|X ||A| .

The second and the third inequalities are obtained similarly. The proof is further only detailed for the third inequalityhereinafter: given (x, a) ∈ B, and from the two-sided Hoeffding’s inequality:

P(|Qπb(x, a)− Qπb(x, a)| > εVmax

)= P

|Qπb(x, a)− Qπb(x, a)|2Vmax

>

√1

2ND(x, a)log

2|X ||A|δ

(18)

≤ 2 exp

(−2ND(x, a)

1

2ND(x, a)log

2|X ||A|δ

)(19)

≤ δ

|X ||A| (20)

Adding up all |X ||A| state-action pairs probabilities lower than δ|X ||A| gives a result lower than δ, which proves the

proposition.

B.2. Value function error bounds

Lemma 1 (Value function error bounds). Consider two transition functions P1 and P2, two reward functions R1 and R2,and two bootstrapping Q-function Q1 and Q2, used to bootstrap two MDPs M1 and M2. Consider a policy π ∈ Π. Let V1

and V2 be the state value function of the policy π given (P1, R1, Q1) and (P2, R2, Q2), respectively.

If

∀(x, a) /∈ B, ‖P1(·|x, a)− P2(·|x, a)‖1 ≤ ε∀(x, a) /∈ B, |R1(x, a)−R2(x, a)| ≤ εRmax∀(x, a) ∈ B, |Q1(x, a)−Q2(x, a)| ≤ εVmax

, then ∀x ∈ X , |V1(x)− V2(x)| ≤ εVmax

1− γ , (21)

where Vmax is the known maximum of the value function.

Proof. In order to prove this lemma, we adopt the matrix notations. The premises are therefore equivalently noted: ∀(x, a) /∈ B, ‖P11x,a − P21x,a‖1 ≤ ε∀(x, a) /∈ B, |R11x,a −R21x,a| ≤ εRmax∀(x, a) ∈ B, |Q11x,a −Q21x,a| ≤ εVmax

, (22)

and the conclusion:

|V11x − V21x| ≤εVmax

1− γ , (23)

The policy π can be decomposed as the aggregation of two partial policies: π = π + π, where π are the non-boostrappedactions probabilities, and π are the bootstrapped actions probabilities. Then, the difference between the two value functions


can be written:

V1 − V2 = R1π + γV1P1π −R2π − γV2P2π (24)= R1π + γV1P1π −R2π − γV2P2π + γV2P1π − γV2P1π (25)= (R1 −R2)π + γ(V1 − V2)P1π + γV2(P1 − P2)π (26)= (R1 −R2)π + γ(V1 − V2)P1π + γV2(P1 − P2)π (27)

= [(R1 −R2)π + γV2(P1 − P2)π] (I− γP1π)−1. (28)

Line 27 is explained by the fact that the bootstrapping action lead to a terminal state: therefore P1π = P2π = 0, and Line28 is passing the second term to the left-hand side of the equation, factorised over V1 − V2 and divided by its factor. Nowusing the Holder’s inequality, for any state-action pair (x, a) /∈ B2, we have:

|γV2(P1 − P2)π1x| ≤ |V2(P11x,a − P21x,a)|‖π1x‖1 (29)≤ ‖V2‖∞‖P11x,a − P21x,a‖1‖π1x‖1 (30)≤ εVmax‖π1x‖1. (31)

Also, considering the reward term, we get:

|(R1 −R2)π1x| ≤ ‖R1 −R2‖B∞‖π1x‖1 + ‖Q1 −Q2‖B∞‖π1x‖1 (32)≤ εRmax‖π1x‖1 + εVmax‖π1x‖1. (33)

Inserting 31 and 33 into Equation 28 gives:

|V11x − V21x| ≤ ‖(I− γπP1)−1‖1 [εRmax‖π1x‖1 + εVmax‖π1x + γεVmax‖π1x‖1] (34)

≤ [εVmax‖π1x‖1 + ε‖π1x‖1(Rmax + γVmax)]

1− γ (35)

≤ [εVmax‖π1x‖1 + εVmax‖π1x‖1]

1− γ (36)

≤ εVmax

1− γ (37)

B.3. Near optimality

Theorem 2 (Near optimality of Qπb -SPIBB-1). Let πval be an optimal policy of the reward maximization problem of

an estimated Qπb -bootstrapped MDP M . Then, under the construction properties of M and under the assumption ofProposition 1, the performance of πval in M is near-optimal:

ρ(πval, M) ≥ maxπ∈Π


1− γ (38)

Proof. From Lemma 1, with π = πval, P1 = P , P2 = P , R1 = R, R2 = R, Q1 = Qπb , and Q2 = Qπb , we have:

|ρ(πval, M)− ρ(πval,M)| = |V π

val

M(x0)− V π

valM (x0)| (39)

≤ εVmax

1− γ (40)

2Since π is the non-bootstrapped part of the policy, we know that its support is exclusively outside of B.


And if we write π∗ = argmaxπ∈Π ρ(π, M), analogously, we also have:

|ρ(π∗, M)− ρ(π∗, M)| ≤ εVmax

1− γ (41)

Thus, we may write:

ρ(π∗, M)− ρ(πval, M)(a)

≤ ρ(π∗, M)− ρ(πval,M) +

εVmax

1− γ (42)

(b)

≤ ρ(π∗, M)− ρ(π∗, M) +εVmax

1− γ (43)

(c)

≤ εVmax

1− γ +εVmax

1− γ (44)

(d)

≤ 2εVmax

1− γ , (45)

where each step is obtained as follows:

(a) From equation 40.

(b) Optimality of πval in the estimated Qπb -bootstrapped MDP M .

(c) From equation 41.

(d) Summation.

B.4. Safe policy improvement

Proposition 2 (Baseline policy value conservation under Qπb -bootstrapping).

V πb

M= V πb

M∗ . (46)

Proof. We adopt the matrix notations. The V -value function can be decomposed as follows:

V πb

M=(R+ V πb

MP)πb (47)

=(R+ V πb

MP)πb +

(R+ V πb

MP)πb (48)

=(R∗ + V πb

MP ∗)πb +Qπb

M∗ πb (49)

=(R∗ + V πb

MP ∗)πb + (R∗ + V πb

M∗P∗) πb. (50)

From Equation 50, it is direct to conclude that V πb

M∗ is the unique solution of the Bellman equation for V πb

M, and therefore

that V πb

M= V πb

M∗ .

Corollary 1 (Baseline policy return conservation under bootstrapping).

ρ(πb, M) = ρ(πb,M∗). (51)


Proof. This is a direct consequence from Proposition 2:

ρ(πb, M) = V πb

M(x0) = V πb

M∗(x0) = ρ(πb,M∗). (52)

Theorem 3 (Safe policy improvement ofQπb -SPIBB-1). Let πval be an optimal policy of the reward maximization problem

of an estimated Qπb -bootstrapped MDP M . Then, under the construction properties of M and under the assumptionof Proposition 1, πval applied in M is an approximate safe policy improvement over the baseline policy πb with highprobability 1− δ:

ρ(πval, M) ≥ ρ(πb,M∗)− 2εVmax

1− γ (53)

Proof.

ρ(πval, M)(a)

≥ maxπ∈Π


1− γ (54)

(b)

≥ ρ(πb, M)− 2εVmax

1− γ (55)

(c)

≥ ρ(πb,M∗)− 2εVmax

1− γ , (56)


(a) From Theorem 2.

(b) Optimality of maxπ∈Π ρ(π, M).

(c) From Corollary 1.


C. Proofs for Πb-SPIBB (Section 3.3)Proposition 3. Consider an environment modelled with a semi-MDP (Parr & Russell, 1998; Sutton et al., 1999) M =〈X ,ΩA, P ∗, R∗,Γ∗〉, where Γ∗ is the discount rate inferior or equal to γ that varies with the state action transitions and

the empirical semi-MDP M = 〈X ,ΩA, P , R, Γ∗〉 estimated from a dataset D. If in every state x where option oa may be

initiated: x ∈ Ia, we have

√2

ND(x, a)log|X ||A|2|X |

δ≤ ε, then, with probability at least 1− δ:

∀a ∈ A,∀x ∈ Ia,‖Γ∗P ∗(x, oa)− Γ P (x, oa)‖1 ≤ ε|R∗(x, oa)− R(x, oa)| ≤ εRmax

(57)

Proof. The proof is similar to that of Proposition 1.

C.1. Q-function error bounds with Πb-SPIBB

Lemma 2 (Q-function error bounds with Πb-SPIBB). Consider two semi-MDPs M1 = 〈X ,ΩA, P1, R1,Γ1〉 and M2 =〈X ,ΩA, P2, R2,Γ2〉. Consider a policy π. Also, consider Q1 and Q2 be the state-action value function of the policy π inM1 and M2, respectively. If:

∀a ∈ A,∀x ∈ Ia,|R11x,oa −R21x,oa | ≤ εRmax||(Γ1 P1)1x,oa − (Γ2 P2)1x,oa ||1 ≤ ε,

(58)

then, we have:

∀a ∈ A,∀x ∈ Ia, |Q11x,oa −Q21x,oa | ≤2εVmax

1− γ , (59)

where Vmax is the known maximum of the value function.

Proof. We adopt the matrix notations. The difference between the two state-option value functions can be written:

Q1 −Q2 = R1 +Q1π(Γ1 P1)−R2 −Q2π(Γ2 P2) (60)= R1 +Q1π(Γ1 P1)−R2 −Q2π(Γ2 P2) +Q2π(Γ1 P1)−Q2π(Γ1 P1) (61)= R1 −R2 + (Q1 −Q2)π(Γ1 P1) +Q2π((Γ1 P1)− (Γ2 P2)) (62)

= [R1 −R2 +Q2π((Γ1 P1)− (Γ2 P2))] (I− π(Γ1 P1))−1. (63)

Now using the Holder’s inequality and the second assumption, we have:

|Q2π((Γ1 P1)− (Γ2 P2))1x,oa | ≤ ‖Q2‖∞‖π‖∞‖(Γ1 P1)1x,oa − (Γ2 P2)1x,oa‖1 ≤ εVmax. (64)

Inserting 64 into Equation 63 and using the first assumption, we obtain:

|Q11x,oa −Q21x,oa | ≤ [εRmax + εVmax] ‖(I− π(Γ1 P1))−11x,oa‖1 (65)

≤ 2εVmax

1− γ , (66)

which proves the lemma. As a contrast to Lemma 1, there is an additional factor 2 that might require some discussion. Itprovides from the fact that we do not control the fact that the maximum Rmax might be as big as Vmax in the semi-MDPsetting and we do not control the γ factor in front of the second term. As a consequence, we surmise that a tighter bounddown to εVmax

1−γ holds, but this still has to be proven.


C.2. Near optimality of Πb-SPIBB

Theorem 4 (Near Πb-optimality of Πb-SPIBB). Let Πb be the set of policies under the constraint of following πb when(x, a) ∈ B. Let πpol be a Πb-optimal policy of the reward maximization problem of an estimated MDP M . Then, theperformance of πpol is near Πb-optimal in the true MDP M∗:

ρ(πpol,M∗) ≥ max

π∈Πb

ρ(π,M∗)− 4εVmax

1− γ . (67)

Proof. We transform the true MDP M∗ and the MDP estimate M , to their bootstrapped semi-MDP counterparts M∗ and

the MDP estimate M . In these semi-MDPs, the actions A are replaced by options ΩA = oaa∈A constructed as follows:

oa = 〈Ia, a:πb, β〉 =

Ia = x ∈ X , such that (x, a) /∈ Ba:πb = perform a at initialization, then follow πbβ(x) = ‖πb(x, ·)‖1

(68)

where πb has been decomposed as the aggregation of two partial policies: πb = πb + πb, with πb(x, a) containing thenon-bootstrapped actions probabilities in state x, and πb(x, a) the bootstrapped actions probabilities. Let Π denote the setof policies over the bootstrapped semi MDPs.

Ia is the initialization function: it determines the set of states where the option is available. a:πb is the option policy beingfollowed during the length of the option. Finally, β(x) is the termination function defining the probability of the optionto terminate in each state3. Please, notice that some states might have no available options, but this is okay since everyoption has a termination function equal to 0 in those states, meaning that they are unreachable. This to avoid being inthis situation at the beginning of the trajectory, we use the notion of starting option: a trajectory starts with a void optiono∅ = 〈x0 , πb, β〉.By construction x ∈ Ia if and only if (x, a) ∈ B, i.e. if and only if the condition on the state-action counts of Proposition3 is fulfilled4. Also, any policy π ∈ Πb is implemented by a policy π ∈ Π in a bootstrapped semi-MDP. Inversely, anypolicy π ∈ Π admits a policy π ∈ Πb in the original MDP.

Note also, that by construction, the transition and reward functions are only defined for (x, oa) pairs such that x ∈ Ia. Byconvention, we set them to 0 for the other pairs. Their corresponding Q-functions are therefore set to 0 as well.

This means that Lemma 2 may be applied with π = πpol and M1 = M∗ and M2 = M . We have:

|ρ(πpol,M∗)− ρ(πpol, M)| = |ρ(πpol, M

∗)− ρ(πpol,M)| (69)

= |V πpol

M∗(x0)− V π

polM (x0)| (70)

= |Qπpol

M∗(x0, o∅)−Q

πpolM (x0, o∅)| (71)

≤ 2εVmax

1− γ (72)

Let π∗ denote the optimal policy in the set of admissible policies Πb: π∗ = argmaxπ∈Πbρ(π,M∗). Analogously to 72,

we also have:

|ρ(π∗,M∗)− ρ(π∗, M)| ≤ 2εVmax

1− γ (73)

3Notice that all options have the same termination function.4Also, note that there is the requirement here that the trajectories are generated under policy πb, so that the options are consistent

with the dataset.


Thus, we may write:

ρ(π∗,M∗)− ρ(πpol,M∗)

(a)

≤ ρ(π∗,M∗)− ρ(πpol, M) +2εVmax

1− γ (74)

(b)

≤ ρ(π∗,M∗)− ρ(π∗, M) +2εVmax

1− γ (75)

(c)

≤ 2εVmax

1− γ +2εVmax

1− γ (76)

(d)

≤ 4εVmax

1− γ , (77)


(a) From equation 72.

(b) Optimality of πpol in the estimated MDP M .

(c) From equation 73.

(d) From summation.

C.3. Safe policy improvement of Πb-SPIBB

Theorem 5 (Safe policy improvement of Πb-SPIBB). Let Πb be the set of policies under the constraint of following πbwhen (x, a) ∈ B. Let πpol be a Πb-optimal policy of the reward maximization problem of an estimated MDP M . Then,πpol is an approximate safe policy improvement over the baseline policy πb with high probability 1− δ:


∗)− 4εVmax

1− γ . (78)

Proof. It is direct to observe that πb ∈ Πb, and therefore that:

maxπ∈Πb

ρ(π,M∗) ≥ ρ(πb,M∗) (79)

It follows from Theorem 4 that:


∗)− 4εVmax

1− γ . (80)


D. Algorithms for the greedy projection of Q(i) on Πb, Π0, and Π≤b

The policy-based SPIBB algorithms rely on a policy iteration process that requires a policy improvement step under theconstraint of the generated policy to belong to Πb, Π0, or Π≤b. Those are respectively described in Algorithms 4, 5, and 6.

Algorithm 4 Greedy projection of Q(i) on Πb

Data: Baseline policy πbData: Last iteration value function Q(i)

Data: Set of bootstrapped state-action pairs BData: Current state x and action set AInitialize π(i)

pol = 0

for (x, a) ∈ B do π(i)pol(x, a) = πb(x, a) ;

π(i)pol(x, argmax(x,a)/∈BQ(i)(x, a)) = 1−∑(x,a′)∈B πb(x, a

′)

return π(i)pol

Algorithm 5 Greedy projection of Q(i) on Π0


Data: Set of bootstrapped state-action pairs BData: Current state x and action set AInitialize π(i)

pol = 0

π(i)pol(x, argmax(x,a)/∈BQ(i)(x, a)) = 1

return π(i)pol

Algorithm 6 Greedy projection of Q(i) on Π≤b


Data: Set of bootstrapped state-action pairs BData: Current state x and action set ASort A in decreasing order of the action values: Q(i)(x, a)

Initialize π(i)pol = 0

for a ∈ A doif (x, a) ∈ B then

if πb(a|x) ≥ 1−∑a′∈A π(i)pol(a

′|x) thenπ

(i)pol(a|x) = 1−∑a′∈A π

(i)pol(a

′|x)

return π(i)pol

elseπ

(i)pol(a|x) = πb(a|x)

endelse

π(i)pol(a|x) = 1−∑a′∈A π

(i)pol(a

′|x)

return π(i)pol

endend


E. Extensive SPIBB experimental results

101 102 103


−40

−30

−20

−10

0

10

20

30

40

mea

np

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞


101 102 103


−40

−30

−20

−10

0

10

20

30

40

deci

lep

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞

(b) Worst decile performance

101 102 103


−40

−30

−20

−10

0

10

20

30

40

cent

ilep

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞

(c) Worst centile performance

Figure 5: SPIBB benchmark (N∧ = 5): mean, worst decile and worst centile performances.

101 102 103


−40

−30

−20

−10

0

10

20

30

40

mea

np

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞


101 102 103


−40

−30

−20

−10

0

10

20

30

40

deci

lep

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞


101 102 103


−40

−30

−20

−10

0

10

20

30

40

cent

ilep

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞



101 102 103


−40

−30

−20

−10

0

10

20

30

40

mea

np

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞


101 102 103


−40

−30

−20

−10

0

10

20

30

40

deci

lep

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞


101 102 103


−40

−30

−20

−10

0

10

20

30

40

cent

ilep

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞



101 102 103


−40

−30

−20

−10

0

10

20

30

40

mea

np

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞


101 102 103


−40

−30

−20

−10

0

10

20

30

40

deci

lep

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞


101 102 103


−40

−30

−20

−10

0

10

20

30

40

cent

ilep

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞




101 102 103


−40

−30

−20

−10

0

10

20

30

40

mea

np

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞


101 102 103


−40

−30

−20

−10

0

10

20

30

40

deci

lep

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞


101 102 103


−40

−30

−20

−10

0

10

20

30

40

cent

ilep

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞



101 102 103


−40

−30

−20

−10

0

10

20

30

40

mea

np

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞


101 102 103


−40

−30

−20

−10

0

10

20

30

40

deci

lep

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞


101 102 103


−40

−30

−20

−10

0

10

20

30

40

cent

ilep

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞



101 102 103


−40

−30

−20

−10

0

10

20

30

40

mea

np

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞


101 102 103


−40

−30

−20

−10

0

10

20

30

40

deci

lep

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞


101 102 103


−40

−30

−20

−10

0

10

20

30

40

cent

ilep

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞



101 102 103


−40

−30

−20

−10

0

10

20

30

40

mea

np

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞


101 102 103


−40

−30

−20

−10

0

10

20

30

40

deci

lep

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞


101 102 103


−40

−30

−20

−10

0

10

20

30

40

cent

ilep

erfo

rman

ce

Baseline

Basic RLΠb-SPIBB

Π0-SPIBB

Π≤b-SPIBB

Qπb-SPIBB-1

Qπb-SPIBB-∞



safe policy improvement with baseline bootstrapping high complexity ... methodology of bootstrapping...

Documents