lipschitz lifelong reinforcement learning · 2021. 2. 22. · lipschitz lifelong reinforcement...

Lipschitz Lifelong Reinforcement Learning

Erwan Lecarpentier1, 2, David Abel3, Kavosh Asadi3, 4∗ , Yuu Jinnai3,Emmanuel Rachelson1, Michael L. Littman3

1ISAE-SUPAERO, Universite de Toulouse, France2ONERA, The French Aerospace Lab, Toulouse, France

3Brown University, Providence, Rhode Island, USA4Amazon Web Service, Palo Alto, California, USA

[email protected]

Abstract

We consider the problem of knowledge transfer when an agentis facing a series of Reinforcement Learning (RL) tasks. Weintroduce a novel metric between Markov Decision Processesand establish that close MDPs have close optimal value func-tions. Formally, the optimal value functions are Lipschitzcontinuous with respect to the tasks space. These theoreti-cal results lead us to a value-transfer method for Lifelong RL,which we use to build a PAC-MDP algorithm with improvedconvergence rate. Further, we show the method to experienceno negative transfer with high probability. We illustrate thebenefits of the method in Lifelong RL experiments.

1 IntroductionLifelong Reinforcement Learning (RL) is an online problemwhere an agent faces a series of RL tasks, drawn sequentially.Transferring knowledge from prior experience to speed upthe resolution of new tasks is a key question in that setting(Lazaric 2012; Taylor and Stone 2009). We elaborate on theintuitive idea that similar tasks should allow a large amountof transfer. An agent able to compute online a similarity mea-sure between source tasks and the current target task couldbe able to perform transfer accordingly. By measuring theamount of inter-task similarity, we design a novel methodfor value transfer, practically deployable in the online Life-long RL setting. Specifically, we introduce a metric betweenMDPs and prove that the optimal Q-value function is Lips-chitz continuous with respect to the MDP space. This prop-erty makes it possible to compute a provable upper boundon the optimal Q-value function of an unknown target task,given the learned optimal Q-value function of a source task.Knowing this upper bound accelerates the convergence of anRMax-like algorithm (Brafman and Tennenholtz 2002), rely-ing on an optimistic estimate of the optimal Q-value function.Overall, the proposed transfer method consists of computingonline the distance between source and target tasks, deducingthe upper bound on the optimal Q value function of the sourcetask and using this bound to accelerate learning. Importantly,the method exhibits no negative transfer, i.e., it cannot cause

∗Kavosh Asadi finished working on this project before joiningAmazon.Copyright c© 2021, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

performance degradation, as the computed upper bound prov-ably does not underestimate the optimal Q-value function.

Our contributions are as follows. First, we study theoreti-cally the Lipschitz continuity of the optimal Q-value functionin the task space by introducing a metric between MDPs(Section 3). Then, we use this continuity property to proposea value-transfer method based on a local distance betweenMDPs (Section 4). Full knowledge of both MDPs is not re-quired and the transfer is non-negative, which makes themethod applicable online and safe. In Section 4.3, we builda PAC-MDP algorithm called Lipschitz RMax, applying thistransfer method in the online Lifelong RL setting. We providesample and computational complexity bounds and showcasethe algorithm in Lifelong RL experiments (Section 5).

2 Background and Related WorkReinforcement Learning (RL) (Sutton and Barto 2018) isa framework for sequential decision making. The problemis typically modeled as a Markov Decision Process (MDP)(Puterman 2014) consisting of a 4-tuple 〈S,A, R, T 〉, whereS is a state space, A an action space, Ras is the expectedreward of taking action a in state s and T ass′ is the transitionprobability of reaching state s′ when taking action a in states. Without loss of generality, we assume Ras ∈ [0, 1]. Givena discount factor γ ∈ [0, 1), the expected cumulative return∑t γ

tRatst obtained along a trajectory starting with state s andaction a using policy π in MDP M is denoted by QπM (s, a)and called the Q-function. The optimal Q-function Q∗M is thehighest attainable expected return from s, a and V ∗M (s) =maxa∈AQ∗M (s, a) is the optimal value function in s. Noticethat Ras ≤ 1 implies Q∗M (s, a) ≤ 1

1−γ for all s, a ∈ S × A.This maximum upper bound is used by the RMax algorithmas an optimistic initialization of the learned Q function. Akey point to reduce the sample complexity of this algorithmis to benefit from a tighter upper bound, which is the purposeof our transfer method.

Lifelong RL (Silver, Yang, and Li 2013; Brunskill and Li2014) is the problem of experiencing online a series of MDPsdrawn from an unknown distribution. Each time an MDPis sampled, a classical RL problem takes place where theagent is able to interact with the environment to maximizeits expected return. In this setting, it is reasonable to thinkthat knowledge gained on previous MDPs could be re-used

to improve the performance in new MDPs. In this paper, weprovide a novel method for such transfer by characterizingthe way the optimal Q-function can evolve across tasks. Ascommonly done (Wilson et al. 2007; Brunskill and Li 2014;Abel et al. 2018), we restrict the scope of the study to thecase where sampled MDPs share the same state-action spaceS × A. For brevity, we will refer indifferently to MDPs,models or tasks, and write them M = 〈R, T 〉.

Using a metric between MDPs has the appealing charac-teristic of quantifying the amount of similarity between tasks,which intuitively should be linked to the amount of transferachievable. Song et al. (2016) define a metric based on the bi-simulation metric introduced by Ferns, Panangaden, and Pre-cup (2004) and the Wasserstein metric (Villani 2008). valuetransfer is performed between states with low bi-simulationdistances. However, this metric requires knowing both MDPscompletely and is thus unusable in the Lifelong RL settingwhere we expect to perform transfer before having learnedthe current MDP. Further, the transfer technique they pro-pose does allow negative transfer (see Appendix, Section 1).Carroll and Seppi (2005) also define a value-transfer methodbased on a measure of similarity between tasks. However,this measure is not computable online and thus not applica-ble to the Lifelong RL setting. Mahmud et al. (2013) andBrunskill and Li (2013) propose MDP clustering methods;respectively using a metric quantifying the regret of runningthe optimal policy of one MDP in the other MDP and the L1

norm between the MDP models. An advantage of clusteringis to prune the set of possible source tasks. They use theirapproach for policy transfer, which differs from the value-transfer method proposed in this paper. Ammar et al. (2014)learn the model of a source MDP and view the predictionerror on a target MDP as a dissimilarity measure in the taskspace. Their method makes use of samples from both tasksand is not readily applicable to the online setting consideredin this paper. Lazaric, Restelli, and Bonarini (2008) provide apractical method for sample transfer, computing a similaritymetric reflecting the probability of the models to be identical.Their approach is applicable in a batch RL setting as opposedto the online setting considered in this paper. The approachdeveloped by Sorg and Singh (2009) is very similar to ours inthe sense that they prove bounds on the optimal Q-functionfor new tasks, assuming that both MDPs are known and thata soft homomorphism exists between the state spaces. Brun-skill and Li (2013) also provide a method that can be usedfor Q-function bounding in multi-task RL.

Abel et al. (2018) present the MaxQInit algorithm, provid-ing transferable bounds on the Q-function with high proba-bility while preserving PAC-MDP guarantees (Strehl, Li, andLittman 2009). Given a set of solved tasks, they derive theprobability that the maximum over the Q-values of previousMDPs is an upper bound on the current task’s optimal Q-function. This approach results in a method for non-negativetransfer with high probability once enough tasks have beensampled. The method developed by Abel et al. (2018) issimilar to ours in two fundamental points: first, a theoreticalupper bounds on optimal Q-values across the MDP space isbuilt; secondly, this provable upper bound is used to transferknowledge between MDPs by replacing the maximum 1

1−γ

Q∗M (s, a)

11−γ

M

Bounds:

RMax

MaxQInit

LRMax

Figure 1: The optimal Q-value function represented for a par-ticular s, a pair across the MDP space. The RMax, MaxQInitand LRMax bounds are represented for three sampled MDPs.

bound in an RMax-like algorithm, providing PAC guarantees.The difference between the two approaches is illustrated inFigure 1, where the MaxQInit bound is the one developedby Abel et al. (2018), and the LRMax bound is the one wepresent in this paper. On this figure, the essence of the LRMaxbound is noticeable. It stems from the fact that the optimalQ value function is locally Lipschitz continuous in the MDPspace w.r.t. a specific pseudometric. Confirming the intuition,close MDPs w.r.t. this metric have close optimal Q values.It should be noticed that no bound is uniformly better thanthe other as intuited by Figure 1. Hence, combining all thebounds results in a tighter upper bound as we will illustratein experiments (Section 5). We first carry out the theoreticalcharacterization of the Lipschitz continuity properties in thefollowing section. Then, we build on this result to propose apractical transfer method for the online Lifelong RL setting.

3 Lipschitz Continuity of Q-FunctionsThe intuition we build on is that similar MDPs should havesimilar optimal Q-functions. Formally, this insight can betranslated into a continuity property of the optimal Q-functionover the MDP spaceM. The remainder of this section math-ematically formalizes this intuition that will be used in thenext section to derive a practical method for value transfer.To that end, we introduce a local pseudometric characterizingthe distance between the models of two MDPs at a particularstate-action pair. A reminder and a detailed discussion on themetrics used herein can be found in the Appendix, Section 2.Definition 1. Given two tasks M = 〈R, T 〉, M = 〈R, T 〉,and a function f : S → R+, we define the pseudometricbetween models at (s, a) ∈ S ×A w.r.t. f as:

Dfsa(M,M) , |Ras − Ras |+

∑

s′∈Sf(s′)|T ass′ − T ass′ |. (1)

This pseudometric is relative to a positive function f . Weimplicitly cast this definition in the context of discrete statespaces. The extension to continuous spaces is straightforwardbut beyond the scope of this paper. For the sake of clarity inthe remainder of this study, we introduce

Dsa(M‖M) , DγV ∗

Msa (M,M),

corresponding to the pseudometric between models with theparticular choice of f = γV ∗

M. From this definition stems the

following pseudo-Lipschitz continuity result.

Proposition 1 (Local pseudo-Lipschitz continuity). For twoMDPs M, M , for all (s, a) ∈ S ×A,∣∣Q∗M (s, a)−Q∗M (s, a)

∣∣ ≤ ∆sa(M, M), (2)

with the local MDP pseudometric ∆sa(M, M) ,min

{dsa(M‖M), dsa(M‖M)

}, and the local MDP dissim-

ilarity dsa(M‖M) is the unique solution to the followingfixed-point equation for dsa:

dsa = Dsa(M‖M) + γ∑

s′∈ST ass′ max

a′∈Ads′a′ ,∀s, a. (3)

All the proofs of the paper can be found in the Appendix.This result establishes that the distance between the opti-mal Q-functions of two MDPs at (s, a) ∈ S × A is con-trolled by a local dissimilarity between the MDPs. The latterfollows a fixed-point equation (Equation 3), which can besolved by Dynamic Programming (DP) (Bellman 1957). Notethat, although the local MDP dissimilarity dsa(M‖M) isasymmetric, ∆sa(M,M) is a pseudometric, hence the namepseudo-Lipschitz continuity. Notice that the policies in Equa-tion 2 are the optimal ones for the two MDPs and thus aredifferent. Proposition 1 is a mathematical result stemmingfrom Definition 1 and should be distinguished from otherframeworks of the literature that assume the continuity ofthe reward and transition models w.r.t. S × A (Rachelsonand Lagoudakis 2010; Pirotta, Restelli, and Bascetta 2015;Asadi, Misra, and Littman 2018).This result establishes thatthe optimal Q-functions of two close MDPs, in the sense ofEquation 1, are themselves close to each other. Hence, givenQ∗M

, the functions, a 7→ Q∗M (s, a) + ∆sa(M,M) (4)

can be used as an upper bound on Q∗M with M an unknownMDP. This is the idea on which we construct a computableand transferable upper bound in Section 4. In Figure 1, theupper bound of Equation 4 is represented by the LRMaxbound. Noticeably, we provide a global pseudo-Lipschitzcontinuity property, along with similar results for the optimalvalue function V ∗M and the value function of a fixed policy. Asthese results do not directly serve the purpose of this article,we report them in the Appendix, Section 4.

4 Transfer Using the Lipschitz ContinuityA purpose of value transfer, when interacting online with anew MDP, is to initialize the value function and drive theexploration to accelerate learning. We aim to exploit valuetransfer in a method guaranteeing three conditions:

C1. the resulting algorithm is PAC-MDP;C2. the transfer accelerates learning;C3. the transfer is non-negative.

To achieve these conditions, we first present a transferableupper bound on Q∗M in Section 4.1. This upper bound stemsfrom the Lipschitz continuity result of Proposition 1. Then,we propose a practical way to compute this upper bound inSection 4.2. Precisely, we propose a surrogate bound that canbe calculated online in the Lifelong RL setting, without hav-ing explored the source and target tasks completely. Finally,we implement the method in an algorithm described in Sec-tion 4.3, and demonstrate formally that it meets conditionsC1, C2 and C3. Improvements are discussed in Section 4.4.

4.1 A Transferable Upper Bound on Q∗M

From Proposition 1, one can naturally define a local upperbound on the optimal Q-function of an MDP given the opti-mal Q-function of another MDP.Definition 2. Given two tasks M and M , for all (s, a) ∈S ×A, the Lipschitz upper bound on Q∗M induced by Q∗

Mis

defined as UM (s, a) ≥ Q∗M (s, a) with:

UM (s, a) , Q∗M (s, a) + ∆sa(M, M). (5)

The optimism in the face of uncertainty principle leadsto considering that the long-term expected return from anystate is the 1

1−γ maximum return, unless proven otherwise.Particularly, the RMax algorithm (Brafman and Tennenholtz2002), explores an MDP so as to shrink this upper bound.RMax is a model-based, online RL algorithm with PAC-MDP guarantees (Strehl, Li, and Littman 2009), meaningthat convergence to a near-optimal policy is guaranteed ina polynomial number of missteps with high probability. Itrelies on an optimistic model initialization that yields anoptimistic upper bound U on the optimal Q-function, thenacts greedily w.r.t. U . By default, it takes the maximum valueU(s, a) = 1

1−γ , but any tighter upper bound is admissible.Thus, shrinking U with Equation 5 is expected to improvethe learning speed or sample complexity for new tasks inLifelong RL.

In RMax, during the resolution of a task M , S × A issplit into a subset of known state-action pairs K and its com-plement Kc of unknown pairs. A state-action pair is knownif the number of collected reward and transition samplesallows estimating an ε-accurate model in L1-norm with prob-ability higher than 1 − δ. We refer to ε and δ as the RMaxprecision parameters. This results in a threshold nknown onthe number of visits n(s, a) to a pair s, a that are necessaryto reach this precision. Given the experience of a set of mMDPs M = {M1, . . . , Mm}, we define the total bound asthe minimum over all the induced Lipschitz bounds.Proposition 2. Given a partially known task M = 〈R, T 〉,the set of known state-action pairs K, and the set of Lipschitzbounds on Q∗M induced by previous tasks

{UM1

, . . . , UMm

},

the function Q defined below is an upper bound on Q∗M forall s, a ∈ S ×A.

Q(s, a) ,

Ras + γ∑s′∈S

T ass′ maxa′∈A

Q(s′, a′)

if (s, a) ∈ K,U(s, a) otherwise,

(6)

with U(s, a) = min{

11−γ , UM1

(s, a), . . . , UMm(s, a)

}.

Commonly in RMax, Equation 6 is solved to a precisionεQ via Value Iteration. This yields a function Q that is a validheuristic (bound on Q∗M ) for the exploration of MDP M .

4.2 A Computable Upper Bound on Q∗M

The key issue addressed in this section is how to actuallycompute U(s, a), particularly when both source and targettasks are partially explored. Consider two tasksM and M , onwhich vanilla RMax has been applied, yielding the respective

sets of known state-action pairs K and K, along with thelearned models M = 〈T , R〉 and ˆM = 〈 ˆT, ˆR〉, and the upperbounds Q and Q respectively on Q∗M and Q∗

M. Notice that, if

K = ∅, then Q(s, a) = 11−γ for all s, a pairs. Conversely, if

Kc = ∅, Q is an ε-accurate estimate of Q∗M in L1-norm withhigh probability. Equation 6 allows the transfer of knowledgefrom M to M if UM (s, a) can be computed. Unfortunately,the true model and optimal value functions, necessary tocompute UM , are partially known (see Equation 5). Thus,we propose to compute a looser upper bound based on thelearned models and value functions. First, we provide anupper bound Dsa(M‖M) on Dsa(M‖M) (Definition 1).Proposition 3. Given two tasks M , M and respectively K,K the subsets of S ×A where their models are known withaccuracy ε in L1-norm with probability at least 1− δ,

Pr(Dsa(M‖M) ≥ Dsa(M‖M)

)≥ 1− δ

with Dsa(M‖M) the upper bound on the pseudometric be-tween models defined below for B = ε

(1 + γmaxs′ V (s′)

).

Dsa(M‖M) ,

DγVsa (M, ˆM) + 2B if (s, a) ∈ K ∩ K

maxµ∈M

DγVsa (M, µ) +B if (s, a) ∈ K ∩ Kc

maxµ∈M

DγVsa (µ, ˆM) +B if (s, a) ∈ Kc ∩ K

maxµ,µ∈M2

DγVsa (µ, µ) if (s, a) ∈ Kc ∩ Kc

(7)

Importantly, this upper bound Dsa(M‖M) can be calcu-lated analytically (see Appendix, Section 7). This makesDsa(M‖M) usable in the online Lifelong RL setting, wherealready explored tasks may be partially learned, and littleknowledge has been gathered on the current task. The mag-nitude of the B term is controlled by ε. In the case whereno information is available on the maximum value of V , wehave that B = ε

1−γ . ε measures the accuracy with whichthe tasks are known: the smaller ε, the tighter the B bound.Note that V is used as an upper bound on the true V ∗

M. In

many cases, maxs′ V∗M

(s′) ≤ 11−γ ; e.g. for stochastic short-

est path problems, which feature rewards only upon reachingterminal states, we have that maxs′ V

∗M

(s′) = 1 and thusB = (1 + γ)ε is a tighter bound for transfer. CombiningDsa(M‖M) and Equation 3, one can derive an upper bounddsa(M‖M) on dsa(M‖M), detailed in Proposition 4.Proposition 4. Given two tasks M,M ∈ M, K the set ofstate-action pairs where (R, T ) is known with accuracy ε inL1-norm with probability at least 1− δ. If γ(1 + ε) < 1, thesolution dsa(M‖M) of the following fixed-point equation ondsa (for all s, a ∈ S ×A) is an upper bound on dsa(M‖M)with probability at least 1− δ:

dsa = Dsa(M‖M) + (8)γ

( ∑s′∈S


ds′a′ + ε maxs′,a′∈S×A

ds′a′

)if s, a ∈ K,

γ maxs′,a′∈S×A

ds′a′ otherwise.

Similarly as in Proposition 3, the condition γ(1 + ε) < 1illustrates the fact that for a large return horizon (large γ), ahigh accuracy (small ε) is needed for the bound to be com-putable. Eventually, a computable upper bound on Q∗M givenM with high probability is given by

UM (s, a) = Q(s, a)

+ min{dsa(M‖M), dsa(M‖M)

}. (9)

The associated upper bound on U(s, a) (Equation 6) giventhe set of previous tasks M = {Mi}mi=1 is defined by

U(s, a) = min{

11−γ , UM1

(s, a), . . . , UMm(s, a)

}. (10)

This upper bound can be used to transfer knowledge from apartially solved source task to a target task. If U(s, a) ≤ 1

1−γon a subset of S × A, then the convergence rate can beimproved. As complete knowledge of both tasks is not neededto compute the upper bound, it can be applied online inthe Lifelong RL setting. In the next section, we explicit analgorithm that leverages this value-transfer method.

4.3 Lipschitz RMax AlgorithmIn Lifelong RL, MDPs are encountered sequentially. Apply-ing RMax to task M yields the set of known state-actionpairs K, the learned models T and R, and the upper boundQ on Q∗M . Saving this information when the task changesallows computing the upper bound of Equation 10 for the newtarget task, and using it to shrink the optimistic heuristic ofRMax. This computation effectively transfers value functionsbetween tasks based on task similarity. As the new task isexplored online, the task similarity is progressively assessedwith better confidence, refining the values of Dsa(M‖M),dsa(M‖M) and eventually U , allowing for more efficienttransfer where the task similarity is appraised. The result-ing algorithm, Lipschitz RMax (LRMax), is presented inAlgorithm 1. To avoid ambiguities with M, we use M tostore learned features (T , R, K, Q) about previous MDPs.In a nutshell, the behavior of LRMax is precisely that ofRMax, but with a tighter admissible heuristic U that be-comes better as the new task is explored (while this heuristicremains constant in vanilla RMax). LRMax is PAC-MDP(Condition C1) as stated in Propositions 5 and 6 below. WithS = |S| and A = |A|, the sample complexity of vanillaRMax is O(S2A/(ε3(1− γ)3)), which is improved by LR-Max in Proposition 5 and meets Condition C2. Finally, U isa provable upper bound with high probability on Q∗M , whichavoids negative transfer and meets Condition C3.Proposition 5 (Sample complexity (Strehl, Li, and Littman2009)). With probability 1 − δ, the greedy policy w.r.t. Qcomputed by LRMax achieves an ε-optimal return in MDPM after

O(S|{s, a ∈ S ×A | U(s, a) ≥ V ∗M (s)− ε}|

ε3(1− γ)3

)

samples (when logarithmic factors are ignored), with U de-fined in Equation 10 a non-static, decreasing quantity, upperbounded by 1

1−γ .

Algorithm 1: Lipschitz RMax algorithm

Initialize M = ∅.for each newly sampled MDP M do

Initialize Q(s, a) = 11−γ ,∀s, a, and K = ∅

Initialize T and R (RMax initialization)Q← UpdateQ(M, T , R)for t ∈ [1, max number of steps] do

s = current state, a = arg maxa′

Q(s, a′)

Observe reward r and next state s′n(s, a)← n(s, a) + 1if n(s, a) < nknown then

Store (s, a, r, s′)

if n(s, a) = nknown thenUpdate K and (T ass′ , R

as) (learned model)

Q← UpdateQ(M, T , R)

Save M =(T , R,K,Q

)in M

Function UpdateQ(M, T , R):for M ∈ M do

Compute Dsa(M‖M), Dsa(M‖M) (Eq. 7)Compute dsa(M‖M), dsa(M‖M) (DP on Eq. 8)Compute UM (Eq. 9)

Compute U (Eq. 10)Compute and return Q (DP on Eq. 6 using U )

Proposition 5 shows that the sample complexity of LRMaxis no worse than that of RMax. Consequently, in the worstcase, LRMax performs as badly as learning from scratch,which is to say that the transfer method is not negative as itcannot degrade the performance.

Proposition 6 (Computational complexity). The total com-putational complexity of LRMax (Algorithm 1) is

O(τ +

S3A2N

(1− γ)ln

(1

εQ(1− γ)

))

with τ the number of interaction steps, εQ the precision ofvalue iteration and N the number of source tasks.

4.4 Refining the LRMax BoundsLRMax relies on bounds on the local MDP dissimilarity(Equation 8). The quality of the Lipschitz bound on Q∗M canbe improved according to the quality of those estimates. Wediscuss two methods to provide finer estimates.

Refining with prior knowledge. First, from the definitionof Dsa(M‖M), it is easy to show that this pseudometric be-tween models is always upper bounded by 1+γ

1−γ . However, inpractice, the tasks experienced in a Lifelong RL experimentmight not cover the full span of possible MDPsM and maysystematically be closer to each other than 1+γ

1−γ . For instance,the distance between two games in the Arcade Learning En-vironment (ALE) (Bellemare et al. 2013), is smaller than

M 1+γ

1−γ

MDmax

Set of all the MDPs M

Set M of possible MDPs ina lifelong RL experiment

Sampled MDPs

Figure 2: Illustration of the prior knowledge on the maximumpseudo-distance between models for a particular s, a pair.

the maximum distance between any two MDPs defined onthe common state-action space of the ALE (extended discus-sion in Appendix, Section 11). Let us note M ⊂M the setof possible MDPs for a particular Lifelong RL experiment.Let Dmax(s, a) , maxM,M∈M2

(Dsa(M‖M)

)be the max-

imum model pseudo-distance at a particular s, a pair on thesubset M. Prior knowledge might indicate a smaller upperbound for Dmax(s, a) than 1+γ

1−γ . We will note such an upperbound Dmax, considered valid for all s, a pairs, i.e., such thatDmax ≥ max

s,a,M,M∈S×A×M2

(Dsa(M‖M)

). In a Lifelong

RL experiment, Dmax can be seen as a rough estimate ofthe maximum model discrepancy an agent may encounter.Figure 2 illustrates the relative importance of Dmax vs. 1+γ

1−γ .

Solving Equation 8 boils down to accumulating Dsa(M‖M)

values in dsa(M‖M). Hence, reducing a Dsa(M‖M) es-timate in a single s, a pair actually reduces dsa(M‖M) inall s, a pairs. Thus, replacing Dsa(M‖M) in Equation 8 bymin{Dmax, Dsa(M‖M)}, provides a smaller upper bounddsa(M‖M) on dsa(M‖M), and thus a smaller U which al-lows transfer if it is less than 1

1−γ . Consequently, the knowl-edge of such a bound Dmax can make a difference betweensuccessful and unsuccessful transfer, even if its value is oflittle importance. Conversely, setting a value for Dmax quan-tifies the distance between MDPs where transfer is efficient.

Refining by learning the maximum distance. The valueof Dmax(s, a) can be estimated online for each s, a pair, dis-carding the hypothesis of available prior knowledge. We pro-pose to use an empirical estimate of the maximum model dis-tance at s, a: Dmax(s, a) , maxM,M∈M2{Dsa(M‖M)},with M the set of explored tasks. The pitfall of this approachis that, with few explored tasks, Dmax(s, a) could underesti-mate Dmax(s, a). Proposition 7 provides a lower bound onthe probability that Dmax(s, a) + ε does not underestimateDmax(s, a), depending on the number of sampled tasks.

Proposition 7. Consider an algorithm producing ε-accuratemodel estimates Dsa(M‖M) for a subset K of S × A af-ter interacting with any two MDPs M,M ∈ M. AssumeDsa(M‖M) ≥ Dsa(M‖M) for any s, a /∈ K. For alls, a ∈ S × A, δ ∈ (0, 1], after sampling m tasks, if m islarge enough to verify 2(1− pmin)m − (1− 2pmin)m ≤ δ,

Pr(Dmax(s, a) + ε ≥ Dmax(s, a)

)≥ 1− δ .

This result indicates when Dmax(s, a) + ε upperbounds Dmax(s, a) with high probability. In such acase, Dsa(M‖M) of Equation 8 can be replaced bymin{Dmax(s, a) + ε, Dsa(M‖M)} to tighten the bound ondsa(M‖M). Assuming a lower bound pmin on the samplingprobability of a task implies thatM is finite and is seen as anon-adversarial task sampling rule (Abel et al. 2018).

5 ExperimentsThe experiments reported here1 illustrate how the Lipschitzbound (Equation 9) provides a tighter upper bound on Q∗,improving the sample complexity of LRMax compared toRMax, and making the transfer of inter-task knowledge effec-tive. Graphs are displayed with 95% confidence intervals. Forinformation in line with the Machine Learning Reproducibil-ity Check-list (Pineau 2019) see the Appendix, Section 16.

We evaluate different variants of LRMax in a LifelongRL experiment. The RMax algorithm will be used as a no-transfer baseline. LRMax(x) denotes Algorithm 1 with priorDmax = x. MaxQInit denotes the MAXQINIT algorithmfrom Abel et al. (2018), consisting in a state-of-the art PAC-MDP algorithm. Both LRMax and MaxQInit algorithmsachieve value transfer by providing a tighter upper boundon Q∗ than 1

1−γ . Computing both upper bounds and takingthe minimum results in combining the two approaches. We in-clude such a combination in our study with the LRMaxQInitalgorithm. Similarly, the latter algorithm benefiting fromprior knowledge Dmax = x is denoted by LRMaxQInit(x).For the sake of comparison, we only compare algorithmswith the same features, namely, tabular, online, PAC-MDPmethods, presenting non-negative transfer.

The environment used in all experiments is a variant of the“tight” task used by Abel et al. (2018). It is an 11× 11 grid-world, the initial state is in the centre, actions are the cardinalmoves (Appendix, Section 12). The reward is always zeroexcept for the three goal cells in the upper-right corner. Eachsampled task has its own reward values, drawn from [0.8, 1]for each of the three goal cells and its own probability ofslipping (performing a different action than the one selected),picked in [0, 0.1]. Hence, tasks have different reward andtransition functions. Notice the distinction in applicabilitybetween MaxQInit, that requires the set of MDPs to be finite,and LRMax, that can be used with any set of MDPs. Forthe comparison between both to be possible, we drew tasksfrom a finite set of 5 MDPs. We sample 15 tasks sequentiallyamong this set, each run for 2000 episodes of length 10.The operation is repeated 10 times to narrow the confidenceintervals. We set nknown = 10, δ = 0.05, and ε = 0.01(discussion in Appendix, Section 15). Other Lifelong RLexperiments are reported in Appendix, Section 13.

The results are reported in Figure 3. Figure 3a displaysthe discounted return for each task, averaged across episodes.Similarly, Figure 3b displays the discounted return for eachepisode, averaged across tasks (same color code as Figure 3a).Figure 3c displays the discounted return for five specificinstances, detailed below. To avoid inter-task disparities, all

1Code available at https://github.com/SuReLI/llrl

the aforementioned discounted returns are displayed relativeto an estimator of the optimal expected return for each task.For readability, Figures 3b and 3c display a moving averageover 100 episodes. Figure 3d reports the benefits of variousvalues of Dmax on the algorithmic properties.

In Figure 3a, we first observe that LRMax benefits from thetransfer method, as the average discounted return increases asmore tasks are experienced. Moreover, this advantage appearsas early as the second task. In contrast, MaxQInit requires towait for task 12 before benefiting from transfer. As suggestedin Section 4.4, increasing amounts of prior knowledge allowthe LRMax transfer method to be more efficient: a smallerknown upper bound Dmax on Dsa(M‖M) accelerates con-vergence. Combining both approaches in the LRMaxQInitalgorithm outperforms all other methods. Episode-wise, weobserve in Figure 3b that the LRMax transfer method allowsfor faster convergence, i.e., lower sample complexity. Inter-estingly, LRMax exhibits three stages in the learning process.1) The first episodes are characterized by a direct exploita-tion of the transferred knowledge, causing these episodes toyield high payoff. This behavior is a consequence of the com-bined facts that the Lipschitz bound (Equation 9) is largeron promising regions of S × A seen on previous tasks andthe fact that LRMax acts greedily w.r.t. that bound. 2) Thishigh performance regime is followed by the exploration ofunknown regions of S ×A, in our case yielding low returns.Indeed, as promising regions are explored first, the bound be-comes tighter for the corresponding state-action pairs, enoughfor the Lipschitz bound of unknown pairs to become larger,thus driving the exploration towards low payoff regions. Suchregions are then identified and never revisited. 3) Eventually,LRMax stops exploring and converges to the optimal policy.Importantly, in all experiments, LRMax never experiencesnegative transfer, as supported by the provability of the Lips-chitz upper bound with high probability. LRMax is at leastas efficient as the no-transfer RMax baseline.

Figure 3c displays the collected returns of RMax, LR-Max(0.1), and MaxQInit for specific tasks. We observe thatLRMax benefits from transfer as early as Task 2, where theprevious 3-stage behavior is visible. MaxQInit takes untiltask 12 to leverage the transfer method. However, the boundit provides is tight enough that it does not have to explore.

In Figure 3d, we display the following quantities for vari-ous values of Dmax: ρLip, the fraction of the time the Lips-chitz bound was tighter than the RMax bound 1

1−γ ; ρSpeed-up,is the relative gain of time steps before convergence whencomparing LRMax to RMax. This quantity is estimated basedon the last updates of the empirical model M ; ρReturn, isthe relative total return gain on 2000 episodes of LRMaxw.r.t. RMax. First, we observe an increase of ρLip as Dmax

becomes tighter. This means that the Lipschitz bound ofEquation 9 becomes effectively smaller than 1

1−γ . This phe-nomenon leads to faster convergence, indicated by ρSpeed-up.Eventually, this increased convergence rate allows for a nettotal return gain, as can be seen with the increase of ρReturn.

Overall, in this analysis, we have showed that LRMax ben-efits from an enhanced sample complexity thanks to the value-transfer method. The knowledge of a prior Dmax increases

2 4 6 8 10 12 14Task number

1.0

1.5

2.0

2.5

3.0A

vera

geR

elat

ive

Dis

cou

nte

dR

etu

rn

RMax

LRMax

LRMax(0.2)

LRMax(0.1)

MaxQInit

LRMaxQInit

LRMaxQInit(0.1)

(a) Average discounted return vs. tasks

0 250 500 750 1000 1250 1500 1750 2000Episode number

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Ave

rage

Rel

ativ

eD

isco

unte

dR

etu

rn

(b) Average discounted return vs. episodes

0 250 500 750 1000 1250 1500 1750 2000Episode number

0.0

0.5

1.0

Rel

ativ

ed

isco

unt

edre

turn

RMax Task = 1

LRMax(0.1) Task = 1

LRMax(0.1) Task = 2

MaxQInit Task = 11

MaxQInit Task = 12

(c) Discounted return for specific tasks

0.00.20.40.60.81.0Prior knowledge (known upper-bound on maxs,a = DMM

γV ∗M

(s, a))

0

20

40

60

80

100

120

140

160

%

ρLip (% use Lipschitz bound)

ρSpeed−up (% convergence speed-up)

ρReturn (% total return gain)

Prior knowledge Dmax

(d) Algorithmic properties vs. Dmax

Figure 3: Experimental results. LRMax benefits from an enhanced sample complexity thanks to the value-transfer method.

this benefit. The method is comparable to the MaxQInitmethod and has some advantages such as the early fitness foruse and the applicability to infinite sets of tasks. Moreover,the transfer is non-negative while preserving the PAC-MDPguarantees of the algorithm. Additionally, we show in Ap-pendix, Section 14 that, when provided with any prior Dmax,LRMax increasingly stops using it during exploration, con-firming the claim of Section 4.4 that providing Dmax enablestransfer even if its value is of little importance.

6 ConclusionWe have studied theoretically the Lipschitz continuity prop-erty of the optimal Q-function in the MDP space w.r.t. a newmetric. We proved a local Lipschitz continuity result, estab-lishing that the optimal Q-functions of two close MDPs arethemselves close to each other. We then proposed a value-transfer method using this continuity property with the Lip-schitz RMax algorithm, practically implementing this ap-proach in the Lifelong RL setting. The algorithm preservesPAC-MDP guarantees, accelerates learning in subsequent

tasks and exhibits no negative transfer. Improvements of thealgorithm were discussed in the form of prior knowledgeon the maximum distance between models and online esti-mation of this distance. As a non-negative, similarity-based,PAC-MDP transfer method, the LRMax algorithm is thefirst method of the literature combining those three appeal-ing features. We showcased the algorithm in Lifelong RLexperiments and demonstrated empirically its ability to accel-erate learning while not experiencing any negative transfer.Notably, our approach can directly extend other PAC-MDPalgorithms (Szita and Szepesvari 2010; Rao and Whiteson2012; Pazis, Parr, and How 2016; Dann, Lattimore, and Brun-skill 2017) to the Lifelong setting. In hindsight, we believethis contribution provides a sound basis to non-negative valuetransfer via MDP similarity, a study that was lacking in theliterature. Key insights for the practitioner lie both in the the-oretical analysis and in the practical derivation of a transferscheme achieving non-negative transfer with PAC guaran-tees. Further, designing scalable methods conveying the sameintuition could be a promising research direction.

AcknowledgmentsWe would like to thank Dennis Wilson for fruitful discussionsand comments on the paper. This research was supportedby the Occitanie region, France; ISAE-SUPAERO (InstitutSuperieur de l’Aeronautique et de l’Espace); fondation ISAE-SUPAERO; Ecole Doctorale Systemes; and ONERA, theFrench Aerospace Lab.

ReferencesAbel, D.; Jinnai, Y.; Guo, S. Y.; Konidaris, G.; and Littman,M. L. 2018. Policy and Value Transfer in Lifelong Rein-forcement Learning. In Proceedings of the 35th InternationalConference on Machine Learning (ICML 2018), 20–29.

Ammar, H. B.; Eaton, E.; Taylor, M. E.; Mocanu, D. C.;Driessens, K.; Weiss, G.; and Tuyls, K. 2014. An AutomatedMeasure of MDP Similarity for Transfer in ReinforcementLearning. In Workshops at the 28th AAAI Conference onArtificial Intelligence (AAAI 2014).

Asadi, K.; Misra, D.; and Littman, M. L. 2018. LipschitzContinuity in Model-Based Reinforcement Learning. Pro-ceedings of the 35th International Conference on MachineLearning (ICML 2018) .

Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M.2013. The Arcade Learning Environment: An EvaluationPlatform for General Agents. Journal of Artificial Intelli-gence Research 47: 253–279.

Bellman, R. 1957. Dynamic Programming. Princeton, USA:Princeton University Press.

Brafman, R. I.; and Tennenholtz, M. 2002. R-max - a Gen-eral Polynomial Time Algorithm for Near-Optimal Rein-forcement Learning. Journal of Machine Learning Research3(Oct): 213–231.

Brunskill, E.; and Li, L. 2013. Sample Complexity of Multi-task Reinforcement Learning. In Proceedings of the 29th con-ference on Uncertainty in Artificial Intelligence (UAI 2013).

Brunskill, E.; and Li, L. 2014. PAC-inspired Option Dis-covery in Lifelong Reinforcement Learning. In Proceedingsof the 31st International Conference on Machine Learning(ICML 2014), 316–324.

Carroll, J. L.; and Seppi, K. 2005. Task Similarity Measuresfor Transfer in Reinforcement Learning Task Libraries. InProceedings of the 5th International Joint Conference onNeural Networks (IJCNN 2005), volume 2, 803–808. IEEE.

Dann, C.; Lattimore, T.; and Brunskill, E. 2017. UnifyingPAC and Regret: Uniform PAC Bounds for Episodic Rein-forcement Learning. In Advances in Neural InformationProcessing Systems 30 (NeurIPS 2017), 5713–5723.

Ferns, N.; Panangaden, P.; and Precup, D. 2004. Metricsfor Finite Markov Decision Processes. In Proceedings ofthe 20th conference on Uncertainty in Artificial Intelligence(UAI 2004), 162–169. AUAI Press.

Lazaric, A. 2012. Transfer in Reinforcement Learning: aFramework and a Survey. In Reinforcement Learning, 143–173. Springer.

Lazaric, A.; Restelli, M.; and Bonarini, A. 2008. Transfer ofSamples in Batch Reinforcement Learning. In Proceedingsof the 25th International Conference on Machine Learning(ICML 2008), 544–551.Mahmud, M. M.; Hawasly, M.; Rosman, B.; and Ramamoor-thy, S. 2013. Clustering Markov Decision Processesfor Continual Transfer. Computing Research Repository(arXiv/CoRR) URL https://arxiv.org/abs/1311.3959.Pazis, J.; Parr, R. E.; and How, J. P. 2016. Improving PACExploration using the Median of Means. In Advances inNeural Information Processing Systems 29 (NeurIPS 2016),3898–3906.Pineau, J. 2019. Machine Learning Reproducibil-ity Checklist. https://www.cs.mcgill.ca/∼jpineau/ReproducibilityChecklist.pdf. Version 1.2, March 27,2019, last accessed on August 27, 2020.Pirotta, M.; Restelli, M.; and Bascetta, L. 2015. Policy gra-dient in Lipschitz Markov Decision Processes. MachineLearning 100(2-3): 255–283.Puterman, M. L. 2014. Markov Decision Processes: DiscreteStochastic Dynamic Programming. John Wiley & Sons.Rachelson, E.; and Lagoudakis, M. G. 2010. On the Localityof Action Domination in Sequential Decision Making. In Pro-ceedings of the 11th International Symposium on ArtificialIntelligence and Mathematics (ISAIM 2010).Rao, K.; and Whiteson, S. 2012. V-MAX: Tempered Opti-mism for Better PAC Reinforcement Learning. In Proceed-ings of the 11th International Conference on AutonomousAgents and Multiagent Systems (AAMAS 2012), 375–382.Silver, D. L.; Yang, Q.; and Li, L. 2013. Lifelong MachineLearning Systems: Beyond Learning Algorithms. In AAAISpring Symposium: Lifelong Machine Learning, volume 13,05.Song, J.; Gao, Y.; Wang, H.; and An, B. 2016. Measuringthe Distance Between Finite Markov Decision Processes.In Proceedings of the 15th International Conference on Au-tonomous Agents and Multiagent Systems (AAMAS 2016),468–476.Sorg, J.; and Singh, S. 2009. Transfer via Soft Homo-morphisms. In Proceedings of the 8th International Con-ference on Autonomous Agents and Multiagent Systems(AAMAS 2009), 741–748. International Foundation for Au-tonomous Agents and Multiagent Systems.Strehl, A. L.; Li, L.; and Littman, M. L. 2009. ReinforcementLearning in Finite MDPs: PAC Analysis. Journal of MachineLearning Research 10(Nov): 2413–2444.Sutton, R. S.; and Barto, A. G. 2018. Reinforcement Learning:An Introduction. MIT press, Cambridge.Szita, I.; and Szepesvari, C. 2010. Model-Based Reinforce-ment Learning with Nearly Tight Exploration ComplexityBounds. In Proceedings of the 27th International Conferenceon Machine Learning (ICML 2010), 1031–1038.Taylor, M. E.; and Stone, P. 2009. Transfer Learning forReinforcement Learning Domains: A Survey. Journal ofMachine Learning Research 10(Jul): 1633–1685.

Lipschitz Lifelong Reinforcement LearningAppendix

Erwan Lecarpentier1, 2, David Abel3, Kavosh Asadi3, 4∗ , Yuu Jinnai3,Emmanuel Rachelson1, Michael L. Littman3

1ISAE-SUPAERO, Universite de Toulouse, France2ONERA, The French Aerospace Lab, Toulouse, France

3Brown University, Providence, Rhode Island, USA4Amazon Web Service, Palo Alto, California, USA

[email protected]

1 Negative transferIn the lifelong RL setting, it is reasonable to think that knowledge gained on previous MDPs could be re-used to improvethe performance in new MDPs. Such a practice, known as knowledge transfer, sometimes does cause the opposite effect, i.e.,decreases the performance. In such a case, we talk about negative transfer. Several attempt to formally define negative transferhave been done, but researchers hardly agree on a single definition, as performance can be defined in various ways. For instance,it can be characterized by the speed of convergence, the area under the learning curve, the final score of the learned policy orclassifier, and many other things. Defining negative transfer is out of the scope of this paper, but let us give an example of whythis phenomenon can be problematic.

In their paper, Song et al. (2016) propose a transfer methods based on the metric between MDPs they introduce, stemmingfrom the bi-simulation metric introduced by Ferns, Panangaden, and Precup (2004). In their method, a bi-simulation metric iscomputed between each pair of states belonging respectively to the source and target MDPs. Roughly, this metric tells howdifferent are the transition and reward models corresponding to the states pairs, for the action maximizing th distance. Moreprecisely, if we note (T,R) and (T , R) the models of two MDPs, and (s, s′) ∈ S a state pair, the distance d between s and s′ isdefined by

d(s, s′) = maxa∈A

(∣∣Ras − Ras′∣∣+ c W1

(T as·, T

as′·)), (11)

where c ∈ R is a positive constant and W1 is the 1-Wasserstein metric. For each state of the target model, the closest counterpartstate (with the smallest bi-simulation distance) of the source MDP is identified and its learned Q-values are used to initialize theQ-function of the target MDP. In their experiments, Song et al. (2016) run a standard Q-Learning algorithm (Watkins and Dayan1992) with an ε-greedy exploration strategy thereafter.

Source task

I · · · A

B

C

R = +1

k states

Target task

I’ · · · A’

B’

C’

R = +1

k states

Figure 1: The T-shaped MDP transfer task.

Let us now consider applying this method to a similar task to the T-shaped MDP transfer task proposed by Taylor and Stone(2009). The source and target tasks are respectively described on the left and right sides of Figure 1. In each task, the states are

∗Kavosh Asadi finished working on this project before joining Amazon.Copyright c© 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

represented by the circles and the arrows between them correspond to the available actions that allow to move from one state tothe other. The initial state of both tasks is the left state I for the source task and I′ for the target task. Between the states I and Ain the source task (respectively I′ to A′ in the target task) are k states, k being a parameter increasing the distance to travel from Ito A (respectively I′ to A′). The tasks are deterministic and the reward is zero everywhere except for the state B in the source taskand C′ in the target task where a reward of +1 is received. Consequently, the optimal policy in the source task is to go to the stateA and then to the state B. In the target task, the same applies except that a transition to state C should be applied in place of stateB′ when the agent is in state A′.

Regardless of the parameters used in the bi-simulation metric of Equation 11, the direct state transfer method from Song et al.(2016) maps the following states together as they share the exact same model:

I ←→ I′

k states ←→ k states

A ←→A′ .

Hence, during learning, the Q-function of the target task is initialized with the values of the Q-function of the source task.Therefore, the behavior derived with the Q-Learning algorithm is the optimal policy of the source task, but in the target task.Depending on the value of the learning rate of the algorithm, the time to favor action DOWN in state A′ instead of action UP canbe arbitrarily long. Also, depending on the value of ε, the exploration of state C′ due to the ε-greedy strategy can be arbitrarilyunlikely. Finally, the time needed for one of those two events to occur increases proportionally to the value of k, which can bearbitrarily large.

This case illustrates the difficulty facing any transfer method in the general context of lifelong RL. The method proposed bySong et al. (2016) can be highly efficient in some cases as they show in experiments, but the lack of theoretical guarantees makesnegative transfer possible. Generally, using a similarity measure such that the bi-simulation metric helps to discourage usingsome source tasks when the computed similarity is too low. However, as we saw in the T-shaped MDP example, this rule is notabsolute and the choice of the metric is important. The approach we develop in this paper aims at avoiding negative transfer byproviding a conservative transferred knowledge that is simply of no use when the similarity between source and target tasks istoo low. This is intuitive as we do not expect any task to provide transferable knowledge to any other task.

2 Discussion on metrics and related notionsA metric on a set X is a function m : X ×X → R which has the following properties for any x, y, z ∈ X:

P1. m(x, y) ≥ 0 (positivity),

P2. m(x, y) = 0⇔ x = y (positive definiteness),

P3. m(x, y) = m(y, x) (symmetry),

P4. m(x, z) ≤ m(x, y) +m(y, z) (triangle inequality).

If property P2 is not verified by m, but instead we have for any x ∈ X that m(x, x) = 0, then m is called a pseudo-metric. If monly verifies P1, P2 and P4 then m is called a quasi-metric. If m only verifies P1 and P2 and if X is a set of probability measures,then m is called a divergence.

From this, the pseudo-metric between models of Definition 1 is indeed a pseudo-metric as it is relative to a positive function fthat could be equal to zero and break property P2.

The local MDP dissimilarity between MDPs dsa(M‖M) of Proposition 1 does not respect properties P2 and P3, hence thename dissimilarity. The ∆sa(M,M) = min

{dsa(M‖M), dsa(M‖M)

}quantity, however, regains property P3 and is hence a

pseudo-metric. A noticeable consequence is that Proposition 1 is “in the spirit” of a Lipschitz continuity result but cannot becalled as such, hence the name pseudo-Lipschitz continuity.

The same goes for the global dissimilarity d(M‖M) = 11−γ maxs,a∈S×A

(Dsa(M‖M)

). However, using min

{dMM , d

MM

}

allows to regain property 3 and makes this quantity a pseudo-metric again between MDPs.

3 Proof of Proposition 1Notation 1. Given two sets X and Y , we note F (X,Y ) the set of functions defined on the domain X with codomain Y .

Lemma 1. Given two MDPs M,M ∈M, the following equation on d ∈ F (S ×A,R) is a fixed-point equation admitting aunique solution for any (s, a) ∈ S ×A:

dsa = Dsa(M‖M) + γ∑


a′∈Ads′a′ .

We refer to this unique solution as dsa(M‖M).

Proof of Lemma 1. The proof follows closely that in (Puterman 2014) that proves that the Bellman operator over value functionsis a contraction mapping. Let L be the functional operator that maps any function d ∈ F (S ×A,R) to

Ld : S ×A → Rs, a 7→ Dsa(M‖M) + γ

∑s′∈S T

ass′ maxa′∈A ds′a′ .

Then for f and g, two functions from S ×A to R and (s, a) ∈ S ×A, we have that

Lfsa − Lgsa = γ∑

s′∈ST ass′

(maxa′∈A

fs′a′ −maxa′∈A

gs′a′

)

≤ γ∑


a′∈A(fs′a′ − gs′a′)

≤ γ ‖f − g‖∞ .

Since this is true for any pair (s, a) ∈ S ×A, we have that

‖Lf − Lg‖∞ ≤ γ ‖f − g‖∞ .

Since γ < 1, L is a contraction mapping in the metric space (F (S ×A,R) , ‖·‖∞). This metric space being complete andnon-empty, it follows by direct application of the Banach fixed-point theorem that the equation d = Ld admits a uniquesolution.

Proof of Proposition 1. The proof is by induction. The value iteration sequence of iterates (QnM )n∈N of the optimal Q-functionof any MDP M ∈M is defined for all (s, a) ∈ S ×A by:

Q0M (s, a) = 0 ,

Qn+1M (s, a) = Ras + γ

∑


a′∈AQnM (s′, a′), ∀n ∈ N .

Consider two MDPs M, M ∈M. It is obvious that∣∣Q0

M (s, a)−Q0M

(s, a)∣∣ ≤ dsa(M‖M) for all (s, a) ∈ S ×A. Suppose the

property∣∣QnM (s, a)−Qn

M(s, a)

∣∣ ≤ dsa(M‖M) true at rank n ∈ N for all (s, a) ∈ S ×A. Consider now the rank n+ 1 and apair (s, a) ∈ S ×A:

∣∣Qn+1M (s, a)−Qn+1

M(s, a)

∣∣ =

∣∣∣∣∣Ras − R

as + γ

∑

s′∈S

[T ass′ max

a′∈AQnM (s′, a′)− T ass′ max

a′∈AQnM (s′, a′)

]∣∣∣∣∣

≤∣∣Ras − R

as

∣∣+ γ∑

s′∈S

∣∣∣∣T ass′ maxa′∈A

QnM (s′, a′)− T ass′ maxa′∈A

QnM (s′, a′)

∣∣∣∣

≤∣∣Ras − R

as

∣∣+ γ∑

s′∈Smaxa′∈A

QnM (s′, a′)∣∣T ass′ − T

ass′∣∣

+ γ∑

s′∈ST ass′

∣∣∣∣maxa′∈A

QnM (s′, a′)−maxa′∈A

QnM (s′, a′)

∣∣∣∣

≤∣∣Ras − R

as

∣∣+∑

s′∈SγV ∗M (s′)

∣∣T ass′ − Tass′∣∣+ γ

∑


a′∈A

∣∣QnM (s′, a′)−QnM (s′, a′)∣∣

≤ Dsa(M‖M) + γ∑


a′ds′a′(M‖M)

≤ dsa(M‖M) ,

where we used Lemma 1 in the last inequality. Since Q∗M and Q∗M

are respectively the limits of the sequences (QnM )n∈N and(QnM

)n∈N, it results from passage to the limit that

∣∣Q∗M (s, a)−Q∗M (s, a)∣∣ ≤ dsa(M‖M) .

By symmetry, we also have∣∣Q∗M (s, a)−Q∗

M(s, a)

∣∣ ≤ dsa(M‖M) and we can take the minimum of the two valid upper bounds,yielding: ∣∣Q∗M (s, a)−Q∗M (s, a)

∣∣ ≤ min{dsa(M‖M), dsa(M‖M)

},

which concludes the proof.

4 Similar results to Proposition 1Similar results to Proposition 1 can be derived. First, an important consequence is the global pseudo-Lipschitz continuity resultpresented below.Proposition 8 (Global pseudo-Lipschitz continuity). For two MDPs M , M , for all (s, a) ∈ S ×A,

|Q∗M (s, a)−Q∗M (s, a)| ≤ ∆(M,M), (12)

with ∆(M, M) , min{d(M‖M), d(M‖M)

}and

d(M‖M) , 1

1− γ maxs,a∈S×A

(Dsa(M‖M)

).

From a pure transfer perspective, Equation 12 is interesting since the right hand side does not depend on s, a. Hence, thecounterpart of the upper bound of Equation 4, namely,

s, a 7→ Q∗M (s, a) + ∆(M, M),

is easier to compute. Indeed, ∆(M,M) can be computed once and for all, contrarily to ∆sa(M,M) that needs to be evaluatedfor all s, a pair. However, we do not use this result for transfer because it is impractical to compute online. Indeed, estimating themaximum in the definition of d(M‖M) can be as hard as solving both MDPs, which, when it happens, is too late for transfer tobe useful.

Proof of Proposition 8. The proof is by induction. We consider the sequence of value iteration iterates defined for any MDPM ∈M for (s, a) ∈ S ×A by

Q0M (s, a) = 0 ,


∑


a′∈AQnM (s′, a′), ∀n ∈ N .

Consider two MDPs M,M ∈M. It is immediate for any (s, a) ∈ S ×A that∣∣Q0

M (s, a)−Q0M (s, a)

∣∣ ≤ d(M‖M) ,

and, by symmetry, the result holds as well for d(M‖M). Suppose that it is true at rank n ∈ N. Consider rank n + 1 and(s, a) ∈ S ×A, we have that:

∣∣Qn+1M (s, a)−Qn+1

M(s, a)

∣∣ ≤ Dsa(M‖M) + γ∑


a′∈A

∣∣QnM (s′, a′)−QnM (s′, a′)∣∣

≤ max(s,a)∈S×A

Dsa(M‖M) + γ∑

s′∈ST ass′

1

1− γ max(s,a)∈S×A

Dsa(M‖M)

≤ max(s,a)∈S×A

Dsa(M‖M)

(1 +

γ

1− γ

)

≤ d(M‖M) .

By symmetry, the results holds as well for d(M‖M) which concludes the proof by induction.

The second result is for the value function and is stated below.Proposition 9 (Local pseudo-Lipschitz continuity of the optimal value function). For any two MDPs M,M ∈M, for all s ∈ S ,

∣∣V ∗M (s)− V ∗M (s)∣∣ ≤ max

a∈A∆sa(M, M)

where the local MDP pseudo-metric ∆sa(M,M) has the same definition as in Proposition 1.

Proof of Proposition 9. The proof follows exactly the same steps as the proof of Proposition 1, i.e., by first constructing thevalue iteration sequence of iterates of the optimal value function, showing the result by induction for rank n ∈ N and thenconcluding with a passage to the limit.

Another result can be derived for the value of any policy π. For the sake of generality, we state the result for any stochasticpolicy mapping states to distributions over actions. Note that a deterministic policy is a stochastic policy mapping states toDirac distributions over actions. First, we state the result for the value function in Proposition 10 and then for the Q function inProposition 11.

Proposition 10 (Local pseudo-Lipschitz continuity of the value function of any policy). For any two MDPs M,M ∈M, forany stochastic stationary policy π, for all s ∈ S,

∣∣V πM (s)− V πM (s)∣∣ ≤ ∆π

s (M, M)

where ∆πs (M, M) , min

{dπs (M‖M), dπs (M‖M)

}and dπs (M‖M) is defined as the fixed-point of the following fixed-point

equation on d ∈ F (S,R):

ds =∑

a∈Aπ(a | s)

(DγV π

Msa (M,M) + γ

∑

s′∈ST ass′ds′

).

Before proving the Proposition, we show that the fixed point equation admits a unique solution in the following Lemma.

Lemma 2. Given two MDPs M,M ∈ M, any stochastic stationary policy π, the following equation on d ∈ F (S,R) is afixed-point equation admitting a unique solution for any s ∈ S:

ds =∑

a∈Aπ(a | s)

(DγV π

Msa (M,M) + γ

∑

s′∈ST ass′ds′

).

We refer to this unique solution as dπs (M‖M).

Proof of Lemma 2. Let L be the functional operator that maps any function d ∈ F (S,R) to

Ld : S → Rs 7→ ∑

a∈A π(a | s)(DγV π

Msa (M,M) + γ

∑s′∈S T

ass′ds′

).

Then for f and g, two functions from S to R, we have that

Lfs − Lgs = γ∑

a∈Aπ(a | s)

∑

s′∈ST ass′ (fs′ − gs′)

≤ γ ‖f − g‖∞ .

Hence we have that ‖Lf − Lg‖∞ ≤ γ ‖f − g‖∞. Since γ < 1, L is a contraction mapping in the metric space (F (S,R) , ‖·‖∞).This metric space being complete and non-empty, it follows by direct application of the Banach fixed-point theorem that theequation d = Ld admits a unique solution.

Proof of Proposition 10. Consider a stochastic stationary stationary policy π. The value iteration sequence of iterates (V π,nM )n∈Nof the value function of any MDP M ∈M is defined for all s ∈ S by:

V π,0M (s) = 0 ,

V π,n+1M (s) =

∑

a∈Aπ(a | s)

(Ras + γ

∑

s′∈ST ass′V

π,nM (s′)

)

Consider two MDPs M,M ∈ M. It is obvious that∣∣∣V π,0M (s)− V π,0

M(s)∣∣∣ ≤ dπs (M‖M) for all s ∈ S. Suppose the property∣∣V π,nM (s)− V π,n

M(s)∣∣ ≤ dπs (M‖M) true at rank n ∈ N for all s ∈ S. Consider now the rank n+ 1 and the state s ∈ S:

∣∣∣V π,n+1M (s)− V π,n+1

M(s)∣∣∣ ≤

∑

a∈Aπ(a | s)

∣∣∣∣∣Ras − R

as + γ

∑

s′∈S

(T ass′V

π,nM (s′)− T ass′V π,nM

(s′))∣∣∣∣∣

≤∑

a∈Aπ(a | s)

(∣∣Ras − R

as

∣∣+ γ∑

s′∈SV π,nM

(s′)∣∣T ass′ − T

ass′∣∣

+ γ∑

s′∈ST ass′

∣∣V π,nM (s′)− V π,nM

(s′)∣∣)

≤∑

a∈Aπ(a | s)

(DγV π

Msa (M, M) + γ

∑

s′∈ST ass′d

πs′(M‖M)

)

≤ dπs (M‖M) ,

where we used Lemma 2 in the last inequality. Since V πM and V πM

are respectively the limits of the sequences (V π,nM )n∈N and(V π,nM


∣∣V πM (s)− V πM (s)∣∣ ≤ dπs (M‖M) .

By symmetry, we also have∣∣V πM (s)− V π

M(s)∣∣ ≤ dπs (M‖M) and we can take the minimum of the two valid upper bounds,

yielding: ∣∣V πM (s)− V πM (s)∣∣ ≤ min

{dπs (M‖M), dπs (M‖M)

},


Proposition 11 (Local pseudo-Lipschitz continuity of the Q-function of any policy). For any two MDPs M,M ∈M, for anystochastic stationary policy π, for all (s, a) ∈ S ×A,

∣∣QπM (s, a)−QπM (s, a)∣∣ ≤ ∆π

sa(M,M)

where ∆πsa(M,M) , min

{dπsa(M‖M), dπsa(M‖M)

}and dπsa(M‖M) is defined as the fixed-point of the following fixed-point

equation on d ∈ F (S ×A,R):

dsa = DγV π

Msa (M,M) + γ

∑

(s′,a′)∈S×AT ass′π(a′ | s′)ds′a′ .

Before proving the Proposition, we show that the fixed point equation admits a unique solution in the following Lemma.

Lemma 3. Given two MDPs M,M ∈M, any stochastic stationary policy π, the following equation on d ∈ F (S ×A,R) is afixed-point equation admitting a unique solution for any (s, a) ∈ S ×A:

dsa = DγV π

Msa (M,M) + γ

∑

(s′,a′)∈S×AT ass′π(a′ | s′)ds′a′ .

We refer to this unique solution as dπsa(M‖M).

Proof of Lemma 3. Let L be the functional operator that maps any function d ∈ F (S ×A,R) to

Ld : S ×A → R(s, a) 7→ D

γV πM

sa (M,M) + γ∑

(s′,a′)∈S×A Tass′π(a′ | s′)ds′,a′ .

Then for f and g, two functions from S ×A to R, we have for all (s, a) ∈ S ×A that


(s′,a′)∈S×AT ass′π(a′ | s′) (Lfs′a′ − Lgs′a′)

≤ γ ‖f − g‖∞ .

Hence we have that ‖Lf − Lg‖∞ ≤ γ ‖f − g‖∞. Since γ < 1, L is a contraction mapping in the metric space(F (S ×A,R) , ‖·‖∞). This metric space being complete and non-empty, it follows by direct application of the Banachfixed-point theorem that the equation d = Ld admits a unique solution.

Proof of Proposition 11. Consider a stochastic stationary policy π. The value iteration sequence of iterates (Qπ,nM )n∈N of the Qfunction for the policy π and MDP M ∈M is defined for all (s, a) ∈ S ×A by:

Qπ,0M (s, a) = 0 ,

Qπ,n+1M (s, a) = Ras + γ

∑

(s′,a′)∈S×AT ass′π(a′ | s′)Qπ,nM (s′, a′)

Consider two MDPs M,M ∈M. It is obvious that∣∣∣Qπ,0M (s, a)−Qπ,0

M(s, a)

∣∣∣ ≤ dπsa(M‖M) for all (s, a) ∈ S ×A. Suppose

the property∣∣Qπ,nM (s, a)−Qπ,n

M(s, a)

∣∣ ≤ dπsa(M‖M) true at rank n ∈ N for all (s, a) ∈ S ×A. Consider now the rank n+ 1

and the state-action pair (s, a) ∈ S ×A:∣∣∣Qπ,n+1M (s, a)−Qπ,n+1

M(s, a)

∣∣∣ ≤∣∣Ras − R

as

∣∣+ γ∑

(s′,a′)∈S×Aπ(a′ | s′)

∣∣T ass′Qπ,nM (s′, a′)− T ass′Qπ,nM (s′, a′)∣∣

≤∣∣Ras − R

as

∣∣+ γ∑

(s′,a′)∈S×Aπ(a′ | s′)Qπ,n

M(s′, a′)

∣∣T ass′ − Tass′∣∣

+ γ∑

(s′,a′)∈S×Aπ(a′ | s′)T ass′

∣∣Qπ,nM (s′, a′)−Qπ,nM

(s′, a′)∣∣

≤∣∣Ras − R

as

∣∣+∑

s′∈SγV πM (s′)

∣∣T ass′ − Tass′∣∣+ γ

∑

(s′,a′)∈S×Aπ(a′ | s′)T ass′dM,M

π (s′, a′)

≤ DγV πM

sa (M,M) + γ∑

(s′,a′)∈S×AT ass′π(a′ | s′)dπs′a′(M‖M)

≤ dπsa(M‖M) ,

where we used Lemma 3 in the last inequality. Since QπM and QπM

are respectively the limits of the sequences (Qπ,nM )n∈N and(Qπ,nM


∣∣QπM (s, a)−QπM (s, a)∣∣ ≤ dπsa(M‖M) .

By symmetry, we also have∣∣QπM (s, a)−Qπ

M(s, a)

∣∣ ≤ dπsa(M‖M) and we can take the minimum of the two valid upper bounds,yielding for all (s, a) ∈ S ×A: ∣∣QπM (s, a)−QπM (s, a)

∣∣ ≤ min{dπsa(M‖M), dπsa(M‖M)

},


5 Proof of Proposition 2Proof of Proposition 2. The result is clear for all (s, a) /∈ K since the Lipschitz bounds are provably greater than Q∗M . For(s, a) ∈ K, the result is shown by induction. Let us consider the Dynamic Programming (Bellman 1957) sequences convergingto Q∗M and U whose definitions follow for all (s, a) ∈ S ×A and for n ∈ N:{

Q0M (s, a) = 0


∑s′∈S T

ass′ maxa′∈AQnM (s′, a′)

,

{U0(s, a) = 0

Un+1(s, a) = Ras + γ∑s′∈S T

ass′ maxa′∈A Un(s′, a′)

Obviously, we have at rank n = 0 that Q0M (s, a) ≤ U0(s, a) for all (s, a) ∈ S × A. Suppose the property true at rank n ∈ N

and consider rank n+ 1:

Qn+1M (s, a)− Un+1(s, a) = γ

∑

s′∈ST ass′

(maxa′∈A

QnM (s′, a′)−maxa′∈A

Un(s′, a′)

)

≤ γ∑


a′∈A(QnM (s′, a′)− Un(s′, a′))

≤ 0 .

Which concludes the proof by induction. The result holds by passage to the limit since the considered Dynamic Programmingsequences converge to the true functions.

6 Proof of Proposition 3Proof of Proposition 3. Consider two tasks M = (T,R) and M = (T , R), with K and K the respective sets of state-actionpairs where their learned models M = (T , R) and ˆM = ( ˆT, ˆR) are known with accuracy ε in L1-norm with probability at least1− δ, i.e., we have that,

Pr

∣∣∣Ras − Ra

s

∣∣∣ ≤ ε, ∀ (s, a) ∈ K and∥∥∥T ass′ − Ta

ss′

∥∥∥1≤ ε, ∀ (s, a) ∈ K and∣∣∣Ras − ˆRas

∣∣∣ ≤ ε, ∀ (s, a) ∈ K and∥∥∥T ass′ − ˆT ass′∥∥∥

1≤ ε, ∀ (s, a) ∈ K

≤ 1− δ. (13)

Importantly, notice that the probabilistic event of Inequality 13 is the intersection of at most 4SA individual events of estimatingeither Ras , T ass′ , R

as or T ass′ with precision ε. Each one of those individual events is itself true with probability at least 1− δ′,

where δ′ ∈ (0, 1] is a parameter. For all the individual events to be true at the same time, i.e. for Inequality 13 to be verified, onemust apply Boole’s inequality and set δ′ = δ/(4SA) to ensure a total probability — i.e., probability of the intersection of all theindividual events — of at least 1− δ.

We demonstrate now the result for each one of the three cases

(i) (s, a) ∈ K ∩ K,(ii) (s, a) ∈ K ∩ Kc and(iii) (s, a) ∈ Kc ∩ Kc,

the case (s, a) ∈ Kc ∩ K being the symmetric of case (ii).(i) If (s, a) ∈ K ∩ K, then we have ε-close estimates of both models with high probability, as described by Inequality 13. By

definition:DγV ∗

Msa (M,M) =

∣∣Ras − Ras

∣∣+ γ∑

s′∈SV ∗M (s′)

∣∣T ass′ − Tass′∣∣ . (14)

The first term of the right hand side of Equation 14 respects the following sequence of inequalities with probability at least 1− δ:∣∣Ras − R

as

∣∣ ≤∣∣∣Ras − R

a

s

∣∣∣+∣∣∣Ras − ˆRas

∣∣∣+∣∣∣Ras − ˆRas

∣∣∣

≤∣∣∣Ras − ˆRas

∣∣∣+ 2ε . (15)

The second term of the right hand side of Equation 14 respects the following sequence of inequalities with probability at least1− δ:

γ∑

s′∈SV ∗M (s′)

∣∣T ass′ − Tass′∣∣ ≤ γ

∑

s′∈SV (s′)

(∣∣∣T ass′ − Ta

ss′

∣∣∣+∣∣∣T ass′ − ˆT ass′

∣∣∣+∣∣∣T ass′ − ˆT ass′

∣∣∣)

≤ γmaxs′

V (s′)∑

s′∈S

∣∣∣T ass′ − Ta

ss′

∣∣∣+ γ∑

s′∈SV (s′)

∣∣∣T ass′ − ˆT ass′∣∣∣+

γmaxs′

V (s′)∑

s′∈S

∣∣∣T ass′ − ˆT ass′∣∣∣

≤ γ∑

s′∈SV (s′)

∣∣∣T ass′ − ˆT ass′∣∣∣+ 2εγmax

s′∈SV (s′). (16)

Replacing the Inequalities 15 and 16 in Equation 14 yields

Dsa(M‖M) ≤∣∣∣Ras − ˆRas

∣∣∣+ γ∑

s′∈SV (s′)

∣∣∣T ass′ − ˆT ass′∣∣∣+ 2ε+ 2εγmax

s′∈SV (s′)

≤ DγVsa (M, ˆM) + 2ε

(1 + γmax

s′∈SV (s′)

),

which holds with probability at least 1− δ and proves the Theorem for case (i).(ii) If (s, a) ∈ K ∩ Kc, then we do not have an ε-close estimate of T as· and Ras . Similarly to the proof of case (i), we upper

bound sequentially the two terms of the right hand side of Equation 14. With probability at least 1− δ, we have the following:∣∣Ras − R

as

∣∣ ≤∣∣∣Ras − R

a

s

∣∣∣+∣∣∣Ras − R

as

∣∣∣

≤ ε+ maxR∈[0,1]

∣∣∣Ras − R∣∣∣ . (17)

Similarly, with probability at least 1− δ, we have:

γ∑

s′∈SV ∗M (s′)

∣∣T ass′ − Tass′∣∣ ≤ γ

∑

s′∈SV (s′)

(∣∣∣T ass′ − Ta

ss′

∣∣∣+∣∣∣T ass′ − T

ass′

∣∣∣)

≤ γmaxs′∈S

V (s′)ε+ γ maxT∈VS

∑

s′∈SV (s′)

∣∣∣T ass′ − Ts′∣∣∣ , (18)

where VS is the set of probability vectors of size S. Combining inequalities 17 and 18, we get the following with probability atleast 1− δ, by noticing DγV ∗

Msa (M,M) on the left hand side:

Dsa(M‖M) ≤ maxm∈M

DγVsa (M, m) + ε

(1 + γmax

s′V (s′)

),

which is the expected result for case (ii).(iii) If (s, a) ∈ Kc ∩ Kc, then we do not have ε-close estimates of both tasks. In such a case, the result

Dsa(M‖M) ≤ maxm,m∈M2

DγVsa (m, m)

is straightforward by remarking that, as a consequence of Inequality 13, we have that V ∗M

(s) ≤ V (s) with probability at least1− δ.

7 Analytical calculation of Dsa(M‖M) in Proposition 3Consider two tasks M = (T,R) and M = (T , R), with K and K the respective sets of state-action pairs where their learnedmodels M = (T , R) and ˆM = ( ˆT, ˆR) are known with accuracy ε in L1-norm with probability at least 1− δ. We note Vmax, aknown upper bound on the maximum achievable value. In the worst case where one does not have any information on the valueof Vmax, setting Vmax = 1

1−γ is a valid upper bound. We detail the computation of Dsa(M‖M) for each cases: 1) (s, a) ∈ K ∩ K,2) (s, a) ∈ K ∩ Kc, and 3) (s, a) ∈ Kc ∩ Kc. The case (s, a) ∈ Kc ∩ K being the symmetric of case 2), the same calculationsapply.

1) If (s, a) ∈ K ∩ K, we have

Dsa(M‖M) = DγVsa (M, ˆM) + 2B

=∣∣∣Ras − ˆRas

∣∣∣+ γ∑

s′∈SV (s′)

∣∣∣T ass′ − ˆT ass′∣∣∣+ 2ε

(1 + γmax

s′∈SV (s′)

).

Since (s, a) is a known state-action pair, everything is known and computable in this last equation. Note that maxs′∈S V (s′) canbe tracked along the updates of V and thus its computation does not induce any additional computational complexity.

2) If (s, a) ∈ K ∩ Kc, we have

Dsa(M‖M) = maxµ∈M

DγVsa (M, µ) +B

= maxRas ,T

ass′

(∣∣∣Ras − Ras

∣∣∣+ γ∑

s′∈SV (s′)

∣∣∣T ass′ − T ass′∣∣∣)

+ ε

(1 + γmax

s′∈SV (s′)

),

= maxr∈[0,1]

∣∣∣Ras − r∣∣∣+ γ max

t∈[0,1]S

s.t.∑s′∈S ts′=1

(∑

s′∈SV (s′)

∣∣∣T ass′ − ts′∣∣∣)

+ ε

(1 + γmax

s′∈SV (s′)

).

First, we have

maxr∈[0,1]

∣∣∣Ras − r∣∣∣ = max

{Ra

s , 1− Ra

s

}.

Maximizing over the variable t ∈ [0, 1]S such that∑s′∈S ts′ = 1 is equivalent to maximizing a convex combination of the

positive vector V whose terms are not independent as they must sum to one. This is easily solvable as a linear programmingproblem. A straightforward (simplex-like) resolution procedure consists in progressively adding mass on the terms that willmaximize the convex combination as follows:

• ts′ = 0, ∀s′ ∈ S• l = Sort states by decreasing values of V

• While∑s∈S ts < 1

– s′ = pop first state in l

– Assign ts′ ← arg maxt∈[0,1]

∣∣∣T ass′ − t∣∣∣ to s′ (note that ts′ ∈ {0, 1})

– If∑s∈S ts > 1, then ts′ ← 1−∑s∈S\s′ t(s)

This allows calculating the maximum over transition models.Notice that there is a simpler computation that almost always yields the same result (when it does not, it provides an upper

bound) and does not require the burden of the previous procedure. Consider the subset of states for which V (s′) = maxs∈S V (s)

(often these are states in Kc). Among those states, let us suppose there exists s+, unreachable from (s, a), according to T , i.e.,Ta

ss+ = 0. If M has not been fully explored, as is often the case in RMax, there may be many such states. Then the distribution twith all its mass on s+ maximizes the maxt∈[0,1]S term. Conversely, if such a state does not exist (that is, if for all such statesTa

ss+ > 0), then maxs∈S V (s) is an upper bound on the maxt∈[0,1]S term. Therefore:

maxt∈[0,1]S

(∑

s′∈SV (s′)

∣∣∣T ass′ − ts′∣∣∣)≤ max

s∈SV (s) ,

with equality in many cases.3) If (s, a) ∈ Kc ∩ Kc, the resolution is trivial and we have

Dsa(M‖M) = maxµ,µ∈M2

DγVsa (µ, µ)

= maxRas ,T

ass′ ,R

as ,T

ass′

(∣∣Ras − R

as

∣∣+ γ∑

s′∈SV (s′)

∣∣T ass′ − T ass′∣∣)

= maxr,r∈[0,1]

|r − r|+ γ maxt,t∈[0,1]S

s.t.∑s∈S ts=1

and∑s∈S ts=1

∑

s′∈SV (s′) |ts′ − ts′ |

= 1 + 2γmaxs∈S

V (s) .

Overall, computing the value of the provided upper bound in the three cases allows to compute Dsa(M‖M) for all (s, a) ∈ S×A.

8 Proof of Proposition 4Lemma 4. Given two tasks M,M ∈M, K the set of state-action pairs for which (R, T ) is known with accuracy ε in L1-normwith probability at least 1− δ. If γ(1 + ε) < 1, this equation on d ∈ F (S ×A,R) is a fixed-point equation admitting a uniquesolution.

ds,a =

Dsa(M‖M) + γ( ∑s′∈S


ds′,a′ + ε max(s′,a′)∈S×A

ds′,a′)

if (s, a) ∈ K,

Dsa(M‖M) + γ max(s′,a′)∈S×A

ds′,a′ else.

We refer to this unique solution as dsa(M‖M).

Proof of Lemma 4. Let L be the functional operator that maps any function d ∈ F (S ×A,R) to

Ld : S ×A → R

(s, a) 7→

Dsa(M‖M) + γ

( ∑s′∈S

Ta

ss′ maxa′∈A

ds′,a′ + ε max(s′,a′)∈S×A

ds′,a′

)if (s, a) ∈ K,


ds′,a′ otherwise.

Let f and g be two functions from S ×A to R. If (s, a) ∈ K, we have


s′∈ST ass′

(maxa′∈A

fs′a′ −maxa′∈A

gs′a′

)+ γε

(max

(s′,a′)∈S×Afs′a′ − max

(s′,a′)∈S×Ags′a′

)

≤ (γ + γε)

(max



)

≤ γ(1 + ε) max(s′,a′)∈S×A

(fs′a′ − gs′a′)

≤ γ(1 + ε) ‖f − g‖∞ .

If (s, a) /∈ K, we have

Lfsa − Lgsa = γ

(max



)

≤ γ max(s′,a′)∈S×A

(fs′a′ − gs′a′)

= γ(1 + ε) ‖f − g‖∞ .

In both cases, ‖Lf − Lg‖∞ ≤ γ(1 + ε) ‖f − g‖∞. If γ(1 + ε) < 1, L is a contraction mapping in the metric space(F (S ×A,R) , ‖ · ‖∞). This metric space being complete and non-empty, it follows from Banach fixed-point theorem thatd = Ld admits a single solution.

Proof of Proposition 4. Consider two MDPs M, M ∈ M. Before proving the result, notice that we shall put ourselves in thecase of Proposition 3, for the upper bound on the pseudometric between models Dsa(M‖M) to be true upper bounds withprobability at least 1 − δ for all (s, a) ∈ S × A. As seen in the proof of Proposition 3, this implies learning any reward ortransition function with precision ε in L1-norm with probability at least 1− δ/(4SA).

The proof is done by induction, by calculating the values of dsa(M‖M) and dsa(M‖M) following the value iterationalgorithm. Those values can respectively be computed via the sequences of iterates (dn)n∈N and (dn)n∈N defined as follows forall (s, a) ∈ S ×A:

d0sa(M‖M) = 0

dn+1sa (M‖M) = Dsa(M‖M) + γ

∑


a′∈Adns′a′(M‖M) ,

and,

d0sa(M‖M) = 0,

dn+1sa (M‖M) =

Dsa(M‖M) + γ

( ∑s′∈S

Ta

ss′ maxa′∈A

dns′a′(M‖M) + ε max(s′,a′)∈S×A

dns′a′(M‖M)

)if (s, a) ∈ K,


dns′a′(M‖M) otherwise.

The proof at rank n = 0 is trivial. Let us assume the proposition dnsa(M‖M) ≤ dnsa(M‖M), ∀ (s, a) ∈ S × A true at rankn ∈ N and consider rank n+ 1. There are two cases, depending on the fact that (s, a) is in K or not.

If (s, a) ∈ K, we have

dn+1sa (M‖M)− dn+1

sa (M‖M) = Dsa(M‖M)− Dsa(M‖M)

+ γ∑

s′∈S

(T ass′ max

a′∈Adns′a′(M‖M)− T ass′ max

a′∈Adns′a′(M‖M)

)

− γε max(s′,a′)∈S×A

dns′a′(M‖M) .

Using Proposition 3, we have that Dsa(M‖M) is an upper bound on Dsa(M‖M) with probability at least 1− δ. Hence

Pr(Dsa(M‖M)− Dsa(M‖M) ≤ 0

)≥ 1− δ.

This plus the fact that dnsa(M‖M) ≤ dnsa(M‖M), ∀ (s, a) ∈ S ×A by induction hypothesis, we have with probability at least1− δ,


sa (M‖M) ≤ γ∑

s′∈Smaxa′∈A

dns′a′(M‖M)(T ass′ − T

a

ss′

)− γε max

(s′,a′)∈S×Adns′a′(M‖M)


dns′a′(M‖M)∑

s′∈S

(T ass′ − T

a

ss′

)− γε max



dns′a′(M‖M)(∥∥∥T − T

∥∥∥1− ε).

Since Pr(∥∥∥T − T

∥∥∥1≤ ε)≥ 1− δ, we have with probability at least 1− δ,


sa (M‖M) ≤ γ max(s′,a′)∈S×A

dns′a′(M‖M) (ε− ε) = 0,

which concludes the proof in the first case case.Conversely, if (s, a) /∈ K, we have


sa (M‖M) = Dsa(M‖M)− Dsa(M‖M) + γ∑


a′∈Adns′a′(M‖M)− γ max

(s′,a′)∈S×Adns′a′(M‖M) .

Using the same reasoning than in case (s, a) ∈ K, we have with probability higher than 1− δ,


sa (M‖M) ≤ γ∑


a′∈Adns′a′(M‖M)− γ max



dns′a′(M‖M)− γ max(s′,a′)∈S×A

dns′a′(M‖M)

≤ 0,

which concludes the proof in the second case.

9 Proof of Proposition 6Proof of Proposition 6. The cost of LRMax is constant on most time steps since the action is greedily chosen w.r.t. the upperbound on the optimal Q-function, which is a lookup table. Let N ∈ N be the number of source tasks that have been learnedby LRMax during a lifelong RL experiment. When updating a new state-action pair, i.e., labeling it as a known pair, thealgorithm performs 2N Dynamic Programming (DP) computations to update the induced Lipschitz bounds (Equation 8) plusone DP computation to update the total-bound (Equation 6). In total, we apply (2N + 1) DP computations for each state-actionpair update. As at most SA state-action pairs are updated during the exploration of the current MDP, the total number of DPcomputations is at most SA(2N + 1), for which we use the value iteration algorithm.

We use the value iteration as a Dynamic Programming method. Strehl, Li, and Littman (2009) report the minimum number ofiterations needed by the value iteration algorithm to estimate a value function (or Q-function in our case) that is εQ-close to theoptimum in maximum norm. This minimum number is given by

⌈1

1− γ ln

(1

εQ(1− γ)

)⌉.

Each iteration has a cost S2A. Overall, the cost of all the DP computations in a complete run of LRMax is

O(S3A2N

1− γ ln

(1

εQ(1− γ)

)).

This, plus the constant cost O(1) applied on each one of the τ decision epochs concludes the proof.

10 Proof of Proposition 7Proof of Proposition 7. Consider an algorithm producing ε-accurate model estimates Dsa(M‖M) for a subset K of S ×A afterinteracting with any two MDPs M,M ∈ M. Assume Dsa(M‖M) to be an upper bound of Dsa(M‖M) for any (s, a) /∈ K.These assumptions are guaranteed with high probability by Proposition 3 while running Algorithm 1 in the lifelong RL setting.Then, for any (s, a) ∈ S ×A and any two MDPs M, M ∈M, we have that

Dsa(M‖M) = Dsa(M‖M)± ε if (s, a) ∈ KDsa(M‖M) ≥ Dsa(M‖M) else.

Particularly, Dsa(M‖M) + ε ≥ Dsa(M‖M) for all (s, a) ∈ S × A and any M,M ∈ M. By definition of Dmax(s, a), thisimplies that, for all (s, a) ∈ S ×A,

maxM,M∈M

Dsa(M‖M) + ε ≥ Dmax(s, a) , (19)

where M is the set of possible tasks in the considered lifelong RL experiment. Consider M, the set of sampled MDPs whichallows to define Dmax(s, a) = maxM,M∈M Dsa(M‖M) as the maximum model distance for all the experienced MDPs at(s, a) ∈ S ×A. We have that

Dmax(s, a) = maxM,M∈M

Dsa(M‖M) ,

only if two MDPs maximizing the right hand side of this equation belong to M . If it is the case, then Equation 19 imply that

Dmax(s, a) + ε ≥ Dmax(s, a) . (20)

Overall, we require the two MDPs maximizing maxM,M∈M Dsa(M‖M) to be sampled for Equation 20 to hold. Let us nowderive the probability that those two MDPs have been sampled. We note them M1 and M2. There may exist more candidates forthe maximization but, for the sake of generality, we put ourselves in the case where only two MDPs achieve the maximization.Let us consider drawing m ∈ N tasks. We note p1 (respectively p2) the probability of sampling M1 (respectively M2) each timea task is sampled. We note X1 (respectively X2) the random variable of the first occurrence of the task M1 (respectively M2)among the m trials. Hence, the probability of sampling M1 for the first time at trial k ∈ {1, . . . ,m} is given by the geometriclaw and is equal to

Pr (X1 = k) = p1 (1− p1)k−1

.

Additionally, the probability of sampling M1 at least once in the first m trials is given by the cumulative distribution function:

Pr (X1 ≤ m) = 1− (1− p1)m . (21)

We are interested in the probability of the event that M1 and M2 have been sampled in the m first trials, i.e.Pr (X1 ≤ m ∩X2 ≤ m). Following the rule of addition for probabilities, we have that,

Pr (X1 ≤ m ∩X2 ≤ m) = Pr (X1 ≤ m) + Pr (X2 ≤ m)− Pr (X1 ≤ m ∪X2 ≤ m) .

Given that the event of sampling either M1 or M2 during a single trial happens with probability p1 + p2, we have by analogywith Equation 21 that Pr (X1 ≤ m ∪X2 ≤ m) = 1− (1− (p1 + p2))m. As a result, the following holds:

Pr (X1 ≤ m ∩X2 ≤ m) = 1− (1− p1)m + 1− (1− p2)m − (1− (1− (p1 + p2))m)

= 1− (1− p1)m − (1− p2)m + (1− (p1 + p2))m

≥ 1− 2(1− pmin)m + (1− 2pmin)m .

As said earlier, Equation 20 holds if M1 and M2 have been sampled during the first m trials, which imply that the probability forEquation 20 to hold is at least equal to the probability of sampling both tasks. Formally,

Pr(Dmax(s, a) + ε ≥ Dmax(s, a)

)≥ Pr (X1 ≤ m ∩X2 ≤ m)

≥ 1− 2(1− pmin)m + (1− 2pmin)m .

In turn, if m verifies 2(1 − pmin)m − (1 − 2pmin)m ≤ δ, then 1 − 2(1 − pmin)m + (1 − 2pmin)m ≥ 1 − δ andPr(Dmax(s, a) + ε ≥ Dmax(s, a)

)≥ 1− δ, which concludes the proof.

11 Discussion on an upper bound on distances between MDP modelsSection 4.4 introduced the idea of exploiting prior knowledge on the maximum distance between two MDP models. This ideabegs for a more detailed discussion. Consider two MDPs M and M . By definition of the local model pseudo metric in Equation 1,the maximum possible distance is given by

maxM,M∈M2

Dsa(M‖M) =1 + γ

1− γ .

But this assumes that any transition or reward model can defineM and M . In other words, the maximization is made on the wholeset of possible MDPs. To illustrate why this is too naive, consider a game within the Arcade Learning Environment (Bellemareet al. 2013). We, as humans, have a strong bias concerning similarity between environments. If the game changes, we still assumegroups of pixels will move together on the screen as the result of game actions. For instance, we generally discard possiblenew games M that “teleport” objects across the screen without physical considerations. We also discard new games that allowtransitions from a given screen to another screen full of static. These examples illustrate why the knowledge of Dmax is verynatural (and also why its precise value may be irrelevant). The same observation can be made for the “tight” experiment ofSection 5; the set of possible MDPs is restricted by some implicit assumptions that constrain the maximum distance betweentasks. For instance, in these experiments, all transitions move to a neighboring state and never “teleport” the agent to the otherside of the gridworld. Without the knowledge of Dmax, LRMax assumes such environments are possible and therefore transfervalues very cautiously (with the ultimate goal not to under estimate the optimal Q-function, in order to avoid negative transfer).Overall, the experiments of Section 5 confirm this important insight: safe transfer occurs slowly if no a priori is given on themaximum distance between MDPs. On the contrary, the knowledge of Dmax allows a faster and more efficient transfer betweenenvironments.

12 The “tight” environment used in experiments of Section 5The tight environment is a 11× 11 grid-world illustrated in Figure 2. The initial state of the agent is the central cell displayedwith an “S”. The actions are moving 1 cell in one of the four cardinal directions. The reward is 0 everywhere, except for executingan action in one of the three teal cells in the upper-right corner. Each time a task is sampled, a slipping probability of executinganother action as the one selected is drawn in [0, 1] and the reward received in each one of the teal cells is picked in [0.8, 1.0].

S

Figure 2: The tight grid-world environment.

20 cells

S G

Figure 3: The corridor grid-world environment.

13 Additional lifelong RL experiments

We ran additional experiments on the corridor grid-world environment represented in Figure 3. The initial state of the agent is thecentral cell labeled with the letter “S”. The actions are {left, right} and the goal is to reach the cell labeled with the letter “G” onthe extreme right. A reward R > 0 is received when reaching the goal and 0 otherwise. At each new task, a new value of R issampled in [0.8, 1]. The transition function is fixed and deterministic.

The key insight in this experiment is not to lose time exploring the left part of the corridor. We ran 20 episodes of 11 time stepsfor each one of the 20 sampled tasks. Results are displayed in Figure 4a and 4b, respectively for the average relative discountedreturn over episodes and over tasks. Similarly as in Section 5, we observe in Figure 4a that LRMax benefits from the transfermethod as early as the second task. The MaxQInit algorithm benefits from the transfer from task number 12. Prior knowledgeDmax decreases the sample complexity of LRMax as reported earlier and the combination of LRMax with MaxQInit outperformsall other methods by providing a tighter upper bound on the optimal Q-value function. This decrease of sample complexity isalso observed in the episode-wise display of Figure 4b where the convergence happens more quickly on average for LRMax andeven more for MaxQInit. This figure allows to see the three learning stages of LRMax reported in Section 5.

We also ran lifelong RL experiments in the maze grid-world of Figure 5. The tasks consists in reaching the goal cell labeledwith a “G” while the initial state of the agent is the central cell, labeled with an “S”. Two walls configurations are possible,yielding two different tasks with probability 1

2 of being sampled in the lifelong RL setting. The first task corresponds to thecase where orange walls are actually walls and green cells are normal white cells where the agent can go. The second task is theconverse, where green walls are walls and orange cells are normal white cells. We run 100 episodes of length 15 time steps andsample a total of 30 different tasks. Results can be found in Figure 6. In this experiment, we observe the increase of performanceof LRMax as the value of Dmax decreases. The three stages behavior of LRMax reported in Section 5 does not appear in thiscase. We tested the performance of using the online estimation of the local model distances of Proposition 7 in the algorithmreferred by LRMax in Figure 6. Once enough tasks have been sampled, the estimate on the model local distance is used with highconfidence on its value and refines the upper bound computed analytically in Equation 7. Importantly, this instance of LRMaxachieved the best result in this particular environment, demonstrating the usefulness of this result. This method being similar tothe MaxQInit estimation of maximum Q-values, we unsurprisingly observe that both algorithms feature a similar performance inthe maze environment.

14 Prior Dmax use experiment

Consider two MDPs M, M ∈M. Each time a state-action pair (s, a) ∈ S ×A is updated, we compute the local distance upperbound Dsa(M‖M) (Equation 7) for all (s, a) ∈ S ×A. In this computation, one can leverage the knowledge of Dmax to selectmin

{Dsa(M‖M), Dmax

}. We show that LRMax relies less and less on Dmax as knowledge on the current task increases. For

this experiment, we used the two grid-worlds environments displayed in Figures 7a and 7b.

3 6 9 12 15 18Task number

0.5

1.0

1.5

2.0

Ave

rage

Rel

ativ

eD

isco

unte

dR

etu

rn

RMax

LRMax

LRMax(0.2)

MaxQInit

LRMaxQInit

LRMaxQInit(0.2)

(a) Average discounted return vs. tasks

3 6 9 12 15 18Episode number

0.0

0.2

0.4

0.6

0.8

1.0

Ave

rage

Rel

ativ

eD

isco

unte

dR

etu

rn

(b) Average discounted return vs. episodes

Figure 4: Results of the corridor lifelong RL experiment with 95% confidence interval.

S

G

Figure 5: The maze grid-world environment. The walls correspond to the black cells and either the green ones or the orange ones.

The rewards collected with any actions performed in the teal cells of both tasks are defined as:

Rsa = exp

(− (sx − gx)2 + (sy − gy)2

2σ2

),

∀s = (sx, sy) ∈ S, a ∈ A,

where (sx, sy) are the coordinates of the current state, (gx, gy) the coordinate of the goal cell labelled with a G and σ is a spanparameter equal to 1 in the first environment and 1.5 in the second environment. The agent starts at the cell labelled with the Sletter. Black cells represent unreachable cells (walls). We run LRMax twice on the two different maze grid-worlds and record foreach model update the proportion of times Dmax is smaller than Dsa(M‖M) in Figure 8 via the % use of Dmax.

With maximum value Dmax = 19, Dsa(M‖M) is systematically lesser than Dmax, resulting in 0% use. Conversely, withminimum value Dmax = 0, the use expectedly increases to 100%. The in-between value of Dmax = 10 displays a linear decayof the use. This suggests that, at each update, Dsa(M‖M) ≤ Dmax is only true for one more unique s, a pair, resulting in aconstant decay of the use. With fewer prior (Dmax = 15 or 17), updating one single s, a pair allows Dsa(M‖M) to drop underDmax for more than one pair, resulting in less use of the prior knowledge. The conclusion of this experiment if that Dmax is onlyuseful at the beginning of the exploration, while LRMax relies more on its own bound Dsa(M‖M) when partial knowledge ofthe task has been acquired.

0 15 30 45 60 75 90Episode Number

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Ave

rage

Dis

cou

nte

dR

etu

rn

RMax

LRMax-prior2.0

LRMax-prior1.0

LRMax-prior0.5

LRMax

RMax-MaxQInit

Figure 6: Averaged discounted return over tasks for the maze grid-world lifelong RL experiment.

S

G

(a) 4 times 4 heat-map grid-world. Slipping probability is 10%.

S

G

(b) 4 times 4 heat-map grid-world. Slipping probability is 5%.

Figure 7: The two grid-worlds of the prior use experiment.

0 6 12 18 24 30 36 42 48Computation Number

0

20

40

60

80

100

%P

rior

Use

LRMax-prior19.0

LRMax-prior17.0

LRMax-prior15.0

LRMax-prior10.0

LRMax-prior0.0

Figure 8: Proportion of times where Dmax ≤ Dsa(M‖M), i.e., use of the prior, vs computation of the Lipschitz bound. Eachcurve is displayed with 95% confidence intervals.

TaskNumber ofexperimentrepetitions

Number ofsampled tasks

Number ofepisodes

Maximumlength

of episodes

Total number ofcollected transitionsamples (s, a, r, s′)

“Tight” taskof Figures 3a 3b

and 3c10 15 2000 10 3,000,000

“Tight” taskof Figure 3d 100 2 2000 10 4,000,000

Corridor taskSection 13 1 20 20 11 4400

Maze taskSection 13 1 30 100 15 45000

Heat-mapSection 14 100 2 100 30 600,000

Table 1: Summary of the number of experiment repetition, number of sampled tasks, number of episodes, maximum length ofepisodes and upper bounds on the number of collected samples.

15 Discussion on RMax precision parameters ε, δ, nknown

We used nknown = 10, δ = 0.05 and ε = 0.01. Theoretically, nknown should be a lot larger (≈ 105) in order to reach an accuracyε = 0.01 according to Strehl, Li, and Littman (2009). However, it is common practice to assume such small values of nknownare sufficient to reach an acceptable model accuracy ε. Interestingly, empirical validation did not confirm this assumption for anyRMax-based algorithm. We keep these values nonetheless for the sake of comparability between algorithms and consistencywith the literature. Despite such absence of accuracy guarantees, RMax-based algorithms still perform surprisingly well and arerobust to model estimation uncertainties.

16 Information about the Machine Learning reproducibility checklistFor the experiments run in Section 5, the computing infrastructure used was a laptop using a single 64-bit CPU (model: Intel(R)Core(TM) i7-4810MQ CPU @ 2.80GHz). The collected samples sizes and number of evaluation runs for each experiment issummarized in Table 1.

The displayed confidence intervals for any curve presented in the paper is the 95% confidence interval (Neyman 1937) on thedisplayed mean. No data were excluded neither pre-computed. Hyper-parameters were determined to our appreciation, they maybe sub-optimal but we found the results convincing enough to display interesting behaviors.

ReferencesBellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. 2013. The Arcade Learning Environment: An Evaluation Platformfor General Agents. Journal of Artificial Intelligence Research 47: 253–279.Bellman, R. 1957. Dynamic Programming. Princeton, USA: Princeton University Press.Ferns, N.; Panangaden, P.; and Precup, D. 2004. Metrics for Finite Markov Decision Processes. In Proceedings of the 20thconference on Uncertainty in Artificial Intelligence (UAI 2004), 162–169. AUAI Press.Neyman, J. 1937. X—outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability. PhilosophicalTransactions of the Royal Society of London. Series A, Mathematical and Physical Sciences 236(767): 333–380.Puterman, M. L. 2014. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.Song, J.; Gao, Y.; Wang, H.; and An, B. 2016. Measuring the Distance Between Finite Markov Decision Processes. InProceedings of the 15th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2016), 468–476.Strehl, A. L.; Li, L.; and Littman, M. L. 2009. Reinforcement Learning in Finite MDPs: PAC Analysis. Journal of MachineLearning Research 10(Nov): 2413–2444.Taylor, M. E.; and Stone, P. 2009. Transfer Learning for Reinforcement Learning Domains: A Survey. Journal of MachineLearning Research 10(Jul): 1633–1685.Watkins, C. J. C. H.; and Dayan, P. 1992. Q-learning. Machine Learning 8(3-4): 279–292.

Villani, C. 2008. Optimal Transport: Old and New, volume338. Springer Science & Business Media.Wilson, A.; Fern, A.; Ray, S.; and Tadepalli, P. 2007. Multi-Task Reinforcement Learning: A Hierarchical Bayesian Ap-proach. In Proceedings of the 24th International Conferenceon Machine Learning (ICML 2007), 1015–1022.

lipschitz lifelong reinforcement learning · 2021. 2. 22. · lipschitz lifelong reinforcement...

Documents