Smoothness in Reinforcement Learning with Large State and Action Spaces
by
Kavosh Asadi
B. E., University of Tehran, 2013
M. Sc., University of Alberta, 2015
A dissertation submitted in partial fulfillment of the
requirements for the Degree of Doctor of Philosophy
in the Department of Computer Science at Brown University
Providence, Rhode Island
March 31st, 2020
c© Copyright 2020 by Kavosh Asadi
This dissertation by Kavosh Asadi is accepted in its present form by
the Department of Computer Science as satisfying the dissertation requirement
for the degree of Doctor of Philosophy.
DateMichael L. Littman, Director
Recommended to the Graduate Council
DateGeorge D. Konidaris, Reader
DateRonald E. Parr, Reader
(Duke University)
Approved by the Graduate Council
DateAndrew G. Campbell
Dean of the Graduate School
iii
Vita
Kavosh was born and raised in Chalous, a small, green, and beautiful town in Northern Iran. As a
child, he spent an excessive amount of time playing video games, which prompted his deep interest
in minds, specifically the artificial ones. In 2008, he moved to Iran’s capital, Tehran, to complete
his undergraduate degree in computer engineering at the University of Tehran. In 2015, he obtained
his Master’s degree from the University of Alberta in Canada. Since then, he has been pursuing a
PhD degree at the Department of Computer Science at Brown University in the United States.
iv
Preface
Writing this thesis would not have been fun without the collaboration with several amazing people.
Special credit goes to my wonderful mentor, Michael Littman, with whom I spent a lot of time
thinking about this thesis. Michael provided me with remarkable emotional support too, which I
definitely could use in light of the political climate. I am equally thankful of my mentor, George
Konidaris, who has been guiding me like his younger brother. It has also been a great pleasure to
work with Ron Parr. Ron is a deep thinker. Many parts of this thesis are shaped by my experience
working with Rich Sutton at Alberta. Rich believed in my potential when some did not, and for
that I will always remain grateful. I appreciated my collaborations with Dipendra Misra, Seungchan
Kim, and Evan Cater, with whom I coauthored papers that form the building blocks of this thesis.
Specifically, Chapter 3 is based on a paper that I co-authored with Michael [6], and a follow-up paper
that I co-authored with Seugnchan, George and Michael [59]. Chapter 4 is based on two papers: a
first paper that I co-authored with Dipendra and Michael [8], and another paper that I co-authored
with Evan, Dipendra, and Michael [5]. Lastly, chapter 5 is based on a paper that I co-authored with
Ron, George, and Michael [9].
I also want to acknowledge the invaluable role of several people who definitely shaped my ideas
throughout the PhD journey: Ben Abbatematteo, David Abel, Cameron Allen, Barrett Ames, Dilip
Arumugam, Akhil Bagaria, Sina Ghiassian, Chris Grimm, Yuu Jinnai, John Langford, Erwan Lecar-
pentier, Lucas Lehnert, Liholng Li, Sam Lobel, Robert Loftin, Marlos Machado, Rupam Mahmood,
Joseph Modayil, Melrose Roderick, Eric Rosen, Sam Saarinen, Saket Tiwari, Eli Upfal, Hado van
Hasselt, Harm van Seijen, Jason Williams, and Geoffrey Zweig.
v
Contents
List of Tables viii
List of Figures ix
1 Introduction 3
2 Background 5
2.1 The Reinforcement Learning Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 A Smooth Approximation of Max for Convergent Reinforcement Learning 10
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Boltzmann Misbehaves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Boltzmann Has Multiple Fixed Points . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Mellowmax and Its Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.1 Mellowmax is a Non-Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.2 Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4.3 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4.4 Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Maximum Entropy Mellowmax Policy . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6 Experiments in the Tabular Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6.1 Two-state MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6.2 Random MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6.3 Multi-passenger Taxi Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.7 Experiments in the Function Approximation Setting . . . . . . . . . . . . . . . . . . 21
3.7.1 DeepMellow vs DQN without a Target Network . . . . . . . . . . . . . . . . . 22
3.7.2 DeepMellow vs DQN with a Target Network . . . . . . . . . . . . . . . . . . 24
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
vi
4 Smoothness in Model-based Reinforcement Learning 25
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Lipschitz Model Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 On the Choice of Probability Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.1 Value-Aware Model Learning (VAML) Loss . . . . . . . . . . . . . . . . . . . 29
4.3.2 Lipschitz Generalized Value Iteration . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.3 Equivalence Between VAML and Wasserstein . . . . . . . . . . . . . . . . . . 32
4.4 Understanding the Compounding Error Phenomenon . . . . . . . . . . . . . . . . . . 33
4.5 Value Error with Lipschitz Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Smoothness in Continuous Control 44
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Deep RBF Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Experiments: Continuous Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 Experiments: Continuous Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6 Proposed Work: Safe Reinforcement Learning with Smooth Policies 62
6.1 Rademacher Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 A Generalization Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
vii
List of Tables
3.1 A comparison between Mellowmax and Boltzmann in terms of convergence to a unique
fixed point. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Experimental details for evaluating DeepMellow for each domain. . . . . . . . . . . . 21
4.1 Lipschitz constant for various functions used in a neural network. Here, Wj denotes
the jth row of a weight matrix W . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
viii
List of Figures
2.1 An illustration of the reinforcement learning problem. . . . . . . . . . . . . . . . . . 6
2.2 An illustration of Lipschitz continuity . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 An MDP in which Boltzmann softmax misbehaves. . . . . . . . . . . . . . . . . . . . 12
3.2 Oscillating values under the Boltzmann softmax policy. . . . . . . . . . . . . . . . . 13
3.3 GVI with Boltzmann could have multiple fixed-points. . . . . . . . . . . . . . . . . . 14
3.4 GVI with Boltzmann misbehaves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 GVI with mellowmax exhibits a sound behavior. . . . . . . . . . . . . . . . . . . . . 19
3.6 Number of iterations before termination of GVI on the example MDP. GVI under
mmω outperforms the alternatives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.7 An illustration of multi-passenger Taxi domain. . . . . . . . . . . . . . . . . . . . . . 20
3.8 Evaluation of maximum-entropy mellowmax policy on Taxi. . . . . . . . . . . . . . . 20
3.9 Comparison between DeepMellow and DQN in the absence of target network. . . . . 22
3.10 Comparison between DeepMellow and DQN with target network. . . . . . . . . . . . 23
4.1 An example of a Lipschitz model class. . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 An example of a case where the accuracy of a stochastic model is best quantified using
Wasserstein. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 A Comparison between KL, TV, and Wasserstein in terms of capturing the effective-
ness of transition models for planning. . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Correlation between value-prediction error and model error. . . . . . . . . . . . . . . 39
4.5 Impact of controlling the Lipschitz constant of learned models. . . . . . . . . . . . . 40
4.6 A stochastic problem solved by training a Lipschitz model class using EM. . . . . . . 42
4.7 Impact of controlling the Lipschitz constant in the supervised-learning domain. . . . 43
4.8 Performance of a Lipschitz model class on the gridworld domain. . . . . . . . . . . . 43
5.1 The architecture of a deep RBF value (Q) function. . . . . . . . . . . . . . . . . . . 48
5.2 An RBF Q function for different smoothing parameters. . . . . . . . . . . . . . . . . 51
5.3 Surface of a true reward function, and its approximation using an RBF reward network 55
5.4 Mean and standard deviation of performance for various methods on the continuous
optimization task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
ix
5.5 A comparison between RBF–DQN and standard deep-RL baselines. . . . . . . . . . 61
x
Abstract
Reinforcement learning (RL) is the study of the interaction between an environment and an agent
that learns to achieve a goal through trial-and-error. Owing to its generality, RL has successfully been
applied to numerous applications including those with enormous state and action spaces. In light of
the curse of dimensionality in large settings, a fundamental question in RL is how to design algorithms
that are compatible with function approximation but also capable of tackling longstanding challenges
including convergence guarantees, exploration-exploitation, planning, and safety. In this thesis I
study approximate RL through the lens of smoothness formally defined using Lipschitz continuity.
RL algorithms are typically comprised of several key ingredients such as value functions, models,
operators, and policies. I present theoretical results showing an essential role for the smoothness
of these ingredients in stability and convergence of RL, effective model learning and planning, and
state-of-the-art continuous control. Through many examples and experiments, I also demonstrate
how to adjust the smoothness of these ingredients to improve the performance of approximate RL
in large problems.
1
Thesis Statement
Smoothness, defined by Lipschitz continuity, is essential for approximate reinforcement learning in
terms of convergence to a unique fixed-point, near-accurate planning, state-of-the-art continuous
control, and safety.
2
Chapter 1
Introduction
Reinforcement learning (RL) is the study of the interaction between an environment and an agent
that learns to achieve a goal. Owing to its generality, RL has many emerging applications including
in settings with large (and sometimes infinite) state and action spaces [73, 61, 116, 10, 46, 96, 64].
Successful application of RL to these applications hinges on developing learning algorithms that are
compatible with function approximation, but also capable of addressing longstanding challenges in
RL such as exploration-exploitation, generalization, planning, and safety. RL algorithms are typ-
ically composed of key ingredients including operators, value functions, models, and policies. In
this thesis I develop a theory that studies these ingredients through the lens of smoothness and in
presence of function approximation.
I define a smooth function as any function with the following property: For any two distinct points
on its domain, the line that connects the surface of the function at the two points should have a
finite slope. The Lipschitz constant of the function is defined as the maximum value of this slope
for any pair of points. A function with a Lipschitz constant K is said to be K-Lipschitz.
My first result pertains to convergence of RL to a unique fixed-point, which hinges on using K-
Lipschitz operators with K < 1. A Lipschitz operator with this property is sometimes known
as a non-expansion. I show that Boltzmann softmax operator, commonly used for addressing the
exploration-exploitation in large RL problems, is not a non-expansion and is prone to misbehav-
ior. I then introduce an alternative softmax operator which, among other nice properties, has the
non-expansion property. The new operator exhibits convergent behavior, and is also useful in large
domains such as Atari.
Second, I focus on model-based reinforcement learning, where I show that in absence of model
smoothness even a near-perfect model, defined as a model with bounded one-step error, can have
arbitrarily large multi-step errors. I show novel bounds on multi-step prediction error and value-
estimation error of one-step smooth models. I then make the case for controlling the Lipschitz
3
4
constant of learned models for effective planning.
Third, I move to the setting with continuous states and actions, where finding an action that is
optimal with respect to a learned state–action value function can be difficult. I introduce deep RBF
value functions: state–action value functions learned using a deep neural network with a radial-basis
function (RBF) output layer. I show that the optimal action with respect to a deep RBF value
function can be easily approximated up to any desired accuracy, and that deep RBF value functions
can support universal function approximation. I show that controlling the smoothness of deep RBF
value functions yields state-of-the-art performance in continuous-action RL problems.
Finally, I explain proposed future work revolving around RL safety, which demands that the new
policy always retains a performance lower-bound with high probability each time the learning al-
gorithm changes its policy. I show how we may meet this safety requirement by controlling the
Lipschitz constant of the agent’s policy, and I also show how to learn such a safe policy using neural
networks.
Chapter 2
Background
This thesis leans on a background that I present in this chapter. I first start by explaining the
reinforcement learning problem, then introduce the function approximation setting, and end the
chapter by formulating the notion of smoothness.
5
6
2.1 The Reinforcement Learning Problem
Reinforcement learning (RL) is an area of artificial intelligence (AI) that revolves around the in-
teraction between an environment and an agent that seeks to maximize reward. This problem is
illustrated in Figure 2.1. At each timestep t, the RL agent takes an action At = a. The agent then
receives a scalar reward signal Rt = r, and is then moved to a new state St+1 = s. The goal of the RL
agent is to maximize the sum of rewards across future timesteps by learning a good action-selection
strategy from environmental interaction.
Figure 2.1: An illustration of the reinforcement learning problem.
The RL problem is typically formulated using Markov Decision Processes (MDPs) [89]. An MDP
is usually specified by a tuple: 〈S,A, T,R, γ〉. Here S and A denote the state space and the action
space of the MDP. In the most basic case S and A are discrete, but more generally one or both
of them can be continuous. The MDP model is comprised of two functions, namely the transition
model T : S ×A×S → [0, 1], and the reward model R : S ×A → R. The discount factor, γ ∈ [0, 1),
determines the importance of immediate reward as opposed to rewards received in the future. The
goal of an RL agent is to find a policy, π : S → A that collects high sums of discounted rewards
across timesteps.
For a state s ∈ S, action a ∈ A, and a policy π, we define the state–action value function:
Qπ(s, a) := Eπ
[Gt | st = s, at = a
],
where Gt :=∑∞i=t γ
i−tRi is called the return at timestep t. We define Q∗ as the maximum Q value
of a state-action pair among all policies:
Q∗(s, a) := maxπ
Qπ(s, a) .
Under a discrete state space, the quantity Q∗, known as the optimal state–action value function,
7
can be written recursively [23]:
Q∗(s, a) = R(s, a) + γ∑
s′
T (s, a, s′) maxa′
Q∗(s′, a′) . (2.1)
If the model of the MDP 〈R, T 〉 is available, standard dynamic programming (DP) approaches find
Q∗ by solving for the fixed point of (2.1), known as the Bellman equation. A notable example of
these DP algorithms is Value Iteration, which starts by a randomly initialized Q, and proceeds by
repeatedly performing the following update:
Q(s, a)← R(s, a) + γ∑
s′
T (s, a, s′) maxa′
Q(s′, a′) . (2.2)
In the absence of a model, there co-exist two classes of RL algorithms: model-based and model-free.
Model-based algorithms estimate a model of the MDP 〈T , R〉 by using supervised learning:
T ≈ T, R ≈ R .
Once a model is learned, it could be used, for example in conjunction with Value Iteration to solve
the MDP.
The second class of algorithms solve for the fixed point of the Bellman equation using environmental
interaction and without learning a model. Q-learning [115], a notable example of these so-called
model-free algorithms, learns an approximation of Q∗, denoted by Q as follows:
Q(s, a)← Q(s, a) + α(r + γ max
a′∈AQ(s′, a′)− Q(s, a)
), (2.3)
It can be shown that, under mild conditions, Value Iteration, Model-based reinforcement learning,
and model-free Q-learning converge to the optimal state-action value function Q∗.
2.2 Function Approximation
In many interesting RL applications, the state space or the action space of the MDP are enormous,
or sometimes even infinite. In such settings, I assume that the state space or the action space is a
metric space endowed by a norm:
(S, dS) and (A, dA) .
The norm could be defined flexibly and in light of the specific problem we desire to solve. Under an
infinite state space or an action space, it is simply infeasible to learn about each state-action pair
individually, whether the goal is to learn a value function, a policy, or a model. The idea behind
function approximation is to represent these ingredients using a parameterized function class.
8
As a concrete example, when combined with function approximation, Q-learning updates its param-
eters θ as follows:
θ ← θ + α(r + γ max
a′∈AQ(s′, a′; θ)− Q(s, a; θ)
)∇θQ(s, a; θ) . (2.4)
Note that Q-learning’s update rule (2.4) is agnostic to the choice of function class, and so in principle
any differentiable and parameterized function class could be used in conjunction with the above
update to learn θ parameters.
2.3 Smoothness
s1 s1
f(s) f(s)
Figure 2.2: An illustration of Lipschitz continuity. Pictorially, Lips-chitz continuity ensures that f lies in between the two affine functions(colored in blue) with slopes K and −K. In this one-dimensional ex-ample, the slope is equal to the maximum value of the gradient of thefunction on its domain.
In this thesis I leverage the “smoothness” of various functions (operators, value functions, models,
policies), which I formulate using the mathematical notion of Lipschitz continuity defined momen-
tarily. Intuitively, I understand a smooth function to have the following property: For any two
distinct points on the domain of the function, the line that connects the surface of the function at
the two points should have a finite slope. The Lipschitz constant of the function is defined as the
maximum value of this slope for any pair of points. A function with a Lipschitz constant K is said
to be K-Lipschitz. I now define a smooth (Lipschitz) function more formally:
Definition 2.3.1. Given two metric spaces (M1, d1) and (M2, d2) consisting of a space and a dis-
tance metric, a function f : M1 7→ M2 is Lipschitz continuous (sometimes simply Lipschitz) if the
Lipschitz constant, defined as
Kd1,d2(f) := sups1∈M1,s2∈M1
d2
(f(s1), f(s2)
)
d1(s1, s2), (2.5)
is finite.
Equivalently, for a Lipschitz f ,
∀s1,∀s2 d2
(f(s1), f(s2)
)≤ Kd1,d2(f) d1(s1, s2) .
9
The concept of Lipschitz continuity is visualized in Figure 2.2.
A Lipschitz function f is called a non-expansion when Kd1,d2(f) = 1 and a contraction when
Kd1,d2(f) < 1. Lipschitz continuity, in one form or another, has been a key tool in the theory of
RL [27, 28, 68, 77, 41, 50, 90, 107, 83, 87, 86, 26, 22] and bandits [60, 33]. Below, I also define
Lipschitz continuity over a subset of inputs.
Definition 2.3.2. A function f : M1 ×A 7→M2 is uniformly Lipschitz continuous in A if
KAd1,d2(f) := supa∈A
sups1,s2
d2
(f(s1, a), f(s2, a)
)
d1(s1, s2), (2.6)
is finite.
Chapter 3
A Smooth Approximation of Max
for Convergent Reinforcement
Learning
In RL problems, it is typical to use a soft approximation of maximum whenever it is necessary to
maximize utility but also to hedge against problems that arise from putting all of one’s weight behind
a single maximum utility decision. These so called softmax operators applied to a set of values act
somewhat like the maximization function and somewhat like an average. The Boltzmann softmax
operator is the most commonly used softmax operator in this setting, but I show that this operator
is prone to misbehavior. In this chapter, I study an alternative softmax operator that, among other
properties, is a non-expansion (has a Lipschitz constant that is less than 1) ensuring a convergent
behavior in learning and planning. I introduce a variant of SARSA algorithm that, by utilizing the
new operator, computes a Boltzmann policy with a state-dependent temperature parameter. I show
that the algorithm is convergent and that it performs favorably in practice.
10
11
3.1 Introduction
There is a fundamental tension in decision making between choosing the action that has highest
expected utility and avoiding “starving” the other actions. The issue arises in the context of the
exploration–exploitation dilemma [110], non-stationary decision problems [102], and when interpret-
ing observed decisions [16].
In reinforcement learning, an approach to addressing the tension is the use of softmax operators
for value-function optimization, and softmax policies for action selection. Examples include value-
based methods such as SARSA [92] or expected SARSA [103, 113], and policy-search methods such
as REINFORCE [117].
An ideal softmax operator is a parameterized set of operators that:
1. has parameter settings that allow it to approximate maximization arbitrarily accurately to
perform reward-seeking behavior;
2. is a non-expansion for all parameter settings ensuring convergence to a unique fixed point;
3. is differentiable to make it possible to improve via gradient-based optimization; and
4. avoids the starvation of non-maximizing actions.
Let X = x1, . . . , xn be a vector of values. We define the following operators:
• max(X) = maxi∈{1,...,n} xi ,
• mean(X) = 1n
∑ni=1 xi ,
• epsε(X) = ε mean(X) + (1− ε) max(X) ,
• boltzβ(X) =∑ni=1 xi e
βxi∑ni=1 e
βxi.
The first operator, max(X), is known to be a non-expansion [69]. However, it is non-differentiable
(Property 3), and ignores non-maximizing selections (Property 4).
The next operator, mean(X), computes the average of its inputs. It is differentiable and, like any
operator that takes a fixed convex combination of its inputs, is a non-expansion. However, it does
not allow for maximization (Property 1).
The third operator epsε(X), commonly referred to as epsilon greedy [103], interpolates between max
and mean. The operator is a non-expansion, because it is a convex combination of two non-expansion
operators. But it is non-differentiable (Property 3).
12
The Boltzmann operator boltzβ(X) is differentiable. It also approximates max as β →∞, and mean
as β → 0. However, it is not a non-expansion (Property 2), and therefore, prone to misbehavior as
will be shown in the next section.
In the following section, we provide a simple example illustrating why the non-expansion property
is important, especially in the context of planning and on-policy learning. We then present a new
softmax operator that is similar to the Boltzmann operator yet is a non-expansion. We prove several
critical properties of this new operator, introduce a new softmax policy, and present empirical re-
sults.
3.2 Boltzmann Misbehaves
I first show that boltzβ can lead to problematic behavior. To this end, I ran SARSA with Boltzmann
softmax policy (Algorithm 1) on the MDP shown in Figure 3.1. The edges are labeled with a
transition probability (unsigned) and a reward number (signed). Also, state s2 is a terminal state,
so I only consider two action values, namely Q(s1, a) and Q(s2, b). Recall that the Boltzmann
softmax policy assigns the following probability to each action:
π(a|s) =eβQ(s,a)
∑a e
βQ(s,a).
S1
0.340.66
a+0.122
b0.010.99 +0.033
S2Figure 3.1: A simple MDP with two states, two actions,and γ = 0.98 . The use of a Boltzmann softmax policyis not sound in this simple domain.
In Figure 3.2, I plot state–action value estimates at the end of each episode of a single run (smoothed
by averaging over ten consecutive points). I set α = .1 and β = 16.55. The value estimates are
unstable.
SARSA is known to converge in the tabular setting using ε-greedy exploration [69], under decreasing
exploration [99], and to a region in the function-approximation setting [45]. There are also variants
of the SARSA update rule that converge more generally [84, 14, 113]. However, this example is the
first, to our knowledge, to show that SARSA fails to converge in the tabular setting with Boltzmann
policy. The next section provides background for our analysis of the example.
13
Algorithm 1 SARSA with Boltzmann softmax policy
Input: initial Q(s, a) ∀s ∈ S ∀a ∈ A, α, and βfor each episode do
Initialize sa ∼ Boltzmann with parameter βrepeat
Take action a, observe r, s′
a′ ∼ Boltzmann with parameter β
Q(s, a)← Q(s, a) + α[r + γQ(s′, a′)− Q(s, a)
]
s← s′, a← a
′
until s is terminalend for
episode number
Figure 3.2: Values estimated by SARSA with Boltzmannsoftmax. The algorithm never achieves stable values.
3.3 Boltzmann Has Multiple Fixed Points
Although it has been known for a long time that the Boltzmann operator is not a non-expansion [70],
I am not aware of a published example of an MDP for which two distinct fixed points exist. The
MDP presented in Figure 3.1 is the first example where, as shown in Figure 3.3, GVI under boltzβ
has two distinct fixed points. I also show, in Figure 3.4, a vector field visualizing GVI updates
under boltzβ=16.55. The updates can move the current estimates farther from the fixed points. The
behavior of SARSA (Figure 3.2) results from the algorithm stochastically bouncing back and forth
between the two fixed points. When the learning algorithm performs a sequence of noisy updates,
it moves from a fixed point to the other. As I will show later, planning will also progress extremely
slowly near the fixed points. The lack of the non-expansion property leads to multiple fixed points
and ultimately a misbehavior in learning and planning.
3.4 Mellowmax and Its Properties
I advocate for an alternative softmax operator defined as follows:
mmω(X) =log( 1
n
∑ni=1 e
ωxi)
ω,
14
5 5
Figure 3.3: Fixed points of GVI under boltzβ for varying β. Two distinct fixed points (red and blue)co-exist for a range of β.
Figure 3.4: A vector field showingGVI updates under boltzβ=16.55. Fixedpoints are marked in black. For somepoints, such as the large blue point, up-dates can move the current estimatesfarther from the fixed points. Also, forpoints that lie in between the two fixed-points, progress is extremely slow.
which can be viewed as a particular instantiation of the quasi-arithmetic mean [21]. It can also be
derived from information theoretical principles as a way of regularizing policies with a cost function
defined by KL divergence [112, 91, 42]. Note that the operator has previously been utilized in other
areas, such as power engineering [95].
I show that mmω, which we refer to as mellowmax, has the desired properties and that it compares
quite favorably to boltzβ in practice.
3.4.1 Mellowmax is a Non-Expansion
I prove that mmω is a non-expansion (Property 2), and therefore, GVI and SARSA under mmω are
guaranteed to converge to a unique fixed point.
Let X = x1, . . . , xn and Y = y1, . . . , yn be two vectors of values. Let ∆i = |xi−yi| for i ∈ {1, . . . , n}be the difference of the ith components of the two vectors. Also, let i∗ be the index with the
maximum component-wise difference, i∗ = argmaxi ∆i. For simplicity, we assume that i∗ is unique
15
and ω > 0. Also, without loss of generality, we assume that xi∗ − yi∗ ≥ 0. It follows that:
∣∣mmω(X)−mmω(Y)∣∣
=∣∣ log(
1
n
n∑
i=1
eωxi)/ω − log(1
n
n∑
i=1
eωyi)/ω∣∣
=∣∣ log
1n
∑ni=1 e
ωxi
1n
∑ni=1 e
ωyi/ω∣∣
=∣∣ log
∑ni=1 e
ω(yi+∆i
)∑ni=1 e
ωyi/ω∣∣
≤∣∣ log
∑ni=1 e
ω(yi+∆i∗
)∑ni=1 e
ωyi/ω∣∣
=∣∣ log
eω∆i∗∑ni=1 e
ωyi
∑ni=1 e
ωyi/ω∣∣
=∣∣ log(eω∆i∗ )/ω
∣∣ =∣∣∆i∗
∣∣ = maxi
∣∣xi − yi∣∣ ,
allowing us to conclude that mellowmax is a non-expansion under the infinity norm.
3.4.2 Maximization
Mellowmax includes parameter settings that allow for maximization (Property 1) as well as for min-
imization. In particular, as ω goes to infinity, mmω acts like max.
Let m = max(X) and let W = |{xi = m|i ∈ {1, . . . , n}}|. Note that W ≥ 1 is the number of
maximum values (“winners”) in X. Then:
limω→∞
mmω(X) = limω→∞
log( 1n
∑ni=1 e
ωxi)
ω
= limω→∞
log( 1ne
ωm∑ni=1 e
ω(xi−m))
ω
= limω→∞
log( 1ne
ωmW )
ω
= limω→∞
log(eωm)− log(n) + log(W )
ω
= m+ limω→∞
− log(n) + log(W )
ω= m = max(X) .
That is, the operator acts more and more like pure maximization as the value of ω is increased.
Conversely, as ω goes to −∞, the operator approaches the minimum.
16
3.4.3 Derivatives
We can take the derivative of mellowmax with respect to each one of the arguments xi and for any
non-zero ω:∂mmω(X)
∂xi=
eωxi∑ni=1 e
ωxi≥ 0 .
Note that the operator is non-decreasing in each component of X.
Moreover, we can take the derivative of mellowmax with respect to ω. We define nω(X) =
log( 1n
∑ni=1 e
ωxi) and dω(X) = ω. Then:
∂nω(X)
∂ω=
∑ni=1 xie
ωxi
∑ni=1 e
ωxiand
∂dω(X)
∂ω= 1 ,
and so:∂mmω(X)
∂ω=
∂nω(X)∂ω dω(X)− nω(X)∂dω(X)
∂ω
dω(X)2,
ensuring differentiablity of the operator (Property 3).
3.4.4 Averaging
Because of the division by ω in the definition of mmω, the parameter ω cannot be set to zero.
However, we can examine the behavior of mmω as ω approaches zero and show that the operator
computes an average in the limit.
Since both the numerator and denominator go to zero as ω goes to zero, we will use L’Hopital’s rule
and the derivative given in the previous section to derive the value in the limit:
limω→0
mmω(X) = limω→0
log( 1n
∑ni=1 e
ωxi)
ω
L’Hopital= lim
ω→0
1n
∑ni=1 xie
ωxi
1n
∑ni=1 e
ωxi
=1
n
n∑
i=1
xi = mean(X) .
That is, as ω gets closer to zero, mmω(X) approaches the mean of the values in X.
3.5 Maximum Entropy Mellowmax Policy
As described, mmω computes a value for a list of numbers somewhere between its minimum and
maximum. However, it is often useful to actually provide a probability distribution over the actions
such that (1) a non-zero probability mass is assigned to each action, and (2) the resulting expected
value equals the computed value. Such a probability distribution can then be used for action selec-
tion in algorithms such as SARSA.
17
In this section, I address the problem of identifying such a probability distribution as a maximum
entropy problem—over all distributions that satisfy the properties above, pick the one that maximizes
information entropy [36, 85]. I formally define the maximum entropy mellowmax policy of a state s
as:
πmm(s) = argminπ
∑
a∈Aπ(a|s) log
(π(a|s)
)(3.1)
subject to{∑a∈A π(a|s)Q(s, a) = mmω(Q(s, .))
π(a|s) ≥ 0∑a∈A π(a|s) = 1 .
Note that this optimization problem is convex and can be solved reliably using any numerical convex
optimization library.
One way of finding the solution, which leads to an interesting policy form, is to use the method of
Lagrange multipliers. Here, the Lagrangian is:
L(π, λ1, λ2) =∑
a∈Aπ(a|s) log
(π(a|s)
)
−λ1
(∑
a∈Aπ(a|s)− 1
)
−λ2
(∑
a∈Aπ(a|s)Q(s, a)−mmω
(Q(s, .)
)).
Taking the partial derivative of the Lagrangian with respect to each π(a|s) and setting them to zero,
we obtain:∂L
∂π(a|s) = log(π(a|s)
)+ 1− λ1 − λ2Q(s, a) = 0 ∀ a ∈ A .
These |A| equations, together with the two linear constraints in (3.1), form |A| + 2 equations to
constrain the |A|+ 2 variables π(a|s) ∀a ∈ A and the two Lagrangian multipliers λ1 and λ2.
Solving this system of equations, the probability of taking an action under the maximum entropy
mellowmax policy has the form:
πmm(a|s) =eβQ(s,a)
∑a∈A e
βQ(s,a)∀a ∈ A ,
where β is a value for which:
∑
a∈Aeβ(Q(s,a)−mmωQ(s,.)
)(Q(s, a)−mmωQ(s, .)
)= 0 .
The argument for the existence of a unique root is simple. As β → ∞ the term corresponding
to the best action dominates, and so, the function is positive. Conversely, as β → −∞ the term
18
corresponding to the action with lowest utility dominates, and so the function is negative. Fi-
nally, by taking the derivative, it is clear that the function is monotonically increasing, allowing
us to conclude that there exists only a single root. Therefore, we can find β easily using any root-
finding algorithm. In particular, I use Brent’s method [30] available in the Numpy library of Python.
This policy has the same form as Boltzmann softmax, but with a parameter β whose value depends
indirectly on ω. This mathematical form arose not from the structure of mmω, but from maxi-
mizing the entropy. One way to view the use of the mellowmax operator, then, is as a form of
Boltzmann policy with a temperature parameter chosen adaptively in each state to ensure that the
non-expansion property holds.
Finally, note that the SARSA update under the maximum entropy mellowmax policy could be
thought of as a stochastic implementation of the GVI update under the mmω operator:
Eπmm
[r + γQ(s′, a′)
∣∣s, a]
=∑
s′∈SR(s, a, s′) + γP(s, a, s′)
∑
a′∈Aπmm(a′|s′)Q(s′, a′)
]
︸ ︷︷ ︸mmω
(Q(s′,.)
)
due to the first constraint of the convex optimization problem (3.1). Because mellowmax is a
non-expansion, SARSA with the maximum entropy mellowmax policy is guaranteed to converge to
a unique fixed point. Note also that, similar to other variants of SARSA, the algorithm simply
bootstraps using the value of the next state while implementing the new policy.
3.6 Experiments in the Tabular Setting
Before presenting experiments, I note that in practice computing mellowmax can yield overflow if
the exponentiated values are large. In this case, we can safely shift the values by a constant before
exponentiating them due to the following equality:
log( 1n
∑ni=1 e
ωxi)
ω= c+
log( 1n
∑ni=1 e
ω(xi−c))
ω.
A value of c = maxi xi usually avoids overflow.
3.6.1 Two-state MDP
I repeat the experiment from Figure 3.4 for mellowmax with ω = 16.55 to get a vector field. The
result, presented in Figure 3.5, shows a rapid and steady convergence towards the unique fixed point.
As a result, GVI under mmω can terminate significantly faster than GVI under boltzβ , as illustrated
in Figure 3.6.
19
Figure 3.5: GVI updates undermmω=16.55. The fixed point is unique,and all updates move quickly towardthe fixed point.
Figure 3.6: Number of iterations before termination of GVI on the example MDP. GVI under mmω
outperforms the alternatives.
3.6.2 Random MDPs
The example in Figure 3.1 was contrived. It is interesting to know whether such examples are likely
to be encountered naturally. To this end, I constructed 200 MDPs as follows: I sampled |S| from
{2, 3, ..., 10} and |A| from {2, 3, 4, 5} uniformly at random. I initialized the transition probabilities
by sampling uniformly from [0, .01]. I then added to each entry, with probability 0.5, Gaussian noise
with mean 1 and variance 0.1. I next added, with probability 0.1, Gaussian noise with mean 100
and variance 1. Finally, I normalized the raw values to ensure that I get a transition matrix. I did
a similar process for rewards, with the difference that I divided each entry by the maximum entry
and multiplied by 0.5 to ensure that Rmax = 0.5 .
I measured the failure rate of GVI under boltzβ and mmω by stopping GVI when it did not terminate
in 1000 iterations. I also computed the average number of iterations needed before termination. A
summary of results is presented in the table below. Mellowmax outperforms Boltzmann based on
the three measures provided below.
20
MDPs, noterminate
MDPs, > 1fixed points
averageiterations
boltzβ 8 of 200 3 of 200 231.65mmω 0 0 201.32
Table 3.1: A comparison between Mellowmax and Boltzmann in terms of convergence to a uniquefixed point.
F
F
S F DFigure 3.7: Multi-passenger taxi domain. The discount rate γ is 0.99.Reward is +1 for delivering one passenger, +3 for two passengers, and+15 for three passengers. Reward is zero for all the other transitions.Here F , S, and D denote passengers, start state, and destination re-spectively.
3.6.3 Multi-passenger Taxi Domain
I evaluated SARSA on the multi-passenger taxi domain introduced by Dearden et al. [37]. (See
Figure 3.7.)
One challenging aspect of this domain is that it admits many locally optimal policies. Exploration
needs to be set carefully to avoid either over-exploring or under-exploring the state space. Note
also that Boltzmann softmax performs remarkably well on this domain, outperforming sophisticated
Bayesian reinforcement-learning algorithms [37]. As shown in Figure 3.8, SARSA with the epsilon-
Figure 3.8: Comparison on the multi-passenger taxi domain. Results are shown for different valuesof ε, β, and ω. For each setting, the learning rate is optimized. Results are averaged over 25independent runs, each consisting of 300000 time steps.
greedy policy performs poorly. In fact, in our experiment, the algorithm rarely was able to deliver all
the passengers. However, SARSA with Boltzmann softmax and SARSA with the maximum entropy
mellowmax policy achieved significantly higher average reward. Maximum entropy mellowmax policy
is no worse than Boltzmann softmax, here, suggesting that the greater stability does not come at
the expense of less effective exploration.
21
Parameters Acrobot Lunar Lander Atarilearning rate 10−3 10−4 0.00025neural network MLP MLP CNNlayers 3 3 4neurons per layer 300 500 –update frequency 100 200 10000number of runs 100 50 5processing unit CPU GPU GPU
Table 3.2: Experimental details for evaluating DeepMellow for each domain.
3.7 Experiments in the Function Approximation Setting
I now move to the function approximation case where we use mellowmax as the bootstrapping
operator in Deep Q Networks (DQN) [73], hence the name DeepMellow. We simply change the
bootstrapping operator of DQN from max to mellowmax.
We tested DeepMellow in two control domains (Acrobot, Lunar Lander) and two Atari games (Break-
out, Seaquest). We used multilayer perceptrons (MLPs) for the control domains, and convolutional
neural networks (CNNs) for Atari games. The parameters and neural network architectures for each
domain are summarized in Table 3.2. For the Atari game domains, we used the same CNN as [73].
Target network update frequency is a crucial factor in our experiments. In DQN, while the real
action-value function is updated every iteration, the target network is only updated every C iterations—
we call C the target network update frequency.1 When C > 1, the target network is updated with a
delay. On the other hand, setting C = 1 means that, after every update, the target network is copied
from the real action-value function immediately; we will use C = 1 as a synonym for eliminating the
separate target network.
Choice of Temperature Parameter ω
The temperature parameter ω of DeepMellow is an additional parameter that should be tuned.
Both too large and too small ω values are bad—as ω increases, Mellowmax behaves more like a
max operator, so there is no advantage to using it. With an ω close to zero, Mellowmax behaves
like a mean operator, resulting in persistent random behaviors. Intermediate values yield better
performances than the two extremes, but the beginning and end of the “reasonable performance
range” of the parameter varies with domain. To find the optimal ranges of the ω parameter for each
domain, we used a grid search method, as we did for other hyperparameters. We empirically found
that simpler domains (Acrobot, Lunar Lander) require relatively smaller ω values while large-scale
Atari domains require larger values. For Acrobot and Lunar Lander, our parameter search set was
1Though we use the term “frequency”, “period” might be more apt.
22
Figure 3.9: The performance of DeepMellow (no target network) and DQN (no target network) incontrol domains (left) and Atari games (right). DeepMellow outperforms DQN in all domains, inthe absence of target network. Note that the best performing temperature ω values vary acrossdomains.
ω ∈ {1, 2, 5, 10}. For Breakout and Seaquest, we tested ω ∈ {100, 250, 1000, 2000, 3000, 5000} and
ω ∈ {10, 20, 30, 40, 50, 100, 200}, respectively. An adaptive approach to choosing this parameter can
benefit DeepMellow, but we leave this direction for future work.
3.7.1 DeepMellow vs DQN without a Target Network
We first compared DeepMellow and DQN in the absence of a target network (or target network
update frequency C = 1). The control domain results are shown in the Figure 3.9 (left). In Ac-
robot, DeepMellow achieves more stable learning than DQN—without a target network, the learning
curve of DQN goes upward fast, but soon starts fluctuating and fails to improve towards the end.
By contrast, DeepMellow (especially with temperature parameter ω = 1) succeeds. Similar results
are observed in Lunar Lander. DeepMellow (ω ∈ {1, 2}) achieves more stable learning and higher
average returns than DQN without a target network.
In order to quantify the performances in each domain, we computed the sum of areas under the
curves for DeepMellow and DQN (for the first 1000 episodes; y-axis lower bound is -500). Setting
the areas under the curves of DeepMellow (ω = 1) as 100, the areas under the curves of DeepMellow
23
Figure 3.10: Performances of DeepMellow (no target network) and DQN (with a target network).If tuned with an optimal temperature parameter ω value, DeepMellow learns faster than DQN witha target network.
(ω = 2) and DQN were 88.9% and 78.7%, respectively, in Acrobot. Similarly, in Lunar Lander
domain, the areas under the curves of DeepMellow(ω = 2) and DQN were 93.4% and 79.2% of
that of DeepMellow (ω = 1). In both domains, DeepMellow(ω = 1) performed best, followed by
DeepMellow(ω = 2), and then by DQN.
We also compared the performances of DeepMellow and DQN in two Atari games, Breakout and
Seaquest. We chose these two domains because the effects of having a target network are known to
be different in each domain [73]. In Breakout, the performance of DQN does not differ significantly
with and without a target network. On the other hand, Seaquest is a domain that shows a significant
performance drop when the target network is absent. Thus, these two domains are two contrasting
examples for us to see whether DeepMellow obviates the need for a target network.
Figure 3.9 (right) shows the performances of DeepMellow and DQN in these games. We used similar
methods to quantify their performances (computing the areas under the curves), and the results were
as follows: in Breakout, compared with the areas under the curves of DeepMellow(ω = 1000), those
of DeepMellow(ω = 250, 3000, 5000) and DQN were 93.1%, 68.5%, 67.1%, and 64.5%, respectively.
In Seaquest, the performance gaps widened as expected: the areas under the curves of DeepMellow
(ω = 20, 40) and DQN were 69.5%, 31.1%, and 14.9% of that of DeepMellow(ω = 30).
DeepMellow performed better than DQN without a target network in both Breakout and Seaquest;
especially in Seaquest, the performance gap was substantial. Also, note that there are intermediate
ω values that yield best performances of DeepMellow in each domain. (ω = 30 is better than ω = 20
or ω = 40 in Seaquest; ω = 1000 is better than ω = 250 or ω = 3000, 5000 in Breakout.) These
results are consistent with the property of the Mellowmax operator that both too large (behaving like
max operator) and too small (entailing random action-selection) ω values degrade the performance.
24
3.7.2 DeepMellow vs DQN with a Target Network
In the previous section, we showed that DeepMellow outperforms DQN without a target network.
The next question that naturally arises is whether DeepMellow without a target network performs
even better than DQN with a target network.
To see if DeepMellow has an advantage over DQN with a target network, we compared their perfor-
mances, focusing on their learning speed. Our prediction is that DeepMellow will learn faster than
DQN, because DQN’s updates are delayed (C > 1), and DeepMellow is likely to react faster to the
environment.
As shown in Figure 3.10, DeepMellow does learn faster than DQN in Lunar Lander, Breakout, and
Seaquest domains. In Acrobot (not shown), there was no significant difference, because both algo-
rithms learned very quickly. In Lunar Lander, DeepMellow reaches a score of 0 at episode 517 on
average, while DQN reaches the same point around episode 561 on average. (DeepMellow is 8%
faster than DQN in reaching to the zero-score.)
In Breakout and Seaquest, DeepMellow learns faster than DQN and achieves higher performance,
if tuned with an optimal ω parameter. In Breakout, DeepMellow (ω = 1000) reaches the score of
15 and 20 at timestep 55 × 104 and 95 × 104, while DQN reaches them at timestep 153 × 104 and
503×104. Similarly, in Seaquest, DeepMellow (ω = 30) reaches the score of 600 and 800 at timestep
140 × 104 and 193 × 104, while DQN reaches them at 296 × 104 and 406 × 104. We also observed
that, in both Atari games, DeepMellow achieves higher scores than DQN. In Breakout, at timestep
1000×104, the scores of DeepMellow (ω = 1000) and DQN are 38 and 26, respectively, which means
that the score of DeepMellow at this timepoint is 42.6% higher than DQN. Similarly in Seaquest,
at timestep 800× 104, the scores of DeepMellow (ω = 30) and DQN are 1205 and 962, respectively.
At this timepoint, DeepMellow achieves a score that is 25.2% higher than DQN.
3.8 Conclusion
In this chapter, I proposed and evaluated the mellowmax operator as an appealing smooth ap-
proximation of max. I showed that mellowmax has several desirable properties and that it works
favorably in practice, including on large Atari games. Arguably, mellowmax could be used in place
of Boltzmann throughout reinforcement-learning research.
Chapter 4
Smoothness in Model-based
Reinforcement Learning
When could a learned model be useful for planning? In this chapter I answer this question by
examining the impact of learning Lipschitz continuous models. I make the case for a new loss
function for model-based RL based on the Wasserstein metric. By formalizing a good one-step
model as a Lipschitz model with bounded one-step Wasserstein error, I provide a novel bound on
multi-step prediction error of Lipschitz models, as well as on value-estimation error of the one-step
model. I conclude with empirical results that show the benefits of controlling the Lipschitz constant
of transition models when they are represented using neural networks.
25
26
4.1 Introduction
The model-based approach to reinforcement learning (RL) focuses on predicting the dynamics of
the environment to plan and make high-quality decisions [54, 103, 11]. Although the behavior of
model-based algorithms in tabular environments is well understood and can be effective [103], scaling
up to the approximate setting can cause instabilities. Even small model errors can be magnified by
the planning process resulting in poor performance [108].
In this chapter, I study model-based RL through the lens of Lipschitz continuity. I show that the
ability of a model to make accurate multi-step predictions hinges on not just the model’s one-step
accuracy, but also the magnitude of the Lipschitz constant (smoothness) of the model. I further
show that the dependence on the Lipschitz constant carries over to the value-prediction problem,
ultimately influencing the quality of the policy found by planning.
I consider a setting with continuous state spaces and stochastic transitions where I quantify the
distance between distributions using the Wasserstein metric. I introduce a novel characterization
of models, referred to as a Lipschitz model class, that represents stochastic dynamics using a set
of component deterministic functions. This allows us to study any stochastic dynamic using the
Lipschitz continuity of its component deterministic functions. To learn a Lipschitz model class in
continuous state spaces, I provide an Expectation-Maximization algorithm [38].
One promising direction for mitigating the effects of inaccurate models is the idea of limiting the
complexity of the learned models or reducing the horizon of planning [52]. Doing so can sometimes
make models more useful, much as regularization in supervised learning can improve generalization
performance [111]. I examine a type of regularization that comes from controlling the Lipschitz
constant of models. This regularization technique can be applied efficiently, as I will show, when we
represent the transition model by neural networks.
4.2 Lipschitz Model Class
We introduce a novel representation of stochastic MDP transitions in terms of a distribution over a
set of deterministic components.
Definition 4.2.1. Given a metric state space (S, dS) and an action space A, we define Fg as a
collection of functions: Fg = {f : S → S} distributed according to g(f | a) where a ∈ A. We say
that Fg is a Lipschitz model class if
KF := supf∈Fg
KdS ,dS (f) ,
is finite.
27
Figure 4.1: An example of a Lipschitz model class in a gridworld environment [93]. The dynamicsare such that any action choice results in an attempted transition in the corresponding direction withprobability 0.8 and in the neighboring directions with probabilities 0.1 and 0.1. We can define Fg =
{fup, fright, fdown, f left} where each f outputs a deterministic next position in the grid (factoring inobstacles). For a = up, we have: g(fup | a = up) = 0.8, g(fright | a = up) = g(f left | a = up) = 0.1,and g(fdown | a = up) = 0. Defining distances between states as their Manhattan distance in thegrid, then ∀f sups1,s2
(d(f(s1), f(s2)
)/d(s1, s2) = 2, and so KF = 2. So, the four functions and g
comprise a Lipschitz model class.
Our definition captures a subset of stochastic transitions, namely ones that can be represented as
a state-independent distribution over deterministic transitions. An example is provided in Figure 4.1.
Associated with a Lipschitz model class is a transition function given by:
T (s′ | s, a) =∑
f
1(f(s) = s′
)g(f | a) .
Given a state distribution µ(s), I also define a generalized notion of transition function TG(· | µ, a)
given by:
TG(s′ | µ, a) =
∫
s
∑
f
1(f(s) = s′
)g(f | a)
︸ ︷︷ ︸T (s′|s,a)
µ(s)ds .
We are primarily interested in KAd,d(TG), the Lipschitz constant of TG . However, since TG takes as
input a probability distribution and also outputs a probability distribution, we require a notion of
distance between two distributions. This notion is quantified using Wasserstein and is justified in
the next section.
4.3 On the Choice of Probability Metric
I consider the stochastic model-based setting and show through an example that the Wasserstein
metric is a reasonable choice compared to other common options.
28
c1
c2
1
1
2
1
4
1
4
µ(s)
TG(.|µ, a)
bTG(.|µ, a)
Figure 4.2: A state distribution µ(s) (top),a stochastic environment that randomly addsor subtracts c1 (middle), and an approximatetransition model that randomly adds or sub-tracts a second scalar c2 (bottom).
Consider a uniform distribution over states µ(s) as shown in black in Figure 4.2 (top). Take a
transition function TG in the environment that, given an action a, uniformly randomly adds or
subtracts a scalar c1. The distribution of states after one transition is shown in red in Figure 4.2
(middle). Now, consider a transition model TG that approximates TG by uniformly randomly adding
or subtracting the scalar c2. The distribution over states after one transition using this imperfect
model is shown in blue in Figure 4.2 (bottom). We desire a metric that captures the similarity
between the output of the two transition functions. I first consider Kullback-Leibler (KL) divergence
and observe that:
KL(TG(· | µ, a), TG(· | µ, a)
)
:=
∫TG(s′ | µ, a) log
TG(s′ | µ, a)
TG(s′ | µ, a)ds′ =∞ ,
unless the two constants are exactly the same.
The next possible choice is Total Variation (TV) defined as:
TV(TG(· | µ, a), TG(· | µ, a)
)
:=1
2
∫ ∣∣TG(s′ | µ, a)− TG(s′ | µ, a)∣∣ds′ = 1 ,
if the two distributions have disjoint supports regardless of how far the supports are from each other.
In contrast, Wasserstein is sensitive to how far the constants are as:
W(TG(· | µ, a), TG(· | µ, a)
)= |c1 − c2| .
It is clear that, of the three, Wasserstein corresponds best to the intuitive sense of how closely TG
approximates TG . This is particularly important in high-dimensional spaces where the true distri-
bution is known to usually lie in low-dimensional manifolds [79].
In the next section, I present a theoretical argument for the choice of Wasserstein.
29
4.3.1 Value-Aware Model Learning (VAML) Loss
In model-based RL it is very common to have a model-learning process that is agnostic to the specific
planing process. In such cases, the usefulness of the model for the specific planning procedure comes
as an afterthought. In contrast, the basic idea behind VAML [39] is to learn a model tailored to the
planning algorithm that intends to use it. To illustrate this idea, consider Bellman equations [24]
which are at the core of many planning and RL algorithms [103]:
Q(s, a) = R(s, a) + γ
∫T (s′|s, a)f
(Q(s′, .)
)ds′ ,
where f can generally be any arbitrary operator [68] such as max. We also define:
v(s′) := f(Q(s′, .)
).
A good model T could then be thought of as the one that minimizes the error:
l(T, T )(s, a) = R(s, a) + γ
∫T (s′|s, a)v(s′)ds′
− R(s, a)− γ∫T (s′|s, a)v(s′)ds′
= γ
∫ (T (s′|s, a)−T (s′|s, a)
)v(s′)ds′
Note that minimizing this objective requires access to the value function in the first place, but we
can obviate this need by leveraging Holder’s inequality:
l(T , T )(s, a) = γ
∫ (T (s′|s, a)− T (s′|s, a)
)v(s′)ds′
≤ γ∥∥∥T (s′|s, a)− T (s′|s, a)
∥∥∥1‖v‖∞
Further, we can use Pinsker’s inequality to write:
∥∥∥T (·|s, a)− T (·|s, a)∥∥∥
1≤√
2KL(T (·|s, a)||T (·|s, a)
).
This justifies the use of maximum likelihood estimation for model learning, a common practice in
model-based RL [13, 2, 3], since maximum likelihood estimation is equivalent to empirical KL min-
imization.
However, there exists a major drawback with the KL objective, namely that it ignores the structure
of the value function during model learning. As a simple example, if the value function is constant
through the state-space, any randomly chosen model T will, in fact, yield zero Bellman error. How-
ever, a model learning algorithm that ignores the structure of value function can potentially require
many samples to provide any guarantee about the performance of the learned policy.
Consider the objective function l(T, T ), and notice again that v itself is not known so we cannot
directly optimize for this objective. Farahmand et al. [40] proposed to search for a model that
30
results in lowest error given all possible value functions belonging to a specific class:
L(T, T )(s, a)= supv∈F
∣∣∣∫ (T (s′ | s, a)− T (s′ | s, a)
)v(s′)ds′
∣∣∣2
(4.1)
Note that minimizing this objective is shown to be tractable if, for example, F is restricted to the
class of exponential functions. Observe that the VAML objective (4.1) is similar to the dual of
Wasserstein:
W (µ1, µ2) = supf :Kd,dR (f)≤1
∫ (µ1(s)− µ2(s)
)f(s)ds ,
but the difference is in the space of value functions. In the next section we show that even the space
of value functions are the same under certain conditions.
4.3.2 Lipschitz Generalized Value Iteration
We show that solving for a class of Bellman equations yields a Lipschitz value function. Our proof
is in the context of GVI [68], which defines Value Iteration [24] with arbitrary backup operators.
We make use of the following lemmas.
Lemma 4.3.1. Given a non-expansion f : S → R:
KAdS ,dR( ∫
T (s′|s, a)f(s′)ds′)≤ KAdS ,W
(T).
Proof. Starting from the definition, we write:
KAdS ,dR( ∫
T (s′|s, a)f(s′)ds′)
= supa
sups1,s2
∣∣∣∫ (T (s′|s1, a)− T (s′|s2, a)
)f(s′)ds′
∣∣∣d(s1, s2)
≤ supa
sups1,s2
∣∣∣ supg∫ (T (s′|s1, a)− T (s′|s2, a)
)g(s′)ds′
∣∣∣d(s1, s2)
(where KdS ,dR(g) ≤ 1)
= supa
sups1,s2
supg∫ (T (s′|s1, a)− T (s′|s2, a)
)g(s′)ds′
d(s1, s2)
= supa
sups1,s2
W(T (·|s1, a), T (·|s2, a)
)
d(s1, s2)= KAdS ,W (T ) .
Lemma 4.3.2. The following operators are non-expansion (K‖·‖∞,dR(·) = 1):
1. max(x), mean(x)
2. ε-greedy(x) := ε mean(x) + (1− ε)max(x)
31
3. mmβ(x) :=log
∑i eβxi
n
β
Proof. 1 is proven by Littman & Szepsevari [68]. 2 follows from 1: (metrics not shown for brevity)
K(ε-greedy(x)) = K(ε mean(x) + (1− ε)max(x)
)
≤ εK(mean(x)
)+ (1− ε)K
(max(x)
)
= 1
Finally, 3 is proven multiple times in the literature. [6, 78, 80]
Algorithm 2 GVI algorithm
Input: initial Q(s, a), δ, and choose an operator frepeat
diff← 0for each s ∈ S do
for each a ∈ A doQcopy ← Q(s, a)
Q(s, a)←R(s, a)+γ∫T (s′ | s, a)f
(Q(s′, ·)
)ds′
diff← max{
diff, |Qcopy −Q(s, a)|}
end forend for
until diff < δ
I now present the main result of the section.
Theorem 4.3.3. For any choice of backup operator f outlined in Lemma 4.3.2, GVI computes a
value function with a Lipschitz constant bounded byKAdS ,dR
(R)
1−γKdS ,W (T ) if γKAdS ,W (T ) < 1.
Proof. From Algorithm 2, in the nth round of GVI updates we have:
Qn+1(s, a)← R(s, a) + γ
∫T (s′ | s, a)f
(Qn(s′, ·)
)ds′.
First observe that:
KAdS ,dR(Qn+1)
≤KAdS ,dR(R)+γKAdS ,dR(∫
T (s′ | s, a)f(Qn(s′, ·)
)ds′)
(due to Lemma (4.3.1)
)
≤ KAdS ,dR(R) + γKAdS ,W (T ) KdS,R
(f(Qn(s, ·)
))
(due to Composition Lemma
)
≤ KAdS ,dR(R) + γKAdS ,W (T )K‖·‖∞,dR(f)KAdS ,dR(Qn)(due to Lemma (4.3.2), the non-expansion property of f
)
= KAdS ,dR(R) + γKAdS ,W (T )KAdS ,dR(Qn)
32
Equivalently:
KAdS ,dR(Qn+1) ≤ KAdS ,dR(R)
n∑
i=0
(γKAdS ,W (T )
)i
+(γKAdS ,W (T )
)nKAdS ,dR(Q0) .
By computing the limit of both sides, we get:
limn→∞
KAdS ,dR(Qn) ≤ limn→∞
KAdS ,dR(R)
n∑
i=0
(γKAdS ,W (T )
)i
+ limn→∞
(γKAdS ,W (T )
)nKAdS ,dR(Q0)
=KAdS ,dR(R)
1− γKdS ,W (T )+ 0 ,
where we used the fact that
limn→∞
(γKAdS ,W (T )
)n= 0 .
This concludes the proof.
Now notice that as defined earlier:
Vn(s) := f(Qn(s, ·)
),
so as a relevant corollary of our theorem we get:
KdS ,dR
(v(s)
)= lim
n→∞KdS ,dR(Vn)
= limn→∞
KdS ,dR
(f(Qn(s, ·)
))
≤ limn→∞
KAdS ,dR(Qn)
≤KAdS ,dR(R)
1− γKdS ,W (T ).
That is, solving for the fixed point of this general class of Bellman equations results in a Lipschitz
state-value function.
4.3.3 Equivalence Between VAML and Wasserstein
We now show the main claim of the paper, namely that minimzing for the VAML objective is the
same as minimizing the Wasserstein metric.
Consider again the VAML objective:
L(T, T )(s, a)= supv∈F
∣∣∣∫ (
T (s′ | s, a)− T (s′ | s, a))v(s′)ds′
∣∣∣2
where F can generally be any class of functions. From our theorem, however, the space of value
functions F should be restricted to Lipschitz functions. Moreover, it is easy to design an MDP and
33
a policy such that a desired Lipschitz value function is attained.
This space LC can then be defined as follows:
LC = {f : KdS ,dR(f) ≤ C} ,
where
C =KAdS ,dR(R)
1− γKdS ,W (T ).
So we can rewrite the VAML objective L as follows:
L(T, T
)(s, a) = sup
f∈LC
∣∣∣∫f(s)
(T (s′ | s, a)−T (s′ | s, a)
)ds′∣∣∣2
= supf∈LC
∣∣∣∫Cf(s)
C
(T (s′ | s, a)−T (s′ | s, a)
)ds′∣∣∣2
= C2 supg∈L1
∣∣∣∫g(s)
(T (s′ | s, a)−T (s′ | s, a)
)ds′∣∣∣2
.
It is clear that a function g that maximizes the Kantorovich-Rubinstein dual form:
supg∈L1
∫g(s)
(T (s′ | s, a)− T (s′ | s, a)
)ds′ := W (T (·|s, a), T (·|s, a)) ,
will also maximize:
L(T, T
)(s, a) =
∣∣∣∫g(s)
(T (s′ | s, a)− T (s′ | s, a)
)ds′∣∣∣2
.
This is due to the fact that ∀g ∈ L1 ⇒ −g ∈ L1 and so computing absolute value or squaring the
term will not change arg max in this case.
As a result:
L(T, T
)(s, a) =
(C W
(T (·|s, a), T (·|s, a)
))2
.
This highlights a nice property of Wasserstein, namely that minimizing this metric yields a value-
aware model. Therefore, the strong theoretical properties shown for value-aware loss [39] further
justifies our choice of Wasserstein assuming that the Lipschitz assumption holds.
4.4 Understanding the Compounding Error Phenomenon
To extract a prediction with a horizon n > 1, model-based algorithms typically apply the model for n
steps by taking the state input in step t to be the state output from the step t−1. Previous work has
shown that model error can result in poor long-horizon predictions and ineffective planning [108, 109].
Observed even beyond reinforcement learning [71, 114], this is referred to as the compounding error
phenomenon. The goal of this section is to provide a bound on multi-step prediction error of a
model. In light of the previous section, I formalize the notion of model accuracy below:
34
Definition 4.4.1. Given an MDP with a transition function T , we identify a Lipschitz model Fg
as ∆-accurate if its induced T satisfies:
∀s ∀a W(T (· | s, a), T (· | s, a)
)≤ ∆ .
We want to express the multi-step Wasserstein error in terms of the single-step Wasserstein error
and the Lipschitz constant of the transition function TG . I provide a bound on the Lipschitz constant
of TG using the following lemma:
Lemma 4.4.1. A generalized transition function TG induced by a Lipschitz model class Fg is Lips-
chitz with a constant:
KAW,W (TG) := supa
supµ1,µ2
W(TG(·|µ1, a), TG(·|µ2, a)
)
W (µ1, µ2)≤KF
Proof.
W(T (· | µ1, a), T (· | µ2, a)
)
:= infj
∫
s′1
∫
s′2
j(s′1, s′2)d(s′1, s
′2)ds′1ds
′2
= infj
∫
s1
∫
s2
∫
s′1
∫
s′2
∑
f
1(f(s1) = s′1 ∧ f(s2) = s′2
)j(s1, s2, f)d(s′1, s
′2)ds′1ds
′2ds1ds2
= infj
∫
s1
∫
s2
∑
f
j(s1, s2, f)d(f(s1), f(s2)
)ds1ds2
≤ KF infj
∫
s1
∫
s2
∑
f
g(f |a)j(s1, s2)d(s1, s2)ds1ds2
= KF
∑
f
g(f |a) infj
∫
s1
∫
s2
j(s1, s2)d(s1, s2)ds1ds2
= KF
∑
f
g(f |a)W (µ1, µ2) = KFW (µ1, µ2)
Intuitively, Lemma 4.4.1 states that, if the two input distributions are similar, then for any action
the output distributions given by TG are also similar up to a KF factor.
Given the one-step error (Definition 4.4.1), a start state distribution µ and a fixed sequence of actions
a0, ..., an−1, we desire a bound on n-step error:
δ(n) := W(TnG (· | µ), TnG (· | µ)
),
where TnG (·|µ) := TG(·|TG(·|...TG(·|µ, a0)..., an−2), an−1)︸ ︷︷ ︸n recursive calls
and TnG (· | µ) is defined similarly. I provide
a useful lemma followed by the theorem.
35
Lemma 4.4.2. (Composition Lemma) Define three metric spaces (M1, d1), (M2, d2), and (M3, d3).
Define Lipschitz functions f : M2 7→M3 and g : M1 7→M2 with constants Kd2,d3(f) and Kd1,d2(g).
Then, h : f ◦ g : M1 7→M3 is Lipschitz with constant Kd1,d3(h) ≤ Kd2,d3(f)Kd1,d2(g).
Proof.
Kd1,d3(h) = sups1,s2
d3
(f(g(s1)
), f(g(s2)
))
d1(s1, s2)
= sups1,s2
d2
(g(s1), g(s2)
)
d1(s1, s2)
d3
(f(g(s1)
), f(g(s2)
))
d2
(g(s1), g(s2)
)
≤ sups1,s2
d2
(g(s1), g(s2)
)
d1(s1, s2)sups1,s2
d3
(f(s1), f(s2)
)
d2(s1, s2)
= Kd1,d2(g)Kd2,d3(f).
Similar to composition, we can show that summation preserves Lipschitz continuity with a constant
bounded by the sum of the Lipschitz constants of the two functions. We omitted this result due to
brevity.
Theorem 4.4.3. Define a ∆-accurate TG with the Lipschitz constant KF and an MDP with a
Lipschitz transition function TG with constant KT . Let K = min{KF ,KT }. Then ∀n ≥ 1:
δ(n) := W(TnG (· | µ), TnG (· | µ)
)≤ ∆
n−1∑
i=0
(K)i .
Proof. We construct a proof by induction. Using Kantarovich-Rubinstein duality (Lipschitz property
of f not shown for brevity) we first prove the base of induction:
δ(1) := W(TG(· | µ, a0), TG(· | µ, a0)
)
:= supf
∫ ∫ (T (s′ | s, a0)−T (s′ | s, a0)
)f(s′)µ(s) ds ds′
≤∫
supf
∫ (T (s′|s, a0)−T (s′|s, a0)
)f(s′) ds′
︸ ︷︷ ︸=W(T (·|s,a0),T (·|s,a0)
)due to duality (4.3.1)
µ(s) ds
=
∫W(T (· | s, a0), T (· | s, a0)
)︸ ︷︷ ︸≤∆ due to Definition 4.4.1
µ(s) ds
≤∫
∆ µ(s) ds = ∆ .
We now prove the inductive step. Assuming δ(n−1) := W(Tn−1G (· | µ), Tn−1
G (· | µ))≤ ∆
∑n−2i=0 (KF )i
we can write:
36
δ(n) := W(TnG (· | µ), TnG (· | µ)
)
≤ W(TnG (· | µ), TG
(· | Tn−1
G (· | µ), an−1
))
+ W(TG(· | Tn−1
G (· | µ), an−1
), TnG (· | µ)
)(Triangle ineq)
= W(TG(· | Tn−1
G (· | µ), an−1), TG(· | Tn−1
G (· | µ), an−1
))
+W(TG(· | Tn−1
G (· | µ), an−1
), TG(· | Tn−1
G (· | µ), an−1))
We now use Lemma 4.4.1 and Definition 4.4.1 to upper bound the first and the second term of the
last line respectively.
δ(n) ≤ KF W(Tn−1G (· | µ), Tn−1
G (· | µ))
+ ∆
= KF δ(n− 1) + ∆ ≤ ∆
n−1∑
i=0
(KF )i . (4.2)
Note that in the triangle inequality, we may replace TG(· | Tn−1
G (· | µ))
with TG(· | Tn−1
G (· | µ))
and
follow the same basic steps to get:
W(TnG (· | µ), TnG (· | µ)
)≤ ∆
n−1∑
i=0
(KT )i . (4.3)
Combining (4.2) and (4.3) allows us to write:
δ(n) = W(TnG (· | µ), TnG (· | µ)
)
≤ min
{∆
n−1∑
i=0
(KT )i,∆
n−1∑
i=0
(KF )i
}
= ∆
n−1∑
i=0
(K)i ,
which concludes the proof.
There exist similar results in the literature relating one-step transition error to multi-step transi-
tion error and sub-optimality bounds for planning with an approximate model. The Simulation
Lemma [58, 101] is for discrete state MDPs and relates error in the one-step model to the value
obtained by using it for planning. A related result for continuous state-spaces [55] bounds the error
in estimating the probability of a trajectory using total variation. A second related result [114]
provides a slightly looser bound for prediction error in the deterministic case—this result can be
thought of as a generalization of their result to the probabilistic case.
4.5 Value Error with Lipschitz Models
We next investigate the error in the state-value function induced by a Lipschitz model class. To
answer this question, we consider an MRP M1 denoted by 〈S,A, T,R, γ〉 and a second MRP M2
37
that only differs from the first in its transition function 〈S,A, T , R, γ〉. Let A = {a} be the action
set with a single action a. We further assume that the reward function is only dependent upon
state. We first express the state-value function for a start state s with respect to the two transition
functions. By δs below, we mean a Dirac delta function denoting a distribution with probability 1
at state s.
VT (s) :=
∞∑
n=0
γn∫TnG (s′|δs)R(s′) ds′ ,
VT (s) :=
∞∑
n=0
γn∫TnG (s′|δs)R(s′) ds′ .
Next we derive a bound on∣∣VT (s)− VT (s)
∣∣ ∀s.
Theorem 4.5.1. Assume a Lipschitz model class Fg with a ∆-accurate T with K = min{KF ,KT }.Further, assume a Lipschitz reward function with constant KR = KdS ,R(R). Then ∀s ∈ S and
K ∈ [0, 1γ )
∣∣VT (s)− VT (s)∣∣ ≤ γKR∆
(1− γ)(1− γK).
Proof. We first define the function f(s) = R(s)KR
. It can be observed that KdS ,R(f) = 1. We now
write:
VT (s)− VT (s)
=
∞∑
n=0
γn∫R(s′)
(TnG (s′ | δs)− TnG (s′ | δs)
)ds′
= KR
∞∑
n=0
γn∫f(s′)
(TnG (s′ | δs)− TnG (s′ | δs)
)ds′
Let F = {h : KdS ,R(h) ≤ 1}. Then given f ∈ F :
KR
∞∑
n=0
γn∫f(s′)
(TnG (s′|δs)− TnG (s′|δs)
)ds′
≤ KR
∞∑
n=0
γn supf∈F
∫f(s′)
(TnG (s′ | δs)− TnG (s′ | δs)
)ds′
︸ ︷︷ ︸:=W
(TnG (.|δs),TnG (.|δs)
)due to duality (4.3.1)
= KR
∞∑
n=0
γn W(TnG (. | δs), TnG (. | δs)
)︸ ︷︷ ︸
≤∑n−1i=0 ∆(K)i due to Theorem 4.4.3
≤ KR
∞∑
n=0
γnn−1∑
i=0
∆(K)i
= KR∆
∞∑
n=0
γn1− Kn
1− K
=γKR∆
(1− γ)(1− γK).
38
Function f Definition Lipschitz constant K‖‖p,‖‖p(f)
p = 1 p = 2 p =∞ReLu : Rn → Rn ReLu(x)i := max{0, xi} 1 1 1
+b : Rn → Rn,∀b ∈ Rn +b(x) := x+ b 1 1 1
×W : Rn → Rm, ∀W ∈ Rm×n ×W (x) := Wx∑j ‖Wj‖∞
√∑j ‖Wj‖22 supj ‖Wj‖1
Table 4.1: Lipschitz constant for various functions used in a neural network. Here, Wj denotes thejth row of a weight matrix W .
We can derive the same bound for VT (s)−VT (s) using the fact that Wasserstein distance is a metric,
and therefore symmetric, thereby completing the proof.
Regarding the tightness of these bounds, I can show that when the transition model is deterministic
and linear then Theorem 4.4.3 provides a tight bound. Moreover, if the reward function is linear,
the bound provided by Theorem 4.5.1 is tight.
4.6 Experiments
My first goal in this section is to compare TV, KL, and Wasserstein in terms of the ability to best
quantify error of an imperfect model, and do so in a simple and clear setting. To this end, I built
finite MRPs with random transitions, |S| = 10 states, and γ = 0.95. In the first case the reward
signal is randomly sampled from [0, 10], and in the second case the reward of an state is the index
of that state, so small Euclidean norm between two states is an indication of similar values. For
105 trials, I generated an MRP and a random model, and then computed model error and planning
error (Figure 4.3). We understand a good metric as the one that computes a model error with a
high correlation with value error. I show these correlations for different values of γ in Figure 4.4.
Figure 4.3: Value error (x axis) and model error (y axis). When the reward is the index of the state(right), correlation between Wasserstein error and value-prediction error is high. This highlights thefact that when closeness in the state-space is an indication of similar values, Wasserstein can be apowerful metric for model-based RL. Note that Wasserstein provides no advantage given randomrewards (left).
39
Figure 4.4: Correlation between value-prediction error and model error for the three metrics usingrandom rewards (left) and index rewards (right). Given a useful notion of state similarities, lowWasserstein error is a better indication of planning error.
It is known that controlling the Lipschitz constant of neural nets can help in terms of improving
generalization error due to a lower bound on Rademacher complexity [81, 17]. It then follows from
Theorems 4.4.3 and 4.5.1 that controlling the Lipschitz constant of a learned transition model can
achieve better error bounds for multi-step and value predictions. To enforce this constraint during
learning, we bound the Lipschitz constant of various operations used in building neural network.
The bound on the constant of the entire neural network then follows from Lemma 4.4.2. In Table 4.1,
we provide Lipschitz constant for operations used in our experiments. We quantify these results for
different p-norms ‖·‖p.
Given these simple methods for enforcing Lipschitz continuity, I performed empirical evaluations to
understand the impact of Lipschitz continuity of transition models, specifically when the transition
model is used to perform multi-step state-predictions and policy improvements. I chose two standard
domains: Cart Pole and Pendulum. In Cart Pole, I trained a network on a dataset of 15 ∗ 103 tuples
〈s, a, s′〉. During training, I ensured that the weights of the network are smaller than k. For each k, I
performed 20 independent model estimation, and chose the model with median cross-validation error.
Using the learned model, along with the actual reward signal of the environment, I then performed
stochastic actor-critic RL. [20, 105] This required an interaction between the policy and the learned
model for relatively long trajectories. To measure the usefulness of the model, I then tested the
learned policy on the actual domain. I repeated this experiment on Pendulum. To train the neural
transition model for this domain we used 104 samples. Notably, I used deterministic policy gradient
[97] for training the policy network with the hyper parameters suggested by [65]. I report these
results in Figure 4.5.
Observe that an intermediate Lipschitz constant yields the best result. Consistent with the theory,
controlling the Lipschitz constant in practice can combat the compounding errors and can help in
40
-250
-500
-750
-1000
-1250
-1500
averagereturnper
episode
Figure 4.5: Impact of controlling the Lipschitz constant of learned models in Cart Pole (left) andPendulum (right). An intermediate value of k (Lipschitz constant) yields the best performance.
the value estimation problem. This ultimately results in learning a better policy.
I next examined if the benefits carry over to stochastic settings. To capture stochasticity we need
an algorithm to learn a Lipschitz model class (Definition 4.2.1). I used an EM algorithm to joinly
learn a set of functions f , parameterized by θ= {θf : f ∈ Fg}, and a distribution over functions g.
Note that in practice our dataset only consists of a set of samples 〈s, a, s′〉 and does not include the
function the sample is drawn from. Hence, I consider this as our latent variable z. As is standard
with EM, we start with the log-likelihood objective (for simplicity of presentation I assume a single
action in the derivation):
L(θ) =
N∑
i=1
log p(si, si′; θ)
=
N∑
i=1
log∑
f
p(zi = f, si, si′; θ)
=
N∑
i=1
log∑
f
q(zi=f |si, si′)p(zi = f, si, si
′; θ)q(zi = f |si, si′)
≥N∑
i=1
∑
f
q(zi=f |si, si′)logp(zi = f, si, si
′; θ)q(zi = f |si, si′)
,
where I used Jensen’s inequality and concavity of log in the last line. This derivation leads to the
41
following EM algorithm.
In the M step, find θt by solving for:
arg maxθ
N∑
i=1
∑
f
qt−1(zi = f |si, si′)logp(zi = f, si, si
′; θ)qt−1(zi = f |si, si′)
In the E step, compute posteriors:
qt(zi=f |si, si′)=p(si, si
′|zi = f ; θtf )g(zi = f ; θt)∑
f p(si, si′|zi = f ; θt
f )g(zi = f ; θt).
Note that we assume each point is drawn from a neural network f with probability:
p(si, si
′|zi = f ; θtf)
= N(∣∣si′ − f(si, θt
f )∣∣, σ2
),
and with a fixed variance σ2 tuned as a hyper-parameter.
I first used a supervised-learning domain to evaluate the EM algorithm in a simple setting. I
generated 30 points from the following 5 functions:
f0(x) = tanh(x) + 3
f1(x) = x ∗ xf2(x) = sin(x)− 5
f3(x) = sin(x)− 3
f4(x) = sin(x) ∗ sin(x) ,
and trained 5 neural networks to fit these points. Iterations of a single run is shown in Figure 4.6
and the summary of results is presented in Figure 4.7. Observe that the EM algorithm is effective,
and that controlling the Lipschitz constant is again useful.
42
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
−6
−4
−2
0
2
4
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
−6
−4
−2
0
2
4
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
−6
−4
−2
0
2
4
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
−6
−4
−2
0
2
4
Figure 4.6: A stochastic problem solved by training a Lipschitz model class using EM. The top leftfigure shows the functions before any training (iteration 0), and the bottom right figure shows thefinal results (iteration 50).
I next applied EM to train a transition model for an RL setting, namely the gridworld domain from
Moerland et al. [74]. Here a useful model needs to capture the stochastic behavior of the two ghosts.
I modify the reward to be -1 whenever the agent is in the same cell as either one of the ghosts and
0 otherwise. I performed environmental interactions for 1000 time-steps and measured the return. I
compared against standard tabular methods[103], and a deterministic model that predicts expected
next state [106, 82]. In all cases I used value iteration for planning.
Results in Figure 4.8 show that tabular models fail due to no generalization, and expected models
fail since the ghosts do not move on expectation, a prediction not useful for planner. Performing
value iteration with a Lipschitz model class outperforms the baselines.
4.7 Conclusion
In this chapter, we took an important step towards understanding the effects of smoothness in
model-based RL. I showed that Lipschitz continuity of an estimated model plays a central role in
multi-step prediction error, and in value-estimation error. I also showed the benefits of employing
Wasserstein for model-based RL.
43
Figure 4.7: Impact of controlling the Lipschitz constant in the supervised-learning domain. Noticethe U-shape of final Wasserstein loss with respect to Lipschitz constant k.
Figure 4.8: Performance of a Lipschitz model class on the gridworld domain. I show model testaccuracy (left) and quality of the policy found using the model (right). Notice the poor performanceof tabular and expected models.
Chapter 5
Smoothness in Continuous Control
In this chapter I focus on RL in continuous action spaces. As I showed in the background chapter,
a core operation in RL is finding an action that is optimal with respect to the Q function. This
operation, maxa∈A Q(s, a), is actually challenging when A is a continuous space.
I introduce deep RBF value functions: Q functions learned using a deep neural network with a radial-
basis function (RBF) output layer. I show that the optimal action with respect to a deep RBF Q
function can be easily approximated up to any desired accuracy. Moreover, deep RBF Q functions
can represent any true value function up to any desired accuracy owing to their support for universal
function approximation. I show the theoretical and practical benefits of controlling the smoothness
of deep RBF Q functions. By learning a deep RBF value function, I extend the standard DQN
algorithm to continuous control, and demonstrate that the resultant agent, RBF–DQN, outperforms
standard baselines on a set of continuous-action RL problems
44
45
5.1 Introduction
In RL the Q function quantifies the expected return for taking action a in state s. Many RL
algorithms, notably Q-learning, learn an approximation of the Q function from environmental in-
teractions. When using function approximation with Q-learning, the agent has a parameterized
function class, and learning consists of finding a parameter setting θ for the approximate value func-
tion Q(s, a; θ) that accurately represents the true Q function.
A core operation in many RL algorithms is finding an optimal action with respect to the value
function, arg maxa∈A Q(s, a; θ), or finding the highest action-value maxa∈A Q(s, a; θ). The need for
performing these operations arises not just when computing a behavior policy for action selection,
but also when learning Q itself using bootstrapping techniques [104].
The optimization problem maxa∈A Q(s, a; θ) is generally challenging if A is continuous, in contrast
to the discrete case where the operation is trivial if the number of discrete actions is not enormous.
The challenge stems from the observation that the surface of the function fs(a; θ) := Q(s, a; θ)
could have many local maxima and saddle points; therefore, naıve approaches such as finding the
maximum through gradient ascent can lead into inaccurate answers [94]. In light of this technical
challenge, recent work on solving continuous control problems has instead embraced policy-gradient
algorithms, which typically compute ∇aQ(s, a; θ), rather than solving maxa∈A Q(s, a; θ), and follow
the ascent direction to move an explicitly maintained policy towards actions with higher Q [98].
However, policy-gradient algorithms have their own weaknesses, particularly in settings with sparse
rewards where computing an accurate estimate of the gradient requires an unreasonable number of
environmental interactions [56, 72]. Rather than adopting a policy-gradient approach, I focus on
tackling the problem of efficiently computing maxa∈A Q(s, a; θ) for value-function-based RL.
Previous work on value-function-based algorithms for continuous control has shown the benefits of
using function classes that are conducive to efficient action maximization. For example, Gu et al.
[47] explored function classes that can capture an arbitrary dependence on the state, but only a
quadratic dependence on the action. Given a function class with a quadratic dependence on the
action, Gu et al. [47] showed how to compute maxa∈A Q(s, a; θ) quickly and in constant time. A
more general idea is to use input–convex neural networks [4] that restrict Q(s, a; θ) to functions that
are convex (or concave) with respect to the input a, so that for any fixed state s the optimization
problem maxa∈A Q(s, a; θ) can be solved efficiently using convex-optimization techniques [29]. These
solutions trade the expressiveness of the function class for easy action maximization.
While restricting the function class can enable easy maximization, it can be problematic if no
member of the restricted class has low approximation error relative to the true Q function [67].
More concretely, when the agent cannot possibly learn an accurate Q, the error |maxaQπ(s, a) −
maxa Q(s, a; θ)| could be significant even if the agent can solve maxa Q(s, a; θ) exactly. In the case of
46
input–convex neural networks, for example, high error can occur if fa(s; θ) := Q(s, a; θ) is completely
non-convex. Thus, it is desirable to ensure that, for any true Q function, there exists a member of
the function class that approximates Q up to any desired accuracy. Such a function class is said
to be capable of universal function approximation (UFA) [51, 25, 48]. A function class that is both
conducive to efficient action maximization and also capable of UFA would be ideal.
In this chapter I introduce deep RBF value functions, which approximate Q by a standard deep
neural network equipped with an RBF output layer. I show that deep RBF value functions have the
two desired properties outlined above: First, using deep RBF value functions enable us to approx-
imate the optimal action up to any desired accuracy. Second, deep RBF value functions support
universal function approximation.
Prior work in RL used RBF networks for learning value functions in problems with discrete action
spaces (see Section 9.5.5 of Sutton & Barto [104] for a discussion). That said, to the best of my
knowledge, my discovery of the action-maximization property of RBF networks is novel, and there
has been no application of deep RBF networks to continuous control. I combine deep RBF networks
with DQN [73], a standard deep RL algorithm originally proposed for discrete actions, to produce
a new algorithm called RBF–DQN. I evaluate RBF–DQN on a large set of continuous-action RL
problems, and demonstrate its superior performance relative to standard deep-RL baselines.
5.2 Deep RBF Value Functions
Deep RBF value functions combine the practical advantages of deep networks [44] with the theoreti-
cal advantages of radial-basis functions (RBFs) [88]. A deep RBF network is comprised of a number
of arbitrary hidden layers, followed by an RBF output layer, defined next. The RBF output layer,
first introduced in a seminal paper by Broomhead & Lowe [32], is sometimes used as a standalone
single-layer function approximator, referred to as a (shallow) RBF network. We use an RBF network
as the final, or output, layer of a deep network.
For a given input a, the RBF layer f(a) is defined as:
f(a) :=
N∑
i=1
g(a− ai) vi , (5.1)
where each ai represents a centroid location, vi is the value of the centroid ai, N is the number of
centroids, and g is an RBF. A commonly used RBF is the negative exponential:
g(a− ai) := e−β‖a−ai‖ , (5.2)
equipped with a smoothing parameter β ≥ 0. (See Karayiannis [57] for a thorough treatment of
other RBFs.) Formulation (5.1) could be thought of as an interpolation based on the value and
47
the weights of all centroids, where the weight of each centroid is determined by its proximity to the
input. Proximity here is quantified by the RBF g, in this case the negative exponential (5.2).
As will be clear momentarily, it is theoretically useful to normalize centroid weights to ensure that
they sum to 1 so that f implements a weighted average. This weighted average is sometimes referred
to as a normalized Gaussian RBF layer [76, 34]:
fβ(a) :=
∑Ni=1 e
−β‖a−ai‖ vi∑Ni=1 e
−β‖a−ai‖. (5.3)
As the smoothing parameter β→∞ the function implements a winner-take-all case where the value
of the function at a given input is determined only by the value of the closest centroid location,
nearest-neighbor style. This limiting case is sometimes referred to as a Voronoi decomposition [12].
Conversely, f converges to the mean of centroid values regardless of the input a as β gets close to
0; that is, ∀a limβ→0 fβ(a) =∑Ni=1 viN . Since an RBF layer is differentiable, it could be used in
conjunction with (stochastic) gradient descent and backprop to learn the centroid locations and their
values by optimizing for a loss function. Note that formulation (5.3) is different than the Boltzmann
softmax operator [6, 100], where the weights are determined, not by an RBF, but by the action values.
Finally, to represent the Q function for RL, we use the following formulation:
Qβ(s, a; θ) :=
∑Ni=1 e
−β‖a−ai(s;θ)‖ vi(s; θ)∑Ni=1 e
−β‖a−ai(s;θ)‖. (5.4)
A deep RBF Q function (5.4) internally learns two mappings: state-dependent set of centroid lo-
cations ai(s; θ) and state-dependent centroid values vi(s; θ). The role of the RBF output layer,
then, is to use these learned mappings to form the output of the entire deep RBF Q function. We
illustrate the architecture of a deep RBF Q function in Figure 5.1. In the experimental section, we
demonstrate how to learn parameters θ.
I now show that deep RBF function have the first desired property for value-function-based RL,
namely that they enable easy action maximization.
In light of the RBF formulation, it is easy to find the value of the deep RBF Q function at each
centroid location ai, that is, to compute Qβ(s, ai; θ). Note that Qβ(s, ai; θ) 6= vi(s; θ) in general for
a finite β, because the other centroids aj ∀j ∈ {1, .., N} − i may have non-zero weights at ai. In
other words, the action-value function at a centroid ai can in general differ from the centroid’s value
vi.
Therefore, to compute Qβ(s, ai; θ), we access the centroid location using ai(s; θ), then input ai to
get Q(s, ai; θ) . Once we have Q(s, ai; θ) ∀ai, we can trivially find the highest-valued centroid or its
corresponding Q:
maxi∈[1,N ]
Qβ(s, ai; θ) .
48
softmax…
… dot product
hidden layers
s<latexit sha1_base64="EF3gVTokzRtz6drbuppAhSvs3q4=">AAAB6XicbVDLSsNAFJ3UV42vqks3g6XgqiQqPnZFNy5bsA9oQ5lMb9qhk0mYmQgl9AtcCQri1k9y5d84SYOo9cCFwzn3cu89fsyZ0o7zaZVWVtfWN8qb9tb2zu5eZf+go6JEUmjTiEey5xMFnAloa6Y59GIJJPQ5dP3pbeZ3H0AqFol7PYvBC8lYsIBRoo3UUsNK1ak7OfAycQtSRQWaw8rHYBTRJAShKSdK9V0n1l5KpGaUw9yuDRIFMaFTMoa+oYKEoLw0v3SOa0YZ4SCSpoTGuWr/mEhJqNQs9E1nSPRE/fUy8T+vn+jgykuZiBMNgi4WBQnHOsLZ23jEJFDNZ4YQKpk5FtMJkYRqE46dp3Cd4eL752XSOa27Z/Wz1nm1cVPkUUZH6BidIBddoga6Q03URhQBekTP6MWaWk/Wq/W2aC1Zxcwh+gXr/QtcoY1Z</latexit>
a<latexit sha1_base64="bUPuLPmmdR34XjWzdEt1jGkvsAk=">AAAB6XicbVDLSsNAFJ3UV42vqks3g6XgqiQqPnZFNy5bsA9oQ5lMb9qhk0mYmQgl9AtcCQri1k9y5d84SYOo9cCFwzn3cu89fsyZ0o7zaZVWVtfWN8qb9tb2zu5eZf+go6JEUmjTiEey5xMFnAloa6Y59GIJJPQ5dP3pbeZ3H0AqFol7PYvBC8lYsIBRoo3UIsNK1ak7OfAycQtSRQWaw8rHYBTRJAShKSdK9V0n1l5KpGaUw9yuDRIFMaFTMoa+oYKEoLw0v3SOa0YZ4SCSpoTGuWr/mEhJqNQs9E1nSPRE/fUy8T+vn+jgykuZiBMNgi4WBQnHOsLZ23jEJFDNZ4YQKpk5FtMJkYRqE46dp3Cd4eL752XSOa27Z/Wz1nm1cVPkUUZH6BidIBddoga6Q03URhQBekTP6MWaWk/Wq/W2aC1Zxcwh+gXr/QtBR41H</latexit>
a1:N<latexit sha1_base64="CAA25IiFiC3oik9wIk1eV9k2oxQ=">AAAB73icbVDLSsNAFL2prxpfVZduBkvBVUlUfK2KblxJBfuANpTJdNIOnUzizEQooR/hSlAQt36PK//GSRtErQcuHM65l3vv8WPOlHacT6uwsLi0vFJctdfWNza3Sts7TRUlktAGiXgk2z5WlDNBG5ppTtuxpDj0OW35o6vMbz1QqVgk7vQ4pl6IB4IFjGBtpBbupe7FzaRXKjtVZwo0T9yclCFHvVf66PYjkoRUaMKxUh3XibWXYqkZ4XRiV7qJojEmIzygHUMFDqny0um9E1QxSh8FkTQlNJqq9o+JFIdKjUPfdIZYD9VfLxP/8zqJDs68lIk40VSQ2aIg4UhHKHse9ZmkRPOxIZhIZo5FZIglJtpEZE9TOM9w8v3zPGkeVt2j6tHtcbl2medRhD3YhwNw4RRqcA11aACBETzCM7xY99aT9Wq9zVoLVj6zC79gvX8BSLGPkw==</latexit>
v1:N<latexit sha1_base64="SciiXDZvNHatHNOODf6xOOkzn5Y=">AAAB73icbVDLSsNAFL2prxpfVZdugqXgqiRWfK2KblxJBfuANpTJdNIOnUzizKRQQj/ClaAgbv0eV/6NkzSIWg9cOJxzL/fe40WMSmXbn0ZhaXllda24bm5sbm3vlHb3WjKMBSZNHLJQdDwkCaOcNBVVjHQiQVDgMdL2xtep354QIWnI79U0Im6Ahpz6FCOlpfaknziXt7N+qWxX7QzWInFyUoYcjX7pozcIcRwQrjBDUnYdO1JugoSimJGZWenFkkQIj9GQdDXlKCDSTbJ7Z1ZFKwPLD4UurqxMNX9MJCiQchp4ujNAaiT/eqn4n9eNlX/uJpRHsSIczxf5MbNUaKXPWwMqCFZsqgnCgupjLTxCAmGlIzKzFC5SnH7/vEhax1WnVq3dnZTrV3keRTiAQzgCB86gDjfQgCZgGMMjPMOL8WA8Ga/G27y1YOQz+/ALxvsXaRiPqA==</latexit>
k·k<latexit sha1_base64="UAkLneWZSjt3xRr0vurj6WzEHho=">AAACBnicbVDLSsNAFJ34rPEVdamLYCm4KqkVH7uiG5cV7AOaUCbTm3bo5MHMjVBKN678FFeCgrj1H1z5N07aIGo9cC+Hc+5l5h4/EVyh43waC4tLyyurhTVzfWNza9va2W2qOJUMGiwWsWz7VIHgETSQo4B2IoGGvoCWP7zK/NYdSMXj6BZHCXgh7Uc84IyilrrWgSsgQFc0QaLLejG6kvcHumdC1yo6ZWcKe55UclIkOepd68PtxSwNIUImqFKdipOgN6YSORMwMUtuqiChbEj70NE0oiEobzw9Y2KXtNKzg1jqitCequaPjTENlRqFvp4MKQ7UXy8T//M6KQbn3phHSYoQsdlDQSpsjO0sE7vHJTAUI00ok1x/1mYDKilDnZw5TeEiw+n3zfOkeVyuVMvVm5Ni7TLPo0D2ySE5IhVyRmrkmtRJgzByTx7JM3kxHown49V4m40uGPnOHvkF4/0L7seZhA==</latexit>
RBF layer
bQ<latexit sha1_base64="gKd2EUikoAVz+E+VRA4g4ZT/seQ=">AAAB8XicbVDLSsNAFJ3UV62vqks3g0VwVVIrPnZFNy5bsA9sQ5lMJu3QySTM3Cgl9C/cuFDErX/jzr9xkgZR64ELh3Pu5d573EhwDbb9aRWWlldW14rrpY3Nre2d8u5eR4exoqxNQxGqnks0E1yyNnAQrBcpRgJXsK47uU797j1TmofyFqYRcwIyktznlICR7gYP3GNjArg1LFfsqp0BL5JaTiooR3NY/hh4IY0DJoEKonW/ZkfgJEQBp4LNSoNYs4jQCRmxvqGSBEw7SXbxDB8ZxcN+qExJwJn6cyIhgdbTwDWdAYGx/uul4n9ePwb/wkm4jGJgks4X+bHAEOL0fexxxSiIqSGEKm5uxXRMFKFgQiplIVymOPt+eZF0Tqq1erXeOq00rvI4iugAHaJjVEPnqIFuUBO1EUUSPaJn9GJp68l6td7mrQUrn9lHv2C9fwFDEZDI</latexit> �
<latexit sha1_base64="F4vN4+Bz77orLUGKmZFZH7+yeew=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQqftyKXjxWMG2hDWWznbRLN5uwuxFK6W/w4kERr/4gb/4bN2kQtT4YeLw3w8y8IOFMacf5tEpLyyura+X1ysbm1vZOdXevpeJUUvRozGPZCYhCzgR6mmmOnUQiiQKO7WB8k/ntB5SKxeJeTxL0IzIULGSUaCN5vQA16VdrTt3JYS8StyA1KNDsVz96g5imEQpNOVGq6zqJ9qdEakY5ziq9VGFC6JgMsWuoIBEqf5ofO7OPjDKww1iaEtrO1Z8TUxIpNYkC0xkRPVJ/vUz8z+umOrz0p0wkqUZB54vClNs6trPP7QGTSDWfGEKoZOZWm46IJFSbfCp5CFcZzr9fXiStk7p7Wj+9O6s1ros4ynAAh3AMLlxAA26hCR5QYPAIz/BiCevJerXe5q0lq5jZh1+w3r8A3J6O2w==</latexit>
Figure 5.1: The architecture of a deep RBF value (Q) function. A deep RBF Q function couldbe thought of as an RBF output layer added to an otherwise standard deep state-action value (Q)function. In this sense, any kind of standard layer (dense, convolutional, recurrent, etc.) could beused as a hidden layer. All operations of the final RBF layer are differentiable, and therefore, theparameters of hidden layers θ, which represent the mappings ai(s; θ) and vi(s; θ), can be learnedusing standard gradient-based optimization techniques.
While in general there may be a gap between the global maximimum maxa∈A Qβ(s, a; θ) and its
easy-to-compute approximation maxi∈[1,N ] Qβ(s, ai; θ), the following theorem predicts that this gap
is zero in one-dimensional action spaces. More importantly, Theorem 5.2.1 guarantees that in action
spaces with an arbitrary number of dimensions, the gap gets exponentially small with increasing the
smoothing parameter β, allowing us to reduce the gap very quickly and up to any desired accuracy
by simply increasing the smoothing parameter β.
Theorem 5.2.1. Let Qβ be a member of the class of normalized Gaussian RBF value functions.
For a one-dimensional action space A = R:
maxa∈A
Qβ(s, a; θ) = maxi∈[1,N ]
Qβ(s, ai; θ) .
For A = Rd ∀d ≥ 1:
0 ≤ maxa∈A
Qβ(s, a; θ)− maxi∈[1,N ]
Qβ(s, ai; θ) ≤ O(e−β) .
Proof. I begin by proving the first result. For an arbitrary action a, we can write:
Qβ(s, a; θ) = w1v1(s; θ) + ...+ wNvN (s; θ) ,
where each weight wi is determined via softmax. Without loss of generality, we sort all centroids so
that ∀i ai < ai+1. Take two neighboring centroids aL and aR and notice that:
∀i < L,wLwi
=e−|a−aL|
e−|a−ai|=e−a+aL
e−a+ai= eaL−ai
def=
1
ci=⇒ wi = wLci .
49
In the above, I used the fact that all ai are to the left of a and aL. Similarly, I can argue that
∀i > R Wi = WRci. Intuitively, as long as the action is between aL and aR, the ratio of the weight
of a centroid to the left of aL, over the weight of aL itself, remains constant and does not change
with a. The same holds for the centroids to the right of aR. In light of the above result, by renaming
some variables we can now write:
Qβ(s, a; θ) = w1v1(s; θ) + ...+ wLvL(s; θ) + wRvR(s; θ) + ...+ wKvK(s; θ)
= wLc1v1(s; θ) + ...+ wLvL(s; θ) + wRvR(s; θ) + ...+ wRcKvK(s; θ)
= wL(c1v1(s; θ) + ...+ vL(s; θ)
)+ wR(vR(s; θ) + ...+ cKvK(s; θ)) .
Moreover, note that the weights need to sum up to 1:
wL(c1 + ...+ 1) + wR(1 + ...+ cK) = 1 ,
and wL is at its peak when we choose a = aL and at its smallest value when we choose a = aR. A
converse statement is true about wR. Moreover, the weights monotonically increase and decrease
as we move the input a. We call the endpoints of the range wmin and wmax. As such, the problem
maxa∈[aL,aR] Qβ(s, a; θ) could be written as this linear program:
maxwL,wR
wL(c1v1(s; θ) + ...+ vL(s; θ)
)+ wR
(vR(s; θ) + ...+ cKvK(s; θ)
)
s.t. wL(c1 + ...+ 1) + wR(1 + ...+ cK) = 1
wL, wR ≥Wmin
wL, wR ≤Wmax
A standard result in linear programming is that every linear program has an extreme point that
is an optimal solution [29]. Therefore, at least one of the points (wL = wmin, wR = wmax) or
(wL = wmax, wR = wmin) is an optimal solution. It is easy to see that there is a one-to-one mapping
between a and WL,WR in light of the monotonic property. As a result, the first point corresponds
to the unique value of a = aR(s), and the second corresponds to unique value of a = aL(s). Since
no point in between two centroids can be bigger than the surrounding centroids, at least one of the
centroids is a globally optimal solution in the range [a1(s), aN (s)], that is
maxa∈[a1(s;θ),aN (s;θ)]
Qβ(s, a; θ) = maxai
Qβ(s, ai; θ).
To finish the proof, I can show that ∀a < a1 Qβ(s, a; θ) = Qβ(s, a1; θ). The proof for ∀a >
50
aN Qβ(s, a; θ) = Qβ(s, aN ; θ) follows similar steps. So,
∀a < a1 Qβ(s, a; θ) =
∑Ni=1 e
−β|a−ai(s)|vi(s)∑Ni=1 e
−β|a−ai(s)|
=
∑Ni=1 e
−β|a1−c−ai(s)|vi(s)∑Ni=1 e
−β|a1−c−ai(s)|
=
∑Ni=1 e
β(a1−c−ai(s))vi(s)∑Ni=1 e
β(a1−c−ai(s))
=e−c
∑Ni=1 e
β(a1−ai(s))vi(s)
e−c∑Ni=1 e
β(a1−ai(s))
=
∑Ni=1 e
β(a1−ai(s))vi(s)∑Ni=1 e
β(a1−ai(s))= Qβ(s, a1; θ) ,
which concludes the proof of the first part.
I now move to the more general case with A = Rm:
maxa
Qβ(s, a; θ)− maxi∈{1:N}
Q(s, ai; θ) ≤ vmax(s; θ)− maxi∈{1:N}
Q(s, ai; θ)
≤ vmax(s; θ)− Qβ(s, amax; θ) .
WLOG, I assume the first centroid is the one with highest v, that is v1(s; θ) = arg maxvi vi(s; θ),
and conclude the proof. Note that a related result was shown recently [100]:
vmax(s)− Qβ(s, amax; θ) = v1 −∑Ni=1 e
−β‖a1−ai(s)‖vi(s)∑Ni=1 e
−β‖a1−ai(s)‖
=
∑Ni=1 e
−β‖a1−ai(s)‖(v1(s)− vi(s))
∑Ni=1 e
−β‖a1−ai(s)‖
=
∑Ni=2 e
−β‖a1−ai(s)‖(v1(s)− vi(s))
1 +∑Kk=2 e
−β‖a1−ai(s)‖
≤ ∆q
∑Ni=2 e
−β‖a1−ai(s)‖
1 +∑Ni=2 e
−β‖a1−ai(s)‖
≤ ∆q
N∑
i=2
e−β‖a1−ai(s)‖
1 + e−β‖a1−ai(s)‖
= ∆q
N∑
i=2
1
1 + eβ‖a1−ai(s)‖= O(e−β).
Figure 5.2 shows an example of the output of an RBF Q function where there exists a gap between
maxa∈A Qβ(s, a; θ) and maxi∈[1,N ] Qβ(s, ai; θ) for small values of β. Note also that, consistent with
Theorem 5.2.1, we can quickly decrease this gap by increasing the value of β.
51
(a) β = 0.25 (b) β = 1
(c) β = 1.5 (d) β = 2
Figure 5.2: An RBF Q function with 3 fixed centroid locations and centroid values (shown as blackdots), but different settings of the smoothing parameter β on a 2-dimensional action space. Greenregions highlight the set of actions a for which a is extremely close to the global maximum, or moreformally the set
{a ∈ A|maxa∈A{Qβ(s, a; θ)} − Qβ(s, a; θ) < 0.02
}. Observe the fast reduction
of the gap between maxa∈A Qβ(s, a; θ) and maxi∈[1,N ] Qβ(s, ai; θ) by increasing β, as guaranteed by
Theorem 5.2.1. Specifically, maxa∈A Qβ(s, a; θ) − maxi∈[1,N ] Qβ(s, ai; θ) < 0.02 for any β ≥ 1.5 .Also, observe that the function becomes less smooth as we increase β.
In light of the above theoretical result, to approximate maxa∈A Q(s, a; θ) we compute maxi∈[1,N ] Q(s, ai; θ).
If the goal is to ensure that the approximation is sufficiently accurate, one can always increase the
smoothing parameter to quickly get the desired accuracy.
Notice that this result holds for normalized Gaussian RBF networks (hence adding the normalization
step), but not necessarily for the unnormalized case or for other types of RBFs. I believe that this
observation is an interesting result in and of itself, regardless of its connection to value-function-
based RL.
I now move to the second desired property of RBF Q networks, namely that these networks are in
fact capable of universal function approximation (UFA).
52
Theorem 5.2.2. Consider any state–action value function Qπ(s, a) defined on a closed action space
A. Assume that Qπ(s, a) is a continuous function. For a fixed state s and for any ε > 0, there exists
a deep RBF value function Qβ(s, a; θ) and a setting of the smoothing parameter β0 for which:
∀a ∈ A ∀β ≥ β0 |Qπ(s, a)− Qβ(s, a; θ)| ≤ ε .
theorem
Proof. Since Qπ is continuous, I leverage the fact that it is Lipschitz with a Lipschitz constant L:
∀a0, a1 |f(a1)− f(a0)| ≤ L ‖a1 − a0‖
As such, assuming that ‖a1 − a0‖ ≤ ε4L , we have that
|f(a1)− f(a0)| ≤ ε
4(5.5)
Consider a set of centroids {c1, c2, ..., cN}, define the cell(j) as:
cell(j) = {a ∈ A| ‖a− cj‖ = minz‖a− cz‖} ,
and the radius Rad(j,A) as:
Rad(j,A) := supx∈cell(j)
‖x− cj‖ .
Assuming that A is a closed set, there always exists a set of centroids {c1, c2, ..., cN} for which
Rad(c,A) ≤ ε4L . Now consider the following functional form:
Qβ(s, a) :=
N∑
j=1
Qπ(s, cj)wj ,
where wj =e−β‖a−cj‖
∑Nz=1 e
−β‖a−cz‖.
Now suppose a lies in a subset of cells, called the central cells C:
C := {j|a ∈ cell(j)} ,
We define a second neighboring set of cells:
N := {j|cell(j) ∩(∪i∈C cell(i)
)6= ∅} − C ,
and a third set of far cells:
F := {j|j /∈ C & j /∈ N} ,
53
We now have:
|Qπ(s, a)− Qβ(s, a; θ)| = |N∑
j=1
(Qπ(s, a)−Qπ(s, cj)
)wj |
≤N∑
j=1
∣∣Qπ(s, a)−Qπ(s, cj)∣∣wj
=∑
j∈C
∣∣Qπ(s, a)−Qπ(s, cj)∣∣wj
+∑
j∈N
∣∣Qπ(s, a)−Qπ(s, cj)∣∣wj +
∑
j∈F
∣∣Qπ(s, a)−Qπ(s, cj)∣∣wj
We now bound each of the three sums above. Starting from the first sum, it is easy to see that∣∣Qπ(s, a)−Qπ(s, cj)∣∣ ≤ ε
4 , simply because a ∈ cell(j). As for the second sum, since cj is the centroid
of a neighboring cell, using a central cell i, we can write:
‖a− cj‖ = ‖a− ci + ci − cj‖ ≤ ‖a− ci‖+ ‖ci − cj‖ ≤ε
4L+
ε
4L=
ε
2L,
and so in this case∣∣Qπ(s, a)− Qβ(s, cj)
∣∣ ≤ ε2 . In the third case with the set of far cells F , observe
that for a far cell j and a central cell i we have:
wjwi
=e−β‖a−cj‖
e−β‖a−ci‖→ wj = wie
−β(‖a−cj‖−‖a−ci‖) ≤ wie−βµ ≤ e−βµ,
For some µ > 0. In the above, I used the fact that ‖a− cj‖ − ‖a− ci‖ > 0 is always true.
Putting it all together, we have:
|Qπ(s, a)− Qβ(s, a)|=
∑
j∈C
∣∣Qπ(s, a)−Qπ(s, cj)∣∣
︸ ︷︷ ︸≤ ε4
wj︸︷︷︸≤1
+∑
j∈N
∣∣Qπ(s, a)−Qπ(s, cj)∣∣
︸ ︷︷ ︸≤ ε2
wj︸︷︷︸1
+∑
j∈F
∣∣Qπ(s, a)−Qπ(s, cj)∣∣ wj︸︷︷︸e−βµ
≤ ε
4+ε
2+∑
j∈F
∣∣Qπ(s, a)−Qπ(s, cj)∣∣e−βµ
≤ ε
4+ε
2+ 2N sup
a|Qπ(s, a)|e−βµ
In order to have 2N supa |Qπ(s, a)|e−βµ ≤ ε4 , it suffices to have β ≥ −1
µ log( ε8N supa |Qπ(s,a)| ) := β0.
To conclude the proof:
|Qπ(s, a)− Qβ(s, a; θ)| ≤ ε ∀ β ≥ β0 .
For a similar proof, see [25].
Collectively, Theorems 5.2.1 and 5.2.2 guarantee that deep RBF Q functions preserve the desired
UFA property while ensuring accurate and efficient action maximization. This combination of prop-
erties stands in contrast with prior work that used function classes that enable easy action maxi-
mization but lack the UFA property [47, 4], as well as prior work that preserved the UFA property
but did not guarantee arbitrarily low accuracy when performing the maximization step [67, 94]. The
54
only important assumption in Theorem 5.2.2 is that the true value function Qπ(s, a) is continuous,
which is a standard assumption in the UFA literature [51] and in RL [8].
I note that, while using a large value of β makes it theoretically possible to approximate any function
up to any desired accuracy, there is a downside to using large β values. Specifically, very large values
of β result in extremely local approximations, which ultimately increases sample complexity as ex-
perience is not generalized from centroid to centroid. The bias–variance tension between using large
β values that allow for greater accuracy and using smaller β values that reduce sample complexity
make intermediate values of β work best. This property could be examined formally through the
lens of regularization [18].
As for scalability to large action spaces, note that the RBF formulation scales naturally owing to its
freedom to come up with centroids that best minimize the loss function. As a thought experiment,
suppose that some region of the action space has a high value, so an agent with greedy action
selection frequently chooses actions from that region. The deep RBF Q function would then move
more centroids to the region, because the region heavily contributes to the loss function. It is
unnecessary, then, to initialize centorid locations carefully, or to uniformly cover the action space
a priori. In our RL experiments in Section 5.4, we achieved reasonable results with the number of
centroids fixed across every problem, indicating that we need not rapidly increase the number of
centroids as the action dimension increases.
5.3 Experiments: Continuous Optimization
To demonstrate the operation of an RBF network in the simplest and clearest setting, I start with
a single-input continuous optimization problem, where the agent lacks access to the true reward
function but can sample input–output pairs 〈a, r〉. This setting is akin to the action maximization
step in RL for a single state or, stated differently, a continuous bandit problem. We are interested in
evaluating approaches that use tuples of experience 〈a, r〉 to learn the surface of the reward function,
and then optimize the learned function.
To this end, I chose the reward function:
r(a) = ‖a‖2sin(a0) + sin(a1)
2. (5.6)
Figure 5.3 (left) shows the surface of this function. It is clearly non-convex and includes several
local maxima (and minima). We are interested in two cases, first the problem where the goal is to
find maxa r(a), and the converse problem where we desire to find mina r(a).
Exploration is challenging in this setting [62]. Here, our focus is not to find the most effective ex-
ploration policy, but to evaluate different approaches based on their effectiveness to represent and
55
optimize a learned reward function r(a; θ). So, in the interest of fairness, we adopt the same random
action-selection strategy for all approaches.
More concretely, we sampled 500 actions uniformly randomly from A = [−3, 3]2 and provided the
agent with the reward associated with the actions according to (5.6). We then used this dataset for
training. When learning ended, we computed the action that maximized (or minimized) the learned
r(a; θ). Details of the function classes used in each case, as well as how to perform maxa r(a; θ) and
mina r(a; θ) will now be presented below for each individual approach.
Figure 5.3: Left: Surface of the true reward function. Right: The surface learned by the RBF rewardnetwork in a sample run. Black dots represent the centroids.
For our first baseline, I discretized each action dimension to 7 bins, resulting in 49 bins that uni-
formly covered the two dimensions of the input space. For each bin, we averaged the rewards over
〈a, r〉 pairs for which the sampled action a belonged to that bin. Once we had a learned r(a; θ),
which in this case was just a 7 × 7 table, I performed maxa r(a; θ) and mina r(a; θ) by a simple
table lookup. Discretization clearly fails to scale to problems with higher dimensionality, and I have
included this baseline solely for completeness.
Our second baseline used the input-convex neural network architecture [4], where the neural network
is constrained so that the learned reward function r(a; θ) is convex. Learning was performed by RM-
SProp optimization [44] with mean-squared loss. Once r(a; θ) was learned, I used gradient ascent for
finding the maximum, and gradient descent for finding the minimum. Note that this input-convex
approach subsumes the excluded quadratic case proposed Gu et al. [47], because quadratic functions
are just a special case of convex functions, but the converse in not necessarily true [29].
Our next baseline was the wire-fitting method proposed by Baird & Klopf [15]. This method is
similar to RBF networks in that it also learns a set of centroids. Similar to the previous case, I used
the RMSprop optimizer and mean-squared loss, and finally returned the centroids with lowest (or
highest) values according to the learned r(a; θ).
56
Figure 5.4: Mean and standard deviation of performance for various methods on the continuousoptimization task. Here, r(a) denotes the true reward associated with the action a found by eachmethod. Top: Results for action minimization. (Lower is better.) Bottom: Results for action maxi-mization. (Higher is better.) Results are averaged over 30 independent runs. The RBF architectureoutperforms the alternatives in both cases.
As the last baseline, I used a standard feed-forward neural network architecture with two hidden
layers to learn r(a; θ). It is well–known that this function class is capable of UFA [51] and so can
accurately learn the reward function in principle. However, once learning ends, we face a non–convex
optimization problem for action maximization (or minimization) maxa r(a; θ). We simply initialized
gradient descent (ascent) to a point chosen uniformly randomly, and followed the corresponding
direction until convergence.
To learn an RBF reward function, I used N = 50 centroids and β = 0.5 . I again used RMSprop
and mean-squared loss minimization. Recall that Theorem 5.2.1 showed that with an RBF net-
work the following approximations are well-justified in theory: maxa∈A r(a) ≈ maxi∈[1,50] r(ai) and
mina∈A r(a) ≈ mini∈[1,50] r(ai). As such, when the learning of r(a; θ) ends, I output the centroid
values with highest and lowest reward.
For each individual case, I ran the corresponding experimental pipeline for 30 different random seeds.
The solution found by each learner was fed to the true reward function (5.6) to get the true quality
of the found solution. I report the average reward achieved by each function class in Figure 5.4. The
RBF learner outperforms all baselines on both the maximization and the minimaztion problem. We
further show the function learned by a sample run of RBF on the right side of Figure 5.3, which is
57
an almost perfect approximation for the true reward function.
5.4 Experiments: Continuous Control
We now use deep Q functions for solving continuous-action RL problems. To this end, we learn a
deep RBF Q function using a learning algorithm similar to that of DQN [73], but extended to the
continuous-action case. DQN uses the following loss function for learning a deep state–action value
function:
L(θ) := Es,a,r,s′
[(r + γ max
a′∈AQ(s′, a′; θ−)− Q(s, a; θ)
)2].
DQN adds tuples of experience 〈s, a, r, s′〉 to a buffer, and later samples a minibatch of tuples to
compute ∇θL(θ). DQN maintains a second network parameterized by weights θ−. This second
network, denoted Q(·, ·, θ−) and referred to as the target network, is periodically synchronized with
the online network Q(·, ·, θ).
RBF–DQN uses the same loss function, but modifies the function class of DQN. Concretely, DQN
learns a deep network that outputs a scalar action-value output per action, exploiting the discrete
and finite nature of the action space. By contrast, RBF–DQN takes a state vector and an action
vector as input, and outputs a single scalar using a deep RBF Q function. Note that every operation
in a deep RBF Q function is differentiable, so the gradient of the loss function with respect to θ
parameters can be computed using standard deep learning libraries. Specifically, we used Python’s
Tensorflow library [1] with Keras [35] as its interface.
In terms of action selection, with probability ε, DQN chooses a random action, and with probability
1−ε it chooses an action with the highest value. The value of ε is annealed so that the agent becomes
more greedy as learning proceeds. To define an analog of this so called ε-greedy policy for RBF–
DQN, I sample from a uniform distribution with probability ε, and we take arg maxi∈[1,N ] Qβ(s, ai; θ)
with probability 1− ε. I annealed the ε parameter, similar to DQN.
Additionally, I made a minor change to the original DQN algorithm in terms of updating θ−, the
weights of the target network. Concretely, I update θ− using an exponential moving average of all
the previous θ values, as suggested by Lilicrap et al. [66]: θ− ← (1 − α)θ− + αθ−θ , which differs
from the occasional periodic updates θ− ← θ of the original DQN agent. I observed a significant
performance increase with this simple modification.
For completeness, I provide pseudocode for RBF–DQN in Algorithm 3, and have provide an open
repository1.
1You can find the repository github.com/kavosh8/RBFDQN.
58
I compared RBF–DQN’s performance to other deep-RL baselines on a large set of standard continuous-
action RL domains from Gym [31]. These domains range from simple tasks such as Inverted Pen-
dulum with a one-dimensional action space, to more complicated domains such as Ant with a
7-dimensional action space. I used the same number of centroids N = 30 for learning the deep RBF
Q function. I found the performance of RBF–DQN to be most sensitive to two hyper-parameters,
namely RMSProp’s learning rate αθ and the RBF smoothing parameter β. We tuned these two
parameters via grid search [44] for each individual domain, while all other hyperparameters were
fixed across domains.
For a meaningful comparison, I performed roughly similar numbers of gradient-based updates for
RBF–DQN and the baselines. Specifically, in all domains, I performed M = 100 updates per episode
on RBF–DQN’s network parameters θ. I used the same number of updates per episode for other
value-function-based baselines, such as input-convex neural networks [4]. Moreover, in the case of
policy-gradient baseline DDPG [66], I performed 100 value-function updates and 100 policy updates
per episode. This number of updates gave reasonable results in terms of data efficiency, and it also
helped run all experiments on modern CPUs.
In choosing baselines, my main goal was to compare RBF–DQN to other value-function-based deep-
RL baselines that explicitly perform the maximization step. Thus I did not perform comparisons
with an exhaustive set of existing policy gradient methods from the literature, since they work fun-
damentally differently than RBF–DQN and circumvent the action-maximization step. That said,
deep deterministic policy gradient (DDPG) [66] and its more advanced variant, TD3 [43], are two
very common baselines in continuous control, so I included them for completeness.
Moreover, in light of recent concerns with reproducibility in RL [49], I ran each algorithm for 10
fixed random seeds and report average performance. Other than the input-convex neural network
baseline [4], for which the authors released Tensorflow code, I chose to implement RBF–DQN and all
other baselines ourselves and in Tensorflow. This choice reflected a concern that comparing results
across different deep-learning libraries is extremely difficult.
It is clear from Figure 5.5 that RBF–DQN is competitive to all baselines both in terms of data
efficiency and final performance.
5.5 Conclusion
I proposed, analyzed, and exhibited the strengths of deep RBF value functions in continuous control.
These value functions facilitate easy action maximization, support universal function approximation,
and scale to large continuous action spaces. Deep RBF value functions are thus an appealing choice
for value function approximation in continuous control. Controlling the smoothness parameter of
59
deep RBF value functions played a critical role in achieving state-of-the-art performance.
60
Algorithm 3 Pseudocode for RBF–DQN
Initialize RBF network architecture with N centroidsRBF smoothing parameter βRBF online network parameters θ and θ−
optimizer learning rate αθtarget network learning rate αθ−total training episodes E, minibatch size Mdiscount rate γ, replay buffer B, decay rate µ
for episode ∈ [1, E] dos← env.reset(), done← False, ε← (1 + episode)−µ
while done==False doa← ε-greedy(Qβ
(·, ·; θ), s, ε
)
s′, r,done← env.step(s, a)add 〈s, a, r, s′,done〉 to Bs← s′
end whilefor M minibatches sampled from B do
∆θm = (r − Qβ(s, a; θ))∇θQ(s, a; θ)
if done==False thenget centroids a′i(s; θ
−) ∀i ∈ [1, N ]
∆θm += γmaxa′i Qβ(s′, a′i; θ
−)∇θQβ(s, a; θ)end if
end for∆θ ←
∑Mm=1 ∆θ
m
Mθ ← RMSProp(θ,∆θ, αθ)θ− ← (1− αθ−)θ− + αθ−θ
end forfunction ε-greedy(Qβ , s, ε):temp ∼ uniformly from [0, 1]if temp ≤ ε thena ∼ uniformly from Areturn a
elseget centroids ai(s; θ) ∀i ∈ [1, N ]
return arg maxai Qβ(s, ai; θ)end if
61
Figure 5.5: A comparison between RBF–DQN and baselines on continuous-action RL domains.Y-axis shows sum of rewards across timesteps in each episode, so higher is better.
Chapter 6
Proposed Work: Safe
Reinforcement Learning with
Smooth Policies
In the context of policy search, the agent maintains an explicit policy π(a|s; θ) denoting the proba-
bility of taking action a in state s under the policy π parameterized by θ. Note that for each state,
the policy outputs a probability distribution over actions: π : S → P(A).
The agent’s goal is to find a policy that maximizes the return for every timestep, so we define an
objective function J that allows us to score an arbitrary policy parameter θ:
J(θ) = Eτ∼Pr(τ |θ)[G(τ)] =∑
τ
Pr(τ |θ)G(τ) ,
where τ denotes a trajectory.
6.1 Rademacher Complexity
We use Rademacher complexity for sample complexity analysis. We define this measure, but for
details see Bartlett et al. [19] or Mohri [75]. Also, see Jiang et al. [53] and Lehnert et al. [63] for
previous applications of Rademacher in reinforcement learning.
Definition 6.1.1. Consider f : S → [−1, 1], and a set of such functions F . The Rademacher
complexity of this set, Rad(F), is defined as:
Rad(F) := Esj ,σj supf∈F
1
n
n∑
j=1
σjf(sj) ,
where σj, referred to as Rademacher random variables, are drawn uniformly at random from {±1}.62
63
The Rademacher variables could be thought of as independent and identically distributed noise.
Under this view, the average 1n
∑nj=1 σjf(sj) quantifies the extent to which f(·) matches the noise.
We have a high Rademacher complexity for a complicated hypothesis space that can accurately
match noise. Conversely, a simple hypothesis space has a low Rademacher complexity.
To apply Rademacher to the model-based setting where the output of the model is a vector, we extend
Definition 6.1.1 to vector-valued functions that map to [−1, 1]d. Consider a function g := 〈f1, ..., fd〉where ∀i fi ∈ F . Define the set of such functions G. Then:
Rad(G) := Esj ,σjisupg∈G
1
n
n∑
j=1
d∑
i=1
σjig(sj)i ,
where σji are drawn uniformly from {±1}.
6.2 A Generalization Bound
For a given policy with parameters θ′, and the function class G, Rademacher gives us a generalization
bound on the performance of the policy induced by θ′:
E[J(θ′)] ≥ E[J(θ′)] + 2Rad(G) +
√ln(1/δ)
m, (6.1)
where E[J(θ′)] denotes the empirical average of the policy performance. This quantity is missing in
the absence of environmental interaction with policy πθ′ . However, we have access to E[J(θ)] for all
θ parameters seen before. Suppose that the following Lipschitz property holds:
|E[J(θ)]− E[J(θ′)]| ≤ K ‖θ − θ′‖ ,
then we can bound 6.1 with high probability. Moreover, Rad(G) can be bounded and quantified
for Lipschitz neural networks [7]. This allows us to certify a lower bound on performance before
updating the policy, thereby meeting the safety requirement.
6.3 Timeline
I expect to complete this chapter by the end of summer 2020. I hope to send findings of this chapter
to AAAI 2021 (submission deadline: Sep 5th 2020), and later defend my thesis on October 2020.
Bibliography
[1] Martın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,
Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A
system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems
Design and Implementation, pages 265–283, 2016.
[2] Pieter Abbeel, Morgan Quigley, and Andrew Y Ng. Using inaccurate models in reinforcement
learning. In Proceedings of the 23rd international conference on Machine learning, pages 1–8.
ACM, 2006.
[3] Alejandro Agostini and Enric Celaya. Reinforcement learning with a gaussian mixture model.
In Neural Networks (IJCNN), The 2010 International Joint Conference on, pages 1–8. IEEE,
2010.
[4] Brandon Amos, Lei Xu, and J Zico Kolter. Input convex neural networks. In Proceedings of
the 34th International Conference on Machine Learning, pages 146–155, 2017.
[5] Kavosh Asadi, Evan Cater, Dipendra Misra, and Michael L Littman. Equivalence between
wasserstein and value-aware model-based reinforcement learning. In FAIM Workshop on Pre-
diction and Generative Modeling in Reinforcement Learning, volume 3, 2018.
[6] Kavosh Asadi and Michael L. Littman. An alternative softmax operator for reinforcement
learning. In Proceedings of the 34th International Conference on Machine Learning, pages
243–252, 2017.
[7] Kavosh Asadi, Dipendra Misra, Seungchan Kim, and Michel L Littman. Combating the
compounding-error problem with a multi-step model. arXiv preprint arXiv:1905.13320, 2019.
[8] Kavosh Asadi, Dipendra Misra, and Michael L. Littman. Lipschitz continuity in model-based
reinforcement learning. In Proceedings of the 35th International Conference on Machine Learn-
ing, pages 264–273, 2018.
[9] Kavosh Asadi, Ronald E Parr, George D Konidaris, and Michael L Littman. Deep rbf value
functions for continuous control. arXiv preprint arXiv:2002.01883, 2020.
64
65
[10] Kavosh Asadi and Jason D Williams. Sample-efficient deep reinforcement learning for dialog
control. arXiv preprint arXiv:1612.06000, 2016.
[11] Kavosh Asadi Atui. Strengths, Weaknesses, and Combinations of Model-based and Model-free
Reinforcement Learning. PhD thesis, University of Alberta, 2015.
[12] Franz Aurenhammer. Voronoi diagrams—a survey of a fundamental geometric data structure.
ACM Computing Surveys, 23(3):345–405, 1991.
[13] J Andrew Bagnell and Jeff G Schneider. Autonomous helicopter control using reinforcement
learning policy search methods. In Robotics and Automation, 2001. Proceedings 2001 ICRA.
IEEE International Conference on, volume 2, pages 1615–1620. IEEE, 2001.
[14] Leemon Baird and Andrew W Moore. Gradient descent for general reinforcement learning. In
Advances in Neural Information Processing Systems, pages 968–974, 1999.
[15] Leemon C Baird and A Harry Klopf. Reinforcement learning with high-dimensional, continuous
actions. Technical report, Wright Laboratory, 1993.
[16] Chris L Baker, Joshua B Tenenbaum, and Rebecca R Saxe. Goal inference as inverse planning.
In Proceedings of the 29th Annual Meeting of the Cognitive Science Society, 2007.
[17] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds
and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
[18] Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds
and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
[19] Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds
and structural results. JMLR, 2002.
[20] Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements
that can solve difficult learning control problems. IEEE transactions on systems, man, and
cybernetics, pages 834–846, 1983.
[21] Gleb Beliakov, Humberto Bustince Sola, and Tomasa Calvo Sanchez. A Practical Guide to
Averaging Functions. Springer, 2016.
[22] Marc G Bellemare, Will Dabney, and Remi Munos. A distributional perspective on reinforce-
ment learning. In International Conference on Machine Learning, pages 449–458, 2017.
[23] Richard Bellman. On the theory of dynamic programming. Proceedings of the National
Academy of Sciences of the United States of America, 38(8):716, 1952.
[24] Richard Bellman. A markovian decision process. Journal of Mathematics and Mechanics,
pages 679–684, 1957.
66
[25] Michel Benaim. On functional approximation with normalized Gaussian units. Neural Com-
putation, 6(2):319–333, 1994.
[26] Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-
based reinforcement learning with stability guarantees. In Advances in Neural Information
Processing Systems, pages 908–919, 2017.
[27] D Bertsekas. Convergence of discretization procedures in dynamic programming. IEEE Trans-
actions on Automatic Control, 20(3):415–419, 1975.
[28] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming: an overview. In
Decision and Control, 1995, Proceedings of the 34th IEEE Conference on, volume 1, pages
560–564. IEEE, 1995.
[29] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press,
2004.
[30] Richard P Brent. Algorithms for minimization without derivatives. Courier Corporation, 2013.
[31] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,
and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
[32] David S Broomhead and David Lowe. Radial basis functions, multi-variable functional inter-
polation and adaptive networks. Technical report, Royal Signals and Radar Establishment
Malvern (United Kingdom), 1988.
[33] Sebastien Bubeck, Remi Munos, Gilles Stoltz, and Csaba Szepesvari. X-armed bandits. Journal
of Machine Learning Research, 12(May):1655–1695, 2011.
[34] Guido Bugmann. Normalized Gaussian radial basis function networks. Neurocomputing, 20(1-
3):97–110, 1998.
[35] Francois Chollet. keras. https://github.com/fchollet/keras, 2015.
[36] T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley and Sons, 2006.
[37] Richard Dearden, Nir Friedman, and Stuart Russell. Bayesian Q-learning. In Fifteenth Na-
tional Conference on Artificial Intelligence (AAAI), pages 761–768, 1998.
[38] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete
data via the em algorithm. Journal of the royal statistical society. Series B (methodological),
pages 1–38, 1977.
[39] Amir-Massoud Farahmand, Andre Barreto, and Daniel Nikovski. Value-Aware Loss Function
for Model-based Reinforcement Learning. In Proceedings of the 20th International Conference
on Artificial Intelligence and Statistics, pages 1486–1494, 2017.
67
[40] Amir-Massoud Farahmand, Andre Barreto, and Daniel Nikovski. Value-Aware Loss Function
for Model-based Reinforcement Learning. In Proceedings of the 20th International Conference
on Artificial Intelligence and Statistics, pages 1486–1494, 2017.
[41] Norm Ferns, Prakash Panangaden, and Doina Precup. Metrics for finite markov decision
processes. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages
162–169. AUAI Press, 2004.
[42] Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via
soft updates. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial
Intelligence, pages 202–211. AUAI Press, 2016.
[43] Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in
actor-critic methods. In International Conference on Machine Learning, pages 1587–1596,
2018.
[44] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
[45] Geoffrey J. Gordon. Reinforcement learning with function approximation converges to a region,
2001. Unpublished.
[46] Omer Gottesman, Fredrik Johansson, Matthieu Komorowski, Aldo Faisal, David Sontag, Fi-
nale Doshi-Velez, and Leo Anthony Celi. Guidelines for reinforcement learning in healthcare.
Nat Med, 25(1):16–18, 2019.
[47] Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep Q-
learning with model-based acceleration. In International Conference on Machine Learning,
pages 2829–2838, 2016.
[48] Barbara Hammer and Kai Gersmann. A note on the universal approximation capability of
support vector machines. Neural Processing Letters, 17(1):43–53, 2003.
[49] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David
Meger. Deep reinforcement learning that matters. In Proceedings of the Thirty-Second AAAI
Conference on Artificial Intelligence, 2018.
[50] Karl Hinderer. Lipschitz continuity of value functions in Markovian decision processes. Math-
ematical Methods of Operations Research, 62(1):3–22, 2005.
[51] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are
universal approximators. Neural networks, 2(5):359–366, 1989.
[52] Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective
planning horizon on model accuracy. In Proceedings of AAMAS, pages 1181–1189, 2015.
68
[53] Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective
planning horizon on model accuracy. In Proceedings of the 2015 International Conference on
Autonomous Agents and Multiagent Systems, pages 1181–1189. International Foundation for
Autonomous Agents and Multiagent Systems, 2015.
[54] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A
survey. Journal of artificial intelligence research, 4:237–285, 1996.
[55] Sham Kakade, Michael J Kearns, and John Langford. Exploration in metric state spaces.
In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages
306–312, 2003.
[56] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning.
In Proceedings of the International Conference on Machine Learning, pages 267–274, 2002.
[57] Nicolaos B Karayiannis. Reformulated radial basis neural networks trained by gradient descent.
IEEE Transactions on Neural Networks, 10(3):657–671, 1999.
[58] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time.
Machine Learning, 49(2-3):209–232, 2002.
[59] Seungchan Kim, Kavosh Asadi, Michael Littman, and George Konidaris. Deepmellow: remov-
ing the need for a target network in deep q-learning. In Proceedings of the 28th International
Joint Conference on Artificial Intelligence, pages 2733–2739. AAAI Press, 2019.
[60] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In
Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, pages 681–690.
ACM, 2008.
[61] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.
The International Journal of Robotics Research, 32(11):1238–1274, 2013.
[62] Tor Lattimore and Csaba Szepesvari. Bandit algorithms. preprint, 2018.
[63] Lucas Lehnert, Romain Laroche, and Harm van Seijen. On value function representation of
long horizon problems. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[64] Ke Li and Jitendra Malik. Learning to optimize neural nets. arXiv preprint arXiv:1703.00441,
2017.
[65] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval
Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.
arXiv preprint arXiv:1509.02971, 2015.
[66] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval
Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.
arXiv preprint arXiv:1509.02971, 2015.
69
[67] Sungsu Lim, Ajin Joseph, Lei Le, Yangchen Pan, and Martha White. Actor-expert: A
framework for using action-value methods in continuous action spaces. arXiv preprint
arXiv:1810.09103, 2018.
[68] Michael L. Littman and Csaba Szepesvari. A generalized reinforcement-learning model: Con-
vergence and applications. In Proceedings of the 13th International Conference on Machine
Learning, pages 310–318, 1996.
[69] Michael L. Littman and Csaba Szepesvari. A generalized reinforcement-learning model: Con-
vergence and applications. In Lorenza Saitta, editor, Proceedings of the Thirteenth Interna-
tional Conference on Machine Learning, pages 310–318, 1996.
[70] Michael Lederman Littman. Algorithms for Sequential Decision Making. PhD thesis, Depart-
ment of Computer Science, Brown University, February 1996. Also Technical Report CS-96-09.
[71] Edward Lorenz. Predictability: does the flap of a butterfly’s wing in Brazil set off a tornado
in Texas? na, 1972.
[72] Guillaume Matheron, Nicolas Perrin, and Olivier Sigaud. The problem with DDPG: un-
derstanding failures in deterministic environments with sparse rewards. arXiv preprint
arXiv:1911.11679, 2019.
[73] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G
Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.
Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
[74] Thomas M Moerland, Joost Broekens, and Catholijn M Jonker. Learning multimodal tran-
sition dynamics for model-based reinforcement learning. arXiv preprint arXiv:1705.00470,
2017.
[75] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learn-
ing. MIT Press, 2012.
[76] John Moody and Christian J Darken. Fast learning in networks of locally-tuned processing
units. Neural computation, 1(2):281–294, 1989.
[77] Alfred Muller. Optimal selection from distributions with unknown parameters: Robustness of
bayesian models. Mathematical Methods of Operations Research, 44(3):371–386, 1996.
[78] Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap
between value and policy based reinforcement learning. arXiv preprint arXiv:1702.08892,
2017.
[79] Hariharan Narayanan and Sanjoy Mitter. Sample complexity of testing the manifold hypoth-
esis. In Advances in Neural Information Processing Systems, pages 1786–1794, 2010.
70
[80] Gergely Neu, Anders Jonsson, and Vicenc Gomez. A unified view of entropy-regularized
Markov decision processes. arXiv preprint arXiv:1705.07798, 2017.
[81] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in
neural networks. In Proceedings of The 28th Conference on Learning Theory, pages 1376–
1401, 2015.
[82] Ronald Parr, Lihong Li, Gavin Taylor, Christopher Painter-Wakefield, and Michael L Littman.
An analysis of linear models, linear value-function approximation, and feature selection for re-
inforcement learning. In Proceedings of the 25th international conference on Machine learning,
pages 752–759. ACM, 2008.
[83] Jason Pazis and Ronald Parr. Pac optimal exploration in continuous space markov decision
processes. In AAAI, 2013.
[84] Theodore J Perkins and Doina Precup. A convergent form of approximate policy iteration. In
Advances in Neural Information Processing Systems, pages 1595–1602, 2002.
[85] Jan Peters, Katharina Mulling, and Yasemin Altun. Relative entropy policy search. In AAAI.
Atlanta, 2010.
[86] Bernardo Avila Pires and Csaba Szepesvari. Policy error bounds for model-based reinforcement
learning with factored linear models. In Conference on Learning Theory, pages 121–151, 2016.
[87] Matteo Pirotta, Marcello Restelli, and Luca Bascetta. Policy gradient in lipschitz Markov
decision processes. Machine Learning, 100(2-3):255–283, 2015.
[88] Michael JD Powell. Radial basis functions for multivariable interpolation: a review. Algorithms
for approximation, 1987.
[89] Martin L Puterman. Markov Decision Processes.: Discrete Stochastic Dynamic Programming.
John Wiley & Sons, 2014.
[90] Emmanuel Rachelson and Michail G. Lagoudakis. On the locality of action domination in
sequential decision making. In International Symposium on Artificial Intelligence and Mathe-
matics, ISAIM 2010, Fort Lauderdale, Florida, USA, January 6-8, 2010, 2010.
[91] Jonathan Rubin, Ohad Shamir, and Naftali Tishby. Trading value and information in mdps.
In Decision Making with Imperfect Decision Makers, pages 57–74. Springer, 2012.
[92] G. A. Rummery and M. Niranjan. On-line Q-learning using connectionist systems. Technical
Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department, 1994.
[93] Stuart J Russell and Peter Norvig. Artificial intelligence: A modern approach, 1995.
[94] Moonkyung Ryu, Yinlam Chow, Ross Anderson, Christian Tjandraatmadja, and Craig
Boutilier. Caql: Continuous action q-learning. arXiv preprint arXiv:1909.12397, 2019.
71
[95] Aysel Safak. Statistical analysis of the power sum of multiple correlated log-normal compo-
nents. IEEE Transactions on Vehicular Technology, 42(1):58–61, 1993.
[96] Ahmad EL Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. Deep rein-
forcement learning framework for autonomous driving. Electronic Imaging, 2017(19):70–76,
2017.
[97] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Ried-
miller. Deterministic policy gradient algorithms. In ICML, 2014.
[98] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Ried-
miller. Deterministic policy gradient algorithms. In International Conference on Machine
Learning, pages 387–395, 2014.
[99] Satinder Singh, Tommi Jaakkola, Michael L. Littman, and Csaba Szepesvari. Convergence
results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 39:287–
308, 2000.
[100] Zhao Song, Ron Parr, and Lawrence Carin. Revisiting the softmax bellman operator: New
benefits and new perspective. In International Conference on Machine Learning, pages 5916–
5925, 2019.
[101] Alexander L Strehl, Lihong Li, and Michael L Littman. Reinforcement learning in finite mdps:
Pac analysis. Journal of Machine Learning Research, 10(Nov):2413–2444, 2009.
[102] Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on
approximating dynamic programming. In Proceedings of the Seventh International Conference
on Machine Learning, pages 216–224, Austin, TX, 1990. Morgan Kaufmann.
[103] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT
Press, 1998.
[104] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press,
2018.
[105] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradi-
ent methods for reinforcement learning with function approximation. In Advances in Neural
Information Processing Systems, pages 1057–1063, 2000.
[106] Richard S. Sutton, Csaba Szepesvari, Alborz Geramifard, and Michael H. Bowling. Dyna-
style planning with linear function approximation and prioritized sweeping. In UAI 2008,
Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence, Helsinki, Finland,
July 9-12, 2008, pages 528–536, 2008.
[107] Csaba Szepesvari. Algorithms for reinforcement learning. Synthesis Lectures on Artificial
Intelligence and Machine Learning, 4(1):1–103, 2010.
72
[108] Erik Talvitie. Model regularization for stable sample rollouts. In Proceedings of the Thirtieth
Conference on Uncertainty in Artificial Intelligence, pages 780–789. AUAI Press, 2014.
[109] Erik Talvitie. Self-correcting models for model-based reinforcement learning. In Proceedings
of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Fran-
cisco, California, USA., 2017.
[110] Sebastian B. Thrun. The role of exploration in learning control. In David A. White and
Donald A. Sofge, editors, Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Ap-
proaches, pages 527–559. Van Nostrand Reinhold, New York, NY, 1992.
[111] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society. Series B (Methodological), pages 267–288, 1996.
[112] Emanuel Todorov. Linearly-solvable markov decision problems. In NIPS, pages 1369–1376,
2006.
[113] Harm Van Seijen, Hado Van Hasselt, Shimon Whiteson, and Marco Wiering. A theoretical
and empirical analysis of Expected Sarsa. In 2009 IEEE Symposium on Adaptive Dynamic
Programming and Reinforcement Learning, pages 177–184. IEEE, 2009.
[114] Arun Venkatraman, Martial Hebert, and J Andrew Bagnell. Improving multi-step prediction of
learned time series models. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial
Intelligence, January 25-30, 2015, Austin, Texas, USA., 2015.
[115] Christopher Watkins. Learning from delayed rewards. King’s College, Cambridge, 1989.
[116] Jason D Williams, Kavosh Asadi Atui, and Geoffrey Zweig. Hybrid code networks: practical
and efficient end-to-end dialog control with supervised and reinforcement learning. In Proceed-
ings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), pages 665–677, 2017.
[117] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist rein-
forcement learning. Machine Learning, 8(3):229–256, 1992.