Download - Smoothness in Reinforcement Learning with Large State and ...Thesis Statement Smoothness, de ned by Lipschitz continuity, is essential for approximate reinforcement learning in terms

Smoothness in Reinforcement Learning with Large State and Action Spaces

by

Kavosh Asadi

B. E., University of Tehran, 2013

M. Sc., University of Alberta, 2015

A dissertation submitted in partial fulfillment of the

requirements for the Degree of Doctor of Philosophy

in the Department of Computer Science at Brown University

Providence, Rhode Island

March 31st, 2020

This dissertation by Kavosh Asadi is accepted in its present form by

the Department of Computer Science as satisfying the dissertation requirement

for the degree of Doctor of Philosophy.

DateMichael L. Littman, Director

Recommended to the Graduate Council

DateGeorge D. Konidaris, Reader

DateRonald E. Parr, Reader

(Duke University)

Approved by the Graduate Council

DateAndrew G. Campbell

Dean of the Graduate School

iii

Vita

Kavosh was born and raised in Chalous, a small, green, and beautiful town in Northern Iran. As a

child, he spent an excessive amount of time playing video games, which prompted his deep interest

in minds, specifically the artificial ones. In 2008, he moved to Iran’s capital, Tehran, to complete

his undergraduate degree in computer engineering at the University of Tehran. In 2015, he obtained

his Master’s degree from the University of Alberta in Canada. Since then, he has been pursuing a

PhD degree at the Department of Computer Science at Brown University in the United States.

iv

Preface

Writing this thesis would not have been fun without the collaboration with several amazing people.

Special credit goes to my wonderful mentor, Michael Littman, with whom I spent a lot of time

thinking about this thesis. Michael provided me with remarkable emotional support too, which I

definitely could use in light of the political climate. I am equally thankful of my mentor, George

Konidaris, who has been guiding me like his younger brother. It has also been a great pleasure to

work with Ron Parr. Ron is a deep thinker. Many parts of this thesis are shaped by my experience

working with Rich Sutton at Alberta. Rich believed in my potential when some did not, and for

that I will always remain grateful. I appreciated my collaborations with Dipendra Misra, Seungchan

Kim, and Evan Cater, with whom I coauthored papers that form the building blocks of this thesis.

Specifically, Chapter 3 is based on a paper that I co-authored with Michael [6], and a follow-up paper

that I co-authored with Seugnchan, George and Michael [59]. Chapter 4 is based on two papers: a

first paper that I co-authored with Dipendra and Michael [8], and another paper that I co-authored

with Evan, Dipendra, and Michael [5]. Lastly, chapter 5 is based on a paper that I co-authored with

Ron, George, and Michael [9].

I also want to acknowledge the invaluable role of several people who definitely shaped my ideas

throughout the PhD journey: Ben Abbatematteo, David Abel, Cameron Allen, Barrett Ames, Dilip

Arumugam, Akhil Bagaria, Sina Ghiassian, Chris Grimm, Yuu Jinnai, John Langford, Erwan Lecar-

pentier, Lucas Lehnert, Liholng Li, Sam Lobel, Robert Loftin, Marlos Machado, Rupam Mahmood,

Joseph Modayil, Melrose Roderick, Eric Rosen, Sam Saarinen, Saket Tiwari, Eli Upfal, Hado van

Hasselt, Harm van Seijen, Jason Williams, and Geoffrey Zweig.

v

Contents

List of Tables viii

List of Figures ix

1 Introduction 3

2 Background 5

2.1 The Reinforcement Learning Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 A Smooth Approximation of Max for Convergent Reinforcement Learning 10

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Boltzmann Misbehaves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Boltzmann Has Multiple Fixed Points . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.4 Mellowmax and Its Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.4.1 Mellowmax is a Non-Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4.2 Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4.3 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4.4 Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.5 Maximum Entropy Mellowmax Policy . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.6 Experiments in the Tabular Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.6.1 Two-state MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.6.2 Random MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.6.3 Multi-passenger Taxi Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.7 Experiments in the Function Approximation Setting . . . . . . . . . . . . . . . . . . 21

3.7.1 DeepMellow vs DQN without a Target Network . . . . . . . . . . . . . . . . . 22

3.7.2 DeepMellow vs DQN with a Target Network . . . . . . . . . . . . . . . . . . 24

3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

vi

4 Smoothness in Model-based Reinforcement Learning 25

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2 Lipschitz Model Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3 On the Choice of Probability Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3.1 Value-Aware Model Learning (VAML) Loss . . . . . . . . . . . . . . . . . . . 29

4.3.2 Lipschitz Generalized Value Iteration . . . . . . . . . . . . . . . . . . . . . . . 30

4.3.3 Equivalence Between VAML and Wasserstein . . . . . . . . . . . . . . . . . . 32

4.4 Understanding the Compounding Error Phenomenon . . . . . . . . . . . . . . . . . . 33

4.5 Value Error with Lipschitz Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Smoothness in Continuous Control 44

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2 Deep RBF Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.3 Experiments: Continuous Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.4 Experiments: Continuous Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6 Proposed Work: Safe Reinforcement Learning with Smooth Policies 62

6.1 Rademacher Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.2 A Generalization Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.3 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

vii

List of Tables

3.1 A comparison between Mellowmax and Boltzmann in terms of convergence to a unique

fixed point. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Experimental details for evaluating DeepMellow for each domain. . . . . . . . . . . . 21

4.1 Lipschitz constant for various functions used in a neural network. Here, Wj denotes

the jth row of a weight matrix W . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

viii

List of Figures

2.1 An illustration of the reinforcement learning problem. . . . . . . . . . . . . . . . . . 6

2.2 An illustration of Lipschitz continuity . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 An MDP in which Boltzmann softmax misbehaves. . . . . . . . . . . . . . . . . . . . 12

3.2 Oscillating values under the Boltzmann softmax policy. . . . . . . . . . . . . . . . . 13

3.3 GVI with Boltzmann could have multiple fixed-points. . . . . . . . . . . . . . . . . . 14

3.4 GVI with Boltzmann misbehaves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.5 GVI with mellowmax exhibits a sound behavior. . . . . . . . . . . . . . . . . . . . . 19

3.6 Number of iterations before termination of GVI on the example MDP. GVI under

mmω outperforms the alternatives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.7 An illustration of multi-passenger Taxi domain. . . . . . . . . . . . . . . . . . . . . . 20

3.8 Evaluation of maximum-entropy mellowmax policy on Taxi. . . . . . . . . . . . . . . 20

3.9 Comparison between DeepMellow and DQN in the absence of target network. . . . . 22

3.10 Comparison between DeepMellow and DQN with target network. . . . . . . . . . . . 23

4.1 An example of a Lipschitz model class. . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 An example of a case where the accuracy of a stochastic model is best quantified using

Wasserstein. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.3 A Comparison between KL, TV, and Wasserstein in terms of capturing the effective-

ness of transition models for planning. . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4 Correlation between value-prediction error and model error. . . . . . . . . . . . . . . 39

4.5 Impact of controlling the Lipschitz constant of learned models. . . . . . . . . . . . . 40

4.6 A stochastic problem solved by training a Lipschitz model class using EM. . . . . . . 42

4.7 Impact of controlling the Lipschitz constant in the supervised-learning domain. . . . 43

4.8 Performance of a Lipschitz model class on the gridworld domain. . . . . . . . . . . . 43

5.1 The architecture of a deep RBF value (Q) function. . . . . . . . . . . . . . . . . . . 48

5.2 An RBF Q function for different smoothing parameters. . . . . . . . . . . . . . . . . 51

5.3 Surface of a true reward function, and its approximation using an RBF reward network 55

5.4 Mean and standard deviation of performance for various methods on the continuous

optimization task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

ix

5.5 A comparison between RBF–DQN and standard deep-RL baselines. . . . . . . . . . 61

x

Abstract

Reinforcement learning (RL) is the study of the interaction between an environment and an agent

that learns to achieve a goal through trial-and-error. Owing to its generality, RL has successfully been

applied to numerous applications including those with enormous state and action spaces. In light of

the curse of dimensionality in large settings, a fundamental question in RL is how to design algorithms

that are compatible with function approximation but also capable of tackling longstanding challenges

including convergence guarantees, exploration-exploitation, planning, and safety. In this thesis I

study approximate RL through the lens of smoothness formally defined using Lipschitz continuity.

RL algorithms are typically comprised of several key ingredients such as value functions, models,

operators, and policies. I present theoretical results showing an essential role for the smoothness

of these ingredients in stability and convergence of RL, effective model learning and planning, and

state-of-the-art continuous control. Through many examples and experiments, I also demonstrate

how to adjust the smoothness of these ingredients to improve the performance of approximate RL

in large problems.

1

Thesis Statement

Smoothness, defined by Lipschitz continuity, is essential for approximate reinforcement learning in

terms of convergence to a unique fixed-point, near-accurate planning, state-of-the-art continuous

control, and safety.

2

Chapter 1

Introduction

Reinforcement learning (RL) is the study of the interaction between an environment and an agent

that learns to achieve a goal. Owing to its generality, RL has many emerging applications including

in settings with large (and sometimes infinite) state and action spaces [73, 61, 116, 10, 46, 96, 64].

Successful application of RL to these applications hinges on developing learning algorithms that are

compatible with function approximation, but also capable of addressing longstanding challenges in

RL such as exploration-exploitation, generalization, planning, and safety. RL algorithms are typ-

ically composed of key ingredients including operators, value functions, models, and policies. In

this thesis I develop a theory that studies these ingredients through the lens of smoothness and in

presence of function approximation.

I define a smooth function as any function with the following property: For any two distinct points

on its domain, the line that connects the surface of the function at the two points should have a

finite slope. The Lipschitz constant of the function is defined as the maximum value of this slope

for any pair of points. A function with a Lipschitz constant K is said to be K-Lipschitz.

My first result pertains to convergence of RL to a unique fixed-point, which hinges on using K-

Lipschitz operators with K < 1. A Lipschitz operator with this property is sometimes known

as a non-expansion. I show that Boltzmann softmax operator, commonly used for addressing the

exploration-exploitation in large RL problems, is not a non-expansion and is prone to misbehav-

ior. I then introduce an alternative softmax operator which, among other nice properties, has the

non-expansion property. The new operator exhibits convergent behavior, and is also useful in large

domains such as Atari.

Second, I focus on model-based reinforcement learning, where I show that in absence of model

smoothness even a near-perfect model, defined as a model with bounded one-step error, can have

arbitrarily large multi-step errors. I show novel bounds on multi-step prediction error and value-

estimation error of one-step smooth models. I then make the case for controlling the Lipschitz

3

4

constant of learned models for effective planning.

Third, I move to the setting with continuous states and actions, where finding an action that is

optimal with respect to a learned state–action value function can be difficult. I introduce deep RBF

value functions: state–action value functions learned using a deep neural network with a radial-basis

function (RBF) output layer. I show that the optimal action with respect to a deep RBF value

function can be easily approximated up to any desired accuracy, and that deep RBF value functions

can support universal function approximation. I show that controlling the smoothness of deep RBF

value functions yields state-of-the-art performance in continuous-action RL problems.

Finally, I explain proposed future work revolving around RL safety, which demands that the new

policy always retains a performance lower-bound with high probability each time the learning al-

gorithm changes its policy. I show how we may meet this safety requirement by controlling the

Lipschitz constant of the agent’s policy, and I also show how to learn such a safe policy using neural

networks.

Chapter 2

Background

This thesis leans on a background that I present in this chapter. I first start by explaining the

reinforcement learning problem, then introduce the function approximation setting, and end the

chapter by formulating the notion of smoothness.

5

6

2.1 The Reinforcement Learning Problem

Reinforcement learning (RL) is an area of artificial intelligence (AI) that revolves around the in-

teraction between an environment and an agent that seeks to maximize reward. This problem is

illustrated in Figure 2.1. At each timestep t, the RL agent takes an action At = a. The agent then

receives a scalar reward signal Rt = r, and is then moved to a new state St+1 = s. The goal of the RL

agent is to maximize the sum of rewards across future timesteps by learning a good action-selection

strategy from environmental interaction.

Figure 2.1: An illustration of the reinforcement learning problem.

The RL problem is typically formulated using Markov Decision Processes (MDPs) [89]. An MDP

is usually specified by a tuple: 〈S,A, T,R, γ〉. Here S and A denote the state space and the action

space of the MDP. In the most basic case S and A are discrete, but more generally one or both

of them can be continuous. The MDP model is comprised of two functions, namely the transition

model T : S ×A×S → [0, 1], and the reward model R : S ×A → R. The discount factor, γ ∈ [0, 1),

determines the importance of immediate reward as opposed to rewards received in the future. The

goal of an RL agent is to find a policy, π : S → A that collects high sums of discounted rewards

across timesteps.

For a state s ∈ S, action a ∈ A, and a policy π, we define the state–action value function:

Qπ(s, a) := Eπ

[Gt | st = s, at = a

],

where Gt :=∑∞i=t γ

i−tRi is called the return at timestep t. We define Q∗ as the maximum Q value

of a state-action pair among all policies:

Q∗(s, a) := maxπ

Qπ(s, a) .

Under a discrete state space, the quantity Q∗, known as the optimal state–action value function,

7

can be written recursively [23]:

Q∗(s, a) = R(s, a) + γ∑

s′

T (s, a, s′) maxa′

Q∗(s′, a′) . (2.1)

If the model of the MDP 〈R, T 〉 is available, standard dynamic programming (DP) approaches find

Q∗ by solving for the fixed point of (2.1), known as the Bellman equation. A notable example of

these DP algorithms is Value Iteration, which starts by a randomly initialized Q, and proceeds by

repeatedly performing the following update:

Q(s, a)← R(s, a) + γ∑

s′

T (s, a, s′) maxa′

Q(s′, a′) . (2.2)

In the absence of a model, there co-exist two classes of RL algorithms: model-based and model-free.

Model-based algorithms estimate a model of the MDP 〈T , R〉 by using supervised learning:

T ≈ T, R ≈ R .

Once a model is learned, it could be used, for example in conjunction with Value Iteration to solve

the MDP.

The second class of algorithms solve for the fixed point of the Bellman equation using environmental

interaction and without learning a model. Q-learning [115], a notable example of these so-called

model-free algorithms, learns an approximation of Q∗, denoted by Q as follows:

Q(s, a)← Q(s, a) + α(r + γ max

a′∈AQ(s′, a′)− Q(s, a)

), (2.3)

It can be shown that, under mild conditions, Value Iteration, Model-based reinforcement learning,

and model-free Q-learning converge to the optimal state-action value function Q∗.

2.2 Function Approximation

In many interesting RL applications, the state space or the action space of the MDP are enormous,

or sometimes even infinite. In such settings, I assume that the state space or the action space is a

metric space endowed by a norm:

(S, dS) and (A, dA) .

The norm could be defined flexibly and in light of the specific problem we desire to solve. Under an

infinite state space or an action space, it is simply infeasible to learn about each state-action pair

individually, whether the goal is to learn a value function, a policy, or a model. The idea behind

function approximation is to represent these ingredients using a parameterized function class.

8

As a concrete example, when combined with function approximation, Q-learning updates its param-

eters θ as follows:

θ ← θ + α(r + γ max

a′∈AQ(s′, a′; θ)− Q(s, a; θ)

)∇θQ(s, a; θ) . (2.4)

Note that Q-learning’s update rule (2.4) is agnostic to the choice of function class, and so in principle

any differentiable and parameterized function class could be used in conjunction with the above

update to learn θ parameters.

2.3 Smoothness

s1 s1

f(s) f(s)

Figure 2.2: An illustration of Lipschitz continuity. Pictorially, Lips-chitz continuity ensures that f lies in between the two affine functions(colored in blue) with slopes K and −K. In this one-dimensional ex-ample, the slope is equal to the maximum value of the gradient of thefunction on its domain.

In this thesis I leverage the “smoothness” of various functions (operators, value functions, models,

policies), which I formulate using the mathematical notion of Lipschitz continuity defined momen-

tarily. Intuitively, I understand a smooth function to have the following property: For any two

distinct points on the domain of the function, the line that connects the surface of the function at

the two points should have a finite slope. The Lipschitz constant of the function is defined as the

maximum value of this slope for any pair of points. A function with a Lipschitz constant K is said

to be K-Lipschitz. I now define a smooth (Lipschitz) function more formally:

Definition 2.3.1. Given two metric spaces (M1, d1) and (M2, d2) consisting of a space and a dis-

tance metric, a function f : M1 7→ M2 is Lipschitz continuous (sometimes simply Lipschitz) if the

Lipschitz constant, defined as

Kd1,d2(f) := sups1∈M1,s2∈M1

d2

(f(s1), f(s2)

)

d1(s1, s2), (2.5)

is finite.

Equivalently, for a Lipschitz f ,

∀s1,∀s2 d2

(f(s1), f(s2)

)≤ Kd1,d2(f) d1(s1, s2) .

9

The concept of Lipschitz continuity is visualized in Figure 2.2.

A Lipschitz function f is called a non-expansion when Kd1,d2(f) = 1 and a contraction when

Kd1,d2(f) < 1. Lipschitz continuity, in one form or another, has been a key tool in the theory of

RL [27, 28, 68, 77, 41, 50, 90, 107, 83, 87, 86, 26, 22] and bandits [60, 33]. Below, I also define

Lipschitz continuity over a subset of inputs.

Definition 2.3.2. A function f : M1 ×A 7→M2 is uniformly Lipschitz continuous in A if

KAd1,d2(f) := supa∈A

sups1,s2

d2

(f(s1, a), f(s2, a)

)

d1(s1, s2), (2.6)

is finite.

Chapter 3

A Smooth Approximation of Max

for Convergent Reinforcement

Learning

In RL problems, it is typical to use a soft approximation of maximum whenever it is necessary to

maximize utility but also to hedge against problems that arise from putting all of one’s weight behind

a single maximum utility decision. These so called softmax operators applied to a set of values act

somewhat like the maximization function and somewhat like an average. The Boltzmann softmax

operator is the most commonly used softmax operator in this setting, but I show that this operator

is prone to misbehavior. In this chapter, I study an alternative softmax operator that, among other

properties, is a non-expansion (has a Lipschitz constant that is less than 1) ensuring a convergent

behavior in learning and planning. I introduce a variant of SARSA algorithm that, by utilizing the

new operator, computes a Boltzmann policy with a state-dependent temperature parameter. I show

that the algorithm is convergent and that it performs favorably in practice.

10

11

3.1 Introduction

There is a fundamental tension in decision making between choosing the action that has highest

expected utility and avoiding “starving” the other actions. The issue arises in the context of the

exploration–exploitation dilemma [110], non-stationary decision problems [102], and when interpret-

ing observed decisions [16].

In reinforcement learning, an approach to addressing the tension is the use of softmax operators

for value-function optimization, and softmax policies for action selection. Examples include value-

based methods such as SARSA [92] or expected SARSA [103, 113], and policy-search methods such

as REINFORCE [117].

An ideal softmax operator is a parameterized set of operators that:

1. has parameter settings that allow it to approximate maximization arbitrarily accurately to

perform reward-seeking behavior;

2. is a non-expansion for all parameter settings ensuring convergence to a unique fixed point;

3. is differentiable to make it possible to improve via gradient-based optimization; and

4. avoids the starvation of non-maximizing actions.

Let X = x1, . . . , xn be a vector of values. We define the following operators:

• max(X) = maxi∈{1,...,n} xi ,

• mean(X) = 1n

∑ni=1 xi ,

• epsε(X) = ε mean(X) + (1− ε) max(X) ,

• boltzβ(X) =∑ni=1 xi e

βxi∑ni=1 e

βxi.

The first operator, max(X), is known to be a non-expansion [69]. However, it is non-differentiable

(Property 3), and ignores non-maximizing selections (Property 4).

The next operator, mean(X), computes the average of its inputs. It is differentiable and, like any

operator that takes a fixed convex combination of its inputs, is a non-expansion. However, it does

not allow for maximization (Property 1).

The third operator epsε(X), commonly referred to as epsilon greedy [103], interpolates between max

and mean. The operator is a non-expansion, because it is a convex combination of two non-expansion

operators. But it is non-differentiable (Property 3).

12

The Boltzmann operator boltzβ(X) is differentiable. It also approximates max as β →∞, and mean

as β → 0. However, it is not a non-expansion (Property 2), and therefore, prone to misbehavior as

will be shown in the next section.

In the following section, we provide a simple example illustrating why the non-expansion property

is important, especially in the context of planning and on-policy learning. We then present a new

softmax operator that is similar to the Boltzmann operator yet is a non-expansion. We prove several

critical properties of this new operator, introduce a new softmax policy, and present empirical re-

sults.

3.2 Boltzmann Misbehaves

I first show that boltzβ can lead to problematic behavior. To this end, I ran SARSA with Boltzmann

softmax policy (Algorithm 1) on the MDP shown in Figure 3.1. The edges are labeled with a

transition probability (unsigned) and a reward number (signed). Also, state s2 is a terminal state,

so I only consider two action values, namely Q(s1, a) and Q(s2, b). Recall that the Boltzmann

softmax policy assigns the following probability to each action:

π(a|s) =eβQ(s,a)

∑a e

βQ(s,a).

S1

0.340.66

a+0.122

b0.010.99 +0.033

S2Figure 3.1: A simple MDP with two states, two actions,and γ = 0.98 . The use of a Boltzmann softmax policyis not sound in this simple domain.

In Figure 3.2, I plot state–action value estimates at the end of each episode of a single run (smoothed

by averaging over ten consecutive points). I set α = .1 and β = 16.55. The value estimates are

unstable.

SARSA is known to converge in the tabular setting using ε-greedy exploration [69], under decreasing

exploration [99], and to a region in the function-approximation setting [45]. There are also variants

of the SARSA update rule that converge more generally [84, 14, 113]. However, this example is the

first, to our knowledge, to show that SARSA fails to converge in the tabular setting with Boltzmann

policy. The next section provides background for our analysis of the example.

13

Algorithm 1 SARSA with Boltzmann softmax policy

Input: initial Q(s, a) ∀s ∈ S ∀a ∈ A, α, and βfor each episode do

Initialize sa ∼ Boltzmann with parameter βrepeat

Take action a, observe r, s′

a′ ∼ Boltzmann with parameter β

Q(s, a)← Q(s, a) + α[r + γQ(s′, a′)− Q(s, a)

]

s← s′, a← a

′

until s is terminalend for

episode number

Figure 3.2: Values estimated by SARSA with Boltzmannsoftmax. The algorithm never achieves stable values.

3.3 Boltzmann Has Multiple Fixed Points

Although it has been known for a long time that the Boltzmann operator is not a non-expansion [70],

I am not aware of a published example of an MDP for which two distinct fixed points exist. The

MDP presented in Figure 3.1 is the first example where, as shown in Figure 3.3, GVI under boltzβ

has two distinct fixed points. I also show, in Figure 3.4, a vector field visualizing GVI updates

under boltzβ=16.55. The updates can move the current estimates farther from the fixed points. The

behavior of SARSA (Figure 3.2) results from the algorithm stochastically bouncing back and forth

between the two fixed points. When the learning algorithm performs a sequence of noisy updates,

it moves from a fixed point to the other. As I will show later, planning will also progress extremely

slowly near the fixed points. The lack of the non-expansion property leads to multiple fixed points

and ultimately a misbehavior in learning and planning.

3.4 Mellowmax and Its Properties

I advocate for an alternative softmax operator defined as follows:

mmω(X) =log( 1

n

∑ni=1 e

ωxi)

ω,

14

5 5

Figure 3.3: Fixed points of GVI under boltzβ for varying β. Two distinct fixed points (red and blue)co-exist for a range of β.

Figure 3.4: A vector field showingGVI updates under boltzβ=16.55. Fixedpoints are marked in black. For somepoints, such as the large blue point, up-dates can move the current estimatesfarther from the fixed points. Also, forpoints that lie in between the two fixed-points, progress is extremely slow.

which can be viewed as a particular instantiation of the quasi-arithmetic mean [21]. It can also be

derived from information theoretical principles as a way of regularizing policies with a cost function

defined by KL divergence [112, 91, 42]. Note that the operator has previously been utilized in other

areas, such as power engineering [95].

I show that mmω, which we refer to as mellowmax, has the desired properties and that it compares

quite favorably to boltzβ in practice.

3.4.1 Mellowmax is a Non-Expansion

I prove that mmω is a non-expansion (Property 2), and therefore, GVI and SARSA under mmω are

guaranteed to converge to a unique fixed point.

Let X = x1, . . . , xn and Y = y1, . . . , yn be two vectors of values. Let ∆i = |xi−yi| for i ∈ {1, . . . , n}be the difference of the ith components of the two vectors. Also, let i∗ be the index with the

maximum component-wise difference, i∗ = argmaxi ∆i. For simplicity, we assume that i∗ is unique

15

and ω > 0. Also, without loss of generality, we assume that xi∗ − yi∗ ≥ 0. It follows that:

∣∣mmω(X)−mmω(Y)∣∣

=∣∣ log(

1

n

n∑

i=1

eωxi)/ω − log(1

n

n∑

i=1

eωyi)/ω∣∣

=∣∣ log

1n

∑ni=1 e

ωxi

1n

∑ni=1 e

ωyi/ω∣∣

=∣∣ log

∑ni=1 e

ω(yi+∆i

)∑ni=1 e

ωyi/ω∣∣

≤∣∣ log

∑ni=1 e

ω(yi+∆i∗

)∑ni=1 e

ωyi/ω∣∣

=∣∣ log

eω∆i∗∑ni=1 e

ωyi

∑ni=1 e

ωyi/ω∣∣

=∣∣ log(eω∆i∗ )/ω

∣∣ =∣∣∆i∗

∣∣ = maxi

∣∣xi − yi∣∣ ,

allowing us to conclude that mellowmax is a non-expansion under the infinity norm.

3.4.2 Maximization

Mellowmax includes parameter settings that allow for maximization (Property 1) as well as for min-

imization. In particular, as ω goes to infinity, mmω acts like max.

Let m = max(X) and let W = |{xi = m|i ∈ {1, . . . , n}}|. Note that W ≥ 1 is the number of

maximum values (“winners”) in X. Then:

limω→∞

mmω(X) = limω→∞

log( 1n

∑ni=1 e

ωxi)

ω

= limω→∞

log( 1ne

ωm∑ni=1 e

ω(xi−m))

ω

= limω→∞

log( 1ne

ωmW )

ω

= limω→∞

log(eωm)− log(n) + log(W )

ω

= m+ limω→∞

− log(n) + log(W )

ω= m = max(X) .

That is, the operator acts more and more like pure maximization as the value of ω is increased.

Conversely, as ω goes to −∞, the operator approaches the minimum.

16

3.4.3 Derivatives

We can take the derivative of mellowmax with respect to each one of the arguments xi and for any

non-zero ω:∂mmω(X)

∂xi=

eωxi∑ni=1 e

ωxi≥ 0 .

Note that the operator is non-decreasing in each component of X.

Moreover, we can take the derivative of mellowmax with respect to ω. We define nω(X) =

log( 1n

∑ni=1 e

ωxi) and dω(X) = ω. Then:

∂nω(X)

∂ω=

∑ni=1 xie

ωxi

∑ni=1 e

ωxiand

∂dω(X)

∂ω= 1 ,

and so:∂mmω(X)

∂ω=

∂nω(X)∂ω dω(X)− nω(X)∂dω(X)

∂ω

dω(X)2,

ensuring differentiablity of the operator (Property 3).

3.4.4 Averaging

Because of the division by ω in the definition of mmω, the parameter ω cannot be set to zero.

However, we can examine the behavior of mmω as ω approaches zero and show that the operator

computes an average in the limit.

Since both the numerator and denominator go to zero as ω goes to zero, we will use L’Hopital’s rule

and the derivative given in the previous section to derive the value in the limit:

limω→0

mmω(X) = limω→0

log( 1n

∑ni=1 e

ωxi)

ω

L’Hopital= lim

ω→0

1n

∑ni=1 xie

ωxi

1n

∑ni=1 e

ωxi

=1

n

n∑

i=1

xi = mean(X) .

That is, as ω gets closer to zero, mmω(X) approaches the mean of the values in X.

3.5 Maximum Entropy Mellowmax Policy

As described, mmω computes a value for a list of numbers somewhere between its minimum and

maximum. However, it is often useful to actually provide a probability distribution over the actions

such that (1) a non-zero probability mass is assigned to each action, and (2) the resulting expected

value equals the computed value. Such a probability distribution can then be used for action selec-

tion in algorithms such as SARSA.

17

In this section, I address the problem of identifying such a probability distribution as a maximum

entropy problem—over all distributions that satisfy the properties above, pick the one that maximizes

information entropy [36, 85]. I formally define the maximum entropy mellowmax policy of a state s

as:

πmm(s) = argminπ

∑

a∈Aπ(a|s) log

(π(a|s)

)(3.1)

subject to{∑a∈A π(a|s)Q(s, a) = mmω(Q(s, .))

π(a|s) ≥ 0∑a∈A π(a|s) = 1 .

Note that this optimization problem is convex and can be solved reliably using any numerical convex

optimization library.

One way of finding the solution, which leads to an interesting policy form, is to use the method of

Lagrange multipliers. Here, the Lagrangian is:

L(π, λ1, λ2) =∑

a∈Aπ(a|s) log

(π(a|s)

)

−λ1

(∑

a∈Aπ(a|s)− 1

)

−λ2

(∑

a∈Aπ(a|s)Q(s, a)−mmω

(Q(s, .)

)).

Taking the partial derivative of the Lagrangian with respect to each π(a|s) and setting them to zero,

we obtain:∂L

∂π(a|s) = log(π(a|s)

)+ 1− λ1 − λ2Q(s, a) = 0 ∀ a ∈ A .

These |A| equations, together with the two linear constraints in (3.1), form |A| + 2 equations to

constrain the |A|+ 2 variables π(a|s) ∀a ∈ A and the two Lagrangian multipliers λ1 and λ2.

Solving this system of equations, the probability of taking an action under the maximum entropy

mellowmax policy has the form:

πmm(a|s) =eβQ(s,a)

∑a∈A e

βQ(s,a)∀a ∈ A ,

where β is a value for which:

∑

a∈Aeβ(Q(s,a)−mmωQ(s,.)

)(Q(s, a)−mmωQ(s, .)

)= 0 .

The argument for the existence of a unique root is simple. As β → ∞ the term corresponding

to the best action dominates, and so, the function is positive. Conversely, as β → −∞ the term

18

corresponding to the action with lowest utility dominates, and so the function is negative. Fi-

nally, by taking the derivative, it is clear that the function is monotonically increasing, allowing

us to conclude that there exists only a single root. Therefore, we can find β easily using any root-

finding algorithm. In particular, I use Brent’s method [30] available in the Numpy library of Python.

This policy has the same form as Boltzmann softmax, but with a parameter β whose value depends

indirectly on ω. This mathematical form arose not from the structure of mmω, but from maxi-

mizing the entropy. One way to view the use of the mellowmax operator, then, is as a form of

Boltzmann policy with a temperature parameter chosen adaptively in each state to ensure that the

non-expansion property holds.

Finally, note that the SARSA update under the maximum entropy mellowmax policy could be

thought of as a stochastic implementation of the GVI update under the mmω operator:

Eπmm

[r + γQ(s′, a′)

∣∣s, a]

=∑

s′∈SR(s, a, s′) + γP(s, a, s′)

∑

a′∈Aπmm(a′|s′)Q(s′, a′)

]

︸︷︷︸mmω

(Q(s′,.)

)

due to the first constraint of the convex optimization problem (3.1). Because mellowmax is a

non-expansion, SARSA with the maximum entropy mellowmax policy is guaranteed to converge to

a unique fixed point. Note also that, similar to other variants of SARSA, the algorithm simply

bootstraps using the value of the next state while implementing the new policy.

3.6 Experiments in the Tabular Setting

Before presenting experiments, I note that in practice computing mellowmax can yield overflow if

the exponentiated values are large. In this case, we can safely shift the values by a constant before

exponentiating them due to the following equality:

log( 1n

∑ni=1 e

ωxi)

ω= c+

log( 1n

∑ni=1 e

ω(xi−c))

ω.

A value of c = maxi xi usually avoids overflow.

3.6.1 Two-state MDP

I repeat the experiment from Figure 3.4 for mellowmax with ω = 16.55 to get a vector field. The

result, presented in Figure 3.5, shows a rapid and steady convergence towards the unique fixed point.

As a result, GVI under mmω can terminate significantly faster than GVI under boltzβ , as illustrated

in Figure 3.6.

19

Figure 3.5: GVI updates undermmω=16.55. The fixed point is unique,and all updates move quickly towardthe fixed point.

Figure 3.6: Number of iterations before termination of GVI on the example MDP. GVI under mmω

outperforms the alternatives.

3.6.2 Random MDPs

The example in Figure 3.1 was contrived. It is interesting to know whether such examples are likely

to be encountered naturally. To this end, I constructed 200 MDPs as follows: I sampled |S| from

{2, 3, ..., 10} and |A| from {2, 3, 4, 5} uniformly at random. I initialized the transition probabilities

by sampling uniformly from [0, .01]. I then added to each entry, with probability 0.5, Gaussian noise

with mean 1 and variance 0.1. I next added, with probability 0.1, Gaussian noise with mean 100

and variance 1. Finally, I normalized the raw values to ensure that I get a transition matrix. I did

a similar process for rewards, with the difference that I divided each entry by the maximum entry

and multiplied by 0.5 to ensure that Rmax = 0.5 .

I measured the failure rate of GVI under boltzβ and mmω by stopping GVI when it did not terminate

in 1000 iterations. I also computed the average number of iterations needed before termination. A

summary of results is presented in the table below. Mellowmax outperforms Boltzmann based on

the three measures provided below.

20

MDPs, noterminate

MDPs, > 1fixed points

averageiterations

boltzβ 8 of 200 3 of 200 231.65mmω 0 0 201.32

Table 3.1: A comparison between Mellowmax and Boltzmann in terms of convergence to a uniquefixed point.

F

F

S F DFigure 3.7: Multi-passenger taxi domain. The discount rate γ is 0.99.Reward is +1 for delivering one passenger, +3 for two passengers, and+15 for three passengers. Reward is zero for all the other transitions.Here F , S, and D denote passengers, start state, and destination re-spectively.

3.6.3 Multi-passenger Taxi Domain

I evaluated SARSA on the multi-passenger taxi domain introduced by Dearden et al. [37]. (See

Figure 3.7.)

One challenging aspect of this domain is that it admits many locally optimal policies. Exploration

needs to be set carefully to avoid either over-exploring or under-exploring the state space. Note

also that Boltzmann softmax performs remarkably well on this domain, outperforming sophisticated

Bayesian reinforcement-learning algorithms [37]. As shown in Figure 3.8, SARSA with the epsilon-

Figure 3.8: Comparison on the multi-passenger taxi domain. Results are shown for different valuesof ε, β, and ω. For each setting, the learning rate is optimized. Results are averaged over 25independent runs, each consisting of 300000 time steps.

greedy policy performs poorly. In fact, in our experiment, the algorithm rarely was able to deliver all

the passengers. However, SARSA with Boltzmann softmax and SARSA with the maximum entropy

mellowmax policy achieved significantly higher average reward. Maximum entropy mellowmax policy

is no worse than Boltzmann softmax, here, suggesting that the greater stability does not come at

the expense of less effective exploration.

21

Parameters Acrobot Lunar Lander Atarilearning rate 10−3 10−4 0.00025neural network MLP MLP CNNlayers 3 3 4neurons per layer 300 500 –update frequency 100 200 10000number of runs 100 50 5processing unit CPU GPU GPU

Table 3.2: Experimental details for evaluating DeepMellow for each domain.

3.7 Experiments in the Function Approximation Setting

I now move to the function approximation case where we use mellowmax as the bootstrapping

operator in Deep Q Networks (DQN) [73], hence the name DeepMellow. We simply change the

bootstrapping operator of DQN from max to mellowmax.

We tested DeepMellow in two control domains (Acrobot, Lunar Lander) and two Atari games (Break-

out, Seaquest). We used multilayer perceptrons (MLPs) for the control domains, and convolutional

neural networks (CNNs) for Atari games. The parameters and neural network architectures for each

domain are summarized in Table 3.2. For the Atari game domains, we used the same CNN as [73].

Target network update frequency is a crucial factor in our experiments. In DQN, while the real

action-value function is updated every iteration, the target network is only updated every C iterations—

we call C the target network update frequency.1 When C > 1, the target network is updated with a

delay. On the other hand, setting C = 1 means that, after every update, the target network is copied

from the real action-value function immediately; we will use C = 1 as a synonym for eliminating the

separate target network.

Choice of Temperature Parameter ω

The temperature parameter ω of DeepMellow is an additional parameter that should be tuned.

Both too large and too small ω values are bad—as ω increases, Mellowmax behaves more like a

max operator, so there is no advantage to using it. With an ω close to zero, Mellowmax behaves

like a mean operator, resulting in persistent random behaviors. Intermediate values yield better

performances than the two extremes, but the beginning and end of the “reasonable performance

range” of the parameter varies with domain. To find the optimal ranges of the ω parameter for each

domain, we used a grid search method, as we did for other hyperparameters. We empirically found

that simpler domains (Acrobot, Lunar Lander) require relatively smaller ω values while large-scale

Atari domains require larger values. For Acrobot and Lunar Lander, our parameter search set was

1Though we use the term “frequency”, “period” might be more apt.

22

Figure 3.9: The performance of DeepMellow (no target network) and DQN (no target network) incontrol domains (left) and Atari games (right). DeepMellow outperforms DQN in all domains, inthe absence of target network. Note that the best performing temperature ω values vary acrossdomains.

ω ∈ {1, 2, 5, 10}. For Breakout and Seaquest, we tested ω ∈ {100, 250, 1000, 2000, 3000, 5000} and

ω ∈ {10, 20, 30, 40, 50, 100, 200}, respectively. An adaptive approach to choosing this parameter can

benefit DeepMellow, but we leave this direction for future work.

3.7.1 DeepMellow vs DQN without a Target Network

We first compared DeepMellow and DQN in the absence of a target network (or target network

update frequency C = 1). The control domain results are shown in the Figure 3.9 (left). In Ac-

robot, DeepMellow achieves more stable learning than DQN—without a target network, the learning

curve of DQN goes upward fast, but soon starts fluctuating and fails to improve towards the end.

By contrast, DeepMellow (especially with temperature parameter ω = 1) succeeds. Similar results

are observed in Lunar Lander. DeepMellow (ω ∈ {1, 2}) achieves more stable learning and higher

average returns than DQN without a target network.

In order to quantify the performances in each domain, we computed the sum of areas under the

curves for DeepMellow and DQN (for the first 1000 episodes; y-axis lower bound is -500). Setting

the areas under the curves of DeepMellow (ω = 1) as 100, the areas under the curves of DeepMellow

23

Figure 3.10: Performances of DeepMellow (no target network) and DQN (with a target network).If tuned with an optimal temperature parameter ω value, DeepMellow learns faster than DQN witha target network.

(ω = 2) and DQN were 88.9% and 78.7%, respectively, in Acrobot. Similarly, in Lunar Lander

domain, the areas under the curves of DeepMellow(ω = 2) and DQN were 93.4% and 79.2% of

that of DeepMellow (ω = 1). In both domains, DeepMellow(ω = 1) performed best, followed by

DeepMellow(ω = 2), and then by DQN.

We also compared the performances of DeepMellow and DQN in two Atari games, Breakout and

Seaquest. We chose these two domains because the effects of having a target network are known to

be different in each domain [73]. In Breakout, the performance of DQN does not differ significantly

with and without a target network. On the other hand, Seaquest is a domain that shows a significant

performance drop when the target network is absent. Thus, these two domains are two contrasting

examples for us to see whether DeepMellow obviates the need for a target network.

Figure 3.9 (right) shows the performances of DeepMellow and DQN in these games. We used similar

methods to quantify their performances (computing the areas under the curves), and the results were

as follows: in Breakout, compared with the areas under the curves of DeepMellow(ω = 1000), those

of DeepMellow(ω = 250, 3000, 5000) and DQN were 93.1%, 68.5%, 67.1%, and 64.5%, respectively.

In Seaquest, the performance gaps widened as expected: the areas under the curves of DeepMellow

(ω = 20, 40) and DQN were 69.5%, 31.1%, and 14.9% of that of DeepMellow(ω = 30).

DeepMellow performed better than DQN without a target network in both Breakout and Seaquest;

especially in Seaquest, the performance gap was substantial. Also, note that there are intermediate

ω values that yield best performances of DeepMellow in each domain. (ω = 30 is better than ω = 20

or ω = 40 in Seaquest; ω = 1000 is better than ω = 250 or ω = 3000, 5000 in Breakout.) These

results are consistent with the property of the Mellowmax operator that both too large (behaving like

max operator) and too small (entailing random action-selection) ω values degrade the performance.

24

3.7.2 DeepMellow vs DQN with a Target Network

In the previous section, we showed that DeepMellow outperforms DQN without a target network.

The next question that naturally arises is whether DeepMellow without a target network performs

even better than DQN with a target network.

To see if DeepMellow has an advantage over DQN with a target network, we compared their perfor-

mances, focusing on their learning speed. Our prediction is that DeepMellow will learn faster than

DQN, because DQN’s updates are delayed (C > 1), and DeepMellow is likely to react faster to the

environment.

As shown in Figure 3.10, DeepMellow does learn faster than DQN in Lunar Lander, Breakout, and

Seaquest domains. In Acrobot (not shown), there was no significant difference, because both algo-

rithms learned very quickly. In Lunar Lander, DeepMellow reaches a score of 0 at episode 517 on

average, while DQN reaches the same point around episode 561 on average. (DeepMellow is 8%

faster than DQN in reaching to the zero-score.)

In Breakout and Seaquest, DeepMellow learns faster than DQN and achieves higher performance,

if tuned with an optimal ω parameter. In Breakout, DeepMellow (ω = 1000) reaches the score of

15 and 20 at timestep 55 × 104 and 95 × 104, while DQN reaches them at timestep 153 × 104 and

503×104. Similarly, in Seaquest, DeepMellow (ω = 30) reaches the score of 600 and 800 at timestep

140 × 104 and 193 × 104, while DQN reaches them at 296 × 104 and 406 × 104. We also observed

that, in both Atari games, DeepMellow achieves higher scores than DQN. In Breakout, at timestep

1000×104, the scores of DeepMellow (ω = 1000) and DQN are 38 and 26, respectively, which means

that the score of DeepMellow at this timepoint is 42.6% higher than DQN. Similarly in Seaquest,

at timestep 800× 104, the scores of DeepMellow (ω = 30) and DQN are 1205 and 962, respectively.

At this timepoint, DeepMellow achieves a score that is 25.2% higher than DQN.

3.8 Conclusion

In this chapter, I proposed and evaluated the mellowmax operator as an appealing smooth ap-

proximation of max. I showed that mellowmax has several desirable properties and that it works

favorably in practice, including on large Atari games. Arguably, mellowmax could be used in place

of Boltzmann throughout reinforcement-learning research.

Chapter 4

Smoothness in Model-based

Reinforcement Learning

When could a learned model be useful for planning? In this chapter I answer this question by

examining the impact of learning Lipschitz continuous models. I make the case for a new loss

function for model-based RL based on the Wasserstein metric. By formalizing a good one-step

model as a Lipschitz model with bounded one-step Wasserstein error, I provide a novel bound on

multi-step prediction error of Lipschitz models, as well as on value-estimation error of the one-step

model. I conclude with empirical results that show the benefits of controlling the Lipschitz constant

of transition models when they are represented using neural networks.

25

26

4.1 Introduction

The model-based approach to reinforcement learning (RL) focuses on predicting the dynamics of

the environment to plan and make high-quality decisions [54, 103, 11]. Although the behavior of

model-based algorithms in tabular environments is well understood and can be effective [103], scaling

up to the approximate setting can cause instabilities. Even small model errors can be magnified by

the planning process resulting in poor performance [108].

In this chapter, I study model-based RL through the lens of Lipschitz continuity. I show that the

ability of a model to make accurate multi-step predictions hinges on not just the model’s one-step

accuracy, but also the magnitude of the Lipschitz constant (smoothness) of the model. I further

show that the dependence on the Lipschitz constant carries over to the value-prediction problem,

ultimately influencing the quality of the policy found by planning.

I consider a setting with continuous state spaces and stochastic transitions where I quantify the

distance between distributions using the Wasserstein metric. I introduce a novel characterization

of models, referred to as a Lipschitz model class, that represents stochastic dynamics using a set

of component deterministic functions. This allows us to study any stochastic dynamic using the

Lipschitz continuity of its component deterministic functions. To learn a Lipschitz model class in

continuous state spaces, I provide an Expectation-Maximization algorithm [38].

One promising direction for mitigating the effects of inaccurate models is the idea of limiting the

complexity of the learned models or reducing the horizon of planning [52]. Doing so can sometimes

make models more useful, much as regularization in supervised learning can improve generalization

performance [111]. I examine a type of regularization that comes from controlling the Lipschitz

constant of models. This regularization technique can be applied efficiently, as I will show, when we

represent the transition model by neural networks.

4.2 Lipschitz Model Class

We introduce a novel representation of stochastic MDP transitions in terms of a distribution over a

set of deterministic components.

Definition 4.2.1. Given a metric state space (S, dS) and an action space A, we define Fg as a

collection of functions: Fg = {f : S → S} distributed according to g(f | a) where a ∈ A. We say

that Fg is a Lipschitz model class if

KF := supf∈Fg

KdS ,dS (f) ,

is finite.

27

Figure 4.1: An example of a Lipschitz model class in a gridworld environment [93]. The dynamicsare such that any action choice results in an attempted transition in the corresponding direction withprobability 0.8 and in the neighboring directions with probabilities 0.1 and 0.1. We can define Fg =

{fup, fright, fdown, f left} where each f outputs a deterministic next position in the grid (factoring inobstacles). For a = up, we have: g(fup | a = up) = 0.8, g(fright | a = up) = g(f left | a = up) = 0.1,and g(fdown | a = up) = 0. Defining distances between states as their Manhattan distance in thegrid, then ∀f sups1,s2

(d(f(s1), f(s2)

)/d(s1, s2) = 2, and so KF = 2. So, the four functions and g

comprise a Lipschitz model class.

Our definition captures a subset of stochastic transitions, namely ones that can be represented as

a state-independent distribution over deterministic transitions. An example is provided in Figure 4.1.

Associated with a Lipschitz model class is a transition function given by:

T (s′ | s, a) =∑

f

1(f(s) = s′

)g(f | a) .

Given a state distribution µ(s), I also define a generalized notion of transition function TG(· | µ, a)

given by:

TG(s′ | µ, a) =

∫

s

∑

f

1(f(s) = s′

)g(f | a)

︸︷︷︸T (s′|s,a)

µ(s)ds .

We are primarily interested in KAd,d(TG), the Lipschitz constant of TG . However, since TG takes as

input a probability distribution and also outputs a probability distribution, we require a notion of

distance between two distributions. This notion is quantified using Wasserstein and is justified in

the next section.

4.3 On the Choice of Probability Metric

I consider the stochastic model-based setting and show through an example that the Wasserstein

metric is a reasonable choice compared to other common options.

28

c1

c2

1

1

2

1

4

1

4

µ(s)

TG(.|µ, a)

bTG(.|µ, a)

Figure 4.2: A state distribution µ(s) (top),a stochastic environment that randomly addsor subtracts c1 (middle), and an approximatetransition model that randomly adds or sub-tracts a second scalar c2 (bottom).

Consider a uniform distribution over states µ(s) as shown in black in Figure 4.2 (top). Take a

transition function TG in the environment that, given an action a, uniformly randomly adds or

subtracts a scalar c1. The distribution of states after one transition is shown in red in Figure 4.2

(middle). Now, consider a transition model TG that approximates TG by uniformly randomly adding

or subtracting the scalar c2. The distribution over states after one transition using this imperfect

model is shown in blue in Figure 4.2 (bottom). We desire a metric that captures the similarity

between the output of the two transition functions. I first consider Kullback-Leibler (KL) divergence

and observe that:

KL(TG(· | µ, a), TG(· | µ, a)

)

:=

∫TG(s′ | µ, a) log

TG(s′ | µ, a)

TG(s′ | µ, a)ds′ =∞ ,

unless the two constants are exactly the same.

The next possible choice is Total Variation (TV) defined as:

TV(TG(· | µ, a), TG(· | µ, a)

)

:=1

2

∫ ∣∣TG(s′ | µ, a)− TG(s′ | µ, a)∣∣ds′ = 1 ,

if the two distributions have disjoint supports regardless of how far the supports are from each other.

In contrast, Wasserstein is sensitive to how far the constants are as:

W(TG(· | µ, a), TG(· | µ, a)

)= |c1 − c2| .

It is clear that, of the three, Wasserstein corresponds best to the intuitive sense of how closely TG

approximates TG . This is particularly important in high-dimensional spaces where the true distri-

bution is known to usually lie in low-dimensional manifolds [79].

In the next section, I present a theoretical argument for the choice of Wasserstein.

29

4.3.1 Value-Aware Model Learning (VAML) Loss

In model-based RL it is very common to have a model-learning process that is agnostic to the specific

planing process. In such cases, the usefulness of the model for the specific planning procedure comes

as an afterthought. In contrast, the basic idea behind VAML [39] is to learn a model tailored to the

planning algorithm that intends to use it. To illustrate this idea, consider Bellman equations [24]

which are at the core of many planning and RL algorithms [103]:

Q(s, a) = R(s, a) + γ

∫T (s′|s, a)f

(Q(s′, .)

)ds′ ,

where f can generally be any arbitrary operator [68] such as max. We also define:

v(s′) := f(Q(s′, .)

).

A good model T could then be thought of as the one that minimizes the error:

l(T, T )(s, a) = R(s, a) + γ

∫T (s′|s, a)v(s′)ds′

− R(s, a)− γ∫T (s′|s, a)v(s′)ds′

= γ

∫ (T (s′|s, a)−T (s′|s, a)

)v(s′)ds′

Note that minimizing this objective requires access to the value function in the first place, but we

can obviate this need by leveraging Holder’s inequality:

l(T , T )(s, a) = γ

∫ (T (s′|s, a)− T (s′|s, a)

)v(s′)ds′

≤ γ∥∥∥T (s′|s, a)− T (s′|s, a)

∥∥∥1‖v‖∞

Further, we can use Pinsker’s inequality to write:

∥∥∥T (·|s, a)− T (·|s, a)∥∥∥

1≤√

2KL(T (·|s, a)||T (·|s, a)

).

This justifies the use of maximum likelihood estimation for model learning, a common practice in

model-based RL [13, 2, 3], since maximum likelihood estimation is equivalent to empirical KL min-

imization.

However, there exists a major drawback with the KL objective, namely that it ignores the structure

of the value function during model learning. As a simple example, if the value function is constant

through the state-space, any randomly chosen model T will, in fact, yield zero Bellman error. How-

ever, a model learning algorithm that ignores the structure of value function can potentially require

many samples to provide any guarantee about the performance of the learned policy.

Consider the objective function l(T, T ), and notice again that v itself is not known so we cannot

directly optimize for this objective. Farahmand et al. [40] proposed to search for a model that

30

results in lowest error given all possible value functions belonging to a specific class:

L(T, T )(s, a)= supv∈F

∣∣∣∫ (T (s′ | s, a)− T (s′ | s, a)

)v(s′)ds′

∣∣∣2

(4.1)

Note that minimizing this objective is shown to be tractable if, for example, F is restricted to the

class of exponential functions. Observe that the VAML objective (4.1) is similar to the dual of

Wasserstein:

W (µ1, µ2) = supf :Kd,dR (f)≤1

∫ (µ1(s)− µ2(s)

)f(s)ds ,

but the difference is in the space of value functions. In the next section we show that even the space

of value functions are the same under certain conditions.

4.3.2 Lipschitz Generalized Value Iteration

We show that solving for a class of Bellman equations yields a Lipschitz value function. Our proof

is in the context of GVI [68], which defines Value Iteration [24] with arbitrary backup operators.

We make use of the following lemmas.

Lemma 4.3.1. Given a non-expansion f : S → R:

KAdS ,dR( ∫

T (s′|s, a)f(s′)ds′)≤ KAdS ,W

(T).

Proof. Starting from the definition, we write:

KAdS ,dR( ∫

T (s′|s, a)f(s′)ds′)

= supa

sups1,s2

∣∣∣∫ (T (s′|s1, a)− T (s′|s2, a)

)f(s′)ds′

∣∣∣d(s1, s2)

≤ supa

sups1,s2

∣∣∣ supg∫ (T (s′|s1, a)− T (s′|s2, a)

)g(s′)ds′

∣∣∣d(s1, s2)

(where KdS ,dR(g) ≤ 1)

= supa

sups1,s2

supg∫ (T (s′|s1, a)− T (s′|s2, a)

)g(s′)ds′

d(s1, s2)

= supa

sups1,s2

W(T (·|s1, a), T (·|s2, a)

)

d(s1, s2)= KAdS ,W (T ) .

Lemma 4.3.2. The following operators are non-expansion (K‖·‖∞,dR(·) = 1):

1. max(x), mean(x)

2. ε-greedy(x) := ε mean(x) + (1− ε)max(x)

31

3. mmβ(x) :=log

∑i eβxi

n

β

Proof. 1 is proven by Littman & Szepsevari [68]. 2 follows from 1: (metrics not shown for brevity)

K(ε-greedy(x)) = K(ε mean(x) + (1− ε)max(x)

)

≤ εK(mean(x)

)+ (1− ε)K

(max(x)

)

= 1

Finally, 3 is proven multiple times in the literature. [6, 78, 80]

Algorithm 2 GVI algorithm

Input: initial Q(s, a), δ, and choose an operator frepeat

diff← 0for each s ∈ S do

for each a ∈ A doQcopy ← Q(s, a)

Q(s, a)←R(s, a)+γ∫T (s′ | s, a)f

(Q(s′, ·)

)ds′

diff← max{

diff, |Qcopy −Q(s, a)|}

end forend for

until diff < δ

I now present the main result of the section.

Theorem 4.3.3. For any choice of backup operator f outlined in Lemma 4.3.2, GVI computes a

value function with a Lipschitz constant bounded byKAdS ,dR

(R)

1−γKdS ,W (T ) if γKAdS ,W (T ) < 1.

Proof. From Algorithm 2, in the nth round of GVI updates we have:

Qn+1(s, a)← R(s, a) + γ

∫T (s′ | s, a)f

(Qn(s′, ·)

)ds′.

First observe that:

KAdS ,dR(Qn+1)

≤KAdS ,dR(R)+γKAdS ,dR(∫

T (s′ | s, a)f(Qn(s′, ·)

)ds′)

(due to Lemma (4.3.1)

)

≤ KAdS ,dR(R) + γKAdS ,W (T ) KdS,R

(f(Qn(s, ·)

))

(due to Composition Lemma

)

≤ KAdS ,dR(R) + γKAdS ,W (T )K‖·‖∞,dR(f)KAdS ,dR(Qn)(due to Lemma (4.3.2), the non-expansion property of f

)

= KAdS ,dR(R) + γKAdS ,W (T )KAdS ,dR(Qn)

32

Equivalently:

KAdS ,dR(Qn+1) ≤ KAdS ,dR(R)

n∑

i=0

(γKAdS ,W (T )

)i

+(γKAdS ,W (T )

)nKAdS ,dR(Q0) .

By computing the limit of both sides, we get:

limn→∞

KAdS ,dR(Qn) ≤ limn→∞

KAdS ,dR(R)

n∑

i=0

(γKAdS ,W (T )

)i

+ limn→∞

(γKAdS ,W (T )

)nKAdS ,dR(Q0)

=KAdS ,dR(R)

1− γKdS ,W (T )+ 0 ,

where we used the fact that

limn→∞

(γKAdS ,W (T )

)n= 0 .

This concludes the proof.

Now notice that as defined earlier:

Vn(s) := f(Qn(s, ·)

),

so as a relevant corollary of our theorem we get:

KdS ,dR

(v(s)

)= lim

n→∞KdS ,dR(Vn)

= limn→∞

KdS ,dR

(f(Qn(s, ·)

))

≤ limn→∞

KAdS ,dR(Qn)

≤KAdS ,dR(R)

1− γKdS ,W (T ).

That is, solving for the fixed point of this general class of Bellman equations results in a Lipschitz

state-value function.

4.3.3 Equivalence Between VAML and Wasserstein

We now show the main claim of the paper, namely that minimzing for the VAML objective is the

same as minimizing the Wasserstein metric.

Consider again the VAML objective:

L(T, T )(s, a)= supv∈F

∣∣∣∫ (

T (s′ | s, a)− T (s′ | s, a))v(s′)ds′

∣∣∣2

where F can generally be any class of functions. From our theorem, however, the space of value

functions F should be restricted to Lipschitz functions. Moreover, it is easy to design an MDP and

33

a policy such that a desired Lipschitz value function is attained.

This space LC can then be defined as follows:

LC = {f : KdS ,dR(f) ≤ C} ,

where

C =KAdS ,dR(R)

1− γKdS ,W (T ).

So we can rewrite the VAML objective L as follows:

L(T, T

)(s, a) = sup

f∈LC

∣∣∣∫f(s)

(T (s′ | s, a)−T (s′ | s, a)

)ds′∣∣∣2

= supf∈LC

∣∣∣∫Cf(s)

C

(T (s′ | s, a)−T (s′ | s, a)

)ds′∣∣∣2

= C2 supg∈L1

∣∣∣∫g(s)

(T (s′ | s, a)−T (s′ | s, a)

)ds′∣∣∣2

.

It is clear that a function g that maximizes the Kantorovich-Rubinstein dual form:

supg∈L1

∫g(s)

(T (s′ | s, a)− T (s′ | s, a)

)ds′ := W (T (·|s, a), T (·|s, a)) ,

will also maximize:

L(T, T

)(s, a) =

∣∣∣∫g(s)

(T (s′ | s, a)− T (s′ | s, a)

)ds′∣∣∣2

.

This is due to the fact that ∀g ∈ L1 ⇒ −g ∈ L1 and so computing absolute value or squaring the

term will not change arg max in this case.

As a result:

L(T, T

)(s, a) =

(C W

(T (·|s, a), T (·|s, a)

))2

.

This highlights a nice property of Wasserstein, namely that minimizing this metric yields a value-

aware model. Therefore, the strong theoretical properties shown for value-aware loss [39] further

justifies our choice of Wasserstein assuming that the Lipschitz assumption holds.

4.4 Understanding the Compounding Error Phenomenon

To extract a prediction with a horizon n > 1, model-based algorithms typically apply the model for n

steps by taking the state input in step t to be the state output from the step t−1. Previous work has

shown that model error can result in poor long-horizon predictions and ineffective planning [108, 109].

Observed even beyond reinforcement learning [71, 114], this is referred to as the compounding error

phenomenon. The goal of this section is to provide a bound on multi-step prediction error of a

model. In light of the previous section, I formalize the notion of model accuracy below:

34

Definition 4.4.1. Given an MDP with a transition function T , we identify a Lipschitz model Fg

as ∆-accurate if its induced T satisfies:

∀s ∀a W(T (· | s, a), T (· | s, a)

)≤ ∆ .

We want to express the multi-step Wasserstein error in terms of the single-step Wasserstein error

and the Lipschitz constant of the transition function TG . I provide a bound on the Lipschitz constant

of TG using the following lemma:

Lemma 4.4.1. A generalized transition function TG induced by a Lipschitz model class Fg is Lips-

chitz with a constant:

KAW,W (TG) := supa

supµ1,µ2

W(TG(·|µ1, a), TG(·|µ2, a)

)

W (µ1, µ2)≤KF

Proof.

W(T (· | µ1, a), T (· | µ2, a)

)

:= infj

∫

s′1

∫

s′2

j(s′1, s′2)d(s′1, s

′2)ds′1ds

′2

= infj

∫

s1

∫

s2

∫

s′1

∫

s′2

∑

f

1(f(s1) = s′1 ∧ f(s2) = s′2

)j(s1, s2, f)d(s′1, s

′2)ds′1ds

′2ds1ds2

= infj

∫

s1

∫

s2

∑

f

j(s1, s2, f)d(f(s1), f(s2)

)ds1ds2

≤ KF infj

∫

s1

∫

s2

∑

f

g(f |a)j(s1, s2)d(s1, s2)ds1ds2

= KF

∑

f

g(f |a) infj

∫

s1

∫

s2

j(s1, s2)d(s1, s2)ds1ds2

= KF

∑

f

g(f |a)W (µ1, µ2) = KFW (µ1, µ2)

Intuitively, Lemma 4.4.1 states that, if the two input distributions are similar, then for any action

the output distributions given by TG are also similar up to a KF factor.

Given the one-step error (Definition 4.4.1), a start state distribution µ and a fixed sequence of actions

a0, ..., an−1, we desire a bound on n-step error:

δ(n) := W(TnG (· | µ), TnG (· | µ)

),

where TnG (·|µ) := TG(·|TG(·|...TG(·|µ, a0)..., an−2), an−1)︸︷︷︸n recursive calls

and TnG (· | µ) is defined similarly. I provide

a useful lemma followed by the theorem.

35

Lemma 4.4.2. (Composition Lemma) Define three metric spaces (M1, d1), (M2, d2), and (M3, d3).

Define Lipschitz functions f : M2 7→M3 and g : M1 7→M2 with constants Kd2,d3(f) and Kd1,d2(g).

Then, h : f ◦ g : M1 7→M3 is Lipschitz with constant Kd1,d3(h) ≤ Kd2,d3(f)Kd1,d2(g).

Proof.

Kd1,d3(h) = sups1,s2

d3

(f(g(s1)

), f(g(s2)

))

d1(s1, s2)

= sups1,s2

d2

(g(s1), g(s2)

)

d1(s1, s2)

d3

(f(g(s1)

), f(g(s2)

))

d2

(g(s1), g(s2)

)

≤ sups1,s2

d2

(g(s1), g(s2)

)

d1(s1, s2)sups1,s2

d3

(f(s1), f(s2)

)

d2(s1, s2)

= Kd1,d2(g)Kd2,d3(f).

Similar to composition, we can show that summation preserves Lipschitz continuity with a constant

bounded by the sum of the Lipschitz constants of the two functions. We omitted this result due to

brevity.

Theorem 4.4.3. Define a ∆-accurate TG with the Lipschitz constant KF and an MDP with a

Lipschitz transition function TG with constant KT . Let K = min{KF ,KT }. Then ∀n ≥ 1:

δ(n) := W(TnG (· | µ), TnG (· | µ)

)≤ ∆

n−1∑

i=0

(K)i .

Proof. We construct a proof by induction. Using Kantarovich-Rubinstein duality (Lipschitz property

of f not shown for brevity) we first prove the base of induction:

δ(1) := W(TG(· | µ, a0), TG(· | µ, a0)

)

:= supf

∫ ∫ (T (s′ | s, a0)−T (s′ | s, a0)

)f(s′)µ(s) ds ds′

≤∫

supf

∫ (T (s′|s, a0)−T (s′|s, a0)

)f(s′) ds′

︸︷︷︸=W(T (·|s,a0),T (·|s,a0)

)due to duality (4.3.1)

µ(s) ds

=

∫W(T (· | s, a0), T (· | s, a0)

)︸︷︷︸≤∆ due to Definition 4.4.1

µ(s) ds

≤∫

∆ µ(s) ds = ∆ .

We now prove the inductive step. Assuming δ(n−1) := W(Tn−1G (· | µ), Tn−1

G (· | µ))≤ ∆

∑n−2i=0 (KF )i

we can write:

36

δ(n) := W(TnG (· | µ), TnG (· | µ)

)

≤ W(TnG (· | µ), TG

(· | Tn−1

G (· | µ), an−1

))

+ W(TG(· | Tn−1

G (· | µ), an−1

), TnG (· | µ)

)(Triangle ineq)

= W(TG(· | Tn−1

G (· | µ), an−1), TG(· | Tn−1

G (· | µ), an−1

))

+W(TG(· | Tn−1

G (· | µ), an−1

), TG(· | Tn−1

G (· | µ), an−1))

We now use Lemma 4.4.1 and Definition 4.4.1 to upper bound the first and the second term of the

last line respectively.

δ(n) ≤ KF W(Tn−1G (· | µ), Tn−1

G (· | µ))

+ ∆

= KF δ(n− 1) + ∆ ≤ ∆

n−1∑

i=0

(KF )i . (4.2)

Note that in the triangle inequality, we may replace TG(· | Tn−1

G (· | µ))

with TG(· | Tn−1

G (· | µ))

and

follow the same basic steps to get:

W(TnG (· | µ), TnG (· | µ)

)≤ ∆

n−1∑

i=0

(KT )i . (4.3)

Combining (4.2) and (4.3) allows us to write:

δ(n) = W(TnG (· | µ), TnG (· | µ)

)

≤ min

{∆

n−1∑

i=0

(KT )i,∆

n−1∑

i=0

(KF )i

}

= ∆

n−1∑

i=0

(K)i ,

which concludes the proof.

There exist similar results in the literature relating one-step transition error to multi-step transi-

tion error and sub-optimality bounds for planning with an approximate model. The Simulation

Lemma [58, 101] is for discrete state MDPs and relates error in the one-step model to the value

obtained by using it for planning. A related result for continuous state-spaces [55] bounds the error

in estimating the probability of a trajectory using total variation. A second related result [114]

provides a slightly looser bound for prediction error in the deterministic case—this result can be

thought of as a generalization of their result to the probabilistic case.

4.5 Value Error with Lipschitz Models

We next investigate the error in the state-value function induced by a Lipschitz model class. To

answer this question, we consider an MRP M1 denoted by 〈S,A, T,R, γ〉 and a second MRP M2

37

that only differs from the first in its transition function 〈S,A, T , R, γ〉. Let A = {a} be the action

set with a single action a. We further assume that the reward function is only dependent upon

state. We first express the state-value function for a start state s with respect to the two transition

functions. By δs below, we mean a Dirac delta function denoting a distribution with probability 1

at state s.

VT (s) :=

∞∑

n=0

γn∫TnG (s′|δs)R(s′) ds′ ,

VT (s) :=

∞∑

n=0

γn∫TnG (s′|δs)R(s′) ds′ .

Next we derive a bound on∣∣VT (s)− VT (s)

∣∣ ∀s.

Theorem 4.5.1. Assume a Lipschitz model class Fg with a ∆-accurate T with K = min{KF ,KT }.Further, assume a Lipschitz reward function with constant KR = KdS ,R(R). Then ∀s ∈ S and

K ∈ [0, 1γ )

∣∣VT (s)− VT (s)∣∣ ≤ γKR∆

(1− γ)(1− γK).

Proof. We first define the function f(s) = R(s)KR

. It can be observed that KdS ,R(f) = 1. We now

write:

VT (s)− VT (s)

=

∞∑

n=0

γn∫R(s′)

(TnG (s′ | δs)− TnG (s′ | δs)

)ds′

= KR

∞∑

n=0

γn∫f(s′)


)ds′

Let F = {h : KdS ,R(h) ≤ 1}. Then given f ∈ F :

KR

∞∑

n=0

γn∫f(s′)

(TnG (s′|δs)− TnG (s′|δs)

)ds′

≤ KR

∞∑

n=0

γn supf∈F

∫f(s′)


)ds′

︸︷︷︸:=W

(TnG (.|δs),TnG (.|δs)

)due to duality (4.3.1)

= KR

∞∑

n=0

γn W(TnG (. | δs), TnG (. | δs)

)︸︷︷︸

≤∑n−1i=0 ∆(K)i due to Theorem 4.4.3

≤ KR

∞∑

n=0

γnn−1∑

i=0

∆(K)i

= KR∆

∞∑

n=0

γn1− Kn

1− K

=γKR∆

(1− γ)(1− γK).

38

Function f Definition Lipschitz constant K‖‖p,‖‖p(f)

p = 1 p = 2 p =∞ReLu : Rn → Rn ReLu(x)i := max{0, xi} 1 1 1

+b : Rn → Rn,∀b ∈ Rn +b(x) := x+ b 1 1 1

×W : Rn → Rm, ∀W ∈ Rm×n ×W (x) := Wx∑j ‖Wj‖∞

√∑j ‖Wj‖22 supj ‖Wj‖1

Table 4.1: Lipschitz constant for various functions used in a neural network. Here, Wj denotes thejth row of a weight matrix W .

We can derive the same bound for VT (s)−VT (s) using the fact that Wasserstein distance is a metric,

and therefore symmetric, thereby completing the proof.

Regarding the tightness of these bounds, I can show that when the transition model is deterministic

and linear then Theorem 4.4.3 provides a tight bound. Moreover, if the reward function is linear,

the bound provided by Theorem 4.5.1 is tight.

4.6 Experiments

My first goal in this section is to compare TV, KL, and Wasserstein in terms of the ability to best

quantify error of an imperfect model, and do so in a simple and clear setting. To this end, I built

finite MRPs with random transitions, |S| = 10 states, and γ = 0.95. In the first case the reward

signal is randomly sampled from [0, 10], and in the second case the reward of an state is the index

of that state, so small Euclidean norm between two states is an indication of similar values. For

105 trials, I generated an MRP and a random model, and then computed model error and planning

error (Figure 4.3). We understand a good metric as the one that computes a model error with a

high correlation with value error. I show these correlations for different values of γ in Figure 4.4.

Figure 4.3: Value error (x axis) and model error (y axis). When the reward is the index of the state(right), correlation between Wasserstein error and value-prediction error is high. This highlights thefact that when closeness in the state-space is an indication of similar values, Wasserstein can be apowerful metric for model-based RL. Note that Wasserstein provides no advantage given randomrewards (left).

39

Figure 4.4: Correlation between value-prediction error and model error for the three metrics usingrandom rewards (left) and index rewards (right). Given a useful notion of state similarities, lowWasserstein error is a better indication of planning error.

It is known that controlling the Lipschitz constant of neural nets can help in terms of improving

generalization error due to a lower bound on Rademacher complexity [81, 17]. It then follows from

Theorems 4.4.3 and 4.5.1 that controlling the Lipschitz constant of a learned transition model can

achieve better error bounds for multi-step and value predictions. To enforce this constraint during

learning, we bound the Lipschitz constant of various operations used in building neural network.

The bound on the constant of the entire neural network then follows from Lemma 4.4.2. In Table 4.1,

we provide Lipschitz constant for operations used in our experiments. We quantify these results for

different p-norms ‖·‖p.

Given these simple methods for enforcing Lipschitz continuity, I performed empirical evaluations to

understand the impact of Lipschitz continuity of transition models, specifically when the transition

model is used to perform multi-step state-predictions and policy improvements. I chose two standard

domains: Cart Pole and Pendulum. In Cart Pole, I trained a network on a dataset of 15 ∗ 103 tuples

〈s, a, s′〉. During training, I ensured that the weights of the network are smaller than k. For each k, I

performed 20 independent model estimation, and chose the model with median cross-validation error.

Using the learned model, along with the actual reward signal of the environment, I then performed

stochastic actor-critic RL. [20, 105] This required an interaction between the policy and the learned

model for relatively long trajectories. To measure the usefulness of the model, I then tested the

learned policy on the actual domain. I repeated this experiment on Pendulum. To train the neural

transition model for this domain we used 104 samples. Notably, I used deterministic policy gradient

[97] for training the policy network with the hyper parameters suggested by [65]. I report these

results in Figure 4.5.

Observe that an intermediate Lipschitz constant yields the best result. Consistent with the theory,

controlling the Lipschitz constant in practice can combat the compounding errors and can help in

40

-250

-500

-750

-1000

-1250

-1500

averagereturnper

episode

Figure 4.5: Impact of controlling the Lipschitz constant of learned models in Cart Pole (left) andPendulum (right). An intermediate value of k (Lipschitz constant) yields the best performance.

the value estimation problem. This ultimately results in learning a better policy.

I next examined if the benefits carry over to stochastic settings. To capture stochasticity we need

an algorithm to learn a Lipschitz model class (Definition 4.2.1). I used an EM algorithm to joinly

learn a set of functions f , parameterized by θ= {θf : f ∈ Fg}, and a distribution over functions g.

Note that in practice our dataset only consists of a set of samples 〈s, a, s′〉 and does not include the

function the sample is drawn from. Hence, I consider this as our latent variable z. As is standard

with EM, we start with the log-likelihood objective (for simplicity of presentation I assume a single

action in the derivation):

L(θ) =

N∑

i=1

log p(si, si′; θ)

=

N∑

i=1

log∑

f

p(zi = f, si, si′; θ)

=

N∑

i=1

log∑

f

q(zi=f |si, si′)p(zi = f, si, si

′; θ)q(zi = f |si, si′)

≥N∑

i=1

∑

f

q(zi=f |si, si′)logp(zi = f, si, si

′; θ)q(zi = f |si, si′)

,

where I used Jensen’s inequality and concavity of log in the last line. This derivation leads to the

41

following EM algorithm.

In the M step, find θt by solving for:

arg maxθ

N∑

i=1

∑

f

qt−1(zi = f |si, si′)logp(zi = f, si, si

′; θ)qt−1(zi = f |si, si′)

In the E step, compute posteriors:

qt(zi=f |si, si′)=p(si, si

′|zi = f ; θtf )g(zi = f ; θt)∑

f p(si, si′|zi = f ; θt

f )g(zi = f ; θt).

Note that we assume each point is drawn from a neural network f with probability:

p(si, si

′|zi = f ; θtf)

= N(∣∣si′ − f(si, θt

f )∣∣, σ2

),

and with a fixed variance σ2 tuned as a hyper-parameter.

I first used a supervised-learning domain to evaluate the EM algorithm in a simple setting. I

generated 30 points from the following 5 functions:

f0(x) = tanh(x) + 3

f1(x) = x ∗ xf2(x) = sin(x)− 5

f3(x) = sin(x)− 3

f4(x) = sin(x) ∗ sin(x) ,

and trained 5 neural networks to fit these points. Iterations of a single run is shown in Figure 4.6

and the summary of results is presented in Figure 4.7. Observe that the EM algorithm is effective,

and that controlling the Lipschitz constant is again useful.

42

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

−6

−4

−2

0

2

4

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

−6

−4

−2

0

2

4

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

−6

−4

−2

0

2

4

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

−6

−4

−2

0

2

4

Figure 4.6: A stochastic problem solved by training a Lipschitz model class using EM. The top leftfigure shows the functions before any training (iteration 0), and the bottom right figure shows thefinal results (iteration 50).

I next applied EM to train a transition model for an RL setting, namely the gridworld domain from

Moerland et al. [74]. Here a useful model needs to capture the stochastic behavior of the two ghosts.

I modify the reward to be -1 whenever the agent is in the same cell as either one of the ghosts and

0 otherwise. I performed environmental interactions for 1000 time-steps and measured the return. I

compared against standard tabular methods[103], and a deterministic model that predicts expected

next state [106, 82]. In all cases I used value iteration for planning.

Results in Figure 4.8 show that tabular models fail due to no generalization, and expected models

fail since the ghosts do not move on expectation, a prediction not useful for planner. Performing

value iteration with a Lipschitz model class outperforms the baselines.

4.7 Conclusion

In this chapter, we took an important step towards understanding the effects of smoothness in

model-based RL. I showed that Lipschitz continuity of an estimated model plays a central role in

multi-step prediction error, and in value-estimation error. I also showed the benefits of employing

Wasserstein for model-based RL.

43

Figure 4.7: Impact of controlling the Lipschitz constant in the supervised-learning domain. Noticethe U-shape of final Wasserstein loss with respect to Lipschitz constant k.

Figure 4.8: Performance of a Lipschitz model class on the gridworld domain. I show model testaccuracy (left) and quality of the policy found using the model (right). Notice the poor performanceof tabular and expected models.

Chapter 5

Smoothness in Continuous Control

In this chapter I focus on RL in continuous action spaces. As I showed in the background chapter,

a core operation in RL is finding an action that is optimal with respect to the Q function. This

operation, maxa∈A Q(s, a), is actually challenging when A is a continuous space.

I introduce deep RBF value functions: Q functions learned using a deep neural network with a radial-

basis function (RBF) output layer. I show that the optimal action with respect to a deep RBF Q

function can be easily approximated up to any desired accuracy. Moreover, deep RBF Q functions

can represent any true value function up to any desired accuracy owing to their support for universal

function approximation. I show the theoretical and practical benefits of controlling the smoothness

of deep RBF Q functions. By learning a deep RBF value function, I extend the standard DQN

algorithm to continuous control, and demonstrate that the resultant agent, RBF–DQN, outperforms

standard baselines on a set of continuous-action RL problems

44

45

5.1 Introduction

In RL the Q function quantifies the expected return for taking action a in state s. Many RL

algorithms, notably Q-learning, learn an approximation of the Q function from environmental in-

teractions. When using function approximation with Q-learning, the agent has a parameterized

function class, and learning consists of finding a parameter setting θ for the approximate value func-

tion Q(s, a; θ) that accurately represents the true Q function.

A core operation in many RL algorithms is finding an optimal action with respect to the value

function, arg maxa∈A Q(s, a; θ), or finding the highest action-value maxa∈A Q(s, a; θ). The need for

performing these operations arises not just when computing a behavior policy for action selection,

but also when learning Q itself using bootstrapping techniques [104].

The optimization problem maxa∈A Q(s, a; θ) is generally challenging if A is continuous, in contrast

to the discrete case where the operation is trivial if the number of discrete actions is not enormous.

The challenge stems from the observation that the surface of the function fs(a; θ) := Q(s, a; θ)

could have many local maxima and saddle points; therefore, naıve approaches such as finding the

maximum through gradient ascent can lead into inaccurate answers [94]. In light of this technical

challenge, recent work on solving continuous control problems has instead embraced policy-gradient

algorithms, which typically compute ∇aQ(s, a; θ), rather than solving maxa∈A Q(s, a; θ), and follow

the ascent direction to move an explicitly maintained policy towards actions with higher Q [98].

However, policy-gradient algorithms have their own weaknesses, particularly in settings with sparse

rewards where computing an accurate estimate of the gradient requires an unreasonable number of

environmental interactions [56, 72]. Rather than adopting a policy-gradient approach, I focus on

tackling the problem of efficiently computing maxa∈A Q(s, a; θ) for value-function-based RL.

Previous work on value-function-based algorithms for continuous control has shown the benefits of

using function classes that are conducive to efficient action maximization. For example, Gu et al.

[47] explored function classes that can capture an arbitrary dependence on the state, but only a

quadratic dependence on the action. Given a function class with a quadratic dependence on the

action, Gu et al. [47] showed how to compute maxa∈A Q(s, a; θ) quickly and in constant time. A

more general idea is to use input–convex neural networks [4] that restrict Q(s, a; θ) to functions that

are convex (or concave) with respect to the input a, so that for any fixed state s the optimization

problem maxa∈A Q(s, a; θ) can be solved efficiently using convex-optimization techniques [29]. These

solutions trade the expressiveness of the function class for easy action maximization.

While restricting the function class can enable easy maximization, it can be problematic if no

member of the restricted class has low approximation error relative to the true Q function [67].

More concretely, when the agent cannot possibly learn an accurate Q, the error |maxaQπ(s, a) −

maxa Q(s, a; θ)| could be significant even if the agent can solve maxa Q(s, a; θ) exactly. In the case of

46

input–convex neural networks, for example, high error can occur if fa(s; θ) := Q(s, a; θ) is completely

non-convex. Thus, it is desirable to ensure that, for any true Q function, there exists a member of

the function class that approximates Q up to any desired accuracy. Such a function class is said

to be capable of universal function approximation (UFA) [51, 25, 48]. A function class that is both

conducive to efficient action maximization and also capable of UFA would be ideal.

In this chapter I introduce deep RBF value functions, which approximate Q by a standard deep

neural network equipped with an RBF output layer. I show that deep RBF value functions have the

two desired properties outlined above: First, using deep RBF value functions enable us to approx-

imate the optimal action up to any desired accuracy. Second, deep RBF value functions support

universal function approximation.

Prior work in RL used RBF networks for learning value functions in problems with discrete action

spaces (see Section 9.5.5 of Sutton & Barto [104] for a discussion). That said, to the best of my

knowledge, my discovery of the action-maximization property of RBF networks is novel, and there

has been no application of deep RBF networks to continuous control. I combine deep RBF networks

with DQN [73], a standard deep RL algorithm originally proposed for discrete actions, to produce

a new algorithm called RBF–DQN. I evaluate RBF–DQN on a large set of continuous-action RL

problems, and demonstrate its superior performance relative to standard deep-RL baselines.

5.2 Deep RBF Value Functions

Deep RBF value functions combine the practical advantages of deep networks [44] with the theoreti-

cal advantages of radial-basis functions (RBFs) [88]. A deep RBF network is comprised of a number

of arbitrary hidden layers, followed by an RBF output layer, defined next. The RBF output layer,

first introduced in a seminal paper by Broomhead & Lowe [32], is sometimes used as a standalone

single-layer function approximator, referred to as a (shallow) RBF network. We use an RBF network

as the final, or output, layer of a deep network.

For a given input a, the RBF layer f(a) is defined as:

f(a) :=

N∑

i=1

g(a− ai) vi , (5.1)

where each ai represents a centroid location, vi is the value of the centroid ai, N is the number of

centroids, and g is an RBF. A commonly used RBF is the negative exponential:

g(a− ai) := e−β‖a−ai‖ , (5.2)

equipped with a smoothing parameter β ≥ 0. (See Karayiannis [57] for a thorough treatment of

other RBFs.) Formulation (5.1) could be thought of as an interpolation based on the value and

47

the weights of all centroids, where the weight of each centroid is determined by its proximity to the

input. Proximity here is quantified by the RBF g, in this case the negative exponential (5.2).

As will be clear momentarily, it is theoretically useful to normalize centroid weights to ensure that

they sum to 1 so that f implements a weighted average. This weighted average is sometimes referred

to as a normalized Gaussian RBF layer [76, 34]:

fβ(a) :=

∑Ni=1 e

−β‖a−ai‖ vi∑Ni=1 e

−β‖a−ai‖. (5.3)

As the smoothing parameter β→∞ the function implements a winner-take-all case where the value

of the function at a given input is determined only by the value of the closest centroid location,

nearest-neighbor style. This limiting case is sometimes referred to as a Voronoi decomposition [12].

Conversely, f converges to the mean of centroid values regardless of the input a as β gets close to

0; that is, ∀a limβ→0 fβ(a) =∑Ni=1 viN . Since an RBF layer is differentiable, it could be used in

conjunction with (stochastic) gradient descent and backprop to learn the centroid locations and their

values by optimizing for a loss function. Note that formulation (5.3) is different than the Boltzmann

softmax operator [6, 100], where the weights are determined, not by an RBF, but by the action values.

Finally, to represent the Q function for RL, we use the following formulation:

Qβ(s, a; θ) :=

∑Ni=1 e

−β‖a−ai(s;θ)‖ vi(s; θ)∑Ni=1 e

−β‖a−ai(s;θ)‖. (5.4)

A deep RBF Q function (5.4) internally learns two mappings: state-dependent set of centroid lo-

cations ai(s; θ) and state-dependent centroid values vi(s; θ). The role of the RBF output layer,

then, is to use these learned mappings to form the output of the entire deep RBF Q function. We

illustrate the architecture of a deep RBF Q function in Figure 5.1. In the experimental section, we

demonstrate how to learn parameters θ.

I now show that deep RBF function have the first desired property for value-function-based RL,

namely that they enable easy action maximization.

In light of the RBF formulation, it is easy to find the value of the deep RBF Q function at each

centroid location ai, that is, to compute Qβ(s, ai; θ). Note that Qβ(s, ai; θ) 6= vi(s; θ) in general for

a finite β, because the other centroids aj ∀j ∈ {1, .., N} − i may have non-zero weights at ai. In

other words, the action-value function at a centroid ai can in general differ from the centroid’s value

vi.

Therefore, to compute Qβ(s, ai; θ), we access the centroid location using ai(s; θ), then input ai to

get Q(s, ai; θ) . Once we have Q(s, ai; θ) ∀ai, we can trivially find the highest-valued centroid or its

corresponding Q:

maxi∈[1,N ]

Qβ(s, ai; θ) .

48

softmax…

… dot product

hidden layers

s<latexit sha1_base64="EF3gVTokzRtz6drbuppAhSvs3q4=">AAAB6XicbVDLSsNAFJ3UV42vqks3g6XgqiQqPnZFNy5bsA9oQ5lMb9qhk0mYmQgl9AtcCQri1k9y5d84SYOo9cCFwzn3cu89fsyZ0o7zaZVWVtfWN8qb9tb2zu5eZf+go6JEUmjTiEey5xMFnAloa6Y59GIJJPQ5dP3pbeZ3H0AqFol7PYvBC8lYsIBRoo3UUsNK1ak7OfAycQtSRQWaw8rHYBTRJAShKSdK9V0n1l5KpGaUw9yuDRIFMaFTMoa+oYKEoLw0v3SOa0YZ4SCSpoTGuWr/mEhJqNQs9E1nSPRE/fUy8T+vn+jgykuZiBMNgi4WBQnHOsLZ23jEJFDNZ4YQKpk5FtMJkYRqE46dp3Cd4eL752XSOa27Z/Wz1nm1cVPkUUZH6BidIBddoga6Q03URhQBekTP6MWaWk/Wq/W2aC1Zxcwh+gXr/QtcoY1Z</latexit>

a<latexit sha1_base64="bUPuLPmmdR34XjWzdEt1jGkvsAk=">AAAB6XicbVDLSsNAFJ3UV42vqks3g6XgqiQqPnZFNy5bsA9oQ5lMb9qhk0mYmQgl9AtcCQri1k9y5d84SYOo9cCFwzn3cu89fsyZ0o7zaZVWVtfWN8qb9tb2zu5eZf+go6JEUmjTiEey5xMFnAloa6Y59GIJJPQ5dP3pbeZ3H0AqFol7PYvBC8lYsIBRoo3UIsNK1ak7OfAycQtSRQWaw8rHYBTRJAShKSdK9V0n1l5KpGaUw9yuDRIFMaFTMoa+oYKEoLw0v3SOa0YZ4SCSpoTGuWr/mEhJqNQs9E1nSPRE/fUy8T+vn+jgykuZiBMNgi4WBQnHOsLZ23jEJFDNZ4YQKpk5FtMJkYRqE46dp3Cd4eL752XSOa27Z/Wz1nm1cVPkUUZH6BidIBddoga6Q03URhQBekTP6MWaWk/Wq/W2aC1Zxcwh+gXr/QtBR41H</latexit>

a1:N<latexit sha1_base64="CAA25IiFiC3oik9wIk1eV9k2oxQ=">AAAB73icbVDLSsNAFL2prxpfVZduBkvBVUlUfK2KblxJBfuANpTJdNIOnUzizEQooR/hSlAQt36PK//GSRtErQcuHM65l3vv8WPOlHacT6uwsLi0vFJctdfWNza3Sts7TRUlktAGiXgk2z5WlDNBG5ppTtuxpDj0OW35o6vMbz1QqVgk7vQ4pl6IB4IFjGBtpBbupe7FzaRXKjtVZwo0T9yclCFHvVf66PYjkoRUaMKxUh3XibWXYqkZ4XRiV7qJojEmIzygHUMFDqny0um9E1QxSh8FkTQlNJqq9o+JFIdKjUPfdIZYD9VfLxP/8zqJDs68lIk40VSQ2aIg4UhHKHse9ZmkRPOxIZhIZo5FZIglJtpEZE9TOM9w8v3zPGkeVt2j6tHtcbl2medRhD3YhwNw4RRqcA11aACBETzCM7xY99aT9Wq9zVoLVj6zC79gvX8BSLGPkw==</latexit>

v1:N<latexit sha1_base64="SciiXDZvNHatHNOODf6xOOkzn5Y=">AAAB73icbVDLSsNAFL2prxpfVZdugqXgqiRWfK2KblxJBfuANpTJdNIOnUzizKRQQj/ClaAgbv0eV/6NkzSIWg9cOJxzL/fe40WMSmXbn0ZhaXllda24bm5sbm3vlHb3WjKMBSZNHLJQdDwkCaOcNBVVjHQiQVDgMdL2xtep354QIWnI79U0Im6Ahpz6FCOlpfaknziXt7N+qWxX7QzWInFyUoYcjX7pozcIcRwQrjBDUnYdO1JugoSimJGZWenFkkQIj9GQdDXlKCDSTbJ7Z1ZFKwPLD4UurqxMNX9MJCiQchp4ujNAaiT/eqn4n9eNlX/uJpRHsSIczxf5MbNUaKXPWwMqCFZsqgnCgupjLTxCAmGlIzKzFC5SnH7/vEhax1WnVq3dnZTrV3keRTiAQzgCB86gDjfQgCZgGMMjPMOL8WA8Ga/G27y1YOQz+/ALxvsXaRiPqA==</latexit>

k·k<latexit sha1_base64="UAkLneWZSjt3xRr0vurj6WzEHho=">AAACBnicbVDLSsNAFJ34rPEVdamLYCm4KqkVH7uiG5cV7AOaUCbTm3bo5MHMjVBKN678FFeCgrj1H1z5N07aIGo9cC+Hc+5l5h4/EVyh43waC4tLyyurhTVzfWNza9va2W2qOJUMGiwWsWz7VIHgETSQo4B2IoGGvoCWP7zK/NYdSMXj6BZHCXgh7Uc84IyilrrWgSsgQFc0QaLLejG6kvcHumdC1yo6ZWcKe55UclIkOepd68PtxSwNIUImqFKdipOgN6YSORMwMUtuqiChbEj70NE0oiEobzw9Y2KXtNKzg1jqitCequaPjTENlRqFvp4MKQ7UXy8T//M6KQbn3phHSYoQsdlDQSpsjO0sE7vHJTAUI00ok1x/1mYDKilDnZw5TeEiw+n3zfOkeVyuVMvVm5Ni7TLPo0D2ySE5IhVyRmrkmtRJgzByTx7JM3kxHown49V4m40uGPnOHvkF4/0L7seZhA==</latexit>

RBF layer

bQ<latexit sha1_base64="gKd2EUikoAVz+E+VRA4g4ZT/seQ=">AAAB8XicbVDLSsNAFJ3UV62vqks3g0VwVVIrPnZFNy5bsA9sQ5lMJu3QySTM3Cgl9C/cuFDErX/jzr9xkgZR64ELh3Pu5d573EhwDbb9aRWWlldW14rrpY3Nre2d8u5eR4exoqxNQxGqnks0E1yyNnAQrBcpRgJXsK47uU797j1TmofyFqYRcwIyktznlICR7gYP3GNjArg1LFfsqp0BL5JaTiooR3NY/hh4IY0DJoEKonW/ZkfgJEQBp4LNSoNYs4jQCRmxvqGSBEw7SXbxDB8ZxcN+qExJwJn6cyIhgdbTwDWdAYGx/uul4n9ePwb/wkm4jGJgks4X+bHAEOL0fexxxSiIqSGEKm5uxXRMFKFgQiplIVymOPt+eZF0Tqq1erXeOq00rvI4iugAHaJjVEPnqIFuUBO1EUUSPaJn9GJp68l6td7mrQUrn9lHv2C9fwFDEZDI</latexit> �

<latexit sha1_base64="F4vN4+Bz77orLUGKmZFZH7+yeew=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQqftyKXjxWMG2hDWWznbRLN5uwuxFK6W/w4kERr/4gb/4bN2kQtT4YeLw3w8y8IOFMacf5tEpLyyura+X1ysbm1vZOdXevpeJUUvRozGPZCYhCzgR6mmmOnUQiiQKO7WB8k/ntB5SKxeJeTxL0IzIULGSUaCN5vQA16VdrTt3JYS8StyA1KNDsVz96g5imEQpNOVGq6zqJ9qdEakY5ziq9VGFC6JgMsWuoIBEqf5ofO7OPjDKww1iaEtrO1Z8TUxIpNYkC0xkRPVJ/vUz8z+umOrz0p0wkqUZB54vClNs6trPP7QGTSDWfGEKoZOZWm46IJFSbfCp5CFcZzr9fXiStk7p7Wj+9O6s1ros4ynAAh3AMLlxAA26hCR5QYPAIz/BiCevJerXe5q0lq5jZh1+w3r8A3J6O2w==</latexit>

Figure 5.1: The architecture of a deep RBF value (Q) function. A deep RBF Q function couldbe thought of as an RBF output layer added to an otherwise standard deep state-action value (Q)function. In this sense, any kind of standard layer (dense, convolutional, recurrent, etc.) could beused as a hidden layer. All operations of the final RBF layer are differentiable, and therefore, theparameters of hidden layers θ, which represent the mappings ai(s; θ) and vi(s; θ), can be learnedusing standard gradient-based optimization techniques.

While in general there may be a gap between the global maximimum maxa∈A Qβ(s, a; θ) and its

easy-to-compute approximation maxi∈[1,N ] Qβ(s, ai; θ), the following theorem predicts that this gap

is zero in one-dimensional action spaces. More importantly, Theorem 5.2.1 guarantees that in action

spaces with an arbitrary number of dimensions, the gap gets exponentially small with increasing the

smoothing parameter β, allowing us to reduce the gap very quickly and up to any desired accuracy

by simply increasing the smoothing parameter β.

Theorem 5.2.1. Let Qβ be a member of the class of normalized Gaussian RBF value functions.

For a one-dimensional action space A = R:

maxa∈A

Qβ(s, a; θ) = maxi∈[1,N ]

Qβ(s, ai; θ) .

For A = Rd ∀d ≥ 1:

0 ≤ maxa∈A

Qβ(s, a; θ)− maxi∈[1,N ]

Qβ(s, ai; θ) ≤ O(e−β) .

Proof. I begin by proving the first result. For an arbitrary action a, we can write:

Qβ(s, a; θ) = w1v1(s; θ) + ...+ wNvN (s; θ) ,

where each weight wi is determined via softmax. Without loss of generality, we sort all centroids so

that ∀i ai < ai+1. Take two neighboring centroids aL and aR and notice that:

∀i < L,wLwi

=e−|a−aL|

e−|a−ai|=e−a+aL

e−a+ai= eaL−ai

def=

1

ci=⇒ wi = wLci .

49

In the above, I used the fact that all ai are to the left of a and aL. Similarly, I can argue that

∀i > R Wi = WRci. Intuitively, as long as the action is between aL and aR, the ratio of the weight

of a centroid to the left of aL, over the weight of aL itself, remains constant and does not change

with a. The same holds for the centroids to the right of aR. In light of the above result, by renaming

some variables we can now write:

Qβ(s, a; θ) = w1v1(s; θ) + ...+ wLvL(s; θ) + wRvR(s; θ) + ...+ wKvK(s; θ)

= wLc1v1(s; θ) + ...+ wLvL(s; θ) + wRvR(s; θ) + ...+ wRcKvK(s; θ)

= wL(c1v1(s; θ) + ...+ vL(s; θ)

)+ wR(vR(s; θ) + ...+ cKvK(s; θ)) .

Moreover, note that the weights need to sum up to 1:

wL(c1 + ...+ 1) + wR(1 + ...+ cK) = 1 ,

and wL is at its peak when we choose a = aL and at its smallest value when we choose a = aR. A

converse statement is true about wR. Moreover, the weights monotonically increase and decrease

as we move the input a. We call the endpoints of the range wmin and wmax. As such, the problem

maxa∈[aL,aR] Qβ(s, a; θ) could be written as this linear program:

maxwL,wR

wL(c1v1(s; θ) + ...+ vL(s; θ)

)+ wR

(vR(s; θ) + ...+ cKvK(s; θ)

)

s.t. wL(c1 + ...+ 1) + wR(1 + ...+ cK) = 1

wL, wR ≥Wmin

wL, wR ≤Wmax

A standard result in linear programming is that every linear program has an extreme point that

is an optimal solution [29]. Therefore, at least one of the points (wL = wmin, wR = wmax) or

(wL = wmax, wR = wmin) is an optimal solution. It is easy to see that there is a one-to-one mapping

between a and WL,WR in light of the monotonic property. As a result, the first point corresponds

to the unique value of a = aR(s), and the second corresponds to unique value of a = aL(s). Since

no point in between two centroids can be bigger than the surrounding centroids, at least one of the

centroids is a globally optimal solution in the range [a1(s), aN (s)], that is

maxa∈[a1(s;θ),aN (s;θ)]

Qβ(s, a; θ) = maxai

Qβ(s, ai; θ).

To finish the proof, I can show that ∀a < a1 Qβ(s, a; θ) = Qβ(s, a1; θ). The proof for ∀a >

50

aN Qβ(s, a; θ) = Qβ(s, aN ; θ) follows similar steps. So,

∀a < a1 Qβ(s, a; θ) =

∑Ni=1 e

−β|a−ai(s)|vi(s)∑Ni=1 e

−β|a−ai(s)|

=

∑Ni=1 e

−β|a1−c−ai(s)|vi(s)∑Ni=1 e

−β|a1−c−ai(s)|

=

∑Ni=1 e

β(a1−c−ai(s))vi(s)∑Ni=1 e

β(a1−c−ai(s))

=e−c

∑Ni=1 e

β(a1−ai(s))vi(s)

e−c∑Ni=1 e

β(a1−ai(s))

=

∑Ni=1 e

β(a1−ai(s))vi(s)∑Ni=1 e

β(a1−ai(s))= Qβ(s, a1; θ) ,

which concludes the proof of the first part.

I now move to the more general case with A = Rm:

maxa

Qβ(s, a; θ)− maxi∈{1:N}

Q(s, ai; θ) ≤ vmax(s; θ)− maxi∈{1:N}

Q(s, ai; θ)

≤ vmax(s; θ)− Qβ(s, amax; θ) .

WLOG, I assume the first centroid is the one with highest v, that is v1(s; θ) = arg maxvi vi(s; θ),

and conclude the proof. Note that a related result was shown recently [100]:

vmax(s)− Qβ(s, amax; θ) = v1 −∑Ni=1 e

−β‖a1−ai(s)‖vi(s)∑Ni=1 e

−β‖a1−ai(s)‖

=

∑Ni=1 e

−β‖a1−ai(s)‖(v1(s)− vi(s))

∑Ni=1 e


=

∑Ni=2 e

−β‖a1−ai(s)‖(v1(s)− vi(s))

1 +∑Kk=2 e


≤ ∆q

∑Ni=2 e


1 +∑Ni=2 e


≤ ∆q

N∑

i=2

e−β‖a1−ai(s)‖

1 + e−β‖a1−ai(s)‖

= ∆q

N∑

i=2

1

1 + eβ‖a1−ai(s)‖= O(e−β).

Figure 5.2 shows an example of the output of an RBF Q function where there exists a gap between

maxa∈A Qβ(s, a; θ) and maxi∈[1,N ] Qβ(s, ai; θ) for small values of β. Note also that, consistent with

Theorem 5.2.1, we can quickly decrease this gap by increasing the value of β.

51

(a) β = 0.25 (b) β = 1

(c) β = 1.5 (d) β = 2

Figure 5.2: An RBF Q function with 3 fixed centroid locations and centroid values (shown as blackdots), but different settings of the smoothing parameter β on a 2-dimensional action space. Greenregions highlight the set of actions a for which a is extremely close to the global maximum, or moreformally the set

{a ∈ A|maxa∈A{Qβ(s, a; θ)} − Qβ(s, a; θ) < 0.02

}. Observe the fast reduction

of the gap between maxa∈A Qβ(s, a; θ) and maxi∈[1,N ] Qβ(s, ai; θ) by increasing β, as guaranteed by

Theorem 5.2.1. Specifically, maxa∈A Qβ(s, a; θ) − maxi∈[1,N ] Qβ(s, ai; θ) < 0.02 for any β ≥ 1.5 .Also, observe that the function becomes less smooth as we increase β.

In light of the above theoretical result, to approximate maxa∈A Q(s, a; θ) we compute maxi∈[1,N ] Q(s, ai; θ).

If the goal is to ensure that the approximation is sufficiently accurate, one can always increase the

smoothing parameter to quickly get the desired accuracy.

Notice that this result holds for normalized Gaussian RBF networks (hence adding the normalization

step), but not necessarily for the unnormalized case or for other types of RBFs. I believe that this

observation is an interesting result in and of itself, regardless of its connection to value-function-

based RL.

I now move to the second desired property of RBF Q networks, namely that these networks are in

fact capable of universal function approximation (UFA).

52

Theorem 5.2.2. Consider any state–action value function Qπ(s, a) defined on a closed action space

A. Assume that Qπ(s, a) is a continuous function. For a fixed state s and for any ε > 0, there exists

a deep RBF value function Qβ(s, a; θ) and a setting of the smoothing parameter β0 for which:

∀a ∈ A ∀β ≥ β0 |Qπ(s, a)− Qβ(s, a; θ)| ≤ ε .

theorem

Proof. Since Qπ is continuous, I leverage the fact that it is Lipschitz with a Lipschitz constant L:

∀a0, a1 |f(a1)− f(a0)| ≤ L ‖a1 − a0‖

As such, assuming that ‖a1 − a0‖ ≤ ε4L , we have that

|f(a1)− f(a0)| ≤ ε

4(5.5)

Consider a set of centroids {c1, c2, ..., cN}, define the cell(j) as:

cell(j) = {a ∈ A| ‖a− cj‖ = minz‖a− cz‖} ,

and the radius Rad(j,A) as:

Rad(j,A) := supx∈cell(j)

‖x− cj‖ .

Assuming that A is a closed set, there always exists a set of centroids {c1, c2, ..., cN} for which

Rad(c,A) ≤ ε4L . Now consider the following functional form:

Qβ(s, a) :=

N∑

j=1

Qπ(s, cj)wj ,

where wj =e−β‖a−cj‖

∑Nz=1 e

−β‖a−cz‖.

Now suppose a lies in a subset of cells, called the central cells C:

C := {j|a ∈ cell(j)} ,

We define a second neighboring set of cells:

N := {j|cell(j) ∩(∪i∈C cell(i)

)6= ∅} − C ,

and a third set of far cells:

F := {j|j /∈ C & j /∈ N} ,

53

We now have:

|Qπ(s, a)− Qβ(s, a; θ)| = |N∑

j=1

(Qπ(s, a)−Qπ(s, cj)

)wj |

≤N∑

j=1

∣∣Qπ(s, a)−Qπ(s, cj)∣∣wj

=∑

j∈C


+∑

j∈N

∣∣Qπ(s, a)−Qπ(s, cj)∣∣wj +

∑

j∈F


We now bound each of the three sums above. Starting from the first sum, it is easy to see that∣∣Qπ(s, a)−Qπ(s, cj)∣∣ ≤ ε

4 , simply because a ∈ cell(j). As for the second sum, since cj is the centroid

of a neighboring cell, using a central cell i, we can write:

‖a− cj‖ = ‖a− ci + ci − cj‖ ≤ ‖a− ci‖+ ‖ci − cj‖ ≤ε

4L+

ε

4L=

ε

2L,

and so in this case∣∣Qπ(s, a)− Qβ(s, cj)

∣∣ ≤ ε2 . In the third case with the set of far cells F , observe

that for a far cell j and a central cell i we have:

wjwi

=e−β‖a−cj‖

e−β‖a−ci‖→ wj = wie

−β(‖a−cj‖−‖a−ci‖) ≤ wie−βµ ≤ e−βµ,

For some µ > 0. In the above, I used the fact that ‖a− cj‖ − ‖a− ci‖ > 0 is always true.

Putting it all together, we have:

|Qπ(s, a)− Qβ(s, a)|=

∑

j∈C

∣∣Qπ(s, a)−Qπ(s, cj)∣∣

︸︷︷︸≤ ε4

wj︸︷︷︸≤1

+∑

j∈N

∣∣Qπ(s, a)−Qπ(s, cj)∣∣

︸︷︷︸≤ ε2

wj︸︷︷︸1

+∑

j∈F

∣∣Qπ(s, a)−Qπ(s, cj)∣∣ wj︸︷︷︸e−βµ

≤ ε

4+ε

2+∑

j∈F

∣∣Qπ(s, a)−Qπ(s, cj)∣∣e−βµ

≤ ε

4+ε

2+ 2N sup

a|Qπ(s, a)|e−βµ

In order to have 2N supa |Qπ(s, a)|e−βµ ≤ ε4 , it suffices to have β ≥ −1

µ log( ε8N supa |Qπ(s,a)| ) := β0.

To conclude the proof:

|Qπ(s, a)− Qβ(s, a; θ)| ≤ ε ∀ β ≥ β0 .

For a similar proof, see [25].

Collectively, Theorems 5.2.1 and 5.2.2 guarantee that deep RBF Q functions preserve the desired

UFA property while ensuring accurate and efficient action maximization. This combination of prop-

erties stands in contrast with prior work that used function classes that enable easy action maxi-

mization but lack the UFA property [47, 4], as well as prior work that preserved the UFA property

but did not guarantee arbitrarily low accuracy when performing the maximization step [67, 94]. The

54

only important assumption in Theorem 5.2.2 is that the true value function Qπ(s, a) is continuous,

which is a standard assumption in the UFA literature [51] and in RL [8].

I note that, while using a large value of β makes it theoretically possible to approximate any function

up to any desired accuracy, there is a downside to using large β values. Specifically, very large values

of β result in extremely local approximations, which ultimately increases sample complexity as ex-

perience is not generalized from centroid to centroid. The bias–variance tension between using large

β values that allow for greater accuracy and using smaller β values that reduce sample complexity

make intermediate values of β work best. This property could be examined formally through the

lens of regularization [18].

As for scalability to large action spaces, note that the RBF formulation scales naturally owing to its

freedom to come up with centroids that best minimize the loss function. As a thought experiment,

suppose that some region of the action space has a high value, so an agent with greedy action

selection frequently chooses actions from that region. The deep RBF Q function would then move

more centroids to the region, because the region heavily contributes to the loss function. It is

unnecessary, then, to initialize centorid locations carefully, or to uniformly cover the action space

a priori. In our RL experiments in Section 5.4, we achieved reasonable results with the number of

centroids fixed across every problem, indicating that we need not rapidly increase the number of

centroids as the action dimension increases.

5.3 Experiments: Continuous Optimization

To demonstrate the operation of an RBF network in the simplest and clearest setting, I start with

a single-input continuous optimization problem, where the agent lacks access to the true reward

function but can sample input–output pairs 〈a, r〉. This setting is akin to the action maximization

step in RL for a single state or, stated differently, a continuous bandit problem. We are interested in

evaluating approaches that use tuples of experience 〈a, r〉 to learn the surface of the reward function,

and then optimize the learned function.

To this end, I chose the reward function:

r(a) = ‖a‖2sin(a0) + sin(a1)

2. (5.6)

Figure 5.3 (left) shows the surface of this function. It is clearly non-convex and includes several

local maxima (and minima). We are interested in two cases, first the problem where the goal is to

find maxa r(a), and the converse problem where we desire to find mina r(a).

Exploration is challenging in this setting [62]. Here, our focus is not to find the most effective ex-

ploration policy, but to evaluate different approaches based on their effectiveness to represent and

55

optimize a learned reward function r(a; θ). So, in the interest of fairness, we adopt the same random

action-selection strategy for all approaches.

More concretely, we sampled 500 actions uniformly randomly from A = [−3, 3]2 and provided the

agent with the reward associated with the actions according to (5.6). We then used this dataset for

training. When learning ended, we computed the action that maximized (or minimized) the learned

r(a; θ). Details of the function classes used in each case, as well as how to perform maxa r(a; θ) and

mina r(a; θ) will now be presented below for each individual approach.

Figure 5.3: Left: Surface of the true reward function. Right: The surface learned by the RBF rewardnetwork in a sample run. Black dots represent the centroids.

For our first baseline, I discretized each action dimension to 7 bins, resulting in 49 bins that uni-

formly covered the two dimensions of the input space. For each bin, we averaged the rewards over

〈a, r〉 pairs for which the sampled action a belonged to that bin. Once we had a learned r(a; θ),

which in this case was just a 7 × 7 table, I performed maxa r(a; θ) and mina r(a; θ) by a simple

table lookup. Discretization clearly fails to scale to problems with higher dimensionality, and I have

included this baseline solely for completeness.

Our second baseline used the input-convex neural network architecture [4], where the neural network

is constrained so that the learned reward function r(a; θ) is convex. Learning was performed by RM-

SProp optimization [44] with mean-squared loss. Once r(a; θ) was learned, I used gradient ascent for

finding the maximum, and gradient descent for finding the minimum. Note that this input-convex

approach subsumes the excluded quadratic case proposed Gu et al. [47], because quadratic functions

are just a special case of convex functions, but the converse in not necessarily true [29].

Our next baseline was the wire-fitting method proposed by Baird & Klopf [15]. This method is

similar to RBF networks in that it also learns a set of centroids. Similar to the previous case, I used

the RMSprop optimizer and mean-squared loss, and finally returned the centroids with lowest (or

highest) values according to the learned r(a; θ).

56

Figure 5.4: Mean and standard deviation of performance for various methods on the continuousoptimization task. Here, r(a) denotes the true reward associated with the action a found by eachmethod. Top: Results for action minimization. (Lower is better.) Bottom: Results for action maxi-mization. (Higher is better.) Results are averaged over 30 independent runs. The RBF architectureoutperforms the alternatives in both cases.

As the last baseline, I used a standard feed-forward neural network architecture with two hidden

layers to learn r(a; θ). It is well–known that this function class is capable of UFA [51] and so can

accurately learn the reward function in principle. However, once learning ends, we face a non–convex

optimization problem for action maximization (or minimization) maxa r(a; θ). We simply initialized

gradient descent (ascent) to a point chosen uniformly randomly, and followed the corresponding

direction until convergence.

To learn an RBF reward function, I used N = 50 centroids and β = 0.5 . I again used RMSprop

and mean-squared loss minimization. Recall that Theorem 5.2.1 showed that with an RBF net-

work the following approximations are well-justified in theory: maxa∈A r(a) ≈ maxi∈[1,50] r(ai) and

mina∈A r(a) ≈ mini∈[1,50] r(ai). As such, when the learning of r(a; θ) ends, I output the centroid

values with highest and lowest reward.

For each individual case, I ran the corresponding experimental pipeline for 30 different random seeds.

The solution found by each learner was fed to the true reward function (5.6) to get the true quality

of the found solution. I report the average reward achieved by each function class in Figure 5.4. The

RBF learner outperforms all baselines on both the maximization and the minimaztion problem. We

further show the function learned by a sample run of RBF on the right side of Figure 5.3, which is

57

an almost perfect approximation for the true reward function.

5.4 Experiments: Continuous Control

We now use deep Q functions for solving continuous-action RL problems. To this end, we learn a

deep RBF Q function using a learning algorithm similar to that of DQN [73], but extended to the

continuous-action case. DQN uses the following loss function for learning a deep state–action value

function:

L(θ) := Es,a,r,s′

[(r + γ max

a′∈AQ(s′, a′; θ−)− Q(s, a; θ)

)2].

DQN adds tuples of experience 〈s, a, r, s′〉 to a buffer, and later samples a minibatch of tuples to

compute ∇θL(θ). DQN maintains a second network parameterized by weights θ−. This second

network, denoted Q(·, ·, θ−) and referred to as the target network, is periodically synchronized with

the online network Q(·, ·, θ).

RBF–DQN uses the same loss function, but modifies the function class of DQN. Concretely, DQN

learns a deep network that outputs a scalar action-value output per action, exploiting the discrete

and finite nature of the action space. By contrast, RBF–DQN takes a state vector and an action

vector as input, and outputs a single scalar using a deep RBF Q function. Note that every operation

in a deep RBF Q function is differentiable, so the gradient of the loss function with respect to θ

parameters can be computed using standard deep learning libraries. Specifically, we used Python’s

Tensorflow library [1] with Keras [35] as its interface.

In terms of action selection, with probability ε, DQN chooses a random action, and with probability

1−ε it chooses an action with the highest value. The value of ε is annealed so that the agent becomes

more greedy as learning proceeds. To define an analog of this so called ε-greedy policy for RBF–

DQN, I sample from a uniform distribution with probability ε, and we take arg maxi∈[1,N ] Qβ(s, ai; θ)

with probability 1− ε. I annealed the ε parameter, similar to DQN.

Additionally, I made a minor change to the original DQN algorithm in terms of updating θ−, the

weights of the target network. Concretely, I update θ− using an exponential moving average of all

the previous θ values, as suggested by Lilicrap et al. [66]: θ− ← (1 − α)θ− + αθ−θ , which differs

from the occasional periodic updates θ− ← θ of the original DQN agent. I observed a significant

performance increase with this simple modification.

For completeness, I provide pseudocode for RBF–DQN in Algorithm 3, and have provide an open

repository1.

1You can find the repository github.com/kavosh8/RBFDQN.

58

I compared RBF–DQN’s performance to other deep-RL baselines on a large set of standard continuous-

action RL domains from Gym [31]. These domains range from simple tasks such as Inverted Pen-

dulum with a one-dimensional action space, to more complicated domains such as Ant with a

7-dimensional action space. I used the same number of centroids N = 30 for learning the deep RBF

Q function. I found the performance of RBF–DQN to be most sensitive to two hyper-parameters,

namely RMSProp’s learning rate αθ and the RBF smoothing parameter β. We tuned these two

parameters via grid search [44] for each individual domain, while all other hyperparameters were

fixed across domains.

For a meaningful comparison, I performed roughly similar numbers of gradient-based updates for

RBF–DQN and the baselines. Specifically, in all domains, I performed M = 100 updates per episode

on RBF–DQN’s network parameters θ. I used the same number of updates per episode for other

value-function-based baselines, such as input-convex neural networks [4]. Moreover, in the case of

policy-gradient baseline DDPG [66], I performed 100 value-function updates and 100 policy updates

per episode. This number of updates gave reasonable results in terms of data efficiency, and it also

helped run all experiments on modern CPUs.

In choosing baselines, my main goal was to compare RBF–DQN to other value-function-based deep-

RL baselines that explicitly perform the maximization step. Thus I did not perform comparisons

with an exhaustive set of existing policy gradient methods from the literature, since they work fun-

damentally differently than RBF–DQN and circumvent the action-maximization step. That said,

deep deterministic policy gradient (DDPG) [66] and its more advanced variant, TD3 [43], are two

very common baselines in continuous control, so I included them for completeness.

Moreover, in light of recent concerns with reproducibility in RL [49], I ran each algorithm for 10

fixed random seeds and report average performance. Other than the input-convex neural network

baseline [4], for which the authors released Tensorflow code, I chose to implement RBF–DQN and all

other baselines ourselves and in Tensorflow. This choice reflected a concern that comparing results

across different deep-learning libraries is extremely difficult.

It is clear from Figure 5.5 that RBF–DQN is competitive to all baselines both in terms of data

efficiency and final performance.

5.5 Conclusion

I proposed, analyzed, and exhibited the strengths of deep RBF value functions in continuous control.

These value functions facilitate easy action maximization, support universal function approximation,

and scale to large continuous action spaces. Deep RBF value functions are thus an appealing choice

for value function approximation in continuous control. Controlling the smoothness parameter of

59

deep RBF value functions played a critical role in achieving state-of-the-art performance.

60

Algorithm 3 Pseudocode for RBF–DQN

Initialize RBF network architecture with N centroidsRBF smoothing parameter βRBF online network parameters θ and θ−

optimizer learning rate αθtarget network learning rate αθ−total training episodes E, minibatch size Mdiscount rate γ, replay buffer B, decay rate µ

for episode ∈ [1, E] dos← env.reset(), done← False, ε← (1 + episode)−µ

while done==False doa← ε-greedy(Qβ

(·, ·; θ), s, ε

)

s′, r,done← env.step(s, a)add 〈s, a, r, s′,done〉 to Bs← s′

end whilefor M minibatches sampled from B do

∆θm = (r − Qβ(s, a; θ))∇θQ(s, a; θ)

if done==False thenget centroids a′i(s; θ

−) ∀i ∈ [1, N ]

∆θm += γmaxa′i Qβ(s′, a′i; θ

−)∇θQβ(s, a; θ)end if

end for∆θ ←

∑Mm=1 ∆θ

m

Mθ ← RMSProp(θ,∆θ, αθ)θ− ← (1− αθ−)θ− + αθ−θ

end forfunction ε-greedy(Qβ , s, ε):temp ∼ uniformly from [0, 1]if temp ≤ ε thena ∼ uniformly from Areturn a

elseget centroids ai(s; θ) ∀i ∈ [1, N ]

return arg maxai Qβ(s, ai; θ)end if

61

Figure 5.5: A comparison between RBF–DQN and baselines on continuous-action RL domains.Y-axis shows sum of rewards across timesteps in each episode, so higher is better.

Chapter 6

Proposed Work: Safe

Reinforcement Learning with

Smooth Policies

In the context of policy search, the agent maintains an explicit policy π(a|s; θ) denoting the proba-

bility of taking action a in state s under the policy π parameterized by θ. Note that for each state,

the policy outputs a probability distribution over actions: π : S → P(A).

The agent’s goal is to find a policy that maximizes the return for every timestep, so we define an

objective function J that allows us to score an arbitrary policy parameter θ:

J(θ) = Eτ∼Pr(τ |θ)[G(τ)] =∑

τ

Pr(τ |θ)G(τ) ,

where τ denotes a trajectory.

6.1 Rademacher Complexity

We use Rademacher complexity for sample complexity analysis. We define this measure, but for

details see Bartlett et al. [19] or Mohri [75]. Also, see Jiang et al. [53] and Lehnert et al. [63] for

previous applications of Rademacher in reinforcement learning.

Definition 6.1.1. Consider f : S → [−1, 1], and a set of such functions F . The Rademacher

complexity of this set, Rad(F), is defined as:

Rad(F) := Esj ,σj supf∈F

1

n

n∑

j=1

σjf(sj) ,

where σj, referred to as Rademacher random variables, are drawn uniformly at random from {±1}.62

63

The Rademacher variables could be thought of as independent and identically distributed noise.

Under this view, the average 1n

∑nj=1 σjf(sj) quantifies the extent to which f(·) matches the noise.

We have a high Rademacher complexity for a complicated hypothesis space that can accurately

match noise. Conversely, a simple hypothesis space has a low Rademacher complexity.

To apply Rademacher to the model-based setting where the output of the model is a vector, we extend

Definition 6.1.1 to vector-valued functions that map to [−1, 1]d. Consider a function g := 〈f1, ..., fd〉where ∀i fi ∈ F . Define the set of such functions G. Then:

Rad(G) := Esj ,σjisupg∈G

1

n

n∑

j=1

d∑

i=1

σjig(sj)i ,

where σji are drawn uniformly from {±1}.

6.2 A Generalization Bound

For a given policy with parameters θ′, and the function class G, Rademacher gives us a generalization

bound on the performance of the policy induced by θ′:

E[J(θ′)] ≥ E[J(θ′)] + 2Rad(G) +

√ln(1/δ)

m, (6.1)

where E[J(θ′)] denotes the empirical average of the policy performance. This quantity is missing in

the absence of environmental interaction with policy πθ′ . However, we have access to E[J(θ)] for all

θ parameters seen before. Suppose that the following Lipschitz property holds:

|E[J(θ)]− E[J(θ′)]| ≤ K ‖θ − θ′‖ ,

then we can bound 6.1 with high probability. Moreover, Rad(G) can be bounded and quantified

for Lipschitz neural networks [7]. This allows us to certify a lower bound on performance before

updating the policy, thereby meeting the safety requirement.

6.3 Timeline

I expect to complete this chapter by the end of summer 2020. I hope to send findings of this chapter

to AAAI 2021 (submission deadline: Sep 5th 2020), and later defend my thesis on October 2020.

Bibliography

[1] Martın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,

Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A

system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems

Design and Implementation, pages 265–283, 2016.

[2] Pieter Abbeel, Morgan Quigley, and Andrew Y Ng. Using inaccurate models in reinforcement

learning. In Proceedings of the 23rd international conference on Machine learning, pages 1–8.

ACM, 2006.

[3] Alejandro Agostini and Enric Celaya. Reinforcement learning with a gaussian mixture model.

In Neural Networks (IJCNN), The 2010 International Joint Conference on, pages 1–8. IEEE,

2010.

[4] Brandon Amos, Lei Xu, and J Zico Kolter. Input convex neural networks. In Proceedings of

the 34th International Conference on Machine Learning, pages 146–155, 2017.

[5] Kavosh Asadi, Evan Cater, Dipendra Misra, and Michael L Littman. Equivalence between

wasserstein and value-aware model-based reinforcement learning. In FAIM Workshop on Pre-

diction and Generative Modeling in Reinforcement Learning, volume 3, 2018.

[6] Kavosh Asadi and Michael L. Littman. An alternative softmax operator for reinforcement

learning. In Proceedings of the 34th International Conference on Machine Learning, pages

243–252, 2017.

[7] Kavosh Asadi, Dipendra Misra, Seungchan Kim, and Michel L Littman. Combating the

compounding-error problem with a multi-step model. arXiv preprint arXiv:1905.13320, 2019.

[8] Kavosh Asadi, Dipendra Misra, and Michael L. Littman. Lipschitz continuity in model-based

reinforcement learning. In Proceedings of the 35th International Conference on Machine Learn-

ing, pages 264–273, 2018.

[9] Kavosh Asadi, Ronald E Parr, George D Konidaris, and Michael L Littman. Deep rbf value

functions for continuous control. arXiv preprint arXiv:2002.01883, 2020.

64

65

[10] Kavosh Asadi and Jason D Williams. Sample-efficient deep reinforcement learning for dialog

control. arXiv preprint arXiv:1612.06000, 2016.

[11] Kavosh Asadi Atui. Strengths, Weaknesses, and Combinations of Model-based and Model-free

Reinforcement Learning. PhD thesis, University of Alberta, 2015.

[12] Franz Aurenhammer. Voronoi diagrams—a survey of a fundamental geometric data structure.

ACM Computing Surveys, 23(3):345–405, 1991.

[13] J Andrew Bagnell and Jeff G Schneider. Autonomous helicopter control using reinforcement

learning policy search methods. In Robotics and Automation, 2001. Proceedings 2001 ICRA.

IEEE International Conference on, volume 2, pages 1615–1620. IEEE, 2001.

[14] Leemon Baird and Andrew W Moore. Gradient descent for general reinforcement learning. In

Advances in Neural Information Processing Systems, pages 968–974, 1999.

[15] Leemon C Baird and A Harry Klopf. Reinforcement learning with high-dimensional, continuous

actions. Technical report, Wright Laboratory, 1993.

[16] Chris L Baker, Joshua B Tenenbaum, and Rebecca R Saxe. Goal inference as inverse planning.

In Proceedings of the 29th Annual Meeting of the Cognitive Science Society, 2007.

[17] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds

and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.

[18] Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds

and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.

[19] Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds

and structural results. JMLR, 2002.

[20] Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements

that can solve difficult learning control problems. IEEE transactions on systems, man, and

cybernetics, pages 834–846, 1983.

[21] Gleb Beliakov, Humberto Bustince Sola, and Tomasa Calvo Sanchez. A Practical Guide to

Averaging Functions. Springer, 2016.

[22] Marc G Bellemare, Will Dabney, and Remi Munos. A distributional perspective on reinforce-

ment learning. In International Conference on Machine Learning, pages 449–458, 2017.

[23] Richard Bellman. On the theory of dynamic programming. Proceedings of the National

Academy of Sciences of the United States of America, 38(8):716, 1952.

[24] Richard Bellman. A markovian decision process. Journal of Mathematics and Mechanics,

pages 679–684, 1957.

66

[25] Michel Benaim. On functional approximation with normalized Gaussian units. Neural Com-

putation, 6(2):319–333, 1994.

[26] Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-

based reinforcement learning with stability guarantees. In Advances in Neural Information

Processing Systems, pages 908–919, 2017.

[27] D Bertsekas. Convergence of discretization procedures in dynamic programming. IEEE Trans-

actions on Automatic Control, 20(3):415–419, 1975.

[28] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming: an overview. In

Decision and Control, 1995, Proceedings of the 34th IEEE Conference on, volume 1, pages

560–564. IEEE, 1995.

[29] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press,

2004.

[30] Richard P Brent. Algorithms for minimization without derivatives. Courier Corporation, 2013.

[31] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,

and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.

[32] David S Broomhead and David Lowe. Radial basis functions, multi-variable functional inter-

polation and adaptive networks. Technical report, Royal Signals and Radar Establishment

Malvern (United Kingdom), 1988.

[33] Sebastien Bubeck, Remi Munos, Gilles Stoltz, and Csaba Szepesvari. X-armed bandits. Journal

of Machine Learning Research, 12(May):1655–1695, 2011.

[34] Guido Bugmann. Normalized Gaussian radial basis function networks. Neurocomputing, 20(1-

3):97–110, 1998.

[35] Francois Chollet. keras. https://github.com/fchollet/keras, 2015.

[36] T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley and Sons, 2006.

[37] Richard Dearden, Nir Friedman, and Stuart Russell. Bayesian Q-learning. In Fifteenth Na-

tional Conference on Artificial Intelligence (AAAI), pages 761–768, 1998.

[38] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete

data via the em algorithm. Journal of the royal statistical society. Series B (methodological),

pages 1–38, 1977.

[39] Amir-Massoud Farahmand, Andre Barreto, and Daniel Nikovski. Value-Aware Loss Function

for Model-based Reinforcement Learning. In Proceedings of the 20th International Conference

on Artificial Intelligence and Statistics, pages 1486–1494, 2017.

https://github.com/fchollet/keras

67

[40] Amir-Massoud Farahmand, Andre Barreto, and Daniel Nikovski. Value-Aware Loss Function

for Model-based Reinforcement Learning. In Proceedings of the 20th International Conference

on Artificial Intelligence and Statistics, pages 1486–1494, 2017.

[41] Norm Ferns, Prakash Panangaden, and Doina Precup. Metrics for finite markov decision

processes. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages

162–169. AUAI Press, 2004.

[42] Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via

soft updates. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial

Intelligence, pages 202–211. AUAI Press, 2016.

[43] Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in

actor-critic methods. In International Conference on Machine Learning, pages 1587–1596,

2018.

[44] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

[45] Geoffrey J. Gordon. Reinforcement learning with function approximation converges to a region,

2001. Unpublished.

[46] Omer Gottesman, Fredrik Johansson, Matthieu Komorowski, Aldo Faisal, David Sontag, Fi-

nale Doshi-Velez, and Leo Anthony Celi. Guidelines for reinforcement learning in healthcare.

Nat Med, 25(1):16–18, 2019.

[47] Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep Q-

learning with model-based acceleration. In International Conference on Machine Learning,

pages 2829–2838, 2016.

[48] Barbara Hammer and Kai Gersmann. A note on the universal approximation capability of

support vector machines. Neural Processing Letters, 17(1):43–53, 2003.

[49] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David

Meger. Deep reinforcement learning that matters. In Proceedings of the Thirty-Second AAAI

Conference on Artificial Intelligence, 2018.

[50] Karl Hinderer. Lipschitz continuity of value functions in Markovian decision processes. Math-

ematical Methods of Operations Research, 62(1):3–22, 2005.

[51] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are

universal approximators. Neural networks, 2(5):359–366, 1989.

[52] Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective

planning horizon on model accuracy. In Proceedings of AAMAS, pages 1181–1189, 2015.

68

[53] Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective

planning horizon on model accuracy. In Proceedings of the 2015 International Conference on

Autonomous Agents and Multiagent Systems, pages 1181–1189. International Foundation for

Autonomous Agents and Multiagent Systems, 2015.

[54] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A

survey. Journal of artificial intelligence research, 4:237–285, 1996.

[55] Sham Kakade, Michael J Kearns, and John Langford. Exploration in metric state spaces.

In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages

306–312, 2003.

[56] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning.

In Proceedings of the International Conference on Machine Learning, pages 267–274, 2002.

[57] Nicolaos B Karayiannis. Reformulated radial basis neural networks trained by gradient descent.

IEEE Transactions on Neural Networks, 10(3):657–671, 1999.

[58] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time.

Machine Learning, 49(2-3):209–232, 2002.

[59] Seungchan Kim, Kavosh Asadi, Michael Littman, and George Konidaris. Deepmellow: remov-

ing the need for a target network in deep q-learning. In Proceedings of the 28th International

Joint Conference on Artificial Intelligence, pages 2733–2739. AAAI Press, 2019.

[60] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In

Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, pages 681–690.

ACM, 2008.

[61] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.

The International Journal of Robotics Research, 32(11):1238–1274, 2013.

[62] Tor Lattimore and Csaba Szepesvari. Bandit algorithms. preprint, 2018.

[63] Lucas Lehnert, Romain Laroche, and Harm van Seijen. On value function representation of

long horizon problems. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[64] Ke Li and Jitendra Malik. Learning to optimize neural nets. arXiv preprint arXiv:1703.00441,

2017.

[65] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval

Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.

arXiv preprint arXiv:1509.02971, 2015.

[66] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval

Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.

arXiv preprint arXiv:1509.02971, 2015.

69

[67] Sungsu Lim, Ajin Joseph, Lei Le, Yangchen Pan, and Martha White. Actor-expert: A

framework for using action-value methods in continuous action spaces. arXiv preprint

arXiv:1810.09103, 2018.

[68] Michael L. Littman and Csaba Szepesvari. A generalized reinforcement-learning model: Con-

vergence and applications. In Proceedings of the 13th International Conference on Machine

Learning, pages 310–318, 1996.

[69] Michael L. Littman and Csaba Szepesvari. A generalized reinforcement-learning model: Con-

vergence and applications. In Lorenza Saitta, editor, Proceedings of the Thirteenth Interna-

tional Conference on Machine Learning, pages 310–318, 1996.

[70] Michael Lederman Littman. Algorithms for Sequential Decision Making. PhD thesis, Depart-

ment of Computer Science, Brown University, February 1996. Also Technical Report CS-96-09.

[71] Edward Lorenz. Predictability: does the flap of a butterfly’s wing in Brazil set off a tornado

in Texas? na, 1972.

[72] Guillaume Matheron, Nicolas Perrin, and Olivier Sigaud. The problem with DDPG: un-

derstanding failures in deterministic environments with sparse rewards. arXiv preprint

arXiv:1911.11679, 2019.

[73] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G

Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.

Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.

[74] Thomas M Moerland, Joost Broekens, and Catholijn M Jonker. Learning multimodal tran-

sition dynamics for model-based reinforcement learning. arXiv preprint arXiv:1705.00470,

2017.

[75] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learn-

ing. MIT Press, 2012.

[76] John Moody and Christian J Darken. Fast learning in networks of locally-tuned processing

units. Neural computation, 1(2):281–294, 1989.

[77] Alfred Muller. Optimal selection from distributions with unknown parameters: Robustness of

bayesian models. Mathematical Methods of Operations Research, 44(3):371–386, 1996.

[78] Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap

between value and policy based reinforcement learning. arXiv preprint arXiv:1702.08892,

2017.

[79] Hariharan Narayanan and Sanjoy Mitter. Sample complexity of testing the manifold hypoth-

esis. In Advances in Neural Information Processing Systems, pages 1786–1794, 2010.

70

[80] Gergely Neu, Anders Jonsson, and Vicenc Gomez. A unified view of entropy-regularized

Markov decision processes. arXiv preprint arXiv:1705.07798, 2017.

[81] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in

neural networks. In Proceedings of The 28th Conference on Learning Theory, pages 1376–

1401, 2015.

[82] Ronald Parr, Lihong Li, Gavin Taylor, Christopher Painter-Wakefield, and Michael L Littman.

An analysis of linear models, linear value-function approximation, and feature selection for re-

inforcement learning. In Proceedings of the 25th international conference on Machine learning,

pages 752–759. ACM, 2008.

[83] Jason Pazis and Ronald Parr. Pac optimal exploration in continuous space markov decision

processes. In AAAI, 2013.

[84] Theodore J Perkins and Doina Precup. A convergent form of approximate policy iteration. In

Advances in Neural Information Processing Systems, pages 1595–1602, 2002.

[85] Jan Peters, Katharina Mulling, and Yasemin Altun. Relative entropy policy search. In AAAI.

Atlanta, 2010.

[86] Bernardo Avila Pires and Csaba Szepesvari. Policy error bounds for model-based reinforcement

learning with factored linear models. In Conference on Learning Theory, pages 121–151, 2016.

[87] Matteo Pirotta, Marcello Restelli, and Luca Bascetta. Policy gradient in lipschitz Markov

decision processes. Machine Learning, 100(2-3):255–283, 2015.

[88] Michael JD Powell. Radial basis functions for multivariable interpolation: a review. Algorithms

for approximation, 1987.

[89] Martin L Puterman. Markov Decision Processes.: Discrete Stochastic Dynamic Programming.

John Wiley & Sons, 2014.

[90] Emmanuel Rachelson and Michail G. Lagoudakis. On the locality of action domination in

sequential decision making. In International Symposium on Artificial Intelligence and Mathe-

matics, ISAIM 2010, Fort Lauderdale, Florida, USA, January 6-8, 2010, 2010.

[91] Jonathan Rubin, Ohad Shamir, and Naftali Tishby. Trading value and information in mdps.

In Decision Making with Imperfect Decision Makers, pages 57–74. Springer, 2012.

[92] G. A. Rummery and M. Niranjan. On-line Q-learning using connectionist systems. Technical

Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department, 1994.

[93] Stuart J Russell and Peter Norvig. Artificial intelligence: A modern approach, 1995.

[94] Moonkyung Ryu, Yinlam Chow, Ross Anderson, Christian Tjandraatmadja, and Craig

Boutilier. Caql: Continuous action q-learning. arXiv preprint arXiv:1909.12397, 2019.

71

[95] Aysel Safak. Statistical analysis of the power sum of multiple correlated log-normal compo-

nents. IEEE Transactions on Vehicular Technology, 42(1):58–61, 1993.

[96] Ahmad EL Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. Deep rein-

forcement learning framework for autonomous driving. Electronic Imaging, 2017(19):70–76,

2017.

[97] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Ried-

miller. Deterministic policy gradient algorithms. In ICML, 2014.

[98] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Ried-

miller. Deterministic policy gradient algorithms. In International Conference on Machine

Learning, pages 387–395, 2014.

[99] Satinder Singh, Tommi Jaakkola, Michael L. Littman, and Csaba Szepesvari. Convergence

results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 39:287–

308, 2000.

[100] Zhao Song, Ron Parr, and Lawrence Carin. Revisiting the softmax bellman operator: New

benefits and new perspective. In International Conference on Machine Learning, pages 5916–

5925, 2019.

[101] Alexander L Strehl, Lihong Li, and Michael L Littman. Reinforcement learning in finite mdps:

Pac analysis. Journal of Machine Learning Research, 10(Nov):2413–2444, 2009.

[102] Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on

approximating dynamic programming. In Proceedings of the Seventh International Conference

on Machine Learning, pages 216–224, Austin, TX, 1990. Morgan Kaufmann.

[103] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT

Press, 1998.

[104] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press,

2018.

[105] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradi-

ent methods for reinforcement learning with function approximation. In Advances in Neural

Information Processing Systems, pages 1057–1063, 2000.

[106] Richard S. Sutton, Csaba Szepesvari, Alborz Geramifard, and Michael H. Bowling. Dyna-

style planning with linear function approximation and prioritized sweeping. In UAI 2008,

Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence, Helsinki, Finland,

July 9-12, 2008, pages 528–536, 2008.

[107] Csaba Szepesvari. Algorithms for reinforcement learning. Synthesis Lectures on Artificial

Intelligence and Machine Learning, 4(1):1–103, 2010.

72

[108] Erik Talvitie. Model regularization for stable sample rollouts. In Proceedings of the Thirtieth

Conference on Uncertainty in Artificial Intelligence, pages 780–789. AUAI Press, 2014.

[109] Erik Talvitie. Self-correcting models for model-based reinforcement learning. In Proceedings

of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Fran-

cisco, California, USA., 2017.

[110] Sebastian B. Thrun. The role of exploration in learning control. In David A. White and

Donald A. Sofge, editors, Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Ap-

proaches, pages 527–559. Van Nostrand Reinhold, New York, NY, 1992.

[111] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal

Statistical Society. Series B (Methodological), pages 267–288, 1996.

[112] Emanuel Todorov. Linearly-solvable markov decision problems. In NIPS, pages 1369–1376,

2006.

[113] Harm Van Seijen, Hado Van Hasselt, Shimon Whiteson, and Marco Wiering. A theoretical

and empirical analysis of Expected Sarsa. In 2009 IEEE Symposium on Adaptive Dynamic

Programming and Reinforcement Learning, pages 177–184. IEEE, 2009.

[114] Arun Venkatraman, Martial Hebert, and J Andrew Bagnell. Improving multi-step prediction of

learned time series models. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial

Intelligence, January 25-30, 2015, Austin, Texas, USA., 2015.

[115] Christopher Watkins. Learning from delayed rewards. King’s College, Cambridge, 1989.

[116] Jason D Williams, Kavosh Asadi Atui, and Geoffrey Zweig. Hybrid code networks: practical

and efficient end-to-end dialog control with supervised and reinforcement learning. In Proceed-

ings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1:

Long Papers), pages 665–677, 2017.

[117] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist rein-

forcement learning. Machine Learning, 8(3):229–256, 1992.

Download - Smoothness in Reinforcement Learning with Large State and ...Thesis Statement Smoothness, de ned by Lipschitz continuity, is essential for approximate reinforcement learning in terms

Top Related