learning recursive goal proposal

Final Master’s Degree Thesis

Learning Recursive Goal ProposalA Hierarchical Reinforcement Learning approach

Master’s Degree in Artificial Intelligence

Author:Rafel Palliser Sans

Supervisor:Mario Martin Munoz

April 2021

Rafel Palliser Sans: Learning Recursive Goal Proposal. A Hierarchical Reinforcement Learningapproach ©, April 2021.

A Final Master’s Degree Thesis submitted to the Facultat d’Informatica de Barcelona (FIB)- Universitat Politecnica de Catalunya (UPC) - Barcelona Tech, Facultat de Matematiquesde Barcelona - Universitat de Barcelona (UB) and Escola Tecnica Superior d’Enginyeria -Universitat Rovira Virgili (URV) in partial fulfillment of the requirements for the Master’sDegree in Artificial Intelligence.

Thesis produced at the Facultat d’Informatica de Barcelona under the supervision of Prof. MarioMartin.

Author:Rafel Palliser Sans

Supervisor:Mario Martin Munoz

Location:Menorca, Spain

What we want is a machine that can learn from experience.

— Alan Turing

Abstract

Reinforcement Learning’s unique way of learning has led to remarkable successeslike Alpha Zero [29] or Alpha Go [28], mastering the games of Chess and Go,and being able to beat the respective World Champions. Notwithstanding,Reinforcement Learning algorithms are underused in real-world applicationscompared to other techniques such as Supervised or Unsupervised learning.

One of the most significant problems that could limit the applicability ofReinforcement Learning in other areas is its sample inefficiency, i.e., its needfor vast amounts of interactions to obtain good behaviors.

Off-policy algorithms, those that can learn a different policy than the one theyuse for exploration, take advantage of Replay Buffers to store the gatheredknowledge and reuse it, already making a step into sample efficiency. However,in complex tasks, they still need lots of interactions with the environment toexplore and learn good policies.

This master thesis presents Learning Recursive Goal Proposal (LRGP), a newhierarchical algorithm based on two levels in which the higher one serves as agoal proposal for the lower one, which interacts with the environment followingthe proposed goals. The main idea of this novel method is to break a task intotwo parts, speeding and easing the learning process.

In addition to this, LRGP implements a new reward system that takes advantageof non-sparse rewards to increase its sample efficiency by generating moretransitions per episode, which are stored and reused thanks to ExperienceReplay.

LRGP, which has the flexibility to be used with a wide variety of ReinforcementLearning algorithms in environments of different nature, obtains State-of-the-Art results both in performance and efficiency when compared to methods suchas Double DQN [14] or Soft Actor Critic (SAC) [12, 13] in Simple MiniGridand Pendulum environments.

KeyWords: Reinforcement Learning · Hierarchical RL · Sample Efficiency · Recursivity· Subgoal Proposal · Actor Critic · LRGP · DDQN · HER · SAC · UMDP · UVFA · AI.

5

Acknowledgments

First of all, I would like to show my greatest gratitude to my thesis supervisor ProfessorMario Martin. The interest and enthusiasm he demonstrated for the project were crucialfor developing this new algorithm. Thanks for our uncountable meetings and endlessfruitful discussion that made our first idea see the light in this project.

Moreover, I also need to thank the Barcelona Supercomputing Center (BSC) forproviding me with the necessary computational resources for the project presented in thisthesis.

I also wish to thank my MAI friends, who, aside from exchanging ideas for our research,have motivated and supported me during this project and the whole master’s.

Finally, I would like to thank my family and friends here in Menorca for being mymain support and providing laughs and distraction when needed during all these lastmonths.

7

Contents

1 Introduction 131.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Reinforcement Learning 152.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Model-free Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.1 Value-based algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.2 Policy gradient algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2.3 Actor Critic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3 Universal Markov Decision Processes (UMDP) . . . . . . . . . . . . . . . . . . . 262.3.1 Universal Value Function Approximators (UVFA) . . . . . . . . . . . . . 272.3.2 Hindsight Experience Replay (HER) . . . . . . . . . . . . . . . . . . . . . 27

3 Meta Reinforcement Learning and Multi-Task Learning 313.1 The concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Learning Recursive Goal Proposal 354.1 Idea and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.2 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.3 Main components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.1 The low hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3.2 The high hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.3.3 Is reachable function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.3.4 Incomplete goal spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Experiments and Results 455.1 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1.1 Simple MiniGrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.1.2 Pendulum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2 Preliminary studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6 Conclusions 59

Bibliography 61

Appendix A Implementation Details 65

9

List of Figures

2.1 Diagram of the Reinforcement Learning paradigm . . . . . . . . . . . . . . . . . 152.2 Actor Critic general architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3 Universal Value Function Approximators’ architectures . . . . . . . . . . . . . . . 27

4.1 Diagram to illustrate high hierarchy transition generation . . . . . . . . . . . . . 394.2 Diagram to illustrate is reachable data gathering . . . . . . . . . . . . . . . . . . 41

5.1 Original MiniGrid Empty 6x6 Environment . . . . . . . . . . . . . . . . . . . . . 455.2 Simple MiniGrid Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.3 Pendulum Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.4 Algorithm comparison for the high hierarchy . . . . . . . . . . . . . . . . . . . . 495.5 Comparison between estimation techniques . . . . . . . . . . . . . . . . . . . . . 505.6 Study on the sample efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.7 Analysis of our solution to incomplete goal spaces . . . . . . . . . . . . . . . . . . 525.8 LRGP’s learning process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.9 State-of-the-Art comparison in Simple MiniGrid Empty 15x15 . . . . . . . . . . . 555.10 State-of-the-Art comparison in Simple MiniGrid FourRooms 15x15 . . . . . . . . 565.11 State-of-the-Art comparison in Pendulum . . . . . . . . . . . . . . . . . . . . . . 57

A.1 ε-greedy exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

List of Tables

A.1 DDQN default hyperparameters for Simple MiniGrid environments . . . . . . . . 65A.2 SAC default hyperparameters for Simple MiniGrid environments . . . . . . . . . 65A.3 Learning parameters for Simple MiniGrid environments . . . . . . . . . . . . . . 66A.4 SAC default hyperparameters for the Pendulum environment . . . . . . . . . . . 67A.5 Learning parameters for the Pendulum environment . . . . . . . . . . . . . . . . 67

10

Glossary

LRGP: Learning Recursive Goal Proposal

AI: Artificial Intelligence

DDQN: Double Deep Q-Network

DRL: Deep Reinforcement Learning

HER: Hindsight Experience Replay

MRL: Meta Reinforcement Learning

RL: Reinforcement Learning

SAC: Soft Actor Critic

UMDP: Universal Markov Decision Processes

UVFA: Universal Value Function Approximators

11

1. Introduction

This final master’s degree thesis presents our new algorithm Learning Recursive Goal Proposal,which is designed to solve Reinforcement Learning environments more efficiently by splittingthe problem into two hierarchical and more manageable sub-tasks; using Replay Bufferson which leveraging sample efficiency; and making use of recursivity to save computationresources.

1.1 Motivation

Reinforcement Learning (RL) is quite different from other Artificial Intelligence learning tech-niques like Supervised and Unsupervised Learning, as it does not require a dataset. Opposite tothese well-known learning paradigms, RL uses an agent that learns to solve a task by trial anderror through interaction with an environment or world. Although this process mimics humanlearning quite uniquely, it is usually inefficient as it may require millions of interactions to learna specific task. When thinking about real applications, these inefficiencies often become costsboth in time and resources. For all these reasons, it is crucial to seek any type of efficiency whencreating RL algorithms, so they become more readily applicable to real-world problems.

First, in the field of Reinforcement Learning, Hierarchical approaches apply a “divide andconquer” strategy by splitting the task into multiple, less complex pieces, which often havesmaller search spaces and therefore are faster to solve. Furthermore, splitting a task intosub-problems makes each of these more reusable, an essential quality for efficiency. Imagine,for example, that we want to teach a robot how to walk from a certain point A to any otherpoint B in a city. If we somehow can decouple the learning process of walking from the learningprocess of navigating the city optimally, we could reuse the latter for another agent, such asa self-driving car, for example. At the same time, learning each sub-task more independentlywould be easier and faster.

Second, opposed to on-policy methods, which require fresh and updated data to learn—whichat the same time require constant interaction with the environment—off-policy methods havethe ability to use “old” data. Therefore, the agent can accumulate experience from environmentinteractions and use it as many times as required. The capability to reuse gathered informationmakes off-policy algorithms much more sample efficient, less needed for environment interactions,and therefore less costly for real-world applications.

Third, modern Reinforcement Learning algorithms rely on Deep Learning and NeuralNetworks, which are highly powerful as function approximators yet costly in computationalresources. Making clever use of them is vital for a proper methodology.

This work aims to design, test, and analyze a Reinforcement Learning algorithm thatcombines all the previously mentioned concepts into a pipeline capable of solving variate tasksefficiently.

13

14 CHAPTER 1. INTRODUCTION

1.2 Contribution

This project’s main contribution is the proposal of a new Reinforcement Learning algorithmbased on a two-level hierarchy where the high level guides the low one by recursively proposingintermediate subgoals to be followed until arriving at the final solution. Up to our awareness, thisis the first algorithm that merges recursive and hierarchical solutions to solve RL environments.

Alongside, but not less important, the high level of our hierarchy is trained using a newnon-sparse reward system that mixes concepts from Monte Carlo estimation and ExperienceReplay to reach incredible sample efficiency.

Together, all the pieces constitute an algorithm evenly novel, powerful, sample efficient, andcost-effective regarding computational resources.

1.3 Overview

Apart from this first chapter, the thesis is divided in five different parts.

Chapter 2 contains all the theoretical backgrounds needed to understand this project. Morespecifically, Section 2.1 presents the problem of Reinforcement Learning, while Section 2.2introduces the family of Model-free RL methods, alongside with the State-of-the-Art algorithmsin this category. Finally, Section 2.3 includes an introduction to Universal Markov DecisionProcesses, and the techniques used to adapt RL algorithms to this field.

Following, Chapter 3 presents the concepts of Meta Reinforcement Learning and Multi-taskLearning, closely related to this project. Additionally, it presents some methods that applythese techniques with similar objectives than this work.

After the introduction of the background and related work, Chapter 4 presents our newalgorithm. In particular, Section 4.1 explains the idea from which we built the method, whileSection 4.2 follows with an overview of the algorithm and Section 4.3 includes a detaileddescription of the algorithm’s main components.

Later, Chapter 5 includes all the performed experiments and results to understand howour algorithm learns, assess its performance and efficiency, and compare it to State-of-the-Arttechniques.

Finally, the last part of this thesis, Chapter 6, shows some conclusions about this work andpresents ideas for future work.

2. Reinforcement Learning

This chapter will cover all the theoretical backgrounds used for this project, ranging fromReinforcement Learning basics to more specific concepts related to Model-free ReinforcementLearning and Universal Markov Decision Processes. All the material has been extracted fromProfessor Martin’s RL Course Notes [18] and The Reinforcement Learning book [30] by Suttonand Barto.

2.1 Reinforcement Learning

As introduced in Section 1, Reinforcement Learning (RL) is different from other learningparadigms in the sense that it does not use a dataset nor intends to learn a mapping into classes.Instead, in RL, we have an agent whose goal is to learn a behavior that helps it achieve a goalwhile interacting with an environment.

The state of the agent in the environment at a certain discrete time-step is entirely anduniquely defined by st ∈ S, which can be multidimensional. In this state, the agent performs anaction at ∈ A, also multidimensional, which leads to a new state of the environment st+1 ∈ S.Additionally, the agent receives feedback in the form of a scalar reward rt+1 ∈ R. Figure 2.1shows a diagram of the RL framework.

Agent

Environment

Action atNext state st+1 Reward rt+1

Figure 2.1: Diagram of the Reinforcement Learning paradigm.

It is important to note that the feedback does not correspond to the optimal action, asit would in Supervised Learning. In fact, in many RL environments, the optimal behavior isunknown a priori and discovered through optimization, so it is impossible to provide immediatefeedback to the agent actions. Therefore, the reward is commonly sparse and delayed untilthe agent reaches the goal, making learning more challenging. Solutions to reward sparsity arepresented in Section 2.3.2.

Under this framework, the goal is to learn a behavior or a policy that decides which actionsto take in every state in order to maximize the long-term reward received by the agent.

15

16 CHAPTER 2. REINFORCEMENT LEARNING

Formal Definition More formally, the RL problem can be formulated as a Markov DecisionProcess (MDP), this is, a tuple < S,A,R,P >, in which:

• S denotes a finite set of states

• A a finite set of actions

• R(s, a, s′) = E[rt+1|st = s, at = a, st+1 = s′] is a reward function

• P(s, a, s′) = Pr{st+1 = s′|st = s, at = a} is a transition probability function in which theMarkovian Property holds, this is, the probability of a next state st+1 = s′ only dependson the actual state st = s but not older ones.

With this formulation, we can now define a policy π : S −→ A as a mapping from states toactions, and the goal becomes choosing that actions that obtain the maximum long-term reward,defined as the sum of immediate rewards. From a certain time-step t, the long-term reward Rtwould be defined as

Rt = rt+1 + rt+2 + rt+3 + . . . =∑k=0

rt+1+k (2.1)

Note that if long-term reward does not have a horizon, this sum is unbounded. For thisreason, a discount factor γ ∈ [0, 1] is often introduced for future rewards.

Rt = rt+1 + γrt+2 + γ2rt+3 + . . . =∑k=0

γkrt+k+1 (2.2)

In this project we will use finite horizons H, so there is no need in introducing a discount.Our long-term reward will be calculated as

Rt = rt+1 + rt+2 + rt+3 + . . . =H∑k=0

rt+1+k (2.3)

Value Functions Value functions are introduced to evaluate states’ goodness by estimatingthe long-term reward that an agent would receive from a specific state if following a particularpolicy.

Concretely, the state-value function (also called V-Value Function) V π(s) from state sfollowing policy π is defined as

V π(s) = Eπ[Rt|st = s] = Eπ

[∑k=0

γkrt+k+1

∣∣∣∣∣st = s

](2.4)

Besides, the action-value function (also called Q-Value Function) estimates the long-termreward from a state s taking action a and then following the policy π, and is defined as

Qπ(s, a) = Eπ[Rt|st = s, at = a] = Eπ

[∑k=0

γkrt+k+1

∣∣∣∣∣st = s, at = a

](2.5)

2.1. REINFORCEMENT LEARNING 17

Bellman Equations The Bellman’s equation for the V-Value Function can be easily obtainedby decomposing Equation 2.4 and regrouping as seen in Equation 2.6. It can be seen that theV-Value of a certain state is formed by two parts: the immediate reward rt+1, and the discountedvalue for the next state, obtained following the policy π. Additionally, we can recursivelydecompose any V-Value into further time-steps, which will be crucial to learn optimal policies.

V π(s) = Eπ[Rt|st = s]

= Eπ[rt+1 + γrt+2 + γ2rt+3 + . . . |st = s]

= Eπ[rt+1 + γ(rt+2 + γrt+3 + . . .)|st = s]

= Eπ[rt+1 + γRt+1|st = s]

= Eπ[rt+1 + γV π(St+1)|st = s]

(2.6)

Similarly, one can obtain the Bellman’s equation for the Q-Value Function as

Qπ(s, a) = Eπ[Rt|st = s, at = a]

= Eπ[rt+1 + γrt+2 + γ2rt+3 + . . . |st = s, at = a]

= Eπ[rt+1 + γ(rt+2 + γrt+3 + . . .)|st = s, at = a]

= Eπ[rt+1 + γRt+1|st = s, at = a]

= Eπ[rt+1 + γV π(St+1)|st = s, at = a]

(2.7)

Note that the immediate reward rt+1 is affected by the taken action at = a, but the restdepends on the policy π, which is the reason why we can convert the second part of the equationinto a V-Value. Also, taking into account that the difference between a V-Value and a Q-Valuestands only on the first action, if that action is in fact the one followed by the policy, bothvalue functions match: Qπ(s, π(s)) = V π(s). Therefore, we can rewrite Equation 2.7 as follows,making it recursive too.

Qπ(s, a) = Eπ[rt+1 + γQπ(St+1, π(St+1))|st = s, at = a] (2.8)

Optimal policies An optimal policy π∗ is a policy that, when followed, obtains a better valuefor each possible state of the environment than the value that could have been obtained withany other policy. Formally,

V π∗(s) ≥ V π(s) ∀s ∈ S (2.9)

From another point of view, an optimal policy must be capable of selecting the best actionto take from any state to maximize the long-term reward. Otherwise, another policy could existthat took a better action in the mentioned state while behaving identically in the others, beingmore optimal.

π∗(s) = arg maxa∈A

Eπ∗ [Rt+1] = arg maxa∈A

(Qπ∗(s, a)

)∀s ∈ S (2.10)

This idea of idea of improving a policy by taking a better action in a certain state is oftencalled “greedification”, and it’s the idea of the most basic algorithms to find optimal policies:Policy Iteration (PI) and Value Iteration (VI). The former iteratively evaluates a policyby computing the value of all states to then update the policy by making it greedy. The latter,instead, directly uses the Bellman equations to update the values to obtain a greedy policy oncethe values have converged.

Both these algorithms are proved to converge at the optimal policy. However, these onlywork under critical assumptions that come from the fact that they are defined beneath a MarkovDecision Process (MDP). As explained, an MDP is constituted from a 4-tuple, which includes a


state space, an action space, the reward function, and the transition function between states.Therefore, to apply these algorithms one must know the complete dynamics of the environmentas well as the rewards for each state and action beforehand. This type of Reinforcement Learningis called Model-based RL, and needs a model of the environment (its dynamics) in order tothen learn value functions or optimal policies. On the other side, Model-free RL are anothergroup of algorithms in which the agent explores to gather information at the same time thatimproves its policy. This project is based in Model-free algorithms.

2.2 Model-free Reinforcement Learning

The family of Model-free RL algorithms is at the same time divided in several groups ofalgorithms, which will be explained during this section as Learning Recursive Goal Proposaluses several of them.

2.2.1 Value-based algorithms

This sub-family of methods relies on the computation of value functions as an estimation oflong-term rewards. If we had a model of the world, an MDP, we could calculate the V-Values orQ-Values using the reward function and the transition probability function by computing theexpected long-term reward as in Equation 2.4 or Equation 2.5. However, as this informationis not available in Model-free RL, these estimations are computed through an average of theempiric long-term reward received from each state.

In Model-free RL the agent interacts with the environment generating episodes or trialsτ = (s0, a0, s1, r1, a1, s2, r2, . . .) following its current policy. From any time-step we can calculatethe empiric long-term reward Rt as in Equations 2.1 to 2.3. Finally, we can average all long-termrewards obtained from a certain state s and taking action a to estimate Q(s, a).

Monte Carlo policy learning, one of the most-known Model-free Value-based algorithmstakes this idea but applies a clever approximation of the averaging part that does not need tostore millions of trials before being able to compute an average, at the same time that permitsupdating the policy to gather new information. To do so, they initialize the values of Q(s, a)randomly, and anytime the agent visits state s and takes action a applies the following updatingrule:

Q(s, a)←− Q(s, a) + α(Rt(s)−Q(s, a)) (2.11)

Note that instead of having to store all empirical Rt, they only need an entry for each Q(s, a),which is softly updated depending on the parameter α. When doing this iteratively, Q(s, a) willconverge to approximating Rt.

Q-learning, another well-known method uses the same idea but substitutes Rt using theBellman’s equation

Q(s, a)←− Q(s, a) + α[rt+1 + γQ(s′, π(s′))−Q(s, a)]

Q(s, a)←− (1− α) ·Q(s, a) + α[rt+1 + γQ(s′, π(s′))](2.12)

Q-learning is also known a Temporal Differences (TD), as uses the Q-Values of the nextstate to update the current one. This process of using value estimates for learning other valueestimates is called bootstrapping.

The differences between Monte Carlo and Q-learning are the fact that the former useslong-term reward while the latter uses immediate rewards and bootstrapping, which permits itto update the Q-Values each step and not each trial (MC needs to end the trial to compute Rt).

2.2. MODEL-FREE REINFORCEMENT LEARNING 19

Note, nevertheless, that these algorithms have a problem. As we mentioned in the previoussection in Equation 2.10, the optimal policy takes the greedy action, the one with highest Q-Value.Therefore, it could happen that after few iterations the agent took always the same actions,looking for the largest reward preventing it to explore possibly even better behaviors. For thisreason, whenever generating new trials the agent uses an exploratory policy. One of the mostused methods to solve this Exploitation vs Exploration problem is the ε-greedy explorationpolicy, which takes a random action with probability ε, and the optimal one otherwise.

At this point it’s worth mentioning that Model-free methods can also be classified in twofamilies: On-policy algorithms and Off-policy ones. The former optimize the policy they areusing to generate the trials while the latter ones are able to learn a different policy that the oneused to explore. Monte Carlo is an on-policy method because all the information used to updatethe Q-Values, Rt, is obtained empirically. Therefore, the policy gets updated with informationgathered with the same policy. On the other hand, Q-learning is an off-policy method becauseit gathers information using an exploratory policy like ε-greedy, but in the updating rule seen inEquation 2.12 also uses information about policy π which is purely greedy and different fromthe one used to explore.

Although on-policy methods tend to be more stable and converge faster, off-policy algorithmscan take advantage of learning a different policy than the one used to explore, as we will seefollowing.

Deep Q-learning All the methods presented up to this point have some significant drawbacks.On the one hand, they work in a tabular manner, only applicable for discrete state and actionspaces. This problem could be solved by discretizing continuous environments, but there is amore critical shortcoming: the curse of dimensionality.

The number of combinations of states and actions grows exponentially with their spaces’dimensionality. Therefore, it can be drastic to store huge Q-Value tables for each state andaction. Nonetheless, it can even be worse to obtain a reasonably good estimate for each of them,which implies visiting each state and taking each action from it several times.

For this reason, the use of function approximators is crucial, methods that can approxi-mate a full Q-table in limited amount of parameters, independently of the dimensionality ofstate and action spaces, apart from having the ability to generalize to unseen states.

In order to update the parameters, gradient descent is used alongside with an error functionthat compares the estimation with the “true” value in a supervised way. As there is noknown true value, bootstrapping is often used. However, the use of gradient descent in theonline incremental learning scheme (where we generate trials in an online basis) comes withconvergence problems due to data not being i.i.d., as there is high correlation among consecutivestates. In addition to this, it would be inefficient to compute the gradient for each data point.To solve this, gradient batch methods are introduced.

Gradient batch methods take advantage of off-policy algorithms to learn using data fromanother policy, and instead of training their function approximators on online data (just gatheredfrom experience), it creates a dataset D of past experiences coded as tuples (s, a, r, s′). Usingthis idea, they can decouple the exploration step, in which the dataset is being filled from thelearning stage, also called Experience Replay, in which batches of transitions are sampledfrom D. Using batches breaks the i.i.d. problem, as transitions in a batch are selected randomly,and equally important, it brings the possibility of using Stochastic Gradient Descent.


Moreover, the use of a dataset provides the option to reuse data, thus being much moresample efficient.

The first method to appear and one of the most known is Deep Q-Network (DQN)[20], presented in 2015 by Mnih et al . Their algorithm was one of the first in using NeuralNetworks (NN) as function approximators for Q-Values, introducing Deep Learning in the fieldof Reinforcement Learning, a combination known as Deep Reinforcement Learning (DRL).Additionally, DQN uses the Bellman’s equation and takes advantage of batch computing andoff-policy algorithms.

More specifically, the mentioned algorithm uses a NN with parameters θ to approximateQθ(s, a), and for each batch of samples (s, a, r, s′), computes the “true” value Q(s, a) by boot-strapping:

Q(s, a) = r + γmaxa′

Qθ′(s′, a′) (2.13)

Note that according to the Bellman’s equation (see Equation 2.8), we should select an actiona′ = π(s′). Taking into account that this is a Value-based method whose goal is to estimatethe Q-Values, the policy is then obtained by “greedification” over these estimations. Therefore,following the idea in Equation 2.10, the “true” value is computed with the action the optimalpolicy would select, the greedy one.

Also note that if the “true” value was calculated with the same parameters θ that are beingupdate each iteration, DQN would suffer from a problem called moving target, in which thevalue that we are trying to estimate, the “true” one, changes over time. To solve this, Mnih etal . use the trick of using a target network, another NN parametrized with θ′, which is updatedat a much lower pace, helping in the convergence of the algorithm.

Once we have calculated this “true” or target value, we can compute the loss, also calledTD error, by comparing it with the current estimate. Afterwards, the gradient of this lossw.r.t the parameters θ can be computed to update them using SGD.

TD− error = Qθ(s, a)−Q(s, a) = Qθ(s, a)− [r + γmaxa′

Qθ′(s′, a′)] (2.14)

This updating rule, however, had an inconvenience. When using a random initialization ofQ, we can be overestimating the real Q-Values by iteratively applying r + γmaxa′ Qθ′(s

′, a′)on the same network, propagating this overestimation to other states and making the problemeven worse.

Hasselt et al . recognized this problem and presented Double Q-learning [14], also calledDouble DQN (DDQN). In this algorithm, they introduced the idea of having a set of two differentnetworks A and B to estimate Q(s, a). Using one to estimate the values of the other shouldcompensate mistakes as networks would be independent. Nonetheless, taking into account thatDQN already used two networks (the value network to estimate and the target network tocompute the target values), they proposed the idea of using the former to select the best actionand the latter to evaluate it calculating the “true” or target value as

Q(s, a) = r + γQθ′

(s′, arg max

a′Qθ(s

′, a′)

)(2.15)

Note that a′ is selected using Qθ, the value network, but the estimation of Q(s′, a′) is takenfrom Qθ′ , the target network. As the target network is updated asynchronously, both networkswill have different values and overestimates will mutually mitigate each other. The completeDDQN method is shown in Algorithm 2.1.


Apart from DDQN, the literature contains various publications that make incrementalcontributions to DQN improving its performance. Prioritized Experience Replay [26],presented by Schaul et al ., implements a method to sample transitions from the buffer morecleverly. Dueling Network Architectures [31], published by Wang et al ., include new NNsto the algorithm to compute a new value function called Advantage that improves convergence.Finally, Hessel et al . presented Rainbow [15], which merged all these ideas into the samealgorithm to obtain State-of-the-Art results, back in 2017.

Algorithm 2.1 DDQN [14]

1: Initialize weights θ for the value network2: Initialize target network as θ′ ← θ3: Initialize a buffer D4: s← initial state5: repeat6: Choose a from s using an exploratory policy derived from Qθ7: Execute action a, observe r, s′ and add (s, a, r, s′) to dataset D8: Sample a mini-batch B from D9: Update parameters θ using

θ ← θ − α∑

t∈B∂∂θ

(Qθ(st, at)−

[rt+1 + γQθ′

(st+1, arg max

a′Qθ(st+1, a

′)

)])10: if iteration mod N then11: Update target network θ′ ← θ12: end if13: until max iterations14: return π based on greedy evaluation of Qθ

2.2.2 Policy gradient algorithms

The idea of Policy gradient methods is to use a function approximator to directly learn a policyπθ with parameters θ without having to estimate value functions. The explanation behind thisapproach is that Value-based algorithms use greedy updates, and this can potentially make thelearning process unstable due to large policy jumps (argmax is not smooth). Using a functionapproximation to directly estimate a policy with certain parameters makes it possible to computegradient updates (which are soft and stable) directly on the policy.

Moreover, as there is no need on estimating an explicit Q-Value for all states and actions,Policy gradient methods can be used in high dimensional spaces or even continuous action spaces,in which Value-based algorithms need a discretization step.

To have a soft policy which can be differentiated, these algorithms use stochastic policies.Thus, π(a|s) is a conditional probability distribution over A, indicating the probability of takingeach action a ∈ A from state s.

Two more steps are needed for these methods to work: a function that defines the goodnessof a policy, and a way to compute its gradient. The first part is solved by defining some valuefunctions. One option would be computing the value of the starting state of an episode, followingthe policy (see Equation 2.16). An alternative would be computing the average value over statesas in Equation 2.17. Note that dπθ is a stationary distribution of Markov Chain for πθ, indicatingthe probability of being in each state if following the policy. Finally, a third way of measuringthe quality of a policy could be computing an average over state values, but taking into account


the immediate rewards that the agent receives in that state depending on the action it takes.This idea is presented in Equation 2.18, under the function JavR.

J1(θ) = V πθ(s1) (2.16)

Jav(θ) =∑s

dπθ(s)V πθ(s) (2.17)

JavR(θ) =∑s

dπθ(s)∑a

πθ(a|s)r(s, a) (2.18)

On the other side, Sutton and Barto in their Reinforcement Learning book [30], provedthe Policy gradient theorem, presented in Equation 2.19, which shows how to compute thegradient of any of the defined objective functions J1, Jav, JavR.

∇θJ(θ) = Eπθ [∇θ log πθ(a|s)Qπθ(s, a)] (2.19)

Knowing that Rt is an unbiased sample of the estimator Qπθ , REINFORCE—also calledMonte Carlo Policy Gradient—uses the theorem to build an online Policy Gradient method,which is shown in Algorithm 2.2.

Algorithm 2.2 REINFORCE

1: Initialize weights θ for the policy network2: repeat3: Generate episode τ = (s1, a1, r2, s2, a2, . . .) following πθ4: for all t do5: Rt ← long-term return from step t6: Update parameters θ ← θ + α∇θ log πθ(at|st)Rt7: end for8: until convergence

2.2.3 Actor Critic Methods

Actor Critic Methods appear as an improvement over REINFORCE. Monte Carlo-based algo-rithms, the ones that use empirical long-term rewards Rt, tend to have high variance due to theimplicit variance of Rt, which is the result of taking several actions in the environment. In orderto reduce this variance, a baseline is added to Rt to obtain the value (Rt − b(st)). Choosing aclever estimation for the baseline such as b(st) = V πθ(st) reduces the variance of the algorithm,making it more stable. However, a new set of parameters have to be introduced to estimate theV-Values. The resulting algorithm is called REINFORCE with Baseline or Monte CarloActor Critic.

Actor Critic Methods (AC) are formed by an Actor, which selects actions and it is trainedfollowing the idea of Policy Gradient algorithms; and by a Critic, which estimates a Valuefunction which originally was introduced as a baseline to reduce the variance. Therefore, ACalgorithms merge the two worlds of Value-based and Policy Gradient methods.

Similarly with Monte Carlo and Q-learning, there is a version of REINFORCE with Baselinethat uses bootstrapping on the Bellman’s equation to train the critic. Additionally, the sameTD error, which is a difference between two values makes the functions of the baseline and servesto train the policy using the Policy Gradient Theorem. Figure 2.2 shows the general structureof this method called One step AC (QAC).


Actor(Policy)

Critic(Value Function)

Environment

Reward rState s

Action a

TD error

Figure 2.2: Actor Critic general architecture.

DDPG In 2016, Lillicrap et al . presented Deep Deterministic Policy Gradient (DDPG)[17], a method that combines ideas both from DQN and One step Actor Critic (QAC). On theone hand, DDPG is an AC algorithm because it uses both policy gradient and value networks,which allows it to be use under continuous action spaces. On the other hand, as an off-policymethod, it takes advantage from Experience Replay, being much more sample efficient. Inaddition to this, DDPG uses the Bellman’s equation and bootstrapping, which are commonfeatures among DQN and QAC.

Moreover, DDPG comes with some novel features. Firstly, it uses deterministic policiesas opposed to the already mentioned AC algorithms; and second, it proposes a new explorationmethod based in the addition of Gaussian noise to the action selected by the policy insteadof using other techniques such as ε-greedy exploration, which selects a completely random actionwith probability ε.

Algorithm 2.3 shows the main pipeline of DDPG. Note in first place that it uses targetnetworks both for the actor and the critic to improve convergence. Furthermore, line 8 showsthe addition of Gaussian noise, with standard deviation σ, which is usually reduced duringtraining, and set to 0 when testing. It should be also remarked that DDPG uses soft updatesfor its target networks. In this technique, instead of directly copying the parameters of thetrained networks each N iterations, the target networks are updated every iteration using smallsteps (a common value is τ ≈ 0.005). This usually improves stability compared to delayedhard updates.

TD3 Two years later, in 2018, Twin Delayed DDPG (TD3) [10] was presented by Fujimotoet al . proposing four incremental contributions to DDPG. Firstly, the authors noted a smallproblem in the noise addition. Due to the Gaussian distribution, there is a low—but notzero—probability of sampling extreme values, which can imply large and unstable updates. Forthis reason, they clipped the noise before introducing it in the exploratory action improvingstability.

The second contribution was to also add noise when updating the critic, and not only whenexploring. Therefore, when computing the target Q-Value (see line 11 in Algorithm 2.3), insteadof selecting a′ = πθ′a(s′), they add Gaussian noise like in the exploratory action.


Algorithm 2.3 DDPG [17]

1: Initialize weights θc for the critic network Qθc(s, a)2: Initialize weights θa for the actor network πθa(s)3: Initialize target networks as θ′c ← θc, θ′a ← θa4: Initialize a buffer D5: s← initial state6: repeat7: Select action a = πθa(s)8: Add noise a← a+N (0, σ)9: Execute action a, observe r, s′ and add (s, a, r, s′) to dataset D

10: Sample a mini-batch B from D11: Compute the target Q-Value using the Bellman’s equation and target networks:

Q(s, a) = r + γQθ′c(s′, πθ′a(s′)

)12: Update critic using MSE Loss on the target Q-Values:

θc ← θc − αc∑

t∈B∂∂θc

(Qθc(st, at)−Q(st, at))2

13: Update actor using critic’s valuation of the actions it selects:

θa ← θa + αa∑

t∈B∂∂θa

Qθc (st, πθa(st))

14: Soft-update target networks:

θ′c ← (1− τ)θ′c + τθcθ′a ← (1− τ)θ′a + τθa

15: until max iterations

The third improvement is a reminiscence of DDQN. To fight against possible overestimation,TD3 introduces two sets of critics, and uses the most pessimistic one to build the target Q-Valuesfrom which to calculate the loss or error.

Finally, the last contribution is to delay the updates of the actor several iterations. In otherwords, updating the critic more frequently than the policy network brings better convergence.

All in all, TD3 is one of the State-of-the-Art Reinforcement Learning algorithms nowadays.

SAC Also in 2018, Haarnoja et al . presented Soft Actor Critic (SAC) [12], which is anActor Critic method that works with Entropy-regularized policies. The idea behind this projectis that if a stochastic policy has maximum entropy, the probability of taking each action becomescloser, so it automatically explores. Therefore, when learning a maximum entropy policy, thereis no need to add noise or other exploration techniques.

The entropy of a stochastic policy can be defined as

H(π(·|s)) = Ea∼π(s)

[− log π(a|s)] (2.20)

And using this definition, now the goal is not to maximize the long-term reward, but tomaximize a combination of the reward and the entropy of the policy using the following newvalue functions:

V π(s) = Eτ∼π(s)

[∑k=0

γk (rt+k+1 + αH(π(·|st+k)))

∣∣∣∣∣st = s

](2.21)

Qπ(s, a) = Eτ∼π(s)

[∑k=0

γkrt+k+1 + α∑k=1

γkH(π(·|st+k))

∣∣∣∣∣st = s, at = a

](2.22)


Note that the entropy does not play in the first time-step for the Q-value, as the immediatereward is directly used. Also, the parameter α controls the trade-off between exploiting therewards or providing more importance to the entropy, thus to exploration. This parameter isusually decreased during training, and set to 0 when testing. Nonetheless, in a newer work [13],Haarnoja et al . provide a method to automatically and dynamically tune this parameter duringthe training stage.

With these entropy-aware definitions of the value functions, we can rewrite the Bellmanequations as

V π(s) = Eτ∼π(s)

[Qπ(s, a) + αH(π(·|s))] (2.23)

Qπ(s, a) = Eτ∼π(s)

[r + γ

(Qπ(s′, a′) + αH(π(·|s′))

)]= E

τ∼π(s)

[r + γV π(s′)

](2.24)

Taking into account that in both Bellman equations appear V-Values and Q-Values, SACuses the following seven function approximators:

• Twin Q-networks (like in TD3) with their respective target networks

• A V-network and its target one

• A policy network for the actor

Both Q-networks are trained using the MSE loss against a target Q-Value calculated as inEquation 2.24, which is computed using the target V-network. On the other side, the V-networkis also trained using an MSE loss against a target V-Value computed as in Equation 2.23, usingthe most pessimistic target Q-network.

Finally, the objective function for the policy network is to maximize the V-value of takenstates, this is

maxπ

Eτ∼π(s)

[Qπ(s, a)− α log π(s, a)] (2.25)

Following the same methodology used in the different networks, we would like to calculate thegradient of it to optimize using gradient ascent. However, this is not possible for this objectivefunction as gradients cannot be computed for π when the expectation operator works for its ownstochasticity. For this reason, and taking advantage of π being stochastic, the authors applythe Reparametrization trick. Firstly, they define the policy as Gaussian, so given a state sthere is a Gaussian probability distribution over the action space A. From this point, insteadof computing the outcome directly, they calculate the mean and standard deviation for suchprobability distribution. Using a new independent variable ξ, they can sample a value from thedistribution in the following way:

aθ = µθ + σθ � ξ (2.26)

being θ the parameters of the policy network, and ξ ∼ N (0, 1). Now note that, as thestochasticity has fallen into ξ (µ and σ are two variables but do not “decide” on the actiontaken), the expectation depends on ξ, not on the parameters θ from which we need to computethe gradient.

Alongside with TD3, SAC obtains State-of-the-Art results in almost all ReinforcementLearning tasks nowadays.


2.3 Universal Markov Decision Processes (UMDP)

Universal Markov Decision Processes (UMDP) are a generalization of Markov Decision Processesdefined by the tuple < S,A,G, R, P > in which:

• S denotes a multidimensional state space

• A refers to a multidimensional action space

• G is a goal space, also multidimensional

• R : S × G ×A× S −→ R is a Reward function

• P : S ×A −→ S denotes a probability transition function

Note that the definition is almost identical to the one of MDPs, exposed in Section 2.1, butin UMDPs we have a goal space G. In UMDP environments, for each episode, a goal g isdrawn from G apart from an initial state. The objective of the agent will be to reach goal g, andwill receive a reward r for every taken action depending on the state s, the goal g and the stateit arrives after performing the action, s′.

The goal space G is related to the state space S with a function called state-goal mapperdefined so there is an injection from space S to space G

sgm : s −→ g s ∈ S, g ∈ G (2.27)

Common state-goal-mappers are the identity (G ≡ S) or those that make the goal spacebeing a subspace of the state space (G ⊆ S).

V-Value and Q-Value value functions maintain the same definitions as in MDPs, but theyare indexed with goal g because the goodness of a state also depends on the goal of the agent:

V π(s, g) = Eπ

[∑k=0

γkrt+k+1

∣∣∣∣∣st = s, gt = g

](2.28)

Qπ(s, a, g) = Eπ

[∑k=0

γkrt+k+1

∣∣∣∣∣st = s, at = a, gt = g

](2.29)

Finally, a policy π : S × G −→ A is a mapping from a certain state s and goal g to an actiona. Thus, the behavior of an agent could be different from a same state if trying to reach a goalor another. Naturally, the objective is to learn an optimal policy π∗(s, g) that maximizes thevalue functions defined in Equation2.28 and Equation2.29 for the initial states and goals.

Note that, while a policy learned under an MDP only serves to accomplish one goal, agoal-conditioned policy learned in an UMDP can learn a general behavior useful to achieve anypossible goal g ∈ G, thus being much more powerful. A practical example would be a robottrained for a “pick and place” task, which has to position each type of object in a different bin,a different goal position.

2.3. UNIVERSAL MARKOV DECISION PROCESSES (UMDP) 27

2.3.1 Universal Value Function Approximators (UVFA)

The introduction of function approximators in Reinforcement Learning was key for severalreasons. Among them, and probably the most crucial one, it brought the ability to approximatea value function for an arbitrarily large number of states with a limited amount of parameters.As a consequence, the use of approximators can generalize over similar states and even providean approximate value for unseen ones.

Universal Value Function Approximators (UVFA) [25], presented in 2015 by Schaulet al . studies how to obtain function approximators that work for Universal Value Functionssuch the ones described in Equation 2.28 and Equation 2.29, with the objective of not onlygeneralizing over states, but also over goals.

Figure 2.3 shows the two options Schaul et al . propose. The first one, called “concatenate”,concatenates the state and goal vectors to create an extended state representation, which canthen be feed into any function approximator F such as a Multi Layer Perceptron.

On the other side, the second option, also known as “two-stream”, supposes that there issome possible factorization between states and goals. Therefore, each of them is feed into adifferent function approximator to extract a feature vector encoding it. Afterwards, both featurevectors are combined using a third approximator to output a scalar estimating the value forsuch state and goal.

Figure 2.3: Universal Value Function Approximators’ architectures. Image taken from [25].

2.3.2 Hindsight Experience Replay (HER)

Most Reinforcement Learning tasks use sparse rewards, providing feedback only in the time-step when the task is achieved, if so. This happens because in most cases, the only alternativeis to do reward engineering, this is, spending time and resources studying the environment tocreate a function able to provide a more powerful signal for the agent to learn faster. However,engineering a reward is commonly environment-specific—it only works on a specific task—andrequires expertise knowledge. Thus, it becomes an inefficient choice.

In environments with large state spaces, algorithms suffer from only receiving negative signalsdue to not solving the task. If an agent only gathers immediate rewards with the value—let’ssay—zero, the estimates for the long-term reward will be zero for all states and actions. Onlywhen the agent achieves the task by chance (due to exploration), it receives a positive reward,let’s say one. Just in that case, the agent is able to propagate a different value and update theirestimates.


For environments where only specific sequences of actions can lead to solve the task, forexample in MountainCar from OpenAI gym [2], it is almost impossible that the agent reachesits goal, even with huge exploration. Due to sparse rewards, this agent will not ever have apositive signal, so it will not be able to learn.

Hindsight Experience Replay (HER) [1], presented by Andrychowicz et al . in 2017, isa data augmentation technique that helps agents learning under sparse reward by extendingtheir tasks to Universal Markov Decision Processes (UMDP) and using Universal Value FunctionApproximators (UVFA) to solve them.

The idea behind their project is to take advantage of all the gathered information by theagent, even from negative rewards. As an example, if an agent has moved from state s1 to states2 and not solved the task, it should at least use this information to learn how to go from s1 tos2.

To solve this problem they use the idea of UMDPs, in which for each episode the agent hasan explicit goal g. If the original environment was an MDP without goals, the states where thetask is considered solved become the goal. Otherwise, for goal environments, a possibly newgoal could be drawn each episode.

Under these circumstances, after finishing an episode with T time-steps receiving a binarysparse reward r ∈ {−,+} (any pair of values is possible), the agent will have collected T tuplesof the form (s, a, r, s′, g), where goal g will be equal to the environment goal, and reward rwill most of the time be negative. The proposal of HER is to add to the Experience Replaybuffer D extra transitions in which the goal g is modified to be some of the states the agent hasvisited during the episode, mapped to the goal space G: g′t ∈ {stm(s′1), stm(s′2), . . . , stm(s′T )}.Therefore, in these new transitions, if s′t achieves g′t, the reward would be positive, and wewould be introducing relevant information on how to go from st to s′t. UVFA would do the rest,learning a goal-conditioned policy not only capable of achieving the task goal, but also to gofrom any state to any goal once optimized.

In their work, Andreychowicz et al . study four strategies to create these new transitionscalled Hindsight transitions. Among them, “future” strategy outperforms all the others. Inthis strategy, for each transition (time-step), a random future transition from the same episodeis chosen, and the projection into the goal space G of its next state, stm(s′), is used as goal.Reward is calculated as if the transition happened in the environment: a positive reward if thestate achieved by the agent corresponds to the goal, or a negative one otherwise. Algorithm 2.4shows how HER operates under the “future” strategy.

From the algorithm it can be seen that using this strategy we are creating one HER transitionfor each original transition (other rates are possible under other strategies), thus augmentingthe dataset and being, in this case, doubly sample efficient.

On the other side, at least one of the T HER transitions, the last one, will have a positivereward. Note that for the last time-step, the only “future” transition is itself, so g′ would becomestm(s′), which is obviously achieved by s′.

Regarding the accomplishment of a goal g by a state s, a function has to be defined of theform

achieved : S × G −→ {True,False} (2.30)

2.3. UNIVERSAL MARKOV DECISION PROCESSES (UMDP) 29

Algorithm 2.4 HER (future strategy)

1: Initialize buffer D2: repeat3: Initialize array tmp← [ ]4: s← initial state5: g ← episode goal6: for step t = 1 : T do . Do an episode and store original transitions7: Select action a according to policy8: Execute action a, observe r, s′ and add (s, a, r, s′, g) to dataset D and to tmp

9: end for

10: for step t = 1 : T do . Create and insert HER transitions11: Get the original transition: (st, at, rt, s

′t, gt)← tmp[t]

12: Select a future time-step: f ← randInt(t, T )13: Get future transition: (sf , af , rf , s

′f , gf )← tmp[f ]

14: Hindsight Goal: g′t ← s′f15: Hindsight Reward: r′t ← compute reward(s′t, g

′t)

16: Add Hindsight transition (st, at, r′t, s′t, g′t) to dataset D

17: end for18: until max episodes

19: function compute reward(s, g)20: if s achieves g then21: return Positive reward22: else23: return Negative reward24: end if25: end function

On discrete environments, the most used function is checking if the projection of the stateinto the goal space G matches the goal:

achieved(s, g) =

True if sgm(s) ≡ g

False otherwise(2.31)

On continuous environments, asking for a perfect match is a too restrictive, so tolerances areoften used:

achieved(s, g) =

True if |sgm(s)− g| ≤ ε


To conclude, HER is an incredibly powerful tool to speed up the learning process of an agentwhen facing sparse rewards. It is implemented today alongside the most important ReinforcementLearning algorithms and obtains State-of-the-Art results while making the learning processmuch easier and efficient.

3. Meta Reinforcement Learningand Multi-task Learning

The goal of this chapter is to introduce Meta Reinforcement Learning and Multi-task Learning,on which the project is based. More specifically, the chapter contains a brief introduction toMeta-Learning, the ideas behind its extension to Reinforcement Learning, and some literatureassociated with the field, including a short description of several algorithms and pipelines relatedto this project.

3.1 The concept

Following the definitions made by Professor Finn in her Stanford CS-330 Course [6], Meta-Learning is a field of Artificial Intelligence that studies methodologies to improve learningand make it more efficient. These new algorithms work on top of base learners to speedtheir process, simplify them or prepare them to generalize over new tasks. In the real world, ateacher could be seen as a meta-learner, as it processes the information and presents it to thestudent, the base learner, in a way she can learn faster than standalone. Following this idea,Meta Reinforcement Learning (MRL) is a set of techniques that apply Meta-Learning inReinforcement Learning to improve its performance and efficiency.

On the other hand, Multi-task Learning are a set of algorithms that take advantage of theprovided goal space to learn many tasks simultaneously. Methods that learn goal-conditionedpolicies could be included in this category as once optimized they are able to solve differenttasks in the same environment.

The border between these two sets of techniques is incredibly fuzzy and even confused in theliterature. The reason behind the complexity of setting a boundary is that learning multipletasks simultaneously usually leads to better efficiency in adapting to new tasks, while a methodable to adapt fast and easily becomes more reusable thus efficient. Whichever is the correct ormore strict classification, the following section focuses on presenting methods whose goal is toimprove the efficiency in Reinforcement Learning by tackling some of its weak points.

In first place, Reinforcement Learning—and more specifically, Model-free methods—areknown to require many interactions with the environment. Additionally, a learned policy onlyworks in a particular environment and task. These two drawbacks together make these algorithmssomewhat inefficient. To improve this key feature, there is a group of algorithms that has theobjective of learning multi-task policies: behaviors that can be used for different tasks in thesame environment.

Second, as mentioned in Chapter 2, RL algorithms have problems learning in large statespaces under sparse reward. A family of algorithms called Curriculum Learning tries to solve thisproblem by having a meta-learner agent that proposes goals to the base agent. The meta-learner

31

32 CHAPTER 3. META REINFORCEMENT LEARNING

starts by presenting easy goals, and once the learner knows how to achieve them, it increases thedifficulty until the agent can solve the whole task. Instead of leaving the agent to discover allthe state space, the meta-learner guides it through the learning process. Curriculum Learningfollows the idea of “Learning to Learn”, which is a colloquial definition for Meta-Learning.

Third, Hierarchical methods tackle the same problem with a different philosophy. Oppositeto Curriculum Learning techniques, which work at training time guiding the learning process,Hierarchical methods help the base agents accomplish each episode by guiding it throughintermediate states (or goals) that are more easily achievable. These methods use “Divide andConquer”’s idea by splitting the task’s complexity into several parts, each of them to be solvedby a different agent more efficiently. Commonly, the low hierarchy, or base learner, has to learnthe environment’s basic actions while trying to achieve the subgoals proposed by higher agentswhile these learn how to propose intermediate subgoals to make the whole process easier andfaster.

Finally, there is also a group of methods that focus on exploring the environment in anunsupervised way to be then able to solve any task using the gathered knowledge.

3.2 Related Work

Gradient-based algorithms Model-Agnostic Meta-Learning for Fast Adaptation of DeepNetworks (MAML) [7], presented by Finn et al . in 2017, is a generic Meta-Learning approachthat works for any task and model that uses gradient descent as optimization technique. Itsgoal is to obtain a generic model that can directly solve a group of tasks or, at least, that cando it after a very short fine-tuning on each of them.

To accomplish this, given a set of parameters θ for the learner (the policy in RL), MAMLcomputes K updates for each task independently (from the same initial parameters θ). Eachof these updated models is then evaluated using test data belonging to the respective task,computing a loss or error. Finally, all per-task losses are combined, and a gradient step is madefrom the original parameters θ using this final loss. This methodology ensures that the gradientstep goes in a direction where most tasks are improved, obtaining a general model able to solveseveral tasks. Moreover, this pipeline ensures that the parameters are set so that the model canbe easily fine-tuned for each specific task with very few steps.

Although it is quite an impressive theoretical idea and well funded mathematically, it ispretty inefficient in practice. On the one hand, the whole pipeline requires creating a copyof the model for each task, every iteration. On the other hand, it is a second-order methodas it has to compute the gradients over already-computed per-task gradients, which is costlycomputationally. Additionally, the RL-adapted version (the original algorithm is generic, not RLbased) uses the empirical long-term reward Rt as estimator, similar to Monte Carlo methods,thus being on-policy and sample-inefficient.

Proximal Meta-Policy Search (ProMP) [24], published in 2018 by Rothfuss et al ., and later,Curriculum in Gradient-Based Meta-Reinforcement Learning (Meta-ADR) [19] by Mehta et al .proposed some incremental contributions over MAML by improving its sample efficiency and itsperformance over unseen tasks, respectively.

3.2. RELATED WORK 33

Curriculum Learning With a completely different idea, Florensa et al . presented, in 2018,Automatic Goal Generation for Reinforcement Learning Agents (AGG) [8], a CurriculumLearning algorithm for RL tasks. In this work, they propose the concept “GOID: Goals ofIntermediate Difficulty”, and define a goal g to have this property if it was achieved between aminimum and a maximum percentages in the previous iteration. Using this information, theytrain a Generative Adversarial Network (GAN) [11] to propose goals that are of intermediatedifficulty with the current agent’s knowledge and ability. They show that using this technique,the agent is always “motivated” to learn because their goals are not too difficult, at the sametime it is “challenged” as they are not too easy, being able to efficiently learn to solve any goalfrom a given initial state in an organized way.

Note that this algorithm runs at training time, providing a way of learning the wholeenvironment. However, for an specific task, not all possible goals are required to be learned. Inthis cases AGG can be too costly. All in all, it is a quite clever way of exploring and learning.

Exploration-based algorithms Following with exploration approaches, Diversity Is All YouNeed (DIAYN) [5], presented in 2019 by Eysenbach et al ., proposes an unsupervised methodologyto explore all the space by learning skills, which can be interpreted—simplifying—as goal regions.Therefore, their purpose is to learn diverse and distinguishable skills that cover all the space.Their implementation is based on a discriminator that uses information theory to ensure that thelearned skills are as different and variate as possible. After discovering skills in an environmentduring the pre-training stage, they apply fine-tuning to several downstream tasks.

The method’s idea is to obtain a deep knowledge of the environment before learning anypossible task with zero or few iterations. This is pretty useful when extreme multi-task knowledgeis needed. However, exploring all the space can be hugely inefficient if valid states’ subspace isnot the entire state space, or in cases where the task’s goal space is much lower than the spaceof states.

A year later, Sekar et al . presented Planning to Explore via Self-Supervised World Models(Plan2Explore) [27], another exploration technique whose goal is to obtain a world model to laterlearn new tasks efficiently by zero-shot or few-show fine-tuning using Model-based sample-efficientalgorithms on the model. Again, a rather interesting alternative if massive multi-tasking isrequired.

Latent task representations Back to purely multi-task approaches, although not gradient-based, Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables(PEARL) [23] was published in 2019 by Rakelly et al . In their work, they propose an alternativeto solve multi-task environments without the use of UMDPs and explicit definition of goals.Instead, for each episode, they gather information of the environment by collecting tuples(s, a, r, s′), and use an Inference Neural Network to estimate a probabilistic latent representationz for the task using that data.

Afterwards, taking inspiration on goal-conditioned policies, they condition the policy on theobtained task latent representation z. They prove that this idea, which is directly designedfor Reinforcement Learning outperforms methods such MAML [7] or ProMP [24] both inperformance and sample efficiency, which is expected as PEARL is an off-policy method thatuses Experience Replay.

34 CHAPTER 3. META REINFORCEMENT LEARNING

Hierarchical methods On the hierarchical methods field, we find Learning Multi-LevelHierarchies with Hindsight (HAC) [16], which Levi et al . presented in 2017. HAC builds astructure of two or more hierarchies, where the lowest one interacts with the environment, andthe rest work as goal proposers.

More specifically, each hierarchy is an independent agent, with its own learning algorithmand UMDP. The highest level receives the environment goal and has to propose a subgoal forthe immediately lower hierarchy. The last agent, the lowest one, is a base learner who interactswith the environment, but instead of following the environment’s goals, it receives proposals forthe agent just on top of it, which are easier to achieve.

With this idea, the base learner does not need to learn the whole space but only those goalsproposed by the hierarchical parent. Similarly, the highest level does not need to know how tointeract with the environment but to propose goals that ease the solve of tasks.

To implement this concept, HAC uses DDPG [17], HER [1], and UVFA’s [25] “concatenate”strategy for each agent. Note that while the lowest hierarchy’s action space matches theenvironment’s one, the other levels have to propose goals, so their action space is actually theenvironment’s goal space. Following this structure, the highest level starts proposing a goal forthe immediately lower one, which from the same state and new goal, proposes a new goal to thefollowing one. The last agent, the base learner, has a limited horizon of time-steps to try toachieve the goal it has received, and after reaching the goal or the time horizon H, it sends itsfinal state to the agent immediately on top which recognizes it as its own next state. After allagents have used their time horizons or the environment’s goal has been achieved, each agentupdates its own policy using its replay buffer to learn better proposals or actions and eventuallyeasing the learning process by dividing it into several smaller problems.

HAC is an astonishingly interesting idea that can be applied as a complement of HER tomake learning more efficient. However, the introduction of new agents make the algorithm morecostly computationally.

Finally, Meta Goal-generation for Hierarchical Reinforcement Learning (MGHRL) [9] pre-sented by Fu et al . in 2020 combines a two-level hierarchy similar to the one in HAC [16] withlatent representations of tasks like in PEARL [23] to build an algorithm that is both efficient atlearning and flexible to adapt to new tasks, two of the main purposes of Meta ReinforcementLearning.

4. Learning Recursive Goal Proposal

4.1 Idea and motivation

As exposed in Chapter 3, Reinforcement Learning has some weak points that make it lessefficient, and therefore less usable in the real world. Taking this under consideration, thisproject’s intention has been to design a new algorithm that is evenly powerful to solve complexenvironments than efficient in all of its definitions. First, sample efficiency is vital in RL as itbrings the ability to learn a task with fewer interactions with the environment. Another qualityof an algorithm that can reduce the number of interactions is the ability to learn faster andeasier, lowering the required training time and amount of iterations needed. Finally, the designof the algorithm itself also has an impact on the computational cost.

With all these concepts in mind, our idea has been to build a two-level hierarchicalmethod. The high level would work as a subgoal proposal, while the low one would onlyinteract with the environments following easy goals proposed by the higher hierarchy. Asexplained in the past chapter, this allows the lower agent to learn faster by having a smallersearch space, while the high hierarchy only needs to learn how to propose goals, not how to actin the environment.

Opposite to HAC [16], which uses several goal-proposing agents to obtain better and morerefined proposals with each of them, our design decision has been to encapsulate all thesedifferent policies into a single one. By using recursivity on the high level, we can ask forcloser and more reachable goal proposals without having to use multiple agents, thus beingmore computationally efficient.

This idea has been inspired by how shortest paths can be found after applying Dijkstra’salgorithm [4]. The algorithm—applied to a set of nodes in a graph—stores, for each targetnode, the node that should be visited just before it when traveling optimally from a source node.Therefore, if the source and target nodes are not directly connected, recursive calls have to bemade to recover the shortest path.

4.2 Algorithm Overview

Learning Recursive Goal Proposal is formed by a goal stack, an infinite loop and two differen-tiated parts, one for each hierarchy level. An overview of the method is shown in Algorithm4.1.

As it can be seen, each episode starts with an initial state and goal drawn from theenvironment. Additionally, a stack is used to store all goals and subgoals, which the low agentwill have to achieve in last in first out strict order. The final objective is to empty the stackof goals, which would imply that the first inserted item, the episode’s goal, is accomplished.

35

36 CHAPTER 4. LEARNING RECURSIVE GOAL PROPOSAL

Algorithm 4.1 LRGP

Require: high agent, low agent, is reachable(s, g) function, low horizon Hl

1: s← initial state2: g ← episode goal3: goal stack← [g] . Initialize stack with goal g4: while True do5: g ← goal stack.top()6: reachable← is reachable(s, g) . Boolean7: if not reachable then . High hierarchy section8: g′ ← high.propose goal(s, g)9: Add g′ to goal stack

10: else . Low hierarchy section11: for low step = 1:Hl do12: a← low.select action(s, g)13: Apply action a and observe next state s′

14: s← s′

15: if s′ achieves g then16: Remove g from goal stack and break for loop17: end if18: end for19: if goal stack.is empty() then20: Episode is completed. Break.21: end if22: end if23: end while

After the initialization, the algorithm contains a loop in which the last goal introduced inthe stack is selected. If that goal is not reachable by the low agent in Hl or fewer steps, the goalproposer—the higher hierarchy—is asked to generate a new goal taking into account the currentstate-goal pair.

If otherwise the last goal introduced in the stack is reachable by the low agent within itsstep horizon Hl, a maximum of Hl steps are taken in the environment guided by the low level’spolicy. If during these interactions the agent reaches the goal that it is pursuing, that goal isremoved from the stack and the following one us used for the next iteration.

Coming sections refer to how the two hierarchies are implemented, as well as the is reachablefunction, that estimates when a goal is reachable from a state in no more than Hl steps.

4.3 Main components

4.3.1 The low hierarchy

The low hierarchy can be seen as a UMDP (see Section 2.3) whose state space S, action spaceA, and transition probability function P are the environment’s original ones. The goal space Gis also the environment’s one, with the main difference that goals do not come directly from it,but from the high hierarchy. Therefore, although the space G is identical, the low agent willreceive fewer state and goal combinations, so the perceived space to explore becomes smaller.

Regarding the reward function R, goal environments tend to use binary sparse rewards witha positive or negative value depending on whether the agent’s state has achieved the episode’s

4.3. MAIN COMPONENTS 37

goal. In our hierarchical algorithm, we overwrite the environment’s rewards for another sparseand binary reward function that acts on the goal that is currently pursued, the one on topof the goal stack (which in the general case is different from the episode’s goal, the one in thebottom of the stack). More specifically we use a value of −1 if a goals is not achieved, and 0otherwise. As long-term reward is upper-bounded at 0, there is no need on using a discountedversion of the long-term return, therefore we use γ = 1.

In order to be more sample efficient and learn faster under sparse rewards, we use HER [1]with “future” strategy, presented in Section 2.3.2 and Algorithm 2.4.

Additionally, it’s worth mentioning that Learning Recursive Goal Proposal (LRGP) is definedfor any type of environment, with no restriction on the nature of the state or action spaces.Moreover, it is also flexible to use any Value-based or Actor Critic algorithm as base learner.For our experiments we have tested DDQN [14] for discrete environments and DDPG [17],TD3 [10] and SAC [12,13] for continuous ones.

4.3.2 The high hierarchy

The high hierarchy task can also be defined as a UMDP in which the state space S and thegoal space G are the environment’s original ones. However, this time the action space A becomesthe environment’s goal space, as the high agent will work as goal proposer. Therefore, itsoutput, actions, become goals.

Regarding the probability transition function P, it now depends on the low-level policybecause it is the only agent able to change the state of the environment. In fact, the statesperceived by the high hierarchy are only the first and last states of each run of maximum Hl

steps by the lower one. In other words, how the lower agent has moved from one state to theother is a black box for the high level.

Concerning the reward function, we have used a non-sparse definition that lets us bemore sample efficient and speed up the learning process. Moreover, we have studied twodifferent estimation techniques that will be explained following. Both methodologies can beimplemented into any off-policy learning algorithm that uses Q-Values: either Value-basedmethods or Actor Critic learners. In this work, we have experimented with DDQN [14] topropose goals from a discrete space and DDPG [17], TD3 [10], and SAC [12,13] for continuousgoal spaces. Furthermore, we have proved that the continuous methods also work for discretegoal spaces by first proposing a continuous goal and then discretizing it. This idea leads to animprovement on the generalization power and a more quickly convergence.

Reward function

The task of proposing subgoals to help the base learner reach a final goal g from a currentstate s can be seen as a planning algorithm in which the objective is to find a sequence ofintermediate milestones that the base learner can easily achieve. In our context, a goal iseasily achievable if it can be reached with no more than Hl environment steps, which we willcall a run.

Taking this idea into account, we define the short-term reward to be −1 for each requiredgoal proposal. Therefore, it is trivial to see that the long-term reward will be the negative ofthe number of subgoals that have been required to solve an episode.


From the definition of Q-Value functions in UMDPs (see Equation 2.29), the Q-Value for astate s, subgoal proposal a and goal g, Q(s, a, g), will be the negative of the expected number ofsubgoals required to reach g from s after having proposed a. Thus, the optimal policy will haveto learn which subgoal proposal a to choose in order to minimize the expected number ofproposals that will be required from s to g.

Note that minimizing the number of subgoal proposals implies minimizing the number oflow hierarchy runs, which at the same time implies minimizing the number of steps the agentperforms in the environment (as each run is bounded by Hl). If subgoal proposals are tooclose to the actual state, the episode will require more proposals, being suboptimal. On theother hand, if subgoal proposals are too far, the low agent will not be able to reach them inHl steps, and an extra proposal will be required by the algorithm thus becoming suboptimalagain. Therefore, this reward function definition ensures both a minimum use of the highhierarchy and solving the episodes as fast as possible.

Similarly to the low agent’s reward, this long-term reward is also upper-bounded by 0.Therefore, we use γ = 1 for the higher hierarchy too.

Long-term reward estimation

To estimate the number of subgoal proposals or runs required from a state s to a goal g wepropose two alternatives. The first one would be inspired on Bellman’s equation and theuse of the short-term reward, while the second one is based on Monte Carlo’s idea of usingthe long-term empirical reward as estimation. However, in both cases, we take advantage of thereward function’s definition to increase our sample efficiency by generating more transitions perepisode.

To illustrate how we generate these transitions we will make use of the diagram shown inFigure 4.1, which represents a sequence of 3 runs with their initial and final states A,B,C,D.Note that arrows represent runs, i.e., sequences of at most Hl environment steps. Therefore,A −→ B could have two compositions: either Hl steps towards a goal that was not achievedby the low agent before reaching its horizon; or any number of steps (≤ Hl) towards B untilachieving it. One or the other, it is a black box for the high hierarchy and only the initial andfinal states A and B are needed.

To compute transitions, we use the idea of Hindsight Action Transitions (HAT) pre-sented in HAC [1], in which transitions for the higher hierarchies are computed as if lower onesalready had an optimal policy. This is key to stabilize the learning process of the high hierarchybecause it depends on the low’s policy, which changes over time when learning simultaneously.

We know that when the low agent acts optimally, it always achieves the last goal in the stackif reachable. Therefore, to mimic an optimal behavior we will force the final step of each run toaccomplish the goal in the transition. Following HAT’s idea, we will use hindsight goals, i.e.,we will use the empirical final state of each run—mapped into the goal space G—instead of thegoal that was truly proposed, independently of the accomplishment of it. Therefore, to createtransitions we do not need the goals that were actually proposed, but the sequence of initial andfinal states for each episode run.

Following with the example in Figure 4.1, both states and goals belong to R2. As they havethe same exact representation, the state-goal mapper becomes g = sgm(s) = s. For this reason,in this example, we will refer to nodes A,B,C,D as either states or goals.


A

B C

D

Figure 4.1: Diagram to illustrate high hierarchy transition generation.

Our Bellman’s based way of estimating the long-term reward makes use of theBellman’s equation to generate transitions of the form:

Q(s, a, g) = r + maxa′

Q(s′, a′, g) (4.1)

This should be read as “the cost of going from s to g through the intermediate milestone orsubgoal a is: r, which encapsulates the cost of going from s to sgm(s′), plus the cost of goingfrom s′ to g in the most optimal way”. Following this idea, we could generate the followingtransitions for the diagram:

• Q(A,B,B) = −1 + 0. The cost of going from A to B is −1 (one run). The second part ofthe equation would be the cost of going from B to B which is obviously 0.

• Similarly, Q(B,C,C) = −1 + 0, and Q(C,D,D) = −1 + 0.

• Q(A,B,C) = −1 + maxa′ Q(B, a′, C). The cost of going from A to C proposing B is thecost of going from A to B (−1) plus the best cost from B to C in the most efficient way.

• Similarly, Q(B,C,D) = −1+maxa′ Q(C, a′, D), and Q(A,B,D) = −1+maxa′ Q(B, a′, D).

• Finally, we can also skip time-steps by doing Q(A,C,D) = −2 + maxa′ Q(C, a′, D). Inthis case we are encapsulating the cost from A to C in the short-term reward, which nowbecomes −2 as two runs or proposals are needed to reach C under this trial A,B,C,D.

Note that thanks to the definition of the reward function we are able to generate moretransitions that the ones that would be created under sparse rewards. In this toy example theepisode contains three runs, but instead, we generated 7 transitions or samples. Every one ofthese data points would be stored in the replay buffer in the form (s, a, r, s′, g) and then sampledapplying Equation 4.1 to compute the target values for the learners as exposed in Chapter 2.Moreover, it can easily be seen that the number of computed transitions increases rapidly withthe number of runs the episode contains, becoming even more sample efficient.

On the other hand, we propose another technique inspired on Monte Carlo estimation.In this case, we use the observed long-term return R, which includes the number of proposalsrequired from a certain state s to goal g using the empirical trial.

Using this idea, read as “from s to g through a, R proposals or runs were required”, we cangenerate the following transitions:

• Q(A,B,B) = −1. The cost of going from A to B through B is −1. One proposal (B) orrun was required.


• Similarly, Q(B,C,C) = −1, and Q(C,D,D) = −1

• Q(A,B,C) = −2. The cost of going from A to C through B was −2 as it required 2 runs.

• Similarly, Q(B,C,D) = −2

• Following the same idea and extending the span of the transitions, Q(A,B,D) = −3, andQ(A,C,D) = −3.

Note that using these transitions we always use the observed information, what happenedduring the episode. Under this technique, we would store tuples like (s, a, g, R), and then wewould ask the Q-Value network or critic to predict Q(s, a, g) = R.

Comparison between our two approaches To better see which is the difference betweenour two methodologies we will focus on the estimation of Q(A,B,D).

Additionally, we will use the concept of the triangular inequation. This inequationexpresses that between three states B − C −D, the following relation holds:

d(B,D) ≤ d(B,C) + d(C,D) (4.2)

If we make a generalization into Q-Values, we obtain:

maxa′

Q(B, a′, D) ≥ maxa′′

Q(B, a′′, C) + maxa′′′

Q(C, a′′′, D) (4.3)

Note how the inequality sign changes sides because the original definition is in terms ofdistance (which usually has to be minimized), while Q-Values are to be maximized. In this case,the inequality expresses that when introducing intermediate milestones (e.g . C), the Q-Valuecan not go any better.

In our Bellman’s approach we use Q(A,B,D) = −1 + maxa′ Q(B, a′, D), therefore we areusing an optimistic view of the triangular inequation, as we are using the minimum cost fromB to D, using the left side of the Q-Value inequation (see Equation 4.3).

On the other side, when using Monte Carlo’s approach we obtain Q(A,B,D) = −3, whichimplies a cost from B to D of exactly −2. We call this an empirical view of the triangularinequation as opposed to our first method, as it does not uses the best case but the observedone.

Special Cases In order to improve the stability, robustness and convergence of our learningstage, we introduce some penalizations in the following cases:

• When the policy proposes g′ = g, being g the goal that was being accomplished. If g couldnot be reached and another closer goal was required, it makes no sense to propose thesame one.

• When the policy proposes a goal g′ that is directly achieved by the current state s. In thiscase the proposed goal does not help the agent reach its final objective.

During the training stage we introduce a maximum number of goal proposals Hh for eachepisode, to avoid getting stuck in some environments. Taking this into account, a correctlycompleted episode can have a sequence of maximum Hh runs, therefore the values for R canget down to −Hh in a completed episode. To make the punishment work, we use Q(s, a, g) =−(Hh + 1) in the mentioned cases, for both our strategies of long-term reward estimation.


4.3.3 Is reachable function

The is reachable function is the third main component of our algorithm, and its goal is to predictif a goal g is reachable from a state s to decide whether to ask for a closer subgoal, or attempt arun towards g. Multiple alternatives exist to implement it.

Data gathering To train this predictor we have to first gather data and store it in a ReplyBuffer. It is important to note that the gathered information will be changing over time duringthe training phase in which the agents’ policies are being optimized.

Imagine a low hierarchy agent with Hl = 3 that performs the following run of 3 stepsover environment states A,B,C,D, as seen in Figure 4.2. Thanks to the triangular inequationwe can ensure that if the agent has been able to move from A to D using Hl steps, B andC are also reachable from A. In fact, we can generate all the following state-goal pairs:reachable = {(A, sgm(B)), (A, sgm(C)), (A, sgm(D)), (B, sgm(C)), (B, sgm(D)), (C, sgm(D))}.

On the contrary, if the goal was to reach G, we can only affirm that G is not reachable fromA using this policy. As the triangular inequation is in fact an inequation, we can not extractany information about B, C or D. Additionally, note that with an optimal policy, G might bereachable from A, opposite to the “positive” transitions, which are true independently of thepolicy’s optimality.

A B C D G

Figure 4.2: Diagram to illustrate is reachable data gathering.

Possible Implementations In this work we have tested two different implementations forthe is reachable function.

The first idea is to use a basic Neural Network such as a Multi Layer Perceptron (MLP),trained on the positive and negative gathered data as a binary classifier. After each episode,some mini-batches of the buffer’s data would be sampled and used to update the network’sparameters. This concept has some advantages and drawbacks.

Firstly, as a function approximator it is able to generalize over states and goals, learningfaster. This, however, could also be a drawback because in a hard-boundary problem like thisone, two consecutive states can have completely different outcomes.

Moreover, there is a notorious class imbalance in the data, as we are only able to obtain one“negative” example for each non-achieved run, while multiple “positive” samples are generated.This fact could hinder the learning process.

Finally, taking into account that the error tends to be larger in the first epochs, the mostimportant gradient steps would be computed using transitory data gathered in the first episodeswhere the agents do not act close to optimal, with the possibility of moving the parameters farfrom the real optimum.

The second approach to implement this function is a tabular method. Using this system,a certain state-goal pair is tagged as reachable if it has been visited before during training.


Therefore, the prediction becomes a look-up operation in the buffer data and there is no needon gathering “negative” examples.

A clear drawback of this methodology is the difficulty to generalize over states and goals,thus making the learning process slower and requiring more exploration. With a similar idea,environments with continuous states and goals become a problem because it is nearly impossibleto find an exact match of a state-goal pair in the buffer. To solve it, some tolerances are usedlike in Equation 2.32.

On the other side, this method needs less computational resources as, using the same bufferthan in the MLP version, it only requires a look-up function instead of a Neural Network.Nonetheless, the look-up alternative could become slower if the buffer grows fast, which shouldbe taken into account.

Finally, a deterministic look-up function is always accurate and able to differentiate amongsimilar states and goals which suits the problem of having a hard boundary between reachableand not reachable state-goals. Moreover, this method is always up-to-date with the more recentdata, while a NN-based version can need several iterations to adapt to new data, makingconvergence slightly more difficult.

In this project, the tabular method has produced dramatically better results than a neuralimplementation. Nevertheless, how to implement it more efficiently remains a field of research.

With either implementation we use an ε-greedy technique that returns True with probabilityε independently of the trained predictor. This ensures a correct exploration especially duringthe first episodes of the training stage.

4.3.4 Incomplete goal spaces

The problem With the introduction of hierarchies and subgoal proposers, a new problematicparadigm arises.

Any Reinforcement Learning algorithm is defined for either a discrete or a continuous actionspace. If discrete, actions are usually encoded as consecutive integers, starting at zero. In thecontinuous case, actions are defined using its lower and upper limits, where any value in betweenis possible.

However, this is not always the case for state spaces. As an example, an angle θ is usuallyencoded into a two-dimensional vector (cos(θ), sin(θ)), in order to have a continuous cyclicrepresentation. This state-space is continuous and both its dimensions are bounded in [−1, 1].Nevertheless, note that not all 2-dimensional vectors with values in the bounds work. In fact,only cos2(θ) + sin2(θ) = 1 combinations are valid.

This is not a problem, usually, because the environment is in charge of generating theobservations. Notwithstanding, in our hierarchy, the high level has to propose goals, thereforeits action space A becomes the goal space G, which is related with the state space S through astate-goal mapping function.

For this reason, if not all the possible continuous vectors in the goal space’s bounds are valid,this becomes problematic because not all the high agent’s outputs are allowed. As an example,if goals were represented as (cos(θ), sin(θ)), it would be nearly impossible for our high agent toproduce outputs where the condition cos2(θ) + sin2(θ) = 1 was met. Therefore, in most cases


our agent would be proposing subgoals that would never be reachable, hindering the learningprocess. In fact, those would be forbidden goal representations.

In most cases this problem can be solved by defining a goal space G that it is continuousand complete (all the possible vectors in its bounds are valid). In the example, althoughthe state space S is represented using the cosine and sine functions, the goal space G could bedefined as the angle θ, using sgm(s) = atan2(s).

However, this is not always possible when working with obstacles. Imagine now that, forsome reason, angles θ ∈ [30◦, 45◦] ∪ [135◦, 150◦] are never reachable due to obstacles. It isimpossible to create a representation for the goal space which is continuous and all possiblevalues are valid inside certain bounds, let’s say [−180◦, 180◦).

In such case, our high hierarchy could propose a prohibited angle like 35◦, which would neverbe reached. The first problem would be that the episode would become impossible to achieve.The second difficulty is that under our reward estimation system, this proposal would not bepenalized—as we use Hindsight Action Transitions (HAT), using final states as goals. Thereforewe need a system to identify forbidden goals in environments where this problem occur inorder to provide the appropriate feedback to the goal proposal, as well as to avoid getting stuckwhen solving a task.

The solution The proposed solution is to build a history of all goals that have been visited,and check if a goal is forbidden just after proposing it. Taking into account that prohibitedgoals are specific subspaces and we require high precision to detect them, a tabular methodlike the one used for the is reachable function works the best.

To build the history, all the states visited by the agent during the training stage are convertedinto the goal space G using the state-goal mapping function and added to a new buffer.

Just after proposing a goal g, a look-up call is made to check if the goal is in the history. Ifthat is the case, the method follows as shown in Algorithm 4.1. Otherwise, the goal is not addedto the goal stack and a new one is drawn using some noise to avoid repetition. Furthermore, anew transition is created such that the proposal receives the maximum penalization, −(Hh + 1).

To avoid exploitation and ensure exploration, an ε-greedy technique is used in the look-upcall, which returns True with probability ε without taking into account the history.

5. Experiments and Results

This chapter contains an introduction to the environments used in the project, as well as all theresults of the most important experiments and studies that have been conducted to design, test,and analyze our algorithm.

5.1 Environments

The design and selection of environments are crucial for the development and analysis of a newalgorithm. For this project, we have used Simple MiniGrid Empty, Simple MiniGrid FourRooms,and Pendulum.

5.1.1 Simple MiniGrid

Simple MiniGrid is an own modification of the MiniGrid [3] environments, published byChevalier-Boisvert et al . in 2018, which were presented as a third-party library but are nowintegrated into gym [2] from OpenAI.

Figure 5.1: Original MiniGrid Empty 6x6 Environment.

MiniGrid, as seen in Figure 5.1, is a grid-based discrete environment with an agent that hasa limited viewing distance, shown with a light grey transparency in the image. In the originalimplementation, the state of the agent is formed by a 3-dimensional tensor in which, for eachtile in the agent’s sight, a 3-dimensional vector encodes the type of object that the tile contains,along with some of its attributes. The episode’s goal and the outer walls are considered objects.

As this state definition was not suiting our requirements for an environment to work as atesting bench for our new algorithm, we changed the whole state representation along with

45

46 CHAPTER 5. EXPERIMENTS AND RESULTS

the internal logic. More specifically, instead of basing the state on a partial observation of thespace, we defined it as the global position of the agent in the environment, defined by its x andy coordinates, as well as its orientation.

Regarding the goal space, it has been defined as a two-dimensional space defined by the xand y coordinates. Note that the orientation do not matter to consider that a certain state hasreached a goal.

Concerning the actions, three discrete ones have been defined: Moving forward, rotatingleft and rotating right. Note that we could also have used an agent without orientation, with atwo-dimensional state space and a four-dimensional action space. Although learning would havebeen easier, we wanted to include in our testing environment a differentiation between state andgoal spaces, just to be more generic.

The transition probability function is deterministic and dynamics work as one could expect:If “moving forward” is chosen, the agent will move to the tile in front of it (depending on itsorientation) if and only if there is not an obstacle in that tile. With the other two actions, theagent will always make a 90-degree turn in the respective direction without changing its position.

Finally, the reward function is sparse. The agent receives a −1 by default and a 0 whenit performs an action that leads to a state position (without including orientation) that matchesthe goal location.

To sum up, for an environment of H rows and W columns of tiles:

• S = {0, 1, . . . H − 1} × {0, 1, . . .W − 1} × {0, 1, 2, 3}

• G = {0, 1, . . . H − 1} × {0, 1, . . .W − 1}

• A = {forward, left, right}

• g = sgm(s) = (s[0], s[1])

• r(s, g) =

0 if g ≡ sgm(s)

−1 otherwise

Apart from this redefinition of the environment, we have included both the logic andvisualization to use multiple goals, which are represented in the order they have been inserted inthe goal stack using a color gradient as seen in Figure 5.2a.

Although having a large number of state-goal-action combinations, this Empty version ofour Simple Minigrid environment is quite easy and mostly used for development. For this reason,we have also used a variation called FourRooms which, as seen in Figure 5.2b includes wallsthat split the space into four rooms connected by tiny “doors”. Despite looking quite similar,this is an incredibly difficult environment for Reinforcement Learning algorithms because itrequires a very specific sequence of actions to be able to travel from a room to another. In fact,the only possibility is to reach the tile in front of the door, change the orientation towards itand make two forward steps.

Besides this exploration problem, FourRooms introduces obstacles (walls), which serve totest or solution to the “incomplete goal space” problem introduced in Section 4.3.4.

In all Simple MiniGrid environments, a different random initial state and goal are selectedfor each episode.

5.1. ENVIRONMENTS 47

(a) 10x10 Empty environment. (b) 10x10 FourRooms environment.

Figure 5.2: Simple MiniGrid Environments. Lighter subgoals need to be reached before darker ones.

5.1.2 Pendulum

Pendulum [22] is a well-known environment from the Classic Control package [21] of gym [2],and we have selected it as both its state space and action space are continuous, opposite tothe Simple MiniGrid environments. Therefore, this environment will help us understand theapplicability of our algorithm in environments of different nature.

The Pendulum environment is, in fact, a pendulum that oscillates from a fixed rotationpoint. To make the task slightly more difficult we have modified the initial state space so eachepisode starts in a random position close to its stable point (free extreme facing down) withzero velocity. The goal of the environment is always the same: reaching the opposite position(free extreme facing up) with also zero velocity.

The action space A is a unidimensional continuous value that indicates the torque that isapplied, bounded in certain values that makes it impossible to go from the initial state to thegoal directly. As the actuator does not have such power, some prior oscillations are requiredto gain velocity and reach the final goal.

Additionally, we have added the logic and the visualization for multiple goals in a colorgradient that indicates the goal stack position of each of them, as seen in Figure 5.3.

Internally, the state of the environment is encoded as the angle of the pendulum θ and itsangular velocity θ. The angle starts at the top and increases counter-clockwise. However, this isnot a cyclic representation of the state because an angle of θ = π could be considered differentfrom θ = −π. For this reason, after applying the dynamics of the system, the representation isconverted into a vector (cos(θ), sin(θ), θ) which is continuous and cyclic.

We have defined the goal space G in the same way the internal state is encoded, (θ, θ), fora main reason. As a continuous environment, we will have to define when a goal is achieved bythe use of certain tolerances. If using a fixed tolerance with cosine and sine functions, thesewould lead to different angle tolerances, as they are not linear. This would be an unexpectedbehavior as we would like to treat any angle the same way. For this reason we have used thefollowing state-goal mapping function:


Figure 5.3: Pendulum Environment. The goal proposals ask the agent to first move left (orange subgoal)in order to gain velocity to then move to the right sector towards the red milestone.

g = sgm(s) = sgm(

(cos(θ), sin(θ), θ))

=(

atan2 (sin(θ), cos(θ)) , θ)

= (θ, θ) (5.1)

With this definition, the goal of the environment becomes (θ = 0, θ = 0).

Finally, we have modified this environment to use a sparse reward of −1 whenever theepisode’s goal is not achieved, or 0 otherwise. To consider a goal achieved, we use the followingdefinition where (θs, θs) = sgm(s) and g = (θg, θg):

achieved(s, g) =

True if | normalize(θs-θg) | < εθ ∧ |θs − θg| < εθ


Note that εθ and εθ are two tolerances for the angle and the angular velocity respectively,and the normalize function maps any angle into [−π, π).

5.2 Preliminary studies

This section contains several experiments and studies that were performed with the mainobjective of obtaining a better understanding on our new algorithm, its components and howeach of them can affect its performance and convergence.

Base learner comparison

Learning Recursive Goal Proposal (LRGP) is able to work with any Value-based or ActorCritic Reinforcement Learning algorithm. In this experiment, performed on a 15x15 version ofour Simple MiniGrid Empty environment, we have compared the performance of DDQN [14],DDPG [17], TD3 [10] and SAC [12,13] when used in our high hierarchy.

Note that the Simple MiniGrid environment has an action set of three discrete actions. Tomatch the nature of that space we use DDQN as it specifically designed for discrete actionspaces.

Regarding the goal space, which becomes the action space of our high hierarchy, it is alsodiscrete. Therefore, the straightforward implementation would be using DDQN, treating eachposition of the grid independently.

5.2. PRELIMINARY STUDIES 49

Nonetheless, there is a correlation among those actions or goals, as they are spatiallydistributed in a two-dimensional grid. For this reason, we have implemented a discretizationstep in order to be able to use algorithms that are defined for continuous action spaces, thereforecapable of generalizing over them. To be more specific, we use two-dimensional continuousactions, and divide the output range of each dimension into 15 equal bins (width and heightof the environment), which are used to discretize and obtain the x and y coordinates of a goalproposal.

Figure 5.4 shows the learning curves for these experiments, expressed as the success ratioachieved when testing the learned policy. The most remarkable result is that DDQN hugely under-performs the other methods. This is expected because it treats each grid position independently,thus being unable to generalize.

� ��

��

��

��

��

��

��

��

��

��

��

Figure 5.4: Algorithm comparison for the high hierarchy.

Among the other algorithms, TD3 performs similarly to DDPG, while SAC slightly outper-forms all of them while learning faster and achieving a 100% success ratio more consistently.

To conclude the analysis, we should say that Learning Recursive Goal Proposal’s flexibility touse different learners, enables the possibility to choose the most appropriate algorithm dependingon the task. Moreover, this experiment showed that when having correlated discrete actions, itis better to use continuous-defined algorithms as generalization speeds up the learning process.

Estimation technique comparison

In this experiment we compare our two long-term reward estimation techniques introduced inSection 4.3.2. We refer to them as Monte Carlo for the version that uses long-term empiricalreturns; and Bellman for the version that is based on the Bellman’s recursive equation.

Figure 5.5 shows the learning curves for the best configuration of our agents—using DDQN [14]for the low hierarchy and SAC [12,13] for the higher one—when using one or the other estimationtechniques.


� ��

��

��

��

��

��

��

��

��

Figure 5.5: Comparison between Monte Carlo-inspired and Bellman-based estimation techniques.

Note how although both should converge to the same policy theoretically, the Monte Carloinspired version clearly outperforms the Bellman-based one. A possible explanation for theseresults is the fact that Bellman’s version—opposite to Monte Carlo’s—relies on target networksto compute the “target” or “true” values used to train the Q-networks or critics. These targetnetworks are updated at a significantly slower pace than the value ones, in order to overcomethe moving target problem. This could hinder the learning speed of the high level agent, whichat the same time could snowball into worse learning dynamics for the low one.

All the following experiments have been made using the Monte Carlo version of long-termreturn estimation.

Study on the sample efficiency

In Section 4.3.2 we showed that our way of generating transitions for the high hierarchy wasquite sample efficient because it created more samples than the number of runs that the episodecontained. The main objective of this study is to illustrate how much sample efficient is oursystem in comparison with different methodologies.

Figure 5.6 shows this comparison by presenting the number of transitions that are createdas a function of the number of runs an episode has had.

Without any special technique, a “plain” or common agent that interacts with the environmentobtaining sparse rewards would create one data point per interaction. By making use of HER [1],this ratio would increase as Hindsight Experience Replay adds additional transitions to thebuffer. The most used ratio for HER is one-to-one, therefore creating a hindsight transition foreach original one, leading to a total of two transitions per interaction.

Some algorithms may use higher HER ratios like two-to-one, which we also included. However,much higher ratios usually lead to instability and convergence problems.


Finally, the number of transitions generated by our method grows much faster than linear. Infact, it can be seen that for four runs we have less than 42 = 16 transitions, but for six runs wehave more than 62 = 36. Therefore, the growth of this function becomes more than quadratic.

To conclude, it must be said that the reward function definition and the methodology usedto generate transitions for our high hierarchy is one of the key points of this project, leading toincredible sample efficiency.

� � � � � �#�� #�!

�

�

��

��

��

��

��

��

��

��

� �"

��"

��!�

"��!

�� "

��

��

��

� ��

$

�� # !�

Figure 5.6: Study on the sample efficiency.

Solution to incomplete goal spaces

The intention of this study is to analyze the ability of our solution to solve the problem ofincomplete goal spaces presented in Section 4.3.4.

To test our method we now use a 15x15 version of our Simple MiniGrid FourRoomsenvironment, which as already mentioned, includes some inner walls or obstacles. These areplaced in “valid” positions regarding the goal space. However, these positions are not reachableby the agent, which can never stand on walls.

Therefore, if the high hierarchy proposes a goal located on a wall, which is in between thebounds or range of its output, the lower agent will never be able to reach it and that episodewill never be achieved.

Figure 5.7 shows a comparison between the learning processes of our algorithm when usingour solution to this problem and without any treatment of these forbidden goals. It can easilybe seen that without handling the problem, the algorithm is not able to solve the environment(at least in 25000 training episodes), while when using our solution the agent learns much fasterand it achieves a better success ratio from the very beginning of the training stage.


On the one hand, this analysis proves the importance of detecting this problem and on theother, the validity of our solution, which improves both the performance in the environment andthe learning speed.

� ��

��

��

��

��

��

��

��

��

Figure 5.7: Analysis of our solution to incomplete goal spaces.

Analysis of the learning process

Training multiple agents simultaneously is not a straightforward matter, as the interactionsbetween their suboptimal policies can lead to instabilities and convergence problems. The goal ofthis study is to analyze how the whole algorithm, including the two agents and the is reachablefunction, learn simultaneously.

In order to provide more insight on this process, Figure 5.8 shows different metrics of ouralgorithm learning to solve a 15x15 Simple MiniGrid Empty environment with low horizons ofHl = 4 and Hl = 8.

Figure 5.8a presents the number of subgoals that were asked during the testing episodesinserted in between the training process, which has a direct implication on the convergence ofthe high policy. Note how it starts at 15, which is a hard limit that we established in order toavoid getting stuck when agents act far from optimal. In the following episodes the curves godown close to 4, to later converge to approximately 3. When using a shorter low horizon Hl,more goal proposals are needed, as expected.

Figure 5.8c shows the success rate of the low agent. In other words, it presents the ratio ofruns in which it could achieve its goal in no more than Hl steps. It can be seen that when thelow agent starts learning how to achieve “reachable” goals, the high hierarchy starts to converge.Therefore, both learning processes are connected. Note how larger horizons need slightly moretime to converge, which is expected because in that case the agent receives more state-goalcombinations to learn.


� ��

�

�

�

�

��

��

��

��

��

��

��

��

(a) Subgoal proposals.

� ��

��

��

��

��

��

��

��

��

��

��

��

��

��

(b) High hierarchy buffer occupancy.

� ��

��

��

��

��

��

��

��

��

��

��

��

��

(c) Low agent success rate.

� ��

�

��

��

��

��

��

��

��

��

(d) Low hierarchy buffer occupancy.

� ��

��

��

��

��

��

��

��

��

(e) Success rate.

� ��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

(f) is reachable buffer occupancy

Figure 5.8: LRGP’s learning process.

Putting all together, Figure 5.8e shows the environment’s success rate. It can easily be seenthat the whole algorithm’s convergence is strictly related to each of the agent’s.

On the other side, Figure 5.8b shows the number of transitions inserted in the high hierarchyreplay buffer, which has a size of 106. Note how it’s occupancy grows faster in the first episodes.The reason behind this fact is the larger sample efficiency that we have in episodes with largernumber of runs or goal proposals. Furthermore, with larger low horizon Hl, the number ofsubgoal proposals decreases and the buffer is filled slower.


Figure 5.8d shows the same information for the low-level replay buffer. In this case, however,it grows at a slower pace because the low agent uses HER [1], which is less sample efficient.Moreover, note how both curves look the same because the total number of environment steps isindependent on how many subgoal proposals are being used (if the proposals are meaningful).

Finally, Figure 5.8f shows the number of unique transitions inserted in the buffer used for thetabular implementation of is reachable. It is important to see that its convergence is stronglyrelated with the agent’s. Additionally, lower horizons Hl converge to lower values because thereexist less reachable state-goal combinations (note that we only store “positive” reachable pairs).

To conclude, it can clearly be seen that the convergence of the algorithm is strongly correlatedwith the convergence of each of its three main components: the high hierarchy, the low agentand the is reachable function. Therefore, it is crucial to find good hyperparameters for each ofthem, as a delay on one’s convergence has a huge impact on the learning speed of the wholealgorithm.

5.3 Main Results

In this section we are presenting the best results achieved by Learning Recursive Goal Proposal(LRGP) in each environment, along with a comparison to other non-hierarchical State-of-the-Artalgorithms.

For all experiments, the hyperparameters have been kept constant across all algorithms inorder to provide the fairest analysis. Appendix A shows detailed information about them.

Simple MiniGrid Empty Environment

Figure 5.9 presents the learning curves for DDQN [14] , DDQN with HER [1,17], and LearningRecursive Goal Proposal (LRGP) with the best configuration obtained in the preliminary studies(see Section 5.2).

It can be seen that although being a quite easy environment, a “plain” agent has problemsto learn under sparse rewards, requiring almost 5000 episodes to obtain a perfect success ratio.With the addition of HER, DDQN is already able to learn quite faster thanks to the hindsightpositive rewards.

Regarding LRGP, the plot shows the learning curve for a configuration in which the goalproposer uses SAC [12,13]. Note that the success rate starts growing slightly before than DDQN+ HER, which is expected as the low agent uses in fact DDQN + HER, but it has the advantagethat it does not need to learn the whole state-goal space but only the goals proposed by thehigh hierarchy. As seen in the preliminary studies, the high hierarchy requires a couple moreepisodes to converge and stabilize close to 100% success rate.

To conclude, in the empty version of the Simple MiniGrid environment, Learning RecursiveGoal Proposals obtains comparable results to the current State-of-the-Art, with the addeddifficulty of training two agents simultaneously.

5.3. MAIN RESULTS 55

� �� !��

��

��

��

��

��

��#

��!

!� �

"��

��

Figure 5.9: State-of-the-Art comparison in Simple MiniGrid Empty 15x15 environment.

Simple MiniGrid FourRooms Environment

Although being quite similar visually, the FourRooms version of the Simple MiniGrid environmentis significantly different in difficulty. In fact, the first thing that can be seen in Figure 5.10,which presents the learning curves of different algorithms in this environment, is that they need,at least, 10 times more episodes to reach good performances.

If we start analyzing DDQN, we can see that this well-known algorithm is not able to solvethe environment within 25000 episodes. In particular, it does not reach a 25% success rate in allthe training stage.

The addition of HER helps DDQN to obtain between 30% and 40% success rate betweenepisodes 1000 and 20000. During this stage it seems that DDQN + HER is consistently capableof solving some environments, but it can not achieve many others. A possible reason might bethe inability of crossing doors and changing rooms, which require high exploration.

In fact, approximately episode 20000 seems to be critical because DDQN + HER increasesits success rate rapidly.

Finally, LRGP is able to obtain similar success rates already in episode 10000, learning morethan twice as faster as non-hierarchical solutions. In this environment, where learning a goodgoal-conditioned policy that can guide the agent from any state to any goal is quite complex,our hierarchical solution shines by proving its sample efficiency and the advantages of dividingthe search space in different agents.

Opposite to the empty environment, which was quite easy, the FourRooms environmentpresents a favorable trade-off between the complexity of training two agents simultaneously andthe benefits of having such hierarchy.


� ��

��

��

��

��

��

��

�"��

� ��!��

��

Figure 5.10: State-of-the-Art comparison in Simple MiniGrid FourRooms 15x15 environment.

Pendulum Environment

As mentioned in Section 5.1.2, Pendulum [22] is an environment that uses both continuous stateand action spaces. For this reason, in this experiment we have compared our algorithm to twoState-of-the-Art methods for environments of this nature: Soft Actor Critic (SAC) [12, 13], andSAC + HER, which improves learning in sparse reward schemes.

Moreover, we have used SAC both for our low hierarchy which has to perform a continuousaction, and for the high hierarchy which has to propose a continuous goal. Figure 5.11 showsthe results for this experiment, in which all algorithms work quite similarly.

Firstly, it is worth mentioning that all the methods have a 0% success rate during someepisodes. The reason behind this is that in the Pendulum environment, the final goal is alwaysto reach an upwards position with zero velocity. Therefore, it is almost impossible to completean episode with a suboptimal policy, as opposed to MiniGrid environments in which some easygenerations with close initial state and goal were possible.

Second, note how SAC is able to solve the environment on its own quite efficiently, and theaddition of HER does not make a huge difference. This could be due to the fact that SAC’simplicit exploration is enough to solve the task, and learning a goal-conditioned policy is notrequired as the goal is always the same.

On the other side, it can be seen that LRGP is also able to solve the task with approximatelythe same amount of episodes. This proves that the algorithm can work with environments ofany nature, as well as with different base learners for each of its hierarchy levels.

To conclude, Pendulum has been a good environment to prove the ability to work oncontinuous tasks, but it may be too easy to show the power of the hierarchy because a non-hierarchical State-of-the-Art method is already able to solve it straightforwardly. Regardless,LRGP has proved that can obtain, at least comparable results to State-of-the-Art methods whiletraining two agents simultaneously.

5.3. MAIN RESULTS 57

� ��

��

��

��

��

��

��

��

��

Figure 5.11: State-of-the-Art comparison in Pendulum Environment.

6. Conclusions

Discussion

In this thesis, we have presented our algorithm Learning Recursive Goal Proposal (LRGP), anew Reinforcement Learning solution formed by a two-level hierarchy in which the base learnerfollows the subgoals proposed by the higher agent.

Up to our knowledge, LRGP is the first hierarchical algorithm to use recursivity, having theability to propose closer and closer subgoals using the same policy in a recursive way.

Moreover, LRGP has the capability of using any Value-based or Actor Critic algorithm forboth the low and high hierarchies, including State-of-the-Art methods like Soft Actor Critic(SAC) [12,13]. This fact leads to exceptional flexibility to adapt to any environment and use themost appropriate algorithm depending on it.

Regarding the results, LRGP shows promising performance and efficiency by obtaining,at least, comparable results to other State-of-the-Art algorithms. Furthermore, it shines incomplicated environments where it requires much less time to converge than State-of-the-Artmethods. This superiority could be associated with two different causes. In the first place, thehierarchy becomes strongly useful as it breaks the problem into two subparts, each of thembeing easier to solve. Second, LRGP has presented incredible sample efficiency, being able togather much more information per episode.

Finally, LRGP proved to converge using simultaneous learning, which is always challengingdue to interference among suboptimal policies. However, it also showed to be notably sensitiveto each of its parts’ convergence. Therefore, if any of its three main components has suboptimalhyperparameters, the whole learning procedure becomes slower and less efficient. This overloadof training multiple components simultaneously could have prevented our algorithm from showingsuperior performance and efficiency in more manageable environments, where taking significantbenefits from hierarchies is more complicated.

For this reason, although LRGP shows promising results, it has just left the nest and muchmore work could be done to improve its performance, efficiency and robustness.

Future Work

The introduction of recursivity and the collapse of multiple-level subgoal proposers into a singlepolicy, implies the necessity to know when to ask for a closer subgoal. For this reason, theis reachable function is a crucial part of our project. Nonetheless, we have used a tabular versionof it, lacking generalization capabilities and being slower at convergence. Therefore, the researchfor a better implementation of this component would be essential for the efficiency of the wholealgorithm and the accomplishment of even better subgoal proposals.

59

60 CHAPTER 6. CONCLUSIONS

Moreover, having such a sample efficiency may also have a negative implication. Underthese circumstances, the amount of data stored in the replay buffer becomes incredibly large, sothe probability of sampling each transition gets evenly smaller. This dynamic could affect thelearning process if penalizations, which apply to very particular situations, are not sampled. Forthis reason, it would be a great idea to study the implementation of cleverer ways of sampling,like Prioritized Experience Replay [26] or Combined-Q [32].

Additionally, the environments used in this work, specially MiniGrid, have been tailoredfor developing and analyzing the algorithm’s behavior. They have been crucial for the processof creating an entirely new algorithm that uses a novel reward system. However, it would bepositive to assess its performance on more sophisticated tasks where the use of hierarchies shouldbe more relevant.

Finally, the hyperparameter search space grows exponentially using three different componentsthat must learn simultaneously. A deeper study on the impact of each parameter on the wholealgorithm’s convergence would lead to better knowledge about simultaneous learning, andpossibly better configurations, improving performance and efficiency.

Bibliography

[1] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder,Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experiencereplay, 2018.

[2] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, JieTang, and Wojciech Zaremba. Openai gym, 2016. cite arxiv:1606.01540.

[3] Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld envi-ronment for openai gym. https://github.com/maximecb/gym-minigrid, 2018.

[4] Edsger W. Dijkstra. Dijkstra’s algorithm, 1956.

[5] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is allyou need: Learning skills without a reward function, 2018.

[6] Chelsea Finn. Stanford CS-330 course: Deep multi-task and meta learning. https:

//cs330.stanford.edu/, 2020.

[7] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fastadaptation of deep networks, 2017.

[8] Carlos Florensa, David Held, Xinyang Geng, and Pieter Abbeel. Automatic goal generationfor reinforcement learning agents, 2018.

[9] Haotian Fu, Hongyao Tang, Jianye Hao, Wulong Liu, and Chen Chen. Mghrl: Metagoal-generation for hierarchical reinforcement learning, 2020.

[10] Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximationerror in actor-critic methods, 2018.

[11] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014.

[12] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic:Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018.

[13] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, JieTan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Softactor-critic algorithms and applications, 2019.

[14] H. V. Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning.In AAAI, 2016.

[15] Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, WillDabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combiningimprovements in deep reinforcement learning, 2017.

61

https://github.com/maximecb/gym-minigrid

https://cs330.stanford.edu/

https://cs330.stanford.edu/

62 BIBLIOGRAPHY

[16] Andrew Levy, George Konidaris, Robert Platt, and Kate Saenko. Learning multi-levelhierarchies with hindsight, 2019.

[17] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, YuvalTassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcementlearning, 2019.

[18] Mario Martin. Reinforcement Learning course notes. Master’s degree in Artificial Intelligence,UPC-BarcelonaTECH. https://www.cs.upc.edu/~mmartin/url-RL.html, 2020.

[19] Bhairav Mehta, Tristan Deleu, Sharath Chandra Raparthy, Chris J. Pal, and Liam Paull.Curriculum in gradient-based meta-reinforcement learning, 2020.

[20] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, StigPetersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran,Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deepreinforcement learning. Nature, 518(7540):529–533, February 2015.

[21] OpenAI. Classic control environments. https://gym.openai.com/envs/#classic_

control.

[22] OpenAI. Pendulum environment. https://gym.openai.com/envs/Pendulum-v0/.

[23] Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, and Sergey Levine. Efficientoff-policy meta-reinforcement learning via probabilistic context variables, 2019.

[24] Jonas Rothfuss, Dennis Lee, Ignasi Clavera, Tamim Asfour, and Pieter Abbeel. Promp:Proximal meta-policy search, 2018.

[25] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function ap-proximators. In Francis Bach and David Blei, editors, Proceedings of the 32nd InternationalConference on Machine Learning, volume 37 of Proceedings of Machine Learning Research,pages 1312–1320, Lille, France, 07–09 Jul 2015. PMLR.

[26] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experiencereplay, 2016.

[27] Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and DeepakPathak. Planning to explore via self-supervised world models, 2020.

[28] David Silver, Aja Huang, Christopher Maddison, Arthur Guez, Laurent Sifre, GeorgeDriessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot,Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, TimothyLillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis.Mastering the game of go with deep neural networks and tree search. Nature, 529:484–489,01 2016.

[29] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai,Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, TimothyLillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-playwith a general reinforcement learning algorithm, 2017.

[30] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. TheMIT Press, second edition, 2018.

https://www.cs.upc.edu/~mmartin/url-RL.html

https://gym.openai.com/envs/#classic_control

https://gym.openai.com/envs/#classic_control

https://gym.openai.com/envs/Pendulum-v0/

BIBLIOGRAPHY 63

[31] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nandode Freitas. Dueling network architectures for deep reinforcement learning, 2016.

[32] Shangtong Zhang and Richard S. Sutton. A deeper look at experience replay, 2018.

64 BIBLIOGRAPHY

Appendix A. Implementation Details

Simple MiniGrid Environments

The non-hierarchical agents have been implemented using DDQN [14] with the hyperparametersshown in Table A.1. Furthermore, the low hierarchy has also been implemented with DDQNand the same parameters.

Hyperparameter Value

Discount factor γ 1

Soft-update parameter τ 0.005

Hidden dimensions (128, 128, 128, 128)

Optimizer AdamW with default parameters

Learning rate 3e-4

Buffer size 5e5

Table A.1: DDQN default hyperparameters for Simple MiniGrid environments.

Regarding the high hierarchy of Learning Recursive Goal Proposal, it has been implementedusing SAC [12,13] and the hyperparameters presented in Table A.2. Furthermore, the selectedcontinuous action has been discretized using the floor function to obtain a discrete coordinate.


Discount factor γ 1


Entropy weight α 1

Hidden dimensions (for all networks) (128, 128, 128, 128)

Optimizer (for all networks) AdamW with default parameters

Learning rate (for all networks) 3e-4

Buffer size 1e6

Table A.2: SAC default hyperparameters for Simple MiniGrid environments.

The is reachable function has been implemented with a 1e5 buffer and an exact look-up call;and for the FourRooms environment, we have implemented our solution to the incomplete goalspace problem using a buffer of 200 dimensions (which is exactly the number of non-forbiddenstates in the 15x15 version of the environment).

Finally, we have used the learning parameters shown in Table A.3 to train all our policies.

65

66 APPENDIX A. IMPLEMENTATION DETAILS

The length of the training stage has been 5000 episodes for the Empty version of the environmentand 25000 for the FourRooms variant.


LRGP’s low horizon Hl 4

LRGP’s limit of subgoal proposals Hh 15

Network updates per episode 5

Batch size 256

ε-greedy (DDQN and is reachable) as in Figure A.1

Table A.3: Learning parameters for Simple MiniGrid environments.

� ��

��

��

��

��

��

��

��

Figure A.1: ε-greedy exploration.

67

Pendulum Environment

In the Pendulum Environment we have used SAC for all our experiments, using the hyperpa-rameters shown in Table A.4.


Discount factor γ 0.2


Entropy weight α 1

Hidden dimensions (for all networks) (128, 128, 128, 128)

Optimizer (for all networks) AdamW with default parameters

Learning rate (for all networks) 3e-4

Buffer size (non-hierarchical and low) 5e5

Buffer size (high) 1e6

Table A.4: SAC default hyperparameters for the Pendulum environment.

Additionally, we have used the learning parameters presented in Table A.5.


Episodes 10000

LRGP’s low horizon Hl 15

LRGP’s limit of subgoal proposals Hh 15

Tolerance on the angle εθ 0.15 rad

Tolerance on the angular velocity εθ 1 rad/s

Network updates per episode 5

Batch size 256

ε-greedy (is reachable) as in Figure A.1

Table A.5: Learning parameters for the Pendulum environment.

Finally, the is reachable function has been implemented using a buffer of capacity 1e5, and alook-up call that checks if there is any state-goal pair in the buffer such that both the state andthe goal are close to the query.

To implement this comparison, let (θsq, θsq, θgq, θgq) be the state-goal queried pair (the statehas already been converted into the goal space), and (θsi, θsi, θgi, θgi) the ith component in thebuffer. Therefore, using a normalization function that maps any angle to the range [−π, π), theis reachable is defined as in Equation A.1, using the tolerances presented in Table A.5.

is reachable =

True if ∃ i s.t. | normalize(θsq-θsi) | < εθ ∧ | ˙θsq − ˙θsi| < εθ ∧

| normalize(θgq-θgi) | < εθ ∧ | ˙θgq − ˙θgi| < εθ

False otherwise

(A.1)

learning recursive goal proposal

Documents