generating automatic curricula via self-supervised active

Generating Automatic Curricula via Self-SupervisedActive Domain Randomization

Sharath Chandra Raparthy∗Mila, Universite de Montreal

Bhairav MehtaMila, Universite de Montreal

Florian GolemoMila, Universite de Montreal

Element AI

Liam PaullUniversite de Montreal, Mila

CIFAR AI Chair

Abstract: Goal-directed Reinforcement Learning (RL) traditionally considers anagent interacting with an environment, prescribing a real-valued reward to an agentproportional to the completion of some goal. Goal-directed RL has seen largegains in sample efficiency, due to the ease of reusing or generating new expe-rience by proposing goals. One approach, self-play, allows an agent to “play”against itself by alternatively setting and accomplishing goals, creating a learnedcurriculum through which an agent can learn to accomplish progressively moredifficult goals. However, self-play has been limited to goal curriculum learningor learning progressively harder goals within a single environment. Recent workon robotic agents has shown that varying the environment during training, for ex-ample with domain randomization, leads to more robust transfer. As a result, weextend the self-play framework to jointly learn a goal and environment curriculum,leading to an approach that learns the most fruitful domain randomization strat-egy with self-play. Our method, Self-Supervised Active Domain Randomization(SS-ADR), generates a coupled goal-task curriculum, where agents learn throughprogressively more difficult tasks and environment variations. By encouraging theagent to try tasks that are just outside of its current capabilities, SS-ADR builds adomain randomization curriculum that enables state-of-the-art results on varioussim2real transfer tasks. Our results show that a curriculum of co-evolving the en-vironment difficulty together with the difficulty of goals set in each environmentprovides practical benefits in the goal-directed tasks tested.

1 Introduction

Reinforcement learning offers a way for autonomous agents to learn behaviors via interaction withan environment. A central, often practical, problem of reinforcement learning is reward engineering:the process of creating reward functions that, when used during optimization, generate desirablebehaviors. Reward design is an arduous, trial-and-error process, and provides experimenters littleinsight into how a particular reward generated certain behaviors. The reward design process standsin the way of reinforcement learning becoming a realistic contender for true artificial intelligence,with fundamental issues spawning entire new fields of study such as reward hacking [1], AI safety[2], and AI alignment [3].

Reward design can also induce training difficulties, as certain reward functions may be too difficultfor agents to optimize [4]. Curriculum learning [5] seeks to ease the optimization process by allow-ing agents to solve progressively more difficult variants of a task. However, traditional curriculumlearning often requires heuristics that control when to advance agents between tasks, simply movingthe reward design issue up a layer of abstraction.

An alternative to the standard reward design and curriculum learning paradigms is to have agentsreward themselves. Numerous works explore the field of intrinsic motivation, using metrics such

∗ Correspondence to [email protected]

arX

iv:2

002.

0791

1v2

[cs

.LG

] 2

6 O

ct 2

020

Figure 1: Self-Supervised Active Domain Ran-domization (SS-ADR) learns robust policies(h) via self-play by co-evolving a goal curricu-lum, set by Alice (e), alongside an environmentcurriculum, set by the SVPG particles (j). Therandomized environments (c) and goals (g)slowly increase in difficulty, leading to strongzero shot transfer on all environments tested.

as surprisal [6, 7], Bayesian information gain [8], and curiosity [9] to improve agent exploration.Automatic curricula naturally arise in this scenario, as an agent is required to push its own frontiersin order to generate further intrinsic rewards.

Self-play is a popular intrinsic motivation method that allows for training of agents without rewards,particularly in goal-directed reinforcement learning. Self-play pits two agents against each other ina standard Markov Decision Process (MDP) framework. These two agents are often time-separated“copies” of the same agent. Self-play rewards an agent for making progress against the other, gen-erating a game-theoretic, automatic curriculum as each agent tries to get the upper hand.

While self-play has proved itself in non-equilibrium, two-player games [10, 11], one under exploredapplication of self-play is in task selection - or curriculum learning in the task, rather than goal,space. Many problems in reinforcement learning need not focus on such a problem, as task curricu-lum design is only relevant in transfer learning scenarios.

One of the most exciting applications of transfer learning is the sim2real problem in robotic learning.In robotic RL, policies trained purely in a simulation have proved difficult to transfer to the realworld, a problem known as “reality gap” [12]. One leading approach for this sim2real transferis Domain Randomization (DR) [13], where a simulator’s parameters are perturbed, generating aspace of similar yet distinct environments, all of which an agent tries to solve before transferringto a real robot. Similar to issues of which goals to show an agent in goal curriculum learning, theissue once again becomes a question of which environments to show the agent. Recently, Mehtaet al. [14] empirically showed that not all generated environments are equally useful for learning,leading to Active Domain Randomization (ADR). ADR defines a curriculum learning problem inthe environment randomized space, using learned rewards to search for an optimal curriculum.

In this work, we propose the combination of ADR and asymmetric self-play. We formulate ourproblem as a bilevel optimization where in the inner loop we maximize the expected return over thegiven environment-goal pairs, and in the outer loop, we maximize a self-play based metric to enforcea naturally growing environment-goal curriculum. We show that we can co-evolve curricula in bothgoal and environment spaces, using only a single self-supervised reward. This bilevel formulationinduces robustness in policy performance, further reducing variance compared to other curriculumlearning methods. We show that this coupling generates strong robotic policies in all environmentstested, even across multiple real world settings.

2 Background

2.1 Reinforcement Learning

We consider a Markov Decision Process (MDP),M, defined by (S,A, T ,R, γ), where S is the statespace,A is the action space, T : S×A → S is the transition function,R : S×A → R is the rewardfunction and γ is the discount factor. Formally, the agent receives a state st ∈ S at the timestep t andtakes an action at based on the policy πθ. The environment provides a reward of rt and the agenttransitions to the next state st+1. The goal of RL is to find a policy πθ which maximizes the expectedreturn from each state st where the return Rt is given by Rt =

∑∞k=0 γ

krt+k. Goal-directed RL

2

often appends a goal (some g in a goal space G) to the state, and requires the goal when evaluatingthe reward function (i.eR : S × G ×A → R).

2.2 Self-Play

We consider the self-play framework proposed by Sukhbaatar et al. [11], which proposes an un-supervised way of learning to explore the environment. In this method, the agent has two brains:Alice, which sets a task, and Bob, which finishes the assigned task. This dichotomy is quantified inEquations 1 and 2:

ra = υ ∗max(0, tb − ta) (1)rb = −υ ∗ tb (2)

where ta is the time taken by Alice to set a task, tb is the time taken by Bob to finish the task setby Alice and υ is the scaling factor. This dual-reward design, ra for Alice and rb for Bob, allowsself-regulating feedback between both agents, as Alice is rewarded most heavily for picking tasksthat are just beyond Bob’s horizon: Alice tries to propose tasks that are easy for her, yet difficultfor Bob. This evolution of tasks forces the two agents to construct a curriculum for explorationautomatically.

2.3 Domain Randomization

Domain randomization [15, 13] is a technique popular in robotic domain transfer, particularly in thezero-shot transfer2 problem setup, allowing for robotic agents to train entirely in simulation. Domainrandomization randomizes numerical parameters of a simulated robotic task (i.e., friction, gravity)during the training such that the agent cannot exploit the approximate dynamics within a simulator.The agent, trained on a host of randomized, simulated environments, ideally learns a policy that fitsonly the task - not the environment, enabling it to generalize well when it is transferred to the realrobot.

Domain randomization requires the explicit definition of a set of Nrand simulation parameters tovary, as well as their bounded domain (denoted as a randomization space Ξ ⊂ RNrand). Duringevery episode, a set of parameters ξ ∈ Ξ are sampled to generate a new MDP when passed throughthe simulator S3. In the standard domain randomization formulation, the parameters at each episodeare sampled uniformly from the randomization space throughout training.

2.4 Active Domain Randomization

ADR [14] builds off the assumption that DR leads to inefficient learning when sampling uniformlyfrom the randomization space. ADR poses the environment curriculum learning problem - the searchover the randomization space - as a reinforcement learning problem for the most informative en-vironment instances. In ADR, the environment sampler (the policy) is parameterized by Stein’sVariational Policy Gradient (SVPG) [16]. ADR learns to control a set of particles {µφi}Ni=1 whichdirectly correspond to which environments are shown to the agent. Each particle has its own set ofparameters, φi, which are trained with the update described in Equation 3.

The use of SVPG allows the particles to undergo interacting updates, which includes a term whichmaximizes return and a term which induces diversity between particles (and correspondingly, envi-ronments shown to the robotic agent). The full update can be written as:

µφi ← µφi +ε

N

N∑j=1

[∇µφj J(µφj )k(µφi , µφj ) + α∇µφj k(µφi , µφj )], (3)

where J(µφi) denotes the sampled return from particle i, k(·, ·) is a kernel that calculates similaritybetween two particles and forces similar points in the parameter space away from each other [17],and the learning rate ε and temperature α are hyperparameters.

The particles navigate the randomization space, looking for the simulation parameters that may gen-erate the most utility. The particles are trained by using learned discriminator-based rewards rD,

2Zero-shot transfer does not allow an agent to take extra optimization steps in the testing distribution.3Domain randomization generally changes the transition function T by editing dynamics parameters of the

simulation.

3

(a) (b) (c) (d)

Figure 2: (a, b) ErgoReacher is a 4 DoF robotic arm, with both simulation and real world environ-ments. The goal is to move the end effector to several imaginary goals (pink dot) as fast as possible,actuated with the four motors. (c, d) ErgoPusher is a 3DoF robotic arm, with the goal of bringing asecondary object to an imaginary goal (pink dot).

similar to the formulation in [18]. The discriminator D (with trainable parameters ψ) attempts tomeasure the discrepancies between the trajectories from the reference Eref and randomized envi-ronment instances Ei (corresponding to the ith particle’s parameter proposal ξi).

The reward coming from the discrimintator for the SVPG is defined as:

rD = log Dψ(y|τi ∼ π(·;Ei)) (4)

where y is a boolean variable denoting the source (randomized or reference) and τi is the randomizedtrajectory generated inside of the randomized instance Ei.

This reward formulation drives ADR to find environments which are difficult for the current agentpolicy to solve, as measured via learnable discrepancies between trajectories generated by the agentpolicy in a reference (and generally easier) environment, and a proposed randomized instance.

3 Method

ADR allows for curriculum learning in an environment space: given some black box agent, trajec-tories are used to differentiate between the difficulty of environments, regardless of the goal set inthe particular environment instance. In goal-directed RL, the goal itself may be the difference be-tween a useful episode and a useless one. In particular, certain goals within the same environmentinstance may vary in difficulty; on the other hand, the same goal may vary in terms of reachabilityin different environments. ADR provides a curriculum in environment space, but with goal-directedenvironments, we have a new potential curriculum to consider: the goal curriculum.

In order to build proficient, generalizable agents, we need to evolve a curriculum in goal spacealongside a curriculum in environment space; evolving them independently may lead to degeneratesolutions: the algorithm may learn to propose impossible goals with any environment, or impossibleenvironments (i.e., with unrealistic physical parameters) with any goal. As shown in [11], self-playprovides a way for policies to learn without environment interaction, but when used only for goalcurricula, requires interleaving of self-play trajectories alongside reward-evaluated rollouts for bestperformance. Our work investigates the following question:

Can we co-evolve a goal and environment curriculum, both with the same self-supervised learningsignal?

To this end, we propose Self-Supervised Active Domain Randomization (SS-ADR), summarizedin Algorithm 1. SS-ADR learns a curriculum in the joint goal-environment space using only a singleself-supervised reward signal, producing strong and generalizable policies. SS-ADR can be seen inits entirety in Algorithm 1 and 2 and Figure 1, and is described qualitatively below.

SS-ADR learns two additional policies compared to the standard ADR formulation: Alice and Bob.Alice - the goal setter - operates in the environment, and eventually signals a STOP action. The

4

Algorithm 1 Self Supervised ADRInput: Ξ: Randomization space, S: Simulator(S : Ξ→ E), ξref : reference parametersInitialize πa: Alice’s acting policy, πsa: Al-ice’s stopping policy, πb: Bob’s acting policy,µφ: SVPG particlesfor Tmax timesteps do

πa ← Old bob’s policy πbEref ← S(ξref )Observe the initial state sota, tb = Self-Play(πa, πsa, πb, µφ)Compute Alice’s reward ra using Eq (1)Update πsa with ra (goal curriculum)Update the SVPG particles with ra using

(3)(environment curriculum)Update πa and πb using environment re-

wardsend for

Algorithm 2 Self-PlayInput : πa, πsa, πb, µφta, tb ← 0while ats is not STOP do

ta ← ta + 1Observe the current state staats ← πsa(so, sta ;Eref )ata ∼ πa(so, sta ;Eref )

end whileBob’s target state: s∗ ← staSample environment ξrand ∼ µφ(·)Erand ← S(ξrand)while Bob not done do

tb ← tb + 1Observe the current state stbatb ∼ πb(stb , s

∗;Erand)end whileReturn: ta, tb

environment is then reset to the starting state, and now Bob uses his policy to attempt to achieve thegoal Alice has set. Bob sees Alice’s goal state appended to the current state, while Alice sees thecurrent state appended to it’s initial state. Alice and Bob’s normal policies are trained via DDPG[19] and environment rewards, while Alice’s STOP-signalling policy is trained with Vanilla policygradients using Equation 1.

To co-evolve the environment curriculum alongside the goal curriculum, we introduce the random-ization aspects of our approach. Before Bob operates in the environment, the environment is ran-domized (e.g. object frictions are perturbed or robot torques are changed), according to the valuesset from the ADR environment-sampler. Alice, who operates in the reference environment, Eref(an environment given as the “default”), tries to find goals that are easy in the reference environment(Eref ), but difficult in the randomized ones (Erand).

To enforce the co-evolution, we train the ADR particles with Alice’s reward (i.e., Equation 1 isevaluated separately for each randomization prescribed by the individual ADR particles). As thisreward depends on time spent by both Alice and Bob (where time for completion is denoted byta and tb in Equation 1), the curriculum in both goal and environment space evolve in difficultysimultaneously.

The reward structure forces Alice to focus on horizons: her reward is maximized when she cando something quickly that Bob cannot do at all. Considering the synchrony of policy updates foreach agent, we presume that the goal set by Alice is not far out of Bob’s current reach. In addition,as Bob’s policy is trained with the environment reward but operates in a randomized environment,Alice’s acting policy is time-delayed copy of Bob’s policy. This, along with the co-evolution ofcurriculum, greatly improves robotic transfer.

4 Results

In order to evaluate our method, we perform various experiments on continuous control robotic tasksboth in simulation and real world. We used the following environments from [20] and [14]:

• ErgoReacher: A 4DoF robot where the end-effector has to reach the goal (Figure 2(a, b))

• ErgoPusher: A similar 3DoF robot that has to push the puck to the goal (Figure 2(c, d))

For the sim-to-real experiments, we recreate the simulation environment on the real Poppy ErgoJr. robots [21] shown in Figure 2. All simulation experiments are run across 4 random seeds. Weevaluate the policy on (a) the default environment and (b) an intuitively hard environment whichlies outside the training domain, for every 5000 timesteps, resulting in 200 evaluations in total over1 million timesteps.

5

(a) (b)

Figure 3: (a) On the default (in-distribution) environment, both the self-play method, shown asUnsupervised-Default, and SS-ADR show strong performance. Even on an easier task, we seeissues with UDR, which is unstable in both performance and convergence throughout training. (b)On the intuitively hard environment, we see that only SS-ADR converges with low variance andstrong performance while the other baselines struggle both in terms of variance and performance.Shown is final distance to goal, lower is better.

We compare our method against two different baselines:

• Uniform Domain Randomization (UDR): We use UDR, which generates a multitudeof tasks by uniformly sampling parameters from a given range as our first baseline. Theenvironment space generated by UDR is unstructured and there is no intuitive curriculum.The goal stays constant throughout episodes.

• Unsupervised Default: We use the self-play framework to generate a naturally growingcurriculum of goals as our second baseline. Here, only the goal curriculum (and not thecoupled environment-goal curriculum) is considered.

4.1 Simulation Experiments

We evaluate SS-ADR’s performance on the ErgoPusher and ErgoReacher tasks. In the ErgoPushertask, we vary the puck friction (Nrand = 1). In order to create an intuitively hard environment, welower the value of this parameter, which creates an “icy” surface, ensuring that the puck needs to behit carefully to complete the difficult task.

For the ErgoReacher task, we increase the randomization dimensions (Nrand = 8) making it hardto intuitively infer the environment complexity. However, for the demonstration purposes, we createan intuitively hard environment by assigning extremely low torques and gains for each joint. Weadapt the parameter ranges from Mehta et al. (2020).

From Figure 3(a) and 4(a) we can see that both Unsupervised-Default and SS-ADR significantlyoutperform UDR both in terms of variance and average final distance. This highlights that theuniform sampling in UDR can lead to unpredictable and inconsistent behaviour. To actually see thebenefits of the coupled environment-goal curriculum over solely goal curriculum, we evaluate on theintuitively-hard environments (outside of the training parameter distribution, as described above).From Figure 3(b) and 4(b), we can see that our method, SS-ADR, which co-evolves environmentand goal curriculum, outperforms Unsupervised-Default. This shows that the coupled curriculumenables strong generalization performance over the standard self-play formulation.

4.2 Sim-to-Real Transfer Experiments

In this section, we explore the zero-shot transfer performance of the trained policies in the simulator.To test our policies on real robots, we take the four independently trained policies of both ErgoRe-acher and ErgoPusher and deploy them on the real robots without any fine-tuning. We roll outeach policy per seed for 25 independent trials and calculate the average final distance across these

6

25 trials. To evaluate the generalization, we change the task definitions (and therefore the MDPs)of the puck friction (across low, high, and standard frictions in a box pushing environment) in caseof ErgoPusher and joint torques (across a wide spectrum of choices) on ErgoReacher. In general,lower values in both settings correspond to harder tasks, due to construction of the robot and theintrinsic difficulty of the task itself.

From the Figures 5(a) and 5(b), we see that SS-ADR outperforms both baselines in terms of accuracyand consistency, leading to robust performance across all environment variants tested. Zero-shot pol-icy transfer is a difficult and dangerous task, meaning that low spread (i.e consistent performance)is required for deployed robotic RL agents. As we can see in the plots, simulation alone is not theanswer (leading to poor performance of UDR), while self-play also fails sometimes to generate cur-ricula that allow for strong, generalizable policies. However, by utilizing both methods together, andco-evolving the two curriculum spaces, we see multiplicative benefits of using curriculum learningin each separately.

4.3 Self-calibration of SS-ADR

In Domain Randomization, picking the ranges within which to randomize the parameters is often atrial-and-error process. These ranges often play an important role in the policy optimization and ifnot chosen properly, it might lead to optimization difficulties. In this section, we discuss benefits ofSS-ADR as they relate to self-calibration.

We train ErgoPusher on calibrated and uncalibrated parameter ranges(which includes the impossi-ble to solve MDPs) and obtain the environment sampling plot. In Figure 6(a), we see that with acalibrated range, where the ranges are carefully chosen by iteratively adjusting the bounds of therandomization space, both algorithms sample approximately equally in the “harder” task ranges (asseen in the inset plot) although we can see SS-ADR learning the curriculum unlike UDR. In Fig-ure 6(b), we see the benefits of SS-ADR over UDR when using uncalibrated ranges. Here we cansee that UDR is sampling certain values which generate physically unstable environments (in thisexperiment, approximately any environment with a randomization coefficient less than 0.05) whileSS-ADR is able to adapt and stay away from the unsolvable tasks at the extremely low end of therandomization spectrum.

5 Related Work

Curriculum Learning: The idea of curriculum leaning was first proposed by Elman [22], whoshowed that the curriculum of tasks is beneficial in language processing. Later, Bengio et al. [5]extended this idea to various vision and language tasks which showed faster learning and better

(a) (b)

Figure 4: (a) In the ErgoPusher environment, we see the same narrative as in Figure 3; UDR strug-gles even in the easy, in-distribution environment, while both self-play methods converge quicklywith low variance. (b) Both self-play methods show higher variance in an intuitively hard environ-ment, despite the fact that SS-ADR has better overall performance. UDR, as expected, still struggleson the held out environment. Shown is final distance to goal, lower is better.

7

(a) (b)

Figure 5: (a) On various instantiations of the real robot (parameterized by motor torques), SS-ADRoutperforms UDR in terms of performance (lower is better) and spread. While SS-ADR’s perfor-mance is almost consistent with or better than that of the Unsupervised-Default. (b) We see thedifference between the various methods clearly in the Pusher environment, where SS-ADR outper-forms all other baselines. Lower is better.

(a) (b)

Figure 6: While UDR and SS-ADR both do well with calibrated ranges (a), with uncalibrated ranges,SS-ADR is the only algorithm that stays away from the physically unstable environments on thelower end of the randomization coefficient. UDR, as a static sampling algorithm, mixes samplesfrom these environments with the rest of the range, potentially hindering training.

convergence. While many of these require some human specifications, recently, automatic task gen-eration has gained interest in the RL community. This body of work includes automatic curriculumproduced by adversarial training [23], reverse curriculum [24, 25], and teacher-student curriculumlearning [26, 27]. However, many papers focus on distinct tasks rather than continuous task spacesand use state or reward-based “progress” heuristics. In this work, we focus on learning the curricu-lum in a continuous joint task space (environments and goals) simultaneously.

Self Play: Curriculum learning has also been studied through the lens of self-play. Self-play hasbeen successfully applied to many games such as checkers [28] and Go [10]. Recently an interestingasymmetric self-play strategy has been proposed [11], which models a game between two variantsof the same agent, Alice and Bob, enabling exploration of the environment without requiring anyextrinsic reward. However, in this work, we use the self-play framework for learning a curriculumof goals, rather than for its traditional exploration-driven use case.

Sim2Real Transfer: Despite the success in deep RL, training RL algorithms on physical robotsremains a difficult problem and is often impractical due to safety concerns. Simulators played ahuge role in transferring policies to the real robot safely, and many different methods have been

8

proposed for the same [20, 29, 30]. DR [13] is one of the popular methods which generates amultitude of environment instances by uniformly sampling the environment parameters from a fixedrange. However, [14] showed that DR suffers from high variance due to an unstructured task spaceand instead proposed a novel algorithm that learns to sample the most informative environmentinstances.

6 Conclusion

In this work, we proposed Self-Supervised Active Domain Randomization (SS-ADR), which co-evolves curricula in a joint goal-environment space to create strong, robust policies that can transferzero-shot onto real world robots. Our method solely depends on a single self-supervised rewardsignal through self-play to learn this joint curriculum. SS-ADR is a feasible approach to train newpolicies in goal-directed RL settings, and outperforms all baselines with low variance in both simu-lated and real variants tested.

7 Acknowledgements

The authors gratefully acknowledge the Natural Sciences and Engineering Research Council ofCanada (NSERC), the Fonds de Recherche Nature et Technologies Quebec (FQRNT), Calcul Que-bec, Compute Canada, the Canada Research Chairs, Canadian Institute for Advanced Research (CI-FAR) and Nvidia for donating a DGX-1 for computation. BM would like to thank IVADO forfinancial support. FG would like to thank MITACS for their funding and support.

References[1] J. Clark and D. Amodei. Faulty Reward Functions, 2020. URL https://openai.com/blog/

faulty-reward-functions/.

[2] D. Amodei, C. Olah, J. Steinhardt, P. F. Christiano, J. Schulman, and D. Mane. Concreteproblems in AI safety. CoRR, abs/1606.06565, 2016. URL http://arxiv.org/abs/1606.06565.

[3] I. Gabriel. Artificial intelligence, values and alignment, 2020.

[4] T. Schaul, D. Borsa, J. Modayil, and R. Pascanu. Ray interference: a source of plateaus indeep reinforcement learning, 2019.

[5] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In Proceedingsof the 26th Annual International Conference on Machine Learning, ICML ’09, New York,NY, USA, 2009. ACM. ISBN 978-1-60558-516-1. doi:10.1145/1553374.1553380. URLhttp://doi.acm.org/10.1145/1553374.1553380.

[6] A. Barto, M. Mirolli, and G. Baldassarre. Novelty or surprise? Frontiers in Psychology, 4:907,2013. ISSN 1664-1078. doi:10.3389/fpsyg.2013.00907. URL https://www.frontiersin.org/article/10.3389/fpsyg.2013.00907.

[7] M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifyingcount-based exploration and intrinsic motivation, 2016.

[8] P. Shyam, W. Jaskowski, and F. Gomez. Model-based active exploration, 2018.

[9] Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros. Large-scale study ofcuriosity-driven learning, 2018.

[10] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser,I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalch-brenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis.Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), Jan.2016. doi:10.1038/nature16961.

9

https://openai.com/blog/faulty-reward-functions/

https://openai.com/blog/faulty-reward-functions/

http://arxiv.org/abs/1606.06565


http://dx.doi.org/10.1145/1553374.1553380

http://doi.acm.org/10.1145/1553374.1553380

http://dx.doi.org/10.3389/fpsyg.2013.00907

https://www.frontiersin.org/article/10.3389/fpsyg.2013.00907

https://www.frontiersin.org/article/10.3389/fpsyg.2013.00907

http://dx.doi.org/10.1038/nature16961

[11] S. Sukhbaatar, I. Kostrikov, A. Szlam, and R. Fergus. Intrinsic motivation and automaticcurricula via asymmetric self-play. CoRR, abs/1703.05407, 2017. URL http://arxiv.org/abs/1703.05407.

[12] N. Jakobi, P. Husbands, and I. Harvey. Noise and the reality gap: The use of simulation inevolutionary robotics. In European Conference on Artificial Life. Springer, 1995.

[13] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomizationfor transferring deep neural networks from simulation to the real world. In Intelligent Robotsand Systems (IROS), 2017 IEEE/RSJ International Conference on. IEEE, 2017.

[14] B. Mehta, M. Diaz, F. Golemo, C. J. Pal, and L. Paull. Active domain randomization. In L. P.Kaelbling, D. Kragic, and K. Sugiura, editors, Proceedings of the Conference on Robot Learn-ing, volume 100 of Proceedings of Machine Learning Research, pages 1162–1176. PMLR, 30Oct–01 Nov 2020. URL http://proceedings.mlr.press/v100/mehta20a.html.

[15] F. Sadeghi and S. Levine. Cad2rl: Real single-image flight without a single real im-age. Robotics: Science and Systems XIII, Jul 2017. doi:10.15607/rss.2017.xiii.034. URLhttp://dx.doi.org/10.15607/RSS.2017.XIII.034.

[16] Y. Liu, P. Ramachandran, Q. Liu, and J. Peng. Stein variational policy gradient. CoRR,abs/1704.02399, 2017. URL http://arxiv.org/abs/1704.02399.

[17] Q. Liu and D. Wang. Stein variational gradient descent: A general purpose bayesian inferencealgorithm. In Proceedings of the 30th International Conference on Neural Information Pro-cessing Systems, NIPS’16, page 2378–2386, Red Hook, NY, USA, 2016. Curran AssociatesInc. ISBN 9781510838819.

[18] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine. Diversity is all you need: Learning skillswithout a reward function. CoRR, abs/1802.06070, 2018. URL http://arxiv.org/abs/1802.06070.

[19] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic pol-icy gradient algorithms. In E. P. Xing and T. Jebara, editors, Proceedings of the 31st Inter-national Conference on Machine Learning, volume 32 of Proceedings of Machine LearningResearch, Bejing, China, 22–24 Jun 2014. PMLR. URL http://proceedings.mlr.press/v32/silver14.html.

[20] F. Golemo, A. A. Taiga, A. Courville, and P.-Y. Oudeyer. Sim-to-real transfer with neural-augmented robot simulation. In A. Billard, A. Dragan, J. Peters, and J. Morimoto, editors,Proceedings of The 2nd Conference on Robot Learning, volume 87 of Proceedings of MachineLearning Research. PMLR, 29–31 Oct 2018. URL http://proceedings.mlr.press/v87/golemo18a.html.

[21] M. Lapeyre. Poppy: open-source, 3d printed and fully-modular robotic platform for science,art and education. 11 2014.

[22] J. L. Elman. Learning and development in neural networks: The importance of starting small.Cognition, 48(1), 1993.

[23] D. Held, X. Geng, C. Florensa, and P. Abbeel. Automatic goal generation for reinforcementlearning agents. CoRR, abs/1705.06366, 2017. URL http://arxiv.org/abs/1705.06366.

[24] C. Florensa, D. Held, M. Wulfmeier, and P. Abbeel. Reverse curriculum generation for re-inforcement learning. CoRR, abs/1707.05300, 2017. URL http://arxiv.org/abs/1707.05300.

[25] S. Forestier, Y. Mollard, and P. Oudeyer. Intrinsically motivated goal exploration processeswith automatic curriculum learning. CoRR, abs/1708.02190, 2017. URL http://arxiv.org/abs/1708.02190.

[26] T. Matiisen, A. Oliver, T. Cohen, and J. Schulman. Teacher-student curriculum learning. CoRR,abs/1707.00183, 2017. URL http://arxiv.org/abs/1707.00183.

10



http://proceedings.mlr.press/v100/mehta20a.html

http://dx.doi.org/10.15607/rss.2017.xiii.034

http://dx.doi.org/10.15607/RSS.2017.XIII.034




http://proceedings.mlr.press/v32/silver14.html

http://proceedings.mlr.press/v32/silver14.html

http://proceedings.mlr.press/v87/golemo18a.html

http://proceedings.mlr.press/v87/golemo18a.html







[27] A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu. Automated curricu-lum learning for neural networks. CoRR, abs/1704.03003, 2017. URL http://arxiv.org/abs/1704.03003.

[28] A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal ofResearch and Development, 3, 1959.

[29] A. Prakash, S. Boochoon, M. Brophy, D. Acuna, E. Cameracci, G. State, O. Shapira, andS. Birchfield. Structured domain randomization: Bridging the reality gap by context-awaresynthetic data. CoRR, abs/1810.10093, 2018. URL http://arxiv.org/abs/1810.10093.

[30] Y. Chebotar, A. Handa, V. Makoviychuk, M. Macklin, J. Issac, N. D. Ratliff, and D. Fox.Closing the sim-to-real loop: Adapting simulation randomization with real world experience.CoRR, abs/1810.05687, 2018. URL http://arxiv.org/abs/1810.05687.

[31] S. Fujimoto, H. van Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. CoRR, abs/1802.09477, 2018. URL http://arxiv.org/abs/1802.09477.

[32] D. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conferenceon Learning Representations, 12 2014.

A Variance Reduction with Self-Play - ADR vs SS-ADR

To investigate the stability of SS-ADR, we benchmark Active Domain Randomization in the Er-goReacher environment and plot how the policy performance (the final distance to the goal) evolvesover time. In Fig. 7, where we can see that SS-ADR is more consistent and shows lower varianceacross seeds compared to ADR.

B Real robot comparison with ADR

We also compared the real-robot results of SS-ADR to the results reported in [14]. We could notperfectly replicate the conditions of [14] as the authors used techniques similar to real robot evalu-ations of [13]: low friction consisted of manual application of lubricant to a table, and high frictionconsisted wrapping an object with paper. We were unable to enter the lab to rebenchmark the twoalgorithms more consistently, so we report our temporary results in Fig. 8.

C Implementation Details

Across all experiments, all networks share the same network architecture and hyperparameters. Foreach Alice and Bob acting policy, we use Deep Deterministic Policy Gradients [19], using theOurDDPG.py implementation from the open source repository of [31]. Each actor and the critichave two hidden layers with 400 and 300 neurons, respectively, and use ReLU activation. For Al-ice’s stopping policy (which signals the STOP action), we use a multi-layered perceptron with two

Figure 7: Differences in policy performance between ADR and SS-ADR for different fixed seeds.

11






Hyperparameter ValueDiscount Factor γ 0.99

Reward scaling factorυ 0.2Actor learning rate 0.001Critic learning rate 0.001

Batch size 100Maximum episode timesteps 100Nrand for ErgoPusher 1Nrand for ErgoReacher 8

hidden layers consisting of 300 neurons each. For SVPG particles we use same architecture and hy-perparameters as described in Mehta et al. All networks use the Adam optimizer [32] with standardhyperparameters from the Pytorch implementation 4.The hyperparameters are summarized below:

4https://pytorch.org/

(a) (b)

Figure 8: (a) On various instantiations of the real robot (parameterized by motor torques), SS-ADR outperforms UDR in terms of performance (lower is better) and spread. While SS-ADR’sperformance is almost consistent with or better than that of the Unsupervised-Default. (b) SS-ADRoutperforms all baselines safe for ADR.

12

generating automatic curricula via self-supervised active

Documents