reinforcement learning with adaptive curriculum dynamics

Reinforcement Learning with Adaptive CurriculumDynamics Randomization

for Fault-Tolerant Robot ControlWataru Okamoto

Graduate School of Scienceand Engineering,Chiba University

Chiba, Japan

Hiroshi KeraGraduate School of Engineering,

Chiba UniversityChiba, Japan

Kazuhiko KawamotoGraduate School of Engineering,

Chiba UniversityChiba, Japan

e-mail: [email protected]

Abstract—This study is aimed at addressing the problem offault tolerance of quadruped robots to actuator failure, which iscritical for robots operating in remote or extreme environments.In particular, an adaptive curriculum reinforcement learningalgorithm with dynamics randomization (ACDR) is established.The ACDR algorithm can adaptively train a quadruped robotin random actuator failure conditions and formulate a singlerobust policy for fault-tolerant robot control. It is noted thatthe hard2easy curriculum is more effective than the easy2hardcurriculum for quadruped robot locomotion. The ACDR algo-rithm can be used to build a robot system that does not requireadditional modules for detecting actuator failures and switchingpolicies. Experimental results show that the ACDR algorithmoutperforms conventional algorithms in terms of the averagereward and walking distance.

I. INTRODUCTION

Robots are being increasingly applied in a wide range offields such as factory automation and disaster assessmentand rescue. In the existing approaches, the control law forrobots are manually defined. However, in the presence ofsignificant uncertainties in operating environments, the controllaw and states to be considered may not be easily identifiable,rendering such manual description challenging. To addressthese issues, reinforcement learning has attracted attention as apromising method that can automatically establish the controllaw without a clear description of how to realize a task of in-terest. A reinforcement learning framework learns, by trial anderror, a policy that maximizes the reward, which is a measureof the goodness of the behavior in a certain environment. Torealize reinforcement learning, it is necessary to only specify atask to be performed and design an appropriate reward functionfor the task.

Notably, to implement reinforcement learning in the realworld, trial and error processes must be performed in real time.However, this approach may pose safety risks for the robotand surrounding environment. To ensure safety and cost effec-tiveness, robots are often trained in simulation environmentsbefore being applied in real world (Sim2Real). Reinforcementlearning in simulation environments has achieved human-levelperformance in terms of the Atari benchmark [1] and robot

control [2], [3]. However, Sim2Real is often ineffective owingto the gap between the simulation and real world, namedreality gap, which can be attributed to the differences inphysical quantities such as friction and inaccurate physicalmodeling. A straightforward approach to solve this problemis system identification, in which mathematical models orcomputer simulators of dynamical systems are establishedvia measured data. However, the development of a realisticsimulator is extremely costly. Even if system identificationcan be realized, many real-world phenomena associated withfailure, wear, and fluid dynamics cannot be reproduced bysuch simulators. A typical approach to bridge the reality gap isdomain randomization [4]. Domain randomization randomizesthe parameters of the simulators to expose a model of interestto various environments. The model is expected to learna policy that is effective in various environments, therebynarrowing the reality gap. Domain randomization has beenused to successfully apply Sim2Real for the visual servocontrol of robot arms [4], [5] and locomotion control ofquadruped robots [6].

Notably, reinforcement learning involves certain limitations.Figure 1 shows that even though a quadruped robot can walkowing to reinforcement learning (top row), the robot may turnover owing to the actuator failure of one leg (bottom row). Inthis context, fault-tolerant control is critical for robots workingin remote or extreme environments such as disaster sites [7]because such robots can often not be repaired onsite. Severalresearchers have focused on fault-tolerant robot control basedon reinforcement learning [8], [9]. However, these studiesassume that the robots can detect internal failures and switchtheir policies according to the failures. Such robots require anadditional module for diagnosing internal failures along witha module for control. In this context, robot systems designedfor detecting all possible failures may be prohibitively com-plex [10]. To address this aspect, in this study, we establish areinforcement learning algorithm for fault-tolerant control ofquadruped robots against actuator failure. In this framework,the control law can ensure the locomotion of quadruped robotseven if several actuators in the robot legs fail. Furthermore, we

arX

iv:2

111.

1000

5v1

[cs

.RO

] 1

9 N

ov 2

021

Fig. 1. Illustration of quadruped robot locomotion with actuator failure. (Top) A quadruped robot trained by basic reinforcement learning can exhibitlocomotion. (Bottom) The quadruped robot turns over if the actuator of a leg (red) fails.

develop a fault-tolerant control system that does not includeany self-diagnostic modules for actuator failures. To thisend, an adaptive curriculum dynamics randomization (ACDR)algorithm is proposed that can learn a single robust controlpolicy against actuator failures by randomizing the dynamicsof quadruped robots with actuator failures. In addition, wedevelop a curriculum learning algorithm that adaptively trainsthe quadruped robots to achieve an enhanced walking ability.It is noted that the hard2easy curriculum, in which the trainingis initiated in difficult conditions that progressively becomeseasier, is more effective than an easy2hard curriculum, whichis commonly used. Experiments are conducted to demonstratethe effectiveness of the ACDR algorithm in comparison withconventional algorithms.

The key contributions of this study can be summarized asfollows:• We establish a reinforcement learning algorithm to ad-

dress the problem of fault tolerance of quadruped robotsto actuator failures and realize fault-tolerant control.

• The ACDR algorithm to proposed effectively learn asingle robust control policy against actuator failures. TheACDR algorithm does not require any self-diagnosismodules for detecting internal failures, which must beimplemented in conventional systems.

• We demonstrate that the ACDR algorithms outperformconventional algorithms in terms of the average rewardand walking distance.

• For fault-tolerance control of quadruped robots, wefind that the hard2easy curriculum is effective than theeasy2hard curriculum.

II. RELATED WORK

Domain randomization [4], [11] is aimed at enhancinggeneralization performance of the policy identified in simu-lations. Domain randomization requires a prescribed set ofN dimensional simulation parameters, ξ ∈ Ξ ⊂ RN , torandomize and the corresponding sampling interval. The i-th element of ξ is sampled from the uniform distributionUni(L,U) on the closed interval [L,U ]. Using a set of theparameters {ξ}, we can realize robot learning in various sim-ulation environments. This randomization is expected to bridge

the reality gap between the simulation and real world. Tobin etal. [4] successfully applied Sim2Real for visual robot controlby randomizing the visual rendering of the environments.Moreover, Sim2Real has been successfully applied to controla robot arm [5] and quadruped robot [6] by randomizingphysical parameters such as the robot mass and floor frictioncoefficient. In particular, we refer to the physical parameterrandomization [5], [6] as dynamics randomization, therebydistinguishing it from visual domain randomization [4]. Toimplement dynamics randomization, the interval [L,U ] mustbe carefully determined, as an inappropriate interval is likelyto result in failure to learn [12], a policy with considerablevariance [13], or a conservative policy [14].

For reinforcement learning, it is effective to start with aneasy task in order to succeed in a difficult task. Curriculumlearning [15] is a learning process in which the difficultyof a task is gradually increased during training. Curriculumlearning is mainly applied in navigation [16] and pickup tasksinvolving robot arms [17]. Luo et al. [18] proposed a precision-based continuous curriculum learning (PCCL) strategy thatadjusts the required accuracy in a sequential manner duringlearning for a robot arm and can accelerate learning and in-crease the task accuracy. In general, it is difficult to determinethe appropriate level of difficulty of a task in curriculumlearning. Consequently, approaches that adaptively adjust thedifficulty level for the agent through the learning process areattracting attention. Wang et al. [19] proposed a paired open-ended trailblazer (POET) strategy, which ensures an appro-priate level of difficulty by co-evolving the environment andpolicies, and enables walking in difficult terrain. Moreover,curriculum learning in which the randomization interval isadaptively varied can help realize domain randomization [13],[20]. Xie et al. [20] considered a stepping-stone walkingtask with constrained scaffolding. The authors estimated theagent performance on each grid of the random parameters andadaptively varied the difficulty by decreasing and increasingthe probability of occurrence of grids with high and lowperformance values, respectively. This algorithm can enhancethe performance and robustness of the model. In general,the design of the environment and curriculum considerablyinfluence the performance of learning to work [21]. To realize

natural walking behavior, it is necessary to limit the torque. Tothis end, a curriculum that starts with a higher value, such as1.6x, in the early stages of learning and gradually reduces thetorque to the target value is considered to be effective [14].

To realize fault-tolerant robot control, Yang et al. [8]considered the joint damage problem in manipulation andwalking tasks. The authors established a method to modelseveral possible fault states in advance and switch amongthem during training. When the system encountered a failure,the method increased the robustness against joint damage byswitching the policy. Kume et al. [9] proposed a method forstoring failure policies in a multidimensional discrete map inadvance. When the robot recognized a failure, it retrieved andapplied these policies, thereby exhibiting a certain adaptabilityto robot failure. Notably, the existing failure-aware learningmethods assumed that the robot can recognize its damage state.It remains challenging to develop a system that can achievehigh performance with a single policy regardless of whetherthe robot is in a normal or failed state, even if the robot doesnot recognize the exact damage state.

III. METHOD

To realize fault-tolerant robot control without additionalmodules, we propose an ACDR algorithm for reinforcementlearning. The objective is to formulate a single robust policythat can enable a robot to perform a walking task even ifactuator failure occurs. We assume that the quadruped robotcannot recognize its damage state. The ACDR algorithmcan adaptively change the interval of the random dynamicsparameter to efficiently train the robot.

A. Reinforcement LearningWe focus on the standard reinforcement learning prob-

lem [22], in which an agent interacts with its environmentaccording to a policy to maximize a reward. The state spaceand action space are expressed as S and A, respectively. Forstate st ∈ S and action at ∈ A at timestep t, the rewardfunction r : S×A→ R provides a real number that representsthe desirability of performing a certain action in a certain state.The goal of the agent is to maximize the multistep returnRt =

∑Tt′=t γ

t′−tr(st′ , at′) after T steps, where γ ∈ [0, 1] isthe discount coefficient, which indicates the importance of thefuture reward relative to the current reward. The agent decidesits action based on the policy πθ : S×A→ [0, 1]. The policyis usually modeled using a parameterized function with respectto θ. Thus, the objective of learning is to identify the optimalθ∗ as

θ∗ = arg maxθ

J(θ) (1)

where J(θ) is the expected return defined as

J(θ) = Eτ∼pθ(τ)

[T−1∑t=0

r(st, at)

](2)

where

pθ(τ) = p(s0)

T−1∏t=0

p(st+1|st, at)πθ(st, at) (3)

Algorithm 1 Adaptive curriculum dynamics randomization

Require: L,URequire: D: Performance data bufferRequire: m, gth: ThresholdRequire: ∆L,∆U : Update step sizeθ ← random weightsD ← φwhile not done dok ∼ Uni(L,U)l ∼ Uni{0, 1, 2, 3} choose legol,t ← kol,t−1 update l-th actuator by kRun policy πθ in environment with kOptimize θ with PPOGenerate rollout τ = (s0, a0, . . . , sT )g ← 0for st, at in τ dog ← g + r(st, at)

end forD ← D ∪ {g} {Add performance to buffer}if |D| ≥ m theng ← Average of Dif g ≥ gth then

if curriculum is easy2hard thenU ← U −∆U

L← L−∆L

else if curriculum is hard2easy thenU ← U + ∆U

L← L+ ∆L

end ifgth ← g

end ifD ← φ

end ifend while

represents the joint probability distribution on the series ofstates and actions τ = (s0, a0, s1, . . . , aT−1, sT ) under policyπθ.

B. Definition of Actuator Failure

The actuator failure is represented through the variation inthe torque characteristics of the joint actuators. By denotingot ∈ R as the actuator output at timestep t, we simply representthe output value of the failure state, o′t, as follows:

o′t = kot. (4)

where k ∈ [0, 1] is a failure coefficient. Failure coefficientsof k = 1.0 and k 6= 1.0 indicate that the actuator is inthe normal or failed state, respectively. A larger deviation ofk from 1.0 corresponds to a larger degree of robot failure,and k = 0.0 means that the actuator cannot be controlled.Figure 1 illustrates an example of failure of an 8-DOF robotthat had learned to walk. The top row in Fig. 1 shows a normal

Fig. 2. Time update of intervals [L,U ] of the failure coefficient k incurriculum learning. The intervals are represented by the colored regions.ACDR e2h and ACDR h2e represent easy2hard and hard2easy curricula forACDR, respectively. LCDR e2h and LCDR h2h represent linear easy2hardand hard2easy curricula, respectively.

robot with k = 1.0, and the bottom row in Fig. 1 showsa failed robot (red leg damaged) with k = 0.0. Because theconventional reinforcement learning does not take into accountenvironmental changes owing to robot failures, the robot fallsin the middle of the task and cannot continue walking, asshown in the bottom row in Fig. 1.

C. Adaptive Curriculum Dynamics Randomization

To implement fault-tolerant robot control the ACDR algo-rithm employs dynamics randomization by randomly samplingthe failure coefficient k and curriculum learning by adaptivelychanging the interval of k. The process flow of the ACDRalgorithm is presented as Algorithm 1 and can be describedas follows.

For dynamics randomization, one leg l of the quadrupedrobot is randomly chosen as

l ∼ Uni{0, 1, 2, 3} (5)

where Uni{0, 1, 2, 3} is the discrete uniform distribution onthe possible leg index set {0, 1, 2, 3} of the quadruped robot.The leg l is broken at the beginning of each training episode,and the failure coefficient k is randomly chosen as

k ∼ Uni(L,U) (6)

where Uni(L,U) is the continuous uniform distribution on theinterval [L,U ]. According to Eq. (4), the broken actuator ofthe leg l outputs o′l,t = kol,t in each training episode. As theactuator output changes, the dynamics model p(st+1|st, at)also changes during training. This dynamics randomizationmeans that a policy can be learned under varying dynamicsmodels instead of being learned for a specific dynamicsmodel. Under dynamics randomization, the expected return

is determined by taking the expectation over k, as indicatedin Eq. (2), as

Ek∼Uni(L,U)

[Eτ∼p(τ |k)

[T−1∑t=0

r(st, at)

]]−→ max . (7)

By maximizing this expected return, the agent can learn apolicy to adapt to the dynamics changes owing to robotfailures.

For curriculum learning, the interval [L,U ] in Eq. (6) isadaptively updated according to the return earned by the robotin training. As mentioned previously, a large interval [L,U ]is likely to result in a policy with large variance [13] ora conservative policy [14]. Thus, a small interval [L,U ] isassigned to Algorithm 1 as the initial value. The interval [L,U ]is updated as follows: First, a trajectory of states and actionsτ = (s0, a0, . . . , sT ) is generated based on a policy πθ, whereθ is the policy parameter optimized using a proximal policyoptimization (PPO) algorithm [23]. Subsequently, the return gof the trajectory τ is calculated by accumulating the rewardsas

g =

T−1∑t=0

r(st, at). (8)

By repeating the parameter optimization and return calculationm times, the average value of the returns g can be determined.This value is used to evaluate the performance of the robotand determine whether the interval [L,U ] must be updated.If the average g exceeds a threshold gth, the interval [L,U ]is updated, i.e., the robot is trained with the updated failurecoefficients.

In this study, we consider two methods to update the interval[L,U ]. The manner in which the interval is updated duringcurriculum learning is illustrated in Fig. 2. In the first method,the upper and lower bounds of the interval [L,U ] graduallydecrease as {

U ← U −∆U

V ← V −∆V(9)

given the initial interval [1.5, 1.5]. In all experiments, we set∆U = ∆V = 0.01. This decrease rule represents an easy2hardcurriculum because smaller values of L and U correspond toa higher walking difficult. For example, when k = 0, the legdoes not move. The second method pertains to a hard2easycurriculum, i.e., {

U ← U + ∆U

V ← V + ∆V(10)

We set the initial interval as [0.0, 0.0]. In this state, the robotcannot move the broken leg, and its ability to address thisfailure is gradually enhanced. Notably, although the easy2hardcurriculum is commonly used in curriculum learning, thehard2easy curriculum is more effective than the easy2hardconfiguration in accomplishing the task. The difference be-tween the two frameworks is discussed in Section IV-D.

Fig. 3. Average reward for the plain (blue) and broken (green) quadrupedtasks. Error bars indicate the standard error.

Fig. 4. Average progress for the plain (blue) and broken (green) quadrupedtasks. Error bars indicate the standard error.

D. Reward function

In the analysis, we introduce a slightly modified rewardfunction instead of using the commonly applied functiondescribed in [24], because the existing reward function [24] islikely to result in a conservative policy in which the robot doesnot walk but remains in place. The existing reward function[24] can be defined by

r = vfwd − 10−6‖u‖2 − 10−3‖fimpact‖2 + 1, (11)

where vfwd = x′−x∆t is the walking speed, u is the vector of

joint torques, and fimpact is the contact force. The rewardfunction in Eq. (11) yields a reward of 1 even when the robotfalls or remains stationary without walking. Therefore, thisreward function is likely to lead to a conservative policy. Toensure active walking of the robot, we modify the reward

TABLE IHYPERPARAMETERS FOR EXPERIMENTS.

Parameter ValueLearning rate 0.00022

Horizon 128Minibatch size 4

epochs 4Clipping parameter ε 0.2

Discount γ 0.99GAE parameter λ 0.95

VF coefficient 0.5Entropy coefficient 0.01

function as

r = vfwd − 10−6‖u‖2 − 10−3‖fimpact‖2 + s, (12)

where

s =

{0 if the robot is falling.1 otherwise

(13)

Preliminary experiments demonstrate that the modified rewardfunction in Eq. (12) can ensure more active walking of therobots.

IV. EXPERIMENTS

We conduct experiments in the Ant-v2 environment pro-vided by OpenAI Gym [25]. This environment runs on thesimulation software MuJoCo [26], and the task is aimedat realizing rapid advancement of the quadruped robot, asshown in Fig. 1. To realize reinforcement learning, we use thePPO algorithm [23], which is suitable for high-dimensionalcontinuous action environments.

A. Experiment setup

For reinforcement learning, we employ the actor and criticnetwork consisting of two hidden layers of 64 and 64 neu-rons with tanh activation. All networks use the Adam opti-mizer [27]. The related hyperparameters are listed in Table I.

The ACDR algorithm is compared to the following threealgorithms. Baseline: The baseline is a basic reinforcementlearning algorithm for normal robots that implements neitherdynamics randomization nor curriculum learning. UniformDynamics Randomization (UDR): UDR is a dynamics ran-domization algorithm that is based on the uniform distributionUni(0.0, 1.5) and does not implement curriculum learning.Linear Curriculum Dynamics Randomization (LCDR): Inthe LCDR algorithm, the interval [L,U ] is linearly updated.Detail of the LCDR can be found in A. Similar to theACDR, the LCDR uses the easy2hard and hard2easy curricula.The progress of the two curricula, the legends of which are”LCDR e2h” and ”LCDR h2e”, is shown in Fig. 2.

We evaluate the algorithms in terms of the average rewardand average walking distance in two conditions: plain andbroken. In the plain, the robots function normally, i.e., thefailure coefficient is fixed at k = 1.0. In the broken, the robotsare damaged, and k randomly takes a value in [0.0, 0.5]. Weimplement each algorithm on the two conditions, with five

random seeds for each framework. The average reward andaverage walking distance are calculated as the average of 100trials for each seed. The average walking distance is one thatis suitable for evaluating the walking ability of the robot.

B. Comparative evaluation with conventional algorithms

The average reward and average walking distance for allalgorithms are shown in Fig. 3 and Fig. 4, respectively. Inboth the plain and broken conditions, UDR is inferior tothe Baseline, which does not consider robot failures. Thisresult indicates that the simple UDR cannot adapt to robotfailures. The hard2easy curriculum of LCDR, denoted byLCDR h2e, outperforms the Baseline in terms of the averagereward, as shown in Fig. 3. However, LCDR h2e correspondsto an inferior walking ability compared to that of the Baselinefor both plain and broken conditions, as shown in Fig. 4.This result indicates that LCDR h2e implements a conserva-tive policy that does not promote walking as active as thatgenerated by the Baseline. For both LCDR and ACDR, thehard2easy curriculum outperforms the easy2hard curriculum,as discussed in Section IV-D. Moreover, ACDR h2e achievesthe highest performance in all cases. This finding indicatesthat ACDR h2e can avoid the formulation of a conservativepolicy by providing tasks with difficulty levels that match theprogress of the robot.

Figure 5 shows the average reward for each failure co-efficient k in the interval [0, 1]. ACDR h2e earns the highestaverage reward for any k, and ACDR h2e achieves the highestperformance in any failure state. Moreover, the performance ofUDR, LCDR e2h, and ACDR e2h, deteriorate as k increases.This result indicates that the robots implementing these strate-gies cannot learn to walk. Therefore, even when k increasesand it is easy to move the leg, the robots fall and do not earnrewards.

C. Comparative evaluation with non-curriculum learning

The experiment results show that the hard2easy curriculumis effective against robot failures. To evaluate whether thehard2easy curriculum contributes to an enhanced performance,we compare this framework with a non-curriculum algorithmin which the robot is trained with a fixed failure coefficient k.The values of k are varied as {0.0, 0.2, 0.4, 0.6, 0.8, 1.2, 1.4}.For each k, Fig. 6 shows the average reward earned by therobot on interval [0, 1] of the failure coefficient. ACDR h2estill achieves the highest performance over the interval [0, 1].Therefore, the hard2easy curriculum can enhance the robust-ness against robot failures.

D. Comparative evaluation between easy2hard and hard2easycurricula

The hard2easy curriculum gradually decreases the degreeof robot failure and makes it easier for the robot to walk.In contrast, the easy2hard curriculum gradually increases thedegree of difficulty, and eventually, the leg stops moving.Figures 3 and 4 show that the hard2easy curriculum is more

Fig. 5. Average reward of all algorithms over various failure coefficientsk ∈ [0, 1].

Fig. 6. Average reward of the non-curriculum algorithm with fixed failurecoefficient k for various failure coefficients k ∈ [0, 1].

effective than the easy2hard curriculum. We discuss the reasonbelow.

Figure 6 shows that the average reward tends to increaseas k increases and deteriorates as k decreases. For example,k = 1.4 corresponds to the highest average reward over theinterval [0, 1], whereas k = 0.0 or 0.2 yield low rewards. Thisresult suggests that it is preferable to avoid training at k = 0.0to allow the robot to walk even under failure. Hence, we canhypothesize that the easy2hard curriculum, in which trainingis performed at k = 0.0 at the end of training, deteriorates therobot’s ability to walk. To verify this hypothesis, we performthe same comparison as described in Section IV-B for robotstrained on the interval k ∈ [0.5, 1.5]; in particular, in thisframework, the robots are not trained at k = 0. Figures 7and 8 show the average reward and average walking distance,respectively, and Fig. 9 shows the average reward for each k

Fig. 7. Average reward for the plain (blue) and broken (green) quadrupedtasks. Error bars indicate standard error. The comparison algorithm is trainedin the interval k ∈ [0.5, 1.5].

Fig. 8. Average progress for the plain (blue) and broken (green) quadrupedtasks. Error bars indicate the standard error. The comparison algorithm istrained in the interval k ∈ [0.5, 1.5].

over the interval [0, 1]. Comparison of these results with thosepresented in Figs. 3, 4, and 5, demonstrates that by avoidingtraining in the proximity of k = 0, the average rewards ofLCDR e2h and ACDR e2h are increased. In summary, theeasy2hard curriculum deteriorates the learned policy owingto the strong degree of robot failure introduced at the end oftraining, whereas the hard2easy curriculum does not.

Furthermore, the average rewards of UDR, LCDR h2e, andACDR h2e increase, as shown in Figs. 7, 8, and 9. Theseresults indicate that training in the proximity of k = 0 isnot necessary to achieve fault-tolerant robot control. Thus,interestingly, the robot can walk even when one of the legscannot be effectively moved, without being trained under thecondition.

Fig. 9. Average reward of all algorithms across various failure coefficientsk ∈ [0.5, 1.5].

V. CONCLUSION

This study was aimed at realizing fault-tolerant controlof quadruped robots against actuator failures. To solve theproblem, we propose a reinforcement learning algorithm withadaptive curriculum dynamics randomization, abbreviated asACDR. We apply domain randomization to fault-tolerant robotcontrol toward Sim2Real in robotics. Furthermore, we de-veloped an adaptive curriculum learning framework to en-hance the effectiveness of domain randomization. The ACDRalgorithm could formulate a single robust policy to realizefault-tolerant robot control. Notably, the ACDR algorithmcan facilitate the development of a robot system that doesnot require any self-diagnostic modules for detecting actuatorfailures.

Experiment results demonstrate that the combination ofcurriculum learning and dynamics randomization is moreeffective than non-curriculum learning or non-randomizationof dynamics. In addition, the hard2easy curriculum is noted tobe more effective than the easy2hard curriculum. This findingsuggests that it is preferable to train quadruped robots in theorder of difficulty opposite to that implemented in standardcurriculum learning. In future work, there are several possibleresearch directions of improving the ACDR algorithm. Acombination of automatic curriculum design (e.g. [28]) anddomain randomization can be a promising direction.

ACKNOWLEDGMENT

This work was supported by JSPS KAKENHI Grant Num-ber JP19K12039.

REFERENCES

[1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu,Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, An-dreas K Fidjeland, Georg Ostrovski, et al. Human-level control throughdeep reinforcement learning. nature, 518(7540):529–533, 2015.

[2] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, andPhilipp Moritz. Trust region policy optimization. In Internationalconference on machine learning, pages 1889–1897. PMLR, 2015.

[3] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of MachineLearning Research, 17(1):1334–1373, 2016.

[4] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba,and Pieter Abbeel. Domain randomization for transferring deep neuralnetworks from simulation to the real world. In 2017 IEEE/RSJ Inter-national Conference on Intelligent Robots and Systems (IROS), pages23–30. IEEE, 2017.

[5] Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and PieterAbbeel. Sim-to-real transfer of robotic control with dynamics ran-domization. In 2018 IEEE international conference on robotics andautomation (ICRA), pages 1–8. IEEE, 2018.

[6] Jie Tan, Tingnan Zhang, Erwin Coumans, Atil Iscen, Yunfei Bai,Danijar Hafner, Steven Bohez, and Vincent Vanhoucke. Sim-to-real: Learning agile locomotion for quadruped robots. arXiv preprintarXiv:1804.10332, 2018.

[7] Ivana Kruijff-Korbayova, Francis Colas, Mario Gianni, Fiora Pirri,Joachim de Greeff, Koen Hindriks, Mark Neerincx, Petter Ogren, TomasSvoboda, and Rainer Worst. Tradr project: Long-term human-robotteaming for robot assisted disaster response. KI-Kunstliche Intelligenz,29(2):193–201, 2015.

[8] Fan Yang, Chao Yang, Di Guo, Huaping Liu, and Fuchun Sun. Fault-aware robust control via adversarial reinforcement learning. arXivpreprint arXiv:2011.08728, 2020.

[9] Ayaka Kume, Eiichi Matsumoto, Kuniyuki Takahashi, Wilson Ko, andJethro Tan. Map-based multi-policy reinforcement learning: enhancingadaptability of robots by deep reinforcement learning. arXiv preprintarXiv:1710.06117, 2017.

[10] Muhamad Azhar, A Alobaidy, Jassim Abdul-Jabbar, and Saad Saeed.Faults diagnosis in robot systems: A review. Al-Rafidain EngineeringJournal (AREJ), 25:164–175, 12 2020.

[11] Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel,Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, SM Eslami, et al.Emergence of locomotion behaviours in rich environments. arXivpreprint arXiv:1707.02286, 2017.

[12] Wataru Okamoto and Kazuhiko Kawamoto. Reinforcement learningwith randomized physical parameters for fault-tolerant robots. In 2020Joint 11th International Conference on Soft Computing and IntelligentSystems and 21st International Symposium on Advanced IntelligentSystems (SCIS-ISIS), pages 1–4, 2020.

[13] Bhairav Mehta, Manfred Diaz, Florian Golemo, Christopher J Pal, andLiam Paull. Active domain randomization. In Conference on RobotLearning, pages 1162–1176. PMLR, 2020.

[14] Farzad Abdolhosseini. Learning locomotion: symmetry and torque limitconsiderations. PhD thesis, University of British Columbia, 2019.

[15] Yoshua Bengio, Jerome Louradour, Ronan Collobert, and Jason Weston.Curriculum learning. In Proceedings of the 26th annual internationalconference on machine learning, pages 41–48, 2009.

[16] Diego Rodriguez and Sven Behnke. Deepwalk: Omnidirectional bipedalgait by deep reinforcement learning. In 2021 IEEE InternationalConference on Robotics and Automation (ICRA), pages 3033–3039,2021.

[17] Ozsel Kilinc and G. Montana. Follow the object: Curriculum learningfor manipulation tasks with imagined goals. ArXiv, abs/2008.02066,2020.

[18] Sha Luo, Hamidreza Kasaei, and Lambert Schomaker. Accelerating re-inforcement learning for reaching using continuous curriculum learning.In 2020 International Joint Conference on Neural Networks (IJCNN),pages 1–8. IEEE, 2020.

[19] Rui Wang, Joel Lehman, Jeff Clune, and Kenneth O. Stanley. Pairedopen-ended trailblazer (POET): endlessly generating increasingly com-plex and diverse learning environments and their solutions. CoRR,abs/1901.01753, 2019.

[20] Zhaoming Xie, Hung Yu Ling, Nam Hee Kim, and Michiel van dePanne. Allsteps: Curriculum-driven learning of stepping stone skills.In Proc. ACM SIGGRAPH / Eurographics Symposium on ComputerAnimation, 2020.

[21] Daniele Reda, Tianxin Tao, and Michiel van de Panne. Learning tolocomote: Understanding how environment design matters for deepreinforcement learning. CoRR, abs/2010.04304, 2020.

[22] Richard S Sutton and Andrew G Barto. Reinforcement learning: Anintroduction. MIT press, 2018.

[23] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, andOleg Klimov. Proximal policy optimization algorithms. arXiv preprintarXiv:1707.06347, 2017.

[24] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, andPieter Abbeel. High-dimensional continuous control using generalizedadvantage estimation. arXiv preprint arXiv:1506.02438, 2015.

[25] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider,John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXivpreprint arXiv:1606.01540, 2016.

[26] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physicsengine for model-based control. In 2012 IEEE/RSJ InternationalConference on Intelligent Robots and Systems, pages 5026–5033. IEEE,2012.

[27] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochasticoptimization. In Yoshua Bengio and Yann LeCun, editors, 3rd Interna-tional Conference on Learning Representations, ICLR 2015, San Diego,CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.

[28] Michael Dennis, Natasha Jaques, Eugene Vinitsky, Alexandre Bayen,Stuart Russell, Andrew Critch, and Sergey Levine. Emergent com-plexity and zero-shot transfer via unsupervised environment design. InH. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages13049–13061. Curran Associates, Inc., 2020.

APPENDIX ALINEAR CURRICULUM DYNAMICS RANDOMIZATION

The LCDR updates the interval [L,U ] at the discretetimestep t = ∆t, 2∆t, . . . , (N −1)∆t, where ∆t is defined as

∆t = bT/Nc , (14)

where T is the predetermined total learning time, and b·cis the floor function. We set N = 11 in the experiments.Similar to the ACDR, the LCDR implements two curricula:easy2hard and hard2easy. Given the initial interval [1.5, 1.5],the easy2hard curriculum updates the interval as{

L← L−∆ t

U ← U −∆ t(15)

where the update amount ∆ is defined as

∆ =kmax

N − 1, (16)

where kmax = 1.5 is the maximum value of the failurecoefficient. In addition, the hard2easy curriculum updates theinterval as {

L← L+ ∆ t

U ← U + ∆ t(17)

given the initial interval [0.0, 0.0]

reinforcement learning with adaptive curriculum dynamics

Documents