reinforcement learning with time-dependent goals for ...reinforcement learning with time-dependent...

Preprint, submitted to IEEE Robotics and Automation Letters (RA-L) 2021 with International Conference on Robotics andAutomation Conference Option (ICRA) 2021

Reinforcement Learning with Time-dependent Goals for RoboticMusicians

Thilo Fryen1∗, Manfred Eppe1∗, Phuong D.H. Nguyen1, Timo Gerkmann2, Stefan Wermter1

Abstract— Reinforcement learning is a promising method toaccomplish robotic control tasks. The task of playing musicalinstruments is, however, largely unexplored because it involvesthe challenge of achieving sequential goals – melodies – thathave a temporal dimension. In this paper, we address roboticmusicianship by introducing a temporal extension to goal-conditioned reinforcement learning: Time-dependent goals. Wedemonstrate that these can be used to train a robotic musicianto play the theremin instrument. We train the robotic agentin simulation and transfer the acquired policy to a real-worldrobotic thereminist.Supplemental video: https://youtu.be/jvC9mPzdQN4

Index Terms— Art and Entertainment Robotics, Reinforce-ment Learning, Robotic Musicians

I. INTRODUCTION

Most contemporary musical performance robots rely onhard-coded motion trajectories to play musical instruments.For example, the robotic metal band Compressorhead con-sists of five humanoid robots that play a guitar, a bassguitar, and drums using sophisticated pneumatic actuators(see Fig. 1). A problem with this approach is that engineersand artists need to calibrate and program the specific motiontrajectories required to play the instruments manually. Howcan we realize a computational architecture that allowsrobotic musicians to learn to play their instruments so thatartists and engineers do not need to program them manually?A currently very prominent learning-based approach forrobotic control that can potentially overcome this issue isreinforcement learning (RL). However, existing approacheshave not been able to realize RL-driven robotic musiciansbecause music is time-dependent: To play a melody, a roboticmusician must hit the desired pitch of each note in a temporalsequence. A problem is that current reinforcement learningmethods consider static goals or static reward functions thatare independent of the temporal dimension of an episode.This renders them inappropriate for learning the temporaldynamics involved in controlling musical instruments.

As our novel scientific contribution, we address thisproblem by extending static goal-conditioned reinforcementlearning methods with time-dependent goals (TDGs). Herein,we extend the static goal representation provided to theagent with a dynamically changing goal representation. Wehypothesize that time-dependent goals enable robots to play

∗equal contribution. 1Thilo Fryen, Manfred Eppe, Phuong D.H.Nguyen, and Stefan Wermter are with the Knowledge Technologygroup, Department of Informatics, University of Hamburg, Hamburg,Germany. 2Timo Gerkmann is with the Signal Processing group,Department of Informatics, University of Hamburg, Hamburg,Germany. {6fryen, eppe, pnguyen, gerkmann,wermter}@informatik.uni-hamburg.de

Photo by Torsten Maue, licensed under the terms of the cc-by-sa-2.0.

Fig. 1: The guitarist and bassist of the robotic metal bandCompressorhead.

musical instruments. We verify our hypothesis for the caseof playing the theremin (Fig. 2) and we provide a proofof concept by applying the method to a robotic thereminplayer. We also provide an empirical evaluation to showthat the method is robust to acoustic noise and we comparedifferent variations of the auditory preprocessing and controlmechanisms.

The remainder of this paper is constructed as follows.We provide a brief background and related work for mu-sical robots and goal-conditioned reinforcement learning insection II. In section III, we introduce the time-dependentgoal and describe our theremin tone simulation. Then wedescribe our experiments in both the simulated and thereal-world environment in section IV. We illustrate ourarchitecture and the reinforcement learning components. Ourresults section V is divided into the comparative evaluationof different time- to frequency-domain transforms, actionspaces, and robustness to different noise intensities. We alsoinvestigate hindsight experience replay to alleviate reward-sparsity. Furthermore, we provide a real-world proof-of-concept and transfer the approach to a real-world robotictheremin-playing robot. We conclude in section VI.

II. BACKGROUND AND RELATED WORK

We combine robotic control through reinforcement learn-ing with musical robots. There exist works that composemusic with reinforcement learning methods [1, 2], but tothe best of our knowledge, there exists no prior work thatuses reinforcement learning to control a physical roboticmusician.

arX

iv:2

011.

0571

5v1

[cs

.RO

] 1

1 N

ov 2

020

https://youtu.be/jvC9mPzdQN4

A. Musical Robots

The design of musical robots and the development of thecorresponding algorithms constitutes an interesting challengebecause of the many aspects that music incorporates. Existingrobotic musicians include the flute and saxophonist robotsfrom Waseda University [3], the marimba-playing robotShimon from the Georgia Institute of Technology [4], and theviolinist robot from Ryukoku University [5]. These and otherprominent projects, like Compressorhead [6] and Automatica[7], concentrate on the artistic and the engineering aspects ofrobotic musicianship. This leads to impressive performancesand a relatively precise control of non-trivial instruments.However, large parts of the behavior of these robots arepre-programmed and depend on hand-engineered domainknowledge. A learning-based approach can simulate theindividuality of human musicians better because they arenot explicitly programmed but rather learn their individualway of playing the instrument. A learning-based approachalso eliminates the need for background knowledge andthe manual calibration of control mechanisms. However,the temporal dimension that inheres musical melodies andharmonies makes it difficult to apply existing learning-basedmethods, especially for instruments that are difficult to play.So how can we realize a learning-based approach that issuitable for learning a sequence of different pitches whilealso considering rhythm and timing? How can we avoid themanual calibration of robotic musicians?

To address these challenging questions, we propose herea proof of concept with a theremin-playing robot. Thetheremin (see Fig. 2) is an analog electronic instrument thatemits a sine-like tone, where the pitch can be controlledby changing the distance of the musician’s hand to thetheremin’s antenna. The simplicity of this design allows usto develop and to evaluate a temporally extended actor-criticreinforcement learning method in conjunction with domainrandomization [8]. This enables the robot to learn to playtemporally extended melodies and to quickly adapt to theactual properties of the environment without requiring acalibration phase. Previous robotic thereminists [9, 10] usefeed-forward control and therefore need a calibration phaseto determine the distance-to-frequency function before therobot can play the theremin.

B. Actor-Critic and Goal-conditioned Reinforcement Learn-ing

In reinforcement learning, an agent interacts with anenvironment and learns to solve a problem. At each time step,the agent receives a representation of the environment’s states and carries out an action a according to its policy π(s) = a.The environment defines a reward function r(s) that informsthe agent about the success of its policy. To be able tolearn the optimal policy and maximize the rewards, the agentneeds to know the best action for each state. It approximatesan action-value function Q(s, a), which learns the expectedreward from taking action a in state s. The action-valuefunction generalizes over the set of all states so that it canhandle previously unseen states. The reinforcement learning

Fig. 2: The physical theremin we used for our experiments.The pitch is controlled by adjusting one’s distance to thetheremin’s antenna.

agent can be subdivided into an actor, which learns thepolicy, and a critic, which learns the action-value function.The actor learns from the feedback of the critic and the criticlearns from the feedback of the environment, the reward.

Deep Deterministic Policy Gradient (DDPG) [11] isan actor-critic RL algorithm for continuous actions andtherefore useful for robotic control. It combines deep Q-networks [12] and deterministic policy gradient algorithms[13]. DDPG trains the actor and critic neural networks off-policy-like, and stores the collected experience in a replaybuffer.

The actor-critic approach is often applied in a multi-goal framework, where the agent learns a policy that isconditioned to a goal variable g [11, 14, 15, 16]. Thisallows for specifying different goals for the same task. Forexample, g can specify the desired goal coordinates of itsend-effector [17], but it can also specify the desired note fora note-playing task. Universal value function approximators(UVFAs) [16] learn a value function Q(s, a, g) that considerthis goal variable, computing the expected value of takingaction a in state s given the goal g. Consequently, universalvalue functions do not only generalize over continuous states,but also over continuous goal representations.

A general problem of reinforcement learning is sample-efficiency. Hindsight experience replay (HER) [18] increasesthe sample efficiency for goal-conditioned RL by enablingthe agent to learn from unsuccessful episodes. By substitut-ing the goal in the replay buffer with the goal that has beenactually achieved, the agent pretends that the achieved statewas the desired goal. It then receives a reward for its actionsand learns how to achieve the achieved state, pretending inhindsight that this was the goal. This is especially helpfulin situations where the rewards are sparse, and where it isunlikely for the agent to receive a reward during exploration.In this work, we extend HER to temporally extended goals.

Reinforcement learning has been used for the synthesisof music-like signals by Doya and Sejnowski [19]. Theauthors modeled song-learning by birds with a reinforcementlearning model that chooses the parameters for a synthesizer

to replicate recordings of birdsongs. Their work has manysimilarities to our approach to operate musical robots: theytrain their model only with auditory feedback and evaluatethe agent’s actions one syllable at a time. In our work, weextend these ideas and, rather than using a synthesizer, let theagent interact physically with the environment. Moreover, weextend UVFAs towards temporally extended goal variablesso that the agent also learns to generalize over environmentstates and temporally extended goals.

III. METHODOLOGY

A. Time-dependent Goals

Classical UVFAs enable an agent to learn to play asingle note on an instrument. However, such an agent isnot designed to learn to play sequences of notes becauseit only considers one goal per episode. Our proposed time-dependent goal enhances UVFAs by defining a possiblydifferent goal gt for every time step t. This is necessaryto describe the task of playing music because melodies area sequence of notes that have to be played at the right time.To integrate our proposed time-dependent goal (TDG) withactor-critic RL, we make several adjustments. We start bychanging the signature of the reward function from r(st, g)to r(st, gt). This results in the following modified Bellmanequation (1) for the optimal action-value function:

Q∗(st, at, gt) = r(st+1, gt+1)

+ γ ·maxa

Q∗(st+1, at+1, gt+1)(1)

Herein, Q∗(st, at, gt) is the optimal action-value functionand γ is the discount factor for future rewards. In our work,we implement the actor as an artificial neural network π withweights θ, and we implement the critic Q one with weightsφ. Hence, we rewrite the reinforcement learning problemwith TDGs as the following optimization problem (2) forthe critic’s weights:

argminφ

[r(st, gt) + γ ·Qπθ

φ (st, at, gt)

−Qπθ

φ (st−1, at−1, gt−1)](2)

Accordingly, we optimize the actor’s weights with theoutput of the critic as follows (3):

argminθ

[Qπθ

φ (st, πθ(st, gt), gt)] (3)

Since we use an off-policy approach, we also store theTDG in the replay buffer. Therefore, the stored transitionsare tuples (st, at, gt, st+1, gt+1). In our experiments weshow that these enhancements to goal-conditioned RL enableagents to learn to achieve sequences of time-dependent goalsthat encode a desired sequence of simulated theremin tones.

B. Theremin Tone SimulationIn our simulation, we approximate the real-world theremin

properties and represent its sound in the frequency domain.The pitch of the theremin is calculated from the distanceof the arm’s tip to the theremin antenna with a formula thatapproximates a real-world theremin distance-to-pitch relation(see Fig.3). With the pitch frequency, we compute a 50millisecond-long theremin tone x(n) with base frequency fthat mimics a real-world theremin sound described by thefollowing equation (4).

x(n) =

8∑i=0

0.4(1/4i)sin(2πf(i+ 1)n) (4)

The theremin sound is composed of eight sine waves, suchthat their frequencies are multiples of the base frequency(harmonics). The amplitude of each sine is scaled by 0.4and a factor that exponentially decreases the amplitude ofeach harmonic the higher its frequency is. We determinedthese parameters by manual numerical optimization. We onlycompute the first seven harmonics because higher harmonicshave a relatively low amplitude. Therefore, the effect ofhigher harmonics is negligible.

We add pink noise with a signal-to-noise ratio (SNR) of38 dB to the simulated theremin signal to improve the robust-ness of the agent. We transform the tone into the frequencydomain and compare the following three methods: Constant-Q transform (CQT) [20], short-time discrete Fourier trans-form (STFT), and STFT with mel scaling. We use a hann-window and discard the phase for all transformations and

(x,y,z)

transformwith CQT,STFT, orSTFT+MEL

select goal every 25 time stepsgϵ{A4=440Hz, ..., Ab5=830Hz}agent

a = π(s, g)

reward0 if goal - state < threshold-1 otherwise

frequency

state

goal

direct joint manipulation(Δj1, Δj2,...,Δj6)

inverse kinematics (Δx,Δy,Δz) state s, goal gor

+pink noise

distance in m

freq

uency

in H

z

frequency

there

min

an

tenna

distancedistance-to-frequency function theremin tone simulation

Fig. 3: Schematic depiction of the agent-environment interaction loop. The simulated 6 DOF robotic arm and the thereminantenna are illustrated on the left. State and goal are transformed to the frequency domain either by the CQT, the STFT orthe STFT with mel scaling. The robot is actuated with inverse kinematics (blue) or direct manipulation of the joints (green).

provide the transformed signals as perceptive input to theagent.

IV. EXPERIMENTS

We utilize the time-dependent goal for training a therem-inist robot in a simulation and transfer a policy to the realworld. In both the simulation and real-world scenario theagent’s task is to play a sequence of eight notes with theright duration on the theremin.

A. Simulation

We simulate a 6 DOF robotic arm and a theremin antennain CoppeliaSim [21] and use PyRep [22] to actuate therobotic arm. Fig. 3 illustrates the agent-environment inter-action loop. At each time step, two theremin tones are com-puted and provided to the agent in their frequency domainrepresentation. One is the observation, whose frequency isdetermined by the distance between the robot’s tip and thetheremin antenna. The other one is the time-dependent goal.

One episode consists of 200 time steps and the time-dependent goal changes every 25 time steps. The goal-notesare randomly selected between A4 and Ab5. The agent triesto achieve these goals by actuating the robotic arm witheither inverse kinematics or by direct manipulation of thejoints. The sparse reward is computed with spectrographictemplate matching described in the following equation (5),where g is the goal, s is the achieved state and n bins is thenumber of bins of the used transform (constant-Q transform/ short-time discrete Fourier transform (STFT) / mel-scaledSTFT).

r(s, g) =

{0

∑n binsi=0 |gi − si| < ε

−1 otherwise(5)

The difference between the goal g and the achieved state sat time t has to be smaller than a threshold ε for the rewardto be 0, otherwise the reward is -1. We choose ε so that therelative difference between the base frequency of the goalnote and the achieved goal note is smaller than 0.7% of thegoal note’s frequency, which is larger than the differencetrained humans can notice but smaller than the differenceuntrained humans notice [23]. At the start of each episode,the robot’s position is reset and the position of the thereminantenna is randomized in the y-direction, which means thatit is moved sideways.

B. Real World

The real-world robot is a 1 DOF mobile robot that canmove back and forth, thereby changing its distance to thetheremin and therefore the pitch. We trained an agent in thesimulation and then deployed it on the Lego MindstormsNXT robot shown in Fig. 4.

The real-world sound is recorded by a condenser micro-phone and transformed to the frequency domain with theCQT. The action is a value that specifies the desired rotationof the robot’s motor. To account for the fluctuating amplitudeof the real-world theremin and the limited accuracy of the

Fig. 4: Simulation (top) and real-world (bottom) environmentfor the theremin playing robot. The robot’s movement isconstrained by two blocks. We also trained a 6 DOF roboticarm (See Fig.3) but did not have access to its physical versiondue to the COVID-19 pandemic.

NXT robot, we change the reward computation during thesimulated training as described in the following equation (6).

r(s, g) =

{0 argmax CQT(g) == argmax CQT(s)

−1 otherwise(6)

The reward is now 0 if the CQT bin with the largestvalue corresponds to the goal note and -1 otherwise. Thiseffectively increases the allowed tolerance when computingthe reward. In contrast to the 6 DOF robot simulation, theposition of the robot is not reset at the start of every episode.Instead, it just starts from the position it was at at theend of the last episode. This automatically makes the agentgeneralize over different starting positions.

C. Agent Neural Network Architecture

The agent architecture is a DDPG actor-critic network.Actor and critic both have three dense layers with 64 neuronsand ReLU activation each, as illustrated in Fig.5. We furtherincrease the sample efficiency with hindsight experiencereplay [18] with a manipulated to original experience ratioof 4:1 and replay strategy future, which means that HER-goals are replaced by an achieved state from a future timestep. During training, we randomize 30% of the actions andadd 20% action noise to the non-random actions to facilitateexploration. We use the Adam Optimizer [24] and a learningrate of 0.001. The discount factor γ for future rewards is0.995 and the loss for the critic is the mean-squared Bellmanerror. One episode consists of 200 steps and the time-dependent goal changes every 25 steps. Our implementationbuilds upon the OpenAI baselines implementation of DDPGwith hindsight experience replay [25].

V. RESULTS AND DISCUSSION

Both the simulated and the real-world robot can success-fully play the theremin. We evaluate the accuracy of the

CriticActor

action a

dense 64

ik:3, jm:6

CQT: 120, MEL: 256, STFT: 2050

ReLU

ReLU

tanhq-value

dense 64

1

CQT: 120, MEL: 256, STFT: 2050; +3 or 6

ReLU

ReLU

Envir

onm

ent

store

(s t

, a

t, g

t, r t

, s t

+1)

Replay Buffer with Hindsight Experience Replay

st, gt or g't at rt or r't

3x 3x

Fig. 5: Agent architecture. Blue boxes denote neural networklayers. Sizes of input and output layers depend on the trans-form type (CQT / mel-scaled STFT / STFT) and actuationtype (inverse kinematics / joint manipulation). The left dottedline is used for the actor-update and the right dotted line isused for the critic-update. g′t is the goal chosen by HER andr′t is the recomputed reward for the substituted goal. The redarrows indicate the values that we used as feedback for thenetworks.

simulated robot by caching the base frequencies of the goalsand achieved states and use them to calculate the amount oftime steps at which the goal has been reached, so that thetransform type does not influence the evaluation.

Our baseline configuration uses the CQT, inverse kine-matics, a signal-to-noise ratio of 38 dB, and HER. In ourexperiments, we alter these to find out the best configuration.We train each configuration seven times for 30 epochs, whichconsist of 25 train- and 10 test-episodes each. The graphsshow the median and the second and third quartile of thetraining runs.

A. Transforms

We compare the three different time- to frequency-domaintransforms CQT, STFT, and STFT with mel scaling. Fig.6 de-

Fig. 6: Performance of the agent using the different time-to frequency-domain transforms constant-Q transform, short-time discrete Fourier transform, and STFT with mel scaling.

picts the results. The learning converges after approximately300 - 500 training episodes, depending on the transformationmethod used. CQT performs slightly better than the othermethods because the spacing of the center frequencies isbetter suited for musical notes. The spacing is logarithmicallyarranged with a constant factor of Q = 1

12 and, therefore,corresponds to the typical chromatographic note frequenciesof Western music that we used.

For all transformation methods, it is not possible to achieve100% accuracy. The successful time steps converge to around175 after 300-600 episodes and never reach the full 200successful time steps per episode because the agent has tomove between the notes. While still in the process of moving,it has not yet hit the correct pitch.

B. Action Spaces

We also compare two action spaces of the robot’s end-effector: First, we control the robot in Cartesian spacewith inverse kinematics, and, second, we perform directcontrol over the joints. Fig. 7 shows that the agent performsbetter when it uses inverse kinematics. This was expectedbecause the Cartesian space is smaller, it directly correlateswith the theremin’s pitch, and the joint movements are notindependent of one another.

C. Need for Hindsight Experience Replay

We also scrutinize the benefits of using hindsight experi-ence replay. Our results (Fig.8) show that HER improves thetraining speed. However, the benefits of HER compared toDDPG alone are not as dramatic as expected. We hypothesizethat the agent is likely to hit the right notes by accident,generating sufficiently many successful training samples forthe training, so that DDPG without HER already performswell.

D. Robustness and Generalization

Fig. 9 shows that different signal-to-noise ratios (SNRs)affect the performance of the agent, but the agent is rather

Fig. 7: Performance of the agent using either inverse kine-matics or direct joint manipulation for actuating the robot.

Fig. 8: Influence of HER on the performance of the agent.

robust to auditory noise. The agent performs well at SNRsup to 16 dB and still learns something when the noise is asloud as the signal. A large drop in performance only appearsat a SNRs below 8 dB.

In addition to this empirical evaluation, we perform twoqualitative tests to further investigate the robustness of ourapproach. First, we evaluate the agent with tones that itwas not trained on, i.e. with tones outside of the discretechromatographic musical scale. In this case, the agent wasstill able to hit the correct frequencies. Second, we test achanged distance-to-pitch computation. This also does nothinder the agent from achieving the goals. The successfuldeployment of a policy to the real world shows the robustnessof the agent as well.

E. Real World

To test the transfer to real-world robots, we train theNXT robot in simulation and deploy the learned policyon the physical robot. The NXT is physically not able toplay realistic melodies because it lacks speed and accuracyto achieve an appropriate timing performance. However, asillustrated in our supplemental video1 , the NXT robot canplay a slow series of notes on a real theremin. This ispossible even though the NXT’s motor noise is relativelyloud. Furthermore, we observed that the robot can alsoreact to and compensate for external disturbances, e.g. whenwe increase the theremin’s pitch by moving a hand closerto the antenna. In this case, the agent drives backward tocompensate for the disturbance.

VI. CONCLUSIONS

We introduced the concept of time-dependent goals thatenable robots to play musical instruments. We provided aproof of concept for the case of actor-critic reinforcementlearning, but the method is equally applicable to other goal-conditioned reinforcement learning methods that are based

1https://youtu.be/jvC9mPzdQN4

Fig. 9: Performance of the agent at different signal-to-noiseratios.

on UVFAs. We demonstrated that a robotic agent can suc-cessfully learn to play the theremin in simulation and that thelearned behavior can also be transferred to a physical robot.Our future work will involve more sophisticated physicalrobots and more complex instruments, such as a xylophoneor a keyboard. In addition, we will extend the time-dependentgoal method on a theoretical level, so that it can alsoaccount for future goals. This will enable the robotic agentto prepare for playing a note beforehand so that it can startthe movement required to hit the note in time.

ACKNOWLEDGMENT

Manfred Eppe, Stefan Wermter and Phuong Nguyenwere supported by the German Research Foundation (DFG)through the IDEAS and LeCAREbot projects. The authorsgratefully acknowledge partial support from the German Re-search Foundation (DFG) under project Crossmodal Learn-ing (TRR-169).

REFERENCES

[1] N. Collins, “Reinforcement learning for live musical agents,”Proceedings of the International Computer Music Conference(ICMC), 2008. [Online]. Available: https://quod.lib.umich.edu/i/icmc/bbp2372.2008.005/1

[2] S. Le Groux and P. F. Verschure, “Towards adaptive musicgeneration by reinforcement learning of musical tension,”Proceedings of the Sound and Music Computing Conference(SMC), pp. 71–77, 2010.

[3] J. Solis, K. Petersen, T. Ninomiya, M. Takeuchi, andA. Takanishi, “Development of anthropomorphic musicalperformance robots: From understanding the nature of musicperformance to its application to entertainment robotics,” inProceedings of the International Conference on IntelligentRobots and Systems (IROS), 2009. [Online]. Available:https://doi.org/10.1109/IROS.2009.5354547

[4] G. Hoffman and G. Weinberg, “Shimon: An interactiveimprovisational robotic marimba player,” in Proceedings ofthe Computer-Human Interaction Conference (CHI), 2010.[Online]. Available: https://doi.org/10.1145/1753846.1753925

[5] K. Shibuya, H. Ideguchi, and K. Ikushima, “Volume Controlby Adjusting Wrist Moment of Violin-Playing Robot,” Inter-national Journal of Synthetic Emotions (IJSE), vol. 3, no. 2,pp. 31–47, 2012.

https://youtu.be/jvC9mPzdQN4

https://quod.lib.umich.edu/i/icmc/bbp2372.2008.005/1

https://quod.lib.umich.edu/i/icmc/bbp2372.2008.005/1

https://doi.org/10.1109/IROS.2009.5354547

https://doi.org/10.1145/1753846.1753925

[6] F. Barnes, “Compressorhead,” 2013. [Online]. Available:http://compressorhead.org/

[7] N. Stanford, “Automatica,” 2017. [Online]. Available: https://nigelstanford.com/Automatica/

[8] OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin,B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell,R. Ribas, J. Schneider, N. Tezak, J. Tworek, P. Welinder,L. Weng, Q. Yuan, W. Zaremba, and L. Zhang, “Solvingrubik’s cube with a robot hand,” 2019. [Online]. Available:https://openai.com/blog/solving-rubiks-cube/

[9] T. Mizumoto, H. Tsujino, T. Takahashi, T. Ogata, andH. G. Okuno, “Thereminist robot: Development of arobot theremin player with feedforward and feedback armcontrol based on a Theremin’s pitch model,” Proceedingsof the International Conference on Intelligent Robotsand Systems (IROS), 2009. [Online]. Available: https://doi.org/10.1109/IROS.2009.5354473

[10] Y. Wu, P. Kuvinichkul, P. Y. Cheung, and Y. Demiris, “To-wards anthropomorphic robot thereminist,” Proceedings ofthe International Conference on Robotics and Biomimetics(ROBIO), pp. 235–240, 2010.

[11] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez,Y. Tassa, D. Silver, and D. Wierstra, “Continuous Controlwith Deep Reinforcement Learning,” in Proceedings ofthe International Conference on Learning Representations(ICLR), 2016. [Online]. Available: https://arxiv.org/pdf/1509.02971v2.pdf

[12] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland,G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou,H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis,“Human-level control through deep reinforcement learning,”in Nature, 2015, vol. 518, no. 7540, pp. 529–533.

[13] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, andM. Riedmiller, “Deterministic policy gradient algorithms,”Proceedings of the International Conference on MachineLearning (ICML), pp. 605–619, 2014.

[14] M. Eppe, S. Magg, and S. Wermter, “Curriculum GoalMasking for Continuous Deep Reinforcement Learning,” inInternational Conference on Development and Learning andEpigenetic Robotics (ICDL-EpiRob), 2019, pp. 183–188.

[15] F. Roder, M. Eppe, P. D. H. Nguyen, and S. Wermter,“Curious Hierarchical Actor-Critic Reinforcement Learning,”in International Conference on Artificial Neural Networks(ICANN - to appear), 2020.

[16] T. Schaul, D. Horgan, K. Gregor, and D. Silver, “UniversalValue Function Approximators,” in Proceedings of the Inter-national Conference on Machine Learning (ICML), 2015, pp.1312–1320.

[17] M. Eppe, P. D. H. Nguyen, and S. Wermter, “From Semanticsto Execution: Integrating Action Planning with ReinforcementLearning for Robotic Causal Problem-solving,” Frontiers inRobotics and AI, vol. 6, p. online, 2019.

[18] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong,P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba,“Hindsight Experience Replay,” in Proceedings of the Confer-ence on Neural Information Processing Systems (NIPS), 2017,pp. 5048–5058.

[19] K. Doya and T. Sejnowski, “A novel reinforcement model ofbirdsong vocalization learning,” Advances in Neural Informa-tion Processing Systems, vol. 7, pp. 101–108, 1994.

[20] J. C. Brown, “Calculation of a constant Q spectral transform,”Journal of the Acoustical Society of America, vol. 89, no. 1,pp. 425–434, 1991.

[21] E. Rohmer, S. P. Singh, and M. Freese, “V-REP: A versatileand scalable robot simulation framework,” Proceedings of theInternational Conference on Intelligent Robots and Systems(IROS), pp. 1321–1326, 2013.

[22] S. James, M. Freese, and A. J. Davison, “PyRep:Bringing V-REP to Deep Robot Learning,” arXiv preprintarXiv:1906.11176, 2019. [Online]. Available: http://arxiv.org/abs/1906.11176

[23] C. Micheyl, K. Delhommeau, X. Perrot, and A. J. Oxenham,“Influence of musical and psychoacoustical training on pitchdiscrimination,” Hearing Research, vol. 219, no. 1-2, pp. 36–47, 2006.

[24] D. P. Kingma and J. L. Ba, “Adam: a Method for StochasticOptimization,” in Proceedings of the International Conferenceon Learning Representations (ICLR), 2015.

[25] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert,A. Radford, J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov,“OpenAI Baselines,” 2017. [Online]. Available: https://github.com/openai/baselines

http://compressorhead.org/

https://nigelstanford.com/Automatica/

https://nigelstanford.com/Automatica/

https://openai.com/blog/solving-rubiks-cube/



https://arxiv.org/pdf/1509.02971v2.pdf

https://arxiv.org/pdf/1509.02971v2.pdf

http://arxiv.org/abs/1906.11176

http://arxiv.org/abs/1906.11176

https://github.com/openai/baselines

https://github.com/openai/baselines

reinforcement learning with time-dependent goals for ...reinforcement learning with time-dependent...

Documents