new maximum entropy-regularized multi-goal reinforcement learning12-11-00... · 2019. 6. 9. ·...

Rui Zhao*, Xudong Sun, Volker TrespSiemens AG & Ludwig Maximilian University of Munich | June 2019 | ICML 2019

Maximum Entropy-Regularized Multi-Goal Reinforcement Learning

Introduction

In Multi-Goal Reinforcement Learning, an agent learns to achieve multiple goals with a goal-conditioned policy. During learning, the agent first collects the trajectories into a replay buffer and later these trajectories are selected randomly for replay.

OpenAI Gym Robotic Simulations

Motivation

- We observed that the achieved goals in the replay buffer are often biased towards the behavior policies.

- From a Bayesian perspective (Murphy, 2012), when there is no prior knowledge of the target goal distribution, the agent should learn uniformly from diverse achieved goals.

- We want to encourage the agent to achieve a diverse set of goals while maximizing the expected return.

Contributions

- First, we propose a novel multi-goal RL objective based on weighted entropy, which is essentially a reward-weighted entropy objective.

- Secondly, we derive a safe surrogate objective, that is, a lower bound of the original objective, to achieve stable optimization.

- Thirdly, we developed a Maximum Entropy-based Prioritization (MEP) framework to optimize the derived surrogate objective.

- We evaluate the proposed method in the OpenAI Gym robotic simulations.

A Novel Multi-Goal RL Objective Based on Weighted Entropy

Guiacsu [1971] proposed weighted entropy, which is an extension of Shannon entropy. The definition of weighted entropy is given by

where is the weight of the event and is the probability of the event.

This objective encourages the agent to maximize the expected return as well as to achieve more diverse goals.

* We use to denote all the achieved goals in the trajectory , i.e., .

Hwp = �

KX

k=1

wkpk log pk<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

⌧ g<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

⌧<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

⌧ g = (gs0, ..., gsT )

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

⌘H(✓) = Hwp (T g) = Ep

"log

1

p(⌧ g)

TX

t=1

r(St, Ge) | ✓

#


w<latexit sha1_base64="w0HyHQT+vKR/Yh4DcsftmNNhoY0=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI4kXjxCIo8ENmR26IWR2dnNzKyGEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfj27nffkSleSzvzSRBP6JDyUPOqLFS46lfLLlldwGyTryMlCBDvV/86g1ilkYoDRNU667nJsafUmU4Ezgr9FKNCWVjOsSupZJGqP3p4tAZubDKgISxsiUNWai/J6Y00noSBbYzomakV725+J/XTU1Y9adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNK5LtWoWRx7O4BwuwYMbqMEd1KEJDBCe4RXenAfnxXl3PpatOSebOYU/cD5/AOMBjPU=</latexit>

p<latexit sha1_base64="OdjnM+N0ldtMSmejwmrrhjkZqpM=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV7LHgxWML9gPaUDbbSbt2swm7G6GE/gIvHhTx6k/y5r9x2+agrQ8GHu/NMDMvSATXxnW/nY3Nre2d3cJecf/g8Oi4dHLa1nGqGLZYLGLVDahGwSW2DDcCu4lCGgUCO8Hkbu53nlBpHssHM03Qj+hI8pAzaqzUTAalsltxFyDrxMtJGXI0BqWv/jBmaYTSMEG17nluYvyMKsOZwFmxn2pMKJvQEfYslTRC7WeLQ2fk0ipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmrDmZ1wmqUHJlovCVBATk/nXZMgVMiOmllCmuL2VsDFVlBmbTdGG4K2+vE7a1Yp3Xak2b8r1Wh5HAc7hAq7Ag1uowz00oAUMEJ7hFd6cR+fFeXc+lq0bTj5zBn/gfP4A2GWM7g==</latexit>

A Safe Surrogate Objective

The surrogate is a lower bound of the objective function, i.e., , where

is the normalization factor for . is the weighted entropy (Guiacsu, 1971; Kelbert et al., 2017), where the weight is the accumulated reward , in our case.

⌘L(✓)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

⌘L(✓) < ⌘H(✓)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

⌘H(✓) = Hwp (T g) = Ep

"log

1

p(⌧ g)

TX

t=1

r(St, Ge) | ✓

#


⌘L(✓) = Z · Eq

"TX

t=1

r(St, Ge) | ✓

#


q(⌧ g) =1

Zp(⌧ g) (1� p(⌧ g))


Z<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

q(⌧ g)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Hwp (T g)


⌃Tt=1r(St, G

e)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Maximum Entropy-based Prioritization (MEP)

MEP Algorithm: We update the density model to construct a higher entropy distribution of achieved goals and update the agent with the more diversified training distribution.

Mean success rate and training time

Entropy of achieved goals versus and training epoch

No MEP: 5.13 ± 0.33 With MEP: 5.59 ± 0.34

No MEP: 5.73 ± 0.33 With MEP: 5.81 ± 0.30

No MEP: 5.78 ± 0.21 With MEP: 5.81 ± 0.18

Summary and Take-home Message

- Our approach improves performance by nine percentage points and sample-efficiency by a factor of two while keeping computational time under control.

- Training the agent with many different kinds of goals, i.e., a higher entropy goal distribution, helps the agent to learn.

- The code is available on GitHub: https://github.com/ruizhaogit/mep

- Poster: 06:30 -- 09:00 PM @ Pacific Ballroom #32 Thank you!

https://github.com/ruizhaogit/mep

new maximum entropy-regularized multi-goal reinforcement learning12-11-00... · 2019. 6. 9. ·...

Documents