arxiv:1909.02291v3 [cs.lg] 10 may 2021

Transfer with Action Embeddings for Deep Reinforcement LearningYu Chen1 ∗ , Yingfeng Chen1∗ , Zhipeng Hu2 , Tianpei Yang3 ,

Changjie Fan1 , Yang Yu4 , Jianye Hao3

1 NetEase Fuxi AI Lab2 Zhejiang University

3 College of Intelligence and Computing, Tianjin University4 National Key Laboratory for Novel Software Technology, Nanjing University{chenyu5,chenyingfeng1,fanchangjie}@corp.netease.com, [email protected],

{tpyang,jianye.hao}@tju.edu.cn, [email protected]

AbstractTransfer learning (TL) is a promising way to im-prove the sample efficiency of reinforcement learn-ing. However, how to efficiently transfer knowl-edge across tasks with different state-action spacesis investigated at an early stage. Most previousstudies only addressed the inconsistency across dif-ferent state spaces by learning a common featurespace, without considering that similar actions indifferent action spaces of related tasks share simi-lar semantics. In this paper, we propose a methodto learning action embeddings by leveraging thisidea, and a framework that learns both state em-beddings and action embeddings to transfer policyacross tasks with different state and action spaces.Our experimental results on various tasks show thatthe proposed method can not only learn informativeaction embeddings but accelerate policy learning.

1 IntroductionDeep reinforcement learning (DRL), which combines rein-forcement learning algorithms and deep neural networks, hasachieved great success in many domains, such as playingAtari games [Mnih et al., 2015], playing the game of Go [Sil-ver et al., 2016] and robotics control [Levine et al., 2016].Although DRL is viewed as one of the most potential ways toGeneral Artificial Intelligence (GAI), it is still criticized forits data inefficiency. Training an agent from scratch requiresconsiderable numbers of interactions with the environmentfor a specific task. One approach dealing with this problemis Transfer Learning (TL) [Taylor and Stone, 2009], whichmakes use of prior knowledge from relevant tasks to reducethe consumption of samples and improve the performance ofthe target task.

Most of the current TL methods are essential to share theknowledge that is contained in the parameters of neural net-works. However, this kind of knowledge cannot be directlytransferred when faced with the cross-domain setting, wherethe source and target tasks have different state-action spaces.

∗Equal contribution

In order to overcome the domain gap, some researches fo-cus on mapping state spaces into a common feature space,such as using manifold alignment [Ammar et al., 2015;Gupta et al., 2017], mutual information [Wan et al., 2020]and domain adaptation [Carr et al., 2019]. However, noneof them considers similar actions in different action spacesof related tasks share similar semantics. To illustrate this in-sight, we take the massively multiplayer online role-playinggame (MMORPG) as an example. An MMORPG usuallyconsists of a variety of roles, each of which is equippedwith a set of unique skills. However, these skills share somesimilarities since some of them cause similar effects, suchas ’Damage Skill’, ’Control Skill’, ’Evasion Skill’ and soon. Several work have studied action representations andused them to improve function approximation [Tennenholtzand Mannor, 2019], generalization [Chandak et al., 2019;Jain et al., 2020] or transfer between tasks with the samestate-action spaces [Whitney et al., 2020]. In contrast to theseresearches, we study the feasibility of leveraging action em-beddings to transfer policy across tasks with different actionspaces. Intuitively, similar actions will be taken when per-forming tasks with the same goal. Hence, if the semantics ofactions is learned explicitly, there is a chance to utilize thesemantic information to transfer policy.

The main challenge in the action-based transfer is how tolearn meaningful action embeddings that captures the seman-tics of actions. Our key insight is that the semantics of ac-tions can be reflected by their effects which are defined bystate transitions in RL problems. Thus, we learn action em-beddings through a transition model, which predicts the nextstate according to the current state and action embedding. An-other challenge is that how to transfer policy across tasks withdifferent state and action spaces. To this end, we propose anovel framework named TRansfer with ACtion Embeddings(TRACE) for DRL, where we leverage both state embeddingsand action embeddings for policy transfer. When transferringto the target task, we transfer both the policy model and thetransition model learned from the source task. The nearestneighbor algorithm used by the policy model can select themost similar action in the embedding space. Meanwhile, thetransition model helps to align action embeddings of the twotasks. The main contributions are summarized as follows:

arX

iv:1

909.

0229

1v3

[cs

.LG

] 1

0 M

ay 2

021

• We propose a method to learn action embeddings, whichcan capture the semantics of the actions.

• We propose a novel framework TRACE which learnsboth state embeddings and action embeddings to transferpolicy across tasks with different state and action spaces.

• Our experimental results show that TRACE can a) learninformative action embeddings; b) effectively improvesample efficiency compared with vanilla DRL and state-of-the-art transfer algorithms.

2 Related Work2.1 Transfer in Reinforcement LearningTransfer learning is considered an important and challengingdirection in reinforcement learning and has become increas-ingly important. [Teh et al., 2017] proposes a method thatuses a shared distilled policy for joint training of multipletasks named Distral. [Finn et al., 2017] introduces a gen-eral framework of meta-learning that can achieve fast adapta-tion. Successor features and generalized policy improvementare also applied to transfer knowledge [Barreto et al., 2019;Ma et al., 2018]. [Yang et al., 2020] proposes a Policy Trans-fer Framework (PTF) which can effectively select the optimalsource policy and accelerate learning. All these methods fo-cus on the tasks that only differ in reward functions or transi-tion functions.

To transfer across tasks with different state and actionspaces, many research attempts to map state spaces into acommon feature space. [Gupta et al., 2017] learns com-mon invariant features between different agents from a proxyskill and uses it as an auxiliary reward. However, it requirescorresponding state pairs of two tasks, which may be diffi-cult to acquire. Adversarial training [Wulfmeier et al., 2017;Carr et al., 2019] is also utilized to align the representationsof source and target tasks. Additionally, MIKT [Wan et al.,2020] learns an embedding space with the same dimensionas the source task and uses lateral connections to transferknowledge from teacher networks, which is similar to [Liuet al., 2019]. These methods focus on the connection be-tween state spaces but ignore the relationship between actionspaces. Recently, OpenAI Five [Raiman et al., 2019] intro-duces a technique named Surgery to determine which sec-tion of the model should be retrained when the architecturechanges. The method is capable of continuous learning but isnot suitable for tasks with totally different action spaces.

Inter-task mapping, which considers both state and actionspaces, is used to describe the relationship between tasksthrough explicit task mappings. [Taylor et al., 2007] man-ually constructs an inter-task mapping and builds a transfer-able action-value function based on it. Furthermore, Usuper-vised Manifold Alignment (UMA) is used to learn an inter-task mapping from trajectories of source and target tasksautonomously [Ammar et al., 2015]. [Zhang et al., 2021]learns state and action correspondence across domains usinga cycle-consistency constraint to achieve policy transfer. Themain difference from our work is that we try to embed actionsinto a common space instead of learning a direct mapping.

2.2 Action Embedding

Action embedding is firstly studied by [Dulac-Arnold et al.,2015], aiming to solve the explosion of action space in RL.However, the action embeddings are assumed to be given asa prior. Act2Vec is introduced by [Tennenholtz and Man-nor, 2019], in which a skip-gram model is used to learn ac-tion representations from expert demonstrations. [Chandak etal., 2019] learns a latent space of actions by modeling the in-verse dynamics of the environment, and a function is learnedto map the embeddings into discrete actions. While in thispaper, we learn representations of discrete actions through aforward transition model. Similarly, [Whitney et al., 2020]simultaneously learns embeddings of states and action se-quences that capture the environment’s dynamic to improvesample efficiency. However, they focus on continuous con-trol tasks and consider the effects of action sequences. Ac-tion representations are also used to enable generalization tounseen actions [Jain et al., 2020]. By contrast, we use it totransfer policy across different tasks.

3 Problem Definition

RL problems are often modeled as Markov Decision Pro-cesses (MDPs) which are defined as a tuple M =(S,A, T ,R, γ), where S and A are sets of states and ac-tions. In this work, we restrict our method on discrete ac-tion spaces, and |A| denotes the size of action set. T :S × A× S 7→ [0, 1] is a state transition probability function,which can also be represented as the distribution of resultingstates p(st+1|st, at) at time step t. R : S × A 7→ R is areward function measuring the performance of agents and γis a discount factor for future rewards. Additionally a policyπ : S × A 7→ [0, 1] is defined as a conditional distributionover actions for each state. Given an MDP M, the goal ofthe agent is to find an optimal policy π∗ that maximizes theexpected discounted return R =

∑∞t=0 γ

trt.In this paper, we consider the transfer problem between a

source MDPMS = (SS ,AS , TS ,RS , γS) and a target MDPMT = (ST ,AT , TT ,RT , γT ). In this paper, we assume thatthe state and action spaces in the two MDPs are different,while there are some similarities in both the reward functionsand the transition functions. For example, in one of our ex-perimental tasks, MS and MT correspond to two differentroles in a MMORPG to fight against an enemy. While thedimensions of actions (skills) and states are completely dif-ferent, the two agents both need to defeat the enemy, with areward that depends on the final result: win, lose or tie. Be-sides, though the actions (skills) of the two roles are different,their effects can be similar, such as two agents both choose adamage skill which cause similar damage to the enemy.

4 Transfer with Action Embeddings

In this section, we introduce the TRACE framework. We firstdiscuss how to learn meaningful action embeddings. Further,we describe how the action embeddings can be combinedwith RL algorithms and to facilitate policy transfer.

Figure 1: Illustration of the process of learning action embeddings.Given a state transition tuple, we first obtain action embedding e(at)from Wae. The transition model is used to predict the next statest+1. The red arrow denotes the gradients of the loss, and the greygrid denotes input variables.

4.1 Learning Action Embeddings

To learn action embeddings that capture the semantics of ac-tions, our main insight is that the semantics of actions canbe reflected by their effects on the environment, which canbe measured by the state transition probability in RL. Thus,we aim to learn an action embedding e(a) ∈ Rd with dimen-sion d for each a ∈ A, which should satisfy some properties:(1) The distance between action embeddings is adjacent ifthe actions have similar effects on the environment. (2) Theembeddings should be sufficient so that the distribution con-ditioned on the embeddings approximates that conditioned onthe actions p(st+1|st, at) ≈ p(st+1|st, e(at)). We approxi-mate this by learning a transition model fθD , which predictsthe next state st+1 according to current state st and action atwith parameter θD, to capture the dynamics of the environ-ment.

Figure 1 illustrates the learning process of action embed-dings. At the beginning, an embedding matrix W ae ∈R|A|×d is instantiated, in which the i-th row of the matrixdenotes the embedding vector e(ai). Given a state transi-tion tuple (st, at, st+1), we first get action embedding e(at)by lookup from the current matrix W ae. Then a latent vari-able zt is sampled from zt ∼ N (µt, σt), where [µt, σt] =fθD1 (st, e(at)), like variational autoencoder (VAE) [Kingmaand Welling, 2013]. The latent variable is introduced asa stochastic process to cope with stochastic environments,which is similar to [Goyal et al., 2017]. Further, the modelpredicts the next state st+1 by a multi-layered feed-forwardneural network that conditions on zt, st, and e(at), i.e.st+1 = fθD2 (st, e(at), zt). The transition model and actionembeddings are optimized to minimize the prediction error:

L(θD,W ae) = Est,at,st+1

[||st+1 − st+1||22

+ βDKL

(N (µt, σt) || N (0, I)

)] (1)

where θD = {θD1 , θD2}, β is a scaling factor. Note that,if the transition of tasks is non-markovian, we can apply arecurrent transition model (such as LSTM [Hochreiter andSchmidhuber, 1997]) to learn the dynamics more accurately.

Algorithm 1 TRACE training algorithm on source task

1: Randomly initialize the policy fθπ , state embedding fθse ,transition model fθD , and action embeddings W ae

/* State embedding is optional. In same-domain transfer,we set fθse(s) = s */

2: Initialize replay buffer B3: for episode = 1 to L do4: Receive initial state s1 from environment5: for timestep t = 1 to T do6: Select action at = fθπ (fθse(st)), at = g(at) ac-

cording to current policy and action embeddings7: Execute action at, receive reward rt, and observe

new state st+1

8: Add tuple (st, at, rt, st+1) to B9: Sample random batch from B

i.i.d.∼ B10: Update θπ and θse according to SAC training loss11: Sample random batch from B′

i.i.d.∼ B and calculatestate embedding fθse(s) for each state in the batch

12: Update θD and W ae over Equation. (A.1)13: end for14: end for15: return θD, θπ , θse, W ae

Algorithm 2 TRACE transfer algorithm on target taskInput: Parameters θDS , θπS from source task

1: Initialize the policy fθπ and transition model fθD whereθπ = θπS and θD = θDS , and randomly initialize stateembedding fθse and action embeddings W ae

/* In same-domain transfer, we set fθse(s) = s */2: Train model according to line 2-15 in Algorithm 13: return θD, θπ , θse, W ae

4.2 Policy Training and TransferTrain Policy with Action EmbeddingsIn this section, we describe the training process of TRACEcombined with SAC. It is noteworthy that TRACE is not lim-ited to SAC. It can be extended to any other RL algorithmswith appropriate adaptions.

Algorithm 1 outlines the training process on the source taskof TRACE (state embedding is introduced in Section Cross-Domain Transfer). First, we initialize the action embeddingsand the network parameters of the policy model and the tran-sition model (Line 1). During the training process, to selectan action at each timestep, the output of the policy model(continuous embedding space) should be mapped to the orig-inal discrete action space. In this paper, we use a standardnearest neighbor algorithm to do the mapping(Line 6) [Dulac-Arnold et al., 2015]. Specifically, the policy parameterizedby θπ outputs a proto-action a = fθπ (s) for a given state s,a ∈ Rd. Then the real action performed is chosen by a nearestneighbor in the learned action embeddings:

g(a) = argmina∈A ||a− e(a)||2

where g(·) is a mapping from a continuous space to a discretespace. It returns an action in A that is closest to proto-action

𝑓!!"#

𝑠!

SourceTask

Policy Model𝑓-!"

𝑓!$"#Policy Model

𝑓-#"

TargetTask

𝑠0

Transition Model𝑓-!$

Initialize

action

Transition Model𝑓-#$

Initialize

𝑎%lookup

𝑒(𝑎!)

𝑎&

𝑎$

𝑎$

𝑓!!"#(𝑠")

𝑊%'(

Concat

𝑊&'(

lookup

action

𝑒(𝑎") 𝑓!$"#(𝑠#)

Concat

Nearest Neighbor

Nearest Neighbor

Figure 2: The architecture of TRACE for tasks with different state spaces. When transferring to the target task, the parameters in grey gridsare transferred as initialization. The brown dashes mean that the data is used by the other part, but the gradient does not propagate across themodule. The number of circles denotes the size of the vector, and the subscripts S and T denote source and target task, respectively.

a in embedding space by L2 distance. The agent executes ac-tion at = g(at), receives reward rt, observes next state st+1,and stores the transition (st, at, r, st+1) to the replay buffer B(Lines 7-8). Then it updates the policy model and the actionembeddings and transition model accordingly following SACloss [Haarnoja et al., 2018] and Equation (A.1) (Lines 9-12).

Same-Domain TransferWith a trained source policy, we now describe how to trans-fer the policy to the target task. For better understanding, westart from a simple setting where the state spaces of sourceand target tasks remain the same while the action spaces aredifferent, and we call it the same-domain transfer. For ex-ample, a character in a game carries different sets of skillsto perform the same task. Each set of skills contains differ-ent skills in terms of number and type. Within this setting,we can directly transfer policy fθπS and fine-tune on the targettask. The agent should behave similarly when facing the samestate, and the nearest-neighbor algorithm will find the mostrelevant skills in the target task if action embeddings of thesource and target tasks are well aligned. To achieve that, wetransfer the transition model’s parameters θDS learned fromthe source task and freeze them. Then the action embeddingsof the target task W ae

T ∈ R|AT |×d are optimized according toEquation A.1. In this way, we can align action embeddingsof the tasks, which is also validated in our experiments.Cross-Domain TransferIn cross-domain transfer, where tasks differ in both state andaction spaces, the transition model can not be reused sincethe dimensions of states are different between source and tar-get tasks. The premise of reusing the transition model isthat states can be embedded into the same or similar spacewith the same size. Thus, the input of the transition modelbecomes a tuple (fθse(st), at, fθse(st+1)), where fθse(·) de-notes a non-linear function with parameter θse, mapping theoriginal state space into a common space, called state em-bedding. In this work, we train the state embedding alongwith the policy. Note that the two modules (policy model andtransition model) become interdependent — transition modelneeds state embeddings as training input, and policy requires

(a) (b)

Figure 3: The learned embeddings projected into 2D space via PCA.Each dot in (a) represents an action embedding in gridworld. Weshow the action effect of each group. (b) Action embeddings of thesource task (2-step gridworld) and the target task (1-step gridworld).

action embeddings to select actions. Therefore, we train twomodules together, which can also increase data utilization.The architecture is shown in Figure 2

When training on the target task, we initialize the RL pol-icy fθπT and the transition model fθDT with parameters θπS andθDS from the source task. At the same time, the state embed-ding fθseT and action embeddingsW ae

T are randomly reinitial-ized. Then we train the network as on the source task (Lines2-15 in Algorithm 1). Algorithm 2 outlines the transfer pro-cess. Unlike the same-domain transfer, the transition modelparameters are not frozen but finetuned. This is because theinconsistency in the transition model increases when state di-mensions are different, which leads to unstable training. Wealso verify this in our experiments shown in the appendix.

5 ExperimentsIn this section, we empirically investigate the feasibility oflearning action embeddings with the transition model and as-sess the effectiveness of TRACE.

We compare our method, abbreviated as TRACE-PT, whichmeans to transfer both the Policy model (P) and the Tran-sition model (T), with three baselines: a) SAC [Haarnoja etal., 2018], which learns from scratch on target tasks; b) Abasic transfer (BT) strategy, which replaces the input andoutput layers of the neural network learned from the sourcetask with new learnable layers that match the dimensions re-quired of the target task and fine-tunes the whole network. c)

0 100 200 300 400 500episode

2

0

2

4

6

8

win

rate

TRACE(no transfer) TRACE-PT TRACE-P TRACE-T BT MIKT SAC

0 200 400 600 800 1000 1200 1400episode

0

2

4

6

8

10

retu

rn

(a) task n = 1 (source task n = 3)

0 200 400 600 800 1000 1200 1400episode

2

0

2

4

6

8

10

retu

rn

(b) task n = 2 (source task n = 1)

0 200 400 600 800 1000 1200 1400episode

2

0

2

4

6

8

10

retu

rn

(c) task n = 3 (source task n = 2)

Figure 4: Experiment results on gridworld environments. The solid lines denote our method, and the dashed lines represent the comparedmethods. The shaded areas are bootstrapped 95% confidence intervals.

MIKT [Wan et al., 2020], which leverages policies learnedfrom source tasks as teachers to enhance the learning of tar-get tasks; Note that BT can see as an ablation of our methodthat does not use action embedding. We also report the resultof our method that learns from scratch on the target task, de-noted by TRACE(no transfer). All the methods are based onSAC, and the results are averaged over 10 individual runs.

Appendix A provides detailed descriptions of environ-ments and hyperparameters used in the experiments, andmore experimental results are available in Appendix B.

5.1 N-Step GridworldWe first validate our methods in an 11 × 11 gridworld, inwhich an agent needs to reach a randomly assigned goal po-sition. The agent could perform 4 atomic actions: Up, Down,Left, and Right at each step. Once we consider combo movesin successive n-step, the number of actions becomes 4n. Weconduct experiments on three settings with n ∈ {1, 2, 3}. Thesizes of action spaces are 4, 16, 64, respectively. The state ofthe tasks consists of the current position (x, y) and the goalposition (x, y). The agent receives a -0.05 for each step anda +10 reward when the agent reaches the goal.

To investigate whether action embedding can capture thesemantics of actions, we set n = 3, the dimension of actionembedding d = 4, and sample 10,000 transition data basedon a random policy. Figure 3(a) shows the Principal Compo-nent Analysis (PCA) projections of the resulting embeddings.In the figure, different colors denote different action effects.For example, there are 9 blue dots (↑), including ↑↑↓,←↑→,etc. We see that the actions with the same effect are posi-tioned closely and clustered into 16 separate groups. What’smore, those clusters show near-perfect symmetry along withthe four directions in the gridworld, which means our methodeffectively captures the semantics of actions. In word embed-dings, the relationship between words is often discussed, suchas Paris - France + Italy = Rome [Mikolov et al., 2013]. Inaction embeddings, we can also get the similar property, suchas e(↑↑←) + e(↑←→)− e(←→←) ≈ e(↑↑↑).

Further, we evaluate the policy transfer performance ontasks n = {1, 2, 3}. Note that state spaces of these tasks arethe same. So we freeze the parameters θD of the transferredtransition model as described in Section 4.3. As seen in Fig-ure 4, the speed of training on target tasks with TRACE-PToutperforms that of all the other methods in all tasks. MIKTlearns faster than SAC, but slower than BT. This is because

that reusing previous parameters is more efficient than distill-ing regarding the same-domain transfer. Besides, we find thatSAC performs better than TRACE in all tasks because wemap discrete actions into a continuous space, which makesit challenging to learn the policy, especially when the num-ber of actions is small. Note that there are no jump-starts onthose curves due to the action embeddings of target tasks arerandomly initialized. However, the action embeddings adaptquickly, which results in a fast transfer. Moreover, the actionembeddings of the target task should align with the sourcetask so that policy could have a promising performance onthe target task. To verify this, we project the embeddings ofboth source task n = 2 and target task n = 1 into 2D space.As shown in Figure 3(b), the embeddings of the source taskand the target task are well aligned, and it is even observedthat e(↑) ≈ 0.5 ∗ e(↑↑) + 0.5 ∗ e(↑↓).

5.2 Mujoco and RoboschoolNext, we consider a more difficult cross-domain transfer set-ting, in which state spaces are different as well as actionspaces. We conduct experiments among four environments,InvertedPendulum and InvertedDoublePendulum in Mujoco[Todorov et al., 2012] and Roboschool, respectively, denotedby mP (mujoco Pendulum), mDP (mujoco Double Pendu-lum), rP (roboschool Pendulum) and rDP (roboschool Dou-ble Pendulum) for short. Currently, our methods are only suit-able for discrete action spaces. So we discretize the originalm-dimension continuous control action space into k equallyspaced values on each dimension, resulting in a discrete ac-tion space with |A| = km actions. The detailed configura-tions and descriptions of the environments are summarizedin Appendix A.3 , and the learned action embeddings of theenvironments are shown in Figure 4 of Appendix.

In this experiment, our method is evaluated on four trans-fer tasks. We first try to transfer policy from mDP to mP andrDP to rP, where the source and the target tasks are still in thesame underlying physical engine, and the target tasks are eas-ier than the source ones. Further, we transfer policy from mPto rDP and rP to mDP, the source and the target tasks are indifferent physical engines, and the target tasks are more chal-lenging than the source tasks. Figure 10 depicts the results ofcross-domain transfer. Overall, TRACE-PT still learns fasterthan SAC and BT in all tasks, and MIKT performs slightlybetter than SAC. We can see that in Figure 5(a) and 5(b),though BT has a promising performance, it fails to learn thetarget task in more challenging transfer tasks and even has

0 100 200 300 400 500episode

2

0

2

4

6

8

win

rate

TRACE(no transfer) TRACE-PT TRACE-P TRACE-T BT MIKT SAC

0 200 400 600 800 1000 1200 1400episode

0

20

40

60

80

100

retu

rn

(a) mP (source task mDP)

0 200 400 600 800 1000 1200 1400episode

20

40

60

80

100

retu

rn

(b) rP (source task rDP)

0 200 400 600 800 1000 1200 1400episode

0100200300400500600700800

retu

rn

(c) mDP (source task rP)

0 200 400 600 800 1000 1200 1400episode

200300400500600700800900

retu

rn

(d) rDP (source task mP)

Figure 5: Experiment results on Mujoco and Roboschool tasks. The solid lines denote our method, and the dashed lines represent thecompared methods. The shaded areas are bootstrapped 95% confidence intervals.

negative transfer, shown in Figure 5(c) and 5(d). Accordingto our extended experiments shown in Figure 3 of AppendixB, BT only performs well in the simplest cases. In most cases,it leads to a negative transfer. In contrast, TRACE-PT canhandle all the transfer tasks and accelerate learning. This in-dicates that our approach is superior to BT and the advantagesare more apparent when in more challenging transfer tasks.

5.3 Combat Tasks in a Commercial GameTo inspect our methods’ potential in more practical prob-lems, we validate them on a one-versus-one combat scenarioin a commercial game where the agent can carry differentsets of skills to fight against a build-in opponent. In this ex-periment, we select two classes named She Shou (SS) andFang Shi (FS). The state representations are extracted man-ually and consist of information about the controlled agentand build-in opponent, forming two vectors with 48 and 60-dimensional, respectively. The sizes of action spaces are 10for both classes, containing their unique skills and standardoperations, such as move and attack. The agent receives posi-tive rewards for damaging and winning, and negative for tak-ing damage and losing. Besides, the agent is punished forchoosing unready skills.

We first verify whether our method can learnreasonable action embeddings in complex games.

Figure 6: PCA projectionsof learned action embeddings.Each dot represents an actionembedding of skill in the game.

we randomly sample 5 outof 15 skills and collecttransition data. Table 4of Appendix A lists theskill descriptions. We sam-ple 50,000 transition datato train action embeddingswith d = 6. Figure 6plots the result, and we no-tice that the skills with spe-cial effects, such as Silenceand Stun, are distinguishedclearly, and that damageskills are also closer to eachother. As annotated in the figure, the DoT Damage and In-stant Damage are recognized as well.

For policy transfer, we first train policy individually for tworoles and then transfer to each other. The performance is mea-sured by the average winning rate of the recent 100 episodes,as plotted in Figure 7. For clarity, we show ablation results

(b) FS (source task SS)(a) SS (source task FS)

Figure 7: Experiment results on combat tasks.

in Appendix B. We can see that TRACE-PT achieves a bet-ter sample efficiency compared with MIKT, while BT mayleads to the negative transfer. It proves that our method canbe applied to more practical problems.

5.4 Ablations

To better understand our method, we analyze the contributionof the transition model and the policy model to performancepromotion. The ablation experiments are designed as follows:

• TRACE-P: Transfer policy model only, and randomlyinitialize transition model.

• TRACE-T: Transfer transition model only, and train pol-icy from scratch.

• BT: Transfer policy, and do not use action embedding.

For same-domain transfer (Figure 4), we find that TRACE-P can also accelerate learning significantly. However, therestill lies a gap between TRACE-PT and TRACE-P becauseit needs to learn the transition model and action embeddingsanew, making the action embeddings of the target task maynot align with the source task, and the policy requires moretime to adapt. However, in the cross-domain transfer (Fig-ure 10), TRACE-P achieves a competitive performance toTRACE-PT , especially in Figure 5(b). It is easy to under-stand that the transition model and the action embeddings arelearned from state embeddings, which are retrained on thetarget task. It limits the performance of transfer. Besides,in both same-domain and cross-domain transfers, TRACE-Tresults in a similar performance to TRACE, which may indi-cate that the cost of training mainly lies in policy training andaction representations can boost policy generalization. Com-paring TRACE-PT and BT, we find that the proposed actionembeddings can indeed facilitate policy transfer.

6 ConclusionIn this paper, we study how to leverage action embeddings totransfer across tasks with different action spaces and/or statespaces. We propose a method to effectively learn meaningfulaction embeddings by training a transition model. Further, wetrain RL policies with action embeddings by using the nearestneighbor in the embedding space. The policy and transitionmodel are transferred to the target task, leading to a quickadaptation of policy. Extensive experiments demonstrate thatit significantly improves sample efficiency, even with differ-ent state spaces and action spaces. In the future, we will tryto extend our method to continuous action spaces and alignthe state embeddings with additional restrictions.

References[Ammar et al., 2015] Haitham Bou Ammar, Eric Eaton, Paul Ru-

volo, and Matthew E Taylor. Unsupervised cross-domain trans-fer in policy gradient reinforcement learning via manifold align-ment. In Twenty-Ninth AAAI Conference on Artificial Intelli-gence, 2015.

[Barreto et al., 2019] Andre Barreto, Diana Borsa, John Quan, TomSchaul, David Silver, Matteo Hessel, Daniel Mankowitz, Au-gustin Zıdek, and Remi Munos. Transfer in deep reinforce-ment learning using successor features and generalised policy im-provement. arXiv preprint arXiv:1901.10964, 2019.

[Carr et al., 2019] Thomas Carr, Maria Chli, and George Vogiatzis.Domain adaptation for reinforcement learning on the atari.In Proceedings of the 18th International Conference on Au-tonomous Agents and MultiAgent Systems, pages 1859–1861,2019.

[Chandak et al., 2019] Yash Chandak, Georgios Theocharous,James Kostas, Scott M. Jordan, and Philip S. Thomas. Learn-ing action representations for reinforcement learning. In ICML,pages 941–950, 2019.

[Dulac-Arnold et al., 2015] Gabriel Dulac-Arnold, Richard Evans,Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, JonathanHunt, Timothy Mann, Theophane Weber, Thomas Degris, andBen Coppin. Deep reinforcement learning in large discrete actionspaces. arXiv preprint arXiv:1512.07679, 2015.

[Finn et al., 2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine.Model-agnostic meta-learning for fast adaptation of deep net-works. In ICML, pages 1126–1135, 2017.

[Goyal et al., 2017] Anirudh Goyal Alias Parth Goyal, AlessandroSordoni, Marc-Alexandre Cote, Nan Rosemary Ke, and YoshuaBengio. Z-forcing: Training stochastic recurrent networks. InAdvances in neural information processing systems, pages 6713–6723, 2017.

[Gupta et al., 2017] Abhishek Gupta, Coline Devin, YuXuan Liu,Pieter Abbeel, and Sergey Levine. Learning invariant featurespaces to transfer skills with reinforcement learning. arXivpreprint arXiv:1703.02949, 2017.

[Haarnoja et al., 2018] Tuomas Haarnoja, Aurick Zhou, PieterAbbeel, and Sergey Levine. Soft actor-critic: Off-policy maxi-mum entropy deep reinforcement learning with a stochastic actor.arXiv preprint arXiv:1801.01290, 2018.

[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and JurgenSchmidhuber. Long short-term memory. Neural computation,9(8):1735–1780, 1997.

[Jain et al., 2020] Ayush Jain, Andrew Szot, and Joseph Lim. Gen-eralization to new actions in reinforcement learning. In Inter-national Conference on Machine Learning, pages 4661–4672.PMLR, 2020.

[Kingma and Welling, 2013] Diederik P Kingma and MaxWelling. Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114, 2013.

[Levine et al., 2016] Sergey Levine, Chelsea Finn, Trevor Darrell,and Pieter Abbeel. End-to-end training of deep visuomotor poli-cies. JMLR, 17(1):1334–1373, 2016.

[Liu et al., 2019] Iou Jen Liu, Jian Peng, and Alexander GSchwing. Knowledge flow: Improve upon your teachers. In7th International Conference on Learning Representations, ICLR2019, 2019.

[Ma et al., 2018] Chen Ma, Junfeng Wen, and Yoshua Bengio. Uni-versal successor representations for transfer reinforcement learn-ing. arXiv preprint arXiv:1804.03758, 2018.

[Mikolov et al., 2013] Tomas Mikolov, Kai Chen, Greg Corrado,and Jeffrey Dean. Efficient estimation of word representationsin vector space. arXiv preprint arXiv:1301.3781, 2013.

[Mnih et al., 2015] Volodymyr Mnih, Koray Kavukcuoglu, DavidSilver, Andrei A Rusu, Joel Veness, Marc G Bellemare, AlexGraves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostro-vski, et al. Human-level control through deep reinforcementlearning. Nature, 518(7540):529, 2015.

[Raiman et al., 2019] Jonathan Raiman, Susan Zhang, and ChristyDennison. Neural network surgery with sets. arXiv preprintarXiv:1912.06719, 2019.

[Silver et al., 2016] David Silver, Aja Huang, Chris J Maddison,Arthur Guez, Laurent Sifre, George Van Den Driessche, JulianSchrittwieser, Ioannis Antonoglou, Veda Panneershelvam, MarcLanctot, et al. Mastering the game of go with deep neural net-works and tree search. nature, 529(7587):484, 2016.

[Taylor and Stone, 2009] Matthew E Taylor and Peter Stone. Trans-fer learning for reinforcement learning domains: A survey.JMLR, 10(Jul):1633–1685, 2009.

[Taylor et al., 2007] Matthew E Taylor, Peter Stone, and Yaxin Liu.Transfer learning via inter-task mappings for temporal differencelearning. JMLR, 8(Sep):2125–2167, 2007.

[Teh et al., 2017] Yee Teh, Victor Bapst, Wojciech M Czarnecki,John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, andRazvan Pascanu. Distral: Robust multitask reinforcement learn-ing. In Advances in Neural Information Processing Systems,pages 4496–4506, 2017.

[Tennenholtz and Mannor, 2019] Guy Tennenholtz and Shie Man-nor. The natural language of actions. In International Conferenceon Machine Learning, pages 6196–6205, 2019.

[Todorov et al., 2012] Emanuel Todorov, Tom Erez, and YuvalTassa. Mujoco: A physics engine for model-based control. In In-ternational Conference on Intelligent Robots and Systems, pages5026–5033. IEEE, 2012.

[Wan et al., 2020] Michael Wan, Tanmay Gangwani, and JianPeng. Mutual information based knowledge transfer under state-action dimension mismatch. In Conference on Uncertainty inArtificial Intelligence, pages 1218–1227. PMLR, 2020.

[Whitney et al., 2020] William Whitney, Rajat Agarwal,Kyunghyun Cho, and Abhinav Gupta. Dynamics-aware embed-dings. In International Conference on Learning Representations,2020.

[Wulfmeier et al., 2017] Markus Wulfmeier, Ingmar Posner, andPieter Abbeel. Mutual alignment transfer learning. arXiv preprintarXiv:1707.07907, 2017.

[Yang et al., 2020] Tianpei Yang, Jianye Hao, Zhaopeng Meng,Zongzhang Zhang, Yujing Hu, Yingfeng Chen, Changjie Fan,Weixun Wang, Zhaodong Wang, and Jiajie Peng. Efficient deepreinforcement learning through policy transfer. In Proceedingsof the 19th International Conference on Autonomous Agents andMultiAgent Systems, pages 2053–2055, 2020.

[Zhang et al., 2021] Qiang Zhang, Tete Xiao, Alexei A Efros, Ler-rel Pinto, and Xiaolong Wang. Learning cross-domain correspon-dence for control with dynamics cycle-consistency. ICLR, 2021.

A Environment Details and Hyperparameters

A.1 Experiment SettingsIn our experiments, n-step gridworld and Mujoco envrionments aredeterministic. Hence, the latent variable zt is not necessary in theenvironments and it is adopted only in combat tasks. Without zt, theloss function of the transition model reduces to MSE loss:

L(θD,W ae) = Est,at,st+1

[||st+1 − st+1||22

]A.2 GridworldGridworld environment consists of an agent and a goal as shownin Figure 8(a). At each episode, the agent is spawned in a randomposition and the goal is randomly positioned. The objective of theagent is to reach the goal.

States: The state is a 4-dimensional vector consisting of the cur-rent position (x, y) and the goal position (x, y).

Actions: An action of the agent in n-step gridworld indicates nconsecutive moves in four directions. Taking 2-step gridworld as anexample, the action space is 16, including { ↑↑, ↑↓, ↑←, ↑→,←↑,←↓,←←,←→,→↑,→↓,→←,→→, ↓↑, ↓↓, ↓←, ↓→ }. Oncethe agent selects an action, it executes moves step-by-step. If theagent hits the boundary, it will stay in the current position.

Rewards: The agent receives a -0.05 reward each move and a+10 reward when the agent reaches the goal.

Termination: Each game is terminated when the agent reachesthe goal, or the agent has taken 20 actions.

The hyperparameters of the experiment are available in Table 1.

Table 1: Parameter settings in gridworld experiment

Parameters Value

SAC

state embed dim -state embed hiddens -

ac hiddens [200, 100]actor lr 1e-5critic lr 1e-3τ 0.999α 0.2γ 0.99

Transition Modelaction embed dim 2

hiddens [64, 32]lr 1e-3

A.3 Mujoco and Roboschoolwe conduct experiments on InvertedPendulum and InvertedDou-blePendulum tasks in two simulator, Mujoco and Roboschool, de-noted by mP, rP, mDP and rDP, respectively. In all tasks, the agentneeds to apply a force to the cart to balance the poles upright. Figure8(b) presents the screenshots of the tasks.

States: The state of mP is a 4-dimensional vector, representingthe position and velocity of the cart and the pole. For rP, the stateis a 5-dimensional vector, including the position and velocity of thecart and the joint angle and its sine and cosine. The state vectorof mDP consists of the cart’s position , sine and cosine of two jointangles, velocities of position and the angles, and constraint forces onposition and the angles, with a total length of 11. For rDP, the stateis described via the position and velocity of the cart, pole’s position,two angles, and sine and cosine of the angles.

Actions: We discretize original continuous action spaces intoequally spaced values. We summarize the detailed configurationsin Table 2.

Rewards: In mP and rP, the agent gets +1 reward for keepingthe pole upright. For mDP and rDP, the agent receives +10 rewardeach step and a negative reward for hight velocity and drift from thecenter point.

Termination: Each game is terminated when the agent has taken100 actions, or the poles are too low.

Table 3 shows the hyperparameters in this experiment.

Table 2: Configurations of the four environments

Task State Dim Act Range Discretized Act DimmP 4 [-3, 3] 101

mDP 11 [-1, 1] 51rP 5 [-1, 1] 91

rDP 9 [-1, 1] 71

Table 3: Parameter settings in Mujoco and Roboschool experiment

Parameters Value

SAC

state embed dim 5state embed hiddens [200, 100]


Transition Modelaction embed dim 3

hiddens [64, 32]lr 1e-3

A.4 Combat Tasks in the Commercial GameIn the scenario, the agent needs to move and arrange the skills rea-sonably to defeat an opponent controlled by rules. Figure 8(c) dis-plays the screenshot of the scenario. To have a better understandingof learned action embeddings in the paper, we summarize the effectsof She Shou’s skills in Table 4.

States: The state consists of the HPs, attack, defense, and buffs ofthe agent and the opponent, and whether skills are available. Thus,different classes have different sizes of the state vector.

Actions: The action space contains six unique skills of the classand some common operations, such as move, attack, and dummy.

agentgoal 1-step 2-step 3-step

(a) Gridworld (b) Pendulum and DoublePendulum (c) Combat tasks in the commercial game

Figure 8: (a) The n-step gridworld environment. (b) Pendulum and DoublePendulum tasks. (c) The commercial game.

0 100 200 300 400 500episode

2

0

2

4

6

8

win

rate

AE-SAC(no transfer) AE-SAC-PT BT MIKT SAC

0 200 400 600 800 1000 1200 1400episode

0

2

4

6

8

10

retu

rn

(a) task n = 1 (source task n = 2)

0 200 400 600 800 1000 1200 1400episode

2

0

2

4

6

8

10

retu

rn

(b) task n = 2 (source task n = 3)

0 200 400 600 800 1000 1200 1400episode

2

0

2

4

6

8

10

retu

rn

(c) task n = 3 (source task n = 1)

Figure 9: The experiment results on Gridworld. The solid lines denote our method, and the dashed lines represent the compared methods.The shaded areas are bootstrapped 95% confidence intervals.

Table 4: Skill Descriptions of She Shou

Skill ID effect1 Cause damage and fire dot damage for 15 seconds2 Cause damage and poisoning damage for 7 seconds3 Cause damage and poisoning damage for 7 seconds4 Set a trap and cause damage every second5 Cause damage and fire dot damage for 15 seconds6 Cause damage and poisoning damage for 7 seconds7 Cause damage8 Stun enemy for 2.5 seconds9 Summon an arrow tower, which causes damage every second10 Cause damage11 Cause damage12 Knock back enemy13 Cause damage and silence enemy for 6 seconds14 Cause large amount of damage and create an aura

Rewards: Each step, the reward is calculated by the HP changeof the agent and the opponent. Besides, a negative reward is given ifthe agent selects an unready skill. A final reward +10 is given if theagent wins, -10 otherwise.

Termination: Each game is terminated when the agent has taken100 actions, or anyone of both sides is dead.

The hyperparameters used in this experiment are available in Ta-ble 5.

B Extended ResultsB.1 Action EmbeddingIn our paper, we have shown the learned embeddings of gridworldand combat tasks. Here, we exhibit action embeddings learned fromadditional state embeddings in Mujoco and Roboschool tasks. The

Table 5: Parameter settings in combat tasks experiment

Parameters Value

SAC

state embed dim 25state embed hiddens [200, 100]


Transition Model

action embed dim 6hiddens [128, 64]z dim 8

z hiddens [32,]β 1e-2lr 1e-3

0 100 200 300 400 500episode

2

0

2

4

6

8

win

rate

AE-SAC(no transfer) AE-SAC-PT BT MIKT SAC

0 200 400 600 800 1000 1200 1400episode

0

20

40

60

80

100re

turn

(a) mP (source task rP)

0 200 400 600 800 1000 1200 1400episode

0

20

40

60

80

100

retu

rn

(b) mP (source task rDP)

0 200 400 600 800 1000 1200 1400episode

20

40

60

80

100

retu

rn

(c) rP (source task mP)

0 200 400 600 800 1000 1200 1400episode

20

40

60

80

100

retu

rn

(d) rP (source task mDP)

0 200 400 600 800 1000 1200 1400episode

0100200300400500600700800

retu

rn

(e) mDP (source task mP)

0 200 400 600 800 1000 1200 1400episode

100200300400500600700800

retu

rn

(f) mDP (source task rDP)

0 200 400 600 800 1000 1200 1400episode

200

300

400

500

600

700

800

900

retu

rn

(g) rDP (source task rP)

0 200 400 600 800 1000 1200 1400episode

200

300

400

500

600

700

800

900

retu

rn

(h) rDP (source task mDP)

Figure 10: The experiment results on Mujoco and Roboschool. The solid lines denote our method, and the dashed lines represent thecompared methods.

0 20 40 60 80 100

1.0

0.5

0.0

0.5

1.0

1.5

2.0mDP(source)mP(target)

Figure 11: Action embeddings (projected via PCA) of the sourcetask (mDP, colored in orange) and the target task (mP, colored inblue). The x-axis denotes the index of discretized actions, and they-axis is its corresponding projected value.

relation between actions should be linear in the tasks since they arediscretized from a continuous action space. Figure 11 plots the PCAprojections (1 dimensional) of embeddings. As seen, though thereare some local oscillations, the overall trend of the curve is linear.It proves that our method can still learn meaningful latent actionrepresentations based on state embeddings.

B.2 Policy Transfer

For each set of environments, there are many different combinationsof source and target tasks. Here we report the rest transfer results ofour method from different source tasks that are not included in thepaper.

Figure 9 and Figure 10 show the results on different tasks. Asshown in figures, AE-SAC-PT outperforms BT and MIKT in alltransfer tasks and exhibits improved sample efficiency compared toSAC. Moreover, we can see that BT leads to negative transfer inmost cases. In contrast, AE-SAC-PT can handle all the transfer tasksand accelerate learning.

Figure 12 depicts the full experiment results on the combat tasks.

(b) FS (source task SS)(a) SS (source task FS)

Figure 12: Experiment results on combat tasks. The legend refers tothe paper.

C Impact of Dimension of Action EmbeddingsIn this section, we analyze the impact of the dimension of action em-beddings on the semantics of learned action embeddings and trans-fer performance. We conduct experiments on gridworld navigationtasks with different action embedding dimensions d ∈ [1, 6].

Figure 13 plots the visualizations of learned embeddings. Gen-erally, as the dimension of action embeddings increases, the resultappears to become more confusing because additional dimensionswill introduce more noises.

Figure 14 report the training and transfer performances with dif-ferent action embedding dimensions. First, we notice that it failsto learn when d = 1 because it is not enough to learn action se-mantics. Figure 14(a) shows that the training speed decreases withthe increase of dimension d because the action space gets larger.However, the transfer performances are only slightly influenced, asshown in Figure 14(b) and (c). The optimal setting of d will dependon the dynamics of environments and action semantics.

D Freeze Parameters of the Transition ModelIn this section, we compare the performance between TRACEwhether the transition model’s parameters are frozen in cross-domain transfer. Figure 15 depicts the result. We can see that freez-

0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4

0.3

0.2

0.1

0.0

0.1

0.2

0.3

(a) d = 2

0.2 0.0 0.2 0.4 0.6

0.3

0.2

0.1

0.0

0.1

0.2

0.3

0.4

(b) d = 3

0.4 0.2 0.0 0.2 0.4

0.3

0.2

0.1

0.0

0.1

0.2

0.3

0.4

(c) d = 4

0.4 0.2 0.0 0.2 0.4

0.3

0.2

0.1

0.0

0.1

0.2

0.3

0.4

(d) d = 5

0.4 0.2 0.0 0.2 0.4

0.2

0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

(e) d = 6

Figure 13: Visualizations of learned action embeddings with different dimensions.

0 100 200 300 400 500 600 700episode

2

0

2

4

6

8

10re

turn

d=1 d=2 d=3 d=4 d=5 d=6

0 500 1000 1500 2000 2500episode

2

0

2

4

6

8

10

retu

rn

(a) without transfer

0 100 200 300 400 500 600 700episode

2

0

2

4

6

8

10

retu

rn

(b) transfer from 1-step gridworld

0 100 200 300 400 500 600 700episode

2

0

2

4

6

8

10

retu

rn

(c) transfer from 3-step gridworld

Figure 14: The experiment results with different action embedding dimensions on 2-step gridworld.

(a) mDP (source task rP) (b) rDP (source task rp)

Figure 15: Transfer results on mDP and rDP.

ing the parameters of the transition model in cross-domain transferresults in a unstable training and the negative transfer.

arxiv:1909.02291v3 [cs.lg] 10 may 2021

Documents