scalable lifelong reinforcement...

32
Scalable Lifelong Reinforcement Learning Yusen Zhan The School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA 99163, USA Haitham Bou Ammar Prowler i.o., Cambridge, United Kingdomn Matthew E. Taylor * The School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA 99163, USA Abstract Lifelong reinforcement learning provides a successful framework for agents to learn multiple consecutive tasks sequentially. Current methods, however, suffer from scalability issues when the agent has to solve a large number of tasks. In this paper, we remedy the above drawbacks and propose a novel scalable tech- nique for lifelong reinforcement learning. We derive an algorithm which assumes the availability of multiple processing units and computes shared repositories and local policies using only local information exchange. We then show an im- provement to reach a linear convergence rate compared to current lifelong pol- icy search methods. Finally, we evaluate our technique on a set of benchmark dynamical systems and demonstrate learning speed-ups and reduced running times. Keywords: Reinforcement Learning, Lifelong Learning, Distributed Optimization, Transfer Learning * Corresponding author Email addresses: [email protected] (Yusen Zhan), [email protected] (Haitham Bou Ammar), [email protected] (Matthew E. Taylor) Preprint submitted to Journal of L A T E X Templates July 23, 2017

Upload: others

Post on 10-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

Scalable Lifelong Reinforcement Learning

Yusen Zhan

The School of Electrical Engineering and Computer Science, Washington State University,

Pullman, WA 99163, USA

Haitham Bou Ammar

Prowler i.o., Cambridge, United Kingdomn

Matthew E. Taylor∗

The School of Electrical Engineering and Computer Science, Washington State University,Pullman, WA 99163, USA

Abstract

Lifelong reinforcement learning provides a successful framework for agents to

learn multiple consecutive tasks sequentially. Current methods, however, suffer

from scalability issues when the agent has to solve a large number of tasks. In

this paper, we remedy the above drawbacks and propose a novel scalable tech-

nique for lifelong reinforcement learning. We derive an algorithm which assumes

the availability of multiple processing units and computes shared repositories

and local policies using only local information exchange. We then show an im-

provement to reach a linear convergence rate compared to current lifelong pol-

icy search methods. Finally, we evaluate our technique on a set of benchmark

dynamical systems and demonstrate learning speed-ups and reduced running

times.

Keywords: Reinforcement Learning, Lifelong Learning, Distributed

Optimization, Transfer Learning

∗Corresponding authorEmail addresses: [email protected] (Yusen Zhan), [email protected] (Haitham

Bou Ammar), [email protected] (Matthew E. Taylor)

Preprint submitted to Journal of LATEX Templates July 23, 2017

Page 2: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

1. Introduction

Reinforcement learning (RL) provides the ability to solve sequential decision-

making problems with limited feedback. Applications with these characteristics

range from robotics control [1] to personalized medicine [2, 3]. Though suc-

cessful, typical RL methods require a substantial amount of experience before5

acquiring acceptable behavior. The cost of obtaining such experience, however,

can be prohibitively expensive in terms of time and data [4]. Also, these costs

only worsen when considering multiple tasks1. Transfer learning2 and multi-task

learning [7, 8] have been developed to remedy these problems by allowing agents

to reuse knowledge from other tasks. Unfortunately, both these techniques suf-10

fer from scalability problems as the number of tasks considered grows large. In

a recent attempt to target these issues, Bou Ammar et al. proposed PG-ELLA,

an online hybrid of the above two paradigms—transfer and multi-task learn-

ing [9]. PG-ELLA decomposes a task’s policy parameters into a shared latent

repository L and task specific coefficients st, one for each task. Consequently,15

it allows for knowledge transfer between tasks (using the latent repository L)

while scaling multi-task learning by streaming problems online. In the original

paper, the authors show a closed form solution to the shared repository.

In a lifelong setting, the agent deals with a sequence of tasks, potentially lead-

ing to computational intractability when the agent observes and learns many20

tasks (e.g., in big data scenario, the agent may observe many tasks in its life).

Unfortunately, PG-ELLA suffers when considering a large number of tasks or

dimensions — determining the shared knowledge-base L becomes intractable —

due to two inefficiencies. The first inefficiency is that computing the expansion’s

operating point amounts to solving a local RL problem described by the cur-25

rent task’s observed trajectories. The second inefficiency reducing PG-ELLA’s

1The mutiple tasks learning tried to learning a single polices or strategy to deal with a set

of tasks. See [5] for details.2The idea of transfer learning is to reuse previous trained information to speed up learning

process. See [6] for details.

2

Page 3: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

scalability arises when updating the shared repository L.

In this paper, we remedy the above drawbacks of lifelong policy search and

propose a scalable method capable of handling large number of tasks and high-

dimensional policies, while ensuring meaningful parameters at each iteration of30

the algorithm (i.e., each step conveys more information than a single gradient

step, as done in PG-ELLA). Our method assumes multiple processing units

and derives a novel algorithm for lifelong policy search. Such an assumption

leads us to better convergence rates and more accurate local policies. Using

augmented Lagrangian methods, we propose two algorithms for computing the35

local operating point and the shared repository, while allowing for scalable solu-

tions. We show that such updates can be computed locally by each processing

unit in closed form under broad assumptions. We theoretically and empirically

analyze the performance of these new techniques. On the theoretical side, we

demonstrate superiority to PG-ELLA by proving dimensionality-free linear con-40

vergence in both determining local policies and updating the shared repository.

On the empirical side, we show our algorithm outperforms existing algorithms

on a set of 5 domains with 50 variations, including the control of a simulated

helicopter. In a final set of experiments, we also demonstrate the improved

generalization capabilities of this new method on unobserved tasks and report45

a decrease in learning time, relative to current techniques. The contributions

of this paper can be summarized as: i) deriving a novel scalable algorithm for

lifelong policy search, ii) acquiring linear convergence rate in both steps of the

method, iii) demonstrating the effectiveness of our technique on 5 dynamical

systems with 50 variations each, and iv) demonstrating learning speed-ups on50

unobserved tasks.

The rest of the paper is organized as follows: Section 2 introduces the back-

ground; Section 3 provides some related works; In section 4, we define our

lifelong policy search problem; Section 5 proposes the scalable lifelong policy

search method; In section 6, it shows the experimental results and section 755

summarizes the whole paper.

3

Page 4: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

2. Reinforcement Learning

In reinforcement learning (RL) an agent must sequentially select actions to

maximize its total expected pay-off. These problems are typically formalized as

Markov decision processes (MDPs) 〈X ,U ,P,R, γ〉, where X ⊆ Rd and U ⊆ Rm60

denote the state and action spaces. P : X × U × X → [0, 1] represents the

transition probability governing the dynamics of the system, R : X × U → R is

the reward function quantifying the performance of the agent and γ ∈ [0, 1) is a

discount factor specifying the degree to which rewards are discounted over time.

At each time step m, the agent is in state xm ∈ X and must choose an action65

ut ∈ U , transitioning it to a successor state xm+1 ∼ p(xm+1|xm,um) as given

by P and yielding a reward rm+1 = R(xm,um). A policy π : X × U → [0, 1] is

defined as a probability distribution over state-action pairs, where π (um|xm)

denotes the probability of choosing action ut is state xm.

Policy gradients [1, 4] are a class of reinforcement learning algorithms that

have shown successes in solving complex robotic problems [1]. Such methods

represent the policy πθ(um|xm) by an unknown vector of parameters θ ∈ Rd.

The goal is to determine the optimal parameters θ? that maximize the expected

average pay-off:

J (θ) =

∫τ

pθ(τ )R(τ )dτ ,

where τ = [x0:T ,u0:T ] denotes a trajectory over a possibly infinite horizon T .

The probability of acquiring a trajectory, pθ(τ ), under the policy parameteri-

zation πθ(·) and average per-time-step return R(τ ) is given by:

pθ(τ ) = p0(x0)

M−1∏m=0

p (xm+1 | xm,um)πθ(um | xm), R(τ ) =1

M

M∑m=0

rm+1,

with an initial state distribution p0 : X → [0, 1]. Policy gradient methods, such

as episodic REINFORCE [10] and Natural Actor Critic [11, 12], typically em-

ploy a lower-bound on the expected return J (θ) for fitting the unknown policy

parameters θ. To achieve this, such algorithms generate trajectories using the

current policy θ, and then compares performance with a new parameterization

4

Page 5: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

θ. As detailed in [1], the lower bound on the expected return can be attained

using Jensen’s inequality and the concavity of the logarithm:

logJ(θ)

= log

∫τ

pθ(τ )R(τ )dτ ≥∫τ

pθ(τ )R(τ ) logpθ(τ )

pθ(τ )dτ + cnst.

∝ −KL(pθ(τ )R(τ )||pθ(τ )

)= JL,θ

(θ),

where KL(p(τ )||q(τ )) =∫τp(τ ) log p(τ )

q(τ )dτ . Thus, we can directly optimize the70

lower bound instead of the original objective.

3. Related Work

Transfer learning aims to improve the learning speed of an agent on a new

target task by transferring knowledge learned from one or more previous source

tasks [6]. However, most current transfer learning methods only consider the75

target tasks instead of all previously visited tasks. In contrast, multi-task learn-

ing (MTL) methods optimize performance over all tasks by training models over

all possible tasks [7, 8], but are computationally expensive in lifelong learning

setting since the agent must learn a large number of tasks over time.:[13, 14].

There are many works that consider parallelizing reinforcement learning.80

For example, Caarls and Schuitema apply parallel on-line temporal Temporal

Difference (TD) learning to motor control [15]. Shixiang et al. proposed an

asynchronous parallel method to deep reinforcement learning[16]. Yahya et al.

introduce a distributed asynchronous guided policy search method to deep re-

inforcement learning [17]. Sergey et al. also employed Bregman ADMM, which85

is a centralized ADMM method, to guided policy search setting [18]. However,

all of the above methods are based on standard reinforcement learning instead

of lifelong learning setting.

PG-ELLA is a recent lifelong policy gradient RL algorithm that can effi-

ciently learn multiple tasks consecutively while sharing knowledge between task90

policies to accelerate learning [9]. In fact, PG-ELLA, which is a centralized

method, can be viewed as one agent or node case of the algorithm we propose

in this paper. MTL for policy gradients has also been explored by Deisenroth

5

Page 6: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

et al. [19] through customizing a single parameterized controller to individual

tasks that differ only in the reward function. Another closely related work is95

on hierarchical Bayesian MTL [20], which can learn RL tasks consecutively, but

unlike our approach, requires discrete states and actions. Snel and Whiteson’s

representation learning approach is also related, but limited to a potential func-

tion representation [21]. GO-MTL is also a multi-task learning algorithm that

focuses on multi-task supervised learning [22]. All of these MTL methods suffer100

from scalability issues due to their dependence on centralized methods.

4. Lifelong Policy Search

We adopt the lifelong learning framework previously introduced in [9]. An

agent has to master multiple RL tasks while supporting transfer to improve

learning. Let T denote the set of tasks in which each element is an MDP. At

any time, the learner may face any of the previously seen tasks — it must max-

imize its performance across all tasks. The goal is to learn optimal policies,

π?θ?1, . . . , π?θ?

|T |for all tasks with corresponding parameters θ?1 , . . . ,θ

?|T |, where

θ?j ∈ Rd parameterizes the policy for task j. For each task j, the learner sam-

ples a set of nj trajectories T(j) ={τ(1)j , . . . , τ

(nj)j

}from the environment by

predefined policies, and each trajectory has a length M (j). To transfer knowl-

edge between tasks, we adopt the shared knowledge-base representation, where

the policy parameters for each task are represented as a linear combination of

a shared latent basis L ∈ Rd×p and coefficient vectors sj ∈ Rp, i.e., θj = Lsj .

Here, each column of L encodes a set of transferable knowledge that is tailored

to suit the peculiarities of each task using sj . Such a decomposition has been

widely used in the multi-task learning literature [9, 13, 22, 23]. Hence, the

optimization problem for a total of t tasks can be written as:

minL,s1:st

1

t

t∑j=1

(− Jj (Lsj) + µ1 ‖sj‖1

)+ µ2 ‖L‖2F , (1)

where ||v||1 and ||Z||F denote the L1 norm of v ∈ Rp and the Frobenius norm

of Z ∈ Rd×p, respectively. µ1 and µ2 denote regularization parameters, and

6

Page 7: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

Jj(Lsj) is the total pay-off for a task j:

Jj (Lsj) =

∫τ (j)∈T(j)

p(j)Lsj

(τ (j)

)R(j)

(τ (j)

)dτ (j).

Here, p(j)Lsj

(τ (j)

)and R(j)

(τ (j)

)denote the task-specific trajectory probability

and expected total reward:

p(j)Lsj

(τ (j)

)= p

(j)0

(x(j)0

)M(j)−1∏m=0

p(j)(x(j)m+1 | x(j)

m ,u(j)m

)π(j)Lsj

(u(j)m | x(j)

m )

R(j)(τ (j)

)=

1

M (j)

M(j)∑m=0

R(j)(x(j)m ,u(j)

m

).

As noted in [9], Equation (1) describes batch multi-task learning as opposed

to the online lifelong learning setting. This is due to the dependency of the

above optimization problem on all trajectories from all tasks. To remedy this

problem, the authors employed a local second-order Taylor expansion of the

lower-bound to Jt(θt) around θ?t = arg minθ −JL,θ(θ)

, where JL,θ(θ)

=

KL(pθ(τ )R(τ )||pθ(τ )

), leading to:

minL,s1:st

1

t

t∑j=1

(∥∥∥θ?j −Lsj∥∥∥2H(j)

+ µ1 ‖sj‖1

)+ µ2 ‖L‖2F , (2)

where j denotes task j, ‖v‖2H(j) = vᵀH(j)v and H(j) denotes the Hessian

defined as:

H(j) = −Eτ (j)∼pθ?j(τ (j))

R(j)(τ (j)

)M(j)∑m=1

∇2θj ,θj

logπθj

(u(j)m | s(j)m

) .The problems of PG-ELLA. Although successful, as shown in [9, 23], PG-ELLA

is unscalable to a large number of tasks, which is typical for the lifelong learning

setting. Here, we differentiate the two inefficiency sources of PG-ELLA. First,105

we note that although the dependency on all tasks has been remedied using a

second-order expansion, computing the operating point θ?j can be expensive. In

essence, determining θ?j equates to solving a local MDP (i.e., the jth) defined

over T(j). This led the authors to propose a one-step gradient update given

a small set of observed trajectories for a task j. Apart from the reduction in110

7

Page 8: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

the number of available trajectories, gradient methods are known to exhibit

slow convergence rates and are prone to local minima. Consequently, a PG-

ELLA update can lead to a low-quality latent repository L due to the error in

approximating θ?j .

The second inefficiency reducing PG-ELLA’s scalability arises when updat-115

ing the shared repository L. On one hand, the update involves an inversion of

a dp × dp matrix leading to a complexity of O(d3p2

)in the best case. Con-

sequently, such a method is not scalable for neither high-dimensional policy

parametrization nor latent dimensions p. As, as mentioned in [23], gradient-

based methods share the problems mentioned above. On the other hand, as the120

number of tasks grows large, these centralized methods (updating L or comput-

ing θ?j ) become intractable due to computational and memory constraints.

5. Scalable Lifelong Policy Search

Our strategy to remedy the two problems above consists of both scaling

PG-ELLA to multiple trajectories and multiple tasks through more processing125

units, as well as generalizing gradient methods to algorithms with faster con-

vergence rates. We assume the existence of multiple processors defined through

undirected graph G = (V, E), with V denoting the set of nodes and E the set

of edges. We assume a natural ordering among the nodes from 1, . . . , |V| and

eij ∈ E denotes the edge between nodes i and j with i < j. We define a block130

Matrix A to have |E| rows and |V| columns. Suppose edge eij is the kth edge

in E associating to the kth row in matrix A, then, the entry (k, i) of matrix

A equals identity matrix I ∈ Rd×d , i.e., A(k, i) = I, and the block A(k, j) of

matrix A equals -I, i.e., A(k, j) = −I and other entries are set to matrix 0.

A(I) denotes the general edge-node incident matrix with identity matrix I.135

The intuition is that we distribute all the tasks among the nodes in the

graph and compute the local variables by local variables from other nodes, which

connected by edges in the set E . One potential drawback is the communication

cost between nodes.

8

Page 9: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

5.1. Scaling the Computation of θ?j140

As mentioned earlier, the first problem of current lifelong policy search algo-

rithms relates to the scalability in determining the local operating point θ?j for

the second-order Taylor expansion in Equation (2). Given a set of trajectories

T(j) for a task j, θ?j is determined as the solution to:

θ?j = arg minθ−J (j)

L,θ

(θ)

= arg minθ

KL(p(j)θ

(τ (j)

)R(j)

(τ (j)

) ∣∣∣∣∣∣p(j)θj

(τ (j)

))= arg min

θ

∫τ (j)∈T(j)

p(j)

θ(j)

(τ (j)

)R(j)

(τ (j)

)log

[p(j)

θ(j)

(τ (j)

)p(j)

θ(j)

(τ (j)

)R(j)

(τ (j)

)] dτ (j).

Given the above processing unit, we assume a random split of the set T(j) among

the nodes 1, . . . , |V| such that T(j) = ∪|V|i=1T(j)i and thus rewrite the optimization

problem of θ?j as:

minθ(j)1 :θ

(j)

|V|

|V|∑i=1

∫τ (j)∈T(j)

i

p(j)

θ(j)i

(τ (j)

)R(j)

(τ (j)

)log

p(j)

θ(j)i

(τ (j)

)p(j)

θ(j)i

(τ (j)

)R(j)

(τ (j)

) dτ (j)

(3)

s.t. θ(j)i = θ

(j)j , for all eij ∈ E and i < j,

where the constraints θ(j)i = θ

(j)j , for all eij ∈ E and i < j have been added so

as to ensure agreement across the different processing units in G. To rewrite the

constraint above using a compact representation we defined the following: a vec-

tor θ(j): =

[θT,(j)1 , . . . , θ

T,(j)|V|

]T∈ Rd|V| and matrix A = A(Id×d) ∈ Rd|E|×d|V|,

where Id×d is a d-dimensional identity matrix. Using the above notation, we

reformulate the problem in Equation (3) as:

minθ(j)1 :θ

(j)

|V|

|V|∑i=1

∫τ (j)∈T(j)

i

p(j)

θ(j)i

(τ (j)

)R(j)

(τ (j)

)log

p(j)

θ(j)i

(τ (j)

)p(j)

θ(j)i

(τ (j)

)R(j)

(τ (j)

) dτ (j)

s.t. Aθ(j): = 0.

The above is a constraint optimization problem that we aim to solve efficiently.

Generally, this problem can be intractable due to the non-convexity of each of

the i objectives. Under the broad-class of exponential family policies, however,

9

Page 10: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

each of the local objectives is convex in θ(j)i and can thus be solved optimally and

efficiently. Our technique for determining the solution relies on a blend between

an augmented Lagrangian and a decomposition-coordination procedure similar

to the accelerated direction method of multiples ADMM [24] algorithm. In

contrast to ADMM, our method is required to determine the feasible point based

exclusively on local interactions among the computational nodes. This setting

is well known in the distributed optimization literature [25]. Unfortunately,

current methods for distributed ADMM mostly focus on the univariate case (as

detailed in [25]). The univariate case is limited by its one-dimensional setting —

we generalize a distributed version of ADMM to our multi-dimensional setting

to propose a scalable and efficient solver for lifelong policy search. Let λ ∈ Rd|E|

be a vector of Lagrange multipliers associating with the constraint Aθ(j): = 0

and a constant ρ > 0 of penalty term∣∣∣∣∣∣Aθ(j):

∣∣∣∣∣∣22. We augmented Lagrangian

methods (similar to existing work [24, 25]) so that the Lagrangian function

becomes:

LAug

(θ(j): ,λ

)=

|V|∑i=1

∫τ (j)∈T(j)

i

p(j)

θ(j)i

(τ (j)

)R(j)

(τ (j)

)log

p(j)

θ(j)i

(τ (j)

)p(j)

θ(j)i

(τ (j)

)R(j)

(τ (j)

) dτ (j)

− λTAθ(j): +

ρ

2

∣∣∣∣∣∣Aθ(j):

∣∣∣∣∣∣22.

At this stage, we can use standard ADMM for acquiring primal-dual updates

and thus determining the solution θ?,(j): . Standard ADMM, however, is not

readily applicable to our setting due to the need of a distributed solution in

which each node performs parameter updates based on only local information

exchange between its neighbors. To achieve such a distributed solution, we gen-145

eralize the approach in [25] to the multi-dimensional setting and associate a

dual variable with the constraint on each edge. Each processor i keeps a local

decision estimate θ(j)i and a vector of dual variables λki with k < i. We define

two sets of the neighbors of a node i, denote by Pi and Si. Here, Pi collects

all nodes having an index lower than i, i.e., Pi = {j|eij ∈ E , j < i} and Si all150

10

Page 11: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

Algorithm 1 Scalable Operating Point Computation

1: Initialize θ(j)i , ∀i ∈ {1, . . . , |V|}, and set ρ > 0.

2: for k = 0, . . . ,K do

3: Each agent i updates, θ(j)i in a sequential order from i = 1, . . . , |V| using:

θ(k+1),(j)i = arg min

θ(j)i

[∫τ (j)∈T(j)

i

p(j)

θ(j)i

(τ (j)

)R(j)

(τ (j)

)log

p(j)

θ(j)i

(τ (j)

)p(j)

θ(j)i

(τ (j)

)R(j)

(τ (j)

) dτ (j)

2

∑l∈Pi

∣∣∣∣∣∣∣∣θ(k+1),(j)l − θ(j)i −

1

ρλ(k)li

∣∣∣∣∣∣∣∣22

2

∑l∈Si

∣∣∣∣∣∣∣∣θ(j)i − θ(k),(j)l − 1

ρλ(k)il

∣∣∣∣∣∣∣∣22

].

4: Each agent updates λli for l ∈ Pi as:

λ(k+1)li = λ

(k)li − ρ

(θ(k+1),(j)l − θ(k+1),(j)

i

).

5: end for

nodes with index higher than i, i.e., Si = {j|eij ∈ E , i < j}. Our algorithm to

determine θ?j is summarized by the set of instructions in Algorithm 1. Algo-

rithm 1 consists of two steps. First, primal updates are performed by each node

in G (line 3). Second, given the updated primal, the dual variables are then

computed (line 4).155

Note that in the case of Gaussian policies, which also can be generalized

to any policy from the exponential family, the primal updates on line 3 of

Algorithm 1 can be computed in closed form (see Appendix B for derivation):

θ(k+1),(j)i =

−ρdegi2

Id×d + Eτ (j)∼pθ(j)i

(·)

R(j)(τ (j)

)M(j)−1∑m=0

ΦT,(j)(xm)Σ−1ΦT,(j)(xm)

−1

×

[Eτ (j)∼p

θ(j)i

(·)

R(j)(τ (j)

)M(j)−1∑m=0

ΦT,(j)(xm)Σ−1u(j)m

− ρ

2

(∑l∈Si

[θ(k),(j)l +

1

ρλ(k)il

]+∑l∈Pi

[θ(k+1),(j)l − 1

ρλ(k)li

])].

where degi is the degree of processing unit i, i.e., set of neighbors, and Φ(·) is a

11

Page 12: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

feature map of the system’s state space. Having the problem solved in comput-

ing the linearization operating point, the second step needed for scaling lifelong

policy search is handling the inefficiency in updating the shared repository. We

detail the solution to this step in the next section.160

Theoretical Guarantees: We now show the theorem derives an improve-

ment to a linear convergence rate for our proposed algorithm.

Theorem 1. The sequence{θ(k),(j): ,λ(k)

}k≥0

consists of

θ(k),(j): =[θT,(j)1 , . . . , θ

T,(j)|V|

]Tand λ(k) =

[λ(k),T1 , . . . ,λ

(k),T|E|

]T,

where λ(k)i ∈ Rd is a dual vector. Following Algorithm 1, the sequence converges

linearly with a rate given by O (1/k).

(The interested reader can find the proof in Appendix A.)165

5.2. Scaling the Model Update

The lifelong policy search problem is framed as the solution to the opti-

mization problem in Equation (2). It is easy to see that such a problem is not

convex in both the shared repository, L, and the co-efficient vectors, sj , simul-

taneously. Holding one variable fixed, however, leads to a convex problem in

the other. Consequently, following such an alternating procedure, one can solve

a sequence of convex optimization problems iteratively. The computations over

the sj vectors are relatively simple to scale. The scalability problem arises when

trying to update the latent repository L due to its dependency on all tasks ob-

served thus far. To scale the computation of the shared repository L, we follow

a similar strategy to that of the previous section. Holding sj fixed, computing

L, then we need to solve:

minL

1

t

t∑j=1

∣∣∣∣∣∣θ?j −Lsj∣∣∣∣∣∣2Hj

+ µ2||L||2F.

First, we introduce the vec(Z) notation, which vectorizes a matrix Z ∈ Rd×p

to a vector of size Rdp, i.e.,

vec(Z) = [a1,1, . . . , ad,1, a1,2, . . . , ad,2, . . . , ad,1, . . . , ad,p]T,

12

Page 13: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

where ai,j denotes the (i, j)-th element of matrix Z. Consequently, we can write

vec(Lsj) =(sTj ⊗ Id×d

)vec(L), with ⊗ being the Kronecker product and Id×d a

d-dimensional identity matrix. Hence, we can rewrite the optimization problem

in the following equivalent form:

minvec(L)

1

t

t∑j=1

∣∣∣∣∣∣θ?j − (sj ⊗ Id×d) vec(L)∣∣∣∣∣∣2Hj

+ µ2||vec(L)||22,

where ||L||2F = ||vec(L)||22. Splitting this equation across G leads us to the

following:

minvec(L1):vec(L|V|)

|V|∑i=1

1

ti

ti∑j=1

(∣∣∣∣∣∣θ?j − (sj ⊗ Id×d) vec (Li)∣∣∣∣∣∣2Hj

+ µ2 ||vec (Li)||22

)s.t. vec (Ll) = vec (Li) for all eli ∈ E and l < i,

where vec (Li) is the chunk of the shared repository to be learned using process-

ing unit i and ti is the total number of tasks assigned to that processing unit.

Similarly, we can rewrite the constraints compactly in the form of A ¯vec(L) = 0,

where ¯vec(L) =[vec (L1)

T, . . . , vec

(L|V|

)T]T. Thus, we can write the aug-

mented Lagrangian as:

LAugL ( ¯vec(L),λL) =

|V|∑i=1

1

ti

ti∑j=1

(∣∣∣∣∣∣θ?j − (sj ⊗ Id×d) vec (Li)∣∣∣∣∣∣2Hj

+ µ2 ||vec (Li)||22

)− λT

LA ¯vec(L) +ρ

2

∣∣∣∣∣∣A ¯vec(L)∣∣∣∣∣∣22,

where ρ > 0 is the penalty parameter and λL is a vector of Lagrange mul-

tipliers. At this stage, we can follow a similar strategy to that of the previous

section to determine the optimal solution ¯vec(L)?. To update the primal vari-

ables, each processing unit in G solves the following problem:

vec(L

(k+1)i

)= arg min

vec(Li)

[1

ti

ti∑j=1

(∣∣∣∣∣∣θ?j − (sj ⊗ Id×d) vec (Li)∣∣∣∣∣∣2Hj

+ µ2 ||vec (Li)||22

)

2

∑l∈Pi

∣∣∣∣∣∣∣∣vec(L(k+1)l

)− vec (Li)−

1

ρλ(k)L,li

∣∣∣∣∣∣∣∣22

2

∑l∈Si

∣∣∣∣∣∣∣∣vec (Li)− vec(L

(k)l

)− 1

ρλ(k)L,li

∣∣∣∣∣∣∣∣22

],

(4)

13

Page 14: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

Algorithm 2 Scalable Operating Point Computation for L

1: Initialize vec(L

(0)i

), ∀i ∈ {1, . . . , |V|}, and set ρ > 0.

2: for k = 0, . . . ,K do

3: Each agent i updates, vec(L

(k+1)i

)in a sequential order from i =

1, . . . , |V| using:

vec(L

(k+1)i

)= arg min

vec(Li)

[1

ti

ti∑j=1

(∣∣∣∣∣∣θ?j − (sj ⊗ Id×d) vec (Li)∣∣∣∣∣∣2Hj

+ µ2 ||vec (Li)||22

)

2

∑l∈Pi

∣∣∣∣∣∣∣∣vec(L(k+1)l

)− vec (Li)−

1

ρλ(k)L,li

∣∣∣∣∣∣∣∣22

2

∑l∈Si

∣∣∣∣∣∣∣∣vec (Li)− vec(L

(k)l

)− 1

ρλ(k)L,il

∣∣∣∣∣∣∣∣22

],

4: Each agent updates λli for

λ(k+1)L,li = λ

(k)L,li − ρ

(vec(L

(k+1)l

)− vec

(L

(k+1)i

)).

5: end for

where k is the iteration number. Given an updated primal, the dual variables,

L, are then computed according to:

λ(k+1)L,li = λ

(k)L,li − ρ

(vec(L

(k+1)l

)− vec

(L

(k+1)i

)). (5)

Equation (5) is similar to step 4 in Algorithm 1. We provide the pseudocode

in Algorithm 2. Note that the update rule can be acquired in a closed form

the Gaussian policy assumption, leading to the following: (See Appendix B for

details):

vec(L

(k+1)i

)=

1

ti

ti∑j=1

ΓTjHjΓj +

ρdegi2

Idk×dk

−1( 1

ti

ti∑j=1

ΓjHj θ?j

2

(∑l∈Si

[vec(L

(k+1)l

)+

1

ρλ(k)L,li

]+∑l∈Pi

[vec(L

(k+1)l

)− 1

ρλ(k)L,il

])),

with Γj = sj ⊗ Id×d. Given the solution to the shared repository, the task-

specific coefficients are performed locally using a LASSO [26] step for each of

14

Page 15: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

the tasks ti.

Theoretical Guarantees: Our next theorem derives a linear convergence170

rate for our proposed algorithm.

Theorem 2. The sequence{

¯vec(L(k)

),λ

(k)L

}k≥0

consists of

¯vec(L(k)

)=[vec (L1)

(k),T, . . . , vec

(L|V|

)(k),T]Tand λ

(k)L =

[λ(k),TL,1 , . . . ,λ

(k),TL,|E|

]T,

where λ(k)L,i ∈ Rdp is a dual vector. Following Equations (4) and 5, the sequence

converges linearly with a rate given by O (1/k).

(The interested reader can find the proof in Appendix A.)

Theorems 1 and 2 show that our technique improves over PG-ELLA in both175

the policy gradient update step and as well as in the computations of the shared

repository. This improvement is significant as it shows linear converges in both

steps.

6. Experimental Results

Next, we present an empirical validation on five benchmark dynamical sys-180

tems introduced in [4, 23] .

Cart Pole: The cart pole (CP) system is controlled by the cart’s mass mc

in kg, the pole’s mass mp in kg and the pole’s length l in meters. The state

is given by the cart’s position and velocity v, as well as the pole’s angle θ and

angular velocity θ. The goal is to control the pole in an upright position.185

Double Inverted Pendulum: The double inverted pendulum (DIP) is an

extension of the cart pole system. It has one cart with mass m0 in kg and two

poles in which the corresponding lengths are l1 and l2 in meters. We assume

the poles have no mass and there are two masses m1 and m2 in kg on the top

of each pole. The state consists of the cart’s position x1 and velocity v1, the190

lower pole’s angle θ1 and angular velocity θ1, as well as the upper pole’s θ2 and

angular velocity θ2. The goal is also to learn a policy to control the two poles

in a specific state.

15

Page 16: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

Helicopter: This linearized model of a CH-47 (HC) tandem-rotor helicopter

assumes horizontal motion at 40 knots. It characterized by two matrices A ∈195

R4×4 and B ∈ R4×2. The main goal is to stabilize the helicopter by controlling

the collective and differential rotor thrust.

Simple Mass: The simple mass (SM) system is characterized by the spring

constant k in N/m, the damping constant d in Ns/m and the mass m in kg.

The system’s state is given by the position x and the velocity, v, of the mass.200

The goal is to train a policy for guiding the mass to a specific state.

Double Mass: The double mass (DM) is an extension of the simple mass

system. It has two masses m1,m2 in kg and two springs in which the cor-

responding springs constants are k1 and k2 in N/m, as well as the damping

constant d1 and d2 in Ns/m. The state consists of the big mass’ position x1205

and velocity v1, as well as the small mass’ position x2 and velocity v2. The goal

is also to learn a policy to control the two mass in a specific state.

Table 1: Parameter ranges used in the experiments

CP DIP HC

mc,mp ∈ [0, 1] m0 ∈ [1.5, 3.5] A with rand entry A(i, j) ∈ [−32, 3]

l ∈ [0.2, 0.8] m1,m2 ∈ [0.055, 0.1] B with rand entry A(i, j) ∈ [−9, 1]

l1 ∈ [0.4, 0.8], l2 ∈ [0.5, 0.9]

SM DM

m ∈ [3, 5] m1 ∈ [1, 7],m2 ∈ [1, 10]

k ∈ [1, 7] k1 ∈ [1, 5], k2 ∈ [1, 7]

d1, d2 ∈ [0.01, 0.1]

We generated 50 tasks for each domain by varying the dynamical parameters

of each of the domains (for 250 in total). For reproducibility, we summarize these

ranges in Table 1. The reward was given by −√‖xm − x‖22−

√uᵀmum, where x210

was the goal state, xm is the current state and um is the action that the agent

has chosen. We run each task for a total of 200 iterations. At each iteration,

the learner observed a task through 50 trajectories of 150 steps and updated

16

Page 17: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

0 50 100 150 200

Iterations

-180

-160

-140

-120

-100A

vera

ge R

ew

ard

ePG-ELLA

PG-ELLA

GO-MTL

Standard PG

(a) SM

0 50 100 150 200

Iterations

-1400

-1200

-1000

-800

-600

-400

Avera

ge R

ew

ard

ePG-ELLA

PG-ELLA

GO-MTL

Standard PG

(b) DM

0 50 100 150 200

Iterations

-400

-350

-300

-250

Avera

ge R

ew

ard

ePG-ELLA

PG-ELLA

GO-MTL

Standard PG

(c) CP

0 50 100 150 200

Iterations

-400

-350

-300

-250

-200

Avera

ge R

ew

ard

ePG-ELLA

PG-ELLA

GO-MTL

Standard PG

(d) DCP

0 50 100 150 200

Iterations

-1000

-900

-800

-700

-600

-500

-400

-300

Avera

ge R

ew

ard

ePG-ELLA

PG-ELLA

GO-MTL

Standard PG

(e) HC

100

101

102

Iterations

100

Consensus E

rror

Cart-Pole

Double Mass

Simple Mass

(f) Agreement

Figure 1: Figures (a)-(e) report average reward versus iterations and demonstrate that ePG-

ELLA is capable of outperforming other methods. Figure (f) shows the agreement error across

processors on 3 sample systems.

performed algorithmic updates. We used eNAC [27], a standard PG algorithm,

as the base learner. We compare our method (ePG-ELLA) to standard PG, PG-215

17

Page 18: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

% Ju

mp-

Star

t Im

prov

emen

t0

70

SM CPDCP HC

DM

PG-ELLA GO-MTL ePG-ELLA

17 19

21

23 15

39

23 25

38

44 42

64

40 58

69

(a) Jump-Start

% Ju

mp-

Star

t Im

prov

emen

t0

80

SM CPDCP HC

DM

PG-ELLA PG-ELLA Improvement TrendGO-MTL GO-MTL Improvement TrendePG-ELLA ePG-ELLA Improvement Trend

17 19

21

23 15

39

23 25

38

44 42

64

40 58

69

(b) Asymptotic Results

Tota

l Run

ning

Tim

e [s

ec]

0.00

120.

00

SM CPDCP HC

DM

ePG-ELLA PG-ELLA GO-MTL

1519

11 14

>3000

7 10 9

>30

12

>30

>3000 >4000 >4000 >4000

(c) Time to Convergence

0 20 40 60 80 100 120 140 160 180 200

No. of Iterations

-2.3

-2.2

-2.1

-2

-1.9

-1.8

-1.7

-1.6

Ave

rag

e R

ew

ard

PG

ePG-ELLA

PG-ELLA

GO-MTL

(d) Novel Domain

Figure 2: Figures (a) and (b) show that ePG-ELLA is capable of outperforming others in

both the jump-start and asymptotic performance. Figure (c) shows ePG-ELLA takes less

time to optimized the objective function. Finally, Figure (d) demonstrates the generalization

capability of our method in terms of average reward on unobserved tasks in HC domain.

ELLA [9], and GO-MTL [22]. To distribute our computations, we made use of

MATLAB’s parallel pool running on 10 nodes. For distributive computation in

ePG-ELLA, we equally assigned 50 tasks to 10 agents, that is, each agent learns

5 tasks in parallel. The edges between agents were generated randomly without

overlapping, and the agents are able to exchange information via the graph220

network. To make sure every node in the graph has at least one predecessor

and at least one successor, we connect node 1 to node 2, node 2 to node 3, . . . ,

node 9 and 10. We then randomly assigned 20 edges to all nodes to ensure a

connected graph.

Figures 1a—1e report the average reward performance on the different bench-225

18

Page 19: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

mark in terms of the number of iterations. In all these systems, our method

is capable of outperforming the others in terms of total reward. Figure 1f

shows that our method is capable of acquiring feasible solutions by reporting

the consensus error on 3 different systems. Figures 2a and 2b demonstrate that

our algorithm is capable of outperforming PG-ELLA and GO-MTL in terms230

of jump-starts and asymptotic performance. Figure 2c demonstrates that our

method takes less time to optimize the objective function in terms of time con-

sumption.

Transfer provides significant advantages when the lifelong learning agent

faces a novel task domain. To evaluate this, we chose the most complex of the235

task domains (HC) and trained the lifelong learner by unseen 49 tasks to yield an

effectively shared knowledge base for each of the algorithms. Then, we evaluated

the algorithms’ ability to learn a new (unseen) task from the helicopter domain,

comparing the benefits of ePG-ELLA transfer from LePGELLA with PG-ELLA

(from LPG-ELLA) , GO-MTL (from LGO-MTL) and PG. Figure 2d depicts the240

result of learning on a novel domain, averaged over HC tasks, showing that

ePG-ELLA converges fastest in this scenario in terms of average reward.

Negative transfer may occur during training with a small probability. How-

ever, we run the experiments over 200 iterations and take the average results

to smooth the results. Any cases where transfer was negative were statistically245

overwhelemed by the positive results in our setting.

7. Conclusion and Future Work

We proposed a scalable and efficient lifelong learner for policy gradient meth-

ods. Our approach achieves linear convergence rates in both acquiring lineariz-

ing operating points and updating shared knowledge bases. We have shown250

that our new technique can outperform state-of-the-art methods on a variety

of benchmark control systems. Future work will include targeting the more-

general lifelong learning setting in different domains and running on real world

applications such as robotics.

19

Page 20: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

8. Acknowledgements255

We thank the anonymous reviewers for their useful suggestions and com-

ments. This research has taken place in part at the Intelligent Robot Learn-

ing (IRL) Lab, Washington State University. IRL’s support includes NASA

NNX16CD07C, NSF IIS-1149917, NSF IIS-1643614, and USDA 2014-67021-

22174.260

Appendix A. Convergence and Convergence Rate

In this section, we will show the proof of Theorem 2. Theorem 1 can be

proved in a similar way. To show that the sequence{

¯vec(L(k)

),λ

(k)L

}k≥0

con-

verges with rate O( 1k ), let the function

F ( ¯vec (L)) =

|V|∑i=1

fi (vec (Li))

=

|V|∑i=1

1

ti

ti∑j=1

(∣∣∣∣∣∣θ?j − (sj ⊗ Id×d) vec (Li)∣∣∣∣∣∣2Hj

+ µ2 ||vec (Li)||22

)and the Lagrangian function

LAugL ( ¯vec(L),λL) =

|V|∑i=1

1

ti

ti∑j=1

(∣∣∣∣∣∣θ?j − (sj ⊗ Id×d) vec (Li)∣∣∣∣∣∣2Hj

+ µ2 ||vec (Li)||22

)− λT

LA ¯vec(L) +ρ

2

∣∣∣∣∣∣A ¯vec(L)∣∣∣∣∣∣22,

(A.1)

where

¯vec(L(k)

)=

[vec(L

(k)1

)T, . . . , vec

(L

(k)|V|

)T]Tand λ

(k)L =

[λ(k),TL,1 , . . . ,λ

(k),TL,|E|

]T,

with λ(k)L,i ∈ Rdk being a dual vector. It is easy to verify that each cost function

fi (vec (Li)) =1

ti

ti∑j=1

(∣∣∣∣∣∣θ?j − (sj ⊗ Id×d) vec (Li)∣∣∣∣∣∣2Hj

+ µ2 ||vec (Li)||22

)is convex. Also, we needs following assumptions

20

Page 21: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

Assumption 1 (Existence of a Saddle Point). The Lagrangian function LAugL

has a saddle point, that is, there is a solution ( ¯vec (L∗) ,λ∗L) such that

LAugL ( ¯vec (L∗) ,λL) ≤ LAugL ( ¯vec (L∗) ,λ∗L) ≤ LAugL ( ¯vec (L) ,λ∗L) (A.2)

for all ¯vec (L) ∈ Rdp|V| and λL ∈ Rdp.

Assumption 2. The penalty term∥∥∥A ¯vec(L)

∥∥∥22is bounded by a constant γ > 0.

We extended Wei and Ozdaglar’s results to multiple dimension [25]. Before265

we provide the proofs, we need following lemmas:

Lemma 1. The sequence{

¯vec(L(k)

),λ

(k)L

}k≥0

consists of

¯vec(L(k)

)=

[vec(L

(k)1

)T, . . . , vec

(L

(k)|V|

)T]Tand λ

(k)L =

[λ(k),TL,1 , . . . ,λ

(k),TL,|E|

]T,

where λ(k)L,i ∈ Rdp is a dual vector. Then, for all k ≥ 0, we have following

relation,

F ( ¯vec (L))−F(

¯vec(L(k+1)

))+(

¯vec(L(k)

)− ¯vec

(L(k+1)

))T·(−ATλ

(k+1)L − ρATB

(¯vec(L(k+1)

)− ¯vec

(L(k)

))+ ρBTB

(¯vec(L(k+1)

)¯vec(L(k)

)))≥ 0

for all ¯vec (L) ∈ Rdp|V|, where matrix A is the general edge-node incident matrix

and B = min{

0, A}, and the min is taken element-wise.

Proof. Let

g(k)i (vec (Li)) =

ρ

2

∑l∈Pi

∣∣∣∣∣∣∣∣vec(L(k+1)l

)− vec (Li)−

1

ρλ(k)L,li

∣∣∣∣∣∣∣∣22

2

∑l∈Si

∣∣∣∣∣∣∣∣vec (Li)− vec(L

(k)l

)− 1

ρλ(k)L,il

∣∣∣∣∣∣∣∣22

From equation 4, the optimal value of function gki + fi is vec(L

(k+1)i

), which

implies that there exits a subgradient sg(vec(L

(k+1)i

))∈ ∂fi (vec (Li)) such

that sg(vec(L

(k+1)i

))+∇g(k)i

(vec(L

(k+1)i

))= 0, and we obtain[

vec (Li)− vec(L

(k+1)i

)]T (sg(vec(L

(k+1)i

))+∇g(k)i

(vec(L

(k+1)i

)))= 0,

(A.3)

21

Page 22: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

for all vec (Li) ∈ Rdk. Due to the definition of subgradient, we have

fi (vec (Li)) ≥ fi(vec(L

(k+1)i

))+[vec (Li)− vec

(L

(k+1)i

)]T·sg(vec(L

(k+1)i

))(A.4)

Combining Equation (A.3) and A.4 yield:

fi (vec (Li))−fi(vec(L

(k+1)i

))+[vec (Li)− vec

(L

(k+1)i

)]T·∇g(k)i

(vec(L

(k+1)i

))≥ 0.

(A.5)

The gradient of g(k)i

(vec(L

(k+1)i

))is

∇g(k)i

(vec(L

(k+1)i

))= −ρ

∑l∈Pi

(vec(L

(k+1)l

)− vec

(L

(k+1)i

)− 1

ρλ(k)L,li

)+ ρ

∑l∈Si

(vec(L

(k+1)i

)− vec

(L

(k)l

)− 1

ρλ(k)L,il

)Then, we substitute the gradient in Equation (A.5) with above equation, we

have

fi (vec (Li))− fi(vec(L

(k+1)i

))+[vec (Li)− vec

(L

(k+1)i

)]T·(− ρ

∑l∈Pi

(vec(L

(k+1)l

)− vec

(L

(k+1)i

)− 1

ρλ(k)L,li

)+ ρ

∑l∈Si

(vec(L

(k+1)i

)− vec

(L

(k)l

)− 1

ρλ(k)L,il

))≥ 0.

Since the dual variables λ(k)L,li have the relation in Equation (5), then,

fi (vec (Li))− fi(vec(L

(k+1)i

))+[vec (Li)− vec

(L

(k+1)i

)]T·(∑l∈Pi

λ(k+1)L,li +

∑l∈Si

−λ(k+1)L,il +

∑l∈Si

ρ(vec(L

(k+1)l

)− vec

(L

(k)l

)))≥ 0.

Then, by the definition of matrix A, we can rewrite above equation as

fi (vec (Li))− fi(vec(L

(k+1)i

))+[vec (Li)− vec

(L

(k+1)i

)]T·(− A(·, i)Tλ(k+1)

L +∑l∈Si

ρ(vec(L

(k+1)l

)− vec

(L

(k)l

)))≥ 0.

where A(·, i) denotes the ith column of matrix A. We sum the above equation

22

Page 23: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

from k = 1 to |V|,

|V|∑i=1

fi (vec (Li))−|V|∑i=1

fi

(vec(L

(k+1)i

))+

|V|∑i=1

[vec (Li)− vec

(L

(k+1)i

)]T·(− A(·, i)Tλ(k+1)

L +∑l∈Si

ρ(vec(L

(k+1)l

)− vec

(L

(k)l

)))≥ 0.

(A.6)

By the definition of matrices A and B, we obtain

−|V|∑i=1

[vec (Li)− vec

(L

(k+1)i

)]TA(·, i)Tλ(k+1)

L = −[

¯vec (L)− ¯vec(L(k+1)

)]TATλ

(k+1)L

and

|V|∑i=1

[vec (Li)− vec

(L

(k+1)i

)]T∑l∈Si

ρ(vec(L

(k+1)l

)− vec

(L

(k)l

))= ρ

[¯vec (L) ¯vec

(L(k+1)

)]T (−A+ B

)TB(

¯vec(L(k+1)

)− ¯vec

(L(k+1)

)).

The result follows by plugging preceding two equations into Equation (A.6).

Lemma 2. The sequence{

¯vec(L(k)

),λ

(k)L

}k≥0

consists of

¯vec(L(k)

)=

[vec(L

(k)1

)T, . . . , vec

(L

(k)|V|

)T]Tand λ

(k)L =

[λ(k),TL,1 , . . . ,λ

(k),TL,|E|

]T,

where λ(k)L,i ∈ Rdp is a dual vector. Assume that ( ¯vec (L∗) ,λ∗L) is the solution

of problem A.1. Let matrix A is the general edge-node incident matrix and

B = min{

0, A}, and the min is taken element-wise. Then, we have following

relation:

¯vec(L(k+1)

)TAT(λ

(k+1)L − λ∗L) + ρ( ¯vec

(L(k+1)

)TATB

(¯vec(L(k+1)

)− ¯vec

(L(k)

))+ ρ

(¯vec (L∗)− ¯vec

(L(k+1)

))TBTB

(¯vec(L(k+1)

)− ¯vec

(L(k)

))=

1

(∥∥∥λ(k)L − λ

∗L

∥∥∥2 − ∥∥∥λ(k+1)L − λ∗L

∥∥∥2)+ρ

2

(∥∥∥B ( ¯vec(L(k)

)− ¯vec (L∗))

)∥∥∥2 − ∥∥∥B ( ¯vec(L(k+1)

)− ¯vec (L∗))

)∥∥∥)− ρ

2

∥∥∥B ( ¯vec(L(k+1)

)− ¯vec

(L(k)

))− A ¯vec

(L(k+1)

)∥∥∥223

Page 24: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

Proof. See the proofs of Lemma 4.2 in [25]270

Lemma 3. Let ( ¯vec (L∗) ,λ∗L) be a saddle point of Equation (A.1). Then we

have

A ¯vec (L∗) = 0.

Proof. The result follows by the Assumption 1.

With aforementioned Lemmas, we are able to show the main theoretical

result in the paper.

Proof of Theorem 1. To show the convergence with rate O( 1k ), we need an

additional variable yk = 1k

∑k−1s=0 ¯vec

(L(s+1)

)which is the average of ¯vec

(Lk)

until time k. Then, We need to bound

LAugL

(yk,λ∗L

)− LAugL ( ¯vec (L∗) ,λ∗L)

First, LAugL

(yk,λ∗L

)−LAugL ( ¯vec (L∗) ,λ∗L) ≥ 0 since ( ¯vec (L∗) ,λ∗L) is the

saddle point of Equation (A.1). Then, we derive the upper bound of above

equation. By setting ¯vec (L) = ¯vec (L∗) in Lemma 1, we have following relation,

F ( ¯vec (L∗))−F(

¯vec(L(s+1)

))+(

¯vec (L∗)− ¯vec(L(s+1)

))T·(−ATλ

(s+1)L − ρATB

(¯vec(L(s+1)

)− ¯vec

(L(s)

))+ ρBTB

(¯vec(L(s1)

)− ¯vec

(L(s)

)))≥ 0

We have A ¯vec (L∗) = 0 from Lemma 3, so ¯vec (L∗)TAT = 0. Thence, the above

equation can be reduced to

F ( ¯vec (L∗))−F(

¯vec(L(s+1)

))+ ¯vec

(L(s+1)

)TATλ

(s+1)L

+

(ρ ¯vec

(L(s+1)

)TATB + ¯vec

(L(s)

)+ ρ

(¯vec (L∗)− ¯vec

(L(s+1)

))BTB

)·(

¯vec(L(s+1)

)− ¯vec

(L(s)

))≥ 0

By adding (λ∗L)TA ¯vec

(L(s)

)− (λ∗L)

TA ¯vec

(L(s)

), we obtain

F ( ¯vec (L∗))−F(

¯vec(L(s+1)

))+ ¯vec

(L(s+1)

)TATλ∗L

+ ¯vec(L(s+1)

)TAT(λ

(s+1)L − λ∗L) + ρ ¯vec

(L(s+1)

)TATB

(¯vec(L(s+1)

)− ¯vec

(L(s)

))+ ρ

(¯vec (L∗)− ¯vec

(L(s+1)

))TBTB

(¯vec(L(s+1)

)− ¯vec

(L(s)

))≥ 0

24

Page 25: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

Using Lemma 2 and Lemma 3, it yields,

F ( ¯vec (L∗))−F(

¯vec(L(s+1)

))+ (λ∗L)

TA ¯vec

(L(s+1)

)+

1

(∥∥∥λ(s)L − λ

∗L

∥∥∥2 − ∥∥∥λ(s+1)L − λ∗L

∥∥∥2)+ρ

2

(∥∥∥B ( ¯vec(L(s)

)− ¯vec (L∗)

)∥∥∥2 − ∥∥∥B ( ¯vec(L(s)

)− ¯vec (L∗)

)∥∥∥)− ρ

2

∥∥∥B(vec(L(s+1))− vec(L(s)))− Avec(L(s+1))∥∥∥2 ≥ 0

Then, we sum aforementioned inequality from s = 0, . . . k − 1 and after tele-

scoping cancellation, we have

kF ( ¯vec (L∗))−k−1∑s=0

F(

¯vec(L(s+1)

))+ (λ∗L)

TAT

k−1∑s=0

¯vec(L(s+1)

)+

1

∥∥λ0L − λ∗L

∥∥2 +ρ

2

∥∥∥B ( ¯vec(L0)− ¯vec (L∗)

)∥∥∥2≥ 1

∥∥λkL − λ∗L∥∥2 +ρ

2

∥∥∥B ( ¯vec(Ll)− ¯vec (L∗)

)∥∥∥2+

k−1∑s=0

ρ

2

∥∥∥B ( ¯vec(L(s+1)

)− ¯vec

(L(s)

))− A ¯vec

(L(s+1)

)∥∥∥2 ≥ 0

Since F is a convex function due to the convexity of fi, we have∑k−1s=0 F

(¯vec(L(s+1)

))≥

kF(yk) and with the definition of yk = 1k

∑k−1s=0 ¯vec

(L(s+1)

),

kF ( ¯vec (L∗))− kF(yk) + k (λ∗L)TATyk

+1

∥∥λ0L − λ∗L

∥∥2 +ρ

2

∥∥∥B ( ¯vec(L0)− ¯vec (L∗)

)∥∥∥2 ≥ 0

Taking − 1k on both sides of the above equation and (λ∗L)

TAT ¯vec (L∗) = 0 by

Lemma 3, we obtain

F(yk)− (λ∗L)TATyk −F ( ¯vec (L∗)) + (λ∗L)TAT ¯vec (L∗)

≤ 1

k

(1

∥∥λ0L − λ∗L

∥∥2 +ρ

2

∥∥∥B ( ¯vec(L0)− ¯vec (L∗)

)∥∥∥2)By Assumption 2 and definition of yk,

ρ

2

∣∣∣∣∣∣Ayk∣∣∣∣∣∣22≤ ρ

2k

k−1∑s=0

∣∣∣∣∣∣A ¯vec(L(s+1)

)∣∣∣∣∣∣22≤ ρ

2kkγ =

γρ

2(A.7)

25

Page 26: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

Then,

F(yk)− (λ∗L)TATyk +

ρ

2

∣∣∣∣∣∣Ayk∣∣∣∣∣∣22yk −F ( ¯vec (L∗)) + (λ∗L)

TAT ¯vec (L∗)− ρ

2

∣∣∣∣∣∣A ¯vec (L∗)∣∣∣∣∣∣22

≤ 1

k

(1

∥∥λ0L − λ∗L

∥∥2 +ρ

2

∥∥∥B ( ¯vec(L0)− ¯vec (L∗)

)∥∥∥2 +γρ

2

)(A.8)

Combining Equation (A.8) and (A.7), we have

LAugL

(yk,λ∗L

)− LAugL ( ¯vec (L∗) ,λ∗L)

≤ 1

k

(1

∥∥λ0L − λ∗L

∥∥2 +ρ

2

∥∥∥B ( ¯vec(L0)− ¯vec (L∗)

)∥∥∥2 +γρ

2

)Due to the convexity of function F and linearity of (λ∗L)

TAT ¯vec (L∗) and

ρ2

∣∣∣∣∣∣A ¯vec (L∗)∣∣∣∣∣∣22, the function LAugL ( ¯vec (L) ,λ∗L) is strict convex and has a275

unique minimizer, i.e., ¯vec(L∗)

. Hence if k → inf, then the sequence{

¯vec(L(k)

),λ

(k)L

}k≥0

converges to the optimal minimizer ¯vec(L∗)

. In practice, we could use the

additional variables yk = 1k

∑k−1s=0 ¯vec

(L(s+1)

)to achieve O

(1k

)convergence

rate.

Appendix B. Update Rules280

In this section, we provide the derivation of update rules in Section 5.1. The

derivation of Section 5.2 can be obtained by similar methods. Using Leibniz

integration rule, the gradient can be computed as:

∇θ(j)i

[∫τ (j)∈T(j)

i

p(j)

θ(j)i

(τ (j)

)R(j)

(τ (j)

)log

p(j)

θ(j)i

(τ (j)

)p(j)

θ(j)i

(τ (j)

)R(j)

(τ (j)

) dτ (j)

ρ

2

∑l∈Pi

∣∣∣∣∣∣∣∣θ(k+1),(j)l − θ(j)i −

1

ρλ(k)li

∣∣∣∣∣∣∣∣22

2

∑l∈Si

∣∣∣∣∣∣∣∣θ(j)i − θ(k),(j)l − 1

ρλ(k)li

∣∣∣∣∣∣∣∣22

]

=

∫τ (j)∈T(j)

i

∇θ(j)ip(j)

θ(j)i

(τ (j)

)R(j)

(τ (j)

)log

p(j)

θ(j)i

(τ (j)

)p(j)

θ(j)i

(τ (j)

)R(j)

(τ (j)

) dτ (j)

︸ ︷︷ ︸fRL

(j)i

+ ρdegiθ(j)i − ρ

(∑l∈Si

[θ(k),(j)l +

1

ρλ(k)il

]+∑l∈Pi

[θ(k+1),(j)l − 1

ρλ(k)li

]),

26

Page 27: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

with degi is the degree of processing unit i, i.e., set of neighbors. Assuming a

Gaussian policy for the ith split of task j, we can fRL

(j)i

as:

fRL

(j)i

= p(j)

θ(j)i

(τ (j)

)R(j)

(τ (j)

)log

[p(j)

θ(j)i

(τ (j)

)]− cnst.

= p(j)

θ(j)i

(τ (j)

)R(j)

(τ (j)

)M(j)−1∑m=0

−1

2

[u(j)m −Φ(j)(xm)θ

(j)i

]TΣ−1

[u(j)m −Φ(j)(xm)θ

(j)i

],

with Φ(j)(xm) ∈ Rm×d being a feature representation of the state, and Σ ∈

Rm×m being the covariance matrix of the Gaussian policy. Hence:

∫τ (j)∈T(j)

i

∇θ(j)ip(j)

θ(j)i

(τ (j)

)R(j)

(τ (j)

)log

p(j)

θ(j)i

(τ (j)

)p(j)

θ(j)i

(τ (j)

)R(j)

(τ (j)

) dτ (j)

= −∇θ(j)ip(j)

θ(j)i

(τ (j)

)R(j)

(τ (j)

)[M(j)−1∑m=0

u(j),Tm Σ−1u(j)

m

− 2

M(j)−1∑m=0

u(j),Tm Σ−1Φ(j) (xm) θ

(j)i

+

M(j)−1∑m=0

θ(j),Ti Φ(j),T (xm) Σ−1Φ(j) (xm) θ

(j)i

]

= −p(j)θ(j)i

(τ (j)

)R(j)

(τ (j)

)[M(j)−1∑m=0

∇θ(j)iu(j),Tm Σ−1u(j)

m

− 2

M(j)−1∑m=0

∇θ(j)iu(j),Tm Σ−1Φ(j) (xm) θ

(j)i

+

M(j)−1∑m=0

∇θ(j)iθ(j),Ti Φ(j),T (xm) Σ−1Φ(j) (xm) θ

(j)i

].

At this stage, we handle each of the above three gradients separately. The first

is zero as the expression does not depend on θ(j)i . For the second we have:

∇θ(j)iu(j),Tm Σ−1Φ(j) (xm) θ

(j)i = Φ(j),T (xm) Σ−1u(j)

m .

Furthermore, we have:

∇θ(j)iθ(j),Ti Φ(j),T (xm) Σ−1Φ(j) (xm) θ

(j)i = (A+AT)θ

(j)i ,

27

Page 28: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

with A = Φ(j),T (xm) Σ−1Φ(j) (xm). Since A = AT, then we have:

∇θ(j)iθ(j),Ti Φ(j),T (xm) Σ−1Φ(i) (xm) θ

(j)i = 2Aθ

(j)i

= 2Φ(j),T (xm) Σ−1Φ(j) (xm) θ(j)i .

Plugging these results back in the original gradient equation, yields:

∫τ (j)∈T(j)

i

∇θ(j)ip(j)

θ(j)i

(τ (j)

)R(j)

(τ (j)

)log

p(j)

θ(j)i

(τ (j)

)p(j)

θ(j)i

(τ (j)

)R(j)

(τ (j)

) dτ (j)

= −p(j)θ(j)i

(τ (j)

)R(j)

(τ (j)

)[− 2

M(j)−1∑m=0

Φ(j),T (xm) Σ−1u(j)m

+ 2

M(j)−1∑m=0

Φ(j),T(xm)Σ−1Φ(j)(xm)

]

= 2Eτ (j)∼p(j)

θ(j)i

(·)

[R(j)

(τ (j)

)M(j)−1∑m=0

Φ(j),T (xm) Σ−1u(j)m

]

− 2Eτ (j)∼p(j)

θ(j)i

(·)

[R(j)(τ (j))

M(j)−1∑j=0

Φ(j),T (xm) Σ−1Φ(j) (xm) θ(j)i

].

Considering the penalty terms of ADMM, and setting the above to zero, we can

update˜θ(j)i in closed-form according to:

− 2Eτ (j)∼p(j)

θ(j)i

(·)

[R(j)(τ (j))

M(j)−1∑j=0

Φ(j),T (xm) Σ−1Φ(j) (xm) θ(j)i

]

+ 2Eτ (j)∼p(j)

θ(j)i

(·)

[R(j)

(τ (j)

)M(j)−1∑m=0

Φ(j),T (xm) Σ−1u(j)m

]+ ρdegiθ

(j)i

− ρ

(∑l∈Si

[θ(k),(j)l +

1

ρλ(k)il

]+∑l∈Pi

[θ(k+1),(j)l − 1

ρλ(k)li

])= 0.

28

Page 29: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

Solving the above yields:

θ(k+1),(j)i =

−ρdegi2

Id×d + Eτ (j)∼pθ(j)i

(·)

R(j)(τ (j)

)M(j)−1∑m=0

ΦT,(j)(xm)Σ−1ΦT,(j)(xm)

−1

×

[Eτ (j)∼p

θ(j)i

(·)

R(j)(τ (j)

)M(j)−1∑m=0

ΦT,(j)(xm)Σ−1u(j)m

− ρ

2

(∑l∈Si

[θ(k),(j)l +

1

ρλ(k)il

]+∑l∈Pi

[θ(k+1),(j)l − 1

ρλ(k)li

])].

Similar steps can be performed to determine the update equation for the

shared repository by taking the gradient with respect to vec(L).

29

Page 30: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

[1] J. Kober, J. R. Peters, Policy search for motor primitives in robotics, in:

Advances in neural information processing systems, 2009, pp. 849–856.

[2] S. A. Murphy, D. W. Oslin, A. J. Rush, J. Zhu, Methodological challenges in285

constructing effective treatment sequences for chronic psychiatric disorders,

Neuropsychopharmacology 32 (2007) 257–262.

[3] J. Pineau, M. G. Bellemare, A. J. Rush, A. Ghizaru, S. A. Murphy, Con-

structing evidence-based treatment strategies using methods from com-

puter science, Drug and alcohol dependence 88 (2007) S52–S60.290

[4] R. S. Sutton, A. G. Barto, Introduction to Reinforcement Learning, 1st

ed., MIT Press, Cambridge, MA, USA, 1998.

[5] A. Wilson, A. Fern, S. Ray, P. Tadepalli, Multi-task reinforcement learning:

a hierarchical bayesian approach, in: Proceedings of the 24th international

conference on Machine learning, ACM, 2007, pp. 1015–1022.295

[6] M. E. Taylor, P. Stone, Transfer Learning for Reinforcement Learning

Domains: A Survey, Journal of Machine Learning Research 10 (2009)

1633–1685.

[7] A. Lazaric, M. Ghavamzadeh, Bayesian multi-task reinforcement learning,

in: Proceedings of the 27th International Conference on Machine Learning,300

2010.

[8] H. Li, X. Liao, L. Carin, Multi-task reinforcement learning in partially

observable stochastic environments, The Journal of Machine Learning Re-

search 10 (2009) 1131–1186.

[9] H. Bou-Ammar, E. Eaton, P. Ruvolo, M. Taylor, Online Multi-task Learn-305

ing for Policy Gradient Methods, in: T. Jebara, E. P. Xing (Eds.), Pro-

ceedings of the 31st International Conference on Machine Learning, JMLR

Workshop and Conference Proceedings, 2014.

30

Page 31: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

[10] R. J. Williams, Simple statistical gradient-following algorithms for connec-

tionist reinforcement learning, Machine learning 8 (1992) 229–256.310

[11] S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, M. Lee, Natural actor–critic

algorithms, Automatica 45 (2009) 2471–2482.

[12] J. Peters, S. Schaal, Natural actor-critic, Neurocomputing 71 (2008) 1180–

1190.

[13] P. Ruvolo, E. Eaton, Ella: An efficient lifelong learning algorithm, in:315

Proceedings of the 30th International Conference on Machine Learning,

2013.

[14] S. Thrun, J. O’Sullivan, Discovering Structure in Multiple Learning Tasks:

The TC Algorithm, in: International Conference on Machine Learning,

Morgan Kaufmann, 1996.320

[15] W. Caarls, E. Schuitema, Parallel online temporal difference learning for

motor control, IEEE transactions on neural networks and learning systems

27 (2016) 1457–1468.

[16] S. Gu, E. Holly, T. Lillicrap, S. Levine, Deep reinforcement learning for

robotic manipulation with asynchronous off-policy updates, arXiv preprint325

arXiv:1610.00633 (2016).

[17] A. Yahya, A. Li, M. Kalakrishnan, Y. Chebotar, S. Levine, Collective robot

reinforcement learning with distributed asynchronous guided policy search,

arXiv preprint arXiv:1610.00673 (2016).

[18] S. Levine, C. Finn, T. Darrell, P. Abbeel, End-to-end training of deep330

visuomotor policies, Journal of Machine Learning Research 17 (2016) 1–

40.

[19] M. P. Deisenroth, P. Englert, J. Peters, D. Fox, Multi-task policy search for

robotics, in: Robotics and Automation (ICRA), 2014 IEEE International

Conference on, 2014, pp. 3876–3881.335

31

Page 32: Scalable Lifelong Reinforcement Learningirll.eecs.wsu.edu/wp-content/papercite-data/pdf/2017patternrecogniti… · Scalable Lifelong Reinforcement Learning ... 45 generalization capabilities

[20] A. Wilson, A. Fern, S. Ray, P. Tadepalli, Multi-Task Reinforcement Learn-

ing: A Hierarchical Bayesian Approach, in: Machine Learning, Proceedings

of the 24th International Conference Corvallis, Oregon, USA, June 20-24,

year = 2007,, ????

[21] M. Snel, S. Whiteson, Learning potential functions and their representa-340

tions for multi-task reinforcement learning, Autonomous agents and multi-

agent systems 28 (2014) 637–681.

[22] A. Kumar, H. Daume, Learning task grouping and overlap in multi-task

learning, in: Proceedings of the 29th International Conference on Machine

Learning, 2012, pp. 1383–1390.345

[23] H. Bou Ammar, E. Eaton, J. M. Luna, P. Ruvolo, Autonomous cross-

domain knowledge transfer in lifelong policy gradient reinforcement learn-

ing, in: Proceedings of the International Joint Conference on Artificial

Intelligence, 2015.

[24] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed opti-350

mization and statistical learning via the alternating direction method of

multipliers, Foundations and Trends in Machine Learning 3 (2011) 1–122.

[25] E. Wei, A. Ozdaglar, Distributed alternating direction method of multi-

pliers, in: Decision and Control, 2012 IEEE 51st Annual Conference on,

IEEE, 2012, pp. 5445–5450.355

[26] R. Tibshiranit, Regression shrinkage and selection via the lasso, Journal

of the Royal Statistical Society. Series B (Methodological) 58 (1996) pp.

267–288.

[27] J. Peters, S. Schaal, Natural Actor-Critic, Neurocomputing 71 (2008).

32