soft actor-critic: off-policy maximum entropy deep … · 2018-08-10 · soft actor-critic:...

Soft Actor-Critic:Off-Policy Maximum Entropy Deep Reinforcement

Learning with a Stochastic Actor

Tuomas Haarnoja 1 Aurick Zhou 1 Pieter Abbeel 1 Sergey Levine 1

AbstractModel-free deep reinforcement learning (RL) al-gorithms have been demonstrated on a range ofchallenging decision making and control tasks.However, these methods typically suffer from twomajor challenges: very high sample complexityand brittle convergence properties, which necessi-tate meticulous hyperparameter tuning. Both ofthese challenges severely limit the applicabilityof such methods to complex, real-world domains.In this paper, we propose soft actor-critic, an off-policy actor-critic deep RL algorithm based on themaximum entropy reinforcement learning frame-work. In this framework, the actor aims to maxi-mize expected reward while also maximizing en-tropy. That is, to succeed at the task while actingas randomly as possible. Prior deep RL methodsbased on this framework have been formulatedas Q-learning methods. By combining off-policyupdates with a stable stochastic actor-critic formu-lation, our method achieves state-of-the-art per-formance on a range of continuous control bench-mark tasks, outperforming prior on-policy andoff-policy methods. Furthermore, we demonstratethat, in contrast to other off-policy algorithms, ourapproach is very stable, achieving very similarperformance across different random seeds.

1. IntroductionModel-free deep reinforcement learning (RL) algorithmshave been applied in a range of challenging domains, fromgames (Mnih et al., 2013; Silver et al., 2016) to roboticcontrol (Schulman et al., 2015). The combination of RLand high-capacity function approximators such as neuralnetworks holds the promise of automating a wide range ofdecision making and control tasks, but widespread adoption

1Berkeley Artificial Intelligence Research, University of Cal-ifornia, Berkeley, USA. Correspondence to: Tuomas Haarnoja<[email protected]>.

of these methods in real-world domains has been hamperedby two major challenges. First, model-free deep RL meth-ods are notoriously expensive in terms of their sample com-plexity. Even relatively simple tasks can require millions ofsteps of data collection, and complex behaviors with high-dimensional observations might need substantially more.Second, these methods are often brittle with respect to theirhyperparameters: learning rates, exploration constants, andother settings must be set carefully for different problemsettings to achieve good results. Both of these challengesseverely limit the applicability of model-free deep RL toreal-world tasks.

One cause for the poor sample efficiency of deep RL meth-ods is on-policy learning: some of the most commonly useddeep RL algorithms, such as TRPO (Schulman et al., 2015),PPO (Schulman et al., 2017b) or A3C (Mnih et al., 2016),require new samples to be collected for each gradient step.This quickly becomes extravagantly expensive, as the num-ber of gradient steps and samples per step needed to learnan effective policy increases with task complexity. Off-policy algorithms aim to reuse past experience. This is notdirectly feasible with conventional policy gradient formula-tions, but is relatively straightforward for Q-learning basedmethods (Mnih et al., 2015). Unfortunately, the combina-tion of off-policy learning and high-dimensional, nonlinearfunction approximation with neural networks presents a ma-jor challenge for stability and convergence (Bhatnagar et al.,2009). This challenge is further exacerbated in continuousstate and action spaces, where a separate actor network isoften used to perform the maximization in Q-learning. Acommonly used algorithm in such settings, deep determinis-tic policy gradient (DDPG) (Lillicrap et al., 2015), providesfor sample-efficient learning but is notoriously challengingto use due to its extreme brittleness and hyperparametersensitivity (Duan et al., 2016; Henderson et al., 2017).

We explore how to design an efficient and stable model-free deep RL algorithm for continuous state and actionspaces. To that end, we draw on the maximum entropyframework, which augments the standard maximum rewardreinforcement learning objective with an entropy maximiza-tion term (Ziebart et al., 2008; Toussaint, 2009; Rawlik et al.,

arX

iv:1

801.

0129

0v2

[cs

.LG

] 8

Aug

201

8

Soft Actor-Critic

2012; Fox et al., 2016; Haarnoja et al., 2017). Maximum en-tropy reinforcement learning alters the RL objective, thoughthe original objective can be recovered using a tempera-ture parameter (Haarnoja et al., 2017). More importantly,the maximum entropy formulation provides a substantialimprovement in exploration and robustness: as discussedby Ziebart (2010), maximum entropy policies are robustin the face of model and estimation errors, and as demon-strated by (Haarnoja et al., 2017), they improve explorationby acquiring diverse behaviors. Prior work has proposedmodel-free deep RL algorithms that perform on-policy learn-ing with entropy maximization (O’Donoghue et al., 2016),as well as off-policy methods based on soft Q-learning andits variants (Schulman et al., 2017a; Nachum et al., 2017a;Haarnoja et al., 2017). However, the on-policy variants suf-fer from poor sample complexity for the reasons discussedabove, while the off-policy variants require complex approx-imate inference procedures in continuous action spaces.

In this paper, we demonstrate that we can devise an off-policy maximum entropy actor-critic algorithm, which wecall soft actor-critic (SAC), which provides for both sample-efficient learning and stability. This algorithm extends read-ily to very complex, high-dimensional tasks, such as theHumanoid benchmark (Duan et al., 2016) with 21 actiondimensions, where off-policy methods such as DDPG typi-cally struggle to obtain good results (Gu et al., 2016). SACalso avoids the complexity and potential instability associ-ated with approximate inference in prior off-policy maxi-mum entropy algorithms based on soft Q-learning (Haarnojaet al., 2017). We present a convergence proof for policyiteration in the maximum entropy framework, and then in-troduce a new algorithm based on an approximation to thisprocedure that can be practically implemented with deepneural networks, which we call soft actor-critic. We presentempirical results that show that soft actor-critic attains asubstantial improvement in both performance and sampleefficiency over both off-policy and on-policy prior methods.We also compare to twin delayed deep deterministic (TD3)policy gradient algorithm (Fujimoto et al., 2018), which isa concurrent work that proposes a deterministic algorithmthat substantially improves on DDPG.

2. Related WorkOur soft actor-critic algorithm incorporates three key in-gredients: an actor-critic architecture with separate policyand value function networks, an off-policy formulation thatenables reuse of previously collected data for efficiency, andentropy maximization to enable stability and exploration.We review prior works that draw on some of these ideas inthis section. Actor-critic algorithms are typically derivedstarting from policy iteration, which alternates between pol-icy evaluation—computing the value function for a policy—

and policy improvement—using the value function to obtaina better policy (Barto et al., 1983; Sutton & Barto, 1998). Inlarge-scale reinforcement learning problems, it is typicallyimpractical to run either of these steps to convergence, andinstead the value function and policy are optimized jointly.In this case, the policy is referred to as the actor, and thevalue function as the critic. Many actor-critic algorithmsbuild on the standard, on-policy policy gradient formulationto update the actor (Peters & Schaal, 2008), and many ofthem also consider the entropy of the policy, but instead ofmaximizing the entropy, they use it as an regularizer (Schul-man et al., 2017b; 2015; Mnih et al., 2016; Gruslys et al.,2017). On-policy training tends to improve stability butresults in poor sample complexity.

There have been efforts to increase the sample efficiencywhile retaining robustness by incorporating off-policy sam-ples and by using higher order variance reduction tech-niques (O’Donoghue et al., 2016; Gu et al., 2016). How-ever, fully off-policy algorithms still attain better effi-ciency. A particularly popular off-policy actor-critic method,DDPG (Lillicrap et al., 2015), which is a deep variant of thedeterministic policy gradient (Silver et al., 2014) algorithm,uses a Q-function estimator to enable off-policy learning,and a deterministic actor that maximizes this Q-function.As such, this method can be viewed both as a determinis-tic actor-critic algorithm and an approximate Q-learningalgorithm. Unfortunately, the interplay between the deter-ministic actor network and the Q-function typically makesDDPG extremely difficult to stabilize and brittle to hyperpa-rameter settings (Duan et al., 2016; Henderson et al., 2017).As a consequence, it is difficult to extend DDPG to complex,high-dimensional tasks, and on-policy policy gradient meth-ods still tend to produce the best results in such settings (Guet al., 2016). Our method instead combines off-policy actor-critic training with a stochastic actor, and further aims tomaximize the entropy of this actor with an entropy maxi-mization objective. We find that this actually results in aconsiderably more stable and scalable algorithm that, inpractice, exceeds both the efficiency and final performanceof DDPG. A similar method can be derived as a zero-stepspecial case of stochastic value gradients (SVG(0)) (Heesset al., 2015). However, SVG(0) differs from our method inthat it optimizes the standard maximum expected return ob-jective, and it does not make use of a separate value network,which we found to make training more stable.

Maximum entropy reinforcement learning optimizes poli-cies to maximize both the expected return and the ex-pected entropy of the policy. This framework has beenused in many contexts, from inverse reinforcement learn-ing (Ziebart et al., 2008) to optimal control (Todorov, 2008;Toussaint, 2009; Rawlik et al., 2012). In guided policysearch (Levine & Koltun, 2013; Levine et al., 2016), themaximum entropy distribution is used to guide policy learn-

Soft Actor-Critic

ing towards high-reward regions. More recently, severalpapers have noted the connection between Q-learning andpolicy gradient methods in the framework of maximum en-tropy learning (O’Donoghue et al., 2016; Haarnoja et al.,2017; Nachum et al., 2017a; Schulman et al., 2017a). Whilemost of the prior model-free works assume a discrete actionspace, Nachum et al. (2017b) approximate the maximum en-tropy distribution with a Gaussian and Haarnoja et al. (2017)with a sampling network trained to draw samples from theoptimal policy. Although the soft Q-learning algorithm pro-posed by Haarnoja et al. (2017) has a value function andactor network, it is not a true actor-critic algorithm: theQ-function is estimating the optimal Q-function, and theactor does not directly affect the Q-function except throughthe data distribution. Hence, Haarnoja et al. (2017) moti-vates the actor network as an approximate sampler, ratherthan the actor in an actor-critic algorithm. Crucially, theconvergence of this method hinges on how well this samplerapproximates the true posterior. In contrast, we prove thatour method converges to the optimal policy from a givenpolicy class, regardless of the policy parameterization. Fur-thermore, these prior maximum entropy methods generallydo not exceed the performance of state-of-the-art off-policyalgorithms, such as DDPG, when learning from scratch,though they may have other benefits, such as improved ex-ploration and ease of fine-tuning. In our experiments, wedemonstrate that our soft actor-critic algorithm does in factexceed the performance of prior state-of-the-art off-policydeep RL methods by a wide margin.

3. PreliminariesWe first introduce notation and summarize the standard andmaximum entropy reinforcement learning frameworks.

3.1. Notation

We address policy learning in continuous action spaces.We consider an infinite-horizon Markov decision process(MDP), defined by the tuple (S,A, p, r), where the statespace S and the action space A are continuous, and theunknown state transition probability p : S × S × A →[0, ∞) represents the probability density of the next statest+1 ∈ S given the current state st ∈ S and action at ∈ A.The environment emits a bounded reward r : S × A →[rmin, rmax] on each transition. We will use ρπ(st) andρπ(st,at) to denote the state and state-action marginals ofthe trajectory distribution induced by a policy π(at|st).

3.2. Maximum Entropy Reinforcement Learning

Standard RL maximizes the expected sum of rewards∑t E(st,at)∼ρπ [r(st,at)]. We will consider a more gen-

eral maximum entropy objective (see e.g. Ziebart (2010)),which favors stochastic policies by augmenting the objective

with the expected entropy of the policy over ρπ(st):

J(π) =

T∑t=0

E(st,at)∼ρπ [r(st,at) + αH(π( · |st))] . (1)

The temperature parameter α determines the relative im-portance of the entropy term against the reward, and thuscontrols the stochasticity of the optimal policy. The maxi-mum entropy objective differs from the standard maximumexpected reward objective used in conventional reinforce-ment learning, though the conventional objective can berecovered in the limit as α→ 0. For the rest of this paper,we will omit writing the temperature explicitly, as it canalways be subsumed into the reward by scaling it by α−1.

This objective has a number of conceptual and practicaladvantages. First, the policy is incentivized to explore morewidely, while giving up on clearly unpromising avenues.Second, the policy can capture multiple modes of near-optimal behavior. In problem settings where multiple ac-tions seem equally attractive, the policy will commit equalprobability mass to those actions. Lastly, prior work has ob-served improved exploration with this objective (Haarnojaet al., 2017; Schulman et al., 2017a), and in our experi-ments, we observe that it considerably improves learningspeed over state-of-art methods that optimize the conven-tional RL objective function. We can extend the objective toinfinite horizon problems by introducing a discount factor γto ensure that the sum of expected rewards and entropies isfinite. Writing down the maximum entropy objective for theinfinite horizon discounted case is more involved (Thomas,2014) and is deferred to Appendix A.

Prior methods have proposed directly solving for the op-timal Q-function, from which the optimal policy can berecovered (Ziebart et al., 2008; Fox et al., 2016; Haarnojaet al., 2017). We will discuss how we can devise a softactor-critic algorithm through a policy iteration formulation,where we instead evaluate the Q-function of the currentpolicy and update the policy through an off-policy gradientupdate. Though such algorithms have previously been pro-posed for conventional reinforcement learning, our methodis, to our knowledge, the first off-policy actor-critic methodin the maximum entropy reinforcement learning framework.

4. From Soft Policy Iteration to SoftActor-Critic

Our off-policy soft actor-critic algorithm can be derivedstarting from a maximum entropy variant of the policy it-eration method. We will first present this derivation, verifythat the corresponding algorithm converges to the optimalpolicy from its density class, and then present a practicaldeep reinforcement learning algorithm based on this theory.

Soft Actor-Critic

4.1. Derivation of Soft Policy Iteration

We will begin by deriving soft policy iteration, a general al-gorithm for learning optimal maximum entropy policies thatalternates between policy evaluation and policy improve-ment in the maximum entropy framework. Our derivationis based on a tabular setting, to enable theoretical analysisand convergence guarantees, and we extend this methodinto the general continuous setting in the next section. Wewill show that soft policy iteration converges to the optimalpolicy within a set of policies which might correspond, forinstance, to a set of parameterized densities.

In the policy evaluation step of soft policy iteration, wewish to compute the value of a policy π according to themaximum entropy objective in Equation 1. For a fixedpolicy, the soft Q-value can be computed iteratively, startingfrom any function Q : S ×A → R and repeatedly applyinga modified Bellman backup operator T π given by

T πQ(st,at) , r(st,at) + γ Est+1∼p [V (st+1)] , (2)

where

V (st) = Eat∼π [Q(st,at)− log π(at|st)] (3)

is the soft state value function. We can obtain the soft valuefunction for any policy π by repeatedly applying T π asformalized below.

Lemma 1 (Soft Policy Evaluation). Consider the soft Bell-man backup operator T π in Equation 2 and a mappingQ0 : S×A → R with |A| <∞, and defineQk+1 = T πQk.Then the sequence Qk will converge to the soft Q-value ofπ as k →∞.

Proof. See Appendix B.1.

In the policy improvement step, we update the policy to-wards the exponential of the new Q-function. This particularchoice of update can be guaranteed to result in an improvedpolicy in terms of its soft value. Since in practice we preferpolicies that are tractable, we will additionally restrict thepolicy to some set of policies Π, which can correspond, forexample, to a parameterized family of distributions such asGaussians. To account for the constraint that π ∈ Π, weproject the improved policy into the desired set of policies.While in principle we could choose any projection, it willturn out to be convenient to use the information projectiondefined in terms of the Kullback-Leibler divergence. In theother words, in the policy improvement step, for each state,we update the policy according to

πnew = arg minπ′∈Π

DKL

(π′( · |st)

∥∥∥∥ exp (Qπold(st, · ))Zπold(st)

).

(4)

The partition function Zπold(st) normalizes the distribution,and while it is intractable in general, it does not contribute tothe gradient with respect to the new policy and can thus beignored, as noted in the next section. For this projection, wecan show that the new, projected policy has a higher valuethan the old policy with respect to the objective in Equa-tion 1. We formalize this result in Lemma 2.

Lemma 2 (Soft Policy Improvement). Let πold ∈ Π and letπnew be the optimizer of the minimization problem definedin Equation 4. Then Qπnew(st,at) ≥ Qπold(st,at) for all(st,at) ∈ S ×A with |A| <∞.


The full soft policy iteration algorithm alternates betweenthe soft policy evaluation and the soft policy improvementsteps, and it will provably converge to the optimal maxi-mum entropy policy among the policies in Π (Theorem 1).Although this algorithm will provably find the optimal solu-tion, we can perform it in its exact form only in the tabularcase. Therefore, we will next approximate the algorithm forcontinuous domains, where we need to rely on a functionapproximator to represent the Q-values, and running thetwo steps until convergence would be computationally tooexpensive. The approximation gives rise to a new practicalalgorithm, called soft actor-critic.

Theorem 1 (Soft Policy Iteration). Repeated application ofsoft policy evaluation and soft policy improvement from anyπ ∈ Π converges to a policy π∗ such that Qπ

∗(st,at) ≥

Qπ(st,at) for all π ∈ Π and (st,at) ∈ S × A, assuming|A| <∞.


4.2. Soft Actor-Critic

As discussed above, large continuous domains require us toderive a practical approximation to soft policy iteration. Tothat end, we will use function approximators for both theQ-function and the policy, and instead of running evaluationand improvement to convergence, alternate between opti-mizing both networks with stochastic gradient descent. Wewill consider a parameterized state value function Vψ(st),soft Q-function Qθ(st,at), and a tractable policy πφ(at|st).The parameters of these networks are ψ, θ, and φ. Forexample, the value functions can be modeled as expressiveneural networks, and the policy as a Gaussian with meanand covariance given by neural networks. We will nextderive update rules for these parameter vectors.

The state value function approximates the soft value. Thereis no need in principle to include a separate function approx-imator for the state value, since it is related to the Q-functionand policy according to Equation 3. This quantity can be

Soft Actor-Critic

estimated from a single action sample from the current pol-icy without introducing a bias, but in practice, including aseparate function approximator for the soft value can stabi-lize training and is convenient to train simultaneously withthe other networks. The soft value function is trained tominimize the squared residual error

JV (ψ) = Est∼D

[12

(Vψ(st)− Eat∼πφ [Qθ(st,at)− log πφ(at|st)]

)2](5)

where D is the distribution of previously sampled states andactions, or a replay buffer. The gradient of Equation 5 canbe estimated with an unbiased estimator

∇ψJV (ψ) = ∇ψVψ(st) (Vψ(st)−Qθ(st,at) + log πφ(at|st)) ,(6)

where the actions are sampled according to the current pol-icy, instead of the replay buffer. The soft Q-function param-eters can be trained to minimize the soft Bellman residual

JQ(θ) = E(st,at)∼D

[1

2

(Qθ(st,at)− Q(st,at)

)2],

(7)

with

Q(st,at) = r(st,at) + γ Est+1∼p[Vψ(st+1)

], (8)

which again can be optimized with stochastic gradients

∇θJQ(θ) = ∇θQθ(at, st)(Qθ(st,at)− r(st,at)− γVψ(st+1)

).

(9)

The update makes use of a target value network Vψ, whereψ can be an exponentially moving average of the valuenetwork weights, which has been shown to stabilize train-ing (Mnih et al., 2015). Alternatively, we can update thetarget weights to match the current value function weightsperiodically (see Appendix E). Finally, the policy param-eters can be learned by directly minimizing the expectedKL-divergence in Equation 4:

Jπ(φ) = Est∼D

[DKL

(πφ( · |st)

∥∥∥∥ exp (Qθ(st, · ))Zθ(st)

)].

(10)

There are several options for minimizing Jπ. A typicalsolution for policy gradient methods is to use the likelihoodratio gradient estimator (Williams, 1992), which does notrequire backpropagating the gradient through the policy andthe target density networks. However, in our case, the targetdensity is the Q-function, which is represented by a neuralnetwork an can be differentiated, and it is thus convenientto apply the reparameterization trick instead, resulting in alower variance estimator. To that end, we reparameterizethe policy using a neural network transformation

at = fφ(εt; st), (11)

Algorithm 1 Soft Actor-CriticInitialize parameter vectors ψ, ψ, θ, φ.for each iteration do

for each environment step doat ∼ πφ(at|st)st+1 ∼ p(st+1|st,at)D ← D ∪ {(st,at, r(st,at), st+1)}

end forfor each gradient step doψ ← ψ − λV ∇ψJV (ψ)

θi ← θi − λQ∇θiJQ(θi) for i ∈ {1, 2}φ← φ− λπ∇φJπ(φ)ψ ← τψ + (1− τ)ψ

end forend for

where εt is an input noise vector, sampled from some fixeddistribution, such as a spherical Gaussian. We can nowrewrite the objective in Equation 10 as

Jπ(φ) = Est∼D,εt∼N [log πφ(fφ(εt; st)|st)−Qθ(st, fφ(εt; st))] ,

(12)

where πφ is defined implicitly in terms of fφ, and we havenoted that the partition function is independent of φ and canthus be omitted. We can approximate the gradient of Equa-tion 12 with

∇φJπ(φ) = ∇φ log πφ(at|st)+ (∇at log πφ(at|st)−∇atQ(st,at))∇φfφ(εt; st),

(13)

where at is evaluated at fφ(εt; st). This unbiased gradientestimator extends the DDPG style policy gradients (Lillicrapet al., 2015) to any tractable stochastic policy.

Our algorithm also makes use of two Q-functions to mitigatepositive bias in the policy improvement step that is knownto degrade performance of value based methods (Hasselt,2010; Fujimoto et al., 2018). In particular, we parameterizetwo Q-functions, with parameters θi, and train them inde-pendently to optimize JQ(θi). We then use the minimum ofthe Q-functions for the value gradient in Equation 6 and pol-icy gradient in Equation 13, as proposed by Fujimoto et al.(2018). Although our algorithm can learn challenging tasks,including a 21-dimensional Humanoid, using just a singleQ-function, we found two Q-functions significantly speedup training, especially on harder tasks. The complete algo-rithm is described in Algorithm 1. The method alternatesbetween collecting experience from the environment withthe current policy and updating the function approximatorsusing the stochastic gradients from batches sampled from areplay buffer. In practice, we take a single environment stepfollowed by one or several gradient steps (see Appendix D

Soft Actor-Critic

0.0 0.2 0.4 0.6 0.8 1.0million steps

0

1000

2000

3000

4000av

erag

ere

turn

Hopper-v1

(a) Hopper-v1

0.0 0.2 0.4 0.6 0.8 1.0million steps

0

1000

2000

3000

4000

5000

6000

aver

age

retu

rn

Walker2d-v1

(b) Walker2d-v1

0.0 0.5 1.0 1.5 2.0 2.5 3.0million steps

0

5000

10000

15000

aver

age

retu

rn

HalfCheetah-v1

(c) HalfCheetah-v1

0.0 0.5 1.0 1.5 2.0 2.5 3.0million steps

0

2000

4000

6000

aver

age

retu

rn

Ant-v1

(d) Ant-v1

0 2 4 6 8 10million steps

0

2000

4000

6000

8000

aver

age

retu

rn

Humanoid-v1

(e) Humanoid-v1


0

2000

4000

6000

aver

age

retu

rn

Humanoid (rllab)

SAC

DDPG

PPO

SQL

TD3 (concurrent)

(f) Humanoid (rllab)

Figure 1. Training curves on continuous control benchmarks. Soft actor-critic (yellow) performs consistently across all tasks andoutperforming both on-policy and off-policy methods in the most challenging tasks.

for all hyperparameter). Using off-policy data from a replaybuffer is feasible because both value estimators and the pol-icy can be trained entirely on off-policy data. The algorithmis agnostic to the parameterization of the policy, as long asit can be evaluated for any arbitrary state-action tuple.

5. ExperimentsThe goal of our experimental evaluation is to understandhow the sample complexity and stability of our methodcompares with prior off-policy and on-policy deep rein-forcement learning algorithms. We compare our methodto prior techniques on a range of challenging continuouscontrol tasks from the OpenAI gym benchmark suite (Brock-man et al., 2016) and also on the rllab implementation ofthe Humanoid task (Duan et al., 2016). Although the easiertasks can be solved by a wide range of different algorithms,the more complex benchmarks, such as the 21-dimensionalHumanoid (rllab), are exceptionally difficult to solve withoff-policy algorithms (Duan et al., 2016). The stability ofthe algorithm also plays a large role in performance: eas-ier tasks make it more practical to tune hyperparametersto achieve good results, while the already narrow basins ofeffective hyperparameters become prohibitively small forthe more sensitive algorithms on the hardest benchmarks,leading to poor performance (Gu et al., 2016).

We compare our method to deep deterministic policy gra-dient (DDPG) (Lillicrap et al., 2015), an algorithm thatis regarded as one of the more efficient off-policy deepRL methods (Duan et al., 2016); proximal policy optimiza-tion (PPO) (Schulman et al., 2017b), a stable and effectiveon-policy policy gradient algorithm; and soft Q-learning(SQL) (Haarnoja et al., 2017), a recent off-policy algorithmfor learning maximum entropy policies. Our SQL imple-mentation also includes two Q-functions, which we foundto improve its performance in most environments. We addi-tionally compare to twin delayed deep deterministic policygradient algorithm (TD3) (Fujimoto et al., 2018), usingthe author-provided implementation. This is an extensionto DDPG, proposed concurrently to our method, that firstapplied the double Q-learning trick to continuous controlalong with other improvements. We have included trust re-gion path consistency learning (Trust-PCL) (Nachum et al.,2017b) and two other variants of SAC in Appendix E. Weturned off the exploration noise for evaluation for DDPGand PPO. For maximum entropy algorithms, which do notexplicitly inject exploration noise, we either evaluated withthe exploration noise (SQL) or use the mean action (SAC).The source code of our SAC implementation1 and videos2

are available online.1github.com/haarnoja/sac2sites.google.com/view/soft-actor-critic

http://github.com/haarnoja/sac

Soft Actor-Critic

5.1. Comparative Evaluation

Figure 1 shows the total average return of evaluation rolloutsduring training for DDPG, PPO, and TD3. We train fivedifferent instances of each algorithm with different randomseeds, with each performing one evaluation rollout every1000 environment steps. The solid curves corresponds to themean and the shaded region to the minimum and maximumreturns over the five trials.

The results show that, overall, SAC performs comparablyto the baseline methods on the easier tasks and outperformsthem on the harder tasks with a large margin, both in termsof learning speed and the final performance. For example,DDPG fails to make any progress on Ant-v1, Humanoid-v1, and Humanoid (rllab), a result that is corroborated byprior work (Gu et al., 2016; Duan et al., 2016). SAC alsolearns considerably faster than PPO as a consequence ofthe large batch sizes PPO needs to learn stably on morehigh-dimensional and complex tasks. Another maximumentropy RL algorithm, SQL, can also learn all tasks, but itis slower than SAC and has worse asymptotic performance.The quantitative results attained by SAC in our experimentsalso compare very favorably to results reported by othermethods in prior work (Duan et al., 2016; Gu et al., 2016;Henderson et al., 2017), indicating that both the sampleefficiency and final performance of SAC on these benchmarktasks exceeds the state of the art. All hyperparameters usedin this experiment for SAC are listed in Appendix D.

5.2. Ablation Study

The results in the previous section suggest that algorithmsbased on the maximum entropy principle can outperformconventional RL methods on challenging tasks such as thehumanoid tasks. In this section, we further examine whichparticular components of SAC are important for good perfor-mance. We also examine how sensitive SAC is to some ofthe most important hyperparameters, namely reward scalingand target value update smoothing constant.

Stochastic vs. deterministic policy. Soft actor-criticlearns stochastic policies via a maximum entropy objec-tive. The entropy appears in both the policy and valuefunction. In the policy, it prevents premature convergence ofthe policy variance (Equation 10). In the value function, itencourages exploration by increasing the value of regions ofstate space that lead to high-entropy behavior (Equation 5).To compare how the stochasticity of the policy and entropymaximization affects the performance, we compare to adeterministic variant of SAC that does not maximize the en-tropy and that closely resembles DDPG, with the exceptionof having two Q-functions, using hard target updates, nothaving a separate target actor, and using fixed rather thanlearned exploration noise. Figure 2 compares five individualruns with both variants, initialized with different random


0

2000

4000

6000

aver

age

retu

rn

Humanoid (rllab)

stochastic policy

deterministic policy

Figure 2. Comparison of SAC (blue) and a deterministic variant ofSAC (red) in terms of the stability of individual random seeds onthe Humanoid (rllab) benchmark. The comparison indicates thatstochasticity can stabilize training as the variability between theseeds becomes much higher with a deterministic policy.

seeds. Soft actor-critic performs much more consistently,while the deterministic variant exhibits very high variabilityacross seeds, indicating substantially worse stability. Asevident from the figure, learning a stochastic policy withentropy maximization can drastically stabilize training. Thisbecomes especially important with harder tasks, where tun-ing hyperparameters is challenging. In this comparison, weupdated the target value network weights with hard updates,by periodically overwriting the target network parametersto match the current value network (see Appendix E fora comparison of average performance on all benchmarktasks).

Policy evaluation. Since SAC converges to stochasticpolicies, it is often beneficial to make the final policy deter-ministic at the end for best performance. For evaluation, weapproximate the maximum a posteriori action by choosingthe mean of the policy distribution. Figure 3(a) comparestraining returns to evaluation returns obtained with this strat-egy indicating that deterministic evaluation can yield betterperformance. It should be noted that all of the trainingcurves depict the sum of rewards, which is different fromthe objective optimized by SAC and other maximum en-tropy RL algorithms, including SQL and Trust-PCL, whichmaximize also the entropy of the policy.

Reward scale. Soft actor-critic is particularly sensitive tothe scaling of the reward signal, because it serves the roleof the temperature of the energy-based optimal policy andthus controls its stochasticity. Larger reward magnitudescorrespond to lower entries. Figure 3(b) shows how learn-ing performance changes when the reward scale is varied:For small reward magnitudes, the policy becomes nearlyuniform, and consequently fails to exploit the reward signal,resulting in substantial degradation of performance. Forlarge reward magnitudes, the model learns quickly at first,

Soft Actor-Critic

0.0 0.5 1.0 1.5 2.0 2.5 3.0million steps

0

2000

4000

6000

aver

age

retu

rnAnt-v1

deterministic evaluation

stochastic evaluation

(a) Evaluation

0.0 0.5 1.0 1.5 2.0 2.5 3.0million steps

0

2000

4000

6000

aver

age

retu

rn

Ant-v1

1

3

10

30

100

(b) Reward Scale

0.0 0.5 1.0 1.5 2.0 2.5 3.0million steps

−2000

0

2000

4000

6000

aver

age

retu

rn

Ant-v1

0.0001

0.001

0.01

0.1

(c) Target Smoothing Coefficient (τ )

Figure 3. Sensitivity of soft actor-critic to selected hyperparameters on Ant-v1 task. (a) Evaluating the policy using the mean actiongenerally results in a higher return. Note that the policy is trained to maximize also the entropy, and the mean action does not, in general,correspond the optimal action for the maximum return objective. (b) Soft actor-critic is sensitive to reward scaling since it is related to thetemperature of the optimal policy. The optimal reward scale varies between environments, and should be tuned for each task separately.(c) Target value smoothing coefficient τ is used to stabilize training. Fast moving target (large τ ) can result in instabilities (red), whereasslow moving target (small τ ) makes training slower (blue).

but the policy then becomes nearly deterministic, leadingto poor local minima due to lack of adequate exploration.With the right reward scaling, the model balances explo-ration and exploitation, leading to faster learning and betterasymptotic performance. In practice, we found reward scaleto be the only hyperparameter that requires tuning, and itsnatural interpretation as the inverse of the temperature inthe maximum entropy framework provides good intuitionfor how to adjust this parameter.

Target network update. It is common to use a separatetarget value network that slowly tracks the actual value func-tion to improve stability. We use an exponentially movingaverage, with a smoothing constant τ , to update the targetvalue network weights as common in the prior work (Lill-icrap et al., 2015; Mnih et al., 2015). A value of one cor-responds to a hard update where the weights are copieddirectly at every iteration and zero to not updating the targetat all. In Figure 3(c), we compare the performance of SACwhen τ varies. Large τ can lead to instabilities while smallτ can make training slower. However, we found the rangeof suitable values of τ to be relatively wide and we usedthe same value (0.005) across all of the tasks. In Figure 4(Appendix E) we also compare to another variant of SAC,where instead of using exponentially moving average, wecopy over the current network weights directly into the tar-get network every 1000 gradient steps. We found this variantto benefit from taking more than one gradient step betweenthe environment steps, which can improve performance butalso increases the computational cost.

6. ConclusionWe present soft actor-critic (SAC), an off-policy maximumentropy deep reinforcement learning algorithm that providessample-efficient learning while retaining the benefits of en-tropy maximization and stability. Our theoretical resultsderive soft policy iteration, which we show to converge tothe optimal policy. From this result, we can formulate asoft actor-critic algorithm, and we empirically show that itoutperforms state-of-the-art model-free deep RL methods,including the off-policy DDPG algorithm and the on-policyPPO algorithm. In fact, the sample efficiency of this ap-proach actually exceeds that of DDPG by a substantial mar-gin. Our results suggest that stochastic, entropy maximizingreinforcement learning algorithms can provide a promisingavenue for improved robustness and stability, and furtherexploration of maximum entropy methods, including meth-ods that incorporate second order information (e.g., trustregions (Schulman et al., 2015)) or more expressive policyclasses is an exciting avenue for future work.

AcknowledgmentsWe would like to thank Vitchyr Pong for insightful discus-sions and help in implementing our algorithm as well asproviding the DDPG baseline code; Ofir Nachum for offer-ing support in running Trust-PCL experiments; and GeorgeTucker for his valuable feedback on an early version of thispaper. This work was supported by Siemens and BerkeleyDeepDrive.

Soft Actor-Critic

ReferencesBarto, A. G., Sutton, R. S., and Anderson, C. W. Neuronlike

adaptive elements that can solve difficult learning con-trol problems. IEEE transactions on systems, man, andcybernetics, pp. 834–846, 1983.

Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S., Maei,H. R., and Szepesvari, C. Convergent temporal-differencelearning with arbitrary smooth function approximation.In Advances in Neural Information Processing Systems(NIPS), pp. 1204–1212, 2009.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,Schulman, J., Tang, J., and Zaremba, W. OpenAI gym.arXiv preprint arXiv:1606.01540, 2016.

Duan, Y., Chen, X. Houthooft, R., Schulman, J., and Abbeel,P. Benchmarking deep reinforcement learning for contin-uous control. In International Conference on MachineLearning (ICML), 2016.

Fox, R., Pakman, A., and Tishby, N. Taming the noise inreinforcement learning via soft updates. In Conferenceon Uncertainty in Artificial Intelligence (UAI), 2016.

Fujimoto, S., van Hoof, H., and Meger, D. Addressing func-tion approximation error in actor-critic methods. arXivpreprint arXiv:1802.09477, 2018.

Gruslys, A., Azar, M. G., Bellemare, M. G., and Munos, R.The reactor: A sample-efficient actor-critic architecture.arXiv preprint arXiv:1704.04651, 2017.

Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., andLevine, S. Q-prop: Sample-efficient policy gradient withan off-policy critic. arXiv preprint arXiv:1611.02247,2016.

Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Rein-forcement learning with deep energy-based policies. InInternational Conference on Machine Learning (ICML),pp. 1352–1361, 2017.

Hasselt, H. V. Double Q-learning. In Advances in NeuralInformation Processing Systems (NIPS), pp. 2613–2621,2010.

Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., andTassa, Y. Learning continuous control policies by stochas-tic value gradients. In Advances in Neural InformationProcessing Systems (NIPS), pp. 2944–2952, 2015.

Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup,D., and Meger, D. Deep reinforcement learning thatmatters. arXiv preprint arXiv:1709.06560, 2017.

Kingma, D. and Ba, J. Adam: A method for stochasticoptimization. In International Conference for LearningPresentations (ICLR), 2015.

Levine, S. and Koltun, V. Guided policy search. In Interna-tional Conference on Machine Learning (ICML), pp. 1–9,2013.

Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-endtraining of deep visuomotor policies. Journal of MachineLearning Research, 17(39):1–40, 2016.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez,T., Tassa, Y., Silver, D., and Wierstra, D. Continuouscontrol with deep reinforcement learning. arXiv preprintarXiv:1509.02971, 2015.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,Antonoglou, I., Wierstra, D., and Riedmiller, M. Playingatari with deep reinforcement learning. arXiv preprintarXiv:1312.5602, 2013.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-land, A. K., Ostrovski, G., et al. Human-level controlthrough deep reinforcement learning. Nature, 518(7540):529–533, 2015.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,T. P., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn-chronous methods for deep reinforcement learning. InInternational Conference on Machine Learning (ICML),2016.

Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D.Bridging the gap between value and policy based rein-forcement learning. In Advances in Neural InformationProcessing Systems (NIPS), pp. 2772–2782, 2017a.

Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D.Trust-PCL: An off-policy trust region method for contin-uous control. arXiv preprint arXiv:1707.01891, 2017b.

O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V.PGQ: Combining policy gradient and Q-learning. arXivpreprint arXiv:1611.01626, 2016.

Peters, J. and Schaal, S. Reinforcement learning of motorskills with policy gradients. Neural networks, 21(4):682–697, 2008.

Rawlik, K., Toussaint, M., and Vijayakumar, S. On stochas-tic optimal control and reinforcement learning by approx-imate inference. Robotics: Science and Systems (RSS),2012.

Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., andMoritz, P. Trust region policy optimization. In Inter-national Conference on Machine Learning (ICML), pp.1889–1897, 2015.

Soft Actor-Critic

Schulman, J., Abbeel, P., and Chen, X. Equivalence be-tween policy gradients and soft Q-learning. arXiv preprintarXiv:1704.06440, 2017a.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., andKlimov, O. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017b.

Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D.,and Riedmiller, M. Deterministic policy gradient algo-rithms. In International Conference on Machine Learning(ICML), 2014.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,van den Driessche, G., Schrittwieser, J., Antonoglou, I.,Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe,D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T.,Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis,D. Mastering the game of go with deep neural networksand tree search. Nature, 529(7587):484–489, Jan 2016.ISSN 0028-0836. Article.

Sutton, R. S. and Barto, A. G. Reinforcement learning: Anintroduction, volume 1. MIT press Cambridge, 1998.

Thomas, P. Bias in natural actor-critic algorithms. In Inter-national Conference on Machine Learning (ICML), pp.441–448, 2014.

Todorov, E. General duality between optimal control andestimation. In IEEE Conference on Decision and Control(CDC), pp. 4286–4292. IEEE, 2008.

Toussaint, M. Robot trajectory optimization using approxi-mate inference. In International Conference on MachineLearning (ICML), pp. 1049–1056. ACM, 2009.

Williams, R. J. Simple statistical gradient-following algo-rithms for connectionist reinforcement learning. Machinelearning, 8(3-4):229–256, 1992.

Ziebart, B. D. Modeling purposeful adaptive behavior withthe principle of maximum causal entropy. Carnegie Mel-lon University, 2010.

Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K.Maximum entropy inverse reinforcement learning. InAAAI Conference on Artificial Intelligence (AAAI), pp.1433–1438, 2008.

Soft Actor-Critic

A. Maximum Entropy ObjectiveThe exact definition of the discounted maximum entropy objective is complicated by the fact that, when using a discountfactor for policy gradient methods, we typically do not discount the state distribution, only the rewards. In that sense,discounted policy gradients typically do not optimize the true discounted objective. Instead, they optimize average reward,with the discount serving to reduce variance, as discussed by Thomas (2014). However, we can define the objective that isoptimized under a discount factor as

J(π) =

∞∑t=0

E(st,at)∼ρπ

[ ∞∑l=t

γl−t Esl∼p,al∼π [r(st,at) + αH(π( · |st))|st,at]]. (14)

This objective corresponds to maximizing the discounted expected reward and entropy for future states originating fromevery state-action tuple (st,at) weighted by its probability ρπ under the current policy.

B. ProofsB.1. Lemma 1

Lemma 1 (Soft Policy Evaluation). Consider the soft Bellman backup operator T π in Equation 2 and a mappingQ0 : S ×A → R with |A| <∞, and define Qk+1 = T πQk. Then the sequence Qk will converge to the soft Q-value of πas k →∞.

Proof. Define the entropy augmented reward as rπ(st,at) , r(st,at) + Est+1∼p [H (π( · |st+1))] and rewrite the updaterule as

Q(st,at)← rπ(st,at) + γ Est+1∼p,at+1∼π [Q(st+1,at+1)] (15)

and apply the standard convergence results for policy evaluation (Sutton & Barto, 1998). The assumption |A| < ∞ isrequired to guarantee that the entropy augmented reward is bounded.

B.2. Lemma 2

Lemma 2 (Soft Policy Improvement). Let πold ∈ Π and let πnew be the optimizer of the minimization problem defined inEquation 4. Then Qπnew(st,at) ≥ Qπold(st,at) for all (st,at) ∈ S ×A with |A| <∞.

Proof. Let πold ∈ Π and let Qπold and V πold be the corresponding soft state-action value and soft state value, and let πnew

be defined as

πnew( · |st) = arg minπ′∈Π

DKL (π′( · |st) ‖ exp (Qπold(st, · )− logZπold(st)))

= arg minπ′∈Π

Jπold(π′( · |st)). (16)

It must be the case that Jπold(πnew( · |st)) ≤ Jπold

(πold( · |st)), since we can always choose πnew = πold ∈ Π. Hence

Eat∼πnew[log πnew(at|st)−Qπold(st,at) + logZπold(st)] ≤ Eat∼πold

[log πold(at|st)−Qπold(st,at) + logZπold(st)],(17)

and since partition function Zπold depends only on the state, the inequality reduces to

Eat∼πnew[Qπold(st,at)− log πnew(at|st)] ≥ V πold(st). (18)

Next, consider the soft Bellman equation:

Qπold(st,at) = r(st,at) + γ Est+1∼p [V πold(st+1)]

≤ r(st,at) + γ Est+1∼p[Eat+1∼πnew [Qπold(st+1,at+1)− log πnew(at+1|st+1)]

]...≤ Qπnew(st,at), (19)

where we have repeatedly expanded Qπold on the RHS by applying the soft Bellman equation and the bound in Equation 18.Convergence to Qπnew follows from Lemma 1.

Soft Actor-Critic

B.3. Theorem 1

Theorem 1 (Soft Policy Iteration). Repeated application of soft policy evaluation and soft policy improvement to any π ∈ Πconverges to a policy π∗ such that Qπ

∗(st,at) ≥ Qπ(st,at) for all π ∈ Π and (st,at) ∈ S ×A, assuming |A| <∞.

Proof. Let πi be the policy at iteration i. By Lemma 2, the sequence Qπi is monotonically increasing. Since Qπ is boundedabove for π ∈ Π (both the reward and entropy are bounded), the sequence converges to some π∗. We will still need toshow that π∗ is indeed optimal. At convergence, it must be case that Jπ∗(π∗( · |st)) < Jπ∗(π( · |st)) for all π ∈ Π, π 6= π∗.Using the same iterative argument as in the proof of Lemma 2, we get Qπ

∗(st,at) > Qπ(st,at) for all (st,at) ∈ S ×A,

that is, the soft value of any other policy in Π is lower than that of the converged policy. Hence π∗ is optimal in Π.

C. Enforcing Action BoundsWe use an unbounded Gaussian as the action distribution. However, in practice, the actions needs to be bounded to a finiteinterval. To that end, we apply an invertible squashing function (tanh) to the Gaussian samples, and employ the change ofvariables formula to compute the likelihoods of the bounded actions. In the other words, let u ∈ RD be a random variableand µ(u|s) the corresponding density with infinite support. Then a = tanh(u), where tanh is applied elementwise, is arandom variable with support in (−1, 1) with a density given by

π(a|s) = µ(u|s)∣∣∣∣det

(da

du

)∣∣∣∣−1

. (20)

Since the Jacobian da/du = diag(1− tanh2(u)) is diagonal, the log-likelihood has a simple form

log π(a|s) = logµ(u|s)−D∑i=1

log(1− tanh2(ui)

), (21)

where ui is the ith element of u.

Soft Actor-Critic

D. HyperparametersTable 1 lists the common SAC parameters used in the comparative evaluation in Figure 1 and Figure 4. Table 2 lists thereward scale parameter that was tuned for each environment.

Table 1. SAC Hyperparameters

Parameter Value

Sharedoptimizer Adam (Kingma & Ba, 2015)learning rate 3 · 10−4

discount (γ) 0.99replay buffer size 106

number of hidden layers (all networks) 2number of hidden units per layer 256number of samples per minibatch 256nonlinearity ReLU

SACtarget smoothing coefficient (τ ) 0.005target update interval 1gradient steps 1

SAC (hard target update)target smoothing coefficient (τ ) 1target update interval 1000gradient steps (except humanoids) 4gradient steps (humanoids) 1

Table 2. SAC Environment Specific Parameters

Environment Action Dimensions Reward Scale

Hopper-v1 3 5Walker2d-v1 6 5HalfCheetah-v1 6 5Ant-v1 8 5Humanoid-v1 17 20Humanoid (rllab) 21 10

Soft Actor-Critic

E. Additional Baseline ResultsFigure 4 compares SAC to Trust-PCL (Figure 4. Trust-PC fails to solve most of the task within the given number ofenvironment steps, although it can eventually solve the easier tasks (Nachum et al., 2017b) if ran longer. The figure alsoincludes two variants of SAC: a variant that periodically copies the target value network weights directly instead of usingexponentially moving average, and a deterministic ablation which assumes a deterministic policy in the value update(Equation 6) and the policy update (Equation 13), and thus strongly resembles DDPG with the exception of having twoQ-functions, using hard target updates, not having a separate target actor, and using fixed exploration noise rather thanlearned. Both of these methods can learn all of the tasks and they perform comparably to SAC on all but Humanoid (rllab)task, on which SAC is the fastest.

0.0 0.2 0.4 0.6 0.8 1.0million steps

0

1000

2000

3000

4000

aver

age

retu

rn

Hopper-v1

(a) Hopper-v1

0.0 0.2 0.4 0.6 0.8 1.0million steps

0

1000

2000

3000

4000

5000

6000

aver

age

retu

rn

Walker2d-v1

(b) Walker2d-v1

0.0 0.5 1.0 1.5 2.0 2.5 3.0million steps

0

5000

10000

15000

aver

age

retu

rn

HalfCheetah-v1

(c) HalfCheetah-v1

0.0 0.5 1.0 1.5 2.0 2.5 3.0million steps

0

2000

4000

6000

aver

age

retu

rn

Ant-v1

(d) Ant-v1


0

2000

4000

6000

8000

aver

age

retu

rn

Humanoid-v1

(e) Humanoid-v1


0

2000

4000

6000

aver

age

retu

rn

Humanoid (rllab)

SAC

SAC (hard target update)

SAC (hard target update, deterministic)

Trust-PCL

(f) Humanoid (rllab)

Figure 4. Training curves for additional baseline (Trust-PCL) and for two SAC variants. Soft actor-critic with hard target update (blue)differs from standard SAC in that it copies the value function network weights directly every 1000 iterations, instead of using exponentiallysmoothed average of the weights. The deterministic ablation (red) uses a deterministic policy with fixed Gaussian exploration noise,does not use a value function, drops the entropy terms in the actor and critic function updates, and uses hard target updates for the targetQ-functions. It is equivalent to DDPG that uses two Q-functions, hard target updates, and removes the target actor.

soft actor-critic: off-policy maximum entropy deep … · 2018-08-10 · soft actor-critic:...

Documents