laprop: a better way to combine momentum with adaptive gradientanderson/cs445/notebooks/... ·...

18
LaProp: a Better Way to Combine Momentum with Adaptive Gradient Liu Ziyin 1 Zhikang T.Wang 1 Masahito Ueda 12 Abstract Identifying a divergence problem in ADAM, we propose a new optimizer, LAPROP, which belongs to the family of adaptive gradient descent meth- ods. This method allows for greater flexibility in choosing its hyperparameters, mitigates the ef- fort of fine tuning, and permits straightforward interpolation between the signed gradient meth- ods and the adaptive gradient methods. We bound the regret of LAPROP on a convex problem and show that our bound differs from the previous methods by a key factor, which demonstrates its advantage. We experimentally show that LAPROP outperforms the previous methods on a toy task with noisy gradients, optimization of extremely deep fully-connected networks, neural art style transfer, natural language processing using trans- formers, and reinforcement learning with deep- Q networks. The performance improvement of LAPROP is shown to be consistent, sometimes dramatic and qualitative. 1. Introduction Modern deep learning research and application has become increasingly time consuming due to the need to train and evaluate large models on large-scale problems, where the training may take weeks or even months to complete. This makes the study of optimization of neural networks one of the most important fields of research (Ruder, 2016; Sun, 2019). At the core of the research lies the study of optimiz- ers, i.e., the algorithms by which the parameters of neural networks are updated. Especially, since the widely used opti- mization algorithm ADAM was proposed and became widely used, various modifications of ADAM have been proposed to overcome the difficulties encountered by ADAM in spe- cific cases (Reddi et al., 2018; Liu et al., 2019a; Loshchilov & Hutter, 2017; Luo et al., 2019). Nevertheless, none of them have shown consistent improvement over ADAM on 1 Department of Physics & Institute for Physics of Intelli- gence, University of Tokyo 2 RIKEN CEMS. Correspondence to: Liu Ziyin <[email protected]>. Our implemen- tation is provided at https://github.com/Z-T-WANG/ LaProp-Optimizer Figure 1. Divergence of Adam on a two-layer ReLU network on MNIST with μ = 0.9= 0.7. In contrast, LAPROP continues to work stably for a much larger number of updates. all tasks, and the mechanism behind the optimizers remains vague. In this paper, we propose a new optimizer, LAPROP. We find that on a variety of tasks, LAPROP consistently performs better or at least comparably when compared with ADAM, and especially, LAPROP performs better when the task is noisier or more unstable. Such improvement comes almost for free: LAPROP is closely related to ADAM, and it has exactly the same number of hyperparameters as ADAM; the hyperparameter settings of ADAM can be immediately carried over to LAPROP. Moreover, LAPROP is more stable and it allows for a wider range of hyperparameter choice for which ADAM would diverge, which makes LAPROP able to reach a higher performance over its hyperparameter space, which is demonstrated in our experiments. We hope our proposed optimizer benefits future study of deep learning research and industrial applications. Organization In this section, we give an overview of the up-to-date optimization algorithms for neural networks. In section 2, we identify the key mechanism that is behind the dynamics of adaptive optimizers and point out a key failure of Adam in this formulation, and in section 3, we introduce our optimizer LAPROP as a remedy. Section 4 presents our experiments. In section 5, we prove some mathematical properties of our method, and bound the regret of our method by a factor of order O( T ) Contribution This work has three main contributions: (1) proposing a new adaptive optimization algorithm that has considerably better flexibility in hyperparameters, and con- firming that such flexibility can indeed translate to wider ap- plicability and better performance on tasks that are relevant arXiv:2002.04839v1 [cs.LG] 12 Feb 2020

Upload: others

Post on 06-Aug-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LaProp: a Better Way to Combine Momentum with Adaptive Gradientanderson/cs445/notebooks/... · 2020. 2. 20. · LaProp: a Better Way to Combine Momentum with Adaptive Gradient Liu

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

Liu Ziyin 1 Zhikang T.Wang 1 Masahito Ueda 1 2

AbstractIdentifying a divergence problem in ADAM, wepropose a new optimizer, LAPROP, which belongsto the family of adaptive gradient descent meth-ods. This method allows for greater flexibilityin choosing its hyperparameters, mitigates the ef-fort of fine tuning, and permits straightforwardinterpolation between the signed gradient meth-ods and the adaptive gradient methods. We boundthe regret of LAPROP on a convex problem andshow that our bound differs from the previousmethods by a key factor, which demonstrates itsadvantage. We experimentally show that LAPROPoutperforms the previous methods on a toy taskwith noisy gradients, optimization of extremelydeep fully-connected networks, neural art styletransfer, natural language processing using trans-formers, and reinforcement learning with deep-Q networks. The performance improvement ofLAPROP is shown to be consistent, sometimesdramatic and qualitative.

1. IntroductionModern deep learning research and application has becomeincreasingly time consuming due to the need to train andevaluate large models on large-scale problems, where thetraining may take weeks or even months to complete. Thismakes the study of optimization of neural networks one ofthe most important fields of research (Ruder, 2016; Sun,2019). At the core of the research lies the study of optimiz-ers, i.e., the algorithms by which the parameters of neuralnetworks are updated. Especially, since the widely used opti-mization algorithm ADAM was proposed and became widelyused, various modifications of ADAM have been proposedto overcome the difficulties encountered by ADAM in spe-cific cases (Reddi et al., 2018; Liu et al., 2019a; Loshchilov& Hutter, 2017; Luo et al., 2019). Nevertheless, none ofthem have shown consistent improvement over ADAM on

1Department of Physics & Institute for Physics of Intelli-gence, University of Tokyo 2RIKEN CEMS. Correspondenceto: Liu Ziyin <[email protected]>. Our implemen-tation is provided at https://github.com/Z-T-WANG/LaProp-Optimizer

Figure 1. Divergence of Adam on a two-layer ReLU network onMNIST with µ = 0.9, ν = 0.7. In contrast, LAPROP continues towork stably for a much larger number of updates.

all tasks, and the mechanism behind the optimizers remainsvague.

In this paper, we propose a new optimizer, LAPROP. We findthat on a variety of tasks, LAPROP consistently performsbetter or at least comparably when compared with ADAM,and especially, LAPROP performs better when the task isnoisier or more unstable. Such improvement comes almostfor free: LAPROP is closely related to ADAM, and it hasexactly the same number of hyperparameters as ADAM;the hyperparameter settings of ADAM can be immediatelycarried over to LAPROP. Moreover, LAPROP is more stableand it allows for a wider range of hyperparameter choice forwhich ADAM would diverge, which makes LAPROP able toreach a higher performance over its hyperparameter space,which is demonstrated in our experiments. We hope ourproposed optimizer benefits future study of deep learningresearch and industrial applications.

Organization In this section, we give an overview of theup-to-date optimization algorithms for neural networks. Insection 2, we identify the key mechanism that is behindthe dynamics of adaptive optimizers and point out a keyfailure of Adam in this formulation, and in section 3, weintroduce our optimizer LAPROP as a remedy. Section 4presents our experiments. In section 5, we prove somemathematical properties of our method, and bound the regretof our method by a factor of order O(

√T )

Contribution This work has three main contributions: (1)proposing a new adaptive optimization algorithm that hasconsiderably better flexibility in hyperparameters, and con-firming that such flexibility can indeed translate to wider ap-plicability and better performance on tasks that are relevant

arX

iv:2

002.

0483

9v1

[cs

.LG

] 1

2 Fe

b 20

20

Page 2: LaProp: a Better Way to Combine Momentum with Adaptive Gradientanderson/cs445/notebooks/... · 2020. 2. 20. · LaProp: a Better Way to Combine Momentum with Adaptive Gradient Liu

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

to modern industrial applications and academic interests; (2)conceptually, establishing a new framework for understand-ing different optimizers, as none of the previously proposedframeworks includes our method as a subclass; (3) theoreti-cally, we extend the convergence proofs and show that ourmethod is different from the previous ones, i.e. ADAM andAMSGRAD, by a key factor that has limited their flexibityand often leads to worse performance.

1.1. Gradient Descent FormalismLet `(θ) be the loss function we want to minimize, and θthe model parameter(s). When `(θ) is differentiable, whichis the case in modern deep learning, the popular choice is toperform a stochastic gradient descent step with step size λ.One also speeds up the training procedure by introducingmomentum, which is an acceleration term added to theupdate rule:

gt = ∇θ`(θt−1), (gradient) (1)mt = µmt−1 + gt, (momentum) (2)θt = θt−1 − λmt, (update) (3)

where µ is the momentum hyperparameter. Momentum (insome special but similar form) has been proven to accelerateconvergence (Nesterov, 1983), and it is an experimental factthat optimization of neural networks benefits from momen-tum (Sutskever et al., 2013). When mini-batch training isused, gt is computed as an average over the mini-batch ofdata. It has recently been shown that stochastic gradientdescent can finds the global minimum for a neural networkdespite the fact that no line search is performed over theupdate step size (Du et al., 2018; Zou et al., 2018).

1.2. The Adaptive Gradient FamilyThe adaptive gradient methods have emerged as the mostpopular tool for training deep neural networks over the lastfew years, and they have been of great use both industriallyand academically. The adaptive gradient family divides anupdate step by the running root mean square (RMS) of thegradients (Duchi et al., 2011; Tieleman & Hinton, 2012;Kingma & Ba, 2014), which speeds up training by effec-tively rescaling the update to the order of λO(1) throughoutthe training trajectory. The most popular method among thisfamily is the ADAM algorithm (Kingma & Ba, 2014). whichcomputes the accumulated values, i.e. the momentum andthe preconditioner, as an exponential averaging with decayhyperparameter µ, ν (also referred to as β1, β2 in litera-ture), with bias correction terms cn(t), cm(t) to correct thebias in the initialization:

gt = ∇θ`(θt−1), (4)

nt = νnt−1 + (1 − ν)g2t , (preconditioner) (5)

mt = µmt−1 + (1 − µ)gt, (6)

θt = θt−1 − λmt

cm√nt/cn

, (7)

param. ν → 0 ν → 1

ADAMµ→ 0 SSG RMSPROP2

µ→ 1 divergence ADAM(µ, ν)

LAPROPµ→ 0 SSG RMSPROPµ→ 1 SSG-M ADAM(µ, ν)

Table 1. Sign-variance interpolation and the relationship of ADAM

and LAPROP to other algorithms. In all four limiting cases of thehyperparameters, LAPROP asymptotes to reasonable algorithms,while ADAM does not if we reduce the adaptivity when the mo-mentum is present, i.e., ν → 0 and µ > 0, suggesting that theway ADAM computes momentum may be problematic. This tablealso shows that the signed gradient family of optimizers are spe-cial cases of LAPROP. We emphasize that although ADAM andLAPROP become asymptotically equivalent at the limit of ν → 1,they practically show different behaviours even at ν = 0.999.

where ADAM sets cn(t) = 1 − νt and cm(t) = 1 − µt. Sev-eral variants of ADAM also exist (Reddi et al., 2018; Liuet al., 2019a; Loshchilov & Hutter, 2017; Luo et al., 2019).However, it remains inconclusive which method is the bestand the various “fixes" of ADAM do not show consistentbetter performance.

1.3. Signed Gradient FamilyAnother important family of neural network optimizationis the signed stochastic gradient descent (SSG) (Bernsteinet al., 2018). This method simply uses the signs of thegradients, and, when desired, the momentum term can beadded in as usual for further improvement (SSG-M)1:

gt = sgn[∇θ`(θt−1)], (8)mt = µmt−1 + (1 − µ)gt, (9)θt = θt−1 − λmt. (10)

The original motivation is that, since the gradients of neuralnetwork parameters appears noisy and random, the directionof the update for each parameter may be more importantthan the size of the update. The same paper shows the ef-fectiveness of the signed gradient descent method when thesignal-to-noise ratio (SNR) is high. Our work, in essence,proposes a new method as an interpolation between thesigned method and the adaptive gradient method.

2. The Sign-Variance Mechanism of AdamThe discussion in this section develops upon the insight-ful studies in Bernstein et al. (2018) and Balles & Hennig(2017). It is shown in Balles & Hennig (2017) that thereare essentially two different kinds of mechanisms that drivethe adaptive gradient methods: the sign and the varianceof the gradient. Setting µ = 0 and cm = cn = 1 and from

1We find that though SSG-M works well, it has not been pro-posed in exiting literature. Notably, Ref. (Bernstein et al., 2018)proposed a scheme that applies the sign function after the momen-tum, rather than before.

Page 3: LaProp: a Better Way to Combine Momentum with Adaptive Gradientanderson/cs445/notebooks/... · 2020. 2. 20. · LaProp: a Better Way to Combine Momentum with Adaptive Gradient Liu

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

equation 7, we can rewrite the update term of ADAM as

1

λ(θt−1 − θt) =

mt√nt

=¿ÁÁÀ 1

1 − ν + ν nt−1

g2t

× sgn(gt). (11)

Qualitatively, we may assume that the gradients are drawnfrom a distribution with a variance of σ2 which, in turn, isestimated by nt. We can also define the estimated relativevariance ηt ∶= σ2

t /g2t ≥ 0. Therefore we have have two

terms:

mt√nt

=√

1

1 − ν + νη2t

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶adaptivity component

× sgn(gt)´¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¶

sign component

. (12)

Thus, ADAM can be interpreted as a combination of thesigned gradient methods and the adaptive methods. Thesign component updates the parameters following the signsof the gradients, and the adaptivity component reweights theupdate according to the estimated variance. An interpolationbetween the signed and the full adaptive methods can beachieved by setting ν in the two limits (µ = 0):

mt√nt

=⎧⎪⎪⎨⎪⎪⎩

sgn(gt) for limν → 0;gtσt

for limν → 1.(13)

Moreover, Balles & Hennig (2017) have shown that it is thesign component that can explain most behaviors of ADAM,such as the improved training speed on difficult tasks.

Since the parameter µ controls how the momentum is ac-cumulated and ν controls the adaptivity, one expects that,at non-vanishing µ, we can make ν → 0 to “turn off" theadaptivity and interpolate to SSG with momentum. How-ever, this is not the case. Reducing ν to 0 at non-zero µcauses instability and the optimization can diverge; this isalso reflected in the convergence theorems of ADAM whereν is required to be at least of the same order as µ both forthe finite step size dynamics (Reddi et al., 2018) and in thecontinuous time limit (da Silva & Gazeau, 2018). To illus-trate, we may assume that gt is i.i.d. and obeys a standardGaussian distribution, and note that

ut ∶= limν→0

mt√nt

= ∑ti(1 − µ)µt−igi

∣gt∣, (14)

Var[ut] ≥ Var [(1 − µ)µgt−1

gt] =∞ for t > 1, (15)

where the divergence follows from the fact that g2t−1/g2

t

obeys a F (1,1) distribution with divergent mean for µ > 0.This divergence can even be seen when training on MNIST

2In this paper, we refer to the bias-corrected version of RM-SPROP, which differs from the standard RMSPROP by the correc-tion factor cn.

using a simple 2−layer network. See figure 1. We arguethat ADAM fails to interpolate because it does not correctlycompute the momentum. We thus propose the LAPROPalgorithm that allows for a straightforward interpolation.This problem-free interpolation capability gives LAPROPfar more flexibity than ADAM, leading to reduced instabilityand better performance, as shown in the section 4. In table 1,we summarize the behavior of ADAM and LAPROP for somelimiting values of µ and ν.

3. LAPROP

As a solution to the interpolation problem posed in theprevious section, we propose LAPROP, which efficientlyinterpolates between the signed gradient and the adaptivegradient family.

The divergence of ADAM can be understood in the followingway. One can rewrite the ADAM update as follows:

∆θt ∼mt√nt

= gt√nt

+ µ gt−1√nt

+ ... + µt g1√nt. (16)

If we interpret√nt as an estimate for the standard devi-

ation, σt, of the gradient at time t, then gi/σt normalizesthe gradient at time i by the variance at time t. However,since a training trajectory can often be divided into differ-ent regimes, it is reasonable to say that the distribution ofgradients over the trajectory can have a non-trivial timedependence. This has a destabilizing effect for i≪ t, espe-cially if the distribution of the gradient changes throughouttraining. If µ < √

ν, the effect of any past gradient can beupper bounded (see Proposition 5.2), and the destabilizingeffect is counteracted by the decaying factor of µt. How-ever, when µ > √

ν, µt decays too slow and may eventuallyresults in divergence. This shows that the effect of µ is twofold; it controls how the momentum is accumulated, and, italso alleviates the instability in the past gradients.

To resolve this problem, we argue that one should onlyreweight the local gradient by its local variance, and thisonly involves changing the term gi/σt → gi/σi, and theupdate becomes

∆θt ∼gt√nt

+ µ gt−1√nt−1

+ ... + µt g1√n1. (17)

We show in section 5.2 that this simple modification re-moves the divergence problem caused by the improper

Algorithm 1 LAPROP

Input: x1 ∈ Rd, learning rate {ηt}Tt=1, decay parameters0 ≤ µ < 1, 0 ≤ ν < 1, ε≪ 1. Set m0 = 0, n0 = 0.gt = ∇θ`(θt−1)nt = νnt−1 + (1 − ν)g2

t

mt = µmt−1 + (1 − µ) gt√nt/cn+ε

θt+1 = θt − λtmt/cm

Page 4: LaProp: a Better Way to Combine Momentum with Adaptive Gradientanderson/cs445/notebooks/... · 2020. 2. 20. · LaProp: a Better Way to Combine Momentum with Adaptive Gradient Liu

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

reweighting rule of ADAM, with equal computation com-plexity. This corrected update rule constitutes the core ofLAPROP, of which the algorithm is given as Algorithm 1.In essence, this decouples the momentum from adaptivity.Now µ and ν can be chosen to be independent of each other,and, in section 4, we will show that such decoupling leads towider applicability and better performance of the optimizer.The default hyperparameters we recommend are λ = 4E −4,µ = 0.8 - 0.9, ν = 0.95 - 0.999 and ε = 10−15, but in general,hyperparameter settings of ADAM always work for LAPROPas well, and one may choose the default settings of ADAM.

3.1. A Conceptual FrameworkMany attempts at presenting all adaptive optimizers withinthe same framework exist, such as in (Reddi et al., 2018;Chen et al., 2018; Luo et al., 2019). Likely due to thepopularization of ADAM, these frameworks share the struc-ture of the ADAM algorithm presented in equations 4-7,where the momentum mt and the preconditioner nt are sep-arate, and the update is given as mt/

√nt. We call it the

ADAM style of momentum computation. Although theseexisting frameworks succeed in including various adaptiveoptimizer, they seem to be conceptually biased towards theADAM style momentum, taking the update rule mt/

√nt for

granted. We, however, argue that mt/√nt is a sub-optimal

way of incorporating momentum into ADAM, and that thepreconditioner should be applied before we apply the mo-mentum. In this case, none of the existing frameworksproposed seems to include our algorithm. Here, we developa new framework of adaptive optimizers by extending theframework in Ref. (Wilson et al., 2017) to include our algo-rithm, LAPROP, which also reveals the difference betweenADAM and LAPROP.

We note that the simple gradient descent can be written as

θt+1 = θt − λt∇`(θt), (18)

where each term is defined as above. When combined withthe momentum (be it the Heavy-Ball momentum or Nes-terov), we can write the update using just one line:

θt+1 = θt − λt∇`[θt − γt(θt − θt−1)] + βt(θt − θt−1), (19)

where βt controls how we decay the momentum; the Heavy-Ball momentum has γ = 0, and the Nesterov momentum hasγt = βt. It suffices to set γ = 0 in our discussion.

This framework can be further extended by multiplying theupdates by preconditioning matrices Gt,Kt to include theadaptive gradient family (and also the second-order meth-ods):

θt+1 = θt − λtGt∇`(θt) + βtKt(θt − θt−1) (20)

and the non-adaptive methods corresponds to setting the

preconditioners to the identity, i.e. Gt =Kt = I . Let

Ht = diag( 1 − ν1 − νt

t

∑i=0νt−igt ○ gt)

12

, (21)

where gt ○ gt represents the element-wise product. TheADAM algorithm takes Gt =H−1

t and Kt =H−1t Ht−1:

θt+1 = θt − λtH−1t ∇`(θt) + βt H−1

t Ht−1 (θt − θt−1),(22)

while the LAPROP algorithm can be written as

θt+1 = θt − λtH−1t ∇`(θt) + βt(θt − θt−1), (23)

where the difference is enclosed in the rectangular boxin equation 22. ADAM uses the current preconditionerto reweight the accumulated momentum, while LAPROPleaves the accumulated momentum intact, bringing it closerto the original momentum in non-adaptive methods. In ta-ble 2, we compare different diagonal optimizers under theframework of equation 20.

Intuitively, the term H−1t Ht−1 in ADAM seems to have neg-

ative effects in learning. The term H−1t Ht−1 can be under-

stood as a reweighting of the momentum term βt(θt − θt−1)by referring to the current gradient. Since the term is smallwhen the current gradient is large, it seems to be problem-atic. In fact, when the current gradient is pathologicallylarge, e.g. due to noise or out-of-distribution data points, themomentum in ADAM immediately vanishes. On the con-trary, for LAPROP, the effect of an extremely large gradienton the accumulated moemtum is upper bounded by 1−µ√

1−ν byconstruction (Proposition 5.1). This suggests that LAPROPputs more importance on the past momentum than ADAMdoes, and therefore, we hypothesize that LAPROP has abetter ability to overcome barriers and escape local minimathan ADAM. This also suggests that LAPROP enjoys betterstability in training.

4. Experiment4.1. Learning with Strong Noise in the GradientWe now experimentally show the improved performance ofLAPROP compared with the previous methods. First, weconsider the optimization of the Rosenbrock loss function(Rosenbrock, 1960), for which the optimal solution is knownexactly. For a two parameter system (x, y), we define noisythe Rosenbrock loss function to be

`(x, y) = (1 − x + ε1)2 + 100(x − y + ε2)2, (24)

where x, y are initialized to be 0, and the optimal solutionlies at (1,1), ε1 and ε2 are independently sampled fromUniform(−σ,σ). See Figure 2. We see that ADAM, AMS-GRAD and LAPROP show different behaviors. While ADAM

Page 5: LaProp: a Better Way to Combine Momentum with Adaptive Gradientanderson/cs445/notebooks/... · 2020. 2. 20. · LaProp: a Better Way to Combine Momentum with Adaptive Gradient Liu

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

non-adaptive adaptive adaptive & momentumSGD SGD-M RMSProp ADAM LAPROP

gradient computation λt λ λ λ λ 1−µ1−µt λ 1−µ

1−µt

Gt I I H−1t H−1

t H−1t

momentum computation βt 0 µ 0 µ µKt − I − H−1

t Ht−1 I

Table 2. Classification of different optimization algorithms according to Equation 20, with I denoting the identity matrix. LAPROP differsfrom ADAM by how the momentum is computed, and differs from the SGD-M by the way the gradient is handled. This table also showsthat the way ADAM computes its momentum is not quite in line with other methods.

(a) σ = 0.04 (b) σ = 0.10 (c) σ = 0.12 (d) LAPROP with σ ≥ 0.12

Figure 2. Time it takes to converge on Noisy-Rosenbrock vs. ν, with noise level σ (the lower the better). We note that LAPROP doesexhibit much stronger robustness against noise in the gradient. (a) When the noise level is small, we see that the optimization speed ofLAPROP is almost invariant w.r.t. the choice of ν, demonstrating its stronger flexibility in choosing hyperparameters. (c, d) For σ ≥ 0.12,only LAPROP succeeds in optimizing the problem even if we lengthen the experiment to 10000 steps (notice the difference in the y−axis),and smaller ν offers better performance.

converges as fast as LAPROP for most of ν, it is unstablewhen ν is small. AMSGRAD, on the other hand, seemsmuch more stable, but its convergence slows down greatlyin the presence of noise. For LAPROP, when the noise issmall, it is (one of) the fastest and is more stable againsta change of ν; when the noise is large (σ ≥ 0.12), onlyLAPROP converges for a suitable range of ν.

4.2. Initiating Training of Extremely Deep FullyConnected Networks

We demonstrate that LAPROP has stronger than existingmethods to initiated training of hard problems. For exam-ple, we show that LAPROP can be used to train extremelydeep fully-connected (FC) networks on MNIST with theReLU activation function. The network has input dimension784, followed by d layers all of which include w hiddenneurons, and followed by a output layer of dimension 10.The networks are initialized with the Kaiming uniform ini-tialization. It is shown in Ref. (Hanin & Rolnick, 2018)that it is very hard to initiate the training of FC networksat extreme depth due to the gradient explosion or gradientvanishing problem, where SGD, ADAM, AMSGRAD allfail. See Figure 3. We see that LAPROP is able to initi-ate training of neural networks with depth up to 5000 (thedeepest the our GPU memory can afford). On the otherhand, to our best try, neither of ADAM or AMSGRAD or

(a) w = 256 (b) LAPROP only

Figure 3. LAPROP: Training of very deep fully-connected neuralnetwork networks. We note that only LAPROP succeeds in trainingdeeper than 500 layers.

SGD can optimize FC neural networks of this depth (weplot ADAM at d = 500 for illustration), and this is a task thatonly LAPROP can achieve. We also run with different widthof the network w ∈ {32,256}. We apply smoothing overconsecutive 50 updates. The training batch size is 1024, andthe experiment includes 400 epochs. For all the tasks, weset µ = 0.8, ν = 0.96, ε = 10−26, and we only tune for aproper learning rate, ranging from λ = 4e − 4 to 1e − 5.

Page 6: LaProp: a Better Way to Combine Momentum with Adaptive Gradientanderson/cs445/notebooks/... · 2020. 2. 20. · LaProp: a Better Way to Combine Momentum with Adaptive Gradient Liu

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

(a) r = 0.0, no label noise (b) r = 0.1 (c) r = 0.2 (d) r = 0.3

Figure 4. Generalization error on label-corrupted IMDB vs. ν, with corruption probability r. We note that tuning the parameter ν onLAPROP offers stronger robustness against label noise in the gradient. We test on corruption rate r = {0,0.1,0.2,0.3}. For all fourcorruption rates, the behavior of LAPROP is similar. Lower ν gives smaller generalization error and better performance. For ADAM, theperformance becomes worse for a smaller ν due to divergence and instability. For AMSGRAD, the divergence problem is fixed, and thelearning is more stable than ADAM; however, its failure to decouple µ and ν is shown to have a negative effect on performance.

4.3. Achieving Better Generalization PerformanceWhile our analysis has focused on optimization, we devotethis section to show how ν might be tuned to improve thegeneralization of models. The IMDB dataset is a binaryclassification task whose goal is to identify the sentimentof speakers. We use LSTM with attention (Bahdanau et al.,2014) with pretrained GloVe word embedding (Penningtonet al., 2014). Also, we study how label noise affects thedifferent optimizers in this task. We randomly flip labels ofevery data point with probability r ∈ {0,0.1,0.2,0.3}.

For all data points, the training batchsize is 128 and thelearning rate is set to 2 × 10−3 with µ = 0.9. All the mod-els are trained for 15 epochs on this task. See Figure 4.In summary, we find that (1) LAPROP with ν = 0 alwaysachieves the best generalization performance both with andwithout label noise; for example, at zero noise rate, LAPROPachieves 10% accuracy improvement over the best hyper-parameter setting of ADAM and AMSGRAD; (2) ADAMand AMSGRAD are not very stable w.r.t the changes in ν,while LAPROP responds to the changes in ν in a stable andpredictable way, implying that LAPROP’s hyperparameter νis easier to tune in practice.

4.4. Neural Style TransferIn this section, we experiment on the task of generating animage with a given content and a given style. The task isdescribed in detail in (Gatys et al., 2015). It is purely opti-mizational, and we see the advantage of being able to tuneν. The trend we observe agrees well with the experiment insection 4.1. See Figure 5, where we plot the average regretR(T )/T at T = 1000. This is a theoretical quantity oftenstudied in literature, and lower R(T )/T corresponds to abetter convergence rate (Kingma & Ba, 2014; Reddi et al.,2018). We note that, on this task also, lower ν offers betterperformance for LAPROP and worse performance for AMS-GRAD. ADAM features a sweet spot in the intermediaterange. On the right we show examples of training curves.

(a) Different ν (b) Training Trajectories

Figure 5. Neural Style Transfer with different adaptive optimizers.

We note that ADAM has unstable and oscillatory behaviorswhen ν is small, while LAPROP achieves fastest speed atthe same ν.

4.5. Translation and Pretraining with TransformersThe Transformer is known to be difficult to train as a re-cently proposed modern architecture (Vaswani et al., 2017).Especially, it requires a warm-up schedule of learning ratesif a large learning rate is to be used, and it is typically trainedusing ADAM with a ν smaller than 0.999. We show that theperformance of LAPROP parallels or outperforms that ofADAM on the tasks of IWSLT14 German to English trans-lation (Cettolo et al., 2014) and the relatively large-scaleRoBERTa(base) pretraining on the full English Wikipedia(Liu et al., 2019b). The results are shown in Figure 6. Webasically follow the prescriptions given by Ref. (Ott et al.,2019) and used the recommended network architectures andtraining settings thereof. Details are given in the figure cap-tion and Appendix. As demonstrated in (Liu et al., 2019a),without a warmup, ADAM does not converge to a low loss onthe IWSLT14 de-en (German to English) dataset as shownin Fig. 6a. However, LAPROP does not get trapped by thelocal minimum and continues looking for a better solution,though the loss is still higher than that with a warmup. Whenthere is a warmup, LAPROP and ADAM perform compara-bly. In RoBERTa, the default ν is 0.98, but we find that we

Page 7: LaProp: a Better Way to Combine Momentum with Adaptive Gradientanderson/cs445/notebooks/... · 2020. 2. 20. · LaProp: a Better Way to Combine Momentum with Adaptive Gradient Liu

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

0 10 20 30 40 50 60×103 of updates

4

6

8

10

12lo

ss= 0.98= 0.98 w/o warmup

adamlaprop

(a) training of IWSLT14 de-en

0.0 2.5 5.0 7.5 10.0 12.5 15.0×103 of updates

4

6

8

10

12

14

16

loss

= 0.98= 0.999= 0.999 w/o warmup

adamlaprop

(b) training of RoBERTa on the full English Wikipedia

Figure 6. Learning curves of the transformer tasks. The trainingloss is plotted for every update. When there is a warmup, thelearning rate linearly increases from zero to its maximum and thenlinearly decreases; otherwise it is initialized to be the maximumand decreases. In (a), the warmup includes the first 2×103 updatesand the learning rate vanishes at the 60× 103-th update. In (b), thewarmup includes 10 × 103 updates and the learning rate vanishesat the 125 × 103-th update. The maximum learning rates are 3E-4and 1E-4 for (a) and (b) and µ = 0.9. Smoothing is applied. SeeAppendix for details. The commonly reported perplexity value isequal to 2loss above.

can use ν = 0.999 in our case, which results in a faster con-vergence after the initial stage, and moreover, we can evenignore the warmup schedule and use the maximum learningrate from the beginning, in which case LAPROP turns out tooutperform ADAM by a larger margin. We conjecture thatthe unnecessity of a warmup is due to the small learningrate and the relatively small variance of the data. Note thatADAM with a large ε (10−6) is used by default to improvethe stability during learning, while LAPROP does not needsuch enhancement and always uses a small ε.

4.6. Reinforcement LearningWe compare the performance of LAPROP and ADAM on amodern reinforcement learning task, i.e. to play a Atari2600game, Breakout, and our training procedure mainly followsthe Rainbow DQN (Hessel et al., 2018). To accelerate thetraining, we have changed a few minor settings of Rainbowas detailed in Appendix. The result of test performanceis shown in Fig. 7. We see that LAPROP starts learningearlier than ADAM and achieves high performances fasterthan ADAM. The performance of LAPROP improves with

0 2 4 6 8 10 12millions of frames

0

100

200

300

400

500

rewa

rd

adamlaprop

Figure 7. Testing performance on the Atari2600 game Breakout.The model is saved every 0.2 million frames and each saved modelis evaluated for the given 5 lives in the game. Each model isevaluated 20 times, and one standard deviation of the evaluation isplotted. Because the performance fluctuates, Gaussian smoothingwith a standard deviation of 1 is applied to the data points.

fluctuation just like ADAM, but its overall performance isbetter. The training does not converge at the 12 million-thframe and the performances slowly improve. It is importantto understand that the fluctuation of the performances duringtraining is a result of DQN and not simply due to noise.During the training, the maximum performance achievedby LAPROP (607.3 ± 47.6) is higher than that of ADAM(522.9 ± 40.6), and we find that LAPROP is generally moreaggressive than ADAM in learning and exploration.

4.7. Better Hyperparameter FlexibilityWe train deep residual networks on the CIFAR10 image clas-sification task using LAPROP and ADAM (He et al., 2016;Krizhevsky et al., 2009), implementing the weight decayfollowing the suggestions in (Loshchilov & Hutter, 2017).The network architecture is the standard Resnet-20 3, andwe use the same hyperparameters for ADAM and LAPROPexcept for ε, and we perform a grid search on µ and ν.The results are shown in Fig. 8. We find that LAPROP andADAM perform comparably around the region of commonhyperparameter choices, and when ν gets smaller ADAMoccasionally diverges, and if µ is much larger than ν thenADAM diverges. We believe that it is the residual connec-tions that have made the model stable and robust againstdivergence. Concerning the loss, we find ADAM tends tooverfit when ν is small while LAPROP does not. The com-plete data and the settings are provided in Appendix.

5. Mathematical Properties and ConvergenceAnalysis

The most important and insightful property of the LAPROPalgorithm is that its update can always to be upper bounded

3We use the implementation in https://github.com/bearpaw/pytorch-classification

Page 8: LaProp: a Better Way to Combine Momentum with Adaptive Gradientanderson/cs445/notebooks/... · 2020. 2. 20. · LaProp: a Better Way to Combine Momentum with Adaptive Gradient Liu

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

0 0.1 0.2 0.3 0.4 0.6 0.8 0.99

0.80

0.85

0.90

0.95

0.99

nan 91.3 91.4 91.1 91.5 91.4 91.2 91.4

nan 91.2 91.5 91.5 91.8 91.6 91.6 91.2

nan 91.3 91.7 10.0 91.6 91.7 91.3 91.4

nan nan 91.4 91.4 91.3 91.3 91.5 91.1

nan nan nan nan 10.0 nan 10.0 91.3

adam

0 0.1 0.2 0.3 0.4 0.6 0.8 0.99

0.80

0.85

0.90

0.95

0.99

90.8 91.1 90.9 91.0 91.6 91.3 91.3 91.4

91.0 90.8 90.8 90.7 91.3 91.0 90.9 91.4

90.9 91.0 91.1 91.6 91.1 91.3 91.4 91.2

90.9 90.8 90.9 90.8 91.3 91.2 91.3 91.6

90.6 91.0 91.0 90.9 90.8 91.3 91.1 91.1

lapr

op

Figure 8. Test accuracy (%) of Resnet-20 trained on CIFAR10. Wetrain for 164 epochs following the standard training procedure.

by a constant for any µ, ν ∈ [0,1).

Proposition 5.1. Bound for LAPROP update. Letmt, cn bedefined as in Algorithm 1, and set cn = 1 − νt, cm = 1 − µt.Then the magnitude of the update can be bounded fromabove as ∣mt

cm∣ ≤ 1√

1−ν .

Another important feature of this bound is that it only de-pends on ν. This is in sharp contrast with the analysis forthe variants of ADAM.

Proposition 5.2. Bound for Adam update. Let mt, cn, cmbe defined as in Equation 7, and set cn = 1−νt, cm = 1−µt.Assume µ < √

ν. Then the magnitude of the update isbounded from above as ∣ cnmt

cm√nt

∣ ≤ 11−γ , where γ = µ√

ν.

Notice that there are two key differences: (1) the bounddepends on the the ratio µ√

ν, suggesting that the momentum

hyperparameter µ and the adaptivity parameter ν are cou-pled in a complicated way for ADAM; (2) the bound onlyapplies when µ > √

ν, suggesting that the range of choicefor the hyperparameters of Adam is limited, while LAPROPis more flexible because µ and ν are decoupled. The mostpopular choice for Adam is (µ = 0.9, ν = 0.999), whichis well within the range of the bound, but our experimentalsection has demonstrated that much more could potentiallybe achieved if such a restriction is removed.

Now, we present the convergence theorem of LAPROPin a convex setting. The proof follows closely in lineRef. (Kingma & Ba, 2014; Reddi et al., 2018). Note thata rigorous proof of convergence for the adaptive gradientfamily has been a major challenge in the field, with variousbounds and rates present, and, in this section, we do notaim at solving this problem, but only at providing a bound

whose terms are qualitatively insightful.

Theorem 1. (Regret bound for convex problem) Let the lossfunction be convex and the gradient be bounded at all timesteps, with ∣∣∇`t(θ)∣∣∞ ≤ G∞ for all θ ∈ Rd, and let thedistance between any θt learned by LAPROP be bounded,with ∣∣θt1 − θt2 ∣∣ ≤D, and let µ, ν ∈ [0,1). Let the learningrate and the momentum decay as λt = λ/

√t, µt = µζt for

ζ ∈ (0,1). Then the regret for LAPROP can be boundedfrom above as

R(T ) ≤ D2√T

2λ(1 − µ)d

∑i=1

√nT,i +

µD2G∞2(1 − µ)(1 − ζ)2

(25)

+ λ√

1 + logT

(1 − µ)√

1 − ν(1 − ν)

d

∑i=1

∣∣g1∶T,i∣∣2, (26)

where, in the worst case, ∑di=1∑Tt=0√

νt,it

≤ dG∞√T .

We see that the difference between this bound and that of(Reddi et al., 2018) lies in the third term, where LAPROPreplaces the factor 1

1−γ by 11−ν . This is the key difference

between LAPROP and the ADAM style optimizers. Theproposed method converges for any ν ∈ [0,1), while thatof ADAM depends on the relation between µ and ν, i.e.1 > ν > µ2.

From the above theorem, one see that the average regret ofLAPROP converges at the same asymptotic rate as SGD andother adaptive optimizers such as ADAM and AMSGRAD.

Corollary 1. For LAPROP, R(T )T

= O( 1√T).

Like ADAM and AMSGRAD, the above bound can be con-siderably better than O(

√dT ) when the gradient is sparse,

as shown in (Duchi et al., 2011). The experimental fact thatthe adaptive gradient family is faster at training neural net-works suggests that the nature of gradient in a deep neuralnetwork is sparse.

6. ConclusionBased on and motivated by a series of previous works, wehave proposed an effective and strong optimization methodfor modern neural networks on various tasks. While theproposed method does outperform and show better flexibil-ity than other members in the adaptive gradient family, weremark that the understanding of its true advantage needs tobe tested with more experiments at the industrial level.

Page 9: LaProp: a Better Way to Combine Momentum with Adaptive Gradientanderson/cs445/notebooks/... · 2020. 2. 20. · LaProp: a Better Way to Combine Momentum with Adaptive Gradient Liu

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

ReferencesBahdanau, D., Cho, K., and Bengio, Y. Neural machine

translation by jointly learning to align and translate. arXivpreprint arXiv:1409.0473, 2014.

Balles, L. and Hennig, P. Dissecting adam: The sign, magni-tude and variance of stochastic gradients. arXiv preprintarXiv:1705.07774, 2017.

Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anand-kumar, A. signsgd: Compressed optimisation for non-convex problems. arXiv preprint arXiv:1802.04434,2018.

Cettolo, M., Niehues, J., Stüker, S., Bentivogli, L., and Fed-erico, M. Report on the 11th iwslt evaluation campaign,iwslt 2014. In Proceedings of the International Workshopon Spoken Language Translation, Hanoi, Vietnam, pp.57, 2014.

Chen, X., Liu, S., Sun, R., and Hong, M. On the conver-gence of a class of adam-type algorithms for non-convexoptimization. arXiv preprint arXiv:1808.02941, 2018.

da Silva, A. B. and Gazeau, M. A general system of differ-ential equations to model first order adaptive algorithms.arXiv preprint arXiv:1810.13108, 2018.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:Pre-training of deep bidirectional transformers for lan-guage understanding. arXiv preprint arXiv:1810.04805,2018.

Du, S. S., Lee, J. D., Li, H., Wang, L., and Zhai, X. Gradientdescent finds global minima of deep neural networks.arXiv preprint arXiv:1811.03804, 2018.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgra-dient methods for online learning and stochastic opti-mization. J. Mach. Learn. Res., 12:2121–2159, July2011. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1953048.2021068.

Gatys, L. A., Ecker, A. S., and Bethge, M. A neural algo-rithm of artistic style. arXiv preprint arXiv:1508.06576,2015.

Guo, J., He, H., He, T., Lausen, L., Li, M., Lin, H., Shi, X.,Wang, C., Xie, J., Zha, S., Zhang, A., Zhang, H., Zhang,Z., Zhang, Z., and Zheng, S. Gluoncv and gluonnlp:Deep learning in computer vision and natural languageprocessing. arXiv preprint arXiv:1907.04433, 2019.

Hanin, B. and Rolnick, D. How to start training: The effectof initialization and architecture. In Advances in NeuralInformation Processing Systems, pp. 571–581, 2018.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In Proceedings of the IEEEconference on computer vision and pattern recognition,pp. 770–778, 2016.

Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostro-vski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., andSilver, D. Rainbow: Combining improvements in deep re-inforcement learning. In Thirty-Second AAAI Conferenceon Artificial Intelligence, 2018.

Kingma, D. P. and Ba, J. Adam: A methodfor stochastic optimization. CoRR, abs/1412.6980,2014. URL http://dblp.uni-trier.de/db/journals/corr/corr1412.html#KingmaB14.

Krizhevsky, A., Hinton, G., et al. Learning multiple layersof features from tiny images. 2009.

Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., andHan, J. On the variance of the adaptive learning rate andbeyond. arXiv preprint arXiv:1908.03265, 2019a.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V.Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019b.

Loshchilov, I. and Hutter, F. Fixing weight decay regular-ization in adam. arXiv preprint arXiv:1711.05101, 2017.

Luo, L., Xiong, Y., Liu, Y., and Sun, X. Adaptive gradientmethods with dynamic bound of learning rate. In Pro-ceedings of the 7th International Conference on LearningRepresentations, New Orleans, Louisiana, May 2019.

Nesterov, Y. E. A method of solving a convex program-ming problem with convergence rate o(k2). In Dok-lady Akademii Nauk, volume 269, pp. 543–547. RussianAcademy of Sciences, 1983.

Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng,N., Grangier, D., and Auli, M. fairseq: A fast, extensibletoolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.

Pennington, J., Socher, R., and Manning, C. D. Glove:Global vectors for word representation. In Proceedingsof the 2014 conference on empirical methods in naturallanguage processing (EMNLP), pp. 1532–1543, 2014.

Reddi, S., Kale, S., and Kumar, S. On the convergenceof adam and beyond. In International Conference onLearning Representations, 2018.

Rosenbrock, H. An automatic method for finding the great-est or least value of a function. The Computer Journal, 3(3):175–184, 1960.

Page 10: LaProp: a Better Way to Combine Momentum with Adaptive Gradientanderson/cs445/notebooks/... · 2020. 2. 20. · LaProp: a Better Way to Combine Momentum with Adaptive Gradient Liu

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

Ruder, S. An overview of gradient descent optimizationalgorithms. arXiv preprint arXiv:1609.04747, 2016.

Sun, R. Optimization for deep learning: theory and algo-rithms. arXiv preprint arXiv:1912.08957, 2019.

Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On theimportance of initialization and momentum in deep learn-ing. In International conference on machine learning, pp.1139–1147, 2013.

Tieleman, T. and Hinton, G. Lecture 6.5—RmsProp: Dividethe gradient by a running average of its recent magnitude.COURSERA: Neural Networks for Machine Learning,2012.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-tion is all you need. In Advances in neural informationprocessing systems, pp. 5998–6008, 2017.

Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht,B. The marginal value of adaptive gradient methods inmachine learning. In Advances in Neural InformationProcessing Systems, pp. 4148–4158, 2017.

Zou, D., Cao, Y., Zhou, D., and Gu, Q. Stochastic gradientdescent optimizes over-parameterized deep relu networks.arXiv preprint arXiv:1811.08888, 2018.

Page 11: LaProp: a Better Way to Combine Momentum with Adaptive Gradientanderson/cs445/notebooks/... · 2020. 2. 20. · LaProp: a Better Way to Combine Momentum with Adaptive Gradient Liu

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

A. Appendix

B. Proof of Proposition 5.1Expand the update:

∣mt∣ = µ∣mt−1∣ + (1 − µ) ∣gt∣√nt/cn(t)

(27)

= µ∣mt−1∣ + (1 − µ) ∣gt∣√[(1 − ν)g2

t + νnt−1]/cn(t)(28)

≤ µ∣mt−1∣ + (1 − µ) ∣gt∣√[(1 − ν)g2

t ]/cn(t)(29)

≤ µ∣mt−1∣ +1 − µ√1 − ν

(30)

and this defines a recurrence relation that solves to

∣mt∣ ≤1 − µt√

1 − ν(31)

whereby

∣mt∣cm(t) ≤ 1√

1 − ν. (32)

C. Proof of Proposition 5.2Expand the term:

∣mt∣√nt

= µ ∣mt−1∣√nt

+ (1 − µ) ∣gt∣√nt/cn(t)

(33)

≤ 1 − µ√1 − ν

T

∑t=0

µT−t

νT−t2

(34)

and if µ < √ν, then

∣mt∣√nt

≤ 1 − µ√1 − ν

1

1 − γ (35)

where γ = µ√ν

. Putting in the bias corrections, we obtain that

∣mt∣/cm(t)√nt/cn(t)

≤ 1√1 − ν

1

1 − γ , (36)

and we are done.

D. Theorem 1 ProofBy convexity,

`(θt) − `(θ∗) ≤d

∑i=1gt,i(θt,i − θ∗i ) (37)

We focus on a single compoment with index i and bound the term gt,i(θt,i − θ∗i ). Plug in the LAPROP update rule to obtain

θt+1 = θt − λtmt

cm(38)

= θt −λt

1 − µt (µtmt−1 + (1 − µt)gtst

) (39)

Page 12: LaProp: a Better Way to Combine Momentum with Adaptive Gradientanderson/cs445/notebooks/... · 2020. 2. 20. · LaProp: a Better Way to Combine Momentum with Adaptive Gradient Liu

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

where we defined st ∶=√nt/cn. We now subtract θ∗, and square both sides to obtain

(θt+1 − θ∗)2 = (θt − θ∗)2 + 2λt

1 − µt (µtmt−1 + (1 − µt)gtst

)(θ∗ − θt) +λ2

(1 − µt)2m2t . (40)

Rearrange term, and, in the second line, apply the inequality ab < a2+b22

gt(θt − θ∗) =1 − µt

2λt(1 − µt)[(θt − θ∗)2 − (θt+1 − θ∗)2]st +

µt1 − µt

(θ∗ − θt)√vtmt−1 +

λt2(1 − µt)(1 − µt)

stm2t (41)

≤ 1

2λt(1 − µ)[(θt − θ∗)2 − (θt+1 − θ∗)2]st +

λt2(1 − µt)(1 − µt)

stm2t (42)

+ µt2λt−1(1 − µt)

(θ∗ − θt)2st +λtµ

2(1 − µt)stm

2t−1 (43)

Now apply Proposition 5.1 to bound mt,

gt(θt − θ∗) ≤1

2λt(1 − µ)[(θt − θ∗)2 − (θt+1 − θ∗)2]st +

λt2(1 − µt)(1 − ν)

st (44)

+ µt2λt−1(1 − µt)

(θ∗ − θt)2st +λtµ

2(1 − µt)(1 − ν)st (45)

≤ 1

2λt(1 − µ)[(θt − θ∗)2 − (θt+1 − θ∗)2]st +

µt2λt−1(1 − µt)

(θ∗ − θt)2st +λt

(1 − µ)(1 − ν)st (46)

Now we are ready to bound R(T ), to do this, we sum up over the index i of the parameters and the times steps from 1 to T(recall that each m, θ, s carries an index i with them):

R(T ) ≤d

∑i=0

T

∑t=1

{ 1

2λt(1 − µ)[(θt − θ∗)2 − (θt+1 − θ∗)2]st +

µt2λt−1(1 − µt)

(θ∗ − θt)2st +λt

(1 − µ)(1 − ν)st} (47)

≤d

∑i=1

1

2λt(1 − µ)(θ1,i − θ∗i )2s1,i +

d

∑i=1

T

∑t=2

1

2(1 − µ)(θt,i − θ∗)2 (st,i

λt− st−1,i

λt)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶B

(48)

+d

∑i=0

T

∑t=1

µtst,i

2λt(1 − µ)(θ∗i − θt,i)2 +

d

∑i=

T

∑t=1

λt(1 − µ)(1 − ν)st,i (49)

≤ D2√T

2λ(1 − µ)d

∑i=0

√sT,i +

µD2G∞2(1 − µ)(1 − ζ)2

+ λ

(1 − µ)(1 − ν)d

∑i=

T

∑t=1

st,i√t

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶A

. (50)

As a side note, one important term in the derivation is term B. This term appears in the original proof for the convergence ofADAM (Kingma & Ba, 2014), and, in fact, appears in any known regret bound for adaptive optimizers (Chen et al., 2018),and is shown to cause divergence when B < 0 in (Reddi et al., 2018). To deal with this problem, we simply assume that weare in a problem where B > 0 by definition, while noting that this problem can be avoided theoretically using two ways. Oneway is to increase ν gradually towards 1 during training, as is proven in (Reddi et al., 2018) .Another way to deal with thisproblem is to define an AMSGRAD-ized version of the adaptive algorithm by substituting in st,i = max(

√nt,i

cn,√

nt−1,i

cn).

However, this “fix" is shown in the main text to be quite harmful to the performance, and so we disencourage using aAMSGRAD-dized LAPROPunless the problem really demands so.

We now try to bound the term A in the last line above. The worst case bound can be found simply by noting that st,i ≤ G∞and∑Tt 1/

√t <

√T +1. Thus the worst case bound for term A is A ≤ dG∞

√T , which is the same as other adaptive gradient

Page 13: LaProp: a Better Way to Combine Momentum with Adaptive Gradientanderson/cs445/notebooks/... · 2020. 2. 20. · LaProp: a Better Way to Combine Momentum with Adaptive Gradient Liu

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

methods. Another bound, which is tighter when the gradient is sparse, we note

d

∑i=1

T

∑t=1

st,i√t≤

d

∑i=1

√∑t

nt,i

cn

√∑t

1

t(51)

≤√

1 + logTd

∑i=1

√∑t

nt,i

cn(52)

=√

1 + logTd

∑i=1

¿ÁÁÀ T

∑t

∑tj(1 − ν)νt−jg2j,i

cn(53)

≤√

1 + logT

1 − νd

∑i=1

∣∣g1∶T,i∣∣ (54)

and so we obtain the desired bound

R(T ) ≤≤ D2√T

2λ(1 − µ)d

∑i=0

√sT,i +

µD2G∞2(1 − µ)(1 − ζ)2

+√

1 + logT

(1 − µ)(1 − ν)√

1 − ν

d

∑i=1

∣∣g1∶T,i∣∣ (55)

E. Practical Concerns and Various ExtensionsE.1. Weight Decay for LAPROP: LAPROPW

As suggested in Ref. (Loshchilov & Hutter, 2017), we implement the weight decay separately, which may also be called theLAPROPW algorithm. The algorithm is given by Algorithm 2.

Algorithm 2 LAPROPW

Input: x1 ∈ Rd, learning rate {ηt}Tt=1, weight decay {wt}Tt=1, decay parameters 0 ≤ µ < 1, 0 ≤ ν < 1, ε≪ 1. Set m0 = 0,n0 = 0.gt = ∇θ`(θt−1)nt = νnt−1 + (1 − ν)g2

t

mt = µmt−1 + (1 − µ) gt√nt/cn+ε

θt+1 = (θt − λtmt/cm) × (1 −wt)

E.2. Tuning ν

The tuning of ν is nontrivial for each task and it is basically decided by trial and error. Roughly speaking, a smaller νmakes LAPROP closer to the signed gradient methods and the maximum update made by LAPROP is smaller, which may bebeneficial in noisy settings, and a larger ν makes nt change slowly, which may benefit fine-tuning of the model and theconvergence. From a statistical viewpoint, ν is used to compute the second moment nt, and therefore when the trainedmodel changes quickly, a large ν may result in a larger bias in second-moment estimation. In our experiments, we find that asmaller ν improves the loss faster at the initial updates, while a larger ν improves faster at a later stage, such as in Fig. 9b,and for MNIST and CIFAR10 a small ν improves faster only for the initial hundreds of updates, which may be due to thesimplicity of the tasks. We believe tuning ν for large-scale and difficult tasks can bring better results and we leave it forfuture research.

E.3. AmsProp

In Ref. (Reddi et al., 2018), the authors proposed AMSGRAD as a variant of ADAM that has a monotonically increasing ntto guarantee its convergence. This idea may be applied to LAPROP similarly, which produces an algorithm that may becalled AmsProp, as shown in Algorithm 3. It should be noted that this algorithm subtly differs from the original AMSGRAD.In practical cases, we have not observed the advantage of using such a variant, because a large ν can often do well enough byapproximately producing a constant nt at convergence and therefore achieve a good performance. Nevertheless, AmsPropmight be useful in special and complicated cases and we leave this for future research.

Page 14: LaProp: a Better Way to Combine Momentum with Adaptive Gradientanderson/cs445/notebooks/... · 2020. 2. 20. · LaProp: a Better Way to Combine Momentum with Adaptive Gradient Liu

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

Algorithm 3 AmsProp

Input: x1 ∈ Rd, learning rate {ηt}Tt=1, decay parameters 0 ≤ µ < 1, 0 ≤ ν < 1, ε≪ 1. Set m0 = 0, n0 = 0, n0 = 0.gt = ∇θ`(θt−1)nt = νnt−1 + (1 − ν)g2

t

nt = max(nt−1, nt)mt = µmt−1 + (1 − µ) gt√

nt/cn+εθt+1 = θt − λtmt/cm

E.4. Centered LAPROP

Following the suggestion in Ref. (Tieleman & Hinton, 2012), we also propose a centered version of LAPROP, which usesthe estimation of the centered second moment rather than the non-centered one, which is a more aggressive optimizationstrategy that would diverge at the presence of a constant gradient. The algorithm is give by Algorithm 4. As the estimationof the centered momentum is unstable at the initial few iterations, we recommend to use the original LAPROP for a fewupdates and then change to the centered one. Also, one may even combine the centered LAPROP and AMSGRAD, using themaximal centered second moment in the learning trajectory, but we have not known the advantage of such a combination.

Algorithm 4 Centered LAPROP

Input: x1 ∈ Rd, learning rate {ηt}Tt=1, decay parameters 0 ≤ µ < 1, 0 ≤ ν < 1, ε≪ 1. Set m0 = 0, n0 = 0, n0 = 0.gt = ∇θ`(θt−1)nt = νnt−1 + (1 − ν)g2

t

nt = νnt−1 + (1 − ν)gtmt = µmt−1 + (1 − µ) gt√(nt−n2

t )/cn+εθt+1 = θt − λtmt/cm

E.5. Tuning ε

The tuning of ε is found to be important for the stability of ADAM in some difficult tasks, such as in Rainbow DQN andTransformer training (Hessel et al., 2018; Devlin et al., 2018), and a small ε may result in an unstable learning trajectory.However, we find that this is not the case for LAPROP, and LAPROP often works better with a small ε, and we have notfound any case where a too small ε results in worse performance. The ε can be freely set to be 1E-8, 1E-15 or 1E-20, aslong as it is within machine precision, which is drastically different from ADAM. Similarly, gradient clipping seems to beunnecessary.

F. Additional Experiments and Experimental DetailsOur code is going to be released on Github.

F.1. RoBERTa Pretraining

We find that pretraining of RoBERTa is stable if the dataset only contains English Wikipedia, and in this case, a warmupschedule of the learning rate is not necessary and we can use larger ν also. We present the learning curves of different ν andusing or not using warmup in Fig. 9, training only for the first 20×103 updates, which is actually less than one epoch. If onezooms in Fig. 9a, it can also be observed that LAPROP marginally outperforms ADAM from an early stage. In Fig. 9b, it canbe seen that a smaller ν accelerate the learning at an early stage, while a larger ν converges better at a later stage. We haveapplied a Gaussian smoothing with a standard deviation σ = 6 to the nearby data points in the plots, and for the IWSLT14task we have used σ = 70.

The English Wikipedia dataset is prepared following the instructions in the GitHub repository of GluonNLP4 (Guo et al.,2019). We used the latest English Wikipedia dump file and cleaned the text5, and we encode and preprocess the text

4https://github.com/dmlc/gluon-nlp/issues/6415using scripts on https://github.com/eric-haibin-lin/text-proc

Page 15: LaProp: a Better Way to Combine Momentum with Adaptive Gradientanderson/cs445/notebooks/... · 2020. 2. 20. · LaProp: a Better Way to Combine Momentum with Adaptive Gradient Liu

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0×103 of updates

4

6

8

10

12

14

16lo

ss

= 0.98= 0.999= 0.999 w/o warmup

adamlaprop

(a)

0 2 4 6 8 10×103 of updates

6

8

10

12

14

16

loss

laprop = 0.9laprop = 0.98laprop = 0.999

(b)

Figure 9. RoBERTa pretraining on the full English Wikipedia. See the text for details.

following the standard RoBERTa procedure 6. We use learning rate schedules with linear increase/decrease, expecting thelearning rate to reduce to 0 at the default 125000th update. When there is warmup, the learning rate is increased from 0 to itsmaximum at the 10000th update and is then decreased, and when there is no warmup, it is initialized at its maximum andthen decreased. Due to a limited computational budget, our training batch size is 120, and we use a maximum learningrate of 1E-4; the maximum sequence length of data is kept 512 as default. The trained model is Bert-base (Devlin et al.,2018), and we use the implementation provided by the Fairseq library (Ott et al., 2019). We also enable the option ofmask-whole-words to make the pretraining task more realistic. Other hyperparameter settings are identical to the default ifnot mentioned.

F.2. Details of Reinforcement Learning Experiments

To obtain better training results, compared with Rainbow DQN (Hessel et al., 2018), we do not use multi-step learning, andwe use a combined ε-greedy and noisy network strategy. We also change the update period of target networks to 10000, andenlarge the replay buffer to contain 2 million transitions adopting a random replacement strategy when it is full. The duelnetwork structure is not used. Concerning the game, because the game Breakout may loop endlessly, it automatically stopsif the player makes no progress for about 10000 frames. Therefore, we avoid the automatic stop by forcing the agent to losea life by taking random actions when 10000 frames have passed. The same is done to obtain the test performances. For thegreedy-ε strategy, we set the minimum value of ε to be 0.005 for better performances, which is smaller than the common0.01.

Concerning the optimization algorithms, we use the same hyperparameter settings as those in (Hessel et al., 2018) for bothADAM and LAPROP, except for the ε of LAPROP, which we still set to the default 1E-15.

F.3. The Grid Search of Resnet-20 on CIFAR10

We use a learning rate of 1E-3 for both ADAM and LAPROP and use a weight decay of 1E-4 (Loshchilov & Hutter, 2017),and they are both reduced by a factor of 10 at epoch 80 and 120, and we train for 164 epochs and report the average of thelast 5 epochs. The default ε of ADAM and LAPROP are used, and other settings follows Ref. (He et al., 2016). The completelearning curves of training and test loss are given as Fig. 12 and 13. The test accuracies are given in Fig. 10, and trainingand test loss are given in Fig. 11.

6described in https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md

Page 16: LaProp: a Better Way to Combine Momentum with Adaptive Gradientanderson/cs445/notebooks/... · 2020. 2. 20. · LaProp: a Better Way to Combine Momentum with Adaptive Gradient Liu

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.99

0.80

0.85

0.90

0.95

0.99

nan 91.3 91.3 91.1 91.5 91.6 91.4 91.2 91.3 91.5 91.4

nan 91.1 91.5 91.6 91.8 91.7 91.6 91.4 91.6 91.4 91.2

nan 91.3 91.7 10.0 91.6 90.9 91.7 91.2 91.4 91.5 91.4

nan nan 91.4 91.4 91.3 91.4 91.3 91.3 91.4 91.5 91.1

nan nan nan nan 10.0 nan nan 10.0 10.0 91.2 91.3

adam

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.99

0.80

0.85

0.90

0.95

0.99

90.8 91.1 90.9 91.0 91.5 91.0 91.3 91.2 91.3 91.5 91.4

91.0 90.8 90.9 90.7 91.3 91.2 91.0 91.2 91.0 91.6 91.5

90.9 91.0 91.1 91.6 91.1 91.0 91.3 91.1 91.4 91.5 91.2

90.9 90.8 90.9 90.8 91.3 91.0 91.2 91.6 91.3 91.4 91.6

90.6 91.0 91.0 90.9 90.8 91.2 91.3 90.7 91.1 91.0 91.2

lapr

op

Figure 10. The complete results of Fig. 8 in the main text, i.e. the final test accuracies corresponding to the grid search models in Fig. 12and 13. NAN or nan is an abbreviation for not a number, which means that the machine encounters numerical problems such as numericaloverflow, and that it cannot compute a number.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.99

0.80

0.85

0.90

0.95

0.99

NAN 0.028 0.027 0.03 0.029 0.028 0.026 0.029 0.028 0.026 0.026

NAN 0.031 0.027 0.028 0.025 0.027 0.026 0.028 0.029 0.027 0.026

NAN 0.026 0.027 2.3 0.028 0.027 0.027 0.027 0.028 0.026 0.026

NAN NAN 0.026 0.027 0.026 0.029 0.027 0.029 0.028 0.028 0.027

NAN NAN NAN NAN >100 NAN NAN >100 >100 0.032 0.028

adam

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.99

0.80

0.85

0.90

0.95

0.99

0.077 0.059 0.055 0.052 0.045 0.044 0.042 0.038 0.032 0.029 0.027

0.075 0.058 0.056 0.055 0.048 0.046 0.042 0.037 0.035 0.03 0.027

0.073 0.061 0.055 0.051 0.048 0.046 0.039 0.04 0.033 0.03 0.028

0.078 0.059 0.06 0.053 0.05 0.049 0.046 0.038 0.036 0.029 0.027

0.072 0.065 0.061 0.054 0.055 0.051 0.046 0.045 0.038 0.034 0.03

lapr

op

(a) The training loss on CIFAR10 of ADAM and LAPROP

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.99

0.80

0.85

0.90

0.95

0.99

NAN 0.38 0.37 0.37 0.36 0.38 0.38 0.38 0.36 0.35 0.35

NAN 0.38 0.38 0.37 0.36 0.36 0.35 0.36 0.36 0.36 0.36

NAN 0.39 0.37 2.3 0.36 0.38 0.36 0.35 0.36 0.36 0.35

NAN NAN 0.39 0.38 0.37 0.36 0.36 0.36 0.36 0.34 0.37

NAN NAN NAN NAN >100 NAN NAN 3.2 >100 0.35 0.35

adam

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.99

0.80

0.85

0.90

0.95

0.99

0.35 0.35 0.35 0.35 0.35 0.36 0.35 0.36 0.34 0.35 0.35

0.35 0.35 0.35 0.37 0.34 0.35 0.36 0.35 0.37 0.34 0.34

0.35 0.35 0.35 0.34 0.36 0.35 0.35 0.36 0.36 0.34 0.34

0.35 0.35 0.35 0.35 0.35 0.36 0.34 0.34 0.34 0.35 0.33

0.36 0.34 0.35 0.35 0.35 0.33 0.34 0.36 0.36 0.36 0.36

lapr

op

(b) The test loss on CIFAR10 of ADAM and LAPROP

Figure 11. A summary of the final training and test loss in Fig. 12 and 13. From the trend, we can clearly see that ADAM tends to overfitwhen it is closer to divergence, while LAPROP is not clearly affected. Surprisingly, we find the higher test loss of ADAM does notnecessarily lead to a worse test accuracy on this task.

Page 17: LaProp: a Better Way to Combine Momentum with Adaptive Gradientanderson/cs445/notebooks/... · 2020. 2. 20. · LaProp: a Better Way to Combine Momentum with Adaptive Gradient Liu

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

=0.99 =0

10 1

100

Adam TrainAdam TestLaProp TrainLaProp Test

=0.99 =0.1 =0.99 =0.2 =0.99 =0.3 =0.99 =0.4 =0.99 =0.5

=0.99 =0.6

10 1

100

=0.99 =0.7 =0.99 =0.8 =0.99 =0.9 =0.99 =0.99

=0.95 =0

10 1

100

=0.95 =0.1 =0.95 =0.2 =0.95 =0.3 =0.95 =0.4 =0.95 =0.5

0 50 100 150=0.95 =0.6

10 1

100

0 50 100 150=0.95 =0.7

0 50 100 150=0.95 =0.8

0 50 100 150=0.95 =0.9

0 50 100 150=0.95 =0.99

Figure 12. The learning curves of the Resnet-20 grid search on CIFAR10. For the rest, see Fig. 13. The training loss and test loss areplotted for ADAM and LAPROP, and the meaning of the curves are shown in the legend in the first plot. If ADAM diverges, its curvesbecomes absent in the plots. In the above figures, we see that when ADAM diverges, the smaller ν is the earlier the divergence occurs.However, divergence sometimes occurs accidentally as well, such as the case of µ = 0.9, ν = 0.3. Notably, we see that if ADAM does notdiverge, ADAM reaches a low training loss in all cases irrespective of µ and ν, while the training loss of LAPROP is clearly affected by µand ν. However, the test loss of LAPROP is shown to be almost unaffected, and often it is lower than the test loss of ADAM, as shown forµ ≤ 0.95 and 0.1 ≤ ν ≤ 0.4. This is an example where LAPROP generalizes better than ADAM.

Page 18: LaProp: a Better Way to Combine Momentum with Adaptive Gradientanderson/cs445/notebooks/... · 2020. 2. 20. · LaProp: a Better Way to Combine Momentum with Adaptive Gradient Liu

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

=0.9 =0

10 1

100

Adam TrainAdam TestLaProp TrainLaProp Test

=0.9 =0.1 =0.9 =0.2 =0.9 =0.3 =0.9 =0.4 =0.9 =0.5

=0.9 =0.6

10 1

100

=0.9 =0.7 =0.9 =0.8 =0.9 =0.9 =0.9 =0.99

=0.85 =0

10 1

100

=0.85 =0.1 =0.85 =0.2 =0.85 =0.3 =0.85 =0.4 =0.85 =0.5

=0.85 =0.6

10 1

100

=0.85 =0.7 =0.85 =0.8 =0.85 =0.9 =0.85 =0.99

=0.8 =0

10 1

100

=0.8 =0.1 =0.8 =0.2 =0.8 =0.3 =0.8 =0.4 =0.8 =0.5

0 50 100 150=0.8 =0.6

10 1

100

0 50 100 150=0.8 =0.7

0 50 100 150=0.8 =0.8

0 50 100 150=0.8 =0.9

0 50 100 150=0.8 =0.99

Figure 13. The training curves of the Resnet-20 grid search on CIFAR10. See Fig. 12.