on the theory of implicit deep learning global …

Published as a conference paper at ICLR 2021

ON THE THEORY OF IMPLICIT DEEP LEARNING:GLOBAL CONVERGENCE WITH IMPLICIT LAYERS

Kenji KawaguchiHarvard UniversityCambridge, MA 02138, [email protected]

ABSTRACT

A deep equilibrium model uses implicit layers, which are implicitly definedthrough an equilibrium point of an infinite sequence of computation. It avoidsany explicit computation of the infinite sequence by finding an equilibrium pointdirectly via root-finding and by computing gradients via implicit differentiation.In this paper, we analyze the gradient dynamics of deep equilibrium modelswith nonlinearity only on weight matrices and non-convex objective functions ofweights for regression and classification. Despite non-convexity, convergence toglobal optimum at a linear rate is guaranteed without any assumption on the widthof the models, allowing the width to be smaller than the output dimension and thenumber of data points. Moreover, we prove a relation between the gradient dy-namics of the deep implicit layer and the dynamics of trust region Newton methodof a shallow explicit layer. This mathematically proven relation along with ournumerical observation suggests the importance of understanding implicit bias ofimplicit layers and an open problem on the topic. Our proofs deal with implicitlayers, weight tying and nonlinearity on weights, and differ from those in the re-lated literature.

1 INTRODUCTION

A feedforward deep neural network consists of a stack of H layers, where H is the depth of thenetwork. The value for the depthH is typically a hyperparameter and is chosen by network designers(e.g., ResNet-101 in He et al. 2016). Each layer computes some transformation of the output of theprevious layer. Surprisingly, several recent studies achieved results competitive with the state-of-the-art performances by using the same transformation for each layer with weight tying (Dabre &Fujita, 2019; Bai et al., 2019b; Dehghani et al., 2019). In general terms, the output of the l-th layerwith weight tying can be written by

z(l) = h(z(l−1);x, θ) for l = 1, 2, . . . ,H − 1, (1)

where x is the input to the neural network, z(l) is the output of the l-th layer (with z(0) = x), θrepresents the trainable parameters that are shared among different layers (i.e., weight tying), andz(l−1) 7→ h(z(l−1);x, θ) is some continuous function that transforms z(l−1) given x and θ. Withweight tying, the memory requirement does not increase as the depth H increases in the forwardpass. However, the efficient backward pass to compute gradients for training the network usuallyrequires to store the values of the intermediate layers. Accordingly, the overall computational re-quirement typically increases as the finite depth H increases even with weight tying.

Instead of using a finite depth H , Bai et al. (2019a) recently introduced the deep equilibrium modelthat is equivalent to running an infinitely deep feedforward network with weight tying. Insteadof running the layer-by-layer computation in equation (1), the deep equilibrium model uses root-finding to directly compute a fixed point z∗ = liml→∞ z(l), where the limit can be ensured to existby a choice of h. We can train the deep equilibrium model with gradient-based optimization byanalytically backpropagating through the fixed point using implicit differentiation (e.g., Griewank& Walther, 2008; Bell & Burke, 2008; Christianson, 1994). With numerical experiments, Bai et al.(2019a) showed that the deep equilibrium model can improve performance over previous state-of-the-art models while significantly reducing memory consumption.

1


Despite the remarkable performances of deep equilibrium models, our theoretical understanding ofits properties is yet limited. Indeed, immense efforts are still underway to mathematically understanddeep linear networks, which have finite values for the depth H without weight tying (Saxe et al.,2014; Kawaguchi, 2016; Hardt & Ma, 2017; Laurent & Brecht, 2018; Arora et al., 2018; Bartlettet al., 2019; Du & Hu, 2019; Arora et al., 2019a; Zou et al., 2020b). In deep linear networks,the function h at each layer is linear in θ and linear in x; i.e., the map (x, θ) 7→ h(z(l−1);x, θ)is bilinear. Despite this linearity, several key properties of deep learning are still present in deeplinear networks. For example, the gradient dynamics is nonlinear and the objective function is non-convex. Accordingly, understanding gradient dynamics of deep linear networks is considered to bea valuable step towards the mathematical understanding of deep neural networks (Saxe et al., 2014;Arora et al., 2018; 2019a).

In this paper, inspired by the previous studies of deep linear networks, we initiate a theoretical studyof gradient dynamics of deep equilibrium linear models as a step towards theoretically understand-ing general deep equilibrium models. As we shall see in Section 2, the function h at each layer isnonlinear in θ for deep equilibrium linear models, whereas it is linear for deep linear networks. Thisadditional nonlinearity is essential to enforce the existence of the fixed point z∗. The additional non-linearity, the infinite depth, and weight tying are the three key proprieties of deep equilibrium linearmodels that are absent in deep linear networks. Because of these three differences, we cannot relyon the previous proofs and results in the literature of deep linear networks. Furthermore, we analyzegradient dynamics, whereas Kawaguchi (2016); Hardt & Ma (2017); Laurent & Brecht (2018) stud-ied the loss landscape of deep linear networks. We also consider a general class of loss functionsfor both regression and classification, whereas Saxe et al. (2014); Arora et al. (2018); Bartlett et al.(2019); Arora et al. (2019a); Zou et al. (2020b) analyzed gradient dynamics of deep linear networksin the setting of the square loss.

Accordingly, we employ different approaches in our analysis and derive qualitatively and quantita-tively different results when compared with previous studies. In Section 2, we provide theoreticaland numerical observations that further motivate us to study deep equilibrium linear models. In Sec-tion 3, we mathematically prove convergence of gradient dynamics to global minima and the exactrelationship between the gradient dynamics of deep equilibrium linear models and that of the adap-tive trust region method. Section 5 gives a review of related literature, which strengthens the mainmotivation of this paper along with the above discussion (in Section 1). Finally, Section 6 presentsconcluding remarks on our results, the limitation of this study, and future research directions.

2 PRELIMINARIES

We begin by defining the notation. We are given a training dataset ((xi, yi))ni=1 of n samples where

xi ∈ X ⊆ Rmx and yi ∈ Y ⊆ Rmy are the i-th input and the i-th target output, respectively.We would like to learn a hypothesis (or predictor) from a parametric family H = {fθ : Rmx →Rmy | θ ∈ Θ} by minimizing the objective function L (called the empirical loss) over θ ∈ Θ:L(θ) =

∑ni=1 `(fθ(xi), yi), where θ is the parameter vector and ` : Rmy × Y → R≥0 is the

loss function that measures the difference between the prediction fθ(xi) and the target yi for eachsample. For example, when the parametric family of interest is the class of linear models as H ={x 7→Wφ(x) |W ∈ Rmy×m}, the objective function L can be rewritten as:

L0(W ) =

n∑i=1

`(Wφ(xi), yi), (2)

where the feature map φ is an arbitrary fixed function that is allowed to be nonlinear and is chosenby model designers to transforms an input x ∈ Rmx into the desired features φ(x) ∈ Rm. We usevec(W ) ∈ Rmym to represent the standard vectorization of a matrix W ∈ Rmy×m.

Instead of linear models, our interest in this paper lies on deep equilibrium models. The output z∗of the last hidden layer of a deep equilibrium model is defined by

z∗ = liml→∞

z(l) = liml→∞

h(z(l−1);x, θ) = h(z∗;x, θ), (3)

where the last equality follows from the continuity of z 7→ h(z;x, θ) (i.e., the limit commutes withthe continuous function). Thus, z∗ can be computed by solving the equation z∗ = h(z∗;x, θ) with-out running the infinitely deep layer-by-layer computation. The gradients with respect to parametersare computed analytically via backpropagation through z∗ using implicit differentiation.

2


2.1 DEEP EQUILIBRIUM LINEAR MODELS

A deep equilibrium linear model is an instance of the family of deep equilibrium models and isdefined by setting the function h at each layer as follows:

h(z(l−1);x, θ) = γσ(A)z(l−1) + φ(x), (4)

where θ = (A,B) with two trainable parameter matrices A ∈ Rm×m and B ∈ Rmy×m. Alongwith a positive real number γ ∈ (0, 1), the nonlinear function σ is used to ensure the existenceof the fixed point and is defined by σ(A)ij =

exp(Aij)∑mk=1 exp(Akj)

. The class of deep equilibrium linear

models is given by H = {x 7→ B(liml→∞ z(l)(x,A)

)| A ∈ Rm×m, B ∈ Rmy×m}, where

z(l)(x,A) = γσ(A)z(l−1) + φ(x). Therefore, the objective function for deep equilibrium linearmodels can be written as

L(A,B) =

n∑i=1

`

(B

(liml→∞

z(l)(xi, A)

), yi

). (5)

The outputs of deep equilibrium linear models fθ(x) = B(liml→∞ z(l)(x,A)

)are nonlinear and

non-multilinear in the optimization variable A. This is in contrast to linear models and deep linearnetworks. From the optimization viewpoint, linear modelsWφ(x) are called linear because they arelinear in the optimization variables W . Deep linear networks W (H)W (H−1) · · ·W (1)x are multi-linear in the optimization variables (W (1),W (2), . . . ,W (H)) (this holds also when we replace x byφ(x)). This difference creates a challenge in the analysis of deep equilibrium linear models.

Following previous works on gradient dynamics of different machine learning models (Saxe et al.,2014; Ji & Telgarsky, 2020), we consider the process of learning deep equilibrium linear models viagradient flow:

d

dtAt = −∂L

∂A(At, Bt),

d

dtBt = − ∂L

∂B(At, Bt), ∀t ≥ 0, (6)

where (At, Bt) represents the model parameters at time t with an arbitrary initialization (A0, B0).Throughout this paper, a feature map φ and a real number γ ∈ (0, 1) are given and arbitrary (exceptin experimental observations) and we omit their universal quantifiers for the purpose of brevity.

2.2 PRELIMINARY OBSERVATION FOR ADDITIONAL MOTIVATION

Our analysis is chiefly motivated as a step towards mathematically understanding general deep equi-librium models (as discussed in Sections 1 and 5). In addition to the main motivation, this sectionprovides supplementary motivations through theoretical and numerical preliminary observations.

In general deep equilibrium models, the limit, liml→∞ z(l), is not ensured to exist (see Appendix C).In this view, the class of deep equilibrium linear models is one instance where the limit is guaranteedto exist for any values of model parameters as stated in Proposition 1:

Proposition 1. Given any (x,A), the sequence (z(l)(x,A))l in Euclidean space Rm converges.Proof. We use the nonlinearity σ to ensure the convergence in our proof in Appendix A.5.

Proposition 1 shows that we can indeed define the deep equilibrium linear model with liml→∞ z(l) =z∗(x,A). Therefore, understanding this model is a sensible starting point for theory of general deepequilibrium models.

As our analysis has been mainly motivated for theory, it would be of additional value to discusswhether the model would also make sense in practice, at least potentially in the future. Consider an(unknown) underling data distribution P (x, y) = P (y|x)P (x). Intuitively, if the mean of the P (y|x)is approximately given by a (true unknown) deep equilibrium linear model, then it would make senseto use the parametric family of deep equilibrium linear models to have the inductive bias in practice.To confirm this intuition, we conducted numerical simulations. To generate datasets, we first drewuniformly at random 200 input images for input data points xi from a standard image dataset —CIFAR-10, CIFAR-100 or Kuzushiji-MNIST (Krizhevsky & Hinton, 2009; Clanuwat et al., 2019).We then generated targets as yi = B∗(liml→∞ z(l)(xi, A

∗)) + δi where δii.i.d.∼ N (0, 1). Each entry

of the true (unknown) matrices A∗ and B∗ was independently drawn from the standard normaldistribution. For each dataset generated in this way, we used stochastic gradient descent (SGD)to train linear models, fully-connected feedforward deep neural networks with ReLU nonlinearity

3


0 1000 2000 3000 4000 5000epoch

101

102

89

20

30

405060708090

200

test

loss

Linear (best)Linear (worst)DELM (best)DELM (worst)

0 1000 2000 3000 4000 5000epoch

10 11

10 9

10 7

10 5

10 3

10 1

101

103

train

loss

(a) Modified Kuzushiji-MNIST: Linear v.s. DELM

0 1000 2000 3000 4000 5000epoch

101

102

89

20

30

405060708090

200

test

loss

DNN (H=2)DNN (H=3)DNN (H=4)DELM (best)DELM (worst)

0 1000 2000 3000 4000 5000epoch

10 11

10 9

10 7

10 5

10 3

10 1

101

103

train

loss

(b) Modified Kuzushiji-MNIST: DNN v.s. DELM

0 1000 2000 3000 4000 5000epoch

102

20

30

40

50

60708090

test

loss


0 1000 2000 3000 4000 5000epoch

10 4

10 3

10 2

10 1

100

101

102

train

loss

(c) Modified CIFAR-10: Linear v.s. DELM

0 1000 2000 3000 4000 5000epoch

102

20

30

40

50

60708090

200

test

loss


0 1000 2000 3000 4000 5000epoch

10 10

10 8

10 6

10 4

10 2

100

102

train

loss

(d) Modified CIFAR-10: DNN v.s. DELM

0 1000 2000 3000 4000 5000epoch

102

20

30

40

50

60708090

test

loss


0 1000 2000 3000 4000 5000epoch

10 3

10 2

10 1

100

101

102

train

loss

(e) Modified CIFAR-100: Linear v.s. DELM

0 1000 2000 3000 4000 5000epoch

102

20

30

40

5060708090

200

test

loss


0 1000 2000 3000 4000 5000epoch

10 11

10 9

10 7

10 5

10 3

10 1

101

train

loss

(f) Modified CIFAR-100: DNN v.s. DELM

Figure 1: Preliminary observations for additional motivation to theoretically understand deep equi-librium linear models. The figure shows test and train losses versus the number of epochs for linearmodels, deep equilibrium linear models (DELMs), and deep neural networks with ReLU (DNNs).

(DNNs), and deep equilibrium linear models. For all models, we fixed φ(x) = x. See Appendix Dfor more details of the experimental settings.

The results of this numerical test are presented in Figure 1. In the figure, the plotted lines indicatethe mean values over five random trials whereas the shaded regions represent error bars with onestandard deviation. The plots for linear models and deep equilibrium linear models are shown withthe best and worst learning rates (separately for each model in terms of the final test errors at epoch= 5000) from the set of learning rates SLR = {0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00005}. Theplots for DNNs are shown with the best learning rates (separately for each depthH) from the set SLR.As can be seen, all models preformed approximately the same at initial points, but deep equilibriumlinear models outperformed both linear models and DNNs in test errors after training, confirmingour intuition above. Moreover, we confirmed qualitatively same behaviors with four more datasetsas well as for DNNs with and without bias terms in Appendix D. These observations additionallymotivated us to study deep equilibrium linear models to obtain our main results in the next section.The purpose of these experiments is to provide a secondary motivation for our theoretical analyses.

3 MAIN RESULTS

In this section, we establish mathematical properties of gradient dynamics for deep equilibriumlinear models by directly analyzing its trajectories. We prove linear convergence to global minimumin Section 3.1 and further analyze the dynamics from the viewpoint of trust region in Section 3.2.

3.1 CONVERGENCE ANALYSIS

We begin in Section 3.1.1 with a presentation of the concept of the Polyak-Łojasiewicz (PL) inequal-ity and additional notation. The PL inequality is used to regularize the choice of the loss functions` in our main convergence theorem for a general class of losses in Section 3.1.2. We conclude inSection 3.1.3 by providing concrete examples of the convergence theorem with the square loss andthe logistic loss, where the PL inequality is no longer required as the PL inequality is proven to besatisfied by these loss functions.

3.1.1 THE POLYAK-ŁOJASIEWICZ INEQUALITY AND ADDITIONAL NOTATION

In our context, the notion of the PL inequality is formally defined as follows:

4


Definition 1. The function L0 is said to satisfy the Polyak-Łojasiewicz (PL) inequality with radiusR ∈ (0,∞] and parameter κ > 0 if 1

2‖∇Lvec0 (vec(W ))‖22 ≥ κ(Lvec

0 (vec(W )) − L∗0,R) for all‖W‖1 < R, where Lvec

0 (vec(·)) := L0(·) and L∗0,R := infW :‖W‖1<R L0(W ).

With any radiusR > 0 sufficiently large (such that it covers the domain ofL0), Definition 1 becomesequivalent to the definition of the PL inequality in the optimization literature (e.g., Polyak, 1963;Karimi et al., 2016). See Appendix C for additional explanations on the equivalence. In general,the non-convex objective function L of deep equilibrium linear models does not satisfy the PLinequality. Therefore, we cannot assume the inequality on L. However, in order to obtain linearconvergence for a general class of the loss functions `, we need some assumption on `: otherwise,we can choose a loss ` to violate the convergence. Accordingly, we will regularize the choice of theloss ` through the PL inequality on the function L0 : W 7→

∑ni=1 `(Wφ(xi), yi).

The PL inequality with a radiusR ∈ (0,∞] (Definition 1) leads to the notion of the global minimumvalue in the domain corresponding to the radius in our analysis: L∗R = infA∈Rm×m,B∈BR L(A,B),

where BR = {B ∈ Rmy×m | ‖B‖1 < (1 − γ)R}. With R = ∞, this recovers the globalminimum value L∗ in the unconstrained domain as L∗R = L∗ := infA∈Rm×m,B∈Rmy×m L(A,B).

Furthermore, if a global minimum (A∗, B∗) ∈ Rm×m × Rmy×m exists, there exists R < ∞ suchthat for anyR ∈ [R,∞), we haveB∗ ∈ BR and thus L∗R = L∗. In other words, if a global minimumexists, using a (sufficiently large) finite radiusR <∞ suffices to obtain L∗R = L∗.

We close this subsection by introducing additional notation. For a real symmetric matrix M , weuse λmin(M) to represent its smallest eigenvalue. For an arbitrary matrix M ∈ Rd×d′ , we letrank(M) be its rank, ‖M‖p be its matrix norm induced by the vector p-norm, σmin(M) be itssmallest singular value (i.e., the min(d, d′)-th largest singular value), M∗j be its j-th column vectorin Rd, and Mi∗ be its i-th row vector in Rd′ . For d ∈ N>0, we denote by Id the identify matrix inRd×d. We define the Jacobian matrix Jk,t ∈ Rm×m of the vector-valued function A∗k 7→ σ(A)∗kby (Jk,t)ij = ∂σ(A)ik

∂Ajk|A=At for all t ≥ 0 and k = 1, . . . ,m. Finally, we define the feature matrix

Φ ∈ Rm×n by Φki = φ(xi)k for k = 1, . . . ,m and i = 1, . . . , n .

3.1.2 MAIN CONVERGENCE THEOREM

Using the PL inequality only on the loss function ` through L0 (Definition 1), we present our maintheorem — a guarantee on linear convergence to global minimum for the gradient dynamics of thenon-convex objective L for deep equilibrium linear models:Theorem 1. Let ` : Rmy × Y → R≥0 be arbitrary such that the function q 7→ `(q, yi) is differ-entiable for any i ∈ {1, . . . , n} (with an arbitrary my ∈ N>0 and an arbitrary Y). Then, for anyT > 0, R ∈ (0,∞] and κ > 0 such that ‖Bt‖1 < (1 − γ)R for all t ∈ [0, T ] and L0 satisfies thePL inequality with the radius R and the parameter κ, the following holds:

L(AT , BT ) ≤ L∗R +(L(A0, B0)− L∗0,R

)e−2κλTT , (7)

where λT := inft∈[0,T ] λmin(Dt) > 0 and Dt is a positive definite matrix defined by

Dt :=

m∑k=1

[(U−>t )∗k(U−1

t )k∗ ⊗(Imy + γ2BtU

−1t Jk,tJ

>k,tU

−Tt B>t

)], (8)

with Ut := Im − γσ(At). Furthermore, λT ≥ 1m(1+γ)2 for any T ≥ 0 (limT→∞ λT ≥ 1

m(1+γ)2 ).

Proof. The additional nonlinearity σ creates a complex interaction among m hidden neurons. Thisinteraction is difficult to be factorized out for the gradients of L with respect to A. This is differentfrom but analogous to the challenge to deal with nonlinear activations in the loss landscape of (non-overparameterized) deep nonlinear networks, for which previous works have made assumptions ofsparse connections to factorize the interaction (Kawaguchi et al., 2019). In contrast, we do not relyon sparse connections. Instead, we observe that although it is difficult to factorize this complexinteraction (due to the nonlinearity σ) in the space of loss landscape, we can factorize it in the spaceof gradient dynamics. See Appendix A.1 for the proof overview and the complete proof.

Theorem 1 shows that in the worst case for λT , the optimality gap decreases exponentially towardszero as L(AT , BT ) − L∗R ≤ C0e

− 2κm(1+γ)2

T, where C0 = L(A0, B0) − L∗0,R. Therefore, for any

5


desired accuracy ε > 0, setting C0e− 2κm(1+γ)2

T ≤ ε and solving for T yield that

L(AT , BT )− L∗R ≤ ε for any T ≥ m(1 + γ)2

2κlog

L(A0, B0)− L∗0,Rε

. (9)

Theorem 1 also states that the rate of convergence improves further depending on the quality of thematrix Dt (defined in equation (8)) in terms of its smallest eigenvalue over the particular trajectory(At, Bt) up to the specific time t ≤ T ; i.e., λT = inft∈[0,T ] λmin(Dt). This opens up the directionof future work for further improvement of the convergence rate through the design of initialization(A0, B0) to maximize λT for trajectories generated from a specific initialization scheme.

3.1.3 EXAMPLES: SQUARE LOSS AND LOGISTIC LOSS

The main convergence theorem in the previous subsection is stated for any radius R ∈ (0,∞] andparameter κ > 0 that satisfy the conditions on ‖Bt‖1 and the PL inequality (see Theorem 1). Thevalues of these variables are not completely specified there as they depend on the choice of the lossfunctions `. In this subsection, we show that these values can be specified further and the conditionon PL inequality can be discarded by considering a specific choice of loss functions `.

In particular, by using the square loss for `, we prove that we can set R =∞ and κ = 2σmin(Φ)2:Corollary 1. Let `(q, yi) = ‖q − yi‖22 where yi ∈ Rmy for i = 1, 2, . . . , n (with an arbitrarymy ∈ N>0). Assume that rank(Φ) = min(n,m). Then for any T > 0,

L(AT , BT ) ≤ L∗ + (L(A0, B0)− L∗0) e−4σmin(Φ)2λTT ,

where σmin(Φ) > 0, L∗0 := infW∈Rmy×m L0(W ), and λT := inft∈[0,T ] λmin(Dt) ≥ 1m(1+γ)2 .

Proof. This statement follows from Theorem 1. The conditions on ‖Bt‖1 and the PL inequality (inTheorem 1) are now discarded by using the property of the square loss `. See Appendix A.3 for thecomplete proof.

In Corollary 1, the global linear convergence is established for the square loss without the notion ofthe radius R as we set R = ∞. Even with the square loss, the objective function L is non-convex.Despite the non-convexity, Corollary 1 shows that for any desired accuracy ε > 0,

L(AT , BT )− L∗ ≤ ε for any T ≥ m(1 + γ)2

4σmin(Φ)2log

L(A0, B0)− L∗0ε

. (10)

Corollary 1 allows both cases of m ≤ n and m > n. In the case of over-parameterization m > n,the covariance matrix ΦΦ> ∈ Rm×m (or XX> with φ(x) = x) is always rank deficient becauserank(ΦΦ>) = rank(Φ) ≤ n < m. This implies that the Hessian of L0 is always rank deficient,because the Hessian of L0 is ∇2Lvec

0 (vec(W )) = 2[ΦΦ> ⊗ Imy ] ∈ Rmym×mym (see AppendixA.3 for its derivation) and because rank([ΦΦ> ⊗ Imy ]) = rank(ΦΦ>) rank(Imy ) ≤ myn <mym. Since the strong convexity on a twice differentiable function requires its Hessian to be of fullrank, this means that the objective L0 for linear models is not strongly convex in the case of over-parameterization m > n. Nevertheless, we establish the linear convergence to global minimumfor deep equilibrium linear models in Corollary 1 for both cases of m > n and m ≤ n by usingTheorem 1.

For the logistic loss for `, the following corollary proves the global convergence at a linear rate:Corollary 2. Let `(q, yi) = −yi log( 1

1+e−q )− (1− yi) log(1− 11+e−q ) + τ‖q‖22 with an arbitrary

τ ≥ 0 where yi ∈ {0, 1} for i = 1, 2, . . . , n. Assume that rank(Φ) = m. Then for any T > 0 andR ∈ (0,∞] such that ‖Bt‖1 < (1− γ)R for all t ∈ [0, T ], the following holds:

L(AT , BT ) ≤ L∗R +(L(A0, B0)− L∗0,R

)e−2(2τ+ρ(R))σmin(Φ)2λTT ,

where σmin(Φ) > 0, λT := inft∈[0,T ] λmin(Dt) ≥ 1m(1+γ)2 , and

ρ(R) := infW :‖W‖1<R,i∈{1,...,n}

(1

1 + e−Wφ(xi)

)(1− 1

1 + e−Wφ(xi)

)≥ 0.

Proof. This statement follows from Theorem 1 by proving that the condition on PL inequality issatisfied with the parameter κ = (2τ + ρ(R))σmin(Φ)2. See Appendix A.4 for the complete proof.

6


In Corollary 2, we can also set R = ∞ to remove the notion of the radius R from the statementof the global convergence for the logistic loss. By setting R = ∞, Corollary 2 states that for anyT > 0,

L(AT , BT ) ≤ L∗ + (L(A0, B0)− L∗0) e−4τσmin(Φ)2λTT ,

for the logistic loss. For any τ > 0, this implies that for any desired accuracy ε > 0,

L(AT , BT )− L∗ ≤ ε for any T ≥ m(1 + γ)2

4τσmin(Φ)2log

L(A0, B0)− L∗0ε

. (11)

In practice, we may want to set τ > 0 to regularize the parameters (for generalization) and to ensurethe existence of global minima (for optimization and identifiability). That is, if we set τ = 0 instead,the global minima may not exist in any bounded space, due to the property of the logistic loss. Thisis consistent with Corollary 2 in that if τ = 0, equation (11) does not hold and we must consider theconvergence to the global minimum value L∗R defined in a bounded domain with a radius R < ∞.In the case of τ = 0 and R <∞, Corollary 2 implies that for desired accuracy ε > 0,

L(AT , BT )− L∗R ≤ ε for any T ≥ m(1 + γ)2

2ρ(R)σmin(Φ)2log

L(A0, B0)− L∗0,Rε

, (12)

where we have ρ(R) > 0 because R < ∞. Therefore, Corollary 2 establish the linear convergenceto global minimum with both cases of τ > 0 and τ = 0 for the logistic loss.

3.2 UNDERSTANDING DYNAMICS THROUGH TRUST REGION NEWTON METHOD

In this subsection, we analyze the dynamics of deep equilibrium linear models in the space of thehypothesis, fθt : x 7→ Bt

(liml→∞ z(l)(x,At)

). For any functions g and g with a domain X ⊆

Rmx , we write g = g if g(x) = g(x) for all x ∈ X .

The following theorem shows that the dynamics of deep equilibrium linear models fθt can be writtenas d

dtfθt = 1δtVtφwhere 1

δtis scalar and Vt follows the dynamics of a trust region Newton method of

shallow models with the (non-standard) adaptive trust region Vt. This suggests potential benefits ofdeep equilibrium linear models in two aspects: when compared to shallow models, it can sometimesaccelerate optimization via the effect of the implicit trust region method (but not necessarily as thetrust region method does not necessarily accelerate optimization) and induces novel implicit bias forgeneralization via the non-standard implicit trust region Vt.Theorem 2. Let ` : Rmy × Y → R≥0 be arbitrary such that the function q 7→ `(q, yi) is differen-tiable for any i ∈ {1, . . . , n} with my = 1 and (an arbitrary Y). Then for any time t ≥ 0, thereexist a real number δt > 0 such that for any δt ∈ (0, δt],

d

dtfθt =

1

δtVtφ, vec(Vt) ∈ argmin

v∈VtLt0(v), (13)

where Vt := {v ∈ Rm : ‖v‖Gt ≤ δt‖ ddt vec(BtU−1t )‖Gt}, Gt := Ut

(S−1t − δtFt

)U>t � 0, and

Lt0(v) := Lvec0 (vec(BtU

−1t )) +∇Lvec

0 (vec(BtU−1t ))>v +

1

2v>∇2Lvec

0 (vec(BtU−1t ))v.

Here, Ft :=∑ni=1∇2`i(fθt(xi))(liml→∞ z(l)(xi, At))(liml→∞ z(l)(xi, At))

> with `i(q) :=

`(q, yi) and St := Im + γ2 diag(vSt ) with vSt ∈ Rm and (vSt )k := ‖J>k,t(BtU−1t )>‖22 ∀k.

Proof. This is proven with the Karush–Kuhn–Tucker (KKT) conditions for the constrained opti-mization problem: minimizev∈Vt L

t0(v). See Appendix A.2.

When many global minima exist, a difference in the gradient dynamics can lead to a significantdiscrepancy in the learned models: i.e., two different gradient dynamics can find significantly differ-ent global minima with different behaviors for generalization and test accuracies (Kawaguchi et al.,2017). In machine learning, this is an important phenomenon called implicit bias — inductive biasinduced implicitly through gradient dynamics — and is the subject of an emerging active researcharea (Gunasekar et al., 2017; Soudry et al., 2018; Gunasekar et al., 2018; Woodworth et al., 2020;Moroshko et al., 2020).

7


As can be seen in Theorem 2, the gradient dynamics of deep equilibrium linear models fθt differsfrom that of linear models Wtφ with any adaptive learning rates, fixed preconditioners, and existingvariants of Newton methods. This is consistent with our experiments in Section 2.2 and Appendix Dwhere the dynamics of deep equilibrium linear models resulted in the learned predictors with highertest accuracies, when compared to linear models with any learning rates. In this regard, Theorem2 provides a partial explanation (and a starting point of the theory) for the observed generaliza-tion behaviors, whereas Theorem 1 (with Corollaries 1 and 2) provides the theory for the globalconvergence observed in the experiments.

Theorem 2, along with our experimental results, suggests the importance of theoretically under-standing implicit bias of the dynamics with the time-dependent trust region. In Appendix B, weshow that Theorem 2 suggests a new type of implicit bias towards a simple function as a result ofinfinite depth, whereas understanding this implicit bias in more details is left as an open problem forfuture work.

4 EXPERIMENTS

In this section, we conduct experiments to further verify and demonstrate our theory. To comparewith the previous findings, we use the same synthetic data as that in the previous work (Zou et al.,2020b): i.e., we randomly generate xi ∈ R10 from the standard normal distribution and set yi =−xi + 0.1ςi for all i ∈ {1, 2, . . . , n} with n = 1000, where ςi is independently generated by thestandard normal distribution. We set φ(x) = x and use the square loss `(q, yi) = ‖q − yi‖22.As in the previous work, we consider random initialization and identity initialization (Zou et al.,2020b) and report the results in Figure 2 (a). As can be seen in the figure, deep equilibrium linearmodels converges to the global minimum value with all initialization and random trials, whereaslinear ResNet converges to a suboptimal value with identity initialization. This is consistent withour theory for deep equilibrium linear models and the previous work for ResNet (Zou et al., 2020b).

We repeated the same experiment by generating (xi)k independently from the uniform distributionof the interval [−1, 1] instead for all i ∈ {1, . . . , n} and k ∈ {1, . . . ,m} with n = 1000 andm = 10. Figure 2 (b) shows the results of this experiment with the uniform distribution and confirmthe global convergence of deep equilibrium linear models again with all initialization and randomtrials. In this case, linear ResNet with identity initialization also converged to the global minimumvalue. These observations are consistent with Corollary 1 where deep equilibrium linear models areguaranteed to converge to the global minimum value without any condition on the initialization.

We now consider the rate of the global convergence. In Corollary 1, we can set λT = 1m(1+γ)2

to get a guarantee for the global linear convergence rate for all initializations in theory. However,in practice, this is a pessimistic convergence rate and we may want to choose λT depending on ainitialization. To demonstrate this, using the same data as that in Figure 2 (a), Figure 2 (c) reportsthe numerical training trajectory along with theoretical upper bounds with initialization-independentλT = 1

m(1+γ)2 and initialization-dependent λT = inft∈[0,T ] λmin(Dt). As can be seen in Figure 2(c), the theoretical bound with initialization-dependent λT demonstrates a faster and more accurateconvergence rate. A qualitatively same observation is reported for the logistic loss in Appendix D.2.

0 250 500 750 1000 1250 1500 1750 2000

# of iterations0

200

400

600

800

1000

train

loss

DELM: identityDELM: random 1DELM: random 2DELM: random 3Linear ResNetglobal optima

(a) Gaussian data

0 250 500 750 1000 1250 1500 1750 2000

# of iterations0

200

400

600

800

1000

train

loss

DELM: identityDELM: random 1DELM: random 2DELM: random 3Linear ResNetglobal optima

(b) Uniform data

100 101 102 103 104 105

# of iterations

103

104

train

loss

training trajectorybound: init-ind T

bound: init-dep T

global optima

(c) Theoretical bounds

Figure 2: (a)-(b): Convergence performances for deep equilibrium linear models (DELMs) withidentity initialization and random initialization of three random trials, and linear ResNet with iden-tity initialization. (c) the numerical training trajectory of DELMs with random initialization alongwith theoretical upper bounds with initialization-independent λT and initialization-dependent λT .

8


5 RELATED WORK

The theoretical study of gradient dynamics of deep networks with some linearized component isa highly active area of research. Recently, Bartlett et al. (2019); Du & Hu (2019); Arora et al.(2019a); Zou et al. (2020b) analyzed gradient dynamics of deep linear networks and proved globalconvergence rates for the square loss under certain assumptions on the dataset, initialization, andnetwork structures. For example, the dataset is assumed to be whitened (i.e., ΦΦ> = Im orXX> =Imx ) and the initial loss is assumed to be smaller than the loss of any rank-deficient solution by Aroraet al. (2019a): the input and output layers are assumed to represent special transformations and arefixed during training by Zou et al. (2020b).

Deep networks are also linearized implicitly in the neural tangent kernel (NTK) regime with signif-icant over-parameterization m � n (Yehudai & Shamir, 2019; Lee et al., 2019). By significantlyincreasing model parameters (or more concretely the width m), we can ensure deep features or cor-responding NTK to stay nearly the same during training. In other words, deep networks in thisregime are approximately linear models with random features corresponding to the NTK at randominitialization. Because of this implicit linearization, deep networks in the NTK regime are shown toachieve globally minimum training errors by interpolating all training data points (Zou et al., 2020a;Li & Liang, 2018; Jacot et al., 2018; Du et al., 2019; 2018; Chizat et al., 2019; Arora et al., 2019b;Allen-Zhu et al., 2019; Lee et al., 2019; Fang et al., 2020; Montanari & Zhong, 2020).

These previous studies have significantly advanced our theoretical understanding of deep learn-ing through the study of deep linear networks and implicitly linearized deep networks in the NTKregime. In this context, this paper is expected to contribute to the theoretical advancement throughthe study of a new and significantly different type of deep models — deep equilibrium linear models.In deep equilibrium linear models, the function at each layer A 7→ h(z(l−1);x, θ) is nonlinear dueto the additional nonlinearity σ: A 7→ h(z(l−1);x, θ) := γσ(A)z(l−1) + φ(x). In contrast, for deeplinear networks, the function at each layer W (l) 7→ h(l)(z(l−1);x,W (l)) := W (l)z(l−1) is linear (itis linear also with skip connection). Furthermore, the nonlinearity σ is not an element-wise func-tion, which poses an additional challenge in the mathematical analysis of deep equilibrium linearmodels. The nonlinearity σ, the infinite depth, and weight tying in deep equilibrium linear modelsnecessitated us to develop new approaches in our proofs. The differences in the models and proofsnaturally led to qualitatively and quantitatively different results. For example, we do not requireany of over-parameterization m� n, interpolation of all training data points, and any assumptionsmentioned above for deep linear networks.

Unlike previous papers, we also related the dynamics of deep equilibrium linear models to that ofa trust region Newton method of shallow models with Gt-quadratic norm. This suggested potentialbenefits of deep equilibrium linear models. Our theory is consistent with our numerical observations.

6 CONCLUSION

For deep equilibrium linear models, despite the non-convexity, we have rigorously proven conver-gence of gradient dynamics to global minima, at a linear rate, for a general class of loss functions,including the square loss and logistic loss. Moreover, we have proven the relationship between thegradient dynamics of deep equilibrium linear models and that of the adaptive trust region method.These results apply to models with any configuration on the width of hidden layers, the numberof data points, and input/output dimensions, allowing rank-deficient covariance matrices as well asboth under-parameterization and over-parameterization.

The crucial assumption for our analysis is the differentiability of the function q 7→ `(q, yi), which issatisfied by standard loss functions, such as the square loss, the logistic loss, and the smoothed hingeloss `(q, yi) = (max{0, 1 − yiq})k with k ≥ 2. However, it is not satisfied by the (non-smoothed)hinge loss `(q, yi) = max{0, 1 − yiq}, the treatment of which is left to future work. Future workalso includes corresponding theoretical analyses with stochastic gradient descent.

Our theoretical results (in Section 3) and numerical observations (in Section 2.2 and Appendix D)uncover the special properties of deep equilibrium linear models, providing a basis of future workfor theoretical studies of implicit bias and for further empirical investigations of deep equilibriummodels. In our proofs, the treatments of the additional nonlinearity σ, the infinite depth, and weighttying are especially unique, and we expect our new proof techniques to be proven useful in furtherstudies of gradient dynamics for deep models.

9


REFERENCES

J Harold Ahlberg and Edwin N Nilson. Convergence properties of the spline fit. Journal of theSociety for Industrial and Applied Mathematics, 11(1):95–104, 1963.

Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameter-ized neural networks, going beyond two layers. In Advances in neural information processingsystems, pp. 6158–6169, 2019.

Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicitacceleration by overparameterization. In International Conference on Machine Learning, 2018.

Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient de-scent for deep linear neural networks. In International Conference on Learning Representations,2019a.

Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis ofoptimization and generalization for overparameterized two-layer neural networks. arXiv preprintarXiv:1901.08584, 2019b.

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. In Advances in NeuralInformation Processing Systems, pp. 690–701, 2019a.

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Trellis networks for sequence modeling. In Inter-national Conference on Learning Representations, 2019b.

Randal J Barnes. Matrix differentiation. Springs Journal, pp. 1–9, 2006.

Peter L Bartlett, David P Helmbold, and Philip M Long. Gradient descent with identity initializa-tion efficiently learns positive-definite linear transformations by deep residual networks. Neuralcomputation, 31(3):477–502, 2019.

Bradley M Bell and James V Burke. Algorithmic differentiation of implicit functions and optimalvalues. In Advances in Automatic Differentiation, pp. 67–77. Springer, 2008.

Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming.In Advances in Neural Information Processing Systems, pp. 2937–2947, 2019.

Bruce Christianson. Reverse accumulation and attractive fixed points. Optimization Methods andSoftware, 3(4):311–326, 1994.

Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and DavidHa. Deep learning for classical japanese literature. In NeurIPS Creativity Workshop 2019, 2019.

Raj Dabre and Atsushi Fujita. Recurrent stacking of layers for compact neural machine translationmodels. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 6292–6299, 2019.

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universaltransformers. In International Conference on Learning Representations, 2019.

Simon Du and Wei Hu. Width provably matters in optimization for deep linear neural networks. InInternational Conference on Machine Learning, pp. 1655–1664, 2019.

Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds globalminima of deep neural networks. In International Conference on Machine Learning, pp. 1675–1685, 2019.

Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds globalminima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018.

Cong Fang, Jason D Lee, Pengkun Yang, and Tong Zhang. Modeling from features: a mean-fieldframework for over-parameterized deep neural networks. arXiv preprint arXiv:2007.01452, 2020.

10


Georg Frobenius. Uber matrizen aus nicht negativen elementen. Sitzungsberichte der KoniglichPreussischen Akademie der Wissenschaften, pp. 456—-477, 1912.

Andreas Griewank and Andrea Walther. Evaluating derivatives: principles and techniques of algo-rithmic differentiation. SIAM, 2008.

Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro.Implicit regularization in matrix factorization. In Advances in Neural Information ProcessingSystems, pp. 6151–6159, 2017.

Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descenton linear convolutional networks. In Advances in Neural Information Processing Systems, pp.9461–9471, 2018.

Moritz Hardt and Tengyu Ma. Identity matters in deep learning. In International Conference onLearning Representations, 2017.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.770–778, 2016.

Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and gen-eralization in neural networks. In Advances in neural information processing systems, pp. 8571–8580, 2018.

Ameya D Jagtap, Kenji Kawaguchi, and George Em Karniadakis. Locally adaptive activation func-tions with slope recovery for deep and physics-informed neural networks. Proceedings of theRoyal Society A, 476(2239):20200334, 2020a.

Ameya D Jagtap, Kenji Kawaguchi, and George Em Karniadakis. Adaptive activation functionsaccelerate convergence in deep and physics-informed neural networks. Journal of ComputationalPhysics, 404:109136, 2020b.

Ziwei Ji and Matus Telgarsky. Directional convergence and alignment in deep learning. arXivpreprint arXiv:2006.06657, 2020.

Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Ma-chine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer, 2016.

Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural InformationProcessing Systems, pp. 586–594, 2016.

Kenji Kawaguchi and Yoshua Bengio. Depth with nonlinearity creates no bad local minima inresnets. Neural Networks, 118:167–174, 2019.

Kenji Kawaguchi and Jiaoyang Huang. Gradient descent finds global minima for generalizable deepneural networks of practical sizes. In 2019 57th Annual Allerton Conference on Communication,Control, and Computing (Allerton), pp. 92–99. IEEE, 2019.

Kenji Kawaguchi and Leslie Kaelbling. Elimination of all bad local minima in deep learning. InInternational Conference on Artificial Intelligence and Statistics, pp. 853–863. PMLR, 2020.

Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning.arXiv preprint arXiv:1710.05468, 2017.

Kenji Kawaguchi, Jiaoyang Huang, and Leslie Pack Kaelbling. Effect of depth and width on localminima in deep learning. Neural computation, 31(7):1462–1498, 2019.

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Tech-nical report, Citeseer, 2009.

Thomas Laurent and James Brecht. Deep linear networks with arbitrary loss: All local minima areglobal. In International conference on machine learning, pp. 2902–2907. PMLR, 2018.

11


Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied todocument recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear modelsunder gradient descent. In Advances in neural information processing systems, pp. 8572–8583,2019.

Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradientdescent on structured data. In Advances in Neural Information Processing Systems, pp. 8157–8166, 2018.

Shiyu Liang, Ruoyu Sun, Jason D Lee, and R Srikant. Adding one neuron can eliminate all badlocal minima. In Advances in Neural Information Processing Systems, 2018.

Andrea Montanari and Yiqiao Zhong. The interpolation phase transition in neural networks: mem-orization and generalization under lazy training. preprint arXiv:2007.12826, 2020.

Nenad Moraca. Bounds for norms of the matrix inverse and the smallest singular value. Linearalgebra and its applications, 429(10):2589–2601, 2008.

Edward Moroshko, Suriya Gunasekar, Blake Woodworth, Jason D Lee, Nathan Srebro, and DanielSoudry. Implicit bias in deep linear classification: Initialization scale vs training accuracy. arXivpreprint arXiv:2007.06738, 2020.

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Readingdigits in natural images with unsupervised feature learning. In NIPS workshop on deep learningand unsupervised feature learning, 2011.

Quynh Nguyen. On connected sublevel sets in deep learning. In International Conference onMachine Learning, pp. 4790–4799. PMLR, 2019.

Quynh Nguyen. A note on connectivity of sublevel sets in deep learning. arXiv preprintarXiv:2101.08576, 2021.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems, pp.8026–8037, 2019.

Oskar Perron. Zur theorie der matrices. Mathematische Annalen, 64(2):248–263, 1907.

Tomaso Poggio, Kenji Kawaguchi, Qianli Liao, Brando Miranda, Lorenzo Rosasco, Xavier Boix,Jack Hidary, and Hrushikesh Mhaskar. Theory of deep learning iii: explaining the non-overfittingpuzzle. arXiv preprint arXiv:1801.00173, 2017.

Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal Vychislitel’noiMatematiki i Matematicheskoi Fiziki, 3(4):643–653, 1963.

Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dy-namics of learning in deep linear neural networks. In International Conference on LearningRepresentations, 2014.

Ohad Shamir. Are ResNets provably better than linear predictors? In Advances in Neural Informa-tion Processing Systems, to appear, 2018.

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The im-plicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.

B Tactile Srl and Italy Brescia. Semeion handwritten digit data set. Semeion Research Center ofSciences of Communication, Rome, Italy, 1994.

James M Varah. A lower bound for the smallest singular value of a matrix. Linear Algebra and itsApplications, 11(1):3–5, 1975.

12


Vikas Verma, Meng Qu, Kenji Kawaguchi, Alex Lamb, Yoshua Bengio, Juho Kannala, and JianTang. Graphmix: Regularized training of graph neural networks for semi-supervised learning.arXiv preprint arXiv:1909.11715, 2019.

Blake Woodworth, Suriya Gunasekar, Jason D Lee, Edward Moroshko, Pedro Savarese, Itay Golan,Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. arXivpreprint arXiv:2002.09277, 2020.

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmark-ing machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.

Gilad Yehudai and Ohad Shamir. On the power and limitations of random features for understandingneural networks. In Advances in Neural Information Processing Systems, pp. 6598–6608, 2019.

Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-parameterized deep ReLU networks. Machine Learning, 109(3):467–492, 2020a.

Difan Zou, Philip M Long, and Quanquan Gu. On the global convergence of training deep linearresnets. In International Conference on Learning Representations, 2020b.

13


A PROOFS

In this appendix, we complete the proofs of our theoretical results. We present the proofs of Theorem1 in Appendix A.1, Theorem 2 in Appendix A.2, Corollary 1 in Appendix A.3, Corollary 2 inAppendix A.4, and Proposition 1 in Appendix A.5. We also provide a proof overview of Theorem 1in the beginning of Appendix A.1.

Before starting our proofs, we first introduce additional notation used in the proofs and then dis-cuss alternative proofs using the implicit function theorem to avoid relying on the convergence ofNeumann series.

Additional notation. Given a scalar-valued function a ∈ R and a matrix M ∈ Rd×d′ , we write

∂a

∂M=

∂a

∂M11· · · ∂a

∂M1d′...

. . ....

∂a∂Md1

· · · ∂a∂Mdd′

∈ Rd×d′,

where Mij represents the (i, j)-th entry of the matrix M . Given a vector-valued function a ∈ Rd

and a column vector b ∈ Rd′ , we write

∂a

∂b=

∂a1

∂b1· · · ∂a1

∂bd′...

. . ....

∂ad∂b1

· · · ∂ad∂bd′

∈ Rd×d′,

where bi represents the i-th entry of the column vector b. Similarly, given a vector-valued functiona ∈ Rd and a row vector b ∈ R1×d′ , we write

∂a

∂b=

∂a1

∂b11· · · ∂a1

∂b1d′...

. . ....

∂ad∂b11

· · · ∂ad∂b1d′

∈ Rd×d′,

where b1i represents the i-th entry of the row vector b. We use∇AL to represent the map (A,B) 7→∂L∂A (A,B) (without the usual transpose used in vector calculus). Given a matrix M and a functionϕ, we define∇Mϕ similarly as the map M 7→ ∂ϕ

∂M (M). Our proofs also use the indicator function:

1{i = k} =

{1 if i = k

0 if i 6= k

Finally, we recall the definition of the Kronecker product of two matrices: for matrices M ∈RdM×d′M and M ∈ RdM×d′M ,

M ⊗ M =

M11M · · · M1d′MM

.... . .

...MdM1M · · · MdMd′M

M

∈ RdMdM×d′Md′M .

On alternative proofs using the implicit function theorem. In our default proofs, we utilizethe Neumann series

∑∞k=0 γ

kσ(A)k when deriving the formula of the gradients with respect to A.Instead of using the Neumann series, we can alternatively use the implicit function theorem to derivethe formula of the gradients with respect to A. Specifically, in this alternative proof, we apply theimplicit function theorem to the function ψ defined by

ψ(vec[A], z) = z − γσ(A)z − φ(x),

where vec[A] and z ∈ Rm are independent variables of the function ψ: i.e., ψ(vec[A], z) is allowedto be nonzero. On the other hand, the vector z satisfying ψ(vec[A], z) = 0 is the fixed point

14


z∗ = liml→∞ z(l) based on equation (3). Therefore, by applying the implicit function theorem tothe function ψ, it holds that if the the Jacobian matrix ∂ψ(vec[A],z)

∂z |z=z∗ is invertible, then

∂z∗

∂ vec[A]= −

(∂ψ(vec[A], z)

∂z

∣∣∣∣z=z∗

)−1(∂ψ(vec[A], z)

∂ vec[A]

∣∣∣∣z=z∗

). (14)

Since ∂ψ(vec[A],z)∂z

∣∣∣z=z∗

= I − γσ(A) is invertible, it holds that

∂z∗

∂ vec[A]= − (I − γσ(A))

−1

(∂ψ(vec[A], z)

∂ vec[A]

∣∣∣∣z=z∗

). (15)

Moreover, since σ(A)z ∈ Rm is a column vector,

∂ψ(vec[A], z)

∂ vec[A]

∣∣∣∣z=z∗

= −γ ∂σ(A)z

∂ vec[A]

∣∣∣∣z=z∗

= −γ ∂ vec[σ(A)z]

∂ vec[A]

∣∣∣∣z=z∗

= −γ ∂[z> ⊗ Im] vec[σ(A)]

∂ vec[A]

∣∣∣∣z=z∗

= −γ[(z∗)> ⊗ Im]∂ vec[σ(A)]

∂ vec[A]. (16)

Combining equations (15) and (16), we have

∂z∗

∂ vec[A]= γ (I − γσ(A))

−1[(z∗)> ⊗ Im]

∂ vec[σ(A)]

∂ vec[A]. (17)

In our proofs, whenever we require the gradients with respect to A, we can directly use equation(17), instead of relying on the convergence of the Neumann series. For example, equation (21) inthe proof of Theorem 1 is identical to equation (17) with additional multiplication of Bq∗: i.e., forthe left hand side,

Bq∗∂z∗

∂ vec[A]=

∂Bq∗z∗

∂ vec[A]=∂Bq∗U

−1φ(x)

∂A

and for the right hand side,

γBq∗ (I − γσ(A))−1

[(z∗)> ⊗ Im]∂ vec[σ(A)]

∂ vec[A]

= γ

[(∂σ(A)∗1∂A∗1

)>(Bq∗U

−1)>φ(x)>(U−>)∗1 · · ·(∂σ(A)∗m∂A∗m

)>(Bq∗U

−1)>φ(x)>(U−>)∗m

]A.1 PROOF OF THEOREM 1

We begin with a proof overview of Theorem 1. We first compute the derivatives of the output ofdeep equilibrium linear models with respect to the parameters A in Appendix A.1.1. Then usingthe derivatives, we rearrange the formula of ∇AL such that it is related to the formula of ∇L0 inAppendices A.1.1–A.1.3. Intuitively, we then want to understand ∇AL through the property of∇L0, similarly to the landscape analyses of deep linear networks by Kawaguchi (2016). However,we note there that the additional nonlinearity σ creates a complex interaction over the dimension mto prevent us from using such a proof approach. Instead, using the proven relation of∇AL and∇L0

from Appendices A.1.1–A.1.3, we directly analyze the trajectories of the dynamics over time t inAppendices A.1.4–A.1.5, which results in a partial factorization of the iteration over the dimensionm. Using such a partial factorization, we derive the linear convergence rate in Appendices A.1.6–A.1.7 by using the PL inequality and the properties of induced norms.

Before getting into the details of the proof, we now briefly discuss the property of our proof in termsof the tightness of a bound. In the condition of‖Bt‖1 < (1 − γ)R in the statement of Theorem 1,the quantity (1−γ) comes from the proof in Appendix A.1.7: i.e., it is the reciprocal of the quantity

11−γ in the upper bound of ‖(Im− γσ(A))−1‖1 ≤ 1

1−γ . Therefore, a natural question is whether ornot we can improve this bound further. This bound turns out to be tight based on the following lowerbound. The matrix Im − γσ(A) is a Z-matrix since off-diagonal entries are less than or equal to

15


zero. Furthermore, Im − γσ(A) is M -matrix since eigenvalues of Im − γσ(A) are the eigenvaluesof Im − γσ(A)> and the eigenvalues of I − γσ(A)> are lower bounded by 1 − γ > 0. This isbecause σ(A)> is a stochastic matrix with the largest eigenvalue being one. Moreover, in the proofin Appendix A.1.7, we showed that |I−γσ(A)|jj−

∑i 6=j |I−γσ(A)|ij = 1−γ for all j. Therefore,

using the lower bound by Moraca (2008), we have

‖(I − γσ(A))−1‖1 ≥1

maxj(|I − γσ(A)|jj −∑i6=j |I − γσ(A)|ij)

=1

1− γ,

which matches with the upper bound of ‖(Im − γσ(A))−1‖1 ≤ 11−γ . Therefore, we cannot further

improve the our bound on ‖Bt‖1 in general without making some additional assumption.

A.1.1 REARRANGING THE FORMULA OF ∇AL

We will use the following facts for matrix calculus (that can be derived by using definition of deriva-tives: e.g., see Barnes, 2006):

∂M−1

∂a= −M−1 ∂M

∂aM−1

∂a>M−1b

∂M= −M−>ab>M−>

∂g(M)

∂a=∑i

∑j

∂g(M)

∂Mij

∂Mij

∂a

∂g(a)

∂M=∂g(a)

∂a

∂a

∂M

Recall that U = I − γσ(A). From the above facts, given a function g, we have

∂g(U)

∂Akl=

m∑i=1

m∑j=1

∂g(U)

∂Uij

∂Uij∂Akl

=

m∑i=1

m∑j=1

∂g(U)

∂Uij

∂Uij∂σ(A)ij

∂σ(A)ij∂Akl

= −γm∑i=1

m∑j=1

∂g(U)

∂Uij

∂σ(A)ij∂Akl

. (18)

Using the quotient rule,

∂σ(A)ij∂Akl

=∂

∂Akl

exp(Aij)∑t exp(Atj)

=(∂ exp(Aij)∂Akl

)(∑t exp(Atj))− exp(Aij)(

∂∑t exp(Atj)

∂Akl)

(∑t exp(Atj))2

=1{i = k}1{j = l} exp(Aij)(

∑t exp(Atj))− 1{j = l} exp(Aij) exp(Akj)

(∑t exp(Atj))2

=1{i = k}1{j = l} exp(Aij)∑

t exp(Atj)− 1{j = l} exp(Aij) exp(Akj)

(∑t exp(Atj))2

= 1{i = k}1{j = l}σ(A)ij − 1{j = l} exp(Aij)∑t exp(Atj)

exp(Akj)∑t exp(Atj)

= 1{j = l}1{i = k}σ(A)ij − 1{j = l}σ(A)ijσ(A)kj . (19)

Thus,∂g(U)

∂Akl= −γ

m∑i=1

∂g(U)

∂Uil

∂σ(A)il∂Akl

= −γ(∂g(U)

∂U∗l

)>∂σ(A)∗l∂Akl

∈ R,

16


where ∂g(U)∂U∗l

∈ Rm×1 and ∂σ(A)∗l∂Akl

∈ Rm×1. This yields

∂g(U)

∂A∗l= −γ

(∂g(U)

∂U∗l

)>∂σ(A)∗l∂A∗l

∈ R1×m,

where ∂σ(A)∗l∂A∗l

∈ Rm×m.

Now we want to set g(U) to be the output of deep equilibrium linear models as g(U) =Bq∗

(liml→∞ z(l)(x,A)

)for each q ∈ {1, . . . ,my}. To do this, we first simplify the formula of

the output Bq∗(liml→∞ z(l)(x,A)

)using the following:

(Im − γσ(A))

(l∑

k=0

γkσ(A)k

)= Im − γσ(A) + γσ(A)− (γσ(A))2 + (γσ(A))2 − (γσ(A))3 + · · · − (γσ(A))l+1

= I − (γσ(A))l+1.

Therefore,

(Im − γσ(A))

(liml→∞

z(l)(x,A)

)= liml→∞

(Im − γσ(A))

(l∑

k=0

γkσ(A)kφ(x)

)

=

(Im − lim

l→∞(γσ(A))l+1

)φ(x)

= φ(x)

where the first line, the second line and the last line used the fact that γσ(A)ij ≥ 0,

‖σ(A)‖1 = maxj

∑i

|σ(A)ij | = 1,

and hence ‖γσ(A)‖1 < 1 for γ ∈ (0, 1). This shows that B(liml→∞ z(l)(x,A)

)= BU−1φ(x),

where the inverse U−1 exists as the corresponding Neumann series converges∑∞k=0 γ

kσ(A)k since‖γσ(A)‖1 < 1. Therefore, we can now set g(U) = Bq∗

(liml→∞ z(l)(x,A)

)= Bq∗U

−1φ(x).

Then, using ∂a>M−1b∂M = −M−>ab>M−>,

∂g(U)

∂U=∂Bq∗U

−1φ(x)

∂U= −U−>(Bq∗)

>φ(x)>U−>,

which implies that

∂Bq∗U−1φ(x)

∂U∗l= −(U−>(Bq∗)

>φ(x)>U−>)∗l

= −U−>(Bq∗)>φ(x)>(U−>)∗l ∈ Rm×1. (20)

Combining (18) and (20),

∂g(U)

∂A∗l=∂Bq∗U

−1φ(x)

∂A∗l= −γ

(∂g(U)

∂U∗l

)>∂σ(A)∗l∂A∗l

= γ(U−>(Bq∗)

>φ(x)>(U−>)∗l)>(∂σ(A)∗l

∂A∗l

)= γ((U−>)∗l)

>φ(x)Bq∗U−1

(∂σ(A)∗l∂A∗l

)

17


= γ(U−1)l∗φ(x)Bq∗U−1


)∈ R1×m,

where we used (U−>)∗l = ((U−1)>)∗l = ((U−1)l∗)> and ((U−>)∗l)

> = (((U−1)l∗)>)> =

(U−1)l∗. By taking transpose,

(∂Bq∗U

−1φ(x)

∂A

)∗l

= γ


)>(Bq∗U

−1)>φ(x)>(U−>)∗l ∈ Rm×1.

By rearranging this to the matrix form,

∂Bq∗U−1φ(x)

∂A(21)

= γ

[(∂σ(A)∗1∂A∗1

)>(Bq∗U

−1)>φ(x)>(U−>)∗1 · · ·(∂σ(A)∗m∂A∗m

)>(Bq∗U

−1)>φ(x)>(U−>)∗m

],

where ∂Bq∗U−1φ(x)∂A ∈ Rm×m. Each entry of this matrix represents the derivatives of the model

output with respect to the parameters A. We now use this to rearrange∇AL(A,B) := ∂L(A,B)∂A . We

set yiq = Bq∗U−1φ(x) and yi = BU−1φ(x) and define

Jk :=∂σ(A)∗k∂A∗k

∈ Rm×m,

and

Q :=

n∑i=1

(∂`(yi, yi)

∂yi

)>φ(xi)

> ∈ Rmy×m.

Then, using the chain rule and the above formula of ∂Bq∗U−1φ(x)∂A ,

∂L(A,B)

∂A

=

n∑i=1

my∑q=1

∂`(yi, yi)

∂yiq

∂yiq∂A

= γ

n∑i=1

my∑q=1

∂`(yi, yi)

∂yiq

[J>1 (Bq∗U

−1)>φ(xi)>(U−>)∗1 · · · J>m(Bq∗U

−1)>φ(xi)>(U−>)∗m

]= γ

n∑i=1

[J>1 (

∑myq=1(Bq∗U

−1)> ∂`(yi,yi)∂yiq)φ(xi)

>(U−>)∗1 · · · J>m(∑myq=1(Bq∗U

−1)> ∂`(yi,yi)∂yiq)φ(xi)

>(U−>)∗m]

= γn∑i=1

[J>1 (

∑myq=1((BU−1)>)∗q

∂`(yi,yi)∂yiq

)φ(xi)>(U−>)∗1 · · · J>m(

∑myq=1((BU−1)>)∗q

∂`(yi,yi)∂yiq

)φ(xi)>(U−>)∗m

]

= γn∑i=1

[J>1 (BU−1)>(∂`(yi,yi)∂yi

)>φ(xi)>(U−>)∗1 · · · J>m(BU−1)>(∂`(yi,yi)∂yi

)>φ(xi)>(U−>)∗m

]= γ

[(∂σ(A)∗1∂A∗1

)>(BU−1)>Q(U−>)∗1 · · ·

(∂σ(A)∗m∂A∗m

)>(BU−1)>Q(U−>)∗m

]Summarizing the above, we have that

∇AL(A,B) :=∂L(A,B)

∂A= γ

[J>1 (BU−1)>Q(U−>)∗1 · · · J>m(BU−1)>Q(U−>)∗m

],

(22)

where∇AL(A,B) ∈ Rm×m.

18


A.1.2 REARRANGING THE FORMULA OF ∇L0

In order to relate L0 to the gradient dynamics of L, we now rearrange the formula of ∇L0. We setyiq = Wq∗φ(xi) ∈ R and yi = Wφ(xi) ∈ Rmy for linear models. Then, by the chain rule,

∂L0(W )

∂W=

n∑i=1

my∑q=1

∂`(yi, yi)

∂yiq

∂yiq∂W

.

Since∂yiq∂Wk∗

= 1{k = q}φ(xi)>,

we have∂L0(W )

∂Wk∗=

n∑i=1

∂`(yi, yi)

∂yik

∂yik∂Wk∗

=

n∑i=1

∂`(yi, yi)

∂yikφ(xi)

>.

By rearranging this into the matrix form,

∂L0(W )

∂W=

∑ni=1

∂`(yi,yi)∂yi1

φ(xi)>

...∑ni=1

∂`(yi,yi)∂yimy

φ(xi)>

=

n∑i=1

∂`(yi,yi)∂yi1

φ(xi)>

...∂`(yi,yi)∂yimy

φ(xi)>

=

n∑i=1

∂`(yi,yi)∂yi1

...∂`(yi,yi)∂yimy

φ(xi)>

=

n∑i=1

(∂`(yi, yi)

∂yi

)>φ(xi)

> ∈ Rmy×m

where ∂`(yi,yi)∂yi

∈ R1×my . Thus,

∇L0(W ) :=∂L(W )

∂W=

n∑i=1

(∂`(yi, yi)

∂yi

)>φ(xi)

> ∈ Rmy×m. (23)

A.1.3 COMBINING THE FORMULA OF ∇AL AND ∇L0

Combining (22) and (23) by resolving the different definitions of yi yields that

∇AL(A,B) (24)

= γ[J>1 (BU−1)>∇L0(BU−1)(U−>)∗1 · · · J>m(BU−1)>∇L0(BU−1)(U−>)∗m

].

Here, if there is no additional nonlinearity σ, the matrices Jk = ∂σ(A)∗k∂A∗k

become identity for allk. In that case, ∇AL(A,B) can be further simplified and factorize over m, which is desired forthe analysis of gradient dynamics. However, due to the additional nonlinearity, we cannot factor-ize ∇AL(A,B) over m. One of the key techniques in our analysis is to keep this un-factorized∇AL(A,B) and find a way to factorize it during the update of parameters (At, Bt

) in the gradientdynamics, as shown later in this proof. To do so, we now start considering the dynamics over time t.

A.1.4 ANALYSING(liml→∞ z(l)(x,At)

)Now let us temporarily consider a gradient dynamics discretized by the Euler method as

At+1 = At − α∇AL(At, Bt),

19


with some step size α > 0. Then,(liml→∞

z(l)(x,At+1)

)= (Im − γσ(At+1))−1φ(x)

= (Im − γσ(At − α∇AL(At, Bt)))−1φ(x),

where we used(liml→∞ z(l)(x,At)

)= U−1

t φ(x) from Section A.1.1. By setting ϕij(α) = σ(A−α∇AL(A,B))ij ∈ R,

σ(A− α∇AL(A,B))ij = ϕij(α) = ϕij(0) +∂ϕij(0)

∂αα+O(α2).

By using the chain rule and setting M = A− α∇AL(A,B) ∈ Rn×n,

∂ϕij(α)

∂α=

m∑k=1

m∑l=1

∂σ(M)ij∂Mkl

∂Mkl

∂α

= −m∑k=1

m∑l=1

[1{j = l}1{i = k}σ(M)ij − 1{j = l}σ(M)ijσ(M)kj ]∇AL(A,B)kl

= −m∑k=1

[1{i = k}σ(M)ij − σ(M)ijσ(M)kj ]∇AL(A,B)kj

Therefore,

∂ϕij(0)

∂α= −

m∑k=1

[1{i = k}σ(A)ij − σ(A)ijσ(A)kj ]∇AL(A,B)kj

= −m∑k=1

∂σ(A)ij∂Akj

∇AL(A,B)kj

= −∂σ(A)ij∂A∗j

∇AL(A,B)∗j ∈ R.

Recalling the definition of Jk := ∂σ(A)∗k∂A∗k

∈ Rm×m,

∂ϕ∗j(0)

∂α= −∂σ(A)∗j

∂A∗j∇AL(A,B)∗j = −Jj∇AL(A,B)∗j ∈ Rm×1

Rearranging it into the matrix form,

∂ϕ(0)

∂α= − [J1∇AL(A,B)∗1 · · · Jm∇AL(A,B)∗m] ∈ Rm×m

Putting the above equations together,

σ(A− α∇AL(A,B)) = ϕ(0) + α∂ϕ(0)

∂α+O(α2)

= σ(A)− α [J1∇AL(A,B)∗1 · · · Jm∇AL(A,B)∗m] +O(α2).

Thus,

[Im − γσ(A − α∇AL(A,B))]−1

=[Im − γ

[σ(A)− α [J1∇AL(A,B)∗1 · · · Jm∇AL(A,B)∗m] +O(α2)

]]−1

=[Im − γσ(A) + γα [J1∇AL(A,B)∗1 · · · Jm∇AL(A,B)∗m] +O(α2)

]−1

=[U + αγ [J1∇AL(A,B)∗1 · · · Jm∇AL(A,B)∗m] +O(α2)

]−1.

By setting M = [J1∇AL(A,B)∗1 · · · Jm∇AL(A,B)∗m] and ϕ(α) = [U + αγM + o(α2)]−1

and by using ∂M−1

∂a = −M−1 ∂M∂a M

−1,

[I − γσ(A − α∇AL(A,B))]−1 = [U + αγM +O(α2)]−1

20


= ϕ(α)

= ϕ(0) +∂ϕ(0)

∂αα+O(α2)

= U−1 − αγU−1MU−1 + 2αO(α) +O(α2)

= U−1 − αγU−1MU−1 +O(α2)

Summarizing above,

[Im − γσ(A − α∇AL(A,B))]−1 (25)

= U−1 − αγU−1 [J1∇AL(A,B)∗1 · · · Jm∇AL(A,B)∗m]U−1 +O(α2)

A.1.5 PUTTING RESULTS TOGETHER FOR INDUCED DYNAMICS

We now consider the dynamics of Zt := BtU−1t in Rmy×m that is induced by the gradient dynamics

of (At, Bt):d

dtAt = −∂L

∂A(At, Bt),

d

dtBt = − ∂L

∂B(At, Bt), ∀t ≥ 0.

Continuing the previous subsection, we first consider the dynamics discretized by the Euler method:

Zt+1 := Bt+1U−1t+1 = [Bt − α∇BL(At, Bt)][Im − γσ(At − α∇AL(At, Bt))]

−1,

where α > 0. Then, substituting (25) into the right-hand side of this equation,

Zt+1

= Bt+1[U−1t − αγU−1

t [J1,t∇AL(At, Bt)∗1 · · · Jm,t∇AL(At, Bt)∗m]U−1t +O(α2)]

= Bt+1U−1t − αγBt+1U

−1t [J1,t∇AL(At, Bt)∗1 · · · Jm,t∇AL(At, Bt)∗m]U−1

t +O(α2).

Using Bt+1 = [Bt − α∇BL(At, Bt)], we have

αγBt+1U−1t [J1,t∇AL(At, Bt)∗1 · · · Jm,t∇AL(At, Bt)∗m]U−1

t

= αγBtU−1t [J1,t∇AL(At, Bt)∗1 · · · Jm,t∇AL(At, Bt)∗m]U−1

t +O(α2)

= αγZt [J1,t∇AL(At, Bt)∗1 · · · Jm,t∇AL(At, Bt)∗m]U−1t +O(α2).

Since (U−>)∗k = ((U−1)>)∗k = ((U−1)k∗)>, U−1 =

(U−1)1∗...

(U−1)m∗

and ∇AL(A,B)∗k =

(∂L(A,B)∂A

)∗k

= γJ>k (BU−1)>∇L0(BU−1)(U−>)∗k from (24), we have that

Zt [J1,t∇AL(At, Bt)∗1 · · · Jm,t∇AL(At, Bt)∗m]U−1t

= γZt[J1,tJ

>1,tZ

>t ∇L0(Zt)(U

−>t )∗1 · · · Jm,tJ

>m,tZ

>t ∇L0(Zt)(U

−>t )∗m

]U−1t

= γZt[J1,tJ

>1,tZ

>t ∇L0(Zt)(U

−>t )∗1 · · · Jm,tJ

>m,tZ

>t ∇L0(Zt)(U

−>t )∗m

] (U−1t )1∗

...(U−1

t )m∗

=

m∑k=1

ZtJk,tJ>k,tZ

>t ∇L0(Zt)((U

−1t )k∗)

>(U−1t )k∗

=

m∑k=1

ZtJk,tJ>k,tZ

>t ∇L(Zt)(U

−>t )∗k(U−1

t )k∗

On the other hand, using Bt+1 = [Bt − α∇BL(At, Bt)] and ∇BL(A,B) :=

∂L(A,B)∂B =

(∑ni=1

(∂`(yi,yi)∂yi

)>φ(xi)

>)U−> = ∇L0(BU−1)U−>, we have

21


Bt+1U−1t = Zt − α∇L0(Zt)U

−>t U−1

t . Summarizing these equations by noticingU−>U−1 =

∑mk=1(U−>)∗k(U−1)k∗ yields that

Zt+1 = Zt − α∇L0(Zt)

m∑k=1

(U−>t )∗k(U−1t )k∗ − αγ2

m∑k=1

ZtJk,tJ>k,tZ

>t ∇L0(Zt)(U

−>t )∗k(U−1

t )k∗ +O(α2)

= Zt − α

(m∑k=1

∇L0(Zt)(U−>t )∗k(U−1

t )k∗ + γ2ZtJk,tJ>k,tZ

>t ∇L0(Zt)(U

−>t )∗k(U−1

t )k∗

)+O(α2)

= Zt − α

(m∑k=1

(Imy + γ2ZtJk,tJ>k,tZ

>t )∇L0(Zt)(U

−>t )∗k(U−1

t )k∗

)+O(α2)

By vectorizing both sides,

vec(Zt+1)

= vec(Zt)− α

(m∑k=1

vec((Imy + γ2ZtJk,tJ

>k,tZ

>t )∇L0(Zt)(U

−>t )∗k(U−1

t )k∗))

+O(α2)

= vec(Zt)− α

(m∑k=1

[(U−>t )∗k(U−1t )k∗ ⊗ (Imy + γ2ZtJk,tJ

>k,tZ

>t )]

)vec(∇L0(Zt)) +O(α2)

By defining

Dt :=

m∑k=1


>k,tZ

>t )],

we have

vec(Zt+1) = vec(Zt)− αDt vec(∇L0(Zt)) +O(α2).

This implies thatvec(Zt+1)− vec(Zt)

α= −Dt vec(∇L0(Zt)) +O(α),

where α > 0. By recalling the definition of the Euler method and defining Z(t) = Zt, we canrewrite this as

vec(Z(t+ α))− vec(Z(t))

α= −Dt vec(∇L0(Zt)) +O(α).

By taking the limit for α→ 0 and going back to continuous-time dynamics, this implies that

d

dtvec(Zt) = −Dt vec(∇L0(Zt)). (26)

Here, we note that the complex interaction over m due to the nonlinearity is factorized out into thematrix Dt. Furthermore, the interaction within the matrix Dt has more structures when comparedwith that in the gradients themselves from (24). For example, unlike the gradients, the interactionover m even within Dt can be factorized out in the case of my = 1 as:

D =

m∑k=1

[(U−>)∗k(U−1)k∗ ⊗

(Imy + γ2ZJkJ

>k Z>)]

=

m∑k=1

(1 + γ2ZJkJ

>k Z>) (U−>)∗k(U−1)k∗

= U−> diag

1 + γ2ZJ1J

>1 Z>

...1 + γ2ZJmJ

>mZ>

U−1

= U−>

Im + diag

γ

2ZJ1J>1 Z>

...γ2ZJmJ

>mZ>

U−1.

Although we do not assume my = 1, this illustrates the additional structure well.

22


A.1.6 ANALYSIS OF THE MATRIX Dt

From the definition of Dt, we have that

Dt =

m∑k=1

[(U−>t )∗k(U−1

t )k∗ ⊗(Imy + γ2ZtJk,tJ

>k,tZ

>t

)]=

m∑k=1

[(U−>)∗k(U−1)k∗ ⊗ Imy

]+

m∑k=1

[(U−>)∗k(U−1)k∗ ⊗ γ2ZtJk,tJ

>k,tZ

>t

]=

[m∑k=1

(U−>)∗k(U−1)k∗ ⊗ Imy

]+

m∑k=1


>k,tZ

>t

]=[U−>U−1 ⊗ Imy

]+

m∑k=1


>k,tZ

>t

](27)

Since U−>U−1 is positive definite, Imy is positive definite, and a Kronecker product of two positivedefinite matrices is positive definite (since the eigenvalues of Kronecker product are the products ofeigenvalues of the two matrices), we have[

U−>U−1 ⊗ Imy]� 0. (28)

Since (U−>)∗k(U−1)k∗ is positive semidefinite, γ2ZtJk,tJ>k,tZ

>t is positive semidefinite, and a

Kronecker product of two positive semidefinite matrices is positive semidefinite (since the eigenval-ues of Kronecker product are the products of eigenvalues of the two matrices), we have[

(U−>)∗k(U−1)k∗ ⊗ γ2ZtJk,tJ>k,tZ

>t

]� 0.

Since a sum of positive semidefinite matrices is positive semidefinite (from the definition of positivesemi-definiteness: x>Mkx ≥ 0⇒ x>(

∑kMk)x =

∑k x>Mkx ≥ 0),

m∑k=1


>k,tZ

>t

]� 0. (29)

Since a sum of a positive definite matrix and positive semidefinite matrix is positive definite (from thedefinition of positive definiteness and positive definiteness: (x>M1x > 0 ∧ x>M2x)⇒ x>(M1 +M2)x = x>M1x+ x>M2x > 0),

Dt =m∑k=1

[(U−>)∗k(U−1)k∗ ⊗

(Imy + γ2ZtJk,tJ

>k,tZ

>t

)]=[U−>U−1 ⊗ Imy

]+

m∑k=1


>k,tZ

>t

]� 0.

Therefore, Dt is a positive definite matrix for any t and hence

λT := inft∈[0,T ]

λmin(Dt) > 0. (30)

A.1.7 CONVERGENCE RATE VIA POLYAK-ŁOJASIEWICZ INEQUALITY AND NORM BOUNDS

Let R ∈ (0,∞] and T > 0 be arbitrary. By taking derivative of L0(Zt)− L∗0,R with respect to timet with Zt := BtU

−1t ,

d

dt

(L0(Zt)− L∗0,R

)=

my∑i=1

m∑j=1

(dL0

dW(Zt)

)ij

(d

dt(Zt)

)ij

− d

dtL∗,

=

my∑i=1

m∑j=1

(dL0

dW(Zt)

)ij

(d

dt(Zt)

)ij

23


where we used the chain rule and the fact that ddtL∗0,R = 0. By using the vectorization notation with

∇L0(Zt) = dL0

dW (Zt),

d

dt

(L0(Zt)− L∗0,R

)= vec [∇L0(Zt)]

>vec

[d

dt(Zt)

],

By using (26) for the equation of vec[ddt (Zt)

],

d

dt

(L0(Zt)− L∗0,R

)= − vec [∇L0(Zt)]

>Dt vec[∇L0(Zt)]

≤ −λmin(Dt)‖ vec [∇L0(Zt)] ‖22= −λmin(Dt)‖∇L0(Zt)‖2F

Using the condition that ∇L0 satisfies the the Polyak-Łojasiewicz inequality with radius R, if‖Zt‖1 < R, then we have that for all t ∈ [0, T ],

d

dt

(L0(Zt)− L∗0,R

)≤ −2κλmin(Dt)(L0(Zt)− L∗0,R)

≤ −2κλT (L0(Zt)− L∗0,R).

By solving the differential equation, this implies that if ‖Zt‖1 < R,

L0(ZT )− L∗0,R ≤(L0(Z0)− L∗0,R

)e−2κλTT ,

Since L(At, Bt) = L0(Zt), if ‖Zt‖1 < R,

L(AT , BT ) ≤ L∗0,R + (L(A0, B0)− L∗0,R)e−2κλTT . (31)

We now complete the proof of the first part of the desired statement by showing that ‖B‖1 < (1 −γ)R implies ‖Zt‖1 < R. With Z = BU−1, since any induced operator norm is a submultiplicativematrix norm,

‖Z‖1 = ‖B(Im − γσ(A))−1‖1 ≤ ‖B‖1‖(Im − γσ(A))−1‖1.We can then rewrite

‖(Im − γσ(A))−1‖1 = ‖((Im − γσ(A))−1)>‖∞ = ‖(Im − γσ(A)>)−1‖∞.

Here, the matrix Im−γσ(A)> is strictly diagonally dominant: i.e., |Im−γσ(A)>|ii >∑j 6=i |Im−

γσ(A)|ij for any i. This can be shown as follows: for any j,

1 > γ ⇐⇒ 1 > γ∑i

σ(A)ij

⇐⇒ 1 > γσ(A)jj +∑i 6=j

γσ(A)ij

⇐⇒ 1− γσ(A)jj >∑i 6=j

γσ(A)ij

⇐⇒ |Im − γσ(A)|jj >∑i 6=j

| − γσ(A)|ij

⇐⇒ |Im − γσ(A)|jj >∑i 6=j

|Im − γσ(A)|ij

⇐⇒ |Im − γσ(A)>|jj >∑i 6=j

|Im − γσ(A)>|ji

This calculation also shows that |Im−γσ(A)|jj−∑i 6=j |Im−γσ(A)|ij = 1−γ for all j. Thus, using

the Ahlberg–Nilson–Varah bound for the strictly diagonally dominant matrix (Ahlberg & Nilson,1963; Varah, 1975; Moraca, 2008), we have

‖(Im − γσ(A)>)−1‖∞ ≤1

minj(|Im − γσ(A)|jj −∑i 6=j |Im − γσ(A)|ij)

=1

1− γ.

24


By taking transpose,

‖(Im − γσ(A))−1‖1 ≤1

1− γ.

Summarizing above,

‖Z‖1 = ‖B(Im − γσ(A))−1‖1 ≤ ‖B‖11

1− γ.

Therefore, if ‖B‖1 < R(1−γ), then ‖Z‖1 = ‖B(Im−γσ(A))−1‖1 ≤ ‖B‖1 11−γ < R, as desired.

Combining this with (31) implies that if ‖B‖1 < R(1− γ),

L(At, Bt) ≤ L∗0,R + (L(A0, B0)− L∗0,R)e−2κλTT .

Recall that L∗0,R = infW :‖W‖1<R L0(W ) and L∗R = infA∈Rm×m,B∈BR L(A,B) where BR = {B ∈Rmy×m | ‖B‖1 < (1 − γ)R}. Here, B ∈ BR implies that ‖Z‖1 = ‖BU−1‖1 ≤ ‖B‖1‖U−1‖1 <(1− γ)R‖U−1‖1 ≤ R, using the above upper bond ‖U−1‖1 = ‖(Im− γσ(A))−1‖1 ≤ 1

1−γ . SinceL(A,B) = L0(Z) with Z = BU−1, this implies that L∗0,R ≤ L∗R and thus

L(At, Bt) ≤ L∗0,R + (L(A0, B0)− L∗0,R)e−2κλTT ≤ L∗R + (L(A0, B0)− L∗0,R)e−2κλTT .

This completes the first part of the desired statement of Theorem 1.

The remaining task is to lower bound λT , which is completed as follows: for any (A,B),

λmin(D) = minv:‖v‖=1

v>Dv

= minv:‖v‖=1

v>

(m∑k=1


>k,tZ

>t )]

)v

≥ minv:‖v‖=1

v>[U−>U−1 ⊗ Imy

]v

= λmin([U−>U−1 ⊗ Imy

])

= λmin(U−>U−1)

= σ2min(U−1) =

1

‖U‖22≥ 1

m‖U‖21(32)

where the third line follows from (27)–(29), the fifth line follows from the property of Kroneckerproduct (the eigenvalues of Kronecker product of two matrices are the products of eigenvalues ofthe two matrices), and the last inequality follows from the relation between the spectral norm andthe norm ‖ · ‖1. We now compute ‖U‖1 as: for any (A,B),

‖U‖1 = ‖Im − γσ(A)‖1= max

j

∑i

|(Im − γσ(A))ij |

= maxj

∑i

|(Im)ij − γσ(A)ij |

= maxj|(Im)jj − γσ(A)jj |+

∑i 6=j

|(Im)ij − γσ(A)ij |

= maxj|1− γσ(A)jj |+

∑i 6=j

| − γσ(A)ij |

= maxj

1− γσ(A)jj +∑i 6=j

γσ(A)ij

= maxj

1 + γ

∑i 6=j

σ(A)ij

− σ(A)jj

≤ 1 + γ.

25


By substituting this into (32), we have that for any (A,B) (and hence for any t),

λmin(D) ≥ 1

m(1 + γ)2. (33)

This completes the proof for both the first and second parts of the statement of Theorem 1.

A.2 PROOF OF THEOREM 2

We first show that with δt > 0 sufficiently small, we have Gt � 0. Recall that

Gt = Ut(S−1t − δtFt

)U>t = UtS

−1t U>t − δtUtFtU>t .

Thus, with δt > 0 sufficiently small, for any v 6= 0,

v>Gv = v>UtS−1t U>t v − δtv>UtFtU>t v,

which is dominated by the first term v>UtS−1t U>t v if the matrix UtS−1

t U>t is positive definite.Since St := Im+γ2 diag(vSt ) with vSt ∈ Rm and (vSt )k := ‖J>k,t(BtU

−1t )>‖22 for k = 1, 2, . . . ,m,

the matrix UtS−1t U>t is positive definite. Thus, with δt > 0 sufficiently small, v>Gv is dominated

by the first term, which is positive (since UtS−1t U>t is positive definite), and thus we have Gt � 0.

Then we observe that the output of argminv:‖v‖Gt≤δt‖ddt vec(BtU

−1t )‖Gt

Lt0(v) is the set of solutionsof the following constrained optimization problem:

minimizev

Lt0(v) s.t. ‖v‖2Gt − δ2t

∥∥∥∥ ddt vec(BtU−1t )

∥∥∥∥2

Gt

≤ 0.

Since this optimization problem is convex, one of the sufficient conditions for global optimality isthe KKT condition with a multiplier µ ∈ R:

∇Lt0(v) + 2µGtv = 0

µ ≥ 0

µ

(‖v‖2Gt − δ

2t


∥∥∥∥2

Gt

)= 0.

Therefore, the desired statement is obtained if the above KKT condition is satisfied by v =δt(ddt vec(BtU

−1t ))

with some multiplier µ. The rest of this proof shows that the KKT condi-tion is satisfied by setting v = δt

(ddt vec(BtU

−1t ))

and µ = 12δt

. With this choice, the last twoconditions of the KKT condition hold, since

µ =1

2δt≥ 0,

and

‖v‖2Gt − δ2t


∥∥∥∥2

Gt

= δ2t


∥∥∥∥2

Gt

− δ2t


∥∥∥∥2

Gt

= 0.

The remaining task is to show that ∇Lt0(v) + 2µGtv = 0 with v = δt(ddt vec(BtU

−1t ))

andµ = 1

2δt. From the definition of Lt0,

∇Lt0(v) + 2µGtv = ∇Lvec0 (vec(BtU

−1t )) +∇2Lvec

0 (vec(BtU−1t ))v + 2µGtv. (34)

We now compute and∇Lvec0 and ∇2Lvec

0 . Since∇L0(W ) := ∂L0(W )∂W =

∑ni=1

(∂`(yi,yi)∂yi

)>x>i ,

vec(∇L0(W )) =

n∑i=1

vec

(Imy

(∂`(yi, yi)

∂yi

)>φ(xi)

>

)=

n∑i=1

[φ(xi)⊗ Imy ]

(∂`(yi, yi)

∂yi

)>,

26


whereyi := Wφ(xi) = [φ(xi)

> ⊗ Imy ] vec[W ].

Therefore,

∇Lvec0 (vec(W )) = vec(∇L0(W )) =

n∑i=1

[φ(xi)⊗ Imy ]

(∂`(yi, yi)

∂yi

)>.

For the Hessian,

∇2Lvec0 (vec(W )) =

∂

∂ vec(W )∇Lvec

0 (vec(W ))

=

n∑i=1

[φ(xi)⊗ Imy ]

(∂

∂yi

(∂`(yi, yi)

∂yi

)>)∂yi

∂ vec(W )

=

n∑i=1

[φ(xi)⊗ Imy ]

(∂

∂yi

(∂`(yi, yi)

∂yi

)>)[φ(xi)

> ⊗ Imy ]

By defining ì(z) = `(z, yi) and ∇2ì(z) = ∂∂z

(∂ì(z)∂z

)>,

∇2Lvec0 (vec(W )) =

n∑i=1

[φ(xi)⊗ Imy ]∇2ì(Wφ(xi))[φ(xi)> ⊗ Imy ]. (35)

Since we have that

(Im − γσ(A))

(l∑

k=0

γkσ(A)k

)= Im − γσ(A) + γσ(A)− (γσ(A))2 + (γσ(A))2 − (γσ(A))3 + · · · − (γσ(A))l+1

= I − (γσ(A))l+1,

we can write:

(Im − γσ(A))

(liml→∞

z(l)(x,A)

)= liml→∞

(Im − γσ(A))

(l∑

k=0

γkσ(A)kφ(x)

)

=

(Im − lim

l→∞(γσ(A))l+1

)φ(x)

= φ(x),

where we used the fact that γσ(A)ij ≥ 0, ‖σ(A)‖1 = maxj∑i |σ(A)ij | = 1, and thus

‖γσ(A)‖1 < 1 for any γ ∈ (0, 1). This shows that(liml→∞ z(l)(x,A)

)= z∗(x,A) = U−1φ(x),

from which we have φ(xi) = Uz∗(xi, A).

Substituting φ(xi) = Uz∗(xi, A) into (35),

∇2Lvec0 (vec(W ))

=

n∑i=1

[Uz∗(xi, A)⊗ Imy ]∇2ì(Wφ(xi))[z∗(xi, A)>U> ⊗ Imy ].

=

n∑i=1

[U ⊗ Imy ][z∗(xi, A)⊗ Imy ]∇2ì(Wφ(xi))[z∗(xi, A)> ⊗ Imy ][U> ⊗ Imy ].

= [U ⊗ Imy ]

(n∑i=1

[z∗(xi, A)⊗ Imy ]∇2ì(Wφ(xi))[z∗(xi, A)> ⊗ Imy ]

)[U> ⊗ Imy ].

27


In the case of my = 1, since Imy = 1,∇2Lvec0 (vec(W )) is further simplified to:

∇2Lvec0 (vec(W )) = U

(n∑i=1

∇2`i(Wφ(xi))z∗(xi, A)z∗(xi, A)>

)U>.

Therefore,∇2Lvec

0 (vec(BtU−1t )) = UtFtU

>t ,

where

Ft =

n∑i=1

∇2`i(BtU−1t φ(xi))z

∗(xi, At)z∗(xi, At)

>.

By plugging µ = 12δt

and ∇2Lvec0 (vec(BtU

−1t )) = UtFtU

>t into (34),


−1t )) + UtFtU

>t v +

1

δtUt(S−1t − δtFt

)U>t v

= ∇Lvec0 (vec(BtU

−1t )) +

1

δtUtS

−1t U>t v.

By using v = δt(ddt vec(BtU

−1t )),


−1t )) + UtS

−1t U>t

(d

dtvec(BtU

−1t )

).

By plugging (26) into ddt vec(BtU

−1t ) with Zt = BtU

−1t ,


−1t ))− UtS−1

t U>t Dt vec(∇L0(Zt)).

Recall that

Dt =

m∑k=1


>k,tZ

>t )].

In the case of my = 1, the matrix Dt can be simplified as:

D =

m∑k=1

[(U−>)∗k(U−1)k∗ ⊗

(Imy + γ2ZJkJ

>k Z>)]

=

m∑k=1

(1 + γ2ZJkJ

>k Z>) (U−>)∗k(U−1)k∗

= U−> diag

1 + γ2ZJ1J

>1 Z>

...1 + γ2ZJmJ

>mZ>

U−1

= U−>SU−1.

Plugging this into the above equation for Dt,


−1t ))− UtS−1

t U>t U−>t StU

−1t vec(∇L0(Zt))

= ∇Lvec0 (vec(BtU

−1t ))−∇Lvec

0 (vec(BtU−1t )) = 0.

Therefore, the constrained optimization problem at time t is solved by

v = δt

(d

dtvec(BtU

−1t )

),

which implies that

d

dtvec(BtU

−1t ) =

1

δtvec(Vt), vec(Vt) ∈ argmin

v:‖v‖Gt≤δt‖ddt vec(BtU

−1t )‖Gt

Lt0(v),

28


By multiplying φ(x)> ⊗ Imy to each side of the equation, we have

d

dt[φ(x)> ⊗ Imy ] vec(BtU

−1t ) = Bt

(liml→∞

z(l)(x,At)

),

1

δt[φ(x)> ⊗ Imy ] vec(Vt) =

1

δtVtφ(x),

yielding thatd

dtBt

(liml→∞

z(l)(x,At)

)=

1

δtVtφ(x).

This proves the desired statement of Theorem 2.

A.3 PROOF OF COROLLARY 1

The assumption rank(Φ) = min(n,m) implies that σmin(Φ) > 0. Moreover, the square loss` satisfies the assumption of the differentiability. Thus, Theorem 1 implies the statement of thiscorollary if L0 with the square loss satisfies the Polyak-Łojasiewicz inequality for anyW ∈ Rmy×mwith parameter κ = 2σmin(Φ)2. This is to be shown in the rest of this proof. By setting ϕ = L0

in Definition 1, we have ‖∇ϕvec(vec(q))‖22 = ‖∇L0(W )‖2F . With the square loss, we can writeL0(W ) =

∑ni=1 ‖Wφ(xi) − yi‖22 = ‖WΦ − Y ‖2F where Φ ∈ Rm×n and Y ∈ Rmy×n with

Φji = φ(xi)j and Yji = (yi)j . Thus,

∇L0(W ) = 2(WΦ− Y )Φ> ∈ Rmy×m.We first consider the case of m ≤ n. In this case, we consider the vectorization Lvec

0 (vec(W )) =L0(W ) and derive the gradient with respect to vec(W ):

∇Lvec0 (vec(W )) = 2 vec((WΦ− Y )Φ>) = 2[Φ⊗ Imy ] vec(WΦ− Y ).

Then, the Hessian can be easily computed as

∇2Lvec0 (vec(W )) = 2[Φ⊗ Imy ][Φ> ⊗ Imy ] = 2[ΦΦ> ⊗ Imy ],

where Imy is the identity matrix of size my by my . Since the singular values of Kronecker productof the two matrices is the product of singular values of each matrix, we have

∇2Lvec0 (vec(W )) � 2σmin(Φ)2Imym,

where we used the fact that m ≤ n in this case. Since W is arbitrary, this implies that Lvec0 is

strongly convex with parameter 2σmin(Φ)2 > 0 in Rmy×m. Since a strongly convex functionwith some parameter satisfies the Polyak-Łojasiewicz inequality with the same parameter (Karimiet al., 2016), this implies that Lvec

0 (and hence L0) satisfies the Polyak-Łojasiewicz inequality withparameter 2σmin(Φ)2 > 0 in Rmy×m in the case of m ≤ n.

We now consider the remaining case of m > n. In this case, using the singular value decompositionof Φ = UΣV >,

1

2‖∇L0(W )‖2F = 2‖Φ(WΦ− Y )>‖2F

= 2‖UΣV >(WΦ− Y )>‖2F= 2‖ΣV >(WΦ− Y )>‖2F≥ 2σmin(Φ)2‖V >(WΦ− Y )>‖2F= 2σmin(Φ)2L0(W )

≥ 2σmin(Φ)2 (L0(W )− L∗∗0 )

for any L∗∗0 ≥ 0, where the first line uses ‖q‖2F = ‖q>‖2F , the second line uses the singular valuedecomposition, and the third and fourth line uses the fact that U and V are orthonormal matrices.The forth line uses the fact that m > n in this case. Therefore, since W is arbitrary, we haveshown that L0 satisfies the Polyak-Łojasiewicz inequality for any W ∈ Rmy×m with parameterκ = 2σmin(Φ)2 in both cases of m > n and m ≤ n.

29


A.4 PROOF OF COROLLARY 2

The assumption rank(Φ) = m implies that σmin(Φ) > 0. Moreover, the logistic loss ` satisfies theassumption of the differentiability. Thus, Theorem 1 implies the statement of this corollary if L0

with the logistic loss satisfies the Polyak-Łojasiewicz inequality with the given radius R ∈ (0,∞]and the parameter κ = (2τ + ρ(R))σmin(Φ)2 where ρ(R) depends on R. Let R ∈ (0,∞] be given.Note that we have ρ(R) > 0 if R < ∞, and ρ(R) = 0 if R = ∞. If (R, τ) = (∞, 0), then2τ + ρ(R) = 0 for which the statement of this corollary trivially holds (since the bound does notdecrease). Therefore, we focus on the remaining case of (R, τ) 6= (∞, 0). Since (R, τ) 6= (∞, 0),we have 2τ + ρ(R) > 0. We first compute the Hessian with respect to W as:

∇2L0(W ) =

n∑i=1

pi(1− pi)φ(xi)φ(xi)> + 2τ

n∑i=1

φ(xi)φ(xi)>,

where pi = 11+e−Wφ(xi)

. Therefore,

v>∇2L0(W )v =

n∑i=1

pi(1− pi)v>φ(xi)φ(xi)>v + 2τv>

(n∑i=1

φ(xi)φ(xi)>

)v

≥ ρ(R)

n∑i=1

v>φ(xi)φ(xi)>v + 2τv>

(n∑i=1

φ(xi)φ(xi)>

)v

= (2τ + ρ(R))v>ΦΦ>v

≥ (2τ + ρ(R))σmin(Φ)2 > 0.

Therefore, L0 is strongly convex with parameter (2τ + ρ(R))σmin(Φ)2 > 0. Since a strongly con-vex function with a parameter satisfies the Polyak-Łojasiewicz inequality with the same parameter(Karimi et al., 2016), this implies that L0 satisfies the Polyak-Łojasiewicz inequality with the givenradius R ∈ (0,∞] and parameter (2τ + ρ(R))σmin(Φ)2 > 0. Since R ∈ (0,∞] is arbitrary, thisimplies the statement of this corollary by Theorem 1.

A.5 PROOF OF PROPOSITION 1

Let (x,A) be given. By repeatedly applying the definition of z(l)(x,A), we obtain

z(l)(x,A) = γσ(A)z(l−1) + φ(x)

= γσ(A)(γσ(A)z(l−2) + φ(x)) + φ(x)

=

l∑k=0

γkσ(A)kφ(x), (36)

where σ(A)k represents the matrix multiplications of k copies of the matrix σ(A) with σ(A)0 = Im.In general, if σ is identity, this sequence does not converge. However, with our definition of σ, wehave

‖σ(A)‖1 = maxj

∑i

|σ(A)ij | = 1.

Therefore, ‖γσ(A)‖1 = γ‖σ(A)‖1 = γ < 1 for any γ ∈ (0, 1). Since an induced matrix norm issub-multiplicative, this implies that

‖γkσ(A)k‖1 ≤k∏i=1

‖γσ(A)‖1 = γk.

In other words, each term ‖γkσ(A)k‖1 in the series∑∞k=0 ‖γkσ(A)k‖1 is bounded by γk. Since

the series∑∞k=0 γ

k converges in R with γ ∈ (0, 1), this implies, by the comparison test, that theseries

∑∞k=0 ‖γkσ(A)k‖1 converges in R. Thus, the sequence (

∑lk=0 ‖γkσ(A)k‖1)l is a Cauchy

30


sequence. By defining sl =∑lk=0 γ

kσ(A)k, we have ‖sl − sl′‖1 = ‖∑lk=l′+1 γ

kσ(A)k‖1 ≤∑lk=l′+1 ‖γkσ(A)k‖1 for any l′ > l by the triangle inequality of the (matrix) norm. Since

(∑lk=0 ‖γkσ(A)k‖1)l is a Cauchy sequence, this inequality implies that (

∑lk=0 γ

kσ(A)k)l is aCauchy sequence (in a Banach space (Rm×m, ‖·‖1), which is isometric to Rmm under ‖·‖1). Thus,the sequence (

∑lk=0 γ

kσ(A)k)l converges. From (36), this implies that the sequence (z(l)(x,A))lconverges.

B ON THE IMPLICIT BIAS

In this section, we show that Theorem 2 suggests an implicit bias towards a simple function as aresult of infinite depth, whereas understanding this bias in more details is left as an open problem.This section focuses on the case of the square loss `(q, yi) = ‖q − yi‖22 with my = 1. By solvingvec(Vt) ∈ argminv∈V L

t0(v) in Theorem 2 for the direction of the Newton method, Theorem 2

implies that

Vt = −r>t (Im − γσ(At))−1 ∈ R1×m, (37)

where rt ∈ Rm is an error vector with each entry being a function of the residuals fθt(xi)− yi.Since σ(At) is a positive matrix and is a left stochastic matrix due to the nonlinearity σ, the Perron–Frobenius theorem (Perron, 1907; Frobenius, 1912) ensures that the largest eigenvalue of σ(At) isone, any other eigenvalue in absolute value is strictly smaller than one, and any left eigenvectorcorresponding the largest eigenvalue is the vector η1 = η[1, 1, . . . , 1]> ∈ Rm where ζ ∈ R is somescalar. Thus, the largest eigenvalue of the matrix (Im − γσ(At))

−1 is 11−γ , any other eigenvalue is

in the form of 11−λkγ with |λk| < 1, and any left eigenvector corresponding the largest eigenvalue

is η1 ∈ Rm. By decomposing the error vector as rt = P1rt + (1 − P1)rt, this implies thatvec(Vt) = V >t = 1

1−γP1rt + gγ((1 − P1)rt) ∈ Rm, where P1 = 1m11> is a projection onto the

column space of 1 = [1, 1, . . . , 1]> ∈ Rm, and gγ is a function such that for any q in its domain,‖gγ(q)‖ < c for all γ ∈ (0, 1) with some constant c in γ.

In other words, vec(Vt) in Theorem 2 can be decomposed into the two terms: 11−γP1rt (the pro-

jection of the error vector onto the column space of 1) and gγ((1 − P1)rt) (a function of the pro-jection of the error vector onto the null space space of 1). Here, as γ → 1, ‖ 1

1−γP1rt‖ → ∞ and‖gγ((1 − P1)rt)‖ < c. This implies that with γ < 1 sufficiently large, the first term 1

1−γP1rtdominates the second term gγ((1−P1)rt).

Since (P1rt)>φ(x) = µt

m 1>φ(x) with µt =∑mk=1(rt)k ∈ R, this implies that with γ < 1 suf-

ficiently large, the dynamics of deep equilibrium linear models ddtfθt = 1

δtVtφ learns a simple

shallow function µTm 1>φ(x) first before learning more complicated components of the functions

through gγ((1 − P1)rt), where µT =∫ T

0µtdt ∈ R. Here, µTm 1>φ(x) is a simple average model

that averages over the features 1m1>φ(x) and multiplies it by a scaler µt. Moreover, large γ < 1

means that we have large effective depth or large weighting for deeper layers since we have a shallowmodel with γ = 0 and γ is a discount factor of the infinite depth.

C ADDITIONAL DISCUSSION

On existence of the limit. When Bai et al. (2019a) introduced general deep equilibrium models,they hypothesized that the limit liml→∞ z(l) exists for several choices of h, and provided numericalresults to support this hypothesis. In general deep equilibrium models, depending on the values ofmodel parameters, the limit is not ensured to exists. For example, when h increases the norm ofthe output at every layer, then it is easy to see that the sequence diverge or explode. This is alsotrue when we set h to be that of deep equilibrium models without the nonlinearity σ (or equivalentlyredefining σ to be the identity function): if the operator norms onA are not bounded by one, then thesequence can diverge in general. In other words, in general, some trajectory of gradient dynamicsmay potentially violate the assumption of the existence of the limit when learning models. In thisview, the class of deep equilibrium linear models is one instance of general deep equilibrium modelswhere the limit is guaranteed to exist for any values of model parameters as stated in Proposition 1.

On the PL inequality. With R sufficiently large, the definition of the PL inequality in this paperis simply a rearrangement in the form where L0 is allowed to take matrices as its inputs through

31


Lvec0 (vec(·)) = L0(·), where vec(M) represents the standard vectorization of a matrix M . See

(Polyak, 1963; Karimi et al., 2016) for more detailed explanations of the PL inequality.

On the reditus R for the logistic loss. As shown in Section 3.1.3, we can use R = ∞ for thesquare loss and the logistic loss, in order to get a prior guarantee for the global linear convergencein theory. In practice, for the logistic loss, we may want to choose R depending on the differentscenarios, because of the following observation.For the logistic loss, we would like to set the radiusR to be large so that the trajectory on B is bounded as ‖Bt‖1 < (1− γ)R for all t ∈ [0, T ] and theglobal minimum value on the constrained domain to decrease: i.e., L∗R → L∗ as R→∞. However,unlike in the case of the squared loss, the convergence rate decreases as we increase R in the caseof the logistic loss, because ρ(R) decreases as R increases. This does not pose an issue because wecan always pick R < ∞ so that for any t > 0 and T > 0, we have ρ(R) > cρ for some constantcρ > 0. Moreover, this tradeoff does not appear for the square loss: i.e., we can set R = ∞ for thesquare loss without decreasing the convergence rate. We can also avoid this tradeoff for the logisticloss by simply setting R =∞ and τ > 0.

On previous work without implicit linearization. In Section 5, we discussed the previouswork on deep neural networks with implicit linearization via significant over-parameterization.Kawaguchi & Huang (2019) observed that we can also use the implicit linearization with mildover-parameterization by controlling learning rates to guarantee global convergence and general-ization performances at the same time. On the other hand, there is another line of previous workwhere deep nonlinear neural networks are studied without any (implicit or explicit) linearization andwithout any strong assumptions; e.g., see the previous work by Shamir (2018); Liang et al. (2018);Nguyen (2019); Kawaguchi & Bengio (2019); Kawaguchi & Kaelbling (2020); Nguyen (2021).Whereas the conclusions of these previous studies without strong assumptions can be directly appli-cable to practical settings, their conclusions are not as strong as those of previous studies with strongassumptions (e.g., implicit linearization via significant over-parameterization) as expected. The di-rect practical applicability, however, comes with the benefit of being able to assist the progress ofpractical methods (Verma et al., 2019; Jagtap et al., 2020b;a).

D EXPERIMENTS

The purpose of our experiments is to provide a secondary motivation for our theoretical analyses,instead of claiming the immediate benefits of using deep equilibrium linear models.

D.1 EXPERIMENTAL SETUP

For data generation and all models, we set φ(x) = x. Therefore, we have m = mx.

Data. To generate datasets, we first drew uniformly at random 200 input images from a stan-dard image dataset — CIFAR-10 (Krizhevsky & Hinton, 2009), CIFAR-100 (Krizhevsky & Hinton,2009) or Kuzushiji-MNIST (Clanuwat et al., 2019) — as pre input data points xpre

i ∈ Rm′x . Outof 200 images, 100 images to be used for training were drawn from a train dataset and the 100other images to be used for testing were drawn from the corresponding test dataset. Then, the inputdata pints xi ∈ Rmx with mx = 150 were generated as xi = Rxpre

i where each entry of a matrixR ∈ Rmx×m′x was set to δ/

√mx with δ i.i.d.∼ N (0, 1) and was fixed over the indices i. We then

generated the targets as yi = B∗(liml→∞ z(l)(xi, A∗))+δi ∈ R with γ = 0.8 where δi

i.i.d.∼ N (0, 1).Each entry of the true (unknown) matrices A∗ ∈ R1×m and B∗ ∈ Rm×m was independently drawnfrom the standard normal distribution.

Model. For DNNs, we used ReLU activation and W (l) ∈ Rm×m for l = 1, 2 . . . , H − 1(W (H) ∈ R1×m). Each entry of the weight matricesW (l) for DNNs was initialized to δ/

√m where

δi.i.d.∼ N (0, 1) for all l = 1, 2, . . . ,H . Similarly, for deep equilibrium linear models, each entry ofA and B was initialized to δ/

√m where δ i.i.d.∼ N (0, 1). Linear models were initialized to represent

the exact same functions as those of initial deep equilibrium linear models: i.e., W0 = B0U−10 . We

used γ = 0.8 for deep equilibrium linear models.

Training. For each dataset, we used stochastic gradient descent (SGD) to train linear models, deepequilibrium linear models, and fully-connected feedforward deep neural networks (DNNs). We usedthe square loss `(q, yi) = ‖q − yi‖22. We fixed the mini-batch size to be 64 and the momentum

32


coefficient to be 0.8. Under this setting, linear models are known to find a minimum norm solution(with extra elements from initialization) (Gunasekar et al., 2017; Poggio et al., 2017). Similarly,DNNs have been empirically observed to have implicit regularization effects (although the mostwell studied setting is with the loss functions with exponential tails) (e.g., see discussions in Poggioet al., 2017; Moroshko et al., 2020; Woodworth et al., 2020). In order to minimize the effect oflearning rates on our conclusion, we conducted experiments with all the values of learning ratesfrom the choices of 0.01, 0.005, 0.001, 0.0005, 0.0001 and 0.00005, and reported the results withboth the worst cases and the best cases separately for each model (and each depth H for DNNs).

All experiments were implemented in PyTorch (Paszke et al., 2019).

D.2 ADDITIONAL EXPERIMENTS

In this subsection, we report additional experimental results.

Additional datasets. We repeated the same experiments as those for Figure 1 with four additionaldatasets – modified MNIST (LeCun et al., 1998), SVHN (Netzer et al., 2011), SEMEION (Srl &Brescia, 1994), and Fashion-MNIST (Xiao et al., 2017). We report the result of this experiment inFigure 4. As can be seen from Figures 4, we confirmed qualitatively the same observations as inFigure 1: i.e., all models preformed approximately the same at initial points, but deep equilibriumlinear models outperformed both linear models and nonlinear DNNs in test errors after training.

DNNs without bias terms. In Figures 1–4, the results of DNNs are reported with bias terms.To consider the effect of discarding bias term, we also repeated the same experiments with DNNswithout bias term and reported the results in Figure 5. As can be seen from Figures 5, we confirmedqualitatively the same observations: i.e., deep equilibrium linear models outperformed nonlinearDNNs in test errors.

DNNs with deeper networks. To consider the effect of deeper networks, we also repeated thesame experiments with deeper DNNs with depth H = 10, 100 and 200, and we reported the re-sults in Figures 6–7. As can be seen from Figures 6–7, we again confirmed qualitatively the sameobservations: i.e., deep equilibrium linear models outperformed nonlinear DNNs in test errors, al-though DNNs can reduce training errors faster than deep equilibrium linear models. We experiencedgradient explosion and gradient vanishing for DNNs with depth H = 100 and H = 200.

Larger datasets. In Figures 1, we used only 200 data points so that we can observe the effect ofinductive bias and overfitting phenomena under a small number of data points. If we use a largenumber of data points, it is expected that the benefit of the inductive bias with deep equilibriumlinear models tends to become less noticeable because using a large number of data points canreduce the degree of overfitting for all models, including linear models and DNNs. However, werepeated the same experiments with all data points of each datasets: for example, we use 60000training data points and 10000 test data points for MNIST. Figure 8 reports the results where thevalues are shown with the best learning rates for each model from the set of learning rates SLR ={0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00005} (in terms of the final test errors at epoch = 100). Ascan be seen in the figure, deep equilibrium linear models outperformed both linear models andnonlinear DNNs in test errors.

Logistic loss and theoretical bounds. In Corollary 2, we can set λT = 1m(1+γ)2 to get a guarantee

for the global linear convergence rate for any ini-tialization in theory. However, in practice, this isa pessimistic convergence rate and we may want tochoose λT depending on initializations. To demon-strate this, Figure 3 reports the numerical train-ing trajectory along with theoretical upper boundswith initialization-independent λT = 1

m(1+γ)2 andinitialization-dependent λT = inft∈[0,T ] λmin(Dt). Ascan be seen in Figure 3, the theoretical upper boundwith initialization-dependent λT demonstrates a fasterconvergence rate.

100 101 102 103 104 105

# of iterations

3 × 101

4 × 101

train

loss

training trajectorybound: init-ind T

bound: init-dep T

optimal loss

Figure 3: Logistic loss and theoreticalbounds with initialization-independentλT and initialization-dependent λT .

33


0 1000 2000 3000 4000 5000epoch

101

102

56789

20

30

405060708090

test

loss


0 1000 2000 3000 4000 5000epoch

10 10

10 8

10 6

10 4

10 2

100

102

train

loss

(a) Modified MNIST: Linear v.s. DELM

0 1000 2000 3000 4000 5000epoch

101

102

456789

20

30

405060708090

test

loss


0 1000 2000 3000 4000 5000epoch

10 12

10 10

10 8

10 6

10 4

10 2

100

102

train

loss

(b) Modified MNIST: DNN v.s. DELM

0 1000 2000 3000 4000 5000epoch

101

102

56789

20

30405060708090

200

300

test

loss


0 1000 2000 3000 4000 5000epoch

10 1

100

101

102

train

loss

(c) Modified Fashion-MNIST: Linear v.s. DELM

0 1000 2000 3000 4000 5000epoch

101

102

56789

20

30405060708090

200

300

test

loss


0 1000 2000 3000 4000 5000epoch

10 12

10 9

10 6

10 3

100

103

train

loss

(d) Modified Fashion-MNIST: DNN v.s. DELM

0 1000 2000 3000 4000 5000epoch

101

102

6789

20

30

40

5060708090

test

loss


0 1000 2000 3000 4000 5000epoch

10 2

10 1

100

101

102

train

loss

(e) Modified SEMEION: Linear v.s. DELM

0 1000 2000 3000 4000 5000epoch

101

102

6789

20

30

40

5060708090

test

loss


0 1000 2000 3000 4000 5000epoch

10 9

10 7

10 5

10 3

10 1

101

train

loss

(f) Modified SEMEION: DNN v.s. DELM

0 1000 2000 3000 4000 5000epoch

101

102

56789

20

30

40

5060708090

test

loss


0 1000 2000 3000 4000 5000epoch

10 1

100

101

102

train

loss

(g) Modified SVHN: Linear v.s. DELM

0 1000 2000 3000 4000 5000epoch

101

102

56789

20

30

405060708090

test

loss


0 1000 2000 3000 4000 5000epoch

10 5

10 4

10 3

10 2

10 1

100

101

102

train

loss

(h) Modified SVHN: DNN v.s. DELM

Figure 4: Test and train losses (in log scales) versus the number of epochs for linear models, deepequilibrium linear models (DELMs), and deep neural networks with ReLU (DNNs). The plottedlines indicate the mean values over five random trials whereas the shaded regions represent errorbars with one standard deviations for both test and train losses. The plots for linear models andDELMs are shown with the best and worst learning rates (in terms of the final test errors at epoch =5000). The plots for DNNs are shown with the best learning rates for each depth H = 2, 3, and 4(in terms of the final test errors at epoch = 5000).

34


0 1000 2000 3000 4000 5000epoch

101

102

89

20

30

405060708090

200

test

loss

DNN-NB (H=2)DNN-NB (H=3)DNN-NB (H=4)DELM (best)DELM (worst)

0 1000 2000 3000 4000 5000epoch

10 11

10 9

10 7

10 5

10 3

10 1

101

103

train

loss

(a) Modified Kuzushiji-MNIST: NN-NB v.s. DELM

0 1000 2000 3000 4000 5000epoch

102

20

30

40

50

60708090

200

test

loss


0 1000 2000 3000 4000 5000epoch

10 10

10 8

10 6

10 4

10 2

100

102

train

loss

(b) Modified CIFAR-10: DNN-NB v.s. DELM

0 1000 2000 3000 4000 5000epoch

102

20

30

40

5060708090

200

test

loss


0 1000 2000 3000 4000 5000epoch

10 10

10 8

10 6

10 4

10 2

100

102

train

loss

(c) Modified CIFAR-100: DNN-NB v.s. DELM

0 1000 2000 3000 4000 5000epoch

101

102

56789

20

30405060708090

200

300

test

loss


0 1000 2000 3000 4000 5000epoch

10 12

10 9

10 6

10 3

100

103

train

loss

(d) Modified Fashion-MNIST: DNN-NB v.s. DELM

0 1000 2000 3000 4000 5000epoch

101

102

456789

20

30

405060708090

test

loss


0 1000 2000 3000 4000 5000epoch

10 12

10 10

10 8

10 6

10 4

10 2

100

102

train

loss

(e) Modified MNIST: DNN-NB v.s. DELM

0 1000 2000 3000 4000 5000epoch

101

102

6789

20

30

40

5060708090

test

loss


0 1000 2000 3000 4000 5000epoch

10 10

10 8

10 6

10 4

10 2

100

102

train

loss

(f) Modified SEMEION: DNN-NB v.s. DELM

0 1000 2000 3000 4000 5000epoch

101

102

56789

20

30

405060708090

test

loss


0 1000 2000 3000 4000 5000epoch

10 5

10 3

10 1

101

train

loss

(g) Modified SVHN: DNN-NB v.s. DELM

Figure 5: Test and train losses (in log scales) versus the number of epochs for deep equilibrium linearmodel (DELM) and deep neural network with no bias term (DNN-NB). The plotted lines indicatethe mean values over five random trials whereas the shaded regions represent error bars with onestandard deviations for both test and train losses. The plots for DELM are shown with the best andworst learning rates (in terms of the final test errors at epoch = 5000). The plots for DNN-NB areshown with the best learning rates for each depth H = 2, 3, and 4 (in terms of the final test errors atepoch = 5000).

35


0 1000 2000 3000 4000 5000epoch

101

102

89

20

30

405060708090

200

test

loss

0 1000 2000 3000 4000 5000epoch

10 11

10 9

10 7

10 5

10 3

10 1

101

103

train

loss


(a) Modified Kuzushiji-MNIST: DNN-NB v.s. DELM

0 1000 2000 3000 4000 5000epoch

102

20

30

40

50

60708090

200

test

loss

0 1000 2000 3000 4000 5000epoch

10 2

10 1

100

101

102

train

loss


0 1000 2000 3000 4000 5000epoch

102

20

30

40

5060708090

200

test

loss

0 1000 2000 3000 4000 5000epoch

10 3

10 2

10 1

100

101

102

train

loss


0 1000 2000 3000 4000 5000epoch

101

102

456789

20

30405060708090

200

300

test

loss

0 1000 2000 3000 4000 5000epoch

10 1

100

101

102

train

loss


0 1000 2000 3000 4000 5000epoch

101

102

456789

20

30

405060708090

test

loss

0 1000 2000 3000 4000 5000epoch

10 10

10 8

10 6

10 4

10 2

100

102

train

loss


0 1000 2000 3000 4000 5000epoch

101

102

6789

20

30

40

5060708090

test

loss

0 1000 2000 3000 4000 5000epoch

10 2

10 1

100

101

102

train

loss


0 1000 2000 3000 4000 5000epoch

101

102

56789

20

30

405060708090

test

loss

0 1000 2000 3000 4000 5000epoch

10 1

100

101

102

train

loss


Figure 6: Test and train losses (in log scales) versus the number of epochs for deep equilibriumlinear model (DELM) and deeper neural network with no bias term (DNN-NB). The legend is thesame for all subplots and shown in subplot (a). The plotted lines indicate the mean values over fiverandom trials whereas the shaded regions represent error bars with one standard deviations for bothtest and train losses. The plots for DELM are shown with the best and worst learning rates (in termsof the final test errors at epoch = 5000). The plots for DNN-NB are shown with the best learningrates for each depth H = 10, 100, and 200 (in terms of the final test errors at epoch = 5000). Thevalues for DNNs with depth H = 100 and 200 coincide in the plots.

36


0 1000 2000 3000 4000 5000epoch

101

102

89

20

30

405060708090

200te

st lo

ss

0 1000 2000 3000 4000 5000epoch

10 11

10 8

10 5

10 2

101

train

loss


(a) Modified Kuzushiji-MNIST: DNN v.s. DELM

0 1000 2000 3000 4000 5000epoch

102

20

30

40

50

60708090

200

test

loss

0 1000 2000 3000 4000 5000epoch

10 11

10 9

10 7

10 5

10 3

10 1

101

train

loss

(b) Modified CIFAR-10: DNN v.s. DELM

0 1000 2000 3000 4000 5000epoch

102

20

30

40

5060708090

200

test

loss

0 1000 2000 3000 4000 5000epoch

10 10

10 8

10 6

10 4

10 2

100

102

train

loss

(c) Modified CIFAR-100: DNN v.s. DELM

0 1000 2000 3000 4000 5000epoch

101

102

456789

20

30405060708090

200

300

test

loss

0 1000 2000 3000 4000 5000epoch

10 1

100

101

102

train

loss

(d) Modified Fashion-MNIST: DNN v.s. DELM

0 1000 2000 3000 4000 5000epoch

101

102

456789

20

30

405060708090

test

loss

0 1000 2000 3000 4000 5000epoch

10 12

10 10

10 8

10 6

10 4

10 2

100

102

train

loss

(e) Modified MNIST: DNN v.s. DELM

0 1000 2000 3000 4000 5000epoch

101

102

6789

20

30

40

5060708090

test

loss

0 1000 2000 3000 4000 5000epoch

10 11

10 9

10 7

10 5

10 3

10 1

101

train

loss

(f) Modified SEMEION: DNN v.s. DELM

0 1000 2000 3000 4000 5000epoch

101

102

56789

20

30

405060708090

test

loss

0 1000 2000 3000 4000 5000epoch

10 5

10 4

10 3

10 2

10 1

100

101

102

train

loss

(g) Modified SVHN: DNN v.s. DELM

Figure 7: Test and train losses (in log scales) versus the number of epochs for deep equilibriumlinear model (DELM) and deeper neural networks with ReLU (DNNs). The legend is the same forall subplots and shown in subplot (a). The plotted lines indicate the mean values over five randomtrials whereas the shaded regions represent error bars with one standard deviations for both test andtrain losses. The plots for DELM are shown with the best and worst learning rates (in terms of thefinal test errors at epoch = 5000). The plots for DNNs are shown with the best learning rates foreach depth H = 10, 100, and 200 (in terms of the final test errors at epoch = 5000). The values forDNNs with depth H = 100 and 200 coincide in the plots.

37


0 20 40 60 80 100epoch

10 10

10 8

10 6

10 4

10 2

100

102

test

loss

0 20 40 60 80 100epoch

10 10

10 8

10 6

10 4

10 2

100

102

train

loss

DNN (H=2)DNN (H=4)DNN (H=10)LinearDELM

(a) Modified Kuzushiji-MNIST: NN-NB v.s. DELM

0 20 40 60 80 100epoch

10 8

10 6

10 4

10 2

100

102

test

loss

0 20 40 60 80 100epoch

10 8

10 6

10 4

10 2

100

102

train

loss


0 20 40 60 80 100epoch

10 8

10 6

10 4

10 2

100

102

test

loss

0 20 40 60 80 100epoch

10 8

10 6

10 4

10 2

100

102

train

loss


0 20 40 60 80 100epoch

10 10

10 8

10 6

10 4

10 2

100

102

test

loss

0 20 40 60 80 100epoch

10 10

10 8

10 6

10 4

10 2

100

102

train

loss


0 20 40 60 80 100epoch

10 10

10 8

10 6

10 4

10 2

100

102

test

loss

0 20 40 60 80 100epoch

10 10

10 8

10 6

10 4

10 2

100

102

train

loss


0 20 40 60 80 100epoch

10 1

100

101

102

00000000

00001111

23456789

2030405060708090

test

loss

0 20 40 60 80 100epoch

10 2

10 1

100

101

102

train

loss


0 20 40 60 80 100epoch

10 5

10 4

10 3

10 2

10 1

100

101

102

000000000000000000000000000000000000000000111123456789

2030405060708090200

test

loss

0 20 40 60 80 100epoch

10 5

10 4

10 3

10 2

10 1

100

101

102

train

loss


Figure 8: Test and train losses (in log scales) versus the number of epochs for linear models, deepequilibrium linear models (DELMs), and deep neural networks with ReLU (DNNs). The legend isthe same for all subplots and shown in subplot (a). The plotted lines indicate the mean values overthree random trials whereas the shaded regions represent error bars with one standard deviations.

38

on the theory of implicit deep learning global …

Documents