theoretical issues in deep networks: approximation, optimization … · 2019-08-17 · theoretical...

10
CBMM Memo No. 100 August 17, 2019 Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization Tomaso Poggio 1 , Andrzej Banburski 1 , Qianli Liao 1 1 Center for Brains, Minds, and Machines, MIT Abstract While deep learning is successful in a number of applications, it is not yet well understood theoretically. A satisfactory theoretical characterization of deep learning however, is beginning to emerge. It covers the following questions: 1) representation power of deep networks 2) optimization of the empirical risk 3) generalization properties of gradient descent techniques — why the expected error does not suffer, despite the absence of explicit regular- ization, when the networks are overparametrized? In this review we discuss recent advances in the three areas. In approximation theory both shallow and deep networks have been shown to approximate any continuous functions on a bounded domain at the expense of an exponential number of parameters (exponential in the dimensionality of the function). However, for a subset of compositional functions, deep networks of the convolutional type (even without weight sharing) can have a linear dependence on dimensionality, unlike shallow networks. In optimization we discuss the loss landscape for the exponential loss function. It turns out that global minima at infinity are com- pletely degenerate. The other critical points of the gradient are less degenerate, with at least one – and typically more – nonzero eigenvalues. This suggests that stochastic gradient descent will find with high probability the global minima. To address the question of generalization for classification tasks, we use classical uniform conver- gence results to justify minimizing a surrogate exponential-type loss function under a unit norm constraint on the weight matrix at each layer. It is an interesting side remark, that such minimization for (homogeneous) ReLU deep networks implies maximization of the margin. The resulting constrained gradient system turns out to be identical to the well-known weight normalization technique, originally motivated from a rather different way. We also show that standard gradient descent contains an implicit L2 unit norm constraint in the sense that it solves the same constrained minimization problem with the same critical points (but a different dynamics). Our approach, which is supported by several independent new results, offers a solution to the puzzle about generalization performance of deep overparametrized ReLU networks, uncovering the origin of the underlying hidden complexity control in the case of deep networks. This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. 1

Upload: others

Post on 02-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Theoretical Issues in Deep Networks: Approximation, Optimization … · 2019-08-17 · Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization Tomaso Poggioa,1,Andrzej

CBMM Memo No. 100 August 17, 2019

Theoretical Issues in Deep Networks:Approximation, Optimization and Generalization

Tomaso Poggio1, Andrzej Banburski 1, Qianli Liao1

1Center for Brains, Minds, and Machines, MIT

Abstract

While deep learning is successful in a number of applications, it is not yet well understood theoretically. Asatisfactory theoretical characterization of deep learning however, is beginning to emerge. It covers the followingquestions: 1) representation power of deep networks 2) optimization of the empirical risk 3) generalization propertiesof gradient descent techniques — why the expected error does not suffer, despite the absence of explicit regular-ization, when the networks are overparametrized? In this review we discuss recent advances in the three areas. Inapproximation theory both shallow and deep networks have been shown to approximate any continuous functionson a bounded domain at the expense of an exponential number of parameters (exponential in the dimensionalityof the function). However, for a subset of compositional functions, deep networks of the convolutional type (evenwithout weight sharing) can have a linear dependence on dimensionality, unlike shallow networks. In optimizationwe discuss the loss landscape for the exponential loss function. It turns out that global minima at infinity are com-pletely degenerate. The other critical points of the gradient are less degenerate, with at least one – and typicallymore – nonzero eigenvalues. This suggests that stochastic gradient descent will find with high probability theglobal minima. To address the question of generalization for classification tasks, we use classical uniform conver-gence results to justify minimizing a surrogate exponential-type loss function under a unit norm constraint on theweight matrix at each layer. It is an interesting side remark, that such minimization for (homogeneous) ReLU deepnetworks implies maximization of the margin. The resulting constrained gradient system turns out to be identicalto the well-known weight normalization technique, originally motivated from a rather different way. We also showthat standard gradient descent contains an implicit L2 unit norm constraint in the sense that it solves the sameconstrained minimization problem with the same critical points (but a different dynamics). Our approach, which issupported by several independent new results, offers a solution to the puzzle about generalization performance ofdeep overparametrized ReLU networks, uncovering the origin of the underlying hidden complexity control in thecase of deep networks.

This material is based upon work supported by the Center for Brains,Minds and Machines (CBMM), funded by NSF STC award CCF-1231216.

1

Page 2: Theoretical Issues in Deep Networks: Approximation, Optimization … · 2019-08-17 · Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization Tomaso Poggioa,1,Andrzej

Theoretical Issues in Deep Networks:Approximation, Optimization and GeneralizationTomaso Poggioa,1, Andrzej Banburskia, and Qianli Liaoa

aCenter for Brains, Minds and Machines, MIT

This manuscript was compiled on August 17, 2019

While deep learning is successful in a number of applications, it isnot yet well understood theoretically. A satisfactory theoretical char-acterization of deep learning however, is beginning to emerge. Itcovers the following questions: 1) representation power of deep net-works 2) optimization of the empirical risk 3) generalization proper-ties of gradient descent techniques — why the expected error doesnot suffer, despite the absence of explicit regularization, when thenetworks are overparametrized? In this review we discuss recentadvances in the three areas. In approximation theory both shal-low and deep networks have been shown to approximate any con-tinuous functions on a bounded domain at the expense of an ex-ponential number of parameters (exponential in the dimensionalityof the function). However, for a subset of compositional functions,deep networks of the convolutional type (even without weight shar-ing) can have a linear dependence on dimensionality, unlike shallownetworks. In optimization we discuss the loss landscape for the ex-ponential loss function. It turns out that global minima at infinityare completely degenerate. The other critical points of the gradientare less degenerate, with at least one – and typically more – nonzeroeigenvalues. This suggests that stochastic gradient descent will findwith high probability the global minima. To address the question ofgeneralization for classification tasks, we use classical uniform con-vergence results to justify minimizing a surrogate exponential-typeloss function under a unit norm constraint on the weight matrix ateach layer. It is an interesting side remark, that such minimizationfor (homogeneous) ReLU deep networks implies maximization of themargin. The resulting constrained gradient system turns out to beidentical to the well-known weight normalization technique, origi-nally motivated from a rather different way. We also show that stan-dard gradient descent contains an implicit L2 unit norm constraintin the sense that it solves the same constrained minimization prob-lem with the same critical points (but a different dynamics). Our ap-proach, which is supported by several independent new results (1–4),offers a solution to the puzzle about generalization performance ofdeep overparametrized ReLU networks, uncovering the origin of theunderlying hidden complexity control in the case of deep networks.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

Machine Learning | Deep learning | Approximation | Optimization |Generalization

1. Introduction2

In the last few years, deep learning has been tremendously3

successful in many important applications of machine learn-4

ing. However, our theoretical understanding of deep learning,5

and thus the ability of developing principled improvements,6

has lagged behind. A satisfactory theoretical characterization7

of deep learning is emerging. It covers the following areas:8

1) approximation properties of deep networks 2) optimization9

of the empirical risk 3) generalization properties of gradient10

descent techniques – why the expected error does not suf-11

fer, despite the absence of explicit regularization, when the12

networks are overparametrized? 13

A. When Can Deep Networks Avoid the Curse of Dimension- 14

ality?. We start with the first set of questions, summarizing 15

results in (5–7), and (8, 9). The main result is that deep net- 16

works have the theoretical guarantee, which shallow networks 17

do not have, that they can avoid the curse of dimensionality 18

for an important class of problems, corresponding to composi- 19

tional functions, that is functions of functions. An especially 20

interesting subset of such compositional functions are hierar- 21

chically local compositional functions where all the constituent 22

functions are local in the sense of bounded small dimensional- 23

ity. The deep networks that can approximate them without 24

the curse of dimensionality are of the deep convolutional type 25

– though, importantly, weight sharing is not necessary. 26

Implications of the theorems likely to be relevant in practice 27

are: 28

a) Deep convolutional architectures have the theoretical 29

guarantee that they can be much better than one layer archi- 30

tectures such as kernel machines for certain classes of problems; 31

b) the problems for which certain deep networks are guaran- 32

teed to avoid the curse of dimensionality (see for a nice review 33

(10)) correspond to input-output mappings that are compo- 34

sitional with local constituent functions; c) the key aspect of 35

convolutional networks that can give them an exponential 36

Significance Statement

In the last few years, deep learning has been tremendouslysuccessful in many important applications of machine learn-ing. However, our theoretical understanding of deep learning,and thus the ability of developing principled improvements, haslagged behind. A theoretical characterization of deep learningis now beginning to emerge. It covers the following questions:1) representation power of deep networks 2) optimization ofthe empirical risk 3) generalization properties of gradient de-scent techniques – how can deep networks generalize despitebeing overparametrized – more weights than training data – inthe absence of any explicit regularization? We review progresson all three areas showing that 1) for a the class of composi-tional functions deep networks of the convolutional type areexponentially better approximators than shallow networks; 2)only global minima are effectively found by stochastic gradientdescent for over-parametrized networks; 3) there is a hiddennorm control in the minimization of cross-entropy by gradientdescent that allows generalization despite overparametrization.

T.P. designed research; T.P., A.B., and Q.L. performed research; and T.P. and A.B. wrote the paper.

The authors declare no conflict of interest.

1To whom correspondence should be addressed. E-mail: [email protected]

www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX PNAS | August 17, 2019 | vol. XXX | no. XX | 1–9

Page 3: Theoretical Issues in Deep Networks: Approximation, Optimization … · 2019-08-17 · Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization Tomaso Poggioa,1,Andrzej

x1 x2 x3 x4 x5 x6 x7 x8

a b

….∑

x1 x2 x3 x4 x5 x6 x7 x8

….….

…....... ... ...

x1 x2 x3 x4 x5 x6 x7 x8

x1 x2 x3 x4 x5 x6 x7 x8

Fig. 1. The top graphs are associated to functions; each of the bottom diagramsdepicts the ideal network approximating the function above. In a) a shallow uni-versal network in 8 variables and N units approximates a generic function of 8variables f(x1, · · · , x8). Inset b) shows a hierarchical network at the bottom inn = 8 variables, which approximates well functions of the form f(x1, · · · , x8) =h3(h21(h11(x1, x2), h12(x3, x4)), h22(h13(x5, x6), h14(x7, x8))) as repre-sented by the binary graph above. In the approximating network each of the n− 1nodes in the graph of the function corresponds to a set of Q = N

n−1 ReLU units com-

puting the ridge function∑Q

i=1ai(〈vi, x〉+ ti)+, with vi, x ∈ R2, ai, ti ∈ R.

Each term in the ridge function corresponds to a unit in the node (this is somewhatdifferent from todays deep networks, but equivalent to them (25)). Similar to theshallow network, a hierarchical network is universal, that is, it can approximate anycontinuous function; the text proves that it can approximate a compositional functionsexponentially better than a shallow network. Redrawn from (9).

advantage is not weight sharing but locality at each level of37

the hierarchy.38

B. Related Work. Several papers in the ’80s focused on the39

approximation power and learning properties of one-hidden40

layer networks (called shallow networks here). Very little41

appeared on multilayer networks, (but see (11–15)). By now,42

several papers (16–18) have appeared. (8, 19–22) derive new43

upper bounds for the approximation by deep networks of44

certain important classes of functions which avoid the curse45

of dimensionality. The upper bound for the approximation by46

shallow networks of general functions was well known to be47

exponential. It seems natural to assume that, since there is no48

general way for shallow networks to exploit a compositional49

prior, lower bounds for the approximation by shallow networks50

of compositional functions should also be exponential. In51

fact, examples of specific functions that cannot be represented52

efficiently by shallow networks have been given, for instance in53

(23–25). An interesting review of approximation of univariate54

functions by deep networks has recently appeared (26).55

C. Degree of approximation. The general paradigm is as fol-56

lows. We are interested in determining how complex a network57

ought to be to theoretically guarantee approximation of an58

unknown target function f up to a given accuracy ε > 0. To59

measure the accuracy, we need a norm ‖ · ‖ on some normed60

linear space X. As we will see the norm used in the results61

of this paper is the sup norm in keeping with the standard62

choice in approximation theory. As it turns out, the results of63

this section require the sup norm in order to be independent 64

from the unknown distribution of the input data. 65

Let VN be the be set of all networks of a given kind with 66

N units (which we take to be or measure of the complexity 67

of the approximant network). The degree of approximation 68

is defined by dist(f, VN ) = infP∈VN ‖f − P‖. For example, if 69

dist(f, VN ) = O(N−γ) for some γ > 0, then a network with 70

complexity N = O(ε−1γ ) will be sufficient to guarantee an 71

approximation with accuracy at least ε. The only a priori in- 72

formation on the class of target functions f , is codified by the 73

statement that f ∈ W for some subspace W ⊆ X. This sub- 74

space is a smoothness and compositional class, characterized 75

by the parameters m and d (d = 2 in the example of Figure 1 76

; it is the size of the kernel in a convolutional network). 77

D. Shallow and deep networks. This section characterizes con- 78

ditions under which deep networks are “better” than shallow 79

network in approximating functions. Thus we compare shallow 80

(one-hidden layer) networks with deep networks as shown in 81

Figure 1. Both types of networks use the same small set of 82

operations – dot products, linear combinations, a fixed nonlin- 83

ear function of one variable, possibly convolution and pooling. 84

Each node in the networks corresponds to a node in the graph 85

of the function to be approximated, as shown in the Figure. A 86

unit is a neuron which computes (〈x,w〉+ b)+, where w is the 87

vector of weights on the vector input x. Both w and the real 88

number b are parameters tuned by learning. We assume here 89

that each node in the networks computes the linear combina- 90

tion of r such units∑r

i=1 ci(〈x,wi〉+ bi)+. Notice that in our 91

main example of a network corresponding to a function with 92

a binary tree graph, the resulting architecture is an idealized 93

version of deep convolutional neural networks described in the 94

literature. In particular, it has only one output at the top 95

unlike most of the deep architectures with many channels and 96

many top-level outputs. Correspondingly, each node computes 97

a single value instead of multiple channels, using the combina- 98

tion of several units. However our results hold also for these 99

more complex networks (see (25)). 100

The sequence of results is as follows. 101

• Both shallow (a) and deep (b) networks are universal, that 102

is they can approximate arbitrarily well any continuous 103

function of n variables on a compact domain. The result 104

for shallow networks is classical. 105

• We consider a special class of functions of n 106

variables on a compact domain that are hier- 107

archical compositions of local functions, such as 108

f(x1, · · · , x8) = h3(h21(h11(x1, x2), h12(x3, x4)), 109

h22(h13(x5, x6), h14(x7, x8))) 110

The structure of the function in Figure 1 b) is represented 111

by a graph of the binary tree type, reflecting dimensional- 112

ity d = 2 for the constituent functions h. In general, d is 113

arbitrary but fixed and independent of the dimensionality 114

n of the compositional function f . (25) formalizes the 115

more general compositional case using directed acyclic 116

graphs. 117

• The approximation of functions with a compositional 118

structure – can be achieved with the same degree of ac- 119

curacy by deep and shallow networks but the number of 120

parameters are much smaller for the deep networks than 121

2 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX Poggio et al.

Page 4: Theoretical Issues in Deep Networks: Approximation, Optimization … · 2019-08-17 · Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization Tomaso Poggioa,1,Andrzej

for the shallow network with equivalent approximation122

accuracy.123

We approximate functions with networks in which the124

activation nonlinearity is a smoothed version of the so called125

ReLU, originally called ramp by Breiman and given by σ(x) =126

x+ = max(0, x) . The architecture of the deep networks127

reflects the function graph with each node hi being a ridge128

function, comprising one or more neurons.129

Let In = [−1, 1]n, X = C(In) be the space of all continuousfunctions on In, with ‖f‖ = maxx∈In |f(x)|. Let SN,n denotethe class of all shallow networks with N units of the form

x 7→N∑k=1

akσ(〈wk, x〉+ bk),

where wk ∈ Rn, bk, ak ∈ R. The number of trainable pa-130

rameters here is (n + 2)N ∼ n. Let m ≥ 1 be an integer,131

and Wnm be the set of all functions of n variables with con-132

tinuous partial derivatives of orders up to m <∞ such that133

‖f‖+∑

1≤|k|1≤m‖Dkf‖ ≤ 1, where Dk denotes the partial134

derivative indicated by the multi-integer k ≥ 1, and |k|1 is the135

sum of the components of k.136

For the hierarchical binary tree network, the analogous137

spaces are defined by considering the compact set Wn,2m to138

be the class of all compositional functions f of n variables139

with a binary tree architecture and constituent functions h140

in W 2m. We define the corresponding class of deep networks141

DN,2 to be the set of all deep networks with a binary tree142

architecture, where each of the constituent nodes is in SM,2,143

where N = |V |M , V being the set of non–leaf vertices of the144

tree. We note that in the case when n is an integer power of145

2, the total number of parameters involved in a deep network146

in DN,2 is 4N .147

The first theorem is about shallow networks.148

Theorem 1 Let σ : R → R be infinitely differentiable, and149

not a polynomial. For f ∈ Wnm the complexity of shallow150

networks that provide accuracy at least ε is151

N = O(ε−n/m) and is the best possible. [1]152

The estimate of Theorem 1 is the best possible if the only a153

priori information we are allowed to assume is that the target154

function belongs to f ∈Wnm. The exponential dependence on155

the dimension n of the number e−n/m of parameters needed to156

obtain an accuracy O(ε) is known as the curse of dimension-157

ality. Note that the constants involved in O in the theorems158

will depend upon the norms of the derivatives of f as well as159

σ.160

Our second and main theorem is about deep networks with161

smooth activations (preliminary versions appeared in (6–8)).162

We formulate it in the binary tree case for simplicity but163

it extends immediately to functions that are compositions164

of constituent functions of a fixed number of variables d (in165

convolutional networks d corresponds to the size of the kernel).166

Theorem 2 For f ∈ Wn,2m consider a deep network with167

the same compositional architecture and with an activation168

function σ : R → R which is infinitely differentiable, and169

not a polynomial. The complexity of the network to provide170

approximation with accuracy at least ε is171

N = O((n− 1)ε−2/m). [2]172

The proof is in (25). The assumptions on σ in the theorems 173

are not satisfied by the ReLU function x 7→ x+, but they are 174

satisfied by smoothing the function in an arbitrarily small 175

interval around the origin. The result of the theorem can be 176

extended to non-smooth ReLU(25). 177

In summary, when the only a priori assumption on the 178

target function is about the number of derivatives, then to 179

guarantee an accuracy of ε, we need a shallow network with 180

O(ε−n/m) trainable parameters. If we assume a hierarchical 181

structure on the target function as in Theorem 2, then the 182

corresponding deep network yields a guaranteed accuracy of 183

ε with O(ε−2/m) trainable parameters. Note that Theorem 2 184

applies to all f with a compositional architecture given by 185

a graph which correspond to, or is a subgraph of, the graph 186

associated with the deep network – in this case the graph 187

corresponding to Wn,dm . 188

2. The Optimization Landscape of Deep Nets with 189

Smooth Activation Function 190

The main question in optimization of deep networks is to the 191

landscape of the empirical loss in terms of its global minima 192

and local critical points of the gradient. 193

A. Related work. There are many recent papers studying opti- 194

mization in deep learning. For optimization we mention work 195

based on the idea that noisy gradient descent (27–30) can find 196

a global minimum. More recently, several authors studied the 197

dynamics of gradient descent for deep networks with assump- 198

tions about the input distribution or on how the labels are 199

generated. They obtain global convergence for some shallow 200

neural networks (31–36). Some local convergence results have 201

also been proved (37–39). The most interesting such approach 202

is (36), which focuses on minimizing the training loss and 203

proving that randomly initialized gradient descent can achieve 204

zero training loss (see also (40–42)). In summary, there is by 205

now an extensive literature on optimization that formalizes 206

and refines to different special cases and to the discrete domain 207

our results of (43, 44). 208

B. Degeneracy of global and local minima under the expo- 209

nential loss. The first part of the argument of this section 210

relies on the obvious fact (see (1)), that for RELU networks 211

under the hypothesis of an exponential-type loss function, 212

there are no local minima that separate the data – the only 213

critical points of the gradient that separate the data are the 214

global minima. 215

Notice that the global minima are at ρ = ∞, when the 216

exponential is zero. As a consequence, the Hessian is identically 217

zero with all eigenvalues being zero. On the other hand any 218

point of the loss at a finite ρ has nonzero Hessian: for instance 219

in the linear case the Hessian is proportional to∑N

nxnx

Tn . The 220

local minima which are not global minima must missclassify. 221

How degenerate are they? 222

Simple arguments (1) suggest that the critical points which 223

are not global minima cannot be completely degenerate. We 224

thus have the following 225

Property 1 Under the exponential loss, global minima are 226

completely degenerate with all eigenvalues of the Hessian (W 227

of them withW being the number of parameters in the network) 228

being zero. The other critical points of the gradient are less 229

Poggio et al. PNAS | August 17, 2019 | vol. XXX | no. XX | 3

Page 5: Theoretical Issues in Deep Networks: Approximation, Optimization … · 2019-08-17 · Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization Tomaso Poggioa,1,Andrzej

Fig. 2. Stochastic Gradient Descent and Langevin Stochastic Gradient Descent(SGDL) on the 2D potential function shown above leads to an asymptotic distributionwith the histograms shown on the left. As expected from the form of the Boltzmanndistribution, both dynamics prefer degenerate minima to non-degenerate minima ofthe same depth. From (1).

degenerate, with at least one – and typically N – nonzero230

eigenvalues.231

For the general case of non-exponential loss and smooth232

nonlinearities instead of the RELU the following conjecture233

has been proposed (1):234

Conjecture 1 : For appropriate overparametrization, there235

are a large number of global zero-error minimizers which are236

degenerate; the other critical points – saddles and local minima237

– are generically (that is with probability one) degenerate on a238

set of much lower dimensionality.239

C. SGD and Boltzmann Equation. The second part of our ar-240

gument (in (44)) is that SGD concentrates in probability on241

the most degenerate minima. The argument is based on the242

similarity between a Langevin equation and SGD and on the243

fact that the Boltzmann distribution is exactly the asymptotic244

“solution” of the stochastic differential Langevin equation and245

also of SGDL, defined as SGD with added white noise (see for246

instance (45)). The Boltzmann distribution is247

p(f) = 1Ze−

LT , [3]248

where Z is a normalization constant, L(f) is the loss and T249

reflects the noise power. The equation implies that SGDL250

prefers degenerate minima relative to non-degenerate ones of251

the same depth. In addition, among two minimum basins of252

equal depth, the one with a larger volume is much more likely253

in high dimensions as shown by the simulations in (44). Taken254

together, these two facts suggest that SGD selects degenerate255

minimizers corresponding to larger isotropic flat regions of the256

loss. Then SDGL shows concentration – because of the high257

dimensionality – of its asymptotic distribution Equation 3.258

Together (43) and (1) suggest the following259

Conjecture 2 : For appropriate overparametrization of the260

deep network, SGD selects with high probability the global261

minimizers of the empirical loss, which are highly degenerate.262

3. Generalization263

Recent results by (2) illuminate the apparent absence of ”over-264

fitting” (see Figure 4) in the special case of linear networks265

for binary classification. They prove that minimization of loss266

functions such as the logistic, the cross-entropy and the expo-267

nential loss yields asymptotic convergence to the maximum268

margin solution for linearly separable datasets, independently269

of the initial conditions and without explicit regularization.270

Here we discuss the case of nonlinear multilayer DNNs un-271

der exponential-type losses, for several variations of the basic272

gradient descent algorithm. The main results are:273

• classical uniform convergence bounds for generalization 274

suggest a form of complexity control on the dynamics 275

of the weight directions Vk: minimize a surrogate loss 276

subject to a unit Lp norm constraint; 277

• gradient descent on the exponential loss with an explicit 278

L2 unit norm constraint is equivalent to a well-known 279

gradient descent algorithms, weight normalization which 280

is closely related to batch normalization; 281

• unconstrained gradient descent on the exponential loss 282

yields a dynamics with the same critical points as weight 283

normalization: the dynamics seems to implicitely enforce 284

a L2 unit constraint on the directions of the weights Vk. 285

We observe that several of these results directly apply to 286

kernel machines for the exponential loss under the separa- 287

bility/interpolation assumption, because kernel machines are 288

one-homogeneous. 289

A. Related work. A number of papers have studied gradient 290

descent for deep networks (46–48). Close to the approach 291

summarized here (details are in (1)) is the paper (49). Its 292

authors study generalization assuming a regularizer because 293

they are – like us – interested in normalized margin. Unlike 294

their assumption of an explicit regularization, we show here 295

that commonly used techniques, such as weight and batch nor- 296

malization, in fact minimize the surrogate loss margin while 297

controlling the complexity of the classifier without the need 298

to add a regularizer or to use weight decay. Surprisingly, we 299

will show that even standard gradient descent on the weights 300

implicitly controls the complexity through an “implicit” unit 301

L2 norm constraint. Two very recent papers ((4) and (3)) de- 302

velop an elegant but complicated margin maximization based 303

approach which lead to some of the same results of this section 304

(and many more). The important question of which condi- 305

tions are necessary for gradient descent to converge to the 306

maximum of the margin of f are studied by (4) and (3). Our 307

approach does not need the notion of maximum margin but 308

our theorem 3 establishes a connection with it and thus with 309

the results of (4) and (3). Our main goal here (and in (1)) 310

is to achieve a simple understanding of where the complexity 311

control underlying generalization is hiding in the training of 312

deep networks. 313

B. Deep networks: definitions and properties. We define a 314

deep network with K layers with the usual coordinate-wise 315

scalar activation functions σ(z) : R → R as the set of 316

functions f(W ;x) = σ(WKσ(WK−1 · · ·σ(W 1x))), where the 317

input is x ∈ Rd, the weights are given by the matrices W k, 318

one per layer, with matching dimensions. We sometime use 319

the symbol W as a shorthand for the set of W k matrices 320

k = 1, · · · ,K. For simplicity we consider here the case of 321

binary classification in which f takes scalar values, implying 322

that the last layer matrix WK is WK ∈ R1,Kl . The labels are 323

yn ∈ {−1, 1}. The weights of hidden layer l are collected in a 324

matrix of size hl × hl−1. There are no biases apart form the 325

input layer where the bias is instantiated by one of the input 326

dimensions being a constant. The activation function in this 327

section is the ReLU activation. 328

For ReLU activations the following important positive one- 329

homogeneity property holds σ(z) = ∂σ(z)∂z

z. A consequence of 330

one-homogeneity is a structural lemma (Lemma 2.1 of (50)) 331

4 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX Poggio et al.

Page 6: Theoretical Issues in Deep Networks: Approximation, Optimization … · 2019-08-17 · Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization Tomaso Poggioa,1,Andrzej

∑i,jW i,jk

(∂f(x)∂W

i,jk

)= f(x) where Wk is here the vectorized332

representation of the weight matrices Wk for each of the dif-333

ferent layers (each matrix is a vector).334

For the network, homogeneity implies f(W ;x) =335 ∏K

k=1 ρkf(V1, · · · , VK ;xn), where Wk = ρkVk with the ma-336

trix norm ||Vk||p = 1. Another property of the Rademacher337

complexity of ReLU networks that follows from homogeneity338

is RN (F) = ρRN (F) where ρ = ρ1∏K

k=1 ρk, F is the class of339

neural networks described above.340

We define f = ρf ; F is the associated class of normalized341

neural networks (we call f(V ;x) = f(x) with the understand-342

ing that f(x) = f(W ;x)). Note that ∂f∂ρk

= ρρkf and that the343

definitions of ρk, Vk and f all depend on the choice of the344

norm used in normalization.345

In the case of training data that can be separated by the346

networks f(xn)yn > 0 ∀n = 1, · · · , N . We will sometime347

write f(xn) as a shorthand for ynf(xn).348

C. Uniform convergence bounds: minimizing a surrogate349

loss under norm constraint. Classical generalization bounds350

for regression (51) suggest that minimizing the empirical loss351

of a loss function such as the cross-entropy subject to con-352

strained complexity of the minimizer is a way to to attain353

generalization, that is an expected loss which is close to the354

empirical loss:355

Proposition 1 The following generalization bounds apply to356

∀f ∈ F with probability at least (1− δ):357

L(f) ≤ L(f) + c1RN (F) + c2

√ln( 1

δ)

2N [4]358

where L(f) = E[`(f(x), y)] is the expected loss, L(f) is the359

empirical loss, RN (F) is the empirical Rademacher average of360

the class of functions F, measuring its complexity; c1, c2 are361

constants that depend on properties of the Lipschitz constant362

of the loss function, and on the architecture of the network.363

Thus minimizing under a constraint on the Rademacher364

complexity a surrogate function such as the cross-entropy365

(which becomes the logistic loss in the binary classification366

case) will minimize an upper bound on the expected clas-367

sification error because such surrogate functions are upper368

bounds on the 0− 1 function. We can choose a class of func-369

tions F with normalized weights and write f(x) = ρf(x) and370

RN (F) = ρRN (F). One can choose any fixed ρ as a (Ivanov)371

regularization-type tradeoff.372

In summary, the problem of generalization may approached373

by minimizing the exponential loss – more in general an374

exponential-type loss, such the logistic and the cross-entropy –375

under a unit norm constraint on the weight matrices:376

limρ→∞

arg min||Vk||=1, ∀k

L(ρf) [5]377

where we write f(W ) = ρf(V ) using the homogeneity of the378

network. As it will become clear later, gradient descent tech-379

niques on the exponential loss automatically increase ρ to380

infinity. We will typically consider the sequence of minimiza-381

tions over Vk for a sequence of increasing ρ. The key quantity382

for us is f and the associated weights Vk; ρ is in a certain383

sense an auxiliary variable, a constraint that is progressively384

relaxed.385

In the following we explore the implications for deep net- 386

works of this classical approach to generalization. 387

C.1. Remark: minimization of an exponential-type loss implies mar- 388

gin maximization . Though not critical for our approach to the 389

question of generalization in deep networks it is interesting 390

that constrained minimization of the exponential loss implies 391

margin maximization. This property relates our approach 392

to the results of several recent papers (2–4). Notice that 393

our theorem 3 as in (52) is a sufficient condition for margin 394

maximization. Necessity is not true for general loss functions. 395

To state the margin property more formally, we adapt to 396

our setting a different result due to (52) (they consider for a 397

linear network a vanishing λ regularization term whereas we 398

have for nonlinear networks a set of unit norm constraints). 399

First we recall the definition of the empirical loss L(f) = 400∑N

n=1 `(ynf(xn)) with an exponential loss function `(yf) = 401

e−yf . We define η(f) a the margin of f , that is η(f) = 402

minn f(xn). 403

Then our margin maximization theorem (proved in (1)) 404

takes the form 405

Theorem 3 Consider the set of Vk, k = 1, · · · ,K correspond- 406

ing to 407

min||Vk||=1

L(f(ρk, Vk)) [6] 408

where the norm ||Vk|| is a chosen Lp norm and L(f)(ρk, VK) = 409

L(f(ρ)) =∑

n`(ynρf(V ;xn)) is the empirical exponential loss. 410

For each layer consider a sequence of increasing ρk. Then the 411

associated sequence of Vk defined by Equation 6, converges for 412

ρ→∞ to the maximum margin of f , that is to max||Vk||≤1 η(f) 413

. 414

D. Minimization under unit norm constraint: weight normal- 415

ization. The approach is then to minimize the loss function 416

L(f(w)) =∑N

n=1 e−f(W ;xn)yn =

∑N

n=1 e−ρf(Vk;xn)yn , with 417

ρ =∏ρk, subject to ||Vk||pp = 1 ∀k, that is under a unit norm 418

constraint for the weight matrix at each layer (if p = 2 then 419∑i,j

(Vk)2i,j = 1 is the Frobenius norm). The minimization is 420

understood as a sequence of minimizations for a sequence of 421

increasing ρk. Clearly these constraints imply the constraint 422

on the norm of the product of weight matrices for any p norm 423

(because any induced operator norm is a sub-multiplicative 424

matrix norm). The standard choice for a loss function is an 425

exponential-type loss such the cross-entropy, which for binary 426

classification becomes the logistic function. We study here 427

the exponential because it is simpler and retains all the basic 428

properties. 429

There are several gradient descent techniques that given the 430

unconstrained optimization problem transform it into a con- 431

strained gradient descent problem. To provide the background 432

let us formulate the standard unconstrained gradient descent 433

problem for the exponential loss as it is used in practical 434

training of deep networks: 435

W i,jk = − ∂L

∂W i,jk

=N∑n=1

yn∂f(xn;w)∂W i,j

k

e−ynf(xn;W ) [7] 436

where Wk is the weight matrix of layer k. Notice that, since 437

the structural property implies that at a critical point we have 438

Poggio et al. PNAS | August 17, 2019 | vol. XXX | no. XX | 5

Page 7: Theoretical Issues in Deep Networks: Approximation, Optimization … · 2019-08-17 · Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization Tomaso Poggioa,1,Andrzej

∑N

n=1 ynf(xn;w)e−ynf(xn;W ) = 0, the only critical points of439

this dynamics that separate the data (i.e. ynf(xn;w) > 0 ∀n)440

are global minima at infinity. Of course for separable data,441

while the loss decreases asymptotically to zero, the norm of the442

weights ρk increases to infinity, as we will see later. Equations443

7 define a dynamical system in terms of the gradient of the444

exponential loss L.445

The set of gradient-based algorithms enforcing a unit-norm446

constraints (53) comprises several techniques that are equiv-447

alent for small values of the step size. They are all good448

approximations of the true gradient method. One of them is449

the Lagrange multiplier method; another is the tangent gradient450

method based on the following theorem:451

Theorem 4 (53) Let ||u||p denote a vector norm that is452

differentiable with respect to the elements of u and let g(t)453

be any vector function with finite L2 norm. Then, calling454

ν(t) = ∂||u||p∂u u=u(t), the equation455

u = hg(t) = Sg(t) = (I − ννT

||ν||22)g(t) [8]456

with ||u(0)|| = 1, describes the flow of a vector u that satisfies457

||u(t)||p = 1 for all t ≥ 0.458

In particular, a form for g is g(t) = µ(t)∇uL, the gradient459

update in a gradient descent algorithm. We call Sg(t) the460

tangent gradient transformation of g. In the case of p = 2461

we replace ν in Equation 8 with u because ν(t) = ∂||u||2∂u

= u.462

This gives S = I − uuT

||u||22and u = Sg(t).463

Consider now the empirical loss L written in terms of Vk464

and ρk instead of Wk, using the change of variables defined by465

Wk = ρkVk but without imposing a unit norm constraint on466

Vk. The flows in ρk, Vk can be computed as ρk = ∂Wk∂ρk

∂L∂Wk

=467

V Tk∂L∂Wk

and Vk = ∂Wk∂Vk

∂L∂Wk

= ρk∂L∂Wk

, with ∂L∂Wk

given by468

Equations 7.469

We now enforce the unit norm constraint on Vk by using470

the tangent gradient transform on the Vk flow. This yields471

ρk = V Tk∂L

∂WkVk = Skρk

∂L

∂Wk. [9]472

Notice that the dynamics above follows from the classical473

approach of controlling the Rademacher complexity of f during474

optimization (suggested by bounds such as Equation 4. The475

approach and the resulting dynamics for the directions of the476

weights would seem different from the standard unconstrained477

approach in training deep networks. It turns out, however, that478

the dynamics described by Equations 9 is the same dynamics479

of Weight Normalization.480

The technique of Weight normalization (54) was originally481

proposed as a small improvement on standard gradient descent482

“to reduce covariate shifts”. It was defined for each layer in483

terms of w = g v||v|| , as484

g = v

||v||∂L

∂wv = g

||v||S∂L

∂w[10]485

with S = I − vvT

||v||2 .486

It is easy to see that Equations 9 are the same as the weight487

normalization Equations 10, if ||v||2 = 1. We now observe,488

multiplying Equation 9 by vT , that vT v = 0 because vTS = 0,489

implying that ||v||2 is constant in time with a constant that490

can be taken to be 1. Thus the two dynamics are the same.491

E. Generalization with hidden complexity control. Empiri- 492

cally it appears that GD and SGD converge to solutions that 493

can generalize even without batch or weight normalization. 494

Convergence may be difficult for quite deep networks and gen- 495

eralization may not be as good as with batch normalization 496

but it still occurs. How is this possible? 497

We study the dynamical system Wki,j under the 498

reparametrization W i,jk = ρkV

i,jk with ||Vk||2 = 1. We con- 499

sider for each weight matrixWk the corresponding “vectorized” 500

representation in terms of vectors W i,jk = Wk. We use the 501

following definitions and properties (for a vector w): 502

• Define w||w||2

= w; thus w = ||w||2w with ||w||2 = 1. Also 503

define S = I − wwT = I − wwT

||w||22. 504

• The following relations are easy to check: 505

1. ∂||w||2∂w

= w 506

2. ∂w∂w

= S||w||2

. 507

3. Sw = Sw = 0 508

4. S2 = S 509

The gradient descent dynamic system used in training 510

deep networks for the exponential loss is given by Equation 7. 511

Following the chain rule for the time derivatives, the dynamics 512

forWk is exactly (see (1)) equivalent to the following dynamics 513

for ||Wk|| = ρk and Vk: 514

ρk = ∂||Wk||∂Wk

∂Wk

∂t= V Tk Wk [11] 515

and 516

Vk = ∂Vk∂Wk

∂Wk

∂t= SkρkWk [12] 517

where Sk = I − VkV Tk . We used property 1 in 4 for Equation 518

11 and property 2 for Equation 12. 519

The key point here is that the dynamics of Vk includes a 520

unit L2 norm constraint: using the tangent gradient transform 521

will not change the equation because S2 = S. 522

As separate remarks , notice that if for t > t0, f separates all 523

the data, ddtρk > 0, that is ρ diverges to∞ with limt→∞ ρ = 0. 524

In the 1-layer network case the dynamics yields ρ ≈ log t 525

asymptotically. For deeper networks, this is different. (1) 526

shows (for one support vector) that the product of weights 527

at each layer diverges faster than logarithmically, but each 528

individual layer diverges slower than in the 1-layer case. The 529

norm of the each layer grows at the same rate ρ2k, independent 530

of k. The Vk dynamics has stationary or critical points given 531

by 532

W∑

αn(ρ(t)(∂f(xn)∂V i,jk

− V i,jk f(xn)), [13] 533

where αn = e−ynρ(t)f(xn). We examine later the linear one- 534

layer case f(x) = vTx in which case the stationary points 535

of the gradient are given by∑

αn(ρ(t)(xn − vvTxn). In the 536

linear case the critical point is unique and corresponds to a 537

hyperbolic minimum. In the general case the critical points 538

correspond for ρ→∞ to degenerate zero “asymptotic minima” 539

of the loss. 540

To understand whether there exists a hidden complexity 541

control in standard gradient descent, we check whether there 542

6 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX Poggio et al.

Page 8: Theoretical Issues in Deep Networks: Approximation, Optimization … · 2019-08-17 · Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization Tomaso Poggioa,1,Andrzej

2

1.95 -

-0 (lJ N

1.9 E'-

0 C

C :::,

1.85 ....... Vl Vl

.2 1.8 co

C

;p Vl

(lJ I- 1.75

1.7

0.44

-

-0 (lJ N 0,42 m E'-

0

2 0.:,.......1/'1 Vl

_g 0.38 co C

;p Vl

Q.) t- 0.36

0.3

0.005

' • •

0.0005

• •

• • •

• ••

0.01 0.0,5 0.02 0.025 0.03

Training loss (unnormalized)

0.001 0.0015 0.002

Training Loss (unnormalized)

std. dev. 2.303 of initial weights 2.302

0.14 -

� 2.301 0.12 N

ro 0.1 E 2.3 '-

0 C -

0.08 � 2.299 0

...J

0.06 co C 2.298 :p Vl

(lJ

0.04 I- 2.297

0.02 2.296

2.296

scaling of initial weights

4

-

-0

3.5 Q) N

ro 3 E

'-0 C -

2.5 VI 1/'1

0

co 2 C

......

1/'1

Q)

1.5 I-

1

2.297 2.298 2.299 2.3 2.301

Training Loss (normalized)

1.5 1.55 1.6

Training loss (normalized)

2.302

1.65

std. dev. of initial weights

0.14

0.12

0.1

0.08

0.06

0.04

0.02

scaling of initial weights

4

3.5

3

2.5

2

1.5

Fig. 3. The top left graph shows testing vs training cross-entropy loss for networks each trained on the same data sets (CIFAR10) but with adifferent initializations, yielding zero classification error on training set but different testing errors. The top right graph shows the same data,that is testing vs training loss for the same networks, now normalized by dividing each weight by the Frobenius norm of its layer. Noticethat all points have zero classification error at training. The red point on the top right refers to a network trained on the same CIFAR-10data set but with randomized labels. It shows zero classification error at training and test error at chance level. The top line is a square-lossregression of slope 1 with positive intercept. The bottom line is the diagonal at which training and test loss are equal. The networks are3-layer convolutional networks. The left can be considered as a visualization of Equation 4 when the Rademacher complexity is not controlled.

The right hand side is a visualization of the same relation for normalized networks that is L(f) ≤ L(f) + c1RN (F) + c2

√ln( 1

δ)

2N . Under our

conditions for N and for the architecture of the network the terms c1RN (F) + c2

√ln( 1

δ)

2N represent a small offset. From (55).

exists an Lp norm for which unconstrained normalization is543

equivalent to constrained normalization.544

From Theorem 4 we expect the constrained case to be given545

by the action of the following projector onto the tangent space:546

Sp = I− ννT

||ν||22with νi = ∂||w||p

∂wi= sign(wi)◦

(|wi|||w||p

)p−1

.

[14]547

The constrained Gradient Descent is then548

ρk = V Tk Wk Vk = ρkSpWk. [15]549

On the other hand, reparametrization of the unconstrained550

dynamics in the p-norm gives (following Equations 11 and 12)551

ρk = ∂||Wk||p∂Wk

∂Wk

∂t= sign(Wk) ◦

(|Wk|||Wk||p

)p−1

· Wk

Vk = ∂Vk∂Wk

∂Wk

∂t=I − sign(Wk) ◦

(|Wk|||Wk||p

)p−1WTk

||Wk||p−1p

Wk.

[16]

552

These two dynamical systems are clearly different for generic553

p reflecting the presence or absence of a regularization-like554

constraint on the dynamics of Vk.555

As we have seen however, for p = 2 the 1-layer dynamical556

system obtained by minimizing L in ρk and Vk withWk = ρkVk557

under the constraint ||Vk||2 = 1, is the weight normalization558

dyanmics559

ρk = V Tk Wk Vk = SρkWk, [17]560

which is quite similar to the standard gradient equations561

ρk = V Tk Wk v = S

ρkWk. [18]562

The two dynamical systems differ only by a ρ2k factor in 563

the Vk equations. However, the critical points of the gradient 564

for the Vk flow, that is the point for which Vk = 0, are the 565

same in both cases since for any t > 0 ρk(t) > 0 and thus 566

Vk = 0 is equivalent to SWk = 0. Hence, gradient descent 567

with unit Lp-norm constraint is equivalent to the standard, 568

unconstrained gradient descent but only when p = 2. Thus 569

Fact 1 The standard dynamical system used in deep learning, 570

defined by Wk = − ∂L∂Wk

, implicitly enforces a unit L2 norm 571

constraint on Vk with ρkVk = Wk. Thus, under an exponential 572

loss, if the dynamics converges, the Vk represent the minimizer 573

under the L2 unit norm constraint. 574

Thus standard GD implicitly enforces the L2 norm con- 575

straint on Vk = Wk||Wk||2

, consistently with Srebro’s results 576

on implicit bias of GD. Other minimization techniques such 577

as coordinate descent may be biased towards different norm 578

constraints. 579

F. Linear networks and rates of convergence. The linear 580

(f(x) = ρvTx) networks case (2) is an interesting example 581

of our analysis in terms of ρ and v dynamics. We start with 582

unconstrained gradient descent, that is with the dynamical 583

system 584

ρ = 1ρ

N∑n=1

e−ρvT xnvTxn v = 1

ρ

N∑n=1

e−ρvT xn(xn − vvTxn).

[19] 585

If gradient descent in v converges to v = 0 at finite time, 586

v satisfies vvTx = x, where x =∑C

j=1 αjxj with positive 587

coefficients αj and xj are the C support vectors (see (1)). A 588

solution vT = ||x||x† then exists (x†, the pseudoinverse of x, 589

Poggio et al. PNAS | August 17, 2019 | vol. XXX | no. XX | 7

Page 9: Theoretical Issues in Deep Networks: Approximation, Optimization … · 2019-08-17 · Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization Tomaso Poggioa,1,Andrzej

since x is a vector, is given by x† = xT

||x||2 ). On the other hand,590

the operator T in v(t+ 1) = Tv(t) associated with equation 19591

is non-expanding, because ||v|| = 1, ∀t. Thus there is a fixed592

point v ∝ x which is independent of initial conditions (56).593

The rates of convergence of the solutions ρ(t) and v(t),594

derived in different way in (2), may be read out from the595

equations for ρ and v. It is easy to check that a general596

solution for ρ is of the form ρ ∝ C log t. A similar estimate597

for the exponential term gives e−ρvT xn ∝ 1t. Assume for598

simplicity a single support vector x. We claim that a solution599

for the error ε = v− x, since v converges to x, behaves as 1log t .600

In fact we write v = x+ ε and plug it in the equation for v in601

20. We obtain (assuming normalized input ||x|| = 1)602

ε = 1ρe−ρv

T x(x−(x+ε)(x+ε)Tx) ≈ 1ρe−ρv

T x(x−x−xεT−εxT ),[20]603

which has the form ε = − 1t log t (2xε

T ). Assuming ε of the604

form ε ∝ 1log t we obtain − 1

t log2 t = −B 1t log2 t . Thus the error605

indeed converges as ε ∝ 1log t .606

A similar analysis for the weight normalization equations607

17 considers the same dynamical system with a change in the608

equation for v, which becomes609

v ∝ e−ρρ(I − vvT )x. [21]610

This equation differs by a factor ρ2 from equation 20. As a611

consequence equation 21 is of the form ε = − log ttε, with a612

general solution of the form ε ∝ t−12 log t. In summary, GD with613

weight normalization converges faster to the same equilibrium614

than standard gradient descent: the rate for ε = v − x is615

t−12 log(t) vs 1

log t .616

Our goal was to find limρ→∞ arg min||Vk||=1, ∀k L(ρf). We617

have seen that various forms of gradient descent enforce dif-618

ferent paths in increasing ρ that empirically have different619

effects on convergence rate. It is an interesting theoretical620

and practical challenge to find the optimal way, in terms of621

generalization and convergence rate, to grow ρ→∞.622

Our analysis of simplified batch normalization (1) suggests623

that several of the same considerations that we used for weight624

normalization should apply (in the linear one layer case BN is625

identical to WN). However, BN differs from WN in the multi-626

layer case in several ways, in addtion to weight normalization:627

it has for instance separate normalization for each unit, that628

is for each row of the weight matrix at each layer.629

4. Discussion630

A main difference between shallow and deep networks is in631

terms of approximation power or, in equivalent words, of632

the ability to learn good representations from data based on633

the compositional structure of certain tasks. Unlike shallow634

networks, deep local networks – in particular convolutional635

networks – can avoid the curse of dimensionality in approxi-636

mating the class of hierarchically local compositional functions.637

This means that for such class of functions deep local networks638

represent an appropriate hypothesis class that allows good639

approximation with a minimum number of parameters. It640

is not clear, of course, why many problems encountered in641

practice should match the class of compositional functions.642

Though we and others have argued that the explanation may643

be in either the physics or the neuroscience of the brain, these644

Training data size: 50000

10 2 10 3 10 4 10 5 10 6 10 7

Number of Model Params

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Err

or

on

CIF

AR

-10

#Training DataTraining

Test

Fig. 4. Empirical and expected error in CIFAR 10 as a function of number of neuronsin a 5-layer convolutional network. The expected classification error does not increasewhen increasing the number of parameters beyond the size of the training set in therange we tested.

arguments are not rigorous. Our conjecture at present is that 645

compositionality is imposed by the wiring of our cortex and, 646

critically, is reflected in language. Thus compositionality of 647

some of the most common visual tasks may simply reflect the 648

way our brain works. 649

Optimization turns out to be surprisingly easy to perform 650

for overparametrized deep networks because SGD will converge 651

with high probability to global minima that are typically ch 652

mumore degenerate for the exponential loss than other local 653

critical points. 654

More surprisingly, gradient descent yields generalization in 655

classification performance, despite overparametrization and 656

even in the absence of explicit norm control or regularization, 657

because standard gradient descent in the weights is subject to 658

an implicit unit (L2) norm constraint on the directions of the 659

weights in the case of exponential-type losses for classification 660

tasks. 661

In summary, it is tempting to conclude that the practical 662

success of deep learning has its roots in the almost magic syn- 663

ergy of unexpected and elegant theoretical properties of several 664

aspects of the technique: the deep convolutional network ar- 665

chitecture itself, its overparametrization, the use of stochastic 666

gradient descent, the exponential loss, the homogeneity of the 667

RELU units and of the resulting networks. 668

Of course many problems remain open on the way to develop 669

a full theory and, especially, in translating it to new archi- 670

tectures. More detailed results are needed in approximation 671

theory, especially for densely connected networks. Our frame- 672

work for optimization is missing at present a full classification 673

of local minima and their dependence on overparametrization. 674

The analysis of generalization should include an analysis of 675

convergence of the weights for multilayer networks (see (4) and 676

(3)). A full theory would also require an analysis of the trade- 677

off for deep networks between approximation and estimation 678

error, relaxing the separability assumption. 679

ACKNOWLEDGMENTS. We are grateful to Sasha Rakhlin and 680

Nate Srebro for useful suggestions about the structural lemma and 681

about separating critical points. Part of the funding is from the 682

Center for Brains, Minds and Machines (CBMM), funded by NSF 683

8 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX Poggio et al.

Page 10: Theoretical Issues in Deep Networks: Approximation, Optimization … · 2019-08-17 · Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization Tomaso Poggioa,1,Andrzej

STC award CCF-1231216, and part by C-BRIC, one of six centers684

in JUMP, a Semiconductor Research Corporation (SRC) program685

sponsored by DARPA.686

1. Banburski A, et al. (2019) Theory of deep learning III: Dynamics and generalization in deep687

networks. CBMM Memo No. 090.688

2. Soudry D, Hoffer E, Srebro N (2017) The Implicit Bias of Gradient Descent on Separable689

Data. ArXiv e-prints.690

3. Lyu K, Li J (2019) Gradient descent maximizes the margin of homogeneous neural networks.691

CoRR abs/1906.05890.692

4. Shpigel Nacson M, Gunasekar S, Lee JD, Srebro N, Soudry D (2019) Lexicographic and693

Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models. arXiv e-694

prints p. arXiv:1905.07325.695

5. Anselmi F, Rosasco L, Tan C, Poggio T (2015) Deep convolutional network are hierarchical696

kernel machines. Center for Brains, Minds and Machines (CBMM) Memo No. 35, also in697

arXiv.698

6. Poggio T, Rosasco L, Shashua A, Cohen N, Anselmi F (2015) Notes on hierarchical splines,699

dclns and i-theory, (MIT Computer Science and Artificial Intelligence Laboratory), Technical700

report.701

7. Poggio T, Anselmi F, Rosasco L (2015) I-theory on depth vs width: hierarchical function702

composition. CBMM memo 041.703

8. Mhaskar H, Liao Q, Poggio T (2016) Learning real and boolean functions: When is deep704

better than shallow? Center for Brains, Minds and Machines (CBMM) Memo No. 45, also in705

arXiv.706

9. Mhaskar H, Poggio T (2016) Deep versus shallow networks: an approximation theory per-707

spective. Center for Brains, Minds and Machines (CBMM) Memo No. 54, also in arXiv.708

10. Donoho DL (2000) High-dimensional data analysis: The curses and blessings of dimension-709

ality in AMS CONFERENCE ON MATH CHALLENGES OF THE 21ST CENTURY.710

11. Mhaskar H (1993) Approximation properties of a multilayered feedforward artificial neural711

network. Advances in Computational Mathematics pp. 61–80.712

12. Mhaskar HN (1993) Neural networks for localized approximation of real functions in Neural713

Networks for Processing [1993] III. Proceedings of the 1993 IEEE-SP Workshop. (IEEE), pp.714

190–196.715

13. Chui C, Li X, Mhaskar H (1994) Neural networks for localized approximation. Mathematics of716

Computation 63(208):607–623.717

14. Chui CK, Li X, Mhaskar HN (1996) Limitations of the approximation capabilities of neural718

networks with one hidden layer. Advances in Computational Mathematics 5(1):233–243.719

15. Pinkus A (1999) Approximation theory of the mlp model in neural networks. Acta Numerica720

8:143–195.721

16. Poggio T, Smale S (2003) The mathematics of learning: Dealing with data. Notices of the722

American Mathematical Society (AMS) 50(5):537–544.723

17. Montufar, G. F.and Pascanu R, Cho K, Bengio Y (2014) On the number of linear regions of724

deep neural networks. Advances in Neural Information Processing Systems 27:2924–2932.725

18. Livni R, Shalev-Shwartz S, Shamir O (2013) A provably efficient algorithm for training deep726

networks. CoRR abs/1304.7045.727

19. Anselmi F, et al. (2014) Unsupervised learning of invariant representations with low sample728

complexity: the magic of sensory cortex or a new framework for machine learning?. Center729

for Brains, Minds and Machines (CBMM) Memo No. 1. arXiv:1311.4158v5.730

20. Anselmi F, et al. (2015) Unsupervised learning of invariant representations. Theoretical Com-731

puter Science.732

21. Poggio T, Rosaco L, Shashua A, Cohen N, Anselmi F (2015) Notes on hierarchical splines,733

dclns and i-theory. CBMM memo 037.734

22. Liao Q, Poggio T (2016) Bridging the gap between residual learning, recurrent neural net-735

works and visual cortex. Center for Brains, Minds and Machines (CBMM) Memo No. 47, also736

in arXiv.737

23. Telgarsky M (2015) Representation benefits of deep feedforward networks. arXiv preprint738

arXiv:1509.08101v2 [cs.LG] 29 Sep 2015.739

24. Safran I, Shamir O (2016) Depth separation in relu networks for approximating smooth non-740

linear functions. arXiv:1610.09887v1.741

25. Poggio T, Mhaskar H, Rosasco L, Miranda B, Liao Q (2016) Theory I: Why and when can742

deep - but not shallow - networks avoid the curse of dimensionality, (CBMM Memo No. 058,743

MIT Center for Brains, Minds and Machines), Technical report.744

26. Daubechies I, DeVore R, Foucart S, Hanin B, Petrova G (2019) Nonlinear approximation and745

(deep) relu networks. arXiv e-prints p. arXiv:1905.02199.746

27. Jin C, Ge R, Netrapalli P, Kakade SM, Jordan MI (2017) How to escape saddle points effi-747

ciently. CoRR abs/1703.00887.748

28. Ge R, Huang F, Jin C, Yuan Y (2015) Escaping from saddle points - online stochastic gradient749

for tensor decomposition. CoRR abs/1503.02101.750

29. Lee JD, Simchowitz M, Jordan MI, Recht B (2016) Gradient descent only converges to min-751

imizers in 29th Annual Conference on Learning Theory, Proceedings of Machine Learning752

Research, eds. Feldman V, Rakhlin A, Shamir O. (PMLR, Columbia University, New York,753

New York, USA), Vol. 49, pp. 1246–1257.754

30. Du SS, Lee JD, Tian Y (2018) When is a convolutional filter easy to learn? in International755

Conference on Learning Representations.756

31. Tian Y (2017) An analytical formula of population gradient for two-layered relu network and its757

applications in convergence and critical point analysis in Proceedings of the 34th International758

Conference on Machine Learning - Volume 70, ICML’17. (JMLR.org), pp. 3404–3413.759

32. Soltanolkotabi M, Javanmard A, Lee JD (2019) Theoretical insights into the optimization land-760

scape of over-parameterized shallow neural networks. IEEE Transactions on Information761

Theory 65(2):742–769.762

33. Li Y, Yuan Y (2017) Convergence analysis of two-layer neural networks with relu activation in763

Proceedings of the 31st International Conference on Neural Information Processing Systems,764

NIPS’17. (Curran Associates Inc., USA), pp. 597–607.765

34. Brutzkus A, Globerson A (2017) Globally optimal gradient descent for a convnet with gaussian766

inputs in Proceedings of the 34th International Conference on Machine Learning, ICML 2017, 767

Sydney, NSW, Australia, 6-11 August 2017. pp. 605–614. 768

35. Du S, Lee J, Tian Y, Singh A, Poczos B (2018) Gradient descent learns one-hidden-layer CNN: 769

Don’t be afraid of spurious local minima in Proceedings of the 35th International Conference 770

on Machine Learning, Proceedings of Machine Learning Research, eds. Dy J, Krause A. 771

(PMLR, Stockholmsmässan, Stockholm Sweden), Vol. 80, pp. 1339–1348. 772

36. Du SS, Lee JD, Li H, Wang L, Zhai X (2018) Gradient descent finds global minima of deep 773

neural networks. CoRR abs/1811.03804. 774

37. Zhong K, Song Z, Jain P, Bartlett PL, Dhillon IS (2017) Recovery guarantees for one-hidden- 775

layer neural networks in Proceedings of the 34th International Conference on Machine Learn- 776

ing - Volume 70, ICML’17. (JMLR.org), pp. 4140–4149. 777

38. Zhong K, Song Z, Dhillon IS (2017) Learning non-overlapping convolutional neural networks 778

with multiple kernels. CoRR abs/1711.03440. 779

39. Zhang X, Yu Y, Wang L, Gu Q (2018) Learning One-hidden-layer ReLU Networks via Gradient 780

Descent. arXiv e-prints. 781

40. Li Y, Liang Y (2018) Learning overparameterized neural networks via stochastic gradient 782

descent on structured data in Advances in Neural Information Processing Systems 31, eds. 783

Bengio S, et al. (Curran Associates, Inc.), pp. 8157–8166. 784

41. Du SS, Zhai X, Poczos B, Singh A (2019) Gradient descent provably optimizes over- 785

parameterized neural networks in International Conference on Learning Representations. 786

42. Zou D, Cao Y, Zhou D, Gu Q (2018) Stochastic gradient descent optimizes over- 787

parameterized deep relu networks. CoRR abs/1811.08888. 788

43. Poggio T, Liao Q (2017) Theory II: Landscape of the empirical risk in deep learning. 789

arXiv:1703.09833, CBMM Memo No. 066. 790

44. Zhang C, et al. (2017) Theory of deep learning IIb: Optimization properties of SGD. CBMM 791

Memo 072. 792

45. Raginsky M, Rakhlin A, Telgarsky M (2017) Non-convex learning via stochastic gradient 793

langevin dynamics: A nonasymptotic analysis. arXiv:180.3251 [cs, math]. 794

46. Daniely A (2017) Sgd learns the conjugate kernel class of the network in Advances in Neural 795

Information Processing Systems 30, eds. Guyon I, et al. (Curran Associates, Inc.), pp. 2422– 796

2430. 797

47. Allen-Zhu Z, Li Y, Liang Y (2018) Learning and generalization in overparameterized neural 798

networks, going beyond two layers. CoRR abs/1811.04918. 799

48. Arora S, Du SS, Hu W, yuan Li Z, Wang R (2019) Fine-grained analysis of optimization and 800

generalization for overparameterized two-layer neural networks. CoRR abs/1901.08584. 801

49. Wei C, Lee JD, Liu Q, Ma T (2018) On the margin theory of feedforward neural networks. 802

CoRR abs/1810.05369. 803

50. Liang T, Poggio T, Rakhlin A, Stokes J (2017) Fisher-rao metric, geometry, and complexity of 804

neural networks. CoRR abs/1711.01530. 805

51. Bousquet O, Boucheron S, Lugosi G (2003) Introduction to statistical learning theory. pp. 806

169–207. 807

52. Rosset S, Zhu J, Hastie T (2003) Margin maximizing loss functions in Advances in Neural 808

Information Processing Systems 16 [Neural Information Processing Systems, NIPS 2003, 809

December 8-13, 2003, Vancouver and Whistler, British Columbia, Canada]. pp. 1237–1244. 810

53. Douglas SC, Amari S, Kung SY (2000) On gradient adaptation with unit-norm constraints. 811

IEEE Transactions on Signal Processing 48(6):1843–1847. 812

54. Salimans T, Kingm DP (2016) Weight normalization: A simple reparameterization to acceler- 813

ate training of deep neural networks. Advances in Neural Information Processing Systems. 814

55. Liao Q, Miranda B, Banburski A, Hidary J, Poggio TA (2018) A surprising linear relationship 815

predicts test performance in deep networks. CoRR abs/1807.09659. 816

56. Ferreira PJSG (1996) The existence and uniqueness of the minimum norm solution to certain 817

linear and nonlinear problems. Signal Processing 55:137–139. 818

Poggio et al. PNAS | August 17, 2019 | vol. XXX | no. XX | 9