best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary d, inf ’...

21
Noname manuscript No. (will be inserted by the editor) Best k-layer neural network approximations Lek-Heng Lim · Mateusz Micha lek · Yang Qi Received: date / Accepted: date Abstract We show that the empirical risk minimization (ERM) problem for neural networks has no solution in general. Given a training set s 1 ,...,s n R p with corresponding responses t 1 ,...,t n R q , fitting a k-layer neural network ν θ : R p R q involves estimation of the weights θ R m via an ERM: inf θR m n X i=1 kt i - ν θ (s i )k 2 2 . We show that even for k = 2, this infimum is not attainable in general for common activations like ReLU, hyperbolic tangent, and sigmoid functions. In addition, we deduce that if one attempts to minimize such a loss function in the event when its infimum is not attainable, it necessarily results in values of θ diverging to ±∞. We will show that for smooth activations σ(x)=1/ ( 1+ exp(-x) ) and σ(x) = tanh(x), such failure to attain an infimum can happen on a positive-measured subset of responses. For the ReLU activation σ(x)= max(0,x), we completely classify cases where the ERM for a best two-layer neural network approximation attains its infimum. In recent applications of neural networks, where overfitting is commonplace, the failure to attain an infimum is avoided by ensuring that the system of equations t i = ν θ (s i ), i =1,...,n, has a solution, i.e., the ERM is fitted perfectly. For a two-layer L.-H. Lim Department of Statistics, University of Chicago, Chicago, IL 60637 E-mail: [email protected] M. Michalek Max Planck Institute for Mathematics in the Sciences, 04103 Leipzig, Germany E-mail: [email protected] Y. Qi INRIA Saclay– ˆ Ile-de-France, CMAP, ´ Ecole Polytechnique, IP Paris, CNRS, 91128 Palaiseau Cedex, France E-mail: [email protected]

Upload: others

Post on 01-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary D, inf ’ i2D kf ’ 1 ’ 2 ’ kk B: If we denote the layers of a neural network by ’

Noname manuscript No.(will be inserted by the editor)

Best k-layer neural network approximations

Lek-Heng Lim · Mateusz Micha lek ·Yang Qi

Received: date / Accepted: date

Abstract We show that the empirical risk minimization (ERM) problem forneural networks has no solution in general. Given a training set s1, . . . , sn ∈ Rpwith corresponding responses t1, . . . , tn ∈ Rq, fitting a k-layer neural networkνθ : Rp → Rq involves estimation of the weights θ ∈ Rm via an ERM:

infθ∈Rm

n∑i=1

‖ti − νθ(si)‖22.

We show that even for k = 2, this infimum is not attainable in general forcommon activations like ReLU, hyperbolic tangent, and sigmoid functions. Inaddition, we deduce that if one attempts to minimize such a loss function inthe event when its infimum is not attainable, it necessarily results in values ofθ diverging to ±∞. We will show that for smooth activations σ(x) = 1/

(1 +

exp(−x))

and σ(x) = tanh(x), such failure to attain an infimum can happenon a positive-measured subset of responses. For the ReLU activation σ(x) =max(0, x), we completely classify cases where the ERM for a best two-layerneural network approximation attains its infimum. In recent applications ofneural networks, where overfitting is commonplace, the failure to attain aninfimum is avoided by ensuring that the system of equations ti = νθ(si),i = 1, . . . , n, has a solution, i.e., the ERM is fitted perfectly. For a two-layer

L.-H. LimDepartment of Statistics, University of Chicago, Chicago, IL 60637E-mail: [email protected]

M. Micha lekMax Planck Institute for Mathematics in the Sciences, 04103 Leipzig, GermanyE-mail: [email protected]

Y. QiINRIA Saclay–Ile-de-France, CMAP, Ecole Polytechnique, IP Paris, CNRS, 91128 PalaiseauCedex, FranceE-mail: [email protected]

Page 2: Best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary D, inf ’ i2D kf ’ 1 ’ 2 ’ kk B: If we denote the layers of a neural network by ’

2 L.-H. Lim, M. Micha lek, Y. Qi

ReLU-activated network, we will show when such a system of equations has asolution generically.

Keywords neural network · best approximation · join loci · secant loci

Mathematics Subject Classification (2010) 92B20 · 41A50 · 41A30

1 Introduction

Let αi : Rdi → Rdi+1 , x 7→ Aix + bi be an affine function with Ai ∈ Rdi+1×di

and bi ∈ Rdi+1 , i = 1, . . . , k. Given any fixed activation function σ : R → R,we will abuse notation slightly by also writing σ : Rd → Rd for the functionwhere σ is applied coordinatewise, i.e., σ(x1, . . . , xd) = (σ(x1), . . . , σ(xd)), forany d ∈ N. Consider a k-layer neural network ν : Rp → Rq,

ν = αk σ αk−1 · · · σ α2 σ α1, (1)

obtained from alternately composing σ with affine functions k times. Note thatsuch a function ν is parameterized (and completely determined) by its weightsθ := (Ak, bk, . . . , A1, b1) in

Θ := (Rdk+1×dk × Rdk+1)× · · · × (Rd2×d1 × Rd2) ∼= Rm. (2)

Here and throughout this article,

m :=

k∑i=1

(di + 1)di+1 (3)

will always denote the number of weights that parameterize ν. In neural net-works lingo, the dimension of the ith layer di is also called the number ofneurons in the ith layer. Whenever it is necessary to emphasize the depen-dence of ν on θ, we will write νθ for a k-layer neural network parameterizedby θ ∈ Θ.

Consider the function approximation problem with1 dk+1 = 1, i.e., givenΩ ⊆ Rd1 and a target function f : Ω → R in some Banach space B = Lp(Ω),W k,p(Ω), BMO(Ω), etc, how well can f be approximated by a neural networkνθ : Ω → R in the Banach space norm ‖ · ‖B? In other words, one is interestedin the problem

infθ∈Θ‖f − νθ‖B. (4)

The most celebrated results along these lines are the universal approximationtheorems of Cybenko [5], for sigmoidal activation and L1-norm, as well asthose of Hornik et al. [9,8], for more general activations such as ReLU andLp-norms, 1 ≤ p ≤ ∞. These results essentially say that the infimum in (4)is zero as long as k is at least two (but with no bound on d2). In this article,

1 Results may be extended to dk+1 > 1 by applying them coordinatewise, i.e., with

B ⊗ Rdk+1 in place of B.

Page 3: Best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary D, inf ’ i2D kf ’ 1 ’ 2 ’ kk B: If we denote the layers of a neural network by ’

Best k-layer neural network approximations 3

we will focus on the Hilbert space case p = 2 for simplicity, which frees up theletter p for alternative use — henceforth we denote the dimensions of the firstand last layers by

p := d1 and q := dk+1

to avoid the clutter of subscripts.

Traditional studies of neural networks in approximation theory [5,7,9,8]invariably assume that Ω, the domain of the target function f in the prob-lem (4), is an open subset of Rp. Nevertheless, any actual training of a neuralnetwork involves only finitely many points s1, . . . , sn ∈ Ω and the values off on these points: f(s1) = t1, . . . , f(sn) = tn. Therefore, in reality, one onlysolves problem (4) for a finite Ω = s1, . . . , sn and this becomes a parameterestimation problem commonly called the empirical risk minimization problem.Let s1, . . . , sn ∈ Ω ⊆ Rp be a sample of n independent, identically distributedobservations with corresponding responses t1, . . . , tn ∈ Rq. The main com-putational problem in supervised learning with neural networks is to fit thetraining set (si, ti) ∈ Rp × Rq : i = 1, . . . , n with a k-layer neural networkνθ : Ω → Rq so that

ti ≈ νθ(si), i = 1, . . . , n,

often in the least-squares sense

infθ∈Θ

n∑i=1

‖ti − νθ(si)‖22. (5)

The responses are regarded as values of the unknown function f : Ω → Rq tobe learned, i.e., ti = f(si), i = 1, . . . , n. The hope is that by solving (5) forθ∗ ∈ Rm, the neural network obtained νθ∗ will approximate f well in the senseof having small generalization errors, i.e., f(s) ≈ νθ∗(s) for s /∈ s1, . . . , sn.This hope has been borne out empirically in spectacular ways [10,12,16].

Observe that the empirical risk estimation problem (5) is simply the func-tion approximation problem (4) for the case when Ω is a finite set equippedwith the counting measure. The problem (4) asks how well a given targetfunction can be approximated by a given function class, in this case the classof k-layer σ-activated neural networks. This is an infinite-dimensional prob-lem when Ω is infinite and is usually studied using functional analytic tech-niques. On the other hand (5) asks how well the approximation is at finitelymany sample points, a finite-dimensional problem, and is therefore amenableto techniques in algebraic and differential geometry. We would like to em-phasize that the finite-dimensional problem (5) is not any easier than theinfinite-dimensional problem (4); they simply require different techniques. Inparticular, our results do not follow from the results in [3,7,14] for infinite-dimensional spaces — we will have more to say about this in Section 2.

There is a parallel with [15], where we applied methods from algebraicand differential geometry to study the empirical risk minimization problem

Page 4: Best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary D, inf ’ i2D kf ’ 1 ’ 2 ’ kk B: If we denote the layers of a neural network by ’

4 L.-H. Lim, M. Micha lek, Y. Qi

corresponding to nonlinear approximation, i.e., where one seeks to approximatea target function by a sum of k atoms ϕ1, . . . , ϕk from a dictionary D,

infϕi∈D

‖f − ϕ1 − ϕ2 − · · · − ϕk‖B.

If we denote the layers of a neural network by ϕi ∈ L, then (4) may be writtenin a form that parallels the above:

infϕi∈L

‖f − ϕk ϕk−1 · · · ϕ1‖B.

Again our goal is to study the corresponding empirical risk minimization prob-lem, i.e., the approximation problem (5). The first surprise is that this maynot always have a solution. For example, take n = 6 and p = q = 2 with

s1 =

[−2

0

], s2 =

[−1

0

], s3 =

[00

], s4 =

[10

], s5 =

[20

], s6 =

[11

],

t1 =

[20

], t2 =

[10

], t3 =

[00

], t4 =

[−2

0

], t5 =

[−4

0

], t6 =

[01

].

For a ReLU-activated two-layer neural network, the approximation problem(5) seeks weights θ = (A, b,B, c) that attain the infimum over all A,B ∈ R2×2,b, c ∈ R2 of the loss function∥∥∥∥[20]−[Bmax

(A

[−2

0

]+ b,

[00

])+ c

]∥∥∥∥2 +

∥∥∥∥[10]−[Bmax

(A

[−1

0

]+ b,

[00

])+ c

]∥∥∥∥2+

∥∥∥∥[00]−[Bmax

(A

[00

]+ b,

[00

])+ c

]∥∥∥∥2 +

∥∥∥∥[−20

]−[Bmax

(A

[10

]+ b,

[00

])+ c

]∥∥∥∥2+

∥∥∥∥[−40

]−[Bmax

(A

[20

]+ b,

[00

])+ c

]∥∥∥∥2 +

∥∥∥∥[01]−[Bmax

(A

[11

]+ b,

[00

])+ c

]∥∥∥∥2.We will see in the proof of Theorem 1 that this has no solution. Any sequenceof θ = (A, b,B, c) chosen so that the loss function converges to its infimum willhave ‖θ‖2 = ‖A‖2F + ‖b‖22 + ‖B‖2F + ‖c‖22 becoming unbounded — the entriesof θ will diverge to ±∞ in such a way that keeps the loss function boundedand in fact convergent to its infimum.2

With a smooth activation like hyperbolic tangent in place of ReLU, wecan establish a stronger nonexistence result: In Theorem 5, we show thatthere is a positive-measured set U ⊆ (Rq)n such that for any target values(t1, . . . , tn) ∈ U , the empirical risk estimation problem (5) has no solution.

Whenever one attempts to minimize a function that does not have a mini-mum (i.e., infimum is not attained), one runs into serious numerical issues. Weestablish this formally in Proposition 1, showing that the parameters θ mustnecessarily diverge to ±∞ when one attempts to minimize (5) in the eventwhen its infimum is not attainable.

2 We assume the Euclidean norm ‖θ‖2 :=∑k

i=1‖Ai‖2F + ‖bi‖22 on our parameter space Θbut results in this article are independent of the choice of norms as all norms are equivalenton finite-dimensional spaces, another departure from the infinite-dimensional case in [7,14].

Page 5: Best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary D, inf ’ i2D kf ’ 1 ’ 2 ’ kk B: If we denote the layers of a neural network by ’

Best k-layer neural network approximations 5

Our study here may thus shed some light on a key feature of modern deepneural networks, made possible by the abundance of computing power notavailable to early adopters like the authors of [5,7,9,8]. Deep neural networksare, almost by definition and certainly in practice, heavily overparameterized.In this case, whatever training data (s1, t1), . . . , (sn, tn) may be perfectly fittedand in essence one solves the system of neural network equations:

ti = νθ(si), i = 1, . . . , n, (6)

for a solution θ ∈ Θ and thereby circumvents the issue of whether the infimumin (5) can be attained. This in our view is the reason ill-posedness issues did notprevent the recent success of neural networks. For a two-layer ReLU-activatedneural network, we will address the question of whether (6) has a solutiongenerically in Section 5.

A word about our use of the term “ill-posedness” is in order: Recall that aproblem is said to be ill-posed if a solution either (i) does not exist, (ii) existsbut is not unqiue, or (iii) exists and is unique but does not depend continuouslyon the input parameters of the problem. When we claimed that the problem(5) is ill-posed, it is in the sense of (i), clearly the most serious issue of the three— (ii) and (iii) may be ameliorated with regularization or other strategies butnot (i). We use “ill-posedness” in the sense of (i) throughout our article. Thework in [13], for example, is about ill-posedness in the sense of (ii).

2 Geometry of empirical risk minimization for neural networks

Given that we are interested in the behavior of νθ as a function of weights θ, wewill rephrase (5) to put it on more relevant footing. Let the sample s1, . . . , sn ∈Rp and responses t1, . . . , tn ∈ Rq be arbitrary but fixed. Henceforth we willassemble the sample into a design matrix,

S :=

sT1sT2...sTn

∈ Rn×p, (7)

and the corresponding responses into a response matrix,

T :=

tT1tT2...tTn

∈ Rn×q. (8)

Here and for the rest of this article, we use the following numerical linearalgebra conventions:

• a vector a ∈ Rn will always be regarded as a column vector;• a row vector will always be denoted aT for some column vector a;

Page 6: Best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary D, inf ’ i2D kf ’ 1 ’ 2 ’ kk B: If we denote the layers of a neural network by ’

6 L.-H. Lim, M. Micha lek, Y. Qi

• a matrix A ∈ Rn×p may be denoted either– as a list of its column vectors A = [a1, . . . , ap], i.e., a1, . . . , ap ∈ Rn are

columns of A;– or as a list of its row vectors A = [αT

1, . . . , αTn]T, i.e., α1, . . . , αn ∈ Rp

are rows of A.

We will also adopt the convention that treats a direct sum of p subspaces(resp. cones) in Rn as a subspace (resp. cone) of Rn×p: If V1, . . . , Vp ⊆ Rn aresubspaces (resp. cones), then

V1 ⊕ · · · ⊕ Vp := [v1, . . . , vp] ∈ Rn×p : v1 ∈ V1, . . . , vp ∈ Vp. (9)

Let νθ : Rp → Rq be a k-layer neural network, k ≥ 2. We define the weightsmap ψk : Θ → Rn×q by

ψk(θ) =

νθ(s1)T

νθ(s2)T

...νθ(sn)T

∈ Rn×q.

In other words, for a fixed sample, ψk is νθ regarded as a function of theweights θ. The empirical risk minimization problem is (5) rewritten as

infθ∈Θ‖T − ψk(θ)‖F , (10)

where ‖ · ‖F denotes the Frobenius norm. We may view (10) as a matrix ap-proximation problem — finding a matrix in

ψk(Θ) = ψk(θ) ∈ Rn×q : θ ∈ Θ (11)

that is nearest to a given matrix T ∈ Rn×q.

Definition 1 We will call the set ψk(Θ) the image of weights of the k-layerneural network νθ and the corresponding problem (10) a best k-layer neuralnetwork approximation problem.

As we noted in (2), the space of all weights Θ is essentially the Euclideanspace Rm and uninteresting geometrically, but the image of weights ψk(Θ),as we will see in this article, has complicated geometry (e.g., for k = 2 andReLU activation, it is the join locus of a line and the secant locus of a cone —see Theorem 2). In fact, the geometry of the neural network is the geometryof ψk(Θ). We expect that it will be pertinent to understand this geometryif one wants to understand neural networks at a deeper level. For one, thenontrivial geometry of ψk(Θ) is the reason that the best k-layer neural networkapproximation problem, which is to find a point in ψk(Θ) closest to a givenT ∈ Rn×q, lacks a solution in general.

Indeed, the most immediate mathematical issues with the approximationproblem (10) are the existence and uniqueness of solutions:

Page 7: Best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary D, inf ’ i2D kf ’ 1 ’ 2 ’ kk B: If we denote the layers of a neural network by ’

Best k-layer neural network approximations 7

(i) a nearest point may not exist since the set ψk(Θ) may not be a closedsubset of Rn×q, i.e., the infimum in (10) may not be attainable;

(ii) even if it exists, the nearest point may not be unique, i.e., the infimumin (10) may be attained by two or more points in ψk(Θ).

As a reminder a problem is said to be ill-posed if it lacks existence and unique-ness guarantees. Ill-posedness creates numerous difficulties both logical (whatdoes it mean to find a solution when it does not exist?) and practical (whichsolution do we find when there are more than one?). In addition, a well-posedproblem near an ill-posed one is the very definition of an ill-conditioned prob-lem [4], which presents its own set of difficulties. In general, ill-posed prob-lems are not only to be avoided but also delineated to reveal the region ofill-conditioning.

For the function approximation problem (4) with Ω an open subset ofRp, the nonexistence issue of a best neural network approximant is very well-known, dating back to [7], with a series of follow-up works, e.g., [3,14]). Butfor the case that actually occurs in the training of a neural network, i.e., whereΩ is a finite set, (4) becomes (5) or (10) and its well-posedness has never beenstudied, to the best of our knowledge. Our article seeks to address this gap.The geometry of the set in (11) will play an important role in studying theseproblems, much like the role played by the geometry of rank-k tensors in [15].

We will show that for many networks, the problem (10) is ill-posed. Wehave already mentioned an explicit example at the end of Section 1 for theReLU activation:

σmax(x) := max(0, x) (12)

where (10) lacks a solution; we will discuss this in detail in Section 4. Per-haps more surprisingly, for the sigmoidal and hyperbolic tangent activationfunctions:

σexp(x) :=1

1 + exp(−x)and σtanh(x) := tanh(x),

we will see in Section 6 that (10) lacks a solution with positive probability, i.e.,there exists an open set U ⊆ Rn×q such that for any T ∈ U there is no nearestpoint in ψk(Θ). Similar phenomenon is known for real tensors [6, Section 8].

For neural networks with ReLU activation, we are unable to establish simi-lar “failure with positive probability” results but the geometry of the problem(10) is actually simpler in this case. For two-layer ReLU-activated network,we can completely characterize the geometry of ψ2(Θ), which provides us withgreater insights as to why (10) generally lacks a solution. We can also deter-mine the dimension of ψ2(Θ) in many instances. These will be discussed inSection 5.

The following map will play a key role in this article and we give it a nameto facilitate exposition.

Definition 2 (ReLU projection) The map σmax : Rd → Rd where theReLU activation (12) is applied coordinatewise will be called a ReLU projec-tion. For any Ω ⊆ Rd, σmax(Ω) ⊆ Rd will be called a ReLU projection ofΩ.

Page 8: Best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary D, inf ’ i2D kf ’ 1 ’ 2 ’ kk B: If we denote the layers of a neural network by ’

8 L.-H. Lim, M. Micha lek, Y. Qi

Note that a ReLU projection is a linear projection when restricted to anyorthant of Rd.

3 Geometry of a “one-layer neural network”

We start by studying the ‘first part’ of a two-layer ReLU-activated neuralnetwork:

Rp α−→ Rq σmax−−−→ Rq

and, slightly abusing terminologies, call this a one-layer ReLU-activated neuralnetwork. Note that the weights here are θ = (A, b) ∈ Rq×(p+1) = Θ withA ∈ Rq×p, b ∈ Rq that define the affine map α(x) = Ax+ b.

Let the sample S = [sT1, . . . , sTn]T ∈ Rn×p be fixed. Define the weight map

ψ1 : Θ → Rn×q by

ψ1(A, b) = σmax

As1 + b

...Asn + b

, (13)

where σmax is applied coordinatewise.Recall that a cone C ⊆ Rd is simply a set invariant under scaling by

positive scalars, i.e., if x ∈ C, then λx ∈ C for all λ > 0. Note that we do notassume that a cone has to be convex; in particular, in our article the dimensionof a cone C refers to its dimension as a semialgebraic set.

Definition 3 (ReLU cone) Let S = [sT1, . . . , sTn]T ∈ Rn×p. The ReLU cone

of S is the set

Cmax(S) :=

σmax

a

Ts1 + b...

aTsn + b

∈ Rn : a ∈ Rp, b ∈ R

.

The ReLU cone is clearly a cone. Such cones will form the building blocks forthe image of weights ψk(Θ). In fact, it is easy to see that Cmax(S) is exactlyψ1(Θ) in case when q = 1. The next lemma describes the geometry of ReLUcones in greater detail.

Lemma 1 Let S = [sT1, . . . , sTn]T ∈ Rn×p.

(i) The set Cmax(S) is always a closed pointed cone of dimension

dim Cmax(S) = rank[S,1]. (14)

Here [S,1] ∈ Rn×(p+1) is augmented with an extra column 1 := [1, . . . , 1]T ∈Rn, the vector of all ones.

(ii) A set C ⊆ Rn is a ReLU cone if and only if it is a ReLU projection ofsome linear subspace in Rn containing the vector 1.

Page 9: Best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary D, inf ’ i2D kf ’ 1 ’ 2 ’ kk B: If we denote the layers of a neural network by ’

Best k-layer neural network approximations 9

Proof Consider the map in (13) with q = 1. Then ψ1 : Θ → Rn is given by acomposition of the linear map

Θ → Rn,[ab

]7→

sT1a+ b

...sTna+ b

= [S,1]

[ab

],

whose image is a linear space L of dimension rank[S,1] containing 1, and theReLU projection σmax : Rn → Rn. Since Cmax(S) = ψ1(Θ), it is a ReLUprojection of a linear subspace in Rn, which is clearly a closed pointed cone.

For its dimension, note that a ReLU projection is a linear projection oneach quadrant and thus cannot increase dimension. On the other hand, since1 ∈ L, we know that L intersects the interior of the nonnegative quadrant onwhich σmax is the identity; thus σmax preserves dimension and we have (14).

Conversely, a ReLU projection of any linear space L may be realized asCmax(S) for some choice of S — just choose S so that the image of the matrix[S,1] is L. ut

It follows from Lemma 1 that the image of the weights of a one-layer ReLUneural network has the geometry of a direct sum of q closed pointed cones.Recall our convention for direct sum in (9).

Corollary 1 Consider the one-layer ReLU-activated neural network

Rp α1−→ Rq σmax−−−→ Rq.

Let S ∈ Rn×p. Then ψ1(Θ) ⊆ Rn×q has the structure of a direct sum of qcopies of Cmax(S) ⊆ Rn. More precisely,

ψ1(Θ) = [v1, . . . , vq] ∈ Rn×q : v1, . . . , vq ∈ Cmax(S). (15)

In particular, ψ1(Θ) is a closed pointed cone of dimension q · rank[S,1] inRn×q.

Proof Each row of the matrix α1 can be identified with the affine map definedin Lemma 1. Then the conclusion follows by Lemma 1. ut

Given Corollary 1, one might perhaps think that a two-layer neural network

Rd1 α1−→ Rd2 σmax−−−→ Rd2 α2−→ Rd3

would also have a closed image of weights ψ2(Θ). This turns out to be false.We will show that the image ψ2(Θ) may not be closed.

As a side remark, note that Definition 3 and Lemma 1 are peculiar to theReLU activation. For smooth activations like σexp and σtanh, the image ofweights ψk(Θ) is almost never a cone and Lemma 1 does not hold in multipleways.

Page 10: Best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary D, inf ’ i2D kf ’ 1 ’ 2 ’ kk B: If we denote the layers of a neural network by ’

10 L.-H. Lim, M. Micha lek, Y. Qi

4 Ill-posedness of best k-layer neural network approximation

The k = 2 case is the simplest and yet already nontrivial in that it has theuniversal approximation property, as we mentioned earlier. The main contentof the next theorem is in its proof, which is constructive and furnishes anexplicit example of a function that does not have a best approximation by atwo-layer neural network. The reader is reminded that even training a two-layer neural network is already an NP-hard problem [1,2].

Theorem 1 (Ill-posedness of neural network approximation I) Thebest two-layer ReLU-activated neural network approximation problem

infθ∈Θ‖T − ψ2(θ)‖F (16)

is ill-posed, i.e., the infimum in (16) cannot be attained in general.

Proof We will construct an explicit two-layer ReLU-activated network whoseimage of weights is not closed. Let d1 = d2 = d3 = 2. For the two-layerReLU-activated network

R2 α1−→ R2 σmax−−−→ R2 α2−→ R2,

the weights take the form

θ = (A1, b1, A2, b2) ∈ R2×2 × R2 × R2×2 × R2 = Θ ∼= R12,

i.e., m = 12. Consider a sample S of size n = 6 given by

si =

[i− 3

0

], i = 1, . . . , 5, s6 =

[11

],

that is

S =

−2 0−1 0

0 01 02 01 1

∈ R6×2.

Thus the weight map ψ2 : Θ → R6×2, or, more precisely,

ψ2 : R2×2 × R2 × R2×2 × R2 → R6×2,

is given by

ψ2(θ) =

νθ(s1)T

...νθ(s6)T

=

(A2 max(A1s1 + b1, 0) + b2

)T...(

A2 max(A1s6 + b1, 0) + b2)T ∈ R6×2.

Page 11: Best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary D, inf ’ i2D kf ’ 1 ’ 2 ’ kk B: If we denote the layers of a neural network by ’

Best k-layer neural network approximations 11

We claim that the image of weights ψ2(Θ) is not closed — the point

T =

2 01 00 0−2 0−4 0

0 1

∈ R6×2 (17)

is in the closure of ψ2(Θ) but not in ψ2(Θ). Therefore for this choice of T , theinfimum in (16) is zero but is never attainable by any point in ψ2(Θ).

We will first prove that T is in the closure of ψ2(Θ): Consider a sequence

of affine transformations α(k)1 , k = 1, 2, . . . , defined by

α(k)1 (si) = (i− 3)

[−11

], i = 1, . . . , 5, α

(k)1 (s6) =

[2kk

],

and set

α(k)2

([xy

])=

[x− 2y

1ky

].

The sequence of two-layer neural networks,

ν(k) = α(k)2 σmax α(k)

1 , k ∈ N,

have weights given by

θk =

([−1 2k + 11 k − 1

],

[00

],

[1 −20 1

k

],

[00

])(18)

and thatlimk→∞

ψ2(θk) = T.

This shows that T in (17) is indeed in the closure of ψ2(Θ).We next show by contradiction that T /∈ ψ2(Θ). Write T = [tT1, . . . , t

T6]T ∈

R6×2 where t1, . . . , t6 ∈ R2 are as in (17). Suppose T ∈ ψ2(Θ). Then thereexist some affine maps β1, β2 : R2 → R2 such that

ti = β2 σmax β1(si), i = 1, . . . , 6.

As t1, t3, t6 are affinely independent, β2 has to be an affine isomorphism. Hencethe five points

σmax(β1(s1)), . . . , σmax(β1(s5)) (19)

have to lie on a line in R2. Also, note that

β1(s1), . . . , β1(s5) (20)

lie on a (different) line in R2 since β1 is an affine homomorphism. The fivepoints in (20) have to be in the same quadrant, otherwise the points in (19)could not lie on a line or have successive distances δ, δ, 2δ, 2δ for some δ > 0.

Page 12: Best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary D, inf ’ i2D kf ’ 1 ’ 2 ’ kk B: If we denote the layers of a neural network by ’

12 L.-H. Lim, M. Micha lek, Y. Qi

Note that σmax : R2 → R2 is identity in the first quadrant, projection to they-axis in the second, projection to the point (0, 0) in the third, and projec-tion to the x-axis in the fourth. So σmax takes points with equal successivedistances in the first, second, fourth quadrant to points with equal successivedistances in the same quadrant; and it takes all points in the third quadrantto the origin. Hence σmax cannot take the five colinear points in (20), withequal successive distances, to the five colinear points in (19), with successivedistances δ, δ, 2δ, 2δ. This yields the required contradiction. ut

We would like to emphasize that the example constructed in the proof ofTheorem 1, which is about the nonclosedness of the image of weights within afinite-dimensional space of response matrices, differs from examples in [7,14],which are about the nonclosedness of the class of neural network within aninfinite-dimensional space of target functions. Another difference is that herewe have considered the ReLU activation σmax as opposed to the hyperbolictangent activation σexp in [7,14]. The ‘neural networks’ studied in [3] differfrom standard usage [5,7,9,8,14] in the sense of (1); they are defined as alinear combination of neural networks whose weights are fixed in advancedand approximation is in the sense of finding the coefficients of such linearcombinations. As such the results in [3] are irrelevant to our discussions.

As we pointed out at the end of Section 1 and as the reader might alsohave observed in the proof of Theorem 1, the sequence of weights θj in (18)contain entries that become unbounded as j →∞. This is not peculiar to thesequence we chose in (18); by the same discussion in [6, Section 4.3], this willalways be the case:

Proposition 1 If the infimum in (10) is not attainable, then any sequence ofweights θj ∈ Θ with

limj→∞‖T − ψk(θj)‖F = inf

θ∈Θ‖T − ψk(θ)‖F

must be unbounded, i.e.,lim supj→∞

‖θj‖ =∞.

This holds regardless of the choice of activation and number of layers.

The implication of Proposition 1 is that if one attempts to forcibly fit a neuralnetwork to target values T where (16) does not have a solution, then it simplyresults in the parameters θ blowing up to ±∞.

5 Generic solvability of the neural network equations

As we mentioned at the end of Section 1, modern deep neural networks avoidsthe issue that (10) may lack an infimum by overfitting. In which case therelevant problem is no longer one of approximation but becomes one of solvinga system of what we called neural network equations (6), rewritten here as

ψk(θ) = T. (21)

Page 13: Best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary D, inf ’ i2D kf ’ 1 ’ 2 ’ kk B: If we denote the layers of a neural network by ’

Best k-layer neural network approximations 13

Whether (21) has a solution is not determined by dimΘ but by dimψk(Θ).If the dimension of the image of weights equals nq, i.e., the dimension of theambient space Rn×q in which T lies, then (21) has a solution generically. Inthis section, we will provide various expressions for dimψk(Θ) for k = 2 andthe ReLU activation.

There is a deeper geometrical explanation behind Theorem 1. Let d ∈ N.The join locus of X1, . . . , Xr ⊆ Rd is the set

Join(X1, . . . , Xr) := λ1x1 + · · ·+ λrxr ∈ Rd :

xi ∈ Xi, λi ∈ R, i = 1, . . . , r. (22)

A special case is when X1 = · · · = Xr = X and in which case the join locus iscalled the rth secant locus

Σr (X) = λ1x1 + · · ·+ λrxr ∈ Rd : xi ∈ X, λi ∈ R, i = 1, . . . , r.

An example of a join locus is the set of “sparse-plus-low-rank” matrices [15,Section 8.1]; an example of a rth secant locus is the set of rank-r tensors [15,Section 7].

From a geometrical perspective, we will next show that for k = 2 and q = 1the set ψ2(Θ) has the structure of a join locus. Join loci are known in general tobe nonclosed [18]. For this reason, the ill-posedness of the best k-layer neuralnetwork problem is not unlike that of the best rank-r approximation problemfor tensors, which is a consequence of the nonclosedness of the secant loci ofthe Segre variety [6].

We shall begin with the case of a two-layer neural network with one-dimensional output, i.e., q = 1. In this case, ψ2(Θ) ⊆ Rn and we can describeits geometry very precisely. The more general case where q > 1 will be inTheorem 4.

Theorem 2 (Geometry of two-layer neural network I) Consider thetwo-layer ReLU-activated network with p-dimensional inputs and d neuronsin the hidden layer:

Rp α1−→ Rd σmax−−−→ Rd α2−→ R. (23)

The image of weights is given by

ψ2(Θ) = Join(Σd(Cmax(S)), span1

)where

Σd(Cmax(S)

):=

⋃y1,...,yd∈Cmax(S)

spany1, . . . , yd ⊆ Rn

is the dth secant locus of the ReLU cone and span1 is the one-dimensionalsubspace spanned by 1 ∈ Rn.

Page 14: Best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary D, inf ’ i2D kf ’ 1 ’ 2 ’ kk B: If we denote the layers of a neural network by ’

14 L.-H. Lim, M. Micha lek, Y. Qi

Proof Let α1(z) = A1z + b, where

A1 =

aT1...aT

d

∈ Rd×p and b =

b1...bd

∈ Rd.

Let α2(z) = AT2z + λ, where AT

2 = (c1, . . . , cd) ∈ Rd. Then x = (x1, . . . , xn)T ∈ψ2(Θ) if and only if

x1 = c1σ(sT1a1 + b1) + · · ·+ cdσ(sT1ad + bd) + λ,...

...xn = c1σ(sTna1 + b1) + · · ·+ cdσ(sTnad + bd) + λ.

(24)

For i = 1, . . . , d, define the vector yi = (σ(sT1ai + bi), . . . , σ(sTnai + bi))T ∈ Rn,

which belongs to Cmax(S) by Corollary 1. Thus, (24) is equivalent to

x = c1y1 + · · ·+ cdyd + λ1 (25)

for some c1, . . . , cd ∈ R. By definition of secant locus, (25) is equivalent to thestatement

x ∈ Σd(Cmax(S)

)+ λ1,

which completes the proof. ut

From a practical point of view, we are most interested in basic topologicalissues like whether the image of weights ψk(Θ) is closed or not, since thisaffects the solvability of (16). However, the geometrical description of ψ2(Θ)in Theorem 2 will allow us to deduce bounds on its dimension. Note that thedimension of the space of weights Θ as in (2) is just m as in (14) but this isnot the true dimension of the neural network, which should instead be that ofthe image of weights ψk(Θ).

In general, even for k = 2, it will be difficult to obtain the exact dimensionof ψ2(Θ) for an arbitrary two-layer network (23). In the next corollary, wededuce from Theorem 2 an upper bound dependent on the sample S ∈ Rn×pand another independent of it.

Corollary 2 For the two-layer ReLU-activated network (23), we have

dimψ2(Θ) ≤ d(rank[S,1]) + 1,

and in particular,

dimψ2(Θ) ≤ min(d(min(p, n) + 1) + 1, pn

).

When n is sufficiently large and the observations s1, . . . , sn are sufficientlygeneral,3 we may deduce a more precise value of dimψ2(Θ). Before describing

3 Here and in Lemma 2 and Corollary 3, ‘general’ is used in the sense of algebraic geometry:A property is general if the set of points that does not have it is contained in a Zariski closedsubset that is not the whole space.

Page 15: Best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary D, inf ’ i2D kf ’ 1 ’ 2 ’ kk B: If we denote the layers of a neural network by ’

Best k-layer neural network approximations 15

our results, we introduce several notations. For any index set I ⊆ 1, . . . , nand sample S ∈ Rn×p, we write

RnI := x ∈ Rn : xi = 0 if i /∈ I and xj > 0 if j ∈ I,FI(S) := Cmax(S) ∩ RnI .

Note that F∅(S) = 0, 1 ∈ F1,...,n(S), and Cmax(S) may be expressed as

Cmax(S) = FI1(S) ∪ · · · ∪ FI`(S), (26)

for some index sets I1, . . . , I` ⊆ 1, . . . , n and ` ∈ N minimum.

Lemma 2 Given a general x ∈ Rn and any integer k, 1 ≤ k ≤ n, there is ak-element subset I ⊆ 1, . . . , n and a λ ∈ R such that σ(λ1 + x) ∈ RnI .

Proof Let x = (x1, . . . , xn)T ∈ Rn be general. Without loss of generality, wemay assume its coordinates are in ascending order x1 < · · · < xn. For any kwith 1 ≤ k ≤ n, choose λ so that xn−k < λ < xn−k+1 where we set x0 = −∞.Then σ(u− λ1) ∈ Rn−k+1,...,n. ut

Lemma 3 Let n ≥ p+1. There is a nonempty open subset of vectors v1, . . . , vp ∈Rn such that for any p+1 ≤ k ≤ n, there are a k-element subset I ⊆ 1, . . . , nand λ1, . . . , λp, µ ∈ R where

σ(λ11 + v1), . . . , σ(λp1 + vp), σ(µ1 + v1) ∈ RnI

are linearly independent.

Proof For each i = 1, . . . , p, we choose general vi = (vi,1, . . . , vi,n)T ∈ Rn sothat

vi,1 < · · · < vi,n.

For any fixed k with p+ 1 ≤ k ≤ n, by Lemma 2, we can find λ1, . . . , λp, µ ∈R such that σ(vi − λi1) ∈ Rn−k+1,...,n, i = 1, . . . , n, and σ(v1 − µ1) ∈Rn−k+1,...,n. By the generality of vi’s, the vectors σ(v1 − λ11), . . . , σ(vp −λp1), σ(v1 − µ1) are linearly independent. ut

We are now ready to state our main result on the dimension of the imageof weights of a two-layer ReLU-activated neural network.

Theorem 3 (Dimension of two-layer neural network I) Let n ≥ d(p+1) + 1 where p is the dimension of the input and d is the dimension of thehidden layer. Then there is a nonempty open subset of samples S ∈ Rn×p suchthat the image of weights for the two-layer ReLU-activated network (23) hasdimension

dimψ2(Θ) = d(p+ 1) + 1. (27)

Page 16: Best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary D, inf ’ i2D kf ’ 1 ’ 2 ’ kk B: If we denote the layers of a neural network by ’

16 L.-H. Lim, M. Micha lek, Y. Qi

Proof The rows of S = [sT1, . . . , sTn]T ∈ Rn×p are the n samples s1, . . . , sn ∈ Rp.

In this case it will be more convenient to consider the columns of S, which wewill denote by v1, . . . , vp ∈ Rn. Denote the coordinates by vi = (vi,1, . . . , vi,n)T,i = 1, . . . , p. Consider the nonempty open subset

U := S = [v1, . . . , vp] ∈ Rn×p : vi,1 < · · · < vi,n, i = 1, . . . , p. (28)

Define the index sets Ji ⊆ 1, . . . , n by

Ji := n− i(p+ 1) + 1, . . . , n, i = 1, . . . , d.

By Lemma 3,

dimFJi(S) = rank[S,1] = p+ 1, i = 1, . . . , d.

When S ∈ U is sufficiently general,

spanFJ1(S) + · · ·+ spanFJd(S) = spanFJ1(S)⊕ · · · ⊕ spanFJd(S). (29)

Given any I ⊆ 1, . . . , n with FI(S) 6= ∅, we have that for any x, y ∈ FI(S)and any a, b > 0, ax+ by ∈ FI(S). This implies that

dimFI(S) = dim spanFI(S) = dimΣr(FI(S)

)(30)

for any r ∈ N. Let I1, . . . , I` ⊆ 1, . . . , n and ` ∈ N be chosen as in (26). Then

Σd(Cmax(S)

)=

⋃1≤i1≤···≤id≤`

Join(FIi1 (S), . . . , FIid (S)

).

Now choose Ii1 = J1, . . . , Iid = Jd. By (29) and (30),

dim Join(FJ1(S), . . . , FJd(S)

)=

d∑i=1

dimFJi(S).

Therefore

dimΣd(Cmax(S)

)= dim Join

(FJ1(S), . . . , FJd(S)

)+ dim span1

= d rank[S,1] + 1,

which gives us (27). ut

A consequence of Theorem 3 is that the dimension formula (27) holds forany general sample s1, . . . , sn ∈ Rp when n is sufficiently large.

Corollary 3 (Dimension of two-layer neural network II) Let n pd.Then for general S ∈ Rn×p, the image of weights for the two-layer ReLU-activated network (23) has dimension

dimψ2(Θ) = d(p+ 1) + 1.

Page 17: Best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary D, inf ’ i2D kf ’ 1 ’ 2 ’ kk B: If we denote the layers of a neural network by ’

Best k-layer neural network approximations 17

Proof Let the notations be as in the proof of Theorem 3. When n is sufficientlylarge, we can find a subset

I = i1, . . . , id(p+1)+1 ⊆ 1, . . . , n

such that either

vj,i1 < · · · < vj,id(p+1)+1or vj,i1 > · · · > vj,id(p+1)+1

for each j = 1, . . . , p. The conclusion then follows from Theorem 3. ut

For deeper networks one may have m dimψk(Θ) even for n 0. Con-sider a k-layer network with one neuron in every layer, i.e.,

d1 = d2 = · · · = dk = dk+1 = 1.

For any samples s1, . . . , sn ∈ R, we may assume s1 ≤ · · · ≤ sn without lossof generality. Then the image of weights ψk(Θ) ⊆ Rn may be described asfollows: a point x = (x1, . . . , xn)T ∈ ψk(Θ) if and only if

x1 = · · · = x` ≤ · · · ≤ x`+`′ = · · · = xn

and

x`+1, . . . , x`+`′−1 are the affine images of s`+1, . . . , s`+`′−1.

In particular, as soon as k ≥ 3 the image of weights ψk(Θ) does not changeand its dimension remains constant for any n ≥ 6.

We next address the case where q > 1. One might think that by theq = 1 case in Theorem 2 and “one-layer” case in Corollary 1, the image ofweights ψ2(Θ) ⊆ Rn×q in this case is simply the direct sum of q copies ofJoin

(Σd(Cmax(S)), span1

). It is in fact only a subset of that, i.e.,

ψ2(Θ) ⊆

[x1, . . . , xq] ∈ Rn×q : x1, . . . , xd ∈ Join(Σd(Cmax(S)), span1

)but equality does not in general hold.

Theorem 4 (Geometry of two-layer neural network II) Consider thetwo-layer ReLU-activated network with p-dimensional inputs, d neurons in thehidden layer, and q-dimensional outputs:

Rp α1−→ Rd σmax−−−→ Rd α2−→ Rq. (31)

The image of weights is given by

ψ2(Θ) =

[x1, . . . , xq] ∈ Rn×q : there exist y1, . . . , yd ∈ Cmax(S)

such that xi ∈ span1, y1, . . . , yd, i = 1, . . . , d.

Page 18: Best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary D, inf ’ i2D kf ’ 1 ’ 2 ’ kk B: If we denote the layers of a neural network by ’

18 L.-H. Lim, M. Micha lek, Y. Qi

Proof Let X = [x1, . . . , xq] ∈ ψ2(Θ) ⊆ Rn×q. Suppose that the affine mapα2 : Rd → Rq is given by α2(x) = Ax + b where A = [a1, . . . , aq] ∈ Rd×qand b = (b1, . . . , bq)

T ∈ Rq. Then each xi is realized as in (25) in the proofof Theorem 2. Therefore we conclude that [x1, . . . , xq] ∈ ψ2(Θ) if and only ifthere exist y1, . . . , yd ∈ Cmax(S) with

xi = bi1 +

d∑j=1

aijyj , i = 1, . . . , q,

for some bi, aij ∈ R, i = 1, . . . , n, j = 1, . . . , d. ut

With Theorem 4, we may deduce analogues of (part of) Theorem 3 andCorollary 3 for the case q > 1. The proofs are similar to those of Theorem 3and Corollary 3.

Corollary 4 (Dimension of two-layer neural network III) The imageof weights of the two-layer ReLU-activated network (31) with p-dimensionalinputs, d neurons in the hidden layer, and q-dimensional output has dimension

dimψ2(Θ) = (q + rank[S,1])d+ q.

If the sample size n is sufficiently large, then for general S ∈ Rn×p, the di-mension is

dimψ2(Θ) = (p+ q + 1)d+ q.

Note that by (14),

dimΘ = (p+ 1)d+ (d+ 1)q = dimψ2(Θ)

in the latter case of Corollary 4, as we expect.Note that when dimψ2(Θ) = dimRn×q, ψ2(Θ) becomes a dense set in

Rn×q. Thus the expressions for dimψ2(Θ) in Theorem 3, Corollaries 3 and 4allows one to ascertain when the neural network equations (21) have a solutiongenerically, namely, when dimψ2(Θ) = nq.

6 Smooth activations

For smooth activation like sigmoidal and hyperbolic tangent, we expect thegeometry of the image of weights to be considerably more difficult to describe.Nevertheless when it comes to the ill-posedness of the best k-layer neural net-work problem (10), it is easy to deduce not only that there is a T ∈ Rn×q suchthat (10) does not attain its infimum, but that there is a positive-measuredset of such T ’s.

The phenomenon is already visible in the one-dimensional case p = q = 1and can be readily extended to arbitrary p and q. Take the sigmoidal activationσexp(x) = 1/

(1 + exp(−x)

). Let n = 1. So the sample S and response matrix

Page 19: Best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary D, inf ’ i2D kf ’ 1 ’ 2 ’ kk B: If we denote the layers of a neural network by ’

Best k-layer neural network approximations 19

T are both in R1×1 = R. Suppose S 6= 0. Then for a σexp-activated k-layerneural network of arbitrary k ∈ N,

ψk(Θ) = (0, 1).

Therefore any T ≥ 1 or T ≤ 0 will not have a best approximation by pointsin ψk(Θ). The same argument works for the hyperbolic tangent activationσtanh(x) = tanh(x) or indeed any activation σ whose range is a proper openinterval. In this sense, the ReLU activation σmax is special in that its range isnot an open interval.

To show that the n = 1 assumption above is not the cause of the ill-posedness, we provide a more complicated example with n = 3. Again we willkeep p = q = 1 and let

s1 = 0, s2 = 1, s3 = 2; t1 = 0, t2 = 2, t3 = 1.

Consider a k = 2 layer neural network with hyperbolic tangent activation

R α1−→ R σtanh−−−→ R α2−→ R. (32)

Note that its weights take the form

θ = (a, b, c, d) ∈ R× R× R× R ∼= R4,

and thus Θ = R4. It is also straightforward to see that

ψ2(Θ) =

ceb − e−b

eb + e−b+ d

cea+b − ea−b

ea+b + ea−b+ d

ce2a+b − e2a−b

e2a+b + e2a−b+ d

∈ R3 : a, b, c, d ∈ R

.

For ε > 0, consider the open set of response matrices

U(ε) :=

t′1t′2t′3

∈ R3 : |t1 − t′1| ≤ ε, |t2 − t′2| ≤ ε, |t3 − t′3| ≤ ε

.

We claim that for ε small enough, any response matrix T ′ = (t′1, t′2, t′3)T ∈ U(ε)

will not have a best approximation in ψ2(Θ).Any best approximation of T = (0, 2, 1)T in the closure of ψ2(Θ) must take

of the form (0, y, y)T for some y ∈ [1, 2]. On the other hand, (0, y, y)T /∈ ψ2(Θ)for any y ∈ [1, 2] and thus T does not have a best approximation in ψ2(Θ).Similarly, for small ε > 0 and T ′ = (t′1, t

′2, t′3)T ∈ U , a best approximation of

T ′ in the closure of ψ2(Θ) must take the form (t′1, y, y)T for some y ∈ [t′3, t′2].

Since (t′1, y, y)T /∈ ψ2(Θ) for any y ∈ [t′3, t′2], T ′ has no best approximation in

ψ2(Θ). Thus for small enough ε > 0, the infimum in (10) is unattainable forany T ∈ U(ε), a nonempty open set.

We summarize the conclusion of the above discussion in the followingproposition.

Page 20: Best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary D, inf ’ i2D kf ’ 1 ’ 2 ’ kk B: If we denote the layers of a neural network by ’

20 L.-H. Lim, M. Micha lek, Y. Qi

Theorem 5 (Ill-posedness of neural network approximation II) Thereexists a positive-measured set U ⊆ Rn×q and some S ∈ Rn×p such that the bestk-layer neural network approximation problem (10) with hyperbolic tangentactivation σtanh does not attain its infimum for any T ∈ U .

We leave open the question as to whether Theorem 5 holds for the ReLUactivation σmax. Despite our best efforts, we are unable to construct an exam-ple nor show that such an example cannot possibly exist.

7 Concluding remarks

This article studies the best k-layer neural network approximation from theperspective of our earlier work [15], where we studied similar issues for thebest k-term approximation. An important departure from [15] is that a neu-ral network is not an algebraic object because the most common activationfunctions σmax, σtanh, σexp are not polynomials; thus the algebraic techniquesin [15] do not apply in our study here and are relevant at best only throughanalogy.

Nevertheless, by the Stone–Weierstrass theorem continuous functions maybe uniformly approximated by polynomials. This suggests that it might per-haps be fruitful to study “algebraic neural networks,” i.e., where the activa-tion function σ is a polynomial function. This will allow us to apply the fullmachinery of algebraic geometry to deduce information about the image ofweights ψk(Θ) on the one hand and to extend the field of interest from R toC on the other. In fact one of the consequences of our results in [15] is thatfor an algebraic neural network over C, i.e., Θ = Cm, any response matrixT ∈ Cn×q will almost always have a unique best approximation in ψk(Cm),i.e., the approximation problem (10) attains its infimum with probability one.

Furthermore, from our perspective, the most basic questions about neu-ral network approximations are the ones that we studied in this article butquestions like:

generic dimension for neural networks: for a general S ∈ Rn×p with n 0,what is the dimension of ψk(Θ)?

generic rank for neural networks: what is the smallest value of k ∈ N suchthat ψk(Θ) is a dense set in Rn×q?

These are certainly the questions that one would first try to answer aboutvarious types of tensor ranks [11] or tensor networks [17] but as far as weknow, they have never been studied for neural networks. We leave these asdirections for potential future work.

Acknowledgements The work of YQ and LHL is supported by DARPA D15AP00109and NSF IIS 1546413. In addition LHL acknowledges support from a DARPA Director’sFellowship and the Eckhardt Faculty Fund. MM would like to thank Guido Montufar forhelpful discussions. Special thanks go to the anonymous referees for their careful readingand numerous invaluable suggestions that significantly enriched the content of this article.

Page 21: Best k-layer neural network approximationslekheng/work/neural.pdf · k from a dictionary D, inf ’ i2D kf ’ 1 ’ 2 ’ kk B: If we denote the layers of a neural network by ’

Best k-layer neural network approximations 21

References

1. Blum, A., Rivest, R.L.: Training a 3-node neural network is NP-complete (extendedabstract). In: Proceedings of the 1988 Workshop on Computational Learning Theory(Cambridge, MA, 1988), pp. 9–18. Morgan Kaufmann, San Mateo, CA (1989)

2. Blum, A.L., Rivest, R.L.: Training a 3-node neural network is NP-complete. In: Machinelearning: from theory to applications, Lecture Notes in Comput. Sci., vol. 661, pp. 9–28.Springer, Berlin (1993). doi: 10.1007/3-540-56483-7_20

3. Burger, M., Engl, H.W.: Training neural networks with noisy data as an ill-posed prob-lem. Adv. Comput. Math. 13(4), 335–354 (2000). doi: 10.1023/A:1016641629556. URLhttps://doi.org/10.1023/A:1016641629556

4. Burgisser, P., Cucker, F.: Condition: The geometry of numerical algorithms,Grundlehren der Mathematischen Wissenschaften, vol. 349. Springer, Heidelberg(2013). doi: 10.1007/978-3-642-38896-5

5. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. ControlSignals Systems 2(4), 303–314 (1989). doi: 10.1007/BF02551274

6. De Silva, V., Lim, L.H.: Tensor rank and the ill-posedness of the best low-rank ap-proximation problem. SIAM J. Matrix Anal. Appl. 30(3), 1084–1127 (2008). doi:10.1137/06066518X

7. Girosi, F., Poggio, T.: Networks and the best approximation property. Biol. Cybernet.63(3), 169–176 (1990). doi: 10.1007/BF00195855

8. Hornik, K.: Approximation capabilities of multilayer feedforward networks. NeuralNetw. 4(2), 251–257 (1991). doi: 10.1016/0893-6080(91)90009-T

9. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are univer-sal approximators. Neural Netw. 2(5), 359–366 (1989). doi: 10.1016/0893-6080(89)90020-8

10. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolu-tional neural networks. Commun. ACM 60(6), 84–90 (2017). doi: 10.1145/3065386

11. Landsberg, J.M.: Tensors: geometry and applications, Graduate Studies in Mathematics,vol. 128. American Mathematical Society, Providence, RI (2012)

12. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436 EP – (2015). doi:10.1038/nature14539

13. O’Leary-Roseberry, T., Ghattas, O.: Ill-posedness and optimization geometry for nonlin-ear neural network training. Preprint (2020). URL https://arxiv.org/abs/2002.02882

14. Petersen, P., Raslan, M., Voigtlaender, F.: Topological properties of the set of functionsgenerated by neural networks of fixed size. Preprint (2018). URL https://arxiv.org/

abs/1806.08459

15. Qi, Y., Micha l ek, M., Lim, L.H.: Complex best r-term approximations almost alwaysexist in finite dimensions. Appl. Comput. Harmon. Anal. 49(1), 180–207 (2020). doi:10.1016/j.acha.2018.12.003. URL https://doi.org/10.1016/j.acha.2018.12.003

16. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., Schrit-twieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D.,Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K.,Graepel, T., Hassabis, D.: Mastering the game of Go with deep neural networks andtree search. Nature 529, 484 EP – (2016). doi: 10.1038/nature16961

17. Ye, K., Lim, L.H.: Tensor network ranks. Preprint (2018). URL https://arxiv.org/

abs/1801.02662

18. Zak, F.L.: Tangents and secants of algebraic varieties, Translations of MathematicalMonographs, vol. 127. American Mathematical Society, Providence, RI (1993). Trans-lated from the Russian manuscript by the author