sharp representation theorems for relu networks with precise … · 2020. 6. 9. · 3 using depth...

Sharp Representation Theorems for ReLU Networkswith Precise Dependence on Depth

Guy BreslerDepartment of EECS

MITCambridge, MA 02139

[email protected]

Dheeraj NagarajDepartment of EECS

MITCambridge, MA [email protected]

Abstract

We prove sharp dimension-free representation results for neural networks with DReLU layers under square loss for a class of functions GD defined in the paper.These results capture the precise benefits of depth in the following sense:

1. The rates for representing the class of functions GD via D ReLU layers issharp up to constants, as shown by matching lower bounds.

2. For each D, GD ⊆ GD+1 and as D grows the class of functions GD containsprogressively less smooth functions.

3. If D′ < D, then the approximation rate for the class GD achieved by depthD′ networks is strictly worse than that achieved by depth D networks.

This constitutes a fine-grained characterization of the representation power offeedforward networks of arbitrary depth D and number of neurons N , in contrastto existing representation results which either require D growing quickly with Nor assume that the function being represented is highly smooth. In the latter casesimilar rates can be obtained with a single nonlinear layer. Our results confirm theprevailing hypothesis that deeper networks are better at representing less smoothfunctions, and indeed, the main technical novelty is to fully exploit the fact that deepnetworks can produce highly oscillatory functions with few activation functions.

1 Introduction

Deep neural networks are the workhorse of modern machine learning [1]. An important reason forthis is the universal approximation property of deep networks which allows them to represent anycontinuous real valued function with arbitrary accuracy. Various representation theorems, establishingthe universal approximation property of neural networks have been shown [2, 3, 4, 5, 6]. Underregularity conditions on the functions, a long line of work gives rates for approximation in termsof number of neurons [7, 8, 9, 10, 11, 12, 13, 14, 15, 16]. By now the case of a single layer ofnonlinearities is fairly well understood, while the corresponding theory for deep networks is lacking.

Deep networks have been shown empirically to significantly outperform their shallow counterpartsand a flurry of theoretical papers has aimed to understand this. For instance, [26] shows that lettingdepth scale with the number of samples gives minimax optimal error rates for non-parametricregression tasks. [17] considers hierarchical learning in deep networks, where training with SGDyields layers that successively construct more complex features to represent the function. While anunderstanding of the generalization performance of neural networks trained on data is a holy grail,motivated by the logic that expressivity of the network determines the fundamental barriers underarbitrary optimization procedures, in this paper we focus on the more basic question of representation.

Preprint. Under review.

arX

iv:2

006.

0404

8v1

[st

at.M

L]

7 J

un 2

020

A body of work on depth separation attempts to gain insight into the benefits of depth by constructingfunctions which can be efficiently represented by networks of a certain depth but cannot be representedby shallower networks unless their width is very large [18, 19, 12, 20, 21, 22, 23, 24]. For instance,[22] shows the existence of radial functions which can be easily approximated by two nonlinearlayers but cannot be approximated by one nonlinear layer and [23] shows the existence of oscillatoryfunctions which can be approximated easily by networks with D3 nonlinear layers but cannot beapproximated by 2D-width networks of D nonlinear layers . In the different setting of representingprobability distributions with sum-product networks, [25] on shows strong D versus D+ 1 separationresults. All of these results show existence of a function requiring a certain depth, but do not attemptto characterize the class of functions approximable by networks of a given depth.

For neural networks with N nonlinear units in a single layer, classical results obtained via a law oflarge numbers type argument yields a 1/N rate of decay for the square loss [7, 8]. Several paperssuggest a benefit of increased depth [9, 10, 12, 13] by implementing a Taylor series approximation toshow that deep ReLU or RePU neural networks can achieve rates faster than 1/N , when the functionbeing represented is very smooth and the depth is allowed to grow as the loss tends to 0. However,it was shown recently in [16] that when such additional smoothness is assumed, a single layer ofnonlinearities suffices to achieve similar error decay rates. Therefore, the benefits of depth are notcaptured by these results.

The work most related to ours is [14], which considers representation of functions of a given modulusof continuity under the sup norm. When depth D scales linearly with the total number of activationfunctions N , the rate of error decay is shown to be strictly better than when D is held fixed. Thisdoes indicate that depth is fundamentally beneficial in representation, but the rates are dimension-dependent and hence, as will become clear, the results are far from sharply characterizing the exactbenefits of depth.

In this paper we give a fine-grained characterization of the role of depth in representation power ofReLU networks. Given a network with D ReLU layers and input dimension d, we define a class GDof real valued functions characterized by the decay of their Fourier transforms, similar to the classconsidered in classical works such as [7]. As D increases, the tails of the Fourier transforms areallowed to be fatter, thereby capturing a broader class of functions. Note that decay of a function’sFourier transform is well-known to be related to its smoothness (c.f., [27]). Our results stated inSection 4 show that a network with N ReLU units in D layers can achieve rates of the order N−1 forfunctions in the class GD whereas networks with D′ < D ReLU layers must suffer from slower ratesof order N−D

′/D. All of these rates are optimal up to constant factors. As explained in Section 3,we prove these results by utilizing the compositional structure of deep networks to systematicallyproduce highly oscillatory functions which are hard to produce using shallow networks.

Organization. The paper is organized as follows. In Section 2, we introduce notation and definethe problem. In Section 3, we overview the main idea behind our results, which are then statedformally in Section 4. Sections 5, 6, and 7 prove these results.

2 Notation, Problem Setup, and Fourier Norms

Notation. For t ∈ R let ReLU(t) = max(0, t). In this work, the depth D refers to the number ofReLU layers. Let the input dimension be d. Given n1, n2, . . . , nD ∈ N, we let di = ni−1 wheni ≥ 2 and d1 = d. Let fi : Rdi → Rni be defined by fi(x) =

∑nij=1 ReLU(〈x,Wij〉 − Tij)ej ,

where ej are the standard basis vectors, Wij ∈ Rdi , and Tij ∈ R. For a ∈ RnD , we define theReLU network corresponding to the parameters d,D, n1, . . . , nD,Wij , Tij and a to be f(x) =

〈a, fD fD−1 · · · f1(x)〉. The number of ReLU units in this network is N =∑Di=1 ni.

The representation problem. Consider a function f : Rd → R. Given any probability measure µover Bd(r) := x ∈ Rd : ‖x‖2 ≤ r, we want to understand how many ReLU units are necessaryand sufficient in a neural network of depth D such that its output f has square loss bounded as∫ (f(x)− f(x)

)2µ(dx) ≤ ε .

2

Fourier norms. Suppose f has Fourier transform F , which for f ∈ L1(Rd) is given by

F (ξ) =

∫Rdf(x)ei〈ξ,x〉dx .

The Fourier transform is well-defined also for larger classes of functions than L1(Rd) [27]. If F isa function, then we assume F ∈ L1(Rd), but we also allow it to be a finite complex measure andintegration with respect to F (ξ)dξ is understood to be integration with respect to this measure. Underthese conditions we have the Fourier inversion formula

f(x) =1

(2π)d

∫RdF (ξ)e−i〈ξ,x〉dξ . (1)

For α ≥ 0, define the Fourier norms as in [7],

Cαf =1

(2π)d

∫Rd|F (ξ)| · ‖ξ‖αdξ . (2)

We define a sequence of function spaces GK such that GK ⊆ GK+1 for K ∈ N:

GK := f : Rd → R : C0f + C

1/Kf <∞ .

The domain Rd is usually implicit, but we occasionally write GK(Rd) or GK(R) for clarity. Sincedecay of the Fourier transform is related to smoothness of the function, the sequence of functionspaces (GK) adds functions with less smoothness as K increases. It is hard to exactly characterizethis class of functions, but we note that a variety of function classes, when modified to decay to 0outside of Bd(r) (by multiplying with a suitable ‘bump function’), are included in GK for all Klarge enough. These include polynomials, trigonometric polynomials, (for all K ≥ 1) and any ReLUnetwork of any depth (when K ≥ 2).

Theorems 1 and 2 show that the quantity C0f (C

1/Kf + C0

f ) effectively controls the rate at which fcan be approximated by a depth K ReLU network (at least in a ball of radius r = 1, with suitablemodification for arbitrary r). As K increases, the class of functions for which C0

f (C1/Kf + C0

f ) ≤ 1grows to include less and less smooth functions. The following two examples illustrate this behavior:

Example 1. Consider the Gaussian function f : Rd → Rd given by f(x) = e−‖x‖2

2 . Its Fourier

transform is F (ξ) = (2π)d2 e−

‖ξ‖22 . A simple calculation shows that C0

f (C1/Kf + C0

f ) ∼ d1/2K + 1.

Thus, when K & log d, the quantity C0f (C

1/Kf + C0

f ) remains bounded for any dimension d.

Example 2. Let n ∈ N be large and consider the function f : R→ R given by f(x) = cos(nx)/nα.As α decreases, the oscillations grow in magnitude and one can check that C0

f (C1/Kf + C0

f ) =

n1/K−2α + n−2α. When K ≥ 1/2α, the rates are essentially independent of n.

3 Using Depth to Improve Representation

Before stating our representation theorems in the next section, we now briefly explain the core ideas:

1. Following [7], we use the inverse Fourier transform to represent f(x) as an expectation ofA cos(〈ξ, x〉 + θ(ξ)) for some random variable ξ and then implement cos(〈ξ, x〉 + θ(ξ))using ReLU units.

2. We use an idea similar to the one in [23] to implement a triangle waveform Tk with 2kpeaks using ∼ k1/D ReLU units arranged in a network of depth D.

3. Composition of low frequency cosines with triangle waveforms is then used to efficientlyapproximate the high frequency cosines of the form cos(〈ξ, x〉+ θ(ξ)) via ReLU units.

Suppose that we want to approximate the function f(t) = cos(ωt) for t in the interval [−1, 1] using aReLU network with a single hidden layer (as in Item 1 above). Because the interval [−1, 1] containsΩ(ω) periods of the function, effectively tracking it requires Ω(ω) ReLU units. It turns out that thisdependence can be significantly improved if we allow two nonlinear layers. The first layer is used

3

Triangle Waveform

Cosine Waveform

Composition of Cosine and Triangle Waveforms

cos(3x)

<latexit sha1_base64="kPpiSjI5GZ0LfKULd26lVh8g7d0=">AAACBXicbVDNTgIxGOziH+If6tFLIzHBC9kVjR6JXjxiIkgCG9ItXWjotpv2WyPZcPbuVV/Bm/Hqc/gGPoYF9iDgJE0mM532+yaIBTfgut9ObmV1bX0jv1nY2t7Z3SvuHzSNSjRlDaqE0q2AGCa4ZA3gIFgr1oxEgWAPwfBm4j88Mm24kvcwipkfkb7kIacErNTqUGXK1afTbrHkVtwp8DLxMlJCGerd4k+np2gSMQlUEGPanhuDnxINnAo2LnQSw2JCh6TP2pZKEjHjp9N5x/jEKj0cKm2PBDxV/yZSGgWa9wcw905KImNGUWDzEYGBWfQm4n9eO4Hwyk+5jBNgks6+DxOBQeFJJbjHNaMgRpYQqrndANMB0YSCLa5gq/EWi1gmzbOKV61c3J2XatdZSXl0hI5RGXnoEtXQLaqjBqJIoBf0it6cZ+fd+XA+Z1dzTpY5RHNwvn4BHPOYuQ==</latexit>

Tk(x

)<latexit sha1_base64="zid3uvB6O9FjFQReUaPFk+PeFyU=">AAACA3icbVDLTgIxFL2DL8QX6tJNIzHBDZnxEV0S3bjEhFcCE9IpBSptZ9J2jGTC0r1b/QV3xq0f4h/4GRaYhYAnaXJyTk977wkizrRx3W8ns7K6tr6R3cxtbe/s7uX3D+o6jBWhNRLyUDUDrClnktYMM5w2I0WxCDhtBMPbid94pEqzUFbNKKK+wH3JeoxgY6V6tTMsPp128gW35E6BlomXkgKkqHTyP+1uSGJBpSEca93y3Mj4CVaGEU7HuXasaYTJEPdpy1KJBdV+Mp12jE6s0kW9UNkjDZqqfxMJEYFi/YGZeyfBQuuRCGxeYDPQi95E/M9rxaZ37SdMRrGhksy+78UcmRBNCkFdpigxfGQJJorZDRAZYIWJsbXlbDXeYhHLpH5W8s5Ll/cXhfJNWlIWjuAYiuDBFZThDipQAwIP8AKv8OY8O+/Oh/M5u5px0swhzMH5+gWqVZfv</latexit>

cos(3Tk(x

))<latexit sha1_base64="fvgNovGmBLLEPWRpoKmQq1JyOow=">AAACCnicbVDLTgIxFO3gC/GFunTTSExgQ2ZEo0uiG5eY8EpgQjqlAw19jG3HSCb8gXu3+gvujFt/wj/wMywwCwFP0uTknJ723hNEjGrjut9OZm19Y3Mru53b2d3bP8gfHjW1jBUmDSyZVO0AacKoIA1DDSPtSBHEA0Zaweh26rceidJUiroZR8TnaCBoSDEyVvK7WOpipd4bFZ9KpV6+4JbdGeAq8VJSAClqvfxPty9xzIkwmCGtO54bGT9BylDMyCTXjTWJEB6hAelYKhAn2k9mQ0/gmVX6MJTKHmHgTP2bSDAPFB0MzcI7CeJaj3lg8xyZoV72puJ/Xic24bWfUBHFhgg8/z6MGTQSTnuBfaoINmxsCcKK2g0gHiKFsLHt5Ww13nIRq6R5XvYq5cv7i0L1Ji0pC07AKSgCD1yBKrgDNdAAGDyAF/AK3pxn5935cD7nVzNOmjkGC3C+fgEypppa</latexit>

0<latexit sha1_base64="PeDYiFR9ExHXuvEFDhe1dAzaRP8=">AAAB/nicbVDLSgMxFL1TX7W+qi7dBIvgqsz4QJdFNy5bsA9oS8mkd9rQZGZIMkIZCu7d6i+4E7f+in/gZ5i2s7CtBwKHc3KSe48fC66N6347ubX1jc2t/HZhZ3dv/6B4eNTQUaIY1lkkItXyqUbBQ6wbbgS2YoVU+gKb/uh+6jefUGkehY9mHGNX0kHIA86osVLN7RVLbtmdgawSLyMlyFDtFX86/YglEkPDBNW67bmx6aZUGc4ETgqdRGNM2YgOsG1pSCXqbjobdELOrNInQaTsCQ2ZqX8TKZO+4oOhWXgnpVLrsfRtXlIz1MveVPzPaycmuO2mPIwTgyGbfx8kgpiITLsgfa6QGTG2hDLF7QaEDamizNjGCrYab7mIVdK4KHuX5evaValyl5WUhxM4hXPw4AYq8ABVqAMDhBd4hTfn2Xl3PpzP+dWck2WOYQHO1y8s85YG</latexit>

2<latexit sha1_base64="Lad7e4nkRnT7/A5m8Umx/6GKbAU=">AAACAXicbVDLSgMxFM3UV62vqks3wSK4KjNV0WXRjcsK9gHtUDJppg1NMkNyRyhDV+7d6i+4E7d+iX/gZ5hpZ2FbDwQO5+Qk954gFtyA6347hbX1jc2t4nZpZ3dv/6B8eNQyUaIpa9JIRLoTEMMEV6wJHATrxJoRGQjWDsZ3md9+YtrwSD3CJGa+JEPFQ04JZFKtF/N+ueJW3RnwKvFyUkE5Gv3yT28Q0UQyBVQQY7qeG4OfEg2cCjYt9RLDYkLHZMi6lioimfHT2axTfGaVAQ4jbY8CPFP/JlIqA82HI1h4JyXSmIkMbF4SGJllLxP/87oJhDd+ylWcAFN0/n2YCAwRzurAA64ZBTGxhFDN7QaYjogmFGxpJVuNt1zEKmnVqt5F9erhslK/zUsqohN0is6Rh65RHd2jBmoiikboBb2iN+fZeXc+nM/51YKTZ47RApyvX4/Zl1s=</latexit>

1<latexit sha1_base64="DM3UFkmDp+lukK5ZN7uk2lwR7VY=">AAAB/nicbVDLSgMxFL1TX7W+qi7dBIvgqsz4QJdFNy5bsA9oS8mkd9rQZGZIMkIZCu7d6i+4E7f+in/gZ5i2s7CtBwKHc3KSe48fC66N6347ubX1jc2t/HZhZ3dv/6B4eNTQUaIY1lkkItXyqUbBQ6wbbgS2YoVU+gKb/uh+6jefUGkehY9mHGNX0kHIA86osVLN6xVLbtmdgawSLyMlyFDtFX86/YglEkPDBNW67bmx6aZUGc4ETgqdRGNM2YgOsG1pSCXqbjobdELOrNInQaTsCQ2ZqX8TKZO+4oOhWXgnpVLrsfRtXlIz1MveVPzPaycmuO2mPIwTgyGbfx8kgpiITLsgfa6QGTG2hDLF7QaEDamizNjGCrYab7mIVdK4KHuX5evaValyl5WUhxM4hXPw4AYq8ABVqAMDhBd4hTfn2Xl3PpzP+dWck2WOYQHO1y8ujZYH</latexit>

1<latexit sha1_base64="HmZxeP8izf3uqbQkxkGijHuk5po=">AAAB/3icbVDLTgIxFL3jE/GFunTTSEzcSGZ8RJdENy7RyCMBQjqlAw3tzKS9Y0ImLNy71V9wZ9z6Kf6Bn2GBWQh4kiYn5/S09x4/lsKg6347S8srq2vruY385tb2zm5hb79mokQzXmWRjHTDp4ZLEfIqCpS8EWtOlS953R/cjv36E9dGROEjDmPeVrQXikAwilZ6OPU6haJbcicgi8TLSBEyVDqFn1Y3YoniITJJjWl6boztlGoUTPJRvpUYHlM2oD3etDSkipt2Opl0RI6t0iVBpO0JkUzUv4mUKV+LXh9n3kmpMmaofJtXFPtm3huL/3nNBIPrdirCOEEesun3QSIJRmRcBukKzRnKoSWUaWE3IKxPNWVoK8vbarz5IhZJ7azknZcu7y+K5ZuspBwcwhGcgAdXUIY7qEAVGATwAq/w5jw7786H8zm9uuRkmQOYgfP1C5vGlj4=</latexit>





2/3<latexit sha1_base64="Hj6+v7qQh7WAJRm1/Fi9/7IOroU=">AAACBHicbVDLTgIxFL3jE/GFunTTSEzciDOg0SXRjUtM5JHAhHRKBxrazqTtmJAJW/du9RfcGbf+h3/gZ1hgFgKepMnJOT3tvSeIOdPGdb+dldW19Y3N3FZ+e2d3b79wcNjQUaIIrZOIR6oVYE05k7RumOG0FSuKRcBpMxjeTfzmE1WaRfLRjGLqC9yXLGQEGys1z8udmF1UuoWiW3KnQMvEy0gRMtS6hZ9OLyKJoNIQjrVue25s/BQrwwin43wn0TTGZIj7tG2pxIJqP52OO0anVumhMFL2SIOm6t9ESkSgWH9g5t5JsdB6JAKbF9gM9KI3Ef/z2okJb/yUyTgxVJLZ92HCkYnQpBHUY4oSw0eWYKKY3QCRAVaYGNtb3lbjLRaxTBrlklcpXT1cFqu3WUk5OIYTOAMPrqEK91CDOhAYwgu8wpvz7Lw7H87n7OqKk2WOYA7O1y/n/5gI</latexit>






4/3<latexit sha1_base64="p3tBY5bu9B3jdDHiG68Gmfz2vJM=">AAACBHicbVDLTgIxFL3jE/GFunTTSEzciDOC0SXRjUtM5JHAhHRKBxrazqTtmJAJW/du9RfcGbf+h3/gZ1hgFgKepMnJOT3tvSeIOdPGdb+dldW19Y3N3FZ+e2d3b79wcNjQUaIIrZOIR6oVYE05k7RumOG0FSuKRcBpMxjeTfzmE1WaRfLRjGLqC9yXLGQEGys1zyudmF2Uu4WiW3KnQMvEy0gRMtS6hZ9OLyKJoNIQjrVue25s/BQrwwin43wn0TTGZIj7tG2pxIJqP52OO0anVumhMFL2SIOm6t9ESkSgWH9g5t5JsdB6JAKbF9gM9KI3Ef/z2okJb/yUyTgxVJLZ92HCkYnQpBHUY4oSw0eWYKKY3QCRAVaYGNtb3lbjLRaxTBqXJa9cunqoFKu3WUk5OIYTOAMPrqEK91CDOhAYwgu8wpvz7Lw7H87n7OqKk2WOYA7O1y/rPZgK</latexit>





2/3<latexit sha1_base64="5AqpBFmCBYsW/uOgiMMf0kLlRUk=">AAACA3icbVDLTgIxFL2DL8QX6tJNIzFxhTOi0SXRjUtM5JHAhHRKByptZ9J2TMiEpXu3+gvujFs/xD/wMywwCwFP0uTknJ723hPEnGnjut9ObmV1bX0jv1nY2t7Z3SvuHzR0lChC6yTikWoFWFPOJK0bZjhtxYpiEXDaDIa3E7/5RJVmkXwwo5j6AvclCxnBxkqN807MzirdYsktu1OgZeJlpAQZat3iT6cXkURQaQjHWrc9NzZ+ipVhhNNxoZNoGmMyxH3atlRiQbWfTqcdoxOr9FAYKXukQVP1byIlIlCsPzBz76RYaD0Sgc0LbAZ60ZuI/3ntxITXfspknBgqyez7MOHIRGhSCOoxRYnhI0swUcxugMgAK0yMra1gq/EWi1gmjfOyVylf3l+UqjdZSXk4gmM4BQ+uoAp3UIM6EHiEF3iFN+fZeXc+nM/Z1ZyTZQ5hDs7XL3nbl9E=</latexit>



4/3<latexit sha1_base64="kwTFQFr8LNOLlVKlV9/daKms1N0=">AAACA3icbVDLTgIxFL3jE/GFunTTSExc4YxgdEl04xITeSQwIZ3SgUrbmbQdEzJh6d6t/oI749YP8Q/8DAvMQsCTNDk5p6e99wQxZ9q47rezsrq2vrGZ28pv7+zu7RcODhs6ShShdRLxSLUCrClnktYNM5y2YkWxCDhtBsPbid98okqzSD6YUUx9gfuShYxgY6VGpROz83K3UHRL7hRomXgZKUKGWrfw0+lFJBFUGsKx1m3PjY2fYmUY4XSc7ySaxpgMcZ+2LZVYUO2n02nH6NQqPRRGyh5p0FT9m0iJCBTrD8zcOykWWo9EYPMCm4Fe9Cbif147MeG1nzIZJ4ZKMvs+TDgyEZoUgnpMUWL4yBJMFLMbIDLAChNja8vbarzFIpZJ46LklUuX95Vi9SYrKQfHcAJn4MEVVOEOalAHAo/wAq/w5jw7786H8zm7uuJkmSOYg/P1C30Zl9M=</latexit>




(a) Frequency Multiplication

sin(2x)<latexit sha1_base64="aQ+NSdiGaGeP3sxU81daFs5pbP4=">AAACBXicbVDLTgIxFO34RHyhLt00EhPckBnU6JLoxiUm8khgQjqlAw1tZ9LeMZIJa/du9RfcGbd+h3/gZ1hgFgKepMnJOT3tvSeIBTfgut/Oyura+sZmbiu/vbO7t184OGyYKNGU1WkkIt0KiGGCK1YHDoK1Ys2IDARrBsPbid98ZNrwSD3AKGa+JH3FQ04JWKnVMVyVKk9n3ULRLbtT4GXiZaSIMtS6hZ9OL6KJZAqoIMa0PTcGPyUaOBVsnO8khsWEDkmftS1VRDLjp9N5x/jUKj0cRtoeBXiq/k2kVAaa9wcw905KpDEjGdi8JDAwi95E/M9rJxBe+ylXcQJM0dn3YSIwRHhSCe5xzSiIkSWEam43wHRANKFgi8vbarzFIpZJo1L2zsuX9xfF6k1WUg4doxNUQh66QlV0h2qojigS6AW9ojfn2Xl3PpzP2dUVJ8scoTk4X78jh5i9</latexit>

sin(x ; 1/4, 2)<latexit sha1_base64="k/1HCLqCEmsEZwRlBbErMFN5hTc=">AAACFnicbVDLSgMxFM3UV62vURcu3ASLUEHqTK0ouCm60GUF+4BOLZk0bUOTzJBkxDLMf7h3q7/gTty69Q/8DNPHwrYeuHA4556Ee/yQUaUd59tKLSwuLa+kVzNr6xubW/b2TlUFkcSkggMWyLqPFGFUkIqmmpF6KAniPiM1v3899GuPRCoaiHs9CEmTo66gHYqRNlLL3vNuEOfoIfYUFUnu6dI9KR7DwlHLzjp5ZwQ4T9wJyYIJyi37x2sHOOJEaMyQUg3XCXUzRlJTzEiS8SJFQoT7qEsahgrEiWrGowMSeGiUNuwE0ozQcKT+TcSY+5J2e3rqnRhxpQbcN3mOdE/NekPxP68R6c5FM6YijDQRePx9J2JQB3DYEWxTSbBmA0MQltRcAHEPSYS1aTJjqnFni5gn1ULePc2f3RWzpatJSWmwDw5ADrjgHJTALSiDCsAgAS/gFbxZz9a79WF9jldT1iSzC6Zgff0Cn+ieNA==</latexit>




(b) Random ReLU Approximation of sin(·)

Figure 1: Conceptually, our representation result is shown by a combination of composition of ReLUwith sinusoids to boost frequencies as depicted in (a), and using ReLUs to approximate sinusoidswhich appear in the Fourier transform of a function as depicted in (b).

to implement the triangle waveform Tk on [−1, 1] for k = Θ(√ω), which oscillates at a frequency√

ω and uses O(√ω) ReLU units. Then the second layer is used to implement cos(

√ωt), again with

O(√ω) ReLU units. The output of the two layers combined is cos(

√ωTk(t)), and since cos(

√ωt)

and Tk(t) each oscillate with frequency√ω, it follows that their composition cos(

√ωTk(t)) oscillates

at the frequency ω, and one can show more specifically that we obtain the output cos(ωt). We checkthat the network requires only O(

√ω) ReLU units. A similar argument shows that networks of depth

D require only O(ω1/D) ReLU units. Surprisingly, this simple idea yields optimal dependence ofrepresentation power on depth. We illustrate this in Figure 1a.

4 Main Results

Theorem 1. Suppose f : Rd → R is such that f ∈ GK for some K ≥ 1. Let µ be any probabilitymeasure over Bd(r) and let D ∈ N be such that D ≤ K. There exists a ReLU network with D ReLU

layers and at most N0 ReLU units in total such that its output f satisfies∫ (f(x)− f(x)

)2µ(dx) ≤ A0

( DN0

)D/K(r1/KC

1/Kf C0

f + (C0f )2), (3)

where A0 is a universal constant given in the proof.

We give the proof in Section 6.

Remark 1. Theorem 1 requires D ≤ K, but can be applied also in the case D > K as follows.If D > K and f ∈ GK , then f ∈ GD since C1/D

f + C0f ≤ 2(C

1/Kf + C0

f ). By an application of

Theorem 1 for f ∈ GD, we obtain an upper bound on the square loss of O(DN0

(r1/DC

1/Df C0

f +

(C0f )2))

, which is of the order 1/N0. Thus, our upper bound for the class GK becomes better as thedepth D increases and saturates at D = K, giving an upper bounds of the order 1/N0.

Remark 2. Getting rates faster than 1/N0 for f ∈ GK using D > K layers may be possible usingideas from [16]. We intend to address this problem in future work.

Remark 3. When D = K = 1, Theorem 1 captures the O(1/N0) rate achieved by [8] which isshown under the assumption that C2

f <∞ instead of C1f + C0

f <∞ as given here.

We next give a matching lower bound in Theorem 2, which also shows depth separation betweendepth D and D + 1 networks for arbitrary D.

Theorem 2. Let K ≥ 1 be fixed, D ≤ K, and r > 0. Let µ be the uniform measure over [−r, r].There exists f : R→ R in GK(R) and a universal constant B0 such that for any f : R→ R given bythe output of a D layer ReLU network with at most N0 nonlinear units,∫ (

f(x)− f(x))2µ(dx) ≥ B0

(r1/KC

1/Kf C0

f + (C0f )2)( DN0

)D/K.

4

The proof, given in Section 7, is based on the fact that the output of a D layer ReLU network with N0

units can oscillate at most ∼ (N0/D)D times ([23]). Therefore, such a network cannot capture theoscillations in cos(ωx) whenever ω > (N0/D)D.Remark 4. Theorems 1 and 2 together recover the depth separation results of [23] for the case ofReLU networks.

5 Technical Results

Before delving into the proof of Theorems 1 and 2, we require some technical lemmas.

5.1 Triangle Waveforms

Consider the triangle function parametrized by α, β > 0 T ( · ;α, β) : R→ R defined by

T (t;α, β) =

βt if t ∈ [0, α]

2αβ − βt if t ∈ (α, 2α]

0 otherwise.(4)

We construct the triangle waveform, parametrized by α, β > 0 and k ∈ N, defined as

Tk(t;α, β) =

k∑i=−k+1

T (t− 2α(i− 1);α, β) . (5)

The triangle waveform Tk( · ;α, β) is supported over [−2αk, 2αk] and can be implemented with4k + 1 ReLUs in a single layer. We state the following basic results and refer to Appendix A.1 fortheir proofs.Lemma 1 (Symmetry). The triangle waveform Tk satisfies the following symmetry properties:

1. Let t ∈ [2mα, 2(m+ 1)α] for some −k ≤ m ≤ k − 1. ThenTk(t;α, β) = Tk(t− 2mα;α, β) = T (t− 2mα;α, β) .

2. Let t ∈ [0, 2kα]. Then, Tk(2kα− t;α, β) = Tk(t;α, β).Lemma 2 (Composition). If αβ = 2al, then for every t ∈ R, Tl(Tk(t;α, β); a, b) = T2kl(t;

aβ , bβ).

Lemma 2 shows that when we compose two triangle wave forms of the right height and width, theirfrequencies multiply. For instance, this implies that T2kl can be implemented with O(k + l) ReLUsin two layers instead of O(kl) ReLUs in a single layer.

5.2 Representing Sinusoids with Random ReLUs

Define R4(t;S) := ReLU(t) +ReLU(t− πω )−ReLU

(t− πS

ω

)−ReLU

(t− π(1−S)

ω

)for t ∈ R and

S ∈ [0, 1]. Henceforth let S ∼ Unif[0, 1]. We let

Γsin(t;S, ω) :=πω

2sin(πS)[R4(t;S, ω)−R4(t− π

ω ;S, ω)] .

and

Γcosn (t;S, ω) :=

n∑i=−n−1

Γsin(t− 2πiω + π

2ω ;S, ω) .

for some n ∈ N to be set later. We refer to Figure 1b for an illustration of the function Γsin( · ;S, ω).Lemma 3. Γcos

n (t;S, ω) satisfies the following properties:

1. For every −πω − n

2πω ≤ t ≤ n

2πω + π

ω , EΓcosn (t;S, ω) = cos(ωt) .

2. For every t ∈ R, |Γcosn (t;S, ω)| ≤ π2/4 almost surely.

3. Γcosn ( · ;S, ω) can be implemented using 16n+ 16 ReLU units in a single layer. (With a bit

more care this can be improved to 8n+ 10, but we ignore this for the sake of simplicity.)

The proof, deferred to Appendix A.2, is based on a simple application of integration by parts.

5

5.3 Frequency Multipliers

The lemma below combines the considerations from Section 5.1 to show that composing a ReLUestimator for a low frequency cosine function with a low frequency triangle waveform gives anestimator for a high frequency cosine function as described in Section 3.Lemma 4. Recall the triangle waveform Tk( · ;α, β) defined above. Fix β > 1 and ω > 1. Setα = (2n+ 1)π/βω and k = d(r + π

βω )/2αe. For any θ ∈ [−π, π] and t ∈ [−r, r] we have that

EΓcosn (Tk(t+ θ

βω ;α, β);S, ω) = cos(βωt+ θ) .

The proof of this lemma follows from Lemma 3, using the fact that the triangle waveform Tk makesthe waveform repeat multiple times. The formal proof is given in Appendix A.2.

6 Proof of Theorem 1

We want to approximate f ∈ GK using a depth D neural network with D ≤ K. Let F be the Fouriertransform of f , where F (ξ) = |F (ξ)|e−iθ(ξ) for some θ(ξ) ∈ [−π, π] (such a choice exists viaJordan decomposition and the Radon-Nikodym Theorem). Then, by the Fourier inversion formula,

f(x) =1

(2π)d

∫RdF (ξ)e−i〈ξ,x〉dξ =

1

(2π)d

∫Rd|F (ξ)|e−i[〈ξ,x〉+θ(ξ)]dξ

=1

(2π)d

∫Rd|F (ξ)| cos (〈ξ, x〉+ θ(ξ)) dξ = C0

fEξ∼νf cos (〈ξ, x〉+ θ(ξ)) . (6)

Here νf is the probability measure given by νf (dξ) = |F (ξ)|dξ(2π)dC0

f. We start with the case D = 1.

6.1 D = 1

Cosine as an expectation over ReLUs. Suppose ξ 6= 0. Since x ∈ Bd(r), we know that 〈x,ξ〉‖ξ‖ ∈[−r, r]. Let n = d‖ξ‖r/2πe, ω = ‖ξ‖, and t = 〈ξ,x〉

‖ξ‖ + θ(ξ)‖ξ‖ in Item 1 of Lemma 3 to conclude that

cos(〈ξ, x〉+ θ(ξ)) = cos(‖ξ‖ 〈ξ,x〉‖ξ‖ + θ(ξ)) = ES∼Unif[0,1]Γcosn

(〈ξ,x〉‖ξ‖ + θ(ξ)

‖ξ‖ ;S, ‖ξ‖). (7)

By Item 3 of Lemma 3, the unbiased estimator Γcosn

(〈ξ,x〉‖ξ‖ + θ(ξ)

‖ξ‖ ;S, ‖ξ‖)

for cos(〈ξ, x〉+ θ(ξ)) canbe implemented using 16n+ 16 ReLUs.

If ξ = 0, then cos(〈ξ, x〉+ θ(ξ)) is a constant and 2 ReLUs suffice to implement this function over[−r, r]. Thus, for any value of ξ we can implement an unbiased estimator for cos(〈ξ, x〉 + θ(ξ))using 16n + 16 ReLUs (note that this quantity depends on ‖ξ‖ through n). We call this unbiasedestimator Γ(x; ξ, S). By Item 2 of Lemma 3, |Γ(x; ξ, S)| ≤ π2/4 almost surely.

Obtaining an estimator via random sampling. Let (ξ, S) ∼ νf × Unif([0, 1]). For every x ∈Bd(r), (6) and (7) together imply that f(x) = C0

fEΓ(x; ξ, S). We construct an empirical estimateby sampling (ξj , Sj) ∼ νf × Unif([0, 1]) i.i.d. for j = 1, 2, . . . , l, yielding

f(x) =1

l

l∑j=1

C0f Γ(x; ξj , Sj) . (8)

By Fubini’s Theorem and the fact that Γ(x; ξj , Sj) are i.i.d., for any probability measure µ we have

E∫ (f(x)− f(x)

)2µ(dx) =

∫E(f(x)− f(x)

)2µ(dx) ≤ (C0

f )2 supx

E|Γ(x; ξj , Sj)|2

l≤

(C0f )2π4

16l.

In the last step we have used the fact that |Γ(x; ξ, S)| ≤ π2/4 almost surely. By Markov’s inequality,with probability at least 2/3 (over the randomness in (ξj , Sj)

lj=1) we have∫ (

f(x)− f(x))2µ(dx) ≤ 3(C0

f )2π4/16l . (9)

6

Let the number of ReLU units used to implement Γ( · ; ξj , Sj) be Nj . Note that Nj is a randomvariable bounded depending on the random ξj according to Nj ≤ 8‖ξj‖r/π + 32. Thereforethe total number of ReLUs used is N =

∑lj=1Nj ≤

∑li=1(8‖ξi‖r/π + 32). By Minkowski’s

inequality, for every K ≥ 1, N1/K ≤∑li=1 (8‖ξi‖r/π)

1/K+ (32)1/K , so by definition of νf and

the Fourier norms, EN1/K ≤ l[ (

8rπ

)1/K C1/Kf

C0f

+ (32)1/K]. Markov’s inequality now implies that

with probability at least 1/2, the number of ReLUs used is bounded as

N1/K ≤ 2l

[(8rπ

)1/K C1/Kf

C0f

+ (32)1/K]

:= l ·D0 . (10)

Combining the two bounds. Let D0 be as defined in (10) just above. Given N0 ∈ N such thatN0 ≥ DK

0 , we take l = bN1/K0 /D0c . By the union bound, both Equations (9) and (10) must hold

with positive probability, so there exists a configuration with N ReLUs such that N ≤ N0 and∫ (f(x)− f(x)

)2µ(dx) ≤ 1

N1/K0

(6π4C0

fC1/Kf r1/K + 8π4(C0

f )2).

IfN0 ≤ DK0 we just use a network that always outputs 0. From Equation (6), we see that |f(x)| ≤ C0

f

for every x, so the last displayed equation holds (up to a constant factor) also in this case.

6.2 D > 1

We follow the same overall procedure as the D = 1 case, but we will use the frequency multiplicationtechnique to implement the cosine function with fewer ReLU units. For each ξ, we want an unbiasedestimator for cos(‖ξ‖t+ θ(ξ)) for t ∈ [−r, r] and θ(ξ) ∈ [−π, π]. Assume ξ 6= 0.

Triangle waveform. In Lemma 4 we take ω = ‖ξ‖1/D, n = d(‖ξ‖r)1/D/2πe, β = ‖ξ‖1−1/D,and let α and k be determined as given in the statement of the Lemma i.e, α = (2n+ 1)π/‖ξ‖ andk ≥ d(r + π

‖ξ‖ )/2αe. Note that Γcosn can be implemented by the Dth ReLU layer and therefore it is

sufficient to implement Tk( · ;α, ‖ξ‖1−1/D) using the previous D − 1 ReLU layers as follows.

Let l :=⌈12 (r‖ξ‖)1/D

⌉and γ := ‖ξ‖1/D. Let α1 := (2l)D−2(2n+ 1)π/‖ξ‖ and for i = 2 . . . , D−

1, αi := α1 (γ/2l)i−1. For i = 1, . . . , D − 1, we define fi : R→ R as fi(t) = Tl(t;αi, γ). Clearly,

fD−1 · · · f1(t) can be implemented by a depth D−1 ReLU network by implementing the functionfi using the ith ReLU layer. It now follows by a straightforward induction using Lemma 2 that

fD−1 · · · f1(t) = Tk(t;α, ‖ξ‖1−1/D) ,

where k = 2D−2lD−1 ≥ d(r + π‖ξ‖ )/2αe.

Each of the first D − 1 layers require 4l + 1 ReLU units and by Item 3 of Lemma 3, the Dth layerrequires 16n+ 16 ReLU units. If the number of ReLUs used is N(ξ), then

N(ξ) ≤ (D − 1)(4l + 1) + 16n+ 16 ≤ ( 8π + 2D − 2)(r‖ξ‖)1/D + 5D + 27 .

Estimator via sampling. As in the D = 1 case, we form the estimator in (8) by sampling (ξj , Sj)i.i.d. from the distribution νf×Unif([0, 1]), but will now implement the cosines using aD layer ReLUnetwork as described above. This uses N ≤

∑li=1N(ξi) nonlinear units. Let K ≥ D. Applying

Minkowski’s inequality to ND/K as in the D = 1 case, we obtain that

END/K ≤ l[ (

8π + 2D − 2

)D/Kr1/K

C1/Kf

C0f

+ (5D + 27)D/K].

Equation (9) remains the same in this case too. Therefore, using the union bound just like in the caseD = 1, we conclude that there exists a configuration of weights for a D layer ReLU network withN ≤ N0 ReLU units for any N0 ∈ N such that∫ (

f(x)− f(x))2µ(dx) ≤ 3π4(2D + 1)D/K

4ND/K0

r1/KC1/Kf C0

f +3π4(5D + 27)D/K

4ND/K0

(C0f )2 .

7

7 Proof of Theorem 2

We will exhibit a function f which is challenging to estimate over [−r, r] by f with respect to squareloss over the uniform measure µ.

We use the idea of crossing numbers from [23] for the lower bounds. Given a continuous functionf : R→ R, let If denote the partition of R into intervals where 1(f ≥ 1/2) is a constant, and define

Cr(f) = |If |. Let the set of endpoints of intervals in If be denoted by Nf . Clearly, |Nf | ≤ Cr(f).

Consider the function f : R → R given by f(x) =1+cos(ωxr )

2ωα for some α > 0 and ω > 1.The Fourier transform of f is the measure π

2ωα

[2δ(ξ) + δ(ξ − ω

r ) + δ(ξ + ωr )]. Integrating yields

C0f = 1/ωα and C1/K

f = 12r−1/Kω1/K−α. We will later choose α = 1/2K, for which it follows

that 1/2 ≤ C0fC

1/Kf r1/K + (C0

f )2 ≤ 3/2. We first show the following basic lemma.

Lemma 5. Let ω = 2πL for some L ∈ N. If Cr(ωαf) < 2L then

1

2r

∫ 1

−1

(f(x)− f(x)

)2dx ≥ π

16

2L− Cr(ωαf)

ω2α+1.

Proof. Partition the interval [−r, r) into 2L subintervals of the form [ir/L, r(i+ 1)/L) for −L ≤i ≤ L− 1. By a simple counting argument, there exist at least 2L− |Nωαf | ≥ 2L− Cr(ωαf) suchintervals which do not contain any point from the set Nωαf . Let [ir/L, r(i + 1)/L) be such an

interval. Then either 1) f(x) ≥ 1/2ωα for every x ∈ [ir/L, r(i + 1)/L), or 2) f(x) < 1/2ωα forevery x ∈ [ir/L, r(i+ 1)/L). Without loss of generality, suppose that the first of these is true. Then

1

2r

∫ r(i+1)/L

ri/L

(f(x)− f(x))2dx =1

2r

∫ r(i+1)/L

ri/L

(1 + cos(2πLx/r)

2ωα− f(x)

)2dx

≥ 1

2r

∫ (4i+3)r/4L

(4i+1)r/4L

(cos(2πLx/r)

2ωα+

1

2ωα− f(x)

)2dx ≥ 1

2r

∫ (4i+3)r/4L

(4i+1)r/4L

(cos(2πLx/r)

2ωα

)2dx

=1

8rω2α

∫ (4i+3)r/4L

(4i+1)r/4L

cos2(2πLx/r)dx =π

16ω2α+1. (11)

In the second step we have used the fact that cos( 2πLr x) ≤ 0 for irL + r

4L ≤ x ≤irL + 3r

4L . Addingthe contributions to the integral in the statement of the lemma over the collection of such intervalswhich do not contain any point from Nf yields the result.

We now refer to Lemma 3.2 in [23], which states that because f is the output of a D layer ReLUnetwork with at most N0 ReLUs, its crossing number is bounded as |Cr(f)| ≤ 2 (2N0/D)

D. TakingL = d2 (2N0/D)

De in Lemma 5 and recalling that ω = 2πL, we obtain

1

2r

∫ r

−r

(f(x)− f(x)

)2dx ≥ π

32ω2α=

π

32(2π)2αL2α≥ πD2αD

32(6π)2α(2N0)2αD.

Recall from just before Lemma 5 that for α = 1/2K we have 1/2 ≤ C0fC

1/Kf r1/K + (C0

f )2 ≤ 3/2.Thus, from the last displayed equation we conclude that there is a universal constant B0 such that forany f that is the output of a D layer N0 unit ReLU network, the squared loss in representing f is

1

2r

∫ r

−r

(f(x)− f(x)

)2dx ≥ B0

(C0fC

1/Kf r1/K + (C0

f )2)( DN0

)D/K.

This completes the proof of Theorem 2.

Acknowledgments and Disclosure of Funding

This work was supported in part by MIT-IBM Watson AI Lab and NSF CAREER award CCF-1940205.

8

References

[1] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.

[2] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics ofcontrol, signals and systems, vol. 2, no. 4, pp. 303–314, 1989.

[3] K.-I. Funahashi, “On the approximate realization of continuous mappings by neural networks,”Neural networks, vol. 2, no. 3, pp. 183–192, 1989.

[4] B. Hanin and M. Sellke, “Approximating continuous functions by relu nets of minimal width,”arXiv preprint arXiv:1710.11278, 2017.

[5] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universalapproximators,” Neural networks, vol. 2, no. 5, pp. 359–366, 1989.

[6] Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang, “The expressive power of neural networks: A viewfrom the width,” in Advances in neural information processing systems, pp. 6231–6239, 2017.

[7] A. R. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,”IEEE Transactions on Information theory, vol. 39, no. 3, pp. 930–945, 1993.

[8] J. M. Klusowski and A. R. Barron, “Approximation by combinations of relu and squared reluridge functions with l1 and l0 controls,” IEEE Transactions on Information Theory, vol. 64,no. 12, pp. 7649–7656, 2018.

[9] B. Li, S. Tang, and H. Yu, “Better approximations of high dimensional smooth functions bydeep neural networks with rectified power units,” arXiv preprint arXiv:1903.05858, 2019.

[10] S. Liang and R. Srikant, “Why deep neural networks for function approximation?,” arXivpreprint arXiv:1610.04161, 2016.

[11] C. Ma and L. Wu, “Barron spaces and the compositional function spaces for neural networkmodels,” arXiv preprint arXiv:1906.08039, 2019.

[12] I. Safran and O. Shamir, “Depth-width tradeoffs in approximating natural functions with neuralnetworks,” in Proceedings of the 34th International Conference on Machine Learning-Volume70, pp. 2979–2987, JMLR. org, 2017.

[13] D. Yarotsky, “Error bounds for approximations with deep relu networks,” Neural Networks,vol. 94, pp. 103–114, 2017.

[14] D. Yarotsky, “Optimal approximation of continuous functions by very deep relu networks,” inConference On Learning Theory, pp. 639–649, 2018.

[15] J. Schmidt-Hieber, “Deep relu network approximation of functions on a manifold,” arXivpreprint arXiv:1908.00695, 2019.

[16] G. Bresler and D. Nagaraj, “A corrective view of neural networks: Representation, memorizationand learning,” arXiv preprint arXiv:2002.00274, 2020.

[17] Z. Allen-Zhu and Y. Li, “Backward feature correction: How deep learning performs deeplearning,” arXiv preprint arXiv:2001.04413, 2020.

[18] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao, “Why and when can deep-butnot shallow-networks avoid the curse of dimensionality: a review,” International Journal ofAutomation and Computing, vol. 14, no. 5, pp. 503–519, 2017.

[19] B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli, “Exponential expressivity indeep neural networks through transient chaos,” in Advances in neural information processingsystems, pp. 3360–3368, 2016.

[20] A. Daniely, “Depth separation for neural networks,” in Conference on Learning Theory, pp. 690–696, 2017.

[21] O. Delalleau and Y. Bengio, “Shallow vs. deep sum-product networks,” in Advances in NeuralInformation Processing Systems, pp. 666–674, 2011.

[22] R. Eldan and O. Shamir, “The power of depth for feedforward neural networks,” in Conferenceon learning theory, pp. 907–940, 2016.

[23] M. Telgarsky, “benefits of depth in neural networks,” in Conference on Learning Theory,pp. 1517–1539, 2016.

9

[24] N. Cohen, O. Sharir, and A. Shashua, “On the expressive power of deep learning: A tensoranalysis,” in Conference on Learning Theory, pp. 698–728, 2016.

[25] J. Martens and V. Medabalimi, “On the expressive efficiency of sum product networks,” arXivpreprint arXiv:1411.7717, 2014.

[26] J. Schmidt-Hieber, “Nonparametric regression using deep neural networks with relu activationfunction,” arXiv preprint arXiv:1708.06633, 2017.

[27] F. G. Friedlander and M. C. Joshi, Introduction to the Theory of Distributions. CambridgeUniversity Press, 1998.

10

A Supplementary Material

A.1 Frequency Multipliers - Proofs

We skip the proof of Lemma 1 since it is elementary. We give a proof of Lemma 2 as follows.

Proof of Lemma 2. Tk(t;α, β) has 4k straight line segments which either increase from 0 to αβ =2al or decrease from αβ to 0. For each of these line segments, the entire set of values of Tl( · ; a, b) in[0, 2al] is repeated once. This gives us 4kl triangles. The height of these triangles is the same as thatof Tl( · ; a, b) which is ab. The domain of the triangle waveform is the same as that of Tk( · ;α, β),which is [−2kα, 2kα]. From this we conclude the statement of the lemma.

A.2 ReLU Representation for Sinusoids - Proofs

Let ω > 0. We want to represent t→ sin(ωt) for t ∈ [0, π/ω] in terms of ReLU functions. The firstpart of the argument entails manipulation of an integral, and then the resulting identity will be appliedto obtain the proofs of Lemmas 3 and 4.

To start, integration by parts yields∫ π/ω

0

ω2 sin(ωT )ReLU(t− T )dT = ωt− sin(ωt) .

Replacing t with π/ω − t, we have∫ π/ω

0

ω2 sin(ωT )ReLU(π/ω − t− T )dT = π − ωt− sin(ωt) ,

and adding the last two equations gives∫ π/ω

0

ω2 sin(ωT ) [ReLU(π/ω − t− T ) + ReLU(t− T )] dT = π − 2 sin(ωt) .

From the case t = 0 in the last equation, we conclude that

π =

∫ π/ω

0

ω2 sin(ωT ) [π/ω − T ] dT .

Combining the last two equations, we obtain the identity

sin(ωt) =1

2

∫ π/ω

0

ω2 sin(ωT ) [π/ω − T − ReLU(π/ω − t− T )− ReLU(t− T )] dT .

Making the transformation S = Tωπ , the integral can be rewritten as

sin(ωt) =π

2

∫ 1

0

ω sin(πS)

[π

ω(1− S)− ReLU

(πω

(1− S)− t)− ReLU

(t− πS

ω

)]dS . (12)

Now recall the function R4( · ;S, ω) as defined in Section 5.2. A simple calculation shows that

R4(t;S, ω) =

0 if t 6∈ [0, πω ]πω (1− S)− ReLU

(πω (1− S)− t

)− ReLU

(t− πS

ω

)if t ∈ [0, πω ] ,

so if we let S be a random variable with S ∼ Unif([0, 1]) we can rewrite (12) as

Eπω

2sin(πS)R4(t;S, ω) =

0 if t 6∈ [0, πω ]

sin(ωt) if t ∈ [0, πω ].

It then follows that

Eπω

2sin(πS)[R4(t;S, ω)−R4(t− π

ω ;S, ω)] =

0 if t 6∈ [0, πω ]

sin(ωt) if t ∈ [0, 2πω ].(13)

11

Proof of Lemma 3. The first item follows from the basic trigonometric identity cos(x) = sin(x+ π2 )

and Equation (13).

For Item 2, note that because Γcosn ( · ;S, ω) is a sum of shifted versions of Γsin( · ;S, ω) such that the

interiors of the shifted versions’ supports are all disjoint, it is sufficient to upper bound the values ofΓsin( · ;S, ω). Indeed, inspection of the form ofR4 shows that |R4(t;S, ω)| ≤ π

ω min(S, 1−S) ≤ π2ω .

Since πω2 | sin(πS)| ≤ πω

2 , the bound follows.

Finally, Item 3 follows because Γcosn ( · ;S, ω) is implemented via summation of 4(n + 1) shifted

versions of the functionR4( · ;S, ω). SinceR4( · ;S, ω) by definition can be implemented via 4 ReLUfunctions, we conclude the result.

Proof of Lemma 4. It is sufficient to show that for t ∈ [−r − πβω , r + π

βω ]

EΓcosn (Tk(t;α, β);S, ω) = cos(βωt) .

Fix t ∈ [−r− πβω , r+ π

βω ]. By definition, Tk( · ;α, β) is supported in [−2kα, 2kα]. By our choice ofk, we have [−r − π

βω , r + πβω ] ∈⊆ [−2kα, 2kα]. Let t ∈ [2mα, 2(m+ 1)α] for some m ∈ Z such

that −k ≤ m ≤ k − 1. We invoke Item 1 of Lemma 1 to show that Tk(t;α, β) = T (t− 2mα;α, β).Now, T (t− 2mα,α, β) ∈ [0, αβ] = [0, (2n+1)π

ω ]. Therefore by Item 1 of Lemma 3,

EΓcosn (Tk(t;α, β);S, ω) = cos(ωT (t− 2mα;α, β)) . (14)

It is now sufficient to show that cos(ωT (t− 2mα;α, β)) = cos(βωt). We consider two cases:

1) If t− 2mα ∈ [0, α], then T (t− 2mα;α, β) = βt− 2mαβ. The LHS of Equation (14) becomes

cos(ωβt− 2mαβω) = cos(ωβt− 2m(2n+ 1)π) = cos(ωβt) .

2) If t − 2mα ∈ (α, 2α], then T (t − 2mα;α, β) = (2m + 2)αβ − βt and hence the LHS ofEquation (14) becomes

cos(−ωβt+ (2m+ 2)αβω) = cos(−ωβt+ (2m+ 2)(2n+ 1)π) = cos(ωβt) .

This completes the proof.

12

sharp representation theorems for relu networks with precise … · 2020. 6. 9. · 3 using depth...

Documents