ganns: a new theory of representation for nonlinear bounded...

GeneralArtificialNeural

Networks

William Guss

Introduction

The Core Idea

Results

GANNs: A New Theory of Representation forNonlinear Bounded Operators

William Guss

Machine Learning at Berkeley

April 22, 2016


Networks

William Guss

Introduction

The Core Idea

Results

Introduction: What’s up with continuous data?

All of the data we deal with is discrete thanks to Turing.

But, most of it models a continuous process.

ExamplesAudio: We take > 100k samples of something we coulddescribe with f : R→ R! Trick Question: Which is easierto use? (a) v ∈ R100000 or (b) f .Images: We take 100k × 100k samples of something wecould describe with f : R2 → R.

Why do we use discrete data? No computer known canreally store f . End of stoy.


Networks

William Guss

Introduction

The Core Idea

Results

Introduction: Abusing continuity

f can’t be that bad. Can it?

If f is smooth it’s easy to draw:

−10 −5 0 5 10

0

20

40

60

80

100

I can even name f most of the time: f : x 7→ x2 or evensuper precisely g : x 7→

∑∞i anx

n.

Moral: Smooth functions are mostly very managable.


Networks

William Guss

Introduction

The Core Idea

Results

Introduction: Abusing continuity

So why do we do this:

To classify this:


Networks

William Guss

Introduction

The Core Idea

ResultsThe Core Idea: Let neuralnetworks abuse continuity

and smoothness.


Networks

William Guss

Introduction

The Core Idea

Results

Artificial Neural Networks

Definition

We say N : Rn → Rm is a feed-forward neural network if for aninput vector xxx ,

N : σ(l+1)j = g

∑i∈Z (l)

w(l)ij σ

(l)i + β

(l)

σ

(0)i = xi ,

(1)

where 1 ≤ l ≤ L− 1. Furthermore we say {N} is the set of allneural networks.


Networks

William Guss

Introduction

The Core Idea

Results

Operator Neural Networks

Let’s get rid of R100000 and use f .

Definition

We call O : Lp(X )→ L1(Y ) an operator neural network if,

O : σ(l+1)(j) = g(∫

R(l)σ(l)(i)w (l)(i , j) di

)σ(0)(j) = f (j).

(2)

Furthermore let {O} denote the set of all functional neuralnetworks.

Well that was easy. In fact {O} ⊃ {N}These definitions looks really similar? Is there some moregeneral category or structure containing them.


Networks

William Guss

Introduction

The Core Idea

Results

Generalized Artifical Neural Networks

Definition

If A,B are (possibly distinct) Banach spaces over a field F, wesay G : A→ B is a generalized neural network if and only if

G : σ(l+1) = g(Tl

[σ(l)]

+ β(l))

σ(0) = ξ(3)

for some input ξ ∈ A, and a linear form Tl .

Claim: ”Neural networks” are powerful because they can movebumps anywhere!How? Tl is a linear form. It can move σ

(l) anywhere, and g isa bump of some sort.


Networks

William Guss

Introduction

The Core Idea

Results

Moving bumps around

The sigmoid function

g =1

1 + e−x(4)

is a bump, that we can move around with weights!


Networks

William Guss

Introduction

The Core Idea

Results

Tl as the layer type.

Definition

We suggest several classes of Tl as follows

Tl is said to be o operational if and only if =

Tl = o : Lp(R(l))→ L1(R(l+1))

σ 7→∫R(l)

σ(i)w (l)(i , j) di .(5)

Tl is said to be n discrete if and only if

Tl = n : Rn → Rm

~σ 7→m∑j

~ej

n∑i

σiw(l)ij

(6)

where ~ej denotes the jth basis vector in Rm.


Networks

William Guss

Introduction

The Core Idea

Results

Tl as the layer type.

Definition

Tl is said to be n1 transitional if and only if

Tl = n1 : Rn → Lq(R(l+1))

~σ 7→n∑i

σiw(l)i (j).

(7)

Tl is said to be n2 transitional if and only if

Tl = n2 : Lp(R(l))→ Rm

σ(i) 7→m∑j

~ej

∫R(l)

σ(i)w(l)j (i) di

(8)


Networks

William Guss

Introduction

The Core Idea

Results

Neural networks as diagrams!

This generalization is nice from a creative standpoint.I can come up with new sorts of ”classifiers” on the fly.Examples:

A three layer neural network is just

N3 : R10000g◦n−−→ R30 g◦n−−→ R3. (9)

A three layer operator network is simply

O3 : Lp(R)g◦o−−→ L1(R) g◦o−−→ C (R). (10)

We can even classify functions!

C : Lp(R) g◦o−−→ L1(R) g◦o−−→ . . . g◦o−−→ L1(R) g◦n2−−−→ Rn. (11)


Networks

William Guss

Introduction

The Core Idea

Results

Results: Did abusing continuity help?

For every layer o has weights

w (l)(i , j) =

Z(l)Y∑b

Z(l)X∑a

k(l)a,bi

ajb. (12)

Theorem

Let C be a GANN with only one n2 transitional layer with O(1)weight polynomial. If a continuous function, say f (t) issampled uniformly from t = 0, to t = N, such that xn = f (n),and if G has an input function which is piecewise linear withO(N2) weights.

ξ = (xn+1 − xn) (z − n) + xn (13)

for n ≤ z < n + 1, then there exist some discrete neuralnetwork N such that G(ξ) = N (xxx).


Networks

William Guss

Introduction

The Core Idea

Results

Results: Did abusing continuity help?

WHAT!?!? How did C reduce the number of weights fromO(N2) to O(1)?

The infinite dimensional versions of N , in particular O andC are invariant to input quality. Takes the idea behindConvnets to an extreme!

This is easy to see.


Networks

William Guss

Introduction

The Core Idea

Results

Results: Representation Theory

How good are Continuous Classifier Networks, {C} asalgorithms?

Theorem

Let X be a compact Hausdorf space. For every � > 0 and everycontinuous bounded functional on Lq(X ), say f , there exists atwo layer continuous classifier

C : Lq(x) g◦n2−−−→ Rm n−→ Rn (14)

such that‖f − C‖ < �. (15)


Networks

William Guss

Introduction

The Core Idea

Results

Results: Representation Theory

How good are Operator Networks and GANNs as algorithms?They should be able to approximate the important operators,eg. Fourier Transform, Laplace Transform, Derivation, etc.

Theorem

Given a operator neural network O then some layer l ∈ O, thelet K : C (R(l))→ C (R(l)) be a bounded linear operator. If wedenote the operation of layer l on layer l − 1 asσ(l+1) = g

(Σl+1σ

(l)), then for every � > 0, there exists a

weight polynomial w (l)(i , j) such that the supremum norm overR(l) ∥∥∥Kσ(l) − Σl+1σ(l)∥∥∥

∞< � (16)

Proof.

See paper. Nice!


Networks

William Guss

Introduction

The Core Idea

Results

Results: Stronger Representation Theory

We want to show the following better theorem.

Theorem

Given a operator neural network O then some layer l ∈ O, thelet K : C (R(l))→ C (R(l)) be a bounded continuous operator.If we denote the operation of layer l on layer l − 1 asσ(l+1) = g

(Σl+1σ

(l)), then for every � > 0, there exists a

weight polynomial w (l)(i , j) such that the supremum norm overR(l) ∥∥∥Kσ(l) − Σl+1σ(l)∥∥∥

∞< � (17)

But how? Dirac Spikes!


Networks

William Guss

Introduction

The Core Idea

Results


ξ

Lp(R)

f

f (j)

L1(R)

R

j

K ,O

K , ?O

Kj , Cj

Kj , Cj

Proof.


Networks

William Guss

Introduction

The Core Idea

Results


ξ

Lp(R)

f

f (j)

L1(R)

R

j

K ,O

K , ?O

Kj , Cj

Kj , Cj

Proof.

Fix � > 0. Given K : ξ 7→ f , let Kj : ξ 7→ f (j) be afunctional on Lq.

We can find a Cj : Lq(R)g◦n2−−−→ Rm(j) n−→ R1 so that for all

ξ,|Cj(ξ)− Kj(ξ)| = |Cj(ξ)− f (j)| < �/2. (18)


Networks

William Guss

Introduction

The Core Idea

Results


ξ

Lp(R)

f

f (j)

L1(R)

R

j

K ,O

K , ?O

Kj , Cj

Kj , Cj

Proof.

We know that

Cj(ξ) =m(j)∑k=1

ajkg

(∫Rξ(i)wkj(i) dµ(i)

)(18)


Networks

William Guss

Introduction

The Core Idea

Results


w1jw2jw3jw4j

1

Proof.

We wish to turn Cj into a two layer O. Let,

w (0)(i , `) =

{wkj(i), if ` = j + k , k∈ 1, . . . ,m(j)0 otherwise


Networks

William Guss

Introduction

The Core Idea

Results


w1jw2jw3jw4j

1

Proof.

Then

Cj(ξ) =m∑

k=1

ajkg ◦ o[ξ](k + j) (18)

How do we turn this finite sum into an integral? Diractime!


Networks

William Guss

Introduction

The Core Idea

Results


Proof.

We define a dirac spike as follows for every n:

δnkj(`) = cn exp(−bn2|`− (j + k)|2) (18)

where c , b are set so that∫R δnkj = 1


Networks

William Guss

Introduction

The Core Idea

Results


Proof.

Now let the second weight function be:

w(1)n (`, j) =

m∑k=1

ajkδnkj(`) (18)


Networks

William Guss

Introduction

The Core Idea

Results


Proof.

Putting everything together, for every n letOn : Lp(R)→ L1([0, 1])

On : ξ 7→∫Rw (1)(`, j)o[ξ](`) dµ(`). (18)

Clearly On →∑m

k=1 ajkg ◦ o[ξ](k + j)


Networks

William Guss

Introduction

The Core Idea

Results


Proof.

Therefore for every � > 0 there exists an N such that forall n > N, for all ξ, and for all j ,

|On[ξ](j)− Cj [ξ]| ≤ ‖On[·](j)− Cj [·]‖ < �/2. (18)

Recall that for every j , ‖Kj − Cj‖ < �/2.


Networks

William Guss

Introduction

The Core Idea

Results


Proof.

By the triangle inequality we have that for all j

‖Kj −On(k)‖ = ‖Kj −On(j) + Cj − Cj‖≤ ‖Kj − Cj‖+ ‖On(j)− Ck‖ < �.

(18)

Therefore ‖K −O‖ < �


Networks

William Guss

Introduction

The Core Idea

Results


Proof.

By the triangle inequality we have that for all j

‖Kj −On(k)‖ = ‖Kj −On(j) + Cj − Cj‖≤ ‖Kj − Cj‖+ ‖On(j)− Ck‖ < �.

(18)

Therefore ‖K −O‖ < �

IntroductionThe Core IdeaResults

ganns: a new theory of representation for nonlinear bounded...

Documents