ganns: a new theory of representation for nonlinear bounded...
TRANSCRIPT
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
GANNs: A New Theory of Representation forNonlinear Bounded Operators
William Guss
Machine Learning at Berkeley
April 22, 2016
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Introduction: What’s up with continuous data?
All of the data we deal with is discrete thanks to Turing.
But, most of it models a continuous process.
ExamplesAudio: We take > 100k samples of something we coulddescribe with f : R→ R! Trick Question: Which is easierto use? (a) v ∈ R100000 or (b) f .Images: We take 100k × 100k samples of something wecould describe with f : R2 → R.
Why do we use discrete data? No computer known canreally store f . End of stoy.
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Introduction: Abusing continuity
f can’t be that bad. Can it?
If f is smooth it’s easy to draw:
−10 −5 0 5 10
0
20
40
60
80
100
I can even name f most of the time: f : x 7→ x2 or evensuper precisely g : x 7→
∑∞i anx
n.
Moral: Smooth functions are mostly very managable.
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Introduction: Abusing continuity
So why do we do this:
To classify this:
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
ResultsThe Core Idea: Let neuralnetworks abuse continuity
and smoothness.
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Artificial Neural Networks
Definition
We say N : Rn → Rm is a feed-forward neural network if for aninput vector xxx ,
N : σ(l+1)j = g
∑i∈Z (l)
w(l)ij σ
(l)i + β
(l)
σ
(0)i = xi ,
(1)
where 1 ≤ l ≤ L− 1. Furthermore we say {N} is the set of allneural networks.
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Operator Neural Networks
Let’s get rid of R100000 and use f .
Definition
We call O : Lp(X )→ L1(Y ) an operator neural network if,
O : σ(l+1)(j) = g(∫
R(l)σ(l)(i)w (l)(i , j) di
)σ(0)(j) = f (j).
(2)
Furthermore let {O} denote the set of all functional neuralnetworks.
Well that was easy. In fact {O} ⊃ {N}These definitions looks really similar? Is there some moregeneral category or structure containing them.
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Generalized Artifical Neural Networks
Definition
If A,B are (possibly distinct) Banach spaces over a field F, wesay G : A→ B is a generalized neural network if and only if
G : σ(l+1) = g(Tl
[σ(l)]
+ β(l))
σ(0) = ξ(3)
for some input ξ ∈ A, and a linear form Tl .
Claim: ”Neural networks” are powerful because they can movebumps anywhere!How? Tl is a linear form. It can move σ
(l) anywhere, and g isa bump of some sort.
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Moving bumps around
The sigmoid function
g =1
1 + e−x(4)
is a bump, that we can move around with weights!
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Tl as the layer type.
Definition
We suggest several classes of Tl as follows
Tl is said to be o operational if and only if =
Tl = o : Lp(R(l))→ L1(R(l+1))
σ 7→∫R(l)
σ(i)w (l)(i , j) di .(5)
Tl is said to be n discrete if and only if
Tl = n : Rn → Rm
~σ 7→m∑j
~ej
n∑i
σiw(l)ij
(6)
where ~ej denotes the jth basis vector in Rm.
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Tl as the layer type.
Definition
Tl is said to be n1 transitional if and only if
Tl = n1 : Rn → Lq(R(l+1))
~σ 7→n∑i
σiw(l)i (j).
(7)
Tl is said to be n2 transitional if and only if
Tl = n2 : Lp(R(l))→ Rm
σ(i) 7→m∑j
~ej
∫R(l)
σ(i)w(l)j (i) di
(8)
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Neural networks as diagrams!
This generalization is nice from a creative standpoint.I can come up with new sorts of ”classifiers” on the fly.Examples:
A three layer neural network is just
N3 : R10000g◦n−−→ R30 g◦n−−→ R3. (9)
A three layer operator network is simply
O3 : Lp(R)g◦o−−→ L1(R) g◦o−−→ C (R). (10)
We can even classify functions!
C : Lp(R) g◦o−−→ L1(R) g◦o−−→ . . . g◦o−−→ L1(R) g◦n2−−−→ Rn. (11)
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Did abusing continuity help?
For every layer o has weights
w (l)(i , j) =
Z(l)Y∑b
Z(l)X∑a
k(l)a,bi
ajb. (12)
Theorem
Let C be a GANN with only one n2 transitional layer with O(1)weight polynomial. If a continuous function, say f (t) issampled uniformly from t = 0, to t = N, such that xn = f (n),and if G has an input function which is piecewise linear withO(N2) weights.
ξ = (xn+1 − xn) (z − n) + xn (13)
for n ≤ z < n + 1, then there exist some discrete neuralnetwork N such that G(ξ) = N (xxx).
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Did abusing continuity help?
WHAT!?!? How did C reduce the number of weights fromO(N2) to O(1)?
The infinite dimensional versions of N , in particular O andC are invariant to input quality. Takes the idea behindConvnets to an extreme!
This is easy to see.
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Representation Theory
How good are Continuous Classifier Networks, {C} asalgorithms?
Theorem
Let X be a compact Hausdorf space. For every � > 0 and everycontinuous bounded functional on Lq(X ), say f , there exists atwo layer continuous classifier
C : Lq(x) g◦n2−−−→ Rm n−→ Rn (14)
such that‖f − C‖ < �. (15)
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Representation Theory
How good are Operator Networks and GANNs as algorithms?They should be able to approximate the important operators,eg. Fourier Transform, Laplace Transform, Derivation, etc.
Theorem
Given a operator neural network O then some layer l ∈ O, thelet K : C (R(l))→ C (R(l)) be a bounded linear operator. If wedenote the operation of layer l on layer l − 1 asσ(l+1) = g
(Σl+1σ
(l)), then for every � > 0, there exists a
weight polynomial w (l)(i , j) such that the supremum norm overR(l) ∥∥∥Kσ(l) − Σl+1σ(l)∥∥∥
∞< � (16)
Proof.
See paper. Nice!
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
We want to show the following better theorem.
Theorem
Given a operator neural network O then some layer l ∈ O, thelet K : C (R(l))→ C (R(l)) be a bounded continuous operator.If we denote the operation of layer l on layer l − 1 asσ(l+1) = g
(Σl+1σ
(l)), then for every � > 0, there exists a
weight polynomial w (l)(i , j) such that the supremum norm overR(l) ∥∥∥Kσ(l) − Σl+1σ(l)∥∥∥
∞< � (17)
But how? Dirac Spikes!
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
ξ
Lp(R)
f
f (j)
L1(R)
R
j
K ,O
K , ?O
Kj , Cj
Kj , Cj
Proof.
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
ξ
Lp(R)
f
f (j)
L1(R)
R
j
K ,O
K , ?O
Kj , Cj
Kj , Cj
Proof.
Fix � > 0. Given K : ξ 7→ f , let Kj : ξ 7→ f (j) be afunctional on Lq.
We can find a Cj : Lq(R)g◦n2−−−→ Rm(j) n−→ R1 so that for all
ξ,|Cj(ξ)− Kj(ξ)| = |Cj(ξ)− f (j)| < �/2. (18)
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
ξ
Lp(R)
f
f (j)
L1(R)
R
j
K ,O
K , ?O
Kj , Cj
Kj , Cj
Proof.
Fix � > 0. Given K : ξ 7→ f , let Kj : ξ 7→ f (j) be afunctional on Lq.
We can find a Cj : Lq(R)g◦n2−−−→ Rm(j) n−→ R1 so that for all
ξ,|Cj(ξ)− Kj(ξ)| = |Cj(ξ)− f (j)| < �/2. (18)
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
ξ
Lp(R)
f
f (j)
L1(R)
R
j
K ,O
K , ?O
Kj , Cj
Kj , Cj
Proof.
We know that
Cj(ξ) =m(j)∑k=1
ajkg
(∫Rξ(i)wkj(i) dµ(i)
)(18)
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
w1jw2jw3jw4j
1
Proof.
We wish to turn Cj into a two layer O. Let,
w (0)(i , `) =
{wkj(i), if ` = j + k , k∈ 1, . . . ,m(j)0 otherwise
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
w1jw2jw3jw4j
1
Proof.
Then
Cj(ξ) =m∑
k=1
ajkg ◦ o[ξ](k + j) (18)
How do we turn this finite sum into an integral? Diractime!
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
w1jw2jw3jw4j
1
Proof.
Then
Cj(ξ) =m∑
k=1
ajkg ◦ o[ξ](k + j) (18)
How do we turn this finite sum into an integral? Diractime!
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
Proof.
We define a dirac spike as follows for every n:
δnkj(`) = cn exp(−bn2|`− (j + k)|2) (18)
where c , b are set so that∫R δnkj = 1
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
Proof.
Now let the second weight function be:
w(1)n (`, j) =
m∑k=1
ajkδnkj(`) (18)
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
Proof.
Putting everything together, for every n letOn : Lp(R)→ L1([0, 1])
On : ξ 7→∫Rw (1)(`, j)o[ξ](`) dµ(`). (18)
Clearly On →∑m
k=1 ajkg ◦ o[ξ](k + j)
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
Proof.
Therefore for every � > 0 there exists an N such that forall n > N, for all ξ, and for all j ,
|On[ξ](j)− Cj [ξ]| ≤ ‖On[·](j)− Cj [·]‖ < �/2. (18)
Recall that for every j , ‖Kj − Cj‖ < �/2.
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
Proof.
Therefore for every � > 0 there exists an N such that forall n > N, for all ξ, and for all j ,
|On[ξ](j)− Cj [ξ]| ≤ ‖On[·](j)− Cj [·]‖ < �/2. (18)
Recall that for every j , ‖Kj − Cj‖ < �/2.
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
Proof.
By the triangle inequality we have that for all j
‖Kj −On(k)‖ = ‖Kj −On(j) + Cj − Cj‖≤ ‖Kj − Cj‖+ ‖On(j)− Ck‖ < �.
(18)
Therefore ‖K −O‖ < �
-
GeneralArtificialNeural
Networks
William Guss
Introduction
The Core Idea
Results
Results: Stronger Representation Theory
Proof.
By the triangle inequality we have that for all j
‖Kj −On(k)‖ = ‖Kj −On(j) + Cj − Cj‖≤ ‖Kj − Cj‖+ ‖On(j)− Ck‖ < �.
(18)
Therefore ‖K −O‖ < �
IntroductionThe Core IdeaResults