ganns: a new theory of representation for nonlinear bounded...

31
General Artificial Neural Networks William Guss Introduction The Core Idea Results GANNs: A New Theory of Representation for Nonlinear Bounded Operators William Guss Machine Learning at Berkeley April 22, 2016

Upload: others

Post on 25-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    GANNs: A New Theory of Representation forNonlinear Bounded Operators

    William Guss

    Machine Learning at Berkeley

    April 22, 2016

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Introduction: What’s up with continuous data?

    All of the data we deal with is discrete thanks to Turing.

    But, most of it models a continuous process.

    ExamplesAudio: We take > 100k samples of something we coulddescribe with f : R→ R! Trick Question: Which is easierto use? (a) v ∈ R100000 or (b) f .Images: We take 100k × 100k samples of something wecould describe with f : R2 → R.

    Why do we use discrete data? No computer known canreally store f . End of stoy.

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Introduction: Abusing continuity

    f can’t be that bad. Can it?

    If f is smooth it’s easy to draw:

    −10 −5 0 5 10

    0

    20

    40

    60

    80

    100

    I can even name f most of the time: f : x 7→ x2 or evensuper precisely g : x 7→

    ∑∞i anx

    n.

    Moral: Smooth functions are mostly very managable.

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Introduction: Abusing continuity

    So why do we do this:

    To classify this:

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    ResultsThe Core Idea: Let neuralnetworks abuse continuity

    and smoothness.

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Artificial Neural Networks

    Definition

    We say N : Rn → Rm is a feed-forward neural network if for aninput vector xxx ,

    N : σ(l+1)j = g

    ∑i∈Z (l)

    w(l)ij σ

    (l)i + β

    (l)

    σ

    (0)i = xi ,

    (1)

    where 1 ≤ l ≤ L− 1. Furthermore we say {N} is the set of allneural networks.

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Operator Neural Networks

    Let’s get rid of R100000 and use f .

    Definition

    We call O : Lp(X )→ L1(Y ) an operator neural network if,

    O : σ(l+1)(j) = g(∫

    R(l)σ(l)(i)w (l)(i , j) di

    )σ(0)(j) = f (j).

    (2)

    Furthermore let {O} denote the set of all functional neuralnetworks.

    Well that was easy. In fact {O} ⊃ {N}These definitions looks really similar? Is there some moregeneral category or structure containing them.

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Generalized Artifical Neural Networks

    Definition

    If A,B are (possibly distinct) Banach spaces over a field F, wesay G : A→ B is a generalized neural network if and only if

    G : σ(l+1) = g(Tl

    [σ(l)]

    + β(l))

    σ(0) = ξ(3)

    for some input ξ ∈ A, and a linear form Tl .

    Claim: ”Neural networks” are powerful because they can movebumps anywhere!How? Tl is a linear form. It can move σ

    (l) anywhere, and g isa bump of some sort.

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Moving bumps around

    The sigmoid function

    g =1

    1 + e−x(4)

    is a bump, that we can move around with weights!

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Tl as the layer type.

    Definition

    We suggest several classes of Tl as follows

    Tl is said to be o operational if and only if =

    Tl = o : Lp(R(l))→ L1(R(l+1))

    σ 7→∫R(l)

    σ(i)w (l)(i , j) di .(5)

    Tl is said to be n discrete if and only if

    Tl = n : Rn → Rm

    ~σ 7→m∑j

    ~ej

    n∑i

    σiw(l)ij

    (6)

    where ~ej denotes the jth basis vector in Rm.

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Tl as the layer type.

    Definition

    Tl is said to be n1 transitional if and only if

    Tl = n1 : Rn → Lq(R(l+1))

    ~σ 7→n∑i

    σiw(l)i (j).

    (7)

    Tl is said to be n2 transitional if and only if

    Tl = n2 : Lp(R(l))→ Rm

    σ(i) 7→m∑j

    ~ej

    ∫R(l)

    σ(i)w(l)j (i) di

    (8)

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Neural networks as diagrams!

    This generalization is nice from a creative standpoint.I can come up with new sorts of ”classifiers” on the fly.Examples:

    A three layer neural network is just

    N3 : R10000g◦n−−→ R30 g◦n−−→ R3. (9)

    A three layer operator network is simply

    O3 : Lp(R)g◦o−−→ L1(R) g◦o−−→ C (R). (10)

    We can even classify functions!

    C : Lp(R) g◦o−−→ L1(R) g◦o−−→ . . . g◦o−−→ L1(R) g◦n2−−−→ Rn. (11)

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Results: Did abusing continuity help?

    For every layer o has weights

    w (l)(i , j) =

    Z(l)Y∑b

    Z(l)X∑a

    k(l)a,bi

    ajb. (12)

    Theorem

    Let C be a GANN with only one n2 transitional layer with O(1)weight polynomial. If a continuous function, say f (t) issampled uniformly from t = 0, to t = N, such that xn = f (n),and if G has an input function which is piecewise linear withO(N2) weights.

    ξ = (xn+1 − xn) (z − n) + xn (13)

    for n ≤ z < n + 1, then there exist some discrete neuralnetwork N such that G(ξ) = N (xxx).

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Results: Did abusing continuity help?

    WHAT!?!? How did C reduce the number of weights fromO(N2) to O(1)?

    The infinite dimensional versions of N , in particular O andC are invariant to input quality. Takes the idea behindConvnets to an extreme!

    This is easy to see.

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Results: Representation Theory

    How good are Continuous Classifier Networks, {C} asalgorithms?

    Theorem

    Let X be a compact Hausdorf space. For every � > 0 and everycontinuous bounded functional on Lq(X ), say f , there exists atwo layer continuous classifier

    C : Lq(x) g◦n2−−−→ Rm n−→ Rn (14)

    such that‖f − C‖ < �. (15)

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Results: Representation Theory

    How good are Operator Networks and GANNs as algorithms?They should be able to approximate the important operators,eg. Fourier Transform, Laplace Transform, Derivation, etc.

    Theorem

    Given a operator neural network O then some layer l ∈ O, thelet K : C (R(l))→ C (R(l)) be a bounded linear operator. If wedenote the operation of layer l on layer l − 1 asσ(l+1) = g

    (Σl+1σ

    (l)), then for every � > 0, there exists a

    weight polynomial w (l)(i , j) such that the supremum norm overR(l) ∥∥∥Kσ(l) − Σl+1σ(l)∥∥∥

    ∞< � (16)

    Proof.

    See paper. Nice!

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Results: Stronger Representation Theory

    We want to show the following better theorem.

    Theorem

    Given a operator neural network O then some layer l ∈ O, thelet K : C (R(l))→ C (R(l)) be a bounded continuous operator.If we denote the operation of layer l on layer l − 1 asσ(l+1) = g

    (Σl+1σ

    (l)), then for every � > 0, there exists a

    weight polynomial w (l)(i , j) such that the supremum norm overR(l) ∥∥∥Kσ(l) − Σl+1σ(l)∥∥∥

    ∞< � (17)

    But how? Dirac Spikes!

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Results: Stronger Representation Theory

    ξ

    Lp(R)

    f

    f (j)

    L1(R)

    R

    j

    K ,O

    K , ?O

    Kj , Cj

    Kj , Cj

    Proof.

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Results: Stronger Representation Theory

    ξ

    Lp(R)

    f

    f (j)

    L1(R)

    R

    j

    K ,O

    K , ?O

    Kj , Cj

    Kj , Cj

    Proof.

    Fix � > 0. Given K : ξ 7→ f , let Kj : ξ 7→ f (j) be afunctional on Lq.

    We can find a Cj : Lq(R)g◦n2−−−→ Rm(j) n−→ R1 so that for all

    ξ,|Cj(ξ)− Kj(ξ)| = |Cj(ξ)− f (j)| < �/2. (18)

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Results: Stronger Representation Theory

    ξ

    Lp(R)

    f

    f (j)

    L1(R)

    R

    j

    K ,O

    K , ?O

    Kj , Cj

    Kj , Cj

    Proof.

    Fix � > 0. Given K : ξ 7→ f , let Kj : ξ 7→ f (j) be afunctional on Lq.

    We can find a Cj : Lq(R)g◦n2−−−→ Rm(j) n−→ R1 so that for all

    ξ,|Cj(ξ)− Kj(ξ)| = |Cj(ξ)− f (j)| < �/2. (18)

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Results: Stronger Representation Theory

    ξ

    Lp(R)

    f

    f (j)

    L1(R)

    R

    j

    K ,O

    K , ?O

    Kj , Cj

    Kj , Cj

    Proof.

    We know that

    Cj(ξ) =m(j)∑k=1

    ajkg

    (∫Rξ(i)wkj(i) dµ(i)

    )(18)

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Results: Stronger Representation Theory

    w1jw2jw3jw4j

    1

    Proof.

    We wish to turn Cj into a two layer O. Let,

    w (0)(i , `) =

    {wkj(i), if ` = j + k , k∈ 1, . . . ,m(j)0 otherwise

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Results: Stronger Representation Theory

    w1jw2jw3jw4j

    1

    Proof.

    Then

    Cj(ξ) =m∑

    k=1

    ajkg ◦ o[ξ](k + j) (18)

    How do we turn this finite sum into an integral? Diractime!

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Results: Stronger Representation Theory

    w1jw2jw3jw4j

    1

    Proof.

    Then

    Cj(ξ) =m∑

    k=1

    ajkg ◦ o[ξ](k + j) (18)

    How do we turn this finite sum into an integral? Diractime!

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Results: Stronger Representation Theory

    Proof.

    We define a dirac spike as follows for every n:

    δnkj(`) = cn exp(−bn2|`− (j + k)|2) (18)

    where c , b are set so that∫R δnkj = 1

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Results: Stronger Representation Theory

    Proof.

    Now let the second weight function be:

    w(1)n (`, j) =

    m∑k=1

    ajkδnkj(`) (18)

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Results: Stronger Representation Theory

    Proof.

    Putting everything together, for every n letOn : Lp(R)→ L1([0, 1])

    On : ξ 7→∫Rw (1)(`, j)o[ξ](`) dµ(`). (18)

    Clearly On →∑m

    k=1 ajkg ◦ o[ξ](k + j)

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Results: Stronger Representation Theory

    Proof.

    Therefore for every � > 0 there exists an N such that forall n > N, for all ξ, and for all j ,

    |On[ξ](j)− Cj [ξ]| ≤ ‖On[·](j)− Cj [·]‖ < �/2. (18)

    Recall that for every j , ‖Kj − Cj‖ < �/2.

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Results: Stronger Representation Theory

    Proof.

    Therefore for every � > 0 there exists an N such that forall n > N, for all ξ, and for all j ,

    |On[ξ](j)− Cj [ξ]| ≤ ‖On[·](j)− Cj [·]‖ < �/2. (18)

    Recall that for every j , ‖Kj − Cj‖ < �/2.

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Results: Stronger Representation Theory

    Proof.

    By the triangle inequality we have that for all j

    ‖Kj −On(k)‖ = ‖Kj −On(j) + Cj − Cj‖≤ ‖Kj − Cj‖+ ‖On(j)− Ck‖ < �.

    (18)

    Therefore ‖K −O‖ < �

  • GeneralArtificialNeural

    Networks

    William Guss

    Introduction

    The Core Idea

    Results

    Results: Stronger Representation Theory

    Proof.

    By the triangle inequality we have that for all j

    ‖Kj −On(k)‖ = ‖Kj −On(j) + Cj − Cj‖≤ ‖Kj − Cj‖+ ‖On(j)− Ck‖ < �.

    (18)

    Therefore ‖K −O‖ < �

    IntroductionThe Core IdeaResults