artificial neural net basics

Upload: andi-nur

Post on 05-Apr-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Artificial Neural Net Basics

    1/45

    Problem Solving in Hyperspace

    or

    Artificial Neural Net Basics

    Tim Hare

  • 7/31/2019 Artificial Neural Net Basics

    2/45

    Some History

    In the 1960s, much interest in artificial neural networks (ANNs)

    Rosenblatt (1962) proves important theorem regarding perceptron(single learning layer) network learning

    Widrow, Angell, Hoff (1960-1962): demonstrations of perceptron

    learning

    Minsky (1969) kills the party: analyzes with great rigor and findsperceptrons have restrictions on what they can learn & sincemultilayer network training approaches not defined, the world lostinterest.

    Work in field slows for a decade, but Widrow is defiant, andestablishes training algorithms for multilayer perceptrons.

    The party starts again in the 80s

  • 7/31/2019 Artificial Neural Net Basics

    3/45

    Why are ANNs important?

    Ability to automatically create complex non-linear functions from simpler linearfunctions by composition of the individual pieces of the network into a meta-function

    Process learns from data a priori knowledge not needed

    Not a black box result one can discern (and well go through this) the specifics of the model one can adjust the model One can embed the final model in other applications

    Process can be made continuously Adaptive continues modify itself as the data set changes

    Alternative to traditional modeling techniques such as ANOVA and multiple regression

    In more advanced forms, continues to be a means to explore the underpinnings of theorganic intelligence that evolved on this planet

  • 7/31/2019 Artificial Neural Net Basics

    4/45

    Biological Neurons

  • 7/31/2019 Artificial Neural Net Basics

    5/45

    The sort of ANN architecture well be playing with today:

    Two processing layers, each with their own weights.

    Information flows from left to right during execution of the network (forward propagation),

    and from right to left during the weight adjustment cycle (backward propagation)

    Neuron 1

    Neuron 2

    Input(1)

    Input(2)

    W(1,1)

    W(1,2)W(2,1)

    W(2,2)

    W(1,3)

    W(2,3)

    Neuron 3

    Layer-1 Layer-2

    NetworkOutput

    Error

  • 7/31/2019 Artificial Neural Net Basics

    6/45

    Each neuron is in effect a summation operator. That is, per below,

    NET(i) is a summation ( ) of all X(m)*AB(m,i), where m = input

    number and i = neuron number.

    Net(1)

    Net(2)

    X(1)

    X(2)

    AB(1,1)

    AB(1,2)AB(2,1)

    AB(2,2)

    AB(1,1)X1+AB(1,2)X2 = NET(1)

    AB(2,1)X1+AB(2,2)X2 = NET(2)

    =

    =

  • 7/31/2019 Artificial Neural Net Basics

    7/45

    Or, equivalently, a vector-matrix product of the input vector (X) and

    the weight matrix (AB), to produce the vector NET

    X(1) X(2) AB(1,1) AB(1,2)

    AB(2,1) AB(2,2)

    X x [AB] = NET

    =

    X (1X2)2x2

    AB

    Input Vector Weight Matrix

    NET

    (1X2)=

    NET Vector

    NET(1) NET(2)

    m

    AB(m,i)X(m) = NET(i)

  • 7/31/2019 Artificial Neural Net Basics

    8/45

    In fact, our artificial neuron definition

    actually also includes a sigmoid function, TanH(x)

    TanH(NET = W x X) = OUT

    or, if you like

  • 7/31/2019 Artificial Neural Net Basics

    9/45

    The TanH(X) activation (or transfer) function allows gain control (squashing) over the value of each neuron. Large neuron

    values (or large weights) wont be amplified downstream leading to noise saturation and distortion in network learning.

    My impression in testing is that if you dont use the sigmoid transfer function you run the risk of creating a feed-forward loop that

    runs the weights to large values and while it is possible to get training, many times the net explodes into huge neuron values,

    leading to overflow errors.

    -1

    1

    OUT = TanH(NET)

    TanH(x) = [exp(x)-exp(-x)] / [exp(x)+exp(-x)]

    NET = W x X

    0

    NET distribution

  • 7/31/2019 Artificial Neural Net Basics

    10/45

    The networks well test today will have one or two neurons in Layer-1

    AB(1,1)X(1)+AB(1,2)X(2) = NET(1)

    F(Net(1))=OUT(1)

    F(Net(2))=OUT(2)

    X(1)

    X(2)

    AB(1,1)

    AB(1,2)AB(2,1)

    AB(2,2)

    AB(2,1)X(1)+AB(2,2)X(2) = NET(2)

    F(X) = TanH(X)

    BC(1)

    BC(2)

    F(Net(3))=OUT(3)

    Layer-1 Layer-2

  • 7/31/2019 Artificial Neural Net Basics

    11/45

    TRAINING: well need the derivative of our chosen sigmoid function. This

    allows us to adjust the weight space error by establishing a relationship to the

    training error.

    -1

    1

    OUT = TanH(NET)

    TanH(x) = [exp(x)-exp(-x)] / [exp(x)+exp(-x)]

    Slope=

    TanH

    (NET(i*))

    NET=W x XNET*=(W x X*)

    TanH(x) = [1-TanH(x)][1+TanH(x)

    WAF = TanH(NET(i*))(d*-OUT*) for

    a particular position on the sigmoid

    During forward propagation,

    NET* is fed into the sigmoid

    function, and OUT* isproduced.

    During backward propagation, a

    deltaOUT (d*-OUT*) is fed

    into the linearization around

    OUT*

    A deltaNET (our weight

    adjustment factor, WAF) is

    produced.

    * = a particular value

  • 7/31/2019 Artificial Neural Net Basics

    12/45

    The network is a META function

    The network is a meta function: a functional composition of the moreprimitive functions in each node

    We use the WAF iteratively to minimize the total error in this metafunction with respect to the entire patterns set, on average.

    WAF is used in conjunction with coefficients to tailor training:NewWeight = (LR)(OldWeight)+(MO)(WAF) where LR = learningrate and MO = momentum.

    LR and MO refines the adjustment, and are chosen empirically and

    vary according to each problems data set, and can vary as afunction of training results if encoded to do so.

    Despite all this, we can still get caught in local minima as weattempt to reduce the error in weight space

  • 7/31/2019 Artificial Neural Net Basics

    13/45

    In training, we want to minimize the error

    (cost function) on the network (meta function) output

    X = [X1, X2] = our input vector

    D = our desired output for X

    META(X) = network output

    E(X) = cost function = AVG(ABS(D-META(X))

    The cost function is minimized across X vectors for the entire

    training set, iteratively, as the weights in META(X) are adjusted.

  • 7/31/2019 Artificial Neural Net Basics

    14/45

    Pseudo-code training algorithm for back-propagation

    of error using gradient descent

    For each example in the training set

    Calculate error (d-OUT)

    Compute delta-WX for all weights from layer-1 neurons to j layer-2(output) neurons E2(j).

    Compute delta-WX for all weights from the X(m) inputs to the i

    layer-1 neurons E1(i). This value is based upon W x E2(j) since there is no training pair for the

    layer-1 neurons.

    Use E2(j) to update the weights leading back to each of the j layer-1 neurons

    Use E1(i) to update the weights leading back to each of the m

    inputs Next example

    (do while not meeting some stop criteria such as low average, absoluteerror across all patterns in one epoch of training)

  • 7/31/2019 Artificial Neural Net Basics

    15/45

    Error Hyper-Surface minimization: 2D pictured but our error is in the

    WEIGHT space therefore in a much higher dimension.

    The weight space is the error dimension we must minimize . This is distinct

    from dimensionality of the input space, X, the neuron space, or the output

    space. The high dimensional weight space surface is what we move down, to

    a (hopefully) global minimum

  • 7/31/2019 Artificial Neural Net Basics

    16/45

    This is not a 1D weight space graph, but our cost function, or error function: The overall

    network error for one epoch (one pass through the patterns), Error = AVG(ABS(d-OUT),

    is gradually minimized and reflects our weight space error reduction process during

    training.

    Training Epoch

    AVG(ABS(d-OUT)

    Local

    minima

    Global

    minimum(hopefully)

  • 7/31/2019 Artificial Neural Net Basics

    17/45

    Decision surfaces: Some 2D open sets. These lines would be higher dimensional linear

    equations if more than two inputs were specified. Some minimum number of linear

    equations will be needed to solve each type of problem.

    X(2)

    X(1)

    X(2)

    X(2)

    X(1)

    Two surfaces (two neurons, or two

    equations) are needed for more

    complex problems.

    One Neuron: AB(1,1)X(1)+AB(1,2)X(2) = NET(1) = K

    K

    K K

    One decision surface is only good for

    simple classifications such as above.

    Two Neurons: AB(1,1)X(1)+AB(1,2)X(2) = NET(1) = K & K

  • 7/31/2019 Artificial Neural Net Basics

    18/45

    An open convex set that classifies A as above the lower line, and

    below the upper line. The weights that feed downstream neurons from

    each of these two neurons (linear equations) will establish the cutoff by

    virtue of their interpretation by the down-stream neuron.

    A

    W(1,1)X(1)+W(2,1)X(2

    )=NET(1)

    W(1

    ,2)X(1)+W

    (2,2)X(2)=

    NET

    (2)

    We have two neurons (or

    two decision lines) and

    two inputs, hence the

    form of the equations

    Net(1)

    Net(2)

    X(1)

    X(2)

    W(1,1)

    W(1,2)W(2,1)

    W(2,2)

  • 7/31/2019 Artificial Neural Net Basics

    19/45

    More on decision surfaces: Again, in 2D, the

    network can create closedconvex sets

    X(2)

    X(1)

    X(2) X(2)

    X(1) X(1)

    We are STILL somewhat limited in that we cant enclose any arbitraryshape (concave not possible) in a single class using convex objects made

    from layer-1 neurons.

  • 7/31/2019 Artificial Neural Net Basics

    20/45

    A single additional computational layer (e.g. between layer-1

    and layer-2) adds the capacity to make concave sets

    X(2)

    X(1)

    A not B gives concavity

    A

    B

  • 7/31/2019 Artificial Neural Net Basics

    21/45

    In summary: 1 neuron = 1 linear equation = 1 decision surface

    Each neuron represents a line (in 2D input space) a plane (in 3D input space) or ahyper-plane (in higher dimensions).

    All these are linear decision objects/surfaces regardless of the dimension of the vectorX

    The dimension of the space in which the decisions surfaces exist is determined by thedimension of X, input vector, whose dimension depends upon the number of inputs wefeed into the network (X[1,2]= line, X[1,2,3] = plane, X[1,2,3,4n] = hyper-plane)

    Additional network layers beyond two provide logical operations through weights thatconnect the previous layers neurons (objects) to next to allow concave sets.

  • 7/31/2019 Artificial Neural Net Basics

    22/45

    XOR training data format. Two inputs coupled with our intended

    (d=desired) classification, by which the network will learn to group

    patterns (the data rows) and a total of four patterns.

    X(1) X(2) d

    X(1) X(2) d

    X(1) X(2) d

    X(1) X(2) d

  • 7/31/2019 Artificial Neural Net Basics

    23/45

    Each row has an input (X) vector, and an output vector, or desired vector, d. In this case

    D is a 1-dimensional vector (a single output neuron), however we could specify as many

    as we like, and so have higher dimensional vectors in both cases. The vectors are our

    training pairs, that make up a single row or record, each of which is submitted to the net

    during training, one at a time.

    1 1 0

    1 0 1

    0 1 1

    0 0 0

  • 7/31/2019 Artificial Neural Net Basics

    24/45

    It should be clear that our intended classification is column 3, and we

    have two classes we want the net to separate. While we encode them

    in binary when we feed them to the network during training, well

    reference them on the graphs to come as classes A and B

    1 1 B

    1 0 A

    0 1 A

    0 0 B

  • 7/31/2019 Artificial Neural Net Basics

    25/45

    Finally, the actual data we sent to the net has been scaled for our preferred transfer

    function TanH(x) which will categorize our input patters as either 1 or -1. While TanH(x)

    has Domain [-inf, +inf] and Range [-1,1] well want to scale & restrict our inputs for a

    variety of reasons. If a real-world problem, wed also likely scale our outputs to a smaller

    dynamic range of TanH(x), say R[-0.75,0.75], for optimal training times. Well use the

    below for clarity, though.

    1 1 -1

    1 -1 1

    -1 1 1

    -1 1 -1

  • 7/31/2019 Artificial Neural Net Basics

    26/45

    Here is a program Im developing to analyze patterns in data.

    If we have time, Ill demo this later to

    show you more of a typical work flow...

  • 7/31/2019 Artificial Neural Net Basics

    27/45

    For now, well use the hard coded two layer

    ANN software I wrote for you in EXCEL

  • 7/31/2019 Artificial Neural Net Basics

    28/45

    Before we tackle the XOR problem,

    lets try a simpler problem

    Two classes: A=(1,1) and B=(-1,-1)

    Two inputs, X1 and X2

    Lets try a single layer-1 neuron and see if we can solve it

  • 7/31/2019 Artificial Neural Net Basics

    29/45

    Heres our result (or close enough)

    but are we unduly biased?

    Uh, waitwespecified only ONE

    layer-1 neuron so

    why the extra

    connections???

  • 7/31/2019 Artificial Neural Net Basics

    30/45

    The equations that result from training our network on a

    simple two class pattern using one neuron and two inputs

    1.77X(1)+1.51X(2)+0.01X3 = NET(1)

    We can safely ignore the input-level bias, X3, since its weight isnear ZERO. So, 1.77X(1)+1.51X(2) = NET(1)

    0.03X(1)-0.52X(2)+0.49X(3) = NET(2)

    The value of NET(2) is always 1, since it is in fact a bias neuronitself, therefore its weights dont count. Therefore, 1 = NET(2)(always).

    F(NET(1))*(-2.43)+F(NET(2))*(0.01) = NET(3)

    As well the weights coming out of NET(2) are small, so well ignorethem. So, F(NET(1))*(-2.43) = NET(3), therefore

    OUT = F(NET(3)) = F(F(1.77X(1)+1.51X(2))*(-2.43)) where F is,

    again, TanH(x)

  • 7/31/2019 Artificial Neural Net Basics

    31/45

    Initially, what looks pretty complicated.

    X(1) 1.77

    X(2)

    1.52

    X(3)

    0.01

    F(Net(1))

    0.03

    - 0.52

    0.49 F(Net(2))

    -2.43

    0.01

    F(Net(2)) OUT

  • 7/31/2019 Artificial Neural Net Basics

    32/45

    Can be simplified.

    X(1) 1.77

    X(2)

    1.52

    1

    0

    F(Net(1))

    0

    - 0.52

    0.49 1

    -2.43

    0

    F(Net(2)) OUT

  • 7/31/2019 Artificial Neural Net Basics

    33/45

    And further simplified.

    X(1) 1.77

    X(2)

    1.52

    F(Net(1))-2.43

    F(Net(2)) OUT

    Resulting in a single 2D decision line, or one linear equation

    in two dimensions (X(1) & X(2))

    OUT = F(NET(3)) = F(F(1.77X(1)+1.51X(2))*(-2.43)) where F is, again, TanH(x)

    In fact, the weight from NET(1) to NET(2) has

    evolved just to scale the value of NET(2) for TanH(x) so that it

    produces an OUT value close to +1 or -1.

  • 7/31/2019 Artificial Neural Net Basics

    34/45

    So, we can do this classification with a single layer-1 neuron. In fact, each

    hidden layer neuron is tantamount to a linear equation that is a line, surface,

    or hypersurface, depending on the number of inputs to layer-1. We needed

    at least one solution line to solve this problem.

    X(2)

    X(1)

    (0,0) =A

    (1,1) = B

    W(1,1)X(1)+W

    (2,1)X(2)=NET(1)

  • 7/31/2019 Artificial Neural Net Basics

    35/45

    What if we dont know the minimum

    number of neurons and use more?

    We created more decision lines than we

    needed and made the interpretation of

    the equations, the relationship between

    the inputs, difficult to understand.

    10n

    euronsused!!!

  • 7/31/2019 Artificial Neural Net Basics

    36/45

    To get a simple set of equations and a good model

    Use as few neurons as you can to minimize the interpretation of the

    equations

    Start with some minimum number that works, and work backwardsto the true minimum (though with more complex noisy data sets, this

    optimum may be hard to assess)

    Also, generalization (averaging across noise within a class) is

    impaired when one uses too many neurons. A less than robustnetwork results, which cant generalize to patterns it has never seen

    and is over-fitted to noise.

  • 7/31/2019 Artificial Neural Net Basics

    37/45

    Lets try and solve the XOR problem. How many linear

    equations (layer-1 neurons) will we need for XOR?

    Lets first try with 1 neuron

  • 7/31/2019 Artificial Neural Net Basics

    38/45

    Why cant a 2-layer net with a single neuron in layer-1 (one linear

    equation) solve XOR?

    X(2)

    X(1)

    (0,0) = B

    (1,1) = B

    (0,1) =A

    (1,0) =A

    W(1,1)X1+W

    (2,1)X2=NET(1)

    As we saw, a network with a

    single layer-1 neuron cant

    separate class A from class B

    and so the net iterates but does

    not regress to a solution.

  • 7/31/2019 Artificial Neural Net Basics

    39/45

    So, a single layer-1 neuron was not sufficient, and we know

    why, but what about two layer-1 neurons (two equations, or

    decision lines)?

    Well try with 2 neurons

    Then well take a look at the equations

    Then well prune neurons if they dont contribute, to make theequations and variable relationships clearer

    Well remove clear ZERO weights and test those we suspect may

    not be impacting the result

    Well then check for generalization across input vectors that the

    net has not seen

  • 7/31/2019 Artificial Neural Net Basics

    40/45

    Heres a result with XOR and two neurons

  • 7/31/2019 Artificial Neural Net Basics

    41/45

    Again, with XOR, initially, things looks pretty complicated.

    X(1) 1.15

    X(2)

    1.16

    X(3)

    -0.81

    F(Net(1))

    -0.82 F(Net(2))

    -2.57

    -2.57

    F(Net(4)) OUT

    F(Net(3))

    0.810.030.04

    -1.14

    -1.13

    -1.75

  • 7/31/2019 Artificial Neural Net Basics

    42/45

    but can be made less complicated by eliminating clear zero inputs, pruning

    inputs that they dont contribute, and labeling our bias neurons.

    X(1) 1.15

    X(2)

    1.16

    1

    -0.81

    F(Net(1))

    -0.82 F(Net(2))

    -2.57

    -2.57

    F(Net(4)) OUT

    1

    000

    -1.14

    -1.13

    -1.75

  • 7/31/2019 Artificial Neural Net Basics

    43/45

    until things are somewhatmore clear.

    X(1) 1.15

    X(2)

    1.16

    1

    -0.81

    F(Net(1))

    -0.82 F(Net(2))

    -2.57

    -2.57

    F(Net(4)) OUT

    1

    -1.14

    -1.13

    -1.75

  • 7/31/2019 Artificial Neural Net Basics

    44/45

    The equations that result from training our network on a

    simple two class pattern using one neuron and two inputs

    1) 1.15X1+1.16X2-0.81= NET(1)

    2) -1.14X1-1.13X2-0.82 = NET(2)

    3) -1.75 = NET(3)

    4) F(NET(1))*(-2.57) + F(NET(2))*(-2.57) - (1.75) = NET(4)

    5) OUT = F(NET(4))

    .where F(x) = TanH(x).

  • 7/31/2019 Artificial Neural Net Basics

    45/45

    Miscellany and discussion

    Linking up pre-trained networks as input to a training network

    Linking up the above but using the hidden layer of the trained network as input to the

    training network

    Depending on where you start, you may never get out of a local minimum, or you mayfall into one after making progress.

    Demo of VB software as needed to illustrate 3-layer networks