artificial neural networks introduction - unipr.it...artificial neural network (example: multi-layer...

Artificial Neural Networks

Introduction

Biological neural networks

• There are about 1011 neurons in humans, of more than 100

different kinds, forming more than 1014 connections

• When activated (stimuli received through the dendrites >

threshold) the impulses generated by the excitation of the

soma propagate through the axon to the other neurons.

Nucleus

Cell body (soma)

Dendrites

Axon

• The contact points between neurons are called synapses and

can be either inhibitory or excitatory, according to the

corresponding effect a stimulus received through them

causes on the neuron.

• At the synaptic level the terminals of two connected

neurons are separated by a gap through which the nerve

impulse train is transmitted electro-chemically (by producing

a substance, the acetylcoline, which acts as synaptic mediator).

• The maximum frequency of the impulses <= 1KHz, therefore

information transmission/processing is rather slow. The

power of the nervous system is hypothesized to be related

with its parallel distributed processing of information.

Biological neural networks

Composed of two cascaded stages:

• Linear adder (which produces the so-called net input)

net = Sk wk ik , wk = weight associated with the kth input

• Non-linear threshold-like activation function o = f(net)

Artificial neuron

Possible activation functions (step function or approximated

versions):

o = 1 if (x - q) > 0

0 if (x - q) <= 0

o = +1 if (x - q) > 0

-1 if (x - q) <= 0

o = 1 / (1 + e -(x -q))

o = tanh (x - q)

Sigmoid/Logistic function

Step function

Bipolar step

Hyperbolic Tangent

q is a constant (bias) that acts as a threshold (it shifts f

along the x axis). It is equivalent to a weight associated

with a constant input whose value is 1.

(continuous

approximations

of the step

function)

Artificial neuron

Multi-layer architecture:

• Input layer

• Hidden layer/s (not directly

accessible from ‘outside’)

• Output (layer)

A layer is a set of topologically

equivalent neurons (i.e. all are

connected to neurons which

are in turn topologically

equivalent)

IN

Artificial neural network

(example: multi-layer perceptron)

OUT

A weight wij is associated to each connection between

neuron i and neuron j, and is used in the first-stage adder of

the neuron which receives data through it.

The ‘behavior’ of a neural network therefore depends on:

• Number of neurons

• Topology

• Values of the weights associated with its connections

Artificial neural network

Problems that can be solved by different net configurations (RP Lippmann “An introduction to computing with neural nets”, IEEE 1987)

Artificial neural networks

Computability

Supervised learning:

Comparator

Input

Pattern

Teaching

Input

Neural

Network

Weight

Adaptation

Example

Artificial neural networks: training

Unsupervised learning:

Input Pattern Neural Network

Weight Adaptation

Example

Artificial neural networks: training

Supervised Learning

Supervised learning Training can be turned into an optimization task:

Minimize a function with a high-dimensional domain (as many

dimensions as the weights in the network) which measures the

difference between the teaching inputs in the training set and the

actual outputs of the network (error function).

An iterative procedure (gradient descent) moves the network

weights along the direction of the negative gradient of the

error function, thus guaranteeing that, at each step, the value

of the error function is decreased, until a local minimum is

reached.

Gradient descent is one of the so-called trajectory-based

optimization methods: a single point moves within the input

space; each steps aims at moving it to a better location.

i1

i2

.

.

.

.

iM

oP,i

oP,i+1

.

.

oP,i+k

oP,i

oP,i+1

.

.

oP,i+k

n1

.

.

.

.

.

.

.

.

.

nM

ni

ni+k

nj

nN

oP,i

oP,1

oP,M

oP,i+k

oP,N OP

P wi,j wi+1,j

.

. wi+k,j

wi,j wi+1,j

.

. wi+k,j

netP,j netP,j oP,j

.

.

.

Given:

• A single-layer network with linear activations ( oj(W) = Si wij ii )

• a training set T = { (xp, tp) : p = 1, …., P} P = n. of examples

• a squared error function computed on the pth pattern

Ep = Sj =1,N (tpj -opj)2 / 2 N = n. of output neurons,

opj,tpj= output/teaching input for neuron j

• a global error function

E = Sp=1,P Ep = E(W), W = weight matrix of weights wij

associated with the connections ij

(from neuron i to neuron j)

E can be minimized using a gradient descent, converging to the local

minimum of E which is closest to the initial conditions

(corresponding to the values at which weights are initialized).

Supervised learning: Delta Rule (Widrow & Hoff’s rule)

Basin of

attraction

for y

y

Basin of

attraction

for x

x

Supervised learning: gradient descent

The gradient of E has E/wij as components

E/wij = Sp Ep / wij

From the derivation rule for composed functions (E = E (O(W)) :

Ep/ wij = Ep/ opj · opj/ wij

From the definition of the error function:

-Ep/ opj = (tpj - opj) = dpj (error produced by neuron j

when pattern p is input)

Since neuron activations are linear (opj = Si wij ipi)

opj/ wij = ipi Therefore Ep/ wij = - dpj ipi

thus E / wij = - Sp dpj ipi


def

If we apply the gradient descent

Dwij = - e E/wij = - Sp e Ep/wij = e Sp dpj ipi = Sp Dpwij

If e is small enough, one can modify weights after each single

pattern according to the rule:

Dpwij = e dpj ipi

NB All these quantitites are easily computed.


def


I. Initialize the weights

II. Repeat

For each pattern in the training set:

a. Compute the output produced by the present

configuration of the network

b. Compute the error function

c. Modify weights by moving them in a direction

that is opposite to the direction of the error

function gradient with respect to the weights

until the error function reaches a pre-set value OR

a preset maximum number of iterations is reached.

Problems

• It is not possible, in general, to compute the gradient of the

error function with respect to all the weights for any

network configuration. However, some configurations (see

the following slides) do allow gradients to be computed or

approximated.

• Even when this is possible, gradient descent will find a local

minimum, which may be very far from the global one.


The delta rule can be applied only to a particular neural net

(single-layer with linear activation functions).

The delta rule can be generalized and applied to multi-layer

networks with non-linear activation functions.

This is possible for feedforward networks (aka Multi-layer

Perceptrons or MLP) where a topological ordering of neurons

can be defined, as well as a sequential order of activation.

The activation function f(net) of all neurons must be continuous,

differentiable and non-decreasing.

netpj= Si wij opi for a multi-layer network

(i = index of a neuron forward-connected to j , i.e., i<j if neurons are ordered

starting from the input layer)


Generalized Delta Rule (Backpropagation)

We want to apply gradient descent for minimizing the

total squared error (this holds for any topology):

Dpwij = - e Ep/wij

Notice that EP = EP (Op) = EP (f(netp(W))

Applying the differentiation rule for composite functions:

Ep/netpj

Ep/wij = Ep/opj · opj/netpj· netpj/wij

netpj/wij = /wij (Sk wkj opk ) = opi



If we define: dpj = - Ep/netpj we obtain:

Ep/wij = e dpj opi

(same formulation as for the Delta rule)

We know opi; we need to compute dpj

(defined as for the delta rule: in fact, for a linear neuron,

opj = netpj )



dpj = - Ep/netpj = - Ep/opj · opj/netpj

Notice that opj = fj(netpj) and

f depends on a single variable, thus

opj/netpj = dopj/dnetpj = f ’j(netpj)

If the j-th neuron is an output neuron then

Ep/opj = - (tpj - opj)

Therefore, for such neurons, dpj = - (tpj - opj) f ’j(netpj)



Instead, if the j-th neuron is a hidden neuron (here is where

topology becomes important),

Ep/opj = Sk Ep/netpk · netpk/opj = k>j

= Sk Ep/netpk · /opj Si wik opi =

= Sk Ep/netpk · wjk = - Sk dpk wjk

This means that, for the hidden neurons,

dpj = f ’j(netpj) · Sk dpk wjk k>j

Therefore, for the hidden neurons, dpj can be computed

recursively starting from the output neurons (error

backpropagation)



In a MLP we

know the

values of all

these terms!

1. Weights are initialized

2. The pth pattern is given as input

• the corresponding network outputs opj are computed

• The error function is computed for the output layer:

dpj = - (tpj - opj) f ’j(netpj) (for the output layer)

• To compute deltas for the hidden layers:

dpj = f’j(netpj) · Sk dpk wjk are iteratively computed

starting from the hidden

layer closest to the outputs

3. Weights are modified as follows: Dpwij = e dpj opj

4. Steps 2. and 3. are repeated until convergence


Backpropagation algorithm

Weights should be updated after processing all training patterns

(batch learning). In fact: Dwij = Sp Dpwij = e Sp dpj opj

If e is small the same result is obtained updating weights after

processing each pattern (online learning).

If weights are initialized to 0 convergence problems occur: usually

different small positive random values (in a range such as 0.05-

0.1) are used for weight initialization.

For this procedure to be an actual gradient descent e should be

infinitesimal, but the smaller e, the slower the convergence.

However, if e is too large one might ‘fly over’ a minimum.

We may want this to happen, if it is a local minimum (see the next

slides)

The gradient descent is not computationally efficient.

Supervised learning: Backpropagation

• Setting the network structure: how many neurons in how

many layers? The only constraint is imposed on the number of

neurons in the input and in the output layers.

Usually, the layers closer to the input are larger (they include

more neurons), to allow enough degrees of freedom to

recombine inputs appropriately; the following layers are

generally smaller, to favor generalization and limit overfitting.

Incremental algorithms exist that start from very few neurons

and add them step by step until good performances are

reached, or start from a large network and then ‘prune’ it by

removing connections associated with weights that are smaller

than a threshold.

Supervised learning: Practical problems when using the MLP

• Setting the learning rate: a large learning rate makes it

easier for the network to escape local minima (which have

often a narrow ‘basin of attraction’) and jump into others; a

small learning rate generates a more literal gradient descent

into the closest local minimum. Usually we want to first

explore the space (large e) but then, when we have reached a

good point, we want to refine our search (small e).


• Setting a stopping condition: typically, a maximum number of

iterations must be set. Restarting from the same weight

configuration is always possible. Going back is not!

A good advice is to save the network configuration whenever a

new minimum is reached (NB You need a validation set here!)

and to finally stop the algorithm from running when overfitting

appears.

In fact, usually, in the first (NB could be many!) iterations one can

observe a decrease in the error over both the training and the

validation set. After a certain number of iterations the network

will start overfitting the training data, i.e. the error will still be

decreasing on the training set, but will start increasing on the

validation set.


• The gradient descent is a deterministic algorithm, but weight

initialization is a random process, which means that the results

obtained by running the network several times on the same data

with different random sequences are instances of a random

variable, and must be treated as such! Hence, it makes no sense

comparing different network configurations based on a single run

of each. Appropriate statistics must be collected by running the

BP algorithm several times for each configuration and appropriate

statistical tests must be applied to show that the distributions of

the results, and especially their mean values, are significantly

different.

Never forget to consider that drawing training/validation/test

sets from a data set is already enough to make the whole training

procedure stochastic.


artificial neural networks introduction - unipr.it...artificial neural network (example: multi-layer...

Documents