back propagation and representation in pdp networks

Back Propagation and Representation in PDP Networks

Psychology 209February 6, 2013

Homework 4• Part 1 due Feb 13

– Complete Exercises 5.1 and 5.2.

– It may be helpful to carry out some explorations of parameters, as suggested in Exercise 5.3. This may help you achieve a solution in the last part of the homework, below. However, no write-up is required for this.

• Part 2 due Feb 20

– Consult Chapter 8 of the PDP Book by Rumelhart, Hinton, and Williams (In readings directory for Feb 6). Consider the problems described there that were solved using back propagation, and choose one; or create a problem of your own to investigate with back propagation.

– Carry out Exercise 5.4, creating your own network, template, pattern, and startup file (similar to bpxor.m), and answer question 5.4.1.

The Perceptron

For input pattern p, teacher tp and output op, change the thresholdAnd weights as follows:

Note: including bias = -q in net and using threshold of 0, thentreating bias as a weight from a unit that is always on is equivalent

AND, OR, XOR

Adding a unit to make XOR solvable

Gradient Descent Learning in the ‘LMS’ Associator

Output is a linear function of inputs and weights:

Find a learning rule to minimize the Summed squared Error:

Taking derivatives, we find:

Consider the policy:

This breaks down into the sum overpatterns of terms of the form:

Error Surface for OR functionin LMS Associator

What if we want to learn how to solve xor?

We need to figure out how to adjust the weightsinto the ‘hidden’ unit, following the principle ofgradient descent:

We start with an even simplerproblem

rsrs w

Ew

rswE

12

2

2

2

221 wnet

neta

aE

wEAssume units are linear, both weights = .5 and, i = 1, t = 1.

We use the chain rule to calculate for each weight.

0 1 2w10 w21

Weight changes should follow the gradient:

First we unpack the chain, then we calculate the elements of it.

2 22 2 1

21 2 2 21

2( ) 1a netE E t a aw a net w

2 2 1 12 2 21 0

10 2 2 1 1 10

2( ) 1 1a net a netE E t a w aw a net a net w

Including a non-linear activation function

• Let

• Then

• So our chains from before become:

inetii enetfa

11)(

)1()( iiii

i aanetfneta

2 22 2 2 2 1

21 2 2 21

2( ) (1 )a netE E t a a a aw a net w

2 2 1 12 2 2 2 21 1 1 0

10 2 2 1 1 10

2( ) (1 ) (1 )a net a netE E t a a a w a a aw a net a net w

inetii enetfa

11)(

Including the activation function in the chain rule and including more than one output unit leads to the formulation below, in which we use ‘di’ to represent ∂E/∂neti

We can continue this back indefinitely… ds = f’(nets)Srdrwrs

The weight change rule at every layer is:

wrs = dras

Calculating the d term for output unit i:

di = (ti-ai)f’(neti)

And the d term for hidden unit j:

dj = f’(netj)Sidiwij

i

j

k

Back propagation algorithm• Propagate activation forward

– Activation can only flow from lower-numbered units to higher numbered units

• Propagate “error” backward– Error flows from higher numbered units back to lower

numbered units• Calculate ‘weight error derivative’ terms = dras

• One can change weights after processing a single pattern or accumulate weight error derivatives over a batch of patterns before changing the weights.

Variants/Embellishments to back propagation

• Full “batch mode” (epoch-wise) learning rule with weight decay and momentum:

wrs= Spdrpasp – wwrs + awrs(prev)

• Weights can alternatively be updated after each pattern or after every k patterns.

• An alternative error measure has both conceptual and practical advantages:

CEp = -Si [tiplog(aip) + (1-tip)log(1-aip)]

• If targets are actually probabilistic, minimizing CEp maximizes the probability of the observed target values.

• This also eliminates the ‘pinned output unit’ problem.

Why is back propagation important?

• Provides a procedure that allows networks to learn weights that can solve any deterministic input-output problem.

– Contrary to expectation, it does not get stuck in local minima except in cases where the network is exceptionally tightly constrained.

• Allows networks to learn how to represent information as well as how to use it.

• Raises questions about the nature of representations and of what must be specified in order to learn them.

Is Backprop biologically plausible?

• Neurons do not send error signals backward across their weights through a chain of neurons, as far as anyone can tell

• But we shouldn’t be too literal minded about the actual biological implementation of the learning rule.

• Some neurons appear to use error signals, and there are ways to use differences between activation signals to carry error information

back propagation and representation in pdp networks

Documents