back propagation and representation in pdp networks
DESCRIPTION
Back Propagation and Representation in PDP Networks. Psychology 209 February 6, 2013. Homework 4. Part 1 due Feb 13 Complete Exercises 5.1 and 5.2. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Back Propagation and Representation in PDP Networks](https://reader035.vdocument.in/reader035/viewer/2022081502/5681619e550346895dd15568/html5/thumbnails/1.jpg)
Back Propagation and Representation in PDP Networks
Psychology 209February 6, 2013
![Page 2: Back Propagation and Representation in PDP Networks](https://reader035.vdocument.in/reader035/viewer/2022081502/5681619e550346895dd15568/html5/thumbnails/2.jpg)
Homework 4• Part 1 due Feb 13
– Complete Exercises 5.1 and 5.2.
– It may be helpful to carry out some explorations of parameters, as suggested in Exercise 5.3. This may help you achieve a solution in the last part of the homework, below. However, no write-up is required for this.
• Part 2 due Feb 20
– Consult Chapter 8 of the PDP Book by Rumelhart, Hinton, and Williams (In readings directory for Feb 6). Consider the problems described there that were solved using back propagation, and choose one; or create a problem of your own to investigate with back propagation.
– Carry out Exercise 5.4, creating your own network, template, pattern, and startup file (similar to bpxor.m), and answer question 5.4.1.
![Page 3: Back Propagation and Representation in PDP Networks](https://reader035.vdocument.in/reader035/viewer/2022081502/5681619e550346895dd15568/html5/thumbnails/3.jpg)
The Perceptron
For input pattern p, teacher tp and output op, change the thresholdAnd weights as follows:
Note: including bias = -q in net and using threshold of 0, thentreating bias as a weight from a unit that is always on is equivalent
![Page 4: Back Propagation and Representation in PDP Networks](https://reader035.vdocument.in/reader035/viewer/2022081502/5681619e550346895dd15568/html5/thumbnails/4.jpg)
AND, OR, XOR
![Page 5: Back Propagation and Representation in PDP Networks](https://reader035.vdocument.in/reader035/viewer/2022081502/5681619e550346895dd15568/html5/thumbnails/5.jpg)
Adding a unit to make XOR solvable
![Page 6: Back Propagation and Representation in PDP Networks](https://reader035.vdocument.in/reader035/viewer/2022081502/5681619e550346895dd15568/html5/thumbnails/6.jpg)
Gradient Descent Learning in the ‘LMS’ Associator
Output is a linear function of inputs and weights:
Find a learning rule to minimize the Summed squared Error:
Taking derivatives, we find:
Consider the policy:
This breaks down into the sum overpatterns of terms of the form:
![Page 7: Back Propagation and Representation in PDP Networks](https://reader035.vdocument.in/reader035/viewer/2022081502/5681619e550346895dd15568/html5/thumbnails/7.jpg)
Error Surface for OR functionin LMS Associator
![Page 8: Back Propagation and Representation in PDP Networks](https://reader035.vdocument.in/reader035/viewer/2022081502/5681619e550346895dd15568/html5/thumbnails/8.jpg)
What if we want to learn how to solve xor?
We need to figure out how to adjust the weightsinto the ‘hidden’ unit, following the principle ofgradient descent:
![Page 9: Back Propagation and Representation in PDP Networks](https://reader035.vdocument.in/reader035/viewer/2022081502/5681619e550346895dd15568/html5/thumbnails/9.jpg)
We start with an even simplerproblem
rsrs w
Ew
rswE
12
2
2
2
221 wnet
neta
aE
wEAssume units are linear, both weights = .5 and, i = 1, t = 1.
We use the chain rule to calculate for each weight.
0 1 2w10 w21
Weight changes should follow the gradient:
First we unpack the chain, then we calculate the elements of it.
2 22 2 1
21 2 2 21
2( ) 1a netE E t a aw a net w
2 2 1 12 2 21 0
10 2 2 1 1 10
2( ) 1 1a net a netE E t a w aw a net a net w
![Page 10: Back Propagation and Representation in PDP Networks](https://reader035.vdocument.in/reader035/viewer/2022081502/5681619e550346895dd15568/html5/thumbnails/10.jpg)
Including a non-linear activation function
• Let
• Then
• So our chains from before become:
inetii enetfa
11)(
)1()( iiii
i aanetfneta
2 22 2 2 2 1
21 2 2 21
2( ) (1 )a netE E t a a a aw a net w
2 2 1 12 2 2 2 21 1 1 0
10 2 2 1 1 10
2( ) (1 ) (1 )a net a netE E t a a a w a a aw a net a net w
inetii enetfa
11)(
![Page 11: Back Propagation and Representation in PDP Networks](https://reader035.vdocument.in/reader035/viewer/2022081502/5681619e550346895dd15568/html5/thumbnails/11.jpg)
Including the activation function in the chain rule and including more than one output unit leads to the formulation below, in which we use ‘di’ to represent ∂E/∂neti
We can continue this back indefinitely… ds = f’(nets)Srdrwrs
The weight change rule at every layer is:
wrs = dras
Calculating the d term for output unit i:
di = (ti-ai)f’(neti)
And the d term for hidden unit j:
dj = f’(netj)Sidiwij
i
j
k
![Page 12: Back Propagation and Representation in PDP Networks](https://reader035.vdocument.in/reader035/viewer/2022081502/5681619e550346895dd15568/html5/thumbnails/12.jpg)
Back propagation algorithm• Propagate activation forward
– Activation can only flow from lower-numbered units to higher numbered units
• Propagate “error” backward– Error flows from higher numbered units back to lower
numbered units• Calculate ‘weight error derivative’ terms = dras
• One can change weights after processing a single pattern or accumulate weight error derivatives over a batch of patterns before changing the weights.
![Page 13: Back Propagation and Representation in PDP Networks](https://reader035.vdocument.in/reader035/viewer/2022081502/5681619e550346895dd15568/html5/thumbnails/13.jpg)
Variants/Embellishments to back propagation
• Full “batch mode” (epoch-wise) learning rule with weight decay and momentum:
wrs= Spdrpasp – wwrs + awrs(prev)
• Weights can alternatively be updated after each pattern or after every k patterns.
• An alternative error measure has both conceptual and practical advantages:
CEp = -Si [tiplog(aip) + (1-tip)log(1-aip)]
• If targets are actually probabilistic, minimizing CEp maximizes the probability of the observed target values.
• This also eliminates the ‘pinned output unit’ problem.
![Page 14: Back Propagation and Representation in PDP Networks](https://reader035.vdocument.in/reader035/viewer/2022081502/5681619e550346895dd15568/html5/thumbnails/14.jpg)
Why is back propagation important?
• Provides a procedure that allows networks to learn weights that can solve any deterministic input-output problem.
– Contrary to expectation, it does not get stuck in local minima except in cases where the network is exceptionally tightly constrained.
• Allows networks to learn how to represent information as well as how to use it.
• Raises questions about the nature of representations and of what must be specified in order to learn them.
![Page 15: Back Propagation and Representation in PDP Networks](https://reader035.vdocument.in/reader035/viewer/2022081502/5681619e550346895dd15568/html5/thumbnails/15.jpg)
Is Backprop biologically plausible?
• Neurons do not send error signals backward across their weights through a chain of neurons, as far as anyone can tell
• But we shouldn’t be too literal minded about the actual biological implementation of the learning rule.
• Some neurons appear to use error signals, and there are ways to use differences between activation signals to carry error information