learning in recurrent networks psychology 209 february 25, 2013
TRANSCRIPT
![Page 1: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/1.jpg)
Learning in Recurrent Networks
Psychology 209February 25, 2013
![Page 2: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/2.jpg)
Outline
• Back Propagation through time• Alternatives that can teach networks to
settle to fixed points• Learning conditional distributions• An application
– Collaboration of hippocampus & cortex in learning new associations
![Page 3: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/3.jpg)
Back Propagation
Through Time
Error at each unit is theinjected error (arrows) andthe back-propagated error;these are summed andscaled by deriv. of activationfunction to calculate deltas.
![Page 4: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/4.jpg)
Continuous back prop through time as implemented in rbp
• Time is viewed as consisting of “intervals” from 0 to nintervals (tmax).• Inputs clamped typically from t=0 for 2-3 intervals.
• Activation equation (for t = t:t:tmax):neti(t)= t ( Sjaj(t-t)wij + bi ) + (1 – t) neti(t-t)
• Calculation of deltas (for t = tmax:-t:t): dj(t) = ( t f’(netj(t)) E/aj(t) ) + (1 – t) dj(t+t)
• Where dj(tmax+t) = 0 for all j and E/aj(t) = Skwkjdk(t+t) + (t(t) – a(t))
• Targets are usually provided over the last 2-3 intervals.
• Then change weights using: E/wij = St=1:t:tmaxaj(t-1)di(t)
• Include momentum and weight decay if desired.• Use CE instead of E if desired:• CE = -Si [tilog(ai) + (1-ti)log(1-ai)]
![Page 5: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/5.jpg)
Recurrent Network Used in Rogers et al Semantic Network Simulation
![Page 6: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/6.jpg)
Plusses and Minuses of BPTT
• Can learn arbitrary trajectories through state space (figure eights, etc).
• Works very reliably in training networks to settle to desired target states.
• Biologically implausiblemax
• Gradient gets very thin over many time steps
![Page 7: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/7.jpg)
Several Variants and Alternative Algorithms(all relevant to networks that settle to a fixed point)
• Almeda/Pineda algorithm– Discussed in Williams and Zipser reading along with
many other variants of back prop through time
• Recirculation and Generec.– Discussed in O’Reilly Reading
• Contrastive Hebbian Learning.– Discussed in Movellan and McClelland reading
![Page 8: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/8.jpg)
Almeda Pineda Algorithm(Notation from O’Reilly, 1996)
Update net inputs (h) until they stop changing according to( (.)s = logistic fcn):
Then update deltas (y) til they stop changingaccording to:
J represents the external error to theunit, if any.
Adjust weights using the delta rule
ji
![Page 9: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/9.jpg)
Assuming symmetricconnections:
jk
Only activation is propagated.
Time difference of activationreflects error signal.
Maybe this is more biologicallyplausible that explicit backpropof error?
![Page 10: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/10.jpg)
Generalized RecirculationO’Reilly, 1996
Minus phase: Present input, feed activationforward,compute output, let it feed back, letnetwork settle.
Plus phase: Then clamp both input andoutput units into desired state, and letnetwork settle again.*
*equations neglect the componentto the net input at the hidden layerfrom the input layer.
tk
hj, yj
si
![Page 11: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/11.jpg)
A problem for backprop and approximations to it:Average of Two Solutions May not be a Solution
![Page 12: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/12.jpg)
Network Must Be Stochastic
• Boltzmann MachineP(a = 1) = logistic(net/T)
• Continuous Diffusion Network
• (g = 1/T)
![Page 13: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/13.jpg)
Contrastive Hebbian Learning Rule
• Present Input only (‘minus phase’)• Settle to equilibrium (change still occurs
but distribution stops changing)– Do this several times to sample distribution of
states at equilibrium• Collect ‘coproducts’ ai
-aj-; avg = <ai
-aj->
• Present input and targets (‘plus phase’)• Collect ‘coproducts’ ai
+aj+; avg = <ai
+aj+>
• Change weights according to:Dwij = (<ai
+aj+>- <ai
-aj->)
![Page 14: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/14.jpg)
The contrastive Hebbian learning rule minimizes divergence between probability distributions over all possible states s of
the output units for desired (plus) and obtained (minus) phases
dssp
spspTIG
s
)(
)(ln)(
![Page 15: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/15.jpg)
In a continuous diffusion network, probability flows over time until it reaches an equilibrium distribution
![Page 16: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/16.jpg)
![Page 17: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/17.jpg)
Patterns and Distributions
Obtained ResultsDesired Distrib
![Page 18: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/18.jpg)
Problems and Solutions
• Stochastic neural networks are VERY slow to train because you need to settle (which takes many time steps) many times in each of the plus and minus phases to collect adequate statistics.
• Perhaps RBM’s and Deep Networks can help here?
![Page 19: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/19.jpg)
Collaboration of Hippocampus and Neocortex
• The effects of prior association strength on memory in both normal and control subjects are consistent with the idea that hippocampus and neocortex work synergistically rather than simply providing two different sources of correct performance.
• Even a damaged hippocampus can be helpful when the prior association is very strong.
![Page 20: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/20.jpg)
Performance of Control and AmnesicPatients in Learning Word Pairs with Prior
Associations
Cutting (1978), Expt. 1
-20
0
20
40
60
80
100
Very Easy Easy Fairly Easy Hard Very Hard
Category (Ease of Association)
Per
cent
Cor
rect
Control (Expt)
Amnesic (Expt)
Base rates
man:woman hungry:thin city:ostrich
![Page 21: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/21.jpg)
Kwok & McClelland Model
• Model includes slow learning cortical system representing the content of an association and the context.
• Hidden units in neo-cortex mediate associative learning.
• Cortical network is pre-trained with several cue-relation-response triples for each of 20 different cues.
• When tested just with ‘cue’ as probe it tends to produce different targets with different probabilities: – Dog (chews) bone (~.30)– Dog (chases) cat (~.05)
• Then the network is shown cue-response-context triples. Hippo. learns fast and cortex learns (very) slowly.
• Hippocampal and cortical networks work together at recall, so that even weak hippocampal learning can increase probability of settling to a very strong pre-existing association.
ContextRelation Cue
Response
Neo-Cortex
Hippocampus
![Page 22: Learning in Recurrent Networks Psychology 209 February 25, 2013](https://reader031.vdocument.in/reader031/viewer/2022032722/56649ced5503460f949b9dca/html5/thumbnails/22.jpg)
Data with Simulation Results From K&M Model
Cutting (1978), Expt. 1
84
0
70
90
68
-20
0
20
40
60
80
100
Very Easy Easy Fairly Easy Hard Very Hard
Category (Ease of Association)
Per
cent
Cor
rect
Control (Model)
Amnesic (Model)
Control (Expt)
Amnesic (Expt)