![Page 1: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/1.jpg)
Goal-Directed Feature and Memory Learning
Cornelius Weber
Frankfurt Institute for Advanced Studies (FIAS)
Sheffield, 3rd November 2009
Collaborators: Sohrab Saeb and Jochen Triesch
![Page 2: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/2.jpg)
for taking action, we need only the relevant features
x
y
z
![Page 3: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/3.jpg)
unsupervisedlearningin cortex
reinforcementlearning
in basal ganglia
state spaceactor
Doya, 1999
![Page 4: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/4.jpg)
background:
- gradient descent methods generalize RL to several layers Sutton&Barto RL book (1998); Tesauro (1992;1995)
- reward-modulated Hebb Triesch, Neur Comp 19, 885-909 (2007), Roelfsema & Ooyen, Neur Comp 17, 2176-214 (2005); Franz & Triesch, ICDL (2007)
- reward-modulated activity leads to input selection Nakahara, Neur Comp 14, 819-44 (2002)
- reward-modulated STDP Izhikevich, Cereb Cortex 17, 2443-52 (2007), Florian, Neur Comp 19/6, 1468-502 (2007); Farries & Fairhall, Neurophysiol 98, 3648-65 (2007); ...
- RL models learn partitioning of input space e.g. McCallum, PhD Thesis, Rochester, NY, USA (1996)
![Page 5: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/5.jpg)
reinforcement learning
go up? go right? go down? go left?
![Page 6: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/6.jpg)
reinforcement learning
input s
action a
weights
![Page 7: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/7.jpg)
action a
reinforcement learning
minimizing value estimation error:
d q(s,a) ≈ 0.9 q(s’,a’) - q(s,a)
d q(s,a) ≈ 1 - q(s,a)
moving target value
fixed at goal
q(s,a) value of a state-action pair(coded in the weights)
repeated running to goal:
in state s, agent performsbest action a (with random),yielding s’ and a’
--> values and action choices converge
input s
weights
![Page 8: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/8.jpg)
actor
go right? go left?
can’t handle this!
simple input
go right!
complex input
reinforcement learning
input (state space)
![Page 9: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/9.jpg)
sensory input
reward
action
complex input
scenario: bars controlled by actions, ‘up’, ‘down’, ‘left’, ‘right’;
reward given if horizontal bar at specific position
![Page 10: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/10.jpg)
need another layer(s) to pre-process complex data
feature detection
action selection
network definition:
s = softmax(W I)P(a=1) = softmax(Q s)
q = a Q s
a action
s state
I input
Q weight matrix
W weight matrix
position of relevant bar
encodes q
feature detector
![Page 11: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/11.jpg)
feature detection
action selection
network training:
E = (0.9 q(s’,a’) - q(s,a))2 = δ2
d Q ≈ dE/dQ = δ a sd W ≈ dE/dW = δ Q s I + ε
a action
s state
I input
W weight matrix
minimize error w.r.t. current target
reinforcement learning
δ-modulated unsupervised learning
Q weight matrix
![Page 12: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/12.jpg)
note: non-negativity constraint on weights
Details: network training minimizes error w.r.t. target Vπ
identities used:
![Page 13: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/13.jpg)
SARSA with WTA input layer
(v should be q here)
![Page 14: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/14.jpg)
RL action weights
feature weights
data
learning the ‘short bars’ data
reward
action
![Page 15: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/15.jpg)
short bars in 12x12 average # of steps to goal: 11
![Page 16: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/16.jpg)
RL action weights
feature weights
input reward 2 actions (not shown)
data
learning ‘long bars’ data
![Page 17: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/17.jpg)
WTAnon-negative weights
SoftMaxnon-negative weights
SoftMaxno weight constraints
![Page 18: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/18.jpg)
extension to memory ...
![Page 19: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/19.jpg)
if there are detection failures of features ...
... it would be good to have memory or a forward model
grey bars are invisible to the network
![Page 20: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/20.jpg)
s(t-1)
a(t-1)
network training by gradient descent as previously
softmax function used; no weight constraint
a action
s state
I input
W feature weights
Q action weights
![Page 21: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/21.jpg)
learnt feature detectors
![Page 22: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/22.jpg)
the network updates its trajectory internally
![Page 23: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/23.jpg)
network performance
![Page 24: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/24.jpg)
discussion
- two-layer SARSA RL performs gradient descent on value estimation error
- approximation with winner-take-all leads to local rule with δ-feedback
- learns only action-relevant features
- non-negative coding aids feature extraction
- memory weights develop into a forward model
- link between unsupervised- and reinforcement learning
- demonstration with more realistic data still needed
![Page 25: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/25.jpg)
video
![Page 26: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/26.jpg)
![Page 27: Goal-Directed Feature and Memory Learning Cornelius Weber](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f5550346895db614b7/html5/thumbnails/27.jpg)
Bernstein FocusNeurotechnology,BMBF grant 01GQ0840
EU project 231722“IM-CLeVeR”,call FP7-ICT-2007-3
Frankfurt Institutefor Advanced Studies,FIAS
Sponsors:
Thank you!
Collaborators:
Sohrab Saeb and Jochen Triesch