reasons to be careful about reward a flow (policy) cannot be specified with a scalar function of...
TRANSCRIPT
Reasons to be careful about reward
• A flow (policy) cannot be specified with a scalar function of states: the fundamental theorem of vector calculus – aka the Helmholtz decomposition
• Any (curl free) flow specified with reward can only have a fixed point attractor: reward cannot specify itinerant movement or policies
• Value is produced by flow – not its cause: reward is a consequence of (defined by) behaviour not its cause
The inherent tautology of reward: explaining behaviour in terms of maximising reward is like explaining the evolution of the eye by saying it maximises adaptive value
Unresolved questions in motor control: A UCL-JHU workshop
A Physicist An Engineer An Economist
( , , )
( , , )
( , )
( , )
( ) ln ( ( ) | )
a
s
a
x x a
s x a
a F s
F s D
F t p s t m
f
g
a
( ( ) | ) : ( )
( | ) 0 ( | ) ( )
p x t m p p p
p x m p x m
f E
( , )
( , )
( ) ( )
a
x
x a u u
u
x x a
x u
a t u x
f
f
f
0
( , , , ), ( , )
( ) ( , )( )
( , )( ( )) ( ( ))
d
t
P m
x t t x
t
B
A A
Random dynamical systems
Random attractors with small measure
( | ) ln ( ( ))H X m A
Kolmogorov forward equation
Free-energy formulation
| [ ( | )] ( )H X m H p x m dtF t Ergodic theorem
| exp( ( )) ( )
( ( )) ( ) ( )
( ( )) ( ( ))t
p x m V x
V x t R x V x
V x t R x t dt
f
E
( )V W Q V f
Helmholtz decomposition
( ) ( ( ))F t V x t
Value and reward
Free energy upper bounds expected cost
( , )
( , )
( ) ( )
a
x
x a u u
u
x x a
x u
a t u x
f
f
f
Value and reward
( )V W Q V f
Helmholtz decomposition
max ( , ) ( ) ( , ) ( )u
x u V x x V x f f
Optimal control theory
( ( )) ( ( ))
( ( )) ( ) ( )
| exp( ( ))
t
V x t R x t dt
V x t R x V x
p x m V x
f
Value and reward
( ( )) ( ( ))
[ ( ( ))]
| exp( ( ))
t
t
V x t R x t dt
E R x t
p x m V x
x
( )R x
Forward models in motor control
Intrinsic frame of reference
Extrinsic frame of reference
hidden states
control
Optimal controlOptimal control
Motor commands Efference copyˆmin ( , )uu c x u dt
Forward modelForward modelState estimationState estimationSensory mappingSensory mapping
( )s x g
x̂
( , )c x uCost function
Plant kineticsPlant kinetics
( , )x x u f ˆ ˆ ˆ( ( ))x x K s g x ˆ ˆ( , )x f x u
u
s
u
x̂
Predictive coding in motor control
Intrinsic frame of reference
Extrinsic frame of reference
Optimal controlOptimal controlMotor commands
Efference copy
ˆmin ( , )uu c x u dt
Sensory mappingSensory mapping
( )s x g
( , )c x u
Cost function
Plant kineticsPlant kinetics
( , )x x u f
u
ˆ( )e e es g x
ps
u
x̂
es
Forward modelForward model
ˆ ˆ( , )x f x u K
Top-down predictions
Bottom up prediction error
ˆ( )p p ps g x
sensations control
Active inference
Intrinsic frame of reference
Extrinsic frame of reference
sensations
Classical reflexClassical reflex
Corollary discharge
min Ta p pa
Sensory mappingSensory mapping
( )s x g
vPrior beliefs
Plant kineticsPlant kinetics
( , )x x a f
e
( )a t
movementsForward modelForward model
ˆ ˆ( , )x f x v K
p
Bottom up prediction error
p
( )v t
Proprioceptive predictions
Vs
J
1
2
xs
x
(1)v
1J
1x
2x2J
(0,0)
1 2 3( , , )V v v v
visual input
proprioceptive input
Action with point attractorscf., equilibria point hypothesis
(1) (1) ( ( ) ( ))v v s a g
a (1)a va s
Descendingproprioceptive predictions
Exteroceptive predictions
(1)v
(2)v
(1)x
(1)x
(1)v
(2)x
0 0.2 0.4 0.6 0.8 1 1.2 1.4
0.4
0.6
0.8
1
1.2
1.4
action
position (x)
posit
ion
(y)
0 0.2 0.4 0.6 0.8 1 1.2 1.4
observation
position (x)
Heteroclinic cycle
(1)x
(1)a va s
Action with heteroclinic cycles
Descendingproprioceptive predictions
Unresolved questions in motor control: A UCL-JHU workshop
Reasons to be careful about reward
• A flow (policy) cannot be specified with a scalar function of states: the fundamental theorem of vector calculus – aka the Helmholtz decomposition
• Any (curl free) flow specified with reward can only have a fixed point attractor: reward cannot specify itinerant movement or policies
• Value is produced by flow – not its cause: reward is a consequence of (defined by) behaviour not its cause
The inherent tautology of reward: explaining behaviour in terms of maximising reward is like explaining the evolution of the eye by saying it maximises adaptive value – cf., Intelligent design