reasons to be careful about reward a flow (policy) cannot be specified with a scalar function of...

Reasons to be careful about reward

• A flow (policy) cannot be specified with a scalar function of states: the fundamental theorem of vector calculus – aka the Helmholtz decomposition

• Any (curl free) flow specified with reward can only have a fixed point attractor: reward cannot specify itinerant movement or policies

• Value is produced by flow – not its cause: reward is a consequence of (defined by) behaviour not its cause

The inherent tautology of reward: explaining behaviour in terms of maximising reward is like explaining the evolution of the eye by saying it maximises adaptive value

Unresolved questions in motor control: A UCL-JHU workshop

A Physicist An Engineer An Economist

( , , )

( , , )

( , )

( , )

( ) ln ( ( ) | )

a

s

a

x x a

s x a

a F s

F s D

F t p s t m

f

g

a

( ( ) | ) : ( )

( | ) 0 ( | ) ( )

p x t m p p p

p x m p x m

f E

( , )

( , )

( ) ( )

a

x

x a u u

u

x x a

x u

a t u x

f

f

f

0

( , , , ), ( , )

( ) ( , )( )

( , )( ( )) ( ( ))

d

t

P m

x t t x

t

B

A A

Random dynamical systems

Random attractors with small measure

( | ) ln ( ( ))H X m A

Kolmogorov forward equation

Free-energy formulation

| [ ( | )] ( )H X m H p x m dtF t Ergodic theorem

| exp( ( )) ( )

( ( )) ( ) ( )

( ( )) ( ( ))t

p x m V x

V x t R x V x

V x t R x t dt

f

E

( )V W Q V f

Helmholtz decomposition

( ) ( ( ))F t V x t

Value and reward

Free energy upper bounds expected cost

( , )

( , )

( ) ( )

a

x

x a u u

u

x x a

x u

a t u x

f

f

f

Value and reward

( )V W Q V f

Helmholtz decomposition

max ( , ) ( ) ( , ) ( )u

x u V x x V x f f

Optimal control theory

( ( )) ( ( ))

( ( )) ( ) ( )

| exp( ( ))

t

V x t R x t dt

V x t R x V x

p x m V x

f

Value and reward

( ( )) ( ( ))

[ ( ( ))]

| exp( ( ))

t

t

V x t R x t dt

E R x t

p x m V x

x

( )R x

http://en.wikipedia.org/wiki/File:JohnvonNeumann-LosAlamos.gif






Forward models in motor control

Intrinsic frame of reference

Extrinsic frame of reference

hidden states

control

Optimal controlOptimal control

Motor commands Efference copyˆmin ( , )uu c x u dt

Forward modelForward modelState estimationState estimationSensory mappingSensory mapping

( )s x g

x̂

( , )c x uCost function

Plant kineticsPlant kinetics

( , )x x u f ˆ ˆ ˆ( ( ))x x K s g x ˆ ˆ( , )x f x u

u

s

u

x̂

Predictive coding in motor control



Optimal controlOptimal controlMotor commands

Efference copy

ˆmin ( , )uu c x u dt

Sensory mappingSensory mapping

( )s x g

( , )c x u

Cost function


( , )x x u f

u

ˆ( )e e es g x

ps

u

x̂

es

Forward modelForward model

ˆ ˆ( , )x f x u K

Top-down predictions

Bottom up prediction error

ˆ( )p p ps g x

sensations control

Active inference



sensations

Classical reflexClassical reflex

Corollary discharge

min Ta p pa

Sensory mappingSensory mapping

( )s x g

vPrior beliefs


( , )x x a f

e

( )a t

movementsForward modelForward model

ˆ ˆ( , )x f x v K

p

Bottom up prediction error

p

( )v t

Proprioceptive predictions

Vs

J

1

2

xs

x

(1)v

1J

1x

2x2J

(0,0)

1 2 3( , , )V v v v

visual input

proprioceptive input

Action with point attractorscf., equilibria point hypothesis

(1) (1) ( ( ) ( ))v v s a g

a (1)a va s

Descendingproprioceptive predictions

Exteroceptive predictions

(1)v

(2)v

(1)x

(1)x

(1)v

(2)x

0 0.2 0.4 0.6 0.8 1 1.2 1.4

0.4

0.6

0.8

1

1.2

1.4

action

position (x)

posit

ion

(y)

0 0.2 0.4 0.6 0.8 1 1.2 1.4

observation

position (x)

Heteroclinic cycle

(1)x

(1)a va s

Action with heteroclinic cycles

Descendingproprioceptive predictions

Unresolved questions in motor control: A UCL-JHU workshop

Reasons to be careful about reward

• A flow (policy) cannot be specified with a scalar function of states: the fundamental theorem of vector calculus – aka the Helmholtz decomposition

• Any (curl free) flow specified with reward can only have a fixed point attractor: reward cannot specify itinerant movement or policies

• Value is produced by flow – not its cause: reward is a consequence of (defined by) behaviour not its cause

The inherent tautology of reward: explaining behaviour in terms of maximising reward is like explaining the evolution of the eye by saying it maximises adaptive value – cf., Intelligent design

reasons to be careful about reward a flow (policy) cannot be specified with a scalar function of...

Documents

reward forward models

inherent tautology of

terms of maximising

cost value

policies value

flow policy

curl free flow

fixed point attractor