1 decision making. 2 how does the brain learn the values?
Post on 19-Dec-2015
216 views
TRANSCRIPT
4
The computational problem
The value of the state S1 depends on the policy
1 2ice cream V S r V S
If the animal chooses ‘right’ at S1,
6
How to find the optimal policy in a complicated world?
• If values of the different states are known then this task is easy
1 t t tV S r V S
7
How to find the optimal policy in a complicated world?
• If values of the different states are known then this task is easy
How can the values of the different states be learned?
8
1 t t tV S r V S
V(St) = the value of the state at time t
rt = the (average) reward delivered at time t
V(St+1) = the value of the state at time t+1
9
where
t t tV S V S
1 t t t tr V S V S
is the TD error.
The TD (temporal difference) learning algorithm
12
Dopamine is good
• Dopamine is released by rewarding experiences, e.g., sex, food
• Cocaine, nicotine and amphetamine directly or indirectly lead to an increase of dopamine release
• Neutral stimuli that are associated with rewarding experiences result in a release of dopamine
• Drugs that reduce dopamine activity reduce motivation, cause anhedonia (inability to experience pleasure)
• Long-term use may result in dyskinesia (diminished voluntary movements and the presence of involuntary movements)
14
• Bradykinesia – slowness in voluntary movement such as standing up, walking, and sitting down. This may lead to difficulty initiating walking, but when more severe can cause “freezing episodes” once walking has begun.
• Tremors – often occur in the hands, fingers, forearms, foot, mouth, or chin. Typically, tremors take place when the limbs are at rest as opposed to when there is movement.
• Rigidity – otherwise known as stiff muscles, often produce muscle pain that is increased during movement.
• Poor balance – happens because of the loss of reflexes that help posture. This causes unsteady balance, which oftentimes leads to falls.
No dopamine is bad (Parkinson’s disease)
16
CS Reward
Before trial 1:
1 2 3 4 5 6 7 8 9
1 2 9 0 V S V S V S
In trial 1:
• no reward in states 1-7
1 0 t t t tr V S V S
0 t t tV S V S
• reward of size 1 in states 8
9 8 1 t tr V S V S
8 t tV S V S
17
CS Reward
Before trial 2:
1 2 3 4 5 6 7 8 9
1 2 7 9 0 V S V S V S V S
8 V SIn trial 2, for states 1-6
1 0 t t t tr V S V S
0 t t tV S V S
For state 7,
1 t t t tr V S V S 2
7 7 tV S V S
18
CS Reward
Before trial 2:
1 2 3 4 5 6 7 8 9
1 2 7 9 0 V S V S V S V S
8 V SFor state 8,
1 1 t t t tr V S V S
8 8 1 2 tV S V S
19
CS Reward
Before trial 3:
1 2 3 4 5 6 7 8 9
1 2 6 9 0 V S V S V S V S
27 8 2 V S V S
In trial 2, for states 1-5
1 0 t t t tr V S V S
0 t t tV S V S
For state 6,
21 t t t tr V S V S
37 7 tV S V S
20
CS Reward
1 2 3 4 5 6 7 8 9
For state 7,
21 2 2 1 t t t tr V S V S
2 2 37 7 2 1 3 2 tV S V S
Before trial 3: 1 2 6 9 0 V S V S V S V S
27 8 2 V S V S
For state 8,
1 1 2 t t t tr V S V S
8 8 2 1 1 2 tV S V S
21
CS Reward
After many trials
1 2 3 4 5 6 7 8 9
1 8 91 0 V S V S V S
1 0 t t t tr V S V S
Except for the CS whose time is unknown
24
Bayer and Glimcher, 1998
“We found that these neurons encoded the difference between the current reward and a weighted average of previous rewards, a reward prediction error, but only for outcomes that were better than expected”.