Download - Dopamine, Uncertainty and TD Learning
Dopamine, Uncertainty and TD Learning
CoSyNe’04
Yael Niv Michael DuffPeter Dayan
What does Dopamine encode?• Important neuromodulator
- Neurological/psychiatric disorders - Drug addiction/self stimulation
• Fundamental role in RL- Classical/Pavlovian conditioning- Instrumental/operant conditioning
• DA neurons respond to:− Unexpected (appetitive) rewards− Stimuli predicting (appetitive) rewards− Withdrawal of expected rewards− Novel/Salient stimuli
What does Dopamine encode?
DA represents some aspect of reward, but not rewards as such.
The TD Hypothesis of Dopamine
)()(ˆ)(ˆ)1(ˆ)1()(
ttV
tVtVtrt
)1()1()(
)()(
tVtrtV
rtVt
DA encodes the reward prediction error
<-DA
Stimulus Reward Stimulus RewardStimulus Reward
DA
δ(t)
Precise theory for the generation of DA firing patternsCompelling account for the role of DA in classical conditioning
But: Fiorillo, Tobler & Schultz 2003• Introduce inherent uncertainty into the
classical conditioning paradigm
• Five visual stimuli indicating different reward probabilities: P=0,¼,½,¾,1
CS = 2 sec visual stimulus
US (probabilistic) = drops of juice
Fiorillo, Tobler & Schultz 2003
• At stimulus time: DA represents mean expected reward
• Interesting: A ramp in activity up
to reward (highest for p=½)
• Hypothesis: DA ramp encodes uncertainty in reward
Dopamine: Uncertainty or TD error?
• No apparent reason for ramp• The ramp is predictable from
the stimulus• TD predicts away
predictable quantities
contradiction !
• Side issue: the ramp is like a constantly surprising reward -- it can’t influence action choice
At time of reward:• Prediction errors result
from uncertainty
• Crucially: Positive and negative errors cancel out
A closer look at FTS’s results:
p = 0.5
p = 0.75
• TD error δ(t) can be positive or negative
• Neuronal firing rate is only positive (negative values are coded relative to base firing rate)
But:• DA base firing rate is low
-> asymmetric encoding of δ(t)
A closer look at FTS’s results:
55%
270%
δ(t)
DA
x(1) x(2) …
r(t) δ(t)
V(1) V(20)
• Tapped delay line • Standard online TD learning• Fixed learning rate
• Negative δ(t) scaled by d=1/6 prior to PSTH
Modeling TD with asymmetric errors
Learning proceeds normally (without scaling) − Necessary to produce the right predictions− Can be biologically plausible
TD learning with asymmetric prediction errors replicates
the recorded data accurately.
Ramps result from asymmetrically coded prediction errors propagating back to stimulus
Artifact of summing PSTHs over nonstationary recent reward histories
Modeling TD with asymmetric errors
Analytically deriving the maximum error at the time of the reward we get:
=> the ramp is indeed highest for P=½
But:• DA Encodes nothing but temporal difference error!• Experimental test: Ramp as within or between trial
phenomenon?
DA: Uncertainty or Temporal Difference?
)1)(1( dppTT
Trace conditioning: A puzzle and its resolution
• Same (if not more) uncertainty, but… no DA ramping! (Fiorillo et al.; Morris, Arkadir, Nevet, Vaadia & Bergman)
• Resolution: lower learning rate in trace conditioning eliminates ramp
CS = short visual stimulus
Trace period
US (probabilistic) = drops of juice
ConclusionsPreserve the TD hypothesis of Dopamine:
− No explicit coding of uncertainty− Ramping explained by neural constraints− Explains the disappearance of the ramp in
trace conditioning
Important challenges to the TD hypothesis − Conditioned inhibition− Effects of timing