1 dopamine and prediction error no predictionprediction, rewardprediction, no reward td error vtvt r...
TRANSCRIPT
![Page 1: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/1.jpg)
1
dopamine and prediction error
no prediction prediction, reward prediction, no reward
TD error
Vt
R
RL
tttt VVr 1
)(t
Schultz 1997
![Page 2: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/2.jpg)
humans are no different
• dorsomedial striatum/PFC– goal-directed control
• dorsolateral striatum– habitual control
• ventral striatum– Pavlovian control; value signals
• dopamine...
![Page 3: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/3.jpg)
in humans…
< 1 sec
0.5 sec
You won40 cents
5 secISI
19 subjects (dropped 3 non learners, N=16)3T scanner, TR=2sec, interleaved234 trials: 130 choice, 104 single stimulusrandomly ordered and counterbalanced
2-5secITI
5 stimuli:
40¢20¢
0/40¢0¢0¢
![Page 4: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/4.jpg)
what would a prediction error look like (in BOLD)?
![Page 5: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/5.jpg)
prediction errors in NAC
unbiased anatomical ROI in nucleus accumbens (marked per subject*)
* thanks to Laura deSouza
raw BOLD(avg over all subjects)
can actually decide between different neuroeconomic models of risk
![Page 6: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/6.jpg)
Polar Exploration
Peter Dayan
Nathaniel Daw John O’Doherty Ray Dolan
![Page 7: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/7.jpg)
Exploration vs. exploitation
Classic dilemma in learned decision making
For unfamiliar outcomes, how to trade off learning about their values against exploiting knowledge already gained
![Page 8: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/8.jpg)
Exploration vs. exploitation
• Exploitation– Choose action expected to be best– May never discover something better
Time
Reward
![Page 9: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/9.jpg)
Exploration vs. exploitation
• Exploitation– Choose action expected to be best– May never discover something better
• Exploration:– Choose action expected to be worse
Time
Reward
![Page 10: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/10.jpg)
Exploration vs. exploitation
• Exploitation– Choose action expected to be best– May never discover something better
• Exploration:– Choose action expected to be worse– If it is, then go back to the original
Time
Reward
![Page 11: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/11.jpg)
Exploration vs. exploitation
• Exploitation– Choose action expected to be best– May never discover something better
• Exploration:– Choose action expected to be worse
Time
Reward
![Page 12: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/12.jpg)
Exploration vs. exploitation
• Exploitation– Choose action expected to be best– May never discover something better
• Exploration:– Choose action expected to be worse– If it is better, then exploit in the future
Time
Reward
![Page 13: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/13.jpg)
Exploration vs. exploitation
• Exploitation– Choose action expected to be best– May never discover something better
• Exploration:– Choose action expected to be worse– Balanced by the long-term gain if it turns out better– (Even for risk or ambiguity averse subjects)– nb: learning non trivial when outcomes noisy or changing
Time
Reward
![Page 14: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/14.jpg)
Bayesian analysis (Gittins 1972)• Tractable dynamic program in
restricted class of problems– “n-armed bandit”
• Solution requires balancing– Expected outcome values– Uncertainty (need for exploration)– Horizon/discounting (time to exploit)
• Optimal policy: Explore systematically– Choose best sum of value plus bonus– Bonus increases with uncertainty
• Intractable in general setting– Various heuristics used in practice
ActionV
alue
![Page 15: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/15.jpg)
Experiment• How do humans handle tradeoff?
• Computation: Which strategies fit behavior?– Several popular approximations
• Difference: what information influences exploration?
• Neural substrate: What systems are involved?– PFC, high level control
• Competitive decision systems (Daw et al. 2005)
– Neuromodulators• dopamine (Kakade & Dayan 2002)• norepinephrine (Usher et al. 1999)
![Page 16: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/16.jpg)
Task design
Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”),
in scanner
Slotsrevealed
TrialOnset
![Page 17: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/17.jpg)
Task design
Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”),
in scanner
Subject makes choice -chosen slot spins.
Slotsrevealed
TrialOnset
+~430 ms
+
![Page 18: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/18.jpg)
Task design
Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”),
in scanner
Subject makes choice -chosen slot spins.
Slotsrevealed
Outcome:Payoffrevealed
TrialOnset
+~430 ms
+~3000 ms
obtained57
points
+
+
![Page 19: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/19.jpg)
Task design
Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”),
in scanner
Subject makes choice -chosen slot spins.
Slotsrevealed
Outcome:Payoffrevealed
TrialOnset
+~430 ms
+~3000 ms
Screen cleared
+~1000 ms
Trial ends
obtained57
points
+
+
+
![Page 20: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/20.jpg)
Payoff structureNoisy to require integration of dataSubjects learn about payoffs only by sampling them
![Page 21: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/21.jpg)
Payoff structureNoisy to require integration of dataSubjects learn about payoffs only by sampling them
![Page 22: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/22.jpg)
Payoff structureP
ayof
f
![Page 23: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/23.jpg)
Payoff structureNonstationary to encourage ongoing exploration
(Gaussian drift w/ decay)
![Page 24: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/24.jpg)
Analysis strategy
• Behavior: Fit an RL model to choices– Find best fitting parameters– Compare different exploration models
• Imaging: Use model to estimate subjective factors (explore vs. exploit, value, etc.)– Use these as regressors for the fMRI signal– After Sugrue et al.
![Page 25: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/25.jpg)
Behavior
![Page 26: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/26.jpg)
Behavior
![Page 27: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/27.jpg)
Behavior
![Page 28: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/28.jpg)
Behavior
![Page 29: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/29.jpg)
Behavior
![Page 30: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/30.jpg)
Behavior model
1. Estimate payoffs
2. Derive choice probabilities
mgreen , mred etcsgreen , sred etc
Choose randomly according to these
Pgreen , Pred etc
![Page 31: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/31.jpg)
Behavior model
1. Estimate payoffs
2. Derive choice probabilities
Kalman filterError update(like TD)Exact inference mgreen , mred etc
sgreen , sred etc
Choose randomly according to these
Pgreen , Pred etc
![Page 32: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/32.jpg)
x
x
Behavior model
1. Estimate payoffs
2. Derive choice probabilities
Kalman filterError update(like TD)Exact inference pa
yoff
trial t t+1
x
mgreen , mred etcsgreen , sred etc
Choose randomly according to these
Pgreen , Pred etc
![Page 33: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/33.jpg)
x
x
Behavior model
1. Estimate payoffs
2. Derive choice probabilities
Kalman filterError update(like TD)Exact inference pa
yoff
trial t t+1
x
mgreen , mred etcsgreen , sred etc
Choose randomly according to these
Pgreen , Pred etc
![Page 34: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/34.jpg)
x
x
Behavior model
1. Estimate payoffs
2. Derive choice probabilities
Kalman filterError update(like TD)Exact inference pa
yoff
trial t t+1
x
x
mgreen , mred etcsgreen , sred etc
Choose randomly according to these
Pgreen , Pred etc
red red
![Page 35: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/35.jpg)
x
x
Behavior model
1. Estimate payoffs
2. Derive choice probabilities
Kalman filterError update(like TD)Exact inference pa
yoff
trial t t+1
x
x
mgreen , mred etcsgreen , sred etc
Choose randomly according to these
Pgreen , Pred etc
x
x
red red
![Page 36: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/36.jpg)
Behavior model
1. Estimate payoffs
2. Derive choice probabilities
Kalman filterError update(like TD)Exact inference pa
yoff
trial t t+1
mgreen , mred etcsgreen , sred etc
Choose randomly according to these
Pgreen , Pred etc
red red
2 2(1 )red red 2 2 2/red d ore
Behrens & volatility
![Page 37: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/37.jpg)
Behavior model
1. Estimate payoffs
2. Derive choice probabilities
Kalman filter
Compare rules:How is exploration directed?
mgreen , mred etcsgreen , sred etc
Choose randomly according to these
Pgreen , Pred etc
![Page 38: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/38.jpg)
Behavior model
2. Derive choice probabilities
Compare rules:How is exploration directed?
mgreen , mred etcsgreen , sred etc
Pgreen , Pred etc
Choose randomly according to these
![Page 39: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/39.jpg)
Behavior model
2. Derive choice probabilities
Compare rules:How is exploration directed?
mgreen , mred etcsgreen , sred etc
Action
Val
ue
(dumber) (smarter)
![Page 40: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/40.jpg)
Behavior model
2. Derive choice probabilities
Compare rules:How is exploration directed?
mgreen , mred etcsgreen , sred etc
Action
Val
ue
(dumber) (smarter)
Pro
babi
lity
Randomly“e-greedy”
1 3 if max(all )
oth
erwise
red
redP
![Page 41: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/41.jpg)
Behavior model
2. Derive choice probabilities
Compare rules:How is exploration directed?
mgreen , mred etcsgreen , sred etc
Action
Val
ue
(dumber) (smarter)
Pro
babi
lity
Randomly“e-greedy”
By value“softmax”
exp( )red redβP μ1 3 if max(all )
oth
erwise
red
redP
![Page 42: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/42.jpg)
Behavior model
2. Derive choice probabilities
Compare rules:How is exploration directed?
mgreen , mred etcsgreen , sred etc
Action
Val
ue
(dumber) (smarter)
Pro
babi
lity
Randomly“e-greedy”
By value“softmax”
By value and uncertainty“uncertainty bonuses”
1 3 if max(all )
oth
erwise
red
redP
exp( )red redβP μ exp( [ ])red red redP μ σβ +φ
![Page 43: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/43.jpg)
Model comparison
• Assess models based on likelihood of actual choices– Product over subjects and trials of modeled
probability of each choice– Find maximum likelihood parameters
• Inference parameters, choice parameters• Parameters yoked between subjects• (… except choice noisiness, to model all
heterogeneity)
![Page 44: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/44.jpg)
Behavioral results
• Strong evidence for exploration directed by value
• No evidence for direction by uncertainty– Tried several
variations
e-greedy softmax uncertainty bonuses
-log likelihood(smaller is better)
4208.3 3972.1 3972.1
19 19 20# parameters
![Page 45: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/45.jpg)
Behavioral results
• Strong evidence for exploration directed by value
• No evidence for direction by uncertainty– Tried several
variations
e-greedy softmax uncertainty bonuses
-log likelihood(smaller is better)
4208.3 3972.1 3972.1
19 19 20# parameters
![Page 46: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/46.jpg)
Imaging methods
• 1.5 T Siemens Sonata scanner• Sequence optimized for OFC (Deichmann
et al. 2003)• 2x385 volumes; 36 slices; 3mm thickness• 3.24 secs TR• SPM2 random effects model
• Regressors generated using fit model, trial-by-trial sequence of actual choices/payoffs.
![Page 47: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/47.jpg)
Imaging results
• TD error: dopamine targets (dorsal and ventral striatum)
• Replicate previous studies, but weakish– Graded payoffs? p<0.01
p<0.001
L
vStr
dStr
x,y,z=9,12,-9
x,y,z=9,0,18
![Page 48: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/48.jpg)
Value-related correlates
p<0.01p<0.001
L
p<0.01p<0.001
vmPFC vmPFC
mOFC mOFCL
probability (or exp. value) of chosen action: vmPFC
payoff amount: OFC
% s
igna
l cha
nge
% s
igna
l cha
nge
probability of chosen action
payoff
x,y,z=-3,45,-18
x,y,z=3,30,-21
![Page 49: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/49.jpg)
Exploration• Non-greedy > greedy choices:
exploration• Frontopolar cortex• Survives whole-brain correction
LrFP
rFP
LFP
p<0.01
p<0.001
x,y,z=-27,48,4; 27,57,6
![Page 50: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/50.jpg)
TimecoursesFrontal pole IPS
![Page 51: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/51.jpg)
• Do other factors explain differential BOLD activity better?– Multiple regression vs. RT, actual reward, predicted reward, choice prob,
stay vs. switch, uncertainty, more– Only explore/exploit is significant– (But 5 additional putative explore areas eliminated)
• Individual subjects: BOLD differences stronger for better behavioral fit
Checks
![Page 52: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/52.jpg)
Frontal poles
• Imaging – high level control– Coordinating goals/subgoals (Koechlin et al. 1999, Braver &
Bongiolatti 2002; Badre & Wagner 2004)– Mediating cognitive processes (Ramnani & Owen 2004)– Nothing this computationally specific
• Lesions: task switching (Burgess et al. 2000)– more generic: perseveration
• “One of the least well understood regions of the human brain”
• No cortical connections outside PFC (“PFC for PFC”)
• Rostrocaudal hierarchy in PFC (Christoff & Gabrielli 2000; Koechlin et al. 2003)
![Page 53: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/53.jpg)
Interpretation• Cognitive decision to explore overrides habit circuitry?
Via parietal?– Higher FP response when exploration chosen most against the
odds– Explore RT longer
• Exploration/exploitation are neurally distinct• Computationally surprising, esp. bad for uncertainty
bonus schemes– proper exploration requires computational integration– no behavioral evidence either
• Why softmax? Can misexplore– Deterministic bonus schemes bad in adversarial/multiagent
setting– Dynamic temperature control? (norepinephrine; Usher et al.; Doya)
![Page 54: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/54.jpg)
Conclusions
• Subjects direct exploration by value but not uncertainty
• Cortical regions differentially implicated in exploration– computational consequences
• Integrative approach: computation, behavior, imaging– Quantitatively assess & constrain models using raw
behavior– Infer subjective states using model, study neural
correlates
![Page 55: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997](https://reader035.vdocument.in/reader035/viewer/2022062313/56649ca75503460f9496a386/html5/thumbnails/55.jpg)
Open Issues
• model-based vs model-free vs Pavlovian control– environmental priors vs naive optimism vs
neophilic compulsion?• environmental priors and generalization
– curiosity/`intrinsic motivation’ from expected future reward