![Page 1: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/1.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
Introduction to reinforcement learning
Pantelis P. Analytis
March 12, 2018
1 / 27
![Page 2: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/2.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
1 Introduction
2 classical and operant conditioning
3 Modeling human learning
4 Ideas for semester projects
2 / 27
![Page 3: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/3.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
What’s reinforcement learning?
3 / 27
![Page 4: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/4.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
What’s reinforcement learning?
4 / 27
![Page 5: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/5.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
classical conditioning
Conditioned stimulus (e.g. a sound) , unconditionedstimulus (e.g. the taste of food), unconditioned response(unlearned behavior such as salivation).
5 / 27
![Page 6: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/6.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
Behaviorism in psychology
Psychology was under the grip of behaviorism from the20s to the 60s.
Focus on expressed behavior rather than on psychologicalprocesses.
6 / 27
![Page 7: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/7.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
The Rescola-Wanger model
∆V n+1X = αXβ(λ− Vtot)
V n+1X = V n
X + ∆V n+1X
∆VX is the change in the strength, on a single trial, of theassociation between the CS labelled ”X” and the US
α is the salience of X (bounded by 0 and 1)
β is the rate parameter for the US (bounded by 0 and 1),sometimes called its association value
λ is the maximum conditioning possible for the US
VX is the current associative strength of X
Vtot is the total associative strength of all stimuli present,that is, X plus any others
7 / 27
![Page 8: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/8.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
The Rescola-Wanger model: predictions
The model captures acquisition and extinction ofassociations through a process of surprise. First model toincorporate several cues.
Importantly, the model captures interactions between cues.One cue may block the association of another with theUS. Extinction might not occur if an inhibitor is there.
Over time the model converges to optimal least squareweights.
Examples: Blocking, overshadowing and weakening ofstimuli.
8 / 27
![Page 9: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/9.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
The first learning experiments
Thorndike studied the time that animals took to escapefrom his illustrious box.
9 / 27
![Page 10: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/10.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
Thorndike’s law of effect
Thorndike s law of effect: Of several responses made to thesame situation, those which are accompanied or closelyfollowed by satisfaction to the animal will, other things beingequal, be more firmly connected with the situation, so that,when it recurs, they will be more likely to recur; those whichare accompanied or closely followed by discomfort to theanimal will, other things being equal, have their connectionswith that situation weakened, so that, when it recurs, they willbe less likely to occur. The greater the satisfaction ordiscomfort, the greater the strengthening or weakening of thebond (Thorndike, 1911, p.244).
10 / 27
![Page 11: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/11.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
Learned helplessness
The organisms learn that it is impossible to escape, andeven when the hindrance is removed they do not attemptto escape.
11 / 27
![Page 12: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/12.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
The first learning experiments
Operant conditioning can be described as a process thatattempts to modify behavior through the use of positiveand negative reinforcement. Through operantconditioning, an individual makes an association between aparticular behavior and a consequence (Skinner, 1938).
12 / 27
![Page 13: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/13.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
Tolman’s cognitive maps
3 groups of rats, running in a maze for 17 days.
one group got a reward, the second got no reward, thethird got a reward on the 11th day.
13 / 27
![Page 14: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/14.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
Implicit learning
The group that was rewarded only on the 11th dayimproved rapidly and surpassed in terms of performancethe group that was rewarded from the beginning.
14 / 27
![Page 15: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/15.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
There are two strategies to solve RL problems. Organismcan memorize rewards or construct a contingency map andplan ahead behavior.
15 / 27
![Page 16: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/16.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
The Iowa gambling task (Bachara et al. 1997)
Participants are presented 4 decks on the computer andthey are told that each deck will reward them or penalizethem.100 trials in total, unbeknownst to the participants. Theparticipants started with $ 2000 and are asked tomaximize their returns.
16 / 27
![Page 17: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/17.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
The Iowa gambling task
Participants are presented 4 decks on the computer andthey are told that each deck will reward them or penalizethem.Deck’s A and B bring higher bring higher immediaterewards, but have negative expected value, while C and Dhave lower immediate rewards but positive expected value.
17 / 27
![Page 18: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/18.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
Modeling human learning: expectation
The delta rule is a popular model-free learning rule:
Ej(t) = Ej(t − 1) + δj(t)η[Rj(t)− Ej(t − 1)],
where δj(t) is an indicator variable, being 1 if alternative jwas chosen on trial t, and 0 otherwise. We opted for asimple fixed learning rate, η ≥ 0.
18 / 27
![Page 19: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/19.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
Modeling human learning: expectation
The decay rule is another popular model-free learning rule,according to which expected values of the unchosenalternatives decay towards 0 (e.g. Erev and Roth, 1998):
Ej(t) = ηEj(t − 1) + δj(t)Rj(t),
with decay parameter 0 ≤ η ≤ 1.
19 / 27
![Page 20: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/20.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
Modeling human learning: choice rules
ε-greedy rule
P(C (t) = j) =
{(1− ε)/Kmax if Ej(t) > Ek(t), ∀k 6= j
ε/(K − Kmax) otherwise
where K is the number of arms and Kmax is the number ofarms with the same maximum value.
Softmax
P(C (t) = j) =exp(θEj(t))∑K
k=1 exp(θEk(t))
20 / 27
![Page 21: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/21.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
The Iowa gambling task: behavioral results
Participants are presented 4 decks on the computer andthey are told that each deck will reward them or penalizethem.
Deck’s A and B bring higher bring higher immediaterewards, but have negative expected value, while C and Dhave lower immediate rewards but positive expected value.
21 / 27
![Page 22: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/22.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
The Iowa gambling task: simulating models
The models were fitted on human data using maximumlikelihood estimation.
22 / 27
![Page 23: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/23.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
Prediction competitions
23 / 27
![Page 24: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/24.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
Replicating well known findings
24 / 27
![Page 25: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/25.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
Studying widely used websites
Can you develop a model of likes and comments onInstagram or Twitter?How does attention interact with liking in websites likeFacebook?
25 / 27
![Page 26: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/26.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
Using big data from KDD competitions
KDD regularly organizes competitions. Data from pastevents are available online.
26 / 27
![Page 27: Introduction to reinforcement learning - Cornell University › info4940 › documents › ...reinforcement learning Pantelis P. Analytis Introduction classical and operant conditioning](https://reader033.vdocument.in/reader033/viewer/2022052613/5f1ad3968bc8ae0ea46cb852/html5/thumbnails/27.jpg)
Introductionto
reinforcementlearning
Pantelis P.Analytis
Introduction
classical andoperantconditioning
Modelinghumanlearning
Ideas forsemesterprojects
Dataset repositories
27 / 27