![Page 1: Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human](https://reader035.vdocument.in/reader035/viewer/2022062809/5697c0111a28abf838ccb7be/html5/thumbnails/1.jpg)
Human and Optimal Exploration and Exploitation in Bandit Problems
Department of Cognitive Sciences, University of California.
A Bayesian analysis of human decision-making on bandit problems: Journal of Mathematical Psychology 53 (2009) 168179.
Presenter: Juan WangDate: 18/5/2015
![Page 2: Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human](https://reader035.vdocument.in/reader035/viewer/2022062809/5697c0111a28abf838ccb7be/html5/thumbnails/2.jpg)
Bandit problem
• A decision –maker must choose one out of multiple alternatives after a short sequence of trials. (Such as different treatments)
• Each of the alternatives has a fixed reward rate, but are not told what the rates are. (such as success rate after accepting one treatments)
• However, the problem of dilemma between exploration and exploitation is evident in many real-world decision-making situations. ---e.g. shown in the below figure.
![Page 3: Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human](https://reader035.vdocument.in/reader035/viewer/2022062809/5697c0111a28abf838ccb7be/html5/thumbnails/3.jpg)
Which alternative should be chosen on the 11th trails?
The first choice represents more failures and less successes, but at a moderate rate and also well-known than the second. However, the second alternative explores the possibility that this alternative may be the more rewarding one. ---Dilemma between exploration and exploitation.
Acquiring knowledge of each alternative is exploration, and making use of it to making the option is exploitation.
![Page 4: Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human](https://reader035.vdocument.in/reader035/viewer/2022062809/5697c0111a28abf838ccb7be/html5/thumbnails/4.jpg)
Therefore, it is necessary for decision-makers to find good ways to learn about alternatives, which is requires exploration and which requires exploitation , simultaneously attaining more rewards.
![Page 5: Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human](https://reader035.vdocument.in/reader035/viewer/2022062809/5697c0111a28abf838ccb7be/html5/thumbnails/5.jpg)
Background
Human performance on bandit problems has been a topic of interest in variety of fields, such as economics and cognitive neuroscience.
Most studies focused on a large number of trials (larger horizon bandit problems), however which is less likely to allow for people switch flexibly between exploration and exploitation when a small number of trials ( short-horizon bandit problems).
![Page 6: Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human](https://reader035.vdocument.in/reader035/viewer/2022062809/5697c0111a28abf838ccb7be/html5/thumbnails/6.jpg)
Objective
To know if people switch flexibly between exploration and exploitation under the short horizon bandit problems, and to well understand how switch on a specially interest situation: a well-understood but only moderately-rewarding alternative compared to a less well-understood but possibly better-rewarding alternative.
![Page 7: Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human](https://reader035.vdocument.in/reader035/viewer/2022062809/5697c0111a28abf838ccb7be/html5/thumbnails/7.jpg)
In this paper, authors developed and evaluated a probabilistic model that assumes different latent states guide decision making for short-horizon bandit problems. (searching/exploration state and stand/exploition state )
![Page 8: Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human](https://reader035.vdocument.in/reader035/viewer/2022062809/5697c0111a28abf838ccb7be/html5/thumbnails/8.jpg)
Assumption of three different situations
![Page 9: Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human](https://reader035.vdocument.in/reader035/viewer/2022062809/5697c0111a28abf838ccb7be/html5/thumbnails/9.jpg)
The Probabilistic Model
![Page 10: Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human](https://reader035.vdocument.in/reader035/viewer/2022062809/5697c0111a28abf838ccb7be/html5/thumbnails/10.jpg)
Experiment
• Conditions: six different types of bandit problems conditions: combination of two trial size (8 trials and 16 trials) and three different environmental distributions (Beta distribution where two parameters consisted of prior successes and prior failures ).
• Assumed 50 problems for each condition: (total 300 problems)
• Date: collected date from 10 naïve participants (6 males, 4 females)
• all problems within the conditions was randomized for each participant at each trail
![Page 11: Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human](https://reader035.vdocument.in/reader035/viewer/2022062809/5697c0111a28abf838ccb7be/html5/thumbnails/11.jpg)
Optimal Performance and Model analysis
1) Calculate Optimal decision-making behavior for all of the problems completed by 10 participants using a recursive approach in reinforcement learning literature (e.g.,Kaebling et al.,1996).
I did not understand this recursive approach, and this issue mentioned in Kaebling’s paper, anyway, this approach is helpful to find the optimal decision-making process for a bandit problem after giving distribution conditions and trail size.
2) Applied the graphical model in Figure 2 to all of the optimal and human decision data (training data), for all six bandit problem conditions. For each data set, estimated parameter from 1000 posterior samples.
![Page 12: Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human](https://reader035.vdocument.in/reader035/viewer/2022062809/5697c0111a28abf838ccb7be/html5/thumbnails/12.jpg)
Test the latent state model how to fit the observed data reasonable well• Compared its predicted decisions at its-best-fitting parameterization (estimator) to
all of the human and optimal decision-making data.
• Proportion of agreement calculated between both.
Generally fit well, just a little less well for participant AH.
![Page 13: Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human](https://reader035.vdocument.in/reader035/viewer/2022062809/5697c0111a28abf838ccb7be/html5/thumbnails/13.jpg)
Check the descriptive adequacy of the latent state model
Zi parameter inferred in the model is a variable to determine either search or state for i-th trail.
Descriptive adequacy is shown in the figure of next slide.
Posterior probability that the i-th trial uses the stand state approximates the posterior of the Zi indicator variables.
![Page 14: Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human](https://reader035.vdocument.in/reader035/viewer/2022062809/5697c0111a28abf838ccb7be/html5/thumbnails/14.jpg)