Download - Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human

Human and Optimal Exploration and Exploitation in Bandit Problems

Department of Cognitive Sciences, University of California.

A Bayesian analysis of human decision-making on bandit problems: Journal of Mathematical Psychology 53 (2009) 168179.

Presenter: Juan WangDate: 18/5/2015

Bandit problem

• A decision –maker must choose one out of multiple alternatives after a short sequence of trials. (Such as different treatments)

• Each of the alternatives has a fixed reward rate, but are not told what the rates are. (such as success rate after accepting one treatments)

• However, the problem of dilemma between exploration and exploitation is evident in many real-world decision-making situations. ---e.g. shown in the below figure.

Which alternative should be chosen on the 11th trails?

The first choice represents more failures and less successes, but at a moderate rate and also well-known than the second. However, the second alternative explores the possibility that this alternative may be the more rewarding one. ---Dilemma between exploration and exploitation.

Acquiring knowledge of each alternative is exploration, and making use of it to making the option is exploitation.

Therefore, it is necessary for decision-makers to find good ways to learn about alternatives, which is requires exploration and which requires exploitation , simultaneously attaining more rewards.

Background

Human performance on bandit problems has been a topic of interest in variety of fields, such as economics and cognitive neuroscience.

Most studies focused on a large number of trials (larger horizon bandit problems), however which is less likely to allow for people switch flexibly between exploration and exploitation when a small number of trials ( short-horizon bandit problems).

Objective

To know if people switch flexibly between exploration and exploitation under the short horizon bandit problems, and to well understand how switch on a specially interest situation: a well-understood but only moderately-rewarding alternative compared to a less well-understood but possibly better-rewarding alternative.

In this paper, authors developed and evaluated a probabilistic model that assumes different latent states guide decision making for short-horizon bandit problems. (searching/exploration state and stand/exploition state )

Assumption of three different situations

The Probabilistic Model

Experiment

• Conditions: six different types of bandit problems conditions: combination of two trial size (8 trials and 16 trials) and three different environmental distributions (Beta distribution where two parameters consisted of prior successes and prior failures ).

• Assumed 50 problems for each condition: (total 300 problems)

• Date: collected date from 10 naïve participants (6 males, 4 females)

• all problems within the conditions was randomized for each participant at each trail

Optimal Performance and Model analysis

1) Calculate Optimal decision-making behavior for all of the problems completed by 10 participants using a recursive approach in reinforcement learning literature (e.g.,Kaebling et al.,1996).

I did not understand this recursive approach, and this issue mentioned in Kaebling’s paper, anyway, this approach is helpful to find the optimal decision-making process for a bandit problem after giving distribution conditions and trail size.

2) Applied the graphical model in Figure 2 to all of the optimal and human decision data (training data), for all six bandit problem conditions. For each data set, estimated parameter from 1000 posterior samples.

Test the latent state model how to fit the observed data reasonable well• Compared its predicted decisions at its-best-fitting parameterization (estimator) to

all of the human and optimal decision-making data.

• Proportion of agreement calculated between both.

Generally fit well, just a little less well for participant AH.

Check the descriptive adequacy of the latent state model

Zi parameter inferred in the model is a variable to determine either search or state for i-th trail.

Descriptive adequacy is shown in the figure of next slide.

Posterior probability that the i-th trial uses the stand state approximates the posterior of the Zi indicator variables.

Download - Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human

Top Related