meta-learning effective exploration strategies for contextual ...amr sharaf,1 hal daume iii´ 1,2 1...

9
Meta-Learning Effective Exploration Strategies for Contextual Bandits Amr Sharaf, 1 Hal Daum´ e III 1,2 1 University of Maryland 2 Microsoft Research [email protected], [email protected] Abstract In contextual bandits, an algorithm must choose actions given observed contexts, learning from a reward signal that is ob- served only for the action chosen. This leads to an explo- ration/exploitation trade-off: the algorithm must balance tak- ing actions it already believes are good with taking new ac- tions to potentially discover better choices. We develop a meta-learning algorithm, M ˆ EL ´ EE, that learns an exploration policy based on simulated, synthetic contextual bandit tasks. EL ´ EE uses imitation learning against these simulations to train an exploration policy that can be applied to true contex- tual bandit tasks at test time. We evaluate Mˆ EL ´ EE on both a natural contextual bandit problem derived from a learning to rank dataset as well as hundreds of simulated contextual ban- dit problems derived from classification tasks. M ˆ EL ´ EE out- performs seven strong baselines on most of these datasets by leveraging a rich feature representation for learning an explo- ration strategy. 1 Introduction In a contextual bandit problem, an agent attempts to opti- mize its behavior over a sequence of rounds based on limited feedback (Kaelbling 1994; Auer 2003; Langford and Zhang 2008). In each round, the agent chooses an action based on a context (features) for that round, and observes a reward for that action but no others (§ 2). The feedback is partial: the agent only observes the reward for the action it selected, and for no other actions. This is strictly harder than super- vised learning in which the agent observes the reward for all available actions. Contextual bandit problems arise in many real-world settings like learning to rank for information re- trieval, online recommendations and personalized medicine. As in reinforcement learning, the agent must learn to bal- ance exploitation (taking actions that, based on past experi- ence, it believes will lead to high instantaneous reward) and exploration (trying new actions). However, contextual ban- dit learning is easier than reinforcement learning: the agent needs to only take one action, not a sequence of actions, and therefore does not face a credit assignment problem. In this paper, we present a meta-learning approach to automatically learn a good exploration strategy from data. To achieve this, we use synthetic supervised learning data Copyright c 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. sets on which we can simulate contextual bandit tasks in an offline setting. Based on these simulations, our algo- rithm, M ˆ EL ´ EE (MEta LEarner for Exploration), learns a good heuristic exploration strategy that generalize to future contextual bandit problems. M ˆ EL ´ EE contrasts with more classical approaches to exploration (like -greedy or Lin- UCB), in which exploration strategies are constructed by expert algorithm designers. These approaches often achieve provably good exploration strategies in the worst case, but are potentially overly pessimistic and are sometimes com- putationally intractable. EL ´ EE is an example of meta-learning in which we replace a hand-crafted learning algorithm with a learned learning algorithm. At training time (§ 3), Mˆ EL ´ EE simu- lates many contextual bandit problems from fully labeled synthetic data. Using this data, in each round, M ˆ EL ´ EE is able to counterfactually simulate what would happen under all possible action choices. We can then use this informa- tion to compute regret estimates for each action, which can be optimized using the AggreVaTe imitation learning algo- rithm (Ross and Bagnell 2014). Our imitation learning strategy mirrors the meta-learning approach of Bachman, Sordoni, and Trischler (2017) in the active learning setting. We present a simplified, styl- ized analysis of the behavior of M ˆ EL ´ EE to ensure that our cost function encourages good behavior (§ 4), and show that EL ´ EE enjoys the no-regret guarantees of the AggreVaTe imitation learning algorithm. Empirically, we use M ˆ EL ´ EE to train an exploration policy on only synthetic datasets and evaluate this policy on both a contextual bandit task based on a natural learning to rank dataset as well as three hundred simulated contextual bandit tasks (§5). We compare the trained policy to a number of al- ternative exploration algorithms, and show that M ˆ EL ´ EE out- performs alternative exploration strategies in most settings. 2 Preliminaries: Contextual Bandits and Policy Optimization Contextual bandits is a model of interaction in which an agent chooses actions (based on contexts) and receives im- mediate rewards for that action alone. For example, in a sim- plified news personalization setting, at each time step t,a user arrives and the system must choose a news article to PRELIMINARY VERSION: DO NOT CITE The AAAI Digital Library will contain the published version some time after the conference

Upload: others

Post on 25-Jul-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Meta-Learning Effective Exploration Strategies for Contextual ...Amr Sharaf,1 Hal Daume III´ 1,2 1 University of Maryland 2 Microsoft Research sharaf@umd.edu, me@hal3.name Abstract

Meta-Learning Effective Exploration Strategies for Contextual Bandits

Amr Sharaf,1 Hal Daume III1,2

1 University of Maryland2 Microsoft Research

[email protected], [email protected]

Abstract

In contextual bandits, an algorithm must choose actions givenobserved contexts, learning from a reward signal that is ob-served only for the action chosen. This leads to an explo-ration/exploitation trade-off: the algorithm must balance tak-ing actions it already believes are good with taking new ac-tions to potentially discover better choices. We develop ameta-learning algorithm, MELEE, that learns an explorationpolicy based on simulated, synthetic contextual bandit tasks.MELEE uses imitation learning against these simulations totrain an exploration policy that can be applied to true contex-tual bandit tasks at test time. We evaluate MELEE on both anatural contextual bandit problem derived from a learning torank dataset as well as hundreds of simulated contextual ban-dit problems derived from classification tasks. MELEE out-performs seven strong baselines on most of these datasets byleveraging a rich feature representation for learning an explo-ration strategy.

1 IntroductionIn a contextual bandit problem, an agent attempts to opti-mize its behavior over a sequence of rounds based on limitedfeedback (Kaelbling 1994; Auer 2003; Langford and Zhang2008). In each round, the agent chooses an action based ona context (features) for that round, and observes a rewardfor that action but no others (§2). The feedback is partial:the agent only observes the reward for the action it selected,and for no other actions. This is strictly harder than super-vised learning in which the agent observes the reward for all

available actions. Contextual bandit problems arise in manyreal-world settings like learning to rank for information re-trieval, online recommendations and personalized medicine.As in reinforcement learning, the agent must learn to bal-ance exploitation (taking actions that, based on past experi-ence, it believes will lead to high instantaneous reward) andexploration (trying new actions). However, contextual ban-dit learning is easier than reinforcement learning: the agentneeds to only take one action, not a sequence of actions, andtherefore does not face a credit assignment problem.

In this paper, we present a meta-learning approach toautomatically learn a good exploration strategy from data.To achieve this, we use synthetic supervised learning data

Copyright c� 2021, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

sets on which we can simulate contextual bandit tasks inan offline setting. Based on these simulations, our algo-rithm, MELEE (MEta LEarner for Exploration), learns agood heuristic exploration strategy that generalize to futurecontextual bandit problems. MELEE contrasts with moreclassical approaches to exploration (like ✏-greedy or Lin-UCB), in which exploration strategies are constructed byexpert algorithm designers. These approaches often achieveprovably good exploration strategies in the worst case, butare potentially overly pessimistic and are sometimes com-putationally intractable.

MELEE is an example of meta-learning in which wereplace a hand-crafted learning algorithm with a learnedlearning algorithm. At training time (§ 3), MELEE simu-lates many contextual bandit problems from fully labeledsynthetic data. Using this data, in each round, MELEE isable to counterfactually simulate what would happen underall possible action choices. We can then use this informa-tion to compute regret estimates for each action, which canbe optimized using the AggreVaTe imitation learning algo-rithm (Ross and Bagnell 2014).

Our imitation learning strategy mirrors the meta-learningapproach of Bachman, Sordoni, and Trischler (2017) inthe active learning setting. We present a simplified, styl-ized analysis of the behavior of MELEE to ensure that ourcost function encourages good behavior (§4), and show thatMELEE enjoys the no-regret guarantees of the AggreVaTeimitation learning algorithm.

Empirically, we use MELEE to train an exploration policyon only synthetic datasets and evaluate this policy on botha contextual bandit task based on a natural learning to rankdataset as well as three hundred simulated contextual bandittasks (§5). We compare the trained policy to a number of al-ternative exploration algorithms, and show that MELEE out-performs alternative exploration strategies in most settings.

2 Preliminaries: Contextual Bandits andPolicy Optimization

Contextual bandits is a model of interaction in which anagent chooses actions (based on contexts) and receives im-mediate rewards for that action alone. For example, in a sim-plified news personalization setting, at each time step t, auser arrives and the system must choose a news article to

PRELIMINARY VERSION: DO NOT CITE The AAAI Digital Library will contain the published

version some time after the conference

Page 2: Meta-Learning Effective Exploration Strategies for Contextual ...Amr Sharaf,1 Hal Daume III´ 1,2 1 University of Maryland 2 Microsoft Research sharaf@umd.edu, me@hal3.name Abstract

display to them. Each possible news article corresponds toan action a, and the user corresponds to a context xt. Afterthe system chooses an article at to display, it can observe,for instance, the amount of time that the user spends readingthat article, which it can use as a reward rt(at).

Formally, we largely follow the setup and notation ofAgarwal et al. (2014). Let X be an input space of contexts(users) and [K] = {1, . . . ,K} be a finite action space (ar-ticles). We consider the statistical setting in which there ex-ists a fixed but unknown distribution D over pairs (x, r) 2X⇥[0, 1]K , where r is a vector of rewards (for convenience,we assume all rewards are bounded in [0, 1]). In this setting,the world operates iteratively over rounds t = 1, 2, . . . . Eachround t:

1. The world draws (xt, rt) ⇠ D and reveals context xt.2. The agent (randomly) chooses action at 2 [K] based on

xt, and observes reward rt(at).The goal of an algorithm is to maximize the cumulativesum of rewards over time. Typically the primary quantityconsidered is the average regret of a sequence of actionsa1, . . . , aT to the behavior of the best possible function in aprespecified class F :

Reg(a1, . . . , aT ) = maxf2F

1

T

TX

t=1

hrt(f(xt))� rt(at)

i(1)

A no-regret agent has a zero average regret in the limit oflarge T .

To produce a good agent for interacting with the world,we assume access to a function class F and to an oracle

policy optimizer for that function class POLOPT. For exam-ple, F may be a set of single layer neural networks map-ping user features x 2 X to predicted rewards for actionsa 2 [K]. Formally, the observable record of interaction re-sulting from round t is the tuple (xt, at, rt(at), pt(at)) 2X⇥[K]⇥[0, 1]⇥[0, 1], where pt(at) is the probability thatthe agent chose action at, and the full history of interactionis ht = h(xi, ai, ri(ai), pi(ai))iti=1. The oracle policy opti-mizer, POLOPT, takes as input a history of user interactionsand outputs an f 2 F with low expected regret.

An example of the oracle policy optimizer POLOPT is tocombine inverse propensity scaling (IPS) with a regressionalgorithm (Horvitz and Thompson 1952). Here, given a his-tory h, each tuple (x, a, r, p) in that history is mapped to amultiple-output regression example. The input for this re-gression example is the same x; the output is a vector ofK costs, all of which are zero except the a

th component,which takes value r/p. This mapping is done for all tuplesin the history, and a supervised learning algorithm on thefunction class F is used to produce a low-regret regressionfunction f . This is the function returned by the oracle pol-icy optimizer POLOPT. IPS has the property of being un-biased, however, it often suffers from large variance. Thedirect method (DM) (Dudik et al. 2011) is another kind ofan oracle policy optimizer POLOPT that has lower variancethan IPS. The direct method estimates the reward functiondirectly from the history h without importance sampling,and uses this estimate to learn a low-regret function f . Inour experiments, we use the direct method, largely for itslow variance and simplicity. However, MELEE is agnostic to

the type of the estimator used by the oracle policy optimizerPOLOPT.

3 Approach: Learning an EffectiveExploration Strategy

In order to have an effective approach to the contextualbandit problem, one must be able to both optimize a pol-icy based on historic data and make decisions about howto explore. The exploration/exploitation dilemma is funda-mentally about long-term payoffs: is it worth trying some-thing potentially suboptimal now in order to learn how tobehave better in the future? A particularly simple and ef-fective form of exploration is ✏-greedy: given a function f

output by POLOPT, act according to f(x) with probability(1� ✏) and act uniformly at random with probability ✏.

Intuitively, one would hope to improve on a strategy like✏-greedy by taking more (any!) information into account; forinstance, basing the probability of exploration on f ’s uncer-tainty. In this section, we describe MELEE, first by showinghow it operates in a Markov Decision Process (§3), and thenby showing how to train it using synthetic simulated contex-tual bandit problems based on imitation learning (§3).

Markov Decision Process FormulationWe model the exploration / exploitation task as a MarkovDecision Process (MDP). Given a context vector x and afunction f output by POLOPT on rounds 1 . . . t�1, the agentlearns an exploration policy ⇡ to decide whether to exploit

by acting according to f(x), or explore by choosing a dif-ferent action a 6= f(x). We model this as an MDP, repre-sented as a tuple hA,S, s0, T, Ri, where A = {a} is the setof all actions, S = {s} is the space of all possible states, s0is a starting state (or initial distribution), R(s, a) is the re-ward function, and T (s0|s, a) is the transition function. Wedescribe these below.

States A state s in our MDP represents all past informa-tion seen in the world (there are a very large number ofstates). At time t, the state st includes: all past experiences(xi, ai, ri(ai))

t�1i=1 , the current context xt, and the current

function ft computed by running POLOPT on the past ex-periences.

For the purposes of learning, this state space is far toolarge to be practical, so we model each state using a set of ex-ploration features. The exploration policy ⇡ is trained basedon exploration features � (Alg 1, line 12). These features areallowed to depend on the current classifier ft, and on anypart of the history except the inputs xt in order to maintaintask independence. We additionally ensure that its featuresare independent of the dimensionality of the inputs, so that⇡ can generalize to datasets of arbitrary dimensions. Thespecific features we use are listed below; these are largelyinspired by Konyushkova, Sznitman, and Fua (2017) butadapted to our setting.

Importantly, we wish to train ⇡ using one set of tasks (forwhich we have fully supervised data on which to run sim-ulations) and apply it to wholly different tasks (for whichwe only have bandit feedback). To achieve this, we allow⇡ to depend representationally on ft in arbitrary ways: for

Page 3: Meta-Learning Effective Exploration Strategies for Contextual ...Amr Sharaf,1 Hal Daume III´ 1,2 1 University of Maryland 2 Microsoft Research sharaf@umd.edu, me@hal3.name Abstract

instance, it might use features that capture ft’s uncertaintyon the current example. We additionally allow ⇡ to dependin a task-independent manner on the history (for instance,which actions have not yet been tried): it can use featuresof the actions, rewards and probabilities in the history butnot depend directly on the contexts x. This is to ensure that⇡ only learns to explore and not also to solve the underly-ing task-dependent classification problem. Because ⇡ needsto learn to be task independent, we found that if ft’s predic-tions were uncalibrated, it was very difficult for ⇡ to general-ize well to unseen tasks. Therefore, we additionally allow ⇡

to depend on a very small amount of fully labeled data fromthe task at hand, which we use to allow ⇡ to calibrate ft’spredictions. In our experiments we use only 30 fully labeledexamples, but alternative approaches to calibrating ft thatdo not require this data would be preferable. The features offt that we use are: a) predicted probability p(at|ft,xt), weuse a softmax over the predicted rewards from ft to convertthem to probabilities; b) entropy of the predicted probabilitydistribution; c) a one-hot encoding for the predicted actionft(xt).

The features of ht�1 that we use are: a) current time stept; b) normalized counts for all previous actions predicted sofar; c) average observed rewards for each action; d) empir-ical variance of the observed rewards for each action in thehistory.

We use Platt’s scaling (Platt 1999; Lin, Lin, and Weng2007) method to calibrate the predicted probabilities. Platt’sscaling works by fitting a logistic regression model to theclassifier’s predicted scores.

Initial State The intial state distribution is formed bydrawing a new contextual bandit task at random, setting thehistory to the empty set, and initializing the first context x1

as the first example in that contextual bandit task.

Actions At each state, our learned exploration policy ⇡

must take an input state st (described above) and make adecision. Its action space A is the same action space as thatof the contextual bandit problem it is trying to solve. If ⇡chooses to take the same action as f , then we interpret thisas an “exploitation” step, and if it takes another action, weinterpret this as an “exploration” step.

Transitions Each episode starts off with a new contex-tual bandit task and an empty history h0 = {}. The subse-quent steps in the episode involve observing context vectorsx1, · · · , xT from the new contextual bandit task. A singletransition in the episode consists of the exploration policy⇡ being given the state s containing information about thecurrent context vector xt and the history ht�1, using which,the exploration policy ⇡ chooses the next action a. The tran-sition function T (s0|s, a) incorporates the action a chosenby the exploration policy in state s along with the featuresrepresenting the current state s, and produces the next states0 that represents a new feature vector xt+1. The episode

terminates whenever all the context vectors in the contex-tual bandit task have been exhausted. During the test phase,each contextual bandit task is handled only once in a singleepisode.

More formally, let ⇡ be the exploration policy we arelearning, which takes two inputs: a function f 2 F anda context x, and outputs an action. In our example, f willbe the output of the policy optimizer on all historic data,and x will be the current user. This is used to produce anagent which interacts with the world, maintaining an initiallyempty history buffer h, as:

1. The world draws (xt, rt) ⇠ D and reveals context xt.2. The agent computes ft POLOPT(ht) and a greedy ac-

tion at = ⇡(ft, xt).3. The agent plays at = at with probability (1� µ), and at

uniformly at random otherwise.4. The agent observes rt(at).5. The agent appends (xt, at, rt(at), pt) to the history ht,

where pt = µ/K if at 6= at; and pt = 1 � µ + µ/K ifat = at.

Here, ft is the function optimized on the historical data, and⇡ uses it and xt to choose an action. Intuitively, ⇡ mightchoose to use the prediction ft(xt) most of the time, un-less ft is quite uncertain on this example, in which case ⇡

might choose to return the second (or third) most likely ac-tion according to ft. The agent then performs a small amountof additional µ-greedy-style exploration: most of the time itacts according to ⇡ but occasionally it explores some more.In practice (§ 5), we find that setting µ = 0 is optimal inaggregate, but non-zero µ is necessary for our theory (§4).

Rewards The reward function is chosen so that rewardmaximization by the learned policy is equivalent to low re-gret in the contextual bandit problem. Formally, at state st,let rt(·) be the reward function for the contextual bandit taskat that state. The reward function is: R(s, a) = rt(a).

Training MELEE by Imitation LearningThe meta-learning challenge is: how do we learn a good ex-ploration policy ⇡? We assume we have access to fully la-

beled data on which we can train ⇡; this data must includecontext/reward pairs, but where the reward for all actions isknown. This is a weak assumption: in practice, we use purelysynthetic data as this training data; one could alternativelyuse any fully labeled classification dataset as in (Beygelz-imer and Langford 2009). Under this assumption about thedata, and with our model of behavior as an MDP, a naturalclass of learning algorithms to consider for learning are im-itation learning algorithms (Daume, Langford, and Marcu2009; Ross, Gordon, and Bagnell 2011; Ross and Bagnell2014; Chang et al. 2015). In other work on meta-learning,such problems are often cast as full reinforcement-learning

problems. We opt for imitation learning instead because itis computationally attractive and effective when a simulatorexists.

Informally, at training time, MELEE will treat one of thesesynthetic datasets as if it were a contextual bandit dataset. Ateach time step t, it will compute ft by running POLOPT onthe historical data, and then consider: for each action, whatwould the long time reward look like if I were to take thisaction. Because the training data for MELEE is fully labeled,this can be evaluated for each possible action, and a policy⇡ can be learned to maximize these rewards.

Page 4: Meta-Learning Effective Exploration Strategies for Contextual ...Amr Sharaf,1 Hal Daume III´ 1,2 1 University of Maryland 2 Microsoft Research sharaf@umd.edu, me@hal3.name Abstract

Algorithm 1 MELEE (supervised training sets {Sm}, hy-pothesis class F , exploration rate µ, number of validationexamples NVal, feature extractor �)

1: initialize meta-dataset D = {}2: for episode n = 1, 2, . . . , N do3: choose S at random from {Sm}, and set history h0 =

{}4: partition and permute S randomly into train Tr and

validation Val where |Val| = NVal

5: for round t = 1, 2, . . . , |Tr| do6: let (xt, rt) = Trt, st = (xi, ai, ri(ai))

t�1i=1, xt, ft

7: for each action a = 1, . . . ,K do8: ft,a = POLOPT(F , ht�1 � (xt, a, rt(a), 1 �

K�1K µ)) on augmented history

9: roll-out: estimate Q?(st, a), the cost-to-go of a,

using rt(a) and a roll-out policy ⇡out on ft,a

10: end for11: compute ft = POLOPT(F , ht�1)12: D D � (�(st), hQ?(st, 1), . . . , Q?(st,K)i)13: roll-in: at ⇠ µ

K 1K + (1 � µ)⇡n�1(ft, xt) withprobability pt, where 1 is the ones-vector

14: append history ht ht�1 � (xt, at, rt(at), pt)15: end for16: update ⇡n = LEARN(D)17: end for18: return best policy in {⇡n}Nn=1

More formally, in imitation learning, we assume training-time access to an expert, ⇡?, whose behavior we wish tolearn to imitate at test-time. Because we can train on fullysupervised training sets, we can easily define an optimal ref-erence policy ⇡

?, which “cheats” at training time by lookingat the true labels: in particular, ⇡? can always pick the cor-rect action (i.e., the action that maximizes future rewards)at any given state. The learning problem is then to estimate⇡ to have as similar behavior to ⇡

? as possible, but withoutaccess to those labels.

Suppose we wish to learn an exploration policy ⇡ for acontextual bandit problem with K actions. We assume ac-cess to M supervised learning datasets S1, . . . , SM , whereeach Sm = { (x1, r1), . . . , (xNm , rNm)} of size Nm, whereeach xn is from a (possibly different) input space Xm and thereward vectors are all in [0, 1]K . In particular, multi-classclassification problems are modeled by setting the rewardfor the correct label to be one and the reward for all otherlabels to be zero.

The imitation learning algorithm we use is AggreVaTe(Ross and Bagnell 2014) (closely related to DAgger (Ross,Gordon, and Bagnell 2011)), and is instantiated for the con-textual bandits meta-learning problem in Alg 1. AggreVaTelearns to choose actions to minimize the cost-to-go of theexpert rather than the zero-one classification loss of mim-icking its actions. On the first iteration AggreVaTe collectsdata by observing the expert perform the task, and in eachtrajectory, at time t, explores an action a in state s, and ob-serves the cost-to-go Q

?t (s, a) of the expert after performing

this action, defined as:Q

?t (s, a) = rt(a) + Es⇠T (.|s,a),a⇠⇡?(.|s)[Q

?t+1(s, a)]. (2)

where the expectation is taken over the randomness of thepolicy ⇡

? and the MDP.Each such step generates a cost-weighted training exam-

ple (s, t, a, Q?) and AggreVaTe trains a policy ⇡1 to mini-mize the expected cost-to-go on this dataset. At each follow-ing iteration n, AggreVaTe collects data through interactionwith the learner as follows: for each trajectory, begin by us-ing the current learner’s policy ⇡n to perform the task, inter-rupt at time t, explore a roll-in action a in the current states, after which control is provided back to the expert to con-tinue up to time-horizon T . This results in new examples ofthe cost-to-go (roll-out value) of the expert (s, t, a,Q?), un-der the distribution of states visited by the current policy ⇡n.This new data is aggregated with all previous data to trainthe next policy ⇡n+1; more generally, this data can be usedby a no-regret online learner to update the policy and obtain⇡n+1. This is iterated for some number of iterations N andthe best policy found is returned.

Following the AggreVaTe template, MELEE operates inan iterative fashion, starting with an arbitrary ⇡ and improv-ing it through interaction with an expert. Over N episodes,MELEE selects random training sets and simulates the test-time behavior on that training set. The core functionality isto generate a number of states (ft, xt) on which to train ⇡,and to use the supervised data to estimate the value of everyaction from those states. MELEE achieves this by samplinga random supervised training set and setting aside some val-idation data from it (line 4). It then simulates a contextualbandit problem on this training data; at each time step t,it tries all actions and “pretends” like they were appendedto the current history (line 8) on which it trains a new pol-icy and evaluates it’s roll-out value (line 9). This yields, foreach t, a new training example for ⇡, which is added to ⇡’straining set (line 12); the features for this example are fea-tures of the classifier based on true history (line 11) (and pos-sibly statistics of the history itself), with a label that gives,for each action, the corresponding cost-to-go of that action(the Q

?s computed in line 9). MELEE then must commit toa roll-in action to actually take; it chooses this according toa roll-in policy (line 13). MELEE has no explicit “exploita-tion policy”, exploitation happens when ⇡ chooses the sameaction as ft, while exploration happens when it chooses adifferent action. In learning to explore, MELEE simultane-ously learns when to exploit.

Roll-in actions. The distribution over states visited byMELEE depends on the actions taken, and in general it isgood to have that distribution match what is seen at test time.This distribution is determined by a roll-in policy (line 13),controlled in MELEE by exploration parameter µ 2 [0, 1].As µ ! 1, the roll-in policy approaches a uniform randompolicy; as µ ! 0, the roll-in policy becomes deterministic.When the roll-in policy does not explore, it acts accordingto ⇡(ft, .).

Roll-out values. The ideal value to assign to an action(from the perspective of the imitation learning procedure) isthat total reward (or advantage) that would be achieved inthe long run if we took this action and then behaved accord-

Page 5: Meta-Learning Effective Exploration Strategies for Contextual ...Amr Sharaf,1 Hal Daume III´ 1,2 1 University of Maryland 2 Microsoft Research sharaf@umd.edu, me@hal3.name Abstract

ing to our final learned policy. Unfortunately, during train-ing, we do not yet know the final learned policy. Thus, asurrogate roll-out policy ⇡

out is used instead. A convenient,and often computationally efficient alternative, is to evaluatethe value assuming all future actions were taken by the ex-pert (Langford and Zadrozny 2005; Daume, Langford, andMarcu 2009; Ross and Bagnell 2014). In our setting, at anytime step t, the expert has access to the fully supervised re-ward vector rt for the context xt. When estimating the roll-out value for an action a, the expert will return the true re-ward value for this action rt(a) and we use this as our esti-mate for the roll-out value.

4 Theoretical GuaranteesWe analyze MELEE, showing that the no-regret property ofAGGREVATE can be leveraged in our meta-learning settingfor learning contextual bandit exploration. In particular, wefirst relate the regret of the learner in line 16 to the overallregret of ⇡. This will show that, if the underlying classifierimproves sufficiently quickly, MELEE will achieve sublinearregret. We then show that for a specific choice of underly-ing classifier (BANDITRON), this is achieved. MELEE is aninstantiation of AGGREVATE (Ross and Bagnell 2014); assuch, it inherits AGGREVATE’s regret guarantees.Theorem 1 After N episodes, if LEARN (line 16) is no-

regret algorithm, then as N ! 1, with probability 1, it

holds that J(⇡) � J(⇡?) � 2Tp

K ✏class(T ), where J(·) is

the reward of the exploration policy, ⇡ is the average policy

returned, and ✏class(T ) is the average regression regret for

each ⇡n accurately predicting Q?, where

✏class(T ) = min⇡2⇧

1N

ENX

i=1

hQ?

T�t+1(s,⇡)�mina

Q?T�t+1(s, a)

i

is the empirical minimum expected cost-sensitive classifi-

cation regret achieved by policies in the class ⇧ on all the

data over the N iterations of training when compared to the

Bayes optimal regressor, for t ⇠ U(T ), s ⇠ dt⇡i, U(T ) the

uniform distribution over {1, . . . , T}, dt⇡ the distribution of

states at time t induced by executing policy ⇡, and Q?

the

cost-to-go of the expert.

Thus, achieving low regret at the problem of learning ⇡ onthe training data it observes (“D” in MELEE), i.e. ✏class(T )is small, translates into low regret in the contextual-banditsetting. At first glance this bound looks like it may scalelinearly with T . However, the bound in Theorem 1 is depen-dent on ✏class(T ). Note however, that s is a combination ofthe context vector xt and the classification function ft. AsT !1, one would hope that ft improves significantly and✏class(T ) decays quickly. Thus, sublinear regret may still beachievable when f learns sufficiently quickly as a functionof T . For instance, if f is optimizing a strongly convex lossfunction, online gradient descent achieves a regret guaranteeof O( log T

T ) (Hazan et al. 2016, Theorem 3.3), potentiallyleading to a regret for MELEE of O(

p(log T )/T ).

The above statement is informal (it does not take intoaccount the interaction between learning f and ⇡). How-ever, we can show a specific concrete example: we analyzeMELEE’s test-time behavior when the underlying learning

algorithm is BANDITRON. BANDITRON is a variant of themulticlass Perceptron that operates under bandit feedback.Details of this analysis (and proofs, which directly follow theoriginal BANDITRON analysis) are given in the appendix.

5 Experimental Setup and ResultsUsing a collection of synthetically generated classificationproblems, we train an exploration policy ⇡ using MELEE(Alg 1). This exploration policy learns to explore on the ba-sis of calibrated probabilistic predictions from f togetherwith a predefined set of exploration features (§3). Once ⇡

is learned and fixed, we follow the test-time behavior de-scribed in § 3 to evaluate ⇡ on a set of contextual banditproblems. We evaluate MELEE on a natural learning to ranktask (§5). To ensure that the performance of MELEE general-izes beyond this single learning to rank task, we additionallyperform thorough evaluation on 300 “simulated” contextualbandit problems, derived from standard classification task.

Figure 1: Win/Loss counts for all pairs of algorithms over16 random shuffles for the MSLR-10K dataset.

In all cases, the underlying classifier f is a linear modeltrained with an optimizer that runs stochastic gradient de-scent. We seek to answer two questions experimentally:

1. How does MELEE compare empirically to alternative (ex-pert designed) exploration strategies?

2. How important are the additional features used by MELEEin comparison to using calibrated probability predictionsfrom f as features?

Training DatasetsIn our experiments, we follow Konyushkova, Sznitman, andFua (2017) (and also Peters et al. (2014), in a different set-ting) and train the exploration policy ⇡ only on synthetic

data. This is possible because the exploration policy ⇡ nevermakes use of x explicitly and instead only accesses it via

Page 6: Meta-Learning Effective Exploration Strategies for Contextual ...Amr Sharaf,1 Hal Daume III´ 1,2 1 University of Maryland 2 Microsoft Research sharaf@umd.edu, me@hal3.name Abstract

Figure 2: Learning curve on the MSLR-10K dataset: x-axisshows the number of queries observed, and y-axis shows theprogressive reward.

ft’s behavior on it. We generate datasets with uniformly dis-tributed class conditional distributions. The datasets are al-ways two-dimensional.

Evaluation MethodologyFor evaluation, we use progressive validation (Blum, Kalai,and Langford 1999), which is exactly computing the re-ward of the algorithm. Specifically, to evaluate the perfor-mance of an exploration algorithm A on a dataset S of sizen, we compute the progressive validation return G(A) =1n

Pnt=1 rt(at) as the average reward up to n, where at is the

action chosen by the algorithm A and rt is the true reward.Progressive validation is particularly suitable for measuringthe effectiveness of the exploration algorithm, since the de-cision on whether to exploit or explore at earlier time stepswill affect the performance on the observed examples in thefuture.

Because our evaluation is over 300 datasets, we report ag-gregate results in terms of Win/Loss Statistics: We comparetwo exploration methods by counting the number of statisti-cally significant wins and losses. An exploration algorithmA wins over another algorithm B if the progressive valida-tion return G(A) is statistically significantly larger than B’sreturn G(B) at the 0.01 level using a paired sample t-test.

Experimental ResultsLearning to Rank We evaluate MELEE on a natural learn-ing to rank dataset. The dataset we consider is the MicrosoftLearning to Rank dataset, variant MSLR-10K from (Qin andLiu 2013). The dataset consists of feature vectors extractedfrom query-url pairs along with relevance judgment labels.The relevance judgments are obtained from a retired label-ing set of a commercial web search engine (Microsoft Bing),

Figure 3: Behavior of MELEE in comparison to baselineand state-of-the-art exploration algorithms. A representativelearning curve on dataset #1144.

which take 5 values from 0 (irrelevant) to 4 (perfectly rele-vant) . In our experiments, we limit the number of labelsto the two extremes: 0 and 4, and we drop the queries notlabelled as any of the two extremes. A query-url pair is rep-resented by a 136-dimensional feature vector. The datasetis highly imbalanced as the number of irrelevant queries ismuch larger than the number of relevant ones. To addressthis, we sample the number of irrelevant queries to matchthat of the relevant ones. To avoid correlations between theobserved query-url pairs, we group the queries by the queryID, and sample a single query from each group. We convertrelevance scores to losses with 0 indicating a perfectly rele-vant document, and 1 an irrelevant one.

Figure 2 shows the evaluation results on a subset ofthe MSLR-10K dataset. Since the performance is closelymatched between the different exploration algorithms, werepeat the experiment 16 times with randomly shuffled per-mutations of the MSLR-10K dataset. Figure 2 shows thelearning curve of the trained policy ⇡ as well as the base-lines. Here, we see that MELEE quickly achieves high re-ward, after about 100 examples the two strongest baselinescatch up. By 200 examples all approaches have asymptoted.We exclude LinUCB from these runs because the requiredmatrix inversions made it too computationally expensive.1Figure 1 shows statistically-significant win/loss differencesfor each of the algorithms, across these 16 shuffles. Eachrow/column entry shows the number of times the row algo-rithm won against the column, minus the number of losses.MELEE is the only algorithm that always wins more than itloses against other algorithms, and outperforms the nearestcompetition (✏-decreasing) by 3 points.

1In a single run of LinUCB we observed that its performance ison par with ✏-greedy.

Page 7: Meta-Learning Effective Exploration Strategies for Contextual ...Amr Sharaf,1 Hal Daume III´ 1,2 1 University of Maryland 2 Microsoft Research sharaf@umd.edu, me@hal3.name Abstract

Figure 4: Behavior of MELEE in comparison to baseline andstate-of-the-art exploration algorithms. Win statistics: each(row, column) entry shows the number of times the row algo-rithm won against the column, minus the number of losses.MELEE outperforms the nearest competition (✏-decreasing)by 23.

Simulated Contextual Bandit Tasks We perform an ex-haustive evaluation on simulated contextual bandit tasksto ensure that the performance of MELEE generalizes be-yond learning to rank. Following Bietti, Agarwal, and Lang-ford (2018), we use a collection of 300 binary classifica-tion datasets from openml.org for evaluation. These datasetscover a variety of different domains including text & imageprocessing, medical, and sensory data. We convert classi-fication datasets into cost-sensitive classification problemsby using a 0/1 encoding. Given these fully supervised cost-sensitive multi-class datasets, we simulate the contextualbandit setting by only revealing the reward for the selectedactions.

In Figure 3, we show a representative learning curve.Here, we see that as more data becomes available, all the ap-proaches improve (except ⌧ -first, which has ceased to learnafter 2% of the data). MELEE, in particular, is able to veryquickly achieve near-optimal performance (in around 40 ex-amples) in comparison to the best baseline which takes atleast 200.

In Figure 4, we show statistically-significant win/lossdifferences for each of the algorithms. Here, each (row,column) entry shows the number of times the row algo-rithm won against the column, minus the number of losses.MELEE is the only algorithm that always wins more than itloses against other algorithms, and outperforms the nearestcompetition (✏-decreasing) by 23 points.

To understand more directly how MELEE compares to ✏-decreasing, in Figure 5, we show a scatter plot of rewards

Figure 5: MELEE vs ✏-decreasing; every point represents onedataset; the x-axis shows the reward of MELEE, the y-axisshows ✏-decreasing, and red dots represent statistically sig-nificant runs.

Figure 6: MELEE vs MELEE using only the calibrated pre-diction probabilities (x-axis). MELEE gets an additionalleverage when using all the features.

achieved by MELEE (x-axis) and ✏-decreasing (y-axis) oneach of the 300 datasets, with statistically significant dif-ferences highlighted in red and insignificant differences inblue. Points below the diagonal line correspond to betterperformance by MELEE (147 datasets) and points above to✏-decreasing (124 datasets). The remaining 29 had no statis-tically significant difference.

Finally, we consider the effect that the additional featureshave on MELEE’s performance. In particular, we consider aversion of MELEE with all features (this is the version usedin all other experiments) with an ablated version that onlyhas access to the (calibrated) probabilities of each actionfrom the underlying classifier f . The comparison is shownas a scatter plot in Figure 6. Here, we can see that the fullfeature set does provide lift over just the calibrated probabil-ities, with a win-minus-loss improvement of 24 by addingadditional features from which to learn to explore.

Page 8: Meta-Learning Effective Exploration Strategies for Contextual ...Amr Sharaf,1 Hal Daume III´ 1,2 1 University of Maryland 2 Microsoft Research sharaf@umd.edu, me@hal3.name Abstract

Broader ImpactsThe motivation of this work is to give machine learning prac-titioners another tool in their toolbox to learn better explo-ration strategies for contextual bandits. Our primary targetstakeholder population is such machine learning practition-ers and data scientists. Secondarily, as that primary stake-holder population builds and deploys algorithms, those whoare impacted by those algorithms through direct or indirectuse will, we hope, benefit from using a better explorationalgorithm. A possible risk of our algorithm is around, if de-ployed, how the explored actions would affect the fairnessof the learned model with respect to different demographicgroups. On the positive side, we provide theoretical guaran-tees for the regret bounds achieved by MELEE in a simplifiedsetting. Overall, while there are real concerns about how thistechnology might be deployed, our hope is that the positiveimpacts outweigh the negatives, specifically because stan-dard best-use practices should mitigate most of the risks.

AcknowledgmentsWe thank members of the CLIP lab for reviewing earlier ver-sions of this work. This material is based upon work sup-ported by the National Science Foundation under Grant No.1618193. Any opinions, findings, and conclusions or rec-ommendations expressed in this material are those of theauthor(s) and do not necessarily reflect the views of the Na-tional Science Foundation.

ReferencesAgarwal, A.; Hsu, D.; Kale, S.; Langford, J.; Li, L.; andSchapire, R. E. 2014. Taming the monster: A fast and sim-ple algorithm for contextual bandits. In In Proceedings of the

31st International Conference on Machine Learning (ICML-

14, 1638–1646.Andrychowicz, M.; Denil, M.; Gomez, S.; Hoffman, M. W.;Pfau, D.; Schaul, T.; and de Freitas, N. 2016. Learning tolearn by gradient descent by gradient descent. In Advances

in Neural Information Processing Systems, 3981–3989.Auer, P. 2003. Using confidence bounds for exploitation ex-ploration trade-offs. The Journal of Machine Learning Re-

search 3: 397–422.Bachman, P.; Sordoni, A.; and Trischler, A. 2017. Learningalgorithms for active learning. In ICML.Beygelzimer, A.; and Langford, J. 2009. The offset tree forlearning with partial labels. In Proceedings of the 15th ACM

SIGKDD international conference on Knowledge discovery

and data mining, 129–138. ACM.Bietti, A.; Agarwal, A.; and Langford, J. 2018. A ContextualBandit Bake-off. Working paper or preprint.Blum, A.; Kalai, A.; and Langford, J. 1999. Beatingthe hold-out: Bounds for k-fold and progressive cross-validation. In Proceedings of the twelfth annual conference

on Computational learning theory, 203–208. ACM.Breiman, L. 2001. Random Forests. Mach. Learn. 45(1):5–32. ISSN 0885-6125. doi:10.1023/A:1010933404324.

Chang, K.; Krishnamurthy, A.; Agarwal, A.; Daume, III, H.;and Langford, J. 2015. Learning to Search Better Than YourTeacher. In Proceedings of the 32Nd International Confer-

ence on International Conference on Machine Learning -

Volume 37, ICML, 2058–2066. JMLR.org.Daume, III, H.; Langford, J.; and Marcu, D. 2009. Search-based structured prediction. Machine Learning 75(3): 297–325. ISSN 1573-0565. doi:10.1007/s10994-009-5106-x.Dudik, M.; Hsu, D.; Kale, S.; Karampatziakis, N.; Langford,J.; Reyzin, L.; and Zhang, T. 2011. Efficient optimal learningfor contextual bandits. arXiv preprint arXiv:1106.2369 .Fang, M.; Li, Y.; and Cohn, T. 2017. Learning how to Ac-tive Learn: A Deep Reinforcement Learning Approach. InEMNLP.Feraud, R.; Allesiardo, R.; Urvoy, T.; and Clrot, F. 2016.Random Forest for the Contextual Bandit Problem. In Gret-ton, A.; and Robert, C. C., eds., Proceedings of the 19th In-

ternational Conference on Artificial Intelligence and Statis-

tics, volume 51 of Proceedings of Machine Learning Re-

search, 93–101. Cadiz, Spain: PMLR.Gupta, A.; Mendonca, R.; Liu, Y.; Abbeel, P.; and Levine, S.2018. Meta-Reinforcement Learning of Structured Explo-ration Strategies. arXiv preprint arXiv:1802.07245 .Hazan, E.; et al. 2016. Introduction to online convex opti-mization. Foundations and Trends R� in Optimization 2(3-4):157–325.Horvitz, D. G.; and Thompson, D. J. 1952. A Generalizationof Sampling Without Replacement from a Finite Universe.Journal of the American Statistical Association 47(260):663–685. doi:10.1080/01621459.1952.10483446.Kaelbling, L. P. 1994. Associative reinforcement learning:Functions ink-dnf. Machine Learning 15(3): 279–298.Karnin, Z. S.; and Anava, O. 2016. Multi-armed Ban-dits: Competing with Optimal Sequences. In Lee, D. D.;Sugiyama, M.; Luxburg, U. V.; Guyon, I.; and Garnett, R.,eds., Advances in Neural Information Processing Systems

29, 199–207. Curran Associates, Inc.Konyushkova, K.; Sznitman, R.; and Fua, P. 2017. LearningActive Learning from Data. In Advances in Neural Informa-

tion Processing Systems.Langford, J.; and Zadrozny, B. 2005. Relating reinforcementlearning performance to classification performance. In Pro-

ceedings of the 22nd international conference on Machine

learning, 473–480. ACM.Langford, J.; and Zhang, T. 2008. The Epoch-Greedy Algo-rithm for Multi-armed Bandits with Side Information. In Ad-

vances in Neural Information Processing Systems 20, 817–824. Curran Associates, Inc.Li, K.; and Malik, J. 2016. Learning to optimize. arXiv

preprint arXiv:1606.01885 .Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010a. AContextual-bandit Approach to Personalized News ArticleRecommendation. In Proceedings of the 19th International

Conference on World Wide Web, WWW ’10, 661–670. New

Page 9: Meta-Learning Effective Exploration Strategies for Contextual ...Amr Sharaf,1 Hal Daume III´ 1,2 1 University of Maryland 2 Microsoft Research sharaf@umd.edu, me@hal3.name Abstract

York, NY, USA: ACM. ISBN 978-1-60558-799-8. doi:10.1145/1772690.1772758.Li, W.; Wang, X.; Zhang, R.; Cui, Y.; Mao, J.; and Jin,R. 2010b. Exploitation and Exploration in a PerformanceBased Contextual Advertising System. In Proceedings of

the 16th ACM SIGKDD International Conference on Knowl-

edge Discovery and Data Mining, KDD ’10, 27–36. NewYork, NY, USA: ACM.Lin, H.-T.; Lin, C.-J.; and Weng, R. C. 2007. A note onPlatt’s probabilistic outputs for support vector machines.Machine Learning 68(3): 267–276. ISSN 1573-0565. doi:10.1007/s10994-007-5018-6.Maes, F.; Wehenkel, L.; and Ernst, D. 2012. Meta-learningof exploration/exploitation strategies: The multi-armed ban-dit case. In International Conference on Agents and Artifi-

cial Intelligence, 100–115. Springer.Osband, I.; Blundell, C.; Pritzel, A.; and Van Roy, B. 2016.Deep Exploration via Bootstrapped DQN. In Lee, D. D.;Sugiyama, M.; Luxburg, U. V.; Guyon, I.; and Garnett, R.,eds., Advances in Neural Information Processing Systems

29, 4026–4034. Curran Associates, Inc.Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.;Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss,R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.;Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine

Learning Research 12: 2825–2830.Peters, J.; Mooij, J. M.; Janzing, D.; and Scholkopf, B. 2014.Causal discovery with continuous additive noise models.The Journal of Machine Learning Research 15(1): 2009–2053.Platt, J. C. 1999. Probabilistic Outputs for Support Vec-tor Machines and Comparisons to Regularized LikelihoodMethods. In ADVANCES IN LARGE MARGIN CLASSI-

FIERS, 61–74. MIT Press.Qin, T.; and Liu, T. 2013. Introducing LETOR 4.0 Datasets.CoRR abs/1306.2597. URL http://arxiv.org/abs/1306.2597.Ross, S.; and Bagnell, J. A. 2014. Reinforcement and im-itation learning via interactive no-regret learning. arXiv

preprint arXiv:1406.5979 .Ross, S.; Gordon, G.; and Bagnell, J. A. 2011. A Re-duction of Imitation Learning and Structured Prediction toNo-Regret Online Learning. In Proceedings of the Four-

teenth International Conference on Artificial Intelligence

and Statistics, volume 15 of Proceedings of Machine Learn-

ing Research, 627–635. Fort Lauderdale, FL, USA: PMLR.Sutton, R. S. 1996. Generalization in reinforcement learn-ing: Successful examples using sparse coarse coding. InAdvances in neural information processing systems, 1038–1044.Sutton, R. S.; and Barto, A. G. 1998. Introduction to Rein-

forcement Learning. Cambridge, MA, USA: MIT Press, 1stedition. ISBN 0262193981.Woodward, M.; and Finn, C. 2017. Active one-shot learning.arXiv preprint arXiv:1702.06559 .

Xu, T.; Liu, Q.; Zhao, L.; Xu, W.; and Peng, J. 2018. Learn-ing to Explore with Meta-Policy Gradient. arXiv preprint

arXiv:1803.05044 .Zoph, B.; and Le, Q. V. 2016. Neural architec-ture search with reinforcement learning. arXiv preprint

arXiv:1611.01578 .