distributed online learning via cooperative …1 distributed online learning via cooperative...

1

Distributed Online Learning via CooperativeContextual Bandits

Cem Tekin*, Member, IEEE, Mihaela van der Schaar, Fellow, IEEEElectrical Engineering Department, University of California, Los Angeles

Email: [email protected], [email protected]

Abstract—In this paper we propose a novel framework fordecentralized, online learning by many learners. At each momentof time, an instance characterized by a certain context may arriveto each learner; based on the context, the learner can select one ofits own actions (which gives a reward and provides information)or request assistance from another learner. In the latter case, therequester pays a cost and receives the reward but the providerlearns the information. In our framework, learners are modeledas cooperative contextual bandits. Each learner seeks to maximizethe expected reward from its arrivals, which involves tradingoff the reward received from its own actions, the informationlearned from its own actions, the reward received from theactions requested of others and the cost paid for these actions- taking into account what it has learned about the value ofassistance from each other learner. We develop distributed onlinelearning algorithms and provide analytic bounds to compare theefficiency of these with algorithms with the complete knowledge(oracle) benchmark (in which the expected reward of every actionin every context is known by every learner). Our estimatesshow that regret - the loss incurred by the algorithm - issublinear in time. Our theoretical framework can be used inmany practical applications including Big Data mining, eventdetection in surveillance sensor networks and distributed onlinerecommendation systems.

Index Terms—Online learning, distributed learning, multi-userlearning, cooperative learning, contextual bandits, multi-userbandits.

I. INTRODUCTION

In this paper we propose a novel framework for onlinelearning by multiple cooperative and decentralized learners.We assume that an instance (a data unit), characterized bya context (side) information, arrives at a learner (processor)which needs to process it either by using one of its own pro-cessing functions or by requesting another learner (processor)to process it. The learner’s goal is to learn online what isthe best processing function which it should use such that itmaximizes its total expected reward for that instance. A datastream is an ordered sequence of instances that can be readonly once or a small number of times using limited computingand storage capabilities [1]. For example, in a stream miningapplication, an instance can be the data unit extracted by asensor or camera; in a wireless communication application,an instance can be a packet that needs to be transmitted. Thecontext can be anything that provides information about therewards to the learners. For example, in stream mining, thecontext can be the type of the extracted instance; in wireless

A preliminary version of this work appeared in Allerton 2013. The work ispartially supported by the grants NSF CNS 1016081 and AFOSR DDDAS.

communications, the context can be the channel Signal toNoise Ratio (SNR). The processing functions in the streammining application can be the various classification functions,while in wireless communications they can be the transmissionstrategies for sending the packet (Note that the selection of theprocessing functions by the learners can be performed basedon the context and not necessarily the instance). The rewardsin the stream mining can be the accuracy associated with theselected classification function, and in wireless communica-tion they can be the resulting goodput and expended energyassociated with a selected transmission strategy.

To solve such distributed online learning problems, wedefine a new class of multi-armed bandit solutions, which werefer to as cooperative contextual bandits. In the consideredscenario, there is a set of cooperative learners, each equippedwith a set of processing functions (arms1) which can beused to process the instance. We assume a discrete timemodel t = 1, 2, . . ., where different instances and associatedcontext information arrive to a learner.2 Upon the arrival ofan instance, a learner needs to select either one of its armsto process the instance or it can call another learner whichcan select one of its own arms to process the instance andincur a cost (e.g., delay cost, communication cost, processingcost, money). Based on the selected arm, the learner receivesa random reward, which is drawn from some unknown distri-bution that depends on the context information characterizingthe instance. The goal of a learner is to maximize its totalundiscounted reward up to any time horizon T . A learner doesnot know the expected reward (as a function of the context)of its own arms or of the other learners’ arms. In fact, wego one step further and assume that a learner does not knowanything about the set of arms available to other learnersexcept an upper bound on the number of their arms. Thelearners are cooperative because they obtain mutual benefitsfrom cooperation - a learner’s benefit from calling anotherlearner may be an increased reward as compared to the casewhen it uses solely its own arms; the benefit of the learnerasked to perform the processing by another learner is that it canlearn about the performance of its own arm based on its reward

1We use the terms action and arm interchangeably.2Assuming synchronous agents/learners is common in the decentralized

multi-armed bandit literature [2], [3]. Although our formulation is for syn-chronous learners, our results directly apply to the asynchronous learners,where times of instance and context arrivals can be different. A learner maynot receive an instance and context at every time slot t. Then, instead of thefinal time T , our performance bounds for learner i will depend on the totalnumber of arrivals to learner i by time T .

arX

iv:1

308.

4568

v3 [

cs.L

G]

19

Apr

201

4

2

for the calling learner. This is especially beneficial whencertain instances and associated contexts are less frequent, orwhen gathering labels (observing the reward) is costly.

The problem defined in this paper is a generalization of thewell-known contextual bandit problem [4]–[9], in which thereis a single learner who has access to all the arms. However, theconsidered distributed online learning problem is significantlymore challenging because a learner cannot observe the armsof other learners and cannot directly estimate the expectedrewards of those arms. Moreover, the heterogeneous contextsarriving at each learner lead to different learning rates forthe various learners. We design distributed online learningalgorithms whose long-term average rewards converge to thebest distributed solution which can be obtained if we assumedcomplete knowledge of the expected arm rewards of eachlearner for each context.

To rigorously quantify the learning performance, we definethe regret of an online learning algorithm for a learner asthe difference between the expected total reward of the bestdecentralized arm selection scheme given complete knowl-edge about the expected arm rewards of all learners and theexpected total reward of the algorithm used by the learner.Simply, the regret of a learner is the loss incurred due tothe unknown system dynamics compared to the completeknowledge benchmark. We prove a sublinear upper bound onthe regret, which implies that the average reward converges tothe optimal average reward. The upper bound on regret givesa lower bound on the convergence rate to the optimal averagereward. We show that when the contexts arriving to a learnerare uniformly distributed over the context space, the regretdepends on the dimension of the context space, while whenthe contexts arriving to the same learner are concentrated ina small region of the context space, the regret is independentof the dimension of the context space.

The proposed framework can be used in numerous applica-tions including the ones given below.

Example 1: Consider a wireless sensor surveillance net-work, where different types of sensors located in differentlocations collect different information (instances) about thesame event as well as the context information associated withthese instances. To detect the event, each sensor may rundifferent classification algorithms. The goal of each sensori is to maximize its detection accuracy based on its contextinformation xi by using either its own classification algorithmor by requesting a prediction from another sensor j bysending it the context information xi. We assume that theaccuracies of the classification algorithms are not known apriori by the sensors and these accuracies may even changeover time. To understand the importance of cooperation amongsensors, consider the following example. Let x∗ be a sideinformation/context (e.g., soil moisture, temperature) whichindicates that an event E′ had happened with a very highprobability. Initially i and j are unaware of this implication.By time t, context x∗ may have been frequently observed bysensor j while rarely observed by sensor i. Therefore, j mayknow that x∗ implies event E′, and i may not know this sinceit had few observations. If by time t, i learns that j’s predictionis accurate for context x∗, it can use j to predict what will

happen when it observes the context x∗. This cooperativedecentralized sensor event detection problem can be modeledand solved using the proposed cooperative contextual bandits.

Example 2: Consider a network security scenario in whichautonomous systems (ASs) collaborate with each other todetect cyber-attacks [10]. Each AS has a set of securitysolutions which it can use to detect attacks. The contexts arethe characteristics of the data traffic in each AS. These contextscan provide valuable information about the occurrence ofcyber-attacks. Since the nature of the attacks are dynamic, non-stochastic and context dependent, the efficiency of the varioussecurity solutions are dynamically varying, context dependentand unknown a-priori. Based on the extracted contexts (e.g.key properties of its traffic, the originator of the traffic etc.), anAS i may route its incoming data stream (or only the contextinformation) to another AS j, and if AS j detects a maliciousactivity based on its own security solutions, it warns AS i. Dueto the privacy or security concerns, AS i may not know whatsecurity applications AS j is running. This problem can bemodeled as a cooperative contextual bandit problem in whichthe various ASs cooperate with each other to learn onlinewhich actions they should take or which other ASs they shouldrequest to take actions in order to accurately detect attacks (e.g.minimize the mis-detection probability of cyber-attacks).

The remainder of the paper is organized as follows. InSection II we describe the related work and highlight thedifferences from our work. In Section III we describe thechoices of learners, rewards, complete knowledge benchmark,and define the regret of a learning algorithm. A cooperativecontextual learning algorithm that uses a non-adaptive partitionof the context space is proposed and a sublinear bound on itsregret is derived in Section IV. Another learning algorithmthat adaptively partitions the context space of each learneris proposed in Section V, and its regret is bounded fordifferent types of context arrivals. In Section VI we discussthe necessity of training phase which is a property of bothalgorithms and compare them. Finally, the concluding remarksare given in Section VII.

II. RELATED WORK

Contextual bandits have been studied before in [6]–[9] ina single agent setting, where the agent sequentially choosesfrom a set of arms with unknown rewards, and the rewardsdepend on the context information provided to the agentat each time slot. The goal of the agent is to maximizeits reward by balancing exploration of arms with uncertainrewards and exploitation of the arm with the highest estimatedreward. The algorithms proposed in these works is shown toachieve sublinear in time regret with respect to the completeknowledge benchmark, and the sublinear regret bounds areproved to match with lower bounds on the regret up tologarithmic factors. In all the prior work, the context spaceis assumed to be large and a known similarity metric overthe contexts is exploited by the algorithms to estimate armrewards together for groups of similar contexts. Groups ofcontexts are created by partitioning the context space. Forexample, [8] proposed an epoch-based uniform partition ofthe context space, while [6] proposed a non-uniform adaptive

3

partition. In [11], contextual bandit methods are developedfor personalized news articles recommendation and a variantof the UCB algorithm [12] is designed for linear payoffs. In[13], contextual bandit methods are developed for data miningand a perceptron based algorithm that achieves sublinear regretwhen the instances are chosen by an adversary is proposed.To the best of our knowledge, our work is the first to providerigorous solutions for online learning by multiple cooperativelearners when context information is present and propose anovel framework for cooperative contextual bandits to solvethis problem.

Another line of work [4], [5] considers a single agent witha large set of arms (often uncountable). Given a similaritystructure on the arm space, they propose online learningalgorithms that adaptively partition the arm space to getsublinear regret bounds. The algorithms we design in thispaper also exploits the similarity information, but in thecontext space rather than the action space, to create a partitionand learn through the partition. However, distributed problemformulation, creation of the partitions and how learning isperformed is very different from related prior work [4]–[9].

Previously, distributed multi-user learning is only consid-ered for multi-armed bandits with finite number of arms andno context. In [2], [14] distributed online learning algorithmsthat converge to the optimal allocation with logarithmic regretare proposed for the i.i.d. arm reward model, given that theoptimal allocation is an orthogonal allocation in which eachuser selects a different arm. Considering a similar model butwith Markov arm rewards, logarithmic regret algorithms areproposed in [15], [16], where the regret is with respect to thebest static policy which is not generally optimal for Markovrewards. This is generalized in [3] to dynamic resource sharingproblems and logarithmic regret results are also proved for thiscase.

We provide a detailed comparison between our work andrelated work in multi-armed bandit learning in Table I. Ourcooperative contextual learning framework can be seen asan important extension of the centralized contextual banditframework [4]–[9]. The main differences are: (i) trainingphase which is required due to the informational asymmetriesbetween learners, (ii) separation of exploration and exploita-tion over time instead of using an index for each arm tobalance them, resulting in three-phase learning algorithms withtraining, exploration and exploitation phases, (iii) coordinatedcontext space partitioning in order to balance the differencesin reward estimation due to heterogeneous context arrivalsto the learners. Although we consider a three-phase learningstructure, our learning framework can work together withindex-based policies such as the ones proposed in [6], byrestricting the index updates to time slots that are not in thetraining phase. Our three-phase learning structure separatesexploration and exploitation into distinct time slots, while theytake place concurrently for an index-based policy. We willdiscuss the differences between these methods in Section VI.We will also show in Section VI that the training phase isnecessary for the learners to form correct estimates about eachother’s rewards in cooperative contextual bandits.

Different from our work, distributed learning is also consid-

[6]–[9] [3], [14], [20] This workMulti-user no yes yesCooperative N/A yes yesContextual yes no yesContext arrival arbitrary N/A arbitraryprocesssynchronous (syn)/ N/A syn bothasynchronous (asn)Regret sublinear logarithmic sublinear

TABLE ICOMPARISON WITH RELATED WORK IN MULTI-ARMED BANDITS

ered in online convex optimization setting [17]–[19]. In all ofthese works local learners learn choose to their actions (param-eter vectors) to minimize the global total loss by exchangingmessages with their neighbors and performing subgradientdescent. In contrast to these works in which learners shareinformation about their actions, the learners in our modeldoes not share any information about their own actions. Theinformation shared in our model is the context information ofthe calling learner and the reward generated by the arm ofthe called learner. However, this information is not shared atevery time slot, and the rate of information sharing betweenlearners who cannot help each other to gain higher rewardsgoes to zero asymptotically.

III. PROBLEM FORMULATION

The system model is shown in Fig. 1. There are M learnerswhich are indexed by the set M = {1, 2, . . . ,M}. LetM−i :=M−{i} be the set of learners learner i can choosefrom to receive a reward. Let Fi denote the set of arms oflearner i. Let F := ∪j∈MFj denote the set of all arms. LetKi := Fi ∪M−i. We call Ki the set of choices for learner i.We use index k to denote any choice in Ki, f to denote armsof the learners, j to denote other learners in M−i.

These learners work in a discrete time setting t =1, 2, . . . , T , where the following events happen sequentially,in each time slot: (i) an instance with context xi(t) arrivesto each learner i ∈ M, (ii) based on xi(t), learner i eitherchooses one of its arms f ∈ Fi or calls another learner andsends xi(t),3 (iii) For each learner who called learner i attime t, learner i chooses one of its arms f ∈ Fi. (iv) Learneri observes the rewards of all the arms f ∈ Fi it had chosenboth for its own contexts and for other learners. (iv) Learner ieither obtains directly the reward of its own arm it had chosen,or a reward that is passed from the learner that it had calledfor its own context.4

The contexts xi(t) come from a bounded D dimensionalspace X , which is taken to be [0, 1]D without loss of general-ity. When selected, an arm f ∈ F generates a random rewardsampled from an unknown, context dependent distributionGf (x) with support in [0, 1].5 The expected reward of armf ∈ F for context x ∈ X is denoted by πk(x). Learner iincurs a known deterministic and fixed cost dik for selecting

3An alternative formulation is that learner i selects multiple choices fromKi at each time slot, and receives sum of the rewards of the selected choices.All of the ideas/results in this paper can be extended to this case as well.

4Although in our problem description the learners are synchronized, ourmodel also works for the case where instance/context arrives asynchronouslyto each learner. We discuss more about this in [10].

5Our results can be generalized to rewards with bounded support [b1, b2]for −∞ < b1 < b2 < ∞. This will only scale our performance bounds bya constant factor.

4

Fig. 1. System model from the viewpoint of learners i and j. Here i exploitsj to obtain a high reward while helping j to learn about the reward of itsown arm.choice k ∈ Ki.6 For example for k ∈ Fi, dik can representthe cost of activating arm k, while for k ∈ M−i, dik canrepresent the cost of communicating with learner k and/or thepayment made to learner k. Although in our system model weassume that each learner i can directly call another learner j,our model can be generalized to learners over a network wherecalling learners that are away from learner i has a higher costfor learner i. Learner i knows the set of other learners M−iand costs of calling them, i.e., dij , j ∈ M−i, but does notknow the set of arms Fj , j ∈M−i, but only knows an upperbound on the number of arms that each learner has, i.e., Fmax

on |Fj |,7 j ∈ M−i. Since the costs are bounded, withoutloss of generality we assume that costs are normalized, i.e.,dik ∈ [0, 1] for k ∈ Ki, i ∈ M. The net reward of learner ifrom a choice is equal to the obtained reward minus cost ofselecting the choice. The net reward of a learner is always in[−1, 1].

The learners are cooperative which implies that when calledby learner i, learner j will choose one of its own arms which itbelieves to yield the highest expected reward given the contextof learner i.

The expected reward of an arm is similar for similarcontexts, which is formalized this in terms of a Lipschitzcondition given in the following assumption.

Assumption 1: For each f ∈ F , there exists L > 0, α > 0such that for all x, x′ ∈ X , we have |πf (x) − πf (x′)| ≤L||x− x′||α, where ||.|| denotes the Euclidian norm in Rd.We assume that α is known by the learners. In the contextual

6Alternatively, we can assume that the costs are random variables withbounded support whose distribution is unknown. In this case, the learnerswill not learn the reward but they will learn reward minus cost which isessentially the same thing.

7For a set A, let |A| denote the cardinality of that set.

bandit literature this is referred to as similarity information[6], [21]. Different from prior works on contextual bandit, wedo not require L to be known by the learners. However, Lwill appear in our performance bounds.

The goal of learner i is to maximize its total expectedreward. In order to do this, it needs to learn the rewards fromits choices. Thus, learner i should concurrently explore thechoices in Ki to learn their expected rewards, and exploitthe best believed choice for its contexts which maximizesthe reward minus cost. In the next subsection we formallydefine the complete knowledge benchmark. Then, we definethe regret which is the performance loss due to uncertaintyabout arm rewards.

A. Optimal Arm Selection Policy Under Complete Information

We define learner j’s expected reward for context x asπj(x) := maxf∈Fj πf (x). This is the maximum expectedreward learner j can provide when called by a learner withcontext x. For learner i, µik(x) := πk(x) − dik denotes thenet reward of choice k ∈ Ki for context x. Our benchmarkwhen evaluating the performance of the learning algorithms isthe optimal solution which selects the choice with the highestexpected net reward for learner i for its context x. This isgiven by

k∗i (x) := arg maxk∈Ki

µik(x) ∀x ∈ X . (1)

Since knowing µij(x) requires knowing πf (x) for f ∈ Fj ,knowing the optimal solution means that learner i knows thearm in F that yields the highest expected reward for eachx ∈ X .

B. The Regret of Learning

Let ai(t) be the choice selected by learner i at time t.Since learner i has no a priori information, this choice is onlybased on the past history of selections and reward observationsof learner i. The rule that maps the history of learner i toits choices is called the learning algorithm of learner i. Leta(t) := (a1(t), . . . , aM (t)) be the choice vector at time t.We let bi,j(t) denote the arm selected by learner i whenit is called by learner j at time t. If j does not call i attime t, then bi,j(t) = ∅. Let bi(t) = {bi,j(t)}j∈M−i andb(t) = {bi(t)}i∈M. The regret of learner i with respect tothe complete knowledge benchmark k∗i (xi(t)) given in (1) isgiven by

Ri(T ) :=

T∑t=1

(πk∗i (xi(t))(xi(t))− d

ik∗i (xi(t))

)− E

[T∑t=1

riai(t)(xi(t), t)− diai(t)

],

where riai(t)(xi(t), t) denotes the random reward of choiceai(t) ∈ Ki for context x at time t for learner i, and theexpectation is taken with respect to the selections made by thedistributed algorithm of the learners and the statistics of therewards. For example, when ai(t) = j and bj,i(t) = f ∈ Fj ,this random reward is sampled from the distribution of arm fat context xi(t).

5

Regret gives the convergence rate of the total expectedreward of the learning algorithm to the value of the optimalsolution given in (1). Any algorithm whose regret is sublinear,i.e., R(T ) = O(T γ) such that γ < 1, will converge tothe optimal solution in terms of the average reward. In thesubsequent sections we will propose two different distributedlearning algorithms with sublinear regret.

IV. A DISTRIBUTED UNIFORM CONTEXT PARTITIONINGALGORITHM

The algorithm we consider in this section forms at thebeginning a uniform partition of the context space for eachlearner. Each learner estimates its choice rewards based onthe past history of arrivals to each set in the partition inde-pendently from the other sets in the partition. This distributedlearning algorithm is called Contextual Learning with UniformPartition (CLUP) and its pseudocode is given in Fig. 2, Fig. 3and Fig. 4. For learner i, CLUP is composed of two parts. Thefirst part is the maximization part (see Fig. 3), which is usedby learner i to maximize its reward from its own contexts. Thesecond part is the cooperation part (see Fig. 4), which is usedby learner i to help other learners maximize their rewards fortheir own contexts.

Let mT be the slicing parameter of CLUP that determinesthe number of sets in the partition of the context space X .When mT is small, the number of sets in the partition issmall, hence the number of contexts from the past observationswhich can be used to form reward estimates in each set islarge. However, when mT is small, the size of each set islarge, hence the variation of the expected choice rewards overeach set is high. First, we will analyze the regret of CLUPfor a fixed mT and then optimize over it to balance theaforementioned tradeoff. CLUP forms a partition of [0, 1]D

consisting of (mT )D sets where each set is a D-dimensionalhypercube with dimensions 1/mT ×1/mT × . . .×1/mT . Weuse index p to denote a set in PT . For learner i let pi(t) bethe set in PT which xi(t) belongs to.8

First, we will describe the maximization part of CLUP.At time slot t learner i can be in one of the three phases:training phase in which learner i calls another learner withits context such that when the reward is received, the calledlearner can update the estimated reward of its selected arm(but i does not update the estimated reward of the selectedlearner), exploration phase in which learner i selects a choicein Ki and updates its estimated reward, and exploitation phasein which learner i selects the choice with the highest estimatednet reward.

Recall that the learners are cooperative. Hence, when calledby another learner, learner i will choose its arm with thehighest estimated reward for the calling learner’s context. Togain the highest possible reward in exploitations, learner i musthave an accurate estimate of other learner’s expected rewardswithout observing the arms selected by them. In order to dothis, before forming estimates about the expected reward oflearner j, learner i needs to make sure that j will almostalways select its best arm when called by i. Thus, the training

8If xi(t) is an element of the boundary of multiple sets, then it is randomlyassigned to one of these sets.

CLUP for learner i:1: Input: D1(t), D2(t), D3(t), T , mT

2: Initialize sets: Create partition PT of [0, 1]D into (mT )D

identical hypercubes.3: Initialize counters: N i

p = 0, ∀p ∈ PT ,N ik,p = 0,∀k ∈ Ki, p ∈ PT , N tr,i

j,p = 0, ∀j ∈M−i, p ∈ PT .

4: Initialize estimates: rik,p = 0, ∀k ∈ Ki, p ∈ PT .5: while t ≥ 1 do6: Run CLUPmax to get choice ai, p = pi(t) and train.7: If ai ∈M−i call learner ai and pass xi(t).8: Receive Ci(t), the set of learners who called i, and their

contexts.9: if Ci(t) 6= ∅ then

10: Run CLUPcoop to get arms to be selectedbi := {bi,j}j∈Ci(t) and sets that the contexts lie inpi := {pi,j}j∈Ci(t).

11: end if12: if ai ∈ Fi then13: Pay cost diai , receive random reward r drawn from

Gai(xi(t)).14: else15: Pay cost diai , receive random reward r drawn from

Gbai,i(xi(t)).16: end if17: if train = 1 then18: N tr,i

ai,p + +.19: else20: rik,p =

rik,pNiai,p

+r

Niai,p+1

.

21: N ip + +, N i

ai,p + +.22: end if23: if Ci(t) 6= ∅ then24: for j ∈ Ci(t) do25: Observe random reward r drawn from Gbi,j (xj(t)).

26: ribi,j ,pi,j =ribi,j ,pi,j

Nibi,j ,pi,j+r

Nibi,j ,pi,j

+1.

27: N ipi,j + +, N i

bi,j ,pi,j+ +.

28: end for29: end if30: t = t + 131: end while

Fig. 2. Pseudocode for CLUP algorithm.

phase of learner i helps other learners build accurate estimatesabout rewards of their arms, before i uses any rewards fromthese learners to form reward estimates about them. In contrast,the exploration phase of learner i helps it to build accurateestimates about rewards of its choices. These two phasesindirectly help learner i to maximize its total expected rewardin the long run.

Next, we define the counters learner i keeps for each setin PT for each choice in Ki, which are used to decide itscurrent phase. Let N i

p(t) be the number of context arrivals tolearner i in p ∈ PT by time t (its own arrivals and arrivals toother learners who call learner i) except the training phasesof learner i. For f ∈ Fi, let N i

f,p(t) be the number of timesarm f is selected in response to a context arriving to set pby learner i by time t (including times other learners selectlearner i for their contexts in set p). Other than these, learneri keeps two counters for each other learner in each set inthe partition, which it uses to decide training, exploration orexploitation. The first one, i.e., N tr,i

j,p (t), is an estimate on thenumber of context arrivals to learner j from all learners exceptthe training phases of learner j and exploration, exploitation

6

CLUPmax (maximization part of CLUP) for learner i:

1: train = 0.2: Find the set in PT that xi(t) belongs to, i.e., pi(t).3: Let p = pi(t).4: Compute the set of under-explored arms Fue

i,p(t) given in(2).

5: if Fuei,p(t) 6= ∅ then

6: Select ai randomly from Fuei,p(t).

7: else8: Compute the set of training candidates Mct

i,p(t) given in(3).

9: //Update the counters of training candidates.10: for j ∈Mut

i,p(t) do11: Obtain N j

p from learner j, set N tr,ij,p = N j

p −N ij,p.

12: end for13: Compute the set of under-trained learners Mut

i,p(t) givenin (4).

14: Compute the set of under-explored learners Muei,p(t)

given in (5).15: if Mut

i,p(t) 6= ∅ then16: Select ai randomly from Mut

i,p(t), train = 1.17: else if Mue

i,p(t) 6= ∅ then18: Select ai randomly from Mue

i,p(t).19: else20: Select ai randomly from arg maxk∈Ki r

ik,p − dik.

21: end if22: end if

Fig. 3. Pseudocode for the maximization part of CLUP algorithm.

CLUPcoop (cooperation part of CLUP) for learner i.

1: for j ∈ Ci(t) do2: Find the set in PT that xj(t) belongs to, i.e., pi,j .3: Compute the set of under-explored arms Fue

i,pi,j(t) given

in (2).4: if Fue

i,pi,j(t) 6= ∅ then

5: Select bi,j randomly from Fuei,pi,j

(t).6: else7: bi,j = arg maxf∈Fi r

if,pi,j

.8: end if9: end for

Fig. 4. Pseudocode for the cooperation part of CLUP algorithm.

phases of learner i. This is an estimate because learner iupdates this counter only when it needs to train learner j.The second one, i.e., N i

j,p(t), counts the number of contextarrivals to learner j only from the contexts of learner i in setp at times learner i selected learner j in its exploration andexploitation phases by time t. Based on the values of thesecounters at time t, learner i either trains, explores or exploitsa choice in Ki. This three-phase learning structure is one ofthe major components of our learning algorithm which makesit different than the algorithms proposed for the contextualbandits in the literature which assigns an index to each choiceand selects the choice with the highest index.

At each time slot t, learner i first identifies pi(t). Then,it chooses its phase at time t by giving highest priority toexploration of its own arms, second highest priority to trainingof other learners, third highest priority to exploration of otherlearners, and lowest priority to exploitation. The reason thatexploration of own arms has a higher priority than trainingof other learners is that it can reduce the number of trainingsrequired by other learners, which we will describe below.

First, learner i identifies its set of under-explored arms:

Fuei,p(t) := {f ∈ Fi : N i

f,p(t) ≤ D1(t)}. (2)

where D1(t) is a deterministic, increasing function of t whichis called the control function. We will specify this functionlater, when analyzing the regret of CLUP. The accuracy ofreward estimates of learner i for its own arms increases withD1(t), hence it should be selected to balance the tradeoffbetween accuracy and the number of explorations. If this set isnon-empty, learner i enters the exploration phase and randomlyselects an arm in this set to explore it. Otherwise, learner iidentifies the set of training candidates:

Mcti,p(t) := {j ∈M−i : N tr,i

j,p (t) ≤ D2(t)}, (3)

where D2(t) is a control function similar to D1(t). Accuracyof other learner’s reward estimates of their own arms increasewith D2(t), hence it should be selected to balance the possiblereward gain of learner i due to this increase with the rewardloss of learner i due to number of trainings. If this set isnon-empty, learner i asks the learners j ∈ Mct

i,p(t) to reportN jp (t). Based in the reported values it recomputes N tr,i

j,p (t) asN tr,ij,p (t) = N j

p (t)−N ij,p(t). Using the updated values, learner

i identifies the set of under-trained learners:

Muti,p(t) := {j ∈M−i : N tr,i

j,p (t) ≤ D2(t)}. (4)

If this set is non-empty, learner i enters the training phase andrandomly selects a learner in this set to train it.9 WhenMct

i,p(t)orMut

i,p(t) is empty, this implies that there is no under-trainedlearner, hence learner i checks if there is an under-exploredchoice. The set of learners that are under-explored by learneri is given by

Muei,p(t) := {j ∈M−i : N i

j,p(t) ≤ D3(t)}, (5)

where D3(t) is also a control function similar to D1(t). If thisset is non-empty, learner i enters the exploration phase andrandomly selects a choice in this set to explore it. Otherwise,learner i enters the exploitation phase in which it selects thechoice with the highest estimated net reward, i.e.,

ai(t) ∈ arg maxk∈Ki

rik,p(t)− dik, (6)

where rik,p(t) is the sample mean estimate of the rewardslearner i observed (not only collected) from choice k by timet, which is computed as follows. For j ∈M−i, let E ij,p(t) bethe set of rewards collected by learner i at times it selectedlearner j while learner i’s context is in set p in its explorationand exploitation phases by time t. For estimating the rewardsof its own arms, learner i can also use the rewards obtainedby other learner at times they called learner i. In order to takethis into account, for f ∈ Fi, let E if,p(t) be the set of rewardscollected by learner i at times it selected its arm f for itsown contexts in set p union the set of rewards observed bylearner i when it selected its arm f for other learners callingit with contexts in set p by time t. Therefore, sample mean

9Most of the regret bounds proposed in this paper can also be achievedby setting N tr,i

j,p(t) to be the number of times learner i trains learner j bytime t, without considering other context observations of learner j. However,by recomputing N tr,i

j,p(t), learner i can avoid many unnecessary trainingsespecially when own context arrivals of learner j is adequate for it to formaccurate estimates about its arms for set p or when learners other than i havealready helped learner j to build accurate estimates for its arms in set p.

7

reward of choice k ∈ Ki in set p for learner i is defined asrik,p(t) = (

∑r∈Eik,p(t)

r)/|E ik,p(t)|, An important observationis that computation of rik,p(t) does not take into account thecosts related to selecting choice k. Reward generated by anarm only depends on the context it is selected at but not on theidentity of the learner for whom that arm is selected. However,the costs incurred depend on the identity of the learner. Letµik,p(t) := rik,p(t)− dik be the estimated net reward of choicek for set p. Of note, when there is more than one maximizer of(6), one of them is randomly selected. In order to run CLUP,learner i does not need to keep the sets E ik,p(t) in its memory.rik,p(t) can be computed by using only rik,p(t − 1) and thereward at time t.

The cooperation part of CLUP operates as follows. Let Ci(t)be the learners who call learner i at time t. For each j ∈ Ci(t),learner i first checks if it has any under-explored arm f forpj(t), i.e., f such that N i

f,pj(t)(t) ≤ D1(t). If so, it randomly

selects one of its under-explored arms and provides its rewardto learner j. Otherwise, it exploits its arm with the highestestimated reward for learner j’s context, i.e.,

bi,j(t) ∈ arg maxf∈Fi

rif,pj(t)(t). (7)

A. Analysis of the regret of CLUPLet βa :=

∑∞t=1 1/ta, and let log(.) denote logarithm

in base e. For each set (hypercube) p ∈ PT let πf,p :=supx∈p πf (x), πf,p := infx∈p πf (x), for f ∈ F , and µik,p :=supx∈p µ

ik(x), µi

k,p:= infx∈p µ

ik(x), for k ∈ Ki. Let x∗p be

the context at the center (center of symmetry) of the hypercubep. We define the optimal choice of learner i for set p ask∗i (p) := arg maxk∈Ki µ

ik(x∗p). When the set p is clear from

the context, we will simply denote the optimal choice for setp with k∗i . Let

Lip(t) :={k ∈ Ki such that µi

k∗i (p),p− µik,p > Atθ

},

be the set of suboptimal choices for learner i at time t, whereθ < 0, A > 0 are parameters that are only used in the analysisof the regret and do not need to be known by the learners. First,we will give regret bounds that depend on values of θ and Aand then we will optimize over these values to find the bestbound. Also related to this let

F jp(t) :={f ∈ Fj such that πf∗j (p),p − πf,p > Atθ

},

be the set of suboptimal arms of learner j at time t, wheref∗j (p) = arg maxf∈Fj πf (x∗p). Also when the set p is clearfrom the context we will just use f∗j . The arms in F jp(t) arethe ones that learner j should not select when called by anotherlearner.

The regret given in (1) can be written as a sum of threecomponents: Ri(T ) = E[Rei (T )] + E[Rsi (T )] + E[Rni (T )],where Rei (T ) is the regret due to trainings and explorations bytime T , Rsi (T ) is the regret due to suboptimal choice selectionsin exploitations by time T and Rni (T ) is the regret due to nearoptimal choice selections in exploitations by time T , which areall random variables. In the following lemmas we will boundeach of these terms separately. The following lemma boundsE[Rei (T )].

Lemma 1: When CLUP is run by all learners with param-eters D1(t) = tz log t, D2(t) = Fmaxt

z log t, D3(t) = tz log tand mT = dT γe,10 where 0 < z < 1 and 0 < γ < 1/D, wehave

E[Rei (T )] ≤ 2

(mT )D∑

p=1

(|Fi|+ (M − 1)(Fmax + 1))T z log T

+ 2(|Fi|+ 2(M − 1))(mT )D

≤ 2D+1(|Fi|+ (M − 1)(Fmax + 1))T z+γD log T

+ 2D+1(|Fi|+ 2(M − 1))T γD .

Proof: Since time slot t is a training or an explorationslot for learner i if and only if Mut

i,pi(t)(t) ∪ Mue

i,pi(t)(t) ∪

Fuei,pi(t)

(t) 6= ∅, up to time T , there can be at most dT z log T eexploration slots in which an arm in f ∈ Fi is selectedby learner i, dFmaxT

z log T e training slots in which learneri selects learner j ∈ M−i, dT z log T e exploration slots inwhich learner i selects learner j ∈M−i. Result follows fromsumming these terms and the fact that (mT )D ≤ 2DT γD forany T ≥ 1. The additional factor of 2 comes from the factthat the realized regret at any time slot can be at most 2.

From Lemma 1, we see that the regret due to explorations islinear in the number of hypercubes (mT )D, hence exponentialin parameter γ and z. We conclude that z and γ should besmall enough to achieve sublinear regret in exploration steps.

For any k ∈ Ki and p ∈ PT , the sample mean rik,p(t)represents a random variable which is the average of the inde-pendent samples in set E ik,p(t). Different from classical finite-armed bandit theory [12], these samples are not identicallydistributed. In order to facilitate our analysis of the regret, wegenerate two different artificial i.i.d. processes to bound theprobabilities related to µik,p(t) = rik,p(t) − dik, k ∈ Ki. Thefirst one is the best process for learner i in which rewards aregenerated according to a bounded i.i.d. process with expectedreward µik,p, the other one is the worst process for learner iin which the rewards are generated according to a boundedi.i.d. process with expected reward µi

k,p. Let µb,i

k,p(z) denotethe sample mean of the z samples from the best process andµw,ik,p(z) denote the sample mean of the z samples from the

worst process for learner i. We will bound the terms E[Rni (T )]and E[Rsi (T )] by using these artificial processes along with thesimilarity information given in Assumption 1.

Let Ξij,p(t) be the event that a suboptimal arm f ∈ Fj isselected by learner j ∈M−i, when it is called by learner i fora context in set p for the tth time in the exploitation phasesof learner i. Let Xi

j,p(t) denote the random variable which isthe number of times learner j selects a suboptimal arm whencalled by learner i in exploitation slots of learner i when thecontext is in set p ∈ PT by time t. Clearly, we have

Xij,p(t) =

|Eij,p(t)|∑t′=1

I(Ξij,p(t′)), (8)

where I(·) is the indicator function which is equal to 1 if theevent inside is true and 0 otherwise.

10For a number r ∈ R, let dre be the smallest integer that is greater thanor equal to r.

8

The following lemma bounds E[Rsi (T )].


z log t, D3(t) = tz log tand mT = dT γe, where 0 < z < 1 and 0 < γ < 1/D, giventhat 2L(

√D)αt−γα + 6t−z/2 ≤ Atθ, we have

E[Rsi (T )] ≤ 2D+1(M − 1 + |Fi|)β2T γD

+ 2D+2(M − 1)Fmaxβ2TγD+z/2/z.

Proof: Consider time t. For simplicity of notation let p =pi(t). Let

Wi(t) := {Muti,pi(t)

(t) ∪Muei,pi(t)

(t) ∪ Fuei,pi(t)

(t) = ∅}

be the event that learner i exploits at time t.

First, we will bound the probability that learner i selectsa suboptimal choice in an exploitation slot. Then, using thiswe will bound the expected number of times a suboptimalchoice is selected by learner i in exploitation slots. Notethat every time a suboptimal choice is selected by learneri, since µik(x) = πik(x) − dik ∈ [−1, 1] for all k ∈ Ki,the realized (hence expected) loss is bounded above by 2.Therefore 2 times the expected number of times a sub-optimal arm is chosen in an exploitation slot bounds theregret due to suboptimal choices in exploitation slots. LetVik,p(t) be the event that a suboptimal choice k ∈ Ki ischosen at time t by learner i for p = pi(t). We haveRsi (T ) ≤

∑p∈PT

∑Tt=1

∑k∈Lip(t)

I(Vik,p(t),Wi(t)), Adopt-ing the standard probabilistic notation, for two events E1 andE2, I(E1, E2) is equal to I(E1 ∩E2). Taking the expectation

E[Rsi (T )] ≤∑p∈PT

T∑t=1

∑k∈Lip(t)

P(Vik,p(t),Wi(t)). (9)

Let Bij,p(t) be the event that at most tφ samples in E ij,p(t)are collected from suboptimal arms of learner j. For f ∈ Fi,Bif,p(t) is just the universal set. For a set A, let Ac denote thecomplement of that set. For any k ∈ Ki, we have

{Vik,p(t),Wi(t)} ⊂{µik,p(t) ≥ µik∗i ,p(t),W

i(t),Bik,p(t)}

∪{µik,p(t) ≥ µik∗i ,p(t),W

i(t),Bik,p(t)c}

⊂{µik,p(t) ≥ µik,p +Ht,Wi(t),Bik,p(t)

}∪{µik∗i ,p(t) ≤ µ

ik∗i ,p−Ht,Wi(t),Bik,p(t)

}∪{µik,p(t) ≥ µik∗i ,p(t), µ

ik,p(t) < µik,p +Ht,

µik∗i ,p(t) > µik∗i ,p−Ht,Wi(t),Bik,p(t)

}∪ Bik,p(t)c, (10)

for some Ht > 0. This implies that

P(Vik,p(t),Wi(t)

)≤ P

(µik,p(t) ≥ µik,p +Ht,Wi(t),Bik,p(t)

)+ P

(µik∗i ,p(t) ≤ µ


)+ P(Bik,p(t)c)

+ P(µik,p(t) ≥ µik∗i ,p(t), µ



). (11)

We have for any suboptimal choice k ∈ Lip(t),

P(µik,p(t) ≥ µik∗i ,p(t), µ



)≤ P

(µb,ik,p(|E

ik,p(t)|) ≥ µ

w,ik∗i ,p

(|E ik∗i ,p(t)|)− tφ−1,

µb,ik,p(|E

ik,p(t)|) < µik,p + L

(√D/mT

)α+Ht + tφ−1,

µw,ik∗i ,p

(|E ik∗i ,p(t)|) > µik∗i ,p− L

(√D/mT

)α−Ht, Wi(t)

).

For k ∈ Lip(t), when

2L(√

D/mT

)α+ 2Ht + 2tφ−1 −Atθ ≤ 0, (12)

the three inequalities given below

µk∗i ,p− µik,p > Atθ,

µb,ik,p(|E

ik,p(t)|) < µik,p + L

(√D/mT

)α+Ht + tφ−1,

µw,ik∗i ,p

(|E ik,p(t)|) > µik∗i ,p− L

(√D/mT

)α−Ht,

together imply that µb,ik,p(|E ik,p(t)|) < µw,i

k∗i ,p(|E ik,p(t)|) − tφ−1,

which implies that

P(µik,p(t) ≥ µik∗i ,p(t), µ



)= 0. (13)

Let Ht = 2tφ−1. A sufficient condition that implies (12) is

2L(√D)αt−γα + 6tφ−1 ≤ Atθ. (14)

Assume that (14) holds for all t ≥ 1. Using a Chernoff-Hoeffding bound, for any k ∈ Lip(t), since on the eventWi(t),|E ik,p(t)| ≥ tz log t, we have

P(µik,p(t) ≥ µik,p +Ht,Wi(t),Bik,p(t)

)≤ P

(µb,ik,p(|E

ik,p(t)|) ≥ µik,p +Ht,Wi(t)

)≤ e−2(Ht)

2tz log t = e−8t2φ−2tz log t, (15)

and

P(µik∗i ,p(t) ≤ µ


)≤ P

(µw,ik∗i ,p

(|E ik∗i ,p(t)|) ≤ µik∗i ,p−Ht + tφ−1,Wi(t)

)≤ e−2(Ht−t

φ−1)2tz log t = e−2t2φ−2tz log t. (16)

In order to bound the regret, we will sum (15) and (16) forall t up to T . For regret to be small we want the sum to besublinear in T . This holds when 2φ−2+z ≥ 0. We want z tobe small since regret due to explorations increases with z, andwe also want φ to be small since we will show that our regretbound increases with φ. Therefore we set 2φ − 2 + z = 0,hence

φ = 1− z/2. (17)

9

When (17) holds we have

P(µik,p(t) ≥ µik,p +Ht,Wi(t),Bik,p(t)

)≤ 1

t2, (18)

and

P(µik∗i ,p(t) ≤ µ


)≤ 1

t2. (19)

Finally, for f ∈ Fi obviously we have P(Bif,p(t)c) = 0.We have {Bij,p(t)c,Wi(t)} = {Xi

j,p(t) ≥ tφ} (RecallXij,p(t) from (8)). Applying the Markov inequality we have

P(Bij,p(t)c,Wi(t)) ≤ E[Xij,p(t)]/t

φ. Recall that Xij,p(t) =∑|Eij,p(t)|

t′=1 I(Ξij,p(t′)), and

P(Ξij,p(t)

)≤

∑m∈Fjp(t)

P(rjm,p(t) ≥ r

jf∗j ,p

(t))

≤∑

m∈Fjp(t)

(P(rjm,p(t) ≥ πjm,p +Ht,Wi(t)

)+P

(rjf∗j ,p

(t) ≤ πjf∗j ,p −Ht,Wi(t))

+ P(rjm,p(t) ≥ r

jf∗j ,p

(t),

rjm,p(t) < πjm,p +Ht, rjf∗j ,p

(t) > µjf∗j ,p−Ht,Wi(t)

)).

When (14) holds, since φ = 1 − z/2, the last probabil-ity in the sum above is equal to zero while the first twoprobabilities are upper bounded by e−2(Ht)

2tz log t. This isdue to the training phase of CLUP by which it is guar-anteed that every learner samples each of its own armsat least tz log t times before learner i starts forming esti-mates about learner j. Therefore, we have P

(Ξij,p(t)

)≤∑

m∈Fjp(t) 2e−2(Ht)2tz log t ≤ 2|Fj |/t2. These together imply

that E[Xij,p(t)] ≤

∑∞t′=1 P(Ξij,p(t

′)) ≤ 2|Fj |∑∞t′=1 1/t2.

Therefore from the Markov inequality we get

P(Bij,p(t)c,Wi(t)) = P(Xij,p(t) ≥ tφ) ≤ 2|Fj |β2

t1−z/2. (20)

Then, using (13), (18), (19) and (20), we haveP(Vij,p(t),Wi(t)

)≤ 2/t2 + (2|Fj |β2)/t1−z/2, for any j ∈

M−i, and P(Vif,p(t),Wi(t)

)≤ 2/t2, for any f ∈ Fi. By

(9), and by the result of Appendix A, we get the stated boundfor E[Rsi (T )].

From Lemma 2, we see that the regret increases exponen-tially with parameters γ and z, similar to the result of Lemma1. These two lemmas suggest that γ and z should be as small aspossible, given the condition 2L(

√D)αt−γα + 6t−z/2 ≤ Atθ,

is satisfied.Each time learner i calls learner j, learner j selects one of

its own arms in Fj . There is a positive probability that learnerj will select one of its suboptimal arms, which implies thateven if learner j is near optimal for learner i, selecting learnerj may not yield a near optimal outcome. We need to take thisinto account, in order to bound E[Rni (T )].

The next lemma bounds the expected number of suchhappenings.


z log t, D3(t) = tz log tand mT = dT γe, where 0 < z < 1 and 0 < γ < 1/D, given

that 2L(√D)αt−γα + 6t−z/2 ≤ Atθ, we have

E[Xij,p(t)] ≤ 2Fmaxβ2,

for j ∈M−i.

Proof: We have Xij,p(t) =

∑|Eij,p(t)|t′=1 I(Ξij,p(t

′)), and

P(Ξij,p(t)

)≤

∑m∈Fjp(t)

P(rjm,p(t) ≥ r

jf∗j ,p

(t))

≤∑

m∈Fjp(t)

(P(rjm,p(t) ≥ πm,p +Ht,Wi(t)

)+P

(rjf∗j ,p

(t) ≤ πjf∗j ,p −Ht,Wi(t))

+ P(rjm,p(t) ≥ r

jf∗j ,p

(t),

rjm,p(t) < πm,p +Ht, rjf∗j ,p

(t) > πjf∗j ,p−Ht,Wi(t)

)).

Let Ht = 2t−z/2. Similar to the proof of Lemma2, the last probability in the sum above is equal tozero while the first two probabilities are upper boundedby e−2(Ht)

2tz log t. Therefore, we have P(Ξij,p(t)

)≤∑

m∈Fjp(t) 2e−2(Ht)2tz log t ≤ 2|Fj |/t2. These together imply

that E[Xij,p(t)] ≤

∑∞t′=1 P (Ξij,p(t

′)) ≤ 2|Fj |∑∞t′=1 1/t2.

We will use Lemma 3 in the following lemma to boundE[Rni (T )].


z log t, D3(t) = tz log tand mT = dT γe, where 0 < z < 1 and 0 < γ < 1/D, giventhat 2L(

√D)αt−γα + 6t−z/2 ≤ Atθ, we have

E[Rni (T )] ≤ (2AT 1+θ)/(1 + θ) + 4(M − 1)Fmaxβ2.

Proof: If a near optimal arm in Fi ∩ Lip(t) is chosen bylearner i at time t, the contribution to the regret is at most Atθ.If a near optimal learner j ∈M−i∩Lip(t) is called by learneri at time t, and if learner j selects one of its near optimalarms in F jp(t), then the contribution to the regret is at most2Atθ. Therefore, the total regret due to near optimal choicesof learner i by time T is upper bounded by 2A

∑Tt=1 t

θ ≤(2AT 1+θ)/(1 + θ), by using the result in Appendix A. Eachtime a near optimal learner in j ∈ M−i ∩ Lip(t) is called inan exploitation step, there is a small probability that the armselected by learner j is a suboptimal one. Given in Lemma 3,the expected number of times a suboptimal arm is chosen bylearner j for learner i is bounded by 2|Fj |β2. For each suchchoice, the one-slot regret of learner i can be at most 2.

From Lemma 4, we see that the regret due to near optimalchoices depends exponentially on θ which is related to nega-tive of γ and z. Therefore γ and z should be chosen as largeas possible to minimize the regret due to near optimal arms.

In the next theorem we bound the regret of learner i bycombining the above lemmas.

Theorem 1: When CLUP is run by all learnerswith parameters D1(t) = t2α/(3α+D) log t, D2(t) =Fmaxt

2α/(3α+D) log t, D3(t) = t2α/(3α+D) log t andmT =

⌈T 1/(3α+D)

⌉, we have

Ri(T ) ≤ T2α+D3α+D

(2(2LDα/2 + 6)

(2α+D)/(3α+ d)+ 2D+1Zi log T

)

10

+ Tα+D3α+D

2D+2(M − 1)Fmaxβ22α/(3α+D)

+ TD

3α+D 2D+1(2Ziβ2 + |Ki|) + 4(M − 1)Fmaxβ2,

i.e., Ri(T ) = O(MFmaxT

2α+D3α+D

), where Zi = |Fi|+ (M −

1)(Fmax + 1).Proof: The highest orders of regret come from explo-

rations and near optimal arms which are O(T γD+z) andO(T 1+θ) respectively. We need to optimize them with respectto the constraint 2LDα/2t−γα + 6t−z/2 ≤ Atθ, which isassumed in Lemmas 2 and 4. The values that minimize theregret for which this constraint hold is θ = −z/2, γ = z/(2α)A = 2LDα/2 + 6 and z = 2α/(3α+D). Result follows fromsumming the bounds in Lemmas 1, 2 and 4.

Remark 1: Although the parameter mT of CLUP dependson T and hence we require T as an input to the algorithm,we can make CLUP run independently of the final time T andachieve the same regret bound by using a well known doublingtrick (see, e.g., [6]). Consider phases τ ∈ {1, 2, . . .}, whereeach phase has length 2τ . We run a new instance of algorithmCLUP at the beginning of each phase with time parameter2τ . Then, the regret of this algorithm up to any time T willbe O

(T (2α+D)/(3α+D)

). Although doubling trick works well

in theory, CLUP can suffer from cold-start problems. Thealgorithm we will define in the next section will not requireT as an input parameter.

The regret bound proved in Theorem 1 is sublinear intime which guarantees convergence in terms of the averagereward, i.e., limT→∞ E[Ri(T )]/T = 0. For a fixed α, theregret becomes linear in the limit as D goes to infinity. Onthe contrary, when D is fixed, the regret decreases, and inthe limit, as α goes to infinity, it becomes O(T 2/3). This isintuitive since increasing D means that the dimension of thecontext increases and therefore the number of hypercubes toexplore increases. While increasing α means that the level ofsimilarity between any two pairs of contexts increases, i.e.,knowing the expected reward of arm f in one context yieldsmore information about its accuracy in another context.

B. Computational complexity of CLUP

For each set p ∈ PT , learner i keeps the sample mean of re-wards from |Fi|+M−1 choices, while for a centralized banditalgorithm, the sample mean of the rewards of |∪j∈MFj | armsneeds to be kept in memory. Since the number of sets in PT isupper bounded by 2DTD/(3α+D), the memory requirement isupper bounded by (|Fi|+M − 1)2DTD/(3α+D). This meansthat the memory requirement is sublinearly increasing in T andthus, in the limit T → ∞, required memory goes to infinity.However, CLUP can be modified so that the available memoryprovides an upper bound on mT . However, in this case theregret bound given in Theorem 1 may not hold. Also theactual number of hypercubes with at least one context arrivaldepends on the context arrival process, hence can be very smallcompared to the worst-case scenario. In that case, it is enoughto keep the reward estimates for these hypercubes. The follow-ing example illustrates that for a practically reasonable timeframe, the memory requirement is not very high for a learnercompared to a non-contextual centralized implementation (that

uses partition {X}). For example for α = 1, D = 1, we have2DTD/(3α+D) = 2T 1/4. If learner i learned through T = 108

samples, and if M = 100, |Fj | = 100, for all j ∈M, learneri using CLUP only needs to store at most 40000 sample meanestimates, while a standard bandit algorithm which does notexploit any context information requires to keep 10000 samplemean estimates. Although, the memory requirement is 4 timeshigher than the memory requirement of a standard banditalgorithm, CLUP is suitable for a distributed implementation,learner i does not require any knowledge about the arms ofother learners (except an upper bound on the number of arms),and it is shown to converge to the best distributed solution.

V. A DISTRIBUTED ADAPTIVE CONTEXT PARTITIONINGALGORITHM

Intuitively, the loss due to selecting a suboptimal choicefor a context can be further minimized if the learners inspectthe regions of X with large number of context arrivals morecarefully, instead of using a uniform partition of X . We dothis by introducing the Distributed Context Zooming Algorithm(DCZA).

A. The DCZA algorithm

In the previous section, the partition PT is formed byCLUP at the beginning by choosing the slicing parameter mT .Differently, DCZA adaptively generates the partition basedon how contexts arrive. Similar to CLUP, using DCZA alearner forms reward estimates for each set in its partitionbased only on the history related to that set. Let Pi(t) belearner i’s partition of X at time t and pi(t) denote the setin Pi(t) that contains xi(t). Using DCZA, learner i startswith Pi(1) = {X}, then divides X into sets with smallersizes as time goes on and more contexts arrive. Hence thecardinality of Pi(t) increases with t. This division is donein a systematic way to ensure that the tradeoff between thevariation of expected choice rewards inside each set and thenumber of past observations that are used in reward estimationfor each set is balanced. As a result, the regions of the contextspace with a lot of context arrivals are covered with sets ofsmaller sizes than regions of contexts space with few contextarrivals. In other words, DCZA zooms into the regions ofcontext space with large number of arrivals. An illustrationthat shows partition of CLUP and DCZA is given in Fig. 5for D = 1. As we discussed in the Section II the zooming ideahave been used in a variety of multi-armed bandit problems[4]–[9], but there are differences in the problem structure andhow zooming is done.

The sets in the adaptive partition of each learner arechosen from hypercubes with edge lengths coming from theset {1, 1/2, 1/22, . . .}.11 We call a D-dimensional hypercubewhich has edges of length 2−l a level l hypercube (or levell set). For a hypercube p, let l(p) denote its level. Differentfrom CLUP, the partition of each learner in DCZA can bedifferent since context arrivals to learners can be different.

11Hypercubes have advantages in cooperative contextual bandits becausethey are disjoint and a learner can pass information to another learner aboutits partition by only passing the center and edge length of its hypercubes.

11

In order to help each other, learners should know about eachother’s partition. For this, whenever a new set of hypercubes isactivated by learner i, learner i communicates this by sendingthe center and edge length of one of the hypercubes in the newset of hypercubes to other learners. Based on this information,other learners update their partition of learner i. Thus, at anytime slot t all learners know Pi(t). This does not require alearner to keep M different partitions. It is enough for eachlearner to keep P(t) :=

⋃i∈M Pi(t), which is the set of

hypercubes that are active for at least one learner at time t.For p ∈ P(t) let τ(p) be the first time p is activated by oneof the learners and for p ∈ Pi(t), let τi(p) be the first timep is activated for learner i’s partition. We will describe theactivation process later, after defining the counters of DCZAwhich are initialized and updated differently than CLUP.

N ip(t), p ∈ Pi(t) counts the number of context arrivals

to set p of learner i (from its own contexts) from times{τi(p), . . . , t − 1}. For f ∈ Fi, N i

f,p(t) counts the numberof times arm f is selected in response to contexts arrivingto set p ∈ P(t) (from learner i’s own contexts or contextsof calling learners) from times {τ(p), . . . , t − 1}. SimilarlyN tr,ij,p (t), p ∈ Pi(t) is an estimate on the context arrivals to

learner j in set p from all learners except the training phases oflearner j and exploration, exploitation phases of learner i fromtimes {τ(p), . . . , t−1}. Finally, N i

j,p(t) counts the number ofcontext arrivals to learner j from exploration and exploitationphases of learner i from times {τi(p), . . . , t− 1}. Let εif,p(t),f ∈ Fi be the set of rewards (received or observed) by learneri at times that contribute to the increase of counter N i

f,p(t) andεij,p(t), j ∈ M−i be the set of rewards received by learner iat times that contribute to the increase of counter N i

j,p(t).We have rik,p(t) = (

∑r∈Eik,p(t)

r)/|E ik,p(t)| for k ∈ Ki. Thecontrol functions D1(t), D2(t) and D3(t) used by DCZA areexactly same as CLUP’s control functions.

Learner i updates its partition Pi(t) as follows. At the endof each time slot t, learner i checks if N i

pi(t)(t+ 1) exceeds a

threshold B2ρl(pi(t)), where B and ρ are parameters of DCZAthat are common to all learners. If N i

pi(t)(t+1) ≥ B2ρl(pi(t)),

learner i will divide pi(t) into 2D level l(pi(t)) + 1 hyper-cubes and will note the other learners about its new partitionPi(t+1). With this division pi(t) is de-activated for learner i’spartition. For a set p, let τfin

i (p) be the time it is de-activatedfor learner i’s partition.

Similar to CLUP, DCZA also have maximization and coop-eration parts. The maximization part of DCZA is the same asCLUP with training, exploration and exploitation phases. Theonly differences are that which phase to enter is determinedby comparing the counters defined above with the controlfunctions and in exploitation phase the best choice is selectedbased on the sample mean estimates defined above. In thecooperation part at time t, learner i explores one of its under-explored arms or chooses its best arm for pj(t) for learnerj ∈ Ci(t) using the counters and sample mean estimatesdefined above. Since the operation of DCZA is the same asCLUP except the differences mentioned in this section, weomitted its pseudocode to avoid repetition.

Fig. 5. An illustration showing how the partition of DCZA differs from thepartition of CLUP for D = 1. As contexts arrive, DCZA zooms into regionsof high number of context arrivals.B. Analysis of the regret of DCZA

Our analysis for CLUP in Section IV was for worst-casecontext arrivals. This means that the bound in Theorem 1 holdseven when other learners never call learner i to train it, or otherlearners never learn by themselves. In this section we analyzethe regret of DCZA under different types of context arrivals,which are given in the following definition.

Definition 1: We call the context arrival processxi(1), . . . , xi(T ) uniform arrivals if minimum distancebetween any two contexts for learner i is T−1/D, densearrivals if all contexts for learner i lies in the same leveld(log2 T )/ρe (highest time T level hypercube for DCZA,see Lemma 5). We call the context arrival process, soloarrivals if contexts only arrive to learner i, identical arrivalsif xi(t) = xj(t) for all i, j ∈ M, t = 1, . . . , T . We definethe following four cases to capture the extreme points ofoperation of DCZA:• C1 uniform and solo arrivals to learner i.• C2 uniform and identical arrivals.• C3 dense and solo arrivals to learner i.• C4 dense and identical arrivals.We start with a simple lemma which gives an upper bound

on the highest level hypercube that is active at any time t.Lemma 5: All the active hypercubes p ∈ P(t) at time t

have at most a level of (log2 t)/ρ+ 1.Proof: Let l′ + 1 be the level of the highest level active

hypercube. We must have B∑l′

l=0 2ρl < t, otherwise the high-est level active hypercube’s level will be less than l′ + 1. Wehave for t/B > 1, B 2ρ(l

′+1)−12ρ−1 < t⇒ 2ρl

′< t

B ⇒ l′ < log2 tρ .

In order to analyze the regret of DCZA, we first bound theregret in each level l hypercube. We do this for the solo andidentical context arrival cases separately. The following lemmabounds the regret due to trainings and explorations in a levell hypercube.

Lemma 6: When DCZA is run by all learners with param-eters D1(t) = D3(t) = tz log t and D2(t) = Fmaxt

z log t,for any level l hypercube the regret of learner i due totrainings and explorations by time t is bounded above by (i)2(|Fi| + (M − 1)(Fmax + 1))(tz log t + 1) for solo contextarrivals, (ii) 2(|Fi|+(M−1))(tz log t+1) for identical contextarrivals (given |Fi| ≥ |Fj |, j ∈M−i).

Proof: The proof is similar to Lemma 1. Note that whenthe context arriving to each learner is the same and |Fi| ≥

12

|Fj |, j ∈ M−i, we have N i,trj,p (t) > D2(t) for all j ∈ M−i

whenever N if,p(t) > D1(t) for all f ∈ Fi. The multiplicative

factor of 2 comes from bounded rewards and costs.

From Lemma 6, it can be seen that the regret due to explo-rations increases exponentially with z for each hypercube. Wedefine the set of suboptimal choices and arms for learner i inDCZA a little differently than CLUP (suboptimality dependon the level of the hypercube but not on time), using the samenotation as in the analysis of CLUP. Let

Lip :={k ∈ Ki : µ

k∗i (p),p− µk,p > ALDα/22−l(p)α

}, (21)

be the set of suboptimal choices of learner i for a hypercubep, and

F jp :={f ∈ Fj : πf∗j (p),p − πf,p > ALDα/22−l(p)α

},

(22)

be the set of suboptimal arms of learner j for hypercube p,where A > 0.

In the next lemma we bound the regret due to choosinga suboptimal action in the exploitation steps in a level lhypercube.

Lemma 7: Let A = 12/(LDα/22−α)+2) in (21) and (22).When DCZA is run by all learners with parameters ρ > 0,2α/ρ ≤ z < 1, D1(t) = D3(t) = tz log t and D2(t) =Fmaxt

z log t, for any level hypercube p, the regret of learneri, i.e., E[Rsi,p(T )] from selecting suboptimal choices in itsexploitation phases for contexts in p at times τi(p), . . . , τfin

i (p)is bounded above by 4β2|Fi|+ 8(M − 1)Fmaxβ2T

z/2/z.

Proof: The proof of this lemma is similar to the proofof Lemma 7, thus some steps are omitted. Similar to CLUP,in DCZA the event that learner i exploits at time t is givenby Wi(t) := {Mut

i,pi(t)(t) ∪ Mue

i,pi(t)(t) ∪ Fue

i,pi(t)(t) = ∅}.

First, we will bound the probability that learner i selects asuboptimal choice in Lipi(t) in an exploitation slot at time t,and then use this to bound the expected number of times asuboptimal choice is selected. Recall that loss in every step canbe at most 2. When the time is clear from the context we use pto denote pi(t). The event that a suboptimal choice is selectedby learner i at time t is denoted by Vik,p(t) E[Rsi,p(T )] ≤∑Tt=1

∑k∈Lip

P(Vik,p(t),Wi(t)). Let Bij,p(t) be the event thatat most tφ samples in E ij,p(t) are collected from suboptimalarms of learner j in Fjp . For f ∈ Fi let Bif,C(t) be an eventthat holds with probability 1. We have

P(Vik,p(t),Wi(t)

)≤ P

(µb,ik,p(N

ik,p(t)) ≥ µik,p +Ht,Wi(t)

)+ P

(µb,ik,p(N

ik,p(t)) ≥ µ

w,ik∗i ,p

(N ik∗i ,p

(t))− 2tφ−1,

µb,ik,p(N

ik,p(t)) < µik,p + LDα/22−l(p)α +Ht + 2tφ−1,

µw,ik∗i ,p

(N ik∗i ,p

(t)) > µik∗i ,p− LDα/22−l(p)α −Ht Wi(t)

)(23)

+ P(µw,ik∗i ,p

(N ik∗i ,p

(t)) ≤ µik∗i ,p−Ht + 2tφ−1,

Wi(t))

+ P((Bik,p(t))c),

where Ht > 0. In order to make the probability in (23) equalto 0, we need

4tφ−1 + 2Ht ≤ (A− 2)LDα/22−l(p)α. (24)

By Lemma 5, (24) holds when

4tφ−1 + 2Ht ≤ (A− 2)LDα/22−αt−α/ρ. (25)

For Ht = 4tφ−1, φ = 1 − z/2, z ≥ 2α/ρ and A =12/(LDα/22−α) + 2), (25) holds by which (23) is equal tozero. Also by using a Chernoff-Hoeffding bound we can showthat P

(µb,ik,p(N

ik,p(t)) ≥ µik,p +Ht,Wi(t)

)≤ 1/t2, and

P(rw,ik∗i ,p

(N ik∗i ,p

(t)) ≤ µik∗i ,p−Ht + 2tφ−1,Wi(t)

)≤ e−2(4 log t) ≤ 1/t2.

We also have P(Bif,p(t)c) = 0 for f ∈ Fi and P(Bij,p(t)c) ≤E[Xi

j,p(t)]/tφ ≤ 2Fmaxβ2t

z/2−1. for j ∈ M−i. Combining

all of these we get P(Vif,p(t),Wi(t)

)≤ 2

t2 , for f ∈ Fi and

P(Vij,p(t),Wi(t)

)≤ 2/(t2) + 2Fmaxβ2t

z/2−1, for j ∈M−i.These together imply that E[Rsi,p(T )] ≤ 4β2|Fi| + 8(M −1)Fmaxβ2T

z/2/z.From Lemma 7, we see that the regret due to explorations

increases exponentially with z for each hypercube. In the nextlemma we bound the regret of learner i due to selecting nearoptimal choices in a hypercube.

Lemma 8: Let A = 12/(LDα/22−α)+2) in (21) and (22).When DCZA is run by all learners with parameters ρ > 0,2α/ρ ≤ z < 1, D1(t) = D3(t) = tz log t and D2(t) =Fmaxt

z log t, for any hypercube p, the regret of learner i due toselecting near optimal choices in exploitation phases at timesτi(p), . . . , τ

fini (p), i.e., E[Rni,p(T )], is bounded above by

2BALDα/22(ρ−α)l(p) + 2(M − 1)Fmaxβ2

Proof: Consider hypercube p. Similar to the proof ofLemma 3, we have E[Xi

j,p(t)] ≤ 2Fmaxβ2. Thus, when anear optimal learner j ∈M−i is called, the contribution to theregret from suboptimal arms of j is bounded by 4Fmaxβ2. Theone-slot regret of any near optimal arm of any near optimallearner j ∈ M−i is bounded by 2ALDα/22−l(p)α. The one-step regret of any near optimal arm f ∈ Fi is boundedby ALDα/22−l(p)α. Since p remains active for learner i’spartition for at most B2ρl(p) context arrivals to p, we haveE[Rni,p(T )] ≤ 2BALDα/22(ρ−α)l(p) + 2(M − 1)Fmaxβ2.

From Lemma 8, we see that time order of the regret dueto choosing near optimal choices in each hypercube increaseswith the parameter ρ that determines how much the hypercubewill remain active, and decreases with α.

Next, we combine the results from Lemmas 6, 7 and 8 toobtain our regret bounds for the cases given in Definition 1.All these lemmas bound the regret for a single hypercube.The bounds in Lemmas 6 and 7 are independent of the levelof the hypercube, while the bound in Lemma 8 depends on thelevel of the hypercube. We can also derive a level-independentbound for E[Rni,p(T )], but we can get a tighter regret boundby using the level dependent regret bound. In order to getthe desired regret bound, we need to consider how many

13

hypercubes of each level is formed by DCZA up to timeT . The number of such hypercubes explicitly depends on thecontext arrival process. Therefore, different than the analysisof CLUP, we examine the regret of DCZA under differentassumptions on context arrivals and correlations between thecontext’s of different learners.

Theorem 2: Let A = 12/(LDα/22−α) + 2) in (21) and(22). When DCZA is run by all learners with parameters ρ =3α+√9α2+8αD2 , z = 2α/ρ < 1, D1(t) = D3(t) = tz log t and

D2(t) = Fmaxtz log t. Then, for C1 (given in Definition 1),

we haveRi(T ) ≤ T f1(α,D)2

(2ABLDα/22D+ρ−α + 22DZi log T

)+ T f2(α,D)22D+4(M − 1)Fmaxβ2

+ T f3(α,D)22D+1 (2(M − 1)Fmaxβ2 + Zi + 4β2|Fi|) ,


f1(α,D)), for C2, we have

Ri(T ) ≤ T f1(α,D)2(

2ABLDα/22D+ρ−α + 22D|Ki| log T)

+ T f2(α,D)22D+4(M − 1)Fmaxβ2

+ T f3(α,D)22D+1 (2(M − 1)Fmaxβ2 + |Ki|+ 4β2|Fi|) ,

i.e., Ri(T ) = O(|Ki|T f1(α,D)

), for C3, we have

Ri(T ) ≤ T 2/32

(Zi log T

log2 T

ρ+ 2ABLDα/2 22(ρ−α)

2ρ−α − 1

)+ T 1/324(M − 1)Fmaxβ2 ((log2 T )/ρ+ 1)

+ (Zi + 4β2|Fi|+ 2(M − 1)Fmaxβ2) ((log2 T )/ρ+ 1) ,


2/3), for C4, we have

Ri(T ) ≤ T 2/32

(|Ki| log T

log2 T

ρ+ 2ABLDα/2 22(ρ−α)

2ρ−α − 1

)+ T 1/324(M − 1)Fmaxβ2 ((log2 T )/ρ+ 1)

+ (|Ki|+ 4β2|Fi|+ 2(M − 1)Fmaxβ2) ((log2 T )/ρ+ 1) ,

i.e., Ri(T ) = O(|Ki|T 2/3

), where

Zi = |Fi|+ (M − 1)(Fmax + 1),

f1(α,D) =D + α+

√9α2+8αD2

D + 3α+√9α2+8αD2

f2(α,D) =D

D + 3α+√9α2+8αD2

+2α

3α+√

9α2 + 8αD

f3(α,D) = D/(D + (3α+

√9α2 + 8αD)/2

).

For any α > 0, D ≥ 1, we have f1(α,D) > f2(α,D) >f3(α,D).

Proof: Consider C1. It can be shown that in the worstcase the highest level hypercube has level at most 1 +log2ρ+D T . The total number of hypercubes is bounded by∑1+log

2ρ+DT

l=0 2Dl ≤ 22DTDD+ρ . Using the result in Lemma

8, we calculate the regret from choosing near optimal armsas

E[Rni (T )] ≤ 22DTDD+ρ (2(M − 1)Fmaxβ2)

+ 2ABLDα/2

1+log2ρ+D

T∑l=0

2(D+ρ−α)l ≤ 22DTDD+ρ

× (2(M − 1)Fmaxβ2) + 2ABLDα/222(D+ρ−α)TD+ρ−αD+ρ .

Since the number of hypercubes is O(TDD+ρ ), regret due to

explorations is O(TDD+ρ+z log T ), while regret due to subop-

timal arm selections is O(TDD+ρ+z), for z ≥ 2α/ρ. These three

terms are balanced when z = 2α/ρ and D+ρ−αD+ρ = D

D+ρ + z.Solving for ρ we get ρ = (3α +

√9α2 + 8αD)/2. Substi-

tuting these parameters and summing up all the terms we getthe regret bound O(T f1(α,D)). Only difference of C2 formC1 is that the explorations is reduced to |Ki|(T z log T +

1)22DTDD+ρ . Consider C3. By Lemma 5, the number of acti-

vated hypercubes is upper bounded by log2 T/ρ+1, and by theproperty of context arrivals all the activated hypercubes havedifferent levels. Using the result in Lemma 8, we calculate theregret from choosing near optimal arms as

E[Rni (T )] ≤ ((log2 T )/ρ+ 1) (2(M − 1)Fmaxβ2)

+ 2ABLDα/2

log2 T/ρ+1∑l=0

2(ρ−α)l

≤ ((log2 T )/ρ+ 1) (2(M − 1)Fmaxβ2)

+ 2ABLDα/222(ρ−α)Tρ−αρ /(2ρ−α).

Since there are at most log2 T/ρ + 1 hypercubes by timeT , the regret due to trainings, explorations and suboptimalchoice selections are bounded by multiplying the regret resultsin Lemmas 6 and 7 with log2 T/ρ + 1. All these termsare balanced by setting z = 2α/ρ, ρ = 3α. Summingall the terms, we get the desired regret bound. Only differ-ence of C4 from C3 is that the explorations is reduced to|Ki|(T z log T + 1)(log2 T/p+ 1).

Remark 2: C1-C4 represent the four extreme cases ofspatio-temporal dependencies between the contexts. It is guar-anteed that the regret bound for C1 holds for every contextarrival process.

VI. DISCUSSION

A. Necessity of the Training Phase

In this subsection, we prove that the training phase isnecessary to achieve sublinear regret for the cooperative con-textual bandit problem. In order to show this, we considera special case of expected arm rewards and context arrivalsand show that independent of the rate of explorations, theregret of any learning algorithm that separates exploration andexploitation (without the training phase) is linear in time forany exploration control function Di(t)

12 for learner i (explo-ration functions of learners can be different). Although, ourproof does not consider index-based learning algorithms, wethink that similar to our construction in Theorem 3, probleminstances which will give linear regret can be constructed forany type of index policy without the training phase.

12Here Di(t) is the control function that controls when to explore or exploitthe choices in Ki for learner i. Here we consider CLUP and DCZA withoutthe training phase, while everything else remains the same.

14

Theorem 3: Without the training phase, the regret of anyalgorithm that uses the separation of exploration and exploita-tion structure in which a time slot is either an exploration slotor an exploitation slot, will be linear in time.

Proof: We will construct a problem instance for which thestatement of the theorem is valid. Assume that all costs dik,k ∈ Ki, i ∈ M are zero. Let M = 2. Consider a hypercubep of either CLUP or DCZA. We assume that at all time slotscontext x∗ ∈ p arrives to learner 1, and all the contexts thatare arriving to learner 2 are outside p. Learner 1 has only asingle arm m, learner 2 has two arms b and g. With an abuseof notation, we denote the expected reward of a arm f ∈{m, b, g} at context x∗ as πf . Assume that the arm rewardsare drawn from {0, 1} and the following is true for expectedarm rewards:

πb + CKδ < πm < πg − δ < πm + δ, (26)

for some δ > 0, CK > 0, where the value of CK will bespecified later. Assume that learner 1’s exploration controlfunction is D1(t) = tz log t, and learner 2’s exploration controlfunction is D2(t) = tz log t/K for some K ≥ 1, 0 < z < 1.

When we have K = 1, when called by learner 1 in itsexplorations, learner 2 may always choose its suboptimal armb since it is under-explored for learner 2. If this happens, thenin exploitations learner 1 will almost always choose its ownarm instead of learner 2, because it had estimated the accuracyof learner 2 for x∗ incorrectly because the random rewards inexplorations of learner 2 came from b. By letting K ≥ 1, wealso consider cases where only a fraction of reward samplesof learner 2 for learner 1 comes from the suboptimal arm b.We will show that for any value of K ≥ 1, there exists aproblem instance of the form given in (26) such that learner1’s regret is linear in time. Let Et be the event that time tis an exploitation slot for learner 1. Let πm(t), π2(t) be thesample mean reward of arm m and learner 2 for learner 1 bytime t respectively. Let ξτ be the event that learner 1 exploitsfor the τ th time by choosing its own arm. Denote the time ofthe τ th exploitation of learner 1 by τ(t). We will show thatfor any finite τ , Pr(ξτ , . . . , ξ1) ≥ 1/2. We have by the chainrule

Pr(ξτ , . . . , ξ1) = Pr(ξτ |ξτ−1, . . . , ξ1)Pr(ξτ−1|ξτ−2, . . . , ξ1)

. . . Pr(ξ1). (27)

We will continue by bounding Pr(ξτ |ξτ−1, . . . , ξ1). Whenthe event Eτ(t) ∩ ξτ−1 ∩ . . . ∩ ξ1 happens, we know that atleast dτ(t)z log τ(t)/Ke of dτ(t)z log τ(t)e reward samples oflearner 2 for learner 1 comes from b. Let At := {πm(t) >πm − ε1}, Bt := {π2(t) < πg − ε2} and Ct := {π2(t) <πm(t)}, for ε1 > 0, ε2 > 0. Given ε2 ≥ ε1 + 2δ, we have(A ∩ B) ⊂ C. Consider the event {ACt , Et}. Since on Et,learner 1 selected m at least tz log t times (given that z islarge enough such that the reward estimate of learner 1’s ownarm is accurate), we have Pr(ACt , Et) ≤ 1/(2t2), using aChernoff bound. Let Ng(t) (Nb(t)) be the number of timeslearner 2 has chosen arm g (b) when called by learner 1 bytime t. Let rg(t) (rb(t)) be the random reward of arm g (b)when it is chosen for the tth time by learner 2. For η1 > 0,

η2 > 0, let Z1(t) := {(∑Ng(t)t′=1 rg(t

′))/Ng(t) < πg + η1} andZ2(t) := {(

∑Nb(t)t′=1 rb(t

′))/Nb(t) < πb + η2}. On the eventEτ(t)∩ξτ−1∩. . .∩ξ1, we have Ng(τ(t))/Nb(τ(t)) ≤ K. Sinceπ2(t) =

(∑Nb(t)t′=1 rb(t

′) +∑Ng(t)t′=1 rg(t

′))/(Nb(t) + Ng(t)),

We have

Z1(t) ∩ Z2(t)

⇒ π2(t) <Ng(t)πg +Nb(t)πb + η1Ng(t) + η2Nb(t)

Nb(t) +Ng(t).

(28)

If

πg − πb >Ng(t)

Nb(t)(η1 + ε2) + (η2 + ε2). (29)

then, it can be shown that the right hand side of (28) is lessthan πg − ε2. Thus given that (29) holds, we have Z1(t) ∩Z2(t) ⊂ Bt. But on the event Eτ(t) ∩ ξτ−1 ∩ . . . ∩ ξ1, (29)holds at τ(t) when

πg − πb > K(η1 + ε2) + (η2 + ε2).

Note that if we take ε1 = η1 = η2 = δ/2, and ε2 =ε1 + 2δ = 5δ/2 the statement above holds for a probleminstance with CK > 3K + 2. Since at any exploitation slott, at least dtz log t/Ke samples are taken by learner 2 fromboth arms b and g, we have P (Z1(τ(t))C) ≤ 1/(4τ(t)2) andP (Z2(τ(t))C) ≤ 1/(4τ(t)2) by a Chernoff bound (again forz large enough as in the proofs of Theorems 1 and 2). ThusP (Bτ(t))

C ≤ P (Z1(τ(t))C) + P (Z2(τ(t))C) ≤ 1/(2τ(t)2).Hence P (CCτ(t)) ≤ P (ACτ(t)) + P (BCτ(t)) ≤ 1/(τ(t)2), andP (Cτ(t)) > 1− 1/(τ(t)2). Continuing from (27), we have

P (ξτ , . . . , ξ1) =(1− 1/(τ(t)2)

) (1− 1/((τ − 1)(t)2)

). . .(1− 1/((1)(t)2)

)≥ Π

τ(t)t′=1

(1− 1/(t′)2

)> 1/2, (30)

for all τ . This result implies that with probability greater thanone half, learner 1 chooses its own arm at all of its exploitationslots, resulting in an expected per-slot regret of πg − πm > δ.Hence the regret is linear in time.

B. Comparison of CLUP and DCZA

In Table II, we summarize the regret results of CLUP andDCZA for different cases13 given in Definition 1. For theuniform arrivals C1 and C2, the time parameter of the regretbecomes linear as D increases. This is intuitive since the gainsof zooming diminish when the context is not concentratedin a region of space or a lower dimensional subspace. Thetime exponent of the regret of DCZA is D+α/2+

√9α2+8αD/2

D+3α/2+√9α2+8αD/2

,while the time exponent of the regret of CLUP is D+2α

D+3α =D+α/2+

√9α2/2

D+3α/2+√9α2/2

. This is intuitive since CLUP is designed tocapture the worst-case (uniform) arrival process by forminga uniform partition over the context space, while DCZAadaptively learns over time that the best partition over the

13Due to the space limitations, in Theorem 1 we only give the regretbound of CLUP for C1. Bounds for other cases can be derived similar tothe derivation of bounds for DCZA.

15

Case CLUP DCZA

C1 O

(MFmaxT

2α+D3α+D

)O

(MFmaxT f1(α,D)

)C2 O

(|Ki|T

2α+D3α+D

)O

(|Ki|T f1(α,D)

)C3 O

(MFmaxT

2α3α+D

)O

(MFmaxT 2/3

)C4 O

(|Ki|T

2α3α+D

)O

(|Ki|T 2/3

)TABLE II

COMPARISON OF REGRETS OF CLUP AND DCZA

context space is uniform. The difference in the regret orderbetween DCZA and CLUP is due to the fact that DCZA startswith a single hypercube and splits it over time to reach theuniform partition which is optimal for uniform arrivals, whileCLUP starts with the uniform partition at the beginning. Notethat for αD small, the regret order of DCZA is very closeto the regret order of CLUP. For C2, which is the identicalarrrivals case, the constant that multiplies the highest orderterm is |Ki|, while for C1, which is the solo arrivals case, thisconstant is |Fi|(M −1)(Fmax + 1), which is much larger. Fordense arrivals C3 and C4, the time parameter of the regretdoes not depend on the dimension of the context space D forDCZA. The difference between the regret terms of C3 and C4is similar to the difference between C1 and C2.

Next, we assess the computation and memory requirementsof DCZA and compare it with CLUP. DCZA needs to keep thesample mean reward estimates of |Ki| choices for each activehypercube. A level l active hypercube becomes inactive if thecontext arrivals to that hypercube exceeds B2ρl. Because ofthis, the number of active hypercubes at any time T may bemuch smaller than the number of activated hypercubes by timeT . For cases C1 and C2, the maximum number of activated hy-

percubes for a learner using DCZA is O(TD

D+(3α+√

9α2+8α)/2 ),while for any D and α, the number of hypercubes of CLUPis upper bounded by O(TD/(D+3α)). Also note that DCZAonly have to keep the estimates of rewards in currently activehypercubes, but not all activated hypercubes, hence in realitythe memory requirement of DCZA can be much smallerthan CLUP which requires to keep the estimates for everyhypercube at all times. Under the dense arrivals given in C3and C4, at any time slot there is only a single active hypercube.Therefore, the memory requirement of DCZA for learner i atany time slot is only O(|Ki|). Finally DCZA does not requirefinal time T as in input while CLUP requires it. AlthoughCLUP can be combined with the doubling trick to make itindependent of T , this makes the constants that multiply thetime order of regret large.

C. Comparison of separation of exploration and exploitationwith index-based algorithms

Our algorithms separate training, exploration and exploita-tion into different time slots. We have proved in Theorem 3that the separation of training, hence not using the rewardsobtained in training to update accuracy estimates, is necessaryto achieve sublinear regret. However separation of explorationand exploitation into different time slots is not necessary forour results to hold. Indeed, in most of the previous work on

multi-armed bandit problems [12] and contextual bandits [4]–[9], index-based algorithms are used to balance explorationand exploitation. These algorithms assign an index to eacharm, based only on the history of observations of that arm, andchoose the arm with the highest index at each time slot. Theindex of an arm is usually given as a sum of two terms. Thefirst term is the sample mean of the rewards obtained from thearm and the second term is an inflation term, which increaseswith the learner’s relative uncertainty about the reward of thearm. When all arms are observed sufficiently many times, thedominant term becomes the sample mean, hence at such timeslots the arm with the highest estimated reward is selected.

Lets call the algorithms which separates exploration andexploitation the separation algorithms. Previously we haveconsidered separation algorithms for learning in multi-usermulti-armed bandits [3], and showed that they can achievethe same time order of regret as index-based algorithms. Tothe best of our knowledge separation algorithms have not beenapplied to learning in contextual bandits before.

There are several (practical) advantages of separation algo-rithms compared to index-based algorithms. Using separationalgorithms, a learner can explicitly decide when to explore orexploit, which can be very beneficial in a variety of applica-tions. For example, one application of cooperative contextualbandits is stream mining with active learning [22], in which alearner equipped with a set of classifiers (arms), with unknownaccuracies (expected rewards) wants to learn which classifierto select to classify its data stream. The random reward is equalto 1 if the prediction is correct and 0 otherwise. However,checking whether the prediction is correct or not requiresobtaining the label (truth) which may not be always possibleor which may be very costly. For instance, if classifiers arelabeling images, then the true label of the image can onlybe given with the help of a human expert, and calling suchan expert is costly to the learner. In this case, the learnermust minimize the number of times it obtains the label whilealmost always choosing the optimal classifier in exploitations.Using separation algorithms, the learner can maximize its totalreward without even observing the rewards in exploitationslots. Index-based algorithms does not give such a flexibilityto the learner.

Another problem where separation algorithms can be use-ful is stream mining with varying error tolerance. The datainstance that comes to learner i at time t can have an errortolerance parameter erri(t) in addition to its context xi(t). Foran instance with erri(t) = 1 the cost of mis-classification canbe very high, while for an instance with erri(t) = 0 it can below. Then, ideally the learner will try to shift its explorationsto instances with erri(t) = 0. Of course, our regret boundsmay not directly hold for such a case, and any regret boundis expected to depend on the arrival distribution of instanceswith different types of error tolerance. We leave this analysisas a future research topic.

D. Non-deterministic control functions

CLUP and DCZA uses deterministic control functions toselect the phase of a learner. This learner i ensures that all of

16

its arms have been explored at least D1(t) times, and all of theother learners have been explored at least D3(t) times. Intu-itively, the number of explorations can be reduced if a learner’sexploration rate depends on the estimated suboptimality of achoice. Let a∗i (t) := arg maxk∈Ki(r

ik,pi(t)

(t) − dik) be theestimated optimal choice at time t. Estimated suboptimalityof choice k ∈ Ki − a∗i (t) is defined as

∆ik(t) := (rik,pi(t)(t)− d

ik)− (ria∗i (t),pi(t)

(t)− dia∗i (t)).

Then, one example of a set of non-deterministic controlfunctions is Di

k,pi(t)= tz log t/(∆i

k(t)) for k ∈ Ki−a∗i (t) andDia∗i (t),pi(t)

= tz log t. CLUP and DCZA can compute the setof under-explored choices based on comparing their counterswith these control functions. Whether the time order of thecontrol function, i.e., z could be made smaller than the valuein Theorems 1 and 2 are open questions that is out of the scopeof this paper. Note that even though it is possible to reducethe number of explorations by using non-deterministic controlfunctions, using non-deterministic control functions to controltraining appears to be more complicated because the rewardscoming from a learner j in the training phase of learner i arenot reliable to form estimates about j’s reward.

VII. CONCLUSION

In this paper we proposed a novel framework for decen-tralized, online learning by many learners. We developed twonovel online learning algorithms for this problem and provedsublinear regret results for our algorithms. We discussed someimplementation issues such as complexity and the memoryrequirement under different instance and context arrivals. Ourtheoretical framework can be applied to many practical settingsincluding distributed online learning in Big Data mining,recommendation systems and surveillance applications. Co-operative contextual bandits opens a new research direction inonline learning and raises many interesting questions: Whatare the lower bounds on the regret? Is there a gap in the timeorder of the lower bound compared to centralized contextualbandits due to informational asymmetries? Can regret boundsbe proved when cost of calling learner j is controlled bylearner j? In other words, what happens when a learner wantsto maximize both the total reward from its own contexts andthe total reward from the calls of other learners.

APPENDIX AA BOUND ON DIVERGENT SERIES

For ρ > 0, ρ 6= 1,∑Tt=1 1/(tρ) ≤ 1 + (T 1−ρ − 1)/(1− ρ).

Proof: See [23].

APPENDIX BFREQUENTLY USED EXPRESSIONS

Mathematical operators• O(·): Big O notation.• O(·): Big O notation with logarithmic terms hidden.• I(A): indicator function of event A.• Ac or AC : complement of set A.Notation related to underlying system• M: Set of learners.

• Fi: Set of arms of learner i.• M−i: Set of learners except i.• Ki: Set of choices of learner i.• F : Set of all arms.• X = [0, 1]D: Context space.• D: Dimension of the context space.• πf (x): Expected reward of arm f ∈ F for context x.• πj(x): Expected reward of learner j’s best arm for contextx.

• dik: Cost of selecting choice k ∈ Ki for learner i.• µik(x) = πk(x) − dik: Expected net reward of learner i

from choice k for context x.• k∗i (x): Best choice (highest expected net reward) for

learner i for context x.• f∗i (x): Best arm (highest expected reward) of learner j

for context x.• L: Holder constant.• α: Holder exponent.Notation related to algorithms• D1(t), D2(t), D3(t): Control functions.• p: Index for set of contexts (hypercube).• mT : Number of slices for each dimension of the context

for CLUP.• PT : Uniform partition of X into (mT )D hypercubes for

CLUP.• Pi(t): Learner i’s adaptive partition of X at time t for

DCZA.• P(t): Union of partitions of X of all learners for DCZA.• pi(t): The set in Pi(t) that contains xi(t).• Muc

i,p(t): Set of learners who are training candidates oflearner i at time t for set p of learner i’s partition.

• Muti,p(t): Set of learners who are under-trained by learner

i at time t for set p of learner i’s partition.• Mue

i,p(t): Set of learners who are under-explored bylearner i at time t for set p of learner i’s partition.

• Muci,p(t): Set of learners who are training candidates of

learner i at time t for set p of learner i’s partition.

REFERENCES

[1] Wikipedia. Data stream mining. [Online]. Available: http://en.wikipedia.org/wiki/Data stream mining

[2] K. Liu and Q. Zhao, “Distributed learning in multi-armed bandit withmultiple players,” Signal Processing, IEEE Transactions on, vol. 58,no. 11, pp. 5667–5681, 2010.

[3] C. Tekin and M. Liu, “Online learning in decentralized multi-userspectrum access with synchronized explorations,” in Proc. of IEEEMILCOM, 2012.

[4] R. Kleinberg, A. Slivkins, and E. Upfal, “Multi-armed bandits in metricspaces,” in Proceedings of the 40th annual ACM symposium on Theoryof computing. ACM, 2008, pp. 681–690.

[5] S. Bubeck, R. Munos, G. Stoltz, and C. Szepesvari, “X-armed bandits,”The Journal of Machine Learning Research, vol. 12, pp. 1655–1695,2011.

[6] A. Slivkins, “Contextual bandits with similarity information,” in 24thAnnual Conference on Learning Theory (COLT), 2011.

[7] M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin,and T. Zhang, “Efficient optimal learning for contextual bandits,” arXivpreprint arXiv:1106.2369, 2011.

[8] J. Langford and T. Zhang, “The epoch-greedy algorithm for contextualmulti-armed bandits,” Advances in Neural Information Processing Sys-tems, vol. 20, pp. 1096–1103, 2007.

[9] W. Chu, L. Li, L. Reyzin, and R. E. Schapire, “Contextual bandits withlinear payoff functions,” in Proc. of the International Conference onArtificial Intelligence and Statistics (AISTATS), 2011.

http://en.wikipedia.org/wiki/Data_stream_mining

http://en.wikipedia.org/wiki/Data_stream_mining

17

[10] C. Tekin and M. van der Schaar, “Decentralized online big dataclassification - a bandit framework,” Preprint: arXiv:1308.4565, 2013.

[11] L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-banditapproach to personalized news article recommendation,” in Proc. of the19th international conference on World wide web. ACM, 2010, pp.661–670.

[12] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of themultiarmed bandit problem,” Machine Learning, vol. 47, pp. 235–256,2002.

[13] K. Crammer and C. Gentile, “Multiclass classification with banditfeedback using adaptive regularization,” 2011.

[14] A. Anandkumar, N. Michael, and A. Tang, “Opportunistic spectrumaccess with multiple players: Learning under competition,” in Proc. ofIEEE INFOCOM, March 2010.

[15] C. Tekin and M. Liu, “Online learning of rested and restless bandits,”Information Theory, IEEE Transactions on, vol. 58, no. 8, pp. 5588–5611, 2012.

[16] H. Liu, K. Liu, and Q. Zhao, “Learning in a changing world: Restlessmultiarmed bandit with unknown dynamics,” Information Theory, IEEETransactions on, vol. 59, no. 3, pp. 1902–1916, 2013.

[17] S. S. Ram, A. Nedic, and V. V. Veeravalli, “Distributed stochasticsubgradient projection algorithms for convex optimization,” Journal ofoptimization theory and applications, vol. 147, no. 3, pp. 516–545, 2010.

[18] F. Yan, S. Sundaram, S. Vishwanathan, and Y. Qi, “Distributedautonomous online learning: regrets and intrinsic privacy-preservingproperties,” Knowledge and Data Engineering, IEEE Transactions on,vol. 25, no. 11, pp. 2483–2493, 2013.

[19] M. Raginsky, N. Kiarashi, and R. Willett, “Decentralized online convexprogramming with local information,” in American Control Conference(ACC), 2011. IEEE, 2011, pp. 5363–5369.

[20] H. Liu, K. Liu, and Q. Zhao, “Learning in a changing world: Non-bayesian restless multi-armed bandit,” Techinal Report, UC Davis,October 2010.

[21] R. Ortner, “Exploiting similarity information in reinforcement learning,”Proceedings 2nd ICAART, pp. 203–210, 2010.

[22] B. Settles, “Active learning literature survey,” University of Wisconsin,Madison, vol. 52, pp. 55–66, 2010.

[23] E. Chlebus, “An approximate formula for a partial sum of the divergentp-series,” Applied Mathematics Letters, vol. 22, no. 5, pp. 732–737,2009.

distributed online learning via cooperative …1 distributed online learning via cooperative...

Documents