budget-constrained multi-armed bandits with multiple plays · the nonstochastic multi-armed bandit...

Budget-Constrained Multi-Armed Banditswith Multiple Plays

Datong P. Zhou, Claire J. Tomlin

AAAI-2018

[datong.zhou, tomlin]@berkeley.edu

February 5, 2018

1 / 22

Table of Contents

1 Background: The Multi-Armed Bandit (MAB) ProblemProblem FormulationContributions

2 Stochastic MAB with Multiple Play and Budget ConstraintsSetupAlgorithm UCB-MB

3 Adversarial MAB with Multiple Play and Budget ConstraintsUpper Bounds on the RegretLower Bounds on the RegretHigh Probability Bounds on the Regret

4 Conclusion

5 References

2 / 22

Background: The Multi-Armed Bandit (MAB) Problem

Table of Contents

4 Conclusion

5 References

3 / 22

Background: The Multi-Armed Bandit (MAB) Problem Problem Formulation

Background

Examples

Product recommendation to maximize sales

Ad placement to maximize click through rate

Classical Problem and Objective

Given N arms with unknown rewarddistribution

Pull arms sequentially to maximize totalexpected reward over certain horizon TAt each round t ∈ [T ]:

Player selects exactly one action atPlayer observes gain rat ,t

Goal: Minimize cumulative regret

RA(T ) =

(maxi∈[N]

[T∑t=1

])− E

[T∑t=1

rat ,t

Stochastic generation of rewards vs.adversarial setting

Recent developments: Cost ci,t and multipleplays K

4 / 22

Background

Examples

RA(T ) =

(maxi∈[N]

[T∑t=1

])− E

[T∑t=1

rat ,t

4 / 22

Background

Examples

RA(T ) =

(maxi∈[N]

[T∑t=1

])− E

[T∑t=1

rat ,t

4 / 22

Background

Examples

RA(T ) =

(maxi∈[N]

[T∑t=1

])− E

[T∑t=1

rat ,t

4 / 22

Background

Examples

RA(T ) =

(maxi∈[N]

[T∑t=1

])− E

[T∑t=1

rat ,t

4 / 22

Background

Examples

RA(T ) =

(maxi∈[N]

[T∑t=1

])− E

[T∑t=1

rat ,t

4 / 22

Background

Examples

RA(T ) =

(maxi∈[N]

[T∑t=1

])− E

[T∑t=1

rat ,t

4 / 22

Background: The Multi-Armed Bandit (MAB) Problem Contributions

Contributions

Algorithm Upper Bound Lower Bound Setting Authors

Exp3 O(√

NT log N)

Ω(√

Fixed T , K = 1 1

Exp3.M O

(√NTK log N

((1− K

)2√NT

)Fixed T , K ≥ 1 2

Exp3.M.B O

(√NB log N

((1− K

)2√NB/K

)B > 0,K ≥ 1 This paper

Exp3.P O(√

NT log (NT/δ) + log(NT/δ))

Fixed T , K = 1 3

Exp3.P.M O(K2√

NT N−KN−1

log (NT/δ) + N−KN−1

log(NT/δ))

Fixed T , K ≥ 1 This paper

Exp3.P.M.B O

√NBK

N−KN−1

)+ N−K

N−1log(

))B > 0, K ≥ 1 This paper

UCB1 O(N log T ) Fixed T , K = 1 4

LLR O(NK4 log T ) Fixed T , K ≥ 1 5

UCB-BV O(N log B) B > 0,K = 1 6

UCB-MB O(NK4 log B) B > 0,K ≥ 1 This paper

1P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32 (2002), pp. 48–77.2T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference on

Algorithmic Learning Theory (2010), pp. 375–389.3P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32 (2002), pp. 48–77.4P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.5Y. Gai, B. Krishnamachari, and R. Jain. “Combinatorial Network Optimization with Unknown Variables: Multi-Armed Bandits with Linear Rewards and

Individual Observations”. In: IEEE/ACM Transactions on Networking 20.5 (2012), pp. 1466–1478.6W. Ding et al. “Multi-Armed Bandit with Budget Constraint and Variable Costs”. In: Proceedings of the Twenty-Seventh AAAI Conference on

Artificial Intelligence (2013). 5 / 22

Stochastic MAB with Multiple Play and Budget Constraints

Table of Contents

4 Conclusion

5 References

6 / 22

Stochastic MAB with Multiple Play and Budget Constraints Setup

Problem Description

Bandit with N distinct arms

Each arm i has unknown cost and reward distributions with means 0 < µir ≤ 1 and

0 < cmin ≤ µic ≤ 1

Realizations of costs ci,t ∈ [cmin, 1] and rewards ri,t ∈ [0, 1] are i.i.d.

Initial budget B > 0 to pay for the materialized costsAt each round t = 1, . . . , τA(B):

Select exactly 1 ≤ K ≤ N arms into atObserve individual costs ci,t | i ∈ at and rewards ri,t | i ∈ atTerminate game if

∑i∈at ci,t is greater than remaining budget

Minimize expected regret

RA(B) = E[GA∗(B)]− E[GA(B)]

Utilize modified UCB algorithm with upper confidence bounds:

Ui,t = µit + ei,t

At each round, play K arms with K largest Ui,t

7 / 22

Problem Description

0 < cmin ≤ µic ≤ 1

Ui,t = µit + ei,t

7 / 22

Problem Description

0 < cmin ≤ µic ≤ 1

Ui,t = µit + ei,t

7 / 22

Problem Description

0 < cmin ≤ µic ≤ 1

Ui,t = µit + ei,t

7 / 22

Problem Description

0 < cmin ≤ µic ≤ 1

Ui,t = µit + ei,t

7 / 22

Problem Description

0 < cmin ≤ µic ≤ 1

Ui,t = µit + ei,t

7 / 22

Problem Description

0 < cmin ≤ µic ≤ 1

Ui,t = µit + ei,t

7 / 22

Stochastic MAB with Multiple Play and Budget Constraints Algorithm UCB-MB

Algorithm UCB-MB

The Algorithm

Play all arms once to initialize bang-per-buck ratios for each arm

While B not exhausted, select K arms with K largest Ui,t

Theorem (Upper Bound on RA(B) for Algorithm UCB-MB)

For the definition of confidence bounds

Ui,t = µit +

√(K + 1) log t/ni,t(1 + 1/cmin)

cmin −√

(K + 1) log t/ni,t,

Algorithm UCB-MB achieves expected regret RA(B) = O(NK 4 logB).

Proof Idea

Step 1: Upper bound the number of times a non-optimal selection of arms is madeup to a fixed stopping time τA(B):

# suboptimal choices = O(NK 3 log τA(B))

Step 2: Relate algorithm UCB-MB and B to stopping time τA(B)

8 / 22

Algorithm UCB-MB

The Algorithm

Ui,t = µit +

√(K + 1) log t/ni,t(1 + 1/cmin)

cmin −√

(K + 1) log t/ni,t,

Proof Idea

8 / 22

Algorithm UCB-MB

The Algorithm

Ui,t = µit +

√(K + 1) log t/ni,t(1 + 1/cmin)

cmin −√

(K + 1) log t/ni,t,

Proof Idea

8 / 22

Algorithm UCB-MB

The Algorithm

Ui,t = µit +

√(K + 1) log t/ni,t(1 + 1/cmin)

cmin −√

(K + 1) log t/ni,t,

Proof Idea

8 / 22

Adversarial MAB with Multiple Play and Budget Constraints

Table of Contents

4 Conclusion

5 References

9 / 22

Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret

Upper Bound on the Regret

Oblivious adversary ⇒ no assumptions on reward or cost distributions exceptboundedness

Algorithm Exp3.M.B

Initialize weights wi = 1 for i = 1, . . . ,NFor each round t = 1, . . . , τA(B):

Cap weights that are “too large”

wi (t) = v(t) for i ∈ S(t) = i ∈ [N] | wi (t) > vt

v(t)←vt

∣∣∣ vt(1− γ)∑Ni=1 vt · 1(wi (t) ≥ vt) + wi (t) · 1(wi (t) < vt)

K−γ

Calculate probabilities pi (t) = K

((1− γ) wi (t)∑N

j=1 wj (t)+ γ

)Play arms at ∼ p1, . . . , pNUpdate weights:

ri (t) = ri (t)/pi (t) · 1(i ∈ at)

ci (t) = ci (t)/pi (t) · 1(i ∈ at)

wi (t + 1) = wi (t) exp

N[ri (t)− ci (t)]1i∈S(t)

]10 / 22

Algorithm Exp3.M.B

v(t)←vt

K−γ

((1− γ) wi (t)∑N

j=1 wj (t)+ γ

ri (t) = ri (t)/pi (t) · 1(i ∈ at)

ci (t) = ci (t)/pi (t) · 1(i ∈ at)

wi (t + 1) = wi (t) exp

]10 / 22

Algorithm Exp3.M.B

v(t)←vt

K−γ

((1− γ) wi (t)∑N

j=1 wj (t)+ γ

ri (t) = ri (t)/pi (t) · 1(i ∈ at)

ci (t) = ci (t)/pi (t) · 1(i ∈ at)

wi (t + 1) = wi (t) exp

]10 / 22

Algorithm Exp3.M.B

v(t)←vt

K−γ

((1− γ) wi (t)∑N

j=1 wj (t)+ γ

ri (t) = ri (t)/pi (t) · 1(i ∈ at)

ci (t) = ci (t)/pi (t) · 1(i ∈ at)

wi (t + 1) = wi (t) exp

]10 / 22

Algorithm Exp3.M.B

v(t)←vt

K−γ

((1− γ) wi (t)∑N

j=1 wj (t)+ γ

ri (t) = ri (t)/pi (t) · 1(i ∈ at)

ci (t) = ci (t)/pi (t) · 1(i ∈ at)

wi (t + 1) = wi (t) exp

]10 / 22

Algorithm Exp3.M.B

v(t)←vt

K−γ

((1− γ) wi (t)∑N

j=1 wj (t)+ γ

ri (t) = ri (t)/pi (t) · 1(i ∈ at)

ci (t) = ci (t)/pi (t) · 1(i ∈ at)

wi (t + 1) = wi (t) exp

]10 / 22

Upper Bound on the Regret (cont’d.)

Analysis of Algorithm Exp3.M.B

Theorem (Upper Bound on RA(B) for Algorithm Exp3.M.B)

Algorithm Exp3.M.B achieves cumulative regret RA(B) = O(√

BN log(N/K)).

Proof Idea

Modify existing proof techniques7:Step 1: Assume fixed time horizon T = max(τA(B), τA∗ (B))Step 2: Relate T to budget B

Remarks

Our bound recovers previous findings for the following special cases with fixed T :Recovers O(

√BN log N) bound for K = 18

Recovers O(√

TN log N) bound from for K = 1, no costs / budget9

7T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389; P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32(2002), pp. 48–77.

8T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.

9P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.

11 / 22

BN log(N/K)).

Proof Idea

Remarks

Recovers O(√

11 / 22

BN log(N/K)).

Proof Idea

Remarks

Recovers O(√

11 / 22

BN log(N/K)).

Proof Idea

Remarks

Recovers O(√

11 / 22

BN log(N/K)).

Proof Idea

Remarks

Recovers O(√

11 / 22

BN log(N/K)).

Proof Idea

Remarks

Recovers O(√

11 / 22

BN log(N/K)).

Proof Idea

Remarks

Recovers O(√

11 / 22

Adversarial MAB with Multiple Play and Budget Constraints Lower Bounds on the Regret

Lower Bound on the Regret

Theorem (Lower Bound on RA(B) for Algorithm Exp3.M.B)

The weak regret of Algorithm Exp3.M.B is at least

R ≥ min

3/2min (1− K/N)2

log(4/3)

K,B(1− K/N)

This bound is of order Ω((1− K/N)2√

NB/K).

Proof Idea

Step 1: Derive auxiliary lemma. Select K out of N arms at random to be “good”arms with ri (t) ∼ Bern(1/2 + ε); ci (t) = cmin w.p. 1/2 + ε, ci (t) = 1 w.p. 1/2− ε

Let f : 0, 1, cmin, 1τmax → [0,M] be any function defined on reward and costsequences r, c of length less than or equal τmax = B

Kcmin. Then for the best action set a∗:

Ea∗ [f (r, c)] ≤ Eu[f (r, c)] +Bc−3/2min

√−Eu[Na∗ ] log(1− 4ε2),

12 / 22

R ≥ min

3/2min (1− K/N)2

log(4/3)

K,B(1− K/N)

NB/K).

Proof Idea

√−Eu[Na∗ ] log(1− 4ε2),

12 / 22

R ≥ min

3/2min (1− K/N)2

log(4/3)

K,B(1− K/N)

NB/K).

Proof Idea

√−Eu[Na∗ ] log(1− 4ε2),

12 / 22

Lower Bound on the Regret (cont’d.)

Proof Idea (cont’d.)

Step 2: Notice there exist(NK

)unique combinations of K tuples.

Let C([N],K) denote the set of all such subsets10

Let E∗[·] denote the expected value w.r.t. uniform assignment of “good” arms.

E∗[Gmax] =

)KE∗[τA(B), ]

Ea∗ [GA] =1

2KEa∗ [τA(B)] + εEa∗ [Na∗ ],

E∗[GA] =1(NK

) ∑a∗∈C([N],K)

Ea∗ [GA] =1

2KE∗[τA(B)] +

) ∑a∗∈C([N],K)

Ea∗ [Na∗ ].

Step 3: Use previous lemma to bound E∗[Gmax − GA]:

E∗[Gmax − GA] ≥ εB(

1− K

)− 2εB

c3/2min

Nlog(4/3).

Step 4: Tune ε to optimize bound:

ε = min

c3/2min

4 log(4/3)(1− K/N)

13 / 22

E∗[Gmax] =

)KE∗[τA(B), ]

Ea∗ [GA] =1

2KEa∗ [τA(B)] + εEa∗ [Na∗ ],

E∗[GA] =1(NK

) ∑a∗∈C([N],K)

Ea∗ [GA] =1

2KE∗[τA(B)] +

) ∑a∗∈C([N],K)

Ea∗ [Na∗ ].

1− K

)− 2εB

c3/2min

Nlog(4/3).

ε = min

c3/2min

4 log(4/3)(1− K/N)

13 / 22

E∗[Gmax] =

)KE∗[τA(B), ]

Ea∗ [GA] =1

2KEa∗ [τA(B)] + εEa∗ [Na∗ ],

E∗[GA] =1(NK

) ∑a∗∈C([N],K)

Ea∗ [GA] =1

2KE∗[τA(B)] +

) ∑a∗∈C([N],K)

Ea∗ [Na∗ ].

1− K

)− 2εB

c3/2min

Nlog(4/3).

ε = min

c3/2min

4 log(4/3)(1− K/N)

13 / 22

E∗[Gmax] =

)KE∗[τA(B), ]

Ea∗ [GA] =1

2KEa∗ [τA(B)] + εEa∗ [Na∗ ],

E∗[GA] =1(NK

) ∑a∗∈C([N],K)

Ea∗ [GA] =1

2KE∗[τA(B)] +

) ∑a∗∈C([N],K)

Ea∗ [Na∗ ].

1− K

)− 2εB

c3/2min

Nlog(4/3).

ε = min

c3/2min

4 log(4/3)(1− K/N)

13 / 22

E∗[Gmax] =

)KE∗[τA(B), ]

Ea∗ [GA] =1

2KEa∗ [τA(B)] + εEa∗ [Na∗ ],

E∗[GA] =1(NK

) ∑a∗∈C([N],K)

Ea∗ [GA] =1

2KE∗[τA(B)] +

) ∑a∗∈C([N],K)

Ea∗ [Na∗ ].

1− K

)− 2εB

c3/2min

Nlog(4/3).

ε = min

c3/2min

4 log(4/3)(1− K/N)

13 / 22

Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret

High Probability Bound on the Regret

Algorithm Exp3.P.M.BModification of Algorithm Exp3.M.B:

Initialize parameter α = 2√

(N − K)/(N − 1) log (NB/(Kcminδ)).

Initialize weights wi for i ∈ [N] : wi (1) = exp(αγK2

√B/(NKcmin)/3

Update weights for i ∈ [N] as follows:

wi (t + 1) = wi (t) exp

[1i 6∈S(t)

(ri (t)− ci (t) +

α√Kcmin

pi (t)√NB

Theorem (High Probability Upper Bound on RA(B) for Algorithm Exp3.P.M.B)

For the multiple play algorithm (1 ≤ K ≤ N) and the budget B > 0, the following boundon the regret holds with probability at least 1− δ:

R ≤ 2√

√NB(1− cmin)

cminlog

K+ 4√

6N − K

N − 1log

Kcminδ

+ 2√

6(1 + K 2)

√N − K

N − 1

Kcminlog

Kcminδ

N − K

N − 1log

N − K

N − 1log

))14 / 22

√B/(NKcmin)/3

wi (t + 1) = wi (t) exp

[1i 6∈S(t)

(ri (t)− ci (t) +

α√Kcmin

pi (t)√NB

R ≤ 2√

√NB(1− cmin)

cminlog

K+ 4√

6N − K

N − 1log

Kcminδ

+ 2√

6(1 + K 2)

√N − K

N − 1

Kcminlog

Kcminδ

N − K

N − 1log

N − K

N − 1log

))14 / 22

√B/(NKcmin)/3

wi (t + 1) = wi (t) exp

[1i 6∈S(t)

(ri (t)− ci (t) +

α√Kcmin

pi (t)√NB

R ≤ 2√

√NB(1− cmin)

cminlog

K+ 4√

6N − K

N − 1log

Kcminδ

+ 2√

6(1 + K 2)

√N − K

N − 1

Kcminlog

Kcminδ

N − K

N − 1log

N − K

N − 1log

))14 / 22

√B/(NKcmin)/3

wi (t + 1) = wi (t) exp

[1i 6∈S(t)

(ri (t)− ci (t) +

α√Kcmin

pi (t)√NB

R ≤ 2√

√NB(1− cmin)

cminlog

K+ 4√

6N − K

N − 1log

Kcminδ

+ 2√

6(1 + K 2)

√N − K

N − 1

Kcminlog

Kcminδ

N − K

N − 1log

N − K

N − 1log

))14 / 22

√B/(NKcmin)/3

wi (t + 1) = wi (t) exp

[1i 6∈S(t)

(ri (t)− ci (t) +

α√Kcmin

pi (t)√NB

R ≤ 2√

√NB(1− cmin)

cminlog

K+ 4√

6N − K

N − 1log

Kcminδ

+ 2√

6(1 + K 2)

√N − K

N − 1

Kcminlog

Kcminδ

N − K

N − 1log

N − K

N − 1log

))14 / 22

High Probability Bound on the Regret (cont’d.)

N − K

N − 1log

N − K

N − 1log

Remark

Recovers O(√

NT log(NT/δ) + log(NT/δ)) bound11 for K = 1, no costs

Proof Idea

Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.

Step 2: Lower bound GExp3.P.M.B as function of U

Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B

15 / 22

N − K

N − 1log

N − K

N − 1log

Remark

Recovers O(√

Proof Idea

15 / 22

N − K

N − 1log

N − K

N − 1log

Remark

Recovers O(√

Proof Idea

15 / 22

N − K

N − 1log

N − K

N − 1log

Remark

Recovers O(√

Proof Idea

15 / 22

N − K

N − 1log

N − K

N − 1log

Remark

Recovers O(√

Proof Idea

15 / 22

High Probability on the Regret (cont’d.)

Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.:Define upper confidence bound:

U∗ =∑i∈a∗

ασi +

τa∗ (B)∑t=1

(ri (t)− ci (t))

√6√

N−KN−1

log NBKcminδ

≤ α ≤ 12√

NBKcmin

, we can show P(U∗ > Gmax − B

)≥ 1− δ

Step 2: Lower bound GExp3.P.M.B as function of U:

For α ≤ 2√

NBKcmin

, the gain of Algorithm Exp3.P.M.B is bounded below as follows:

GExp3.P.M.B ≥(

1− γ −2γ

1− cmin

)U∗ −

K− 2α2 − α(1 + K2)

Kcmin.

Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B, tune γ:

γ = min

1− cmin

(3N log(N/K)

(Gmax − B) (1 + 2(1− cmin)/(3cmin)),

16 / 22

U∗ =∑i∈a∗

ασi +

τa∗ (B)∑t=1

(ri (t)− ci (t))

√6√

N−KN−1

log NBKcminδ

≤ α ≤ 12√

NBKcmin

)≥ 1− δ

For α ≤ 2√

NBKcmin

GExp3.P.M.B ≥(

1− γ −2γ

1− cmin

)U∗ −

K− 2α2 − α(1 + K2)

Kcmin.

γ = min

1− cmin

(3N log(N/K)

(Gmax − B) (1 + 2(1− cmin)/(3cmin)),

16 / 22

U∗ =∑i∈a∗

ασi +

τa∗ (B)∑t=1

(ri (t)− ci (t))

√6√

N−KN−1

log NBKcminδ

≤ α ≤ 12√

NBKcmin

)≥ 1− δ

For α ≤ 2√

NBKcmin

GExp3.P.M.B ≥(

1− γ −2γ

1− cmin

)U∗ −

K− 2α2 − α(1 + K2)

Kcmin.

γ = min

1− cmin

(3N log(N/K)

(Gmax − B) (1 + 2(1− cmin)/(3cmin)),

16 / 22

U∗ =∑i∈a∗

ασi +

τa∗ (B)∑t=1

(ri (t)− ci (t))

√6√

N−KN−1

log NBKcminδ

≤ α ≤ 12√

NBKcmin

)≥ 1− δ

For α ≤ 2√

NBKcmin

GExp3.P.M.B ≥(

1− γ −2γ

1− cmin

)U∗ −

K− 2α2 − α(1 + K2)

Kcmin.

γ = min

1− cmin

(3N log(N/K)

(Gmax − B) (1 + 2(1− cmin)/(3cmin)),

16 / 22

U∗ =∑i∈a∗

ασi +

τa∗ (B)∑t=1

(ri (t)− ci (t))

√6√

N−KN−1

log NBKcminδ

≤ α ≤ 12√

NBKcmin

)≥ 1− δ

For α ≤ 2√

NBKcmin

GExp3.P.M.B ≥(

1− γ −2γ

1− cmin

)U∗ −

K− 2α2 − α(1 + K2)

Kcmin.

γ = min

1− cmin

(3N log(N/K)

(Gmax − B) (1 + 2(1− cmin)/(3cmin)),

16 / 22

Conclusion

Table of Contents

4 Conclusion

5 References

17 / 22

Conclusion

Conclusion and Outlook

Summary of Contributions

Analysis of budget-constrained MABs with multiple plays

Stochastic SettingAlgorithm UCB-MB based on UCB112 and UCB-BV13 to upper-bound regret

Adversarial SettingAlgorithm Exp3.M.B to upper-bound regretModification of weight updates in Exp3.M.B to lower-bound regretAlgorithm Exp3.P.M.B to for high probability upper bound

Future Work

Simulations with data

Connect to Badanidiyuru et al.14 to recover O(√B) adversarial bound with null arm

Explore connection to e-commerce (mechanism design)Strategic considerations of agents in repeated interactions between buyer and sellerPool of buyers vs. single buyer

12P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.13W. Ding et al. “Multi-Armed Bandit with Budget Constraint and Variable Costs”. In: Proceedings of the Twenty-Seventh AAAI Conference on

Artificial Intelligence (2013).14A. Badanidiyuru, R. Kleinberg, and A. Slivkins. “Bandits with Knapsacks”. In: Proceedings of the 2013 IEEE 54th Annual Symposium on

Foundations of Computer Science (2013), pp. 207–216.

18 / 22

Conclusion

Future Work

18 / 22

Conclusion

Future Work

18 / 22

Conclusion

Future Work

18 / 22

Conclusion

Future Work

18 / 22

Conclusion

Future Work

18 / 22

References

Table of Contents

4 Conclusion

5 References

19 / 22

References

References I

P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In:

Machine Learning 47 (2002), pp. 235–256.

R. Agrawal, M. V. Hegde, and D. Teneketzis. “Multi-Armed Bandits with Multiple Plays and Switching Cost”.

In: Stochastics and Stochastic Reports 29 (1990), pp. 437–459.

V. Anantharam, P. Varaiya, and J. Walrand. “Asymptotically Efficient Allocation Rules for the Multiarmed

Bandit Problem - Part I: IID Rewards”. In: IEEE Transactions on Automatic Control 32 (1986), pp. 968–976.

P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32 (2002),

pp. 48–77.

A. Badanidiyuru, R. Kleinberg, and A. Slivkins. “Bandits with Knapsacks”. In: Proceedings of the 2013 IEEE

54th Annual Symposium on Foundations of Computer Science (2013), pp. 207–216.

W. Ding et al. “Multi-Armed Bandit with Budget Constraint and Variable Costs”. In: Proceedings of the

Twenty-Seventh AAAI Conference on Artificial Intelligence (2013).

Y. Gai, B. Krishnamachari, and R. Jain. “Combinatorial Network Optimization with Unknown Variables:

Multi-Armed Bandits with Linear Rewards and Individual Observations”. In: IEEE/ACM Transactions onNetworking 20.5 (2012), pp. 1466–1478.

20 / 22

References

References II

L. Tran-Thanh et al. “Epsilon-First Policies for Budget-Limited Multi-Armed Bandits”. In: Twenty-Fourth

AAAI Conference on Artificial Intelligence (2010), pp. 1211–1216.

L. Tran-Thanh et al. “Knapsack Based Optimal Policies for Budget–Limited Multi–Armed Bandits”. In:

Twenty-Sixth AAAI Conference on Artificial Intelligence (2012), pp. 1134–1140.

T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In:

International Conference on Algorithmic Learning Theory (2010), pp. 375–389.

Y. Xia et al. “Budgeted Multi-Armed Bandits with Multiple Plays”. In: Proceedings of the 25th International

Joint Conference on Artificial Intelligence (2016), pp. 2210 –2216.

21 / 22

References

THANK YOU!

QUESTIONS?

22 / 22

budget-constrained multi-armed bandits with multiple plays · the nonstochastic multi-armed bandit...

Documents