budget-constrained multi-armed bandits with multiple plays · the nonstochastic multi-armed bandit...

71
Budget-Constrained Multi-Armed Bandits with Multiple Plays Datong P. Zhou, Claire J. Tomlin AAAI-2018 [datong.zhou, tomlin]@berkeley.edu February 5, 2018 1 / 22

Upload: others

Post on 21-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Budget-Constrained Multi-Armed Banditswith Multiple Plays

Datong P. Zhou, Claire J. Tomlin

AAAI-2018

[datong.zhou, tomlin]@berkeley.edu

February 5, 2018

1 / 22

Page 2: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Table of Contents

1 Background: The Multi-Armed Bandit (MAB) ProblemProblem FormulationContributions

2 Stochastic MAB with Multiple Play and Budget ConstraintsSetupAlgorithm UCB-MB

3 Adversarial MAB with Multiple Play and Budget ConstraintsUpper Bounds on the RegretLower Bounds on the RegretHigh Probability Bounds on the Regret

4 Conclusion

5 References

2 / 22

Page 3: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Background: The Multi-Armed Bandit (MAB) Problem

Table of Contents

1 Background: The Multi-Armed Bandit (MAB) ProblemProblem FormulationContributions

2 Stochastic MAB with Multiple Play and Budget ConstraintsSetupAlgorithm UCB-MB

3 Adversarial MAB with Multiple Play and Budget ConstraintsUpper Bounds on the RegretLower Bounds on the RegretHigh Probability Bounds on the Regret

4 Conclusion

5 References

3 / 22

Page 4: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Background: The Multi-Armed Bandit (MAB) Problem Problem Formulation

Background

Examples

Product recommendation to maximize sales

Ad placement to maximize click through rate

Classical Problem and Objective

Given N arms with unknown rewarddistribution

Pull arms sequentially to maximize totalexpected reward over certain horizon TAt each round t ∈ [T ]:

Player selects exactly one action atPlayer observes gain rat ,t

Goal: Minimize cumulative regret

RA(T ) =

(maxi∈[N]

E

[T∑t=1

ri,t

])− E

[T∑t=1

rat ,t

]

Stochastic generation of rewards vs.adversarial setting

Recent developments: Cost ci,t and multipleplays K

4 / 22

Page 5: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Background: The Multi-Armed Bandit (MAB) Problem Problem Formulation

Background

Examples

Product recommendation to maximize sales

Ad placement to maximize click through rate

Classical Problem and Objective

Given N arms with unknown rewarddistribution

Pull arms sequentially to maximize totalexpected reward over certain horizon TAt each round t ∈ [T ]:

Player selects exactly one action atPlayer observes gain rat ,t

Goal: Minimize cumulative regret

RA(T ) =

(maxi∈[N]

E

[T∑t=1

ri,t

])− E

[T∑t=1

rat ,t

]

Stochastic generation of rewards vs.adversarial setting

Recent developments: Cost ci,t and multipleplays K

4 / 22

Page 6: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Background: The Multi-Armed Bandit (MAB) Problem Problem Formulation

Background

Examples

Product recommendation to maximize sales

Ad placement to maximize click through rate

Classical Problem and Objective

Given N arms with unknown rewarddistribution

Pull arms sequentially to maximize totalexpected reward over certain horizon TAt each round t ∈ [T ]:

Player selects exactly one action atPlayer observes gain rat ,t

Goal: Minimize cumulative regret

RA(T ) =

(maxi∈[N]

E

[T∑t=1

ri,t

])− E

[T∑t=1

rat ,t

]

Stochastic generation of rewards vs.adversarial setting

Recent developments: Cost ci,t and multipleplays K

4 / 22

Page 7: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Background: The Multi-Armed Bandit (MAB) Problem Problem Formulation

Background

Examples

Product recommendation to maximize sales

Ad placement to maximize click through rate

Classical Problem and Objective

Given N arms with unknown rewarddistribution

Pull arms sequentially to maximize totalexpected reward over certain horizon TAt each round t ∈ [T ]:

Player selects exactly one action atPlayer observes gain rat ,t

Goal: Minimize cumulative regret

RA(T ) =

(maxi∈[N]

E

[T∑t=1

ri,t

])− E

[T∑t=1

rat ,t

]

Stochastic generation of rewards vs.adversarial setting

Recent developments: Cost ci,t and multipleplays K

4 / 22

Page 8: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Background: The Multi-Armed Bandit (MAB) Problem Problem Formulation

Background

Examples

Product recommendation to maximize sales

Ad placement to maximize click through rate

Classical Problem and Objective

Given N arms with unknown rewarddistribution

Pull arms sequentially to maximize totalexpected reward over certain horizon TAt each round t ∈ [T ]:

Player selects exactly one action atPlayer observes gain rat ,t

Goal: Minimize cumulative regret

RA(T ) =

(maxi∈[N]

E

[T∑t=1

ri,t

])− E

[T∑t=1

rat ,t

]

Stochastic generation of rewards vs.adversarial setting

Recent developments: Cost ci,t and multipleplays K

4 / 22

Page 9: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Background: The Multi-Armed Bandit (MAB) Problem Problem Formulation

Background

Examples

Product recommendation to maximize sales

Ad placement to maximize click through rate

Classical Problem and Objective

Given N arms with unknown rewarddistribution

Pull arms sequentially to maximize totalexpected reward over certain horizon TAt each round t ∈ [T ]:

Player selects exactly one action atPlayer observes gain rat ,t

Goal: Minimize cumulative regret

RA(T ) =

(maxi∈[N]

E

[T∑t=1

ri,t

])− E

[T∑t=1

rat ,t

]

Stochastic generation of rewards vs.adversarial setting

Recent developments: Cost ci,t and multipleplays K

4 / 22

Page 10: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Background: The Multi-Armed Bandit (MAB) Problem Problem Formulation

Background

Examples

Product recommendation to maximize sales

Ad placement to maximize click through rate

Classical Problem and Objective

Given N arms with unknown rewarddistribution

Pull arms sequentially to maximize totalexpected reward over certain horizon TAt each round t ∈ [T ]:

Player selects exactly one action atPlayer observes gain rat ,t

Goal: Minimize cumulative regret

RA(T ) =

(maxi∈[N]

E

[T∑t=1

ri,t

])− E

[T∑t=1

rat ,t

]

Stochastic generation of rewards vs.adversarial setting

Recent developments: Cost ci,t and multipleplays K

4 / 22

Page 11: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Background: The Multi-Armed Bandit (MAB) Problem Contributions

Contributions

Algorithm Upper Bound Lower Bound Setting Authors

Exp3 O(√

NT log N)

Ω(√

NT)

Fixed T , K = 1 1

Exp3.M O

(√NTK log N

K

((1− K

N

)2√NT

)Fixed T , K ≥ 1 2

Exp3.M.B O

(√NB log N

K

((1− K

N

)2√NB/K

)B > 0,K ≥ 1 This paper

Exp3.P O(√

NT log (NT/δ) + log(NT/δ))

Fixed T , K = 1 3

Exp3.P.M O(K2√

NT N−KN−1

log (NT/δ) + N−KN−1

log(NT/δ))

Fixed T , K ≥ 1 This paper

Exp3.P.M.B O

(K2

√NBK

N−KN−1

log(

NBKδ

)+ N−K

N−1log(

NBKδ

))B > 0, K ≥ 1 This paper

UCB1 O(N log T ) Fixed T , K = 1 4

LLR O(NK4 log T ) Fixed T , K ≥ 1 5

UCB-BV O(N log B) B > 0,K = 1 6

UCB-MB O(NK4 log B) B > 0,K ≥ 1 This paper

1P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32 (2002), pp. 48–77.2T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference on

Algorithmic Learning Theory (2010), pp. 375–389.3P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32 (2002), pp. 48–77.4P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.5Y. Gai, B. Krishnamachari, and R. Jain. “Combinatorial Network Optimization with Unknown Variables: Multi-Armed Bandits with Linear Rewards and

Individual Observations”. In: IEEE/ACM Transactions on Networking 20.5 (2012), pp. 1466–1478.6W. Ding et al. “Multi-Armed Bandit with Budget Constraint and Variable Costs”. In: Proceedings of the Twenty-Seventh AAAI Conference on

Artificial Intelligence (2013). 5 / 22

Page 12: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Stochastic MAB with Multiple Play and Budget Constraints

Table of Contents

1 Background: The Multi-Armed Bandit (MAB) ProblemProblem FormulationContributions

2 Stochastic MAB with Multiple Play and Budget ConstraintsSetupAlgorithm UCB-MB

3 Adversarial MAB with Multiple Play and Budget ConstraintsUpper Bounds on the RegretLower Bounds on the RegretHigh Probability Bounds on the Regret

4 Conclusion

5 References

6 / 22

Page 13: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Stochastic MAB with Multiple Play and Budget Constraints Setup

Setup

Problem Description

Bandit with N distinct arms

Each arm i has unknown cost and reward distributions with means 0 < µir ≤ 1 and

0 < cmin ≤ µic ≤ 1

Realizations of costs ci,t ∈ [cmin, 1] and rewards ri,t ∈ [0, 1] are i.i.d.

Initial budget B > 0 to pay for the materialized costsAt each round t = 1, . . . , τA(B):

Select exactly 1 ≤ K ≤ N arms into atObserve individual costs ci,t | i ∈ at and rewards ri,t | i ∈ atTerminate game if

∑i∈at ci,t is greater than remaining budget

Goal

Minimize expected regret

RA(B) = E[GA∗(B)]− E[GA(B)]

Utilize modified UCB algorithm with upper confidence bounds:

Ui,t = µit + ei,t

At each round, play K arms with K largest Ui,t

7 / 22

Page 14: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Stochastic MAB with Multiple Play and Budget Constraints Setup

Setup

Problem Description

Bandit with N distinct arms

Each arm i has unknown cost and reward distributions with means 0 < µir ≤ 1 and

0 < cmin ≤ µic ≤ 1

Realizations of costs ci,t ∈ [cmin, 1] and rewards ri,t ∈ [0, 1] are i.i.d.

Initial budget B > 0 to pay for the materialized costsAt each round t = 1, . . . , τA(B):

Select exactly 1 ≤ K ≤ N arms into atObserve individual costs ci,t | i ∈ at and rewards ri,t | i ∈ atTerminate game if

∑i∈at ci,t is greater than remaining budget

Goal

Minimize expected regret

RA(B) = E[GA∗(B)]− E[GA(B)]

Utilize modified UCB algorithm with upper confidence bounds:

Ui,t = µit + ei,t

At each round, play K arms with K largest Ui,t

7 / 22

Page 15: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Stochastic MAB with Multiple Play and Budget Constraints Setup

Setup

Problem Description

Bandit with N distinct arms

Each arm i has unknown cost and reward distributions with means 0 < µir ≤ 1 and

0 < cmin ≤ µic ≤ 1

Realizations of costs ci,t ∈ [cmin, 1] and rewards ri,t ∈ [0, 1] are i.i.d.

Initial budget B > 0 to pay for the materialized costsAt each round t = 1, . . . , τA(B):

Select exactly 1 ≤ K ≤ N arms into atObserve individual costs ci,t | i ∈ at and rewards ri,t | i ∈ atTerminate game if

∑i∈at ci,t is greater than remaining budget

Goal

Minimize expected regret

RA(B) = E[GA∗(B)]− E[GA(B)]

Utilize modified UCB algorithm with upper confidence bounds:

Ui,t = µit + ei,t

At each round, play K arms with K largest Ui,t

7 / 22

Page 16: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Stochastic MAB with Multiple Play and Budget Constraints Setup

Setup

Problem Description

Bandit with N distinct arms

Each arm i has unknown cost and reward distributions with means 0 < µir ≤ 1 and

0 < cmin ≤ µic ≤ 1

Realizations of costs ci,t ∈ [cmin, 1] and rewards ri,t ∈ [0, 1] are i.i.d.

Initial budget B > 0 to pay for the materialized costsAt each round t = 1, . . . , τA(B):

Select exactly 1 ≤ K ≤ N arms into atObserve individual costs ci,t | i ∈ at and rewards ri,t | i ∈ atTerminate game if

∑i∈at ci,t is greater than remaining budget

Goal

Minimize expected regret

RA(B) = E[GA∗(B)]− E[GA(B)]

Utilize modified UCB algorithm with upper confidence bounds:

Ui,t = µit + ei,t

At each round, play K arms with K largest Ui,t

7 / 22

Page 17: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Stochastic MAB with Multiple Play and Budget Constraints Setup

Setup

Problem Description

Bandit with N distinct arms

Each arm i has unknown cost and reward distributions with means 0 < µir ≤ 1 and

0 < cmin ≤ µic ≤ 1

Realizations of costs ci,t ∈ [cmin, 1] and rewards ri,t ∈ [0, 1] are i.i.d.

Initial budget B > 0 to pay for the materialized costsAt each round t = 1, . . . , τA(B):

Select exactly 1 ≤ K ≤ N arms into atObserve individual costs ci,t | i ∈ at and rewards ri,t | i ∈ atTerminate game if

∑i∈at ci,t is greater than remaining budget

Goal

Minimize expected regret

RA(B) = E[GA∗(B)]− E[GA(B)]

Utilize modified UCB algorithm with upper confidence bounds:

Ui,t = µit + ei,t

At each round, play K arms with K largest Ui,t

7 / 22

Page 18: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Stochastic MAB with Multiple Play and Budget Constraints Setup

Setup

Problem Description

Bandit with N distinct arms

Each arm i has unknown cost and reward distributions with means 0 < µir ≤ 1 and

0 < cmin ≤ µic ≤ 1

Realizations of costs ci,t ∈ [cmin, 1] and rewards ri,t ∈ [0, 1] are i.i.d.

Initial budget B > 0 to pay for the materialized costsAt each round t = 1, . . . , τA(B):

Select exactly 1 ≤ K ≤ N arms into atObserve individual costs ci,t | i ∈ at and rewards ri,t | i ∈ atTerminate game if

∑i∈at ci,t is greater than remaining budget

Goal

Minimize expected regret

RA(B) = E[GA∗(B)]− E[GA(B)]

Utilize modified UCB algorithm with upper confidence bounds:

Ui,t = µit + ei,t

At each round, play K arms with K largest Ui,t

7 / 22

Page 19: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Stochastic MAB with Multiple Play and Budget Constraints Setup

Setup

Problem Description

Bandit with N distinct arms

Each arm i has unknown cost and reward distributions with means 0 < µir ≤ 1 and

0 < cmin ≤ µic ≤ 1

Realizations of costs ci,t ∈ [cmin, 1] and rewards ri,t ∈ [0, 1] are i.i.d.

Initial budget B > 0 to pay for the materialized costsAt each round t = 1, . . . , τA(B):

Select exactly 1 ≤ K ≤ N arms into atObserve individual costs ci,t | i ∈ at and rewards ri,t | i ∈ atTerminate game if

∑i∈at ci,t is greater than remaining budget

Goal

Minimize expected regret

RA(B) = E[GA∗(B)]− E[GA(B)]

Utilize modified UCB algorithm with upper confidence bounds:

Ui,t = µit + ei,t

At each round, play K arms with K largest Ui,t

7 / 22

Page 20: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Stochastic MAB with Multiple Play and Budget Constraints Algorithm UCB-MB

Algorithm UCB-MB

The Algorithm

Play all arms once to initialize bang-per-buck ratios for each arm

While B not exhausted, select K arms with K largest Ui,t

Theorem (Upper Bound on RA(B) for Algorithm UCB-MB)

For the definition of confidence bounds

Ui,t = µit +

√(K + 1) log t/ni,t(1 + 1/cmin)

cmin −√

(K + 1) log t/ni,t,

Algorithm UCB-MB achieves expected regret RA(B) = O(NK 4 logB).

Proof Idea

Step 1: Upper bound the number of times a non-optimal selection of arms is madeup to a fixed stopping time τA(B):

# suboptimal choices = O(NK 3 log τA(B))

Step 2: Relate algorithm UCB-MB and B to stopping time τA(B)

8 / 22

Page 21: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Stochastic MAB with Multiple Play and Budget Constraints Algorithm UCB-MB

Algorithm UCB-MB

The Algorithm

Play all arms once to initialize bang-per-buck ratios for each arm

While B not exhausted, select K arms with K largest Ui,t

Theorem (Upper Bound on RA(B) for Algorithm UCB-MB)

For the definition of confidence bounds

Ui,t = µit +

√(K + 1) log t/ni,t(1 + 1/cmin)

cmin −√

(K + 1) log t/ni,t,

Algorithm UCB-MB achieves expected regret RA(B) = O(NK 4 logB).

Proof Idea

Step 1: Upper bound the number of times a non-optimal selection of arms is madeup to a fixed stopping time τA(B):

# suboptimal choices = O(NK 3 log τA(B))

Step 2: Relate algorithm UCB-MB and B to stopping time τA(B)

8 / 22

Page 22: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Stochastic MAB with Multiple Play and Budget Constraints Algorithm UCB-MB

Algorithm UCB-MB

The Algorithm

Play all arms once to initialize bang-per-buck ratios for each arm

While B not exhausted, select K arms with K largest Ui,t

Theorem (Upper Bound on RA(B) for Algorithm UCB-MB)

For the definition of confidence bounds

Ui,t = µit +

√(K + 1) log t/ni,t(1 + 1/cmin)

cmin −√

(K + 1) log t/ni,t,

Algorithm UCB-MB achieves expected regret RA(B) = O(NK 4 logB).

Proof Idea

Step 1: Upper bound the number of times a non-optimal selection of arms is madeup to a fixed stopping time τA(B):

# suboptimal choices = O(NK 3 log τA(B))

Step 2: Relate algorithm UCB-MB and B to stopping time τA(B)

8 / 22

Page 23: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Stochastic MAB with Multiple Play and Budget Constraints Algorithm UCB-MB

Algorithm UCB-MB

The Algorithm

Play all arms once to initialize bang-per-buck ratios for each arm

While B not exhausted, select K arms with K largest Ui,t

Theorem (Upper Bound on RA(B) for Algorithm UCB-MB)

For the definition of confidence bounds

Ui,t = µit +

√(K + 1) log t/ni,t(1 + 1/cmin)

cmin −√

(K + 1) log t/ni,t,

Algorithm UCB-MB achieves expected regret RA(B) = O(NK 4 logB).

Proof Idea

Step 1: Upper bound the number of times a non-optimal selection of arms is madeup to a fixed stopping time τA(B):

# suboptimal choices = O(NK 3 log τA(B))

Step 2: Relate algorithm UCB-MB and B to stopping time τA(B)

8 / 22

Page 24: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints

Table of Contents

1 Background: The Multi-Armed Bandit (MAB) ProblemProblem FormulationContributions

2 Stochastic MAB with Multiple Play and Budget ConstraintsSetupAlgorithm UCB-MB

3 Adversarial MAB with Multiple Play and Budget ConstraintsUpper Bounds on the RegretLower Bounds on the RegretHigh Probability Bounds on the Regret

4 Conclusion

5 References

9 / 22

Page 25: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret

Upper Bound on the Regret

Setup

Oblivious adversary ⇒ no assumptions on reward or cost distributions exceptboundedness

Algorithm Exp3.M.B

Initialize weights wi = 1 for i = 1, . . . ,NFor each round t = 1, . . . , τA(B):

Cap weights that are “too large”

wi (t) = v(t) for i ∈ S(t) = i ∈ [N] | wi (t) > vt

v(t)←vt

∣∣∣ vt(1− γ)∑Ni=1 vt · 1(wi (t) ≥ vt) + wi (t) · 1(wi (t) < vt)

=1

K−γ

N

Calculate probabilities pi (t) = K

((1− γ) wi (t)∑N

j=1 wj (t)+ γ

N

)Play arms at ∼ p1, . . . , pNUpdate weights:

ri (t) = ri (t)/pi (t) · 1(i ∈ at)

ci (t) = ci (t)/pi (t) · 1(i ∈ at)

wi (t + 1) = wi (t) exp

[Kγ

N[ri (t)− ci (t)]1i∈S(t)

]10 / 22

Page 26: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret

Upper Bound on the Regret

Setup

Oblivious adversary ⇒ no assumptions on reward or cost distributions exceptboundedness

Algorithm Exp3.M.B

Initialize weights wi = 1 for i = 1, . . . ,NFor each round t = 1, . . . , τA(B):

Cap weights that are “too large”

wi (t) = v(t) for i ∈ S(t) = i ∈ [N] | wi (t) > vt

v(t)←vt

∣∣∣ vt(1− γ)∑Ni=1 vt · 1(wi (t) ≥ vt) + wi (t) · 1(wi (t) < vt)

=1

K−γ

N

Calculate probabilities pi (t) = K

((1− γ) wi (t)∑N

j=1 wj (t)+ γ

N

)Play arms at ∼ p1, . . . , pNUpdate weights:

ri (t) = ri (t)/pi (t) · 1(i ∈ at)

ci (t) = ci (t)/pi (t) · 1(i ∈ at)

wi (t + 1) = wi (t) exp

[Kγ

N[ri (t)− ci (t)]1i∈S(t)

]10 / 22

Page 27: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret

Upper Bound on the Regret

Setup

Oblivious adversary ⇒ no assumptions on reward or cost distributions exceptboundedness

Algorithm Exp3.M.B

Initialize weights wi = 1 for i = 1, . . . ,NFor each round t = 1, . . . , τA(B):

Cap weights that are “too large”

wi (t) = v(t) for i ∈ S(t) = i ∈ [N] | wi (t) > vt

v(t)←vt

∣∣∣ vt(1− γ)∑Ni=1 vt · 1(wi (t) ≥ vt) + wi (t) · 1(wi (t) < vt)

=1

K−γ

N

Calculate probabilities pi (t) = K

((1− γ) wi (t)∑N

j=1 wj (t)+ γ

N

)Play arms at ∼ p1, . . . , pNUpdate weights:

ri (t) = ri (t)/pi (t) · 1(i ∈ at)

ci (t) = ci (t)/pi (t) · 1(i ∈ at)

wi (t + 1) = wi (t) exp

[Kγ

N[ri (t)− ci (t)]1i∈S(t)

]10 / 22

Page 28: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret

Upper Bound on the Regret

Setup

Oblivious adversary ⇒ no assumptions on reward or cost distributions exceptboundedness

Algorithm Exp3.M.B

Initialize weights wi = 1 for i = 1, . . . ,NFor each round t = 1, . . . , τA(B):

Cap weights that are “too large”

wi (t) = v(t) for i ∈ S(t) = i ∈ [N] | wi (t) > vt

v(t)←vt

∣∣∣ vt(1− γ)∑Ni=1 vt · 1(wi (t) ≥ vt) + wi (t) · 1(wi (t) < vt)

=1

K−γ

N

Calculate probabilities pi (t) = K

((1− γ) wi (t)∑N

j=1 wj (t)+ γ

N

)Play arms at ∼ p1, . . . , pNUpdate weights:

ri (t) = ri (t)/pi (t) · 1(i ∈ at)

ci (t) = ci (t)/pi (t) · 1(i ∈ at)

wi (t + 1) = wi (t) exp

[Kγ

N[ri (t)− ci (t)]1i∈S(t)

]10 / 22

Page 29: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret

Upper Bound on the Regret

Setup

Oblivious adversary ⇒ no assumptions on reward or cost distributions exceptboundedness

Algorithm Exp3.M.B

Initialize weights wi = 1 for i = 1, . . . ,NFor each round t = 1, . . . , τA(B):

Cap weights that are “too large”

wi (t) = v(t) for i ∈ S(t) = i ∈ [N] | wi (t) > vt

v(t)←vt

∣∣∣ vt(1− γ)∑Ni=1 vt · 1(wi (t) ≥ vt) + wi (t) · 1(wi (t) < vt)

=1

K−γ

N

Calculate probabilities pi (t) = K

((1− γ) wi (t)∑N

j=1 wj (t)+ γ

N

)Play arms at ∼ p1, . . . , pNUpdate weights:

ri (t) = ri (t)/pi (t) · 1(i ∈ at)

ci (t) = ci (t)/pi (t) · 1(i ∈ at)

wi (t + 1) = wi (t) exp

[Kγ

N[ri (t)− ci (t)]1i∈S(t)

]10 / 22

Page 30: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret

Upper Bound on the Regret

Setup

Oblivious adversary ⇒ no assumptions on reward or cost distributions exceptboundedness

Algorithm Exp3.M.B

Initialize weights wi = 1 for i = 1, . . . ,NFor each round t = 1, . . . , τA(B):

Cap weights that are “too large”

wi (t) = v(t) for i ∈ S(t) = i ∈ [N] | wi (t) > vt

v(t)←vt

∣∣∣ vt(1− γ)∑Ni=1 vt · 1(wi (t) ≥ vt) + wi (t) · 1(wi (t) < vt)

=1

K−γ

N

Calculate probabilities pi (t) = K

((1− γ) wi (t)∑N

j=1 wj (t)+ γ

N

)Play arms at ∼ p1, . . . , pNUpdate weights:

ri (t) = ri (t)/pi (t) · 1(i ∈ at)

ci (t) = ci (t)/pi (t) · 1(i ∈ at)

wi (t + 1) = wi (t) exp

[Kγ

N[ri (t)− ci (t)]1i∈S(t)

]10 / 22

Page 31: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret

Upper Bound on the Regret (cont’d.)

Analysis of Algorithm Exp3.M.B

Theorem (Upper Bound on RA(B) for Algorithm Exp3.M.B)

Algorithm Exp3.M.B achieves cumulative regret RA(B) = O(√

BN log(N/K)).

Proof Idea

Modify existing proof techniques7:Step 1: Assume fixed time horizon T = max(τA(B), τA∗ (B))Step 2: Relate T to budget B

Remarks

Our bound recovers previous findings for the following special cases with fixed T :Recovers O(

√BN log N) bound for K = 18

Recovers O(√

TN log N) bound from for K = 1, no costs / budget9

7T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389; P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32(2002), pp. 48–77.

8T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.

9P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.

11 / 22

Page 32: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret

Upper Bound on the Regret (cont’d.)

Analysis of Algorithm Exp3.M.B

Theorem (Upper Bound on RA(B) for Algorithm Exp3.M.B)

Algorithm Exp3.M.B achieves cumulative regret RA(B) = O(√

BN log(N/K)).

Proof Idea

Modify existing proof techniques7:Step 1: Assume fixed time horizon T = max(τA(B), τA∗ (B))Step 2: Relate T to budget B

Remarks

Our bound recovers previous findings for the following special cases with fixed T :Recovers O(

√BN log N) bound for K = 18

Recovers O(√

TN log N) bound from for K = 1, no costs / budget9

7T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389; P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32(2002), pp. 48–77.

8T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.

9P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.

11 / 22

Page 33: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret

Upper Bound on the Regret (cont’d.)

Analysis of Algorithm Exp3.M.B

Theorem (Upper Bound on RA(B) for Algorithm Exp3.M.B)

Algorithm Exp3.M.B achieves cumulative regret RA(B) = O(√

BN log(N/K)).

Proof Idea

Modify existing proof techniques7:Step 1: Assume fixed time horizon T = max(τA(B), τA∗ (B))Step 2: Relate T to budget B

Remarks

Our bound recovers previous findings for the following special cases with fixed T :Recovers O(

√BN log N) bound for K = 18

Recovers O(√

TN log N) bound from for K = 1, no costs / budget9

7T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389; P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32(2002), pp. 48–77.

8T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.

9P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.

11 / 22

Page 34: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret

Upper Bound on the Regret (cont’d.)

Analysis of Algorithm Exp3.M.B

Theorem (Upper Bound on RA(B) for Algorithm Exp3.M.B)

Algorithm Exp3.M.B achieves cumulative regret RA(B) = O(√

BN log(N/K)).

Proof Idea

Modify existing proof techniques7:Step 1: Assume fixed time horizon T = max(τA(B), τA∗ (B))Step 2: Relate T to budget B

Remarks

Our bound recovers previous findings for the following special cases with fixed T :Recovers O(

√BN log N) bound for K = 18

Recovers O(√

TN log N) bound from for K = 1, no costs / budget9

7T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389; P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32(2002), pp. 48–77.

8T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.

9P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.

11 / 22

Page 35: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret

Upper Bound on the Regret (cont’d.)

Analysis of Algorithm Exp3.M.B

Theorem (Upper Bound on RA(B) for Algorithm Exp3.M.B)

Algorithm Exp3.M.B achieves cumulative regret RA(B) = O(√

BN log(N/K)).

Proof Idea

Modify existing proof techniques7:Step 1: Assume fixed time horizon T = max(τA(B), τA∗ (B))Step 2: Relate T to budget B

Remarks

Our bound recovers previous findings for the following special cases with fixed T :Recovers O(

√BN log N) bound for K = 18

Recovers O(√

TN log N) bound from for K = 1, no costs / budget9

7T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389; P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32(2002), pp. 48–77.

8T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.

9P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.

11 / 22

Page 36: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret

Upper Bound on the Regret (cont’d.)

Analysis of Algorithm Exp3.M.B

Theorem (Upper Bound on RA(B) for Algorithm Exp3.M.B)

Algorithm Exp3.M.B achieves cumulative regret RA(B) = O(√

BN log(N/K)).

Proof Idea

Modify existing proof techniques7:Step 1: Assume fixed time horizon T = max(τA(B), τA∗ (B))Step 2: Relate T to budget B

Remarks

Our bound recovers previous findings for the following special cases with fixed T :Recovers O(

√BN log N) bound for K = 18

Recovers O(√

TN log N) bound from for K = 1, no costs / budget9

7T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389; P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32(2002), pp. 48–77.

8T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.

9P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.

11 / 22

Page 37: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret

Upper Bound on the Regret (cont’d.)

Analysis of Algorithm Exp3.M.B

Theorem (Upper Bound on RA(B) for Algorithm Exp3.M.B)

Algorithm Exp3.M.B achieves cumulative regret RA(B) = O(√

BN log(N/K)).

Proof Idea

Modify existing proof techniques7:Step 1: Assume fixed time horizon T = max(τA(B), τA∗ (B))Step 2: Relate T to budget B

Remarks

Our bound recovers previous findings for the following special cases with fixed T :Recovers O(

√BN log N) bound for K = 18

Recovers O(√

TN log N) bound from for K = 1, no costs / budget9

7T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389; P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32(2002), pp. 48–77.

8T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.

9P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.

11 / 22

Page 38: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints Lower Bounds on the Regret

Lower Bound on the Regret

Theorem (Lower Bound on RA(B) for Algorithm Exp3.M.B)

The weak regret of Algorithm Exp3.M.B is at least

R ≥ min

(c

3/2min (1− K/N)2

8√

log(4/3)

√NB

K,B(1− K/N)

8

).

This bound is of order Ω((1− K/N)2√

NB/K).

Proof Idea

Step 1: Derive auxiliary lemma. Select K out of N arms at random to be “good”arms with ri (t) ∼ Bern(1/2 + ε); ci (t) = cmin w.p. 1/2 + ε, ci (t) = 1 w.p. 1/2− ε

Lemma

Let f : 0, 1, cmin, 1τmax → [0,M] be any function defined on reward and costsequences r, c of length less than or equal τmax = B

Kcmin. Then for the best action set a∗:

Ea∗ [f (r, c)] ≤ Eu[f (r, c)] +Bc−3/2min

2

√−Eu[Na∗ ] log(1− 4ε2),

12 / 22

Page 39: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints Lower Bounds on the Regret

Lower Bound on the Regret

Theorem (Lower Bound on RA(B) for Algorithm Exp3.M.B)

The weak regret of Algorithm Exp3.M.B is at least

R ≥ min

(c

3/2min (1− K/N)2

8√

log(4/3)

√NB

K,B(1− K/N)

8

).

This bound is of order Ω((1− K/N)2√

NB/K).

Proof Idea

Step 1: Derive auxiliary lemma. Select K out of N arms at random to be “good”arms with ri (t) ∼ Bern(1/2 + ε); ci (t) = cmin w.p. 1/2 + ε, ci (t) = 1 w.p. 1/2− ε

Lemma

Let f : 0, 1, cmin, 1τmax → [0,M] be any function defined on reward and costsequences r, c of length less than or equal τmax = B

Kcmin. Then for the best action set a∗:

Ea∗ [f (r, c)] ≤ Eu[f (r, c)] +Bc−3/2min

2

√−Eu[Na∗ ] log(1− 4ε2),

12 / 22

Page 40: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints Lower Bounds on the Regret

Lower Bound on the Regret

Theorem (Lower Bound on RA(B) for Algorithm Exp3.M.B)

The weak regret of Algorithm Exp3.M.B is at least

R ≥ min

(c

3/2min (1− K/N)2

8√

log(4/3)

√NB

K,B(1− K/N)

8

).

This bound is of order Ω((1− K/N)2√

NB/K).

Proof Idea

Step 1: Derive auxiliary lemma. Select K out of N arms at random to be “good”arms with ri (t) ∼ Bern(1/2 + ε); ci (t) = cmin w.p. 1/2 + ε, ci (t) = 1 w.p. 1/2− ε

Lemma

Let f : 0, 1, cmin, 1τmax → [0,M] be any function defined on reward and costsequences r, c of length less than or equal τmax = B

Kcmin. Then for the best action set a∗:

Ea∗ [f (r, c)] ≤ Eu[f (r, c)] +Bc−3/2min

2

√−Eu[Na∗ ] log(1− 4ε2),

12 / 22

Page 41: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints Lower Bounds on the Regret

Lower Bound on the Regret (cont’d.)

Proof Idea (cont’d.)

Step 2: Notice there exist(NK

)unique combinations of K tuples.

Let C([N],K) denote the set of all such subsets10

Let E∗[·] denote the expected value w.r.t. uniform assignment of “good” arms.

E∗[Gmax] =

(1

2+ ε

)KE∗[τA(B), ]

Ea∗ [GA] =1

2KEa∗ [τA(B)] + εEa∗ [Na∗ ],

E∗[GA] =1(NK

) ∑a∗∈C([N],K)

Ea∗ [GA] =1

2KE∗[τA(B)] +

ε(NK

) ∑a∗∈C([N],K)

Ea∗ [Na∗ ].

Step 3: Use previous lemma to bound E∗[Gmax − GA]:

E∗[Gmax − GA] ≥ εB(

1− K

N

)− 2εB

c3/2min

√BK

Nlog(4/3).

Step 4: Tune ε to optimize bound:

ε = min

(1

4,

c3/2min

4 log(4/3)(1− K/N)

√N

BK

).

10T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.

13 / 22

Page 42: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints Lower Bounds on the Regret

Lower Bound on the Regret (cont’d.)

Proof Idea (cont’d.)

Step 2: Notice there exist(NK

)unique combinations of K tuples.

Let C([N],K) denote the set of all such subsets10

Let E∗[·] denote the expected value w.r.t. uniform assignment of “good” arms.

E∗[Gmax] =

(1

2+ ε

)KE∗[τA(B), ]

Ea∗ [GA] =1

2KEa∗ [τA(B)] + εEa∗ [Na∗ ],

E∗[GA] =1(NK

) ∑a∗∈C([N],K)

Ea∗ [GA] =1

2KE∗[τA(B)] +

ε(NK

) ∑a∗∈C([N],K)

Ea∗ [Na∗ ].

Step 3: Use previous lemma to bound E∗[Gmax − GA]:

E∗[Gmax − GA] ≥ εB(

1− K

N

)− 2εB

c3/2min

√BK

Nlog(4/3).

Step 4: Tune ε to optimize bound:

ε = min

(1

4,

c3/2min

4 log(4/3)(1− K/N)

√N

BK

).

10T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.

13 / 22

Page 43: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints Lower Bounds on the Regret

Lower Bound on the Regret (cont’d.)

Proof Idea (cont’d.)

Step 2: Notice there exist(NK

)unique combinations of K tuples.

Let C([N],K) denote the set of all such subsets10

Let E∗[·] denote the expected value w.r.t. uniform assignment of “good” arms.

E∗[Gmax] =

(1

2+ ε

)KE∗[τA(B), ]

Ea∗ [GA] =1

2KEa∗ [τA(B)] + εEa∗ [Na∗ ],

E∗[GA] =1(NK

) ∑a∗∈C([N],K)

Ea∗ [GA] =1

2KE∗[τA(B)] +

ε(NK

) ∑a∗∈C([N],K)

Ea∗ [Na∗ ].

Step 3: Use previous lemma to bound E∗[Gmax − GA]:

E∗[Gmax − GA] ≥ εB(

1− K

N

)− 2εB

c3/2min

√BK

Nlog(4/3).

Step 4: Tune ε to optimize bound:

ε = min

(1

4,

c3/2min

4 log(4/3)(1− K/N)

√N

BK

).

10T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.

13 / 22

Page 44: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints Lower Bounds on the Regret

Lower Bound on the Regret (cont’d.)

Proof Idea (cont’d.)

Step 2: Notice there exist(NK

)unique combinations of K tuples.

Let C([N],K) denote the set of all such subsets10

Let E∗[·] denote the expected value w.r.t. uniform assignment of “good” arms.

E∗[Gmax] =

(1

2+ ε

)KE∗[τA(B), ]

Ea∗ [GA] =1

2KEa∗ [τA(B)] + εEa∗ [Na∗ ],

E∗[GA] =1(NK

) ∑a∗∈C([N],K)

Ea∗ [GA] =1

2KE∗[τA(B)] +

ε(NK

) ∑a∗∈C([N],K)

Ea∗ [Na∗ ].

Step 3: Use previous lemma to bound E∗[Gmax − GA]:

E∗[Gmax − GA] ≥ εB(

1− K

N

)− 2εB

c3/2min

√BK

Nlog(4/3).

Step 4: Tune ε to optimize bound:

ε = min

(1

4,

c3/2min

4 log(4/3)(1− K/N)

√N

BK

).

10T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.

13 / 22

Page 45: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints Lower Bounds on the Regret

Lower Bound on the Regret (cont’d.)

Proof Idea (cont’d.)

Step 2: Notice there exist(NK

)unique combinations of K tuples.

Let C([N],K) denote the set of all such subsets10

Let E∗[·] denote the expected value w.r.t. uniform assignment of “good” arms.

E∗[Gmax] =

(1

2+ ε

)KE∗[τA(B), ]

Ea∗ [GA] =1

2KEa∗ [τA(B)] + εEa∗ [Na∗ ],

E∗[GA] =1(NK

) ∑a∗∈C([N],K)

Ea∗ [GA] =1

2KE∗[τA(B)] +

ε(NK

) ∑a∗∈C([N],K)

Ea∗ [Na∗ ].

Step 3: Use previous lemma to bound E∗[Gmax − GA]:

E∗[Gmax − GA] ≥ εB(

1− K

N

)− 2εB

c3/2min

√BK

Nlog(4/3).

Step 4: Tune ε to optimize bound:

ε = min

(1

4,

c3/2min

4 log(4/3)(1− K/N)

√N

BK

).

10T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.

13 / 22

Page 46: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret

High Probability Bound on the Regret

Algorithm Exp3.P.M.BModification of Algorithm Exp3.M.B:

Initialize parameter α = 2√

6√

(N − K)/(N − 1) log (NB/(Kcminδ)).

Initialize weights wi for i ∈ [N] : wi (1) = exp(αγK2

√B/(NKcmin)/3

).

Update weights for i ∈ [N] as follows:

wi (t + 1) = wi (t) exp

[1i 6∈S(t)

γK

3N

(ri (t)− ci (t) +

α√Kcmin

pi (t)√NB

)].

Theorem (High Probability Upper Bound on RA(B) for Algorithm Exp3.P.M.B)

For the multiple play algorithm (1 ≤ K ≤ N) and the budget B > 0, the following boundon the regret holds with probability at least 1− δ:

R ≤ 2√

3

√NB(1− cmin)

cminlog

N

K+ 4√

6N − K

N − 1log

(NB

Kcminδ

)

+ 2√

6(1 + K 2)

√N − K

N − 1

NB

Kcminlog

(NB

Kcminδ

)

= O

(K 2

√NB

K

N − K

N − 1log

(NB

)+

N − K

N − 1log

(NB

))14 / 22

Page 47: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret

High Probability Bound on the Regret

Algorithm Exp3.P.M.BModification of Algorithm Exp3.M.B:

Initialize parameter α = 2√

6√

(N − K)/(N − 1) log (NB/(Kcminδ)).

Initialize weights wi for i ∈ [N] : wi (1) = exp(αγK2

√B/(NKcmin)/3

).

Update weights for i ∈ [N] as follows:

wi (t + 1) = wi (t) exp

[1i 6∈S(t)

γK

3N

(ri (t)− ci (t) +

α√Kcmin

pi (t)√NB

)].

Theorem (High Probability Upper Bound on RA(B) for Algorithm Exp3.P.M.B)

For the multiple play algorithm (1 ≤ K ≤ N) and the budget B > 0, the following boundon the regret holds with probability at least 1− δ:

R ≤ 2√

3

√NB(1− cmin)

cminlog

N

K+ 4√

6N − K

N − 1log

(NB

Kcminδ

)

+ 2√

6(1 + K 2)

√N − K

N − 1

NB

Kcminlog

(NB

Kcminδ

)

= O

(K 2

√NB

K

N − K

N − 1log

(NB

)+

N − K

N − 1log

(NB

))14 / 22

Page 48: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret

High Probability Bound on the Regret

Algorithm Exp3.P.M.BModification of Algorithm Exp3.M.B:

Initialize parameter α = 2√

6√

(N − K)/(N − 1) log (NB/(Kcminδ)).

Initialize weights wi for i ∈ [N] : wi (1) = exp(αγK2

√B/(NKcmin)/3

).

Update weights for i ∈ [N] as follows:

wi (t + 1) = wi (t) exp

[1i 6∈S(t)

γK

3N

(ri (t)− ci (t) +

α√Kcmin

pi (t)√NB

)].

Theorem (High Probability Upper Bound on RA(B) for Algorithm Exp3.P.M.B)

For the multiple play algorithm (1 ≤ K ≤ N) and the budget B > 0, the following boundon the regret holds with probability at least 1− δ:

R ≤ 2√

3

√NB(1− cmin)

cminlog

N

K+ 4√

6N − K

N − 1log

(NB

Kcminδ

)

+ 2√

6(1 + K 2)

√N − K

N − 1

NB

Kcminlog

(NB

Kcminδ

)

= O

(K 2

√NB

K

N − K

N − 1log

(NB

)+

N − K

N − 1log

(NB

))14 / 22

Page 49: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret

High Probability Bound on the Regret

Algorithm Exp3.P.M.BModification of Algorithm Exp3.M.B:

Initialize parameter α = 2√

6√

(N − K)/(N − 1) log (NB/(Kcminδ)).

Initialize weights wi for i ∈ [N] : wi (1) = exp(αγK2

√B/(NKcmin)/3

).

Update weights for i ∈ [N] as follows:

wi (t + 1) = wi (t) exp

[1i 6∈S(t)

γK

3N

(ri (t)− ci (t) +

α√Kcmin

pi (t)√NB

)].

Theorem (High Probability Upper Bound on RA(B) for Algorithm Exp3.P.M.B)

For the multiple play algorithm (1 ≤ K ≤ N) and the budget B > 0, the following boundon the regret holds with probability at least 1− δ:

R ≤ 2√

3

√NB(1− cmin)

cminlog

N

K+ 4√

6N − K

N − 1log

(NB

Kcminδ

)

+ 2√

6(1 + K 2)

√N − K

N − 1

NB

Kcminlog

(NB

Kcminδ

)

= O

(K 2

√NB

K

N − K

N − 1log

(NB

)+

N − K

N − 1log

(NB

))14 / 22

Page 50: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret

High Probability Bound on the Regret

Algorithm Exp3.P.M.BModification of Algorithm Exp3.M.B:

Initialize parameter α = 2√

6√

(N − K)/(N − 1) log (NB/(Kcminδ)).

Initialize weights wi for i ∈ [N] : wi (1) = exp(αγK2

√B/(NKcmin)/3

).

Update weights for i ∈ [N] as follows:

wi (t + 1) = wi (t) exp

[1i 6∈S(t)

γK

3N

(ri (t)− ci (t) +

α√Kcmin

pi (t)√NB

)].

Theorem (High Probability Upper Bound on RA(B) for Algorithm Exp3.P.M.B)

For the multiple play algorithm (1 ≤ K ≤ N) and the budget B > 0, the following boundon the regret holds with probability at least 1− δ:

R ≤ 2√

3

√NB(1− cmin)

cminlog

N

K+ 4√

6N − K

N − 1log

(NB

Kcminδ

)

+ 2√

6(1 + K 2)

√N − K

N − 1

NB

Kcminlog

(NB

Kcminδ

)

= O

(K 2

√NB

K

N − K

N − 1log

(NB

)+

N − K

N − 1log

(NB

))14 / 22

Page 51: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret

High Probability Bound on the Regret (cont’d.)

Theorem (High Probability Upper Bound on RA(B) for Algorithm Exp3.P.M.B)

For the multiple play algorithm (1 ≤ K ≤ N) and the budget B > 0, the following boundon the regret holds with probability at least 1− δ:

R = O

(K 2

√NB

K

N − K

N − 1log

(NB

)+

N − K

N − 1log

(NB

))

Remark

Recovers O(√

NT log(NT/δ) + log(NT/δ)) bound11 for K = 1, no costs

Proof Idea

Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.

Step 2: Lower bound GExp3.P.M.B as function of U

Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B

11P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.

15 / 22

Page 52: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret

High Probability Bound on the Regret (cont’d.)

Theorem (High Probability Upper Bound on RA(B) for Algorithm Exp3.P.M.B)

For the multiple play algorithm (1 ≤ K ≤ N) and the budget B > 0, the following boundon the regret holds with probability at least 1− δ:

R = O

(K 2

√NB

K

N − K

N − 1log

(NB

)+

N − K

N − 1log

(NB

))

Remark

Recovers O(√

NT log(NT/δ) + log(NT/δ)) bound11 for K = 1, no costs

Proof Idea

Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.

Step 2: Lower bound GExp3.P.M.B as function of U

Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B

11P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.

15 / 22

Page 53: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret

High Probability Bound on the Regret (cont’d.)

Theorem (High Probability Upper Bound on RA(B) for Algorithm Exp3.P.M.B)

For the multiple play algorithm (1 ≤ K ≤ N) and the budget B > 0, the following boundon the regret holds with probability at least 1− δ:

R = O

(K 2

√NB

K

N − K

N − 1log

(NB

)+

N − K

N − 1log

(NB

))

Remark

Recovers O(√

NT log(NT/δ) + log(NT/δ)) bound11 for K = 1, no costs

Proof Idea

Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.

Step 2: Lower bound GExp3.P.M.B as function of U

Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B

11P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.

15 / 22

Page 54: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret

High Probability Bound on the Regret (cont’d.)

Theorem (High Probability Upper Bound on RA(B) for Algorithm Exp3.P.M.B)

For the multiple play algorithm (1 ≤ K ≤ N) and the budget B > 0, the following boundon the regret holds with probability at least 1− δ:

R = O

(K 2

√NB

K

N − K

N − 1log

(NB

)+

N − K

N − 1log

(NB

))

Remark

Recovers O(√

NT log(NT/δ) + log(NT/δ)) bound11 for K = 1, no costs

Proof Idea

Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.

Step 2: Lower bound GExp3.P.M.B as function of U

Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B

11P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.

15 / 22

Page 55: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret

High Probability Bound on the Regret (cont’d.)

Theorem (High Probability Upper Bound on RA(B) for Algorithm Exp3.P.M.B)

For the multiple play algorithm (1 ≤ K ≤ N) and the budget B > 0, the following boundon the regret holds with probability at least 1− δ:

R = O

(K 2

√NB

K

N − K

N − 1log

(NB

)+

N − K

N − 1log

(NB

))

Remark

Recovers O(√

NT log(NT/δ) + log(NT/δ)) bound11 for K = 1, no costs

Proof Idea

Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.

Step 2: Lower bound GExp3.P.M.B as function of U

Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B

11P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.

15 / 22

Page 56: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret

High Probability on the Regret (cont’d.)

Proof Idea (cont’d.)

Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.:Define upper confidence bound:

U∗ =∑i∈a∗

ασi +

τa∗ (B)∑t=1

(ri (t)− ci (t))

For 2

√6√

N−KN−1

log NBKcminδ

≤ α ≤ 12√

NBKcmin

, we can show P(U∗ > Gmax − B

)≥ 1− δ

Step 2: Lower bound GExp3.P.M.B as function of U:

For α ≤ 2√

NBKcmin

, the gain of Algorithm Exp3.P.M.B is bounded below as follows:

GExp3.P.M.B ≥(

1− γ −2γ

3

1− cmin

cmin

)U∗ −

3N

γlog

N

K− 2α2 − α(1 + K2)

BN

Kcmin.

Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B, tune γ:

γ = min

((1 +

2

3

1− cmin

cmin

)−1

,

(3N log(N/K)

(Gmax − B) (1 + 2(1− cmin)/(3cmin)),

)1/2)

16 / 22

Page 57: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret

High Probability on the Regret (cont’d.)

Proof Idea (cont’d.)

Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.:Define upper confidence bound:

U∗ =∑i∈a∗

ασi +

τa∗ (B)∑t=1

(ri (t)− ci (t))

For 2

√6√

N−KN−1

log NBKcminδ

≤ α ≤ 12√

NBKcmin

, we can show P(U∗ > Gmax − B

)≥ 1− δ

Step 2: Lower bound GExp3.P.M.B as function of U:

For α ≤ 2√

NBKcmin

, the gain of Algorithm Exp3.P.M.B is bounded below as follows:

GExp3.P.M.B ≥(

1− γ −2γ

3

1− cmin

cmin

)U∗ −

3N

γlog

N

K− 2α2 − α(1 + K2)

BN

Kcmin.

Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B, tune γ:

γ = min

((1 +

2

3

1− cmin

cmin

)−1

,

(3N log(N/K)

(Gmax − B) (1 + 2(1− cmin)/(3cmin)),

)1/2)

16 / 22

Page 58: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret

High Probability on the Regret (cont’d.)

Proof Idea (cont’d.)

Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.:Define upper confidence bound:

U∗ =∑i∈a∗

ασi +

τa∗ (B)∑t=1

(ri (t)− ci (t))

For 2

√6√

N−KN−1

log NBKcminδ

≤ α ≤ 12√

NBKcmin

, we can show P(U∗ > Gmax − B

)≥ 1− δ

Step 2: Lower bound GExp3.P.M.B as function of U:

For α ≤ 2√

NBKcmin

, the gain of Algorithm Exp3.P.M.B is bounded below as follows:

GExp3.P.M.B ≥(

1− γ −2γ

3

1− cmin

cmin

)U∗ −

3N

γlog

N

K− 2α2 − α(1 + K2)

BN

Kcmin.

Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B, tune γ:

γ = min

((1 +

2

3

1− cmin

cmin

)−1

,

(3N log(N/K)

(Gmax − B) (1 + 2(1− cmin)/(3cmin)),

)1/2)

16 / 22

Page 59: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret

High Probability on the Regret (cont’d.)

Proof Idea (cont’d.)

Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.:Define upper confidence bound:

U∗ =∑i∈a∗

ασi +

τa∗ (B)∑t=1

(ri (t)− ci (t))

For 2

√6√

N−KN−1

log NBKcminδ

≤ α ≤ 12√

NBKcmin

, we can show P(U∗ > Gmax − B

)≥ 1− δ

Step 2: Lower bound GExp3.P.M.B as function of U:

For α ≤ 2√

NBKcmin

, the gain of Algorithm Exp3.P.M.B is bounded below as follows:

GExp3.P.M.B ≥(

1− γ −2γ

3

1− cmin

cmin

)U∗ −

3N

γlog

N

K− 2α2 − α(1 + K2)

BN

Kcmin.

Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B, tune γ:

γ = min

((1 +

2

3

1− cmin

cmin

)−1

,

(3N log(N/K)

(Gmax − B) (1 + 2(1− cmin)/(3cmin)),

)1/2)

16 / 22

Page 60: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret

High Probability on the Regret (cont’d.)

Proof Idea (cont’d.)

Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.:Define upper confidence bound:

U∗ =∑i∈a∗

ασi +

τa∗ (B)∑t=1

(ri (t)− ci (t))

For 2

√6√

N−KN−1

log NBKcminδ

≤ α ≤ 12√

NBKcmin

, we can show P(U∗ > Gmax − B

)≥ 1− δ

Step 2: Lower bound GExp3.P.M.B as function of U:

For α ≤ 2√

NBKcmin

, the gain of Algorithm Exp3.P.M.B is bounded below as follows:

GExp3.P.M.B ≥(

1− γ −2γ

3

1− cmin

cmin

)U∗ −

3N

γlog

N

K− 2α2 − α(1 + K2)

BN

Kcmin.

Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B, tune γ:

γ = min

((1 +

2

3

1− cmin

cmin

)−1

,

(3N log(N/K)

(Gmax − B) (1 + 2(1− cmin)/(3cmin)),

)1/2)

16 / 22

Page 61: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Conclusion

Table of Contents

1 Background: The Multi-Armed Bandit (MAB) ProblemProblem FormulationContributions

2 Stochastic MAB with Multiple Play and Budget ConstraintsSetupAlgorithm UCB-MB

3 Adversarial MAB with Multiple Play and Budget ConstraintsUpper Bounds on the RegretLower Bounds on the RegretHigh Probability Bounds on the Regret

4 Conclusion

5 References

17 / 22

Page 62: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Conclusion

Conclusion and Outlook

Summary of Contributions

Analysis of budget-constrained MABs with multiple plays

Stochastic SettingAlgorithm UCB-MB based on UCB112 and UCB-BV13 to upper-bound regret

Adversarial SettingAlgorithm Exp3.M.B to upper-bound regretModification of weight updates in Exp3.M.B to lower-bound regretAlgorithm Exp3.P.M.B to for high probability upper bound

Future Work

Simulations with data

Connect to Badanidiyuru et al.14 to recover O(√B) adversarial bound with null arm

Explore connection to e-commerce (mechanism design)Strategic considerations of agents in repeated interactions between buyer and sellerPool of buyers vs. single buyer

12P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.13W. Ding et al. “Multi-Armed Bandit with Budget Constraint and Variable Costs”. In: Proceedings of the Twenty-Seventh AAAI Conference on

Artificial Intelligence (2013).14A. Badanidiyuru, R. Kleinberg, and A. Slivkins. “Bandits with Knapsacks”. In: Proceedings of the 2013 IEEE 54th Annual Symposium on

Foundations of Computer Science (2013), pp. 207–216.

18 / 22

Page 63: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Conclusion

Conclusion and Outlook

Summary of Contributions

Analysis of budget-constrained MABs with multiple plays

Stochastic SettingAlgorithm UCB-MB based on UCB112 and UCB-BV13 to upper-bound regret

Adversarial SettingAlgorithm Exp3.M.B to upper-bound regretModification of weight updates in Exp3.M.B to lower-bound regretAlgorithm Exp3.P.M.B to for high probability upper bound

Future Work

Simulations with data

Connect to Badanidiyuru et al.14 to recover O(√B) adversarial bound with null arm

Explore connection to e-commerce (mechanism design)Strategic considerations of agents in repeated interactions between buyer and sellerPool of buyers vs. single buyer

12P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.13W. Ding et al. “Multi-Armed Bandit with Budget Constraint and Variable Costs”. In: Proceedings of the Twenty-Seventh AAAI Conference on

Artificial Intelligence (2013).14A. Badanidiyuru, R. Kleinberg, and A. Slivkins. “Bandits with Knapsacks”. In: Proceedings of the 2013 IEEE 54th Annual Symposium on

Foundations of Computer Science (2013), pp. 207–216.

18 / 22

Page 64: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Conclusion

Conclusion and Outlook

Summary of Contributions

Analysis of budget-constrained MABs with multiple plays

Stochastic SettingAlgorithm UCB-MB based on UCB112 and UCB-BV13 to upper-bound regret

Adversarial SettingAlgorithm Exp3.M.B to upper-bound regretModification of weight updates in Exp3.M.B to lower-bound regretAlgorithm Exp3.P.M.B to for high probability upper bound

Future Work

Simulations with data

Connect to Badanidiyuru et al.14 to recover O(√B) adversarial bound with null arm

Explore connection to e-commerce (mechanism design)Strategic considerations of agents in repeated interactions between buyer and sellerPool of buyers vs. single buyer

12P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.13W. Ding et al. “Multi-Armed Bandit with Budget Constraint and Variable Costs”. In: Proceedings of the Twenty-Seventh AAAI Conference on

Artificial Intelligence (2013).14A. Badanidiyuru, R. Kleinberg, and A. Slivkins. “Bandits with Knapsacks”. In: Proceedings of the 2013 IEEE 54th Annual Symposium on

Foundations of Computer Science (2013), pp. 207–216.

18 / 22

Page 65: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Conclusion

Conclusion and Outlook

Summary of Contributions

Analysis of budget-constrained MABs with multiple plays

Stochastic SettingAlgorithm UCB-MB based on UCB112 and UCB-BV13 to upper-bound regret

Adversarial SettingAlgorithm Exp3.M.B to upper-bound regretModification of weight updates in Exp3.M.B to lower-bound regretAlgorithm Exp3.P.M.B to for high probability upper bound

Future Work

Simulations with data

Connect to Badanidiyuru et al.14 to recover O(√B) adversarial bound with null arm

Explore connection to e-commerce (mechanism design)Strategic considerations of agents in repeated interactions between buyer and sellerPool of buyers vs. single buyer

12P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.13W. Ding et al. “Multi-Armed Bandit with Budget Constraint and Variable Costs”. In: Proceedings of the Twenty-Seventh AAAI Conference on

Artificial Intelligence (2013).14A. Badanidiyuru, R. Kleinberg, and A. Slivkins. “Bandits with Knapsacks”. In: Proceedings of the 2013 IEEE 54th Annual Symposium on

Foundations of Computer Science (2013), pp. 207–216.

18 / 22

Page 66: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Conclusion

Conclusion and Outlook

Summary of Contributions

Analysis of budget-constrained MABs with multiple plays

Stochastic SettingAlgorithm UCB-MB based on UCB112 and UCB-BV13 to upper-bound regret

Adversarial SettingAlgorithm Exp3.M.B to upper-bound regretModification of weight updates in Exp3.M.B to lower-bound regretAlgorithm Exp3.P.M.B to for high probability upper bound

Future Work

Simulations with data

Connect to Badanidiyuru et al.14 to recover O(√B) adversarial bound with null arm

Explore connection to e-commerce (mechanism design)Strategic considerations of agents in repeated interactions between buyer and sellerPool of buyers vs. single buyer

12P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.13W. Ding et al. “Multi-Armed Bandit with Budget Constraint and Variable Costs”. In: Proceedings of the Twenty-Seventh AAAI Conference on

Artificial Intelligence (2013).14A. Badanidiyuru, R. Kleinberg, and A. Slivkins. “Bandits with Knapsacks”. In: Proceedings of the 2013 IEEE 54th Annual Symposium on

Foundations of Computer Science (2013), pp. 207–216.

18 / 22

Page 67: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

Conclusion

Conclusion and Outlook

Summary of Contributions

Analysis of budget-constrained MABs with multiple plays

Stochastic SettingAlgorithm UCB-MB based on UCB112 and UCB-BV13 to upper-bound regret

Adversarial SettingAlgorithm Exp3.M.B to upper-bound regretModification of weight updates in Exp3.M.B to lower-bound regretAlgorithm Exp3.P.M.B to for high probability upper bound

Future Work

Simulations with data

Connect to Badanidiyuru et al.14 to recover O(√B) adversarial bound with null arm

Explore connection to e-commerce (mechanism design)Strategic considerations of agents in repeated interactions between buyer and sellerPool of buyers vs. single buyer

12P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.13W. Ding et al. “Multi-Armed Bandit with Budget Constraint and Variable Costs”. In: Proceedings of the Twenty-Seventh AAAI Conference on

Artificial Intelligence (2013).14A. Badanidiyuru, R. Kleinberg, and A. Slivkins. “Bandits with Knapsacks”. In: Proceedings of the 2013 IEEE 54th Annual Symposium on

Foundations of Computer Science (2013), pp. 207–216.

18 / 22

Page 68: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

References

Table of Contents

1 Background: The Multi-Armed Bandit (MAB) ProblemProblem FormulationContributions

2 Stochastic MAB with Multiple Play and Budget ConstraintsSetupAlgorithm UCB-MB

3 Adversarial MAB with Multiple Play and Budget ConstraintsUpper Bounds on the RegretLower Bounds on the RegretHigh Probability Bounds on the Regret

4 Conclusion

5 References

19 / 22

Page 69: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

References

References I

P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In:

Machine Learning 47 (2002), pp. 235–256.

R. Agrawal, M. V. Hegde, and D. Teneketzis. “Multi-Armed Bandits with Multiple Plays and Switching Cost”.

In: Stochastics and Stochastic Reports 29 (1990), pp. 437–459.

V. Anantharam, P. Varaiya, and J. Walrand. “Asymptotically Efficient Allocation Rules for the Multiarmed

Bandit Problem - Part I: IID Rewards”. In: IEEE Transactions on Automatic Control 32 (1986), pp. 968–976.

P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32 (2002),

pp. 48–77.

A. Badanidiyuru, R. Kleinberg, and A. Slivkins. “Bandits with Knapsacks”. In: Proceedings of the 2013 IEEE

54th Annual Symposium on Foundations of Computer Science (2013), pp. 207–216.

W. Ding et al. “Multi-Armed Bandit with Budget Constraint and Variable Costs”. In: Proceedings of the

Twenty-Seventh AAAI Conference on Artificial Intelligence (2013).

Y. Gai, B. Krishnamachari, and R. Jain. “Combinatorial Network Optimization with Unknown Variables:

Multi-Armed Bandits with Linear Rewards and Individual Observations”. In: IEEE/ACM Transactions onNetworking 20.5 (2012), pp. 1466–1478.

20 / 22

Page 70: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

References

References II

L. Tran-Thanh et al. “Epsilon-First Policies for Budget-Limited Multi-Armed Bandits”. In: Twenty-Fourth

AAAI Conference on Artificial Intelligence (2010), pp. 1211–1216.

L. Tran-Thanh et al. “Knapsack Based Optimal Policies for Budget–Limited Multi–Armed Bandits”. In:

Twenty-Sixth AAAI Conference on Artificial Intelligence (2012), pp. 1134–1140.

T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In:

International Conference on Algorithmic Learning Theory (2010), pp. 375–389.

Y. Xia et al. “Budgeted Multi-Armed Bandits with Multiple Plays”. In: Proceedings of the 25th International

Joint Conference on Artificial Intelligence (2016), pp. 2210 –2216.

21 / 22

Page 71: Budget-Constrained Multi-Armed Bandits with Multiple Plays · The Nonstochastic Multi-Armed Bandit Problem".In: SIAM Journal on Computing 32 (2002), pp. 48{77. 2T. Uchiya, A. Nakamura,

References

THANK YOU!

QUESTIONS?

22 / 22