budget-constrained multi-armed bandits with multiple plays · the nonstochastic multi-armed bandit...
TRANSCRIPT
Budget-Constrained Multi-Armed Banditswith Multiple Plays
Datong P. Zhou, Claire J. Tomlin
AAAI-2018
[datong.zhou, tomlin]@berkeley.edu
February 5, 2018
1 / 22
Table of Contents
1 Background: The Multi-Armed Bandit (MAB) ProblemProblem FormulationContributions
2 Stochastic MAB with Multiple Play and Budget ConstraintsSetupAlgorithm UCB-MB
3 Adversarial MAB with Multiple Play and Budget ConstraintsUpper Bounds on the RegretLower Bounds on the RegretHigh Probability Bounds on the Regret
4 Conclusion
5 References
2 / 22
Background: The Multi-Armed Bandit (MAB) Problem
Table of Contents
1 Background: The Multi-Armed Bandit (MAB) ProblemProblem FormulationContributions
2 Stochastic MAB with Multiple Play and Budget ConstraintsSetupAlgorithm UCB-MB
3 Adversarial MAB with Multiple Play and Budget ConstraintsUpper Bounds on the RegretLower Bounds on the RegretHigh Probability Bounds on the Regret
4 Conclusion
5 References
3 / 22
Background: The Multi-Armed Bandit (MAB) Problem Problem Formulation
Background
Examples
Product recommendation to maximize sales
Ad placement to maximize click through rate
Classical Problem and Objective
Given N arms with unknown rewarddistribution
Pull arms sequentially to maximize totalexpected reward over certain horizon TAt each round t ∈ [T ]:
Player selects exactly one action atPlayer observes gain rat ,t
Goal: Minimize cumulative regret
RA(T ) =
(maxi∈[N]
E
[T∑t=1
ri,t
])− E
[T∑t=1
rat ,t
]
Stochastic generation of rewards vs.adversarial setting
Recent developments: Cost ci,t and multipleplays K
4 / 22
Background: The Multi-Armed Bandit (MAB) Problem Problem Formulation
Background
Examples
Product recommendation to maximize sales
Ad placement to maximize click through rate
Classical Problem and Objective
Given N arms with unknown rewarddistribution
Pull arms sequentially to maximize totalexpected reward over certain horizon TAt each round t ∈ [T ]:
Player selects exactly one action atPlayer observes gain rat ,t
Goal: Minimize cumulative regret
RA(T ) =
(maxi∈[N]
E
[T∑t=1
ri,t
])− E
[T∑t=1
rat ,t
]
Stochastic generation of rewards vs.adversarial setting
Recent developments: Cost ci,t and multipleplays K
4 / 22
Background: The Multi-Armed Bandit (MAB) Problem Problem Formulation
Background
Examples
Product recommendation to maximize sales
Ad placement to maximize click through rate
Classical Problem and Objective
Given N arms with unknown rewarddistribution
Pull arms sequentially to maximize totalexpected reward over certain horizon TAt each round t ∈ [T ]:
Player selects exactly one action atPlayer observes gain rat ,t
Goal: Minimize cumulative regret
RA(T ) =
(maxi∈[N]
E
[T∑t=1
ri,t
])− E
[T∑t=1
rat ,t
]
Stochastic generation of rewards vs.adversarial setting
Recent developments: Cost ci,t and multipleplays K
4 / 22
Background: The Multi-Armed Bandit (MAB) Problem Problem Formulation
Background
Examples
Product recommendation to maximize sales
Ad placement to maximize click through rate
Classical Problem and Objective
Given N arms with unknown rewarddistribution
Pull arms sequentially to maximize totalexpected reward over certain horizon TAt each round t ∈ [T ]:
Player selects exactly one action atPlayer observes gain rat ,t
Goal: Minimize cumulative regret
RA(T ) =
(maxi∈[N]
E
[T∑t=1
ri,t
])− E
[T∑t=1
rat ,t
]
Stochastic generation of rewards vs.adversarial setting
Recent developments: Cost ci,t and multipleplays K
4 / 22
Background: The Multi-Armed Bandit (MAB) Problem Problem Formulation
Background
Examples
Product recommendation to maximize sales
Ad placement to maximize click through rate
Classical Problem and Objective
Given N arms with unknown rewarddistribution
Pull arms sequentially to maximize totalexpected reward over certain horizon TAt each round t ∈ [T ]:
Player selects exactly one action atPlayer observes gain rat ,t
Goal: Minimize cumulative regret
RA(T ) =
(maxi∈[N]
E
[T∑t=1
ri,t
])− E
[T∑t=1
rat ,t
]
Stochastic generation of rewards vs.adversarial setting
Recent developments: Cost ci,t and multipleplays K
4 / 22
Background: The Multi-Armed Bandit (MAB) Problem Problem Formulation
Background
Examples
Product recommendation to maximize sales
Ad placement to maximize click through rate
Classical Problem and Objective
Given N arms with unknown rewarddistribution
Pull arms sequentially to maximize totalexpected reward over certain horizon TAt each round t ∈ [T ]:
Player selects exactly one action atPlayer observes gain rat ,t
Goal: Minimize cumulative regret
RA(T ) =
(maxi∈[N]
E
[T∑t=1
ri,t
])− E
[T∑t=1
rat ,t
]
Stochastic generation of rewards vs.adversarial setting
Recent developments: Cost ci,t and multipleplays K
4 / 22
Background: The Multi-Armed Bandit (MAB) Problem Problem Formulation
Background
Examples
Product recommendation to maximize sales
Ad placement to maximize click through rate
Classical Problem and Objective
Given N arms with unknown rewarddistribution
Pull arms sequentially to maximize totalexpected reward over certain horizon TAt each round t ∈ [T ]:
Player selects exactly one action atPlayer observes gain rat ,t
Goal: Minimize cumulative regret
RA(T ) =
(maxi∈[N]
E
[T∑t=1
ri,t
])− E
[T∑t=1
rat ,t
]
Stochastic generation of rewards vs.adversarial setting
Recent developments: Cost ci,t and multipleplays K
4 / 22
Background: The Multi-Armed Bandit (MAB) Problem Contributions
Contributions
Algorithm Upper Bound Lower Bound Setting Authors
Exp3 O(√
NT log N)
Ω(√
NT)
Fixed T , K = 1 1
Exp3.M O
(√NTK log N
K
)Ω
((1− K
N
)2√NT
)Fixed T , K ≥ 1 2
Exp3.M.B O
(√NB log N
K
)Ω
((1− K
N
)2√NB/K
)B > 0,K ≥ 1 This paper
Exp3.P O(√
NT log (NT/δ) + log(NT/δ))
Fixed T , K = 1 3
Exp3.P.M O(K2√
NT N−KN−1
log (NT/δ) + N−KN−1
log(NT/δ))
Fixed T , K ≥ 1 This paper
Exp3.P.M.B O
(K2
√NBK
N−KN−1
log(
NBKδ
)+ N−K
N−1log(
NBKδ
))B > 0, K ≥ 1 This paper
UCB1 O(N log T ) Fixed T , K = 1 4
LLR O(NK4 log T ) Fixed T , K ≥ 1 5
UCB-BV O(N log B) B > 0,K = 1 6
UCB-MB O(NK4 log B) B > 0,K ≥ 1 This paper
1P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32 (2002), pp. 48–77.2T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference on
Algorithmic Learning Theory (2010), pp. 375–389.3P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32 (2002), pp. 48–77.4P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.5Y. Gai, B. Krishnamachari, and R. Jain. “Combinatorial Network Optimization with Unknown Variables: Multi-Armed Bandits with Linear Rewards and
Individual Observations”. In: IEEE/ACM Transactions on Networking 20.5 (2012), pp. 1466–1478.6W. Ding et al. “Multi-Armed Bandit with Budget Constraint and Variable Costs”. In: Proceedings of the Twenty-Seventh AAAI Conference on
Artificial Intelligence (2013). 5 / 22
Stochastic MAB with Multiple Play and Budget Constraints
Table of Contents
1 Background: The Multi-Armed Bandit (MAB) ProblemProblem FormulationContributions
2 Stochastic MAB with Multiple Play and Budget ConstraintsSetupAlgorithm UCB-MB
3 Adversarial MAB with Multiple Play and Budget ConstraintsUpper Bounds on the RegretLower Bounds on the RegretHigh Probability Bounds on the Regret
4 Conclusion
5 References
6 / 22
Stochastic MAB with Multiple Play and Budget Constraints Setup
Setup
Problem Description
Bandit with N distinct arms
Each arm i has unknown cost and reward distributions with means 0 < µir ≤ 1 and
0 < cmin ≤ µic ≤ 1
Realizations of costs ci,t ∈ [cmin, 1] and rewards ri,t ∈ [0, 1] are i.i.d.
Initial budget B > 0 to pay for the materialized costsAt each round t = 1, . . . , τA(B):
Select exactly 1 ≤ K ≤ N arms into atObserve individual costs ci,t | i ∈ at and rewards ri,t | i ∈ atTerminate game if
∑i∈at ci,t is greater than remaining budget
Goal
Minimize expected regret
RA(B) = E[GA∗(B)]− E[GA(B)]
Utilize modified UCB algorithm with upper confidence bounds:
Ui,t = µit + ei,t
At each round, play K arms with K largest Ui,t
7 / 22
Stochastic MAB with Multiple Play and Budget Constraints Setup
Setup
Problem Description
Bandit with N distinct arms
Each arm i has unknown cost and reward distributions with means 0 < µir ≤ 1 and
0 < cmin ≤ µic ≤ 1
Realizations of costs ci,t ∈ [cmin, 1] and rewards ri,t ∈ [0, 1] are i.i.d.
Initial budget B > 0 to pay for the materialized costsAt each round t = 1, . . . , τA(B):
Select exactly 1 ≤ K ≤ N arms into atObserve individual costs ci,t | i ∈ at and rewards ri,t | i ∈ atTerminate game if
∑i∈at ci,t is greater than remaining budget
Goal
Minimize expected regret
RA(B) = E[GA∗(B)]− E[GA(B)]
Utilize modified UCB algorithm with upper confidence bounds:
Ui,t = µit + ei,t
At each round, play K arms with K largest Ui,t
7 / 22
Stochastic MAB with Multiple Play and Budget Constraints Setup
Setup
Problem Description
Bandit with N distinct arms
Each arm i has unknown cost and reward distributions with means 0 < µir ≤ 1 and
0 < cmin ≤ µic ≤ 1
Realizations of costs ci,t ∈ [cmin, 1] and rewards ri,t ∈ [0, 1] are i.i.d.
Initial budget B > 0 to pay for the materialized costsAt each round t = 1, . . . , τA(B):
Select exactly 1 ≤ K ≤ N arms into atObserve individual costs ci,t | i ∈ at and rewards ri,t | i ∈ atTerminate game if
∑i∈at ci,t is greater than remaining budget
Goal
Minimize expected regret
RA(B) = E[GA∗(B)]− E[GA(B)]
Utilize modified UCB algorithm with upper confidence bounds:
Ui,t = µit + ei,t
At each round, play K arms with K largest Ui,t
7 / 22
Stochastic MAB with Multiple Play and Budget Constraints Setup
Setup
Problem Description
Bandit with N distinct arms
Each arm i has unknown cost and reward distributions with means 0 < µir ≤ 1 and
0 < cmin ≤ µic ≤ 1
Realizations of costs ci,t ∈ [cmin, 1] and rewards ri,t ∈ [0, 1] are i.i.d.
Initial budget B > 0 to pay for the materialized costsAt each round t = 1, . . . , τA(B):
Select exactly 1 ≤ K ≤ N arms into atObserve individual costs ci,t | i ∈ at and rewards ri,t | i ∈ atTerminate game if
∑i∈at ci,t is greater than remaining budget
Goal
Minimize expected regret
RA(B) = E[GA∗(B)]− E[GA(B)]
Utilize modified UCB algorithm with upper confidence bounds:
Ui,t = µit + ei,t
At each round, play K arms with K largest Ui,t
7 / 22
Stochastic MAB with Multiple Play and Budget Constraints Setup
Setup
Problem Description
Bandit with N distinct arms
Each arm i has unknown cost and reward distributions with means 0 < µir ≤ 1 and
0 < cmin ≤ µic ≤ 1
Realizations of costs ci,t ∈ [cmin, 1] and rewards ri,t ∈ [0, 1] are i.i.d.
Initial budget B > 0 to pay for the materialized costsAt each round t = 1, . . . , τA(B):
Select exactly 1 ≤ K ≤ N arms into atObserve individual costs ci,t | i ∈ at and rewards ri,t | i ∈ atTerminate game if
∑i∈at ci,t is greater than remaining budget
Goal
Minimize expected regret
RA(B) = E[GA∗(B)]− E[GA(B)]
Utilize modified UCB algorithm with upper confidence bounds:
Ui,t = µit + ei,t
At each round, play K arms with K largest Ui,t
7 / 22
Stochastic MAB with Multiple Play and Budget Constraints Setup
Setup
Problem Description
Bandit with N distinct arms
Each arm i has unknown cost and reward distributions with means 0 < µir ≤ 1 and
0 < cmin ≤ µic ≤ 1
Realizations of costs ci,t ∈ [cmin, 1] and rewards ri,t ∈ [0, 1] are i.i.d.
Initial budget B > 0 to pay for the materialized costsAt each round t = 1, . . . , τA(B):
Select exactly 1 ≤ K ≤ N arms into atObserve individual costs ci,t | i ∈ at and rewards ri,t | i ∈ atTerminate game if
∑i∈at ci,t is greater than remaining budget
Goal
Minimize expected regret
RA(B) = E[GA∗(B)]− E[GA(B)]
Utilize modified UCB algorithm with upper confidence bounds:
Ui,t = µit + ei,t
At each round, play K arms with K largest Ui,t
7 / 22
Stochastic MAB with Multiple Play and Budget Constraints Setup
Setup
Problem Description
Bandit with N distinct arms
Each arm i has unknown cost and reward distributions with means 0 < µir ≤ 1 and
0 < cmin ≤ µic ≤ 1
Realizations of costs ci,t ∈ [cmin, 1] and rewards ri,t ∈ [0, 1] are i.i.d.
Initial budget B > 0 to pay for the materialized costsAt each round t = 1, . . . , τA(B):
Select exactly 1 ≤ K ≤ N arms into atObserve individual costs ci,t | i ∈ at and rewards ri,t | i ∈ atTerminate game if
∑i∈at ci,t is greater than remaining budget
Goal
Minimize expected regret
RA(B) = E[GA∗(B)]− E[GA(B)]
Utilize modified UCB algorithm with upper confidence bounds:
Ui,t = µit + ei,t
At each round, play K arms with K largest Ui,t
7 / 22
Stochastic MAB with Multiple Play and Budget Constraints Algorithm UCB-MB
Algorithm UCB-MB
The Algorithm
Play all arms once to initialize bang-per-buck ratios for each arm
While B not exhausted, select K arms with K largest Ui,t
Theorem (Upper Bound on RA(B) for Algorithm UCB-MB)
For the definition of confidence bounds
Ui,t = µit +
√(K + 1) log t/ni,t(1 + 1/cmin)
cmin −√
(K + 1) log t/ni,t,
Algorithm UCB-MB achieves expected regret RA(B) = O(NK 4 logB).
Proof Idea
Step 1: Upper bound the number of times a non-optimal selection of arms is madeup to a fixed stopping time τA(B):
# suboptimal choices = O(NK 3 log τA(B))
Step 2: Relate algorithm UCB-MB and B to stopping time τA(B)
8 / 22
Stochastic MAB with Multiple Play and Budget Constraints Algorithm UCB-MB
Algorithm UCB-MB
The Algorithm
Play all arms once to initialize bang-per-buck ratios for each arm
While B not exhausted, select K arms with K largest Ui,t
Theorem (Upper Bound on RA(B) for Algorithm UCB-MB)
For the definition of confidence bounds
Ui,t = µit +
√(K + 1) log t/ni,t(1 + 1/cmin)
cmin −√
(K + 1) log t/ni,t,
Algorithm UCB-MB achieves expected regret RA(B) = O(NK 4 logB).
Proof Idea
Step 1: Upper bound the number of times a non-optimal selection of arms is madeup to a fixed stopping time τA(B):
# suboptimal choices = O(NK 3 log τA(B))
Step 2: Relate algorithm UCB-MB and B to stopping time τA(B)
8 / 22
Stochastic MAB with Multiple Play and Budget Constraints Algorithm UCB-MB
Algorithm UCB-MB
The Algorithm
Play all arms once to initialize bang-per-buck ratios for each arm
While B not exhausted, select K arms with K largest Ui,t
Theorem (Upper Bound on RA(B) for Algorithm UCB-MB)
For the definition of confidence bounds
Ui,t = µit +
√(K + 1) log t/ni,t(1 + 1/cmin)
cmin −√
(K + 1) log t/ni,t,
Algorithm UCB-MB achieves expected regret RA(B) = O(NK 4 logB).
Proof Idea
Step 1: Upper bound the number of times a non-optimal selection of arms is madeup to a fixed stopping time τA(B):
# suboptimal choices = O(NK 3 log τA(B))
Step 2: Relate algorithm UCB-MB and B to stopping time τA(B)
8 / 22
Stochastic MAB with Multiple Play and Budget Constraints Algorithm UCB-MB
Algorithm UCB-MB
The Algorithm
Play all arms once to initialize bang-per-buck ratios for each arm
While B not exhausted, select K arms with K largest Ui,t
Theorem (Upper Bound on RA(B) for Algorithm UCB-MB)
For the definition of confidence bounds
Ui,t = µit +
√(K + 1) log t/ni,t(1 + 1/cmin)
cmin −√
(K + 1) log t/ni,t,
Algorithm UCB-MB achieves expected regret RA(B) = O(NK 4 logB).
Proof Idea
Step 1: Upper bound the number of times a non-optimal selection of arms is madeup to a fixed stopping time τA(B):
# suboptimal choices = O(NK 3 log τA(B))
Step 2: Relate algorithm UCB-MB and B to stopping time τA(B)
8 / 22
Adversarial MAB with Multiple Play and Budget Constraints
Table of Contents
1 Background: The Multi-Armed Bandit (MAB) ProblemProblem FormulationContributions
2 Stochastic MAB with Multiple Play and Budget ConstraintsSetupAlgorithm UCB-MB
3 Adversarial MAB with Multiple Play and Budget ConstraintsUpper Bounds on the RegretLower Bounds on the RegretHigh Probability Bounds on the Regret
4 Conclusion
5 References
9 / 22
Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret
Upper Bound on the Regret
Setup
Oblivious adversary ⇒ no assumptions on reward or cost distributions exceptboundedness
Algorithm Exp3.M.B
Initialize weights wi = 1 for i = 1, . . . ,NFor each round t = 1, . . . , τA(B):
Cap weights that are “too large”
wi (t) = v(t) for i ∈ S(t) = i ∈ [N] | wi (t) > vt
v(t)←vt
∣∣∣ vt(1− γ)∑Ni=1 vt · 1(wi (t) ≥ vt) + wi (t) · 1(wi (t) < vt)
=1
K−γ
N
Calculate probabilities pi (t) = K
((1− γ) wi (t)∑N
j=1 wj (t)+ γ
N
)Play arms at ∼ p1, . . . , pNUpdate weights:
ri (t) = ri (t)/pi (t) · 1(i ∈ at)
ci (t) = ci (t)/pi (t) · 1(i ∈ at)
wi (t + 1) = wi (t) exp
[Kγ
N[ri (t)− ci (t)]1i∈S(t)
]10 / 22
Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret
Upper Bound on the Regret
Setup
Oblivious adversary ⇒ no assumptions on reward or cost distributions exceptboundedness
Algorithm Exp3.M.B
Initialize weights wi = 1 for i = 1, . . . ,NFor each round t = 1, . . . , τA(B):
Cap weights that are “too large”
wi (t) = v(t) for i ∈ S(t) = i ∈ [N] | wi (t) > vt
v(t)←vt
∣∣∣ vt(1− γ)∑Ni=1 vt · 1(wi (t) ≥ vt) + wi (t) · 1(wi (t) < vt)
=1
K−γ
N
Calculate probabilities pi (t) = K
((1− γ) wi (t)∑N
j=1 wj (t)+ γ
N
)Play arms at ∼ p1, . . . , pNUpdate weights:
ri (t) = ri (t)/pi (t) · 1(i ∈ at)
ci (t) = ci (t)/pi (t) · 1(i ∈ at)
wi (t + 1) = wi (t) exp
[Kγ
N[ri (t)− ci (t)]1i∈S(t)
]10 / 22
Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret
Upper Bound on the Regret
Setup
Oblivious adversary ⇒ no assumptions on reward or cost distributions exceptboundedness
Algorithm Exp3.M.B
Initialize weights wi = 1 for i = 1, . . . ,NFor each round t = 1, . . . , τA(B):
Cap weights that are “too large”
wi (t) = v(t) for i ∈ S(t) = i ∈ [N] | wi (t) > vt
v(t)←vt
∣∣∣ vt(1− γ)∑Ni=1 vt · 1(wi (t) ≥ vt) + wi (t) · 1(wi (t) < vt)
=1
K−γ
N
Calculate probabilities pi (t) = K
((1− γ) wi (t)∑N
j=1 wj (t)+ γ
N
)Play arms at ∼ p1, . . . , pNUpdate weights:
ri (t) = ri (t)/pi (t) · 1(i ∈ at)
ci (t) = ci (t)/pi (t) · 1(i ∈ at)
wi (t + 1) = wi (t) exp
[Kγ
N[ri (t)− ci (t)]1i∈S(t)
]10 / 22
Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret
Upper Bound on the Regret
Setup
Oblivious adversary ⇒ no assumptions on reward or cost distributions exceptboundedness
Algorithm Exp3.M.B
Initialize weights wi = 1 for i = 1, . . . ,NFor each round t = 1, . . . , τA(B):
Cap weights that are “too large”
wi (t) = v(t) for i ∈ S(t) = i ∈ [N] | wi (t) > vt
v(t)←vt
∣∣∣ vt(1− γ)∑Ni=1 vt · 1(wi (t) ≥ vt) + wi (t) · 1(wi (t) < vt)
=1
K−γ
N
Calculate probabilities pi (t) = K
((1− γ) wi (t)∑N
j=1 wj (t)+ γ
N
)Play arms at ∼ p1, . . . , pNUpdate weights:
ri (t) = ri (t)/pi (t) · 1(i ∈ at)
ci (t) = ci (t)/pi (t) · 1(i ∈ at)
wi (t + 1) = wi (t) exp
[Kγ
N[ri (t)− ci (t)]1i∈S(t)
]10 / 22
Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret
Upper Bound on the Regret
Setup
Oblivious adversary ⇒ no assumptions on reward or cost distributions exceptboundedness
Algorithm Exp3.M.B
Initialize weights wi = 1 for i = 1, . . . ,NFor each round t = 1, . . . , τA(B):
Cap weights that are “too large”
wi (t) = v(t) for i ∈ S(t) = i ∈ [N] | wi (t) > vt
v(t)←vt
∣∣∣ vt(1− γ)∑Ni=1 vt · 1(wi (t) ≥ vt) + wi (t) · 1(wi (t) < vt)
=1
K−γ
N
Calculate probabilities pi (t) = K
((1− γ) wi (t)∑N
j=1 wj (t)+ γ
N
)Play arms at ∼ p1, . . . , pNUpdate weights:
ri (t) = ri (t)/pi (t) · 1(i ∈ at)
ci (t) = ci (t)/pi (t) · 1(i ∈ at)
wi (t + 1) = wi (t) exp
[Kγ
N[ri (t)− ci (t)]1i∈S(t)
]10 / 22
Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret
Upper Bound on the Regret
Setup
Oblivious adversary ⇒ no assumptions on reward or cost distributions exceptboundedness
Algorithm Exp3.M.B
Initialize weights wi = 1 for i = 1, . . . ,NFor each round t = 1, . . . , τA(B):
Cap weights that are “too large”
wi (t) = v(t) for i ∈ S(t) = i ∈ [N] | wi (t) > vt
v(t)←vt
∣∣∣ vt(1− γ)∑Ni=1 vt · 1(wi (t) ≥ vt) + wi (t) · 1(wi (t) < vt)
=1
K−γ
N
Calculate probabilities pi (t) = K
((1− γ) wi (t)∑N
j=1 wj (t)+ γ
N
)Play arms at ∼ p1, . . . , pNUpdate weights:
ri (t) = ri (t)/pi (t) · 1(i ∈ at)
ci (t) = ci (t)/pi (t) · 1(i ∈ at)
wi (t + 1) = wi (t) exp
[Kγ
N[ri (t)− ci (t)]1i∈S(t)
]10 / 22
Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret
Upper Bound on the Regret (cont’d.)
Analysis of Algorithm Exp3.M.B
Theorem (Upper Bound on RA(B) for Algorithm Exp3.M.B)
Algorithm Exp3.M.B achieves cumulative regret RA(B) = O(√
BN log(N/K)).
Proof Idea
Modify existing proof techniques7:Step 1: Assume fixed time horizon T = max(τA(B), τA∗ (B))Step 2: Relate T to budget B
Remarks
Our bound recovers previous findings for the following special cases with fixed T :Recovers O(
√BN log N) bound for K = 18
Recovers O(√
TN log N) bound from for K = 1, no costs / budget9
7T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389; P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32(2002), pp. 48–77.
8T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.
9P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.
11 / 22
Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret
Upper Bound on the Regret (cont’d.)
Analysis of Algorithm Exp3.M.B
Theorem (Upper Bound on RA(B) for Algorithm Exp3.M.B)
Algorithm Exp3.M.B achieves cumulative regret RA(B) = O(√
BN log(N/K)).
Proof Idea
Modify existing proof techniques7:Step 1: Assume fixed time horizon T = max(τA(B), τA∗ (B))Step 2: Relate T to budget B
Remarks
Our bound recovers previous findings for the following special cases with fixed T :Recovers O(
√BN log N) bound for K = 18
Recovers O(√
TN log N) bound from for K = 1, no costs / budget9
7T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389; P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32(2002), pp. 48–77.
8T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.
9P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.
11 / 22
Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret
Upper Bound on the Regret (cont’d.)
Analysis of Algorithm Exp3.M.B
Theorem (Upper Bound on RA(B) for Algorithm Exp3.M.B)
Algorithm Exp3.M.B achieves cumulative regret RA(B) = O(√
BN log(N/K)).
Proof Idea
Modify existing proof techniques7:Step 1: Assume fixed time horizon T = max(τA(B), τA∗ (B))Step 2: Relate T to budget B
Remarks
Our bound recovers previous findings for the following special cases with fixed T :Recovers O(
√BN log N) bound for K = 18
Recovers O(√
TN log N) bound from for K = 1, no costs / budget9
7T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389; P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32(2002), pp. 48–77.
8T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.
9P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.
11 / 22
Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret
Upper Bound on the Regret (cont’d.)
Analysis of Algorithm Exp3.M.B
Theorem (Upper Bound on RA(B) for Algorithm Exp3.M.B)
Algorithm Exp3.M.B achieves cumulative regret RA(B) = O(√
BN log(N/K)).
Proof Idea
Modify existing proof techniques7:Step 1: Assume fixed time horizon T = max(τA(B), τA∗ (B))Step 2: Relate T to budget B
Remarks
Our bound recovers previous findings for the following special cases with fixed T :Recovers O(
√BN log N) bound for K = 18
Recovers O(√
TN log N) bound from for K = 1, no costs / budget9
7T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389; P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32(2002), pp. 48–77.
8T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.
9P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.
11 / 22
Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret
Upper Bound on the Regret (cont’d.)
Analysis of Algorithm Exp3.M.B
Theorem (Upper Bound on RA(B) for Algorithm Exp3.M.B)
Algorithm Exp3.M.B achieves cumulative regret RA(B) = O(√
BN log(N/K)).
Proof Idea
Modify existing proof techniques7:Step 1: Assume fixed time horizon T = max(τA(B), τA∗ (B))Step 2: Relate T to budget B
Remarks
Our bound recovers previous findings for the following special cases with fixed T :Recovers O(
√BN log N) bound for K = 18
Recovers O(√
TN log N) bound from for K = 1, no costs / budget9
7T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389; P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32(2002), pp. 48–77.
8T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.
9P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.
11 / 22
Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret
Upper Bound on the Regret (cont’d.)
Analysis of Algorithm Exp3.M.B
Theorem (Upper Bound on RA(B) for Algorithm Exp3.M.B)
Algorithm Exp3.M.B achieves cumulative regret RA(B) = O(√
BN log(N/K)).
Proof Idea
Modify existing proof techniques7:Step 1: Assume fixed time horizon T = max(τA(B), τA∗ (B))Step 2: Relate T to budget B
Remarks
Our bound recovers previous findings for the following special cases with fixed T :Recovers O(
√BN log N) bound for K = 18
Recovers O(√
TN log N) bound from for K = 1, no costs / budget9
7T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389; P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32(2002), pp. 48–77.
8T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.
9P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.
11 / 22
Adversarial MAB with Multiple Play and Budget Constraints Upper Bounds on the Regret
Upper Bound on the Regret (cont’d.)
Analysis of Algorithm Exp3.M.B
Theorem (Upper Bound on RA(B) for Algorithm Exp3.M.B)
Algorithm Exp3.M.B achieves cumulative regret RA(B) = O(√
BN log(N/K)).
Proof Idea
Modify existing proof techniques7:Step 1: Assume fixed time horizon T = max(τA(B), τA∗ (B))Step 2: Relate T to budget B
Remarks
Our bound recovers previous findings for the following special cases with fixed T :Recovers O(
√BN log N) bound for K = 18
Recovers O(√
TN log N) bound from for K = 1, no costs / budget9
7T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389; P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32(2002), pp. 48–77.
8T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.
9P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.
11 / 22
Adversarial MAB with Multiple Play and Budget Constraints Lower Bounds on the Regret
Lower Bound on the Regret
Theorem (Lower Bound on RA(B) for Algorithm Exp3.M.B)
The weak regret of Algorithm Exp3.M.B is at least
R ≥ min
(c
3/2min (1− K/N)2
8√
log(4/3)
√NB
K,B(1− K/N)
8
).
This bound is of order Ω((1− K/N)2√
NB/K).
Proof Idea
Step 1: Derive auxiliary lemma. Select K out of N arms at random to be “good”arms with ri (t) ∼ Bern(1/2 + ε); ci (t) = cmin w.p. 1/2 + ε, ci (t) = 1 w.p. 1/2− ε
Lemma
Let f : 0, 1, cmin, 1τmax → [0,M] be any function defined on reward and costsequences r, c of length less than or equal τmax = B
Kcmin. Then for the best action set a∗:
Ea∗ [f (r, c)] ≤ Eu[f (r, c)] +Bc−3/2min
2
√−Eu[Na∗ ] log(1− 4ε2),
12 / 22
Adversarial MAB with Multiple Play and Budget Constraints Lower Bounds on the Regret
Lower Bound on the Regret
Theorem (Lower Bound on RA(B) for Algorithm Exp3.M.B)
The weak regret of Algorithm Exp3.M.B is at least
R ≥ min
(c
3/2min (1− K/N)2
8√
log(4/3)
√NB
K,B(1− K/N)
8
).
This bound is of order Ω((1− K/N)2√
NB/K).
Proof Idea
Step 1: Derive auxiliary lemma. Select K out of N arms at random to be “good”arms with ri (t) ∼ Bern(1/2 + ε); ci (t) = cmin w.p. 1/2 + ε, ci (t) = 1 w.p. 1/2− ε
Lemma
Let f : 0, 1, cmin, 1τmax → [0,M] be any function defined on reward and costsequences r, c of length less than or equal τmax = B
Kcmin. Then for the best action set a∗:
Ea∗ [f (r, c)] ≤ Eu[f (r, c)] +Bc−3/2min
2
√−Eu[Na∗ ] log(1− 4ε2),
12 / 22
Adversarial MAB with Multiple Play and Budget Constraints Lower Bounds on the Regret
Lower Bound on the Regret
Theorem (Lower Bound on RA(B) for Algorithm Exp3.M.B)
The weak regret of Algorithm Exp3.M.B is at least
R ≥ min
(c
3/2min (1− K/N)2
8√
log(4/3)
√NB
K,B(1− K/N)
8
).
This bound is of order Ω((1− K/N)2√
NB/K).
Proof Idea
Step 1: Derive auxiliary lemma. Select K out of N arms at random to be “good”arms with ri (t) ∼ Bern(1/2 + ε); ci (t) = cmin w.p. 1/2 + ε, ci (t) = 1 w.p. 1/2− ε
Lemma
Let f : 0, 1, cmin, 1τmax → [0,M] be any function defined on reward and costsequences r, c of length less than or equal τmax = B
Kcmin. Then for the best action set a∗:
Ea∗ [f (r, c)] ≤ Eu[f (r, c)] +Bc−3/2min
2
√−Eu[Na∗ ] log(1− 4ε2),
12 / 22
Adversarial MAB with Multiple Play and Budget Constraints Lower Bounds on the Regret
Lower Bound on the Regret (cont’d.)
Proof Idea (cont’d.)
Step 2: Notice there exist(NK
)unique combinations of K tuples.
Let C([N],K) denote the set of all such subsets10
Let E∗[·] denote the expected value w.r.t. uniform assignment of “good” arms.
E∗[Gmax] =
(1
2+ ε
)KE∗[τA(B), ]
Ea∗ [GA] =1
2KEa∗ [τA(B)] + εEa∗ [Na∗ ],
E∗[GA] =1(NK
) ∑a∗∈C([N],K)
Ea∗ [GA] =1
2KE∗[τA(B)] +
ε(NK
) ∑a∗∈C([N],K)
Ea∗ [Na∗ ].
Step 3: Use previous lemma to bound E∗[Gmax − GA]:
E∗[Gmax − GA] ≥ εB(
1− K
N
)− 2εB
c3/2min
√BK
Nlog(4/3).
Step 4: Tune ε to optimize bound:
ε = min
(1
4,
c3/2min
4 log(4/3)(1− K/N)
√N
BK
).
10T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.
13 / 22
Adversarial MAB with Multiple Play and Budget Constraints Lower Bounds on the Regret
Lower Bound on the Regret (cont’d.)
Proof Idea (cont’d.)
Step 2: Notice there exist(NK
)unique combinations of K tuples.
Let C([N],K) denote the set of all such subsets10
Let E∗[·] denote the expected value w.r.t. uniform assignment of “good” arms.
E∗[Gmax] =
(1
2+ ε
)KE∗[τA(B), ]
Ea∗ [GA] =1
2KEa∗ [τA(B)] + εEa∗ [Na∗ ],
E∗[GA] =1(NK
) ∑a∗∈C([N],K)
Ea∗ [GA] =1
2KE∗[τA(B)] +
ε(NK
) ∑a∗∈C([N],K)
Ea∗ [Na∗ ].
Step 3: Use previous lemma to bound E∗[Gmax − GA]:
E∗[Gmax − GA] ≥ εB(
1− K
N
)− 2εB
c3/2min
√BK
Nlog(4/3).
Step 4: Tune ε to optimize bound:
ε = min
(1
4,
c3/2min
4 log(4/3)(1− K/N)
√N
BK
).
10T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.
13 / 22
Adversarial MAB with Multiple Play and Budget Constraints Lower Bounds on the Regret
Lower Bound on the Regret (cont’d.)
Proof Idea (cont’d.)
Step 2: Notice there exist(NK
)unique combinations of K tuples.
Let C([N],K) denote the set of all such subsets10
Let E∗[·] denote the expected value w.r.t. uniform assignment of “good” arms.
E∗[Gmax] =
(1
2+ ε
)KE∗[τA(B), ]
Ea∗ [GA] =1
2KEa∗ [τA(B)] + εEa∗ [Na∗ ],
E∗[GA] =1(NK
) ∑a∗∈C([N],K)
Ea∗ [GA] =1
2KE∗[τA(B)] +
ε(NK
) ∑a∗∈C([N],K)
Ea∗ [Na∗ ].
Step 3: Use previous lemma to bound E∗[Gmax − GA]:
E∗[Gmax − GA] ≥ εB(
1− K
N
)− 2εB
c3/2min
√BK
Nlog(4/3).
Step 4: Tune ε to optimize bound:
ε = min
(1
4,
c3/2min
4 log(4/3)(1− K/N)
√N
BK
).
10T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.
13 / 22
Adversarial MAB with Multiple Play and Budget Constraints Lower Bounds on the Regret
Lower Bound on the Regret (cont’d.)
Proof Idea (cont’d.)
Step 2: Notice there exist(NK
)unique combinations of K tuples.
Let C([N],K) denote the set of all such subsets10
Let E∗[·] denote the expected value w.r.t. uniform assignment of “good” arms.
E∗[Gmax] =
(1
2+ ε
)KE∗[τA(B), ]
Ea∗ [GA] =1
2KEa∗ [τA(B)] + εEa∗ [Na∗ ],
E∗[GA] =1(NK
) ∑a∗∈C([N],K)
Ea∗ [GA] =1
2KE∗[τA(B)] +
ε(NK
) ∑a∗∈C([N],K)
Ea∗ [Na∗ ].
Step 3: Use previous lemma to bound E∗[Gmax − GA]:
E∗[Gmax − GA] ≥ εB(
1− K
N
)− 2εB
c3/2min
√BK
Nlog(4/3).
Step 4: Tune ε to optimize bound:
ε = min
(1
4,
c3/2min
4 log(4/3)(1− K/N)
√N
BK
).
10T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.
13 / 22
Adversarial MAB with Multiple Play and Budget Constraints Lower Bounds on the Regret
Lower Bound on the Regret (cont’d.)
Proof Idea (cont’d.)
Step 2: Notice there exist(NK
)unique combinations of K tuples.
Let C([N],K) denote the set of all such subsets10
Let E∗[·] denote the expected value w.r.t. uniform assignment of “good” arms.
E∗[Gmax] =
(1
2+ ε
)KE∗[τA(B), ]
Ea∗ [GA] =1
2KEa∗ [τA(B)] + εEa∗ [Na∗ ],
E∗[GA] =1(NK
) ∑a∗∈C([N],K)
Ea∗ [GA] =1
2KE∗[τA(B)] +
ε(NK
) ∑a∗∈C([N],K)
Ea∗ [Na∗ ].
Step 3: Use previous lemma to bound E∗[Gmax − GA]:
E∗[Gmax − GA] ≥ εB(
1− K
N
)− 2εB
c3/2min
√BK
Nlog(4/3).
Step 4: Tune ε to optimize bound:
ε = min
(1
4,
c3/2min
4 log(4/3)(1− K/N)
√N
BK
).
10T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In: International Conference onAlgorithmic Learning Theory (2010), pp. 375–389.
13 / 22
Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret
High Probability Bound on the Regret
Algorithm Exp3.P.M.BModification of Algorithm Exp3.M.B:
Initialize parameter α = 2√
6√
(N − K)/(N − 1) log (NB/(Kcminδ)).
Initialize weights wi for i ∈ [N] : wi (1) = exp(αγK2
√B/(NKcmin)/3
).
Update weights for i ∈ [N] as follows:
wi (t + 1) = wi (t) exp
[1i 6∈S(t)
γK
3N
(ri (t)− ci (t) +
α√Kcmin
pi (t)√NB
)].
Theorem (High Probability Upper Bound on RA(B) for Algorithm Exp3.P.M.B)
For the multiple play algorithm (1 ≤ K ≤ N) and the budget B > 0, the following boundon the regret holds with probability at least 1− δ:
R ≤ 2√
3
√NB(1− cmin)
cminlog
N
K+ 4√
6N − K
N − 1log
(NB
Kcminδ
)
+ 2√
6(1 + K 2)
√N − K
N − 1
NB
Kcminlog
(NB
Kcminδ
)
= O
(K 2
√NB
K
N − K
N − 1log
(NB
Kδ
)+
N − K
N − 1log
(NB
Kδ
))14 / 22
Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret
High Probability Bound on the Regret
Algorithm Exp3.P.M.BModification of Algorithm Exp3.M.B:
Initialize parameter α = 2√
6√
(N − K)/(N − 1) log (NB/(Kcminδ)).
Initialize weights wi for i ∈ [N] : wi (1) = exp(αγK2
√B/(NKcmin)/3
).
Update weights for i ∈ [N] as follows:
wi (t + 1) = wi (t) exp
[1i 6∈S(t)
γK
3N
(ri (t)− ci (t) +
α√Kcmin
pi (t)√NB
)].
Theorem (High Probability Upper Bound on RA(B) for Algorithm Exp3.P.M.B)
For the multiple play algorithm (1 ≤ K ≤ N) and the budget B > 0, the following boundon the regret holds with probability at least 1− δ:
R ≤ 2√
3
√NB(1− cmin)
cminlog
N
K+ 4√
6N − K
N − 1log
(NB
Kcminδ
)
+ 2√
6(1 + K 2)
√N − K
N − 1
NB
Kcminlog
(NB
Kcminδ
)
= O
(K 2
√NB
K
N − K
N − 1log
(NB
Kδ
)+
N − K
N − 1log
(NB
Kδ
))14 / 22
Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret
High Probability Bound on the Regret
Algorithm Exp3.P.M.BModification of Algorithm Exp3.M.B:
Initialize parameter α = 2√
6√
(N − K)/(N − 1) log (NB/(Kcminδ)).
Initialize weights wi for i ∈ [N] : wi (1) = exp(αγK2
√B/(NKcmin)/3
).
Update weights for i ∈ [N] as follows:
wi (t + 1) = wi (t) exp
[1i 6∈S(t)
γK
3N
(ri (t)− ci (t) +
α√Kcmin
pi (t)√NB
)].
Theorem (High Probability Upper Bound on RA(B) for Algorithm Exp3.P.M.B)
For the multiple play algorithm (1 ≤ K ≤ N) and the budget B > 0, the following boundon the regret holds with probability at least 1− δ:
R ≤ 2√
3
√NB(1− cmin)
cminlog
N
K+ 4√
6N − K
N − 1log
(NB
Kcminδ
)
+ 2√
6(1 + K 2)
√N − K
N − 1
NB
Kcminlog
(NB
Kcminδ
)
= O
(K 2
√NB
K
N − K
N − 1log
(NB
Kδ
)+
N − K
N − 1log
(NB
Kδ
))14 / 22
Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret
High Probability Bound on the Regret
Algorithm Exp3.P.M.BModification of Algorithm Exp3.M.B:
Initialize parameter α = 2√
6√
(N − K)/(N − 1) log (NB/(Kcminδ)).
Initialize weights wi for i ∈ [N] : wi (1) = exp(αγK2
√B/(NKcmin)/3
).
Update weights for i ∈ [N] as follows:
wi (t + 1) = wi (t) exp
[1i 6∈S(t)
γK
3N
(ri (t)− ci (t) +
α√Kcmin
pi (t)√NB
)].
Theorem (High Probability Upper Bound on RA(B) for Algorithm Exp3.P.M.B)
For the multiple play algorithm (1 ≤ K ≤ N) and the budget B > 0, the following boundon the regret holds with probability at least 1− δ:
R ≤ 2√
3
√NB(1− cmin)
cminlog
N
K+ 4√
6N − K
N − 1log
(NB
Kcminδ
)
+ 2√
6(1 + K 2)
√N − K
N − 1
NB
Kcminlog
(NB
Kcminδ
)
= O
(K 2
√NB
K
N − K
N − 1log
(NB
Kδ
)+
N − K
N − 1log
(NB
Kδ
))14 / 22
Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret
High Probability Bound on the Regret
Algorithm Exp3.P.M.BModification of Algorithm Exp3.M.B:
Initialize parameter α = 2√
6√
(N − K)/(N − 1) log (NB/(Kcminδ)).
Initialize weights wi for i ∈ [N] : wi (1) = exp(αγK2
√B/(NKcmin)/3
).
Update weights for i ∈ [N] as follows:
wi (t + 1) = wi (t) exp
[1i 6∈S(t)
γK
3N
(ri (t)− ci (t) +
α√Kcmin
pi (t)√NB
)].
Theorem (High Probability Upper Bound on RA(B) for Algorithm Exp3.P.M.B)
For the multiple play algorithm (1 ≤ K ≤ N) and the budget B > 0, the following boundon the regret holds with probability at least 1− δ:
R ≤ 2√
3
√NB(1− cmin)
cminlog
N
K+ 4√
6N − K
N − 1log
(NB
Kcminδ
)
+ 2√
6(1 + K 2)
√N − K
N − 1
NB
Kcminlog
(NB
Kcminδ
)
= O
(K 2
√NB
K
N − K
N − 1log
(NB
Kδ
)+
N − K
N − 1log
(NB
Kδ
))14 / 22
Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret
High Probability Bound on the Regret (cont’d.)
Theorem (High Probability Upper Bound on RA(B) for Algorithm Exp3.P.M.B)
For the multiple play algorithm (1 ≤ K ≤ N) and the budget B > 0, the following boundon the regret holds with probability at least 1− δ:
R = O
(K 2
√NB
K
N − K
N − 1log
(NB
Kδ
)+
N − K
N − 1log
(NB
Kδ
))
Remark
Recovers O(√
NT log(NT/δ) + log(NT/δ)) bound11 for K = 1, no costs
Proof Idea
Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.
Step 2: Lower bound GExp3.P.M.B as function of U
Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B
11P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.
15 / 22
Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret
High Probability Bound on the Regret (cont’d.)
Theorem (High Probability Upper Bound on RA(B) for Algorithm Exp3.P.M.B)
For the multiple play algorithm (1 ≤ K ≤ N) and the budget B > 0, the following boundon the regret holds with probability at least 1− δ:
R = O
(K 2
√NB
K
N − K
N − 1log
(NB
Kδ
)+
N − K
N − 1log
(NB
Kδ
))
Remark
Recovers O(√
NT log(NT/δ) + log(NT/δ)) bound11 for K = 1, no costs
Proof Idea
Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.
Step 2: Lower bound GExp3.P.M.B as function of U
Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B
11P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.
15 / 22
Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret
High Probability Bound on the Regret (cont’d.)
Theorem (High Probability Upper Bound on RA(B) for Algorithm Exp3.P.M.B)
For the multiple play algorithm (1 ≤ K ≤ N) and the budget B > 0, the following boundon the regret holds with probability at least 1− δ:
R = O
(K 2
√NB
K
N − K
N − 1log
(NB
Kδ
)+
N − K
N − 1log
(NB
Kδ
))
Remark
Recovers O(√
NT log(NT/δ) + log(NT/δ)) bound11 for K = 1, no costs
Proof Idea
Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.
Step 2: Lower bound GExp3.P.M.B as function of U
Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B
11P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.
15 / 22
Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret
High Probability Bound on the Regret (cont’d.)
Theorem (High Probability Upper Bound on RA(B) for Algorithm Exp3.P.M.B)
For the multiple play algorithm (1 ≤ K ≤ N) and the budget B > 0, the following boundon the regret holds with probability at least 1− δ:
R = O
(K 2
√NB
K
N − K
N − 1log
(NB
Kδ
)+
N − K
N − 1log
(NB
Kδ
))
Remark
Recovers O(√
NT log(NT/δ) + log(NT/δ)) bound11 for K = 1, no costs
Proof Idea
Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.
Step 2: Lower bound GExp3.P.M.B as function of U
Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B
11P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.
15 / 22
Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret
High Probability Bound on the Regret (cont’d.)
Theorem (High Probability Upper Bound on RA(B) for Algorithm Exp3.P.M.B)
For the multiple play algorithm (1 ≤ K ≤ N) and the budget B > 0, the following boundon the regret holds with probability at least 1− δ:
R = O
(K 2
√NB
K
N − K
N − 1log
(NB
Kδ
)+
N − K
N − 1log
(NB
Kδ
))
Remark
Recovers O(√
NT log(NT/δ) + log(NT/δ)) bound11 for K = 1, no costs
Proof Idea
Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.
Step 2: Lower bound GExp3.P.M.B as function of U
Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B
11P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.
15 / 22
Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret
High Probability on the Regret (cont’d.)
Proof Idea (cont’d.)
Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.:Define upper confidence bound:
U∗ =∑i∈a∗
ασi +
τa∗ (B)∑t=1
(ri (t)− ci (t))
For 2
√6√
N−KN−1
log NBKcminδ
≤ α ≤ 12√
NBKcmin
, we can show P(U∗ > Gmax − B
)≥ 1− δ
Step 2: Lower bound GExp3.P.M.B as function of U:
For α ≤ 2√
NBKcmin
, the gain of Algorithm Exp3.P.M.B is bounded below as follows:
GExp3.P.M.B ≥(
1− γ −2γ
3
1− cmin
cmin
)U∗ −
3N
γlog
N
K− 2α2 − α(1 + K2)
BN
Kcmin.
Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B, tune γ:
γ = min
((1 +
2
3
1− cmin
cmin
)−1
,
(3N log(N/K)
(Gmax − B) (1 + 2(1− cmin)/(3cmin)),
)1/2)
16 / 22
Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret
High Probability on the Regret (cont’d.)
Proof Idea (cont’d.)
Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.:Define upper confidence bound:
U∗ =∑i∈a∗
ασi +
τa∗ (B)∑t=1
(ri (t)− ci (t))
For 2
√6√
N−KN−1
log NBKcminδ
≤ α ≤ 12√
NBKcmin
, we can show P(U∗ > Gmax − B
)≥ 1− δ
Step 2: Lower bound GExp3.P.M.B as function of U:
For α ≤ 2√
NBKcmin
, the gain of Algorithm Exp3.P.M.B is bounded below as follows:
GExp3.P.M.B ≥(
1− γ −2γ
3
1− cmin
cmin
)U∗ −
3N
γlog
N
K− 2α2 − α(1 + K2)
BN
Kcmin.
Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B, tune γ:
γ = min
((1 +
2
3
1− cmin
cmin
)−1
,
(3N log(N/K)
(Gmax − B) (1 + 2(1− cmin)/(3cmin)),
)1/2)
16 / 22
Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret
High Probability on the Regret (cont’d.)
Proof Idea (cont’d.)
Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.:Define upper confidence bound:
U∗ =∑i∈a∗
ασi +
τa∗ (B)∑t=1
(ri (t)− ci (t))
For 2
√6√
N−KN−1
log NBKcminδ
≤ α ≤ 12√
NBKcmin
, we can show P(U∗ > Gmax − B
)≥ 1− δ
Step 2: Lower bound GExp3.P.M.B as function of U:
For α ≤ 2√
NBKcmin
, the gain of Algorithm Exp3.P.M.B is bounded below as follows:
GExp3.P.M.B ≥(
1− γ −2γ
3
1− cmin
cmin
)U∗ −
3N
γlog
N
K− 2α2 − α(1 + K2)
BN
Kcmin.
Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B, tune γ:
γ = min
((1 +
2
3
1− cmin
cmin
)−1
,
(3N log(N/K)
(Gmax − B) (1 + 2(1− cmin)/(3cmin)),
)1/2)
16 / 22
Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret
High Probability on the Regret (cont’d.)
Proof Idea (cont’d.)
Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.:Define upper confidence bound:
U∗ =∑i∈a∗
ασi +
τa∗ (B)∑t=1
(ri (t)− ci (t))
For 2
√6√
N−KN−1
log NBKcminδ
≤ α ≤ 12√
NBKcmin
, we can show P(U∗ > Gmax − B
)≥ 1− δ
Step 2: Lower bound GExp3.P.M.B as function of U:
For α ≤ 2√
NBKcmin
, the gain of Algorithm Exp3.P.M.B is bounded below as follows:
GExp3.P.M.B ≥(
1− γ −2γ
3
1− cmin
cmin
)U∗ −
3N
γlog
N
K− 2α2 − α(1 + K2)
BN
Kcmin.
Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B, tune γ:
γ = min
((1 +
2
3
1− cmin
cmin
)−1
,
(3N log(N/K)
(Gmax − B) (1 + 2(1− cmin)/(3cmin)),
)1/2)
16 / 22
Adversarial MAB with Multiple Play and Budget Constraints High Probability Bounds on the Regret
High Probability on the Regret (cont’d.)
Proof Idea (cont’d.)
Step 1: Derive upper confidence bound U on Gmax that holds w .h.p.:Define upper confidence bound:
U∗ =∑i∈a∗
ασi +
τa∗ (B)∑t=1
(ri (t)− ci (t))
For 2
√6√
N−KN−1
log NBKcminδ
≤ α ≤ 12√
NBKcmin
, we can show P(U∗ > Gmax − B
)≥ 1− δ
Step 2: Lower bound GExp3.P.M.B as function of U:
For α ≤ 2√
NBKcmin
, the gain of Algorithm Exp3.P.M.B is bounded below as follows:
GExp3.P.M.B ≥(
1− γ −2γ
3
1− cmin
cmin
)U∗ −
3N
γlog
N
K− 2α2 − α(1 + K2)
BN
Kcmin.
Step 3: Combine to obtain upper bound on Gmax − GExp3.P.M.B, tune γ:
γ = min
((1 +
2
3
1− cmin
cmin
)−1
,
(3N log(N/K)
(Gmax − B) (1 + 2(1− cmin)/(3cmin)),
)1/2)
16 / 22
Conclusion
Table of Contents
1 Background: The Multi-Armed Bandit (MAB) ProblemProblem FormulationContributions
2 Stochastic MAB with Multiple Play and Budget ConstraintsSetupAlgorithm UCB-MB
3 Adversarial MAB with Multiple Play and Budget ConstraintsUpper Bounds on the RegretLower Bounds on the RegretHigh Probability Bounds on the Regret
4 Conclusion
5 References
17 / 22
Conclusion
Conclusion and Outlook
Summary of Contributions
Analysis of budget-constrained MABs with multiple plays
Stochastic SettingAlgorithm UCB-MB based on UCB112 and UCB-BV13 to upper-bound regret
Adversarial SettingAlgorithm Exp3.M.B to upper-bound regretModification of weight updates in Exp3.M.B to lower-bound regretAlgorithm Exp3.P.M.B to for high probability upper bound
Future Work
Simulations with data
Connect to Badanidiyuru et al.14 to recover O(√B) adversarial bound with null arm
Explore connection to e-commerce (mechanism design)Strategic considerations of agents in repeated interactions between buyer and sellerPool of buyers vs. single buyer
12P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.13W. Ding et al. “Multi-Armed Bandit with Budget Constraint and Variable Costs”. In: Proceedings of the Twenty-Seventh AAAI Conference on
Artificial Intelligence (2013).14A. Badanidiyuru, R. Kleinberg, and A. Slivkins. “Bandits with Knapsacks”. In: Proceedings of the 2013 IEEE 54th Annual Symposium on
Foundations of Computer Science (2013), pp. 207–216.
18 / 22
Conclusion
Conclusion and Outlook
Summary of Contributions
Analysis of budget-constrained MABs with multiple plays
Stochastic SettingAlgorithm UCB-MB based on UCB112 and UCB-BV13 to upper-bound regret
Adversarial SettingAlgorithm Exp3.M.B to upper-bound regretModification of weight updates in Exp3.M.B to lower-bound regretAlgorithm Exp3.P.M.B to for high probability upper bound
Future Work
Simulations with data
Connect to Badanidiyuru et al.14 to recover O(√B) adversarial bound with null arm
Explore connection to e-commerce (mechanism design)Strategic considerations of agents in repeated interactions between buyer and sellerPool of buyers vs. single buyer
12P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.13W. Ding et al. “Multi-Armed Bandit with Budget Constraint and Variable Costs”. In: Proceedings of the Twenty-Seventh AAAI Conference on
Artificial Intelligence (2013).14A. Badanidiyuru, R. Kleinberg, and A. Slivkins. “Bandits with Knapsacks”. In: Proceedings of the 2013 IEEE 54th Annual Symposium on
Foundations of Computer Science (2013), pp. 207–216.
18 / 22
Conclusion
Conclusion and Outlook
Summary of Contributions
Analysis of budget-constrained MABs with multiple plays
Stochastic SettingAlgorithm UCB-MB based on UCB112 and UCB-BV13 to upper-bound regret
Adversarial SettingAlgorithm Exp3.M.B to upper-bound regretModification of weight updates in Exp3.M.B to lower-bound regretAlgorithm Exp3.P.M.B to for high probability upper bound
Future Work
Simulations with data
Connect to Badanidiyuru et al.14 to recover O(√B) adversarial bound with null arm
Explore connection to e-commerce (mechanism design)Strategic considerations of agents in repeated interactions between buyer and sellerPool of buyers vs. single buyer
12P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.13W. Ding et al. “Multi-Armed Bandit with Budget Constraint and Variable Costs”. In: Proceedings of the Twenty-Seventh AAAI Conference on
Artificial Intelligence (2013).14A. Badanidiyuru, R. Kleinberg, and A. Slivkins. “Bandits with Knapsacks”. In: Proceedings of the 2013 IEEE 54th Annual Symposium on
Foundations of Computer Science (2013), pp. 207–216.
18 / 22
Conclusion
Conclusion and Outlook
Summary of Contributions
Analysis of budget-constrained MABs with multiple plays
Stochastic SettingAlgorithm UCB-MB based on UCB112 and UCB-BV13 to upper-bound regret
Adversarial SettingAlgorithm Exp3.M.B to upper-bound regretModification of weight updates in Exp3.M.B to lower-bound regretAlgorithm Exp3.P.M.B to for high probability upper bound
Future Work
Simulations with data
Connect to Badanidiyuru et al.14 to recover O(√B) adversarial bound with null arm
Explore connection to e-commerce (mechanism design)Strategic considerations of agents in repeated interactions between buyer and sellerPool of buyers vs. single buyer
12P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.13W. Ding et al. “Multi-Armed Bandit with Budget Constraint and Variable Costs”. In: Proceedings of the Twenty-Seventh AAAI Conference on
Artificial Intelligence (2013).14A. Badanidiyuru, R. Kleinberg, and A. Slivkins. “Bandits with Knapsacks”. In: Proceedings of the 2013 IEEE 54th Annual Symposium on
Foundations of Computer Science (2013), pp. 207–216.
18 / 22
Conclusion
Conclusion and Outlook
Summary of Contributions
Analysis of budget-constrained MABs with multiple plays
Stochastic SettingAlgorithm UCB-MB based on UCB112 and UCB-BV13 to upper-bound regret
Adversarial SettingAlgorithm Exp3.M.B to upper-bound regretModification of weight updates in Exp3.M.B to lower-bound regretAlgorithm Exp3.P.M.B to for high probability upper bound
Future Work
Simulations with data
Connect to Badanidiyuru et al.14 to recover O(√B) adversarial bound with null arm
Explore connection to e-commerce (mechanism design)Strategic considerations of agents in repeated interactions between buyer and sellerPool of buyers vs. single buyer
12P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.13W. Ding et al. “Multi-Armed Bandit with Budget Constraint and Variable Costs”. In: Proceedings of the Twenty-Seventh AAAI Conference on
Artificial Intelligence (2013).14A. Badanidiyuru, R. Kleinberg, and A. Slivkins. “Bandits with Knapsacks”. In: Proceedings of the 2013 IEEE 54th Annual Symposium on
Foundations of Computer Science (2013), pp. 207–216.
18 / 22
Conclusion
Conclusion and Outlook
Summary of Contributions
Analysis of budget-constrained MABs with multiple plays
Stochastic SettingAlgorithm UCB-MB based on UCB112 and UCB-BV13 to upper-bound regret
Adversarial SettingAlgorithm Exp3.M.B to upper-bound regretModification of weight updates in Exp3.M.B to lower-bound regretAlgorithm Exp3.P.M.B to for high probability upper bound
Future Work
Simulations with data
Connect to Badanidiyuru et al.14 to recover O(√B) adversarial bound with null arm
Explore connection to e-commerce (mechanism design)Strategic considerations of agents in repeated interactions between buyer and sellerPool of buyers vs. single buyer
12P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In: Machine Learning 47 (2002), pp. 235–256.13W. Ding et al. “Multi-Armed Bandit with Budget Constraint and Variable Costs”. In: Proceedings of the Twenty-Seventh AAAI Conference on
Artificial Intelligence (2013).14A. Badanidiyuru, R. Kleinberg, and A. Slivkins. “Bandits with Knapsacks”. In: Proceedings of the 2013 IEEE 54th Annual Symposium on
Foundations of Computer Science (2013), pp. 207–216.
18 / 22
References
Table of Contents
1 Background: The Multi-Armed Bandit (MAB) ProblemProblem FormulationContributions
2 Stochastic MAB with Multiple Play and Budget ConstraintsSetupAlgorithm UCB-MB
3 Adversarial MAB with Multiple Play and Budget ConstraintsUpper Bounds on the RegretLower Bounds on the RegretHigh Probability Bounds on the Regret
4 Conclusion
5 References
19 / 22
References
References I
P. Auer, N. Cesa-Bianchi, and P. Fischer. “Finite-Time Analysis of the Multiarmed Bandit Problem”. In:
Machine Learning 47 (2002), pp. 235–256.
R. Agrawal, M. V. Hegde, and D. Teneketzis. “Multi-Armed Bandits with Multiple Plays and Switching Cost”.
In: Stochastics and Stochastic Reports 29 (1990), pp. 437–459.
V. Anantharam, P. Varaiya, and J. Walrand. “Asymptotically Efficient Allocation Rules for the Multiarmed
Bandit Problem - Part I: IID Rewards”. In: IEEE Transactions on Automatic Control 32 (1986), pp. 968–976.
P. Auer et al. “The Nonstochastic Multi-Armed Bandit Problem”. In: SIAM Journal on Computing 32 (2002),
pp. 48–77.
A. Badanidiyuru, R. Kleinberg, and A. Slivkins. “Bandits with Knapsacks”. In: Proceedings of the 2013 IEEE
54th Annual Symposium on Foundations of Computer Science (2013), pp. 207–216.
W. Ding et al. “Multi-Armed Bandit with Budget Constraint and Variable Costs”. In: Proceedings of the
Twenty-Seventh AAAI Conference on Artificial Intelligence (2013).
Y. Gai, B. Krishnamachari, and R. Jain. “Combinatorial Network Optimization with Unknown Variables:
Multi-Armed Bandits with Linear Rewards and Individual Observations”. In: IEEE/ACM Transactions onNetworking 20.5 (2012), pp. 1466–1478.
20 / 22
References
References II
L. Tran-Thanh et al. “Epsilon-First Policies for Budget-Limited Multi-Armed Bandits”. In: Twenty-Fourth
AAAI Conference on Artificial Intelligence (2010), pp. 1211–1216.
L. Tran-Thanh et al. “Knapsack Based Optimal Policies for Budget–Limited Multi–Armed Bandits”. In:
Twenty-Sixth AAAI Conference on Artificial Intelligence (2012), pp. 1134–1140.
T. Uchiya, A. Nakamura, and M. Kudo. “Algorithms for Adversarial Bandit Problems with Multiple Plays”. In:
International Conference on Algorithmic Learning Theory (2010), pp. 375–389.
Y. Xia et al. “Budgeted Multi-Armed Bandits with Multiple Plays”. In: Proceedings of the 25th International
Joint Conference on Artificial Intelligence (2016), pp. 2210 –2216.
21 / 22
References
THANK YOU!
QUESTIONS?
22 / 22