sic-mmab: synchronisation involves communication in … · 2020. 9. 3. · sic-mmab:...
TRANSCRIPT
SIC - MMAB: Synchronisation Involves Communication inMultiplayer Multi-Armed Bandits
Etienne Boursier†, Vianney Perchet*†
†ENS Paris-Saclay, CMLA *Criteo AI [email protected], [email protected]
Summary
• Synchronisation =⇒ communication through collisions in MMAB• Contradicts previous lower bounds• Synchronisation is unrealistic and a loophole in the model• More realistic dynamic model: first logarithmic regret algorithm
Multiplayer bandits model
Motivation cognitive radio networksFramework K arms, M ≤ K players. At each t≤ T, player j pulls πj(t)
Reward rj(t) := Xπj(t)(t)(1−ηπj(t)(t))
• ηk(t) :=
¨
1 if several players pull k,
0 otherwise• Xk(t) ∈ [0, 1] i.i.d. with E[Xk] = µk
Decentralized no information exchange between playersObservations Collision sensing: observes rj(t) and ηπj(t)(t)
No sensing: observes rj(t) onlySynchro models Static: any player plays for t= 1, . . . , T
Dynamic: player j plays for t= 1+τjτjτj, . . . , TGoal with M(t) set of players at time t, minimize regret
RT :=T∑
t=1
#M(t)∑
k=1
µ(k)−Eµ
T∑
t=1
∑
j∈M(t)
rj(t)
Notations µ(1) ≥ . . .≥ µ(K) and ∆̄(M) := mini=1,...,M
(µ(i)−µ(i+1))
State of the art bounds
Setting Prior knowledge Asympt. Upper bound
Centralized Multiplayer M∑
k>M
log(T)µ(M)−µ(k)
[3]
Decentralized, Col. Sensing µ(M)−µ(M+1)MK log(T)
(µ(M)−µ(M+1))2 [4]
Decentralized, Col. Sensing -∑
k>M
log(T)µ(M)−µ(k)
+MK log(T)
Dec., Col. Sensing , Dynamic ∆̄(M)Mp
K log(T)T∆̄2(M)
[4]
Dec., No Sensing, Dynamic - MK log(T)∆̄2(M)+ M2K log(T)
µ(M)
Our results in red.
SIC-MMAB: static with collision sensing
Observation: coll. indicator ηk(t) ∈ {0, 1} → bit sent from one player to another=⇒ communication possible between players
Algorithm 1: SIC-MMABInitialization phase: estimate M and player rank j in time O(K log(T))for p= 1, ...,∞ doExploration phase: explore each active arm 2p times without collisionCommunication phase: players exchange statistics of arms using collisionsEliminate suboptimal armsAttribute optimal arms to players (who enter exploitation phase)
endExploitation phase: pull attributed arm until T
Communication protocol
When player i sends stats of arm k to player j:
• Pull arm j to send 1 bit (collision)• Pull arm i to send 0 bit (no collision)• Send quantized empirical mean in binary
For 2p exploration rounds, communicate in p→ sublogarithmic comm. regret
Regret of SIC-MMAB
E�
RT
�
®
Exploration phases︷ ︸︸ ︷
∑
k>M
log(T)µ(M)−µ(k)
+ MK log(T)︸ ︷︷ ︸
Initialization phase
+
Communication phases︷ ︸︸ ︷
M3K log2
�
log(T)(µ(M)−µ(M+1))2
�
Contradiction with the lower bound
Centralized lower bound
∑
k>M
log(T)µ(M)−µ(k)
[1]
Decentralized lower bound
M∑
k>M
log(T)µ(M)−µ(k)
? [2]
NO
Decentralized ∼ Centralized
SIC-MMAB
Toward a realistic model
Unrealistic communication protocols→ loophole in current modelWhich assumption did go wrong?
• Collision sensing? No: still send a bit in log(T)µ(K)
rounds without it
• Synchronisation? Yes! Communication possible because players knowwhen to talk to each other
Synchronisation: unrealistic assumption leading to hacking algorithmsLet’s remove it→ dynamic model
DYN-MMAB: dynamic no sensing
Algorithm 2: DYN-MMABExploration phase: pull k∼ U(K)Update occupied arms and queue of arms to exploitif rk(t)> 0 and k= arm to exploit thenEnter exploitation phase
endExploitation phase: pull exploited arm until T
How does it work?Do not estimate µk but γµk where γ' P(no collision)Still distinguish optimal from suboptimal armsWhat if another player exploits k?
• either r̂k→ 0 quickly and arm k becomes suboptimal• or too many 0s in a row =⇒ k is occupied
Regret of DYN-MMAB
E[RT]®M2K log(T)µ(M)
+MK log(T)∆̄2(M)
References
[1] Anantharam, V., Varaiya, P., and Walrand, J. (1987). Asymptotically efficient allocationrules for the multiarmed bandit problem with multiple plays-part i: I.i.d. rewards. IEEETransactions on Automatic Control, 32(11):968–976.
[2] Besson, L. and Kaufmann, E. (2018). Multi-Player Bandits Revisited. In AlgorithmicLearning Theory, Lanzarote, Spain.
[3] Komiyama, J., Honda, J., and Nakagawa, H. (2015). Optimal regret analysis of thompsonsampling in stochastic multi-armed bandit problem with multiple plays. In InternationalConference on Machine Learning, pages 1152–1161.
[4] Rosenski, J., Shamir, O., and Szlak, L. (2016). Multi-player bandits – a musical chairsapproach. In International Conference on Machine Learning, pages 155–163.