mean field equilibria of multi-armed bandit games ramki gummadi (stanford) joint work with: ramesh...

24
Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Upload: kerrie-hoover

Post on 01-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Mean Field Equilibria of Multi-Armed Bandit Games

Ramki Gummadi (Stanford) Joint work with:

Ramesh Johari (Stanford)Jia Yuan Yu (IBM Research, Dublin)

Page 2: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Motivation

• Classical MAB models have a single agent.

• What happens when other agents influence arm rewards?

• Do standard learning algorithms lead to any equilibrium?

Page 3: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Examples

• Wireless transmitters learningunknown channels with interference

• Sellers learning about product categories:e.g. eBay

• Positive externalities: social gaming.

Page 4: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Example: Wireless Transmitters

Channel A0.8

Channel B0.6

?

Page 5: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Example: Wireless Transmitters

Channel A0.8 ; 0.9

Channel B0.6 ; 0.1

?

Page 6: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Modeling the Bandit Game

• Perfect bayesian equilibrium– Implausible agent behavior.

• Mean field model– Agents behave under an assumption of

stationarity.

Page 7: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Outline

• Model• The equilibrium concept• Existence• Dynamics • Uniqueness and convergence• From finite system to limit model• Conclusion

Page 8: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Mean Field Model of MAB Games

• Discrete time; arms; rewards.

• An agent at any time has

• Agents `regenerate’ once every time slots.– is sampled i.i.d. with distribution .– is reset to zero vector.

Page 9: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Mean Field Model of MAB Games

• Policy, : maps to (randomized) armE.g. UCB, Gittins index.

• Population profile : Arm distribution of agents

• Reward distribution Bernoulli of mean:

Page 10: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

A Single Agent’s Evolution

• Current state: • Current type: • Agent picks an arm • Population profile • Transitions to new state where:

with probability with probability

Page 11: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Examples of Reward Functions

• Negative externality: E.g.

• Positive externality: E.g.

• Non separable rewards: E.g.

Page 12: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

The Equilibrium Concept

• What constitutes an MFE?1. A joint distribution for 2. A population profile, 3. Policy that maps state to arm choice.

• Equilibrium conditions for 1. has to be the unique invariant distribution for

fixed population profile under .2. arises from when agents adopt policy

Page 13: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Optimality in Equilibrium

• In an MFE, doesn’t change over time.

• can be any “optimal” policy learning an i.i.d. reward environment.

Page 14: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Existence of MFE

Theorem : At least one MFE exists if is continuous in for every .

• Proved using Brouwer’s fixed point theorem.

Page 15: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Beyond Existence

• MFE exists, but when is it unique?

• Can agent dynamics find such an equilibrium even if it is unique?

• How does the mean field model approximate a system with finitely many agents?

Page 16: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Arms123.i.n

Dynamics

Page 17: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Arms123.i.n

Policy: 𝒇 𝒕

Dynamics

Page 18: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Policy:

Transition kernel ()

Arms123.i.n

𝒇 𝒕

Dynamics

Page 19: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Policy:

Transition kernel ()

Arms123.i.n

𝒇 𝒕

Dynamics

Page 20: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Dynamics

Theorem : Let denote map from to .

Assume is - Lipschitz for every θ.Then is a contraction map (in total variation) if:

• Proof uses a coupling argument on the bandit process, .

Page 21: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Uniqueness and Convergence

1. Fixed points for MFE

2. For arbitrary initial , mean field evolution is:

When is a contraction (w.r.t. ):1. There exists a unique MFE 2. The mean field trajectory of measures

converges to

Page 22: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Finite Systems to Limit Model

• Rewards depend on , the empirical population profile of agents.

• is a random probability measure on the (state, type) space.

• (In what sense) does as ? i.e. Could trajectories diverge after a long time even for large ?

Page 23: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Approximation Property

Theorem: As uniformly in when is a contraction.

• Proof uses an artificial “auxiliary” system with rewards based on mean field profile.

• Coupling of transitions to enable a bridge from finite to mean field limit via auxiliary system.

Page 24: Mean Field Equilibria of Multi-Armed Bandit Games Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

Conclusion

• Agent populations converge to a mean field equilibrium using classical bandit algorithms.

• Large agent population effectively mitigates non-stationarity in MAB games.

• Interesting theoretical results beyond existence: uniqueness, convergence and approximation.

• Insights are more general than theorem conditions strictly imply.