reinforcement learning for mean field games with strategic

Reinforcement Learning for Mean Field Games with StrategicComplementarities

Kiyeob Lee* Desik Rengarajan* Dileep Kalathil Srinivas ShakkottaiTexas A&M University

{kiyeoblee,desik,dileep.kalathil,sshakkot}@tamu.edu

Abstract

Mean Field Games (MFG) are the class ofgames with a very large number of agentsand the standard equilibrium concept is aMean Field Equilibrium (MFE). Algorithmsfor learning MFE in dynamic MFGs are un-known in general. Our focus is on an im-portant subclass that possess a monotonic-ity property called Strategic Complementar-ities (MFG-SC). We introduce a natural re-finement to the equilibrium concept that wecall Trembling-Hand-Perfect MFE (T-MFE),which allows agents to employ a measure ofrandomization while accounting for the im-pact of such randomization on their payoffs.We propose a simple algorithm for comput-ing T-MFE under a known model. We alsointroduce a model-free and a model-based ap-proach to learning T-MFE and provide sam-ple complexities of both algorithms. We alsodevelop a fully online learning scheme thatobviates the need for a simulator. Finally, weempirically evaluate the performance of theproposed algorithms via examples motivatedby real-world applications.

1 Introduction

Strategic complementarities refers to a well-establishedstrategic game structure wherein the marginal returnsto increasing one’s strategy rise with increases in thecompetitors’ strategies, and many qualitative results ofthe dynamic systems are based on these properties (Mil-grom and Roberts, 1990; Vives, 2009; Adlakha and

*Equal Contribution

Proceedings of the 24th International Conference on Artifi-cial Intelligence and Statistics (AISTATS) 2021, San Diego,California, USA. PMLR: Volume 130. Copyright 2021 bythe author(s).

Johari, 2013). Many practical scenarios with this prop-erty, including pricing in oligopolistic markets, adop-tion of network technology and standards, deposits andwithdrawals in banking, weapons purchasing in armsraces etc., have been identified, and our running exam-ple is that of adopting actions to prevent the spread ofa computer or human virus, wherein stronger actionstowards maintaining health (eg., installing patches, orwearing masks) by members enhances the returns (eg.,system reliability or economic value) to a particularindividual following suit.

The above examples are characterized by a large num-ber of agents following the dynamics of repeated action,reward, and state transition that is a characteristicof stochastic games. However, analytical complexityimplies that most work has focused on the static sce-nario with a small number of agents (see Milgrom andRoberts (1990)). Recently, there have been attemptsto utilize the information structure of a mean fieldgame (MFG) (Lasry and Lions, 2007; Tembine et al.,2009; Adlakha and Johari, 2013; Iyer et al., 2014; Liet al., 2016, 2018) to design algorithms to computeequilibria under a known model in the large popula-tion setting (Adlakha and Johari, 2013). Here, eachagent assumes that the states of all others are drawnin an i.i.d. manner from a common mean field beliefdistribution, and optimizes accordingly. However, themodel is non-stationary due to the change in the meanfield distribution at each time step, and provably con-vergent learning algorithms for identifying equilibriumstrategies are currently unavailable.

Main Contributions: We study the problem oflearning under an unknown model in stochastic gameswith strategic complementarities under the mean fieldsetting. Our main contributions are:

(i) We introduce the notion of trembling-hand perfec-tion to the context of mean field games, under whicha known randomness is introduced into all strategies.Unlike an ε-greedy policy in which randomization isadded as an afterthought to the optimal action, under

Reinforcement Learning for Mean Field Games with Strategic Complementarities

trembling-hand perfection, optimal value is computedconsistently while accounting for this randomness.

(ii) We describe TQ-value iteration, based on general-ized value iteration, as a means of computing trembling-hand consistent TQ-values. We introduce a notion ofequilibrium that we refer to as trembling-hand-perfectmean field equilibrium (T-MFE), and show existenceof T-MFE and a globally convergent computationalmethod in games with strategic complementarities.

(iii) We propose two learning algorithms—model-freeand model-based—for learning T-MFE under an un-known model. The algorithms follow a structure ofidentifying TQ-values to find a candidate strategy, andthen taking only one-step Mckean-Vlasov dynamics toupdate the mean field distribution. We show conver-gence and determine their sample complexity bounds.

(iv) Finally, to the best of our knowledge, ours is thefirst work that develops a fully online learning schemethat utilizes the large population of agents to concur-rently sample the real world. Our algorithm only needsa one-step Mckean-Vlasov mean field update undereach strategy, which automatically happens via agentsapplying the current strategy for one step. This ob-viates the need for a multi-step simulator required bytypical RL approaches.

Related Work: There has recently been much inter-est in the intersection of machine learning and collec-tive behavior, often under the large population regime.Much of this work focuses on specific classes of stochas-tic games that possess verifiable properties on informa-tion, payoffs and preferences that provide structure tothe problem. In line with this approach is learning instructured MFGs, such as in linear-quadratic, oscillatoror potential game settings (Kizilkale and Caines, 2012;Yin et al., 2013; Cardaliaguet and Hadikhanloo, 2017;Carmona et al., 2019). Other approaches provide struc-ture to the problem by considering localized effects,such as local interactions (Yang et al., 2018), localconvergence (Mguni et al., 2018), or a local version ofNash equilibrium (Subramanian and Mahajan, 2019).While we are not aware of any work that considerslearning in games with strategic complementarities, asurvey on the algorithmic aspects of multi-agent RL,including a review of existing work in the mean fielddomain is available in Zhang et al. (2019).

Many of the issues faced in simultaneous learning anddecision making in games with large populations arecontained in Yang et al. (2018); Guo et al. (2019); Sub-ramanian and Mahajan (2019), which are the closestto our work. Yang et al. (2018) considers the scenariowherein interactions are local in that each agent is im-pacted only by the set of neighbors, and so samplingonly among them is sufficient to obtain a mean field

estimate. This, however, requires a specific structure inwhich the Q-function of agents can be decomposed intosuch local versions. Subramanian and Mahajan (2019)proposed a simulator-based (not fully online) policy gra-dient algorithm for learning the mean field equilibriumin stationary mean field games. However, the conver-gence guarantees are limited to only a local version ofNash equilibrium. Finally, Guo et al. (2019) presentsan existence and simulator-based model-free learningalgorithm for MFE without structural assumptions onthe game. Instead, there are contraction assumptionsimposed on the trajectories of state and action distribu-tions over time, which, however, may not be verifiablefor a given game in a straightforward manner.

Our work is distinguished from existing approaches inseveral ways. We introduce the concept of a strategi-cally consistent approach to learning via the trembling-hand idea, rather than arbitrarily adding a modicumof exploration to best responses as do most existingworks. We provide a structured application scenarioto apply this idea in the form of games with strategiccomplementarities. In turn, this allows us to exploreboth model-free and model-based methods to computeand learn optimal trembling-hand perfect MFE, includ-ing showing global convergence and determining theirsample complexities. Perhaps most importantly, weexploit the large population setting to obtain sampleswithout the aid of a simulator, which in turn enableslearning directly from the real system.

2 Mean Field Games andTrembling-Hand Perfection

Mean Field Games: An N -agent stochastic dy-namic game is represented as (S,A, P, (ri)Ni , γ), whereS and A are the state and action spaces, respectively,both assumed to be finite. At time k, agent i hasstate sik ∈ S, takes action aik ∈ A, and receives areward ri(sk, ak). Here, sk = (sik)Ni=1 is the systemstate and ak = (aik)Ni=1 is the joint action. Thesystem state evolves according to transition kernelsk+1 ∼ P (·|sk, ak). Each agent aims at maximiz-ing the infinite horizon cumulative discounted rewardE[∑∞k=0 γ

tri(sk, ak)], with discount factor γ ∈ (0, 1).

Identification of a best response is computationallyhard under a Bayesian framework, and a more realis-tic approach, aligned with a typical agent’s computa-tional capabilities is to reduce the information stateof each agent to the so-called mean field state distri-bution (Lasry and Lions, 2007; Tembine et al., 2009;Adlakha and Johari, 2013; Iyer et al., 2014; Li et al.,2016, 2018). The mean field zk at time k is defined

as zk(s) = 1N

∑Ni=1 1{sik = s}, and is the empirical

distribution of the states of all agents. It represents

Kiyeob Lee*, Desik Rengarajan*, Dileep Kalathil, Srinivas Shakkottai

the agent’s belief that the states of all others will bedrawn in an i.i.d. manner from zk. Agent i at timek receives a reward ri(sik, a

ik, zk), and its state evolves

according to sik+1 ∼ P (·|sik, aik, zk). The mean field ap-proximation is accurate under structural assumptionson correlation decay across agent states as the numberof agents becomes asymptotically large (Graham andMeleard, 1994; Iyer et al., 2014).

We represent a MFG as Γ = (S,A, P, r, γ), and re-strict our attention to stationary MFGs with a homo-geneous reward function. Here, all agents follow thesame stationary strategy µ : S → P(A), where P(A)is the probability distribution over the action space,and reward function ri = r for all i. We assume that|r(s, a, z)| ≤ 1. The mean field zk evolves following thediscrete time McKean-Vlasov equation

zk+1 = Φ(zk, µ), where, zk+1(s′) = Φ(zk, µ)(s′) (1)

=∑s∈S

∑a∈A

zk(s)µ(s, a)P (s′|s, a, zk). (2)

The value function Vµ,z corresponding to thestrategy µ and the mean field z is defined asVµ,z(s) = E[

∑∞k=0 γ

tr(sk, ak, z)|s0 = s], with, ak ∼µ(sk, ·), sk+1 ∼ P (·|sk, ak, z).

Mean Field Games with Strategic Complemen-tarities (MFG-SC): Structural assumptions on thenature of the MFG are needed in order to show ex-istence of an equilibrium, and to identify convergentdynamics. Our focus is on a particular structure calledstrategic complementarities that aligns the increaseof an agent’s strategy with increases the competitors’strategies (Nowak, 2007; Vives, 2009; Adlakha andJohari, 2013).

We introduce some concepts before defining MFG-SC.The partially ordered set (X,�) is called a lattice if forall x, y ∈ X, the elements sup{x, y} and inf{x, y} are inX. X is a complete lattice if for any S ⊂ X, both supSand inf S are in X. A function f : X → R is said to besupermodular if f(sup{x, x′}) + f(inf{x, x′}) ≥ f(x) +f(x′) for any x, x′ ∈ X. Given lattices X,Y , a functionf : X×Y → R is said to have increasing differences inx and y if for all x′ � x, y′ � y, f(x′, y′) − f(x′, y) ≥f(x, y′) − f(x, y). A correspondence T : X → Y isnon-decreasing if x′ � x, y ∈ T (x), and y′ ∈ T (x′)implies that sup{y, y′} ∈ T (x′) and inf{y, y′} ∈ T (x).

For probability distributions p, p′ ∈ P(X), we say pstochastically dominates p′, denoted as p �SD p′, if∑

x∈X f(x)p(x) ≥∑x∈X f(x)p′(x) for any bounded

non-decreasing function f . The conditional distribu-tion p(·|y) is stochastically non-decreasing in y if for ally′ � y, we have p(·|y′) �SD p(·|y). Finally, the condi-tional distribution p(·|y, z) has stochastically increasingdifferences in y and z if

∑x f(x)p(x|y, z) has increasing

differences in y and z for any bounded non-decreasingfunction f .

We now give the formal definition of mean field gameswith strategic complementarities (Adlakha and Johari,2013).

Definition 1 (MFG-SC). Let Γ be a stationary meanfield game. We say that Γ is a mean field game withstrategic complementaries if it has the following prop-erties:(i) Reward function: r(s, a, z) is non-decreasing in s,supermodular in (s, a), and has increasing differencesin (s, a) and z. Also, maxa r(s, a, z) is non-decreasingin s for all fixed z.(ii) Transition probability: P (·|s, a, z) is stochasticallysupermodular in (s, a), has stochastically increasingdifferences in (s, a) and z, and is stochastically non-decreasing in each of s, a, and z.

MFG-SC Example - Infection Spread: We as-sume a large but fixed population of agents (computersor humans). At any time step, an agent may leavethe system (network or town) with probability ζ, andis immediately replaced with a new agent. The stateof an agent s ∈ Z+ is its health level, and the agentcan take action a ∈ {1, 2, 3, ...|A|} (installing securitypatches, wearing a mask etc.) to stay healthy. The sus-ceptibility of each agent, p(s), is a decreasing functionin state (higher health implies lower susceptibility). Ateach time step, the agent interacts with the ensembleof agents who have a mean field state distribution z.We define infection intensity iz = cfp(

∑s∈S sz(s)),

which can be interpreted as the probability of gettinginfected via interaction with the population and cf isthe infection intensity constant.

Given the current state-action pair (s, a), the next states′ is given by

s′ = (s+ a− w1)+1{E1}+ (s+ a)1{E2}+ w21{E3},

where (x)+ = max{0, x} (state is non-negative),w1, w2 ∈ Z+ are realizations of non-negative randomvariables, and Eis are mutually exclusive events withprobabilities iz(1−ζ), (1−iz)(1−ζ), and ζ, respectively.Events E1 and E2 correspond to the agent remaining inthe system, and being infected (health may deteriorate)or not infected, respectively. E3 is the event that theagent leaves and is replaced with an agent with ran-dom state (regeneration). An agent receives a rewardthat depends on its own immunity 1− p(s) as well asthat of the population (i.e., system value increases withimmunity), but pays for its action. Hence,

r(s, a, z) = δ1(1− p(s)) + δ2∑s∈S

z(s)(1− p(s))− δ3a,

where δis are positive constants.


It is easy to verify that the model has non-decreasingdifferences in the transition matrix. Also, it encouragescrowd seeking behavior, where the mean field positivelyaffects rewards obtained. Showing that it has strategiccomplementarities is straightforward. We validate ouralgorithms via simulations on this model in Section 6.

MFG-SC Example - Amazon Mechanical Turk(MTurk) and other Gig Economy Marketplaces:We consider Amazon Mechanical Turk (MTurk), acrowd sourcing market wherein human workers (calledTurkers) are recruited to perform so-called Human In-telligence Tasks (HITs). There is a natural alignmentof effort employed by Turkers in MTurk, since higher ef-forts translate into more HITs done right, which resultsin a higher quality of work distribution, which resultsin firms willing to spend more on HITs. Note that freeriding is difficult, since poor quality work results inpayments being withheld and reputation loss. This no-tion of incentive alignment applies to essentially all Gigeconomy marketplaces such as Uber and Airbnb—thereputation of the agent directly enhances its reward,while the reputation of the marketplace as a whole (i.e.,its mean field) draws customers willing to pay into thesystem, and so enhances the reward of the agent. Amore detailed description and numerical simulationsare provided in the supplementary material.

Trembling-Hand-Perfect Mean Field Equilib-rium: The notion of trembling-hand-perfection is ameans of refining the Nash equilibrium concept to ac-count for the fact that equilibria that naturally occurare often those that are optimal when a known amountof randomness is introduced into the strategies em-ployed to ensure that they are totally mixed, i.e., allactions will be played with some (however small) proba-bility (Bielefeld, 1988). Thus, the agent is restricted toonly playing such mixed (randomized) strategies, butmaintains strategic consistency in that it accounts forthe probability of playing each action while calculatingthe expected payoff of such a totally mixed strategy.

Formalizing the above thoughts, we denote the set oftrembling-hand strategies as Πε. A trembling-handstrategy µ ∈ Πε is a mapping µ : S → Pε(A), wherePε(A) is the set of ε-randomized probability vectorsover A. Any probability vector in Pε(A) has the value(1 − ε) for one element and the value ε/(|A| − 1) forall the other elements. Thus, any µ ∈ Πε has thefollowing form: µ(s, a) = (1 − ε) for a = as for someas and µ(s, a) = ε/(|A| − 1) for all other a ∈ A. Inthe standard reinforcement learning parlance, Πε isessentially the set of all ε-greedy policies.

The main difference between a trembling-hand strategyand an ε-greedy policy lies in the value function. Recall

that under the ε-greedy idea, the agent computes thepure (deterministic) best response policy, and thenarbitrarily adds randomization. However, under thestrategic game setting, choosing a strategy that couldresult in an arbitrary loss of value is impermissible.Rather, the agent must compute the best trembling-hand strategy, i.e., it must account for the impact onvalue of the ε randomness. Formally, we first definethe optimal trembling-hand value function V ∗z and theoptimal trembling hand strategy µ∗z corresponding tothe mean field z as

V ∗z = maxµ∈Πε

Vµ,z, (3)

µ∗z ∈ Ψ(z), where Ψ(z) = arg maxµ∈Πε

Vµ,z. (4)

We define trembling-hand-prefect mean field equilib-rium (T-MFE) in terms of a trembling-hand strategyµ∗ and a mean field distribution z∗ that must jointlysatisfy, (i) optimality—the strategy µ∗ must be supe-rior to all other strategies, given the belief z∗, and(ii) consistency—given a candidate mean field distribu-tion z∗, the strategy µ∗ must regenerate z∗ under theMcKean-Vlasov dynamics (1).

Definition 2 (T-MFE). Let Γ be a stationary meanfield game. A trembling-hand-perfect mean field equi-librium (µ∗, z∗) of Γ is a strategy µ∗ ∈ Πε and amean field z∗ such that, (optimality condition) µ∗ ∈Ψ(z∗), and (consistency condition) z∗ = Φ(z∗, µ∗)

3 Existence and Computation ofT-MFE

Existence of T-MFE: We first introduce a methodto compute the optimal trembling-hand value functionV ∗z for a given mean field z. Note that classical valueiteration for any given finite MDP will converge underdeterministic policies (pure strategies)—something thatis not possible under our restriction to trembling handstrategies Πε, which only allows totally mixed strategies.We overcome this issue by using a generalized value iter-ation approach (Szepesvari and Littman, 1996). Here,rather than the value function, we compute the Q-valuefunction, which for a strategy µ and a given mean fieldz is defined as Qµ,z(s, a) = E[

∑∞t=0 γ

tr(st, at, z)|s0 =s, a0 = a], with, at ∼ µ(st, ·), st+1 ∼ P (·|st, at, z).The optimal trembling-hand Q-value function (TQ-value function) for a given mean field z is then definedas Q∗z = maxµ∈Πε Qµ,z. The optimal trembling-handstrategy µ∗z for a given mean field z can then be com-puted as µ∗z = πεQ∗z .

For any given Q-value function, define the trembling-


hand strategy πεQ and the function G(Q) as

πεQ(s, a) =

{(1− ε) for a = arg maxbQ(s, b)ε/(|A| − 1) for a 6= arg maxbQ(s, b)

,

G(Q)(s) =∑a∈A

πεQ(s, a) Q(s, a).

Note that πεQ is the usual ε-greedy policy with respectto Q. Using the above notation, we define the TQ-valueoperator Fz for a given mean field z as

Fz(Q)(s, a) = r(s, a, z) + γ∑s′

P (s′|s, a, z) G(Q)(s′).

(5)

The TQ-value operator Fz has properties similar tothe standard Bellman operator. In particular, we showbelow that Fz is a contraction with Q∗z as its uniquefixed point.

Proposition 1. (i) Fz is a contraction mapping insup norm for all z ∈ P(S). More precisely, ‖Fz(Q1)−Fz(Q2)‖∞ ≤ γ‖Q1−Q2‖∞ for any Q1, Q2, and for allz ∈ P(S).(ii) The optimal trembling-hand Q-value functionQ∗z for a given mean field z is the unique fixed pointof Fz, i.e., Fz(Q

∗z) = Q∗z.

The contraction property of Fz implies that the iter-ation Qm+1,z = Fz(Qm,z) will converge to the uniquefixed point of Fz, i.e., Qm,z → Q∗z. We call this proce-dure as TQ-value iteration.

From the above result, we can compute the optimaltrembling-hand strategy for a given mean field z. How-ever, it is not clear if there exists a T-MFE (z∗, µ∗)that simultaneously satisfies the optimality conditionand consistency condition. We answer this questionaffirmatively below.

Theorem 1. Let Γ be a stationary mean field gamewith strategic complementarities. Then, there exists atrembling-hand-perfect mean field equilibrium for Γ.

The proof follows from the monotonicity properties ofMFG-SC. Thus, given this game structure, no addi-tional conditions are needed to show existence.

Computing T-MFE: Given that a T-MFE exists,the next goal is to devise an algorithm to compute aT-MFE. A natural approach is to use a form of best-response dynamics as follows. Given the candidatemean field zk, the trembling best-response strategy µkcan be computed as µk ∈ Ψ(zk). While the typicalapproach would be to then compute the stationarydistribution under µk, we simply update the next meanfield zk+1 by using just one-step Mckean-Vlasov dy-namics as zk+1 = Φ(zk, µk), and the cycle continues.While the approach is intuitive and reminiscent of the

best-response dynamics proposed in Adlakha and Jo-hari (2013), it is not clear that it will converge toany equilibrium. We show that in mean field gameswith strategic complementarities, such a trembling best-response (T-BR) process converges to a T-MFE. Ourcomputation algorithm, which we call the T-BR algo-rithm, is presented in Algorithm 1.

Algorithm 1: T-BR Algorithm

1: Initialization: Initial mean field z0

2: for k = 0, 1, 2, .. do3: For the mean field zk, compute the optimal

TQ-Value function Q∗zk using the TQ-valueiteration Qm+1,zk = F (Qm,zk)

4: Compute the strategy µk = Ψ(zk) as thetrembling-hand strategy w.r.t Q∗zk , i.e,µk = πεQ∗zk

5: Compute the next mean field zk+1 = Φ(zk, µk)6: end for

Theorem 2. Let Γ be a stationary mean field gamewith strategic complementarities. Let {zk} and {µk} bethe sequences of mean fields and strategies generatedaccording to Algorithm 1. Then µk → µ∗ and zk → z∗

as k →∞ where (µ∗, z∗) constitutes a T-MFE of Γ.

Remark 1. Consider the sequence of mean fields {zk}generated by the T-BR algorithm. According to The-orem 2, there exists a finite k0 = k0(ε) such that‖zk − z∗‖ ≤ ε for all k ≥ k0, where z∗ is the T-MFE. A precise characterization of k0 is difficult be-cause the convergence of the T-BR algorithm is basedon monotonicity properties of MFG-SC, rather than oncontraction arguments.

4 Learning T-MFE

We address the problem of learning T-MFE when themodel is unknown. In this section, we assume theavailability of a simulator, which, given the currentstate s, current action a and current mean field z, cangenerate the next state s′ ∼ P (·|s, a, z). We discusshow to learn directly from real-world samples withouta simulator in the next section. We also assume thatthe reward function is known, as is common in the lit-erature. We now introduce two reinforcement learningalgorithms—a model-free algorithm and a model-basedalgorithm—for learning T-MFE.

4.1 Model-Free TMFQ-Learning Algorithmfor learning T-MFE

We first describe the TMFQ-learning algorithm, whichbuilds on the T-BR algorithm. Recall that the T-BRalgorithm uses knowledge of the model in two locations.


The first is that at each step k, for a given mean fieldzk, the optimal TQ-value function Q∗zk is computedusing TQ-value iteration. In the learning approach, weuse the generalized Q-learning framework (Szepesvariand Littman, 1996) as the basis for the model-freeTQ-learning algorithm as follows:

Qt+1,zk(st, at) = (1− αt)Qt,zk(st, at)

+ αt(r(st, at, zk) + γG(Qt,zk)(st+1))(6)

where αt is the appropriate learning rate. Here, thestate sequence {st} is generated using the simulator byfixing the mean field zk, i.e., st+1 ∼ P (·|st, at, zk),∀t.Using the properties of the generalized Q-learning for-mulation (Szepesvari and Littman, 1996), it can beshown that Qt,zk → Q∗zk as t→∞.

The second location where the model is needed in T-BR is for the one-step McKean-Vlasov update. In thelearning approach, given the mean field zk and thestrategy µk, the next mean field zk+1 can be estimatedto a desired accuracy using the simulator. The pre-cise numerical approach is presented as the Next-MFscheme described in Algorithm ?? (supplementary ma-terial). We can now combine these steps to obtain theTMFQ-learning algorithm presented in Algorithm 2.

Algorithm 2: TMFQ-Learning Algorithm for T-MFE1: Initialization: Initial mean field z0

2: for k = 0, 1, 2, . . . do3: Initialize time step t← 0. Initialize s0, Q0,zk

4: repeat5: Take action at ∼ πεQt,zk (st), observe reward

and the next state st+1 ∼ P (·|st, at, zk)6: Update Qt,zk according to TQ-learning (6)7: t← t+ 18: until ‖Qt,zk −Qt−1,zk‖∞ < ε19: Let Qzk = Qt,zk and the strategy µk = πεQzk

10: zk+1 =Next-MF(zk, µk)11: end for

We first present an asymptotic convergence result ofTMFQ-learning based on a perfect accuracy assump-tion on the TQ-learning and Next-MF steps, i.e., theyare run to convergence. A complex, but more accurateanalysis using two-timescale stochastic approximationis also possible. Instead, we will remove this assump-tion, and provide a PAC-type result further below.

Theorem 3. Let Γ be a stationary mean field gamewith strategic complementarities. Let {zk} and {µk} bethe sequences of the mean fields and policies generatedby Algorithm 2. Then µk → µ∗ and zk → z∗ as k →∞where (µ∗, z∗) is a T-MFE of Γ.

In a practical implementation, we may only run theTQ-learning step and the Next-MF step for a finitenumber of iterations. Hence, we develop a samplecomplexity bound under which TQ-learning and Next-MF provide an appropriate accuracy. We desire tocompare TMFQ-learning with T-BR after k0 iterations,where k0 is the number of iterations of T-BR that yieldsa mean field that is ε-close to the T-MFE z∗. We makesome necessary assumptions that are required for sucha characterization.

Assumption 1. (i) There exists C1 > 0 suchthat ‖r(·, ·, z)− r(·, ·, z)‖1 ≤ C1 ‖z − z‖1, for allz, z ∈ P(S).(ii) There exists a C2 > 0 such that‖P (·|·, ·, z)− P (·|·, ·, z)‖1 ≤ C2 ‖z − z‖1, for allz, z ∈ P(S).(iii) Let Pµ,z(s

′|s) =∑a µ(s, a)P (s′|s, a, z). Let µ1, µ2

be the trembling-hand policies corresponding to Q1, Q2,i.e., µ1 = πεQ1

, µ2 = πεQ2. Then there exists a C3 > 0

such that ‖Pz,µ1− Pz,µ2

‖1 ≤ C3 ‖Q1 −Q2‖∞ for allz ∈ P(S) and for any given Q1, Q2.

Assumption 1.(i) and 1.(ii) indicate that the rewardfunction and transition kernel are Lipschitz with respectto the mean field, while Assumption 1.(iii) indicatesthat the distance between the Markov chains inducedby two policies on the same transition kernel are upperbounded by a constant times the distance between theirrespective Q-functions. We then have the following.

Theorem 4. Let Assumption 1 hold. For any 0 ≤ε, δ < 1, let k0 = k0(ε). In Algorithm 2, for eachk ≤ k0, assume that TQ-learning (according to (6))update is performed T0 number of times where T0 isgiven as

T0 = O

((B2L1+3wV 2

max

β2ε2ln(2Bk0|S||A|Vmax

δβε

)) 1w

+

(L

βln(

BVmax

ε)

) 11−w), (7)

where B = (1 + C2 + C3D)k0+1(C3 + 1), D = (C1 +γC2)/(1−γ), Vmax = 1/(1−γ), β = (1−γ)/2, L is anupper bound on the covering time 1, and w ∈ (1/2, 1).Then,

P(‖zk0 − z∗‖ ≤ 2ε) ≥ (1− δ).

We may also eliminate the dependence of the constantterm B on k0 under a contraction assumption on theMcKean-Vlasov dynamics Φ (for instance, followingconditions similar to Borkar and Sundaresan (2013)).

1Covering time of a state-action pair sequence is thenumber of steps needed to visit all state-action pairs startingfrom any arbitrary state-action pair Even-Dar and Mansour(2003).


Assumption 2. Let Q1, Q2 be two arbitrary Q-valuefunctions and let µ1 = πεQ1

, µ2 = πεQ2. Let z1, z2 be two

arbitrary mean fields. Then there exists positive con-stants C4 and C5 such that ‖Φ(z1, µ1)− Φ(z2, µ2)‖1 ≤C4‖z1 − z2‖1 + C5‖Q1 − Q2‖∞. Also assume that(C4 + C5D) < 1, where D = (C1 + γC2)/(1− γ).

Corollary 1. Let Assumption 1 and Assumption 2hold. Then, the we obtain the bound on T0 as in (7)with B = (C5 + 1)/(1− (C4 + C5D)), which does notdepend on k0.

4.2 Generative Model-Based ReinforcementLearning for T-MFE

A model-based variant of the T-BR algorithm is alsostraightforward to construct. We note that non-stationarity in our system (the model changes at eachstep) implies that there is no single model. Hence, ourapproach follows the generation of a new model eachtime that the mean field evolves under a T-BR-likeprocedure. Full details are presented in Appendix ??.

5 Online Learning of T-MFE

The availability of a large number of agents that explorevia trembling hand strategies suggests that we can doaway with a system simulator via an online algorithmthat simply aggregates these concurrently generatedsamples and computes a new strategy that is thenpushed to all agents. Typically, the assumption insuch large population scenarios is that the system sizeis fixed, but any agent may leave the system at timestep t with probability ζ ∈ (0, 1), and be immediatelyreplaced by a new agent with random state referred toas a regeneration event.

A generic RL approach would require that given thecurrent strategy µk and mean field zk, we would needto compute the stationary distribution of the modelPzk,µk , and set it as the next mean field. This wouldpreclude learning without a simulator, since runningone step in the real world would immediately cause amean field update to zk+1, and induce a new modelPzk+1,., making the system non-stationary.

Since our RL algorithms only need a one-step McKean-Vlasov update under each strategy, an online learningapproach at time k when the underlying state distribu-tion is zk, would be to apply µk−1 to the system, withthe resultant state distribution being zk+1. However,we face the issue that the samples obtained pertain toPzk,µk−1

, whereas the system model is now Pzk+1,., andso an online sample-based trembling-hand strategy µkwill lag the current system model by one step. Fortu-nately, convergence of the TMFQ-learning approach isrobust to this lag, and we present its online version in

Algorithm 3.

Algorithm 3: Online TMFQ-learning Algorithmfor T-MFE1: Initialize mean field z0 and strategy µ0

2: for k = 1, 2, . . . do3: Reset memory buffer4: for agents i = 1, 2, . . . , I do5: Given current state sik, take action

aik ∼ µk−1(sik), get reward rik, observe thenext state sik+1

6: Add the sample (sik, aik, r

ik, s

ik+1) to the

memory buffer7: end for8: Perform TQ-learning on the memory buffer to

obtain Qzk and µk = πεQzk9: end for

Algorithm 3 aggregates the samples generated by exe-cuting strategy µk−1 to estimate the TQ-value function,and then passes back µk to the agents. We note thatregeneration of a fraction of agents and execution oftrembling hand strategies ensures that we have suf-ficient samples of each state-action pair in the largepopulation regime to ensure that off-policy learningsuch as TQ-learning on the memory buffer converges tothe optimal TQ-value function. In order to character-ize the sample complexity of Algorithm 3, we employbounds pertaining to synchronous Q-learning (Even-Dar and Mansour, 2003) using this guarantee that allstate-action pairs are sampled a desired number oftimes. The PAC result is similar to TMFQ-learning,with accuracy increasing in the number of agents, ratherthan with the number of samples as in Theorem 4.

Theorem 5. Let Assumption 1 hold. For any 0 ≤ε, δ < 1, let k0 = k0(ε). In Algorithm 3, for eachk ≤ k0, assume that there are a total of I = I0 numberof agents where I0 is given as

I0 = O

(|S||A|εζ

(B2V 2

max

β2ε2ln(2Bk0|S||A|Vmax

δβε

)) 1w

+

(1

βln(

BVmax

ε)

) 11−w),

where B = ((1−ζ)2A)k0 (C3ε)(1−ζ)(A−1) , A = max{1 + C2, C3D},

D = (C1+γC2)/(1−γ), Vmax = 1/(1−γ), β = (1−γ)/2,w ∈ (1/2, 1), ζ is a regeneration probability and ε isthe trembling-hand strategy randomization. Then,

P(‖zk0 − z∗‖ ≤ 2ε) ≥ (1− δ).

Note that a result similar to that of Corollary 1 caneasily be shown here as well under the same assumption.We omit details due to page limitations.


0 5 10 15 20 25States

0.0

0.2

0.4

0.6

0.8

1.0

CDF

Online TMFQ-LearningGMBLTMFQ Learning

Figure 1: TMFE with cf = 0.1

0 5 10 15 20 25States

0.0

0.2

0.4

0.6

0.8

1.0

CDF

O− TMFQLMFQIQL

Figure 2: CDF of states for differentalgorithms: cf = 0.05

0 1000 2000 3000 4000 5000k

8

10

12

14

16

18

20

Aver

age

Stat

e of

Pop

ulat

ion

O− TMFQLMFQIQL

Figure 3: Mean states for differentalgorithms: cf = 0.05

0 1000 2000 3000 4000 5000k

0.01

0.02

0.03

0.04

0.05

Norm

alize

d ||z

k−z*

|| 1

100 Agents500 Agents1000 Agents

Figure 4: Convergence of O-TMFQ-Learning: cf = 0.05.

0 1000 2000 3000 4000 5000k

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0Av

erag

e St

ate

of P

opul

atio

n

cf = 0.05cf = 0.3cf = 0.6

Figure 5: Mean state of O-TMFQ-Learning over time

0 5 10 15 20 25States

0.0

0.2

0.4

0.6

0.8

1.0

CDF

cf=0.05cf=0.3cf=0.6

Figure 6: CDF of O-TMFQ-Learning

6 Experiments

We consider Infection Spread described in Section 2.State space is S = {0, . . . , 24} with 0 being the lowesthealth level. Action space is A = {0, . . . , 4} with 4being the strongest preventive action, and cf is theinfection intensity constant. Details of the parametersand more experiments are presented in the supplemen-tary materials. We evaluate the performance of ouralgorithms, TMFQ-Learning (TMFQ), GMBL, and On-line TMFQ-Learning (O-TMFQ) on this model, alongwith comparisons with Independent Q-Learning (IQL)and Mean Field Q-learning (MFQ) (Yang et al., 2018).In IQL, each agent ignores other agents, maintains anindividual Q-function and performs TQ-learning inde-pendently. We implement a variant of MFQ, whereeach agent maintains a Q-function parameterized bythe average states of a subset of the population and per-forms TQ-Learning. We average over 20 runs in eachexperiment, and the dashed line and band in figuresshow the average and standard deviation, respectively.

Figure 1 shows the final mean field distribution ob-tained by each of our algorithms, simulated with 1000agents. Note that all of them converge to the sameT-MFE, indicating the accuracy of O-TMFQ. We nextcompare the performance O-TMFQ with IQL and MFQ.Figure 2 shows that the final equilibrium distributionsof IQL and MFQ are inaccurate (not the true T-MFE),and that O-TMFQ results in higher states (health lev-els). Figure 3 shows the evolution of the mean health

of the population,∑s szk(s) with iteration number k.

The mean health of the population quickly converges toa higher value under O-TMFQ while the correspondingvalue is lower under MFQ and IQL.

Figure 4 shows rate of convergence of the mean fieldunder different numbers of agents. Here, z∗ is thefinal mean field obtained by O-TMFQ. We see thatthe asymptotically accurate mean field approximationbecomes increasingly correct even with a relativelysmall number of agents of 500 or 1000. We show theevolution of average health level of the population forO-TMFQ in Figure 5 for different values of cf . Thisplot indicates that the convergence of the algorithm isfairly fast. Finally, Figure 6 explores the impact of themean field on the model and its associated equilibriumvia cf . As expected, more agents are in lower healthstates for larger cf .

7 Conclusions

We introduced the notion of trembling-hand perfectionto MFG as a means of providing strategically consistentexploration. We showed existence of T-MFE in MFGwith strategic complementarities, and developed analgorithm for computation. Based on this algorithm,we developed model-free, model-based and fully onlinelearning algorithms, and provided PAC bounds on theirperformance. Experiments illustrated the accuracy andgood convergence properties of our algorithms.


Acknowledgement

This work was supported in part by NSF grants CRII-CPS-1850206, ECCS-EPCN-1839616, CNS-1955696,ECCS-1839816, CNS-1719384, CPS-2038963 and AROgrant W911NF-19-1-0367.

References

S. Adlakha and R. Johari. Mean field equilibriumin dynamic games with strategic complementarities.Operations Research, 61(4):971–989, 2013.

R. S. Bielefeld. Reexamination of the perfectness con-cept for equilibrium points in extensive games. InModels of Strategic Rationality, pages 1–31. Springer,1988.

V. S. Borkar and R. Sundaresan. Asymptotics of theinvariant measure in mean field models with jumps.Stochastic Systems, 2(2):322–380, 2013.

P. Cardaliaguet and S. Hadikhanloo. Learning in meanfield games: the fictitious play. ESAIM: Control,Optimisation and Calculus of Variations, 23(2):569–591, 2017.

R. Carmona, M. Lauriere, and Z. Tan. Linear-quadratic mean-field reinforcement learning: con-vergence of policy gradient methods. arXiv preprintarXiv:1910.04295, 2019.

E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of machine learning Research, 5(Dec):1–25, 2003.

C. Graham and S. Meleard. Chaos hypothesis for asystem interacting through shared resources. Prob-ability Theory and Related Fields, 100(2):157–174,1994.

X. Guo, A. Hu, R. Xu, and J. Zhang. Learning mean-field games. In Advances in Neural InformationProcessing Systems, pages 4967–4977, 2019.

K. Iyer, R. Johari, and M. Sundararajan. Mean fieldequilibria of dynamic auctions with learning. Man-agement Science, 60(12):2949–2970, 2014.

A. C. Kizilkale and P. E. Caines. Mean field stochasticadaptive control. IEEE Transactions on AutomaticControl, 58(4):905–920, 2012.

J.-M. Lasry and P.-L. Lions. Mean field games.Japanese journal of mathematics, 2(1):229–260, 2007.

J. Li, R. Bhattacharyya, S. Paul, S. Shakkottai, andV. Subramanian. Incentivizing sharing in realtimeD2D streaming networks: A mean field game per-spective. IEEE/ACM Transactions on Networking,25(1):3–17, 2016.

J. Li, B. Xia, X. Geng, H. Ming, S. Shakkottai, V. Sub-ramanian, and L. Xie. Mean field games in nudge

systems for societal networks. ACM Transactions onModeling and Performance Evaluation of ComputingSystems (TOMPECS), 3(4):15, 2018.

D. Mguni, J. Jennings, and E. M. de Cote. Decen-tralised learning in systems with many, many strate-gic agents. In Thirty-Second AAAI Conference onArtificial Intelligence, 2018.

P. Milgrom and J. Roberts. Rationalizability, learning,and equilibrium in games with strategic complemen-tarities. Econometrica: Journal of the EconometricSociety, pages 1255–1277, 1990.

A. S. Nowak. On stochastic games in economics. Math-ematical Methods of Operations Research, 66(3):513–530, 2007.

J. Subramanian and A. Mahajan. Reinforcement learn-ing in stationary mean-field games. In Proceedingsof the 18th International Conference on AutonomousAgents and MultiAgent Systems, pages 251–259. In-ternational Foundation for Autonomous Agents andMultiagent Systems, 2019.

C. Szepesvari and M. L. Littman. GeneralizedMarkov decision processes: Dynamic-programmingand reinforcement-learning algorithms. In Proceed-ings of International Conference of Machine Learn-ing, volume 96, 1996.

H. Tembine, J.-Y. Le Boudec, R. El-Azouzi, and E. Alt-man. Mean field asymptotics of Markov decision evo-lutionary games and teams. In 2009 InternationalConference on Game Theory for Networks, pages140–150. IEEE, 2009.

X. Vives. Strategic complementarity in multi-stagegames. Economic Theory, 40(1):151–171, 2009.

Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, andJ. Wang. Mean field multi-agent reinforcement learn-ing. In 35th International Conference on MachineLearning, ICML 2018, volume 80, pages 5571–5580.PMLR, 2018.

H. Yin, P. G. Mehta, S. P. Meyn, and U. V. Shanbhag.Learning in mean-field games. IEEE Transactionson Automatic Control, 59(3):629–644, 2013.

K. Zhang, Z. Yang, and T. Basar. Multi-agent rein-forcement learning: A selective overview of theoriesand algorithms. arXiv preprint arXiv:1911.10635,2019.

reinforcement learning for mean field games with strategic

Documents