learning nash equilibrium for general-sum markov games ... · learning nash equilibrium for...

arX

iv:1

606.

0871

8v4

[cs

.GT

] 6

Mar

201

7

Learning Nash Equilibrium for General-SumMarkov Games from Batch Data

Julien Perolat(1)

[email protected]

Florian Strub(1)

[email protected]

Bilal Piot(1,3)

[email protected]

Olivier Pietquin(1,2,3)

[email protected](1)Univ. Lille, CNRS, Centrale Lille, Inria, UMR 9189 - CRIStAL, F-59000 Lille, France

(2)Institut Universitaire de France (IUF), France(3)DeepMind, UK

Abstract

This paper addresses the problem of learn-ing a Nash equilibrium in γ-discounted mul-tiplayer general-sum Markov Games (MGs)in a batch setting. As the number of playersincreases in MG, the agents may either col-laborate or team apart to increase their finalrewards. One solution to address this prob-lem is to look for a Nash equilibrium. Al-though, several techniques were found for thesubcase of two-player zero-sum MGs, thosetechniques fail to find a Nash equilibrium ingeneral-sum Markov Games. In this paper,we introduce a new definition of ǫ-Nash equi-librium in MGs which grasps the strategy’squality for multiplayer games. We provethat minimizing the norm of two Bellman-like residuals implies to learn such an ǫ-Nashequilibrium. Then, we show that minimiz-ing an empirical estimate of the Lp norm ofthese Bellman-like residuals allows learningfor general-sum games within the batch set-ting. Finally, we introduce a neural networkarchitecture that successfully learns a Nashequilibrium in generic multiplayer general-sum turn-based MGs.

1 Introduction

A Markov Game (MG) is a model for Multi-AgentReinforcement Learning (MARL) [Littman, 1994].Chess, robotics or human-machine dialogues are a fewexamples of the myriad of applications that can bemodeled by an MG. At each state of an MG, all

Appearing in Proceedings of the 20th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2017, Fort Lauderdale, Florida, USA. JMLR: W&CP vol-ume 54. Copyright 2017 by the authors.

agents have to pick an action simultaneously. As aresult of this mutual action, the game moves to an-other state and each agent receives a reward. Inthis paper, we focus on presently the most genericMG framework with perfect information: general-summultiplayer MGs. This framework includes MDPs(which are single agent MGs), zero-sum two-playergames, normal form games, extensive form gamesand other derivatives [Nisan et al., 2007]. For in-stance, general-sum multiplayer MGs have neither aspecific reward structure as opposed to zero-sum two-player MGs [Shapley, 1953, Perolat et al., 2015] norstate space constraints such as the tree structure in ex-tensive form games [Nisan et al., 2007]. More impor-tantly, when it comes to MARL, using classic modelsof reinforcement learning (MDP or Partially Observ-able MDP) assumes that the other players are part ofthe environment. Therefore, other players have a sta-tionary strategy and do not update it according to thecurrent player. General-sum multiplayer MGs removethis assumption by allowing the agents to dynamicallyre-adapt their strategy according the other players.Therefore, they may either cooperate or to confront indifferent states which entails joint-strategies betweenthe players.

In this paper, agents are not allowed to directly settlea common strategy beforehand. They have to choosetheir strategy secretly and independently to maximizetheir cumulative reward.In game theory, this concept istraditionally addressed by searching for a Nash equi-librium [Filar and Vrieze, 2012]. Note that differentNash equilibria may co-exist in a single MG. Com-puting a Nash equilibrium in MGs requires a perfectknowledge of the dynamics and the rewards of thegame. In most practical situations, the dynamics andthe rewards are unknown. To overcome this difficultytwo approaches are possible: Either learn the game byinteracting with other agents or extract informationfrom the game through a finite record of interactionsbetween the agents. The former is known as the online

http://arxiv.org/abs/1606.08718v4

Learning Nash Equilibrium for General-Sum Markov Games from Batch Data

scenario while the latter is called the batch scenario.This paper focuses on the batch scenario. Further-more, when a player learns a strategy, he faces a rep-resentation problem. In every state of the game (orconfiguration), he has to store a strategy correspond-ing to his goal. However, this approach does not scalewith the number of states and it prevents from gen-eralizing a strategy from one state to another. Thus,function approximation is introduced to estimate andgeneralize the strategy.

First, the wider family of batch algo-rithms applied to MDPs builds upon ap-proximate value iteration [Riedmiller, 2005,Munos and Szepesvari, 2008] or approximate policyiteration [Lagoudakis and Parr, 2003]. Another familyrelies on the direct minimization of the optimal Bell-man residual [Baird et al., 1995, Piot et al., 2014b].These techniques can handle value function ap-proximation and have proven their efficacy in largeMDPs especially when combined with neural net-works [Riedmiller, 2005]. The batch scenario is alsowell studied for the particular case of zero-sum two-player MGs. It can be solved with analogue techniquesas the ones for MDPs [Lagoudakis and Parr, 2002,Perolat et al., 2015, Perolat et al., 2016]. All the pre-viously referenced algorithms use the value function orthe state-action value function to compute the optimalstrategy. However, [Zinkevich et al., 2006] demon-strates that no algorithm based on the value functionor on the state-action value function can be appliedto find a stationary-Nash equilibrium in general-sumMGs. Therefore, recent approaches to find a Nashequilibrium consider an optimization problem onthe strategy of each player [Prasad et al., 2015]in addition to the value functions. Finding aNash equilibrium is well studied when the modelof the MG is known [Prasad et al., 2015] or es-timated [Akchurina, 2010] or in the online sce-nario [Littman, 2001, Prasad et al., 2015]. However,none of the herein-before algorithms approximatesthe value functions or the strategies. To ourknowledge, using function approximation to find aNash-equilibrium in MGs has not been attemptedbefore this paper.

The key contribution of this paper is to introduce anovel approach to learn an ǫ-Nash equilibrium in ageneral-summultiplayer MG in the batch setting. Thiscontribution relies on two ideas: the minimization oftwo Bellman-like residuals and the definition of thenotion of weak ǫ-Nash equilibrium. Again, our ap-proach is the first work that considers function ap-proximation to find an ǫ-Nash equilibrium. As a sec-ond contribution, we empirically evaluate our idea onrandomly generated deterministic MGs by designing a

neural network architecture. This network minimizesthe Bellman-like residuals by approximating the strat-egy and the state-action value-function of each player.

These contributions are structured as follows: first, werecall the definitions of an MG, of a Nash equilibriumand we define a weaker notion of an ǫ-Nash equilib-rium. Then, we show that controlling the sum of sev-eral Bellman residuals allows us to learn an ǫ-Nashequilibrium. Later, we explain that controlling theempirical norm of those Bellman residuals addressesthe batch scenario problem. At this point, we pro-vide a description of the NashNetwork. Finally, weempirically evaluate our method on randomly gener-ated MGs (Garnets). Using classic games would havelimited the scope of the evaluation. Garnets enableto set up a myriad of relevant configuration and theyavoid drawing conclusion on our approach from spe-cific games.

2 Background

In an N -player MG with a finite state space S all play-ers take actions in a finite set ((Ai(s))s∈S)i∈{1,...,N}

depending on the player and on the state. In ev-ery state s, each player has to take an action withinhis/her action space ai ∈ Ai(s) and receives a re-ward signal ri(s, a1, ..., aN ) which depends on thejoint action of all players in state s. Then, we as-sume that the state changes according to a transi-tion kernel p(s′|s, a1, ..., aN). All actions indexed by−i are joint actions of all players except player i (i.e.a-i = (a1, . . . , ai−1, ai+1, . . . , aN )). For the sake ofsimplicity we will note a = (a1, . . . , aN) = (ai,a-i),p(s′|s, a1, ..., aN ) = p(s′|s, a) = p(s′|s, ai,a-i) andri(s, a1, ..., aN) = ri(s, a) = ri(s, ai,a-i). The con-stant γ ∈ [0, 1) is the discount factor.

In a MARL problem, the goal is typically to find astrategy for each player (a policy in the MDP lit-erature). A stationary strategy πi maps to eachstate a distribution over the action space Ai(s). Thestrategy πi(.|s) is such a distribution. Alike previ-ously, we will write π = (π1, ..., πN ) = (πi,π-i) theproduct distribution of those strategies (i.e. π-i =(π1, . . . , πi−1, πi+1, . . . , πN )). The stochastic kerneldefining the dynamics of the Markov Chain whenall players follow their strategy π is Pπ(s

′|s) =Ea∼π[p(s

′|s, a)] and the one defining the MDP whenall players but player i follow their strategy π-i isPπ-i(s′|s, ai) = Ea-i∼π-i [p(s′|s, a)]. The kernel Pπ

can be seen as a squared matrix of size the num-ber of states|S|. Identically, we can define the aver-aged reward over all strategies riπ(s) = Ea∼π[r

i(s, a)]and the averaged reward over all players but player i,riπ-i(s, a

i) = Ea-i∼π-i [ri(s, a)]. Again, the reward riπcan be seen as a vector of size |S|. Finally, the identity

Julien Perolat, Florian Strub, Bilal Piot, Olivier Pietquin

matrix is noted I.

Value of the joint strategy : For each state s, theexpected return considering each player plays his partof the joint strategy π is the γ-discounted sum of hisrewards [Prasad et al., 2015, Filar and Vrieze, 2012]:

viπ(s) = E[

+∞∑

t=0

γtriπ(st)|s0 = s, st+1 ∼ Pπ(.|st)],

= (I − γPπ)−1

riπ.

The joint value of each player is the following: vπ =(v1π, ..., v

Nπ ). In the MG theory, two Bellman opera-

tors can be defined on the value function with respectto player i. The first one is T i

a vi (s) = ri(s, a) +

γ∑

s′∈S

p(s′|s, a)vi(s′). It represents the expected value

player i will get in state s if all players play the jointaction a and if the value for player i after one tran-sition is vi. If we consider that instead of playing ac-tion a every player plays according to a joint strat-egy π the Bellman operator to consider is T i

πvi (s) =

Ea∼π[T ia v

i (s)] = riπ(s) + γ∑

s′∈S

Pπ(s′|s)vi(s′). As in

MDPs theory, this operator is a γ-contraction in L+∞-norm [Puterman, 1994]. Thus, it admits a unique fixedpoint which is the value for player i of the joint strat-egy π : viπ.

Value of the best response : When the strat-egy π-i of all players but player i is fixed, the prob-lem is reduced to an MDP of kernel Pπ-i and re-ward function ri

π-i . In this underlying MDP, theoptimal Bellman operator would be T ∗i

π-ivi (s) =

maxai

Ea-i∼π-i [T iai,a-iv

i (s)] = maxai

[riπ-i(s, a

i) +

γ∑

s′∈S

Pπ-i(s′|s, ai)vi(s′)]. The fixed point of this op-

erator is the value of a best response of player i

when all other players play π-i and will be writtenv∗iπ-i = max

πiviπi,π-i .

3 Nash, ǫ-Nash and Weak ǫ-Nash

Equilibrium

A Nash equilibrium is a solution concept well de-fined in game theory. It states that one player can-not improve his own value by switching his strat-egy if the other players do not vary their ownone [Filar and Vrieze, 2012]. The goal of this paperis to find one strategy for players which is as close aspossible to a Nash equilibrium.

Definition 1. In an MG, a strategy π is a Nash equi-librium if: ∀i ∈ {1, ..., N}, viπ = v∗i

π-i .

This definition can be rewritten with Bellman opera-tors:

Definition 2. In an MG, a strategy π is a Nashequilibrium if ∃v such as ∀i ∈ {1, ..., N}, T i

πvi =

vi and T ∗iπ-iv

i = vi.

Proof. The proof is left in appendix A.

One can notice that, in the case of a single player MG(or MDP), a Nash equilibrium is simply the optimalstrategy. An ǫ-Nash equilibrium is a relaxed solutionconcept in game theory. When all players play anǫ-Nash equilibrium the value they will receive is atmost ǫ sub-optimal compared to a best response. For-mally [Filar and Vrieze, 2012]:

Definition 3. In an MG, a strategy π is an ǫ-Nashequilibrium if:∀i ∈ {1, ..., N}, viπ + ǫ ≥ v∗i

π-i

or ∀i ∈ {1, ..., N}, v∗iπ-i − viπ ≤ ǫ,

which is equivalent to:∥

∥

∥

∥

∥v∗iπ-i − viπ

∥

∥

s,∞

∥

∥

∥

i,∞≤ ǫ.

Interestingly, when considering an MDP, the defi-nition of an ǫ-Nash equilibrium is reduced to con-trol the L+∞-norm between the value of the play-ers’ strategy and the optimal value. However, it isknown that approximate dynamic programming algo-rithms do not control a L+∞-norm but rather an Lp-norm [Munos and Szepesvari, 2008] (we take the defi-nition of the Lp-norm of [Piot et al., 2014b]). Using Lp

-norm is necessary for approximate dynamic program-ming algorithms to use complexity bounds from learn-ing theory [Piot et al., 2014b]. The convergence ofthese algorithms was analyzed using supervised learn-ing bounds in Lp-norm and thus guaranties are givenin Lp-norm [Scherrer et al., 2012]. In addition, Bell-man residual approaches on MDPs also give guarantiesin Lp-norm [Maillard et al., 2010, Piot et al., 2014b].Thus, we define a natural relaxation of the previousdefinition of the ǫ-Nash equilibrium in Lp-norm whichis consistent with the existing work on MDPs.

Definition 4. In a MG, π is a weak ǫ-Nash equilib-

rium if:∥

∥

∥

∥


∥

∥

µ(s),p

∥

∥

∥

ρ(i),p≤ ǫ.

One should notice that an ǫ-Nash equilibrium is a weakǫ-Nash equilibrium. Conversely, a weak ǫ-Nash equi-librium is not always an ǫ-Nash equilibrium. Further-more, both ǫ do not need to be equal. The notion ofweak ǫ-Nash equilibrium defines a performance crite-rion to evaluate a strategy while seeking for a Nashequilibrium. Thus, this definition has the great ad-vantage to provide a convincing way to evaluate thefinal strategy. In the case of an MDP, it states thata weak ǫ-Nash equilibrium only consists in controllingthe difference in Lp-norm between the optimal valueand the value of the learned strategy. For an MG, thiscriterion is an Lp-norm over players (i ∈ {1, . . . , N})


of an Lp-norm over states (s ∈ S) of the difference be-tween the value of the joint strategy π for player i instate s and of his best response against the joint strat-egy π-i. In the following, we consider that learning aNash equilibrium should result in minimizing the loss∥

∥

∥

∥


∥

∥

µ(s),p

∥

∥

∥

ρ(i),p. For each player, we want to

minimize the difference between the value of his strat-egy and a best response considering the strategy ofthe other players is fixed. However, a direct minimiza-tion of that norm is not possible in the batch settingeven for MDPs. Indeed, v∗i

π-i cannot be directly ob-served and be used as a target. A common strategyto alleviate this problem is to minimize a surrogateloss. In MDPs, a possible surrogate loss is the optimalBellman residual ‖v − T ∗v‖µ,p (where µ is a distribu-tion over states and T ∗ is the optimal Bellman opera-tor for MDPs) [Piot et al., 2014b, Baird et al., 1995].The optimal policy is then extracted from the learntoptimal value (or Q-value in general). In the followingsection, we extend this Bellman residual approach toMGs.

4 Bellman Residual Minimization in

MGs

Optimal Bellman residual minimization is not straight-forwardly extensible to MGs because multiplayerstrategies cannot be directly extracted from valuefunctions as shown in [Zinkevich et al., 2006]. Yet,from Definition 2, we know that a joint strategy π

is a Nash equilibrium if there exists v such that, forany player i, vi is the value of the joint strategy π

for player i (i.e. T iπv

i = vi) and vi is the value ofthe best response player i can achieve regarding theopponent’s strategy π-i (i.e. T ∗i

π-ivi = vi). We thus

propose to build a second Bellman-like residual opti-mizing over the set of strategies so as to directly learnan ǫ-Nash equilibrium. The first (traditional) resid-ual (i.e.

∥

∥T ∗iπ-iv

i − vi∥

∥

ρ(i),p) forces the value of each

player to be close to their respective best responseto every other player while the second residual (i.e.∥

∥T iπv

i − vi∥

∥

ρ(i),p) will force every player to play the

strategy corresponding to that value.

One can thus wonder how close from a Nash-Equilibrium π would be if there existed v such thatT ∗iπ-iv

i ≈ vi and T iπv

i ≈ vi. In this section, weprove that, if we are able to control over (v,π) a sumof the Lp-norm of the associated Bellman residuals(∥

∥T ∗iπ-iv

i − vi∥

∥

µ,pand

∥

∥T iπv

i − vi∥

∥

µ,p), then we are able

to control∥

∥

∥

∥


∥

∥

µ(s),p

∥

∥

∥

ρ(i),p.

Theorem 1. ∀p, p′ positive reals such that 1p+ 1

p′= 1

and ∀(v1, . . . , vN ):

∥

∥

∥

∥

∥

∥

∥v∗iπ-i − v

iπ

∥

∥

∥

µ(s),p

∥

∥

∥

∥

ρ(i),p

≤2

1

p′ C∞(µ, ν)1p

1− γ

×

[

N∑

i=1

ρ(i)

(

∥

∥

∥T∗iπ-iv

i − vi∥

∥

∥

p

ν,p+∥

∥

∥Tiπv

i − vi∥

∥

∥

p

ν,p

)

]1p

,

with the following concentrability coefficient (the normof a Radon-Nikodym derivative)

C∞(µ, ν, πi,π-i) =

∥

∥

∥

∥

∂µT (1−γ)(I−γPπi,π-i )

−1

∂νT

∥

∥

∥

∥

ν,∞

and

C∞(µ, ν) = supπ C∞(µ, ν, πi,π-i).

Proof. The proof is left in appendix B.

This theorem shows that an ǫ-Nash equilibrium can becontrolled by the sum over the players of the sum ofthe norm of two Bellman-like residuals: the BellmanResidual of the best response of each player and theBellman residual of the joint strategy. If the residualof the best response of player i (

∥

∥T ∗iπ-iv

i − vi∥

∥

p

ν,p) is

small, then the value vi is close to the value of thebest response v∗i

π-i and if the Bellman residual of the

joint strategy∥

∥T iπv

i − vi∥

∥

p

ν,pfor player i is small, then

vi is close to viπ. In the end, if all those residuals aresmall, the joint strategy is an ǫ-Nash equilibrium withǫ small since v∗i

π-i ≃ vi ≃ viπ.

Theorem 1 also emphasizes the necessity of a weakenednotion of an ǫ-Nash equilibrium. It is much easier tocontrol a Lp-norm than a L∞-norm with samples. Inthe following, the weighted sum of the norms of thetwo Bellman residuals will be noted as:

fν,ρ,p(π,v) =N∑

i=1

ρ(i)(

∥

∥T ∗iπ-iv

i − vi∥

∥

p

ν,p+∥

∥T iπv

i − vi∥

∥

p

ν,p

)

.

Finding a Nash equilibrium is then reduced to a non-convex optimization problem. If we can find a (π,v)such that fν,ρ,p(π,v) = 0, then the joint strategy π isa Nash equilibrium. This procedure relies on a searchover the joint value function space and the joint strat-egy space.Besides, if the state space or the numberof joint actions is too large, the search over valuesand strategies might be intractable. We addressed theissue by making use of approximate value functionsand strategies. Actually, Theorem 1 can be extendedwith function approximation. A good joint strategyπθ within an approximate strategy space Π can stillbe found by computing πθ,vη ∈ argmin

θ, η

fν,ρ,p(πθ,vη)

(where θ and η respectively parameterize πθ,vη).Even with function approximation, the learned jointstrategy πθ would be at least a weak ǫ-Nash equilib-

rium (with ǫ ≤ 21

p′ C∞(µ,ν)1p

1−γfν,ρ,p(πθ,vη)).This is, to

our knowledge, the first approach to solve MGs within


an approximate strategy space and an approximatevalue function space.

5 The Batch Scenario

In this section, we explain how to learn a weak ǫ-Nashequilibrium from Theorem 1 with approximate strate-gies and an approximate value functions. As said pre-viously, we will focus on the batch setting where onlya finite number of historical data sampled from an MGare available.

In the batch scenario, it is common to work on state-action value functions (also named Q-functions). InMDPs, the Q-function is defined on the state andthe action of one agent. In MGs, the Q-functionhas to be extended over the joint action of theagents [Perolat et al., 2015, Hu and Wellman, 2003].Thus, the Q-function in MGs for a fixed joint strat-egy π is the expected γ-discounted sum of rewardsconsidering that the players first pick the joint actiona and follow the joint strategy π. Formally, the Q-function is defined as Qi

π(s, a) = T ia v

iπ. Moreover, one

can define two analogue Bellman operators to the onesdefined for the value function:

BiπQ (s,a) = r

i(s,a) +∑

s′∈S

p(s′|s, a)Eb∼π[Q(s′,b)]

B∗iπ Q (s,a) = r

i(s,a)+∑

s′∈S

p(s′|s,a)maxbi

[

Eb-i∼π-i [Q(s′, bi, b-i)]]

.

As for the value function, Qiπ is the fixed point of

Biπ and Q∗i

π-i is the fixed point of B∗iπ (where Q∗i

π-i =maxπi

Qiπ). The extension of Theorem 1 to Q-functions

instead of value functions is straightforward. Thus, wewill have to minimize the following function dependingon strategies and Q-functions:

f(Q,π) =N∑

i=1

ρ(i)

(

∥

∥

∥B∗i

π-iQi −Q

i∥

∥

∥

p

ν,p+∥

∥

∥Bi

πQi −Q

i∥

∥

∥

p

ν,p

)

(1)

The batch scenario consists in having a set of k samples(sj , (a

1j , ..., a

Nj ), (r1j , ..., r

Nj ), s′j)j∈{1,...,k} where rij =

ri(sj , a1j , ..., a

Nj ) and where the next state is sampled

according to p(.|sj , a1j , ..., aNj ). From Equation (1), we

can minimize the empirical-norm by using the k sam-ples to obtain the empirical estimator of the Bellmanresidual error.

fk(Q,π) =k∑

j=1

N∑

i=1

ρ(i)

[

∣

∣

∣B∗i

π-iQi(sj ,aj)−Q

i(sj , aj)∣

∣

∣

p

+∣

∣

∣BiπQ

i(sj ,aj)−Qi(sj ,aj)

∣

∣

∣

p

]

, (2)

For more details, an extensive analysis beyond theminimization of the Bellman residual in MDPscan be found in [Piot et al., 2014b]. In the fol-lowing we discuss the estimation of the twoBellman residuals

∣

∣B∗iπ-iQ

i(sj ,aj)−Qi(sj ,aj)∣

∣

pand

∣

∣BiπQ

i(sj ,aj)−Qi(sj , aj)∣

∣

pin different cases.

Deterministic Dynamics: With deterministic dy-namics, the estimation is straightforward. We es-timate Bi

πQi(sj ,aj) with rij + γEb∼π[Q

i(s′j ,b)] andB∗i

π-iQi(sj ,aj) with rij + γmax

bi

[

Eb-i∼π-i [Qi(s′j , bi, b-i)]

]

,

where the expectation are:

Eb∼π[Qi(s′,b)] =

∑

b1∈A1

· · ·∑

bN∈AN

π1(b1|s′) . . . πN(bN |s′)Qi(s′, b1, . . . , bN )

and where

Eb-i∼π-i [Qi(s′, bi, b-i)] =

∑

b1∈A1

· · ·∑

bi−1∈Ai−1

∑

bi+1∈Ai+1

. . .

∑

bN∈AN

π1(b1|s′) . . . πi−1(bi−1|s′)πi+1(bi+1|s′) . . .

× πN(bN |s′)Qi(s′, b1, . . . , bN )

Please note that this computation can be turned intotensor operations as described in Appendix-C.

Stochastic Dynamics: In the case of stochastic dy-namics, the previous estimator (1) is known to be bi-ased [Maillard et al., 2010, Piot et al., 2014b]. If thedynamic is known, one can use the following unbi-ased estimator: Bi

πQi(sj ,aj) is estimated with rij +

γ∑

s′∈S

p(s′|sj ,aj)Eb∼π[Qi(s′,b)] and B∗i

π-iQi(sj ,aj) with

rij + γ∑

s′∈S

p(s′|sj ,aj)maxbi

[

Eb-i∼π-i [Qi(s′, bi, b-i)]]

.

If the dynamic of the game is not known (e.g. batchscenario), the unbiased estimator cannot be used sincethe kernel of the MG is required and other tech-niques must be applied. One idea would be tofirst learn an approximation of this kernel. For in-stance, one could extend the MDP techniques to MGssuch as embedding a kernel in a Reproducing Ker-nel Hilbert Space (RKHS) [Grunewalder et al., 2012,Piot et al., 2014a, Piot et al., 2014b] or using kernelestimators [Taylor and Parr, 2012, Piot et al., 2014b].Therefore, p(s′|sj ,aj) would be replaced by an es-timator of the dynamics p(s′|sj ,aj) in the previousequation. If a generative model is available, the issuecan be addressed with double sampling as discussedin [Maillard et al., 2010, Piot et al., 2014b].

6 Neural Network architecture

Minimizing the sum of Bellman residuals is a challeng-ing problem as the objective function is not convex. It


is all the more difficult as both the Q-value functionand the strategy of every player must be learned in-dependently. Nevertheless, neural networks have beenable to provide good solutions to problems that requireminimizing non-convex objective such as image clas-sification or speech recognition [LeCun et al., 2015].Furthermore, neural networks were successfully ap-plied to reinforcement learning to approximate theQ-function [Mnih et al., 2015] in one agent MGs witheclectic state representation.

Here, we introduce a novel neural architecture: theNashNetwork1. For every player, a two-fold network isdefined: the Q-network that learns a Q-value functionand a π-network that learns the stochastic strategy ofthe players. The Q-network is a multilayer perceptronwhich takes the state representation as input. It out-puts the predicted Q-values of the individual actionsuch as the network used by [Mnih et al., 2015]. Iden-tically, the π-network is also a multilayer perceptronwhich takes the state representation as input. It out-puts a probability distribution over the action spaceby using a softmax. We then compute the two Bell-man residuals for every player following Equation 5.Finally, we back-propagate the error by using classicgradient descent operations.

In our experiments, we focus on deterministic turn-based games. It entails a specific neural architecturethat is fully described in Figure 6. During the train-ing phase, all the Q-networks and the π-networks ofthe players are used to minimize the Bellman resid-ual. Once the training is over, only the π-network iskept to retrieve the strategy of each player. Note thatthis architecture differs from classic actor-critic net-works [Lillicrap et al., 2016] for several reasons. Al-though Q-network is an intermediate support to thecomputation of π-network, neither policy gradient noradvantage functions are used in our model. Besides,the Q-network is simply discarded at the end of thetraining. The NashNetwork directly searches in thestrategy space by minimizing the Bellman residual.

7 Experiments

In this section, we report an empirical evalua-tion of our method on randomly generated MGs.This class of problems has been first describedby [Archibald et al., 1995] for MDPs and has beenstudied for the minimization of the optimal Bell-man residual [Piot et al., 2014b] and in zero-sum two-player MGs [Perolat et al., 2016]. First, we extendthe class of randomly generated MDPs to general-sumMGs, then we describe the training setting, finally weanalyze the results. Without loss of generality we fo-

1https://github.com/fstrub95/nash_network

cus on deterministic turn-based MGs for practical rea-sons. Indeed, in simultaneous games the complexityof the state actions space grows exponentially withthe number of player whereas in turn-based MGs itonly grows linearly. Besides, as in the case of simul-taneous actions, a Nash equilibrium in a turn-basedMG (even deterministic) might be a stochastic strat-egy [Zinkevich et al., 2006]. In a turn-based MG, onlyone player can choose an action in each state. Finally,we run our experiments on deterministic MGs to avoidbias in the estimator as discussed in Sec.5.

To sum up, we use turn-based games to avoid the ex-ponential growth of the number of actions and deter-ministic games to avoid bias in the estimator. Thetechniques described in Sec.5 could be implementedwith a slight modification of the architecture describedin Sec.6.

Dataset: We chose to use artificially generated MGs(named Garnet) to evaluate our method as they arenot problem dependent. They have the great advan-tage to allow to easily change the state space (and itsunderlying structure), the action space and the num-ber of players. Furthermore, Garnets have the prop-erty to present all the characteristics of a completeMG. Standard games such as ”rock-paper-scissors” (nostates and zero-sum), or other games such as chessor checkers (deterministic and zero-sum), always relyon a limited number of MG properties. Using classicgames would thus have hidden the generality of ourapproach even though it would have been more ap-pealing. Finally, the underlying model of the dynam-ics of the Garnet is fully known. Thus, it is possible toinvestigate whether an ǫ-Nash equilibrium have beenreached during the training and to quantify the ǫ. Itwould have been impossible to do so with more com-plex games.

An N -player Garnet is described by a tuple (NS , NA).The constantNS is the number of states and NA is thenumber of actions of the MG. For each state and action(s, a), we randomly pick the next state s′ in the neigh-borhood of indices of s where s′ follow a rounded nor-mal distribution N (s, σs′ ). We then build the transi-tion matrix such as p(s′|s, a) = 1. Next, we enforce anarbitrary reward structure into the Garnet to enablethe emergence of an Nash equilibrium. Each player ihas a random critical state si and the reward of thisplayer linearly decreases by the distance (encoded byindices) to his critical state. Therefore, the goal ofthe player is to get as close as possible from his criti-cal state. Some players may have close critical statesand may act together while other players have to fol-low opposite directions. A Gaussian noise of standarddeviation σnoise is added to the reward, then we spar-sify it with a ratio mnoise to harden the task. Finally,

https://github.com/fstrub95/nash_network


0.0

0.1

0.2

0.3

0.4

0.5

0.6

% E

rror

vs B

est

Response

1 Player - Ns:100 Na:5

Network policy

Random policy

0.0

0.1

0.2

0.3

0.4

0.5

0.65 Players - Ns:100 Na:5

Network policy

Random policy

3.5

3.0

2.5

2.0

1.5

1.0

0.5

0.0

Bellm

an r

esid

ual Err

or

- lo

g1

0 Train dataset

Test dataset

3.0

2.5

2.0

1.5

1.0

0.5

0.0

0.5

Train dataset

Test dataset

0 250 500 750 1000 1250 1500 1750

epoch epoch

0.0

0.1

0.2

0.3

0.4

0.5

0.6

% E

rror

vs B

est

Response

Average network policy

Average random policy

0 250 500 750 1000 1250 1500 17500.0

0.1

0.2

0.3

0.4

0.5

0.6



Figure 1: (top) Evolution of the Error vs Best Response during the training for a given Garnet. In both cases,players manage to iteratively improve their strategies. A strong drop may occur when one of the player findsa important move. (middle) Empirical Bellman residual for the training dataset and the testing dataset. Ithighlights the link between minimizing the Bellman residual and the quality of the learned strategy. (bottom)Average Error vs Best Response averaged over every players, Garnets and every batch. It highlights the ro-bustness of minimizing the Bellman residual over several games. Experiments are run on 5 Garnets which isre-sampled 5 times to average the metrics.

a player is selected uniformly over {1, . . . , N} for ev-ery state once for all. The resulting vector vc encodeswhich player plays at each state as it is a turn-basedgame. Concretely, a Garnet works as follows: Given astate s, a player is first selected according the vectorvc(s). This player then chooses an action accordingits strategy πi. Once the action is picked, every playeri receives its individual reward ri(s, a). Finally, thegame moves to a new state according p(s′|s, a) and anew state’s player (vc(s

′)). Again, the goal of eachplayer is to move as close as possible to its criticalstate. Thus, players benefit from choosing the actionthat maximizes its cumulative reward and leads to astate controlled by a non-adversarial player.

Finally, samples (s, (a1, . . . , aN), (r1, . . . , rN ), s′) aregenerated by randomly selecting a state and an actionuniformly. The reward, the next state, and the nextplayer are selected according to the model describedabove.

Evaluation: Our goal is to verify whether the-jointstrategy of the players is an ǫ-Nash equilibrium. Todo so, we first retrieve the strategy of every player by

evaluating the πi-networks over the state space. Then,given π, we can exactly evaluate the value of the jointstrategy viπ for each player i and the value of the bestresponse to the strategy of the others v∗i

π-i . To doso, the value viπ is computed by inverting the linearsystem viπ = (I − γPπ)

−1riπ. The value v∗iπ-i is com-

puted with the policy iteration algorithm. Finally, wecompute the Error vs Best Response for every player

defined as‖vi

π−v∗i

π-i‖2∥

∥

∥v∗i

π-i

∥

∥

∥

2

. If this metric is close to zero

for all players, the players reach a weak ǫ-Nash equi-librium with ǫ close to zero. Actually, Error vs BestResponse is a normalized quantification of how sub-optimal the player’s strategy is compared to his bestresponse. It indicates by which margin a player wouldhave been able to increase his cumulative rewards byimproving his strategy while other players keep playingthe same strategies. If this metric is close to zero for allplayers, then they have little incentive to switch fromtheir current strategy. It is an ǫ-Nash equilibrium. Inaddition, we keep track of the empirical norm of theBellman residual on both the training dataset and thetest dataset as it is our training objective.


Training parameters: We use N -player Garnetwith 1, 2 or 5 players. The state space and the actionspace are respectively of size 100 and 5. The state isencoded by a binary vector. The transition kernel isbuilt with a standard deviation σs′ of 1. The reward

function is ri(s) = 2min(|s−si|,NS)−|s−si|)NS

(it is a cir-cular reward function). The reward sparsity mnoise isset to 0.5 and the reward white noise σnoise has a stan-dard deviation of 0.05. The discount factor γ is set to0.9. The Q-networks and π-networks have one hiddenlayers of size 80 with RELUs [Goodfellow et al., 2016].Q-networks have no output transfer function while π-networks have a softmax. The gradient descent is per-formed with AdamGrad [Goodfellow et al., 2016] withan initial learning rate of 1e-3 for the Q-network and5e-5 for the π-networks. We use a weight decay of1e-6. The training set is composed of 5NSNA sam-ples split into random minibatch of size 20 while thetesting dataset contains NSNA samples. The neuralnetwork is implemented by using the python frame-work Tensorflow [Abadi et al., 2015]. The source codeis available on Github (Hidden for blind review) to runthe experiments.

Results: Results for 1 player (MDP) and 5 playersare reported in Figure 1. Additional settings are re-ported in the Appendix-C such as the two-player caseand the tabular cases. In those scenarios, the qual-ity of the learned strategies converges in parallel withthe empirical Bellman residual. Once the training isover, the players can increase their cumulative rewardby no more than 8% on average over the state space.Therefore, neural networks succeed in learning a weakǫ-Nash equilibrium. Note that it is impossible to reacha zero error as (i) we are in the batch setting, (ii) weuse function approximation and (iii) we only control aweak ǫ-Nash equilibrium. Moreover, the quality of thestrategies are well-balanced among the players as thestandard deviation is below 5 points.

Our neural architecture has good scaling properties.First, scaling from 2 players to 5 results in the samestrategy quality. Furthermore, it can be adapted to awide variety of problems by only changing the bottomof each network to fit with the state representation ofthe problem.

Discussions: The neural architecture faces someover-fitting issues. It requires a high numberof samples to converge as described on Figure 2.[Lillicrap et al., 2016] introduces several tricks thatmay improve the training. Furthermore, we run addi-tional experiments on non-deterministic Garnets butthe result remains less conclusive. Indeed, the estima-tors of the Bellman residuals are biased for stochasticdynamics. As discussed, embedding the kernel or us-ing kernel estimators may help to estimate properly

0 1 2 3 4 5 6

Sampling coefficient : alpha

0.0

0.1

0.2

0.3

0.4

0.5

% E

rror

vs B

est

Response

2 Players - Ns:100 Na:5

Average Network

Average Random

Figure 2: Impact of the number of samples on thequality of the learned strategy. The number of sam-ples per batch is computed by Nsamples = αNANS .Experiments are run on 3 different Garnets which arere-sampled 3 times to average the metrics.

the cost function (Equation (2)). Finally, our algo-rithm only seeks for a single Nash-Equilibrium whenseveral equilibria might exist. Finding a specific equi-librium among others is out of the scope of this paper.

8 Conclusion

In this paper, we present a novel approach to learn aNash equilibrium in MGs. The contributions of thispaper are both theoretical and empirical. First, wedefine a new (weaker) concept of an ǫ-Nash equilib-rium. We prove that minimizing the sum of differentBellman residuals is sufficient to learn a weak ǫ-Nashequilibrium. From this result, we provide empirical es-timators of these Bellman residuals from batch data.Finally, we describe a novel neural network architec-ture, called NashNetwork, to learn a Nash equilibriumfrom batch data. This architecture does not rely on aspecific MG. It also scales to a high number of players.Thus it can be applied to a trove of applications.

As future works, the NashNetwork could be extendedto more eclectic games such as simultaneous games(i.e. Alesia [Perolat et al., 2015] or a stick togethergame such as in [Prasad et al., 2015]) or Atari’s gameswith several players such as Pong or Bomber. Addi-tional optimization methods can be studied to this spe-cific class of neural networks to increase the quality oflearning. Moreover, the Q-function and the strategycould be parametrised with other classes of functionapproximation such as trees. Yet, it requires to studyfunctional gradient descent on the loss function. Andfinally, future work could explore the use of projectedBellman residual as studied in MDPs and two-playerzero-sum MGs [Perolat et al., 2016]


Acknowledgments

This work has received partial funding from CPERNord-Pas de Calais/FEDER DATA Advanced datascience and technologies 2015-2020, the EuropeanCommission H2020 framework program under the re-search Grant number 687831 (BabyRobot) and theProject CHIST-ERA IGLU.

References

[Abadi et al., 2015] Abadi, M., Agarwal, A., Barham,P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S.,Davis, A., Dean, J., Devin, M., Ghemawat, S.,Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia,Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Leven-berg, J., Mane, D., Monga, R., Moore, S., Murray,D., Olah, C., Schuster, M., Shlens, J., Steiner, B.,Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V.,Vasudevan, V., Viegas, F., Vinyals, O., Warden, P.,Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X.(2015). TensorFlow: Large-scale machine learningon heterogeneous systems. Software available fromtensorflow.org.

[Akchurina, 2010] Akchurina, N. (2010). Multi-AgentReinforcement Learning Algorithms. PhD thesis,Paderborn, Univ., Diss., 2010.

[Archibald et al., 1995] Archibald, T., McKinnon, K.,and Thomas, L. (1995). On the Generation ofMarkov Decision Processes. Journal of the Oper-ational Research Society, 46:354–361.

[Baird et al., 1995] Baird, L. et al. (1995). ResidualAlgorithms: Reinforcement Learning with FunctionApproximation. In Proc. of ICML.

[Filar and Vrieze, 2012] Filar, J. and Vrieze, K.(2012). Competitive Markov Decision Processes.Springer Science & Business Media.

[Goodfellow et al., 2016] Goodfellow, I., Bengio, Y.,and Courville, A. (2016). Deep Learning. Book inpreparation for MIT Press.

[Grunewalder et al., 2012] Grunewalder, S., Lever,G., Baldassarre, L., Pontil, M., and Gretton, A.(2012). Modelling Transition Dynamics in MDPsWith RKHS Embeddings. In Proc. of ICML.

[Hu and Wellman, 2003] Hu, J. and Wellman, M. P.(2003). Nash Q-Learning for General-Sum Stochas-tic Games. Journal of Machine Learning Research,4:1039–1069.

[Lagoudakis and Parr, 2002] Lagoudakis, M. G. andParr, R. (2002). Value Function Approximation inZero-Sum Markov Games. In Proc. of UAI.

[Lagoudakis and Parr, 2003] Lagoudakis, M. G. andParr, R. (2003). Reinforcement Learning as Clas-sification: Leveraging Modern Classifiers. In Proc.of ICML.

[LeCun et al., 2015] LeCun, Y., Bengio, Y., and Hin-ton, G. (2015). Deep Learning. Nature, 521:436–444.

[Lillicrap et al., 2016] Lillicrap, T. P., Hunt, J. J.,Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D.,and Wierstra, D. (2016). Continuous Control withDeep Reinforcement Learning. In Proc. of ICLR.

[Littman, 1994] Littman, M. L. (1994). MarkovGames as a Framework for Multi-Agent Reinforce-ment Learning. In Proc. of ICML.

[Littman, 2001] Littman, M. L. (2001). Friend-or-FoeQ-Learning in General-Sum Games. In Proc. ofICML.

[Maillard et al., 2010] Maillard, O.-A., Munos, R.,Lazaric, A., and Ghavamzadeh, M. (2010). Finite-Sample Analysis of Bellman Residual Minimization.In Proc. of ACML.

[Mnih et al., 2015] Mnih, V., Kavukcuoglu, K., Sil-ver, D., Rusu, A. A., Veness, J., Bellemare, M. G.,Graves, A., Riedmiller, M., Fidjeland, A. K., Os-trovski, G., et al. (2015). Human-Level ControlThrough Deep Reinforcement Learning. Nature,518:529–533.

[Munos and Szepesvari, 2008] Munos, R. andSzepesvari, C. (2008). Finite-Time Boundsfor Fitted Value Iteration. The Journal of MachineLearning Research, 9:815–857.

[Nisan et al., 2007] Nisan, N., Roughgarden, T., Tar-dos, E., and Vazirani, V. V. (2007). AlgorithmicGame Theory, volume 1. Cambridge UniversityPress Cambridge.

[Perolat et al., 2016] Perolat, J., Piot, B., Geist, M.,Scherrer, B., and Pietquin, O. (2016). Softened Ap-proximate Policy Iteration for Markov Games. InProc. of ICML.

[Perolat et al., 2016] Perolat, J., Piot, B., Scherrer,B., and Pietquin, O. (2016). On the use of non-stationary strategies for solving two-player zero-summarkov games. In Proc. of AISTATS.

[Perolat et al., 2015] Perolat, J., Scherrer, B., Piot,B., and Pietquin, O. (2015). Approximate DynamicProgramming for Two-Player Zero-Sum MarkovGames. In Proc. of ICML.


[Piot et al., 2014a] Piot, B., Geist, M., and Pietquin,O. (2014a). Boosted Bellman Residual Minimiza-tion Handling Expert Demonstrations. In Proc. ofECML.

[Piot et al., 2014b] Piot, B., Geist, M., and Pietquin,O. (2014b). Difference of convex functions program-ming for reinforcement learning. In Proc. of NIPS.

[Prasad et al., 2015] Prasad, H., LA, P., and Bhat-nagar, S. (2015). Two-Timescale Algorithms forLearning Nash Equilibria in General-Sum Stochas-tic Games. In Proc. of AAMAS.

[Puterman, 1994] Puterman, M. L. (1994). MarkovDecision Processes: Discrete Stochastic DynamicProgramming. John Wiley & Sons.

[Riedmiller, 2005] Riedmiller, M. (2005). Neural Fit-ted Q Iteration–First Experiences with a Data Ef-ficient Neural Reinforcement Learning Method. InProc. of ECML.

[Scherrer et al., 2012] Scherrer, B., Ghavamzadeh,M., Gabillon, V., and Geist, M. (2012). Approxi-mate Modified Policy Iteration. In Proc. of ICML.

[Shapley, 1953] Shapley, L. S. (1953). StochasticGames. In Proc. of the National Academy of Sci-ences of the United States of America.

[Taylor and Parr, 2012] Taylor, G. and Parr, R.(2012). Value Function Approximation in NoisyEnvironments Using Locally Smoothed RegularizedApproximate Linear Programs. In Proc. of UAI.

[Zinkevich et al., 2006] Zinkevich, M., Greenwald, A.,and Littman, M. (2006). Cyclic Equilibria inMarkov Games. In Proc. of NIPS.


A Proof of the equivalence of definition 1 and 2

Proof of the equivalence between definition 2 and 1.

(2) ⇒ (1):

If ∃v such as ∀i ∈ {1, ..., N}, T iπv

i = vi and T ∗iπ-iv

i = vi, then ∀i ∈ {1, ..., N}, vi = viπi,π-i and vi = max

πi

viπi,π-i

(1) ⇒ (2):

if ∀i ∈ {1, ..., N}, viπi,π-i = max

πiviπi,π-i ., then ∀i ∈ {1, ..., N}, the value vi = vi

πi,π-i is such as T iπv

i =

vi and T ∗iπ-iv

i = vi

B Proof of Theorem 1

First we will prove the following lemma. The proof is strongly inspired by previous work on the minimization ofthe Bellman residual for MDPs [Piot et al., 2014b].

Lemma 1. let p and p′ be a real numbers such that 1p+ 1

p′= 1, then ∀v,π and ∀i ∈ {1, ..., N}:

∥

∥

∥viπi∗,π-i − v

iπi,π-i

∥

∥

∥

µ,p

≤1

1− γ

(

C∞(µ, ν, πi∗,π

-i)p′

p + C∞(µ, ν, πi,π

-i)p′

p

) 1

p′[

∥

∥

∥T ∗iπ-iv

i − vi∥

∥

∥

p

µ,p+∥

∥

∥T iπv

i − vi∥

∥

∥

p

µ,p

] 1p

,

where πi∗ is the best response to π-i. Meaning vi

πi∗,π-i is the fixed point of T ∗i

π-i . And with the following concen-

trability coefficient C∞(µ, ν, πi,π-i) =

∥

∥

∥

∥

∂µT (1−γ)(I−γPπi,π-i )

−1

∂νT

∥

∥

∥

∥

ν,∞

.

Proof. The proof uses similar techniques as in [Piot et al., 2014b]. First we have:

viπi,π-i − vi = (I − γPπi,π-i)−1(riπi,π-i − (I − γPπi,π-i)vi),

= (I − γPπi,π-i)−1(T iπi,π-iv

i − vi).

But we also have:viπi

∗,π-i − vi = (I − γPπi

∗,π-i)−1(T i

πi∗,π-iv

i − vi),

then:

viπi∗,π-i − viπi,π-i = viπi

∗,π-i − vi + vi − viπi,π-i ,

= (I − γPπi∗,π-i)−1(T i

πi∗,π-iv

i − vi)− (I − γPπi,π-i)−1(T iπi,π-iv

i − vi),

≤ (I − γPπi∗,π-i)−1(T ∗i

π-ivi − vi)− (I − γPπi,π-i)−1(T i

πi,π-ivi − vi),

≤ (I − γPπi∗,π-i)−1

∣

∣T ∗iπ-iv

i − vi∣

∣+ (I − γPπi,π-i)−1∣

∣

∣T iπi,π-iv

i − vi∣

∣

∣.

Finally, using the same technique as the one in [Piot et al., 2014b], we get:

∥

∥

∥viπi

∗,π-i − viπi,π-i

∥

∥

∥

µ,p

≤∥

∥(I − γPπi∗,π-i)−1

∣

∣T ∗iπ-iv

i − vi∣

∣

∥

∥

µ,p+∥

∥

∥(I − γPπi,π-i)−1

∣

∣

∣T iπi,π-iv

i − vi∣

∣

∣

∥

∥

∥

µ,p,

≤1

1− γ

[


-i)1p

∥

∥T ∗iπ-iv

i − vi∥

∥

ν,p+ C∞(µ, ν, πi,π-i)

1p

∥

∥T ∗iπ-iv

i − vi∥

∥

ν,p

]

,

≤1

1− γ

(


-i)p′

p + C∞(µ, ν, πi,π-i)p′

p

)

1

p′[

∥

∥T ∗iπ-iv

i − vi∥

∥

p

ν,p+∥

∥T iπv

i − vi∥

∥

p

ν,p

]1p

.


Theorem 1 falls in two steps:

∥

∥

∥

∥

∥

∥

∥

∥

∥

maxπi

vπi,π-i − viπ

∥

∥

∥

∥

µ(s),p

∥

∥

∥

∥

∥

ρ(i),p

≤1

1− γ

[

maxi∈{1,...,N}

(


-i)p′

p + C∞(µ, ν, πi,π

-i)p′

p

) 1

p′

]

×

[

N∑

i=1

ρ(i)

(

∥

∥

∥T ∗iπ-iv

i − vi∥

∥

∥

p

ν,p+∥

∥

∥T iπv

i − vi∥

∥

∥

p

ν,p

)

]1p

,

≤2

1

p′ C∞(µ, ν)1p

1− γ

[

N∑

i=1

ρ(i)

(

∥

∥

∥T ∗iπ-iv

i − vi∥

∥

∥

p

ν,p+∥

∥

∥T iπv

i − vi∥

∥

∥

p

ν,p

)

]1p

,

with C∞(µ, ν) =

(

supπi,π-i

C∞(µ, ν, πi,π-i)

)

The first inequality is proven using lemma 1 and Holder inequality. The second inequality falls noticing∀πi,π-i, C∞(µ, ν, πi,π-i) ≤ sup

πi,π-i

C∞(µ, ν, πi,π-i).

C Additional curves

This section provides additional curves regarding the training of the NashNetwork.

Figure 3: (top) Evolution of the Error vs Best Response during the training for a given Garnet. (middle)Empirical Bellman residual for the training dataset and the testing dataset. (bottom) Average Error vs BestResponse averaged over every players, Garnets and every batch. Experiments are run on 5 Garnets which isre-sampled 5 times to average the metrics.


0.0

0.1

0.2

0.3

0.4

0.5

0.6

% E

rror

vs B

est

Response

1 Players - Ns:100 Na:5

Random policy

Network policy

0.0

0.1

0.2

0.3

0.4

0.5


Random policy

Network policy

0.0

0.1

0.2

0.3

0.4

0.5


Random policy

Network policy

3.5

3.0

2.5

2.0

1.5

1.0

0.5

0.0

Bellm

an r

esid

ual Err

or

- lo

g1

0 Train dataset

Test dataset

3.0

2.5

2.0

1.5

1.0

0.5

0.0

Train dataset

Test dataset

2.5

2.0

1.5

1.0

0.5

0.0

0.5

Train dataset

Test dataset

0 250 500 750 1000 1250 1500 1750

epoch

0.0

0.1

0.2

0.3

0.4

0.5

0.6

% E

rror

vs B

est

Response



0 250 500 750 1000 1250 1500 1750

epoch

0.0

0.1

0.2

0.3

0.4

0.5

0.6



0 250 500 750 1000 1250 1500 1750

epoch

0.0

0.1

0.2

0.3

0.4

0.5

0.6



Figure 4: Tabular basis. One may notice that the tabular case works well for a 1 player game (MDP). Yet, themore players there are, the worth it performs. Experiments are run on 5 Garnets which is re-sampled 5 times toaverage the metrics.

0 250 500 750 1000 1250 1500 1750

epoch

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

% E

rror

vs B

est

Response

Network policy

Random policy

0 250 500 750 1000 1250 1500 1750

epoch

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Dis

trib

uti

on o

ver

acti

on

Action 0

Action 1

Action 2

Action 3

Action 4

Figure 5: (left)Evolution of the Error vs Best Response during the training for a given Garnet. When theGarnet has a complex structure or the batch is badly distributed, one player sometimes fails to learn a goodstrategy. (right) Distribution of actions in the strategy among the players with the highest probability. Thisplots highlights that π-networks do modify the strategy during the training


Shared

weights

SelectSelect

max

action

next player

reward

state next state next state

next player

Matrices: ✁

Matrices: �

Dot product

Sum, product

terms by terms

Multiplexer

B

p

AY

(A.0) (B)

(D)

(C.1)

(E)

(H)

(G) (F)reward

(A.1)

(C.0)

(C.2)

Figure 6: NashNetwork. This scheme summarizes the neural architecture that learns a Nash equilibriumby minimizing the Bellman residual for turn-based games. Each player has two networks : a Q-networkthat learns a state-action value function and a π-network that learns the strategy of the player. The finalloss is an empirical estimate of the sum of Bellman residuals given a batch of data of the following shape(s, (a1, . . . , aN ), (r1, . . . , rN ), s′). The equation (2) can be divided into key operations that are described below.For each player i : first of all, in (A.0) we compute Qi(s, a1, . . . , aN ) by computing the tensor Qi(s,a) and thenby selecting the value of Qi(s,a) (A.1) given the action of the batch. Step (B) computes the tensor Qi(s′, b)and step (C.0) computes the strategy πi(.|s′). In all Bellman residuals we need to average over the strategyof players Qi(s′, b). Since we focus on turn-based MGs, we will only average over the strategy of the playercontrolling next state s′ (in the following, this player is called the next player). In (C.1) we select the strategyπ(.|s′) of the next player given the batch. In (C.2) we duplicate the strategy of the next player for all otherplayers. In (D) we compute the dot product between the Qi(s′, b) and the strategy of the next player to obtainEb∼π[Q

i(s′,b)] and in (E) we pick the highest expected rewards and obtain maxb

Qi(s′, b). Step (F) aims at

computing maxbi

[


and, since we deal with a turn-based MG, we need to select between

the output of (D) or (E) according to the next player. For all i, we either select the one coming from (E)if the next player is i or the one from (D) otherwise. In (G) we compute the error between the reward ri toQi(s, a)− γmax

bi

[


and in (H) between ri and γEb∼π[Qi(s′,b)]−Qi(s, a). The final loss

is the sum of all the residuals.

learning nash equilibrium for general-sum markov games ... · learning nash equilibrium for...

Documents