pre-bayesian games moshe tennenholtz technion—israel institute of technology

Pre-Bayesian Games

Moshe Tennenholtz

Technion—Israel Institute of Technology

Acknowledgements

• Based on joint work with Itai Ashlagi, Ronen Brafman and Dov Monderer.

GT with CS flavor

• Program equilibrium / strong mediated equilibrium• Ranking systems• Non-cooperative computing• Pre-Bayesian games• Distributed Games• Recommender systems for GT• …• …• …

Modeling Uncertainty

• In game theory and economics the Bayesian approach is mainly used.

• Work in computer science frequently uses non-probabilistic models.

• Work on Pre-Bayesian games incorporates game-theoretic reasoning into non-Bayesian decision-making settings.

Pre-Bayesian Games• Modeling and solution concepts in Pre-

Bayesian games.

• Applications: congestion games with incomplete information.

• Pre-Bayesian repeated/stochastic games as a framework for multi-agent learning.

Games with Incomplete Information

4,4 0,5

5,0 1,1

1,1 5,0

0,5 4,4

0,5 4,4

1,1 5,0

5,0 1,1

4,4 0,5

1s 2s

1t

2t

11p12p

21p 22p

Model (cont.)

Flexibility of the Model

1,1 5,0

0,5 4,4

1s 2s

1t

2t

2,3 4,5 1,1

2,1 2,3 4,5

5,1 2,2 2,1 1,1 3,2

Solution Concepts

in Pre-Bayesian Games

Dominant Strategies

4,4 0,5

5,0 1,1

1,1 5,0

0,5 4,4

0,5 4,4

1,1 5,0

5,0 1,1

4,4 0,5

1s 2s

1t

2t

Ex-Post Equilibrium

5,0 2,4

4,2 3,3

2,8 5,1

1,5 6,4

0,2 5,2

1,1 6,0

0,5 3,6

7,2 1,4

1s 2s

1t

2t

Safety-Level Equilibrium

For every type play a strategy that maximizes the worst-case payoff given the other players’ strategies.

Worst case - over the set of possible states!

Safety-Level Equilibrium

5,0 5,4

4,2 3,3

2,8 5,9

1,5 6,4

0,2 5,2

1,1 6,0

0,5 3,6

7,2 4,4

1s 2s

1t

2t

p

1-p

q

1-q

w 1-w z 1-z

Safety-Level Equilibrium (cont.)

Other Non-Bayesian Solution Concepts

• Minimax-Regret equilibrium (Hyafil and Boutilier 2004)

• Competitive-Ratio equilibrium

Existence in Mixed StrategiesTheorem:Safety-level, Minimax-regret and Competitive-ratio equilibria exist in every concave pre-Bayesian Game.

A concave pre-Bayesian game - - For every type the set of possible actions is compact and convex (for every player)

- ui(,¢) - concave function for every player i

The proof follows by applying Kakutani’s fixed point theorem.

Related Work On Non-Bayesian Solutions

Safety-level equilibria • Aghassi and Bertsimas (2004)• Levin and Ozdenoren (2004) Pure safety-level equilbria• Shoham and Tennenholtz (1992), Moses and

Tennenholtz (1992),Tennenholtz (1991)

Axiomatic Foundations • Brafman and Tennenholtz (1996)

The main goal – analysis!!

Beyond Existence

Modeling Congestion Settings

Modeling Congestion Settings

Examples: • Transportation engineering (Wardrop 1952, Beckman et al. 1956)• Congestion games (Rosenthal 1973) • Potential games (Monderer and Shapley 1996)• Price of anarchy (Papadimitriou 1999, Tardos and Roughgarden

2001)• Resource selection games with player specific cost functions

(Milchtaich 1996)• Local effect games (Leyton-Brown and Tennenholtz 2003)….

Where are we heading to?Our Goal: Incorporate incomplete

information to congestion settings.

Type of uncertainties: • number of players• job sizes• network structure• players’ cost functions• …• …

Resource Selection Games with Unknown Number of

Players

Resource Selection Games

Symmetric Equilibrium

Theorem:

Every resource selection game with increasing resource cost functions has a unique symmetric equilibrium - .

Resource Selection Games with Unknown Number of Players

Uniqueness of Symmetric Safety Level Equilibrium

- game with complete information - game with incomplete information

Theorem:• Let be a resource selection system with

increasing resource cost functions. has a unique symmetric safety-level equilibrium.

• The symmetric safety-level equilibrium profile is . is the unique symmetric equilibrium in the game .

Is Ignorance Bad?

K – the real state, |K|=k , k<nKnown number of players – cost of every player

• Unknown number of players - cost of every player

?

Main Theorem:Let be a linear resource selection system with increasing resource cost functions. There exist an integer such that for all :

1.

2. All inequalities above are strict if and only if there exists such that

wj(k)=wj(1)+(k-1)dj

Is Ignorance Bad?

Where is this Useful?

Example:

Mechanism Design -

Organizer knows the exact number of active players.

Wishes to maximize social surplus -- will not reveal the information.

More Detailed Analysis

Theorem:Let be a linear resource selection system with increasing resource cost functions. There exist an integer L such that for every k>L:The minimal social cost in attained with symmetric mixed-action profiles is attained at . Consequently, is minimized at n=2k-1.

Further Research• Routing games with unknown number of players extension to general networks unique symmetric equilibrium exists in a model where an agent job can be split ignorance helps as long as n<k2

• Routing games with unknown job sizes extension to variable job sizes uncertainty about job sizes do not change surplus in several general settings ignorance helps where we have uncertainty on both the number of participants and the job sizes • ……• ……

• Minimax-regret equilibria in the different congestion settings• Non-Bayesian equilibria in social choice settings……

Conclusions so far

• Non-Bayesian Equilibria exist in pre-Bayesian Games.

• Players are better off with common lack of knowledge about the number of participants.

• More generally, we show illuminating results using non-Bayesian solution concepts in pre-Bayesian games.

Non-Bayesian solutions for repeated (and stochastic) games with incomplete information: efficient learning equilibrium.

Learning in multi-agent systems• Multi-Agent Learning lies in the intersection of Machine

Learning/Artificial Intelligence and Game Theory• Basic settings: A repeated game where the game (payoff functions) is

initially unknown, but may be learned based on observed history.

A stochastic game where both the stage games and the transition probabilities are initially unknown.

• What can be observed following an action is part of the problem specification.

• No Bayesian assumptions!

The objective of learning in multi-agent systems

• Descriptive objective: how do people behave/adapt their behavior in (e.g. repeated) games?

• Normative objective: can we provide the agents with advice about how they should behave, to be followed by “rational” agents, which will also lead to some good social outcome?

Learning in games: an existing perspective

• Most work on learning in games (in machine learning/AI extending upon work in game theory), deals with the search for learning algorithms that if adopted by all agents will lead to equilibrium.

(another approach: regret minimization will be discussed and

compared to later).

Re-Considering Learning in Games

• But, why should the agents adopt these learning algorithms?

This seems contradicting to the whole idea of self-motivated agents (which led to considering equilibrium concepts).

Re-Considering Learning in Games

• (New) Normative answer: The learning

algorithms themselves should be in equilibrium!

• We call this form of equilibrium: Learning Equilibrium, and in particular we consider Efficient Learning Equilibrium (ELE).

• Remark: In this talk we refer to optimal ELE (extending upon the basic ELE we introduced) but use the term ELE.

Efficient Learning Equilibrium:

“Informal Definition” • The learning algorithms themselves are in equilibrium.

It is irrational for an agent to deviate from its algorithm assuming that the others stick to their algorithms, regardless of the nature of the (actual) game that is being played.

• If the agents follow the provided learning algorithms then they will obtain a value that is close to the value obtained in an optimal (or Pareto-optimal) Nash equilibrium (of the actual game) after polynomially many iterations.

• It is irrational to deviate from the learning algorithm. Moreover, the irrationality of deviation is manifested within a polynomial number of iterations.

Efficient Learning Equilibrium is a form of ex-post equilibrium in Pre-Bayesian repeated games

Basic Definitions Game: G=<N={1,…,n},{S1,….,Sn},{U1,….,Un}>

Ui:S1 … Sn→ R - utility function for i

Δ(Si) – mixed strategies for i.

A tuple of (mixed) strategies t=(t1,…,tn) is a Nash equilibrium if

i N, Ui(t) Ui(t1,…,ti-1,t’,ti+1,…,tn) for every t’ Si

Optimal Nash equilibrium – maximizes social surplus (sum of agents payoffs) val(t,i,g) – the minimal expected payoff that may be obtained by i when employing t in the game g.

A strategy t’ Δ(Si) for which val(.,i,g) is maximized is a safety- level strategy (or, probabilistic maximin strategy ), and

its value is the safety-level value.

Basic Definitions• R(G) -- repeated game with respect to a (one-shot) game G. • History of player i after t iterations of R(G):

Perfect monitoring – Hti = ((a1

j, …, anj),(p1

j,…,pnj))t

j=1 --- a player can observe all previously chosen actions and payoffs

Imperfect monitoring – Hti = ((a1

j, …, anj),pi

j)tj=1 ---

a player can observe previously chosen actions (of all players) and payoffs of i.

Strictly imperfect monitoring – Hti = (ai

j ,pij)t

j=1 --- a player can observe only its own payoffs and actions.

Possible histories for agent i: Hi=t=1

Hti

Policy for agent i: :Hi Δ(Si) Remark: in the game theory literature the term perfect monitoring is

used to refer to the concept of imperfect monitoring above

Basic Definitions Let G be a (one-shot) game, let M=R(G) be the

corresponding repeated game, and let n(G) be an optimal Nash-equilibrium of G. Denote the expected payoff of agent i in that equilibrium by NVi(n(G)).

Given M= R(G) and a natural number T, we denote the expected T-step undiscounted average reward of player i when the players follow the policy profile (1 ,…,i,…,n) by Ui(M,1 ,…,i,…,n,T).

Ui(M,1 ,…,i,…,n)=liminfT Ui(M,1 ,…,i,…,n,T)

Definition: (Optimal) ELE (in 2-person repeated game)

(,) is an efficient learning equilibrium with respect to the class of

games (where each one-shot game has k actions) if for every > 0, 0 < <1, there exists some T>0, where T is polynomial in 1/ , 1/ , and k, such that with probability of at least 1- :

(1) If player 1(resp. 2) deviates from to ’ (resp. from to ’) in iteration l, then U1(M,(’,) ,l+t) U1(M,(,) ,l+t) + (resp. U2(M,(,’) ,l+t) U2(M,(,) ,l+t) + ) for every t T and for every repeated game M=R(G) .

(2) For every t T and for every repeated game M=R(G) , U1(M,(,) ,t)+ U2(M,(,) ,t) NV1(n(G))+NV2(n(G)) - for an optimal (surplus maximizing) Nash equilibrium n(G).

The Existence of ELE Theorem: Let M be a class of repeated games.

Then, there exists an ELE w.r.t. M given perfect monitoring.

The proof of the above is constructive and use ideas of our Rmax algorithm (the first near-optimal polynomial time algorithm for reinforcement learning in stochastic games)+the folk-theorem in economics.

The ELE algorithm

For ease of presentation assume that the payoff functions are non-negative and are bounded by Rmax.

• Player 1 performs action ai

one time after the other for k times, for all i=1,2,...,k.

• In parallel, player 2 performs the sequence of actions (a1

,…,ak

) k times.

• If both players behaved according to the above then an optimal Nash equilibrium of the corresponding

(revealed) game is computed, and the players behave according to the corresponding strategies from that

point on. If several such Nash equilibria exist, one is selected based on a pre-determined arrangement.

• If one of the players deviated from the above, we shall call this player the adversary and the other

player the agent, and do the following:

• Let G be the Rmax-sum game in which the adversary's payoff is identical to his payoff in the original

game, and where the agent's payoff is Rmax minus the adversary payoffs. Let M denote the

corresponding repeated game. Thus, G is a constant-sum game where the agent's goal is to minimize the

adversary's payoff. Notice that some of these payoffs will be unknown (because the adversary did not

cooperate in the exploration phase). The agent now plays according to the following:

The ELE algorithm (cont.) • Initialize: Construct the following model M' of the repeated game M, where the game G is replaced by a game G' where all the entries in the game matrix are assigned the rewards (Rmax,0) (we assume w.l.o.g positive payoffs, and also assume the maximal possible reward Rmax is known).

We associate a boolean valued variable with each joint-action: {assumed,known}. This variable is initialized to the value assumed.

Repeat:

• Compute and Act: Compute the optimal probabilistic maximin of G' and

execute it.

• Observe and update: Following each joint action do as follows:

Let a be the action the agent performed and let a‘ be the adversary's action.

If (a,a') is performed for the first time, update the reward associated with

(a,a') in G', as observed, and mark it known.

Imperfect Monitoring Theorem: There exist classes of games for which an ELE does not exist given imperfect monitoring.

The proof is based on showing that you can not get the values obtained in the Nash equilibria of the following games, when you don’t know initially what game you

play, and can not observe the other agent’s payoff:

f s

f 8,0 0, 100

s 5,-100 1,500

f s

f 8,9 0, 1

s 5,11 1,10

The Existence of ELE for Imperfect Monitoring Settings

Theorem: Let M be a class of repeated symmetric

games. Then, there exists an ELE w.r.t. M given imperfect monitoring.

The Existence of ELE for Imperfect Monitoring Settings:

Proof Idea

Agents are instructed to explore the game matrix.

If it has been done without deviations, action profiles (s,t) and (t,s) with optimal surplus are selected to be played indefinitely when (s,t) is played on odd iterations and (t,s) is played on even iterations.

If there has been a deviation then we remain with the problem of effective and efficient punishment. Notice that here an agent does not learn another agent’s payoff in an entry once it is played!

The Existence of ELE for Imperfect Monitoring Settings:

Proof Idea (cont.)

Assume the row agent is about to punish the column agent.

We say that a column associated with action s is known if the row agent knows its payoff for any pair (t,s). Notice that at each point the squared sub-matrix which corresponds to actions associated with known columns has the property that the row agent knows all payoffs of both agents in it.

With some small probability the row agent plays a random action, and otherwise plays the probabilistic maximin associated with the above (known) squared sub-matrix where its payoffs are the complement to 0 of the column agent payoffs.

Many missing details and computations….

Extensions

• The results are extended to n-person games and stochastic games, providing a general solution to the normative problem of multi-agent learning.

ELE and Efficiency Our results for symmetric games imply that we can

get the optimal social surplus as a result of the learning process, where the learning algorithms are in equilibrium!

This is impossible in general games without having side payments as part of the policies, which leads to another version of the ELE concept.

Pareto ELE Given a 2-person game G, a pair (a,b) of strategies is (economically)

efficient, if U1(a,b)+U2(a,b)=maxsS1,tS2 (U1(s,t)+U2(s,t))

Obtaining economically efficient outcomes is in general impossible without side payments (the probabilistic maximin value for i may be higher than what he gets in the economically efficient tuple).

Side payments: an agent may be asked to pay the other as part of its policy. If its payoff at particular point is pi, and the agent pays ci then the actual payoff/utility is pi-ci.

Pareto ELE is defined similarly to (Nash) ELE with the following distinctions:

1. The agents should obtain an average total reward close to the sum of their rewards in an efficient outcome.

2. Side payments are allowed as part of the agent‘s policy.

Pareto ELE Theorem: Let M be a class of repeated games.

Then, there exists a Pareto ELE w.r.t. M given perfect monitoring.

Theorem: There exist classes of games for which a Pareto ELE does not exist given imperfect monitoring.

Common Interest Games A game is called a common-interest game if for every

joint-action all agents receive the same reward.

Theorem: Let Mc be the class of common-interest repeated games in which the number of actions each agent has is a. There exists an ELE for Mc under strict imperfect monitoring.

The above result is obtained for the general case where there are no a-priori conventions on agents’ ordering or strategies’ ordering.

Efficient Learning Equilibrium and Regret Minimization

• The literature on regret minimization attempts to find a best response for “arbitrary” action sequences of an opponent.

• Notice that in general an agent can not devise best response against an adversary whose action selection depends on the agent’s previous actions.

• In such situations it is hard to avoid equilibrium concepts.

• Efficient Learning Equilibrium requires that deviations will be irrational considering any game from a given set of games, and therefore has the flavor of ex-post equilibrium.

Stochastic Game

(3,2) (4,1)

(-2,7) (8,-3)

(2,3) (-1,6)

(2,3) (2,3)

(-3,8) (2,3)

(-2,7) (0,5)

(-4,9) (-2,7)

(2,3) (4,1)

(0,8) (-2,7)

(2,3) (-8,0)0.3

0.70.6 0.4 0.5

1

0.5a1

agent

a2agent

a2adver.a1

adver.

SGs Are An Expressive Model

• SGs are more general than Markov

decision processes and repeated games:

– Markov decision process: the adversary has a

single action

– Repeated games: a unique stage game

Extending ELE to stochastic games

Let M be a stochastic game and let > 0, 0 < <1.

Let vi(M,) be the -return mixing time of a probabilistic maximin (safety level) strategy for agent i.

Consider a stochastic game, Mi, which is identical to M except that the payoffs of player i are taken as the complement to Rmax of the other player's payoff. Let vi'(Mi, ) be the -return mixing time of an optimal policy (safety-level strategy) of i in that game.

Consider also the game M‘, where M‘ is a Markov decision process, which is isomorphic to M, but where the (single) player's reward for the action (a,b) in state s is the sum of the players' rewards in M. Let Opt(M') be the value of an optimal policy in M'. Let vc(M, ) be the -return mixing time of that optimal policy (in M’).

Let v(M, )=max(v1(M, ),v1’(M, ) , v2(M, ),v2’(M, ),vc(M, ))

Extending ELE to stochastic games

A policy profile (,) is a Pareto efficient learning equilibrium w.r.t. the

class M of stochastic games if for every > 0, 0 < <1, and

M , there exists some T>0, where T is polynomial in 1/, 1/ , the size of M, and v(M, ), such that with probability of at least 1- :

(1) for every t T, U1(M,,,t)+ U2(M,,,t) (1- )(Opt(M')) -

for i=1,2

(2) if player 1 (resp. 2) deviates from to ’(resp. from to ’) in

iteration l, then U1(M,’,,l+t) U1(M,,,l+t) + (resp.

U2(M,,’,l+t) U2(M,,,l+t) + ) )

Theorem: Given a perfect monitoring setting for stochastic games, there always exists a Pareto ELE.

The R-max Algorithm

R-max is the first near-optimal efficient reinforcement learning algorithm for stochastic games. In particular, it is applicable to (efficiently) obtaining the safety-level value in stochastic games where the stage games and transition probabilities are initially unknown

Therefore, when adopted by all agents, R-max determines an ELE in zero-sum stochastic games.

Efficiency is measured as a function of the mixing time of the optimal policy in the known model.

The R-max Algorithm

A model-based learning algorithm utilizing an optimistic, fictitious model

Model initialization:Model initialization:• States: original states + 1 fictitious state• All game-matrix entries are marked unknown• All joint actions lead to the fictitious state with

probability 1• The agent’s payoff is Rmax everywhere (the

adversary’s payoff plays no role, 0 is fine)

Initial Model

(Rmax,0) (Rmax,0)

(Rmax,0) (Rmax,0)

(Rmax,0) (Rmax,0)

(Rmax,0) (Rmax,0)

(Rmax,0) (Rmax,0)

(Rmax,0) (Rmax,0)

(Rmax,0) (Rmax,0)

(Rmax,0) (Rmax,0)

Real Stage Games

Fictitious Stage Game

1

- Unknown

The Algorithm (cont.)

Repeat:Repeat:

• ComputeCompute optimal policy

• ExecuteExecute current policy

• Update Update model

Model Update

Occurs after we play a joint action corresponding to an unknown entry:

1. Record payoff in matrix (once only)2. Record the observed transition3. Once enough transitions from this entry are

recorded:1. Update the transition model based on the

observed frequencies2. Mark the entry as known3. Recompute the policy

The Algorithm (cont.)

Repeat:Repeat:

• ComputeCompute optimal T-step policy

• ExecuteExecute current policy

• Update Update model – an entry is known when it has

been visited

times.

))()max(( 2

3

6ln8,

3Rmax2

Nk

NT

Main Theorem

Let MM be an SG with NN states and kk actions. Let >0 and 0<<1 be constants denoting desired error bounds. Denote the policies for MM whose -return mixing time is TT by MM((,T),T), and the optimal expected return achievable by such policies by OptOptMM((,T)) ,T)) (i.e., the best value of a policy that mixes in time T).T).

Then, with probability no less than 1-, the R-maxR-max algorithm will attain an actual average return of no less than OptOptMM((,T))-,T))- within a number of steps polynomial in:

N,T,k,1/N,T,k,1/..

Main Theorem (cont.)

Main Technical Contribution: Implicit Explore or Exploit (IEE)

• R-maxR-max either explores efficiently or

exploits efficiently:

– The adversary can influence whether we

exploit efficiently or explore

– But, it cannot prevent us from doing one of

the two

Conclusion (ELE)

• ELE captures the requirement that the learning algorithms themselves should be in equilibrium.

• Somewhat surprisingly, (optimal) ELE exists for large classes of games. The proofs are constructive.

• ELE can be viewed as ex-post equilibrium in repeated pre-Bayesian games with (initial) strict uncertainty about payoffs.

• The results can be extended to stochastic games (more complicated, and need to refer to mixing time of policies in the definition of efficiency).

Conclusion• Pre-Bayesian Games are a natural setting for the

study of multi-agent interactions with incomplete information, where there is no exact probabilistic information about the environment.

• Natural solution concepts such as ex-post equilibrium can be extended to non-Bayesian equilibrium (such as safety-level equilibrium) which always exist.

• The study of non-Bayesian equilibrium leads to illuminating results in areas connecting CS and GT.

Conclusion (cont.)

• There are tight connection between Pre-Bayesian repeated games and multi-agent learning.

• Equilibrium of learning algorithms can be shown to exist in rich settings. ELE is a notion of ex-post equilibrium in Pre-Bayesian repeated games.

• The study of Pre-Bayesian games is a rich, attractive, and illuminating research direction!

Our research agenda: GT with CS flavor

• Program equilibrium• Ranking systems• Non-cooperative computing• Pre-Bayesian games• Distributed Games• Recommender systems for GT• …• …• …

GT with CS flavor: re-visiting equilibrium analysis

• Program equilibrium

CS brings the idea that strategies can be of low capability (resource bounds), but also of high capability: programs can serve both as data and as a set of instructions. This enables to obtain phenomena observed in repeated games in the context of one-shot games.

GT with CS flavor: re-visiting social choice

• Ranking systems The Internet suggests the need to

extend the theory of social choice to the context where the set of players and the set of alternatives coincide and transitive effects are taken into account. This allows to treat the foundations of page ranking systems and of reputation systems (e.g. an axiomatization of Google’s PageRank).

GT with CS flavor: re-visiting mechanism design

• Non-cooperative computing

Informational mechanism design where goals are informational states, and agents’ payoffs are determined by informational states is essential in order to deal with distributed computing with selfish participants. This allows to answer the question of which functions can be jointly computed by self-motivated participants.

GT with CS flavor: action prediction in one-shot games

• Recommender systems for GT

Find correlations between agents’ behaviors in different games, in order to try and predict an agent’s behavior in a game (he has not played yet) based on his behavior in other games. This is a useful technique when e.g. selling books in Amazon, and here it is suggested for action prediction in games, with surprisingly great initial success (an experimental study).

GT with CS flavor: incorporating distributed systems features into

game theoretic models

• Distributed Games The effects of asynchronous interactions The effects of message syntax and the communication structure on implementation The effects of failures.

GT with CS flavor: revisiting uncertainty and in games and

learning

• Pre-Bayesian games

This talk.

pre-bayesian games moshe tennenholtz technion—israel institute of technology

Documents

concave prebayesian

prebayesian gamesmodeling

bayesian approach

congestion games rosenthal

potential games monderer

resource selection system

complete information

safetylevel equilibriump1pq1