stacks.stanford.edukx619tt4623/thesis_adlakha... · i certify that i have read this dissertation...
Post on 30-Apr-2020
0 Views
Preview:
TRANSCRIPT
EQUILIBRIUM AND CONTROL IN COMPLEX
INTERCONNECTED SYSTEMS
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF ELECTRICAL
ENGINEERING
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Sachin Adlakha
August 2010
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/kx619tt4623
© 2010 by Sachin Adlakha. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Andrea Goldsmith, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Ramesh Johari
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Sanjay Lall
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
Abstract
Large-scale complex systems such as power grids, transportation systems, and social
networks are reshaping every aspect of modern society. Despite their ubiquitous na-
ture, the design and understanding of such complex networks is still very challenging.
Decision making in such systems is complicated by the fact that an agent’s optimal
choice depends on the choices made by other agents in the system. In a smart grid,
the power consumption of an individual user could depend on the demand profile of
other users, some of who may be physically far away. An investment decision by an
agent in an online auction is affected by the strategic choices of other agents partic-
ipating in the auction. Thus, a node’s decision is affected by the presence and the
actions of other nodes in the system. The multitude of dependencies arising in such
environments lead to an extremely complicated decision making process for a single
agent.
Often in complex systems, the decision maker has partial information about the
state of the system. For example, a centralized load balancer in a server farm obtains
the state of the queues via a communication network. This network introduces delays
and losses which result in partial information at the decision maker. This further
complicates the decision making process.
In this thesis, we study equilibrium and control in complex interconnected systems.
In the first part of the thesis, we investigate centralized decision making in a networked
system in presence of delays. Specifically, we show that even in the presence of delays,
a centralized decision maker can make optimal decisions with only a subset of the
past history of the system.This history depends on the structure of the system as well
as the associated delay pattern. From a practical point of view, these results show
iv
that one can make optimal decisions with only finite memory about the past, thus
eliminating the need to store the entire history. Thus, for example, a centralized load
balancer in a server farm can use algorithms based on only a finite past to evenly
distribute load across multiple servers.
In the second part of the thesis, we look at decentralized decision making in a
reactive environment. We describe a mean field approach to decision making in large-
scale systems. The basic premise of this approach is to treat other agents as a single
entity with some aggregate behavior. We develop a unified framework to study mean
field equilibrium in large-scale stochastic games. Under a set of simple assumptions,
we prove the existence of a mean field equilibrium. A key insight developed from our
result shows that the existence result is closely related to the approximation of mean
field equilibrium to the actual behavior. Thus, a single agent can make near optimal
decisions based only on aggregate behavior of other agents.
We conclude the thesis with various interesting extensions and open challenges in
the design and understanding of complex interconnected systems.
v
Acknowledgments
This thesis is a culmination of a journey that started at Stanford about five years ago.
I was fortunate to have the guidance and friendship of several people who made this
journey enjoyable. First and foremost, I thank my adviser, Prof. Andrea Goldsmith,
for being a wonderful adviser and a mentor. She took a leap of faith by agreeing
to fund me from the first day, thus enabling me to come to Stanford. Without her
confidence and trust, I would not have been at Stanford, much less write this thesis.
She has always been very encouraging and supportive of me as I built collaborations
with different people and developed my research interests. She also made the Wireless
Systems Laboratory feel more like home. She was very generous in inviting us to
various parties at her home. It was a joy to meet her family - discussions with
Arturo sharpened my thought process and Daniel and Nicole’s company was always
a welcome change from the daily grind of research. For that, I sincerely thank her
and her family.
Early on in my research career, I was fortunate to work and interact with Prof.
Sanjay Lall. His breadth and depth of knowledge constantly amazed me and made
me realize how little I know. Besides being a great researcher, he is also a wonderful
and a very generous teacher. It is from him that I learned the art of learning. He
personally spent countless hours teaching me everything I know about control systems
and Markov decision processes. He also mentored me and taught me how to write
good papers and give good talks. Every paper I ever write, every talk I give in future
will always bear his signature. For all his time and efforts, I will always be thankful
to him.
My sincere thanks are also due to my co-adviser Prof Ramesh Johari. Ramesh’s
vi
enthusiasm for research, his drive for perfection, and his sheer ability to work hard
constantly amazed me. During my entire graduate school career, he was the bench-
mark I strove to achieve. He constantly challenged my limits and helped me realize
my potential. Besides working on research, he spent a lot of time giving me guidance
and career advice. His guidance allowed me to understand my strengths, realize my
weaknesses, and helped me push my limits. My experience at Stanford would have
never been the same, had I not had the pleasure of working with him. For the count-
less hours he spent thinking about my work, for genuinely caring about my work and
my career, and for making me realize my potential, I shall forever be indebted to him.
A significant portion of this thesis is based on the work of Prof. Gabriel Weintraub
of Columbia University. He not only provided the seeds of this work, he also helped
me guide through it. Gabriel comes from a very different background and has a very
unique perspective on research. He was generous with his time and shared his ideas
with me. For his guidance and help at every step of my work, I express my deepest
gratitude.
My Ph.D at Stanford was almost not going to happen had it not been for one
person, who had more faith in me than I had in myself. Convinced that Stanford was
beyond my reach, I had almost decided not to apply. It was only at the urging of Ram
that I finally decided to take the chance. He even promised to pay the application
fee which he still owes me. But what I owe him can never be repaid. He has been a
true friend, believing in me more than I ever believed in myself. During all the ups
and downs of this grueling journey, he was a constant source of encouragement and
support. Mere words of thanks can never do justice to all that he has done for me.
My stay at Stanford gave me an opportunity to make some wonderful friends. I
would like to thank Mayank Jain for being extra-ordinarily helpful at every step of
the way. He was always generous with his time - spending several hours going over the
details of my work with me. He is also the reason that I survived Stanford without
ever owning a car. His generosity will always be remembered. Part of this research
started as a course project that was jointly done with Vineet Abhishek. Even though
he left Stanford for greener pastures, the seeds we jointly sowed as a course project,
flourished as part of my thesis. The joint project also provided an opportunity to
vii
know him better and to grow as friends. Several arguments and discussions over tea,
and our regular dinners at “Treehouse” shall always be fondly remembered.
Life at Stanford would have never been the same without the company of several
friends. Forum Parmar - whose infectious laughter lightened the mood of most serious
of all conversations, Mridul Agarwal - who provided me company at various hiking
trips we made, Dinkar Gupta - whose extraordinary culinary skills and stimulating
company provided for some wonderful dinner nights, Saurabh Jain - who dazzled us
with some wonderful desserts, and Kannan, Kadambari and Abhay - who provided
wonderful company, made these past five years worth living and remembering. During
last few years, I also had the pleasure of knowing Suchitra Vijayan, first as Ram’s wife
and then as a very caring friend. The times spent complaining about Ram, discussing
politics, and exchanging recipes will always be fondly remembered.
The daily grind of school was made more bearable by the presence and company
of Vinay Majjigi who provided me company every time I needed a break. He was
very generous in sharing with us his Mom’s food which made me miss home a tad
bit less. Dan O’Neill offered me some very valuable advice and shared his years of
experience and his unique perspective on life. It would not be an exaggeration to say
that various discussions with him will certainly have an impact on whatever future
career I pursue.
The members of the Wireless Systems Laboratory provided a very intellectually
stimulating environment. Ivana Maric was a very patient and adjusting office mate
who suffered as we converged on the right temperature in our office. Bruno Sinopoli
jump started my research career as soon as I joined Stanford. Various members of
the Wireless Systems Laboratory (both past and present) made this a fun place to
be.
Special thanks are due to Maria Kazandjieva for being my running buddy and
for her wonderful company, to Michelle Hewlett, Samar Fahmy, Sara Lefort, Hattie
Dong and Thomas John for providing a reason (other than work) to come to office
every day, to Sophie and Jonathan for being friends from the first day I came to the
US, to Patrick Burke for helping me with every computer related issue, and to Pat
Oshiro, Bernadette Aguiao and to Joice DeBolt for making bureaucratic work less of
viii
a hassle.
During the last five years, Sanjay Bhal and his family opened the doors of their
heart and provided me a home away from home. Poorvi Bhabhi ensured that I never
missed home cooked food. Kuhoo’s angelic face and innocent remarks made me forget
the stress of work and life. Kaustubh and Prisha never made me miss my nephew
and niece in India. The friendship and their warm hospitality made even the hardest
periods of life bearable.
Last, but not the least, my deepest gratitude is for my family who were always
there to support me through various challenges of my life. My sisters (Pooja, Prerna,
and Kashika) and my brothers (Ashish and Ketan), who endured my absence as I
focused more on my work, were always very supportive. My parents endured several
hardships so that I could follow my passion – they always supported me at every stage
of my life, and sacrificed their dreams so that I could achieve mine. This journey would
never have been possible without their love, support and encouragement. This thesis
is dedicated to them!
ix
Contents
Abstract iv
Acknowledgments vi
1 Introduction 1
1.1 Networked Control Systems with Delayed Information . . . . . . . . . 3
1.1.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Managing Interactions via a Mean Field Approach . . . . . . . . . . . 6
2 Networked Markov Decision Processes 10
2.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Partially Observed Markov Decision Processes . . . . . . . . . . . . . 15
2.2.1 Information State for POMDPs . . . . . . . . . . . . . . . . . 17
2.3 Networked Markov Decision Processes . . . . . . . . . . . . . . . . . 20
2.3.1 Networked MDP as a POMDP . . . . . . . . . . . . . . . . . 24
3 Information State for Networked MDPs 26
3.1 Networked MDP with Action Delays . . . . . . . . . . . . . . . . . . 34
3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Single System with Delayed State Observations and Action Delays 35
3.3 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Linear Systems with Delays . . . . . . . . . . . . . . . . . . . 37
3.3.2 Controller Design for Finite State Systems . . . . . . . . . . . 40
x
4 A Bayesian Network Approach to Network MDPs 45
4.1 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Networked MDPs as Bayesian Networks . . . . . . . . . . . . . . . . 48
4.3 Alternate Proof of the Information State for Networked MDPs . . . . 50
5 A Mean Field Approach to Studying Large Systems 59
5.1 Stochastic Game Model . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Markov Perfect Equilibrium (MPE) . . . . . . . . . . . . . . . . . . . 63
5.3 Mean Field Equilibrium (MFE) . . . . . . . . . . . . . . . . . . . . . 65
6 MFE as an Approximation to MPE 69
6.1 The Asymptotic Markov Equilibrium (AME) Property . . . . . . . . 69
7 Existence of Mean Field Equilibrium 73
7.1 Closed Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.2 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.3 Compactness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8 Conclusions and Future Work 82
A Proofs 85
A.1 Preliminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
A.2 Proof of AME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
A.3 Compactness: Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Bibliography 116
xi
List of Figures
2.1 A network of interconnected subsystems with delays. Subsystem i is
denoted by Si, the network propagation delay from Si to Sj is denoted
by Mij and the measurement delay from Si to the controller is denoted
by Ni. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Directed graph for the network of Figure 2.1. . . . . . . . . . . . . . . 23
3.1 A networked Markov decision process with action delays. The control
action delay to subsystem Si is denoted by Pi. . . . . . . . . . . . . . 33
3.2 A network of two interconnected subsystems with delays. Here the
control input is only applied to subsystem 1. . . . . . . . . . . . . . . 36
3.3 A system of two interacting queues. Here the solid line represent jobs
of type R which enter system 1, and are then transported to system
2 after a delay of M12. Similarly, the dashed line represents jobs of
type B which enter system 2 and are transported to system 1 after a
delay of M21. Of the two queues at each system, the top queue is the
high-priority queue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Infinite horizon discounted action cost Ja (averaged over all initial
states) vs. the infinite horizon discounted state cost Js. The curve
is plotted by varying the weighting factor α. . . . . . . . . . . . . . . 43
4.1 A Bayesian network with 6 variables . . . . . . . . . . . . . . . . . . 46
xii
4.2 A network of two interconnected subsystems with delays. Subsystem
i is denoted by Si, the network propagation delay from Si to Sj is
denoted by Mij and the measurement delay from Si to the controller
is denoted Ni. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 The Bayesian network associated with the 2-subsystems networked
MDP of Figure 4.2. Here the circle represents the state of the two sub-
systems and the square represents the control input. For this Bayesian
network, we chose M21 = 1 and M12 = 2. The edges from state vari-
ables to control inputs have been omitted for visual clarity. . . . . . . 51
xiii
Chapter 1
Introduction
Complex networks pervade every aspect of modern society. Nearly any large scale
system can be modeled as a network of interconnected agents or subsystems dynam-
ically evolving over time: the power grid, social networks, modern economies and
financial systems, automated transportation systems, and “smart” environments all
exhibit this characteristic. A defining feature of a complex network is a high degree
of interconnectedness. This connectivity, which is one of the most attractive features
of complex networks, is also a source of its biggest challenge. Two key characteris-
tics of complex networks are: (a) the environment in which they operate is reactive,
and (b) each node in a complex network usually has partial information about its
environment. Understanding these features is important to the design of complex
networks.
• Reactive Environment: Imagine a power grid consisting of a variety of gen-
erators and storage devices catering to the needs of many users, including a
substantial fraction with plug-in vehicles. A user’s decision to charge a hybrid
vehicle could depend upon other vehicles being charged in the same area. Simi-
larly, the operating point of a generator could depend upon the presence of other
energy storage or generating devices. Thus, any node’s decision is affected by
the presence and the actions of other nodes in the system. The environment
in which a node operates consists not only of the dynamics associated with
the physical environment but also the dynamics associated with the actions of
1
CHAPTER 1. INTRODUCTION 2
other nodes. From the point of view of a single node, the environment is re-
active: other nodes will respond and react to this node’s behavior. A similar
phenomenon arises in wireless systems where the transmit power of a device
depends upon the channel characteristics as well as the transmit power of other
devices operating in the region. The multitude of dependencies arising in such
environments lead to an extremely complicated decision making process for a
single agent.
• Partial Information: Decision making in any complex system is also made
difficult by the lack of complete information available to any decision maker.
Imagine a server farm with thousands of servers that handle incoming requests.
A centralized decision maker aims to allocate the incoming requests to these
servers to effectively balance the load in the server farm. In order to do this,
the decision maker requires the current state of the queue at each server. This
information is usually transmitted to the decision maker via a control network.
Because of the inherent delays and losses in the network, it is possible that the
decision maker would not have the current state of each server. In other circum-
stances, noise in the feedback process can corrupt the information available to
the decision maker. Such artifacts of networks further complicate the decision
making in complex systems.
In this thesis, we focus on the design and understanding of complex networks
from the viewpoint of a single agent. A common theme of this research has been
the answer to the following question: how does a single agent make an optimal
decision in the presence of a reactive environment and/or with partial
information about the state of the environment? The thesis is divided into
two fairly distinct parts that addresses this question. In the first part of the thesis,
we study the problem of decision making in the presence of partial information and
the latter part addresses the issues of reactive environments and their role in decision
making for complex networks.
CHAPTER 1. INTRODUCTION 3
1.1 Networked Control Systems with Delayed In-
formation
In order to better understand the design of complex systems, it is important to un-
derstand how the presence of delay affects decision making. In order to gain this
understanding, we model a complex network as an interconnected network of sub-
systems. Each subsystem is modeled as a Markov decision process (MDP) and this
network of MDPs is referred to as a networked Markov decision process. Such a net-
work of Markov decision processes is a common model for a variety of distributed
control problems, such as distributed vehicle coordination [22], communication sys-
tems, queuing networks [41, 9] and distributed scheduling over multiple servers [5],
[4], [11]. The subsystems are coupled to each other via communication links that are
noise free, pure delay lines. Thus, in our network model we do not consider packet
losses or noisy observations. We also assume that the delays are fixed but may be
different for each interconnection. We assume that each subsystem has a finite state
space and its state evolution is affected by the delayed state of its neighbors. A
centralized controller receives delayed state measurements from each subsystem and
computes an optimal control action to be applied to each subsystem. The control
action applied to each subsystem takes effect after a certain delay.
Although the controller receives state information from each subsystem, each of
these states is delayed by different amounts. Since the current state of each subsystem
is not available to the controller, this system can be represented as a partially observed
Markov decision process (POMDP). Optimal control design for POMDPs has been
studied extensively in the literature [12, 40, 53]. There are two standard approaches
to optimal control of POMDPs. The first approach generates a policy which is a
function of the entire history of observation; this history is called an information
state. As the time increases, the number of observations increase and hence the
information state grows without bound. In the second approach, the controller is a
function of the posterior distribution of the current state of the system conditioned on
the entire observation history. This distribution is called the belief state and finding
an optimal controller requires solving a dynamic program on a space of belief states.
CHAPTER 1. INTRODUCTION 4
This posterior distribution or the belief state is infinite dimensional and hence the
computation of a controller is very challenging [53]. Thus, in general, the computation
required to compute an optimal controller is prohibitively large. We are therefore
motivated to find a representation of the belief state that is as small as possible.
Our main contribution in this section of the thesis is to show that for networked
Markov decision processes, the optimal controller is a stateless function of a finite
number of past observations. Thus, networked MDPs can be reduced to an MDP
with a sufficient information state that does not grow with time. This sufficient
information is a subset of the entire information state and it captures all relevant
information required for the optimal control. This significantly reduces the computa-
tion complexity associated with obtaining an optimal controller for networked MDPs.
More interestingly, we show that the amount of information required to make this op-
timal decision is related to the concept of a Markov blanket in Bayesian networks. This
interesting connection between networked control systems and Bayesian networks is
an exciting new common thread between these fields, and it opens doors to using
ideas from Bayesian networks to better understand networked systems with delayed
information. The bound on the number of past observations required to compute an
optimal controller is shown to be tight. In an extension of standard terminology for
linear systems, we refer to these numbers as bands and call the controller with finite
memory banded. We show that for networked MDPs, the bands depend only upon the
network structure and the associated delays. From a practical point of view, these
results show that one can make optimal decisions with only finite memory about the
past, thus eliminating the need to store the entire history. Thus, for example, a cen-
tralized load balancer in a server farm can use algorithms based on only a finite past
to evenly distribute load across multiple servers.
1.1.1 Prior Work
The optimal control of Markov decision processes (MDPs) originates with the work
of Bellman [17]. The term MDP refers to optimal control problems in which the
system state is available for measurement, and such systems are analyzed in [48].
CHAPTER 1. INTRODUCTION 5
Control problems for which the full state is not observable are called partially observed
Markov decision processes (POMDP), and are considered in [40]. The separation
structure and dynamic programming recursion for such problems are well-known [12,
53]. For finite-state systems on a finite horizon, the optimal state-feedback controller
is memoryless. However, when only partial observations of the state are available,
the optimal controller has a separation structure, and is a function of the posterior
distribution of the current state given all the observations [40, 12, 53]. In the special
case where the system dynamics are linear and the noise is Gaussian, the optimal
controller can be computed recursively in an efficient manner, leading to the classical
linear quadratic Gaussian (LQG) control formulation.
Markov decision processes with delays have also been studied extensively in the
literature. Altman and Nain [8] consider an MDP with delayed state availability and
use it to solve a communication network design problem. Bander and White [13]
extended this result to the case where a partial state observation is available after
a delay. Markov decision processes with control action delays are considered in [6].
In [39], the authors unified these results by considering an MDP with observation
delays, action delays as well as cost delays. They also extended the result to the case
of random delays. Optimal control for linear systems with control action delay was
also considered in [16].
The previous works considered a single Markov decision process with delays in
observations and action. Among the earliest works on distributed systems with delays
is [54] where a separation structure for the one-step delay sharing pattern for a system
with general nonlinear dynamics was obtained. Algorithms to compute the optimal
controller for such a system were obtained in [33, 34] by essentially reducing the
problem to a centralized control problem. An optimal controller is then synthesized
using standard algorithms. More general decentralized control of MDPs has been
shown to be intractable in [18]. An optimal controller for the one-step delay sharing
pattern for LQG was obtained in [50] and [42]. Optimal control of linear systems with
one-step delay sharing was also studied in [55] in an input-output framework. More
generally, in [31, 32] it was shown that for dynamic LQG teams with a partially nested
information structure, the optimal controller at any time is linear. See [15, 14] for a
CHAPTER 1. INTRODUCTION 6
more complete bibliography on results in team theory. Markov decision processes with
delayed feedback have been used to study flow control in queuing networks [41, 9].
Stochastic games with one-step delayed sharing information pattern have also been
used to study distributed power control in networks [7].
1.2 Managing Interactions via a Mean Field Ap-
proach
In several complex systems, a large number of agents interact with each other without
the presence of a central decision maker (or a controller). In applications as diverse
as power control in wireless networks [28], competition among firms in a market [27]
or non-cooperative control systems [35], one often finds a large number of agents
interacting with each other. In many of these applications, the agents have conflicting
objectives and they interact with each other in a stochastic dynamic environment.
In the absence of any centralized authority in such systems, a natural mathematical
framework to study such complex systems is that of stochastic games.
Stochastic games [51] are dynamic games with probabilistic transitions played by
one or more players. In a stochastic game, players compete with each other in a
number of time steps. Each player has a state which describes all parameters of
interest to the player. The state of a player evolves according to a stochastic process.
Such games provide a framework to model dynamic behavior of agents in a stochastic
environment without the presence of a centralized authority or a controller.
Markov perfect equilibrium is a commonly used equilibrium concept for stochastic
games [51]. In MPE, strategies of players depend only on the current state of all
players, and not on the past history of the game. In general, finding an MPE is
analytically intractable; MPE is typically obtained numerically using dynamic pro-
gramming (DP) algorithms [46]. As a result, the complexity associated with MPE
computation increases rapidly with the number of players, the size of the state space,
and the size of the action sets [25]. This limits its application to problems with small
dimensions. Several techniques have been proposed in the literature to deal with the
CHAPTER 1. INTRODUCTION 7
complexity of large scale systems [47, 10, 52, 36].
Recently, a scheme for approximating MPE for such large scale games has been
proposed in different application domains via a solution concept we call mean field
equilibrium, or MFE [38, 35, 56, 1, 2, 43, 26, 23]. Mean field equilibrium is also
referred to as “oblivious equilibrium” in [56], and as “Nash certainty equivalence
control” in [35].In mean field equilibrium, a player optimizes given only the long-
run average statistics of other players, rather than the entire instantaneous vector
of its competitors’ states. MFE resolves the computational difficulties associated
with MPE: in MFE, a player is reacting to far simpler aggregate statistics of the
behavior of other players. Further, MFE computation is significantly simpler than
MPE computation, since each player only needs to solve a one-dimensional dynamic
program; thus MFE is appealing from a rationality standpoint as well, as it does
not require agents to track a complex state vector in a system with many interacting
players.
The notion of mean field equilibrium provides a simple approach to understanding
behavior in large population stochastic dynamic games. However, this notion is not
very meaningful unless we can guarantee that a mean field equilibrium exists in a
wide variety of stochastic games. Even if a mean field equilibrium were to exist in a
particular game of interest, it is natural to wonder whether such an equilibrium is a
good approximation to Markov perfect equilibrium in finite games. MFE is unlikely
to be useful in practice without conditions that guarantee it approximates equilibria
in finite systems well. Below we address these two fundamental questions on this
topic: the existence of MFE and whether it provides any meaningful approximation
to MPE.
As we shall show below, an important contribution of our work is to relate ap-
proximation to existence of MFE. The approximation theorem we provide requires
continuity assumptions on the model primitives; as we demonstrate in the second
part of this thesis, these same continuity conditions are required (together with con-
vexity and compactness conditions) to ensure an MFE actually exists. Thus we obtain
the valuable insight that approximation is essentially a corollary of existence. This is
practically valuable: establishing MFE is a good approximation is effectively a free
CHAPTER 1. INTRODUCTION 8
byproduct, once the conditions ensuring its existence have been verified.
The mean field approach developed in our research has implications for a variety
of engineering applications such as cognitive radio networks, smart grids, etc. Imag-
ine a network of cognitive radios looking for spectrum holes. In order to construct
algorithms for effective spectrum search, a radio must also take into account the pres-
ence of other cognitive devices. Our research suggests design directions for simple
spectrum search algorithms that might only react to average statistics of the envi-
ronment, rather than finely detailed knowledge of other nodes’ behavior. Similarly,
imagine a network of electric cars that are all trying to charge their batteries from the
power grid. In order to construct simple recharging algorithms for electric cars, each
car might base its recharging schedule on some average load profile of the electric
grid. These policies in turn could help prevent the occurrence of load spikes on the
electric grid, since the equilibrium condition requires that these policies give rise to
the initially conjectured average load profile.
The rest of the thesis is organized as follows. In Chapter 2, we describe our model
for networked control systems. As discussed above, we model networked control sys-
tems as a networked Markov decision process. We begin this chapter by formally
defining Markov decision processes (MDPs), partially observed Markov decision pro-
cesses (POMDPs), and networked Markov decision processes (N-MDPs). We also
formally define the information state for POMDPs and show that in general net-
worked MDPs can be written as POMDPs. The first main result - that computes the
sufficient information state for networked MDPs and shows that it is finite set set of
the past history, is proved in Chapter 3. We also show that our model and results
encompass previously know results and we study two numerical examples, where we
compute an actual controller for networked MDPs with delayed state information. In
order to gain more insight into our results, we study a special case of our main result
(particularly, we look at finite time horizon networked MDPs) in Chapter 4. In this
chapter, we provide an alternate proof of our main result using ideas from Bayesian
networks. This alternate approach provides additional insights into the finite mem-
ory of the controllers for networked MDPs. It shows that the finiteness of the bands
occurs because given the finite history of states and actions, the current state of the
CHAPTER 1. INTRODUCTION 9
system is independent of the remaining states and actions.
In Chapter 5, we describe our model of stochastic game and define the notion of
Markov perfect equilibrium and the mean field equilibrium. In the next chapter, we
define our notion of approximation to MPE and prove that as the number of players in
a game becomes large, the mean field equilibrium is a good approximation to Markov
perfect equilibrium. In Chapter 7, we prove that under a simple set of assumptions,
a mean field equilibrium exists in a wide variety of stochastic games. The existence
result is based on Kakutani’s theorem and we check each of the three conditions of
the theorem in several sections of this chapter. The following chapter then concludes
the thesis and provides a list of interesting and open challenges that are pertinent to
the understanding and design of complex systems.
Chapter 2
Networked Markov Decision
Processes
In this chapter, we present our model of networked Markov decision processes. The
subsystems in a networked MDP are coupled to each other via communication links
that are noise free, pure delay lines. We also assume that the delays are fixed but
may be different for each interconnection. A centralized controller receives delayed
state measurements from each subsystem and computes an optimal control action to
be applied to each subsystem. The control action applied to each subsystem takes
effect after a certain delay. As mentioned before, each subsystem is modeled as a
Markov decision process. Before we formally define a Markov decision process, we
formalize our notation.
Notation. In chapters 2 – 4, we use the following notation. We use superscripts
to denote particular subsystems and subscripts for the time index. Thus xit denotes
the state of the subsystem i at time t. For simplicity, we omit the superscript 1 if
there is only one subsystem. Similarly, we denote by yit the observation received from
subsystem i at time t and by uit the control input applied to subsystem i at time t. We
also denote by z, s and a the realization of the state x, observation y and control action
u. We define xit1:t2
:=(xi
t1, . . . , xi
t2
)to refer to the list of variables corresponding to
the subsystem i from time t1 to t2. If t2 < t1, we interpret this as an empty list. The
10
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 11
notation x0:t = z0:t is interpreted as an element-wise equality, i.e., x0 = z0, x1 = z1,
etc. To denote the list of variables corresponding to all subsystems, we define xt :=
(x1t , . . . , x
nt ). Similarly, we denote ut := (u1
t , . . . , unt ) as the control action applied to all
subsystems at time t. We define Ai0:t to be the product of the variables corresponding
to times 0, . . . , t, that is Ai0:t := Ai
0Ai1 . . . Ai
t. We will see below that each Ait is a
function of several variables and the product Ai0A
i1 . . . Ai
t is interpreted as a product
of several functions. Similarly, we can define the product A10:tA
20:t . . . A
n0:t as a product
of functions A10A
11 . . . A1
t A20A
21 . . . A2
t . . . An0A
n1 . . . An
t . For a set X , we denote X n to be
the n-fold cartesian product of the set, that is X n = X × · · · × X n-times. We write
Z+ for the set of non-negative integers and write Z
++ for the set of positive integers.
2.1 Markov Decision Processes
A Markov decision process provides a framework for sequential decision making in
a stochastic environment. The decision (also known as the action) taken at time t
affects the evolution of the future system. The goal of the decision-maker is to choose
a sequence of actions to optimize a predetermined criterion. We assume that the
decisions are made at discrete times t ∈ Z+.
At each decision time t, the system occupies a state. We denote the set of all
possible states by a finite set X . At each time t, the decision-maker choses a decision
from a finite set denoted by U . Formally,
Definition 1 (MDP). A Markov decision process is a tuple (A, g) where,
1. A is a sequence A0, A1 . . . with A0 : X → [0, 1], such that A0(z) ≥ 0 for all
z ∈ X and∑
z A0(z) = 1.
For t ≥ 1, we have At : X × X × U → [0, 1], such that
At(z1, z2, a) ≥ 0, ∀ z1, z2 ∈ X and a ∈ U ,∑
z1
At(z1, z2, a) = 1, ∀ z2 ∈ X and a ∈ U .
2. g is a sequence g0, g1, . . . with gt : X × U → R.
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 12
Roughly speaking, A0 is the distribution of the initial state, At is the transition
kernel at time t and gt is the cost at time t. As an example of an MDP, consider a
discrete time dynamic system, where the state of the system at time t ≥ 0 is denoted
by xt. The system dynamics are
xt+1 = f (xt, ut, wt) . (2.1)
Here ut is the control action or the decision taken at time t. The random variables
wt for t ≥ 0 are independent noise processes. The initial state x0 is chosen to be
independent of the noise process wt. Associated with this dynamic system is an
MDP (A, g) defined as follows. For all p ∈ X , let A0(p) = Prob(x0 = p) be the
probability mass function of the initial state of the system. For t > 0, let
At (zt, zt−1, at−1) = Prob(xt = zt | xt−1 = zt−1, ut−1 = at−1
)(2.2)
be the conditional probability of state xt given the previous state xt−1 and the applied
input ut−1. It is easy to verify that the sequence A satisfies all the properties as given
in Definition 1. The sequence gt (xt, ut) represents the cost at time t and it depends
on the current state xt of the system as well as the action ut taken at time t. Note
that the dynamic system presented in equation (2.1) and the transition matrix given
in equation (2.2) are both canonical representations of a Markov decision process. In
fact, given any MDP one can represent it either as a dynamic system or as a transition
matrix [48, 20, 21, 4].
As mentioned before, the decision-maker (i.e., the controller) needs to choose an
action ut at time t. This action is chosen based upon the information available to
the controller at that time. We define hmdpt to be the information available to the
controller at time t, given by
hmdpt =
(u0:t−1, x0:t
).
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 13
We will also use imdpt to denote a realization of hmdp
t as
imdpt =
(a0:t−1, z0:t
).
Here the sequences z and a specify the values of a realization of x and u, respectively.
An MDP policy (also known as the control policy) specifies the decision or control
action to be taken at each time t.
Definition 2 (MDP Policy). An MDP policy is a sequence K = (K0, K1, . . . ) where
K0 : U × X → [0, 1] and Kt : U × X t+1 × U t → [0, 1] for all t ∈ Z++ such that
K0(a, z) ≥ 0 ∀ a ∈ U and z ∈ X ,∑
a
K0(a, z) = 1 ∀ z ∈ X .
and for all t ∈ Z++ we have
Kt(a1, z, a2) ≥ 0, ∀ a1 ∈ U , z ∈ X t+1 and a2 ∈ Ut,
∑
a1
Kt(a1, z, a2) = 1, ∀ z ∈ X t+1 and a2 ∈ Ut.
Thus, Kt is a history dependent randomized policy, which maps the history of
the MDP to an action at time t. For the discrete time dynamic system given in
equation (2.1), we can interpret the MDP policy as
Kt(at, it) = Prob(ut = at | hmdpt = it).
The MDP policies as described above are called as mixed policies since the decision at
time t is specified by a probability distribution which is a function of the information
available to the controller.
Stochastic Process Generated by an MDP. Consider an MDP (A, g) and an
MDP policy K. Associated with (A, g) and K is a stochastic process that is induced
by the MDP and its policy. For MDPs evolving over a finite time horizon T , we can
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 14
define the sample space of the stochastic process as
Ω = X × U × X . . .U × X = X × UT−1 ×X .
A typical element ω ∈ Ω is given by a sequence of states and actions. For example,
for an infinite horizon model, a typical sample path would be given as
ω = z0, a0, z1, a1, . . . .
Definition 3 (MDP Stochastic Process). Suppose (A, g) is an MDP and K is an
MDP policy. Define the state process xt(ω) and the action process ut(ω) by
Prob (x0:t = z0:t, u0:t = a0:t) = A0(z0)t∏
k=1
Ak (zk, zk−1, ak−1)t∏
k=0
Kk (ak, z0:k, a0:k−1) .
(2.3)
Note that this implies that for all t we have
Prob (xt = zt|x0:t−1 = z0:t−1, u0:t−1 = a0:t−1) = Prob (xt = zt|xt−1 = zt−1, ut−1 = at−1) ,
Prob (xt = zt|xt−1 = zt−1, ut−1 = at−1) = At (zt, zt−1, at−1) ,
Prob (ut = at|x0:t = z0:t, u0:t−1 = a0:t−1) = Kt (at, z0:t, a0:t−1) .
The above equations show that the state xt is conditionally independent of the past
states and actions given the current state xt−1 and the current action ut−1. Thus,
given the policy, the state evolution is Markov, justifying the name.
As mentioned before, the goal of the Markov decision process formulation is to
make sequential decisions in a stochastic environment. The controller’s objective is
to choose an MDP policy K so as to minimize a cost function. Typically, the cost
function has the form
JK (A, g) , E
T∑
t=0
gt (xt, ut) .
Here, the expectation is taken over the noise processes and is with respect to the prob-
ability measure defined in equation (3). The notation JK (A, g) represents the cost of
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 15
an MDP (A, g) under an MDP policy K. In this sense, the sequence g represents the
cost function or the objective that the decision-maker wishes to minimize.
2.2 Partially Observed Markov Decision Processes
A POMDP is an extension of an MDP, where the state of the system is not fully
observable [40, 45]. Thus, the decision-maker needs to make the decision with only
partial knowledge of the state of the system. The set of all possible observations as
seen by the decision-maker is denoted by a finite set Y .
Definition 4 (POMDP). A partially observed Markov decision process is a tuple
(A,C, g) where,
1. (A, g) is a Markov decision process.
2. C is a sequence C0, C1 . . . with Ct : Y × X → [0, 1], such that
Ct (s, z) ≥ 0, ∀ s ∈ Y ,∀ z ∈ X ,∑
s
Ct (s, z) = 1, ∀ z ∈ X .
Intuitively, Ct is the observation received by the controller at time t. Akin to
MDPs, the decision in a POMDP is made based on the information available to the
decision-maker. We define hpomdpt to be the information available to the controller at
time t, given by
hpomdpt =
(u0:t−1, y0:t
).
Also, we use ipomdpt to denote a realization of hpomdp
t as
ipomdpt =
(a0:t−1, s0:t
).
Definition 5 (POMDP Policy). A POMDP policy is a sequence K = (K0, K1, . . . )
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 16
where K0 : U × Y → [0, 1] and Kt : U × Y t+1 × U t → [0, 1] for all t ∈ Z++ such that
K0(a, s) ≥ 0 ∀ a ∈ U and s ∈ Y ,∑
a
K0(a, s) = 1 ∀ s ∈ Y .
and for all t ∈ Z++ we have
Kt(a1, s, a2) ≥ 0, ∀ a1 ∈ U , s ∈ Y t+1 and a2 ∈ Ut,
∑
a1
Kt(a1, s, a2) = 1, ∀ s ∈ Y t+1 and a2 ∈ Ut.
For partially observed discrete time dynamic processes, the POMDP policy gives
a probability distribution over possible actions or controls as a function of the infor-
mation available to the decision-maker. That is
Kt(at, it) = Prob(ut = at | hpomdpt = it).
Stochastic Process Generated by a POMDP. Consider a POMDP (A,C, g)
and a POMDP policy K. Associated with every (A,C, g) and K is a stochastic
process that is induced by it. For a POMDP evolving over a finite horizon T , we can
define the sample space of the stochastic process as
Ωpomdp = X × Y × U × X . . .Y × U × X
= X × Y × UT−1 ×X
A typical sample path for the infinite horizon POMDP would be given as
ω = z0, s0, a0, z1, s1, a1, . . . .
Definition 6 (POMDP Stochastic Process). Consider a POMDP (A,C, g) along with
a POMDP policy K. Define the state process xt(ω), the observation process yt(ω) and
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 17
the action process ut(ω) by
Prob (x0:t = z0:t, y0:t = s0:t, u0:t = a0:t) = A0:tC0:tK0:t. (2.4)
Here we have suppressed the arguments for notational compactness. Note that
this implies that for all t we have
Prob (xt = zt|x0:t−1 = z0:t−1, u0:t−1 = a0:t−1) = Prob (xt = zt|xt−1 = zt−1, ut−1 = at−1) ,
Prob (xt = zt|xt−1 = zt−1, ut−1 = at−1) = At (zt, zt−1, at−1) ,
Prob (yt = st|xt = zt) = Ct (st, zt) ,
Prob (ut = at|y0:t = s0:t, u0:t−1 = a0:t−1) = Kt (at, s0:t, a0:t−1) .
Similar to MDPs, given a fixed policy, the state evolution process xt and the observa-
tion process yt are both Markov. The POMDP policy only depends on the observation
vector y and not on the actual state vector x, justifying the partially observed part
of the name.
The cost function for POMDPs is given as
JK (A,C, g) = E
T∑
t=0
gt (xt, ut) .
where the expectation is taken with respect to the marginal probability measure
derived from equation (2.4). Here JK (A,C, g) represents the cost of a POMDP
(A,C, g) under a POMDP policy K. The objective of a decision-maker is to find a
POMDP policy which minimizes the expected cost.
2.2.1 Information State for POMDPs
An information state for a POMDP represents all the information about the history
of the POMDP that is relevant to the selection of an optimal control. A POMDP
can be reformulated as an MDP using the information state. For a POMDP, the
information state consists of either a complete history of observations and actions or
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 18
their corresponding sufficient statistics [40]. We define the term sufficient infor-
mation state to mean a function of the past observations of the POMDP that is
detailed enough to permit an optimal controller to use the history processed through
this function as its only input. Using the sufficient information state, a POMDP can
be converted into an MDP with observable state such that the optimal controller for
this MDP also minimizes the cost function for the original POMDP.
Definition 7. Suppose (A,C, g) is a POMDP and define a sequence of functions
γt : U t × Y t+1 → Q.
Let ξt = γt (u0:t−1, y0:t). Then ξt is called a sufficient information state for the
POMDP if there exists an MDP (A, g) over the state space Q and the action space U
such that, for all POMDP policies K, we have
1. A is a sequence such that
At+1 (qt+1, qt, at) = Prob (ξt+1 = qt+1 | ξ0:t = q0:t, u0:t = a0:t) . (2.5)
2. g is a sequence g0, g1 . . . such that
gt (qt, at) = E (gt (xt, at) | ξt = qt, ut = at) . (2.6)
3. For all t ≥ 0, we have
Prob(
xt = zt | ξt = γt (s0:t, a0:t−1) , . . . , ξ0 = γ0(s0), u0:t−1 = a0:t−1
)
=
Prob (xt = zt | y0:t = s0:t, u0:t−1 = a0:t−1) . (2.7)
Note that A in equation (2.5), gt in equation (2.6) and the conditional probability
in equation (2.7) are independent of the POMDP policy K. Furthermore, equa-
tion (2.5) shows that given the action sequence or the policy the evolution of ξt is
Markov.
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 19
From the above definition, it is clear that associated with any POMDP is a suffi-
cient information state MDP (A, g). Let hi-mdpt be the history of the sufficient infor-
mation state MDP at time t. Then, we have
hi-mdpt =
(u0:t−1, ξ0:t
).
We will use ii-mdpt to denote a realization of hi-mdp
t as
ii-mdpt =
(a0:t−1, q0:t
).
As before, we define a sufficient information state MDP policy as a mapping from the
history of the information state MDP to an action at time t. Let Kt be a sufficient
information state MDP policy. As before, we can interpret Kt as
Kt(at, it) = Prob(ut = at | h
i-mdpt = it
).
The following theorem shows that we can find an optimal POMDP policy by consid-
ering the associated MDP over the sufficient information state.
Theorem 8. Consider a POMDP (A,C, g) and let Ppomdp be the set of all POMDP
policies. Let (A, g) be the sufficient information state MDP associated with the given
POMDP and let Pi-mdp be the set of all sufficient information state MDP policies.
Then, for any T , we have
minKt∈Ppomdp
t=0,1,···T
T∑
t=0
E[gt(zt, at)
]= min
Kt∈Pi-mdp
t=0,1,···T
T∑
t=0
E[gt(qt, at)
]
Proof. The proof follows from standard dynamic programming techniques as given
in Chapter 6 of [40].
From the above theorem, it is clear that one can find an optimal policy for a
POMDP by transforming it into a sufficient information state MDP. Given an optimal
sufficient information state policy Kopt one may immediately compute the optimal
POMDP policy by composing Kopt with γ. The optimal sufficient information state
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 20
policy Kopt may be found using standard dynamic programming recursion. From [48],
we know that the optimal policy for an MDP is a function of its current state. In
other words, the optimal policy for a POMDP is just a function of its sufficient
information state ξt. One such sufficient information state is the entire history of
the POMDP, where γt is an identity function [40]. As we show below, for a certain
class of POMDPs (in particular for networked MDPs), the sufficient information state
includes only finite past history of observations and control actions. In other words,
for a certain class of POMDPs, the function γt is a projection operator. Also note
that the above theorem can be easily extended to the infinite horizon case (both
average cost as well as discounted cost), as long as the limiting value of the sum of
the costs is well defined. For the discounted infinite horizon case, we can incorporate
the discount factor in the time dependent cost function.
2.3 Networked Markov Decision Processes
A networked Markov decision process (N-MDP) is a weighted directed graph G =
(V , E), where V = 1, . . . , n is a finite set of vertices and E ⊂ V × V is a set of
edges. Each vertex i ∈ V represents a Markov decision process. An edge (i, j) ∈ E
if the MDP at vertex i directly affects the MDP at vertex j. Associated with each
edge (i, j) ∈ E is a non-negative integer weight, Mij, which specifies the delay for the
dynamics of vertex i to propagate to vertex j. We assume without loss of generality
that (i, i) /∈ E .
Associated with each j ∈ V, let Paj be the set of all vertices with an incoming
edge to vertex j, specifically
Paj = i ∈ V | (i, j) ∈ E .
Similarly, for each j ∈ V, let Chj be the set of all vertices connected by an edge
outgoing from vertex j, specifically
Chj = i ∈ V | (j, i) ∈ E .
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 21
Thus, Paj is the set of vertices that affect the system at node j and Chj is the set of
vertices that are affected by the system at node j. At each time t, the state of the
MDP at vertex i belongs to a finite set X i. The decision or the control action taken
at vertex i is drawn out of a finite set U i.
Remark. In the remainder of this section, we denote X−i =∏
j∈Pai X j. Also de-
note X (n) =∏n
i=1Xi as the cartesian product of state space corresponding to all
vertices. Similarly, let U (n) =∏n
i=1 Ui.
Definition 9. A networked Markov decision process is a tuple (A, g) where
1. A is a set of transition matrices Ait, t ≥ 0 | i ∈ V with Ai
0 : X i → [0, 1] for
all i ∈ V, such that for all z ∈ X i, we have
Ai0 (z) ≥ 0 and
∑
z
Ai0 (z) = 1.
For t ≥ 0, we have At : X i × X i × X−i × U i → [0, 1] such that, for all i ∈ V
and for all a ∈ U i and z ∈ X−i we have
Ait(z1, z2, z, a) ≥ 0, ∀ z1, z2 ∈ X
i,∑
z1
Ait(z1, z2, z, a) = 1, ∀ z2 ∈ X
i.
2. g is a sequence g0, g1, . . . with gt : X (n) × U (n) → [0, 1].
As an example of a networked Markov decision process, consider a networked
system consisting of four subsystems as shown in Figure 2.1. The system dynamics
are
xit+1 = f i
(xi
t, xjt−Mji
| j ∈ Pai, uit, w
it
), (2.8)
for all i ∈ V. Here uit ∈ U
i is the control action applied to subsystem i at time t. The
random variables xi0, w
it for t ≥ 0 and i ∈ V are independent, i.e., the noise processes
are independent across both time and subsystems. The directed graph corresponding
to this networked MDP is shown in Figure 2.2.
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 22
S4 S3
S2
S1
Controller
M43
M34
M23M42
M21 M12N4 N3
N2
N1
Figure 2.1: A network of interconnected subsystems with delays. Subsystem i isdenoted by Si, the network propagation delay from Si to Sj is denoted by Mij andthe measurement delay from Si to the controller is denoted by Ni.
Associated with this system is a networked MDP (A, g) as defined below. For p ∈
X i, let Ai0(p) = Prob(xi
0 = p) be the probability mass functions of the initial state of
subsystem i ∈ V. The initial states x10, . . . , x
n0 are chosen independently. For t > 0,
let
Ait(z, p, q, a) = Prob
(
xit = z | xi
t−1 = p, xjt−1−Mji
= qj | j ∈ Pai, uit−1 = a
)
, (2.9)
be the conditional probability mass function of state xit given the previous states xi
t−1
and xjt−1−Mji
| j ∈ Pai and the applied input uit−1. It is easy to verify that the
sequence A satisfies the properties in Definition 9. The sequence gt(xt, ut) represents
the cost at time t and depends on the state of the system xt = (x1t , . . . , x
nt ) as well as
the action ut = (u1t , . . . , u
nt ) applied at time t.
In a networked MDP, the controller needs to choose a control action corresponding
to each vertex i ∈ V. The actions are chosen based on the information available to
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 23
4 3
2
1
Figure 2.2: Directed graph for the network of Figure 2.1.
the controller at time t. Associated with each vertex i ∈ V of a networked MDP,
we have a non-negative integer Ni which specifies the delay in receiving the state
measurement from system i. We define hn-mdpt to be the information available to the
decision-maker at time t, given by
hn-mdpt =
(x1
0:t−N1, u1
0:t−1, . . . , xn0:t−Nn
, un0:t−1
).
Also define in-mdpt to be a realization of hn-mdp
t as
in-mdpt =
(z10:t−N1
, a10:t−1, . . . , z
n0:t−Nn
, an0:t−1
).
Thus, the observations received by the decision-maker at time t consist of the state
of the subsystem i delayed by Ni time steps. A networked MDP policy specifies the
decisions taken at time t.
Definition 10 (Networked-MDP Policy). A networked MDP policy is a sequence
K = (K0, K1, . . . ) where
K0 : U (n) ×n∏
i=1
(X i)1−Ni → [0, 1]
and
Kt : U (n) ×n∏
i=1
(X i)t+1−Ni ×
n∏
i=1
(U i)t→ [0, 1],
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 24
for all t ∈ Z++ such that
K0(a, z) ≥ 0 ∀a ∈ U (n), ∀z ∈n∏
i=1
(X i)1−Ni ,
∑
a
K0(a, z) = 1 ∀z ∈n∏
i=1
(X i)1−Ni ,
and for all t ∈ Z++, for all a1 ∈ U
(n), z ∈∏n
i=1 (X i)t+1−Ni and a2 ∈
∏ni=1 (U i)
twe
have
Kt (a1, z, a2) ≥ 0,∑
a1
Kt (a1, z, a2) = 1.
Note that for all times t, the product∏n
i=1 (X i)t+1−Ni in the above definition is
taken over those i for which t+ 1−Ni is strictly positive. For the networked systems
as given in equation (2.8), a general mixed control policy is defined as a sequence of
transition matrices Kt, t ≥ 0 given by
Kt(at, it) = Prob(ut = at | hn-mdpt = it).
2.3.1 Networked MDP as a POMDP
In networked MDPs, although the controller receives state information from the sub-
systems, these states are delayed by different amounts. Thus, a networked MDP can
be written as a POMDP. Consider a networked MDP as given in Definition 9. Let
us define a new state xt =xi
t−b′:t | i ∈ V, where we choose b′ = maxi,j∈V Mij +
maxi∈V Ni. The state x is chosen such that in the resulting system the observation at
time t is only a function of the current state at time t. It is easy to check that there
exists a function f such that
xt+1 = f (xt, ut, wt) .
CHAPTER 2. NETWORKED MARKOV DECISION PROCESSES 25
Associated with this function is a transition probability mass function At (zt+1, zt, at),
where zt is the realization of the state xt. The observation at any time t is given as
yt = h(xt).
Corresponding to this observation process is a probability mass function Ct (st, zt),
where st is the realization of the observation yt and is given as
st =zi
t−Ni| i ∈ V
.
The cost function is given as
gt (xt, ut) = gt (xt, ut) . (2.10)
It is easy to check that the functions At, Ct and gt satisfy the properties given in
Definition 4. The networked MDP can thus be written as a POMDP(A, C, g
).
As shown in the above subsection, we can write any networked MDP as a POMDP.
In the next chapter, we compute the sufficient information state (as defined in Defi-
nition 7) for networked MDPs.
Chapter 3
Information State for Networked
MDPs
In this chapter, we establish the main result associated with the information state
for networked MDPs. This result establishes that the sufficient information state
for networked Markov decision processes consists only of a finite number of past
observations. As we will see, these finite numbers or bands depend only the network
structure and the associated delays. We begin by making the following definitions.
Definition 11. Let
di = maxNi, maxk∈Pai
(Nk −Mki − 1) (3.1)
and define the integers bi by
bi = maxdi, maxk∈Chi
(dk + Mik) −Ni (3.2)
Remark. In the remainder of this chapter, we use the following additional notation.
We define a new function Pt for t ≥ 0 by
Pt = A10:tA
20:t . . . A
n0:t.
26
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 27
Define
αt = zi0:t−Ni
, ai0:t−1 | i ∈ V,
βt = zit−Ni−bi:t−Ni
, ait−di:t
| i ∈ V.
Furthermore, the notation z /∈ αt means the set
z | z /∈ αt = zit−Ni+1:t | i ∈ V,
and the notation z /∈ βt and a /∈ βt mean the sets
z | z /∈ βt = zi0:t−Ni−bi−1 | i ∈ V.
a | a /∈ βt = ai0:t−di−1 | i ∈ V.
Recall that any list of variables xt1:t2 with t2 < t1 is interpreted as empty.
The following theorem is the main result for networked MDPs. It defines a suf-
ficient information state for a networked Markov decision process. It shows that a
networked MDP can be converted into a fully observable MDP with a state that is
bounded and does not grow with time. Note that a networked MDP can be written
as a POMDP(A, C, g
), with state x.
Theorem 12. Consider a networked Markov decision process. Then,
ξt =ui
t−di:t−1, xit−Ni−bi:t−Ni
| i ∈ V
. (3.3)
is a sufficient information state for the networked MDP.
To prove this theorem, we check the conditions of a sufficient information state
as given in Definition 7. The following key lemma shows that ξt as defined in equa-
tion (3.3) satisfies the first condition of a sufficient information state as given in
equation (2.5).
Lemma 13. Consider a networked Markov decision process (A, g) and a networked
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 28
MDP policy K. Define
At+1(qt+1, qt, at) , Prob(ξt+1 = qt+1 | ξt = qt, ut = at
).
Then, ξt satisfies the following Markov property
At+1 = Prob(ξt+1 = qt+1 | ξ0:t = q0:t, u0:t = a0:t
),
and A is independent of the policy K.
Proof. Using Bayes’ rule, we can write
L = Prob(ξt+1 = qt+1 | ξ0:t = q0:t, u0:t = a0:t
)=
Prob(ξ0:t+1 = q0:t+1, u0:t = a0:t
)
Prob (ξ0:t, = q0:tu0:t = a0:t).
(3.4)
Note that the above conditional probability is well defined when the sequence of
random variables (ξ0:t = q0:t, u0:t = a0:t) has a non-zero probability. For the event,
where this sequence has zero probability, we define the conditional probability to be
zero. Also note that the sequence ξ0:t consists of the variables xi0:t−Ni
, ui0:t−1 | i ∈ V.
Let us denote the denominator of equation (3.4) by Lden. Then,
Lden =∑
z /∈αt
PtK0:t, (3.5)
where we have used the definition of ξ0:t and the notation that Pt = A10:t . . . A
n0:t. Note
that the transition kernel Ait has arguments
zit, z
it−1, a
it−1, z
kt−1−Mki
| k ∈ Pai.
We first show that some of the Ait’s are independent of the variables being summed
over. Consider an arbitrary s ≥ 0, and suppose Ait−s depends upon at least one
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 29
of z1t−N1+1:t, . . . , z
nt−Nn+1:t. Then, we must have
t−Ni + 1 ≤ t− s or
t−Ni + 1 ≤ t− s− 1 or
t−Nk + 1 ≤ t− s− 1−Mki for some k ∈ Pai
where each inequality arises from the corresponding argument of Ait−s. This implies
that
s ≤ Ni − 1 or s ≤ maxNk − 1−Mki | k ∈ Pai − 1.
Hence for each i, the largest such s is exactly equal to di−1 where di is defined by equa-
tion (3.1). Thus if s ≥ di then Ait−s does not depend on any of z1
t−N1+1:t, . . . , znt−Nn+1:t.
In other words, Ai0:t−di
are independent of all the variables of summation. Further-
more, note that K0:t only depend on the variables in αt and hence are independent
of the variables of summation. Thus, we can write the denominator of equation (3.4)
as
Lden = A10:t−d1
. . . An0:t−dn
K0:t
∑
z /∈αt
A1t−d1+1:t . . . A
nt−dn+1:t. (3.6)
Let us denote the numerator of equation (3.4) as Lnum. Then,
Lnum =∑
z /∈αt+1
Pt+1K0:t. (3.7)
Following the same argument as above, it is easy to verify that if s ≥ di − 1, then
Ait−s does not depend on any of z1
t−N1+2:t+1, . . . , znt−Nn+2:t+1. Thus, Ai
0:t−di+1 are inde-
pendent of the variables of summation of Lnum. We can thus write Lnum as
Lnum = A10:t−d1
. . . An0:t−dn
K0:t
∑
z /∈αt+1
A1t−d1+1:t+1 . . . An
t−dn+1:t+1.
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 30
Canceling the common factors from the numerator and denominator gives
L =
∑
z /∈αt+1A1
t−d1+1:t+1 . . . Ant−dn+1:t+1
∑
z /∈αtA1
t−d1+1:t . . . Ant−dn+1:t
. (3.8)
Using Bayes’ rule, we can write
R = Prob(ξt+1 = qt+1 | ξt = qt, ut = at
)=
Prob(ξt+1 = qt+1, ξt = qt, ut = at
)
Prob (ξt = qt, ut = at). (3.9)
As before, if the denominator is zero, we define the conditional probability to be zero.
In that particular case, the lemma is trivially true. Let Rden denote the denominator
of equation (3.9). Using the definition of ξt, we can write the denominator as,
Rden =∑
a/∈βt
∑
z /∈βt
∑
z /∈αt
PtK0:t.
As before Ait−di
and K0:t are independent of the variables of summation z /∈ αt and
hence we can write Rden as
Rden =∑
a/∈βt
∑
z /∈βt
A10:t−d1
. . . An0:t−dn
K0:t ×∑
z /∈αt
A1t−d1+1:t . . . A
nt−dn+1:t
︸ ︷︷ ︸
Rden
.
Let us determine explicitly what variables Rden depends on. For notational conve-
nience, let us denote
T = A1t−d1+1:t . . . A
nt−dn+1:t.
If T depends on zis then we must have
t− di ≤ s or
t− dk −Mik ≤ s for some k ∈ Chi.
The first inequality holds if zis occurs in Ai
t−di+1···t and the second holds if it occurs
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 31
in Akt−dk+1···t. If Rden depends on on zi
t−Ni−r then,
t− di ≤ t−Ni − r or
t− dk −Mik ≤ t−Ni − r for some k ∈ Chi,
and these conditions imply that
r ≤ di −Ni or
r ≤ maxdk + Mik | k ∈ Chi −Ni.
Using the definition of bi in equation (3.2), these two inequalities imply that r ≤ bi.
Thus Rden depends on ait−di:t−1 | ∈ V and zi
t−Ni−bi:t−Ni| i ∈ V and hence is
independent of variables a /∈ βt and z /∈ βt. Thus, we can write
Rden =
∑
a/∈βt
∑
z /∈βt
A10:t−d1
. . . An0:t−dn
K0:t
(∑
z /∈αt
A1t−d1+1:t . . . A
nt−dn+1:t
)
. (3.10)
Let Rnum denote the numerator of the equation (3.9). Then,
Rnum =∑
a/∈βt
∑
z /∈βt
∑
z /∈αt+1
Pt+1K0:t.
Using the same argument as above we can write the numerator as
Rnum =
∑
a/∈βt
∑
z /∈βt
A10:t−d1
. . . An0:t−dn
K0:t
∑
z /∈αt+1
A1t−d1+1:t+1 . . . An
t−dn+1:t+1
.
(3.11)
From equation (3.10) and equation (3.11) we have
R =
∑
z /∈αt+1A1
t−d1+1:t+1 . . . Ant−dn+1:t+1
∑
z /∈αtA1
t−d1+1:t . . . Ant−dn+1:t
. (3.12)
The result follows from equations (3.8) and (3.12).
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 32
The next lemma evaluates the cost function gt for the induced MDP and shows
that it is independent of the POMDP policy.
Lemma 14. The cost function as defined in equation (2.6) is independent of the
POMDP policy K.
Proof. From equation (2.6), we have that
gt (qt, at) = E (gt (xt, at) | ξt = qt, ut = at) ,
Using the definition of gt from equation (2.10), we get that
gt (qt, at) = E (gt (xt, at) | ξt = qt, ut = at) ,
=∑
zt
gt(zt, at)Prob (zt, qt, at)
Prob (qt, at).
Using the definition of ξt we get that
Prob (qt, at) =∑
a/∈βt
∑
z /∈βt
∑
z /∈αt
PtK0:t.
Thus, we get that gt is given as
gt (qt, at) =
∑
a/∈βt
∑
z /∈βt
∑
z /∈αtgt(zt, at)PtK0:t
∑
a/∈βt
∑
z /∈βt
∑
z /∈αtPtK0:t
,
=
∑
z /∈αtgt(zt, at)A
1t−d1+1:t . . . A
nt−dn+1:t
∑
z /∈αtA1
t−d1+1:t . . . Ant−dn+1:t
.
where the last equality follows from a similar argument as given for equation (3.10).
Thus the cost function is independent of the POMDP policy K.
The following lemma shows that the conditional probability density function for
the state at time t is the same for the induced MDP and the original POMDP.
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 33
Lemma 15. For all t ≥ 0, we have
Prob (xt = zt | ξ0:t = q0:t, u0:t−1 = a0:t−1) =
Prob (xt = zt | y0:t = s0:t, u0:t−1 = a0:t−1) , (3.13)
where we have used the notation γt (s0:t, a0:t−1) = qt.
Proof. Note that the sequence ξ0:t consists of the variables xi0:t−Ni
, ui0:t−1 | i ∈ V.
Also from section 2.3.1, we know that yt =xi
t−Ni| i ∈ V
. The lemma follows
trivially from these two facts.
Proof of Theorem 12. From Lemmas 13, 14, and 15, we get that ξt as defined in
equation (3.3) is a sufficient information state for a networked MDP.
S4 S3
S2
S1
Controller
M43
M34
M23M42
M21 M12N4 N3
N2N1
P4 P3
P2 P1
Figure 3.1: A networked Markov decision process with action delays. The controlaction delay to subsystem Si is denoted by Pi.
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 34
3.1 Networked MDP with Action Delays
In this section, we extend our result to the case where the control action does not
take effect immediately. Consider a networked Markov decision process as shown in
Figure 3.1. The system dynamics are
xit+1 = f i
(xi
t, xjt−Mji
| j ∈ Pai, uit−Pi
, wit
),
for all i ∈ V. Here uit−Pi
is the control action applied to subsystem i at time t− Pi.
To obtain a sufficient information state for a networked MDP with action delays,
we convert this system into a networked MDP with no action delays. To do this, let
us define a new state xit = (xi
t, uit−Pi:t−1) for all i ∈ V. As before, if any Pi = 0, we
interpret the list uit−Pi:t−1 as empty and thus xi
t = xit. This new state is chosen such
that the state evolution of each subsystem at time t + 1 depends on the current state
and action at time t. Thus, a networked MDP with action delays can be reformulated
as a networked MDP with no action delays with system dynamics given as
xit+1 = f i
(xi
t, xjt−Mji
| j ∈ Pai, uit, w
it
),
for all i ∈ V. Using Theorem 12, we know that a sufficient information state for this
new system consists of past states xit−bi−Ni:t−Ni
and past control actions uit−di:t−1 for
all i ∈ V. Let us define a new band di as
di =
di if Pi = 0
bi + Ni + Pi otherwise(3.14)
Using this definition, it is easy to check that a sufficient information state for a net-
worked MDP with action delays consists of past states xit−bi−Ni:t−Ni
and past control
actions uit−di:t−1
for all i ∈ V. This gives us the following theorem.
Theorem 16. Consider a networked Markov decision process with action delays.
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 35
Then,
ξt =
uit−di:t−1
, xit−Ni−bi:t−Ni
| i ∈ V
.
is a sufficient information state for a networked MDP with action delays.
3.2 Discussion
From Theorem 12, we note that every networked MDP has a sufficient information
state ξt given by equation (3.3), which depends on only the finite history of the states
and control actions. Thus, from Definition 7 we have that, associated with every
networked MDP is a tuple (A, g) where At is the transition matrix given by
At+1(qt+1, qt, at) = Prob(ξt+1 = qt+1 | ξt = qt, ut = at
),
and gt is the cost function associated with this new MDP. The cost function is given
by equation (2.6). From, the Theorem 8, we note that an optimal controller for the
original POMDP can be found by considering the associated sufficient information
state MDP. An optimal controller can be found using dynamic programming [48, 20]
over the state space Q generated by ξt. This holds for both finite horizon as well as
infinite horizon average cost as well as infinite horizon discounted cost models. In the
next subsection we show that the previously known results on single systems with
delayed state [8] can be obtained as a special case of our main result.
3.2.1 Single System with Delayed State Observations and
Action Delays
We consider control of a single system with a delayed state measurement. This is
precisely the information pattern considered in [8], and we show that in this case the
above results imply those of [8]. We have dynamics
xt+1 = f(xt, ut, wt)
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 36
where, since the system is composed of exactly one subsystem, we have x1t = xt.
The controller must choose ut at time t, when it has access to u0, . . . , ut−1 and
x0, x1, . . . , xt−N1 . Then, from Definition 11 we have
d1 = N1 and b1 = 0
and so the optimal control action ut is a memoryless function of ut−N1 , . . . , ut−1
and xt−N1 . Thus, the optimal controller applied at time t is a function of the last
observed state and the previous N1 actions, which is exactly the result of [8].
A single system with both observation and action delays was analyzed in [39].
Consider a single system with both delayed state measurement of N1 steps and a
delay of P1 steps in control action. From Theorem 16, we know that the control
action at time t is a function of ut−N1−P1 , . . . , ut−1 and the state xt−N1 , which is
exactly the result obtained in [39].
S2S1
Controller
M12
M21
N2N1
Figure 3.2: A network of two interconnected subsystems with delays. Here the controlinput is only applied to subsystem 1.
3.3 Numerical Examples
In this section we consider two numerical examples where we compute the optimal
controller for networked Markov decision processes. In the first example, we study
linear scalar systems with delays. For a special class of such systems, one can compute
controllers using an approach based on the Youla parametrization, in combination
with convex optimization, as in [24]. We observe that for a certain class of systems, the
optimal controller has exactly the same amount of past history as given in Theorem 12.
This shows that bands computed in the main theorem are tight in the sense that there
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 37
are systems where using any less amount of information would yield sub-optimal
controllers. As a second example, we study controller design for two interacting
queues. Using the knowledge of the bands as computed from Theorem 12, we use
dynamic programming to explicitly compute the optimal controller. The knowledge
of the bands allows us to greatly simplify the computation of the optimal controller.
3.3.1 Linear Systems with Delays
As a first example, we compute an optimal controller for the special case of a linear
scalar system with delays. For simplicity, we consider a two system case as shown
in Figure 3.2. Note that the control action is only applied to subsystem 1. For this
system, the controller is only required to store bi + 1 values of the state of system i
and d1 values of the past inputs, where
b1 = max0, N2 + M12 −N1,
b2 = max0, N1 + M21 −N2,
d1 = maxN1, N2 −M21 − 1.
(3.15)
The system dynamics are given by
x1t+1 = f 1(x1
t , ut, x2t−M21
, w1t ),
x2t+1 = f 2(x2
t , x1t−M12
, w2t ).
(3.16)
The information available to the controller at time t is given as
yt = (a0:t−1, z10:t−N1
, z20:t−N2
)
The system under consideration has a continuous state space, and the results pre-
sented above may be extended to this scenario under appropriate technical assump-
tions on the probability measures. Specifically, we consider system dynamics which
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 38
are a special case of those in equation (3.16) given by
x1t+1 = x1
t + 0.25x2t−2 + ut + w1
t
x2t+1 = 0.25x1
t−2 + x2t + w2
t
The noise processes w1t and w2
t are zero mean, unit variance white Gaussian noise
processes. The initial states x10 and x2
0 are independent of each other and are normally
distributed with variance 10−5. The objective is to minimize the cost
J = E
((T−1∑
t=0
(‖xt‖2 + ‖ut‖
2)
)
+ ‖xT‖2
)
,
which is a standard quadratic cost. We will use a time horizon of T = 10. The
propagation and measurement delays are
M12 = 2, M21 = 2, and N1 = 0, N2 = 1
so that the controller receives the observations from subsystem 2 after a single time-
step delay. For this system, equation (3.15) gives the memory requirements of the
optimal controller as
b1 = 3, b2 = 1 and d1 = 0.
Therefore at each time t the optimal input ut is given by a memoryless function of
ymemt , that is of the data x1
t−3, x1t−2, x
1t−1, x
1t , x
2t−2, x
2t−1.
To compute the optimal controller for this problem, we use an approach based on
the Youla parametrization, in combination with convex optimization, as in [24]. A
similar approach is used in [49] to compute optimal decentralized controllers. The
optimal controller for this problem is
u0
u1
...
u9
= −F
x10
x11...
x19
−G
x20
x21...
x29
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 39
where F and G are given by
F ≈1
10
8.2
1.2 8.1
1.1 1.1 7.9
1.1 1 1 7.7
0.9 0.8 0.8 7.4
0.6 0.5 0.5 6.9
0.3 0.3 0.2 6.5
6.2
6
5
G ≈1
10
0
0 5.8
1.7 5.4
1.7 4.8
1.6 4
1.6 3
1.6 1.8
1.5 0.8
1.5 0.5
1.2 0
Hence we have
µ(ymemt ) = −
T−1∑
s=0
Ftsx1s −
T−1∑
s=0
Gtsx2s.
It is apparent from the above matrices that the control input at time t depends only on
the past history of x according to the memory limits b1, b2 and d1 as in Equations (3.1)
and (3.2).
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 40
M12M21
S2S1
Figure 3.3: A system of two interacting queues. Here the solid line represent jobs oftype R which enter system 1, and are then transported to system 2 after a delay ofM12. Similarly, the dashed line represents jobs of type B which enter system 2 andare transported to system 1 after a delay of M21. Of the two queues at each system,the top queue is the high-priority queue.
3.3.2 Controller Design for Finite State Systems
We consider a network of two interconnected queues as shown in Figure 3.3. Our
example is inspired by the model of interacting queues studied in [30]; however our
objective here is to illustrate the computation of an optimal controller for such sys-
tems. As opposed to previous works, we introduce delays between queues as well as
delays in receiving queue state information at a centralized controller. We however
assume that any control inputs have immediate effects.
Informally, the system description is as follows. Jobs of type R arrive at system 1
while jobs of type B arrive at system 2. The arrival process at each system is in-
dependent and identically distributed over time. Furthermore, we assume that the
arrival process at the two systems are independent of each other. At each system,
the server maintains two kinds of queues, the high priority queue and the low priority
queue. Jobs of type R are placed in the high priority queue at system 1, where they
are processed and are moved to system 2 after a delay of M12 time units. At system
2, these jobs are placed in the low priority queue. On the other hand, jobs of type B
enter the high priority queue at system 2 and after being processed at system 2, they
are moved to system 1 after a delay of M21 time units. At system 1, these jobs are
placed in the low priority queue. At each system, if the queue is full the incoming
jobs are dropped.
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 41
The server at each system has two modes of operation, a slow and a fast mode.
In the slow mode, in each time unit, the server serves one job from the high priority
queue (provided the queue is non-empty). In the fast mode, as long as the queues
are non-empty, the server serves one job from each of the high and the low priority
queues. After being processed, the high priority jobs are moved to the other system,
while the low priority jobs exit the system. A centralized controller receives delayed
information about the total number of jobs in each queue, and decides what mode
each server should operate at. At each time step, a cost depending on the number of
jobs in each queue and the mode of operation of each server is incurred.
To describe the above system mathematically, we let xiR(t) be the number of R
jobs in queue i at time t. Similarly, we let xiB(t) be the number of B jobs in queue i
at time t. Here both xiR(t) and xi
R(t) are in the set 0, 1, 2, . . . ,Q, where Q is the
queue length at each system. For simplicity, we assume that all the queues are of same
length. The control action is ui(t) ∈ 0, 1, for i = 1, 2. Here ui(t) = 0 represents the
slow mode of the server. The system dynamics are
x1R(t + 1) = maxminx1
R(t) + w1(t),Q − 1, 0
x1B(t + 1) = maxminx1
B(t) + 1x2B
(t−M21)>0,Q − u1(t), 0
x2B(t + 1) = maxminx2
B(t) + w2(t),Q − 1, 0
x2R(t + 1) = maxminx2
R(t) + 1x1R
(t−M12)>0,Q − u2(t), 0
where 1x>0 is the indicator function. Here wit, i = 1, 2 are the number of packets that
arrive at each queue at time t. We let the state of each system be the number of R
and B packets at each time, i.e., xi(t) =(xi
R(t), xiB(t)
)for i ∈ 1, 2. It is easy to
check that there exist functions f 1 and f 2 such that the state dynamics are given by
x1(t + 1) = f 1(x1(t), x2(t−M21), u
1(t), w1(t)),
x2(t + 1) = f 1(x2(t), x1(t−M12), u
2(t), w2(t))
Let gs(x1(t), x2(t)) be the cost associated with the state and ga(u
1(t), u2(t)) be
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 42
the cost associated with the actions. We assume that the state cost is
gs(x1(t), x2(t)) =
(x1
R(t) + x1B(t) + x2
R(t) + x2B(t)
)2.
The action cost is
ga(u1(t), u2(t)) =
(u1(t) + 1 + u2(t) + 1
)2,
where we assume that for ui(t) = 0, the cost incurred is 1 unit. The total cost at
time t is thus
g(x1(t), x2(t), a1(t), a2(t)
)= (1− α)gs(x
1(t), x2(t)) + αga(u1(t), u2(t)),
where α is the weighting factor. The objective is to minimize the infinite horizon
discounted cost
J = E
(∞∑
t=0
βtg(
x1(t), x2(t), u1(t), u2(t)))
,
= E
(∞∑
t=0
βt(
(1− α)gs + αga
)
)
,
= (1− α)Js + αJa
where β is the discount factor. Here Js and Ja are the infinite horizon discounted
costs associated with the state and action.
For purposes of numerical computation, we let Q = 1 in our specific example. The
state space consists of (0, 0), (0, 1), (1, 0), (1, 1), where the first element in the tuple
represents the number of R jobs. The arrival process at both systems is assumed to be
Bernoulli with the probability of arrival at the first system given by Prob(w1(t) = 1) =
0.1 and the probability of arrival at the second system given by Prob(w2(t) = 1) = 0.3.
The inter subsystem propagation delays and observation delays are chosen to be
M12 = 2, M21 = 1, and N1 = 2, N2 = 1
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 43
0 1 2 3 4 5 6 7 8 915
16
17
18
19
20
21
22
23
24
25
Js
Ja
Figure 3.4: Infinite horizon discounted action cost Ja (averaged over all initial states)vs. the infinite horizon discounted state cost Js. The curve is plotted by varying theweighting factor α.
Using equations (3.1) and (3.2), we find
b1 = 1, b2 = 2 and d1 = 2, d2 = 1.
For the discount factor β = 0.75, Figure 3.4, shows the tradeoff curve for Ja vs. Js.
This curve shows the trade off between the action cost and the state cost for different
values of the weighting factor α.
This section illustrates that the knowledge of the bands simplifies the computation
of the optimal controller. Without the knowledge of these bands, one would compute
the optimal controller by treating this networked MDP as a POMDP and using dy-
namic programming over the belief state. The knowledge of these bands allows us to
write this networked MDP as a fully observed MDP over the sufficient information
state and greatly simplifies the computation of the optimal controller.
As shown in Theorem 12, the sufficient information state for networked MDPs
depends only on the finite past history of the observations. This finite history or bands
depend only on the network structure and the associated delays. In the next chapter,
CHAPTER 3. INFORMATION STATE FOR NETWORKED MDPS 44
we look at a special case of networked MDPs over a finite time horizon. Based on the
ideas from Bayesian networks, we provide an alternate proof of Theorem 12. This
alternate proof provides an intuitive explanation for the bands given in Definition 11.
In particular, it shows that the finiteness of the bands occurs because given the finite
history of states and actions, the current state of the system is independent of the
remaining states and actions.
Chapter 4
A Bayesian Network Approach to
Network MDPs
In this chapter, we restrict our attention to networked MDPs over a finite time hori-
zon. For such a special class of networked MDPs, we provide an alternate proof of
Theorem 12 based on the ideas from Bayesian networks. We show that the finite
history of states and actions that was obtained in the previous chapter is exactly
the same as the information required to estimate the current state of the system.
This, along with the separation principle, provides an alternate proof and additional
insights into the finite memory of the controllers for networked MDPs. It shows that
the finiteness of the bands occurs because given the finite history of states and actions,
the current state of the system is independent of the remaining states and actions.
We begin by describing concepts from Bayesian networks.
4.1 Bayesian Networks
A Bayesian network [37], Nb = Gb,Pb consists of
• A directed acyclic graph Gb = (Vb, Eb), and
• A set of conditional probability distributions Pb.
45
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS46
Here the subscript b stands for Bayesian and is used to distinguish the Bayesian
network graph from the networked MDP graph G as defined in the previous section.
Associated with each vertex v ∈ Vb of the graph Gb, is a random variable Xv taking
values in a particular set. A directed edge e ∈ Eb between vertices describes the
conditional dependence between the random variables corresponding to the vertices.
If there is a directed edge from a vertex v1 to v2, we say that v2 is a child of v1
and that v1 is a parent of v2. The set of parent vertices of a vertex v is denoted by
parent(v).
The set of probability distributions Pb contains one distribution P(Xv|Xparent(v)
)
for every v ∈ Vb. The joint distribution of all the variables Xk, k = 1, . . . , n is given
as
Prob(X1, . . . Xn
)=
n∏
k=1
Prob(Xk | parents(Xk)
)
An example of a Bayesian network is shown in Figure 4.1. Here the graph Gb consists
of vertices A,B,C,D,E, F and edges A→ C,B → C,C → D,C → E,D → F.
The set of probabilities is given as
Pb = P (A), P (B), P (C|A,B), P (D|C), P (E|C), P (F |D).
Note that since the variables A and B have no parents, the probability set contains
their unconditional probabilities.
A B
C
D E
F
Figure 4.1: A Bayesian network with 6 variables
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS47
d-Separation. As mentioned before, the graph Gb encodes the conditional de-
pendencies between the variables. Conditional independence between variables is de-
termined by the property of d-separation. If two variables X and Y are d-separated
in the graph by a third variable Z, then the variables X and Y are conditionally
independent given the variable Z.
Definition 17. A path π in the graph Gb = Vb, Eb is said to be d-separated by a set
of nodes Z ∈ Vb if and only if one of the following holds
• π contains a chain i→ z → j such that i, j ∈ π and z ∈ Z,
• π contains a fork i← z → j such that i, j ∈ π and z ∈ Z and
• π contains an inverted fork (or a collider) i → z ← j such that i, j ∈ π and
neither z nor any of its children are in Z.
The concept of d-separation is closely tied to that of a Markov blanket. Before
we define the Markov blanket, we introduce some notation.
Remark: Consider a set of variables X = X1, . . . , Xn. Denote P(X) to be
the set consisting of all parents of variables in the set X, not including the variables
themselves. Similarly, we denote CH(X) (and PCH(X)) to be the set consisting of
all children (parents of children) of variables in the set X, not including the variables
themselves.
Definition 18 (Markov Blanket). The Markov blanket of a set of variables X =
X1, . . . , Xn
(denoted by MB(X)) is given as
MB(X) = P(X) ∪ CH(X) ∪ PCH(X) (4.1)
The following theorem (see [37] for the proof) states that the variables in the set
X are independent of the rest of the graph given its Markov blanket.
Theorem 19. Given a finite Bayesian network and two distinct variables X and
Y /∈ MB(X), we have
Prob(X|MB(X), Y
)= Prob
(X|MB(X)
)
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS48
The Markov blanket of the set of variables shields the variables from the rest of
the graph. Thus, the Markov blanket is the only knowledge required to predict the
value of the variables. Furthermore, if all the variables in a Markov blanket of X are
known, then X is d-separated from the rest of the graph [37].
4.2 Networked MDPs as Bayesian Networks
In this section, we model networked Markov decision processes as Bayesian networks
in a natural way. Consider a networked MDP given by a graph G = V , E, where
we let V = 1, . . . , n. As before, for each i ∈ V. we have xit ∈ X
i. For the remainder
of this chapter, we would consider the evolution of the networked MDP over a finite
horizon T . Associated with this networked MDP, we can construct a finite Bayesian
network Nb = Gb,Pb. The vertex set Vb is given as
Vb =vstate
i,t | i ∈ V, t = 0, 1, . . . , T ⋃
vactioni,t | i ∈ V, t = 0, 1, . . . , T − 1
.
Associated with a vertex vstatei,t is the random variable xi
t, taking values in the finite
set X i, that corresponds to the state of subsystem i at time t. Similarly, associated
with a vertex vactioni,t is the random variable ui
t, taking values in the finite set U i, that
corresponds to the control action applied to subsystem i at time t. The edge set Eb
consists of the following edges.
Eb =vstate
i,t → vstatei,t+1, v
statej,t−Mji
→ vstatei,t+1, v
actioni,t → vstate
i,t+1, vstatei,0:t−Ni
→ vactionk,t ,
vactioni,0:t−1 → vaction
k,t | j ∈ I i, i, k ∈ V, t ∈ N.
Here vstatei,0:t−Ni
→ vactionk,t is interpreted as a directed edge between vstate
i,τ → vactionk,t for
every τ = 0, . . . , t − Ni. An edge vstatej,t−Mji
→ vstatei,t+1 means that the random variable
xjt−Mji
affects the random variable xit+1. Similar interpretations exist for other edges
in the edge set Eb. The set of conditional probability densities Pb consists of all the
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS49
transition probabilities, that is
Pb =Ai
t | i ∈ V, t = 0, . . . , T∪Kt | t = 0, . . . , T − 1
For a finite time horizon T , let ST be the set of random variables given as
ST =xi
t | i ∈ V, t = 0, 1, . . . , T ⋃
uit | i ∈ V, t = 0, 1, . . . , T − 1
The joint probability density function of all the variables in the set ST can then be
written as
Prob (ST ) = A10:T A2
0:T . . . An0:T K0:T−1
S2S1
Controller
M12
M21
N2N1
Figure 4.2: A network of two interconnected subsystems with delays. Subsystem i isdenoted by Si, the network propagation delay from Si to Sj is denoted by Mij andthe measurement delay from Si to the controller is denoted Ni.
As an example, consider the networked system of Figure 4.2. The system dynamics
equations are given as
x1t+1 = f 1(x1
t , x2t−M21
, u1t , w
1t ),
x2t+1 = f 2(x2
t , x1t−M12
, u2t , w
2t ).
(4.2)
For the purpose of this example, we choose M12 = 2 and M21 = 1. Thus, the transition
probability matrices are given as
A1t
(
z1t , z
1t−1, z
2t−2, a
1t−1
)
= Prob(
x1t = z1
t | x1t−1 = z1
t−1, x2t−2 = z2
t−2, u1t−1 = a1
t−1
)
,
(4.3)
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS50
and
A2t
(
z2t , z
2t−1, z
1t−3, a
2t−1
)
= Prob(
x2t = z2
t | x2t−1 = z2
t−1, x1t−3 = z1
t−3, u2t−1 = a2
t−1
)
,
(4.4)
Associated with this networked control system is a Bayesian network as shown
in Figure 4.3. The directed acyclic graph Gb consists of a vertex for each state of
the two systems and two control actions applied at time t. A directed edge between
two vertices v1 and v2 exists if the variable corresponding to vertex v1 affects the
variable corresponding to vertex v2. For example, a directed edge exists between the
vertex corresponding to x2t−2 and the vertex corresponding to x1
t . Similarly, a directed
edge exists between the vertex corresponding to control action u2t−1 and the vertex
corresponding to x2t . The set of probability distributions Pb consists of the transition
probabilities A1t , A2
t and Kt for all t ≥ 0.
4.3 Alternate Proof of the Information State for
Networked MDPs
In this section, we provide an alternate proof of the finiteness of the information state
for networked MDPs. We start by making the following definition.
Definition 20. Define
hmemt =
(x1
t−N1−b1:t−N1, u1
t−d1:t−1, . . . , xnt−Nn−bn:t−Nn
, unt−dn:t−1
)(4.5)
to the finite history of observations at time t and denote
imemt =
(z1
t−N1−b1:t−N1, a1
t−d1:t−1, . . . , znt−Nn−bn:t−Nn
, ant−dn:t−1
)
to be a realization of hmemt . Further define the set Hmem
t as
Hmemt =
n∏
i=1
(X i)bi+1
×n∏
i=1
(U i)di .
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS51
x1 x2u1 u2
t
t− 1
t− 2
t− 3
t− 4
t− 5
Figure 4.3: The Bayesian network associated with the 2-subsystems networked MDPof Figure 4.2. Here the circle represents the state of the two subsystems and thesquare represents the control input. For this Bayesian network, we chose M21 = 1and M12 = 2. The edges from state variables to control inputs have been omitted forvisual clarity.
From the separation principle [12], we know that the optimal control action is a
function of the belief state. We define the set of belief states at time t as follows.
Definition 21. Let Mt be a set defined as
Mt =Λt : X (n) ×Ht → [0, 1] | Λt(zt, it) ≥ 0,
∑
zt
Λt(zt, it) = 1,
where we denote X (n) =∏n
i=1Xi to be the cartesian product of the state space corre-
sponding to all vertices.
Here, Λt(zt, it) is interpreted as the conditional probability density of the current
state of the system given the entire observation history at time t. That is
Λt(zt, it) = Prob (xt = zt | ht = it)
Let Ft : Ht → Mt be an operator that maps the entire observation history at
time t to an element in Mt. That is, the operator Ft maps the observation history
to a belief state. Furthermore, let Tt : Mt → A be the operator that maps the
belief state to a control action. From the separation principle [12], we know that the
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS52
optimal control K∗t , as function of the observation history it, is given as
K∗t = Tt Ft
That is, K∗t (at, it) = Tt (at, Λt(·, it)).
To prove the main theorem, we show that for networked MDPs, there exists an
optimal controller that depends only on imemt . Let P : Ht → H
memt be the projection
operator that projects the entire observation history to a truncated history as defined
in equation (4.5). The following theorem shows that there exists an operator Fmemt :
Hmemt →Mt such that
Ft = Fmemt P
Theorem 22. For a networked Markov decision process, there exists Λ∗0, . . . Λ
∗T such
that
Λt (zt, it) = Λ∗t (zt, i
memt ) ∀ t = 0, 1, . . . T. (4.6)
Thus, there exists an optimal controller K∗0 , . . . , K
∗T−1 such that
K∗t (at, it) = Tt (at, Λ
∗t (·, imem
t ))
= Kt (at, imemt ) ∀ t = 0, 1, . . . T − 1. (4.7)
Thus, bi’s are the bounds on the length of the observation history that an optimal
estimator needs to maintain beyond it current observation.
Before we present the proof of Theorem 22, we first prove a key lemma.
Lemma 23. Suppose there exists an optimal K∗j , j = t + 1, . . . , T − 1 such that
K∗j (aj, ij) = Kj
(aj, i
memj
)
for all aj. Then
K∗t (at, it) = Kt (at, i
memt )
for all at.
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS53
Proof. From the separation principle [12], we know that
K∗t (at, it) = T (at, Λt (·, it))
Thus, to prove the lemma it suffices to show that Λt (zt, it) = Λ∗t (at, i
memt ). At time
t, the controller knows it = zi0:t−Ni
, ai0:t−1 | i ∈ V. Let
Sut =
(x1
t−N1+1:t, . . . , xnt−Nn+1:t
)
be the states that are unknown at the controller at time t. Here the superscript u is
used to indicate that these states are unknown to the controller at time t. Note that
states of subsystem i are part of Sut if and only if Ni ≥ 1. This is because if Ni = 0,
then the current state of subsystem i is known to the controller. Let
Zut =
(z1
t−N1+1:t, . . . , znt−Nn+1:t
)
be a realization of Sut . Let Lt (Zu
t , it) be the joint conditional probability of the
variables in the set Sut given it. That is,
Lt (Zut , it) = Prob
(Su
t = Zut | ht = it
).
Define
L∗t (Zu
t , imemt ) = Prob
(Su
t = Zut | h
memt = imem
t
).
If we can show that there exists L∗t such that
Lt (Zut , it) = L∗
t (Zut , imem
t ) , (4.8)
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS54
then it follows that
Λt (zt, it) =∑
zit−Ni+1:t−1|i∈V
Lt (Zut , it)
=∑
zit−Ni+1:t−1|i∈V
L∗t (Zu
t , imemt )
= Λ∗t (zt, i
memt ) (4.9)
Thus, to prove the lemma it suffices to find an L∗ satisfying equation (4.8). To prove
the existence of an L∗, we show that the Markov blanket of the set Sut consists of the
variables imemt . Theorem 19 would then prove the existence of L∗.
Note that Sut contains xj
t−τjfor τj = 0, 1, . . . , Nj − 1 and j = 1, 2, . . . , n. From
equation (4.1), we know that the Markov blanket of Sut consists of parents, children
and parents of children of the variables in the set Sut . We focus on a single variable
xjt−τj
and find its parents, its children and all the parents of its children.
To find the parents of xjt−τj
, we look at the transition probability of this variable.
From equation (2.9), we note that xjt−τj
depends on
P(xj
t−τj
)=
xjt−τj−1, u
jt−τj−1, x
st−(τj+1+Msj)
| s ∈ Ij
, (4.10)
and hence these variables are the parents of xjt−τj
.
To find the children of xjt−τj
, consider the set Oj of outgoing vertices of subsystem
j and let p ∈ Oj. Consider Apt−t′ and note that this transition probability contains
xjt−t′−1−Mjp
. Thus, xjt−τj
would be a parent of xpt−t′ for all p ∈ Oj, if t− t′− 1−Mjp =
t− τj, which gives that t′ = τj − 1−Mjp.
Note that the children of xjt−τj
also consist of all the control variables that depend
on xjt−τj
. From the assumption in the lemma, we know that the K∗t+1:T−1 are only
a function of the finite past history of states given by imem. Thus, a directed edge
exists between xjt−τj
and ut−t′ for all t′ = τj −Nj − bj : τj −Nj. Thus, the children of
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS55
xjt−τj
consists of
CH(xj
t−τj
)=
xjt−τj+1, x
pt−τj+Mjp+1 | p ∈ O
j ⋃
ukt−τj+Nj :t−τj+Nj+bj
| k ∈ V
.
(4.11)
To find the parents of children of xjt−τj
, we find the parents of the variables given in
equation (4.11). From transition probability equation (2.9), we note that the parents
of xpt−τj+Mjp+1 include
xpt−τj+Mjp
, upt−τj+Mjp
, xrt−τj+Mjp−Mrp
| r ∈ Ip
To find the parents of ukt−τj+Nj :t−τj+Nj+bj
| k ∈ V, we note that from the assump-
tion in the lemma, these control inputs only depend on imemt . Thus, the parents of
ukt−τj+Nj :t−τj+Nj+bj
k ∈ V consist of
xit−τj+Nj−bi−Ni:t−τj+Nj+bj−Ni
, uit−τj+Nj−di:t−τj+Nj+bj−1 | i ∈ V
Thus we have
PCH(xj
t−τj
)=
xst−τj−Msj
, ujt−τj
, xpt−τj+Mjp
, upt−τj+Mjp
,
xrt−τj+Mjp−Mrp
| s ∈ Ij, r ∈ Ip, p ∈ Oj ⋃
xit−τj+Nj−bi−Ni:t−τj+Nj+bj−Ni
,
uit−τj+Nj−di:t−τj+Nj+bj−1 | i ∈ V
(4.12)
Let us denote the set of parents, the children and the parents of children of
xjt−Nj+1:t by Mj. From equations (4.10), (4.11), (4.12), we get that the set Mj
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS56
contains
Mj =
xjt−Nj :t+1, x
st−(Nj+Msj):t−Msj
, xpt−(Nj−1−Mjp):t+Mjp+1, x
rt−(Nj−1−Mjp+Mrp):t−(Mrp−Mjp),
xit−Ni−bi+1:t−Ni+bj+Nj
| s ∈ Ij, p ∈ Oj, r ∈ Ip, i ∈ V ⋃
ujt−Nj :t
, ukt+1:t+Nj+bj
, upt−(Nj−1−Mjp):t+Mjp
,
uit−(di−1):t+Nj+bj−1 | p ∈ O
j, k, i ∈ V.
Let us denoteM = ∪j∈VMj. Note that ukt−sk∈M if sk ≥ Nk or sk ≥ Nj−1−Mjk
for all j ∈ Ik or sk ≥ dk − 1. From the definition 11, this implies that
sk = maxNk, dk − 1, Nj −Mjk − 1 | j ∈ Ik
= dk
Similarly, xkt−qk∈ S if and only if xk
t−qk∈ M. This happens if one of the following
conditions holds.
1. qk ≥ Nk.
2. qk ≥ Nj + Mkj such that k ∈ Ij for some j ∈ V. This happens for all j ∈ Ok.
3. qk ≥ (Nj − 1 − Mjk) such that k ∈ Oj for some j ∈ V. That is if qk =
(Nj − 1−Mjk) for all j ∈ Ik.
4. For the last term, we need to find all j ∈ V such that for all p ∈ Oj, we
have k ∈ Ip. This happens for all j ∈ Ip, such that p ∈ Ok. Thus we have
qk ≥ Nj − 1−Mjp + Mkp for all p ∈ Ok and all j ∈ Ip.
5. qk ≥ bk + Nk − 1
Thus, we get that
qk = max
Nk, Ns + Mks, Nr − 1−Mrk,
Np − 1−Mps + Mks, bk + Nk − 1 | p ∈ Is, s ∈ Ok, r ∈ Ik
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS57
Using the definition of bk and dk, it is easy to verify that qk = bk + Nk. This proves
that the Markov blanket of the variables Sut consists of only imem
t . Thus, there exists
L∗t such that equation (4.8) is satisfied. The lemma then follows from equation (4.9).
Proof of Theorem 22. To prove the main theorem, we first show that at time T−1,
the belief state is only a function of imemT−1 . To see this, note that at time T − 1, the
set of unknown states at the controller SuT has no children. Thus, using a simplified
version of the argument given in the proof of lemma 23, it is easy to verify that there
exists Λ∗T−1 such that
ΛT−1 (aT−1, iT−1) = Λ∗T−1
(aT−1, i
memT−1
).
Thus, there exists an optimal controller K∗T−1 such that
K∗T−1 (aT−1, iT−1) = T
(aT−1, Λ
∗T−1
(·, imem
T−1
))
= KT−1
(aT−1, i
memT−1
)
The proof of the theorem then follows from the inductive argument using Lemma 23.
In previous chapters, we studied networked Markov decision processes with delays
between subsystems. We showed that for networked MDPs, a sufficient information
state is a function of a finite number of past system states and the past controller
inputs. The number of past states as well as past inputs depends only on the un-
derlying graph structure of the networked Markov decision process as well as the
associated delays. We also give explicit bounds on the number of past states and
inputs required to compute an optimal control action for networked MDPs with de-
lays. We also showed that this bound has interesting connections to Markov blanket
in Bayesian networks. This allows us to look at complex networked systems from the
view point of Bayesian networks and provides additional insights into how the delays
between subsystems affects the overall controller performance.
CHAPTER 4. A BAYESIAN NETWORK APPROACH TO NETWORK MDPS58
The results of previous chapters allow us to look at complex interconnected sys-
tems that have a centralized controller or a decision maker. In several systems of
interest, the presence of a centralized decision maker is infeasible. Even if one can
envision a centralized decision maker, it might be costly for every subsystem to trans-
mit its state to the decision maker. In the next chapter, we look at a stochastic game
model of complex interacting systems. In such models, each system makes optimal
decision in a decentralized manner. We study a new notion of equilibrium in such
systems that allows us to compute decentralized policies or strategies for systems with
large number of players.
Chapter 5
A Mean Field Approach to
Studying Large Systems
In several complex systems, a large number of agents interact with each other without
the presence of centralized authority. Even in systems where a centralized authority
may be present, it might be costly for each player or subsystem to transmit its state
to the centralized authority. Imagine a wireless network where a large number of
devices are performing power control to maximize their capacity. Even if there is a
central base station, it is costly for each device to continuously update its channel
state or queue backlog for the base station to perform the power control. Thus, in such
scenarios, each agent (or player) interacts with the other agents in a decentralized
manner to achieve its own objectives. A natural framework to study such systems is
that of stochastic games. Stochastic games [51] have been used to study interactions
between players in stochastic dynamic environments. However, typically such games
can be solved with a very small number of players since the computational complexity
involved in finding optimal equilibrium policies is very large [25]. This limits their
application to models with small dimensions.
In chapters 5 – 7, we study a mean field approach to understanding systems with
large number of interacting players [38, 35, 56, 1, 2, 43, 26, 23]. Mean field theory
has been used in statistical physics to deal with the combinatoric complexity of large
interactions. The basic idea is to treat other particles or agents as a single entity
59
CHAPTER 5. A MEAN FIELD APPROACH TO STUDYING LARGE SYSTEMS60
with some average behavior. Applied to engineering problems, this greatly simplifies
decision making by a single agent: a single agent can make its decision based on
the average behavior of other agents. The equilibrium consistency condition requires
that the average behavior of agents arise from the individual trajectories. Just as
in statistical physics, the mean field behavior allows us to decouple the interactions
between agents and enables us to come up with simple decision making policies.
In this and subsequent chapters, we developed a unified framework to study mean
field equilibrium behavior of large scale stochastic games. In particular, we prove
that under a set of simple assumptions on the model, a mean field equilibrium always
exists. Furthermore, as a simple consequence of this existence theorem,we show that
from the viewpoint of a single agent, a near optimal decision making policy is one that
reacts only to the average behavior of its environment. This result unifies previous
known results on the mean field equilibrium in large scale systems. In developing this
unified framework, we isolate and highlight the key modeling parameters which make
this mean field approach feasible. As a first step in studying the mean field approach,
we begin by defining our model for stochastic games.
5.1 Stochastic Game Model
In this section, we describe our stochastic game model. Compared to standard
stochastic games in the literature [51], in our model, every player has an individual
state. Players are coupled through their payoffs and state transitions. A stochastic
game has the following elements:
Time. The game is played in discrete time. We index time periods by t =
0, 1, 2, . . ..
Players. There are m players in the game; we use i to denote a particular player.
State. The state of player i at time t is denoted by xi,t ∈ X , where X ⊆ Zd is
a subset of the d-dimensional integer lattice. We use x−i,t to denote the state of all
players except player i at time t.
Action. The action taken by player i at time t is denoted by ai,t ∈ A, where
A ⊆ Rq is a subset of q-dimensional Euclidean space.
CHAPTER 5. A MEAN FIELD APPROACH TO STUDYING LARGE SYSTEMS61
Transition Probabilities. The state of a player evolves in a Markov fashion. For-
mally, let ht = x0,a0, . . . ,xt−1,at−1 denote the history up to time t. Conditional
on ht, players’ states at time t are independent of each other. Player i′ state xi,t at
time t depends on the past history ht only through the state of player i at time t− 1,
xi,t−1; the states of other players at time t−1, x−i,t−1; and the action taken by player
i at time t− 1, ai,t−1. We represent the distribution of the next state as a transition
kernel P, where:
P(x′i | xi, ai,x−i) = Prob
(xi,t+1 = x′
i | xi,t = xi, ai,t = ai,x−i,t = x−i
). (5.1)
Note that the evolution of players’ states may be coupled: in general, the next state
of player i depends on the current state not only of player i, but also the current state
of players other than i.
Payoff. In a given time period, if the state of player i is xi, the state of other
players is x−i, and the action taken by player i is ai, then the single period payoff
to player i is π(xi, ai,x−i
)∈ R. Note that the players are coupled via their payoff
function, since the payoff to player i depends on the state of every other player.
Discount Factor. The players discount their future payoff by a discount factor
0 < β < 1. Thus a player i’s infinite horizon payoff is given by:
∞∑
t=0
βtπ(xi,t, ai,t,x−i,t
).
In the model described above, the players’ payoff function and transition kernel
depends on the states of all players. In a variety of games, this coupling between
players is independent of the identity of the players. The notion of anonymity captures
scenarios where the interaction between players is via aggregate information about
the state. Let f(m)−i,t(y) denote the fraction of players (excluding player i) that have
their state as y at time t, i.e.:
f(m)−i,t(y) =
1
m− 1
∑
j 6=i
1xj,t=y, (5.2)
CHAPTER 5. A MEAN FIELD APPROACH TO STUDYING LARGE SYSTEMS62
where 1xj,t=y is the indicator function that the state of player j at time t is y. We
refer to f(m)−i,t as the population state at time t (from player i’s point of view).
Definition 24 (Anonymous Stochastic Game). A stochastic game is called an anony-
mous stochastic game if the payoff function π(xi,t, ai,t,x−i,t) and transition kernel
P(x′i,t | xi,t, ai,t,x−i,t) depend on x−i,t only through f
(m)−i,t. In an abuse of notation,
we write π(xi,t, ai,t,f
(m)−i,t
)for the payoff to player i, and P(x′
i,t | xi,t, ai,t,f(m)−i,t) for the
transition kernel for player i.
For the remainder of the chapters, we focus our attention on anonymous stochastic
games. For ease of notation, we often drop the subscript i and t to denote a generic
transition kernel and a generic payoff function, i.e., we denote a transition kernel
by P(· | x, a, f) and a generic payoff function by π(x, a, f), where f represents the
population state of players other than the player under consideration.
Our results require a topology on population states; we consider a topology in-
duced by the 1-p norm. Given p > 0, the 1-p-norm of a function f : X → R is given
by:
‖f‖1-p =∑
x∈X
‖x‖pp |f(x)|,
where ‖x‖p is the usual p-norm of a vector. When X is finite, then ‖f‖1-p induces
the same topology as the standard Euclidean norm. However, when X is infinite,
the 1-p-norm weights larger states higher than smaller states. In many applications,
other players at larger states have a greater impact on the payoff; in such settings,
continuity of the payoff in f in the 1-p-norm naturally controls for this effect.
Formally, let F be the set of all possible population states on X with finite 1-p
norm, i.e.:
F =f : X → [0, 1] | f(x) ≥ 0,
∑
x∈X
f(x) = 1, ‖f‖1-p <∞. (5.3)
In addition, we let F(m) denote the set of all population states in F over m−1 players,
CHAPTER 5. A MEAN FIELD APPROACH TO STUDYING LARGE SYSTEMS63
i.e.:
F(m) =
f ∈ F : there exists x ∈ Xm−1 with f(y) =1
m− 1
∑
j
1xj=y
.
5.2 Markov Perfect Equilibrium (MPE)
In studying stochastic games, attention is typically focused on a smaller class of
Markov strategy spaces, where the action of a player at each time is a function of
only the current state of every player [29]. In the context of anonymous stochastic
games, a Markov strategy depends on the current state of the player as well as the
current population state. Because a player using such a strategy tracks the evolution
of the other players, we refer to such strategies in our context as cognizant strategies.
Definition 25. Let M be the set of cognizant strategies available to a player. That
is,
M =µ | µ : X × F→ A
. (5.4)
Consider an m-player anonymous stochastic game. At every time t, player i
chooses an action ai,t that depends on its current state and on the current population
state f(m)−i,t ∈ F(m). Letting µi ∈ M denote the cognizant strategy used by player i,
we have ai,t = µi(xi,t,f(m)−i,t). The next state of player i is randomly drawn according
to the kernel P:
xi,t+1 ∼ P(
·∣∣∣ xi,t, µi(xi,t,f
(m)−i,t),f
(m)−i,t
)
. (5.5)
We let µ denote the vector of strategies chosen by the players. We also let µ(m)
denote the strategy vector where every player has chosen strategy µ.
Let V (m)(x, f | µ′,µ(m−1)
)be the expected net present value for a player with
initial state x, and with initial population state f ∈ F(m), given that the player
follows a strategy µ′ and every other player follows the strategy µ. In particular, we
CHAPTER 5. A MEAN FIELD APPROACH TO STUDYING LARGE SYSTEMS64
have
V (m)(x, f | µ′,µ(m−1)
),
E
[ ∞∑
t=0
βtπ(xi,t, ai,t,f
(m)−i,t
) ∣∣ xi,0 = x, f
(m)−i,0 = f ; µi = µ′,µ−i = µ(m−1)
]
. (5.6)
Note that state sequence xi,t and population state sequence f(m)−i,t evolve according to
the dynamics (5.5).
We focus our attention on a symmetric Markov perfect equilibrium (MPE), where
all players use the same cognizant strategy µ. In an abuse of notation, we write
V (m)(x, f | µ(m)
)to refer to the expected discounted value as given in equation (5.6)
when every player follows the same cognizant strategy µ.
Definition 26 (Markov Perfect Equilibrium). The vector of cognizant strategies
µ(m) ∈M is a symmetric Markov perfect equilibrium (MPE) if for all initial states x ∈
X and population states f ∈ F(m) we have
supµ′∈M
V (m)(x, f | µ′,µ(m−1)
)= V (m)
(x, f | µ(m)
).
Thus, a Markov perfect equilibrium is a profile of cognizant strategies that si-
multaneously maximize the expected discounted payoff for every player, given the
strategies of other players. It is a well known fact that computing a Markov perfect
equilibrium for a stochastic game is computationally challenging in general [25]. This
is because to find an optimal cognizant strategy, each player needs to track and fore-
cast the exact evolution of the entire population state. In certain scenarios, it might
be infeasible to exchange or learn this information at every step because of limited
communication capacity between players or limited cognitive ability. In the next
section, we describe a recently proposed scheme for approximating Markov perfect
equilibrium.
CHAPTER 5. A MEAN FIELD APPROACH TO STUDYING LARGE SYSTEMS65
5.3 Mean Field Equilibrium (MFE)
In a game with a large number of players, we might expect that fluctuations of players’
states “average out” and hence the actual population state remains roughly constant
over time. Because the effect of other players on a single player’s payoff and transition
probabilities is only via the population state, it is intuitive that, as the number of
players increases, a single player has negligible effect on the outcome of the game.
Based on this intuition, a scheme for approximating MPE has been proposed via a
solution concept we call mean field equilibrium, or MFE [38, 35, 56, 1, 2, 43, 26, 23].
Mean field equilibrium is also referred to as “oblivious equilibrium” by [56] or as
“Nash certainty equivalence control” by [35].
In MFE, each player optimizes its payoff based on only the long-run average
population state. Thus, rather than keep track of the exact population state, a single
player’s immediate action depends only on his own current state. We call such players
oblivious, and refer to their strategies as oblivious strategies. Formally, we let MO
denote the set of (stationary, nonrandomized) oblivious strategies, defined as follows.
Definition 27. Let MO be the set of oblivious strategies available to a player. That
is,
MO =µ | µ : X → A
. (5.7)
Given a strategy µ ∈ MO an oblivious player i takes an action ai,t = µ(xi,t) at
time t; as before, the next state of the player is randomly distributed according to
the transition kernel P:
xi,t+1 ∼ P(· | xi,t, ai,t, f) where ai,t ∼ µ(xi,t). (5.8)
Note that because we are considering a mean field model, the player’s state evolves
according to transition kernel with population state f .
We define the oblivious value function V(x | µ, f
)to be the expected net present
CHAPTER 5. A MEAN FIELD APPROACH TO STUDYING LARGE SYSTEMS66
value for any oblivious player with initial state x, when the long run average popula-
tion state is f , and the player uses an oblivious strategy µ. We have
V(x | µ, f
), E
[ ∞∑
t=0
βtπ(xi,t, ai,t, f
)∣∣∣ xi,0 = x; µ
]
. (5.9)
Note that the state sequence xi,t is determined by the strategy µ according to the
dynamics (5.8).
We define the optimal oblivious value function V ∗(x | f) as
V ∗(x | f) = supµ∈MO
V (x | µ, f).
Given a population state f , an oblivious player computes an optimal strategy by
maximizing their oblivious value function. Note that because an oblivious player does
not track the evolution of the population state, under reasonable assumptions their
optimal strategy is only a function of their current state—i.e., it must be oblivious
even if optimizing over cognizant strategies. We capture this optimization step via
the correspondence P defined next.
Definition 28. The correspondence P : F → MO maps a distribution f ∈ F to
the set of optimal oblivious strategies for a player. That is, µ ∈ P(f) if and only if
V(x | µ, f
)= V ∗(x | f) for all x, where V is the oblivious value function given by
equation (5.9).
Note that P maps a distribution to a stationary, nonrandomized oblivious strategy.
This is typically without loss of generality, since in most models of interest there
always exists such an optimal strategy. We later establish under our assumptions
that P(f) is nonempty.
Now suppose that the population state is f , and all players are oblivious and play
using a stationary strategy µ. We expect that the long run population state should
in fact be an invariant distribution of the Markov process with transition kernel (5.8).
We capture this relationship via the correspondence D, defined next.
Definition 29. The correspondence D : MO × F → F maps the oblivious strategy
CHAPTER 5. A MEAN FIELD APPROACH TO STUDYING LARGE SYSTEMS67
µ and population state f to the set of invariant distributions D(µ, f) associated with
the dynamics (5.8).
Note that the image of the correspondence D is empty if the strategy does not
result in an invariant distribution. We later establish conditions under which D(µ) is
nonempty.
We can now define mean field equilibrium. If every agent conjectures that f is
the long run population state, then every agent would prefer to play an optimal
oblivious strategy µ. On the other hand, if every agent plays µ and the population
state is in fact f , then we should expect the long run population state of all players
to be an invariant distribution of (5.8). Mean field equilibrium requires a consistency
condition: the equilibrium population state f must in fact be an invariant distribution
of the dynamics (5.8) under the strategy µ and the same population state f .
Definition 30 (Mean Field Equilibrium). An oblivious strategy µ ∈MO and a dis-
tribution f ∈ F constitute a mean field equilibrium if µ ∈ P(f) and f ∈ D(µ, f).
The notion of mean field equilibrium provides a simple approach to understanding
behavior in large population stochastic dynamic games. However, this notion is not
very meaningful unless we can guarantee that a mean field equilibrium exists in a
wide variety of stochastic games. Even if a mean field equilibrium were to exist in
a particular game of interest, it is natural to wonder whether such an equilibrium
is a good approximation to Markov perfect equilibrium in games with finitely many
players. MFE is unlikely to be useful in practice without conditions that guarantee
it approximates equilibria in finite systems well. Below we address these two fun-
damental questions: the existence of MFE and whether it provides any meaningful
approximation to MPE.
As we shall show below, an important contribution of our thesis is to relate ap-
proximation to existence of MFE. The approximation theorem we provide requires
continuity assumptions on the model primitives; as we demonstrate later, these same
continuity conditions are required (together with convexity and compactness condi-
tions) to ensure an MFE actually exists. Thus we obtain the valuable insight that
CHAPTER 5. A MEAN FIELD APPROACH TO STUDYING LARGE SYSTEMS68
approximation is essentially a corollary of existence. This is practically valuable: es-
tablishing that MFE is a good approximation is effectively a free byproduct, once the
conditions ensuring its existence have been verified.
We begin by studying the approximation result. We first define the appropriate
notion of approximation and show that under very mild assumptions a mean field
equilibrium (if it exists) approximates Markov perfect equilibrium as the number of
players in the game becomes large.
Chapter 6
MFE as an Approximation to MPE
As discussed in the previous chapter, a mean field equilibrium is of practical value only
if it approximates equilibria in finite systems well. In this chapter, we establish one of
our main results: under a parsimonious set of assumptions, a mean field equilibrium is
a good approximation to Markov perfect equilibrium as the number of players grows
large.
6.1 The Asymptotic Markov Equilibrium (AME)
Property
We begin by formalizing the approximation property of interest, referred to as the
asymptotic Markov equilibrium (AME) property. Intuitively, this property requires
that a mean field equilibrium strategy is approximately optimal even when compared
against Markov strategies, as the number of players grows large.
Definition 31 (Asymptotic Markov Equilibrium). A mean field equilibrium (µ, f)
possesses the asymptotic Markov equilibrium (AME) property if for all states x and
sequences of cognizant strategies µm ∈M, we have:
lim supm→∞
V (m)(x, f (m) | µm,µ(m−1)
)− V (m)
(x, f (m) | µ(m)
)≤ 0, (6.1)
69
CHAPTER 6. MFE AS AN APPROXIMATION TO MPE 70
almost surely, where the initial population state f (m) is derived by sampling each other
player’s initial state independently from the probability mass function f .
Note that V (m)(x, f (m) | µ′,µ(m−1)
)is the actual value function of a player as
defined in equation (5.6), when the player uses a cognizant strategy µ′ and every
other player plays an oblivious strategy µ. In particular, we have
V (m)(x, f (m) | µm,µ(m−1)
),
E
[∞∑
t=0
βtπ(xi,t, ai,t,f
(m)−i,t
)∣∣∣ xi,0 = x, f
(m)−i,0 = f (m); µi = µm,µ−i = µ(m−1)
]
,
where the state evolution of the players is given by:
xi,t+1 ∼ P(· | xi,t, µm(xi,t,f
(m)−i,t),f
(m)−i,t
)
xj,t+1 ∼ P(· | xj,t, µ(xj,t),f
(m)−i,t
)∀j 6= i.
Similarly, V (m)(x, f (m) | µ(m)
)is the actual value function of a player as defined
in equation (5.6) when every player is playing the oblivious strategy µ. AME requires
that the error when using the MFE strategy approaches zero almost surely with
respect to the randomness in the initial population state. This definition can be
shown to be stronger than the definition considered by [56], where AME is defined
only in expectation with respect to randomness in the initial population state.1
We emphasize that the AME property is essentially a continuity property in the
population state f . Under reasonable assumptions, we show that the time t popula-
tion state in the system with m players, f(m)−i,t, approaches f almost surely for all t as
m→∞. Therefore, informally, if the payoffs satisfy an appropriate continuity prop-
erty in f , we should expect the AME property to hold. This observation is significant,
because as noted above, continuity is also an essential prerequisite to existence. It is
for this reason that, under fairly general assumptions, the AME property is essentially
a corollary to existence.
1Under our assumptions on the model, convergence in expectation can be established via anapplication of the bounded convergence theorem. In particular, by Lemma 41 it follows that|V (m)(x, f | µ′, µ)| ≤ C(x, 0) <∞ for all f , µ′, and µ.
CHAPTER 6. MFE AS AN APPROXIMATION TO MPE 71
Before proceeding, we require some additional notation. Without loss of generality,
we can view the state Markov process in terms of the increments from the current
state. Specifically, if the current state is x and action a is taken, we can write:
xi,t+1 = xi,t + ξi,t, (6.2)
where we consider ξi,t to be a random increment that is distributed according to the
probability mass function Q(· | x, a, f), where
Q(z′ | x, a, f) = P(x + z′ | x, a, f).
Note that Q(z′ | x, a, f) is positive for only those z′ such that x + z′ ∈ X .
We make the following assumptions over model primitives; these ensure the model
is appropriately “continuous” in the limit.
Assumption 1 (Continuity). 1. Compact action set. The set of feasible actions
for a player, denoted by A, is compact.
2. Bounded increments. There exists M ≥ 0 such that, for all z with ‖z‖∞ > M ,
Q(z | x, a, f) = 0 for all x, a, and f .
3. Payoff and kernel continuity. The payoff π(x, a, f) is jointly continuous in a ∈ A
and f ∈ F for fixed x ∈ X (where F is endowed with the 1-p norm), and the
kernel P(x′ | x, a, f) is jointly continuous in a ∈ A and f ∈ F for each x, x′ ∈ X
(where F is endowed with the 1-p norm).2
4. Growth rate bound. There exist constants K and n ∈ Z+ such that
supa∈A,f∈F
|π(x, a, f)| ≤ K(1 + ‖x‖∞)n
for every x ∈ X , where ‖·‖∞ is the sup norm.
2Here we view P(x′ | x, a, f) as a real valued function of a and f , for fixed x, x′; note that sincewe have also assumed bounded increments, this notion of continuity is equivalent to assuming thatP(· | x, a, f) is jointly continuous in a and f with respect to the topology of weak convergence ondistributions over X .
CHAPTER 6. MFE AS AN APPROXIMATION TO MPE 72
The most consequential of these assumptions are that the model exhibits bounded
increments, and that the payoff growth rate can be bounded. These are not particu-
larly severe restrictions; for a wide range of economic models of interest, it is reason-
able to assume increments are bounded. Further, the polynomial growth rate bound
on the payoff is quite weak, and serves to exclude the possibility of strategies that
yield infinite expected discounted payoff.
Theorem 32 (AME). Let (µ, f) be a mean field equilibrium with f ∈ F, and suppose
Assumption 1 holds. Then the AME property holds for (µ, f).
The proof of the AME property exploits the fact that the 1-p-norm of f must
be finite (since f ∈ F) to show that∥∥∥f
(m)−i,t − f
∥∥∥
1-p→ 0 almost surely as m → ∞;
i.e., the population state of other players approaches f almost surely. Continuity of
the payoff π in f , together with the growth rate bounds in Assumption 1, yields the
desired result. The proof of the AME property is provided in the appendix.
In the next chapter we establish the existence of MFE. The existence results uses
the same continuity assumption (along with other additional assumptions) as required
for the AME property. This shows that the approximation result is a corollary of the
existence result.
Chapter 7
Existence of Mean Field
Equilibrium
The notion of mean field equilibrium allows to us approximate Markov perfect equi-
librium in large stochastic dynamic games. This notion is vacuous unless we can
guarantee that mean field equilibrium exists in a wide variety of games. In this
chapter, we study the existence of mean field equilibria. From Definition 30, we
observe that (µ, f) is a mean field equilibrium if and only if f is a fixed point of
Φ(f) = D(P(f), f), and µ ∈ P(f). Thus our approach is to find conditions un-
der which the correspondence Φ has a fixed point; in particular, we aim to apply
Kakutani’s fixed point theorem to Φ to find a MFE.
Kakutani’s fixed point theorem requires three essential pieces: (1) compactness of
the range of Φ; (2) convexity of both the domain of Φ, as well as Φ(f) for each f ; and
(3) appropriate continuity properties of the operator Φ. As emphasized in the last
chapter, a central technical observation is that the same continuity properties needed
to establish the AME property are essential to proving existence of a MFE.
We start with the following restatement of Kakutani’s theorem.
Theorem 33 (Kakutani). Suppose there exists a set FC ⊆ F such that:
1. FC is convex and compact (in the 1-p norm), with Φ(FC) ⊂ FC;
2. Φ(f) is convex and nonempty for every f ∈ FC; and
73
CHAPTER 7. EXISTENCE OF MEAN FIELD EQUILIBRIUM 74
3. Φ has a closed graph on FC.
Then there exists a mean field equilibrium (µ, f) with f ∈ FC.
In the remainder of this section, we find exogenous conditions on model primitives
to ensure these requirements are met. We tackle them in reverse order. We first show
that under Assumption 1, Φ has a closed graph. Next, we study conditions under
which Φ(f) can be guaranteed to be convex. Finally, we provide conditions on model
primitives under which there exists a compact, convex set FC with Φ(FC) ⊂ FC .
The conditions we provide are mild, and yet also suffice to guarantee that Φ(f) is
nonempty.
7.1 Closed Graph
In this section we establish that exactly the same the continuity assumptions embod-
ied in Assumption 1 also suffice to ensure that Φ has a closed graph. We begin with
the following lemma.
Lemma 34. For each f , P(f) is compact; further, the correspondence P is upper
hemicontinuous on F.
Proof. By Assumption 1, π(x, a, f) is jointly continuous in a and f . Lemma 42
establishes that the optimal oblivious value function V ∗(x | f) is continuous in f ,
and so as in the proof of that lemma, it follows that for a fixed state x, π(x, a, f) +
β∑
x′ V ∗(x′ | f)P(x′ | x, a, f) is finite and jointly continuous in a and f . Define the
set Px(f) ⊂ A as the set of actions that achieve the maximum on the right hand side
of (A.3); this is nonempty as A is compact (Assumption 1) and the right hand side
is continuous in a. By Berge’s maximum theorem, for each x the correspondence Px
is upper hemicontinuous with compact values [3].
By Lemma 42, µ ∈ P(f) if and only if µ(x) ∈ Px(f) for each x. Note that we have
endowed the set of strategies with the topology of pointwise convergence. The range
space of P is an infinite product of the compact action space A (Assumption 1) over
the countable state space. Hence by Tychonoff’s theorem [3], the range space of P is
CHAPTER 7. EXISTENCE OF MEAN FIELD EQUILIBRIUM 75
compact. Further, since Px is compact-valued, it follows that P is compact-valued.
Since Px(f) is compact-valued and upper hemicontinuous, the Closed Graph Theorem
ensures that Px has a closed graph [3]. This in turn ensures that P has closed graph;
again by the Closed Graph Theorem, we conclude that P is upper hemicontinuous.
Proposition 35. Suppose that Assumption 1 holds. Then Φ has a closed graph on
F; i.e., the set (f, g) : g ∈ Φ(f) ⊂ F× F is closed (where F is endowed with the 1-p
norm).
Proof. Suppose fk → f in the 1-p norm, and that gk → g in the 1-p norm, where
gk ∈ Φ(fk) for all k. We must show that g ∈ Φ(f). For each k, let µk ∈ P(fk) be an
optimal oblivious strategy such that gk ∈ D(µk, fk). As in the proof of Lemma 34,
the range space of P is compact in the topology of pointwise convergence; therefore,
taking subsequences if necessary, we can assume without loss of generality that µk
converges to some strategy µ ∈MO pointwise. By upper hemicontinuity of P (Lemma
34), we have µ ∈ P(f).
By definition of D, it follows that for all x:
gk(x) =∑
x′
gk(x′)P(x|x′, µk(x
′), fk). (7.1)
Since P(x|x′, a, f) is jointly continuous in action and population state (Assumption
1), it follows that for all x and x′:
P(x|x′, µk(x′), fk)→ P(x|x′, µ(x′), f)
as k → ∞. Further, if gk → g in the 1-p norm, then in particular, gk(x) → g(x) for
all x. Finally, observe that for all a and f , we have P(x|x′, a, f) = 0 for all states x′
such that ‖x′ − x‖∞ > M , since increments are bounded (Assumption 1). Thus:
∑
x′
gk(x′)P(x|x′, µk(x
′), fk)→∑
x′
g(x′)P(x|x′, µ(x′), f)
CHAPTER 7. EXISTENCE OF MEAN FIELD EQUILIBRIUM 76
as k →∞. Taking the limit as k →∞ on both sides of (7.1) yields:
g(x) =∑
x′
g(x′)P(x|x′, µ(x′), f), (7.2)
which establishes that g ∈ D(µ, f). Since we had µ ∈ P(f), we conclude g ∈ Φ(f),
as required.
7.2 Convexity
Next, we develop conditions to ensure that Φ(f) is nonempty and convex. We start
by considering a simple model, where the action set A is the simplex of randomized
actions on a base set of pure actions. Formally, we have the following definition.
Definition 36. An anonymous stochastic game has a finite action space if there exists
a finite set S such that the following three conditions hold:
1. A consists of all probability distributions over S: A = a ≥ 0 :∑
s a(s) = 1.
2. π(x, a, f) =∑
s a(s)π(x, s, f), where π(x, s, f) is the payoff evaluated at state
x, population state f , and pure action s.
3. P(x′ | x, a, f) =∑
s a(s)P(x′ | x, s, f), where P(x′ | x, s, f) is the kernel eval-
uated at states x′ and x, population state f , and pure action s.
Essentially, the preceding definition allows inclusion of randomized strategies in
our search for mean field equilibrium. This model inherits Nash’s original approach
to establishing existence of an equilibrium for static games, where randomization
induces convexity on the strategy space. We show next that in any game with finite
action spaces, the set Φ(f) is always convex.
Proposition 37. Suppose Assumption 1 holds. In any anonymous stochastic game
with a finite action space, Φ(f) is convex for all f ∈ F.
CHAPTER 7. EXISTENCE OF MEAN FIELD EQUILIBRIUM 77
Proof. Fix f ∈ F, and let g1, g2 be elements of Φ(f). Let µ1, µ2 ∈ P(f) be strategies
such that gi ∈ D(µi, f), i = 1, 2. Then for i = 1, 2 and all x′ ∈ X , we have:
gi(x′) =
∑
x
gi(x′)P(x′ | x, µi(x), f).
Fix δ, 0 ≤ δ ≤ 1, and for each x, define g(x) by:
g(x) = δg1(x) + (1− δ)g2(x).
We must show g ∈ Φ(f). Define a new strategy µ as follows: for each x such that
g(x) > 0,
µ(x) =δg1(x)µ1(x) + (1− δ)g2(x)µ2(x)
g(x).
For each x such that g(x) = 0, let µ(x) = µ1(x).
We claim that µ ∈ P(f), i.e., µ is an optimal oblivious strategy given f ; and that
g ∈ D(µ, f), i.e., that g is an invariant distribution given strategy µ and population
state f . This suffices to establish that g ∈ Φ(f).
To establish the claim, first observe that under Definition 36, the right hand side
of (A.3) is linear in a. Thus any convex combination of two optimal actions is also
an optimal action. This establishes that for every x, µ(x) achieves the maximum on
the right hand side of (A.3); so we conclude µ ∈ P(f).
Let T = x : g(x) > 0. Then:
g(x′) = δg1(x′) + (1− δ)g2(x
′)
=∑
x
δg1(x)P(x′ | x, µ1(x), f) + (1− δ)g2(x)P(x′ | x, µ2(x), f)
=∑
x
∑
s
(δg1(x)µ1(x)(s) + (1− δ)g2(x)µ2(x)(s))P(x′ | x, s, f)
=∑
x∈T
∑
s
g(x)µ(x)(s)P(x′ | x, s, f).
The first equality is the definition of g(x′), and the second equality follows by ex-
panding the invariant distribution equations for g1 and g2. The third equality follows
CHAPTER 7. EXISTENCE OF MEAN FIELD EQUILIBRIUM 78
by expanding the sum over pure actions s. Finally, in the last equality, we substitute
the definition of g(x), and we also observe that for x 6∈ T , g(x) = 0—and therefore,
g1(x) = g2(x) = 0. Since g(x) = 0 for x 6∈ T , it follows that:
∑
x 6∈T
∑
s
g(x)µ(x)(s)P(x′ | x, s, f) = 0.
It follows that:
g(x′) =∑
x
g(x)P(x′ | x, µ(x), f),
as required.
The preceding result ensures that if randomization is allowed over a set of finite
actions, then the map Φ is convex-valued. On the other hand, many relevant appli-
cations typically require existence of equilibria in pure strategies. For this purpose,
we present an alternate result, in which concavity assumptions on model primitives
guarantee convexity of Φ(f). Before proceeding we require some additional terminol-
ogy.
Let S ⊂ Rn. We say that a function g : S → R is nondecreasing if g(x′) ≥ g(x)
whenever x′ ≥ x (where we write x′ ≥ x if x′ is at least as large as x in every
component). Let Pθ be a family of probability distributions on X indexed by θ ∈
S. We say that Pθ is stochastically nondecreasing in the parameter θ, if for every
nondecreasing function u : X → R, and for θ1 ≥ θ2, there holds Eθ1 [u] ≥ Eθ2 [u]
wherever the expectations are both well defined. (When the preceding condition
holds, we say Pθ1 stochastically dominates Pθ2 .) We say that Pθ is stochastically
concave in the parameter θ, if for every nondecreasing function u : X → R, Eθ[u] =∑
x u(x)Pθ(x) is a concave function of θ wherever the expectation is well defined.
We have the following assumption.
Assumption 2. 1. The action set A is convex.
2. The payoff π(x, a, f) is nondecreasing in x for fixed a and f , and the kernel
P(· | x, a, f) is stochastically nondecreasing in x for fixed a and f .
3. The payoff is concave in a for fixed x and f , and the kernel is stochastically
CHAPTER 7. EXISTENCE OF MEAN FIELD EQUILIBRIUM 79
concave in a for fixed x and f , with at least one of the two strictly concave in
a.1
In the following proposition we show the preceding conditions on model primitives
ensure the optimal oblivious strategy is unique, and therefore that Φ(f) is convex for
all f .
Proposition 38. Suppose Assumptions 1 and 2 hold. Then P(f) is a singleton, and
Φ(f) is convex for all f ∈ F.
Proof. From Lemma 43, the conditions of the proposition guarantee a unique optimal
solution in the right hand side of (A.3), for every x ∈ X . Thus under either of these
conditions the optimal strategy given f is unique, i.e., P(f) is a singleton. It is
straightforward to check that D(µ, f) is convex for each fixed µ and f(the set of
invariant distributions given µ and f are the solution to a linear system), so it follows
that Φ(f) is convex for each f .
7.3 Compactness
In this section, we provide conditions under which under which we can guarantee
the existence of a compact, convex, nonempty set FC such that Φ(FC) ⊂ FC . The
assumptions we make are closely related to those needed to ensure that Φ(f) is
nonempty. To see the relationship between these results, observe that in Lemma 42,
we showed that under Assumption 1 an optimal oblivious strategy always exists for
any f ∈ F. Thus to ensure that Φ(f) is nonempty, it suffices to show that there
exists at least one ergodic strategy in P(f)—i.e., at least one strategy that possesses
an invariant distribution. Our approach to demonstrating existence of an invariant
distribution is to use a Foster-Lyapunov argument. This same argument also allows
us to bound the moments of the invariant distribution—precisely what is needed to
find the desired set FC that is compact in the 1-p norm.
1Strict stochastic concavity of the kernel requires that the expectation against any nondecreasingfunction is strictly concave.
CHAPTER 7. EXISTENCE OF MEAN FIELD EQUILIBRIUM 80
One simple condition under which Φ(f) is nonempty is that the state space is
finite; in this case any policy is ergodic, since any Markov chain on a finite state
space possesses at least one positive recurrent class. In this case the entire set F is
compact in the 1-p norm. Thus we have the following result.
Proposition 39. Suppose Assumption 1 holds, and that the state space is finite.
Then Φ(f) is nonempty for all f ∈ F, and F is compact in the 1-p norm.
We now turn our attention to the setting where the state space may be unbounded.
In this case, we must make additional assumptions to ensure the optimal strategy
does not allow the state to become transient, and to bound moments of the invariant
distribution of any optimal oblivious strategy.
We endow F with the stochastic dominance ordering (i.e., f ′ f if f ′ stochastically
dominates f). If X and Y are partially ordered sets with orders and ⊒, respectively,
we say that a function F : X × Y → R has decreasing differences in x and y if for all
x, x′, y, y′ such that x′ x and y′ ⊒ y, there holds:
F (x′, y′)− F (x, y′) ≤ F (x′, y)− F (x, y).
In other words, increasing the parameter y reduces the marginal return to higher
values of x.
Assumption 3. 1. The state space X = Z+.
2. The payoff function π(x, a, f) has decreasing differences in x ∈ X , a ∈ A, and
f ∈ F.
3. For all ∆ ∈ Z+, a ∈ A, and f ∈ F, as x→∞ there holds:
π(x + ∆, a, f)− π(x, a, f)→ 0.
4. Any action is costly: there exists an action a ∈ A such that a ≥ a for all a ∈ A,
and the payoff satisfies π(x, a, f) < π(x, a, f) for all x, f , and a 6= a.
CHAPTER 7. EXISTENCE OF MEAN FIELD EQUILIBRIUM 81
5. The increment kernel Q(· | x, a, f) is stochastically nonincreasing in x ∈ X for
each a ∈ A and f ∈ F, and stochastically nonincreasing in f ∈ F for each
x ∈ X and a ∈ A.
6. Given any f ∈ F, the drift is eventually negative at a: there exists a state χ
such that for all x with x ≥ χ,
∑
z
zQ(z | x, a, f) < 0.
Under the preceding assumptions we have the following result.
Proposition 40. Suppose Assumptions 1, 2, and 3 hold. Then Φ(f) is nonempty for
all f ∈ F, and there exists a compact, convex, nonempty set FC such that φ(F) ⊂ FC.
The proof of this proposition is provided in the Appendix. From Lemmas 35,
38 and 40 and applying Kakutani’s theorem, the existence of mean field equilibrium
follows.
Chapter 8
Conclusions and Future Work
Complex networks are becoming more pervasive in our lives. Yet the design and
understanding of these networks is still a challenging task. In this thesis, we looked
at two particular challenges in the design of such networks, namely the reactive envi-
ronment in which these networks operate as well as the lack of complete information
available to any decision maker.
To understand the affect of delay in decision making, we model complex systems
as a networked of interconnected Markov decision processes. The subsystems in
this networked MDPs are connected to each other via delay lines. We considered
a scenario where a centralized decision maker receives delayed state feedback from
each subsystem. Our main theorem shows that the central decision maker can make
optimal decisions based only on a subset of past information available to it. In other
words, beyond a certain history, the past is irrelevant to future decision making. We
also explored a connection between networked control systems and Bayesian networks
where we show that the amount of information required to make optimal decisions in
a networked MDP is related to the concept of a Markov blanket in Bayesian networks.
These results allow us to compute optimal controllers for networked MDPS in presence
of delays.
To cope with the reactive environment present in complex networks, we use the
mean field equilibrium approach. This approach, that is motivated by statistical
physics, deals with the complexity of interactions in complex systems. As part of
82
CHAPTER 8. CONCLUSIONS AND FUTURE WORK 83
our research, we have developed a unified framework to study mean field equilibrium
behavior of large scale dynamical stochastic systems. In particular, we proved that
under a set of simple assumptions on the model, a mean field equilibrium always exists.
Furthermore, as a simple consequence of this existence theorem, we show that from the
viewpoint of a single agent, a near optimal decision making policy is one that reacts
only to the average behavior of its environment. This result unifies previous known
results on the mean field equilibrium in large scale dynamical systems. In developing
this unified framework, we isolate and highlight the key modeling parameters which
make this mean field approach feasible. The mean field approach provides a low
complexity solution to a single agent’s decision making problem.
Although the issues addressed in this thesis are important to the design and un-
derstanding of complex systems, several important questions still remain unanswered.
Below we highlight some important questions that are pertinent to the understanding
of complex systems.
• Large Scale Systems with Local Interactions: The mean field analysis
serves well when there are a large number of agents. However, in many large
complex networks, an agent may interact with only a finite subpopulation of
an infinite collection of agents. For example, consider a network of electric cars
that are all trying to charge their batteries from the power grid. An electric
vehicles recharging schedule would depend mostly on other electric vehicles in
its immediate neighborhood. In such scenarios, we can use mean field analysis
as a starting point to design algorithms for large scale systems with local in-
teractions. Quantifying the scale at which mean field analysis becomes useful
would also help us better understand these systems.
• Distributed Control in the Presence of Delays: The current model of
networked MDPs assumes that there is a single decision maker or controller that
receives delayed state information from every subsystem. In several systems,
each subsystem has its own controller. These controllers may receive delayed
state information from only a small number of subsystems. Understanding the
sufficient information for each controller is still a challenge.
CHAPTER 8. CONCLUSIONS AND FUTURE WORK 84
• Understanding the Price of Delay: Any networked system must address
decision making in the presence of delay. However, the effect of delay on overall
performance is very poorly understood. How does the optimal cost in a system
increase as more delay is incurred in receiving the information? Understanding
the price of delay would enable us to design approximate decision making rules
with a clear understanding of the gap to optimality. In scenarios where the delay
in receiving information can be managed or spread among different agents, this
price of delay would allow us to better understand the design of networked
systems with delays.
• Value of Information in Decision Making: In any complex system, agents
usually have partial information about the environment as well as about each
other. Imagine a transportation system where a particular vehicle needs to
decide a minimum delay route to its destination. The vehicle makes its deci-
sion based on partial information about the congestion in the transportation
network. How does the overall delay in the system decrease as the vehicle
receives more information about the congestion in the network? What informa-
tion would the vehicle like to receive in order to minimize its delay? Should it
be congestion on its nearest links, or the vehicular traffic on the most congested
link? Understanding the value of information would enable us to understand
the right kind of information required to make a near optimal decision.
These and several other questions are of immense importance in the design of
complex systems. We believe that the tools and methodologies developed in this
thesis would provide a stepping stone in answering some or all of these questions.
Appendix A
Proofs
A.1 Preliminary Lemmas
In this section, we prove some preliminary lemmas that are used for both the AME
proof as well as the proof of existence of a MFE. We begin with the following lemma.
Lemma 41. Suppose Assumption 1 holds. Let x0 = x. Let at ∈ A be any sequence
of (possibly history dependent) actions, and let ft ∈ F be any sequence of (possibly
history dependent) population states. Let xt be the state sequence generated, i.e.,
xt ∼ P(· | xt−1, at−1, ft−1).
Then for all T ≥ 0, there exists C(x, T ) <∞ such that:
E
[∞∑
t=T
βt|π(xt, at, ft)|∣∣∣ x0 = x
]
≤ C(x, T ).
Further, C(x, T )→ 0 as T →∞.
Proof. Observe that by Assumption 1, the increments are bounded. Thus starting
from state x, we have ‖xt‖∞ ≤ ‖x‖∞ + tM . Again by Assumption 1, |π(xt, at, ft)| ≤
85
APPENDIX A. PROOFS 86
K(1 + ‖xt‖∞)n. Therefore:
E
[∞∑
t=T
βt|π(xt, at, ft)| | x0 = x
]
≤ K∞∑
t=T
βt(1 + ‖x‖∞ + tM)n.
We define C(x, 0) as the right hand side above when T = 0:
C(x, 0) = K∞∑
t=0
βt(1 + ‖x‖∞ + tM)n.
Observe that C(x, 0) <∞.
We now reason as follows for T ≥ 1:
K∞∑
t=T
βt(1 + ‖x‖∞ + tM)n = KβT
∞∑
t=0
βt(1 + ‖x‖∞ + tM + TM)n
= KβT
∞∑
t=0
βt
n∑
j=0
(n
j
)
(1 + ‖x‖∞ + tM)j(TM)n−j
≤ KβT
∞∑
t=0
βt
n∑
j=0
(n
j
)
(1 + ‖x‖∞ + tM)n(TM)n
= KβT 2n(TM)n
∞∑
t=0
βt(1 + ‖x‖∞ + tM)n
= C(x, 0)βT (2MT )n.
Here the inequality holds because 1 + ‖x‖∞ + tM ≥ 1, M ≥ 0, and T ≥ 1. So for
T ≥ 1, define:
C(x, T ) = C(x, 0)βT (2MT )n. (A.1)
Then C(x, T )→ 0 as T →∞, as required.
We now show that the Bellman equation holds for the dynamic program solved by
a single agent given a population state f . Our proof involves the use of a weighted sup
norm, defined as follows. For each x ∈ X , let W (x) = (1 + ‖x‖∞)n. For a function
APPENDIX A. PROOFS 87
F : X → R, define:
‖F‖W -∞ = supx∈X
∣∣∣∣
F (x)
W (x)
∣∣∣∣.
This is the weighted sup norm with weight function W . We let B(X ) denote the set
of all functions F : X → R such that ‖F‖W -∞ <∞.
Let Tf denote the dynamic programming operator with population state f : given
a function F : X → R, we have:
(TfF )(x) = supa∈A
π(x, a, f) + β∑
x′∈X
F (x′)P(x′ | x, a, f)
.
We define T kf to be the composition of the mapping Tf with itself k times.
Lemma 42. Suppose Assumption 1 holds. For all f ∈ F, if F ∈ B(X ) then TfF ∈
B(X ). Further, there exist k, ρ independent of f with 0 < ρ < 1 such that Tf is a
k-stage ρ-contraction on B(X ); i.e., if F, F ′ ∈ B(X ), then for all f :
∥∥T k
f F − T kf F ′∥∥
W -∞≤ ρ ‖F − F ′‖W -∞ . (A.2)
In particular, value iteration converges to V ∗(·|f) ∈ B(X ) from any initial value
function in B(X ), and for all f ∈ F and x ∈ X , the Bellman equation holds:
V ∗(x | f) = supa∈A
π(x, a, f) + β∑
x′∈X
V ∗(x′ | f)P(x′ | x, a, f)
. (A.3)
Further, V ∗(x|f) is continuous in f .
Finally, there exists at least one optimal oblivious strategy among all (possibly
history-dependent, possibly randomized) strategies; i.e., P(f) is nonempty. An obliv-
ious strategy µ ∈ MO is optimal given f if and only if µ(x) achieves the maximum
on the right hand side of (A.3) for every x ∈ X .
Proof. We have the following three properties:
1. By the growth rate bound in Assumption 1 we have supa |π(x, a, f)|/W (x) ≤ K
for all x.
APPENDIX A. PROOFS 88
2. We have:
W (x) = supa∈A
∑
x′
P(x′ | x, a, f)W (x′) ≤ (1 + ‖x‖∞ + M)n,
since the increments are bounded (Assumption 1). Thus W (x)/W (x) ≤ (1 +
M)n for all x.
3. Finally, fix ρ such that 0 < ρ < 1 and let:
W k(x) = supµ∈MO
E[W (xk)|x0 = x, µ],
where the state evolves according to xt+1 ∼ P(· | xt, µ(xt), f). By bounded
increments in Assumption 1, we have:
βkW k(x) ≤ βk(1 + ‖x‖∞ + kM)n ≤ βk(1 + kM)nW (x).
By choosing k sufficiently large so that βk(1 + kM)n < ρ, we have:
βkW k(x) ≤ ρW (x).
Given (1)-(3), by standard arguments (see, e.g., [19]), it follows that Tf is a k-
stage ρ-contraction with respect to the weighted sup norm, value iteration converges to
V ∗(· | f), the Bellman equation holds, and any (stationary, nonrandomized) oblivious
strategy that maximizes the right hand side in (A.3) for each x ∈ X is optimal.
Observe that since V ∗(· | f) ∈ B(X ) for any f , it follows that V ∗(x | f) <∞ for all
x. In fact, by Lemma 41, |V ∗(x | f)| ≤ C(x, 0) for all x.
Next we show that V ∗(x | f) is continuous in f . Define Z(x) = 0 for all x, and let
V(ℓ)f = T ℓ
fZ. We first show that V(ℓ)f (x) is continuous in f . To see this, we proceed
by induction. The result is trivally true at ℓ = 0. Next, observe that π(x, a, f) is
jointly continuous in a and f for each fixed x by Assumption 1. Suppose V(ℓ)f (x) is
continuous in f for each x; then V(ℓ)f (x′)P(x′ | x, a, f) is jointly continuous in a and
f for each fixed x, x′. Since the kernel has bounded increments from Assumption 1,
APPENDIX A. PROOFS 89
we conclude that∑
x′ V(ℓ)f (x′)P(x′ | x, a, f) is jointly continuous in a and f for each
fixed x. It follows by Berge’s maximum theorem [3] that V(ℓ+1)f (x) is continuous in f .
Fix ǫ > 0. Since Tf is a k-stage ρ-contraction in the weighted sup norm for every
f , it follows that for all sufficiently large ℓ, for every f there holds:
|V (ℓ)f (x)− V ∗(x | f)| ≤ W (x)ǫ.
So now suppose that fn → f in the 1-p norm. Since V ℓf (x) is continuous in f , for all
sufficiently large n there holds:
|V (ℓ)fn
(x)− V(ℓ)f (x)| ≤ ǫ.
Thus using the triangle inequality, for all sufficiently large n we have:
|V ∗(x | f)− V ∗(x | fn)| ≤ (2W (x) + 1)ǫ.
Since ǫ was arbitrary it follows that the left hand side approaches zero as n→∞, as
required.
Finally, observe that by a similar argument as above,
∑
x′
V ∗(x′ | f)P(x′ | x, a, f)
is a continuous function of a for each fixed x and f ; since π(x, a, f) is also continuous
in a for each fixed f , the right hand side of (A.3) is continuous in a for each fixed
f . Since A is compact, it follows that there exists an optimal action at each state x,
and thus there exists an optimal strategy given f .
Lemma 43. Suppose Assumptions 1 and 2 hold. Then the right hand side of (A.3)
is nondecreasing in x and strictly concave in a.
Proof. Define Z(x) = 0 for all x, and let V(ℓ)f = T ℓ
fZ. Observe that if V(ℓ)f is
nondecreasing, then under the conditions of Proposition 38, it follows that V(ℓ+1)f will
be nondecreasing. Taking the limit as n → ∞, we conclude (from convergence of
APPENDIX A. PROOFS 90
value iteration) that V ∗(· | f) is nondecreasing, and thus the right hand side of (A.3)
is nondecreasing in x.
Since V ∗(· | f) is nondecreasing, π(x, a, f) is concave in a, and the kernel is
stochastically concave in a, with at least one of the last two strictly concave, it
follows that the right hand side of (A.3) is strictly concave in a.
A.2 Proof of AME
In this section, we provide the proof of Theorem 32. Throughout this section, we
suppose Assumption 1 holds. We begin by defining the following sets.
Definition 44. For every x ∈ X , define
Xx =
z ∈ X∣∣∣ P(x | z, a, f) > 0 for some a ∈ A and for some f ∈ F
. (A.4)
Also define Xx,t as
Xx,t =
z ∈ X∣∣∣ ‖z‖∞ ≤ ‖x‖∞ + tM
. (A.5)
Thus, Xx is the set of all initial states that can result in the final state as x.
Since the increments are bounded (Assumption 1), for every x ∈ X , the set Xx is
finite. The set Xx,t is a superset of all possible states that can be reached at time t
starting from state x (since the increments are uniformly bounded over action a and
distribution f); note that Xx,t is finite as well.
Lemma 45. Let (µ, f) be a mean field equilibrium. Consider an m-player game.
Let x(m)i,0 = x0 and suppose the initial state of every player (other than player i) is
independently sampled from the distribution f . That is, suppose x(m)j,0 ∼ f for all
j 6= i; let f (m) ∈ F(m) denote the initial population state. Let a(m)i,t be any sequence of
(possibly random, possibly history dependent) actions. Suppose players’ states evolve
APPENDIX A. PROOFS 91
as:
x(m)j,t+1 ∼ P
(· | x(m)
j,t , µ(x(m)j,t ), f
(m)−j,t
)∀ j = 1, 2, · · · ,m, j 6= i,
x(m)i,t+1 ∼ P
(· | x(m)
i,t , a(m)i,t , f
(m)−i,t
).
Then, for every initial state x0, for all times t,∥∥∥f
(m)−i,t − f
∥∥∥
1-p→ 0 almost surely as
m→∞.
Proof. Note that f ∈ F and hence ‖f‖1-p < ∞. Thus, given any ǫ > 0, there exists
a finite set Cǫ,f such that:
∑
x/∈Cǫ,f
‖x‖pp f(x) < ǫ. (A.6)
At t = 0, we have
f(m)−i,0(x) =
1
m− 1
m−1∑
j=1
1Xj,0=x,
where Xj,0 are i.i.d random variables distributed according to the distribution f .
Define:
Yj = ‖Xj,0‖pp 1Xj,0 6∈Cǫ,f.
Note that the Yj are i.i.d. random variables, with:
E[Yj] =∑
x 6∈Cǫ,f
‖x‖pp f(x).
Further, observe that:
∑
x 6∈Cǫ,f
‖x‖pp f(m)−i,0(x) =
1
m− 1
m−1∑
j=1
Yj.
Thus by the strong law of large numbers, almost surely as m→∞,
∑
x 6∈Cǫ,f
‖x‖pp f(m)−i,0(x)→
∑
x 6∈Cǫ,f
‖x‖pp f(x) < ǫ.
APPENDIX A. PROOFS 92
Now observe that:
∥∥∥f
(m)−i,0(x)− f
∥∥∥
1-p≤∑
x∈Cǫ,f
‖x‖pp |f(m)−i,0(x)− f(x)|+
∑
x/∈Cǫ,f
‖x‖pp f(m)−i,0(x) +
∑
x/∈Cǫ,f
‖x‖pp f(x).
Each of the second and third terms on the right hand side is almost surely less than
ǫ for sufficiently large m. For the first term, observe that |f (m)−i,0(x) − f(x)| → 0
almost surely, again by the strong law of large numbers (since f (m)(x) is the sample
average of m − 1 Bernoulli random variables with parameter f(x)). Thus the first
term approaches zero almost surely as m→∞ by the bounded convergence theorem.
Since ǫ was arbitrary, this proves that∥∥∥f
(m)−i,0 − f
∥∥∥
1-p→ 0 almost surely as m→∞.
We now use an induction argument; let us assume that,∥∥∥f
(m)−i,τ − f
∥∥∥
1-p→ 0 almost
surely as m→∞ for all times τ ≤ t. From the definition of f(m)−i,t+1 we have:
f(m)−i,t+1(y) =
1
m− 1
∑
j 6=i
1x
(m)j,t+1=y
,
where x(m)j,t+1 ∼ P
(· | x(m)
j,t , µ(x(m)j,t ), f
(m)−j,t
)for all j 6= i. Note that if two players have
the same initial state, then the population state from their viewpoint is identical.
That is, if x(m)j,t = x
(m)k,t , then f
(m)−j,t(y) = f
(m)−k,t(y) for all y ∈ X . We can thus redefine
the population state from the viewpoint of a player at a particular state. Let f(x,m)t
be the population state at time t from the viewpoint of a player at state x. Then, if
x(m)j,t = x
(m)k,t = x, then for all y ∈ X , f
(m)−j,t(y) = f
(m)−k,t(y) = f
(x,m)t (y). Without loss of
generality, we assume m > 1. Let η(m)−i,t(x) be the total number of players (excluding
player i) that have their state at time t as x, i.e., η(m)−i,t(x) = (m − 1)f
(m)−i,t(x). Note
APPENDIX A. PROOFS 93
that η(m)−i,t(x) = 0 if and only f
(m)−i,t(x) = 0. We can now write f
(m)−i,t+1(y) as:
f(m)−i,t+1(y) =
1
m− 1
∑
x∈X
η(m)−i,t(x)∑
j=1
1Y
(m)j,x,t=y
=∑
x∈X
f(m)−i,t(x)
1
η(m)−i,t(x)
η(m)−i,t(x)∑
j=1
1Y
(m)j,x,t=y
=∑
x∈Xy
f(m)−i,t(x)
1
η(m)−i,t(x)
η(m)−i,t(x)∑
j=1
1Y
(m)j,x,t=y
(A.7)
where the last equality follows from Definition 44. Here, Y(m)j,x,t are random variables
that are independently drawn according to the transition kernel P(· | x, µ(x), f(x,m)t ).
Note that if η(m)−i,t(x) = 0, we interpret the term inside the parentheses as zero.
Let us now look at f(x,m)t . We have
f(x,m)t (z) = f
(m)−i,t(z) +
1
m− 11x
(m)i,t =z
−1
m− 11z=x.
Consider∥∥∥f
(x,m)t − f
∥∥∥
1-p. We have:
∥∥∥f
(x,m)t − f
∥∥∥
1-p=∑
z∈X
‖z‖pp
∣∣∣f
(x,m)t (z)− f(z)
∣∣∣
=∑
z∈X
‖z‖pp
∣∣∣∣f
(m)−i,t(z) +
1
m− 11x
(m)i,t =z
−1
m− 11z=x − f(z)
∣∣∣∣
≤∑
z∈X
‖z‖pp
∣∣∣f
(m)−i,t(z)− f(z)
∣∣∣+
1
m− 1
∑
z∈X
‖z‖pp 1x
(m)i,t =z
+1
m− 1
∑
z∈X
‖z‖pp 1z=x
=∥∥∥f
(m)−i,t − f
∥∥∥
1-p+
1
m− 1
∑
z∈X
‖z‖pp 1x
(m)i,t =z
+1
m− 1
∑
z∈X
‖z‖pp 1z=x
From the induction hypothesis, we have∥∥∥f
(m)−i,t − f
∥∥∥
1-p→ 0 almost surely as m→∞.
APPENDIX A. PROOFS 94
Note that at time t, x(m)i,t ∈ Xx0,t from equation (A.5), and Xx0,t is finite. Thus,
supm
∑
z∈X
‖z‖pp 1x
(m)i,t =z
<∞.
This implies that for all states x ∈ X ,∥∥∥f
(x,m)t − f
∥∥∥
1-p→ 0 almost surely as m →
∞. From Assumption 1, we know that the transition kernel is continuous in the
population state f (where F is endowed with the 1-p norm). Thus for every x ∈ X ,
we have almost surely:
P(· | x, µ(x), f(x,m)t )→ P(· | x, µ(x), f), (A.8)
as m→∞.
Next, we show that f(m)−i,t+1(y) → f(y) almost surely as m → ∞, for all y. We
leverage equation (A.7). Observe that the set of points x ∈ X where ‖x‖p ≤ 1 is
finite, since X is a subset of an integer lattice. From the induction hypothesis, as∑
x∈X ‖x‖pp|f
(m)−i,t(x)− f(x)| → 0 almost surely as m→∞, it follows that f
(m)−i,t(x)→
f(x) almost surely for all x ∈ X as x→∞.
Suppose that x ∈ Xy and f(x) > 0. Since f(m)−i,t(x) → f(x), it follows that
η(m)−i,t → ∞ as m → ∞, almost surely. Note that Y
(m)j,x,t are random variables that
are independently drawn according to the transition kernel P(·|x, µ(x), f(x,m)t ). From
equation (A.8), and Lemma 46, we get that for every x, y ∈ X , there holds
1
η(m)−i,t(x)
η(m)−i,t(x)∑
j=1
1Y
(m)j,x,t=y
→ P(y|x, µ(x), f),
almost surely as m→∞.
On the other hand, suppose x ∈ Xy and f(x) = 0. Again, since f(m)−i,t(x) → f(x)
as x→∞, it follows that as m→∞, almost surely:
f(m)−i,t(x)
1
η(m)−i,t(x)
η(m)−i,t(x)∑
j=1
1Y
(m)j,x,t=y
→ 0,
APPENDIX A. PROOFS 95
since the term in brackets is nonnegative and bounded. (Recall we interpret the term
in brackets as zero if f(m)−i,t(x) = 0.)
We conclude that, almost surely, as m→∞:
f(m)−i,t+1(y) =
∑
x∈Xy
f(m)−i,t(x)
1
η(m)−i,t(x)
η(m)−i,t(x)∑
j=1
1Y
(m)j,x,t=y
→
∑
x∈Xy
f(x)P(y|x, µ(x), f) = f(y).
To complete the proof, we need to show that∥∥∥f
(m)−i,t+1 − f
∥∥∥
1-p→ 0 almost surely
as m→∞. Since f(m)−i,t(x)→ f(x) almost surely, for all ǫ > 0 we have:
∑
x∈Cǫ,f
‖x‖ppf(m)−i,t(x)→
∑
x∈Cǫ,f
‖x‖ppf(x).
This together with the fact that ‖f (m)−i,t − f‖1-p → 0 implies that, almost surely:
lim supm→∞
∑
x∈Cǫ,f
‖x‖ppf(m)−i,t(x) < ǫ. (A.9)
Now at time t + 1, we have
∑
x/∈Cǫ,f
‖x‖pp f(m)−i,t+1 =
∑
x/∈Cǫ,f
d∑
ℓ=1
|xℓ|pf
(m)−i,t+1(x)
≤∑
x/∈Cǫ,f
d∑
ℓ=1
(|xℓ|+ M
)pf
(m)−i,t(x), (A.10)
where the equality follows because X is a subset of the d-dimensional integer lattice.
The last inequality follows from the fact that the increments are bounded (Assump-
tion 1). Without loss of generality, assume that |xℓ| ≥ 1 and that M ≥ 1. Then we
APPENDIX A. PROOFS 96
have:
(|xℓ|+ M
)p=
p∑
j=1
(p
j
)
|xℓ|jMp−j
≤
p∑
j=1
(p
j
)
|xℓ|pMp
= 2pMp|xℓ|p = K1|xℓ|
p,
where we let K1 = (2M)p. Substituting in equation (A.10), we have, almost surely,
lim supm→∞
∑
x/∈Cǫ,f
‖x‖pp f(m)−i,t+1 ≤
∑
x/∈Cǫ,f
d∑
ℓ=1
K1|xℓ|pf
(m)−i,t(x)
= K1
∑
x/∈Cǫ,f
‖x‖pp f(m)−i,t(x)
< K1ǫ,
where the last inequality follows from equation (A.9). Now observe that:
∥∥∥f
(m)−i,t+1 − f
∥∥∥
1-p≤∑
x∈Cǫ,f
‖x‖pp |f(m)−i,t+1(x)− f(x)|+
∑
x/∈Cǫ,f
‖x‖pp f(m)−i,t+1(x)
+∑
x/∈Cǫ,f
‖x‖pp f(x).
In taking a limsup on the left hand side, the second term on the right hand side
is almost surely less than K1ǫ. From the definition of Cǫ,f and equation (A.6), we
get that the third term on the right hand side is also less than ǫ. Finally, since for
every x |f (m)−i,t+1(x) − f(x)| → 0 almost surely as m → ∞, and Cǫ,f is finite, the
first term in the above equation approaches zero almost surely as m → ∞ by the
Bounded Convergence Theorem. Since ǫ was arbitrary, this proves the induction step
and hence the lemma.
The preceding proof uses the following refinement of the strong law of large num-
bers.
APPENDIX A. PROOFS 97
Lemma 46. Suppose 0 ≤ pk ≤ 1 for all k, and that pk → p as k → ∞. For each
k, let Y(k)1 , . . . , Y
(k)k be i.i.d. Bernoulli random variables with parameter pk. Then
almost surely:
limk→∞
1
k
k∑
i=1
Y(k)i = p.
Proof. Let ǫ > 0. By Hoeffding’s inequality, we have:
Prob
(∣∣∣∣∣
1
k
k∑
i=1
Y(k)i − pk
∣∣∣∣∣> ǫ
)
≤ 2e−kǫ2 ,
since 0 ≤ Y(k)i ≤ 1 for all i, k. Let ǫℓ = 1/ℓ; then by the Borel-Cantelli lemma, the
event on the left hand side in the preceding expression occurs for only finitely many
ℓ, almost surely. In other words, almost surely:
limk→∞
[
pk −1
k
k∑
i=1
Y(k)i
]
= 0.
The result follows.
Before we prove the AME property, we need some additional notation. Let (µ, f)
be a mean field equilibrium. Consider again an m player game and focus on player i.
Let x(m)i,0 = x0 and assume that player i uses a cognizant strategy µm. The initial
state of every other player j 6= i is independently drawn from the distribution f , that
is, x(m)j,0 ∼ f . Denote the initial distribution of all m− 1 players (excluding player i)
by f (m) ∈ F(m). The state evolution of player i is given by
x(m)i,t+1 ∼ P
(
· | x(m)i,t , a
(m)i,t ,f
(m)−i,t
)
, (A.11)
where a(m)i,t = µm
(x
(m)i,t ,f
(m)−i,t
)and f
(m)−i,t is the actual population distribution. Here
the superscript m on the state variable represents the fact that we are considering
an m player stochastic game. Let every other player j use the oblivious strategy µ
APPENDIX A. PROOFS 98
and thus their state evolution is given by
x(m)j,t+1 ∼ P
(
· | x(m)j,t , µ
(x
(m)j,t
), f
(m)−j,t
)
. (A.12)
Define V (m)(x, f (m) | µm,µ(m−1)
)to be the actual value function of player i, with its
initial state x, the initial distribution of the rest of the population as f (m) ∈ F(m),
when the player uses a cognizant strategy µm and every other player uses an oblivious
strategy µ. We have
V (m)(x, f (m) | µm,µ(m−1)
)=
E
[∞∑
t=0
βtπ(xi,t, ai,t,f
(m)−i,t
) ∣∣ xi,0 = x, f
(m)−i,0 = f (m); µi = µm,µ−i = µ(m−1)
]
. (A.13)
We define a new player that is coupled to player i in the m player stochastic games
defined above. We call this player the coupled player. Let x(m)i,t be the state of this
coupled player at time t. The subscript i and the superscript m reflect the fact that
this player is coupled to player i in an m player stochastic game. We assume that
the state evolution of this player is given by:
x(m)i,t+1 ∼ P
(·, x(m)
i,t , a(m)i,t , f
), (A.14)
where a(m)i,t = a
(m)i,t = µm
(x
(m)i,t ,f
(m)−i,t
). In other words, this coupled player takes
the same action as player i at every time t and this action depends on the actual
population state of m−1 players. However, note that the state evolution is dependent
only on the mean field distribution f . Let us define
V (m)(
x∣∣∣ f ; µm,µ(m−1)
)
=
E
[∞∑
t=0
βtπ(
x(m)i,t , a
(m)i,t , f
)
| x(m)i,0 = x0, a
(m)i,t = µm(xi,t,f
(m)−i,t); µ
(m−1)
]
. (A.15)
Thus, V (m)(x | f ; µm, µ) is the expected net present value of this coupled player, when
the player’s initial state is x, and the long run average population state is f . Observe
APPENDIX A. PROOFS 99
that
V (m)(x | f ; µm,µ(m−1)
)≤ sup
µ′∈M
V (m)(x | f ; µ′,µ(m−1))
= supµ′∈MO
V (m)(x | f ; µ′,µ(m−1))
= V ∗(x | f) = V (x | µ, f). (A.16)
Here, the first equality follows from Lemma 42, which implies that the supremum
over all cognizant strategies is the same as the supremum over oblivious strategies
(since the state evolution of other players does not affect the payoff of this coupled
player), and the last equality follows since µ ∈ P(f).
Lemma 47. Let (µ, f) be a mean field equilibrium and consider an m player game.
Let the initial state of player i be x(m)i,0 = x, and let f (m) ∈ F(m) be the initial population
state of m − 1 players whose initial state is sampled independently from the distri-
bution f . Assume that player i uses a cognizant strategy µm and every other player
uses the oblivious strategy µ. Their state evolutions are given by equation (A.11)
and (A.12). Also define a coupled player with initial state x(m)i,0 = x and let its state
evolution be given by equation (A.14).
Then, for all times t, and for every y ∈ X , we have
∣∣∣Prob
(x
(m)i,t = y
)− Prob
(x
(m)i,t = y
)∣∣∣→ 0,
almost surely as m→∞.
Proof. The lemma is trivially true for t = 0. Let us assume that it holds for all
times τ = 0, 1, · · · , t− 1. Then, we have
Prob(
x(m)i,t = y
)
=∑
z∈Xy
Prob(
x(m)i,t−1 = z
)
P(
y∣∣∣ z, µm(z, f
(m)−i,t−1), f
(m)−i,t−1
)
Prob(x
(m)i,t = y
)=∑
z∈Xy
Prob(
x(m)i,t−1 = z
)
P(
y∣∣∣ z, µm(z, f
(m)−i,t−1), f
)
.
Here we use the fact that the coupled player uses the same action as player i and the
APPENDIX A. PROOFS 100
state evolution of the coupled player is given by equation (A.14). Note that the sum-
mation is over all states in the finite set Xy, where Xy is defined as in equation (A.4).
From Lemma 45, we know that for all times t,∥∥∥f
(m)−i,t − f
∥∥∥
1-p→ 0 almost surely as
m→∞. From Assumption 1, we know that the transition kernel is jointly continuous
in the action a and distribution f (where the set of distributions F is endowed with
1-p norm). Since the action set is compact, this implies that for all y, z ∈ X ,
limm→∞
supa∈A
∣∣∣P(
y∣∣∣ z, a, f
(m)−i,t−1
)
−P(
y∣∣∣ z, a, f
)∣∣∣ = 0.
It follows that for every y, z ∈ X ,
limm→∞
∣∣∣P(
y∣∣∣ z, µm(z, f
(m)−i,t−1), f
(m)−i,t−1
)
−P(
y∣∣∣ z, µm(z, f
(m)−i,t−1), f
)∣∣∣ = 0
almost surely. From the induction hypothesis, we know that for every z ∈ X ,
∣∣∣Prob
(x
(m)i,t−1 = z
)− Prob
(x
(m)i,t−1 = z
)∣∣∣→ 0
almost surely as m→∞. This along with the finiteness of the set Xy, gives that for
every y ∈ X∣∣∣Prob
(x
(m)i,t = y
)− Prob
(x
(m)i,t = y
)∣∣∣→ 0
almost surely as m→∞. This proves the lemma.
Lemma 48. Let (µ, f) be a mean field equilibrium and consider an m player game.
Let the initial state of player i be x(m)i,0 = x, and let f (m) ∈ F(m) be the initial population
state of m − 1 players whose initial state is sampled independently from the distri-
bution f . Assume that player i uses a cognizant strategy µm and every other player
uses the oblivious strategy µ. Their state evolutions are given by equation (A.11)
and (A.12). Also define a coupled player with initial state x(m)i,0 = x and let its state
evolution be given by equation (A.14).
Then, for all times t, we have
lim supm→∞
E
[
π(
x(m)i,t , µm
(x
(m)i,t ,f
(m)−i,t
),f
(m)−i,t
)
− π(
x(m)i,t , µm
(x
(m)i,t ,f
(m)−i,t
), f)]
≤ 0,
APPENDIX A. PROOFS 101
almost surely.
Proof. Let us write a(m)i,t = µm
(x
(m)i,t ,f
(m)−i,t
). We have
∆(m)i,t = E
[
π(
x(m)i,t , a
(m)i,t ,f
(m)−i,t
)
− π(
x(m)i,t , a
(m)i,t , f
)]
= E
[
π(
x(m)i,t , a
(m)i,t ,f
(m)−i,t
)
− π(
x(m)i,t , a
(m)i,t , f
)]
+
E
[
π(
x(m)i,t , a
(m)i,t , f
)
− π(
x(m)i,t , a
(m)i,t , f
)]
, T(m)1,t + T
(m)2,t .
Consider the first term. We have
T(m)1,t =
∑
y∈X
Prob(x
(m)i,t = y
) (
π(
y, a(m)i,t ,f
(m)−i,t
)
− π(
y, a(m)i,t , f
))
≤∑
y∈X
Prob(x
(m)i,t = y
)supa∈A
∣∣∣π(
y, a, f(m)−i,t
)
− π (y, a, f)∣∣∣
=∑
y∈Xx,t
Prob(x
(m)i,t = y
)supa∈A
∣∣∣π(
y, a, f(m)−i,t
)
− π (y, a, f)∣∣∣ ,
where the last equality follows from the fact that x(m)i,0 = x and from equation (A.5).
From Assumption 1, we know that the payoff is jointly continuous in action a and
distribution f (with the set of distributions F endowed with 1-p norm) and the set A
is compact. Thus, for every y ∈ X , we have
supa∈A
∣∣∣π(
y, a, f(m)−i,t
)
− π (y, a, f)∣∣∣→ 0,
almost surely as m → ∞. This along with the fact that Xx,t is finite shows that
lim supm→∞ T(m)1,t ≤ 0 almost surely.
APPENDIX A. PROOFS 102
Now consider the second term. We have
T(m)2,t = E
[
π(
x(m)i,t , a
(m)i,t , f
)
−(
x(m)i,t , a
(m)i,t , f
)]
=∑
y∈X
Prob(x
(m)i,t = y
)π(y, a
(m)i,t , f
)−∑
y∈X
Prob(x
(m)i,t = y
)π(y, a
(m)i,t , f
)
=∑
y∈X
(
Prob(x
(m)i,t = y
)− Prob
(x
(m)i,t = y
))
π(y, a
(m)i,t , f
)
≤∑
y∈X
∣∣∣Prob
(x
(m)i,t = y
)− Prob
(x
(m)i,t = y
)∣∣∣
∣∣∣π(y, a
(m)i,t , f
)∣∣∣
≤∑
y∈X
∣∣∣Prob
(x
(m)i,t = y
)− Prob
(x
(m)i,t = y
)∣∣∣ sup
a∈A
∣∣π(y, a, f)
∣∣
=∑
y∈Xx,t
∣∣∣Prob
(x
(m)i,t = y
)− Prob
(x
(m)i,t = y
)∣∣∣ sup
a∈A
∣∣π(y, a, f)
∣∣ ,
where the last equality follows from the fact that x(m)i,0 = x
(m)i,0 = x and from Defini-
tion 44. From Lemma 47, we know that for every y ∈ X ,
∣∣∣Prob
(x
(m)i,t = y
)− Prob
(x
(m)i,t = y
)∣∣∣→ 0
almost surely as m→∞. Since Xx,t is finite for every fixed x ∈ X and every time t,
this implies that lim supm→∞ T(m)2,t ≤ 0 almost surely. This proves the lemma.
Before we proceed further, we need one additional piece of notation. Once again
let (µ, f) be a mean field equilibrium and consider an oblivious player. Let xt be the
state of this oblivious player at time t. We assume that x0 = x and the since the
player used the oblivious strategy µ, the state evolution of this player is given by
xt+1 ∼ P(· | xt, at, f
)(A.17)
where at = µ(xt). Define V(x | µ, f
)to be the oblivious value function for this player
APPENDIX A. PROOFS 103
starting from state x. That is,
V(x | µ, f
), E
[ ∞∑
t=0
βtπ(xi,t, ai,t, f
)∣∣∣ xi,0 = x; µ
]
. (A.18)
Also, consider an m player game and focus on player i. We represent the state of
player i at time t by x(m)i,t . As before, the superscript m on the state variable represents
the fact that we are considering an m player stochastic game. Let x(m)i,0 = x and let
player i also use the oblivious strategy µ. The initial state of every other player
j 6= i is drawn independently from the distribution f , that is, x(m)j,0 ∼ f . Denote the
initial distribution of all m− 1 players (excluding player i) by f (m) ∈ F(m). The state
evolution of player i is then given by
x(m)i,t+1 ∼ P
(
· | x(m)i,t , a
(m)i,t ,f
(m)−i,t
)
, (A.19)
where a(m)i,t = µ
(x
(m)i,t
). Note that even though the player uses an oblivious strategy,
its state evolution is affected by the actual population state. Let every other player j
also use the oblivious strategy µ and let their state evolution be given by
x(m)j,t+1 ∼ P
(
·∣∣∣ x
(m)j,t , µ
(x
(m)j,t
), f
(m)−j,t
)
. (A.20)
Define V (m)(x, f (m) | µ(m)
)to be the actual value function of the player, when the
initial state of the player is x, the initial population distribution is f (m) and every
player uses the oblivious strategy µ. That is.
V (m)(x, f | µ(m)
)=
E
[∞∑
t=0
βtπ(xi,t, ai,t,f
(m)−i,t
) ∣∣ xi,0 = x, f
(m)−i,0 = f ; µi = µ,µ−i = µ(m)
]
. (A.21)
Lemma 49. Let (µ, f) be a mean field equilibrium and consider an m player stochastic
game. Let x(m)i,0 = x, and let f (m) ∈ F(m) be the initial population state of m − 1
players whose initial state is sampled independently from f . Assume that every player
uses the oblivious strategy µ and their state evolutions are given by equations (A.19)
APPENDIX A. PROOFS 104
and (A.20). Also, consider an oblivious player with x0 = x and let its state evolution
be given by equation (A.17).
Then, for every time t and for all x ∈ X , we have
∣∣∣Prob(xt = y)− Prob(x
(m)i,t = y)
∣∣∣→ 0, (A.22)
almost surely as m→∞.
Proof. The lemma is trivially true for t = 0. Let us assume that it holds for all
times τ = 0, 1, · · · , t− 1. Then, we have
Prob (xt = y) =∑
z∈Xy
Prob (xt−1 = z)P(
y∣∣∣ z, µ(z), f
)
Prob(x
(m)i,t = y
)=∑
z∈Xy
Prob(
x(m)i,t−1 = z
)
P(
y∣∣∣ z, µ(z),f
(m)−i,t
)
.
Note that the summation above is over all states in a finite set Xy (as defined in
Defintion 44).
From Lemma 45, we know that for all times t,∥∥∥f
(m)−i,t − f
∥∥∥
1-p→ 0 almost surely as
m→∞. From Assumption 1, we know that the transition kernel is continuous in the
distribution (where the set of distributions F is endowed with the 1-p norm). From
the induction hypothesis, we know that∣∣∣Prob
(xt−1 = z
)− Prob
(x
(m)−i,t−1 = z
)∣∣∣ → 0.
This along with the finiteness of the set Xy, gives that for every x ∈ X
∣∣∣Prob (xt = x)− Prob
(x
(m)i,t = x
)∣∣∣→ 0
almost surely as m→∞. This proves the lemma.
Lemma 50. Let (µ, f) be a mean field equilibrium and consider an m player stochastic
game. Let x(m)i,0 = x, and let f (m) ∈ F(m) be the initial population state of m − 1
players whose initial state is sampled independently from f . Assume that every player
uses the oblivious strategy µ and their state evolutions are given by equations (A.19)
and (A.20). Also, consider an oblivious player with x0 = x and let its state evolution
be given by equation (A.17).
APPENDIX A. PROOFS 105
Then for all times t, we have
E
[
π(xt, µ(xt), f
)− π
(x
(m)i,t , µ(x
(m)i,t ),f
(m)−i,t
)]
→ 0,
almost surely as m→∞.
Proof. Define ∆(m)i,t as
∆(m)i,t = E
[
π(xt, µ(xt), f
)− π
(x
(m)i,t , µ(x
(m)i,t ),f
(m)−i,t
)]
= E
[
π(xt, µ(xt), f
)− π
(xt, µ(xt),f
(m)−i,t)
]
+ E
[
π(xt, µ(xt),f
(m)−i,t)− π
(x
(m)i,t , µ(x
(m)i,t ),f
(m)−i,t
)]
, T(m)1,t + T
(m)2,t .
Note that from Lemma 45, we have that∥∥∥f
(m)−i,t − f
∥∥∥
1-p→ 0 almost surely as m→
∞. From Assumption 1, we know that the payoff is continuous in the distribution,
where the set of distributions F is endowed with 1-p norm. Thus, for every y and a,
we have∣∣∣π(y, a, f)− π(y, a, f
(m)−i,t)
∣∣∣→ 0, (A.23)
as m→∞. Consider the first term. We have:
T(m)1,t =
∑
y∈X
Prob (xt = y)∣∣∣π(y, µ(y), f)− π(y, µ(y),f
(m)−i,t)
∣∣∣
=∑
y∈Xx,t
Prob (xt = y)∣∣∣π(y, µ(y), f)− π(y, µ(y),f
(m)−i,t)
∣∣∣ ,
where the last equality follows from the fact that x0 = x and from Definition 44.
Since Xx,t is a finite set for every initial state x ∈ X and every time t, we get that
T(m)1,t → 0 almost surely as m→∞.
APPENDIX A. PROOFS 106
Consider now the second term. We have:
T(m)2,t = E
[
π(xt, µ(xt),f
(m)−i,t)− π
(x
(m)i,t , µ(x
(m)i,t ),f
(m)−i,t
)]
=∑
y∈X
Prob(xt = y
)π(y, µ(y),f
(m)−i,t)−
∑
y∈X
Prob(x
(m)i,t = y
)π(y, µ(y),f
(m)−i,t
)
=∑
y∈Xt
(
Prob(xt = y
)− Prob
(x
(m)i,t = y
))
π(y, µ(y),f
(m)−i,t).
From Lemma 49, equation (A.23), and the finiteness of Xx,t, we get that
lim supm→∞
T(m)2,t ≤ 0
almost surely. This proves the lemma.
Proof.[Proof of Theorem 32] Let us define
∆V (m)(x, f (m)) , V (m)(x, f (m) | µm,µ(m−1)
)− V (m)
(x, f (m) | µ(m)
).
Then we need to show that for all x, lim supm→∞ ∆V (m)(x, f (m)) ≤ 0 almost surely.
We can write
∆V (m)(x, f (m)) = V (m)(x, f (m) | µm,µ(m−1)
)− V (x | µ, f) + V (x | µ, f)
− V (m)(x, f (m) | µ(m)
)
≤ V (m)(x, f (m) | µm,µ(m−1)
)− V (m)
(x | f ; µm,µ(m−1)
)+ V (x | µ, f)
− V (m)(x, f (m) | µ(m)
)
, T(m)1 + T
(m)2 .
Here the inequality follows from equation (A.16). Consider the term T(m)1 . We have
T(m)1 = V (m)
(x, f (m) | µm,µ(m−1)
)− V (m)
(x | f ; µm,µ(m−1)
)
= E
[∞∑
t=0
βt(
π(x
(m)i,t , a
(m)i,t ,f
(m)−i,t
)− π
(x
(m)i,t , a
(m)i,t , f
))]
,
APPENDIX A. PROOFS 107
where the last equality follows from equations (A.13) and (A.15). Note that xi,0 =
xi,0 = x and ai,t = ai,t = µm
(xi,t,f
(m)−i,t
)and the state transitions of players are given
by equations (A.11), (A.12), and (A.14). From Lemma 48, we have
lim supm→∞
E
[T−1∑
t=0
βt(
π(x
(m)i,t , a
(m)i,t ,f
(m)−i,t
)− π
(x
(m)i,t , a
(m)i,t , f
))]
≤ 0,
almost surely for any finite time T . From Lemma 41, we have, almost surely
E
[∞∑
t=T
βt(
π(x
(m)i,t , a
(m)i,t ,f
(m)−i,t
)− π
(x
(m)i,t , a
(m)i,t , f
))]
≤ 2C(x, T ),
which goes to zero as T →∞. This proves that lim supm→∞ T(m)1 ≤ 0 almost surely.
Similar analysis (with an application of Lemma 50) shows that lim supm→∞ T(m)2 ≤ 0
almost surely, yielding the result.
A.3 Compactness: Proof
In this section, we provide the proof of Proposition 40. Throughout this subsection
we suppose Assumptions 1, 2, and 3 are in effect.
Lemma 51. Given x′ ≥ x, a ∈ A, and f ∈ F, there exists a probability space with
random variables ξ′ ∼ Q(· | x′, a, f), ξ ∼ Q(· | x, a, f), such that ξ′ ≤ ξ almost surely,
and x′ + ξ′ ≥ x + ξ almost surely.
Proof. The proof uses a standard coupling argument. Let U be a uniform ran-
dom variable on [0, 1]. Let F (resp., F ′) be the cumulative distribution function
of Q(· | x, a, f) (resp., Q(· | x′, a, f)), and let G be the cumulative distribution
function of P(· | x, a, f) (resp., P(· | x′, a, f)). By Assumption 2, P(· | x, a, f) is
stochastically nondecreasing in x, and by Assumption 3, Q(· | x, a, f) is stochasti-
cally decreasing in x. Thus for all z, F (z) ≤ F ′(z), but for all y, G(y) ≥ G′(y);
further, G(y) = F (y − x) (and G′(y) = F ′(y − x′)). Let ξ = infz : F (z) ≥ U,
and let ξ′ = infz : F ′(z) ≥ U. Then ξ ≥ ξ′. Rewriting the definitions, we also
APPENDIX A. PROOFS 108
have x + ξ = infy : F (y − x) ≥ U, and x′ + ξ′ = infy : F ′(y − x′) ≥ U, i.e.,
x + ξ =∈ y : G(y) ≥ U, and x′ + ξ′ = infy : G′(y) ≥ U. Thus it follows that
x′ + ξ′ ≥ x + ξ, as required.
Lemma 52. Fix ∆ ∈ Z, ∆ ≥ 0. Then as x→∞,
supf∈F
V ∗(x + ∆ | f)− V ∗(x | f))→ 0.
Proof. Fix f ∈ F, and fix x. Let x0 = x, and let x′0 = x + ∆. Let µ be an
optimal oblivious strategy given f , i.e., µ ∈ P(f); such a strategy exists by Lemma
42. Let x′t and a′
t denote the state and action sequence realized under the kernel
P(·|x, a, f), respectively, when a′t = µ(x′
t) and starting from x′0; and let xt denote
the state sequence realized using the action sequence a′t, starting from x0.
We use a coupling argument to study the difference in V (x+∆ | f) and V (x | f).
It follows from Lemma 51 that there exists a probability space with random variables
ξ0, ξ′0 such that ξ0 ∼ Q(·|x0, a
′0, f) and ξ′0 ∼ Q(·|x′
0, a′0, f), ξ0 ≥ ξ′0 almost surely,
and yet x0 + ξ0 ≤ x′0 + ξ′0 almost surely; this ensures that ξ0 − ξ′0 ≤ ∆. Proceeding
inductively, there exists a joint probability measure under which:
0 ≤ x′t − xt ≤ ∆,
for all t ≥ 0.
We now have the following sequence of inequalities:
V ∗(x + ∆ | f)− V ∗(x | f)
≤ E
[∑
t
βt(π(x′t, a
′t, f)− π(xt, a
′t, f))
∣∣∣ x′
0 = x + ∆, x0 = x, a′t = µ(x′
t)
]
≤ E
[∑
t
βt supδ:0≤δ≤∆
(π(xt + δ, a′t, f)− π(xt, a
′t, f))
∣∣∣ x′
0 = x + ∆, x0 = x, a′t = µ(x′
t)
]
≤ E
[∑
t
βt( supδ:0≤δ≤∆
(π(xt + δ, a, f)− π(xt, a, f))∣∣∣ x′
0 = x + ∆, x0 = x, a′t = µ(x′
t)
]
,
APPENDIX A. PROOFS 109
where f denotes the smallest distribution in F—i.e., the one that places all its mass
on state 0. The first inequality follows because the payoff received under the action
sequence a′t starting from x cannot be larger than V ∗(x | f). The second inequality
follows by taking a supremum over the difference in state, and the fact that (almost
surely) 0 ≤ x′t−xt ≤ ∆ for all t. The third inequality follows because π has decreasing
differences in x, a, and f , and because δ ≥ 0 (Assumption 3).
Now recall that increments are bounded (Assumption 3). Thus in time t, the
maximum distance the state could have moved from initial state x is tM . Thus if
x0 = x, then:
supδ:0≤δ≤∆
(π(xt + δ, a, f)−π(xt, a, f)) ≤ supǫ,δ:0≤δ≤∆,|ǫ|≤tM
(π(x+ ǫ+ δ, a, f)−π(x+ ǫ, a, f)).
Let Ax,t denote the right hand side of the preceding equation; note that this is a deter-
ministic quantity. Since the supremum is over a finite set, it follows from Assumption
3 that Ax,t → 0 as x→∞, for all fixed t.
Finally, observe that since π(x + δ, a, f) − π(x, a, f) → 0 as x → ∞, it follows
that:
sup0≤δ≤∆,y∈Z+
(π(y + δ, a, f)− π(y, a, f)) <∞.
We denote the left hand side of the preceding inequality by D.
Combining our arguments, we have:
V ∗(x + ∆, f)− V ∗(x, f) ≤T∑
t=0
βtAx,t +βT D
1− β. (A.24)
First taking the limit on the right hand side as x→∞, then taking the limit on the
right hand side as T → ∞ shows that V ∗(x + ∆, f) − V ∗(x, f) → 0 as x → ∞ for
every fixed f . Since the right hand side in (A.24) is independent of f , it follows that
convergence to zero is uniform in f , as required.
Lemma 53. Let µf be the unique optimal oblivious strategy given f , cf. Lemma 42
APPENDIX A. PROOFS 110
and Proposition 38. Then there exists ǫ > 0 and x such that for all x ≥ x,
supf
∑
z
zQ(z | x, µf (x), f) < −ǫ.
Proof. We first show that as x→∞,
supf‖µf (x)− a‖∞ → 0.
Suppose the preceding statement fails; then there exists r > 0 and a sequence
fn, xn such that ‖µfn(xn) − a‖∞ ≥ r for all n. We first use this fact to bound
π(xn, µfn(xn), fn) away from π(xn, a, fn).
Fix R > 0, and let Γ(R) be the optimal objective value of the following optimiza-
tion problem:
maximize π(0, a, f)
subject to ‖a− a‖∞ ≥ R
a ∈ A.
where f denotes the smallest distribution in F—i.e., the one that places all its mass
on state 0. By Assumption 3, it follows that Γ(R) ≤ π(0, a, f). We claim that in fact,
Γ(R) < π(0, a, f). Suppose not; then Γ(R) = π(0, a, f). Further, observe that the
objective function is continuous in a and the feasible region is compact, so at least one
feasible solution exists, say a∗. But then at a∗, π(0, a∗, f) = π(0, a, f), contradicting
Assumption 3.
So now we have:
π(xn, µfn(xn), fn)− π(xn, a, fn) ≤ π(0, µfn
(xn), f)− π(0, a, f)
≤ Γ(r)− π(0, a, f) < 0.
The first line follows by decreasing differences (Assumption 3), and the second line
by the definition of Γ(r). Importantly, note the bound on the right hand side is
APPENDIX A. PROOFS 111
independent of n.
On the other hand, we have:
∑
x′
V ∗(x′ | fn)P(x′ | xn, a, fn)−∑
x′
V ∗(x′ | fn)P(x′ | xn, µfn(xn), fn)
≥ V ∗(xn −M | fn)− V ∗(xn + M | fn)
≥ − supf
(V ∗(xn + M | f)− V ∗(xn −M | f)).
Here the first inequality follows because V ∗(x | f) is nondecreasing in x (Lemma 43),
and because increments are bounded (Assumption 3). By Lemma 52, it follows that
the right hand side above approaches zero as n→∞.
Combining our observations, for all sufficiently large n we have:
π(xn, µfn(xn), fn) + β
∑
x′
V ∗(x′ | fn)P(x′ | xn, µfn(xn), fn) <
π(xn, a, fn) + β∑
x′
V ∗(x′ | fn)P(x′ | xn, a, fn),
contradicting the fact that µfnis an optimal oblivious strategy (since Bellman’s op-
timality condition fails; see Lemma 42). We conclude that supf ‖µf (x)− a‖∞ → 0 as
x→∞.
Next, observe that∑
z zQ(z | x, a, f) is continuous in a, as the kernel P is con-
tinuous in a (Assumption 1) and increments are bounded (Assumption 3). Further,
we know that∑
z zQ(z | x, a, f) < 0 for all sufficiently large x, by Assumption
3. Since the increment kernel is stochastically nonincreasing in x, it follows that∑
z zQ(z | x, a, f) is nonincreasing in x for fixed a. Further, since the increment ker-
nel is stochastically nonincreasing in f , it follows that for any f ,∑
z zQ(z | x, a, f) ≤∑
z zQ(z | x, a, f). Combining these observations, we conclude there exists ǫ > 0
and δ > 0 such that if ‖a − a‖∞ < δ, then for all f ,∑
z zQ(z | x, a, f) < −ǫ for all
sufficiently large x. The desired result now follows since supf ‖µf (x) − a‖∞ → 0 as
x→∞.
Lemma 54. For every f , Φ(f) is nonempty.
APPENDIX A. PROOFS 112
Proof. As described in the discussion of Section 7.3, it suffices to show that the
state Markov chain induced by an optimal oblivious strategy possesses at least one
invariant distribution—i.e., that D(µ, f) is nonempty, where µ is the unique optimal
oblivious strategy given f .
We use a Foster-Lyapunov argument. Let U(x) = x. Then x ∈ X : U(x) ≤ K
is finite for all K, and by Lemma 53, for x ≥ x,
∑
x′
U(x′)P(x′ | x, µ(x), f) ≤ U(x)− ǫ.
Since increments are bounded (Assumption 3), it is trivial that:
sup0≤x≤x
(∑
x′
U(x′)P(x′ | x, µ(x), f)− U(x)
)
<∞.
It follows by the Foster-Lyapunov criterion that the resulting chain is positive recur-
rent, as required [44].
Lemma 55. For every η ∈ Z+,
supf
supφ∈Φ(f)
∑
x
xηφ(x) <∞.
Proof. We again use a Foster-Lyapunov argument. We proceed by induction; the
claim is clearly true if η = 0. So assume the claim is true up to η − 1; in particular,
define:
αk = supf
supφ∈Φ(f)
∑
x
xkφ(x)
for k = 0, . . . , η−1. Fix f , and let µ ∈ P(f) be the unique optimal oblivious strategy
given f . The preceding lemma establishes that the Markov chain induced by µ is
APPENDIX A. PROOFS 113
positive recurrent. Let U(x) = xη+1. Then:
∑
x′
U(x′)P(x′| x, µ(x), f) =∑
z
(x + z)η+1Q(z | x, µ(x), f)
=∑
z
η+1∑
k=0
(η + 1
k
)
xkzη+1−kQ(z | x, µ(x), f)
= U(x) + (η + 1)xη∑
z
zQ(z | x, µ(x), f)
+∑
z
η−1∑
k=0
(η + 1
k
)
xkzη+1−kQ(z | x, µ(x), f).
Define g(x) as:
g(x) =
η−1∑
k=0
(η + 1
k
)
Mη+1−kxk.
By the inductive hypothesis,
γ , supf
supφ∈Φ(f)
∑
x
g(x)φ(x) <∞.
Further, by Lemma 53, for all x ≥ x, we have:
∑
z
Q(z | x, µ(x), f) < −ǫ.
Define h(x) as:
h(x) =
−(η + 1)Mxη, if x ≤ x;
ǫ(η + 1)xη, if x > x.
It follows that:
∑
x′
U(x′)P(x′ | x, µ(x), f)− U(x) ≤ −h(x) + g(x),
APPENDIX A. PROOFS 114
so by the Foster-Lyapunov criterion [44] it follows that:
∑
x
h(x)φ(x) ≤∑
x
g(x)φ(x) ≤ γ.
Rearranging terms, we conclude that:
∑
x>x
xηφ(x) ≤γ
η + 1+ Mxη.
Thus:∑
x
xηφ(x) ≤γ
η + 1+ (M + 1)xη.
(Recall that the sum is only over nonnegative x.) Since the right hand side is finite
and independent of f and φ, the result follows.
Proof.[Proof of Proposition 40] We have already established that Φ(f) is nonempty
in Lemma 54. Define B as:
B = supf
supφ∈Φ(f)
∑
x
xp+1φ(x) <∞,
where the inequality is the result of Lemma 55.
We define the set FC as:
FC =
f ∈ F :∑
x
xp+1f(x) ≤ B
.
By definition, Φ(F) ⊂ FC . It is clear that FC is nonempty and convex. It remains to
be shown that FC is compact in the 1-p-norm. It is straightforward to check that FC
is complete; we show that FC is totally bounded, thus establishing compactness.
Fix ǫ > 0. Choose Kǫ so that B/Kǫ < ǫ. Then for all f ∈ FC :
∑
x≥Kǫ
xpf(x) ≤B
Kǫ
< ǫ. (A.25)
APPENDIX A. PROOFS 115
Let Sǫ be the projection of FC into the first Kǫ components; i.e.,
Sǫ = g ∈ RKǫ : ∃ f ∈ FC with g(x) = f(x)∀ x ≤ Kǫ.
It is straightforward to check that Sǫ is a compact subset of RKǫ ; so let f1, . . . , fℓ ∈ Sǫ
be a ǫ-cover of Sǫ (i.e., Sǫ is covered by the balls around f1, . . . , fℓ of radius ǫ in the
1-p-norm). Then it follows that f1, . . . , fℓ is a 2ǫ-cover of FC , since (A.25) bounds the
tail of any f ∈ FC by ǫ. This establishes that FC is totally bounded in the 1-p-norm,
as required.
Bibliography
[1] V. Abhishek, S. Adlakha, R. Johari, and G. Weintraub. Oblivious equilibrium
for general stochastic games with many players. Proceedings of the Allerton
Conference on Communication, Control and Computing, pages 892–896, 2007.
[2] S. Adlakha, R. Johari, G. Weintraub, and A. Goldsmith. Oblivious equilibrium
for large-scale stochastic games with unbounded costs. Proceedings of the IEEE
Conference on Decision and Control, 2008.
[3] C.D. Aliprantis and K.C. Border. Infinite dimensional analysis: a hitchhiker’s
guide. Springer Verlag, 2006.
[4] E. Altman. Applications of Markov decision processes in communication net-
works. Handbook of Markov Decision Processes: Methods and Applications, page
489, 2002.
[5] E. Altman and T. Basar. Optimal rate control for high speed telecommunication
networks. Proceedings of the IEEE Conference on Decision and Control, pages
1389–1394, 1995.
[6] E. Altman, T. Basar, and R. Srikant. Congestion control as a stochastic control
problem with action delays. Automatica, 12:1937–1950, 1999.
[7] E. Altman, V. Kambley, and A. Silva. Stochastic games with one step delay
sharing information pattern with application to power control. Proceeding of the
International Conference on Game Theory for Networks, pages 124–129, May
2009.
116
BIBLIOGRAPHY 117
[8] E. Altman and P. Nain. Closed-loop control with delayed information. Perfor-
mance Evaluation Review, 20:193–204, 1992.
[9] E. Altman and S. Stidham. Optimality of monotonic policies for two-action
Markovian decision processes, with applications to control of queues with delayed
information. Queueing Systems, 21(3):267–291, 1995.
[10] A. C. Antoulas and D. C. Sorensen. Approximation of large-scale dynamical sys-
tems: An overview. International Journal of Applied Mathematics and Computer
Sciences, 11(5):1093–1121, 2001.
[11] D. Artiges. Optimal routing into two heterogeneous service stations with delayed
information. IEEE Transactions on Automatic Control, 40(7):1234–1236, 1995.
[12] K. J. Astrom. Optimal control of Markov processes with incomplete state esti-
mation. Journal of Mathematical Analysis and Applications, 10:174–205, 1965.
[13] J. L. Bander and C. C. White III. Markov decision processes with noise-corrupted
and delayed state observations. Journal of Operational Research Society, 50:660–
668, 1999.
[14] T. Basar and N. Bansal. The theory of teams: a selective annotated bibliography.
Lecture Notes in Contol and Information Sciences, 119:186–201, 1989.
[15] T. Basar and J. B Cruz. Concepts and methods in multi-person coordination
and control. Optimization and Control of Dynamic Operational Research Models,
pages 351–387, 1982.
[16] Michael Basin, Jesus Rodriguez-Gonzalez, and Rodolfo Martinez-Zuniga. Op-
timal control for linear systems with time delay in control input based on the
duality principle. Proceedings of the American Control Conference, pages 2144–
2148, 2003.
[17] R. E. Bellman. Dynamic Programming. Princeton University Press, 1957.
BIBLIOGRAPHY 118
[18] D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein. The complexity of
decentralized control of Markov decision processes. Mathematics of Operations
Research, 27(4):819–840, 2002.
[19] D. P. Bertsekas. Dynamic Programming and Optimal Control (Vol. 2). Athena
Scientific, Nashua, New Hampshire, 2001.
[20] D.P. Bertsekas. Dynamic Programming and optimal control, volume 1. Athena
Scientific, 1995.
[21] D.P. Bertsekas. Dynamic Programming and optimal control, volume 2. Athena
Scientific, 1995.
[22] D. J. Bertsimas and G. V. Ryzin. Stochastic and dynamic vehicle routing in
the Euclidean plane with multiple capacitated vehicles. Operations Research,
41(1):60–76, 1993.
[23] Aaron Bodoh-Creed. The simple behavior of large mechanisms. Under submis-
sion, 2010.
[24] S. P. Boyd and C. H. Barratt. Linear controller design. Prentice Hall, 1991.
[25] U. Doraszelski and A. Pakes. A framework for applied dynamic analysis in IO.
Handbook of Industrial Organization, Volume 3, 2007.
[26] Darrell Duffie, Semyon Malamud, and Gustavo Manso. Information percolation
with equilibrium search dynamics. Econometrica, 77(5):1513–1574, 2009.
[27] R. Ericson and A. Pakes. Markov-perfect industry dynamics: A framework for
empirical work. Review of Economic Studies, 62(1):53–82, 1995.
[28] D. Famolari, N. Mandayam, D. Goodman, and V. Shah. A new framework for
power control in wireless data networks: Games, utility and pricing. In Proceed-
ings of the Allerton Conference on Communication, Control and Computing,
volume 36, pages 546–555. Springer, 1998.
[29] D. Fudenberg and J. Tirole. Game Theory. The MIT Press, 1991.
BIBLIOGRAPHY 119
[30] B. Hajek. Optimal control of two interacting service stations. IEEE Transactions
on Automatic Control, 29(6):491–499, 1984.
[31] Y. C. Ho and K. C. Chu. Team decision theory and infromation structures in
optimal control problems – Part I. IEEE Transactions on Automatic Control,
17:15–22, 1972.
[32] Y.C. Ho. Team decision theory and information structures. Proceedings of the
IEEE, 68(6):644–654, 1980.
[33] K. Hsu and H. I. Marcus. Decentralized control of finite state Markov processes.
Proceedings of the IEEE Conference on Decision and Control including the Sym-
posium on Adaptive Processes, 19:143–148, 1980.
[34] K. Hsu and H. I. Marcus. Decentralized control of finite state Markov processes.
IEEE Transactions on Automatic Control, 2:426–431, 1982.
[35] M. Huang, P. E. Caines, and R. P. Malhame. Large-population cost-coupled LQG
problems with nonuniform agents: Individual-mass behavior and decentralized
ǫ-Nash equilibria. IEEE Transactions on Automatic Control, 52(9):1560–1571,
2007.
[36] M. Huang, R. P. Malhame, and P. E. Caines. Nash equilibria for large-population
linear stochastic systems of weakly coupled agents. Analysis, Control and Opti-
mization of Complex Dynamical Systems, pages 215–252, 2005.
[37] F. V. Jensen. Bayesian Networks and Decision Graphs. Springer, 2001.
[38] B. Jovanovic and R.W. Rosenthal. Anonymous sequential games. Journal of
Mathematical Economics, 17:77–87, 1988.
[39] K. V. Katsikopoulos and S. E. Engelbrecht. Markov decision processes with de-
lays and asynchronous cost collection. IEEE Transactions on Automatic Control,
48(4):568–574, 2003.
BIBLIOGRAPHY 120
[40] P. R. Kumar and P. Varaiya. Stochastic Systems: Estimation, Identification and
Adaptive Control. Prentice Hall, 1986.
[41] J. Kuri and A. Kumar. Optimal control of arrivals to queues with delayed queue
length information. IEEE Transactions on Automatic Control, 40(8):1444–1450,
1995.
[42] B. Kurtaran and R. Sivan. Linear-Quadratic-Gaussian control with one-step-
delay sharing pattern. IEEE Transactions on Automatic Control, 19(5):571–574,
1974.
[43] J. M. Lasry and P. L Lions. Mean field games. Japanese Journal of Mathematics,
2(1):229–260, 2007.
[44] S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Springer-
Verlag, 1993.
[45] George E. Monahan. A survey of partially observable Markov decision processes:
Theory, models, and algorithms. Management Science, 28(1):1–16, 1982.
[46] A. Pakes and P. McGuire. Computing Markov-perfect Nash equilibria: Numer-
ical implications of a dynamic differentiated product model. RAND Journal of
Economics, 25(4):555–589, 1994.
[47] A. Pakes and P. McGuire. Stochastic algorithms, symmetric Markov perfect
equilibrium, and the curse of dimensionality. Econometrica, 69(5):1261–1281,
2001.
[48] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Pro-
gramming. Wiley, 1994.
[49] M. Rotkowitz and S. Lall. A characterization of convex problems in decentralized
control. IEEE Transactions on Automatic Control, 51(2):274–286, 2006.
[50] N. R. Sandell and M. Athans. Solution of some nonclassical LQG stochastic
decision problems. IEEE Transactions on Automatic Control, 19:108–116, 1974.
BIBLIOGRAPHY 121
[51] L. S. Shapley. Stochastic games. Proceeding of the National Academy of Sciences,
39:1095–1100, 1953.
[52] D. D. Siljak. Decentralized Control of Complex Systems. Academic Press, 1991.
[53] R. D. Smallwood and E. J. Sondik. The optimal control of partially observable
Markov processes over a finite horizon. Operations Research, 21(5):1071–1088,
1973.
[54] P. Varaiya and J. Walrand. On delayed sharing patterns. IEEE Transactions on
Automatic Control, 23:443–445, 1978.
[55] P. G. Voulgaris. Optimal control of systems with delayed observation sharing
patterns via input-output methods. Proceedings of the IEEE Conference on
Decision and Control, pages 2311–2316, 2000.
[56] G. Y. Weintraub, C. L. Benkard, and B. Van Roy. Markov perfect industry
dynamics with many firms. Econometrica, 76(6):1375–1411, 2008.
top related