derivative action learning in games review of: j. shamma and g. arslan, “dynamic fictitious play,...
TRANSCRIPT
Derivative Action Learning in Games
Review of: J. Shamma and G. Arslan, “Dynamic Fictitious Play, Dynamic Gradient Play, and Distributed Convergence to Nash Equilibria,” IEEE Transactions on Automatic Control, Vol. 50, no. 3, pp. 312-327, March 2005
Overview
• The authors propose an extension of fictitious play (FP) and gradient play (GP) in which strategy adjustment is a function of both the estimated opponent strategy and its time derivative
• They demonstrate that when the learning rules are well-calibrated, convergence (or near-convergence) to Nash equilibria and asymptotic stability in the vicinity of equilibria can be achieved in games where static FP and GP fail to do so
Game SetupThis paper addresses a class of two-player games in which each player selects an action ai from a finite set at each instance of the game according to his mixed strategy pi and experiences utility U(p1,p2) equal to his expected payoff plus some additional utility associated with playing a mixed strategy. The purpose of the entropy term is not discussed by the authors, but it may be there to avoid converging to inferior local maxima in the utility function.
Actual payoff depends on the combined player actions a1 and a2, each randomly selected according to the mixed strategies p1 and p2.
( ) ( )( ) ( ) ( ) ( )( )
( ) ( )( ) [ ] ( ) ( )[ ] ( )( )( ) ( ) selectionstrategy mixed encourages logH rmentropy te the
kHkkEPayoffEk,kU
,elyalternativ
0,kHkkk,kU
T
iiiTiiii
iiiTiiii
sss
paMapp
ppMppp
−=
τ+==
≥ττ+=
−−
−−
Entropy function H(•) rewards mixed strategy
Probability of selecting a1 in 2-dimensional strategy space
Empirical Estimation and Best Response
Player i’s strategy pi is, in general, mixed and exists within the simplex defined in mi space, where mi is the number of available actions to player i, by vertices corresponding to the available actions.
Further, he adjusts his strategy by observing his opponent’s actions, formulating an empirical estimate of his opponent’s strategy q-i, and calculating the best mixed strategy in response. The adjusted strategy then will direct his next move.
( ) ( ) ( )
( )( ) )k(k)k(
k1k
1k
1k
k1k
i
on distributi the toaccordingselection Random
iii
iii
i
aqp
aqq
p
→β=
++
+=+
−
−−−
Best Response Function
The best response is defined by the authors to be the mixed strategy that maximizes expected payoff. The authors claim (without proof) that, for > 0, the utility-maximizing function is the logit function.
( )( ) ( )
( )( )[ ]N
ii
1
ii
n
ii
i
)k(M)k(M
)k(M
nii
iiiiii
ee
ek:0for
)k(),k(Umaxargk)k(
⎥⎦⎤
⎢⎣⎡
τ⎥⎦⎤
⎢⎣⎡
τ
⎥⎦⎤
⎢⎣⎡
τ
−
−−
−−
−
++
=β>τ
=β=
q
p
q
qpqp
L
FP in Continuous Time
( )( ) ( )
( )( )[ ]
( )( ) ( )tt)t(ee
et:0for
)t(),t(Umaxargt)t(
iiii
)t(M)t(M
)t(M
nii
iiiiii
N
ii
1
ii
n
ii
i
−−−
⎥⎦⎤
⎢⎣⎡
τ⎥⎦⎤
⎢⎣⎡
τ
⎥⎦⎤
⎢⎣⎡
τ
−
−−
−β=++
=β>τ
=β=
−−
−
qqq
q
qpqp
q
p
&L
The remaining discussion of Fictitious Play is conducted in the continuous time domain. This allows the authors to describe the system dynamics in terms of smooth differential equations, and player actions are equivalent to their mixed strategies.
The discrete-time dynamics are then interpreted as stochastic approximations of continuous-time solutions to the differential equations. This transformation is discussed in [Benaim, Hofbauer and Sorin 2003] and, presumably in [Benaim and Hirsch 1996], though I have not seen the latter myself.
Achieving Nash Equilibrium
Nash equilibria are reached at fixed points of the Best Response function. Convergence to fixed points occurs as the empirical frequency estimates converge to the actual strategies played.
( ) ( )( ) ( ) ( )( )( )( ) ( ) 0qqq
pqppqpp
→−β=
→=→β=
−−−
∗−−
∗∗−−
∗
tt)t(
tt),t(Umaxargtt)t(
iiii
iiiiiiiii
&
Derivative Action FP (DAFP): Idealized Case – Exact DAFP
Exact DAFP uses directly measured first order forecast of opponent strategy in addition to observed empirical frequency in order to calculate Best Response
( ) ( )( )( ) ( )( ) ( )
p(t)strategy play isoutput and q(t),
frequency, empirical isinput where,controller-PD in the
gain” nal“proportio theis 1 gain”, e“derivativ theis
ttt)t(
tt)t(
iiiii
iiii
γ
−γ+β=
γ+β=
−−
−−
qqqq
qqp
&&
&
Derivative Action FP (DAFP): Approximate DAFP
Approximate DAFP uses estimated first order forecast of opponent strategy in addition to observed empirical frequency in order to calculate Best Response
( ) ( )( )( ) ( )( ) ( )
( ) ( ) ( )( )
)continuously sufficient is function ResponseBest the
provided (and large with (t)(t) and
(t), ofion approximat filtered a is (t)
ttt
ttt)t(
tt)t(
iii
iiiii
iiii
β
λ→
−λ=
−γ+β=
γ+β=
−−−
−−−
−−
qr
qr
rqr
qrqq
qqp
&&
&
&&
&
Exact DAFP,Special Case: = 1
System Inversion – Each player seeks to play best response against current opponent strategy
( ) ( )( )
( ) ( )( ) ( ) ( )( )
( ) ( )( ) ( ) ( )
( ) ( ) ( ) )s(1s
1ss)s(ss
t)t(ttt)t(
s1ssss)s(
tt)t(
iiiii
TransformLaplace
iiiiiii
iiiiii
TransformLaplace
iiii
−−−−−
−−−−−
−−−
−−
+=⇒−=⇒
−=−+β=
+β=+β=⇒
+β=
pqqpq
qpqqqq
qqqp
qqp
&&
&
Convergence with Exact DAFP in Special Case ( = 1)
( ) ( ) ( )( )
( )( )
( ) ( )( )( )
( )
game original theof equilibriaNash arewhich
T, of points fixed follow dynamics DAFP
T
aswrittenbecanequationsDAFPthethen
mappingthebemm:Tletand
t)t(
t)t(letnow
ttt)t(recall
12
21
2
1
21mm
22
11
2
1
iiiii
21
⇒
=
⎥⎦
⎤⎢⎣
⎡β
β⎥⎦
⎤⎢⎣
⎡
Δ×Δ→ℜ×ℜ
⎥⎦
⎤⎢⎣
⎡+
+=⎥
⎦
⎤⎢⎣
⎡=
+β=+ −−
zz
z
z
z
z
z
zz
qqqq
a
&
&
&&
Convergence with Noisy Exact DAFP in Special Case ( = 1)
Suppose
In words, the derivative of empirical frequencies is measurable to within some error.
The authors prove that for any arbitrarily small >0, there exists a >0 such that if the measurement error (e1, e2) eventually remains within a -neighborhood of the origin, then the empirical frequencies (q1, q2) will remain within an -neighborhood of a Nash equilibrium.
This suggests that, if a sufficiently accurate approximation of empirical frequency can be constructed, Approximate DAFP will converge to an arbitrary neighborhood of the Nash equilibria.
( ) ( ) ( )( )iiiiii ttt)t( −−− ++β=+ eqqqq &&
Convergence with Approximate DAFP in Special Case ( = 1)
( ) ( )( )( ) ( )( ) ( )
( ) ( ) ( )( )
( ) ( ) ( )
value.same the toconverge (t)and(t) that show tonecessary isit ,equilibriaNash
of odneighborhoarbitrary an toconverges DAFP eApproximat that show order toin However,
.0T ],T,[Tfor t 0ttt1
, increasingy arbitrarilfor that prove authors thee,Furthermor
input. bounded with systems LTIfor assurance sboundednes Lyupanov of reminicent
,)(sup1
(0) -(0)e(t) -(t)
(t), and (t) solutions associated with ,any for sinceuniformly bounded also
are (t) and (t) versionsfiltered thebounded,uniformly thereforeare and
simplexstrategy in the evolve (t) and (t) since that prove authors The
ttt
ttt)t(
tt)t(recall
ii
121iii
i0
iit
ii
ii
iii
iiiii
iiii
λλ
λ
≥τ
λλλ−λλ
λλ
−−−
−−−
−−
>∈→−=λ
λ
τλ
+≤
λ
−λ=
−γ+β=
γ+β=
rq
rqr
qrqrq
rq
rr
rqr
qrqq
qqp
&&
&
&
&
&
&
&&
&
Convergence with Approximate DAFP in Special Case ( = 1) (CONTINUED)
( )
earlier. defined mapping T theof points fixed are they and
equations dynamic DAFPExact thesolve (t)and(t)Then
(t)(t)(t)If
(t).(t)(t)
define and
asly respective (t),and(t) tocoverge (t)and(t)et L
ii
i-i-ii
iii
iiii
qqb
qqb
qqqq
&
&
&
&&
+β=
+=
∞→λλλ
Convergence with Approximate DAFP in Special Case ( = 1) (CONTINUED)
systems). timecontinuous(for planeleft in the are seigenvaluematrix system
linearized all that ensure toprocedure Hurwitz-Routh a using and mequilibriuNash
aabout evolution strategy theglinearizinby settinggain for method a illustrate
authors theand gain, derivativeunity -non with achievable isstability Asymptotic
Lyupanov. of sense in the stable is and mequilibriuNash aabout odneighborho
arbitrary an toconverges 1 with DAFP eApproximat practice,In
.continuous weakly be function that the
requiresit but ,definitionby almost m)equilibriu(Nash point fixed a to
econvergenc guarantees abovegiven condition equality that theNote
i
=
β
Simulation Demonstration: Shapley Game
Consider the 2-player 3×3 game invented by Lloyd Shapley to illustrate non-convergence of fictitious play in general.
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡==
001
100
010
MM 21
Standard FP in Discrete Time (top) and Continuous Time (bottom)
Simulation Demonstration: Shapley Game
Shapley Game with Approximate DAFP in Continuous Time with increasing : 1(top), 10(middle), 100(bottom)
Another interesting thing here is that the players enter a correlated equilibrium, and their average payoff is higher than the expected Nash payoff.
For the “modified” game, where player utility matrices are not identical, the strategies converge to theoretically unsupported values, illustrating a violation of the weak continuity requirement for βi. This steady-state error can be corrected by setting the derivative gain according to the linearization-Routh-Hurwitz procedure noted earlier.
GP Review for 2-player Games
Gradient Play: Player i adjusts his strategy by observing his own empirical action frequency and adding the gradient of his Utility, as determined by his opponent’s empirical action frequency
GP in Discrete Time
( )( )
[ ]
( ) ( ) ( )
.simplex theonto projection denotes
k1k
1k
1k
k1k
)k()k()k(
)k()k(),k(U
)k()k()k(),k(U
i
iii
iiii
iiiii
iiTiiii
i
i
ΔΠ
++
+=+
+Π=
=∇
=
Δ
−−−
−Δ
−−
−−
aqq
qMqp
pMpp
pMppp
GP in Continuous Time
[ ]( ) [ ] ( )t)t()t(t
)t()t()t(
iiiii
iiii
−−−Δ−
−Δ
−+Π=
+Π=
qqMqq
qMqp
&
Achieving Nash Equilibrium
Gradient Play:
[ ]( ) [ ] ( ) 0qqMqq
qMqp
=−+Π=
+Π=∗−
∗−
∗−Δ−
∗−
∗Δ
∗
t)t()t(t
)t()t()t(
iiiii
iiii
&
Derivative Action Gradient PlayStandard GP cannot converge asymptotically to completely mixed Nash equilibria because the linearized dynamics are unstable at mixed equilibria.
Exact DAGP always enables asymptotic stability at mixed equilibria with proper selection of derivative gain. Under some conditions, Approximate DAGP also enables asymptotic stability near mixed equilibria.
Approximate DAGP always ensures asymptotic stability in the vicinity of strict equilibria.
( ) ( )( )( ) ( )
( ) ( )( )( ) ( )( ) ( ) ( )( )ttt
ttMt)t(:eApproximat
ttMt)t(:Exact
iii
iii1ii
iii1ii
−−−
−−Δ
−−Δ
−λ=
−γ++Π=
−γ++Π=
rqr
qrqqq
qqqqq
&
&&
&&
DAGP Simulation:Modified Shapley Game
Multiplayer Games
Consider the 3-player Jordan game:
0a2
a01
212
a02
0a1
211
The authors demonstrate that DAGP converges to the mixed Nash equilibrium.
Jordan Game Demonstration