empirical game-theoretic analysis for practical strategic reasoning
DESCRIPTION
Keynote by Michael Wellman at The 15th International Conference on Principles and Practice of Multi-Agent Systems (PRIMA-2012) during the Knowledge Technology Week (KTW2012). September 3 - 7, 2012. Kuching, Sarawak, MalaysiaTRANSCRIPT
M. Wellman 6 Sep 12
PRIMA-‐12 1
Empirical Game-‐Theore<c Analysis for Prac<cal Strategic Reasoning
Michael P. Wellman University of Michigan
Planning in Strategic Environments • Planning problem – find agent behavior sa<sfying/op<mizing objec<ves wrt environment
– strives for ra<onality
• When environment contains other agents – model them as ra#onal planners as well – problem is a game – search now mul<-‐dimensional, different (global) objec<ve
Environment Agent2 Agent1
M. Wellman 6 Sep 12
PRIMA-‐12 2
Real-‐World Games
• rich strategy space – strategy: obs* × <me
ac<on • severely incomplete informa<on – interdependent types (signals) – info par<ally revealed over
<me ➙ analy<c game-‐theore<c
solu<ons few and far between
two approaches 1. analyze (stylized)
approxima<ons – one-‐shot, complete info…
2. simula<on-‐based methods – search – empirical: sta<s<cs, machine learning,…
complex dynamics and uncertainty
Empirical Game-‐Theore<c Analysis (EGTA)
• Game described procedurally, no directly usable analy<cal form
• Parametrize strategy space based on agent architecture
• Selec<vely explore strategy/profile space • Induce game model (payoff func<on) from simula<on data Empirical game
M. Wellman 6 Sep 12
PRIMA-‐12 3
EGTA Process
Simulator Profile Payoff Data
Empirical Game
Strategy Set
Profile Space
Game Analysis (NE)
1. Parametrize strategy space
2. Es<mate empirical game
3. Solve empirical game
6,8 0,2 5,1 …
Simula<on-‐Based Game Modeling
M. Wellman 6 Sep 12
PRIMA-‐12 4
TAC Supply Chain Mgmt Game
Pintel IMD
Basus Macrostar
Mec Queenmax Watergate
Mintor
CPU
Mot
herb
oard
Mem
ory
Har
d D
isk
suppliers Manufacturer 1
Manufacturer 2
Manufacturer 3
Manufacturer 4
Manufacturer 5
Manufacturer 6
manufacturers
customer
component RFQs
supplier offers
component orders
PC RFQs
PC bids
PC orders
10 component types 16 PC types 220 simulation days 15 seconds per day
Two-‐Strategy Game (Unpreempted)
M. Wellman 6 Sep 12
PRIMA-‐12 5
Two-‐Strategy Game (Unpreempted)
Three-‐Strategy Game: Devia<ons
M. Wellman 6 Sep 12
PRIMA-‐12 6
Ranking Strategies
• O`en want to know: which is “beber”, strategy A or strategy B?
• Problem: – Depends on what other agents do – Cannot evaluate independent of strategic context
• Which context? – Self-‐play – Fixed propor<ons of other agents – Equilibrium (NE Regret)
Ranking Strategies: TAC/SCM-‐07
SCM-‐07 EGTA SCM-‐07 Tournament
from PR Jordan PhD Thesis, 2009
M. Wellman 6 Sep 12
PRIMA-‐12 7
−1400 −1200 −1000 −800 −600 −400 −200 0 200
21
1312
1133
3814
2141
1537
3436
3510
2346
4519
2927
68
2228
324
2616
3039
187
253
917
4031
4420
4743
524
4249
50
Deviation Gain
Strategy Ranking (TAC Travel) Strategies ranked with respect to the final equilibrium context
from LJ Schvartzman PhD Thesis, 2009
Strategy Ranking (CDA)
strategy NE1 regret NE2 regret symm. profile payoff
GDX 0 1.32 247.98 GD 0.49 3.26 248.57 RB 2.20 8.64 248.08 ZIP 2.90 9.86 247.95 Kaplan 4.56 24.55 2.02 ZIbtq 14.67 17.44 247.45 ZI 16.42 16.82 248.07
M. Wellman 6 Sep 12
PRIMA-‐12 8
Strategy Ranking (SimSPSB)
NE Regret
AverageMU64_HBAverageMU64Z_HB
BidEvaluatorMix_E8S32K8_HBBidEvaluatorMixA
BidXEvaluatorMix3_K16_HBBidXEvaluatorMixA_K16_HB
LocalBidSearch_K16_HBSCBidEvaluatorMixA_K16_HBSCBidXEvaluatorMixA_K16_HB
SCLocalBidSearch_K16_HBSCLocalBidSearch_K16Z_HBSCLocalBidSearchS5K6_HB
0.0 0.1 0.2 0.3 0.4 0.5
U[6,4]
0.0 0.1 0.2 0.3 0.4 0.5
U[5,5]
0.0 0.1 0.2 0.3
U[5,8]
0 2 4 6 8 10 12
H[5,3]
0 2 4 6 8 10
H[5,5]SC Local SC BidEval Local BidEval AvgMU
bobom line: SC Local most effec<ve overall, robust across environments
SC Local: Heuris<c search for op<mal bid in response to self-‐confirming prices
Scaling #Players
log(|G|)
#players
strategic complexity
M. Wellman 6 Sep 12
PRIMA-‐12 9
Improving Scalability
• Exploit locality of interac<on – graphical games, MAIDs, ac<on-‐graph games, …
• Aggregate agents – hierarchical reduc<on (Wellman et al. AAAI-‐05)
– clustering (Ficici et al. UAI-‐08) – devia<on-‐preserving reduc<on (Wiedenbeck & W, 2012)
Game Size
“profile”
�N+|S|�1N
�Number of profiles:
M. Wellman 6 Sep 12
PRIMA-‐12 10
⇡
Player Reduc<on
Hierarchical Reduc<on [Wellman, et al. 2005]
M. Wellman 6 Sep 12
PRIMA-‐12 11
vs
Twins Reduc<on [Ficici, et al. 2008]
Twins Reduc<on
vs
[Ficici, et al. 2008]
M. Wellman 6 Sep 12
PRIMA-‐12 12
Devia<on-‐Preserving Reduc<on
vs
Devia<on-‐Preserving Reduc<on
vs
M. Wellman 6 Sep 12
PRIMA-‐12 13
0
1
2
3
4
0 500 1000 1500
HR
DPR
TR
12p-‐6s Conges<on Game
0 4 8 12 16 20 24
0 5 10 15 20
HR
DPR
TR
0
1
2
3
4
0 500 1000 1500
HR
DPR
TR
0 1 2 3 4 5 6
0 500 1000 1500
HR
DPR
TR
100p-‐2s Conges<on Game
12p-‐6s Local Effect Game
12p-‐6s Network Forma<on Game
M. Wellman 6 Sep 12
PRIMA-‐12 14
Role-‐Symmetric Games
32
Hierarchical Reduc<on
M. Wellman 6 Sep 12
PRIMA-‐12 15
Devia<on-‐Preserving Reduc<on
vs
Devia<on-‐Preserving Reduc<on
vs
M. Wellman 6 Sep 12
PRIMA-‐12 16
Itera<ve EGTA Process
Simulator Profile Payoff Data
Empirical Game
Strategy Set
Profile Space
Game Analysis (NE)
Strategy Space Refine?
End
N
More Strategies
More Samples
Select
Sampling Control Problem
Game Model Induc<on Problem
Add Strategy
Strategy Explora<on Problem
Sampling Control Problem
• Revealed payoff model – sample provides exact payoff – minimum-‐regret-‐first search (MRFS) • abempts to refute best current candidate
• Noisy payoff model – sample drawn from payoff distribu<on – informa<on gain search (IGS) • sample profile maximizing entropy difference wrt probability of being min-‐regret profile
M. Wellman 6 Sep 12
PRIMA-‐12 17
c1 c2 c3 c4
r1 9,5 3,3 2,5 4,8
r2 6,4 8,8 3,0 5,3
r3 2,2 2,1 3,2 4,6
r4 4,4 2,0 2,2 9,3
Profile ε-bound (r1,c1) 0
start (arbitrary)
Min-‐Regret-‐First Search
c1 c2 c3 c4
r1 9,5 3,3 2,5 4,8
r2 6,4 8,8 3,0 5,3
r3 2,2 2,1 3,2 4,6
r4 4,4 2,0 2,2 9,3
Profile ε-bound (r1,c1) 0 (r1,c2) 2 evaluated
best
Select random devia<on from current best profile
Min-‐Regret Search
M. Wellman 6 Sep 12
PRIMA-‐12 18
c1 c2 c3 c4
r1 9,5 3,3 2,5 4,8
r2 6,4 8,8 3,0 5,3
r3 2,2 2,1 3,2 4,6
r4 4,4 2,0 2,2 9,3
Profile ε-bound (r1,c1) 0 (r1,c2) 2 (r2,c1) 3
Min-‐Regret Search
evaluated best
c1 c2 c3 c4
r1 9,5 3,3 2,5 4,8
r2 6,4 8,8 3,0 5,3
r3 2,2 2,1 3,2 4,6
r4 4,4 2,0 2,2 9,3
Profile ε-bound (r1,c1) 0 (r1,c2) 2 (r2,c1) 3 (r3,c1) 7
Min-‐Regret Search
evaluated best
M. Wellman 6 Sep 12
PRIMA-‐12 19
c1 c2 c3 c4
r1 9,5 3,3 2,5 4,8
r2 6,4 8,8 3,0 5,3
r3 2,2 2,1 3,2 4,6
r4 4,4 2,0 2,2 9,3
Profile ε-bound (r1,c1) 3 (r1,c2) 5 (r2,c1) 3 (r3,c1) 7 (r1,c4) 0
Min-‐Regret Search
evaluated best
c1 c2 c3 c4
r1 9,5 3,3 2,5 4,8
r2 6,4 8,8 3,0 5,3
r3 2,2 2,1 3,2 4,6
r4 4,4 2,0 2,2 9,3
Profile ε-bound (r1,c1) 3 (r1,c2) 5 (r2,c1) 3 (r3,c1) 7 (r1,c4) 1 (r2,c4) 1
Min-‐Regret Search
evaluated best
M. Wellman 6 Sep 12
PRIMA-‐12 20
c1 c2 c3 c4
r1 9,5 3,3 2,5 4,8
r2 6,4 8,8 3,0 5,3
r3 2,2 2,1 3,2 4,6
r4 4,4 2,0 2,2 9,3
Profile ε-bound (r1,c1) 3 (r1,c2) 5 (r2,c1) 4 (r3,c1) 7 (r1,c4) 1 (r2,c4) 5 (r2,c2) 0
Min-‐Regret Search
evaluated best
c1 c2 c3 c4
r1 9,5 3,3 2,5 4,8
r2 6,4 8,8 3,0 5,3
r3 2,2 2,1 3,2 4,6
r4 4,4 2,0 2,2 9,3
Profile ε-bound (r1,c1) 3 (r1,c2) 5 (r2,c1) 4 (r3,c1) 7 (r1,c4) 1 (r2,c4) 5 (r2,c2) 0 (r2,c3) 8
Min-‐Regret Search
evaluated best
M. Wellman 6 Sep 12
PRIMA-‐12 21
c1 c2 c3 c4
r1 9,5 3,3 2,5 4,8
r2 6,4 8,8 3,0 5,3
r3 2,2 2,1 3,2 4,6
r4 4,4 2,0 2,2 9,3
Profile ε-bound (r1,c1) 3 (r1,c2) 5 (r2,c1) 4 (r3,c1) 7 (r1,c4) 1 (r2,c4) 5 (r2,c2) 0 (r2,c3) 8 (r3,c2) 6
Min-‐Regret Search
evaluated best
c1 c2 c3 c4
r1 9,5 3,3 2,5 4,8
r2 6,4 8,8 3,0 5,3
r3 2,2 2,1 3,2 4,6
r4 4,4 2,0 2,2 9,3
Profile ε-bound (r1,c1) 3 (r1,c2) 5 (r2,c1) 4 (r3,c1) 7 (r1,c4) 1 (r2,c4) 5 (r2,c2) 0* (r2,c3) 8 (r3,c2) 6 (r4,c2) 6
NE Confirmed!
Min-‐Regret Search
evaluated best
M. Wellman 6 Sep 12
PRIMA-‐12 22
Finding Approximate PSNE
Itera<ve EGTA Process
Simulator Profile Payoff Data
Empirical Game
Strategy Set
Profile Space
Game Analysis (NE)
Strategy Space Refine?
End
N
More Strategies
More Samples
Select
Sampling Control Problem
Game Model Induc<on Problem
Add Strategy
Strategy Explora<on Problem
M. Wellman 6 Sep 12
PRIMA-‐12 23
Construct Empirical Game
• Simplest approach: direct es<ma<on – employ control variates and other variance reduc<on techniques
(s1,u(s1))
(sL,u(sL))
...
? u(•)
Empirical Game
Payoff data from selected profiles
Payoff Func<on Regression Si = [0,1]
eq = (0.32,0.32)
solve learned game
learn regression
3,3 1,0 1,1
4,1 2,2 4,1
1,1 1,4 3,3 0
0.5
1
0 0.5 1 generate data (simula<ons) FPSB2 Example
Vorobeychik et al., ML 2007
M. Wellman 6 Sep 12
PRIMA-‐12 24
Generaliza<on Risk Approach
• Model varia<ons – func<onal forms, rela<onship structures, parameters
– strategy granularity • Approach: – Treat candidate game model as a predictor for payoff data
– Adopt loss func<on for predictor
– Select model candidate minimizing expected loss
Cross ValidaHon
Observa#on Data
Training
Fold 2 Fold 3 Fold 1
Valida#on
Jordan et al., AAMAS-‐09
Itera<ve EGTA Process
Simulator Profile Payoff Data
Empirical Game
Strategy Set
Profile Space
Game Analysis (NE)
Strategy Space Refine?
End
N
More Strategies
More Samples
Select
Sampling Control Problem
Game Model Induc<on Problem
Add Strategy
Strategy Explora<on Problem
M. Wellman 6 Sep 12
PRIMA-‐12 25
Simulator
Game Analysis (NE)
Strategy Set
Profile Space
Profile Payoff Data
Empirical Game
RL: Best response to NE Refine?
Add new Strategy
Deviates? Improve RL Model?
Online Learning
N Y
Y
End N
N
More Strategies
More Samples
Select
New Strategy
Learning New Strategies: EGTA+RL
CDA Learning Problem Setup
H1: Moving average H2: Frequency weighted ra<o, threshold= V H3: Frequency weighted ra<o, threshold= A Q1: Opposite role Q2: Same role T1: Total T2: Since last trade U: Number of trades le` V: Value of next unit to be traded
History of recent trades
Quotes
Time
Pending Trades
State Space
Ac<ons
A: Offset from V
Rewards
R: Difference between unit valua<on and trade price
M. Wellman 6 Sep 12
PRIMA-‐12 26
EGTA/RL Round 1
Strategies Payoff NE Learning
Strategy Dev. Payoff
Kaplan ZI ZIbtq
248.1 1.000 ZI L1 268.7
L1 242.5 1.000 L1
EGTA/RL Round 2
Strategies Payoff NE Learning
Strategy Dev. Payoff
Kaplan ZI ZIbtq
248.1 1.000 ZI L1 268.7
L1 242.5 1.000 L1
ZIP 248.0 1.000 ZIP
GD 248.6 1.000 GD L2-‐L8 L9
-‐-‐-‐ 251.8
L9 246.1 0.531 GD 0.469 L9
L10 252.1
M. Wellman 6 Sep 12
PRIMA-‐12 27
EGTA/RL Rounds 3+
Strategies Payoff NE Learning
Strategy Dev. Payoff … … … … …
L10 248.0 0.191 GD 0.809 L10
L11 251.0
L11 246.2 1.000 L11
GDX 245.8 0.192 GDX 0.808 L11
L12 248.3
L12 245.8 0.049 L11 0.951 L12
L13 245.9
L13 245.6 0.872 L12 0.128 L13
L14 245.6
RB 245.6 0.872 L12 0.128 L13
Final champion
Strategy Explora<on Problem
• Premise: – Limited ability to cover profile space – Expecta<on to reasonably evaluate all considered strategies
• Need deliberate policy to decide which strategies to introduce
• RL for strategy explora<on – abempt at best response to current equilibrium – is this a good heuris<c (even assuming ideal BR calc?)
M. Wellman 6 Sep 12
PRIMA-‐12 28
Example"A1 A2 A3 A4
A1 1, 1 1, 2 1, 3 1, 4 A2 2, 1 2, 2 2, 3 2, 6 A3 3, 1 3, 2 3, 3 3, 8 A4 4, 1 6, 2 8, 3 4, 4
Strategy Set Candidate Eq. Regret wrt True Game {A1} (A1,A1) 3
{A1,A2} (A2,A2) 4 {A1,A2,A3} (A3,A3) 5
{A1,A2,A3,A4} (A4,A4) 0
Introduce strategies in order: A1, A2, A3, A4
Regret may increase over subsequent steps!"
00.2
0.40.6
0.81
0
0.5
1
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
ki
kj
!(k
j)
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1BR
E(DEV)
DEV [MESH]
FPSB2 Regret Surface
M. Wellman 6 Sep 12
PRIMA-‐12 29
Explora<on Policies • RND: Random (uniform) selec<on • Devia<on-‐Based
– DEV: Uniform among strategies that deviate from current equilibrium
– BR: Best response to current equilibrium – BR+DEV: Alternate on successive itera<ons – ST(t): So`max selec<on among deviators, propor<onal to gain
• MEMT: – Select strategy that maximizes the gain (regret) from devia<ng to a strategy outside the set from any mixture over the set.
CDA↓4"
Step
Exp
ecte
d R
eg
ret
1 3 5 7 9 11 13
10!3
10!2
10!1
100
101
102
103
MEMT
DEV
RND
BR
ST10
ST1
ST0.1
M. Wellman 6 Sep 12
PRIMA-‐12 30
EGTA Applica<ons
• Market games – TAC: Travel, Supply Chain, Ad Auc<on – Canonical auc<ons: SimAAs, CDAs, SimSPSBs,… – Equity premium in financial trading
• Other domains – Privacy: informa<on sharing abacks – Networking: rou<ng, wireless AP selec<on – Credit network forma<on
• Mechanism design
Conclusion: EGTA Methodology • Extends scope of GT to procedurally defined scenarios
• Embraces sta<s<cal underpinnings of strategic reasoning
• Search process: – GT for establishing salient strategic context – Strategy explora<on:
• e.g., RL to search for best response to that context → Principled approach to evaluate complex strategy spaces
• Growing toolbox of EGTA techniques