empirical game-theoretic analysis for practical strategic reasoning

M. Wellman 6 Sep 12

PRIMA-‐12 1

Empirical Game-‐Theore<c Analysis for Prac<cal Strategic Reasoning

Michael P. Wellman University of Michigan

Planning in Strategic Environments •  Planning problem –  find agent behavior sa<sfying/op<mizing objec<ves wrt environment

–  strives for ra<onality

•  When environment contains other agents – model them as ra#onal planners as well –  problem is a game –  search now mul<-‐dimensional, different (global) objec<ve

Environment Agent2 Agent1

M. Wellman 6 Sep 12

PRIMA-‐12 2

Real-‐World Games

•  rich strategy space –  strategy: obs* × <me

ac<on •  severely incomplete informa<on –  interdependent types (signals) –  info par<ally revealed over

<me ➙ analy<c game-‐theore<c

solu<ons few and far between

two approaches 1.  analyze (stylized)

approxima<ons –  one-‐shot, complete info…

2.  simula<on-‐based methods –  search –  empirical: sta<s<cs, machine learning,…

complex dynamics and uncertainty

Empirical Game-‐Theore<c Analysis (EGTA)

•  Game described procedurally, no directly usable analy<cal form

•  Parametrize strategy space based on agent architecture

•  Selec<vely explore strategy/profile space •  Induce game model (payoff func<on) from simula<on data Empirical game

M. Wellman 6 Sep 12

PRIMA-‐12 3

EGTA Process

Simulator Profile Payoff Data

Empirical Game

Strategy Set

Profile Space

Game Analysis (NE)

1. Parametrize strategy space

2. Es<mate empirical game

3. Solve empirical game

6,8 0,2 5,1 …

Simula<on-‐Based Game Modeling

M. Wellman 6 Sep 12

PRIMA-‐12 4

TAC Supply Chain Mgmt Game

Pintel IMD

Basus Macrostar

Mec Queenmax Watergate

Mintor

CPU

Mot

herb

oard

Mem

ory

Har

d D

isk

suppliers Manufacturer 1

Manufacturer 2

Manufacturer 3

Manufacturer 4

Manufacturer 5

Manufacturer 6

manufacturers

customer

component RFQs

supplier offers

component orders

PC RFQs

PC bids

PC orders

10 component types 16 PC types 220 simulation days 15 seconds per day

Two-‐Strategy Game (Unpreempted)

M. Wellman 6 Sep 12

PRIMA-‐12 5

Two-‐Strategy Game (Unpreempted)

Three-‐Strategy Game: Devia<ons

M. Wellman 6 Sep 12

PRIMA-‐12 6

Ranking Strategies

•  O`en want to know: which is “beber”, strategy A or strategy B?

•  Problem: – Depends on what other agents do – Cannot evaluate independent of strategic context

•  Which context? – Self-‐play – Fixed propor<ons of other agents – Equilibrium (NE Regret)

Ranking Strategies: TAC/SCM-‐07

SCM-‐07 EGTA SCM-‐07 Tournament

from PR Jordan PhD Thesis, 2009

M. Wellman 6 Sep 12

PRIMA-‐12 7

−1400 −1200 −1000 −800 −600 −400 −200 0 200

21

1312

1133

3814

2141

1537

3436

3510

2346

4519

2927

68

2228

324

2616

3039

187

253

917

4031

4420

4743

524

4249

50

Deviation Gain

Strategy Ranking (TAC Travel) Strategies ranked with respect to the final equilibrium context

from LJ Schvartzman PhD Thesis, 2009

Strategy Ranking (CDA)

strategy NE1 regret NE2 regret symm. profile payoff

GDX 0 1.32 247.98 GD 0.49 3.26 248.57 RB 2.20 8.64 248.08 ZIP 2.90 9.86 247.95 Kaplan 4.56 24.55 2.02 ZIbtq 14.67 17.44 247.45 ZI 16.42 16.82 248.07

M. Wellman 6 Sep 12

PRIMA-‐12 8

Strategy Ranking (SimSPSB)

NE Regret

AverageMU64_HBAverageMU64Z_HB

BidEvaluatorMix_E8S32K8_HBBidEvaluatorMixA

BidXEvaluatorMix3_K16_HBBidXEvaluatorMixA_K16_HB

LocalBidSearch_K16_HBSCBidEvaluatorMixA_K16_HBSCBidXEvaluatorMixA_K16_HB

SCLocalBidSearch_K16_HBSCLocalBidSearch_K16Z_HBSCLocalBidSearchS5K6_HB

0.0 0.1 0.2 0.3 0.4 0.5

U[6,4]

0.0 0.1 0.2 0.3 0.4 0.5

U[5,5]

0.0 0.1 0.2 0.3

U[5,8]

0 2 4 6 8 10 12

H[5,3]

0 2 4 6 8 10

H[5,5]SC Local SC BidEval Local BidEval AvgMU

bobom line: SC Local most effec<ve overall, robust across environments

SC Local: Heuris<c search for op<mal bid in response to self-‐confirming prices

Scaling #Players

log(|G|)

#players

strategic complexity

M. Wellman 6 Sep 12

PRIMA-‐12 9

Improving Scalability

•  Exploit locality of interac<on – graphical games, MAIDs, ac<on-‐graph games, …

•  Aggregate agents – hierarchical reduc<on (Wellman et al. AAAI-‐05)

– clustering (Ficici et al. UAI-‐08) – devia<on-‐preserving reduc<on (Wiedenbeck & W, 2012)

Game Size

“profile”

�N+|S|�1N

�Number of profiles:

M. Wellman 6 Sep 12

PRIMA-‐12 10

⇡

Player Reduc<on

Hierarchical Reduc<on [Wellman, et al. 2005]

M. Wellman 6 Sep 12

PRIMA-‐12 11

vs

Twins Reduc<on [Ficici, et al. 2008]

Twins Reduc<on

vs

[Ficici, et al. 2008]

M. Wellman 6 Sep 12

PRIMA-‐12 12

Devia<on-‐Preserving Reduc<on

vs


vs

M. Wellman 6 Sep 12

PRIMA-‐12 13

0

1

2

3

4

0 500 1000 1500

HR

DPR

TR

12p-‐6s Conges<on Game

0 4 8 12 16 20 24

0 5 10 15 20

HR

DPR

TR

0

1

2

3

4

0 500 1000 1500

HR

DPR

TR

0 1 2 3 4 5 6

0 500 1000 1500

HR

DPR

TR

100p-‐2s Conges<on Game

12p-‐6s Local Effect Game

12p-‐6s Network Forma<on Game

M. Wellman 6 Sep 12

PRIMA-‐12 14

Role-‐Symmetric Games

32

Hierarchical Reduc<on

M. Wellman 6 Sep 12

PRIMA-‐12 15


vs


vs

M. Wellman 6 Sep 12

PRIMA-‐12 16

Itera<ve EGTA Process


Empirical Game

Strategy Set

Profile Space

Game Analysis (NE)

Strategy Space Refine?

End

N

More Strategies

More Samples

Select

Sampling Control Problem

Game Model Induc<on Problem

Add Strategy

Strategy Explora<on Problem


•  Revealed payoff model – sample provides exact payoff – minimum-‐regret-‐first search (MRFS) •  abempts to refute best current candidate

•  Noisy payoff model – sample drawn from payoff distribu<on –  informa<on gain search (IGS) •  sample profile maximizing entropy difference wrt probability of being min-‐regret profile

M. Wellman 6 Sep 12

PRIMA-‐12 17

c1 c2 c3 c4

r1 9,5 3,3 2,5 4,8

r2 6,4 8,8 3,0 5,3

r3 2,2 2,1 3,2 4,6

r4 4,4 2,0 2,2 9,3

Profile ε-bound (r1,c1) 0

start (arbitrary)

Min-‐Regret-‐First Search

c1 c2 c3 c4

r1 9,5 3,3 2,5 4,8

r2 6,4 8,8 3,0 5,3

r3 2,2 2,1 3,2 4,6

r4 4,4 2,0 2,2 9,3

Profile ε-bound (r1,c1) 0 (r1,c2) 2 evaluated

best

Select random devia<on from current best profile

Min-‐Regret Search

M. Wellman 6 Sep 12

PRIMA-‐12 18

c1 c2 c3 c4

r1 9,5 3,3 2,5 4,8

r2 6,4 8,8 3,0 5,3

r3 2,2 2,1 3,2 4,6

r4 4,4 2,0 2,2 9,3

Profile ε-bound (r1,c1) 0 (r1,c2) 2 (r2,c1) 3


evaluated best

c1 c2 c3 c4

r1 9,5 3,3 2,5 4,8

r2 6,4 8,8 3,0 5,3

r3 2,2 2,1 3,2 4,6

r4 4,4 2,0 2,2 9,3

Profile ε-bound (r1,c1) 0 (r1,c2) 2 (r2,c1) 3 (r3,c1) 7


evaluated best

M. Wellman 6 Sep 12

PRIMA-‐12 19

c1 c2 c3 c4

r1 9,5 3,3 2,5 4,8

r2 6,4 8,8 3,0 5,3

r3 2,2 2,1 3,2 4,6

r4 4,4 2,0 2,2 9,3

Profile ε-bound (r1,c1) 3 (r1,c2) 5 (r2,c1) 3 (r3,c1) 7 (r1,c4) 0


evaluated best

c1 c2 c3 c4

r1 9,5 3,3 2,5 4,8

r2 6,4 8,8 3,0 5,3

r3 2,2 2,1 3,2 4,6

r4 4,4 2,0 2,2 9,3

Profile ε-bound (r1,c1) 3 (r1,c2) 5 (r2,c1) 3 (r3,c1) 7 (r1,c4) 1 (r2,c4) 1


evaluated best

M. Wellman 6 Sep 12

PRIMA-‐12 20

c1 c2 c3 c4

r1 9,5 3,3 2,5 4,8

r2 6,4 8,8 3,0 5,3

r3 2,2 2,1 3,2 4,6

r4 4,4 2,0 2,2 9,3

Profile ε-bound (r1,c1) 3 (r1,c2) 5 (r2,c1) 4 (r3,c1) 7 (r1,c4) 1 (r2,c4) 5 (r2,c2) 0


evaluated best

c1 c2 c3 c4

r1 9,5 3,3 2,5 4,8

r2 6,4 8,8 3,0 5,3

r3 2,2 2,1 3,2 4,6

r4 4,4 2,0 2,2 9,3

Profile ε-bound (r1,c1) 3 (r1,c2) 5 (r2,c1) 4 (r3,c1) 7 (r1,c4) 1 (r2,c4) 5 (r2,c2) 0 (r2,c3) 8


evaluated best

M. Wellman 6 Sep 12

PRIMA-‐12 21

c1 c2 c3 c4

r1 9,5 3,3 2,5 4,8

r2 6,4 8,8 3,0 5,3

r3 2,2 2,1 3,2 4,6

r4 4,4 2,0 2,2 9,3

Profile ε-bound (r1,c1) 3 (r1,c2) 5 (r2,c1) 4 (r3,c1) 7 (r1,c4) 1 (r2,c4) 5 (r2,c2) 0 (r2,c3) 8 (r3,c2) 6


evaluated best

c1 c2 c3 c4

r1 9,5 3,3 2,5 4,8

r2 6,4 8,8 3,0 5,3

r3 2,2 2,1 3,2 4,6

r4 4,4 2,0 2,2 9,3

Profile ε-bound (r1,c1) 3 (r1,c2) 5 (r2,c1) 4 (r3,c1) 7 (r1,c4) 1 (r2,c4) 5 (r2,c2) 0* (r2,c3) 8 (r3,c2) 6 (r4,c2) 6

NE Confirmed!


evaluated best

M. Wellman 6 Sep 12

PRIMA-‐12 22

Finding Approximate PSNE



Empirical Game

Strategy Set

Profile Space

Game Analysis (NE)


End

N

More Strategies

More Samples

Select



Add Strategy


M. Wellman 6 Sep 12

PRIMA-‐12 23

Construct Empirical Game

•  Simplest approach: direct es<ma<on – employ control variates and other variance reduc<on techniques

(s1,u(s1))

(sL,u(sL))

...

? u(•)

Empirical Game

Payoff data from selected profiles

Payoff Func<on Regression Si = [0,1]

eq = (0.32,0.32)

solve learned game

learn regression

3,3 1,0 1,1

4,1 2,2 4,1

1,1 1,4 3,3 0

0.5

1

0 0.5 1 generate data (simula<ons) FPSB2 Example

Vorobeychik et al., ML 2007

M. Wellman 6 Sep 12

PRIMA-‐12 24

Generaliza<on Risk Approach

•  Model varia<ons –  func<onal forms, rela<onship structures, parameters

–  strategy granularity •  Approach: –  Treat candidate game model as a predictor for payoff data

–  Adopt loss func<on for predictor

–  Select model candidate minimizing expected loss

Cross ValidaHon

Observa#on Data

Training

Fold 2 Fold 3 Fold 1

Valida#on

Jordan et al., AAMAS-‐09



Empirical Game

Strategy Set

Profile Space

Game Analysis (NE)


End

N

More Strategies

More Samples

Select



Add Strategy


M. Wellman 6 Sep 12

PRIMA-‐12 25

Simulator

Game Analysis (NE)

Strategy Set

Profile Space

Profile Payoff Data

Empirical Game

RL: Best response to NE Refine?

Add new Strategy

Deviates? Improve RL Model?

Online Learning

N Y

Y

End N

N

More Strategies

More Samples

Select

New Strategy

Learning New Strategies: EGTA+RL

CDA Learning Problem Setup

H1: Moving average H2: Frequency weighted ra<o, threshold= V H3: Frequency weighted ra<o, threshold= A Q1: Opposite role Q2: Same role T1: Total T2: Since last trade U: Number of trades le` V: Value of next unit to be traded

History of recent trades

Quotes

Time

Pending Trades

State Space

Ac<ons

A: Offset from V

Rewards

R: Difference between unit valua<on and trade price

M. Wellman 6 Sep 12

PRIMA-‐12 26

EGTA/RL Round 1

Strategies Payoff NE Learning

Strategy Dev. Payoff

Kaplan ZI ZIbtq

248.1 1.000 ZI L1 268.7

L1 242.5 1.000 L1

EGTA/RL Round 2


Strategy Dev. Payoff

Kaplan ZI ZIbtq

248.1 1.000 ZI L1 268.7

L1 242.5 1.000 L1

ZIP 248.0 1.000 ZIP

GD 248.6 1.000 GD L2-‐L8 L9

-‐-‐-‐ 251.8

L9 246.1 0.531 GD 0.469 L9

L10 252.1

M. Wellman 6 Sep 12

PRIMA-‐12 27

EGTA/RL Rounds 3+


Strategy Dev. Payoff … … … … …

L10 248.0 0.191 GD 0.809 L10

L11 251.0

L11 246.2 1.000 L11

GDX 245.8 0.192 GDX 0.808 L11

L12 248.3

L12 245.8 0.049 L11 0.951 L12

L13 245.9

L13 245.6 0.872 L12 0.128 L13

L14 245.6

RB 245.6 0.872 L12 0.128 L13

Final champion


•  Premise: –  Limited ability to cover profile space –  Expecta<on to reasonably evaluate all considered strategies

•  Need deliberate policy to decide which strategies to introduce

•  RL for strategy explora<on –  abempt at best response to current equilibrium –  is this a good heuris<c (even assuming ideal BR calc?)

M. Wellman 6 Sep 12

PRIMA-‐12 28

Example"A1 A2 A3 A4

A1 1, 1 1, 2 1, 3 1, 4 A2 2, 1 2, 2 2, 3 2, 6 A3 3, 1 3, 2 3, 3 3, 8 A4 4, 1 6, 2 8, 3 4, 4

Strategy Set Candidate Eq. Regret wrt True Game {A1} (A1,A1) 3

{A1,A2} (A2,A2) 4 {A1,A2,A3} (A3,A3) 5

{A1,A2,A3,A4} (A4,A4) 0

Introduce strategies in order: A1, A2, A3, A4

Regret may increase over subsequent steps!"

00.2

0.40.6

0.81

0

0.5

1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

ki

kj

!(k

j)

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1BR

E(DEV)

DEV [MESH]

FPSB2 Regret Surface

M. Wellman 6 Sep 12

PRIMA-‐12 29

Explora<on Policies •  RND: Random (uniform) selec<on •  Devia<on-‐Based

–  DEV: Uniform among strategies that deviate from current equilibrium

–  BR: Best response to current equilibrium –  BR+DEV: Alternate on successive itera<ons –  ST(t): So`max selec<on among deviators, propor<onal to gain

•  MEMT: –  Select strategy that maximizes the gain (regret) from devia<ng to a strategy outside the set from any mixture over the set.

CDA↓4"

Step

Exp

ecte

d R

eg

ret

1 3 5 7 9 11 13

10!3

10!2

10!1

100

101

102

103

MEMT

DEV

RND

BR

ST10

ST1

ST0.1

M. Wellman 6 Sep 12

PRIMA-‐12 30

EGTA Applica<ons

•  Market games – TAC: Travel, Supply Chain, Ad Auc<on – Canonical auc<ons: SimAAs, CDAs, SimSPSBs,… – Equity premium in financial trading

•  Other domains – Privacy: informa<on sharing abacks – Networking: rou<ng, wireless AP selec<on – Credit network forma<on

•  Mechanism design

Conclusion: EGTA Methodology •  Extends scope of GT to procedurally defined scenarios

•  Embraces sta<s<cal underpinnings of strategic reasoning

•  Search process: – GT for establishing salient strategic context –  Strategy explora<on:

•  e.g., RL to search for best response to that context → Principled approach to evaluate complex strategy spaces

•  Growing toolbox of EGTA techniques

empirical game-theoretic analysis for practical strategic reasoning

Technology

regret search

regret search

strategy game

strategy

3prima

3 min

unpreempted

5 evaluated