adaptive regret minimization in bounded memory games

Adaptive Regret Minimization in Bounded Memory Games

Jeremiah Blocki, Nicolas Christin, Anupam Datta, Arunesh Sinha

1

GameSec 2013 – Invited Paper

Motivating Example: Cheating Game

4

Semester 1 Semester 2 Semester 3

Motivating Example: Speeding Game

5

Week 1 Week 2 Week 3

Motivating Example: Speeding GameExample

Actions

6

QuestionsAppropriate Game Model for this Interaction?

Defender Strategies?

:

Outcomes

:High InspectionLow Inspection

Speed Behave

Game Elements

8

o Repeated Interactiono Two Players: Defender and Adversary

o Imperfect Informationo Defender only observes outcome

o Short Term Adversarieso Adversary Incentives Unknown to

Defendero Last presentation! [JNTP 13]

o Adversary may be uninformed/irrational

Repeated G

ame M

odel?

Stackelberg

Additional Game Elements

9

o History-dependent Actionso Adversary adapts behavior following unknown strategyo How should defender respond?

o History-dependent Rewards:o Point Systemo Reputation of defender depends both on its history

and on the current outcome

StandardRegretMinimization

Repeated Game Model?

Outline

10

Motivation Background

Standard Definition of Regret Regret Minimization Algorithms Limitations

Our Contributions Bounded Memory Games Adaptive Regret Results

Speeding Game: Repeated Game Model

Example

11

.19 0.70.2 1

High InspectionLow Inspection

Defender’s (D) Expected Utility

+

Regret Minimization ExampleExample

Experts

13

Low Inspecti

onHigh

Inspection

What should I do?

BehaveLow

BehaveHigh

SpeedHighLowHigh

LowHigh

LowHigh


.19 0.70.2 1


Defender’s Utility

Experts

14

AdversaryDefender

Utility1.892.21.59

AristotlePlato

0.19 + 1 + 0.7 = 1.89 0.2 + 1 + 1 = 2.2Day 1 Day 2 Day 3


.19 0.70.2 1



Regret

15

DefenderAristotlePlato

Utility1.892.20.59


.19 0.70.2 1



Regret

16


.19 0.70.2 1


Defender’s utility

Regret Minimization Algorithm (A)

17

lim¿𝑇→∞ (Regret ( 𝐴 ,𝐸𝑥𝑝𝑒𝑟𝑡𝑠 ,𝑇 )𝑇 )≤0

Regret Minimization: Basic Idea

18

Low Inspectio

n

High Inspectio

n

1.0 1.0Weights

Choose action probabilistically based on weights

.19 0.70.2 1

High Inspection

Low Inspection

Regret Minimization: Basic Idea

19

Updated weights

Low Inspectio

n

High Inspectio

n

0.5 1.5

.19 0.70.2 1


Speeding GameExample

.19 0.70.2 1



Defender’s Strategy

20

Nash Equilibrium: Low Inspection

Regret Minimization: Low Inspection

Dominant Strategy

Low Inspectio

nHigh

Inspection

0.5 1.5


.19 0.70.2 1




21



Dominant Strategy

Low Inspectio

nHigh

Inspection

0.3 1.7


.19 0.70.2 1




22



Dominant Strategy

Low Inspectio

nHigh

Inspection

0.1 1.9

Philosophical Argument

24

See! My advice

was better!

We need

a better game model

!

Unmodeled Game Elements

29

o Adversary Incentives Unknown to Defendero Last presentation! [JNTP 13]

o Adversary may be uninformed/irrational

o History-dependent Rewards:o Point Systemo Reputation of defender depends both on its history

and on the current outcome

o History-dependent Actionso Adversary adapts behavior following unknown strategyo How should defender respond?

Outline

30

Motivation Background Our Contributions

Bounded Memory Games Adaptive Regret Results

Bounded Memory Games

32

State s: Encodes last m outcomes

States: can capture history dependent rewards

𝑂𝑖−𝑚 ,…,𝑂 𝑖− 2 ,𝑂𝑖−1 𝑂𝑖−𝑚+1 ,…,𝑂𝑖−1 ,𝑂𝑖𝑂𝑖

- Defender payoff when actions d,a are played at state s

Bounded Memory Games

33

State s: Encodes last m outcomes

Current outcome is only dependent on current actions

, ¿ ,…, , - Defender payoff when actions d,a are played at state s

Bounded Memory Games - Experts

34

Expert advice may depend on the last m outcomes If no violations have been

detected in the last m rounds then play High Inspection, otherwise Low Inspection

Fixed Defender Strategy:

State Action

Outline

35

Motivation Background Our Contributions

Bounded Memory Games Adaptive Regret Results

k-Adaptive Strategy

36

Decision tree for the next k roundsSpeed

Speed

Speed Speed

Day 1

Day 2

Day 3

…… …

Behave

… …

Speed

Speed

k-Adaptive Strategy

37

Decision tree for the next k rounds

Week 1 Week 2 Week 3

I will never speed while I am on vacation.

I will speed until I get caught. If I ever get a ticket then I will

stop.

I will keep speeding until I get two tickets. If I ever get two

tickets then I will stop.

If violations have been detected in the last 7 rounds

then play High Inspection, otherwise Low Inspection

k-Adaptive Regret

38

Regret (𝐷 , Expert ,𝑇 )=∑𝑖=1

𝑇(𝑟 𝑖′−𝑟𝑖 )

Initial State

Defender … O-1 O0

Actions (a1,d1) (a2,d2) … (ak+1,dk+1)

Outcome

O1 O2 … Ok+1 …

r1 r2 … rk+1

Expert … O-1 O0

Actions (a1,d1’)

(a2’,d2’)

… (ak+1,dk+1’)

…

Outcome

O1’ O2’ … Ok+1’ …

r1’ r2’ … rk+1’

k-Adaptive Regret Minimization

39

Definition: An algorithm D is a-approximate k-adaptive regret minimization if for any bounded memory-m game and any fixed set of experts EXP

lim ¿ 𝑇→∞(max𝐸∈exp

Regret (D ,𝐸 ,𝑇 )𝑇 )≤𝛾 .

Outline

40

Motivation Background Bounded Memory Games Adaptive Regret Results

k-Adaptive Regret Minimization

41

lim¿𝑇→∞(max𝐸∈exp

Regret (D ,𝐸 ,𝑇 )𝑇 )≤𝛾 .

Definition: An algorithm D is a-approximate k-adaptive regret minimization if for any bounded memory-m game and any fixed set of experts EXP

Theorem: For any there is an inefficient –approximate k-adaptive regret minimization algorithm.

Inefficient Regret Minimization Algorithm

42

• Use standard regret minimization algorithm for repeated games of imperfect information [AK04, McMahanB04,K05,FKM05]

, ¿ ,…, , …

f1

f2

…

Bounded Memory-mGame

fixed

stra

tegy

Repeated Game


43

, ¿ ,…, , …

f1

f2

…


fixed

stra

tegy

Repeated Game

Expected reward in original game given:

1. Defender follows fixed strategy f2 for next mkt rounds of original game

2. Defender sees sequence of k-adaptive adversaries below


Current outcome is only dependent on current actions


Start State Stagei (m*k*t rounds)

Real Game … O1 … Om …

Repeated Game

… O1 … Om …

44


Start State Stagei (m*k*t rounds)

Real Game … O1 … Om …

Repeated Game

… O1 … Om …

45

• After m rounds in Stagei View 1 and View 2 must converge to the same state


46

, ¿ ,…, , …

f1

f2

…


fixed

stra

tegy

Repeated GameStandard Regret Minimization algorithms maintain weight

for each expert.

Inefficient: Exponentially many fixed strategies!


Summary of Technical Results

47

Imperfect Information

Perfect Information

Oblivious Regret Hard (Theorem 1)APX (Theorem 5)

APX (Theorem 4)

k-Adaptive Regret Hard (Theorem 1) Hard (Remark 2)

Fully Adaptive Regret X (Theorem 6) X (Theorem 6)

Easier

X – No Regret Minimization Algorithm ExistsHard – unless no regret minimization algorithm is efficient in

APX – efficient approximate regret minimization algorithm


48


Perfect Information


APX (Theorem 4)

k-Adaptive Regret Hard (Theorem 1)APX (New!)

Hard (Remark 2)APX (New!)


Easier

X – No Regret Minimization Algorithm ExistsHard – unless no regret minimization algorithm is efficient in

APX – efficient approximate regret minimization algorithm in n, k


49


Perfect Information

Oblivious Regret k-Adaptive Regret


Easier

Ideas: Implicit weight representation + Dynamic Programming

Warning! f(k) is a very large constant!

Implicit Weights: Outcome Tree

50

… …

Behave… …

SpeedBehave Speed 𝑶 ( ln𝒏𝜸 )

❑

𝒘𝒖𝒗

How often is edge (s,t) relevant?

nodes

Implicit Weights: Outcome Tree

51

Expert: E… …

Behave… …

SpeedBehave Speed

𝒘𝒖𝒗 𝒘 𝑬= ∑𝒖 ,𝒗∈𝑬

❑

𝑹𝒖𝒗𝒘𝒖𝒗

nodes

𝑶 ( ln𝒏𝜸 )❑

Open Questions

52

Perfect Information: efficient 𝛄-Approximate k-Adaptive Regret Minimization Algorithm when k = 0 and 𝛄 = 0?

𝛄-Approximate k-Adaptive Regret Minimization Algorithm with more efficient running time? ?


Perfect Information


APX (Theorem 4)

k-Adaptive Regret Hard (Theorem 1)APX

Hard (Remark 2)APX


Thanks for Listening!

Thanks for Listening!

53

http://www.google.com/imgres?hl=en&safe=active&client=firefox-a&rls=org.mozilla:en-US:official&biw=1540&bih=782&tbm=isch&tbnid=YYfg_RPwQBzTlM:&imgrefurl=http://liongadgets.com/wordpress/most-important-ten-questions-to-ask-yourself/question-mark/&docid=el4ufHamgZaiEM&imgurl=http://liongadgets.com/wordpress/wp-content/uploads/2012/09/question-mark.jpg&w=4000&h=4000&ei=aQlBUZ-0Kqjh4APX4YGgAQ&zoom=1&ved=1t:3588,r:9,s:0,i:106

THEOREM 3Unless RP=NP there is no efficient RegretMinimization algorithm for Bounded Memory

Games even against an oblivious adversary.

Reduction from MAX 3-SAT (7/8+ε) [Hastad01] Similar to reduction in [EKM05] for MDP’s

54

THEOREM 3: SETUP

Defender Actions A: {0,1}x{0,1}

m = O(log n)

States: Two states for each variable S0 = {s1,…, sn} S1 = {s’1,…,s’n}

Intuition: A fixed strategy corresponds to a variable assignment 55

THEOREM 3: OVERVIEW The adversary picks a clause uniformly at

random for the next n rounds

Defender can earn reward 1 by satisfying this unknown clause in the next n rounds

The game will “remember” if a reward has already been given so that defender cannot earn a reward multiple times during n rounds

56

THEOREM 3: STATE TRANSITIONS

57

Adversary Actions B: {0,1}x{0,1,2,3}

b = (b1,b2)

g(a,b) = b1

f(a,b) = S1 if a2 = 1 or b2 = a1 (reward already given) S0 else (no reward given)

THEOREM 3: REWARDS

58

b = (b1,b2)

No reward whenever B plays b2 = 2

r(a,b,s) =

1 if s S0 and a = b2

-5 if s S1 and f(a,b) = S0 and b2 3 0 otherwise

No reward whenever s S1

THEOREM 3: OBLIVIOUS ADVERSARY(d1,…,dn) - binary De Buijn sequence of order

n

1. Pick a clause C uniformly at random2. For i = 1,…,n

Play b = (di,b2)

3. Repeat Step 159

b2 =

1 If xi C0 If xi C3 If i = n2 Otherwise

ANALYSIS

Defender can never be rewarded from s S1 Get Reward => Transition to s S1 Defender is punished for leaving S1

Unless adversary plays b2 = 3 (i.e when i = n)60

f(a,b) = S1 if a2 = 1 or b2 = a1

S0 else

r(a,b,s)=

1 if s S0 and a = b2

-5 if s S1 and f(a,b) = S0 and b2 3 0 otherwise

THEOREM 3: ANALYSIS φ - assignment satisfying ρ fraction of clauses

fφ – average score ρ/n

Claim: No strategy (fixed or adaptive) can obtain an average expected score better than ρ*/n

Regret Minimization Algorithm Run until expected average regret < ε/n Expected average score > (ρ*- ε )/n

61

adaptive regret minimization in bounded memory games

Documents

defender actions

adversary actions

adversary tourists

defender play

audit game

defender policemanactions

defender strategies

motivating example