theory and applications of a repeated game playing algorithm
Post on 09-Dec-2021
1 Views
Preview:
TRANSCRIPT
Theory and Applications ofTheory and Applications ofTheory and Applications ofTheory and Applications ofTheory and Applications ofA Repeated Game Playing AlgorithmA Repeated Game Playing AlgorithmA Repeated Game Playing AlgorithmA Repeated Game Playing AlgorithmA Repeated Game Playing Algorithm
Rob Schapire
Princeton University[currently visiting Yahoo! Research]
Learning Is (Often) Just a GameLearning Is (Often) Just a GameLearning Is (Often) Just a GameLearning Is (Often) Just a GameLearning Is (Often) Just a Game
• some learning problems:
• learn from training examples to detect faces in images• sequentially predict what decision a person will make
next (e.g., what link a user will click on)• learn to drive toy car by observing how human does it
• these appear to be very different
• this talk:
• can all be viewed as game-playing problems• can all be handled as applications of single algorithm for
playing repeated games
• take-home message:• game-theoretic algorithms have wide applications in
learning• reveals deep connections between seemingly unrelated
learning problems
OutlineOutlineOutlineOutlineOutline
• how to play repeated games: theory and an algorithm
• applications:
• boosting• on-line learning• apprenticeship learning
How to Play Repeated GamesHow to Play Repeated GamesHow to Play Repeated GamesHow to Play Repeated GamesHow to Play Repeated Games
[with Freund]
GamesGamesGamesGamesGames
• game defined by matrix M:
Rock Paper Scissors
Rock 1/2 1 0Paper 0 1/2 1
Scissors 1 0 1/2
• row player (“Mindy”) chooses row i
• column player (“Max”) chooses column j (simultaneously)
• Mindy’s goal: minimize her loss M(i , j)
• assume (wlog) all entries in [0, 1]
Randomized PlayRandomized PlayRandomized PlayRandomized PlayRandomized Play
• usually allow randomized play:• Mindy chooses distribution P over rows• Max chooses distribution Q over columns
(simultaneously)
• Mindy’s (expected) loss
=∑
i ,j
P(i)M(i , j)Q(j)
= P>MQ ≡ M(P,Q)
• i , j = “pure” strategies
• P, Q = “mixed” strategies
• m = # rows of M
• also write M(i ,Q) and M(P, j) when one side plays pure andother plays mixed
Sequential PlaySequential PlaySequential PlaySequential PlaySequential Play
• say Mindy plays before Max
• if Mindy chooses P then Max will pick Q to maximizeM(P,Q) ⇒ loss will be
L(P) ≡ maxQ
M(P,Q) = maxj
M(P, j)
[note: maximum realized at pure strategy]
• so Mindy should pick P to minimize L(P)⇒ loss will be
minP
L(P) = minP
maxQ
M(P,Q)
• similarly, if Max plays first, loss will be
maxQ
minP
M(P,Q)
Minmax TheoremMinmax TheoremMinmax TheoremMinmax TheoremMinmax Theorem
• playing second (with knowledge of other player’s move)cannot be worse than playing first, so:
minP
maxQ
M(P,Q)︸ ︷︷ ︸
Mindy plays first
≥ maxQ
minP
M(P,Q)︸ ︷︷ ︸
Mindy plays second
• von Neumann’s minmax theorem:
minP
maxQ
M(P,Q) = maxQ
minP
M(P,Q)
• in words: no advantage to playing second
Optimal PlayOptimal PlayOptimal PlayOptimal PlayOptimal Play
• minmax theorem:
minP
maxQ
M(P,Q) = maxQ
minP
M(P,Q) = value v of game
• optimal strategies:• P∗ = arg minP maxQ M(P,Q) = minmax strategy• Q∗ = arg maxQ minP M(P,Q) = maxmin strategy
• in words:• Mindy’s minmax strategy P∗ guarantees loss ≤ v
(regardless of Max’s play)• optimal because Max has maxmin strategy Q∗ that can
force loss ≥ v (regardless of Mindy’s play)
• e.g.: in RPS, P∗ = Q∗ = uniform
• solving game = finding minmax/maxmin strategies
Weaknesses of Classical TheoryWeaknesses of Classical TheoryWeaknesses of Classical TheoryWeaknesses of Classical TheoryWeaknesses of Classical Theory
• seems to fully answer how to play games — just computeminmax strategy (e.g., using linear programming)
• weaknesses:
• game M may be unknown• game M may be extremely large• opponent may not be fully adversarial
• may be possible to do better than value v
• e.g.:Lisa (thinks):
Poor predictable Bart, always takes Rock.
Bart (thinks):Good old Rock, nothing beats that.
Repeated PlayRepeated PlayRepeated PlayRepeated PlayRepeated Play
• if only playing once, hopeless to overcome ignorance of gameM or opponent
• but if game played repeatedly, may be possible to learn toplay well
• goal: play (almost) as well as if knew game and howopponent would play ahead of time
Repeated Play (cont.)Repeated Play (cont.)Repeated Play (cont.)Repeated Play (cont.)Repeated Play (cont.)
• M unknown
• for t = 1, . . . ,T :• Mindy chooses Pt
• Max chooses Qt (possibly depending on Pt)• Mindy’s loss = M(Pt ,Qt)• Mindy observes loss M(i ,Qt) of each pure strategy i
• want:
1
T
T∑
t=1
M(Pt ,Qt)
︸ ︷︷ ︸
actual average loss
≤ minP
1
T
T∑
t=1
M(P,Qt)
︸ ︷︷ ︸
best loss (in hindsight)
+ [“small amount”]
Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)
• choose η > 0
• initialize: P1 = uniform
• on round t:
Pt+1(i) =Pt(i) exp (−η M(i ,Qt))
normalization
• idea: decrease weight of strategies suffering the most loss
• directly generalizes [Littlestone & Warmuth]
• other algorithms:• [Hannan’57]
• [Blackwell’56]
• [Foster & Vohra]
• [Fudenberg & Levine]...
AnalysisAnalysisAnalysisAnalysisAnalysis
• Theorem: can choose η so that, for any game M with m
rows, and any opponent,
1
T
T∑
t=1
M(Pt ,Qt)
︸ ︷︷ ︸
actual average loss
≤ minP
1
T
T∑
t=1
M(P,Qt)
︸ ︷︷ ︸
best average loss (≤ v)
+ ∆T
where ∆T = O
(√
lnm
T
)
→ 0
• regret ∆T is:• logarithmic in # rows m
• independent of # columns
• running time (of MW) is:• linear in # rows m
• independent of # columns
• therefore, can use when working with very large games
Can Prove Minmax Theorem as CorollaryCan Prove Minmax Theorem as CorollaryCan Prove Minmax Theorem as CorollaryCan Prove Minmax Theorem as CorollaryCan Prove Minmax Theorem as Corollary
• want to prove:
minP
maxQ
M(P,Q) ≤ maxQ
minP
M(P,Q)
(already argued “≥”)
• game M played repeatedly
• Mindy plays using MW• on round t, Max chooses best response:
Qt = arg maxQ
M(Pt ,Q)
[note: Qt can be pure]
• let
P =1
T
T∑
t=1
Pt , Q =1
T
T∑
t=1
Qt
Proof of Minmax TheoremProof of Minmax TheoremProof of Minmax TheoremProof of Minmax TheoremProof of Minmax Theorem
minP
maxQ
P>MQ ≤ maxQ
P>MQ
= maxQ
1
T
T∑
t=1
P>
t MQ
≤1
T
T∑
t=1
maxQ
P>
t MQ
=1
T
T∑
t=1
P>
t MQt
≤ minP
1
T
T∑
t=1
P>MQt + ∆T
= minP
P>MQ + ∆T
≤ maxQ
minP
P>MQ + ∆T
Solving a GameSolving a GameSolving a GameSolving a GameSolving a Game
• derivation shows that:
maxQ
M(P,Q) ≤ v + ∆T
andminP
M(P,Q) ≥ v − ∆T
• so: P and Q are ∆T -approximate minmax and maxminstrategies
• further: can choose Qt ’s to be pure⇒ Q = 1
T
∑
t Qt will be sparse (≤ T non-zero entries)
Summary of Game PlayingSummary of Game PlayingSummary of Game PlayingSummary of Game PlayingSummary of Game Playing
• presented MW algorithm for repeatedly playing matrix games
• plays almost as well as best fixed strategy in hindsight
• very efficient — independent of # columns
• proved minmax theorem
• MW can also be used to approximately solve a game
Application to BoostingApplication to BoostingApplication to BoostingApplication to BoostingApplication to Boosting
[with Freund]
Example: Spam FilteringExample: Spam FilteringExample: Spam FilteringExample: Spam FilteringExample: Spam Filtering
• problem: filter out spam (junk email)
• gather large collection of examples of spam and non-spam:
From: yoav@ucsd.edu Rob, can you review a paper... non-spamFrom: xa412@hotmail.com Earn money without working!!!! ... spam...
......
• goal: have computer learn from examples to distinguish spamfrom non-spam
• main observation:
• easy to find “rules of thumb” that are “often” correct
• If ‘viagra’ occurs in message, then predict ‘spam’
• hard to find single rule that is very highly accurate
The Boosting ApproachThe Boosting ApproachThe Boosting ApproachThe Boosting ApproachThe Boosting Approach
• given:• training examples• access to weak learning algorithm for finding weak
classifiers (rules of thumb)
• repeat:• choose distribution on training examples
• intuition: put most weight on “hardest” examples
• train weak learner on distribution to get weak classifier
• combine weak classifiers into final classifier• use (weighted) majority vote
• weak learning assumption: weak classifiers slightly butconsistently better than random
• want: final classifier to be very accurate
Boosting as a GameBoosting as a GameBoosting as a GameBoosting as a GameBoosting as a Game
• Mindy (row player) ↔ booster
• Max (column player) ↔ weak learner
• matrix M:
• row ↔ training example• column ↔ weak classifier• M(i , j) ={
1 if j-th weak classifier correct on i -th training example0 else
• encodes which weak classifiers correct on which examples• huge # of columns — one for every possible weak
classifier
Boosting and the Minmax TheoremBoosting and the Minmax TheoremBoosting and the Minmax TheoremBoosting and the Minmax TheoremBoosting and the Minmax Theorem
• γ-weak learning assumption:• for every distribution on examples• can find weak classifier with weighted error ≤ 1
2− γ
• equivalent to:
(value of game M) ≥ 1
2+ γ
• by minmax theorem, implies that:• ∃ some weighted majority classifier that correctly
classifies all training examples• further, weights are given by maxmin strategy of game M
Idea for BoostingIdea for BoostingIdea for BoostingIdea for BoostingIdea for Boosting
• maxmin strategy of M has perfect (training) accuracy
• find approximately using earlier algorithm for solving a game• i.e., apply MW to M
• yields (variant of) AdaBoost
AdaBoost and Game TheoryAdaBoost and Game TheoryAdaBoost and Game TheoryAdaBoost and Game TheoryAdaBoost and Game Theory
• AdaBoost can be derived as special case of general gameplaying algorithm (MW)
• idea of boosting intimately tied to minmax theorem
• game-theoretic interpretation can further be used to analyze:
• AdaBoost’s convergence properties• AdaBoost as a “large-margin” classifier
(margin = measure of confidence, useful for analyzinggeneralization accuracy)
Application: Detecting FacesApplication: Detecting FacesApplication: Detecting FacesApplication: Detecting FacesApplication: Detecting Faces[Viola & Jones]
• problem: find faces in photograph or movie
• weak classifiers: detect light/dark rectangles in image
• many clever tricks to make extremely fast and accurate
Application to On-line LearningApplication to On-line LearningApplication to On-line LearningApplication to On-line LearningApplication to On-line Learning
[with Freund]
Example: Predicting Stock MarketExample: Predicting Stock MarketExample: Predicting Stock MarketExample: Predicting Stock MarketExample: Predicting Stock Market
• want to predict daily if stock market will go up or down
• every morning:• listen to predictions of other forecasters
(based on market conditions)• formulate own prediction
• every evening:• find out if market actually went up or down
• want own predictions to be almost as good as best forecaster
On-line LearningOn-line LearningOn-line LearningOn-line LearningOn-line Learning
• given access to pool of predictors• could be: people, simple fixed functions, other learning
algorithms, etc.• for our purposes, think of as fixed functions of observable
“context”
• on each round:• get predictions of all predictors, given current context• make own prediction• find out correct outcome
• want: # mistakes close to that of best predictor
• training and testing occur simultaneously
• no statistical assumptions — examples chosen by adversary
On-line Learning as a GameOn-line Learning as a GameOn-line Learning as a GameOn-line Learning as a GameOn-line Learning as a Game
• Mindy (row player) ↔ learner
• Max (column player) ↔ adversary
• matrix M:
• row ↔ predictor• column ↔ context
• M(i , j) =
{1 if i -th predictor wrong on j-th context0 else
• actually, the same game as in boosting, but with roles ofplayers reversed
Applying MWApplying MWApplying MWApplying MWApplying MW
• can apply MW to game M
→ weighted majority algorithm [Littlestone & Warmuth]
• MW analysis gives:
(# mistakes of MW) ≤ (# mistakes of best predictor)
+ [“small amount”]
• regret only logarithmic in # predictors
Example: Mind-reading GameExample: Mind-reading GameExample: Mind-reading GameExample: Mind-reading GameExample: Mind-reading Game[with Freund & Doshi]
• can use to play penny-matching:• human and computer each choose a bit• if match, computer wins• else human wins
• random play wins with 50% probability• very hard for humans to play randomly• can do better if can predict what opponent will do
• algorithm: MW with many fixed predictors• each prediction based on recent history of plays• e.g.: if human played 0 on last two rounds,
then predict next play will be 1else predict next play will be 0
• play at: seed.ucsd.edu/∼mindreader
Histogram of ScoresHistogram of ScoresHistogram of ScoresHistogram of ScoresHistogram of Scores
0
20
40
60
80
100
120
140
160
180
200
-100 -75 -50 -25 0 25 50 75 100
num
ber
score
� � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � �
� � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � �
human winscomputer wins
• score = (rounds won by human) − (rounds won by computer)
• based on 11,882 games
• computer won 86.6% of games
• average score = −41.0
Slower is Better (for People)Slower is Better (for People)Slower is Better (for People)Slower is Better (for People)Slower is Better (for People)
0
50
100
150
200
250
300
-100 -75 -50 -25 0 25 50
aver
age
dura
tion
(sec
onds
)
score
• the faster people play, the lower their score
• when play fast, probably fall into rhythmic pattern• easy for computer to detect/predict
Application to Apprenticeship LearningApplication to Apprenticeship LearningApplication to Apprenticeship LearningApplication to Apprenticeship LearningApplication to Apprenticeship Learning
[with Syed]
Learning How to Behave in Complex EnvironmentsLearning How to Behave in Complex EnvironmentsLearning How to Behave in Complex EnvironmentsLearning How to Behave in Complex EnvironmentsLearning How to Behave in Complex Environments
• example: learning to “drive” in toy world
• states: positions of cars• actions: steer left/right; speed up; slow down• reward: for going fast without crashing or going off road
• in general, called Markov decision process (MDP)
• goal: find policy that maximizes expected long-term reward• policy = mapping from states to actions
Unknown Reward FunctionUnknown Reward FunctionUnknown Reward FunctionUnknown Reward FunctionUnknown Reward Function[Abbeel & Ng]
• when real-valued reward function known, optimal policy canbe found using well established techniques
• however, often reward function not known
• in driving example:
• may know reward “features”:
• faster is better than slower• collisions are bad• going off road is bad
• but may not know how to combine these into singlereal-valued function
• we assume reward is unknown convex combination of knownreal-valued features
Apprenticeship LearningApprenticeship LearningApprenticeship LearningApprenticeship LearningApprenticeship Learning
• also assume get to observe behavior of “expert”
• goal:• imitate expert• improve on expert, if possible
Game-theoretic ApproachGame-theoretic ApproachGame-theoretic ApproachGame-theoretic ApproachGame-theoretic Approach
• want to find policy such that improvement over expert is aslarge as possible, for all possible reward functions
• can formulate as game:• rows ↔ features• columns ↔ policies
(note: extremely large number of columns)
• optimal (randomized) policy, under our criterion, is exactlymaxmin of game
• can solve (approximately) using MW:• to pick “best” column on each round, use methods for
standard MDP’s with known reward
ConclusionsConclusionsConclusionsConclusionsConclusions
• presented MW algorithm for playing repeated games• used to prove minmax theorem• can use to approximately solve games efficiently
• many diverse applications and insights• boosting
• key concepts and algorithms are all game-theoretic
• on-line learning
• in essence, the same problem as boosting
• apprenticeship learning
• new and richer problem, but solvable as a game
ReferencesReferencesReferencesReferencesReferences
• Yoav Freund and Robert E. Schapire.Game theory, on-line prediction and boosting.In Ninth Annual Conference on Computational Learning Theory, 1996.
• Yoav Freund and Robert E. Schapire.Adaptive game playing using multiplicative weights.Games and Economic Behavior, 29:79–103, 1999.
• Umar Syed and Robert E. Schapire.A game-theoretic approach to apprenticeship learning.In Advances in Neural Information Processing Systems 20, 2008.
• Nicolo Cesa-Bianchi and Gabor Lugosi.Prediction, Learning, and Games.Cambridge University Press, 2006.
Coming soon (hopefully):
• Robert E. Schapire and Yoav Freund.Boosting: Foundations and Algorithms.MIT Press, 2010(?).
top related