the learning of strategies in a simple, two-person zero-sum game without saddlepoint

9
THE LEARNING OF STRATEGIES I N A SIMPLE, TWO-PERSON ZERO-SUM GAME WITHOUT SADDLEPOINT bg John Fox‘ University of Michigan, Ann Arbor Subjects play a 2x2 zero-sum game without saddlepoint against a computer program opponent; the computer program either follows its minimaxmixed strategy or adopts a (pre- defined) nonrational mixed strategy. It is found that there is a significant trend in the strat- egy choice behavior of subjects playing against a rational opponent such that these subjects tend to approach their optimal strategy mixture. Since subjects playing opposite a rational opponent cannot affect the expected outcome of the game, the relationship between sub- jects’play and thevariance and skewness of payoffs (interpreted as components of the risk- iness of the game) is explored. Subjects whose opponent is a computer program which plays nonoptimally appear to be able to learn to exploit their opponent’s departure from rational play. There is some evidence that subjects may respond to random fluctuations in the com- puter’s play. c+3 INTRODUCTION HE STUDY of conflict of interest has tra- T d . itionally been of concern to social sci- entists. The relatively recent mathematical treatment of conflict embodied in the theory of games has provided social scientists with a language for thinking about, and a frame- work within which to investigate, the prop- erties and determinants of social conflict and cooperation. It is, therefore, somewhat enigmatic that relatively little experimentation has been done involving perhaps the most basic of conflict situations : the two-person, zero-sum game. Two-person games may be divided into tn-o classes: (1) zero-sum and (2) non- zero-sum games. In a zero-sum game one player’s losses are the other player’s gains, :md vice-versa; hence, zero-sum games may be thought to define pure conflict-of-interest situations. In contrast, nonzero-sum (or mixed-motive) games may embody aspects of both conflict and cooperation since payoffs to the players are not constrained by the (zero-sum) principle that one player’s gain is the other’s loss. Two-person zero-sum games may be fur- ther differentiated into (1) games with sad- * I should like to express my gratitude to Dr. Melvin Guyer of the Mental Health Research In- stitute, and to Prof. Gudmund Iversen of the De- partment af Sociology, University of Michigan, 300 Behavioral Science, Volume 17, 1972 dlepoints and (2) games withoug snddle- points. Let G be a matrix representing a two-person zero-sum game, and let the entry in the matrix gzJ be the payoff to the row chooser R when he chooses row i and his op- ponent C chooses column j. If there exists it gi, such that gii is simultaneously the mini- mum entry in row i and the maximum entry in column j, then gi, is a saddlepoint of game G. If a game G has a saddlepoint gij, then R’s rational strategy is to choose row i and C’s rational strategy is to choose columnj. When a two-person, zero-sum game con- tains no saddlepoint it can be shown that a player can best insure himself against (ex- pected) loss by playing a particular random mixture of strategy choices. Such an optimal mixed strategy-often termed a minimax mixed strategy-has properties analagous t o those of optimal strategies in strictly deter- mined games (games with saddlepoints). It has been shown experimentally (Lieber- man, 1960; Brayer, 1964) that subjects playing two-person, zero-sum games with saddlepoints learn to play their minimax strategies when their opponents are playing rationally. How people behave when con- fronted with two-person, zero-sum games for their help and advice in the preparation of the present paper. I am also grateful to the University of Michigan Psychology Department Subject Pool for furnishing experimental subjects for this study.

Upload: john-fox

Post on 06-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The learning of strategies in a simple, two-person zero-sum game without saddlepoint

THE LEARNING OF STRATEGIES IN A SIMPLE, TWO-PERSON ZERO-SUM GAME WITHOUT SADDLEPOINT

bg John Fox‘

University of Michigan, Ann Arbor

Subjects play a 2x2 zero-sum game without saddlepoint against a computer program opponent; the computer program either follows its minimaxmixed strategy or adopts a (pre- defined) nonrational mixed strategy. It is found that there is a significant trend in the strat- egy choice behavior of subjects playing against a rational opponent such that these subjects tend to approach their optimal strategy mixture. Since subjects playing opposite a rational opponent cannot affect the expected outcome of the game, the relationship between sub- jects’play and thevariance and skewness of payoffs (interpreted as components of the risk- iness of the game) is explored. Subjects whose opponent is a computer program which plays nonoptimally appear to be able to learn to exploit their opponent’s departure from rational play. There is some evidence that subjects may respond to random fluctuations in the com- puter’s play.

c+3

INTRODUCTION

HE STUDY of conflict of interest has tra- T d . itionally been of concern to social sci- entists. The relatively recent mathematical treatment of conflict embodied in the theory of games has provided social scientists with a language for thinking about, and a frame- work within which to investigate, the prop- erties and determinants of social conflict and cooperation.

It is, therefore, somewhat enigmatic that relatively little experimentation has been done involving perhaps the most basic of conflict situations : the two-person, zero-sum game. Two-person games may be divided into tn-o classes: (1) zero-sum and (2) non- zero-sum games. In a zero-sum game one player’s losses are the other player’s gains, :md vice-versa; hence, zero-sum games may be thought to define pure conflict-of-interest situations. In contrast, nonzero-sum (or mixed-motive) games may embody aspects of both conflict and cooperation since payoffs to the players are not constrained by the (zero-sum) principle that one player’s gain is the other’s loss.

Two-person zero-sum games may be fur- ther differentiated into (1) games with sad-

* I should like to express my gratitude to Dr. Melvin Guyer of the Mental Health Research In- stitute, and to Prof. Gudmund Iversen of the De- partment af Sociology, University of Michigan,

300

Behavioral Science, Volume 17, 1972

dlepoints and ( 2 ) games withoug snddle- points. Let G be a matrix representing a two-person zero-sum game, and let the entry in the matrix g z J be the payoff to the row chooser R when he chooses row i and his op- ponent C chooses column j . If there exists it

gi, such that gi i is simultaneously the mini- mum entry in row i and the maximum entry in column j , then gi, is a saddlepoint of game G . If a game G has a saddlepoint gij, then R’s rational strategy is to choose row i and C’s rational strategy is to choose columnj.

When a two-person, zero-sum game con- tains no saddlepoint i t can be shown that a player can best insure himself against (ex- pected) loss by playing a particular random mixture of strategy choices. Such an optimal mixed strategy-often termed a minimax mixed strategy-has properties analagous to those of optimal strategies in strictly deter- mined games (games with saddlepoints).

It has been shown experimentally (Lieber- man, 1960; Brayer, 1964) that subjects playing two-person, zero-sum games with saddlepoints learn to play their minimax strategies when their opponents are playing rationally. How people behave when con- fronted with two-person, zero-sum games

for their help and advice in the preparation of the present paper. I am also grateful to the University of Michigan Psychology Department Subject Pool for furnishing experimental subjects for this study.

Page 2: The learning of strategies in a simple, two-person zero-sum game without saddlepoint

THE LEARNING OF STRATEGIES 30 1

E

* l r l

9

GAME 1. E = experimenter S = subject Payoffs are what S receives from E E’s minimax mixedstrategyis @(el) = .25 ,p(ez ) = .75) S’s minimax mixed strategy is ( p ( s , ) = . 75 ,p ( s z ) = .25) E’s nonrational play is (.5,.5) [From Lieberman (1962)l

without saddlepoints is somewhat less clear; and it is to this latter subject that the present investigation is addressed.

People presumably learn to alter their behavior when rewards are differentially contingent upon behavior. In studying how people learn to play games, game theory would appear to be of value insofar as i t enables us to deduce propositions about how the outcome of a game is dependent upon the players’ behavior. Hence, for example, we may note that, in a 2 X 2, zero-sum game without saddlepoint, a person playing op- posite a rational opponent cannot expect to influence the outcome of the game; this observation leaves US with little basis for predicting how people behave in such situa- tions. More specifically, Lieberman (1962) has discovered that subjects playing a 2 X 2 game (Game 1) against a rational opponent do not tend to approach their optimal mixed strategy. In contrast, subjects playing the same game against an opponent who plays nonoptimally tend to approach the mixed strategy prescribed to them by game theory. This last result is somewhat difficult to inter- pret: because of the manner in which Lieber- man’s (1962) study was designed, to ap- proach rationality (i.e., strategy ( ~ ( $ 1 ) = .75,p(sz) = 2 5 ) ) from indifference ( .5, .5) is also to approach maximal exploitation (1.0, 0.0) of the opponent’s departure from his optimal play. It is unclear why a subject playing against a nonrational opponent should prefer his minimax strategy to one

which would allow him to exploit his oppo- nent’s weakness (although, of course, by departing from minimax a subject exposes himself to possible exploitation by his oppo- nent).

I n a study of the behavior of subjects playing a 3 X 3 zero-sum game without sad- dlepoint against an opponent who plays his minimax strategy, Messick (1967) found that subjects generally did not adopt their optimal mixed strategy. These findings seem to be consistent with Lieberman’s (1962) observations.

STUDY DESIGN Purpose. The present experiment was

designed to provide answers to two central questions :

(1) How do people behave when their ex- pected payoff in a game situation is inde- pendent of their strategy choice behavior (i,e., when they play a 2 X 2, zero-sum game without saddlepoint against a rational oppo- nent)?2

(2) Do people playing opposite a nonra- tional opponent learn to exploit their oppo- nent’s ~ e a k n e s s ? ~

Subjects. Subjects were 32 University of Michigan undergraduates registered in intro- ductory psychology courses, who, in order to fulfill a course requirement, volunteered to participate as unpaid subjects in psycholog- ical experiments.

Procedure. Subjects played a 2 x 2 zero- sum game without saddlepoint against a computer program opponent. Subjects were instructed that success in the game was an index of intelligence and that the computer was programmed to try to make them lose as much as possible. Payoffs were represented in imaginary dollars and each subject began

2 See discussion above, Lieberman (1962), and Messick (1967).

3 We should point out the fact that the experi- ment has been designed so that to exploit the com- puter’s departure from rational play, subjects mnst move away from their minimax strategy; i.e., the subjects’ optimal exploiting strategy and their minimax strategy are divergent. Hence we may interpret movement toward exploitation (or, toward minimax) in a relatively unambiguous manner. In Lieberman’s (1962) experiment, move- ment toward minimax and movement toward maximal exploitation are largely confounded.

Behavioral Science, Volume 17. 1972

Page 3: The learning of strategies in a simple, two-person zero-sum game without saddlepoint

302 JOHN Fox

the game with a “bankroll” of $100. Subjects recorded their choices and received informa- tion a t a teletype connected to the computer. Each subject played 200 trials, a typical ses- sion taking under one hour. Subjects learned the outcome of each play immediately after choosing and, after every 20 plays, received information about their performance over the last 20 trials.

Experimental coizditions . Subjects were assigned to one of four experimental condi- tions, eight subjects in each condition: (1) Rational opponent, Low feedback (RL); ( 2 ) Rational opponent, High feedback ( R H ) ; (3) Xonrational opponent, Low feedback ( N L ) ; and (4) Nonrational opponent, High feed- back ( N H ) . In the Rational opponent condi- tions (RL and RI-J), subjects played against :t computer program which played its mini- max strategy. Subjects in the Nonrationul opponent conditions ( N L and NII) played against a computer program which departed from rational play. Low feedback subjects (RL and NL) were told every 20 trials how much money they won (or lost) over the last 20 trials. In the High feedback conditions (RH and N H ) , subjects were additionally told every 20 trials how many times they chose each alternative over the past 20 phys.

Game. The game played by the subjects is shown below :ts Game 2 .

Results (1) Check on randomness of coinputer’s plug.

The series of pseudorandom nunibers used by the computer program to make its choices was checked for apparent randonmess before the experiment was coridu~ted.~ We can, perhaps more subtly, check the apparent randomness of the computer’s play by re- gressing subjects’ winnings over 20 t r i d blocks on the proportion of bl choices (i.e., the subjects’ strategies) computed on the s:Lme units (20 trittl blocks). We should ex- pect there to be no (statistically significant) re1:Ltionship between a subject’s play and his winnings; :dtlitionully, the rc.gression line

PLAYING AGAINST A RATIONAL OPPONENT

4 See Naylor, 13alintfy, Bnrdick, arid Chu (1966) for a detailed description of rncthods of vhecking pscutloraiidorn riiunher grrierittors for alJpaWllt raildomliess.

A

B

GAME 2 A = computer B = subject Payoffs are what B receives from A A’s minimax mixed strategy is (p(a1) =

B’s minimax mixed strategy is (p(b1) =

A’s iionrat ional play is ( . G , .4)

. 4 2 8 5 7 , ~ ( ~ 2 ) = ,57413)

.21429,p(bz) = .78571)

should intersect the Y-axis a t a value not significantly diff ererit from the per trial value of the game times the blocking factor (-$0.2857 X 20 = -$5.714). These expec- tations are, indeed, borne out by the

(2) Subjects’ Play. Rapoport and Ortvant (1962) have indicated that available data show “that subjects do not perceive and/or play a mixed minimax strategy in 2-person games with no saddle point.” We have a1- ready noted that Lieberman (1962) found that subjects in his expcriment who played against a rational opponent failed to ap- proach their minimax strategy with repeated plays of the game. It is therefore quite sur- prising tliitt our data show a significant trend toward decreased selection of alternative hl as the game progresses: i.e., a trend toward playing closer to the subjects’ optimal mixed strategy (see Fig. 1 and Table 1).

Although the main effect of time arid the linear trend over time are both significant a t the .01 level, there is no significant differ- ence between the two feedback groups ( R L arid R H ) ; further, there is no significant interaction between time and level of feed- back.

Discussion The finding that subjects in the RL and

RH conditions tended to play closer to their There is no significant rchtionship between

strategy and payoff for RI, and RH group subjects: T = ,0617, p = .41. Additionally, the regression line iiitcrsacts the Y-axis at. -10.4554 which does not differ significant,ly froin the expected a of -5.714 ( p = .35).

Behavioral Science, Volume 17, 1972

Page 4: The learning of strategies in a simple, two-person zero-sum game without saddlepoint

THE LEARNING OF STRATEGIES 303

e-4 RL - - A--A R”

+--<l Computer :p(al) - - - -

1.0

0 . 9

0 . 8

u) 0 . 7 E

OI

; 0.5

U 0 . 6 a‘

z 0, 0 . 4 I- L

0 . 3 s L

0 . 2

0.1

0 . 0

- subjects’ minimax strategy -

- - 1 I 1 I I I I I I I

rational strategy as the game progressed is quite puzzling. In light of previous experi- mental findings (Lieberman, 1962; Messick, 1967) i t seems likely that this trend is an artifact of the present experimental situa- tion.6 On the other hand, it is difficult to isolate an aspect of the present experiment that would give rise to such behavior.

Let us begin by asking what parameters, other than expected payoff, characterize a game situation and proceed to discuss how these are related to subjects’ stratcgies. Coombs and Pruitt (1960) have discussed three components of risk in decision malting problcms uridcr risk: (1) cxpectcd vduc ; (a) v:wiarice (dispersion of outcomes), and ( 3 ) slwvriess (odds for and :Lg:iinrt). Although

6 See Lieberman (19G2) arid our siibseqiient dis- ciission of subjects’ responsiveness to random variations in the computer’s play.

_ _ _____

discussions of decision making :tnd risk arc typicdly cast in the context of games against nature (Edxvds, 1954; Luce CY. Raiffu, 1957; Coombs B I’ruitt, 1960; Coombs CC illeyer, 1969)- as, for example, in a coin toss game- the concept of risk (at least as a partial func- tion of variance of outcomes) would appear to be :Lpplicable to games against an oppo- nent who makes strategic choices.

In the Appendix we demonstrate that al- though a person playing a 2 x 2 zero-sum game against a rational opponent cltnnot influence the expected outcome of the game, there exists a linear relationship between the variance of the probability distribution over possible outcomes and the probability that one of the alternatives (say, bl) is chosen. Hence, (if the slope of this line # 0) in order to minimize this v:triance it is necess:iry to choosc bl either all of the time, or now of the time.

Behavioral Science, Volume 17, 1972

Page 5: The learning of strategies in a simple, two-person zero-sum game without saddlepoint

304 JOHN Fox

TABLE 1 ANALYSIS 03' V l R I A N C E W I T H TEST FOR TREND,

FOR RL AND R a G R O U P S U B J E C T S (RE- I'EATED MEASURES MODEL)

____ Source df Mean Square f $<

Between suojects F' subjects within

groups Within subjects

Tf

FT T X subjects within

T (linear)

groups T X subjects

within groups (linear)

15

14 .03865848 1 ,04389062 1.135 ns

144 9 ,06257808 5.288 .01 1 ,2829204 15.068 .01 9 .00711979 0.602 ns

126 .01183306

14 ,01877566

* Feedback level t Time (blocked a t 20 trials)

In the present case we should expect a positive linear relationship between p(bl ) and the degree of dispersion of actual out- comes (see Appendix).I Looking at Table 2 we discover that the observed relationship between strategy and dispersal of outcomes does not depart significantly from our expec- tations. It should, as a cautionary note, bc added that the relationship between strategy and outcome variance is riot very strong (r = .1481) and, in fact, is not significantly dif- ferent from zero (.I0 > p > .05). E'urther- more, in order to explain thc trend in sub- jects' strategic behavior in terms of risk reduction we should need to assume that subjects were generally motivated to mini- mize (or, a t least, to reduce) risk-a not altogether tenable position to adopt, since it has been shown (Edwards, 1954; Cooinbs & Pruitt, 1960; Coombs cY. Jfeyer, 1969) that levels of risk preference are idiosyncratic tr:iits.8 Zero-sum games against rational opponents might, however, be fruitful con- texts within which to investigate the effects of risk on strategic behavior.

We may view Game 2 (for the RL and RH conditions) as a choice between two bets with the same expected value. We have noted that bet b, is associated with greater variance of outcomes than bet bl and, as a consequence, choice of bl involves greater risk. I t should be mentioned, if only in passing, that the odds for a positive payoff are more favorable in b , than in bl-when the com-

TABLE 2 DISPERSION OF OUTCOMES* AS A FUNCTION OF

STRATEGY+ FOR RL AND RH GROUP SUBJECTS ___ ~ - ____

I ,

b 449.309 548.571 -.4159 .67 1.8827 . O G

a 1 85.960 I 41.082 I ,3664 I .71 1 .7521 1 .45 ,1481

____ ~ _ _ _ * Dispersion of outcomes measured by the squared residuals

t p(b1) computed over 20 trial blocks. 1: See Appendix.

from the regression of payoff on strategy (see Table 1).

We should consider the possibility that differences between the experimental proce- dures used by Lieberman (1962) and those employed in the present study contributed to differences in results. Since all between experiment differences are completely con- founded with each other, any attempt to ascribe dissimilar results to design features remains speculative.

One of the most striking diffcrences be- tween the two experimental procedures w:is the manner in which subjects were furnished information about their perforniance during the course of the game. I n Lieberman's (1962) study, subjects received or surren- dered chips (representing money) after each play of the game, and could, therefore, if they wished, keep track of their winnings or losses. No other systematic information about the course of the game was provided by the experimenter. In contrast, in the pres- ent experiment, subjects had a written rec- ord of past plays (to which many referred), cumulative reminders of their winnings and losses (every 20 plays), and, for RH subjects, information about their mixture of strategy choices. (Since level of feedback has been shown to have no significant effect, we may discount this last factor.) It is conceivable that the richer informational situation and relative standardization of feedback which

piiter plays minimax; hence we discover that choice of b 2 minimizes risk according to the skew- ness criterion (Coombs & Pruitt, 1960).

It would, perhaps, be reasonable t o speculate that subjects were motivated to decrease risk be- cause they were losing, i.e., the value of the game to subjects is negative.

Behavioral Science, Volume 17, 1972

Page 6: The learning of strategies in a simple, two-person zero-sum game without saddlepoint

THE LEARNING OF STRATEGIES 30 5

characterize the present cxpcriment are re- flected in the trend we have noted in the subjects’ strategy choice behavior.

PLAYING AGAINSTIA NONRATIONAL OPPONENT

Subjects in the N L and NH experimental groups played Game 2 against a computer program employing the nonoptimal mixed strategy (p(a1) = .6,p(a2) = .4). In order to maximally exploit their opponent’s depar- ture from rational play, subjects should have adopted the strategy (p(b1) = l.O,p(bJ = 0.0). Therefore, any movement from indiffer- ence between the two alternatives (i.e., strat- egy ( 4 . 5 ) ) toward maximal exploitation is opposite in direction to movement toward the subjects’ minimax strategy (.21429, .78571). This opposition becomes, perhaps, more meaningful when we recall that RL and RH group subjects tended to play choice 61 less frequently as the game progressed.

v) Y

!! 0 I U

J i

1 .o

0 . 9

0 . 8

0 . 7

0 . 6

0 . 5

0 . 4

0 . 3

0 . 2

0 . 1

0 .o

Results (1) Check o n randomness of computer’s play.

The assumption that the computer plays the random strategy (.6,.4) allows us to express the expected winnings of a subject (over a fixed number of trials) as a linear function of his strategy, i.e., the proportion of the time he plays one alternative or the other. If we calculate expected winnings (for 20 trial blocks, for all N L and NH group subjects) and regress this variable on actual winnings (again, computed for 20 trial block units) we should anticipate that, first, the slope of the regression line, b, should not differ signifi- cantly from 1.0, and, secondly, the y-inter- cept, a, of this regression line should not be significantly different from 0.0. The data confirm these expectations, allowing us to state with some confidence that subjects were unable to exploit any perceived (or,

1 2 3 4 5 6 7 8 9 10

T I M E (BLOCKED AT 20 T R I A L S ) FIG. 2. Performance of N I , and N H group subjects.

Behavioral Science, Volume 17, 1972

Page 7: The learning of strategies in a simple, two-person zero-sum game without saddlepoint

306 JOHN Fox

TABLE 3 ANALYSIS OF VARIANCE WITH TESTS FOR TRENDS,

FOR NL AND NH GROUP SUBJECTS (RE- PEATED MEASURES MODEL)

- _____ -. ______ -

Source d j Mean Square j $<

Between subjects 15 F* 1 subjects within 14

Within subjects 144 T t 9 T (linear) 1 T (quadratic) 1

F T 9 FT (linear) 1 FT (quadratic) 1

groups

T X subjects within 126

T X subjects within 14

T X subjects within 14

groups

groups (linear)

groups (quadratic)

,08556253 0.338 ns ,2530982

,1061528 10.854 .01 ,7116837 22.656 .01 .1559766 24.426 .01 ,02809722 2.873 .01 .01092658 1.303 ns .08316304 9.891 .01 ,009779736

,03141211

,006385788

* Feedback level f Time (blocked at 20 trials)

perhaps we should wish to say, imagined) patterns in the computer’s play.g

( 2 ) Subjects’ performance. As the game progressed, subjects in the N L and NH groups tended generally to play choice bl an increasing proportion of the time (see Fig. 2 and Table 3).

As we have previously observed for R L and R H group subjects, the main effect of feedback condition is not statistically signifi- cant. The main effect of time and the linear trend over time are both significant a t the .01 level. The entire story is, however, more complex. First, there is a significant interac- tion between time and level of feedback. Secondly, there is a statistically significant quadratic trend over time, beyond the sim- pler linear effect. We may note, further, that the time X feedback interaction reflects dif- ferences in the quadratic profiles of the two feedback groups ( N L and N H ) while the linear profiles do not differ significantly.

Discussion The finding that N L and N H group sub-

jects tended to play alternative bl more fre- quently as the game progressed (increasingly

9 The observed b value is ,9687; the observed a is - .G114. Neither coefficient deviates signifi- cantly from its predicted value ( p = .08 and p = 3 1 , respectively).

exploiting their opponent’s depart’ure from minimax) is all the more striking in light of the behavior of the R L and RH group sub- jects-who, it will be remembered, tended to play bl less frequently with time.

We should not, however, fail to note that the performance of the NH group tended to fall off somewhat toward the end of the game. I n contrast, the trend in performance of the N L group was more nearly monotonic. These factors give rise to the finding that al- though there was no significant main effect associated with level of feedback, the quad- ratic profiles of the N L and NH groups were different. The following conjecture is offered as an explanation of these phenomena: By the sixth 20 trial block, the high feedback ( N H ) subjects were playing alternative b~ nearly 85 percent of the time and, conse- quently, their expected winnings were quite high. It is likely that these subjects, a t this point in the game, were able to feel that they had achieved a satisfactory (if not optimal) mastery of the task. Boredom over the re- maining 80 trials might have given rise t o the worsening of these subjects’ performance. The low feedback ( N L ) subjects, on the other hand, never reached the level of per- formance that the high feedback subjects attained by the sixth 20 trial block. Accord- ingly, the performance of tlie NL subjects shows no regular falloff toward the end of the game.

Although the results of the foregoing anal- ysis of variance appear t o be largely con- sistent with tlie hypothesis that subjects are able to learn to exploit the weakncss of a nonrational opponent, it would seem to be desirable to put this hypothesis to a more direct test. If subjects’ learning is indeed motivated by the statistical relationship of strategy to payoff, it would seem to follow that the strength of this msociation should be related to cxtcnt of learning.1° Tlic withiii-

10 P u t another way, since the relationship be- tween strategy (p (b1) ) and payoff is probabilistic in nature, we shoilld expect, variance in the sul)- ject-to-subject strengths of association between these two variables. Pursuing this line of reason- ing, the stronger the relationship between thc manlier in which a subject plays aiid the mngrii- tnde of his payoffs, the more he shoidd he in- fluenced to alter his strategic behavior so as to increase his expected winnings.

Behavioral Science, Volume 17, 1972

Page 8: The learning of strategies in a simple, two-person zero-sum game without saddlepoint

THE LEARNING OF STRATEGIES 307

subject correlation coefficient, rsp, has been employed as an index of the degree of associ- ation between strategy, i.e., p(b1) and payoff; the extent of a subject’s learning over the course of the game was measured by the slope of the (within-subject) regression line of strategy on time, bsi-i.e., the steeper the slope of this regrcssion line, the greater the improvement in the subject’s performance.~1 These two variables, rs, and b,t, are signifi- cantly related (T = .5743, p < .05), lending further credibility to the hypothesis that learning has taken place. We note that intro- ducing level of feedback into the regression equation results in an increase in predictive power ( B = .6519, p < .05).

SUBJECTS’ RESPONSIVENESS TO RANDOM FLUCTUATIONS IN THE COMPUTER’S

PLAY

Licbcrman (1962) has advanced the hypothesis that subjects playing a zero-sum game without saddlepoint against an oppo- nent who employs a randomized strategy will respond to chance fluctuations in their opponcnt’s play. An examination of Lieber- m:mls data tends to confirm this responsive- ness hypothesis; additionally, Lieberman discovered that subjects playing against an opponent who plays nonoptimally seem to be more responsive to variations in their oppo- nent’s play than those subjects who play against :m opponent who follows a minimax mixed strategy.

In Figs. 1 and 2, describing the behavior of subjects in the present study, the computer’s play is graphed along with subjects’ per- formance. In order to test the utility of the responsiveness hypothesis in explaining the performance of subjects in the present exper- iment, the relationship between the com- puter’s play, i.e., p(a1) and subjects’ play ( p ( b l ) ) wascalculated: (1) over 20 trial blocks t = 1 to t = 10, measuring both p(a1) and p(bl) a t time t ; and (2) over blocks t = 2 to t = 10, measuring p(al) a t time t - 1 and p ( b l ) a t time t (so as to discover any delayed subject response to fluctuations in the com-

11 This nwasure, 1 e , bet, is highly related (T = ,9366) t o a measlire of improvement over time de- fined by the difference between the mean (intra- subject) strategy for the second half of the game mid that for the first half of the game.

TABLE 4 RESPONSIVENESS OF SUBJECTS TO

RANDOM VARIATIONS IN THE COMPUTER’S PLAY

RL ,3214 . 01 .0071 ns R H ,1975 ns ,1429 ns N L --.1200 ns - ,0897 ns NH -.1140 US ,2981 .05

* Correlation between computer’s strategy (~(61)) and sub-

t Correlation between computer’s strategy a t block t - 1 and iects’ strategy ( p ( b 1 ) ) over 20 trial blocks.

snhjects’ strategy a t block t , over 20 trial blocks 2 to 10.

puter’s play). In general, the data to not seem to lend strong confirmation to the responsiveness hypothesis (see Table 4). However, subjects in the RL group seem to respond somewhat to variations in the com- puter’s play, and subjects in the N H condi- tion appear to manifest a delayed response effect.

It is probably reasonable to conclude that some of the variance in subjects’ strategic behavior can be ascribed to responsiveness to chance fluctuations in their opponent’s p1:t.y. This responsiveness explanation and the hypothesis that subjects can learn to exploit their opponent’s departure from rationality (or that subjects may act so as to decrease risk when they play opposite a rational opponent) do not, however, appear to be mutually exclusive. Indeed, it would probably be worthwhile to design future experiments so as to be able to separate the effects of learning and responsiveness to ran- dom fluctuations.

SUMMARY AND CONCLUSIONS

Experimental evidence supporting the hypothesis that subjects can learn (over repeated plays of a simple, two-person, zero- sum game without saddlepoint) to exploit an opponent’s departure from rationality has been presented, The behavior of these sub- jects has been shown to generally conform to what is usualIy meant by learning-i.e., a modification of behavior contingent on dif- f erential rewards.

It is more difficult to interpret or explain tlic behavior of subjects playing against an opponent who follows an optimal mixed

Behavioral Science, Volume 17, 1972

Page 9: The learning of strategies in a simple, two-person zero-sum game without saddlepoint

308 JOTIN Fox

strategy. Over repeated plays, tlirse subjects came to play closer to their minimax strat- egy; that is, they played alternative bl a decreasing proportion of the time. This last finding runs counter to the results of previous studies (Lieberman, 1962; IVlessick, 1967). Since subjects in the RL and RH conditions could not influence their expected payoff, the effects of subjects’ strategy choice behavior on the variance and skewness of outcomes (interpreted as components of risk) were explored; it was suggested that in playing choice bl a decreasing proportion of the time, subjects in the RL and RH groups may have been motivated by a desire to reduce the riskiness of the game.

Finally, the hypothesis that subjects (playing a zero-sum game without saddle- point, against an opponent who adopts a mixed strategy) will respond to chance fluc- tuations in their opponent’s play has been tested for all experimental groups. The data give some support to this responsiveness hypothesis.

APPENDIX In this Appendix we derive an expression for the

variance of payoffs to a player (in a 2 x 2 zero-sum game, without saddlepoint) when the player’s op- ponent plays his minimax mixed strategy. We find that this expression for the variance of a player’s payoffs is a linear function of the proportion of the time he chooses one alternative (or the other).

Let us begin by demonstrating that when a player’s opponent plays rationally, the expected outcome of the game is independent of the player’s strategy choice behavior; we derive an expression for the expected value of a game in the following manner:

Let Game 3 represent the generalized 2 x 2 zero- sum game, and let P(b1) = x, and P ( a l ) = y. Then A’s rational strategy is y = (d - b)/[(a $- d ) - (b + c)].12 And the value of the game,

(1) E($) = z ( (d - b ) / [ ( a + d ) - (b + c ) ] ) a + x u - (d - b ) / [ ( a + d ) - (b + c ) l ) b + (1 - z) ( (d - b) / l (a + d ) - (b + c) l )c + (1 - z)(1 - (d - b)/I(a + d ) - (6 + ~ 1 1 ) d (2) = z ( a ( d - b ) + b(a - c ) ) / [ ( a + d ) - ( b + c ) ] +(I - x) (c(d - b ) + d b - c ) ) / [ ( a + d ) - ( b + c ) ]

(3) = x(ud - b c ) / [ ( a + d ) - (b + c)] + (1 - $)(ad - bc)/[(a + d ) - (b + c) ]

(4) = (ad - b c ) / [ ( a f d ) - ( b + c)l Hence, no matter how B plays, E($) is un-

changed. Let us proceed to derive an expression for the variance of outcomes in Game 3:

(5) V($) = E($2) - [E($)12

l2 See Rapoport ( IOCf i , pp. 78-81) for derivation.

A

GAME 3. Generalized 2x2 zero-sum game.

(6 )

(7)

Since we assume that A plays his minimax strat- egy, y is a constant; further, when A plays ra- tionally E($) is independent of 2, according to (4) above. Hence, for a given game, (7) defines a linear equation in x.

= x y a 2 + z(1 - y)b2 + (1 - z )ycz

= z [ y ( a 2 - b2 - c2 + d 2 ) - d2 + b2] + (1 - 2)(1 - Y)d2 - [E($)IZ

+ Y (c2 - d2) + d2 - [E($)l2

For Game 2, (7) becomes:

(8)

(9 1 = 648.572 f 44.082 when outcomes are blocked at 20 plays.

REFERENCES Brayer, A. R. An experimental analysis of some

variables of minimax theory. Behav. Sc i . ,

Coombs, C. H., & Meyer, D. E. Risk-preference in coin-toss games. J . math. Psychol., 1969, 6,

Coombs, C. H., & Pruitt, D. G. Components of risk in decision making: Probability and variance preferences. J . exper. Psychol.,

Edwards, W. Variance preferences in gambling. Amer . J . Pschol., 1954,67, 441-452.

Lieberman, B. Human behavior in a strictly deter- mined 3 X 3 matrix game. Behav. Sci., 1960,

Lieberman, B. Experimental studies of conflict in some two and three person games. In J. H. Crisswell, H. Solomon, & P. Suppes (Eds.). Mathematical methods in small group proc- esses. Stanford: Stanford Univ. Press, 1962,

LUCC, R. D., & Raiffa, H. Games and decisions: Introduction and critical survey. New York: Wiley, 1957.

Messick, D. M. Interdependent decision strategies in zero-sum games: A computer-controlled study. Behav. Sci., 1967,12,33-48.

Naylor, T. H., Balintfy, J. L., Burdick, D. S., & Chu, K. Computer simulation techniques. New York: Wiley, 1966.

Rapoport, A. Two person game theory: T h e essential ideas. Ann Arbor, Mich.: Univ. Michigan Press, 1966.

Itapoport, A. & Orwant, C. Experimental games: A review. Behav. Sci . , 1962, 7, 1-37.

(Manuscript received January 4, 1972)

V($) = 20~[(6/14) (36 - 25 - 4 + 1) - 1 + 251 f 20[(6/14)(4 - 1) + 1- (-.2857)’l

1964, 9, 33-44.

514-527.

1960, 60, 265-277.

5, 317-322.

pp. 203-220.

Behavioral Science. Volume 17, 1972