the principal-agent problem with evolutionary learning

Computational & Mathematical Organization Theory 2:2 (1996): 139-162 © 1996 Kluwer Academic Publishers, Manufactured in the Netherlands

The Principal-Agent Problem with Evolutionary Learning 1

DAVID ROSE Defense Resource Management Institute, Naval Postgraduate School, Monterey, CA 93943-5000, email: [email protected]

THOMAS R. WILLEMAIN Decision Sciences and Engineering Systems, Rensselaer Polytechnic Institute, Troy, NY 12180, emaii: [email protected]

Abstract

We extend agency theory to incorporate bounded rationality of both principals and agents. In this study we define a simple version of the principal-agent game and examine it using object-oriented computer simulation. Player learning is simulated with a genetic algorithm model. Our results show that players of incentive games in highly uncertain environments may take on defensive strategies. These defensive strategies lead to equilibria which are inferior to Nash equilibria. If agents are risk averse, the principal may not be able to provide enough monetary compensation to encourage them to take risks. But principals may be able to improve system performance by identifying good performers and facilitating information exchange among agents.

Keywords: Agency theory; bounded rationality; computer simulation; incentives; game theory; genetic algorithms.

1 Introduction

The principal-agent model has become a widely accepted model of incentives as a means of organizational control (Eisenhardt 1985, 1989; Coleman 1990). In this model, the principal contracts with an agent to perform a certain task. The principal pays the agent based on the outcome of the task. This outcome depends both on the agent's effort and a random variable that represents a noisy environment.

A key assumption in the traditional principal-agent literature is player rationality. Players are assumed to know their own and each others' utilities and the probabilities of all states of nature. Given this knowledge, the principal can formulate and solve optimization models yielding the least-cost contract to induce cooperation from the agent.

In this study we assumed that players of an incentive game are boundedly rational. They have no immediate knowledge of each other or the environment; instead, they must gain such information through repeated play. We assume that players are members of a larger population and learn through an evolutionary process. This is similar to viewing organiza-

1The authors would like to thank the anonymous referees for their helpful suggestions.

140 ROSE AND WILLEMAIN

tions as ecosystems (Morgan 1986) or open systems (Scott 1981). Under this assumption, we show that when the environment is highly uncertain, it is difficult to induce cooperation in risk averse agents using strictly monetary means. However, the principal may be able to improve system performance by using an artificial selection procedure to identify high performing agents.

The Story in Brief

Section 2.0 gives a short description of the principal-agent problem. The analysis assumes both principal and agents know all utility functions and relevant probabilities on all states of nature. With these assumptions, the principal can design an incentive scheme by formu- lating and solving a mathematical optimization problem. The resulting incentive scheme should induce agents to perform at the desired effort level. In reality, incentive designers do not have this knowledge. They wish to improve the quality of output of their agents, but do not have the information to perform a meaningful optimization process. Instead, the principals design their incentives and agents choose their actions under conditions of bounded rationality.

The section goes on to describe the literature on evolutionary models of organizational adaptation. In this literature, organizations or firms are portrayed as members of a larger ecosystem. Organizations introduce variation into their operating routines by innovation. High-performing routines survive by achieving a good fit to the environment; this is similar to selection in biological evolution. The high-performing routines are then copied, either by the organization making copies of itself or by being copied by low-performing firms, as a form of preservation.

In section 3.0 we describe a simple incentive game and analyze it using standard decision- theoretic and game-theoretic tools. But such analyses assume that the players directly know the game structure. We are interested in the case in which they must induce the structure and the correct strategies through repeated play and learning. We model player learning using computer simulation. This relates the study to others of modeling organizational change and learning through computer simulation (Axelrod 1984; Lant and Mezias 1990; Crowston 1994; Glance and Huberman 1994). While their work centers on change that is internally driven or spontaneous, our work examines change induced and encouraged by an external administrator or organization through the use of incentives.

In section 4.0 we describe the detailed learning model used for our players, based on the genetic algorithm (Holland 1975). In section 5.0 we describe results of simulation runs under a variety of conditions. In the simulations, player behaviors are what might be expected from theoretical analysis except for the situation of risk averse agents in a high uncertainty environment. Agents in such situations take on a defensive strategy, responding with low effort to any incentive structure offered. This can be remedied by the principal identifying high-performing agents and facilitating information transfer among agents; this might be considered an artificial selection procedure on the part of the principal. In section 6.0 we outline our conclusions and promising avenues for future research.

THE PRINCIPAL-AGENT PROBLEM WITH EVOLUTIONARY LEARNING 141

2.0 B a c k g r o u n d

2.1 Incentives and the Principal-Agent Problem

The basic behavioral assumptions in the principal-agent problem are (Baiman 1982; Eisen- hardt 1989)2:

• The agent 's actions are not programmable or observable. This means the contract must be based on outcomes and not on the agent 's actions. Outcomes depend both on the agent 's actions and random environmental events. This complicates the design of the payment scheme. Uncertainty makes it difficult for the principal to infer agent actions based solely on outcomes. Environmental uncertainty also introduces the problem of risk.

• Outcomes are observable and verifiable. It is difficult to base an enforceable contract on outcomes which are difficult to witness or subject to interpretation.

• The agent (and perhaps the principal) is risk averse. This means the contract must take into account the amount of risk the agent is willing to bear.

• The agent has goal conflicts with the principal. This means the contract must align the agent 's goals with the principal 's . We assume that both principal and agent act in self-interest. Each player maximizes his own expected utility regardless of the con- sequence to the other player. Each player expects the other to act in self-interest as well.

• Both principal and agent are rational. Each chooses behavior to maximize expected utility. Each knows his own and the other player ' s utilities on outcomes. Both players know all possible states of nature and know the same (perhaps subjective) probabilities of those states. Both can costlessly perform all necessary computations for the maximiza- tion.

A solution to the basic agency model consists of:

• An employment contract, which is the principal 's choice of payment schedule along with the agent 's specification of how he will act (what effort level he will choose).

• The agent 's actual action. Under optimal conditions the principal will choose the payment schedule in such a way that the agent 's specified action and the agent 's actual action are the same.

The incentive problem as described above fits the definition of a game. The Nash equilibrium solution of the game can be computed by solving the principal 's problem (Holmstrom

2To clarify the exposition, we will refer to the principal as a "her" and the agent as a "him" for the remainder of the paper.


1979). Let

x

I(.) e

f (x]e) G(s) U(s) V(e) K

= agent 's output.

= the payment schedule to the agent.

= agent 's effort, = the probability density of output x given that the agent has chosen effort level e.

= principal 's utility for net income s.

= agent 's utility for income s.

= agent 's disutility for choosing effort level e. = agent 's reservation utility (the utility he can get by working elsewhere).

Based on the assumptions given above, the principal's problem becomes

( G (x - l(x)) f(x]e)dx (1) m a x l(.),e 3

subject to

- V) = I g ( I ( x ) ) f ( x l e ) d x - V(e) --> K E(U (2)

= Argmax{( g ( I (x ) ) f (x le ' )dx- V(e')} (3) g gt J

This says (1) the principal wants to choose I(x) to maximize her expected utility, subject to (2) the agent's expected utility must be greater than his reservation utility, and (3) the agent will choose his effort level to maximize his own expected utility. Since the Nash equilibrium solution is such that no single player would unilaterally change his or her Nash strategy, the contract derived from the solution to (1 - 3) is considered to be self-enforcing.

The assumption of rational behavior is a typical simplifying assumption which social sci- entists employ in their models. Under this assumption, players maximize utility, and have all necessary information to do this. In the principal-agent problem, it is assumed that the principal knows the agent's utility function as well as that of her own. By knowing the agent's utility function, the principal is able to produce an incentive scheme that virtually guarantees the desired behavior from the agent. Both players know all possible states of nature, and share the same (perhaps subjective) probability distribution on those states of nature. This means the agent has relatively complete information on how successful he is in turning effort into output. It is assumed that the agent has better knowledge of this information than the principal, which makes the principal-agent problem a game with asymmetric information.

In reality, the players of this game know much less than is assumed. The principal does not know the agent's utility function and may have only a vague articulation of her own. This means the principal cannot predict with certainty the action of the agent based on the incentive scheme--she may have to learn this from a trial-and-error process. Likewise, the agent may not know the most efficient way to convert effort into output at the beginning of play; this also might be learned through trial and error. The players of this game cannot immediately choose actions to maximize their expected utilities. Thus, the game has moved from one of simple asymmetric information to one of bounded rationality.


Bounded rationality takes into account knowledge and computational limitations of decision makers (Simon 1969). These are especially important considerations in game theory, where what is unknown is not only the environment, but also the desires and strategies of the other players (Kreps 1990; Rosenthal 1993).

We wish to examine models in which players of the incentive game are boundedly rational and must learn about their environment. But such assumptions complicate the modeling process. To quote Holland and Miller (1991): "Usually, there is only one way to be fully rational, but there are many ways to be less rational." This makes it important to examine the learning process from several perspectives. In previous work (Rose and Wiltemain 1996) we regarded organizations as brains (Morgan 1986) or goal-driven rational systems (Scott 1981) which learn through individual exploration and adaptation. In this paper, we think of organizations as ecosystems (Morgan 1986) or open systems (Scott 1981) which adapt to their environment through evolution.

2.2 Organizations as Ecosystems

Evolutionary models of organizational adaptation have been developed by both economists (Alchien 1950; Hirchleiffer 1977; Boulding 1981; Nelson and Winter 1982) and organizational theorists (Hannah and Freeman 1977; Carroll 1988). These models regard change as occurring through processes of variation, selection, and retention of organizational routines.

Variation may come intentionally through innovation or unintentionally through failure in organizational control. The variations in organizational routines spread through the system through a diffusion process; poorly-performing organizations can copy the routines of better- performing ones, or personnel may move from one organization to the other.

Selection occurs based on the fit between the organization and the environment. Firms which fit well with their environments are more likely to survive than firms which are poorer fits. This goodness of fit may occur through skill, luck, or a combination of the two. Organi- zational ecology and microeconomics both stress tight linkages between organizations and their environments, and both emphasize the role of competition for scarce resources. How- ever, organizational ecologists assume the selection process is imperfect and filled with delays. While economists focus on the equilibrium of the system, organizational ecologists are also interested in the time path leading up to equilibrium as well.

Retention involves the maintenance of the organizational routines in the population. This might be viewed as bureaucratic inertia--firms with successful strategies are unlikely to change them (Hannan and Freeman 1977). It also might be viewed as multiplication-- firms with successful strategies will end up with more profits and more resources to control. They are more likely to produce copies of themselves or be copied by less successful firms (Hirchleiffer 1977; Nelson and Winter 1982).

The key is that it is the rules of successful firms which are copied. At the same time, fail- ing firms will contract. This means that as organizational rules (and thus the organizations that go with them) succeed in the environment, they and their variants become a larger proportion of the rules in use. This is the inspiration behind using the genetic algorithm model of learning described in section 4.0.


3.0 The Incentive Game

3.1 Assumptions

The goal of our model is to combine the setting of the principal-agent problem with the ideas of bounded rationality and see what incentive design guidelines emerge. Our entire set of assumptions is:

• The agents' actions are not programmable or observable. • Outcomes are observable and verifiable. • The agent has goal conflicts with the principal. • Decision makers partition outcomes into a small number of classes. This is consistent

with theories of decision making in complex or uncertain environments (Cyert and March 1963; Simon 1969; March 1988). In our game, decisions and outcomes will be represented by small numbers of discrete possibilities.

• Players are boundedly rational: a) The principal knows nothing about the agents' utilities. b) Neither the principal nor the agents know the relationship between effort and output.

• For reasons of fairness, the principal must offer the same incentive schedule to all agents (Mantrala et al. 1994; Stanton and Buskirk 1987; Tyagi 1990).

We now describe a simple incentive game with these characteristics and outline methods for analyzing that game.

3.2 Game Description

A principal is trying to determine a reward scheme for a group of agents. The principal and agents play this game repeatedly. The principal's goal is to have the agents produce the most output while receiving as small a reward as possible. Each agent can produce one of two outputs: Low and High. The reward scheme in this game awards each agent a dollar amount for each level of output.

After the scheme is announced, each agent chooses one of two effort levels: low and high. The probability the agent produces output level x, given that he has chosen effort level e, may be represented as:

Outcome x

Low High Effort Low PLL pLH

e High pilL PH.

The p's represent the degree of environmental uncertainty. If the difference between PLH and PHi/is large, then the agent is very likely to perform better by choosing high effort over low effort, which means the environmental uncertainty is low. In contrast, if the difference


between PLy/and PHH is small, the agent is less likely to perform better by choosing high

effort, which means the environmental uncertainty is higher. The agent's cost structure is

Effort Cost Low 0 High c > 0

After the agent chooses his effort level, the output is realized and the agent receives his reward. The agent's profit is his reward minus his cost of effort. He chooses his effort level to increase his utility of profit. The agent may be either risk neutral or risk averse. If he is risk neutral, his utility of profit P is

W(P) = P

If he is risk averse, his utility of profit is

W(P) = 100(1 - e -AP)

where t is a measure of the agent's risk aversion. This is the constant absolute risk aversion utility function (Hey 1979). This particular utility function is used for exposition purposes here; other utility functions (such as constant relative risk aversion and logrithmic utility) which induce risk aversion in the agents were used with similar results.

3.3 Theoretical Analysis o f the Incentive Game

The principal wants to maximize output at the least cost. In the game described above, this reduces to the principal offering the least cost schedule which induces the agent to choose high effort. If the principal knew the information given above before playing, she could determine the least cost schedule to offer the agent by solving the following mathematical programs:

For the risk neutral agent:

Min pnLIL + pHHIB

S. t. pHLIL + pHulI4 -- C >-- pLrlL + pLHItt

pHLIL + pm~IH -- C >-- 0

1L, IH >-- O

For the risk averse agent:

Min p , L I r + P14141H

s. t. pHrl00(1 - e -a(IL-e)) + pHH100(1 -- e ACIH-e)) >

(N)

(A)


PLLIO0(1 -- e -a(1L)) + PLHIO0(1 -- e -a(1H))

p~rL100(1 -- e -A(lr-c)) + pHH100(1 -- e -A(ln-c)) >-- 0

[L, IH >--0

In both these programs, the risk neutral principal wishes to choose a payment scheme which minimizes her expected payout to the agent, subject to two constraints:

• Under the optimal schedule the agent should prefer playing high effort over playing low

effort. • Under the optimal schedule the agent should achieve a minimum threshold of utility (in

this case, 0).

Solutions to these programs for A = 0.04, c = 20, and two levels of environmental uncertainty are given in Table 1.

Table 1. Solutions to programs (N) and (A).

Risk tolerance Pun = PLL IL IH

Neutral .95 0 23 Averse .95 0 23 Neutral .70 0 50 Averse .70 0 92

Notice that as the environmental uncertainty increases, the principal must pay more to induce an agent to choose high effort. Also, she must pay a higher risk premium to a risk averse agent to induce him to choose high effort.

We wish to design a game in which both the principal and the agents have discrete choices. The agent's choices are as described above. At the beginning of each period, the principal chooses one payment scheme from the following choices:

Payment for

Scheme Low High

A 0 10 B 0 60 C 0 120

These reward schedules are based on the optimal schedules stated in Table 1 and are chosen to produce different behaviors in the players under different environmental conditions and risk preferences. Based on these payment schemes, we can compute the agent's best response for each schedule given his risk preference and the environmental uncertainty


level. Call this agent's strategy Sa. If the principal knows the agent's responses for each schedule, she can then offer the schedule that induces consistent high effort at the lowest expected cost. Call this principal's strategy Sp. The Nash equilibrium solution is for the principal to play her strategy Sp and the agent to respond with strategy Sa.

Given the following values of p~/H and agents' risk preference, Table 2 lists the corre- sponding Nash equilibria. The agent's strategy Sa is given in the form [a, b, c], where a is the response to schedule A, b is the response to schedule B, and c is the response to schedule C.

Table 2. Nash equilibrium strategies for each experimental setting.

Agent's risk preference PLL = PHH Principal's Nash strategy Agent's Nash strategy

Neutral 0.95 B [low, high, high] Averse 0.95 B [low, high, high] Neutral 0.70 B [low, high, high] Averse 0.70 C [low, low, high]

4.0 The Genetic Algorithm Model

The equilibria in Table 2 were computed assuming the principal knows the agent's profits for various outcomes, and that the principal and agent know the probabilities of those outcomes. Our question we will examine with our simulations is: Will we see the same equilibria if the principal and agent have to learn about their environment?

In our model, players learn by playing the game. We model the players using object- oriented computer simulations. We define the principal and each agent as software objects. Within each object are rules for interacting with all other objects in the simulation. Player learning is modeled with a genetic algorithm. This particular method was chosen because

• It has been used in other studies of game and organization (Axelrod 1987; Miller 1996; Kollman et al. 1992; Marimon et al. 1990; Bruderer 1993; Arifovic 1994; Crowston 1994).

• It is similar in structure to the ecosystem model of organizational learning described in section 2.2.

Genetic algorithms were formally developed by Holland (1975). Their usefulness in economic and social simulations is argued by Holland and Miller (1991). In this model, many economic actors choose from several actions based on sets of decision rules. These decision rules would be analogous to genes in a chromosome. The rewards from the actions are then realized. Based on the decisions chosen, some actors will have received higher rewards than others. The actors who are performing poorly will steal and combine parts of the decision rules from the actors who are performing well. This learning model is comparable to the population ecology model of bounded rationality described in section 2.2. The com- parison was made by Goldberg (1989) as an illustration in his well-known text on genetic algorithms:


At a widget conference . . . various widget experts from around the world gather to discuss the latest in widget technology stories. Well-known widget experts, of course, are in greater demand and exchange more ideas, thoughts, and notions with their lesser known widget colleagues. When the show ends, the widget people return to their widget laboratories to try out a surfeit of widget innovations.

In a genetic algorithm, the economic actors are represented by a fixed-length character string. The character strings will represent decision rules to choose actions based on the outcomes of previous periods.

A popular model of evolution in the biological world is that individuals reproduce according to their fitness in the environment. The higher an individual's fitness level, the more offspring he or she will produce. In genetic algorithms, fitness is a measure of how well a string of decision rules solves a particular problem. The fitness measure of any agent is his profitability. The higher an agent's profits are, the more likely it is that his decision rules will be copied by other agents. At the end of each iteration (or after a certain number of iterations) the lower-profit agents will choose higher-profit agents to copy. The higher- profit agents will be copied in proportion to their relative profitability. When the decision rules are copied, some specific parts may be miscopied or modified by the copying agent. An algorithm for a simple genetic algorithm is (Goldberg 1989):

1. Create an initial population of individual fixed-length character strings. This is typically done by creating the strings at random.

2. In each iteration: a) Evaluate the fitness (profitability) of each individual in the population. b) Create a new population of strings by applying the following operations:

i) Copy existing individual strings to the new population (reproduction). ii) Create two new strings by genetically recombining randomly chosen substrings

from two existing strings (crossover). iii) Create a new string from an existing string by randomly changing the character at

one position of the string (mutation).

4.1 Detailed Genetic Algorithm Description

This part of the study borrows from the work of Miller (1996). In his study of the evolution of strategies for iterated prisoners' dilemma, Miller represents each player as a finite-state automaton. A population of automata is generated at random; each member plays a repeated prisoners' dilemma game with all other members. A new population of automata is then created from the first using a genetic algorithm. We use the same approach in our simulation of strategic evolution in the repeated incentive game.

Finite automaton representation Each player is modeled as a finite-state machine. The machine is described by four elements: its internal states, its starting state, its actions for each state, and the transitions to other states depending on other players' actions. Using automata to represent players' strategies has been analyzed by Rubinstein (Rubinstein 1986;

THE PRINCIPAL-AGENT PROBLEM W I T H E V O L U T I O N A R Y L E A R N I N G 1 4 9

Abreu and Rubinstein 1988; Piccione and Rubinstein 1994). This representation of the players' strategies allows a variety of behaviors based on game histories to emerge. For example, a principal's strategy may be to offer reward schedule C until she detects two consecutive low outputs from the agent; from that point on, she will offer reward schedule A. Such a strategy can be represented by the appropriate finite-state automaton.

As an example, Figure la) shows the optimal machine representation for a risk neutral agent. The agent sees the reward schedule offered by the principal. He then transitions to another state (or perhaps stays in the same state) and plays the effort level designated by that state. The agent starts in state L. If reward schedule A is offered, the agent stays in state L and plays low effort. Otherwise, he moves to state H and plays high effort. If the agent is in state H and schedule A is offered, he moves to state L and plays low effort. Otherwise, he stays in state H and plays high effort.

A bit string to represent the machine described above is shown in Figure lb). 0 represents state L or low effort. 1 represents state H or high effort. The first bit indicates the starting state of the machine. The next four bits represent the decision rules in state L. The first three bits represent the transition states for the various reward schedules which might be offered, and the fourth bit represents the effort the agent should play when ending in state L. The next four bits perform a similar function for state H.

Tran~iition stat~ TxanMtion state Transition slate Tra~-qSltion slate Transition state Transition s t~ if schedule A ff schedule B if schedule C ff sclaedule A if schednle B ff schedule C

1 is offered0 is offeredl is offea'edl Action0 is offered0 is offeredi is offeredl

Start State L State H state

Action 1

start state 0 state 1 state 15 state

state i

transition state transition state transition state response after if schedule A if schedule B if schedule C landing in state is offered is offered is offered i (0 = low, 1 = high)

Figure 1. Agent automata structure.


In the general case, each player's strategy is expressed as a 16-state machine. Sixteen states were chosen as a compromise between the potential for complex behavior versus a possible simple description of that behavior. This finite-state machine can be represented by a more complex bit string. The string's structure is represented in Figure lc). Each # represents a bit ( # E {0, 1}). The first four bits represent the starting state of the agent. This is followed by sixteen 13-bit packets. Packet i represents the behavior of the agent when he is in state i, i E {0 . . . . . 15}. Each packet consists of three four-bit words and an action bit. Each four-bit word corresponds to a pay schedule; the word represents the state the agent moves to if that particular schedule is offered by the principal. The action bit in packet i represents the action the agent will take when state i is reached. The principal is represented in a similar way.

Genetic a lgori thm learning There will be a population of principals of size n. Each principal will be in charge of a population of agents. Within each generation, the principal and agents will play the incentive game for 80 iterations, following the strategies dictated by their finite automata structures. The number of iterations was chosen after preliminary experiments as a reasonable trade between having enough iterations to observe convergence and having so many iterations that simulation time became long. After 80 iterations, the ncxt generation of players will be constructed using genetic algorithms. We will describe this in detail for the principal; the method for the agents is very similar.

All principals will be sorted in order of accumulated utility. The principals with the top k utilities (k < n) will be copied into the next generation intact; this is called elitist selection, guaranteeing the top-performing rules survive into the next round of play. This leaves the remaining (n - k) population slots open. They will be filled by children of the current principals. To fill two slots, we select two principals to act as parents. Each parent is selected based on its utility relative to the population total:

utility i p(principal i selected) -

utility j j = l

The hit strings of the parents are copied into the children, with random crossover and mutation altering the children so they will not be identical to the parents. Then the children are copied into the next generation. This process is repeated until all slots in the new population are filled.

We show an example of this process in Figure 2. In this example, n = 10 and k = 6. This means we copy into the new generation the top 6 performers from the old generation of 10. We now sample with replacement from the old population to find parents for slots 7 and 8. In this case, the square in slot 1 and the triangle in slot 3 are chosen. With probability Pc, the two parents are crossed to form their children. In this case crossover does not happen, so the parents are copied directly to their children and put into the new generation. Now we sample to find parents for slots 9 and 10. The circle in slot 2 and the pencil in slot 7 are chosen. Again, with probability Pc the parents are crossed to form their children. In this case crossover does happen, so each child shares information from both parents.


Performance ranking

1

2

3

4

5

6

7

8

9

10

Figure 2. Forming the new generation.

Old New generation generation

5.0 Results from Virtual Experiments

We conduct experiments involving variations of the following factors:

• Risk aversion: Agents' risk aversion willmodulate the motivation provided by aparticular incentive schedule. Will the principal discover the risk aversion of the agents and change the incentive schedule appropriately?

• Environmental uncertainty: As environmental uncertainty increases, outcome incentives become more costly and less effective. Is it still possible to design incentive schedules to guide agents toward improving output?

To keep the analysis simple, we vary agents' risk preferences between two levels (risk neutral and risk averse) and environmental uncertainty between two levels (low and high) for a resulting 2 × 2 experimental design.

Output analysis will focus on system equilibrium: Do equilibria under bounded rationality conditions approximate the Nash equilibria predicted by the traditional literature? Do certain incentive schedules result in higher equilibrium output than others?

We start with a population of Np principals generated at random. Each principal regulates a population of Na agents initially generated at random. Each principal plays T iterations of an incentive game with her population of agents. In each iteration the agents choose effort levels and produce outputs according to the rules dictated by their definitions. After the iterations are finished, new populations of principals and agents are created from the old


populations. For the principals, the top np performers from the old population are copied to the new one. Then Np - n p more agents are created from the old population using the genetic operations of selection, crossover, and mutation. For the agents, the top n a performers from the old populations are copied, while Na - na new agents are created using genetic operations. This is done for G generations.

Table 3 summarizes the parameters described in our model along with their experimental settings.

Table 3. Model parameters and experimental settings.

Parameter Meaning Experimental setting

Up

N~ na G T A

Size of principal population 30 Number of top principal performers copied into next generation 20 Size of agent population 100 Number of top agent performers copied into next generation 50 Number of generations 50 Number of game iterations in each generation 80 Level of risk aversion in risk-averse agents 0.04

These parameter values were chosen because they worked well in preliminary experiments and because they matched experimental conditions of earlier work (Miller 1996; Rose and Witlemain 1996).

Since the genetic algorithm is evolutionary in nature, we look primarily for evolutionary phenomena. We ran four experimental blocks, playing risk neutral and risk averse agents in low and high uncertainty environments. Each block consisted of five simulation runs. We recorded the dominant strategies for principals and agents in the last generation. The results are summarized in Table 4.

Table 4. Summary of experimental results.

Agents' risk preference Environmental uncertainty Result

Risk neutral Low Nash equilibrium Risk averse Low Nash equilibrium Risk neutral High Nash equilibrium Risk averse High Defensive play

Notice that we observed the expected Nash equilibrium in all situations except that of risk averse agents playing in a highly uncertain environment. In this situation, the agents play a defensive strategy--they choose low effort regardless of the payment schedule offered by the principal.

In Figure 3, we see the diffusion of both principal and agent strategy choices for risk neutral agents in a high uncertainty environment. We see in Figure 3a) that the principals prefer schedule A to the other reward schedules in the early generations. The reason is


30

~a. 25 ~20

a) ~o15

.61o 5

o O O O o

Generation

100

80 o~ ~ P r i n c i p a l s offering 1 60 ~ schedule A t

High effort t 40 '6 responses to ' ! 2 0 schedu e A i 0

30

~o. 25

" ~ 2 0

b) ~o_15

'61o ~, 5

0

Y i . . . . i ........... 4 ............... I ...........

0 0 0 0 0 ~ 03 -~"

G e n e r a t i o n

100

80 ~ [ " "~=Pr inc ipals offering l I schedule B

60 ~ / High effort / 40 '6 ] responses to I 20 =~ [ schedule B l

0

30

"~25 D . .

"52O ¢,,,

~.15 , , - 1 0 O

~= 5

0

100

0 0 0 0 0 OJ 03

t -

G e n e r a t i o n

60 g

40 .5

2 0

0

"~mPr inc ipa ls offering] schedule C I

High effort responses to schedule C

Figure 3. Diffusion of strategies; risk neutral agents, high uncertainty a) Principals offering schedule A and agent responses b) Principals offering schedule B and agent responses c) Principals offering schedule C and agent responses.


indicated in the figure; the agents respond initially in roughly the same way to all reward schedules in the beginning two or three generations, so the principals prefer the cheapest schedule. By generation 20, however, the number of agents playing high for schedule A

3O o~ ~o. 25

92o a) "~ 15

5

0 O O O O O

04 03 ~"

G e n e r a t i o n

lO0

80

_m "~ " 'P r i nc i pa l s offering 60 '~ schedule A

40 "~ /

20 ~ [ High effort F responses to

0 [ schedu e A j

3O

~ 2 5

20

b) "~. 15 .~1o

5

O O O O O

Generation

100

80 ~ ~ P r i n c i p a l s offering] 60 ~ ~ schedule B /

o i 2 0

0

3O u) m ~.25 "p: 20

"~ 15 c) o. .,~1o

5

O O O O O 04 CO ~"

G e n e r a t i o n

1 0 0

8 0

4 0 "5

2 0

0

" " ' " P r i n c i p a l s offering I schedule C

High effort responses to schedule C

Figure4. Diffusion of strategies; risk averse agents, high uncertainty, a) Principals offering schedule A and agent responses, b) Principals offereing schedule B and agent responses, c) Principals offering schedule C and agent responses.


has significantly decreased. Meanwhile, the number of agents playing high for schedule B has remained roughly the same. Thus, Figure 3b) shows that the principals begin offering reward schedule B more consistently. More agents respond with high effort, and the players converge to their Nash equilibrium predicted in Table 2.

In Figure 4, we look at the same diffusion process for risk averse agents. Notice the same initial preference by the principals for schedule A in Figure 4a). But in the risk averse case, there is a rapid decline in the number of agents playing high effort for all of the reward schedules. Consequently, the principals consistently play the least expensive schedule, and the players converge to a suboptimal equilibrium.

Why do risk-averse agents have such difficulty learning the correct strategy in the high uncertainty environment? Genetic algorithm researchers (Goldberg and Segrest 1987; Horn 1993; Fogel 1994; Rudolph 1994) have suggested that a key component of genetic algorithm convergence is the fitness ratio:

r =

expected fitness of the optimal strategy

expected fitness of any other strategy

As r increases, the probability that a player using the optimal strategy will be chosen as a parent for the next generation of players increases. This increases the likelihood that the optimal strategy will be the dominant strategy for the population after some fixed number of generations.

Figure 5 shows the expected utilities for risk averse agents in the low uncertainty environment. Even if the principal offers a fairly moderate reward, the agents can improve their expected utility tremendously by playing high effort. For schedules B or C, the expected fitness (measured in agent utility) for high effort is many times that of low effort. Thus, if the principal offers schedules B or C, the agent strategy of responding with high effort will propagate quickly through the agent population.

Schedule Schedule C 100-

B

80- f i 60- ~ 40-

20-

-2o- I -40- hedule -60 I ~ ' - A

I I I I I 0 20 40 60 80 100 120 140

Paymentlor high output

Agent effort

low

high

I I I 160 180 200

Figure 5. Risk-averse agents' expected utilities in low uncertainty environment.

In contrast, Figure 6 shows expected utilities for risk averse agents in a high uncertainty environment. Here the difference in expected utility between choosing low or high effort


~oo-]

60 Schedule

-80 I I J I 0 20 20 60 810

Schedule C

Agent effort

low

high

1~0 120 lZ~0 160' 180 200 Payment for high output

Figure 6. Risk-averse agents' expected utilities in high uncertainty environment.

is small, and does not improve greatly even for very high rewards. Thus, agents who play high consistently will show only a small proportionate advantage over agents which play low consistently. Since individual agents will be chosen for the next generation based on their proportionate fitness, the number of high-playing agent in the next generation will increase only slightly. Before any perceptible increase in agent output occurs, the principals give up and starts offering schedule A.

Why does the number of agents playing high effort for schedule C decrease? The answer can be inferred from Figures 4 and 6. We see that in the first few generations, the number of times schedule C is offered decreases. Also, the utility increase an agent receives from playing high effort is imperceptible. Meanwhile, the number of times schedule A is offered increases. For schedule A, the agent's utility increase for playing low is significant--he is better off for schedules A and B, and is essentially indifferent for schedule C. The agents have learned a defensive strategy, which frustrates the principal's goal of fostering high outcomes through high effort. Is there anything the principal can do to help the agents avoid this?

Improving agent performance through artificial selection The genetic algorithm used for risk averse agents in high environmental uncertainty leads to a premature suboptimal convergence, as seen in Figure 4. We tried several of the standard techniques from genetic algorithm literature (Michalwicz 1994), including rank-order selection and changing the principal's fitness function over the generations to make her more generous in the beginning of play. Neither of these improved player performance.

We also tried offering higher rewards and penalties for low performance. But Figure 6 indicates that simply offering a higher reward for high output will not help agents learn their best response to a high reward schedule. Including a penalty does not help either-- this merely shifts both utility curves nearer to the horizontal axis.

Part of the problem may become part of the solution. Since this is a game, there are two performance measures at work. The principal's performance measure increases as output increases. But the agents select decision rules to copy based on their own utility of profit.

Suppose the principal could bias the agent selection process in her favor--in other words, the principal could coax the agents to select decision rules that result in high outputs. She

THE PRINCIPAL-AGENT PROBLEM WITH EVOLUTIONARY LEARNING 1 5 7

might do this by offering favorable recognition for high outputs (Tyagi 1990; Brooks 1994; Troy 1993), or benchmarking and cataloging best practices among the agents (Brewer 1993; Hayes et al. 1988; Sherman and Ladino 1995). At any rate, assume the principal has some mechanism which alters agent population selection properties to favor agents which produce high output. In keeping with the biological metaphor of genetic algorithms, this might be considered "artificial" selection.

We model a process of artificial selection by modifying the agent's fitness function:

j~ = W(profiti) + YiOi{R1 + R2[W(profiti)Rl]}

where

j~ = fitness of agent i

W(profiti) =

Y i ~-

0 i ~-

R 1 =

agent's utility of profit (as defined in section 3.0)

[ 1 i fW(prof i t i )>0} 0 otherwise

1 if output = High 1 0 if output Low ]

direct influence of principal on agent selection, 0 ~ R1 ~ 1

R2 = effect of interaction of principal influence and profit utility, - 1 --< R2 ----- 1

This model says that the principal has some influence R1 on the selection process of agents, which she exerts with agents who produce High outputs. If R~ is near 0, the principal has little influence over the agent selection process; values of RI near 1 indicate that the principal has considerable influence. However, this influence does not extend to agents who make losses.

There is also an interaction between utility of profit and the principal's selection influence, the magnitude of which is determined by R2. If R2 < 0, this means that profit and influence are substitutes; as agent profit increases, the influence of the principal is weakened and vice versa. If R2 > 0, agent profit and principal influence are complements; as agent profit increases, the influence of the principal is strengthened and vice versa.

We ran simulation experiments for various values of R1 and R2. We then used the experimental results to estimate response surfaces as functions of R1 and R2; details are included in Rose (1995). One of the responses we recorded and estimated was the number of principals offering each schedule in the final generation of play. Figure 7 shows the RI × R2 parameter space partitioned by the dominant response of the principals. We see that principals preferred schedule B except for very small values of R~, where they preferred schedule A. Thus, we focus our analysis on system behavior under schedule B.


o o.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 R1

Figure Z Parameter space partitioned by dominant response.

Figure 8 shows the proportion of principals offering schedule B in the last generation as a function of Rt and Re. We see that the proportion offering schedule B increases as R1 increases. Furthermore, the proportion is fairly insensitive to values of R2.

I 1 I

0.9 I

~ 0 . 8 1

~ o . 7 I

0.1 0 .2 0 .3 0 .4 0 .5 0 .6 0.7 0 .8 0.9 1 0

~ o.6 I

0.5 I

0.4 I

0.3 I , J

R1

Figure 8. Proportion of principals offering schedule B in last generation.

Figure 9 shows the proportion of agents playing high in response to being offered schedule B in the last generation. Proportions less than or equal to 0.5 are shown in black; the black region indicates where~the proportion of agents responding with high effort decreases from the 0.5 proportion generated at random in the first generation. Figure 9 indicates that the principal can increase the number of agent high efforts when her influence parameter RI >-- 0.5. This might be interpreted as meaning that the principal can improve system performance if her selection influence is significant, even if her influence is somewhat less than that of agents ' utility of profit (e.g., R1 = 0.6). The diagram also indicates that this system improvement will be sensitive to values of R2.


1

0.95

0.9

0.85

0.8

0.75

0.7

0.65

0.6

0.55

0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

R1

Figure 9. Proportion of agents playing high when schedule B is offered in last generation.

6.0 Conclusions and Discussion

In this work we consider incentive systems in which the principal must learn which incentive schedule to offer and agents must learn how to respond to these schedules. Player learning is modeled in computer simulations by a genetic algorithm.

Theory suggests the speed at which agents learn to perform an appropriate action depends on the ratio of the fitness gained from performing that action over the fitness gained from performing any other action. This idea is confirmed by our virtual experiments. For risk averse agents whose marginal utility for monetary reward declines rapidly, the utility gained from high efforts may be insignificantly greater than the utility from lower efforts if environmental uncertainty is high. This leads agents to play a defensive strategy--always offer low effort, no matter what the reward structure.

This is consistent with results on cooperation under uncertainty. For example, in repeated prisoners' dilemma games, simple strategies are capable of fostering cooperation when there is no environmental uncertainty. As uncertainty increases, simple cooperative strategies increase the variation in reward as opposed to more complex or less cooperative strategies (Bendor 1987; Mueller 1987). This is especially important if the players are assumed to be risk averse; such players will trade a strategy which yields a minor improvement in reward for one which gives a significant reduction in variation.

In cases where agents are risk averse and the environment is uncertain, increasing monetary reward may not encourage agents to exert more effort; thus, the principal cannot use this strategy to improve system performance. But in this model, the key is to aid agent learning, not to reward or punish per se. Agent learning is aided by adjusting the agent selection process in favor of what the principal wants--high output. Our experiments show that if the principal has a significant influence over how the agents choose decision rules, she can use this influence to overcome agent defensiveness. In doing so, both principal and agents will learn strategies that leave both parties better off than did the defensive strategies. This adds


to the role traditionally given to the principal of gathering information strictly for her own u s e .

Some limitations of the genetic algorithm model arise from representational issues. For example, utility functions of particular forms were used to induce risk aversion in agents. Using several functional forms of the utility function shows a robustness in our conclusions. However, the effect of the form on the likelihood that a particular agent would be chosen as a parent for creation of the next generation of agents bears scrutiny. We also chose a particular representation of player decision rules, that of a finite automaton. This was chosen because the finite automaton can represent many interesting strategies. But the behavior of a genetic algorithm may be sensitive to the representation of the solution space (Michalwitz 1994). These issues warrant further consideration in future research.

References

Abreu, D. and A. Rubinstein (1988), "The Structure of Nash Equilibrium in Repeated Games with Finite Au- tomata," Econometrica 56, 1259-1281.

Alchien, A. (1950), "Uncertainly, Evolution, and Economic Theory," Journal of Political Economy 58, 211-221. Arifovic, J. (1994), "Genetic Algorithm Learning and the Cobweb Model," Journal of Economic Dynamics and

Control 18, 3-28. Axelrod, R. (1987), "The Evolution of Strategies in the Iterated Prisoner's Dilemma," in L. David (Ed.) Genetic

Algorithms and Simulated Annealing, London: Pitman. Baiman, S. (i982), "Agency Research in Managerial Accounting: A Survey," in J. Bell (Ed.) Accounting Control

Systems: A Behavioral and Technical Integration, New York: Markus Wiener. Bendor, J. (1987), "In Good Times and Bad: Reciprocity in an Uncertain World," American Journal of Political

Science 31,531-558. Boulding, K. (1981 ), Evolutionary Economics. Beverly Hills, CA: Sage. Brewer, G. (1993), "Ford," Incentive, 167(2), 22-23. Brooks, S. (1994), "Noncash Ways to Compensate Employees," HRMagazine, 39(4), 38-43. Bruderer, E. (1993), "How Strategies Are Learned." Unpublished doctoral dissertation, University of Michigan. Carroll, G. (Ed.) (i988), Ecological Models of Organizations. Cambridge, MA: Ballinger. Coleman, J. (1990), Foundations of Social Theory. Cambridge, MA: Belknap Press. Crowston, K. (1994), "Evolving Novel Organizational Forms," in K. Carley and M. Prietula, (Eds.) Computational

Organization Theory, Hillsdale, NJ: Lawrence Erlbaum Associates. Cyert, R. and J. March (1963), A Behavioral Theory of the Firm. Englewood Cliffs, N J: Prentice-Hall. Eisenhardt, K. (1989), "Agency Theory: An Assessment and Review," Academy of Management Review 14, 57-

74. Eisenhardt, K. (1985), "Control: Organizational and Economic Approaches," Management Science, 31,134-149. Fogel, D. (1994), "Asymptotic Convergence Properties of Genetic Algorithms and Evolutionary Programming:

Analysis and Experiments." Cybernetics and Systems, 25, 389-407. Goldberg, D. (1989), Genetic Algorithms in Search, Optimization, and Machine Learning. Reading, MA: Addison-

Wesley. Goldberg, D., and R Segrest (1987), "Finite Markov Chain Analysis of Genetic Algorithms," Proceedings of the

Second International Conference on Genetic Algorithms. Hannan, M. and J. Freeman (1977), "The Population Ecology of Organizations," American Journal ()f Sociology,

82, 929-964.


Hayes, R., S. Wheelwright, and K. Clark (1988), Dynamic Manufacturing: Creating the Learning Organization. New York, Free Press.

Hey, J. (1979), Uncertainty in Economics. New York, New York University Press. Hirshleiffer, J. (1977), "Economics from a Biological Viewpoint," Journal of Law and Economics 20, 1-54. Holland, J. (1975), Adaptation in Natural and Artificial Systems. Ann Arbor, University of Michigan Press. Holland, J., and L Miller (1991), "Artificial Adaptive Agents in Economic Theory," American Economic Review

Papers and Proceedings 81,365-370. Holmslrom, B. (1979), "Moral Hazard and Observability," Bell Journal of Economics 10, 74-91. Holmslrom, B. and P. Milgrom (1987), "Aggregation and Linearity in the Provision of Intertemporal Incentives,"

Econometrica 55, 303-328. Horn, J. (1993), "Finite Markov Chain Analysis of Genetic Algorithms with Niching," in S. Forrest (Ed.) Pro-

ceedings of the Fifth International Conference on Genetic Algorithms, San Mateo, CA: Morgan Kaufman. Kollman, K., J. Miller, and S. Page (1992), "Adaptive Parties in Spatial Elections," American Political Science

Review 86, 929-937. Kreps, D. (1990), Game Theory and Economic Modelling. Oxford, UK: Clarendon Press. Lant, T. and S. Mezias (1990), "Managing Discontinuous Change: A Simulation Study of Organizational Learning

and Entrepreneurial Strategies," Strategic Management Journal 11, 147-179. Mantrala, M., P. Sinha and A. Zoltners (1994), "Structuring a Multiproduct Sales Quota-Bonus Plan for a Hetero-

geneous Sales Force: A Practical Model-Based Approach," Marketing Science 13, 121-144. March, J. (1988), "Variable Risk Preference and Adaptive Aspirations," Journal of Economic Behavior and Or-

ganization 9, 5-24. Marimon, R., E. McGrattan, and T. Sargent (1990), "Money As a Medium of Exchange in an Economy with

Artificially Intelligent Agents," Journal of Economic Dynamics and Control 14, 329-373. Michalwicz, Z. (1994), Genetic Algorithms + Data Structures = Evolutionary Programs. Berlin: Springer-Verlag. Miller, J. (1996), "The Co-Evolution of Automata in the Repeated Prisoner's Dilemma," Journal of Economic

Behavior and Organization, 29, 87-112. Morgan, G. (1986), Images of Organization. Beverly Hills, CA: Sage. Mueller, U. (1987), "Optimal Retaliation for Optimal Cooperation," Journal of Conflict Resolution 31, pp. 692-

724. Nelson, R. and S. Winter (1982), An Evolutionary Theory of Economic Change, Cambridge, MA: Betknap Press. Piccione, M. and A. Rubinstein (1994), "Finite Automata Play a Repeated Extensive Game," Journal of Economic

Theory 61,160-168. Rose, D. (1995), "'Designing Incentives Under Conditions of Bounded Rationality." Unpublished doctoral disser-

tation, Rensselaer Polytechnic Institute, Troy, NY. Rose, D. and T. Willemain (1996), "The Principal-Agent Problem with Adaptive Players," Computational and

Mathematical Organization Theory 1, 157-182. Rosenthal, R. (1993), "Rules of Thumb in Games," Journal of Economic Behavior and Organization 22, 1-13. Rubinstein, A. (1986), "Finite Automata Play the Repeated Prisoner's Dilemma," Journal of Economic Theory

39, 83-96. Rudolph, G. (1994), "Convergence Analysis of Canonical Genetic Algorithms," IEEE Transactions on Neural

Networks 5, 96-101. Scott, W. R. (1981), Organizations: Rational Natural and Open Systems. Englewood Cliffs, NJ: Prentice Hall. Sherman, H. D. and G. Ladino (1995), "Managing Bank Productivity Using Data Envelopment Analysis (DEA),"

Interfaces 25, 60-73. Simon, H. (1969), The Sciences of the Artificial. Cambridge, MA: MIT Press. Stanton, W. and R. Buskirk (1987), Management of the Sales Force. 7th ed. Homewood, IL: Irwin.


Troy, K. (1993), "Recognize Quality Achievement with Noncash Awards," Personnel Journal 72, 111-117. Tyagi, R (1990), "Inequities in Organizations, Salesperson Motivation, and Job Satisfaction," International Jour-

nal of Research in Marketing 7, 135-148.

David Rose is Assistant Professor of Decision Science in the Defense Resources Man- agement Institute of the Naval Postgraduate School. David received his Ph.D. in Decision Science and Engineering Systems from Rensselaer Polytechnic Institute in 1995. He also holds two degrees in mathematics: an MA from the State University of New York at Buf- falo and a BS from Southwest Missouri State University. His research interests include the economic theory of incentive systems and the effect of learning in decision making.

Thomas R. Willemain is Associate Professor, Department of Decision Sciences and Engineering Systems, Rensselaer Polytechnic Institute, Troy, NY USA 12180-3590. He received the BSE from Princeton University and the SM and Ph.D. from Massachusetts Institute of Technology, all in Electrical Engineering.

the principal-agent problem with evolutionary learning

Documents