Download - Bayesian Instinct - Stanford University · conference, including, and in addition to, Doug Bernheim, Gabriel Carroll, Isa Chaves, Scott Ganz ... Defense analysts evaluate whether

Electronic copy available at: https://ssrn.com/abstract=2916929

Bayesian Instinct∗

Etan A. Green1 and David P. Daniels2

1University of Pennsylvania2Stanford University

April 3, 2017†

Abstract

Do experts form rational beliefs when making split-second, sophisticated judgments? Along literature suggests not: individuals often form prior beliefs from biased samplingand update those beliefs by improperly weighting new information. This paper studiesbelief formation by professional umpires in Major League Baseball. We show that thedecisions of umpires reflect an accurate, probabilistic, and state-specific understandingof their rational expectations—as well as an ability to integrate those prior beliefs ina manner that approximates Bayes rule. Given that umpires have barely a second toform beliefs and make a decision, we conclude that the instincts of professional umpiresmimic a sophisticated level of rationality remarkably well.

∗Please direct correspondence to: [email protected]. We thank Jim Andreoni, Dorothy Kronick,Cade Massey, Ken Moon, Alex Rees-Jones, Uri Simonsohn, Joe Simmons, and Dan Stone for comments onthis paper. We also thank seminar participants at Stanford, Wharton, Vanderbilt, UCSD, the 2014 Societyfor Judgment and Decision Making conference, and the 2014 Behavioral Decision Research in Managementconference, including, and in addition to, Doug Bernheim, Gabriel Carroll, Isa Chaves, Scott Ganz, NirHalevy, Sergiu Hart, David Hausman, Jonathan Levav, Neil Malhotra, Max Mishkin, Muriel Niederle, RogerNoll, Justin Rao, Peter Reiss, Al Roth, Charlie Sprenger, Francisco Toro, and Russ VerSteeg, for commentson a related, and now retired, working paper (Green and Daniels, 2015). All errors are our own.†For the most recent version, see: ssrn.com/abstract=2916929

mailto:[email protected]

https://ssrn.com/abstract=2916929


1 Introduction

Experts routinely make split-second judgments. Soldiers evaluate whether passersby are

combatants. Police evaluate whether suspects pose threats. Paramedics evaluate whether

injuries need treatment at the scene. Defense analysts evaluate whether radar blips indicate

an attack. And traders evaluate a security’s expected returns after unexpected news. Do

they form rational beliefs? That is, do they begin with rational expectations—i.e., the share

of combatants in the population, the rate at which suspects threaten officers, how often

injuries require immediate attention, how frequently radar noise implies an attack, or the

prior distribution of expected returns? And if so, do they appropriately weigh those prior

beliefs against new information—e.g., the pedestrian pushing buttons on a cell phone, the

suspect reaching into a pocket, the vital signs of the injured, the pattern on the radar, or

the content of the news?

A long literature in behavioral economics and psychology suggests not. Individuals often

form prior beliefs from biased sampling and update those beliefs by improperly weighting

new information.1 Controlled experiments—in which beliefs are elicited through surveys, and

the distribution of the signal is specified by the experimenter—largely find that lay subjects

violate Bayes rule (Grether, 1980; El-Gamal and Grether, 1995; Eil and Rao, 2011; Mobius,

Niederle, Niehaus and Rosenblat, 2011), although different elicitation methods can produce

different conclusions (e.g., Andreoni and Mylovanov, 2012). In natural contexts, however,

beliefs and signals are typically unobservable. As a result, tests of rational belief formation

by experts have relied on indirect evidence. An interesting question in finance, for example,

is whether investors overreact to news. The empirical difficulty is that whereas prices are

1A vast body of research has documented biases in belief formation (for a review, see Camerer, Loewen-stein and Rabin, 2004). On the one hand, individuals often underweight new information, particularlywhen overconfident (e.g. Barber and Odean, 2001) or when the information contradicts their prior beliefs(Nickerson, 1998). On the other hand, individuals often overweight new observations when they are salient(Kahneman and Tversky, 1972)—even when they are obviously uninformative, as when people anchor onthe spin of a wheel (Tversky and Kahneman, 1974), the roll of a die (Mussweiler, 2001), or the last two digitsof a social security number (Ariely, Loewenstein and Prelec, 2003). People also see patterns in randomness(Chen, Moskowitz and Shue, 2016), make strong inferences from few observations (Tversky and Kahneman,1971), neglect selection bias in sample observations (Koehler and Mercer, 2009), and are only attentive tosalient information (e.g. Lacetera, Pope and Sydnor, 2012). Whether people favor their prior beliefs or newinformation can depend on what they want to believe. Those with incentives to change their beliefs seek outcontravening information, whereas those with incentives to reaffirm their beliefs resist such information, evenwhen it is free (Ambuehl, 2016). Individuals who manage to form beliefs absent any bias or corruption maynonetheless succumb to the difficulty of explicitly calculating a posterior belief via Bayes rule, especially ifthey are among the majority of educated subjects who confuse probabilities and proportions (Lipkus et al.,2001). As a result, models of non-Bayesian updating have proliferated (e.g. Akerlof and Dickens, 1982;Barberis, Shleifer and Vishny, 1998; Rabin and Schrag, 1999; Rabin, 2002; Brunnermeier and Parker, 2005).

1


observable, true valuations are not. Consequently, tests of rational belief formation among

investors require the analyst to specify an equilibrium model of asset pricing, and such tests

may be invalid if the model is misspecified (cf. De Bondt and Thaler, 1985).2

This paper exploits an unusual natural setting, in which both rational expectations and

signals are observable, to directly evaluate the extent to which experienced agents form

rational beliefs. We study professional umpires in Major League Baseball. In baseball, when

the pitcher makes a pitch and the batter chooses not to swing, the umpire immediately

makes a call—either a ball or a strike. By rule, the umpire is supposed to call a strike

when the pitch intersects a bounded region defined by the official strike zone, and a ball

otherwise. The umpire’s difficulty is that he imperfectly observes the location of the pitch:

pitches travel fast, and the boundaries of the official strike zone are invisible. An umpire with

imperfect vision can improve the accuracy of his calls by incorporating rational expectations

about where the pitch will be thrown. Using data from stereoscopic cameras that record the

precise location of each pitch, we measure rational expectations as the empirical distribution

of pitch locations when the batter does not swing. These expectations represent the umpire’s

best guess of where the pitch was thrown, absent observing the pitch itself. Our data also

allow us to specify the signal, under the innocuous assumption that the umpire’s observation

is drawn from a distribution centered at the true location of the pitch.

Baseball enthusiasts and academic researchers have noted an intriguing pattern in umpire

decisions. Whereas MLB instructs umpires to call balls and strikes based solely on the

location of the pitch, umpires make different calls in different counts, or game states, for

pitches at the same location (Moskowitz and Wertheim, 2011; Green and Daniels, 2014;

Mills, 2014). When the pitch is obviously inside the strike zone or obviously outside the

strike zone, the umpire almost always makes the correct call regardless of the count. But for

pitches near the boundary of the official strike zone, the probability of a strike call can vary

between counts by more than 60 percentage points.

We show that a simple Bayesian model predicts these patterns. In our model, the umpire’s

observation of the pitch provides a noisy signal of its location, which he integrates with his

rational expectations to form posterior beliefs. If according to those posterior beliefs, the

location of the pitch is more likely than not to be inside the strike zone, the umpire calls a

strike; otherwise, he calls a ball. The rational-expectations prior shifts the umpire’s posterior

beliefs towards locations that are a priori more expected. For observed locations that are

obviously inside the strike zone or obviously outside the strike zone, these shifts do not

2For recent examples of indirect tests in other contexts, see Zafar (2011) and Stone (2013).

2

change the umpire’s call. But for observed locations just inside the strike zone, expectations

that place greater weight just outside the strike zone change the umpire’s call from strike to

ball. Likewise, for observed locations just outside the strike zone, expectations that place

greater weight just inside the strike zone change the umpire’s call from ball to strike.

Count-based variation in rational expectations generates count-based variation in calls.

When the count favors the batter, the pitcher needs to throw a strike, and the batter can

afford not to swing. As a result, the umpire expects to make calls on pitches inside the

official strike zone. If he is unsure whether the pitch is a ball or a strike, he calls what he

expects: a strike. By contrast, when the count favors the pitcher, the pitcher can afford

to throw a ball, and the batter cannot risk letting a strike go by. As a result, the umpire

expects to make calls on pitches outside the official strike zone. If he is unsure whether the

pitch is a ball or a strike, he calls what he expects: a ball.3

These predictions match the data almost exactly. The enforced strike zone, or the region

in which umpires empirically call strikes in expectation, shrinks and expands across counts.

When the count most favors the batter, the enforced strike zone is 58% larger than when the

count most favors the pitcher, with the change concentrated near the top of the official strike

zone. We use the model to predict the region in which umpires call strikes in expectation,

and we find that in every count, this predicted strike zone is nearly coincident with the

enforced strike zone. Not only does the model reproduce the expansion and contraction of

enforced strike zone—it also reproduces the spatial pattern in which calls on pitches near

the top of the official strike zone are most affected by the count. Our model is accurate and

parsimonious. Changes in the predicted strike zone with the count follow from the data—

i.e., from differences in rational expectations across counts. Just 2 parameters govern the

variance of the signal, which modulates the responsiveness of the posterior to the rational-

expectations prior.

Conventional wisdom among baseball enthusiasts, as well as previous academic research,

has attributed count-based differences in called-strike rates to a preference for avoiding game-

changing calls—i.e., by favoring the pitcher when the count favors the batter, and vice versa

(Moskowitz and Wertheim, 2011; Mills, 2014; Green and Daniels, 2015). We show that this

taste-based account both fails to predict other key patterns in the data and predicts patterns

that do not materialize. An aversion to game-changing, or pivotal, calls predicts that for

3Our model ignores possible strategic behavior by the pitcher and batter in response to count-baseddeviations in the umpire’s calls. Folk theorems suggest that in the repeated game of an at-bat, a game, aseason, or a career, almost any behavior can be rationalized. Invoking Occam’s razor, we posit a simple,testable model that as we show, rationalizes the puzzles in the data.

3

decisions in which neither option is pivotal, called-strike rates will not vary. In fact, called-

strike rates do vary between counts in which both a ball and a strike would be non-pivotal,

and they do so in a manner that is predicted by our Bayesian model. The taste-based account

also predicts that non-count factors which amplify asymmetries in game impact (e.g., score

differential between teams) will amplify differences in called-strike rates. However, we show

that called-strike rates do not meaningfully depend on these factors. Instead, we show how a

rational process generates behaviors that appear anomalous (e.g. Backus, Blake and Tadelis,

2015; Miller and Sanjurjo, 2016).

Our results provide striking evidence of rational updating: a simple Bayesian model re-

produces otherwise puzzling empirical patterns. Though the model is straightforward, the

behavior it predicts requires considerable sophistication. Umpires’ rational expectations dif-

fer by pitch location and by count, and the umpire has barely a second to appropriately

weigh these expectations against his noisy perception of the location of the pitch. While

dominant accounts predict that lay judgments will violate rationality under such conditions

(for a review, see Kahneman, 2011), whether the heuristics of experienced agents fail in sim-

ilar ways is less clear. Some have speculated that with repetition and feedback, heuristics

can be refined to approximate complex calculations (Kahneman and Klein, 2009). However,

evidence is limited and often finds that experts exhibit the same instinctive biases as inexpe-

rienced subjects and to similar degrees (e.g. Haigh and List, 2005; Larson, List and Metcalfe,

2016). We find evidence of sophisticated, heuristic judgment by experts: the decisions of

umpires reflect an accurate, probabilistic, and count-specific understanding of the distribu-

tion of called pitches—as well as an ability to integrate those prior beliefs in a manner that

approximates Bayes rule.

These skills likely result from some combination of instruction, selection, and feedback.

Professional umpires are trained to make accurate calls, and they are promoted and rewarded

based on performance. They also receive unparalleled feedback—MLB uses the same pitch

location data that we analyze to depict each umpire’s mistakes after every game—which

we suspect further hones instincts that mimic a complex Bayesian logic. The tradition of

bounded rationality presumes that individuals often lack the cognitive resources to behave

optimally (e.g. Simon, 1982). Our results suggest that highly rational behavior can be learned

intuitively, and hence, that feedback and repetition, rather than intelligence, can underpin

rationality.

The remainder of the paper is organized as follows. Section 2 describes the context and the

data. Section 3 shows how the enforced strike zone changes with the count. Section 4 details

4

our Bayesian model, illustrates its dynamics with examples, and evaluates its predictions

under different types of prior beliefs. In Section 5, we compare the model’s predictions

under a sophisticated, rational-expectations prior to the empirical patterns from Section 3.

Section 6 concludes.

2 Context and Data

Most plays in baseball begin with the pitcher throwing a pitch towards home plate, an

irregular pentagon in the ground. As shown in Figure 1, the batter stands beside home

plate, and the umpire looks over the catcher from behind home plate. If the batter chooses

not to swing, the home plate umpire then makes a call : either a ball or a strike. MLB

defines the official strike zone as “that area over home plate the upper limit of which is a

horizontal line at the midpoint between the top of the [batter’s] shoulders and the top of the

uniform pants, and the lower level is a line at the hollow beneath the kneecap.”4 Pitches

that intersect the official strike zone should be called strikes; pitches that do not intersect

the official strike zone should be called balls.

Figure 1: The batter stands beside home plate, and the umpire looks over the catcher frombehind home plate. Major league baseball defines the official strike zone, which are renderedon each image, as the three-dimensional region over home plate between the bottom of thebatter’s kneecaps and the midline of his chest.

Umpires have not always adhered closely to the official strike zone. As recently as the

1990s, pitches far beyond the side of home plate—that hitters would have to lunge for—were

often called strikes, while high strikes—over the plate and between the hitter’s belt and

the midline of his chest—were almost always called balls. Major League Baseball could not

4http://mlb.mlb.com/mlb/official_info/umpires/strike_zone.jsp

5

http://mlb.mlb.com/mlb/official_info/umpires/strike_zone.jsp

remedy the problem by rewarding the least egregious violators, because the umpires union

mandated that umpires evenly split postseason assignments and the accompanying bonuses

(Callahan, 1998).

In 1999, MLB initiated three small measures aimed at reducing discrepancies between

enforced strike zones and the official strike zone: first, reminding all umpires of the definition

of the official strike zone; second, instructing team officials to monitor each umpire’s enforced

strike zone; and third, suspending an umpire who physically confronted a player—the first

suspension ever given to an umpire. A clumsy response by the umpires union paved the way

for baseball to strengthen the formal incentives faced by umpires. First, the union authorized

a strike. Then, upon realizing that its contract with MLB forbade a strike, the union tried

to dissolve itself—convincing 57 of the 66 union umpires to resign—so as to negotiate a new

contract. When a federal court ruled the dissolution null and void, Major League Baseball

accepted the resignations of 22 umpires and hired 30 new umpires (Callan, 2012). In an

instant, MLB reconstituted its pool of about 70 full-time umpires (O’Connell, 2007). To

become a full-time umpire today, a candidate must graduate in the top fifth of his class from

umpire school, rise through four levels of the minor leagues, be chosen to fill in for an umpire

on vacation, and then be chosen to replace a retiring umpire (Caple, 2011).

Home plate umpires in Major League Baseball now operate under a high degree of mon-

itoring and performance-based incentives. MLB uses pitch-tracking technology to evaluate

the calls of home plate umpires. In the early 2000s, MLB installed the QuesTec system in

half of its stadiums. Mills (2015) shows that these cameras increased umpires’ adherence to

the official strike zone. Prior to the 2008 season, MLB installed the more accurate PITCH

f/x system in every park. After each game, the home-plate umpire receives a breakdown of

his performance, including a score that measures the accuracy of his calls. Umpires are eval-

uated twice each season, with evaluations based on reports from league-employed observers

and analysis of the camera data. MLB rewards good performance with lucrative assignments

to officiate the All-Star Game and the playoffs (Drellich, 2012).

We measure umpires’ adherence to their directive using pitch location data from the

PITCH f/x cameras. Stereoscopic cameras record the trajectory of each pitch at over a

dozen points. An algorithm then fits the observed coordinates to a parabola and infers the

2-dimension location where the pitch intersects the plane that rises from the front of home

plate. MLB reports 4 measures of each pitch: that 2-dimensional location, the speed, the

lateral movement, and the maximal vertical deviation from the straight-line path. MLB also

labels each pitch with the categorical pitch type (e.g., four-seam fastball) from the pitcher’s

6

known repertoire that is best predicted by the observed speed and movement.

A pitch should be called a strike if its trajectory intersects the extruded pentagonal prism

of the official strike zone. Since MLB does not report the full trajectory of each pitch, we

instead compare the location of the pitch on the plane that rises from the front of home

plate to the rectangular slice of the official strike zone defined on that plane. The validity

of this comparison depends on two assumptions: first, that the parabolic fit correctly infers

the true location of the pitch on that plane, and second, that this location fully captures

the pitch’s trajectory through the space defined by the official strike zone. We make these

assumptions credible by restricting our sample to four-seam fastballs, which more than any

other pitch type follow a straight-line path from the pitcher’s hand to the catcher’s glove.5

This restriction allows us to disregard the handedness of the pitcher and the skill of the

catcher at framing pitches, both of which affect ball and strike calls significantly for off-

speed pitches but only minimally for fastballs (Deshpande and Wyner, 2017).

One issue remains when comparing 2-dimensional pitch locations to the official strike

zone. Whereas the horizontal dimension of the official strike zone is fixed to the width of

home plate, the vertical dimension of the official strike zone is determined by the batter’s

stance prior to the pitch. Hence, the absolute location of the pitch does not alone determine

whether a pitch is a ball or a strike. To address this issue, a human operator using a

separate camera (in center field) records the top and bottom boundaries of the official strike

zone from the batter’s stance prior to each pitch. Using these measures, we normalize the

vertical location by fixing the vertical height of the official strike zone to its average height.6

A key explanatory variable in our analysis is the count, which tracks the number of balls

and strikes in the at-bat, or the sequence of pitches between a pitcher and a batter. At the

beginning of an at-bat, the count has 0 balls and 0 strikes. When the batter does not swing

and the umpire calls a ball, the number of balls increases by one. When the batter swings,

or when he doesn’t and the umpire calls a strike, the number of strikes increases by one.7

5On average, the maximal vertical deviation between a four-seam fastball’s trajectory and a straight-linepath is 4.2 inches, compared to 7.9 inches for all other pitch types (t = 1.8× 103).

6Specifically, we redefine the vertical location of each pitch in relation to the midline of the top andbottom boundaries for that pitch and then multiply this redefined location by the ratio of the average heightof the official strike zone to the height for the current pitch. For example, a pitch observed 40 inches abovethe ground with top and bottom boundaries 47 and 21 inches above the ground, respectively will first beredefined as having crossed 6 inches above the midline. Given that the average height of the official strikezone is 22 inches, our normalization places the pitch 5.5 inches above the midline.

7The at-bat ends when the batter hits the ball in the field of play. If the batter hits the ball outside thefield of play (a “foul ball”), the number of strikes in the count increases by one, and the at-bat continues. Anexception is when the count has two strikes, in which case the count remains at two strikes. An exceptionto that exception is a bunt foul, which results in a strikeout when the count previously had two strikes.

7

Four balls end the at-bat with a walk, which favors the batting team; three strikes end the

at-bat with a strikeout, which favors the pitching team. Counts are denoted balls before

strikes. For instance, a 2-1 count means that the count has two balls and one strike.

Count Pitches (%) P (¬swing) Calls (%) P (strike)

0-0 526,357 0.28 0.72 378,463 0.37 0.460-1 207,615 0.11 0.52 108,744 0.11 0.230-2 111,115 0.06 0.51 56,523 0.06 0.091-0 204,572 0.11 0.57 117,167 0.12 0.431-1 169,137 0.09 0.45 76,146 0.07 0.251-2 146,846 0.08 0.42 61,365 0.06 0.132-0 85,915 0.05 0.58 49,940 0.05 0.482-1 107,119 0.06 0.39 42,260 0.04 0.282-2 132,848 0.07 0.33 44,453 0.04 0.173-0 35,531 0.02 0.93 32,934 0.03 0.643-1 58,078 0.03 0.44 25,369 0.02 0.383-2 101,698 0.05 0.25 25,143 0.02 0.17

Total 1,886,831 1.00 0.54 1,018,507 1.00 0.35

Table 1: Summary statistics by count, for four-seam fastballs.

The PITCH f/x cameras recorded 5.7 million pitches over the 2008 though 2015 regular

seasons,8 of which 34%, are categorized as four-seam fastballs. Table 1 tabulates summary

statistics by count for these 1, 886, 831 four-seam fastballs. Batters tend not to swing when

the count is in their favor, even when the pitch will likely be called a strike. For instance,

batters choose not to swing at 93% of four-seam fastballs in 3-0 counts, even though 64%

of these pitches are called strikes. When the count favors the pitcher, batters swing more

aggressively. In 0-2 counts, for example, 51% of four-seam fastballs are swung at, and the

remainder are called strikes just 9% of the time. Overall, batters choose not to swing on 54%

of four-seam fastballs, yielding a sample of 1,018,507 calls, 35% of which are called strikes.

8Fewer than 1% of pitches are not captured by the cameras. We also exclude the 2.8% of pitches forwhich the top and bottom strike zone measurements are likely erroneous, with a strike zone height greaterthan 28 inches or less than 18 inches.

8

3 Stylized Facts

Umpires deviate from the official strike zone by enforcing a strike zone that varies with the

count. In this section, we visualize these changes, showing that they are universal, an order

of magnitude larger than other systematic deviations among umpires, and inconsistent with

previous interpretations.

3.1 Changes in the enforced strike zone with the count

Figure 2: Probability of a strike call when the batter chooses not to swing, for listed counts.

(a) P (strike|0-0)

0.10.1

0.1

0.10.1

0.1

0.1

0.9

0.9

0.9

0.9

0.5

0.5

0.5

0.5

0.5

0.5

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Vert

ical axis

(in

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(b) P (strike|3-0)− P (strike|0-2)

0

0.1

0.1

0.1 0.1

0.1

0.1

0.1

0.1

0.2

0.2

0.2

0.2

0.2

0.20.2

0.30.

3

0.3

0.3

0.3

0.3

0.4

0.4

0.4

0.4

0.50.

5 0.6

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Vert

ical axis

(in

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Note: we estimate called strike probabilities in each count via kernel regression with abivariate Gaussian kernel and Silverman’s rule-of-thumb bandwidth in each dimension. Forthe difference plot (b), we use the larger of the two bandwidths in each dimension.

Figure 2 shows the plane that rises from the front of home plate. The umpire looks

through this plane over the catcher and towards the pitcher. A right-handed batter stands

on the left, and a left-handed batter stands on the right. The dotted box bounds the official

strike zone—the width of home plate on the horizontal axis, and the average distance between

the bottom of the batter’s kneecaps and the midline of his chest on the vertical axis. The

contour lines in Figure 2a denote the rates at which umpires call strikes when the batter

chooses not to swing; here, we restrict to the first pitch of the at-bat, when the count has

0 balls and 0 strikes; we estimate these probabilities via kernel regression with a bivariate

9

Gaussian kernel and Silverman’s rule-of-thumb bandwidth in each dimension.9 Umpires

reliably make obvious calls. Pitches that intersect the center of the official strike zone are

almost always called strikes, and pitches that intersect the plane far outside the official

strike zone are almost always called balls. But in between, pitches at the same location are

sometimes called strikes and sometimes called balls. A band 6 to 8 inches wide separates

locations where strikes are called 90% of the time from locations where strikes are called just

10% of the time.

For pitches in this band, the probability of a strike call varies with factors other than

pitch location—in particular, the count. Figure 2b shows differences in the probability of a

strike call between the two counts for which the differences are greatest—counts with 3 balls

and 0 strikes and counts with 0 balls and 2 strikes.10 Holding pitch location fixed, a strike

call is more likely in a 3-0 count than in an 0-2 count at nearly every location. The difference

peaks near the boundary of the official strike zone and particularly at the top, where pitches

are as much as 60 percentage points more likely when the count favors the batter as when

it favors the pitcher.

Figure 3: Enforced strike zone in 3-0, 0-0, and 0-2 counts.

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Vert

ical axis

(in

)

3-0

0-0

0-2

Note: we estimate the enforced strike zone in each count using a support vector machinewith a Gaussian kernel.

Umpires enforce different strike zones in different counts. Figure 3 shows the enforced

9These bandwidths are 0.88 inches in the horizontal dimension and 0.86 inches in the vertical dimension.10For this difference plot, we estimate called strike probabilities separately in each count using the larger

of the bandwidths in each dimension. These bandwidths are 1.7 and 1.8 inches in the horizontal and verticaldimensions, respectively.

10

strike zone, or the boundary that best separates ball and strike calls, for three different

counts; we estimate each enforced strike zone using a support vector machine with a Gaussian

kernel (cf. Friedman, Hastie and Tibshirani, 2001). In each of these counts, as in every other,

the enforced strike zone is wider than the official strike zone. It is also asymmetric on the

horizontal axis, wider to the left of home plate than to the right, owing to different enforced

strike zones for left- and right-handed batters, as shown in Figure A1 in the Appendix.11

Across counts, the size of the enforced strike zone varies. The enforced strike zone is 357

square inches in 0-2 counts and 565 square inches in 3-0 counts, an expansion of 58%.

The count-based deviations in the enforced strike zone appear to be universal. We rees-

timate the enforced strike zone in Figure 3 for each of the 60 umpires who make at least

10,000 calls on four-seam fastballs.12 Every one of these umpires enforces a smaller strike

zone in 0-2 counts than in 3-0 counts. Among them, the smallest difference in area between

the enforced strike zone in these counts is 141 square inches, the maximum difference is 356

square inches, and the median difference is 218 square inches.

These changes are not limited to the most imbalanced counts. We estimate the probability

of a strike call in each count using a linear probability model with fixed effects by location

and count:

yi =∑z

1{xi ∈ z} · αz +3∑b=0

2∑s=0

1{ballsi = b, strikesi = s} · βb,s + εi, (1)

where yi is an indicator for a strike call, xi is the location of the pitch, z indexes square-inch

regions on the plane that rises from home plate, b and s index the number of balls and strikes

in the count, and εi is a mean-zero error term. Given that the deviations are greatest where

the correct call is uncertain, we restrict the sample to the 323,560 calls for which the pitch

is within 3 inches—about the diameter of a baseball—of the boundary of the official strike

zone. We hold out β0,0, implying that βb,s represents the change in the probability of a strike

call from a 0-0 count in a count with b balls and s strikes, for a fixed pitch location and for

locations close to the boundary of the official strike zone.

Figure 4 shows estimates of βb,s with 95% confidence intervals from standard errors

11For right-handed batters, the enforced strike zone is close to symmetric across the midline of home plate.For left-handed batters, however, the enforced strike zone is asymmetric: pitches left of home plate and awayfrom the batter are called strikes at higher rates than pitches a similar distance right of home plate andnear the batter. An amateur umpire we spoke with attributes this phenomenon to differences in where theumpire stands behind the catcher for left- and right-handed batters.

12This threshold roughly separates umpires who worked full-time over the entire observation from thosewho were hired full time after 2008 or retired prior to 2015.

11

Figure 4: Estimates of βb,s, the count-specific change in the probability of a called strikefrom a 0-0 count, from Equation 1. 95% confidence intervals reflect standard errors clusteredby umpire.

clustered by umpire. The probability of a strike differs across counts. In 3-0 counts, a strike

is 7.0 percentage points more likely on average than in 0-0 counts. In 0-2 counts, a strike

is 14.3 percentage points less likely on average than in 0-0 counts. The probability of a

strike also differs significantly from 0-0 counts in less imbalanced counts. In 2-0 counts, for

instance, a strike is 4.3 percentage points more likely on average than in 0-0 counts, and in

0-1 counts, a strike is on average 9.7 percentage points less likely. Adding a strike to the

count always decreases the probability of a strike call, and adding a ball to the count always

increases the probability of a strike call. These estimates strongly reject the null hypothesis

that calls do not depend on the count, or that the βb,s terms are equivalent; F(11,117) =

210.63, p < 10−3.

The deviations appear to be independent of Major League experience. We reestimate

Equation 1 with an additional set of count fixed effects, and we interact this set with an

indicator for whether the umpire is one of the 60 who makes at least 10,000 calls on fastballs

over our observation window. These parameters measure count-by-count differences in the

probability of a strike call between experienced and inexperienced umpires. An F-test cannot

reject the null hypothesis that each of these 12 parameters is zero; F(12,117) = 0.93, p = 0.52.

Even the least experienced Major League umpires have been professional umpires for a

decade or more. Perhaps as a result, Major League experience does not appear to influence

the degree to which umpires change the enforced strike zone with the count.

12

3.2 Comparison with other deviations

(1) (2) (3)

Strike on last pitch -0.019(0.0034)

Home team batting -0.0046(0.0017)

3 balls & runner on 1st -0.0027(0.0051)

2 strikes & 2 outs 0.0073(0.0041)

Pitching team ahead 0.0039(0.0021)

Batting team ahead 0.0015(0.0022)

F-test: βb,s 207.11 209.96 210.51Calls 323560 323560 323560

Table 2: Estimates of the linear probability model in Equation 1 with covariates. Parameterestimates represent the change in the probability of a strike call for locations close to theboundary of the official strike zone, holding fixed the location of the pitch and the count.The sample is restricted to the 323,560 calls for which the pitch intersected the plane thatrises from the front of home plate within 3 inches of the official strike zone. The dependentvariable is an indicator for a strike call, and all models include fixed effects for 1) the square-inch location of the pitch, and 2) the count. F-test: βb,s reports an F-statistic for equivalenceamong the count fixed effects βb,s. Standard errors (in parentheses) are clustered by umpire.

Changes in the called-strike rates with the count are an order of magnitude larger than

other deviations among umpires. Chen et al. (2016) find that umpires make negatively corre-

lated judgments: the probability of a strike call declines when the previous pitch was called a

strike. Other work has shown that soccer referees favor the home team (Sutter and Kocher,

2004; Garicano, Palacios-Huerta and Prendergast, 2005). We estimate Equation 1 with co-

variates that measure these biases. Table 2 lists the corresponding parameter estimates as

well as an F-statistic for equivalence among the βb,s terms. Model 1 tests for negative au-

tocorrelation by including an indicator for whether the last pitch in the at-bat was called

13

a strike.13 The probability of a strike call decreases by 1.9 percentage points when the last

pitch in the at-bat was called a strike (|t| = 5.68), consistent with negative autocorrelation

in calls. Model 2 tests for home-team bias by including an indicator for whether the home

team is batting. The probability of a strike call decreases by 0.46 percentage points when

the home team is batting (|t| = 2.69), consistent with favoritism of the home team. In

comparison, a strike is 21.2 percentage points more likely in a 3-0 count than in an 0-2 count

(|t| = 38.56).

The Appendix visualizes these results. Figure A2 shows the enforced strike zone for calls

for which the previous pitch in the at-bat was called a strike and for called which the previous

pitch in the at-bat was not called a strike, for each count with at least one strike. In each

count, enforced strike zone depends minimally, if at all, on whether the previous pitch in the

at-bat was called a strike. Figure A3 shows the difference in the probability of a strike call

between calls when the home team is pitching and when the away team is pitching. The

contour lines trace a flat plane at zero, suggesting minimal, if any, distortion.

3.3 Alternative interpretation

Changes in the enforced strike zone with the count have been interpreted as form of status

quo bias (cf. Samuelson and Zeckhauser, 1988) in which umpires are averse to game-changing

calls—i.e., to a fourth ball or a third strike, both of which end the at-bat (Moskowitz and

Wertheim, 2011; Mills, 2014; Green and Daniels, 2015). This interpretation is inconsistent

with two patterns in the data. First, an aversion to game-changing calls predicts that umpires

will avoid calling balls principally when the count has 3 balls, and that they will avoid calling

strikes principally when the count has 2 strikes. But as Figure 4 shows, the probability of a

strike call varies smoothly with the number of balls and strikes in the count.

Second, a bias towards the status quo predicts that non-count factors which amplify the

impact of one call—such as men on base, number of outs, or the score differential—will

amplify the umpire’s preference for the other call. However, we find that the enforced strike

zone does not meaningfully depend on any of these factors. Model 3 of Table 2 includes

covariates that proxy for asymmetries in game impact. We include an indicator for a count

with 3 balls and a runner on first base; presumably, a walk is more beneficial to the batting

team when it would move a runner into scoring position. We also include an indicator for a

count with 2 strikes and 2 outs in the inning; presumably, a strikeout is more beneficial to the

13The left-out condition comprises calls made on the first pitch of the at-bat, calls for which the previouspitch was called a ball, and calls for which the batter swung at the previous pitch.

14

pitching team when it ends the inning. Finally, we include indicators for whether the pitching

team is ahead and whether the batting team is ahead; presumably, an umpire who prefers

evenness would favor the team that is behind. None of these variables meaningfully predicts

the call, and their inclusion does not diminish the significance of the count indicators.14

4 Theoretical Framework

We show that a simple Bayesian model can account for the observed changes in the enforced

strike zone with the count. After writing down the model, we illustrate its dynamics through

examples and compare its predictions under different types of prior beliefs.

4.1 Bayesian model

We consider an umpire who holds prior beliefs about where the pitch is likely to be thrown,

conditional on the batter choosing not to swing. The umpire also conditions his prior beliefs

on the game state S—e.g., the count—as different states may imply different beliefs about

pitcher and batter behavior. These prior beliefs represent the umpire’s best guess about

where the pitch will be thrown, prior to observing the pitch itself. We denote the umpire’s

prior beliefs as P (x0|S,¬swing), where x0 is the true location of the pitch.

When the pitch is thrown, the umpire observes a noisy signal of its true location. In

particular, he observes the location xu, which is drawn from a distribution, Fxu|x0 , centered

at the true location and known to the umpire. Hence, the umpire expects to observe the

pitch at its true location, but acknowledging the difficulty of observing the exact location

at which a fastball intersects the plane that rises from the front of home plate, he places

14In a previous working paper, we interpreted the count-based distortions as an aversion to making the morepivotal call (Green and Daniels, 2015). We provided two pieces of evidence in support of this interpretation.First, we showed that holding the count fixed, the call depends on other factors that make one call morepivotal than the other, which we termed the differential impact. Second, we showed that this dependencebecomes more extreme in high-visibility games, namely during national weekend telecasts. For both analyses,we calculated the differential impact in the following manner. First, we used the count, number of outs, andindicators for men on base to estimate the number of runs that a team in that situation could expect to scoreover the remainder of the half inning. Second, we calculated the differential impact of a strike as the changein the expected number of runs from calling a strike minus the change in the expected number of runs fromcalling a ball. We then estimated the bias, defined as the change in the probability of a called strike fromgame states with a differential impact of 0, as a smooth function of the differential impact (holding constantthe location of the pitch). For both analyses, our specification forced the estimate through the origin, therebyrequiring that for pitches in which a ball and a strike were assessed to be similarly (non-)pivotal, umpireswould call strikes at the same rate. Specifications that relax this restriction show no dependence on factorsother than the count, nor do they show any differences between national weekend telecasts and other games.

15

positive probability on nearby locations. The signal is simply fxu|x0 , which we denote as

P (xu|x0). Note that this signal depends only on the umpire’s eyesight, which we assume to

be independent of the game state and whether or not the batter swings.15

Given prior beliefs and the signal, posterior beliefs follow from Bayes’ rule:

P (x0|xu, S,¬swing) ∝ P (xu|x0)P (x0|S,¬swing) (2)

The umpire’s prior beliefs about where the pitch will be thrown shift his posterior beliefs

about where the pitch was thrown towards locations that are a priori more expected in the

current state.

After integrating his prior beliefs with the signal, the umpire compares his posterior

beliefs to a fixed rectangle, which we denote Z. He then makes the call that is more likely

to be correct according to his posterior beliefs. If the majority of the posterior density lies

inside Z, he calls a strike; if the majority of the posterior density lies outside Z, he calls a

ball. Let θS(xu) equal 1 if the umpire calls a strike, and 0 otherwise. Then,

θS(xu) = 1 ⇐⇒∫ZP (z|xu, S,¬swing)dz ≥ 1

2(3)

This Bayesian decision rule is deterministic. An observation xu and a state S jointly deter-

mine the umpire’s posterior beliefs and, along with the boundaries of Z, the call he makes.

Whereas this decision rule makes a binary prediction for xu, it makes a probabilistic

prediction for x0. This probability is simply the expected call, with the expectation taken

over the distribution of the umpire’s observation:

P (strike|x0, S) = Exu|x0 [θS(xu)] (4)

The probability of a strike call for a pitch at x0 is the share of draws xu for which a majority

of the umpire’s posterior beliefs are bounded by Z.

This probabilistic prediction defines the predicted strike zone, or the set of true locations

at which the umpire calls strikes more often than balls. Let yS(x0) equal 1 if a pitch at x0

is called a strike in expectation, and 0 otherwise. Hence,

yS(x0) = 1

{P (strike|x0, S) ≥ 1

2

}= 1

{Exu|x0 [θS(xu)] >

1

2

}(5)

15Assuming independence between the likelihood and the batter’s swing decision is innocuous as the umpireonly makes a call when the batter chooses not to swing.

16

The model predicts that a pitch observed by the cameras at x0 will be called a strike on

average if for the majority of the nature’s draws of xu, the majority of the umpire’s posterior

beliefs are bounded by Z.

4.2 Examples

Prior beliefs that coincide with rational expectations over the location of the pitch can explain

changes in the enforced strike zone with the count. When the count favors the batter, the

pitcher needs to throw a strike, and the batter can afford not to swing. Hence, umpires

expect to make calls on pitches inside the official strike zone, and they shift their posterior

beliefs towards the official strike zone. When the count favors the pitcher, the pitcher can

afford to throw a ball, and the batter cannot risk letting a close pitch go by. Hence, umpires

expects to make calls on pitches outside the official strike zone, and they shift their posterior

beliefs about the true location of the pitch away from the official strike zone.16

Figure 5 illustrates these patterns. The first column shows the empirical distribution

of pitch locations. Overall (5a), pitches concentrate at the center of the plane, with 42%

thrown in the official strike zone. In 3-0 counts (5d), pitchers need to throw a strike. As a

result, the distribution concentrates at the center, with 51% of pitches thrown in the official

strike zone. In 0-2 counts (5g), pitchers need not throw a strike. As a result, the distribution

is flatter, with only 24% of pitches thrown in the official strike zone.

The second column shows count-specific rates at which batters choose not to swing.

Overall (5b), batters discriminate by location, swinging at pitches at the center of the plane,

and taking pitches on the periphery. In 3-0 counts (5e), batters have the luxury of not

swinging at a strike. As a result, they rarely swing, regardless of the location of the pitch.

In 0-2 counts (5h), batters cannot afford a called third strike. As a result, they swing at

most pitches near the official strike zone and rarely choose not to swing at a pitch that is

clearly inside of it.

These tendencies underlie vastly different distributions of pitch locations for calls, which

constitute the umpire’s rational expectations about where the pitch will be thrown, con-

ditional on the batter choosing not to swing. These expectations are shown in the third

column. Overall (5c), umpires disproportionately make calls on the left and right edges of

the official strike zone, with 25% of called pitches intersecting the official strike zone. In

16We are indebted to Jim Andreoni for proposing this logic to us in 2015. To our knowledge, thefirst public mention of this explanation was a 2016 blog post on Baseball Prospectus: http://www.

baseballprospectus.com/article.php?articleid=28513.

17

http://www.baseballprospectus.com/article.php?articleid=28513

http://www.baseballprospectus.com/article.php?articleid=28513

Figure 5: Pitch density, rate of non-swings, and pitch density conditional on not swinging,overall (a-c) and in 3-0 (d-f) and 0-2 (g-i) counts.

(a) P (x0)

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

0.0003

0.00030.0003

0.0

003

0.0003

0.0003

0.0006

0.0

006

0.0006

0.0006

0.0006

0.0

006

0.0009

0.0

009

0.0009

0.0009

0.0009 0.00

120.0012

0.0

012 0.0015

(b) P (¬swing|x0)

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

0.3

0.3

0.3

0.3

0.5

0.5

0.5

0.5

0.5

0.5

0.7

0.7

0.70.7

0.7

0.7

0.9

0.9

0.9

0.9

0.9

(c) P (x0|¬swing)

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

0.00030.0003

0.0003

0.0006

0.0006 0.0006

0.0

006

0.0006

0.0006

0.0006

0.00

090.0

009

(d) P (x0|3-0)

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

0.0

003

0.00030.0003

0.00

03

0.0

003

0.00030.0003

0.00060.0

006

0.0006

0.0006

0.0

006

0.00060.0009

0.0

009

0.0009

0.00

09

0.0009

0.0012

0.0012

0.0012

0.0

012

0.0015

0.0015

0.0

015

0.00

18

0.0018

0.0021

(e) P (¬swing|x0, 3-0)

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

0.9

0.9

0.90.

9

0.9

(f) P (x0|¬swing, 3-0)

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

0.0

003

0.00030.0003

0.0

003

0.0

003

0.0003

0.00

03

0.0006

0.0

006

0.0006

0.0006

0.0

006

0.0006

0.0009

0.0009

0.0009

0.0

009

0.0009 0.0012

0.0012

0.0012

0.0

012

0.0015

0.0015

0.0

015

0.00

18

0.00

18

(g) P (x0|0-2)

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

0.0003 0.00030.0003

0.00

06

0.0006 0.00060.0006

0.0

006

0.0

009

0.0009

0.0009(h) P (¬swing|x0, 0-2)

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

0.1

0.1

0.1

0.1

0.3

0.3

0.30.3

0.3

0.3

0.5

0.5

0.50.5

0.5

0.5

0.7

0.7

0.70.7

0.7

0.7

0.9

0.9

(i) P (x0|¬swing, 0-2)

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

0.0003

0.0

003

0.0003 0.0003

0.0

003

0.0003

0.0

006

0.00

06

0.0

006

0.0006

Note: Pitch densities (a,c,d,f,g,i) are kernel density estimates. Non-swing rates (b,e,h) arekernel regression estimates. All estimates use a bivariate Gaussian kernel with Silverman’srule-of-thumb bandwidth in each dimension.

18

3-0 counts (5f), umpires disproportionately make calls on pitches near the side and bottom

boundaries of the official strike zone, and 49% of called pitches intersect the official strike

zone. And in 0-2 counts (5i), umpires disproportionately make calls on pitches above and

left and right of the official strike zone, with just 6% of called pitches intersecting the official

strike zone.

Given a prior and a signal, a Bayesian actor forms a posterior and makes a decision. In our

framework, the prior coincides with the umpire’s rational expectations and can be estimated

from the data, as in Figure 5. The signal, however, is unobserved to the analyst. So far, we

have placed minimal structure on the signal, assuming only that the umpire’s observation is

unbiased. Here, we further assume that the signal is drawn from a bivariate normal distri-

bution centered at the true location and with zero covariance—i.e., xu ∼ N(x0,[σ21 0

0 σ22

] ).

For the examples in this section, we set σ21 = σ2

2 = 32 inches. This implies that 20%, 59%,

and 86% of observations are within 2, 4, and 6 inches of the true location, respectively.

Figure 6: P (x0|S,¬swing): Posterior beliefs under a flat prior (a), under a rational-expectations prior (b), and under a count-specific rational-expectations prior for a 3-0 count(c) and an 0-2 count (d). Contour lines show the posterior distribution, the X marks theobserved location, the dotted box denotes the official strike zone, and the black curve denotesthe boundary between ball and strike calls.

(a) Flat prior

0 3 6 9 12 15

Horizonal axis (in)

0

3

6

9

12

15

Ve

rtic

al a

xis

(in

)

(b) RE prior

0 3 6 9 12 15

Horizonal axis (in)

0

3

6

9

12

15

Ve

rtic

al a

xis

(in

)

(c) RE in 3-0 count

0 3 6 9 12 15

Horizonal axis (in)

0

3

6

9

12

15

Ve

rtic

al a

xis

(in

)

(d) RE in 0-2 count

0 3 6 9 12 15

Horizonal axis (in)

0

3

6

9

12

15

Ve

rtic

al a

xis

(in

)

To see how the different prior expectations affect the umpire’s posterior beliefs, consider

a pitch observed to pass through the top-right corner of the official strike zone. The contour

lines in Figure 6a show the posterior beliefs held by an umpire with a flat prior—i.e., who

expects all locations to be equally likely. Here, we assume that Z coincides with the official

strike zone, which is depicted on the figure by a dotted box. The observed location in

Figure 6a is borderline: given a flat prior, about half of the posterior density lies inside the

official strike zone. Borderline observations intersect the decision boundary, which is traced

by the black curve. Under a flat prior, the decision boundary coincides with the official

19

strike zone along its edges but has rounded corners, implying that pitches observed in the

corners of the official strike zone will be called balls. For observations along the edges, half

of the posterior density lies in official strike zone. But for observations on the corners, only

one-quarter of the posterior density lies in the official strike zone.

The umpire’s posterior beliefs, and thus the decision boundary, change with his prior

beliefs. Figure 6b shows the umpire’s posterior beliefs under the unconditional rational-

expectations prior shown in Figure 5c. Near the observed location, the gradient of the

umpire’s prior beliefs is flat—on average, locations around the top-right corner of the official

strike zone are similarly likely. As a result, the umpire’s posterior beliefs under this prior

replicate his beliefs under the flat prior for an observation at this location.

Figure 6c shows the umpire’s posterior beliefs in a 3-0 count under the count-specific

rational-expectations prior shown in Figure 5f. Near the observed location, points closer

to the center are more expected than those farther away—in 3-0 counts, locations inside

the official strike zone are more likely than locations outside—shifting the umpire’s posterior

beliefs inward. A majority of the posterior density now lies inside the official strike zone, and

the umpire calls a strike as a result. The decision boundary expands outward to encompass

the observed location.

Figure 6d shows the umpire’s posterior beliefs in a 0-2 count under the count-specific

rational-expectations prior shown in Figure 5i. Near the observed location, points away

from the center are more expected than those towards the center—in 0-2 counts, locations

outside the official strike zone are more likely than locations inside—shifting the umpire’s

posterior beliefs outward. A majority of the posterior density now lies outside the official

strike zone, and the umpire calls a ball as a result. The decision boundary shrinks inward,

excluding the observed location.

In both counts, the umpire observes the pitch at the same location, but he makes opposite

calls. When the count favors the batter, he calls a larger strike zone, and when the count

favors the pitcher, he calls a smaller strike zone.

4.3 Accuracy

For the umpire with imperfect eyesight, incorporating count-specific expectations over the

location of the pitch improves the accuracy of his calls. The intuition becomes apparent in

the extreme: a blind umpire would guess correctly more often if he assumes that the pitch is

drawn from the true empirical distribution. This logic also applies when the umpire’s vision

is imperfect: the umpire will make the correct call more often if he shifts his beliefs towards

20

locations that are more likely.

Let the error rate at location x0 in state S be the absolute difference between an indicator

for whether x0 is in the official strike zone and the predicted probability of a strike in state S

from Equation 4, defining the states as the 12 counts. If at a location inside the official strike

zone, the umpire calls a strike with probability 70%, then the error rate at that location is

30%.

To compute the predicted strike probability at a given location, we first discretize the

plane that rises from home plate into a grid of one-inch squares bounded at 18 inches from

the midline on both dimensions. In each square, we predict the probability of a strike by

independently drawing Nsim = 1000 observed locations, which we denote xu, from Fxu|x0 .

As in the previous section, we assume that Fxu|x0 follows a bivariate normal distribution with

variance terms σ21 and σ2

2 and no covariance. For each simulated observation, we integrate

the signal with the prior according to Bayes rule, and sum the posterior density over the

official strike zone. If at least half of the posterior density is inside the official strike zone,

then θS(xu,i;σ21, σ

22) = 1, and the umpire calls a strike; otherwise, θS(xu,i;σ

21, σ

22) = 0, and

the umpire calls a ball.

The probability of a strike is the average call over the 1000 simulations:

P (strike|x0, S;σ21, σ

22) =

1

Nsim

Nsim∑i=1

θS(xu,i;σ21, σ

22) (6)

And the error rate is the absolute difference between that probability and an indicator for

whether the true location is in the official strike zone:

Error(x0, S;σ21, σ

22) =

∣∣∣∣1{x0 ∈ Z} − P (strike|x0, S;σ21, σ

22)

∣∣∣∣We then compute average error rate, λ, weighting by P (x0, S), the joint probability of

observing a call at x0 in count S:

λ =∑x0

∑S

Error(x0, S;σ21, σ

22) · P (x0, S)∑

x0

∑S P (x0, S)

For a given pair of variance terms, we perform this computation once for each the three

types of prior beliefs from the previous section.

Figure 7a shows the error rate under the flat prior, for a range of variance terms. For

near-zero values of σ1 and σ2, the umpire has near-perfect eyesight, and he is near-certain

21

Figure 7: Error rates under a flat prior (a), along with the percent improvement in theerror rate under an unconditional rational-expectations prior (b) and under a count-specificrational-expectations prior, for different values of the scale parameters σ2

1 and σ22.

(a) λflat: flat prior

1 2 3 4 5 6

1

2

3

4

5

6

0.0

50.1

0.1

0.1

5

0.1

5

0.1

5

0.2

0.2

(b) 1− λRE/λflat

1 2 3 4 5 6

1

2

3

4

5

6

0 0

0

0

0

0

0 0

0

0 0

00

0

0

0.01

0.01

0.010.0

1

0.02

0.02

0.03

(c) 1− λS/λRE

1 2 3 4 5 6

1

2

3

4

5

6

0

0.01

0.0

1

0.0

2

0.02

0.0

2

0.0

20.0

3

0.0

3

0.0

3

0.0

4

0.0

4

that the observed location is the true location. Given a degenerate and accurate signal, the

prior is irrelevant, and the error rate is zero under either prior. As the variance grows, the

umpire’s observations are drawn from a wider distribution—i.e., his eyesight worsens—and

the error rate increases.

Rational expectations alone do not improve error rates. Figure 7b shows the percent

improvement in the error rate under the unconditional rational-expectations prior, relative

to the flat prior. For reasonable values of the variance terms—i.e., in which the umpire’s

eyesight is not profoundly worse in one direction that the other—error rates are comparable

under the two priors. Umpires experience minimal, if any, benefits from shifting their beliefs

towards locations that are more expected on average.

The Bayesian umpire lowers his error rate not by holding unconditional rational expecta-

tions, but by conditioning those expectations on the count. Figure 7c shows the percent im-

provement under the count-specific prior, relative to the unconditional rational-expectations

prior. Except for umpires with near-perfect vision, umpires benefit from shifting their be-

liefs towards locations that, given the count, are more expected. At σ1 = σ2 = 3 inches, for

instance, λS is about 2% lower than λRE.

This improvement is greatest in counts for which the count-specific empirical distribution

of called pitches is most different from the overall distribution. For σ1 = σ2 = 3 inches, λS is

11% lower than λRE in 0-2 counts, 9% lower in 1-2 counts, 7% lower in 3-2 and 2-2 counts,

and 5% lower in 3-0 counts. The distribution of called pitches varies considerably between

22

counts, as Figures 5f and 5i show. Approximating these distributions by taking their average,

as in Figure 5c, ignores this variation and worsens performance.

5 Model estimates

Does a Bayesian umpire reproduce the observed changes in the enforced strike zone with

the count, as depicted in Section 3? We calibrate the model and show that under the

count-specific rational-expectations prior, it predicts these changes almost exactly.

5.1 Estimation steps

Assuming, as before, that the umpire’s observation is drawn from a normal distribution with

zero covariance, our model is characterized by six parameters: the variance terms, σ21 and

σ22, as well as the four boundaries of Z. Previously, we chose the variance terms arbitrarily

or estimated the model for a range of values, and we assumed that the boundaries coincide

with the official strike zone. Here, we find the parameters that best predict observed calls.17

Specifically, we use non-linear least squares to find the parameter values that minimize

the average area between the count-specific enforced and predicted strike zones, weighted by

the number of observed calls in each count:

arg minσ21 ,σ

22 ,Z

∑S

NS∑S NS

∑x0

(yS(x0)− yS(x0;σ2

1, σ22,Z)

)2(7)

Here, yS(x0) defines the enforced strike zone in count S, equaling 1 when a pitch at the

true location x0 is more often called a strike than a ball in the data, and 0 otherwise. As

in Equation 5, yS(x0;σ21, σ

22,Z) is the predicted strike zone in count S, equaling 1 when a

pitch at the true location x0 is called a strike in expectation by our Bayesian umpire, and 0

otherwise. As before, we discretize the plane into a square-inch grid bounded at 18 inches

from the midlines.

We estimate the enforced strike zone using a support vector machine with a Gaussian

kernel, as in Figure 3. The SVM separates locations at which strikes are called a majority of

the time from locations at which balls are called a majority of the time. The solid boundaries

in Figure 10 below show the enforced strike zone in each count. We estimate the predicted

17We relax the boundaries of Z to account for deviations that are beyond the scope of this paper—inparticular, that umpires call strikes in expectation to the left and right of the official strike zone but notabove or below, as shown in Figure 3.

23

strike zone as in the previous section. For each true location, we calculate the probability of a

strike call from 1000 simulated calls under the current parameter estimates, as in Equation 6,

classifying the location as inside the predicted strike zone if the probability of a strike call

at that location is at least 0.5.

The objective function in Equation 7 counts discrepancies between the enforced and

predicted strike zones across the grid. We take a weighted average over the counts, weighting

each count by NS, the number of calls observed in that count. Thus, the value of the objective

function is the average area, in square inches, over which the enforced and predicted strike

zones disagree. We find the parameters that minimize Equation 7 using a genetic algorithm.18

5.2 Estimates

We estimate the model separately for the three types of prior beliefs discussed previously:

a flat prior in which all locations are on the grid are viewed as equally likely, regardless of

the count; a rational-expectations prior, measured as the empirical density of called pitches,

as shown in Figure 5c; and a count-specific rational-expectations prior, measured as the

count-specific empirical density of called pitches and rendered in Figure 8.19

Estimates of the boundaries are nearly identical under each prior. The top and bottom

boundaries coincide with those of the official strike zone (under the flat prior, the bottom

boundary is an inch lower). However, the sides are wider than the official strike zone, and

this deviation is asymmetric. Whereas the right edge is about 2 inches wider than the official

strike zone, the left edge is about 4 inches wider, consistent with the estimates in Figures 2

and 3. The asymmetry can be attributed to differences in the enforced strike zone for left-

and right-handed batters, as we discuss in Footnote 11.

18Genetic algorithms address two difficulties in minimizing the objective function in Equation 7. First,the objective function is composed of indicator functions, precluding the use of derivative-based algorithms.Genetic algorithms evaluate the function, not its derivatives. Second, the objective function may havemany local minima, and they may be far apart in the parameter space. Though genetic algorithms offerno guarantee of finding the global minimum, they are designed to search efficiently over a multidimensionalspace. We constrain the search by bounding σ2

1 and σ22 to each be less than 6 inches and by restricting the

four boundaries of Z to integers between 7 and 15 inches from the midline on each dimension (inclusive),given that the discrete grid has a coarseness of 1 inch. The algorithm generates 60 candidate parametervectors in each population and stops only when the objective function value associated with the best vectoris unchanged for 15 successive generations.

19We estimate the density in each count using a bivariate Gaussian kernel with Silverman’s rule-of-thumbbandwidth along each dimension. Counts with more observations have lower bandwidths. This imbues theumpire’s beliefs with a precision that increases with his past exposure. Along the horizontal dimension, thebandwidth varies from 0.88 inches in 0-0 counts to 1.81 inches in 3-2 counts. Along the vertical dimension,the bandwidth varies from 0.86 inches in 0-0 counts to 1.91 inches in 3-2 counts.

24

Figure 8: P (x0|¬swing, S): the empirical density of called pitches in each count.

(a) 0 balls, 0 strikes

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Vert

ical axis

(in

)

0.00025

0.00025

0.0005

0.0005 0.0005

0.0

005

0.0005

0.0005

0.00

05

0.00075

0.0

0075

0.000750.00075

0.0

0075

0.001

0.0010.001

0.001

0

0.2

0.4

0.6

0.8

1

1.2

10-3

(b) 0 balls, 1 strike

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Vert

ical axis

(in

)

0.0

0025

0.0

0025

0.0005 0.0005 0.0005

0.0005

0.00

05

0.0005

0.0

005

0.00075

0.0

0075

0.00

075

0.00075

0.0

0075

0.0

010

.001

0

0.2

0.4

0.6

0.8

1

1.2

10-3

(c) 0 balls, 2 strikes

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Vert

ical axis

(in

)

0.00025

0.0

0025

0.00025 0.0

0025

0.0

0025

0.00025

0.0

005

0.0005

0.00050.0

005

0.0005

0

1

2

3

4

5

6

7

10-4

(d) 1 ball, 0 strikes

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Vert

ical axis

(in

)

0.00025

0.0005

0.0005

0.00

05

0.0

005

0.00050.0005

0.0005

0.00075

0.0

0075

0.00075

0.00075

0.0

00

75

0.00075

0.001

0.0010.0

01

0.0

01

0

0.2

0.4

0.6

0.8

1

10-3

(e) 1 ball, 1 strike

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Vert

ical axis

(in

)

0.0

0025

0.0

0025

0.00050.0005 0.0005

0.0005

0.00

05

0.0005

0.0005

0.0

0075

0.0

0075

0.00

075

0.00075

0.0

0075

0.00

1

0.0

01

0

0.2

0.4

0.6

0.8

1

10-3

(f) 1 ball, 2 strikes

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Vert

ical axis

(in

)0.00025 0.00025

0.0

0025

0.00

025

0.00025

0.00

025

0.0

005

0.0

005

0.00050.0

005

0.0

005

0.0005

0.0

0075

0.00

075

0.00075

0

1

2

3

4

5

6

7

8

9

10-4

(g) 2 balls, 0 strikes

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Vert

ical axis

(in

)

0.00025

0.0005

0.0005 0.0005

0.0

005

0.0005

0.0005

0.00

05

0.00075

0.0

0075

0.00075

0.00075

0.0

0075

0.00075

0.0

01

0.001

0.001

0.001

0

0.2

0.4

0.6

0.8

1

10-3

(h) 2 balls, 1 strike

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Vert

ical axis

(in

) 0.0

0025 0

.00025

0.00050.0005

0.00

05

0.0005

0.0

005

0.0005

0.0

00

5

0.0005

0.00075

0.00075 0.00075

0.0

0075

0.00

075

0.00075

0.0

0075

0.00

1

0

0.2

0.4

0.6

0.8

1

10-3

(i) 2 balls, 2 strikes

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Vert

ical axis

(in

)

0.00025

0.00

025

0.00025

0.0005 0.0005 0.0005

0.0005

0.0

005

0.0005

0.0

005

0.0005

0.0

0075

0.00075

0.0

0075

0

1

2

3

4

5

6

7

8

9

10-4

(j) 3 balls, 0 strikes

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Vert

ical axis

(in

)

0.0005

0.0

005

0.00050.0005

0.0

005

0.0005

0.001

0.001

0.001

0.0

01

0.0015

0.0015

0.0

015

0.002

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

10-3

(k) 3 balls, 1 strike

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Vert

ical axis

(in

)

0.00025

0.0005

0.0005 0.0005

0.0

005

0.0005

0.0005

0.0005

0.00075

0.0

0075

0.00075

0.00075

0.0

0075

0.0

0075

0.00075

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10-3

(l) 3 balls, 2 strikes

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Vert

ical axis

(in

) 0.0

0025

0.00025

0.0

0025

0.0005

0.0005

0.00050.0005

0.0

005

0.00075

0.0

0075

0.00075

0.000750.00075

0.0

0075

0.0

0075

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10-3

25

Figure 9: The objective function in Equation 7 in terms of σ1 and σ2, holding constant Zat its estimated boundaries.

(a) Flat prior

1 2 3 4 5 6

1

2

3

4

5

6

70

70

70

70

70

70

70

9090

90

90

90

90

90

110

130

150

(b)Unconditional

rational expectations

1 2 3 4 5 6

1

2

3

4

5

6

70

70

70

7070

70

70

9090

90

90

90

110

(c)Count-specific

rational expectations

1 2 3 4 5 6

1

2

3

4

5

6

30

305

0

50

50

5050

7070

70

70

70

70

9090

90

90

110

For each of the priors, the variance terms are identified by different aspects of the data.

For the flat prior, increasing the variance rounds the corners of the predicted strike zone.

Hence, the parameters are identified by the average roundness of the enforced strike zone,

which is not very informative. Figure 9a shows the objective function under the flat prior in

terms of σ1 and σ2, with the boundaries fixed at their estimated values. The minimum is not

well identified, as values of the function are similar for a wide range of parameter estimates.

For either of the informative priors, increasing the variance makes the predicted strike

zone more responsive to the prior. Hence, the variance terms are identified by whether the

prior shapes the predicted strike zone to match the empirical strike zone. Figure 9b shows

the objective function under the unconditional rational-expectations prior. The minimum

is better identified than under the flat prior, as the function converges to a smaller basin.

Figure 9c shows the objective function under the count-specific rational-expectations prior.

The minimum is well identified, as the function converges sharply to a narrow basin.

The optimal variance terms under this prior appear reasonable. Our estimates of σ1 = 3.1

inches and σ2 = 4.1 inches imply that 15%, 46%, and 75% of observations are within 2, 4,

and 6 inches of their true location. These estimates are precise, with standard errors of 0.14

inches for σ1 and 0.12 inches for σ2.20 The larger estimate on the vertical dimension indicates

20The standard errors are difficult to compute. Since the objective function is composed of indicators, theHessian is undefined. Bootstrapping is computationally infeasible, as the optimizer takes about 36 hours toconverge when running on 64 high-performance cores. Given these constraints, we approximate the Hessianfor the variance terms in the following manner. Holding the bounds of Z at their estimated values, wecompute the objective function value at 8 equally-spaced points on a circle of radius 0.1 inches centered at

26

that the predicted strike zone expands and contracts more vertically than horizontally, as is

true of the enforced strike zone. This asymmetry may also reflect the difficulty of observing

the location of the pitch in relation to the midline of the batter’s chest or the bottom of

the kneecaps. By comparison, the umpire can more easily tell whether the pitch passes over

home plate, which provides a white backdrop for the pitch from the umpire’s vantage point.

Our Bayesian model fits the data closely when the umpire is imbued with count-specific

rational expectations. The minimum objective function value, or the area over which the

enforced and predicted strike zones disagree, is 27.4 square inches under the count-specific

prior—an area equivalent to just 7% of the official strike zone. By contrast, the area of

disagreement is 50.3 square inches under the unconditional rational-expectations prior and

51.2 square inches under the flat prior. Refining the umpire’s prior beliefs does not provide

additional degrees of freedom and thus does not necessarily improve the model’s fit. For

instance, prior beliefs that shift posterior beliefs towards the periphery when the count favors

the batter would worsen the fit. Instead, the close fit that our model achieves under the count-

specific rational expectations prior implies that a Bayesian model with this sophisticated

prior fits the data remarkably well.

Figure 10 visualizes the fit between the model and the data in each count. The dashed

irregular ellipse bounds the predicted strike zone, yS(x0), under count-specific rational ex-

pectations; the solid irregular ellipse bounds the enforced strike zone, yS(x0); and the dotted

box denotes the estimated boundaries of Z. In each count, the predicted and enforced strike

zones coincide, or nearly coincide. When the predicted strike zone is large, as when the

count favors the batter, so too is the enforced strike zone. And when the predicted strike

zone is small, as when the count favors the pitcher, so too is the enforced strike zone. The

enforced strike zone even responds to asymmetries in how the predicted strike zone changes

across counts. As the number of strikes increases, the contraction in both the predicted and

enforced strike zones is concentrated at the top.

We also use the model to predict the probability of a strike call under the count-specific

rational-expectations prior, as in Equation 6. Figure 11a shows the probability of a strike

call across the plane when the count has 0 balls and 0 strikes—the theoretical analogue to

the empirical estimate in Figure 2a. As is the case for actual umpires, the Bayesian umpire

reliably makes obvious calls: pitches in the center of the official strike zone are almost

always called strikes, and pitches on the periphery are almost always called balls. However,

(σ1, σ2). We then use method of moments to fit those objective function values to a parabola of the formax21 + bx22 + cx1x2 + 27.4, where the constant is the objective function value at the estimated variance terms.The approximate matrix of second derivatives is thus [ a c

c b ].

27

Figure 10: The enforced strike zone (solid irregular ellipse) and predicted strike zone(dashed irregular ellipse) in each count, as well as the estimated boundaries of Z (dottedbox).

(a) 0 balls, 0 strikes

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

(b) 0 balls, 1 strike

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

(c) 0 balls, 2 strikes

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)


-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

(e) 1 ball, 1 strike

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

(f) 1 ball, 2 strikes

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

(g) 2 balls, 0 strikes

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

(h) 2 balls, 1 strike

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

(i) 2 balls, 2 strikes

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

(j) 3 balls, 0 strikes

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

(k) 3 balls, 1 strike

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

(l) 3 balls, 2 strikes

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

28

Figure 11: Predicted probability of a strike call, for listed counts.

(a) P (strike|0-0)

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Vert

ical axis

(in

)

0.1

0.1 0.1

0.1

0.1

0.10.1

0.1

0.9

0.9

0.9

0.9

0.5

0.5

0.5

0.5

0.5

0.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(b) P (strike|3-0)− P (strike|0-2)

0.1

0.1

0.1

0.10.1

0.1

0.1

0.1

0.1

0.1

0.1

0.2

0.2 0.2

0.2

0.2

0.20.2

0.2

0.2

0.3

0.30.3

0.3

0.3

0.3

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Vert

ical axis

(in

)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

the region in which the umpire sometimes calls balls and sometimes calls strikes for pitches

at the same location, is slightly wider in the prediction than in the empirical estimate. In

the prediction, the region between the 10% and 90% contour lines is 8 to 10 inches wide,

compared to a width of 6 to 8 inches in the empirical estimate. Reducing the variance of

the signal improves the fit. In particular, variance terms of σ21 = σ2

2 = 32 inches yield a

prediction that closely mimics the empirical estimate in Figure 2a. Of course, this comes at

the expense of agreement between the predicted and empirical strike zones, as implied by

the plot of the objective function in Figure 9c.

Differences in the probability of a strike call between counts are reproduced in shape,

though not in magnitude. Figure 11b shows the difference in the predicted probability of a

strike call between 3-0 and 0-2 counts—the theoretical analogue to the empirical estimate

in Figure 2b. As is true of actual umpires, the Bayesian umpire is more likely to call strikes

in 3-0 counts than in 0-2 counts, particularly at the top of the official strike zone. However,

the difference is only about half as large in the prediction as in the empirical estimate. For

pitches at the top of the official strike zone, the predicted probability of a strike changes

by more than 30 percentage points between the most imbalanced counts; in the data, the

difference is more than 60 percentage points.

The parameters which minimize the distance between the predicted and enforced strike

zones do not minimize the distance between predicted and empirical probabilities of a strike

call. One reason may be that the signal does not follow a Gaussian distribution. Among

29

symmetric distributions, there is no theoretical reason to prefer the Gaussian, and another

distribution may yield a precise fit both when predicting balls and strikes in expectation

and when predicting probabilities.21 A second possibility is that the distribution of the

signal varies with pitch characteristics. Fastballs are thrown faster in counts that favor the

pitcher.22 If the variance of the signal increases with pitch speed—i.e., if faster fastballs are

harder to observe—then a Bayesian model would predict tighter contour lines in Figure 11a

(since fastballs are relatively slow in 0-0 counts) and larger differences in Figure 11b (since

fastballs are much slower in 3-0 counts than in 0-2 counts).

5.3 Blurred rational expectations

A Bayesian model in which umpires combine rational expectations with noisy observations

predicts observed decisions well. To what degree does this performance depend on our

estimates of the count-specific rational-expectations prior? In particular, would a model

underpinned by less precise estimates predict observed decisions just as well? Here, we

consider a model with blurred rational expectations, or over-smoothed versions of the kernel

density estimates in Figure 8. These prior beliefs preserve the gross features of the count-

specific rational expectations without preserving the details. That is, they represent the

prior beliefs of a Bayesian umpire whose rational expectations are constructed from a random

subset of the million observations at our disposal.

We refit the model under different degrees of blurred rational expectations. Specifically,

we over-smooth the kernel density estimates in Figure 8 by multiplying Silverman’s rule-

of-thumb bandwidth in both dimensions by a given factor. Figure 12 shows the area of

disagreement between the enforced and predicted strike zone for different bandwidth factors.

A factor of 1 reproduces our estimates from the previous section, yielding 27.35 square inches

of disagreement. A factor of 2—equivalent to reducing the number of observations by a factor

of 25 = 32—fares marginally worse, yielding 28.10 square inches of disagreement, an increase

of 3%. A bandwidth factor of 3—equivalent to reducing the number of observations by a

factor of 35 = 243—fares considerably worse, yielding 31.30 square inches of disagreement,

an increase of 14%.23

21We also estimate the model with a signal that follows a bivariate t distribution. The t distribution is ageneralization of the normal. An extra parameter, the degrees of freedom, modulates the proportion of massin the tails. We estimate a degrees of freedom parameter of 13, which effectively reduces the t-distributionto a normal.

22This is likely a product of selection. Pitchers who throw harder reach pitcher-friendly counts more often.23The estimated bounds of Z do not change for bandwidth factors of 1.5 and 2; for factors of 2 and 2.5,

the top boundary drops from 10 inches above the midline to 9 inches. Across the bandwidth factors, σ1

30

Figure 12: Area of disagreement (in square inches) between the enforced and predictedstrike zones, for different degrees of over-smoothed prior beliefs.

1 1.5 2 2.5 3

Bandwidth factor

27

28

29

30

31

32

Are

a o

f d

isa

gre

em

en

t (s

q.

inch

es)

Note: we over-smooth the umpire’s prior beliefs by reestimating the kernel density estimatesin Figure 8 with excessive bandwidths, using Silverman’s rule-of-thumb bandwidth in eachdimension multiplied by the given bandwidth factor.

Over-smoothing the umpire’s rational expectations worsens model fit. Hence, observed

calls more closely correspond to a model in which the umpire holds precise rational expecta-

tions rather than a blurred set that over-smooths its features. However, the model performs

only marginally worse when the umpire forms his rational expectations from an order-of-

magnitude fewer observations than are available to us, implying that the gross features of

the empirical density of called pitches are sufficient, or almost sufficient, for predicting the

enforced strike zone.

5.4 Weighting the signal

Umpires make calls as if applying a consistent decision rule to posterior beliefs formed from

rational priors and imperfect signals. Do umpires combine the prior and signal optimally?

The weight placed on the prior relative to the signal should reflect the quality of the umpire’s

eyesight: an umpire with perfect vision should place all the weight on the signal, and a blind

umpire should place all the weight on the prior.

In our model, the variance of the signal represents the umpire’s eyesight and determines

the weight placed on the prior. A higher variance implies that the umpire is less likely to

ranges from 2.6 to 3.2 inches, and σ2 ranges from 3.7 to 4.2 inches.

31

observe the pitch near its true location. As such, he places more weight on the prior. Since we

cannot measure eyesight directly, we instead estimate the variance terms by finding the values

that best predict observed calls. The values we estimate appear reasonable, suggesting that

updating by umpires approximates Bayes rule. However, our model assumes that umpires

know the true distribution of the signal—i.e., umpires form posterior beliefs from the same

distribution as their observations are drawn from. If umpires are overconfident about their

eyesight and thereby underestimate the variance of the signal, they will underweight the

prior. Likewise, if umpires are underconfident about their eyesight and thereby overestimate

the variance of the signal, they will overweight the prior.

Here, we allow umpires to misperceive the distribution of the signal. Specifically, we

simulate observed locations from one distribution, and we allow umpires to form beliefs

from a signal with a different distribution. As before, we assume that both distributions

are centered at the true location and have zero covariance. However, we now estimate four

variance parameters, one in each dimension for each distribution.

Different elements of the data identify the parameters of each distribution. Increasing the

variance of the distribution from which observations are drawn worsens the umpire’s vision,

which makes the predicted strike zone round; hence, these variance parameters are identified

by the roundness of the enforced strike zone. The variance of the umpire’s perception of

the signal also modulates responsiveness to the prior; hence, these variance parameters are

partially identified by the extent to which the prior can shape the predicted strike zone to

match the enforced strike zone. We suggest caution in interpreting the estimates that follow.

Both distributions determine the roundness of the predicted strike zone; hence, both sets

of variance terms are identified by the roundness of the enforced strike zone. Moreover,

roundness in the enforced strike zone is amplified by pooling umpires with even subtly

different tendencies. Finally, no feature of the enforced strike zone provides a direct measure

of the quality of umpires’ eyesight.

Our results provide mixed evidence on optimal weighting. The model fit improves—

the area of disagreement shrinks from 27.4 square inches to 23.8 square inches—owing the

extra parameters and to differences in the variance estimates between the distributions.24

We estimate that observations are drawn from a distribution with a standard deviation of

4.3 inches along the horizontal dimension and 4.2 inches along the vertical dimension. The

symmetry of these estimates, which proxy for eyesight quality, suggest that vision is no

better (or worse) in one direction. For the umpire’s perception of the signal, we estimate

24We estimate the same bounds of Z as under the count-specific rational-expectations prior in Section 5.2.

32

a standard deviation of 2.1 inches along the horizontal dimension and 4.6 inches along the

vertical dimension. These estimates confirm previous findings of greater responsiveness along

the vertical dimension. They also suggest some misperception of the signal: umpires appear

to underweight the prior along the horizontal dimension and approximate Bayes rule along

the vertical dimension.

6 Discussion

Major League Baseball directs umpires to make ball and strike calls based solely on the

location of the pitch. Yet umpires make different calls in different counts for pitches at the

same location. Previous research has argued that umpires employ different decision rules in

different counts (Moskowitz and Wertheim, 2011; Mills, 2014; Green and Daniels, 2015). We

show that these patterns can be rationalized by a simple Bayesian model in which umpires

apply a consistent decision rule to beliefs that vary rationally across counts.

Though the model is straightforward, the behavior it predicts requires considerable so-

phistication. The umpire must instantaneously weigh his imperfect observation of the loca-

tion of the pitch against count-specific rational expectations that differ by location. We find

that decisions by professional umpires reflect a precise understanding of what they should

rationally expect, as well as an ability to instantaneously integrate those prior beliefs in a

manner that approximates Bayes rule.

6.1 Intuition and rationality

Departures from rational models of belief formation are typically attributed to heuristic rules

and other intuitive processes that are unknowingly misapplied (Simon, 1982; Kahneman,

2011). Judging probabilities from salient characteristics, for instance, may be advisable when

determining whether an animal with sharp incisors is a predator. However, applying those

same heuristics in new environments can lead to systematic errors. When asked whether a

woman described as “deeply concerned with issues of discrimination and social justice” is

more likely to be 1) a bank teller, or 2) a bank teller and a feminist, more respondents choose

the latter, even though feminist bank tellers are clearly a subset of bank tellers (Tversky

and Kahneman, 1983). When applied out of context, intuitions can be very bad.

When applied appropriately, however, intuitions should be capable of being very good.

Psychologists have speculated that rational benchmarks are most achievable in contexts

where outcomes are predictable and practice provides feedback (Kahneman and Klein, 2009).

33

However, tests of this proposition are rare, owing to the difficulty of measuring performance

by experts in the field against predictions from classical models. Notable exceptions include

evidence of minimax play by professional tennis players when choosing whether to serve

outside or down the line (Walker and Wooders, 2001) and by professional soccer players

when choose to kick a penalty left or right (Chiappori, Levitt and Groseclose, 2002; Palacios-

Huerta, 2003). In both cases, the rational benchmark requires at best a moderate level of

sophistication—choosing left or right at some fixed rate—and players face no time pressure.

Umpires achieve a far more sophisticated benchmark. Their split-second decisions ap-

proximate Bayesian updating from rational expectations that vary in a complex manner with

both the location of the pitch and the count. Calling balls and strikes should be amenable

to intuitive expertise: observable cues point to the correct call, which Major League um-

pires learn through decades of training and experience, as well as feedback on hundreds of

decisions after each game. Nonetheless, their sophistication is remarkable. The psychologist

Amos Tversky once remarked, “All your economic models are premised on people being

smart and rational, and yet all the people you know are idiots.”25 Our results suggest that

highly rational behavior can be learned intuitively, and hence, that feedback and repetition,

rather than intelligence, can underpin rationality.26

6.2 Statistical discrimination

Our results can also be understood as evidence of statistical discrimination, or the use of

disallowed information to make more correct decisions. Umpires face a tradeoff between

procedural fairness and accuracy. MLB instructs umpires to make ball and strike calls based

solely on the location of the pitch but evaluates umpires based on the accuracy of their calls.

When the location of the pitch is unclear, discriminating along other variables that predict

pitch location—i.e., the count—helps umpires make more correct calls. Umpires statistically

discriminate: they make different calls in different counts for pitches at the same location,

and they do so in a manner that improves their accuracy.

Statistical discrimination is not unique to umpires. The same elements that make statisti-

cal discrimination appealing to umpires—incentives to make correct decisions and disallowed

evidence that can help make more correct decisions on average—exist in many other impor-

tant settings, from the courtroom to the classroom to the interview room to the immigration

desk. Our setting is unique in that statistical discrimination among umpires is directly

25Quoted in Lewis (2016).26For a longer discussion on this point, and for additional evidence, see Green, Rao and Rothschild (2016).

34

observable. In most settings, both the arbitrator and the analyst face uncertainty about

whether the allowed evidence implies guilt or innocence. In baseball, only the umpire is

uncertain whether a pitch is truly a ball or a strike—the analyst knows the truth.

To our knowledge, our results are the first to provide direct field evidence of statisti-

cal discrimination. Evidence of discrimination can rarely differentiate between statistical

discrimination and taste-based discrimination, in which the use of disallowed information

provides direct utility irrespective of whether doing so improves accuracy (e.g. Bertrand

and Mullainathan, 2004).27 In cases where the mechanism can be parsed, taste-based dis-

crimination has proven more readily identifiable, either by observing animus directly (e.g.

Stephens-Davidowitz, 2014) or by observing patterns of discrimination that cannot plau-

sibly be efficient, as with favoritism of same-race players by referees (Price and Wolfers,

2010; Parsons, Sulaeman, Yates and Hamermesh, 2011). The few field studies that claim

to identify statistical discrimination do so indirectly, either by testing auxiliary predictions

(Altonji and Pierret, 2001; Ondrich, Ross and Yinger, 2003; Levitt, 2004; Ewens, Tomlin and

Wang, 2014) or by conducting laboratory-style experiments in the field (List, 2004).28 In

contrast, we estimate a simple model of statistical discrimination using the observed choices

of professional arbitrators and direct estimates of rational expectations, and we find that

this model almost entirely accounts for otherwise puzzling empirical patterns.

27Others have proposed a second distinction: whether discrimination is intentional or implicit (Bertrand,Chugh and Mullainathan, 2005). We do not take a stand on whether statistical discrimination by umpiresis deliberate.

28In a framed field experiment, List (2004) shows that professional sports-card traders quote higher pricesto women and minorities than to white men. He finds that traders do not discriminate by race or genderwhen paired with potential customers in the dictator game, suggesting that the discrimination is not tastebased. And he shows that these traders perform better than random guessing when asked to choose whichof two unlabeled distributions of reservation values was elicited from white men rather than from womenand minorities, suggesting that the discrimination is statistical in nature.

35

References

Akerlof, George A and William T Dickens (1982) “The economic consequences of cognitivedissonance,” The American economic review, Vol. 72, pp. 307–319.

Altonji, Joseph G and Charles R Pierret (2001) “Employer Learning and Statistical Discrim-ination,” The Quarterly Journal of Economics, Vol. 116, pp. 313–350.

Ambuehl, Sandro (2016) “An offer you can’t refuse: Incentives change what we believe,”Available at SSRN 2830171.

Andreoni, James and Tymofiy Mylovanov (2012) “Diverging opinions,” American EconomicJournal: Microeconomics, Vol. 4, pp. 209–232.

Ariely, Dan, George Loewenstein, and Drazen Prelec (2003) “Coherent arbitrariness: Stabledemand curves without stable preferences,” The Quarterly Journal of Economics, Vol.118, pp. 73–106.

Backus, Matthew, Tom Blake, and Steven Tadelis (2015) “Cheap talk, round numbers, andthe economics of negotiation,”Technical report, National Bureau of Economic Research.

Barber, Brad M and Terrance Odean (2001) “Boys will be boys: Gender, overconfidence, andcommon stock investment,” The quarterly journal of economics, Vol. 116, pp. 261–292.

Barberis, Nicholas, Andrei Shleifer, and Robert Vishny (1998) “A model of investor senti-ment,” Journal of financial economics, Vol. 49, pp. 307–343.

Bertrand, Marianne, Dolly Chugh, and Sendhil Mullainathan (2005) “Implicit discrimina-tion,” American Economic Review, pp. 94–98.

Bertrand, Marianne and Sendhil Mullainathan (2004) “Are Emily and Greg more employ-able than Lakisha and Jamal? A field experiment on labor market discrimination,” TheAmerican Economic Review, Vol. 94, pp. 991–1013.

Brunnermeier, Markus K and Jonathan A Parker (2005) “Optimal expectations,” The Amer-ican Economic Review, Vol. 95, pp. 1092–1118.

Callahan, Gerry (1998) “Moody Blues,” Sports Illustrated, pp. 42–47.

Callan, Matthew (2012) “Called Out: The Forgotten Baseball Umpires Strike of 1999,” TheClassical.

Camerer, Colin F, George Loewenstein, and Matthew Rabin (2004) Advances in BehavioralEconomics: Princeton University Press.

Caple, Jim (2011) “Humbled by umpire school,” ESPN.com.

36

Chen, Daniel, Tobias J Moskowitz, and Kelly Shue (2016) “Decision-Making Under theGambler’s Fallacy: Evidence from Asylum Judges, Loan Officers, and Baseball Umpires,”The Quarterly Journal of Economics, p. qjw017.

Chiappori, P-A, Steven Levitt, and Timothy Groseclose (2002) “Testing mixed-strategyequilibria when players are heterogeneous: The case of penalty kicks in soccer,” AmericanEconomic Review, pp. 1138–1151.

De Bondt, Werner F M and Richard Thaler (1985) “Does the Stock Market Overreact?”The Journal of Finance, Vol. 40, pp. 793–805.

Deshpande, Sameer K and Abraham Wyner (2017) “A Bayesian Hierarchical Model ToEstimate the Framing Ability of Major League Baseball Catchers,” Working Paper.

Drellich, Evan (2012) “Complex system in place to evaluate umpires,” MLB.com.

Eil, David and Justin M Rao (2011) “The good news-bad news effect: asymmetric processingof objective information about yourself,” American Economic Journal: Microeconomics,Vol. 3, pp. 114–138.

El-Gamal, Mahmoud A and David M Grether (1995) “Are people Bayesian? Uncoveringbehavioral strategies,” Journal of the American statistical Association, Vol. 90, pp. 1137–1145.

Ewens, Michael, Bryan Tomlin, and Liang Choon Wang (2014) “Statistical discriminationor prejudice? A large sample field experiment,” Review of Economics and Statistics, Vol.96, pp. 119–134.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani (2001) The elements of statisticallearning, Vol. 1: Springer series in statistics Springer, Berlin.

Garicano, Luis, Ignacio Palacios-Huerta, and Canice Prendergast (2005) “Favoritism undersocial pressure,” Review of Economics and Statistics, Vol. 87, pp. 208–216.

Green, Etan A, Justin M Rao, and David M Rothschild (2016) “A Sharp Test of the Porta-bility of Expertise,” Available at SSRN 2656268.

Green, Etan and David P Daniels (2014) “What does it take to call a strike? three biasesin umpire decision making,” in 2014 MIT Sloan Sports Analytics Conference.

(2015) “Impact Aversion in Arbitrator Decisions,” Available at SSRN 2391558.

Grether, David M (1980) “Bayes rule as a descriptive model: The representativeness heuris-tic,” The Quarterly Journal of Economics, Vol. 95, pp. 537–557.

Haigh, Michael S and John A List (2005) “Do professional traders exhibit myopic loss aver-sion? An experimental analysis,” The Journal of Finance, Vol. 60, pp. 523–534.

37

Kahneman, Daniel (2011) Thinking, fast and slow: Macmillan.

Kahneman, Daniel and Gary Klein (2009) “Conditions for intuitive expertise: a failure todisagree.,” American psychologist, Vol. 64, p. 515.

Kahneman, Daniel and Amos Tversky (1972) “Subjective probability: A judgment of repre-sentativeness,” Cognitive Psychology, Vol. 3, pp. 430–454.

Koehler, Jonathan J and Molly Mercer (2009) “Selection neglect in mutual fund advertise-ments,” Management Science, Vol. 55, pp. 1107–1121.

Lacetera, Nicola, Devin G Pope, and Justin R Sydnor (2012) “Heuristic thinking and limitedattention in the car market,” The American Economic Review, Vol. 102, pp. 2206–2236.

Larson, Francis, John A. List, and Robert D. Metcalfe (2016) “Can Myopic Loss AversionExplain the Equity Premium Puzzle? Evidence from a Natural Field Experiment withProfessional Traders,” Working Paper 22605, National Bureau of Economic Research.

Levitt, Steven D (2004) “Testing theories of discrimination: evidence from Weakest Link,”The Journal of Law and Economics, Vol. 47, pp. 431–452.

Lewis, M. (2016) The Undoing Project: A Friendship That Changed Our Minds: W. W.Norton.

Lipkus, Isaac M, Greg Samsa, and Barbara K Rimer (2001) “General performance on anumeracy scale among highly educated samples,” Medical decision making, Vol. 21, pp.37–44.

List, John A (2004) “The nature and extent of discrimination in the marketplace: Evidencefrom the field,” The Quarterly Journal of Economics, pp. 49–89.

Miller, Joshua B and Adam Sanjurjo (2016) “Surprised by the gambler’s and hot handfallacies? A truth in the law of small numbers,” Available at SSRN 2627354.

Mills, Brian M (2014) “Social pressure at the plate: Inequality aversion, status, and mereexposure,” Managerial and Decision Economics, Vol. 35, pp. 387–403.

(2015) “Technological Innovations in Monitoring and Evaluation: Evidence of Per-formance Impacts among Major League Baseball Umpires,” Working Paper.

Mobius, Markus M, Muriel Niederle, Paul Niehaus, and Tanya S Rosenblat (2011) “Man-aging self-confidence: Theory and experimental evidence,” National Bureau of EconomicResearch.

Moskowitz, Tobias and L Jon Wertheim (2011) Scorecasting: The hidden influences behindhow sports are played and games are won: Crown Archetype.

38

Mussweiler, Thomas (2001) “Sentencing Under Uncertainty: Anchoring Effects in the Court-room,” Journal of Applied Social Psychology, Vol. 31, pp. 1535–1551.

Nickerson, Raymond S (1998) “Confirmation bias: A ubiquitous phenomenon in manyguises.,” Review of general psychology, Vol. 2, p. 175.

O’Connell, Jack (2007) “Much required to become MLB umpire,” MLB.com.

Ondrich, Jan, Stephen Ross, and John Yinger (2003) “Now you see it, now you don’t:Why do real estate agents withhold available houses from black customers?” Review ofEconomics and Statistics, Vol. 85, pp. 854–873.

Palacios-Huerta, Ignacio (2003) “Professionals play minimax,” The Review of EconomicStudies, Vol. 70, pp. 395–415.

Parsons, Christopher A, Johan Sulaeman, Michael C Yates, and Daniel S Hamermesh (2011)“Strike three: Discrimination, incentives, and evaluation,” The American Economic Re-view, Vol. 101, pp. 1410–1435.

Price, Joseph and Justin Wolfers (2010) “Racial Discrimination Among NBA Referees,” TheQuarterly Journal of Economics, Vol. 125, pp. 1859–1887.

Rabin, Matthew (2002) “Inference by believers in the law of small numbers,” The QuarterlyJournal of Economics, Vol. 117, pp. 775–816.

Rabin, Matthew and Joel L Schrag (1999) “First impressions matter: A model of confirma-tory bias,” The Quarterly Journal of Economics, Vol. 114, pp. 37–82.

Samuelson, William and Richard Zeckhauser (1988) “Status quo bias in decision making,”Journal of risk and uncertainty, Vol. 1, pp. 7–59.

Simon, Herbert Alexander (1982) Models of bounded rationality: Empirically grounded eco-nomic reason, Vol. 3: MIT press.

Stephens-Davidowitz, Seth (2014) “The cost of racial animus on a black candidate: Evidenceusing Google search data,” Journal of Public Economics, Vol. 118, pp. 26–40.

Stone, Daniel F (2013) “Testing Bayesian Updating with the Associated Press Top 25,”Economic Inquiry, Vol. 51, pp. 1457–1474.

Sutter, Matthias and Martin G Kocher (2004) “Favoritism of agents–the case of referees’home bias,” Journal of Economic Psychology, Vol. 25, pp. 461–469.

Tversky, Amos and Daniel Kahneman (1971) “Belief in the law of small numbers.,” Psycho-logical bulletin, Vol. 76, p. 105.

(1974) “Judgment under Uncertainty: Heuristics and Biases,” Science, Vol. 185, pp.1124–1131.

39

(1983) “Extensional versus intuitive reasoning: The conjunction fallacy in proba-bility judgment.,” Psychological review, Vol. 90, p. 293.

Walker, Mark and John Wooders (2001) “Minimax play at Wimbledon,” The AmericanEconomic Review, Vol. 91, pp. 1521–1538.

Zafar, Basit (2011) “How do college students form expectations?” Journal of Labor Eco-nomics, Vol. 29, pp. 301–348.

40

Figure A1: The enforced strike zone for left- and right-handed batters, in counts with 0balls and 0 strikes.

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Vert

ical axis

(in

)

R

L

41

Figure A2: The enforced strike zone, separately for calls following a called strike on theprevious pitch in the at-bat (solid boundary) and calls not following a called strike on theprevious pitch in the at-bat, in listed counts (dot-dashed boundary).

(a) 0 balls, 1 strike

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

(b) 0 balls, 2 strikes

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

(c) 1 ball, 1 strike

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)


-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

(e) 2 balls, 1 strike

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

(f) 2 balls, 2 strikes

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

(g) 3 balls, 1 strike

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

(h) 3 balls, 2 strikes

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Ve

rtic

al a

xis

(in

)

42

Figure A3: P (strike|home batting)− P (strike|away batting): difference in the probabilityof a called strike between the bottom half of the inning, when the home team bats, and thetop half of the inning, when the away team bats.

00 0

000

00

0

00

0

0

0

0 00

0

0

0

0

0

0

0

0

00

0

0

-18 -12 -6 0 6 12 18

Horizonal axis (in)

-18

-12

-6

0

6

12

18

Vert

ical axis

(in

)

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

43

Download - Bayesian Instinct - Stanford University · conference, including, and in addition to, Doug Bernheim, Gabriel Carroll, Isa Chaves, Scott Ganz ... Defense analysts evaluate whether

Top Related