Electronic copy available at: https://ssrn.com/abstract=2916929
Bayesian Instinct∗
Etan A. Green1 and David P. Daniels2
1University of Pennsylvania2Stanford University
April 3, 2017†
Abstract
Do experts form rational beliefs when making split-second, sophisticated judgments? Along literature suggests not: individuals often form prior beliefs from biased samplingand update those beliefs by improperly weighting new information. This paper studiesbelief formation by professional umpires in Major League Baseball. We show that thedecisions of umpires reflect an accurate, probabilistic, and state-specific understandingof their rational expectations—as well as an ability to integrate those prior beliefs ina manner that approximates Bayes rule. Given that umpires have barely a second toform beliefs and make a decision, we conclude that the instincts of professional umpiresmimic a sophisticated level of rationality remarkably well.
∗Please direct correspondence to: [email protected]. We thank Jim Andreoni, Dorothy Kronick,Cade Massey, Ken Moon, Alex Rees-Jones, Uri Simonsohn, Joe Simmons, and Dan Stone for comments onthis paper. We also thank seminar participants at Stanford, Wharton, Vanderbilt, UCSD, the 2014 Societyfor Judgment and Decision Making conference, and the 2014 Behavioral Decision Research in Managementconference, including, and in addition to, Doug Bernheim, Gabriel Carroll, Isa Chaves, Scott Ganz, NirHalevy, Sergiu Hart, David Hausman, Jonathan Levav, Neil Malhotra, Max Mishkin, Muriel Niederle, RogerNoll, Justin Rao, Peter Reiss, Al Roth, Charlie Sprenger, Francisco Toro, and Russ VerSteeg, for commentson a related, and now retired, working paper (Green and Daniels, 2015). All errors are our own.†For the most recent version, see: ssrn.com/abstract=2916929
Electronic copy available at: https://ssrn.com/abstract=2916929
1 Introduction
Experts routinely make split-second judgments. Soldiers evaluate whether passersby are
combatants. Police evaluate whether suspects pose threats. Paramedics evaluate whether
injuries need treatment at the scene. Defense analysts evaluate whether radar blips indicate
an attack. And traders evaluate a security’s expected returns after unexpected news. Do
they form rational beliefs? That is, do they begin with rational expectations—i.e., the share
of combatants in the population, the rate at which suspects threaten officers, how often
injuries require immediate attention, how frequently radar noise implies an attack, or the
prior distribution of expected returns? And if so, do they appropriately weigh those prior
beliefs against new information—e.g., the pedestrian pushing buttons on a cell phone, the
suspect reaching into a pocket, the vital signs of the injured, the pattern on the radar, or
the content of the news?
A long literature in behavioral economics and psychology suggests not. Individuals often
form prior beliefs from biased sampling and update those beliefs by improperly weighting
new information.1 Controlled experiments—in which beliefs are elicited through surveys, and
the distribution of the signal is specified by the experimenter—largely find that lay subjects
violate Bayes rule (Grether, 1980; El-Gamal and Grether, 1995; Eil and Rao, 2011; Mobius,
Niederle, Niehaus and Rosenblat, 2011), although different elicitation methods can produce
different conclusions (e.g., Andreoni and Mylovanov, 2012). In natural contexts, however,
beliefs and signals are typically unobservable. As a result, tests of rational belief formation
by experts have relied on indirect evidence. An interesting question in finance, for example,
is whether investors overreact to news. The empirical difficulty is that whereas prices are
1A vast body of research has documented biases in belief formation (for a review, see Camerer, Loewen-stein and Rabin, 2004). On the one hand, individuals often underweight new information, particularlywhen overconfident (e.g. Barber and Odean, 2001) or when the information contradicts their prior beliefs(Nickerson, 1998). On the other hand, individuals often overweight new observations when they are salient(Kahneman and Tversky, 1972)—even when they are obviously uninformative, as when people anchor onthe spin of a wheel (Tversky and Kahneman, 1974), the roll of a die (Mussweiler, 2001), or the last two digitsof a social security number (Ariely, Loewenstein and Prelec, 2003). People also see patterns in randomness(Chen, Moskowitz and Shue, 2016), make strong inferences from few observations (Tversky and Kahneman,1971), neglect selection bias in sample observations (Koehler and Mercer, 2009), and are only attentive tosalient information (e.g. Lacetera, Pope and Sydnor, 2012). Whether people favor their prior beliefs or newinformation can depend on what they want to believe. Those with incentives to change their beliefs seek outcontravening information, whereas those with incentives to reaffirm their beliefs resist such information, evenwhen it is free (Ambuehl, 2016). Individuals who manage to form beliefs absent any bias or corruption maynonetheless succumb to the difficulty of explicitly calculating a posterior belief via Bayes rule, especially ifthey are among the majority of educated subjects who confuse probabilities and proportions (Lipkus et al.,2001). As a result, models of non-Bayesian updating have proliferated (e.g. Akerlof and Dickens, 1982;Barberis, Shleifer and Vishny, 1998; Rabin and Schrag, 1999; Rabin, 2002; Brunnermeier and Parker, 2005).
1
Electronic copy available at: https://ssrn.com/abstract=2916929
observable, true valuations are not. Consequently, tests of rational belief formation among
investors require the analyst to specify an equilibrium model of asset pricing, and such tests
may be invalid if the model is misspecified (cf. De Bondt and Thaler, 1985).2
This paper exploits an unusual natural setting, in which both rational expectations and
signals are observable, to directly evaluate the extent to which experienced agents form
rational beliefs. We study professional umpires in Major League Baseball. In baseball, when
the pitcher makes a pitch and the batter chooses not to swing, the umpire immediately
makes a call—either a ball or a strike. By rule, the umpire is supposed to call a strike
when the pitch intersects a bounded region defined by the official strike zone, and a ball
otherwise. The umpire’s difficulty is that he imperfectly observes the location of the pitch:
pitches travel fast, and the boundaries of the official strike zone are invisible. An umpire with
imperfect vision can improve the accuracy of his calls by incorporating rational expectations
about where the pitch will be thrown. Using data from stereoscopic cameras that record the
precise location of each pitch, we measure rational expectations as the empirical distribution
of pitch locations when the batter does not swing. These expectations represent the umpire’s
best guess of where the pitch was thrown, absent observing the pitch itself. Our data also
allow us to specify the signal, under the innocuous assumption that the umpire’s observation
is drawn from a distribution centered at the true location of the pitch.
Baseball enthusiasts and academic researchers have noted an intriguing pattern in umpire
decisions. Whereas MLB instructs umpires to call balls and strikes based solely on the
location of the pitch, umpires make different calls in different counts, or game states, for
pitches at the same location (Moskowitz and Wertheim, 2011; Green and Daniels, 2014;
Mills, 2014). When the pitch is obviously inside the strike zone or obviously outside the
strike zone, the umpire almost always makes the correct call regardless of the count. But for
pitches near the boundary of the official strike zone, the probability of a strike call can vary
between counts by more than 60 percentage points.
We show that a simple Bayesian model predicts these patterns. In our model, the umpire’s
observation of the pitch provides a noisy signal of its location, which he integrates with his
rational expectations to form posterior beliefs. If according to those posterior beliefs, the
location of the pitch is more likely than not to be inside the strike zone, the umpire calls a
strike; otherwise, he calls a ball. The rational-expectations prior shifts the umpire’s posterior
beliefs towards locations that are a priori more expected. For observed locations that are
obviously inside the strike zone or obviously outside the strike zone, these shifts do not
2For recent examples of indirect tests in other contexts, see Zafar (2011) and Stone (2013).
2
change the umpire’s call. But for observed locations just inside the strike zone, expectations
that place greater weight just outside the strike zone change the umpire’s call from strike to
ball. Likewise, for observed locations just outside the strike zone, expectations that place
greater weight just inside the strike zone change the umpire’s call from ball to strike.
Count-based variation in rational expectations generates count-based variation in calls.
When the count favors the batter, the pitcher needs to throw a strike, and the batter can
afford not to swing. As a result, the umpire expects to make calls on pitches inside the
official strike zone. If he is unsure whether the pitch is a ball or a strike, he calls what he
expects: a strike. By contrast, when the count favors the pitcher, the pitcher can afford
to throw a ball, and the batter cannot risk letting a strike go by. As a result, the umpire
expects to make calls on pitches outside the official strike zone. If he is unsure whether the
pitch is a ball or a strike, he calls what he expects: a ball.3
These predictions match the data almost exactly. The enforced strike zone, or the region
in which umpires empirically call strikes in expectation, shrinks and expands across counts.
When the count most favors the batter, the enforced strike zone is 58% larger than when the
count most favors the pitcher, with the change concentrated near the top of the official strike
zone. We use the model to predict the region in which umpires call strikes in expectation,
and we find that in every count, this predicted strike zone is nearly coincident with the
enforced strike zone. Not only does the model reproduce the expansion and contraction of
enforced strike zone—it also reproduces the spatial pattern in which calls on pitches near
the top of the official strike zone are most affected by the count. Our model is accurate and
parsimonious. Changes in the predicted strike zone with the count follow from the data—
i.e., from differences in rational expectations across counts. Just 2 parameters govern the
variance of the signal, which modulates the responsiveness of the posterior to the rational-
expectations prior.
Conventional wisdom among baseball enthusiasts, as well as previous academic research,
has attributed count-based differences in called-strike rates to a preference for avoiding game-
changing calls—i.e., by favoring the pitcher when the count favors the batter, and vice versa
(Moskowitz and Wertheim, 2011; Mills, 2014; Green and Daniels, 2015). We show that this
taste-based account both fails to predict other key patterns in the data and predicts patterns
that do not materialize. An aversion to game-changing, or pivotal, calls predicts that for
3Our model ignores possible strategic behavior by the pitcher and batter in response to count-baseddeviations in the umpire’s calls. Folk theorems suggest that in the repeated game of an at-bat, a game, aseason, or a career, almost any behavior can be rationalized. Invoking Occam’s razor, we posit a simple,testable model that as we show, rationalizes the puzzles in the data.
3
decisions in which neither option is pivotal, called-strike rates will not vary. In fact, called-
strike rates do vary between counts in which both a ball and a strike would be non-pivotal,
and they do so in a manner that is predicted by our Bayesian model. The taste-based account
also predicts that non-count factors which amplify asymmetries in game impact (e.g., score
differential between teams) will amplify differences in called-strike rates. However, we show
that called-strike rates do not meaningfully depend on these factors. Instead, we show how a
rational process generates behaviors that appear anomalous (e.g. Backus, Blake and Tadelis,
2015; Miller and Sanjurjo, 2016).
Our results provide striking evidence of rational updating: a simple Bayesian model re-
produces otherwise puzzling empirical patterns. Though the model is straightforward, the
behavior it predicts requires considerable sophistication. Umpires’ rational expectations dif-
fer by pitch location and by count, and the umpire has barely a second to appropriately
weigh these expectations against his noisy perception of the location of the pitch. While
dominant accounts predict that lay judgments will violate rationality under such conditions
(for a review, see Kahneman, 2011), whether the heuristics of experienced agents fail in sim-
ilar ways is less clear. Some have speculated that with repetition and feedback, heuristics
can be refined to approximate complex calculations (Kahneman and Klein, 2009). However,
evidence is limited and often finds that experts exhibit the same instinctive biases as inexpe-
rienced subjects and to similar degrees (e.g. Haigh and List, 2005; Larson, List and Metcalfe,
2016). We find evidence of sophisticated, heuristic judgment by experts: the decisions of
umpires reflect an accurate, probabilistic, and count-specific understanding of the distribu-
tion of called pitches—as well as an ability to integrate those prior beliefs in a manner that
approximates Bayes rule.
These skills likely result from some combination of instruction, selection, and feedback.
Professional umpires are trained to make accurate calls, and they are promoted and rewarded
based on performance. They also receive unparalleled feedback—MLB uses the same pitch
location data that we analyze to depict each umpire’s mistakes after every game—which
we suspect further hones instincts that mimic a complex Bayesian logic. The tradition of
bounded rationality presumes that individuals often lack the cognitive resources to behave
optimally (e.g. Simon, 1982). Our results suggest that highly rational behavior can be learned
intuitively, and hence, that feedback and repetition, rather than intelligence, can underpin
rationality.
The remainder of the paper is organized as follows. Section 2 describes the context and the
data. Section 3 shows how the enforced strike zone changes with the count. Section 4 details
4
our Bayesian model, illustrates its dynamics with examples, and evaluates its predictions
under different types of prior beliefs. In Section 5, we compare the model’s predictions
under a sophisticated, rational-expectations prior to the empirical patterns from Section 3.
Section 6 concludes.
2 Context and Data
Most plays in baseball begin with the pitcher throwing a pitch towards home plate, an
irregular pentagon in the ground. As shown in Figure 1, the batter stands beside home
plate, and the umpire looks over the catcher from behind home plate. If the batter chooses
not to swing, the home plate umpire then makes a call : either a ball or a strike. MLB
defines the official strike zone as “that area over home plate the upper limit of which is a
horizontal line at the midpoint between the top of the [batter’s] shoulders and the top of the
uniform pants, and the lower level is a line at the hollow beneath the kneecap.”4 Pitches
that intersect the official strike zone should be called strikes; pitches that do not intersect
the official strike zone should be called balls.
Figure 1: The batter stands beside home plate, and the umpire looks over the catcher frombehind home plate. Major league baseball defines the official strike zone, which are renderedon each image, as the three-dimensional region over home plate between the bottom of thebatter’s kneecaps and the midline of his chest.
Umpires have not always adhered closely to the official strike zone. As recently as the
1990s, pitches far beyond the side of home plate—that hitters would have to lunge for—were
often called strikes, while high strikes—over the plate and between the hitter’s belt and
the midline of his chest—were almost always called balls. Major League Baseball could not
4http://mlb.mlb.com/mlb/official_info/umpires/strike_zone.jsp
5
remedy the problem by rewarding the least egregious violators, because the umpires union
mandated that umpires evenly split postseason assignments and the accompanying bonuses
(Callahan, 1998).
In 1999, MLB initiated three small measures aimed at reducing discrepancies between
enforced strike zones and the official strike zone: first, reminding all umpires of the definition
of the official strike zone; second, instructing team officials to monitor each umpire’s enforced
strike zone; and third, suspending an umpire who physically confronted a player—the first
suspension ever given to an umpire. A clumsy response by the umpires union paved the way
for baseball to strengthen the formal incentives faced by umpires. First, the union authorized
a strike. Then, upon realizing that its contract with MLB forbade a strike, the union tried
to dissolve itself—convincing 57 of the 66 union umpires to resign—so as to negotiate a new
contract. When a federal court ruled the dissolution null and void, Major League Baseball
accepted the resignations of 22 umpires and hired 30 new umpires (Callan, 2012). In an
instant, MLB reconstituted its pool of about 70 full-time umpires (O’Connell, 2007). To
become a full-time umpire today, a candidate must graduate in the top fifth of his class from
umpire school, rise through four levels of the minor leagues, be chosen to fill in for an umpire
on vacation, and then be chosen to replace a retiring umpire (Caple, 2011).
Home plate umpires in Major League Baseball now operate under a high degree of mon-
itoring and performance-based incentives. MLB uses pitch-tracking technology to evaluate
the calls of home plate umpires. In the early 2000s, MLB installed the QuesTec system in
half of its stadiums. Mills (2015) shows that these cameras increased umpires’ adherence to
the official strike zone. Prior to the 2008 season, MLB installed the more accurate PITCH
f/x system in every park. After each game, the home-plate umpire receives a breakdown of
his performance, including a score that measures the accuracy of his calls. Umpires are eval-
uated twice each season, with evaluations based on reports from league-employed observers
and analysis of the camera data. MLB rewards good performance with lucrative assignments
to officiate the All-Star Game and the playoffs (Drellich, 2012).
We measure umpires’ adherence to their directive using pitch location data from the
PITCH f/x cameras. Stereoscopic cameras record the trajectory of each pitch at over a
dozen points. An algorithm then fits the observed coordinates to a parabola and infers the
2-dimension location where the pitch intersects the plane that rises from the front of home
plate. MLB reports 4 measures of each pitch: that 2-dimensional location, the speed, the
lateral movement, and the maximal vertical deviation from the straight-line path. MLB also
labels each pitch with the categorical pitch type (e.g., four-seam fastball) from the pitcher’s
6
known repertoire that is best predicted by the observed speed and movement.
A pitch should be called a strike if its trajectory intersects the extruded pentagonal prism
of the official strike zone. Since MLB does not report the full trajectory of each pitch, we
instead compare the location of the pitch on the plane that rises from the front of home
plate to the rectangular slice of the official strike zone defined on that plane. The validity
of this comparison depends on two assumptions: first, that the parabolic fit correctly infers
the true location of the pitch on that plane, and second, that this location fully captures
the pitch’s trajectory through the space defined by the official strike zone. We make these
assumptions credible by restricting our sample to four-seam fastballs, which more than any
other pitch type follow a straight-line path from the pitcher’s hand to the catcher’s glove.5
This restriction allows us to disregard the handedness of the pitcher and the skill of the
catcher at framing pitches, both of which affect ball and strike calls significantly for off-
speed pitches but only minimally for fastballs (Deshpande and Wyner, 2017).
One issue remains when comparing 2-dimensional pitch locations to the official strike
zone. Whereas the horizontal dimension of the official strike zone is fixed to the width of
home plate, the vertical dimension of the official strike zone is determined by the batter’s
stance prior to the pitch. Hence, the absolute location of the pitch does not alone determine
whether a pitch is a ball or a strike. To address this issue, a human operator using a
separate camera (in center field) records the top and bottom boundaries of the official strike
zone from the batter’s stance prior to each pitch. Using these measures, we normalize the
vertical location by fixing the vertical height of the official strike zone to its average height.6
A key explanatory variable in our analysis is the count, which tracks the number of balls
and strikes in the at-bat, or the sequence of pitches between a pitcher and a batter. At the
beginning of an at-bat, the count has 0 balls and 0 strikes. When the batter does not swing
and the umpire calls a ball, the number of balls increases by one. When the batter swings,
or when he doesn’t and the umpire calls a strike, the number of strikes increases by one.7
5On average, the maximal vertical deviation between a four-seam fastball’s trajectory and a straight-linepath is 4.2 inches, compared to 7.9 inches for all other pitch types (t = 1.8× 103).
6Specifically, we redefine the vertical location of each pitch in relation to the midline of the top andbottom boundaries for that pitch and then multiply this redefined location by the ratio of the average heightof the official strike zone to the height for the current pitch. For example, a pitch observed 40 inches abovethe ground with top and bottom boundaries 47 and 21 inches above the ground, respectively will first beredefined as having crossed 6 inches above the midline. Given that the average height of the official strikezone is 22 inches, our normalization places the pitch 5.5 inches above the midline.
7The at-bat ends when the batter hits the ball in the field of play. If the batter hits the ball outside thefield of play (a “foul ball”), the number of strikes in the count increases by one, and the at-bat continues. Anexception is when the count has two strikes, in which case the count remains at two strikes. An exceptionto that exception is a bunt foul, which results in a strikeout when the count previously had two strikes.
7
Four balls end the at-bat with a walk, which favors the batting team; three strikes end the
at-bat with a strikeout, which favors the pitching team. Counts are denoted balls before
strikes. For instance, a 2-1 count means that the count has two balls and one strike.
Count Pitches (%) P (¬swing) Calls (%) P (strike)
0-0 526,357 0.28 0.72 378,463 0.37 0.460-1 207,615 0.11 0.52 108,744 0.11 0.230-2 111,115 0.06 0.51 56,523 0.06 0.091-0 204,572 0.11 0.57 117,167 0.12 0.431-1 169,137 0.09 0.45 76,146 0.07 0.251-2 146,846 0.08 0.42 61,365 0.06 0.132-0 85,915 0.05 0.58 49,940 0.05 0.482-1 107,119 0.06 0.39 42,260 0.04 0.282-2 132,848 0.07 0.33 44,453 0.04 0.173-0 35,531 0.02 0.93 32,934 0.03 0.643-1 58,078 0.03 0.44 25,369 0.02 0.383-2 101,698 0.05 0.25 25,143 0.02 0.17
Total 1,886,831 1.00 0.54 1,018,507 1.00 0.35
Table 1: Summary statistics by count, for four-seam fastballs.
The PITCH f/x cameras recorded 5.7 million pitches over the 2008 though 2015 regular
seasons,8 of which 34%, are categorized as four-seam fastballs. Table 1 tabulates summary
statistics by count for these 1, 886, 831 four-seam fastballs. Batters tend not to swing when
the count is in their favor, even when the pitch will likely be called a strike. For instance,
batters choose not to swing at 93% of four-seam fastballs in 3-0 counts, even though 64%
of these pitches are called strikes. When the count favors the pitcher, batters swing more
aggressively. In 0-2 counts, for example, 51% of four-seam fastballs are swung at, and the
remainder are called strikes just 9% of the time. Overall, batters choose not to swing on 54%
of four-seam fastballs, yielding a sample of 1,018,507 calls, 35% of which are called strikes.
8Fewer than 1% of pitches are not captured by the cameras. We also exclude the 2.8% of pitches forwhich the top and bottom strike zone measurements are likely erroneous, with a strike zone height greaterthan 28 inches or less than 18 inches.
8
3 Stylized Facts
Umpires deviate from the official strike zone by enforcing a strike zone that varies with the
count. In this section, we visualize these changes, showing that they are universal, an order
of magnitude larger than other systematic deviations among umpires, and inconsistent with
previous interpretations.
3.1 Changes in the enforced strike zone with the count
Figure 2: Probability of a strike call when the batter chooses not to swing, for listed counts.
(a) P (strike|0-0)
0.10.1
0.1
0.10.1
0.1
0.1
0.9
0.9
0.9
0.9
0.5
0.5
0.5
0.5
0.5
0.5
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Vert
ical axis
(in
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(b) P (strike|3-0)− P (strike|0-2)
0
0.1
0.1
0.1 0.1
0.1
0.1
0.1
0.1
0.2
0.2
0.2
0.2
0.2
0.20.2
0.30.
3
0.3
0.3
0.3
0.3
0.4
0.4
0.4
0.4
0.50.
5 0.6
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Vert
ical axis
(in
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Note: we estimate called strike probabilities in each count via kernel regression with abivariate Gaussian kernel and Silverman’s rule-of-thumb bandwidth in each dimension. Forthe difference plot (b), we use the larger of the two bandwidths in each dimension.
Figure 2 shows the plane that rises from the front of home plate. The umpire looks
through this plane over the catcher and towards the pitcher. A right-handed batter stands
on the left, and a left-handed batter stands on the right. The dotted box bounds the official
strike zone—the width of home plate on the horizontal axis, and the average distance between
the bottom of the batter’s kneecaps and the midline of his chest on the vertical axis. The
contour lines in Figure 2a denote the rates at which umpires call strikes when the batter
chooses not to swing; here, we restrict to the first pitch of the at-bat, when the count has
0 balls and 0 strikes; we estimate these probabilities via kernel regression with a bivariate
9
Gaussian kernel and Silverman’s rule-of-thumb bandwidth in each dimension.9 Umpires
reliably make obvious calls. Pitches that intersect the center of the official strike zone are
almost always called strikes, and pitches that intersect the plane far outside the official
strike zone are almost always called balls. But in between, pitches at the same location are
sometimes called strikes and sometimes called balls. A band 6 to 8 inches wide separates
locations where strikes are called 90% of the time from locations where strikes are called just
10% of the time.
For pitches in this band, the probability of a strike call varies with factors other than
pitch location—in particular, the count. Figure 2b shows differences in the probability of a
strike call between the two counts for which the differences are greatest—counts with 3 balls
and 0 strikes and counts with 0 balls and 2 strikes.10 Holding pitch location fixed, a strike
call is more likely in a 3-0 count than in an 0-2 count at nearly every location. The difference
peaks near the boundary of the official strike zone and particularly at the top, where pitches
are as much as 60 percentage points more likely when the count favors the batter as when
it favors the pitcher.
Figure 3: Enforced strike zone in 3-0, 0-0, and 0-2 counts.
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Vert
ical axis
(in
)
3-0
0-0
0-2
Note: we estimate the enforced strike zone in each count using a support vector machinewith a Gaussian kernel.
Umpires enforce different strike zones in different counts. Figure 3 shows the enforced
9These bandwidths are 0.88 inches in the horizontal dimension and 0.86 inches in the vertical dimension.10For this difference plot, we estimate called strike probabilities separately in each count using the larger
of the bandwidths in each dimension. These bandwidths are 1.7 and 1.8 inches in the horizontal and verticaldimensions, respectively.
10
strike zone, or the boundary that best separates ball and strike calls, for three different
counts; we estimate each enforced strike zone using a support vector machine with a Gaussian
kernel (cf. Friedman, Hastie and Tibshirani, 2001). In each of these counts, as in every other,
the enforced strike zone is wider than the official strike zone. It is also asymmetric on the
horizontal axis, wider to the left of home plate than to the right, owing to different enforced
strike zones for left- and right-handed batters, as shown in Figure A1 in the Appendix.11
Across counts, the size of the enforced strike zone varies. The enforced strike zone is 357
square inches in 0-2 counts and 565 square inches in 3-0 counts, an expansion of 58%.
The count-based deviations in the enforced strike zone appear to be universal. We rees-
timate the enforced strike zone in Figure 3 for each of the 60 umpires who make at least
10,000 calls on four-seam fastballs.12 Every one of these umpires enforces a smaller strike
zone in 0-2 counts than in 3-0 counts. Among them, the smallest difference in area between
the enforced strike zone in these counts is 141 square inches, the maximum difference is 356
square inches, and the median difference is 218 square inches.
These changes are not limited to the most imbalanced counts. We estimate the probability
of a strike call in each count using a linear probability model with fixed effects by location
and count:
yi =∑z
1{xi ∈ z} · αz +3∑b=0
2∑s=0
1{ballsi = b, strikesi = s} · βb,s + εi, (1)
where yi is an indicator for a strike call, xi is the location of the pitch, z indexes square-inch
regions on the plane that rises from home plate, b and s index the number of balls and strikes
in the count, and εi is a mean-zero error term. Given that the deviations are greatest where
the correct call is uncertain, we restrict the sample to the 323,560 calls for which the pitch
is within 3 inches—about the diameter of a baseball—of the boundary of the official strike
zone. We hold out β0,0, implying that βb,s represents the change in the probability of a strike
call from a 0-0 count in a count with b balls and s strikes, for a fixed pitch location and for
locations close to the boundary of the official strike zone.
Figure 4 shows estimates of βb,s with 95% confidence intervals from standard errors
11For right-handed batters, the enforced strike zone is close to symmetric across the midline of home plate.For left-handed batters, however, the enforced strike zone is asymmetric: pitches left of home plate and awayfrom the batter are called strikes at higher rates than pitches a similar distance right of home plate andnear the batter. An amateur umpire we spoke with attributes this phenomenon to differences in where theumpire stands behind the catcher for left- and right-handed batters.
12This threshold roughly separates umpires who worked full-time over the entire observation from thosewho were hired full time after 2008 or retired prior to 2015.
11
Figure 4: Estimates of βb,s, the count-specific change in the probability of a called strikefrom a 0-0 count, from Equation 1. 95% confidence intervals reflect standard errors clusteredby umpire.
clustered by umpire. The probability of a strike differs across counts. In 3-0 counts, a strike
is 7.0 percentage points more likely on average than in 0-0 counts. In 0-2 counts, a strike
is 14.3 percentage points less likely on average than in 0-0 counts. The probability of a
strike also differs significantly from 0-0 counts in less imbalanced counts. In 2-0 counts, for
instance, a strike is 4.3 percentage points more likely on average than in 0-0 counts, and in
0-1 counts, a strike is on average 9.7 percentage points less likely. Adding a strike to the
count always decreases the probability of a strike call, and adding a ball to the count always
increases the probability of a strike call. These estimates strongly reject the null hypothesis
that calls do not depend on the count, or that the βb,s terms are equivalent; F(11,117) =
210.63, p < 10−3.
The deviations appear to be independent of Major League experience. We reestimate
Equation 1 with an additional set of count fixed effects, and we interact this set with an
indicator for whether the umpire is one of the 60 who makes at least 10,000 calls on fastballs
over our observation window. These parameters measure count-by-count differences in the
probability of a strike call between experienced and inexperienced umpires. An F-test cannot
reject the null hypothesis that each of these 12 parameters is zero; F(12,117) = 0.93, p = 0.52.
Even the least experienced Major League umpires have been professional umpires for a
decade or more. Perhaps as a result, Major League experience does not appear to influence
the degree to which umpires change the enforced strike zone with the count.
12
3.2 Comparison with other deviations
(1) (2) (3)
Strike on last pitch -0.019(0.0034)
Home team batting -0.0046(0.0017)
3 balls & runner on 1st -0.0027(0.0051)
2 strikes & 2 outs 0.0073(0.0041)
Pitching team ahead 0.0039(0.0021)
Batting team ahead 0.0015(0.0022)
F-test: βb,s 207.11 209.96 210.51Calls 323560 323560 323560
Table 2: Estimates of the linear probability model in Equation 1 with covariates. Parameterestimates represent the change in the probability of a strike call for locations close to theboundary of the official strike zone, holding fixed the location of the pitch and the count.The sample is restricted to the 323,560 calls for which the pitch intersected the plane thatrises from the front of home plate within 3 inches of the official strike zone. The dependentvariable is an indicator for a strike call, and all models include fixed effects for 1) the square-inch location of the pitch, and 2) the count. F-test: βb,s reports an F-statistic for equivalenceamong the count fixed effects βb,s. Standard errors (in parentheses) are clustered by umpire.
Changes in the called-strike rates with the count are an order of magnitude larger than
other deviations among umpires. Chen et al. (2016) find that umpires make negatively corre-
lated judgments: the probability of a strike call declines when the previous pitch was called a
strike. Other work has shown that soccer referees favor the home team (Sutter and Kocher,
2004; Garicano, Palacios-Huerta and Prendergast, 2005). We estimate Equation 1 with co-
variates that measure these biases. Table 2 lists the corresponding parameter estimates as
well as an F-statistic for equivalence among the βb,s terms. Model 1 tests for negative au-
tocorrelation by including an indicator for whether the last pitch in the at-bat was called
13
a strike.13 The probability of a strike call decreases by 1.9 percentage points when the last
pitch in the at-bat was called a strike (|t| = 5.68), consistent with negative autocorrelation
in calls. Model 2 tests for home-team bias by including an indicator for whether the home
team is batting. The probability of a strike call decreases by 0.46 percentage points when
the home team is batting (|t| = 2.69), consistent with favoritism of the home team. In
comparison, a strike is 21.2 percentage points more likely in a 3-0 count than in an 0-2 count
(|t| = 38.56).
The Appendix visualizes these results. Figure A2 shows the enforced strike zone for calls
for which the previous pitch in the at-bat was called a strike and for called which the previous
pitch in the at-bat was not called a strike, for each count with at least one strike. In each
count, enforced strike zone depends minimally, if at all, on whether the previous pitch in the
at-bat was called a strike. Figure A3 shows the difference in the probability of a strike call
between calls when the home team is pitching and when the away team is pitching. The
contour lines trace a flat plane at zero, suggesting minimal, if any, distortion.
3.3 Alternative interpretation
Changes in the enforced strike zone with the count have been interpreted as form of status
quo bias (cf. Samuelson and Zeckhauser, 1988) in which umpires are averse to game-changing
calls—i.e., to a fourth ball or a third strike, both of which end the at-bat (Moskowitz and
Wertheim, 2011; Mills, 2014; Green and Daniels, 2015). This interpretation is inconsistent
with two patterns in the data. First, an aversion to game-changing calls predicts that umpires
will avoid calling balls principally when the count has 3 balls, and that they will avoid calling
strikes principally when the count has 2 strikes. But as Figure 4 shows, the probability of a
strike call varies smoothly with the number of balls and strikes in the count.
Second, a bias towards the status quo predicts that non-count factors which amplify the
impact of one call—such as men on base, number of outs, or the score differential—will
amplify the umpire’s preference for the other call. However, we find that the enforced strike
zone does not meaningfully depend on any of these factors. Model 3 of Table 2 includes
covariates that proxy for asymmetries in game impact. We include an indicator for a count
with 3 balls and a runner on first base; presumably, a walk is more beneficial to the batting
team when it would move a runner into scoring position. We also include an indicator for a
count with 2 strikes and 2 outs in the inning; presumably, a strikeout is more beneficial to the
13The left-out condition comprises calls made on the first pitch of the at-bat, calls for which the previouspitch was called a ball, and calls for which the batter swung at the previous pitch.
14
pitching team when it ends the inning. Finally, we include indicators for whether the pitching
team is ahead and whether the batting team is ahead; presumably, an umpire who prefers
evenness would favor the team that is behind. None of these variables meaningfully predicts
the call, and their inclusion does not diminish the significance of the count indicators.14
4 Theoretical Framework
We show that a simple Bayesian model can account for the observed changes in the enforced
strike zone with the count. After writing down the model, we illustrate its dynamics through
examples and compare its predictions under different types of prior beliefs.
4.1 Bayesian model
We consider an umpire who holds prior beliefs about where the pitch is likely to be thrown,
conditional on the batter choosing not to swing. The umpire also conditions his prior beliefs
on the game state S—e.g., the count—as different states may imply different beliefs about
pitcher and batter behavior. These prior beliefs represent the umpire’s best guess about
where the pitch will be thrown, prior to observing the pitch itself. We denote the umpire’s
prior beliefs as P (x0|S,¬swing), where x0 is the true location of the pitch.
When the pitch is thrown, the umpire observes a noisy signal of its true location. In
particular, he observes the location xu, which is drawn from a distribution, Fxu|x0 , centered
at the true location and known to the umpire. Hence, the umpire expects to observe the
pitch at its true location, but acknowledging the difficulty of observing the exact location
at which a fastball intersects the plane that rises from the front of home plate, he places
14In a previous working paper, we interpreted the count-based distortions as an aversion to making the morepivotal call (Green and Daniels, 2015). We provided two pieces of evidence in support of this interpretation.First, we showed that holding the count fixed, the call depends on other factors that make one call morepivotal than the other, which we termed the differential impact. Second, we showed that this dependencebecomes more extreme in high-visibility games, namely during national weekend telecasts. For both analyses,we calculated the differential impact in the following manner. First, we used the count, number of outs, andindicators for men on base to estimate the number of runs that a team in that situation could expect to scoreover the remainder of the half inning. Second, we calculated the differential impact of a strike as the changein the expected number of runs from calling a strike minus the change in the expected number of runs fromcalling a ball. We then estimated the bias, defined as the change in the probability of a called strike fromgame states with a differential impact of 0, as a smooth function of the differential impact (holding constantthe location of the pitch). For both analyses, our specification forced the estimate through the origin, therebyrequiring that for pitches in which a ball and a strike were assessed to be similarly (non-)pivotal, umpireswould call strikes at the same rate. Specifications that relax this restriction show no dependence on factorsother than the count, nor do they show any differences between national weekend telecasts and other games.
15
positive probability on nearby locations. The signal is simply fxu|x0 , which we denote as
P (xu|x0). Note that this signal depends only on the umpire’s eyesight, which we assume to
be independent of the game state and whether or not the batter swings.15
Given prior beliefs and the signal, posterior beliefs follow from Bayes’ rule:
P (x0|xu, S,¬swing) ∝ P (xu|x0)P (x0|S,¬swing) (2)
The umpire’s prior beliefs about where the pitch will be thrown shift his posterior beliefs
about where the pitch was thrown towards locations that are a priori more expected in the
current state.
After integrating his prior beliefs with the signal, the umpire compares his posterior
beliefs to a fixed rectangle, which we denote Z. He then makes the call that is more likely
to be correct according to his posterior beliefs. If the majority of the posterior density lies
inside Z, he calls a strike; if the majority of the posterior density lies outside Z, he calls a
ball. Let θS(xu) equal 1 if the umpire calls a strike, and 0 otherwise. Then,
θS(xu) = 1 ⇐⇒∫ZP (z|xu, S,¬swing)dz ≥ 1
2(3)
This Bayesian decision rule is deterministic. An observation xu and a state S jointly deter-
mine the umpire’s posterior beliefs and, along with the boundaries of Z, the call he makes.
Whereas this decision rule makes a binary prediction for xu, it makes a probabilistic
prediction for x0. This probability is simply the expected call, with the expectation taken
over the distribution of the umpire’s observation:
P (strike|x0, S) = Exu|x0 [θS(xu)] (4)
The probability of a strike call for a pitch at x0 is the share of draws xu for which a majority
of the umpire’s posterior beliefs are bounded by Z.
This probabilistic prediction defines the predicted strike zone, or the set of true locations
at which the umpire calls strikes more often than balls. Let yS(x0) equal 1 if a pitch at x0
is called a strike in expectation, and 0 otherwise. Hence,
yS(x0) = 1
{P (strike|x0, S) ≥ 1
2
}= 1
{Exu|x0 [θS(xu)] >
1
2
}(5)
15Assuming independence between the likelihood and the batter’s swing decision is innocuous as the umpireonly makes a call when the batter chooses not to swing.
16
The model predicts that a pitch observed by the cameras at x0 will be called a strike on
average if for the majority of the nature’s draws of xu, the majority of the umpire’s posterior
beliefs are bounded by Z.
4.2 Examples
Prior beliefs that coincide with rational expectations over the location of the pitch can explain
changes in the enforced strike zone with the count. When the count favors the batter, the
pitcher needs to throw a strike, and the batter can afford not to swing. Hence, umpires
expect to make calls on pitches inside the official strike zone, and they shift their posterior
beliefs towards the official strike zone. When the count favors the pitcher, the pitcher can
afford to throw a ball, and the batter cannot risk letting a close pitch go by. Hence, umpires
expects to make calls on pitches outside the official strike zone, and they shift their posterior
beliefs about the true location of the pitch away from the official strike zone.16
Figure 5 illustrates these patterns. The first column shows the empirical distribution
of pitch locations. Overall (5a), pitches concentrate at the center of the plane, with 42%
thrown in the official strike zone. In 3-0 counts (5d), pitchers need to throw a strike. As a
result, the distribution concentrates at the center, with 51% of pitches thrown in the official
strike zone. In 0-2 counts (5g), pitchers need not throw a strike. As a result, the distribution
is flatter, with only 24% of pitches thrown in the official strike zone.
The second column shows count-specific rates at which batters choose not to swing.
Overall (5b), batters discriminate by location, swinging at pitches at the center of the plane,
and taking pitches on the periphery. In 3-0 counts (5e), batters have the luxury of not
swinging at a strike. As a result, they rarely swing, regardless of the location of the pitch.
In 0-2 counts (5h), batters cannot afford a called third strike. As a result, they swing at
most pitches near the official strike zone and rarely choose not to swing at a pitch that is
clearly inside of it.
These tendencies underlie vastly different distributions of pitch locations for calls, which
constitute the umpire’s rational expectations about where the pitch will be thrown, con-
ditional on the batter choosing not to swing. These expectations are shown in the third
column. Overall (5c), umpires disproportionately make calls on the left and right edges of
the official strike zone, with 25% of called pitches intersecting the official strike zone. In
16We are indebted to Jim Andreoni for proposing this logic to us in 2015. To our knowledge, thefirst public mention of this explanation was a 2016 blog post on Baseball Prospectus: http://www.
baseballprospectus.com/article.php?articleid=28513.
17
Figure 5: Pitch density, rate of non-swings, and pitch density conditional on not swinging,overall (a-c) and in 3-0 (d-f) and 0-2 (g-i) counts.
(a) P (x0)
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
0.0003
0.00030.0003
0.0
003
0.0003
0.0003
0.0006
0.0
006
0.0006
0.0006
0.0006
0.0
006
0.0009
0.0
009
0.0009
0.0009
0.0009 0.00
120.0012
0.0
012 0.0015
(b) P (¬swing|x0)
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
0.3
0.3
0.3
0.3
0.5
0.5
0.5
0.5
0.5
0.5
0.7
0.7
0.70.7
0.7
0.7
0.9
0.9
0.9
0.9
0.9
(c) P (x0|¬swing)
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
0.00030.0003
0.0003
0.0006
0.0006 0.0006
0.0
006
0.0006
0.0006
0.0006
0.00
090.0
009
(d) P (x0|3-0)
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
0.0
003
0.00030.0003
0.00
03
0.0
003
0.00030.0003
0.00060.0
006
0.0006
0.0006
0.0
006
0.00060.0009
0.0
009
0.0009
0.00
09
0.0009
0.0012
0.0012
0.0012
0.0
012
0.0015
0.0015
0.0
015
0.00
18
0.0018
0.0021
(e) P (¬swing|x0, 3-0)
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
0.9
0.9
0.90.
9
0.9
(f) P (x0|¬swing, 3-0)
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
0.0
003
0.00030.0003
0.0
003
0.0
003
0.0003
0.00
03
0.0006
0.0
006
0.0006
0.0006
0.0
006
0.0006
0.0009
0.0009
0.0009
0.0
009
0.0009 0.0012
0.0012
0.0012
0.0
012
0.0015
0.0015
0.0
015
0.00
18
0.00
18
(g) P (x0|0-2)
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
0.0003 0.00030.0003
0.00
06
0.0006 0.00060.0006
0.0
006
0.0
009
0.0009
0.0009(h) P (¬swing|x0, 0-2)
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
0.1
0.1
0.1
0.1
0.3
0.3
0.30.3
0.3
0.3
0.5
0.5
0.50.5
0.5
0.5
0.7
0.7
0.70.7
0.7
0.7
0.9
0.9
(i) P (x0|¬swing, 0-2)
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
0.0003
0.0
003
0.0003 0.0003
0.0
003
0.0003
0.0
006
0.00
06
0.0
006
0.0006
Note: Pitch densities (a,c,d,f,g,i) are kernel density estimates. Non-swing rates (b,e,h) arekernel regression estimates. All estimates use a bivariate Gaussian kernel with Silverman’srule-of-thumb bandwidth in each dimension.
18
3-0 counts (5f), umpires disproportionately make calls on pitches near the side and bottom
boundaries of the official strike zone, and 49% of called pitches intersect the official strike
zone. And in 0-2 counts (5i), umpires disproportionately make calls on pitches above and
left and right of the official strike zone, with just 6% of called pitches intersecting the official
strike zone.
Given a prior and a signal, a Bayesian actor forms a posterior and makes a decision. In our
framework, the prior coincides with the umpire’s rational expectations and can be estimated
from the data, as in Figure 5. The signal, however, is unobserved to the analyst. So far, we
have placed minimal structure on the signal, assuming only that the umpire’s observation is
unbiased. Here, we further assume that the signal is drawn from a bivariate normal distri-
bution centered at the true location and with zero covariance—i.e., xu ∼ N(x0,[σ21 0
0 σ22
] ).
For the examples in this section, we set σ21 = σ2
2 = 32 inches. This implies that 20%, 59%,
and 86% of observations are within 2, 4, and 6 inches of the true location, respectively.
Figure 6: P (x0|S,¬swing): Posterior beliefs under a flat prior (a), under a rational-expectations prior (b), and under a count-specific rational-expectations prior for a 3-0 count(c) and an 0-2 count (d). Contour lines show the posterior distribution, the X marks theobserved location, the dotted box denotes the official strike zone, and the black curve denotesthe boundary between ball and strike calls.
(a) Flat prior
0 3 6 9 12 15
Horizonal axis (in)
0
3
6
9
12
15
Ve
rtic
al a
xis
(in
)
(b) RE prior
0 3 6 9 12 15
Horizonal axis (in)
0
3
6
9
12
15
Ve
rtic
al a
xis
(in
)
(c) RE in 3-0 count
0 3 6 9 12 15
Horizonal axis (in)
0
3
6
9
12
15
Ve
rtic
al a
xis
(in
)
(d) RE in 0-2 count
0 3 6 9 12 15
Horizonal axis (in)
0
3
6
9
12
15
Ve
rtic
al a
xis
(in
)
To see how the different prior expectations affect the umpire’s posterior beliefs, consider
a pitch observed to pass through the top-right corner of the official strike zone. The contour
lines in Figure 6a show the posterior beliefs held by an umpire with a flat prior—i.e., who
expects all locations to be equally likely. Here, we assume that Z coincides with the official
strike zone, which is depicted on the figure by a dotted box. The observed location in
Figure 6a is borderline: given a flat prior, about half of the posterior density lies inside the
official strike zone. Borderline observations intersect the decision boundary, which is traced
by the black curve. Under a flat prior, the decision boundary coincides with the official
19
strike zone along its edges but has rounded corners, implying that pitches observed in the
corners of the official strike zone will be called balls. For observations along the edges, half
of the posterior density lies in official strike zone. But for observations on the corners, only
one-quarter of the posterior density lies in the official strike zone.
The umpire’s posterior beliefs, and thus the decision boundary, change with his prior
beliefs. Figure 6b shows the umpire’s posterior beliefs under the unconditional rational-
expectations prior shown in Figure 5c. Near the observed location, the gradient of the
umpire’s prior beliefs is flat—on average, locations around the top-right corner of the official
strike zone are similarly likely. As a result, the umpire’s posterior beliefs under this prior
replicate his beliefs under the flat prior for an observation at this location.
Figure 6c shows the umpire’s posterior beliefs in a 3-0 count under the count-specific
rational-expectations prior shown in Figure 5f. Near the observed location, points closer
to the center are more expected than those farther away—in 3-0 counts, locations inside
the official strike zone are more likely than locations outside—shifting the umpire’s posterior
beliefs inward. A majority of the posterior density now lies inside the official strike zone, and
the umpire calls a strike as a result. The decision boundary expands outward to encompass
the observed location.
Figure 6d shows the umpire’s posterior beliefs in a 0-2 count under the count-specific
rational-expectations prior shown in Figure 5i. Near the observed location, points away
from the center are more expected than those towards the center—in 0-2 counts, locations
outside the official strike zone are more likely than locations inside—shifting the umpire’s
posterior beliefs outward. A majority of the posterior density now lies outside the official
strike zone, and the umpire calls a ball as a result. The decision boundary shrinks inward,
excluding the observed location.
In both counts, the umpire observes the pitch at the same location, but he makes opposite
calls. When the count favors the batter, he calls a larger strike zone, and when the count
favors the pitcher, he calls a smaller strike zone.
4.3 Accuracy
For the umpire with imperfect eyesight, incorporating count-specific expectations over the
location of the pitch improves the accuracy of his calls. The intuition becomes apparent in
the extreme: a blind umpire would guess correctly more often if he assumes that the pitch is
drawn from the true empirical distribution. This logic also applies when the umpire’s vision
is imperfect: the umpire will make the correct call more often if he shifts his beliefs towards
20
locations that are more likely.
Let the error rate at location x0 in state S be the absolute difference between an indicator
for whether x0 is in the official strike zone and the predicted probability of a strike in state S
from Equation 4, defining the states as the 12 counts. If at a location inside the official strike
zone, the umpire calls a strike with probability 70%, then the error rate at that location is
30%.
To compute the predicted strike probability at a given location, we first discretize the
plane that rises from home plate into a grid of one-inch squares bounded at 18 inches from
the midline on both dimensions. In each square, we predict the probability of a strike by
independently drawing Nsim = 1000 observed locations, which we denote xu, from Fxu|x0 .
As in the previous section, we assume that Fxu|x0 follows a bivariate normal distribution with
variance terms σ21 and σ2
2 and no covariance. For each simulated observation, we integrate
the signal with the prior according to Bayes rule, and sum the posterior density over the
official strike zone. If at least half of the posterior density is inside the official strike zone,
then θS(xu,i;σ21, σ
22) = 1, and the umpire calls a strike; otherwise, θS(xu,i;σ
21, σ
22) = 0, and
the umpire calls a ball.
The probability of a strike is the average call over the 1000 simulations:
P (strike|x0, S;σ21, σ
22) =
1
Nsim
Nsim∑i=1
θS(xu,i;σ21, σ
22) (6)
And the error rate is the absolute difference between that probability and an indicator for
whether the true location is in the official strike zone:
Error(x0, S;σ21, σ
22) =
∣∣∣∣1{x0 ∈ Z} − P (strike|x0, S;σ21, σ
22)
∣∣∣∣We then compute average error rate, λ, weighting by P (x0, S), the joint probability of
observing a call at x0 in count S:
λ =∑x0
∑S
Error(x0, S;σ21, σ
22) · P (x0, S)∑
x0
∑S P (x0, S)
For a given pair of variance terms, we perform this computation once for each the three
types of prior beliefs from the previous section.
Figure 7a shows the error rate under the flat prior, for a range of variance terms. For
near-zero values of σ1 and σ2, the umpire has near-perfect eyesight, and he is near-certain
21
Figure 7: Error rates under a flat prior (a), along with the percent improvement in theerror rate under an unconditional rational-expectations prior (b) and under a count-specificrational-expectations prior, for different values of the scale parameters σ2
1 and σ22.
(a) λflat: flat prior
1 2 3 4 5 6
1
2
3
4
5
6
0.0
50.1
0.1
0.1
5
0.1
5
0.1
5
0.2
0.2
(b) 1− λRE/λflat
1 2 3 4 5 6
1
2
3
4
5
6
0 0
0
0
0
0
0 0
0
0 0
00
0
0
0.01
0.01
0.010.0
1
0.02
0.02
0.03
(c) 1− λS/λRE
1 2 3 4 5 6
1
2
3
4
5
6
0
0.01
0.0
1
0.0
2
0.02
0.0
2
0.0
20.0
3
0.0
3
0.0
3
0.0
4
0.0
4
that the observed location is the true location. Given a degenerate and accurate signal, the
prior is irrelevant, and the error rate is zero under either prior. As the variance grows, the
umpire’s observations are drawn from a wider distribution—i.e., his eyesight worsens—and
the error rate increases.
Rational expectations alone do not improve error rates. Figure 7b shows the percent
improvement in the error rate under the unconditional rational-expectations prior, relative
to the flat prior. For reasonable values of the variance terms—i.e., in which the umpire’s
eyesight is not profoundly worse in one direction that the other—error rates are comparable
under the two priors. Umpires experience minimal, if any, benefits from shifting their beliefs
towards locations that are more expected on average.
The Bayesian umpire lowers his error rate not by holding unconditional rational expecta-
tions, but by conditioning those expectations on the count. Figure 7c shows the percent im-
provement under the count-specific prior, relative to the unconditional rational-expectations
prior. Except for umpires with near-perfect vision, umpires benefit from shifting their be-
liefs towards locations that, given the count, are more expected. At σ1 = σ2 = 3 inches, for
instance, λS is about 2% lower than λRE.
This improvement is greatest in counts for which the count-specific empirical distribution
of called pitches is most different from the overall distribution. For σ1 = σ2 = 3 inches, λS is
11% lower than λRE in 0-2 counts, 9% lower in 1-2 counts, 7% lower in 3-2 and 2-2 counts,
and 5% lower in 3-0 counts. The distribution of called pitches varies considerably between
22
counts, as Figures 5f and 5i show. Approximating these distributions by taking their average,
as in Figure 5c, ignores this variation and worsens performance.
5 Model estimates
Does a Bayesian umpire reproduce the observed changes in the enforced strike zone with
the count, as depicted in Section 3? We calibrate the model and show that under the
count-specific rational-expectations prior, it predicts these changes almost exactly.
5.1 Estimation steps
Assuming, as before, that the umpire’s observation is drawn from a normal distribution with
zero covariance, our model is characterized by six parameters: the variance terms, σ21 and
σ22, as well as the four boundaries of Z. Previously, we chose the variance terms arbitrarily
or estimated the model for a range of values, and we assumed that the boundaries coincide
with the official strike zone. Here, we find the parameters that best predict observed calls.17
Specifically, we use non-linear least squares to find the parameter values that minimize
the average area between the count-specific enforced and predicted strike zones, weighted by
the number of observed calls in each count:
arg minσ21 ,σ
22 ,Z
∑S
NS∑S NS
∑x0
(yS(x0)− yS(x0;σ2
1, σ22,Z)
)2(7)
Here, yS(x0) defines the enforced strike zone in count S, equaling 1 when a pitch at the
true location x0 is more often called a strike than a ball in the data, and 0 otherwise. As
in Equation 5, yS(x0;σ21, σ
22,Z) is the predicted strike zone in count S, equaling 1 when a
pitch at the true location x0 is called a strike in expectation by our Bayesian umpire, and 0
otherwise. As before, we discretize the plane into a square-inch grid bounded at 18 inches
from the midlines.
We estimate the enforced strike zone using a support vector machine with a Gaussian
kernel, as in Figure 3. The SVM separates locations at which strikes are called a majority of
the time from locations at which balls are called a majority of the time. The solid boundaries
in Figure 10 below show the enforced strike zone in each count. We estimate the predicted
17We relax the boundaries of Z to account for deviations that are beyond the scope of this paper—inparticular, that umpires call strikes in expectation to the left and right of the official strike zone but notabove or below, as shown in Figure 3.
23
strike zone as in the previous section. For each true location, we calculate the probability of a
strike call from 1000 simulated calls under the current parameter estimates, as in Equation 6,
classifying the location as inside the predicted strike zone if the probability of a strike call
at that location is at least 0.5.
The objective function in Equation 7 counts discrepancies between the enforced and
predicted strike zones across the grid. We take a weighted average over the counts, weighting
each count by NS, the number of calls observed in that count. Thus, the value of the objective
function is the average area, in square inches, over which the enforced and predicted strike
zones disagree. We find the parameters that minimize Equation 7 using a genetic algorithm.18
5.2 Estimates
We estimate the model separately for the three types of prior beliefs discussed previously:
a flat prior in which all locations are on the grid are viewed as equally likely, regardless of
the count; a rational-expectations prior, measured as the empirical density of called pitches,
as shown in Figure 5c; and a count-specific rational-expectations prior, measured as the
count-specific empirical density of called pitches and rendered in Figure 8.19
Estimates of the boundaries are nearly identical under each prior. The top and bottom
boundaries coincide with those of the official strike zone (under the flat prior, the bottom
boundary is an inch lower). However, the sides are wider than the official strike zone, and
this deviation is asymmetric. Whereas the right edge is about 2 inches wider than the official
strike zone, the left edge is about 4 inches wider, consistent with the estimates in Figures 2
and 3. The asymmetry can be attributed to differences in the enforced strike zone for left-
and right-handed batters, as we discuss in Footnote 11.
18Genetic algorithms address two difficulties in minimizing the objective function in Equation 7. First,the objective function is composed of indicator functions, precluding the use of derivative-based algorithms.Genetic algorithms evaluate the function, not its derivatives. Second, the objective function may havemany local minima, and they may be far apart in the parameter space. Though genetic algorithms offerno guarantee of finding the global minimum, they are designed to search efficiently over a multidimensionalspace. We constrain the search by bounding σ2
1 and σ22 to each be less than 6 inches and by restricting the
four boundaries of Z to integers between 7 and 15 inches from the midline on each dimension (inclusive),given that the discrete grid has a coarseness of 1 inch. The algorithm generates 60 candidate parametervectors in each population and stops only when the objective function value associated with the best vectoris unchanged for 15 successive generations.
19We estimate the density in each count using a bivariate Gaussian kernel with Silverman’s rule-of-thumbbandwidth along each dimension. Counts with more observations have lower bandwidths. This imbues theumpire’s beliefs with a precision that increases with his past exposure. Along the horizontal dimension, thebandwidth varies from 0.88 inches in 0-0 counts to 1.81 inches in 3-2 counts. Along the vertical dimension,the bandwidth varies from 0.86 inches in 0-0 counts to 1.91 inches in 3-2 counts.
24
Figure 8: P (x0|¬swing, S): the empirical density of called pitches in each count.
(a) 0 balls, 0 strikes
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Vert
ical axis
(in
)
0.00025
0.00025
0.0005
0.0005 0.0005
0.0
005
0.0005
0.0005
0.00
05
0.00075
0.0
0075
0.000750.00075
0.0
0075
0.001
0.0010.001
0.001
0
0.2
0.4
0.6
0.8
1
1.2
10-3
(b) 0 balls, 1 strike
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Vert
ical axis
(in
)
0.0
0025
0.0
0025
0.0005 0.0005 0.0005
0.0005
0.00
05
0.0005
0.0
005
0.00075
0.0
0075
0.00
075
0.00075
0.0
0075
0.0
010
.001
0
0.2
0.4
0.6
0.8
1
1.2
10-3
(c) 0 balls, 2 strikes
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Vert
ical axis
(in
)
0.00025
0.0
0025
0.00025 0.0
0025
0.0
0025
0.00025
0.0
005
0.0005
0.00050.0
005
0.0005
0
1
2
3
4
5
6
7
10-4
(d) 1 ball, 0 strikes
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Vert
ical axis
(in
)
0.00025
0.0005
0.0005
0.00
05
0.0
005
0.00050.0005
0.0005
0.00075
0.0
0075
0.00075
0.00075
0.0
00
75
0.00075
0.001
0.0010.0
01
0.0
01
0
0.2
0.4
0.6
0.8
1
10-3
(e) 1 ball, 1 strike
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Vert
ical axis
(in
)
0.0
0025
0.0
0025
0.00050.0005 0.0005
0.0005
0.00
05
0.0005
0.0005
0.0
0075
0.0
0075
0.00
075
0.00075
0.0
0075
0.00
1
0.0
01
0
0.2
0.4
0.6
0.8
1
10-3
(f) 1 ball, 2 strikes
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Vert
ical axis
(in
)0.00025 0.00025
0.0
0025
0.00
025
0.00025
0.00
025
0.0
005
0.0
005
0.00050.0
005
0.0
005
0.0005
0.0
0075
0.00
075
0.00075
0
1
2
3
4
5
6
7
8
9
10-4
(g) 2 balls, 0 strikes
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Vert
ical axis
(in
)
0.00025
0.0005
0.0005 0.0005
0.0
005
0.0005
0.0005
0.00
05
0.00075
0.0
0075
0.00075
0.00075
0.0
0075
0.00075
0.0
01
0.001
0.001
0.001
0
0.2
0.4
0.6
0.8
1
10-3
(h) 2 balls, 1 strike
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Vert
ical axis
(in
) 0.0
0025 0
.00025
0.00050.0005
0.00
05
0.0005
0.0
005
0.0005
0.0
00
5
0.0005
0.00075
0.00075 0.00075
0.0
0075
0.00
075
0.00075
0.0
0075
0.00
1
0
0.2
0.4
0.6
0.8
1
10-3
(i) 2 balls, 2 strikes
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Vert
ical axis
(in
)
0.00025
0.00
025
0.00025
0.0005 0.0005 0.0005
0.0005
0.0
005
0.0005
0.0
005
0.0005
0.0
0075
0.00075
0.0
0075
0
1
2
3
4
5
6
7
8
9
10-4
(j) 3 balls, 0 strikes
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Vert
ical axis
(in
)
0.0005
0.0
005
0.00050.0005
0.0
005
0.0005
0.001
0.001
0.001
0.0
01
0.0015
0.0015
0.0
015
0.002
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
10-3
(k) 3 balls, 1 strike
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Vert
ical axis
(in
)
0.00025
0.0005
0.0005 0.0005
0.0
005
0.0005
0.0005
0.0005
0.00075
0.0
0075
0.00075
0.00075
0.0
0075
0.0
0075
0.00075
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
10-3
(l) 3 balls, 2 strikes
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Vert
ical axis
(in
) 0.0
0025
0.00025
0.0
0025
0.0005
0.0005
0.00050.0005
0.0
005
0.00075
0.0
0075
0.00075
0.000750.00075
0.0
0075
0.0
0075
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
10-3
25
Figure 9: The objective function in Equation 7 in terms of σ1 and σ2, holding constant Zat its estimated boundaries.
(a) Flat prior
1 2 3 4 5 6
1
2
3
4
5
6
70
70
70
70
70
70
70
9090
90
90
90
90
90
110
130
150
(b)Unconditional
rational expectations
1 2 3 4 5 6
1
2
3
4
5
6
70
70
70
7070
70
70
9090
90
90
90
110
(c)Count-specific
rational expectations
1 2 3 4 5 6
1
2
3
4
5
6
30
305
0
50
50
5050
7070
70
70
70
70
9090
90
90
110
For each of the priors, the variance terms are identified by different aspects of the data.
For the flat prior, increasing the variance rounds the corners of the predicted strike zone.
Hence, the parameters are identified by the average roundness of the enforced strike zone,
which is not very informative. Figure 9a shows the objective function under the flat prior in
terms of σ1 and σ2, with the boundaries fixed at their estimated values. The minimum is not
well identified, as values of the function are similar for a wide range of parameter estimates.
For either of the informative priors, increasing the variance makes the predicted strike
zone more responsive to the prior. Hence, the variance terms are identified by whether the
prior shapes the predicted strike zone to match the empirical strike zone. Figure 9b shows
the objective function under the unconditional rational-expectations prior. The minimum
is better identified than under the flat prior, as the function converges to a smaller basin.
Figure 9c shows the objective function under the count-specific rational-expectations prior.
The minimum is well identified, as the function converges sharply to a narrow basin.
The optimal variance terms under this prior appear reasonable. Our estimates of σ1 = 3.1
inches and σ2 = 4.1 inches imply that 15%, 46%, and 75% of observations are within 2, 4,
and 6 inches of their true location. These estimates are precise, with standard errors of 0.14
inches for σ1 and 0.12 inches for σ2.20 The larger estimate on the vertical dimension indicates
20The standard errors are difficult to compute. Since the objective function is composed of indicators, theHessian is undefined. Bootstrapping is computationally infeasible, as the optimizer takes about 36 hours toconverge when running on 64 high-performance cores. Given these constraints, we approximate the Hessianfor the variance terms in the following manner. Holding the bounds of Z at their estimated values, wecompute the objective function value at 8 equally-spaced points on a circle of radius 0.1 inches centered at
26
that the predicted strike zone expands and contracts more vertically than horizontally, as is
true of the enforced strike zone. This asymmetry may also reflect the difficulty of observing
the location of the pitch in relation to the midline of the batter’s chest or the bottom of
the kneecaps. By comparison, the umpire can more easily tell whether the pitch passes over
home plate, which provides a white backdrop for the pitch from the umpire’s vantage point.
Our Bayesian model fits the data closely when the umpire is imbued with count-specific
rational expectations. The minimum objective function value, or the area over which the
enforced and predicted strike zones disagree, is 27.4 square inches under the count-specific
prior—an area equivalent to just 7% of the official strike zone. By contrast, the area of
disagreement is 50.3 square inches under the unconditional rational-expectations prior and
51.2 square inches under the flat prior. Refining the umpire’s prior beliefs does not provide
additional degrees of freedom and thus does not necessarily improve the model’s fit. For
instance, prior beliefs that shift posterior beliefs towards the periphery when the count favors
the batter would worsen the fit. Instead, the close fit that our model achieves under the count-
specific rational expectations prior implies that a Bayesian model with this sophisticated
prior fits the data remarkably well.
Figure 10 visualizes the fit between the model and the data in each count. The dashed
irregular ellipse bounds the predicted strike zone, yS(x0), under count-specific rational ex-
pectations; the solid irregular ellipse bounds the enforced strike zone, yS(x0); and the dotted
box denotes the estimated boundaries of Z. In each count, the predicted and enforced strike
zones coincide, or nearly coincide. When the predicted strike zone is large, as when the
count favors the batter, so too is the enforced strike zone. And when the predicted strike
zone is small, as when the count favors the pitcher, so too is the enforced strike zone. The
enforced strike zone even responds to asymmetries in how the predicted strike zone changes
across counts. As the number of strikes increases, the contraction in both the predicted and
enforced strike zones is concentrated at the top.
We also use the model to predict the probability of a strike call under the count-specific
rational-expectations prior, as in Equation 6. Figure 11a shows the probability of a strike
call across the plane when the count has 0 balls and 0 strikes—the theoretical analogue to
the empirical estimate in Figure 2a. As is the case for actual umpires, the Bayesian umpire
reliably makes obvious calls: pitches in the center of the official strike zone are almost
always called strikes, and pitches on the periphery are almost always called balls. However,
(σ1, σ2). We then use method of moments to fit those objective function values to a parabola of the formax21 + bx22 + cx1x2 + 27.4, where the constant is the objective function value at the estimated variance terms.The approximate matrix of second derivatives is thus [ a c
c b ].
27
Figure 10: The enforced strike zone (solid irregular ellipse) and predicted strike zone(dashed irregular ellipse) in each count, as well as the estimated boundaries of Z (dottedbox).
(a) 0 balls, 0 strikes
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
(b) 0 balls, 1 strike
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
(c) 0 balls, 2 strikes
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
(d) 1 ball, 0 strikes
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
(e) 1 ball, 1 strike
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
(f) 1 ball, 2 strikes
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
(g) 2 balls, 0 strikes
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
(h) 2 balls, 1 strike
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
(i) 2 balls, 2 strikes
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
(j) 3 balls, 0 strikes
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
(k) 3 balls, 1 strike
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
(l) 3 balls, 2 strikes
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
28
Figure 11: Predicted probability of a strike call, for listed counts.
(a) P (strike|0-0)
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Vert
ical axis
(in
)
0.1
0.1 0.1
0.1
0.1
0.10.1
0.1
0.9
0.9
0.9
0.9
0.5
0.5
0.5
0.5
0.5
0.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(b) P (strike|3-0)− P (strike|0-2)
0.1
0.1
0.1
0.10.1
0.1
0.1
0.1
0.1
0.1
0.1
0.2
0.2 0.2
0.2
0.2
0.20.2
0.2
0.2
0.3
0.30.3
0.3
0.3
0.3
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Vert
ical axis
(in
)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
the region in which the umpire sometimes calls balls and sometimes calls strikes for pitches
at the same location, is slightly wider in the prediction than in the empirical estimate. In
the prediction, the region between the 10% and 90% contour lines is 8 to 10 inches wide,
compared to a width of 6 to 8 inches in the empirical estimate. Reducing the variance of
the signal improves the fit. In particular, variance terms of σ21 = σ2
2 = 32 inches yield a
prediction that closely mimics the empirical estimate in Figure 2a. Of course, this comes at
the expense of agreement between the predicted and empirical strike zones, as implied by
the plot of the objective function in Figure 9c.
Differences in the probability of a strike call between counts are reproduced in shape,
though not in magnitude. Figure 11b shows the difference in the predicted probability of a
strike call between 3-0 and 0-2 counts—the theoretical analogue to the empirical estimate
in Figure 2b. As is true of actual umpires, the Bayesian umpire is more likely to call strikes
in 3-0 counts than in 0-2 counts, particularly at the top of the official strike zone. However,
the difference is only about half as large in the prediction as in the empirical estimate. For
pitches at the top of the official strike zone, the predicted probability of a strike changes
by more than 30 percentage points between the most imbalanced counts; in the data, the
difference is more than 60 percentage points.
The parameters which minimize the distance between the predicted and enforced strike
zones do not minimize the distance between predicted and empirical probabilities of a strike
call. One reason may be that the signal does not follow a Gaussian distribution. Among
29
symmetric distributions, there is no theoretical reason to prefer the Gaussian, and another
distribution may yield a precise fit both when predicting balls and strikes in expectation
and when predicting probabilities.21 A second possibility is that the distribution of the
signal varies with pitch characteristics. Fastballs are thrown faster in counts that favor the
pitcher.22 If the variance of the signal increases with pitch speed—i.e., if faster fastballs are
harder to observe—then a Bayesian model would predict tighter contour lines in Figure 11a
(since fastballs are relatively slow in 0-0 counts) and larger differences in Figure 11b (since
fastballs are much slower in 3-0 counts than in 0-2 counts).
5.3 Blurred rational expectations
A Bayesian model in which umpires combine rational expectations with noisy observations
predicts observed decisions well. To what degree does this performance depend on our
estimates of the count-specific rational-expectations prior? In particular, would a model
underpinned by less precise estimates predict observed decisions just as well? Here, we
consider a model with blurred rational expectations, or over-smoothed versions of the kernel
density estimates in Figure 8. These prior beliefs preserve the gross features of the count-
specific rational expectations without preserving the details. That is, they represent the
prior beliefs of a Bayesian umpire whose rational expectations are constructed from a random
subset of the million observations at our disposal.
We refit the model under different degrees of blurred rational expectations. Specifically,
we over-smooth the kernel density estimates in Figure 8 by multiplying Silverman’s rule-
of-thumb bandwidth in both dimensions by a given factor. Figure 12 shows the area of
disagreement between the enforced and predicted strike zone for different bandwidth factors.
A factor of 1 reproduces our estimates from the previous section, yielding 27.35 square inches
of disagreement. A factor of 2—equivalent to reducing the number of observations by a factor
of 25 = 32—fares marginally worse, yielding 28.10 square inches of disagreement, an increase
of 3%. A bandwidth factor of 3—equivalent to reducing the number of observations by a
factor of 35 = 243—fares considerably worse, yielding 31.30 square inches of disagreement,
an increase of 14%.23
21We also estimate the model with a signal that follows a bivariate t distribution. The t distribution is ageneralization of the normal. An extra parameter, the degrees of freedom, modulates the proportion of massin the tails. We estimate a degrees of freedom parameter of 13, which effectively reduces the t-distributionto a normal.
22This is likely a product of selection. Pitchers who throw harder reach pitcher-friendly counts more often.23The estimated bounds of Z do not change for bandwidth factors of 1.5 and 2; for factors of 2 and 2.5,
the top boundary drops from 10 inches above the midline to 9 inches. Across the bandwidth factors, σ1
30
Figure 12: Area of disagreement (in square inches) between the enforced and predictedstrike zones, for different degrees of over-smoothed prior beliefs.
1 1.5 2 2.5 3
Bandwidth factor
27
28
29
30
31
32
Are
a o
f d
isa
gre
em
en
t (s
q.
inch
es)
Note: we over-smooth the umpire’s prior beliefs by reestimating the kernel density estimatesin Figure 8 with excessive bandwidths, using Silverman’s rule-of-thumb bandwidth in eachdimension multiplied by the given bandwidth factor.
Over-smoothing the umpire’s rational expectations worsens model fit. Hence, observed
calls more closely correspond to a model in which the umpire holds precise rational expecta-
tions rather than a blurred set that over-smooths its features. However, the model performs
only marginally worse when the umpire forms his rational expectations from an order-of-
magnitude fewer observations than are available to us, implying that the gross features of
the empirical density of called pitches are sufficient, or almost sufficient, for predicting the
enforced strike zone.
5.4 Weighting the signal
Umpires make calls as if applying a consistent decision rule to posterior beliefs formed from
rational priors and imperfect signals. Do umpires combine the prior and signal optimally?
The weight placed on the prior relative to the signal should reflect the quality of the umpire’s
eyesight: an umpire with perfect vision should place all the weight on the signal, and a blind
umpire should place all the weight on the prior.
In our model, the variance of the signal represents the umpire’s eyesight and determines
the weight placed on the prior. A higher variance implies that the umpire is less likely to
ranges from 2.6 to 3.2 inches, and σ2 ranges from 3.7 to 4.2 inches.
31
observe the pitch near its true location. As such, he places more weight on the prior. Since we
cannot measure eyesight directly, we instead estimate the variance terms by finding the values
that best predict observed calls. The values we estimate appear reasonable, suggesting that
updating by umpires approximates Bayes rule. However, our model assumes that umpires
know the true distribution of the signal—i.e., umpires form posterior beliefs from the same
distribution as their observations are drawn from. If umpires are overconfident about their
eyesight and thereby underestimate the variance of the signal, they will underweight the
prior. Likewise, if umpires are underconfident about their eyesight and thereby overestimate
the variance of the signal, they will overweight the prior.
Here, we allow umpires to misperceive the distribution of the signal. Specifically, we
simulate observed locations from one distribution, and we allow umpires to form beliefs
from a signal with a different distribution. As before, we assume that both distributions
are centered at the true location and have zero covariance. However, we now estimate four
variance parameters, one in each dimension for each distribution.
Different elements of the data identify the parameters of each distribution. Increasing the
variance of the distribution from which observations are drawn worsens the umpire’s vision,
which makes the predicted strike zone round; hence, these variance parameters are identified
by the roundness of the enforced strike zone. The variance of the umpire’s perception of
the signal also modulates responsiveness to the prior; hence, these variance parameters are
partially identified by the extent to which the prior can shape the predicted strike zone to
match the enforced strike zone. We suggest caution in interpreting the estimates that follow.
Both distributions determine the roundness of the predicted strike zone; hence, both sets
of variance terms are identified by the roundness of the enforced strike zone. Moreover,
roundness in the enforced strike zone is amplified by pooling umpires with even subtly
different tendencies. Finally, no feature of the enforced strike zone provides a direct measure
of the quality of umpires’ eyesight.
Our results provide mixed evidence on optimal weighting. The model fit improves—
the area of disagreement shrinks from 27.4 square inches to 23.8 square inches—owing the
extra parameters and to differences in the variance estimates between the distributions.24
We estimate that observations are drawn from a distribution with a standard deviation of
4.3 inches along the horizontal dimension and 4.2 inches along the vertical dimension. The
symmetry of these estimates, which proxy for eyesight quality, suggest that vision is no
better (or worse) in one direction. For the umpire’s perception of the signal, we estimate
24We estimate the same bounds of Z as under the count-specific rational-expectations prior in Section 5.2.
32
a standard deviation of 2.1 inches along the horizontal dimension and 4.6 inches along the
vertical dimension. These estimates confirm previous findings of greater responsiveness along
the vertical dimension. They also suggest some misperception of the signal: umpires appear
to underweight the prior along the horizontal dimension and approximate Bayes rule along
the vertical dimension.
6 Discussion
Major League Baseball directs umpires to make ball and strike calls based solely on the
location of the pitch. Yet umpires make different calls in different counts for pitches at the
same location. Previous research has argued that umpires employ different decision rules in
different counts (Moskowitz and Wertheim, 2011; Mills, 2014; Green and Daniels, 2015). We
show that these patterns can be rationalized by a simple Bayesian model in which umpires
apply a consistent decision rule to beliefs that vary rationally across counts.
Though the model is straightforward, the behavior it predicts requires considerable so-
phistication. The umpire must instantaneously weigh his imperfect observation of the loca-
tion of the pitch against count-specific rational expectations that differ by location. We find
that decisions by professional umpires reflect a precise understanding of what they should
rationally expect, as well as an ability to instantaneously integrate those prior beliefs in a
manner that approximates Bayes rule.
6.1 Intuition and rationality
Departures from rational models of belief formation are typically attributed to heuristic rules
and other intuitive processes that are unknowingly misapplied (Simon, 1982; Kahneman,
2011). Judging probabilities from salient characteristics, for instance, may be advisable when
determining whether an animal with sharp incisors is a predator. However, applying those
same heuristics in new environments can lead to systematic errors. When asked whether a
woman described as “deeply concerned with issues of discrimination and social justice” is
more likely to be 1) a bank teller, or 2) a bank teller and a feminist, more respondents choose
the latter, even though feminist bank tellers are clearly a subset of bank tellers (Tversky
and Kahneman, 1983). When applied out of context, intuitions can be very bad.
When applied appropriately, however, intuitions should be capable of being very good.
Psychologists have speculated that rational benchmarks are most achievable in contexts
where outcomes are predictable and practice provides feedback (Kahneman and Klein, 2009).
33
However, tests of this proposition are rare, owing to the difficulty of measuring performance
by experts in the field against predictions from classical models. Notable exceptions include
evidence of minimax play by professional tennis players when choosing whether to serve
outside or down the line (Walker and Wooders, 2001) and by professional soccer players
when choose to kick a penalty left or right (Chiappori, Levitt and Groseclose, 2002; Palacios-
Huerta, 2003). In both cases, the rational benchmark requires at best a moderate level of
sophistication—choosing left or right at some fixed rate—and players face no time pressure.
Umpires achieve a far more sophisticated benchmark. Their split-second decisions ap-
proximate Bayesian updating from rational expectations that vary in a complex manner with
both the location of the pitch and the count. Calling balls and strikes should be amenable
to intuitive expertise: observable cues point to the correct call, which Major League um-
pires learn through decades of training and experience, as well as feedback on hundreds of
decisions after each game. Nonetheless, their sophistication is remarkable. The psychologist
Amos Tversky once remarked, “All your economic models are premised on people being
smart and rational, and yet all the people you know are idiots.”25 Our results suggest that
highly rational behavior can be learned intuitively, and hence, that feedback and repetition,
rather than intelligence, can underpin rationality.26
6.2 Statistical discrimination
Our results can also be understood as evidence of statistical discrimination, or the use of
disallowed information to make more correct decisions. Umpires face a tradeoff between
procedural fairness and accuracy. MLB instructs umpires to make ball and strike calls based
solely on the location of the pitch but evaluates umpires based on the accuracy of their calls.
When the location of the pitch is unclear, discriminating along other variables that predict
pitch location—i.e., the count—helps umpires make more correct calls. Umpires statistically
discriminate: they make different calls in different counts for pitches at the same location,
and they do so in a manner that improves their accuracy.
Statistical discrimination is not unique to umpires. The same elements that make statisti-
cal discrimination appealing to umpires—incentives to make correct decisions and disallowed
evidence that can help make more correct decisions on average—exist in many other impor-
tant settings, from the courtroom to the classroom to the interview room to the immigration
desk. Our setting is unique in that statistical discrimination among umpires is directly
25Quoted in Lewis (2016).26For a longer discussion on this point, and for additional evidence, see Green, Rao and Rothschild (2016).
34
observable. In most settings, both the arbitrator and the analyst face uncertainty about
whether the allowed evidence implies guilt or innocence. In baseball, only the umpire is
uncertain whether a pitch is truly a ball or a strike—the analyst knows the truth.
To our knowledge, our results are the first to provide direct field evidence of statisti-
cal discrimination. Evidence of discrimination can rarely differentiate between statistical
discrimination and taste-based discrimination, in which the use of disallowed information
provides direct utility irrespective of whether doing so improves accuracy (e.g. Bertrand
and Mullainathan, 2004).27 In cases where the mechanism can be parsed, taste-based dis-
crimination has proven more readily identifiable, either by observing animus directly (e.g.
Stephens-Davidowitz, 2014) or by observing patterns of discrimination that cannot plau-
sibly be efficient, as with favoritism of same-race players by referees (Price and Wolfers,
2010; Parsons, Sulaeman, Yates and Hamermesh, 2011). The few field studies that claim
to identify statistical discrimination do so indirectly, either by testing auxiliary predictions
(Altonji and Pierret, 2001; Ondrich, Ross and Yinger, 2003; Levitt, 2004; Ewens, Tomlin and
Wang, 2014) or by conducting laboratory-style experiments in the field (List, 2004).28 In
contrast, we estimate a simple model of statistical discrimination using the observed choices
of professional arbitrators and direct estimates of rational expectations, and we find that
this model almost entirely accounts for otherwise puzzling empirical patterns.
27Others have proposed a second distinction: whether discrimination is intentional or implicit (Bertrand,Chugh and Mullainathan, 2005). We do not take a stand on whether statistical discrimination by umpiresis deliberate.
28In a framed field experiment, List (2004) shows that professional sports-card traders quote higher pricesto women and minorities than to white men. He finds that traders do not discriminate by race or genderwhen paired with potential customers in the dictator game, suggesting that the discrimination is not tastebased. And he shows that these traders perform better than random guessing when asked to choose whichof two unlabeled distributions of reservation values was elicited from white men rather than from womenand minorities, suggesting that the discrimination is statistical in nature.
35
References
Akerlof, George A and William T Dickens (1982) “The economic consequences of cognitivedissonance,” The American economic review, Vol. 72, pp. 307–319.
Altonji, Joseph G and Charles R Pierret (2001) “Employer Learning and Statistical Discrim-ination,” The Quarterly Journal of Economics, Vol. 116, pp. 313–350.
Ambuehl, Sandro (2016) “An offer you can’t refuse: Incentives change what we believe,”Available at SSRN 2830171.
Andreoni, James and Tymofiy Mylovanov (2012) “Diverging opinions,” American EconomicJournal: Microeconomics, Vol. 4, pp. 209–232.
Ariely, Dan, George Loewenstein, and Drazen Prelec (2003) “Coherent arbitrariness: Stabledemand curves without stable preferences,” The Quarterly Journal of Economics, Vol.118, pp. 73–106.
Backus, Matthew, Tom Blake, and Steven Tadelis (2015) “Cheap talk, round numbers, andthe economics of negotiation,”Technical report, National Bureau of Economic Research.
Barber, Brad M and Terrance Odean (2001) “Boys will be boys: Gender, overconfidence, andcommon stock investment,” The quarterly journal of economics, Vol. 116, pp. 261–292.
Barberis, Nicholas, Andrei Shleifer, and Robert Vishny (1998) “A model of investor senti-ment,” Journal of financial economics, Vol. 49, pp. 307–343.
Bertrand, Marianne, Dolly Chugh, and Sendhil Mullainathan (2005) “Implicit discrimina-tion,” American Economic Review, pp. 94–98.
Bertrand, Marianne and Sendhil Mullainathan (2004) “Are Emily and Greg more employ-able than Lakisha and Jamal? A field experiment on labor market discrimination,” TheAmerican Economic Review, Vol. 94, pp. 991–1013.
Brunnermeier, Markus K and Jonathan A Parker (2005) “Optimal expectations,” The Amer-ican Economic Review, Vol. 95, pp. 1092–1118.
Callahan, Gerry (1998) “Moody Blues,” Sports Illustrated, pp. 42–47.
Callan, Matthew (2012) “Called Out: The Forgotten Baseball Umpires Strike of 1999,” TheClassical.
Camerer, Colin F, George Loewenstein, and Matthew Rabin (2004) Advances in BehavioralEconomics: Princeton University Press.
Caple, Jim (2011) “Humbled by umpire school,” ESPN.com.
36
Chen, Daniel, Tobias J Moskowitz, and Kelly Shue (2016) “Decision-Making Under theGambler’s Fallacy: Evidence from Asylum Judges, Loan Officers, and Baseball Umpires,”The Quarterly Journal of Economics, p. qjw017.
Chiappori, P-A, Steven Levitt, and Timothy Groseclose (2002) “Testing mixed-strategyequilibria when players are heterogeneous: The case of penalty kicks in soccer,” AmericanEconomic Review, pp. 1138–1151.
De Bondt, Werner F M and Richard Thaler (1985) “Does the Stock Market Overreact?”The Journal of Finance, Vol. 40, pp. 793–805.
Deshpande, Sameer K and Abraham Wyner (2017) “A Bayesian Hierarchical Model ToEstimate the Framing Ability of Major League Baseball Catchers,” Working Paper.
Drellich, Evan (2012) “Complex system in place to evaluate umpires,” MLB.com.
Eil, David and Justin M Rao (2011) “The good news-bad news effect: asymmetric processingof objective information about yourself,” American Economic Journal: Microeconomics,Vol. 3, pp. 114–138.
El-Gamal, Mahmoud A and David M Grether (1995) “Are people Bayesian? Uncoveringbehavioral strategies,” Journal of the American statistical Association, Vol. 90, pp. 1137–1145.
Ewens, Michael, Bryan Tomlin, and Liang Choon Wang (2014) “Statistical discriminationor prejudice? A large sample field experiment,” Review of Economics and Statistics, Vol.96, pp. 119–134.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani (2001) The elements of statisticallearning, Vol. 1: Springer series in statistics Springer, Berlin.
Garicano, Luis, Ignacio Palacios-Huerta, and Canice Prendergast (2005) “Favoritism undersocial pressure,” Review of Economics and Statistics, Vol. 87, pp. 208–216.
Green, Etan A, Justin M Rao, and David M Rothschild (2016) “A Sharp Test of the Porta-bility of Expertise,” Available at SSRN 2656268.
Green, Etan and David P Daniels (2014) “What does it take to call a strike? three biasesin umpire decision making,” in 2014 MIT Sloan Sports Analytics Conference.
(2015) “Impact Aversion in Arbitrator Decisions,” Available at SSRN 2391558.
Grether, David M (1980) “Bayes rule as a descriptive model: The representativeness heuris-tic,” The Quarterly Journal of Economics, Vol. 95, pp. 537–557.
Haigh, Michael S and John A List (2005) “Do professional traders exhibit myopic loss aver-sion? An experimental analysis,” The Journal of Finance, Vol. 60, pp. 523–534.
37
Kahneman, Daniel (2011) Thinking, fast and slow: Macmillan.
Kahneman, Daniel and Gary Klein (2009) “Conditions for intuitive expertise: a failure todisagree.,” American psychologist, Vol. 64, p. 515.
Kahneman, Daniel and Amos Tversky (1972) “Subjective probability: A judgment of repre-sentativeness,” Cognitive Psychology, Vol. 3, pp. 430–454.
Koehler, Jonathan J and Molly Mercer (2009) “Selection neglect in mutual fund advertise-ments,” Management Science, Vol. 55, pp. 1107–1121.
Lacetera, Nicola, Devin G Pope, and Justin R Sydnor (2012) “Heuristic thinking and limitedattention in the car market,” The American Economic Review, Vol. 102, pp. 2206–2236.
Larson, Francis, John A. List, and Robert D. Metcalfe (2016) “Can Myopic Loss AversionExplain the Equity Premium Puzzle? Evidence from a Natural Field Experiment withProfessional Traders,” Working Paper 22605, National Bureau of Economic Research.
Levitt, Steven D (2004) “Testing theories of discrimination: evidence from Weakest Link,”The Journal of Law and Economics, Vol. 47, pp. 431–452.
Lewis, M. (2016) The Undoing Project: A Friendship That Changed Our Minds: W. W.Norton.
Lipkus, Isaac M, Greg Samsa, and Barbara K Rimer (2001) “General performance on anumeracy scale among highly educated samples,” Medical decision making, Vol. 21, pp.37–44.
List, John A (2004) “The nature and extent of discrimination in the marketplace: Evidencefrom the field,” The Quarterly Journal of Economics, pp. 49–89.
Miller, Joshua B and Adam Sanjurjo (2016) “Surprised by the gambler’s and hot handfallacies? A truth in the law of small numbers,” Available at SSRN 2627354.
Mills, Brian M (2014) “Social pressure at the plate: Inequality aversion, status, and mereexposure,” Managerial and Decision Economics, Vol. 35, pp. 387–403.
(2015) “Technological Innovations in Monitoring and Evaluation: Evidence of Per-formance Impacts among Major League Baseball Umpires,” Working Paper.
Mobius, Markus M, Muriel Niederle, Paul Niehaus, and Tanya S Rosenblat (2011) “Man-aging self-confidence: Theory and experimental evidence,” National Bureau of EconomicResearch.
Moskowitz, Tobias and L Jon Wertheim (2011) Scorecasting: The hidden influences behindhow sports are played and games are won: Crown Archetype.
38
Mussweiler, Thomas (2001) “Sentencing Under Uncertainty: Anchoring Effects in the Court-room,” Journal of Applied Social Psychology, Vol. 31, pp. 1535–1551.
Nickerson, Raymond S (1998) “Confirmation bias: A ubiquitous phenomenon in manyguises.,” Review of general psychology, Vol. 2, p. 175.
O’Connell, Jack (2007) “Much required to become MLB umpire,” MLB.com.
Ondrich, Jan, Stephen Ross, and John Yinger (2003) “Now you see it, now you don’t:Why do real estate agents withhold available houses from black customers?” Review ofEconomics and Statistics, Vol. 85, pp. 854–873.
Palacios-Huerta, Ignacio (2003) “Professionals play minimax,” The Review of EconomicStudies, Vol. 70, pp. 395–415.
Parsons, Christopher A, Johan Sulaeman, Michael C Yates, and Daniel S Hamermesh (2011)“Strike three: Discrimination, incentives, and evaluation,” The American Economic Re-view, Vol. 101, pp. 1410–1435.
Price, Joseph and Justin Wolfers (2010) “Racial Discrimination Among NBA Referees,” TheQuarterly Journal of Economics, Vol. 125, pp. 1859–1887.
Rabin, Matthew (2002) “Inference by believers in the law of small numbers,” The QuarterlyJournal of Economics, Vol. 117, pp. 775–816.
Rabin, Matthew and Joel L Schrag (1999) “First impressions matter: A model of confirma-tory bias,” The Quarterly Journal of Economics, Vol. 114, pp. 37–82.
Samuelson, William and Richard Zeckhauser (1988) “Status quo bias in decision making,”Journal of risk and uncertainty, Vol. 1, pp. 7–59.
Simon, Herbert Alexander (1982) Models of bounded rationality: Empirically grounded eco-nomic reason, Vol. 3: MIT press.
Stephens-Davidowitz, Seth (2014) “The cost of racial animus on a black candidate: Evidenceusing Google search data,” Journal of Public Economics, Vol. 118, pp. 26–40.
Stone, Daniel F (2013) “Testing Bayesian Updating with the Associated Press Top 25,”Economic Inquiry, Vol. 51, pp. 1457–1474.
Sutter, Matthias and Martin G Kocher (2004) “Favoritism of agents–the case of referees’home bias,” Journal of Economic Psychology, Vol. 25, pp. 461–469.
Tversky, Amos and Daniel Kahneman (1971) “Belief in the law of small numbers.,” Psycho-logical bulletin, Vol. 76, p. 105.
(1974) “Judgment under Uncertainty: Heuristics and Biases,” Science, Vol. 185, pp.1124–1131.
39
(1983) “Extensional versus intuitive reasoning: The conjunction fallacy in proba-bility judgment.,” Psychological review, Vol. 90, p. 293.
Walker, Mark and John Wooders (2001) “Minimax play at Wimbledon,” The AmericanEconomic Review, Vol. 91, pp. 1521–1538.
Zafar, Basit (2011) “How do college students form expectations?” Journal of Labor Eco-nomics, Vol. 29, pp. 301–348.
40
Figure A1: The enforced strike zone for left- and right-handed batters, in counts with 0balls and 0 strikes.
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Vert
ical axis
(in
)
R
L
41
Figure A2: The enforced strike zone, separately for calls following a called strike on theprevious pitch in the at-bat (solid boundary) and calls not following a called strike on theprevious pitch in the at-bat, in listed counts (dot-dashed boundary).
(a) 0 balls, 1 strike
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
(b) 0 balls, 2 strikes
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
(c) 1 ball, 1 strike
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
(d) 1 ball, 2 strikes
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
(e) 2 balls, 1 strike
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
(f) 2 balls, 2 strikes
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
(g) 3 balls, 1 strike
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
(h) 3 balls, 2 strikes
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Ve
rtic
al a
xis
(in
)
42
Figure A3: P (strike|home batting)− P (strike|away batting): difference in the probabilityof a called strike between the bottom half of the inning, when the home team bats, and thetop half of the inning, when the away team bats.
00 0
000
00
0
00
0
0
0
0 00
0
0
0
0
0
0
0
0
00
0
0
-18 -12 -6 0 6 12 18
Horizonal axis (in)
-18
-12
-6
0
6
12
18
Vert
ical axis
(in
)
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
43