next: an improved method for identifying impacts in regression ... and rooklyn -- next rd... ·...
TRANSCRIPT
Next: An Improved Method for Identifying Impacts in Regression Discontinuity Designs
Mark C. Long University of Washington
Box 353055 Seattle, WA 98195-3055
[email protected] (corresponding author)
Jordan Rooklyn
University of Washington [email protected]
Working Paper, August 31, 2016
Abstract: This paper develops and advocates for a data-driven algorithm that simultaneously selects the polynomial specification and bandwidth combination that minimizes the predicted mean squared error at the threshold of a discontinuity. It achieves this selection by evaluating the combinations of specification and bandwidth that perform best in estimating the next point in the observed sequence on each side of the discontinuity. We illustrate this method by applying it to data with a simulated treatment effect to show its efficacy for regression discontinuity designs and re-examine the results of papers in the literature. Keywords: Program evaluation; regression discontinuity Acknowledgments: Brian Dillon, Dan Goldhaber, Jon Smith, Jake Vigdor, Ted Westling, and University of Washington seminar audience members provided helpful feedback on the paper. This research was partially supported by a grant from the U.S. Department of Education’s Institute of Education Sciences (R305A140380).
2
Next: An Improved Method for Identifying Impacts in Regression Discontinuity Designs
I. Introduction and Literature Review
Regression discontinuity (RD) designs have become a very popular method for
identifying the local average treatment effect of a program. In many policy contexts, estimating
treatment effects via social experiments is not feasible due to either cost or ethical
considerations. Furthermore, in many contexts, allocating a treatment on the basis of some score
(often a score that illustrates the individual’s worthiness of receiving the treatment) seems
natural. RD holds the promise of having some of the advantages of random treatment allocation
(assuming that being just above or just below the threshold score for receiving the treatment is
effectively random) without the adverse complications of full-blown randomized experiments.
However, RD designs present a challenge for researchers: how to identify the predicted value of
the outcome (Y) as the score (X) approaches the threshold (T) from both the left and right hand
side of that threshold.
A number of guides to standard practice have been written during the past ten years; the
highly cited guide by Lee and Lemieux (2010) provides the following guidance:1
“When the analyst chooses a parametric functional form (say, a low-order
polynomial) that is incorrect, the resulting estimator will, in general, be biased.
When the analyst uses a nonparametric procedure such as local linear
regression—essentially running a regression using only data points ‘close’ to the
cutoff—there will also be bias….Our main suggestion in estimation is to not rely
on one particular method or specification” (p. 284)
To illustrate this point, Lee and Lemieux reanalyze the data from Lee (2008) who evaluated the
impact of party incumbency on the probability that the incumbent party will retain the district’s
seat in the next election for the U.S. House of Representatives. In this analysis, X is defined as
the Democratic vote share in year t minus the vote share of the “Democrats strongest opponent
(virtually always a Republican)” (Lee, 2008, p. 686). Lee and Lemieux estimate the treatment
effect by using polynomials ranging from order zero (i.e., the average of prior values) up to a 6th
1 For other discussions of standard methods, see Imbens and Lemieux (2008), DiNardo and Lee (2010), Jacob et al. (2012) and Van Der Klaauw (2013).
3
order polynomial with the same order polynomial estimated for both sides of the discontinuity
and with bandwidths ranging from 1% to 100% (i.e., using all of the data). For each bandwidth,
they identify the “optimal order of the polynomial” by selecting the one with the lowest value of
the Akaike information criterion (AIC) value. And, they identify an optimal bandwidth “by
choosing the value of h that minimizes the mean square of the difference between the predicted
and actual value of Y” (p. 321). As shown in Table 2 of their paper, using the optimal
bandwidth, which is roughly 5%, and the optimal order of the polynomial for this bandwidth
(quadratic), the estimated effect of incumbency on the Democratic party’s vote share in year t+1
is 0.100 (s.e. = 0.029).
While this model selection procedure has the nice feature of selecting the specification
and bandwidth “optimally”, it has two limitations: (1) it suggests that a particular order of the
polynomial and bandwidth be used on both sides of the discontinuity, and (2) the AIC evaluates
the fit of the polynomial at all values of X, and doesn’t attempt to evaluate the fit of the
polynomial as X approaches the threshold, which is more appropriate for the RD treatment effect
estimation.
Gelman and Imbens (2014) argue against using high order polynomial regressions to
estimate treatment effects in an RD context and instead “recommend that researchers … control
for local linear or quadratic polynomials or other smooth functions” (p. 2). We focus here on
their second critique:
“Results based on high order polynomial regressions are sensitive to the order of
the polynomial. Moreover, we do not have good methods for choosing that order
in a way that is optimal for the objective of a good estimator for the causal effect
of interest. Often researchers choose the order by optimizing some global
goodness of fit measure, but that is not closely related to the research objective of
causal inference” (p. 2).
The goal of our paper is to provide an optimal method for choosing the polynomial order (as well
as the bandwidth) that Gelman and Imbens (2014) note is currently lacking in the literature.
Gelman and Zelizer (2015) illustrate the challenges that could come from using a higher-
order polynomial by critiquing a prominent paper by Chen, Ebenstein, Greenstone, and Li
(2013), which we describe in greater detail below, which examines the effect of an air pollution
policy on life expectancy. Gelman and Zelizer note:
4
“[Chen et al.’s] cubic adjustment gave an estimated effect of 5.5 years with
standard error 2.4. A linear adjustment gave an estimate of 1.6 years with standard
error 1.7. The large, statistically significant estimated treatment effect at the
discontinuity depends on the functional form employed. …the headline claim, and
its statistical significance, is highly dependent on a model choice that may have a
data-analytic purpose, but which has no particular scientific basis” (pp.3-4).
Gelman and Zelizer conclude that:
“…we are not recommending global linear adjustments as an alternative. In some
settings a linear relationship can make sense …. What we are warning against is
the appealing but misguided view that users can correct for arbitrary dependence
on the forcing variable by simply including several polynomial terms in a
regression” (p. 6).
In the case study in Section 3.3 of this paper, we re-examine the Chen et al. results using our
method. We show that Gelman and Zelizer’s concerns are well founded; our method shows that
the estimated effect of pollution on life expectancy is much smaller.
In addition to finding the most appropriate form for the specification, researchers also
face the challenge of deciding whether to estimate the selected specification over the whole
range of X (that is a “global” estimate of Y=f0(X) and Y=f1(X) where f0(.) and f1(.) reflect the
function on the left and right sides of the threshold) or to estimate the selected specification over
a narrower range of X near T, a “local” approach.
Imbens and Kalyanaraman (2012) argue for using a local approach and develop a
technique for finding the optimal bandwidth. The Imbens and Kalyanaraman bandwidth
selection method is devised for the estimation of separate local linear regressions on each side of
the threshold. They note that “ad hoc approaches for bandwidth choice, such as standard plug-in
and cross-validation methods … are typically based on objective functions which take into
account the performance of the estimator of the regression function over the entire support and
do not yield optimal bandwidths for the problem at hand” (p. 934). Their method, in contrast,
finds the bandwidth that minimizes mean squared error at the threshold. Imbens and
Kalyanaraman caution that their method, which we henceforth label IK, “gives a convenient
starting point and benchmark for doing a sensitivity analyses regarding bandwidth choice” (p.
940) and thus they remind the user to examine the results using other bandwidths.
5
While the IK method greatly helps researchers by providing a data-generated method for
choosing the optimal bandwidth, it does so by assuming that the researcher is using a local linear
regression on both sides of the threshold. This is can introduce substantial bias if (1) a linear
regression is the incorrect functional form and (2) if the treatment changes the relationship
between and . Our method, thus, simultaneously selects the optimal polynomial order and the
optimal bandwidth for each side of the discontinuity. We achieve this result by evaluating the
performance of various combinations of order and bandwidth with performance measured as
mean squared error in predicting the observed values of Y as X approaches the threshold (from
either side); estimating the mean squared error at the threshold as a weighted average of prior
mean squared errors with greater weight on mean squared errors close to the threshold; and
identifying the specification/bandwidth combination that has the lowest predicted mean squared
error at the threshold.
We show that our method does modestly better than the IK method when applied to real data
with a simulated treatment effect. We then apply our method to data from two prominent papers
(Lee (2008) and Chen et al. (2013)) and we document the extent to which our method produces
different results.
2. Method
The goal of RD studies is to estimate the local average treatment effect defined as the
expected change in the outcome for those whose score is at the threshold:
| | | , where is the value of if observation
i is untreated and is the value of if the treatment is received.
Assume that treatment occurs when .2 Assume that there is a smooth and
continuous relationship between and in the range ∆ and that this
relationship can be expressed as | . Likewise, assume that there is a smooth
2 Note that our method is designed for “sharp” regression discontinuities, where treatment is received by all those who are on one side of a threshold and not received by anyone on the other side of the threshold. In “fuzzy” contexts, where there is a discontinuity in the probability of receiving treatment at the threshold, one can obtain estimates of the local effect of the treatment on the treated by computing the ratio of the discontinuity in the outcome at the threshold and the discontinuity in the probability of receiving treatment at the threshold. When applied in the context of fuzzy RDs, our method will identify the intent-to-treat estimate for those at the threshold, but will not yield an estimate of the local average treatment on the treated effect.
6
and continuous relationship between and in the range ∆ and that this
relationship can be expressed as | . Assuming that the only discontinuity in the
relationship between and at is due to the impact of the treatment, the estimand, ,
is defined as the difference of the two estimated functions evaluated at the threshold: =
.
Define mean square prediction error ( ) as follows: .
Our goal is to select the bandwidths (∆ and ∆ ) and order of the polynomials ( and ) for
estimating and such that is minimized3:
(1) argmin∆ , ,∆
=
argmin∆ , ,∆
=
argmin∆ , ,∆
=
argmin∆ , ,∆
=
argmin∆ , ,∆
2
To this point, the minimization problem is unconstrained and standard. Imbens and
Kalyanaraman (2012) add the following constraints to this problem: ∆ ∆ ∆and
1. That is, they assume linear relationships between and in the ranges ∆ and
∆ with the treatment effect, , being identified as the jump between those two linear
functions at .
3 Note that choosing a higher bandwidth allows for more data to be used in estimating . , which reduces the variance of the estimated parameters. But, a larger bandwidth increases the chance that . is not constant and smooth within the range in which it is estimated. A higher polynomial order can improve the fit of the function . to the observed distribution of and , and thus lowers the bias. But, a higher polynomial order leads to increased variance of the prediction, particularly in the tails of the distribution (e.g., at ). By minimizing though the choice of these parameters, we balance between our desires for low bias and low variance.
7
We take a different approach, which involves a different set of simplifying assumptions.
First, unlike IK, our approach allows the treatment to more flexibly change the functional
relationship between and as we do not assume linear functions on either side of the
discontinuity. Our method has . estimated solely on data where ∆ , and .
estimated solely on data where ∆ .This approach is akin to the common practice
in RD studies of estimating one regression which fully interacts the polynomial terms with
.4
Second, we simplify the minimization problem considerably by dropping the last term
(i.e., 2 ). Here is our justification for doing so. Suppose
that for a given choice of ∆ and , the prediction on the left side of the threshold is positive
(i.e., 0 . One could attempt to select ∆ and such that the prediction error
on the right side of the threshold is also positive (i.e., 0 and equal to the bias
on the left so as to cancel it. In fact, one could carry this further and select ∆ and such that
the error on the right side of the threshold is as positive as possible and thus making the last term
as negative as possible (a point Imbens and Kalyanaraman note as well). However, doing so
comes at a penalty of increasing the square of the prediction error on the right side (i.e.,
) and thereby results in a higher . Thus, there is little to be gained by
selecting ∆ and on the basis of the last term in Equation 1. If we can ignore this term, we
substantially simplify the task by breaking it into two separate problems:
(2) argmin∆ , ,∆
argmin∆ ,
argmin∆ ,
argmin∆ ,
argmin∆ ,
.
The advantage of our approach is that we can directly evaluate how different choices of ∆
and perform in predicting observed outcomes before one reaches the threshold, and pick values
of ∆ and that have demonstrated strong performance in terms of their mean squared prediction
errors for observed values. Our key insight is that by focusing on data from one side of the
4 Note that in such models . and . are in effect estimated solely based on data from their respective sides of the threshold as the same coefficients could be obtained by separate polynomial regressions on each side of the threshold. Put differently, no information from the right hand side is being used to estimate the coefficients on the left hand side and vice-versa.
8
threshold only, we can use that observed data to calculate a series of s and then predict
(and ) as weighted averages of observed s (and confidence intervals
around the weighted averages of observed s). We recognize, however, that if the treatment
does not affect the functional relationship between and (e.g., . . ), then our
method would be inefficient (but unbiased), as one would gain power to estimate the common
slope parameters of . and . by using data on both sides of the threshold.
Index the observed distinct values of as 1 to . Define as equal
to , where is a polynomial of order that is estimated over the interval
from to 1 using the observed distributions of and in this interval, reflects the
number of prior observations that are used to estimate the polynomial, and is the observed
value of when = . Note that this formula uses an adaptive bandwidth that is a function of
(i.e., Δ ) to accommodate areas where the data are thin.
Suppose that we estimated as a straight average of these calculated values of
(i.e., ∑ ), and then selected the parameters
and that minimized this straight average. One disadvantage of doing so would be that it
would ignore variance across the values of and would not consider the number of
observations of used to compute this average.5 Less confidence should be placed in
estimates of that rely on fewer or more variable observations of . Thus, rather
than select the parameters and that minimize the average, we select parameters and
that minimize the upper bound of an 80% confidence interval around the average (i.e., such
that there is only a 10% chance that the true, unknown, mean value of the broader distribution
from which our observations are drawn is greater than this upper bound).6
A second disadvantage of a straight average is that it places equal weight on the
calculated values of regardless of how far is from the threshold. So as to place more
weight on the calculated values of for which is close to , we estimate as a
weighted average of the calculated values of :
(3) ∑ ∙ ,
5 The number of observations of declines by 1 for each unit increase in either or 6 As with all confidence intervals, the choice of 80% is arbitrary. Different values can be set by the user of our Stata program for executing this method (Long and Rooklyn, 2016).
9
where is a kernel function (defined below). We then find the parameters that
solve argmin∆ ,
. , where is the estimated standard error of
. To find these parameters, we compute for all combinations of and
subject to the following constraints: and are integers; ∈ max ,
1 . . min , 1 , where and are the minimum and maximum number of prior
observations the researcher is willing to allow to be used in computing ; and ∈
. . , where and are the minimum and maximum polynomial orders the
researcher is willing to consider, 0, and when 0, is defined as the average of
the prior values of . We select the combination of and (among those that are
considered) that minimize .7
In our empirical investigations below, we use an exponential kernel, defined as follows:
(4) ∑
.
is the base weight, and we alternately explore base weights equal to 1, 103, 106, and 1010.
When 1, . is the uniform kernel which gives uniform weight to each value of
when estimating . When = 103 (106) [1010], while all s get some positive weight,
50% (75%) [90%] of the weight is placed on the last 10% of s that are closest to the
threshold. That is, higher values of gives more emphasis to s closer to the threshold
than further away.
We repeat this process to estimate the parameters that solve argmin∆ ,
. √, with the only difference being that we index the observed distinct values of
as to 1 so that the analysis moves from the extreme right in towards .
7 Note that if using a linear specification with only the last two data points (or any polynomial of order using the last +1 data points) there will be no variance in the estimate of Y at the threshold. If this occurs for both sides of the discontinuity, there would be no variance to the estimate of the jump at the threshold. Such a lack of variance of the difference at the discontinuity would disallow hypothesis testing (or, conclude that there is an infinite t-statistic). As this result is unsatisfactory in most contexts, the reader may want to disallow such specification/bandwidth combinations.
10
To illustrate our method, suppose we had a series of six data points with (X, Y)
coordinates (1,12), (2,15), (3,16), (4,13), (5,10), and (6,7), and we would like to use this
information to estimate the next value of Y (when X=7). These six points are shown in Panel A
of Figure 1. Our task is to find the specification that generally performs well in predicting the
next value of Y, and more specifically, as discussed above, has a low for X=7.
The argument for imposing a limited bandwidth, and not using all of the data points to
predict the next value of Y, is a presumption that there has been a change in the underlying
relationship between Y and X; for example a discrete jump in the value of Y (perhaps unrelated to
X), or a change in the function defining the relationship of Y=f(X). If such a change occurred,
then limiting the bandwidth would (ideally) constrain the analysis to the range in which f(X) is
steady. In the example discussed above, there does appear to be a change in f(X) as the function
appears to become linear after X=3. Of course, this apparent change could be a mirage and the
underlying relationship could in fact be quadratic with no change. If there is no change in the
relationship between Y and X, then one would generally want to use all available data points to
best estimate f(X).
Our method for adjudicating between these specifications and bandwidth choices is to
compare all possibilities based on (and the upper bound of its confidence interval).
Panels B through F of Figure 1 show the performance of possible candidate estimators. The
corresponding Table 1 illustrates our method, where Panel A gives the predicted values based on
polynomial orders in the range ∈ 0. . .2 , and Panel B gives the calculations of each
for the feasible combinations of and . Note that since the last four observations happen to
be on a line (i.e., (3,16), (4,13), (5,10), and (6,7)), the linear specification using two prior data
points has no error in predicting the values of Y when X equals 5 or 6, and the same is true for
either the linear or quadratic specifications using three prior values for predicting the value of Y
when X equals 6.
[Insert Figure 1]
[Insert Table 1]
Panel C of Table 1 shows the weighted averages using various kernels. A linear
specification using two prior data points has the lowest weighted average MSPE using all four
11
base weights, as is indicated by the bolded numbers.8 This result is not surprising given the
perfect linear relation of Y and X for the last four data points. As one can see, as the base weight
increases, the weighted average approaches the value of the last in the series.
There is clearly a trade-off involved here. With greater weight placed on the last s in the
series, one gets less bias in the estimate of at the threshold as less weight is placed on
s far away from the threshold. However, relying solely on the last (i.e., )
could invite error – a particular specification might “accidently” produce a near perfect
prediction for the last values of Y before the threshold and thus have a lower , but
incorrectly predict the unknown value of Y at the threshold.
Panel D of Table 1 presents the upper-bound of the 80% confidence interval around
. Note that the linear specification using two prior data points has the lowest upper bound
for three of the four base weights (with the exception being for the uniform weight). Since high
base weights produce wider confidence intervals as they increase the sample standard deviation
of the weighted average , using this upper bound of the confidence interval helps avoid
“unhappy accidents” that could occur when using only . When we apply our method to
simulated data, we find that the performance is relatively insensitive to the base weight, although
we favor =103 given its strong performance documented below.
Our Stata program for executing this method (Long and Rooklyn, 2016) allows the user
to (a) select the minimum number of s that must be included in the analysis (≥2),
excluding from consideration combinations of bandwidth and polynomial orders that result in
few observations of , and thus to avoid “unhappy accidents”; (b) select the minimum and
maximum order of the polynomial that the user is willing to consider, (c) select the minimum
number of observations the researcher is willing to allow to be used to estimate the next
observation, and (d) the desired confidence interval for . For the rest of the paper
(excluding Section 3.3), we set the minimum number of MSPEs to five, the minimum and
maximum polynomial orders to zero and five, the minimum number of observations to five, and
the confidence interval to 80%.
8 If there are ties for the lowest , which did not occur in Table 1, we select the specification with the lowest order polynomial (and ties for a given specification are adjudicated by selecting the smaller bandwidth). We make these choices given the preference in the literature for narrower bandwidths and lower-order polynomials.
12
In the next section, we illustrate the method by applying it to simulated data and use the
method to re-evaluate examples from the existing literature.
3. Case Studies That Illustrate the Method
3.1 Case Study 1: Method Applied to Jacob et al. (2012) with a Simulated Treatment
Jacob, Zhu, Somers, and Bloom (2012) provide a primer on how to use RD methods.
They illustrate contemporary methods using a real data set with a simulated treatment effect,
described as follows:
“The simulated data set is constructed using actual student test scores on a
seventh-grade math assessment. From the full data set, we selected two waves of
student test scores and used those two test scores as the basis for the simulated
data set. One test score (the pretest) was used as the rating variable and the other
(the posttest) was used as the outcome. … We picked the median of the pretest (=
215) as the cut-point (so that we would have a balanced ratio between the
treatment and control units) and added a treatment effect of 10 scale score points
to the posttest score of everyone whose pretest score fell below the median.” (pp.
7-8).
We utilize these data provided by Jacob et al. to illustrate the efficacy of our method. Since the
test scores are given in integers, and since the number of students located at each value of the
pretest scores differs, we add a frequency weight to the regressions in constructing our predicted
values and the weight for computing the weighted average MSPE becomes ∙
, where is the number of observations that have that value of X.
In the first panel of Table 2, we estimate the simulated treatment effect (which should be
-10 by construction) with the threshold at 215. Our method selects a linear specification using
23 data points for the left hand side and a quadratic specification with 33 data points for the right
hand side (these selections are not sensitive to the base weight). Compared to the IK method,
which selects a bandwidth of 6.3 for both sides, our method selected a much larger bandwidth.9
9 To estimate these IK bandwidths and resulting treatment effect estimations, we use the “rd” command for Stata that was developed by Nichols (2011) and use local linear regressions (using
13
Our method outperforms IK with a slightly better estimate of the treatment effect (-9.36 versus -
10.68) and smaller standard errors (0.73 versus 1.27). The much smaller standard error provides
our method more power than IK to correctly identify smaller treatment effects.
[Insert Table 2]
The second and third panels of Table 3 reset the threshold for the simulated effect to 205
and 225, which are respectively at the 19th and 77th percentiles of the distribution of X. With the
threshold at 205, our model produces estimates of the simulated treatment effect in the range of -
-9.96 to -10.09 with base weights of 1 to 106 and -8.60 with a base weight of 1010. Regardless of
the base weight, our method selects a quadratic specification using the first 47 observations on
the right side of the discontinuity. In contrast, the IK method uses a bandwidth of only 7.3 on
both sides of the discontinuity and yields an inferior estimates of the treatment effect (-8.25) and
with a higher standard error.
Our method and the IK method produce comparable estimates of the treatment effect
when the threshold is set at 225 (-11.67 to -11.78 for our method versus -11.74 for IK), yet our
method again has smaller standard errors due to more precision in the estimates of the regression
line. Figure 3 illustrates our preferred specifications and bandwidths for these three thresholds
using 103 as the base weight.
[Insert Figure 3]
The next analysis, which is shown in Table 3, evaluates how our method performs when
there is a zero simulated treatment effect. We restore the Jacob et al. data to have no effect and
then estimate placebo treatment effects with the threshold set at 200, 205, …, 230. We are
testing whether our method generates false positives: apparent evidence of a treatment effect
when there is no treatment. Our model yields estimated treatment effects that are generally small
and lie in the range of -1.67 to 0.64. The bad news is that 2 of the 7 estimates are significant at
the 10% level (1 at the 5% level). Thus, a researcher who uses our method would be more likely
to incorrectly claim a small estimated treatment effect to be significant. The IK method does
better at not finding significant placebo effects in the Jacob et al. (2012) data (none of the IK
estimates are significant). However, the IK estimates have a broader range of -2.27 to 1.75.
Thus the researcher using the IK method would be more inclined to incorrectly conclude that the
triangular (“edge”) kernel weights) within the selected bandwidth. We also find nearly identical results using the “rdob” program for Stata written by Fuji, Imbens, and Kalyraman (2009).
14
treatment had a sizable effect even when the policy had no effect. The mean absolute error for
this set of estimates is 0.76 using our method versus 0.97 using the IK method. The only reason
that our method is more likely to incorrectly find significant effects is our lower standard errors,
which lie in the range of 0.68 to 1.10 versus the IK standard errors, which lie in the range of 1.22
to 1.89. Thus, we conclude that our higher rate of incorrectly finding significant effects is not a
bug but a feature. The researcher who uses our method and finds an insignificant effect can
argue that it’s a “well estimated zero”, while that advantage is less likely to be present using IK.
[Insert Table 3]
To further investigate the efficacy of our method and to compare it to IK’s method, we
augment the Jacob et al. (2012) data by altering the outcome as follows:
_ 5 200 0.1 200
0.0015 200 . This cubic augmentation increases up to a local maxima of 7.7
points at 206, then declines to a local minima of -19.1 at 239, and then
curves upward again. We then estimate simulated treatment effects of 10 points for those below
various thresholds, alternatively set at 200, 205, …, 230. This simulated treatment effect added
to an underlying cubic relation between and _ should be harder to
identify using the IK method as it relies on an assumption of local linear relations. We
furthermore evaluate our method relative to IK where the augmentation of posttest only occurs
on the left or right side of the specification. Note that since a treatment could have heterogeneous
effects, and thus larger or smaller effects away from the threshold, it is possible for the treatment
to not only have a level effect at the threshold, but also alter the relationship between the
outcome (Y) and the score (X).10 Our method should have a better ability to handle such cases,
and to thus derive a better estimate of the local effect at the threshold.
The results are shown in Table 4 and the corresponding graphical representations are
shown in Figure 4. In Panel A of Table 4, we show the results with the cubic augmentation
applied to both sides of threshold. Across the seven estimations, our method produces an
average absolute error of 0.94, which is a 7% improvement on the absolute error found using the
IK method, where the average absolute error was 1.00. In Panels B and C of Table 4, we show
the results with the cubic augmentation applied to the left and right sides of threshold,
10 When we add the augmentation to the left hand side only, we level-shift the right hand side up or down so that there is a simulated effect of -10 points at the threshold, and vice-versa.
15
respectively. Our method is particularly advantageous when the augmentation is applied to the
right side – for these estimations, our method produces an average absolute error that is 30%
lower than the average absolute error using the IK method. As shown in Figure 4, the principal
advantage of our method is the adaptability of the bandwidth and curvature given the available
evidence on each side of the threshold.
[Insert Table 4]
[Insert Figure 4]
Having now (hopefully) established the utility of our method, in the next two sections we
apply the method to two prominent papers in the RD literature.
3.3 Case Study 2: Method Applied to Data from Lee (2008)
Our second case study applies our method to re-estimate findings in Lee (2008) discussed
in Section 1. First, we re-examine the result shown in Lee’s Figure 2a. Y is an indicator variable
that equals 1 if the Democratic Party won the election in that district in year t+1. The key
identifying assumption is that there is a modest random component to the final vote share (e.g.,
rain on Election Day) that cannot be fully controlled by the candidates and that, effectively,
"whether the Democrats win in a closely contested election is...determined as if by a flip of a
coin" (p. 684). Lee’s data comes from U.S. Congressional election returns from 1946 to 1998
(see Lee (2008) for full description of the data).11
The Lee data present a practical challenge for our method. It contains 4,341 and 5,332
distinct values of X on the left and right sides of the discontinuity. Using every possible number
of prior values of X to predict Y at all distinct values of X, while possible, requires substantial
computer processing time. To reduce our processing time, we compute the average value of X
and Y within 200 bins on each side of the discontinuity, with each bin having a width of 0.5%
(since X ranges from -100% to +100% with the discontinuity at 0%). Binning the data as such
has the disadvantage of throwing out some information (i.e., the upwards or downwards sloping
relationship between X and Y within the bin); yet, for most practical applications this information
loss is minor if the bins are kept narrow.
11 We obtained these data on January 2, 2015 from http://economics.mit.edu/faculty/angrist/data1/mhe/lee.
16
To estimate the treatment effect, Lee applies “a logit with a 4th order polynomial in the
margin of victory, separately, for the winners and the losers” (Lee, 2001, p. 14) using all of the
data on both sides of the discontinuity. Given that our binning results in fractional values that lie
in the interval from 0% to 100%, we use a generalized linear model using a logit link function as
recommended by Papke and Woolridge (1996) for modeling proportions.12
We find that a specification that is linear and uses less than half of the data points is best
for X'β for both the left and the right sides (64 and 28 values on the right and left respectively,
with the corresponding bandwidth range for the assignment variable being -32.0% to 13.0%).13
We estimate that the Democratic Party has a 15.3% chance of winning the next election if they
were barely below 50% on the prior election, and a 57.7% chance of winning the next election if
they are just to the right of the discontinuity. Figure 5 shows the estimated curves. Our estimate
of the treatment effect (i.e., barely winning the prior election) is 42.3% (s.e. = 3.5%) is smaller
than Lee’s estimate, which is found in Lee (2001): 45.0% (s.e. = 3.1%).
[Insert Figure 5]
Next, we re-examine the result shown in Lee’s Figure 4a, where Y is now defined as the
Democratic Party’s vote share in year t+1. Lee (2008) used a 4th order polynomial in X for each
side of the discontinuity and concluded that the impact of incumbency on vote share was 0.077
(s.e. = 0.011). That is, being the incumbent raised the expected vote share in the next election by
7.7 percentage points. Applying our method (as shown in Figure 6), we find that the best
specification/bandwidth choice uses a quadratic specification based on the last 171 observations
on the left hand side and a 5th order polynomial based on the 188 observations to the right of the
discontinuity (with the corresponding bandwidth range for the assignment variable being -94.8%
to 93.7%). Our estimated treatment effect is smaller than Lee’s and has a smaller standard error:
0.057 (s.e. = 0.003).
[Insert Figure 6]
Lee’s (2008) study was also reexamined by Lee and Lemieux (2010) and Imbens and
Kalyanaraman (2011). We noted in Section 1 that according to Lee and Lemieux’s analysis, the
12 See also Baum (2008). 13 After binning the data, we end up with 145 distinct values of X on the left side as some bins have no data (i.e., no elections in which the Democratic vote share in year t minus the strongest opponents share fell in that range).
17
optimal bandwidth/specification resulted in a larger estimate of the effect of incumbency (0.100)
and a larger standard error (0.029). Scanning across their Table 2, the smallest estimated effect
that they found was 0.048. Thus, our estimate is not outside of the range of their estimates.
Nonetheless, our estimate is smaller than what would be selected using Lee and Lemieux’s two-
step method for selecting the optimal bandwidth and then optimal specification for that
bandwidth. Imbens and Kalyanaraman’s found that the optimal bandwidth for a linear
specification on both sides was 0.29 and using this bandwidth/specification produced an estimate
of the treatment effect of 0.080 (s.e. = 0.008). Again, their preferred estimate is somewhat larger
than the estimate found using our method and with a higher standard error.14
3.3 Case Study 5: Method Applied to Data from Chen, Ebenstein, Greenstone, & Li (2013)
Our final case study is a replication of a prominent paper by Chen et al. (2013) that
alarmingly concludes that “an arbitrary Chinese policy that greatly increases total suspended
particulates (TSPs) air pollution is causing the 500 million residents of Northern China to lose
more than 2.5 billion life years of life expectancy” (p. 12936). This policy established free coal
to aid winter heating of homes north of the Huai River and Qinling Mountain range. Chen et al.
used the distance from this boundary as the assignment variable with the treatment discontinuity
being the border itself.
As shown in the first column of our Figure 7 (which reprints their Figures 2 and 3), Chen
et al. estimate that being north of the boundary significantly raises TSP by 248 points and
significantly lowers life expectancy by 5.04 years. These estimates are also shown in Panel A of
Table 5.
[Insert Figure 7]
[Insert Table 5]
We have attempted to replicate these results. Unfortunately, the primary data are
proprietary and not easy to obtain; permission for their use can only be granted by the Chinese
14 Note however, that when we apply the Stata programs written by Fuji, Imbens, and Kalyraman (2009) and Nichols (2011) that produce treatment estimates using the Imbens and Kalyanaraman (2011) method, we find the optimal bandwidth for a linear specification on both sides was 0.11 and using this bandwidth/specification produced an estimate of the treatment effect of 0.059 (s.e. = 0.002), which are quite similar to our estimates.
18
Center for Disease Control.15 Rather than use the underlying primary data, we are treating the
data shown in their Figures 2 and 3 as if it were the actual data. To do so, we have manually
measured the X and Y coordinates of each data point in these figures as well as the diameter of
each circle (where the circle’s area is proportional to the population of localities represented in
the bin).16 The middle column of Figure 7 and Panel B of Table 5 present our replication
applying their specification (a global cubic polynomial in latitude with a treatment jump at the
discontinuity) to these data. We obtain similar results, although the magnitudes are smaller and
less significant; our replication of their specification produces estimates that being north of the
boundary raises TSP by 178 points (p-value 0.069) and insignificantly lowers life expectancy by
3.94 years (p-value 0.389). Comparing the first and second columns of Figure 7, note that the
shapes of the estimated polynomial specifications are generally similar with the modest
discrepancies showing that there is a bit of information lost by binning the data.
In Panel C of Table 5, we apply our method to estimate these treatment effects.17 We
find significant effects on TSP, with TSP rising significantly 146 points (using IK’s method, TSP
is found to rise significantly by 197 points). Thus, Chen et al.’s conclusion that TSP rises
significantly appears to be reasonable and robust to alternative specifications.
However, as shown in the second column in Panel D of Table 5, the estimated treatment
impact on Life Expectancy is much smaller; we estimate that being north of the boundary
significantly lowers life expectancy by 0.40 years, which is roughly one-tenth the effect size we
estimated using their global cubic polynomial specification. The fragility of these results should
not be surprising given a visual inspection of the scatterplot, which does not reveal a clear
pattern to the naked eye. In fact, for the right hand side of the threshold for Life Expectancy, we
find that a simple averaging of the 8 data points to the left of the threshold gives the best
prediction at the threshold.
We agree with Gelman and Zelizer’s (2015) critique that the result
15 Personal communication with Michael Greenstone, March 16, 2015. 16 We have taken two separate measurements for each figure and use the average of these two measurements for the X and Y coordinates and the median of our four measurements of the diameter of each circle. 17 Given that there are a small number of observations of and on each side of the discontinuity, we placed no constraint on the minimum number of observations or the minimum number of MSPEs that are required to be included. We considered polynomials of order 0 to 5.
19
“indicates to us that neither the linear nor the cubic nor any other polynomial
model is appropriate here. Instead, there are other variables not included in the
model which distinguish the circles in the graph” (p.4).
4. Conclusion
While regression discontinuity design has over a 50 year history for estimating treatment
impacts (going back to Thistlewaite and Campbell (1960)), the appropriate method for selecting
the specification and bandwidth to implement the estimation has yet to be settled. This paper’s
contribution is the provision of a method for optimally and simultaneously selecting a bandwidth
and polynomial order for both sides of a discontinuity. We identify the combination that
minimizes the estimated mean squared predicted error at the threshold of a discontinuity. Our
paper builds on Imbens and Kalyanaraman (2012), but is different from their approach which
solves for the optimal bandwidth assuming that a linear specification will be used on both sides
of the discontinuity. Our insight is that one can use the information on each side of the
discontinuity to see what bandwidth/polynomial-order combinations do well in predicting the
next data point as one moves closer and closer to the discontinuity. We apply our paper to
reexamine several notable papers in the literature. While some of these paper’s results are shown
to be robust, others are shown to be more fragile, suggesting the importance of using optimal
methods for specification and bandwidth selection.
20
References
Baum, C.F. (2008). Modeling proportions. Stata Journal 8: 299–303. Chen, Y., Ebenstein, A., Greenstone, M., and Li, H. (2013). Evidence on the impact of sustained
exposure to air pollution on life expectancy from China’s Huai River policy. Proceedings of the National Academy of Sciences 110, 12936–12941.
DiNardo, J., and Lee, D. (2010). Program evaluation and research designs. In Ashenfelter and Card (eds.), Handbook of Labor Economics, Vol. 4.
Fuji, D., Imbens, G. and Kalyanaraman, K. (2009). Notes for Matlab and Stata regression discontinuity software. https://www.researchgate.net/publication/228912658_Notes_for_Matlab_and_Stata_regression_discontinuity_software. Software downloaded on July 2, 2015 from http://faculty-gsb.stanford.edu/imbens/RegressionDiscontinuity.html.
Gelman, A., and Imbens, G. (2014). Why high-order polynomials should not be used in regression discontinuity designs. National Bureau of Economic Research, Working Paper 20405, http://www.nber.org/papers/w20405.
Gelman, A., and Zelizer, A. (2015). Evidence on the deleterious impact of sustained use of polynomial regression on causal inference. Research & Politics, 2(1), 1-7.
Imbens, G., and Kalyanaraman, K. (2012.) Optimal bandwidth choice for the regression discontinuity estimator. Review of Economic Studies, 79, 933–95.
Imbens, G., and Lemieux, T. (2008). Regression discontinuity designs: A guide to practice. Journal of Econometrics 142, 615–635.
Jacob, R., Zhu, P., Somers, M., and Bloom, H. (2012). A practical guide to regression discontinuity, MDRC, Accessed via http://www.mdrc.org/sites/default/files/regression_discontinuity_full.pdf.
Lee, D.S. (2001). The electoral advantage to incumbency and voters’ valuation of politicians’ experience: A regression discontinuity analysis of elections to the U.S. National Bureau of Economics Research, Working Paper 8441.
Lee, D.S. (2008). Randomized experiments from non-random selection in U.S. House elections. Journal of Econometrics, 142, 675-697.
Lee, D. S., and Lemieux, T. (2010). Regression discontinuity designs in economics. Journal of Economic Literature 48, 281-355.
Long, M.C., and Rooklyn, J. (2016). Next: A Stata program for regression discontinuity. University of Washington.
Nichols, A. (2011). rd 2.0: Revised Stata module for regression discontinuity estimation. http://ideas.repec.org/c/boc/bocode/s456888.html
Papke, L.E., and Wooldridge, J. (1996.) Econometric methods for fractional response variables with an application to 401(k) plan participation rates. Journal of Applied Econometrics 11: 619–632.
Thistlewaite, D., and Campbell, D. (1960). Regression-discontinuity analysis: An alternative to the ex post facto experiment". Journal of Educational Psychology 51(6): 309–317.
Van Der Klaauw, W. (2008). Regression-discontinuity analysis: A survey of recent developments in economics. Labour 22, 219–245.
21
Figure 1: Predicting the next value after six observed data points
Panel A: Data available to predict next Y
Panel B: Predicting Y given X ≥ 2 using prior value of X
Panel C: Predicting Y given X ≥ 3 using prior two values of X
Panel D: Predicting Y given X ≥ 4 using prior three values of X
Panel E: Predicting Y given X ≥ 5 using prior four values of X
Panel F: Predicting Y given X = 6 using prior five values of X
02
46
81
01
21
41
61
82
0Y
1 2 3 4 5 6X
02
46
81
01
21
41
61
82
0Y
1 2 3 4 5 6X
02
46
81
01
21
41
61
82
0Y
1 2 3 4 5 6X
02
46
81
01
21
41
61
82
0Y
1 2 3 4 5 6X
02
46
81
01
21
41
61
82
0Y
1 2 3 4 5 6X
02
46
81
01
21
41
61
82
0Y
1 2 3 4 5 6X
22
Table 1: Computing Mean Squared Prediction Error (MSPE) and Selecting the Optimal Specification and Bandwidth
0 0 0 0 0 1 1 1 1 2 2 2
1 2 3 4 5 2 3 4 5 3 4 5
Panel A: Prediction of YX Y1 122 15 12.03 16 15.0 13.5 18.04 13 16.0 15.5 14.3 17.0 18.3 15.05 10 13.0 14.5 14.7 14.0 10.0 12.7 15.0 6.0 7.56 7 10.0 11.5 13.0 13.5 13.2 7.0 7.0 9.0 11.4 7.0 4.0 3.4
Panel B: Error SquaredX2 9.03 1.0 6.3 4.04 9.0 6.3 1.8 16.0 28.4 4.05 9.0 20.3 21.8 16.0 0.0 7.1 25.0 16.0 6.36 9.0 20.3 36.0 42.3 38.4 0.0 0.0 4.0 19.4 0.0 9.0 13.0
Panel C: Predicted Value of MSPE given X = Threshold (i.e., Weighed Average of MSPEs)
Base Wgt. = 1 (Uniform) 7.4 13.3 19.9 29.1 38.4 5.0 11.9 14.5 19.4 6.7 7.6 13.0Base Wgt. = 10^3 8.9 19.4 31.6 37.0 38.4 0.8 2.7 8.2 19.4 3.2 8.4 13.0Base Wgt. = 10^6 8.998 20.2 35.0 40.7 38.4 0.1 0.5 5.2 19.4 1.0 8.8 13.0Base Wgt. = 10^10 8.99999 20.25 35.9 42.0 38.4 0.002 0.1 4.2 19.4 0.2 8.97 13.0
Panel D: Upper Bound of 80% Confidence Interval Around Predicted Value of MSPE given X = Threshold
Base Wgt. = 1 (Uniform) 9.9 19.9 38.6 69.5 11.2 28.0 46.8 15.7 11.9Base Wgt. = 10^3 13.2 29.7 57.1 84.1 11.2 26.2 47.6 17.5 13.9Base Wgt. = 10^6 14.1 32.6 65.5 94.5 12.1 27.6 49.6 16.5 14.8Base Wgt. = 10^10 14.4 33.4 68.0 98.6 12.4 27.9 49.8 15.9 15.0
Number of prior data points
( 0 )
Polynomial order ( 0 )
23
Table 2: Estimating a Simulated Treatment Effect of -10 with Jacob et al. (2012) Data
Base Weight 1 1,000 10^6 10^10 1 1,000 10^6 10^10 1 1,000 10^6 10^10
Left Side of ThresholdOptimal Specification Linear Linear Linear Linear Linear Linear Linear Linear Cubic Cubic Cubic CubicOptimal # Prior Observations 23 23 23 23 19 19 17 6 44 44 44 33Total # Prior Observations 42 42 42 42 32 32 32 32 52 52 52 52
Right Side of ThresholdOptimal Specification Quadratic Quadratic Quadratic Quadratic Quadratic Quadratic Quadratic Quadratic Linear Linear Linear LinearOptimal # Prior Observations 33 33 33 33 47 47 47 47 20 20 20 20Total # Prior Observations 42 42 42 42 52 52 52 52 32 32 32 32
Our Estimate of Treatment EffectEstimate -9.36 -9.36 -9.36 -9.36 -9.96 -9.96 -10.09 -8.60 -11.67 -11.67 -11.67 -11.78s.e. (Estimate) (0.73) (0.73) (0.73) (0.73) (0.93) (0.93) (0.95) (1.38) (0.98) (0.98) (0.98) (1.07)
Using Imbens and Kalyanaraman's (2012) Optimal Bandwidth for Linear SpecificationBandwidth 6.3 7.3 7.3Estimate -10.68 -8.25 -11.74s.e. (Estimate) (1.27) (1.50) (1.44)
Simulated Treatment Effect = -10
Threshold = 215 Threshold = 205 Threshold = 225
24
Figure 2: Selection of Specification and Bandwidth Using Data from Jacob et al. (2012) With Simulated Treatment Effect of -10 at Various Thresholds
Simulated Threshold
= 205
Estimated Treatment Effect =
-9.39 (s.e. = 0.24)
Simulated Threshold
= 215
Estimated Treatment Effect =
-9.96 (s.e. = 0.18)
Simulated Threshold
= 225
Estimated Treatment Effect = -11.67
(s.e. = 0.14)
180
200
220
240
260
280
160 180 200 220 240 260
180
200
220
240
260
280
160 180 200 220 240 260
180
200
220
240
260
280
160 180 200 220 240 260
25
Table 3: Estimating a Placebo Treatment Effect with Jacob et al. (2012) Data
Threshold 200 205 210 215 220 225 230
Left Side of ThresholdOptimal Specification Linear Linear Linear Linear Linear Cubic QuadraticOptimal # Prior Observations 19 19 32 23 40 44 39Total # Prior Observations 27 32 37 42 47 52 57
Right Side of ThresholdOptimal Specification Quadratic Quadratic Linear Quadratic Quadratic Linear LinearOptimal # Prior Observations 47 47 29 33 24 20 20Total # Prior Observations 57 52 47 42 37 32 27
Our Estimate of Treatment EffectEstimate -0.90 0.04 -1.39 0.64 0.50 -1.67 -0.20s.e. (Estimate) (1.10) (0.93) (0.68) (0.73) (0.77) (0.98) (1.03)Significance ** *
Using Imbens and Kalyanaraman's (2012) Optimal Bandwidth for Linear SpecificationBandwidth 6.7 7.3 6.3 6.3 7.3 7.3 8.1Estimate 0.01 1.75 -2.27 -0.68 0.22 -1.74 0.08s.e. (Estimate) (1.89) (1.50) (1.50) (1.27) (1.22) (1.44) (1.58)Significance
Absolute Error AverageLong & Rooklyn 0.90 0.04 1.39 0.64 0.50 1.67 0.20 0.76Imbens & Kalyanaraman 0.01 1.75 2.27 0.68 0.22 1.74 0.08 0.97Better Performance: IK LR LR LR IK LR IK LR
Note: ***, **, and * denote two-tailed significance at the 1%, 5%, or 10% level.
Simulated Treatment Effect = 0
26
Table 4: Estimating a Simulated Treatment Effect with Jacob et al. (2012) Data, Augmented with Cubic Function of Prestest (X) Added to Posttest (Y)
Threshold 200 205 210 215 220 225 230
Our Estimate of Treatment EffectEstimate -10.02 -8.14 -11.63 -9.35 -10.26 -11.27 -10.86s.e. (Estimate) (1.59) (1.23) (1.40) (1.04) (0.91) (1.11) (1.50)
Using Imbens and Kalyanaraman's (2012) Optimal Bandwidth for Linear SpecificationEstimate -10.23 -8.64 -12.58 -10.89 -9.95 -11.78 -9.88s.e. (Estimate) (1.94) (1.55) (1.72) (1.24) (1.21) (1.41) (1.55)
Absolute Error AverageLong & Rooklyn 0.02 1.86 1.63 0.65 0.26 1.27 0.86 0.94Imbens & Kalyanaraman 0.23 1.36 2.58 0.89 0.05 1.78 0.12 1.00Better Performance: LR IK LR LR IK LR IK LR
Our Estimate of Treatment EffectEstimate -10.27 -8.09 -11.32 -9.38 -10.13 -11.67 -10.82s.e. (Estimate) (1.51) (1.16) (1.44) (0.99) (1.05) (0.98) (1.22)
Using Imbens and Kalyanaraman's (2012) Optimal Bandwidth for Linear SpecificationEstimate -10.10 -8.74 -12.72 -10.94 -9.95 -11.73 -9.69s.e. (Estimate) (2.06) (1.62) (1.42) (1.22) (1.21) (1.43) (1.60)
Absolute Error AverageLong & Rooklyn 0.27 1.91 1.32 0.62 0.13 1.67 0.82 0.96Imbens & Kalyanaraman 0.10 1.26 2.72 0.94 0.05 1.73 0.31 1.02Better Performance: IK IK LR LR IK LR IK LR
Our Estimate of Treatment EffectEstimate -10.65 -10.01 -11.70 -9.338 -9.63 -11.27 -10.25s.e. (Estimate) (1.21) (1.01) (0.59) (0.80) (0.53) (1.11) (1.35)
Using Imbens and Kalyanaraman's (2012) Optimal Bandwidth for Linear SpecificationEstimate -10.02 -7.97 -12.16 -10.664 -9.78 -11.79 -10.11s.e. (Estimate) (1.75) (1.46) (1.57) (1.30) (1.21) (1.41) (1.55)
Absolute Error AverageLong & Rooklyn 0.65 0.01 1.70 0.662 0.37 1.27 0.25 0.70Imbens & Kalyanaraman 0.02 2.03 2.16 0.664 0.22 1.79 0.11 1.00Better Performance: IK LR LR LR IK LR IK LR
Panel A: Augmentation Applied to Both Sides of Threshold
Panel B: Augmentation Applied to Left Side of Threshold
Panel C: Augmentation Applied to Right Side of Threshold
27
Figure 4: Selection of Specification and Bandwidth Using Data from Jacob et al. (2012) Augmented with Cubic Function of Prestest (X) Added to Posttest (Y) and With Simulated Treatment Effect of -10 at
Various Thresholds
Panel A: Augmentation Applied to Both Sides of
Threshold
Panel B: Augmentation Applied to Left Side of
Threshold
Panel C: Augmentation Applied to Right Side of
Threshold
=200
=205
=210
=215
(Figure 4 Continued on Next Page)
100
200
300
400
160 180 200 220 240 260
100
150
200
250
300
160 180 200 220 240 260
200
250
300
350
400
160 180 200 220 240 260
100
200
300
400
160 180 200 220 240 260
100
150
200
250
300
350
160 180 200 220 240 260
200
250
300
350
400
160 180 200 220 240 260
100
200
300
400
160 180 200 220 240 260
100
150
200
250
300
350
160 180 200 220 240 260
200
250
300
350
400
160 180 200 220 240 260
100
200
300
400
160 180 200 220 240 260
100
150
200
250
300
160 180 200 220 240 260
200
250
300
350
400
160 180 200 220 240 260
28
Figure 4 Continued
Panel A: Augmentation Applied to Both Sides of
Threshold
Panel B: Augmentation Applied to Left Side of
Threshold
Panel C: Augmentation Applied to Right Side of
Threshold
=220
=225
=230
100
200
300
400
160 180 200 220 240 260
100
150
200
250
300
160 180 200 220 240 260
200
250
300
350
400
160 180 200 220 240 260
100
200
300
400
160 180 200 220 240 260
100
150
200
250
300
160 180 200 220 240 260
200
250
300
350
400
160 180 200 220 240 260
100
200
300
400
160 180 200 220 240 260
100
150
200
250
300
160 180 200 220 240 260
200
250
300
350
400
160 180 200 220 240 260
29
Figure 5: Replication of Lee’s (2008) Figure 2a Using Our Specification/Bandwidth Selection Method
Figure 6: Replication of Lee’s (2008) Figure 4a Using Our Specification/Bandwidth Selection Method
0.2
.4.6
.81
-1 -.5 0 .5 1
0.2
.4.6
.81
Vo
te S
har
e, E
lect
ion
t+1
-1 -.5 0 .5 1Democratic Vote Share Margin of Victory, Election t
30
Figure 7: Replication of Chen et al.’s (2013) Figures 2 & 3
Figures 2 and 3 reprinted from Chen et al. (2013)
Replication using their specification applied to data inferred from figures
Replication using our method applied to data inferred from figures
020
040
060
080
0
TS
P
-20 -10 0 10 20
Degrees North of the Huai River Boundary
02
004
00
60
08
00
TS
P
-20 -10 0 10 20
Degrees North of the Huai River Boundary
6570
7580
8590
95
Lif
e E
xpec
tanc
y (Y
ears
)
-20 -10 0 10 20
Degrees North of the Huai River Boundary
65
70
75
80
85
90
95
Life
Exp
ect
an
cy (
Ye
ars
)
-20 -10 0 10 20
Degrees North of the Huai River Boundary
31
Table 5: Estimating Treatment Effects using Data Inferred from Chen et al. (2013)