john marshall ay - harvard university...identifying education’s political effects with incomplete...
TRANSCRIPT
-
IDENTIFYING EDUCATION’S POLITICAL EFFECTSWITH INCOMPLETE DATA: INSTRUMENTAL VARIABLE
ESTIMATES COMBINING TWO DATASETS
JOHN MARSHALL∗
MAY 2014
Political scientists are increasingly using instrumental variable (IV) methods, but areoften faced with datasets that lack key variables or which only provide coarse vari-able codings. While completely missing a key variable typically causes projects to beabandoned, coarsening a treatment variable with multiple intensities—e.g. creating abinary treatment indicator—can substantially upwardly bias IV estimates. This biasarises where the coarsening causes the first stage to only capture part of the instru-ment’s effect. Two-sample IV methods offer a powerful solution to both problems:imputing values for the missing or coarsened variable using a separate dataset drawnfrom the same population with richer measurement of the treatment consistently es-timates the weighted average per-unit treatment effect. Applying this approach in afuzzy regression discontinuity setting in Great Britain, I show that an additional yearof schooling substantially increases the probability of voting Conservative later in life.The estimate for completing high school, however, is upwardly biased by between twoand six times.
∗PhD candidate, Department of Government, Harvard University. [email protected]. I thank AnthonyFowler and Horacio Larreguy for illuminating discussions.
1
-
1 Introduction
Instrumental variable (IV) techniques are now a standard component of the political scientist’s
methodological toolkit. Sovey and Green’s (2011) meta-analysis identifies more than one hundred
articles published in three top journals over two decades using IV techniques. It is easy to under-
stand why. Interpreted in the heterogeneous potential outcomes framework (Imbens and Angrist
1994; Angrist, Imbens and Rubin 1996), IV approaches promise to identify the average causal
effect of a treatment for the population of units that would not have received the treatment without
the intervention of the instrumental variable.
The prevalence of IV techniques has warranted increased scrutiny. Sovey and Green’s (2011)
review highlights six key concerns in IV analyses, and identifies the types of evidence and argu-
ment required to justify the assumptions underpinning the IV framework. Angrist and Pischke
(2008) also provide clear advice on using IV methods in practice.
However, this article highlights an important additional concern: using a binary (or coarsened)
treatment variable when the true treatment has multiple intensities can substantially upwardly bias
IV estimates. An important example is where a dummy variable for completing high school is
used because years of schooling is not measured. This missing data issue has been ignored by
both political scientists and economists, but frequently arises in empirical applications. I explain
how two-sample IV methods can alleviate the problem when, as is often the case, the data needed
to correct the bias is not available in the original sample. I then use the two-sample approach to
identify the causal effect of schooling on political preferences, using fuzzy regression discontinuity
methods to show that an additional year of high school substantially increases Conservative voting
in Great Britain.
This article first shows how coarsening a multi-valued (or interval) treatment variable intro-
duces upward bias. The reduced form captures the impact of an instrument on an outcome for ev-
ery individual regardless of their (coarsened) treatment intensity. However, the first stage—which
2
-
re-weights the reduced form coefficient—underestimates the effect of the instrument on the coars-
ened treatment by failing to recognize that the treatment intensity increases for some individuals
without passing the threshold required to be designated a new coarsened treatment intensity value.
In the case of schooling, the first stage for compulsory schooling laws (CSLs)—a popular instru-
ment for completing high school (e.g. Dee 2004; Lochner and Moretti 2004; Milligan, Moretti and
Oreopoulos 2004)—only captures the individuals CSLs push to complete high school, neglect-
ing any increase in schooling which does not result in completing high school. However, since
the reduced form includes the effects for individuals who experienced greater schooling without
completing high school, this can substantially upwardly bias IV estimates.
The bias is most severe when there is a large first stage for neighboring intensities with large
causal effects. The political effects of completing high school may thus be substantially biased
if each additional year of schooling has a significant causal effect on political preferences and
the instrument increasing schooling for many students without inducing them to complete high
school. When the true causal effect is not highly discontinuous at a known point, estimating the
weighted average per-unit treatment effect for an interval treatment intensity is more appropriate.
I will show that IV techniques provide a consistent estimate of this quantity of interest, even when
some categories of the underlying treatment intensity are unobserved. Unlike the the case where
a treatment intensity is discretized, there is also a clear and conceptually-appealing counterfactual
interpretation for this estimate.
While a treatment intensity variable can be incorrectly discretized or “miscoded” through the
choice of a researcher, a common problem is that more granular data is not available. Using a
dummy for completing high school, for example, is often necessitated because datasets such as
the American National Election Survey and British Social Attitudes Survey only provide relatively
coarse measures of education. In such cases, any IV estimate of schooling’s political effects may
be significantly biased.
The two-sample IV techniques pioneered by Angrist and Krueger (1992, 1995) can substan-
3
-
tially alleviate or solve this problem. Conceptually, these methods estimate the reduced form in
a sample containing data on only the outcome and the instrument, and the first stage in a sample
containing data on only the treatment and the instrument, before combining the two to produce
an IV estimate. The sample used for the first stage effectively serves as a means of imputing the
missing treatment variable. In this sense, two-sample IV methods behave like multiple imputation
techniques (e.g. Honaker and King 2010; King et al. 2001).1 Beyond the standard IV assump-
tions, both samples must be random draws from the same population. I show that the two-sample
2SLS (TS2SLS) estimator first proposed by Angrist and Krueger (1995)—which can accommodate
both overidentification and additional covariates—is consistent, and I also extend Inoue and Solon
(2010) to derive the associated cluster-robust variance matrix which corrects for finite-sample dif-
ferences between samples 1 and 2. Despite their merits, two-sample IV methods have not yet been
used in political science.
Finally, I show how using Britain’s compulsory schooling reforms as instruments for dis-
cretized measures of high school education can significantly upwardly bias estimates of school-
ing’s effects on political preferences. Using Britain’s two major reforms, the upward bias can be
cleanly decomposed: using a dummy for high school completion, instead of the true linear effect
of an additional year of late high school, upwardly biases estimates of schooling’s effect on voting
conservative by between two and six times. The two-sample IV approach instead finds that an
additional year of late high school increases the probability that a voter votes Conservative in later
life by around 10-15 percentage points.
This substantial difference, which is reiterated by the reduced form estimates, raises a dilemma
for left-of-center parties—like Labour and more recently the Liberals—which have championed
inclusive educational policies at the expense of electoral success. Although it is beyond this ar-
ticle’s scope to evaluate the mechanisms underpinning these large effects, it preliminary suggests
1While multiple imputation involves imputing missing data using other variables within a givensample, two-sample IV imputes all observations for a given variable using a second sample. Unlikemultiple imputation programs like Amelia II, the methods used here have analytic solutions.
4
-
that education’s political effects are driven by income-based concerns—rather than socially liberal
attitudes that would be expected to cause voters to support the Labour or Liberal parties.
This paper is organized as follows. Section 2 demonstrates analytically the extent of the bias
and discusses the implications for applied empirical work. Section 3 explains how two-sample IV
techniques can alleviate the missing data problem. Section 4 applies these methods to identify the
effect of schooling on voting preferences in Great Britain. Section 5 concludes.
2 IV’s upward bias with coarsened treatments
2.1 Characterizing the bias
To illustrate the upward bias of coarsening a treatment intensity, consider the simplest case where
there is a single randomly assigned binary instrument.2 Denote this instrument for each obser-
vation i ∈N ≡ {1, ...,n} as Zi ∈ {0,1}. The observed treatment intensity Ti ∈ {1, ...,J} assumes
one of J ordered values, where Tiz ≡ T (Zi = z) denote the potential outcomes of Tiz conditional
on the assignment of the instrument Zi = z. Yi is i’s observed outcome of interest, with potential
outcomes Yit ≡ Y (Ti = t) corresponding to i’s treatment assignment Ti = t. To illustrate the prob-
lem, let us assume that the instrumental variables assumptions of monotonicity and the exclusion
restriction hold (see Imbens and Angrist 1994; Angrist, Imbens and Rubin 1996); see below for
formal definitions.
The researcher, whether by choice or necessity, decides to coarsen the treatment intensity. In
particular, in the hope of identifying the effect of experiencing Ti = k > 1, they partition T by
defining the indicator Dik ≡ 1(Ti ≥ k).3 Crucially, the researcher interested in identifying the
effect of obtaining Ti = k is only interested in estimating βk ≡ E[Yik−Yik−1|Ti1 ≥ t > Ti0]. This2The results presented here extend easily to the cases of multi-valued instruments and to the
inclusion of control variables.3If multiple instruments are available, the coarsening need not be binary.
5
-
quantifies the local average treatment effect (LATE) of obtaining intensity k beyond only obtaining
the preceding level k−1 for instrument compliers. In the case of schooling, this could be the effect
of completing high school (12th grade) beyond completing 11th grade. In many applications, this
counterfactual is not clearly specified.4
This approach yields the following system of IV equations to be estimated:
Yi = β̃kDik + ui, (1)
Dik = γZi + εi. (2)
Equation (1) is the structural model defining the relationship between the binary treatment and the
outcome, while equation (2) is the first stage regression of the binary treatment on the instrument.
The true causal effect of obtaining a treatment intensity of k for instrument compliers is βk, while
β̃k represents the population average effect that IV approaches typically cannot identify.5
Angrist and Imbens (1995) show that the Wald estimator βWk for this system of equations can
be expressed as the weighted sum of the causal effect for compliers moving from intensity t−1 to
t for each such interval:
βWk ≡E[Yi|Zi = 1]−E[Yi|Zi = 0]
E[Dik|Zi = 1]−E[Dik|Zi = 0]=
∑Jt=1 pitβtpik
, (3)
where pit ≡ Pr(Ti1 ≥ t > Ti0) denotes the probability that i only reaches category Ti = t because
they received the instrument Zi = 1, and thus represents the proportion of compliers at treatment
intensity t in the population. pik therefore represents the relevant first stage for ascertaining the
treatment intensity k. βt ≡ E[Yit −Yit−1|Ti1 ≥ t > Ti0] is the LATE for compliers moving from4When the treatment is truly binary, the interpretation is clear. However, if the latent treatment
is multi-valued, the researcher implicitly argues for the difference between some kind of averageof values contained within each discretized treatment condition.
5As Oreopoulos (2006) shows, as the number of compliers increase the local average treatmenteffect converges toward the population average treatment effect.
6
-
treatment intensity t−1 to treatment intensity t.
The following proposition extends Angrist and Imbens (1995) to demonstrate the inconsistency—
and thus a bias even as the sample size is large—associated with the Wald estimator seeking to
identify βk.6
Proposition 1. Suppose the following assumptions hold:
A1. Exclusion restriction: (Ti0,Ti1,{Yit}Jt=1) are jointly independent of Zi, for all i ∈N .
A2. Monotonicity: Ti1−Ti0 ≥ 0 or Ti1−Ti0 ≤ 0, for all i ∈N .
Then the dummy variable Wald estimator βWk of equations (1) and (2) can be expressed as:
βWk −βk =∑t 6=k pitβt
pik. (4)
Provided sign(βk) = sign(βt) for all t 6= k where pit > 0, the dummy variable Wald estimator
accentuates the true causal effect: |βk| ≤ |βWk |.
All proofs are provided in the Appendix.
This result establishes that the Wald estimator generally over-estimates the true LATE of ob-
taining intensity k. The estimator is consistent only in two special cases. First, when the instrument
only affects reaching intensity k; or pit = 0,∀t 6= k. Second, when the causal effect for all intervals
other than k is zero; or βt = 0,∀t 6= k. Otherwise, the inconsistency of the estimator is increasing
in both pit/pik and βt for any t 6= k.
Our education example clearly illustrates the bias. Consider a compulsory school law requiring
that students remain in school until age 15 in a country like the Britain where high school is
completed at age 16.7 For many students who would have dropped out before age 15 without6In general, IV estimators are biased but consistent (see Bound, Jaeger and Baker 1995). The
term bias is reduced somewhat loosely here to mean the deviation between the inconsistent andconsistent estimators.
7The U.S. is also a good example, where high school is completed at age 18 but the schoolleaving age is (or has been) 16 for many states.
7
-
the law, the law may not induce the completion of high school. Many likely only stay until 15,
although some may go on to complete high school. This implies that there is a significant first
stage, pit > 0, for levels of schooling below high school. The IV bias, however, only arises if
an additional year of schooling before the completion of high school matters for the outcome of
interest. For outcomes like income, where either human capital or signaling may matter for labor
market returns (e.g. Becker 1993; Mincer 1974; Spence 1973), it is easy to believe that βt > 0.
Similarly, if income maps to political preferences (e.g. Marshall 2014), or remaining in high school
imparts politically-relevant norms, then political outcomes are also likely to suffer from bias.
2.2 When is the bias severe?
Proposition 1 demonstrated that the extent of bias depends upon the first stage and the LATE at
different treatment intensities. This analytical insight permits precise description of the extent of
bias in terms of a weighted causal response function (CRF). The CRF provides the causal effect of
the treatment at each intensity.
2.2.1 Sharp jumps in the CRF
When the CRF exhibits sharp discontinuities, as exemplified in Figure 1, the dummy approach
can be most appropriate. If the researcher’s understanding of the problem is strong, then correctly
identifying intensity k—the only point at which there is a (positive) causal effect in the figure—as
the key jump will yield a consistent estimate of βk, provided a suitable instrument exists to ensure
pik > 0. The reason that this works well is because βt = 0 for all t 6= k. Therefore, the Wald
estimator is consistent regardless of whether pit > 0 for some other t 6= k.
Since the true CRF is unobserved, it is hard for researchers to know in practice whether k is the
correct cutoff to use when defining their dummy variable. In general, tipping point equilibria that
lack clear institutional definition may not be straight-forward to theorize about. Experiments, on
the other hand, are not subject to these concerns if subjects cannot be partially treated.
8
-
T
Y
●
k k+1
Y0
Y1
Figure 1: Discontinuous causal response function
If the researcher incorrectly surmises that k + 1 is the correct threshold, at best they fail to
detect the existence of the causal effect of intensity k but correctly identify no effect at k+ 1. In
the example of Figure 1, the researcher concludes that βk+1 = 0 provided that their instrument
does not induce subjects to reach intensity k and βt = 0,∀t 6= k. In other words, pik = 0 ensures a
correct causal estimate of a quantity that was probably not of primary interest. When pik > 0, the
Wald estimator will produce an inconsistent estimate of the LATE at intensity k+ 1 given by:
βWk+1−βk+1 =pikβkpik+1
> 0. (5)
Although this estimate is approximately right in the sense that there is a causal effect nearby, it
both wrongly attributes the effect to intensity k+1 and does not even provide a consistent estimate
9
-
of βk unless pik+1 = pik.
2.2.2 Linear CRFs
The bias associated with using a dummy variable can be particularly large when the true CRF is
linear. Letting the causal effect associated with each interval be βt = τ 6= 0, the dummy variable
Wald estimator yields:
βWk −βk =∑t 6=k pit
pikτ . (6)
This requires that more than one half of all compliers must achieve intensity k for the inconsistency
to be less than double the size of the true coefficient.8 This concern increases with how close
the treatment intensity categories are to one another (i.e. increases in J), because it becomes
increasingly implausible that any instrument could induce all i to receive exactly Ti1 = k.
When the causal response is linear, an alternative Wald estimator—also proposed by Angrist
and Imbens (1995)—estimating the weighted average per-unit treatment effect (WAPTE) is more
appropriate:
βWWAPT E ≡E[Yi|Zi = 1]−E[Yi|Zi = 0]E[Ti|Zi = 1]−E[Ti|Zi = 0]
=∑Jt=1 pitβt∑Jt=1 pit
. (7)
When the true causal effect is τ at each interval, it is exactly recovered by βWWAPT E . When the
causal effect is not exactly linear, the estimator disproportionately weights the intervals with most
compliers.
It is easy to show that the dummy variable approach yields a coefficient at least as large as the
8To see this, note that∑t 6=k pit
pik=
p− pikpik
< 1,
only when pik > p/2, where p≡ ∑ j pi j.
10
-
WAPTE when the instrument satisfies monotonicity (Angrist and Imbens 1995).9 Consequently,
if the CRF is that in Figure 1, then the linear approach underestimates the true causal effect at
intensity k by fitting a complier-weighted linear form. In the special case where the instrument
only affects the first stage of interest, or pit = 0,∀t 6= k, the WAPTE estimator yield an identical
estimate to the discretized Wald estimator. To the extent that a more conservative estimate is
desired when the CRF is uncertain, the linearization may therefore be preferred.
Furthermore, the linear approach may be robust even when not all categories are observed. If
the J observed categories represent a coarsening of the true intervals (e.g. because T is continuous),
the linear causal effect can still be recovered provided the intervals are equally spaced.10
Proposition 2. Suppose assumptions A1 and A2 in Proposition 1 hold. Let only J equally-spaced
categories of Ti be observed when there are in fact αJ equally-spaced categories, where α > 1 is
finite and αJ is an integer. Denote βW ,JWAPT E and βW ,αJWAPT E respectively as the Wald estimators in the
observed sample (denoted by superscript J) and unobserved sample (denoted by superscript αJ).
If the effect of Ti is linear and β Jj = τ for all intervals j, then βW ,JWAPT E = αβ
W ,αJWAPT E
This result suggests that any linear relationship can be accurately estimated with the WAPTE es-
timator, even when all intervals cannot be observed in practice. Obtaining the coefficient for the
quantity of interest only requires an adjustment by factor α to provide the average linear causal
effect at the desired unit interval.
2.3 Implications for applied research
The analysis here demonstrates that the shape of the CRF is critical for ascertaining the bias of the
Wald estimator with a binary treatment. Unless the instrument is very specific in inducing subjects
to only reach treatment intensity k or the causal response is non-zero only at that particular point,
9Comparison of the denominators shows that ∑Jt=1 pit ≥ pik if sign(pit) = pik,∀t.10More generally, even if the spacing is uneven the true causal effect could be identified if the
spacing is proportional to the causal effects at each observed intensity.
11
-
the Wald estimator can be severely biased. If the CRF is instead approximately linear in form, it is
more appropriate to estimate the WAPTE.
Although researchers may in some cases have strong prior beliefs over the shape of the CRF,
and thus the most appropriate empirical strategy, definitive evidence is hard to produce. For ex-
ample, it is far from clear whether it is the additional learning imparted every day that students
remain in high school or simply obtaining the diploma that should matter for how an individual
votes. Unfortunately, the researcher must rely on evidence and intuition—including the reduced
form relationship, separate first stage regressions and the dummied-out OLS relationship—in order
to determine the appropriate variable specification when only a single instrument is available.
However, when multiple instruments are available, a sharper empirical assessment is possible.
With p > 1 instruments, p intervals of the CRF can be estimated by instrumenting for p binary
indicators of different treatment levels.11 Under the assumption that different instruments do not
affect different types of compliers differently, this permits the researcher to estimate βt for com-
pliers at the intervals where the researcher believes the per-unit causal effect is likely to be largest.
Large causal effects at t 6= k provide strong evidence against the kind of CRF required to use βWkas a consistent estimator for βk. Applying this approach, section 4 shows that IV estimates for
completing high school can substantially over-estimate education’s political effects.
Carefully examining the effects of different levels of a treatment intensity in a single dataset is
often not possible. As noted above, researchers often only resort to using dummy variables when
better measures are not available. I now show how two-sample IV methods can be used to address
this missing data problem.
11The above analysis can be generalized to the case of multiple instruments.
12
-
3 Using two samples to address missing data
This section shows how two-sample IV techniques—a method yet to be employed in political sci-
ence, as far as I am aware—can solve the problem that the researcher is forced to use a dummy
because an insufficient number of categories are measured in their dataset. Of course, when all
categories are available, the researcher is free to re-specify their treatment intensity variable how-
ever they deem fit. The two-sample method can similarly address the problem that the treatment
variable is completely unobserved.
The key idea underpinning two-sample IV techniques is that the reduced form and first stage
can be estimated in separate samples. Conceptually, we can then combine these estimates by
respectively replacing the numerator and denominator of the WAPTE estimator in equation (7) or
the Wald estimator in equation (3). Accordingly, two datasets are needed—one which includes
Zi and Yi, and a second which includes Zi and Ti. If covariates Wi are desired, they must also be
observed in both samples. If these datasets are both random draws from the same population, then
the relationship between Zi and Ti in the first stage sample should be equivalent to that which would
have been measured in the reduced form sample had good measures of Ti been available. Under
these conditions, which are formalized below, it is reasonable to effectively impute values values
of Ti using our second dataset.
3.1 Estimation
The goal is estimate the following system of IV equations:
Yi = TiβT +Wiβ−T + ui = Xiβ + ui (8)
Ti = ZiΠ+ εi, (9)
13
-
where Zi includes Wi and q excluded instruments. Identification requires that only p≤ q treatment
variables, Ti, can be instrumented for.
Two methods have been proposed for IV estimation with two samples. Angrist and Krueger
(1992) propose a Wald-style estimator where the reduced form estimates are divided by their first
stage counterparts, which can be generalized to the overidentified case where the number of in-
struments outnumber the number of endogenous variables. Inoue and Solon (2010) show that this
estimator is less efficient than the 2SLS counterpart—proposed by Angrist and Krueger (1995)
for splitting a sample—that will be used in the empirical application here. The advantage of this
estimator is that it corrects for finite-sample differences between the two samples.12 Furthermore,
its extension to multiple instruments and multiple endogenous variables is straight-forward—both
of which are important in many empirical applications, including the analysis in this paper.
In matrix form (stacking over i), the TS2SLS estimator is:
β̂ T S2SLS = (X̂ ′1X̂1)−1X̂ ′1Y1, (10)
where X̂1 = (T̂1,W1) is the matrix of predicted values in sample 1. The OLS regression coefficients
generating T̂1 are based on p first stage regressions estimated in sample 2:
X̂1 = Z1Π̂ = Z1(Z′2Z2)−1Z′2X2. (11)
3.2 Properties of TS2SLS
The following assumptions are required to ensure the consistency of the TS2SLS estimator:
1. Random sampling from the same population: {Y1i,Z1i}n1i=1 and {T2i,Z2i}n2i=1 are indepen-
dently and identically distributed draws of size n1 and n2 from the same population with
12Inoue and Solon (2005) show that the TS2SLS estimator remains consistent even when differ-ences in the sampling rates vary with some of the instrumental variables.
14
-
finite second moments.
2. Exclusion restriction: E[Z′1iu1i] = 0.
3. Instrument exogeneity: E[Z′1iε1i] = E[Z′2iε2i] = 0.
4. Rank conditions: (a) Z′1iZ1i and Z′2iZ2i have full rank, (b) X
′1iZ2i and X
′2iZ2i have full rank.
5. Interchangeable sample moments: (a) E[Z′1iX1i] = E[Z′2iX2i], (b) E[Z
′1iZ1i] = E[Z
′2iZ2i].
Assumption 1 says that the samples must draw from the same population. Assumption 2 is im-
plied by the exclusion restriction above, but is written in terms of expectations. Assumption 3
requires that the instrument be exogenous in the first stage. Assumption 4 is a standard rank con-
dition required for matrix invertibility. Assumption 5 requires that crucial samples moments can
be interchanged, thereby permitting substitution between samples. As n1 and n2 converge to the
population size, Assumption 5 necessarily holds.
Proposition 3 demonstrates the consistency of TS2SLS, while the proof illustrates the use of
the assumptions above.13
Proposition 3. Under Assumptions 1-5, β̂ T S2SLS is an n1-consistent estimator of β .
Correctly calculating the TS2SLS standard errors is not obvious. Calculating the standard er-
rors from a regression of Y1 on X̂1 neglects the uncertainty in the first stage, in addition to distribu-
tional differences between the first stage and reduced form samples. The Murphy and Topel (1985)
two-stage framework for understanding “generated regressors”—accounting for the uncertainty in-
troduced where a variable is estimated as a proxy to enter a separate regression—incorporates such
estimation uncertainty.14 Proposition 4 derives the homoskedastic and cluster-robust variance (ma-13Angrist and Krueger’s (1995) proof rests on showing that the TS2SLS estimator converges to
the consistent Angrist and Krueger (1992) estimator, because of Assumption 5. The proof providedhere instead demonstrates the consistency of TS2SLS directly.
14Inoue and Solon (2010) acknowledge this approach but derive homoskedastic and het-eroskedastic variance matrices in an alternative way, but do not provide a cluster-robust varianceestimate.
15
-
trices), of which the robust variance is the particular case of G1 = n1 and G2 = n2 clusters. (i is
dropped to facilitate exposition.)
Proposition 4. The asymptotic variance of the TS2SLS estimator, V[β̂ T S2SLS], is
[σ2u +
n1n2
β̂ T S2SLS′S Ωβ̂T S2SLSS
]E[X̂ ′1X̂1]
−1, Ω = E[ε ′ε|X̂1] =
σ21 · · · σ1,p... . . .
...
σp,1 . . . σ2p
(12)
when the reduced form squared error σ2u = E[u2|X̂1] and the error covariances Ω of the p first
stage regressions are homoskedastic; when the reduced form and first stage errors are grouped
into G1 and G2 clusters respectively, the cluster-robust variance is
E[X̂ ′1X̂1]−1[
V[β̂ T S2SLS]+n1n2
E[X̂ ′1(β̂T S2SLS′S ⊗Z1)]V(Π̂)E[(β̂ T S2SLS′S ⊗Z1)′X̂1]
]E[X̂ ′1X̂1]
−1,(13)
where β̂ T S2SLSS is the vector of coefficients on p endogenous variables, the uncorrected TS2SLS
variance is given by V[β̂ T S2SLS] = G1G1−1 ∑G1g=1 E[X̂
′1gû1gû
′1gX̂1g] and the variances from m first-
stage regressions are V(Π̂) = G2G2−1 Φ⊗E[Z′2Z2]
−1, where
Φ =
E[Z′2Z2]
−1 ∑G2g=1 E[Z′2gε̂2g1ε̂
′2g1Z2g] · · · E[Z′2Z2]−1 ∑
G2g=1 E[Z
′2gε̂2g1ε̂
′2gpZ2g]
... . . ....
E[Z′2Z2]−1 ∑G2g=1 E[Z
′2gε̂2gpε̂
′2g1Z2g] . . . E[Z
′2Z2]
−1 ∑G2g=1 E[Z′2gε̂2gpε̂
′2gpZ2g]
. (14)
Standard errors are given by the square roots of the diagonal elements of V[β̂ T S2SLS]/n1. Using
the analogy principle, expectations can be replaced by sample moments.
16
-
In the case of a single endogenous regressor, V(Π̂) is simply the standard cluster-robust vari-
ance matrix for the first stage:
E[Z′2Z2]−1
[G2
G2−1
G2
∑g=1
E[Z′2gε̂2gε̂′2gZ2g]
]E[Z′2Z2]
−1. (15)
When there are multiple endogenous variables, the first stage estimates may be correlated across
models. This requires the more complex formulation in Proposition (4).
4 High school education and political preferences
I use the two-sample IV methods expounded above to answer an important question about political
behavior: how does high school affect who citizens vote for? Despite widespread interest in the
causal effects of education on political participation (see Sondheimer and Green 2010), education’s
partisan bias has received limited attention from scholars seeking to move beyond survey corre-
lations. Furthermore, identifying the political effects of schooling is challenging because many
surveys provide insufficiently granular measures of education.
There are various ways in which education could affect political preferences. One of the most
robust correlations in political surveys in developed democracies is the link between income and
support for right-wing political parties (e.g. Gelman et al. 2010; Thomassen 2005). If educa-
tion increases income, as human capital theory strongly suggests (e.g. Acemoglu and Angrist
2000; Becker 1993), then additional high school may well increase support for right-wing parties
proposing lower taxes (Meltzer and Richard 1981).15
However, education is also associated with socially liberal attitudes. This link has also been
widely documented in survey research (Dee 2004; Gerber et al. 2010; Schoon et al. 2010), although
15This relationship could similarly work through changing demand for social insurance (Iversenand Soskice 2001; Moene and Wallerstein 2001). In the U.S., Marshall (2014) finds that highschool education predominantly works through tax policy preferences.
17
-
it is particularly strong at the university rather than high school level. Rather than supporting right-
wing parties, this impetus generally pushes voters toward left-wing parties who are more likely
to support post-materialist and socially liberal policies (e.g. Heath et al. 1985; Inglehart 1981).
In the United Kingdom, the Labour and Liberal Democrat parties are regarded as more socially
progressive.
Given the formative role of education, there are many other channels through which schooling
could affect political behavior.16 This paper does not seek to disentangle the mechanisms underpin-
ning the relationship, but rather to demonstrate that high school education has important political
implications for a large proportion of voters. Identifying schooling’s causal effects is challenging
because which individuals receive greater education is very unlikely to be random, even after var-
ious observables are controlled for or matched upon (e.g. Kam and Palmer 2008). I use Britain’s
compulsory schooling reforms as in instrument for schooling to identify high school’s political ef-
fects. Britain represents a particularly important case because, unlike the U.S., the reforms affected
a substantial proportion of the population. With a large proportion of compliers, the estimates for
compliers approach the population average treatment effect (see Oreopoulos 2006).
4.1 Compulsory schooling laws in Britain
Great Britain’s education laws define the maximum age by which students must start school and the
minimum age at which students can leave school. To identify the effect of high school education,
I exploit two landmark reforms of the minimum leaving age that came into force in 1947 and
1972. First, Winston Churchill’s wartime coalition government passed the Education Act 1944,
which increased the leaving age from 14 to 15 in England and Wales. The Education (Scotland)
Act 1945 enacted the same reform in Scotland. The new leaving age, which had repeatedly failed
to pass in the 1920s and 1930s due to financial constraints (Gillard 2011), came into force 1st16For example, education could alter the political composition of social networks (Abrams,
Iversen and Soskice 2010), induce politically-biased participation, or teaching could instill dif-ferent values and norms (Bowles and Gintis 1976).
18
http://www.legislation.gov.uk/ukpga/1944/31/pdfs/ukpga_19440031_en.pdfhttp://www.legislation.gov.uk/ukpga/1945/37/pdfs/ukpga_19450037_en.pdfhttp://www.legislation.gov.uk/ukpga/1945/37/pdfs/ukpga_19450037_en.pdf
-
April 1947 after several years of intensive preparation. Second, Parliament passed the Education
Act 1962 raising the school leaving age to 16, although it was Conservative Edward Heath who
finalized the extension under Statutory Instrument 444 (1972). Like the 1947 reform, Labour had
consistently pushed for the increase,17 while education was widely seen as an economically and
socially beneficial investment at the time (Woodin, McCulloch and Cowan 2013). This second
reform came into force in England, Scotland and Wales on 1st September 1972. Northern Ireland,
which experienced different education reforms (Oreopoulos 2006), is excluded from the analysis.
The reforms substantially altered the education profile of Britain’s students. As Figure 2 shows,
relative to the immediately prior academic cohorts, both reforms induced a large fraction of stu-
dents to remain in school for an additional year. Unlike compulsory schooling reforms in Canada
and the U.S., which affected a small and somewhat idiosyncratic set of students (Clark and Royer
2013; Goldin and Katz 2008; Oreopoulos 2006), Britain’s reforms affected a large proportion of
the population. Almost half of students remained in school one year longer following the 1947 re-
form, while a quarter were remained in school because of the 1972 reform. While the 1947 reform
also increased the proportion staying in school until 16, the 1972 reform did not affect schooling
beyond the high school level.
Although the number of students in school rose considerably, the education system itself did
not greatly change. Prior to the 1947 reform, the government had engaged in a major expansion
effort to increase the number of teachers, buildings and classroom materials. In both cases, the
additional year of schooling was primarily intended to ensure students grasped all the material
they had previously been taught (see Clark and Royer 2013).
Britain’s education reforms have proved popular instruments among labor economists. The
discontinuities in schooling laws have been used to identify positive effects of an additional year
of schooling on income (Devereux and Hart 2010; Grenet 2013; Harmon and Walker 1995; Ore-
17Under Labour Prime Minister Gordon Brown, Parliament passed the Education and Skills Act2008, raising the education leaving to 18 by 2015.
19
http://www.educationengland.org.uk/documents/pdfs/1962-education-act.pdfhttp://www.educationengland.org.uk/documents/pdfs/1962-education-act.pdfhttp://www.legislation.gov.uk/uksi/1972/444/pdfs/uksi_19720444_en.pdfhttp://www.legislation.gov.uk/ukpga/2008/25/pdfs/ukpga_20080025_en.pdfhttp://www.legislation.gov.uk/ukpga/2008/25/pdfs/ukpga_20080025_en.pdf
-
0.2
.4.6
.8
Pro
port
ion
leav
ing
scho
ol
1940 1950 1960 1970
Cohort: year aged 14
Leave before 15 Leave before 16
1947 reform
0.2
.4.6
.8
Pro
port
ion
leav
ing
scho
ol
1950 1960 1970 1980 1990 2000
Cohort: year aged 15
Leave before 16 Leave before 17
1972 reform
Figure 2: Compulsory schooling reforms and staying in school by cohort
Notes: Data based on the Labour Force Survey data used in the empirical analysis below. Black lines representfractional polynomial regression line fits. Grey dots are birth-year cohort averages.
opoulos 2006), and also used to demonstrate that schooling does not affect mortality rates (Clark
and Royer 2013).18 However, the potential political effects of these reforms have not received
attention.18There also exists a large literature exploring the economics effects of U.S. compulsory school-
ing reforms (see Acemoglu and Angrist 2000; Angrist and Krueger 1991; Goldin and Katz 2008).These studies differ in that they exploit cross-state differences using difference-in-differences typestrategies.
20
-
4.2 Data
In order to test the political implications of these reforms, I use the British Social Attitudes Survey
(BSAS). The BSAS, which randomly samples a nationally-representative cross-section of adults
(aged 18 or above) with postal addresses in Great Britain, has been conducted in the summer of
every year since 1983 except in 1988 and 1992. In ten of the 28 available surveys,19 respondents
were asked which party they voted for in the most recent general election. In the sample used in
this analysis, 34% of respondents reported voting Conservative, while 45% and 16% respectively
voted Labour and Liberal.20 Given the theoretical claims outlined above, the analysis focuses on a
dummy for voting Conservative as the main dependent variable.
I operationalize whether a student is affected by the reform by coding indicators—1(CSLc =
15) = 1(birth year+14∈ [1947,1972]) and 1(CSLc = 16) = 1(birth year+15≥ 1972)—for the
minimum schooling leaving age affecting individuals in cohort c. The residual category is below
15. Although month of birth is not available in the BSAS, respondents can be mapped on the basis
of their year of birth (determined by age in years at the date of the survey).21 Whether an individual
was affected by the reform is thus assigned by academic cohort, defined by the year aged 14 and
15, such that 1(CSLc = 15) = 1 for those aged 14 in any between 1947 and 1972.22
However, the BSAS measures of education are problematic. Educational attainment is mea-
19These surveys were conducted in: 1987, 1994, 1995, 1996, 1999, 2001, 2003, 2005, 2008 and2010.
20The Conservative vote share, the main dependent variable in this paper, pretty accurately re-flects the survey-weighted average of 36% of votes received by the Conservatives across the periodunder study. The difference is even smaller in the raw data; as explained below, the TS2SLS ap-proach necessitates removing some observations.
21Our estimates of the effects of the reforms on schooling outcomes are very similar to Clarkand Royer (2013), who can perfectly assign the instruments using month of birth data. This, incombination with the clear graphical discontinuities shown below, strongly suggests that lackingmonth of birth is not significantly affecting the results.
22Scottish students faced a weaker law between 1972 and 1976, they are coded identically toEngland/Wales as a similarly large drop in the proportion leaving occurred. Results are robust toexcluding Scottish students aged 15, 1972-76.
21
-
sured using six categories, ranging from no qualification to university degree.23 Completing high
school is captured by the second lowest category, which specifies that a respondent has a certifi-
cate of secondary education (CSE) or equivalent. At the end of high school (at age 15 or 16), or a
student’s 11th year of formal schooling, students take CSE exams in a variety of subjects. Given
only 2-3% of students fail any particular CSE exam, obtaining a CSE is an almost perfect proxy for
completing high school. An indicator measuring this is used to examine the results when schooling
is dichotomized at a theoretically appealing point. Although the BSAS also asks respondents what
age they left school, nearly half of the surveys did not allow respondents to answer that they left
school below age 15, and thus cannot differentiate the effect of the 1947 reform from the number
of years of schooling.24
Using only the BSAS sample to identify the effect of years of schooling would require either
coarsening the treatment or substantially reducing the sample size. However, collecting a second
sample containing common basic demographic variables and the age at which an individual left
school can solve this problem. Accordingly, I use Labour Force Survey (LFS) data—an annual
and more recently quarterly household survey—from each year in which an election occurred to
collect a pooled sample of 747,851 voting age respondents.25 Years of schooling is defined by
the age left a respondent left continuous full time education minus five, and an upper bound of 13
years of state-supported education is applied.26 Before 2003, the LFS collected both month and
23Respondents with foreign qualifications were excluded.24This bottom coding is clearly still relevant in the twenty-first century because many of those
aged 14 or above in 1947 are still alive. Nevertheless, similar estimates are obtained when restrict-ing the analysis to the years for which age left school could be used as the endogenous variable.For many studies, however, the loss of precision necessitates using a separate sample.
25Only the July-September sample was used since the LFS became quarterly to avoid replication,given that respondents are then surveyed for five consecutive quarters, and to approximate the pe-riod when the BSAS survey was conducted. Observations from Northern Ireland and respondentsbelow the age of 18 were excluded to match the BES sample.
26After age 18, continuing students bifurcate into university or vocational programs. Given thedifficulties of classifying these programs, the upper bound on state-supported schooling is mostappropriate. Since the CSL reforms did not affect higher education, this choice is inconsequentialfor the results.
22
-
year of birth, and therefore permitted perfect instrument assignment; since 2003, the instruments
were assigned as in the BSAS.
The two-sample approach is only valid if both samples randomly draw from the same pop-
ulation. Given that the BSAS and LFS are random samples from the population of those with
available addresses,27 both samples are drawn from essentially identical populations. Neverthe-
less, imbalances could remain due to chance, different survey sizes and any differential response
characteristics. To redress the concern that the TS2SLS assumptions are not satisfied, I then chose
a random subsample of the LFS sample to match the BSAS sample distribution in terms of year
of birth, gender, ethnicity, and survey year by randomly choosing observations from within these
blocks.28 This reduced the final sample size to 47,552.29 The summary statistics in Table 1 show
that the first and second moments on the common variables match very well. In combination with
the random sampling from the same adult population, both samples effectively draw from the same
population.
4.3 Empirical strategy
To identify the effect of late high school education on political preferences, I use Britain’s com-
pulsory schooling reforms as instruments for the level of schooling an individual receives. These
reforms have been widely used as instruments, most convincingly in regression discontinuity (RD)
designs (see Clark and Royer 2013), because of the sharp change in educational attainment across
27More precisely, the BSAS uses a multi-stage design where Britain is broken up into sectorsdefined by postcode, from which households are randomly chosen from the address book. Re-spondents aged 18 or above within a household are then randomly chosen. The LFS became anunclustered (“simple”) random sample from the address roll since 1992, having earlier employeda clustered approach from the Valuation Roll and Post Office Address File.
28Due to a lack of observations in the LFS, the final samples used for both datasets excludedrespondents aged above 74 and those born before 1922 or after 1987.
29Where sample size concerns are more salient (the first stage is very strong here), anotheroption would be to weight observations in the first stage sample to replicate the reduced formsample distribution. Such a procedure is likely to be more efficient.
23
-
Tabl
e1:
Sum
mar
yst
atis
tics:
BSA
San
dL
FSsa
mpl
es
BSA
SLF
SO
bs.
Mea
nSt
d.de
v.M
in.
Max
.O
bs.
Mea
nSt
d.de
v.M
in.
Max
.
Dep
ende
ntva
riab
leC
onse
rvat
ive
vote
15,9
340.
340.
470
1
End
ogen
ous
vari
able
sSc
hool
ing
47,5
5211
.14
1.42
013
Hig
hsc
hool
15,9
340.
730.
440
1
Exc
lude
din
stru
men
tsC
SL=1
515
,934
0.50
0.50
01
47,5
520.
500.
500
1C
SL=1
615
,934
0.39
0.49
01
47,5
520.
390.
490
1
Pre
-tre
atm
entc
ovar
iate
sB
irth
year
15,9
3419
51.9
014
.68
1922
1987
47,5
5219
61.6
614
.64
1922
1987
Age
15,9
3447
.12
14.0
718
7347
,552
46.9
614
.38
1873
Mal
e15
,934
0.44
0.50
01
47,5
520.
450.
500
1W
hite
15,9
340.
950.
210
147
,552
0.95
0.22
01
Asi
an15
,934
0.02
0.15
01
47,5
520.
020.
150
1B
lack
15,9
340.
020.
130
147
,552
0.02
0.13
01
Surv
ey15
,934
1999
.02
5.91
1987
2010
47,5
5219
98.9
06.
3319
8720
10
24
-
cohorts. Although further-apart cohorts could differ systematically, it is hard to see why cohorts
born just before and just after the reform would systematically differ in their political preferences.
Accordingly, this study also employs an RD design where the running variable determining the
treatment is birth year cohort.
The key RD identifying assumption is that partisan preferences are continuous in all covari-
ates other than school leaving age at the reform discontinuity. Given the difficulty of identifying
education’s causal effects using observational data, the RD strategy’s weak assumptions are par-
ticularly appealing. The greatest issue for RD designs is the “sorting” concern that another key
variable simultaneously changes at the discontinuity. Given that cultural shifts are very unlikely to
have affected 15 year olds but not 14 year olds, the most plausible concerns relate to demographic,
socio-economic and labor market characteristics. Figure 3 shows that trends in various proxies for
these variables are essentially continuous through both discontinuities.
I first estimate the effects, δ1 and δ2, of the schooling reforms themselves. This entails estimat-
ing reduced form OLS regressions of the following form in the BSAS sample:
Yict = δ11(CSLc = 15)+ δ21(CSLc = 16)+ f (birth yearc)+Witγ +ηt + εit , (16)
where 1(CSLc < 15) is the residual category, Wit includes a gender dummy, standardized age
polynomials,30 and dummies for white, black and (south and east) Asian ethnicities, and ηt is
a survey fixed effect. The dependent variable Yict is voting Conservative. f is a flexible global
polynomial function of the running variable designed to capture general trends away from the
reform discontinuities.31 I estimate a variety of specifications for f , ranging from including no
birth year trends to fifth-order polynomial trends to demonstrate the robustness of the relationships.
All specifications report standard errors clustered by cohort.
30For simplicity, the age polynomials are assigned the same polynomial order as f .31To fully assess the implications of dummying-out high school, it is necessary to include both
reforms in the same specification. Consequently, it is imperative to show that the results are robustto highly flexible global polynomial trends.
25
-
.4.5
.6.7
.8P
ropo
rtio
n
1940 1960 1980 2000Cohort: year aged 14
Panel A: Male
.85
.9.9
51
Pro
port
ion
1940 1960 1980 2000Cohort: year aged 14
Panel B: White
.3.4
.5.6
.7.8
Pro
port
ion
1940 1960 1980 2000Cohort: year aged 14
Panel C: Father manual or unskilled worker
.2.3
.4.5
.6.7
Pro
port
ion
1940 1960 1980 2000Cohort: year aged 14
Panel D: Father voted Conservative
02
46
810
Rat
e (%
)
1940 1960 1980 2000Year
Panel E: Unemployment
025
5075
100
Inde
x (2
000=
100)
1940 1960 1980 2000Year
Panel F: Average annual earnings
Figure 3: Trends in demographic, socio-economic and labor market demographic variables
Notes: The data in Panels A and B is from the LFS. The data in Panels C and D is from the British Election Survey1979-2010 (because such variables were not widely available in the BSAS), which is used as a robustness checkbelow. The data in Panels E and F is from the Bank of England “UK Economic Data 1700-2009” dataset.
26
-
The principal quantity of interest in this paper is the effect of schooling. To estimate the effects
of different measures of schooling, Si, I use Britain’s reform cutoffs as instruments. Since the
reforms do not perfectly determine an individual’s level of schooling, the assignment of Si is fuzzy.
We thus employ a fuzzy RD approach; like standard IV approaches, this additionally requires
monotonicity and the exclusion restriction to hold. Given the large increase in school attendance
following each reform, and the fact that very few students failed to comply with the new leaving
ages, monotonicity is strongly supported. Given the close proximity of the reforms to schooling
choices, there is very limited scope for the reforms to violate the exclusion restriction by affecting
an individual’s political preferences through other channels.
The fuzzy RD entails estimating the following structural equation:
Yict = βSi + f (birth yearc)+Wiϕ +ηt + εict , (17)
where Si will be either a dummy for completing high school, years of schooling, or two dummy
variables for staying in school for 10 or above 10 years. The first stage regression generating
variation in Si is given by:
Si = α11(CSLc = 15)+α21(CSLc = 16)+ f (birth yearc)+Wiψ +ηt + εict . (18)
A strong first stage, which is required to minimize the bias of IV estimates in finite samples (Bound,
Jaeger and Baker 1995; Staiger and Stock 1997), implies that α1 and α2 are significantly different
from zero.
In the case of the dummy for completing high school, equation (17) can simply be estimated
with 2SLS using only BSAS data. Given that years of schooling comes from the LFS, the effects
of years of schooling are instead estimated using TS2SLS where the LFS first stage and BES
reduced form are efficiently combined as above with cohort-clustered standard errors computed as
in Proposition 4.
27
-
4.4 Results
4.4.1 The effect of compulsory schooling reforms on schooling and political preferences
Figures 4 and 5 plot the first stage and reduced form graphically. The left hand graph in Figure 4
shows a large increase in the average number of years of schooling per cohort following the 1947
reform. This reflects the 40% of students which stayed in school for another year shown in Figure
2. The right-hand graph shows that the 1972 reform also substantially increased the average years
of schooling, although the magnitude of the reform was much smaller. This reflects the fact that
by 1972 students were generally remaining in school longer.
Although the cohort averages are noisier, the reduced form plots in Figure 5 suggest that around
the reforms voters differ systematically in their political preferences. Especially following the
1947 reform, there is an upward shift in support for the Conservative party by cohort. The graphs
indicate that cohorts affected by the reform are approximately three percentage points more Con-
servative.32 Given that the 1972 reform affected fewer students, the difference at the discontinuity
is smaller. Although the difference is less clear, the chart also suggests an increase in support for
the Conservatives. The fact that both reforms reverse the trend against the Conservatives—which
is a function of both declining support over time (in the surveys used here) and younger voters
being more left-wing—further suggests that the posited relationship is not being driven by cohort
trends.
Table 2 presents the reduced form and first stage estimates using a simple linear cohort trend.
Although Figures 4 and 5 indicate that trends in both years of schooling and Conservative support
are approximately linear,33 the results—as will be demonstrated below—are not sensitive to this
choice. The first column provides the reduced form estimates, the second column estimates the
first stage in the BSAS sample, and the third column estimate the years of schooling first stage in
32A linear trend, which fits similarly well, indicates an even larger five percentage point effect.33Note that there is very little data for cohorts born in the 1930s.
28
-
99.
510
10.5
1111
.512
Yea
rs o
f sch
oolin
g
1940 1950 1960 1970
Cohort: year aged 14
1947 reform
99.
510
10.5
1111
.512
Yea
rs o
f sch
oolin
g
1950 1960 1970 1980 1990 2000
Cohort: year aged 15
1972 reform
Figure 4: Average years of schooling by birth year cohort (LFS data)
Notes: Black lines represent fractional polynomial regression line fits. Grey dots are cohort averages.
the LFS sample.
The reduced form shows that the reforms induced a large and statistically significant increase
in support for the Conservative party. Cohorts affected by the 1947 are six percentage points more
likely to vote Conservative, while the 1972 reform—which affected fewer students—increased
Conservative voting by a further 2.5 percentage points. Such large shifts for affected cohorts
imply that the reforms substantially altered national politics, and could easily have altered the
outcomes of the close 1970s and 2000s elections. Figure 6 demonstrates that the reduced form
coefficients are consistent across specifications using higher-order polynomial terms to account for
29
-
.1.2
.3.4
.5
Vot
e sh
are
1940 1950 1960 1970
Cohort: year aged 14
1947 reform
.1.2
.3.4
.5
Vot
e sh
are
1950 1960 1970 1980 1990 2000
Cohort: year aged 15
1972 reform
Figure 5: Proportion conservative by birth year cohort (BSAS data)
Notes: Black lines represent fractional polynomial regression line fits. Grey dots are cohort averages.
more complex trends in Conservative support. However, by averaging across all individuals, these
estimates underestimate the impact on individuals who only remained in school because of the
reforms. To calculate the effects for such compliers, I turn to the fuzzy RD estimates.
4.4.2 The effect of schooling on political preferences
The first stage estimates confirm that both reforms substantially increased schooling. Looking at
the dummy for completing high school in the BSAS sample, column (2) shows that both the 1947
and 1972 reforms significantly increased the probability of completing high school. Column (3)
30
-
-.05
0.0
5.1
.15
.2
CSL=15 CSL=16
Linear-.
050
.05
.1.1
5.2
CSL=15 CSL=16
Quadratic
-.05
0.0
5.1
.15
.2
CSL=15 CSL=16
Cubic
-.05
0.0
5.1
.15
.2
CSL=15 CSL=16
Quartic
-.05
0.0
5.1
.15
.2
CSL=15 CSL=16
Quintic
Figure 6: Reduced form estimates using higher-order polynomial controls
Notes: Higher-order polynomial specifications include standardized global birth year trends of order p and stan-dardized age trends of order p (excluding linear age because it is perfectly collinear with linear birth year).
31
-
Table 2: The effect of CSLs on schooling and voting for the Conservative party
Vote Con High school SchoolingOLS OLS OLS(1) (2) (3)
1947 reform 0.061*** 0.176*** 0.384***(0.020) (0.024) (0.036)
1972 reform 0.085*** 0.275*** 0.604***(0.033) (0.043) (0.074)
Sample BSAS BSAS LFSObservations 15,934 15,934 47,552First stage F test 27.6 56.9
Notes: All specifications include a linear birth year term, male, white, black and south Asian dummies, andsurvey year fixed effects. Standard errors clustered by cohort. * denotes p < 0.1, ** denotes p < 0.05, ***denotes p < 0.01.
instead examines years of schooling in the LFS, and similarly shows that the 1947 reform was
especially effective at keeping students in school. In both cases, the large F statistic—testing the
relevance of the reform dummies—indicates a strong first stage. Table 3 presents the fuzzy RD
results, instrumenting for schooling with the compulsory schooling reforms.
I first examine the 2SLS estimates where schooling is discretized. Column (1) shows the Wald
estimate, and suggest that voters induced to complete high school by the reform are 33 percentage
points more likely to vote Conservative in later life. The estimated effect is very large by almost any
standard, but particularly when considering that a large segment of the population are compliers.
This estimate, however, could suffer from the bias established above: given Table 2 showed a
significant reduced form effect for the 1947 reform, but the reform did not compel all students to
complete high school, there is clear scope for upward bias. This concern is even more evident in
column (2), which uses only the 1947 reform as an instrument (removing those born after 1972). In
this specification—where the bias is expected to be largest, given that the 1947 caused a significant
32
-
Table 3: The effect of schooling on voting
Con Con Con Con Labour Liberal2SLS 2SLS TS2SLS TS2SLS TS2SLS TS2SLS
(1) (2) (3) (4) (5) (6)
Completed high school 0.332** 0.885***(0.132) (0.311)
10 years of schooling 0.124***(0.042)
11 or more years of schooling 0.237**(0.111)
Years of schooling 0.152*** -0.047 -0.081**(0.054) (0.046) (0.031)
First stage sample BSAS BSAS LFS LFSReduced form observations 15,934 9,783 15,934 15,934 15,934 15,934First stage observations 15,934 9,783 47,552 47,552 47,552 47,552First stage F test 27.6 22.3 56.9 56.9 56.9 56.9
Notes: In each specification, the variables listed on the left side of the table are instrumented for by the indicatorsfor the 1947 and 1972 reforms. All specifications include a linear birth year term, male, white, black and Asian(south and east combined) dummies, and survey year fixed effects. Specification (2) excludes respondents affectedby the 1972 reform. While specifications (1)-(4) have Conservative vote as dependent variable, the dependentvariable in specifications (5) and (6) respectively is Labour and Liberal vote. Standard errors clustered by cohort.* denotes p < 0.1, ** denotes p < 0.05, *** denotes p < 0.01.
33
-
proportion of student to also complete high school—the 2SLS estimates imply an implausibly large
89 percentage point increase in the probability of voting Conservative.
The presence of two instruments permits a more precise exploration of any bias. Column (3)
uses the 1947 and 1972 reforms to instrument for indicators for completing ten years of schooling
or 11 or more years of schooling. Given that the reforms did not affect attaining nine or fewer years
of schooling, or more than 11 years of schooling, the coefficients in column (3) non-parametrically
estimate the effect of an additional year of late high school. For both years, an additional year
equates to a LATE of 12 percentage points. This shows that, at least at the end of high school, the
effect of schooling is almost exactly linear. Unsurprisingly, the WAPTE estimate in column (4)
shows a similar effect for an additional year of schooling.34 Figure 7 again demonstrates that the
results are not being driven by linear cohort trends, and are in fact highly stable.
The non-parametric and linear results suggest that the dummy for completing high school sub-
stantially overstates the political effect of the final year of high school. The bias more than doubles
the true LATE for the final year of school when examining both instruments, but increases sixfold
when focusing only on the 1947 reform. While these results are clearly biased in terms of magni-
tude, our more careful analysis nevertheless shows that late high school causes voters to become
substantially more conservative in later life. Reinforcing results from the U.S. (Marshall 2014), this
evidence is consistent with schooling’s economic effects dominating any effects working through
socially liberal attitudes.
Given Britain has had three main political throughout the survey period analyzed here, it is
not obvious which party primarily loses votes to the Conservatives. Specifications (5) and (6)
respectively use Labour and Liberal vote indicators as dependent variables, and show that school-
ing decreases the probability of voting for both parties. The reduction is largest, and statistically
significant, for the Liberal Democrats.
34The point estimate differs because the first stage for other levels of schooling is not exactlyzero.
34
-
Quintic
Quartic
Cubic
Quadratic
Linear
0 .1 .2 .3 .4
Marginal effect of years of schooling
Figure 7: TS2SLS estimates using higher-order polynomial controls
Notes: Higher-order polynomial specifications include standardized global birth year trends of order p and stan-dardized age trends of order p (excluding linear age because it is perfectly collinear with linear birth year).
4.4.3 Robustness checks
Beyond the polynomial cohort trends, I now show that the reduced form and TS2SLS estimates
are highly robust to a variety of potential threats to the identification assumptions. All robustness
checks are reported in Table 4.
Although Figure 3 above showed that trends in plausible confounders are continuous through
the 1947 and 1972 reform discontinuities, I also control for the unemployment rate and average
earnings in column (1) of Table 4 and find that the effect if anything increases. To more thoroughly
demonstrate that age is not driving the results, column (2) shows that the results are robust to
including age fixed effects.
I also employ several out-of-sample checks. First, column (3) in Table similarly shows that an
additional year of late high school increases the likelihood of identifying as a Conservative parti-
san by 12 percentage points. This shows that survey respondents are responding consistently when
35
-
Table 4: Robustness checks
Controls Age dummies Partisan BES vote BES partisan(1) (2) (3) (4) (5)
Panel A: Reduced form estimates1947 reform 0.084*** 0.059** 0.045** 0.072*** 0.067***
(0.023) (0.022) (0.023) (0.014) (0.012)1972 reform 0.110*** 0.080*** 0.067** 0.082*** 0.086***
(0.031) (0.033) (0.033) (0.023) (0.024)
Panel B: TS2SLS estimatesYears of schooling 0.223*** 0.153** 0.115** 0.148*** 0.133***
(0.072) (0.061) (0.057) (0.029) (0.025)First stage F test 28.9 67.1 56.9 98.9 98.9
Reduced form observations 15,934 15,934 15,934 14,105 13,765First stage observations 47,552 47,552 47,552 49,016 49,016
Notes:All specifications include a linear birth year term, male, white, black and Asian (south and east combined)dummies, and survey year fixed effects. Specification (1) includes the national unemployment rate and averageearnings index at age 14 as controls. Specification (2) includes a full set of age dummies. Specification (3) takesConservative partisanship is an indicator dependent variable. Specifications (4) and (5) use the BES data withConservative voting and partisanship as dependent variables; a different LFS sample is used to match the BESdistribution. Standard errors clustered by cohort. * denotes p < 0.1, ** denotes p < 0.05, *** denotes p < 0.01.
36
-
asked about their political preferences. Second, very similar results are obtained when linking the
British Election Study (BES) with an LFS first stage.35 In terms of both voting and partisan iden-
tification, columns (4) and (5) clearly show a substantively similar increase Conservative political
preference.36
5 Conclusion
This article addresses an important issue frequently faced by empirical researchers using instru-
mental variable techniques: good (or any) measures of both the outcome and treatment of interest
may not be available in the same dataset. While lacking the outcome or treatment variable en-
tirely may force researchers to abandon their project, using a coarsened measure of a multi-valued
treatment intensity can substantially bias estimates. As estimates of the effect of high school on
political preferences demonstrated, this bias is especially large when the causal response function
is not discontinuous and the instrument induces different respondents to achieve different treatment
intensities.
Two-sample IV methods can solve these missing data problems. Two samples can be combined—
if they draw from the same population—even when the treatment is not measured in the same
dataset as the outcome. This allows researchers to estimate quantities that a single dataset could
not, but can also provide consistent estimates of quantities that might otherwise be substantially
biased. In this article, I outlined a general approach to implementing two-sample methods. In
particular, I highlight the assumptions required to ensure the consistency of the TS2SLS estimator
as well as a range of formulas for calculating standard errors that adjust for the uncertainty of the
first stage estimation.
35The first stage sample differs from that used with the BSAS data to better match the BESsample. In particular, the LFS first stage sample draws only upon LFS samples from the relevantelection years and matches the BES sample characteristics.
36There is a similarly large bias when using the dummy for completing high school in the BESdata.
37
-
These two-sample methods are applied to the question of how education affects political pref-
erences. More specifically, I show that an additional year of late high school significantly increases
downstream support for Britain’s Conservative party. Exploiting two major educational reforms,
the fuzzy regression discontinuity estimates indicate that an additional year of schooling cause a 15
percentage point increase in the probability of voting Conservative later in life. These large effects
are “local” in that they only apply to students that would not have remained in school without the
reforms—albeit a large proportion of the population—and are specific to late high school. While
Marshall (2014) provides clear evidence of an income mechanism in the U.S., this important re-
lationship requires further research. It is also possible that university education instills liberal
attitudes that counteract schooling’s effects.
38
-
Appendix
Proof of Proposition 1. Angrist and Imbens (1995) prove that the exclusion restriction and mono-
tonicity yield equation (3). Recognizing βk = E[Yik −Yik−1|Si1 ≥ k > Si0] yields equation (3).
Because pit ≥ 0, sign(βk) = sign(E[Yit−Yit−1|Si1 ≥ t > Si0]),∀t 6= k where pit > 0 ensures |βk| ≤
|βWk |. �
Proof of Proposition 2. Note βW ,JWAPT E =∑Jt=1 pitβ
Jt
∑Jt=1 pit= τ and βW ,αJWAPT E =
∑αJt=1 pitβαJt
∑αJt=1 pit= τ/α , where the
linearity of the causal effect at each intensity interval implies αβ αJt = β Jt . The result follows. �
Proof of Proposition 3. Substituting for Y1 yields:
β̂ T S2SLS = (X̂ ′1X̂1)−1X̂ ′1X1β +(X̂
′1X̂1)
−1X̂ ′1u1. (19)
Dividing top and bottom of each term by n1, taking the probability limit and applying Slutsky’s
theorem yields:
plimn1→∞
β̂ T S2SLS =(
plimn1→∞
1n1
X̂ ′1X̂1
)−1(plimn1→∞
1n1
X̂ ′1X1
)β +
(plimn1→∞
1n1
X̂ ′1X̂1
)−1(plimn1→∞
1n1
X̂ ′1u1
). (20)
To prove consistency we require i) the first term to equal β and ii) second term to be 0.
i). First note that Slutsky’s theorem implies:
plimn1→∞
1n1
X̂ ′1X̂1 = plimn1→∞
(1n1
X ′2Z2(Z′2Z2)
−1Z′1Z1(Z′2Z2)
−1Z′2X2
)(21)
=
(plimn1→∞
1n1
X ′2Z2
)(plimn1→∞
1n1
Z′2Z2
)−1(plimn1→∞
1n1
Z′1Z1
)×(
plimn1→∞
1n1
Z′2Z2
)−1(plimn1→∞
1n1
Z′2X2
). (22)
39
-
Applying the weak law of large numbers and then Assumptions 5(a) and 5(b) yields:
plimn1→∞
1n1
X̂ ′1X̂1 = E[X′2Z2]E[Z
′2Z2]
−1E[Z′1Z1]E[Z′2Z2]
−1E[Z′2X2] (23)
= E[X ′2Z2]E[Z′2Z2]
−1E[Z′2X2] (24)
= E[X ′2Z2]E[Z′2Z2]
−1E[Z′1X1]. (25)
Similarly,
plimn1→∞
1n1
X̂ ′1X1 =(
plimn1→∞
1n1
X ′2Z2
)(plimn1→∞
1n1
Z′2Z2
)−1(plimn1→∞
1n1
Z′1X1
)(26)
= E[X ′2Z2]E[Z′2Z2]
−1E[Z′1X1] (27)
= plimn1→∞
1n1
X̂ ′1X̂1. (28)
Given the rank condition in Assumption 4(a), this proves part i).
ii). Substituting out and applying the weak law of large numbers gives:
(plimn1→∞
1n1
X̂ ′1X̂1
)−1(plimn1→∞
1n1
X̂ ′1u1
)=
(plimn1→∞
1n1
X̂ ′1X̂1
)−1(plimn1→∞
1n1
X ′2Z2
)(29)
×(
plimn1→∞
1n1
Z′2Z2
)−1(plimn1→∞
1n1
Z′1u1
)=
(E[X ′2Z2]E[Z
′2Z2]
−1E[Z′1X1])−1
(30)
×E[X ′2Z2]E[Z′2Z2]−1E[Z′1u1]
= 0, (31)
where the final line follows from Assumption 3, as well as the full rank and finite moment assump-
tions. �
40
-
Proof of Proposition 4. Start by separating X̂ into its endogenous and exogenous components,
Yi1 = Xi1β−S +Ti1βS + ui = Xi1β−S + T̂i1βS +[Ti1− T̂i1]+ ui, (32)
where T̂i1 = Zi1Π̂ = Zi1(Z′2Z2)−1Z′2T2 is the predicted value of the treatment using the first stage
estimates, and Ti1 is the true and unobserved treatment in sample 1. An OLS regression would
yield:
√n1
β̂−T −β−Sβ̂S−βS
= ( 1n1 X̂ ′1X̂1)−1 1√
n1X̂ ′1u1 +
(1n1
X̂ ′1X̂1
)−1 1√
n1X̂ ′1[Ti1− T̂i1]βS, (33)
where subscripts i and superscripts T S2SLS are omitted to save space. Using the expansion result
in Murphy and Topel (1985: 374) yields:
√n1(β̂ −β ) ≡
√n1
β̂−T −β−Sβ̂S−βS
a= ( 1n1 X̂ ′1X̂1)−1 1√
n1X̂ ′1u1
+
(1n1
X̂ ′1X̂1
)−1(n1n2
)1/2 1n1
X̂ ′1(β̂′T ⊗Z1)
√n2(Π̂−Π), (34)
where (β̂ ′T ⊗Z1) is the matrix of defined in equation (12) of Murphy and Topel (1985).
Let Π̂ be a consistent estimator of the first stage for the endogenous variables, such that√
n2(Π̂−Π)a∼ N(0,V(Π)). Using our consistent first stage estimate, the asymptotic variance
is therefore given by:
V(β̂ −β ) = E[X̂ ′1X̂1]−1[
V[β ]+n1n2
E[X̂ ′1(β̂′T ⊗Z1)]−1V[Π]E[(β̂ ′T ⊗Z1)′X̂1]−1
]E[X̂ ′1X̂1]
−1, (35)
where V[β ] is the variance of the naive TS2SLS estimator. (Note that E[X̂ ′1u1] = 0, in conjunction
with a consistent first stage, implies the consistency of the estimator.)
This establishes the general asymptotic variance formula in Proposition 4. We now apply the
41
-
homoskedastic and cluster-robust error structures:
1) Homoskedastic errors. Under homoskedasticity, the naive variance from the TS2SLS re-
gression is simply σ2u (X̂ ′1X̂1)−1. To correct for the first stage estimation, we have:
X̂ ′1(β̂′T ⊗Z1)V̂(Π̂)(β̂ ′T ⊗Z1)′X̂1 = X̂ ′1(β̂ ′T ⊗Z1)(Ω⊗ (Z′1Z1)−1)(β̂ ′T ⊗Z1)′X̂1 (36)
= X̂ ′1(β̂′T Ωβ̂T ⊗Z1(Z′1Z1)−1Z′1)X̂1 (37)
= β̂ ′T Ωβ̂T (X̂′1X̂1), (38)
where the first line uses the definitions of homoskedasticity given in the proposition, the sec-
ond line applies the mixed product property of Kronecker products, and the third line exploits
Z1(Z′1Z1)−1Z′1X̂1 = X̂1 (because all exogenous variables are contained in both X̂1 and Z1) and the
fact that β̂ ′T Ωβ̂ ′T is a scalar. Substituting into the general variance matrix yields the homoskedastic
variance formula in Proposition 4.
2) Clustered errors. In the clustered case, we simply let V(Π̂) = G2G2−1 Φ⊗E[Z′2Z2]
−1. �
42
-
References
Abrams, Samuel, Torben Iversen and David Soskice. 2010. “Informal Social Networks and Ratio-
nal Voting.” British Journal of Political Science 41:229–257.
Acemoglu, Daron and Joshua D. Angrist. 2000. “How Large Are Human Capital Externalities?
Evidence from Compulsory Schooling Laws.” NBER Macroeconomics Annual 2000 pp. 9–59.
Angrist, Joshua D. and Alan B. Krueger. 1991. “Does Compulsory School Attendance Affect
Schooling and Earnings?” Quarterly Journal of Economics 106(4):979–1014.
Angrist, Joshua D. and Alan B. Krueger. 1992. “The Effect of Age at School Entry on Educational
Attainment: An Application of Instrumental Variables with Moments from Two Samples.” Jour-
nal of the American Statistical Association 87(418):328–336.
Angrist, Joshua D. and Alan B. Krueger. 1995. “Split-sample instrumental variables estimates of
the return to schooling.” Journal of Business and Economic Statistics 13(2):225–235.
Angrist, Joshua D. and Guido W. Imbens. 1995. “Two-Stage Least Squares Estimation of Average
Causal Effects in Models With Variable Treatment Intensity.” Journal of the American Statistical
Association 90(430):431–442.
Angrist, Joshua D., Guido W. Imbens and Donald B. Rubin. 1996. “Identification of Causal Effects
Using Instrumental Variables.” Journal of the American Statistical Association 91(June):444–
455.
Angrist, Joshua D. and Jörn-Steffan Pischke. 2008. Mostly Harmless Econometrics: An Empiri-
cist’s Companion. Princeton, NJ: Princeton University Press.
Becker, Gary S. 1993. Human Capital: A Theoretical and Empirical Analysis, with Special Refer-
ence to Education. University of Chicago Press.
43
-
Bound, John, David A. Jaeger and Regina M. Baker. 1995. “Problems with instrumental vari-
ables estimation when the correlation between the instruments and the endogenous explanatory
variable is weak.” Journal of the American Statistical Association 90(430):443–450.
Bowles, Samuel and Herbert Gintis. 1976. Schooling in Capitalist America: Educational reform
and the Contradictions of Economic Life. Chicago, IL: Haymarket Books.
Clark, Damon and Heather Royer. 2013. “The Effect of Education on Adult Mortality and Health:
Evidence from Britain.” American Economic Review 103(6):2087–2120.
Dee, Thomas S. 2004. “Are there civic returns to education?” Journal of Public Economics
88:1697–1720.
Devereux, Paul J. and Robert A. Hart. 2010. “Forced to be Rich? Returns to Compulsory Schooling
in Britain.” Economic Journal 120:1345–1364.
Gelman, Andrew, Park, Boris Shor, Joseph Bafumi and Jeronimo Cortina. 2010. Red State, Blue
State, Rich State, Poor State: Why Americans Vote the Way They Do. Princeton, NJ: Princeton
University Press.
Gerber, Alan S., Gregory A. Huber, David Doherty, Conor M. Dowling and Shang E. Ha. 2010.
“Personality and Political Attitudes: Relationships Across Issue Domains and Political Con-
texts.” American Political Science Review 104(01):111–133.
Gillard, Derek. 2011. “Education in England: A Brief History.” Web link.
Goldin, Claudia D. and Lawrence F. Katz. 2008. The Race Between Education and Technology.
Cambridge, MA: Harvard University Press.
Grenet, Julien. 2013. “Is Extending Compulsory Schooling Alone Enough to Raise Earnings?
Evidence from French and British Compulsory Schooling Laws.” Scandinavian Journal of Eco-
nomics 115(1):176–210.
44
http://www.educationengland.org.uk/history/
-
Harmon, Colm and Ian Walker. 1995. “Estimates of the Economic Return to Schooling for the
United Kingdom.” American Economic Review 85(5):1278–1286.
Heath, Anthony, Roger Jowell, John Curtice, Julia Field and Clarissa Levine. 1985. How Britain
Votes. Pergamon Press Oxford.
Honaker, James and Gary King. 2010. “What to Do about Missing Values in Time-Series Cross-
Section Data.” American Journal of Political Science 54(2):561–581.
Imbens, Guido W. and Joshua D. Angrist. 1994. “Identification and Estimation of Local Average
Treatment Effects.” Econometrica 62(2):467–475.
Inglehart, Ronald. 1981. “Post-Materialism in an Environment of Insecurity.” American Political
Science Review 75(4):880–900.
Inoue, Atsushi and Gary Solon. 2005. “Two-Sample Instrumental Variables Estimators.”.
Inoue, Atsushi and Gary Solon. 2010. “Two-Sample Instrumental Variables Estimators.” Review
of Economics and Statistics 92(3):557–561.
Iversen, Torben and David Soskice. 2001. “An Asset Theory of Social Policy Preferences.” Amer-
ican Political Science Review 95(4):875–894.
Kam, Cindy D. and Carl L. Palmer. 2008. “Reconsidering the Effects of Education on Political
Participation.” The Journal of Politics 70(3):612–631.
King, Gary, James Honaker, Anne Joseph and Kenneth Scheve. 2001. “Analyzing Incomplete
Political Science Data: An Alternative Algorithm for Multiple Imputation.” American Political
Science Review 95(1):49–69.
Lochner, Lance and Enrico Moretti. 2004. “The Effect of Education on Crime: Evidence from
Prison Inmates, Arrests, and Self-Reports.” American Economic Review 94(1):155–189.
45
-
Marshall, John. 2014. “Learning to be conservative: How staying in high school changes political
preferences in the United States and Great Britain.” Working paper.
Meltzer, Allan H. and Scott F. Richard. 1981. “A rational theory of the size of government.”
Journal of Political Economy 89:914–927.
Milligan, Kevin, Enrico Moretti and Philip Oreopoulos. 2004. “Does education improve citizen-
ship? Evidence from the United States and the United Kingdom.” Journal of Public Economics
88:1667–1695.
Mincer, Jacob. 1974. Schooling, Experience, and Earnings. New York: Columbia University
Press.
Moene, Karl O. and Michael Wallerstein. 2001. “Inequality, social insurance, and redistribution.”
American Political Science Review pp. 859–874.
Murphy, Kevin M. and Robert H. Topel. 1985. “Estimation and Inference in Two-Step Econometric
Models.” Journal of Business and Economic Statistics 20(1):88–97.
Oreopoulos, Philip. 2006. “Estimating Average and Local Average Treatment Effects of Education
when Compulsory Schooling Laws Really Matter.” American Economic Review 96(1):152–175.
Schoon, Ingrid, Helen Cheng, Catharine R. Gale, G. David Batty and Ian J. Deary. 2010. “Social
status, cognitive ability, and educational attainment as predictors of liberal social attitudes and
political trust.” Intelligence 38(1):144–150.
Sondheimer, Rachel M. and Donald P. Green. 2010. “Using Experiments to Estimate the Effects
of Education on Voter Turnout.” American Journal of Political Science 41(1):178–189.
Sovey, Allison J. and Donald P. Green. 2011. “Instrumental variables estimation in political sci-
ence: A readers’ guide.” American Journal of Political Science 55(1):188–200.
46
-
Spence, Michael. 1973. “Job market signaling.” Quarterly Journal of Economics 87(3):355–374.
Staiger, Douglas and James H. Stock. 1997. “Instrumental Variables Regression with Weak Instru-
ments.” Econometrica 65(3):557–586.
Thomassen, Jacques J.A. 2005. The European Voter: A Comparative Study of Modern Democra-
cies. Oxford: Oxford University Press.
Woodin, Tom, Gary McCulloch and Steven Cowan. 2013. “Raising the participation age in
historical perspective: policy learning from the past?” British Educational Research Journal
39(4):635–653.
47
IntroductionIV's upward bias with coarsened treatmentsCharacterizing the biasWhen is the bias severe?Sharp jumps in the CRFLinear CRFs
Implications for applied research
Using two samples to address missing dataEstimationProperties of TS2SLS
High school education and political preferencesCompulsory schooling laws in BritainDataEmpirical strategyResultsThe effect of compulsory schooling reforms on schooling and political preferencesThe effect of schooling on political preferencesRobustness checks
Conclusion