john marshall ay - harvard university...identifying education’s political effects with incomplete...

IDENTIFYING EDUCATION’S POLITICAL EFFECTSWITH INCOMPLETE DATA: INSTRUMENTAL VARIABLE

ESTIMATES COMBINING TWO DATASETS

JOHN MARSHALL∗

MAY 2014

Political scientists are increasingly using instrumental variable (IV) methods, but areoften faced with datasets that lack key variables or which only provide coarse vari-able codings. While completely missing a key variable typically causes projects to beabandoned, coarsening a treatment variable with multiple intensities—e.g. creating abinary treatment indicator—can substantially upwardly bias IV estimates. This biasarises where the coarsening causes the first stage to only capture part of the instru-ment’s effect. Two-sample IV methods offer a powerful solution to both problems:imputing values for the missing or coarsened variable using a separate dataset drawnfrom the same population with richer measurement of the treatment consistently es-timates the weighted average per-unit treatment effect. Applying this approach in afuzzy regression discontinuity setting in Great Britain, I show that an additional yearof schooling substantially increases the probability of voting Conservative later in life.The estimate for completing high school, however, is upwardly biased by between twoand six times.

∗PhD candidate, Department of Government, Harvard University. [email protected]. I thank AnthonyFowler and Horacio Larreguy for illuminating discussions.

1

[email protected]

1 Introduction

Instrumental variable (IV) techniques are now a standard component of the political scientist’s

methodological toolkit. Sovey and Green’s (2011) meta-analysis identifies more than one hundred

articles published in three top journals over two decades using IV techniques. It is easy to under-

stand why. Interpreted in the heterogeneous potential outcomes framework (Imbens and Angrist

1994; Angrist, Imbens and Rubin 1996), IV approaches promise to identify the average causal

effect of a treatment for the population of units that would not have received the treatment without

the intervention of the instrumental variable.

The prevalence of IV techniques has warranted increased scrutiny. Sovey and Green’s (2011)

review highlights six key concerns in IV analyses, and identifies the types of evidence and argu-

ment required to justify the assumptions underpinning the IV framework. Angrist and Pischke

(2008) also provide clear advice on using IV methods in practice.

However, this article highlights an important additional concern: using a binary (or coarsened)

treatment variable when the true treatment has multiple intensities can substantially upwardly bias

IV estimates. An important example is where a dummy variable for completing high school is

used because years of schooling is not measured. This missing data issue has been ignored by

both political scientists and economists, but frequently arises in empirical applications. I explain

how two-sample IV methods can alleviate the problem when, as is often the case, the data needed

to correct the bias is not available in the original sample. I then use the two-sample approach to

identify the causal effect of schooling on political preferences, using fuzzy regression discontinuity

methods to show that an additional year of high school substantially increases Conservative voting

in Great Britain.

This article first shows how coarsening a multi-valued (or interval) treatment variable intro-

duces upward bias. The reduced form captures the impact of an instrument on an outcome for ev-

ery individual regardless of their (coarsened) treatment intensity. However, the first stage—which

2

re-weights the reduced form coefficient—underestimates the effect of the instrument on the coars-

ened treatment by failing to recognize that the treatment intensity increases for some individuals

without passing the threshold required to be designated a new coarsened treatment intensity value.

In the case of schooling, the first stage for compulsory schooling laws (CSLs)—a popular instru-

ment for completing high school (e.g. Dee 2004; Lochner and Moretti 2004; Milligan, Moretti and

Oreopoulos 2004)—only captures the individuals CSLs push to complete high school, neglect-

ing any increase in schooling which does not result in completing high school. However, since

the reduced form includes the effects for individuals who experienced greater schooling without

completing high school, this can substantially upwardly bias IV estimates.

The bias is most severe when there is a large first stage for neighboring intensities with large

causal effects. The political effects of completing high school may thus be substantially biased

if each additional year of schooling has a significant causal effect on political preferences and

the instrument increasing schooling for many students without inducing them to complete high

school. When the true causal effect is not highly discontinuous at a known point, estimating the

weighted average per-unit treatment effect for an interval treatment intensity is more appropriate.

I will show that IV techniques provide a consistent estimate of this quantity of interest, even when

some categories of the underlying treatment intensity are unobserved. Unlike the the case where

a treatment intensity is discretized, there is also a clear and conceptually-appealing counterfactual

interpretation for this estimate.

While a treatment intensity variable can be incorrectly discretized or “miscoded” through the

choice of a researcher, a common problem is that more granular data is not available. Using a

dummy for completing high school, for example, is often necessitated because datasets such as

the American National Election Survey and British Social Attitudes Survey only provide relatively

coarse measures of education. In such cases, any IV estimate of schooling’s political effects may

be significantly biased.

The two-sample IV techniques pioneered by Angrist and Krueger (1992, 1995) can substan-

3

tially alleviate or solve this problem. Conceptually, these methods estimate the reduced form in

a sample containing data on only the outcome and the instrument, and the first stage in a sample

containing data on only the treatment and the instrument, before combining the two to produce

an IV estimate. The sample used for the first stage effectively serves as a means of imputing the

missing treatment variable. In this sense, two-sample IV methods behave like multiple imputation

techniques (e.g. Honaker and King 2010; King et al. 2001).1 Beyond the standard IV assump-

tions, both samples must be random draws from the same population. I show that the two-sample

2SLS (TS2SLS) estimator first proposed by Angrist and Krueger (1995)—which can accommodate

both overidentification and additional covariates—is consistent, and I also extend Inoue and Solon

(2010) to derive the associated cluster-robust variance matrix which corrects for finite-sample dif-

ferences between samples 1 and 2. Despite their merits, two-sample IV methods have not yet been

used in political science.

Finally, I show how using Britain’s compulsory schooling reforms as instruments for dis-

cretized measures of high school education can significantly upwardly bias estimates of school-

ing’s effects on political preferences. Using Britain’s two major reforms, the upward bias can be

cleanly decomposed: using a dummy for high school completion, instead of the true linear effect

of an additional year of late high school, upwardly biases estimates of schooling’s effect on voting

conservative by between two and six times. The two-sample IV approach instead finds that an

additional year of late high school increases the probability that a voter votes Conservative in later

life by around 10-15 percentage points.

This substantial difference, which is reiterated by the reduced form estimates, raises a dilemma

for left-of-center parties—like Labour and more recently the Liberals—which have championed

inclusive educational policies at the expense of electoral success. Although it is beyond this ar-

ticle’s scope to evaluate the mechanisms underpinning these large effects, it preliminary suggests

1While multiple imputation involves imputing missing data using other variables within a givensample, two-sample IV imputes all observations for a given variable using a second sample. Unlikemultiple imputation programs like Amelia II, the methods used here have analytic solutions.

4

that education’s political effects are driven by income-based concerns—rather than socially liberal

attitudes that would be expected to cause voters to support the Labour or Liberal parties.

This paper is organized as follows. Section 2 demonstrates analytically the extent of the bias

and discusses the implications for applied empirical work. Section 3 explains how two-sample IV

techniques can alleviate the missing data problem. Section 4 applies these methods to identify the

effect of schooling on voting preferences in Great Britain. Section 5 concludes.

2 IV’s upward bias with coarsened treatments

2.1 Characterizing the bias

To illustrate the upward bias of coarsening a treatment intensity, consider the simplest case where

there is a single randomly assigned binary instrument.2 Denote this instrument for each obser-

vation i ∈N ≡ {1, ...,n} as Zi ∈ {0,1}. The observed treatment intensity Ti ∈ {1, ...,J} assumes

one of J ordered values, where Tiz ≡ T (Zi = z) denote the potential outcomes of Tiz conditional

on the assignment of the instrument Zi = z. Yi is i’s observed outcome of interest, with potential

outcomes Yit ≡ Y (Ti = t) corresponding to i’s treatment assignment Ti = t. To illustrate the prob-

lem, let us assume that the instrumental variables assumptions of monotonicity and the exclusion

restriction hold (see Imbens and Angrist 1994; Angrist, Imbens and Rubin 1996); see below for

formal definitions.

The researcher, whether by choice or necessity, decides to coarsen the treatment intensity. In

particular, in the hope of identifying the effect of experiencing Ti = k > 1, they partition T by

defining the indicator Dik ≡ 1(Ti ≥ k).3 Crucially, the researcher interested in identifying the

effect of obtaining Ti = k is only interested in estimating βk ≡ E[Yik−Yik−1|Ti1 ≥ t > Ti0]. This2The results presented here extend easily to the cases of multi-valued instruments and to the

inclusion of control variables.3If multiple instruments are available, the coarsening need not be binary.

5

quantifies the local average treatment effect (LATE) of obtaining intensity k beyond only obtaining

the preceding level k−1 for instrument compliers. In the case of schooling, this could be the effect

of completing high school (12th grade) beyond completing 11th grade. In many applications, this

counterfactual is not clearly specified.4

This approach yields the following system of IV equations to be estimated:

Yi = β̃kDik + ui, (1)

Dik = γZi + εi. (2)

Equation (1) is the structural model defining the relationship between the binary treatment and the

outcome, while equation (2) is the first stage regression of the binary treatment on the instrument.

The true causal effect of obtaining a treatment intensity of k for instrument compliers is βk, while

β̃k represents the population average effect that IV approaches typically cannot identify.5

Angrist and Imbens (1995) show that the Wald estimator βWk for this system of equations can

be expressed as the weighted sum of the causal effect for compliers moving from intensity t−1 to

t for each such interval:

βWk ≡E[Yi|Zi = 1]−E[Yi|Zi = 0]

E[Dik|Zi = 1]−E[Dik|Zi = 0]=

∑Jt=1 pitβtpik

, (3)

where pit ≡ Pr(Ti1 ≥ t > Ti0) denotes the probability that i only reaches category Ti = t because

they received the instrument Zi = 1, and thus represents the proportion of compliers at treatment

intensity t in the population. pik therefore represents the relevant first stage for ascertaining the

treatment intensity k. βt ≡ E[Yit −Yit−1|Ti1 ≥ t > Ti0] is the LATE for compliers moving from4When the treatment is truly binary, the interpretation is clear. However, if the latent treatment

is multi-valued, the researcher implicitly argues for the difference between some kind of averageof values contained within each discretized treatment condition.

5As Oreopoulos (2006) shows, as the number of compliers increase the local average treatmenteffect converges toward the population average treatment effect.

6

treatment intensity t−1 to treatment intensity t.

The following proposition extends Angrist and Imbens (1995) to demonstrate the inconsistency—

and thus a bias even as the sample size is large—associated with the Wald estimator seeking to

identify βk.6

Proposition 1. Suppose the following assumptions hold:

A1. Exclusion restriction: (Ti0,Ti1,{Yit}Jt=1) are jointly independent of Zi, for all i ∈N .

A2. Monotonicity: Ti1−Ti0 ≥ 0 or Ti1−Ti0 ≤ 0, for all i ∈N .

Then the dummy variable Wald estimator βWk of equations (1) and (2) can be expressed as:

βWk −βk =∑t 6=k pitβt

pik. (4)

Provided sign(βk) = sign(βt) for all t 6= k where pit > 0, the dummy variable Wald estimator

accentuates the true causal effect: |βk| ≤ |βWk |.

All proofs are provided in the Appendix.

This result establishes that the Wald estimator generally over-estimates the true LATE of ob-

taining intensity k. The estimator is consistent only in two special cases. First, when the instrument

only affects reaching intensity k; or pit = 0,∀t 6= k. Second, when the causal effect for all intervals

other than k is zero; or βt = 0,∀t 6= k. Otherwise, the inconsistency of the estimator is increasing

in both pit/pik and βt for any t 6= k.

Our education example clearly illustrates the bias. Consider a compulsory school law requiring

that students remain in school until age 15 in a country like the Britain where high school is

completed at age 16.7 For many students who would have dropped out before age 15 without6In general, IV estimators are biased but consistent (see Bound, Jaeger and Baker 1995). The

term bias is reduced somewhat loosely here to mean the deviation between the inconsistent andconsistent estimators.

7The U.S. is also a good example, where high school is completed at age 18 but the schoolleaving age is (or has been) 16 for many states.

7

the law, the law may not induce the completion of high school. Many likely only stay until 15,

although some may go on to complete high school. This implies that there is a significant first

stage, pit > 0, for levels of schooling below high school. The IV bias, however, only arises if

an additional year of schooling before the completion of high school matters for the outcome of

interest. For outcomes like income, where either human capital or signaling may matter for labor

market returns (e.g. Becker 1993; Mincer 1974; Spence 1973), it is easy to believe that βt > 0.

Similarly, if income maps to political preferences (e.g. Marshall 2014), or remaining in high school

imparts politically-relevant norms, then political outcomes are also likely to suffer from bias.

2.2 When is the bias severe?

Proposition 1 demonstrated that the extent of bias depends upon the first stage and the LATE at

different treatment intensities. This analytical insight permits precise description of the extent of

bias in terms of a weighted causal response function (CRF). The CRF provides the causal effect of

the treatment at each intensity.

2.2.1 Sharp jumps in the CRF

When the CRF exhibits sharp discontinuities, as exemplified in Figure 1, the dummy approach

can be most appropriate. If the researcher’s understanding of the problem is strong, then correctly

identifying intensity k—the only point at which there is a (positive) causal effect in the figure—as

the key jump will yield a consistent estimate of βk, provided a suitable instrument exists to ensure

pik > 0. The reason that this works well is because βt = 0 for all t 6= k. Therefore, the Wald

estimator is consistent regardless of whether pit > 0 for some other t 6= k.

Since the true CRF is unobserved, it is hard for researchers to know in practice whether k is the

correct cutoff to use when defining their dummy variable. In general, tipping point equilibria that

lack clear institutional definition may not be straight-forward to theorize about. Experiments, on

the other hand, are not subject to these concerns if subjects cannot be partially treated.

8

T

Y

●

k k+1

Y0

Y1

Figure 1: Discontinuous causal response function

If the researcher incorrectly surmises that k + 1 is the correct threshold, at best they fail to

detect the existence of the causal effect of intensity k but correctly identify no effect at k+ 1. In

the example of Figure 1, the researcher concludes that βk+1 = 0 provided that their instrument

does not induce subjects to reach intensity k and βt = 0,∀t 6= k. In other words, pik = 0 ensures a

correct causal estimate of a quantity that was probably not of primary interest. When pik > 0, the

Wald estimator will produce an inconsistent estimate of the LATE at intensity k+ 1 given by:

βWk+1−βk+1 =pikβkpik+1

> 0. (5)

Although this estimate is approximately right in the sense that there is a causal effect nearby, it

both wrongly attributes the effect to intensity k+1 and does not even provide a consistent estimate

9

of βk unless pik+1 = pik.

2.2.2 Linear CRFs

The bias associated with using a dummy variable can be particularly large when the true CRF is

linear. Letting the causal effect associated with each interval be βt = τ 6= 0, the dummy variable

Wald estimator yields:

βWk −βk =∑t 6=k pit

pikτ . (6)

This requires that more than one half of all compliers must achieve intensity k for the inconsistency

to be less than double the size of the true coefficient.8 This concern increases with how close

the treatment intensity categories are to one another (i.e. increases in J), because it becomes

increasingly implausible that any instrument could induce all i to receive exactly Ti1 = k.

When the causal response is linear, an alternative Wald estimator—also proposed by Angrist

and Imbens (1995)—estimating the weighted average per-unit treatment effect (WAPTE) is more

appropriate:

βWWAPT E ≡E[Yi|Zi = 1]−E[Yi|Zi = 0]E[Ti|Zi = 1]−E[Ti|Zi = 0]

=∑Jt=1 pitβt∑Jt=1 pit

. (7)

When the true causal effect is τ at each interval, it is exactly recovered by βWWAPT E . When the

causal effect is not exactly linear, the estimator disproportionately weights the intervals with most

compliers.

It is easy to show that the dummy variable approach yields a coefficient at least as large as the

8To see this, note that∑t 6=k pit

pik=

p− pikpik

< 1,

only when pik > p/2, where p≡ ∑ j pi j.

10

WAPTE when the instrument satisfies monotonicity (Angrist and Imbens 1995).9 Consequently,

if the CRF is that in Figure 1, then the linear approach underestimates the true causal effect at

intensity k by fitting a complier-weighted linear form. In the special case where the instrument

only affects the first stage of interest, or pit = 0,∀t 6= k, the WAPTE estimator yield an identical

estimate to the discretized Wald estimator. To the extent that a more conservative estimate is

desired when the CRF is uncertain, the linearization may therefore be preferred.

Furthermore, the linear approach may be robust even when not all categories are observed. If

the J observed categories represent a coarsening of the true intervals (e.g. because T is continuous),

the linear causal effect can still be recovered provided the intervals are equally spaced.10

Proposition 2. Suppose assumptions A1 and A2 in Proposition 1 hold. Let only J equally-spaced

categories of Ti be observed when there are in fact αJ equally-spaced categories, where α > 1 is

finite and αJ is an integer. Denote βW ,JWAPT E and βW ,αJWAPT E respectively as the Wald estimators in the

observed sample (denoted by superscript J) and unobserved sample (denoted by superscript αJ).

If the effect of Ti is linear and β Jj = τ for all intervals j, then βW ,JWAPT E = αβ

W ,αJWAPT E

This result suggests that any linear relationship can be accurately estimated with the WAPTE es-

timator, even when all intervals cannot be observed in practice. Obtaining the coefficient for the

quantity of interest only requires an adjustment by factor α to provide the average linear causal

effect at the desired unit interval.

2.3 Implications for applied research

The analysis here demonstrates that the shape of the CRF is critical for ascertaining the bias of the

Wald estimator with a binary treatment. Unless the instrument is very specific in inducing subjects

to only reach treatment intensity k or the causal response is non-zero only at that particular point,

9Comparison of the denominators shows that ∑Jt=1 pit ≥ pik if sign(pit) = pik,∀t.10More generally, even if the spacing is uneven the true causal effect could be identified if the

spacing is proportional to the causal effects at each observed intensity.

11

the Wald estimator can be severely biased. If the CRF is instead approximately linear in form, it is

more appropriate to estimate the WAPTE.

Although researchers may in some cases have strong prior beliefs over the shape of the CRF,

and thus the most appropriate empirical strategy, definitive evidence is hard to produce. For ex-

ample, it is far from clear whether it is the additional learning imparted every day that students

remain in high school or simply obtaining the diploma that should matter for how an individual

votes. Unfortunately, the researcher must rely on evidence and intuition—including the reduced

form relationship, separate first stage regressions and the dummied-out OLS relationship—in order

to determine the appropriate variable specification when only a single instrument is available.

However, when multiple instruments are available, a sharper empirical assessment is possible.

With p > 1 instruments, p intervals of the CRF can be estimated by instrumenting for p binary

indicators of different treatment levels.11 Under the assumption that different instruments do not

affect different types of compliers differently, this permits the researcher to estimate βt for com-

pliers at the intervals where the researcher believes the per-unit causal effect is likely to be largest.

Large causal effects at t 6= k provide strong evidence against the kind of CRF required to use βWkas a consistent estimator for βk. Applying this approach, section 4 shows that IV estimates for

completing high school can substantially over-estimate education’s political effects.

Carefully examining the effects of different levels of a treatment intensity in a single dataset is

often not possible. As noted above, researchers often only resort to using dummy variables when

better measures are not available. I now show how two-sample IV methods can be used to address

this missing data problem.

11The above analysis can be generalized to the case of multiple instruments.

12

3 Using two samples to address missing data

This section shows how two-sample IV techniques—a method yet to be employed in political sci-

ence, as far as I am aware—can solve the problem that the researcher is forced to use a dummy

because an insufficient number of categories are measured in their dataset. Of course, when all

categories are available, the researcher is free to re-specify their treatment intensity variable how-

ever they deem fit. The two-sample method can similarly address the problem that the treatment

variable is completely unobserved.

The key idea underpinning two-sample IV techniques is that the reduced form and first stage

can be estimated in separate samples. Conceptually, we can then combine these estimates by

respectively replacing the numerator and denominator of the WAPTE estimator in equation (7) or

the Wald estimator in equation (3). Accordingly, two datasets are needed—one which includes

Zi and Yi, and a second which includes Zi and Ti. If covariates Wi are desired, they must also be

observed in both samples. If these datasets are both random draws from the same population, then

the relationship between Zi and Ti in the first stage sample should be equivalent to that which would

have been measured in the reduced form sample had good measures of Ti been available. Under

these conditions, which are formalized below, it is reasonable to effectively impute values values

of Ti using our second dataset.

3.1 Estimation

The goal is estimate the following system of IV equations:

Yi = TiβT +Wiβ−T + ui = Xiβ + ui (8)

Ti = ZiΠ+ εi, (9)

13

where Zi includes Wi and q excluded instruments. Identification requires that only p≤ q treatment

variables, Ti, can be instrumented for.

Two methods have been proposed for IV estimation with two samples. Angrist and Krueger

(1992) propose a Wald-style estimator where the reduced form estimates are divided by their first

stage counterparts, which can be generalized to the overidentified case where the number of in-

struments outnumber the number of endogenous variables. Inoue and Solon (2010) show that this

estimator is less efficient than the 2SLS counterpart—proposed by Angrist and Krueger (1995)

for splitting a sample—that will be used in the empirical application here. The advantage of this

estimator is that it corrects for finite-sample differences between the two samples.12 Furthermore,

its extension to multiple instruments and multiple endogenous variables is straight-forward—both

of which are important in many empirical applications, including the analysis in this paper.

In matrix form (stacking over i), the TS2SLS estimator is:

β̂ T S2SLS = (X̂ ′1X̂1)−1X̂ ′1Y1, (10)

where X̂1 = (T̂1,W1) is the matrix of predicted values in sample 1. The OLS regression coefficients

generating T̂1 are based on p first stage regressions estimated in sample 2:

X̂1 = Z1Π̂ = Z1(Z′2Z2)−1Z′2X2. (11)

3.2 Properties of TS2SLS

The following assumptions are required to ensure the consistency of the TS2SLS estimator:

1. Random sampling from the same population: {Y1i,Z1i}n1i=1 and {T2i,Z2i}n2i=1 are indepen-

dently and identically distributed draws of size n1 and n2 from the same population with

12Inoue and Solon (2005) show that the TS2SLS estimator remains consistent even when differ-ences in the sampling rates vary with some of the instrumental variables.

14

finite second moments.

2. Exclusion restriction: E[Z′1iu1i] = 0.

3. Instrument exogeneity: E[Z′1iε1i] = E[Z′2iε2i] = 0.

4. Rank conditions: (a) Z′1iZ1i and Z′2iZ2i have full rank, (b) X

′1iZ2i and X

′2iZ2i have full rank.

5. Interchangeable sample moments: (a) E[Z′1iX1i] = E[Z′2iX2i], (b) E[Z

′1iZ1i] = E[Z

′2iZ2i].

Assumption 1 says that the samples must draw from the same population. Assumption 2 is im-

plied by the exclusion restriction above, but is written in terms of expectations. Assumption 3

requires that the instrument be exogenous in the first stage. Assumption 4 is a standard rank con-

dition required for matrix invertibility. Assumption 5 requires that crucial samples moments can

be interchanged, thereby permitting substitution between samples. As n1 and n2 converge to the

population size, Assumption 5 necessarily holds.

Proposition 3 demonstrates the consistency of TS2SLS, while the proof illustrates the use of

the assumptions above.13

Proposition 3. Under Assumptions 1-5, β̂ T S2SLS is an n1-consistent estimator of β .

Correctly calculating the TS2SLS standard errors is not obvious. Calculating the standard er-

rors from a regression of Y1 on X̂1 neglects the uncertainty in the first stage, in addition to distribu-

tional differences between the first stage and reduced form samples. The Murphy and Topel (1985)

two-stage framework for understanding “generated regressors”—accounting for the uncertainty in-

troduced where a variable is estimated as a proxy to enter a separate regression—incorporates such

estimation uncertainty.14 Proposition 4 derives the homoskedastic and cluster-robust variance (ma-13Angrist and Krueger’s (1995) proof rests on showing that the TS2SLS estimator converges to

the consistent Angrist and Krueger (1992) estimator, because of Assumption 5. The proof providedhere instead demonstrates the consistency of TS2SLS directly.

14Inoue and Solon (2010) acknowledge this approach but derive homoskedastic and het-eroskedastic variance matrices in an alternative way, but do not provide a cluster-robust varianceestimate.

15

trices), of which the robust variance is the particular case of G1 = n1 and G2 = n2 clusters. (i is

dropped to facilitate exposition.)

Proposition 4. The asymptotic variance of the TS2SLS estimator, V[β̂ T S2SLS], is

[σ2u +

n1n2

β̂ T S2SLS′S Ωβ̂T S2SLSS

]E[X̂ ′1X̂1]

−1, Ω = E[ε ′ε|X̂1] =

σ21 · · · σ1,p... . . .

...

σp,1 . . . σ2p

(12)

when the reduced form squared error σ2u = E[u2|X̂1] and the error covariances Ω of the p first

stage regressions are homoskedastic; when the reduced form and first stage errors are grouped

into G1 and G2 clusters respectively, the cluster-robust variance is

E[X̂ ′1X̂1]−1[

V[β̂ T S2SLS]+n1n2

E[X̂ ′1(β̂T S2SLS′S ⊗Z1)]V(Π̂)E[(β̂ T S2SLS′S ⊗Z1)′X̂1]

]E[X̂ ′1X̂1]

−1,(13)

where β̂ T S2SLSS is the vector of coefficients on p endogenous variables, the uncorrected TS2SLS

variance is given by V[β̂ T S2SLS] = G1G1−1 ∑G1g=1 E[X̂

′1gû1gû

′1gX̂1g] and the variances from m first-

stage regressions are V(Π̂) = G2G2−1 Φ⊗E[Z′2Z2]

−1, where

Φ =

E[Z′2Z2]

−1 ∑G2g=1 E[Z′2gε̂2g1ε̂

′2g1Z2g] · · · E[Z′2Z2]−1 ∑

G2g=1 E[Z

′2gε̂2g1ε̂

′2gpZ2g]

... . . ....

E[Z′2Z2]−1 ∑G2g=1 E[Z

′2gε̂2gpε̂

′2g1Z2g] . . . E[Z

′2Z2]

−1 ∑G2g=1 E[Z′2gε̂2gpε̂

′2gpZ2g]

. (14)

Standard errors are given by the square roots of the diagonal elements of V[β̂ T S2SLS]/n1. Using

the analogy principle, expectations can be replaced by sample moments.

16

In the case of a single endogenous regressor, V(Π̂) is simply the standard cluster-robust vari-

ance matrix for the first stage:

E[Z′2Z2]−1

[G2

G2−1

G2

∑g=1

E[Z′2gε̂2gε̂′2gZ2g]

]E[Z′2Z2]

−1. (15)

When there are multiple endogenous variables, the first stage estimates may be correlated across

models. This requires the more complex formulation in Proposition (4).

4 High school education and political preferences

I use the two-sample IV methods expounded above to answer an important question about political

behavior: how does high school affect who citizens vote for? Despite widespread interest in the

causal effects of education on political participation (see Sondheimer and Green 2010), education’s

partisan bias has received limited attention from scholars seeking to move beyond survey corre-

lations. Furthermore, identifying the political effects of schooling is challenging because many

surveys provide insufficiently granular measures of education.

There are various ways in which education could affect political preferences. One of the most

robust correlations in political surveys in developed democracies is the link between income and

support for right-wing political parties (e.g. Gelman et al. 2010; Thomassen 2005). If educa-

tion increases income, as human capital theory strongly suggests (e.g. Acemoglu and Angrist

2000; Becker 1993), then additional high school may well increase support for right-wing parties

proposing lower taxes (Meltzer and Richard 1981).15

However, education is also associated with socially liberal attitudes. This link has also been

widely documented in survey research (Dee 2004; Gerber et al. 2010; Schoon et al. 2010), although

15This relationship could similarly work through changing demand for social insurance (Iversenand Soskice 2001; Moene and Wallerstein 2001). In the U.S., Marshall (2014) finds that highschool education predominantly works through tax policy preferences.

17

it is particularly strong at the university rather than high school level. Rather than supporting right-

wing parties, this impetus generally pushes voters toward left-wing parties who are more likely

to support post-materialist and socially liberal policies (e.g. Heath et al. 1985; Inglehart 1981).

In the United Kingdom, the Labour and Liberal Democrat parties are regarded as more socially

progressive.

Given the formative role of education, there are many other channels through which schooling

could affect political behavior.16 This paper does not seek to disentangle the mechanisms underpin-

ning the relationship, but rather to demonstrate that high school education has important political

implications for a large proportion of voters. Identifying schooling’s causal effects is challenging

because which individuals receive greater education is very unlikely to be random, even after var-

ious observables are controlled for or matched upon (e.g. Kam and Palmer 2008). I use Britain’s

compulsory schooling reforms as in instrument for schooling to identify high school’s political ef-

fects. Britain represents a particularly important case because, unlike the U.S., the reforms affected

a substantial proportion of the population. With a large proportion of compliers, the estimates for

compliers approach the population average treatment effect (see Oreopoulos 2006).

4.1 Compulsory schooling laws in Britain

Great Britain’s education laws define the maximum age by which students must start school and the

minimum age at which students can leave school. To identify the effect of high school education,

I exploit two landmark reforms of the minimum leaving age that came into force in 1947 and

1972. First, Winston Churchill’s wartime coalition government passed the Education Act 1944,

which increased the leaving age from 14 to 15 in England and Wales. The Education (Scotland)

Act 1945 enacted the same reform in Scotland. The new leaving age, which had repeatedly failed

to pass in the 1920s and 1930s due to financial constraints (Gillard 2011), came into force 1st16For example, education could alter the political composition of social networks (Abrams,

Iversen and Soskice 2010), induce politically-biased participation, or teaching could instill dif-ferent values and norms (Bowles and Gintis 1976).

18

http://www.legislation.gov.uk/ukpga/1944/31/pdfs/ukpga_19440031_en.pdfhttp://www.legislation.gov.uk/ukpga/1945/37/pdfs/ukpga_19450037_en.pdfhttp://www.legislation.gov.uk/ukpga/1945/37/pdfs/ukpga_19450037_en.pdf

April 1947 after several years of intensive preparation. Second, Parliament passed the Education

Act 1962 raising the school leaving age to 16, although it was Conservative Edward Heath who

finalized the extension under Statutory Instrument 444 (1972). Like the 1947 reform, Labour had

consistently pushed for the increase,17 while education was widely seen as an economically and

socially beneficial investment at the time (Woodin, McCulloch and Cowan 2013). This second

reform came into force in England, Scotland and Wales on 1st September 1972. Northern Ireland,

which experienced different education reforms (Oreopoulos 2006), is excluded from the analysis.

The reforms substantially altered the education profile of Britain’s students. As Figure 2 shows,

relative to the immediately prior academic cohorts, both reforms induced a large fraction of stu-

dents to remain in school for an additional year. Unlike compulsory schooling reforms in Canada

and the U.S., which affected a small and somewhat idiosyncratic set of students (Clark and Royer

2013; Goldin and Katz 2008; Oreopoulos 2006), Britain’s reforms affected a large proportion of

the population. Almost half of students remained in school one year longer following the 1947 re-

form, while a quarter were remained in school because of the 1972 reform. While the 1947 reform

also increased the proportion staying in school until 16, the 1972 reform did not affect schooling

beyond the high school level.

Although the number of students in school rose considerably, the education system itself did

not greatly change. Prior to the 1947 reform, the government had engaged in a major expansion

effort to increase the number of teachers, buildings and classroom materials. In both cases, the

additional year of schooling was primarily intended to ensure students grasped all the material

they had previously been taught (see Clark and Royer 2013).

Britain’s education reforms have proved popular instruments among labor economists. The

discontinuities in schooling laws have been used to identify positive effects of an additional year

of schooling on income (Devereux and Hart 2010; Grenet 2013; Harmon and Walker 1995; Ore-

17Under Labour Prime Minister Gordon Brown, Parliament passed the Education and Skills Act2008, raising the education leaving to 18 by 2015.

19

http://www.educationengland.org.uk/documents/pdfs/1962-education-act.pdfhttp://www.educationengland.org.uk/documents/pdfs/1962-education-act.pdfhttp://www.legislation.gov.uk/uksi/1972/444/pdfs/uksi_19720444_en.pdfhttp://www.legislation.gov.uk/ukpga/2008/25/pdfs/ukpga_20080025_en.pdfhttp://www.legislation.gov.uk/ukpga/2008/25/pdfs/ukpga_20080025_en.pdf

0.2

.4.6

.8

Pro

port

ion

leav

ing

scho

ol

1940 1950 1960 1970

Cohort: year aged 14

Leave before 15 Leave before 16

1947 reform

0.2

.4.6

.8

Pro

port

ion

leav

ing

scho

ol

1950 1960 1970 1980 1990 2000


Leave before 16 Leave before 17

1972 reform

Figure 2: Compulsory schooling reforms and staying in school by cohort

Notes: Data based on the Labour Force Survey data used in the empirical analysis below. Black lines representfractional polynomial regression line fits. Grey dots are birth-year cohort averages.

opoulos 2006), and also used to demonstrate that schooling does not affect mortality rates (Clark

and Royer 2013).18 However, the potential political effects of these reforms have not received

attention.18There also exists a large literature exploring the economics effects of U.S. compulsory school-

ing reforms (see Acemoglu and Angrist 2000; Angrist and Krueger 1991; Goldin and Katz 2008).These studies differ in that they exploit cross-state differences using difference-in-differences typestrategies.

20

4.2 Data

In order to test the political implications of these reforms, I use the British Social Attitudes Survey

(BSAS). The BSAS, which randomly samples a nationally-representative cross-section of adults

(aged 18 or above) with postal addresses in Great Britain, has been conducted in the summer of

every year since 1983 except in 1988 and 1992. In ten of the 28 available surveys,19 respondents

were asked which party they voted for in the most recent general election. In the sample used in

this analysis, 34% of respondents reported voting Conservative, while 45% and 16% respectively

voted Labour and Liberal.20 Given the theoretical claims outlined above, the analysis focuses on a

dummy for voting Conservative as the main dependent variable.

I operationalize whether a student is affected by the reform by coding indicators—1(CSLc =

15) = 1(birth year+14∈ [1947,1972]) and 1(CSLc = 16) = 1(birth year+15≥ 1972)—for the

minimum schooling leaving age affecting individuals in cohort c. The residual category is below

15. Although month of birth is not available in the BSAS, respondents can be mapped on the basis

of their year of birth (determined by age in years at the date of the survey).21 Whether an individual

was affected by the reform is thus assigned by academic cohort, defined by the year aged 14 and

15, such that 1(CSLc = 15) = 1 for those aged 14 in any between 1947 and 1972.22

However, the BSAS measures of education are problematic. Educational attainment is mea-

19These surveys were conducted in: 1987, 1994, 1995, 1996, 1999, 2001, 2003, 2005, 2008 and2010.

20The Conservative vote share, the main dependent variable in this paper, pretty accurately re-flects the survey-weighted average of 36% of votes received by the Conservatives across the periodunder study. The difference is even smaller in the raw data; as explained below, the TS2SLS ap-proach necessitates removing some observations.

21Our estimates of the effects of the reforms on schooling outcomes are very similar to Clarkand Royer (2013), who can perfectly assign the instruments using month of birth data. This, incombination with the clear graphical discontinuities shown below, strongly suggests that lackingmonth of birth is not significantly affecting the results.

22Scottish students faced a weaker law between 1972 and 1976, they are coded identically toEngland/Wales as a similarly large drop in the proportion leaving occurred. Results are robust toexcluding Scottish students aged 15, 1972-76.

21

sured using six categories, ranging from no qualification to university degree.23 Completing high

school is captured by the second lowest category, which specifies that a respondent has a certifi-

cate of secondary education (CSE) or equivalent. At the end of high school (at age 15 or 16), or a

student’s 11th year of formal schooling, students take CSE exams in a variety of subjects. Given

only 2-3% of students fail any particular CSE exam, obtaining a CSE is an almost perfect proxy for

completing high school. An indicator measuring this is used to examine the results when schooling

is dichotomized at a theoretically appealing point. Although the BSAS also asks respondents what

age they left school, nearly half of the surveys did not allow respondents to answer that they left

school below age 15, and thus cannot differentiate the effect of the 1947 reform from the number

of years of schooling.24

Using only the BSAS sample to identify the effect of years of schooling would require either

coarsening the treatment or substantially reducing the sample size. However, collecting a second

sample containing common basic demographic variables and the age at which an individual left

school can solve this problem. Accordingly, I use Labour Force Survey (LFS) data—an annual

and more recently quarterly household survey—from each year in which an election occurred to

collect a pooled sample of 747,851 voting age respondents.25 Years of schooling is defined by

the age left a respondent left continuous full time education minus five, and an upper bound of 13

years of state-supported education is applied.26 Before 2003, the LFS collected both month and

23Respondents with foreign qualifications were excluded.24This bottom coding is clearly still relevant in the twenty-first century because many of those

aged 14 or above in 1947 are still alive. Nevertheless, similar estimates are obtained when restrict-ing the analysis to the years for which age left school could be used as the endogenous variable.For many studies, however, the loss of precision necessitates using a separate sample.

25Only the July-September sample was used since the LFS became quarterly to avoid replication,given that respondents are then surveyed for five consecutive quarters, and to approximate the pe-riod when the BSAS survey was conducted. Observations from Northern Ireland and respondentsbelow the age of 18 were excluded to match the BES sample.

26After age 18, continuing students bifurcate into university or vocational programs. Given thedifficulties of classifying these programs, the upper bound on state-supported schooling is mostappropriate. Since the CSL reforms did not affect higher education, this choice is inconsequentialfor the results.

22

year of birth, and therefore permitted perfect instrument assignment; since 2003, the instruments

were assigned as in the BSAS.

The two-sample approach is only valid if both samples randomly draw from the same pop-

ulation. Given that the BSAS and LFS are random samples from the population of those with

available addresses,27 both samples are drawn from essentially identical populations. Neverthe-

less, imbalances could remain due to chance, different survey sizes and any differential response

characteristics. To redress the concern that the TS2SLS assumptions are not satisfied, I then chose

a random subsample of the LFS sample to match the BSAS sample distribution in terms of year

of birth, gender, ethnicity, and survey year by randomly choosing observations from within these

blocks.28 This reduced the final sample size to 47,552.29 The summary statistics in Table 1 show

that the first and second moments on the common variables match very well. In combination with

the random sampling from the same adult population, both samples effectively draw from the same

population.

4.3 Empirical strategy

To identify the effect of late high school education on political preferences, I use Britain’s com-

pulsory schooling reforms as instruments for the level of schooling an individual receives. These

reforms have been widely used as instruments, most convincingly in regression discontinuity (RD)

designs (see Clark and Royer 2013), because of the sharp change in educational attainment across

27More precisely, the BSAS uses a multi-stage design where Britain is broken up into sectorsdefined by postcode, from which households are randomly chosen from the address book. Re-spondents aged 18 or above within a household are then randomly chosen. The LFS became anunclustered (“simple”) random sample from the address roll since 1992, having earlier employeda clustered approach from the Valuation Roll and Post Office Address File.

28Due to a lack of observations in the LFS, the final samples used for both datasets excludedrespondents aged above 74 and those born before 1922 or after 1987.

29Where sample size concerns are more salient (the first stage is very strong here), anotheroption would be to weight observations in the first stage sample to replicate the reduced formsample distribution. Such a procedure is likely to be more efficient.

23

Tabl

e1:

Sum

mar

yst

atis

tics:

BSA

San

dL

FSsa

mpl

es

BSA

SLF

SO

bs.

Mea

nSt

d.de

v.M

in.

Max

.O

bs.

Mea

nSt

d.de

v.M

in.

Max

.

Dep

ende

ntva

riab

leC

onse

rvat

ive

vote

15,9

340.

340.

470

1

End

ogen

ous

vari

able

sSc

hool

ing

47,5

5211

.14

1.42

013

Hig

hsc

hool

15,9

340.

730.

440

1

Exc

lude

din

stru

men

tsC

SL=1

515

,934

0.50

0.50

01

47,5

520.

500.

500

1C

SL=1

615

,934

0.39

0.49

01

47,5

520.

390.

490

1

Pre

-tre

atm

entc

ovar

iate

sB

irth

year

15,9

3419

51.9

014

.68

1922

1987

47,5

5219

61.6

614

.64

1922

1987

Age

15,9

3447

.12

14.0

718

7347

,552

46.9

614

.38

1873

Mal

e15

,934

0.44

0.50

01

47,5

520.

450.

500

1W

hite

15,9

340.

950.

210

147

,552

0.95

0.22

01

Asi

an15

,934

0.02

0.15

01

47,5

520.

020.

150

1B

lack

15,9

340.

020.

130

147

,552

0.02

0.13

01

Surv

ey15

,934

1999

.02

5.91

1987

2010

47,5

5219

98.9

06.

3319

8720

10

24

cohorts. Although further-apart cohorts could differ systematically, it is hard to see why cohorts

born just before and just after the reform would systematically differ in their political preferences.

Accordingly, this study also employs an RD design where the running variable determining the

treatment is birth year cohort.

The key RD identifying assumption is that partisan preferences are continuous in all covari-

ates other than school leaving age at the reform discontinuity. Given the difficulty of identifying

education’s causal effects using observational data, the RD strategy’s weak assumptions are par-

ticularly appealing. The greatest issue for RD designs is the “sorting” concern that another key

variable simultaneously changes at the discontinuity. Given that cultural shifts are very unlikely to

have affected 15 year olds but not 14 year olds, the most plausible concerns relate to demographic,

socio-economic and labor market characteristics. Figure 3 shows that trends in various proxies for

these variables are essentially continuous through both discontinuities.

I first estimate the effects, δ1 and δ2, of the schooling reforms themselves. This entails estimat-

ing reduced form OLS regressions of the following form in the BSAS sample:

Yict = δ11(CSLc = 15)+ δ21(CSLc = 16)+ f (birth yearc)+Witγ +ηt + εit , (16)

where 1(CSLc < 15) is the residual category, Wit includes a gender dummy, standardized age

polynomials,30 and dummies for white, black and (south and east) Asian ethnicities, and ηt is

a survey fixed effect. The dependent variable Yict is voting Conservative. f is a flexible global

polynomial function of the running variable designed to capture general trends away from the

reform discontinuities.31 I estimate a variety of specifications for f , ranging from including no

birth year trends to fifth-order polynomial trends to demonstrate the robustness of the relationships.

All specifications report standard errors clustered by cohort.

30For simplicity, the age polynomials are assigned the same polynomial order as f .31To fully assess the implications of dummying-out high school, it is necessary to include both

reforms in the same specification. Consequently, it is imperative to show that the results are robustto highly flexible global polynomial trends.

25

.4.5

.6.7

.8P

ropo

rtio

n

1940 1960 1980 2000Cohort: year aged 14

Panel A: Male

.85

.9.9

51

Pro

port

ion

1940 1960 1980 2000Cohort: year aged 14

Panel B: White

.3.4

.5.6

.7.8

Pro

port

ion

1940 1960 1980 2000Cohort: year aged 14

Panel C: Father manual or unskilled worker

.2.3

.4.5

.6.7

Pro

port

ion

1940 1960 1980 2000Cohort: year aged 14

Panel D: Father voted Conservative

02

46

810

Rat

e (%

)

1940 1960 1980 2000Year

Panel E: Unemployment

025

5075

100

Inde

x (2

000=

100)

1940 1960 1980 2000Year

Panel F: Average annual earnings

Figure 3: Trends in demographic, socio-economic and labor market demographic variables

Notes: The data in Panels A and B is from the LFS. The data in Panels C and D is from the British Election Survey1979-2010 (because such variables were not widely available in the BSAS), which is used as a robustness checkbelow. The data in Panels E and F is from the Bank of England “UK Economic Data 1700-2009” dataset.

26

The principal quantity of interest in this paper is the effect of schooling. To estimate the effects

of different measures of schooling, Si, I use Britain’s reform cutoffs as instruments. Since the

reforms do not perfectly determine an individual’s level of schooling, the assignment of Si is fuzzy.

We thus employ a fuzzy RD approach; like standard IV approaches, this additionally requires

monotonicity and the exclusion restriction to hold. Given the large increase in school attendance

following each reform, and the fact that very few students failed to comply with the new leaving

ages, monotonicity is strongly supported. Given the close proximity of the reforms to schooling

choices, there is very limited scope for the reforms to violate the exclusion restriction by affecting

an individual’s political preferences through other channels.

The fuzzy RD entails estimating the following structural equation:

Yict = βSi + f (birth yearc)+Wiϕ +ηt + εict , (17)

where Si will be either a dummy for completing high school, years of schooling, or two dummy

variables for staying in school for 10 or above 10 years. The first stage regression generating

variation in Si is given by:

Si = α11(CSLc = 15)+α21(CSLc = 16)+ f (birth yearc)+Wiψ +ηt + εict . (18)

A strong first stage, which is required to minimize the bias of IV estimates in finite samples (Bound,

Jaeger and Baker 1995; Staiger and Stock 1997), implies that α1 and α2 are significantly different

from zero.

In the case of the dummy for completing high school, equation (17) can simply be estimated

with 2SLS using only BSAS data. Given that years of schooling comes from the LFS, the effects

of years of schooling are instead estimated using TS2SLS where the LFS first stage and BES

reduced form are efficiently combined as above with cohort-clustered standard errors computed as

in Proposition 4.

27

4.4 Results

4.4.1 The effect of compulsory schooling reforms on schooling and political preferences

Figures 4 and 5 plot the first stage and reduced form graphically. The left hand graph in Figure 4

shows a large increase in the average number of years of schooling per cohort following the 1947

reform. This reflects the 40% of students which stayed in school for another year shown in Figure

2. The right-hand graph shows that the 1972 reform also substantially increased the average years

of schooling, although the magnitude of the reform was much smaller. This reflects the fact that

by 1972 students were generally remaining in school longer.

Although the cohort averages are noisier, the reduced form plots in Figure 5 suggest that around

the reforms voters differ systematically in their political preferences. Especially following the

1947 reform, there is an upward shift in support for the Conservative party by cohort. The graphs

indicate that cohorts affected by the reform are approximately three percentage points more Con-

servative.32 Given that the 1972 reform affected fewer students, the difference at the discontinuity

is smaller. Although the difference is less clear, the chart also suggests an increase in support for

the Conservatives. The fact that both reforms reverse the trend against the Conservatives—which

is a function of both declining support over time (in the surveys used here) and younger voters

being more left-wing—further suggests that the posited relationship is not being driven by cohort

trends.

Table 2 presents the reduced form and first stage estimates using a simple linear cohort trend.

Although Figures 4 and 5 indicate that trends in both years of schooling and Conservative support

are approximately linear,33 the results—as will be demonstrated below—are not sensitive to this

choice. The first column provides the reduced form estimates, the second column estimates the

first stage in the BSAS sample, and the third column estimate the years of schooling first stage in

32A linear trend, which fits similarly well, indicates an even larger five percentage point effect.33Note that there is very little data for cohorts born in the 1930s.

28

99.

510

10.5

1111

.512

Yea

rs o

f sch

oolin

g

1940 1950 1960 1970


1947 reform

99.

510

10.5

1111

.512

Yea

rs o

f sch

oolin

g

1950 1960 1970 1980 1990 2000


1972 reform

Figure 4: Average years of schooling by birth year cohort (LFS data)

Notes: Black lines represent fractional polynomial regression line fits. Grey dots are cohort averages.

the LFS sample.

The reduced form shows that the reforms induced a large and statistically significant increase

in support for the Conservative party. Cohorts affected by the 1947 are six percentage points more

likely to vote Conservative, while the 1972 reform—which affected fewer students—increased

Conservative voting by a further 2.5 percentage points. Such large shifts for affected cohorts

imply that the reforms substantially altered national politics, and could easily have altered the

outcomes of the close 1970s and 2000s elections. Figure 6 demonstrates that the reduced form

coefficients are consistent across specifications using higher-order polynomial terms to account for

29

.1.2

.3.4

.5

Vot

e sh

are

1940 1950 1960 1970


1947 reform

.1.2

.3.4

.5

Vot

e sh

are

1950 1960 1970 1980 1990 2000


1972 reform

Figure 5: Proportion conservative by birth year cohort (BSAS data)

Notes: Black lines represent fractional polynomial regression line fits. Grey dots are cohort averages.

more complex trends in Conservative support. However, by averaging across all individuals, these

estimates underestimate the impact on individuals who only remained in school because of the

reforms. To calculate the effects for such compliers, I turn to the fuzzy RD estimates.

4.4.2 The effect of schooling on political preferences

The first stage estimates confirm that both reforms substantially increased schooling. Looking at

the dummy for completing high school in the BSAS sample, column (2) shows that both the 1947

and 1972 reforms significantly increased the probability of completing high school. Column (3)

30

-.05

0.0

5.1

.15

.2

CSL=15 CSL=16

Linear-.

050

.05

.1.1

5.2

CSL=15 CSL=16

Quadratic

-.05

0.0

5.1

.15

.2

CSL=15 CSL=16

Cubic

-.05

0.0

5.1

.15

.2

CSL=15 CSL=16

Quartic

-.05

0.0

5.1

.15

.2

CSL=15 CSL=16

Quintic

Figure 6: Reduced form estimates using higher-order polynomial controls

Notes: Higher-order polynomial specifications include standardized global birth year trends of order p and stan-dardized age trends of order p (excluding linear age because it is perfectly collinear with linear birth year).

31

Table 2: The effect of CSLs on schooling and voting for the Conservative party

Vote Con High school SchoolingOLS OLS OLS(1) (2) (3)

1947 reform 0.061*** 0.176*** 0.384***(0.020) (0.024) (0.036)

1972 reform 0.085*** 0.275*** 0.604***(0.033) (0.043) (0.074)

Sample BSAS BSAS LFSObservations 15,934 15,934 47,552First stage F test 27.6 56.9

Notes: All specifications include a linear birth year term, male, white, black and south Asian dummies, andsurvey year fixed effects. Standard errors clustered by cohort. * denotes p < 0.1, ** denotes p < 0.05, ***denotes p < 0.01.

instead examines years of schooling in the LFS, and similarly shows that the 1947 reform was

especially effective at keeping students in school. In both cases, the large F statistic—testing the

relevance of the reform dummies—indicates a strong first stage. Table 3 presents the fuzzy RD

results, instrumenting for schooling with the compulsory schooling reforms.

I first examine the 2SLS estimates where schooling is discretized. Column (1) shows the Wald

estimate, and suggest that voters induced to complete high school by the reform are 33 percentage

points more likely to vote Conservative in later life. The estimated effect is very large by almost any

standard, but particularly when considering that a large segment of the population are compliers.

This estimate, however, could suffer from the bias established above: given Table 2 showed a

significant reduced form effect for the 1947 reform, but the reform did not compel all students to

complete high school, there is clear scope for upward bias. This concern is even more evident in

column (2), which uses only the 1947 reform as an instrument (removing those born after 1972). In

this specification—where the bias is expected to be largest, given that the 1947 caused a significant

32

Table 3: The effect of schooling on voting

Con Con Con Con Labour Liberal2SLS 2SLS TS2SLS TS2SLS TS2SLS TS2SLS

(1) (2) (3) (4) (5) (6)

Completed high school 0.332** 0.885***(0.132) (0.311)

10 years of schooling 0.124***(0.042)

11 or more years of schooling 0.237**(0.111)

Years of schooling 0.152*** -0.047 -0.081**(0.054) (0.046) (0.031)

First stage sample BSAS BSAS LFS LFSReduced form observations 15,934 9,783 15,934 15,934 15,934 15,934First stage observations 15,934 9,783 47,552 47,552 47,552 47,552First stage F test 27.6 22.3 56.9 56.9 56.9 56.9

Notes: In each specification, the variables listed on the left side of the table are instrumented for by the indicatorsfor the 1947 and 1972 reforms. All specifications include a linear birth year term, male, white, black and Asian(south and east combined) dummies, and survey year fixed effects. Specification (2) excludes respondents affectedby the 1972 reform. While specifications (1)-(4) have Conservative vote as dependent variable, the dependentvariable in specifications (5) and (6) respectively is Labour and Liberal vote. Standard errors clustered by cohort.* denotes p < 0.1, ** denotes p < 0.05, *** denotes p < 0.01.

33

proportion of student to also complete high school—the 2SLS estimates imply an implausibly large

89 percentage point increase in the probability of voting Conservative.

The presence of two instruments permits a more precise exploration of any bias. Column (3)

uses the 1947 and 1972 reforms to instrument for indicators for completing ten years of schooling

or 11 or more years of schooling. Given that the reforms did not affect attaining nine or fewer years

of schooling, or more than 11 years of schooling, the coefficients in column (3) non-parametrically

estimate the effect of an additional year of late high school. For both years, an additional year

equates to a LATE of 12 percentage points. This shows that, at least at the end of high school, the

effect of schooling is almost exactly linear. Unsurprisingly, the WAPTE estimate in column (4)

shows a similar effect for an additional year of schooling.34 Figure 7 again demonstrates that the

results are not being driven by linear cohort trends, and are in fact highly stable.

The non-parametric and linear results suggest that the dummy for completing high school sub-

stantially overstates the political effect of the final year of high school. The bias more than doubles

the true LATE for the final year of school when examining both instruments, but increases sixfold

when focusing only on the 1947 reform. While these results are clearly biased in terms of magni-

tude, our more careful analysis nevertheless shows that late high school causes voters to become

substantially more conservative in later life. Reinforcing results from the U.S. (Marshall 2014), this

evidence is consistent with schooling’s economic effects dominating any effects working through

socially liberal attitudes.

Given Britain has had three main political throughout the survey period analyzed here, it is

not obvious which party primarily loses votes to the Conservatives. Specifications (5) and (6)

respectively use Labour and Liberal vote indicators as dependent variables, and show that school-

ing decreases the probability of voting for both parties. The reduction is largest, and statistically

significant, for the Liberal Democrats.

34The point estimate differs because the first stage for other levels of schooling is not exactlyzero.

34

Quintic

Quartic

Cubic

Quadratic

Linear

0 .1 .2 .3 .4

Marginal effect of years of schooling

Figure 7: TS2SLS estimates using higher-order polynomial controls

Notes: Higher-order polynomial specifications include standardized global birth year trends of order p and stan-dardized age trends of order p (excluding linear age because it is perfectly collinear with linear birth year).

4.4.3 Robustness checks

Beyond the polynomial cohort trends, I now show that the reduced form and TS2SLS estimates

are highly robust to a variety of potential threats to the identification assumptions. All robustness

checks are reported in Table 4.

Although Figure 3 above showed that trends in plausible confounders are continuous through

the 1947 and 1972 reform discontinuities, I also control for the unemployment rate and average

earnings in column (1) of Table 4 and find that the effect if anything increases. To more thoroughly

demonstrate that age is not driving the results, column (2) shows that the results are robust to

including age fixed effects.

I also employ several out-of-sample checks. First, column (3) in Table similarly shows that an

additional year of late high school increases the likelihood of identifying as a Conservative parti-

san by 12 percentage points. This shows that survey respondents are responding consistently when

35

Table 4: Robustness checks

Controls Age dummies Partisan BES vote BES partisan(1) (2) (3) (4) (5)

Panel A: Reduced form estimates1947 reform 0.084*** 0.059** 0.045** 0.072*** 0.067***

(0.023) (0.022) (0.023) (0.014) (0.012)1972 reform 0.110*** 0.080*** 0.067** 0.082*** 0.086***

(0.031) (0.033) (0.033) (0.023) (0.024)

Panel B: TS2SLS estimatesYears of schooling 0.223*** 0.153** 0.115** 0.148*** 0.133***

(0.072) (0.061) (0.057) (0.029) (0.025)First stage F test 28.9 67.1 56.9 98.9 98.9

Reduced form observations 15,934 15,934 15,934 14,105 13,765First stage observations 47,552 47,552 47,552 49,016 49,016

Notes:All specifications include a linear birth year term, male, white, black and Asian (south and east combined)dummies, and survey year fixed effects. Specification (1) includes the national unemployment rate and averageearnings index at age 14 as controls. Specification (2) includes a full set of age dummies. Specification (3) takesConservative partisanship is an indicator dependent variable. Specifications (4) and (5) use the BES data withConservative voting and partisanship as dependent variables; a different LFS sample is used to match the BESdistribution. Standard errors clustered by cohort. * denotes p < 0.1, ** denotes p < 0.05, *** denotes p < 0.01.

36

asked about their political preferences. Second, very similar results are obtained when linking the

British Election Study (BES) with an LFS first stage.35 In terms of both voting and partisan iden-

tification, columns (4) and (5) clearly show a substantively similar increase Conservative political

preference.36

5 Conclusion

This article addresses an important issue frequently faced by empirical researchers using instru-

mental variable techniques: good (or any) measures of both the outcome and treatment of interest

may not be available in the same dataset. While lacking the outcome or treatment variable en-

tirely may force researchers to abandon their project, using a coarsened measure of a multi-valued

treatment intensity can substantially bias estimates. As estimates of the effect of high school on

political preferences demonstrated, this bias is especially large when the causal response function

is not discontinuous and the instrument induces different respondents to achieve different treatment

intensities.

Two-sample IV methods can solve these missing data problems. Two samples can be combined—

if they draw from the same population—even when the treatment is not measured in the same

dataset as the outcome. This allows researchers to estimate quantities that a single dataset could

not, but can also provide consistent estimates of quantities that might otherwise be substantially

biased. In this article, I outlined a general approach to implementing two-sample methods. In

particular, I highlight the assumptions required to ensure the consistency of the TS2SLS estimator

as well as a range of formulas for calculating standard errors that adjust for the uncertainty of the

first stage estimation.

35The first stage sample differs from that used with the BSAS data to better match the BESsample. In particular, the LFS first stage sample draws only upon LFS samples from the relevantelection years and matches the BES sample characteristics.

36There is a similarly large bias when using the dummy for completing high school in the BESdata.

37

These two-sample methods are applied to the question of how education affects political pref-

erences. More specifically, I show that an additional year of late high school significantly increases

downstream support for Britain’s Conservative party. Exploiting two major educational reforms,

the fuzzy regression discontinuity estimates indicate that an additional year of schooling cause a 15

percentage point increase in the probability of voting Conservative later in life. These large effects

are “local” in that they only apply to students that would not have remained in school without the

reforms—albeit a large proportion of the population—and are specific to late high school. While

Marshall (2014) provides clear evidence of an income mechanism in the U.S., this important re-

lationship requires further research. It is also possible that university education instills liberal

attitudes that counteract schooling’s effects.

38

Appendix

Proof of Proposition 1. Angrist and Imbens (1995) prove that the exclusion restriction and mono-

tonicity yield equation (3). Recognizing βk = E[Yik −Yik−1|Si1 ≥ k > Si0] yields equation (3).

Because pit ≥ 0, sign(βk) = sign(E[Yit−Yit−1|Si1 ≥ t > Si0]),∀t 6= k where pit > 0 ensures |βk| ≤

|βWk |. �

Proof of Proposition 2. Note βW ,JWAPT E =∑Jt=1 pitβ

Jt

∑Jt=1 pit= τ and βW ,αJWAPT E =

∑αJt=1 pitβαJt

∑αJt=1 pit= τ/α , where the

linearity of the causal effect at each intensity interval implies αβ αJt = β Jt . The result follows. �

Proof of Proposition 3. Substituting for Y1 yields:

β̂ T S2SLS = (X̂ ′1X̂1)−1X̂ ′1X1β +(X̂

′1X̂1)

−1X̂ ′1u1. (19)

Dividing top and bottom of each term by n1, taking the probability limit and applying Slutsky’s

theorem yields:

plimn1→∞

β̂ T S2SLS =(

plimn1→∞

1n1

X̂ ′1X̂1

)−1(plimn1→∞

1n1

X̂ ′1X1

)β +

(plimn1→∞

1n1

X̂ ′1X̂1

)−1(plimn1→∞

1n1

X̂ ′1u1

). (20)

To prove consistency we require i) the first term to equal β and ii) second term to be 0.

i). First note that Slutsky’s theorem implies:

plimn1→∞

1n1

X̂ ′1X̂1 = plimn1→∞

(1n1

X ′2Z2(Z′2Z2)

−1Z′1Z1(Z′2Z2)

−1Z′2X2

)(21)

=

(plimn1→∞

1n1

X ′2Z2

)(plimn1→∞

1n1

Z′2Z2

)−1(plimn1→∞

1n1

Z′1Z1

)×(

plimn1→∞

1n1

Z′2Z2

)−1(plimn1→∞

1n1

Z′2X2

). (22)

39

Applying the weak law of large numbers and then Assumptions 5(a) and 5(b) yields:

plimn1→∞

1n1

X̂ ′1X̂1 = E[X′2Z2]E[Z

′2Z2]

−1E[Z′1Z1]E[Z′2Z2]

−1E[Z′2X2] (23)

= E[X ′2Z2]E[Z′2Z2]

−1E[Z′2X2] (24)

= E[X ′2Z2]E[Z′2Z2]

−1E[Z′1X1]. (25)

Similarly,

plimn1→∞

1n1

X̂ ′1X1 =(

plimn1→∞

1n1

X ′2Z2

)(plimn1→∞

1n1

Z′2Z2

)−1(plimn1→∞

1n1

Z′1X1

)(26)

= E[X ′2Z2]E[Z′2Z2]

−1E[Z′1X1] (27)

= plimn1→∞

1n1

X̂ ′1X̂1. (28)

Given the rank condition in Assumption 4(a), this proves part i).

ii). Substituting out and applying the weak law of large numbers gives:

(plimn1→∞

1n1

X̂ ′1X̂1

)−1(plimn1→∞

1n1

X̂ ′1u1

)=

(plimn1→∞

1n1

X̂ ′1X̂1

)−1(plimn1→∞

1n1

X ′2Z2

)(29)

×(

plimn1→∞

1n1

Z′2Z2

)−1(plimn1→∞

1n1

Z′1u1

)=

(E[X ′2Z2]E[Z

′2Z2]

−1E[Z′1X1])−1

(30)

×E[X ′2Z2]E[Z′2Z2]−1E[Z′1u1]

= 0, (31)

where the final line follows from Assumption 3, as well as the full rank and finite moment assump-

tions. �

40

Proof of Proposition 4. Start by separating X̂ into its endogenous and exogenous components,

Yi1 = Xi1β−S +Ti1βS + ui = Xi1β−S + T̂i1βS +[Ti1− T̂i1]+ ui, (32)

where T̂i1 = Zi1Π̂ = Zi1(Z′2Z2)−1Z′2T2 is the predicted value of the treatment using the first stage

estimates, and Ti1 is the true and unobserved treatment in sample 1. An OLS regression would

yield:

√n1

β̂−T −β−Sβ̂S−βS

= ( 1n1 X̂ ′1X̂1)−1 1√

n1X̂ ′1u1 +

(1n1

X̂ ′1X̂1

)−1 1√

n1X̂ ′1[Ti1− T̂i1]βS, (33)

where subscripts i and superscripts T S2SLS are omitted to save space. Using the expansion result

in Murphy and Topel (1985: 374) yields:

√n1(β̂ −β ) ≡

√n1

β̂−T −β−Sβ̂S−βS

a= ( 1n1 X̂ ′1X̂1)−1 1√

n1X̂ ′1u1

+

(1n1

X̂ ′1X̂1

)−1(n1n2

)1/2 1n1

X̂ ′1(β̂′T ⊗Z1)

√n2(Π̂−Π), (34)

where (β̂ ′T ⊗Z1) is the matrix of defined in equation (12) of Murphy and Topel (1985).

Let Π̂ be a consistent estimator of the first stage for the endogenous variables, such that√

n2(Π̂−Π)a∼ N(0,V(Π)). Using our consistent first stage estimate, the asymptotic variance

is therefore given by:

V(β̂ −β ) = E[X̂ ′1X̂1]−1[

V[β ]+n1n2

E[X̂ ′1(β̂′T ⊗Z1)]−1V[Π]E[(β̂ ′T ⊗Z1)′X̂1]−1

]E[X̂ ′1X̂1]

−1, (35)

where V[β ] is the variance of the naive TS2SLS estimator. (Note that E[X̂ ′1u1] = 0, in conjunction

with a consistent first stage, implies the consistency of the estimator.)

This establishes the general asymptotic variance formula in Proposition 4. We now apply the

41

homoskedastic and cluster-robust error structures:

1) Homoskedastic errors. Under homoskedasticity, the naive variance from the TS2SLS re-

gression is simply σ2u (X̂ ′1X̂1)−1. To correct for the first stage estimation, we have:

X̂ ′1(β̂′T ⊗Z1)V̂(Π̂)(β̂ ′T ⊗Z1)′X̂1 = X̂ ′1(β̂ ′T ⊗Z1)(Ω⊗ (Z′1Z1)−1)(β̂ ′T ⊗Z1)′X̂1 (36)

= X̂ ′1(β̂′T Ωβ̂T ⊗Z1(Z′1Z1)−1Z′1)X̂1 (37)

= β̂ ′T Ωβ̂T (X̂′1X̂1), (38)

where the first line uses the definitions of homoskedasticity given in the proposition, the sec-

ond line applies the mixed product property of Kronecker products, and the third line exploits

Z1(Z′1Z1)−1Z′1X̂1 = X̂1 (because all exogenous variables are contained in both X̂1 and Z1) and the

fact that β̂ ′T Ωβ̂ ′T is a scalar. Substituting into the general variance matrix yields the homoskedastic

variance formula in Proposition 4.

2) Clustered errors. In the clustered case, we simply let V(Π̂) = G2G2−1 Φ⊗E[Z′2Z2]

−1. �

42

References

Abrams, Samuel, Torben Iversen and David Soskice. 2010. “Informal Social Networks and Ratio-

nal Voting.” British Journal of Political Science 41:229–257.

Acemoglu, Daron and Joshua D. Angrist. 2000. “How Large Are Human Capital Externalities?

Evidence from Compulsory Schooling Laws.” NBER Macroeconomics Annual 2000 pp. 9–59.

Angrist, Joshua D. and Alan B. Krueger. 1991. “Does Compulsory School Attendance Affect

Schooling and Earnings?” Quarterly Journal of Economics 106(4):979–1014.

Angrist, Joshua D. and Alan B. Krueger. 1992. “The Effect of Age at School Entry on Educational

Attainment: An Application of Instrumental Variables with Moments from Two Samples.” Jour-

nal of the American Statistical Association 87(418):328–336.

Angrist, Joshua D. and Alan B. Krueger. 1995. “Split-sample instrumental variables estimates of

the return to schooling.” Journal of Business and Economic Statistics 13(2):225–235.

Angrist, Joshua D. and Guido W. Imbens. 1995. “Two-Stage Least Squares Estimation of Average

Causal Effects in Models With Variable Treatment Intensity.” Journal of the American Statistical

Association 90(430):431–442.

Angrist, Joshua D., Guido W. Imbens and Donald B. Rubin. 1996. “Identification of Causal Effects

Using Instrumental Variables.” Journal of the American Statistical Association 91(June):444–

455.

Angrist, Joshua D. and Jörn-Steffan Pischke. 2008. Mostly Harmless Econometrics: An Empiri-

cist’s Companion. Princeton, NJ: Princeton University Press.

Becker, Gary S. 1993. Human Capital: A Theoretical and Empirical Analysis, with Special Refer-

ence to Education. University of Chicago Press.

43

Bound, John, David A. Jaeger and Regina M. Baker. 1995. “Problems with instrumental vari-

ables estimation when the correlation between the instruments and the endogenous explanatory

variable is weak.” Journal of the American Statistical Association 90(430):443–450.

Bowles, Samuel and Herbert Gintis. 1976. Schooling in Capitalist America: Educational reform

and the Contradictions of Economic Life. Chicago, IL: Haymarket Books.

Clark, Damon and Heather Royer. 2013. “The Effect of Education on Adult Mortality and Health:

Evidence from Britain.” American Economic Review 103(6):2087–2120.

Dee, Thomas S. 2004. “Are there civic returns to education?” Journal of Public Economics

88:1697–1720.

Devereux, Paul J. and Robert A. Hart. 2010. “Forced to be Rich? Returns to Compulsory Schooling

in Britain.” Economic Journal 120:1345–1364.

Gelman, Andrew, Park, Boris Shor, Joseph Bafumi and Jeronimo Cortina. 2010. Red State, Blue

State, Rich State, Poor State: Why Americans Vote the Way They Do. Princeton, NJ: Princeton

University Press.

Gerber, Alan S., Gregory A. Huber, David Doherty, Conor M. Dowling and Shang E. Ha. 2010.

“Personality and Political Attitudes: Relationships Across Issue Domains and Political Con-

texts.” American Political Science Review 104(01):111–133.

Gillard, Derek. 2011. “Education in England: A Brief History.” Web link.

Goldin, Claudia D. and Lawrence F. Katz. 2008. The Race Between Education and Technology.

Cambridge, MA: Harvard University Press.

Grenet, Julien. 2013. “Is Extending Compulsory Schooling Alone Enough to Raise Earnings?

Evidence from French and British Compulsory Schooling Laws.” Scandinavian Journal of Eco-

nomics 115(1):176–210.

44

http://www.educationengland.org.uk/history/

Harmon, Colm and Ian Walker. 1995. “Estimates of the Economic Return to Schooling for the

United Kingdom.” American Economic Review 85(5):1278–1286.

Heath, Anthony, Roger Jowell, John Curtice, Julia Field and Clarissa Levine. 1985. How Britain

Votes. Pergamon Press Oxford.

Honaker, James and Gary King. 2010. “What to Do about Missing Values in Time-Series Cross-

Section Data.” American Journal of Political Science 54(2):561–581.

Imbens, Guido W. and Joshua D. Angrist. 1994. “Identification and Estimation of Local Average

Treatment Effects.” Econometrica 62(2):467–475.

Inglehart, Ronald. 1981. “Post-Materialism in an Environment of Insecurity.” American Political

Science Review 75(4):880–900.

Inoue, Atsushi and Gary Solon. 2005. “Two-Sample Instrumental Variables Estimators.”.

Inoue, Atsushi and Gary Solon. 2010. “Two-Sample Instrumental Variables Estimators.” Review

of Economics and Statistics 92(3):557–561.

Iversen, Torben and David Soskice. 2001. “An Asset Theory of Social Policy Preferences.” Amer-

ican Political Science Review 95(4):875–894.

Kam, Cindy D. and Carl L. Palmer. 2008. “Reconsidering the Effects of Education on Political

Participation.” The Journal of Politics 70(3):612–631.

King, Gary, James Honaker, Anne Joseph and Kenneth Scheve. 2001. “Analyzing Incomplete

Political Science Data: An Alternative Algorithm for Multiple Imputation.” American Political

Science Review 95(1):49–69.

Lochner, Lance and Enrico Moretti. 2004. “The Effect of Education on Crime: Evidence from

Prison Inmates, Arrests, and Self-Reports.” American Economic Review 94(1):155–189.

45

Marshall, John. 2014. “Learning to be conservative: How staying in high school changes political

preferences in the United States and Great Britain.” Working paper.

Meltzer, Allan H. and Scott F. Richard. 1981. “A rational theory of the size of government.”

Journal of Political Economy 89:914–927.

Milligan, Kevin, Enrico Moretti and Philip Oreopoulos. 2004. “Does education improve citizen-

ship? Evidence from the United States and the United Kingdom.” Journal of Public Economics

88:1667–1695.

Mincer, Jacob. 1974. Schooling, Experience, and Earnings. New York: Columbia University

Press.

Moene, Karl O. and Michael Wallerstein. 2001. “Inequality, social insurance, and redistribution.”

American Political Science Review pp. 859–874.

Murphy, Kevin M. and Robert H. Topel. 1985. “Estimation and Inference in Two-Step Econometric

Models.” Journal of Business and Economic Statistics 20(1):88–97.

Oreopoulos, Philip. 2006. “Estimating Average and Local Average Treatment Effects of Education

when Compulsory Schooling Laws Really Matter.” American Economic Review 96(1):152–175.

Schoon, Ingrid, Helen Cheng, Catharine R. Gale, G. David Batty and Ian J. Deary. 2010. “Social

status, cognitive ability, and educational attainment as predictors of liberal social attitudes and

political trust.” Intelligence 38(1):144–150.

Sondheimer, Rachel M. and Donald P. Green. 2010. “Using Experiments to Estimate the Effects

of Education on Voter Turnout.” American Journal of Political Science 41(1):178–189.

Sovey, Allison J. and Donald P. Green. 2011. “Instrumental variables estimation in political sci-

ence: A readers’ guide.” American Journal of Political Science 55(1):188–200.

46

Spence, Michael. 1973. “Job market signaling.” Quarterly Journal of Economics 87(3):355–374.

Staiger, Douglas and James H. Stock. 1997. “Instrumental Variables Regression with Weak Instru-

ments.” Econometrica 65(3):557–586.

Thomassen, Jacques J.A. 2005. The European Voter: A Comparative Study of Modern Democra-

cies. Oxford: Oxford University Press.

Woodin, Tom, Gary McCulloch and Steven Cowan. 2013. “Raising the participation age in

historical perspective: policy learning from the past?” British Educational Research Journal

39(4):635–653.

47

IntroductionIV's upward bias with coarsened treatmentsCharacterizing the biasWhen is the bias severe?Sharp jumps in the CRFLinear CRFs

Implications for applied research

Using two samples to address missing dataEstimationProperties of TS2SLS

High school education and political preferencesCompulsory schooling laws in BritainDataEmpirical strategyResultsThe effect of compulsory schooling reforms on schooling and political preferencesThe effect of schooling on political preferencesRobustness checks

Conclusion

john marshall ay - harvard university...identifying education’s political effects with incomplete...

Documents