nonparametric estimation of average treatment … · nonparametric estimation of average treatment...

26
NONPARAMETRIC ESTIMATION OF AVERAGE TREATMENT EFFECTS UNDER EXOGENEITY: A REVIEW* Guido W. Imbens Abstract—Recently there has been a surge in econometric work focusing on estimating average treatment effects under various sets of assumptions. One strand of this literature has developed methods for estimating average treatment effects for a binary treatment under assumptions variously described as exogeneity, unconfoundedness, or selection on observables. The implication of these assumptions is that systematic (for example, average or distributional) differences in outcomes between treated and control units with the same values for the covariates are attributable to the treatment. Recent analysis has considered estimation and inference for average treatment effects under weaker assumptions than typical of the earlier literature by avoiding distributional and functional-form assump- tions. Various methods of semiparametric estimation have been proposed, including estimating the unknown regression functions, matching, meth- ods using the propensity score such as weighting and blocking, and combinations of these approaches. In this paper I review the state of this literature and discuss some of its unanswered questions, focusing in particular on the practical implementation of these methods, the plausi- bility of this exogeneity assumption in economic applications, the relative performance of the various semiparametric estimators when the key assumptions (unconfoundedness and overlap) are satisfied, alternative estimands such as quantile treatment effects, and alternate methods such as Bayesian inference. I. Introduction S INCE the work by Ashenfelter (1978), Card and Sulli- van (1988), Heckman and Robb (1984), Lalonde (1986), and others, there has been much interest in econo- metric methods for estimating the effects of active labor market programs such as job search assistance or classroom teaching programs. This interest has led to a surge in theoretical work focusing on estimating average treatment effects under various sets of assumptions. See for general surveys of this literature Angrist and Krueger (2000), Heck- man, LaLonde, and Smith (2000), and Blundell and Costa- Dias (2002). One strand of this literature has developed methods for estimating the average effect of receiving or not receiving a binary treatment under the assumption that the treatment satisfies some form of exogeneity. Different versions of this assumption are referred to as unconfoundedness (Rosen- baum & Rubin, 1983a), selection on observables (Barnow, Cain, & Goldberger, 1980; Fitzgerald, Gottschalk, & Mof- fitt, 1998), or conditional independence (Lechner, 1999). In the remainder of this paper I will use the terms unconfound- edness and exogeneity interchangeably to denote the as- sumption that the receipt of treatment is independent of the potential outcomes with and without treatment if certain observable covariates are held constant. The implication of these assumptions is that systematic (for example, average or distributional) differences in outcomes between treated and control units with the same values for these covariates are attributable to the treatment. Much of the recent work, building on the statistical literature by Cochran (1968), Cochran and Rubin (1973), Rubin (1973a, 1973b, 1977, 1978), Rosenbaum and Rubin (1983a, 1983b, 1984), Holland (1986), and others, considers estimation and inference without distributional and func- tional form assumptions. Hahn (1998) derived efficiency bounds assuming only unconfoundedness and some regu- larity conditions and proposed an efficient estimator. Vari- ous alternative estimators have been proposed given these conditions. These estimation methods can be grouped into five categories: (i) methods based on estimating the un- known regression functions of the outcome on the covari- ates (Hahn, 1998; Heckman, Ichimura, & Todd, 1997, 1998; Imbens, Newey, & Ridder, 2003), (ii) matching on covari- ates (Rosenbaum, 1995; Abadie and Imbens, 2002) (iii) methods based on the propensity score, including blocking (Rosenbaum & Rubin, 1984) and weighting (Hirano, Im- bens, & Ridder, 2003), (iv) combinations of these ap- proaches, for example, weighting and regression (Robins & Rotnizky, 1995) or matching and regression (Abadie & Imbens, 2002), and (v) Bayesian methods, which have found relatively little following since Rubin (1978). In this paper I will review the state of this literature—with partic- ular emphasis on implications for empirical work—and discuss some of the remaining questions. The organization of the paper is as follows. In section II I will introduce the notation and the assumptions used for identification. I will also discuss the difference between population- and sample-average treatment effects. The re- cent econometric literature has largely focused on estima- tion of the population-average treatment effect and its coun- terpart for the subpopulation of treated units. An alternative, following the early experimental literature (Fisher, 1925; Neyman, 1923), is to consider estimation of the average effect of the treatment for the units in the sample. Many of the estimators proposed can be interpreted as estimating either the average treatment effect for the sample at hand, or the average treatment effect for the population. Although the choice of estimand may not affect the form of the estimator, it has implications for the efficiency bounds and for the form of estimators of the asymptotic variance; the variance of estimators for the sample average treatment effect are Received for publication October 22, 2002. Revision accepted for publication June 4, 2003. * University of California at Berkeley and NBER This paper was presented as an invited lecture at the Australian and European meetings of the Econometric Society in July and August 2003. I am also grateful to Joshua Angrist, Jane Herr, Caroline Hoxby, Charles Manski, Xiangyi Meng, Robert Moffitt, and Barbara Sianesi, and two referees for comments, and to a number of collaborators, Alberto Abadie, Joshua Angrist, Susan Athey, Gary Chamberlain, Keisuke Hirano, V. Joseph Hotz, Charles Manski, Oscar Mitnik, Julie Mortimer, Jack Porter, Whitney Newey, Geert Ridder, Paul Rosenbaum, and Donald Rubin for many discussions on the topics of this paper. Financial support for this research was generously provided through NSF grants SBR 9818644 and SES 0136789 and the Giannini Foundation. The Review of Economics and Statistics, February 2004, 86(1): 4–29 © 2004 by the President and Fellows of Harvard College and the Massachusetts Institute of Technology

Upload: dangkhanh

Post on 01-Apr-2018

221 views

Category:

Documents


1 download

TRANSCRIPT

NONPARAMETRIC ESTIMATION OF AVERAGE TREATMENT EFFECTSUNDER EXOGENEITY: A REVIEW*

Guido W. Imbens

Abstract—Recently there has been a surge in econometric work focusingon estimating average treatment effects under various sets of assumptions.One strand of this literature has developed methods for estimating averagetreatment effects for a binary treatment under assumptions variouslydescribed as exogeneity, unconfoundedness, or selection on observables.The implication of these assumptions is that systematic (for example,average or distributional) differences in outcomes between treated andcontrol units with the same values for the covariates are attributable to thetreatment. Recent analysis has considered estimation and inference foraverage treatment effects under weaker assumptions than typical of theearlier literature by avoiding distributional and functional-form assump-tions. Various methods of semiparametric estimation have been proposed,including estimating the unknown regression functions, matching, meth-ods using the propensity score such as weighting and blocking, andcombinations of these approaches. In this paper I review the state of thisliterature and discuss some of its unanswered questions, focusing inparticular on the practical implementation of these methods, the plausi-bility of this exogeneity assumption in economic applications, the relativeperformance of the various semiparametric estimators when the keyassumptions (unconfoundedness and overlap) are satisfied, alternativeestimands such as quantile treatment effects, and alternate methods suchas Bayesian inference.

I. Introduction

SINCE the work by Ashenfelter (1978), Card and Sulli-van (1988), Heckman and Robb (1984), Lalonde

(1986), and others, there has been much interest in econo-metric methods for estimating the effects of active labormarket programs such as job search assistance or classroomteaching programs. This interest has led to a surge intheoretical work focusing on estimating average treatmenteffects under various sets of assumptions. See for generalsurveys of this literature Angrist and Krueger (2000), Heck-man, LaLonde, and Smith (2000), and Blundell and Costa-Dias (2002).

One strand of this literature has developed methods forestimating the average effect of receiving or not receiving abinary treatment under the assumption that the treatmentsatisfies some form of exogeneity. Different versions of thisassumption are referred to as unconfoundedness (Rosen-baum & Rubin, 1983a), selection on observables (Barnow,Cain, & Goldberger, 1980; Fitzgerald, Gottschalk, & Mof-fitt, 1998), or conditional independence (Lechner, 1999). Inthe remainder of this paper I will use the terms unconfound-

edness and exogeneity interchangeably to denote the as-sumption that the receipt of treatment is independent of thepotential outcomes with and without treatment if certainobservable covariates are held constant. The implication ofthese assumptions is that systematic (for example, averageor distributional) differences in outcomes between treatedand control units with the same values for these covariatesare attributable to the treatment.

Much of the recent work, building on the statisticalliterature by Cochran (1968), Cochran and Rubin (1973),Rubin (1973a, 1973b, 1977, 1978), Rosenbaum and Rubin(1983a, 1983b, 1984), Holland (1986), and others, considersestimation and inference without distributional and func-tional form assumptions. Hahn (1998) derived efficiencybounds assuming only unconfoundedness and some regu-larity conditions and proposed an efficient estimator. Vari-ous alternative estimators have been proposed given theseconditions. These estimation methods can be grouped intofive categories: (i) methods based on estimating the un-known regression functions of the outcome on the covari-ates (Hahn, 1998; Heckman, Ichimura, & Todd, 1997, 1998;Imbens, Newey, & Ridder, 2003), (ii) matching on covari-ates (Rosenbaum, 1995; Abadie and Imbens, 2002) (iii)methods based on the propensity score, including blocking(Rosenbaum & Rubin, 1984) and weighting (Hirano, Im-bens, & Ridder, 2003), (iv) combinations of these ap-proaches, for example, weighting and regression (Robins &Rotnizky, 1995) or matching and regression (Abadie &Imbens, 2002), and (v) Bayesian methods, which havefound relatively little following since Rubin (1978). In thispaper I will review the state of this literature—with partic-ular emphasis on implications for empirical work—anddiscuss some of the remaining questions.

The organization of the paper is as follows. In section III will introduce the notation and the assumptions used foridentification. I will also discuss the difference betweenpopulation- and sample-average treatment effects. The re-cent econometric literature has largely focused on estima-tion of the population-average treatment effect and its coun-terpart for the subpopulation of treated units. An alternative,following the early experimental literature (Fisher, 1925;Neyman, 1923), is to consider estimation of the averageeffect of the treatment for the units in the sample. Many ofthe estimators proposed can be interpreted as estimatingeither the average treatment effect for the sample at hand, orthe average treatment effect for the population. Although thechoice of estimand may not affect the form of the estimator,it has implications for the efficiency bounds and for theform of estimators of the asymptotic variance; the varianceof estimators for the sample average treatment effect are

Received for publication October 22, 2002. Revision accepted forpublication June 4, 2003.

* University of California at Berkeley and NBERThis paper was presented as an invited lecture at the Australian and

European meetings of the Econometric Society in July and August 2003.I am also grateful to Joshua Angrist, Jane Herr, Caroline Hoxby, CharlesManski, Xiangyi Meng, Robert Moffitt, and Barbara Sianesi, and tworeferees for comments, and to a number of collaborators, Alberto Abadie,Joshua Angrist, Susan Athey, Gary Chamberlain, Keisuke Hirano, V.Joseph Hotz, Charles Manski, Oscar Mitnik, Julie Mortimer, Jack Porter,Whitney Newey, Geert Ridder, Paul Rosenbaum, and Donald Rubin formany discussions on the topics of this paper. Financial support for thisresearch was generously provided through NSF grants SBR 9818644 andSES 0136789 and the Giannini Foundation.

The Review of Economics and Statistics, February 2004, 86(1): 4–29© 2004 by the President and Fellows of Harvard College and the Massachusetts Institute of Technology

generally smaller. In section II, I will also discuss alterna-tive estimands. Almost the entire literature has focused onaverage effects. However, in many cases such measuresmay mask important distributional changes. These can becaptured more easily by focusing on quantiles of the distri-butions of potential outcomes, in the presence and absenceof the treatment (Lehman, 1974; Docksum, 1974; Firpo,2003).

In section III, I will discuss in more detail some of therecently proposed semiparametric estimators for the averagetreatment effect, including those based on regression,matching, and the propensity score. I will focus particularlyon implementation, and compare the different decisionsfaced regarding smoothing parameters using the variousestimators.

In section IV, I will discuss estimation of the variances ofthese average treatment effect estimators. For most of theestimators introduced in the recent literature, correspondingestimators for the variance have also been proposed, typi-cally requiring additional nonparametric regression. In prac-tice, however, researchers often rely on bootstrapping, al-though this method has not been formally justified. Inaddition, if one is interested in the average treatment effectfor the sample, bootstrapping is clearly inappropriate. HereI discuss in more detail a simple estimator for the variancefor matching estimators, developed by Abadie and Imbens(2002), that does not require additional nonparametric esti-mation.

Section V discusses different approaches to assessing theplausibility of the two key assumptions: exogeneity orunconfoundedness, and overlap in the covariate distribu-tions. The first of these assumptions is in principle untest-able. Nevertheless a number of approaches have been pro-posed that are useful for addressing its credibility (Heckmanand Hotz, 1989; Rosenbaum, 1984b). One may also wish toassess the responsiveness of the results to this assumptionusing a sensitivity analysis (Rosenbaum & Rubin, 1983b;Imbens, 2003), or, in its extreme form, a bounds analysis(Manski, 1990, 2003). The second assumption is that thereexists appropriate overlap in the covariate distributions ofthe treated and control units. That is effectively an assump-tion on the joint distribution of observable variables. How-ever, as it only involves inequality restrictions, there are nodirect tests of this null. Nevertheless, in practice it is oftenvery important to assess whether there is sufficient overlapto draw credible inferences. Lacking overlap for the fullsample, one may wish to limit inferences to the averageeffect for the subset of the covariate space where there existsoverlap between the treated and control observations.

In Section VI, I discuss a number of implementations ofaverage treatment effect estimators. The first set of imple-mentations involve comparisons of the nonexperimentalestimators to results based on randomized experiments,allowing direct tests of the unconfoundedness assumption.The second set consists of simulation studies—using data

created either to fulfill the unconfoundedness assumption orto fail it a known way—designed to compare the applica-bility of the various treatment effect estimators in thesediverse settings.

This survey will not address alternatives for estimatingaverage treatment effects that do not rely on exogeneityassumptions. This includes approaches where selected ob-served covariates are not adjusted for, such as instrumentalvariables analyses (Bjorklund & Moffit, 1987; Heckman &Robb, 1984; Imbens & Angrist, 1994; Angrist, Imbens, &Rubin, 1996; Ichimura & Taber, 2000; Abadie, 2003a;Chernozhukov & Hansen, 2001). I will also not discussmethods exploiting the presence of additional data, such asdifference in differences in repeated cross sections (Abadie,2003b; Blundell et al., 2002; Athey and Imbens, 2002) andregression discontinuity where the overlap assumption isviolated (van der Klaauw, 2002; Hahn, Todd, & van derKlaauw, 2000; Angrist & Lavy, 1999; Black, 1999; Lee,2001; Porter, 2003). I will also limit the discussion to binarytreatments, excluding models with static multivalued treat-ments as in Imbens (2000) and Lechner (2001) and modelswith dynamic treatment regimes as in Ham and LaLonde(1996), Gill and Robins (2001), and Abbring and van denBerg (2003). Reviews of many of these methods can befound in Shadish, Campbell, and Cook (2002), Angrist andKrueger (2000), Heckman, LaLonde, and Smith (2000), andBlundell and Costa-Dias (2002).

II. Estimands, Identification, and Efficiency Bounds

A. Definitions

In this paper I will use the potential-outcome notation thatdates back to the analysis of randomized experiments byFisher (1935) and Neyman (1923). After being forcefullyadvocated in a series of papers by Rubin (1974, 1977,1978), this notation is now standard in the literature on bothexperimental and nonexperimental program evaluation.

We begin with N units, indexed by i � 1, . . . , N,viewed as drawn randomly from a large population. Eachunit is characterized by a pair of potential outcomes, Yi(0)for the outcome under the control treatment and Yi(1) forthe outcome under the active treatment. In addition, eachunit has a vector of characteristics, referred to as covariates,pretreatment variables, or exogenous variables, and denotedby Xi.1 It is important that these variables are not affected bythe treatment. Often they take their values prior to the unitbeing exposed to the treatment, although this is not suffi-cient for the conditions they need to satisfy. Importantly,this vector of covariates can include lagged outcomes.

1 Calling such variables exogenous is somewhat at odds with severalformal definitions of exogeneity (e.g., Engle, Hendry, & Richard, 1974),as knowledge of their distribution can be informative about the averagetreatment effects. It does, however, agree with common usage. See forexample, Manski et al. (1992, p. 28). See also Frolich (2002) and Hiranoet al. (2003) for additional discussion.

AVERAGE TREATMENT EFFECTS 5

Finally, each unit is exposed to a single treatment; Wi � 0if unit i receives the control treatment, and Wi � 1 if uniti receives the active treatment. We therefore observe foreach unit the triple (Wi, Yi, Xi), where Yi is the realizedoutcome:

Yi � Yi�Wi� � �Yi�0� if Wi � 0,Yi�1� if Wi � 1.

Distributions of (W, Y, X) refer to the distribution inducedby the random sampling from the superpopulation.

Several additional pieces of notation will be useful in theremainder of the paper. First, the propensity score (Rosen-baum and Rubin, 1983a) is defined as the conditionalprobability of receiving the treatment,

e� x� � Pr�W � 1�X � x� � ��W�X � x�.

Also, define, for w � {0, 1}, the two conditional regressionand variance functions

�w� x� � ��Y�w��X � x�, �w2 � x� � ��Y�w��X � x�.

Finally, let �( x) be the conditional correlation coefficient ofY(0) and Y(1) given X � x. As one never observes Yi(0)and Yi(1) for the same unit i, the data only contain indirectand very limited information about this correlation coeffi-cient.2

B. Estimands: Average Treatment Effects

In this discussion I will primarily focus on a number ofaverage treatment effects (ATEs). This is less limiting thanit may seem, however, as it includes averages of arbitrarytransformations of the original outcomes. Later I will returnbriefly to alternative estimands that cannot be written in thisform.

The first estimand, and the most commonly studied in theeconometric literature, is the population-average treatmenteffect (PATE):

P � ��Y�1� � Y�0��.

Alternatively we may be interested in the population-average treatment effect for the treated (PATT; for example,Rubin, 1977; Heckman & Robb, 1984):

TP � ��Y�1� � Y�0��W � 1�.

Heckman and Robb (1984) and Heckman, Ichimura, andTodd (1997) argue that the subpopulation of treated units isoften of more interest than the overall population in thecontext of narrowly targeted programs. For example, if aprogram is specifically directed at individuals disadvan-taged in the labor market, there is often little interest in the

effect of such a program on individuals with strong labormarket attachment.

I will also look at sample-average versions of these twopopulation measures. These estimands focus on the averageof the treatment effect in the specific sample, rather than inthe population at large. They include the sample-averagetreatment effect (SATE)

S �1

N�i�1

N

�Yi�1� � Yi�0��,

and the sample-average treatment effect for the treated(SATT)

TS �

1

NT�

i:Wi�1

�Yi�1� � Yi�0��,

where NT � ¥i�1N Wi is the number of treated units. The

SATE and the SATT have received little attention in therecent econometric literature, although the SATE has a longtradition in the analysis of randomized experiments (forexample, Neyman, 1923). Without further assumptions, thesample contains no information about the PATE beyond theSATE. To see this, consider the case where we observe thesample (Yi(0), Yi(1), Wi, Xi), i � 1, . . . , N; that is, weobserve both potential outcomes for each unit. In that caseS � ¥i [Yi(1) Yi(0)]/N can be estimated without error.Obviously, the best estimator for the population-averageeffect P is S. However, we cannot estimate P withouterror even with a sample where all potential outcomes areobserved, because we lack the potential outcomes for thosepopulation members not included in the sample. This simpleargument has two implications. First, one can estimate theSATE at least as accurately as the PATE, and typically moreso. In fact, the difference between the two variances is thevariance of the treatment effect, which is zero only when thetreatment effect is constant. Second, a good estimator forone ATE is automatically a good estimator for the other. Onecan therefore interpret many of the estimators for PATE orPATT as estimators for SATE or SATT, with lower impliedstandard errors, as discussed in more detail in section IIE.

A third pair of estimands combines features of the othertwo. These estimands, introduced by Abadie and Imbens(2002), focus on the ATE conditional on the sample distri-bution of the covariates. Formally, the conditional ATE(CATE) is defined as

�X� �1

N�i�1

N

��Yi�1� � Yi�0��Xi�,

and the SATE for the treated (CATT) is defined as

�X�T �1

NT�

i:Wi�1

��Yi�1� � Yi�0��Xi�.2 As Heckman, Smith, and Clemens (1997) point out, however, one can

draw some limited inferences about the correlation coefficient from theshape of the two marginal distributions of Y(0) and Y(1).

THE REVIEW OF ECONOMICS AND STATISTICS6

Using the same argument as in the previous paragraph, itcan be shown that one can estimate CATE and CATT moreaccurately than PATE and PATT, but generally less accu-rately than SATE and SATT.

The difference in asymptotic variances forces the re-searcher to take a stance on what the quantity of interest is.For example, in a specific application one can legitimatelyreach the conclusion that there is no evidence, at the 95%level, that the PATE is different from zero, whereas theremay be compelling evidence that the SATE and CATE arepositive. Typically researchers in econometrics have fo-cused on the PATE, but one can argue that it is of interest,when one cannot ascertain the sign of the population-leveleffect, to know whether one can determine the sign of theeffect for the sample. Especially in cases, which are all toocommon, where it is not clear whether the sample is repre-sentative of the population of interest, results for the sampleat hand may be of considerable interest.

C. Identification

We make the following key assumption about the treat-ment assignment:

ASSUMPTION 2.1 (UNCONFOUNDEDNESS):

�Y�0�, Y�1�� � W�X.

This assumption was first articulated in this form byRosenbaum and Rubin (1983a), who refer to it as “ ignorabletreatment assignment.” Lechner (1999, 2002) refers to thisas the “conditional independence assumption.” Followingwork by Barnow, Cain, and Goldberger (1980) in a regres-sion setting it is also referred to as “selection on observ-ables.”

To see the link with standard exogeneity assumptions,suppose that the treatment effect is constant: � Yi(1) Yi(0) for all i. Suppose also that the control outcome islinear in Xi:

Yi�0� � � � X�i � εi,

with εi � Xi. Then we can write

Yi � � � � Wi � X�i � εi.

Given the assumption of constant treatment effect, uncon-foundedness is equivalent to independence of Wi and εi

conditional on Xi, which would also capture the idea that Wi

is exogenous. Without this assumption, however, uncon-foundedness does not imply a linear relation with (mean-)-independent errors.

Next, we make a second assumption regarding the jointdistribution of treatments and covariates:

ASSUMPTION 2.2 (OVERLAP):

0 � Pr�W � 1�X� � 1.

For many of the formal results one will also need smooth-ness assumptions on the conditional regression functionsand the propensity score [�w( x) and e( x)], and momentconditions on Y(w). I will not discuss these regularityconditions here. Details can be found in the references forthe specific estimators given below.

There has been some controversy about the plausibility ofAssumptions 2.1 and 2.2 in economic settings, and thusabout the relevance of the econometric literature that fo-cuses on estimation and inference under these conditions forempirical work. In this debate it has been argued thatagents’ optimizing behavior precludes their choices beingindependent of the potential outcomes, whether or notconditional on covariates. This seems an unduly narrowview. In response I will offer three arguments for consider-ing these assumptions.

The first is a statistical, data-descriptive motivation. Anatural starting point in the evaluation of any program is acomparison of average outcomes for treated and controlunits. A logical next step is to adjust any difference inaverage outcomes for differences in exogenous backgroundcharacteristics (exogenous in the sense of not being affectedby the treatment). Such an analysis may not lead to the finalword on the efficacy of the treatment, but its absence wouldseem difficult to rationalize in a serious attempt to under-stand the evidence regarding the effect of the treatment.

A second argument is that almost any evaluation of atreatment involves comparisons of units who received thetreatment with units who did not. The question is typicallynot whether such a comparison should be made, but ratherwhich units should be compared, that is, which units bestrepresent the treated units had they not been treated. Eco-nomic theory can help in classifying variables into thosethat need to be adjusted for versus those that do not, on thebasis of their role in the decision process (for example,whether they enter the utility function or the constraints).Given that, the unconfoundedness assumption merely as-serts that all variables that need to be adjusted for areobserved by the researcher. This is an empirical question,and not one that should be controversial as a generalprinciple. It is clear that settings where some of thesecovariates are not observed will require strong assumptionsto allow for identification. Such assumptions include instru-mental variables settings where some covariates are as-sumed to be independent of the potential outcomes. Absentthose assumptions, typically only bounds can be identified(as in Manski, 1990, 2003).

A third, related argument is that even when agents choosetheir treatment optimally, two agents with the same valuesfor observed characteristics may differ in their treatmentchoices without invalidating the unconfoundedness assump-tion if the difference in their choices is driven by differencesin unobserved characteristics that are themselves unrelatedto the outcomes of interest. The plausibility of this willdepend critically on the exact nature of the optimization

AVERAGE TREATMENT EFFECTS 7

process faced by the agents. In particular it may be impor-tant that the objective of the decision maker is distinct fromthe outcome that is of interest to the evaluator. For example,suppose we are interested in estimating the average effect ofa binary input (such as a new technology) on a firm’soutput.3 Assume production is a stochastic function of thisinput because other inputs (such as weather) are not underthe firm’s control: Yi � g(W, εi). Suppose that profits areoutput minus costs (�i � Yi ci � Wi), and also that a firmchooses a production level to maximize expected profits,equal to output minus costs, conditional on the cost ofadopting new technology,

Wi � arg maxw��0,1�

����w��ci�

� arg maxw��0,1�

��g�w, εi� � ci � w�ci�,

implying

Wi � 1��� g�1, ε� � g�0, εi� � ci�ci�� � h�ci�.

If unobserved marginal costs ci differ between firms, andthese marginal costs are independent of the errors εi in thefirms’ forecast of production given inputs, then unconfound-edness will hold, as

� g�0, ε�, g�1, εi�� � ci.

Note that under the same assumptions one cannot necessar-ily identify the effect of the input on profits, for (�i(0),�(1)) are not independent of ci. For a related discussion, inthe context of instrumental variables, see Athey and Stern(1998). Heckman, LaLonde, and Smith (2000) discuss al-ternative models that justify unconfoundedness. In thesemodels individuals do attempt to optimize the same out-come that is the variable of interest to the evaluator. Theyshow that selection-on-observables assumptions can be jus-tified by imposing restrictions on the way individuals formtheir expectations about the unknown potential outcomes. Ingeneral, therefore, a researcher may wish to consider, eitheras a final analysis or as part of a larger investigation,estimates based on the unconfoundedness assumption.

Given the two key assumptions, unconfoundedness andoverlap, one can identify the average treatment effects. Thekey insight is that given unconfoundedness, the followingequalities hold:

�w� x� � ��Y�w��X � x� � ��Y�w��W � w, X � x�

� ��Y�W � w, X � x�,

and thus �w( x) is identified. Thus one can estimate theaverage treatment effect by first estimating the averagetreatment effect for a subpopulation with covariates X � x:

� x� � ��Y�1� � Y�0��X � x� � ��Y�1��X � x�

� ��Y�0��X � x� � ��Y�1��X � x, W � 1�

� ��Y�0��X � x, W � 0� � ��Y�X, W � 1�

� ��Y�X, W � 0�;

followed by averaging over the appropriate distribution of x.To make this feasible, one needs to be able to estimate theexpectations �[Y�X � x, W � w] for all values of w andx in the support of these variables. This is where the secondassumption enters. If the overlap assumption is violated atX � x, it would be infeasible to estimate both �[Y�X � x,W � 1] and �[Y�X � x, W � 0], because at those valuesof x there would be either only treated or only control units.

Some researchers use weaker versions of the uncon-foundedness assumption (for example, Heckman, Ichimura,and Todd, 1998). If the interest is in the PATE, it is sufficientto assume that

ASSUMPTION 2.3 (MEAN INDEPENDENCE):

��Y�w��W, X� � ��Y�w��X�,

for w � 0, 1.Although this assumption is unquestionably weaker, in

practice it is rare that a convincing case is made for theweaker assumption 2.3 without the case being equallystrong for the stronger version 2.1. The reason is that theweaker assumption is intrinsically tied to functional-formassumptions, and as a result one cannot identify averageeffects on transformations of the original outcome (such aslogarithms) without the stronger assumption.

One can weaken the unconfoundedness assumption in adifferent direction if one is only interested in the averageeffect for the treated (see, for example, Heckman, Ichimura,& Todd, 1997). In that case one need only assume

ASSUMPTION 2.4 (UNCONFOUNDEDNESS FOR CONTROLS):

Y�0� � W�X.

and the weaker overlap assumption

ASSUMPTION 2.5 (WEAK OVERLAP):

Pr�W � 1�X� � 1.

These two assumptions are sufficient for identification ofPATT and SATT, because the moments of the distribution ofY(1) for the treated are directly estimable.

An important result building on the unconfoundednessassumption shows that one need not condition simulta-

3 If we are interested in the average effect for firms that did adopt thenew technology (PATT), the following assumptions can be weakenedslightly.

THE REVIEW OF ECONOMICS AND STATISTICS8

neously on all covariates. The following result shows thatall biases due to observable covariates can be removed byconditioning solely on the propensity score:

Lemma 2.1 (Unconfoundedness Given the PropensityScore: Rosenbaum and Rubin, 1983a): Suppose that as-sumption 2.1 holds. Then

�Y�0�, Y�1�� � W�e�X�.

Proof: We will show that Pr(W � 1�Y(0), Y(1), e(X)) �Pr(W � 1�e(X)) � e(X), implying independence of (Y(0),Y(1)) and W conditional on e(X). First, note that

Pr�W � 1�Y�0�, Y�1�, e�X�� � ��W � 1�Y�0�, Y�1�, e�X��

� ����W�Y�0�, Y�1�, e�X�, X��Y�0�, Y�1�, e�X��

� ����W�Y�0�, Y�1�, X��Y�0�, Y�1�, e�X��

� ����W�X��Y�0�, Y�1�, e�X��

� ��e�X��Y�0�, Y�1�, e�X�� � e�X�,

where the last equality follows from unconfoundedness. Thesame argument shows that

Pr�W � 1�e�X�� � ��W � 1�e�X�� � ����W � 1�X��e�X��

� ��e�X��e�X�� � e�X�.■

Extensions of this result to the multivalued treatment caseare given in Imbens (2000) and Lechner (2001). To provideintuition for Rosenbaum and Rubin’s result, recall the text-book formula for omitted variable bias in the linear regres-sion model. Suppose we have a regression model with tworegressors:

Yi � 0 � 1 � Wi � �2Xi � εi.

The bias of omitting X from the regression on the coefficienton W is equal to �2�, where � is the vector of coefficients onW in regressions of the elements of X on W. By condition-ing on the propensity score we remove the correlationbetween X and W, because X � W�e(X). Hence omitting Xno longer leads to any bias (although it may still lead tosome efficiency loss).

D. Distributional and Quantile Treatment Effects

Most of the literature has focused on estimating ATEs.There are, however, many cases where one may wish toestimate other features of the joint distribution of outcomes.Lehman (1974) and Doksum (1974) introduce quantiletreatment effects as the difference in quantiles between thetwo marginal treated and control outcome distributions.4

Bitler, Gelbach, and Hoynes (2002) estimate these in arandomized evaluation of a social program. In instrumentalvariables settings Abadie, Angrist, and Imbens (2002) andChernozhukov and Hansen (2001) investigate estimation ofdifferences in quantiles of the two marginal potential out-come distributions, either for the entire population or forsubpopulations.

Assumptions 2.1 and 2.2 also allow for identification ofthe full marginal distributions of Y(0) and Y(1). To see this,first note that we can identify not just the average treatmenteffect ( x), but also the averages of the two potentialoutcomes, �0( x) and �0( x). Second, by these assumptionswe can similarly identify the averages of any function of thebasic outcomes, �[ g(Y(0))] and �[ g(Y(1))]. Hence we canidentify the average values of the indicators 1{Y(0) � y}and 1{Y(1) � y}, and thus the distribution function of thepotential outcomes at y. Given identification of the twodistribution functions, it is clear that one can also identifyquantiles of the two potential outcome distributions. Firpo(2002) develops an estimator for such quantiles under un-confoundedness.

E. Efficiency Bounds and Asymptotic Variances forPopulation-Average Treatment Effects

Next I review some results on the efficiency bound forestimators of the ATEs P, and T

P. This requires both theassumptions of unconfoundedness and overlap (Assump-tions 2.1 and 2.2) and some smoothness assumptions on theconditional expectations of potential outcomes and the treat-ment indicator (for details, see Hahn, 1998). Formally, Hahn(1998) shows that for any regular estimator for P, denotedby , with

�N � � � P� 3d

��0, V�,

it must be that

V � ���12�X�

e�X��

�02�X�

1 � e�X�� ��X� � P�2� .

Knowledge of the propensity score does not affect thisefficiency bound.

Hahn also shows that asymptotically linear estimatorsexist with such variance, and hence such efficient estimatorscan be approximated as

� P �1

N�i�1

N

��Yi, Wi, Xi, P� � op�N1/ 2�,

where �� is the efficient score:4 In contrast, Heckman, Smith, and Clemens (1997) focus on estimation

of bounds on the joint distribution of (Y(0), Y(1)). One cannot withoutstrong untestable assumptions identify the full joint distribution, since one

can never observe both potential outcomes simultaneously, but one cannevertheless derive bounds on functions of the two distributions.

AVERAGE TREATMENT EFFECTS 9

�� y, w, x, P� � � wy

e� x��

�1 � w� y

1 � e� x�

� P � ��1� x�

e� x��

�0� x�

1 � e� x� �w � e� x��.

(1)

Hahn (1998) also reports the efficiency bound for TP,

both with and without knowledge of the propensity score.For T

P the efficiency bound given knowledge of e(X) is

��e�X� Var�Y�1��X�

��e�X��2 �e�X�2 Var�Y�0��X�

��e�X��2�1 � e�X��

� ��X� � TP�2

e�X�2

��e�X��2�.

If the propensity score is not known, unlike the bound forP, the efficiency bound for T

P is affected. For TP the bound

without knowledge of the propensity score is

��e�X� Var�Y�1��X�

��e�X��2 �e�X�2 Var�Y�0��X�

��e�X��2�1 � e�X��

� ��X� � TP�2

e�X�

��e�X��2�,

which is higher by

�� ��X� � TP�2 �

e�X��1 � e�X��

��e�X��2 � .

The intuition that knowledge of the propensity score affectsthe efficiency bound for the average effect for the treated(PATT), but not for the overall average effect (PATE), goesas follows. Both are weighted averages of the treatmenteffect conditional on the covariates, ( x). For the PATE theweight is proportional to the density of the covariates,whereas for the PATT the weight is proportional to theproduct of the density of the covariates and the propensityscore (see, for example, Hirano, Imbens, and Ridder, 2003).Knowledge of the propensity score implies one does notneed to estimate the weight function and thus improvesprecision.

F. Efficiency Bounds and Asymptotic Variances forConditional and Sample Average Treatment Effects

Consider the leading term of the efficient estimator forthe PATE, � P � �� , where �� � (1/N) ¥ �(Yi, Wi, Xi,P), and let us view this as an estimator for the SATE,instead of as an estimator for the PATE. I will show that,first, this estimator is unbiased, conditional on the covariatesand the potential outcomes, and second, it has lower vari-ance as an estimator of the SATE than as an estimator of thePATE. To see that the estimator is unbiased note that withthe efficient score �( y, w, x, ) given in equation (1),

����Y, W, X, P�Y�0�, Y�1�, X�� � Y�1� � Y�0� � P,

and thus

����Yi�0�, Yi�1�, Xi�i�1N � � ���� � � P

�1

N�i�1

N

�Yi�1� � Yi�0��.

Hence

�� � S��Yi�0�, Yi�1�, Xi�i�1N �

�1

N�i�1

N

�Yi�1� � Yi�0�� � S � 0.

Next, consider the normalized variance:

VP � N � ��� � S�2� � N � ����� � P � S�2�.

Note that the variance of as an estimator of P can beexpressed, using the fact that �� is the efficient score, as

N � ��� � P�2� � N � ����� �2� �

N � ����� �Y, W, X, P� � �P � S� � �P � S��2�.

Because

����� �Y, W, X, P� � �P � S�� � �P � S�� � 0

[as follows by using iterated expectations, first conditioningon X, Y(0), and Y(1)], it follows that

N � ��� � P�2� � N � ��� � S�2� � N � ���S � P�2�

� N � ��� � S�2� � N � ���Y�1� � Y�0� � P�2�.

Thus, the same statistic that as an estimator of the popula-tion average treatment effect P has a normalized varianceequal to VP, as an estimator of S has the property

�N� � S� 3d

��0, VS�,

with

VS � VP � ���Y�1� � Y�0� � P�2�.

As an estimator of S the variance of is lower than itsvariance as an estimator of P, with the difference equal tothe variance of the treatment effect.

The same line of reasoning can be used to show that

�N� � �X�� 3d

��0, V �X��,

with

V �X� � VP � ���X� � P�2],

THE REVIEW OF ECONOMICS AND STATISTICS10

and

VS � V �X� � ���Y�1� � Y�0� � �X��2�.

An example to illustrate these points may be helpful.Suppose that X � {0, 1}, with Pr(X � 1) � px andPr(W � 1�X) � 1/ 2. Suppose that ( x) � 2x 1, and�w

2 ( x) is very small for all x and w. In that case the averagetreatment effect is px � 1 � (1 px) � (1) � 2px 1.The efficient estimator in this case, assuming only uncon-foundedness, requires separately estimating ( x) for x � 0and 1, and averaging these two by the empirical distributionof X. The variance of �N( S) will be small because�w

2 ( x) is small, and according to the expressions above, thevariance of �N( P) will be larger by 4px(1 px). Ifpx differs from 1/2, and so PATE differs from 0, theconfidence interval for PATE in small samples will tend toinclude zero. In contrast, with �w

2 ( x) small enough and Nodd [and both N0 and N1 at least equal to 2, so that one canestimate �w

2 ( x)], the standard confidence interval for S willexclude 0 with probability 1. The intuition is that P is muchmore uncertain because it depends on the distribution of thecovariates, whereas the uncertainty about S depends onlyon the conditional outcome variances and the propensityscore.

The difference in asymptotic variances raises the issue ofhow to estimate the variance of the sample average treat-ment effect. Specific estimators for the variance will bediscussed in section IV, but here I will introduce somegeneral issues surrounding their estimation. Because thetwo potential outcomes for the same unit are never observedsimultaneously, one cannot directly infer the variance of thetreatment effect. This is the same issue as the nonidentifi-cation of the correlation coefficient. One can, however,estimate a lower bound on the variance of the treatmenteffect, leading to an upper bound on the variance of theestimator of the SATE, which is equal to V �X�. Decompos-ing the variance as

���Y�1� � Y�0� � P�2� � ����Y�1� � Y�0� � P�X��

� ����Y�1� � Y�0� � P�X��,

� ���X� � P� � ���12�X� � �0

2�X�

� 2��X��0�X��1�X��,

we can consistently estimate the first term, but generally saylittle about the second other than that it is nonnegative. Onecan therefore bound the variance of S from above by

����Y, W, X, P�2� � ���Y�1� � Y�0�� � P�2]

� ����Y, W, X, P�2� � ����X� � P�2�

� ���12�X�

e�X��

�02�X�

1 � e�X�� � V �X�,

and use this upper-bound variance estimate to constructconfidence intervals that are guaranteed to be conservative.Note the connection with Neyman’s (1923) discussion ofconservative confidence intervals for average treatment ef-fects in experimental settings. It should be noted that thedifference between these variances is of the same order asthe variance itself, and therefore not a small-sample prob-lem. Only when the treatment effect is known to be constantcan it be ignored. Depending on the correlation between theoutcomes and the covariates, this may change the standarderrors considerably. It should also be noted that bootstrap-ping methods in general lead to estimation of �[( P)2],rather than �[( (X))2], which are generally too big.

III. Estimating Average Treatment Effects

There have been a number of statistics proposed forestimating the PATE and PATT, all of which are alsoappropriate estimators of the sample versions (SATE andSATT) and the conditional average versions (CATE andCATT). (The implications of focusing on SATE or CATErather than PATE only arise when estimating the variance,and so I will return to this distinction in section IV. In thecurrent section all discussion applies equally to all esti-mands.) Here I review some of these estimators, organizedinto five groups.

The first set, referred to as regression estimators, consistsof methods that rely on consistent estimation of the twoconditional regression functions, �0( x) and �1( x). Theseestimators differ in the way that they estimate these ele-ments, but all rely on estimators that are consistent for theseregression functions.

The second set, matching estimators, compare outcomesacross pairs of matched treated and control units, with eachunit matched to a fixed number of observations with theopposite treatment. The bias of these within-pair estimatesof the average treatment effect disappears as the sample sizeincreases, although their variance does not go to zero,because the number of matches remains fixed.

The third set of estimators is characterized by a centralrole for the propensity score. Four leading approaches inthis set are weighting by the reciprocal of the propensityscore, blocking on the propensity score, regression on thepropensity score, and matching on the propensity score.

The fourth category consists of estimators that rely on acombination of these methods, typically combining regres-sion with one of its alternatives. The motivation for thesecombinations is that although in principle any one of thesemethods can remove all of the bias associated with thecovariates, combining two may lead to more robust infer-ence. For example, matching leads to consistent estimatorsfor average treatment effects under weak conditions, somatching and regression can combine some of the desirablevariance properties of regression with the consistency ofmatching. Similarly, a combination of weighting and regres-sion, using parametric models for both the propensity score

AVERAGE TREATMENT EFFECTS 11

and the regression functions, can lead to an estimator that isconsistent even if only one of the models is correctlyspecified (“doubly robust” in the terminology of Robins &Ritov, 1997).

Finally, in the fifth group I will discuss Bayesian ap-proaches to inference for average treatment effects.

Only some of the estimators discussed below achieve thesemiparametric efficiency bound, yet this does not meanthat these should necessarily be preferred in practice—thatis, in finite samples. More generally, the debate concerningthe practical advantages of the various estimators, and thesettings in which some are more attractive than others, isstill ongoing, with as of yet no firm conclusions. Althoughall estimators, either implicitly or explicitly, estimate thetwo unknown regression functions or the propensity score,they do so in very different ways. Differences in smoothnessof the regression function or the propensity score, or relativediscreteness of the covariates in specific applications, mayaffect the relative attractiveness of the estimators.

In addition, even the appropriateness of the standardasymptotic distributions as a guide towards finite-sampleperformance is still debated (see, for example, Robins &Ritov, 1997, and Angrist & Hahn, 2004). A key feature thatcasts doubt on the relevance of the asymptotic distributionsis that the �N consistency is obtained by averaging anonparametric estimator of a regression function that itselfhas a slow nonparametric convergence rate over the empir-ical distribution of its argument. The dimension of thisargument affects the rate of convergence for the unknownfunction [the regression functions �w( x) or the propensityscore e( x)], but not the rate of convergence for the estima-tor of the parameter of interest, the average treatment effect.In practice, however, the resulting approximations of theATE can be poor if the argument is of high dimension, inwhich case information about the propensity score is ofparticular relevance. Although Hahn (1998) showed, asdiscussed above, that for the standard asymptotic distribu-tions knowledge of the propensity score is irrelevant (andconditioning only on the propensity score is in fact lessefficient than conditioning on all covariates), conditioningon the propensity score involves only one-dimensional non-parametric regression, suggesting that the asymptotic ap-proximations may be more accurate. In practice, knowledgeof the propensity score may therefore be very informative.

Another issue that is important in judging the variousestimators is how well they perform when there is onlylimited overlap in the covariate distributions of the twotreatment groups. If there are regions in the covariatespace with little overlap (propensity score close to 0 or1), ATE estimators should have relatively high variance.However, this is not always the case for estimators basedon tightly parametrized models for the regression func-tions, where outliers in covariate values can lead tospurious precision for regression parameters. Regions ofsmall overlap can also be difficult to detect directly in

high-dimensional covariate spaces, as they can bemasked for any single variable.

A. Regression

The first class of estimators relies on consistent estima-tion of �w( x) for w � 0, 1. Given �w( x) for theseregression functions, the PATE, SATE, and CATE are esti-mated by averaging their differences over the empiricaldistribution of the covariates:

reg �1

N�i�1

N

��1�Xi� � �0�Xi��. (2)

In most implementations the average of the predictedtreated outcome for the treated is equal to the averageobserved outcome for the treated [so that ¥i Wi � �1(Xi) �¥i Wi � Yi], and similarly for the controls, implying that reg

can also be written as

1

N�i�1

N

Wi � �Yi � �0�Xi�� � �1 � Wi� � ��1�Xi� � Yi�.

For the PATT and SATT typically only the control regres-sion function is estimated; we only need predict the out-come under the control treatment for the treated units. Theestimator then averages the difference between the actualoutcomes for the treated and their estimated outcomes underthe control:

reg,T �1

NT�i�1

N

Wi � �Yi � �0�Xi��. (3)

Early estimators for �w( x) included parametric regres-sion functions—for example, linear regression (as in Rubin,1977). Such parametric alternatives include least squaresestimators with the regression function specified as

�w� x� � �x � � w,

in which case the average treatment effect is equal to . Inthis case one can estimate directly by least squaresestimation using the regression function

Yi � � � �Xi � � Wi � εi.

More generally, one can specify separate regression func-tions for the two regimes:

�w� x� � �wx.

In that case one can estimate the two regression functionsseparately on the two subsamples and then substitute thepredicted values in equation (2). These simple regressionestimators may be very sensitive to differences in thecovariate distributions for treated and control units. The

THE REVIEW OF ECONOMICS AND STATISTICS12

reason is that in that case the regression estimators relyheavily on extrapolation. To see this, note that the regressionfunction for the controls, �0( x), is used to predict missingoutcomes for the treated. Hence on average one wishes topredict the control outcome at X� T, the average covariatevalue for the treated. With a linear regression function, theaverage prediction can be written as Y� C � �(X� T X� C).With X� T very close to the average covariate value for thecontrols, X� C, the precise specification of the regressionfunction will not matter very much for the average predic-tion. However, with the two averages very different, theprediction based on a linear regression function can be verysensitive to changes in the specification.

More recently, nonparametric estimators have been pro-posed. Hahn (1998) recommends estimating first the threeconditional expectations g1( x) � �[WY�X ], g0( x) ��[(1 W )Y�X ], and e( x) � �[W�X] nonparametricallyusing series methods. He then estimates �w( x) as

�1�x� �g1�x�

e�x�, �0�x� �

g0�x�

1 � e�x�,

and shows that the estimators for both PATE and PATTachieve the semiparametric efficiency bounds discussed insection IIE (the latter even when the propensity score isunknown).

Using this series approach, however, it is unnecessary toestimate all three of these conditional expectations(�[YW�X], �[Y(1 W)�X], and �[W�X]) to estimate�w( x). Instead one can use series methods to directlyestimate the two regression functions �w( x), eliminatingthe need to estimate the propensity score (Imbens, Newey,and Ridder, 2003).

Heckman, Ichimura, and Todd (1997, 1998) and Heck-man, Ichimura, Smith, and Todd (1998) consider kernelmethods for estimating �w( x), in particular focusing onlocal linear approaches. The simple kernel estimator has theform

�w�x� � �i:Wi�w

Yi � K�Xi � x

h �i:Wi�w

K�Xi � x

h ,

with a kernel K� and bandwidth h. In the local linearkernel regression the regression function �w( x) is estimatedas the intercept 0 in the minimization problem

min 0, 1

�i:Wi�w

�Yi � 0 � �1�Xi � x��2 � K�Xi � x

h .

In order to control the bias of their estimators, Heckman,Ichimura, and Todd (1998) require that the order of thekernel be at least as large as the dimension of the covariates.That is, they require the use of a kernel function K( z) suchthat �z zrK( z) dz � 0 for r � dim (X), so that the kernelmust be negative on part of the range, and the implicitaveraging involves negative weights. We shall see this role

of the dimension of the covariates again for other estima-tors.

For the average treatment effect for the treated (PATT), itis important to note that with the propensity score known,the estimator given in equation (3) is generally not efficient,irrespective of the estimator for �0( x). Intuitively, this isbecause with the propensity score known, the average ¥WiYi/NT is not efficient for the population expectation�[Y(1)�W � 1]. An efficient estimator (as in Hahn, 1998)can be obtained by weighting all the estimated treatmenteffects, �1(Xi) �0(Xi), by the probability of receiving thetreatment:

reg,T � �i�1

N

e�Xi� � ��1�Xi� � �0�Xi���i�1

N

e�Xi�. (4)

In other words, instead of estimating �[Y(1)�W � 1] as ¥WiYi/NT using only the treated observations, it is estimatedusing all units, as ¥ �1(Xi) � e(Xi)/¥ e(Xi). Knowledge ofthe propensity score improves the accuracy because it al-lows one to exploit the control observations to adjust forimbalances in the sampling of the covariates.

For all of the estimators in this section an important issueis the choice of the smoothing parameter. In Hahn’s case,after choosing the form of the series and the sequence, thesmoothing parameter is the number of terms in the series. InHeckman, Ichimura, and Todd’s case it is the bandwidth ofthe kernel chosen. The evaluation literature has been largelysilent concerning the optimal choice of the smoothing pa-rameters, although the larger literature on nonparametricestimation of regression functions does provide some guid-ance, offering data-driven methods such as cross-validationcriteria. The optimality properties of these criteria, however,are for estimation of the entire function, in this case �w( x).Typically the focus is on mean-integrated-squared-errorcriteria of the form �x [�w( x) �w( x)]2fX( x) dx, withpossibly an additional weight function. In the current prob-lem, however, one is interested specifically in the averagetreatment effect, and so such criteria are not necessarilyoptimal. In particular, global smoothing parameters may beinappropriate, because they can be driven by the shape ofthe regression function and distribution of covariates inregions that are not important for the average treatmenteffect of interest. LaLonde’s (1986) data set is a well-knownexample of this where much of probability mass of thenonexperimental control group is in a region with moderateto high earnings where few of the treated group are located.There is little evidence whether results for average treat-ment effects are more or less sensitive to the choice ofsmoothing parameter than results for estimation of theregression functions themselves.

B. Matching

As seen above, regression estimators impute the missingpotential outcomes using the estimated regression function.

AVERAGE TREATMENT EFFECTS 13

Thus, if Wi � 1, then Yi(1) is observed and Yi(0) is missingand imputed with a consistent estimator �0(Xi) for theconditional expectation. Matching estimators also imputethe missing potential outcomes, but do so using only theoutcomes of nearest neighbors of the opposite treatmentgroup. In that respect matching is similar to nonparametrickernel regression methods, with the number of neighborsplaying the role of the bandwidth in the kernel regression. Aformal difference is that the asymptotic distribution is de-rived conditional on the implicit bandwidth, that is, thenumber of neighbors, which is often fixed at one. Usingsuch asymptotics, the implicit estimate �w( x) is (close to)unbiased, but not consistent for �w( x). In contrast, theregression estimators discussed in the previous section relyon the consistency of �w( x).

Matching estimators have the attractive feature that giventhe matching metric, the researcher only has to choose thenumber of matches. In contrast, for the regression estima-tors discussed above, the researcher must choose smoothingparameters that are more difficult to interpret: either thenumber of terms in a series or the bandwidth in kernelregression. Within the class of matching estimators, usingonly a single match leads to the most credible inference withthe least bias, at most sacrificing some precision. This canmake the matching estimator easier to use than those esti-mators that require more complex choices of smoothingparameters, and may explain some of its popularity.

Matching estimators have been widely studied in practiceand theory (for example, Gu & Rosenbaum, 1993; Rosen-baum, 1989, 1995, 2002; Rubin, 1973b, 1979; Heckman,Ichimura, & Todd, 1998; Dehejia & Wahba, 1999; Abadie &Imbens, 2002). Most often they have been applied in set-tings with the following two characteristics: (i) the interestis in the average treatment effect for the treated, and (ii)there is a large reservoir of potential controls. This allowsthe researcher to match each treated unit to one or moredistinct controls (referred to as matching without replace-ment). Given the matched pairs, the treatment effect withina pair is then estimated as the difference in outcomes, withan estimator for the PATT obtained by averaging thesewithin-pair differences. Since the estimator is essentially thedifference between two sample means, the variance is cal-culated using standard methods for differences in means ormethods for paired randomized experiments. The remainingbias is typically ignored in these studies. The literature hasstudied fast algorithms for matching the units, as fullyefficient matching methods are computationally cumber-some (see, for example, Gu and Rosenbaum, 1993; Rosen-baum, 1995). Note that in such matching schemes the orderin which the units are matched may be important.

Abadie and Imbens (2002) study both bias and variancein a more general setting where both treated and controlunits are (potentially) matched and matching is done withreplacement (as in Dehejia & Wahba, 1999). The Abadie-Imbens estimator is implemented in Matlab and Stata (see

Abadie et al., 2003).5 Formally, given a sample, {(Yi, Xi,Wi)}i�1

N , let �m(i) be the index l that satisfies Wl � Wi and

�j�Wj�Wi

1��Xj � Xi� � �Xl � Xi�� � m,

where 1{�} is the indicator function, equal to 1 if theexpression in brackets is true and 0 otherwise. In otherwords, �m(i) is the index of the unit in the opposite treat-ment group that is the mth closest to unit i in terms of thedistance measure based on the norm � � �. In particular, �1(i)is the nearest match for unit i. Let �M(i) denote the set ofindices for the first M matches for unit i: �M(i) �{�1(i), . . . , �M(i)}. Define the imputed potential outcomesas

Yi�0� � �Yi if Wi � 0,1

M�

j��M�i�

Yj if Wi � 1,

and

Yi�1� � �1

M�

j��M�i�

Yj if Wi � 0,

Yi if Wi � 1.

The simple matching estimator discussed by Abadie andImbens is then

Msm �

1

N�i�1

N

�Yi�1� � Yi�0��. (5)

They show that the bias of this estimator is O(N1/k), wherek is the dimension of the covariates. Hence, if one studiesthe asymptotic distribution of the estimator by normalizingby �N [as can be justified by the fact that the variance ofthe estimator is O(1/N)], the bias does not disappear if thedimension of the covariates is equal to 2, and will dominatethe large sample variance if k is at least 3.

Let me make clear three caveats to Abadie and Imbens’sresult. First, it is only the continuous covariates that shouldbe counted in this dimension, k. With discrete covariates thematching will be exact in large samples; therefore suchcovariates do not contribute to the order of the bias. Second,if one matches only the treated, and the number of potentialcontrols is much larger than the number of treated units, onecan justify ignoring the bias by appealing to an asymptoticsequence where the number of potential controls increasesfaster than the number of treated units. Specifically, if thenumber of controls, N0, and the number of treated, N1,satisfy N1/N0

4/k 3 0, then the bias disappears in largesamples after normalization by �N1. Third, even though

5 See Becker and Ichino (2002) and Sianesi (2001) for alternative Stataimplementations of estimators for average treatment effects.

THE REVIEW OF ECONOMICS AND STATISTICS14

the order of the bias may be high, the actual bias may stillbe small if the coefficients in the leading term are small.This is possible if the biases for different units are at leastpartially offsetting. For example, the leading term in thebias relies on the regression function being nonlinear, andthe density of the covariates having a nonzero slope. If oneof these two conditions is at least close to being satisfied, theresulting bias may be fairly limited. To remove the bias,Abadie and Imbens suggest combining the matching pro-cess with a regression adjustment, as I will discuss insection IIID.

Another point made by Abadie and Imbens is that match-ing estimators are generally not efficient. Even in the casewhere the bias is of low enough order to be dominated bythe variance, the estimators are not efficient given a fixednumber of matches. To reach efficiency one would need toincrease the number of matches with the sample size. IfM 3 �, with M/N 3 0, then the matching estimator isessentially like a regression estimator, with the imputedmissing potential outcomes consistent for their conditionalexpectations. However, the efficiency gain of such estima-tors is of course somewhat artificial. If in a given data setone uses M matches, one can calculate the variance as if thisnumber of matches increased at the appropriate rate with thesample size, in which case the estimator would be efficient,or one could calculate the variance conditional on thenumber of matches, in which case the same estimator wouldbe inefficient. Little is yet known about the optimal numberof matches, or about data-dependent ways of choosing thisnumber.

In the above discussion the distance metric in choosingthe optimal matches was the standard Euclidean metric:

dE� x, z� � � x � z��� x � z�.

All of the distance metrics used in practice standardize thecovariates in some manner. Abadie and Imbens use thediagonal matrix of the inverse of the covariate variances:

dAI� x, z� � � x � z�� diag ��X1� �x � z�,

where �X is the covariance matrix of the covariates. Themost common choice is the Mahalanobis metric (see, forexample, Rosenbaum and Rubin, 1985), which uses theinverse of the covariance matrix of the pretreatment vari-ables:

dM� x, z� � � x � z���X1� x � z�.

This metric has the attractive property that it reduces dif-ferences in covariates within matched pairs in all direc-tions.6 See for more formal discussions Rubin and Thomas(1992).

Zhao (2004), in an interesting discussion of the choice ofmetrics, suggests some alternatives that depend on thecorrelation between covariates, treatment assignment, andoutcomes. He starts by assuming that the propensity scorehas a logistic form

e� x� �exp�x���

1 � exp�x���,

and that the regression functions are linear:

�w� x� � �w � x� .

He then considers two alternative metrics. The first weightsabsolute differences in the covariates by the coefficient inthe propensity score:

dZ1� x, z� � �k�1

K

�xk � zk� � ��k�,

and the second weights them by the coefficients in theregression function:

dZ2� x, z� � �k�1

K

�xk � zk� � � k�,

where xk and zk are the kth elements of the K-dimensionalvectors x and z respectively.

In light of this discussion, it is interesting to consideroptimality of the metric. Suppose, following Zhao (2004),that the regression functions are linear with coefficients w.Now consider a treated unit with covariate vector x who willbe matched to a control unit with covariate vector z. Thebias resulting from such a match is ( z x)� 0. If one isinterested in minimizing for each match the squared bias,one should choose the first match by minimizing over thecontrol observations ( z x)� 0 �0( z x). Yet typicallyone does not know the value of the regression coefficients,in which case one may wish to minimize the expectedsquared bias. Using a normal distribution for the regressionerrors, and a flat prior on 0, the posterior distribution for 0

is normal with mean 0 and variance �X1�2/N. Hence the

expected squared bias from a match is

6 However, using the Mahalanobis metric can also have less attractiveimplications. Consider the case where one matches on two highly corre-lated covariates, X1 and X2 with equal variances. For specificity, supposethat the correlation coefficient is 0.9 and both variances are 1. Supposethat we wish to match a treated unit i with Xi1 � Xi2 � 0. The two

potential matches are unit j with Xj1 � Xj2 � 5 and unit k with Xk1 � 4and Xk2 � 0. The difference in covariates for the first match is the vector(5, 5)�, and the difference in covariates for the second match is (4, 0)�.Intuitively it may seem that the second match is better: it is strictly closerto the treated unit than the first match for both covariates. Using theAbadie-Imbens metric diag (�X

1), this is in fact true. Under that metricthe distance between the second match and the treated unit is 16,considerably smaller than 50, the distance between the first match and thetreated unit. Using the Mahalanobis metric, however, the distance for thefirst match is 26, and the distance for the second match is much higher at84. Because of the correlation between the covariates in the sample, thedifference between the matches is interpreted very differently under thetwo metrics. To choose between the standard and the Mahalanobis metricone needs to consider what the appropriate match would be in this case.

AVERAGE TREATMENT EFFECTS 15

��� z � x�� 0 �0� z � x�� � � z � x��� 0 �0 � �2�X1/N�

� �z � x�.

In this argument the optimal metric is a combination of thesample covariance matrix plus the outer product of theregression coefficients, with the former scaled down by afactor 1/N:

d*� z, x� � � z � x��� w �w � �w2 �X,w

1 /N��z � x�.

A clear problem with this approach is that when the regres-sion function is misspecified, matching with this particularmetric may not lead to a consistent estimator. On the otherhand, when the regression function is correctly specified, itwould be more efficient to use the regression estimatorsthan any matching approach. In practice one may want touse a metric that combines some of the optimal weightingwith some safeguards in case the regression function ismisspecified.

So far there is little experience with any alternativemetrics beyond the Mahalanobis metric. Zhao (2004) re-ports the results of some simulations using his proposedmetrics, finding no clear winner given his specific design,although his findings suggest that using the outcomes indefining the metric is a promising approach.

C. Propensity Score Methods

Since the work by Rosenbaum and Rubin (1983a) therehas been considerable interest in methods that avoid adjust-ing directly for all covariates, and instead focus on adjustingfor differences in the propensity score, the conditionalprobability of receiving the treatment. This can be imple-mented in a number of different ways. One can weight theobservations using the propensity score (and indirectly alsoin terms of the covariates) to create balance between treatedand control units in the weighted sample. Hirano, Imbens,and Ridder (2003) show how such estimators can achievethe semiparametric efficiency bound. Alternatively one candivide the sample into subsamples with approximately thesame value of the propensity score, a technique known asblocking. Finally, one can directly use the propensity scoreas a regressor in a regression approach.

In practice there are two important cases. First, supposethe researcher knows the propensity score. In that case allthree of these methods are likely to be effective in elimi-nating bias. Even if the resulting estimator is not fullyefficient, one can easily modify it by using a parametricestimate of the propensity score to capture most of theefficiency loss. Furthermore, since these estimators do notrely on high-dimensional nonparametric regression, thissuggests that their finite-sample properties are likely to berelatively attractive.

If the propensity score is not known, the advantages ofthe estimators discussed below are less clear. Although theyavoid the high-dimensional nonparametric regression of the

two conditional expectations �w( x), they require instead theequally high-dimensional nonparametric regression of thetreatment indicator on the covariates. In practice the relativemerits of these estimators will depend on whether thepropensity score is more or less smooth than the regressionfunctions, and on whether additional information is avail-able about either the propensity score or the regressionfunctions.

Weighting: The first set of propensity-score estimatorsuse the propensity scores as weights to create a balancedsample of treated and control observations. Simply takingthe difference in average outcomes for treated and controls,

��WiYi�Wi

�� �1 � Wi�Yi� 1 � Wi

,

is not unbiased for P � �[Y(1) Y(0)], because,conditional on the treatment indicator, the distributions ofthe covariates differ. By weighting the units by the recipro-cal of the probability of receiving the treatment, one canundo this imbalance. Formally, weighting estimators rely onthe equalities

�� WY

e�X�� � ��WY�1�

e�X� � � ����WY�1�

e�X� X��

� ��e�X� � ��Y�1��X�

e�X� � � ��Y�1��,

using unconfoundedness in the second to last equality, andsimilarly

�� �1 � W�Y

1 � e�X� � � ��Y�0��,

implying

P � ��W � Y

e�X��

�1 � W� � Y

1 � e�X� � .

With the propensity score known one can directly imple-ment this estimator as

�1

N�i�1

N �WiYi

e�Xi��

�1 � Wi�Yi

1 � e�Xi�. (6)

In this particular form this is not necessarily an attractiveestimator. The main reason is that, although the estimatorcan be written as the difference between a weighted averageof the outcomes for the treated units and a weighted averageof the outcomes for the controls, the weights do not neces-sarily add to 1. Specifically, in equation (6), the weights forthe treated units add up to [¥ Wi/e(Xi)]/N. In expectationthis is equal to 1, but because its variance is positive, in any

THE REVIEW OF ECONOMICS AND STATISTICS16

given sample some of the weights are likely to deviatefrom 1.

One approach for improving this estimator is simply tonormalize the weights to unity. One can further normalizethe weights to unity within subpopulations as defined by thecovariates. In the limit this leads to an estimator proposedby Hirano, Imbens, and Ridder (2003), who suggest using anonparametric series estimator for e( x). More precisely,they first specify a sequence of functions of the covariates,such as power series hl( x), l � 1, . . . , �. Next, theychoose a number of terms, L(N), as a function of the samplesize, and then estimate the L-dimensional vector �L in

Pr�W � 1�X � x� �exp��h1�x�, . . . , hL�x���L�

1 � exp��h1�x�, . . . , hL�x���L�,

by maximizing the associated likelihood function. Let �L bethe maximum likelihood estimate. In the third step, theestimated propensity score is calculated as

e� x� �exp��h1�x�, . . . , hL�x���L�

1 � exp��h1�x�, . . . , hL�x���L�.

Finally they estimate the average treatment effect as

weight � �i�1

N Wi � Yi

e�Xi��

i�1

N Wi

e�Xi�

� �i�1

N �1 � Wi� � Yi

1 � e�Xi��

i�1

N 1 � Wi

1 � e�Xi�.

(7)

Hirano, Imbens, and Ridder show that with a nonparametricestimator for e( x) this estimator is efficient, whereas withthe true propensity score the estimator would not be fullyefficient (and in fact not very attractive).

This estimator highlights one of the interesting features ofthe problem of efficiently estimating average treatmenteffects. One solution is to estimate the two regressionfunctions �w( x) nonparametrically, as discussed in SectionIIIA; that solution completely ignores the propensity score.A second approach is to estimate the propensity scorenonparametrically, ignoring entirely the two regressionfunctions. If appropriately implemented, both approacheslead to fully efficient estimators, but clearly their finite-sample properties may be very different, depending, forexample, on the smoothness of the regression functionsversus the smoothness of the propensity score. If there isonly a single binary covariate, or more generally if there areonly discrete covariates, the weighting approach with a fullynonparametric estimator for the propensity score is numer-ically identical to the regression approach with a fullynonparametric estimator for the two regression functions.

To estimate the average treatment effect for the treatedrather than for the full population, one should weight the

contribution for unit i by the propensity score e( xi). If thepropensity score is known, this leads to

weight,tr � �i�1

N

Wi � Yi �e�Xi�

e�Xi��

i�1

N

Wi

e�Xi�

e�Xi�

� �i�1

N

�1 � Wi� � Yi �e�Xi�

1 � e�Xi��

i�1

N

�1 � Wi�e�Xi�

1 � e�Xi�,

where the propensity score enters in some places as the truescore (for the weights to get the appropriate estimand) andin other cases as the estimated score (to achieve efficiency).In the unknown propensity score case one always uses theestimated propensity score, leading to

weight,tr � � 1

N1�

i : Wi�1

Yi�� � �

i : Wi�0

Yi �e�Xi�

1 � e�Xi� �i : Wi�0

e�Xi�

1 � e�Xi�� .

One difficulty with the weighting estimators that arebased on the estimated propensity score is again the prob-lem of choosing the smoothing parameters. Hirano, Imbens,and Ridder (2003) use series estimators, which requireschoosing the number of terms in the series. Ichimura andLinton (2001) consider a kernel version, which involveschoosing a bandwidth. Theirs is currently one of the fewstudies considering optimal choices for smoothing parame-ters that focuses specifically on estimating average treat-ment effects. A departure from standard problems in choos-ing smoothing parameters is that here one wants to usenonparametric regression methods even if the propensityscore is known. For example, if the probability of treatmentis constant, standard optimality results would suggest usinga high degree of smoothing, as this would lead to the mostaccurate estimator for the propensity score. However, thiswould not necessarily lead to an efficient estimator for theaverage treatment effect of interest.

Blocking on the Propensity Score: In their originalpropensity-score paper Rosenbaum and Rubin (1983a) sug-gest the following blocking-on-the-propensity-score estima-tor. Using the (estimated) propensity score, divide the sam-ple into M blocks of units of approximately equalprobability of treatment, letting Jim be an indicator for uniti being in block m. One way of implementing this is bydividing the unit interval into M blocks with boundaryvalues equal to m/M for m � 1, . . . , M 1, so that

Jim � 1�m � 1

M� e�Xi� �

m

M�

AVERAGE TREATMENT EFFECTS 17

for m � 1, . . . , M. Within each block there are Nwm

observations with treatment equal to w, Nwm � ¥i 1{Wi �w, Jim � 1}. Given these subgroups, estimate within eachblock the average treatment effect as if random assignmentheld:

m �1

N1m�i�1

N

JimWiYi �1

N0m�i�1

N

Jim�1 � Wi�Yi.

Then estimate the overall average treatment effect as

block � �m�1

M

m �N1m � N0m

N.

If one is interested in the average effect for the treated, onewill weight the within-block average treatment effects bythe number of treated units:

T,block � �m�1

M

m �N1m

NT.

Blocking can be interpreted as a crude form of nonpara-metric regression where the unknown function is approxi-mated by a step function with fixed jump points. To estab-lish asymptotic properties for this estimator would requireestablishing conditions on the rate at which the number ofblocks increases with the sample size. With the propensityscore known, these are easy to determine; no formal resultshave been established for the unknown propensity scorecase.

The question arises how many blocks to use in practice.Cochran (1968) analyzes a case with a single covariate and,assuming normality, shows that using five blocks removes atleast 95% of the bias associated with that covariate. Sinceall bias, under unconfoundedness, is associated with thepropensity score, this suggests that under normality the useof five blocks removes most of the bias associated with allthe covariates. This has often been the starting point ofempirical analyses using this estimator (for example,Rosenbaum and Rubin, 1983b; Dehejia and Wahba, 1999)and has been implemented in Stata by Becker and Ichino(2002).7 Often, however, researchers subsequently checkthe balance of the covariates within each block. If the truepropensity score per block is constant, the distribution of thecovariates among the treated and controls should be identi-cal, or, in the evaluation terminology, the covariates shouldbe balanced. Hence one can assess the adequacy of thestatistical model by comparing the distribution of the co-variates among treated and controls within blocks. If thedistributions are found to be different, one can either splitthe blocks into a number of subblocks, or generalize the

specification of the propensity score. Often some informalversion of the following algorithm is used: If within a blockthe propensity score itself is unbalanced, the blocks are toolarge and need to be split. If, conditional on the propensityscore being balanced, the covariates are unbalanced, thespecification of the propensity score is not adequate. Noformal algorithm has been proposed for implementing theseblocking methods.

An alternative approach to finding the optimal number ofblocks is to relate this approach to the weighting estimatordiscussed above. One can view the blocking estimator asidentical to a weighting estimator, with a modified estimatorfor the propensity score. Specifically, given the originalestimator e( x), in the blocking approach the estimator forthe propensity score is discretized to

e� x� �1

M�

m�1

M

1�m

M� e� x�� .

Using e( x) as the propensity score in the weighting estima-tor leads to an estimator for the average treatment effectidentical to that obtained by using the blocking estimatorwith e( x) as the propensity score and M blocks. Withsufficiently large M, the blocking estimator is sufficientlyclose to the original weighting estimator that it shares itsfirst-order asymptotic properties, including its efficiency.This suggests that in general there is little harm in choosinga large number of blocks, at least with regard to asymptoticproperties, although again the relevance of this for finitesamples has not been established.

Regression on the Propensity Score: The third methodof using the propensity score is to estimate the conditionalexpectation of Y given W and e(X). Define

�w�e� � ��Y�w��e�X� � e�.

By unconfoundedness this is equal to �[Y�W � w, e(X) �e]. Given an estimator �w(e), one can estimate the averagetreatment effect as

regprop �1

N�i�1

N

��1�e�Xi�� � �0�e�Xi���.

Heckman, Ichimura, and Todd (1998) consider a local linearversion of this for estimating the average treatment effectfor the treated. Hahn (1998) considers a series version andshows that it is not as efficient as the regression estimatorbased on adjustment for all covariates.

Matching on the Propensity Score: Rosenbaum andRubin’s result implies that it is sufficient to adjust solely fordifferences in the propensity score between treated andcontrol units. Since one of the ways in which one can adjustfor differences in covariates is matching, another natural

7 Becker and Ichino also implement estimators that match on thepropensity score.

THE REVIEW OF ECONOMICS AND STATISTICS18

way to use the propensity score is through matching. Be-cause the propensity score is a scalar function of the co-variates, the bias results in Abadie and Imbens (2002) implythat the bias term is of lower order than the variance termand matching leads to a �N-consistent, asymptoticallynormally distributed estimator. The variance for the casewith matching on the true propensity score also followsdirectly from their results. More complicated is the casewith matching on the estimated propensity score. I do notknow of any results that give the variance for this case.

D. Mixed Methods

A number of approaches have been proposed that com-bine two of the three methods described in the previoussections, typically regression with one of its alternatives.The reason for these combinations is that, although onemethod alone is often sufficient to obtain consistent or evenefficient estimates, incorporating regression may eliminateremaining bias and improve precision. This is particularlyuseful in that neither matching nor the propensity-scoremethods directly address the correlation between the covari-ates and the outcome. The benefit associated with combin-ing methods is made explicit in the notion developed byRobins and Ritov (1997) of double robustness. They pro-pose a combination of weighting and regression where, aslong as the parametric model for either the propensity scoreor the regression functions is specified correctly, the result-ing estimator for the average treatment effect is consistent.Similarly, matching leads to consistency without additionalassumptions; thus methods that combine matching and re-gressions are robust against misspecification of the regres-sion function.

Weighting and Regression: One can rewrite the weight-ing estimator discussed above as estimating the followingregression function by weighted least squares:

Yi � � � � Wi � εi,

with weights equal to

�i � � Wi

e�Xi��

1 � Wi

1 � e�Xi�.

Without the weights the least squares estimator would notbe consistent for the average treatment effect; the weightsensure that the covariates are uncorrelated with the treat-ment indicator and hence the weighted estimator is consis-tent.

This weighted-least-squares representation suggests thatone may add covariates to the regression function to im-prove precision, for example,

Yi � � � �Xi � � Wi � εi,

with the same weights �i. Such an estimator, using a moregeneral semiparametric regression model, was suggested byRobins and Rotnitzky (1995), Robins, Roznitzky, and Zhao(1995), and Robins and Ritov (1997), and implemented byHirano and Imbens (2001). In the parametric context Robinsand Ritov argue that the estimator is consistent as long aseither the regression model or the propensity score (and thusthe weights) are specified correctly. That is, in Robins andRitov’s terminology, the estimator is doubly robust.

Blocking and Regression: Rosenbaum and Rubin(1983b) suggest modifying the basic blocking estimator byusing least squares regression within the blocks. Without theadditional regression adjustment the estimated treatmenteffect within blocks can be written as a least squaresestimator of m for the regression function

Yi � �m � m � Wi � εi,

using only the units in block m. As above, one can also addcovariates to the regression function

Yi � �m � �m Xi � m � Wi � εi,

again estimated on the units in block m.

Matching and Regression: Because Abadie and Imbens(2002) have shown that the bias of the simple matchingestimator can dominate the variance if the dimension of thecovariates is too large, additional bias corrections throughregression can be particularly relevant in this case. A num-ber of such corrections have been proposed, first by Rubin(1973b) and Quade (1982) in a parametric setting. Follow-ing the notation of section IIIB, let Yi(0) and Yi(1) be theobserved or imputed potential outcomes for unit i; theestimated potential outcomes equal the observed outcomesfor some unit i and for its match �(i). The bias in theircomparison, �[Yi(1) Yi(0)] [Yi(1) Yi(0)], arisesfrom the fact that the covariates Xi and X�(i) for units i and�(i) are not equal, although they are close because of thematching process.

To further explore this, focusing on the single-matchcase, define for each unit

Xi�0� � �Xi if Wi � 0,X�1�i� if Wi � 1

and

Xi�1� � �X�1�i� if Wi � 0,Xi if Wi � 1.

If the matching is exact, Xi(0) � Xi(1) for each unit. If not,these discrepancies may lead to bias. The differenceXi(1) Xi(0) will therefore be used to reduce the bias ofthe simple matching estimator.

AVERAGE TREATMENT EFFECTS 19

Suppose unit i is a treated unit (Wi � 1), so that Yi(1) �Yi(1) and Yi(0) is an imputed value for Yi(0). This imputedvalue is unbiased for �0(X�1(i)) (since Yi(0) � Y�(i)), butnot necessarily for �0(Xi). One may therefore wish to adjustYi(0) by an estimate of �0(Xi) �0(X�1(i)). Typically thesecorrections are taken to be linear in the difference in thecovariates for unit i and its match, that is, of the form �0[Xi(1) Xi(0)] � �0(Xi X�1(i)). Rubin (1973b)proposed three corrections, which differ in how 0 is esti-mated.

To introduce Rubin’s first correction, note that one canwrite the matching estimator as the least squares estimatorfor the regression function

Yi�1� � Yi�0� � � εi.

This representation suggests modifying the regression func-tion to

Yi�1� � Yi�0� � � �Xi�1� � Xi�0��� � εi,

and again estimating by least squares.The second correction is to estimate �0( x) directly by

taking all control units, and estimate a linear regression ofthe form

Yi � �0 � �0 Xi � εi

by least squares. [If unit i is a control unit, the correctionwill be done using an estimator for the regression function�1( x) based on a linear specification Yi � �1 � �1Xi

estimated on the treated units.] Abadie and Imbens (2002)show that if this correction is done nonparametrically, theresulting matching estimator is consistent and asymptoti-cally normal, with its bias dominated by the variance.

The third method is to estimate the same regressionfunction for the controls, but using only those that are usedas matches for the treated units, with weights correspondingto the number of times a control observations is used as amatch (see Abadie and Imbens, 2002). Compared to thesecond method, this approach may be less efficient, as itdiscards some control observations and weights some morethan others. It has the advantage, however, of only using themost relevant matches. The controls that are discarded in thematching process are likely to be outliers relative to thetreated observations, and they may therefore unduly affectthe least squares estimates. If the regression function is infact linear, this may be an attractive feature, but if there isuncertainty over its functional form, one may not wish toallow these observations such influence.

E. Bayesian Approaches

Little has been done using Bayesian methods to estimateaverage treatment effects, either in methodology or in ap-plication. Rubin (1978) introduces a general approach toestimating average and distributional treatment effects from

a Bayesian perspective. Dehejia (2002) goes further, study-ing the policy decision problem of assigning heterogeneousindividuals to various training programs with uncertain andvariable effects.

To my knowledge, however, there are no applicationsusing the Bayesian approach that focus on estimating theaverage treatment effect under unconfoundedness, either forthe whole population or just for the treated. Neither are theresimulation studies comparing operating characteristics ofBayesian methods with the frequentist methods discussed inthe earlier sections of this paper. Such a Bayesian approachcan be easily implemented with the regression methodsdiscussed in section IIIA. Interestingly, it is less clear howBayesian methods would be used with pairwise matching,which does not appear to have a natural likelihood interpre-tation.

A Bayesian approach to the regression estimators may beuseful for a number of reasons. First, one of the leadingproblems with regression estimators is the presence of manycovariates relative to the number of observations. Standardfrequentist methods tend to either include those covariateswithout any restrictions, or exclude them entirely. In con-trast, Bayesian methods would allow researchers to includecovariates with more or less informative prior distributions.For example, if the researcher has a number of laggedoutcomes, one may expect recent lags to be more importantin predicting future outcomes than longer lags; this can bereflected in tighter prior distributions around zero for theolder information. Alternatively, with a number of similarcovariates one may wish to use hierarchical models thatavoid problems with large-dimensional parameter spaces.

A second argument for considering Bayesian methods isthat in an area closely related to this process of estimatedunobserved outcomes—that of missing data with the miss-ing at random (MAR) assumption—Bayesian methods havefound widespread applicability. As advocated by Rubin(1987), multiple imputation methods often rely on a Bayes-ian approach for imputing the missing data, taking accountof the parameter heterogeneity in a manner consistent withthe uncertainty in the missing-data model itself. The samemethods could be used with little modification for causalmodels, with the main complication that a relatively largeproportion—namely 50% of the total number of potentialoutcomes—is missing.

IV. Estimating Variances

The variances of the estimators considered so far typi-cally involve unknown functions. For example, as discussedin section IIE, the variance of efficient estimators of thePATE is equal to

VP � ���12�X�

e�X��

�02�X�

1 � e�X�� ��1�X� � �0�X� � �2� ,

involving the two regression functions, the two conditionalvariances, and the propensity score.

THE REVIEW OF ECONOMICS AND STATISTICS20

There are a number of ways we can estimate this asymp-totic variance. The first is essentially by brute force. All fivecomponents of the variance, �0

2( x), �12( x), �0( x), �1( x),

and e( x), are consistently estimable using kernel methodsor series, and hence the asymptotic variance can be esti-mated consistently. However, if one estimates the averagetreatment effect using only the two regression functions, it isan additional burden to estimate the conditional variancesand the propensity score in order to estimate VP. Similarly,if one efficiently estimates the average treatment effect byweighting with the estimated propensity score, it is a con-siderable additional burden to estimate the first two mo-ments of the conditional outcome distributions just to esti-mate the asymptotic variance.

A second method applies to the case where either theregression functions or the propensity score is estimatedusing series or sieves. In that case one can interpret theestimators, given the number of terms in the series, asparametric estimators, and calculate the variance this way.Under some conditions that will lead to valid standard errorsand confidence intervals.

A third approach is to use bootstrapping (Efron andTibshirani, 1993; Horowitz, 2002). There is little formalevidence specific for these estimators, but, given that theestimators are asymptotically linear, it is likely that boot-strapping will lead to valid standard errors and confidenceintervals at least for the regression and propensity scoremethods. Bootstrapping may be more complicated formatching estimators, as the process introduces discretenessin the distribution that will lead to ties in the matchingalgorithm. Subsampling (Politis and Romano, 1999) willstill work in this setting.

These first three methods provide variance estimates forestimators of P. As argued above, however, one mayinstead wish to estimate S or (X), in which case theappropriate (conservative) variance is

VS � ���12�X�

e�X��

�02�X�

1 � e�X�� .

As above, this variance can be estimated by estimating theconditional moments of the outcome distributions, with theaccompanying inherent difficulties. VS cannot, however, beestimated by bootstrapping, since the estimand itselfchanges across bootstrap samples.

There is, however, an alternative method for estimatingthis variance that does not require additional nonparametricestimation. The idea behind this matching variance estima-tor, as developed by Abadie and Imbens (2002), is that eventhough the asymptotic variance depends on the conditionalvariance �w

2 ( x), one need not actually estimate this varianceconsistently at all values of the covariates. Rather, one needsonly the average of this variance over the distribution,weighted by the inverse of either e( x) or its complement1 e( x). The key is therefore to obtain a close-to-unbiasedestimator for the variance �w

2 ( x). More generally, suppose

we can find two treated units with X � x, say units i and j.In that case an unbiased estimator for �1

2( x) is

�12�x� � �Yi � Yj�

2/2.

In general it is again difficult to find exact matches, butagain, this is not necessary. Instead, one uses the closestmatch within the set of units with the same treatmentindicator. Let vm(i) be the mth closest unit to i with thesame treatment indicator (Wvm(i) � Wi), and

�l�Wl�Wi,l�i

1��Xl � x� � �Xvm�i� � x�� � m.

Given a fixed number of matches, M, this gives us M unitswith the same treatment indicator and approximately thesame values for the covariates. The sample variance of theoutcome variable for these M units can then be used toestimate �1

2( x). Doing the same for the control variancefunction, �0

2( x), we can estimate �w2 ( x) at all values of the

covariates and for w � 0, 1.Note that these are not consistent estimators of the con-

ditional variances. As the sample size increases, the bias ofthese estimators will disappear, just as we saw that the biasof the matching estimator for the average treatment effectdisappears under similar conditions. The rate at which thisbias disappears depends on the dimension of the covariates.The variance of the estimators for �w

2 (Xi), namely at spe-cific values of the covariates, will not go to zero; however,this is not important, as we are interested not in the vari-ances at specific points in the covariates distribution, but inthe variance of the average treatment effect, VS. Followingthe process introduce above, this last step is estimated as

VS �1

N�i�1

N � �12�Xi�

e�Xi��

�02�Xi�

1 � e�Xi�.

Under standard regularity conditions this is consistent forthe asymptotic variance of the average treatment effectestimator. For matching estimators even estimation of thepropensity score can be avoided. Abadie and Imbens showthat one can estimate the variance of the matching estimatorfor SATE as:

VE �1

N�i�1

N �1 �KM�i�

M 2

�Wi

2 �Xi�,

where M is the number of matches and KM(i) is the numberof times unit i is used as a match.

V. Assessing the Assumptions

A. Indirect Tests of the Unconfoundedness Assumption

The unconfoundedness assumption relied upon through-out this discussion is not directly testable. As discussed

AVERAGE TREATMENT EFFECTS 21

above, it states that the conditional distribution of theoutcome under the control treatment, Y(0), given receipt ofthe active treatment and given covariates, is identical to thedistribution of the control outcome given receipt of thecontrol treatment and given covariates. The same is as-sumed for the distribution of the active treatment outcome,Y(1). Because the data are completely uninformative aboutthe distribution of Y(0) for those who received the activetreatment and of Y(1) for those who received the control,the data cannot directly reject the unconfoundedness as-sumption. Nevertheless, there are often indirect ways ofassessing this assumption, a number of which are developedin Heckman and Hotz (1989) and Rosenbaum (1987). Thesemethods typically rely on estimating a causal effect that isknown to equal zero. If the test then suggests that this causaleffect differs from zero, the unconfoundedness assumptionis considered less plausible. These tests can be divided intotwo broad groups.

The first set of tests focuses on estimating the causaleffect of a treatment that is known not to have an effect,relying on the presence of multiple control groups (Rosen-baum, 1987). Suppose one has two potential control groups,for example, eligible nonparticipants and ineligibles, as inHeckman, Ichimura, and Todd (1997). One interpretation ofthe test is to compare average treatment effects estimatedusing each of the control groups. This can also be inter-preted as estimating an “average treatment effect” usingonly the two control groups, with the treatment indicatornow a dummy for being a member of the first group. In thatcase the treatment effect is known to be zero, and statisticalevidence of a nonzero effect implies that at least one of thecontrol groups is invalid. Again, not rejecting the test doesnot imply the unconfoundedness assumption is valid (asboth control groups could suffer the same bias), but nonre-jection in the case where the two control groups are likely tohave different potential biases makes it more plausible thatthe unconfoundedness assumption holds. The key for thepower of this test is to have available control groups that arelikely to have different biases, if any. Comparing ineligiblesand eligible nonparticipants as in Heckman, Ichimura, andTodd (1997) is a particularly attractive comparison. Alter-natively one may use different geographic controls, forexample from areas bordering on different sides of thetreatment group.

One can formalize this test by postulating a three-valuedindicator Ti � {1, 0, 1} for the groups (e.g., ineligibles,eligible nonparticipants, and participants), with the treat-ment indicator equal to Wi � 1{Ti � 1}. If one extends theunconfoundedness assumption to independence of the po-tential outcomes and the group indicator given covariates,

Yi�0�, Yi�1� � Ti�Xi,

then a testable implication is

Yi � 1�Ti � 0��Xi, Ti � 0.

An implication of this independence condition is beingtested by the tests discussed above. Whether this test hasmuch bearing on the unconfoundedness assumption de-pends on whether the extension of the assumption is plau-sible given unconfoundedness itself.

The second set of tests of unconfoundedness focuses onestimating the causal effect of the treatment on a variableknown to be unaffected by it, typically because its value isdetermined prior to the treatment itself. Such a variable canbe time-invariant, but the most interesting case is in con-sidering the treatment effect on a lagged outcome. If this isnot zero, this implies that the treated observations aredistinct from the controls; namely, that the distribution ofYi,1 for the treated units is not comparable to the distribu-tion of Yi,1 for the controls. If the treatment is instead zero,it is more plausible that the unconfoundedness assumptionholds. Of course this does not directly test this assumption;in this setting, being able to reject the null of no effect doesnot directly reflect on the hypothesis of interest, uncon-foundedness. Nevertheless, if the variables used in thisproxy test are closely related to the outcome of interest, thetest arguably has more power. For these tests it is clearlyhelpful to have a number of lagged outcomes.

To formalize this, let us suppose the covariates consist ofa number of lagged outcomes Yi,1, . . . , Yi,T as well astime-invariant individual characteristics Zi, so that Xi �(Yi,1, . . . , Yi,T, Zi). By construction only units in thetreatment group after period 1 receive the treatment; allother observed outcomes are control outcomes. Also sup-pose that the two potential outcomes Yi(0) and Yi(1) cor-respond to outcomes in period zero. Now consider thefollowing two assumptions. The first is unconfoundednessgiven only T 1 lags of the outcome:

Yi�1�, Yi�0� � Wi�Yi,1, . . . , Yi,�T1�, Zi,

and the second assumes stationarity and exchangeability:

fYi,s�0��Yi,s1�0�, . . . ,Yi,s�T1��0�,Zi,Wi� ys�ys1, . . . , ys�T1�, z, w�

does not depend on i and s. Then it follows that

Yi,1 � Wi�Yi,2, . . . , Yi,T, Zi,

which is testable. This hypothesis is what the test describedabove tests. Whether this test has much bearing on uncon-foundedness depends on the link between the two assump-tions and the original unconfoundedness assumption. With asufficient number of lags, unconfoundedness given all lagsbut one appears plausible, conditional on unconfoundednessgiven all lags, so the relevance of the test depends largely onthe plausibility of the second assumption, stationarity andexchangeability.

B. Choosing the Covariates

The discussion so far has focused on the case where thecovariates set is known a priori. In practice there can be two

THE REVIEW OF ECONOMICS AND STATISTICS22

issues with the choice of covariates. First, there may besome variables that should not be adjusted for. Second, evenwith variables that should be adjusted for in large samples,the expected mean squared error may be reduced by ignor-ing those covariates that have only weak correlation withthe treatment indicator and the outcomes. This second issueis essentially a statistical one. Including a covariate in theadjustment procedure, through regression, matching or oth-erwise, will not lower the asymptotic precision of theaverage treatment effect if the assumptions are correct. Infinite samples, however, a covariate that is not, or is onlyweakly, correlated with outcomes and treatment indicatorsmay reduce precision. There are few procedures currentlyavailable for optimally choosing the set of covariates to beincluded in matching or regression adjustments, taking intoaccount such finite-sample properties.

The first issue is a substantive one. The unconfounded-ness assumption may apply with one set of covariates butnot apply with an expanded set. A particular concern is theinclusion of covariates that are themselves affected by thetreatment, such as intermediate outcomes. Suppose, forexample, that in evaluating a job training program, theprimary outcome of interest is earnings two years later. Inthat case, employment status prior to the program is unaf-fected by the treatment and thus a valid element of the set ofadjustment covariates. In contrast, employment status oneyear after the program is an intermediate outcome andshould not be controlled for. It could itself be an outcome ofinterest, and should therefore never be a covariate in ananalysis of the effect of the training program. One guaranteethat a covariate is not affected by the treatment is that it wasmeasured before the treatment was chosen. In practice,however, the covariates are often recorded at the same timeas the outcomes, subsequent to treatment. In that case onehas to assess on a case-by-case basis whether a particularcovariate should be used in adjusting outcomes. See Rosen-baum (1984b) and Angrist and Krueger (2000) for morediscussion.

C. Assessing the Overlap Assumption

The second of the key assumptions in estimating averagetreatment effects requires that the propensity score—theprobability of receiving the active treatment—be strictlybetween zero and one. In principle this is testable, as itrestricts the joint distribution of observables; but formaltests are not necessarily the main concern. In practice, thisassumption raises a number of questions. The first is how todetect a lack of overlap in the covariate distributions. Asecond is how to deal with it, given that such a lack exists.A third is how the individual methods discussed in sectionIII address this lack of overlap. Ideally such a lack wouldresult in large standard errors for the average treatmenteffects.

The first method to detect lack of overlap is to plotdistributions of covariates by treatment groups. In the case

with one or two covariates one can do this directly. Inhigh-dimensional cases, however, this becomes more diffi-cult. One can inspect pairs of marginal distributions bytreatment status, but these are not necessarily informativeabout lack of overlap. It is possible that for each covariatethe distributions for the treatment and control groups areidentical, even though there are areas where the propensityscore is 0 or 1.

A more useful method is therefore to inspect the distri-bution of the propensity score in both treatment groups, whichcan directly reveal lack of overlap in high-dimensionalcovariate distributions. Its implementation requires non-parametric estimation of the propensity score, however, andmisspecification may lead to failure in detecting a lack ofoverlap, just as inspecting various marginal distributionsmay be insufficient. In practice one may wish to under-smooth the estimation of the propensity score, either bychoosing a bandwidth smaller than optimal for nonparamet-ric estimation or by including higher-order terms in a seriesexpansion.

A third way to detect lack of overlap is to inspect thequality of the worst matches in a matching procedure. Givena set of matches, one can, for each component k of thevector of covariates, inspect maxi �xi,k x�1(i),k�, themaximum over all observations of the matching discrep-ancy. If this difference is large relative to the samplestandard deviation of the kth component of the covariates,there is reason for concern. The advantage of this method isthat it does not require additional nonparametric estimation.

Once one determines that there is a lack of overlap, onecan either conclude that the average treatment effect ofinterest cannot be estimated with sufficient precision, and/ordecide to focus on an average treatment effect that isestimable with greater accuracy. To do the latter it can beuseful to discard some of the observations on the basis oftheir covariates. For example, one may decide to discardcontrol (treated) observations with propensity scores below(above) a cutoff level. The desired cutoff may depend on thesample size; in a very large sample one may not be con-cerned with a propensity score of 0.01, whereas in smallsamples such a value may make it difficult to find reason-able comparisons. To judge such tradeoffs, it is useful tounderstand the relationship between a unit’s propensityscore and its implicit weight in the average-treatment-effectestimation. Using the weighting estimator, the average out-come under the treatment is estimated by summing upoutcomes for the control units with weight approximatelyequal to 1 divided by their propensity score (and 1 dividedby 1 minus the propensity score for treated units). Hencewith N units, the weight of unit i is approximately 1/{N �[1 e(Xi)]} if it is a treated unit and 1/[N � e(Xi)] if it isa control. One may wish to limit this weight to somefraction, for example, 0.05, so that no unit will have aweight of more than 5% in the average. Under that ap-proach, the limit on the propensity score in a sample with

AVERAGE TREATMENT EFFECTS 23

200 units is 0.1; units with a propensity score less than 0.1or greater than 0.9 should be discarded. In a sample with1000 units, only units with a propensity score outside therange [0.02, 0.98] will be ignored.

In matching procedures one need not rely entirely oncomparisons of the propensity score distribution in discard-ing the observations with insufficient match quality.Whereas Rosenbaum and Rubin (1984) suggest acceptingonly matches where the difference in propensity scores isbelow a cutoff point, alternatively one may wish to dropmatches where individual covariates are severely mis-matched.

Finally, let us consider the three approaches to infer-ence—regression, matching, and propensity score meth-ods—and assess how each handles lack of overlap. Supposeone is interested in estimating the average effect on thetreated, and one has a data set with sufficient overlap. Nowsuppose one adds a few treated or control observations withcovariate values rarely seen in the alternative treatmentgroup. Adding treated observations with outlying valuesimplies one cannot estimate the average treatment effect forthe treated very precisely, because one lacks suitable con-trols against which to compare these additional units. Thuswith methods appropriately dealing with limited overlapone will see the variance estimates increase. In contrast,adding control observations with outlying covariate valuesshould have little effect, since such controls are irrelevantfor the average treatment effect for the treated. Therefore,methods appropriately dealing with limited overlap shouldin this case show estimates approximately unchanged inbias and precision.

Consider first the regression approach. Conditional on aparticular parametric specification for the regression func-tion, adding observations with outlying values of the regres-sors leads to considerably more precise parameter estimates;such observations are influential precisely because of theiroutlying values. If the added observations are treated units,the precision of the estimated control regression function atthese outlying values will be lower (since few if any controlunits are found in that region); thus the variance willincrease, as it should. One should note, however, that theestimates in this region may be sensitive to the specificationchosen. In contrast, by the nature of regression functions,adding control observations with outlying values will leadto a spurious increase in precision of the control regressionfunction. Regression methods can therefore be misleadingin cases with limited overlap.

Next, consider matching. In estimating the average treat-ment effect for the treated, adding control observations withoutlying covariate values will likely have little affect on theresults, since such observations are unlikely to be used asmatches. The results would, however, be sensitive to addingtreated observations with outlying covariate values, becausethese observations would be matched to inappropriate con-

trols, leading to possibly biased estimates. The standarderrors would largely be unaffected.

Finally, consider propensity-score estimates. Estimates ofthe probability of receiving treatment now include valuesclose to 0 and 1. The values close to 0 for the controlobservations would cause little difficulty because these unitswould get close to zero weight in the estimation. The controlobservations with a propensity score close to 1, however,would receive high weights, leading to an increase in thevariance of the average-treatment-effect estimator, correctlyimplying that one cannot estimate the average treatmenteffect very precisely. Blocking on the propensity scorewould lead to similar conclusions.

Overall, propensity score and matching methods (andlikewise kernel-based regression methods) are better de-signed to cope with limited overlap in the covariate distri-butions than are parametric or semiparametric (series) re-gression models. In all cases it is useful to inspecthistograms of the estimated propensity score in both groupsto assess whether limited overlap is an issue.

VI. Applications

There are many studies using some form of unconfound-edness or selection on observables, ranging from simpleleast squares analyses to matching on the propensity score(for example, Ashenfelter and Card, 1985; LaLonde, 1986;Card and Sullivan, 1988; Heckman, Ichimura, and Todd,1997; Angrist, 1998; Dehejia and Wahba, 1999; Lechner,1998; Friedlander and Robins, 1995; and many others).Here I focus primarily on two sets of analyses that can helpresearchers assess the value of the methods surveyed in thispaper: first, studies attempting to assess the plausibility ofthe assumptions, often using randomized experiments as ayardstick; second, simulation studies focusing on the per-formance of the various techniques in settings where theassumptions are known to hold.

A. Applications: Randomized Experiments as Checks onUnconfoundedness

The basic idea behind these studies is simple: to useexperimental results as a check on the attempted nonexperi-mental estimates. Given a randomized experiment, one canobtain unbiased estimates of the average effect of a pro-gram. Then, one can put aside the experimental controlgroup and attempt to replicate these results using a nonex-perimental control. If one can successfully replicate theexperimental results, this suggests that the assumptions andmethods are plausible. Such investigations are of course notgenerally conclusive, but are invaluable in assessing theplausibility of the approach. The first such study, and onethat made an enormous impact in the econometrics litera-ture, was by LaLonde (1986). Fraker and Maynard (1987)conducted a similar investigation, and many more havefollowed.

THE REVIEW OF ECONOMICS AND STATISTICS24

LaLonde (1986) took the National Supported Work pro-gram, a fairly small program aimed at particularlydisadvantaged people in the labor market (individuals withpoor labor market histories and skills). Using these data, heset aside the experimental control group and in its placeconstructed alternative controls from the Panel Study ofIncome Dynamics (PSID) and Current Population Survey(CPS), using various selection criteria depending on priorlabor market experience. He then used a number of meth-ods—ranging from a simple difference, to least squaresadjustment, a Heckman selection correction, and difference-indifferences techniques—to create nonexperimental esti-mates of the average treatment effect. His general conclu-sion was that the results were very unpredictable and that nomethod could consistently replicate the experimental resultsusing any of the six nonexperimental control groups con-structed. A number of researchers have subsequently testednew techniques using these same data. Heckman and Hotz(1989) focused on testing the various models and arguedthat the testing procedures they developed would haveeliminated many of LaLonde’s particularly inappropriateestimates. Dehejia and Wahba (1999) used several of thesemiparametric methods based on the unconfoundednessassumption discussed in this survey, and found that for thesubsample of the LaLonde data that they used (with twoyears of prior earnings), these methods replicated the ex-perimental results more accurately—both overall and withinsubpopulations. Smith and Todd (2003) analyze the samedata and conclude that for other subsamples, including thosefor which only one year of prior earnings is available, theresults are less robust. See Dehejia (2003) for additionaldiscussion of these results.

Others have used different experiments to carry out thesame or similar analyses, using varying sets of estimatorsand alternative control groups. Friedlander and Robins(1995) focus on least squares adjustment, using data fromthe WIN (Work INcentive) demonstration programs con-ducted in a number of states, and construct control groupsfrom other counties in the same state, as well as fromdifferent states. They conclude that nonexperimental meth-ods are unable to replicate the experimental results. Hotz,Imbens, and Mortimer (2003) use the same data and con-sider matching methods with various sets of covariates,using single or multiple alternative states as nonexperimen-tal control groups. They find that for the subsample ofindividuals with positive earnings at some date prior to theprogram, nonexperimental methods work better than forthose with no known positive earnings.

Heckman, Ichimura, and Todd (1997, 1998) and Heck-man, Ichimura, Smith, and Todd (1998) study the nationalJob Training Partnership Act (JPTA) program, using datafrom different geographical locations to investigate thenature of the biases associated with different estimators, andthe importance of overlap in the covariates, including labormarket histories. Their conclusions provide the type of

specific guidance that should be the aim of such studies.They give clear and generalizable conditions that make theassumptions of unconfoundedness and overlap—at leastaccording to their study of a large training program—moreplausible. These conditions include the presence of detailedearnings histories, and control groups that are geographi-cally close to the treatment group—preferably groups ofineligibles, or eligible nonparticipants from the same loca-tion. In contrast, control groups from very different loca-tions are found to be poor nonexperimental controls. Al-though such conclusions are only clearly generalizable toevaluations of social programs, they are potentially veryuseful in providing analysts with concrete guidance as to theapplicability of these assumptions.

Dehejia (2002) uses the Greater Avenues to INdepen-dence (GAIN) data, using different counties as well asdifferent offices within the same county as nonexperimentalcontrol groups. Similarly, Hotz, Imbens, and Klerman(2001) use the basic GAIN data set supplemented withadministrative data on long-term quarterly earnings (bothprior and subsequent to the randomization date), to inves-tigate the importance of detailed earnings histories. Suchdetailed histories can also provide more evidence on theplausibility of nonexperimental evaluations for long-termoutcomes.

Two complications make this literature difficult to eval-uate. One is the differences in covariates used; it is rare thatvariables are measured consistently across different studies.For instance, some have yearly earnings data, others quar-terly, others only earnings indicators on a monthly or quar-terly basis. This makes it difficult to consistently investigatethe level of detail in earnings history necessary for theunconfoundedness assumption to hold. A second complica-tion is that different estimators are generally used; thus anydifferences in results can be attributed to either estimators orassumptions. This is likely driven by the fact that few of theestimators have been sufficiently standardized that they canbe implemented easily by empirical researchers.

All of these studies just discussed took data from actualrandomized experiments to test the “ true” treatment effectagainst the estimators used on the nonexperimental data. Tosome extent, however, such experimental data are not re-quired. The question of interest is whether an alternativecontrol group is an adequate proxy for a randomized controlin a particular setting; note that this question does notrequire data on the treatment group. Although these ques-tions have typically been studied by comparing experimen-tal with nonexperimental results, all that is really relevant iswhether the nonexperimental control group can predict theaverage outcomes for the experimental control. As in Heck-man, Ichimura, Smith, and Todd’s (1998) analysis of theJTPA data, one can take two groups, neither subject to thetreatment, and ask the question whether—using data on thecovariates for the first control group in combination withoutcome and covariate information for the second—one can

AVERAGE TREATMENT EFFECTS 25

predict the average outcome in the first. If so, this impliesthat, had there been an experiment on the population fromwhich the first control group was drawn, the second groupwould provide an acceptable nonexperimental control.From this perspective one can use data from many differentsurveys. In particular, one can more systematically investi-gate whether control groups from different counties, states,or regions or even different time periods make acceptablenonexperimental controls.

B. Simulations

A second question that is often confounded with that ofthe validity of the assumptions is that of the relative per-formance of the various estimators. Suppose one is willingto accept the unconfoundedness and overlap assumptions.Which estimation method is most appropriate in a particularsetting? In many of the studies comparing nonexperimentalwith experimental outcomes, researchers compare resultsfor a number of the techniques described here. Yet in thesesettings we cannot be certain that the underlying assump-tions hold. Thus, although it is useful to compare thesetechniques in such realistic settings, it is also important tocompare them in an artificial environment where one iscertain that the underlying assumptions are valid.

There exist a few studies that specifically set out to dothis. Frolich (2000) compares a number of matching esti-mators and local linear regression methods, carefully for-malizing fully data-driven procedures for the estimatorsconsidered. To make these comparisons he considers a largenumber of data-generating processes, based on eight differ-ent regression functions (including some highly nonlinearand multimodal ones), two different sample sizes, and threedifferent density functions for the covariate (one importantlimitation is that he restricts the investigation to a singlecovariate). For the matching estimator Frolich considered asingle match with replacement; for the local linear regres-sion estimators he uses data-driven optimal bandwidthchoices based on minimizing the mean squared error of theaverage treatment effect. The first local linear estimatorconsidered is the standard one: at x the regression function�( x) is estimated as 0 in the minimization problem

min 0, 1

�i�1

N

�Yi � 0 � 1 � �Xi � x��2 � K�Xi � x

h ,

with an Epanechnikov kernel. He finds that this has com-putational problems, as well as poor small-sample proper-ties. He therefore also considers a modification suggested bySeifert and Gasser (1996, 2000). For given x, define x� � ¥XiK((Xi x)/h)/¥ K((Xi x)/h), so that one can writethe standard local linear estimator as

��x� �T0

S0�

T1

S2�x � x��,

where, for r � 0, 1, 2, one has Sr � ¥ K((Xi x)/h)(Xi x)r and Tr � ¥ K((Xi x)/h)(Xi x)rYi. TheSeifert-Gasser modification is to use instead

��x� �T0

S0�

T1

S2 � R�x � x��,

where the recommended ridge parameter is R � �x x� �[5/(16h)], given the Epanechnikov kernel k(u) � 3

4(1

u2)1{�u� � 1}. Note that with high-dimensional covariates,such a nonnegative kernel would lead to biases that do notvanish fast enough to be dominated by the variance (see thediscussion in Heckman, Ichimura, and Todd, 1998). This isnot a problem in Frolich’s simulations, as he considers onlycases with a single covariate. Frolich finds that the locallinear estimator, with Seifert and Gassert’s modification,performs better than either the matching or the standardlocal linear estimator.

Zhao (2004) uses simulation methods to compare match-ing and parametric regression estimators. He uses metricsbased on the propensity score, the covariates, and estimatedregression functions. Using designs with varying numbersof covariates and linear regression functions, Zhao findsthere is no clear winner among the different estimators,although he notes that using the outcome data in choosingthe metric appears a promising strategy.

Abadie and Imbens (2002) study their matching estimatorusing a data-generating process inspired by the LaLondestudy to allow for substantial nonlinearity, fitting a separatebinary response model to the zeros in the earnings outcome,and a log linear model for the positive observations. Theregression estimators include linear and quadratic models(the latter with a full set of interactions), with seven covari-ates. This study finds that the matching estimators, and inparticular the bias-adjusted alternatives, outperform the lin-ear and quadratic regression estimators (the former using 7covariates, the latter 35, after dropping squares and interac-tions that lead to perfect collinearity). Their simulations alsosuggest that with few matches—between one and four—matching estimators are not sensitive to the number ofmatches used, and that their confidence intervals have actualcoverage rates close to the nominal values.

The results from these simulation studies are overallsomewhat inconclusive; it is clear that more work is re-quired. Future simulations may usefully focus on some ofthe following issues. First, it is obviously important toclosely model the data-generating process on actual datasets, to ensure that the results have some relevance forpractice. Ideally one would build the simulations around anumber of specific data sets through a range of data-generating processes. Second, it is important to have fullydata-driven procedures that define an estimator as a functionof (Yi, Wi, Xi)i�1

N , as seen in Frolich (2000). For thematching estimators this is relatively straightforward, butfor some others this requires more care. This will allow

THE REVIEW OF ECONOMICS AND STATISTICS26

other researchers to consider meaningful comparisonsacross the various estimators.

Finally, we need to learn which features of the data-generating process are important for the properties of thevarious estimators. For example, do some estimatorsdeteriorate more rapidly than others when a data set hasmany covariates and few observations? Are some estimatorsmore robust against high correlations between covariatesand outcomes, or high correlations between covariates andtreatment indicators? Which estimators are more likely togive conservative answers in terms of precision? Since it isclear that no estimator is always going to dominate allothers, what is important is to isolate salient features of thedata-generating processes that lead to preferring one alter-native over another. Ideally we need descriptive statisticssummarizing the features of the data that provide guidancein choosing the estimator that will perform best in a givensituation.

VII. Conclusion

In this paper I have attempted to review the current stateof the literature on inference for average treatment effectsunder the assumption of unconfoundedness. This has re-cently been a very active area of research where many newsemi- and nonparametric econometric methods have beenapplied and developed. The research has moved a long wayfrom relying on simple least squares methods for estimatingaverage treatment effects.

The primary estimators in the current literature includepropensity-score methods and pairwise matching, as well asnonparametric regression methods. Efficiency bounds havebeen established for a number of the average treatmenteffects estimable with these methods, and a variety of theseestimators rely on the weakest assumptions that allow pointidentification. Researchers have suggested several ways forestimating the variance of these average-treatment-effectestimators. One, more cumbersome approach requires esti-mating each component of the variance nonparametrically.A more common method relies on bootstrapping. A thirdalternative, developed by Abadie and Imbens (2002) for thematching estimator, requires no additional nonparametricestimation. There is, as yet, however, no consensus onwhich are the best estimation methods to apply in practice.Nevertheless, the applied researcher has now a large numberof new estimators at her disposal.

Challenges remain in making the new tools more easilyapplicable. Although software is available to implementsome of the estimators (see Becker and Ichino, 2002;Sianesi, 2001; Abadie et al., 2003), many remain difficult toapply. A particularly urgent task is therefore to provide fullyimplementable versions of the various estimators that do notrequire the applied researcher to choose bandwidths or othersmoothing parameters. This is less of a concern for match-ing methods and probably explains a large part of theirpopularity. Another outstanding question is the relative

performance of these methods in realistic settings with largenumbers of covariates and varying degrees of smoothness inthe conditional means of the potential outcomes and thepropensity score.

Once these issues have been resolved, today’s appliedevaluators will benefit from a new set of reliable, econo-metrically defensible, and robust methods for estimating theaverage treatment effect of current social policy programsunder exogeneity assumptions.

REFERENCES

Abadie, A., “Semiparametric Instrumental Variable Estimation of Treat-ment Response Models,” Journal of Econometrics 113:2 (2003a),231–263.

Abadie, A., “Semiparametric Difference-in-Differences Estimators,”forthcoming, Review of Economic Studies (2003b).

Abadie, A., J. Angrist, and G. Imbens, “ Instrumental Variables Estimationof Quantile Treatment Effects,” Econometrica 70:1 (2002), 91–117.

Abadie, A., D. Drukker, H. Herr, and G. Imbens, “ Implementing MatchingEstimators for Average Treatment Effects in STATA,” Departmentof Economics, University of California, Berkeley, unpublishedmanuscript (2003).

Abadie, A., and G. Imbens, “Simple and Bias-Corrected Matching Esti-mators for Average Treatment Effects,” NBER technical workingpaper no. 283 (2002).

Abbring, J., and G. van den Berg, “The Non-parametric Identification ofTreatment Effects in Duration Models,” Free University of Am-sterdam, unpublished manuscript (2002).

Angrist, J., “Estimating the Labor Market Impact of Voluntary MilitaryService Using Social Security Data on Military Applicants,”Econometrica 66:2 (1998), 249–288.

Angrist, J. D., and J. Hahn, “When to Control for Covariates? Panel-Asymptotic Results for Estimates of Treatment Effects,” NBERtechnical working paper no. 241 (1999).

Angrist, J. D., G. W. Imbens, and D. B. Rubin, “ Identification of CausalEffects Using Instrumental Variables,” Journal of the AmericanStatistical Association 91 (1996), 444–472.

Angrist, J. D., and A. B. Krueger, “Empirical Strategies in Labor Eco-nomics,” in A. Ashenfelter and D. Card (Eds.), Handbook of LaborEconomics vol. 3 (New York: Elsevier Science, 2000).

Angrist, J., and V. Lavy, “Using Maimonides’ Rule to Estimate the Effectof Class Size on Scholastic Achievement,” Quarterly Journal ofEconomics CXIV (1999), 1243.

Ashenfelter, O., “Estimating the Effect of Training Programs on Earn-ings,” this REVIEW 60 (1978), 47–57.

Ashenfelter, O., and D. Card, “Using the Longitudinal Structure ofEarnings to Estimate the Effect of Training Programs,” this REVIEW

67 (1985), 648–660.Athey, S., and G. Imbens, “ Identification and Inference in Nonlinear

Difference-in-Differences Models,” NBER technical working pa-per no. 280 (2002).

Athey, S., and S. Stern, “An Empirical Framework for Testing Theoriesabout Complementarity in Organizational Design,” NBER workingpaper no. 6600 (1998).

Barnow, B. S., G. G. Cain, and A. S. Goldberger, “ Issues in the Analysisof Selectivity Bias,” in E. Stromsdorfer and G. Farkas (Eds.),Evaluation Studies vol. 5 (San Francisco: Sage, 1980).

Becker, S., and A. Ichino, “Estimation of Average Treatment Effects Basedon Propensity Scores,” The Stata Journal 2:4 (2002), 358–377.

Bitler, M., J. Gelbach, and H. Hoynes, “What Mean Impacts Miss:Distributional Effects of Welfare Reform Experiments,” Depart-ment of Economics, University of Maryland, unpublished paper(2002).

Bjorklund, A., and R. Moffit, “The Estimation of Wage Gains and WelfareGains in Self-Selection Models,” this REVIEW 69 (1987), 42–49.

Black, S., “Do Better Schools Matter? Parental Valuation of ElementaryEducation,” Quarterly Journal of Economics CXIV (1999), 577.

AVERAGE TREATMENT EFFECTS 27

Blundell, R., and Monica Costa-Dias, “Alternative Approaches to Evalu-ation in Empirical Microeconomics,” Institute for Fiscal Studies,Cemmap working paper cwp10/02 (2002).

Blundell, R., A. Gosling, H. Ichimura, and C. Meghir, “Changes in theDistribution of Male and Female Wages Accounting for the Em-ployment Composition,” Institute for Fiscal Studies, London, un-published paper (2002).

Card, D., and D. Sullivan, “Measuring the Effect of Subsidized TrainingPrograms on Movements In and Out of Employment,” Economet-rica 56:3 (1988), 497–530.

Chernozhukov, V., and C. Hansen, “An IV Model of Quantile TreatmentEffects,” Department of Economics, MIT, unpublished workingpaper (2001).

Cochran, W., “The Effectiveness of Adjustment by Subclassification inRemoving Bias in Observational Studies,” Biometrics 24, (1968),295–314.

Cochran, W., and D. Rubin, “Controlling Bias in Observational Studies: AReview,” Sankhya� 35 (1973), 417–446.

Dehejia, R., “Was There a Riverside Miracle? A Hierarchical Frameworkfor Evaluating Programs with Grouped Data,” Journal of Businessand Economic Statistics 21:1 (2002), 1–11.

“Practical Propensity Score Matching: A Reply to Smith andTodd,” forthcoming, Journal of Econometrics (2003).

Dehejia, R., and S. Wahba, “Causal Effects in Nonexperimental Studies:Reevaluating the Evaluation of Training Programs,” Journal of theAmerican Statistical Association 94 (1999), 1053–1062.

Doksum, K., “Empirical Probability Plots and Statistical Inference forNonlinear Models in the Two-Sample Case,” Annals of Statistics 2(1974), 267–277.

Efron, B., and R. Tibshirani, An Introduction to the Bootstrap (New York:Chapman and Hall, 1993).

Engle, R., D. Hendry, and J.-F. Richard, “Exogeneity,” Econometrica 51:2(1974), 277–304.

Firpo, S., “Efficient Semiparametric Estimation of Quantile TreatmentEffects,” Department of Economics, University of California,Berkeley, PhD thesis (2002), chapter 2.

Fisher, R. A., The Design of Experiments (Boyd, London, 1935).Fitzgerald, J., P. Gottschalk, and R. Moffitt, “An Analysis of Sample

Attrition in Panel Data: The Michigan Panel Study of IncomeDynamics,” Journal of Human Resources 33 (1998), 251–299.

Fraker, T., and R. Maynard, “The Adequacy of Comparison GroupDesigns for Evaluations of Employment-Related Programs,” Jour-nal of Human Resources 22:2 (1987), 194–227.

Friedlander, D., and P. Robins, “Evaluating Program Evaluations: NewEvidence on Commonly Used Nonexperimental Methods,” Amer-ican Economic Review 85 (1995), 923–937.

Frolich, M., “Treatment Evaluation: Matching versus Local PolynomialRegression,” Department of Economics, University of St. Gallen,discussion paper no. 2000-17 (2000).“What is the Value of Knowing the Propensity Score for Estimat-

ing Average Treatment Effects,” Department of Economics, Uni-versity of St. Gallen (2002).

Gill, R., and J. Robins, “Causal Inference for Complex Longitudinal Data:The Continuous Case,” Annals of Statistics 29:6 (2001), 1785–1811.

Gu, X., and P. Rosenbaum, “Comparison of Multivariate Matching Meth-ods: Structures, Distances and Algorithms,” Journal of Computa-tional and Graphical Statistics 2 (1993), 405–420.

Hahn, J., “On the Role of the Propensity Score in Efficient Semiparamet-ric Estimation of Average Treatment Effects,” Econometrica 66:2(1998), 315–331.

Hahn, J., P. Todd, and W. Van der Klaauw, “ Identification and Estimationof Treatment Effects with a Regression-Discontinuity Design,”Econometrica 69:1 (2000), 201–209.

Ham, J., and R. LaLonde, “The Effect of Sample Selection and InitialConditions in Duration Models: Evidence from Experimental Dataon Training,” Econometrica 64:1 (1996).

Heckman, J., and J. Hotz, “Alternative Methods for Evaluating the Impactof Training Programs” (with discussion), Journal of the AmericanStatistical Association 84:804 (1989), 862–874.

Heckman, J., H. Ichimura, and P. Todd, “Matching as an EconometricEvaluation Estimator: Evidence from Evaluating a Job TrainingProgram,” Review of Economic Studies 64 (1997), 605–654.

“Matching as an Econometric Evaluation Estimator,” Review ofEconomic Studies 65 (1998), 261–294.

Heckman, J., H. Ichimura, J. Smith, and P. Todd, “Characterizing Selec-tion Bias Using Experimental Data,” Econometrica 66 (1998),1017–1098.

Heckman, J., R. LaLonde, and J. Smith, “The Economics and Economet-rics of Active Labor Markets Programs,” in A. Ashenfelter and D.Card (Eds.), Handbook of Labor Economics vol. 3 (New York:Elsevier Science, 2000).

Heckman, J., and R. Robb, “Alternative Methods for Evaluating theImpact of Interventions,” in J. Heckman and B. Singer (Eds.),Longitudinal Analysis of Labor Market Data (Cambridge, U.K.:Cambridge University Press, 1984).

Heckman, J., J. Smith, and N. Clements, “Making the Most out ofProgramme Evaluations and Social Experiments: Accounting forHeterogeneity in Programme Impacts,” Review of Economic Stud-ies 64 (1997), 487–535.

Hirano, K., and G. Imbens, “Estimation of Causal Effects Using Propen-sity Score Weighting: An Application of Data on Right Ear Cath-eterization,” Health Services and Outcomes Research Methodology2 (2001), 259–278.

Hirano, K., G. Imbens, and G. Ridder, “Efficient Estimation of AverageTreatment Effects Using the Estimated Propensity Score,” Econo-metrica 71:4 (2003), 1161–1189.

Holland, P., “Statistics and Causal Inference” (with discussion), Journal ofthe American Statistical Association 81 (1986), 945–970.

Horowitz, J., “The Bootstrap,” in James J. Heckman and E. Leamer (Eds.),Handbook of Econometrics, vol. 5 (Elsevier North Holland, 2002).

Hotz, J., G. Imbens, and J. Klerman, “The Long-Term Gains from GAIN:A Re-analysis of the Impacts of the California GAIN Program,”Department of Economics, UCLA, unpublished manuscript (2001).

Hotz, J., G. Imbens, and J. Mortimer, “Predicting the Efficacy of FutureTraining Programs Using Past Experiences,” forthcoming, Journalof Econometrics (2003).

Ichimura, H., and O. Linton, “Asymptotic Expansions for Some Semipa-rametric Program Evaluation Estimators,” Institute for Fiscal Stud-ies, cemmap working paper cwp04/01 (2001).

Ichimura, H., and C. Taber, “Direct Estimation of Policy Effects,” De-partment of Economics, Northwestern University, unpublishedmanuscript (2000).

Imbens, G., “The Role of the Propensity Score in Estimating Dose-Response Functions,” Biometrika 87:3 (2000), 706–710.“Sensitivity to Exogeneity Assumptions in Program Evaluation,”

American Economic Review Papers and Proceedings (2003).Imbens, G., and J. Angrist, “ Identification and Estimation of Local

Average Treatment Effects,” Econometrica 61:2 (1994), 467–476.Imbens, G., W. Newey, and G. Ridder, “Mean-Squared-Error Calculations

for Average Treatment Effects,” Department of Economics, UCBerkeley, unpublished manuscript (2003).

LaLonde, R. J., “Evaluating the Econometric Evaluations of TrainingPrograms with Experimental Data,” American Economic Review76 (1986), 604–620.

Lechner, M., “Earnings and Employment Effects of Continuous Off-the-Job Training in East Germany after Unification,” Journal of Busi-ness and Economic Statistics 17:1 (1999), 74–90.

Lechner, M., “ Identification and Estimation of Causal Effects of MultipleTreatments under the Conditional Independence Assumption,” inM. Lechner and F. Pfeiffer (Eds.), Econometric Evaluations ofActive Labor Market Policies in Europe (Heidelberg: Physica,2001).

“Program Heterogeneity and Propensity Score Matching: AnApplication to the Evaluation of Active Labor Market Policies,”this REVIEW 84:2 (2002), 205–220.

Lee, D., “The Electoral Advantage of Incumbency and the Voter’s Valu-ation of Political Experience: A Regression Discontinuity Analysisof Close Elections,” Department of Economics, University ofCalifornia, unpublished manuscript (2001).

Lehman, E., Nonparametrics: Statistical Methods Based on Ranks (SanFrancisco: Holden-Day, 1974).

Manski, C., “Nonparametric Bounds on Treatment Effects,” AmericanEconomic Review Papers and Proceedings 80 (1990), 319–323.

Manski, C., G. Sandefur, S. McLanahan, and D. Powers, “AlternativeEstimates of the Effect of Family Structure During Adolescence on

THE REVIEW OF ECONOMICS AND STATISTICS28

High School,” Journal of the American Statistical Association87:417 (1992), 25–37.

Partial Identification of Probability Distributions (New York:Springer-Verlag, 2003).

Neyman, J., “On the Application of Probability Theory to AgriculturalExperiments. Essay on Principles. Section 9” (1923), translated(with discussion) in Statistical Science 5:4 (1990), 465–480.

Politis, D., and J. Romano, Subsampling (Springer-Verlag, 1999).Porter, J., “Estimation in the Regression Discontinuity Model,” Harvard

University, unpublished manuscript (2003).Quade, D., “Nonparametric Analysis of Covariance by Matching,” Bio-

metrics 38 (1982), 597–611.Robins, J., and Y. Ritov, “Towards a Curse of Dimensionality Appropriate

(CODA) Asymptotic Theory for Semi-parametric Models,” Statis-tics in Medicine 16 (1997), 285–319.

Robins, J. M., and A. Rotnitzky, “Semiparametric Efficiency in Multivar-iate Regression Models with Missing Data,” Journal of the Amer-ican Statistical Association 90 (1995), 122–129.

Robins, J. M., Rotnitzky, A., Zhao, L.-P., “Analysis of SemiparametricRegression Models for Repeated Outcomes in the Presence ofMissing Data,” Journal of the American Statistical Association 90(1995), 106–121.

Rosenbaum, P., “Conditional Permutation Tests and the Propensity Scorein Observational Studies,” Journal of the American StatisticalAssociation 79 (1984a), 565–574.

“The Consequences of Adjustment for a Concomitant VariableThat Has Been Affected by the Treatment,” Journal of the RoyalStatistical Society, Series A 147 (1984b), 656–666.“The Role of a Second Control Group in an Observational Study”

(with discussion), Statistical Science 2:3 (1987), 292–316.“Optimal Matching in Observational Studies,” Journal of the

American Statistical Association 84 (1989), 1024–1032.Observational Studies (New York: Springer-Verlag, 1995).“Covariance Adjustment in Randomized Experiments and Obser-

vational Studies,” Statistical Science 17:3 (2002), 286–304.Rosenbaum, P., and D. Rubin, “The Central Role of the Propensity Score

in Observational Studies for Causal Effects,” Biometrika 70(1983a), 41–55.“Assessing the Sensitivity to an Unobserved Binary Covariate in

an Observational Study with Binary Outcome,” Journal of theRoyal Statistical Society, Series B 45 (1983b), 212–218.

“Reducing the Bias in Observational Studies Using Subclassifi-cation on the Propensity Score,” Journal of the American Statisti-cal Association 79 (1984), 516–524.

“Constructing a Control Group Using Multivariate MatchedSampling Methods That Incorporate the Propensity Score,” Amer-ican Statistician 39 (1985), 33–38.

Rubin, D., “Matching to Remove Bias in Observational Studies,” Biomet-rics 29 (1973a), 159–183.

“The Use of Matched Sampling and Regression Adjustments toRemove Bias in Observational Studies,” Biometrics 29 (1973b),185–203.

“Estimating Causal Effects of Treatments in Randomized andNon-randomized Studies,” Journal of Educational Psychology 66(1974), 688–701.

“Assignment to Treatment Group on the Basis of a Covariate,”Journal of Educational Statistics 2:1 (1977), 1–26.“Bayesian Inference for Causal Effects: The Role of Randomiza-

tion,” Annals of Statistics 6 (1978), 34–58.“Using Multivariate Matched Sampling and Regression Adjust-

ment to Control Bias in Observational Studies,” Journal of theAmerican Statistical Association 74 (1979), 318–328.

Rubin, D., and N. Thomas, “Affinely Invariant Matching Methods withEllipsoidal Distributions,” Annals of Statistics 20:2 (1992), 1079–1093.

Seifert, B., and T. Gasser, “Finite-Sample Variance of Local Polynomials:Analysis and Solutions,” Journal of the American Statistical As-sociation 91 (1996), 267–275.

“Data Adaptive Ridging in Local Polynomial Regression,” Jour-nal of Computational and Graphical Statistics 9:2 (2000), 338–360.

Shadish, W., T. Campbell, and D. Cook, Experimental and Quasi-experimental Designs for Generalized Causal Inference (Boston:Houghton Mifflin, 2002).

Sianesi, B., “psmatch: Propensity Score Matching in STATA,” UniversityCollege London and Institute for Fiscal Studies (2001).

Smith, J. A., and P. E. Todd, “Reconciling Conflicting Evidence on thePerformance of Propensity-Score Matching Methods,” AmericanEconomic Review Papers and Proceedings 91 (2001), 112–118.“Does Matching Address LaLonde’s Critique of Nonexperimental

Estimators,” forthcoming, Journal of Econometrics (2003).Van der Klaauw, W., “A Regression-Discontinuity Evaluation of the Effect

of Financial Aid Offers on College Enrollment,” InternationalEconomic Review 43:4 (2002), 1249–1287.

Zhao, Z., “Using Matching to Estimate Treatment Effects: Data Require-ments, Matching Metrics, and Monte Carlo Evidence,” this REVIEW

this issue (2004).

AVERAGE TREATMENT EFFECTS 29