an honest approach to parallel trends - harvard university · 2020. 11. 12. · an honest approach...

86
An Honest Approach to Parallel Trends * Ashesh Rambachan Jonathan Roth November 12, 2020 Abstract This paper proposes robust inference methods for difference-in-differences and event- study designs that do not require that the parallel trends assumption holds exactly. Instead, the researcher must only impose restrictions on the possible differences in trends between the treated and control groups. Several common intuitions expressed in applied work can be captured by such restrictions, including the notion that pre- treatment differences in trends are informative about counterfactual post-treatment differences in trends. Our methodology then guarantees uniformly valid (“honest”) inference when the imposed restrictions are satisfied. We first show that fixed length confidence intervals have near-optimal expected length for a practically-relevant class of restrictions. We next introduce a novel inference procedure that accommodates a wider range of restrictions, which is based on the observation that inference in our setting is equivalent to testing a system of moment inequalities with a large number of linear nuisance parameters. The resulting confidence sets are consistent, and have optimal local asymptotic power for many parameter configurations. We recommend researchers conduct sensitivity analyses to show what conclusions can be drawn under various restrictions on the possible differences in trends. Keywords: Difference-in-differences, event-study, parallel trends, sensitivity analysis, robust inference, partial identification. * We are grateful to Isaiah Andrews, Elie Tamer, and Larry Katz for their invaluable advice and encour- agement. We also thank Gary Chamberlain, Raj Chetty, Peter Ganong, Ed Glaeser, Nathan Hendren, Ryan Hill, Ariella Kahn-Lang, Jens Ludwig, Sendhil Mullainathan, Claudia Noack, Frank Pinter, Adrienne Sabety, Pedro Sant’Anna, Jesse Shapiro, Neil Shephard, Jann Spiess, Jim Stock, and seminar participants at Brown, Chicago Booth, Dartmouth, Harvard, Michigan, Microsoft, Princeton, Rochester, UCL, and Yale for helpful comments, and Dorian Carloni for sharing data. We gratefully acknowledge financial support from the NSF Graduate Research Fellowship under Grant DGE1745303 (Rambachan) and Grant DGE1144152 (Roth). Harvard University, Department of Economics. Email: [email protected] Microsoft. Email: [email protected]

Upload: others

Post on 18-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • An Honest Approach to Parallel Trends∗

    Ashesh Rambachan† Jonathan Roth‡

    November 12, 2020

    Abstract

    This paper proposes robust inference methods for difference-in-differences and event-study designs that do not require that the parallel trends assumption holds exactly.Instead, the researcher must only impose restrictions on the possible differences intrends between the treated and control groups. Several common intuitions expressedin applied work can be captured by such restrictions, including the notion that pre-treatment differences in trends are informative about counterfactual post-treatmentdifferences in trends. Our methodology then guarantees uniformly valid (“honest”)inference when the imposed restrictions are satisfied. We first show that fixed lengthconfidence intervals have near-optimal expected length for a practically-relevant classof restrictions. We next introduce a novel inference procedure that accommodates awider range of restrictions, which is based on the observation that inference in oursetting is equivalent to testing a system of moment inequalities with a large numberof linear nuisance parameters. The resulting confidence sets are consistent, and haveoptimal local asymptotic power for many parameter configurations. We recommendresearchers conduct sensitivity analyses to show what conclusions can be drawn undervarious restrictions on the possible differences in trends.

    Keywords: Difference-in-differences, event-study, parallel trends, sensitivity analysis,robust inference, partial identification.

    ∗We are grateful to Isaiah Andrews, Elie Tamer, and Larry Katz for their invaluable advice and encour-agement. We also thank Gary Chamberlain, Raj Chetty, Peter Ganong, Ed Glaeser, Nathan Hendren, RyanHill, Ariella Kahn-Lang, Jens Ludwig, Sendhil Mullainathan, Claudia Noack, Frank Pinter, Adrienne Sabety,Pedro Sant’Anna, Jesse Shapiro, Neil Shephard, Jann Spiess, Jim Stock, and seminar participants at Brown,Chicago Booth, Dartmouth, Harvard, Michigan, Microsoft, Princeton, Rochester, UCL, and Yale for helpfulcomments, and Dorian Carloni for sharing data. We gratefully acknowledge financial support from the NSFGraduate Research Fellowship under Grant DGE1745303 (Rambachan) and Grant DGE1144152 (Roth).†Harvard University, Department of Economics. Email: [email protected]‡Microsoft. Email: [email protected]

    mailto:[email protected]:[email protected]

  • 1 Introduction

    This paper develops robust causal inference methods for difference-in-differences and relatedevent-study designs.1 Traditional methods for inference are valid only under the so-called“parallel trends” assumption, yet researchers are often unsure whether this assumption holdsin practice. We instead propose methodology that allows for valid inference under weakerassumptions and enables the researcher to conduct sensitivity analysis with respect to theseassumptions.

    Our methods only require the researcher to impose that the possible differences in trendsare restricted to some set ∆. A variety of intuitions expressed in applied work can beformalized via such restrictions. For instance, the intuition for the common practice oftesting for pre-existing differences in trends (“pre-trends”) can be formalized via restrictionsthat impose that the pre-trends are informative about the counterfactual post-treatmentdifferences in trends. Likewise, context-specific knowledge about long-run secular trendsor simultaneous policy changes may motivate restrictions that the difference in trends bemonotone or have a particular sign. We consider a large class of possible ∆’s that allows theresearcher to formalize the aforementioned intuitions as well as many others. Under suchrestrictions, the treatment effect of interest is typically set-identified.

    We then develop methods to conduct uniformly valid inference given a set of restrictions∆. We introduce two methods for inference, which we show have different strengths depend-ing on the type of restriction ∆ that is imposed, as well as a hybrid approach that combinesthe two.

    Our first method for inference uses optimal fixed length confidence intervals (FLCIs)based on affine estimators, following Donoho (1994). FLCIs have attractive properties forcertain types of restrictions – specifically, when ∆ is convex and centrosymmetric – whichinclude our baseline smoothness class used to formalize the intuition behind pre-trends test-ing. For these types of restrictions, results from Armstrong and Kolesar (2018, 2020) implythat the optimal FLCI has near-optimal expected length when in fact parallel trends holds.

    Unfortunately, however, we show that FLCIs can have poor properties for broader classesof restrictions, such as those that incorporate sign or shape restrictions. We provide a novelcharacterization of when the optimal FLCI is consistent, meaning that any fixed point outsideof the identified set falls outside the FLCI with probability approaching one asymptotically.The optimal FLCI is consistent for all parameter values if and only if the length of theidentified set is constant over the parameter space. This condition often fails for several

    1Throughout the paper, we use the phrase “event-study” to refer to a large class of specifications thatestimate dynamic treatment effects along with placebo pre-treatment effects. This includes, but is not limitedto, settings with staggered treatment timing; see Related Literature and Section 7 below.

    2

  • leading restrictions on the class of differential trends, such as those that incorporate sign orshape restrictions. Our (in)consistency result may be of general interest in other applicationswhere FLCIs have been used.

    Motivated by this result, we introduce a second method for inference that delivers de-sirable asymptotic properties over a larger class of possible restrictions on the possible dif-ferences in trends. We exploit the observation that conducting inference on the treatmenteffect of interest under such restrictions is equivalent to a moment inequality problem withlinear nuisance parameters. A practical challenge is that the number of nuisance parametersscales linearly in the number of post-treatment periods, and thus will often be large (above10) in typical empirical applications. This renders many moment inequality procedures,which rely on test inversion over a grid for the full parameter vector, computationally in-feasible. We overcome this computational challenge by employing the conditional inferenceprocedure developed in Andrews, Roth and Pakes (2019, henceforth ARP), which deliverscomputationally tractable confidence sets that uniformly control size in our setting.

    We then derive two novel results on the asymptotic performance of conditional confidencesets in our setting. First, we show that the conditional confidence sets are consistent for allpolyhedral ∆. Second, we provide a condition under which the conditional confidence setshave local asymptotic power converging to the power envelope (i.e., the upper bound on thepower of any procedure that controls size uniformly).2 To prove the optimality result, wemake use of duality results from linear programming and the Neyman-Pearson lemma toshow that both the optimal test and our conditional test converge to a t-test in the directionof the Lagrange multipliers of the linear program that profiles out the nuisance parameters.Both of these asymptotic results are novel, and exploit additional structure in our contextnot present in the more general setting considered in ARP.

    Our results on the local asymptotic power of the conditional confidence sets have twolimitations. First, the condition needed for the conditional confidence sets to have opti-mal local asymptotic power does not hold for all parameter values, and in particular failswhen the parameter of interest is point-identified. Second, our optimality results are underasymptotics where sampling variation grows small relative to the length of the identifiedset. Our asymptotic power results thus may not translate to good finite-sample power whensampling variation is large relative to the length of the identified set, particularly when thereare non-binding moment restrictions that are close to binding.

    Therefore, we finally introduce a novel hybrid inference procedure that combines the2This condition is implied by the linear independence constraint qualification (LICQ), which has been

    used recently in the moment inequality settings studied in Gafarov (2019); Cho and Russell (2019); Flynn(2019); Kaido and Santos (2014). In contrast to many previous uses of LICQ, however, we do not requirethis condition to obtain uniform asymptotic size control. See Section 4.3 for further discussion.

    3

  • relative strengths of the conditional confidence sets and FLCIs. The hybrid procedure isconsistent, and has near-optimal local asymptotic power when the condition for the opti-mality of the conditional approach holds. Further, we find in simulations (discussed in moredetail below) that hybridization with the FLCIs improves performance in finite sample whenthe binding and non-binding moments are not well-separated.

    To explore the performance of our methods in practice, we conduct simulations calibratedto the 12 recently-published empirical papers surveyed in Roth (2019). We find that theFLCIs perform close to the optimal benchmark for excess length when ∆ is our baselinesmoothness class that satisfies the assumptions needed for the consistency and finite-samplenear-optimality of the FLCIs. However, as predicted by the theory, the FLCIs can performpoorly relative to the other methods when ∆ additionally includes sign or shape restrictions.The conditional confidence sets perform well across a wider range of simulation designs,and often have excess length within a few percent of the optimum. However, we find theycan exhibit poor performance in settings where the binding and non-binding moments arenot well-separated relative to the sampling variation in the data, such as when the targetparameter is (nearly) point identified. Finally, the hybrid approach performs quite wellacross a wide range of specifications, with performance typically close to the better of theFLCIs and conditional approach. Based on our Monte Carlo simulations, we recommend theFLCIs for the special settings where the conditions for their consistency and finite-samplenear-optimality hold, and otherwise recommend the hybrid approach.

    We recommend applied researchers use our methods to conduct sensitivity analyses inwhich they report confidence sets under varying restrictions on the set of possible differencesin trends. For instance, one class of restrictions we consider restricts the extent to whichthe difference in trends may deviate from linearity, and is governed by a single parameterthat determines the degree of possible non-linearity. If the researcher is interested in testinga particular null hypothesis — e.g., the treatment effect in a particular period is zero —then a simple statistic to report is the “breakdown” value of the non-linearity parameter atwhich the null hypothesis of interest can no longer be rejected.3 The researcher can alsoreport how her conclusions change with the inclusion of additional sign or monotonicityrestrictions motivated by context-specific knowledge. Performing such sensitivity analysesmakes clear what must be assumed about the possible differences in trends in order to drawspecific causal conclusions. We provide an R package, HonestDiD, for implementation ofour recommended methods.4 We conclude by applying our proposed methodology to two

    3Similar “breakdown” concepts have been proposed in other settings with partial identification (Horowitzand Manski, 1995; Kline and Santos, 2013; Masten and Poirier, 2020).

    4The latest version of may be downloaded here.

    4

    https://github.com/asheshrambachan/HonestDiD

  • recently published papers.

    Related literature: This paper contributes to an active literature on difference-in-differencesand event-study designs by developing robust inference methods that allow for possible vi-olations of the parallel trends assumption. Our approach is most closely related to Manskiand Pepper (2018), who show that the treatment effect of interest is partially identified un-der “bounded variation” assumptions that relax the usual exact parallel trends assumption.We consider identification under a broad class of restrictions on the possible violations ofparallel trends, which nests as a special case the bounded variation assumptions consideredin Manski and Pepper (2018). This broader class of restrictions can be used to formalize avariety of arguments made intuitively in applied work — e.g., that pre-treatment differencesin trends are informative about counterfactual post-treatment differences in trends. Impor-tantly, we develop methods for conducting inference on the causal effects of treatment underthese assumptions, whereas Manski and Pepper (2018) only consider identification.

    Several other recent papers consider various relaxations of the parallel trends assump-tion. Keele, Small, Hsu and Fogarty (2019) develop techniques for testing the sensitivityof difference-in-differences designs to violations of the parallel trends assumption, but theydo not incorporate information from the observed pre-trends in their sensitivity analysis.Empirical researchers commonly adjust for the extrapolation of a linear trend from the pre-treatment periods when there are concerns about violations of the parallel trends assumption,which is valid if the difference in trends is exactly linear (e.g., Dobkin, Finkelstein, Kluenderand Notowidigdo, 2018; Goodman-Bacon, 2018a,b; Bhuller, Havnes, Leuven and Mogstad,2013). Our methods nest this approach as a special case, but allow for valid inference underless restrictive assumptions about the class of possible differences in trends. Freyaldenhoven,Hansen and Shapiro (2019) propose a method that allows for violations of the parallel trendsassumption but requires an additional covariate that is affected by the same confoundingfactors as the outcome but not by the treatment of interest. Ye, Keele, Hasegawa and Small(2020) consider partial identification of treatment effects when there exist two control groupswhose outcomes have a bracketing relationship with the outcome of the treated group. Leav-itt (2020) proposes an empirical Bayes approach calibrated to pre-treatment differences intrends, and Bilinski and Hatfield (2020) and Dette and Schumann (2020) propose approachesbased on pre-tests for the magnitude of the pre-treatment violations of parallel trends.

    Our methods address several concerns related to established empirical practice in difference-in-differences and event-study designs. First, common tests for pre-trends may be under-powered against meaningful violations of parallel trends, potentially leading to severe un-dercoverage of conventional confidential intervals (Freyaldenhoven et al., 2019; Roth, 2019;

    5

  • Bilinski and Hatfield, 2020; Kahn-Lang and Lang, 2020). Second, statistical distortions frompre-testing for pre-trends may further undermine the performance of conventional inferenceprocedures (Roth, 2019). Third, parametric approaches to controlling for pre-existing trendsmay be quite sensitive to functional form assumptions (Wolfers, 2006; Lee and Solon, 2011).We address these issues by providing tools for inference that do not rely on an exact paralleltrends assumption and that make clear the mapping between assumptions on the potentialdifferences in trends and the strength of one’s conclusions.

    Our work is complementary to a growing literature on the causal interpretation of event-study coefficients in two-way fixed effects models in the presence of staggered treatmenttiming or heterogeneous treatment effects (Meer and West, 2016; Borusyak and Jaravel, 2016;Sun and Abraham, 2020; Athey and Imbens, 2018; de Chaisemartin and D’Haultfœuille, 2018;de Chaisemartin and D’Haultfœuille, 2020; Goodman-Bacon, 2018a; Kropko and Kubinec,2018; Callaway and Sant’Anna, 2020; Imai and Kim, 2020; Słoczyński, 2018). A key findingis that regression coefficients from conventional approaches may not produce convex weightedaverages of treatment effects even if parallel trends holds. Several alternative estimators havebeen proposed that consistently estimate sensible causal estimands under a suitable paralleltrends assumption. Our methodology can be applied to assess the sensitivity of resultsobtained using these methods to violations of the corresponding parallel trends assumption;see Section 7 for additional discussion.

    More broadly, our approach relates to a large and active literature on sensitivity analysisand misspecification robust inference, including Imbens (2003); Rosenbaum (2005); Altonji,Elder and Taber (2005); Conley, Hansen and Rossi (2012); Kolesar and Rothe (2018); Arm-strong and Kolesar (2018); Masten and Poirier (2018); Bonhomme and Weidner (2020); Oster(2019) among many others.

    2 General set-up

    We now introduce the assumptions, target parameter, and inferential goal considered in thepaper. In the main text of the paper, we consider a finite-sample normal model, whicharises as an asymptotic approximation to a variety of econometric settings of interest. Inthe supplementary materials, we show how the finite-sample results presented in this modeltranslate to uniform asymptotic statements.

    6

  • 2.1 Finite sample normal model

    Consider the following modelβ̂n „ N pβ, Σnq , (1)

    where β̂n P R¯T`T̄ and Σn “ 1nΣ˚ for Σ˚ a known, positive-definite p

    ¯T ` T̄ q ˆ p

    ¯T ` T̄ q

    matrix. We refer to β̂n as the estimated event-study coefficients, and partition β̂n into vectorscorresponding with the pre-treatment and post-treatment periods, β̂n “ pβ̂1n,pre, β̂1n,postq1,where β̂n,pre P R¯T and β̂n,post P RT̄ . We adopt analogous notation to partition other vectorsthat are the same length as β̂n.

    This finite sample normal model (1) can be viewed as an asymptotic approximation toa wide range of econometric settings. Under mild regularity conditions, a variety of estima-tion strategies in difference-in-differences and event study designs will yield asymptoticallynormal estimated event-study coefficients,

    ?n´

    β̂n ´ β¯

    dÝÑ N p0, Σ˚q.5 This convergence in

    distribution suggests the finite-sample approximation β̂nd« N pβ, Σnq , where

    d« denotes ap-proximate equality in distribution and Σn “ 1nΣ

    ˚. We derive results assuming this equalityin distribution holds exactly in finite samples. In the supplemental materials, we show thatresults in the finite sample normal model translate to uniform asymptotic statements for alarge class of data-generating processes.

    We assume the mean vector β satisfies the following causal decomposition.

    Assumption 1. The parameter vector β can be decomposed as

    β “˜

    τpre

    τpost

    ¸

    loooomoooon

    :“ τ

    δpre

    δpost

    ¸

    loooomoooon

    :“ δ

    with τpre ” 0. (2)

    The first term, τ , represents the time path of the dynamic causal effects of interest. Weassume the treatment has no causal effect prior to its implementation, so τpre “ 0. Thesecond term, δ, represents the difference in trends between the treated and untreated groupsthat would have occurred absent treatment. The parallel trends assumption imposes thatδpost “ 0. Therefore, under parallel trends, βpost “ τpost.

    5Examples of estimators that yield asymptotically normal event-study estimates include canonical two-way fixed effects estimators, the GMM procedure proposed by Freyaldenhoven et al. (2019), instrumentalvariables event-studies (Hudson, Hull and Liebersohn, 2017), the estimation strategies of Sun and Abraham(2020) and Callaway and Sant’Anna (2020) to address issues with non-convex weights on cohort-specificeffects in staggered treatment designs, as well as a range of procedures that flexibly control for differences incovariates between treated and untreated groups (e.g., Heckman, Ichimura, Smith and Todd, 1998; Abadie,2005; Sant’Anna and Zhao, 2020).

    7

  • Example: Difference-in-differences We observe an outcome Yit for a sample of individ-uals i “ 1, . . . , N for three time periods, t “ ´1, 0, 1. Individuals in the treated population(Di “ 1) receive a treatment between period t “ 0 and t “ 1.6 The observed outcomeequals Yi,t “ DiYi,tp1q ` p1´DiqYi,tp0q, where Yi,tp1q and Yi,tp0q are the potential outcomesfor individual i in period t associated with the treatment and control conditions. Assumethe treatment has no causal effect prior to implementation, meaning Yi,tp1q “ Yi,tp0q fort ă 1. The causal estimand of interest is the average treatment effect on the treated (ATT),τATT “ E rYi,1p1q ´ Yi,1p0q |Di “ 1s.

    In this setting, researchers commonly estimate the “dynamic event study regression”

    Yit “ λi ` φt `ÿ

    s‰0βs ˆ 1rt “ ss ˆDi ` �it. (3)

    The estimated coefficient β̂1 is the “difference-in-differences” of sample means across treatedand untreated groups between period t “ 0 and t “ 1, β̂1 “ pȲ1,1´ Ȳ1,0q´pȲ0,1´ Ȳ0,0q, whereȲd,t is the sample mean of Yit for treatment group d in period t. The “pre-period” coefficientβ̂´1 can likewise be written as β̂´1 “ pȲ1,´1 ´ Ȳ1,0q ´ pȲ0,´1 ´ Ȳ0,0q.

    Taking expectations and re-arranging, we see that

    E”

    β̂1

    ı

    “ τATT ` E rYi,1p0q ´ Yi,0p0q |Di “ 1s ´ E rYi,1p0q ´ Yi,0p0q |Di “ 0slooooooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooooon

    Post-period differential trend “: δ1

    ,

    E”

    β̂´1

    ı

    “ E rYi,´1p0q ´ Yi,0p0q |Di “ 1s ´ E rYi,´1p0q ´ Yi,0p0q |Di “ 0slooooooooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooooooon

    Pre-period differential trend “: δ´1

    .

    The parameter β “ E”

    β̂ı

    thus satisfies the decomposition (2), where τpost “ τATT is theATT, δpost “ δ1 is the difference in trends in untreated potential outcomes between t “ 0 andt “ 1, and δpre “ δ´1 is the analogous difference in trends for untreated potential outcomesbetween t “ ´1 and t “ 0. Under suitable regularity conditions, β̂ will also satisfy a centrallimit theorem, so that (1) will hold approximately in large samples. N

    Remark 1. We have motivated our normal model from a sampling-based perspective, whichis the most common framework for uncertainty in the difference-in-differences literature.While applicable in many cases, the sampling view may be unnatural in some settings, suchas when the unit of observation is a state and all 50 states are observed (Manski and Pepper,2018). In Rambachan and Roth (2020), we show that the normal model (1) also arises

    6For the purposes of this example, we think of the observed sample as consisting of N1 independent drawsfrom the treated pD “ 1q population and N0 independent draws from the control population pD “ 0q withN “ N0 `N1, as in Abadie and Imbens (2006).

    8

  • from a design-based model that treats the population as fixed and views the assignmentof treatment as the source of randomness in the data. In that setting, δ is a function ofthe finite-population covariance between idiosyncratic treatment probabilities and trends inuntreated potential outcomes.

    2.2 Target parameter and identification

    The parameter of interest is a scalar, linear combination of the post-treatment causal effects,θ :“ l1τpost for some known T̄ -vector l. For example, θ equals the t-th period causal effectτt when the vector l equals the t-th standard basis vector. Similarly, θ equals the averagecausal effect across all post-treatment periods when l “

    `

    1T̄, ..., 1

    ˘1. Point-identification of θis typically obtained by imposing the parallel trends assumption that δpost “ 0.

    We relax the parallel trends assumption by instead assuming that δ lies in a set ofpossible differences in trends ∆, which is specified by the researcher. This nests the usualparallel trends assumption as a special case with ∆ “ tδ : δpost “ 0u. Intuitively, sinceδpre “ E

    β̂pre

    ı

    is identified, the assumption that δ “ pδ1pre, δ1postq1 P ∆ restricts the possiblevalues of δpost given the (identified) value of the pre-treatment difference in trends δpre. Itis natural to place restrictions on the relationship between δpre and δpost, since researchersfrequently test the null hypothesis that δpre “ 0 as a way of assessing the plausibility of theassumption that δpost “ 0.

    Under the assumption that δ P ∆ ‰ tδ : δpost “ 0u, the parameter θ will typicallybe set-identified. For a given value of β, the set of values θ consistent with β under theassumption δ P ∆ is

    Spβ,∆q :“#

    θ : Dδ P ∆, τpost P RT̄ s.t. l1τpost “ θ, β “ δ `˜

    0

    τpost

    ¸+

    , (4)

    which we refer to as the identified set. When ∆ is a closed and convex set, the identified sethas a simple characterization.

    Lemma 2.1. If ∆ is closed and convex, then Spβ,∆q is an interval in R, Spβ,∆q “rθlbpβ,∆q, θubpβ,∆qs, where

    θlbpβ,∆q :“ l1βpost ´´

    maxδl1δpost, s.t. δ P ∆, δpre “ βpre

    ¯

    looooooooooooooooooooooomooooooooooooooooooooooon

    “:bmaxpβpre;∆q

    , (5)

    θubpβ,∆q :“ l1βpost ´´

    minδl1δpost, s.t. δ P ∆, δpre “ βpre

    ¯

    loooooooooooooooooooooomoooooooooooooooooooooon

    “:bminpβpre;∆q

    . (6)

    9

  • Proof. Re-arranging terms in (4), the identified set can be equivalently written as Spβ,∆q “tθ : Dδ P ∆ s.t. δpre “ βpre, θ “ l1βpost ´ l1δpostu. The result is then immediate.

    Example: Difference-in-differences (continued) Point identification of the ATT inthe difference-in-differences design is typically obtained by assuming that the counterfactualpost-treatment difference in trends δ1 is exactly zero. Instead, we consider imposing thatδ “ pδ´1, δ1q1 P ∆ for some set ∆. The set ∆ places restrictions on the possible values ofthe counterfactual post-treatment difference in trends δ1 (which is not directly identified)given the value of the pre-treatment difference in trends δ´1 (which is identified). When∆ is closed and convex, the identified set for the ATT will be rβ1 ´ bmax, β1 ´ bmins, wherebmax “ maxδ δ1 s.t pδ´1, δ1q1 P ∆ and bmin is defined analogously.

    2.3 Possible choices of ∆

    The class of possible differences in trends ∆ must be specified by the researcher, and thechoice of ∆ will depend on the economic context. We highlight several choices of ∆ that maybe reasonable in empirical applications and formalize intuitive arguments that are commonlymade by applied researchers regarding possible violations of parallel trends. We ultimatelyrecommend that researchers conduct sensitivity analysis with respect to a range of ∆’s thatmay be plausible in their context.

    2.3.1 Smoothness restrictions

    We begin by introducing a class of restrictions that formalizes the intuition behind thecommon practice of testing for pre-existing differences in trends (pre-trends). Researchersfrequently test for pre-trends as a way of assessing the plausibility of the parallel trendsassumption. These tests are motivated by the intuition that the pre-trend is informativeabout the counterfactual post-treatment difference in trends. In other words, the differencein trends must evolve “smoothly” over time; if not, then the fact that the pre-trend is closeto zero would not be informative about the validity of the parallel trends assumption, sincethe (counterfactual) difference in trends could be close to zero in the pre-treatment periodand then jump around the time of treatment.

    We formalize this logic by introducing smoothness restrictions on the possible differencesin trends. Specifically, we bound the extent to which the difference in trends can deviatefrom linearity. Deviations from linearity are a natural starting point since applied researchersconcerned about possible violations of the parallel trends assumption commonly include

    10

  • treatment-group specific linear trends in their regression specifications.7 There are concerns,however, that this linear extrapolation of the pre-trend may not quite be correct (Wolfers,2006; Lee and Solon, 2011). A natural relaxation is therefore to require only that thedifference in trends not deviate “too much” from linearity. We formalize this by boundingthe extent to which the slope of the differential trend may change between consecutiveperiods, requiring that δ lie in the set

    ∆SDpMq :“ tδ : |pδt`1 ´ δtq ´ pδt ´ δt´1q| ďM, @tu, (7)

    where for t ą 0, δt refers to the t-th element of δpost, δ´t refers to the t-th element of δpre, andwe adopt the convention that δ0 “ 0.8 The parameter M ě 0 governs the amount by whichthe slope of δ can change between consecutive periods.9 In the special case where M “ 0,∆SDp0q requires that the difference in trends be exactly linear.

    Figure 1: Intuition for ∆SDpMq

    Alternatively, one might allow the smoothness of the differential trend to depend onthe magnitude of the pre-trend. For instance, applied researchers may have intuition thatif the observed pre-treatment difference in trends is small, then the counterfactual post-treatment difference in trends would also be small. If the two groups did not follow similartrends in the pre-treatment period, though, it may be more plausible that the difference intrends between the two groups would have changed substantially between the pre- and post-treatment periods. This intuition may be formalized by bounding the percentage change in

    7That is, researchers may augment specification (3) with group-specific linear trends, an approach Dobkinet al. (2018) refer to as a “parametric event-study.” An analogous approach is to estimate a linear trend usingonly observations prior to treatment, and then subtract out the estimated linear trend from the observationsafter treatment (Bhuller et al., 2013; Goodman-Bacon, 2018a,b).

    8Setting δ “ 0 corresponds with the common practice of normalizing β0 “ 0, as in specification (3).9∆SDpMq bounds the discrete analog of the second derivative of δ, and is thus similar to restrictions on

    the second derivative of the conditional expectation function or density in regression discontinuity settings(Kolesar and Rothe, 2018; Frandsen, 2016; Noack and Rothe, 2020). Smoothness restrictions are also usedto obtain partial identification in Kim, Kwon, Kwon and Lee (2018).

    11

  • the slope of the differential trend across periods,

    ∆RMpM̄q :“ tδ : |δt`1 ´ δt| ď M̄ |δt ´ δt´1|, @tu. (8)

    Example: Difference-in-differences (continued) In the three-period difference-in-differences model, assuming the differential trend is exactly linear is equivalent to assum-ing ∆ “ tδ : δ1 “ ´δ´1u. In contrast, assuming δ P ∆SDpMq only requires that thelinear extrapolation be approximately correct, δ1 P r´δ´1 ´M,´δ´1 `M s. Likewise, as-suming δ P ∆RMpM̄q bounds the magnitude of δ1 based on the magnitude of δ´1, i.e.∆RMpM̄q “ tpδ´1, δ1q1 : |δ1| ď M̄ |δ´1|u. The larger the magnitude of the observed pre-periodviolation in parallel trends, |δ´1|, the wider the range of possible post-period violations ofparallel trends. Figure 2 gives a geometric depiction of ∆SD and ∆RM in this example. N

    Figure 2: Example choices for ∆

    δ1

    δ´1

    ∆SD

    δ1

    δ´1

    ∆SDPB

    δ1

    δ´1∆RM

    δ1

    δ´1∆RMI

    Note: Diagrams of potential restrictions ∆ on the set of possible violations of parallel trends in the three-period difference-in-differences model. See discussion in Section 2 for further details on each example.

    2.3.2 Sign and monotonicity restrictions

    Context-specific knowledge may sometimes further imply sign or monotonicity restrictions onthe differential trend. For instance, there may be simultaneous, confounding policy changesthat we expect to have a positive effect on the outcome of interest, in which case we mightrestrict the post-period bias to be positive, δ P ∆PB :“ tδ : δt ě 0 @t ě 0u. Likewise, insome empirical settings, there may be secular pre-existing trends that we expect would havecontinued following the treatment date.10 We may then wish to impose that the differentialtrend be increasing, δ P ∆I :“ tδ : δt ě δt´1 @tu. Such sign and monotonicity restrictions

    10Monotone violations of parallel trends are often discussed in applied work. For example, Lovenheim andWillen (2019) argue that violations of parallel trends cannot explain their results because “pre-[treatment]trends are either zero or in the wrong direction (i.e., opposite to the direction of the treatment effect).”Greenstone and Hanna (2014) estimate upward-sloping pre-existing trends and argue that “if the pre-trendshad continued” their estimates would be upward biased.

    12

  • may also be combined with smoothness restrictions. For example, ∆SDPBpMq :“ ∆SDpMqX∆PB and ∆RMIpM̄q :“ ∆RMpM̄q X∆I combine the smoothness restrictions discussed abovewith restrictions that the difference in trends be positive or monotonically increasing. Figure2 gives a geometric depiction of ∆SDPB and ∆RMI in the three-period difference-in-differencesmodel.

    2.3.3 Polyhedral restrictions

    The smoothness, shape, and sign restrictions discussed so far will be applicable in a variety ofeconomic contexts. However, in some cases researchers may have context-specific knowledgethat implies other types of restrictions. Throughout the paper, we therefore consider thebroader class of ∆’s that take a polyhedral form, i.e. sets which can be expressed as a seriesof linear restrictions on δ.

    Assumption 2 (Polyhedral shape restriction). The class ∆ takes the form ∆ “ tδ : Aδ ďdu for some known matrix A and vector d, where the matrix A has no all-zero rows.

    This class of restrictions encompasses nearly all of the aforementioned examples, as well asmany others. Indeed, it is immediate from Figure 2 that with two dimensions, ∆SD,∆SDPB,and ∆RMI are all polyhedra, and the geometric intuition from the two-period case extendsto higher dimensions.11 The one exception is ∆RM , which is not convex and thus not apolyhedron. However, ∆RM can be expressed as the union of polyhedra. One can thus forma confidence set for ∆RM by taking the union of the confidence sets we develop below foreach of the polyhedra that compose ∆RM .

    Remark 2 (Bounded variation assumptions). Manski and Pepper (2018) consider identifi-cation of treatment effects under so-called “bounded variation assumptions.” These assump-tions can be expressed in the polyhedral form introduced in Assumption 2. Within the con-text of our ongoing difference-in-differences example, MP’s “bounded difference-in-differencesvariation” assumption corresponds directly with placing a bound on the magnitude of |δ1|when β̂1 is the coefficient from specification (3). MP also consider “bounded time” and“bounded state” variation assumptions, which correspond with bounds on the magnitudesof |µ11 ´ µ10| and |µ11 ´ µ01|, where µds :“ E rY p0q|D “ d, t “ ss. These restrictions can beaccommodated by augmenting the vector β̂ to include the sample means corresponding with

    11In the case with one pre-period and one post-period, ∆SDpMq “ tδ : ASDδ ď dSDu for ASD “ˆ

    ´1 11 ´1

    ˙

    and dSD “ˆ

    MM

    ˙

    . This generalizes naturally when there are multiple pre-periods and

    multiple post-periods.

    13

  • estimates of the differences in outcomes for the appropriate treatment-group by time periodcells.12 �

    Remark 3 (Ashenfelter’s dip). When applying difference-in-differences to job training pro-grams, one might worry that people who enroll in the program choose to do so in responseto negative transitive shocks to earnings. If so, then the counterfactual difference in trendswill exhibit the so-called Ashenfelter’s dip (Ashenfelter, 1978), in which earnings groups forthe treated group trend downwards (relative to control) before treatment and upwards af-terwards. In this type of setting, a researchers might naturally use a polyhedral ∆ to imposei) restrictions on the signs of the pre-treatment and post-treatment biases, as well as ii)restrictions on the magnitude of the rebound effect relative to the pre-treatment shock.

    2.4 Inferential Goal

    Given a particular choice of ∆, our goal is to construct confidence sets that are uniformlyvalid for all parameter values θ in the identified set. We construct confidence sets Cn satisfying

    infδP∆,τ

    infθPSp∆,δ`τq

    Ppδ,τ,Σnq pθ P Cnq ě 1´ α. (9)

    We subscript the probability operator by pδ, τ,Σnq to make explicit that the distribution ofβ̂n (and hence Cn) depends on these parameters. In the supplemental materials, we showthat the coverage requirement (9) in the normal model translates to uniform asymptoticcoverage over a large class of data-generating processes.

    In practice, we recommend that applied researchers tailor their choice of confidence setCn based on their choice of ∆. If the researcher selects ∆SDpMq or a related ∆ satisfyingparticular properties described below, we recommend that the researcher use the optimalfixed length confidence interval (FLCI) described in Section 3. Otherwise, we recommendthat the researcher use the conditional-FLCI hybrid confidence set, which is introduced inSection 5. Our practical recommendations are based on the theoretical properties of theseconfidence sets that we derive in Sections 3-5 and the simulation results in Section 6. Anapplied reader interested in applying our methods but not their theoretical properties maywish to skip ahead to Sections 7-8, in which we provide further details on our practicalrecommendations and illustrate them in applications to two recently published empiricalpapers.

    12After augmenting the vector for the event-study coefficients, equation (2) needs to be re-written toreplace p0, τ 1postq1 with Mτpost, where M is a matrix that accounts for the fact that elements of τ enter boththe event-study coefficients and the augmented terms. Our proposed methods and results do not rely on thestructure that M “ p0, Iq1 and thus easily accommodate this modification.

    14

  • 3 Inference using Fixed Length Confidence Intervals

    We first consider fixed length confidence intervals (FLCIs) based on affine estimators. Weshow that FLCIs deliver finite-sample guarantees for certain choices of ∆, including ourbaselines smoothness class ∆SD, but may perform poorly for other types of restrictions.

    3.1 Constructing FLCIs

    Following Donoho (1994) and Armstrong and Kolesar (2018, 2020), we consider fixed lengthconfidence intervals based on an affine estimator for θ,

    Cα,npa, v, χq “´

    a` v1β̂n¯

    ˘ χ, (10)

    where a and χ are scalars and v P R¯T`T̄ . We wish to minimize the half-length of the confi-dence interval χ subject to the constraint that Cα,npa, v, χq satisfies the coverage requirement(9).

    To do so, note that a`v1β̂n „ N pa` v1β, v1Σnvq, and hence |a`v1β̂n´θ| „ |N pb, v1Σnvq |,where b “ a` v1β ´ θ is the bias of the affine estimator a` v1β̂n for θ. Observe further thatθ P Cnpa, v, χq if and only if |a` v1β̂n´ θ| ď χ. For fixed values a and v, the smallest value ofχ that satisfies (9) is therefore the 1 ´ α quantile of the |N

    `

    b̄, v1Σnv˘

    | distribution, whereb̄ is the worst-case bias of the affine estimator,

    b̄pa, vq :“ supδP∆,τpostPRT̄

    ˇ

    ˇ

    ˇ

    ˇ

    ˇ

    a` v1˜

    δ `˜

    0

    τpost

    ¸¸

    ´ l1τpost

    ˇ

    ˇ

    ˇ

    ˇ

    ˇ

    . (11)

    Let cvαptq denote the 1´α quantile of the folded normal distribution |N pt, 1q |.13 For fixeda and v, the smallest value of χ satisfying the coverage requirement (9) is

    χnpa, v;αq “ σv,n ¨ cvαpb̄pa, vq{σv,nq, (12)

    where σv,n :“?v1Σnv.

    The minimum-length FLCI is then constructed by choosing the values of a and v tominimize (12). This minimization optimally trades off bias and variance, since the half-length χnpa, v;αq is increasing in both the worst-case bias b̄ and the variance σ2v,n (assumingα P p0, 0.5sq. When ∆ is convex, this minimization can be solved as a nested optimizationproblem, where both the inner and outer minimizations are convex (Low, 1995; Armstrong

    13If t “ 8, we define cvα “ 8.

    15

  • and Kolesar, 2018, 2020). We denote by CFLCIα,n the 1´α level FLCI with the shortest length,

    CFLCIα,n “´

    an ` v1nβ̂n¯

    ˘ χn, (13)

    where χn :“ infa,v χnpa, v;αq and an, vn are the optimal values in the minimization.

    Example: ∆SDpMq. Suppose θ “ τ1. For ∆SDpMq, the affine estimator used by theoptimal FLCI takes the form

    a` v1β̂n “ β̂n,1 ´0ÿ

    s“´¯T`1

    ws

    ´

    β̂n,s ´ β̂n,s´1¯

    , (14)

    where the weights ws sum to one (but may be negative). It takes the event-study coefficientfor period 1 and subtracts out a weighted sum of the estimated slopes between consecutivepre-periods. Intuitively, since ∆SD restricts the changes in the slope of the underlying trendacross periods, but not the slope of the trend itself, an affine estimator with finite bias mustsubtract out an estimate of the slope of the trend between t “ 0 and t “ 1 using the observedslopes in the pre-period. The worst-case bias will be smaller if more weight is placed on pre-periods closer to the treatment date, but it may reduce variance to place more weight onearlier pre-periods. The weights ws are optimally chosen to balance this tradeoff. N

    3.2 Finite-sample near optimality

    In particular cases of interest, such as when ∆ “ ∆SDpMq, the FLCIs introduced above havenear-optimal expected length in the finite-sample normal model. The following result, whichis an immediate consequence of results in Armstrong and Kolesar (2018, 2020), bounds theratio of the expected length of the shortest possible confidence interval that controls sizerelative to the length of the optimal FLCI.

    Assumption 3. Assume i) ∆ is convex and centrosymmetric (i.e. δ P ∆ implies ´δ P ∆),and ii) δA P ∆ is such that pδ ´ δAq P ∆ for all δ P ∆.

    Proposition 3.1. Suppose δA and ∆ satisfy Assumption 3.14 Let Iαp∆,Σnq denote the classof confidence sets that satisfy the finite sample coverage criterion in (9) at the 1 ´ α level.Then, for any τA P RT̄ , Σ˚ positive definite, and n ą 0,

    14We use δA for the null value of δ, rather than δ0, since we use the notation δt to refer to the componentof δ corresponding with period t.

    16

  • infCα,nPIαp∆,Σnq EpδA,τA,Σnq rλpCα,nqs2χn

    ě z1´αp1´ αq ´ z̃αΦpz̃αq ` φpz1´αq ´ φpz̃αqz1´α{2

    ,

    where λp¨q denotes the length (Lebesgue measure) of a set and z̃α “ z1´α ´ z1´α{2.

    Part i) of Assumption 3 is satisfied for ∆SD but not for our other ongoing examples. Forexample, ∆SDPB and ∆RMI are convex but not centrosymmetric, and ∆RM is neither convexnor centrosymmetric. Part ii) of Assumption 3 is always satisfied when parallel trends holdsin both the pre-treatment and post-treatment periods pδA “ 0q. It also holds whenever δAis a linear trend for the case of ∆SDpMq. For α “ 0.05, the lower bound in Proposition 3.1evaluates to 0.72, so the expected length of the shortest possible confidence set that satisfiesthe coverage requirement (9) is at most 28% shorter than the length of the optimal FLCI.15 Anice feature of this result is that it places no restrictions on Σ, and thus allows the samplingvariation in the data to be large (or small) relative to the length of the identified set.

    3.3 (In)Consistency of FLCIs

    The finite-sample guarantees discussed above do not apply for several types of restrictions ∆of importance, including those that incorporate sign and shape restrictions. We now showthat the FLCIs can perform poorly under such restrictions. We first provide two illustrativeexamples, and then introduce a formal inconsistency result.

    Example: ∆SDPBpMq and ∆SDIpMq. Suppose θ “ τ1. One can show that the worst-case bias of an affine estimator over ∆SDPBpMq is the same as the worst-case bias for thatestimator over ∆SDpMq.16 The same argument applies using ∆SDIpMq :“ ∆SDpMq X ∆I ,which adds the restriction that δ be monotonically increasing to ∆SD. Since the constructionof the optimal FLCI depends only on the worst-case bias and variance of the affine estimator,it follows that the optimal FLCI constructed using ∆SDPBpMq or ∆SDIpMq is the same as theone constructed using ∆SDpMq. Therefore, the optimal FLCI does not adapt to additionalsign or monotonicity restrictions. N

    15Additionally, as noted in Armstrong and Kolesar (2020), the results in Joshi (1969) imply that if ∆ isa linear subspace satisfying part i) of Assumption 3, the FLCI achieves minimax expected length, so anyprocedure that outperforms it somewhere must also be inferior at some other point in the parameter space.

    16To see why, suppose that the vector δ maximizes the bias for an affine estimator pa, vq over ∆SDpMq.The vector that adds a constant slope to δ, say δ̃c “ δ ` c ¨ p´

    ¯T, ..., T̄ q1, also lies in ∆SDpMq, and for c

    sufficiently large, δ̃c will lie in ∆SDPBpMq. Moreover, the worse-case bias will be the same for δ and δ̃c,since if pa, vq has finite worst-case bias it must subtract out a weighted average of the pre-treatment slopes.

    17

  • Example: ∆RMIpM̄q. Suppose θ “ τ1. If ∆ “ ∆RMIpM̄q and M̄ ą 0, then all affineestimators for τ1 have infinite worst-case bias.17 Thus, the FLCI is the entire real line. N

    We next provide a formal result on the (in)consistency of the FLCIs. Specifically, weconsider “small-Σ” asymptotics wherein the sampling uncertainty grows small relative to thelength of the identified set, and consider when the FLCIs include points bounded away fromthe identified set with non-vanishing probability.18 Recall from Lemma 2.1 that the identifiedset Spβ,∆q is an interval when ∆ is convex, with length equal to θubpβ,∆q ´ θlbpβ,∆q “bmaxpβpre,∆q ´ bminpβpre,∆q. Since the length of the identified set only depends on ∆ andβpre, denote it by LIDpβpre,∆q. Our next result shows that CFLCIα,n is consistent if and onlyif LIDpβpre,∆q is its maximum possible value, provided that the identified set is not theentire real line (in which case any procedure is trivially consistent).

    Assumption 4 (Identified set maximal length and finite). Suppose δA,pre is such thatLIDpδA,pre,∆q “ supδpreP∆pre LIDpδpre,∆q ă 8, where ∆pre “ tδpre P R¯T : Dδpost s.t. pδ1pre, δ1postq1 P∆u is the set of possible values for δpre.

    Proposition 3.2. Suppose ∆ is convex and α P p0, .5s. Fix δA P ∆ and τA P RT̄ , and supposeSpδA ` τA,∆q ‰ R. Then pδA,∆q satisfy Assumption 4 if and only if CFLCIα,n is consistent,meaning that

    limnÑ8

    PpδA,τA,Σnq`

    θout P CFLCIα,n˘

    “ 0 for all θout R SθpδA ` τA,∆q.

    Thus, if Assumption 4 fails, then CFLCIα,n is inconsistent in the strong sense that it includesfixed points outside of the identified set with non-vanishing probability. It follows that therewill be some δA P ∆ such that the FLCI is inconsistent under δA unless the identified set isalways the same length. We also show in Lemma B.26 in the Appendix that the conditionsof Proposition 3.1 imply that Assumption 4 holds. Thus, the FLCIs obtain finite samplenear-optimality in only a subset of the cases where they are consistent.

    Remark 4. In the three-period difference-in-differences example, the length of the identi-fied set corresponds with the height of ∆ in Figure 3, and so Assumption 4 holds if andonly if ∆ achieves its maximal height at δ´1. As shown in Figure 3, the assumption holdseverywhere for ∆SD (since the identified set is always the same length), for values of δ wherethe sign restrictions do not bind for ∆SDPB, and nowhere for ∆RMI . The restrictiveness ofAssumption 4 thus depends greatly on ∆. �

    17This follows immediately from Lemma B.19 below, which shows that the worst-case bias must be atleast half the maximum length of the identified set, which is infinite for ∆RMIpM̄q.

    18See, e.g., Kadane (1971) and Moreira and Ridder (2019) for other uses of small-Σ asymptotics.

    18

  • Figure 3: Diagram of where Assumptions 3 and 4 hold. The values of δ are colored red(neither holds), light green (Assumption 4 only), and dark green (both hold).

    δ1

    δ´1

    ∆SDδ1

    δ´1

    ∆SDPB

    δ1

    δ´1

    ∆RMI

    Remark 5. Proposition 3.2 implies that FLCIs can potentially be inconsistent when ∆ isconvex and centrosymmetric if δ ‰ 0. For example, if ∆ “ tδ P ∆SDpMq | δ1 ďMu, then theFLCI is inconsistent whenever δ´1 ‰ 0, even though Proposition 3.1 implies that the FLCIis near-optimal for δ “ 0. As discussed above, however, such inconsistency does not arise forour baseline smoothness class ∆SDpMq.

    Remark 6. In Appendix A.1, we further show that if Assumption 4 along with an additionalcondition (Assumption 5 introduced below) hold, then the FLCI also has local asymptoticpower approaching the power envelope under the same asymptotics considered in Proposition3.2. �

    The results in this section establish that when certain conditions on ∆ are satisfied,the FLCIs are consistent and have desirable finite-sample guarantees in terms of expectedlength. These conditions hold for our baseline smoothness class ∆SD, but fail for choicesof ∆ that may be of interest in empirical applications such as those that incorporate signand monotonicity restrictions. This motivates us to next consider an alternative method forinference that can accommodate a larger range of restrictions.

    4 Inference using Conditional Confidence Sets

    In this section, we introduce a more general procedure for inference that has good asymp-totic properties over a large class of possible restrictions ∆. We show that inference on thepartially identified parameter θ “ l1τpost in this setting is equivalent to testing a system ofmoment inequalities with a potentially large number of nuisance parameters that enter the

    19

  • moments linearly. We then apply the conditional approach developed in ARP to obtaincomputationally tractable tests and confidence sets. We derive novel results on the asymp-totic properties of the conditional test in our context, exploiting additional structure in oursetting not found in ARP.

    4.1 Representation as a moment inequality problem with linear nui-

    sance parameters

    Consider the problem of testing the null hypothesis, H0 : θ “ θ̄, δ P ∆ when ∆ “ tδ : Aδ ďdu. We now show that testing H0 is equivalent to testing a system of moment inequalitieswith linear nuisance parameters.

    The model (1) implies Epδ,τ,Σnq”

    β̂n ´ τı

    “ δ, and hence δ P ∆ if and only if Epδ,τ,Σnq”

    Aβ̂n ´ Aτı

    ďd. Defining Yn “ Aβ̂n ´ d and Mpost “ r0, Is1 to be the matrix such that τ “Mpostτpost, it isimmediate that the null hypothesis H0 is equivalent to the composite null

    H0 : Dτpost P RT̄ s.t. l1τpost “ θ̄ and Epδ,τ,Σnq rYn ´ AMpostτposts ď 0. (15)

    In this equivalent form, τpost P RT̄ is a vector of nuisance parameters that must satisfy thelinear constraint l1τpost “ θ̄.

    By applying a change of basis, we can further re-write H0 as an equivalent composite nullhypothesis with an unconstrained nuisance parameter. Re-write the expression AMpostτpost

    as Ã

    ˜

    θ

    τ̃

    ¸

    , where à is the matrix that results from applying a suitable change of basis to

    the columns of AMpost, and τ̃ P RT̄´1.19 The null H0 is then equivalent to

    H0 : Dτ̃ P RT̄´1 s.t. E”

    Ỹnpθ̄q ´ X̃τ̃ı

    ď 0, (16)

    where Ỹ pθ̄q “ Yn ´ Ãp¨,1qθ̄ and X̃ “ Ãp¨,´1q. Since Ỹnpθ̄q is normally distributed with covari-ance matrix Σ̃n “ AΣnA1 under the finite-sample normal model (1), testing H0 : θ “ θ̄, δ P ∆is equivalent to testing a set of moment inequalities with linear nuisance parameters.

    Remark 7. The hypothesis (16) is a special case of the testing problem studied in ARP,which focuses on testing null hypotheses of the form H0 : Dτ s.t. E rY pθq ´Xτ |Xs ď 0,

    19Specifically, let Γ be a square matrix with the vector l1 in the first row and remaining rows chosen so

    that Γ has full rank. Define à :“ AMpostΓ´1. Then AMpostτ “ ÃΓτpost “ Ã

    ¨

    ˝

    θΓp´1,¨qτpostlooooomooooon

    :“τ̃

    ˛

    ‚. If T̄ “ 1,

    then τ̃ is 0-dimensional and should be interpreted as 0.

    20

  • almost surely. Our setting is a special case of this framework in which: i) the variable Xtakes the degenerate distribution X “ X̃, and ii) Y pθq “ Ỹ pθq is linear in θ. The firstfeature plays an important role in developing our novel consistency and local asymptoticpower results presented later in this section: if i) fails and X is continuously distributed,then the tests proposed by ARP will generally not be consistent, as they do not allow forthe number of moments to grow with n. The current proof of the optimal local asymptoticresult also exploits the geometry of feature ii), although we conjecture that this could berelaxed to allow Y pθq to vary smoothly in θ. �

    4.2 Constructing conditional confidence sets

    An important practical consideration for testing hypotheses of the form (16) is that thedimension of the nuisance parameter τ̃ P RT̄´1 grows linearly with the number of post-periodsT̄ and may be large in practice. For instance, in Section 8 we apply our methodology to arecent paper in which T̄ “ 23. Moreover, 5 of the 12 recent event-study papers reviewed inRoth (2019) have T̄ ą 10. This renders many moment inequality methods, especially thosewhich rely on test inversion over a grid for the full parameter vector, practically infeasible inthis context. We now show how the conditional approach of ARP, which directly exploits thelinear structure of the hypothesis (16), can be applied to obtain computationally tractableand powerful tests even when the number of post-periods T̄ is large.20

    Suppose we wish to test (16) for some fixed θ̄. The conditional testing approach considerstests based on the test statistic

    η̂ :“ minη,τ̃

    η s.t. Ỹnpθ̄q ´ X̃τ̃ ď σ̃n ¨ η, (17)

    where σ̃n “b

    diagpΣ̃nq. This linear program selects the value of the nuisance parametersτ̃ P RT̄´1 that produces the most slack in the maximum studentized moment. Duality resultsfrom linear programming (e.g. Schrijver (1986), Section 7.4) imply that the value η̂ obtained

    20Other moment inequality methods have been proposed for subvector inference, but typically do notexploit the linear structure of our setting — see, e.g, Chen, Christensen and Tamer (2018); Bugni, Canayand Shi (2017); Kaido, Molinari and Stoye (2019); Chernozhukov, Newey and Santos (2015); Romano andShaikh (2008). Gafarov (2019), Cho and Russell (2019), and Flynn (2019) also provide methods for subvectorinference with linear moment inequalities, but in contrast to our approach require a linear independenceconstraint qualification (LICQ) assumption for size control.

    21

  • from the primal program (17) equals the optimal value of the dual program,21

    η̂ “ maxγ

    γ1Ỹnpθ̄q s.t. γ1X̃ “ 0, γ1σ̃n “ 1, γ ě 0. (18)

    If a vector γ˚ is optimal in the dual problem above, then it is a vector of Lagrange multipliersfor the primal problem. We denote by V̂n the set of optimal vertices of the dual program.22

    To derive critical values, we analyze the distribution of η̂ conditional on the event thata vertex γ˚ is optimal in the dual problem. Lemma 9 of ARP shows that conditional on theevent γ˚ P V̂n and a sufficient statistic Sn for the nuisance parameters, the test statistic η̂follows a truncated normal distribution with

    η̂ | tγ˚ P V̂n, Sn “ su „ ξ | ξ P rvlo, vups, (19)

    where ξ „ N´

    γ1˚µ̃, γ1˚Σ̃nγ˚

    ¯

    , µ̃ “ E”

    Ỹnpθ̄qı

    , Sn “ pI ´ Σ̃nγ˚γ1˚Σ̃nγ˚γ1˚qỸnpθ̄q, and vlo, vup are

    known functions of Σ̃n, s, γ˚.23 All quantiles of the conditional distribution of η̂ in theprevious display are increasing in γ1˚µ̃,24 and the null hypothesis (16) implies γ1˚µ̃ ď 0.

    We therefore select the critical value for the conditional test to be the 1 ´ α quantileof the truncated normal distribution ξ|ξ P rvlo, vups under the worst-case assumption thatγ1˚µ̃ “ 0. Let ψCα pỸnpθ̄q, Σ̃nq denote an indicator for whether the conditional test rejects atthe 1´ α level. The conditional test is defined as

    ψCα pỸnpθ̄q, Σ̃nq “ 1 ðñ Fξ | ξPrvlo,vupspη̂; γ1˚Σ̃nγ˚q ą 1´ α, (20)

    where Fξ | ξPrvlo,vupsp¨;σ2q is the CDF of ξ „ N p0, σ2q truncated to rvlo, vups. It followsimmediately from Proposition 6 in ARP that the conditional test controls size,

    supδP∆,τ

    supθPSp∆,δ`τq

    Epδ,τ,Σnq”

    ψCα pỸnpθq, Σ̃nqı

    ď α. (21)

    A confidence set satisfying the uniform coverage criterion (9) can be constructed by test21Technically, the duality results require that η̂ be finite. However, one can show that η̂ is finite with

    probability 1, unless the span of X̃ contains a vector with all negative entries, in which case the identifiedset for θ is the real line. We therefore trivially define our test never to reject if η̂ “ ´8.

    22In general, there may not be a unique solution to the dual program. However, Lemma 11 of ARP showsthat conditional on any one vertex of the dual program’s feasible set being optimal, every other vertex isoptimal with either probability 0 or 1. It thus suffices to condition on the event that a vector γ˚ P V̂ .

    23The cutoffs vlo and vup are the maximum and minimum of the set tx : x “ maxγPFn γ1ps`Σ̃nγ˚γ1˚Σ̃nγ˚

    xqu

    when γ1˚Σ̃nγ˚ ‰ 0, where Fn is the feasible set of the dual program (18). When γ1˚Σ̃nγ˚ “ 0, we definevlo “ ´8 and vup “ 8, so the conditional test rejects if and only if η̂ ą 0.

    24This follows from the fact that the truncated normal distribution ξ|ξ P rvlo, vups has the monotonelikelihood ratio property in it is mean (see, e.g. Lemma A.1 in Lee, Sun, Sun and Taylor (2016)).

    22

  • inversion for the scalar parameter θ. The conditional confidence set is given by

    CCα,n :“ tθ̄ : ψCα pỸnpθ̄q, Σ̃nq “ 0u. (22)

    Remark 8. For each value of θ̄, the test statistic η̂ can be computed by solving the linearprogram (17). To form the confidence set CCα,n, one only needs to perform test inversion overa grid of values for the scalar parameter θ, and thus the problem remains highly tractableeven when T̄ is large. Moreover, the commonly-used dual simplex algorithm for linearprogramming returns a vertex to the dual solution (18), so an optimal dual vertex γ˚ can beobtained from standard solvers without further calculation. �

    Remark 9. Consider the simple setting in which we have one post-period pT̄ “ 1q andare interested in the treatment effect in the first period, θ “ τ1. In this case, there areno nuisance parameters, and the form of the conditional test simplifies substantially. Thetest statistic η̂ is the maximum of the studentized moments, η̂ “ maxj Ỹn,j{σ̃n,j, where σ̃n,jis the standard deviation of Ỹn,j. The conditional test rejects in this case if and only ifΦpη̂q´Φpvloq

    1´Φpvloq ą 1´ α. Moreover, if the moments Ỹn are uncorrelated with each other, then vlo

    is the maximum of the non-binding studentized moments, vlo “ maxj‰ĵ Ỹn,j{σ̃n,j, where ĵdenotes the location of the maximum studentized moment. �

    4.3 Consistency and optimal local asymptotic power of conditional

    confidence sets

    We now present two novel results on the asymptotic power of the conditional test in oursetting. We first show that the conditional test is consistent, meaning that any fixed pointoutside of the identified set is rejected with probability approaching one as the sample sizenÑ 8.

    Proposition 4.1. The conditional test is consistent. For any δA P ∆, τA P RT̄ , and θout RSpδA ` τA,∆q,

    limnÑ8

    PpδA,τA,Σnq`

    θout R CCα,n˘

    “ 1.

    Thus, in contrast to the optimal FLCI, the conditional test is consistent for all polyhedral∆.

    We next consider the local asymptotic power of the conditional test. We provide acondition under which the power of the conditional test against local alternatives convergesto the power envelope. This condition guarantees that the binding and non-binding momentsare sufficiently well-separated at points close to the boundary of the identified set.

    23

  • Assumption 5. Let ∆ “ tδ : Aδ ď du and fix δA P ∆. Consider the optimization:

    bmaxpδA,preq “ maxδl1δpost s.t. Aδ ď d, δpre “ δA,pre,

    and assume it has a finite solution. For δ˚ a maximizer to the above problem, let Bpδ˚qindex the set of binding inequality constraints, so that ApBpδ˚q,¨qδ˚ “ dBpδ˚q and Ap´Bpδ˚q,¨qδ˚´d´Bpδ˚q “ ´�´Bpδ˚q ă 0. Assume that there exists a maximizer δ˚ to the problem above suchthat the rank of ApBpδ˚q,postq is equal to |Bpδ˚q|. Analogously, assume that there is a finitesolution to the analogous problem that replaces max with min, and that there is a minimizerδ˚˚ such that ApBpδ˚˚q,postq has rank |Bpδ˚˚q|.

    Assumption 5 considers the problem of finding the differential trend δ P ∆ that is consistentwith the pre-trend identified from the data (δA,pre) and causes l1β̂post to be maximally biasedfor θ :“ l1τpost. It requires that the “right” number of moments bind when we do thisoptimization.

    Remark 10. Assumption 5 is closely related to, but slightly weaker than, linear indepen-dence constraint qualification (LICQ). LICQ has been used recently in the moment inequalitysettings of Gafarov (2019), Cho and Russell (2019), Flynn (2019), and Kaido and Santos(2014); see Kaido, Molinari and Stoye (2020) for a synthesis. We show in Appendix A.2that LICQ is equivalent to a modified version of Assumption 5 that replaces “there exists amaximizer δ˚” with “for every maximizer δ˚” (and analogously for the minimizer δ˚˚). Thus,LICQ is equivalent to Assumption 5 when the optimizations considered in Assumption 5have unique solutions, but potentially weaker when there are multiple solutions. We notethat many of the aforementioned papers require LICQ for asymptotic size control, whereaswe only impose Assumption 5 for our results on local asymptotic power. �

    Remark 11. In the special case with one pre-period p¯T “ 1q and one post-period pT̄ “ 1q,

    Assumption 5 has a simple graphical interpretation. It is satisfied whenever ∆ has non-empty interior and δ is not vertically aligned with a vertex. Figure 4 shows the areas atwhich Assumption 5 holds/fails for three of our ongoing examples. The assumption holdseverywhere for ∆SD when M ą 0, and Lebesgue almost everywhere for ∆SDPB and ∆RMI

    when M ą 0 and M̄ ą 0. The sets ∆SDp0q,∆SDPBp0q,∆RMIp0q all have empty interior, andso Assumption 5 fails for these cases (in which θ is point-identified). More generally, one canshow that Assumption 5 does not hold if θ is point identified. �

    24

  • Figure 4: Diagram of where Assumption 5 holds. The assumption holds (fails) for values ofδ plotted in green (red).

    δ1

    δ´1

    ∆SD

    δ1

    δ´1

    ∆SDPB

    δ1

    δ´1

    ∆RMI

    Let Iαp∆,Σnq denote the class of confidence sets that satisfy the finite sample coveragecriterion in (9) at the 1 ´ α level. Under Assumption 5, the power of the conditional testagainst local alternatives converges to the optimum over Iαp∆,Σnq as nÑ 8.

    Proposition 4.2. Fix δA P ∆, τA, and suppose Σ˚ is positive definite. Let θubA “ supθ SpδA`τA,∆q be the upper bound of the identified set. Suppose Assumption 5 holds. Then, for anyx ą 0,

    limnÑ8

    PpδA,τA,Σnqˆ

    pθubA `1?nxq R CCα,n

    ˙

    “ limnÑ8

    supCα,nPIαp∆,Σnq

    PpδA,τA,Σnqˆ

    pθubA `1?nxq R Cα,n

    ˙

    “ Φpc˚x´ z1´αq,

    for a positive constant c˚.25 The analogous result holds replacing θubA ` 1?nx with θlbA ´ 1?nx,

    for θlbA the lower bound of the identified set (although the constant c˚ may differ).

    We now provide an outline of the proof of our local asymptotic optimality result, thedetails of which are contained in the appendix. The proof proceeds in two parts: the firstpart characterizes the asymptotically optimal test, and the second shows that our conditionaltest converges to this optimal test.

    We first show that under Assumption 5, the local asymptotic power of any test that con-trols size is bounded above by that of a particular one-sided t-test. Specifically, Assumption5 implies that there is a unique set of Lagrange multipliers γ̄ in the “population version” of

    25In particular, letting B “ Bpδ˚˚q as defined in Assumption 5, c˚ “ ´γ̄1BÃpB,1q{σB , where σB “b

    γ̄1BApB,¨qΣ˚A1pB,¨qγ̄B and γ̄B is a non-zero vector such that γ̄

    1BÃpB,´1q “ 0, γ̄B ě 0. The vector γ̄B is

    unique up to scale.

    25

  • the test statistic η̂pθubq that replaces Ỹ pθubq with its expectation µ̃pθubq in (17). We thenshow that testing H0 : θ “ θub, δ P ∆ against a local alternative can be represented as atest of a convex null against a point alternative in the normal location model. Applying theNeyman-Pearson lemma, we show that the optimal test is a one-sided t-test in the directionof γ̄ for alternatives sufficiently close to θub.

    We next show that the conditional test converges in probability to the optimal one-sidedt-test discussed above. Since Assumption 5 implies that γ̄ is the unique dual solution tothe “population version” of η̂, it follows that γ̄ will be optimal in the dual problem for η̂with probability approaching one as Σn Ñ 0. Thus, with probability approaching one thetest statistic for the conditional test will be η̂ “ γ̄1Ỹ , which corresponds with that of theone-sided t-test. Finally, recall that the critical value of the conditional test is based on the1 ´ α quantile of the distribution of γ1˚Ỹ conditional on γ˚ being optimal. However, sinceγ̄ is optimal with probability approaching 1, the distribution of γ̄1Ỹ conditional on γ̄ beingoptimal approaches its unconditional distribution, which is normal. Thus, the critical valueof the conditional test approaches the 1´ α quantile of the normal distribution.26

    Remark 12. Equation (21) and Proposition 4.2 together show that the conditional approachuniformly controls size over all values of δ P ∆, and is asymptotically efficient when δfurther satisfies Assumption 5. In general, we cannot guarantee that the bounds of theidentified set will be differentiable as a function of β “ δ` τ , and the impossibility results inHirano and Porter (2012) imply that no regular estimators of the identified set bounds existwhen differentiability fails. However, one can show that if Assumption 5 holds, then theidentified set bounds are differentiable in β. Proposition 4.2 therefore implies that althoughthe conditional test controls size uniformly for all values of δ P ∆, this does not come atthe expense of efficiency in cases where Assumption 5 holds. Since researchers often do notknow ex ante whether Assumption 5 is satisfied, this “robustness” property is desirable. Ourresults are thus somewhat analogous to results in the weak identification literature that showthat certain procedures control size under weak identification but are efficient under strongidentification (e.g., Moreira (2003)).

    Remark 13 (Relationship to other methods). We are not aware of results analogous toProposition 4.2 for any other moment inequality procedure that controls size in the finitesample normal model. Observe that if Assumption 5 holds, then it also holds if ∆ is aug-mented to include a moment that is non-binding at both endpoints of the identified set.Hence, for Proposition 4.2 to hold, the local asymptotic power of the test needs to be unaf-fected by the inclusion of such slack moments. For example, although relatively insensitive

    26The fact that the conditioning event becomes trivial asymptotically under Assumption 5 explains howthe conditional test is able to approach the power envelope for all valid tests, not just conditional tests.

    26

  • to the inclusion of slack moments, the procedures of Romano, Shaikh and Wolf (2014) andAndrews and Barwick (2012) are still affected by the inclusion of slack moments via thechanges to the first-stage critical value and size-adjustment factor, respectively.27

    Remark 14 (Finite sample power of the conditional test). Note that the argument abovefor the optimality of the conditional approach relies on a unique vector of Lagrange multi-pliers γ̄ being dual-optimal with probability approaching 1 asymptotically. The asymptoticguarantees of Proposition 4.2 thus may not translate to good finite-sample performance insettings where multiple vectors of Lagrange multipliers are optimal with nontrivial prob-ability. Since a vector of Lagrange multipliers corresponds with a set of active momentsin the primal problem (17), this will tend to occur in cases where the set of binding andnon-binding moments are not “well-separated” relative to the sampling variation in the data.Such a situation will tend to arise when Assumption 5 is “close” to being violated.28 �

    5 Conditional-FLCI Hybrid Confidence Sets

    Taken together, the results in Section 3 show that the FLCIs have attractive finite-sampleproperties for particular classes ∆ of interest, but they may perform poorly even asymp-totically for other types of restrictions. On the other hand, the conditional tests have goodasymptotic properties for a wider range of restrictions, but they may perform poorly in finitesamples in settings where the binding and non-binding moments are not well-separated rel-ative to the sampling variation in the data. To address these differing strengths, we proposea novel confidence set that hybridizes the conditional test with the optimal FLCI based onan affine estimator. This conditional-FLCI hybrid confidence set achieves similar desirableasymptotic properties as the conditional confidence set for a wide range of ∆s. In simulations(discussed in more detail below), it leads to substantial improvements in the power of theconditional test in a variety of cases where the moments are not well-separated.

    The conditional-FLCI hybrid confidence set is constructed by first testing whether acandidate parameter value lies within the optimal level-p1 ´ κq FLCI, and then applyinga conditional test to all parameter values that lie within the optimal FLCI. In the secondstage, we use a modified version of the conditional test that i) adjusts size to account for thefirst-stage test, and ii) conditions on the event that the first-stage test fails to reject. Since

    27In concurrent work, Cox and Shi (2020) propose a new method for testing moment inequalities withnuisance paramaters, which like the ARP test is strongly insensitive to slack moments. It is thus possiblethat similar results could be obtained for their test as well.

    28Indeed, in the supplementary material we prove a uniform version of the optimality result in Proposition4.2 under a modified version of Assumption 5 that requires the non-binding moments to be uniformly boundedaway from zero.

    27

  • the construction of the hybrid test uses similar steps to the construction of the FLCIs andconditional test, we defer many of the technical details to Appendix A.3.

    Formally, suppose that 0 ă κ ă α.29 Consider the level p1 ´ κq optimal FLCI, CFLCIκ,n “an` v1nβ̂n˘χn. Lemma A.3 shows that the distribution of the test statistic η̂ defined in (17)follows a truncated normal distribution conditional on the parameter value θ̄ falling withinthe level p1´ κq optimal FLCI. With this result, the construction of the second-stage of theconditional-FLCI hybrid test is analogous to the construction of the conditional test, exceptit uses the modified size α̃ “ α´κ

    1´κ to account for the first-stage test. The conditional-FLCIhybrid test ψC-FLCI equals

    ψC-FLCIκ,α pβ̂n, θ̄, Σ̃nq “ 1 ðñ θ R CFLCIκ,n , OR Fξ | ξPrvloC-FLCI ,vupC-FLCI spη̂q ą 1´ α̃,

    where Fξ | ξPrvloC-FLCI ,vupC-FLCI sp¨q denotes the CDF of the truncated normal distribution derivedin Lemma A.3.

    Since the FLCI controls size, the first stage test rejects with probability at most κ underthe null that θ “ θ̄. The second-stage test rejects with probability at most α̃ “ α´κ

    1´κ

    conditional on θ P CFLCIκ,n . Together, these results imply that the conditional-FLCI hybridtest controls size,

    supδP∆,τ

    supθPSp∆,δ`τq

    Epδ,τ,Σnq”

    ψC-FLCIκ,α pβ̂n, θ̄, Σ̃nqı

    ď α. (23)

    We therefore construct a conditional-FLCI hybrid confidence set for the parameter θ thatsatisfies (9) by inverting the conditional-FLCI test,

    CC-FLCIκ,α,n “ tθ : ψC-FLCIκ,α pβ̂n, θ̄, Σ̃nq “ 0u. (24)

    In Appendix A.3, we show that the conditional-FLCI hybrid confidence set inherits somedesirable asymptotic properties from the conditional approach: it is asymptotically consis-tent, and under the same conditions as Proposition 4.2, the conditional-FLCI hybrid testhas local asymptotic power at least as good as the optimal α´κ

    1´κ test.

    6 Simulation study

    In this section, we present a series of simulations illustrating the performance of the con-fidence sets discussed above across a range of relevant data-generating processes. We find

    29In practice, we set κ “ α{10 following Romano et al. (2014) and ARP, although the optimal choice of κis an interesting question for future research.

    28

  • good size control for all three of our proposed procedures, and therefore focus in the maintext on a comparison of power. In the supplementary material, we present results on sizecontrol and other additional simulation results.

    6.1 Simulation Design

    Our simulations are calibrated using the estimated covariance matrix from the 12 recently-published papers surveyed in Roth (2019).30 For any given paper in the survey, we denote byΣ̂ the estimated variance-covariance matrix from the event-study in the paper, calculated us-ing the clustering scheme specified by the authors. We then simulate event-study coefficientsβ̂s from a normal model under the assumption of parallel trends and zero treatment effects,β̂s „ N

    ´

    0, Σ̂¯

    .31 In simulation s, we construct nominal 95% confidence intervals for the

    parameter of interest θ using the pair pβ̂s, Σ̂q for each proposed procedure. The parameterof interest is the causal effect in the first post-period (θ “ τ1).32

    For a given choice of ∆, we compute the identified set Sp0,∆q and calculate the expectedexcess length for each of the proposed confidence sets. The excess length of a confidence setCpβ̂q is the length of the part of the confidence set that falls outside of the identified set,defined as ELpC; β̂q “ λpCpβ̂qzSp0,∆qq. We benchmark the expected excess length of ourproposed procedures relative to the optimal bound over confidence sets that satisfy the uni-form coverage requirement (9).33 For each paper, we conduct 1000 simulations and computethe optimal bound and the average excess length for the FLCI, conditional confidence set,FLCI, and conditional-FLCI hybrid confidence set. The excess length efficiency of a givenprocedure equals the ratio of the optimal bound to the simulated expected excess length.We report the efficiency ratios for the median paper in the survey.

    We consider three choices of ∆ to highlight the performance of the confidence sets acrossa range of conditions: ∆SDpMq,∆SDPBpMq and ∆SDIpMq. Table 1 summarizes which ofour theoretical results hold for each of the simulation designs.

    In all simulations, the units of the parameter M are standardized to equal the standard30Roth (2019) systematically reviewed the set of papers containing an event-study plot published in the

    American Economic Review, AEJ: Applied Economics and AEJ: Economic Policy between 2014 and mid-2018. Section 4.1 of Roth (2019) discusses the sample selection criteria for this survey of published eventstudies.

    31We focus on the normal simulations in the main text since it allows for a tractable computation of theoptimal excess length of procedures that control size. In the supplementary material, we show that ourprocedures perform similarly in simulations based on the empirical distribution in the original paper.

    32In the supplementary material, we provide simulation results in which the parameter of interest is theaverage causal effect in the post-periods (θ “ τ̄post), and find the results are qualitatively unchanged.

    33A formula for this optimal bound is provided in the supplementary materials, and follows as a corollaryfrom results in Armstrong and Kolesar (2018) on the optimal expected length of a confidence set satisfyingthe uniform coverage requirement.

    29

  • ∆SD ∆SDPB ∆SDI

    Conditional / HybridConsistent X X XAsymptotically (near-)optimal X X ˆFLCIConsistent X ˆ ˆAsymptotically (near-)optimal X ˆ ˆFinite-sample near-optimal X ˆ ˆ

    Table 1: Summary of expected properties for each simulation design

    error of the first post-period event study coefficient pσ1q. We show results for a variety ofchoices of M{σ1. All of the procedures and the optimal benchmarks are invariant to scale,meaning that the confidence set (or optimal benchmark) using pM, 1

    nΣ˚q is 1

    ntimes that using

    pnM,Σ˚q. Therefore, simulation results on excess length asM{σ1 grows large are isomorphicto our asymptotic results presented earlier in which nÑ 8 for Σn “ 1nΣ

    ˚ and M ą 0 fixed.The simulation results also have a finite-sample interpretation, illustrating how our resultschange as we allow the set of underlying trends to be more non-linear, holding Σ˚ constant.

    The supplementary materials present results from several alternative simulation exercises.We find similar results using the empirical distribution from the first paper in Roth (2019)’ssurvey rather than the calibrated normal model studied in the main text. We also findqualitatively similar results when using the average of the post-treatment causal effects asthe target parameter.

    6.2 Simulation Results

    Results for ∆SDpMq: The top left panel of Figure 5 plots the efficiency ratio for eachprocedure as a function of M{σ1 when ∆ “ ∆SD. All procedures perform well as M{σ1grows large with efficiency ratios approaching 1, illustrating our asymptotic (near-)optimalityresults. The FLCIs also perform quite well for smaller values of M{σ1, including the point-identified case where M “ 0, illustrating the finite-sample near-optimality results for theFLCIs when Assumption 3 holds. Although the conditional confidence sets have efficiencyapproaching the optimal bound for M{σ1 large, their efficiency when M{σ1 “ 0 is onlyabout 50%. This reflects the fact that when M “ 0, the parameter is point-identified andAssumption 5 fails. The conditional-FLCI hybrid substantially improves efficiency for smallvalues of M{σ1, while still retaining near-optimal performance as M{σ1 grows large.

    Results for ∆SDPBpMq: The top right panel of Figure 5 plots the efficiency ratio for eachprocedure as a function of M{σ1 when ∆ “ ∆SDPB. The efficiency ratios for the conditional

    30

  • Figure 5: Simulation results: Median efficiency ratios for proposed procedures.

    Note: Median efficiency ratios for our proposed confidence sets. The efficiency ratio for a procedure is definedas the optimal expected excess length divided by the procedure’s actual expected excess length. The resultsfor the FLCI are plotted in green, the results for the conditional-FLCI hybrid confidence interval in red andthe results for the conditional confidence interval in blue. Results are averaged over 1000 simulations foreach of the 12 papers surveyed, and the median across papers is reported here.

    31

  • and hybrid confidence sets are (near-)optimal as M{σ1 grows large, highlighting our asymp-totic (near-)optimality results for these procedures in this simulation design. However, theefficiency ratios for the FLCIs steadily decrease as M{σ1 increases, which reflects the factthat the FLCIs are not consistent in this simulation design when M ą 0. We again seethat the conditional-FLCI hybrid improves efficiency when M{σ1 is small, while retainingnear-optimal performance as M{σ1 grows large.

    Results for ∆SDI The bottom panel of Figure 5 plots the efficiency ratio for each procedureas a function of M{σ1 when ∆ “ ∆SDI . As summarized in Table 1, the conditions forasymptotic (near-)optimality do not hold for any of our procedures in this simulation design.Nonetheless, the conditional and hybrid procedures still perform quite well for large valuesof M{σ1, with efficiency approaching about 90%. This evidence is encouraging as it showsthat they may perform well asymptotically in cases where Assumption 5 fails. The efficiencyof the FLCIs degrades as M{σ1 grows, reflecting that the FLCIs are inconsistent under thissimulation design whenM ą 0. Once again, the conditional-FLCI hybrid improves efficiencywhenM{σ1 is small, while retaining similar performance to the conditional approach asM{σ1grows large.

    7 Practical Guidance

    We now provide practical guidance on how these methods may be used to assess the robust-ness of conclusions in difference-in-differences and event-study designs. We recommend thatapplied researchers conduct the following steps to assess the robustness of their conclusionsin difference-in-differences and related designs.

    1) Estimate an “event-study”-type specification that produces a vector of asympototicallynormal estimates β̂, consisting of “pre-period” coefficients β̂pre and “post-period” coeffi-cients β̂post, where the post-period coefficients have a causal interpretation under a suitableparallel trends assumption.

    2) Perform a sensitivity analysis where inference is conducted under different assumptionsabout the set of possible violations of parallel trends ∆.

    3) Provide economic benchmarks for evaluating the different choices of ∆. This involvesusing context-specific knowledge about potential confounding factors or using informationfrom pre-treatment periods and placebo groups.

    32

  • We recommend researchers select either the optimal FLCI or the conditional-FLCI hybridconditional confidence set based upon the properties of their specified choice of ∆. Forcases (e.g. ∆ “ ∆SDpMq) where the conditions for the consistency of the FLCIs are non-restrictive and the conditions for finite-sample near-optimality under parallel trends hold,we recommend that the researchers use the optimal FLCI. Outside of these special cases –e.g., when context-specific knowledge motivates sign or shape restrictions – we recommendthat researchers use the conditional-FLCI hybrid confidence set. Our R package, HonestDiD,implements our methods and chooses the recommended procedure by default.

    7.1 When to Use Our Methods

    The methods in this paper can be applied in most empirical settings in which researchers usean “event-study plot” to evaluate pre-existing trends. Our methods require the researcher touse an estimator β̂n with an asymptotic normal limit,

    ?npβ̂n ´ βq Ñd N p0, Σ˚q, and that

    the reduced-form parameter β satisfies the causal decomposition in Assumption 1. We nowaddress two considerations that commonly arise in practice: i) staggered treatment timing,and ii) anticipatory effects.

    First, a recent literature has shown that the coefficients from standard two-way fixed effectmodels may not be causally interpretable in the presence of staggered treatment timing andheterogeneous treatment effects across cohorts. To address these issues, Sun and Abraham(2020) and Callaway and Sant’Anna (2020) provide alternative strategies for estimatingweighted averages of cohort-specific treatment effects at a fixed lag (or for placebo analysis,lead) relative to treatment, which yield consistent estimates under a suitable parallel trendsassumption.34

    These estimates are asymptotically normal under mild regularity conditions, and so ourrecommended sensitivity analysis can be applied to gauge sensitivity to violations of theneeded parallel trends assumption. In empirical settings with staggered treatment timingand heterogeneous treatment effects, we therefore recommend that researchers first use themethods of Sun and Abraham (2020) or Callaway and Sant’Anna (2020) for estimation andthen apply our results to conduct sensitivity analysis.

    Next, in some cases, there may be changes in behavior in anticipation of the policy ofinterest, and therefore, βpre may reflect the anticipatory effects of the policy (e.g., see Malaniand Reif (2015)). This violates Assumption 1, which assumes pre-treatment coefficients do

    34The literature on staggered treatment timing considers generalizations of the parallel trends assumptionthat impose that untreated potential outcomes for each treated cohort move in parallel to those for somecontrol group; possibilities for the control group include never-treated units, not-yet-treated units, or thelast cohort to be treated. See, e.g., Assumption 2 of Callaway and Sant’Anna (2020) or Section 4.2 in Sunand Abraham (2020).

    33

  • not reflect causal effects. A simple solution is available if one is willing to assume thatanticipatory effects only occur in a fixed window prior to the policy change. Under such anassumption, the researcher may re-normalize the definition of the “pre-treatment” period tobe the period prior to when anticipatory effects can occur, in which case βpre is determinedonly based on untreated potential outcomes.

    7.2 Sensitivity Analysis

    We recommend that researchers report confidence sets under different assumptions about theset of possible differences in trends ∆. This allows the reader to evaluate what assumptionsneed to be imposed in order to obtain informative inference.

    For instance, in many cases a reasonable baseline choice for ∆ may be ∆SDpMq, whichrelaxes the assumption of linear differences in trends by imposing that the slope of thedifferential trend can change by no more than M in consecutive periods. By reportingrobust confidence sets for different values of M , the researcher may evaluate the extent towhich their conclusions change as we allow for the possibility of greater non-linearities in theunderlying trend. I