we’re going to talk about a class of designs which

33
We’re going to talk about a class of designs which generally are known as quasi- experiments. They’re very important in evaluating educational programs and policies because often we might not have the right context or situation to do an RCT, a randomized control trial so we need designs that help us approximate those random assignment studies, then that can get us towards hopefully a valid estimate of the causal impact of the program or policy. 1

Upload: others

Post on 21-Dec-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

We’re going to talk about a class of designs which generally are known as quasi-experiments. They’re very important in evaluating educational programs and policies because often we might not have the right context or situation to do an RCT, a randomized control trial so we need designs that help us approximate those random assignment studies, then that can get us towards hopefully a valid estimate of the causal impact of the program or policy.

1

We’re going to start with a bit of review but hopefully you will notice that we keep talking about this same point and it’s very important when we’re thinking about outcomes, and that is that the program effect is the difference between the observed outcome for an individual under treatment and the outcome that would have occurred to the same individual without treatment. That is our program effect, and as we’ve discussed, that’s a physical impossibility with people; time continually moves on and that disallows us from winding the clock back and changing the way the world is. So the key then is construction of an appropriate counterfactual condition, so how do we create or use a design to create a group of folks that can stand in for this impossible counterfactual of observing the same individual in two conditions. So that’s a key aspect of thinking about program impact and how do we know if a program worked. So the rest of this presentation is really going to look at what are some designs that we can use to approximate the desired counterfactual condition that we can’t do in reality.

2

Important in this work is that we think about bias and bias is something that systematically exaggerates; it sort of diminishes the program effect, so it’s something that’s tied into how someone gets into a program and it’s also related to what their outcomes are. So we always have to be wary of this lurking bias, this unobserved thing that looks like a treatment effect, looks like a program effect but it’s really due to this underlying characteristic or condition that we didn’t account for that’s moving in the same direction as the quote, unquote true treatment effect. So it’s really important that we do the hard work of thinking that through; what are the potential sources of bias that could be manifested in my estimate of a program effect. And that’s hard work because the direction of the bias is difficult or impossible to know in advance, but we’re resting in a place of strong theory that’s guiding the design of our interventions or programs, that’s guiding our thinking about what to expect in terms of outcomes and how that expected effect might vary across folks. It really puts us in a better place to think about bias, and it’s all-important because it can make an effective program look ineffective or an ineffective program look effective, and both of those have consequences to that invalid estimate of the program effect. So I’ve kind of said it already but another way to say it too is we need to think deeply about how unobserved characteristics of the individuals are related to both the assignment and the outcomes. Bias creeps in through that relationship.

3

To put a more formal name on that type of bias we would call it selection bias. So remember that a program effect really lays on the assumption that the treatment and the control groups would have the same outcomes without treatment, so that the underlying expected outcome is the same for both groups of people in the absence of treatment. Whether or not they get it we would expect to see the same thing happen in both groups; that allows us to compare them and see what the effect of treatment is. That assumption is questionable with non-equivalent comparison groups; no matter how close they look in terms of the things that we do observe, say race or test score, that assumption can still be called into question. We can still perhaps think that there might be a selection bias in our estimate of the program effect. So key to this is understanding the process whereby individuals get into different groups and when the process is unknown that’s where we run into this problem, so we need to do the work of making known the unknown. So what is the process of how people got into different groups? In a random assignment study we know the process and that’s why it works; it’s completely and totally based on random assignment; it’s random, so we can then assume that as the first line says, that the outcome would be the same in both groups without treatment.

4

Another form of bias that is especially problematic in a random assignment study but is also an important bias to consider in all of our studies is attrition bias. And in a way this is a way of selection bias as well but it’s a process that selects individuals to drop out or refuse to cooperate and so what happens then, even if we randomly assign folks to groups and our assumption about potential outcomes is holding, if there’s some differential, systematic process that causes people with similar characteristics to drop out of say the treatment… let’s say low performers systematically drop out of the treatment and then we come to the end and we compare whoever is left in our treatment which is high performers to the control group which is the whole range of performers, we’re going to get a bias in our estimate of the program effect because we’ve systematically lost all the low performers so now we just have a different group that was not randomly assigned of high performers, so that could then look like a treatment effect. So it’s very important, and even if we’re not in a randomized control trial we still need to think about who remains, who drops out of our studies, because that changes how we talk about the group and who’s left and it changes how we think about what the outcome of the program is.

5

And as you should have seen in the readings and the presentations on threats to validity, there’s other sources of bias that can creep into our analysis and our estimation of program effects. So it’s not just that outcome-related characteristics should be equivalent, but also the events and experiences among our groups, equal in the groups, outside of intervention, should also be equivalent. So these are the threats, like history effects or maturation effects.

6

Quasi-experimental designs then are these designs that minimize differences between treatment and control groups that influence outcomes and we might not strictly call them control groups; we’ll call them comparison groups. And these designs seek as much equivalence as possible and it forces us to think through plausible sources of non-equivalents between groups and again that’s where theory comes back in; what is the theory of our independent variable, what is the theory behind our interventions or our programs, what does theory tell us about how people will enter into those programs, how they will stay in those programs and how they will benefit from those programs. All of those theory-based assumptions and hypotheses are very important in thinking about how can we design in equivalence between groups? And it’s important to remark that quasi-experimental designs, there are several of them; there’s a range of them. Some can be considered stronger than others but ultimately bias is always a concern in these designs and it’s the combination of the design, the underlying theory and the logic that we bring to bear in the whole design process that helps us move bias from a strong possibility to a weaker possibility, and that strengthens our inference although it can always be there.

7

In these designs there’s three general characteristics or aspects of these designs that we need to think about. One is the determinants of outcome so we can control for characteristics that are related to differences in outcomes between groups and pulling them out of our estimates of the program effect, if you will. We need to think about the determinants of selection, so several designs we’re going to discuss and explicitly model the selection mechanism as a way of controlling the selection bias, and then we can think of reflexive controls which by adding more observations on the same units over time there’s ways that we can leverage that to try to mitigate bias.

8

For the remainder of this presentation I’m generally going to follow along the Henry chapter in the Holley [?] book as a way of introducing some of these designs and some of the ideas behind them. I do hope you recognize that the readings, this presentation and the work in this class is broadly an overview and that each of these designs… and it’s intended to get you thinking about what’s possible, what paths you could or should take if you’re thinking about outcome evaluations, but it’s not going to provide you the skills to do it, but it hopefully will guide you in the direction to go do some more study and work with your advisors on how to put these designs into effect.So to do that I’ve also generated a bit of dummy data to illustrate some of the points made in the Henry chapter, and so as you can see on the slide I specified a data generation model. So I created… I set up the context for the expectation so I know what the true program effect is. So in this case I created a data set that had let’s say a posttest, this Yt, I set it to be on average 70. I built in a program effect so students in treatment, Z = 1, students not in treatment, Z = 0, and that has a true program effect of five points. I built in an average mean pre-test of 60 and then I gave some contribution, so higher performers on the pre-test; slightly higher posttest. I’ve added some individual characteristic covariant effect X, so we can think of that as say racial demographics, male/female; some other kind of individual characteristic so if someone has that characteristic they have a higher outcome than those that don’t, and then I introduced some random error. And from this I’ve also introduced a selection bias on X, so a total of 500 individuals in this dummy data set, 250 in each

9

group, so 250 in the Z= 1 treatment group, 250 in the Z = 0 control group. 200 members in the treatment group are members of X and only 20 in the control group are members of X so you can see that’s going to differentially impact the mean outcome because the treatment group has more members that have X so their scores are going to be higher. So with that context we’re going to walk through some of the Henry designs and use this data to demonstrate.

9

The first class of comparison designs are what is generally called naïve comparisons. So this is simply comparing existing groups without controls of anything. So we have a group of treated individuals and we compare then to some other group of folks that didn’t get it. This is a common starting point; maybe we have an existing group of teachers that will receive the PD and then we go search and find some other group of teachers that didn’t get the PD. So we can see there the first equation is the lower case delta - so it’s the Greek letter we normally give to change – is simply estimated by subtracting the mean of the control group from the mean of the treatment group. That gives us a naïve estimate of the program effect. And then you can see a regression equation; it’s functionally the same thing. We basically specify regression equation with Z entered in for individual i, whether or not they were in the treatment group so that gets a one, or if not it gets a zero. And this will functionally give us the same thing. So in my dummy data what does this look like? If we look at the top I’ve just done a simple table of the mean of Y in the treatment group and the comparison group. You can see that the delta there, the contrast, the difference, is eight points. If we go back to the data generation mechanism which is this small longer equation down at the bottom, we know that that treatment contrast should be five, so this illustrates that this naïve comparison has bias in it; it has selection bias in it because part of the treatment selection was on X which is contributing to, in this naïve comparison, our estimation of the effect of Z. So some of that positive influence of X which is now… as you can see I’ve put it in parentheses with the error term is now inflating our

10

estimate of Z, and so that’s bias. So if we were to do this we would erroneously conclude that the treatment was more powerful than it actually is. Maybe the difference is not that great but it still remains the fact that we have a biased estimate, and because I’ve specified this equation I know what the answer is, but we generally do not know what the answer is and we can easily see where I could have specified this as well with no treatment effect and have this similar type of selection, and I might have estimated a significant treatment contrast and that would be erroneous because there’s no true effect but I’m estimating there is an effect.So that’s a naïve comparison. Generally we want to stay away from them if we can because we know they don’t control for any of these threats to validity and they leave the door wide open for selection bias and other types of bias. Hopefully you can see that another variant of this design would be a pre- post-comparison so a naïve pre- post-comparison with no control group would be very similar functionally to what I’ve presented here.

10

So moving up, I guess, from a naïve comparison we may implement a control for a pre-test. Often this is a good step to take because it’s the same test pre- and post and this unobserved characteristic, so X in our case, is likely related to both of those scores, so folks that have X, and X is related to higher scores, that will manifest itself in both the pre- and posttests. So to some extent the pre-test is going to pull in some of the aspects of X that’s unobserved and pull some of that bias out of our estimate of the treatment. But it’s not perfect, and as you can see again with the dummy data, I’ve drawn a regression basically with the pre-test added to the indicator of treatment, but still this unobserved selection bias piece X is in the error term; you see we actually did worse here than we did with the straight naïve comparison. I didn’t specify this that the pre-test was generated by being an X; it was some mean plus some error, but generally you can see here we’re not in a much better place than we were before. So sometimes it can potentially get us a little bit further down the road to a stronger inference of our program effect, but we have to realize that it doesn’t get us all the way there; the door is still open to a lot of unobserved selection bias factors that could bias our estimate of the program effect.

11

Another strategy in this broad class is we can control, we can adjust for covariance, so in this case we’ve got our indicator for the program, Z, and we’ve got our X, so our characteristic of individual, i, and we’re explicitly modeling that, but we haven’t included the pre-test. So you can see here in the table the estimate of that and we’re doing much better; we’ve actually underestimated the treatment effect in this case. We’ve pretty much estimated X as I specified before. So we’ve gotten much of the way there and this is an overly simplified example; there’s likely to be a host of X’s that would need to be controlled for and some still remaining unobserved ones that we don’t know about that would induce potentially selection bias still. But you can see as we’re adding to our model, as we’re being more explicit about the things related to outcomes and selection, we’re getting to stronger validity of our inference.

12

So just to wrap up this section if we combine the previous two so we’re controlling for the pre-test and we’re controlling for covariance. You can see that again we’re adding in explicitly the mechanisms that are related to outcomes and treatment assignment which in this case this is our model; this is the model that I used to generate this data, so you can see in column four there the estimate for Z, the treatment effect, is spot on at five and for X it’s also at five and the constant, the average difference here is ten which if you remember I specified it needs to be 60 on the pre-test and 70 on the posttest. So we have perfectly modeled the generation of the Y at the beginning there, the Yi, there should be a t there and we’ve modeled selection as well. So here we have no bias in our estimate of the program effect. Now the real world obviously is much more complicated. It would be… the model for how our Y’s were created and how Z was created are ultimately unknown and unknowable so we’re never generally going to be in the same place as column four suggests; we’re never going to know the full data generation model. But we need to think about what it might be and we need to think about what data we have available or that we could collect all undergirded by theory to help us build these models of what needs to be controlled for. And you really need to think that through. That’s important not only for specifying analytic models, it’s important for justifying programs, setting up why this is important to do and why the program could potentially intervene to then produce better outcomes for kids or teachers. So it’s vital; this whole process is all related to each other and we really need to think through this, what I like to call the generation mechanisms.

13

The next design we’re going to talk about is an interrupted time series and often these time series occur… the opportunity for this design occur in natural settings and they’re sometimes called natural experiments. So let’s deal with the time series part first. We have a large ongoing series of observations on the same variable over time, so that’s our time series. Those observations can be on the same units so in the strictest and the lowest level kind of sense we can have repeated observations on one individual getting close to the single subject design which we’re not going to cover in detail, but it could be the same units, the same kids over time, or it could be different units. So the example I’ve given on the slide is speeding violations or traffic accidents. Those are different people, different units, but it’s the same item over time. So we’ve got this ongoing longitudinal data collection and then the interrupted part is that there is a clear introduction of some treatment that occurs at a specific point in that time series, and that then gives us… So we might think of treatment, in this case, we might often see them in a policy change or a shift in policy. So a good potential example that’s ongoing here in Baltimore is that because of funding cuts, budgetary restrictions put in place, city schools here in Baltimore will more than likely not be funding summer programs except for federally mandated ones where as in the past they have had an extensive early elementary program in summer, and that’s not going to be funded for the summer, so we have a clear policy shift induced by budgetary cutbacks. So we now have a pre- and post- and we have an expectation that outcomes will be different on either side of the introduction of that treatment of

14

which the removal can be considered a treatment. And so there you can see the biased nature has created the potential for comparisons to be made. In one group the comparison is weak, potentially weak because it’s still open to a lot of threats to validity. We can also think about interrupted time series with two groups, so having not only this pre- and post-comparison around the interruption point if you will, but also we can look at the same time series, same amount of time, same data points, with a comparison to where that might not being occurring at the same time. So this interrupted time series opens the potential for comparisons to be made with existing data and it’s a valid design to think about, remembering that it is still open to a lot of our threats to validity, these history and maturation effects. So we’ve got to think those through.

14

So to give an example of what an interrupted time series might look like you see here on this slide a chart that has time at the bottom, so years in a ten year span of the fatality rate of traffic accidents. It’s been corrected for miles driven with seasonal variations removed in Britain from the time period of roughly 1960 to 1970 and so take a look at that; that is a time series. What do you notice about that chart? One of the first things you should notice is that there’s a general decline in the fatality rate over the time period, but then you should also notice that in late ’67 you have this vertical dotted line which is the enactment of the British Road Safety Act and this was an act that was put in place in Britain to try to combat drunk driving. It’s a fairly involved system but in essence we have police and other public safety addressing drunk driving through the enactment of this law, and we see it coming in 1967 so that’s the interrupted time period; we have a shift in policy and the actions of public safety officials towards drunk driving and we see across the time series a general decline in the traffic fatality rate. So the first thing we should be concerned about here is well, there seemed to be a general decline; what can we tell post-enactment? Is that just continuing on its general downward trend? So maybe there is no effect of the program or is there something else going on here? So this is a basic setup of data that we could implement an interrupted time series design to explore.

15

This slide presents a chart that drills into the time series and previous slide, and there’s two main points that I want to bring up here with respect to this design; one, you should immediately notice that it’s a smaller slice of the longer time trend and we need to consider how long is the panel of data that we have. Something that looks obvious on a short timescale may look different on a larger timescale, so we see here the two years prior to the enactment and roughly two years after the enactment, and the trend prior to enactment looks generally flat, and then when we see at the enactment point, we see a dramatic decrease in fatalities and then a somewhat upward trend afterwards. The other thing here is this chart presents only fatalities on the weekend so when are people more likely to drink and drive? The weekend is an obvious place so when we pull out the time period that’s most likely to be the primary focus of this act we see something potentially different from all days combined together. So when we look at this we see some initial preliminary potentially evidence we’d want to explore it further – there is a shift in fatality rate post-enactment compared to prior… We can use regression to specify this model and see to what extent that visual representation holds. But this is starting to provide some evidence that perhaps the act did have an effect but it appears to have had an initial effect and then people might have figured out ways around say roadblocks and fatalities start creeping up.

16

The next part of this study that I’m using here is it shows you a way to think about what is my counterfactual, and one potential here is… so in the previous slide we showed the weekends and the change in traffic fatality frequency; this one shows you the weekdays, and compare this chart to the previous chart and we’re starting to build a case for it seems the Safety Act appears to have an effect where it’s intended dramatically reducing fatalities on the weekend. You see during the week days virtually no change visually in the base rate pre- and post- so we’re building potential to show an effect of the enactment on its intended target of reducing fatalities. So overall interrupted time series are a possibility for design; they have unique requirements, attention to long periods of data collection, and also realizing they’re rather weak by themselves. They’re open – they’re quite open to internal threats to validity, specifically history and maturation because you’re depending on changes in the time series to illustrate the effect. So you need to consider that if you have the opportunity to collect data over a long period of time.

17

Another class of designs that we can use to examine potential effects and impact are what are known as fixed effects models or fixed effect design, and these are models where individual units serve as their own controls, so we are fixing the contribution of individuals, the unobserved characteristics of those individuals; we’re assuming that they’re fixed. Now what does that mean? Basically in order to utilize fixed effect models we need multiple data points on the same units over time, multiple, two, three, four, but what we’re doing is we notice that there are certain characteristics about individuals that do not change over that time period and we can leverage that fact and remove that part of the individual’s characteristics from our estimate of the program effect. So in essence we have the ability to subtract out all of those time and variant characteristics. So an example of this, the book gives you TFA, a study about TFA, and they utilize fixed effects by noticing that students have multiple teachers throughout the day potentially and these teachers have different backgrounds and training so they found a sample of kids that perhaps have a TFA teacher for one subject and then a non-traditionally certified teacher for a different subject, and can we leverage that? The kid is the same in both classrooms; can we subtract out the contribution of the kid in order to get at an estimate of the difference between the TFA teacher and the non-TFA teacher? So that’s the utilized fixed effect there and in essence they’re putting a dummy there for each child. Another example of where one might consider fixed effects is my own work looking at charter schools and school choice, so we have this persistent difficulty with selection, selection bias in that kids most likely to go through the motions of school

18

choice and the family backgrounds that engender that operation doing the school choice are likely related to outcomes. And so we can recognize that well we’ll often see kids in a traditional public school prior to making a school choice more and then we’ll see the same kids in a school of choice, a charter public school let’s say. And so the kid… there’s many things about that kid that remain the same across those two settings and two schools that we can subtract out and part of what we can potentially subtract out is this selection bias. And so in that case we need to find a situation where we have a large sample of kids that we observe in both places. Now this gets us, potentially gets us to removing some of the selection bias but it has some down sides. One is we can observe bias effects; we can potentially difference out the signal that we’re actually trying to estimate and it also requires a large sample, so we lose sample size and power; we also lose generalized ability, so on the charter example we can only estimate the fixed effect model among those kids who make a change, so that means kids that never change context, the kids that only appear in traditional schools, kids that only appear in charter public schools, are not a part of our sample, our analytic sample, so then our generalization is about the effect of charters on kids who change. Which is slightly different from the question that most people want to know which is do charters lead to better outcomes for all kids, and that we can’t really get at with a fixed effect model.

18

The next class of designs that we need to discuss are matching designs, and these are likely to be relevant to a large proportion of you guys as you’re thinking about how you might go about designing a study to get outcomes. And here we’re constructing a control group by finding units that match our treatment - kids or teacher or whoever are the units – on some selected characteristics. And we need to think about what are those characteristics that influence outcomes and also influence participation and hopefully by matching on that set of observable characteristics we can get to a place where we can start to build a case that the comparison group is well matched with the treatment group. Even though we didn’t have a random process can we get a comparison group that looks as similar as possible to our treatment group? So we need to think about those characteristics that are related to those two things, outcome and participation. We should think about high impact characteristics that are non-redundant and come up with a way, a strategy, of matching. Now there are several various methods that we can use to form those matches. We’re not going to go into any depth of those methods. There’s a Stewart article that was part of your readings, I think for the last section, by Elizabeth Stewart who is here at Hopkins on propensity score matching; she touches on that. There’s also a wealth of information out there about propensity score matching; it’s one of the well-regarded ways of forming matches, and there are some other methods out there. But one point I want to raise is it seems like a good course of action. One immediate difficulty that should be obvious to you is that in order to perform well or to hopefully get towards where we want to go which is to identify a treatment effect and the

19

treatment effect only, not these selection biases and other threats to validity. It becomes really hard to do so one, we have to observe the characteristics that are related to outcomes and participation and maybe we’re working from administrative data and we don’t have a wealth of characteristics on which to match. The other thing we should immediately realize is the difficulty in matching becomes harder as we add more characteristics. So you might think well I’ll one-to-one match them on these characteristics. Well it only takes a couple of characteristics to then become a very difficult problem to actually find one-to-one perfect matches on all the covariants. So that’s where methods like propensity score matching come in; it’s a way of taking a large set of characteristics and predicting participation with those among folks that actually receive the treatment and those that don’t; that gives us… it kind of simplifies the probability of receiving treatment and then we match based on that probability, and then we can then refine that and make them very closely matched on that probability and then that forms our two groups. There’s some concerns of under-matching. The methods are only generally as good as the characteristics we have available to us. We have to realize that there’s a host of unobserved things that we’re not matching on. There’s the possibility that matching produces a worse bias than if no matching were done at all, and we need to think about how reliable our matching variables are and how stable they are and are they theoretically relevant to outcome participation. So very think discussion here about matching. In the grand scheme I think you should be considering that as we’re designing our projects and think deeply about what are the mechanisms, the potential routes for bias to creep in to any estimates we might generate of program effect, and consider trying a matching strategy. And like I said there’s a whole range of articles and materials out there to help you think through what is the best strategy to use for matching.

19

The final design we’re going to talk about is known as a regression discontinuity design, RDD, and this design is considered one of the more rigorous quasi-experimental designs and at the heart of it is we have a known mechanism whereby assignment is made and divides people into treated and a comparison group. So in a randomized control trial we know the mechanism and it’s random assignment; in a regression discontinuity the assignment mechanism is not random but we know what that mechanism is and we can potentially model that mechanism and so that gives us leverage in understanding and estimating treatment effect, and this underlying assignment mechanism is based on some qualification that individuals have.

20

So let’s dig in a little bit more about this assignment variable. Really there’s few restrictions on it. One is it cannot be caused by the treatment; that would obviously bias our estimate of the treatment effect. It should be measured prior to treatment. It’s important that the variable, whatever this assignment variable is can never change, so for example date of birth; that’s not going to change. It can be a pre-test of the outcome so we look across the range of scores and we assign a cut point in that range and let’s say folks that score below it receive treatment and folks that score above it do not receive treatment. And it’s important that this assignment variable can be totally unrelated to the outcome and have no substantive meaning at all. We can actually use child height to assign to treatment and control and that would work for this design. The best ones are continuous because we’re going to use this assignment variable at the heart of our models to get at the treatment effect.

21

No audio

22

So this assignment variable is the selection mechanism in the treatment. We know what it is and it can be modeled, and the important point here is that assignment is based on the cut score that we set and the range of the assignment variable. Assignment is strictly based on the cut score only so again for example above the cut score; receive treatment; below you do not. An example from the literature is assignment of summer mediation, a study done by Brian Jacobs in Chicago. This actually ends up being a fuzzy regression discontinuity which we’ll talk about but essentially kids… they took kids’ tests at the end of the year and kids who scored below a certain criterion, the cut score, had to go to remediation, some summer program; kids above it did not so that selection mechanism assuming that it’s strictly held at the cut score so there’s no kids that scored just above the line that ended up in remediation or kids that scored just below the line that ended up not receiving remediation… so as long as that’s strict we have the ability to look at regression discontinuity design. Another example is Reading First, some studies out of MDRC; Howard Blum is one of the lead authors, but here we have a case of a limited pool of funds to fund reading intervention for the Reading First program and to distribute those funds districts came up with some rubrics that assigned points to schools, proposals to receive those funds and interventions they would implement, and so then setting a cut score within the range of scores on that rubric, those that score above it received the funds and implemented the programs and those below it did not receive the funds and did not implement the programs. So there’s another example of regression discontinuity.

23

So to restate in a slightly different way the regression discontinuity is a complete model of the selection process so if you completely know and perfectly measure that model, that selection model, you can adjust for differences in selection around that cut point and arrive at an unbiased estimate of the treatment effect. So this design provides strong warrant for causal effect – it can when done well – and as long as an unknown variable does not influence or determine assignments of groups – no people adhere to that cut point, then we have the opportunity to estimate an impact on the program.

24

So I think these charts… you may need to resize your screen to really look at these in more detail, but these charts I think are helpful in illustrating what exactly a regression discontinuity design looks like. On the left side we have a randomized control trial, so random assignment to treatment or control, and those individuals are represented by X and O and you can see there we’ve got folks assigned to treatment across the range of pre-test scores, that’s plotting pre-test and posttest. So a treatment assignment is randomly dispersed across the range of these scores. The top panel is what if there were no treatment effect what the relationship between pre-test and posttest might require. So there’s one line… there’s a cloud of individuals around that line and there’s no evidence of treatment effect and if you look down below that’s a case of a treatment effect so you see the line above, the X’s are clustered around that line, and down below the O’s are clustered around that line. But there’s a range of folks along that line in both treatment conditions but the gap between the two lines is the treatment effect. So what does that mean and what does a regression discontinuity look like? If you look on the right these are regression discontinuity plots. Here we have the assignment variable going across the X and some scores going along the Y, and here we see that the cut point is the strong vertical line and we have treatment assignment around that line, so all the O’s are clustered on one side and all the X’s are clustered on the other. The top chart is what a regression discontinuity might look like with no treatment effect and if you compare the two top panels you can see they look exactly the same with the exception of our treatment folks are clustered on one

25

side and our comparisons are clustered on the other, but functionally the line relationship is the same. So if we look down on the bottom right we have a chart of a regression discontinuity where there is a treatment effect and here we can see the discontinuity… this is where it gets its name… so at that cut point there is a clear jump; there is a clear gap between the line running through the cloud of comparison individuals and the line running through the cloud of treatment individuals. So at that cut score we can identify a treatment effect. Now naturally there’s a lot of considerations that go into pulling this off well, but the point I want to illustrate here is this is another way that we can get at, with some assumptions, a rigorous estimate of a causal impact of a program. And it doesn’t rely on what can sometimes be unpalatable to folks in the field of random assignment, and it allows us to get to a treatment effect. So it’s a fascinating design when they do arise. When we apportion or allocate resources based on a scale we can use that information potentially to estimate a treatment effect.

25

So to summarize… it’s been one of our longer presentations but we’re just barely scratching the surface of these designs. To summarize I think I’ve said this before but validity is a property of inferences, not of the designs themselves, so one design is not more valid than the other, but it’s the inferences that come from those designs that have potential to be valid or not. And this means that all of our designs, regardless of what they are, even a randomized control trial, we must consider and think about all of the potential and plausible confounders, sources of bias, threats to validity that potentially are contained in our estimate of the treatment effect. It’s vital – it’s crucial to determining what the effect of a program is with some level of rigor and to provide us a warrant to say this works or this doesn’t, and we need to think that through. It’s also, to reiterate again, it’s vital that we go through this thought process regardless of whether we’re doing an outcome evaluation or not. This thought process should undergird our justification for our program, our expectations for why someone should care about implementing our program, why they should spend resources to do our program, regardless of whether we’re going to estimate a program effect or try to carry out an outcome study. It’s vital that we do that; that’s embodied in our logic models, it’s embodied in our theory of treatment that we’ve done, it’s embodied in thinking about fidelity of implementation, and it’s in our literature and that’s how we build a case for our programs. A final point I’ll leave you with is that comparisons matter, they matter a lot. We need to always ask ourselves in comparison to what? What am I comparing it to? And I think that’s legitimate whether we’re talking about quantitative data which primarily

26

we’ve been dealing with here, but also with all data. Compared to what? What is this instance telling me with respect to some other instance and how does that comparison help me to understand what I do have in front of me. And comparisons are vital to outcome evaluation; we must have something that we’re comparing against, whether that be some standard base-rate level of performance, a group of folks that don’t receive the treatment; we have to consider them and think of ways that we can work in comparisons to our studies. So a lot of projects that I know about we’re going to have to think that through. But I’m providing professional development to a whole school setting. We’ll, we still need to consider what is the comparison; what is the potential comparison group? It may not get me to a causal impact inference but I should be thinking along those lines of what is another group of folks that don’t receive it that I can look at to somehow better understand what I’m seeing from the folks I am implementing with, and that’s the heart of the matter and we should be considering them no matter what we do.

26