the office of institutional research and assessment oira, assessment, & quasi-experimental...

THE OFFICE OF INSTITUTIONAL RESEARCH AND ASSESSMENT

OIRA, Assessment, & Quasi-Experimental Designs: An Introduction

September 2012Reuben Ternes, OIRA

Overview• Today’s Presentation has 3 parts.

• 1) A little bit about me• 2) A little bit about OIRA• 3) A little bit about quasi-experimental design research techniques

Me• My background is in Quantitative Psychology.

• I tell most people that I am a statistician, since nobody really knows what a Quantitative Psychologist is.

• When people ask me what I do, I tell them that I work at a public university as an institutional researcher and that it is my job to figure out how the university can educate students better, faster, smarter, stronger, and cheaper using data that the university routinely collects about their own students.

More Me• My specialty: Quasi-Experimental Designs

• Regression Discontinuity Designs• Propensity Score Matching• Time Series Analysis, Etc.• I will talk more about these later.

• Recent Research: Data Mining/Machine Learning Algorithms (mostly Random Forest). • These are algorithmic systems designed to predict outcomes from

large amounts of data.• I won’t be talking about MLAs much today.

OIRA• Office of Institutional Research and Assessment• OIRA:

• Is the keeper of much of the university’s official data.• Is responsible for a great deal of federal and state reporting that

deals with official data.• Tracks enrollment, retention rates, graduation rates, etc.• Projects enrollments, surveys students (NSSE, CIRP, etc.),

provides data to outside constituents (legislative requests, US News, etc.)

OIRA Website• www.oakland.edu/oira• Lots of good stuff on our website!• Current and Historical data on:

• Enrollment, degrees, grade data, admissions data• Survey results (NSSE, CIRP), IPEDS, CDS• Research Reports, Presentations (like this one)• Assessment data/information/resources

• Plans, rubrics, example reports, etc. etc.

http://www.oakland.edu/oira

What Else Does OIRA Do?• Occasionally, we help faculty and staff with both research

design and statistical interpretation.• Surveys, design of experiments, statistical consultations, etc.

• OIRA also helps the UAC and GEC handle Assessment, as required by the NCA/HLC and other accrediting bodies.

• We also do a great deal of policy analysis.

Policy Analysis• What kinds of policy analysis?

• Should we recommend all students to take 16 or more credits their first semester, regardless of their incoming academic ability?

• If we raise our students incoming academic profile, what impact will that have on retention and graduation rates?

• Does need-based aid increase retention rates? Does merit-based aid? Which one is better at attracting students?

• Policy Analysis and Assessment share a lot in common, and the same tools can be applied to both.• The burden of proof might not be the same, but assessment of student

learning outcomes can be greatly enhanced by thinking about it in terms of policy analysis.

Quasi-Experimental Designs• Much of the policy analysis research that I do is focused

on causation, not correlation.• That is, does Policy ‘X’ cause a change in behavior ‘Y’?

• University policy-makers are reluctant to use random assignment to estimate the impact of policy. • But how can we establish causation without experimental designs

and random assignment?• Quasi-experimental designs!

What is a Quasi-Experimental Design (QED)?

• Any research design that attempts to estimate causation without using random assignment.• This is my definition. No doubt, others have better definitions than

this.

• Best overall introduction to quasi-experimental design techniques that I know of is Shaddish, Cook, & Campbell (2002). • Unfortunately, it’s a bit outdated...

Random Assignment and Selection Bias

• Most quasi-experimental designs focus on the issue of selection bias.• (i.e. bias that is introduced because participants in the treatment

condition were selected non-randomly). • There are various ways to either reduce, or theoretically eliminate

selection bias.• These techniques are what the rest of this presentation is about.

• I’m about to throw a lot of acronyms around. • When you forget what they mean, stop me, and ask for

clarification.

Overview of QED• I’ll talk about 4 possible designs

• Propensity Score Matching (PSM)• Interrupted Time Series (ITS)• Instrumental Variables Approach (IV)• Regression Discontinuity Design (RDD)

Propensity Score Matching• Based on Matched Design Principles.

• Construct a control group that is equivalent to the treatment group.• (This is what RA does as n→∞)• Do this by finding control group members that are identical to

participants• For example, if we wanted to study the impact of a red meat diet on

health outcomes:• We might match participants on the basis of Gender, Age, and Smoking

habits.

• Matching procedures work well when the number of matching variables are small and well understood.

• (They don’t actually work well for most medical studies).

Matching = Overwhelming?• How do you match when you have 50+ variables?

• You can’t match them all exactly.• Which variables are most important?

• Solution: Propensity-Score Matching (PSM)• Assigns a ‘propensity’ score to each participant.• Score = probability of belonging to the treatment group.

• Based on all of the collective variables that are related to

receiving treatment.• Easier to match on this score rather than all variables.• It’s a clever way to reduce the dimensionality of the matching

procedure.

PSM in Action – Visual Estimation with LOESS Smoothing Curves

PSM Pros & Cons• PSM creates control groups that provide less biased

estimates of treatment effects.• Bias is lessened, but never eliminated. • Needs lots of relevant variables.

• Missing a critical explanatory variable related to treatment is bad.

• Sample size hog. Generally inefficient.• Results can be sensitive to the way the propensity score was

created. • Usually done through logistic regression.• Must model interactions and non-linearities.• I recommend Random Forest or similar techniques over logistic

regression due to modeling difficulties.

PSM References• Brief Introduction: Shaddish, W.R., Cook, T.D., & Campbell, D.T.

(2002). Experimental and quasi-experimental designs for generalized causal inference (5th ed.). Boston: Houghton Mifflin.

• Essential Reading: Dehejia, R.H. and S. Wahba (2002) “Propensity Score Matching Methods for Non-Experimental Causal Studies”, Review of Economics and Statistics, 84(1), 151-161.

• Primer: Heinrich, C., A. Maffioli, and G. Vázquez. 2010. “A Primer for Applying Propensity-Score Matching”. Impact Evaluation Guidelines, Strategy Development Division, Technical Notes No.IDB-TN-161. Inter-American Development Bank, Washington, D.C.

Interrupted Time Series Analysis• One of the most compelling ways to show causality is to

introduce and suspend an effect repeatedly while measuring an important outcome.

• When graphed, it should look a little bit like a the rhythm of a heartbeat.

1 2 3 4 5 6

Treat

men

t 8 9 10 11 13

Treat

men

t 14 15 16 17 180

10

20

30

40

ITS Fictitious Example

Time

Ou

tco

me

Real Example of ITS Design

2002 2003 2004 20050%

5%

10%

15%

20%

25%

30%

35%

Math 1 Completion Rates

These Students Take Math 2

These Students Take Math 3

There was a significant drop in 2-year completion rates for Math 1 just after the implementation of the new Remedial Math 3 course.

Math 3 Implemented Here

ITS Pros & Cons• Many internal threats to validity are rendered unlikely with

time series data.• Visually intuitive. Easy to explain!• Most problems with ITS occur when there is too little data.

• Either not enough years to estimate trends• Or low sample size which produces too much variation.

• Estimating effect sizes can sometimes be tricky.

ITS References• Brief Introduction: Shaddish, W.R., Cook, T.D., &

Campbell, D.T. (2002). Experimental and quasi-experimental designs for generalized causal inference (5th ed.). Boston: Houghton Mifflin.

• ITS + Segmented Regression: Wagner, A. K., Soumerai, S. B., Zhang, F. and Ross-Degnan, D. (2002), Segmented regression analysis of interrupted time series studies in medication use research. Journal of Clinical Pharmacy and Therapeutics, 27: 299–309.

Instrumental Variables• Used in regression. • Estimates the causal impact of a treatment when the

explanatory variables are correlated with the regression’s error term.

• Let’s go over a classic example.• Does smoking harm health outcomes?

IV - Example• We can’t estimate health outcomes by adding smoking

behavior and other relevant variables into a regression (age, gender, etc.).• Smoking behavior could be correlated with health outcomes even if

smoking did not cause health problems.• For example – smoking could be correlated with lower SES, which

is well known to also be correlated with health outcomes.

IV – Selecting the Instrument• But what if you could find something that was related to

smoking behavior, but not to the outcome?• Taxes?

• Changes the cost of smoking, which should change behavior.• Taxes change by state and by time, which allows analysis on multiple

levels.

• Including the IV allows for (theoretically) unbiased treatment estimates.

IV Pros & Cons• Including the IV allows for (theoretically) unbiased

treatment estimates.• But only for large samples.

• Can be used in conjunction with other types of analyses (RDD, ITS, etc.)

• Good IVs can be difficult to find.• ‘Weak’ IVs are sometimes ineffective, even if accurate.

• Sometimes the IV may appear to be unrelated to the outcome, but it actually is.• Health conscious states that have better health outcomes may

have a tendency to raise taxes on cigarettes.

IV resources• IV has changed a great deal in the last decade.

• I haven’t been able to keep up with it well.• There are probably better references then the ones I’m giving here.

• IV and Smoking Example: Leigh, J.P.; Schembri, M. Instrumental variables technique: cigarette prices provided better estimate of effects of smoking on SF-12. J. Clin. Epidemiol. 2004, 57, 284-293.

• IV and Education Example: Bettinger, E., & Long, B. T. (2009). Addressing the Needs of Under-Prepared Students in Higher Education: Does College Remediation Work? Journal of Human Resources 44(3), 736-771.

• IV with RD Example: Martorell, P., & McFarlin, I. (2007). Help or hindrance? The effects of college remediation on academic and labor market outcomes (Working Paper). Dallas, TX: University of Texas at Dallas, Texas Schools Project.

Regression Discontinuity (RD) Designs

• Often, we separate people into groups based on some arbitrary point along a continuous variable.• Medicare (65 an older)• Certification exams (if you reach a certain score, you pass,

otherwise, you fail).• Scholarships criteria (need a certain ACT score or HS GPA).• Federal Aid programs (eligibility based on income)• Taxes (i.e. your tax bracket)• Placement of students into remedial courses• Admission criteria (MCATs, SATs, LSATs, ACTs, etc.)

Determining Effectiveness• It is difficult to determine the effectiveness of a policy or

program when it uses a cut-score to assign treatment. • These groups are separated because we believe that they behave

differently in some fashion.• Comparing them directly doesn’t make sense, because we expect them

to be different before treatment begins!

RD as a Solution• Regression Discontinuity (RD) designs exploit the

assignment rule to estimate the causal impact of the program.• RDs can provide unbiased treatment estimates when certain

assumptions are met.• The functional form must be modeled correctly.• The assignment rule must be exact, and modeled correctly.

• Many real-world examples meet these assumptions.• Many statisticians consider RD to be equivalent to an experimental

design with random assignment (assuming the assumptions can be met).

The Logic of RD• Participants just above the cut-off and just below the cut-

off should be fairly similar to each other.• However, because of the assignment rule, they have vastly

different experiences.

• We can then compare participants very near to each other, but with different experiences, and use regression to compensate for any remaining differences based on the assignment score.

• When we graph the data, a discontinuity between the regression lines is evidence of an effect.

A Fictitious Example of

RD

RD Weaknesses• RD can only assess the impact of a program near the cut-

off score. • The assignment rule is not always clear cut.

• Such cases have issues. Called ‘Fuzzy’ RDs. See Trichom (1984).

• Other minor issues may complicate RD (and make it time consuming).• Density tests.• Bandwidth optimality.• Alternative functional forms. • Relative to random assignment, RD still remains complicated.

RD Resources• Intro to RD: Shaddish, W.R., Cook, T.D., & Campbell, D.T. (2002). Experimental and quasi-experimental designs for generalized

causal inference (5th ed.). Boston: Houghton Mifflin.

• Intro to RD: West, S.G., Biesanz, J.C., and Pitts, S.C. (2000). Causal Inference and Generalization in Field Settings: Experimental and Quasi-Experimental Designs. In H. T. Reis & C. M. Judd (Eds.), Handbook of research methods in social and personality psychology (pp. 40-84). New York: Cambridge University Press.

• Advanced discussion: Lee, D., & Lemieux, T. (2009). Regression Discontinuity Designs in Economics (NBER Working Paper). Cambridge, MA: National Bureau of Economic Research.

• Advanced Discussion/Primer: Lesik, S. (2006). Applying the regression-discontinuity design to infer causality with non-random assignment. Review of Higher Education, 30(1), 1–19.

• IV with RD Example: Martorell, P., & McFarlin, I. (2007). Help or hindrance? The effects of college remediation on academic and labor market outcomes (Working Paper). Dallas, TX: University of Texas at Dallas, Texas Schools Project.

• Calcagno, J. C., & Long, B. T. (2008) The impact of postsecondary remediation using a regression discontinuity approach: Addressing endogenous sorting and noncompliance (NCPR Working Paper). New York: National Center for Postsecondary Research.

• Fuzzy RD: Trochim, W. (1984). Research design for program evaluation: The regression-discontinuity approach . Beverly Hills, CA.: Sage, 1984.

• Formal Definition and Proof: Rubin, D.B. (1977), Assignment to Treatment Group on the Basis of a Covariate. Journal of Educational Statistics, Vol. 2, 4-58.

• History of RD: Cook, T.D., “"Waiting for life to arrive": A history of the regression-discontinuity design in psychology, statistics and economics,” Journal of Econometrics, February 2008, 142 (2), 636–654.

A Real World Example (If We Have Time)

• Does need-based financial aid improve retention rates?• Let’s examine the data visually

• Instead of imposing a functional form (like a line or a curve). I will explore the data with LOESS smoothing curves.• These are free forming curves that show the shape of the data, without

imposing strict assumptions about what the data must be like. • They help with the visual estimation, but they won’t give us any

regression estimates. • They are good for exploration, to see if linear regression is appropriate.

First, Some Background• In order to qualify for some of our need based aid

programs (institutional grants) you must have an ACT score of a 21 or higher, and demonstrate need.• Students that are not eligible for federal or state need-based aid,

are also not eligible for OU’s need-based aid.• The grant size is usually substantial.• (There’s also a HS GPA requirement, but we will ignore that for

now and focus only on ACT scores).• The data represents about 5,000 students.

Retention Rates by ACT Score and Need Status

The Dashed Lines• The dashed lines represent students that have not

demonstrated financial need. • They are separated into two groups, those with ACT

scores 21 and above.• And those with scores below a 21.• Notice that there is NO gap, or discontinuity between the

two groups.• The two dashed lines could be represented by the same color and

you would never be able to tell where the discontinuity was.

The Solid Lines• The Red line (on the left) represents students with ACT

scores of less than a 21. • These students did not qualify for a large portion of OU’s need-

based aid.

• The Black solid line (on the right) represents students with ACT scores of a 21 or more. • These students potentially qualify for additional need-based aid.

• Notice the HUGE gap between the red and black lines.• This gap is evidence that need based aid had a positive impact on

the retention of students near the cut-off.

the office of institutional research and assessment oira, assessment, & quasi-experimental...

Documents

policy analysis research

impact of policy

terms of policy analysis

policy x

great deal of policy

university policymakers

research design

grade data