1 1 practical sampling for impact evaluation (aka shedding light on voodoo) laura chioda (lac chief...
TRANSCRIPT
11
Practical Sampling for Impact Evaluation
(aka shedding light on voodoo)
Laura Chioda (LAC Chief Economist office & Adopted by DIME)
Reducing Fragility, Conflict, Crime, and Violence
Lisbon, Portugal23-27 March 2014
2
Introduction
Now that you know how to build treatment and control groups in theory, how to do it in practice?1. Which population or groups are we interested in and where do we
find them? Selecting whom to interview
2. From that population, how many people/neighborhoods/units should be interviewed/observed? Sample size
Seemingly trivial, but “the devil is in the details” Example: Suppose we want to understand whether a mix of pro-social
mentoring and cognitive behavioral therapy for at-risk youth can mitigate anti-social and violent behavior Heller et al. (NBER 2013) : “Preventing Youth Violence and Dropout: A
Randomized Field Experiment”
3
Introduction
Example (1): “Whom to interview” is informed by the research/policy question
1. Everyone (male, female, kids, elderly)?
2. All youth aged 14-16?
3. All youth aged 14-16 in urban areas?
4. All youth aged 14-16 in a particular city and in public schools?
Need some information before sampling Complete listing of all units of observation available for sampling in each area or
group
Introduction
“How many” – Sample Size – depends on a few ingredients Example (2), intuitively Sample Size = 2 :
One adolescent receives mentoring to reduce antisocial behavior (treatment) A second adolescent does not (control) The two have been selected at random Impact = the difference between
# times these two adolescents come into contact with police (e.g. being stopped, arrested)
# times they have been disciplined in school, gotten into altercations or fights
Why does sample size matter? If too small, then you may draw conclusions that are not “robust”: What if the first youth receiving mentoring by chance has very violent peers? Or, on the contrary, what if the one not receiving mentoring was by chance
more risk averse and less impulsive, etc.?4
5
Introduction
Why not assign the entire population (individuals; youth) either to the treatment or to the control group? Ideal world: without budget or time constraints, interviewing everyone would
be a good solution In practice interviews are costly and time consuming → not feasible
e.g. Census every 10 years vs. more frequent household surveys that only sample a fraction of households
In sum: Whom to interview is ultimately determined by our research/policy questions Sample size matters & determines the credibility of results
It allows us to say with some “confidence” whether the average outcome in the treatment group is higher/lower than that in comparison group
6
Road Map
What will we be doing now with the rest of the time?
1. What do we mean by confidence? • How does confidence relate to sample size?
2. Ingredients to determine sample size• Detectable effect size• Probabilities of avoiding mistakes in inference (type I & type II errors)• Variance of outcome(s)
3. Multiple treatments4. Group-disaggregated results5. Take-up6. Data quality
7
Sample Size & Confidence (in your results)
Think of sample size as the accuracy of a measuring device: The more observations you have The more precise is your “measuring device” The more confident you are about the conclusions of your evaluation
Example: guess the sentence below knowing only 2 letters
the # of revealed letters is analogous to the # of observations where each letter, say, costs US$ 100,000 You have US$ 2M with which to uncover up to 21 letters (all of them) If you guess wrong, you loose all of your investment
8
Sample Size & Confidence (in the results)
Let’s increase the number of “observations” (in this case letters) This is so much easier
You feel more confident about guessing Common sense: the more complicated is the sentence, the more
letters you would need Below, we discuss the sense in which impacts can be “complicated”
to detect and would require larger samples.
9
Calculating the Sample Size
We understand confidence to mean “with a some degree of certainty” or “with little error”
We are in luck!! This time, the statistical jargon & plain language point to the same notion The same holds true in the statistical sense, only it entails formalizing
what is meant by error. The statistical derivation of the ideal sample size yields an ugly
formula (it looks like voodoo):
Would you like me to derive this formula?
)1(1)(4
2
22/
2
H
D
zzN
10
Calculating the Sample Size
Hopefully you answered “no” to my previous question (otherwise early lunch break)
Intuitive approach will focus on 3 main ingredients:1. Detectable effect size2. Errors in inference: Type II (and type I) errors3. Variance of outcome(s)
We will answer the following question: How do these 3 ingredients affect credibility of your results? And
therefore your choice of sample size ?
11
1st Ingredient: Smallest Effect Size
We do not know in advance the effect of our policy. We want to design a precise way of measuring it
But precision is not cheap: need cost-benefit analysis to decide
1st ingredient: Smallest program effect size that you wish to detect i.e. the smallest effect for which we would be able to conclude that it is
statistically different from zero “detect” is used in a statistical sense
Example: What if mentoring lowers the cost of crime (e.g., policing and
incarceration expenditures) by 5% … but costs (extra man hours, tutoring materials, etc.) grow by 4.5 %? What if the aggregate benefits are lower than the cost of the IE?
12
1st Ingredient: Smallest Effect Size
Cost-benefit analysis guides us in determining “smallest detectable effect”: That could be useful for policy That could justify the cost of an impact evaluation, etc.
The smaller are the (EXPECTED) differences between treatment & control …
… the more precise the instrument has to be to detect them The larger the sample needs to be
13
1st ingredient: Smallest Effect Size
The larger is the sample the more precise is the measuring device the easier it is to detect smaller effects Increasing sample size ≈ increasing precision (of our measuring
device) Who is taller? Which pair does require a more precise measuring device?
14
2nd Ingredient: Type II Error
Why is it important to be able to measure differences with precision? Example: Treatment = cognitive behavioral therapy (CBT):
#Arrests Treatment very similar to (≈) #Arrests Control
If treatment and control outcomes are not statistically different, then we could conclude that our program has “no” effect for 2 reasons:
1. Because our instrument is not precise (Bad Inference )
2. Because the program indeed had no effect (Good Inference )
Unless we have “enough” observations, we would not be able to decide with confidence whether the “no effect” resulted from possibility 1. or 2.
15
2nd Ingredient: Type I Error (false positive)
In the previous example, suppose that by pure chance, treatment youth tend to have parents who are more involved in children’s
upbringing (high quality parental investments).
# Arrests Treatment (statistically) SMALLER than # Arrests Control
We conclude that our program has an effect (despite there being none in truth)
However the difference depends only on the difference in Parents’ Involvement (Bad Inference )
Good news: the larger the sample size, the smaller we can make the probability of committing this type of error
16
One more Ingredient: Variance of Outcomes (1)
How does the variance of the outcome affect our ability to detect an impact? Example: Of the two (circled) populations, which animals on average
are bigger? How many observations from each circle would you need to decide?
17
One more Ingredient: Variance of Outcomes (2)
Example: on average which group has the larger animals? Comparison is more complicated in this case, such that you need
more information (i.e. a larger sample) answer may depend on which members of the blue & red groups you
observes
18
One more Ingredient: Variance of Outcomes (3)
Economic example: let’s look at our adolescents and mentoring Imagine that the mentoring leads to a decline in disruptive incidents over 2
years (impact) from 60 to 50 Case A: Children are all very similar within treatment arms: distributions of
incidents are very concentrated; Case B: Children are more heterogeneous, with distributions of incidents much
more spread out (distributions overlap more) Which instance requires a more precise measuring device?
19
One more Ingredient: Variance of Outcomes (4)
In sum: More underlying variance (heterogeneity) more difficult to detect differences need larger sample size
Tricky: How do we know about outcome heterogeneity before we determine our sample size and collect our data? Ideal: pre-existing data … but often not available Can use pre-existing data from a similar population
Example: surveys from other school districts, labor force surveys, institutional data on crime etc
Common sense
20
What else to consider when determining sample size
Additional features of the design/data that may have implications for determination of sample size1. Multiple treatment arms2. Group-disaggregated results3. Take-up4. Data quality
21
1. Multiple treatments
Testing different mechanisms: is CBT alone enough? can I increase its effectiveness?
Let’s suppose we are interested in the effects of youth mentoring on academic performance Reference Cook, et al (2014).
We would like to “test” of three different treatments Treatment 1: Academic remediation only Treatment 2: Cognitive Behavioral Therapy (CBT) only Treatment 3: Mentoring/CBT and academic remediation
Intuition: the more comparisons (treatments), the larger the sample size needed to be “confident”
22
1. Multiple treatments
To compare multiple treatment groups requires very large samples Analog to having “multiple” impact evaluations bundled in one The more comparisons you make, the more observations you
need If the various treatments are very similar, differences between
the treatment groups can be expected to be particularly small
23
Why do we need strata?
Group-disaggregated results Gender: Are effects different for boys and girls? Location: For different neighborhoods? For different family structures (e.g. 1- vs. 2-parent households)?
Punch line: To ensure balance across treatment and comparison groups, good to divide sample into strata (aka groups) before assigning treatment
Strata Sub-populations (sub-groups or sub-sets) Common strata: geography, gender, age, baseline values of outcome
variable Treatment assignment (or sampling) occurs within these groups (i.e.
randomize within strata)
24
What can go wrong, if you do not use strata?
Example: You randomize without stratification. Now you ask: What is the impact in a particular neighborhood?
= Treatment & = Control, assigned randomly Can you assess with confidence the impact of mentoring within
neighborhoods?
A
B
C
25
Why do we need strata?
To answer consider a few neighborhoods: Region A: we have almost no kids in the control group Region B: very few observations, can you be confident? Region C: no observations at all
A
26
Why do we need strata?
To answer consider a few neighborhoods: Region A: we have almost no kids in the control group Region B: very few observations, can you be confident? Region C: no observations at all
B
27
Why do we need strata?
To answer consider a few regions: Region A: we have almost no farmers in the control group Region B: very few observations, can you be confident? Region C: no observations at all
C
28
Why do we need strata?
How to prevent these imbalances and restore confidence in estimates within strata?
Example: you have 6 neighborhoods Instead of sampling 2400 students, regardless of their municipality
of origin Within each region you draw a sample:
Sample 2400 ÷ 6 = 400 per neighborhood: 200 treatment & 200 control
I.e. Random assignment to treatment within geographical units Within each unit, ½ will be treatment, ½ will be control
Similar logic for gender, family structure, age etc. Which Strata? Your research & policy question should guide you
29
Why do we need strata?
What about now? : The treatment and control youth look “balanced” within
neighborhoods Much better!
30
Take up: Example
Rarely we can force people into programs exception: Mandatory School Age
We would now like to offer youth mentoring but can only offer Incentive (e.g. skipping one class during mentorship / treatment). Advertise the program (communication campaign)
What if we offer the inducement to 500 youth Only 50 participate (often not at random)
In practice, because of low take up rate, we end up with a less precise measuring device We won’t be able to detect differences with precision Can only find an effect if it is really large
Take-up Low take-up (rate) lowers precision of our comparisons Effectively decreases sample size
31
Data Quality
Data quality Poor data quality effectively increases required sample size
Missing observationsquality of data collection, attrition, migration
High measurement error: answers not always precisee.g. self reported behavior, or victimization statuse.g. poorly reported peer associations e.g. recollection bias, framing, pleasing
Poor data quality can be partly addressed with field coordinator on the ground monitoring data collection
32
In Conclusion
Whom to interview is ultimately determined by our research / policy questions
How Many:
Elements Implication for Sample Size
The more (statistical) confidence/precision
The larger the sample size
will have to be
The smaller effects that we want to detect
The more underlying heterogeneity (variance)
The more complicated design - Multiple treatments - Strata
The lower the take up rate
The lower data quality
33
Power Calculation in Practice: an Example
Calculations can be made in many statistical packages – e.g. STATA, Optimal Design Optimal design software is freely donwloadable from University of Michigan website: http://sitemaker.umich.edu/group-based/optimal_design_software
34
Power Calculation in Practice: an Example
Example: Experiment in Ghana designed to increase the profits of microenterprise firms
Baseline profits• 50 cedi per month.• Profits data typically noisy, so a coefficient of variation >1 common.
Example STATA code to detect 10% increase in profits: • sampsi 50 55, p(0.8) pre(1) post(1) r1(0.5) sd1(50) sd2(50)• Having both a baseline and endline decreases required sample size (pre and post)
Results• 10% increase (from 50 to 55): 1,178 firms in each group• 20% increase (from 50 to 60): 295 firms in each group.• 50% increase (from 50 to 75): 48 firms in each group (But this effect size not realistic)
What if take-up is only 50%?• Offer business training that increases profits by 20%, but only half the firms do it. • Mean for treated group = 0.5*50 + 0.5*60 = 55• Equivalent to detecting a 10% increase with 100% take-up need 1,178 in each group instead of 295 in each group