1 1 practical sampling for impact evaluation (aka shedding light on voodoo) laura chioda (lac chief...

11

Practical Sampling for Impact Evaluation

(aka shedding light on voodoo)

Laura Chioda (LAC Chief Economist office & Adopted by DIME)

Reducing Fragility, Conflict, Crime, and Violence

Lisbon, Portugal23-27 March 2014

2

Introduction

Now that you know how to build treatment and control groups in theory, how to do it in practice?1. Which population or groups are we interested in and where do we

find them? Selecting whom to interview

2. From that population, how many people/neighborhoods/units should be interviewed/observed? Sample size

Seemingly trivial, but “the devil is in the details” Example: Suppose we want to understand whether a mix of pro-social

mentoring and cognitive behavioral therapy for at-risk youth can mitigate anti-social and violent behavior Heller et al. (NBER 2013) : “Preventing Youth Violence and Dropout: A

Randomized Field Experiment”

3

Introduction

Example (1): “Whom to interview” is informed by the research/policy question

1. Everyone (male, female, kids, elderly)?

2. All youth aged 14-16?

3. All youth aged 14-16 in urban areas?

4. All youth aged 14-16 in a particular city and in public schools?

Need some information before sampling Complete listing of all units of observation available for sampling in each area or

group

Introduction

“How many” – Sample Size – depends on a few ingredients Example (2), intuitively Sample Size = 2 :

One adolescent receives mentoring to reduce antisocial behavior (treatment) A second adolescent does not (control) The two have been selected at random Impact = the difference between

# times these two adolescents come into contact with police (e.g. being stopped, arrested)

# times they have been disciplined in school, gotten into altercations or fights

Why does sample size matter? If too small, then you may draw conclusions that are not “robust”: What if the first youth receiving mentoring by chance has very violent peers? Or, on the contrary, what if the one not receiving mentoring was by chance

more risk averse and less impulsive, etc.?4

5

Introduction

Why not assign the entire population (individuals; youth) either to the treatment or to the control group? Ideal world: without budget or time constraints, interviewing everyone would

be a good solution In practice interviews are costly and time consuming → not feasible

e.g. Census every 10 years vs. more frequent household surveys that only sample a fraction of households

In sum: Whom to interview is ultimately determined by our research/policy questions Sample size matters & determines the credibility of results

It allows us to say with some “confidence” whether the average outcome in the treatment group is higher/lower than that in comparison group

6

Road Map

What will we be doing now with the rest of the time?

1. What do we mean by confidence? • How does confidence relate to sample size?

2. Ingredients to determine sample size• Detectable effect size• Probabilities of avoiding mistakes in inference (type I & type II errors)• Variance of outcome(s)

3. Multiple treatments4. Group-disaggregated results5. Take-up6. Data quality

7

Sample Size & Confidence (in your results)

Think of sample size as the accuracy of a measuring device: The more observations you have The more precise is your “measuring device” The more confident you are about the conclusions of your evaluation

Example: guess the sentence below knowing only 2 letters

the # of revealed letters is analogous to the # of observations where each letter, say, costs US$ 100,000 You have US$ 2M with which to uncover up to 21 letters (all of them) If you guess wrong, you loose all of your investment

8

Sample Size & Confidence (in the results)

Let’s increase the number of “observations” (in this case letters) This is so much easier

You feel more confident about guessing Common sense: the more complicated is the sentence, the more

letters you would need Below, we discuss the sense in which impacts can be “complicated”

to detect and would require larger samples.

9

Calculating the Sample Size

We understand confidence to mean “with a some degree of certainty” or “with little error”

We are in luck!! This time, the statistical jargon & plain language point to the same notion The same holds true in the statistical sense, only it entails formalizing

what is meant by error. The statistical derivation of the ideal sample size yields an ugly

formula (it looks like voodoo):

Would you like me to derive this formula?

)1(1)(4

2

22/

2

H

D

zzN

10

Calculating the Sample Size

Hopefully you answered “no” to my previous question (otherwise early lunch break)

Intuitive approach will focus on 3 main ingredients:1. Detectable effect size2. Errors in inference: Type II (and type I) errors3. Variance of outcome(s)

We will answer the following question: How do these 3 ingredients affect credibility of your results? And

therefore your choice of sample size ?

11

1st Ingredient: Smallest Effect Size

We do not know in advance the effect of our policy. We want to design a precise way of measuring it

But precision is not cheap: need cost-benefit analysis to decide

1st ingredient: Smallest program effect size that you wish to detect i.e. the smallest effect for which we would be able to conclude that it is

statistically different from zero “detect” is used in a statistical sense

Example: What if mentoring lowers the cost of crime (e.g., policing and

incarceration expenditures) by 5% … but costs (extra man hours, tutoring materials, etc.) grow by 4.5 %? What if the aggregate benefits are lower than the cost of the IE?

12

1st Ingredient: Smallest Effect Size

Cost-benefit analysis guides us in determining “smallest detectable effect”: That could be useful for policy That could justify the cost of an impact evaluation, etc.

The smaller are the (EXPECTED) differences between treatment & control …

… the more precise the instrument has to be to detect them The larger the sample needs to be

13

1st ingredient: Smallest Effect Size

The larger is the sample the more precise is the measuring device the easier it is to detect smaller effects Increasing sample size ≈ increasing precision (of our measuring

device) Who is taller? Which pair does require a more precise measuring device?

14

2nd Ingredient: Type II Error

Why is it important to be able to measure differences with precision? Example: Treatment = cognitive behavioral therapy (CBT):

#Arrests Treatment very similar to (≈) #Arrests Control

If treatment and control outcomes are not statistically different, then we could conclude that our program has “no” effect for 2 reasons:

1. Because our instrument is not precise (Bad Inference )

2. Because the program indeed had no effect (Good Inference )

Unless we have “enough” observations, we would not be able to decide with confidence whether the “no effect” resulted from possibility 1. or 2.

15

2nd Ingredient: Type I Error (false positive)

In the previous example, suppose that by pure chance, treatment youth tend to have parents who are more involved in children’s

upbringing (high quality parental investments).

# Arrests Treatment (statistically) SMALLER than # Arrests Control

We conclude that our program has an effect (despite there being none in truth)

However the difference depends only on the difference in Parents’ Involvement (Bad Inference )

Good news: the larger the sample size, the smaller we can make the probability of committing this type of error

16

One more Ingredient: Variance of Outcomes (1)

How does the variance of the outcome affect our ability to detect an impact? Example: Of the two (circled) populations, which animals on average

are bigger? How many observations from each circle would you need to decide?

17


Example: on average which group has the larger animals? Comparison is more complicated in this case, such that you need

more information (i.e. a larger sample) answer may depend on which members of the blue & red groups you

observes

18


Economic example: let’s look at our adolescents and mentoring Imagine that the mentoring leads to a decline in disruptive incidents over 2

years (impact) from 60 to 50 Case A: Children are all very similar within treatment arms: distributions of

incidents are very concentrated; Case B: Children are more heterogeneous, with distributions of incidents much

more spread out (distributions overlap more) Which instance requires a more precise measuring device?

19


In sum: More underlying variance (heterogeneity) more difficult to detect differences need larger sample size

Tricky: How do we know about outcome heterogeneity before we determine our sample size and collect our data? Ideal: pre-existing data … but often not available Can use pre-existing data from a similar population

Example: surveys from other school districts, labor force surveys, institutional data on crime etc

Common sense

20

What else to consider when determining sample size

Additional features of the design/data that may have implications for determination of sample size1. Multiple treatment arms2. Group-disaggregated results3. Take-up4. Data quality

21

1. Multiple treatments

Testing different mechanisms: is CBT alone enough? can I increase its effectiveness?

Let’s suppose we are interested in the effects of youth mentoring on academic performance Reference Cook, et al (2014).

We would like to “test” of three different treatments Treatment 1: Academic remediation only Treatment 2: Cognitive Behavioral Therapy (CBT) only Treatment 3: Mentoring/CBT and academic remediation

Intuition: the more comparisons (treatments), the larger the sample size needed to be “confident”

22

1. Multiple treatments

To compare multiple treatment groups requires very large samples Analog to having “multiple” impact evaluations bundled in one The more comparisons you make, the more observations you

need If the various treatments are very similar, differences between

the treatment groups can be expected to be particularly small

23

Why do we need strata?

Group-disaggregated results Gender: Are effects different for boys and girls? Location: For different neighborhoods? For different family structures (e.g. 1- vs. 2-parent households)?

Punch line: To ensure balance across treatment and comparison groups, good to divide sample into strata (aka groups) before assigning treatment

Strata Sub-populations (sub-groups or sub-sets) Common strata: geography, gender, age, baseline values of outcome

variable Treatment assignment (or sampling) occurs within these groups (i.e.

randomize within strata)

24

What can go wrong, if you do not use strata?

Example: You randomize without stratification. Now you ask: What is the impact in a particular neighborhood?

= Treatment & = Control, assigned randomly Can you assess with confidence the impact of mentoring within

neighborhoods?

A

B

C

25


To answer consider a few neighborhoods: Region A: we have almost no kids in the control group Region B: very few observations, can you be confident? Region C: no observations at all

A

26


To answer consider a few neighborhoods: Region A: we have almost no kids in the control group Region B: very few observations, can you be confident? Region C: no observations at all

B

27


To answer consider a few regions: Region A: we have almost no farmers in the control group Region B: very few observations, can you be confident? Region C: no observations at all

C

28


How to prevent these imbalances and restore confidence in estimates within strata?

Example: you have 6 neighborhoods Instead of sampling 2400 students, regardless of their municipality

of origin Within each region you draw a sample:

Sample 2400 ÷ 6 = 400 per neighborhood: 200 treatment & 200 control

I.e. Random assignment to treatment within geographical units Within each unit, ½ will be treatment, ½ will be control

Similar logic for gender, family structure, age etc. Which Strata? Your research & policy question should guide you

29


What about now? : The treatment and control youth look “balanced” within

neighborhoods Much better!

30

Take up: Example

Rarely we can force people into programs exception: Mandatory School Age

We would now like to offer youth mentoring but can only offer Incentive (e.g. skipping one class during mentorship / treatment). Advertise the program (communication campaign)

What if we offer the inducement to 500 youth Only 50 participate (often not at random)

In practice, because of low take up rate, we end up with a less precise measuring device We won’t be able to detect differences with precision Can only find an effect if it is really large

Take-up Low take-up (rate) lowers precision of our comparisons Effectively decreases sample size

31

Data Quality

Data quality Poor data quality effectively increases required sample size

Missing observationsquality of data collection, attrition, migration

High measurement error: answers not always precisee.g. self reported behavior, or victimization statuse.g. poorly reported peer associations e.g. recollection bias, framing, pleasing

Poor data quality can be partly addressed with field coordinator on the ground monitoring data collection

32

In Conclusion

Whom to interview is ultimately determined by our research / policy questions

How Many:

Elements Implication for Sample Size

The more (statistical) confidence/precision

The larger the sample size

will have to be

The smaller effects that we want to detect

The more underlying heterogeneity (variance)

The more complicated design - Multiple treatments - Strata

The lower the take up rate

The lower data quality

33

Power Calculation in Practice: an Example

Calculations can be made in many statistical packages – e.g. STATA, Optimal Design Optimal design software is freely donwloadable from University of Michigan website: http://sitemaker.umich.edu/group-based/optimal_design_software

http://sitemaker.umich.edu/group-based/optimal_design_software

34

Power Calculation in Practice: an Example

Example: Experiment in Ghana designed to increase the profits of microenterprise firms

Baseline profits• 50 cedi per month.• Profits data typically noisy, so a coefficient of variation >1 common.

Example STATA code to detect 10% increase in profits: • sampsi 50 55, p(0.8) pre(1) post(1) r1(0.5) sd1(50) sd2(50)• Having both a baseline and endline decreases required sample size (pre and post)

Results• 10% increase (from 50 to 55): 1,178 firms in each group• 20% increase (from 50 to 60): 295 firms in each group.• 50% increase (from 50 to 75): 48 firms in each group (But this effect size not realistic)

What if take-up is only 50%?• Offer business training that increases profits by 20%, but only half the firms do it. • Mean for treated group = 0.5*50 + 0.5*60 = 55• Equivalent to detecting a 10% increase with 100% take-up need 1,178 in each group instead of 295 in each group

1 1 practical sampling for impact evaluation (aka shedding light on voodoo) laura chioda (lac chief...

Documents

appropriate sample size

youth violence

atrisk youth

sample sizeseemingly

treatment group

comparison group

mix of prosocial mentoring

time constraints