angus deaton, princeton university india international center, october 15 th, 2012
TRANSCRIPT
Evidence for policy
Everyone agrees that policies should be based on evidence
Much less agreement about the nature of the evidence What methods should be used? Is there a hierarchy of evidence?
Are some kinds of evidence better than others? Are randomized controlled trials the gold standard?
How do we move from evidence to policy? Rigorous evidence is of limited value if the step
to policy is not well-justified Two steps: developing evidence, adapting to
policy, and outcome depends on weakest link
2
Running examples
Building dams Do dams lead to poverty reduction?
Sanitation Total sanitation campaign (TSC) and its effects on
child mortality and child health How should such schemes be implemented?
Microfinance Is MF an effective tool for poverty reduction?
Food subsidies In kind versus cash? PDS versus CCTs
In general: “finding out what works” “Rigorous evaluation of CCTs has shown that they
work” Is this true, and if so, what does it mean for India? Or
anywhere else?3
Background
The “failure” of development economics and the whole development project Cycling fashions at the World Bank
Infrastructure, structural adjustment, education, health, women., political economy, governance . . . infrastructure
Not just the Bank, but the development community (or at least the community of “developers”)
Unconstrained by evidence Bank unable to document its contribution, if any Deep skepticism about its own internal evaluations
Many argued that there had been little or no progress Much less so now, though remains unclear whether
the development effort by rich countries was positive
4
Diagnosing the problem
Many possible stories for this state of affairs
One story is a failure to learn from experience No systematic, “rigorous,” evaluation procedure for
projects Casual empirical evaluation does not give credible
answers We need “rigorous” and “credible” evidence on what
works If the Bank had done this on all of its projects in the
past, we would know what works by now, and poverty would be history
Is this just the latest turn of the wheel of fashion, or is there some truth to this?
5
Better empirical analysis Certainly true that the quality of empirical
analysis was often weak Correlations that were obviously not causation Chinese railways
Randomized controlled trials seem to offer solutions to these issues They establish causality Solution to the statistical problems of bias,
selection, omitted variables (confounding) etc. These arguments have been very
successful In World Bank, among foundations J-PAL and others doing many experiments
6
Chorus of approval
“The World Bank is finally embracing science” Lancet editorial, 2004
“Creating a culture in which rigorous randomized evaluations are promoted, encouraged, and financed has the potential to revolutionize social policy during the 21st century, just as randomized trials revolutionized medicine during the 20th.” Esther Duflo, 2004 Did RCTs revolutionize medicine?
“Britain has given the world Shakespeare, Newtonian physics, the theory of evolution, parliamentary democracy—and the randomized trial” BMJ editorial, 2001.
7
What is an RCT?
Trial population is randomly divided into two groups, experimentals and controls Experimentals get treatment Controls get none Average outcome in experimental group minus
average outcome in control group tells us if the treatment works, and by how much on average
An RCT estimates an average treatment effect In general, each person (unit) will have a different
treatment effect We cannot observe these for each individual But RCT gives us the average for the group, which is a
lot! Minimal assumptions, absence of bias,
establishing causality are big advantages But is this really the only “rigorous” evaluation?
8
Examples again
CCTs in Mexico (Progresa), some villages got CCTs, some did not Better average outcomes for treatment villages Random selection means it must have been the CCT, not
something else What do we learn?
Will it work in India? External validity. Will it work for a specific village in Mexico? Why did it work? If we knew, we could answer two questions? Controls knew they were going to get CCTs later? Does that
matter? Mexico had a system of clinics: hard to take kids to a non-existent
clinic Big issue today for Santiago Levy at IADB today
Dams: not possible to do randomized dam construction!! So RCTs cannot be done in all cases Some have argued that policies should not be implemented in
these cases Do many things routinely for which there have been no RCTs!
9
Alternative methods
Rohini Pande and Esther Duflo’s work on dams used placement of dams and NSS data on poverty
Dean Spears’ work on TSC uses NFHS and other survey data on health in conjunction with administrative data
Alternative methods of estimating average treatment effects Weaker than RCTs in some respects
Causality, selection, bias are not automatic and must be argued
More assumptions Stronger in other respects
Access distribution of treatment effects, not just the average Usually much larger samples Triangulation helps to pin down mechanisms at work RCTs good at saying what happened, not good at saying why
Ex post fairy stories (just-so stories) without evidence
10
Small RCTs
Are often not large enough to be reliable Expensive to do, so this is not a matter that is easily fixed In a small trial, a few outliers can wreak havoc Example might be microfinance, where one or two women
might be able to do really well, and the rest not at all Get lots of weird and counterintuitive results No idea if they are real, or method is just broken Doubt one can learn anything from a trial of 10 experimental
villages and 10 control villages in CCT experiment Experiment is often conducted on a convenience sample
Not easy to get cooperation from all relevant units: e.g. in looking at CCT, those opposed to the idea might be less willing to cooperate
Results are correct only for the convenience population Not for population that would be affected by the policy
Gold standard rhetoric protects results from questioning
11
Large scale RCTs
Use all of the units in a country PDS/CCT experiment for all of rural India
Comparable to large social experiments in the US in the 70s NJ income tax experiment, SIME/DIME Rand Health experiment
Rand experiment is an important part of the debate today, others not
Ex post data mining Null result is never acceptable to the sponsors Enormous pressure on investigators to find something Usually by subgroup analysis, or looking for other outcomes
MTO has now examined thousands of outcomes Some of the statistically significant ones are spurious And we are back to the small sample problem again
Large experiments not decisive either
12
Dynamic effects
Many policies take time to work out Lots of things work as intended in the short-run, fail later People learn to “work the system” Food rationing in Britain during the war:
Excellent at first, big nutritional benefits, solidarity Crooks (“spivs”) learned to exploit it and create a black
market Support eventually vanished, when it was continued too long
Old age pensions in South Africa: cash transfer Burial insurers were allowed on site to get first access to
recipients Higher level corruption: banks?
Procurement and supply effects in food policy What would an RCT show?
It works! Expensive and unethical to continue the experiment
We get the wrong answer, or only part of the answer Issue in medicine too
13
Using a perfect evaluation Suppose we have a result, e.g.
On average, CCTs make people happier than PDS On average, dams increase poverty On average, reducing open defecation improves child health and
reduces mortality Suppose also that these were all done perfectly, so there is
no dispute about the conclusions Which, of course, never happens!
What use can we make of those results in policy? Should the Planning Commission ban new dams? Should MRD encourage better sanitation? Should we replace PDS by CCTs?
That dams don’t work on average tells us little about any individual dam It is an individual dam that comes up for approval, not all dams! We needs to know more, why dams cause poverty, under what
circumstances, none of which comes from an RCT
15
What should a village do? Or any local authority that decides
Given an RCT about CCT v PDS Again, the average is useful but not decisive
Will it have the same effect for us? We are not the average village Again, we need to know why it works, not whether it works
Neighboring village tried and is happy with the outcome Perhaps this is just an anecdote (“your uncle likes his new
TV”) But for the village, the average outcome is an anecdote too Perhaps the authorities should visit their neighbors and see
what is going on, see if it would work for them Average is more useful for a public health policy that
will be applied to the whole country Sanitation?
16
Finding out what works?
A trial and error process But T & E is NOT the same as an RCT T & E, endless tinkering, is a good
description of the Industrial Revolution How to invent a steam engine, or a toaster How medical science works, on procedures
and devices For which trials are close to irrelevant, and
in many cases have never been done T & E using knowledge and intelligence
can solve the dimensionality problem17
Seeing into the machine
Allows a village, the ministry, or the Planning Commission to make a better choice It may be able to see whether it would work for
them It may be able to see places where they could adapt
it and make it better Hope to understand the process & how it would work
in context Trial and error, plus local knowledge, hard
thought Experimentation but not necessarily RCTs What are the “helping factors” that made a trial
work? E.g. clinics in Mexico!
Can teach us why things work which is generalizable knowledge
18
Causality & helping factors Do not RCTs reveal causality?
It was the treatment that did it! Not something else Is this not particularly helpful in policy? Yes and no. Causality, by itself, is not always useful
The house burned down because the TV was left on Causal, but not general: TVs do not usually burn down houses RCT would show this causal effect But TVs need “helping factors” like bad wiring, or inflammable material left
nearby We have to think about what are the helping factors, how they
work, and whether they will work for us Will a CCT work in a particular village, or during food price inflation, or in a
competent v a corrupt state Does it need banks, or clinics to make it work? Does it matter who gets it? Men and women: gender issues in India v Latin
America Replication of an RCT is not useful, because get different results
in different contexts with or without helping factors Causality is “local”
19
Cartwright: Local causality
Open window A, and fly kite B, String C opens door D, which allows moths E to escape and eat shirt F. Lighter shirt lowers shoe G on to switch H which heats iron I which burnspants J. Smoke K enters tree L and smokes out possum M into basket N, pulling rope O,and lifting cage P, allowing woodpecker Q to chew pencil R. (Emergency knife S in case woodpecker or possum gets sick and can’t work.) 20
Expanding literature
We now have enough RCT papers to judge their quality and the evidence that they claim Some excellent, some terrible Just like other empirical papers in development But they must be judged case by case, like all other
empirical work There is no free pass, just because they are RCTs Using the word “rigorous evaluation” as a code word
for RCT is without justification Right now, in economics, and aid literature, they are
being given a free pass. Sometimes absurd generalizations based on small
special RCTs
RCTs have no monopoly on rigour, there is no gold standard
21