successful impact evaluations: lessons from dfid and 3ie · successful impact evaluations: lessons...
TRANSCRIPT
1
INCOMPLETE DRAFT FOR DISCUSSION ONLY; COMMENTS WELCOME
NOT FOR QUOTATION OR GENERAL CIRCULATION
Successful impact evaluations: lessons from DFID and 3ie
Edoardo Masset1, Francis Rathinam2, Megha Nath2, Marcella Vigneri1, Ben Wood2
1CEDIL, 2International Initiative for Impact Evaluation (3ie)
CEDIL Inception Paper
DRAFT
18/01/2018
Abstract
This paper explores the factors affecting the success of impact evaluations, drawing from the
experience of two major funders of evaluations: DFID and 3ie. We first define successful evaluations
using three key dimensions: credibility of the evidence, relevance, and policy influence. Based on a
review of the literature we build a conceptual framework including the main determinants of
success at the stages of design, planning, implementation and analysis. We review selected samples
of impact evaluations recently funded by 3ie and DFID, we identify successful ones and we discuss
the factors that are more likely to be associated with successful studies. We also discuss
unsuccessful cases of impact evaluations and their main causes, and we provide some guidance in
relation to discontinuing studies that are facing difficulties during implementation. We conclude
with a number of reflections that we hope will help researchers and funders to concentrate
resources and efforts where they are most needed for the success of impact evaluations.
2
3
1 Introduction The goal of the present paper is to identify factors that lead to successful impact evaluations.
Lessons from successes and failures can be used by researchers and funders to better design and
monitor impact evaluation studies. It is well known that impact evaluations often do not achieve
their intended aims, but how is success defined? The paper proposes a “theory of change” of
successful impact evaluations and identifies the potential factors determining success. We conduct a
portfolio review of impact evaluations recently funded by DFID and 3ie to identify successful cases
and their determinants. The ultimate goal of the paper is to provide researchers and funders with a
set of recommendations to improve the likelihood of success and to avoid failures at the various
stages of the evaluation process, including design, planning, implementation and analysis.
2 Background 2.1.1 Search of the literature A comprehensive search of the literature on impact evaluations failure and successes is beyond the
scope of this study. In addition, much of the discussion about institutional and researchers’
experiences in conducting impact evaluations is not academic and more likely to be found in blogs.
We therefore decided to limit our search of the literature to blogs, and to papers and reports
mentioned in blogs.
Reviewing blogs systematically present some challenges. Blog webpages are designed with different
levels of sophistication. They always include a search modality, which normally does not allow
filtering through Boolean operators. In some cases an implicit Boolean search is allowed by using
double-quotes: words can be nested together ("impact evaluation"), and additional words can be
added. Whenever double-quotes were possible, we downloaded blogs using the following search
terms: “impact evaluation” failure and “impact evaluation” success. In all other cases, where an
implicit Boolean search was not allowed, we simply searched for ‘impact evaluation’ and retrieved
all the blogs returned by the search.
We searched the following blogs of institutions and organisations working in impact evaluation:
- A tip a day by and for evaluators of the American Evaluation Association
(http://aea365.org/blog/tag/blog/)
- All impact evaluation posts by the GIveWell blog (https://blog.givewell.org/category/impact-
evaluation/)
- Sharing information to improve evaluation of BetterEvaluation
(http://www.betterevaluation.org/blog)
- What works: transforming development through evaluation of the Independent Evaluation
Group at the World Bank (http://ieg.worldbankgroup.org/blogs)
- Blog of Innovations for Poverty Action (https://www.poverty-action.org/blog)
- Evidence matters of the International Initiatives for Impact Evaluation
(http://blogs.3ieimpact.org/)
- Development impact of the World Bank
(https://blogs.worldbank.org/impactevaluations/blog)
- R&E search for evidence of the FHI360 (https://researchforevidence.fhi360.org/)
- Evaluate, the MEASURE evaluation blog (https://measureevaluation.wordpress.com/)
- Impact blogs of USAID (https://blog.usaid.gov/)
- Views from the centre of the Center for Global Development
(https://www.cgdev.org/global-development)
4
- Evidence for action of UNICEF (https://blogs.unicef.org/evidence-for-action/)
- Shaping policy for development of ODI (https://www.odi.org/comment)
We found more than 20 blogs discussing failures and successes in evaluating programmes and the
World Bank website built a web page (Failure) to discuss failure cases
(http://blogs.worldbank.org/impactevaluations/failure). The blogs often included references to
papers, reports and books that also became part of our review. We downloaded all relevant blogs
and documents quoted or linked to them to inform the following literature review and conceptual
framework.
2.2 Successful evaluations We begin by defining ‘success’ of impact evaluations. Since impact evaluations have different goals
and are conducted in very different ways, any indicator of success must be multidimensional. Based
on or reading of the retrieved literature and on the institutional experience of 3ie, we identified the
following dimensions of success:
• Reliable evidence: a successful IE must be answering the evaluation questions in a credible
and convincing way
• Relevance: a successful IE must provide evidence that is valuable and that can be used for
making policy decisions or to inform policies
• Policy impact: an impact evaluation is more successful if its recommendations are adopted
by decision-makers
• Value for money: an evaluation is more successful if achieves its results using fewer
resources in comparison to similar studies and if it achieves this in a timely manner without
incurring cost overruns.
For the purpose of our work, the term ‘reliable evidence’ refers to the causal estimation of the
effects of programmes, normally conducted using experimental and quasi-experimental methods.
The standards employed to assess the quality of evidence are object of debate and we adopt here
the standards normally used in systematic reviews such as those, for example, of the Campbell
collaboration or the International Initiative for Impact Evaluation.1
Relevance in our view is not judged by statistical significance alone, as is often the case in the impact
evaluation literature, and should depend on the size of differences and cost-effectiveness of the
interventions (White, 2014). An evaluation is relevant if offers a convincing explanation of the
mechanisms leading to the outcomes, if it considers the cost of the intervention in comparison to
available alternatives and if it assesses impact, to the extent this is possible, across a variety of
contexts and populations. Relevance can be therefore assessed by the presence of a solid theory of
change, cost-effectiveness analysis, and sub-group analysis and other heterogeneous effects.
Ultimately, relevance consists of the clarity and strength of the conclusions and policy
recommendations.
Policy impact refers to the use of an evaluation in policy-making. Several policy outcomes are
possible following an evaluation. The results of an evaluation can be used to continue or discontinue
a particular intervention, they can be used to modify and improve an intervention, they can be used
to scale-up a project or expand a project to other areas or countries, or they can inform global
debates on policies. The lessons from evaluations can emerge in the long term or in unexpected
1 See for example the criteria used by 3ie for inclusion in the 3ie Repository of Impact Evaluations, which includes all impact evaluations of developed interventions (4,260 as of November 2017) that have been published to date (http://www.3ieimpact.org/media/filer_public/2014/05/22/ier_screening_checklist.pdf)
5
areas and we do not limit the notion of policy impact to policy changes that are immediate
consequences of evaluation results.
Costs of evaluations also vary by context, implementation and study design. For example, MEASURE
Evaluation presents a range of project costs between $400.000 and $850.000, with a single study
costing over $3.5 million (MEASURE evaluation, 2015). 3ie estimated an average cost of $336.000 for
impact evaluations funded between 2009 and 2015 (Puri & Rathinam, 2017). Impact evaluations are
cost-effective in the long run as they improve project design and inform global portfolios (Stickler,
2016). This does not prevent us from valuing more highly an impact evaluation that achieves the
same results in the same context using fewer resources.
The literature also notes that impact evaluations are more successful if produce unforeseen positive
externalities to the funders, the project evaluated or the populations involved in the project. We
refer to spillover effects as unintended positive effect generated by an impact evaluation. For
example, impact evaluations can improve the efficiency of projects, or they can influence the
scaling-up of good projects and improve the quality of data systems (Lopez Avila, 2016). They can
also improve the skills of project managers and other stakeholders through transmission of
knowledge through trainings or informally (Briceno, Cuesta, & Attanasio, 2011) as well as creating
additional knowledge for beneficiaries. Legovini et al (2015) show that World Bank projects
conducted alongside impact evaluations are more likely to be successful because they tend to be
implemented as planned and therefore more likely to achieve their stated goals. The authors
speculatively attribute this effect to (a) better planning and evidence-base in project design, (b)
greater implementation capacity due to training and support by research team and field staff, (c)
better data for policy decisions, and (d) observer effects and motivation. In principle, there might
also be potential negative effects or impact evaluation that can negatively contribute to its success,
though their occurrence has not been reported in the literature. Spillover effects of evaluations are
certainly important but are not requisites of success and are difficult to track. We will consider them
in the data collection but not in the final analysis of project success.
2.3 A framework of analysis for impact evaluations One way of designing a theory of change for successful impact evaluation is by describing an ‘ideal’
impact evaluation. That is, an impact evaluation having all the ingredients, and in the right
combination, to be successful. What makes an evaluation successful, however, varies depending on
the context of implementations and the goals of the exercise. We therefore decided to opt for a
different approach. We first define the fundamental steps of an impact evaluation exercise and we
then identify the factors that may positively or negatively affect an impact evaluation study at each
stage. The stages considered follow the sequence: identification of evaluation questions, planning of
the evaluation study, implementation of the intervention and of the study, and analysis and
dissemination of the results. For each stage of the process, we surveyed the blog literature to
identify the main determinants of success and failure. Stages and determining factors are visually
represented in the chart in Figure 1.
2.3.1 Defining the evaluation questions: Where are ideas coming from? The first step in designing an impact evaluation is defining the evaluation questions that will be
investigated. But where are evaluation questions coming from? We define here three possible
sources of evaluation questions: innovations, current projects and accountability. In some cases,
(innovations), the questions originate in social and economic theory or from results or other
research and evaluation experiences in other countries or contexts. In these cases, “researchers with
theories of hypotheses already in hand set out to search for appropriate sites to run experiments
6
(Karlan & Appel, 2016a).” Innovations are therefore interventions that haven not been implemented
or tested before like those promoted, for example, by the NGO Innovation for Poverty Action. In
other cases (current projects), evaluators assist the implementers or the funders in assessing the
impact of existing interventions. This does not mean that researchers cannot test modifications of
the original interventions. Often the interventions are not fully spelled out and there is considerable
room for the researchers for tinkering and adjusting the interventions (Duflo, 2017). In other cases
(accountability), the evaluation is motivated by the need to hold project implementers and funding
agencies accountable and to ensure that funds are properly invested.
2.3.2 Planning stage Evaluability
Ideally, before the planning stage, an evaluability exercise should be conducted to assess the technical, political and practical feasibility of an evaluation. The OECD-DAC defines evaluability as “The extent to which an activity or project can be evaluated in a reliable and credible fashion”. It is argued that an evaluability assessment reduces the risk of irrelevant or invalid findings, thus preventing waste of resources on inappropriate evaluations (Davis, 2015). Such assessments should aim at answering three broad questions: a) is it plausible to expect impact?, b) would an impact evaluation be useful and used?, and c) is it feasible to assess or measure impact? Peersman et al. (2015) provide guidelines and a useful checklist to apply this type of assessment to impact evaluations. Based on the experience of five experimental evaluations in agriculture, MCC concluded that “To make efficient use of time and financial resources, impact evaluations should be focused where the learning potential is greatest, where rigorous evaluation (with a counterfactual) is feasible and where there is significant commitment of the various stakeholders to the evaluation (MCC, 2012).” Clearly all this is highly desirable but not easily achievable in practice.
Piloting the intervention
Many interventions fail because they cannot be implemented in the planned way. These problems
affect particularly interventions that have never been implemented before and that are being tested
for the first time (innovations). For example, a particular technology may not work in a given
context. Piloting an intervention (also called ‘formative evaluation’ or ‘field testing’) may overcome
this problem (White, 2014). In some cases, interventions are implemented in the wrong settings.
Many such examples are reported by Karlan (2016a) and summarised in the following way by
Mckenzie: “This includes doing projects in the wrong place (e.g. a malaria prevention program where
malaria is not such an issue), at the wrong time (e.g. when a project delay meant the Indian
monsoon season started, making roads impossible and making it more difficult for clients to raise
chickens), with a technically infeasible solution (e.g. trying to deliver multimedia financial literacy in
rural Peru using DVDs when loan officers weren’t able to find audio and video setups) (Mckenzie,
2016).”
Team’s skills
Teams need to have the right set of technical skills and the right combination of disciplinary approaches required by the study. “Hiring the wrong person(s) on the impact evaluation team, or not defining their responsibilities well” came second place in an internal survey of World Bank staff on the most common mistakes in conducting impact evaluations (Vermeersch, 2012). Team composition and experience are commonly used as screening and scoring criteria by agency commissioning impact evaluations. Clarity and quality of ToRs for the survey team are therefore also highly relevant.
7
Figure 1 Factors affecting ‘success’ of evaluations
Stakeholders’ engagement
This is by far the factor most commonly quoted in the literature and blogs. Project implementers, funders and users should be involved since the design stage of an evaluation. What follow is a series, perhaps a bit inordinate, of excerpts from blogs on this specific issue. “Implementer’s staff should be
8
involved to leverage the diverse expertise necessary for a good impact evaluation design (Stickler, 2016).” “Traditionally, evaluators insisted on independence to assure objectivity. But this separation often results in evaluators being brought in after a project is underway and when the opportunity to collect baseline data and test hypotheses is lost” (Keller & Savedoff, 2016). “Interests of stakeholders (researches, implementers and funders) need to be aligned, which may result in reaching compromises about what is evaluated and by what method or design (Lopez Avila, 2016).” For example, Ferguson (2017) discusses a USAID-funded intervention in Mozambique, centered on a combined social and economic intervention for girls at high risk of HIV infection. The study had to deal with a number of challenges such as the absence of a baseline, low take-up, ad hoc implementation etc. to which the researchers responded with a quasi-experimental design that stroke a balance between rigour and the need to answer the evaluation questions. Engagement with stakeholders (implementers and beneficiaries) at early stages of an intervention will allow identifying the right set of indicators for an evaluation (White, 2014). Sometimes evaluators have the buy-in of implementers but “lower-tier staff may have limited bandwidth and flexibility. A couple of examples come from programs which tried to use existing loan officers to deliver financial literacy training, only to find that many of them were not good teachers; and where bank tellers decided to ignore a script for informing customers of a new product because they felt it slowed them down from serving clients as quickly as possible (Mckenzie, 2016).” The evaluation scope should take into account the demands and concerns of programme managers (Briceno et al., 2011). The evaluation questions and the feasibility of the exercise including its technical characteristics should be discussed and jointly discussed with relevant stakeholders (Briceno et al., 2011). “Effective impact evaluations require close integration between implementers and evaluators, starting from the creation of the program logic and throughout implementation. Changes in program implementation can have significant effects on the evaluation methodology (MCC, 2012).” “Alignment between funders, implementers and evaluators is critical to a well-designed evaluation that complements the program (Priedeman Skiles, Hattori, & Curtis, 2014).” An intense collaboration between researchers and the implementing organisation is even more needed when conducting a randomised trial because the researchers become directly involved in project design and implementation. Glennerster (2017) discusses at length the required elements of a fruitful collaboration between evaluators and managers when implementing a randomised controlled trial. Transparency
An evaluation is more likely to follow best practices when activities are open to the scrutiny of the
public and of other researchers. Transparency may include the registration of the evaluation and the
publication of an analysis plan. Evaluation designs should be ideally be registered in order to avoid
subsequent cherry-picking of results (White, 2014). Some authors have designed a software to
declare the study design in computer code form. Given the code declaration, the software can
diagnose statistical power, bias, expected mean and external validity in advance (Blair, Cooper,
Coppock, & Humphreys, 2016). Wood (2015) discusses the benefits of transparency in the use of
datasets and in their analysis to the production and to the use of evaluation results.
2.3.3 Implementation Project implementation
The project may not be implemented in the way it was planned by the government or the NGO in
charge. Mckenzie recounts the problems faced during the evaluation of different mechanisms to
bring enterprises in the formal sector. In Brazil, new ipads to be used as prizes had to be returned to
the US for tax reasons. In Peru, firm’s distance to the tax office could not be used as an instrument
because tax collectors, in order to reach their quotas, would simply inspect the closest firms they
could find. In another example, a team spent $40,000 to send letters to financial institutions in
9
Mexico to offer financial education by participating in the study. The team was hoping to obtain
between 800-1200 responses out of the 40,000 letter sent, but in the end they obtained only 42
replies (D. Mckenzie, 2016a). Another World Bank team set out to evaluate a large rural roads
rehabilitation program in Rwanda that relies on high-frequency market information. Crowdsourcing
seemed like an ideal solution to generate high-frequency data that would collect timely market data.
Things however, went differently from expected: poor connectivity resulted in a low number of
contributors, data collected in an unstructured way were often inaccurate, and data collection
ended up being a poorer version of traditional enumeration as there was only one contributor per
market (Jones & Kondylis, 2016).
Project take-up
Take-up is often quoted as the single most common reason of project failure in the literature. People
may simply not be interested in the intervention and the intervention does not take place. Universal
take-up is unrealistic and every project loses out its intended beneficiaries through a ‘funnel of
attrition’ (White, 2014). For example, IPA partnered with Global Giving on a program that invited
clients of local organizations to provide feedback about the services received by sending SMS
messages to Global Giving. The client feedback would in turn be sent to potential donors and the
researchers were hoping to find whether feedback would change service provision and donors’
attitudes. A total of 37 organizations in Guatemala, Peru, and Ecuador were identified and
randomised in the study, but only 5 took part. Most organisations were concerned about the
negative effects of unfavourable feedback (Karlan & Appel, 2016b). Similar examples abound in the
literature and will be not reported here.
Contamination of the control group
Even the best designed evaluation may be spoiled by contamination of the control group, when the
benefits of the intervention go beyond the project area or when implementers decide to expand the
geographic coverage of the intervention. In some cases, contamination can be irreparable. For
example, a randomised evaluation of an HIV intervention in Andhra Pradesh was aborted after the
funder in collaboration with the State government decided to saturate the entire state with a nearly
identical and much larger intervention that wiped out the randomised control group (Samuels &
McPherson, 2010).
Selection biases
All sorts of biases may affect the results of the analysis. In conducting experiments, biases include
high attrition of study participants, noncompliance, placebo and Hawhtorne effects with inability to
blind. These type of issues are widely discussed in the evaluation literature. A non-technical
discussion of many of these problems, particularly in relation to randomised controlled trials, can be
found in Glennerster et al. (2013). Quasi-experimental studies are even more vulnerable to biases
and the inherent difficulty of controlling for unobserved differences between the comparison
groups.
Sample size and data quality
Few interventions are able to produce dramatic effects and sample sizes have to be sufficiently large
to detect the expected impacts. Wood and Djimeu (2014) discuss how rarely evaluators, particularly
economists, conduct and report the power calculations supporting the sample size selected for their
impact evaluation. As a result, evaluations often have sample sizes that have inadequate statistical
power to detect impact (White, 2014), particularly for outcomes that are marginally affected by
10
interventions and measured with large error such as, for example, incomes. The problem is
compounded if intra-cluster correlations are not taken into account in the power calculations. But
even if researchers get the sample size right many things can go wrong during data collection. Data
collection has to be closely monitored. “Issues in the design and (preparation of) data collection”
was by far the most represented category in a survey of common mistakes in impact evaluations
conducted at the World Bank. Problems included: forgetting individual identifiers, or not printing the
cover page of the questionnaire and missing information on the respondent (Vermeersch, 2012).
There is a vast literature on measurement error in traditional surveys and on enumeration biases,
but problems arise with new methods as well. For example, team of researchers set out to assess
the impact of an intervention on management practices by taking photographs of micro-enterprises
before and after the intervention. But, contrary to expectations, photographers did not stand in the
same spot each time, sunlight affected the quality of the pictures, not all information was captured
by the pictures and enumerators were inexperienced in the use of cameras (Mckenzie, 2012), and
the experiment failed.
Ethical issues
Research involving human subjects requires meeting a minimum set of ethical standards in order to
be acceptable and ultimately to facilitate the implementation of the study. This may require
approval from an Institutional Review Board, obtainment of informed consent, protection of
confidentiality of the information and any other measure that prevents the researchers from
committing any harm (Glennerster, 2017).
Political economy
Local governments and implementers can be or can become opposed to evaluations or to particular
aspects of their design. Campos et al. (2012) provide a number of examples of failed experiments in
which Government officers were openly opposed to randomised evaluations of a series of subsidies
programmes for private firms. The opposition was sometimes accentuated by turn-over of
government staff, which caused severe delays.
Ability to adapt
Rarely are evaluations going according to pre-defined plans. Changes in circumstances often require sudden changes and adjustments in evaluation design or project implementation. In addition, researchers should prepare contingency plans, particularly when conducting randomised trials (Briceno et al., 2011). “Creative adaptation of the evaluation design to fit operations is the norm not the outlier (Priedeman Skiles et al., 2014).“ A review of 10 evaluations of health interventions by MEASURE found that each case faced design and implementation challenges that required creative solutions to move forward. These challenges included the identification and selection of program beneficiaries, random assignment in complex environments, identification of a robust comparison or control group for estimating the counterfactual, heterogeneity of program impacts, timing of baseline data collection, and absence of baseline data and of a counterfactual (Priedeman Skiles et al., 2014). Puri et al (2017), based on examples from impact evaluations funded by 3ie, offer a discussion of how the success of the studies depend on their ability to adapt to unforeseen implementation circumstances.
2.3.4 Analysis Technical quality
An impact evaluation must provide the best possible estimates of project impact given the
circumstances of implementation. What makes a technically high-quality evaluation can be object of
11
debate but there are common standards, such as, for example, risk of bias tools and assessments
used in systematic reviews, and there are quality markers, like publications in peer-reviewed
journals.
Dissemination
The results of a study have to be widely disseminated in order to reach the desired audiences. Briceno et al. (2011) summarise in the following way: “Evaluation findings should be reported in time and should be used for accountability purposes. If the motivating questions were clear, and the evaluation’s reports incorporated answers to these questions, the program’s managers can use the findings to discuss recommendations and to support changes in the program, if necessary. The ideal evaluation process does not end with a published document; an ideal evaluation should document the recommendations’ discussions, the response from program managers and the consequent decisions or improvement aspects resulting from the evaluation.” The literature on communicating evidence and influencing policy is huge, and will not be reported here.
2.3.5 Cross-cutting themes A number of factors affect the success of evaluations regardless of the stage the implementation. We highlight in particular, shock, contexts and quality assurance. Shocks and external events
Project implementation and evaluations can be derailed by unexpected shocks. For example a World
Bank team conducted a survey of 2,500 households and 2,500 enterprises in rural Egypt to assess
the impact of measure to expand the access to finance but the instability in the wake of the Egyptian
revolution meant microfinance organizations didn’t want to expand to poor areas, and the
intervention never took place (D. Mckenzie, 2016b). A risk assessment at the planning stage and the
formulation of contingency plans might be helpful particularly when researchers set out to work in
areas prone to conflict or other disasters.
Implementation context
The complexity of interventions, consisting on multiple activities and projects at multiple levels
(countries, regions etc.) and multiple stakeholders groups, is perceived as a challenge to the design
of causal pathways for an evaluation (Vaessen, 2017) and for the identification of a control group
(Cuong, 2014). This is often addressed by focusing on a narrow component of the programme rather
than on the programme itself, thus offering little guidance in terms of scaling up the whole
intervention (Lopez Avila, 2016). Successes and failures vary with the characteristics of the area in
which the project is implemented, with the type of intervention and with the characteristics of the
implementer. This suggests that study type, implementer, sector of intervention and geographic
area of intervention should be also taken into account.
Quality assurance and PRGs
Progress reviews along the study and peer-review of the results at various stages ensure the quality
of the design, implementation and analysis. The work of evaluators should be supervised through
advisory panels, peer review groups, and other reference groups in order to ensure the technical
quality of the evaluation (Briceno et al., 2011).
12
3 Methodology The remaining of the paper is dedicated to assess the relevance of the determinants of success of
impact evaluations that were identified by the literature. In this exercise we will use a sample of
studies funded by two major funders of impact evaluations: DFID and 3ie. Here, we briefly describe
the selection of the sample of studies and the methods to identify success and its determinants.
3.1 Selection of DFID studies DFID publishes every year a collection of commissioned evaluation studies in international development. The entire body of this work spans a twenty-year period from 1997 to 2017, and includes independent reports categorized by sector and country, and working papers. More recently, the collection has also included a short document with DFID’s management response. For this exercise, we decided to selectively look at the population of impact evaluations supported by DFID since 2012, which includes in total 61 studies. Only 12, out of the original 61 evaluations, were included in our study as the other evaluations did not meet our selection criteria (see Figure 2 for an illustration of the selection process). We employed two selection criteria. The first requires that the study evaluates an intervention aimed at improving living standards rather than a policy or financial support. Following 3ie guidelines for the 3ie database of impact evaluations, we excluded evaluations “studies which only investigate natural or market-based occurrences, or that report on the findings of controlled laboratory experiments with no discernible development intervention.” This led to the exclusion of 20 studies including, for example, the Evaluation of DFID Online Research Portals and Repositories, Evaluation of Budget Support to Sierra Leone 2002 – 2015, Evaluation of the DFID-AFDB Technical Cooperation Agreement (TCA), and Evaluation of the Consultative Group to Assist to the poor - Phase IV Mid-Term Report. The second selection criterion is the use of a counterfactual method: what would happen in the absence of the intervention. In keeping with the guidelines used by the 3ie database of impact evaluations, studies not employing a counterfactual methodology such as RCT, RDD, PSM, IV, and DD were excluded. This led to the exclusion of further 27 evaluations. The excluded evaluations included 7 desk-based reviews, 5 process evaluations, 4 ‘theory-based’ evaluations, 4 value for money cost analyses, 3 qualitative studies, 2 formative evaluations and 1 rapid health facility assessment. Figure 2 Selection of DFID impact evaluations
Only about 20% of all evaluations consist of impact evaluations employing a counterfactual
methodology and only 2 impact evaluations are randomised controlled trials. Most impact
Evaluations
(61)
non-interventions
(21)
interventions
(40)
impact evaluations
(12)
Other evaluations
(28)
13
evaluations are quasi-experimental (11), often consisting of difference-in-difference studies
reinforced by matching methods or, in some cases, by simple project-control comparisons at one
point in time after the intervention (see Table 1).
Table 1 DFID evaluations by study design
Year evaluations Evaluations of interventions
Impact evaluations
RCTs Quasi-experimental
2012 7 4 1 1 0 2013 20 13 2 1 2 2014 7 5 2 0 2 2015 5 3 1 0 1 2016 18 11 3 0 3 2017 4 4 3 0 3 TOTAL 61 40 12 2 11
We disaggregated the evaluations and the impact evaluations using the World Bank sectoral
classification of interventions (Table 2). The bulk of impact evaluations were commissioned in the
agriculture sector, health and population, and education. This is only partially a reflection of the
distribution of all commissioned evaluations. Some sectors like climate change and conflict
management do not have a single impact evaluation while most evaluation of public sector
management are not impact evaluations.
Table 2 DFID evaluations by sector of intervention
Sector of intervention All evaluations Impact evaluations
Agriculture and rural development 7 3 Climate change and environment 3 0 Conflict management and post-conflict reconstruction 4 0 Cross-sectoral 16 2 Economic policy 0 0 Education 3 1 Energy 0 0 Finance 0 0 Health nutrition and population 9 3 Humanitarian 1 0 Information and communications technology 0 0 Private sector development 3 1 Public sector management 12 1 Social protection 1 1 Transportation 0 0 Urban development 0 0 Water, sanitation and hygiene 2 0
All impact evaluations were conducted in South Asia and Sub-Saharan Africa. It appears that a large
number of evaluations conducted in South Asia are impact evaluation, which is not the case in Sub-
Saharan Africa (Table 3).
14
Table 3 DFDI evaluations by geographic region
Region All evaluations Impact evaluations
East Asia & Pacific 1 0 Europe & Central Asia 2 0 Latin America & Caribbean 0 0 Middle East & North Africa 3 0 North America 0 0 South Asia 7 4 Sub-Saharan Africa 30 7 Global 18 1
3.2 Selection of 3ie studies As of end 2017, the number 3ie funded impact evaluations (IE) closed and posted on 3ie website was
112. However, for the analysis of determination of successful IEs, we sampled approximately one-
third of the full list, i.e. 35 studies. To ensure better representation of the original studies, we sorted
all the studies by sector of evaluation (thematic areas), geography, 3ie funding modality and final
report quality. Every third study in the list was marked for inclusion. Some sectors such as
agriculture, health, education and social protection were over-represented, while other sectors such
as finance, private sector development, peace building, urban development, etc. were under-
represented. To get a better representation of all sectors, we have included all the sectors that had
only one study irrespective of its placement in the list. On the other hand, the maximum number of
studies from the over-represented sectors were limited to five. The final sample includes 35 studies.
This is a fair representation of all thematic areas, geography and the quality of studies in the full list
of all closed 3ie grants. See Figure 3 for the distribution of the selected 3ie studies by sector.
Figure 3 Selected 3ie impact evaluations by sector
3ie funding modalities include open windows, thematic windows and policy windows. 3ie’s Open
Window modality is a grant platform for funding impact evaluations in low- and middle-income
countries (L&MIC) irrespective of funding size request or subject area (Puri et.al. 2018). There were
15
four rounds of open window funding (OW1 to OW4) during 2009 – 2012. This unrestricted funding
modality has now been discontinued. 3ie’s other funding modalities placed restrictions such as
specific themes (thematic windows) and or country and policy relevance. The aim of the thematic
windows is to increase the body of rigorous evidence in a single sector where there are fewer
evaluations with the aim of maximizing policy relevance and impact. The policy window supports
evaluation of government programmes implemented and deemed important by 3ie developing
country members. Since most of closed grants now are from the first four open windows, about 88
per cent of the sampled studies are from open windows and the remaining from thematic and policy
windows (see Figure 4).
Figure 4 Selected 3ie evaluations by thematic window
3ie proposal review of open windows ranked all the proposals by their technical quality and their
geographical distribution (Puri et.al. 2018). This process ensured that while highest quality proposals
were selected, there was also a good geographical distribution. Studies from Africa and South Asia
account for more than half of all the studies in our analysis (see Figure 5).
Figure 5 Selected 3ie evaluations by geographic area and study design
3.3 Methods The selected 3ie and DFID studies were coded along a number of variables representing the
determining factors identified in the conceptual framework and listed in Table 4. Some of the factors
were coded numerically like, for example, whether the study was reviewed by an IRB or the size of
the sample. Other factors are qualitative judgments of the study like, for example, the clarity of the
conclusions or their policy relevance. It should be noted however that even numerical variable, such
as, for example, whether the theory of change is well defined, contain an element of qualitative
judgment. Note that while the information below could be collected for most 3ie evaluations, only
few of the indicated variables could be extracted from the DFID evaluation reports and the related
management response documents available.
OW1
OW2
OW3
OW3
OW4
PW2
TW1TW4 23%
49%
9%
3%
6%
3%
6%3%
Distribution of studies by Grant Window
Central AsiaEast Asia Pacific
Latin America and Caribbean
South Asia
Sub Saharan Africa
3%
9%
20%
26%
43%
Geographical Distribution of closed studies
Quasi experimental
RCT
RCT, Quasi experiemental
29%
46%
26%
Distribution of studies over Study Design
16
Table 4 List of determinants of success and their proxy indicators
Stage Factor Metric
Defining questions Theories, project, accountability
Qualitative assessment of the statement of the problem the project and the evaluation are intending to solve. A well-defined theory of change. Are systematic reviews quoted in reference to the problem statement?
Planning Transparency Study registration A Pre analysis plan is available
Stakeholders engagement Meetings and workshops with implementers Training events Endorsement by the implementers
Team composition Skill set of evaluators
Evaluability Evaluability assessment conducted
Pilot Was the intervention piloted/formative evaluation?
Implementation Project implementation Implementation issues Implementation delays Shocks affecting implementation
Take-up Participation rates
Biases Attrition rates
Contamination Contamination reported
Data collection Sample size
Ethical issues IRB approval
Political economy No data available
Adaptation Letters of variation Modification to the design
Results Analysis Mixed-methods Sub-group analysis Intermediate outcomes
Dissemination Policy influence plans Policy briefs Presentations and conferences Media Publications
Success Evidence Clarity of conclusion with respect to evaluation questions Reporting of effect size of the intervention
Relevance The study provides results that can be effectively used for policy: this is based on a qualitative assessment of the conclusions of the study
Policy impact Assessed by our reading of the management response (in the case of DFID) studies and by 3ie own appraisal (in the case of 3ie studies)
Value for money Budget size
Cross-cutting factors
Quality assurance Peer review groups Advisory panel
Context Country and region Sector of intervention (World Bank classification)
4 DFID impact evaluations
4.1 Successful DFID evaluations We assessed the success of the evaluations by three factors: credibility, relevance and policy impact
(Figure 6 illustrates). We do not consider value for money because we have no information on the
budgets assigned to the evaluation and we abstract for the moment from eventual spill-over effects
that are difficult to extract from the documents available. We looked at credibility, relevance and
policy impact sequentially as if each mattered only conditionally to the previous one. Different
pathways are difficult to imagine. It is hard to think of unreliable evidence as been relevant, while
17
irrelevant evidence should not have impact by definition. It is possible for unreliable evidence to
have a policy impact, but this is unlikely to occur in this context.2 In what follows, the reliability of
evidence is mostly based on our qualitative judgement. Whether a study is relevant or not, is based
on our reading of the conclusions drawn by the researchers and the management response. The
policy impact is entirely understood through our reading of the management response. The figure
below illustrates the sequential classification of the studies.
Figure 6 Reliability, relevance and policy impact (DFID evaluations)
Three of the selected evaluations did not provide reliable evidence and were no further considered.
One study assessed the impact of a capacity building intervention in several low income countries.
The evaluation included a difference-in-difference analysis of the impact of the intervention on
knowledge and skills of health care workers. However, the sampling methodology as well as the
analysis were rather inaccurate. The quantitative assessment was only a minor component of the
overall evaluation and there is no sign that it had a major impact on the conclusions. The authors of
the study did not draw strong conclusion from it and management did not seem to take any relevant
lessons from this particular aspect of the evaluation. The evaluation of a result-based education
project in Ethiopia relied on an interrupted time-series design. The intervention had been
implemented at the national level and no other design option was available. However, there were
delays in project implementation and the data were not fully representative. This was acknowledged
by the researchers, while the management response stressed that in this case lack of evidence of
impact did not imply lack of impact. The evaluation of a livelihood programme found significant
impacts of the intervention on poverty reduction. However, the management response exposed the
fact that external reviewers found large discrepancies between the reported income data and data
collected by other agencies, which called in to question the credibility of the entire survey. As a
result, management assigned the final impact evaluation to another provider altogether.
Of the nine evaluations providing credible evidence, three turned out not to be relevant. An impact
evaluation found a number of positive impacts of a distribution system of ORS packages through
street vendors of soft drinks. However, all the evidence related to availability, access and use of the
ORS packages and no evidence was presented on the effectiveness of the intervention. The results of
2 This view is consistent with Gueron and Rolston assessment of the unusual impact of the evaluation studies of welfare programmes in the US in the 80s. The identify five factors: credibility, the finding themselves, their timeliness, the marketing of the results and the political environment (2013). Marketing and policy environment are contextual factors enabling policy impact, while the ‘finding themselves’ and ‘timeliness’ is what we have described as ‘relevance’ of the findings.
reliable evidence
no
(3)
relevance
(8)
no
(3)
polcy impact
(5)
no
(1)
yes
(4)
18
the study appear to be too contextualised and limited to intermediate outcomes to be relevant. No
management response was available for this study.
Table 5 Credibility, relevance and policy impact of DFID evaluations
Study Impact of intervention
credibility relevance Policy impact
Delivering Reproductive Health Results though non-state providers in Pakistan (DRHR)
No impact found
Yes Yes Yes
Evaluation of the Developing Operational Research Capacity in the Health Sector Project (ORCHSP)
Positive impact
No - -
Evaluation of the Pilot Project of Results-Based Aid in the Education Sector in Ethiopia (RBAESE)
No impact found
No - -
Social and Economic Impacts of Tuungane (TUNGANE)
No impact found
Yes Yes No
COLALIFE operational trial zambia Improving use, access, availability and awareness of ORS and zinc for the treatment of diarrhoea in the home (COLALIFE)
Impact on intermediate
outcomes Yes No -
Independent Impact Assessment of the Chars Livelihoods Programme – Phase 1 (CHARS1)
Positive impact
NO - -
Independent Impact Assessment of the Chars Livelihoods Programme – Phase 2 (CHARS2)
Positive impact
Yes Yes Yes
Evaluation of Uganda Social Assistance Grants for Empowerment (SAGE)
Impact on intermediate
outcomes Yes No -
Adolescent Girls Empowerment Programme, Zambia (AGEP)
Mixed results Yes No -
Impact evaluation of the DFID Programme to accelerate improved nutrition for the extreme poor in Bangladesh: Final Report (IEAINB)
No impact found
Yes Yes Yes
Independent Evaluation of the Security Sector Accountability and Police Reform Programme (SSAPR)
Impact on intermediate
outcomes Yes Yes Yes
The evaluation of an empowerment programme in Uganda found a number of impacts on
intermediate outcomes. However, the sample used was scarcely representative. There were delays
in the project implementation and in the evaluation, and the project underwent a number of
changes during the evaluation. The management response stated that by the time the
recommendation became available, they were no longer relevant or had already been implemented
by the project. The evaluation of another empowerment programme in Zambia found inconclusive
evidence or mixed results. However, there were delays in project implementation so that the
planned impact evaluation became an ‘interim’ evaluation of the ‘direction of travel’ rather than the
expected final assessment, thus informing a future study but not project implementation.
The evaluations that produced credible and relevant results had an impact on policy, with possibly
one exception. The evaluation of a health intervention through non-state providers in Pakistan
delivered the ‘surprising’ result that the intervention had no impact on access to health care, which
led to a re-examination of the whole intervention. The evaluation of a livelihood programme in
Bangladesh encouraged the management to continue the intervention and to introduce changes for
improving success. The ‘sobering’ results produced by the evaluation of a programme to accelerate
improved nutrition for the extreme poor in Bangladesh led to a reconsideration of the DFID support
strategy to nutrition interventions in Bangladesh. The evaluation of a policy reform programme in
Zambia found a number of positive impacts on knowledge, attitudes and practices around policing
that management accepted and committed to use in future programmes. The impact evaluation of a
community driven reconstruction programme in the DRC is a possible exception. The evaluation
19
found no impacts on behaviours and outcomes, which, according to the evaluators, called into
question the appropriateness of the whole project model. However, the management concluded
that positive impacts of the intervention outweighed the absence of impact on behavioural change
and therefore opted for a continuation, albeit modified, of the programme.
It is also interesting to note that only one study found a positive impact of the intervention. There
were other two studies finding a positive impact but the evidence provided was not considered
credible. Other three studies found a positive impact but only on intermediate outcomes. All the
remaining studies found no impact of the interventions on final outcomes. It is also interesting to
note that all studies are assessing impacts on a plurality of outcomes and sometimes for several sub-
groups in the population. We found no obvious attempt to cherry-picking of results or selective
reporting, thought the risk exists. More importantly, evaluators and management often found
difficult to make sense of the large number of effects observed on several dimension and the results
are difficult to interpret in a coherent way.
4.2 Determinants of success of DFID evaluations Four of the 12 studies were pilot interventions and the evaluations aimed at assessing their
effectiveness for their potential scale-up. The remaining evaluations were assessing existing
programmes with the intent of improving or scaling-up the interventions or for accountability
purposes. Perhaps a bit surprisingly only 7 studies, just above half the sample, included a good
theory of change of the intervention and only 3 of the studies quoted relevant systematic reviews of
similar interventions. The absence of a well-defined theory of change and of references to
systematic reviews is not helping the definition of good evaluation questions.
Success at planning stage depends on factors such as: the skill set of the evaluators, transparency of
the evaluation, stakeholders’ engagement, and the implementation of pilot or formative studies
before starting the full evaluations. Eight of the study teams appeared to have the required skill set
while the judgment is somewhat suspended for the remaining five. None of the study reports an
evaluability assessment, a pilot or a formative evaluation, and nothing is known about engagement
of stakeholders. It is difficult to say to what extent these activities did not take place or were simply
not reported.
Several projects (6) faced difficulties at the implementation stage which resulted in delays: the RBA
pilot was not well communicated to the regions in time to appreciably affect students’ performance;
in the Tungane evaluation political tensions led to loss of one province and some regions were
inaccessible for safety reasons; in the CHARS evaluation, the initial implementation of the
monitoring database system was relatively crude, especially in the database structures and the
handling of informant identification, also cohorts were mixed within villages; in the adolescent
empowerment programme in Zambia, the voucher scheme took almost two years to agree with
government; in the integrated nutrition programme in Bangladesh procurement and distribution of
micronutrients were considerably delayed; the Security Sector Accountability and Police Reform
Programme did not state its short- and long-term goals until quite late in the implementation period,
and thus there was little time to work collaboratively. Only one of the studies, the Bangladeshi
nutrition project, was affected by a major shock. Attrition above 20% of the baseline sample was
reported as a problem in 3 studies. It is unfortunate that none of the studies report data on take-up
of the interventions so that it is impossible to reflect on this important aspect of the interventions.
Sample sizes are in the order of thousands of observations in most cases but two evaluations
collected information from less than 300 observations.
20
All studies, with 3 exceptions, were mixed method evaluations that included a qualitative
component in the form of interviews, focus group discussions, process evaluations or contribution
analysis. All studies, with 2 exceptions, conducted some sub-group analysis mostly by gender and
age of project participants. None of the studies, with the exception of the Bangladeshi nutrition
study, appear to have analysed intermediate outcomes of the interventions. Nothing can be said
about the dissemination of the results because the information cannot be evicted by the documents
available.
5 3ie impact evaluations
5.1 Successful 3ie studies Assessing the success of 3ie impact evaluations by the criteria of credibility, relevance and policy
impact is slightly different in comparison to the DFID studies (see Figure for an illustration of the
assignment process). Most 3ie studies are set up to be highly credible. Studies are selected among a
large number of submitted proposal using highly selective criteria applied by internal and externa
reviewers. In addition, the evaluations are scrutinised and monitored from design to final publication
in order to avoid errors in design, implementation and analysis. In addition, all 3ie studies are
relevant, to some extent, because relevance is one of the selection criteria adopted to make funding
decisions. This implies that both credibility and relevance of 3ie evaluations tend to be very high. Of
course credibility and relevance at the design stage does not preclude the possibility that these end
up being compromised at the implementation or analysis stage.
Figure 7 Reliability, relevance and policy impact (3ie evaluations)
Only two evaluations, out of the selected 35, produced non credible evidence (see Figure 7). The first
is an evaluation of the impact of a micro irrigation innovation on household income, health and
nutrition. The study had enormous limitations relating to the comparability of three cohorts for
assessing impact and attrition was not properly addressed. The second was a study assessing the
effect of network structure on the diffusion of health information. The study was designed in such a
way that the desired estimation of the causal impact was impossible.
We found the results of five impact evaluation as not being relevant. Relevance can take different
meanings and there is some degree of subjectivity in the definition. 3ie has its own definition of
relevance and recommends its use in the review of grant applications. The definition has changed
over the years but in general it requires that a study should have either (i) substantial scale or
expected effect size (i.e., affecting a large number of people, implying large resource outlays, or
having large effects on a sub-population), or (ii) being innovative intervention, with the potential to
reliable evidence
no
(2)
relevance
(33)
no
(5)
policy impact
(28)
no
(9)
yes
(19)
21
be scaled up. Alternatively, the study should be filling a gap of knowledge in the specific context and
country and informing policy. All studies funded by 3ie are therefore ‘relevant’ by this definition. The
relevance discussed in what follows refers more to their ability achieve these goals.
We classified the following evaluations as not being relevant. A study of an intervention promoting
savings among poor households in Sri Lanka was unable to provide policy recommendations because
affected by low take up and delays.3 A study of an intervention aiming at increasing long term
savings with the design and the presentation of a pension product was affected by implementation
problems, but more importantly did not produce the expected results and turned out to be of no
policy use. A study of an intervention promoting the use of cook stoves to prevent indoor pollution
in Northern Ghana faced so many implementation problems that the results were few and difficult
to interpret. A study promoting male circumcision in Malawi produced very limited results because
of very low take up among the beneficiary population. Similarly, a study of an intervention offering
weather index based insurance to poor farmers in two Indian states failed because no farmer was
interested in the intervention.
From this survey of problematic studies, a pattern seems to emerge in relation to the causes of the
lack of relevance. First, some studies (like the cook stove and the mobile banking evaluations) were
affected by factors totally outside the control of the researchers or by factors not well understood at
the design stage of the evaluation. The evaluation designs of these studies could have been
improved through a better engagement with relevant stakeholders and with an evaluability exercise
prior to the evaluation to assess potential risks. Second, a good number of studies was affected by
implementation delays and low take up rates. Low take-up is an universal problem in project
implementation and evaluation. More recently 3ie, to prevent this type of problems, has been
promoting pilots or formative studies with the goal of testing the feasibility and acceptability of
interventions before introducing large scale experiments. Finally, some interventions were testing
totally new interventions which ended up being ineffective. Some studies are designed in such a way
to be scarcely relevant when successful, and entirely useless when unsuccessful (see box on
Publication is not relevance).
Box Publication is not relevance
The publication of a study in a peer-reviewed journal is sometimes considered an indicator of its
quality. However, published studies are often credible and interesting but not necessarily relevant
for policy. This is particularly true for a recent wave of evaluation studies aimed at testing the impact
of incentives on people’s behaviour. Some impact evaluations of this type are designed to test
research ideas. But where do ideas come from? In the words of Karlan and Appel of Innovations for
Poverty Action (Karlan & Appel, 2016a), ideas are “perhaps extensions or continuations of an existing
theory, or inspired by the results of other research or from experiences in neighbouring countries or
countries far away.” Ideas often come in the form of hypotheses about different types of incentives
influencing behaviours (the “nudge”). For example, in a famous study mothers were offered two
pounds of dal (dried beans) and a set of stainless steel plates if vaccinating their children. The
incentive had a tremendous impact on vaccination rates (a sevenfold increase to 38%), and the
authors (Banerjee & Duflo, 2011) judged this programme “to be the most effective we ever
evaluated, and probably the one that saved the most lives.” One problem with this type of studies is
3 Shortly after the demonstrations began, a glitch was discovered in the IS platform on which the product was built. The mobile operator’s account was debited each time a deposit was made and the bank suspended use of the system. There were delays in Signing of MOU between Banks and Mobile services. Finally, a credit card fraud made the bank more cautious about the access given to the software company which caused some further delays in the launch of the product
22
that if successful, they can hardly be replicated, while if unsuccessful, they are virtually of no use. If
the study is successful, it certainly shows that a particular incentive has a specific effect in a given
population. But what do we know about the impact of other types of incentives and what do we
know about their influence in different populations? Dal is not widely popular outside India. On the
other hand, if the study is unsuccessful it provides no other recommendation apart from confirming
that a particular type of incentive does not work. In the word of Karlan and Appel (2016a), this is the
case of “idea failure” versus “research failure”, the latter being caused by faulty design plans or
unforeseen shocks. Journals are keen to publish this type of studies because they are innovative and
sometimes defy existing theories, but the policy relevance of these studies is sometimes
questionable. 3ie has funded a number of studies that were published in peer-reviewed journals
despite having little or no practical relevance by the standards adopted by 3ie internal and external
reviewers. For example, a published study of an intervention distributing female condoms through
hairdressers in Zambia found that the intervention was more effective when hairdressers were given
social incentives than when given monetary incentives. However, the amount of condoms
distributed was so small that the intervention had no practical relevance in fighting HIV. In another
published study, researchers tested slum upgrading in a number of Latin American countries and
found high levels of satisfaction among families given a new house. However, about half of the
project beneficiaries reported no improvement in their living conditions, which calls into question
the validity of the evaluation or of the project. In yet another published study, researchers tested
the impact on farmers’ productivity of providing contact farmers with training, social incentives (T-
shirts) and material incentives (raincoats) in rural Mozambique. However, the authors were not able
to conduct a baseline, the study was affected by contamination of the control group, and social and
material incentives were not delivered as planned. Yet, in this latter case as in the previous ones, the
evaluation was published by a peer-reviewed journals despite being deemed hardly credible and
practically irrelevant by 3ie reviewers. To summarise, reviewers of academic papers and reviewers of
impact evaluations follow different criteria and the policy relevance of interventions does not figure
highly among the criteria used by academic reviewers.
The policy impact of 3ie studies is difficult to assess because the evaluation funded by 3ie are rarely
directly commissioned by a funder or an implementer to answer a precise question. 3ie was
established precisely with the goal of supporting impact evaluations that would have not otherwise
been funded. It should be no surprise therefore that some of 3ie evaluations are not always
designed to influence decisions directly. Many 3ie evaluations are ‘public goods’ that provide the
international community with invaluable hard evidence to guide policy. This does not mean that the
studies do not have a policy impact or that the evidence produced it is not valuable. Evaluations may
affect the way people frame issues and can provide explanatory causal ideas, becoming part of the
stock of knowledge used by decision makers (Weiss, 1980). Evaluations can provide ideas and
models which shapes our understanding of how the world works. Evaluation can have an
enlightenment function by slowly percolating to the public and policy makers and influencing actions
in subtle and unexpected ways (Nutley, Walter, & Davies, 2008). Of the 28 3ie studies that were
identified as relevant and credible, only six were considered “changing policy or programme design”,
one was influential in “taking a successful programme to scale.” On the other hand, seven studies
“Informed discussions of policies and programmes”, 4 studies improved “the culture of evaluation
evidence use and strengthen the enabling environment”, while the remaining 9 studies appeared to
have no policy impact.
It is also interesting to note that of the 35 studies considered, including the two studies whose
evidence is not credible, less than half (14) found a clear impact of the interventions. Elven studies
found no impact at all, while nine studies found ‘some’ impact on some of the outcomes observed,
23
usually intermediate outcomes rather than final outcomes. The reason for many studies finding
‘some’ impact is that studies are often testing the impact of the intervention on several outcomes at
the same time, some of which are final while other are intermediate. In the absence of a pre-analysis
plan that set out the hypotheses to be tested in advance, it is difficult to interpret the results of
these evaluations as positive results. Lacking a pre-specified set of hypotheses to test, the results
may be reported selectively by the researchers in a cherry-picking exercise led by pursuit of
statistical significance. Of the nine studies reporting ‘some’ impact, only one had prepared a pre-
analysis plan at the design stage and the results produced by these studies are difficult to interpret.
5.2 Determinants of success of 3ie studies In our review of the literature we identified three main reasons for conducting impact evaluations:
innovations, evaluations of existing programmes and accountability. Classifying each study in these
categories is not a simple task and it requires some degree of subjectivity. However, our reading of
the project documents suggests that only 2 evaluations were designed with the goal of assessing
whether the programmes were achieving their objectives: the evaluation of a productive safety net
programme in Ethiopia and the evaluation of a social transfers scheme in Malawi. In several cases,
the goal of the evaluations was to learn about the impact of existing programmes with the goal of
improving the projects in some ways. At least one half of the remaining study consisted of what we
defined innovations, that is interventions that had not been implemented before such as, for
example, the introduction of new cook stoves, mobile banking, chlorine dispensers, male
circumcision, new pension and saving products and the like. Implementation of this type of studies
present some risks related to the implementation and acceptability in the population in comparison
to established and running programmes that can affect the final success of the studies.
On average, characteristics of evaluations funded by 3ie are more aligned with best practices than
characteristics of evaluations funded by DFID because of the strict selection criteria employed and
because of the constant monitoring of the evaluation studies from design to dissemination of
findings. Table 6 illustrates these characteristics. We first discuss briefly the average characteristics
of 3ie evaluation before comparing the differences in determining characteristics of ‘successful’ and
‘unsuccessful’ evaluations.
Despite efforts by 3ie staff and reviewers, only a quarter of evaluation designs includes a well-
defined theory of change. About a third of evaluation designs quote or reference a systematic
review on the topic thus displaying a good knowledge of the topic and an interest in accumulating
evidence by the production of new evidence. Team’s skills are rated by 3ie on a scale from 1 to 4.
This is one of the selection criteria for funding and, not surprisingly, is quite high among funded
evaluations. Only 2 studies had an evaluability assessment prior to an evaluation design.
Surprisingly, only slightly above 50% of the studies were approved by an institutional ethical board.
Even considering that several studies are using secondary data this implies that some field research
is conducted without ethical approval. The proportion of studies that are registered and that have a
pre-analysis plan is very small (less than 20%). More than three quarters of the evaluations have a
peer-review group or an advisory panel, which reflects the high importance given by 3ie to the
quality assurance of funded evaluations. 3ie also actively promotes the engagement of research
teams with project managers and policy makers, so that implementers’ endorsement is universal
and the fraction of evaluation conducting meetings with the implementers is very high.
24
Table 6 Determinants of success of 3ie studies
Determinant Successful Unsuccessful All
Defining questions Well-designed theory of change 0.17 0.29 0.23 Systematic review quoted 0.17 0.53 0.34 Planning Team’s skills 3.17 2.65 2.91 Evaluability assessment 0.00 0.12 0.06 IRB approval 0.61 0.53 0.57 Study registration 0.11 0.24 0.17 Pre-analysis plan 0.17 0.12 0.14 Implementer’s endorsement 1.00 1.00 1.00 Stakeholders meetings 0.50 0.65 0.57 Stakeholders trainings 0.17 0.12 0.14 Peer-review group 0.78 0.65 0.71 Advisory panel 0.72 0.82 0.77 Implementation Pilot or formative study 0.39 0.53 0.46 Implementation issues 0.33 0.47 0.40 Implementation delays 0.33 0.53 0.43 Shocks 0.11 0.12 0.11 Letter of variation 0.94 0.94 0.94 Changes to design 0.00 0.24 0.11 Take-up (average) 0.87 0.61 0.74 Attrition (average) 0.05 0.17 0.12 Sample size 0.89 0.47 0.69 Analysis Mixed methods 0.83 0.88 0.86 Subgroup analysis 0.72 0.71 0.71 Intermediate outcomes 0.83 0.82 0.83 Dissemination Policy impact plan 0.83 0.94 0.89 Policy brief 0.17 0.12 0.14 Presentations 0.50 0.41 0.46 Media 0.28 0.24 0.26 Publications 0.83 0.88 0.86 Budget 2.00 2.00 2.00
At the implementation stage, nearly half of the studies were preceded by a pilot testing the
feasibility of the intervention, which reflects 3ie’s interest in avoiding implementation failures. This
however does not prevent the occurrence of implementations problems, which resulted in delays in
more than 40% of cases. The issue of letters of variations to the original contract is the norm, though
only in about 10% of cases this implies a change in the original evaluation design. The average
attrition rate (12%) is probably above what we would like to observe, but major shocks to study
design are observed in only 10% of evaluations. About 70% of evaluation have a reasonable sample
size. The required sample size is obtained through power calculations and 3ie has been routinely
requesting such calculation from researchers. However, the experience shows that researchers tend
to have overly optimistic expectations about the effectiveness of interventions so that ex-ante
power estimates are normally far lower than those estimated ex-post. For our purposes, we used a
cut-off of 1,000 observations for the intervention treatment arm as an indicator of small/large
sample. By this standard, about 70% of evaluations were based on large samples.
25
At the analysis stage, the use of subgroup analysis, the analysis of intermediate impacts and the use
of mixed methods approaches is widespread. This is expected because the use of mixed methods is
strongly encouraged by 3ie (see Figure 8 for an illustration of the type of mixed-methods used).
Similarly, nearly all evaluations have a policy influence plan, though this does not always translate in
dissemination as shown by the relatively low proportion of studies that are presented, disseminated
through policy briefs or that appear on the media. On the other hand, nearly all evaluations funded
by 3ie end up being published in peer-reviewed journals.
Figure 8 Mixed methods used in 3ie impact evaluations
Our sample of 35 studies is too small to perform statistical tests of differences between successful
and unsuccessful studies across a large number of characteristics. A qualitative comparative analysis
of the different configurations of determinants of success seems to be more appropriate in this case.
However, we can compare the average levels of characteristics in the two groups to see if there are
any glaring differences.
First, differences between successful and unsuccessful characteristics are not too large, and no
differences appear at the design stage. Second, perhaps surprisingly, a good theory of change as well
as having a peer review group and an advisory panel, do not seem to make a difference. Third, at the
planning stage, the only visible difference is in the skill set of the evaluation team, which is higher for
successful evaluations. Stakeholder engagement, which was highlighted as one of the main
determinants of success by the literature, does not seem to make a difference on success. Fourth,
there are some interesting differences at the implementation stage. Successful evaluations are less
affected by implementation issues and delay-. They have lower attrition rates and larger sample size.
Quite visibly, unsuccessful evaluations have lower take up rates. Finally, successful evaluations
report no change in evaluation design. Fifth, all the remaining characteristics at the analysis and
dissemination stage are nearly identical in the two groups.
High attrition rates and low take-up rates are particularly worrisome because they are the source of
a number of problems that do not seem to receive the attention they deserve in impact evaluations.
Different attrition rates in the project and control group can bias the estimation of project effects,
but even if attrition rates are the same in the two groups, they can limit the external validity of the
study. If a large fraction of the population drops out of the study, the results obtained do not apply
to the whole population. When take-up is very low, the validity of the intervention may be
questioned altogether. But even if the impact is estimated on the participants, researchers are not
0 2 4 6 8 10 12 14 16
Key Informant Interviews
Focused Group Discussions
In-depth Interviews
Observational techniques
Subjective measures (suggestion box…
Text Mining and recursive abstraction
Administrative data analysis
Case Studies
Media communication
Frequency of Type of Mixed methods used in the closed studies
26
always clear or aware that this impact cannot be extrapolated to the non-participants and that the
quasi-experimental methods used to assess this impact are credible only under some additional
assumptions.
5.3 Unsuccessful studies Over the last ten years, 3ie has discontinued 8 studies, out of 122 closed grants and 222 closed plus
ongoing evaluations, for a number of studies for different reasons. For example, a trial of an
intervention to promote empowerment in Mauritania was discontinued at an early stage because of
security concerns for the enumerators and the external researchers due to a sudden break out of
hostilities. In another example, an evaluation of a debt relief mechanism in Andhra Pradesh that
would allow poor women to borrow at concessional rates from banks in order to repay high rates
loans to informal moneylenders, was discontinued because unable to select a survey firm on time
for the baseline. In addition, when the team went to the field to refine the survey instruments it
became apparent that all villages in the state were being reached by the intervention thus
preventing the establishment of a credible control group. In another example, the study was
terminated at the analysis stage. An evaluation of an intervention promoting the diffusion of
improved seeds in Uganda was commissioned to a team with limited experience of impact
evaluations. Concerns about the design were raised earlier on by 3ie reviewers and the study started
as an RCT, soon to became a difference-in-difference study and finally turned into a PSM study
without ever adopting a convincingly valid evaluation approach.
While knowing the causes of the failure of studies, several of which are thoroughly documented by
Karlan and Appel on the experience of IPA (2016a), is certainly useful, it is also interesting
considering studies that could have been discontinued but ultimately were not. Mckenzie observes
that implementation failures tend to snowball quickly and that researchers find it hard to know
when to walk away (2016). Similarly, Karlan and Appel observe that “of all the intellectual challenges
that arise in the course of designing and implementing a rigorous research study, the greatest may
be deciding when to pull the plug.” The decision of discontinuing a study is as difficult one for a
manager of an evaluation as it is for a researcher. Here, we summarise some of the arguments that
have been put forward by 3ie reviewers and managers when deciding whether discontinuing a study
or not. We hope that a discussion of these arguments might help avoiding common mistakes and
future decisions.
One common argument against discontinuing a study is sunk costs. It is difficult to discontinue a
study in which many resources and efforts have been invested, sometimes over a number of years.
Sunk cost is a standard fallacy of logical reasoning: disbursements made in the past should not affect
decisions about the future. In principle, managers and researchers should not let past efforts to
influence decisions about the future. However, people have often hopes or expectations that
circumstances may change or that they could be exploited somehow in different ways. Sunk costs
arguments are difficult to resist.
A second common argument partly follows as a consequence of the first. It is hoped that researchers
may devise a change in design that allows the completion of an interesting study despite the change
in environment and other difficulties. Many projects in international development end differently
from what it had been planned, and some degree and ability of an evaluation to adjust to changes in
circumstance should be expected. In this case, while a randomized controlled trial may be no longer
possible, a quasi-experimental study or a qualitative study could turn out to be useful and
researchers and managers could agree on a way forward.
27
A third argument is reputational risk. The success of an evaluation study is a responsibility of
researchers as well as of their managers. Bringing a study to completion and meeting the targets
despite all problems may be sought in order to avoid bad publicity or negative feedback from the
community of researchers and funders. This is not a logical argument in favour of continuing
evaluations of course but something managers have to reckon with.
The publication of an unsuccessful study in an academic journal is another possible argument.
Evaluations that fail at the implementation stage or that are irrelevant from a policy perspective (see
box above on publication is not relevance) can be successfully published in prestigious academic
journals. Researchers have all interest to publish studies as this is often the main motivation of their
work and this motivation can be shared by managers and funders who may sympathize with
researchers pursuing an otherwise irrelevant study.
Finally, some would argue that a study should never be discontinued and should be carried out
according to plan to the extent this is possible regardless of circumstances. The reason is that a
discontinued study is quickly forgotten and will not appear in reviews of evaluations. The
disappearance of a failed study from journals will result in biased assessment of similar types of
interventions. In addition, it could be argued that much could be learnt from failures by other
researchers if the study and its development are properly documented, if anything to prevent a
similar failure to occur in the future.
We believe that while discontinuation should not be taken lightly it should be consistently
considered. We do not believe that evaluations should be discontinued if they are not going to plans.
Many evaluations take place in highly volatile and difficult contexts. A strict application of this rule
would lead to the completion only of projects implemented and evaluated in simple environments.
On the other hand, we do not believe that evaluations should be carried out regardless of emerging
circumstances simply for the purpose of documenting and avoiding reporting bias in research. A
middle ground needs to be found between these two extremes and decisions should be taken on a
case-by-case basis, though some general guidelines could be used.
First, we believe that prospective publications in academic journals and reputational risk for the
funding institution are not valid arguments in favour of continuing a study which is not set to provide
credible and relevant evidence. Academic journals publish studies that are not necessarily credible
and relevant and the reputation of funding institutions is better served by holding firmly to
principles of rigour and relevance. Second, before discontinuing a study affected by low take-up,
researchers and managers should consider the opportunity of applying different evaluation designs
(including qualitative approaches), assessing impact on beneficiaries only (the ‘treatment on the
treated’) rather than on the overall population, and explaining the reasons for low participation
rates. Third, funders should establish monitoring processes to assess uptake and make informed
decisions around implementation rollout or low take-up concerns. Funders should have designated
“go-no go” decision periods after some points in the evaluation. These decisions would include a
formal review of the study and a decision on the viability of the impact evaluation.
6 Conclusions We summarise the main lessons from the review of 3ie and DFID impact evaluations conducted
above in the following way:
• The overall success rate of impact evaluations is not very high. We worked with very small
samples, and percentages can be misleading, but we found only about 30% of DFID
evaluations and only 50% of 3ie evaluations to be ‘successful’ by our definition. Meeting the
28
triple goal of achieving credibility, relevance and policy impact is not a simple task.
Expectations about success of impact evaluations should be set accordingly.
• The higher success rate of 3ie evaluations is probably a result of the greatest effort spent by
the organisation in selecting teams of evaluators and in supporting the teams along all the
evaluation process, from the design, to the implementation of the study, the analysis of the
information collected and the dissemination of the results. The quality assurance role played
by 3ie is rather unique and might not be easily replicated by other organisations.
• The higher success of 3ie studies hides the fact that DFID studies tend to have higher policy
impact while 3ie studies tend to be more credible and relevant. This is likely the reflection of
the different funding goals adopted by the two institutions. 3ie is more likely to fund
evaluations informing global debates and increasing the stock of knowledge (‘public goods’),
while DFID is more likely to commission evaluations with a view to directly improving project
operations.
• 3ie has funded a larger number of evaluations of ‘innovations’, that is interventions never
attempted before. These interventions, and related impact evaluations, have been often
preceded by pilots in order to avoid failures resulting from low acceptability and feasibility. It
should be stressed however that an element of risk is always present in ‘innovations’ and
that not everything can be planned in advance or piloted. In our review, piloted
interventions do not appear to be more successful than others. The experience of 3ie also
shows that ‘innovations’ evaluations are not always relevant, despite being published in
academic journals. They are often non replicable when the interventions are successful and
have virtually not use when the interventions are unsuccessful.
• At the planning and the design stage, the composition of the team of evaluators and their
skills appear to be an important determinant of success, while other factors often stressed
by the literature such as stakeholders’ engagement, a solid theory of change, and peer-
review groups, appear to be less relevant.
• The low rate of registration of evaluations and of pre-analysis plan does not seem to affect
the success of the evaluations but does have a negative effect on policy recommendations.
Many evaluations report ‘some’ impacts after testing multiple hypotheses for several
outcomes in such a way that the results of the interventions are difficult to interpret.
Registration and pre-analysis plan would help researchers to clarify the goals of their study
and to set more clearly the standards used to assess the impact of interventions.
• Most problems affecting impact evaluations seem to emerge at the implementation stage.
In our review of the literature, and of the DFID and 3ie evaluations, we found several
evaluations negatively affected by: delays in programme approval, project modification,
implementation delays, lack of financial and/or human capital to roll out the programme as
planned, lack of buy-in from lower level and last-mile workers, service delivery glitches,
external shocks and low take-up. The latter being one of the main sources of unsuccessful
evaluations with great implications on the credibility of the estimation of project effects and
on their external validity.
• Whether an evaluation should be discontinued when facing implementation problems is a
decision that should be taken on a case-by-case basis. We do not recommend that
evaluations should be discontinued when sticking to plans is impossible, or that they should
be led to completion at all costs regardless of the problems faced. Managers and
researchers should avoid considerations of reputational risk, publishability of research or
sunk costs when making decisions regarding discontinuations, and a process should be
established to consider the opportunity for discontinuing evaluations.
29
• Successful evaluations obtain their data from large samples. Sample size is a crude measure
of the quality of a study but we observe that impact evaluations relying on small samples
(less than 1,000 project observations) are more likely incur into problems of credibility and
relevance.
• The use of mixed methods, subgroup analysis and the analysis of intermediate impact does
not appear to make studies more successful or policy relevant. On the contrary it was
observed that in the absence of registration and pre-analysis plan the production of many
results can make the interpretation of impact more difficult.
• We found no obvious relationship between the size of the budget and the success of the
impact evaluations funded by 3ie, suggesting that successful impact evaluations can be
produced with limited budgets.
Finally, we would like to highlight some limitations of the study and a possible way forward. First,
the conceptual framework and the review of the impact evaluations is inevitably influenced by the
beliefs and personal experience of the authors. There is a good element of subjectivity in this type of
work and a different team of researchers using the same information would probably reach different
conclusions. Second, the studies reviewed are a sample of the impact evaluations funded by 3ie and
DFID and a very small sample of all impact evaluations conducted over the last 5-10 years. The
sample is not representative of all the determinants of success at play in the wider process of
knowledge generation. Third, the small size of the sample limited our analysis to case studies
supported by quantitative summary tables. There are at least two ways in which our study could be
improved or expanded in a second iteration. The first, is by inspecting more closely a number of case
studies of extremely successful impact evaluations in order to draw lessons about factors
determining success. The second is by using a qualitative comparative analysis approach (QCA) to
assess whether particular configurations of factors explain success.
30
References
Banerjee, A., & Duflo, E. (2011). Poor Economics: A radical rethinking of the way to fight global poverty. New York: PublicAffairs.
Blair, G., Cooper, J., Coppock, A., & Humphreys, M. (2016). Declaring and Diagnosing Research Designs. unpublished manuscript.
Briceno, B., Cuesta, L., & Attanasio, O. (2011). Behind the scenes: managing and conducting large scale impact evaluations in Colombia Working Paper 14. New Delhi: International Initiative for Impact Evaluation (3ie).
Campos, F., Coville, A., Fernandes, A. M., Goldstein, M., & McKenzie, D. (2012). Learning from the Experiments That Never Happened: Lessons from Trying to Conduct Randomized Evaluations of Matching Grant Programs in Africa. Washington DC: The World Bank.
Cuong, N. V. (2014). Impact evaluations of development programmes: experiences from Viet Nam. 3ie Working Paper 21. New Delhi: International Initiative for Impact Evaluation (3ie).
Davis, R. (2015). Evalauability assessment. Retrieved from http://www.betterevaluation.org/en/themes/evaluability_assessment
Duflo, E. (2017). The Economist as a Plumber. Ferguson, M. (2017). Turning lemons into lemonade, and then drinking it: Rigorous evaluation under
challenging conditions. Retrieved from https://researchforevidence.fhi360.org/turning-lemons-lemonade-drinking-rigorous-evaluation-challenging-conditions
Glennerster, R. (2017). The Practicalities of Running Randomised Evaluations: Partnerships, Measurement, Ethics, and Transparency. In E. Duflo & A. Banerjee (Eds.), Handbook of Economic Field Experiments (Vol. Vol. 1, pp. 175-243): North Holland.
Glennerster, R., & Takavarasha, K. (2013). Running Randomised Evaluations: A Practical Guide. Princeton, New Jersey: Princeton University Press.
Gueron, J. M., & Rolston, H. (2013). Fighting for Reliable Evidence. New York: Russel Sage Foundation.
Jones, M., & Kondylis, F. (2016). Lessons from a crowdsourcing failure. Retrieved from http://blogs.worldbank.org/impactevaluations/lessons-crowdsourcing-failure
Karlan, D., & Appel, J. (2016a). Failing in the field: What we can learn when field research is going wrong. Princeton University Press: Woodstock, Oxfordshire.
Karlan, D., & Appel, J. (2016b). When the Juice Isn’t Worth the Squeeze: NGOs refuse to participate in a beneficiary feedback experiment Retrieved from http://blogs.worldbank.org/impactevaluations/when-juice-isn-t-worth-squeeze-ngos-refuse-participate-beneficiary-feedback-experiment
Keller, J. M., & Savedoff, D. (2016). Improving Development Policy through Impact Evaluations: Then, Now, and Next? Retrieved from https://www.cgdev.org/blog/improving-development-policy-through-impact-evaluations-then-now-and-next
Legovini, A., Di Maro, V., & Piza, C. (2015). Impact Evaluations Help Deliver Development Projects. World Bank Policy Research Working Paper 7157.
Lopez Avila, D. M. (2016). Implementing impact evaluations: trade-offs and compromises. Retrieved from http://blogs.3ieimpact.org/implementing-impact-evaluations-trade-offs-and-compromises/
MCC. (2012). MCC’s First Impact Evaluations: Farmer Training Activities in Five Countries. MCC Issue Brief.
Mckenzie. (2016). Book Review: Failing in the Field – Karlan and Appel on what we can learn from things going wrong. Retrieved from http://blogs.worldbank.org/impactevaluations/book-review-failing-field-karlan-and-appel-what-we-can-learn-things-going-wrong
Mckenzie, D. (2012). Misadventures in Photographing Impact. Retrieved from http://blogs.worldbank.org/impactevaluations/misadventures-in-photographing-impact
Mckenzie, D. (2016a). Lessons from some of my evaluation failures: Part 1 of ? Retrieved from http://blogs.worldbank.org/impactevaluations/lessons-some-my-evaluation-failures-part-1
31
Mckenzie, D. (2016b). Lessons from some of my evaluation failures: Part 2 of ? Retrieved from http://blogs.worldbank.org/impactevaluations/lessons-some-my-evaluation-failures-part-2
MEASURE evaluation. (2015). Evaluation FAQ: How Much Will an Impact Evaluation Cost? MEASURE evaluation brief November 2015.
Nutley, S. M., Walter, I., & Davies, H. T. O. (2008). Using Evidence: How Research Can Improve Public Services. Bristol: The Policy Press.
Peersman, G., Guijt, I., & Pasanen, T. (2015). Evaluability assessment for impact evaluation: guidance, checklist and decision support. A Methodslab Publication ODI.
Priedeman Skiles, M., Hattori, A., & Curtis, S. L. (2014). Impact Evaluations of Large-Scale Public Health Interventions Experiences from the Field. USAID: MEASURE International Working Paper 14-17.
Puri, J., & Rathinam, F. (2017). Grants for real-world impact evaluations: what are we learning? unpublished manuscript.
Samuels, F., & McPherson, F. (2010). Meeting the challenge of proving impact in Andhra Pradesh, India. Journal of Development Effectiveness, 2(4), 468-485.
Stickler, M. (2016). 5 Things USAID’s Land Office Has Learned about Impact Evaluations. Retrieved from https://blog.usaid.gov/2016/05/mythbusting-5-things-you-should-know-about-impact-evaluations-at-usaid/
Vaessen, J. (2017). Evaluability and why it is Important for Evaluators and Non-Evaluators. Retrieved from http://ieg.worldbankgroup.org/blog/evaluability
Vermeersch, C. (2012). “Oops! Did I just ruin this impact evaluation?” Top 5 of mistakes and how the new Impact Evaluation Toolkit can help. Retrieved from http://blogs.worldbank.org/impactevaluations/oops-did-i-just-ruin-this-impact-evaluation-top-5-of-mistakes-and-how-the-new-impact-evaluation-tool
Weiss, C. H. (1980). Knowledge Creep and Decision Accretion. Knowledge: Creation, Diffusion, Utilization, 1(3), 381-404.
White, H. (2014). Ten things that can go wrong with randomised controlled trials. Retrieved from http://blogs.3ieimpact.org/ten-things-that-can-go-wrong-with-randomised-controlled-trials/
Wood, B. (2015). Replication research promotes open discourse. Evidence Matters http://blogs.3ieimpact.org/replication-research-promotes-open-discourse/.
Wood, B. D. K., & Djimeu, E. (2014). Requiring fuel gauges: A pitch for justifying impact evaluation sample size assumptions. Evidence Matters http://blogs.3ieimpact.org/requiring-fuel-gauges-a-pitch-for-justifying-impact-evaluation-sample-size-assumptions/.