successful impact evaluations: lessons from dfid and 3ie · successful impact evaluations: lessons...

1

INCOMPLETE DRAFT FOR DISCUSSION ONLY; COMMENTS WELCOME

NOT FOR QUOTATION OR GENERAL CIRCULATION

Successful impact evaluations: lessons from DFID and 3ie

Edoardo Masset1, Francis Rathinam2, Megha Nath2, Marcella Vigneri1, Ben Wood2

1CEDIL, 2International Initiative for Impact Evaluation (3ie)

CEDIL Inception Paper

DRAFT

18/01/2018

Abstract

This paper explores the factors affecting the success of impact evaluations, drawing from the

experience of two major funders of evaluations: DFID and 3ie. We first define successful evaluations

using three key dimensions: credibility of the evidence, relevance, and policy influence. Based on a

review of the literature we build a conceptual framework including the main determinants of

success at the stages of design, planning, implementation and analysis. We review selected samples

of impact evaluations recently funded by 3ie and DFID, we identify successful ones and we discuss

the factors that are more likely to be associated with successful studies. We also discuss

unsuccessful cases of impact evaluations and their main causes, and we provide some guidance in

relation to discontinuing studies that are facing difficulties during implementation. We conclude

with a number of reflections that we hope will help researchers and funders to concentrate

resources and efforts where they are most needed for the success of impact evaluations.

3

1 Introduction The goal of the present paper is to identify factors that lead to successful impact evaluations.

Lessons from successes and failures can be used by researchers and funders to better design and

monitor impact evaluation studies. It is well known that impact evaluations often do not achieve

their intended aims, but how is success defined? The paper proposes a “theory of change” of

successful impact evaluations and identifies the potential factors determining success. We conduct a

portfolio review of impact evaluations recently funded by DFID and 3ie to identify successful cases

and their determinants. The ultimate goal of the paper is to provide researchers and funders with a

set of recommendations to improve the likelihood of success and to avoid failures at the various

stages of the evaluation process, including design, planning, implementation and analysis.

2 Background 2.1.1 Search of the literature A comprehensive search of the literature on impact evaluations failure and successes is beyond the

scope of this study. In addition, much of the discussion about institutional and researchers’

experiences in conducting impact evaluations is not academic and more likely to be found in blogs.

We therefore decided to limit our search of the literature to blogs, and to papers and reports

mentioned in blogs.

Reviewing blogs systematically present some challenges. Blog webpages are designed with different

levels of sophistication. They always include a search modality, which normally does not allow

filtering through Boolean operators. In some cases an implicit Boolean search is allowed by using

double-quotes: words can be nested together ("impact evaluation"), and additional words can be

added. Whenever double-quotes were possible, we downloaded blogs using the following search

terms: “impact evaluation” failure and “impact evaluation” success. In all other cases, where an

implicit Boolean search was not allowed, we simply searched for ‘impact evaluation’ and retrieved

all the blogs returned by the search.

We searched the following blogs of institutions and organisations working in impact evaluation:

- A tip a day by and for evaluators of the American Evaluation Association

(http://aea365.org/blog/tag/blog/)

- All impact evaluation posts by the GIveWell blog (https://blog.givewell.org/category/impact-

evaluation/)

- Sharing information to improve evaluation of BetterEvaluation

(http://www.betterevaluation.org/blog)

- What works: transforming development through evaluation of the Independent Evaluation

Group at the World Bank (http://ieg.worldbankgroup.org/blogs)

- Blog of Innovations for Poverty Action (https://www.poverty-action.org/blog)

- Evidence matters of the International Initiatives for Impact Evaluation

(http://blogs.3ieimpact.org/)

- Development impact of the World Bank

(https://blogs.worldbank.org/impactevaluations/blog)

- R&E search for evidence of the FHI360 (https://researchforevidence.fhi360.org/)

- Evaluate, the MEASURE evaluation blog (https://measureevaluation.wordpress.com/)

- Impact blogs of USAID (https://blog.usaid.gov/)

- Views from the centre of the Center for Global Development

(https://www.cgdev.org/global-development)

http://aea365.org/blog/tag/blog/

https://blog.givewell.org/category/impact-evaluation/

https://blog.givewell.org/category/impact-evaluation/

http://www.betterevaluation.org/blog

http://ieg.worldbankgroup.org/blogs

https://www.poverty-action.org/blog

http://blogs.3ieimpact.org/

https://blogs.worldbank.org/impactevaluations/blog

https://researchforevidence.fhi360.org/

https://measureevaluation.wordpress.com/

https://blog.usaid.gov/

https://www.cgdev.org/global-development

4

- Evidence for action of UNICEF (https://blogs.unicef.org/evidence-for-action/)

- Shaping policy for development of ODI (https://www.odi.org/comment)

We found more than 20 blogs discussing failures and successes in evaluating programmes and the

World Bank website built a web page (Failure) to discuss failure cases

(http://blogs.worldbank.org/impactevaluations/failure). The blogs often included references to

papers, reports and books that also became part of our review. We downloaded all relevant blogs

and documents quoted or linked to them to inform the following literature review and conceptual

framework.

2.2 Successful evaluations We begin by defining ‘success’ of impact evaluations. Since impact evaluations have different goals

and are conducted in very different ways, any indicator of success must be multidimensional. Based

on or reading of the retrieved literature and on the institutional experience of 3ie, we identified the

following dimensions of success:

• Reliable evidence: a successful IE must be answering the evaluation questions in a credible

and convincing way

• Relevance: a successful IE must provide evidence that is valuable and that can be used for

making policy decisions or to inform policies

• Policy impact: an impact evaluation is more successful if its recommendations are adopted

by decision-makers

• Value for money: an evaluation is more successful if achieves its results using fewer

resources in comparison to similar studies and if it achieves this in a timely manner without

incurring cost overruns.

For the purpose of our work, the term ‘reliable evidence’ refers to the causal estimation of the

effects of programmes, normally conducted using experimental and quasi-experimental methods.

The standards employed to assess the quality of evidence are object of debate and we adopt here

the standards normally used in systematic reviews such as those, for example, of the Campbell

collaboration or the International Initiative for Impact Evaluation.1

Relevance in our view is not judged by statistical significance alone, as is often the case in the impact

evaluation literature, and should depend on the size of differences and cost-effectiveness of the

interventions (White, 2014). An evaluation is relevant if offers a convincing explanation of the

mechanisms leading to the outcomes, if it considers the cost of the intervention in comparison to

available alternatives and if it assesses impact, to the extent this is possible, across a variety of

contexts and populations. Relevance can be therefore assessed by the presence of a solid theory of

change, cost-effectiveness analysis, and sub-group analysis and other heterogeneous effects.

Ultimately, relevance consists of the clarity and strength of the conclusions and policy

recommendations.

Policy impact refers to the use of an evaluation in policy-making. Several policy outcomes are

possible following an evaluation. The results of an evaluation can be used to continue or discontinue

a particular intervention, they can be used to modify and improve an intervention, they can be used

to scale-up a project or expand a project to other areas or countries, or they can inform global

debates on policies. The lessons from evaluations can emerge in the long term or in unexpected

1 See for example the criteria used by 3ie for inclusion in the 3ie Repository of Impact Evaluations, which includes all impact evaluations of developed interventions (4,260 as of November 2017) that have been published to date (http://www.3ieimpact.org/media/filer_public/2014/05/22/ier_screening_checklist.pdf)

https://blogs.unicef.org/evidence-for-action/

https://www.odi.org/comment

http://blogs.worldbank.org/impactevaluations/failure

5

areas and we do not limit the notion of policy impact to policy changes that are immediate

consequences of evaluation results.

Costs of evaluations also vary by context, implementation and study design. For example, MEASURE

Evaluation presents a range of project costs between $400.000 and $850.000, with a single study

costing over $3.5 million (MEASURE evaluation, 2015). 3ie estimated an average cost of $336.000 for

impact evaluations funded between 2009 and 2015 (Puri & Rathinam, 2017). Impact evaluations are

cost-effective in the long run as they improve project design and inform global portfolios (Stickler,

2016). This does not prevent us from valuing more highly an impact evaluation that achieves the

same results in the same context using fewer resources.

The literature also notes that impact evaluations are more successful if produce unforeseen positive

externalities to the funders, the project evaluated or the populations involved in the project. We

refer to spillover effects as unintended positive effect generated by an impact evaluation. For

example, impact evaluations can improve the efficiency of projects, or they can influence the

scaling-up of good projects and improve the quality of data systems (Lopez Avila, 2016). They can

also improve the skills of project managers and other stakeholders through transmission of

knowledge through trainings or informally (Briceno, Cuesta, & Attanasio, 2011) as well as creating

additional knowledge for beneficiaries. Legovini et al (2015) show that World Bank projects

conducted alongside impact evaluations are more likely to be successful because they tend to be

implemented as planned and therefore more likely to achieve their stated goals. The authors

speculatively attribute this effect to (a) better planning and evidence-base in project design, (b)

greater implementation capacity due to training and support by research team and field staff, (c)

better data for policy decisions, and (d) observer effects and motivation. In principle, there might

also be potential negative effects or impact evaluation that can negatively contribute to its success,

though their occurrence has not been reported in the literature. Spillover effects of evaluations are

certainly important but are not requisites of success and are difficult to track. We will consider them

in the data collection but not in the final analysis of project success.

2.3 A framework of analysis for impact evaluations One way of designing a theory of change for successful impact evaluation is by describing an ‘ideal’

impact evaluation. That is, an impact evaluation having all the ingredients, and in the right

combination, to be successful. What makes an evaluation successful, however, varies depending on

the context of implementations and the goals of the exercise. We therefore decided to opt for a

different approach. We first define the fundamental steps of an impact evaluation exercise and we

then identify the factors that may positively or negatively affect an impact evaluation study at each

stage. The stages considered follow the sequence: identification of evaluation questions, planning of

the evaluation study, implementation of the intervention and of the study, and analysis and

dissemination of the results. For each stage of the process, we surveyed the blog literature to

identify the main determinants of success and failure. Stages and determining factors are visually

represented in the chart in Figure 1.

2.3.1 Defining the evaluation questions: Where are ideas coming from? The first step in designing an impact evaluation is defining the evaluation questions that will be

investigated. But where are evaluation questions coming from? We define here three possible

sources of evaluation questions: innovations, current projects and accountability. In some cases,

(innovations), the questions originate in social and economic theory or from results or other

research and evaluation experiences in other countries or contexts. In these cases, “researchers with

theories of hypotheses already in hand set out to search for appropriate sites to run experiments

6

(Karlan & Appel, 2016a).” Innovations are therefore interventions that haven not been implemented

or tested before like those promoted, for example, by the NGO Innovation for Poverty Action. In

other cases (current projects), evaluators assist the implementers or the funders in assessing the

impact of existing interventions. This does not mean that researchers cannot test modifications of

the original interventions. Often the interventions are not fully spelled out and there is considerable

room for the researchers for tinkering and adjusting the interventions (Duflo, 2017). In other cases

(accountability), the evaluation is motivated by the need to hold project implementers and funding

agencies accountable and to ensure that funds are properly invested.

2.3.2 Planning stage Evaluability

Ideally, before the planning stage, an evaluability exercise should be conducted to assess the technical, political and practical feasibility of an evaluation. The OECD-DAC defines evaluability as “The extent to which an activity or project can be evaluated in a reliable and credible fashion”. It is argued that an evaluability assessment reduces the risk of irrelevant or invalid findings, thus preventing waste of resources on inappropriate evaluations (Davis, 2015). Such assessments should aim at answering three broad questions: a) is it plausible to expect impact?, b) would an impact evaluation be useful and used?, and c) is it feasible to assess or measure impact? Peersman et al. (2015) provide guidelines and a useful checklist to apply this type of assessment to impact evaluations. Based on the experience of five experimental evaluations in agriculture, MCC concluded that “To make efficient use of time and financial resources, impact evaluations should be focused where the learning potential is greatest, where rigorous evaluation (with a counterfactual) is feasible and where there is significant commitment of the various stakeholders to the evaluation (MCC, 2012).” Clearly all this is highly desirable but not easily achievable in practice.

Piloting the intervention

Many interventions fail because they cannot be implemented in the planned way. These problems

affect particularly interventions that have never been implemented before and that are being tested

for the first time (innovations). For example, a particular technology may not work in a given

context. Piloting an intervention (also called ‘formative evaluation’ or ‘field testing’) may overcome

this problem (White, 2014). In some cases, interventions are implemented in the wrong settings.

Many such examples are reported by Karlan (2016a) and summarised in the following way by

Mckenzie: “This includes doing projects in the wrong place (e.g. a malaria prevention program where

malaria is not such an issue), at the wrong time (e.g. when a project delay meant the Indian

monsoon season started, making roads impossible and making it more difficult for clients to raise

chickens), with a technically infeasible solution (e.g. trying to deliver multimedia financial literacy in

rural Peru using DVDs when loan officers weren’t able to find audio and video setups) (Mckenzie,

2016).”

Team’s skills

Teams need to have the right set of technical skills and the right combination of disciplinary approaches required by the study. “Hiring the wrong person(s) on the impact evaluation team, or not defining their responsibilities well” came second place in an internal survey of World Bank staff on the most common mistakes in conducting impact evaluations (Vermeersch, 2012). Team composition and experience are commonly used as screening and scoring criteria by agency commissioning impact evaluations. Clarity and quality of ToRs for the survey team are therefore also highly relevant.

7

Figure 1 Factors affecting ‘success’ of evaluations

Stakeholders’ engagement

This is by far the factor most commonly quoted in the literature and blogs. Project implementers, funders and users should be involved since the design stage of an evaluation. What follow is a series, perhaps a bit inordinate, of excerpts from blogs on this specific issue. “Implementer’s staff should be

8

involved to leverage the diverse expertise necessary for a good impact evaluation design (Stickler, 2016).” “Traditionally, evaluators insisted on independence to assure objectivity. But this separation often results in evaluators being brought in after a project is underway and when the opportunity to collect baseline data and test hypotheses is lost” (Keller & Savedoff, 2016). “Interests of stakeholders (researches, implementers and funders) need to be aligned, which may result in reaching compromises about what is evaluated and by what method or design (Lopez Avila, 2016).” For example, Ferguson (2017) discusses a USAID-funded intervention in Mozambique, centered on a combined social and economic intervention for girls at high risk of HIV infection. The study had to deal with a number of challenges such as the absence of a baseline, low take-up, ad hoc implementation etc. to which the researchers responded with a quasi-experimental design that stroke a balance between rigour and the need to answer the evaluation questions. Engagement with stakeholders (implementers and beneficiaries) at early stages of an intervention will allow identifying the right set of indicators for an evaluation (White, 2014). Sometimes evaluators have the buy-in of implementers but “lower-tier staff may have limited bandwidth and flexibility. A couple of examples come from programs which tried to use existing loan officers to deliver financial literacy training, only to find that many of them were not good teachers; and where bank tellers decided to ignore a script for informing customers of a new product because they felt it slowed them down from serving clients as quickly as possible (Mckenzie, 2016).” The evaluation scope should take into account the demands and concerns of programme managers (Briceno et al., 2011). The evaluation questions and the feasibility of the exercise including its technical characteristics should be discussed and jointly discussed with relevant stakeholders (Briceno et al., 2011). “Effective impact evaluations require close integration between implementers and evaluators, starting from the creation of the program logic and throughout implementation. Changes in program implementation can have significant effects on the evaluation methodology (MCC, 2012).” “Alignment between funders, implementers and evaluators is critical to a well-designed evaluation that complements the program (Priedeman Skiles, Hattori, & Curtis, 2014).” An intense collaboration between researchers and the implementing organisation is even more needed when conducting a randomised trial because the researchers become directly involved in project design and implementation. Glennerster (2017) discusses at length the required elements of a fruitful collaboration between evaluators and managers when implementing a randomised controlled trial. Transparency

An evaluation is more likely to follow best practices when activities are open to the scrutiny of the

public and of other researchers. Transparency may include the registration of the evaluation and the

publication of an analysis plan. Evaluation designs should be ideally be registered in order to avoid

subsequent cherry-picking of results (White, 2014). Some authors have designed a software to

declare the study design in computer code form. Given the code declaration, the software can

diagnose statistical power, bias, expected mean and external validity in advance (Blair, Cooper,

Coppock, & Humphreys, 2016). Wood (2015) discusses the benefits of transparency in the use of

datasets and in their analysis to the production and to the use of evaluation results.

2.3.3 Implementation Project implementation

The project may not be implemented in the way it was planned by the government or the NGO in

charge. Mckenzie recounts the problems faced during the evaluation of different mechanisms to

bring enterprises in the formal sector. In Brazil, new ipads to be used as prizes had to be returned to

the US for tax reasons. In Peru, firm’s distance to the tax office could not be used as an instrument

because tax collectors, in order to reach their quotas, would simply inspect the closest firms they

could find. In another example, a team spent $40,000 to send letters to financial institutions in

9

Mexico to offer financial education by participating in the study. The team was hoping to obtain

between 800-1200 responses out of the 40,000 letter sent, but in the end they obtained only 42

replies (D. Mckenzie, 2016a). Another World Bank team set out to evaluate a large rural roads

rehabilitation program in Rwanda that relies on high-frequency market information. Crowdsourcing

seemed like an ideal solution to generate high-frequency data that would collect timely market data.

Things however, went differently from expected: poor connectivity resulted in a low number of

contributors, data collected in an unstructured way were often inaccurate, and data collection

ended up being a poorer version of traditional enumeration as there was only one contributor per

market (Jones & Kondylis, 2016).

Project take-up

Take-up is often quoted as the single most common reason of project failure in the literature. People

may simply not be interested in the intervention and the intervention does not take place. Universal

take-up is unrealistic and every project loses out its intended beneficiaries through a ‘funnel of

attrition’ (White, 2014). For example, IPA partnered with Global Giving on a program that invited

clients of local organizations to provide feedback about the services received by sending SMS

messages to Global Giving. The client feedback would in turn be sent to potential donors and the

researchers were hoping to find whether feedback would change service provision and donors’

attitudes. A total of 37 organizations in Guatemala, Peru, and Ecuador were identified and

randomised in the study, but only 5 took part. Most organisations were concerned about the

negative effects of unfavourable feedback (Karlan & Appel, 2016b). Similar examples abound in the

literature and will be not reported here.

Contamination of the control group

Even the best designed evaluation may be spoiled by contamination of the control group, when the

benefits of the intervention go beyond the project area or when implementers decide to expand the

geographic coverage of the intervention. In some cases, contamination can be irreparable. For

example, a randomised evaluation of an HIV intervention in Andhra Pradesh was aborted after the

funder in collaboration with the State government decided to saturate the entire state with a nearly

identical and much larger intervention that wiped out the randomised control group (Samuels &

McPherson, 2010).

Selection biases

All sorts of biases may affect the results of the analysis. In conducting experiments, biases include

high attrition of study participants, noncompliance, placebo and Hawhtorne effects with inability to

blind. These type of issues are widely discussed in the evaluation literature. A non-technical

discussion of many of these problems, particularly in relation to randomised controlled trials, can be

found in Glennerster et al. (2013). Quasi-experimental studies are even more vulnerable to biases

and the inherent difficulty of controlling for unobserved differences between the comparison

groups.

Sample size and data quality

Few interventions are able to produce dramatic effects and sample sizes have to be sufficiently large

to detect the expected impacts. Wood and Djimeu (2014) discuss how rarely evaluators, particularly

economists, conduct and report the power calculations supporting the sample size selected for their

impact evaluation. As a result, evaluations often have sample sizes that have inadequate statistical

power to detect impact (White, 2014), particularly for outcomes that are marginally affected by

10

interventions and measured with large error such as, for example, incomes. The problem is

compounded if intra-cluster correlations are not taken into account in the power calculations. But

even if researchers get the sample size right many things can go wrong during data collection. Data

collection has to be closely monitored. “Issues in the design and (preparation of) data collection”

was by far the most represented category in a survey of common mistakes in impact evaluations

conducted at the World Bank. Problems included: forgetting individual identifiers, or not printing the

cover page of the questionnaire and missing information on the respondent (Vermeersch, 2012).

There is a vast literature on measurement error in traditional surveys and on enumeration biases,

but problems arise with new methods as well. For example, team of researchers set out to assess

the impact of an intervention on management practices by taking photographs of micro-enterprises

before and after the intervention. But, contrary to expectations, photographers did not stand in the

same spot each time, sunlight affected the quality of the pictures, not all information was captured

by the pictures and enumerators were inexperienced in the use of cameras (Mckenzie, 2012), and

the experiment failed.

Ethical issues

Research involving human subjects requires meeting a minimum set of ethical standards in order to

be acceptable and ultimately to facilitate the implementation of the study. This may require

approval from an Institutional Review Board, obtainment of informed consent, protection of

confidentiality of the information and any other measure that prevents the researchers from

committing any harm (Glennerster, 2017).

Political economy

Local governments and implementers can be or can become opposed to evaluations or to particular

aspects of their design. Campos et al. (2012) provide a number of examples of failed experiments in

which Government officers were openly opposed to randomised evaluations of a series of subsidies

programmes for private firms. The opposition was sometimes accentuated by turn-over of

government staff, which caused severe delays.

Ability to adapt

Rarely are evaluations going according to pre-defined plans. Changes in circumstances often require sudden changes and adjustments in evaluation design or project implementation. In addition, researchers should prepare contingency plans, particularly when conducting randomised trials (Briceno et al., 2011). “Creative adaptation of the evaluation design to fit operations is the norm not the outlier (Priedeman Skiles et al., 2014).“ A review of 10 evaluations of health interventions by MEASURE found that each case faced design and implementation challenges that required creative solutions to move forward. These challenges included the identification and selection of program beneficiaries, random assignment in complex environments, identification of a robust comparison or control group for estimating the counterfactual, heterogeneity of program impacts, timing of baseline data collection, and absence of baseline data and of a counterfactual (Priedeman Skiles et al., 2014). Puri et al (2017), based on examples from impact evaluations funded by 3ie, offer a discussion of how the success of the studies depend on their ability to adapt to unforeseen implementation circumstances.

2.3.4 Analysis Technical quality

An impact evaluation must provide the best possible estimates of project impact given the

circumstances of implementation. What makes a technically high-quality evaluation can be object of

11

debate but there are common standards, such as, for example, risk of bias tools and assessments

used in systematic reviews, and there are quality markers, like publications in peer-reviewed

journals.

Dissemination

The results of a study have to be widely disseminated in order to reach the desired audiences. Briceno et al. (2011) summarise in the following way: “Evaluation findings should be reported in time and should be used for accountability purposes. If the motivating questions were clear, and the evaluation’s reports incorporated answers to these questions, the program’s managers can use the findings to discuss recommendations and to support changes in the program, if necessary. The ideal evaluation process does not end with a published document; an ideal evaluation should document the recommendations’ discussions, the response from program managers and the consequent decisions or improvement aspects resulting from the evaluation.” The literature on communicating evidence and influencing policy is huge, and will not be reported here.

2.3.5 Cross-cutting themes A number of factors affect the success of evaluations regardless of the stage the implementation. We highlight in particular, shock, contexts and quality assurance. Shocks and external events

Project implementation and evaluations can be derailed by unexpected shocks. For example a World

Bank team conducted a survey of 2,500 households and 2,500 enterprises in rural Egypt to assess

the impact of measure to expand the access to finance but the instability in the wake of the Egyptian

revolution meant microfinance organizations didn’t want to expand to poor areas, and the

intervention never took place (D. Mckenzie, 2016b). A risk assessment at the planning stage and the

formulation of contingency plans might be helpful particularly when researchers set out to work in

areas prone to conflict or other disasters.

Implementation context

The complexity of interventions, consisting on multiple activities and projects at multiple levels

(countries, regions etc.) and multiple stakeholders groups, is perceived as a challenge to the design

of causal pathways for an evaluation (Vaessen, 2017) and for the identification of a control group

(Cuong, 2014). This is often addressed by focusing on a narrow component of the programme rather

than on the programme itself, thus offering little guidance in terms of scaling up the whole

intervention (Lopez Avila, 2016). Successes and failures vary with the characteristics of the area in

which the project is implemented, with the type of intervention and with the characteristics of the

implementer. This suggests that study type, implementer, sector of intervention and geographic

area of intervention should be also taken into account.

Quality assurance and PRGs

Progress reviews along the study and peer-review of the results at various stages ensure the quality

of the design, implementation and analysis. The work of evaluators should be supervised through

advisory panels, peer review groups, and other reference groups in order to ensure the technical

quality of the evaluation (Briceno et al., 2011).

12

3 Methodology The remaining of the paper is dedicated to assess the relevance of the determinants of success of

impact evaluations that were identified by the literature. In this exercise we will use a sample of

studies funded by two major funders of impact evaluations: DFID and 3ie. Here, we briefly describe

the selection of the sample of studies and the methods to identify success and its determinants.

3.1 Selection of DFID studies DFID publishes every year a collection of commissioned evaluation studies in international development. The entire body of this work spans a twenty-year period from 1997 to 2017, and includes independent reports categorized by sector and country, and working papers. More recently, the collection has also included a short document with DFID’s management response. For this exercise, we decided to selectively look at the population of impact evaluations supported by DFID since 2012, which includes in total 61 studies. Only 12, out of the original 61 evaluations, were included in our study as the other evaluations did not meet our selection criteria (see Figure 2 for an illustration of the selection process). We employed two selection criteria. The first requires that the study evaluates an intervention aimed at improving living standards rather than a policy or financial support. Following 3ie guidelines for the 3ie database of impact evaluations, we excluded evaluations “studies which only investigate natural or market-based occurrences, or that report on the findings of controlled laboratory experiments with no discernible development intervention.” This led to the exclusion of 20 studies including, for example, the Evaluation of DFID Online Research Portals and Repositories, Evaluation of Budget Support to Sierra Leone 2002 – 2015, Evaluation of the DFID-AFDB Technical Cooperation Agreement (TCA), and Evaluation of the Consultative Group to Assist to the poor - Phase IV Mid-Term Report. The second selection criterion is the use of a counterfactual method: what would happen in the absence of the intervention. In keeping with the guidelines used by the 3ie database of impact evaluations, studies not employing a counterfactual methodology such as RCT, RDD, PSM, IV, and DD were excluded. This led to the exclusion of further 27 evaluations. The excluded evaluations included 7 desk-based reviews, 5 process evaluations, 4 ‘theory-based’ evaluations, 4 value for money cost analyses, 3 qualitative studies, 2 formative evaluations and 1 rapid health facility assessment. Figure 2 Selection of DFID impact evaluations

Only about 20% of all evaluations consist of impact evaluations employing a counterfactual

methodology and only 2 impact evaluations are randomised controlled trials. Most impact

Evaluations

(61)

non-interventions

(21)

interventions

(40)

impact evaluations

(12)

Other evaluations

(28)

13

evaluations are quasi-experimental (11), often consisting of difference-in-difference studies

reinforced by matching methods or, in some cases, by simple project-control comparisons at one

point in time after the intervention (see Table 1).

Table 1 DFID evaluations by study design

Year evaluations Evaluations of interventions

Impact evaluations

RCTs Quasi-experimental

2012 7 4 1 1 0 2013 20 13 2 1 2 2014 7 5 2 0 2 2015 5 3 1 0 1 2016 18 11 3 0 3 2017 4 4 3 0 3 TOTAL 61 40 12 2 11

We disaggregated the evaluations and the impact evaluations using the World Bank sectoral

classification of interventions (Table 2). The bulk of impact evaluations were commissioned in the

agriculture sector, health and population, and education. This is only partially a reflection of the

distribution of all commissioned evaluations. Some sectors like climate change and conflict

management do not have a single impact evaluation while most evaluation of public sector

management are not impact evaluations.

Table 2 DFID evaluations by sector of intervention

Sector of intervention All evaluations Impact evaluations

Agriculture and rural development 7 3 Climate change and environment 3 0 Conflict management and post-conflict reconstruction 4 0 Cross-sectoral 16 2 Economic policy 0 0 Education 3 1 Energy 0 0 Finance 0 0 Health nutrition and population 9 3 Humanitarian 1 0 Information and communications technology 0 0 Private sector development 3 1 Public sector management 12 1 Social protection 1 1 Transportation 0 0 Urban development 0 0 Water, sanitation and hygiene 2 0

All impact evaluations were conducted in South Asia and Sub-Saharan Africa. It appears that a large

number of evaluations conducted in South Asia are impact evaluation, which is not the case in Sub-

Saharan Africa (Table 3).

14

Table 3 DFDI evaluations by geographic region

Region All evaluations Impact evaluations

East Asia & Pacific 1 0 Europe & Central Asia 2 0 Latin America & Caribbean 0 0 Middle East & North Africa 3 0 North America 0 0 South Asia 7 4 Sub-Saharan Africa 30 7 Global 18 1

3.2 Selection of 3ie studies As of end 2017, the number 3ie funded impact evaluations (IE) closed and posted on 3ie website was

112. However, for the analysis of determination of successful IEs, we sampled approximately one-

third of the full list, i.e. 35 studies. To ensure better representation of the original studies, we sorted

all the studies by sector of evaluation (thematic areas), geography, 3ie funding modality and final

report quality. Every third study in the list was marked for inclusion. Some sectors such as

agriculture, health, education and social protection were over-represented, while other sectors such

as finance, private sector development, peace building, urban development, etc. were under-

represented. To get a better representation of all sectors, we have included all the sectors that had

only one study irrespective of its placement in the list. On the other hand, the maximum number of

studies from the over-represented sectors were limited to five. The final sample includes 35 studies.

This is a fair representation of all thematic areas, geography and the quality of studies in the full list

of all closed 3ie grants. See Figure 3 for the distribution of the selected 3ie studies by sector.

Figure 3 Selected 3ie impact evaluations by sector

3ie funding modalities include open windows, thematic windows and policy windows. 3ie’s Open

Window modality is a grant platform for funding impact evaluations in low- and middle-income

countries (L&MIC) irrespective of funding size request or subject area (Puri et.al. 2018). There were

15

four rounds of open window funding (OW1 to OW4) during 2009 – 2012. This unrestricted funding

modality has now been discontinued. 3ie’s other funding modalities placed restrictions such as

specific themes (thematic windows) and or country and policy relevance. The aim of the thematic

windows is to increase the body of rigorous evidence in a single sector where there are fewer

evaluations with the aim of maximizing policy relevance and impact. The policy window supports

evaluation of government programmes implemented and deemed important by 3ie developing

country members. Since most of closed grants now are from the first four open windows, about 88

per cent of the sampled studies are from open windows and the remaining from thematic and policy

windows (see Figure 4).

Figure 4 Selected 3ie evaluations by thematic window

3ie proposal review of open windows ranked all the proposals by their technical quality and their

geographical distribution (Puri et.al. 2018). This process ensured that while highest quality proposals

were selected, there was also a good geographical distribution. Studies from Africa and South Asia

account for more than half of all the studies in our analysis (see Figure 5).

Figure 5 Selected 3ie evaluations by geographic area and study design

3.3 Methods The selected 3ie and DFID studies were coded along a number of variables representing the

determining factors identified in the conceptual framework and listed in Table 4. Some of the factors

were coded numerically like, for example, whether the study was reviewed by an IRB or the size of

the sample. Other factors are qualitative judgments of the study like, for example, the clarity of the

conclusions or their policy relevance. It should be noted however that even numerical variable, such

as, for example, whether the theory of change is well defined, contain an element of qualitative

judgment. Note that while the information below could be collected for most 3ie evaluations, only

few of the indicated variables could be extracted from the DFID evaluation reports and the related

management response documents available.

OW1

OW2

OW3

OW3

OW4

PW2

TW1TW4 23%

49%

9%

3%

6%

3%

6%3%

Distribution of studies by Grant Window

Central AsiaEast Asia Pacific

Latin America and Caribbean

South Asia

Sub Saharan Africa

3%

9%

20%

26%

43%

Geographical Distribution of closed studies

Quasi experimental

RCT

RCT, Quasi experiemental

29%

46%

26%

Distribution of studies over Study Design

16

Table 4 List of determinants of success and their proxy indicators

Stage Factor Metric

Defining questions Theories, project, accountability

Qualitative assessment of the statement of the problem the project and the evaluation are intending to solve. A well-defined theory of change. Are systematic reviews quoted in reference to the problem statement?

Planning Transparency Study registration A Pre analysis plan is available

Stakeholders engagement Meetings and workshops with implementers Training events Endorsement by the implementers

Team composition Skill set of evaluators

Evaluability Evaluability assessment conducted

Pilot Was the intervention piloted/formative evaluation?

Implementation Project implementation Implementation issues Implementation delays Shocks affecting implementation

Take-up Participation rates

Biases Attrition rates

Contamination Contamination reported

Data collection Sample size

Ethical issues IRB approval

Political economy No data available

Adaptation Letters of variation Modification to the design

Results Analysis Mixed-methods Sub-group analysis Intermediate outcomes

Dissemination Policy influence plans Policy briefs Presentations and conferences Media Publications

Success Evidence Clarity of conclusion with respect to evaluation questions Reporting of effect size of the intervention

Relevance The study provides results that can be effectively used for policy: this is based on a qualitative assessment of the conclusions of the study

Policy impact Assessed by our reading of the management response (in the case of DFID) studies and by 3ie own appraisal (in the case of 3ie studies)

Value for money Budget size

Cross-cutting factors

Quality assurance Peer review groups Advisory panel

Context Country and region Sector of intervention (World Bank classification)

4 DFID impact evaluations

4.1 Successful DFID evaluations We assessed the success of the evaluations by three factors: credibility, relevance and policy impact

(Figure 6 illustrates). We do not consider value for money because we have no information on the

budgets assigned to the evaluation and we abstract for the moment from eventual spill-over effects

that are difficult to extract from the documents available. We looked at credibility, relevance and

policy impact sequentially as if each mattered only conditionally to the previous one. Different

pathways are difficult to imagine. It is hard to think of unreliable evidence as been relevant, while

17

irrelevant evidence should not have impact by definition. It is possible for unreliable evidence to

have a policy impact, but this is unlikely to occur in this context.2 In what follows, the reliability of

evidence is mostly based on our qualitative judgement. Whether a study is relevant or not, is based

on our reading of the conclusions drawn by the researchers and the management response. The

policy impact is entirely understood through our reading of the management response. The figure

below illustrates the sequential classification of the studies.

Figure 6 Reliability, relevance and policy impact (DFID evaluations)

Three of the selected evaluations did not provide reliable evidence and were no further considered.

One study assessed the impact of a capacity building intervention in several low income countries.

The evaluation included a difference-in-difference analysis of the impact of the intervention on

knowledge and skills of health care workers. However, the sampling methodology as well as the

analysis were rather inaccurate. The quantitative assessment was only a minor component of the

overall evaluation and there is no sign that it had a major impact on the conclusions. The authors of

the study did not draw strong conclusion from it and management did not seem to take any relevant

lessons from this particular aspect of the evaluation. The evaluation of a result-based education

project in Ethiopia relied on an interrupted time-series design. The intervention had been

implemented at the national level and no other design option was available. However, there were

delays in project implementation and the data were not fully representative. This was acknowledged

by the researchers, while the management response stressed that in this case lack of evidence of

impact did not imply lack of impact. The evaluation of a livelihood programme found significant

impacts of the intervention on poverty reduction. However, the management response exposed the

fact that external reviewers found large discrepancies between the reported income data and data

collected by other agencies, which called in to question the credibility of the entire survey. As a

result, management assigned the final impact evaluation to another provider altogether.

Of the nine evaluations providing credible evidence, three turned out not to be relevant. An impact

evaluation found a number of positive impacts of a distribution system of ORS packages through

street vendors of soft drinks. However, all the evidence related to availability, access and use of the

ORS packages and no evidence was presented on the effectiveness of the intervention. The results of

2 This view is consistent with Gueron and Rolston assessment of the unusual impact of the evaluation studies of welfare programmes in the US in the 80s. The identify five factors: credibility, the finding themselves, their timeliness, the marketing of the results and the political environment (2013). Marketing and policy environment are contextual factors enabling policy impact, while the ‘finding themselves’ and ‘timeliness’ is what we have described as ‘relevance’ of the findings.

reliable evidence

no

(3)

relevance

(8)

no

(3)

polcy impact

(5)

no

(1)

yes

(4)

18

the study appear to be too contextualised and limited to intermediate outcomes to be relevant. No

management response was available for this study.

Table 5 Credibility, relevance and policy impact of DFID evaluations

Study Impact of intervention

credibility relevance Policy impact

Delivering Reproductive Health Results though non-state providers in Pakistan (DRHR)

No impact found

Yes Yes Yes

Evaluation of the Developing Operational Research Capacity in the Health Sector Project (ORCHSP)

Positive impact

No - -

Evaluation of the Pilot Project of Results-Based Aid in the Education Sector in Ethiopia (RBAESE)

No impact found

No - -

Social and Economic Impacts of Tuungane (TUNGANE)

No impact found

Yes Yes No

COLALIFE operational trial zambia Improving use, access, availability and awareness of ORS and zinc for the treatment of diarrhoea in the home (COLALIFE)

Impact on intermediate

outcomes Yes No -

Independent Impact Assessment of the Chars Livelihoods Programme – Phase 1 (CHARS1)

Positive impact

NO - -

Independent Impact Assessment of the Chars Livelihoods Programme – Phase 2 (CHARS2)

Positive impact

Yes Yes Yes

Evaluation of Uganda Social Assistance Grants for Empowerment (SAGE)


outcomes Yes No -

Adolescent Girls Empowerment Programme, Zambia (AGEP)

Mixed results Yes No -

Impact evaluation of the DFID Programme to accelerate improved nutrition for the extreme poor in Bangladesh: Final Report (IEAINB)

No impact found

Yes Yes Yes

Independent Evaluation of the Security Sector Accountability and Police Reform Programme (SSAPR)


outcomes Yes Yes Yes

The evaluation of an empowerment programme in Uganda found a number of impacts on

intermediate outcomes. However, the sample used was scarcely representative. There were delays

in the project implementation and in the evaluation, and the project underwent a number of

changes during the evaluation. The management response stated that by the time the

recommendation became available, they were no longer relevant or had already been implemented

by the project. The evaluation of another empowerment programme in Zambia found inconclusive

evidence or mixed results. However, there were delays in project implementation so that the

planned impact evaluation became an ‘interim’ evaluation of the ‘direction of travel’ rather than the

expected final assessment, thus informing a future study but not project implementation.

The evaluations that produced credible and relevant results had an impact on policy, with possibly

one exception. The evaluation of a health intervention through non-state providers in Pakistan

delivered the ‘surprising’ result that the intervention had no impact on access to health care, which

led to a re-examination of the whole intervention. The evaluation of a livelihood programme in

Bangladesh encouraged the management to continue the intervention and to introduce changes for

improving success. The ‘sobering’ results produced by the evaluation of a programme to accelerate

improved nutrition for the extreme poor in Bangladesh led to a reconsideration of the DFID support

strategy to nutrition interventions in Bangladesh. The evaluation of a policy reform programme in

Zambia found a number of positive impacts on knowledge, attitudes and practices around policing

that management accepted and committed to use in future programmes. The impact evaluation of a

community driven reconstruction programme in the DRC is a possible exception. The evaluation

19

found no impacts on behaviours and outcomes, which, according to the evaluators, called into

question the appropriateness of the whole project model. However, the management concluded

that positive impacts of the intervention outweighed the absence of impact on behavioural change

and therefore opted for a continuation, albeit modified, of the programme.

It is also interesting to note that only one study found a positive impact of the intervention. There

were other two studies finding a positive impact but the evidence provided was not considered

credible. Other three studies found a positive impact but only on intermediate outcomes. All the

remaining studies found no impact of the interventions on final outcomes. It is also interesting to

note that all studies are assessing impacts on a plurality of outcomes and sometimes for several sub-

groups in the population. We found no obvious attempt to cherry-picking of results or selective

reporting, thought the risk exists. More importantly, evaluators and management often found

difficult to make sense of the large number of effects observed on several dimension and the results

are difficult to interpret in a coherent way.

4.2 Determinants of success of DFID evaluations Four of the 12 studies were pilot interventions and the evaluations aimed at assessing their

effectiveness for their potential scale-up. The remaining evaluations were assessing existing

programmes with the intent of improving or scaling-up the interventions or for accountability

purposes. Perhaps a bit surprisingly only 7 studies, just above half the sample, included a good

theory of change of the intervention and only 3 of the studies quoted relevant systematic reviews of

similar interventions. The absence of a well-defined theory of change and of references to

systematic reviews is not helping the definition of good evaluation questions.

Success at planning stage depends on factors such as: the skill set of the evaluators, transparency of

the evaluation, stakeholders’ engagement, and the implementation of pilot or formative studies

before starting the full evaluations. Eight of the study teams appeared to have the required skill set

while the judgment is somewhat suspended for the remaining five. None of the study reports an

evaluability assessment, a pilot or a formative evaluation, and nothing is known about engagement

of stakeholders. It is difficult to say to what extent these activities did not take place or were simply

not reported.

Several projects (6) faced difficulties at the implementation stage which resulted in delays: the RBA

pilot was not well communicated to the regions in time to appreciably affect students’ performance;

in the Tungane evaluation political tensions led to loss of one province and some regions were

inaccessible for safety reasons; in the CHARS evaluation, the initial implementation of the

monitoring database system was relatively crude, especially in the database structures and the

handling of informant identification, also cohorts were mixed within villages; in the adolescent

empowerment programme in Zambia, the voucher scheme took almost two years to agree with

government; in the integrated nutrition programme in Bangladesh procurement and distribution of

micronutrients were considerably delayed; the Security Sector Accountability and Police Reform

Programme did not state its short- and long-term goals until quite late in the implementation period,

and thus there was little time to work collaboratively. Only one of the studies, the Bangladeshi

nutrition project, was affected by a major shock. Attrition above 20% of the baseline sample was

reported as a problem in 3 studies. It is unfortunate that none of the studies report data on take-up

of the interventions so that it is impossible to reflect on this important aspect of the interventions.

Sample sizes are in the order of thousands of observations in most cases but two evaluations

collected information from less than 300 observations.

20

All studies, with 3 exceptions, were mixed method evaluations that included a qualitative

component in the form of interviews, focus group discussions, process evaluations or contribution

analysis. All studies, with 2 exceptions, conducted some sub-group analysis mostly by gender and

age of project participants. None of the studies, with the exception of the Bangladeshi nutrition

study, appear to have analysed intermediate outcomes of the interventions. Nothing can be said

about the dissemination of the results because the information cannot be evicted by the documents

available.

5 3ie impact evaluations

5.1 Successful 3ie studies Assessing the success of 3ie impact evaluations by the criteria of credibility, relevance and policy

impact is slightly different in comparison to the DFID studies (see Figure for an illustration of the

assignment process). Most 3ie studies are set up to be highly credible. Studies are selected among a

large number of submitted proposal using highly selective criteria applied by internal and externa

reviewers. In addition, the evaluations are scrutinised and monitored from design to final publication

in order to avoid errors in design, implementation and analysis. In addition, all 3ie studies are

relevant, to some extent, because relevance is one of the selection criteria adopted to make funding

decisions. This implies that both credibility and relevance of 3ie evaluations tend to be very high. Of

course credibility and relevance at the design stage does not preclude the possibility that these end

up being compromised at the implementation or analysis stage.

Figure 7 Reliability, relevance and policy impact (3ie evaluations)

Only two evaluations, out of the selected 35, produced non credible evidence (see Figure 7). The first

is an evaluation of the impact of a micro irrigation innovation on household income, health and

nutrition. The study had enormous limitations relating to the comparability of three cohorts for

assessing impact and attrition was not properly addressed. The second was a study assessing the

effect of network structure on the diffusion of health information. The study was designed in such a

way that the desired estimation of the causal impact was impossible.

We found the results of five impact evaluation as not being relevant. Relevance can take different

meanings and there is some degree of subjectivity in the definition. 3ie has its own definition of

relevance and recommends its use in the review of grant applications. The definition has changed

over the years but in general it requires that a study should have either (i) substantial scale or

expected effect size (i.e., affecting a large number of people, implying large resource outlays, or

having large effects on a sub-population), or (ii) being innovative intervention, with the potential to

reliable evidence

no

(2)

relevance

(33)

no

(5)

policy impact

(28)

no

(9)

yes

(19)

21

be scaled up. Alternatively, the study should be filling a gap of knowledge in the specific context and

country and informing policy. All studies funded by 3ie are therefore ‘relevant’ by this definition. The

relevance discussed in what follows refers more to their ability achieve these goals.

We classified the following evaluations as not being relevant. A study of an intervention promoting

savings among poor households in Sri Lanka was unable to provide policy recommendations because

affected by low take up and delays.3 A study of an intervention aiming at increasing long term

savings with the design and the presentation of a pension product was affected by implementation

problems, but more importantly did not produce the expected results and turned out to be of no

policy use. A study of an intervention promoting the use of cook stoves to prevent indoor pollution

in Northern Ghana faced so many implementation problems that the results were few and difficult

to interpret. A study promoting male circumcision in Malawi produced very limited results because

of very low take up among the beneficiary population. Similarly, a study of an intervention offering

weather index based insurance to poor farmers in two Indian states failed because no farmer was

interested in the intervention.

From this survey of problematic studies, a pattern seems to emerge in relation to the causes of the

lack of relevance. First, some studies (like the cook stove and the mobile banking evaluations) were

affected by factors totally outside the control of the researchers or by factors not well understood at

the design stage of the evaluation. The evaluation designs of these studies could have been

improved through a better engagement with relevant stakeholders and with an evaluability exercise

prior to the evaluation to assess potential risks. Second, a good number of studies was affected by

implementation delays and low take up rates. Low take-up is an universal problem in project

implementation and evaluation. More recently 3ie, to prevent this type of problems, has been

promoting pilots or formative studies with the goal of testing the feasibility and acceptability of

interventions before introducing large scale experiments. Finally, some interventions were testing

totally new interventions which ended up being ineffective. Some studies are designed in such a way

to be scarcely relevant when successful, and entirely useless when unsuccessful (see box on

Publication is not relevance).

Box Publication is not relevance

The publication of a study in a peer-reviewed journal is sometimes considered an indicator of its

quality. However, published studies are often credible and interesting but not necessarily relevant

for policy. This is particularly true for a recent wave of evaluation studies aimed at testing the impact

of incentives on people’s behaviour. Some impact evaluations of this type are designed to test

research ideas. But where do ideas come from? In the words of Karlan and Appel of Innovations for

Poverty Action (Karlan & Appel, 2016a), ideas are “perhaps extensions or continuations of an existing

theory, or inspired by the results of other research or from experiences in neighbouring countries or

countries far away.” Ideas often come in the form of hypotheses about different types of incentives

influencing behaviours (the “nudge”). For example, in a famous study mothers were offered two

pounds of dal (dried beans) and a set of stainless steel plates if vaccinating their children. The

incentive had a tremendous impact on vaccination rates (a sevenfold increase to 38%), and the

authors (Banerjee & Duflo, 2011) judged this programme “to be the most effective we ever

evaluated, and probably the one that saved the most lives.” One problem with this type of studies is

3 Shortly after the demonstrations began, a glitch was discovered in the IS platform on which the product was built. The mobile operator’s account was debited each time a deposit was made and the bank suspended use of the system. There were delays in Signing of MOU between Banks and Mobile services. Finally, a credit card fraud made the bank more cautious about the access given to the software company which caused some further delays in the launch of the product

22

that if successful, they can hardly be replicated, while if unsuccessful, they are virtually of no use. If

the study is successful, it certainly shows that a particular incentive has a specific effect in a given

population. But what do we know about the impact of other types of incentives and what do we

know about their influence in different populations? Dal is not widely popular outside India. On the

other hand, if the study is unsuccessful it provides no other recommendation apart from confirming

that a particular type of incentive does not work. In the word of Karlan and Appel (2016a), this is the

case of “idea failure” versus “research failure”, the latter being caused by faulty design plans or

unforeseen shocks. Journals are keen to publish this type of studies because they are innovative and

sometimes defy existing theories, but the policy relevance of these studies is sometimes

questionable. 3ie has funded a number of studies that were published in peer-reviewed journals

despite having little or no practical relevance by the standards adopted by 3ie internal and external

reviewers. For example, a published study of an intervention distributing female condoms through

hairdressers in Zambia found that the intervention was more effective when hairdressers were given

social incentives than when given monetary incentives. However, the amount of condoms

distributed was so small that the intervention had no practical relevance in fighting HIV. In another

published study, researchers tested slum upgrading in a number of Latin American countries and

found high levels of satisfaction among families given a new house. However, about half of the

project beneficiaries reported no improvement in their living conditions, which calls into question

the validity of the evaluation or of the project. In yet another published study, researchers tested

the impact on farmers’ productivity of providing contact farmers with training, social incentives (T-

shirts) and material incentives (raincoats) in rural Mozambique. However, the authors were not able

to conduct a baseline, the study was affected by contamination of the control group, and social and

material incentives were not delivered as planned. Yet, in this latter case as in the previous ones, the

evaluation was published by a peer-reviewed journals despite being deemed hardly credible and

practically irrelevant by 3ie reviewers. To summarise, reviewers of academic papers and reviewers of

impact evaluations follow different criteria and the policy relevance of interventions does not figure

highly among the criteria used by academic reviewers.

The policy impact of 3ie studies is difficult to assess because the evaluation funded by 3ie are rarely

directly commissioned by a funder or an implementer to answer a precise question. 3ie was

established precisely with the goal of supporting impact evaluations that would have not otherwise

been funded. It should be no surprise therefore that some of 3ie evaluations are not always

designed to influence decisions directly. Many 3ie evaluations are ‘public goods’ that provide the

international community with invaluable hard evidence to guide policy. This does not mean that the

studies do not have a policy impact or that the evidence produced it is not valuable. Evaluations may

affect the way people frame issues and can provide explanatory causal ideas, becoming part of the

stock of knowledge used by decision makers (Weiss, 1980). Evaluations can provide ideas and

models which shapes our understanding of how the world works. Evaluation can have an

enlightenment function by slowly percolating to the public and policy makers and influencing actions

in subtle and unexpected ways (Nutley, Walter, & Davies, 2008). Of the 28 3ie studies that were

identified as relevant and credible, only six were considered “changing policy or programme design”,

one was influential in “taking a successful programme to scale.” On the other hand, seven studies

“Informed discussions of policies and programmes”, 4 studies improved “the culture of evaluation

evidence use and strengthen the enabling environment”, while the remaining 9 studies appeared to

have no policy impact.

It is also interesting to note that of the 35 studies considered, including the two studies whose

evidence is not credible, less than half (14) found a clear impact of the interventions. Elven studies

found no impact at all, while nine studies found ‘some’ impact on some of the outcomes observed,

23

usually intermediate outcomes rather than final outcomes. The reason for many studies finding

‘some’ impact is that studies are often testing the impact of the intervention on several outcomes at

the same time, some of which are final while other are intermediate. In the absence of a pre-analysis

plan that set out the hypotheses to be tested in advance, it is difficult to interpret the results of

these evaluations as positive results. Lacking a pre-specified set of hypotheses to test, the results

may be reported selectively by the researchers in a cherry-picking exercise led by pursuit of

statistical significance. Of the nine studies reporting ‘some’ impact, only one had prepared a pre-

analysis plan at the design stage and the results produced by these studies are difficult to interpret.

5.2 Determinants of success of 3ie studies In our review of the literature we identified three main reasons for conducting impact evaluations:

innovations, evaluations of existing programmes and accountability. Classifying each study in these

categories is not a simple task and it requires some degree of subjectivity. However, our reading of

the project documents suggests that only 2 evaluations were designed with the goal of assessing

whether the programmes were achieving their objectives: the evaluation of a productive safety net

programme in Ethiopia and the evaluation of a social transfers scheme in Malawi. In several cases,

the goal of the evaluations was to learn about the impact of existing programmes with the goal of

improving the projects in some ways. At least one half of the remaining study consisted of what we

defined innovations, that is interventions that had not been implemented before such as, for

example, the introduction of new cook stoves, mobile banking, chlorine dispensers, male

circumcision, new pension and saving products and the like. Implementation of this type of studies

present some risks related to the implementation and acceptability in the population in comparison

to established and running programmes that can affect the final success of the studies.

On average, characteristics of evaluations funded by 3ie are more aligned with best practices than

characteristics of evaluations funded by DFID because of the strict selection criteria employed and

because of the constant monitoring of the evaluation studies from design to dissemination of

findings. Table 6 illustrates these characteristics. We first discuss briefly the average characteristics

of 3ie evaluation before comparing the differences in determining characteristics of ‘successful’ and

‘unsuccessful’ evaluations.

Despite efforts by 3ie staff and reviewers, only a quarter of evaluation designs includes a well-

defined theory of change. About a third of evaluation designs quote or reference a systematic

review on the topic thus displaying a good knowledge of the topic and an interest in accumulating

evidence by the production of new evidence. Team’s skills are rated by 3ie on a scale from 1 to 4.

This is one of the selection criteria for funding and, not surprisingly, is quite high among funded

evaluations. Only 2 studies had an evaluability assessment prior to an evaluation design.

Surprisingly, only slightly above 50% of the studies were approved by an institutional ethical board.

Even considering that several studies are using secondary data this implies that some field research

is conducted without ethical approval. The proportion of studies that are registered and that have a

pre-analysis plan is very small (less than 20%). More than three quarters of the evaluations have a

peer-review group or an advisory panel, which reflects the high importance given by 3ie to the

quality assurance of funded evaluations. 3ie also actively promotes the engagement of research

teams with project managers and policy makers, so that implementers’ endorsement is universal

and the fraction of evaluation conducting meetings with the implementers is very high.

24

Table 6 Determinants of success of 3ie studies

Determinant Successful Unsuccessful All

Defining questions Well-designed theory of change 0.17 0.29 0.23 Systematic review quoted 0.17 0.53 0.34 Planning Team’s skills 3.17 2.65 2.91 Evaluability assessment 0.00 0.12 0.06 IRB approval 0.61 0.53 0.57 Study registration 0.11 0.24 0.17 Pre-analysis plan 0.17 0.12 0.14 Implementer’s endorsement 1.00 1.00 1.00 Stakeholders meetings 0.50 0.65 0.57 Stakeholders trainings 0.17 0.12 0.14 Peer-review group 0.78 0.65 0.71 Advisory panel 0.72 0.82 0.77 Implementation Pilot or formative study 0.39 0.53 0.46 Implementation issues 0.33 0.47 0.40 Implementation delays 0.33 0.53 0.43 Shocks 0.11 0.12 0.11 Letter of variation 0.94 0.94 0.94 Changes to design 0.00 0.24 0.11 Take-up (average) 0.87 0.61 0.74 Attrition (average) 0.05 0.17 0.12 Sample size 0.89 0.47 0.69 Analysis Mixed methods 0.83 0.88 0.86 Subgroup analysis 0.72 0.71 0.71 Intermediate outcomes 0.83 0.82 0.83 Dissemination Policy impact plan 0.83 0.94 0.89 Policy brief 0.17 0.12 0.14 Presentations 0.50 0.41 0.46 Media 0.28 0.24 0.26 Publications 0.83 0.88 0.86 Budget 2.00 2.00 2.00

At the implementation stage, nearly half of the studies were preceded by a pilot testing the

feasibility of the intervention, which reflects 3ie’s interest in avoiding implementation failures. This

however does not prevent the occurrence of implementations problems, which resulted in delays in

more than 40% of cases. The issue of letters of variations to the original contract is the norm, though

only in about 10% of cases this implies a change in the original evaluation design. The average

attrition rate (12%) is probably above what we would like to observe, but major shocks to study

design are observed in only 10% of evaluations. About 70% of evaluation have a reasonable sample

size. The required sample size is obtained through power calculations and 3ie has been routinely

requesting such calculation from researchers. However, the experience shows that researchers tend

to have overly optimistic expectations about the effectiveness of interventions so that ex-ante

power estimates are normally far lower than those estimated ex-post. For our purposes, we used a

cut-off of 1,000 observations for the intervention treatment arm as an indicator of small/large

sample. By this standard, about 70% of evaluations were based on large samples.

25

At the analysis stage, the use of subgroup analysis, the analysis of intermediate impacts and the use

of mixed methods approaches is widespread. This is expected because the use of mixed methods is

strongly encouraged by 3ie (see Figure 8 for an illustration of the type of mixed-methods used).

Similarly, nearly all evaluations have a policy influence plan, though this does not always translate in

dissemination as shown by the relatively low proportion of studies that are presented, disseminated

through policy briefs or that appear on the media. On the other hand, nearly all evaluations funded

by 3ie end up being published in peer-reviewed journals.

Figure 8 Mixed methods used in 3ie impact evaluations

Our sample of 35 studies is too small to perform statistical tests of differences between successful

and unsuccessful studies across a large number of characteristics. A qualitative comparative analysis

of the different configurations of determinants of success seems to be more appropriate in this case.

However, we can compare the average levels of characteristics in the two groups to see if there are

any glaring differences.

First, differences between successful and unsuccessful characteristics are not too large, and no

differences appear at the design stage. Second, perhaps surprisingly, a good theory of change as well

as having a peer review group and an advisory panel, do not seem to make a difference. Third, at the

planning stage, the only visible difference is in the skill set of the evaluation team, which is higher for

successful evaluations. Stakeholder engagement, which was highlighted as one of the main

determinants of success by the literature, does not seem to make a difference on success. Fourth,

there are some interesting differences at the implementation stage. Successful evaluations are less

affected by implementation issues and delay-. They have lower attrition rates and larger sample size.

Quite visibly, unsuccessful evaluations have lower take up rates. Finally, successful evaluations

report no change in evaluation design. Fifth, all the remaining characteristics at the analysis and

dissemination stage are nearly identical in the two groups.

High attrition rates and low take-up rates are particularly worrisome because they are the source of

a number of problems that do not seem to receive the attention they deserve in impact evaluations.

Different attrition rates in the project and control group can bias the estimation of project effects,

but even if attrition rates are the same in the two groups, they can limit the external validity of the

study. If a large fraction of the population drops out of the study, the results obtained do not apply

to the whole population. When take-up is very low, the validity of the intervention may be

questioned altogether. But even if the impact is estimated on the participants, researchers are not

0 2 4 6 8 10 12 14 16

Key Informant Interviews

Focused Group Discussions

In-depth Interviews

Observational techniques

Subjective measures (suggestion box…

Text Mining and recursive abstraction

Administrative data analysis

Case Studies

Media communication

Frequency of Type of Mixed methods used in the closed studies

26

always clear or aware that this impact cannot be extrapolated to the non-participants and that the

quasi-experimental methods used to assess this impact are credible only under some additional

assumptions.

5.3 Unsuccessful studies Over the last ten years, 3ie has discontinued 8 studies, out of 122 closed grants and 222 closed plus

ongoing evaluations, for a number of studies for different reasons. For example, a trial of an

intervention to promote empowerment in Mauritania was discontinued at an early stage because of

security concerns for the enumerators and the external researchers due to a sudden break out of

hostilities. In another example, an evaluation of a debt relief mechanism in Andhra Pradesh that

would allow poor women to borrow at concessional rates from banks in order to repay high rates

loans to informal moneylenders, was discontinued because unable to select a survey firm on time

for the baseline. In addition, when the team went to the field to refine the survey instruments it

became apparent that all villages in the state were being reached by the intervention thus

preventing the establishment of a credible control group. In another example, the study was

terminated at the analysis stage. An evaluation of an intervention promoting the diffusion of

improved seeds in Uganda was commissioned to a team with limited experience of impact

evaluations. Concerns about the design were raised earlier on by 3ie reviewers and the study started

as an RCT, soon to became a difference-in-difference study and finally turned into a PSM study

without ever adopting a convincingly valid evaluation approach.

While knowing the causes of the failure of studies, several of which are thoroughly documented by

Karlan and Appel on the experience of IPA (2016a), is certainly useful, it is also interesting

considering studies that could have been discontinued but ultimately were not. Mckenzie observes

that implementation failures tend to snowball quickly and that researchers find it hard to know

when to walk away (2016). Similarly, Karlan and Appel observe that “of all the intellectual challenges

that arise in the course of designing and implementing a rigorous research study, the greatest may

be deciding when to pull the plug.” The decision of discontinuing a study is as difficult one for a

manager of an evaluation as it is for a researcher. Here, we summarise some of the arguments that

have been put forward by 3ie reviewers and managers when deciding whether discontinuing a study

or not. We hope that a discussion of these arguments might help avoiding common mistakes and

future decisions.

One common argument against discontinuing a study is sunk costs. It is difficult to discontinue a

study in which many resources and efforts have been invested, sometimes over a number of years.

Sunk cost is a standard fallacy of logical reasoning: disbursements made in the past should not affect

decisions about the future. In principle, managers and researchers should not let past efforts to

influence decisions about the future. However, people have often hopes or expectations that

circumstances may change or that they could be exploited somehow in different ways. Sunk costs

arguments are difficult to resist.

A second common argument partly follows as a consequence of the first. It is hoped that researchers

may devise a change in design that allows the completion of an interesting study despite the change

in environment and other difficulties. Many projects in international development end differently

from what it had been planned, and some degree and ability of an evaluation to adjust to changes in

circumstance should be expected. In this case, while a randomized controlled trial may be no longer

possible, a quasi-experimental study or a qualitative study could turn out to be useful and

researchers and managers could agree on a way forward.

27

A third argument is reputational risk. The success of an evaluation study is a responsibility of

researchers as well as of their managers. Bringing a study to completion and meeting the targets

despite all problems may be sought in order to avoid bad publicity or negative feedback from the

community of researchers and funders. This is not a logical argument in favour of continuing

evaluations of course but something managers have to reckon with.

The publication of an unsuccessful study in an academic journal is another possible argument.

Evaluations that fail at the implementation stage or that are irrelevant from a policy perspective (see

box above on publication is not relevance) can be successfully published in prestigious academic

journals. Researchers have all interest to publish studies as this is often the main motivation of their

work and this motivation can be shared by managers and funders who may sympathize with

researchers pursuing an otherwise irrelevant study.

Finally, some would argue that a study should never be discontinued and should be carried out

according to plan to the extent this is possible regardless of circumstances. The reason is that a

discontinued study is quickly forgotten and will not appear in reviews of evaluations. The

disappearance of a failed study from journals will result in biased assessment of similar types of

interventions. In addition, it could be argued that much could be learnt from failures by other

researchers if the study and its development are properly documented, if anything to prevent a

similar failure to occur in the future.

We believe that while discontinuation should not be taken lightly it should be consistently

considered. We do not believe that evaluations should be discontinued if they are not going to plans.

Many evaluations take place in highly volatile and difficult contexts. A strict application of this rule

would lead to the completion only of projects implemented and evaluated in simple environments.

On the other hand, we do not believe that evaluations should be carried out regardless of emerging

circumstances simply for the purpose of documenting and avoiding reporting bias in research. A

middle ground needs to be found between these two extremes and decisions should be taken on a

case-by-case basis, though some general guidelines could be used.

First, we believe that prospective publications in academic journals and reputational risk for the

funding institution are not valid arguments in favour of continuing a study which is not set to provide

credible and relevant evidence. Academic journals publish studies that are not necessarily credible

and relevant and the reputation of funding institutions is better served by holding firmly to

principles of rigour and relevance. Second, before discontinuing a study affected by low take-up,

researchers and managers should consider the opportunity of applying different evaluation designs

(including qualitative approaches), assessing impact on beneficiaries only (the ‘treatment on the

treated’) rather than on the overall population, and explaining the reasons for low participation

rates. Third, funders should establish monitoring processes to assess uptake and make informed

decisions around implementation rollout or low take-up concerns. Funders should have designated

“go-no go” decision periods after some points in the evaluation. These decisions would include a

formal review of the study and a decision on the viability of the impact evaluation.

6 Conclusions We summarise the main lessons from the review of 3ie and DFID impact evaluations conducted

above in the following way:

• The overall success rate of impact evaluations is not very high. We worked with very small

samples, and percentages can be misleading, but we found only about 30% of DFID

evaluations and only 50% of 3ie evaluations to be ‘successful’ by our definition. Meeting the

28

triple goal of achieving credibility, relevance and policy impact is not a simple task.

Expectations about success of impact evaluations should be set accordingly.

• The higher success rate of 3ie evaluations is probably a result of the greatest effort spent by

the organisation in selecting teams of evaluators and in supporting the teams along all the

evaluation process, from the design, to the implementation of the study, the analysis of the

information collected and the dissemination of the results. The quality assurance role played

by 3ie is rather unique and might not be easily replicated by other organisations.

• The higher success of 3ie studies hides the fact that DFID studies tend to have higher policy

impact while 3ie studies tend to be more credible and relevant. This is likely the reflection of

the different funding goals adopted by the two institutions. 3ie is more likely to fund

evaluations informing global debates and increasing the stock of knowledge (‘public goods’),

while DFID is more likely to commission evaluations with a view to directly improving project

operations.

• 3ie has funded a larger number of evaluations of ‘innovations’, that is interventions never

attempted before. These interventions, and related impact evaluations, have been often

preceded by pilots in order to avoid failures resulting from low acceptability and feasibility. It

should be stressed however that an element of risk is always present in ‘innovations’ and

that not everything can be planned in advance or piloted. In our review, piloted

interventions do not appear to be more successful than others. The experience of 3ie also

shows that ‘innovations’ evaluations are not always relevant, despite being published in

academic journals. They are often non replicable when the interventions are successful and

have virtually not use when the interventions are unsuccessful.

• At the planning and the design stage, the composition of the team of evaluators and their

skills appear to be an important determinant of success, while other factors often stressed

by the literature such as stakeholders’ engagement, a solid theory of change, and peer-

review groups, appear to be less relevant.

• The low rate of registration of evaluations and of pre-analysis plan does not seem to affect

the success of the evaluations but does have a negative effect on policy recommendations.

Many evaluations report ‘some’ impacts after testing multiple hypotheses for several

outcomes in such a way that the results of the interventions are difficult to interpret.

Registration and pre-analysis plan would help researchers to clarify the goals of their study

and to set more clearly the standards used to assess the impact of interventions.

• Most problems affecting impact evaluations seem to emerge at the implementation stage.

In our review of the literature, and of the DFID and 3ie evaluations, we found several

evaluations negatively affected by: delays in programme approval, project modification,

implementation delays, lack of financial and/or human capital to roll out the programme as

planned, lack of buy-in from lower level and last-mile workers, service delivery glitches,

external shocks and low take-up. The latter being one of the main sources of unsuccessful

evaluations with great implications on the credibility of the estimation of project effects and

on their external validity.

• Whether an evaluation should be discontinued when facing implementation problems is a

decision that should be taken on a case-by-case basis. We do not recommend that

evaluations should be discontinued when sticking to plans is impossible, or that they should

be led to completion at all costs regardless of the problems faced. Managers and

researchers should avoid considerations of reputational risk, publishability of research or

sunk costs when making decisions regarding discontinuations, and a process should be

established to consider the opportunity for discontinuing evaluations.

29

• Successful evaluations obtain their data from large samples. Sample size is a crude measure

of the quality of a study but we observe that impact evaluations relying on small samples

(less than 1,000 project observations) are more likely incur into problems of credibility and

relevance.

• The use of mixed methods, subgroup analysis and the analysis of intermediate impact does

not appear to make studies more successful or policy relevant. On the contrary it was

observed that in the absence of registration and pre-analysis plan the production of many

results can make the interpretation of impact more difficult.

• We found no obvious relationship between the size of the budget and the success of the

impact evaluations funded by 3ie, suggesting that successful impact evaluations can be

produced with limited budgets.

Finally, we would like to highlight some limitations of the study and a possible way forward. First,

the conceptual framework and the review of the impact evaluations is inevitably influenced by the

beliefs and personal experience of the authors. There is a good element of subjectivity in this type of

work and a different team of researchers using the same information would probably reach different

conclusions. Second, the studies reviewed are a sample of the impact evaluations funded by 3ie and

DFID and a very small sample of all impact evaluations conducted over the last 5-10 years. The

sample is not representative of all the determinants of success at play in the wider process of

knowledge generation. Third, the small size of the sample limited our analysis to case studies

supported by quantitative summary tables. There are at least two ways in which our study could be

improved or expanded in a second iteration. The first, is by inspecting more closely a number of case

studies of extremely successful impact evaluations in order to draw lessons about factors

determining success. The second is by using a qualitative comparative analysis approach (QCA) to

assess whether particular configurations of factors explain success.

30

References

Banerjee, A., & Duflo, E. (2011). Poor Economics: A radical rethinking of the way to fight global poverty. New York: PublicAffairs.

Blair, G., Cooper, J., Coppock, A., & Humphreys, M. (2016). Declaring and Diagnosing Research Designs. unpublished manuscript.

Briceno, B., Cuesta, L., & Attanasio, O. (2011). Behind the scenes: managing and conducting large scale impact evaluations in Colombia Working Paper 14. New Delhi: International Initiative for Impact Evaluation (3ie).

Campos, F., Coville, A., Fernandes, A. M., Goldstein, M., & McKenzie, D. (2012). Learning from the Experiments That Never Happened: Lessons from Trying to Conduct Randomized Evaluations of Matching Grant Programs in Africa. Washington DC: The World Bank.

Cuong, N. V. (2014). Impact evaluations of development programmes: experiences from Viet Nam. 3ie Working Paper 21. New Delhi: International Initiative for Impact Evaluation (3ie).

Davis, R. (2015). Evalauability assessment. Retrieved from http://www.betterevaluation.org/en/themes/evaluability_assessment

Duflo, E. (2017). The Economist as a Plumber. Ferguson, M. (2017). Turning lemons into lemonade, and then drinking it: Rigorous evaluation under

challenging conditions. Retrieved from https://researchforevidence.fhi360.org/turning-lemons-lemonade-drinking-rigorous-evaluation-challenging-conditions

Glennerster, R. (2017). The Practicalities of Running Randomised Evaluations: Partnerships, Measurement, Ethics, and Transparency. In E. Duflo & A. Banerjee (Eds.), Handbook of Economic Field Experiments (Vol. Vol. 1, pp. 175-243): North Holland.

Glennerster, R., & Takavarasha, K. (2013). Running Randomised Evaluations: A Practical Guide. Princeton, New Jersey: Princeton University Press.

Gueron, J. M., & Rolston, H. (2013). Fighting for Reliable Evidence. New York: Russel Sage Foundation.

Jones, M., & Kondylis, F. (2016). Lessons from a crowdsourcing failure. Retrieved from http://blogs.worldbank.org/impactevaluations/lessons-crowdsourcing-failure

Karlan, D., & Appel, J. (2016a). Failing in the field: What we can learn when field research is going wrong. Princeton University Press: Woodstock, Oxfordshire.

Karlan, D., & Appel, J. (2016b). When the Juice Isn’t Worth the Squeeze: NGOs refuse to participate in a beneficiary feedback experiment Retrieved from http://blogs.worldbank.org/impactevaluations/when-juice-isn-t-worth-squeeze-ngos-refuse-participate-beneficiary-feedback-experiment

Keller, J. M., & Savedoff, D. (2016). Improving Development Policy through Impact Evaluations: Then, Now, and Next? Retrieved from https://www.cgdev.org/blog/improving-development-policy-through-impact-evaluations-then-now-and-next

Legovini, A., Di Maro, V., & Piza, C. (2015). Impact Evaluations Help Deliver Development Projects. World Bank Policy Research Working Paper 7157.

Lopez Avila, D. M. (2016). Implementing impact evaluations: trade-offs and compromises. Retrieved from http://blogs.3ieimpact.org/implementing-impact-evaluations-trade-offs-and-compromises/

MCC. (2012). MCC’s First Impact Evaluations: Farmer Training Activities in Five Countries. MCC Issue Brief.

Mckenzie. (2016). Book Review: Failing in the Field – Karlan and Appel on what we can learn from things going wrong. Retrieved from http://blogs.worldbank.org/impactevaluations/book-review-failing-field-karlan-and-appel-what-we-can-learn-things-going-wrong

Mckenzie, D. (2012). Misadventures in Photographing Impact. Retrieved from http://blogs.worldbank.org/impactevaluations/misadventures-in-photographing-impact

Mckenzie, D. (2016a). Lessons from some of my evaluation failures: Part 1 of ? Retrieved from http://blogs.worldbank.org/impactevaluations/lessons-some-my-evaluation-failures-part-1

http://www.betterevaluation.org/en/themes/evaluability_assessment

https://researchforevidence.fhi360.org/turning-lemons-lemonade-drinking-rigorous-evaluation-challenging-conditions

https://researchforevidence.fhi360.org/turning-lemons-lemonade-drinking-rigorous-evaluation-challenging-conditions

http://blogs.worldbank.org/impactevaluations/lessons-crowdsourcing-failure

http://blogs.worldbank.org/impactevaluations/when-juice-isn-t-worth-squeeze-ngos-refuse-participate-beneficiary-feedback-experiment

http://blogs.worldbank.org/impactevaluations/when-juice-isn-t-worth-squeeze-ngos-refuse-participate-beneficiary-feedback-experiment

https://www.cgdev.org/blog/improving-development-policy-through-impact-evaluations-then-now-and-next

https://www.cgdev.org/blog/improving-development-policy-through-impact-evaluations-then-now-and-next

http://blogs.3ieimpact.org/implementing-impact-evaluations-trade-offs-and-compromises/

http://blogs.3ieimpact.org/implementing-impact-evaluations-trade-offs-and-compromises/

http://blogs.worldbank.org/impactevaluations/book-review-failing-field-karlan-and-appel-what-we-can-learn-things-going-wrong

http://blogs.worldbank.org/impactevaluations/book-review-failing-field-karlan-and-appel-what-we-can-learn-things-going-wrong

http://blogs.worldbank.org/impactevaluations/misadventures-in-photographing-impact

http://blogs.worldbank.org/impactevaluations/lessons-some-my-evaluation-failures-part-1

31

Mckenzie, D. (2016b). Lessons from some of my evaluation failures: Part 2 of ? Retrieved from http://blogs.worldbank.org/impactevaluations/lessons-some-my-evaluation-failures-part-2

MEASURE evaluation. (2015). Evaluation FAQ: How Much Will an Impact Evaluation Cost? MEASURE evaluation brief November 2015.

Nutley, S. M., Walter, I., & Davies, H. T. O. (2008). Using Evidence: How Research Can Improve Public Services. Bristol: The Policy Press.

Peersman, G., Guijt, I., & Pasanen, T. (2015). Evaluability assessment for impact evaluation: guidance, checklist and decision support. A Methodslab Publication ODI.

Priedeman Skiles, M., Hattori, A., & Curtis, S. L. (2014). Impact Evaluations of Large-Scale Public Health Interventions Experiences from the Field. USAID: MEASURE International Working Paper 14-17.

Puri, J., & Rathinam, F. (2017). Grants for real-world impact evaluations: what are we learning? unpublished manuscript.

Samuels, F., & McPherson, F. (2010). Meeting the challenge of proving impact in Andhra Pradesh, India. Journal of Development Effectiveness, 2(4), 468-485.

Stickler, M. (2016). 5 Things USAID’s Land Office Has Learned about Impact Evaluations. Retrieved from https://blog.usaid.gov/2016/05/mythbusting-5-things-you-should-know-about-impact-evaluations-at-usaid/

Vaessen, J. (2017). Evaluability and why it is Important for Evaluators and Non-Evaluators. Retrieved from http://ieg.worldbankgroup.org/blog/evaluability

Vermeersch, C. (2012). “Oops! Did I just ruin this impact evaluation?” Top 5 of mistakes and how the new Impact Evaluation Toolkit can help. Retrieved from http://blogs.worldbank.org/impactevaluations/oops-did-i-just-ruin-this-impact-evaluation-top-5-of-mistakes-and-how-the-new-impact-evaluation-tool

Weiss, C. H. (1980). Knowledge Creep and Decision Accretion. Knowledge: Creation, Diffusion, Utilization, 1(3), 381-404.

White, H. (2014). Ten things that can go wrong with randomised controlled trials. Retrieved from http://blogs.3ieimpact.org/ten-things-that-can-go-wrong-with-randomised-controlled-trials/

Wood, B. (2015). Replication research promotes open discourse. Evidence Matters http://blogs.3ieimpact.org/replication-research-promotes-open-discourse/.

Wood, B. D. K., & Djimeu, E. (2014). Requiring fuel gauges: A pitch for justifying impact evaluation sample size assumptions. Evidence Matters http://blogs.3ieimpact.org/requiring-fuel-gauges-a-pitch-for-justifying-impact-evaluation-sample-size-assumptions/.

http://blogs.worldbank.org/impactevaluations/lessons-some-my-evaluation-failures-part-2

https://blog.usaid.gov/2016/05/mythbusting-5-things-you-should-know-about-impact-evaluations-at-usaid/

https://blog.usaid.gov/2016/05/mythbusting-5-things-you-should-know-about-impact-evaluations-at-usaid/

http://ieg.worldbankgroup.org/blog/evaluability

http://blogs.worldbank.org/impactevaluations/oops-did-i-just-ruin-this-impact-evaluation-top-5-of-mistakes-and-how-the-new-impact-evaluation-tool

http://blogs.worldbank.org/impactevaluations/oops-did-i-just-ruin-this-impact-evaluation-top-5-of-mistakes-and-how-the-new-impact-evaluation-tool

http://blogs.3ieimpact.org/ten-things-that-can-go-wrong-with-randomised-controlled-trials/

http://blogs.3ieimpact.org/replication-research-promotes-open-discourse/

http://blogs.3ieimpact.org/requiring-fuel-gauges-a-pitch-for-justifying-impact-evaluation-sample-size-assumptions/

http://blogs.3ieimpact.org/requiring-fuel-gauges-a-pitch-for-justifying-impact-evaluation-sample-size-assumptions/

successful impact evaluations: lessons from dfid and 3ie · successful impact evaluations: lessons...

Documents