left behind by optimal design: the challenge of designing … · 2020-06-10 · left behind by...

60
Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs * Isaac Mbiti Mauricio Romero Youdi Schipper § June 19, 2018 Abstract A growing body of evidence suggests that teacher performance pay systems can improve stu- dent learning outcomes, particularly in settings where existing mechanisms for teacher accountabil- ity are weak. We use a field experiment in Tanzanian public primary schools to directly compare the effectiveness on early grade learning of two randomly assigned teacher performance pay systems: a pay for percentile system, which is more complex, but can (under certain conditions) induce opti- mal effort among teachers and a simple system that rewards teachers based on student proficiency levels. We find that both systems improve student test scores. However, despite the theoretical advantages of the pay for percentile system, the proficiency system is at least as effective in raising student learning. Moreover, we find suggestive evidence that the pay for percentile system favors students from the top of the distribution, highlighting the challenge of designing incentives that can deliver optimal and equitable learning gains for all students. JEL Classification: C93, H52, I21, M52, O15 Keywords: teacher performance pay, pay for percentile, incentive design, Tanzania * We especially thank Karthik Muralidharan for his involvement as a collaborator in the early stages of this project and for subsequent discussions. We are grateful to Bobby Pakzad-Hurson, John Friedman, Joseph Mmbando, Molly Lipscomb, Omer Ozak, Jay Shimshack and seminar participants at UC San Diego and Universidad del Rosario for comments. Erin Litzow and Jessica Mahoney provided excellent research assistance through Innovations for Poverty Action. We are also grateful to EDI Tanzania for their excellent data collection and implementation efforts. The EDI team included Respichius Mitti, Phil Itanisia, Timo Kyessey, Julius Josephat, Nate Sivewright and Celine Guimas. The evaluation was partially funded by a grant from the REACH Trust fund at the World Bank. We are grateful to Peter Holland, Jessica Lee, Arun Joshi, Owen Ozier, Salman Asim and Cornelia Jesse for their support through the REACH initiative. University of Virginia, J-PAL, IZA; [email protected] University of California - San Diego; [email protected] § Twaweza; [email protected]

Upload: others

Post on 10-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

Left Behind by Optimal Design: The Challenge ofDesigning Effective Teacher Performance Pay Programs∗

Isaac Mbiti† Mauricio Romero‡ Youdi Schipper§

June 19, 2018

Abstract

A growing body of evidence suggests that teacher performance pay systems can improve stu-dent learning outcomes, particularly in settings where existing mechanisms for teacher accountabil-ity are weak. We use a field experiment in Tanzanian public primary schools to directly compare theeffectiveness on early grade learning of two randomly assigned teacher performance pay systems:a pay for percentile system, which is more complex, but can (under certain conditions) induce opti-mal effort among teachers and a simple system that rewards teachers based on student proficiencylevels. We find that both systems improve student test scores. However, despite the theoreticaladvantages of the pay for percentile system, the proficiency system is at least as effective in raisingstudent learning. Moreover, we find suggestive evidence that the pay for percentile system favorsstudents from the top of the distribution, highlighting the challenge of designing incentives thatcan deliver optimal and equitable learning gains for all students.

JEL Classification: C93, H52, I21, M52, O15Keywords: teacher performance pay, pay for percentile, incentive design, Tanzania

∗We especially thank Karthik Muralidharan for his involvement as a collaborator in the early stages of this projectand for subsequent discussions. We are grateful to Bobby Pakzad-Hurson, John Friedman, Joseph Mmbando, MollyLipscomb, Omer Ozak, Jay Shimshack and seminar participants at UC San Diego and Universidad del Rosario forcomments. Erin Litzow and Jessica Mahoney provided excellent research assistance through Innovations for PovertyAction. We are also grateful to EDI Tanzania for their excellent data collection and implementation efforts. The EDIteam included Respichius Mitti, Phil Itanisia, Timo Kyessey, Julius Josephat, Nate Sivewright and Celine Guimas. Theevaluation was partially funded by a grant from the REACH Trust fund at the World Bank. We are grateful to PeterHolland, Jessica Lee, Arun Joshi, Owen Ozier, Salman Asim and Cornelia Jesse for their support through the REACHinitiative.†University of Virginia, J-PAL, IZA; [email protected]‡University of California - San Diego; [email protected]§Twaweza; [email protected]

Page 2: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

1 Introduction

Over the past two decades, global education priorities have started to shift from increas-ing primary school enrollment to promoting policies that improve learning. This shifthas been driven in part by the growing body of evidence revealing poor and stagnantlevels of learning among students in developing countries, despite significant invest-ments in education (World Bank, 2018b). Given the central role of teachers in the ed-ucation production function (Hanushek & Rivkin, 2012; Chetty, Friedman, & Rockoff,2014b, 2014a), as well as the substantial share of the education budget devoted to theircompensation, there is increasing interest in policies that can improve teacher effectiv-ness (Bruns, Filmer, & Patrinos, 2011; Mbiti, 2016). By strengthening the links betweenteacher remuneration and performance indicators, such as objectively measured studentlearning outcomes, teacher performance pay programs are seen as a promising pathwayto improve education quality. Consequently, the adoption rate of such programs hasincreased significantly over the past two decades. For instance, the share of US schooldistricts featuring teacher performance pay programs has increased by more than 40%from 2004 to 2012 (Imberman, 2015). Moreover, there is a global trend towards increasedadoption of such programs across the OECD, as well as in less developed countries, suchas Brazil, Chile, and Pakistan (Alger, 2014; Ferraz & Bruns, 2012; Barrera-Osorio & Raju,in press; Contreras & Rau, 2012).1

Previous studies have shown that the effectiveness of teacher performance pay sys-tems depend on key elements of the incentive program design (Neal & Schanzenbach,2010; Neal, 2011; Mbiti et al., 2017; Bruns & Luque, 2015; Loyalka, Sylvia, Liu, Chu,& Shi, in press).2 Incentive schemes based on proficiency thresholds are commonlyused (for example by many US States under No Child Left Behind), as they are clearerand easier to comprehend and implement relative to schemes based on more complexalternatives, such as value-added measures. The administrative simplicity of such de-signs may be particularly appealing for developing countries with weaker state capacity.However, proficiency-based incentive systems are theoretically less effective than morecomplex alternatives and can encourage teachers to focus on students who are close tothe proficiency threshold (Neal & Schanzenbach, 2010; Neal, 2011). In contrast, more

113 of 34 countries that provided information about their teacher policies under the World Bank’sSystems Approach for Better Education Results (SABER) initiative provided monetary bonuses to highperforming teachers (World Bank, 2018a).

2These elements include: whose performance is incentivized (individual or collective); what is theincentive metric (test scores, or input measures like attendance or preparation); how large is the expectedincentive payment; and the mapping between the performance metric and the teacher incentive payment.This last design element is the focus of our study.

1

Page 3: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

complex incentive systems, such as those based on rank-order tournaments (e.g. pay forpercentile designs), may induce greater and potentially socially optimal levels of effortamong teachers compared to simpler schemes (Barlevy & Neal, 2012; Neal, 2011; Loyalkaet al., in press). However, such systems are harder to implement and may be difficultfor teachers to fully comprehend, which can undermine their effectiveness (Goodman& Turner, 2013; Fryer, 2013; Neal, 2011). Furthermore, under certain conditions (orspecifications of the education production function), these complex systems can poten-tially increase inequality in learning by encouraging teachers to focus on students withhigher baseline test-scores. Given the challenge of designing effective incentive systems,empirical evidence that can shed light on the potential trade-offs between complex andsimpler teacher incentive systems is especially useful for policy makers from countrieswith weak state capacity.

In this paper, we compare the effectiveness of a more complex teacher incentivescheme to a simpler system using a randomized experiment in a set of 180 Tanza-nian public primary schools. In particular, we compare the effectiveness of the payfor percentile scheme proposed by Barlevy and Neal (2012) to a simpler design withmultiple proficiency thresholds (‘levels’). We compare the student learning outcomes inboth treatment groups to our control group. Both types of incentive programs rewardedteachers in math, Kiswahili, and English in first, second, and third grade. In addition, theper-student bonus budget was equalized across grades, subjects, and treatment arms. Inboth incentive designs, we determined individual teacher reward payments based on ac-tual student performance on externally administered tests. The mean teacher bonus paidin the second year of the evaluation was 3.5 percent of the annual net salary. FollowingNeal (2013), we evaluate the incentive programs using data from both the ‘high-stakes’test that is administered to all students in order to determine teacher bonuses, and the‘low-stakes’ test that is administered to a sample of students for research purposes. Bothtypes of tests are collected in control schools, although the results of the ‘high-stakes’test do not trigger any payments in these schools.

In the 60 schools assigned to the pay for percentile arm, students are first tested andthen placed in one of several groups based on their initial level of learning. At the end ofthe school year, students are re-tested and ranked within their assigned group. Teachersare then rewarded in proportion to their students’ ranks within each group. By effec-tively handicapping the differences in initial student performance across teachers, thepay for percentile system does not penalize teachers who serve disadvantaged students.In addition, since the reward schedule (or the mapping of student rankings to teacherbonuses) is exactly the same across all groups, it could encourage teachers to focus on all

2

Page 4: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

students, rather than solely on those who are marginal or close to a proficiency thresh-old. Barlevy and Neal (2012) show that the pay for percentile can induce socially optimallevels of effort among teachers. In addition, they also show that the system can lead toequitable learning gains under certain specifications of the education production func-tion. However, the system can be difficult for teachers to comprehend, and challengingto implement, as school systems would need to create and manage databases that trackstudent learning over time.

In the 60 schools assigned to receive incentives based on proficiency targets, teachersearn bonuses based on their students’ mastery of several grade-specific skills that areoutlined in the national curriculum. As teacher incentive programs that use single profi-ciency thresholds typically encourage teachers to focus on students close to the passingthreshold, we included several passing thresholds to encourage teachers to broaden theirfocus. In our design, the skill thresholds range from very basic skills to more complexskills, which allow teachers to earn rewards from a wide range of students. As rewardpayments for each skill are inversely proportional to the number of students that at-tain the skill, harder-to-pass skills are rewarded more. Given the clarity in the rewardstructure, this system is easier to understand and simpler to implement, as it only re-quires that school systems administer one test at the end of the year. However, sincethe system uses proficiency thresholds, it is arguably less likely to induce optimal effortamong teachers. Moreover, as rewards are based on absolute learning levels, systemsusing proficiency targets may disadvantage teachers who serve students from poorerbackgrounds.

Despite the theoretical advantages of the pay for percentile system, the simpler pro-ficiency levels incentive system is as effective, and sometimes more effective, than thepay for percentile system in this setting. After two years, using data from the sampleof students who took the low-stakes test, we find that test scores in math increased byabout 0.07σ under both systems (statistically significant at the 10 percent level). Kiswahiliscores increased by .11σ (p-value < 0.01) under the proficiency levels system comparedto a .056σ (p-value .11) increase under the pay for percentile system. Test scores in En-glish increased by .11σ (p-value .19) in proficiency levels schools, and .19σ (p-value .02)in pay for percentile schools, although these estimates are not statistically distinguish-able. The results using the high-stakes data show similar patterns, although the pointestimates are generally larger, which is likely due to increased student effort (Levitt,List, Neckermann, & Sadoff, 2016; Gneezy et al., 2017; Mbiti et al., 2017). At the end oftwo years, math test scores in the high-stakes exam increased by .14σ (p-value < 0.01)under the proficiency levels system compared to a .093σ (p-value .02) increase under the

3

Page 5: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

pay for percentile system. Kiswahili scores increased by .18σ (p-value < 0.01) under theproficiency levels system compared to a .085σ (p-value .061) increase under the pay forpercentile system. English test scores increased by .28σ (p-value < 0.01) in proficiencylevels schools, and .23σ (p-value < 0.01) in pay for percentile schools. While the treat-ment estimates for math and English test scores are statistically indistinguishable acrosstreatments, the levels treatment estimate for Kiswahili is statistically larger than the payfor percentile estimate at the 5% level. These results contrast with the findings of Loyalkaet al. (in press), who find that student math test scores increased the most under a payfor percentile system compared to other systems (using high-stakes data only).

In addition, Loyalka et al. (in press) find that pay for percentile teacher incentivesled to equitable learning gains across the distribution of students.3 In contrast, we findsuggestive evidence that better prepared students benefited more under the pay forpercentile treatment. However, we find that the learning gains under the levels systemswere more equitably distributed, except for English in the second year of the evaluation.Formal statistical tests reject the equality of treatment effects across the distribution ofstudent baseline test scores for pay for percentile schools in both years for Kiswahili, andin the first year for math. In order to interpret our findings, we build a stylized modelwhere we compare the distributional effects of our incentive systems on student learningunder two different specifications of the education production function: a specificationwhere all students benefit equally from teacher effort, and one in which students withhigher baseline test scores benefit more from teacher effort.4 Following the predictionsof our model, our empirical results suggest that the productivity of teacher effort ishigher for students with better baseline test scores in the education production function.5

Even though Barlevy and Neal (2012) show that pay for percentile systems can stillinduce socially optimal levels of effort in the presence of this type of heterogeneity(or production function specification), our results suggest that teacher performance payprograms need to be carefully designed to account for equity concerns.6

We use our comprehensive set of survey data collected from school administrators,teachers, and students to shed light on theoretically relevant mechanisms. Consistent

3Muralidharan and Sundararaman (2011) found equitable learning gains from a value-added teacherincentive design.

4The basic model examined by Barlevy and Neal (2012) also assumed that all students benefit equallyfrom teacher effort.

5This is consistent with Banerjee and Duflo (2011) and Glewwe, Kremer, and Moulin (2009) who arguethat education systems in developing countries tend to favor students from the top quantiles.

6The effect of incentives on inequality has also been explored in the context of a firm. Bandiera,Barankay, and Rasul (2007) find that performance pay for managers increased earnings inequality amongworkers because managers focused their efforts on high ability workers.

4

Page 6: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

with our results on test scores, we find that teacher effort was generally higher underthe levels systems compared to the pay for percentile system. For instance, teacherswere more likely to be off task in pay for percentile schools relative to levels schools.Given the well-documented concerns about teacher misunderstandings of incentive de-sign (Goodman & Turner, 2013; Fryer, 2013), we show that teacher comprehension washigh under both systems, which allows us to rule out a relative lack of understanding asa driving factor. Although previous studies have shown that women exert less effort intournaments (Niederle & Vesterlund, 2007, 2011), we do not find any heterogeneity byteacher gender. We also explore the heterogeneity in treatment effects by teacher abilityand find that more able teachers were more responsive to the levels design, while therewas no differential pattern by teacher ability in the pay for percentile system. We alsofind that teachers in the pay for percentile design expected to earn lower bonuses, per-haps due to the increased uncertainty and complexity of the pay for percentile systemrelative to the levels system. In addition, teachers in the levels system were better able toarticulate clear and specific targets for their students on the high stakes exams, perhapsdue to the clearer reward structure.

Our results highlight the challenges of designing teacher incentives that are both ef-fective and equitable. Program design features such as the mapping of test-scores torewards, the size of the bonus, the potential for free-riding, and the measure of studentlearning used to reward teachers play important roles in determining the effectivenessof the program (Neal, 2011; Loyalka et al., in press; Ganimian & Murnane, 2016; Bruns& Luque, 2015). These differences are the primary drivers of the heterogeneity in the ef-fectiveness of teacher performance pay programs (Imberman & Lovenheim, 2015; Neal,2011; Ganimian & Murnane, 2016). For instance, Glewwe, Ilias, and Kremer (2010);Lavy (2002, 2009); Muralidharan and Sundararaman (2011); Balch and Springer (2015);Loyalka et al. (in press) find that teacher performance pay increases student test scores.However, another set of studies find that these programs have limited effects on stu-dent learning (Fryer, 2013; Barrera-Osorio & Raju, in press; Goldhaber & Walch, 2012;Goodman & Turner, 2013; Springer et al., 2011).

Our study contributes to a growing literature on the potential of teacher incentivesto improve learning outcomes in developing countries. By comparing the pay for per-centile system with a simpler budget equivalent proficiency system, our study providessome of the first empirical evidence of the effectiveness of pay for percentile in sub-Saharan Africa, benchmarked against a simpler system.7 This comparison allows us to

7Gilligan, Karachiwalla, Kasirye, Lucas, and Neal (2018) evaluate a pay for percentile teacher incentiveprogram in a set of rural schools in Uganda. They find that pay for percentile incentives have no impact on

5

Page 7: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

explore the trade-offs faced by education authorities who have to consider the effective-ness, feasibility of implementation, and equity of different incentive designs with limitedinformation about the education production function.

As our results show, under certain specifications of the education production function,complex and theoretically optimal designs can favor better students. However, simplersystems, though potentially less optimal, may be more robust from an equity perspec-tive to different production function specifications. This is consistent with a large bodyof literature from contract theory, such as Carroll (2015) and Carroll and Meng (2016),who show that simpler incentive schemes are more robust mechanisms to resolve prin-cipal agent problems when there is uncertainty about the specification of the productionfunction.

2 Experimental Design

2.1 Context

Tanzania allocates about a fifth of overall government spending (roughly 3.5 percentof GDP) on education (World Bank, 2017). Much of this spending has been devotedto promoting educational access. As a consequence, net enrollment rates in primaryschool have increased from 53 percent in 2000 to 80 percent in 2014 (World Bank, 2017).Despite these gains in educational access, educational quality remains a major concern.Resources and materials are scarce. For example, only 14 percent of schools had accessto electricty, and just over 40 percent had access to potable water (World Bank, 2017).Nation-wide, there are approximately 43 pupils per teacher (World Bank, 2017), althoughearly grades will often have much larger class-sizes. Moreover, approximately 5 pupilsshared a single mathematics textbook, while 2.5 pupils shared a reading textbook in 2013(World Bank, 2017). Consequently, student learning levels are quite low. In 2012, datafrom nationwide assessments show that only 38 percent of children aged 9-13 are able toread and do arithmetic at Grade 2 level, suggesting that educational quality is a pressingpolicy problem (Uwezo, 2013).

The poor quality of education is driven in part by the limited accountability in theeducation system. Quality assurance systems (e.g., school inspectors) typically focuson superficial issues such as the state of the school garden, rather than on issues thatcan promote learning (Mbiti, 2016). The lack of accountability is further reflected in

student learning, except in schools with textbooks. They do not compare their pay for percentile systemto other incentive designs

6

Page 8: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

teacher absence rates. Data from unannounced spot-checks show that almost a quarterof teachers were absent from school, and only half of the teachers who were at schoolwere in the classroom during further spot-checks (World Bank, 2011). As a result, almost60 percent of planned instructional time was lost (World Bank, 2011).

Despite these high absence rates, teachers’ unions continue to lobby for better pay asa way to address quality concerns in the education system, even though studies havefound that the correlation between teacher compensation and student learning is ex-tremely low (Kane, Rockoff, & Staiger, 2008; Bettinger & Long, 2010; Woessmann, 2011).Teachers earn approximately 500,000 TZS per month (roughly US$300), or roughly 4.5times GDP per capita (World Bank, 2017).8 In addition, approximately 60 percent of theeducation budget is devoted to teacher compensation. Despite the relatively lucrativewages of Tanzanian teachers, the teachers’ union called a strike in 2012 to demand a 100percent increase in pay (Reuters, 2012; PRI, 2013).9

2.2 Interventions and Implementation

We compare the effectiveness of the pay for percentile scheme proposed by Barlevyand Neal (2012) to a simple proficiency threshold design, where the budgets are equal-ized to facilitate cost-effectiveness comparisons. The interventions were formulated andmanaged by Twaweza, an East-African civil society organization that focuses on citizenagency and public service delivery. The intervention was part of a series of projectslaunched under a broader program umbrella named KiuFunza (‘Thirst for learning’ inKiswahili).10 A budget of $150,000 per year for teacher and head teacher incentives wassplit between two treatment arms in proportion to the number of students enrolled. As aresult, the total prize in each treatment arm was approximately $3 per student. All inter-ventions were implemented by Twaweza in partnership with EDI (a Tanzanian researchfirm), and a set of local district partners. To ensure school support for the incentivescheme, we also offered a performance bonus to Head Teachers. This bonus equals 20percent of the combined bonus of all incentivized teachers in his or her school. The headteacher bonus was communicated transparently but did not affect the bonus calculationsof the teachers, since the teacher bonus budgets were communicated excluding the headteacher budget.

8The average teacher in a sub-Saharan African country earns almost four times GDP per capita, com-pared to OECD teachers who earn 1.3 times GDP per capita (OECD, 2017; World Bank, 2017).

9In recent years, other teacher strikes have occurred in South Africa, Kenya, Guinea, Malawi, Swaziland,Uganda, Benin and Ghana.

10The first set of interventions were launched in 2013 and evaluated by Mbiti et al. (2017).

7

Page 9: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

Within each intervention arm, Twaweza distributed information describing the pro-gram to schools and the communities via public meetings in early 2015 and 2016. Theimplementation teams also conducted additional mid-year school visits to re-familiarizeteachers with the program, gauge teacher understanding of the bonus payment mecha-nisms, and answer any remaining questions. At the end of the school year, all studentsin Grades 1, 2, and 3 in every school, including control schools, are tested in Kiswahili,English and, math. As this test was used to determine teacher incentive payments, itis ‘high-stakes’ (from the teacher’s perspective). Our ‘low stakes’ research test was con-ducted on a different day around the same time. Both sets of tests were based on theTanzanian curriculum and were developed by Tanzanian education professionals usingthe Uwezo learning assessment test development framework.11

2.2.1 Proficiency thresholds (levels) design

Proficiency based systems are easier for teachers to understand and provide more ac-tionable targets. Consequently, such systems are likely to increase motivation amongteachers and head teachers, but have well known limitations. For example, they areunable to adequately account for differences in the initial distribution of student prepa-ration across schools and classrooms. In addition, this type of system can encourageteachers to focus on students close to the proficiency threshold, at the expense of stu-dents who are sufficiently above or below the threshold (Neal & Schanzenbach, 2010). Tomitigate this concern, our levels design features multiple thresholds, rather than a singlethreshold, ranging from very basic skills to more advanced skills on the curriculum. Thisdesign allows teachers to earn bonuses for helping a broader set of students, includingstudents with lower baseline test scores and those with higher baseline test-scores.

The levels treatment pays teachers in proportion to the number of skills students ingrades 1-3 are able to master in mathematics, Kiswahili, and English. The bonus budgetfor each subject is split evenly between skills, while the per pass bonus paid ex-postequals the skill budget divided by the number of students passing the skill. Conse-quently, harder skills have a higher per pass bonus. The total bonus for a teacher consistsof the per skill rewards aggregated over all skills and all students who pass a particularskill. Teachers can earn larger bonuses if they have more students and if their students’are proficient in a larger number of skills, especially harder skills.

Table 1 shows the skills to be tested in each grade-subject combination. The totalbudget is split across grades in proportion to the number of students enrolled in each

11Uwezo learning assessments have been routinely conducted in Kenya, Tanzania, and Uganda since2010.

8

Page 10: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

grade. The budget is then divided equally among subjects and skills within each subject.At the end of the year teachers are paid according to the following formula:

Psj =

Xs

∑i∈NL1ai>Ts

∑k∈J

1ak>Ts (1)

where Psj is the payment of teacher j for skill s, J is the set of students of teacher j, ak is

the test score of student k, Ts is the passing threshold for skill s, Xs is the total amountof money available for skill s, and NL is the set of all students in schools across Tanzaniain the ‘levels’ treatment.

For each skill, teachers earn more money as more students in their class score higherthan the passing threshold. The payment for the skill is higher if fewer students areabove the threshold. In other words, the reward is higher for teachers if students mastera ‘difficult’ skill, which is defined by the overall passing rate of each skill.

Table 1: Skills tested in the levels design

Kiswahili English Math

Grade 1

Letters Letters CountingWords Words NumbersSentences Sentences Inequalities

AdditionSubtraction

Grade 2

Words Words InequalitiesSentences Sentences AdditionParagraphs Paragraphs Subtraction

Multiply

Grade 3

AdditionStory Story SubtractionComprehension Comprehension Multiplication

Division

2.2.2 Pay for percentile design

The pay for percentile design is based on the work of Barlevy and Neal (2012), whoshow that this incentive structure can, under certain conditions, induce teachers to exert

9

Page 11: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

socially optimal levels of effort. For each subject-grade combination we created studentgroups with similar initial learning levels based on test score data from the previousschool year. Students without test scores in second and third grade were grouped to-gether in a ‘unknown’ ability group. Since none of the first grade students had incomingtest scores, we created broad groups based on the historical average test-scores for theschool. Thus, all first-grade students within a school were assigned to the same group.We then compensated teachers proportionally to the rank of their students at the end ofthe school year relative to all other students with a similar baseline level of knowledge.

More formally, let at−1i be the score of student i at the end of the previous school year.

Students are divided into k groups according to at−1i . We divided the total pot of money

allocated to a subject-grade combination Xg into k groups, in proportion to the numberof students in the group. That is, Xg

k = Xg∗nkNg

, where Ng is the total number of students

in grade g, nk is the number of students in group k, and Xgk is the amount of money

allocated to group k in grade g. At the end of the year, we ranked students (into 100ranks) within each group according to their endline test score at

i , and within each groupwe gave teachers points proportional to the rank of their students. A teacher wouldreceive 99 points for a student in the top 1% of group k, and no points for a student inthe bottom 1% of the group. Thus, within each group we have:

Xgk =

Xg ∗ nkN

=100

∑i=1

b(i− 1) ∗ nk100

where b(i − 1) is the amount of money paid for each student in rank i. Therefore,b = Xg

Ng299 . The total money Xg allocated to a subject-grade is proportional to the number

of students in each grade and is divided equally among the three subjects. In other

words, Xg =XT∗Ng

3 ∑3g=1 Ng

, where XT is the total amount of money available for the pay

for percentile design. The total amount of money paid per rank is the same across allgroups, in all subjects, and in all grades, and is equal to b = XT

3 ∑3g=1 Ng

299 . For example, in

the first year, the total prize money was $70,820 and total enrollment was 22,296 in payof percentile schools. Therefore, the payment per ‘rank’ was $0.0178. For a student inthe top 1%, teachers earned $1.77 and for a student in the top 50% teachers earned $0.89.

Although this design can deliver socially optimal levels of effort, it can be challengingto implement at scale, particularly in settings with weak administrative capacity, suchas Tanzania. For instance, maintaining the child-level databases of learning required tocalculate teacher value-added, and ensuring the integrity of the testing system are non-trivial administrative challenges. Moreover, the pay-for-percentile system may prove

10

Page 12: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

difficult to grasp for the individual teacher. It presents each teacher with a series oftournaments and therefore the bonus pay-off is relatively hard to predict, even if thedesign guarantees a fair system. The uncertainty introduced by being pitched againststudents from schools across the whole country may dilute the incentive.

2.3 A Note on English

As Kiswahili is the official language of instruction in Tanzania, English is taught as a sec-ond language in primary schools. As English is rarely spoken outside of the classroom,English language skills are quite low in Tanzania. For instance, roughly 12 percent ofGrade 3 students could pass a Grade 2 level in English (Uwezo, 2012). Moreover, there issuggestive evidence that only the best students would be close to the proficiency thresh-old used in the Uwezo assesment (Mbiti et al., 2017). Given the challenges of teachingEnglish in Tanzania, the subject was removed from the national curriculum in Grades 1and 2 in 2015 to allow teachers to focus on numeracy and literacy in Kiswahili in thosegrades. English would still be taught in Grade 3, under a revised curriculum. However,there was little guidance from the Ministry of Education on how to transition to the newcurriculum. Hence, there was substantial variation in how the curriculum changes wereactually implemented by schools. Some schools stopped teaching English in 2015, whileothers stopped in 2016. In addition, there was no official guidance on whether to useGrade 1 English materials in Grade 3, as there were no new books issued to reflect thecurriculum changes. As a result, we dropped English from the incentives in Grade 1 and2 in 2016, but included Grade 3 English teachers. To avoid confusion, we also commu-nicated that our end of year English test in 2016 would still use the pre-reform Grade3 curriculum. Given these issues in the implementation of the curriculum reform, it isunclear how to interpret the results for English in Grade 1 and 2 in 2015, and for Grade3 in both years.

3 Theoretical Framework

We present a set of simple models to clarify the potential behavioral responses by teach-ers and schools in our interventions. We first characterize equilibrium effort levels ofteachers in both incentive systems, and then impose some additional structure and usenumerical methods to obtain a set of qualitative predictions about the distribution ofteacher effort across students of varying baseline learning levels.

11

Page 13: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

3.1 Basic Setup

In our simple setup, students vary in their initial level of learning and are indexedby l. Further, each classroom of students is taught by a single teacher, indexed by j.We assume student learning (or test-scores) at endline is determined by the followingprocess:

alj = al

j(t−1) + γlelj + vl

j

where alj is the learning level of a student with learning level l taught by teacher j, and

alj(t−1) is the student’s baseline level of learning. γl captures the productivity of teacher

effort (elj) and is assumed to be constant across teachers. In other words, we assume

teachers are equally capable.12 vlj is an idiosyncratic random shock to student learning.

We assume that effort is costly, and that the cost function, cl(elj), is twice differentiable

and convex such that c′l(·) > 0, and c′′l (·) > 0.A social planner would choose teacher effort to maximize the total expected value of

student learning, net of the total costs of teacher effort as follows:

∑j

∑l

E(alj(t−1) + γlel

j + vlj)− cl(el

j)

The first order conditions for this problem are:

γl = c′l(elj) (2)

for all l and all j.

3.2 Pay for Percentile

In the ‘Pay for Percentile’ design there are L rank-order tournaments based on studentperformance, where L is the number of student types or the number of groupings, suchthat students in the same group are similar to each other. Under this incentive scheme,teachers maximize their expected payoffs, net of costs, from each rank-order tournament.The teacher’s maximization problem becomes:

∑l

(∑k 6=j

(πP(al

j > alk))− cl(el

j)

)12Barlevy and Neal (2012) also impose this assumption in their basic setup.

12

Page 14: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

The first order conditions for the teacher’s problem are:

∑k 6=j

πγl f l(γl(elj − el

k)) = c′l(elj)

for all l, where f l is the density function of εlj,k = vl

j − vlk.

In a symmetric equilibrium, then

(N − 1)πγl f l(0) = c′l(el) (3)

Without loss of generality, if the cost function is the same across groups (i.e., c′l(x) =c′(x)), but the productivity of effort varies (γl), then the teacher will exert higher effortwhere he or she is more productive (since the cost function is convex). Pay for Percentilecan lead to an efficient outcome, as shown by Barlevy and Neal (2012), if the socialplanner’s objective is to maximize total learning and the payoff is π = 1

(N−1) f l(0) .

3.3 Levels

In our “levels” incentive scheme, teachers earn bonuses whenever a student’s test scoreis above a pre-specified learning threshold. As each subject has multiple thresholds t,we can specify each teacher’s maximization problem as:

∑l

(∑

t

(P(al

j > Tt)Πt

∑l ∑n ClnP(al

n > Tt))− cl(el

j)

)

where Tt is the learning needed to unlock threshold t payment, Πt is the total amountof money available for threshold t, and Cl

n is the number of students of type l in teacher’sn class.

Assuming N is large and teachers ignore their own effect on the overall pass rates, thefirst order conditions for the teacher’s maximization problem becomes:

∑t

γlhl(Tt − alj(t−1) − γlel

j)Πt

∑l ∑k ClnP(

vlk > Tt − al

k(t−1) − γlelk

) = c′l(elj) (4)

for all l, where hl is the density function of vlj. Although we assume that each individ-

ual teacher’s effort does not affect the overall pass rate, we cannot ignore this effect in

13

Page 15: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

equilibrium. Thus, we can characterize our symmetric equilibrium as:

∑t

γlhl(Tt − alj(t−1) − γlel)

Πt

∑l NClnP(

vl > Tt − al(t−1) − γlel

) = c′l(el) (5)

for all l.

3.4 A Comparison of Optimal Teacher Effort

We impose additional structure on our conceptual framework and numerical methods togenerate qualitative predictions about the level of teacher effort across the baseline test-score distribution of students. We compute equilibrium teacher responses under twodifferent stylized scenarios (or assumptions about the productivity of teacher effort inthe production function) to illustrate how changes in these assumptions can alter equilib-rium responses. The goal of this exercise is to highlight the importance of the productionfunction specification on the distribution of learning gains in both our treatments.

We assume that the teacher’s cost function is quadratic (i.e., c(e) = e2), and the shockto student learning follows a standard normal distribution (i.e., vi ∼ N(0, 1)). We furtherassume that there are 1,000 teachers, and each teacher has 17 students in his or her class.Within each class, we let student baseline learning levels to be uniformly distributedfrom -4 to 4, in 0.5 intervals (i.e., there is one student with each baseline learning levelin each class). We set the prize per student in both schemes at $1. Therefore, in the payfor percentile scheme the prize per contest won is $ 2

99 (see Section 3.2) and in the levelsthe total prize is such that there is $1 per student. In the multiple threshold scenariothe prize is held constant and split evenly across all three thresholds. For simplicity,we assume that there are three proficiency thresholds. We first compute the optimalteacher response assuming a single proficiency threshold and then vary the thresholdvalue from -1 to 1. We then compute the multiple threshold case by considering all threecases simultaneously.

Our numerical approach allows us to explore how teachers focus their efforts on stu-dents of different learning levels under both types of systems. Following the baselinemodel described in Barlevy and Neal (2012), we first assume that the productivity ofteacher effort (γ) is constant and equal to one, regardless of a student’s initial learninglevel. We then solve the model numerically. Figures 1a and 1b show the optimal teacherresponses for different levels of student initial learning. Under the pay for percentilescheme, the optimal response would result in teachers exerting equal levels of effortto all their students, regardless of their initial learning level. In contrast, the multiple

14

Page 16: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

threshold levels scheme would result in a bell-shaped effort curve, where teachers wouldfocus on students near the threshold and would not exert much effort to students in thetails of the initial learning distribution (see solid line graph in 1b). Thus, our numericalexercise suggests that if teacher productivity is invariant to the initial level of studentlearning, then the pay for percentile scheme will better serve students at the tails of thedistribution.

Figure 1: Incentive design and optimal effort with constant productivity of teacher effort

−4 −2 0 2 4

0.15

0.20

0.25

Initial levels of learning

Effo

rt

(a) Gains - γ constant across initial levels oflearning

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Initial levels of learning

Effo

rt

Threshold=−1Threshold=0Threshold=1Multiple thresholds

(b) Levels - γ constant across initial levels oflearning

We relax the assumption of constant productivity of teacher effort, and allow it to varywith initial learning levels of students. For simplicity, we specify a linear relationshipbetween teacher productivity (γl) and student learning levels (al) such that γl = 2 +

0.5al(t−1). Figures 2a and 2b show the numerical solutions of optimal teacher effort for

different initial levels of student learning. In the pay for percentile system, focusingon better prepared students increases the likelihood of winning the rank-order contest(among that group of students), while the marginal unit of effort applied to the leastprepared students will have a relatively smaller effect on the likelihood of winning therank-order tournament among that group of students. Thus, in equilibrium, teacherswill focus more on better prepared students, and will not have an incentive to deviatefrom this strategy, given the structure and payoffs of the tournament. In contrast, thelevels scheme would yield a similar, but slightly skewed bell-shaped curve compared tothe baseline constant productivity case.

15

Page 17: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

Our numerical exercise suggests that testing for the equality of treatment effects acrossthe distribution of student baseline test scores in the pay for percentile arm allows us tobetter understand the specification of teacher effort in the education production function.Moreover, the exercise suggests the teacher effort response curve is less sensitive tochanges in the production function under the levels system.

Figure 2: Incentive design and optimal effort when the productivity of teacher effort iscorrelated with the initial level of student learning

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

Initial levels of learning

Effo

rt

(a) Gains - γ increases with initial levels oflearning

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

Initial levels of learning

Effo

rt

Threshold=−1Threshold=0Threshold=1Multiple thresholds

(b) Levels - γ increases with initial levels oflearning

4 Data and Empirical Specification

4.1 Sample Selection

The teacher incentive programs were evaluated using a randomized design. First, tendistricts were randomly selected (see Figure 3).13 The study sample of 180 schools wastaken from a previous field experiment — studied by Mbiti et al. (2017) — where allstudents in Grades 1, 2, and 3 had been tested at the end of 2014. These tests providedthe baseline student-level test score information required to implement the pay for per-centile treatment. Within each district, we randomly allocated schools to one of our

13The program was administered in 11 districts, as one district was included non-randomly by Twawezafor piloting and training. We do not survey schools in this district.

16

Page 18: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

three experimental groups. Thus, in each district, 6 schools were assigned to the ‘levels’treatment, 6 schools to the pay for percentile treatment, and 6 schools served as controls.In total, there are 60 schools in each group. The sample was also stratified by treatmentof the previous RCT and by an index of the overall learning level of students in eachschool. Further details are included in Appendix A.1.

Figure 3: Districts in Tanzania from which schools are selected

Note: We drew a nationally representative sample of 180 schools from arandom sample of 10 districts in Tanzania.

4.2 Data and Balance

Over the two-year evaluation our survey teams visited each school at the beginningand end of the year. We gathered detailed information about each school from thehead-teacher, including: facilities, management practices, and head teacher character-istics. We also conducted individual surveys with each teacher in our evaluation, in-cluding personal characteristics such as education and experience, and effort measures,such as teaching practices. In addition, we conducted classroom observations, wherewe recorded teacher-student interactions and other measures of teacher effort, such asteacher absence.

Within each school we surveyed and tested a random sample of 40 students (10 stu-dents from Grades 1, 2, 3, and 4). Grade 4 students were included in our research sample

17

Page 19: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

in order to measure potential spillovers to other grades. Students in Grades 1, 2, and3 who were sampled in the first year of the program were tracked over the two yearevaluation period. Students in Grade 4 in the first year were not tracked into Grade 5due to budget constraints. In the second year of the program we sampled an additional10 incoming Grade 1 students. We collected a variety of data from our student sampleincluding test scores, individual characteristics such as age and gender, and perceptionsof the school environment. Crucially, the test scores collected on the sample of studentsare ‘low-stakes’ for teachers and students. We supplement the results from this set of‘low-stakes’ student tests with the results from the ‘high-stakes’ tests which are used todetermine teacher bonus payments, and are conducted in all schools including controlschools.

Although the content (subject order, question type, phrasing, difficulty level) is consis-tent across low- and high-stakes tests, there are a number of important differences in thetest administration. The low-stakes test took longer (40 minutes) than the high-stakestest (15 minutes). The low-stakes test had more questions in each section (Kiswahili,English and math) to avoid bottom- and top-coding, and also included an ‘other sub-ject’ module at the end to test spillover effects. The testing environment was different.Low-stakes tests were administered by an enumerator, who took children out of theclassroom and tested them one by one during a regular school day. In the high-stakestest, all students in Grades 1-3 were tested on an agreed test day. As most schools usedthe high-stakes test as the end of year test, students in higher grades were often givena day off, while Twaweza test teams administered the one-on-one tests in designatedclassrooms. A number of measures were introduced to enhance test security. First, toprevent test taking by non-target grade candidates, students could only be tested if theirname had been listed and their photo taken at baseline. Each student listed at baselinereceived an individual pre-printed test form. One test out of ten test sets was randomlyassigned to each student for additional test security. Tests were handled, administered,and electronically scored by Twaweza teams without any interference from teachers.

Most students, school, teachers, and household characteristics are balanced acrosstreatment arms (See Table 2, Column 4). The average student in our sample is 8.9 yearsold in 2013, goes to a school with 679 , and is taught by a teacher who is 38 years old.We are able to track 88% of students in our sample at the end of the second year.

18

Page 20: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

Table 2: Summary statistics across treatment groups at baseline (February 2015)

(1) (2) (3) (4)Control P4Pctile Levels p-value

(all equal)

Panel A: Students

Age 8.88 8.94 8.89 0.35(1.60) (1.67) (1.60)

Male 0.50 0.48 0.51 0.05∗

(0.50) (0.50) (0.50)Kiswahili test score -0.00 0.01 0.01 0.14

(1.00) (0.99) (0.98)English test score 0.00 0.04 -0.02 0.71

(1.00) (1.03) (1.04)Math test score -0.00 -0.01 -0.01 0.56

(1.00) (1.04) (1.00)Tested in yr0 0.91 0.89 0.90 0.41

(0.29) (0.31) (0.30)Tested in yr1 0.87 0.87 0.88 0.20

(0.33) (0.34) (0.32)Tested in yr2 0.88 0.88 0.89 0.56

(0.33) (0.32) (0.32)Poverty index (PCA) 0.01 -0.08 0.01 0.42

(1.99) (1.94) (1.98)

Panel B: Schools

Total enrollment 643.42 656.35 738.37 0.67(331.22) (437.74) (553.33)

Facilities index (PCA) 0.18 -0.11 -0.24 0.07∗

(1.23) (0.97) (1.01)Urban 0.15 0.13 0.17 0.92

(0.36) (0.34) (0.38)Single shift 0.63 0.62 0.62 0.95

(0.49) (0.49) (0.49)

Panel C: Teachers (Grade 1-3)

Male 0.42 0.38 0.35 0.19(0.49) (0.49) (0.48)

Age (Yrs) 37.89 37.02 37.70 0.18(11.35) (11.23) (11.02)

Experience (Yrs) 13.97 12.91 13.54 0.11(11.93) (11.47) (11.14)

Private school experience 0.03 0.01 0.03 0.05∗

(0.17) (0.11) (0.17)Tertiary education 0.87 0.88 0.87 0.74

(0.33) (0.32) (0.33)

This tables presents the mean and standard error of the mean (in parenthesis) forseveral characteristics of students (Panel A), schools (Panel B), and teachers (PanelC) across treatment groups. Column 4 shows the p-value from testing whether themean is equal across all treatment groups (H0 := mean is equal across groups). Ran-domization was stratified by district, previous treatment arm, and quality strata. Thequality strata variable for schools was created using principal component analysis onstudents’ test scores. Schools were categorized into one of two strata depending onwhether they were above or below the median for the first principal component. Thep-value is for a test of equality of means, after controlling for the stratification vari-ables used during randomization. The poverty index is the first component from aPrincipal Component Analysis of the following assets: Mobile phone, watch/clock,refrigerator, motorbike, car, bicycle, television, and radio. The school facilities indexis the first component from a Principal Component Analysis of indicator variables for:Outer wall, staff room, playground, library, and kitchen. Standard errors are clusteredat the school level for the test of equality. ∗ p < 0.10, ∗∗ p < 0.05, ∗∗∗ p < 0.01

19

Page 21: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

4.3 Empirical Specification

We estimate the effect of our interventions on students test scores using the followingOLS equation:

Zisdt = δ0 + δ1Levelss + δ2P4Pctiles + γzZisd,t=0 + γd + γg + Xiδ3 + Xsδ4 + εisd, (6)

where Zisdt is the test score of student i in school s in district d at time t. Levels andP4Pctile are binary variables which capture the treatment assignment of each school.γg

is a set of grade fixed effects, Xi is a series of student characteristics (age, gender andgrade), Xs is a set of school characteristics including facilities, students per teacher,school committee characteristics, average teachers age, average teacher experience, aver-age teacher qualifications, and the fraction of female teachers.

We scale our test-scores using an IRT model and then normalize them using the meanand variance of the control schools to facilitate a clear interpretation of our results. Weinclude baseline test scores and district fixed effects in our specifications to increaseprecision. We also balanced the timing of our survey activities, including the low-stakestests, on a weekly basis across treatment arms. Hence, our results are not driven byimbalanced survey timing.

We examine the impacts of the incentives using both the low stakes (survey) and highstakes (intervention) testing data. However, given the limited set of student characteris-tics in the high stakes data, this analysis includes fewer student level controls.

We also use a similar specification to examine teacher behavioral responses. We furtherexplore heterogeneity in treatment effects by interacting our treatment indicators with avariety of school, teacher, and student characteristics. These include student gender andage, teacher content knowledge, and school facilities and pupil-teacher ratios. We alsoexplore the heterogeneity in treatment effects by baseline student ability to examine ifteachers focus their efforts on a particular type of student. Specifically, we assign stu-dents into quintiles based on their baseline test scores, and then estimate the differencebetween the treatment and the control group within each quintile.

5 Results

5.1 Test Scores

Table 3 shows the impact of the incentives on student learning in math and Kiswahiliusing data from the low-stakes test (Panel A), as well as the high-stakes test (Panel B).

20

Page 22: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

In the first year, both incentive schemes resulted in small learning gains on the low-stakes test (see Panel A in Table 3). However, the treatment effects of the levels incentivescheme were consistently larger than those in the pay for percentile system (Panel A,Columns 1 and 2). In particular, the effect in Kiswahili was larger by .08σ (p-value.077). In the second year of the program, we find that the estimated treatment effectson the low-stakes test are generally larger than the first year estimates. Math test scoresimproved by .067σ (p-value .09) in the levels system, and .07σ (p-value .056) in the payfor percentile system. Kiswahili test scores improved by .11σ (p-value < 0.01) under thelevels system, but only by .056σ (p-value .11) under the pay for percentile system. Finally,English test scores improved by .11σ (p-value .19) ) in the levels system, but improved by.19σ (p-value .02) in the pay for percentile scheme. Although the results show that theestimated learning gains are generally larger under the levels system, formal hypothesistests show that the differences are only significant for Kiswahili in year one (Panel A,Column 2).

Most of the existing literature on teacher incentives relies on data from the high-stakestests that are used to determine teacher rewards (Muralidharan & Sundararaman, 2011;Fryer, 2013; Neal & Schanzenbach, 2010). Following this norm, we also present thetreatment effects of our interventions using the high-stakes exams (Panel B). Generally,the estimated treatment effects are larger compared to those estimated using the low-stakes data (Panel A). However, these differences are not statistically significant in mostcases (Panel C). In addition, the differences between the estimated treatment effectsacross the two incentive designs tend to be larger in the high-stakes data. For example,the effect of the levels scheme is larger for Kiswahili by .11σ (p-value .026) in the firstyear, and by .093σ (p-value .045) in the second year.

The larger treatment effects found in the high-stakes data are likely driven by test-taking effort, where teachers have incentives to motivate their students to take the testsseriously. The importance of student test-taking effort has been documented in othersettings such as an evaluation of teacher and student incentives in Mexico city (Behrman,Parker, Todd, & Wolpin, 2015). As discussed in Section 4.2, the administration of thehigh stakes test was tightly controlled, and conducted by our implementation team.This mitigates any concerns about outright cheating.14 Assuming that all the differencesbetween our high stakes and low stakes results are driven by test-taking effort, thissuggests that student effort can increase test score results between 0.02σ and 0.2σ (seePanel C). This is generally in line with the findings of Gneezy et al. (2017).

Given the reward structure, teachers in both treatment arms would be motivated to14Details of the Twaweza test data integrity process are available on request.

21

Page 23: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

ensure that their students took the high-stakes exam. In the second year of the study,teachers in the levels schools were able to increase student participation in the high-stakes exam by 5 percentage points. Their counterparts in pay for percentile schoolsincreased participation by 3 percentage points (see Table A.3). Following Lee (2009),we compute bounds on the treatment effects by trimming the excess test takers fromthe left and right tails of the high stakes test distribution respectively. Focusing on theyear two results for brevity, this bounding exercise suggests that the treatment effects formath range from -0.023 to 0.32 in the levels treatment and 0.014 to 0.17 in the pay forpercentile treatment. The bounds for Kiswahili range from 0.027 to 0.35 in the levels and-0.0032 to 0.17 in the pay for percentile (see Table A.4).

22

Page 24: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

Table 3: Effect on test scores

(1) (2) (3) (4) (5) (6)Year 1 Year 2

Math Kiswahili English Math Kiswahili English

Panel A: Low-stakesLevels (α1) .038 .044 .014 .067∗ .11∗∗∗ .11

(.047) (.047) (.086) (.039) (.039) (.085)P4Pctile (α2) -.017 -.035 -.049 .07∗ .056 .19∗∗

(.04) (.039) (.076) (.037) (.035) (.081)N. of obs. 4,781 4,781 1,532 4,869 4,869 1,533Gains-Levels α3 = α2 − α1 -.055 -.08∗ -.063 .003 -.057 .079p-value (H0 : α3 = 0) .21 .077 .41 .95 .16 .3

Panel B: High-stakesLevels (β1) .11∗∗ .13∗∗∗ .18∗∗∗ .14∗∗∗ .18∗∗∗ .28∗∗∗

(.047) (.048) (.067) (.045) (.046) (.069)P4Pctile (β2) .066∗ .017 .16∗∗∗ .093∗∗ .085∗ .23∗∗∗

(.039) (.043) (.058) (.04) (.045) (.055)N. of obs. 48,077 48,077 14,664 59,680 59,680 15,458Gains-Levels (β3) = β2 − β1 -.047 -.11∗∗ -.014 -.044 -.093∗∗ -.047p-value (H0 : β3 = 0) 0.30 0.026 0.83 0.31 0.045 0.53

Panel C: High-stakes – Low-stakesβ1 − α1 .065 .075 .14 .063 .056 .15p-value(β1 − α1 = 0) .13 .097 .12 .11 .2 .14β2 − α2 .078 .046 .2 .021 .025 .041p-value(β2 − α2 = 0) .072 .29 .017 .6 .55 .64β3 − α3 .012 -.029 .056 -.042 -.031 -.11p-value( β3 − α3 = 0) .78 .53 .52 .3 .51 .28

Results from estimating Equation 6 for different subjects at both follow-ups. Panel A usesdata from the low-stakes exam taken by a sample of students. Control variables includestudent characteristics (age, gender, grade and lag test scores) and school characteristics(PTR, Infrastructure PCA index, a PCA index of how close is the school to different facili-ties, and an indicator for whether the school is single shift or not). Panel B uses data fromthe high-stakes exam taken by all students. Control variables include student characteris-tics (gender and grade) and school characteristics (PTR, Infrastructure PCA index, a PCAindex of how close is the school to different facilities, and an indicator for whether theschool is single shift or not). Panel C tests the difference between the treatment estimatesin panels A and B. Clustered standard errors, by school, in parenthesis. ∗ p < 0.10, ∗∗

p < 0.05, ∗∗∗ p < 0.01

5.2 Spillovers to Other Grades and Subjects

As the teacher incentives only covered math, English, and Kiswahili in Grades 1, 2, and 3,there are concerns that teachers and schools could focus on these grades and subjects tothe detriment of other grades and subjects. On the one hand, schools may shift resources

23

Page 25: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

such as textbook purchases from higher grades to Grades 1, 2, and 3. Additionally,teachers may cut back on teaching non-incentivized subjects, such as science. On theother hand, if our incentive programs improve literacy and numeracy skills, they maypromote student learning in other subjects, and these gains may persist over time. Inorder to asses possible spillovers, we examine learning outcomes in science for Grades1, 2, and 3. We also examine test scores in Grade 4 to test for any negative spillovers inhigher grades, as well as the persistence of any learning gains induced by the program(in the second year of the evaluation).

Overall, we do not see decreases in test scores of fourth graders, which suggests thatschools were not disproportionately shifting resources away from higher grades (Table4, Panel A). As third graders in the first year of our program transitioned to the fourthgrade in the second year of the program, we can use the fourth grade results in thesecond year (Table 4, Panel A, Columns 3 and 4) to explore the persistence of any learn-ing gains produced by the incentive programs. Although the point estimates are mostlypositive, they are not statistically significant.15 This is consistent with learning gainsfrom both incentive programs fading out over time. Contrary to the concerns of teacherperformance pay critics, the effects of both programs on science test scores are gener-ally positive, suggesting that any estimated gains attributable to the incentives are notcoming at the expense of learning in other subjects or domains (see Table 4, Panel B).

15The standard errors are larger than those in Panel A of Table 3 due to smaller sample sizes.

24

Page 26: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

Table 4: Spillovers to other grades and subjects

(1) (2) (3) (4) (5) (6)

Panel A: Grade 4

Year 1 Year 2

Math Kiswahili English Math Kiswahili English

Levels (α1) .13∗∗ .044 .17∗∗ .061 .041 .082(.062) (.05) (.084) (.063) (.065) (.069)

P4Pctile (α2) -.03 -.032 .032 -.0054 .026 .058(.054) (.054) (.077) (.06) (.061) (.063)

N. of obs. 1,513 1,513 1,513 1,482 1,482 1,482Gains-Levels α3 = α2 − α1 -.16∗∗ -.077 -.14∗ -.067 -.014 -.025p-value (H0 : α3 = 0) .011 .13 .077 .25 .82 .72

Panel B: Science (Grades 1-3)

Year 1 Year 2

Levels (α1) .069 .083(.063) (.06)

P4Pctile (α2) -.005 .079(.05) (.057)

N. of obs. 4,781 4,869Gains-Levels α3 = α2 − α1 -.074 -.0044p-value (H0 : α3 = 0) .24 .94

Results from estimating Equation 6 for grade 4 students (Panel B) and for grade 3 stu-dents in science (Panel B). Control variables include student characteristics (age, gender,grade and lag test scores) and school characteristics (PTR, Infrastructure PCA index, aPCA index of how close is the school to different facilities, and an indicator for whetherthe school is single shift or not). Clustered standard errors, by school, in parenthesis. ∗

p < 0.10, ∗∗ p < 0.05, ∗∗∗ p < 0.01

5.3 Teacher Effort

In this section, we examine teacher responsiveness to the incentives. We use teacherpresence in school and in the classroom as broad measures of teacher effort. Teacherpresence is measured by our survey team and is collected shortly after our team arrivesat a school in the morning. Overall, we do not find any differences in this dimensionof teacher effort across our treatments (see Table 5, Panel A). We also examine studentreports about teacher effort such as assigning homework and providing extra help. Inthe first year of the program, students report receiving more help from teachers in thelevels systems relative to the pay for percentile system. This difference is statisticallysignificant at the 7 percent level (see Table 5, Panel B, Column 1). Students also reportreceiving more homework from teachers in the levels systems relative to the pay for

25

Page 27: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

percentile system, although the difference is not statistically significant (Panel B, Column2). However, in the second year, students report receiving similar levels of extra help andhomework assignments from teachers in both incentive systems; the point estimate onreceiving extra help from pay for percentile teachers is significant (Panel B, Column 3).

Table 5: Teacher Presence and Teaching Strategies

(1) (2) (3) (4)

Panel A: Spot-checks

Year 1 Year 2

In school In classroom In school In classroom

Levels (α1) 0.012 0.0061 -0.025 0.025(0.053) (0.057) (0.050) (0.053)

P4Pctile (α2) -0.012 -0.023 -0.0050 0.023(0.044) (0.050) (0.044) (0.044)

N. of obs. 180 180 180 180Mean control .71 .32 .67 .37Gains-Levels α3 = α2 − α1 -.024 -.029 .02 -.0021p-value (H0 : α3 = 0) .65 .6 .71 .97

Panel B: Student reports

Year 1 Year 2

Extra help Homework Extra help Homework

Levels (α1) 0.011 0.033 0.0052 0.0029(0.018) (0.024) (0.0096) (0.018)

P4Pctile (α2) -0.022 -0.0055 0.016∗ -0.023(0.017) (0.024) (0.0097) (0.019)

N. of obs. 9,006 9,006 9,557 9,557Mean control .12 .1 .018 .093Gains-Levels α3 = α2 − α1 -.034∗ -.038 .011 -.026p-value (H0 : α3 = 0) .073 .16 .29 .24

Panel A presents teacher-level data on teacher absenteeism (Columns 1 and 3), andtime-on-task (Columns 2 and 4). Panel B presents student-level data on teachereffort (as reported by students) on extra help (Columns 1 and 3) and homework(Columns 2 and 4). Clustered standard errors, by school, in parenthesis. ∗ p < 0.10,∗∗ p < 0.05, ∗∗∗ p < 0.01

As in class observations are typically affected by Hawthorne effects, our survey teamscollected data on teacher behavior standing outside the classroom for a few minutes,before teachers noticed they were being observed. Although these reports are not as de-tailed as within-classroom observation protocols, they are arguably better able to capture

26

Page 28: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

broad measures of typical teacher behavior.16 Our findings are shown in Table 6, andwe focus on the estimated differences between the two incentive systems reported in thebottom row (α3). We do not find any statistically significant differences in the the likeli-hood that we observed teachers to be teaching, although the point estimates are largerfor levels teachers (Column 1). Teachers in pay for percentile schools were 2.2 percentagepoints (almost 50 percent) less likely to be engaged in classroom management activities(such as taking attendance or disciplining student) compared to levels teachers (Column2). Teachers in pay for percentile schools were also 7.7 percentage points (29 percent)more likely to be off-task, or engaged in irrelevant activities such as reading a newspaperor sending a text message (Column 3). Finally, we do not observe differences betweenthe two incentives in distracted or off-task students, although the coefficient on pay forpercentile schools shows a more precise reduction in student distraction (Column 4).

Table 6: External classroom observation

(1) (2) (3) (4)Teaching Classroom management Teacher off task Student off task

Levels (α1) 0.011 -0.0016 -0.011 -0.0068(0.043) (0.010) (0.042) (0.018)

P4Pctile (α2) -0.048 -0.024∗∗ 0.066∗ -0.023∗

(0.036) (0.011) (0.035) (0.014)

N. of obs. 2,080 2,080 2,080 2,080Control mean .69 .041 .27 .048Gains-Levels α3 = α2 − α1 -.059 -.022∗∗ .077∗ -.016p-value (H0 : α3 = 0) .2 .037 .082 .28

The outcome variables in this table come from independent classroom observations performed bythe research team for a few minutes, before teachers noticed they were being observed. Teachersare classified doing one of three activities: Teaching (Column 1), managing the classroom (Column2), and being off-task (Column 3). If students are distracted we classify the class as having studentsoff-task (Column 4). Clustered standard errors, by school, in parenthesis. ∗ p < 0.10, ∗∗ p < 0.05, ∗∗∗

p < 0.01

16Schools in Tanzania have open layouts where classrooms are built in blocks, with open space in themiddle. This allows surveyors to simply stand in the open space and observe the class from a distance.

27

Page 29: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

5.4 Heterogeneity by Student Characteristics

We explore the heterogeneity in treatment effects across the distribution of student base-line test-scores in Figures 4 (math), 5 (Kiswahili), and 6 (English).17 In addition to pro-viding evidence on which students benefit more from the incentives, the analysis alsosheds light on the functional form of the productivity of teacher effort. In particular,if the treatment effects in the pay for percentile system are equal in all student abilitygroups, then this would imply that the productivity of teacher effort does not vary bystudent ability, as shown in Figure 1. However, if better prepared students benefit morein the pay for percentile scheme, then this would suggest that the productivity of teachereffort is higher for better prepared students as shown in Figure 2.

In the first year of the program, both math and Kiswahili teachers in the pay for per-centile system (labeled “P4Pctile”) focused their attention on their best students, whereasteachers in the levels system (labeled “levels”) focused on the top half of their class(Figures 4a and 5a). English teachers under both systems seem to focus more on topstudents, although none of the individual quintile estimates are statistically significant(Figure 6a). In the second year of the program, we do not see such overt focus on top stu-dents in mathematics in either incentive system (Figure 4b). However, Kiswahili teachersunder the levels system focused on all of their students in the second year, while teachersin the pay for percentile system focused on the very best students (Figure 5b). In con-trast, English teachers in the levels scheme focused on the top students, while teachersin the pay for percentile seem to focus more on the middle quintiles.

Overall, the pay for percentile results in math (Year 1) and Kiswahili (both years)suggest that the productivity of teacher effort is higher among better prepared students.The results in English are less informative, given the changes in the curriculum, and thegeneral difficulty of teaching the language in Tanzania.18

17We also explore heterogeneity by additional student characteristics such as gender, as well as schoolcharacteristics such as pupil teacher ratio, and find limited evidence of heterogeneity those characteristics(see Tables A.5 and A.6 for details).

18Previous research that examined a simple single proficiency incentive scheme for teachers found thatteachers in Kiswahili and Math focused on students in the middle of the distribution, while teachers inEnglish focused on the top students (Mbiti et al., 2017).

28

Page 30: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

Figure 4: Math

-.6

-.4

-.2

0

.2

.4

.6

Trea

tmen

t effe

ct

1 2 3 4 5Quintile

p-value(H0:Q1=Q5)= .11p-value(H0:Q1=Q2=Q3=Q4=Q5)= .59

Levels

-.6

-.4

-.2

0

.2

.4

.6

Trea

tmen

t effe

ct

1 2 3 4 5Quintile

p-value(H0:Q1=Q5)= .00069p-value(H0:Q1=Q2=Q3=Q4=Q5)= .017

P4Pctile

(a) Year 1

-.6

-.4

-.2

0

.2

.4

.6

Trea

tmen

t effe

ct

1 2 3 4 5Quintile

p-value(H0:Q1=Q5)= .97p-value(H0:Q1=Q2=Q3=Q4=Q5)= .95

Levels

-.6

-.4

-.2

0

.2

.4

.6

Trea

tmen

t effe

ct

1 2 3 4 5Quintile

p-value(H0:Q1=Q5)= .37p-value(H0:Q1=Q2=Q3=Q4=Q5)= .33

P4Pctile

(b) Year 2

Figure 5: Kiswahili

-.6

-.4

-.2

0

.2

.4

.6

Trea

tmen

t effe

ct

1 2 3 4 5Quintile

p-value(H0:Q1=Q5)= .085p-value(H0:Q1=Q2=Q3=Q4=Q5)= .51

Levels

-.6

-.4

-.2

0

.2

.4

.6

Trea

tmen

t effe

ct

1 2 3 4 5Quintile

p-value(H0:Q1=Q5)= .0016p-value(H0:Q1=Q2=Q3=Q4=Q5)= .027

P4Pctile

(a) Year 1

-.6

-.4

-.2

0

.2

.4

.6

Trea

tmen

t effe

ct

1 2 3 4 5Quintile

p-value(H0:Q1=Q5)= .87p-value(H0:Q1=Q2=Q3=Q4=Q5)= .9

Levels

-.6

-.4

-.2

0

.2

.4

.6

Trea

tmen

t effe

ct

1 2 3 4 5Quintile

p-value(H0:Q1=Q5)= .13p-value(H0:Q1=Q2=Q3=Q4=Q5)= .043

P4Pctile

(b) Year 2

29

Page 31: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

Figure 6: English

-.6

-.4

-.2

0

.2

.4

.6

Trea

tmen

t effe

ct

1 2 3 4 5Quintile

p-value(H0:Q1=Q5)= .6p-value(H0:Q1=Q2=Q3=Q4=Q5)= .34

Levels

-.6

-.4

-.2

0

.2

.4

.6

Trea

tmen

t effe

ct

1 2 3 4 5Quintile

p-value(H0:Q1=Q5)= .27p-value(H0:Q1=Q2=Q3=Q4=Q5)= .37

P4Pctile

(a) Year 1

-.6

-.4

-.2

0

.2

.4

.6

Trea

tmen

t effe

ct

1 2 4 5Quintile

p-value(H0:Q1=Q5)= .0026p-value(H0:Q1=Q2=Q4=Q5)= .018

Levels

-.6

-.4

-.2

0

.2

.4

.6

Trea

tmen

t effe

ct

1 2 4 5Quintile

p-value(H0:Q1=Q5)= .5p-value(H0:Q1=Q2=Q4=Q5)= .61

P4Pctile

(b) Year 2

5.5 Mechanisms

Given the limited empirical evidence on pay for percentile schemes, we rely on the largebody of theoretical and empirical evidence on rank-order tournaments to guide ouranalysis on potential mechanisms that drive the differences in behavior and outcomesbetween the two types of incentives. We focus particularly on differences in the incentivestructure of the two systems. For instance, the levels system is easier to understand andcould provide clear learning targets for classrooms, compared to the pay for percentilesystem. In addition, we explore the importance of a number of theoretically relevantteacher characteristics, such as teacher ability, as well as differences in expectations,cooperation, and goal-setting.

5.5.1 Heterogeneity by Teacher Characteristics

Table 7 explores heterogeneity by teacher characteristics including gender, age, contentknowledge, and measures of teacher effectiveness across all three subjects. There is agrowing body of evidence that shows that women are averse to competition and exertrelatively less effort than men in competitive situations, such as rank-order tournaments(Niederle & Vesterlund, 2007, 2011). Although the pay for percentile scheme is morecompetitive than the levels scheme, we do not find that women are less responsive toits competitive pressure (Column 1). We also do not find any heterogeneous effects byteacher age, which proxies for experience.

Previous studies on rank-order tournaments, such as Brown (2011) and Schotter andWeigelt (1992), have shown that heterogeneity in participant ability can negatively im-

30

Page 32: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

pact the efficacy of tournaments. For example, if a tournament features a number ofstrong players, then less-able players may (correctly) surmise that they have a limitedchance of winning. As a result, such players may be discouraged from increasing theireffort when faced with strong players. Similarly, strong players may also reduce theireffort when faced with less-able competitors (Brown, 2011). In the context of our study,this theory suggests that heterogeneity in ability among teachers may reduce the efficacyof the incentives among both the least and most able teachers in the pay for percentilescheme. In contrast, we would not expect to find a discouragement effect of heterogene-ity in the levels incentives. We use three different measures of teacher ability to explorethe heterogeneity in treatment effects. Our measures include an index of teacher con-tent knowledge, which is scaled by an IRT model, an index of head teacher evaluationsof individual teacher effectiveness, and an index of teachers’ perceptions of their ownself-efficacy.19

Although studies such as Metzler and Woessmann (2012) have shown that teachercontent knowledge is predictive of student learning outcomes, we do not find any sig-nificant heterogeneity in our treatment effects by teacher content knowledge (Column 3).We further find that more effective teachers, as measured by the head teacher’s rating,are more responsive on average to the levels incentives compared to teachers in the payfor percentile system. These differences are significant for math (Panel A, Column 3)and English (Panel C, Column 3). Our findings could potentially reflect a greater dis-couragement effect among pay for percentile teachers relative to levels teachers. We alsoexamine heterogeneous effects by teacher beliefs about their individual efficacy in Col-umn 5. We generally find that teachers who believed they were more capable respondedmore to both incentives (Column 5, Panel A and Panel B). Although we find the reverserelationship in English (Column 5, Panel C). Overall, the patterns of heterogeneity byteachers’ self-ratings are statistically indistinguishable across the two incentive designs.

19Teachers were tested on all three subjects and we created an index of content knowledge using an IRTmodel. Head teachers were asked to rate teacher performance on seven dimension including the abilityto ensure that students learn, and classroom management skills. We create an index based on teacherresponses to the following five statements: ‘I am capable of motivating students who show low interestin school’, ‘I am capable of implementing alternative strategies in my classroom’, ‘I am capable of gettingstudents to believe they can do well in school’, ‘I am capable of assisting families in helping their childrendo well in school’, and ‘I am capable of providing an alternative explanation or example when studentsare confused’.

31

Page 33: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

Table 7: Heterogeneity by teacher characteristics

(1) (2) (3) (4) (5)Panel A: Math

Male Age IRT HT Rating Self Rating

Levels*Covariate (α2) 0.033 0.00080 0.016 0.073∗∗∗ 0.041(0.070) (0.0016) (0.037) (0.021) (0.035)

P4Pctile*Covariate (α1) -0.017 0.00056 -0.025 0.012 0.058∗

(0.060) (0.0016) (0.038) (0.022) (0.035)

N. of obs. 9,650 9,650 9,650 4,869 9,650α3 = α2 − α1 -.05 -.00024 -.041 -.062∗∗ .017p-value (H0 : α3 = 0) .49 .88 .2 .012 .61

Panel B: Kiswahili

Male Age IRT HT Rating Self Rating

Levels*Covariate (α2) -0.081 -0.0000038 0.0022 0.069∗∗ 0.085∗∗

(0.069) (0.0011) (0.034) (0.031) (0.034)P4Pctile*Covariate (α1) 0.013 0.000058 0.0053 0.051 0.076∗∗

(0.067) (0.0011) (0.030) (0.034) (0.032)

N. of obs. 9,650 9,650 9,650 4,869 9,650α3 = α2 − α1 .094 .000062 .0031 -.019 -.0092p-value (H0 : α3 = 0) .19 .95 .93 .56 .8

Panel C: English

Male Age IRT HT Rating Self Rating

Levels*Covariate (α2) 0.082 0.0039 0.071 0 -0.091(0.12) (0.0024) (0.098) (.) (0.081)

P4Pctile*Covariate (α1) -0.011 0.0013 -0.068 0 -0.15∗∗

(0.12) (0.0024) (0.088) (.) (0.076)

N. of obs. 6,314 6,314 6,314 0 6,314α3 = α2 − α1 -.093 -.0025 -.14 0 -.064p-value (H0 : α3 = 0) .46 .29 .14 . .34

The outcome variable are student test scores. The data includes both follow-ups. Each column shows the heterogeneous treatment effect by different teachercharacteristics: sex (Column 1), age (Column 2), content knowledge scaled by anIRT model (Column 3), head-teacher rating (Column 4) — only asked for mathand Kiswahili teachers at the end of the second year — and self-rating (Column5). . Clustered standard errors, by school, in parenthesis. ∗ p < 0.10, ∗∗ p < 0.05,∗∗∗ p < 0.01

5.5.2 Teacher Understanding

Complex teacher incentive programs may be less effective if teachers are unable to under-stand the details of the design, and thus cannot optimally allocate their effort (Goodman& Turner, 2013; Loyalka et al., in press). These concerns are potentially more important

32

Page 34: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

in contexts with weak state capacity, which may be less able to effectively disseminatethe details of a complex incentive program to teachers. As the pay for percentile sys-tem is more complex, our results may reflect differences in teacher understanding of theincentive systems. To ameliorate these concerns, we developed culturally appropriatematerials, including examples, analogies, and illustrations, which we used to commu-nicate the details of the incentive program to teachers.20 We also sent teams to visitschools multiple times to reinforce teachers’ familiarity with the main features of theprogram. During our visits, we tested teachers to ensure they understood the detailsof the incentive program they were assigned to. We then conducted a review sessionto discuss the answers to the quiz questions to further ensure that teachers understoodthe design details. The results of the teacher comprehension tests are shown in Figure7. As we asked different questions during each survey round (Baseline, Midline andEndline), we cannot compare the trends in understanding over time. Despite the lack oftemporal comparability, teacher comprehension was generally high and roughly equalacross both types of incentive programs. This provides some assurance that our resultsare not driven by differences in program comprehension.

Figure 7: Do Teachers Understand the Interventions?

0.2

.4.6

.81

% o

f cor

rect

ans

wer

s

BL 2015 BL 2016 ML 2015 ML 2016 EL 2015 EL 2016

Levels P4Pctile

Although teacher understanding was relatively high, we also test for heterogeneity in

20We worked closely with Twaweza’s communications unit to develop our dissemination strategy andcommunications. The communications unit is experienced, and highly specialized in developing materialsto inform and educate the general public in Tanzania.

33

Page 35: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

treatment effects by teachers’ understanding (at endline). Since there is no comparabletest for control group teachers, we cannot interact the treatment variable with teacherunderstanding. Rather, we split each treatment group into a high (above average) un-derstanding group and a low (below average) understanding group, and estimate thetreatment effects for these sub-treatment groups relative to the entire control group.Within each treatment arm, we test for differences between the high-understanding andlow-understanding groups to determine if better understanding leads to better studenttest scores.21 The results are shown in Table 8 for math and Kiswahili. Focusing onthe differences between understanding within each incentive group, we cannot reject theequality of the coefficients, which suggests that better program understanding is not as-sociated with higher treatment effects. This is likely due to the extensive communicationefforts that repeatedly reinforced the main details of the program design to teachers.

21As some teachers were not present when we conducted the teacher comprehension tests we createdan additional group for teachers with no test in both treatments.

34

Page 36: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

Table 8: Heterogeneity by teacher’s understanding

(1) (2)

Math Swahili

Levels (high-understanding) 0.031 0.075∗

(0.044) (0.042)

Levels (low-understanding) 0.073∗ 0.082∗∗

(0.041) (0.037)

P4Pctile (high-understanding) 0.0066 0.027(0.035) (0.036)

P4Pctile (low-understanding) 0.049 -0.0080(0.043) (0.041)

N. of obs. 9,650 9,650Levels:High-Low -.042 -.0073p-value (Levels:High-Low=0) .27 .84P4Pctile:High-Low -.043 .035p-value (P4Pctile:High-Low=0) .31 .41P4Pctile:High-Levels:High -.025 -.048p-value (P4Pctile:High-Levels:High=0) .59 .26P4Pctile:Low-Levels:Low -.024 -.09p-value (P4Pctile:Low-Levels:Low=0) .64 .052

The outcome variable are student test scores in math (Col-umn 1) and Kiswahili (Column 2). Each regression poolsthe data for both follow-ups. Teacher’s are classified asabove or below the median in each follow-up. Clusteredstandard errors, by school, in parenthesis. ∗ p < 0.10, ∗∗

p < 0.05, ∗∗∗ p < 0.01

5.5.3 Cooperation

Critics of teacher performance pay systems often argue that individual teacher incentivescan negatively impact cooperative behavior among teachers. These concerns may beespecially relevant for teachers in the pay for percentile scheme, or more generally inrank-order tournaments schemes, as teachers could potentially sabotage other teachersto improve their relative ranking. As our incentive program was only implemented inGrades 1, 2, and 3, the exclusion of other teachers may reduce cooperation, and increasejealousy within treatment schools. In addition, tournament style incentive programs,such as pay for percentile schemes, can encourage sabotage in extreme cases.22

We examine the extent to which teacher incentives reduce cooperative behavior be-tween teachers in the treatment schools relative to the control schools in Table 9. Coop-

22We examined the potential for sabotage. However, we found very low rates (5 percent or less) ofreported sabotage attempts by other teachers, with no differences across treatments

35

Page 37: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

eration, measured by the levels of assistance provided by other teachers, was generallylower for both types of incentives. Program teachers reported receiving between 0.32and 0.42 fewer instances of assistance from their fellow teachers (see Table 9, Column1). Relative to the control group mean, this translates to a 23 percent to 30 percentreduction in the levels of assistance. The differences between the levels treatment andthe pay for percentile treatment are not statistically significant. We do not find anystatistically significant treatment effects in the extensive margin of receiving help fromother teachers (Column 2). However, when we examine the quality of assistance, wefind that teachers in the levels treatment report they are almost six percentage points(or 8 percent) less likely to receive quality assistance from their peers, while teachersin the pay for percentile treatment report a negligible reduction (Column 3). The re-ductions in the quality of assistance is significantly lower in levels schools, which couldreflect more jealousy among non-participating teachers. As the levels systems is easierto understand, non-participating teachers are perhaps better able to judge the potentialpay-off of the incentive system relative to teachers in pay for percentile systems. As headteachers were included in both incentives designs, they may be more motivated to assistprogram-eligible teachers in their schools. However, our results in Column 4 show thatthere was no significant change in head teacher assistance to program teachers in eithertreatment relative to control schools.

36

Page 38: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

Table 9: Teacher behavioral responses: cooperation

Help from other Help from other Help/advice from Help/advice fromteachers teachers other teachers head teacher

(# last month) (last month>1) (very good/good) (very good/good)(1) (2) (3) (4)

Levels (α1) -.32∗∗ -.047 -.058∗ -.025(.15) (.036) (.031) (.031)

P4Pctile (α2) -.42∗∗ -.046 -.0015 .026(.18) (.034) (.026) (.026)

N. of obs. 1,991 1,991 1,998 1,940Mean control 1.3 .4 .75 .78α3 = α2 − α1 -.094 .0012 .057∗ .05p-value(α3 = 0) .5 .97 .081 .14

This table show the effect of treatment on teacher reports of help and cooperation from otherteachers: The number of times the teacher receives help from other teachers (Column 1),whether the teacher received any help from other teachers or not (Column 2), a high ratingof the advice or help received from other teachers (Column 3), and a high rating of the adviceor help received from the head teacher (Column 4). Clustered standard errors, by school, inparenthesis. ∗ p < 0.10, ∗∗ p < 0.05, ∗∗∗ p < 0.01

5.5.4 Teacher Expectations

Even though we equalized the budgets across our treatments, teachers’ beliefs abouttheir potential earnings could differ across the two incentive systems. In the pay forpercentile system, the fact that the final bonus payment depends on the relative perfor-mance of other teachers is more salient. Hence, teachers may be less confident abouttheir ability to receive larger payouts compared to their peers in the levels treatment,where payouts are determined students’ proficiency levels. Prior to the realization of thebonus payments, we collected teachers’ expected earnings from the incentives, as wellas their beliefs about their performance relative to other teachers in the district. As thesequestions were only applicable to teachers in the incentive programs, we simply compareteachers in the pay for percentile arm to the levels program, which serves as the omittedcategory in Table 10. Teachers in pay for percentile schools had lower bonus earningsexpectations compared to their peers in the levels system. They expected almost 95,000TZS (42 USD) less in bonus payments than teachers in the levels system. This repre-sents an 18 percent reduction in bonus expectations relative to the mean expectationsof teachers in the levels system (Column 1) and 36 percent of the realized mean bonuspayment in 2016. The lower expectations among pay for percentile teachers could bedriven by the greater uncertainty of earnings in rank-order tournaments, such as payfor percentile systems. While the competitive pressure can be motivating, it can also be

37

Page 39: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

demotivating if an individual teacher has low subjective beliefs of winning relative toother competitors.23

We also examine differences in teachers’ beliefs about their relative ranking withintheir district based on their (expected) bonus winnings in Columns 2 to 4. Overall, wedo not find any differences across the treatments in teachers’ beliefs about their rankings.The results suggests that teachers were quite optimistic about their projected earnings; 9percent of teachers expected to be among the bottom earners (Column 2) and 7 percentwere worried about earning a low bonus (Column 5). On the other hand, 80 percentexpected to be among the top earners of the district (Column 4).

Table 10: Teachers’ earning expectations

Bonus (TZS) Bottom of the Middle of the Top of the Worried low bonusdistrict district district

(1) (2) (3) (4) (5)

P4Pctile (α2) -94,330∗∗ -.029 -.0092 .035 -.02(37,169) (.03) (.059) (.045) (.026)

N. of obs. 653 676 676 676 676Mean Levels 525,641 .086 .48 .8 .074

This table show the effect of treatment on teacher self-reported expectations: The expectedpayoff (Column 1), the expected relative ranking in the district (Columns 2-4), and whetherthe teacher is worried about receiving a low bonus payments (Column 5). Clustered standarderrors, by school, in parenthesis. ∗ p < 0.10, ∗∗ p < 0.05, ∗∗∗ p < 0.01

5.5.5 Goal Setting

In addition to being easier to understand, the levels system provides teachers with a setof clear learning targets and goals for their students. This can help guide their instruc-tional strategies and areas of focus in the classroom, and perhaps even support indi-vidualized coaching.24 Recall that all teachers in our study, including control teachers,are provided with student baseline reports, which detail initial levels of student per-formance. Although teachers in both treatment groups were equally likely to use thesereports, there is suggestive evidence that teachers in the levels system were slightly more

23As mentioned above, since the budget per student was equal across both systems, the average payoffis the same across both systems.

24Recent papers in the (behavioral) economics literature provide evidence on general productivity effectsof setting goals, for example Koch and Nafziger (2011); Gomez-Minambres 2012; Dalton et al., 2015.

38

Page 40: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

likely to use these baseline reports to set goals (results not shown). We explore differ-ences in goal setting behavior between teacher in our incentive systems in Table 11. Wedo not find any differences in goals set for general school level exams (Column 1). How-ever, we find that teachers in the levels system were almost 8 percentage points morelikely (or twice as likely) to have set clear goals for the high stakes Twaweza exam (Col-umn 2). In contrast, teachers in pay for percentile schools were 2.5 percentage pointsmore likely to have set clear goals, although this effect is not statistically significant (Col-umn 2). Although we cannot reject the equality of the two estimates, the results providesome suggestive evidence that the levels systems facilitated more goal setting and tar-geting on the high-stakes (Twaweza) exam. Teachers in both systems are less likely to setgoals for general student learning (Column 3), as well as their own knowledge (Column4). In terms of the high-stakes Twaweza test, which was administered to all schools,teachers in both incentive schools were approximately 7 percentage points (roughly 8percent) more likely to set a general goal for the test (Column 5). Additionally, teach-ers in levels schools were almost 10 percentage points (50 percent) more likely to set aspecific numerical target for the Twaweza high stakes test, compared to just under 4 per-cent of teachers in pay for percentile schools (Column 6). Although these differences arenot statistically distinguishable, the point estimates suggest greater incidences of goalsetting among teachers in the levels design.

39

Page 41: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

Table 11: Goal Setting

Goals Twaweza test goals

School Twaweza Student Own General Specificexam exam learning knowledge (number)

(1) (2) (3) (4) (5) (6)

Levels (α1) -.02 .076∗∗ -.088∗∗ -.097∗∗ .067∗∗ .095∗

(.053) (.029) (.04) (.037) (.031) (.052)P4Pctile (α2) -.047 .025 -.077∗ -.066∗ .076∗∗∗ .036

(.048) (.027) (.042) (.037) (.022) (.042)N. of obs. 1,016 1,016 1,016 1,016 1,016 1,016Mean control .46 .078 .34 .25 .89 .19α3 = α2 − α1 -.027 -.05 .011 .031 .0094 -.059p-value(α3 = 0) .58 .14 .78 .42 .7 .27

This table show the effect of treatment on whether teacher set professional goals(Column 1-4) and specific goals for the twaweza exam (Column 5-6). Specifically,whether they set goals for the school exams (Column 1), the twaweza exams (Col-umn 2), student learning (Column 3), and improving content knowledge (Column4). In addition, whether have general goals for student performance on the twawezaexam (Column 5) or specific (numeric) goals (Column 6). Clustered standard errors,by school, in parenthesis. ∗ p < 0.10, ∗∗ p < 0.05, ∗∗∗ p < 0.01

As the high-stakes Twaweza test covers the major skills that are rewarded in the levelstreatment, we can compute skill-specific pass rates for each student in each treatment.For example, we can compute the fraction of students who pass addition on the secondgrade math test. The ability to create skill-specific pass rates for all students enables us touse the high-stakes Twaweza testing data to explore differential patterns in student passrates by treatment in Appendix Tables A.7, A.8, and A.9. Generally, the results showthat teachers in levels schools are better able to improve specific skills on the high-stakestest. This is consistent with the idea that the levels treatment enables teachers to betterfocus instruction using the explicit incentive structure as a target. In addition, becausethe levels treatment is clear about which skills will be tested, and how skill proficiencyis rewarded, it allows teachers to better teach to the test. However, as the treatmenteffects for teachers in the levels design are also higher, these patterns could be entirelydriven by those gains. Thus, these results are merely suggestive of increased targetingor teaching to the test in levels schools.

5.6 Cost effectiveness

The total annual cost of the teacher incentive programs was 7.23 USD per student. Thiscost estimate includes both the direct costs (value of incentive payments) as well as the

40

Page 42: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

implementation costs (test design and implementation, communications, audit, transfercosts etc.) of the program. We do not have accounting data at the treatment arm leveland do not attempt to separate the costs by program. XX TO DISCUSS XX 25.

While cost-effectiveness comparisons require several assumptions (e.g., external va-lidity) (Dhaliwal, Duflo, Glennerster, & Tulloch, 2013), we estimate the increase in testscores per dollar spent, as this is the relevant policy question. The parameters for thiscalculation are presented in XX.

Using the treatment effects from the high-stakes exam, the pay for percentile programincreases test scores by 1.17 SD per 100 USD spent and the levels program increasesthem by 2.47 SD per 100 USD spent. These assumptions make the cost-effectivenesscomparable to most programs since they are evaluated using high-stakes exams andcost-effectiveness is calculated using the long-term cost of the program.

Education administrators face pressure from teachers to increase salaries but are in-creasingly confronted with questions about learning progress and performance account-ability. In this context it is not unrealistic to assume that a Government contemplatingthe introduction of a teacher performance pay scheme would finance it out of budgetthat otherwise would have paid for unconditional pay increase. In such a scenario,the principal cost of the incentive program is the administrative cost of implementingthe program (costs of communicating the bonus offer, independent measurement andrecording of student learning, organising the payments) and not the cost of the bonusitself. If we exclude the costs of the bonus, our estimates of cost-effectiveness are XX forthe pay for percentile program and YY for the levels program.

These estimates suggest that both programs are cost-effective compared to severalother interventions in developing countries in the overview by (Kremer, Brannen, &Glennerster, 2013). However, teacher incentives in Western Kenya have been shown toincrease test scores by over 6 SD per 100 USD spent.

6 Conclusion

We present estimates of the impact on early grade learning of two teacher incentiveprograms implemented as a randomised trial in a nationally representative sample of

25Costs of the pre-treatment testing required in pay for percentile is not included in the cost figure,since this cost would only be incurred once (ability groups could be based on endline data after the firstyear of implementation). Pay for percentile also took a little longer to explain but reports from field teamleaders indicate that the difference with levels was small. The main cost difference is in data management:preparing the ability groups, programming the payment calculations. However, these are largely fixedcosts and so unit costs would be relatively small in steady state. Overall the pay for percentile is the moreexpensive program to implement

41

Page 43: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

180 public primary schools in Tanzania. Specifically, we compare a multiple thresholdsproficiency incentive design with a pay for percentile system in terms of impact onindependently measured test scores of students in Grades 1-3, relative to a control group.

We report X four X main findings. First, we find that both programs led to increasesin test scores in the focal grades. Second, we do not find any negative effects on testscores for non-incentivised subjects or grades. Third, despite the theoretical advantageof the pay for percentile system, overall this design is not more effective than the simpleproficiency system in improving mean student test-scores. In particular, we find thatpoint estimates of impact on high stakes test scores are higher under the proficiencysystem for all incentivised subjects, and significantly higher for Kiswahili. Fourth, andcontrary to expectations, the pay for percentile system leads to learning improvementsprimarily among the best students, while the levels system benefitted student from awider range of initial abilities.

To help interpret our results we report on potential mechanisms. We do not findthat teacher understanding of the programs, as measured by scores on quizzes aboutprogram rules, is associated with impact. However, we do find that teachers in thepercentile pay program had significantly lower expectations of bonus earnings compared toteachers in the levels system (equal to 36 percent of realised mean bonus in 2016). We find someevidence of reduced cooperation between teachers as a result of the incentives....

Goal setting mechanism.Overall, a simpler system with multiple thresholds can actually outperform a more complex

incentive system, especially in countries with low administrative capacity such as Tanzania.

References

Alger, V. E. (2014). Teacher incentive pay that works: A global survey of programs thatimprove student achievement.

Balch, R., & Springer, M. G. (2015). Performance pay, test scores, and student learning objec-tives. Economics of Education Review, 44(0), 114 - 125. Retrieved from http: / / www

.sciencedirect .com/ science/ article/ pii/ S0272775714001034 doi: http:/ /dx.doi.org/10.1016/j.econedurev.2014.11.002

Bandiera, O., Barankay, I., & Rasul, I. (2007). Incentives for managers and inequality amongworkers: Evidence from a firm-level experiment. The Quarterly Journal of Economics,122(2), 729–773.

42

Page 44: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

Banerjee, A., & Duflo, E. (2011). Poor economics: A radical rethinking of the way tofight global poverty. PublicAffairs. Retrieved from http: / / books .google .com/

books ?id= Tj0TF0IHIyAC

Barlevy, G., & Neal, D. (2012). Pay for percentile. American Economic Review, 102(5),1805-31. Retrieved from http: / / www .aeaweb .org/ articles .php ?doi= 10 .1257/

aer .102 .5 .1805 doi: 10.1257/aer.102.5.1805Barrera-Osorio, F., & Raju, D. (in press). Teacher performance pay: Experimental evidence from

pakistan. Journal of Public Economics.Behrman, J. R., Parker, S. W., Todd, P. E., & Wolpin, K. I. (2015). Aligning learning incentives

of students and teachers: Results from a social experiment in mexican high schools. Journalof Political Economy, 123(2), 325-364. Retrieved from https: / / doi .org/ 10 .1086/

675910 doi: 10.1086/675910Bettinger, E. P., & Long, B. T. (2010, August). Does cheaper mean better? the impact of us-

ing adjunct instructors on student outcomes. The Review of Economics and Statis-tics, 92(3), 598-613. Retrieved from http: / / ideas .repec .org/ a/ tpr/ restat/

v92y2010i3p598 -613 .html

Brown, J. (2011). Quitters never win: The (adverse) incentive effects of competing with super-stars. Journal of Political Economy, 119(5), 982-1013.

Bruns, B., Filmer, D., & Patrinos, H. A. (2011). Making schools work: New evidence onaccountability reforms. World Bank Publications.

Bruns, B., & Luque, J. (2015). Great teachers: How to raise student learning in latinamerica and the caribbean. World Bank Publications.

Carroll, G. (2015). Robustness and linear contracts. American Economic Review, 105(2),536–63.

Carroll, G., & Meng, D. (2016). Locally robust contracts for moral hazard. Journal of Mathe-matical Economics, 62, 36–51.

Chetty, R., Friedman, J. N., & Rockoff, J. E. (2014a). Measuring the impacts of teachers i:Evaluating bias in teacher value-added estimates. American Economic Review, 104(9),2593–2632.

Chetty, R., Friedman, J. N., & Rockoff, J. E. (2014b). Measuring the impacts of teachers ii:Teacher value-added and student outcomes in adulthood. American Economic Review,104(9), 2633–79.

Contreras, D., & Rau, T. (2012). Tournament incentives for teachers: evidence from a scaled-upintervention in chile. Economic development and cultural change, 61(1), 219–246.

Dhaliwal, I., Duflo, E., Glennerster, R., & Tulloch, C. (2013). Comparative cost-effectivenessanalysis to inform policy in developing countries: a general framework with applications

43

Page 45: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

for education. Education Policy in Developing Countries, 285–338.Ferraz, C., & Bruns, B. (2012). Paying teachers to perform: The impact of bonus pay in

pernambuco, brazil. Society for Research on Educational Effectiveness.Fryer, R. G. (2013). Teacher incentives and student achievement: Evidence from new york city

public schools. Journal of Labor Economics, 31(2), 373–407.Ganimian, A. J., & Murnane, R. J. (2016). Improving education in developing countries: Lessons

from rigorous impact evaluations. Review of Educational Research, 86(3), 719–755.Gilligan, D., Karachiwalla, N., Kasirye, I., Lucas, A., & Neal, D. (2018). Educator incentives

and educational triage in rural primary schools. (mimeo)Glewwe, P., Ilias, N., & Kremer, M. (2010). Teacher incentives. American Economic Jour-

nal: Applied Economics, 2(3), 205-27. Retrieved from http: / / www .aeaweb .org/

articles .php ?doi= 10 .1257/ app .2 .3 .205 doi: 10.1257/app.2.3.205Glewwe, P., Kremer, M., & Moulin, S. (2009). Many children left behind? textbooks and test

scores in kenya. American Economic Journal: Applied Economics, 1(1), 112–35.Gneezy, U., List, J. A., Livingston, J. A., Sadoff, S., Qin, X., & Xu, Y. (2017, November).

Measuring success in education: The role of effort on the test itself (Working PaperNo. 24004). National Bureau of Economic Research. doi: 10.3386/w24004

Goldhaber, D., & Walch, J. (2012). Strategic pay reform: A student outcomes-based evaluation ofDenver’s procomp teacher pay initiative. Economics of Education Review, 31(6), 1067- 1083. Retrieved from http: / / www .sciencedirect .com/ science/ article/ pii/

S0272775712000751 doi: http:/ /dx.doi.org/10.1016/j.econedurev.2012.06.007Goodman, S. F., & Turner, L. J. (2013). The Design of Teacher Incentive Pay and Educa-

tional Outcomes: Evidence from the New York City Bonus Program. Journal of LaborEconomics, 31(2), 409 - 420. Retrieved from http: / / ideas .repec .org/ a/ ucp/

jlabec/ doi10 .1086 -668676 .html

Hanushek, E. A., & Rivkin, S. G. (2012). The distribution of teacher quality and implicationsfor policy. Annu. Rev. Econ., 4(1), 131–157.

Imberman, S. A. (2015). How effective are financial incentives for teachers? IZA World ofLabor.

Imberman, S. A., & Lovenheim, M. F. (2015). Incentive strength and teacher productivity:Evidence from a group-based teacher incentive pay system. Review of Economics andStatistics, 97(2), 364–386.

Kane, T. J., Rockoff, J. E., & Staiger, D. O. (2008, December). What does certification tell usabout teacher effectiveness? evidence from new york city. Economics of Education Re-view, 27(6), 615-631. Retrieved from http: / / ideas .repec .org/ a/ eee/ ecoedu/

v27y2008i6p615 -631 .html

44

Page 46: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

Kremer, M., Brannen, C., & Glennerster, R. (2013). The challenge of education and learning inthe developing world. Science, 340(6130), 297–300.

Lavy, V. (2002). Evaluating the effect of teachers’ group performance incentives on pupilachievement. Journal of Political Economy, 110(6), pp. 1286-1317. Retrieved fromhttp: / / www .jstor .org/ stable/ 10 .1086/ 342810

Lavy, V. (2009). Performance pay and teachers’ effort, productivity, and grading ethics. Ameri-can Economic Review, 99(5), 1979-2011. Retrieved from http: / / www .aeaweb .org/

articles .php ?doi= 10 .1257/ aer .99 .5 .1979 doi: 10.1257/aer.99.5.1979Lee, D. S. (2009). Training, wages, and sample selection: Estimating sharp bounds on treatment

effects. The Review of Economic Studies, 76(3), 1071–1102.Levitt, S. D., List, J. A., Neckermann, S., & Sadoff, S. (2016). The behavioralist goes to school:

Leveraging behavioral economics to improve educational performance. American Eco-nomic Journal: Economic Policy, 8(4), 183–219.

Loyalka, P. K., Sylvia, S., Liu, C., Chu, J., & Shi, Y. (in press). Pay by design: Teacherperformance pay design and the distribution of student achievement. Journal of LaborEconomics.

Mbiti, I. (2016). The need for accountability in education in developing countries. Journal ofEconomic Perspectives, 30(3), 109–32.

Mbiti, I., Muralidharan, K., Romero, M., Schipper, Y., Manda, C., & Rajani, R. (2017). Inputs,incentives, and complementarities in primary education: Experimental evidencefrom tanzania. (mimeo)

Metzler, J., & Woessmann, L. (2012). The impact of teacher subject knowledge on student achieve-ment: Evidence from within-teacher within-student variation. Journal of DevelopmentEconomics, 99(2), 486–496.

Muralidharan, K., & Sundararaman, V. (2011). Teacher performance pay: Experimental evidencefrom india. Journal of Political Economy, 119(1), pp. 39-77. Retrieved from http: / /

www .jstor .org/ stable/ 10 .1086/ 659655

Neal, D. (2011). Chapter 6 - the design of performance pay in education. In E. A. Hanushek,S. Machin, & L. Woessmann (Eds.), Handbook of the economics of education(Vol. 4, p. 495 - 550). Elsevier. Retrieved from http: / / www .sciencedirect .com/

science/ article/ pii/ B9780444534446000067 doi: https:/ /doi.org/10.1016/B978-0-444-53444-6.00006-7

Neal, D. (2013). The consequences of using one assessment system to pursue two objectives. TheJournal of Economic Education, 44(4), 339–352.

Neal, D., & Schanzenbach, D. W. (2010, February). Left behind by design: Proficiency countsand test-based accountability. Review of Economics and Statistics, 92(2), 263–283.

45

Page 47: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

Retrieved from http: / / dx .doi .org/ 10 .1162/ rest .2010 .12318

Niederle, M., & Vesterlund, L. (2007). Do women shy away from competition? do men competetoo much? The Quarterly Journal of Economics, 122(3), 1067–1101.

Niederle, M., & Vesterlund, L. (2011). Gender and competition. Annu. Rev. Econ., 3(1),601–630.

OECD. (2017). Teachers’ salaries (indicator). (data retrieved from https: / / data .oecd

.org/ eduresource/ teachers -salaries .htm ) doi: 10.1787/f689fb91-enPRI. (2013). Tanzanian teachers learning education doesn’t pay. Retrieved 13/09/2017,

from https: / / www .pri .org/ stories/ 2013 -12 -20/ tanzanian -teachers

-learning -education -doesnt -pay

Reuters. (2012). Tanzanian teachers in strike over pay. Retrieved 13/09/2017,from http: / / www .reuters .com/ article/ ozatp -tanzania -strike -20120730

-idAFJOE86T05320120730

Schotter, A., & Weigelt, K. (1992). Asymmetric tournaments, equal opportunity laws, andaffirmative action: Some experimental results. The Quarterly Journal of Economics,107(2), 511–539.

Springer, M. G., Ballou, D., Hamilton, L., Le, V.-N., Lockwood, J., McCaffrey, D. F., . . . Stecher,B. M. (2011). Teacher pay for performance: Experimental evidence from the project onincentives in teaching (point). Society for Research on Educational Effectiveness.

Uwezo. (2012). Are our children learning? annual learning assessment report 2011(Tech. Rep.). Author. Retrieved from http: / / www .twaweza .org/ uploads/ files/

UwezoTZ2013forlaunch .pdf (Accessed on 05-12-2014)Uwezo. (2013). Are our children learning? numeracy and literacy across east africa

(Uwezo East-Africa Report). Nairobi: Uwezo. (Accessed on 05-12-2014)Woessmann, L. (2011). Cross-country evidence on teacher performance pay. Economics of

Education Review, 30(3), 404 - 418. Retrieved from http: / / www .sciencedirect

.com/ science/ article/ pii/ S0272775710001731 doi: http:/ /dx.doi.org/10.1016/j.econedurev.2010.12.008

World Bank. (2011). Service delivery indicators: Tanzania (Tech. Rep.). The World Bank,Washington D.C.

World Bank. (2017). World development indicators. (data retrieved from, https: / / data.worldbank .org/ data -catalog/ world -development -indicators )

World Bank. (2018a). Systems approach for better education results (saber). (data retrievedfrom, http: / / saber .worldbank .org/ index .cfm ?indx= 8&pd= 1&sub= 1 )

World Bank. (2018b). World development report 2018: Learning to realize education’spromise. The World Bank. Retrieved from https: / / elibrary .worldbank .org/

46

Page 49: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

A Appendix

A.1 Randomization Details

From a previous RCT (KiuFunza I), we have the baseline data necessary to implement the pay forpercentile incentive scheme (to split students into groups, and properly seed each contest) for 180schools. There are two treatments and a control group in this experiment, and the treatment wasstratified by district (and we continue this practice in this experiment). In each district, there areseven schools in each of the previous treatments (seven schools in C1 and seven in C2) and fourin the control group (C3).

We randomly assign schools from the previous treatment groups into the new treatmentsgroups. However, in order to study the long term impacts of teacher incentives we assign aassign a higher proportion of schools in treatments C1 (which involved threshold teacher incen-tives) to both “levels”. Similarly, we assign a higher proportion of schools in the control group ofthe previous experiment (C3) to the control group of this experiment.

For this experiment, we stratify the random treatment assignment by district, previous treat-ment, and an index of the overall learning level of students in each school26. Table A.1 summarizesthe number of schools randomly allocated to each treatment arm based on their assignment in theprevious experiment. In short, in each district, we have 18 schools. In each district,there are sixschools in each of the new treatment groups (levels, gains, and control). In each one of the newtreatments, there are 60 schools. 30 of these schools are above the median in baseline learning and30 are below.

All regressions account for all three levels of stratification: district, previous treatment, and anindex of the overall learning level of students in each school.

Table A.1: Treatment allocation

KiuFunza II

Levels Gains Control TotalKiuFunza I C1 40 20 10 70

C2 10 30 30 70C3 10 10 20 40

Total 60 60 60 180

26We create an overall measure of student learning, and split schools as above or below the median

48

Page 50: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

A.2 Additional Tables

A.2.1 Balance in Teacher Turnover

Table A.2: Teacher turnover

(1) (2)Still teaching incentivized

grades/subjects

Yr 1 Yr 2

Levels (α1) .066 .065(.043) (.04)

P4Pctile (α2) .054 .088∗∗

(.036) (.034)N. of obs. 882 882Mean control .73 .59Gains-Levels α3 = α2 − α1 -.013 .022p-value (H0 : α3 = 0) .75 .56

Proportion of teachers teaching math, English orKiswahili in grades 1, 2, and 3 at the beginning of2015 that are still teaching them (in the same school)at the end of 2015 (Column 1) and 2016 (Column 2).Clustered standard errors, by school, in parenthesis. ∗

p < 0.10, ∗∗ p < 0.05, ∗∗∗ p < 0.01

49

Page 51: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

A.2.2 Effects on Test-Takers and Lee Bounds on High-Stakes Test

Table A.3: Number of test takers in high-stakes exam

(1) (2)

Levels (α1) 0.02 0.05∗∗∗

(0.02) (0.01)

P4Pctile (α2) -0.00 0.03∗∗

(0.02) (0.01)

N. of obs. 540 540Mean control group 0.78 0.83α3 = α2 − α1 -0.02 -0.03∗∗

p-value(α3 = 0) 0.20 0.04

The independent variable is the proportionof test takers (number of test takers di-vided by the enrollment in each grade) inthe high-stakes exam. The unit of observa-tion is at the school-grade level. Clusteredstandard errors, by school, in parenthesis.∗ p < 0.10, ∗∗ p < 0.05, ∗∗∗ p < 0.01

50

Page 52: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

Table A.4: Lee bounds for high-stakes exams

(1) (2) (3) (4)Year 1 Year 2

Math Kiswahili Math Kiswahili

Levels (α1) 0.11∗∗ 0.13∗∗∗ 0.14∗∗∗ 0.18∗∗∗

(0.05) (0.05) (0.04) (0.05)

P4Pctile (α2) 0.07∗ 0.02 0.09∗∗ 0.09∗

(0.04) (0.04) (0.04) (0.05)

N. of obs. 48,077 48,077 59,680 59,680α3 = α2 − α1 -0.047 -0.11∗∗ -0.044 -0.093∗∗

p-value(α3 = 0) 0.30 0.026 0.31 0.045

Lower 95% CI (α1) 0.00066 0.021 -0.023 0.027Higher 95% CI (α1) 0.23 0.25 0.32 0.35

Lower 95% CI (α2) -0.012 -0.070 0.014 -0.0032Higher 95% CI (α2) 0.14 0.10 0.17 0.17

Lower 95% CI (α3) -0.16 -0.24 -0.22 -0.27Higher 95% CI (α3) 0.063 0.00099 0.11 0.057

The independent variable is the standardized test score for different sub-jects. For each subject we present Lee (2009) bounds for all the treatmentestimates (i.e., trimming the left/right tail of the distribution in Levelsand Gains schools so that the proportion of test takes is the same asthe number in control schools). Clustered standard errors, by school, inparenthesis.∗ p < 0.10, ∗∗ p < 0.05, ∗∗∗ p < 0.01

A.2.3 Additional Heterogeneity in Treatment Effects

51

Page 53: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

Table A.5: Heterogeneity by student characteristics

(1) (2) (3) (4) (5) (6) (7) (8) (9)Math Swahili English

Male Age Test(Yr0) Male Age Test(Yr0) Male Age Test(Yr0)

Levels*Covariate (α2) -0.025 0.011 0.026 0.011 -0.024 0.023 -0.059 -0.0021 0.091(0.039) (0.015) (0.033) (0.039) (0.016) (0.027) (0.081) (0.041) (0.055)

P4Pctile*Covariate (α1) 0.0095 0.0089 0.063∗∗ 0.0023 -0.0051 0.040 -0.048 0.032 0.066(0.042) (0.016) (0.027) (0.039) (0.016) (0.026) (0.082) (0.042) (0.057)

N. of obs. 9,650 9,650 9,650 9,650 9,650 9,650 3,065 3,065 3,065α3 = α2 − α1 .035 -.0024 .037 -.009 .019 .017 .011 .034 -.024p-value (H0 : α3 = 0) .4 .88 .23 .82 .22 .52 .89 .35 .59

Each column interacts the treatment effect with different student characteristics: sex (Columns 1, 4, and 7), age(Columns 2, 5, and 8), and baseline test scores (Column 3, 6, and 9). Clustered standard errors, by school, inparenthesis. ∗ p < 0.10, ∗∗ p < 0.05, ∗∗∗ p < 0.0152

Page 54: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

53

Page 55: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

Table A.6: Heterogeneity by school characteristics

(1) (2) (3) (4) (5) (6) (7) (8) (9)Math Swahili English

Facilities PTR Fraction Weak Facilities PTR Fraction Weak Facilities PTR Fraction Weak

Levels*Covariate (α2) 0.031 -0.00015 -0.16 0.033 -0.0019 -0.23 0.063 -0.0040∗ -0.42(0.023) (0.0015) (0.18) (0.031) (0.0013) (0.17) (0.043) (0.0022) (0.26)

P4Pctile*Covariate (α1) -0.027 -0.0025∗∗ -0.24 0.0024 -0.0021 -0.32∗∗ 0.072 -0.0026 -0.34(0.026) (0.0012) (0.15) (0.032) (0.0013) (0.16) (0.044) (0.0024) (0.30)

N. of obs. 9,650 9,650 9,650 9,650 9,650 9,650 3,065 3,065 3,065α3 = α2 − α1 -.057∗∗ -.0024 -.079 -.031 -.00025 -.088 .0093 .0014 .075p-value (H0 : α3 = 0) .023 .18 .62 .28 .88 .55 .82 .64 .78

Each column interacts the treatment effect with different school characteristics: a facilities index (Columns 1, 4, and 7), the pupil-teacherratio (Columns 2, 5, and 8), and the fraction of students that are below the median student in the country (Column 3, 6, and 9). Clusteredstandard errors, by school, in parenthesis. ∗ p < 0.10, ∗∗ p < 0.05, ∗∗∗ p < 0.0154

Page 56: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

A.2.4 Pass Rates

Table A.7: Pass rates using ‘levels’ thresholds in Kiswahili

Silabi Words Sentences Paragraph Story ReadingComprehension

(1) (2) (3) (4) (5) (6)

Panel A: Year 1Levels (β1) .064∗∗ .059∗∗ .071∗∗∗ .075∗∗∗ .038 .024

(.026) (.024) (.023) (.022) (.024) (.026)P4Pctile (β2) -.0057 .015 .011 .026 -.0099 -.0034

(.025) (.022) (.021) (.02) (.021) (.022)N. of obs. 17,886 33,440 33,440 15,554 14,678 14,678Control mean .4 .59 .5 .37 .52 .56β3 = β2 − β1 -.069∗∗∗ -.044∗ -.06∗∗ -.049∗∗ -.048∗∗ -.027p-value (H0 : β3 = 0) .0086 .081 .011 .017 .045 .27

Panel B: Year 2Levels (β1) .09∗∗∗ .085∗∗∗ .08∗∗∗ .046∗∗ .0032 .053∗∗

(.021) (.02) (.018) (.019) (.026) (.021)P4Pctile (β2) .047∗∗ .036∗ .032∗ -.0089 -.027 .012

(.023) (.02) (.019) (.02) (.022) (.019)N. of obs. 26,746 44,262 44,262 17,516 15,493 33,009Control mean .3 .6 .48 .43 .61 .56β3 = β2 − β1 -.044∗∗ -.049∗∗∗ -.048∗∗∗ -.055∗∗∗ -.03 -.041∗

p-value (H0 : β3 = 0) .027 .0082 .0058 .0042 .22 .053

The independent variable is whether a student passed a given skill in the high-stakes exam.Clustered standard errors, by school, in parenthesis.∗ p < 0.10, ∗∗ p < 0.05, ∗∗∗ p < 0.01

55

Page 57: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

Table A.8: Pass rates using ‘levels’ thresholds in math

Counting Numbers Inequalities Addition Subtraction Multiplication Division(1) (2) (3) (4) (5) (6) (7)

Panel A: Year 1Levels (β1) .0034 .014 .03∗∗ .05∗∗ .043∗∗ .038∗∗ .035∗

(.0091) (.021) (.014) (.021) (.02) (.017) (.018)P4Pctile (β2) .031∗∗∗ .031∗ .033∗∗∗ .018 .016 .023 .0095

(.0078) (.018) (.012) (.018) (.016) (.016) (.018)N. of obs. 17,886 17,886 33,440 48,118 48,118 30,232 14,678Control mean .93 .64 .74 .59 .5 .23 .22β3 = β2 − β1 .028∗∗∗ .017 .0027 -.033 -.027 -.015 -.026p-value (H0 : β3 = 0) .0012 .4 .85 .12 .16 .37 .17

Panel B: Year 2Levels (β1) .000686 .0411∗∗ .0265∗∗ .0442∗∗ .0462∗∗ .0514∗∗∗ .0395∗∗

(.0078) (.019) (.011) (.019) (.019) (.014) (.017)P4Pctile (β2) .0108 .0595∗∗∗ .0388∗∗∗ .0394∗∗ .026 .0254∗∗ .0223

(.0071) (.017) (.01) (.017) (.017) (.013) (.017)N. of obs. 26,746 26,746 44,262 59,755 59,755 15,493 15,493Control mean .94 .68 .79 .6 .56 .11 .18β3 = β2 − β1 .01 .018 .012 -.0049 -.02 -.026 -.017p-value (H0 : β3 = 0) .12 .31 .23 .78 .24 .11 .34

The independent variable is whether a student passed a given skill in the high-stakes exam. Clustered standarderrors, by school, in parenthesis.∗ p < 0.10, ∗∗ p < 0.05, ∗∗∗ p < 0.01

56

Page 58: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

Table A.9: Pass rates using ‘levels’ thresholds in English

Silabi Words Sentences Paragraph Story ReadingComprehension

(1) (2) (3) (4) (5) (6)

Panel A: Year 1Levels (β1) .095∗∗∗ .05∗∗∗ .023∗∗∗ .015∗∗ .0079∗ .013∗

(.021) (.013) (.0087) (.0065) (.0046) (.0078)P4Pctile (β2) .036∗∗ .028∗∗ .0041 .0073 .0079∗ .019∗∗∗

(.016) (.011) (.007) (.0055) (.0046) (.0064)N. of obs. 17,886 33,440 33,440 15,554 14,678 14,678Control mean .087 .075 .023 .007 .021 .036β3 = β2 − β1 -.059∗∗∗ -.022∗ -.019∗∗ -.0073 -.00001 .0057p-value (H0 : β3 = 0) .0034 .074 .043 .29 1 .44

Panel B: Year 2Levels (β1) .0074 .022∗∗

(.0061) (.0086)P4Pctile (β2) .012∗ .02∗∗

(.0068) (.0079)N. of obs. 0 0 0 0 10,735 10,735Control mean . . . . .017 .025β3 = β2 − β1 .0048 -.0016p-value (H0 : β3 = 0) .5 .88

The independent variable is whether a student passed a given skill in the high-stakes exam.Clustered standard errors, by school, in parenthesis.∗ p < 0.10, ∗∗ p < 0.05, ∗∗∗ p < 0.01

A.2.5 National Assessments

57

Page 59: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

Table A.10: National assessments

(1) (2) (3) (4) (5) (6) (7) (8) (9)

Panel A: Grade 4 - SFNAGrade 4 SFNA 2015 Grade 4 SFNA 2016 Grade 4 SFNA 2017

Pass Score Test takers Pass Score Test takers Pass Score Test takers

Levels (α1) -0.06∗ -0.20∗∗ 3.04 -0.05∗∗ -0.21∗∗ 15.77 0.02 -0.06 25.61∗

(0.04) (0.08) (8.79) (0.03) (0.08) (9.84) (0.03) (0.08) (13.75)

P4Pctile (α2) -0.05 -0.18∗∗ -6.20 0.01 -0.01 0.93 0.01 -0.11 4.83(0.03) (0.08) (8.24) (0.03) (0.08) (8.03) (0.03) (0.07) (10.98)

N. of obs. 13,853 13,853 166 12,487 12,487 153 14,575 14,575 148N. of schools 168 168 166 157 157 153 153 153 148Mean control group 0.67 2.87 63.6 0.81 3.24 65.3 0.79 3.30 75.8α3 = α2 − α1 0.018 0.019 -9.23 0.059∗∗ 0.19∗∗ -14.8 -0.012 -0.049 -20.8p-value (H0 : α3 = 0) 0.61 0.82 0.33 0.032 0.018 0.13 0.67 0.48 0.15

Panel B: Grade 7 - PSLEGrade 7 PSLE 2015 Grade 7 PSLE 2016 Grade 7 PSLE 2017

Pass Score Test takers Pass Score Test takers Pass Score Test takers

Levels (α1) -0.02 -0.07 6.99 0.00 -0.05 4.02 0.03 0.10 7.00(0.04) (0.08) (6.99) (0.03) (0.07) (7.56) (0.03) (0.06) (8.76)

P4Pctile (α2) -0.04 -0.07 -4.00 -0.02 -0.03 -2.29 -0.00 0.02 0.59(0.03) (0.08) (6.48) (0.03) (0.06) (5.75) (0.03) (0.06) (7.08)

N. of obs. 11,616 11,616 165 10,031 10,031 155 12,070 12,070 155N. of schools 167 167 165 158 158 155 158 158 155Mean control group 0.71 2.98 55.3 0.67 2.83 52.4 0.69 2.86 61.9α3 = α2 − α1 -0.020 -0.0043 -11.0 -0.029 0.016 -6.32 -0.032 -0.074 -6.41p-value (H0 : α3 = 0) 0.63 0.96 0.10 0.42 0.84 0.39 0.30 0.23 0.47

Clustered standard errors, by school, in parenthesis.

58

Page 60: Left Behind by Optimal Design: The Challenge of Designing … · 2020-06-10 · Left Behind by Optimal Design: The Challenge of Designing Effective Teacher Performance Pay Programs

A.2.6 Wasted money

To estimate how much money was ‘wasted’ in the level’s scheme for student’s passing certainthresholds regardless of teachers’ effort, we do the following.

• First, we estimate the following model Yi = βXi + εi, where Yi is whether a student passedan ability or not, Xi is a set of student controls (including region, grade, and school charac-teristics), and we restrict the sample to the control group.

• We then estimate the probability of passing each ability in the treatment group using thismodel. This assumes that in the absence of the treatment the treatment group would behavelike the control group.

• We then estimate the average estimated pass rate, and compare it to the actual pass rate.

Specifically we estimate max 0, YiYi

. This is the proportion of the money paid given for resultsthat are not related to additional effort excreted by the teacher. The results are below.

Table A.11: Wasted money

Kiswahili Math English(1) (2) (3)

Year 1 7% 4% 0%Year 2 5% 10% 0%

59