learning job skills from colleagues at work: …econ.msu.edu/seminars/docs/17466-w21986.pdf ·...

NBER WORKING PAPER SERIES

LEARNING JOB SKILLS FROM COLLEAGUES AT WORK:EVIDENCE FROM A FIELD EXPERIMENT USING TEACHER PERFORMANCE DATA

John P. PapayEric S. TaylorJohn H. TylerMary Laski

Working Paper 21986http://www.nber.org/papers/w21986

NATIONAL BUREAU OF ECONOMIC RESEARCH1050 Massachusetts Avenue

Cambridge, MA 02138February 2016

We thank Jonah Rockoff as well as seminar participants at the NBER Summer Institute, New YorkFederal Reserve, and Harvard Graduate School of Education for helpful comments and suggestions.The Bill & Melinda Gates Foundation provided financial support of this research, and we benefitedgreatly from discussions with our program officer Steve Cantrell. We are equally indebted to the TennesseeDepartment of Education, and particularly Nate Schwartz, Laura Booker, Tony Pratt, Luke Kohlmoos,and Sara Heyburn for their collaboration throughout. Finally, we thank Verna Ruffin, superintendentin Jackson-Madison County Schools, and the principals and teachers who participated in the program.All opinions and errors are our own. The views expressed herein are those of the authors and do notnecessarily reflect the views of the National Bureau of Economic Research.

NBER working papers are circulated for discussion and comment purposes. They have not been peer-reviewed or been subject to the review by the NBER Board of Directors that accompanies officialNBER publications.

© 2016 by John P. Papay, Eric S. Taylor, John H. Tyler, and Mary Laski. All rights reserved. Shortsections of text, not to exceed two paragraphs, may be quoted without explicit permission providedthat full credit, including © notice, is given to the source.

Learning Job Skills from Colleagues at Work: Evidence from a Field Experiment Using TeacherPerformance DataJohn P. Papay, Eric S. Taylor, John H. Tyler, and Mary LaskiNBER Working Paper No. 21986February 2016JEL No. I2,J24,M53

ABSTRACT

We study on-the-job learning among classroom teachers, especially learning skills from coworkers.Using data from a new field experiment, we document meaningful improvements in teacher job performancewhen high- and low-performing teachers working at the same school are paired and asked to worktogether on improving the low-performer’s skills. In particular, pairs are asked to focus on specificskills identified in the low-performer’s prior performance evaluations. In the classrooms of low-performingteachers treated by the intervention, students scored 0.12 standard deviations higher than students incontrol classrooms. These improvements in teacher performance persisted, and perhaps grew, in theyear after treatment. Empirical tests suggest the improvements are likely the result of low-performingteachers learning skills from their partner.

John P. PapayBox 1938340 Brook StreetBrown UniversityProvidence, RI [email protected]

Eric S. TaylorHarvard Graduate School of EducationGutman Library 4696 Appian WayCambridge, MA [email protected]

John H. TylerBox 1938340 Brook StreetBrown UniversityProvidence, RI 02912and [email protected]

Mary LaskiBox 1938340 Brook StreetBrown UniversityProvidence, RI [email protected]

1

“Some types of knowledge can be mastered better if simultaneously related to a practical problem.” Gary Becker (1962)

Can employees learn job skills from their coworkers? Whether and how peers contribute

to on-the-job learning, and at what costs, are practical questions for personnel management.

Economists’ interest in these questions dates to at least Alfred Marshall (1890) and, more

recently, Gary Becker (1962) and Robert Lucas (1988). Yet, despite the intuitive role for

coworkers in human capital development, empirical evidence of learning from coworkers is

scarce.2 In this paper we present new evidence from a random-assignment field experiment in

U.S. public schools: low-performing classroom teachers in treatment schools were each matched

to a high-performing colleague in their school, and pairs were encouraged to work together on

improving their teaching skills. We report positive treatment effects on teacher productivity, as

measured by contributions to student achievement growth, particularly for low-performing

teachers. We then test empirical predictions consistent with peer learning and other potential

mechanisms.

While there is limited evidence on learning from coworkers specifically, there is a

growing literature on productivity spillovers among coworkers generally. Morreti (2004) and

Battu, Belfield, and Sloane (2003) document human capital spillovers broadly, using variation

between firms, but without insight to mechanisms. Several other papers, each focusing on a

specific firm or occupation as we do, also find spillovers; the apparent mechanisms are shared

production opportunities or peer influence on effort (Ichino and Maggi 2000, Hamilton,

Nickerson and Owan 2003, Bandiera, Barankay and Rasul 2005, Mas and Moretti 2009,

Azoulay, Graff Zivin, and Wang 2010). Moreover, these spillovers may be substantial. Lucas

2 We are focused in this paper on coworker peers and learning on-the-job. A large literature examines the role of peers in classroom learning and other formal education settings (for a review see Sacerdote 2010).

2

(1988) suggests human capital spillovers, broadly speaking, could explain between-country

differences in income.

One empirical example of learning from coworkers comes from the study of classroom

teachers. Jackson and Bruegmann (2009) find a teacher’s productivity, as measured by

contribution to her students’ test score growth, improves when a new higher-performing

colleague arrives at her school; then, consistent with peer learning, the improvements persist

after she is no longer working with the same colleague (i.e., teaching the same grade in the same

school). The authors estimate that prior coworker effectiveness explains about one-fifth of the

variation in teacher performance.

In this paper we also focus on classroom teachers. While we believe the paper makes an

important general contribution, a better understanding of on-the-job learning among teachers

specifically has sizable potential value for students and economies. Classroom teaching

represents a substantial investment of resources: one out of ten college-educated workers in the

U.S. is a public school teacher, and public schools spend $285 billion annually on teacher wages

and benefits (U.S. Census Bureau 2015, Table 6).3 And there is substantial variability in teacher

job performance: measured both in the short-run with students’ test scores (see Jackson, Rockoff,

and Staiger 2014 for a review) and the long-run with students’ economic and social success years

later as adults (Chetty, Friedman, and Rockoff 2014). One seemingly consistent source of

differences in teacher performance is experience on the job (Rockoff 2004, Harris and Sass 2011,

Papay and Kraft 2015). Estimated differences due to experience are much larger than differences

in formal pre-service or in-service training (Jackson, Rockoff, and Staiger 2014).

3 Authors’ calculations of workforce share from Current Population Survey 1990-2010.

3

We report here on a field experiment in Tennessee designed to study on-the-job, peer

learning between teachers who work at the same school. Schools were randomly assigned to

either treatment or a business-as-usual control. In treatment schools, low-performing teachers

were each matched to a high-performing partner using detailed micro-data from prior

performance evaluations. In Tennessee, teachers are observed in the classroom multiple times

per year and scored in 19 specific skills (e.g., “questioning,” “lesson structure and pacing,”

“managing student behavior”). Each low-performing “target” teacher was identified as such

because his prior evaluation scores were particularly low in one or more of the 19 skill areas.

Then his high-performing “partner” was chosen because she had high scores in (many of) the

same skill areas. Each pair of teachers was encouraged by their principal to work together during

the school year on improving teaching skills identified by evaluation data. Thus the topics and

skills teachers worked on were specific to each pair and varied between pairs. More generally,

pairs were encouraged to examine each other’s evaluation results, observe each other teaching in

the classroom, discuss strategies for improvement, and follow-up with each other’s commitments

throughout the school year.4

We find that treatment—pairing classroom teachers to work together on improving

skills—improves teachers’ job performance, as measured by their students’ test score growth. At

the end of the school year, the average student in a treatment school scores 0.06σ (student

standard deviations) higher on standardized math and reading/language arts tests than she would

have in a control school, regardless of whether her teacher participated in a partnership. The

gains are concentrated among “target” teachers; in target teachers’ classrooms students score

0.12σ higher. These are meaningful gains. One standard deviation in teacher performance is 4 The treatment was designed in a collaboration between the research team and several people at the Tennessee Department of Education. TNDOE also played key roles in carrying out the experiment and collecting data.

4

typically estimated to be 0.15-0.20σ (Hanushek and Rivkin 2010). In other words, a gain of

0.12σ is roughly equivalent to the difference between being assigned to a median teacher instead

of a bottom quartile teacher. These improvements in teacher performance persist, and perhaps

grow, in the school year following treatment; in year two the effect for target teachers is a

marginally significant 0.25σ.

Interpreting these differences as causal effects of treatment rests mainly on the random

assignment of schools. While the “target” and “partner” roles were not randomly assigned, the

roles were assigned by algorithm for both treatment and control schools prior to randomization,

as we detail in Section 1. The estimates in the previous paragraph are intent-to-treat estimates

based on algorithm-assigned roles.

After documenting average treatment effects, we turn to examining mechanisms. In

particular we ask: Can the performance improvements be attributed to growth in teachers’ skills

from peer learning, or are other changes in behavior or effort behind the estimated effects?

Larger effects for target teachers, effects which persist over time, are highly suggestive of skill

growth, but could also result if partnering increased target teachers’ motivation or effort, or

provided new opportunities to share resources or tasks (Jackson and Bruegmann 2009). In

Section 3, we test a number of empirical predictions motivated by these potential mechanisms. If

the underlying mechanism is skill growth, we would predict larger treatment effects for target

teachers when the high-performing partner’s skill strengths match more of the target teacher’s

weak areas. We would also predict larger improvements in classroom observation scores for the

skills matched, but smaller or no improvements in the skills left unmatched. Both predictions are

true empirically. If, however, the mechanism is shared production or resources, we would predict

larger effects when teacher pairs teach similar grade-levels or subjects. If the mechanism is effort

5

or motivation, we would predict larger effects when there were larger gaps in prior performance

between paired teachers, on the assumption that the comparison of performance induces greater

effort (as in Mas and Moretti 2009). Neither of these latter predictions is borne out in the data. In

short, the available data suggest target teachers learned new skills from their partner.

One contextual feature of the experiment is also important to interpreting these results.

The detailed micro-data with which teachers were paired are taken from the state’s performance

evaluation system for public school teachers, which the Tennessee Department of Education

introduced in 2011. Locally the treatment was known as the “Evaluation Partnership Program.”

These connections to formal evaluation, and its stakes, likely influenced principals’ and teachers’

willingness to participate and the nature of their participation.5 The evaluation context also

affects the counterfactual behavior of control schools and teachers. This context may partly

explain why we find positive effects in this case while other research dose not consistently find

effects of formal mentoring or formal on-the-job training for teachers (see reviews by Jackson,

Rockoff, and Staiger 2014, and Yoon et al. 2007).6 More generally, this paper also belongs to a

small literature on how evaluation programs affect teacher performance (Taylor and Tyler 2012,

Steinberg and Sartain 2015, Bergman and Hill 2015, Dee and Wyckoff 2015). Taylor and Tyler

(2012) study veteran teachers who were evaluated by and received feedback from experienced,

high-performing teachers; the resulting improvements in teacher productivity persisted for years

after the peer evaluation ended.

The performance improvements documented in this paper suggest teachers can learn job

skills from their colleagues—empirical evidence of the intuitive benefit of skilled coworkers in

5 In one-on-one interviews, some participating teachers said they were willing to participate because teacher pairs were matched based on specific skills and not on a holistic measure of performance. 6 Exceptions include an example of mentoring studied by Rockoff (2008) and an example of training studied by Angrist and Lavy (2001).

6

human capital development. The magnitude of those improvements suggests peer learning may

be as important as on-the-job experience in teacher skill development (Rockoff 2004, Papay and

Kraft forthcoming); indeed, peer learning may be a key contributor to the oft-cited estimates of

returns to experience in teaching. Most practically, the treatment and results suggest promising

ideas for managing the sizable teacher workforce.

Next, in Section 1, we describe the treatment in detail, along with other features of the

experimental setting and data. In Section 2 we describe the average treatment effects and

treatment effects by teachers’ assigned partnership role. Section 3 discusses potential

mechanisms and presents tests of empirical predictions related to those mechanisms. We

conclude in Section 4 with some further discussion of the results.

1. Treatment, Setting, and Data

1.1 Treatment

We report on a field experiment designed to study on-the-job, peer learning between

teachers who work at the same school. At schools randomly assigned to the treatment

condition—known in the schools as the “Evaluation Partnership Program”—low-performing

“target” teachers were paired with a high-performing “partner” teacher, and each pair was

encouraged to work together on improving each other’s teaching skills over the course of the

school year. Importantly, teachers were matched using micro-data from state-mandated

performance evaluations. As described further in the next section, these prior evaluations include

separate performance ratings for many specific instructional skills (e.g., “questioning,” “lesson

structure and pacing,” “managing student behavior”). Each target teacher was identified as such

because he had low scores in one or more specific skill areas; his matched partner was selected

7

because she had high scores in (many of) the same skill areas. Pairs were approached by their

school principal and asked to work together for the year focusing on the strength-matched-to-

weakness skill areas, with the goal of improving instructional skills. Thus the topics and skills

teachers worked on were specific to each pair and varied between pairs. More generally, pairs

were encouraged to scrutinize each other’s evaluation results, observe each other teaching in the

classroom, discuss strategies for improvement, and follow-up with each other’s commitments

throughout the school year.

While individual teacher pairings were the focus of the intervention, treatment was

assigned at the school level. Thus the success of individual pairs may have been influenced by

the principal’s role or support, or influenced by other teacher pairs in the school working in the

same kinds of ways. Certainly the extensive margin of treatment take-up was in the hands of the

school principal, as described below.

1.1.1 Teacher Evaluation in Tennessee

All public school teachers in Tennessee are evaluated annually. Beginning in the 2011-12

school year, the state introduced new, more-intensive requirements for teacher evaluation. The

new evaluations include both (i) direct assessments of teaching skills in classroom observations,

and (ii) measures of teachers’ contributions to student achievement. We focus in this section on

the classroom observation scores because they are the micro-data used in matching target and

partner pairs, and the motivation for each pair’s work together.

Each teacher is observed while teaching and scored multiple times during the course of

the school year, typically by the school principal or vice-principal. Observations and scores are

structured around a rubric known as the “TEAM rubric” which measures 19 different

instructional skill areas or “indicators” which are grouped into three categories: instruction,

8

planning, and classroom environment.7 The rubric is based in part of the work of Charlotte

Danielson (1996). Skill areas include things like “managing student behavior,” “instructional

plans,” “teacher content knowledge,” and many others. As an example, Figure 1 reproduces the

rubric for “Questioning.” Teachers are scored from 1-5 on each skill area: 1 significantly below

expectations, 2 below expectations, 3 at expectations, 4 above expectations, 5 significantly above

expectations. As the Figure 1 example suggests, the rubric language describes relatively specific

skills and behaviors, not vague general assessments of teaching effectiveness. As described in

the next section, we use these micro-data on 19 different skill areas to match high- and low-

performing teachers in working pairs.

Table 1 lists all 19 skills and provides summary statistics for teachers’ scores in the year

prior to the experiment. Teachers score lowest in “problem solving” and highest in “content

knowledge.” Scores for classroom environment skills are most variable, especially “managing

student behavior.” Table 1 draws on teachers in the treatment and control schools, but these

statistics are similar when calculated using all Tennessee schools.

At the end of the school year, the classroom observation micro-data are aggregated to

produce a final, overall, univariate observation evaluation for each teacher. In the 2012-13 school

year, the second under the state’s new requirements, teachers in our sample scored quite high in

classroom observations: more than four-fifths received an overall average rating of 3 or higher

(Table 1 Column 3), while just 1.2 percent scored 2 or lower.8 Critically for this study, however,

there was substantial variation between and within teachers at the 19-skill micro-data level as

7 TEAM stands for Tennessee Educator Acceleration Model. Most Tennessee districts, including the district where this paper’s data were collected, use the TEAM rubric. Some districts use alternative rubrics. The full TEAM rubric is available at: http://team-tn.org/wp-content/uploads/2013/08/TEAM-General-Educator-Rubric.pdf. 8 The small percentage of “low” final overall scores is not a-typical, even after the revisions in teacher evaluation programs in recent years (Wiesberg 2009, New York Times, March 30, 2013)

http://team-tn.org/wp-content/uploads/2013/08/TEAM-General-Educator-Rubric.pdf

9

detailed in Table 1. Two out of five teachers received a score lower than 3 in at least one of the

19 skill areas; and, among that 41 percent, the average number of skills with a score below 3 is

5.8. (The threshold of below 3 was used to identify skills for matching pairs, as described in the

next section.) In the state evaluation process, the overall observation rating is combined with

student achievement data to produce a summative score for each teacher. It is important to note

that, while certainly used in the state’s formal evaluation scores, student achievement data were

not used in the matching of teachers or communication with teachers about the goals of the

teacher partnerships.

1.1.2 Identifying and Matching High- and Low-Performing Teachers

For the purposes of this experiment, a teacher was identified as a “low-performing” or

“target” if he had a score less than 3 in one or more of the 19 skill areas. Similarly, a teacher was

identified as a “high-performing” potential “partner” if she had a score of 4 or higher in one or

more skill area. Both the set of target and the set of potential partners were identified based on

pre-experiment evaluation scores: the average of a teacher’s scores from the prior school year

2012-13 and, when available, the first observation of 2013-14.9

Our matching algorithm followed these steps and rules: (1) Consider each possible

pairing of a target and a partner teacher who work in the same school, and calculate the total

number of skill areas (out of 19 possible) where there is a strength-to-weakness skill match for

that pairing. A strength-to-weakness match occurs when the target teacher has a score less than 3

9 If a teacher was identified as both a “target” and “partner” by these rules, the teacher was included only on the target list. Additionally, any potential “target” teacher with an overall observation rating of “above expectation” (4) or “significantly above expectations” (5) was excluded from the target teacher list. Two schools had few or no teachers identified as “target” by these rules. In those two cases, after random assignment, we used a threshold of less than or equal to 3, instead of strictly less than 3. One school had many teachers identified as “target.” In that school we limited the set of target teachers to those with 8 or more skill areas (out of 19) with a score less than 3. While these adjustments aided in the practical implementation of the program, our ITT estimates use only the original assignments of teachers, not the assignments after these school-specific adjustments.

10

in a given skill, and the partner has a score of 4 or higher in the same skill. (2) For each school,

list all possible configurations of pairings where each potential partner is matched to just one

target teacher. (3) Choose the set of pairings which maximizes the number of strength-to-

weakness matches (out of 19 * T possible, where T is the number of target teachers).10 This

algorithm produced a set of “recommended” matches for each school. We created recommended

match lists for both treatment and control schools pre-experiment, but the lists were only

provided to principals in the treatment schools.11

The principal in each treatment school was responsible for introducing each target-

partner pair, explaining why the two had been paired, and encouraging the pair in their work

together. Each teacher, target or partner, ultimately decided to what extent she would participate

and, in that sense, the experiment can be thought of as an encouragement-style design.

Additionally, principals were encouraged to review the recommended list of matches and make

changes as they saw fit. This latitude for principals was intended to help improve matches using

local knowledge not observed in the evaluation data; for example, a recommended match may

have paired two individuals known to not work well together. To encourage data-informed

adjustments, we provided principals with a list of additional potential matches for each target

teacher. These additional matches were the five potential partners with the highest strength-to-

weakness scores from step (1) above, regardless of the optimization and constraints in steps (2)

and (3). Principals received a spreadsheet listing the 19 skill areas for each teacher to show

where target and potential partner teachers matched. An example is shown in Figure 2. We 10 The core of this matching approach is sometimes called the Hungarian Algorithm or Method. 11 We can say definitively that no control school received a list of target teachers and proposed partners. However, we discussed the idea of the program with all principals, and we cannot ensure that the kernel of the idea was not adopted by principals in the control schools. Because principals conduct the observations, they had on hand all of the information necessary to conduct such pairings. Our discussions with district officials, though, suggest that principals in control schools did not undertake such a program. Any such activities would bias downward our estimated average treatment effects.

11

provided the proposed matches and information for participating principals and teachers to

principals in early November 2013.12

1.2 Sample and Random Assignment

The experiment was conducted at 14 elementary and middle schools (7 treatment, 7

control) in a medium-sized district in Tennessee. The district, Jackson-Madison County School

System, is the 12th largest in the state enrolling approximately 13,000 students. Across the

district, 77 percent of students are economically disadvantaged, 61 percent are African-

American, 32 percent are white, and 7 percent are Hispanic. The district spends about $9,750 per

pupil annually. In 2013-14, the state of Tennessee’s measure of student test score growth ranked

Jackson-Madison as a Level 1 district, the lowest-performing category in the state.

In the summer of 2013, principals from all 21 of the district’s elementary and middle

schools were briefed on the Evaluation Partnership Program and 14 agreed to participate in the

study. We blocked the 14 schools in seven pairs based on level (elementary or middle) and

student enrollment, and randomly assigned one to treatment within each pair.13

During the experiment year 2013-14, the 14 schools enrolled nearly 3,000 students in

grades 4 through 8 with at least 136 teachers in mathematics and reading/language arts. We focus

on grades 4 through 8 because these are the grades where we observe pre- and post-experiment

student achievement measured by state tests. Descriptive information on the students and

teachers is provided in Table 2.

12 Participating principals and teachers mainly communicated directly with the research team. The exception is that lists of proposed matches were emailed to principals by the TNDOE. The research team prepared the match reports using de-identified data, where teachers were only known by randomly generated ID numbers. The TNDOE then replaced the random IDs with actual names and sent the reports to principals. 13 This paper focuses on elementary and middle schools where we have data on student achievement test scores. We also recruited the district’s high schools and pre-kindergarten schools. Two high schools and one pre-k school agreed to participate, were randomized, and participated in the program.

12

Interpreting the results of this experiment as causal effects rests largely on the success of

randomly assigning schools to treatment and control conditions. Table 2 reports the traditional

test of randomization, comparing pre-treatment characteristics of students, teachers, and teacher

pairs. These tests include fixed effects for the randomization block pairs. We read the results as

evidence of successful randomization. Most differences are substantively quite small.14 Only 2 of

the 20 characteristics show a statistically significant difference between treatment and control

means: the proportion of English language learner students, and the difference in baseline

observation scores for teacher pairs. Additionally, in Appendix Table A1, we check for covariate

balance separately for teachers in each assigned program role: low-performing target teachers,

high-performing partners, and teachers not assigned a role. The results are similar.

1.3 Data

The Tennessee Department of Education provided two primary sources of data for this

paper. First, we use the teacher evaluation micro-data described above for the years 2012-13

through 2014-15. The pre-randomization evaluation data are used in matching target and partner

teachers; the post-randomization evaluation data are used as outcome measures of observed

teacher job performance. Second, we use state administrative records for the years 2012-13

through 2014-15 that include (i) student scores from annual state standardized tests in math and

reading/language-arts in grades 3 through 8, (ii) information on student demographics and

special educational programs, (iii) records linking each student to her assigned teacher(s) for

each subject each year, (iv) and information about teacher experience and prior performance. We

standardize all test scores (mean zero, standard deviation one) within year-grade-subject cells

14 Note that the teacher value-added scores are in teacher standard deviation units, not student standard deviations as is often the case. In student standard deviations the differences are roughly one-tenth to one-fifth the magnitudes in Tables 2 and A1.

13

using the statewide distribution (as opposed to the district specific mean and standard deviation).

Additionally, TDOE provided data on where each teacher was working at the beginning of the

2015-16 school year.

2. Effects on Student Achievement

In this paper, we ask two primary empirical questions: First, did the treatment—pairing

classroom teachers to work together on improving skills—benefit (harm) teacher performance?

Second, if there were improvements, is there evidence that those improvements are the result of

growth in teachers’ skills from peer learning?

Our primary measure of teacher performance is growth in student achievement test

scores. Student learning that is measurable in standardized tests is, to be certain, only one aspect

of a teacher’s job responsibilities. Nevertheless, existing empirical evidence suggests student test

scores capture important variation in teacher performance (Jackson, Staiger, and Rockoff 2014

provide a review of the literature).

2.1 Average Treatment Effects on Student Achievement

Our most straightforward treatment effect estimates simply compare mean test scores in

treatment and control schools. The treatment-control difference in means, 𝛿, is estimated by

fitting the regression specification

𝐴𝑖𝑗𝑘𝑡 = 𝛿𝐸𝑃𝑃𝑠(𝑗) + 𝑋𝑖𝛽 + 𝜋𝑏(𝑠) + 𝜀𝑖𝑗𝑘𝑡

(1)

14

where 𝐴𝑖𝑗𝑘𝑡 is the end-of-year 𝑡 (the experiment year) test score in subject 𝑘 (math or reading)

for student 𝑖 assigned to teacher 𝑗 in school 𝑠.15 The treatment indicator, 𝐸𝑃𝑃𝑠(𝑗), varies only at

the school-level 𝑠. All estimates in the paper include randomization block fixed effects, 𝜋𝑏(𝑠).

Throughout the paper, all statistical inferences account for error clustering within

schools. We report p-values obtained by the wild cluster bootstrap-t method suggested by

Cameron, Gelbach, and Miller (2008). This approach provides asymptotic refinement when the

number of clusters is small, as in our setting with 14 clusters. Cameron and coauthors show that

rejection rates can be as high as 10 percent for a nominally 𝛼 = 0.05 test using conventional

clustering methods; using the wild cluster bootstrap-t method rejection rates approach 5 percent.

However, in this setting inference using conventional cluster-robust standard errors (as

implemented in Stata) turns out to be quite similar.16

We report 𝛿 without and with additional covariates 𝑋𝑖, which are included to improve

precision. The additional covariates include student 𝑖’s prior achievement—measured with the

average of student 𝑖’s math and reading scores from the prior school year (𝑡 − 1)—as well as her

gender, race/ethnicity, English language learner status, and special education status.17 The vector

𝑋𝑖 also includes a pre-experiment “value added” measure of teacher 𝑗’s contributions to student

test scores in subject 𝑘; this measure comes from that state’s TVAAS system for 2011-12 and 15 The student-teacher link records allow students to be linked to more than one teacher for a given subject, though three-quarters in our sample are linked to just one teacher. When a student has more than one teacher, the state assigns a “percent responsibility” to each teacher. When a student has two or more teachers, we include one observation for each student-by-teacher pairing and weight by the “percent responsibility.” But our results are robust to assigning students to the one teacher with the highest weight. 16 Errors may also be correlated within a group of students taught by the same teacher. In our setting teachers are nested within schools. Thus, clustering errors at the school level is equivalent to clustering at both the school and teacher levels simultaneously (Cameron, Gelbach, and Miller 2011). 17 While the field experiment occurred only in one district, we observe student test scores throughout the state. As a result we have very little missing data in 𝑋𝑖, for example, less than 4 percent of students are missing baseline achievement. When baseline achievement or another given covariate is missing, we replace it with a value of zero and include an indicator = 1 for all students missing the given covariate. Our results are robust to excluding these approximately 4 percent of students.

15

2012-13. Last, 𝑋𝑖 includes grade-by-subject fixed effects and we allow the slope on prior

achievement score to differ by subject and grade.

Estimates of the average treatment effect on student achievement, 𝛿 in Equation 1, are

reported in Table 3 Panel A. These are school-level intent-to-treat effects, and do not use any

variation in assigned teacher roles or treatment take-up.18 Students in treatment schools score

about 0.06σ (student standard deviations) higher than their peers in control schools, on average

pooling math and reading outcomes. The difference is marginally statistically significant

(𝑝 = 0.08) when we control for pre-treatment covariates.

The estimated average treatment effect is somewhat larger when we focus on math

achievement alone (0.07σ; 𝑝 = 0.040) and somewhat smaller for reading and language arts

alone (0.05σ; 𝑝 = 0.072). This is consistent with the typical pattern in empirical research on

schooling: most general interventions affect reading achievement less than math achievement.

Additionally, the reading estimates are more sensitive to the inclusion of pre-treatment

covariates, though we cannot reject that the reading estimates in Column 1 and 2 are equivalent

statistically.

These average differences in student achievement can be interpreted as causal effects of

treatment under the traditional experimental design assumption: At the beginning of the

experiment, there was no difference in potential outcomes—student achievement growth, teacher

or school performance, etc.—between treatment and control samples. This assumption rests

18 We focus the paper’s discussion on ITT estimates. Only one of the seven treatment schools did not participate at all in the Evaluation Partnership Program. The non-participating treatment school received the partnership match lists and program materials just as the other six schools did, but chose not to move forward. Thus the implied first-stage for a school-level TOT estimate would be about 0.86, suggesting the TOT estimates would be about 15 percent larger than the numbers reported in Table 3 Panel A.

16

largely on the success of random assignment; evidence in support of successful random

assignment is presented in Table 2.

These positive average treatment effects are educationally and economically meaningful.

Gains of 0.06σ represent roughly one-third of a standard deviation in teacher performance, which

is typically estimated at 0.15-0.20σ in math and somewhat smaller in reading (Hanushek and

Rivkin 2010, Jackson, Staiger, and Rockoff 2014). Put differently, the 0.06σ difference is

roughly equivalent to the difference between being assigned to a median teacher and a 63rd

percentile teacher. Additionally, these average treatment effects are also roughly one-quarter the

estimated gain from reducing class size by 30 percent in elementary grades (Kruger 1999), or the

estimated gain from doubling the amount of class time middle and high school students spend in

math (Taylor 2014, Cortes, Goodman, and Nomi 2015). However, unlike reducing class size or

increasing class time, the current treatment—pairing classroom teachers to work together on

improving skills—does not require any meaningful increase in teacher salary expenditures.

2.2 Treatment Effects for Target Teachers and Other Teachers

Next we estimate treatment effects separately for teachers with different roles in the

partnership program. The experiment was designed first to improve the job performance of low-

performing “target” teachers; thus, the estimates in the top panel of Table 3 may mask important

heterogeneity by role. Teachers were assigned to one of three roles: (i) low-performing target

teachers, (ii) high-performing potential partner teachers, and (iii) all other teachers who were not

assigned a role in partnerships.

Building on Specification 1, we estimate the following regression to test for differences

by assigned teacher role:

17

𝐴𝑖𝑗𝑘𝑡 = 𝛿𝑇�𝐸𝑃𝑃𝑠(𝑗) ∗ 𝑇𝑎𝑟𝑔𝑒𝑡𝑗(𝑖)� + 𝛿𝑃�𝐸𝑃𝑃𝑠(𝑗) ∗ 𝑃𝑎𝑟𝑡𝑛𝑒𝑟𝑗(𝑖)� + 𝛿𝑁�𝐸𝑃𝑃𝑠(𝑗) ∗ 𝑁𝑜𝑅𝑜𝑙𝑒𝑗(𝑖)�

+ 𝛼𝑃𝑃𝑎𝑟𝑡𝑛𝑒𝑟𝑗(𝑖) + +𝛼𝑁𝑁𝑜𝑅𝑜𝑙𝑒𝑗(𝑖) + 𝑋𝑖𝛽 + 𝜋𝑏(𝑠) + 𝜀𝑖𝑗𝑘𝑡

(2)

where 𝑇𝑎𝑟𝑔𝑒𝑡𝑗, 𝑃𝑎𝑟𝑡𝑛𝑒𝑟𝑗, and 𝑁𝑜𝑅𝑜𝑙𝑒𝑗 are a set of mutually-exclusive and exhaustive indicator

variables varying between teachers. This specification is algebraically equivalent to a more

conventional specification with a main effect of treatment and interactions with two of the three

teacher roles. 𝑇𝑎𝑟𝑔𝑒𝑡𝑗 = 1 if teacher 𝑗 was listed as a low-performing target teacher on the

principal’s Evaluation Partnership Program report. Recall that the report was created for both

treatment and control schools, but only provided to treatment schools. Similarly, 𝑃𝑎𝑟𝑡𝑛𝑒𝑟𝑗 = 1 if

teacher 𝑗 was listed as a high-performing potential partner (either in the recommended pairings

list or list of other potential partners). All other teachers have 𝑁𝑜𝑅𝑜𝑙𝑒𝑗 = 1, and were not listed

on the principal’s report. The estimates from Specification 2 are best interpreted as intent-to-treat

because the role indicators are based on the original reports created by the research team, and not

based on any post-randomization endogenous decisions by principals or teachers. All other

details of estimation are the same as for Specification 1.

Treatment effects are largest for low-performing target teachers. As reported in the

bottom panel of Table 3, treatment leads to test-score gains of 0.12σ in target teachers’

classrooms (compared to students of teachers who would have been target teachers in control

schools, pooling math and reading). The estimates for high-performing partner teachers are

positive, but much smaller and not statistically significant. Indeed the estimates for partner

teachers are quite similar to the estimates for teachers who were not assigned a role in the

program.

18

Again, these improvements are meaningful. A gain of 0.12σ is roughly equivalent to the

difference between being assigned to a median teacher instead of a bottom quartile teacher. A

gain of 0.12σ is also at least as large as the difference in performance between a novice teacher

and a 5 to 10 year veteran (Rockoff 2004, Papay and Kraft 2015). However, the gains for target

teachers are not necessarily substituting for experience gains. Later in the paper we examine

heterogeneity of treatment effects by experience.

Moreover, 0.12σ likely underestimates the effect of treatment on teachers who actually

participate in the program. Table 4 reports estimates of program take-up by teacher role (the first

stage results from a traditional 2SLS estimate of TOT).19 Among treatment teachers assigned by

the research team to the target role, 61 percent participated in the program in the target role,

suggesting a treatment-on-the-treated estimate of about 0.20σ (= 0.12/0.61). These

improvements are large but similar to the gains documented by Taylor and Tyler (2012) studying

a program of evaluation and feedback, especially the gains Taylor and Tyler estimate for the ex-

ante lowest performing teachers.

2.3 Treatment Effects After Two Years

Last we estimate treatment effects one year after treatment. Table 5 shows estimates of

Specifications 1 and 2 using test score data for students taught by treatment and control teachers

in 2014-15.20 The improvements in teacher performance persist, and perhaps grow, in the year

after treatment. The estimated effect for low-performing target teachers is 0.25σ, and marginally

19 The sample and specification are identical to Table 3 Panel B Column 1 following Equation 2, except that the dependent variables are indicators for participation in a specific role. 20 Some of the 136 teachers in the sample switched schools between 2013-14 and 2014-15. As long as a teacher is working in a Tennessee public school teaching grades 4-8 we include her in the sample. In these cases “school”, s, in Specifications 1 and 2 for purposes of estimation is her school during 2013-14. For all teachers “role” is her program role during 2013-14.

19

significant (𝑝 = 0.068, Column 1 Panel B). The estimated average treatment effect is also larger

at 0.11σ, though the p-value is 0.22 (Column 1 Panel A).

Two considerations are important in interpreting these second-year results. The first

consideration is attrition. The sample for Table 5 Column 1 includes 96 of the original 136

treatment and control teachers. Some teachers were not working in Tennessee public schools;

others switched to teaching non-tested grades or subjects.21 To assess the potential influence of

this attrition, Columns 2 and 3 provide bounds for the treatment effect following the approach

described by Lee (2011).22 Bounds for the average treatment effect are 0.05σ and 0.14σ. Bounds

for the effect among low-performing teacher teachers are 0.21σ and 0.34σ (with p-values of

0.052 and 0.040 respectively). In short, the continued performance improvement among target

teachers is robust to attrition, in part, perhaps, because attrition was relatively well balanced.

The second consideration is teachers who were target teachers in 2013-14 and again in

2014-15. In 2014-15 the treatment schools continued to use the Evaluation Partnership Program,

and teacher roles and partnerships were re-assigned based on evaluations from 2013-14. Two

teachers were recommended as target teachers in both years. These repeat target teachers may

have an outsized influence on the estimated year-two effect if the effect is increasing in the dose

of treatment, or if the improvement in performance only occurs in years when a teacher is

21 We find no statistically significant treatment effects on the probability that teachers leaving their school or stop teaching altogether (at least in Tennessee public schools, results in Appendix Table A3). 22 Our setting requires one modification to the standard Lee bounds procedure. We are concerned about attrition at the teacher level, but have outcome data at the student level. Thus determining which teachers to “trim” is not a simple function of an observed variable. To determine which teachers to trim we follow these steps: (i) Estimate the number of teachers who should be trimmed separately by role. The answer turns out to be trim one control target, one treatment partner, and three control teachers with no role. (ii) Repeatedly estimate �̂�𝑇 in Specification 2 excluding one control target teacher in each iteration. Identify the control target teacher 𝑗 whose exclusion minimizes �̂�𝑇. (iii) Repeat step (ii) excluding partner teachers to find the exclusion which minimizes �̂�𝑃. (iv) Repeat step (ii) excluding no-role teachers to find the exclusions which minimize �̂�𝑁. In this case each iteration excludes a unique combination of three control no-role teachers. (v) The five teachers identified in steps (ii)-(iv) are the teachers we trim to estimate the lower bound estimate. (vi) Repeat steps (ii)-(iv), this time maximizing each �̂� ., to identify the five upper-bound teachers.

20

actively participating in a partnership. In Table 5 Column 4 we report estimates from a

specification identical to Column 1, except that treatment effect for target teachers is interacted

with an indicator = 1 if teacher 𝑗 was a target teacher in both years. Column 4 does not provide

causal estimates of one versus two years of treatment; being treated a second year is endogenous.

Column 4 does provide evidence consistent with improvements in job performance for target

teachers persist, perhaps grow, after treatment ends. The 0.25σ improvement holds for teachers

who were targets just one year, though it is not statistically significant.

3. Growth in Teachers’ Skills and Other Potential Mechanisms

While the previous section documents educationally meaningful and economically

significant impacts, the average effect estimates do not shed light on the mechanisms through

which the peer pairings influence productivity. We now move on to our second empirical

question: Can the improvements in student learning—the positive average treatment effects—be

attributed to growth in teachers’ skills from peer learning? Or are other changes in behavior or

effort behind the treatment effects?

In treatment schools, low-performing teachers were paired with a high-performing

partner, and each pair was explicitly asked to work together on improving teaching skills. Thus

our first hypothesized mechanism is teacher skill growth. Nevertheless, there are at least two

other potential mechanisms that could contribute to the treatment effects: changes in teachers’

motivation or effort, and changes in shared tasks (joint production) or resources. These three

categories of mechanism are not mutually exclusive; all three could be contributing, to varying

extents, to the average treatment effects. Jackson and Bruegmann (2009) describe how these

three categories of teacher spillovers likely affect performance in a typical school context. Our

21

discussion of these three mechanisms focuses on how the treatment’s pairing of teachers may

have changed that typical context. The experimental setting and control counterfactual rule out

many first-order features of these mechanisms as we highlight in the next paragraphs. Later we

present empirical tests of predictions from these three hypothesized mechanisms. We cannot test

all predictions empirically, especially in the case of the effort or motivation hypothesis, and thus

the analysis is not definitive. However, the tests we can conduct are a step toward sorting out the

relative contributions of different mechanisms.

While the stated purpose (to participating teachers) of the intervention was to improve

teacher instructional skills, a second potential mechanism is that program participation could

result in changes in teachers’ motivation or effort. Asking a low-performing teacher to spend

more time with a high-performing colleague, and talk together about performance explicitly, may

have made her more optimistic or enthusiastic about work or made her more embarrassed about

her poor performance. Similarly, treatment teachers may have felt more accountability to their

new partner. These interactions may, in turn, lead to increased effort—either transitory increases

in effort (e.g., motivated by specific accountability to ones’ partner or direct monitoring by ones’

partner) or lasting increases in effort (e.g., finding a new preferred equilibrium level of effort as a

result of interacting with ones’ partner). There is evidence for coworker effects on effort outside

the education sector (for example, Mas and Moretti 2009). However, the scope for changes in

optimism, embarrassment, or accountability is limited in the current experimental comparison:

the treatment likely increased the degree of interaction with one coworker, but the mix of

coworkers and typical coworker interactions were the same in treatment and control schools.

Furthermore, teachers (and particularly low-performing teachers) in Tennessee already face

fairly strong extrinsic incentives to increase their performance, as their schools are under

22

substantial test-based accountability pressures and as individuals they are at risk of losing their

jobs for low evaluation ratings. The effect of any marginal accountability to one’s partner is

likely to be small.

The third potential mechanism to consider is changes in teachers’ opportunities to share

resources or production tasks. Teacher partnerships formed by the treatment program may have

expanded to activities outside the original program scope. For example, teachers paired by the

treatment may have been more likely to share existing lesson plans or cooperate in creating new

lessons in ways that benefited productivity. Again, the control counterfactual rules out the

typical, first-order resource or task sharing among teachers within a school. For example,

teachers in treatment and control schools could exchange lessons and have group collaboration

time; thus, to explain treatment effects, the shared production must be a specific result of the new

partnership.

Our objective in this section is to present empirical evidence to help discriminate among

these three potential mechanisms. We have already seen two important empirical results that are

consistent with teacher skill growth. First, improvements in performance are largest for the low-

performing target teachers; the average effect was perhaps entirely driven by target teacher

improvements. Second, the performance improvements persisted, and perhaps even grew, in the

year after treatment ended. However, the pattern of larger effects for target teachers could result

from an asymmetric change in teachers’ motivation or effort, or an asymmetric change in

resource or tasks sharing as a result of treatment. Such changes might then persist after the

partnership has formally ended. In the remainder of this section, we present several additional

empirical tests. In short, we examine whether treatment effects vary with the characteristics of

23

teacher pairings in ways that are most consistent with skill growth or with either of the other two

potential mechanisms.

3.1 Learning Skills

First, if low-performing target teachers did learn new skills from their high-performing

partner, we would expect larger treatment effects when the partner teacher’s specific skill

strengths matched the target teacher’s specific weaknesses. In Table 6 Columns 2 through 5 we

test this prediction. (For ease of comparison, we reproduce our main results from Table 3 in

Column 1.) In Column 2, for the target teachers, we interact the treatment indicator with the

proportion of teacher 𝑗’s weak skill areas matched by her recommended partner’s strong skill

areas.23 The “proportion skills matched” measure is based on the one-to-one pairings

recommended in the original principal reports (in the spirit, again, of intent-to-treat estimates).

We have standardized the “proportion skills matched” (mean 0, s.d. 1) for comparison with other

measures of pair characteristics in Table 6. Consistent with peer learning, the coefficient is

positive and statistically significant (𝑝 = 0.052). Student achievement gains are larger in pairs

where the high-performing teacher is better suited to teach new skills to her low-performing

partner.

Moreover, the result for “proportion skills matched” is not driven by target teachers who

simply have more weak skills to match on. In Column 3 we replace “proportion skills matched”

with “number of skills attempted to match”, and in Column 4 we include both simultaneously.

The target teacher’s number of weak skills does not itself predict a larger treatment effect; the

23 Recall that proposed pairings were determined algorithmically based on matches in 19 specific skill areas measured in each teacher’s prior evaluation micro-data. We count up the number of skill areas in which there is a match: the target teacher has a score less than 3 and the recommended partner has a score of 4 or greater (see Section 1). Then, we divide the number of matches by the number of areas in which the target teacher scored less than 3.

24

effect is larger only when weak skills are matched. By contrast, if attention to low skill scores, by

the program or by the partnership, generated a change in teacher effort, we would have expected

a positive coefficient on “number of skills attempted to match.”

In Column 5 we replace the continuous proportion matched with an indicator = 1 if the

proportion skills matched is above the pair median (about 0.5). This less-parametric approach

also shows larger performance improvements when target and partner teachers are better

matched on skills, and the estimates are more precise. In fact, the treatment effects appear

concentrated among target teachers who were better matched to partners with relevant skills to

share. By contrast, if low-performing target teachers’ motivation, effort, or joint production

behavior changed, we would likely see positive treatment effects even when there are few or no

strength-to-weakness skill matches. This is apparently not the case. The estimated treatment

effect when the proportion skills matched is below median (Column 5 Row 1) is positive

(0.036σ; 𝑝 = 0.340), but not statistically significant and similar in size to the point estimates for

partner and no assigned role teachers.

Evidence for skill growth is also apparent when we examine other aspects of teacher

performance. To this point we have focused on performance as measured by student test score

growth. Table 7 reports treatment effects for a second measure of teacher performance:

evaluation scores from direct, in-class observations of teaching practices. (These are observation

scores from teachers’ formal evaluations described in Section 1.) In Column 1 the outcome is

teacher 𝑗’s average score across all 19 skills and all classroom observations.24 All outcome

variables are standardized (mean 0, s.d. 1). As measured in direct observations, target teachers’

24 As described in Section 1, teachers are scored on 19 skills multiple times per year. We first calculate an average score for each skill, standardize the skill scores, and then average the skill scores to obtain the overall average. We use all post-randomization scores from 2013-14 and 2014-15. We regress each outcome on a treatment indicator, randomization block fixed effects, and controls for teacher experience to improve precision.

25

skills improve on average by a marginally significant 0.05 teacher standard deviations. The

coefficients for partner teachers are similar in magnitude, but estimates for both partner and no-

role teachers have very wide (implicit) confidence intervals.

Consistent with the skill growth mechanism, notably, target teachers’ improvements are

concentrated in skills where there was a match—a match between the target’s weak skills and

her partner’s strong skills. We examine treatment effects separately (Column 2) for the subset of

skills where was a match—the target scored less than 3 and her partner scored 4 or higher—and

(Column 3) the subset of skills where the target was weak—she scored less than 3—but there

was no match. We also report (Column 4) the effect for skill areas where we did not attempt to

match because the target scored 3 or higher.25 In matched skills target teachers improved 0.27

teacher standard deviations. But, by contrast, target teachers improved much less, if at all, in

weak skill areas where there was no match with a partner strength. In summary, performance

gains are larger in skill areas where the high-performing teacher is better suited to teach new

skills to her low-performing partner.26

3.2 Motivation or Effort

The evidence presented above is largely consistent with teachers building skills, rather

than simply increasing motivation or effort. Here, we present additional evidence to differentiate

between these mechanisms. If changes in effort contributed to the treatment effects, we might

expect the treatment effect to be positively correlated with the size of the gap between the target

25 For any individual teacher these three categories are mutually exclusive. Each of the 19 skills must belong to one and only one of matched, no matched, or not attempted to match. 26 One note of caution in interpreting these classroom observation results. Many of the observations are conducted by the school principal who, in treatment schools, was certainly aware of the program and its goals of improved practices and improved evaluation scores. Treatment principals may have, consciously or unconsciously, inflated observations scores to recognize participation in the program; or, alternatively, principals may have been more critical or more aware of low-performance as a result of the program. However, to produce the results in Table 7, the principals would have had to differentially score target teachers on the matched skills specifically.

26

teacher’s baseline job performance and her partner’s baseline performance. For example, Mas

and Moretti (2009) find that low-performing grocery check-out clerks increase their work effort

when a higher-performing peer works at the same time and can observe the low-performer’s

effort; but high-performers are not affected by other high-performers. We test this prediction in

Table 6 Columns 6-9, interacting the treatment indicator with the difference in prior-year skill

measures. Note that while a positive correlation would be consistent with motivation, it might

also be consistent with peer learning. A larger gap in skills may indicate a pair where the high-

performer has more things to teach her partner.

Looking across the estimates in these columns, we do not find evidence of such a positive

correlation. In fact, the imprecise point estimates are all negative. In Column 6, we examine

differences in prior-year classroom observation scores (partner teacher minus target teacher).

These observation scores are the rubric scores gathered in the formal evaluation process

described in Section 1. The point estimate is small, negative, and not statistically significant. In

Column 7 we include interactions for both difference in observation scores (possibly indicative

of effort) and proportion skills matched (indicative of learning). Both measures have been

standardized (mean 0, s.d. 1 throughout Table 6) to facilitate comparisons like Column 7. If

anything, the coefficient on proportion skills matched is more positive than in Column 2, and the

coefficient on observation scores more negative than in Column 6.

In Columns 8 and 9 we use an alternative measure of the gap between the target and

partner baseline performance: the difference in prior “value added scores”.27 Value added scores

are designed to measure each individual teacher’s contribution to student test score growth. The

27 We use value added scores provided by the state’s TVAAS evaluation system. We calculate an overall value added score for each teacher by averaging all her subject-by-year value added scores from 2011-12 and 2012-13.

27

pattern for difference in value added is quite similar to the pattern for difference in observation

scores.

3.3 Sharing Resources or Tasks

The third mechanism category concerns new opportunities to share productive resources

or job tasks. Being paired with another teacher may foster willingness or opportunities to

collaborate at work in ways that improve performance even if the target teacher’s skills do not

change. Under this hypothesis we would expect the treatment effect to be greater when teacher

pairs teach the same grade-level or subject area. The assumption motivating this test is that, even

absent the treatment, shared production activities are easier or higher-return when teachers teach

the same grade-level or subject. Sharing lesson plans is a simple concrete example: for a 6th

grade mathematics teacher, sharing plans with another 6th grade math teacher would be more

helpful than with an 8th grade teacher or a social studies teacher. To test this mechanism, in

Column 10, we interact the target treatment effect with an indicator = 1 if the target and partner

teachers are both teaching the subject 𝑘. In Column 11, we interact the effect with an indicator

= 1 if the target and partner are both teaching the same grade level(s).28 In both cases the point

estimate is positive, which would be consistent with this third mechanism. But in both cases the

estimates are small and (implicit) confidence intervals are quite wide—we could not reject large

negative effects inconsistent with this mechanism. Furthermore, the estimated treatment effects

for teachers in non-matched grades and subjects remains large, statistically significant, and of

approximately the same magnitude as in Column 1. And, as above, this effect could derive in

28 In practice teachers often teach more than one grade level, especially in middle and high school. We calculate the student-weighted average of grade level for each teacher. The indicator is = 1 if the difference in that average grade level is less than one.

28

part because peer learning is facilitated by shared subject or grade level. In short, we do not find

strong evidence for a shared production mechanism.

To summarize the results across the three mechanism categories, first, we find evidence

consistent with the hypothesis that low-performing target teachers learn new skills from their

high-performing partner. Namely, improvements in performance, as measured by student test

scores, are concentrated among target teachers; improvements are larger when more deficient

skills are matched by the partner’s skills; and improvements persist, perhaps grow, in the year

following treatment. Additionally, improvements, as measured in classroom observations, are

concentrated in matched skills. By contrast, predictions for other mechanisms are not borne out

in our data. We do not find evidence consistent with a motivation or effort hypothesis, nor a

shared resources or task hypothesis. Our empirical tests are partial and the three mechanisms are

not mutually exclusive, so we cannot rule out any of these mechanisms. However, the available

evidence suggests skill growth accounts for part of, perhaps much of, the average treatment

effect.

4. Discussion and Conclusion

In this paper we study on-the-job learning among classroom teachers, especially learning

skills from coworkers, in a field experiment. We document meaningful improvements in job

performance among treatment teachers, and the patterns of performance gains suggest teachers

learned job skills from their colleagues. The gains are empirical evidence of the intuitive, long-

theorized benefits of coworkers in human capital development.

The relatively low-performing teachers targeted by our intervention—and ultimately their

students—benefited substantially from partnering with a higher-performing colleague at their

29

school. Target teachers’ performance improved 0.12 student standard deviations in the year of

treatment, and perhaps double that much in the following year. These performance

improvements were larger when teacher partnerships were better matched on strong and weak

skills, but otherwise we find little evidence of heterogeneity.

Our estimates are consistent with prior evidence of learning from coworkers among

teachers. Jackson and Bruegmann (2009) find that when a teacher begins working with higher-

performing colleagues her own performance improves as a result. A one standard deviation

increase in peer performance, as measured by prior contributions to student test score growth,

generates a 0.03-0.04σ improvement in own performance, also measured with current student

test scores. Importantly, Jackson and Bruegmann define “working with” as teaching in the same

grade and school as the peer; the peer work we study is more direct, and the effects are larger.

Taylor and Tyler (2012) find that teacher performance improves 0.05σ during a school year in

which a teacher from a different school and the school principal conduct classroom-observation-

based evaluations and subsequently provides feedback. Both Jackson and Bruegmann (2009) and

Taylor and Tyler (2012) report that the gains in performance are sustained into the future; indeed

in the latter case the effects grow from 0.05σ during the peer evaluation year to 0.10σ in the

years after the peer evaluation year. We similarly find that performance improvements persist,

perhaps grow, in the year following treatment.

Particularly notable, the treatment effects for target teachers appear to hold for both

experienced and inexperienced teachers. In other words, the treatment is not simply substituting

for the returns to experience. If anything, the treatment effect was 0.03σ larger for target teachers

with more than ten years of experience, though the result is far from statistically significant

(Appendix Table A2). Performance growth for experienced teachers is rare in the literature, but

30

this result is consistent with results in Taylor and Tyler (2012). In Appendix Table A2 we

examine several potential sources of heterogeneity, including teaching experience, prior

performance, school level, and the difference in experience between target and partner. None

show a statistically significant relationship to the treatment effect.

The experiment and results suggest practical alternatives to formal on-the-job training,

especially in professional occupations. The contrast in approaches is particularly strong in the

case of teachers. Formal courses, called “professional development,” are today the primary

approach to on-the-job training for public school teachers. Collectively K-12 schools spend

about $18 billion per year on professional development courses, of which $3 billion is paid to

external providers (Gates Foundation 2014); the average teacher spends at least 20 hours each

year in “professional development”.29 Despite the substantial commitment of resources, the

empirical evidence suggests little effect on teacher performance (see reviews by Jackson,

Rockoff, and Staiger 2014, and Yoon et al. 2007). Similarly, public school systems spend

tremendous resources paying for teachers’ graduate tuition, and paying higher salaries once

teachers’ obtain their graduate degree. There is limited evidence that such degrees significantly

improve teacher effectiveness (Jackson, Rockoff, and Staiger 2014).

By contrast, the one-on-one personalized approach to on-the-job training we study in this

paper is apparently much more successful and much less costly. The primary marginal cost for

treatment schools was the time that target and partner teachers allocated to working with each

other. The opportunity cost of that time is important, although it could plausibly substitute for the

time teachers would have spent in formal professional development courses. The cost is low in

part because the high-performing coworker provides the teaching expertise. There could be

29 Author’s calculation from the Schools and Staffing Survey 2011-12.

31

additional costs if higher-performing partners substitute away from other activities, especially

attention to their own students. However, we do not find evidence of reduced performance

among the higher-performing partners. If anything, the high-performers may have also benefited

from participating in the pairings.

One important contextual feature of the experiment is the formal teacher evaluation

system. All teachers in our study—treatment and control, target and partner and no role—are

subject to Tennessee’s new formal performance evaluation system. Teacher pairs were identified

based on prior evaluation results, and teacher pairs were encouraged, in part, to work on

improving evaluation results. These connections to formal evaluation likely influenced

principals’ and teachers’ willingness to participate, and the nature of their participation. For these

reasons we think this study has contributions for the small, still-developing literature on how

evaluation affects teacher performance (Taylor and Tyler 2012, Steinberg and Sartain 2015,

Bergman and Hill 2015). One additional result on this subject comes from a survey of teachers at

the end of the experiment. Teachers were asked a series of questions to measure their attitude

toward formal evaluation, for example, “I have a favorable impression of the teacher evaluation

system” rated on a six point agree/disagree scale.30 Judging from survey responses, teachers in

treatment schools left with more favorable opinions of evaluation: attitudes about evaluation

were 0.23 standard deviations more positive, as measured by a composite of the four survey

questions. However, survey response rates were lower in treatment schools (approximately 45

30 The other three questions on this topic were: “In general, my colleagues have a favorable impression of the teacher evaluation system” and “I receive valuable feedback and guidance through teacher evaluation that helps me improve” both rated in the six point agree/disagree scale. And “What do you feel is the primary purpose of the teacher evaluation system? To help teachers improve. To rate teachers. Some of both.” Surveys were collected by the authors working directly with participating schools. To create a composite score for evaluation attitudes we conducted a factor analysis of these questions and use the predicted first factor as our dependent variable. The factor analysis inputs were the three agree/disagree responses and separate binary indicators for “To help teachers improve” and “To rate teachers”. The first factor explains nearly 100 percent of the variation in responses to these four questions.

32

percent versus 66 percent), and thus this result should be interpreted with caution. If treatment

suppressed responses from teachers with negative opinions, then the treatment effect on attitudes

could easily be negative, but the empirical direction of any non-response bias is not clear.

The teacher job performance improvements documented in this paper suggest learning

from colleagues is at least as valuable as formal on-the-job training or the gains from experience

in developing teaching skills. Indeed peer learning may be a key contributor to the oft-cited

estimates of returns to experience in teaching. Most practically, the treatment and results suggest

promising ideas for managing the sizable teacher workforce.

33

References

Angrist, J. & Lavy, V. (2001). Does Teacher Training Affect Pupil Learning? Evidence from Matched Comparisons in Jerusalem Public Schools. Journal of Labor Economics, 19(2), 343–69.

Azoulay, P., Zivin, J. S., & Wang, J. (2010). Superstar Extinction. Quarterly Journal of

Economics, 125(2), 549-589. Bandiera, O., Barankay, I., & Rasul, I. (2005). Social Preferences and the Response to

Incentives: Evidence from Personnel Data. Quarterly Journal of Economics, 120(3), 917-962.

Battu, H., Belfield, C. R., & Sloane, P. J. (2003). Human Capital Spillovers within the

Workplace: Evidence for Great Britain. Oxford Bull Econ & Stats Oxford Bulletin of Economics and Statistics, 65(5), 575-594.

Becker, G. S. (1962). Investment in Human Capital: A Theoretical Analysis. Journal of Political

Economy, 70(5 part 2), 9-49. Bergman, P., & Hill, M. J. (2015). The Effects of Making Performance Information Public:

Evidence From Los Angeles Teachers and a Regression Discontinuity Design. CESifo Working Paper 5383.

Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2008). Bootstrap-Based Improvements for

Inference with Clustered Errors. Review of Economics and Statistics, 90(3), 414-427. Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2011). Robust Inference with Multiway

Clustering. Journal of Business and Economic Statistics, 29(2), 238-249. Chetty, R., Friedman, J. N., & Rockoff, J. E. (2014). Measuring the Impacts of Teachers II:

Teacher Value-Added and Student Outcomes in Adulthood. American Economic Review, 104(9), 2633-2679.

Cortes, K. E., Goodman, J. S., & Nomi, T. (2015). Intensive Math Instruction and Educational

Attainment: Long-Run Impacts of Double-Dose Algebra. Journal of Human Resources, 50(1), 108-158.

Danielson, C. (1996). Enhancing Professional Practice: A Framework for Teaching. Alexandria,

VA: Association for Supervision and Curriculum Development. Dee, T. S., & Wyckoff, J. (2015). Incentives, Selection, and Teacher Performance: Evidence

from IMPACT. Journal of Policy Analysis and Management, 34(2), 267-297. Gates Foundation. (2014). Teachers Know Best: Teachers Views on Professional Development.

Seattle: Bill & Melinda Gates Foundation.

34

Hamilton, B. H., Nickerson, J. A., & Owan, H. (2003). Team Incentives and Worker

Heterogeneity: An Empirical Analysis of the Impact of Teams on Productivity and Participation. Journal of Political Economy, 111(3), 465-497.

Hanushek, E. A., & Rivkin, S. G. (2010). Generalizations about Using Value-Added Measures of

Teacher Quality. American Economic Review, 100(2), 267-271. Harris, D., & Sass, T. (2011). Teacher Training, Teacher Quality, and Student Achievement.

Journal of Public Economics, 95, 798-812. Ichino, A., & Maggi, G. (2000). Work Environment and Individual Background: Explaining

Regional Shirking Differentials in a Large Italian Firm. Quarterly Journal of Economics, 115(3), 1057-1090.

Jackson, C. K., & Bruegmann, E. (2009). Teaching Students and Teaching Each Other: The

Importance of Peer Learning for Teachers. American Economic Journal: Applied Economics, 1(4), 85-108.

Jackson, C. K., Rockoff, J. E., & Staiger, D. O. (2014). Teacher Effects and Teacher-Related

Policies. Annu. Rev. Econ. Annual Review of Economics, 6(1), 801-825. Krueger, J. (1999). Experimental Estimates of Education Production Functions. Quarterly

Journal of Economics, 114(2), 497-532. Lucas, R. E. (2988). On the Mechanics of Economic Development. Journal of Monetary

Economics, 22(1), 3-42. Marshall, A. (1890). Principles of Economics. London: Macmillan. Mas, A., & Moretti, E. (2009). Peers at Work. American Economic Review, 99(1), 112-145. Moretti, E. (2004). Workers' Education, Spillovers, and Productivity: Evidence from Plant-Level

Production Functions. American Economic Review, 94(3), 656-690. Kraft, M.A. & Papay, J.P. (2014). Do Supportive Professional Environments Promote Teacher

Development? Explaining Heterogeneity in Returns to Teaching Experience. Educational Evaluation and Policy Analysis, 36(4), 476-500.

Papay, J. P. & Kraft, M. A. (2015). Productivity Returns to Experience in the Teacher Labor

Market: Methodological Challenges and New Evidence on Long-Term Career Improvement. Journal of Public Economics, forthcoming.

Rockoff, J. E. (2004). The Impact of Individual Teachers on Student Achievement: Evidence

from Panel Data. American Economic Review, 94(2), 247-252.

35

Rockoff, J. E. (2008). Does Mentoring Reduce Turnover and Improve Skills of New Employees? Evidence from Teachers in New York City. Columbia Business School Working Paper.

Sacerdote, B. I. (2011). Peer Effects in Education: How Might They Work, How Big Are They

and How Much Do We Know Thus Far? In Handbook of Economics of Education Volume 3, Hanushek, E. A., Machin, S., & Woessmann, L. eds. Amsterdam: North Holland.

Steinberg, M. P., & Sartain, L. S. (2015). Does Teacher Evaluation Improve School

Performance? Experimental Evidence from Chicago’s Excellence in Teaching Pilot. Education Finance and Policy, forthcoming Winter 2015.

Taylor, E. S. (2014). Spending More of the School Day in Math Class: Evidence from a

Regression Discontinuity in Middle School. Journal of Public Economics, 117, 162-181. Taylor, E. S., & Tyler, J. H. (2012). The Effect of Evaluation on Teacher Performance. American

Economic Review, 102(7), 3628-3651. U.S. Census Bureau. (2015). Public Education Finances: 2013. Educational Finance Branch

G13-ASPEF. Washington, D.C.: United States Census Bureau Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The Widget Effect: Our

National Failure to Acknowledge and Act on Teacher Effectiveness. New York: The New Teacher Project.

Yoon, K. S., Duncan, T., Lee, S. W., Scarloss, B., & Shapley, K. (2007). Reviewing the

Evidence on How Teacher Professional Development Affects Student Achievement: Issues & Answers Report, REL 2007–No. 033. Washington, DC: US Department of Education, Institute of Education Sciences.

36

Questioning Significantly Above Expectations (5) At Expectations (3) Significantly Below Expectations (1)

• Teacher questions are varied and high-quality, providing a balanced mix of question types: o knowledge and comprehension; o application and analysis; and o creation and evaluation.

• Questions require students to regularly cite evidence throughout lesson.

• Questions are consistently purposeful and coherent.

• A high frequency of questions is asked. • Questions are consistently sequenced with

attention to the instructional goals. • Questions regularly require active

responses (e.g., whole class signaling, choral responses, written and shared responses, or group and individual answers).

• Wait time (3-5 seconds) is consistently provided.

• The teacher calls on volunteers and non-volunteers, and a balance of students based on ability and sex.

• Students generate questions that lead to further inquiry and self-directed learning.

• Questions regularly assess and advance student understanding.

• When text is involved, majority of questions are text based.

• Teacher questions are varied and high-quality providing for some, but not all, question types: o knowledge and comprehension; o application and analysis; and o creation and evaluation.

• Questions usually require students to cite evidence.

• Questions are usually purposeful and coherent.

• A moderate frequency of questions asked. • Questions are sometimes sequenced with

attention to the instructional goals. • Questions sometimes require active

responses (e.g., whole class signaling, choral responses, or group and individual answers).

• Wait time is sometimes provided. • The teacher calls on volunteers and non-

volunteers, and a balance of students based on ability and sex.

• When text is involved, majority of questions are text based

• Teacher questions are inconsistent in quality and include few question types: o knowledge and comprehension; o application and analysis; and o creation and evaluation.

• Questions are random and lack coherence. • A low frequency of questions is asked. • Questions are rarely sequenced with

attention to the instructional goals. • Questions rarely require active responses

(e.g., whole class signaling, choral responses, or group and individual answers).

• Wait time is inconsistently provided. • The teacher mostly calls on volunteers and

high-ability students.

Figure 1—Example from TEAM rubric, “Questioning” skills

37

Figure 2—Sample report for school principals showing potential partner matches for target teachers

Teacher NamePotential Match Teacher Name In

stru

ctio

nal P

lans

(IP

)

Stu

dent

Wor

k (S

W)

Ass

essm

ent (

AS

)

Exp

ecta

tions

(EX

)

Man

agin

g S

tude

nt B

ehav

ior (

MS

B)

Env

ironm

ent (

EN

V)

Res

pect

ful C

ultu

re (R

C)

Sta

ndar

ds a

nd O

bjec

tives

(SO

)

Mot

ivat

ing

Stu

dent

s (M

S)

Pre

sent

ing

Inst

ruct

iona

l Con

tent

(PIC

)

Less

on S

truct

ure

and

Pac

ing

(LS

)

Act

iviti

es a

nd M

ater

ials

(AC

T)

Que

stio

ning

(QU

)

Aca

dem

ic F

eedb

ack

(FE

ED

)

Gro

upin

g S

tude

nts

(GR

P)

Teac

her C

onte

nt K

now

ledg

e (T

CK

)

Teac

her K

now

ledg

e of

Stu

dent

s (T

KS

)

Thin

king

(TH

)

Pro

blem

Sol

ving

(PS

)

Jane Blue o o o o oJane Brown x x x x x x x x x x x x x x x x x xJane Yellow x x x x x x x x x x x x x x x xJohn Red x x x x x x x x x x x x xJane Orange x x x x x x x x x x x x x xJohn Pink x x x x x x x x x x x x x

John Green o o oJohn Black x x x x x x x x x x x x x x x x x x xJane Yellow x x x x x x x x x x x x x x x xJane Orange x x x x x x x x x x x x x xJohn Pink x x x x x x x x x x x x xJane White x x x x x x x x x x x x

Note: An 'x' indicates that the teacher had an average score of 4 or higher on that element, an 'o' indicates an average score of less than 3.

38

Table 1—Classroom observation evaluation scores characteristics, year prior to experiment

(A) Scores in 19 skill areas, and scores averaging across skill areas

Mean St. Dev.

Proportion scoring < 3

(1) (2) (3)

Overall average 3.66 (0.68) 0.18

Instruction average 3.56 (0.69) 0.21 Problem Solving 3.19 (0.78) 0.23 Thinking 3.34 (0.75) 0.16 Questioning 3.39 (0.74) 0.17 Grouping Students 3.43 (0.79) 0.16 Lesson Structure and Pacing 3.43 (0.89) 0.22 Academic Feedback 3.49 (0.79) 0.14 Activities and Materials 3.56 (0.80) 0.13 Standards and Objectives 3.59 (0.81) 0.15 Presenting Instructional Content 3.64 (0.86) 0.15 Teacher Knowledge of Students 3.75 (0.88) 0.13 Motivating Students 3.85 (0.81) 0.08 Teacher Content Knowledge 4.12 (0.78) 0.05

Planning average 3.62 (0.80) 0.17 Assessment 3.38 (0.91) 0.18 Student Work 3.66 (0.89) 0.10 Instructional Plans 3.83 (0.88) 0.09

Environment average 3.97 (0.82) 0.11 Expectations 3.81 (0.92) 0.10 Managing Student Behavior 3.94 (0.96) 0.09 Respectful Culture 4.06 (0.89) 0.07 Environment 4.07 (0.92) 0.06

(B) Teachers with skill scores < 3

Proportion Mean

(4) (5)

One or more skill areas < 3 0.41 Number of skills with score < 3

2.37

Number of skills with score < 3 | at least one < 3

5.84

Note: Authors' calculations using 2012-13 data for treatment and control schools. Data for Tennessee as a whole are similar. Observation scores in natural units (1-5 scale).

39

Table 2—Student, teacher, and pair characteristics,

and pre-treatment balance

Cont. mean

(st. dev.)

Treat. mean

(st. dev.) Diff = 0 p-value

(1) (2) (3)

Student characteristics Baseline test scores Mathematics 0.078 0.036 0.356

(0.531) (0.658)

Reading/language arts 0.065 0.038 0.576

(0.562) (0.668)

Average 0.066 0.040 0.480

(0.541) (0.646)

Female 0.490 0.488 0.952 Race/ethnicity

White 0.333 0.300 0.724 African-American 0.610 0.594 0.828 Latino(a) 0.047 0.087 0.176 Other 0.010 0.018 0.160 English language learner 0.015 0.038 0.004 Special education 0.118 0.126 0.704 Retained in grade 0.001 0.002 0.300

Teacher characteristics Years of experience 12.151 13.800 0.220

(11.92) (10.18)

Baseline job performance Value-added -0.023 0.112 0.472

(0.720) (0.783)

Classroom observation score 3.656 3.846 0.364

(0.595) (0.551)

Teacher pair characteristics Proportion skills matched 0.566 0.599 0.756

Difference in Value-added 0.123 0.688 0.172

(0.561) (0.840)

Observation score 1.157 1.011 0.040

(0.412) (0.440)

Years of experience 2.319 3.764 0.708

(14.79) (12.11)

Teach same grade(s) 0.386 0.487 0.436 Teach same subject(s) 0.794 0.938 0.388

Note: Means and standard deviations net of randomization block fixed effects. Baseline student test scores and baseline teacher value-added standardized (mean 0 s.d. 1) using the statewide distributions for students and teachers respectively. Observation scores in natural units (1-5 scale). Column 3 reports wild cluster bootstrap-t p-values (Cameron, Gelbach, and Miller 2008) with 500 replications.

40

Table 3—Treatment effect on student achievement (intent to treat)

Observations

Treatment effect

Student Teacher

(1) (2)

(3) (4)

(A) Average treatment effect Math and reading 0.055 0.056

2947 136

[0.224] [0.080]

Math 0.092 0.067

2874 86

[0.032] [0.040]

Reading 0.018 0.052

2637 88

[0.648] [0.072]

(B) Treatment effect by teacher role Math and reading

Low-performing target teachers 0.082 0.123

2947 136

[0.024] [0.000]

High-performing partner teachers 0.039 0.029

[0.532] [0.252]

No assigned role 0.013 0.029

[0.824] [0.468]

Pre-experiment covariates

√

Note: Panel A: Each cell is an estimate from a separate regression. The dependent variable is a student test score, standardized (mean 0, standard deviation 1) within subject by grade-level cells using the statewide student distribution. All regressions include randomization block fixed effects. The vector of pre-experiment covariates includes: (i) Baseline student achievement: the average of each student's prior year math and reading scores. Prior test-score slope is allowed to vary by outcome subject and grade-level. (ii) Teacher pre-experiment value-added in the outcome subject: the average of 2012-13 and 2011-12 TVASS scores. (iii) Indicators for student gender, race/ethnicity, English language learner status, special education status, and whether the student is repeating the grade. When baseline test scores or value-added are missing we set the value to zero and include an indicator = 1 for missing. If the student had two or more teachers in a given subject, we include one observation per teacher and weight each observation by the proportion of responsibility allocated by the state to the teacher. Three quarters of students had one teacher in a given subject. Panel B: Each column reports estimates from a separate regression. Estimation is identical to Panel A, except that the single treatment indicator is replaced with three indicators: treatment * target teacher, treatment * partner teacher, and treatment * no assigned role. The specification also includes main effects for teacher role (i.e., "partner" and "no assignment" with "target" the omitted category). P-values in brackets for the test that the coefficient equals zero. P-values estimated using wild cluster (school) bootstrap-t methods (Cameron, Gelbach, and Miller 2008) with 500 replications

41

Table 4—Teacher participation

Dep. var. = 1 if participated as a...

Target Partner

(1) (2)

Treatment * assigned role: low-performing target 0.608 0.154

[0.000] [0.344]

high-performing partner 0.070 0.366

[0.140] [0.000]

no assignment 0.073 0.014

[0.176] [0.744]

F-statistic excluded instruments jointly zero 18.522 31.520

Note: Each column reports estimates from a separate LPM regression; specifically, the first stage regressions from 2SLS estimation where actual role is instrumented with assigned role. Estimation is identical to Table 3 Panel B Column 1, except that the dependent variables are indicators = 1 if we observe participation in the target or partner roles respectively. The sample includes 14 schools, 2,947 students, and 136 teachers. P-values in brackets for the test that the coefficient equals zero. P-values estimated using wild cluster (school) bootstrap-t methods (Cameron, Gelbach, and Miller 2008) with 500 replications.

42

Table 5—Treatment effects on student achievement in year two Dep. var. = student math and reading test scores

Treatment effect bounds

(Lee bounds)

lower upper

(1)

(2) (3)

(4)

(A) Average treatment effect Math and reading 0.106

0.051 0.141

[0.220]

[0.556] [0.212] (B) Treatment effect by teacher role

Math and reading Low-performing target teachers 0.252

0.207 0.340

0.252

[0.068]

[0.052] [0.040]

[0.212]

Low-performing target teachers

-0.000 * target again in year two

[0.988]

High-performing partner teachers 0.056

-0.025 0.072

0.056

[0.684]

[0.776] [0.624]

[0.688]

No assigned role 0.013

0.008 0.047

0.013

[0.908]

[0.864] [0.536]

[0.908]

Note: Each column within panels reports estimates from a separate regression. The dependent variable is a student test score from spring 2015, standardized (mean 0, s.d. 1) within subject by grade-level cells using the statewide distribution. All regressions include randomization block fixed effects, and the vector of pre-experiment covariates described in Table 3. Additionally, Panel B regressions include main effects for teacher role (i.e., "partner" and "no assignment" with "target" the omitted category). Columns 2 and 3 report lower and upper bounds on the treatment effect estimate following the approach suggested by Lee (2009). We trim at the teacher level (dropping all students assigned to the trimmed teachers) within groups defined by teacher role (target, partner, no role) as described in the text. Column 4 repeats the specification and sample in Column 1, but adds an interaction between the target treatment effect and an indicator = 1 if the teacher was a target in year two. Regressions are weighted as described in Table 3. The sample includes 14 schools, 2,876 students, and 96 teachers. P-values in brackets for the test that the coefficient equals zero. P-values estimated using wild cluster (school) bootstrap-t methods (Cameron, Gelbach, and Miller 2008) with 500 replications.

43

Table 6—Treatment effect heterogeneity by assigned teacher role and teacher pair characteristics Dep. var. = student math and reading test scores

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) Treatment * low-performing target 0.123 0.133 0.121 0.128 0.036 0.124 0.138 0.124 0.139 0.112 0.113

[0.000] [0.000] [0.000] [0.000] [0.340] [0.000] [0.000] [0.000] [0.000] [0.024] [0.000]

Treatment * high-performing partner 0.029 0.030 0.029 0.030 0.029 0.029 0.031 0.029 0.030 0.028 0.027

[0.252] [0.216] [0.268] [0.220] [0.208] [0.264] [0.196] [0.252] [0.200] [0.252] [0.272]

Treatment * no assignment 0.029 0.027 0.028 0.026 0.028 0.029 0.027 0.029 0.028 0.029 0.030

[0.468] [0.496] [0.492] [0.552] [0.476] [0.468] [0.512] [0.464] [0.484] [0.468] [0.444]

Treatment * target * proportion skills matched (std)

0.056

0.066

0.061

0.067

[0.052]

[0.020]

[0.028]

[0.044]

* number of skills attempted to match (std)

-0.016 -0.028

[0.632] [0.344]

* proportion skills matched, above median (binary)

0.156

[0.000]

* difference in prior observation scores (std)

-0.015 -0.032

[0.816] [0.480]

* difference in prior value-added scores (std)

-0.014 -0.036

[0.636] [0.456]

* both teaching the same subject (binary)

0.036

[0.612]

* both teaching the same grade (binary)

0.045

[0.308]

Note: Each column reports estimates from a separate regression. Column 1 is identical to Table 3 Panel B Column 2. For Columns 2-11, all details of estimation are the same as Column 1 (see notes for Table 3), except that we add various characteristics of the target teacher and teacher pair as additional right-hand-side variables, and interact those characteristics with the treatment indicator for target teachers. The interaction coefficients are shown above. In each case the regression also includes a main effect for characteristic shown above. A "skill match" occurs when the target teacher has a score below 3 in the skill and the assigned partner has a score of 4 or higher (19 skills possible). The denominator in "proportion skills matched" is the number of skills where the target has a low score. All "difference" measures are partner score minus target score. The "proportion", "difference", and "number" measures have been standardized (mean 0, s.d. 1) within the sample. "Both teaching the same subject" is an indicator = 1 if assigned partner and target both teach the subject of the outcome score (math or reading). "Both teaching the same grade" is an indicator = 1 if the difference in average grade level of the assigned partner's students and the average grade level of the target's students is less than one. When a pair characteristic is missing we set the value to zero and include an indicator = 1 for missing. Regressions are weighted as described in Table 3. The sample includes 14 schools, 2,947 students, and 136 teachers. P-values in brackets for the test that the coefficient equals zero. P-values estimated using wild cluster (school) bootstrap-t methods (Cameron, Gelbach, and Miller 2008) with 500 replications.

44

Table 7—Treatment effect on teaching skills scored in classroom observations

Dep. var. = Average of evaluation scores on…

All 19 skills

Skills matched

Skills not matched

Skills where no attempt

was made to match

(1)

(2) (3) (4)

Low-performing target teachers 0.050

0.267 0.034 -0.071

[0.060]

[0.000] [0.500] [0.348]

High-performing partner teachers 0.048

[0.836]

No assigned role -0.062

[0.892]

Note: Each cell reports an estimate from a separate regression. The dependent variable is a measure of observed teaching practices from formal classroom observations conducted as part of the teacher's performance evaluation (see text for more details). In Column 1 the outcome is the teacher's average score across all 19 skills. In Columns 2 through 4 the outcome is the target teacher's average score across a specific subset of skills described in the column headers. The skills which contribute to each subset differ from teacher to teacher. But for any given teacher the three columns are mutually exclusive and exhaustive. Evaluation scores contributing to the dependent variables are from the 2013-14 and 2014-15 school years. Dependent variables are standardized (mean 0, standard deviation 1), and each of the 19 skill scores was standardized before taking averages. All regressions include randomization block fixed effects, and controls for teacher experience (indicators for quartiles of teacher experience). P-values in brackets for the test that the coefficient equals zero. P-values estimated using wild cluster (school) bootstrap-t methods (Cameron, Gelbach, and Miller 2008) with 500 replications.

45

Appendix Table A1—Pre-treatment balance by teacher role

Treat. - cont. difference [p-value] by teacher's assigned role

Target Partner No role

(1) (2) (3)

Teacher characteristics Years of experience -0.073 1.497 2.462

[0.852] [0.528] [0.468]

Baseline job performance Value-added -0.355 -0.025 0.472

[0.220] [0.852] [0.088]

Classroom observation score 0.189 0.243 -0.072

[0.216] [0.268] [0.900]

Note: Each cell reports a treatment minus control difference in means. The three estimates in each row come from a single regression. The dependent variable described by the row label. All regressions include randomization block fixed effects, and main effects for teacher role (i.e., "partner" and "no assignment" with "target" the omitted category). P-values in brackets for the test that the difference equals zero. P-values estimated using wild cluster (school) bootstrap-t methods (Cameron, Gelbach, and Miller 2008) with 500 replications.

46

Appendix Table A2—Treatment effect heterogeneity, additional teacher, partner, and pair characteristics Dep. var. = student math and reading test scores

(1) (2) (3) (4) (5) (6) (7) (8) Treatment * low-performing target 0.057 0.126 0.136 0.116 0.187 0.118 0.130 0.095

[0.220] [0.344] [0.000] [0.000] [0.008] [0.000] [0.000] [0.060]

Treatment * high-performing partner 0.028 0.027 0.029 0.029 0.028 0.030 0.050 0.032

[0.276] [0.316] [0.280] [0.252] [0.276] [0.256] [0.064] [0.204]

Treatment * no assignment 0.030 0.033 0.029 0.031 0.022 0.028 0.040 0.025

[0.456] [0.400] [0.436] [0.448] [0.596] [0.464] [0.488] [0.536]

Treatment * target * target's years of experience 0.005 -0.009

[0.224] [0.728] * target's years of experience ^ 2

0.000

[0.428] * target has fewer than 10 years experience (binary) -0.029

[0.692] * target's prior value-added scores (std)

-0.022

[0.532] * target's prior observation scores (std)

0.046

[0.464] * target is an elementary school teacher (binary, middle school omitted)

0.010

[0.864] * difference in years of experience (partner - target)

0.012

[0.848] * partner's years of experience

0.003

[0.340]

Note: Each column reports estimates from a separate regression. The dependent variable is a student test score, standardized (mean 0, s.d. 1) within subject by grade-level cells using the statewide distribution. All regressions include randomization block fixed effects, and the vector of pre-experiment covariates described in Table 3. Additionally, all regressions include main effects for teacher role (i.e., "partner" and "no assignment" with "target" the omitted category). In each specification we interact the treatment indicator with a characteristic of the target teacher, pair, or partner teacher. When a characteristic is missing we set the value to zero and include an indicator = 1 for missing. Regressions are weighted as described in Table 3. The sample includes 14 schools, 2,947 students, and 136 teachers. P-values in brackets for the test that the coefficient equals zero. P-values estimated using wild cluster (school) bootstrap-t methods (Cameron, Gelbach, and Miller 2008) with 500 replications.

47

Appendix Table A3—Treatment effects on teacher job changes

No longer teaching in the same school

No longer teaching in Tennessee

one year later

two years later

one year later

two years later

(1) (2)

(3) (4)

(A) Average treatment effects All teachers 0.084 0.067

0.089 0.036

[0.184] [0.384]

[0.012] [0.556]

Control mean 0.277 0.415

0.092 0.246

(B) Treatment effects by teacher role Low-performing target teachers 0.038 0.010

0.073 0.028

[0.776] [0.956]

[0.608] [0.856]


0.125 0.250

High-performing partner teachers 0.070 0.079

0.109 0.188

[0.668] [0.652]

[0.324] [0.216]


0.080 0.200

No assigned role 0.144 0.087

0.085 -0.118

[0.272] [0.400]

[0.176] [0.288]


0.083 0.292

Note: Each column within panels reports estimates from a separate LPM regression. The dependent variable is an indicator = 1 if the teacher is no longer teaching in the same school (or in the state of Tennessee) in 2014-15 the year after treatment in 2013-14 (or 2015-16 two years after treatment). All regressions include randomization block fixed effects. Additionally, all regressions in Panel B include main effects for teacher role (i.e., "partner" and "no assignment" with "target" the omitted category). The sample includes 14 schools, and 136 teachers. P-values in brackets for the test that the coefficient equals zero. P-values estimated using wild cluster (school) bootstrap-t methods (Cameron, Gelbach, and Miller 2008) with 500 replications.

learning job skills from colleagues at work: …econ.msu.edu/seminars/docs/17466-w21986.pdf ·...

Documents