left behind by design: proficiency counts …faculty.smu.edu › millimet › classes › eco7321...

21
LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS AND TEST-BASED ACCOUNTABILITY Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show that within the Chicago Public Schools, both the introduction of NCLB in 2002 and the introduction of similar district-level reforms in 1996 generated noteworthy increases in reading and math scores among students in the middle of the achievement distribution but not among the least academically advantaged students. The stringency of proficiency requirements varied among the programs implemented for different grades in different years, and our results suggest that changes in proficiency requirements induce teachers to shift more attention to stu- dents who are near the current proficiency standard. We were told to cross off the kids who would never pass. We were told to cross off the kids who, if we handed them the test tomorrow, they would pass. And then the kids who were left over, those were the kids we were supposed to focus on. —Middle school staff member (de Vise, 2007)† I. Introduction F OR more than a decade, test-based accountability sys- tems have been a key element of many education reform proposals at the state and district levels, and the No Child Left Behind Act (NCLB) of 2001 created a federal mandate for test-based accountability in every state. A key feature of NCLB is the requirement that each state adopt an account- ability system built, in large part, on standardized testing in reading and math for students in grades 3 through 8. The law seeks to hold schools accountable for student perfor- mance by mandating that schools make the results of these standardized assessments available to parents and report not only aggregate results but also results specific to particular demographic groups, such as groups defined by race or special education status. These reports must convey the fractions of students in particular schools and demographic groups within schools who have achieved proficiency in a particular subject for their grade level. NCLB spells out a set of sanctions that schools should expect to face if they persistently report proficiency levels below the targets set by their state for each calendar year. In this paper, we use data from the Chicago Public Schools (CPS) to examine how a specific aspect of the implementation of NCLB affects the distribution of mea- sured changes in achievement among students. The imple- mentation of NCLB in most states and the design of many state and local accountability systems tie rewards and sanc- tions to the number of students in certain groups scoring above given proficiency thresholds. We use the introduction of two separate accountability systems in CPS, a district- wide system implemented in 1996 and the introduction of NCLB in 2002, to investigate the impacts of proficiency- count accountability systems on the distribution of student performance. In all our analyses, we focus on test score outcomes among students in a given grade. We compare students who took a specific high-stakes exam under a new ac- countability system with students who took the same exam under low stakes the year before the accountability system was implemented. Furthermore, because we re- strict our comparisons to students who took exams either right before or right after the implementation of an account- ability system, we can make these comparisons holding constant student performance on a similar low-stakes exam in an earlier grade. Thus, we are able to measure changes in test scores associated with the accountability system at different points in the distribution of prior achievement. Holmstrom and Milgrom (1991) warn that when workers perform complex jobs involving many tasks, pay-for- performance schemes based on objective measures of out- put often create incentives for workers to shift effort among the various tasks they perform in ways that improve their own performance rating but hinder the overall mission of the organization. Holmstrom and Milgrom cite “teaching to the test” in response to test-based accountability systems as an obvious example of this phenomenon, and much of the exist- ing empirical literature on test-based accountability focuses on whether the test score increases that commonly follow the introduction of such systems represent actual increases in subject mastery. The literature explores many ways that schools may seek to inflate their assessment scores without actually increasing their students’ subject mastery. Schools may coach students for assessments, manipulate the population Received for publication February 12, 2008. Revision accepted for publication September 18, 2008. † Quote from an anonymous middle school staff member in “Rockville School’s Efforts Raise Questions of Test-Prep Ethics” by Daniel de Vise, Washington Post, March 4, 2007. * Neal: University of Chicago and NBER; Schanzenbach: University of Chicago and NBER. We thank Elaine Allensworth, John Q. Easton, and Todd Rosenkranz of the Consortium on Chicago School Research for their assistance in using the data. We thank Amy Nowell of Chicago Public Schools (CPS). We thank participants in the University of California, Berkeley, Labor Economics Workshop, the Federal Reserve Bank of Chicago’s Labor Economics seminar, the Harris School’s Public Policy and Economics Workshop, and the joint meeting of the Institute for Research on Poverty’s Summer Research Workshop and the Chicago Workshop on Black-White Inequality. We thank Fernando Alvarez, Gadi Barlevy, Kelly Bedard, Julie Berry Cullen, Jennifer Jennings, Brian Jacob, John Kennan, Roger Myerson, Kalina Michalska, Phil Reny, and Balazs Szentes for useful comments and discussions, and Chloe Hutchinson, Garrett Hagemann, Richard Olson, and Andy Zup- pann for helpful research assistance. We owe special thanks to Phil Hansen for being so generous with his time and his knowledge of accountability within CPS. D.N. thanks the Searle Freedom Trust for generous research support as well as Lindy and Michael Keiser for support through the University of Chicago’s Committee on Education. Both thank the Population Research Center of NORC and the Univer- sity of Chicago for research support. The Review of Economics and Statistics, May 2010, 92(2): 263–283 © 2010 The President and Fellows of Harvard College and the Massachusetts Institute of Technology

Upload: others

Post on 03-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS …faculty.smu.edu › millimet › classes › eco7321 › papers › neal...Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show

LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS AND TEST-BASEDACCOUNTABILITY

Derek Neal and Diane Whitmore Schanzenbach*

Abstract—We show that within the Chicago Public Schools, both theintroduction of NCLB in 2002 and the introduction of similar district-levelreforms in 1996 generated noteworthy increases in reading and mathscores among students in the middle of the achievement distribution butnot among the least academically advantaged students. The stringency ofproficiency requirements varied among the programs implemented fordifferent grades in different years, and our results suggest that changes inproficiency requirements induce teachers to shift more attention to stu-dents who are near the current proficiency standard.

We were told to cross off the kids who would never pass. Wewere told to cross off the kids who, if we handed them the testtomorrow, they would pass. And then the kids who were leftover, those were the kids we were supposed to focus on.

—Middle school staff member (de Vise, 2007)†

I. Introduction

FOR more than a decade, test-based accountability sys-tems have been a key element of many education reform

proposals at the state and district levels, and the No ChildLeft Behind Act (NCLB) of 2001 created a federal mandatefor test-based accountability in every state. A key feature ofNCLB is the requirement that each state adopt an account-ability system built, in large part, on standardized testing inreading and math for students in grades 3 through 8. Thelaw seeks to hold schools accountable for student perfor-mance by mandating that schools make the results of thesestandardized assessments available to parents and report notonly aggregate results but also results specific to particular

demographic groups, such as groups defined by race orspecial education status. These reports must convey thefractions of students in particular schools and demographicgroups within schools who have achieved proficiency in aparticular subject for their grade level. NCLB spells out aset of sanctions that schools should expect to face if theypersistently report proficiency levels below the targets setby their state for each calendar year.

In this paper, we use data from the Chicago PublicSchools (CPS) to examine how a specific aspect of theimplementation of NCLB affects the distribution of mea-sured changes in achievement among students. The imple-mentation of NCLB in most states and the design of manystate and local accountability systems tie rewards and sanc-tions to the number of students in certain groups scoringabove given proficiency thresholds. We use the introductionof two separate accountability systems in CPS, a district-wide system implemented in 1996 and the introduction ofNCLB in 2002, to investigate the impacts of proficiency-count accountability systems on the distribution of studentperformance.

In all our analyses, we focus on test score outcomesamong students in a given grade. We compare studentswho took a specific high-stakes exam under a new ac-countability system with students who took the sameexam under low stakes the year before the accountabilitysystem was implemented. Furthermore, because we re-strict our comparisons to students who took exams eitherright before or right after the implementation of an account-ability system, we can make these comparisons holdingconstant student performance on a similar low-stakes examin an earlier grade. Thus, we are able to measure changes intest scores associated with the accountability system atdifferent points in the distribution of prior achievement.

Holmstrom and Milgrom (1991) warn that when workersperform complex jobs involving many tasks, pay-for-performance schemes based on objective measures of out-put often create incentives for workers to shift effort amongthe various tasks they perform in ways that improve theirown performance rating but hinder the overall mission of theorganization. Holmstrom and Milgrom cite “teaching to thetest” in response to test-based accountability systems as anobvious example of this phenomenon, and much of the exist-ing empirical literature on test-based accountability focuses onwhether the test score increases that commonly follow theintroduction of such systems represent actual increases insubject mastery. The literature explores many ways thatschools may seek to inflate their assessment scores withoutactually increasing their students’ subject mastery. Schoolsmay coach students for assessments, manipulate the population

Received for publication February 12, 2008. Revision accepted forpublication September 18, 2008.

† Quote from an anonymous middle school staff member in “RockvilleSchool’s Efforts Raise Questions of Test-Prep Ethics” by Daniel de Vise,Washington Post, March 4, 2007.

* Neal: University of Chicago and NBER; Schanzenbach: University ofChicago and NBER.

We thank Elaine Allensworth, John Q. Easton, and Todd Rosenkranzof the Consortium on Chicago School Research for their assistance inusing the data. We thank Amy Nowell of Chicago Public Schools(CPS). We thank participants in the University of California, Berkeley,Labor Economics Workshop, the Federal Reserve Bank of Chicago’sLabor Economics seminar, the Harris School’s Public Policy andEconomics Workshop, and the joint meeting of the Institute forResearch on Poverty’s Summer Research Workshop and the ChicagoWorkshop on Black-White Inequality. We thank Fernando Alvarez,Gadi Barlevy, Kelly Bedard, Julie Berry Cullen, Jennifer Jennings,Brian Jacob, John Kennan, Roger Myerson, Kalina Michalska, PhilReny, and Balazs Szentes for useful comments and discussions, andChloe Hutchinson, Garrett Hagemann, Richard Olson, and Andy Zup-pann for helpful research assistance. We owe special thanks to PhilHansen for being so generous with his time and his knowledge ofaccountability within CPS. D.N. thanks the Searle Freedom Trust forgenerous research support as well as Lindy and Michael Keiser forsupport through the University of Chicago’s Committee on Education.Both thank the Population Research Center of NORC and the Univer-sity of Chicago for research support.

The Review of Economics and Statistics, May 2010, 92(2): 263–283© 2010 The President and Fellows of Harvard College and the Massachusetts Institute of Technology

Page 2: LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS …faculty.smu.edu › millimet › classes › eco7321 › papers › neal...Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show

of students tested, or even alter students’ answer sheets be-tween assessment and grading.1

We depart from most of the existing literature by exam-ining a different multitasking concern. Instead of focusingon how the use of standardized assessments shapes whatteachers teach and what types of coaching they give theirstudents, we examine how the rules that accountabilitysystems use to turn student test scores into performancerankings for schools determine how teachers allocate theirefforts among different students. We show that the use ofproficiency counts as performance measures provides strongincentives for schools to focus on students who are near theproficiency standard but weak incentives to devote extraattention to students who are either already proficient orhave little chance of becoming proficient in the near term.

Even in a world with perfect assessments that cannot bemanipulated by schools in any way, the details of how onemaps students’ test scores into a performance rating for theirschool dictate how teachers allocate their attention amongstudents of different baseline achievement levels. Becausepart of the impetus for NCLB and related reforms is thebelief that some groups of academically disadvantagedstudents have historically received poor service from theirpublic schools, we believe that our results speak to a designissue that is of first-order importance.2

We provide results that characterize the distribution oftest score changes among fifth graders in Chicago followingthe introduction of NCLB in 2002, and we present similarresults for fifth graders tested in Chicago in 1998 followingthe introduction of a school accountability system that wassimilar to NCLB on many dimensions. The results for bothsets of fifth graders follow a strikingly consistent pattern.Students at the bottom of the distribution of measuredthird-grade achievement score the same or lower followingthese reforms than one would have expected given theprereform relationships between third- and fifth-gradescores, but students in the middle of the distribution scoresignificantly higher than expected. Further, there is at bestmixed evidence of gains among students in the top decile.

We also present results for sixth graders tested in 1998.These students were affected directly by both the school-level accountability system instituted within CPS and aseparate set of test score cutoffs used to determine summerschool placement and retention decisions. Chicago’s effort

to end social promotion linked summer school attendanceand retention decisions to score cutoffs that were muchlower than the proficiency cutoffs used to determine school-level performance. Thus, sixth graders who had little chanceof contributing to their school’s overall proficiency ratingfaced strong incentives to work harder in school. The resultsfor these sixth graders follow the same general patternobserved in the fifth-grade results. However, the estimatedgains among sixth graders tend to be larger at each decile,and our estimated treatment effects for the least able sixthgraders are never negative. We conclude that NCLB pro-vides relatively weak incentives to devote extra attention tostudents who have no realistic chance of becoming profi-cient in the near term or students who are already profi-cient.3

The distributional consequences of the Illinois implemen-tation of NCLB are complex. Hanushek and Raymond(2004) argue based on National Assessment of EducationalProgress (NAEP) data and differences over time and amongstates in the stakes associated with state-level accountabilitysystems that test-based accountability reduces racialachievement gaps, and our results are not inconsistent withthis conclusion. The CPS contain relatively few white stu-dents, and average test scores did increase following bothNCLB and the CPS reforms of 1996. Thus, although we donot have comparable data from other school districts inIllinois, our results certainly admit the possibility thatNCLB narrowed the achievement gaps between whites andminorities in the state. However, the group of studentswithin CPS who were likely not helped and may have beenharmed by NCLB is sizable and predominantly black andHispanic.

Several studies on the use of proficiency counts in ac-countability systems other than NCLB provide results thatare consistent with ours. Reback (2008) uses data fromTexas during the 1990s to measure how schools allocatedefforts in response to a statewide accountability system. Hefinds that achievement gains are larger among studentswhose gains are likely to make the greatest marginal con-tribution to their school’s overall proficiency rating. Burgesset al. (2005) use data from England to show that achieve-ment gains are lower among less able students if they attendschools in which a large fraction of the student body aremarginal students with respect to an important score thresh-old in the English accountability system. On the other hand,Springer (2008) analyzes data from the testing program thatIdaho instituted following the introduction of NCLB andargues that he does not see evidence that the use of profi-

1 See Carnoy and Loeb (2002), Grissmer and Flanagan (1998), Ha-nushek and Raymond (2004), Jacob (2005), and Koretz (2002). Thesestudies address the concern that teaching to the test artificially inflates testscores following the introduction of high-stakes testing. See Cullen andReback (2006) for an assessment of strategic efforts among Texas schoolsto improve reported scores by manipulating which students are exemptfrom testing. Jacob and Levitt (2003) provide evidence that some teachersor principals in Chicago actually changed student answers after high-stakes assessments in the 1990s.

2 Lazear (2006) notes that the scope of assessments may also influencethe distribution of gains among students. Those who find learning difficultmay not be affected if assessments are too broad because they and theirteachers may find it too costly for them to reach proficiency.

3 NCLB does contain a provision that requires that all students beproficient by 2014. However, this provision of the law does not constitutea credible threat. NCLB contains a reauthorization requirement for 2007and has still not been reauthorized. Goals that push the limits of credulityand are not required by NCLB until 2014 should play a small role inshaping teachers’ and principals’ expectations concerning how the lawwill be enforced today.

THE REVIEW OF ECONOMICS AND STATISTICS264

Page 3: LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS …faculty.smu.edu › millimet › classes › eco7321 › papers › neal...Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show

ciency counts in NCLB led to increased teacher effortamong only one particular group of students.

All of these papers differ from ours methodologicallybecause none of the authors had access to data on achieve-ment growth prior to the introduction of accountability. Wehave test score data on all Chicago students starting in theearly 1990s, and the tests used for NCLB purposes inIllinois and those used for the district’s accountability pro-gram in the late 1990s were administered in years prior tothe introduction of these accountability systems. Thus, oursis the only study in this literature with access to controlgroups that took the assessments used in accountabilitysystems as low-stakes exams prior to the introduction ofaccountability.4

Several studies of particular schools also find resultsconsistent with those we present below. Gillborn andYoudell (2000) coined the term educational triage to de-scribe their findings from case studies of English schools.They document how these schools targeted specific groupsof students for special instruction in order to maximize thenumber of students who performed above certain thresholdsin the English system. More recently, Booher-Jennings(2005) and White and Rosenbaum (2007) present evidencefrom case studies of two schools serving economicallydisadvantaged students in Texas and Chicago, respectively.Both studies provide clear evidence that teachers and ad-ministrators made conscious and deliberate decisions toshift resources away from low-performing students andtoward students who had more realistic chances of exceed-ing key threshold scores.

In the next section, we present a model of teacher effortwithin schools. Then we turn to the details of the 1996 and2002 reforms and their implementation in Chicago beforeturning to our empirical results. After presenting our results,we discuss the challenges that policymakers face if theywish to replace NCLB’s reliance on proficiency counts witha system of measuring progress that will value the achieve-ment gains of all students. Currently a number of states havebeen granted waivers that allow them to assess schoolperformance using more continuous measures of studentoutcomes than simple proficiency counts. We analyze thelikely effects of these alternative schemes using variants ofthe same model of teacher effort that we describe in the nextsection. Our model clearly illustrates that these waiversmake it easier to design accountability systems that do notbuild in direct incentives to leave some children behind, butwe argue that tough design issues remain unresolved. In ourconclusion, we discuss the extent to which our results fromChicago speak to the likely effects of NCLB in other largecities.

II. Keeping Score Using Proficiency Counts

Consider a school that is part of a test-based accountabil-ity system. Two policies shape the actions of teachers andprincipals. To begin, the central administration, in cooper-ation with parents, provides enough monitoring to makesure the school provides some baseline level of instructionto all students. Because our empirical work measureschanges in performance that follow the introduction ofaccountability within groups of students who are similarwith respect to prior achievement levels, it is not essentialfor our purposes that baseline instruction be identical for allstudents, but this assumption does allow us to easily de-scribe both our model and our empirical results in terms ofthe changes induced by accountability systems. We do notaddress the socially optimal amount of effort per teacher orthe socially optimal allocation of effort among students. Wetake as given for now that teachers and the principal enjoyrents given their pay and the baseline allocation of effort perstudent. Thus, we view the introduction of test-based ac-countability as an attempt to extract more effort fromteachers, and we examine how this attempt to increaseoverall teacher effort also changes the distribution ofteacher effort among students of different abilities.

Given the monitoring system that guarantees baselineeffort, also consider a testing system that labels each studentas either passing or failing. Further, assume that the princi-pal and teachers incur costs that are a function of thenumber of their students who fail. These costs may takemany forms depending on the details of the accountabilitysystem.5 The key point is that NCLB keeps score, and theearlier Chicago accountability system kept score, based onthe number of students whose test scores exceed certainthresholds. Thus, we model our hypothetical accountabilitysystem as a penalty function that imposes costs on teachersand principals when students do not reach a proficiencystandard, and we assume that these costs are strictly convexin the number of students who fail.6

Our school can improve individual test scores by provid-ing extra instruction beyond the minimum effort level thatthe district can enforce through its monitoring technology.

4 The Idaho and Texas data provide no information about how variousschools performed in the absence of NCLB. Springer (2008) notes that hisresults may reflect “customary school behavior irrespective of NCLB’sthreat of sanctions.”

5 Under NCLB, schools must report publicly how many of their studentsare proficient, and they face serious sanctions if their proficiency ratesremain below statewide targets. In Chicago, the district adopted a systemin 1996 that measured school-level performance based on the number ofstudents exceeding national norms on specific exams. In addition, Chicagoschools and students faced additional pressures related to a separate set oflower thresholds (on the same tests) that determined whether students ingrades 3, 6, and 8 were required to attend summer school and possiblyrepeat their grade.

6 NCLB also includes provisions concerning the fraction of students whoare proficient within certain demographic groups defined by race, familyincome, and disability status. Incorporating these subgroup provisions inour model would complicate our analyses but not change our results. Thehigh cost of bringing low-achieving students up to proficiency would stillimply that schools could optimally allocate no extra instruction to theirlow-achieving students. Further, these provisions are less important forNCLB implementation in Chicago than many other school districts be-cause schools in Chicago are highly segregated by race and income.

LEFT BEHIND BY DESIGN 265

Page 4: LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS …faculty.smu.edu › millimet › classes › eco7321 › papers › neal...Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show

We ignore any agency problem between principals andteachers and model the school as a unitary decision-makingunit.

Because the baseline level of instruction for all students isnot a choice variable for the school, the school’s problem isto minimize the total cost incurred by the allocation of extrainstruction among its students and the penalties associatedwith student failures. Suppose that there are N students in aschool and each student has ability

�i, i � 1, 2, . . . , N.

Further, assume that for any individual i, her score on theaccountability test is

ti � ei � �i � εi

where ei is extra instruction received by student i and εi isthe measurement error on i’s test drawn from F(ε), whichhas a unimodal density f(ε).

The cutoff score for passing is �t. We assume that N islarge, and we approximate the school’s objective functionby treating the expected number of students who fail in eachschool as the actual number of failures in each school. Thus,the school’s problem is as follows:

minei

���i�1

N

F��t � ei � �i�� � �i�1

N

c�ei�

(1)

s.t. ei � 0 � i � 1, 2, . . . , N.

Here, �[�] is a penalty function that describes the sanctionssuffered by a school of size N under the accountabilitysystem, and c�(e) 0, c(e) 0, @e � 0. The penaltyfunction is strictly increasing and convex in the number ofstudents who are not proficient. The first-order conditionsthat define optimal effort require

��� � � f� �t � e*i � �i� � c��e*i� � i � 1, 2, . . . , N.

The precise shapes of the penalty function, the costfunction, the distribution of ability types, and the distribu-tion of measurement errors interact to determine the exactpattern of optimal investments. However, we know that inany setting that involves convex cost and penalty functionsas well as a unimodal density for the measurement error, theoptimal investment pattern will exhibit two properties. First,it is easy to show the following: �t � �i � �t � �j N �t ��i � e*i � �t � �j � e*j @i, j. This means that optimalinvestments never cause one type to pass another in terms ofeffective skill. The responses of schools to this type ofaccountability system may narrow the achievement gapsbetween various skill groups, but these responses will noteliminate or reverse any of these gaps. We are not surprised

that our empirical results firmly support this prediction,7 butwe do believe that it is important to recognize that the typeof accountability system described here, which is intendedto capture the key elements of NCLB, should not be ex-pected to fully eliminate achievement gaps between any twogroups of students.

Second, while it is easy to generate examples such thatschools devote no extra effort to students below somecritical ability level or above some critical ability level,appendix B demonstrates that solutions do not exist thatinvolve a school’s allocating no extra effort to a givenstudent but applying positive extra effort to other studentswho are both more and less able. If the solution to theschool’s problem involves positive extra effort for somestudents and no extra effort for others, the students who donot receive extra attention will be either more or less skilledthan those who do.

The focus of our empirical work is the claim that account-ability systems built around proficiency counts provideincentives for schools to provide extra help to students inthe middle of the ability distribution while providing fewincentives for these schools to direct extra attention tostudents who are either far below proficiency or alreadyproficient. We think that the absence of strong incentives tohelp students who are achieving at the lowest levels isespecially noteworthy because this feature of the NCLBdesign is at odds with the stated goals of the legislation, andthe implication that the least able may be left behind bydesign is a quite robust feature of our model. Although wehave assumed that the marginal product of instruction isindependent of student ability, we could make the morecommon assumption that ability and instruction are com-plements in the production of knowledge. In this scenario,the relative cost of raising scores among less able studentsincreases, and it remains straightforward to construct sce-narios in which students below a given ability level receiveno extra attention even though more able students do benefitfrom the accountability system.

It is worth noting that under this type of accountabilitysystem, the choice of �t determines the distribution ofachievement gains. Consider an increase in the standard forproficiency �t. It is easy to show that this increase in thestandard can only decrease and never increase the numberof high-ability students who receive no extra instruction.Thus, higher standards can only benefit and never harm themost able students. However, a higher standard may actu-ally increase the number of low-ability students that a givenschool ignores by increasing the number who have little orno chance of being proficient in the near term.8 AlthoughNCLB encourages each state to set challenging proficiency

7 We define ability groups based on baseline achievement in previousgrades, and our data appendix shows that average math and reading scoresalways increase from one ability group to the next for both our treatmentand control cohorts.

8 A higher standard does not necessarily generate this result. A higherstandard also raises the baseline failure rate and, because the penalty

THE REVIEW OF ECONOMICS AND STATISTICS266

Page 5: LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS …faculty.smu.edu › millimet › classes › eco7321 › papers › neal...Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show

standards, states that set high standards may direct teachereffort away from disadvantaged children.

Also note that one can easily construct a more generalmodel that embeds our analyses of effort allocation withinschools as one component in a model of the labor market forteachers and principals. Here, differences among schools inthe indirect utilities associated with the solutions to theeffort allocation problems faced by various schools willdrive the sorting of teachers and principals among schools.Assuming the function �� is the same for all schools of thesame size, schools with more able students provide a supe-rior working environment for principals and teachers be-cause academically disadvantaged students raise the cost ofmeeting any specific passing rate given a common profi-ciency standard. If the distribution of initial student abilityis worse in school A than school B, teachers and principalsin school A must work harder than those in school B toachieve the same standing under the accountability system,and this should adversely affect the relative supply ofteachers who want to teach in school A.

There has been little empirical work to date on howaccountability systems affect teacher labor markets, butClotfelter et al. (2004) examine changes in teacher retentionrates in North Carolina following the introduction of astatewide accountability system in 1996 that raised therelative cost of teaching in schools with large populations ofdisadvantaged students. They document significant declinesin retention rates among schools with many academicallydisadvantaged students, and their results are difficult tosquare with the hypothesis that the additional departuresfrom these schools were driven primarily by an increase inthe departure of incompetent teachers.

III. High-Stakes Testing in Chicago

We use data in the years surrounding the introduction oftwo separate accountability systems in CPS. The first, im-plemented in 1996, linked school-level probation status tothe number of students who achieved a given level ofproficiency in reading on the Iowa Test of Basic Skills(ITBS). It also linked grade retention decisions concerningindividual students in “promotion gate” grades to theachievement of specific proficiency levels in reading andmath. The second system is the 2002 implementation ofNCLB testing in Illinois, which initially covered studentperformance in grades 3, 5, and 8 on the Illinois StateAchievement Test (ISAT).

During 1996, a new administration within the CPS intro-duced a number of reforms, and these reforms attachedserious consequences to standardized test results.9 In the fallof 1996, CPS introduced a school accountability system.Among elementary schools, probation status was deter-

mined primarily by the fraction of students who earnedreading scores equal to or greater than the national norm fortheir grade. Schools on probation were forced to create andimplement school improvement plans, and these schoolsknew that they faced the threat of reconstitution if theirstudents’ scores did not improve. Although math scoreswere not a major factor in determining probation status,schools also faced pressure to improve math scores. As partof the reform efforts, CPS chose to publicly report profi-ciency rates in math and reading at the school level. Prin-cipals and teachers knew that the reading and math perfor-mance of their students would be reported in localnewspapers, and these school report cards measured schoolperformance using the number of students who performed ator above national norms in reading or math. With regard tosanctions and public reports, proficiency counts were thekey metric of school performance in the CPS system.

In addition, other score thresholds in reading and mathplayed a large role in the reform. In March 1996, before theschool accountability system was introduced, CPS an-nounced a plan to end social promotion. The new elemen-tary school promotion policy required students in third,sixth, and eighth grades to score above specific thresholds inmath and reading or attend summer school. These cutoffscores were far below the national norms that CPS wouldlater use to calculate proficiency rates for schools, but theywere clearly relevant hurdles for students in the bottom halfof the CPS achievement distribution. Even median studentslikely faced more than a 20% risk of summer school if theyexerted no extra effort. Students who attended summerschool were tested again at the end of summer and retainedif they still had not reached the target score levels for theirgrade.

This policy was announced in late March 1996. CPSexempted third- and sixth-grade students from the policyuntil spring of 1997, but the new policy did link eighth-grade summer school and retention decisions to the 1996spring tests results. Since the promotion policy was an-nounced only weeks before testing began, we believe thatthe eighth-grade exams in spring 1996 do not reflect all theimpacts of the reform, but the eighth-grade exams areaffected by the March announcement.10 Thus, we restrictour attention here to the fifth- and sixth-grade results.

The retention policies in the CPS reforms are interestingfrom our perspective because CPS also built these policiesaround cutoff scores and because retentions forced studentsand their families to deal with a summer school programthat they did not choose. Thus, retentions represented asource of potential frustration for parents and another sourceof performance pressure linked to proficiency counts. Fur-ther, the lower cutoff scores for summer school put manystudents at risk of summer school while still giving almost

function is convex, raises the gain associated with moving any singlestudent up to the proficiency standard.

9 See Bryk (2003) and Jacob (2003) for more on the history of recentreform efforts in CPS.

10 In related work, we have discovered that the school-level correlationbetween ITBS score and IGAP (Illinois Goals Assessment Program)scores dropped notably for eighth graders in 1996.

LEFT BEHIND BY DESIGN 267

Page 6: LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS …faculty.smu.edu › millimet › classes › eco7321 › papers › neal...Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show

all students a real chance to avoid it. This was not the casewith regard to the proficiency levels used to determineschool-level performance under the 1996 reforms, and itwas not the case with regard to the ISAT proficiency cutoffsunder NCLB in 2002. Thus, the results for sixth-gradestudents allow us to see what the distribution of achieve-ment gains looks like when more students have a realisticchance of meeting an important threshold score.

The 1996 CPS reforms adopted the ITBS as the primaryperformance assessment in reading and math. Differentforms of the test were given in different years, but in ouranalyses of ITBS data, we concentrate only on years whenForm L was given. These years, 1994, 1996, and 1998, arethe only ones surrounding the 1996 reform that permit acomparison of prereform and postreform cohorts using acommon form of the ITBS. Our analyses seek to measurechanges in scores relative to prereform baselines at differentpoints in the distribution of prior achievement. If we useyears other than the Form L years, our results will reflect notonly any real differences in the effects of the reform atvarious ability levels but also any differences among abilitylevels in the accuracy of the psychometric methods used toplace scores from different forms on a common scale. Whileit is not easy to equate scores among forms in a manner thatis correct on average, the task of equating scores in amanner that is accurate at each point in the distribution ofability is even more demanding.

In the 1998–99 school year, the Illinois State Board ofEducation (ISBE) introduced a new exam to measure theperformance of students relative to the state learning stan-dards and administered the test statewide, but only in grades3, 5, and 8. For many reasons, CPS viewed the ISAT as acollection of relatively low-stakes exams during the springsof 1999, 2000, and 2001.11 However, in fall 2001 with thepassage of NCLB looming, the ISBE placed hundreds ofschools in Illinois on a watch list based on their 1999through 2001 scores on ISAT and also declared that the2002 ISAT exams would be high-stakes exams.

By the time President Bush signed NCLB in early Janu-ary 2002, it had become crystal clear that the 2002 ISATwould be the NCLB exam for Illinois. Further, the stateannounced in February that for the purpose of calculating

how long each school had been failing under NCLB, 1999would be designated as the baseline year and school statusin the year 2000 would retroactively count as the first yearof accountability. This meant that many schools in Chicagoexpected to start to face sanctions immediately if theirproficiency counts on the 2002 spring ISAT exams did notimprove significantly. Thus, in one year, the ISAT wentfrom a relatively low-stakes state assessment to a decidedlyhigh-stakes exam.

Like the 1996 CPS reforms, NCLB employs proficiencycounts as the key metric of school performance. States arerequired to institute a statewide annual standardized test ingrades 3 through 8, subject to parameters set by the U.S.Department of Education. States set their own proficiencystandards as well as a schedule of target levels for thepercentage of proficient students at the school level. If thefraction of proficient students in a school is above the goal,the school is said to have met the standard for AdequateYearly Progress (AYP).12 Under some circumstances, if aschool does not have enough proficient students in thecurrent year but does have a substantially higher fractionthan in previous years, the school may be considered to havemet the AYP standard under what is called the “Safe HarborProvision.” A school that persistently fails to meet the AYPrequirement will face increasing sanctions. These includemandatory offering of public school choice and extra ser-vices for current students, and at some point, the school mayface reconstitution.

We are not able to conduct our analyses of ISAT scoresusing a sample restricted to students who took the exactsame form of the exam. ISBE typically administered ISATusing two forms simultaneously. These forms shared a largenumber of common items both within and across years, andthus the assessment program was designed in a manner thatfacilitated ISBE’s use of an item response theory model toplace all scores on a common scale from 120 to 200. Wecannot control for any form effects in our ISAT analysesbecause the CPS data that we use do not allow us todetermine which form a given student took in a given year.Nonetheless, an independent audit of the ISAT did concludethat ISAT scores are comparable over time and amongforms of the exam.13

IV. Changes in Scores

All the figures presented in this section follow a commonformat. They display differences between mean test scoresin a specific grade following the introduction of high-stakestesting and mean predicted scores based on data from theperiod prior to high-stakes testing. We create our estimationsamples using selection rules that take the following form:we include persons who were enrolled in CPS in year t and

11 The ISAT was not a “no-stakes” exam in 1999–2001. ISAT perfor-mance played a small role in the CPS rules for school accountability overthis time, and the state monitored ISAT performance as well. Nonetheless,according to Phil Hansen, Chicago’s former chief accountability officer,CPS began participating in ISAT under the understanding that the resultswould not be part of any “high-stakes accountability plan.” In late fall1999, the state made several announcements that signaled a change in thisposition, and CPS protested. Then, in January 2000, ISBE moderated itsstance and informed CPS that it would appoint a task force to recommenda “comprehensive school designation system” for state-level accountabil-ity and a set of guidelines that would exempt schools with low ISATscores from being placed on the state’s Academic Early Warning List ifthey “show evidence of continued improvement.” Thus, in the springs of1999, 2000, and 2001, CPS took the ISAT with the expectation that theresults would not have significant direct consequences in terms of the stateaccountability system.

12 In addition, the fraction of students passing in each subgroup above aminimum size must meet the standard. For example, NCLB definessubgroups by race, socioeconomic status, and special education category.

13 Wick (2003) provides a technical audit of the ISAT.

THE REVIEW OF ECONOMICS AND STATISTICS268

Page 7: LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS …faculty.smu.edu › millimet › classes › eco7321 › papers › neal...Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show

year t 2 in grades n and n 2 respectively, and werestrict our samples to students who were tested in math andreading in both years.14 Appendix A provides a detaileddescription of how we construct our samples and the char-acteristics of our treatment and control samples. Relative toour control cohorts, we observe slightly higher rates of follow-up testing for the cohort affected by NCLB and slightly lowerrates of following testing for the cohorts that experienced theearlier CPS reforms. However, in both cases, our prereformand postreform cohorts match well on baseline characteristicswithin our estimation samples, which we define by achieve-ment decile, grade, and reform year.

With regard to our analyses of the CPS accountabilitysystem, the two-year intervals reflect the fact that 1994,1996, and 1998 are years around the CPS reform thatinvolve assessment using the same form of the ITBS. Wepresent results for fifth and sixth graders because these arethe cohorts tested in 1998 that did not face any promotionhurdles under the CPS reforms in 1996 or 1997. Thetwo-year interval is also necessary in our analyses of the2002 implementation of NCLB. ISBE administered theISAT in only third, fifth, and eighth grades. We cannotanalyze eighth-grade scores in the pre-NCLB period givencontrols for fifth-grade achievement because the ISAT wasfirst administered in 1999, but we can use the third-gradescores from 1999 and the fifth-grade scores from 2001 toestimate the pre-NCLB relationship between ISAT scores infifth and third grades.15

In all our analyses, we compare outcomes in a specific gradefor two different cohorts of students. Both cohorts took tests intwo grades, and both cohorts took their tests in the lower gradeunder low stakes. However, the latter cohort took exams in thehigher grade under high stakes. For our ISAT results, thesestakes reflect the Illinois 2002 implementation of NCLB. Forour ITBS results, these stakes reflect the 1996 introduction ofCPS’s accountability system. Our goal is to examine how testscores in the higher grade change following the introduction ofan accountability system based on proficiency counts control-ling for achievement in the lower grade, and we are particularlyinterested in the possibility that the effects of accountabilitymay differ among various levels of prior student achievementin the lower grade.

For the purpose of describing our estimation procedure,we refer to the cohorts tested in both grades under lowstakes as the prereform cohorts and the cohorts tested under

high stakes in the higher grade as the postreform cohorts.Our estimation procedure is as follows:

● We begin by using the prereform cohort to estimate thefirst principal component of math and reading scores inthe baseline grade. This principal component serves asa baseline achievement index.

● We use the coefficient estimates from this principalcomponent analysis and the lower-grade math andreading scores from the postreform cohort to constructindices of baseline achievement for students in thepostreform cohort as well. These indices tell us wherethe postreform students would have been in the distri-bution of baseline achievement for the prereform co-hort.

● We divide the pre- and postreform samples into tencells based on the deciles of the prereform distributionof baseline achievement.

● Given these cells, we run twenty separate regressions.For each of our ten samples of prereform students, werun two regressions of the formyigk � c � �1yi� g�2�math � �2yi� g�2�read

� �3� yi� g�2�math�yi� g�2�read� � uigk,

where yigk is the score of student i in grade g on theassessment in subject k.

● Based on these regression results, we form predictedscores, yigk, for each person in the postreform cohortand then form the differences between these predictedvalues, yigk, and the actual grade g scores in math andreading for the postreform cohort.

● Finally, we calculate the average of these differences inmath and reading for each of our ten samples of studentsin the postreform cohort and plot these averages.16

A. NCLB Results

Figures 1A and 1B present our estimates of the changes infifth-grade reading and math scores associated with the2002 implementation of NCLB in Illinois. For studentswhose third-grade scores place them in the bottom twodeciles of the 1999 achievement distribution, there is noevidence that NCLB led to higher ISAT scores in fifth grade.Three of the four estimated treatment effects for thesedeciles are negative. The only statistically significant esti-mated effect implies that fifth graders in 2002, whosethird-grade scores placed them in the bottom decile of the

14 We use the last year a student was in third grade as the third-gradeyear. We obtain similar results if we use test scores from the first year ofthird grade.

15 In an earlier version of this paper, we also presented comparisonsbetween the 1999–2001 cohort and the 2001–2003 cohort. However, wesubsequently learned that the interval between the 2001 third-grade testand the 2003 fifth-grade test was at least nine weeks shorter than theintervals for the cohorts that we deal with here. While the patterns in theseresults are quite similar to those presented in figures 1A and 1B, we cannotrule out the possibility that the difference in time between assessments aswell as other differences in test administration for the 2001–2003 samplesaffect those results.

16 The bands in the figures are 95% confidence intervals. We calculatethese intervals accounting for the fact that we must estimate what theexpected score for each student would have been in the absence of NCLB.We obtain the adjustments to the variances of our estimates of mean celldifferences by taking the sample average of the elements of the matrix(Z�Z�) where N is the number of fifth-grade observations in 2002, Z isthe N � 4 matrix of third-grade score variables used to produce predictedscores, and � is the estimated variance covariance matrix from theregression of 2001 fifth-grade math or reading scores on these third-gradevariables from 1999.

LEFT BEHIND BY DESIGN 269

Page 8: LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS …faculty.smu.edu › millimet › classes › eco7321 › papers › neal...Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show

1999 third-grade achievement distribution, scored just overone-half point lower in math than expected given the ob-served relationship between third-grade scores in 1999 andfifth-grade scores in 2001. Because the ISAT scale is de-signed to generate a standard deviation of 15 for all scores,

this estimated effect represents a decline of roughly 0.04standard deviations. In contrast, deciles 3 through 9 enjoyhigher-than-expected ISAT scores in both math and reading.We observe the largest score gains in math and reading inthe sixth decile, where fifth graders in 2002 scored justunder 0.1 standard deviations higher in reading and morethan 0.13 standard deviations higher in math than compa-rable fifth graders scored in 2001.

Figure 1C presents the expected proficiency rates in mathand reading for each of the deciles included in figures 1Aand 1B.17 These are the rates expected given the third-gradeperformance of students who were in fifth grade in 2002 andthe relationship between third- and fifth-grade performancefor the 2001 cohort of fifth graders. For example, the figuretells us that in the absence of NCLB, the fifth graders in2002 who fell in the fifth decile of our baseline achievementdistribution would have faced just over a 20% chance ofreaching the proficiency standard for math and just under a35% chance of reaching the reading standard.

In light of figure 1C, we are not surprised that we did notfind an increase in ISAT scores in 2002 among students in thebottom two deciles. The Illinois proficiency standards are loftygoals for these students, and they face less than a 10% chanceof reaching either standard. The fact that we do find significantpositive effects for students in the third decile suggests thatstudents may receive some benefit from these types of reformseven if they have at best modest hopes of reaching the thresh-old for proficiency. It is important to note that the modeloutlined in section II can accommodate this result. We assumethat the cost function associated with investment in any par-ticular student is convex. If small investments in students arerather inexpensive at the margin, schools may find it optimal tomake such investments, even in students who are notablybelow the current proficiency standard.18 On the other hand,our results for the third-decile students may reflect spillovereffects that are not present in our model. In any event, figures1A and 1C demonstrate that students with the lowest levels ofprior achievement did not appear to achieve higher ISATscores following NCLB, and among these students, the Illinoisproficiency standards represented almost unattainable goals.Taken as a whole, these results support our contention thatNCLB is not designed to leave no child behind.

B. Interpretation and Robustness of the NCLB Results

Several issues regarding the interpretation of our resultsdeserve further attention. First, figures 1A and 1B presentestimated changes in the scores on specific assessments. Wecan state clearly that the ISBE implementation of NCLBworked better, in terms of raising ISAT scores, for some

17 These expected proficiency rates are predicted values based on theestimated coefficients from a logit model of fifth-grade proficiency in2001 given third-grade math and reading scores in 1999.

18 Examine our first-order condition above. The value of c�(e) for e �0 will play a large role in determining how many students receive extraattention as a consequence of the accountability program.

FIGURE 1.—FIFTH-GRADE RESULTS FOR 2002A. CHANGE IN FIFTH-GRADE READING SCORES, 2002 VERSUS 2001

B. CHANGE IN FIFTH-GRADE MATH SCORES, 2002 VERSUS 2001

C. EXPECTED 2002 PROFICIENCY IN FIFTH-GRADE BY DECILE OF THE

THIRD-GRADE ACHIEVEMENT DISTRIBUTION, 1999

THE REVIEW OF ECONOMICS AND STATISTICS270

Page 9: LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS …faculty.smu.edu › millimet › classes › eco7321 › papers › neal...Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show

students than others and that it may have been counterpro-ductive among the least able students in CPS; it is worthnoting that this claim does not rest on a particular choice ofscaling for the ISAT scores. We find no evidence of positiveeffects among students in the bottom two deciles but clearevidence of significant increases in ISAT scores amongstudents in deciles 3 through 9. If all the estimated effectswere the same sign, we might worry that any comparisonsamong cells concerning the magnitude of estimated effectscould be sensitive to our choice of scale for reporting testscores, but our main emphasis here is a qualitative, not aquantitative, claim. Scores are higher than expected forstudents who are in the middle of the baseline achievementdistribution and the same or lower than expected for those atthe bottom of this distribution. Although NCLB raisedaverage ISAT scores in Chicago, the implementation ofNCLB in Chicago did not help, and may have hurt, thechildren who were likely the furthest behind when theybegan school. Our model suggests that this outcome shouldnot be a surprise, but it is also not consistent with the statedpurpose of NCLB.

We would like to conduct placebo experiments usingISAT data from the years before 2002 in order to rule out thepossibility that we are simply picking up preexisting differ-ences among ability levels in the trends of third- to fifth-grade changes in test scores among CPS students. However,this is not possible because only three years of ISAT dataexist prior to 2002, and we need four years of data tomeasure differences in third- to fifth-grade achievementtrajectories between two cohorts of students. Nonetheless,we can construct comparisons in reading and math usingtwo cohorts tested under the same policy regime. The 2005and 2004 cohorts of fifth graders were tested in both fifthand third grades under NCLB. Thus, we construct figuresdescribing changes in fifth-grade scores between 2005 and2004 in order to examine changes in scores between twocohorts tested under similar policy regimes. Figures 2A and2B do not offer even a hint of the clear pattern that isobserved in figures 1A and 1B. We see sizable losses inreading and some noteworthy gains in math among the topdeciles, but there is no common pattern for math andreading results, and there is no evidence of important gainsin the middle of the distribution relative to the lower deciles.

We do not know why there are some statistically signif-icant deviations from 0 in these figures. In any pair of years,especially during the early years of a new policy regime,there may be differences in test administration or curricularpriorities that create such differences.19 Our main point isthat these figures describe differences between two cohortsthat experienced broadly similar accountability environ-ments, and these differences in no way fit the patternobserved in figures 1A and 1B.

Table 1 contains the results from two different robustnesschecks. We perform these checks not only on our 2002analyses but also on the 1998 analyses presented in the nextsection. In the first exercise, we add controls for race,gender, and free-lunch status to the regression models thatwe use to form predicted final grade scores. The secondexercise involves adding school fixed effects to these mod-els. The second exercise is not quite as straightforward asthe first. Because there are over 400 elementary schools inChicago and roughly 2,000 fifth-grade students per year ineach of our baseline achievement deciles, many schools arerepresented in a given baseline achievement decile in 2002but not in 2001. Thus, we cannot estimate separate regres-sions for each of our deciles and simply add school fixedeffects without losing a significant number of observa-tions.20 Nonetheless, we can estimate the relationships be-tween fifth- and third-grade scores for the 2001–1999 cohort

19 NCLB is designed to be more demanding over time, and the targetproficiency rate did increase modestly in Illinois between 2004 and 2005.

20 We used four groups: deciles 1 and 2, 3 through 5, 6 through 8, and9 and 10. We employed a richer polynomial in third-grade achievementscores to compensate for the use of four broader regression cells insteadof ten. We still calculate average treatment effects for each decile tofacilitate comparisons with our other results.

FIGURE 2.—PLACEBO TESTS BASED ON 2005 RESULTS

A. CHANGE IN FIFTH-GRADE READING SCORES, 2005 VERSUS 2004

B. CHANGE IN FIFTH-GRADE MATH SCORES, 2005 VERSUS 2004

LEFT BEHIND BY DESIGN 271

Page 10: LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS …faculty.smu.edu › millimet › classes › eco7321 › papers › neal...Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show

within broader ability cells while including school fixedeffects. Using only within-school variation in student out-comes, we find results that follow the same pattern as thosepresented in figures 1A and 1B.

For each specification, table 1 also presents the estimatedaverage score gains for the entire sample. We find that

NCLB is associated with increases in overall averagescores. Thus, our results are consistent with the large bodyof research that finds positive impacts of accountabilitysystems on average test scores at the state, district, or schoollevel. However, in contrast to this literature, our primaryconcern is not the extent to which these average gains

TABLE 1.—ROBUSTNESS CHECKS

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)Math Reading

MainEffect SE

WithDemographic

Controls SE

WithSchoolFixed

Effects SEMainEffect SE

WithDemographic

Controls SE

WithSchoolFixed

Effects SE

A: 5th Grade ISAT (2002 vs. 2001), Grade-Equivalent UnitsDecile in 1999

3rd-gradeachievementdistribution1 �0.56 (0.23) �0.46 (0.22) �0.78 (0.23) �0.21 (0.22) �0.11 (0.22) �0.37 (0.22)2 �0.18 (0.23) 0.01 (0.22) 0.01 (0.22) 0.13 (0.23) 0.21 (0.23) 0.28 (0.22)3 1.18 (0.24) 1.14 (0.23) 1.12 (0.24) 0.57 (0.24) 0.55 (0.24) 0.52 (0.24)4 1.53 (0.27) 1.53 (0.26) 1.41 (0.26) 1.08 (0.27) 1.05 (0.26) 1.06 (0.25)5 0.86 (0.29) 0.74 (0.28) 1.10 (0.27) 0.78 (0.27) 0.69 (0.27) 0.94 (0.27)6 2.09 (0.29) 2.03 (0.28) 2.01 (0.28) 1.38 (0.28) 1.32 (0.28) 1.17 (0.27)7 1.66 (0.31) 1.67 (0.30) 1.90 (0.28) 1.28 (0.29) 1.31 (0.28) 1.46 (0.26)8 1.85 (0.31) 1.78 (0.30) 1.88 (0.31) 1.10 (0.29) 1.05 (0.29) 1.16 (0.28)9 1.25 (0.33) 1.33 (0.32) 0.79 (0.32) 0.58 (0.29) 0.67 (0.29) 0.13 (0.29)

10 0.77 (0.35) 0.68 (0.33) 1.25 (0.35) 0.02 (0.31) �0.02 (0.30) 0.41 (0.31)Overall 0.94 (0.09) 0.95 (0.09) 0.96 (0.09) 0.61 (0.08) 0.62 (0.08) 0.61 (0.08)

B: 5th Grade ITBS (1998 vs. 1996), scale score unitsDecile in 1994

3rd-gradeachievementdistribution1 �0.017 (0.023) �0.015 (0.023) �0.026 (0.023) �0.106 (0.032) �0.105 (0.032) �0.110 (0.031)2 0.062 (0.022) 0.063 (0.022) 0.072 (0.022) 0.037 (0.031) 0.037 (0.030) 0.040 (0.030)3 0.089 (0.021) 0.093 (0.021) 0.081 (0.021) 0.066 (0.030) 0.070 (0.030) 0.053 (0.029)4 0.114 (0.021) 0.118 (0.020) 0.122 (0.019) 0.095 (0.029) 0.099 (0.028) 0.109 (0.027)5 0.119 (0.020) 0.119 (0.019) 0.120 (0.019) 0.116 (0.028) 0.117 (0.027) 0.113 (0.027)6 0.101 (0.020) 0.105 (0.020) 0.094 (0.019) 0.081 (0.027) 0.077 (0.027) 0.084 (0.026)7 0.053 (0.019) 0.060 (0.019) 0.057 (0.017) 0.115 (0.026) 0.124 (0.026) 0.101 (0.024)8 0.085 (0.020) 0.095 (0.020) 0.087 (0.019) 0.066 (0.027) 0.075 (0.027) 0.074 (0.026)9 0.070 (0.019) 0.070 (0.019) 0.038 (0.019) 0.052 (0.029) 0.050 (0.028) 0.000 (0.028)

10 �0.008 (0.019) �0.006 (0.019) 0.022 (0.019) �0.081 (0.033) �0.084 (0.032) �0.032 (0.033)Overall 0.066 (0.006) 0.069 (0.006) 0.066 (0.006) 0.043 (0.009) 0.046 (0.009) 0.043 (0.009)

C: 6th Grade ITBS (1998 vs. 1996), scale score unitsDecile in 1994

4th-gradeachievementdistribution1 0.073 (0.025) 0.064 (0.025) 0.070 (0.024) 0.060 (0.034) 0.054 (0.033) 0.048 (0.032)2 0.192 (0.025) 0.172 (0.024) 0.196 (0.024) 0.191 (0.034) 0.174 (0.033) 0.201 (0.035)3 0.235 (0.023) 0.220 (0.023) 0.222 (0.021) 0.233 (0.031) 0.234 (0.032) 0.237 (0.029)4 0.246 (0.022) 0.237 (0.022) 0.262 (0.020) 0.230 (0.030) 0.226 (0.030) 0.217 (0.028)5 0.209 (0.021) 0.212 (0.021) 0.209 (0.020) 0.158 (0.029) 0.164 (0.029) 0.172 (0.027)6 0.259 (0.021) 0.252 (0.021) 0.245 (0.019) 0.186 (0.027) 0.181 (0.027) 0.177 (0.027)7 0.208 (0.020) 0.205 (0.020) 0.224 (0.018) 0.150 (0.026) 0.149 (0.026) 0.161 (0.023)8 0.189 (0.019) 0.187 (0.019) 0.187 (0.019) 0.160 (0.026) 0.164 (0.026) 0.158 (0.026)9 0.152 (0.018) 0.144 (0.018) 0.128 (0.018) 0.049 (0.026) 0.052 (0.026) �0.002 (0.026)

10 0.086 (0.018) 0.095 (0.017) 0.116 (0.017) 0.037 (0.032) 0.053 (0.031) 0.092 (0.031)Overall 0.183 (0.007) 0.177 (0.007) 0.184 (0.006) 0.142 (0.009) 0.142 (0.009) 0.143 (0.009)

Note: We describe our estimation procedure in section IV. Table A1 describes the samples by decile. The scale for the ISAT scores ranges from 120 to 200. The ISAT is designed to have a standard deviationof 15 for the population of fifth-grade students in Illinois. The ITBS scores are in grade-equivalent units; for example, 5.1 is the achievement level associated with the end of the first month of fifth grade. Note17 describes how we calculate the standard errors in our main specification and the specification with additional controls for race, gender, and free-lunch status. We use 1,000 bootstrap replications to compute thestandard errors for the models with school fixed effects because we cannot consistently estimate the variance-covariance matrices for regressions that include over 400 school fixed effects.

THE REVIEW OF ECONOMICS AND STATISTICS272

Page 11: LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS …faculty.smu.edu › millimet › classes › eco7321 › papers › neal...Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show

generalize to alternative assessments. We emphasize thatwhatever permanent skill increases are associated with theseaverage gains are not gains enjoyed by the students who areat the bottom of the baseline achievement distribution.21

Figures 1A to 1C provide only indirect support for ourmodel because we do not have direct measures of teachereffort, and other mechanisms could generate the patterns weobserve in these figures. If schools, in response to NCLB,picked curricula that worked best for students near profi-ciency and less well for the most and least able students, asimilar pattern might emerge. Nonetheless, any alternativeexplanation for our results must explain how NCLB leads tochanges in educational practice that benefit many studentsbut not students with the lowest levels of prior achievement.

Without arbitrary assumptions about the exact shape of thepenalty function, our model cannot generate clear predictionsconcerning exactly how the shape of figures 1A and 1B shouldchange when we restrict the sample to schools that have certaintypes of baseline students. However, two features of the modelare clear. First, students near the proficiency standard ex antealways receive extra attention because this is the most cost-effective strategy for increasing proficiency counts. Second,schools with low ex ante proficiency rates and few studentsnear the proficiency threshold cannot avoid sanctions by sim-ply directing attention to students near the proficiency stan-dard. Thus, if the penalty function is convex enough, theseschools will find it optimal to direct some extra effort towardstudents who are well below the proficiency standard. Al-though our sample sizes are not large enough to make stronginferences, we find suggestive evidence that students who arefar below proficiency do fare better in schools with the lowestexpected proficiency rates.

When we repeat our analyses within samples of schoolsthat are comparable in terms of their expected proficiencyrates prior to NCLB, we find, as we expect given our model,clear and noteworthy gains among students in the middledeciles of baseline achievement regardless of whetherschools are under modest or great pressure from NCLB’sAYP rules. Further, there is suggestive but not statisticallysignificant evidence that students at the bottom of theachievement distribution do in fact fare better if they attenda school with expected proficiency rates of less than 25%than if they attend schools with expected proficiency ratesbetween 25% and 40%. Among schools with expected

proficiency rates less than 25%, our estimated treatmenteffects for the bottom two deciles are �0.17 and 0.13 inmath and 0.38 and 0.52 in reading, but among schools withexpected proficiency rates between 25% and 40%, thecorresponding results are �0.77 and �0.48 in math and�0.70 and �0.19 in reading.22 These differences are note-worthy, but because our sample sizes within school type areso much smaller, we cannot reject the null hypothesis thatthe effects of NCLB among students in the bottom decilesare the same across these two school types. Nonetheless, thequalitative pattern in these results is consistent with ourexpectations given our model.23

C. Effects of the 1996 CPS Reforms

Figures 3A and 3B present estimates of the effects of the1996 CPS reforms on reading and math scores in fifth grade.Here, we are comparing the performance of students testedin 1998 with the performance that we would have expectedfrom similar students in 1996. The results for fifth-gradereading in figure 3A represent the effects of policy changesthat most closely resemble NCLB. Although CPS put read-ing first in its reform effort and made school-level probationdecisions based primarily on proficiency counts in reading,CPS also published school ratings for math and reading inlocal newspapers based on proficiency counts. Further, fifthgraders did not face a threat of summer school if they didpoorly on the ITBS, and thus the CPS efforts to end socialpromotion, which are not part of NCLB, should not haveaffected results for fifth graders to the same degree that theyaffected the performance of students in sixth grade. Fifth-grade teachers and parents may well have responded to thepromotion hurdles that awaited these students as sixthgraders in 1999. However, we do not expect fifth-gradestudents to make significant changes in their focus andeffort based on the consequences attached to sixth-gradeexams because children discount the future heavily at thisage. This creates an important difference between our fifth-and sixth-grade results.24 In a standard model of studenteffort, students will increase their effort in response to animmediate threat of summer school if the cost of such anincrease is offset by a significant reduction in the likelihoodof attending summer school, and we will see that our resultsfor sixth graders are consistent with this hypothesis.

21 The results in figures 1A and 1B are also robust to different methodsof measuring the heterogeneous effects of NCLB on test scores. We haveexperimented with finer partitions of the baseline achievement distributionand have used local linear regression methods to estimate the plots in ourfigures as continuous functions. We also examined numerous mean dif-ferences in pre- and postreform test scores for samples of students groupedaccording to cells defined by both their reading and math scores. Regard-less of the methods we have used, we have found no evidence of gains inmath or reading scores among students at the bottom of the third-gradeachievement distribution, and this is also true regarding our analyses ofchanges in fifth-grade scores following the 1996 reforms with CPS.Further, we have always found groups of students in the middle of thethird-grade achievement distribution who experienced increases relative toprereform baselines in their fifth grade scores.

22 In our model, no student should ever be harmed directly by theintroduction of an accountability system because we have made the strongassumption that districts perfectly monitor some baseline level of effortbefore and after the introduction of accountability, and we do not modelgroup instruction or related choices concerning curricular selection or thepace of instruction. Nonetheless, if schools responded to NCLB bytailoring all group instruction to the needs of students near the proficiencystandard, other students could be harmed directly.

23 In schools with an expected proficiency rate greater than 0.4, oursamples of students in the bottom deciles are too small to supportmeaningful inferences.

24 We do not analyze seventh-grade scores in 1998 because the sixth-grade promotion hurdle in 1997 is a source of endogenous compositionchanges in the 1998 seventh-grade sample.

LEFT BEHIND BY DESIGN 273

Page 12: LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS …faculty.smu.edu › millimet › classes › eco7321 › papers › neal...Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show

The pattern of results in figure 3A is quite similar to thepattern observed in our analyses of NCLB. Here, the scaleis in grade equivalents, and a 0.1 change represents roughlyone month of additional achievement. The overall standarddeviations of fifth-grade scores in our 2002 samples are

roughly 1.2 for math and 1.5 for reading. Thus, estimatedachievement gains of 0.1 or slightly more for several cells inthe middle of the ability distribution are noteworthy. Still,we find 0 or negative estimated achievement effects amongstudents in either tail. Further, figure 3B shows a similar butslightly less dramatic pattern of changes in fifth-grade mathscores. Figure 3C shows that the CPS proficiency standardswere slightly more demanding than the ISAT proficiencystandards used in 2002, and thus, it is noteworthy thatfifth-grade ITBS scores did increase among students in thethird decile of the prior achievement distribution, eventhough one would have expected less than 5% of thesestudents to pass either the math or reading thresholds in theprereform period. Nonetheless, the teachers and parents ofthese students knew that they would face a promotionhurdle as sixth graders in 1999, and as we demonstratebelow, the standards for promotion were within the reach ofthese students.

Figures 4A and 4B present results for sixth graders. Here,we are clearly not measuring just the effects of the schoolprobation rules and the public reporting of proficiencycounts in local newspapers. We anticipate that the rulesgoverning summer school attendance and retention deci-sions shaped not only the actions of teachers and parents butalso the effort of students during the school year. Students insixth grade faced summer school if they performed belowcertain targets in reading or math, and these targets weremuch lower than the proficiency standards used to measureschool performance. Taking all of these factors into account,we are not surprised that while our results for sixth gradersfollow the same overall pattern observed among fifth grad-ers, the estimated sixth-grade gains associated with the CPSreforms are larger at every decile in both math and reading,and there is some evidence of gains even in the lowestdecile.

Figure 4C is similar to figure 3C except that it plots, foreach decile, the probabilities of exceeding the summerschool cutoffs for sixth graders.25 The striking differencebetween figures 3C and 4C may offer some insight concern-ing why estimated gains from accountability in figures 4Aand 4B are more apparent in the lower deciles of theachievement distribution. Even students in the lowest decileof fourth-grade achievement had almost a 20% chance ofreaching the individual math or reading cutoffs that deter-mined summer school attendance after sixth grade, andWhite and Rosenbaum (2007) suggest that among sixth

25 The primary focus of Jacob (2005) is the average change in scores inresponse to the introduction of the CPS accountability system. However,Jacob does examine an interaction between student ability and the impactof high-stakes testing. Although Jacob’s method involves using scoresfrom many years and thus many different forms of the ITBS exam as wellas a much more restrictive specification of how heterogeneity influencesthe impacts of high stakes, he comes to a conclusion that squares broadlywith our results: “Students who had been scoring at the 10th–50thpercentile (in the national distribution) in the past fared better than theirclassmates who had either scored below the 10th percentile, or above the50th percentile” (pp. 776–777).

FIGURE 3.—FIFTH-GRADE RESULTS FOR 1998A. CHANGE IN FIFTH-GRADE READING SCORES, 1998 VERSUS 1996

B. CHANGE IN FIFTH-GRADE MATH SCORES, 1998 VERSUS 1996

C. EXPECTED 1998 PROFICIENCY IN FIFTH-GRADE BY DECILE OF THE

THIRD-GRADE ACHIEVEMENT DISTRIBUTION FOR 1994

THE REVIEW OF ECONOMICS AND STATISTICS274

Page 13: LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS …faculty.smu.edu › millimet › classes › eco7321 › papers › neal...Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show

graders, CPS schools targeted their instructional effortstoward students who could avoid summer school only ifthey made progress during the school year.26 We argue insection II that less demanding proficiency targets can in-crease the amount of attention that teachers devote to lessable students, and the contrast between our results for fifthand sixth grades is consistent with our conjecture. However,we cannot rule out the possibility that even students at thelowest levels of prior achievement simply worked harderthan similar students in previous cohorts because theywanted to avoid summer school.

Figures 4A and 4B indicate that students in the third andfourth deciles of prior achievement scores scored over 0.2higher in math and reading than one would have expected

prior to the 1996 reforms. These are large effects since 0.2represents two full months of achievement on the ITBSgrade-equivalent scale, and it is worth noting that figure 4Cimplies that CPS set the summer school cutoff scores suchthat students in these deciles faced both a significant chanceof avoiding summer school as well as a significant chance ofattending summer school depending on how they pro-gressed during the year.27

As we note above, table 1 contains results from tworobustness checks that we have conducted for each of ouranalyses. These results come from models of ITBS achieve-ment that included school fixed effects and models that

26 However, it is not completely clear that the least able CPS studentsbenefited from this program. In a previous version of the paper, wepresented results for twenty prior achievement cells. The estimated sixth-grade effects for those in the bottom 5% of the ability distribution werequite close to zero and not statistically significant. See Roderick and Engel(2001) for more work on the motivational responses of low-achievingchildren to the retention policy in CPS.

27 Another literature explores how students respond when they facedifferent stakes and performance standards on tests. See Betts and Grogger(2003), as well as Becker and Rosen (1992), who apply insights fromLazear and Rosen’s (1981) tournament model to the design of academictesting systems that determine rewards and punishments for students. Thisliterature suggests that less able students will not be affected by thesesystems if they have no realistic chance of ever reaching these key cutoffscores. Thus, the decision by CPS to set modest standards for gradepromotion may have been advantageous for generating more effort amonglow-achieving students.

FIGURE 4.—SIXTH-GRADE RESULTS FOR 1998A. CHANGE IN SIXTH-GRADE READING SCORES, 1998 VERSUS 1996

B. CHANGE IN SIXTH-GRADE MATH SCORES, 1998 VERSUS 1996

C. EXPECTED 1998 PASS RATES IN SIXTH GRADE: SUMMER SCHOOL CUTOFFS

BY DECILE OF THE FOURTH-GRADE ACHIEVEMENT DISTRIBUTION FOR 1994

D. CHANGE IN SIXTH-GRADE MATH SCORES ON LOW-STAKES IGAP TEST,1998 VERSUS 1996

LEFT BEHIND BY DESIGN 275

Page 14: LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS …faculty.smu.edu › millimet › classes › eco7321 › papers › neal...Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show

included additional controls for race, gender, and free-luncheligibility. Results from these alternative specifications arequite similar to those in figures 3A, 3B, 4A, and 4B.

Table 2 presents results that parallel those in table 1, buthere the dependent variables are indicators for scoringabove either proficiency standards or the cutoff scoresassociated with the summer school policy. The results fol-low the patterns that we expect given our figures. However,

some readers may wonder why the changes in summerschool pass rates are so modest among sixth graders indeciles 6 through 8 given that figures 4A and 4B presentnoteworthy gains in achievement for these deciles. Theanswer lies in figure 4C. Because the summer school cutoffscores are quite low relative to the NCLB or CPS profi-ciency standards, the vast majority of students in thesedeciles should have been able to avoid summer school

TABLE 2.—TREATMENT EFFECT ON PROFICIENCY RATES

(1) (2) (3) (4) (5) (6) (7) (8)Math Reading

MainEffect SE

WithDemographic

Controls SEMainEffect SE

WithDemographic

Controls SE

A: Proficiency on 5th-Grade ISAT (2002 vs. 2001)Decile in 1999

3rd-gradeachievementdistribution1 0.006 (0.004) 0.007 (0.004) �0.002 (0.005) �0.001 (0.005)2 0.014 (0.006) 0.016 (0.006) �0.006 (0.008) �0.006 (0.008)3 0.057 (0.009) 0.056 (0.009) 0.019 (0.011) 0.018 (0.011)4 0.077 (0.013) 0.077 (0.012) 0.041 (0.014) 0.038 (0.014)5 0.082 (0.015) 0.077 (0.014) 0.035 (0.015) 0.032 (0.015)6 0.114 (0.017) 0.111 (0.016) 0.076 (0.017) 0.073 (0.017)7 0.080 (0.016) 0.080 (0.016) 0.078 (0.016) 0.078 (0.016)8 0.054 (0.015) 0.053 (0.015) 0.077 (0.014) 0.075 (0.014)9 0.008 (0.012) 0.010 (0.012) 0.031 (0.010) 0.033 (0.010)

10 0.011 (0.006) 0.010 (0.006) 0.016 (0.006) 0.015 (0.006)Overall 0.047 (0.004) 0.047 (0.003) 0.033 (0.004) 0.032 (0.004)

B: Proficiency on 5th-Grade ITBS (1998 vs. 1996)Decile in 1994

3rd-gradeachievementdistribution1 0.002 (0.003) 0.002 (0.003) 0.001 (0.004) 0.002 (0.004)2 0.008 (0.004) 0.008 (0.004) 0.004 (0.004) 0.004 (0.004)3 0.013 (0.006) 0.014 (0.006) 0.012 (0.006) 0.012 (0.006)4 0.026 (0.008) 0.027 (0.008) 0.030 (0.009) 0.030 (0.009)5 0.032 (0.011) 0.032 (0.010) 0.033 (0.011) 0.038 (0.011)6 0.042 (0.013) 0.045 (0.013) 0.034 (0.013) 0.031 (0.013)7 0.031 (0.014) 0.034 (0.014) 0.052 (0.014) 0.056 (0.014)8 0.045 (0.014) 0.049 (0.014) 0.013 (0.015) 0.016 (0.015)9 0.037 (0.011) 0.037 (0.011) 0.027 (0.013) 0.027 (0.013)

10 �0.006 (0.006) �0.005 (0.006) �0.011 (0.007) �0.010 (0.007)Overall 0.023 (0.003) 0.025 (0.003) 0.020 (0.003) 0.021 (0.003)

C: Proficiency on 6th-Grade ITBS (1998 vs. 1996) Relative to Summer School CutoffDecile in 1994

4th-gradeachievementdistribution1 0.054 (0.011) 0.051 (0.011) 0.037 (0.011) 0.037 (0.012)2 0.101 (0.014) 0.093 (0.014) 0.072 (0.015) 0.064 (0.015)3 0.119 (0.014) 0.113 (0.014) 0.111 (0.015) 0.113 (0.015)4 0.084 (0.012) 0.081 (0.012) 0.081 (0.014) 0.080 (0.014)5 0.054 (0.010) 0.057 (0.010) 0.047 (0.013) 0.050 (0.013)6 0.047 (0.008) 0.044 (0.008) 0.059 (0.011) 0.057 (0.010)7 0.015 (0.006) 0.014 (0.006) 0.030 (0.008) 0.031 (0.008)8 0.012 (0.004) 0.012 (0.004) 0.014 (0.006) 0.014 (0.006)9 0.002 (0.002) 0.002 (0.003) 0.008 (0.003) 0.008 (0.003)

10 �0.001 (0.001) �0.001 (0.004) 0.001 (0.002) 0.001 (0.004)Overall 0.047 (0.003) 0.049 (0.003) 0.044 (0.003) 0.044 (0.003)

Note: See notes to table 1. This table presents results that parallel those in table 1, but the dependent variable is an indicator for being proficient in fifth grade or exceeding the summer school cutoff in sixth grade.Further, we use a logit model rather than linear regression to create predicted proficiency or pass rates given student characteristics. We calculate the standard errors using 1,000 bootstrap replications.

THE REVIEW OF ECONOMICS AND STATISTICS276

Page 15: LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS …faculty.smu.edu › millimet › classes › eco7321 › papers › neal...Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show

without receiving extra help from their teachers or workingharder on their own. For these students, extra help fromteachers or extra effort in class provided insurance against asmall but not negligible risk of summer school. Such actionscould easily produce noteworthy achievement gains whilehaving small impacts on average pass rates.

D. Changes in a Low-Stakes Outcome

We note above that much of the literature on responses tohigh-stakes testing programs addresses the possibility thatteachers take actions that inflate students’ scores on high-stakes exams relative to their actual skill levels. Thus, somemay wonder how the reforms that we analyze affect distri-butions of outcomes on low-stakes assessments. We are notable to address this issue in detail because almost all of theassessments that students took outside these two account-ability systems are either not comparable for the pre- andpostreform cohorts or not given in the grades we consider.28

However, we are able to construct figure 4D. This figureparallels figure 4B, but the outcome variable is the sixth-grade math IGAP (Illinois Goals Assessment Program) testand not the sixth-grade math ITBS test used in the CPSaccountability plan. The cells are defined exactly as in figure4B. As before, only students who took the ITBS in bothfourth and sixth grades are included in the analyses, and theconditioning variables are the fourth-grade ITBS scores.

Three features of the IGAP results are noteworthy. First,because the standard deviation of sixth-grade math IGAPscores is 86, the gains reported in deciles 3 through 7 of figure4D are noteworthy and all more than .1 standard deviations.Second, the gains in figure 4D are typically a bit smaller thanthose in figure 4B if both sets of results are expressed instandard deviation units.29 Third, the same hump-shape patternthat appears in figure 4B also appears in 4D. These resultsprovide additional evidence that the CPS system benefitedstudents in the middle of the achievement distribution morethan those at the bottom of the distribution.

We also note that several of the estimated gains in IGAPscores presented in figure 4D are almost certainly inflatedby the process of selection into IGAP test taking. Thesamples in figures 4A and 4B include sixth graders in the1998 and 1996 cohorts who took the ITBS in fourth and

sixth grades, but students included in the analyses for figure4D must also take the sixth-grade IGAP math test in 1996 or1998. In both 1996 and 1998, those who take the IGAP havehigher average baseline achievement. Further, if we restrictour attention to students who take both the IGAP and ITBS,we find larger implied gains in ITBS math scores between1998 and 1996 than those presented for the full sample infigure 4B. For most deciles, the differences between ourresults in figure 4B and those from parallel analyses re-stricted to students who take both the ITBS and IGAP arequite small. However, this is not the case in the first andsecond deciles, where, respectively, over one-fourth andone-tenth of the students represented in figure 4B did nottake the IGAP.30

To gauge how much selection into IGAP testing affectsthe results in figure 4D, we calculate a simple selectioncorrection factor based on the results in figure 4B and ourparallel set of ITBS math results for the sample of studentswho took both the IGAP and ITBS. We calculate ratios ofthe estimated changes in ITBS math scores presented infigure 4B to the corresponding estimated changes among theselect sample of students who took both IGAP and ITBS.We then form selection-corrected IGAP results by taking theproduct of these ratios and our estimated IGAP effects infigure 4D.31 The implied corrections are small for mostdeciles. However, the estimated IGAP gain for decile 1 fallsfrom 6.3 to 2.9, and the estimated gain for decile 2 fallsfrom 10 to 8.7. These selection corrections do not changethe pattern of results presented in figure 4D but simplyaccentuate the overall hump shape that is already present.

E. Summary

Our results show a consistent pattern across assessments,subjects, and years. Accountability systems built aroundproficiency counts do not generate a uniform distribution ofchanges in measured achievement. Students who have norealistic chance of becoming proficient in the near termappear to gain little from the introduction of these systems.However, our results from schools with ex ante proficiencyrates of less than 25% provide suggestive evidence thatlow-achieving students may fare slightly better if theyattend schools that cannot meet target proficiency levels byconcentrating only on students who are already near profi-ciency. Our results for sixth graders in 1998, who were ableto avoid summer school if they scored above relativelymodest thresholds, are consistent with the proposition that

28 The IGAP reading is not comparably scored over time. See Jacob(2005) for details. Further, the IGAP was not given in fifth grade. By thetime of NCLB in 2002, the ITBS had actually become a relativelylow-stakes exam in Chicago. However, between 2001 and 2002, CPSchanged to a completely different form of the exam.

29 The standard deviation of sixth-grade ITBS scores is 1.38. Theaverage ITBS gain is .132 standard deviations, while the average IGAPgain is 0.118 standard deviations. Given the correction for selection intoIGAP testing that we describe in the following paragraph, the averageIGAP gain falls to 0.11 standard deviations. Using regression models withsixth-grade ITBS and IGAP scores from 1994 through 1998 as dependentvariables and student demographic characteristics, policy indicators, andtime trends as control variables, Jacob (2005, pp. 781–782) reaches asimilar conclusion concerning the impact of the CPS reform on sixth-grade ITBS math gains. However, he cautiously attributes the growth insixth-grade IGAP math scores to a preexisting trend in IGAP achievement.

30 We have constructed figure 4B while restricting our outcome samplesto students with valid IGAP math scores, and we find that the estimatedgains in ITBS math scores more than double for the first decile andincrease by roughly 13% for the second.

31 This method is valid if the ratios of treatment effects for the selectversus full samples are the same for IGAP and ITBS effects. It makessense that this correction should matter most in the first decile. Rates ofIGAP testing are lowest in this decile. Further, those who did not take theIGAP enjoy lower baseline achievement and are less likely to view thesummer school cutoffs as realistic goals.

LEFT BEHIND BY DESIGN 277

Page 16: LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS …faculty.smu.edu › millimet › classes › eco7321 › papers › neal...Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show

lower proficiency levels shift the benefits of these systemstoward students with lower baseline achievement levels.32

V. Potential Reforms to Accountability Systems

The central lesson of the model and empirical workpresented here is that an accountability system built aroundproficiency counts may not help students who are currentlyfar above or far below these thresholds. In this section, weask whether recent proposals for changing the AYP systemcan help make NCLB a policy that generates improvedinstruction for all students. We assume that the goal ofNCLB or related accountability programs is to induce auniform increase in the amount of extra instruction thatteachers give to students of all abilities. We acknowledgethat there is no reason to believe that increasing the effortallocated to each student by the same amount is sociallyoptimal. However, by analyzing how different AYP scoringsystems influence the allocation of teacher effort relative tothis standard, we illustrate the key issues that designers ofaccountability systems must face when trying to elicit anyparticular distribution of effort that may be deemed desir-able.

Education policymakers are currently devoting signifi-cant attention to two alternative schemes for measuring AYPat the school level. First, several states have adopted index-ing systems based on multiple thresholds.33 In such a sys-tem, students who score above the highest threshold con-tribute, as in other states, one passing score toward theschool’s proficiency count. However, students who fall shortof this highest threshold but do manage to exceed lowerthresholds count as varying fractions of a passing studentdepending on how many thresholds they meet. Other stateshave adopted value-added systems that measure how muchscores have improved, on average, between two test dates.34

These approaches do not build in strong incentives tofocus attention only on students near a single proficiencystandard, and one can easily construct examples in whichthese systems will mitigate the number of students whoreceive no extra attention under NCLB. However, there areimportant trade-offs between the two approaches.

Using the notation from section III, consider the follow-ing index system:

minei

I� 1

T u �i�1

N

E�ti�� � �i�1

N

c�ei�. (2)

Tu is the maximum possible score on the high-stakes as-sessment, and we normalize the floor of the scale to 0. Here,

ti � min��T u, max�0, ti � ei � �i � εi��.

Thus, all scores are constrained to be between the floor andceiling of the scale used for assessment: ti � [0, Tu] @ i �1, 2, . . . , N. In a complete analysis of equation (2), wewould need to address the fact that the relationship betweenei and the expected value of student i’s test score may be afunction of both the floor and ceiling on the test scale.However, we assume that the tests in question are designedto ensure that neither the floor nor ceiling on the scale isrelevant for investment decisions regarding students, re-gardless of student ability.35

In the index system described by equation (2), proficiencyfor a given student is no longer 0 or 1 but rather thestudent’s score expressed as a fraction of the maximumscore, and the penalty function I[�] describes how sanctionsdecline, for a school of size N, as the total proficiency countof the school increases. This indexing system resemblesthose used in some states, but it differs in two ways. First,the indexing is continuous, so that all score increases countthe same toward the school’s proficiency score regardless ofa student’s initial ability, �i.36 Second, because we have setthe standard for proficiency at the maximum possible scoreand assumed this score is never reached, we have eliminatedthe existence of students for whom ei � 0 simply becausethey are already too accomplished relative to the proficiencystandard.

We can easily compare this characterization of indexingin equation (2) to the following value-added system:

minei

���i�1

N

�E�ti� � �� �� � �i�1

N

c�ei�. (3)

Here, �� is the average performance in the school on aprevious assessment. Because of the linearity in our model,equation (3) describes a value-added system in whichschools of a given size are rewarded or sanctioned accord-ing to �� based on their total net improvement in studentachievement. With regard to the goal of leaving no childbehind, both of the systems in equations (2) and (3) repre-

32 We have conducted our analyses of fifth graders using proficiency asthe outcome variable, and as one would expect based the results we reportin our figures, proficiency rates in the bottom two deciles of achievementremain almost constant following the two reforms. We do see smallincreases in the number of 1998 sixth graders in the first decile who meetthe summer school cutoffs for reading and math achievement. This isexpected because the cutoffs are set at such low levels.

33 As of spring 2007, these are Alabama, Florida, Iowa, Louisiana,Massachusetts, Michigan, Minnesota, Mississippi, New Hampshire, NewMexico, Oklahoma, Pennsylvania, Rhode Island, South Carolina, Ver-mont, Wisconsin, and Wyoming. New York also has a small indexingcomponent in their system.

34 As of January 2009, fifteen states had received federal approval ofgrowth model plans: North Carolina, Tennessee, Delaware, Arkansas,Florida, Iowa, Ohio, Alaska, Arizona, Minnesota, Missouri, Pennsylvania,and Texas.

35 The existence of floors or ceilings implies that there may exist regionsof ability types at the top and bottom of the ability distribution such thatthe return to investment in students is diminished because their observedscores are likely to remain at the ceiling or floor even if their latent scoresimprove. By ignoring these possibilities, we are implicitly assuming thatthe distribution of ability types and the distribution of measurement errorsare bounded in a manner that makes the floor and ceiling scores unattain-able given the optimal vector of effort choices.

36 We do not think of ability as a fixed endowment but rather as the levelof competency at the beginning of a given school year, which shouldreflect investments made by both schools and parents in previous years.

THE REVIEW OF ECONOMICS AND STATISTICS278

Page 17: LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS …faculty.smu.edu › millimet › classes › eco7321 › papers › neal...Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show

sent the best of all possible worlds in many respects. Anygiven increase in the expected score of any given studentmakes the same contribution to the school’s standing underthe accountability system regardless of the student’s initialachievement level. Therefore, in this setting, the optimalvector of effort allocations will dictate an identical increasein attention for all students as long as we maintain ourassumption that c(e) is strictly convex.

The systems described here are useful benchmarks be-cause they demonstrate what is required to design an ac-countability system that does not direct effort toward aparticular group of students. Note that the systems describedhere require a team of incredibly skilled test developers. Thetest scales in equations (2) and (3) are such that the effortcost of increasing an individual’s expected score by anyfixed amount is the same for all students. In practice,differences in the costs of improving student scores byparticular increments at different points on a given scalewill influence the allocation of effort among students.37

While both indexing and value-added systems offermeans for eliciting improved allocations of teacher effort toall students and not just those near proficiency standards,indexing and value-added are not equally desirable on alldimensions. Under any system that ties rewards and sanc-tions to levels of achievement, including a continuous in-dexing system, the minimized sum of effort costs andsanctions borne by the staff is a function of the distributionof prior student achievement in the school. Thus, it may bedifficult to design an index system that challenges the bestschools without setting goals for disadvantaged schools thatare not attainable given their resources. The Clotfelter et al.(2004) results suggest that when indexing systems set un-attainable goals for schools in disadvantaged communities,these systems may actually do harm by causing theseschools to lose the teachers they need most. Under thevalue-added system described in equation (3), the total costof achieving the optimal proficiency score is not affected bythe distribution of initial ability.38 Thus, it might be possibleto use such a value-added system to increase the quality ofinstruction for all students without distorting the supply ofteachers among schools.39

Still, value-added methods are not a panacea. To begin,value-added measures are often much noisier than measuresof the current level of student performance. In principle, onecould address this concern by developing more reliableassessments, but it is still important to note that teachersmay well demand increases in other aspects of their com-pensation if their standing under an accountability system isgreatly influenced by the measurement error in performancemeasures.40

In addition, the absence of a natural scale for knowledgeraises important questions about any method that seeks todetermine which groups of students made the most academicprogress. Reardon (2007) uses data from the Early ChildhoodLongitudinal Study–Kindergarten cohort (ECLS–K) toshow that measured differences between the magnitude ofthe black-white test score gap in first grade and fifth gradeamong a single cohort of students can be quite sensitive tothe specific scale used to report the scores, even if all scoresfrom all candidate scales are standardized to have a mean of0 and a variance of 1. The ECLS–K data do not permitresearchers to make definitive statements about how muchbigger the black-white test score gap is among fifth gradersthan among first graders because black and white studentsbegin first grade with different achievement levels and thereis no natural metric for knowledge that tells us how tocompare the size of this achievement gap with the corre-sponding gap observed in fifth grade. Since value-addedmeasures are measures of achievement growth for a popu-lation of students, the claim that value-added is greater inschool A than school B is a claim that, on average, achieve-ment growth was greater in school A than school B duringthe past year. But if it is difficult to make robust judgmentsconcerning whether achievement growth was greater amongwhite students than black students in a nationally represen-tative panel, it will also be difficult to make robust judg-ments concerning the relative magnitudes of averageachievement growth in different schools.41

In the end, designers of accountability systems face animportant trade-off. Any index system built around cutoffscores will make it more costly to attract teachers to teach indisadvantaged schools as long as all schools are held to thesame proficiency standards. On the other hand, systemsbuilt around measures of achievement growth will provideincentives and hand out sanctions based on performancemeasures that may be noisy and not robust to seeminglyarbitrary choices concerning scales. Nonetheless, bothmethods may reduce the incentives some schools currentlyface to leave the least advantaged behind.

37 Further, if c(e) is linear instead of strictly convex, it is easy toconstruct examples such that the optimal vector of effort allocations forboth of the problems above includes increased attention for only somestudents. Given a penalty function, there will be a specific total sum of testscores or a total proficiency score such that the constant marginal cost ofraising the total beyond this point is greater than the reduction in sanctionsassociated with such an increase, and there is nothing in the structure ofthis problem that guarantees ei 0 for all i at this point. Thus, even withthese ideal scales, we need to assume strictly decreasing returns to teachereffort at the student level to rule out the possibility that schools will targetonly a subset of students in their efforts to avoid sanctions.

38 This can be shown easily by substituting the formula for ti intoequation (3), if one assumes that the floor and ceiling on the test scale arenever binding at the optimal effort vector.

39 Here, we are implicitly assuming that the increase in the effort cost ofteaching will not generate a decline in teacher quality that offsets theincreased effort given by remaining teachers.

40 See Kane and Staiger (2002) for more on problems caused bymeasurement error in value-added systems.

41 Reardon’s (2007) results are driven in part, by the fact that the typicalblack student is making gains over a different region of the scale than thetypical white student. Because students sort among schools on ability, asimilar problem arises when measuring relative achievement growthamong schools.

LEFT BEHIND BY DESIGN 279

Page 18: LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS …faculty.smu.edu › millimet › classes › eco7321 › papers › neal...Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show

VI. Conclusion

A significant ethnographic literature documents instancesin specific schools where schools responded to accountabil-ity systems by targeting so-called bubble kids for extra helpwhile simultaneously providing no special attention to stu-dents who were already proficient or unlikely to becomeproficient given feasible interventions. Here, we use uniquedata from Chicago that permit us to cleanly measure howthe entire distribution of student achievement changes fol-lowing the introduction of accountability systems builtaround proficiency counts. Our findings are quite consistentwith the conclusions in the ethnographic literature, and weare the first to document educational triage on a large scaleby comparing cohorts of students who took the same examsunder different accountability regimes.

Our results do not suggest that NCLB has failed toimprove performance among all academically disadvan-taged students in Chicago. Figures 1A and 1B show that2002 ISAT test scores among fifth graders were higher thanone would have expected prior to NCLB over most of theprior achievement distribution, and it is important to notethat even CPS students in the fourth decile of the third-gradeachievement distribution faced just over 20% and just under15% chances of being proficient in reading and math re-spectively prior to NCLB. Thus, many low-achieving stu-dents in Chicago appear to have done better on ISAT underNCLB than they would have otherwise. However, for atleast the bottom 20% of students, there is little evidence ofsignificant gains and a possibility of lower-than-expectedscores in math. If we assume that similar results hold for allelementary grades now tested under NCLB, we have reasonto believe that at a given time, there are more than 25,000CPS students being left behind by NCLB.42

This large number is the result of several factors inter-acting together. First, as a state, Illinois has set standardsthat are challenging for disadvantaged students. Accordingto a report by the Chicago Consortium on School Research,Easton et al. (2003), just over half of the nation’s fifthgraders would be expected to achieve the ISAT proficiencystandard in reading, and just under half would be expectedto achieve the ISAT standard in math. Second, students inChicago are quite disadvantaged. More than 80% of CPSstudents receive free or reduced-price lunch benefits. Third,CPS is one of the largest districts in the country.

We do not have data on individual test scores from otherstates, and we cannot assess the extent to which our results

from Chicago reflect a pattern that is common among otherschool districts in other states. However, we have reason tobelieve that while the pattern of NCLB effects we haveidentified may not be ubiquitous, it is also not unique toChicago. New York City, Cincinnati, Cleveland, and manyother cities educate large populations of disadvantagedstudents in states with accountability systems that areroughly comparable to the 2002 system implemented inIllinois.43 Based on our results, it is reasonable to conjecturethat hundreds of thousands of academically disadvantagedstudents in large cities are currently being left behindbecause the use of proficiency counts in NCLB does notprovide strong incentives for schools to direct more atten-tion toward them. Further, NCLB may be generating thistype of educational triage in nonurban districts as well.44

Any school that views AYP as a binding constraint and alsoeducates a significant number of students who have littlehope of reaching proficiency faces a strong incentive to shiftattention away from their lowest-achieving students andtoward students near proficiency.

Because our results show significant increases in achieve-ment for students near the proficiency standard, our resultsare consistent with the proposition that accountability sys-tems can generate increases in achievement. However, ourresults also indicate that rules used to transform the testscore outcomes for all students into a single set of account-ability ratings for schools play an important role in deter-mining which students experience these achievement gains.More work is required to design systems that truly leave nochild behind.

43 See National Center for Education Statistics (2007). On the otherhand, Boston, Detroit, and Philadelphia are in states that use indexsystems to calculate AYP. Further, Houston, Dallas, and other cities inTexas face a state accountability system built around proficiency standardsthat are not as demanding as the 2002 standards in Illinois and possiblymore in reach for disadvantaged students.

44 Commercial software now exists that makes it easier for schools tomonitor and improve their AYP status. See http://www.schoolnet.com foran example. Schools that wish to create lists of students who are mostlikely to become proficient given extra instruction can easily do so.

REFERENCES

Becker, William E., and Sherwin Rosen, “The Learning Effect of Assess-ment Evaluation in High School,” Economics of Education Review11:2 (1992), 107–118.

Betts, Julian, and Jeffrey Grogger, “The Impact of Grading Standards onStudent Achievement, Educational Attainment, and Entry-LevelEarnings,” Economics of Education Review 22:4 (2003), 343–352.

Booher-Jennings, Jennifer, “Below the Bubble: ‘Educational Triage’ andthe Texas Accountability System,” American Educational Re-search Journal 42:2 (2005), 231–268.

Bryk, Anthony S., “No Child Left Behind, Chicago Style,” in Paul E.Peterson and Martin R. West (Eds.), No Child Left Behind? ThePolitics and Practice of School Accountability (Washington, DC:Brookings Institution Press, 2003).

Burgess, Simon, Carol Propper, Helen Slater, and Deborah Wilson, “WhoWins and Who Loses from School Accountability? The Distribu-tion of Educational Gain in English Secondary Schools,” CMPOworking paper (July 2005).

42 This is a conservative estimate. There are more than 30,000 studentsin the bottom 20% of the current third- to eighth-grade CPS studentpopulation. Although we cannot rule out the possibility that some studentsin the bottom two deciles of the achievement distribution receive extrahelp because their teachers see potential that is not reflected in their priortest scores, our results suggest that few students fall into this category.Further, we conjecture that there are as many or more students in the thirdand fourth deciles who receive little or no extra help because their teachersrealize that they are less likely to improve than other students with similarprior achievement levels.

THE REVIEW OF ECONOMICS AND STATISTICS280

Page 19: LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS …faculty.smu.edu › millimet › classes › eco7321 › papers › neal...Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show

Carnoy, Martin, and Susanna Loeb, “Does External Accountability AffectStudent Outcomes? A Cross-State Analysis,” Educational Evalua-tion and Policy Analysis 24:4 (2002), 305–331.

Clotfelter, Charles, Helen Ladd, Jacob Vigdor, and Aliaga Diaz, “DoSchool Accountability Systems Make It More Difficult for Low-Performing Schools to Attract and Retain High-Quality Teachers?”Journal of Policy Analysis and Management 23:3 (2004), 251–271.

Cullen, Julie B., and Randall Reback, “Tinkering toward Accolades:School Gaming under a Performance Accountability System,” inTimothy J. Gronberg and Dennis W. Jansen (Eds.), Advances inApplied Microeconomics 14 (2006), 1–34.

de Vise, Daniel, “Rockville School’s Efforts Raise Questions of Test-PrepEthics,” Washington Post, March 4, 2007.

Easton, John Q., Macarena Correa, Stuart Luppescu, Hye-Sook Park,Steve Ponisciak, Todd Rosenkranz, and Sue Sporte, “How Do TheyCompare? ITBS and ISAT Reading and Mathematics in ChicagoPublic Schools, 1999 to 2002,” Consortium for Chicago SchoolResearch research data brief (February 2003).

Gillborn, David, and Deborah Youdell, Rationing Education: Policy,Practice, Reform and Equity (Philadelphia: Open University Press,2000).

Grissmer, David, and Ann Flanagan, “Exploring Rapid AchievementGains in North Carolina and Texas,” National Education GoalsPanel (November 1998).

Hanushek, Eric A., and Margaret E. Raymond, “Does School Account-ability Lead to Improved Student Performance?” NBER workingpaper no. 10591 (2004).

Holmstrom, Bengt, and Paul Milgrom, “Multitask Principal-Agent Anal-yses: Incentive Contracts, Asset Ownership and Job Design,”Journal of Law, Economics and Organization 7 (1991), 24–52.

Jacob, Brian A., “A Closer Look at Achievement Gains under High-StakesTesting in Chicago,” in Paul E. Peterson and Martin R. West (Eds.),No Child Left Behind? The Politics and Practice of School Ac-countability (Washington, DC: Brookings Institution Press, 2003).

“Accountability, Incentives and Behavior: The Impact of High-Stakes Testing in the Chicago Public Schools,” Journal of PublicEconomics 89:5–6 (2005), 761–796.

Jacob, Brian A., and Steven Levitt, “Rotten Apples: An Investigation ofthe Prevalence and Predictors of Teacher Cheating,” QuarterlyJournal of Economics 118:3 (2003), 843–877.

Kane, Thomas J., and Douglas O. Staiger, “Volatility in School TestScores: Implications for Test-Based Accountability Systems,”Brookings Papers on Education Policy 5 (2002), 235–283.

Koretz, Daniel M., “Limitations in the Use of Achievement Tests asMeasures of Educators’ Productivity,” Journal of Human Re-sources 37:4 (2002), 752–777.

Lazear, Edward P., “Speeding, Terrorism, and Teaching to the Test,”Quarterly Journal of Economics 121:3 (2006), 1029–1061.

Lazear, Edward P., and Sherwin Rosen, “Rank-Order Tournaments asOptimum Labor Contracts,” Journal of Political Economy 89:5(1981), 841–864.

National Center for Education Statistics, “Mapping 2005 State ProficiencyStandards onto the NAEP Scales,” NCES report 2007-482 (2007).

Reardon, Sean, “Thirteen Ways of Looking at the Black-White Test ScoreGap,” Stanford University mimeograph (March 2007).

Reback, Randall, “Teaching to the Rating: School Accountability and theDistribution of Student Achievement,” Journal of Public Econom-ics 92:5–6 (2008), 1394–1415.

Roderick, Melissa, and Mimi Engel, “The Grasshopper and the Ant:Motivational Responses of Low-Achieving Students to High-Stakes Testing,” Educational Evaluation and Policy Analysis 23:3(2001), 197–227.

Springer, Matthew G., “The Influence of an NCLB Accountability Plan onthe Distribution of Student Test Score Gains,” Economics ofEducation Review 27:5 (2008), 556–563.

Wick, John W., “Independent Assessment of the Technical Characteristicsof the Illinois Standards Achievement Test (ISAT)” (Chicago:Illinois State Board of Education, 2003).

White, Katie Weits, and James E. Rosenbaum, “Inside the Black Box ofAccountability: How High-Stakes Accountability Alters SchoolCulture and the Classification and Treatment of Students andTeachers,” in Alan R. Sadovnik, Jennifer A. O’Day, George W.Bohrnstedt, and Kathryn M. Borman (Eds.), No Child Left Behindand the Reduction of the Achievement Gap: Sociological Perspec-

tives on Federal Education Policy (Florence, KY: Routledge,2007).

APPENDIX A

Data Construction

In our analyses of the effects of NCLB, we restrict our samples tostudents who were tested in fifth grade in 2002, the first year of NCLB, or2001. We further restrict the sample to students who were last tested inthird grade exactly two years prior. Here, we discuss two alternativeprocedures.

First, we could have simply selected the first or last third-grade testavailable for each fifth-grade student in our 2001 and 2002 sampleswithout restricting the sample interval between scores. We chose not topursue this strategy because the ISAT test was not given until 1999.Students tested in fifth grade in 2001 who entered third grade in 1998 andthen either repeated part or all of third grade or fourth grade do not havean ISAT score for their initial third-grade year, and depending on thedetails of their grade progression, they may not have a third-grade ISATscore at all. This is not true among similar students who entered thirdgrade in 1999 and were tested as fifth graders in 2002. Thus, the sampleof fifth graders tested in 2001 with valid third-grade scores contains fewerstudents who experienced retention problems in third or fourth grade thanthe comparable sample of fifth graders tested in 2002. By restricting thesamples to students who last tested in third grade exactly two years prior,we are holding the progression patterns in the treatment and controlsamples constant.

A second alternative procedure involves conditioning on a differentprogression pattern by restricting the samples to students tested two yearsprior during their first year in third grade. These samples would includeonly students with “normal” grade progression. We conducted analyses onthese samples and found results that are quite similar to those in figures 1Aand 1B.

For the final time in 1999, 27,205 students with valid third-grade scorestook the ISAT in third grade. The comparable sample for 2000 contains27,851 students; 20,060 of these 1999 third graders and 21,199 of these2000 third graders appear in the ISAT fifth-grade test files for 2001 and2002, respectively. Thus, the sample retention rate is slightly higher in the2000–2002 sample (73.7% versus 76.1%). One source of this difference inretention rates is that there are fewer student ID number matches lookingforward from the 1999 sample. This primarily reflects fewer exits fromCPS for the 2000 sample as well as fewer student ID numbers in therelevant ISAT files that are not coded correctly. For all our analyses ofITBS scores in the 1990s, our retention rates for both the prereform andpostreform cohorts are always between 80% and 82%. Because CPSadministered these exams as part of their own accountability system, therewere fewer problems with matching exams to correct student ID numbers.In the end, our ISAT analyses contain 18,305 and 19,651 students from the2001 and 2002 samples, respectively, who have valid scores on bothexams and were tested without accommodations in fifth grade. The ratesof follow-up testing without accommodations are 0.673 for the 1999–2001cohort and 0.706 for the 2000–2002 cohort.

Rates of follow-up testing increase with baseline achievement for bothcohorts. This gradient reflects in large part our decision to excludestudents who were tested with accommodations in fifth grade. Since ourdata do not record whether students were allowed to take the third-gradetests with accommodations or what types of accommodations fifth gradersreceived, we are concerned that the introduction of NCLB could haveaffected the types of accommodations offered in ways that we cannotmeasure. For completeness, we did conduct similar analyses including thesamples of accommodated fifth graders. This approach yields largersamples of students in both 2001 and 2002 whose previous third-gradescores signal low-baseline achievement, and given this approach, theestimated treatment effects associated with the bottom two deciles ofbaseline achievement are uniformly below those in figures 1A and 1B butwithin the confidence intervals presented.

Panel A of table A1 describes the data used to construct figures 1A and1B and also describes differences in baseline characteristics between the2001 and 2002 samples. The predicted fifth-grade scores for 2002 arebased on the third-grade scores for the 2002 cohort and the estimatedcoefficients from regressions of fifth-grade math and reading scores on

LEFT BEHIND BY DESIGN 281

Page 20: LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS …faculty.smu.edu › millimet › classes › eco7321 › papers › neal...Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show

TABLE A1.—TREATMENT AND CONTROL SCORES AND SAMPLE SIZES

A: 2002 vs. 2001 Samples, 5th-Grade ISAT

Decile

Sample SizeFollow-up

Testing RateAverage Math Score Average Reading Score

2001 2002 2001 20022001

(Actual)2002

(Actual)2002

(Predicted)2001

(Actual)2002

(Actual)2002

(Predicted)

3rd-gradescore index(1999sample)1 1,833 2,447 0.470 0.558 140.8 140.3 140.8 139.6 139.7 139.92 1,845 2,540 0.608 0.655 144.7 144.4 144.6 144.0 144.1 144.03 1,825 2,287 0.644 0.699 147.0 148.1 146.9 146.4 147.1 146.54 1,825 1,783 0.690 0.697 149.7 151.1 149.5 149.1 150.3 149.25 1,826 1,745 0.686 0.715 152.6 153.4 152.5 151.6 152.6 151.86 1,828 1,691 0.725 0.748 154.9 156.9 154.8 154.0 155.5 154.17 1,838 1,718 0.717 0.762 158.6 160.1 158.5 157.1 158.7 157.48 1,825 1,736 0.758 0.778 162.3 163.9 162.0 160.6 162.1 161.09 1,840 1,810 0.770 0.806 168.0 168.9 167.7 165.3 166.1 165.5

10 1,820 1,894 0.809 0.817 178.7 179.5 178.7 174.2 174.4 174.4Total 18,305 19,651 0.673 0.706 155.7 155.5 154.6 154.2 154.0 153.4

B: 1998 vs. 1996 Samples, 5th-Grade ITBS

Decile

Sample SizeFollow-up

Testing RateAverage Math Score Average Reading Score

1996 1998 1996 19981996

(Actual)1998

(Actual)1998

(Predicted)1996

(Actual)1998

(Actual)1998

(Predicted)

3rd-gradescore index(1994sample)1 2,193 1,964 0.737 0.670 3.71 3.69 3.71 3.53 3.41 3.522 2,211 1,892 0.787 0.719 4.09 4.17 4.11 3.87 3.91 3.873 2,167 1,895 0.815 0.778 4.38 4.49 4.40 4.13 4.20 4.144 2,162 1,917 0.831 0.805 4.68 4.82 4.70 4.50 4.59 4.495 2,177 2,035 0.848 0.825 4.95 5.09 4.97 4.78 4.90 4.796 2,206 2,064 0.837 0.829 5.22 5.33 5.23 5.13 5.19 5.117 2,176 2,220 0.841 0.849 5.55 5.62 5.57 5.43 5.53 5.428 2,181 2,264 0.847 0.865 5.82 5.94 5.86 5.82 5.84 5.779 2,172 2,180 0.846 0.864 6.23 6.32 6.25 6.31 6.35 6.29

10 2,176 2,313 0.837 0.854 6.92 6.96 6.97 7.38 7.34 7.42Total 21,821 20,744 0.821 0.804 5.15 5.30 5.24 5.09 5.20 5.16

C: 1998 vs. 1996 Samples, 6th-Grade ITBS

Decile

Sample SizeFollow-up

Testing RateAverage Math Score Average Reading Score

1996 1998 1996 19981996

(Actual)1998

(Actual)1998

(Predicted)1996

(Actual)1998

(Actual)1998

(Predicted)

4th-gradescore index(1994sample)1 2,406 2,370 0.720 0.696 4.40 4.45 4.37 4.04 4.07 4.012 2,314 2,092 0.795 0.740 4.93 5.15 4.96 4.56 4.76 4.573 2,366 2,206 0.811 0.787 5.28 5.53 5.30 4.94 5.18 4.954 2,370 2,263 0.835 0.818 5.62 5.88 5.64 5.28 5.51 5.285 2,342 2,245 0.834 0.848 5.92 6.16 5.95 5.63 5.79 5.636 2,372 2,389 0.841 0.855 6.20 6.46 6.21 5.94 6.12 5.947 2,351 2,338 0.843 0.865 6.54 6.77 6.56 6.30 6.45 6.308 2,364 2,416 0.855 0.868 6.86 7.07 6.88 6.71 6.86 6.709 2,362 2,543 0.847 0.861 7.37 7.51 7.36 7.26 7.33 7.28

10 2,348 2,551 0.844 0.861 8.20 8.28 8.20 8.46 8.54 8.50Total 23,595 23,413 0.820 0.817 6.13 6.37 6.19 5.91 6.12 5.97

THE REVIEW OF ECONOMICS AND STATISTICS282

Page 21: LEFT BEHIND BY DESIGN: PROFICIENCY COUNTS …faculty.smu.edu › millimet › classes › eco7321 › papers › neal...Derek Neal and Diane Whitmore Schanzenbach* Abstract—We show

polynomials in the third-grade math and reading scores for the 2001cohort. Because the 2002 cohort has slightly lower overall third-gradescores, the average predicted scores in 2002 are below those for 2001 inmath and reading. Panels B and C of table A1 are similar descriptions ofthe data used to construct figures 3A and 3B, and 4A and 4B, respectively.Although panel A shows that higher rates of follow-up testing in thebottom deciles of achievement in the post-NCLB cohort, panels B and Cshow lower rates of follow-up testing in these deciles for the postreformcohorts. Here, the overall predicted average scores among the 1998cohorts are slightly higher than the corresponding average scores amongthe prereform cohorts.

Having noted these differences in testing rates and average predictedscores by cohort, we stress that in all three panels, the average scores ofthe prereform cohorts and the average predicted scores for the postreformcohorts match well within deciles. Further, as we note in table 1, we haveconducted our analyses including extra controls for gender, race, andeligibility for free lunch, and our results are almost identical. In the 1990s,our postreform cohorts are slightly more prepared on average than theprereform cohorts, and the opposite is true in the NCLB years, but withinour decile groups, our treatment and control groups match well in terms ofacademic preparation during the prereform periods.

Some may worry that because the first decile follow-up testing rate in2001 is 16% lower than the corresponding rate observed in 2002 (0.470versus 0.558), our negative estimated achievement gains for this decile in2002 could be driven by an increase in follow-up testing among low-performing students following NCLB. However, table A1 shows noevidence that this is the case with regard to observed characteristics. Theexpected scores for first-decile students in the 2002 cohort are the same asor slightly greater than those of first-decile students in 2001. Further, onewould have to assume an incredible amount of selection on unmeasuredtraits within this first decile in order to alter the basic pattern of results infigures 1A and 1B. Even if one assumes that the average treatment effectamong the additional 16% tested in 2002 is as low as �2, which iscomparable in absolute value to the largest of our positive estimatedtreatment effects, the implied average treatment effects in math andreading for the balance of the sample remain less than �0.28 in math andless than 0.13 in reading.

APPENDIX B

Educational Triage

Here we show that if some students are receiving extra help, studentswho receive no extra help must be in the extremes of the ability distri-bution.

Recall the notation from section II. The school’s problem is describedby

minei

� ��i

F��t � �i � ei��� �i

c�ei� s.t. ei � 0 � i

� 1, 2, . . . , N,

(A1)

where �� and c� are both increasing, strictly convex functions. f� isunimodal.

Proposition. For any three ability levels, �H �M �L, any effortplan e* that satisfies:

e*H � 0

e*M � 0

e*L � 0

cannot be a solution to equation (A1).

Proof. Case 1: Assume that �t � �L � e*L � �t � �M, or �L e*L �M. Define an alternative plan e:

eL � e*M � ��M � �L�

eM � e*L � ��M � �L�

eH � e*H.

The penalties are the same for both e and e* plans, but the total cost ishigher for the e* plan.

Case 2: f�( �t � �M) � 0. This implies that f( �t � �M) � f( �t � �H �e*H). Thus, there exists � 0 such that we can form an alternative plan e:

eL � e*L

eM � �

eH � e*H � �

The penalties of e are weakly less than e* and the costs are lower.Case 3: �L e*L � �M and f�( �t � �M) � 0. These conditions imply

that f( �t � �M) � f( �t � �L � e*L). Thus there exists � 0 such that wecan construct an alternative effort plan e:

eL � e*L � �

eM � �

eH � e*H.

Both the penalties and costs are lower for the alternative plan e.

LEFT BEHIND BY DESIGN 283