auditing for score inflation using self-monitoring ......self-monitoring assessments 3 because of...

54
Auditing for Score Inflation Using Self-Monitoring Assessments: Findings from Three Pilot Studies A working paper of the Education Accountability Project at the Harvard Graduate School of Education http://projects.iq.harvard.edu/eap Daniel Koretz 1 Jennifer L. Jennings 2 Hui Leng Ng 1 Carol Yu 1 David Braslow 1 Meredith Langi 1 1 Harvard Graduate School of Education 2 New York University November 25, 2014 Acknowledgements: The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305AII0420, and by the Spencer Foundation, through Grants 201100075 and 201200071, to the President and Fellows of Harvard College. The authors also thank the New York State Education Department for providing the data used in this study. The opinions expressed are those of the authors and do not represent views of the Institute, the U.S. Department of Education, the Spencer foundation, or the New York State Education Department or its staff.

Upload: others

Post on 08-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Auditing for Score Inflation Using Self-Monitoring Assessments:

Findings from Three Pilot Studies

A working paper of the Education Accountability Project at the Harvard Graduate School of Education

http://projects.iq.harvard.edu/eap

Daniel Koretz1 Jennifer L. Jennings2

Hui Leng Ng1 Carol Yu1

David Braslow1 Meredith Langi1

1 Harvard Graduate School of Education

2 New York University

November 25, 2014 Acknowledgements: The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305AII0420, and by the Spencer Foundation, through Grants 201100075 and 201200071, to the President and Fellows of Harvard College. The authors also thank the New York State Education Department for providing the data used in this study. The opinions expressed are those of the authors and do not represent views of the Institute, the U.S. Department of Education, the Spencer foundation, or the New York State Education Department or its staff.

Page 2: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments i

Abstract

A substantial body of research has shown that test-based accountability programs

often produce score inflation. Most studies have evaluated inflation by comparing trends

on a high-stakes test and a lower-stakes audit test, such as the National Assessment of

Educational Progress. However, Koretz & Beguin (2010) noted the weaknesses of using

external tests for auditing and suggested instead using self-monitoring assessments

(SMAs), which incorporate into high-stakes tests audit items that are sufficiently novel

that they are not susceptible to test preparation aimed at more predictable items. This

paper reports the results of the first three trials of the SMA approach. The studies varied

in design, but all were conducted with the New York State mathematics tests in grades 4,

7, and 8 in 2011 and 2012. Despite a severe conservative bias created by a number of

aspects of the study designs, we found that the audit component functioned as expected in

many of the trials. Specifically, the difference in performance between nonaudit and audit

items was associated with factors that earlier research showed to be related to test

preparation and score inflation, such as “bubble-student” status (scoring just below the

Proficient cut in the previous year) and poverty. However, a number of trials yielded null

findings. These findings underscore the need for additional research investigating the

optimal characteristics of items used for auditing gains.

Page 3: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 1

A substantial body of research has shown that test-based accountability programs

lead to test preparation that focuses attention on the specifics of the test used for

accountability rather than the broader domain of content and skills that the test is

intended to represent (e.g., Hamilton et al., 2007, Stecher, 2002). One consequence of

these behaviors is score inflation, that is, score increases substantially larger than the

improvements in learning that they represent. Studies have shown that the resulting bias

can be very large (e.g., Jacob, 2007; Klein, Hamilton, McCaffrey, & Stecher, 2000;

Koretz & Barron, 1998), and can lead to substantially different inferences about students’

academic progress.

Most studies of score inflation have followed a single design. To be meaningful,

increases in scores must generalize to the domain from which the test samples. If

performance gains do generalize to the domain, they should show a reasonable degree of

generalization to other tests that sample from the same domain and are intended to

support similar inferences. Accordingly, most studies examine the consistency of gains

between a high-stakes test and a lower-stakes audit test, most often the National

Assessment of Educational Progress (NAEP).

Koretz & Beguin (2010) noted numerous disadvantages of this approach. A

suitable audit test may be unavailable or, like NAEP, may be available only for a few

grades or only at high levels of aggregation. The substantive appropriateness of available

audit tests may be arguable, as they may have been designed to support somewhat

different inferences. Motivational differences may bias comparisons between the audit

and high-stakes tests, although this risk is mitigated somewhat when trends are compared.

Page 4: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 2

As way to avoid these limitations, Koretz & Beguin (2010) suggested self-

monitoring assessments (SMAs). SMAs incorporate audit components into the high-

stakes test itself, using differences in performance between these audit components and

routine operational items as a measure of score inflation. In principle, SMAs could

eliminate many of the limitations of using a separate test for auditing.

This study presents the results of the first three trials of the SMA principle, all

conducted in the context of New York State’s mathematics testing program in grades 4,

7, and 8. The first of these was conducted in the context of a statewide field test, while

the others were conducted in operational forms of the state’s high-stakes tests. These

three trials provide very conservative tests of the SMA design, for several reasons. First,

we faced a severe lack of lack of statistical power caused by small samples of students in

the first study and small samples of audit items in all three studies. Second, there is no

prior research evaluating different audit-item designs that could be used to guide the

efforts reported here. Finally, we lacked empirical evidence about the test-preparation

strategies teachers used to address the operational items for which we wrote audit items.

Nonetheless, the studies together demonstrate that SMAs are a viable approach for

detecting score inflation in the context of large-scale testing programs.

Background

The Problem of Score Inflation

The risk of score inflation has been noted in the measurement literature for more

than half a century. For example, Lindquist (1951), writing in an era of low-stakes

testing, noted that

Page 5: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 3

Because of the nature and potency of the rewards and penalties associated

in actual practice with high and low achievement test scores of students,

the behavior measured by a widely used test tends in itself to become the

real objective of instruction, to the neglect of the (different) behavior with

which the ultimate objective is concerned (p. 153).

Madaus (e.g., 1988) and Shepard (e.g., 1988) both warned that high-stakes testing was

likely to create test-specific coaching that would in in turn produce score inflation.

The first empirical study of score inflation (Koretz, Linn, Dunbar, and Shepard,

1991) examined a system that, while high-stakes by the standards of the day, was very

low-stakes by today’s standards, entailing no concrete sanctions or rewards. Koretz et al.

examined changes in scores when the district substituted one commercially produced

achievement test for another, and they readministered the older test four years after its

final high-stakes administration. They found inflation in mathematics of half an academic

year by the end of third grade. Since that time, studies in a variety of different contexts

have confirmed that score inflation is common (e.g., Hambleton, et al., 1995; Haney,

2000; Ho, 2009; Ho & Haertel, 2006; Jacob, 2005, 2007; Klein, Hamilton, McCaffrey, &

Stecher, 2000; Koretz & Barron, 1998; Koretz, Linn, Dunbar & Shepard, 1991). For

example, Klein, Hamilton, McCaffrey, & Stecher (2000) found that gains on the high-

stakes Texas TAAS test were roughly two to six times the size of the state’s gains on

NAEP. Jacob (2007) found that gains on several state high-stakes mathematics tests were

roughly double those on NAEP. Koretz & Barron (1998) found that gains on Kentucky’s

high-stakes mathematics tests exceeded gains in NAEP roughly by a factor of four, and

Page 6: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 4

Hambleton et al. (1995) found that gains of roughly three-fourths of a standard deviation

on the state’s fourth-grade reading test were accompanied by no gain whatsoever on

NAEP.

A number of studies have examined educators’ responses to testing and have

found behaviors that may be linked to score inflation. Among the reported responses that

may be related to inflation are narrowing of instruction to focus on tested content,

adapting instruction to the format of test items, focusing instruction on incidental aspects

of tests, and cheating (e.g., Stecher, 2002). For example, in a comprehensive study of

responses to test-based accountability in three states, Hamilton et al. (2007) found that 55

to 78 percent of teachers reported “emphasizing assessment styles and formats of

problems,” and roughly half reported spending more time teaching test taking strategies

(p. 103). Such practices have been widely documented (e.g., Abrams, Pedulla, & Madaus,

2003; Luna & Turner, 2001; Pedulla et al., 2003; Stecher & Barron, 2001; Stecher et al.,

2008).

Koretz & Hamilton (2006) distinguished among seven different types of test

preparation, including those that are likely to produce meaningful gains in achievement

and those that are likely to produce score inflation. Educators may respond to the

pressures to raise scores by allocating more time to instruction, finding more effective

instructional strategies, or simply working harder. All of these, within limits, may

produce meaningful gains in scores, that is, gains that reflect commensurate increases in

student learning. At the other extreme, educators may cheat, as in the recent large-scale

scandal in the Atlanta public schools (Severson, 2011), which can only produce inflated

Page 7: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 5

scores. Similarly, they may manipulate the population of test-takers, which will not bias

the scores of individual students but will inflate aggregate scores (e.g., Figlio & Getzler,

2006).

More relevant here are two categories of responses that Koretz & Hamilton

(2006) labeled reallocation and coaching. Reallocation entails better aligning

instructional resources, such as time, to the content of the specific test used for

accountability. Coaching refers to focusing instruction on minor, usually incidental

characteristics of the test, including unimportant details of content and the types of

presentations used in the items in the specific test. For example, item writers typically use

Pythagorean triples in items assessing the Pythagorean Theorem because students are

unable to compute square roots by hand. One common form of coaching in response to

this is telling students to memorize the two Pythagorean triples that most often appear in

test items, 3:4:5 (3ଶ + 4ଶ = 5ଶ) and 5:12:13 (e.g., Rubinstein, 2000). This allows

students to answer the item correctly without actually learning the theorem or being able

to apply it in real life.

Reallocation can be beneficial, but it can also generate score inflation. For

example, if a test reveals that a school’s students are weak in proportional reasoning, one

would want educators to bolster instruction in that area, and these responses might

include increasing the allocation of time to it. This reallocation, if effective, will improve

the mathematics achievement the test score is intended to proxy. However, if reallocation

entails shifting resources away from content that is de-emphasized or omitted from the

specific test but is nonetheless important to the inference based on scores, the result will

Page 8: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 6

be score inflation. Numerous studies have found that many teachers report decreasing

their emphasis on elements of the curriculum that are de-emphasized by the test (Pedulla

et al., 2003; Stecher, 2002) and widely available test-preparation materials help educators

do so (Haney, 2000; Stecher, 2002). Similarly, while there are cases in which coaching

might induce meaningful gains—for example, if students are confronted with a format

that is so novel that their performance without familiarization would be biased

downward—it will generally bias scores upwards.

Variations in Test Preparation and Score Inflation

Although most studies of score inflation and test preparation have examined only

trends in the student population as a whole, a growing body of research has documented

variations in both across types of students and schools. We make use of these variations

in our identification strategy, described below.

Research has shown that teachers in more disadvantaged schools, i.e., those that

serve schools serving disproportionately poor or non-Asian minority students, are likely

to focus more than others on test preparation. These schools often have a stronger

emphasis on assigning drills of test-style items and teaching test taking strategies

(Cimbricz, 2002; Diamond & Spillane, 2004; Eisner, 2001; Firestone, Camilli, Yurecko,

Monfils, & Mayrowetz, 2000; Herman & Golan, 1993; Jacob, Stone, & Roderick, 2004;

Jones, Jones, & Hargrove, 2003; Ladd & Zelli, 2002; Lipman, 2002; Luna & Turner,

2001; McNeil, 2000; McNeil & Valenzuela, 2001; Taylor et al., 2002; Urdan & Paris,

1994). These findings are not surprising, as low-performing schools face the most

Page 9: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 7

pressure to raise scores, face more severe obstacles to improving performance, and often

have less experienced teachers and principals.

Research on variations in score inflation is less abundant but is for the most part

consistent with the research on behavioral responses in showing greater inflation in

disadvantaged schools. Several reports show that the relative gains made by low-income

and black and Hispanic students relative to white students on state tests are not matched

on audit tests (Klein et al., 2000; Jacob, 2007; Ho & Haertel, 2006). Shen (2008) further

elaborated on the link between pedagogical practices and score inflation by examining

differences in the performance of items that were more or less “teachable.” She showed

that schools had greater improvements over time in performance on more teachable

items, and that this trend was more pronounced in disadvantaged schools. Similarly,

Jennings, Bearak, and Koretz (2011) have demonstrated that reductions in the racial

achievement gap in New York state mathematics tests were driven by black and

Hispanics students’ improvement on items testing predictably-assessed standards. These

limited studies suggest that the variations in test preparation noted above tend to produce

the greatest score inflation for the most vulnerable students.

Another documented pattern is the reallocation of instructional resources toward

students whose anticipated scores are just under the proficiency cut score (“bubble

students”) and away from students whose anticipated scores are comfortably above or

below proficient, a practice described as educational triage (Booher-Jennings, 2005;

Gillborn & Youdell, 2000; Neal & Schanzenbach, 2010; Stecher et al., 2008).

Accountability systems that only provide rewards or sanctions based on the percentage of

Page 10: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 8

students who score above the proficiency cut score incentivize teachers and schools with

limited resources to focus those resources in ways that seem more likely to result in the

greatest increase in proficiency rates. If students far beneath the cut score would require

more resources than are available to get above the cut score, and if students comfortably

above the cut score require fewer resources to demonstrate proficiency, then teachers

may see that focusing their pedagogical effort on the bubble students may result in a

larger reward. This phenomenon is particularly important in the present context because

it creates particularly clear incentives to focus on test preparation for students near the cut

score, and a number of quantitative studies have established that “bubble students”

appear to make more progress on high-stakes tests when proficiency-based accountability

systems are in place (Jennings & Sohn, 2014; Neal & Schanzenbach, 2010; Reback,

2008). As we will later discuss when we detail our methods, these students thus prove to

be an analytically useful group for identifying test-specific coaching.

Predictable Sampling and the Design of Self-Monitoring Assessments

Successive operational test forms typically show predictable patterns. These

patterns can include the amount of emphasis given to different content (including

predictable omissions of content) and the ways in which content is presented. These

predictable patterns in turn provide the opportunity to narrow test preparation to focus on

the particulars of the test and thereby to generate score inflation. Test preparation

materials often draw users’ attention to these predictable patterns and provide strategies

for performing well on predictable items.

Page 11: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 9

Holcombe, Jennings, & Koretz (2013) provided a framework for classifying the

various types of predictable patterns that can enable inflation. They suggest that the

sampling needed to create a test can be seen as comprising four distinct levels. First, one

decides which elements from the domain will be represented in the standards or

curriculum. The material sampled by the standards may exclude or deemphasize some

content that is important for stakeholders’ targets of inference. Second, the test authors

must decide which of the standards will be tested, and among those that will be tested,

how frequently they will be tested and how much emphasis they will be given in a typical

form. Third, when standards are reasonably broad, the authors then must decide how to

sample from the range of content and skills each standard implies. All of these stages of

sampling represent a narrowing of the substantive range of the test.

The final stage of narrowing entails the sampling of non-substantive elements,

that is, elements that are not relevant to the intended inference. Much of this sampling

may be inadvertent. Some small details of content may be of this sort—for example,

using only regular polygons in geometry items, or using only positive slopes in

quadrant 1 in items about slopes, when the inference does not call for this narrowing.

However, many of the predictable non-substantive elements in a test can be seen as

aspects of presentation rather than content, e.g., item format, the particular graphics used

with mathematics items, and so on.

Audit items in an SMA should assess material that is relevant to the inference

without replicating the predictable patterns that create opportunities for test preparation.

For example, if all operational items assessing knowledge of the Pythagorean Theorem

Page 12: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 10

make use of common Pythagorean triples, a suitable audit item might be a calculator item

with a non-integer solution. Students whose test preparation focused on Pythagorean

triples in lieu of appropriate instruction about the theorem would find the audit item much

more difficult than would students receiving appropriate instruction on the Pythagorean

Theorem.

Opportunities for SMA Trials in New York State

The New York State Education Department (NYSED) offered us two

opportunities to pilot-test an SMA, and changes in the state’s operational program

provided a third trial. To our knowledge, these are the first trials of the SMA approach.

Study 1 was implemented in the context of a stand-alone field test administered

statewide in June 2011 for purposes of linking and pilot testing new items. We

administered mathematics audit items in grades 4, 7, and 8. Six of our audit items and six

operational or anchor items were included in 12-item forms that were allocated randomly

to samples of students. A total of 6,301 students were sampled, each of whom was

administered a single audit-test booklet.

Study 2 stemmed from changes made to New York’s operational testing program. In

2009, NYSED acknowledged that gains on some of the state’s high-stakes mathematics

tests might be inflated (Tisch, 2009). One of the steps NYSED took to lessen the risk of

continued score inflation was to broaden the state’s tests, partly to make them less

predictable. The tests administered in the spring of 2011 were longer and incorporated

items assessing standards that had not previously been tested. This allowed us to use the

Page 13: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 11

items assessing previously untested standards as an audit component. We did this in the

same three grades as in Study 1.

Study 2 offered several advantages compared with Study 1. Administering the

audit items in the context of the operational high-stakes assessment eliminated the risk of

motivational biases. Because the audit items were in the operational forms, the sample

sizes were very large, approaching 200,000. Finally, the distributions of raw scores on the

non-audit component of the Study 1 test forms were badly right censored, as is common

with high-stakes tests. Because of changes made to the operational test in 2011, the

distributions of raw scores on the non-audit components were less right censored in

Study 2.

However, Study 2 also had two important weaknesses compared with Study 1.

First, the audit is limited to one of the four categories of audit items described below, all

of which were administered in the other trials. Second, none of the items in Study 2 were

specifically selected to serve as audit items, and they may be poorly suited for that

purpose. For example, they may share attributes with other items that allow the effects of

coaching to generalize to them. To the extent that this is true, Study 2 would produce

conservative estimates of score inflation.

In Study 3, conducted in the spring of 2012, we embedded planned audit items in

all operational mathematics test forms in grades 4, 7, and 8. The items used in Study 3

overlapped substantially with those used in Study 1 and included all four categories of

audit items described below. The strengths of this study were administering planned

items in the context of an operational assessment and the large sample of students.

Page 14: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 12

However, this study was weakened by NYSED’s administration procedures. NYSED

matrix-samples at the school level. The state’s schools are divided into four

representative groups, and every student in every school in any one of the four groups

receives a single test form. We were permitted to place only four audit items in any form,

so no school received more than four items. Therefore, despite the large samples of

students who received any one form, the very small number of items per school results in

problems of limited statistical power.

Data

We created our analytic samples using the same approach in all three studies. From

the original datasets described above, we dropped schools with fewer than 3 students1.

We then dropped students missing either an audit or non-audit test score. Lastly, we

excluded a small number of students (less than 1%) with apparently anomalous scores.2

These decisions left us with approximately 94% of the original sample for Study 1, 89%

1 The very low threshold of n>=3 was necessitated by the design of the sample we were given in Study 1.

Using a higher cutoff would have eliminated too many schools. However, the use of empirical Bayes

estimates should minimize the impact of choice of threshold. We used data from Study 2, which was based

on nearly complete census data, to evaluate the effect of replacing this cutoff with n>=10. This had no

appreciable effects on the results obtained.

2 We also dropped a very small number of students with mismatched form booklets. In addition, we

dropped P.S. 184 Shuang Wen School, a public school in New York City with an immersion program in

Mandarin Chinese. A large percent of the school’s students were Asian, and an extreme value relative to

the rest of the schools in our sample inflated coefficients for the school proportion-Asian variable.

Page 15: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 13

for Study 2, and 90% for Study 3. These analytic samples differed little from our original

data. Additionally, the samples across the three studies are very similar in demographic

composition (see Table 1).

Methods

Item writing and Construction of Outcome Measures

Our selection and writing of audit items were guided by the framework by

Holcombe et al. (2013) described above, but we found that we had to modify it

somewhat.

In principle, one could design audit items for each of the stages of narrowing

described by Holcombe et al. (2013): from the domain to the standards, from the total set

of standards to the subset of standards tested and emphasized, from the description of

standards to the selected content and skills sampled from within each, and from the range

of possible representations and other non-substantive elements. However, although

unimportant content details and aspects of presentation are conceptually distinct, we

could not differentiate reliably between them. For example, presenting slopes only as

positive in quadrant 1 could be seen either as a detail of content or as a matter of

presentation. This difficulty was exacerbated by the narrow wording of some of the New

York standards, which constrained the variety of items that could be written. (E.g., one

Grade 8 standard in geometry, 8G1, read “Identify pairs of vertical angles as congruent”).

Therefore, we modified the framework by combining these two stages.

This left us with four types of audit items. Not-in-standards (NIS) items

represented content omitted from the state’s standards but often included in the definition

Page 16: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 14

of the domain, e.g., in the NAEP frameworks. Untested-standards (US) items assess

previously untested state standards (i.e., narrowing in terms of the selection of specific

standards for including in the state test). Because US items measure content explicitly

included in the intended inference, they should test for inflation from undesirable

reallocation. The third type of item was designed to counter wording of standards that we

considered to be narrowing content unduly. These broadened-at-the-standards-level (BS)

items very modestly broadened the range of the standard. For example, in the case of 8G1

above, the broader content of interest is the concept of congruency of angles, not

necessarily restricted to pairs of vertical angles. The final category was designed to

respond to narrowing that arose in the writing of the actual operational items to represent

a given standard (e.g., repeated use of the same representation or specific content over

successive years). These broadened-at-the-item level (BI) items broadened the within-

standard diversity of test items while adhering to the specific wording of the standards.

BI items should be sensitive to test preparation focused on specific substantive or non-

substantive characteristics of past items.

In Studies 1 and 3, we were permitted to administer only multiple-choice audit

items, and the number of audit items was constrained. In Study 2, the number of audit

items was determined by design decisions by the state’s contractor. All but one of the

Study 2 audit items were multiple-choice.

The large majority of our audit items for Studies 1 and 3 were publicly available,

previously administered items from large-scale assessments, including the NAEP, the

Trends in International Mathematics and Science Study (TIMSS), and state assessments

Page 17: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 15

from Massachusetts and Pennsylvania. It was often necessary to modify these items,

either for content or to match the four-distractor format of the New York tests. In

addition, we and colleagues at Cito, a testing firm in the Netherlands, wrote a small

number of items. In Study 2, the audit items were all previously untested standard (US)

items included in the operational 2011 forms by the state’s testing contractor, without

guidance from us.

Studies 1 and 3. We included items of all four types in Studies 1 and 3. Counts

are shown in Table 2. Because we wanted to re-test the audit design in the context of an

operational assessment, many of the audit items administered in Study 1 were reused in

Study 3. All items were reused at the seventh grade, and 14 of 16 were reused at the

fourth grade. However, we reused only 9 of 16 audit items in eighth grade because we

found that several failed to function well based on results from Study 1. We thus replaced

these items with new audit items, selected from the same three public sources or created

based on the same principles as for Study 1, in Study 3.

For Study 1, our audit items were embedded in 12-item forms, with 6 audit and 6

non-audit items per form. We selected the non-audit items from past operational and

anchor items in the 2011 Field Test. We matched each of the audit items that were

broadened at the standards or item level (BI and BS items) with a non-audit item that

closely matched it in content but that differed on one or more predictable characteristics

potentially associated with coaching. This matching was intended to guard against

differences in difficulty arising from intentional differences in content rather than from

test preparation. Such matching was not possible by definition for untested-standards

Page 18: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 16

(US) and not-in-standards (NIS) items. Nonetheless, we were able to pair each of the US

and NIS audit items with a non-audit item testing the same content area defined by the

learning standard or strand (e.g., Algebra) in the New York standards.

The strategy we followed in selecting and writing items was intended to minimize

the risk of confounding, and therefore, differences between BI / BS audit items and

nonaudit items were minor. The cost of this attempt to create minimal contrasts is the risk

of introducing too little novelty, which could lead to spurious null findings or

underestimates of the inflation effect.

In contrast, in Study 3, the entire operational form, minus our audit items, was

used as the non-audit component. This was done to lessen the problem of low reliabilities

of difference scores created by 6-item forms, but it precluded the close matching we

carried out in Study 1. We did post-hoc matching of items in Study 3, as described below.

Study 2. The audit items in Study 2 were all items that assessed previously

untested standards (US items). These were added as part of the state’s efforts to broaden

their tests and were selected without input from us. We used the remaining items in the

operational state test as the non-audit items. In seventh grade, we excluded only one item

that NYSED dropped because of a negative correlation with test scores. In each of fourth and

eighth grades, we dropped three items that assessed previously untested standards but that

were extremely easy (p-values ranging from .79 to .96), as it is unlikely that extremely easy

items can serve effectively as audit items.

Page 19: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 17

The dependent variable in all three studies was based on the simple difference

between the raw proportion correct on the non-audit portion of the test and that on the

audit portion:

(1) ܻ௦ = ௦ െ ௦௨ௗ௧

where i indexes individuals and s indexes schools. We standardized this difference to

mean 0, standard deviation 1.

Predictors. Our predictor variables included a variety of student-level

characteristics, school averages of these characteristics, an indicator of students’

proximity to the proficiency cut score (termed “bubble status”), and dummy variables for

regions of the state.

Demographics. We included student-level dummy variables for African-

American, Hispanic and Asian students, leaving white and other-race students as our

omitted comparison group. 3 We also included an indicator of low-income status which is

defined by NYSED as whether or not a student participates in the free and reduced price

lunch program or other economic assistance program. 4 The means of these variables

were computed for school-level demographic variables.

Bubble status. We also included an indicator of bubble status. We defined these

as students whose scores in the previous year’s state test were as much as three raw score

3 In most cases, parents reported race. When parents do not report race, districts are responsible for

assigning classifications.

4 For detailed information about the criteria for the low-income variable, see University of the State of

New York (2011), p. 44.

Page 20: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 18

points below the cut score between Level 2 and Level 3 (“Proficient”). We used this as a

proxy for educators’ anticipated scores for the students in the target year. The proportions

of bubble students were then computed for schools. A sensitivity analysis comparing

alternative definitions of this variable is discussed below.

Region. We created dummy variables to separate districts into three categories,

based on differences in district size, urbanicity, and demographics: New York City; other

urban (Buffalo, Rochester, Syracuse, and Yonkers, the four largest districts after New

York City); and all other districts (the omitted category). We found that the dummy

variable for other urban had little effect, so we dropped it from analysis, combing these

four districts with the omitted category.

Analytical approach

Across the three studies, three grades, and multiple forms (three or four per grade

in Study 1 depending on grade, a single form in Study 2, and four in Study 3), we had a

total of 25 audit forms. We conducted most analysis at the form level because of the lack

of an effective way to link across forms.

Because our data did not permit us to place the audit and non-audit components

on a single scale, we could not use simple differences in difficulties of the two

components as an indicator of potential inflation (Koretz & Beguin, 2010). Instead, we

used a difference-in-differences approach to detect potential score inflation. That is, we

examined the extent to which the difference in difficulty between the audit and non-audit

components is predicted by variables shown previously to be related to variations in test

preparation and score inflation: poverty, race/ethnicity, and especially bubble status.

Page 21: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 19

The audit test score, ௦௨ௗ௧ in equation (1), is ideally an unbiased estimator of the

student’s uninflated achievement, ߠ:

௦௨ௗ௧ (2) = ௦ߠ + ߳௦, (௦߳)ܧ = 0.

We assume it is unbiased because the items are novel with respect to the specific content

or presentation that facilitates inappropriate test preparation and score inflation. To the

extent that this assumption is violated, the difference score is downwardly biased as an

indicator of inflation.

However, ௦ in equation (1) is potentially biased by inflation, ߞ௦, which is

expected to vary both within and between schools. In addition, ௦ may differ from

:that is unrelated to inflation ,ߜ ,௦௨ௗ௧ by a difficulty factor

௦ (3) = ௦ߠ + ߜ + ௦ߞ + .௦ߥ

Therefore one can re-express the audit measure in equation (1) as:

(4) ܻ௦ = ߜ + ௦ߞ + ௦ߥ) െ ߳௦).

Note that the non-inflationary difference in difficulty, ߜ, is shown as a constant,

not varying across schools or students within schools. Student-level variations in

difficulty unrelated to inflation contribute to measurement error. School-level variations

in difficulty pose a more complex issue. In the general case, school performance may

differ between parts of a test for systematic reasons unrelated to inflation. For example,

when another test that reflects a somewhat different framework is used as an audit, there

is a risk that schools will vary in the emphasis they give to certain aspects of the audit test

because of curricular differences that are independent of their responses to the high-

stakes test. In this case, however, all of the content of both subtests is explicitly included

Page 22: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 20

in the target of inference, and therefore in the content that schools are expected to teach.

Therefore, if some schools show strong performance on the tested standards that does not

generalize to the untested standards included in the inference, we can consider that score

inflation. Accordingly, we can treat school-level differences in performance between the

two subtests as comprising only inflation and school-level error and not non-inflationary

differences in difficulty, although we cannot determine whether the inflation is a result of

deliberate responses to testing. Our analytical models incorporate estimates of school-

level error.

The outcome shown in equation (4) poses two difficulties for our analysis. First,

the student-level error in the difference score,(ߥ௦ െ ߳௦), will be large because of the

short length of the audit test. Below we show that the estimated reliability of ൫௦ െ

௦௨ௗ௧൯ based on the internal consistency reliabilities of the two subtests is extremely

low. This creates a severe conservative bias—that is, a large risk of a Type II error. This

problem is ameliorated in our estimates of school-level relationships, but nonetheless,

this lack of statistical power remains an important limitation of the data. In response, we

screened out the trials with the lowest reliability of the differences, as explained below.

An even more important limitation is that ߜ cannot be estimated from our data. To

estimate ߜ, we would need uninflated estimates of the difficulties of both subtests, but we

do not have such an estimate for the non-audit subtest, which was administered only

under conditions vulnerable to inflation. This is what precludes our placing the two

components on a single scale. Therefore, ߜ and ߞ are confounded, and the simple

difference ൫௦ െ .௦௨ௗ௧൯ cannot be interpreted as an indicator of inflation

Page 23: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 21

Our use of a difference-in-differences approach removes the effects of ߜ.

Specifically, we investigated whether variations in ൫௦ െ ௦௨ௗ௧൯ are systematically

associated with student- and school-level variables that have been shown in prior studies

to be associated with either score inflation or inflation-inducing instructional behaviors:

race/ethnicity, economic disadvantage, and bubble status. Because we expect that

inflation-inducing behaviors vary across schools, we examined these relationships both

within- and between schools.

Analysis of Item and Test Statistics

To estimate the internal consistency reliability of the audit-item sets and to flag

potentially problematic items, we calculated item-rest correlations, omitted-item alphas,

and Cronbach’s alpha for all item sets, by form. In addition, we calculated classical

estimates of the student-level reliability of the raw score difference (nonaudit - audit)

separately for each form. Because we knew that the short audit forms would lead to

unreliable difference scores, we used these estimates to screen the 25 separate trials

(across studies, grades, and forms) for adequate reliability, excluding numerous because

of insufficient reliability.

Regression Models

Our primary model was a two-level random-intercepts model:

(5) ܻ௦ = ௦ߚ + ܆ + ଶܰߚ + Ԗ୧ୱ

௦ߚ = ߛ + ܈ + uୱ

where X is a vector of student-level variables, Z is the corresponding vector of school-

level means, and N is the NYC dummy. This model appropriately adjusts standard errors

Page 24: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 22

for clustering and also accommodates our hypothesis that variations in inflation-inducing

behaviors arise between schools.

We grand-mean centered the student-level predictors. This yields parameter

estimates for level-2 variables that are direct estimates of context effects (Raudenbush &

Bryk, 2002). That is, the parameter estimates for level-2 variables indicate the extent to

which the school means differ by more than the level-1 model would predict. We also

group-mean centered level-2 variables for ease of interpretation.

Post hoc Analysis of Matched Items

We analyzed matched audit and non-audit items in the three studies for two

different purposes.

In Studies 1 and 3, we analyzed matched items to help assess whether the audit

items functioned as intended. We expect audit items to be more difficult than matched

non-audit items because of the addition of specific novel content or representations.

While such raw differences in difficulty cannot prove whether an audit worked, they are

indicative of whether the audit items successfully assessed content in ways that were

sufficiently novel relative to operational items.

This analysis was straightforward in Study 1, for which most forms were

constructed using matched item pairs. For Study 3, we identified operational items that

assessed the same content standards as our audit items for consideration as possible

matches. Since some content standards encompass many possible mathematics topics,

two authors independently examined and evaluated each operational item to determine

Page 25: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 23

whether it was an acceptable match based on the content it assessed. There were no

disagreements, and 17 of 22 items (77%) were retained for matching.

In Study 2, matching for this purpose was precluded by the use of untested-

standards audit items, but we carried out another form of matching to evaluate whether

the untested items were sufficiently different from other operational items to warrant

using them as audits. We constructed categories of closely related content standards and

looked for non-audit items that assessed standards from the same categories as those for

the audit items. Using a similar review process as for the other studies, two authors

independently decided whether any of the matches were close enough to be potentially

problematic. We found no problematic matches.

Results

Descriptive Statistics and Test Statistics

The distributions of the difference scores in our outcome variable are

approximately normal, although the distributions of the individual p-values of the audit

and non-audit portions are in some instances skewed. The skewness statistics for the

distributions of difference scores range from .002 to .203. Figure 1 shows the

distributions of the outcome in grades 4 and 8 in study 2; grade 4 is the most skewed and

grade 8 is one of the least skewed.

The internal consistency reliability of the audit item sets varied markedly, from

0.28 to 0.72, with a mean of 0.52 (see Table 3). While item-test and item-rest correlations

are difficult to interpret when forms are very short, we saw little evidence of problematic

items. Rather, the low reliabilities reflect the very short lengths of the audit components.

Page 26: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 24

Applying the Spearman-Brown prophecy formula suggests to data pooled across forms

within grade suggest that if the audit forms were lengthened only to 10 items each, the

reliabilities would fall within a range from 0.46 to 0.76, with most above 0.6.

The estimated reliabilities of the difference scores we used as measures of

inflation are very low (see Table 3). This is attributable to the low reliability of the audit

scores and the strong correlations between the non-audit and audit scores (ranging from

0.57 to 0.81). As a result, the estimated reliabilities of our outcomes range from

effectively 0 to 0.18.5 We are particularly interested in school-level relationships, and our

use of empirical Bayes estimates will lessen this problem of low reliability for school

effects. Nonetheless, this low reliability will introduce a strong conservative bias to the

coefficients in our regressions.

To simplify the presentation of our results, we report in detail only the results of

analysis of forms that had a student-level reliability of the difference score 0.05. We

assume that in the most cases, analysis of forms that did not reach this threshold is too

noisy to interpret with confidence, and indeed, these models had few statistically

5 Conceptually, reliability cannot be negative. In practice, when reliability is very low, one can

obtain negative estimates from error. In our case, we could obtain negative estimates for an

additional reason. Using the classical model, the estimated reliability of a difference score will be

negative whenever (ݎ௫௫ߪ௫ଶ + (௬ଶߪ௬௬ݎ < ௬, where x and y are the two tests that areߪ௫ߪ௫௬ݎ2

differenced. This inequality is more likely to hold when the test with a larger variance has a

considerably lower reliability—precisely what our data produce. Following convention, we set all

negative estimates to zero.

Page 27: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 25

significant findings. One consistent pattern in the excluded models is noted in the text.

Regression results for all excluded forms are available online at (Appendix D). [Note:

Appendix D is included in this draft, but we expect that space constraints will require

moving it online.]

Matched-Item Analysis

In the fourth grade, both matched-item and all-item comparisons showed the

audit items to be more difficult than non-audit items, with p-value differences ranging

from 0.12 to 0.31 (Table 3). In contrast, most of the seventh-grade forms showed very

small differences in difficulty, suggesting that our intended audit items did not function

as such. Most seventh-grade forms that showed very small differences in p-values also

showed reliabilities of the difference <.05, so we did not use the difference in difficulty

as a reason to exclude any additional forms. The eighth-grade forms varied in terms of

p-value differences, but in this grade as well, we saw no reason to exclude forms on this

basis if they exceeded our threshold for reliability.

Regression Results

To place the following results in context, recall that the problem of low statistical

power was most severe in Study 1 and least severe in Study 2. In addition, given that our

screening on reliability and simple differences suggests that our efforts to construct audit

forms were most successful in the fourth grade, we consider the fourth-grade regression

results to be the most important.

The variable which we consider most directly related to the pressures that cause

inflation, bubble status, showed a substantial relationship to the outcome at both the

Page 28: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 26

student and school levels. At the student level, the bubble-status dummy was strongly

related to the nonaudit-audit difference in all seven of our included fourth-grade models

(Table 4). Across the models, the nonaudit-audit difference was larger for bubble students

by 0.16 to 0.38 standard deviations. Moreover, this relationship was statistically

significant at < .001 in six of the seven models. The proportion of bubble students was

also substantially related to the school mean difference in Studies 2 and 3, and

significantly so in three of four models. However, this aggregate relationship was

inconsistent in sign and not significant in Study 1, which had the most severe problems of

statistical power.

Fourth-grade results for the low-income variables were weaker and were only

moderately consistent with expectations. In Study 2, which had the smallest problem of

statistical power, both student-level poverty and the proportion of poor students in the

school variables were positively and significantly related to the nonaudit-audit difference,

although these estimated relationships were considerably weaker than those of the

bubble-status variables. For example, the effect of the student-level poverty dummy in

Study 2 was 0.075 standard deviations, roughly one-third the size of the coefficient for

the bubble-status variable. In Study 3, results from two forms were consistent with

expectations; the exception was Form C, which is the form that also failed to show a

significant relationship with the school proportion of bubble students. The low-income

variables did not predict the outcome significantly in Study 1.

Results for the race/ethnicity variables were similar to those for low income.

Studies 2 and 3 showed expected relationships for the student-level Hispanic and Black

Page 29: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 27

dummy variables in seven of eight models. However, Study 1 again did not yield

consistent findings for these variables.

Only two seventh-grade forms survived screening for extremely low reliability of

the difference scores: Study 2 and one form from Study 1. As in grade 4, the student-level

bubble-status dummy was substantially related to the outcome; the average bubble-

nonbubble difference was approximately 0.25 standard deviations (Table 5). The

association with the proportion of bubble students was very large and highly significant

in Study 2 but failed to reach statistical significance in Study 1. In Study 2, all of the

other student- and school-level variables conformed to expectations: poor students, black

students, and Hispanic students all showed larger audit-non-audit differences, and the

proportions of all three types of students predicted larger school mean differences on the

outcome. As in the fourth grade, these relationships were all smaller than the coefficients

of the bubble-status variables. In Study 1, however, while the school-level coefficients

were in the expected direction, no coefficient other than the student-level bubble-status

dummy was statistically significant.

The results from grade eight were less encouraging. Four forms survived

screening, two each from Study 1 and Study 3. Study 2, which generally provided more

statistical power, in this grade yielded difference scores with a reliability of only 0.03, so

it was excluded. The results from the two Study 3 forms were only partially consistent

with expectations. The coefficients of three student-level demographic variables, black,

Hispanic, and low-income, were positive and statistically significant in five of six cases,

although the estimates were not large, ranging from .04 to .09 standard deviations (Table

Page 30: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 28

5). The estimates for the aggregates of these variables were inconsistent. Most striking,

the estimate for the bubble-status dummy was positive and significant in only one of the

two forms, and the coefficients for the proportion of bubble students were inconsistent in

sign. The two Study 1 forms yielded no clear patterns.

While we do not report the results for the 12 forms screened out, we noted one

pattern consistent with our expectations. Despite the very low reliability of the outcome

in the excluded forms, the coefficients for the bubble-status dummy, the proportion of

bubble students, or both were positive and significant in 9 of the 12 excluded forms (see

online Appendix D).

Sensitivity Analyses

We conducted sensitivity tests to evaluate the robustness of our results to three

possible changes: (i) the bandwidth used to define bubble status; (ii) estimating

differential effects in NYC and the rest of the state; and (iii) a different standardization

procedure. Our findings are robust to the first two choices, but they are sensitive to the

method used for standardization.

Bubble status. The definition of bubble status should include a small number of

students who are close enough to reaching proficiency that teachers have an incentive to

provide them with extra help, but the operational definition is arbitrary. The definition

used in our primary results, three points below proficient in the previous year, classified

6% to 15% as bubble students across grades and studies. We replicated our regressions

using two alternative definitions: two points below proficient, and from two points below

to two points above proficient. These alternative definitions resulted in only modest

Page 31: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 29

differences in the proportions of students included and in estimated coefficients and did

not produce qualitatively different findings (Appendix A).

New York City. New York City, which enrolls substantially over 30 percent of

the students in New York State, might have distinct patterns of score inflation because it

differs in its demographic makeup and has a unique school accountability system.

However, adding interactions with a New York City dummy variable had appreciable

effects only in Study 2. In Study 2, the main effect for the New York City dummy was

negative (suggesting less overall inflation), while the coefficients for black students and

for schools with many black students were positive and larger than elsewhere in the state

(Appendix B). However, these findings were not replicated in the other studies, perhaps

in part because of problems of statistical power. Therefore, we did not include

interactions with the New York City dummy and do not include it in our primary results.

Standardization. There are two defensible but substantively different ways to

standardize the subtest scores used to generate our outcome variable, and these two

methods produce different results (Appendix C). Our outcome variable was constructed

by differencing the proportion of possible points on the audit and nonaudit portions of the

test and then standardizing these difference scores. An alternative would be to

standardize the two components separately and then difference the standardized scores.

This decision has an impact because of differences between the distributions of

the audit and non-audit scores. The standard deviations of the audit scores are often larger

than those of the nonaudit scores. For example, in grade 7 in Study 2, the standard

deviation of non-audit scores was 0.20, while that of the audit scores was 0.27.

Page 32: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 30

The choice of whether to standardize at the component level hinges on one’s

interpretation of these differences in distributions. If one believes that they are simply an

artifact of scaling, then it would be reasonable to standardize at the component level and

thus remove the artifact. On the other hand, if one believes that the differences in

distributions reflect the phenomenon under investigation, then standardizing at the

component level would be inappropriate because it would remove the subject of study.

Our interpretation is that the variance of the non-audit component is reduced by the

phenomenon under investigation, score inflation, as the rapid gains in scores on high-

stakes tests are often accompanied by severe right censoring (Ho & Yu, 2013). Thus, we

considered the greater standard deviation in the non-audit scores as substantively

meaningful and decided before comparing results that standardizing at the component

level would not be appropriate.

Discussion

This paper presents results from the first three trials of self-monitoring

assessments (SMAs), proposed by Koretz & Beguin (2010) as a means of monitoring

score inflation by embedding audit items in operational tests. These trials were not

designed to evaluate the magnitude of score inflation. Rather, they were designed as an

initial exploration of the feasibility of the SMA approach.

The findings presented here support the feasibility of the SMA approach. Indeed,

because of the severe problems of statistical power in these studies, our findings are

likely to substantially understate the potential of SMAs. This is an important finding,

given that score inflation is widespread and often very large and that auditing with an

Page 33: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 31

independent test is expensive, time-consuming, and subject to several important

weaknesses (Koretz & Beguin, 2010). However, there were null results as well that we

believe have important implications for further research and development.

As Koretz & Beguin (2010) explained, unless one designs and scales audit items

at the inception of a testing program, one cannot place the audit and nonaudit items on a

single scale to obtain direct estimates of inflation. Instead, we were limited to a

difference-in-differences approach, where we examined the extent to which variables that

create pressure to raise scores and that have been shown in earlier research to be

associated with test preparation and inflation, predict the nonaudit-audit difference in our

data.

The findings presented above should be interpreted in the context of the large and

unavoidable risk of Type II errors. In Study 1, we confronted low statistical power

because of the small samples of students administered each form in the field test. More

important, in all three studies, our data had very low statistical power because of limits on

the number of audit items we could administer—in the most extreme case, Study 3, only

four audit items per school. This created a severe lack of statistical power despite the

large samples of students in Studies 2 and 3. The impact was apparent in the very low

student-level reliability of the nonaudit-audit difference scores—so low that in the case of

12 forms, the results seemed to be largely noise, could not be interpreted with confidence,

and were excluded from detailed reporting.

The strongest support for the SMA approach is found in the fourth-grade results.

All but one of the fourth-grade forms survived screening, and most of the results in the

Page 34: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 32

fourth grade models conformed to our expectations. Particularly important, in our

opinion, is the finding that the bubble-status variable was a consistent and strong

predictor of the audit-non-audit difference. Of the several predictors we used, we

consider the bubble-status variable the one that is most directly related to the pressure on

teachers to raise scores. Findings with respect to demographic predictors were also

positive, although the magnitude of these effects was generally considerably smaller than

those of the bubble-status variables. The findings from grades seven and eight,

particularly the latter, included markedly more null findings but provided some support

for our expectations.

While some of the variation in results is likely the result of noise, we suspect that

there is a more important factor: variations in the design of our audit items. The trials

presented here were the first attempts to implement SMAs, so we had no empirical

evidence from prior studies guiding the development of our audit items. In the case of

items that broadened at the level of items or standards, we based our designs solely on

examination of operational test items. We arrayed all items previously administered in

chronological order and examined them for predictable patterns of specific content and

representations. We assumed that coaching would focus on the most obvious of these

regularities, and we designed our audit items to undo that specific regularity. We

attempted to make the differences between audit and non-audit items as minor as possible

apart from this single change. This was intentional, to make the matched items close to a

minimal contrast.

Page 35: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 33

We suspect that some of our audit items did not function as expected for one of

two reasons: our speculation about the nature of coaching for the operational items was

incorrect, or the changes we made were too small. However, our data are not sufficient to

permit us to distinguish these possible factors from the effects of noise.

Study 2 differed from the others in relying solely on audit items that tested

previously untested standards (US items). These are appropriate for auditing because

their content is an explicit component of the target of inference. Indeed, this is why New

York State chose to add the US items, which their contractor designed as operational

items to be scored, not for audit purposes. We note that in grades four and seven, the

results of Study 2 were the most consistent with our expectations, while in grade eight,

the Study 2 form was screened out because of extremely low reliability of the outcome.

This further underscores the current lack of clarity about the design of audit items: in

some cases, US items work unusually well, while in others, they do not function

successfully as audit items.

The conventional call for additional research is particularly appropriate here.

First, given that these studies were the first ever conducted, were conducted in a single

setting using a single operational testing program, and entailed substantial data

limitations, the need for replications is clear. Second, there is as yet very little research

guiding the development of audit items. We suggest two specific avenues for future

research on the design of audit items. First, it would be valuable to attempt to link the

design of items to systematic data on the characteristics of preparation for the operational

test in question, rather than relying solely on the characteristics of previously

Page 36: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 34

administered items. This information might be gathered by means of surveys,

observational or video studies, or examination of test-preparation materials. Second,

additional trials of the SMA approach could be structured to provide comparisons among

different item designs—for example, to compare the functioning of several different audit

items for a single operational item.

Finally, the problem of statistical power that we confronted in these studies is

itself an important focus for further investigation. Our most serious problems of power

arose from the sparse sampling of items, not the sampling of students. Space for audit

items will be limited in large-scale testing programs, particularly those that provide

operational test scores for individual students, because of limits on total test length. As

the field learns how to construct more effective audit items, the power of nonaudit-audit

comparisons will increase, but the number of audit items is likely to remain an issue

nonetheless. This suggests a need for research on the size of audit components needed for

effective use in operational assessments with various test designs.

Page 37: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 35

References

Abrams, L. M., Pedulla, J. J., & Madaus, G. F. (2003). Views from the classroom:

Teachers’ opinions of statewide testing programs. Theory into Practice, 42(1), 18-

29.

Booher-Jennings, J. (2005). Below the bubble: "Educational Triage" and the Texas

Accountability System. American Educational Research Journal, 42(2), 231-268.

Cimbricz, S. (2002). State-mandated testing and teachers’ beliefs and practice. Education

Policy Analysis Archives, 10(2). Retrieved from

http://epaa.asu.edu/epaa/v10n2.html

Diamond, J. B., & Spillane, J. P. (2004). High-stakes accountability in urban elementary

schools: Challenging or reproducing inequality? Teachers College Record, 106(6),

1145-1176.

Eisner, E. (2001). What does it mean to say a school is doing well? Phi Delta Kappan,

82(5), 367-372.

Figlio, D. and Getzler, L. (2006). Accountability, ability and disability: Gaming the

system? In T. Gronberg & D. Jansen (Eds.), Advances in Applied Microeconomics

(Vol. 14, pp. 35–49). Elsevier.

Firestone, W. A., Camilli, G., Yurecko, M., Monfils, L., & Mayrowetz, D. (2000). State

standards, socio-fiscal context and opportunity to learn in New Jersey. Education

Policy Analysis Archives, 8(35). Retrieved from http://olam.ed.asu.edu/epaa/v8n35.

Gillborn, D. & Youdell, D. (2000). Rationing education: policy, practice, reform, and

equity. Buckingham, UK: Open University Press.

Page 38: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 36

Hambleton, R. K., Jaeger, R. M., Koretz, D., Linn, R. L., Millman, J., & Phillips, S. E.

(1995). Review of the measurement quality of the Kentucky Instructional Results

Information System, 1991–1994. Frankfort: Office of Education Accountability,

Kentucky General Assembly, June.

Hamilton, L. S., Stecher, B. M., Marsh, J. A., McCombs, J. S., Robyn, A., Russell, J.,

Naftel, S., and Barney, H. (2007). Standards-based accountability under No Child

Left Behind: Experiences of teachers and administrators in three states. Santa

Monica, CA: RAND Corporation.

Haney, W. (2000). The myth of the Texas miracle in education. Education Policy

Analysis Archives, 8(41). Retrieved from

http://epaa.asu.edu/ojs/article/view/432/828

Herman, J. L., & Golan, S. (1993). The effects of standardized testing on teaching and

schools. Educational Measurement: Issues and Practice, 12(4), 20-25, 41-42.

Ho, A. D. (2009). A nonparametric framework for comparing trends and gaps across

tests. Journal of Educational and Behavioral Statistics, 34(2), 201-228.

Ho, A. D., & Haertel, E. H. (2006). Metric-Free Measures of Test Score Trends and Gaps

with Policy-Relevant Examples (CSE Report 665). Los Angeles, CA: National Center

for Research on Evaluation, Standards, and Student Testing (CRESST), Center for

the Study of Evaluation, University of California, Los Angeles.

Ho, A. and Yu, C. (2013). Descriptive statistics for modern test score distributions:

Skewness, kurtosis, discreteness, and ceiling effects. Unpublished working paper.

Page 39: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 37

Holcombe, R., Jennings, J. L., & Koretz, D. (2013). The roots of score inflation: An

examination of opportunities in two states’ tests. In G. Sunderman (Ed.), Charting

Reform, Achieving Equity in a Diverse Nation, 163-189. Greenwich, CT:

Information Age Publishing

Jacob, B. A. (2005). Accountability, incentives and behavior: The impact of high-stakes

testing in the Chicago public schools. Journal of Public Economics, 89(5-6), 761-

796. doi: 10.1016/j.jpubeco.2004.08.004

Jacob, B. (2007). Test-based accountability and student achievement: An investigation of

differential performance on NAEP and state assessments. Cambridge, MA:

National Bureau of Economic Research (Working Paper 12817).

Jacob, R. T., Stone, S., & Roderick, M. (2004). Ending social promotion: The response of

teachers and students. Chicago, IL: Consortium on Chicago School Research.

Retrieved March 29, 2011, from http://www.eric.ed.gov/PDFS/ED483823.pdf

Jennings, J.L, Bearak, J.M., & Koretz, D.M. (2011). “Accountability and Racial

Inequality in American Education.” Paper presented at the Annual Meetings of the

American Sociological Association.

Jones, M. G., Jones, B. D., & Hargrove, T. Y. (2003). The unintended consequences of

high-stakes testing. Lanham, MD: Rowman & Littlefield Publishers, Inc.

Klein, S. P., Hamilton, L.S., McCaffrey, D.F., and Stecher, B.M. (2000). What do test

scores in Texas tell us? Santa Monica, CA: RAND (Issue Paper IP-202). Last

accessed from http://www.rand.org/publications/IP/IP202/ on June 4, 2013.

Page 40: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 38

Koretz, D., and Barron, S. I. (1998). The Validity of Gains on the Kentucky Instructional

Results Information System (KIRIS). MR-1014-EDU, Santa Monica: RAND.

Koretz, D., & Béguin, A. (2010). Self-Monitoring Assessments for Educational

Accountability Systems. Measurement: Interdisciplinary Research & Perspective,

8, 92–109. doi:10.1080/15366367.2010.508685

Koretz, D., and Hamilton, L. S. (2006). Testing for accountability in K-12. In R. L.

Brennan (Ed.), Educational measurement (4th ed.), 531-578. Westport, CT:

American Council on Education/Praeger.

Koretz, D., Linn, R. L., Dunbar, S. B., & Shepard, L. A. (1991). The effects of high-

stakes testing: Preliminary evidence about generalization across tests, in R. L.

Linn (chair), The Effects of High Stakes Testing, symposium presented at the

annual meetings of the American Educational Research Association and the

National Council on Measurement in Education, Chicago, April.

Ladd, H. F., & Zelli, A. (2002). School-based accountability in North Carolina: The

responses of school principals. Educational Administration Quarterly, 38(4), 494-

529. doi: 10.1177/001316102237670

Lindquist, E. F. (1951). Preliminary considerations in objective test construction. In E. F.

Lindquist (Ed.), Educational Measurement (pp. 119-184). Washington, DC:

American Council on Education.

Lipman, P. (2002). Making the global city, making inequality: The political economy and

cultural politics of Chicago school policy. American Educational Research

Journal, 39(2), 379-419.

Page 41: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 39

Luna, C., & Turner, C. L. (2001). The impact of the MCAS: Teachers talk about high-

stakes testing. The English Journal, 91(1), 79-87.

Madaus, G. F. (1988). The distortion of teaching and testing: High-stakes testing and

instruction. Peabody Journal of Education, 65(3), 29-46.

McNeil, L. M. (2000). Contradictions of school reform: Educational costs of

standardized testing. New York, NY: Routledge.

McNeil, L. M. & Valenzuela, A. (2001). The harmful impact of the TAAS system of

testing in Texas: Beneath the accountability rhetoric. In M. Kornhaber & G. Orfield

(Eds.), Raising standards or raising barriers? Inequality and high-stakes testing in

public education (pp. 127-150). New York, NY: Century Foundation.

Neal, D. and Schanzenbach, D. (2010). Left behind by design: Proficiency counts and

test-based accountability. Review of Economics & Statistics, 92(2), 263–283.

New York City Department of Education (2007). Mayor Bloomberg and Chancellor

Klein release first-ever public school Progress Reports [Press release]. Retrieved

from http://schools.nyc.gov/Offices/mediarelations/NewsandSpeeches/2007-

2008/20071105_progress_reports.htm.

Pedulla, J. J., Abrams, L. M., Madaus, G. F., Russell, M. K., Ramos, M. A., & Miao, J.

(2003). Perceived effects of state-mandated testing programs on teaching and

learning: Findings from a national survey of teachers. Boston, MA: National Board

on Educational Testing and Public Policy. Retrieved from

http://www.bc.edu/research/nbetpp/statements/nbr2.pdf

Page 42: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 40

Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and

data analysis methods, second edition. Thousand Oaks, CA: Sage.

Rubinstein, J. (2000). Cracking the MCAS grade 10 math. New York: Princeton Review

Publishing.

Severson, K. (2011). A scandal of cheating, and a fall from grace. The New York Times,

September 7, p. A16. Last retrieved on June 5, 2013 from

http://www.nytimes.com/2011/09/08/us/08hall.html?pagewanted=all&_r=0.

Shen, X. (2008). Do unintended effects of high-stakes testing hit disadvantaged schools

harder? (Doctoral dissertation, Stanford University).

Shepard, L. A. (1988). The harm of measurement-driven instruction. Paper presented at

the annual meeting of the American Educational Research Association,

Washington, DC.

Stecher, B. (2002). Consequences of large-scale, high-stakes testing on school and

classroom practice. In L. Hamilton, et al., Test-based Accountability: A Guide for

Practitioners and Policymakers. Santa Monica: RAND. Retrieved from

http://www.rand.org/publications/MR/MR1554/MR1554.ch4.pdf..

Stecher, B. M., & Barron, S. I. (2001). Unintended consequences of test-based

accountability when testing in “milepost” grades. Educational Assessment, 7(4),

259-281.

Stecher, B. M., Epstein, S., Hamilton, L. S., Marsh, J. A., Robyn, A., McCombs, J. S.,

Russell, J., & Naftel, S. (2008). Pain and gain: Implementing NCLB in three states,

Page 43: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 41

2004 – 2006. Santa Monica, CA: RAND. Retrieved from

http://www.rand.org/pubs/monographs/2008/RAND_MG784.pdf

Taylor, G., Shepard, L., Kinner, F., & Rosenthal, J. (2002). A survey of teachers’

perspectives on high-stakes testing in Colorado: What gets taught, what gets lost

(CSE Technical Report 588). Los Angeles, CA: University of California. Retrieved

September 20, 2010, from http://eric.ed.gov/PDFS/ED475139.pdf

Tisch, M. (2009). What the NAEP results mean for New York. Latham, N.Y.: New York

School Boards Association (November 9). Last retrieved on June 24, 2012 from

http://www.nyssba.org/index.php?src=news&refno=1110&category=On%20Boar

d%20Online%20November%209%202009.

University of the State of New York (2011). New York State Student Information

Repository System (SIRS) manual. Albany, N. Y.: Author. Last accessed June 24,

2013 from age 244 of http://www.p12.nysed.gov/irs/sirs/archive/2010-

11SIRSManual6-2.pdf.

Urdan, T. C., & Paris, S. G. (1994). Teachers’ perceptions of standardized achievement

tests. Educational Policy, 8(2), 137-157.

Page 44: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 42

Tables Table 1

Descriptive statistics for analytic samples, by grade

Grade 4

Grade 7

Grade 8

Study 1 Study 2 Study 3

Study 1 Study 2 Study 3

Study 1 Study 2 Study 3

# of Students 6,301 185,164 193,538

7,662 185,615 195,273

8,420 186,645 195,941 # of Schools 852 2,342 2,433

573 1,356 1,426

517 1,332 1,504

Student-Level % White 0.52 0.51 0.48

0.55 0.52 0.50

0.56 0.53 0.51

% Asian 0.09 0.08 0.18

0.08 0.08 0.18

0.09 0.08 0.08 % Black 0.17 0.18 0.23

0.17 0.18 0.22

0.16 0.18 0.18

% Hispanic 0.21 0.22 0.08

0.19 0.21 0.08

0.19 0.20 0.22 % Other 0.01 0.01 0.02

0.01 0.01 0.01

0.01 0.01 0.01

% Low Income 0.51 0.55 0.56

0.48 0.51 0.53

0.46 0.50 0.52 % Bubble Student 0.14 0.15 0.14

0.08 0.09 0.08

0.08 0.09 0.07

School-Level Prop White 0.52 0.53 0.51

0.52 0.50 0.48

0.51 0.50 0.49

Proportion Asian 0.08 0.06 0.07

0.06 0.05 0.05

0.07 0.05 0.05 Proportion Black 0.18 0.19 0.20

0.20 0.22 0.23

0.21 0.22 0.23

Proportion Hispanic 0.21 0.20 0.21

0.21 0.22 0.22

0.21 0.21 0.22 Proportion Other 0.01 0.01 0.02

0.01 0.01 0.01

0.01 0.01 0.01

Proportion Low Income 0.51 0.55 0.55

0.56 0.57 0.58

0.54 0.56 0.56

Proportion Bubble Student 0.14 0.15 0.14 0.09 0.10 0.07 0.08 0.10 0.07

Page 45: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 43

Table 2

Number and type of audit item for the three studies, by grade

Number of each type of audit items in study and grade

Broadened at item level

Broadened at standards

level

Untested standards

Not in standards

Total

Study 1 Grade 4 6 6 0 6 18 Grade 7 8 3 1 6 18 Grade 8 7 4 4 9 24 Study 2 Grade 4 0 0 13 0 13 Grade 7 0 0 15 0 15 Grade 8 0 0 13 0 13 Study 3 Grade 4 4 5 1 6 16 Grade 7 6 3 1 6 16 Grade 8 4 1 2 8 15* *Note: One audit item was incorrectly transcribed in production and could not be used.

Page 46: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 44

Table 3

Screening forms based on the reliability of the difference and differences in difficulty

Difference in difficulty

rxx Matched items

All Items Decision Grade 4

Study 1, Form 1

0.07

0.24

Keep Study 1, Form 2

0.18

0.31

Keep

Study 1, Form 3

0.10

0.24

Keep

Study 2

0.11

0.16

Keep

Study 3, Form A

0.14

0.12

0.24

Keep Study 3, Form B

0.05

0.29

0.23

Keep

Study 3, Form C

0.10

0.19

Keep Study 3, Form D

0.00

0.12

0.24

Exclude

Grade 7 Study 1, Form 1

0.00

0.10

Exclude Study 1, Form 2

0.01

0.10

Exclude

Study 1, Form 3

0.10

0.23

Keep

Study 2

0.08

0.016

Keep

Study 3, Form A

0.00

-0.04

0.11

Exclude Study 3, Form B

0.00

0.08

0.06

Exclude

Study 3, Form C

0.00

0.03

0.01

Exclude Study 3, Form D

0.00

0.13

Exclude

Grade 8 Study 1, Form 1

0.09

0.20

Keep Study 1, Form 2

0.04

0.16

Exclude

Study 1, Form 3

0.07

0.20

Keep Study 1, Form 4

0.00

0.06

Exclude

Study 2

0.03

0.2

Exclude

Study 3, Form A

0.12

0.23

0.23

Keep Study 3, Form B

0.08

0.08

0.12

Keep

Study 3, Form C

0.00

0.21

Exclude Study 3, Form D 0.03 -0.12 0.12 Exclude

Page 47: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 45

Table 4

Regression results for studies 1, 2, and 3, selected forms and variables, grade 4

Study 1

Study 2

Study 3

Form 1 Form 2 Form 3

Form A Form B Form C

Student-Level Variables Black 0.118 -0.037 0.07

0.043***

0.092*** -0.019 0.080***

(0.083) (0.095) (0.089)

(0.009)

(0.019) (0.019) (0.019)

Hispanic -0.017 -0.033 0.041

0.088***

0.107*** 0.062*** 0.076***

(0.075) (0.080) (0.079)

(0.008)

(0.017) (0.017) (0.017)

Low Income 0.041 0.113 -0.054

0.075***

0.069*** 0.100*** -0.034**

(0.062) (0.068) (0.067)

(0.007)

(0.013) (0.013) (0.013)

Bubble Student 0.266*** 0.381*** 0.196**

0.219***

0.306*** 0.185*** 0.159***

(0.064) (0.069) (0.067)

(0.006)

(0.014) (0.013) (0.014)

School-Level Variables Proportion Black -0.105 0.19 -0.176

-0.125***

-0.060 -0.058 0.034

(0.136) (0.152) (0.148)

(0.025)

(0.042) (0.045) (0.051)

Proportion Hispanic 0.049 0.429** 0.053

-0.048

0.017 0.000 0.142**

(0.138) (0.144) (0.146)

(0.027)

(0.044) (0.048) (0.055)

Proportion Low Income 0.131 0.148 0.229

0.143***

0.098* 0.153*** -0.002

(0.110) (0.120) (0.121)

(0.024)

(0.039) (0.044) (0.049)

Proportion Bubble Students 0.027 0.107 -0.357

0.408***

0.636*** 0.455** 0.300

(0.168) (0.181) (0.187)

(0.083)

(0.144) (0.157) (0.180)

N 2101 2113 2087 185164 48113 45773 44257 Note. *p<0.05 **p<0.01 ***p<0.001

Page 48: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 46

Table 5

Main regression results for studies 1, 2, and 3, selected forms and variables, grade 7 and 8

Grade 7

Grade 8

Study 1 Study 2

Study 1 Study 3

Form 3

Form 1 Form 3 Form A Form B

Student-Level Variables Black 0.02 0.050***

0.05 0.112 0.075*** 0.068***

(0.079) (0.009)

(0.087) (0.091) (0.018) (0.018)

Hispanic -0.002 0.114***

0.049 0.099 0.057*** 0.085***

(0.071) (0.008)

(0.080) (0.082) (0.016) (0.016)

Low Income -0.024 0.075***

-0.048 -0.062 0.035** -0.008

(0.053) (0.006)

(0.061) (0.061) (0.012) (0.012)

Bubble Student 0.246*** 0.246***

0.149 0.096 0.181*** 0.005

(0.073) (0.008)

(0.080) (0.082) (0.017) (0.018)

School-Level Variables Proportion Black 0.05 -0.267***

-0.085 -0.034 -0.453*** 0.109

(0.141) (0.034)

(0.148) (0.150) (0.062) (0.066)

Proportion Hispanic 0.152 -0.246***

0.154 0.002 -0.375*** 0.278***

(0.151) (0.040)

(0.156) (0.151) (0.071) (0.075)

Proportion Low Income 0.165 0.217***

-0.055 0.07 0.221*** -0.265***

(0.125) (0.036)

(0.129) (0.124) (0.067) (0.071)

Proportion Bubble Students 0.213 0.790***

0.803* 0.504 0.356 -0.606*

(0.265) (0.161)

(0.324) (0.323) (0.294) (0.303)

N 2552 185615 2118 2137 46838 47078 Note. *p<0.05 **p<0.01 ***p<0.001

Page 49: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 47

Figures

Figure 1. Histograms of outcome variable, difference in proportion correct on non-audit and audit items, for grades 4 (top) and 8 (bottom) in study 2

Page 50: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 1

Appendix A Table A1 Regression results for sensitivity analysis: changing definition of bubble student for studies 1, 2, and 3 in grade 7

Study 1 Study 2 Study 3

Plus

Minus 3 Minus 2 Plus

Minus 2 Plus

Minus 3 Minus 2 Plus

Minus 2 Plus

Minus 3 Minus 2 Plus

Minus 2 Student-Level Asian -0.048 -0.049 -0.046 -0.115*** -0.117*** -0.113*** -0.052*** -0.052*** -0.050*** (0.053) (0.053) (0.053) (0.010) (0.010) (0.010) (0.010) (0.010) (0.010) Black 0.002 0.003 0.000 0.050*** 0.052*** 0.049*** -0.013 -0.013 -0.014 (0.047) (0.047) (0.047) (0.009) (0.009) (0.009) (0.009) (0.009) (0.009) Hispanic 0.029 0.031 0.028 0.114*** 0.114*** 0.111*** -0.004 -0.004 -0.004 (0.042) (0.042) (0.042) (0.008) (0.008) (0.008) (0.008) (0.008) (0.008) Low Income 0.022 0.022 0.023 0.075*** 0.078*** 0.077*** -0.032*** -0.032*** -0.033*** (0.031) (0.031) (0.031) (0.006) (0.006) (0.006) (0.006) (0.006) (0.006) Bubble Student 0.117** -0.098* -0.096* 0.246*** -0.161*** -0.156*** 0.085*** 0.005 0.008 (0.043) (0.040) (0.040) (0.008) (0.021) (0.020) (0.009) (0.023) (0.023) NYC -0.098* 0.091 0.109** -0.156*** 0.241*** 0.256*** 0.007 0.092*** 0.111*** (0.040) (0.051) (0.037) (0.020) (0.009) (0.007) (0.023) (0.011) (0.008) School-Level Proportion Asian -0.149 -0.171 (0.165) -0.466*** -0.496*** -0.441*** -0.030 -0.036 (0.031) (0.115) (0.114) (0.115) (0.070) (0.069) (0.069) (0.077) (0.076) (0.077) Proportion Black -0.042 -0.043 (0.034) -0.267*** -0.265*** -0.254*** -0.113** -0.115** -0.111** (0.083) (0.083) (0.083) (0.034) (0.034) (0.033) (0.038) (0.038) (0.038) Proportion Hispanic 0.081 0.079 0.081

-0.246*** -0.245*** -0.233***

-0.022 -0.025 (0.022)

(0.087) (0.087) (0.087) (0.040) (0.040) (0.039) (0.046) (0.046) (0.046) Proportion Low Income

0.124 0.134 0.125 0.217*** 0.230*** 0.207*** -0.021 -0.015 (0.019) (0.073) (0.073) (0.073) (0.036) (0.036) (0.036) (0.043) (0.042) (0.042)

Proportion Bubble Student

0.163 0.007 0.023 0.790*** 0.871*** 0.756*** 0.296 0.358 0.208 (0.152) (0.186) (0.138) (0.161) (0.201) (0.129) (0.195) (0.251) (0.165)

Intercept 0.040* 0.040* 0.039* 0.059*** 0.062*** 0.059*** -0.005 -0.004 (0.005) (0.019) (0.019) (0.019) (0.010) (0.010) (0.010) (0.010) (0.010) (0.010) N 7,662 7,662 7,662 185,615 185,615 185,615 182,580 182,580 182,580 Note. *p<0.05 **p<0.01 ***p<0.001

Page 51: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 1

Appendix B Table B1

Regression results for sensitivity analysis: NYC interactions in study 2, grade 7

Grade 7

Student-Level Variables Asian

-0.120***

(0.015)

Black

0.031**

(0.012)

Hispanic

0.083***

(0.011)

Low Income

0.070***

(0.007)

Bubble Student

0.216***

(0.008)

NYC

-0.216***

(0.018)

School-Level Variables Proportion Asian

-0.365***

(0.065)

Proportion Black

-0.180***

(0.033)

Proportion Hispanic

-0.058

(0.036)

Proportion Low Income

0.156***

(0.027)

Proportion Bubble Students

0.294**

(0.098)

NYC Interactions NYC x Asian

0.009

(0.022)

NYC x Black

0.040

(0.020)

NYC x Hispanic

0.021

(0.018)

NYC x Low Income

0.039*

(0.018)

NYC x Bubble Student

0.009

(0.014)

NYC x Proportion Asian

0.412***

(0.095)

NYC x Proportion Black

0.276***

(0.062)

NYC x Proportion Hispanic

0.211**

(0.067)

NYC x Proportion Low Income

-0.082

(0.060)

NYC x Proportion Bubble Student

0.136

(0.185)

N 185,164 Note. *p<0.05 **p<0.01 ***p<0.001

Page 52: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 1

Appendix C Table C1

Regression results for sensitivity analysis: standardizing before the difference is taken in outcome variable for studies 1, 2, and 3 in grade 7 Study 1 Study 2 Study 3

Standardize

Before Standardize

After Standardize

Before Standardize

After Standardize

Before Standardize

After Student-Level Variables Asian -0.049 -0.048 0.024*** -0.115*** 0.066*** -0.052*** (0.049) (0.053) (0.007) (0.010) (0.009) (0.010) Black 0.005 0.002 -0.071*** 0.050*** -0.114*** -0.013 (0.044) (0.047) (0.006) (0.009) (0.008) (0.009) Hispanic 0.029 0.029 0.000 0.114*** -0.077*** -0.004 (0.039) (0.042) (0.006) (0.008) (0.007) (0.008) Low Income 0.024 0.022 -0.054*** 0.075*** -0.130*** -0.032*** (0.029) (0.031) (0.004) (0.006) (0.005) (0.006) Bubble Student 0.113** 0.117** 0.086*** 0.246*** -0.012 0.085*** (0.040) (0.043) (0.005) (0.008) (0.008) (0.009) NYC -0.095* -0.098* -0.047** -0.156*** 0.057*** 0.007 (0.037) (0.040) (0.015) (0.020) (0.016) (0.023) School-Level Variables Proportion Asian -0.146 -0.149 -0.087 -0.466*** 0.223*** -0.030 (0.107) (0.115) (0.052) (0.070) (0.053) (0.077) Proportion Black -0.039 -0.042 -0.159*** -0.267*** -0.048 -0.113** (0.078) (0.083) (0.025) (0.034) (0.027) (0.038) Proportion Hispanic 0.074 0.081 -0.141*** -0.246*** 0.034 -0.022 (0.082) (0.087) (0.029) (0.040) (0.032) (0.046) Proportion Low Income 0.123 0.124 0.006 0.217*** -0.209*** -0.021 (0.068) (0.073) (0.027) (0.036) (0.030) (0.043) Proportion Bubble Students 0.16 0.163 -0.046 0.790*** -0.352* 0.296 (0.142) (0.152) (0.119) (0.161) (0.142) (0.195) Intercept 0.039* 0.040* 0.002 0.059*** -0.026*** -0.005 (0.017) (0.019) (0.007) (0.010) (0.007) (0.010) N 7,662 7,662 185,615 185,615 182,580 182,580

Note. *p<0.05 **p<0.01 ***p<0.001

Page 53: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 1

Appendix D Table D1

Regression results for forms excluded from Tables 4 and 5 because of very low or zero reliability of the difference score, for grades 4 & 8

Grade 4

Grade 8

Study 3

Study 1 Study 2 Study 3

Form D

Form 3 Form 4

Form B Form C

Student-Level Variables Black -0.033

0.112 0.128 -0.014 0.068*** -0.033

(0.018)

(0.091) (0.084) (0.009) (0.018) (0.019)

Hispanic 0.016

0.099 0.052 0.006 0.085*** 0.007

(0.016)

(0.082) (0.078) (0.008) (0.016) (0.018)

Low Income -0.025*

-0.062 0.047 -0.018** -0.008 -0.032*

(0.012)

(0.061) (0.057) (0.006) (0.012) (0.013)

Bubble Student 0.160***

0.096 0.036 0.074*** 0.005 0.072***

(0.013)

(0.082) (0.076) (0.008) (0.018) (0.019)

School-Level Variables Proportion Black -0.038

-0.034 0.035 -0.195*** 0.109 0.114

(0.045)

(0.150) (0.140) (0.037) (0.066) (0.059)

Proportion Hispanic -0.107*

0.002 0.035 -0.174*** 0.278*** 0.081

(0.050)

(0.151) (0.145) (0.045) (0.075) (0.072)

Proportion Low Income 0.038

0.07 -0.135 0.101* -0.265*** -0.190**

(0.044)

(0.124) (0.115) (0.041) (0.071) (0.064)

Proportion Bubble Students 0.552***

0.504 -0.118 0.509** -0.606* 0.292

(0.153)

(0.323) (0.292) (0.166) (0.303) (0.311)

N 44538 2137 2055 186645 47078 44830 Note. *p<0.05 **p<0.01 ***p<0.001

Page 54: Auditing for Score Inflation Using Self-Monitoring ......Self-Monitoring Assessments 3 Because of the nature and potency of the rewards and penalties associated in actual practice

Self-Monitoring Assessments 2

Table D2

Regression results for forms excluded from Tables 5 because of very low or zero reliability of the difference score, for grade 7

Grade 7

Study 1

Study 3

Form 1 Form 2

Form A Form B Form C Form D

Student-Level Variables Black 0.028 -0.019

-0.064*** 0.076*** -0.011 -0.060**

(0.077) (0.078)

(0.019) (0.018) (0.016) (0.019)

Hispanic 0.150* -0.064

-0.015 0.034* -0.013 -0.031

(0.070) (0.070)

(0.017) (0.017) (0.015) (0.017)

Low Income 0.088 0.017

-0.013 -0.064*** 0.010 -0.060***

(0.053) (0.052)

(0.013) (0.012) (0.011) (0.013)

Bubble Student 0.034 0.055

0.127*** 0.127*** 0.086*** -0.008

(0.071) (0.073)

(0.019) (0.019) (0.017) (0.020)

School-Level Variables Proportion Black -0.063 -0.147

-0.145* -0.127** -0.149** -0.054

(0.135) (0.140)

(0.061) (0.049) (0.048) (0.054)

Proportion Hispanic -0.147 0.22

-0.094 -0.040 0.001 0.004

(0.142) (0.145)

(0.070) (0.054) (0.056) (0.064)

Proportion Low Income 0.207 -0.013

0.029 -0.125* 0.009 -0.063

(0.119) (0.122)

(0.066) (0.053) (0.050) (0.059)

Proportion Bubble Students 0.306 0.063

0.991** 0.014 0.434 -0.384

(0.250) (0.247)

(0.315) (0.254) (0.257) (0.278)

N 2579 2531 46652 47329 45080 43519 Note. *p<0.05 **p<0.01 ***p<0.001