auditing for score inflation using self-monitoring ......self-monitoring assessments 3 because of...
TRANSCRIPT
Auditing for Score Inflation Using Self-Monitoring Assessments:
Findings from Three Pilot Studies
A working paper of the Education Accountability Project at the Harvard Graduate School of Education
http://projects.iq.harvard.edu/eap
Daniel Koretz1 Jennifer L. Jennings2
Hui Leng Ng1 Carol Yu1
David Braslow1 Meredith Langi1
1 Harvard Graduate School of Education
2 New York University
November 25, 2014 Acknowledgements: The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305AII0420, and by the Spencer Foundation, through Grants 201100075 and 201200071, to the President and Fellows of Harvard College. The authors also thank the New York State Education Department for providing the data used in this study. The opinions expressed are those of the authors and do not represent views of the Institute, the U.S. Department of Education, the Spencer foundation, or the New York State Education Department or its staff.
Self-Monitoring Assessments i
Abstract
A substantial body of research has shown that test-based accountability programs
often produce score inflation. Most studies have evaluated inflation by comparing trends
on a high-stakes test and a lower-stakes audit test, such as the National Assessment of
Educational Progress. However, Koretz & Beguin (2010) noted the weaknesses of using
external tests for auditing and suggested instead using self-monitoring assessments
(SMAs), which incorporate into high-stakes tests audit items that are sufficiently novel
that they are not susceptible to test preparation aimed at more predictable items. This
paper reports the results of the first three trials of the SMA approach. The studies varied
in design, but all were conducted with the New York State mathematics tests in grades 4,
7, and 8 in 2011 and 2012. Despite a severe conservative bias created by a number of
aspects of the study designs, we found that the audit component functioned as expected in
many of the trials. Specifically, the difference in performance between nonaudit and audit
items was associated with factors that earlier research showed to be related to test
preparation and score inflation, such as “bubble-student” status (scoring just below the
Proficient cut in the previous year) and poverty. However, a number of trials yielded null
findings. These findings underscore the need for additional research investigating the
optimal characteristics of items used for auditing gains.
Self-Monitoring Assessments 1
A substantial body of research has shown that test-based accountability programs
lead to test preparation that focuses attention on the specifics of the test used for
accountability rather than the broader domain of content and skills that the test is
intended to represent (e.g., Hamilton et al., 2007, Stecher, 2002). One consequence of
these behaviors is score inflation, that is, score increases substantially larger than the
improvements in learning that they represent. Studies have shown that the resulting bias
can be very large (e.g., Jacob, 2007; Klein, Hamilton, McCaffrey, & Stecher, 2000;
Koretz & Barron, 1998), and can lead to substantially different inferences about students’
academic progress.
Most studies of score inflation have followed a single design. To be meaningful,
increases in scores must generalize to the domain from which the test samples. If
performance gains do generalize to the domain, they should show a reasonable degree of
generalization to other tests that sample from the same domain and are intended to
support similar inferences. Accordingly, most studies examine the consistency of gains
between a high-stakes test and a lower-stakes audit test, most often the National
Assessment of Educational Progress (NAEP).
Koretz & Beguin (2010) noted numerous disadvantages of this approach. A
suitable audit test may be unavailable or, like NAEP, may be available only for a few
grades or only at high levels of aggregation. The substantive appropriateness of available
audit tests may be arguable, as they may have been designed to support somewhat
different inferences. Motivational differences may bias comparisons between the audit
and high-stakes tests, although this risk is mitigated somewhat when trends are compared.
Self-Monitoring Assessments 2
As way to avoid these limitations, Koretz & Beguin (2010) suggested self-
monitoring assessments (SMAs). SMAs incorporate audit components into the high-
stakes test itself, using differences in performance between these audit components and
routine operational items as a measure of score inflation. In principle, SMAs could
eliminate many of the limitations of using a separate test for auditing.
This study presents the results of the first three trials of the SMA principle, all
conducted in the context of New York State’s mathematics testing program in grades 4,
7, and 8. The first of these was conducted in the context of a statewide field test, while
the others were conducted in operational forms of the state’s high-stakes tests. These
three trials provide very conservative tests of the SMA design, for several reasons. First,
we faced a severe lack of lack of statistical power caused by small samples of students in
the first study and small samples of audit items in all three studies. Second, there is no
prior research evaluating different audit-item designs that could be used to guide the
efforts reported here. Finally, we lacked empirical evidence about the test-preparation
strategies teachers used to address the operational items for which we wrote audit items.
Nonetheless, the studies together demonstrate that SMAs are a viable approach for
detecting score inflation in the context of large-scale testing programs.
Background
The Problem of Score Inflation
The risk of score inflation has been noted in the measurement literature for more
than half a century. For example, Lindquist (1951), writing in an era of low-stakes
testing, noted that
Self-Monitoring Assessments 3
Because of the nature and potency of the rewards and penalties associated
in actual practice with high and low achievement test scores of students,
the behavior measured by a widely used test tends in itself to become the
real objective of instruction, to the neglect of the (different) behavior with
which the ultimate objective is concerned (p. 153).
Madaus (e.g., 1988) and Shepard (e.g., 1988) both warned that high-stakes testing was
likely to create test-specific coaching that would in in turn produce score inflation.
The first empirical study of score inflation (Koretz, Linn, Dunbar, and Shepard,
1991) examined a system that, while high-stakes by the standards of the day, was very
low-stakes by today’s standards, entailing no concrete sanctions or rewards. Koretz et al.
examined changes in scores when the district substituted one commercially produced
achievement test for another, and they readministered the older test four years after its
final high-stakes administration. They found inflation in mathematics of half an academic
year by the end of third grade. Since that time, studies in a variety of different contexts
have confirmed that score inflation is common (e.g., Hambleton, et al., 1995; Haney,
2000; Ho, 2009; Ho & Haertel, 2006; Jacob, 2005, 2007; Klein, Hamilton, McCaffrey, &
Stecher, 2000; Koretz & Barron, 1998; Koretz, Linn, Dunbar & Shepard, 1991). For
example, Klein, Hamilton, McCaffrey, & Stecher (2000) found that gains on the high-
stakes Texas TAAS test were roughly two to six times the size of the state’s gains on
NAEP. Jacob (2007) found that gains on several state high-stakes mathematics tests were
roughly double those on NAEP. Koretz & Barron (1998) found that gains on Kentucky’s
high-stakes mathematics tests exceeded gains in NAEP roughly by a factor of four, and
Self-Monitoring Assessments 4
Hambleton et al. (1995) found that gains of roughly three-fourths of a standard deviation
on the state’s fourth-grade reading test were accompanied by no gain whatsoever on
NAEP.
A number of studies have examined educators’ responses to testing and have
found behaviors that may be linked to score inflation. Among the reported responses that
may be related to inflation are narrowing of instruction to focus on tested content,
adapting instruction to the format of test items, focusing instruction on incidental aspects
of tests, and cheating (e.g., Stecher, 2002). For example, in a comprehensive study of
responses to test-based accountability in three states, Hamilton et al. (2007) found that 55
to 78 percent of teachers reported “emphasizing assessment styles and formats of
problems,” and roughly half reported spending more time teaching test taking strategies
(p. 103). Such practices have been widely documented (e.g., Abrams, Pedulla, & Madaus,
2003; Luna & Turner, 2001; Pedulla et al., 2003; Stecher & Barron, 2001; Stecher et al.,
2008).
Koretz & Hamilton (2006) distinguished among seven different types of test
preparation, including those that are likely to produce meaningful gains in achievement
and those that are likely to produce score inflation. Educators may respond to the
pressures to raise scores by allocating more time to instruction, finding more effective
instructional strategies, or simply working harder. All of these, within limits, may
produce meaningful gains in scores, that is, gains that reflect commensurate increases in
student learning. At the other extreme, educators may cheat, as in the recent large-scale
scandal in the Atlanta public schools (Severson, 2011), which can only produce inflated
Self-Monitoring Assessments 5
scores. Similarly, they may manipulate the population of test-takers, which will not bias
the scores of individual students but will inflate aggregate scores (e.g., Figlio & Getzler,
2006).
More relevant here are two categories of responses that Koretz & Hamilton
(2006) labeled reallocation and coaching. Reallocation entails better aligning
instructional resources, such as time, to the content of the specific test used for
accountability. Coaching refers to focusing instruction on minor, usually incidental
characteristics of the test, including unimportant details of content and the types of
presentations used in the items in the specific test. For example, item writers typically use
Pythagorean triples in items assessing the Pythagorean Theorem because students are
unable to compute square roots by hand. One common form of coaching in response to
this is telling students to memorize the two Pythagorean triples that most often appear in
test items, 3:4:5 (3ଶ + 4ଶ = 5ଶ) and 5:12:13 (e.g., Rubinstein, 2000). This allows
students to answer the item correctly without actually learning the theorem or being able
to apply it in real life.
Reallocation can be beneficial, but it can also generate score inflation. For
example, if a test reveals that a school’s students are weak in proportional reasoning, one
would want educators to bolster instruction in that area, and these responses might
include increasing the allocation of time to it. This reallocation, if effective, will improve
the mathematics achievement the test score is intended to proxy. However, if reallocation
entails shifting resources away from content that is de-emphasized or omitted from the
specific test but is nonetheless important to the inference based on scores, the result will
Self-Monitoring Assessments 6
be score inflation. Numerous studies have found that many teachers report decreasing
their emphasis on elements of the curriculum that are de-emphasized by the test (Pedulla
et al., 2003; Stecher, 2002) and widely available test-preparation materials help educators
do so (Haney, 2000; Stecher, 2002). Similarly, while there are cases in which coaching
might induce meaningful gains—for example, if students are confronted with a format
that is so novel that their performance without familiarization would be biased
downward—it will generally bias scores upwards.
Variations in Test Preparation and Score Inflation
Although most studies of score inflation and test preparation have examined only
trends in the student population as a whole, a growing body of research has documented
variations in both across types of students and schools. We make use of these variations
in our identification strategy, described below.
Research has shown that teachers in more disadvantaged schools, i.e., those that
serve schools serving disproportionately poor or non-Asian minority students, are likely
to focus more than others on test preparation. These schools often have a stronger
emphasis on assigning drills of test-style items and teaching test taking strategies
(Cimbricz, 2002; Diamond & Spillane, 2004; Eisner, 2001; Firestone, Camilli, Yurecko,
Monfils, & Mayrowetz, 2000; Herman & Golan, 1993; Jacob, Stone, & Roderick, 2004;
Jones, Jones, & Hargrove, 2003; Ladd & Zelli, 2002; Lipman, 2002; Luna & Turner,
2001; McNeil, 2000; McNeil & Valenzuela, 2001; Taylor et al., 2002; Urdan & Paris,
1994). These findings are not surprising, as low-performing schools face the most
Self-Monitoring Assessments 7
pressure to raise scores, face more severe obstacles to improving performance, and often
have less experienced teachers and principals.
Research on variations in score inflation is less abundant but is for the most part
consistent with the research on behavioral responses in showing greater inflation in
disadvantaged schools. Several reports show that the relative gains made by low-income
and black and Hispanic students relative to white students on state tests are not matched
on audit tests (Klein et al., 2000; Jacob, 2007; Ho & Haertel, 2006). Shen (2008) further
elaborated on the link between pedagogical practices and score inflation by examining
differences in the performance of items that were more or less “teachable.” She showed
that schools had greater improvements over time in performance on more teachable
items, and that this trend was more pronounced in disadvantaged schools. Similarly,
Jennings, Bearak, and Koretz (2011) have demonstrated that reductions in the racial
achievement gap in New York state mathematics tests were driven by black and
Hispanics students’ improvement on items testing predictably-assessed standards. These
limited studies suggest that the variations in test preparation noted above tend to produce
the greatest score inflation for the most vulnerable students.
Another documented pattern is the reallocation of instructional resources toward
students whose anticipated scores are just under the proficiency cut score (“bubble
students”) and away from students whose anticipated scores are comfortably above or
below proficient, a practice described as educational triage (Booher-Jennings, 2005;
Gillborn & Youdell, 2000; Neal & Schanzenbach, 2010; Stecher et al., 2008).
Accountability systems that only provide rewards or sanctions based on the percentage of
Self-Monitoring Assessments 8
students who score above the proficiency cut score incentivize teachers and schools with
limited resources to focus those resources in ways that seem more likely to result in the
greatest increase in proficiency rates. If students far beneath the cut score would require
more resources than are available to get above the cut score, and if students comfortably
above the cut score require fewer resources to demonstrate proficiency, then teachers
may see that focusing their pedagogical effort on the bubble students may result in a
larger reward. This phenomenon is particularly important in the present context because
it creates particularly clear incentives to focus on test preparation for students near the cut
score, and a number of quantitative studies have established that “bubble students”
appear to make more progress on high-stakes tests when proficiency-based accountability
systems are in place (Jennings & Sohn, 2014; Neal & Schanzenbach, 2010; Reback,
2008). As we will later discuss when we detail our methods, these students thus prove to
be an analytically useful group for identifying test-specific coaching.
Predictable Sampling and the Design of Self-Monitoring Assessments
Successive operational test forms typically show predictable patterns. These
patterns can include the amount of emphasis given to different content (including
predictable omissions of content) and the ways in which content is presented. These
predictable patterns in turn provide the opportunity to narrow test preparation to focus on
the particulars of the test and thereby to generate score inflation. Test preparation
materials often draw users’ attention to these predictable patterns and provide strategies
for performing well on predictable items.
Self-Monitoring Assessments 9
Holcombe, Jennings, & Koretz (2013) provided a framework for classifying the
various types of predictable patterns that can enable inflation. They suggest that the
sampling needed to create a test can be seen as comprising four distinct levels. First, one
decides which elements from the domain will be represented in the standards or
curriculum. The material sampled by the standards may exclude or deemphasize some
content that is important for stakeholders’ targets of inference. Second, the test authors
must decide which of the standards will be tested, and among those that will be tested,
how frequently they will be tested and how much emphasis they will be given in a typical
form. Third, when standards are reasonably broad, the authors then must decide how to
sample from the range of content and skills each standard implies. All of these stages of
sampling represent a narrowing of the substantive range of the test.
The final stage of narrowing entails the sampling of non-substantive elements,
that is, elements that are not relevant to the intended inference. Much of this sampling
may be inadvertent. Some small details of content may be of this sort—for example,
using only regular polygons in geometry items, or using only positive slopes in
quadrant 1 in items about slopes, when the inference does not call for this narrowing.
However, many of the predictable non-substantive elements in a test can be seen as
aspects of presentation rather than content, e.g., item format, the particular graphics used
with mathematics items, and so on.
Audit items in an SMA should assess material that is relevant to the inference
without replicating the predictable patterns that create opportunities for test preparation.
For example, if all operational items assessing knowledge of the Pythagorean Theorem
Self-Monitoring Assessments 10
make use of common Pythagorean triples, a suitable audit item might be a calculator item
with a non-integer solution. Students whose test preparation focused on Pythagorean
triples in lieu of appropriate instruction about the theorem would find the audit item much
more difficult than would students receiving appropriate instruction on the Pythagorean
Theorem.
Opportunities for SMA Trials in New York State
The New York State Education Department (NYSED) offered us two
opportunities to pilot-test an SMA, and changes in the state’s operational program
provided a third trial. To our knowledge, these are the first trials of the SMA approach.
Study 1 was implemented in the context of a stand-alone field test administered
statewide in June 2011 for purposes of linking and pilot testing new items. We
administered mathematics audit items in grades 4, 7, and 8. Six of our audit items and six
operational or anchor items were included in 12-item forms that were allocated randomly
to samples of students. A total of 6,301 students were sampled, each of whom was
administered a single audit-test booklet.
Study 2 stemmed from changes made to New York’s operational testing program. In
2009, NYSED acknowledged that gains on some of the state’s high-stakes mathematics
tests might be inflated (Tisch, 2009). One of the steps NYSED took to lessen the risk of
continued score inflation was to broaden the state’s tests, partly to make them less
predictable. The tests administered in the spring of 2011 were longer and incorporated
items assessing standards that had not previously been tested. This allowed us to use the
Self-Monitoring Assessments 11
items assessing previously untested standards as an audit component. We did this in the
same three grades as in Study 1.
Study 2 offered several advantages compared with Study 1. Administering the
audit items in the context of the operational high-stakes assessment eliminated the risk of
motivational biases. Because the audit items were in the operational forms, the sample
sizes were very large, approaching 200,000. Finally, the distributions of raw scores on the
non-audit component of the Study 1 test forms were badly right censored, as is common
with high-stakes tests. Because of changes made to the operational test in 2011, the
distributions of raw scores on the non-audit components were less right censored in
Study 2.
However, Study 2 also had two important weaknesses compared with Study 1.
First, the audit is limited to one of the four categories of audit items described below, all
of which were administered in the other trials. Second, none of the items in Study 2 were
specifically selected to serve as audit items, and they may be poorly suited for that
purpose. For example, they may share attributes with other items that allow the effects of
coaching to generalize to them. To the extent that this is true, Study 2 would produce
conservative estimates of score inflation.
In Study 3, conducted in the spring of 2012, we embedded planned audit items in
all operational mathematics test forms in grades 4, 7, and 8. The items used in Study 3
overlapped substantially with those used in Study 1 and included all four categories of
audit items described below. The strengths of this study were administering planned
items in the context of an operational assessment and the large sample of students.
Self-Monitoring Assessments 12
However, this study was weakened by NYSED’s administration procedures. NYSED
matrix-samples at the school level. The state’s schools are divided into four
representative groups, and every student in every school in any one of the four groups
receives a single test form. We were permitted to place only four audit items in any form,
so no school received more than four items. Therefore, despite the large samples of
students who received any one form, the very small number of items per school results in
problems of limited statistical power.
Data
We created our analytic samples using the same approach in all three studies. From
the original datasets described above, we dropped schools with fewer than 3 students1.
We then dropped students missing either an audit or non-audit test score. Lastly, we
excluded a small number of students (less than 1%) with apparently anomalous scores.2
These decisions left us with approximately 94% of the original sample for Study 1, 89%
1 The very low threshold of n>=3 was necessitated by the design of the sample we were given in Study 1.
Using a higher cutoff would have eliminated too many schools. However, the use of empirical Bayes
estimates should minimize the impact of choice of threshold. We used data from Study 2, which was based
on nearly complete census data, to evaluate the effect of replacing this cutoff with n>=10. This had no
appreciable effects on the results obtained.
2 We also dropped a very small number of students with mismatched form booklets. In addition, we
dropped P.S. 184 Shuang Wen School, a public school in New York City with an immersion program in
Mandarin Chinese. A large percent of the school’s students were Asian, and an extreme value relative to
the rest of the schools in our sample inflated coefficients for the school proportion-Asian variable.
Self-Monitoring Assessments 13
for Study 2, and 90% for Study 3. These analytic samples differed little from our original
data. Additionally, the samples across the three studies are very similar in demographic
composition (see Table 1).
Methods
Item writing and Construction of Outcome Measures
Our selection and writing of audit items were guided by the framework by
Holcombe et al. (2013) described above, but we found that we had to modify it
somewhat.
In principle, one could design audit items for each of the stages of narrowing
described by Holcombe et al. (2013): from the domain to the standards, from the total set
of standards to the subset of standards tested and emphasized, from the description of
standards to the selected content and skills sampled from within each, and from the range
of possible representations and other non-substantive elements. However, although
unimportant content details and aspects of presentation are conceptually distinct, we
could not differentiate reliably between them. For example, presenting slopes only as
positive in quadrant 1 could be seen either as a detail of content or as a matter of
presentation. This difficulty was exacerbated by the narrow wording of some of the New
York standards, which constrained the variety of items that could be written. (E.g., one
Grade 8 standard in geometry, 8G1, read “Identify pairs of vertical angles as congruent”).
Therefore, we modified the framework by combining these two stages.
This left us with four types of audit items. Not-in-standards (NIS) items
represented content omitted from the state’s standards but often included in the definition
Self-Monitoring Assessments 14
of the domain, e.g., in the NAEP frameworks. Untested-standards (US) items assess
previously untested state standards (i.e., narrowing in terms of the selection of specific
standards for including in the state test). Because US items measure content explicitly
included in the intended inference, they should test for inflation from undesirable
reallocation. The third type of item was designed to counter wording of standards that we
considered to be narrowing content unduly. These broadened-at-the-standards-level (BS)
items very modestly broadened the range of the standard. For example, in the case of 8G1
above, the broader content of interest is the concept of congruency of angles, not
necessarily restricted to pairs of vertical angles. The final category was designed to
respond to narrowing that arose in the writing of the actual operational items to represent
a given standard (e.g., repeated use of the same representation or specific content over
successive years). These broadened-at-the-item level (BI) items broadened the within-
standard diversity of test items while adhering to the specific wording of the standards.
BI items should be sensitive to test preparation focused on specific substantive or non-
substantive characteristics of past items.
In Studies 1 and 3, we were permitted to administer only multiple-choice audit
items, and the number of audit items was constrained. In Study 2, the number of audit
items was determined by design decisions by the state’s contractor. All but one of the
Study 2 audit items were multiple-choice.
The large majority of our audit items for Studies 1 and 3 were publicly available,
previously administered items from large-scale assessments, including the NAEP, the
Trends in International Mathematics and Science Study (TIMSS), and state assessments
Self-Monitoring Assessments 15
from Massachusetts and Pennsylvania. It was often necessary to modify these items,
either for content or to match the four-distractor format of the New York tests. In
addition, we and colleagues at Cito, a testing firm in the Netherlands, wrote a small
number of items. In Study 2, the audit items were all previously untested standard (US)
items included in the operational 2011 forms by the state’s testing contractor, without
guidance from us.
Studies 1 and 3. We included items of all four types in Studies 1 and 3. Counts
are shown in Table 2. Because we wanted to re-test the audit design in the context of an
operational assessment, many of the audit items administered in Study 1 were reused in
Study 3. All items were reused at the seventh grade, and 14 of 16 were reused at the
fourth grade. However, we reused only 9 of 16 audit items in eighth grade because we
found that several failed to function well based on results from Study 1. We thus replaced
these items with new audit items, selected from the same three public sources or created
based on the same principles as for Study 1, in Study 3.
For Study 1, our audit items were embedded in 12-item forms, with 6 audit and 6
non-audit items per form. We selected the non-audit items from past operational and
anchor items in the 2011 Field Test. We matched each of the audit items that were
broadened at the standards or item level (BI and BS items) with a non-audit item that
closely matched it in content but that differed on one or more predictable characteristics
potentially associated with coaching. This matching was intended to guard against
differences in difficulty arising from intentional differences in content rather than from
test preparation. Such matching was not possible by definition for untested-standards
Self-Monitoring Assessments 16
(US) and not-in-standards (NIS) items. Nonetheless, we were able to pair each of the US
and NIS audit items with a non-audit item testing the same content area defined by the
learning standard or strand (e.g., Algebra) in the New York standards.
The strategy we followed in selecting and writing items was intended to minimize
the risk of confounding, and therefore, differences between BI / BS audit items and
nonaudit items were minor. The cost of this attempt to create minimal contrasts is the risk
of introducing too little novelty, which could lead to spurious null findings or
underestimates of the inflation effect.
In contrast, in Study 3, the entire operational form, minus our audit items, was
used as the non-audit component. This was done to lessen the problem of low reliabilities
of difference scores created by 6-item forms, but it precluded the close matching we
carried out in Study 1. We did post-hoc matching of items in Study 3, as described below.
Study 2. The audit items in Study 2 were all items that assessed previously
untested standards (US items). These were added as part of the state’s efforts to broaden
their tests and were selected without input from us. We used the remaining items in the
operational state test as the non-audit items. In seventh grade, we excluded only one item
that NYSED dropped because of a negative correlation with test scores. In each of fourth and
eighth grades, we dropped three items that assessed previously untested standards but that
were extremely easy (p-values ranging from .79 to .96), as it is unlikely that extremely easy
items can serve effectively as audit items.
Self-Monitoring Assessments 17
The dependent variable in all three studies was based on the simple difference
between the raw proportion correct on the non-audit portion of the test and that on the
audit portion:
(1) ܻ௦ = ௦ െ ௦௨ௗ௧
where i indexes individuals and s indexes schools. We standardized this difference to
mean 0, standard deviation 1.
Predictors. Our predictor variables included a variety of student-level
characteristics, school averages of these characteristics, an indicator of students’
proximity to the proficiency cut score (termed “bubble status”), and dummy variables for
regions of the state.
Demographics. We included student-level dummy variables for African-
American, Hispanic and Asian students, leaving white and other-race students as our
omitted comparison group. 3 We also included an indicator of low-income status which is
defined by NYSED as whether or not a student participates in the free and reduced price
lunch program or other economic assistance program. 4 The means of these variables
were computed for school-level demographic variables.
Bubble status. We also included an indicator of bubble status. We defined these
as students whose scores in the previous year’s state test were as much as three raw score
3 In most cases, parents reported race. When parents do not report race, districts are responsible for
assigning classifications.
4 For detailed information about the criteria for the low-income variable, see University of the State of
New York (2011), p. 44.
Self-Monitoring Assessments 18
points below the cut score between Level 2 and Level 3 (“Proficient”). We used this as a
proxy for educators’ anticipated scores for the students in the target year. The proportions
of bubble students were then computed for schools. A sensitivity analysis comparing
alternative definitions of this variable is discussed below.
Region. We created dummy variables to separate districts into three categories,
based on differences in district size, urbanicity, and demographics: New York City; other
urban (Buffalo, Rochester, Syracuse, and Yonkers, the four largest districts after New
York City); and all other districts (the omitted category). We found that the dummy
variable for other urban had little effect, so we dropped it from analysis, combing these
four districts with the omitted category.
Analytical approach
Across the three studies, three grades, and multiple forms (three or four per grade
in Study 1 depending on grade, a single form in Study 2, and four in Study 3), we had a
total of 25 audit forms. We conducted most analysis at the form level because of the lack
of an effective way to link across forms.
Because our data did not permit us to place the audit and non-audit components
on a single scale, we could not use simple differences in difficulties of the two
components as an indicator of potential inflation (Koretz & Beguin, 2010). Instead, we
used a difference-in-differences approach to detect potential score inflation. That is, we
examined the extent to which the difference in difficulty between the audit and non-audit
components is predicted by variables shown previously to be related to variations in test
preparation and score inflation: poverty, race/ethnicity, and especially bubble status.
Self-Monitoring Assessments 19
The audit test score, ௦௨ௗ௧ in equation (1), is ideally an unbiased estimator of the
student’s uninflated achievement, ߠ:
௦௨ௗ௧ (2) = ௦ߠ + ߳௦, (௦߳)ܧ = 0.
We assume it is unbiased because the items are novel with respect to the specific content
or presentation that facilitates inappropriate test preparation and score inflation. To the
extent that this assumption is violated, the difference score is downwardly biased as an
indicator of inflation.
However, ௦ in equation (1) is potentially biased by inflation, ߞ௦, which is
expected to vary both within and between schools. In addition, ௦ may differ from
:that is unrelated to inflation ,ߜ ,௦௨ௗ௧ by a difficulty factor
௦ (3) = ௦ߠ + ߜ + ௦ߞ + .௦ߥ
Therefore one can re-express the audit measure in equation (1) as:
(4) ܻ௦ = ߜ + ௦ߞ + ௦ߥ) െ ߳௦).
Note that the non-inflationary difference in difficulty, ߜ, is shown as a constant,
not varying across schools or students within schools. Student-level variations in
difficulty unrelated to inflation contribute to measurement error. School-level variations
in difficulty pose a more complex issue. In the general case, school performance may
differ between parts of a test for systematic reasons unrelated to inflation. For example,
when another test that reflects a somewhat different framework is used as an audit, there
is a risk that schools will vary in the emphasis they give to certain aspects of the audit test
because of curricular differences that are independent of their responses to the high-
stakes test. In this case, however, all of the content of both subtests is explicitly included
Self-Monitoring Assessments 20
in the target of inference, and therefore in the content that schools are expected to teach.
Therefore, if some schools show strong performance on the tested standards that does not
generalize to the untested standards included in the inference, we can consider that score
inflation. Accordingly, we can treat school-level differences in performance between the
two subtests as comprising only inflation and school-level error and not non-inflationary
differences in difficulty, although we cannot determine whether the inflation is a result of
deliberate responses to testing. Our analytical models incorporate estimates of school-
level error.
The outcome shown in equation (4) poses two difficulties for our analysis. First,
the student-level error in the difference score,(ߥ௦ െ ߳௦), will be large because of the
short length of the audit test. Below we show that the estimated reliability of ൫௦ െ
௦௨ௗ௧൯ based on the internal consistency reliabilities of the two subtests is extremely
low. This creates a severe conservative bias—that is, a large risk of a Type II error. This
problem is ameliorated in our estimates of school-level relationships, but nonetheless,
this lack of statistical power remains an important limitation of the data. In response, we
screened out the trials with the lowest reliability of the differences, as explained below.
An even more important limitation is that ߜ cannot be estimated from our data. To
estimate ߜ, we would need uninflated estimates of the difficulties of both subtests, but we
do not have such an estimate for the non-audit subtest, which was administered only
under conditions vulnerable to inflation. This is what precludes our placing the two
components on a single scale. Therefore, ߜ and ߞ are confounded, and the simple
difference ൫௦ െ .௦௨ௗ௧൯ cannot be interpreted as an indicator of inflation
Self-Monitoring Assessments 21
Our use of a difference-in-differences approach removes the effects of ߜ.
Specifically, we investigated whether variations in ൫௦ െ ௦௨ௗ௧൯ are systematically
associated with student- and school-level variables that have been shown in prior studies
to be associated with either score inflation or inflation-inducing instructional behaviors:
race/ethnicity, economic disadvantage, and bubble status. Because we expect that
inflation-inducing behaviors vary across schools, we examined these relationships both
within- and between schools.
Analysis of Item and Test Statistics
To estimate the internal consistency reliability of the audit-item sets and to flag
potentially problematic items, we calculated item-rest correlations, omitted-item alphas,
and Cronbach’s alpha for all item sets, by form. In addition, we calculated classical
estimates of the student-level reliability of the raw score difference (nonaudit - audit)
separately for each form. Because we knew that the short audit forms would lead to
unreliable difference scores, we used these estimates to screen the 25 separate trials
(across studies, grades, and forms) for adequate reliability, excluding numerous because
of insufficient reliability.
Regression Models
Our primary model was a two-level random-intercepts model:
(5) ܻ௦ = ௦ߚ + ܆ + ଶܰߚ + Ԗ୧ୱ
௦ߚ = ߛ + ܈ + uୱ
where X is a vector of student-level variables, Z is the corresponding vector of school-
level means, and N is the NYC dummy. This model appropriately adjusts standard errors
Self-Monitoring Assessments 22
for clustering and also accommodates our hypothesis that variations in inflation-inducing
behaviors arise between schools.
We grand-mean centered the student-level predictors. This yields parameter
estimates for level-2 variables that are direct estimates of context effects (Raudenbush &
Bryk, 2002). That is, the parameter estimates for level-2 variables indicate the extent to
which the school means differ by more than the level-1 model would predict. We also
group-mean centered level-2 variables for ease of interpretation.
Post hoc Analysis of Matched Items
We analyzed matched audit and non-audit items in the three studies for two
different purposes.
In Studies 1 and 3, we analyzed matched items to help assess whether the audit
items functioned as intended. We expect audit items to be more difficult than matched
non-audit items because of the addition of specific novel content or representations.
While such raw differences in difficulty cannot prove whether an audit worked, they are
indicative of whether the audit items successfully assessed content in ways that were
sufficiently novel relative to operational items.
This analysis was straightforward in Study 1, for which most forms were
constructed using matched item pairs. For Study 3, we identified operational items that
assessed the same content standards as our audit items for consideration as possible
matches. Since some content standards encompass many possible mathematics topics,
two authors independently examined and evaluated each operational item to determine
Self-Monitoring Assessments 23
whether it was an acceptable match based on the content it assessed. There were no
disagreements, and 17 of 22 items (77%) were retained for matching.
In Study 2, matching for this purpose was precluded by the use of untested-
standards audit items, but we carried out another form of matching to evaluate whether
the untested items were sufficiently different from other operational items to warrant
using them as audits. We constructed categories of closely related content standards and
looked for non-audit items that assessed standards from the same categories as those for
the audit items. Using a similar review process as for the other studies, two authors
independently decided whether any of the matches were close enough to be potentially
problematic. We found no problematic matches.
Results
Descriptive Statistics and Test Statistics
The distributions of the difference scores in our outcome variable are
approximately normal, although the distributions of the individual p-values of the audit
and non-audit portions are in some instances skewed. The skewness statistics for the
distributions of difference scores range from .002 to .203. Figure 1 shows the
distributions of the outcome in grades 4 and 8 in study 2; grade 4 is the most skewed and
grade 8 is one of the least skewed.
The internal consistency reliability of the audit item sets varied markedly, from
0.28 to 0.72, with a mean of 0.52 (see Table 3). While item-test and item-rest correlations
are difficult to interpret when forms are very short, we saw little evidence of problematic
items. Rather, the low reliabilities reflect the very short lengths of the audit components.
Self-Monitoring Assessments 24
Applying the Spearman-Brown prophecy formula suggests to data pooled across forms
within grade suggest that if the audit forms were lengthened only to 10 items each, the
reliabilities would fall within a range from 0.46 to 0.76, with most above 0.6.
The estimated reliabilities of the difference scores we used as measures of
inflation are very low (see Table 3). This is attributable to the low reliability of the audit
scores and the strong correlations between the non-audit and audit scores (ranging from
0.57 to 0.81). As a result, the estimated reliabilities of our outcomes range from
effectively 0 to 0.18.5 We are particularly interested in school-level relationships, and our
use of empirical Bayes estimates will lessen this problem of low reliability for school
effects. Nonetheless, this low reliability will introduce a strong conservative bias to the
coefficients in our regressions.
To simplify the presentation of our results, we report in detail only the results of
analysis of forms that had a student-level reliability of the difference score 0.05. We
assume that in the most cases, analysis of forms that did not reach this threshold is too
noisy to interpret with confidence, and indeed, these models had few statistically
5 Conceptually, reliability cannot be negative. In practice, when reliability is very low, one can
obtain negative estimates from error. In our case, we could obtain negative estimates for an
additional reason. Using the classical model, the estimated reliability of a difference score will be
negative whenever (ݎ௫௫ߪ௫ଶ + (௬ଶߪ௬௬ݎ < ௬, where x and y are the two tests that areߪ௫ߪ௫௬ݎ2
differenced. This inequality is more likely to hold when the test with a larger variance has a
considerably lower reliability—precisely what our data produce. Following convention, we set all
negative estimates to zero.
Self-Monitoring Assessments 25
significant findings. One consistent pattern in the excluded models is noted in the text.
Regression results for all excluded forms are available online at (Appendix D). [Note:
Appendix D is included in this draft, but we expect that space constraints will require
moving it online.]
Matched-Item Analysis
In the fourth grade, both matched-item and all-item comparisons showed the
audit items to be more difficult than non-audit items, with p-value differences ranging
from 0.12 to 0.31 (Table 3). In contrast, most of the seventh-grade forms showed very
small differences in difficulty, suggesting that our intended audit items did not function
as such. Most seventh-grade forms that showed very small differences in p-values also
showed reliabilities of the difference <.05, so we did not use the difference in difficulty
as a reason to exclude any additional forms. The eighth-grade forms varied in terms of
p-value differences, but in this grade as well, we saw no reason to exclude forms on this
basis if they exceeded our threshold for reliability.
Regression Results
To place the following results in context, recall that the problem of low statistical
power was most severe in Study 1 and least severe in Study 2. In addition, given that our
screening on reliability and simple differences suggests that our efforts to construct audit
forms were most successful in the fourth grade, we consider the fourth-grade regression
results to be the most important.
The variable which we consider most directly related to the pressures that cause
inflation, bubble status, showed a substantial relationship to the outcome at both the
Self-Monitoring Assessments 26
student and school levels. At the student level, the bubble-status dummy was strongly
related to the nonaudit-audit difference in all seven of our included fourth-grade models
(Table 4). Across the models, the nonaudit-audit difference was larger for bubble students
by 0.16 to 0.38 standard deviations. Moreover, this relationship was statistically
significant at < .001 in six of the seven models. The proportion of bubble students was
also substantially related to the school mean difference in Studies 2 and 3, and
significantly so in three of four models. However, this aggregate relationship was
inconsistent in sign and not significant in Study 1, which had the most severe problems of
statistical power.
Fourth-grade results for the low-income variables were weaker and were only
moderately consistent with expectations. In Study 2, which had the smallest problem of
statistical power, both student-level poverty and the proportion of poor students in the
school variables were positively and significantly related to the nonaudit-audit difference,
although these estimated relationships were considerably weaker than those of the
bubble-status variables. For example, the effect of the student-level poverty dummy in
Study 2 was 0.075 standard deviations, roughly one-third the size of the coefficient for
the bubble-status variable. In Study 3, results from two forms were consistent with
expectations; the exception was Form C, which is the form that also failed to show a
significant relationship with the school proportion of bubble students. The low-income
variables did not predict the outcome significantly in Study 1.
Results for the race/ethnicity variables were similar to those for low income.
Studies 2 and 3 showed expected relationships for the student-level Hispanic and Black
Self-Monitoring Assessments 27
dummy variables in seven of eight models. However, Study 1 again did not yield
consistent findings for these variables.
Only two seventh-grade forms survived screening for extremely low reliability of
the difference scores: Study 2 and one form from Study 1. As in grade 4, the student-level
bubble-status dummy was substantially related to the outcome; the average bubble-
nonbubble difference was approximately 0.25 standard deviations (Table 5). The
association with the proportion of bubble students was very large and highly significant
in Study 2 but failed to reach statistical significance in Study 1. In Study 2, all of the
other student- and school-level variables conformed to expectations: poor students, black
students, and Hispanic students all showed larger audit-non-audit differences, and the
proportions of all three types of students predicted larger school mean differences on the
outcome. As in the fourth grade, these relationships were all smaller than the coefficients
of the bubble-status variables. In Study 1, however, while the school-level coefficients
were in the expected direction, no coefficient other than the student-level bubble-status
dummy was statistically significant.
The results from grade eight were less encouraging. Four forms survived
screening, two each from Study 1 and Study 3. Study 2, which generally provided more
statistical power, in this grade yielded difference scores with a reliability of only 0.03, so
it was excluded. The results from the two Study 3 forms were only partially consistent
with expectations. The coefficients of three student-level demographic variables, black,
Hispanic, and low-income, were positive and statistically significant in five of six cases,
although the estimates were not large, ranging from .04 to .09 standard deviations (Table
Self-Monitoring Assessments 28
5). The estimates for the aggregates of these variables were inconsistent. Most striking,
the estimate for the bubble-status dummy was positive and significant in only one of the
two forms, and the coefficients for the proportion of bubble students were inconsistent in
sign. The two Study 1 forms yielded no clear patterns.
While we do not report the results for the 12 forms screened out, we noted one
pattern consistent with our expectations. Despite the very low reliability of the outcome
in the excluded forms, the coefficients for the bubble-status dummy, the proportion of
bubble students, or both were positive and significant in 9 of the 12 excluded forms (see
online Appendix D).
Sensitivity Analyses
We conducted sensitivity tests to evaluate the robustness of our results to three
possible changes: (i) the bandwidth used to define bubble status; (ii) estimating
differential effects in NYC and the rest of the state; and (iii) a different standardization
procedure. Our findings are robust to the first two choices, but they are sensitive to the
method used for standardization.
Bubble status. The definition of bubble status should include a small number of
students who are close enough to reaching proficiency that teachers have an incentive to
provide them with extra help, but the operational definition is arbitrary. The definition
used in our primary results, three points below proficient in the previous year, classified
6% to 15% as bubble students across grades and studies. We replicated our regressions
using two alternative definitions: two points below proficient, and from two points below
to two points above proficient. These alternative definitions resulted in only modest
Self-Monitoring Assessments 29
differences in the proportions of students included and in estimated coefficients and did
not produce qualitatively different findings (Appendix A).
New York City. New York City, which enrolls substantially over 30 percent of
the students in New York State, might have distinct patterns of score inflation because it
differs in its demographic makeup and has a unique school accountability system.
However, adding interactions with a New York City dummy variable had appreciable
effects only in Study 2. In Study 2, the main effect for the New York City dummy was
negative (suggesting less overall inflation), while the coefficients for black students and
for schools with many black students were positive and larger than elsewhere in the state
(Appendix B). However, these findings were not replicated in the other studies, perhaps
in part because of problems of statistical power. Therefore, we did not include
interactions with the New York City dummy and do not include it in our primary results.
Standardization. There are two defensible but substantively different ways to
standardize the subtest scores used to generate our outcome variable, and these two
methods produce different results (Appendix C). Our outcome variable was constructed
by differencing the proportion of possible points on the audit and nonaudit portions of the
test and then standardizing these difference scores. An alternative would be to
standardize the two components separately and then difference the standardized scores.
This decision has an impact because of differences between the distributions of
the audit and non-audit scores. The standard deviations of the audit scores are often larger
than those of the nonaudit scores. For example, in grade 7 in Study 2, the standard
deviation of non-audit scores was 0.20, while that of the audit scores was 0.27.
Self-Monitoring Assessments 30
The choice of whether to standardize at the component level hinges on one’s
interpretation of these differences in distributions. If one believes that they are simply an
artifact of scaling, then it would be reasonable to standardize at the component level and
thus remove the artifact. On the other hand, if one believes that the differences in
distributions reflect the phenomenon under investigation, then standardizing at the
component level would be inappropriate because it would remove the subject of study.
Our interpretation is that the variance of the non-audit component is reduced by the
phenomenon under investigation, score inflation, as the rapid gains in scores on high-
stakes tests are often accompanied by severe right censoring (Ho & Yu, 2013). Thus, we
considered the greater standard deviation in the non-audit scores as substantively
meaningful and decided before comparing results that standardizing at the component
level would not be appropriate.
Discussion
This paper presents results from the first three trials of self-monitoring
assessments (SMAs), proposed by Koretz & Beguin (2010) as a means of monitoring
score inflation by embedding audit items in operational tests. These trials were not
designed to evaluate the magnitude of score inflation. Rather, they were designed as an
initial exploration of the feasibility of the SMA approach.
The findings presented here support the feasibility of the SMA approach. Indeed,
because of the severe problems of statistical power in these studies, our findings are
likely to substantially understate the potential of SMAs. This is an important finding,
given that score inflation is widespread and often very large and that auditing with an
Self-Monitoring Assessments 31
independent test is expensive, time-consuming, and subject to several important
weaknesses (Koretz & Beguin, 2010). However, there were null results as well that we
believe have important implications for further research and development.
As Koretz & Beguin (2010) explained, unless one designs and scales audit items
at the inception of a testing program, one cannot place the audit and nonaudit items on a
single scale to obtain direct estimates of inflation. Instead, we were limited to a
difference-in-differences approach, where we examined the extent to which variables that
create pressure to raise scores and that have been shown in earlier research to be
associated with test preparation and inflation, predict the nonaudit-audit difference in our
data.
The findings presented above should be interpreted in the context of the large and
unavoidable risk of Type II errors. In Study 1, we confronted low statistical power
because of the small samples of students administered each form in the field test. More
important, in all three studies, our data had very low statistical power because of limits on
the number of audit items we could administer—in the most extreme case, Study 3, only
four audit items per school. This created a severe lack of statistical power despite the
large samples of students in Studies 2 and 3. The impact was apparent in the very low
student-level reliability of the nonaudit-audit difference scores—so low that in the case of
12 forms, the results seemed to be largely noise, could not be interpreted with confidence,
and were excluded from detailed reporting.
The strongest support for the SMA approach is found in the fourth-grade results.
All but one of the fourth-grade forms survived screening, and most of the results in the
Self-Monitoring Assessments 32
fourth grade models conformed to our expectations. Particularly important, in our
opinion, is the finding that the bubble-status variable was a consistent and strong
predictor of the audit-non-audit difference. Of the several predictors we used, we
consider the bubble-status variable the one that is most directly related to the pressure on
teachers to raise scores. Findings with respect to demographic predictors were also
positive, although the magnitude of these effects was generally considerably smaller than
those of the bubble-status variables. The findings from grades seven and eight,
particularly the latter, included markedly more null findings but provided some support
for our expectations.
While some of the variation in results is likely the result of noise, we suspect that
there is a more important factor: variations in the design of our audit items. The trials
presented here were the first attempts to implement SMAs, so we had no empirical
evidence from prior studies guiding the development of our audit items. In the case of
items that broadened at the level of items or standards, we based our designs solely on
examination of operational test items. We arrayed all items previously administered in
chronological order and examined them for predictable patterns of specific content and
representations. We assumed that coaching would focus on the most obvious of these
regularities, and we designed our audit items to undo that specific regularity. We
attempted to make the differences between audit and non-audit items as minor as possible
apart from this single change. This was intentional, to make the matched items close to a
minimal contrast.
Self-Monitoring Assessments 33
We suspect that some of our audit items did not function as expected for one of
two reasons: our speculation about the nature of coaching for the operational items was
incorrect, or the changes we made were too small. However, our data are not sufficient to
permit us to distinguish these possible factors from the effects of noise.
Study 2 differed from the others in relying solely on audit items that tested
previously untested standards (US items). These are appropriate for auditing because
their content is an explicit component of the target of inference. Indeed, this is why New
York State chose to add the US items, which their contractor designed as operational
items to be scored, not for audit purposes. We note that in grades four and seven, the
results of Study 2 were the most consistent with our expectations, while in grade eight,
the Study 2 form was screened out because of extremely low reliability of the outcome.
This further underscores the current lack of clarity about the design of audit items: in
some cases, US items work unusually well, while in others, they do not function
successfully as audit items.
The conventional call for additional research is particularly appropriate here.
First, given that these studies were the first ever conducted, were conducted in a single
setting using a single operational testing program, and entailed substantial data
limitations, the need for replications is clear. Second, there is as yet very little research
guiding the development of audit items. We suggest two specific avenues for future
research on the design of audit items. First, it would be valuable to attempt to link the
design of items to systematic data on the characteristics of preparation for the operational
test in question, rather than relying solely on the characteristics of previously
Self-Monitoring Assessments 34
administered items. This information might be gathered by means of surveys,
observational or video studies, or examination of test-preparation materials. Second,
additional trials of the SMA approach could be structured to provide comparisons among
different item designs—for example, to compare the functioning of several different audit
items for a single operational item.
Finally, the problem of statistical power that we confronted in these studies is
itself an important focus for further investigation. Our most serious problems of power
arose from the sparse sampling of items, not the sampling of students. Space for audit
items will be limited in large-scale testing programs, particularly those that provide
operational test scores for individual students, because of limits on total test length. As
the field learns how to construct more effective audit items, the power of nonaudit-audit
comparisons will increase, but the number of audit items is likely to remain an issue
nonetheless. This suggests a need for research on the size of audit components needed for
effective use in operational assessments with various test designs.
Self-Monitoring Assessments 35
References
Abrams, L. M., Pedulla, J. J., & Madaus, G. F. (2003). Views from the classroom:
Teachers’ opinions of statewide testing programs. Theory into Practice, 42(1), 18-
29.
Booher-Jennings, J. (2005). Below the bubble: "Educational Triage" and the Texas
Accountability System. American Educational Research Journal, 42(2), 231-268.
Cimbricz, S. (2002). State-mandated testing and teachers’ beliefs and practice. Education
Policy Analysis Archives, 10(2). Retrieved from
http://epaa.asu.edu/epaa/v10n2.html
Diamond, J. B., & Spillane, J. P. (2004). High-stakes accountability in urban elementary
schools: Challenging or reproducing inequality? Teachers College Record, 106(6),
1145-1176.
Eisner, E. (2001). What does it mean to say a school is doing well? Phi Delta Kappan,
82(5), 367-372.
Figlio, D. and Getzler, L. (2006). Accountability, ability and disability: Gaming the
system? In T. Gronberg & D. Jansen (Eds.), Advances in Applied Microeconomics
(Vol. 14, pp. 35–49). Elsevier.
Firestone, W. A., Camilli, G., Yurecko, M., Monfils, L., & Mayrowetz, D. (2000). State
standards, socio-fiscal context and opportunity to learn in New Jersey. Education
Policy Analysis Archives, 8(35). Retrieved from http://olam.ed.asu.edu/epaa/v8n35.
Gillborn, D. & Youdell, D. (2000). Rationing education: policy, practice, reform, and
equity. Buckingham, UK: Open University Press.
Self-Monitoring Assessments 36
Hambleton, R. K., Jaeger, R. M., Koretz, D., Linn, R. L., Millman, J., & Phillips, S. E.
(1995). Review of the measurement quality of the Kentucky Instructional Results
Information System, 1991–1994. Frankfort: Office of Education Accountability,
Kentucky General Assembly, June.
Hamilton, L. S., Stecher, B. M., Marsh, J. A., McCombs, J. S., Robyn, A., Russell, J.,
Naftel, S., and Barney, H. (2007). Standards-based accountability under No Child
Left Behind: Experiences of teachers and administrators in three states. Santa
Monica, CA: RAND Corporation.
Haney, W. (2000). The myth of the Texas miracle in education. Education Policy
Analysis Archives, 8(41). Retrieved from
http://epaa.asu.edu/ojs/article/view/432/828
Herman, J. L., & Golan, S. (1993). The effects of standardized testing on teaching and
schools. Educational Measurement: Issues and Practice, 12(4), 20-25, 41-42.
Ho, A. D. (2009). A nonparametric framework for comparing trends and gaps across
tests. Journal of Educational and Behavioral Statistics, 34(2), 201-228.
Ho, A. D., & Haertel, E. H. (2006). Metric-Free Measures of Test Score Trends and Gaps
with Policy-Relevant Examples (CSE Report 665). Los Angeles, CA: National Center
for Research on Evaluation, Standards, and Student Testing (CRESST), Center for
the Study of Evaluation, University of California, Los Angeles.
Ho, A. and Yu, C. (2013). Descriptive statistics for modern test score distributions:
Skewness, kurtosis, discreteness, and ceiling effects. Unpublished working paper.
Self-Monitoring Assessments 37
Holcombe, R., Jennings, J. L., & Koretz, D. (2013). The roots of score inflation: An
examination of opportunities in two states’ tests. In G. Sunderman (Ed.), Charting
Reform, Achieving Equity in a Diverse Nation, 163-189. Greenwich, CT:
Information Age Publishing
Jacob, B. A. (2005). Accountability, incentives and behavior: The impact of high-stakes
testing in the Chicago public schools. Journal of Public Economics, 89(5-6), 761-
796. doi: 10.1016/j.jpubeco.2004.08.004
Jacob, B. (2007). Test-based accountability and student achievement: An investigation of
differential performance on NAEP and state assessments. Cambridge, MA:
National Bureau of Economic Research (Working Paper 12817).
Jacob, R. T., Stone, S., & Roderick, M. (2004). Ending social promotion: The response of
teachers and students. Chicago, IL: Consortium on Chicago School Research.
Retrieved March 29, 2011, from http://www.eric.ed.gov/PDFS/ED483823.pdf
Jennings, J.L, Bearak, J.M., & Koretz, D.M. (2011). “Accountability and Racial
Inequality in American Education.” Paper presented at the Annual Meetings of the
American Sociological Association.
Jones, M. G., Jones, B. D., & Hargrove, T. Y. (2003). The unintended consequences of
high-stakes testing. Lanham, MD: Rowman & Littlefield Publishers, Inc.
Klein, S. P., Hamilton, L.S., McCaffrey, D.F., and Stecher, B.M. (2000). What do test
scores in Texas tell us? Santa Monica, CA: RAND (Issue Paper IP-202). Last
accessed from http://www.rand.org/publications/IP/IP202/ on June 4, 2013.
Self-Monitoring Assessments 38
Koretz, D., and Barron, S. I. (1998). The Validity of Gains on the Kentucky Instructional
Results Information System (KIRIS). MR-1014-EDU, Santa Monica: RAND.
Koretz, D., & Béguin, A. (2010). Self-Monitoring Assessments for Educational
Accountability Systems. Measurement: Interdisciplinary Research & Perspective,
8, 92–109. doi:10.1080/15366367.2010.508685
Koretz, D., and Hamilton, L. S. (2006). Testing for accountability in K-12. In R. L.
Brennan (Ed.), Educational measurement (4th ed.), 531-578. Westport, CT:
American Council on Education/Praeger.
Koretz, D., Linn, R. L., Dunbar, S. B., & Shepard, L. A. (1991). The effects of high-
stakes testing: Preliminary evidence about generalization across tests, in R. L.
Linn (chair), The Effects of High Stakes Testing, symposium presented at the
annual meetings of the American Educational Research Association and the
National Council on Measurement in Education, Chicago, April.
Ladd, H. F., & Zelli, A. (2002). School-based accountability in North Carolina: The
responses of school principals. Educational Administration Quarterly, 38(4), 494-
529. doi: 10.1177/001316102237670
Lindquist, E. F. (1951). Preliminary considerations in objective test construction. In E. F.
Lindquist (Ed.), Educational Measurement (pp. 119-184). Washington, DC:
American Council on Education.
Lipman, P. (2002). Making the global city, making inequality: The political economy and
cultural politics of Chicago school policy. American Educational Research
Journal, 39(2), 379-419.
Self-Monitoring Assessments 39
Luna, C., & Turner, C. L. (2001). The impact of the MCAS: Teachers talk about high-
stakes testing. The English Journal, 91(1), 79-87.
Madaus, G. F. (1988). The distortion of teaching and testing: High-stakes testing and
instruction. Peabody Journal of Education, 65(3), 29-46.
McNeil, L. M. (2000). Contradictions of school reform: Educational costs of
standardized testing. New York, NY: Routledge.
McNeil, L. M. & Valenzuela, A. (2001). The harmful impact of the TAAS system of
testing in Texas: Beneath the accountability rhetoric. In M. Kornhaber & G. Orfield
(Eds.), Raising standards or raising barriers? Inequality and high-stakes testing in
public education (pp. 127-150). New York, NY: Century Foundation.
Neal, D. and Schanzenbach, D. (2010). Left behind by design: Proficiency counts and
test-based accountability. Review of Economics & Statistics, 92(2), 263–283.
New York City Department of Education (2007). Mayor Bloomberg and Chancellor
Klein release first-ever public school Progress Reports [Press release]. Retrieved
from http://schools.nyc.gov/Offices/mediarelations/NewsandSpeeches/2007-
2008/20071105_progress_reports.htm.
Pedulla, J. J., Abrams, L. M., Madaus, G. F., Russell, M. K., Ramos, M. A., & Miao, J.
(2003). Perceived effects of state-mandated testing programs on teaching and
learning: Findings from a national survey of teachers. Boston, MA: National Board
on Educational Testing and Public Policy. Retrieved from
http://www.bc.edu/research/nbetpp/statements/nbr2.pdf
Self-Monitoring Assessments 40
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and
data analysis methods, second edition. Thousand Oaks, CA: Sage.
Rubinstein, J. (2000). Cracking the MCAS grade 10 math. New York: Princeton Review
Publishing.
Severson, K. (2011). A scandal of cheating, and a fall from grace. The New York Times,
September 7, p. A16. Last retrieved on June 5, 2013 from
http://www.nytimes.com/2011/09/08/us/08hall.html?pagewanted=all&_r=0.
Shen, X. (2008). Do unintended effects of high-stakes testing hit disadvantaged schools
harder? (Doctoral dissertation, Stanford University).
Shepard, L. A. (1988). The harm of measurement-driven instruction. Paper presented at
the annual meeting of the American Educational Research Association,
Washington, DC.
Stecher, B. (2002). Consequences of large-scale, high-stakes testing on school and
classroom practice. In L. Hamilton, et al., Test-based Accountability: A Guide for
Practitioners and Policymakers. Santa Monica: RAND. Retrieved from
http://www.rand.org/publications/MR/MR1554/MR1554.ch4.pdf..
Stecher, B. M., & Barron, S. I. (2001). Unintended consequences of test-based
accountability when testing in “milepost” grades. Educational Assessment, 7(4),
259-281.
Stecher, B. M., Epstein, S., Hamilton, L. S., Marsh, J. A., Robyn, A., McCombs, J. S.,
Russell, J., & Naftel, S. (2008). Pain and gain: Implementing NCLB in three states,
Self-Monitoring Assessments 41
2004 – 2006. Santa Monica, CA: RAND. Retrieved from
http://www.rand.org/pubs/monographs/2008/RAND_MG784.pdf
Taylor, G., Shepard, L., Kinner, F., & Rosenthal, J. (2002). A survey of teachers’
perspectives on high-stakes testing in Colorado: What gets taught, what gets lost
(CSE Technical Report 588). Los Angeles, CA: University of California. Retrieved
September 20, 2010, from http://eric.ed.gov/PDFS/ED475139.pdf
Tisch, M. (2009). What the NAEP results mean for New York. Latham, N.Y.: New York
School Boards Association (November 9). Last retrieved on June 24, 2012 from
http://www.nyssba.org/index.php?src=news&refno=1110&category=On%20Boar
d%20Online%20November%209%202009.
University of the State of New York (2011). New York State Student Information
Repository System (SIRS) manual. Albany, N. Y.: Author. Last accessed June 24,
2013 from age 244 of http://www.p12.nysed.gov/irs/sirs/archive/2010-
11SIRSManual6-2.pdf.
Urdan, T. C., & Paris, S. G. (1994). Teachers’ perceptions of standardized achievement
tests. Educational Policy, 8(2), 137-157.
Self-Monitoring Assessments 42
Tables Table 1
Descriptive statistics for analytic samples, by grade
Grade 4
Grade 7
Grade 8
Study 1 Study 2 Study 3
Study 1 Study 2 Study 3
Study 1 Study 2 Study 3
# of Students 6,301 185,164 193,538
7,662 185,615 195,273
8,420 186,645 195,941 # of Schools 852 2,342 2,433
573 1,356 1,426
517 1,332 1,504
Student-Level % White 0.52 0.51 0.48
0.55 0.52 0.50
0.56 0.53 0.51
% Asian 0.09 0.08 0.18
0.08 0.08 0.18
0.09 0.08 0.08 % Black 0.17 0.18 0.23
0.17 0.18 0.22
0.16 0.18 0.18
% Hispanic 0.21 0.22 0.08
0.19 0.21 0.08
0.19 0.20 0.22 % Other 0.01 0.01 0.02
0.01 0.01 0.01
0.01 0.01 0.01
% Low Income 0.51 0.55 0.56
0.48 0.51 0.53
0.46 0.50 0.52 % Bubble Student 0.14 0.15 0.14
0.08 0.09 0.08
0.08 0.09 0.07
School-Level Prop White 0.52 0.53 0.51
0.52 0.50 0.48
0.51 0.50 0.49
Proportion Asian 0.08 0.06 0.07
0.06 0.05 0.05
0.07 0.05 0.05 Proportion Black 0.18 0.19 0.20
0.20 0.22 0.23
0.21 0.22 0.23
Proportion Hispanic 0.21 0.20 0.21
0.21 0.22 0.22
0.21 0.21 0.22 Proportion Other 0.01 0.01 0.02
0.01 0.01 0.01
0.01 0.01 0.01
Proportion Low Income 0.51 0.55 0.55
0.56 0.57 0.58
0.54 0.56 0.56
Proportion Bubble Student 0.14 0.15 0.14 0.09 0.10 0.07 0.08 0.10 0.07
Self-Monitoring Assessments 43
Table 2
Number and type of audit item for the three studies, by grade
Number of each type of audit items in study and grade
Broadened at item level
Broadened at standards
level
Untested standards
Not in standards
Total
Study 1 Grade 4 6 6 0 6 18 Grade 7 8 3 1 6 18 Grade 8 7 4 4 9 24 Study 2 Grade 4 0 0 13 0 13 Grade 7 0 0 15 0 15 Grade 8 0 0 13 0 13 Study 3 Grade 4 4 5 1 6 16 Grade 7 6 3 1 6 16 Grade 8 4 1 2 8 15* *Note: One audit item was incorrectly transcribed in production and could not be used.
Self-Monitoring Assessments 44
Table 3
Screening forms based on the reliability of the difference and differences in difficulty
Difference in difficulty
rxx Matched items
All Items Decision Grade 4
Study 1, Form 1
0.07
0.24
Keep Study 1, Form 2
0.18
0.31
Keep
Study 1, Form 3
0.10
0.24
Keep
Study 2
0.11
0.16
Keep
Study 3, Form A
0.14
0.12
0.24
Keep Study 3, Form B
0.05
0.29
0.23
Keep
Study 3, Form C
0.10
0.19
Keep Study 3, Form D
0.00
0.12
0.24
Exclude
Grade 7 Study 1, Form 1
0.00
0.10
Exclude Study 1, Form 2
0.01
0.10
Exclude
Study 1, Form 3
0.10
0.23
Keep
Study 2
0.08
0.016
Keep
Study 3, Form A
0.00
-0.04
0.11
Exclude Study 3, Form B
0.00
0.08
0.06
Exclude
Study 3, Form C
0.00
0.03
0.01
Exclude Study 3, Form D
0.00
0.13
Exclude
Grade 8 Study 1, Form 1
0.09
0.20
Keep Study 1, Form 2
0.04
0.16
Exclude
Study 1, Form 3
0.07
0.20
Keep Study 1, Form 4
0.00
0.06
Exclude
Study 2
0.03
0.2
Exclude
Study 3, Form A
0.12
0.23
0.23
Keep Study 3, Form B
0.08
0.08
0.12
Keep
Study 3, Form C
0.00
0.21
Exclude Study 3, Form D 0.03 -0.12 0.12 Exclude
Self-Monitoring Assessments 45
Table 4
Regression results for studies 1, 2, and 3, selected forms and variables, grade 4
Study 1
Study 2
Study 3
Form 1 Form 2 Form 3
Form A Form B Form C
Student-Level Variables Black 0.118 -0.037 0.07
0.043***
0.092*** -0.019 0.080***
(0.083) (0.095) (0.089)
(0.009)
(0.019) (0.019) (0.019)
Hispanic -0.017 -0.033 0.041
0.088***
0.107*** 0.062*** 0.076***
(0.075) (0.080) (0.079)
(0.008)
(0.017) (0.017) (0.017)
Low Income 0.041 0.113 -0.054
0.075***
0.069*** 0.100*** -0.034**
(0.062) (0.068) (0.067)
(0.007)
(0.013) (0.013) (0.013)
Bubble Student 0.266*** 0.381*** 0.196**
0.219***
0.306*** 0.185*** 0.159***
(0.064) (0.069) (0.067)
(0.006)
(0.014) (0.013) (0.014)
School-Level Variables Proportion Black -0.105 0.19 -0.176
-0.125***
-0.060 -0.058 0.034
(0.136) (0.152) (0.148)
(0.025)
(0.042) (0.045) (0.051)
Proportion Hispanic 0.049 0.429** 0.053
-0.048
0.017 0.000 0.142**
(0.138) (0.144) (0.146)
(0.027)
(0.044) (0.048) (0.055)
Proportion Low Income 0.131 0.148 0.229
0.143***
0.098* 0.153*** -0.002
(0.110) (0.120) (0.121)
(0.024)
(0.039) (0.044) (0.049)
Proportion Bubble Students 0.027 0.107 -0.357
0.408***
0.636*** 0.455** 0.300
(0.168) (0.181) (0.187)
(0.083)
(0.144) (0.157) (0.180)
N 2101 2113 2087 185164 48113 45773 44257 Note. *p<0.05 **p<0.01 ***p<0.001
Self-Monitoring Assessments 46
Table 5
Main regression results for studies 1, 2, and 3, selected forms and variables, grade 7 and 8
Grade 7
Grade 8
Study 1 Study 2
Study 1 Study 3
Form 3
Form 1 Form 3 Form A Form B
Student-Level Variables Black 0.02 0.050***
0.05 0.112 0.075*** 0.068***
(0.079) (0.009)
(0.087) (0.091) (0.018) (0.018)
Hispanic -0.002 0.114***
0.049 0.099 0.057*** 0.085***
(0.071) (0.008)
(0.080) (0.082) (0.016) (0.016)
Low Income -0.024 0.075***
-0.048 -0.062 0.035** -0.008
(0.053) (0.006)
(0.061) (0.061) (0.012) (0.012)
Bubble Student 0.246*** 0.246***
0.149 0.096 0.181*** 0.005
(0.073) (0.008)
(0.080) (0.082) (0.017) (0.018)
School-Level Variables Proportion Black 0.05 -0.267***
-0.085 -0.034 -0.453*** 0.109
(0.141) (0.034)
(0.148) (0.150) (0.062) (0.066)
Proportion Hispanic 0.152 -0.246***
0.154 0.002 -0.375*** 0.278***
(0.151) (0.040)
(0.156) (0.151) (0.071) (0.075)
Proportion Low Income 0.165 0.217***
-0.055 0.07 0.221*** -0.265***
(0.125) (0.036)
(0.129) (0.124) (0.067) (0.071)
Proportion Bubble Students 0.213 0.790***
0.803* 0.504 0.356 -0.606*
(0.265) (0.161)
(0.324) (0.323) (0.294) (0.303)
N 2552 185615 2118 2137 46838 47078 Note. *p<0.05 **p<0.01 ***p<0.001
Self-Monitoring Assessments 47
Figures
Figure 1. Histograms of outcome variable, difference in proportion correct on non-audit and audit items, for grades 4 (top) and 8 (bottom) in study 2
Self-Monitoring Assessments 1
Appendix A Table A1 Regression results for sensitivity analysis: changing definition of bubble student for studies 1, 2, and 3 in grade 7
Study 1 Study 2 Study 3
Plus
Minus 3 Minus 2 Plus
Minus 2 Plus
Minus 3 Minus 2 Plus
Minus 2 Plus
Minus 3 Minus 2 Plus
Minus 2 Student-Level Asian -0.048 -0.049 -0.046 -0.115*** -0.117*** -0.113*** -0.052*** -0.052*** -0.050*** (0.053) (0.053) (0.053) (0.010) (0.010) (0.010) (0.010) (0.010) (0.010) Black 0.002 0.003 0.000 0.050*** 0.052*** 0.049*** -0.013 -0.013 -0.014 (0.047) (0.047) (0.047) (0.009) (0.009) (0.009) (0.009) (0.009) (0.009) Hispanic 0.029 0.031 0.028 0.114*** 0.114*** 0.111*** -0.004 -0.004 -0.004 (0.042) (0.042) (0.042) (0.008) (0.008) (0.008) (0.008) (0.008) (0.008) Low Income 0.022 0.022 0.023 0.075*** 0.078*** 0.077*** -0.032*** -0.032*** -0.033*** (0.031) (0.031) (0.031) (0.006) (0.006) (0.006) (0.006) (0.006) (0.006) Bubble Student 0.117** -0.098* -0.096* 0.246*** -0.161*** -0.156*** 0.085*** 0.005 0.008 (0.043) (0.040) (0.040) (0.008) (0.021) (0.020) (0.009) (0.023) (0.023) NYC -0.098* 0.091 0.109** -0.156*** 0.241*** 0.256*** 0.007 0.092*** 0.111*** (0.040) (0.051) (0.037) (0.020) (0.009) (0.007) (0.023) (0.011) (0.008) School-Level Proportion Asian -0.149 -0.171 (0.165) -0.466*** -0.496*** -0.441*** -0.030 -0.036 (0.031) (0.115) (0.114) (0.115) (0.070) (0.069) (0.069) (0.077) (0.076) (0.077) Proportion Black -0.042 -0.043 (0.034) -0.267*** -0.265*** -0.254*** -0.113** -0.115** -0.111** (0.083) (0.083) (0.083) (0.034) (0.034) (0.033) (0.038) (0.038) (0.038) Proportion Hispanic 0.081 0.079 0.081
-0.246*** -0.245*** -0.233***
-0.022 -0.025 (0.022)
(0.087) (0.087) (0.087) (0.040) (0.040) (0.039) (0.046) (0.046) (0.046) Proportion Low Income
0.124 0.134 0.125 0.217*** 0.230*** 0.207*** -0.021 -0.015 (0.019) (0.073) (0.073) (0.073) (0.036) (0.036) (0.036) (0.043) (0.042) (0.042)
Proportion Bubble Student
0.163 0.007 0.023 0.790*** 0.871*** 0.756*** 0.296 0.358 0.208 (0.152) (0.186) (0.138) (0.161) (0.201) (0.129) (0.195) (0.251) (0.165)
Intercept 0.040* 0.040* 0.039* 0.059*** 0.062*** 0.059*** -0.005 -0.004 (0.005) (0.019) (0.019) (0.019) (0.010) (0.010) (0.010) (0.010) (0.010) (0.010) N 7,662 7,662 7,662 185,615 185,615 185,615 182,580 182,580 182,580 Note. *p<0.05 **p<0.01 ***p<0.001
Self-Monitoring Assessments 1
Appendix B Table B1
Regression results for sensitivity analysis: NYC interactions in study 2, grade 7
Grade 7
Student-Level Variables Asian
-0.120***
(0.015)
Black
0.031**
(0.012)
Hispanic
0.083***
(0.011)
Low Income
0.070***
(0.007)
Bubble Student
0.216***
(0.008)
NYC
-0.216***
(0.018)
School-Level Variables Proportion Asian
-0.365***
(0.065)
Proportion Black
-0.180***
(0.033)
Proportion Hispanic
-0.058
(0.036)
Proportion Low Income
0.156***
(0.027)
Proportion Bubble Students
0.294**
(0.098)
NYC Interactions NYC x Asian
0.009
(0.022)
NYC x Black
0.040
(0.020)
NYC x Hispanic
0.021
(0.018)
NYC x Low Income
0.039*
(0.018)
NYC x Bubble Student
0.009
(0.014)
NYC x Proportion Asian
0.412***
(0.095)
NYC x Proportion Black
0.276***
(0.062)
NYC x Proportion Hispanic
0.211**
(0.067)
NYC x Proportion Low Income
-0.082
(0.060)
NYC x Proportion Bubble Student
0.136
(0.185)
N 185,164 Note. *p<0.05 **p<0.01 ***p<0.001
Self-Monitoring Assessments 1
Appendix C Table C1
Regression results for sensitivity analysis: standardizing before the difference is taken in outcome variable for studies 1, 2, and 3 in grade 7 Study 1 Study 2 Study 3
Standardize
Before Standardize
After Standardize
Before Standardize
After Standardize
Before Standardize
After Student-Level Variables Asian -0.049 -0.048 0.024*** -0.115*** 0.066*** -0.052*** (0.049) (0.053) (0.007) (0.010) (0.009) (0.010) Black 0.005 0.002 -0.071*** 0.050*** -0.114*** -0.013 (0.044) (0.047) (0.006) (0.009) (0.008) (0.009) Hispanic 0.029 0.029 0.000 0.114*** -0.077*** -0.004 (0.039) (0.042) (0.006) (0.008) (0.007) (0.008) Low Income 0.024 0.022 -0.054*** 0.075*** -0.130*** -0.032*** (0.029) (0.031) (0.004) (0.006) (0.005) (0.006) Bubble Student 0.113** 0.117** 0.086*** 0.246*** -0.012 0.085*** (0.040) (0.043) (0.005) (0.008) (0.008) (0.009) NYC -0.095* -0.098* -0.047** -0.156*** 0.057*** 0.007 (0.037) (0.040) (0.015) (0.020) (0.016) (0.023) School-Level Variables Proportion Asian -0.146 -0.149 -0.087 -0.466*** 0.223*** -0.030 (0.107) (0.115) (0.052) (0.070) (0.053) (0.077) Proportion Black -0.039 -0.042 -0.159*** -0.267*** -0.048 -0.113** (0.078) (0.083) (0.025) (0.034) (0.027) (0.038) Proportion Hispanic 0.074 0.081 -0.141*** -0.246*** 0.034 -0.022 (0.082) (0.087) (0.029) (0.040) (0.032) (0.046) Proportion Low Income 0.123 0.124 0.006 0.217*** -0.209*** -0.021 (0.068) (0.073) (0.027) (0.036) (0.030) (0.043) Proportion Bubble Students 0.16 0.163 -0.046 0.790*** -0.352* 0.296 (0.142) (0.152) (0.119) (0.161) (0.142) (0.195) Intercept 0.039* 0.040* 0.002 0.059*** -0.026*** -0.005 (0.017) (0.019) (0.007) (0.010) (0.007) (0.010) N 7,662 7,662 185,615 185,615 182,580 182,580
Note. *p<0.05 **p<0.01 ***p<0.001
Self-Monitoring Assessments 1
Appendix D Table D1
Regression results for forms excluded from Tables 4 and 5 because of very low or zero reliability of the difference score, for grades 4 & 8
Grade 4
Grade 8
Study 3
Study 1 Study 2 Study 3
Form D
Form 3 Form 4
Form B Form C
Student-Level Variables Black -0.033
0.112 0.128 -0.014 0.068*** -0.033
(0.018)
(0.091) (0.084) (0.009) (0.018) (0.019)
Hispanic 0.016
0.099 0.052 0.006 0.085*** 0.007
(0.016)
(0.082) (0.078) (0.008) (0.016) (0.018)
Low Income -0.025*
-0.062 0.047 -0.018** -0.008 -0.032*
(0.012)
(0.061) (0.057) (0.006) (0.012) (0.013)
Bubble Student 0.160***
0.096 0.036 0.074*** 0.005 0.072***
(0.013)
(0.082) (0.076) (0.008) (0.018) (0.019)
School-Level Variables Proportion Black -0.038
-0.034 0.035 -0.195*** 0.109 0.114
(0.045)
(0.150) (0.140) (0.037) (0.066) (0.059)
Proportion Hispanic -0.107*
0.002 0.035 -0.174*** 0.278*** 0.081
(0.050)
(0.151) (0.145) (0.045) (0.075) (0.072)
Proportion Low Income 0.038
0.07 -0.135 0.101* -0.265*** -0.190**
(0.044)
(0.124) (0.115) (0.041) (0.071) (0.064)
Proportion Bubble Students 0.552***
0.504 -0.118 0.509** -0.606* 0.292
(0.153)
(0.323) (0.292) (0.166) (0.303) (0.311)
N 44538 2137 2055 186645 47078 44830 Note. *p<0.05 **p<0.01 ***p<0.001
Self-Monitoring Assessments 2
Table D2
Regression results for forms excluded from Tables 5 because of very low or zero reliability of the difference score, for grade 7
Grade 7
Study 1
Study 3
Form 1 Form 2
Form A Form B Form C Form D
Student-Level Variables Black 0.028 -0.019
-0.064*** 0.076*** -0.011 -0.060**
(0.077) (0.078)
(0.019) (0.018) (0.016) (0.019)
Hispanic 0.150* -0.064
-0.015 0.034* -0.013 -0.031
(0.070) (0.070)
(0.017) (0.017) (0.015) (0.017)
Low Income 0.088 0.017
-0.013 -0.064*** 0.010 -0.060***
(0.053) (0.052)
(0.013) (0.012) (0.011) (0.013)
Bubble Student 0.034 0.055
0.127*** 0.127*** 0.086*** -0.008
(0.071) (0.073)
(0.019) (0.019) (0.017) (0.020)
School-Level Variables Proportion Black -0.063 -0.147
-0.145* -0.127** -0.149** -0.054
(0.135) (0.140)
(0.061) (0.049) (0.048) (0.054)
Proportion Hispanic -0.147 0.22
-0.094 -0.040 0.001 0.004
(0.142) (0.145)
(0.070) (0.054) (0.056) (0.064)
Proportion Low Income 0.207 -0.013
0.029 -0.125* 0.009 -0.063
(0.119) (0.122)
(0.066) (0.053) (0.050) (0.059)
Proportion Bubble Students 0.306 0.063
0.991** 0.014 0.434 -0.384
(0.250) (0.247)
(0.315) (0.254) (0.257) (0.278)
N 2579 2531 46652 47329 45080 43519 Note. *p<0.05 **p<0.01 ***p<0.001