validating formative and interim assessments under...
TRANSCRIPT
1
Validating Formative and Interim Assessments Under ESSA
Michael B. Bunch
Measurement Incorporated
Introduction
This paper focuses on an orderly integration of formative, interim, and summative
assessments to improve student achievement, as described in the Every Student Succeeds
Act (ESSA) of 2015. Within that framework, it addresses the validation of formative
assessments in terms of growth over time, rather than long-term prediction of
performance. The primary vehicle for this work is Measurement Incorporated’s Project
Essay Grade (PEG), an automated scoring engine we have been using for several years for
formative, interim, and summative assessments.
ESSA
ESSA provides hundreds of millions of dollars for states to develop (either individually or
collaboratively) balanced assessment systems comprising summative, interim, and
formative assessments and to assist local education agencies in implementing them.
Specifically:
“SEC. 1204. Innovative assessment and accountability demonstration authority.
“(a) Innovative assessment system defined.—The term ‘innovative assessment system’
means a system of assessments that may include—
“(1) competency-based assessments, instructionally embedded assessments, interim
assessments, cumulative year-end assessments, or performance-based assessments that
combine into an annual summative determination for a student, which may be administered
through computer adaptive assessments; and
“(2) assessments that validate when students are ready to demonstrate mastery or
proficiency and allow for differentiated student support based on individual learning needs.
Title II (Section 2103) authorizes funds to provide training to classroom teachers, principals,
or other school leaders for the purpose of implementing these assessments. This training
can include inservice on new assessment technologies and means of integrating results of
those assessments into classroom instruction.
2
Integrating Formative, Interim, and Summative Assessments
The relationships among formative, interim, and summative assessments are depicted in
Figure 1. In this system, students take a pretest (a form of interim assessment or possibly a
summative assessment from a previous instructional period) in order to ascertain content
mastery. Since there will be a range of scores on this test, instruction may be individualized,
as may subsequent formative assessments; i.e., as students proceed through a course of
instruction, they will likely be engaged in different activities. Their formative assessments
are then tailored to their progress. While such an approach might be difficult to manage
with teacher-made paper tests, it becomes more manageable with online on-demand tests
such as those afforded in a balanced assessment system. Moreover, in such a system,
feedback can be immediate and targeted so that, at least in theory, students benefit from a
series of formative assessments and improve their chances of doing well on subsequent
interim assessments as well as the summative assessment.
Figure 1. Relationships among formative, interim, and summative assessments
Validation of summative assessments has been the central focus of the educational
measurement profession since its inception. More recently, validation of interim
assessments has gained attention. Validation of formative assessments may require
innovative approaches. If the objective of formative (and even interim) assessments is to
improve student performance on subsequent assessments, predictive validity frameworks
will fail if the formative assessments properly inform intervening instruction. Students who
perform poorly on early assessments will improve performance non-uniformly so that most
or all students will perform better on later assessments. Thus, the principal focus of
validation of formative assessments is the degree to which they promote improvement on
interim and summative assessments.
3
PEG
The following is a brief introduction to Project Essay Grade (PEG). For a more detailed
description of the program, its history, and its applications, see Bunch, Vaughn, & Miel
(2016).
Overview. Ellis Batten Page (1924–2005) is widely acknowledged as the father of
automated essay scoring. Page (1966) reported on an early effort to understand how
human beings graded student essays and to translate the process into a computer program.
That program, Project Essay Grade, or PEG®, was designed to score student essays using
mainframe computers in the 1960s.
As a result of Page’s work, two new terms entered the lexicon: trin and prox. A trin is an
intrinsic characteristic of writing, such as diction or style. A prox is a quantifiable
approximation of that intrinsic characteristic. These terms have since been replaced by
“features,” and there is no practical distinction between intrinsic and objectified features.
“Artificial intelligence,” at least in this context, has been replaced by “automated essay
scoring.”
The initial PEG work focused on essays written by 276 high school students and graded by
four English teachers. Those essays yielded 31 proxes (assigned by PEG in accordance with
rules devised by Page) used as predictors of scores assigned by teachers. Page and his
colleagues calculated the correlation between a weighted composite of the 31 proxes and
the scores assigned by teachers. The resulting multiple R was .71. When one considers the
fact that the correlation between scores assigned by two English teachers is not much
higher, these results were quite remarkable.
Page applied the tools available to him as an English teacher: a deep understanding of the
intrinsic qualities of good writing (trins) and the ability to translate those qualities into
objective units (proxes). He then applied the tools available to him as a psychometrician:
multiple regression and the ability to interpret its results. In doing so, he created the field of
automated essay scoring (AES).
Measurement Incorporated (MI) acquired PEG in 2003 and has worked continuously to
improve it and keep it relevant to current trends in large-scale assessment. Currently, close
to a million students participate in PEG-related programs in over 750 schools in 25 states,
the District of Columbia, and three foreign countries. Table 1 summarizes current
applications of PEG technology.
4
Table 1
Summary of Products and Programs Powered by PEG
State Programs National Programs Other Applications Utah Compose (grades 3-12) PEG Writing (grades 3-12) WPP Online (with Educational
Records Bureau)
Texas PEG Writing (grades 3-12) PEG Writing Scholar (primarily community college)
PEG Korea
NC Write (grades 3-12) PEG Hong Kong
CBAS (Connecticut, grades 3-12) PEG Sweden
As automated scoring has matured over the past 50 years, and as trins and proxes have
given way to features, the goal of programs like PEG has been to improve predictability of
human-rendered scores. As multiple R has also given way to more sophisticated metrics
(e.g., quadratic weighted kappa, or QWK), that goal has evolved into achieving a QWK for
AES ever closer to unity or at least equal to or greater than a QWK for human-rendered
scores.
That goal was officially reached in 2012. Documenting the first Automated Scoring
Assessment Prize (ASAP) competition, Morgan, Shermis, Van Deventer, & Vander Ark
(undated) reported that five vendors’ automated essay scoring programs (with MI in the
lead) had surpassed human readers in score stability, as shown in Figure 2.
Figure 2. Results of 2012 ASAP competition
5
Since 2012, the race to increase QWK, even incrementally, has continued. Values of QWK
closer and closer to unity have been achieved, primarily by the addition of features nd more
sophisticated scoring models. Unfortunately, the number of features as well as the
mathematical form of the scoring models can complicate interpretation of results.
Improvement in formative tools and applicability. Although the transition from 31 trins
and proxes to as many as 800 features has improved PEG’s accuracy in scoring, an
unintended consequence has been a decrease in the ready explicability of scores. In the
formative context, PEG is used to score student writing and provide targeted feedback for
improving the essay. The primary objective is to improve the student’s writing ability. As
such, the feedback generation should be tightly coupled with the scoring engine, so that if a
student earnestly follows the suggestions provided, he/she can expect to see an
improvement in the score of the next submitted revision.
When the number of scoring features exceeds the number of essays scored, it is
theoretically possible to predict scores perfectly on a training set, but the model may not
make any sense to someone trying to use the results to guide instruction. For the past three
years, MI has been working on ways to reduce the total number of features without too
much loss of reliability in an effort to produce scores that are more immediately
understandable and instructionally useful. For example, because we report six different trait
scores, it may be possible to create models to predict just one trait very well without
reference to the other five. To this end, we have focused on creating models (groups of
variables) with the following properties:
Instructionally meaningful – the variables within the group all measure a common
concept relevant to writing quality that can be clearly articulated in the feedback
(e.g., all related to Organization).
Uniform – the variables within a single group should all contribute to the score in the
same way. A group is characterized as a positive group if increasing its variables
should increase the score. Similarly, when variables in a negative group are
increased, the score should decrease.
Decorrelated – different variable groups should be decorrelated (i.e., independent)
as much as possible. A student should be able to follow our feedback to adjust
variables in a given group without affecting those in another group.
A model of this form will allow us to tie the targeted feedback directly to the trait score with
desired effect on the score (assuming the student follows the feedback).
Professional development. In Figure 1, summative assessment lies just to the right of the
horizontal arrow labeled “Instruction.” Interim assessment may be within or beyond the
purview of the classroom teacher, but formative assessment lies squarely within the control
of the teacher, and it is here that the opportunity to align instruction and assessment is
greatest. However, to take advantage of that opportunity, teachers usually need
6
professional development in assessment literacy; specifically, they need assistance in
setting up and managing an assessment system that provides continuous feedback.
MI staff have worked with several schools and school systems to implement PEG-based
systems, providing professional development with regard to the content as well as the
mechanics of the system. Engagement with teachers and administrators extends well
beyond initial training; it includes ongoing technical support and consultation. The following
example from Durham (North Carolina) Public Schools (DPS) illustrates how we approach
professional development and its impact on student achievement.
NC Write Basic Training – creating courses, adding prompts, responding to student
essays, utilizing reports
NC Write Advanced Training – utilizing advanced features, setting up peer review,
working collaboratively with colleagues
NC Write Workshop – creating prompts and/or lesson plans involving NC Write
Additional workshops, tailored to individual schools’ needs, were also provided. For
example, one session was specifically designed for a school’s art, music, library, and
technology teachers. They learned how to collaborate with general classroom teachers
within NC Write to support cross-curricular writing. Following that training, all specialist
teachers were added to existing regular education courses, allowing cross-curricular
collaboration and support.
In addition to on-site sessions, all DPS administrators and teachers had access to free NC
Write User Experience Webinars. These webinars were conducted by members of the NC
Write team who were also former educators. The webinars were conducted live so
participants could ask questions and were also recorded so they could be viewed on
demand at any time. During the 2015-16 school year, MI provided 20 webinars for DPS.
Validation
As noted previously, predictive validity frameworks will fail if the formative assessments
properly inform intervening instruction to the point that all or most students reach mastery,
regardless of their starting points. Growth or performance relative to expectation would be
more appropriate criteria. Nichols, Meyers, & Burling (2009) provided a framework for
evaluating formative assessments, in response to criticisms of “so-called” formative
assessments made by William & Black (1996, p. 543): “…in order to serve a formative
function, an assessment must yield evidence that, with appropriate construct-referenced
interpretations, indicates the existence of a gap between actual and desired levels of
performance, and suggests actions that are in fact successful in closing the gap.”
7
In short, formative assessments are valid to the extent that they permit or guide instruction
that leads to improved student performance, typically measured by scores on subsequent
formative, interim, or summative assessments. Thus, our focus should be on score increases
from one formative assessment to another (formative to formative) from a series of
formative assessments to an interim assessment (formative to interim), or from a series of
formative and interim assessments to the summative assessment (formative/interim to
summative). It might also be appropriate to give some attention to the relationship
between the formative assessments and the curriculum or instruction (alignment). In fact,
we should start there.
Alignment. Nichols et al. (2009) note that formative assessment validation does not begin
with performance improvement; rather, it begins with evidence of a meaningful
relationship between the assessment and the relevant criterion - alignment. The literature
on alignment has focused almost exclusively on summative assessments (cf., Porter, Smith,
Blank & Zeidner, 2007; Webb, 2007). Porter introduced the Survey of Enacted Curriculum
(SEC; seconline.wceruw.org). Webb has given us the Webb Alignment Tool (WAT,
wat.wceruw.org). Both employ Webb’s depth of knowledge (DOK,
dese.mo.gov/divimprove/sia/msip/DOK_Chart.pdf) scale. With these tools, educators have
been able to plot curriculum, instruction, and assessment on a two-dimensional grid to create a
variety of useful visual displays.
Alignment of a single summative assessment to an enacted curriculum is a fairly time-
consuming process that would be prohibitive for a series of several formative assessments.
Alignment of and for formative assessment tends to be more ad hoc. Greenstein (2010) has
provided a template for classroom teachers to create formative assessments and align them
with instruction based on pretest data (see Figure 1 above). Her approach is to create the
formative assessments on the fly, make quick adjustments based on student performance,
and integrate assessment and instruction on an almost daily basis, allowing each to inform
the other. This is essentially the manner in which PEG aligns writing formative assessments
and instructional modules and the manner in which classroom teachers use them. Success
of the alignment process can then be measured in terms of progress on subsequent
formative, interim, and summative assessments.
The Greenstein (2010) book is written from the perspective of a classroom teacher and, like
those of Rick Stiggins (e.g., Stiggins, 2014; Stiggins & Chappius, 2012; Stiggins & Conklin,
1992) is based more in practice than in theory. Nevertheless, her recommendations are very
much in line with those of Nichols et al. (2009).
Formative to formative. Wilson (2012) found that the PEG-based Connecticut Benchmark
Assessment System for Writing (CBAS-Writing) was effective not only in identifying
struggling writers but in helping them move from struggling to proficient. His sample
8
included over 9,000 students in grades 3-12 and a collection of over 40,000 PEG-scored
essays. Using cluster analysis, Wilson was able to differentiate struggling writers from
proficient writers with great reliability. Through repeated use of PEG, 2/3 of the struggling
writers were able to move to a higher cluster. Typically, five to six revisions of an essay were
sufficient to move a struggling writer to the next higher cluster. For the remaining struggling
writers, additional teacher intervention was necessary.
This last point actually highlights one of the features of PEG-based writing systems. They
allow for four kinds of interactions:
Student-system – PEG provides feedback in terms of immediate scores on six traits
as well as comments on spelling and grammar. In addition, these systems provide
links to trait-based skill builders.
Teacher-system – PEG provides opportunities for teachers to monitor students’
progress and time on task.
Teacher-student – PEG also allows teachers to post feedback and suggestions to
students.
Student-student – on PEG writing sites, students can read one another’s essays and
provide feedback and suggestions.
Wilson & Andrada (2016) used hierarchical linear modeling (HLM; Raudenbush & Bryk,
2002) to analyze effects of PEG writing feedback on subsequent PEG scores. The resultant
model showed that scores improve with practice and feedback, up to about five attempts.
In other words, students’ scores improved steadily on the first through fifth revision of a
PEG-scored essay (about half a score point per revision), but not appreciably with
subsequent attempts. However, the scores on the fifth revision were significantly better
than those on the first draft.
Formative to interim. Returning to the notion of alignment, it must be assumed that the formative, interim, and summative assessments that are aligned to the curriculum and instruction are also aligned to one another. This is not as easy as it may sound. In writing assessment, for example, some assessments take a holistic approach to performance, while others are trait-based. PEG happens to be trait based, yielding scores on six traits (Ideas and Development, Organization, Style, Sentence Structure, Word Choice, and Conventions) plus
a total score that is an unweighted sum of the six trait scores. To date, we are aware of no
reported studies of interim assessment performance related to PEG or any other formative
writing assessment program.
Formative/interim to summative. PEG writing systems are also known to have effects on
outcomes on summative assessments in two states. In Utah, Wilson (2016) found that
participation in a year-long application of PEG-based writing feedback had positive effects
on Utah’s SAGE Writing test as well as on the overall SAGE ELA test. Using the same HLM
9
techniques as cited above, Wilson found positive effects for number of essays written,
number of drafts per essay, and number of lessons engaged in based on PEG feedback. The
following excerpt is taken from that report:
Findings from analyses of each research question were consistent: Utah Compose usage is
positively associated with increased performance on the SAGE ELA and Writing assessments.
This was true both for students and for schools, and true even after controlling for prior literacy
achievement. In sum, the more students and schools used Utah Compose the better their
individual and school performance on SAGE. Findings underscore the educational benefits of
Utah Compose.
In Delaware, Wilson (under review) used path analysis (Wright, 1934) to tease out an
indirect effect of PEG writing on the Smarter Balanced total ELA scale score via student self-
reports on writing efficacy. Wilson examined self-efficacy (belief in one’s own competence
in a particular endeavor) as an intervening variable because of its theorized effect on
persistence in that endeavor (cf. Bruning & Kaufman, 2016). Students (n=56) in grade 6
responded to prompts at three points during the school year (October, January, and
March). The October and March prompts required the students to read stimulus materials
and/or watch video. The October prompt was informational, while the March prompt was
persuasive. The January prompt had no associated stimuli.
Students wrote a total of 1,027 essays scored by PEG and spent a total of 708 minutes on
PEG-based lessons. A control group (n=58) wrote essays using GoogleDocs without feedback
or links to remedial lessons. PEG provided feedback on six traits as well as in-text comments
on spelling and grammar errors. At the end of the year, all students responded to a Smarter
Balanced writing prompt (stimulus based, informational). Path analysis revealed that
involvement in PEG writing over the course of the year had no more direct impact on scores
on the summative assessment than did involvement in GoogleDocs writing. However, PEG
writing did have a direct effect on writing self-efficacy, and writing self-efficacy had a direct
effect on scores on the summative assessment. GoogleDocs (the control) had no such
effect.
This finding is interesting in that the interim (PEG/GoogleDocs) writing assignments differed
in important ways from the summative writing assessment; namely, the interim
assignments focused on an assortment of stimulus-based and non-stimulus-based prompts
in two genres, while the summative assessment employed a stimulus-based prompt in a
single genre and combined the writing score with a reading score. A separate writing scale
score was not available. Students who had participated in the PEG program over the course
of the academic year saw their writing performance improve and therefore persisted in
their own performance improvement. This finding is similar to that of Perie, Marion, & Gong
(2009, p. 12) that “repeated testing, in and of itself, contributed to retention.”
10
Conclusions and Recommendations
The rules are changing – again. While the basic definitions of validity have shifted with each
new edition of Educational Measurement (Cureton, 1951, Cronbach, 1971, Messick, 1989,
Kane, 2006), the integration of formative, interim, and summative assessments has
fundamentally changed the nature of what we seek to validate. Indeed, Mike Kane (2006)
speaks of validity as argument, after the manner of Stephen Toulmin (1958). Kane’s
framework is an excellent one for validation of formative and interim assessment.
These changes in how we view validity or even how we view assessment were not brought
about instantly by ESSA, or by any previous renewal/revision of the Elementary and
Secondary Education Act (ESEA) of 1965 (Public Law 95-10). They have been brought about
gradually by a recognized need to focus on assessment at the classroom and individual
student level. Here we find that our longstanding definitions of validity don’t work very
well. Theory works only slightly better. Pragmatism is the order of the day.
The framers of ESSA recognized the fact that assessment embedded in instruction works. It
even seems a bit presumptuous to refer to this kind of testing “innovative” (Section 1204 a),
given that we have known about it for more than 50 years. Greenstein (2010) and others
writing from the perspective of the classroom have helped us realize the potential of
embedded, formative assessments, and Page (1966) and others who followed have given us
tools to make formative assessment a real possibility for teachers.
Automated scoring of essays has been proven to be reliable (Morgan et al., undated). Now,
automated scoring, particularly when coupled with instantaneous feedback and links to
instructional modules, has been shown to be valid in that it helps students write better,
both in the short term and in the longer term. This particular definition of validity may seem
to be a throwback to the earlier definitions in which validity referred to an instrument
rather than to the interpretation of scores derived from that instrument for a particular
purpose, but it is not. It simply recognizes a different context for assessing validity.
Results with the PEG program so far have been very encouraging. While this paper has
focused on PEG’s ability to score essays and provide useful feedback, it should be noted
that PEG is also being used to score short-answer constructed-response items. So far, most
of these items have been embedded in summative assessments. Formative assessments
with similar item types are not far behind. We expect to see similar research on the utility of
PEG-based systems with these assessments very soon.
Neither the current Standards for Educational and Psychological Testing (APA/AERA/NCME,
2014) nor the Operational Best Practices for Statewide Large-Scale Assessment Programs
(CCSSO/ATP, 2013) has a great deal to say about formative assessments. Moving forward,
we should make sure that a workable definition of validation of such assessments makes its
way into the next edition of these guidebooks. As classroom teachers, principals, and other
11
school leaders improve their assessment literacy (a stated goal of Section 2103 of ESSA), the
original purpose of formative (embedded) assessments will be fulfilled.
References
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014). Standards for Educational and Psychological Testing. Washington, DC: AERA.
Bruning, R. H., & Kauffman, D. F. (2016). Self-efficacy beliefs and motivation in writing development. In C. A. McArthur, S. Graham, & J. Fitzgerald (Eds.), Handbook of writing research (2nd Ed.) (pp. 160-173). New York, NY: Guilford
Bunch, M. B., Vaughn, D., & Miel, S. (2016). Automated scoring in assessment systems. In Y. Rosen, S. Ferrara, & M. Mosharraf (Eds.), Technology Tools for Real-World Skill Development (pp. 611-626). Hershey, PA: IGI Global
Council of Chief State School Officers & Association of Test Publishers (2013). Operational Best Practices for Statewide Large-Scale Assessment Programs. Washington, DC: CCSSO.
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.). Educational Measurement (2nd Ed.). Washington, DC: American Council on Education.
Cronbach, L. J. (1982). Designing Evaluations of Educational and Social Programs. San Francisco: Jossey Bass.
Cureton, E. E. (1951). Validity. In E. F. Lindquist (1951). Educational Measurement. Washington, DC: American Council on Education.
Every Student Succeeds Act of 2015. Public Law 114-95 § 114 Stat. 1177 (2015-2016).
Greenstein, L. (2010). What Teachers Really Need to Know About Formative Assessment.
Washington, DC: Association for Supervision and Curriculum Development.
Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.). Educational Measurement (4th Ed.). Westport, CT: American Council on Education/Praeger.
Messick, S. (1989). Validity. In R. L. Linn (Ed.) Educational Measurement (3rd Edition). Washington, DC: American Council on Education/Macmillan
Morgan, J., Shermis, M. D., Van Deventer, L. & Vander Ark, T. (undated). Automated Student Assessment Prize: Phase 1 & Phase 2: A Case Study to Promote Focused Innovation in Student Writing Assessment. Retrieved 9/1/14 from http://gettingsmart.com/wp-content/uploads/2013/02/ASAP-Case-Study-FINAL.pdf
Nichols, P. D., Meyers, J. L., & Burling, K. S. (2009). A framework for evaluating and planning assessments intended to improve student achievement. Educational Measurement: Issues and Practice, 28 (3), 14-23.
12
Page, E. B. (1966). The imminence of…grading essays by computer. Phi Delta Kappan, 47 (2). 238-243).
Perie, M. , Marion, S., & Gong, B. (2009). Moving toward a comprehensive assessment system: A framework for considering interim assessments. Educational Measurement: Issues and Practice, 28 (3), 5-13.
Porter, A.C., Smithson, J., Blank, R., & Zeidner, T. (2007). Alignment as a teacher variable. Applied Measurement in Education, 20(1), 27-51.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical Linear Models: Applications and Data Analysis Methods (2nd Ed.). Thousand Oaks, CA: Sage.
Stiggins, R. J. (2014). Defensible Teacher Evaluation: Student Growth Through Classroom Assessment. Thousand Oaks, CA: Corwin Press.
Stiggins, R. J. & Chappuis, J. (2012). Introduction to Student-Involved Assessment for Learning (6th Ed.). Boston, MA: Pearson Education.
Stiggins, R. J. & Conklin, N. F. (1992). In Teachers’ Hands: Investigating the Practice of Classroom Assessment. Albany, NY: State University of New York Press.
Toulmin, S. E. (1958). The Uses of Argument. New York: Cambridge University Press.
Webb, N.L. (2007). Issues related to judging the alignment of curriculum standards and assessments. Applied Measurement in Education, 20(1), 7-25.
William, D. & Black, P. (1996). Meaning and consequence: A basis for distinguishing formative and summative functions of assessment. British Educational Research Journal, 22 (5), 537-548.
Wilson, J. (2012). Using CBAS-WRITE to Identify Struggling Writers and Improve Writing Skills. Paper presented at the Third Annual Connecticut Assessment Forum, Rocky Hill, CT.
Wilson, J. (2016). Executive Summary of Findings from Analyses of Utah Compose and SAGE Data for Academic Year (AY) 2014-15 and 2015-16.
Wilson, J. (under review). Effects of automated writing evaluation software on writing
quality, writing self-efficacy, and state test performance: A study of PEG Writing. Computers & Education.
Wilson, J. & Andrada, G. N. (2016). Using automated feedback to improve writing quality:
Opportunities and challenges. In Y. Rosen, S. Ferrara, & M. Mosharraf (Eds.), Technology Tools for Real-World Skill Development (pp 678-703). Hershey, PA: IGI Global.
Wright, S. (1934). The method of path coefficients. Annals of Mathematical Statistics. 5 (3): 161–215.