validating formative and interim assessments under...

1

Validating Formative and Interim Assessments Under ESSA

Michael B. Bunch

Measurement Incorporated

Introduction

This paper focuses on an orderly integration of formative, interim, and summative

assessments to improve student achievement, as described in the Every Student Succeeds

Act (ESSA) of 2015. Within that framework, it addresses the validation of formative

assessments in terms of growth over time, rather than long-term prediction of

performance. The primary vehicle for this work is Measurement Incorporated’s Project

Essay Grade (PEG), an automated scoring engine we have been using for several years for

formative, interim, and summative assessments.

ESSA

ESSA provides hundreds of millions of dollars for states to develop (either individually or

collaboratively) balanced assessment systems comprising summative, interim, and

formative assessments and to assist local education agencies in implementing them.

Specifically:

“SEC. 1204. Innovative assessment and accountability demonstration authority.

“(a) Innovative assessment system defined.—The term ‘innovative assessment system’

means a system of assessments that may include—

“(1) competency-based assessments, instructionally embedded assessments, interim

assessments, cumulative year-end assessments, or performance-based assessments that

combine into an annual summative determination for a student, which may be administered

through computer adaptive assessments; and

“(2) assessments that validate when students are ready to demonstrate mastery or

proficiency and allow for differentiated student support based on individual learning needs.

Title II (Section 2103) authorizes funds to provide training to classroom teachers, principals,

or other school leaders for the purpose of implementing these assessments. This training

can include inservice on new assessment technologies and means of integrating results of

those assessments into classroom instruction.

2

Integrating Formative, Interim, and Summative Assessments

The relationships among formative, interim, and summative assessments are depicted in

Figure 1. In this system, students take a pretest (a form of interim assessment or possibly a

summative assessment from a previous instructional period) in order to ascertain content

mastery. Since there will be a range of scores on this test, instruction may be individualized,

as may subsequent formative assessments; i.e., as students proceed through a course of

instruction, they will likely be engaged in different activities. Their formative assessments

are then tailored to their progress. While such an approach might be difficult to manage

with teacher-made paper tests, it becomes more manageable with online on-demand tests

such as those afforded in a balanced assessment system. Moreover, in such a system,

feedback can be immediate and targeted so that, at least in theory, students benefit from a

series of formative assessments and improve their chances of doing well on subsequent

interim assessments as well as the summative assessment.

Figure 1. Relationships among formative, interim, and summative assessments

Validation of summative assessments has been the central focus of the educational

measurement profession since its inception. More recently, validation of interim

assessments has gained attention. Validation of formative assessments may require

innovative approaches. If the objective of formative (and even interim) assessments is to

improve student performance on subsequent assessments, predictive validity frameworks

will fail if the formative assessments properly inform intervening instruction. Students who

perform poorly on early assessments will improve performance non-uniformly so that most

or all students will perform better on later assessments. Thus, the principal focus of

validation of formative assessments is the degree to which they promote improvement on

interim and summative assessments.

3

PEG

The following is a brief introduction to Project Essay Grade (PEG). For a more detailed

description of the program, its history, and its applications, see Bunch, Vaughn, & Miel

(2016).

Overview. Ellis Batten Page (1924–2005) is widely acknowledged as the father of

automated essay scoring. Page (1966) reported on an early effort to understand how

human beings graded student essays and to translate the process into a computer program.

That program, Project Essay Grade, or PEG®, was designed to score student essays using

mainframe computers in the 1960s.

As a result of Page’s work, two new terms entered the lexicon: trin and prox. A trin is an

intrinsic characteristic of writing, such as diction or style. A prox is a quantifiable

approximation of that intrinsic characteristic. These terms have since been replaced by

“features,” and there is no practical distinction between intrinsic and objectified features.

“Artificial intelligence,” at least in this context, has been replaced by “automated essay

scoring.”

The initial PEG work focused on essays written by 276 high school students and graded by

four English teachers. Those essays yielded 31 proxes (assigned by PEG in accordance with

rules devised by Page) used as predictors of scores assigned by teachers. Page and his

colleagues calculated the correlation between a weighted composite of the 31 proxes and

the scores assigned by teachers. The resulting multiple R was .71. When one considers the

fact that the correlation between scores assigned by two English teachers is not much

higher, these results were quite remarkable.

Page applied the tools available to him as an English teacher: a deep understanding of the

intrinsic qualities of good writing (trins) and the ability to translate those qualities into

objective units (proxes). He then applied the tools available to him as a psychometrician:

multiple regression and the ability to interpret its results. In doing so, he created the field of

automated essay scoring (AES).

Measurement Incorporated (MI) acquired PEG in 2003 and has worked continuously to

improve it and keep it relevant to current trends in large-scale assessment. Currently, close

to a million students participate in PEG-related programs in over 750 schools in 25 states,

the District of Columbia, and three foreign countries. Table 1 summarizes current

applications of PEG technology.

4

Table 1

Summary of Products and Programs Powered by PEG

State Programs National Programs Other Applications Utah Compose (grades 3-12) PEG Writing (grades 3-12) WPP Online (with Educational

Records Bureau)

Texas PEG Writing (grades 3-12) PEG Writing Scholar (primarily community college)

PEG Korea

NC Write (grades 3-12) PEG Hong Kong

CBAS (Connecticut, grades 3-12) PEG Sweden

As automated scoring has matured over the past 50 years, and as trins and proxes have

given way to features, the goal of programs like PEG has been to improve predictability of

human-rendered scores. As multiple R has also given way to more sophisticated metrics

(e.g., quadratic weighted kappa, or QWK), that goal has evolved into achieving a QWK for

AES ever closer to unity or at least equal to or greater than a QWK for human-rendered

scores.

That goal was officially reached in 2012. Documenting the first Automated Scoring

Assessment Prize (ASAP) competition, Morgan, Shermis, Van Deventer, & Vander Ark

(undated) reported that five vendors’ automated essay scoring programs (with MI in the

lead) had surpassed human readers in score stability, as shown in Figure 2.

Figure 2. Results of 2012 ASAP competition

5

Since 2012, the race to increase QWK, even incrementally, has continued. Values of QWK

closer and closer to unity have been achieved, primarily by the addition of features nd more

sophisticated scoring models. Unfortunately, the number of features as well as the

mathematical form of the scoring models can complicate interpretation of results.

Improvement in formative tools and applicability. Although the transition from 31 trins

and proxes to as many as 800 features has improved PEG’s accuracy in scoring, an

unintended consequence has been a decrease in the ready explicability of scores. In the

formative context, PEG is used to score student writing and provide targeted feedback for

improving the essay. The primary objective is to improve the student’s writing ability. As

such, the feedback generation should be tightly coupled with the scoring engine, so that if a

student earnestly follows the suggestions provided, he/she can expect to see an

improvement in the score of the next submitted revision.

When the number of scoring features exceeds the number of essays scored, it is

theoretically possible to predict scores perfectly on a training set, but the model may not

make any sense to someone trying to use the results to guide instruction. For the past three

years, MI has been working on ways to reduce the total number of features without too

much loss of reliability in an effort to produce scores that are more immediately

understandable and instructionally useful. For example, because we report six different trait

scores, it may be possible to create models to predict just one trait very well without

reference to the other five. To this end, we have focused on creating models (groups of

variables) with the following properties:

Instructionally meaningful – the variables within the group all measure a common

concept relevant to writing quality that can be clearly articulated in the feedback

(e.g., all related to Organization).

Uniform – the variables within a single group should all contribute to the score in the

same way. A group is characterized as a positive group if increasing its variables

should increase the score. Similarly, when variables in a negative group are

increased, the score should decrease.

Decorrelated – different variable groups should be decorrelated (i.e., independent)

as much as possible. A student should be able to follow our feedback to adjust

variables in a given group without affecting those in another group.

A model of this form will allow us to tie the targeted feedback directly to the trait score with

desired effect on the score (assuming the student follows the feedback).

Professional development. In Figure 1, summative assessment lies just to the right of the

horizontal arrow labeled “Instruction.” Interim assessment may be within or beyond the

purview of the classroom teacher, but formative assessment lies squarely within the control

of the teacher, and it is here that the opportunity to align instruction and assessment is

greatest. However, to take advantage of that opportunity, teachers usually need

6

professional development in assessment literacy; specifically, they need assistance in

setting up and managing an assessment system that provides continuous feedback.

MI staff have worked with several schools and school systems to implement PEG-based

systems, providing professional development with regard to the content as well as the

mechanics of the system. Engagement with teachers and administrators extends well

beyond initial training; it includes ongoing technical support and consultation. The following

example from Durham (North Carolina) Public Schools (DPS) illustrates how we approach

professional development and its impact on student achievement.

NC Write Basic Training – creating courses, adding prompts, responding to student

essays, utilizing reports

NC Write Advanced Training – utilizing advanced features, setting up peer review,

working collaboratively with colleagues

NC Write Workshop – creating prompts and/or lesson plans involving NC Write

Additional workshops, tailored to individual schools’ needs, were also provided. For

example, one session was specifically designed for a school’s art, music, library, and

technology teachers. They learned how to collaborate with general classroom teachers

within NC Write to support cross-curricular writing. Following that training, all specialist

teachers were added to existing regular education courses, allowing cross-curricular

collaboration and support.

In addition to on-site sessions, all DPS administrators and teachers had access to free NC

Write User Experience Webinars. These webinars were conducted by members of the NC

Write team who were also former educators. The webinars were conducted live so

participants could ask questions and were also recorded so they could be viewed on

demand at any time. During the 2015-16 school year, MI provided 20 webinars for DPS.

Validation

As noted previously, predictive validity frameworks will fail if the formative assessments

properly inform intervening instruction to the point that all or most students reach mastery,

regardless of their starting points. Growth or performance relative to expectation would be

more appropriate criteria. Nichols, Meyers, & Burling (2009) provided a framework for

evaluating formative assessments, in response to criticisms of “so-called” formative

assessments made by William & Black (1996, p. 543): “…in order to serve a formative

function, an assessment must yield evidence that, with appropriate construct-referenced

interpretations, indicates the existence of a gap between actual and desired levels of

performance, and suggests actions that are in fact successful in closing the gap.”

7

In short, formative assessments are valid to the extent that they permit or guide instruction

that leads to improved student performance, typically measured by scores on subsequent

formative, interim, or summative assessments. Thus, our focus should be on score increases

from one formative assessment to another (formative to formative) from a series of

formative assessments to an interim assessment (formative to interim), or from a series of

formative and interim assessments to the summative assessment (formative/interim to

summative). It might also be appropriate to give some attention to the relationship

between the formative assessments and the curriculum or instruction (alignment). In fact,

we should start there.

Alignment. Nichols et al. (2009) note that formative assessment validation does not begin

with performance improvement; rather, it begins with evidence of a meaningful

relationship between the assessment and the relevant criterion - alignment. The literature

on alignment has focused almost exclusively on summative assessments (cf., Porter, Smith,

Blank & Zeidner, 2007; Webb, 2007). Porter introduced the Survey of Enacted Curriculum

(SEC; seconline.wceruw.org). Webb has given us the Webb Alignment Tool (WAT,

wat.wceruw.org). Both employ Webb’s depth of knowledge (DOK,

dese.mo.gov/divimprove/sia/msip/DOK_Chart.pdf) scale. With these tools, educators have

been able to plot curriculum, instruction, and assessment on a two-dimensional grid to create a

variety of useful visual displays.

Alignment of a single summative assessment to an enacted curriculum is a fairly time-

consuming process that would be prohibitive for a series of several formative assessments.

Alignment of and for formative assessment tends to be more ad hoc. Greenstein (2010) has

provided a template for classroom teachers to create formative assessments and align them

with instruction based on pretest data (see Figure 1 above). Her approach is to create the

formative assessments on the fly, make quick adjustments based on student performance,

and integrate assessment and instruction on an almost daily basis, allowing each to inform

the other. This is essentially the manner in which PEG aligns writing formative assessments

and instructional modules and the manner in which classroom teachers use them. Success

of the alignment process can then be measured in terms of progress on subsequent

formative, interim, and summative assessments.

The Greenstein (2010) book is written from the perspective of a classroom teacher and, like

those of Rick Stiggins (e.g., Stiggins, 2014; Stiggins & Chappius, 2012; Stiggins & Conklin,

1992) is based more in practice than in theory. Nevertheless, her recommendations are very

much in line with those of Nichols et al. (2009).

Formative to formative. Wilson (2012) found that the PEG-based Connecticut Benchmark

Assessment System for Writing (CBAS-Writing) was effective not only in identifying

struggling writers but in helping them move from struggling to proficient. His sample

8

included over 9,000 students in grades 3-12 and a collection of over 40,000 PEG-scored

essays. Using cluster analysis, Wilson was able to differentiate struggling writers from

proficient writers with great reliability. Through repeated use of PEG, 2/3 of the struggling

writers were able to move to a higher cluster. Typically, five to six revisions of an essay were

sufficient to move a struggling writer to the next higher cluster. For the remaining struggling

writers, additional teacher intervention was necessary.

This last point actually highlights one of the features of PEG-based writing systems. They

allow for four kinds of interactions:

Student-system – PEG provides feedback in terms of immediate scores on six traits

as well as comments on spelling and grammar. In addition, these systems provide

links to trait-based skill builders.

Teacher-system – PEG provides opportunities for teachers to monitor students’

progress and time on task.

Teacher-student – PEG also allows teachers to post feedback and suggestions to

students.

Student-student – on PEG writing sites, students can read one another’s essays and

provide feedback and suggestions.

Wilson & Andrada (2016) used hierarchical linear modeling (HLM; Raudenbush & Bryk,

2002) to analyze effects of PEG writing feedback on subsequent PEG scores. The resultant

model showed that scores improve with practice and feedback, up to about five attempts.

In other words, students’ scores improved steadily on the first through fifth revision of a

PEG-scored essay (about half a score point per revision), but not appreciably with

subsequent attempts. However, the scores on the fifth revision were significantly better

than those on the first draft.

Formative to interim. Returning to the notion of alignment, it must be assumed that the formative, interim, and summative assessments that are aligned to the curriculum and instruction are also aligned to one another. This is not as easy as it may sound. In writing assessment, for example, some assessments take a holistic approach to performance, while others are trait-based. PEG happens to be trait based, yielding scores on six traits (Ideas and Development, Organization, Style, Sentence Structure, Word Choice, and Conventions) plus

a total score that is an unweighted sum of the six trait scores. To date, we are aware of no

reported studies of interim assessment performance related to PEG or any other formative

writing assessment program.

Formative/interim to summative. PEG writing systems are also known to have effects on

outcomes on summative assessments in two states. In Utah, Wilson (2016) found that

participation in a year-long application of PEG-based writing feedback had positive effects

on Utah’s SAGE Writing test as well as on the overall SAGE ELA test. Using the same HLM

9

techniques as cited above, Wilson found positive effects for number of essays written,

number of drafts per essay, and number of lessons engaged in based on PEG feedback. The

following excerpt is taken from that report:

Findings from analyses of each research question were consistent: Utah Compose usage is

positively associated with increased performance on the SAGE ELA and Writing assessments.

This was true both for students and for schools, and true even after controlling for prior literacy

achievement. In sum, the more students and schools used Utah Compose the better their

individual and school performance on SAGE. Findings underscore the educational benefits of

Utah Compose.

In Delaware, Wilson (under review) used path analysis (Wright, 1934) to tease out an

indirect effect of PEG writing on the Smarter Balanced total ELA scale score via student self-

reports on writing efficacy. Wilson examined self-efficacy (belief in one’s own competence

in a particular endeavor) as an intervening variable because of its theorized effect on

persistence in that endeavor (cf. Bruning & Kaufman, 2016). Students (n=56) in grade 6

responded to prompts at three points during the school year (October, January, and

March). The October and March prompts required the students to read stimulus materials

and/or watch video. The October prompt was informational, while the March prompt was

persuasive. The January prompt had no associated stimuli.

Students wrote a total of 1,027 essays scored by PEG and spent a total of 708 minutes on

PEG-based lessons. A control group (n=58) wrote essays using GoogleDocs without feedback

or links to remedial lessons. PEG provided feedback on six traits as well as in-text comments

on spelling and grammar errors. At the end of the year, all students responded to a Smarter

Balanced writing prompt (stimulus based, informational). Path analysis revealed that

involvement in PEG writing over the course of the year had no more direct impact on scores

on the summative assessment than did involvement in GoogleDocs writing. However, PEG

writing did have a direct effect on writing self-efficacy, and writing self-efficacy had a direct

effect on scores on the summative assessment. GoogleDocs (the control) had no such

effect.

This finding is interesting in that the interim (PEG/GoogleDocs) writing assignments differed

in important ways from the summative writing assessment; namely, the interim

assignments focused on an assortment of stimulus-based and non-stimulus-based prompts

in two genres, while the summative assessment employed a stimulus-based prompt in a

single genre and combined the writing score with a reading score. A separate writing scale

score was not available. Students who had participated in the PEG program over the course

of the academic year saw their writing performance improve and therefore persisted in

their own performance improvement. This finding is similar to that of Perie, Marion, & Gong

(2009, p. 12) that “repeated testing, in and of itself, contributed to retention.”

10

Conclusions and Recommendations

The rules are changing – again. While the basic definitions of validity have shifted with each

new edition of Educational Measurement (Cureton, 1951, Cronbach, 1971, Messick, 1989,

Kane, 2006), the integration of formative, interim, and summative assessments has

fundamentally changed the nature of what we seek to validate. Indeed, Mike Kane (2006)

speaks of validity as argument, after the manner of Stephen Toulmin (1958). Kane’s

framework is an excellent one for validation of formative and interim assessment.

These changes in how we view validity or even how we view assessment were not brought

about instantly by ESSA, or by any previous renewal/revision of the Elementary and

Secondary Education Act (ESEA) of 1965 (Public Law 95-10). They have been brought about

gradually by a recognized need to focus on assessment at the classroom and individual

student level. Here we find that our longstanding definitions of validity don’t work very

well. Theory works only slightly better. Pragmatism is the order of the day.

The framers of ESSA recognized the fact that assessment embedded in instruction works. It

even seems a bit presumptuous to refer to this kind of testing “innovative” (Section 1204 a),

given that we have known about it for more than 50 years. Greenstein (2010) and others

writing from the perspective of the classroom have helped us realize the potential of

embedded, formative assessments, and Page (1966) and others who followed have given us

tools to make formative assessment a real possibility for teachers.

Automated scoring of essays has been proven to be reliable (Morgan et al., undated). Now,

automated scoring, particularly when coupled with instantaneous feedback and links to

instructional modules, has been shown to be valid in that it helps students write better,

both in the short term and in the longer term. This particular definition of validity may seem

to be a throwback to the earlier definitions in which validity referred to an instrument

rather than to the interpretation of scores derived from that instrument for a particular

purpose, but it is not. It simply recognizes a different context for assessing validity.

Results with the PEG program so far have been very encouraging. While this paper has

focused on PEG’s ability to score essays and provide useful feedback, it should be noted

that PEG is also being used to score short-answer constructed-response items. So far, most

of these items have been embedded in summative assessments. Formative assessments

with similar item types are not far behind. We expect to see similar research on the utility of

PEG-based systems with these assessments very soon.

Neither the current Standards for Educational and Psychological Testing (APA/AERA/NCME,

2014) nor the Operational Best Practices for Statewide Large-Scale Assessment Programs

(CCSSO/ATP, 2013) has a great deal to say about formative assessments. Moving forward,

we should make sure that a workable definition of validation of such assessments makes its

way into the next edition of these guidebooks. As classroom teachers, principals, and other

11

school leaders improve their assessment literacy (a stated goal of Section 2103 of ESSA), the

original purpose of formative (embedded) assessments will be fulfilled.

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014). Standards for Educational and Psychological Testing. Washington, DC: AERA.

Bruning, R. H., & Kauffman, D. F. (2016). Self-efficacy beliefs and motivation in writing development. In C. A. McArthur, S. Graham, & J. Fitzgerald (Eds.), Handbook of writing research (2nd Ed.) (pp. 160-173). New York, NY: Guilford

Bunch, M. B., Vaughn, D., & Miel, S. (2016). Automated scoring in assessment systems. In Y. Rosen, S. Ferrara, & M. Mosharraf (Eds.), Technology Tools for Real-World Skill Development (pp. 611-626). Hershey, PA: IGI Global

Council of Chief State School Officers & Association of Test Publishers (2013). Operational Best Practices for Statewide Large-Scale Assessment Programs. Washington, DC: CCSSO.

Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.). Educational Measurement (2nd Ed.). Washington, DC: American Council on Education.

Cronbach, L. J. (1982). Designing Evaluations of Educational and Social Programs. San Francisco: Jossey Bass.

Cureton, E. E. (1951). Validity. In E. F. Lindquist (1951). Educational Measurement. Washington, DC: American Council on Education.

Every Student Succeeds Act of 2015. Public Law 114-95 § 114 Stat. 1177 (2015-2016).

Greenstein, L. (2010). What Teachers Really Need to Know About Formative Assessment.

Washington, DC: Association for Supervision and Curriculum Development.

Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.). Educational Measurement (4th Ed.). Westport, CT: American Council on Education/Praeger.

Messick, S. (1989). Validity. In R. L. Linn (Ed.) Educational Measurement (3rd Edition). Washington, DC: American Council on Education/Macmillan

Morgan, J., Shermis, M. D., Van Deventer, L. & Vander Ark, T. (undated). Automated Student Assessment Prize: Phase 1 & Phase 2: A Case Study to Promote Focused Innovation in Student Writing Assessment. Retrieved 9/1/14 from http://gettingsmart.com/wp-content/uploads/2013/02/ASAP-Case-Study-FINAL.pdf

Nichols, P. D., Meyers, J. L., & Burling, K. S. (2009). A framework for evaluating and planning assessments intended to improve student achievement. Educational Measurement: Issues and Practice, 28 (3), 14-23.

http://gettingsmart.com/wp-content/uploads/2013/02/ASAP-Case-Study-FINAL.pdf

http://gettingsmart.com/wp-content/uploads/2013/02/ASAP-Case-Study-FINAL.pdf

12

Page, E. B. (1966). The imminence of…grading essays by computer. Phi Delta Kappan, 47 (2). 238-243).

Perie, M. , Marion, S., & Gong, B. (2009). Moving toward a comprehensive assessment system: A framework for considering interim assessments. Educational Measurement: Issues and Practice, 28 (3), 5-13.

Porter, A.C., Smithson, J., Blank, R., & Zeidner, T. (2007). Alignment as a teacher variable. Applied Measurement in Education, 20(1), 27-51.

Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical Linear Models: Applications and Data Analysis Methods (2nd Ed.). Thousand Oaks, CA: Sage.

Stiggins, R. J. (2014). Defensible Teacher Evaluation: Student Growth Through Classroom Assessment. Thousand Oaks, CA: Corwin Press.

Stiggins, R. J. & Chappuis, J. (2012). Introduction to Student-Involved Assessment for Learning (6th Ed.). Boston, MA: Pearson Education.

Stiggins, R. J. & Conklin, N. F. (1992). In Teachers’ Hands: Investigating the Practice of Classroom Assessment. Albany, NY: State University of New York Press.

Toulmin, S. E. (1958). The Uses of Argument. New York: Cambridge University Press.

Webb, N.L. (2007). Issues related to judging the alignment of curriculum standards and assessments. Applied Measurement in Education, 20(1), 7-25.

William, D. & Black, P. (1996). Meaning and consequence: A basis for distinguishing formative and summative functions of assessment. British Educational Research Journal, 22 (5), 537-548.

Wilson, J. (2012). Using CBAS-WRITE to Identify Struggling Writers and Improve Writing Skills. Paper presented at the Third Annual Connecticut Assessment Forum, Rocky Hill, CT.

Wilson, J. (2016). Executive Summary of Findings from Analyses of Utah Compose and SAGE Data for Academic Year (AY) 2014-15 and 2015-16.

Wilson, J. (under review). Effects of automated writing evaluation software on writing

quality, writing self-efficacy, and state test performance: A study of PEG Writing. Computers & Education.

Wilson, J. & Andrada, G. N. (2016). Using automated feedback to improve writing quality:

Opportunities and challenges. In Y. Rosen, S. Ferrara, & M. Mosharraf (Eds.), Technology Tools for Real-World Skill Development (pp 678-703). Hershey, PA: IGI Global.

Wright, S. (1934). The method of path coefficients. Annals of Mathematical Statistics. 5 (3): 161–215.

validating formative and interim assessments under...

Documents