critique (ver. 3.4) of mark d. shermis & ben hammer...

Critique (Ver. 3.4) of Mark D. Shermis & Ben

Hammer, “Contrasting State-of-the-Art Automated

Scoring of Essays: Analysis” http://www.scoreright.org/NCME_2012_Paper3_29_12.pdf

Les C. Perelman, Ph.D.

Comparative Media Studies

Massachusetts Institute of Technology

13 March 2013

[email protected]

Critique of Mark D. Shermis & Ben Hammer, “Contrasting State-of-the-Art Automated

Scoring of Essays: Analysis” -- Ver. 3.4 Mar. 13, 2013 by Les C. Perelman is licensed

under a Creative Commons Attribution 3.0 Unported License.

http://www.scoreright.org/NCME_2012_Paper3_29_12.pdf

mailto:[email protected]

http://creativecommons.org/licenses/by/3.0/deed.en_US

http://creativecommons.org/licenses/by/3.0/deed.en_US

Critique of Shermis & Hammer Ver. 3.4 Final page 2 13 March 2013 L. C. Perelman

Abstract

Although the unpublished study by Shermis & Hammer (2012) received substantial publicity

about its claim that automated essay scoring (AES) of student essays was as accurate as scoring

by human readers, a close examination of the paper’s methodology and the datasets used

demonstrates that such a claim is not supported by the data in the study. The study’s

methodology used one variable for comparing human readers and a different variable for

comparing machine scores, this difference artificially privileging the machines in half the

datasets. Moreover, conclusions were drawn without the performance of statistical tests and

inferences were based solely on impressionistic and sometimes inaccurate comparisons. In

addition, there was no standard testing of the model as a whole for significance, which given the

large number of comparisons, allowed machine variables to surpass human readers merely

through random chance. Finally, half of the datasets used were not essays but short one-

paragraph responses involving literary analysis or reading comprehension that were not

evaluated on any construct involving writing. Because of the widespread publicity surrounding

this study and that its findings may be used by states and state consortia in implementing the

Common Core State Standards Initiative, the authors should make the test dataset publically

available for analysis.


Introduction1. On April 16, 2012, Mark D. Shermis, Dean of the School of Education at the

University of Akron, presented a paper at annual meeting in Vancouver, British Columbia, of the

National Council on Measurement in Education on “Contrasting State-of-the-Art in Automated

Scoring of Essays: Analysis.” Despite its fairly nondescript title, the paper made the claim that

machines graded essays as well as expert human raters, a claim that was publicized in various

press releases and newspaper articles. A press release from the University of Akron, for

example, stated “A direct comparison between human graders and software designed to score

student essays achieved virtually identical levels of accuracy, with the software in some cases

proving to be more reliable, a groundbreaking study has found.” (Man and machine: Better

writers, better grades , 2012). A headline in Inside Higher Education read “A Win for the Robo-

Readers” and the story included statements such as “The study, funded by the William and Flora

Hewlett Foundation, compared the software-generated ratings given to more than 22,000 short

essays, written by students in junior high schools and high school sophomores, to the ratings

given to the same essays by trained human readers. The differences, across a number of different

brands of automated essay scoring software (AES) and essay types, were minute.” (Kolowich,

2012). Even the venerable British publication, The New Scientist reported “The essay marks

handed out by the machines were statistically identical to those from the human graders, says

[Jaison] Morgan. ‘The result blew away everyone's expectations,’ he says.” (Giles, 2012) Yet

these reports and other statements can best be characterized as unsubstantiated hyperbole. The

paper, which has never appeared in a peer-reviewed journal, employs an inconsistent and highly

dubious methodology that favors the machines over the human graders. Even with this biased

methodology, however, the data still show that for traditional writing assignments--as opposed to

merely paragraph-length summaries or paragraph-length literary analyses, which were not

graded on any construct involving writing ability-- human scorers perform better overall than

machines.

Here is the paper’s central claim:

The results demonstrated that overall, automated essay scoring was capable of producing

scores similar to human scores for extended-response writing items with equal performance

for both source-based and traditional writing genre [sic].

That claim, however, is clearly not supported by the data. Conversely, the data support the

assertion that human scorers performed more reliably than the machines on the longer traditional

writing assignments.

1 I would like to acknowledge and thank my friend and colleague Norbert Elliot among others for their extremely

helpful criticism and advice that helped me revise and improve this paper.


1. Flawed experimental design and anlysis

a. Having a pair of readers compete against eight scoring engines is, in essence, like

running multiple T-tests. Any single high machine score among the nine scores by

eight different vendors compared by five different metrics could possibly be a random

anomaly, or to put it in more colloquial terms, a lucky guess, especially since the size

of the individual test datasets were relatively small, ranging from 304 to 601.

b. Before any individual comparisons were made, no overall test, such as an ANOVA or

a regression, was run to see if the machines, as a group, performed significantly

different (lower or higher) than the human scorers.

c. Although various claims were made in the paper, no statistical tests were presented by

the authors. Instead, the authors present impressionistic assertions such as “In

general, performance on kappa was slightly less with the exception of essay prompts

#5 & #6. On these data sets, the AES engines, as a group, matched or exceeded

human performance.” (p. 23). There are no parameters given on what constituted

matching or exceeding human performance. Moreover, the authors never mention

that the human and machines are matching different variables (see # 3 following).

2. Apples and Oranges I: Two very different types of writing assignments. Although the

study was exploring how well machines could grade extended-response writing, (i.e. essays),

fewer than one-half of the datasets were what is commonly defined as extended-response

writing. The remaining five datasets were essentially paragraphs, not essays. The team

conducting the study was aware of the problem. One of the senior managers of project

reports in an Email (Morgan, 2012) that the team spent three months trying to get sets of

long-form constructed response essays from states, asking every state and even requested

data from international sources. He admits that some of the datasets defined as “essays” in

the study were shorter than the length the team desired, but defends the sample by arguing

that even these short pieces of writing are categorized as essays by the states, meaning that at

least one of the fifty states defines it that way. Moreover, these essays and the way that they

are scored may be unrepresentative of state assessments as a whole. The most problematic of

the eight datasets, #3, #4, #5, & #6, all come from, at most, two states.

Even more troubling, datasets #3, #4, #5, & #6 are also not measures of writing ability.

Shermis & Hammer (p.6) do note that these “four essays were ‘source-based,’ that is, the

questions asked in the prompt referred to a source document that students read as part of the

assessment.” That description, however, is misleading. They are not source-based writing

assignments in the commonly used sense that writing ability is evaluated partly on how well

the writing engages a text or effectively employs information from a text. The construct

being measured in each of these datasets is literary analysis for datasets #3 & #4 and reading

comprehension in datasets #5 & #6. Indeed, except for the word “clear” being included for

scores of “3” & “4” in the rubrics for datasets #5 & #6, there is no reference in any of the

rubrics of these four datasets to any trait associated with writing. Moreover, the rubrics and

other materials for datasets #5 & #6 explicitly define them as a reading test, while defining a

different scale for writing tests. (See Appendix). Weigle (2013) defines such exercises as

Assessing Content through Writing (ACW)—in which the construct is the student’s

understanding and knowledge about specific content. Consequently, half the datasets were


not writing assignments. In addition, some of these assignments had rubrics and other

scoring materials that were very amenable to grading by computer. The training materials

for the reading summary in dataset #6, for example, contained an explicit bulleted list of the

items that should be included in the summary. (See Appendix.)

3. Moving the Goal Posts and Apples and Oranges II: What scoring rules are being

compared? The most serious problem is that for half of the datasets-- #3, #4, #5, #6-- the

performance measure for human readers is different from the performance measure for

machines. For all of the measures beginning with Table 8, the text uses the variable H1H2,

the reliability between the two readers, as the measure for reader reliability, while the

measure for machine performance is reliability between machine and the resolved score

(RS).

In most essay testing situations, the standard practice is that the resolved score is the sum of

the two reader scores if the scores are identical or adjacent; or, if the scores vary by more

than one-point, the resolved score is established by one or two supervisors re-reading the

essay. Yet only one of the datasets in the study, #1, follows the standard best practice of

combing two equal or adjacent scores to compute RS. Dataset #7 combines composite

scores regardless of the size of the difference between them. Dataset #8 also appears to

combine composite scores regardless of the size of the difference but has a third reader

adjudicating 17.7% of the RS’s randomly. Dataset #2 uses the score of only the first reader

as the RS, regardless of the second reader’s score. The remaining four datasets, #3, #4, #5,

and #6, all compute RS as the higher of the two scores.2 This procedure followed by 55%

the essays in the aggregate data sample and by four datasets, half of the total number of

datasets, and which come from, at most, two states, skews many of the metrics used in favor

of the AES scores.

The confusion between human scores and resolved score is found throughout the text. The

report states, for example, on page 22, “all vendor engines generated predicted means within

0.10 of the human mean for Dataset #3 which had a rubric range of 0-3.” The report,

however, is referring to the mean of the resolved score not the mean of the human raters,

which were, in actuality, lower than the resolved score by 0.11 and 0.17 respectively. (See

Table B following – to avoid confusion tables in this Critique are labeled by letters while

the tables in the Shermis & Hammer Report are labeled with numbers)

Essays scores, be they holistic, trait, or analytical, always are continuous variables not

integers, even though graders almost always have to give integer values as scores. The

report recognizes this fact in the observation on page 24 that values for the Pearson r “might

2 Although the adjudication rules given for the essay set descriptions for Essay Sets #3 & #4 do not mention it,

examination of the training set revealed that, like Essay Sets #5 & #6, the resolved score was computed by taking

the higher of two adjacent scores. There were no sets of scores in the training sample for Essay Sets #3 & #4 that

contained pairs of scores that differed by more than one-point, and no third rater scores. Consequently, four of the

data sets from, at most, two states computed resolved score by taking the higher score if the two rater scores were

not identical. The authors mention, on page 9, instances in which the higher of the two scores in one data set (data

set #5) was not the resolved scores. In the two instances I identified, the two readers’ scores were not adjacent and

the resolved score was probably an adjudicated score.


have been higher except that the vendors were asked to predict integer values only.” Each

reader has to select a single integer value even though some essays might be on the border

between two adjacent integers. Some 3’s on a 4-point-scale might be very high 3’s

bordering on a 4, while other 3’s may be very low 3’s bordering on a 2. (Some of the

training materials for the datasets included essays scores with plus and minus signs.) In the

terminology of Classical Test Theory, the True Score might be 3.3 or 2.8. Consequently,

adjacent agreement in the correct direction between two readers (e. g. one rater gives an

essay a score of 3 and the second rater gives the essay a score of 4) will more closely

approximate a True Score of 3.4 than two scores of 3. Note, however, that adjacency needs

to be in the correct direction. If a human rater gives that same essay a 3 and an AES

algorithm gives the essay a score of 2, the scores are still adjacent and might produce the

same Kappa but the machine score is much farther away from the true score. The Quadratic

Weighted Kappa does compensate for direction. However, the Quadratic Weighted Kappa

is meant to compare the scores of two autonomous readers, not a reader score and an

artificially resolved score.

Resolving scores merely by selecting the higher one ignores the continuous nature of the

scores being measured and penalizes human raters while giving AES algorithms a

substantial advantage by allowing them to optimize agreement with RS by rounding up.

In the case of the essay that has a True Score of 3.4, for example, there are four likely

pairs of scores that would be produced by two human raters: 3-3, 3-4, 4-3, and 4-4. Note

that in three of these four cases, selecting the higher score makes 4 the resolved score,

and that in two of these three instances one of the two reader scores will be lower than

the resolved score. This bias can be observed in the data in the tables at the end of the

report. The reader means given in Table 4 of the report-- for the five score sets that ether

combine rater scores to compute RS (#1, #7, & #8) or use a single score as RS (2A &

2B), track the means of the RS much more closely than those of the AES scores as

illustrated in Table A, while Table B displays the means for the datasets that used the

Table A: Test Set Means for Resolved Score = Sum of Scores or Single Score H1 H2 RS Diff. Avg.

Human Rater Means from RS Means

Range of AES Mean Scores

Range of Diff. of AES Mean Scores from RS Means

1 8.61 8.62 8.62 -0.01 8.49 – 8.80 -0.13 – 0.18

2A – 3.39 3.41 -0.02 3.33 – 3.41 -0.08 – 0.00

2B – 3.34 3.32 0.02 3.18 – 3.37 -0.14 – 0.05

7 20.02 20.24 20.13 0.00 19.46 – 20.05 -0.67 – -0.08

8 36.45 36.70 36.67 -0.09 37.04 – 37.79 0.37 – 1.12

higher human rater score as the resolved score. The contrast in values in the difference

between the average human rater means and the resolved score means, which are

consistently higher, in the two tables is striking and illustrates how the second method

skews the results against human raters


Table B: Test Set Means for Resolved Score = Higher Human Rater Score H1 H2 RS Diff. Avg.

Human Rater Means from RS Means

Range of AES Means

Range of Diff. of AES Mean Scores from RS Means

3 1.79 1.73 1.90 0.14 1.84 – 1.95 -0.06 – 0.05

4 1.38 1.40 1.51 0.12 1.34 – 1.57 -0.17 – 0.06

5 2.31 2.35 2.51 0.18 2.44 – 2.54 -0.07 – 0.03

6 2.57 2.58 2.75 0.18 2.54 – 2.83 -0.04 – 0.08

Separating the datasets into two groups, those that use a single human score or a sum of

two human scores to compute the resolved score and those that use the higher score as the

resolved score, present two very different sets of values of the metrics used in the study.

Tables A & B already demonstrate the substantial difference for means. Similar

distinctions can be shown in the other tables. Indeed, in the five measures of agreement,

exact agreement (Table 8), exact and adjacent agreement (Table 10), Kappas (Table 12),

Quadratic Weighted Kappas (Table 14) and the Pearson r (Table 16), the human raters in

the group of datasets clearly outperform the AES engines in the first three and have mixed

results for the Quadratic Weighted Kappa and Pearson r. Curiously, for the Quadratic

Weighted Kappa (Table 14) the relationship of the two groups is inverted – human raters

in two of the four datasets that use the higher score as the resolved score (#3 & #4) as well

as score sets #2A & #2B outperform the AES engines while AES engines outperform

human raters in the other datasets. This anomaly may partially be an artifact of the

Quadratic Weighted Kappa measuring correspondence not between two raters, as was

intended, but between a rater (or machine score) and the artificial construct of the resolved

score as higher of the two scores. Another possible explanation is offered by Brenner &

Kliebsch (1996) who note that that quadratically weighted kappa coefficients tend to

increase with larger scales while unweighted kappa coefficients decrease. They note that

“variation of the quadratically weighted kappa coefficient with the number of categories

appears to be strongest in the range from two to five categories.” (p. 201). As displayed in

Table 3 of the report, the scales for datasets #3 & #4 consisted of a scale of four (0-3),

while datasets #5 & #6 consisted of a scale of five (0-4). With the exception of the four

point scale for score set #2B, all the other datasets had scales greater than five. For

dataset #1 the range of the rubric was 1-6 and the range of the resolved score was 2-12.

For scoring set #2A, the range was 1-6; for scoring set #7, the range of the rubric was 0-

12, and the range of the resolved score was 0-24. For dataset #8, the range of the rubric

was 0-30, and the range of the resolved score was 0-30.

Datasets #7 and #8, however, are not really holistic scores, as reported by Shermis and

Hammer. They are composite scores. Dataset #7, a narrative essay, not an expository

essay as reported in Table 2, consists of four (not six) traits, Ideas, Organization, Style,

and Conventions each rated on a 0-3 scale. (See Appendix.) The resolved score as given in

the training set was the total of the scores of each of the two readers producing a range of

0-12. (The scoring guide states that scores for Ideas were doubled, but that was not the

case in totals given in the training sample.) Dataset #8, also a narrative essay, is scored on

six traits on a 1-6 scale for each, but only four of them, Ideas and Content, Organization,

Sentence Fluency, and Conventions, are counted in computing the resolved score, with

Conventions being double weighted to produce a scale of 10-60 (See Appendix). The


training sets for both datasets #7 & #8 have the individual human trait scores, and it might

have been much more illuminating if the machines were asked to compute individual trait

scores and those, rather than the composite score were comparred to the human scores.

The standard method for comparing the reliability of machine scores to human scores is the

compare the reliability of the machine scores to each of the two human scores and then

compare those scores to reliability of the humans scorers to each other. (McCurry, 2010;

Bennett, 2006). In those studies, as in many others, humans clearly outperformed machines.

Yet the study instead chose to use different variables for humans and machines. The study

could have used the data collected to test the hypothesis that AES can match human scoring

simply by using the sum of the two reader scores as the dependent variable as was the case

with datasets #1, #7, & #8.

Although the values for the two readers’ individual scores compared to the resolved score (H1

& H2) are consistently higher than the machine scores for all of the metrics displayed in the

tables, that could well be an artifact of the individual reader score being a contributing

element to the resolved score. However, of the nine score sets, the two scores of H2 for 2A

and 2B are completely independent of the resolved score because reader H1 defined the

resolved score and H2’s scores were used only for computing grading reliability. That H2 in

score sets #2A & #2B outperformed all of the machines in every metric, except for one

machine in one metric, offers some evidence that the high individual reader scores compared

to the resolved score are not solely an artifact of their being a part of the whole. As shown in

Table C, #2A, which measured ideas, content, organization, style, and voice, had an exact

agreement value of 0.76, compared to the range of machine values of 0.55-0.70. Its Kappa

was 0.62, compared to the range of machine values of 0.30-0.51. Its Quadratic Weighted

Kappa was 0.80, compared to the range of machine values of 0.62-0.74. And its Pearson r

was .73, compared to the range of machine values of 0.62-0.74. Similarly, #2B, which

measured conventions of grammar, usage, punctuation, and spelling, 2B had an exact

agreement value of 0.73 compared to the range of machine values of 0.55-0.69. Its Kappa

was 0.56, compared to the range of machine values of 0.27-0.49. Its Quadratic Weighted

Kappa was 0.76, compared to the range of machine values of 0.62-0.74. And its Pearson r

was .76, compared to the range of machine values of 0.55-0.71.

Table C: Dataset #2 – H2 Score Compared to Resolved Score vs. Machine Scores

Metric 2A 2B

H2 2A Range of

Machine

Scores

H2 2B Range of

Machine

Scores

Exact agreement 0.76 0.55 - 0.70 0.73 0.55-0.69

Kappa 0.62 0.30 - 0.51 0.56 0.27-0.47

Quadratic Weighted Kappa

0.80 0.62 - 0.74 0.76 0.62 -0.74

Pearson r 0.73 0.62 - 0.74 0.76 0.55 -0.71

4. Smoke and Mirrors: Overall, the report minimizes the accuracy of the human scorers

and over-represents the accuracy of machine scoring, even with the skewed variables.


To begin to support this assertion, the comparison between human readers and machine

scores are condensed in Tables D-G. I did not include a table for adjacent & exact agreement

because with many of the scales being 1-4, adjacent & exact agreement was often at 0.99 for

both humans and machines. Table D: Exact Agreement Summary

Essay Set Human Readers Machines

H1 H2 H1H2 Median Mean Range

1 0.64 0.64 0.64 0.44 0.43 .31 - .47

2a --- 0.76 0.76 0.68 0.66 .55 - .70

2b --- 0.73 0.73 0.66 0.65 .55 - .69

3 0.89 0.83 0.72 0.69 0.67 .61 - .72

4 0.87 0.89 0.76 0.65 0.64 .47 - .72

5 0.77 0.79 0.59 0.68 0.66 .47 - .71

6 0.8 0.81 0.63 0.64 0.64 .51 - .69

7 0.28 0.28 0.28 0.12 0.12 .07 - .15

8 0.35 0.35 0.29 0.16 0.16 .08 - .23

Exact agreement is summarized in Table D. The report aggregates the ranges of agreement

for H1H2 among all eight datasets and all nine rows of data stating on page 22 that “The

human exact agreements ranged from 0.28 on dataset #8 to 0.76 for dataset #2.” The report

then states, “the predicted machine score and had a range from 0.07 on dataset #2 [sic] to

0.72 on datasets #3 and #4. An inspection of the deltas on Table 9 shows that machines

performed particularly well on datasets #5 and 6, two of the source-based essays.”

Aside from the careless error of incorrectly attributing the 0.07 exact agreement to dataset

#2 instead of to dataset #7, the report ignores how human scorers performed better than the

machines for most of the datasets. Of the nine scores, the human rater agreement

coefficients exceeded the top score of the machines in six of them, tying in a seventh. In

dataset # 1 both readers performed .17 better than the best performing machine. In dataset

2a, the single “read-behind” reader performed .06 better than the best performing machine.

In dataset 2b, the single “read-behind” reader performed .04 better than the best performing

machine. The next four datasets are content based. For datasets # 3 & #4, the agreement of

the two readers outperforms all but one of the machines and ties that one.


Table E: Kappa Summary



1 0.53 0.53 0.45 0.29 0.28 .16 - .33

2a --- 0.62 0.62 0.48 0.46 .30 - .51

2b --- 0.56 0.56 0.45 0.42 .27 - .49

3 0.83 0.77 0.57 0.53 0.52 .45 - .59

4 0.82 0.84 0.65 0.50 0.50 .30 - .60

5 0.69 0.71 0.44 0.55 0.52 .28 - .59

6 0.70 0.71 0.45 0.46 0.46 .31 - .55

7 0.23 0.23 0.18 0.07 0.07 .03 - .09

8 0.26 0.26 0.16 0.09 0.08 .04 - .13

Table E summarizes the Kappa scores. On page 23, the report states that “in general,

performance on kappa was slightly less with the exception of essay prompts #5 & #6. On

these datasets, the AES engines, as a group, matched or exceeded human performance.”

While this last claim is true for dataset #5, it was not true for dataset #6, where the value for

H1H2 fell right in the middle of the machine scores. Moreover, the machine performance

was not “slightly” lower than human performance measured by H1H2, it was substantially

lower for all datasets except 5 & 6 as can be observed simply by comparing H1H2 with the

median and range values of the machine scores in Table E.

Table F: Quadratic Weighted Kappa Summary



1 0.77 0.78 0.73 0.78 0.77 .66 - .82

2a --- 0.80 0.80 0.70 0.70 .62 - .74

2b --- 0.76 0.76 0.66 0.65 .55 - .69

3 0.92 0.89 0.77 0.72 0.71 .65 - .75

4 0.93 0.94 0.85 0.76 0.77 .67 - .81

5 0.89 0.90 0.74 0.81 0.79 .64 - .82

6 0.89 0.89 0.74 0.76 0.74 .65 - .81

7 0.78 0.77 0.72 0.77 0.75 .58 - .84

8 0.75 0.74 0.61 0.68 0.67 .60 - .73

Table F summarizes the scores on the quadratic weighted kappa. As mentioned previously,

the machines do better on the quadratic weighted kappa except for score sets #2A and #2B

and the literary analysis questions, datasets #3 & #4.

The performance of H1H2 against the machines as measured by the Pearson r, summarized

in Table G, is mixed.


Table G: Pearson r Summary



1 0.93 0.93 0.73 0.80 0.77 .76 - .82

2a --- 0.80 0.80 0.71 0.70 .62 - .74

2b --- 0.76 0.76 0.67 0.66 .55 - .71

3 0.92 0.89 0.77 0.72 0.71 .65 - .75

4 0.94 0.94 0.85 0.76 0.77 .68 - .82

5 0.89 0.90 0.75 0.81 0.79 .65 - .84

6 0.89 0.89 0.74 0.77 0.75 .65 - .81

7 0.93 0.93 0.72 0.78 0.76 .58 - .84

8 0.87 0.88 0.61 0.70 0.68 .62 - .73

5. Finally, although the paper gives the total sample size as 22,029, only 4,343 essays in

eight different datasets comprised the actual test set. The larger amount was collected but

most were used as the training set for the machines and with another set reserved as a

validation set. While large training sets are common for AES, the authors should have

emphasized the size of actual analytic rather than the number of all the essays collected,a

reporting strategy that inflates the sample size..

Conclusion. Even with an experimental design that consisted of different measures for human

and scorers and that privileged the machines in half the data sets, the study clearly does not

demonstrate that machines can replicate human scores. Indeed, comparing the performance of

human graders matching each other to the machines matching the resolved score still gives some

indication that the human raters may be significantly more reliable than machines. Even with the

very flawed overall design of the study, further and rigorous statistical analysis of data may yield

some interesting and extremely important information. Moreover, there are pressing policy

decisions that argue for further analysis of this data. Given that this paper has been reported to

both the Partnership for Assessment of Readiness of College and Careers and the Smarter

Balanced Assessment Consortium, and, consequently, may inform decisions by the two consortia

about the use of automated essay scoring in high stakes testing, it is imperative that the authors

publically post the raw test set data from this study for further analysis that could possibly either

confirm their conclusions or refute them.


References

Man and machine: Better writers, better grades . (2012, April 12). Retrieved from The

University of Akron News: http://www.uakron.edu/im/online-

newsroom/news_details.dot?newsId=40920394-9e62-415d-b038-15fe2e72a677&

Bennett, R. E. (2006). Technology and Writing Assessment: Lessons Learned from the US

National Assessment of Educational Progress. International Association for Educational

Assessment. Singapore. Retrieved March 9, 2013, from

http://www.iaea.info/documents/paper_1162a26d7.pdf

Brenner, H., & Kliebsch, U. (1996, March). Dependence of Weighted Kappa Coefficients on the

Number of Categories. Epidemiology, 7(2), 199-202.

Giles, J. (2012, April 25). AI graders get top marks for scoring essay questions . Retrieved from

The New Scientist: http://www.newscientist.com/article/mg21428615.000-ai-graders-get-

top-marks-for-scoring-essay-questions.html

Kolowich, S. (2012, April 13). A Win for the Robo-Readers. Retrieved from Inside Higher

Education: http://www.insidehighered.com/news/2012/04/13/large-study-shows-little-

difference-between-human-and-robot-essay-graders

McCurry, D. (2010). Can machine scoring deal with broad and open writing. Assessing Writing,

15, pp. 118-129.

Morgan, J. (2012, June 19). RE: Request for data on AES study. Message to the author. Email.

Shermis, M. D., & Hammer, B. (2012). Contrasting State-of-the-Art Automated Scoring of

Essays: Analysis. Retrieved March 3, 2013, from ASAP:

http://www.scoreright.org/NCME_2012_Paper3_29_12.pdf

Weigle, S. C. (2013). English language learners and automated scoring of essays: Critical

considerations. Assessing Writing, 18, 85-99.


Appendix

Prompts, Rubrics, and Other Materials

Source: http://www.kaggle.com/c/asap-aes/data

http://www.kaggle.com/c/asap-aes/data

Dataset #1

Lynn

Rectangle

Lynn

Rectangle

Lynn

Rectangle

Lynn

Rectangle

Dataset #2

of 32

Writing Prompt

Question 1

IXE01020 version 2.1

―All of us can think of a book that we hope none of our children or any other

children have taken off the shelf. But if I have the right to remove that book

from the shelf—that work I abhor—then you also have exactly the same right

and so does everyone else. And then we have no books left on the shelf for any

of us.‖

Katherine Paterson

Author

Write a persuasive essay to a newspaper reflecting your views on censorship in

libraries. Do you believe that certain materials, such as books, music, movies,

magazines, etc., should be removed from the shelves if they are found offensive?

Support your position with convincing arguments from your own experience,

observations, and/or reading.

Your writing will be scored on the following aspects:

• Ideas and content: Does your writing accomplish the assigned task?

• Organization: Does your writing contain an introduction, a body, and a conclusion?

• Style: Do the language and vocabulary in your writing help to convey a clear

message and to create interest?

• Voice: Are the tone and language appropriate for your intended audience?

• Language Conventions: Have you used correct sentence structure, grammar, and

punctuation?

Dataset #3

Essay Set #3 1

Essay Set #3 Type of essay: Source Dependent Responses

Grade level: 10

Training set size: 1,726 essays

Final evaluation set size: 575 essays

Average length of essays: 150 words

Scoring: 1st Reader Score, 2nd Reader Score, Resolved CR Score

Rubric range: 0-3

Resolved CR score range: 0-3

Source Essay

ROUGH ROAD AHEAD: Do Not Exceed Posted Speed Limit

by Joe Kurmaskie

FORGET THAT OLD SAYING ABOUT NEVER taking candy from strangers. No, a better piece of advice for

the solo cyclist would be, “Never accept travel advice from a collection of old-timers who haven’t left

the confines of their porches since Carter was in office.” It’s not that a group of old guys doesn’t know

the terrain. With age comes wisdom and all that, but the world is a fluid place. Things change.

At a reservoir campground outside of Lodi, California, I enjoyed the serenity of an early-summer evening

and some lively conversation with these old codgers. What I shouldn’t have done was let them have a

peek at my map. Like a foolish youth, the next morning I followed their advice and launched out at first

light along a “shortcut” that was to slice away hours from my ride to Yosemite National Park.

They’d sounded so sure of themselves when pointing out landmarks and spouting off towns I would

come to along this breezy jaunt. Things began well enough. I rode into the morning with strong legs and

a smile on my face. About forty miles into the pedal, I arrived at the first “town.” This place might have

been a thriving little spot at one time—say, before the last world war—but on that morning it fit the

traditional definition of a ghost town. I chuckled, checked my water supply, and moved on. The sun was

beginning to beat down, but I barely noticed it. The cool pines and rushing rivers of Yosemite had my

name written all over them.

Twenty miles up the road, I came to a fork of sorts. One ramshackle shed, several rusty pumps, and a

corral that couldn’t hold in the lamest mule greeted me. This sight was troubling. I had been hitting my

water bottles pretty regularly, and I was traveling through the high deserts of California in June.

I got down on my hands and knees, working the handle of the rusted water pump with all my strength. A

tarlike substance oozed out, followed by brackish water feeling somewhere in the neighborhood of two

hundred degrees. I pumped that handle for several minutes, but the water wouldn’t cool down. It didn’t

matter. When I tried a drop or two, it had the flavor of battery acid.

The old guys had sworn the next town was only eighteen miles down the road. I could make that! I

would conserve my water and go inward for an hour or so—a test of my inner spirit.

Essay Set #3 2

Not two miles into this next section of the ride, I noticed the terrain changing. Flat road was replaced by

short, rolling hills. After I had crested the first few of these, a large highway sign jumped out at me. It

read: ROUGH ROAD AHEAD: DO NOT EXCEED POSTED SPEED LIMIT.

The speed limit was 55 mph. I was doing a water-depleting 12 mph. Sometimes life can feel so cruel.

I toiled on. At some point, tumbleweeds crossed my path and a ridiculously large snake—it really did

look like a diamondback—blocked the majority of the pavement in front of me. I eased past, trying to

keep my balance in my dehydrated state.

The water bottles contained only a few tantalizing sips. Wide rings of dried sweat circled my shirt, and

the growing realization that I could drop from heatstroke on a gorgeous day in June simply because I

listened to some gentlemen who hadn’t been off their porch in decades, caused me to laugh.

It was a sad, hopeless laugh, mind you, but at least I still had the energy to feel sorry for myself. There

was no one in sight, not a building, car, or structure of any kind. I began breaking the ride down into

distances I could see on the horizon, telling myself that if I could make it that far, I’d be fi ne.

Over one long, crippling hill, a building came into view. I wiped the sweat from my eyes to make sure it

wasn’t a mirage, and tried not to get too excited. With what I believed was my last burst of energy, I

maneuvered down the hill.

In an ironic twist that should please all sadists reading this, the building—abandoned years earlier, by

the looks of it—had been a Welch’s Grape Juice factory and bottling plant. A sandblasted picture of a

young boy pouring a refreshing glass of juice into his mouth could still be seen.

I hung my head.

That smoky blues tune “Summertime” rattled around in the dry honeycombs of my deteriorating brain.

I got back on the bike, but not before I gathered up a few pebbles and stuck them in my mouth. I’d read

once that sucking on stones helps take your mind off thirst by allowing what spit you have left to

circulate. With any luck I’d hit a bump and lodge one in my throat.

It didn’t really matter. I was going to die and the birds would pick me clean, leaving only some expensive

outdoor gear and a diary with the last entry in praise of old men, their wisdom, and their keen sense of

direction. I made a mental note to change that paragraph if it looked like I was going to lose

consciousness for the last time.

Somehow, I climbed away from the abandoned factory of juices and dreams, slowly gaining elevation

while losing hope. Then, as easily as rounding a bend, my troubles, thirst, and fear were all behind me.

GARY AND WILBER’S FISH CAMP—IF YOU WANT BAIT FOR THE BIG ONES, WE’RE YOUR BEST BET!

“And the only bet,” I remember thinking.

Essay Set #3 3

As I stumbled into a rather modern bathroom and drank deeply from the sink, I had an overwhelming

urge to seek out Gary and Wilber, kiss them, and buy some bait—any bait, even though I didn’t own a

rod or reel.

An old guy sitting in a chair under some shade nodded in my direction. Cool water dripped from my

head as I slumped against the wall beside him.

“Where you headed in such a hurry?”

“Yosemite,” I whispered.

“Know the best way to get there?”

I watched him from the corner of my eye for a long moment. He was even older than the group I’d

listened to in Lodi.

“Yes, sir! I own a very good map.”

And I promised myself right then that I’d always stick to it in the future.

“Rough Road Ahead” by Joe Kurmaskie, from Metal Cowboy, copyright © 1999 Joe Kurmaskie.

Prompt

Write a response that explains how the features of the setting affect the cyclist. In your response,

include examples from the essay that support your conclusion.

Rubric Guidelines

Score 3: The response demonstrates an understanding of the complexities of the text.

Addresses the demands of the question

Uses expressed and implied information from the text

Clarifies and extends understanding beyond the literal

Score 2: The response demonstrates a partial or literal understanding of the text.

Addresses the demands of the question, although may not develop all parts equally

Uses some expressed or implied information from the text to demonstrate understanding

May not fully connect the support to a conclusion or assertion made about the text(s)

Score 1: The response shows evidence of a minimal understanding of the text.

May show evidence that some meaning has been derived from the text

May indicate a misreading of the text or the question

May lack information or explanation to support an understanding of the text in relation to the

question

Score 0: The response is completely irrelevant or incorrect, or there is no response.

Essay Set #3 4

Adjudication Rules

If Reader‐1 Score and Reader‐2 Score are exact or adjacent, adjudication by a third reader is not

required.

If Reader‐1 Score and Reader‐2 Score are not adjacent or exact, then adjudication by a third

reader is required.

Lynn

Rectangle

Lynn

Typewritten Text

Lynn

Rectangle

Lynn

Rectangle

Lynn

Rectangle

Dataset #4

Essay Set #4 1

Essay Set #4 Type of essay: Source Dependent Responses

Grade level: 10

Training set size: 1,772 essays



Scoring: 1st Reader Score, 2nd Reader Score, Resolved CR Score

Rubric range: 0-3

Resolved CR score range: 0-3

Source Essay

Winter Hibiscus by Minfong Ho

Saeng, a teenage girl, and her family have moved to the United States from Vietnam. As Saeng walks

home after failing her driver’s test, she sees a familiar plant. Later, she goes to a florist shop to see if the

plant can be purchased.

It was like walking into another world. A hot, moist world exploding with greenery. Huge flat leaves,

delicate wisps of tendrils, ferns and fronds and vines of all shades and shapes grew in seemingly random

profusion.

“Over there, in the corner, the hibiscus. Is that what you mean?” The florist pointed at a leafy potted

plant by the corner.

There, in a shaft of the wan afternoon sunlight, was a single blood-red blossom, its five petals splayed

back to reveal a long stamen tipped with yellow pollen. Saeng felt a shock of recognition so intense, it

was almost visceral.1

“Saebba,” Saeng whispered.

A saebba hedge, tall and lush, had surrounded their garden, its lush green leaves dotted with vermilion

flowers. And sometimes after a monsoon rain, a blossom or two would have blown into the well, so that

when she drew the well water, she would find a red blossom floating in the bucket.

Slowly, Saeng walked down the narrow aisle toward the hibiscus. Orchids, lanna bushes, oleanders,

elephant ear begonias, and bougainvillea vines surrounded her. Plants that she had not even realized

she had known but had forgotten drew her back into her childhood world.

When she got to the hibiscus, she reached out and touched a petal gently. It felt smooth and cool, with a

hint of velvet toward the center—just as she had known it would feel.

And beside it was yet another old friend, a small shrub with waxy leaves and dainty flowers with

purplish petals and white centers. “Madagascar periwinkle,” its tag announced. How strange to see it in

a pot, Saeng thought. Back home it just grew wild, jutting out from the cracks in brick walls or between

tiled roofs.

Essay Set #4 2

And that rich, sweet scent—that was familiar, too. Saeng scanned the greenery around her and found a

tall, gangly plant with exquisite little white blossoms on it. “Dok Malik,” she said, savoring the feel of the

word on her tongue, even as she silently noted the English name on its tag, “jasmine.”

One of the blossoms had fallen off, and carefully Saeng picked it up and smelled it. She closed her eyes

and breathed in, deeply. The familiar fragrance filled her lungs, and Saeng could almost feel the light

strands of her grandmother’s long gray hair, freshly washed, as she combed it out with the fine-toothed

buffalo-horn comb. And when the sun had dried it, Saeng would help the gnarled old fingers knot the

hair into a bun, then slip a dok Malik bud into it.

Saeng looked at the white bud in her hand now, small and fragile. Gently, she closed her palm around it

and held it tight. That, at least, she could hold on to. But where was the fine-toothed comb? The

hibiscus hedge? The well? Her gentle grandmother?

A wave of loss so deep and strong that it stung Saeng’s eyes now swept over her. A blink, a channel

switch, a boat ride into the night, and it was all gone. Irretrievably, irrevocably gone.

And in the warm moist shelter of the greenhouse, Saeng broke down and wept.

It was already dusk when Saeng reached home. The wind was blowing harder, tearing off the last

remnants of green in the chicory weeds that were growing out of the cracks in the sidewalk. As if

oblivious to the cold, her mother was still out in the vegetable garden, digging up the last of the onions

with a rusty trowel. She did not see Saeng until the girl had quietly knelt down next to her.

Her smile of welcome warmed Saeng. “Ghup ma laio le? You’re back?” she said cheerfully. “Goodness,

it’s past five. What took you so long? How did it go? Did you—?” Then she noticed the potted plant that

Saeng was holding, its leaves quivering in the wind.

Mrs. Panouvong uttered a small cry of surprise and delight. “Dok faeng-noi!” she said. “Where did you

get it?”

“I bought it,” Saeng answered, dreading her mother’s next question.

“How much?”

For answer Saeng handed her mother some coins.

“That’s all?” Mrs. Panouvong said, appalled, “Oh, but I forgot! You and the

Lambert boy ate Bee-Maags . . . .”

“No, we didn’t, Mother,” Saeng said.

“Then what else—?”

“Nothing else. I paid over nineteen dollars for it.”

Essay Set #4 3

“You what?” Her mother stared at her incredulously. “But how could you? All the seeds for this

vegetable garden didn’t cost that much! You know how much we—” She paused, as she noticed the

tearstains on her daughter’s cheeks and her puffy eyes.

“What happened?” she asked, more gently.

“I—I failed the test,” Saeng said.

For a long moment Mrs. Panouvong said nothing. Saeng did not dare look her mother in the eye.

Instead, she stared at the hibiscus plant and nervously tore off a leaf, shredding it to bits.

Her mother reached out and brushed the fragments of green off Saeng’s hands. “It’s a beautiful plant,

this dok faeng-noi,” she finally said. “I’m glad you got it.”

“It’s—it’s not a real one,” Saeng mumbled.

“I mean, not like the kind we had at—at—” She found that she was still too shaky to say the words at

home, lest she burst into tears again. “Not like the kind we had before,” she said.

“I know,” her mother said quietly. “I’ve seen this kind blooming along the lake. Its flowers aren’t as

pretty, but it’s strong enough to make it through the cold months here, this winter hibiscus. That’s what

matters.”

She tipped the pot and deftly eased the ball of soil out, balancing the rest of the plant in her other hand.

“Look how root-bound it is, poor thing,” she said. “Let’s plant it, right now.”

She went over to the corner of the vegetable patch and started to dig a hole in the ground. The soil was

cold and hard, and she had trouble thrusting the shovel into it. Wisps of her gray hair trailed out in the

breeze, and her slight frown deepened the wrinkles around her eyes. There was a frail, wiry beauty to

her that touched Saeng deeply.

“Here, let me help, Mother,” she offered, getting up and taking the shovel away from her.

Mrs. Panouvong made no resistance. “I’ll bring in the hot peppers and bitter melons, then, and start

dinner. How would you like an omelet with slices of the bitter melon?”

“I’d love it,” Saeng said.

Left alone in the garden, Saeng dug out a hole and carefully lowered the “winter hibiscus” into it. She

could hear the sounds of cooking from the kitchen now, the beating of eggs against a bowl, the sizzle of

hot oil in the pan. The pungent smell of bitter melon wafted out, and Saeng’s mouth watered. It was a

cultivated taste, she had discovered—none of her classmates or friends, not even Mrs. Lambert, liked

it—this sharp, bitter melon that left a golden aftertaste on the tongue. But she had grown up eating it

and, she admitted to herself, much preferred it to a Big Mac.

The “winter hibiscus” was in the ground now, and Saeng tamped down the soil around it. Overhead, a

flock of Canada geese flew by, their faint honks clear and—yes—familiar to Saeng now. Almost

Essay Set #4 4

reluctantly, she realized that many of the things that she had thought of as strange before had become,

through the quiet repetition of season upon season, almost familiar to her now. Like the geese. She

lifted her head and watched as their distinctive V was etched against the evening sky, slowly fading into

the distance.

When they come back, Saeng vowed silently to herself, in the spring, when the snows melt and the

geese return and this hibiscus is budding, then I will take that test again.

“Winter Hibiscus” by Minfong Ho, copyright © 1993 by Minfong Ho, from Join In, Multiethnic Short Stories, by Donald R. Gallo, ed.

Prompt

Read the last paragraph of the story. "When they come back, Saeng vowed silently to herself, in the spring, when the snows melt and the geese return and this hibiscus is budding, then I will take that test again." Write a response that explains why the author concludes the story with this paragraph. In your response, include details and examples from the story that support your ideas.

Rubric Guidelines

Score 3: The response demonstrates an understanding of the complexities of the text.

Addresses the demands of the question

Uses expressed and implied information from the text

Clarifies and extends understanding beyond the literal

Score 2: The response demonstrates a partial or literal understanding of the text.

Addresses the demands of the question, although may not develop all parts equally

Uses some expressed or implied information from the text to demonstrate understanding

May not fully connect the support to a conclusion or assertion made about the text(s)

Score 1: The response shows evidence of a minimal understanding of the text.

May show evidence that some meaning has been derived from the text

May indicate a misreading of the text or the question

May lack information or explanation to support an understanding of the text in relation to the

question

Score 0: The response is completely irrelevant or incorrect, or there is no response.

Adjudication Rules

If Reader‐1 Score and Reader‐2 Score are exact or adjacent, adjudication by a third reader is not

required.

Essay Set #4 5

If Reader‐1 Score and Reader‐2 Score are not adjacent or exact, then adjudication by a third

reader is required.

Dataset #5

MCAS Reading 2010 Grade 8

C-24 “Narciso Rodriguez”

Itemnumber: 272837

Contract: 1117

Describe the mood created by the author in the memoir. Support your answer with relevant and specific information from the memoir.

Scoring Guide

Score Description

4 The response is a clear, complete, and accurate description of the mood created by the author. The response includes relevant and specific information from the memoir.

3 The response is a mostly clear, complete, and accurate description of the mood created by the author The response includes relevant but often general information from the memoir.

2 The response is a partial description of the mood created by the author The response includes limited information from the memoir and may include misinterpretations.

1

The response is a minimal description of the mood created by the author The response includes little or no information from the memoir and may include misinterpretations. OR The response relates minimally to the task.

0 The response is incorrect or irrelevant or contains insufficient information to demonstrate comprehension.

Blank No response.

Scoring Notes: The response should describe a mood of gratitude, love, or any similar appreciative mood. The response may include, but is not limited to:

in paragraph 2: the author says he is “eternally grateful” to his parents for instilling in him a love of cooking. He also credits them for his appreciation of Cuban music “which I adore to this day.” In general he notes their having made an inviting home filled with “endless celebrations” out of “modest” means.

in paragraph 3: the author credits his parents for instilling in him a great sense of “family” due to the “environment” they created. This sense of family extended to everyone in a time when the larger world was uninviting.

in paragraph 4: the author mentions his family’s generosity in allowing others to stay with them and notes its reciprocal nature

Lynn

Rectangle

Lynn

Rectangle

in paragraph 5, the author recognizes that his parents came to America “selflessly” in order to “give their children a better life.” He details their challenges and obstacles and observes that they “endured.”

in paragraph 6, the author states, “I will always be grateful to my parents for their love and sacrifice. I’ve often told them that what they did was a much more courageous thing than I could have ever done.” He mentions his admiration and having thanked them yet admits that he has, “no way to express my gratitude.”

in paragraph 7, the author states, “I will never forget that house or its gracious neighborhood or the many things I learned there about how to love. I will never forget how my parents turned this simple house into a home.”

In conclusion, the author creates an appreciative mood by describing all of the things he is grateful for including his parents and the home they made for him.

Other interpretations are acceptable if supported by relevant evidence from the text.

Contract: 1115 Reading Booklet: 1850006114 Page: 1 of 1 Score: A4+ RespCode: RD01054

Annotation:

ssinclair

4A

Dataset #6

MCAS Reading Grade 10

C-9 “The Mooring Mast”

Itemnumber: 280239

Contract: 1117

Based on the excerpt, describe the obstacles the builders of the Empire State Building faced in attempting to allow dirigibles to dock there. Support your answer with relevant and specific information from the excerpt.

Scoring Guide Score Description

4

The response is a clear, complete, and accurate description of the obstacles the builders of the Empire State Building faced in attempting to allow dirigibles to dock there. The response includes relevant and specific information from the excerpt.

3

The response is a mostly clear, complete, and accurate description of the obstacles the builders of the Empire State Building faced in attempting to allow dirigibles to dock there. The response includes relevant but often general information from the excerpt.

2

The response is a partial description of the obstacles the builders of the Empire State Building faced in attempting to allow dirigibles to dock there. The response includes limited information from the excerpt and may include misinterpretations.

1

The response is a minimal description of the obstacles the builders of the Empire State Building faced in attempting to allow dirigibles to dock there. The response includes little or no information from the excerpt and may include misinterpretations. OR The response relates minimally to the task.

0 The response is totally incorrect or irrelevant, or contains insufficient evidence to demonstrate comprehension.

Blank No response. Scoring Notes The obstacles to dirigible docking include:

Building a mast on top of the building Meeting with engineers and dirigible engineers Transmitting the stress of the dirigible all the way down the building; the frame

had to be shored up to the tune of $60,000 Housing the winches and other docking equipment Dealing with flammable gases Handling the violent air currents at the top of the building Confronting laws banning airships from the area Getting close enough to the building without puncturing

Other explanations will be accepted if supported by relevant evidence from the text.

ssinclair

Typewritten Text

Lynn

Rectangle

Lynn

Rectangle

Contract: 1116 Reading Booklet: 2050031749 Page: 1 of 1 Score: A4. RespCode: RD04036Annotation:

ssinclair

Typewritten Text

4B

Dataset #7

Writing - Grade 7 Released Items Fall 2010

2


3

An

alyt

ic R

ub

ric

Nar

rati

ve W

riti

ng

Gra

des

4 a

nd

7

01

23

Idea

s

(p

oin

ts d

oub

led

)

Idea

s ar

e not

focu

sed o

n t

he

task

an

d/o

r ar

e undev

eloped

.

Tel

ls a

sto

ry w

ith idea

s th

at

are

min

imal

ly focu

sed o

n

the

topic

and d

evel

oped

w

ith lim

ited

and/o

r gen

eral

det

ails

.

Tel

ls a

sto

ry w

ith idea

s th

at

are

som

ewhat

focu

sed o

n

the

topic

and a

re d

evel

oped

w

ith a

mix

of sp

ecific

and/o

r gen

eral

det

ails

.

Tel

ls a

sto

ry w

ith idea

s th

at a

re

clea

rly

focu

sed o

n t

he

topic

and

are

thoro

ughly

dev

eloped

with

spec

ific

, re

leva

nt

det

ails

.

Org

aniz

atio

n

N

o o

rgan

izat

ion

evid

ent.

Org

aniz

atio

n an

d

connec

tions

bet

wee

n idea

s an

d/o

r ev

ents

are

wea

k.

Org

aniz

atio

n a

nd

connec

tions

bet

wee

n idea

s an

d/o

r ev

ents

are

logic

ally

se

quen

ced.

Org

aniz

atio

n a

nd c

onnec

tions

bet

wee

n idea

s an

d/o

r ev

ents

are

cl

ear

and logic

ally

seq

uen

ced.

Sty

le

Inef

fect

ive

use

of

languag

e fo

r th

e w

rite

r's

purp

ose

an

d a

udie

nce

.

Lim

ited

use

of la

nguag

e,

incl

udin

g lac

k of va

riet

y in

w

ord

choic

e an

d s

ente

nce

s,

may

hin

der

support

for

the

write

r's

purp

ose

and

audie

nce

.

Adeq

uat

e co

mm

and o

f la

nguag

e, incl

udin

g

effe

ctiv

e w

ord

choic

e an

d

clea

r se

nte

nce

s, s

upport

s th

e w

rite

r's

purp

ose

and

audie

nce

.

Com

man

d o

f la

nguag

e, incl

udin

g

effe

ctiv

e an

d c

om

pel

ling w

ord

ch

oic

e an

d v

arie

d s

ente

nce

st

ruct

ure

, cl

early

support

s th

e w

rite

r's

purp

ose

and a

udie

nce

.

Con

ven

tion

s

Inef

fect

ive

use

of

conve

ntions

of

Sta

ndar

d E

nglis

h*

for

gra

mm

ar,

usa

ge,

spel

ling,

capital

izat

ion,

and

punct

uat

ion.

Lim

ited

use

of co

nve

ntions

of Sta

ndar

d E

nglis

h* for

gra

mm

ar,

usa

ge,

spel

ling,

capital

izat

ion,

and

punct

uat

ion for

the

gra

de

leve

l.

Adeq

uat

e use

of

conve

ntions

of Sta

ndar

d

Englis

h* for

gra

mm

ar,

usa

ge,

spel

ling,

capital

izat

ion,

and

punct

uat

ion for

the

gra

de

leve

l.

Consi

sten

t, a

ppro

priat

e use

of

conve

ntions

of Sta

ndar

d

Englis

h* for

gra

mm

ar,

usa

ge,

sp

ellin

g,

capital

izat

ion,

and

punct

uat

ion for

the

gra

de

leve

l.

C Bla

nk

Any

conditio

n c

ode

will

res

ult in a

sco

re o

f 0 for

all tr

aits

.

B Ille

gib

le o

r w

ritt

en in a

lan

guag

e oth

er t

han

Englis

h

D

In

suffic

ient

to r

ate

* S

tandar

d E

nglis

h is

the

form

of

Englis

h m

ost

wid

ely

acce

pte

d f

or

writing in s

chools

.

A O

ff-t

opic

1/2

4/1

1


4

Dataset #8

Essay Set #8 1

Essay Set #8 Type of essay: Persuasive/ Narrative/Expository

Grade level: 10

Training set size: 918 essays



Scoring: Rater1Comp, Rater2Comp, Rater3Comp, Resolved Score

Rater1Comp Rubric range: 0-30



Resolved score range: 0-60

Prompt

We all understand the benefits of laughter. For example, someone once said, “Laughter is the shortest

distance between two people.” Many other people believe that laughter is an important part of any

relationship. Tell a true story in which laughter was one element or part.

Rubric Guidelines

A rating of 1-6 on the following six traits:

Ideas and Content

Score 6: The writing is exceptionally clear, focused, and interesting. It holds the reader’s attention

throughout. Main ideas stand out and are developed by strong support and rich details suitable to

audience and purpose. The writing is characterized by

• clarity, focus, and control.

• main idea(s) that stand out.

• supporting, relevant, carefully selected details; when appropriate, use of resources provides

strong, accurate, credible support.

• a thorough, balanced, in-depth explanation / exploration of the topic; the writing makes

connections and shares insights.

• content and selected details that are well-suited to audience and purpose.

Score 5: The writing is clear, focused and interesting. It holds the reader’s attention. Main ideas stand

out and are developed by supporting details suitable to audience and purpose. The writing is

characterized by

• clarity, focus, and control.

• main idea(s) that stand out.

• supporting, relevant, carefully selected details; when appropriate, use of resources provides

strong, accurate, credible support.

• a thorough, balanced explanation / exploration of the topic; the writing makes connections and

shares insights.

• content and selected details that are well-suited to audience and purpose.

Essay Set #8 2

Score 4: The writing is clear and focused. The reader can easily understand the main ideas. Support is

present, although it may be limited or rather general. The writing is characterized by

• an easily identifiable purpose.

• clear main idea(s).

• supporting details that are relevant, but may be overly general or limited in places; when

appropriate, resources are used to provide accurate support.

• a topic that is explored / explained, although developmental details may occasionally be out of

balance with the main idea(s); some connections and insights may be present.

• content and selected details that are relevant, but perhaps not consistently well-chosen for

audience and purpose.

Score 3: The reader can understand the main ideas, although they may be overly broad or simplistic, and

the results may not be effective. Supporting detail is often limited, insubstantial, overly general, or

occasionally slightly off-topic. The writing is characterized by

• an easily identifiable purpose and main idea(s).

• predictable or overly-obvious main ideas; or points that echo observations heard elsewhere; or

a close retelling of another work.

• support that is attempted, but developmental details are often limited, uneven, somewhat off-

topic, predictable, or too general (e.g., a list of underdeveloped points).

• details that may not be well-grounded in credible resources; they may be based on clichés,

stereotypes or questionable sources of information.

• difficulties when moving from general observations to specifics.

Score 2: Main ideas and purpose are somewhat unclear or development is attempted but minimal. The

writing is characterized by

• a purpose and main idea(s) that may require extensive inferences by the reader.

• minimal development; insufficient details.

• irrelevant details that clutter the text.

• extensive repetition of detail.

Score 1: The writing lacks a central idea or purpose. The writing is characterized by

• ideas that are extremely limited or simply unclear.

• attempts at development that are minimal or nonexistent; the paper is too short to

demonstrate the development of an idea.

Organization

Score 6: The organization enhances the central idea(s) and its development. The order and structure are

compelling and move the reader through the text easily. The writing is characterized by

• effective, perhaps creative, sequencing and paragraph breaks; the organizational structure fits

the topic, and the writing is easy to follow.

Essay Set #8 3

• a strong, inviting beginning that draws the reader in and a strong, satisfying sense of resolution

or closure.

• smooth, effective transitions among all elements (sentences, paragraphs, ideas).

• details that fit where placed.

Score 5: The organization enhances the central idea(s) and its development. The order and structure are

strong and move the reader through the text. The writing is characterized by

• effective sequencing and paragraph breaks; the organizational structure fits the topic, and the

writing is easy to follow.

• an inviting beginning that draws the reader in and a satisfying sense of resolution or closure.

• smooth, effective transitions among all elements (sentences, paragraphs, ideas).

• details that fit where placed.

Score 4: Organization is clear and coherent. Order and structure are present, but may seem formulaic.

The writing is characterized by

• clear sequencing and paragraph breaks.

• an organization that may be predictable.

• a recognizable, developed beginning that may not be particularly inviting; a developed

conclusion that may lack subtlety.

• a body that is easy to follow with details that fit where placed.

• transitions that may be stilted or formulaic.

• organization which helps the reader, despite some weaknesses.

Score 3: An attempt has been made to organize the writing; however, the overall structure is

inconsistent or skeletal. The writing is characterized by

• attempts at sequencing and paragraph breaks, but the order or the relationship among ideas

may occasionally be unclear.

• a beginning and an ending which, although present, are either undeveloped or too obvious (e.g.,

“My topic is...”; “These are all the reasons that...”).

• transitions that sometimes work. The same few transitional devices (e.g., coordinating

conjunctions, numbering, etc.) may be overused.

• a structure that is skeletal or too rigid.

• placement of details that may not always be effective.

• organization which lapses in some places, but helps the reader in others.

Score 2: The writing lacks a clear organizational structure. An occasional organizational device is

discernible; however, the writing is either difficult to follow and the reader has to reread substantial

portions, or the piece is simply too short to demonstrate organizational skills. The writing is

characterized by

Essay Set #8 4

• some attempts at sequencing, but the order or the relationship among ideas is frequently

unclear; a lack of paragraph breaks.

• a missing or extremely undeveloped beginning, body, and/or ending.

• a lack of transitions, or when present, ineffective or overused.

• a lack of an effective organizational structure.

• details that seem to be randomly placed, leaving the reader frequently confused.

Score 1: The writing lacks coherence; organization seems haphazard and disjointed. Even after

rereading, the reader remains confused. The writing is characterized by

• a lack of effective sequencing and paragraph breaks.

• a failure to provide an identifiable beginning, body and/or ending.

• a lack of transitions.

• pacing that is consistently awkward; the reader feels either mired down in trivia or rushed along

too rapidly.

• a lack of organization which ultimately obscures or distorts the main point.

Voice

Score 6: The writer has chosen a voice appropriate for the topic, purpose, and audience. The writer

demonstrates deep commitment to the topic, and there is an exceptional sense of “writing to be read.”

The writing is expressive, engaging, or sincere. The writing is characterized by

• an effective level of closeness to or distance from the audience (e.g., a narrative should have a

strong personal voice, while an expository piece may require extensive use of outside resources

and a more academic voice; nevertheless, both should be engaging, lively, or interesting.

Technical writing may require greater distance.).

• an exceptionally strong sense of audience; the writer seems to be aware of the reader and of

how to communicate the message most effectively. The reader may discern the writer behind

the words and feel a sense of interaction.

• a sense that the topic has come to life; when appropriate, the writing may show originality,

liveliness, honesty, conviction, excitement, humor, or suspense.

Score 5: The writer has chosen a voice appropriate for the topic, purpose, and audience. The writer

demonstrates commitment to the topic, and there is a sense of “writing to be read.” The writing is

expressive, engaging, or sincere. The writing is characterized by an appropriate level of closeness to or

distance from the audience (e.g., a narrative should have a strong personal voice, while an expository

piece may require extensive use of outside resources and a more academic voice; nevertheless, both

should be engaging, lively, or interesting. Technical writing may require greater distance.).

• a strong sense of audience; the writer seems to be aware of the reader and of how to

communicate the message most effectively. The reader may discern the writer behind the

words and feel a sense of interaction.

• a sense that the topic has come to life; when appropriate, the writing may show originality,

liveliness, honesty, conviction, excitement, humor, or suspense.

Essay Set #8 5

Score 4: A voice is present. The writer seems committed to the topic, and there may be a sense of

“writing to be read.” In places, the writing is expressive, engaging, or sincere. The writing is

characterized by

• a suitable level of closeness to or distance from the audience.

• a sense of audience; the writer seems to be aware of the reader but has not consistently

employed an appropriate voice. The reader may glimpse the writer behind the words and feel a

sense of interaction in places.

• liveliness, sincerity, or humor when appropriate; however, at times the writing may be either

inappropriately casual or personal, or inappropriately formal and stiff.

Score 3: The writer’s commitment to the topic seems inconsistent. A sense of the writer may emerge at

times; however, the voice is either inappropriately personal or inappropriately impersonal. The writing is

characterized by

• a limited sense of audience; the writer’s awareness of the reader is unclear.

• an occasional sense of the writer behind the words; however, the voice may shift or disappear a

line or two later and the writing become somewhat mechanical.

• a limited ability to shift to a more objective voice when necessary.

• text that is too short to demonstrate a consistent and appropriate voice.

Score 2: The writing provides little sense of involvement or commitment. There is no evidence that the

writer has chosen a suitable voice. The writing is characterized by

• little engagement of the writer; the writing tends to be largely flat, lifeless, stiff, or mechanical.

• a voice that is likely to be overly informal and personal.

• a lack of audience awareness; there is little sense of “writing to be read.”

• little or no hint of the writer behind the words. There is rarely a sense of interaction between

reader and writer.

Score 1: The writing seems to lack a sense of involvement or commitment. The writing is characterized

by

• no engagement of the writer; the writing is flat and lifeless.

• a lack of audience awareness; there is no sense of “writing to be read.”

• no hint of the writer behind the words. There is no sense of interaction between writer and

reader; the writing does not involve or engage the reader.

Word Choice

Score 6: Words convey the intended message in an exceptionally interesting, precise, and natural way

appropriate to audience and purpose. The writer employs a rich, broad range of words which have been

carefully chosen and thoughtfully placed for impact. The writing is characterized by

• accurate, strong, specific words; powerful words energize the writing.

• fresh, original expression; slang, if used, seems purposeful and is effective.

Essay Set #8 6

• vocabulary that is striking and varied, but that is natural and not overdone.

• ordinary words used in an unusual way.

• words that evoke strong images; figurative language may be used.

Score 5: Words convey the intended message in an interesting, precise, and natural way appropriate to

audience and purpose. The writer employs a broad range of words which have been carefully chosen

and thoughtfully placed for impact. The writing is characterized by

• accurate, specific words; word choices energize the writing.

• fresh, vivid expression; slang, if used, seems purposeful and is effective.

• vocabulary that may be striking and varied, but that is natural and not overdone.

• ordinary words used in an unusual way.

• words that evoke clear images; figurative language may be used.

Score 4: Words effectively convey the intended message. The writer employs a variety of words that are

functional and appropriate to audience and purpose. The writing is characterized by

• words that work but do not particularly energize the writing.

• expression that is functional; however, slang, if used, does not seem purposeful and is not

particularly effective.

• attempts at colorful language that may occasionally seem overdone.

• occasional overuse of technical language or jargon.

• rare experiments with language; however, the writing may have some fine moments and

generally avoids clichés.

Score 3: Language lacks precision and variety, or may be inappropriate to audience and purpose in

places. The writer does not employ a variety of words, producing a sort of “generic” paper filled with

familiar words and phrases. The writing is characterized by

• words that work, but that rarely capture the reader’s interest.

• expression that seems mundane and general; slang, if used, does not seem purposeful and is not

effective.

• attempts at colorful language that seem overdone or forced.

• words that are accurate for the most part, although misused words may occasionally appear;

technical language or jargon may be overused or inappropriately used.

• reliance on clichés and overused expressions.

• text that is too short to demonstrate variety.

Score 2: Language is monotonous and/or misused, detracting from the meaning and impact. The writing

is characterized by

• words that are colorless, flat or imprecise.

• monotonous repetition or overwhelming reliance on worn expressions that repeatedly detract

from the message.

Essay Set #8 7

• images that are fuzzy or absent altogether.

Score 1: The writing shows an extremely limited vocabulary or is so filled with misuses of words that the

meaning is obscured. Only the most general kind of message is communicated because of vague or

imprecise language. The writing is characterized by

• general, vague words that fail to communicate.

• an extremely limited range of words.

• words that simply do not fit the text; they seem imprecise, inadequate, or just plain wrong.

Sentence Fluency

Score 6: The writing has an effective flow and rhythm. Sentences show a high degree of craftsmanship,

with consistently strong and varied structure that makes expressive oral reading easy and enjoyable. The

writing is characterized by

• a natural, fluent sound; it glides along with one sentence flowing effortlessly into the next.

• extensive variation in sentence structure, length, and beginnings that add interest to the text.

• sentence structure that enhances meaning by drawing attention to key ideas or reinforcing

relationships among ideas.

• varied sentence patterns that create an effective combination of power and grace.

• strong control over sentence structure; fragments, if used at all, work well.

• stylistic control; dialogue, if used, sounds natural.

Score 5: The writing has an easy flow and rhythm. Sentences are carefully crafted, with strong and

varied structure that makes expressive oral reading easy and enjoyable. The writing is characterized by

• a natural, fluent sound; it glides along with one sentence flowing into the next.

• variation in sentence structure, length, and beginnings that add interest to the text.

• sentence structure that enhances meaning.

• control over sentence structure; fragments, if used at all, work well.

• stylistic control; dialogue, if used, sounds natural.

Score 4: The writing flows; however, connections between phrases or sentences may be less than fluid.

Sentence patterns are somewhat varied, contributing to ease in oral reading. The writing is

characterized by

• a natural sound; the reader can move easily through the piece, although it may lack a certain

rhythm and grace.

• some repeated patterns of sentence structure, length, and beginnings that may detract

somewhat from overall impact.

• strong control over simple sentence structures, but variable control over more complex

sentences; fragments, if present, are usually effective.

• occasional lapses in stylistic control; dialogue, if used, sounds natural for the most part, but may

at times sound stilted or unnatural.

Essay Set #8 8

Score 3: The writing tends to be mechanical rather than fluid. Occasional awkward constructions may

force the reader to slow down or reread. The writing is characterized by

• some passages that invite fluid oral reading; however, others do not.

• some variety in sentence structure, length, and beginnings, although the writer falls into

repetitive sentence patterns.

• good control over simple sentence structures, but little control over more complex sentences;

fragments, if present, may not be effective.

• sentences which, although functional, lack energy.

• lapses in stylistic control; dialogue, if used, may sound stilted or unnatural.

• text that is too short to demonstrate variety and control.

Score 2: The writing tends to be either choppy or rambling. Awkward constructions often force the

reader to slow down or reread. The writing is characterized by

• significant portions of the text that are difficult to follow or read aloud.

• sentence patterns that are monotonous (e.g., subject-verb or subject-verb-object).

• a significant number of awkward, choppy, or rambling constructions.

Score 1: The writing is difficult to follow or to read aloud. Sentences tend to be incomplete, rambling, or

very awkward. The writing is characterized by

• text that does not invite—and may not even permit—smooth oral reading.

• confusing word order that is often jarring and irregular.

• sentence structure that frequently obscures meaning.

• sentences that are disjointed, confusing, or rambling.

Conventions

Score 6: The writing demonstrates exceptionally strong control of standard writing conventions (e.g.,

punctuation, spelling, capitalization, grammar and usage) and uses them effectively to enhance

communication. Errors are so few and so minor that the reader can easily skim right over them unless

specifically searching for them. The writing is characterized by

• strong control of conventions; manipulation of conventions may occur for stylistic effect.

• strong, effective use of punctuation that guides the reader through the text.

• correct spelling, even of more difficult words.

• correct grammar and usage that contribute to clarity and style.

• skill in using a wide range of conventions in a sufficiently long and complex piece.

• little or no need for editing.

Score 5: The writing demonstrates strong control of standard writing conventions (e.g., punctuation,

spelling, capitalization, grammar and usage) and uses them effectively to enhance communication.

Errors are few and minor. Conventions support readability. The writing is characterized by

• strong control of conventions.

Essay Set #8 9

• effective use of punctuation that guides the reader through the text.

• correct spelling, even of more difficult words.

• correct capitalization; errors, if any, are minor.

• correct grammar and usage that contribute to clarity and style.

• skill in using a wide range of conventions in a sufficiently long and complex piece.

• little need for editing.

Score 4: The writing demonstrates control of standard writing conventions (e.g., punctuation, spelling,

capitalization, grammar and usage). Significant errors do not occur frequently. Minor errors, while

perhaps noticeable, do not impede readability. The writing is characterized by

• control over conventions used, although a wide range is not demonstrated.

• correct end-of-sentence punctuation; internal punctuation may sometimes be incorrect.

• spelling that is usually correct, especially on common words.

• correct capitalization; errors, if any, are minor.

• occasional lapses in correct grammar and usage; problems are not severe enough to distort

meaning or confuse the reader.

• moderate need for editing.

Score 3: The writing demonstrates limited control of standard writing conventions (e.g., punctuation,

spelling, capitalization, grammar and usage). Errors begin to impede readability. The writing is

characterized by

• some control over basic conventions; the text may be too simple or too short to reveal mastery.

• end-of-sentence punctuation that is usually correct; however, internal punctuation contains

frequent errors.

• spelling errors that distract the reader; misspelling of common words occurs.

• capitalization errors.

• errors in grammar and usage that do not block meaning but do distract the reader.

• significant need for editing.

Score 2: The writing demonstrates little control of standard writing conventions. Frequent, significant

errors impede readability. The writing is characterized by

• little control over basic conventions.

• many end-of-sentence punctuation errors; internal punctuation contains frequent errors.

• spelling errors that frequently distract the reader; misspelling of common words often occurs.

• capitalization that is inconsistent or often incorrect.

• errors in grammar and usage that interfere with readability and meaning.

• substantial need for editing.

Score 1: Numerous errors in usage, spelling, capitalization, and punctuation repeatedly distract the

reader and make the text difficult to read. In fact, the severity and frequency of errors are so

Essay Set #8 10

overwhelming that the reader finds it difficult to focus on the message and must reread for meaning.

The writing is characterized by

• very limited skill in using conventions.

• basic punctuation (including end-of-sentence punctuation) that tends to be omitted, haphazard,

or incorrect.

• frequent spelling errors that significantly impair readability.

• capitalization that appears to be random.

• a need for extensive editing.

Adjudication Rules

Each student essay is rated for six Writing traits (I, O, V, W, S, C), by two independent raters: Rater 1 and

Rater 2. Rater 3 provides a third (resolution) rating for each trait, triggered by the following rules:

Standard Rule: Non-adjacency between the 1st and 2nd scorer on any of the 6 traits generates a

resolution read.

Cusp Rule: If first or second score has all 4s on:

o Ideas and Content

o Organization

o Sentence Fluency

o Conventions,

and the other (1st or 2nd score) has one 3 and three 4s in these categories, require a resolution.

Voice and Word Choice are excluded – it does not matter what scores occur for Voice or Word

choice (though non-adjacent Voice and Word Choice scores will still cause failure on (1)).

Total Composite Score:

For most essays:

= (I_R1+I_R2) + (O_R1+O_R2) + (S_R1+S_R2) + 2 (C_R1+C_R2)

When there is Rater 3 set of scores for the essay then the Total Composite Score formula

changes to:

= 2 (I_R3) + 2 (O_R3) + 2 (S_R3) + 4 (C_R3) or equivalently 2 (I+O+S+C) + 2 (C)

Note the use of only four of the six traits.

critique (ver. 3.4) of mark d. shermis & ben hammer...

Documents