critique (ver. 3.4) of mark d. shermis & ben hammer...
TRANSCRIPT
Critique (Ver. 3.4) of Mark D. Shermis & Ben
Hammer, “Contrasting State-of-the-Art Automated
Scoring of Essays: Analysis” http://www.scoreright.org/NCME_2012_Paper3_29_12.pdf
Les C. Perelman, Ph.D.
Comparative Media Studies
Massachusetts Institute of Technology
13 March 2013
Critique of Mark D. Shermis & Ben Hammer, “Contrasting State-of-the-Art Automated
Scoring of Essays: Analysis” -- Ver. 3.4 Mar. 13, 2013 by Les C. Perelman is licensed
under a Creative Commons Attribution 3.0 Unported License.
Critique of Shermis & Hammer Ver. 3.4 Final page 2 13 March 2013 L. C. Perelman
Abstract
Although the unpublished study by Shermis & Hammer (2012) received substantial publicity
about its claim that automated essay scoring (AES) of student essays was as accurate as scoring
by human readers, a close examination of the paper’s methodology and the datasets used
demonstrates that such a claim is not supported by the data in the study. The study’s
methodology used one variable for comparing human readers and a different variable for
comparing machine scores, this difference artificially privileging the machines in half the
datasets. Moreover, conclusions were drawn without the performance of statistical tests and
inferences were based solely on impressionistic and sometimes inaccurate comparisons. In
addition, there was no standard testing of the model as a whole for significance, which given the
large number of comparisons, allowed machine variables to surpass human readers merely
through random chance. Finally, half of the datasets used were not essays but short one-
paragraph responses involving literary analysis or reading comprehension that were not
evaluated on any construct involving writing. Because of the widespread publicity surrounding
this study and that its findings may be used by states and state consortia in implementing the
Common Core State Standards Initiative, the authors should make the test dataset publically
available for analysis.
Critique of Shermis & Hammer Ver. 3.4 Final page 3 13 March 2013 L. C. Perelman
Introduction1. On April 16, 2012, Mark D. Shermis, Dean of the School of Education at the
University of Akron, presented a paper at annual meeting in Vancouver, British Columbia, of the
National Council on Measurement in Education on “Contrasting State-of-the-Art in Automated
Scoring of Essays: Analysis.” Despite its fairly nondescript title, the paper made the claim that
machines graded essays as well as expert human raters, a claim that was publicized in various
press releases and newspaper articles. A press release from the University of Akron, for
example, stated “A direct comparison between human graders and software designed to score
student essays achieved virtually identical levels of accuracy, with the software in some cases
proving to be more reliable, a groundbreaking study has found.” (Man and machine: Better
writers, better grades , 2012). A headline in Inside Higher Education read “A Win for the Robo-
Readers” and the story included statements such as “The study, funded by the William and Flora
Hewlett Foundation, compared the software-generated ratings given to more than 22,000 short
essays, written by students in junior high schools and high school sophomores, to the ratings
given to the same essays by trained human readers. The differences, across a number of different
brands of automated essay scoring software (AES) and essay types, were minute.” (Kolowich,
2012). Even the venerable British publication, The New Scientist reported “The essay marks
handed out by the machines were statistically identical to those from the human graders, says
[Jaison] Morgan. ‘The result blew away everyone's expectations,’ he says.” (Giles, 2012) Yet
these reports and other statements can best be characterized as unsubstantiated hyperbole. The
paper, which has never appeared in a peer-reviewed journal, employs an inconsistent and highly
dubious methodology that favors the machines over the human graders. Even with this biased
methodology, however, the data still show that for traditional writing assignments--as opposed to
merely paragraph-length summaries or paragraph-length literary analyses, which were not
graded on any construct involving writing ability-- human scorers perform better overall than
machines.
Here is the paper’s central claim:
The results demonstrated that overall, automated essay scoring was capable of producing
scores similar to human scores for extended-response writing items with equal performance
for both source-based and traditional writing genre [sic].
That claim, however, is clearly not supported by the data. Conversely, the data support the
assertion that human scorers performed more reliably than the machines on the longer traditional
writing assignments.
1 I would like to acknowledge and thank my friend and colleague Norbert Elliot among others for their extremely
helpful criticism and advice that helped me revise and improve this paper.
Critique of Shermis & Hammer Ver. 3.4 Final page 4 13 March 2013 L. C. Perelman
1. Flawed experimental design and anlysis
a. Having a pair of readers compete against eight scoring engines is, in essence, like
running multiple T-tests. Any single high machine score among the nine scores by
eight different vendors compared by five different metrics could possibly be a random
anomaly, or to put it in more colloquial terms, a lucky guess, especially since the size
of the individual test datasets were relatively small, ranging from 304 to 601.
b. Before any individual comparisons were made, no overall test, such as an ANOVA or
a regression, was run to see if the machines, as a group, performed significantly
different (lower or higher) than the human scorers.
c. Although various claims were made in the paper, no statistical tests were presented by
the authors. Instead, the authors present impressionistic assertions such as “In
general, performance on kappa was slightly less with the exception of essay prompts
#5 & #6. On these data sets, the AES engines, as a group, matched or exceeded
human performance.” (p. 23). There are no parameters given on what constituted
matching or exceeding human performance. Moreover, the authors never mention
that the human and machines are matching different variables (see # 3 following).
2. Apples and Oranges I: Two very different types of writing assignments. Although the
study was exploring how well machines could grade extended-response writing, (i.e. essays),
fewer than one-half of the datasets were what is commonly defined as extended-response
writing. The remaining five datasets were essentially paragraphs, not essays. The team
conducting the study was aware of the problem. One of the senior managers of project
reports in an Email (Morgan, 2012) that the team spent three months trying to get sets of
long-form constructed response essays from states, asking every state and even requested
data from international sources. He admits that some of the datasets defined as “essays” in
the study were shorter than the length the team desired, but defends the sample by arguing
that even these short pieces of writing are categorized as essays by the states, meaning that at
least one of the fifty states defines it that way. Moreover, these essays and the way that they
are scored may be unrepresentative of state assessments as a whole. The most problematic of
the eight datasets, #3, #4, #5, & #6, all come from, at most, two states.
Even more troubling, datasets #3, #4, #5, & #6 are also not measures of writing ability.
Shermis & Hammer (p.6) do note that these “four essays were ‘source-based,’ that is, the
questions asked in the prompt referred to a source document that students read as part of the
assessment.” That description, however, is misleading. They are not source-based writing
assignments in the commonly used sense that writing ability is evaluated partly on how well
the writing engages a text or effectively employs information from a text. The construct
being measured in each of these datasets is literary analysis for datasets #3 & #4 and reading
comprehension in datasets #5 & #6. Indeed, except for the word “clear” being included for
scores of “3” & “4” in the rubrics for datasets #5 & #6, there is no reference in any of the
rubrics of these four datasets to any trait associated with writing. Moreover, the rubrics and
other materials for datasets #5 & #6 explicitly define them as a reading test, while defining a
different scale for writing tests. (See Appendix). Weigle (2013) defines such exercises as
Assessing Content through Writing (ACW)—in which the construct is the student’s
understanding and knowledge about specific content. Consequently, half the datasets were
Critique of Shermis & Hammer Ver. 3.4 Final page 5 13 March 2013 L. C. Perelman
not writing assignments. In addition, some of these assignments had rubrics and other
scoring materials that were very amenable to grading by computer. The training materials
for the reading summary in dataset #6, for example, contained an explicit bulleted list of the
items that should be included in the summary. (See Appendix.)
3. Moving the Goal Posts and Apples and Oranges II: What scoring rules are being
compared? The most serious problem is that for half of the datasets-- #3, #4, #5, #6-- the
performance measure for human readers is different from the performance measure for
machines. For all of the measures beginning with Table 8, the text uses the variable H1H2,
the reliability between the two readers, as the measure for reader reliability, while the
measure for machine performance is reliability between machine and the resolved score
(RS).
In most essay testing situations, the standard practice is that the resolved score is the sum of
the two reader scores if the scores are identical or adjacent; or, if the scores vary by more
than one-point, the resolved score is established by one or two supervisors re-reading the
essay. Yet only one of the datasets in the study, #1, follows the standard best practice of
combing two equal or adjacent scores to compute RS. Dataset #7 combines composite
scores regardless of the size of the difference between them. Dataset #8 also appears to
combine composite scores regardless of the size of the difference but has a third reader
adjudicating 17.7% of the RS’s randomly. Dataset #2 uses the score of only the first reader
as the RS, regardless of the second reader’s score. The remaining four datasets, #3, #4, #5,
and #6, all compute RS as the higher of the two scores.2 This procedure followed by 55%
the essays in the aggregate data sample and by four datasets, half of the total number of
datasets, and which come from, at most, two states, skews many of the metrics used in favor
of the AES scores.
The confusion between human scores and resolved score is found throughout the text. The
report states, for example, on page 22, “all vendor engines generated predicted means within
0.10 of the human mean for Dataset #3 which had a rubric range of 0-3.” The report,
however, is referring to the mean of the resolved score not the mean of the human raters,
which were, in actuality, lower than the resolved score by 0.11 and 0.17 respectively. (See
Table B following – to avoid confusion tables in this Critique are labeled by letters while
the tables in the Shermis & Hammer Report are labeled with numbers)
Essays scores, be they holistic, trait, or analytical, always are continuous variables not
integers, even though graders almost always have to give integer values as scores. The
report recognizes this fact in the observation on page 24 that values for the Pearson r “might
2 Although the adjudication rules given for the essay set descriptions for Essay Sets #3 & #4 do not mention it,
examination of the training set revealed that, like Essay Sets #5 & #6, the resolved score was computed by taking
the higher of two adjacent scores. There were no sets of scores in the training sample for Essay Sets #3 & #4 that
contained pairs of scores that differed by more than one-point, and no third rater scores. Consequently, four of the
data sets from, at most, two states computed resolved score by taking the higher score if the two rater scores were
not identical. The authors mention, on page 9, instances in which the higher of the two scores in one data set (data
set #5) was not the resolved scores. In the two instances I identified, the two readers’ scores were not adjacent and
the resolved score was probably an adjudicated score.
Critique of Shermis & Hammer Ver. 3.4 Final page 6 13 March 2013 L. C. Perelman
have been higher except that the vendors were asked to predict integer values only.” Each
reader has to select a single integer value even though some essays might be on the border
between two adjacent integers. Some 3’s on a 4-point-scale might be very high 3’s
bordering on a 4, while other 3’s may be very low 3’s bordering on a 2. (Some of the
training materials for the datasets included essays scores with plus and minus signs.) In the
terminology of Classical Test Theory, the True Score might be 3.3 or 2.8. Consequently,
adjacent agreement in the correct direction between two readers (e. g. one rater gives an
essay a score of 3 and the second rater gives the essay a score of 4) will more closely
approximate a True Score of 3.4 than two scores of 3. Note, however, that adjacency needs
to be in the correct direction. If a human rater gives that same essay a 3 and an AES
algorithm gives the essay a score of 2, the scores are still adjacent and might produce the
same Kappa but the machine score is much farther away from the true score. The Quadratic
Weighted Kappa does compensate for direction. However, the Quadratic Weighted Kappa
is meant to compare the scores of two autonomous readers, not a reader score and an
artificially resolved score.
Resolving scores merely by selecting the higher one ignores the continuous nature of the
scores being measured and penalizes human raters while giving AES algorithms a
substantial advantage by allowing them to optimize agreement with RS by rounding up.
In the case of the essay that has a True Score of 3.4, for example, there are four likely
pairs of scores that would be produced by two human raters: 3-3, 3-4, 4-3, and 4-4. Note
that in three of these four cases, selecting the higher score makes 4 the resolved score,
and that in two of these three instances one of the two reader scores will be lower than
the resolved score. This bias can be observed in the data in the tables at the end of the
report. The reader means given in Table 4 of the report-- for the five score sets that ether
combine rater scores to compute RS (#1, #7, & #8) or use a single score as RS (2A &
2B), track the means of the RS much more closely than those of the AES scores as
illustrated in Table A, while Table B displays the means for the datasets that used the
Table A: Test Set Means for Resolved Score = Sum of Scores or Single Score H1 H2 RS Diff. Avg.
Human Rater Means from RS Means
Range of AES Mean Scores
Range of Diff. of AES Mean Scores from RS Means
1 8.61 8.62 8.62 -0.01 8.49 – 8.80 -0.13 – 0.18
2A – 3.39 3.41 -0.02 3.33 – 3.41 -0.08 – 0.00
2B – 3.34 3.32 0.02 3.18 – 3.37 -0.14 – 0.05
7 20.02 20.24 20.13 0.00 19.46 – 20.05 -0.67 – -0.08
8 36.45 36.70 36.67 -0.09 37.04 – 37.79 0.37 – 1.12
higher human rater score as the resolved score. The contrast in values in the difference
between the average human rater means and the resolved score means, which are
consistently higher, in the two tables is striking and illustrates how the second method
skews the results against human raters
Critique of Shermis & Hammer Ver. 3.4 Final page 7 13 March 2013 L. C. Perelman
Table B: Test Set Means for Resolved Score = Higher Human Rater Score H1 H2 RS Diff. Avg.
Human Rater Means from RS Means
Range of AES Means
Range of Diff. of AES Mean Scores from RS Means
3 1.79 1.73 1.90 0.14 1.84 – 1.95 -0.06 – 0.05
4 1.38 1.40 1.51 0.12 1.34 – 1.57 -0.17 – 0.06
5 2.31 2.35 2.51 0.18 2.44 – 2.54 -0.07 – 0.03
6 2.57 2.58 2.75 0.18 2.54 – 2.83 -0.04 – 0.08
Separating the datasets into two groups, those that use a single human score or a sum of
two human scores to compute the resolved score and those that use the higher score as the
resolved score, present two very different sets of values of the metrics used in the study.
Tables A & B already demonstrate the substantial difference for means. Similar
distinctions can be shown in the other tables. Indeed, in the five measures of agreement,
exact agreement (Table 8), exact and adjacent agreement (Table 10), Kappas (Table 12),
Quadratic Weighted Kappas (Table 14) and the Pearson r (Table 16), the human raters in
the group of datasets clearly outperform the AES engines in the first three and have mixed
results for the Quadratic Weighted Kappa and Pearson r. Curiously, for the Quadratic
Weighted Kappa (Table 14) the relationship of the two groups is inverted – human raters
in two of the four datasets that use the higher score as the resolved score (#3 & #4) as well
as score sets #2A & #2B outperform the AES engines while AES engines outperform
human raters in the other datasets. This anomaly may partially be an artifact of the
Quadratic Weighted Kappa measuring correspondence not between two raters, as was
intended, but between a rater (or machine score) and the artificial construct of the resolved
score as higher of the two scores. Another possible explanation is offered by Brenner &
Kliebsch (1996) who note that that quadratically weighted kappa coefficients tend to
increase with larger scales while unweighted kappa coefficients decrease. They note that
“variation of the quadratically weighted kappa coefficient with the number of categories
appears to be strongest in the range from two to five categories.” (p. 201). As displayed in
Table 3 of the report, the scales for datasets #3 & #4 consisted of a scale of four (0-3),
while datasets #5 & #6 consisted of a scale of five (0-4). With the exception of the four
point scale for score set #2B, all the other datasets had scales greater than five. For
dataset #1 the range of the rubric was 1-6 and the range of the resolved score was 2-12.
For scoring set #2A, the range was 1-6; for scoring set #7, the range of the rubric was 0-
12, and the range of the resolved score was 0-24. For dataset #8, the range of the rubric
was 0-30, and the range of the resolved score was 0-30.
Datasets #7 and #8, however, are not really holistic scores, as reported by Shermis and
Hammer. They are composite scores. Dataset #7, a narrative essay, not an expository
essay as reported in Table 2, consists of four (not six) traits, Ideas, Organization, Style,
and Conventions each rated on a 0-3 scale. (See Appendix.) The resolved score as given in
the training set was the total of the scores of each of the two readers producing a range of
0-12. (The scoring guide states that scores for Ideas were doubled, but that was not the
case in totals given in the training sample.) Dataset #8, also a narrative essay, is scored on
six traits on a 1-6 scale for each, but only four of them, Ideas and Content, Organization,
Sentence Fluency, and Conventions, are counted in computing the resolved score, with
Conventions being double weighted to produce a scale of 10-60 (See Appendix). The
Critique of Shermis & Hammer Ver. 3.4 Final page 8 13 March 2013 L. C. Perelman
training sets for both datasets #7 & #8 have the individual human trait scores, and it might
have been much more illuminating if the machines were asked to compute individual trait
scores and those, rather than the composite score were comparred to the human scores.
The standard method for comparing the reliability of machine scores to human scores is the
compare the reliability of the machine scores to each of the two human scores and then
compare those scores to reliability of the humans scorers to each other. (McCurry, 2010;
Bennett, 2006). In those studies, as in many others, humans clearly outperformed machines.
Yet the study instead chose to use different variables for humans and machines. The study
could have used the data collected to test the hypothesis that AES can match human scoring
simply by using the sum of the two reader scores as the dependent variable as was the case
with datasets #1, #7, & #8.
Although the values for the two readers’ individual scores compared to the resolved score (H1
& H2) are consistently higher than the machine scores for all of the metrics displayed in the
tables, that could well be an artifact of the individual reader score being a contributing
element to the resolved score. However, of the nine score sets, the two scores of H2 for 2A
and 2B are completely independent of the resolved score because reader H1 defined the
resolved score and H2’s scores were used only for computing grading reliability. That H2 in
score sets #2A & #2B outperformed all of the machines in every metric, except for one
machine in one metric, offers some evidence that the high individual reader scores compared
to the resolved score are not solely an artifact of their being a part of the whole. As shown in
Table C, #2A, which measured ideas, content, organization, style, and voice, had an exact
agreement value of 0.76, compared to the range of machine values of 0.55-0.70. Its Kappa
was 0.62, compared to the range of machine values of 0.30-0.51. Its Quadratic Weighted
Kappa was 0.80, compared to the range of machine values of 0.62-0.74. And its Pearson r
was .73, compared to the range of machine values of 0.62-0.74. Similarly, #2B, which
measured conventions of grammar, usage, punctuation, and spelling, 2B had an exact
agreement value of 0.73 compared to the range of machine values of 0.55-0.69. Its Kappa
was 0.56, compared to the range of machine values of 0.27-0.49. Its Quadratic Weighted
Kappa was 0.76, compared to the range of machine values of 0.62-0.74. And its Pearson r
was .76, compared to the range of machine values of 0.55-0.71.
Table C: Dataset #2 – H2 Score Compared to Resolved Score vs. Machine Scores
Metric 2A 2B
H2 2A Range of
Machine
Scores
H2 2B Range of
Machine
Scores
Exact agreement 0.76 0.55 - 0.70 0.73 0.55-0.69
Kappa 0.62 0.30 - 0.51 0.56 0.27-0.47
Quadratic Weighted Kappa
0.80 0.62 - 0.74 0.76 0.62 -0.74
Pearson r 0.73 0.62 - 0.74 0.76 0.55 -0.71
4. Smoke and Mirrors: Overall, the report minimizes the accuracy of the human scorers
and over-represents the accuracy of machine scoring, even with the skewed variables.
Critique of Shermis & Hammer Ver. 3.4 Final page 9 13 March 2013 L. C. Perelman
To begin to support this assertion, the comparison between human readers and machine
scores are condensed in Tables D-G. I did not include a table for adjacent & exact agreement
because with many of the scales being 1-4, adjacent & exact agreement was often at 0.99 for
both humans and machines. Table D: Exact Agreement Summary
Essay Set Human Readers Machines
H1 H2 H1H2 Median Mean Range
1 0.64 0.64 0.64 0.44 0.43 .31 - .47
2a --- 0.76 0.76 0.68 0.66 .55 - .70
2b --- 0.73 0.73 0.66 0.65 .55 - .69
3 0.89 0.83 0.72 0.69 0.67 .61 - .72
4 0.87 0.89 0.76 0.65 0.64 .47 - .72
5 0.77 0.79 0.59 0.68 0.66 .47 - .71
6 0.8 0.81 0.63 0.64 0.64 .51 - .69
7 0.28 0.28 0.28 0.12 0.12 .07 - .15
8 0.35 0.35 0.29 0.16 0.16 .08 - .23
Exact agreement is summarized in Table D. The report aggregates the ranges of agreement
for H1H2 among all eight datasets and all nine rows of data stating on page 22 that “The
human exact agreements ranged from 0.28 on dataset #8 to 0.76 for dataset #2.” The report
then states, “the predicted machine score and had a range from 0.07 on dataset #2 [sic] to
0.72 on datasets #3 and #4. An inspection of the deltas on Table 9 shows that machines
performed particularly well on datasets #5 and 6, two of the source-based essays.”
Aside from the careless error of incorrectly attributing the 0.07 exact agreement to dataset
#2 instead of to dataset #7, the report ignores how human scorers performed better than the
machines for most of the datasets. Of the nine scores, the human rater agreement
coefficients exceeded the top score of the machines in six of them, tying in a seventh. In
dataset # 1 both readers performed .17 better than the best performing machine. In dataset
2a, the single “read-behind” reader performed .06 better than the best performing machine.
In dataset 2b, the single “read-behind” reader performed .04 better than the best performing
machine. The next four datasets are content based. For datasets # 3 & #4, the agreement of
the two readers outperforms all but one of the machines and ties that one.
Critique of Shermis & Hammer Ver. 3.4 Final page 10 13 March 2013 L. C. Perelman
Table E: Kappa Summary
Essay Set Human Readers Machines
H1 H2 H1H2 Median Mean Range
1 0.53 0.53 0.45 0.29 0.28 .16 - .33
2a --- 0.62 0.62 0.48 0.46 .30 - .51
2b --- 0.56 0.56 0.45 0.42 .27 - .49
3 0.83 0.77 0.57 0.53 0.52 .45 - .59
4 0.82 0.84 0.65 0.50 0.50 .30 - .60
5 0.69 0.71 0.44 0.55 0.52 .28 - .59
6 0.70 0.71 0.45 0.46 0.46 .31 - .55
7 0.23 0.23 0.18 0.07 0.07 .03 - .09
8 0.26 0.26 0.16 0.09 0.08 .04 - .13
Table E summarizes the Kappa scores. On page 23, the report states that “in general,
performance on kappa was slightly less with the exception of essay prompts #5 & #6. On
these datasets, the AES engines, as a group, matched or exceeded human performance.”
While this last claim is true for dataset #5, it was not true for dataset #6, where the value for
H1H2 fell right in the middle of the machine scores. Moreover, the machine performance
was not “slightly” lower than human performance measured by H1H2, it was substantially
lower for all datasets except 5 & 6 as can be observed simply by comparing H1H2 with the
median and range values of the machine scores in Table E.
Table F: Quadratic Weighted Kappa Summary
Essay Set Human Readers Machines
H1 H2 H1H2 Median Mean Range
1 0.77 0.78 0.73 0.78 0.77 .66 - .82
2a --- 0.80 0.80 0.70 0.70 .62 - .74
2b --- 0.76 0.76 0.66 0.65 .55 - .69
3 0.92 0.89 0.77 0.72 0.71 .65 - .75
4 0.93 0.94 0.85 0.76 0.77 .67 - .81
5 0.89 0.90 0.74 0.81 0.79 .64 - .82
6 0.89 0.89 0.74 0.76 0.74 .65 - .81
7 0.78 0.77 0.72 0.77 0.75 .58 - .84
8 0.75 0.74 0.61 0.68 0.67 .60 - .73
Table F summarizes the scores on the quadratic weighted kappa. As mentioned previously,
the machines do better on the quadratic weighted kappa except for score sets #2A and #2B
and the literary analysis questions, datasets #3 & #4.
The performance of H1H2 against the machines as measured by the Pearson r, summarized
in Table G, is mixed.
Critique of Shermis & Hammer Ver. 3.4 Final page 11 13 March 2013 L. C. Perelman
Table G: Pearson r Summary
Essay Set Human Readers Machines
H1 H2 H1H2 Median Mean Range
1 0.93 0.93 0.73 0.80 0.77 .76 - .82
2a --- 0.80 0.80 0.71 0.70 .62 - .74
2b --- 0.76 0.76 0.67 0.66 .55 - .71
3 0.92 0.89 0.77 0.72 0.71 .65 - .75
4 0.94 0.94 0.85 0.76 0.77 .68 - .82
5 0.89 0.90 0.75 0.81 0.79 .65 - .84
6 0.89 0.89 0.74 0.77 0.75 .65 - .81
7 0.93 0.93 0.72 0.78 0.76 .58 - .84
8 0.87 0.88 0.61 0.70 0.68 .62 - .73
5. Finally, although the paper gives the total sample size as 22,029, only 4,343 essays in
eight different datasets comprised the actual test set. The larger amount was collected but
most were used as the training set for the machines and with another set reserved as a
validation set. While large training sets are common for AES, the authors should have
emphasized the size of actual analytic rather than the number of all the essays collected,a
reporting strategy that inflates the sample size..
Conclusion. Even with an experimental design that consisted of different measures for human
and scorers and that privileged the machines in half the data sets, the study clearly does not
demonstrate that machines can replicate human scores. Indeed, comparing the performance of
human graders matching each other to the machines matching the resolved score still gives some
indication that the human raters may be significantly more reliable than machines. Even with the
very flawed overall design of the study, further and rigorous statistical analysis of data may yield
some interesting and extremely important information. Moreover, there are pressing policy
decisions that argue for further analysis of this data. Given that this paper has been reported to
both the Partnership for Assessment of Readiness of College and Careers and the Smarter
Balanced Assessment Consortium, and, consequently, may inform decisions by the two consortia
about the use of automated essay scoring in high stakes testing, it is imperative that the authors
publically post the raw test set data from this study for further analysis that could possibly either
confirm their conclusions or refute them.
Critique of Shermis & Hammer Ver. 3.4 Final page 12 13 March 2013 L. C. Perelman
References
Man and machine: Better writers, better grades . (2012, April 12). Retrieved from The
University of Akron News: http://www.uakron.edu/im/online-
newsroom/news_details.dot?newsId=40920394-9e62-415d-b038-15fe2e72a677&
Bennett, R. E. (2006). Technology and Writing Assessment: Lessons Learned from the US
National Assessment of Educational Progress. International Association for Educational
Assessment. Singapore. Retrieved March 9, 2013, from
http://www.iaea.info/documents/paper_1162a26d7.pdf
Brenner, H., & Kliebsch, U. (1996, March). Dependence of Weighted Kappa Coefficients on the
Number of Categories. Epidemiology, 7(2), 199-202.
Giles, J. (2012, April 25). AI graders get top marks for scoring essay questions . Retrieved from
The New Scientist: http://www.newscientist.com/article/mg21428615.000-ai-graders-get-
top-marks-for-scoring-essay-questions.html
Kolowich, S. (2012, April 13). A Win for the Robo-Readers. Retrieved from Inside Higher
Education: http://www.insidehighered.com/news/2012/04/13/large-study-shows-little-
difference-between-human-and-robot-essay-graders
McCurry, D. (2010). Can machine scoring deal with broad and open writing. Assessing Writing,
15, pp. 118-129.
Morgan, J. (2012, June 19). RE: Request for data on AES study. Message to the author. Email.
Shermis, M. D., & Hammer, B. (2012). Contrasting State-of-the-Art Automated Scoring of
Essays: Analysis. Retrieved March 3, 2013, from ASAP:
http://www.scoreright.org/NCME_2012_Paper3_29_12.pdf
Weigle, S. C. (2013). English language learners and automated scoring of essays: Critical
considerations. Assessing Writing, 18, 85-99.
Critique of Shermis & Hammer Ver. 3.4 Final page 13 13 March 2013 L. C. Perelman
Appendix
Prompts, Rubrics, and Other Materials
Source: http://www.kaggle.com/c/asap-aes/data
Dataset #1
Dataset #2
Page 2 of 32
Writing Prompt
Question 1
IXE01020 version 2.1
―All of us can think of a book that we hope none of our children or any other
children have taken off the shelf. But if I have the right to remove that book
from the shelf—that work I abhor—then you also have exactly the same right
and so does everyone else. And then we have no books left on the shelf for any
of us.‖
Katherine Paterson
Author
Write a persuasive essay to a newspaper reflecting your views on censorship in
libraries. Do you believe that certain materials, such as books, music, movies,
magazines, etc., should be removed from the shelves if they are found offensive?
Support your position with convincing arguments from your own experience,
observations, and/or reading.
Your writing will be scored on the following aspects:
• Ideas and content: Does your writing accomplish the assigned task?
• Organization: Does your writing contain an introduction, a body, and a conclusion?
• Style: Do the language and vocabulary in your writing help to convey a clear
message and to create interest?
• Voice: Are the tone and language appropriate for your intended audience?
• Language Conventions: Have you used correct sentence structure, grammar, and
punctuation?
Page 3 of 32
Page 4 of 32
Page 5 of 32
Page 6 of 32
Page 7 of 32
Page 8 of 32
Page 9 of 32
Dataset #3
Essay Set #3 1
Essay Set #3 Type of essay: Source Dependent Responses
Grade level: 10
Training set size: 1,726 essays
Final evaluation set size: 575 essays
Average length of essays: 150 words
Scoring: 1st Reader Score, 2nd Reader Score, Resolved CR Score
Rubric range: 0-3
Resolved CR score range: 0-3
Source Essay
ROUGH ROAD AHEAD: Do Not Exceed Posted Speed Limit
by Joe Kurmaskie
FORGET THAT OLD SAYING ABOUT NEVER taking candy from strangers. No, a better piece of advice for
the solo cyclist would be, “Never accept travel advice from a collection of old-timers who haven’t left
the confines of their porches since Carter was in office.” It’s not that a group of old guys doesn’t know
the terrain. With age comes wisdom and all that, but the world is a fluid place. Things change.
At a reservoir campground outside of Lodi, California, I enjoyed the serenity of an early-summer evening
and some lively conversation with these old codgers. What I shouldn’t have done was let them have a
peek at my map. Like a foolish youth, the next morning I followed their advice and launched out at first
light along a “shortcut” that was to slice away hours from my ride to Yosemite National Park.
They’d sounded so sure of themselves when pointing out landmarks and spouting off towns I would
come to along this breezy jaunt. Things began well enough. I rode into the morning with strong legs and
a smile on my face. About forty miles into the pedal, I arrived at the first “town.” This place might have
been a thriving little spot at one time—say, before the last world war—but on that morning it fit the
traditional definition of a ghost town. I chuckled, checked my water supply, and moved on. The sun was
beginning to beat down, but I barely noticed it. The cool pines and rushing rivers of Yosemite had my
name written all over them.
Twenty miles up the road, I came to a fork of sorts. One ramshackle shed, several rusty pumps, and a
corral that couldn’t hold in the lamest mule greeted me. This sight was troubling. I had been hitting my
water bottles pretty regularly, and I was traveling through the high deserts of California in June.
I got down on my hands and knees, working the handle of the rusted water pump with all my strength. A
tarlike substance oozed out, followed by brackish water feeling somewhere in the neighborhood of two
hundred degrees. I pumped that handle for several minutes, but the water wouldn’t cool down. It didn’t
matter. When I tried a drop or two, it had the flavor of battery acid.
The old guys had sworn the next town was only eighteen miles down the road. I could make that! I
would conserve my water and go inward for an hour or so—a test of my inner spirit.
Essay Set #3 2
Not two miles into this next section of the ride, I noticed the terrain changing. Flat road was replaced by
short, rolling hills. After I had crested the first few of these, a large highway sign jumped out at me. It
read: ROUGH ROAD AHEAD: DO NOT EXCEED POSTED SPEED LIMIT.
The speed limit was 55 mph. I was doing a water-depleting 12 mph. Sometimes life can feel so cruel.
I toiled on. At some point, tumbleweeds crossed my path and a ridiculously large snake—it really did
look like a diamondback—blocked the majority of the pavement in front of me. I eased past, trying to
keep my balance in my dehydrated state.
The water bottles contained only a few tantalizing sips. Wide rings of dried sweat circled my shirt, and
the growing realization that I could drop from heatstroke on a gorgeous day in June simply because I
listened to some gentlemen who hadn’t been off their porch in decades, caused me to laugh.
It was a sad, hopeless laugh, mind you, but at least I still had the energy to feel sorry for myself. There
was no one in sight, not a building, car, or structure of any kind. I began breaking the ride down into
distances I could see on the horizon, telling myself that if I could make it that far, I’d be fi ne.
Over one long, crippling hill, a building came into view. I wiped the sweat from my eyes to make sure it
wasn’t a mirage, and tried not to get too excited. With what I believed was my last burst of energy, I
maneuvered down the hill.
In an ironic twist that should please all sadists reading this, the building—abandoned years earlier, by
the looks of it—had been a Welch’s Grape Juice factory and bottling plant. A sandblasted picture of a
young boy pouring a refreshing glass of juice into his mouth could still be seen.
I hung my head.
That smoky blues tune “Summertime” rattled around in the dry honeycombs of my deteriorating brain.
I got back on the bike, but not before I gathered up a few pebbles and stuck them in my mouth. I’d read
once that sucking on stones helps take your mind off thirst by allowing what spit you have left to
circulate. With any luck I’d hit a bump and lodge one in my throat.
It didn’t really matter. I was going to die and the birds would pick me clean, leaving only some expensive
outdoor gear and a diary with the last entry in praise of old men, their wisdom, and their keen sense of
direction. I made a mental note to change that paragraph if it looked like I was going to lose
consciousness for the last time.
Somehow, I climbed away from the abandoned factory of juices and dreams, slowly gaining elevation
while losing hope. Then, as easily as rounding a bend, my troubles, thirst, and fear were all behind me.
GARY AND WILBER’S FISH CAMP—IF YOU WANT BAIT FOR THE BIG ONES, WE’RE YOUR BEST BET!
“And the only bet,” I remember thinking.
Essay Set #3 3
As I stumbled into a rather modern bathroom and drank deeply from the sink, I had an overwhelming
urge to seek out Gary and Wilber, kiss them, and buy some bait—any bait, even though I didn’t own a
rod or reel.
An old guy sitting in a chair under some shade nodded in my direction. Cool water dripped from my
head as I slumped against the wall beside him.
“Where you headed in such a hurry?”
“Yosemite,” I whispered.
“Know the best way to get there?”
I watched him from the corner of my eye for a long moment. He was even older than the group I’d
listened to in Lodi.
“Yes, sir! I own a very good map.”
And I promised myself right then that I’d always stick to it in the future.
“Rough Road Ahead” by Joe Kurmaskie, from Metal Cowboy, copyright © 1999 Joe Kurmaskie.
Prompt
Write a response that explains how the features of the setting affect the cyclist. In your response,
include examples from the essay that support your conclusion.
Rubric Guidelines
Score 3: The response demonstrates an understanding of the complexities of the text.
Addresses the demands of the question
Uses expressed and implied information from the text
Clarifies and extends understanding beyond the literal
Score 2: The response demonstrates a partial or literal understanding of the text.
Addresses the demands of the question, although may not develop all parts equally
Uses some expressed or implied information from the text to demonstrate understanding
May not fully connect the support to a conclusion or assertion made about the text(s)
Score 1: The response shows evidence of a minimal understanding of the text.
May show evidence that some meaning has been derived from the text
May indicate a misreading of the text or the question
May lack information or explanation to support an understanding of the text in relation to the
question
Score 0: The response is completely irrelevant or incorrect, or there is no response.
Essay Set #3 4
Adjudication Rules
If Reader‐1 Score and Reader‐2 Score are exact or adjacent, adjudication by a third reader is not
required.
If Reader‐1 Score and Reader‐2 Score are not adjacent or exact, then adjudication by a third
reader is required.
Dataset #4
Essay Set #4 1
Essay Set #4 Type of essay: Source Dependent Responses
Grade level: 10
Training set size: 1,772 essays
Final evaluation set size: 589 essays
Average length of essays: 150 words
Scoring: 1st Reader Score, 2nd Reader Score, Resolved CR Score
Rubric range: 0-3
Resolved CR score range: 0-3
Source Essay
Winter Hibiscus by Minfong Ho
Saeng, a teenage girl, and her family have moved to the United States from Vietnam. As Saeng walks
home after failing her driver’s test, she sees a familiar plant. Later, she goes to a florist shop to see if the
plant can be purchased.
It was like walking into another world. A hot, moist world exploding with greenery. Huge flat leaves,
delicate wisps of tendrils, ferns and fronds and vines of all shades and shapes grew in seemingly random
profusion.
“Over there, in the corner, the hibiscus. Is that what you mean?” The florist pointed at a leafy potted
plant by the corner.
There, in a shaft of the wan afternoon sunlight, was a single blood-red blossom, its five petals splayed
back to reveal a long stamen tipped with yellow pollen. Saeng felt a shock of recognition so intense, it
was almost visceral.1
“Saebba,” Saeng whispered.
A saebba hedge, tall and lush, had surrounded their garden, its lush green leaves dotted with vermilion
flowers. And sometimes after a monsoon rain, a blossom or two would have blown into the well, so that
when she drew the well water, she would find a red blossom floating in the bucket.
Slowly, Saeng walked down the narrow aisle toward the hibiscus. Orchids, lanna bushes, oleanders,
elephant ear begonias, and bougainvillea vines surrounded her. Plants that she had not even realized
she had known but had forgotten drew her back into her childhood world.
When she got to the hibiscus, she reached out and touched a petal gently. It felt smooth and cool, with a
hint of velvet toward the center—just as she had known it would feel.
And beside it was yet another old friend, a small shrub with waxy leaves and dainty flowers with
purplish petals and white centers. “Madagascar periwinkle,” its tag announced. How strange to see it in
a pot, Saeng thought. Back home it just grew wild, jutting out from the cracks in brick walls or between
tiled roofs.
Essay Set #4 2
And that rich, sweet scent—that was familiar, too. Saeng scanned the greenery around her and found a
tall, gangly plant with exquisite little white blossoms on it. “Dok Malik,” she said, savoring the feel of the
word on her tongue, even as she silently noted the English name on its tag, “jasmine.”
One of the blossoms had fallen off, and carefully Saeng picked it up and smelled it. She closed her eyes
and breathed in, deeply. The familiar fragrance filled her lungs, and Saeng could almost feel the light
strands of her grandmother’s long gray hair, freshly washed, as she combed it out with the fine-toothed
buffalo-horn comb. And when the sun had dried it, Saeng would help the gnarled old fingers knot the
hair into a bun, then slip a dok Malik bud into it.
Saeng looked at the white bud in her hand now, small and fragile. Gently, she closed her palm around it
and held it tight. That, at least, she could hold on to. But where was the fine-toothed comb? The
hibiscus hedge? The well? Her gentle grandmother?
A wave of loss so deep and strong that it stung Saeng’s eyes now swept over her. A blink, a channel
switch, a boat ride into the night, and it was all gone. Irretrievably, irrevocably gone.
And in the warm moist shelter of the greenhouse, Saeng broke down and wept.
It was already dusk when Saeng reached home. The wind was blowing harder, tearing off the last
remnants of green in the chicory weeds that were growing out of the cracks in the sidewalk. As if
oblivious to the cold, her mother was still out in the vegetable garden, digging up the last of the onions
with a rusty trowel. She did not see Saeng until the girl had quietly knelt down next to her.
Her smile of welcome warmed Saeng. “Ghup ma laio le? You’re back?” she said cheerfully. “Goodness,
it’s past five. What took you so long? How did it go? Did you—?” Then she noticed the potted plant that
Saeng was holding, its leaves quivering in the wind.
Mrs. Panouvong uttered a small cry of surprise and delight. “Dok faeng-noi!” she said. “Where did you
get it?”
“I bought it,” Saeng answered, dreading her mother’s next question.
“How much?”
For answer Saeng handed her mother some coins.
“That’s all?” Mrs. Panouvong said, appalled, “Oh, but I forgot! You and the
Lambert boy ate Bee-Maags . . . .”
“No, we didn’t, Mother,” Saeng said.
“Then what else—?”
“Nothing else. I paid over nineteen dollars for it.”
Essay Set #4 3
“You what?” Her mother stared at her incredulously. “But how could you? All the seeds for this
vegetable garden didn’t cost that much! You know how much we—” She paused, as she noticed the
tearstains on her daughter’s cheeks and her puffy eyes.
“What happened?” she asked, more gently.
“I—I failed the test,” Saeng said.
For a long moment Mrs. Panouvong said nothing. Saeng did not dare look her mother in the eye.
Instead, she stared at the hibiscus plant and nervously tore off a leaf, shredding it to bits.
Her mother reached out and brushed the fragments of green off Saeng’s hands. “It’s a beautiful plant,
this dok faeng-noi,” she finally said. “I’m glad you got it.”
“It’s—it’s not a real one,” Saeng mumbled.
“I mean, not like the kind we had at—at—” She found that she was still too shaky to say the words at
home, lest she burst into tears again. “Not like the kind we had before,” she said.
“I know,” her mother said quietly. “I’ve seen this kind blooming along the lake. Its flowers aren’t as
pretty, but it’s strong enough to make it through the cold months here, this winter hibiscus. That’s what
matters.”
She tipped the pot and deftly eased the ball of soil out, balancing the rest of the plant in her other hand.
“Look how root-bound it is, poor thing,” she said. “Let’s plant it, right now.”
She went over to the corner of the vegetable patch and started to dig a hole in the ground. The soil was
cold and hard, and she had trouble thrusting the shovel into it. Wisps of her gray hair trailed out in the
breeze, and her slight frown deepened the wrinkles around her eyes. There was a frail, wiry beauty to
her that touched Saeng deeply.
“Here, let me help, Mother,” she offered, getting up and taking the shovel away from her.
Mrs. Panouvong made no resistance. “I’ll bring in the hot peppers and bitter melons, then, and start
dinner. How would you like an omelet with slices of the bitter melon?”
“I’d love it,” Saeng said.
Left alone in the garden, Saeng dug out a hole and carefully lowered the “winter hibiscus” into it. She
could hear the sounds of cooking from the kitchen now, the beating of eggs against a bowl, the sizzle of
hot oil in the pan. The pungent smell of bitter melon wafted out, and Saeng’s mouth watered. It was a
cultivated taste, she had discovered—none of her classmates or friends, not even Mrs. Lambert, liked
it—this sharp, bitter melon that left a golden aftertaste on the tongue. But she had grown up eating it
and, she admitted to herself, much preferred it to a Big Mac.
The “winter hibiscus” was in the ground now, and Saeng tamped down the soil around it. Overhead, a
flock of Canada geese flew by, their faint honks clear and—yes—familiar to Saeng now. Almost
Essay Set #4 4
reluctantly, she realized that many of the things that she had thought of as strange before had become,
through the quiet repetition of season upon season, almost familiar to her now. Like the geese. She
lifted her head and watched as their distinctive V was etched against the evening sky, slowly fading into
the distance.
When they come back, Saeng vowed silently to herself, in the spring, when the snows melt and the
geese return and this hibiscus is budding, then I will take that test again.
“Winter Hibiscus” by Minfong Ho, copyright © 1993 by Minfong Ho, from Join In, Multiethnic Short Stories, by Donald R. Gallo, ed.
Prompt
Read the last paragraph of the story. "When they come back, Saeng vowed silently to herself, in the spring, when the snows melt and the geese return and this hibiscus is budding, then I will take that test again." Write a response that explains why the author concludes the story with this paragraph. In your response, include details and examples from the story that support your ideas.
Rubric Guidelines
Score 3: The response demonstrates an understanding of the complexities of the text.
Addresses the demands of the question
Uses expressed and implied information from the text
Clarifies and extends understanding beyond the literal
Score 2: The response demonstrates a partial or literal understanding of the text.
Addresses the demands of the question, although may not develop all parts equally
Uses some expressed or implied information from the text to demonstrate understanding
May not fully connect the support to a conclusion or assertion made about the text(s)
Score 1: The response shows evidence of a minimal understanding of the text.
May show evidence that some meaning has been derived from the text
May indicate a misreading of the text or the question
May lack information or explanation to support an understanding of the text in relation to the
question
Score 0: The response is completely irrelevant or incorrect, or there is no response.
Adjudication Rules
If Reader‐1 Score and Reader‐2 Score are exact or adjacent, adjudication by a third reader is not
required.
Essay Set #4 5
If Reader‐1 Score and Reader‐2 Score are not adjacent or exact, then adjudication by a third
reader is required.
Dataset #5
MCAS Reading 2010 Grade 8
C-24 “Narciso Rodriguez”
Itemnumber: 272837
Contract: 1117
Describe the mood created by the author in the memoir. Support your answer with relevant and specific information from the memoir.
Scoring Guide
Score Description
4 The response is a clear, complete, and accurate description of the mood created by the author. The response includes relevant and specific information from the memoir.
3 The response is a mostly clear, complete, and accurate description of the mood created by the author The response includes relevant but often general information from the memoir.
2 The response is a partial description of the mood created by the author The response includes limited information from the memoir and may include misinterpretations.
1
The response is a minimal description of the mood created by the author The response includes little or no information from the memoir and may include misinterpretations. OR The response relates minimally to the task.
0 The response is incorrect or irrelevant or contains insufficient information to demonstrate comprehension.
Blank No response.
Scoring Notes: The response should describe a mood of gratitude, love, or any similar appreciative mood. The response may include, but is not limited to:
in paragraph 2: the author says he is “eternally grateful” to his parents for instilling in him a love of cooking. He also credits them for his appreciation of Cuban music “which I adore to this day.” In general he notes their having made an inviting home filled with “endless celebrations” out of “modest” means.
in paragraph 3: the author credits his parents for instilling in him a great sense of “family” due to the “environment” they created. This sense of family extended to everyone in a time when the larger world was uninviting.
in paragraph 4: the author mentions his family’s generosity in allowing others to stay with them and notes its reciprocal nature
in paragraph 5, the author recognizes that his parents came to America “selflessly” in order to “give their children a better life.” He details their challenges and obstacles and observes that they “endured.”
in paragraph 6, the author states, “I will always be grateful to my parents for their love and sacrifice. I’ve often told them that what they did was a much more courageous thing than I could have ever done.” He mentions his admiration and having thanked them yet admits that he has, “no way to express my gratitude.”
in paragraph 7, the author states, “I will never forget that house or its gracious neighborhood or the many things I learned there about how to love. I will never forget how my parents turned this simple house into a home.”
In conclusion, the author creates an appreciative mood by describing all of the things he is grateful for including his parents and the home they made for him.
Other interpretations are acceptable if supported by relevant evidence from the text.
Contract: 1115 Reading Booklet: 1850006114 Page: 1 of 1 Score: A4+ RespCode: RD01054
Annotation:
Dataset #6
MCAS Reading Grade 10
C-9 “The Mooring Mast”
Itemnumber: 280239
Contract: 1117
Based on the excerpt, describe the obstacles the builders of the Empire State Building faced in attempting to allow dirigibles to dock there. Support your answer with relevant and specific information from the excerpt.
Scoring Guide Score Description
4
The response is a clear, complete, and accurate description of the obstacles the builders of the Empire State Building faced in attempting to allow dirigibles to dock there. The response includes relevant and specific information from the excerpt.
3
The response is a mostly clear, complete, and accurate description of the obstacles the builders of the Empire State Building faced in attempting to allow dirigibles to dock there. The response includes relevant but often general information from the excerpt.
2
The response is a partial description of the obstacles the builders of the Empire State Building faced in attempting to allow dirigibles to dock there. The response includes limited information from the excerpt and may include misinterpretations.
1
The response is a minimal description of the obstacles the builders of the Empire State Building faced in attempting to allow dirigibles to dock there. The response includes little or no information from the excerpt and may include misinterpretations. OR The response relates minimally to the task.
0 The response is totally incorrect or irrelevant, or contains insufficient evidence to demonstrate comprehension.
Blank No response. Scoring Notes The obstacles to dirigible docking include:
Building a mast on top of the building Meeting with engineers and dirigible engineers Transmitting the stress of the dirigible all the way down the building; the frame
had to be shored up to the tune of $60,000 Housing the winches and other docking equipment Dealing with flammable gases Handling the violent air currents at the top of the building Confronting laws banning airships from the area Getting close enough to the building without puncturing
Other explanations will be accepted if supported by relevant evidence from the text.
Contract: 1116 Reading Booklet: 2050031749 Page: 1 of 1 Score: A4. RespCode: RD04036Annotation:
Dataset #7
Writing - Grade 7 Released Items Fall 2010
2
Writing - Grade 7 Released Items Fall 2010
3
An
alyt
ic R
ub
ric
Nar
rati
ve W
riti
ng
Gra
des
4 a
nd
7
01
23
Idea
s
(p
oin
ts d
oub
led
)
Idea
s ar
e not
focu
sed o
n t
he
task
an
d/o
r ar
e undev
eloped
.
Tel
ls a
sto
ry w
ith idea
s th
at
are
min
imal
ly focu
sed o
n
the
topic
and d
evel
oped
w
ith lim
ited
and/o
r gen
eral
det
ails
.
Tel
ls a
sto
ry w
ith idea
s th
at
are
som
ewhat
focu
sed o
n
the
topic
and a
re d
evel
oped
w
ith a
mix
of sp
ecific
and/o
r gen
eral
det
ails
.
Tel
ls a
sto
ry w
ith idea
s th
at a
re
clea
rly
focu
sed o
n t
he
topic
and
are
thoro
ughly
dev
eloped
with
spec
ific
, re
leva
nt
det
ails
.
Org
aniz
atio
n
N
o o
rgan
izat
ion
evid
ent.
Org
aniz
atio
n an
d
connec
tions
bet
wee
n idea
s an
d/o
r ev
ents
are
wea
k.
Org
aniz
atio
n a
nd
connec
tions
bet
wee
n idea
s an
d/o
r ev
ents
are
logic
ally
se
quen
ced.
Org
aniz
atio
n a
nd c
onnec
tions
bet
wee
n idea
s an
d/o
r ev
ents
are
cl
ear
and logic
ally
seq
uen
ced.
Sty
le
Inef
fect
ive
use
of
languag
e fo
r th
e w
rite
r's
purp
ose
an
d a
udie
nce
.
Lim
ited
use
of la
nguag
e,
incl
udin
g lac
k of va
riet
y in
w
ord
choic
e an
d s
ente
nce
s,
may
hin
der
support
for
the
write
r's
purp
ose
and
audie
nce
.
Adeq
uat
e co
mm
and o
f la
nguag
e, incl
udin
g
effe
ctiv
e w
ord
choic
e an
d
clea
r se
nte
nce
s, s
upport
s th
e w
rite
r's
purp
ose
and
audie
nce
.
Com
man
d o
f la
nguag
e, incl
udin
g
effe
ctiv
e an
d c
om
pel
ling w
ord
ch
oic
e an
d v
arie
d s
ente
nce
st
ruct
ure
, cl
early
support
s th
e w
rite
r's
purp
ose
and a
udie
nce
.
Con
ven
tion
s
Inef
fect
ive
use
of
conve
ntions
of
Sta
ndar
d E
nglis
h*
for
gra
mm
ar,
usa
ge,
spel
ling,
capital
izat
ion,
and
punct
uat
ion.
Lim
ited
use
of co
nve
ntions
of Sta
ndar
d E
nglis
h* for
gra
mm
ar,
usa
ge,
spel
ling,
capital
izat
ion,
and
punct
uat
ion for
the
gra
de
leve
l.
Adeq
uat
e use
of
conve
ntions
of Sta
ndar
d
Englis
h* for
gra
mm
ar,
usa
ge,
spel
ling,
capital
izat
ion,
and
punct
uat
ion for
the
gra
de
leve
l.
Consi
sten
t, a
ppro
priat
e use
of
conve
ntions
of Sta
ndar
d
Englis
h* for
gra
mm
ar,
usa
ge,
sp
ellin
g,
capital
izat
ion,
and
punct
uat
ion for
the
gra
de
leve
l.
C Bla
nk
Any
conditio
n c
ode
will
res
ult in a
sco
re o
f 0 for
all tr
aits
.
B Ille
gib
le o
r w
ritt
en in a
lan
guag
e oth
er t
han
Englis
h
D
In
suffic
ient
to r
ate
* S
tandar
d E
nglis
h is
the
form
of
Englis
h m
ost
wid
ely
acce
pte
d f
or
writing in s
chools
.
A O
ff-t
opic
1/2
4/1
1
Writing - Grade 7 Released Items Fall 2010
4
Dataset #8
Essay Set #8 1
Essay Set #8 Type of essay: Persuasive/ Narrative/Expository
Grade level: 10
Training set size: 918 essays
Final evaluation set size: 305 essays
Average length of essays: 650 words
Scoring: Rater1Comp, Rater2Comp, Rater3Comp, Resolved Score
Rater1Comp Rubric range: 0-30
Rater2Comp Rubric range: 0-30
Rater3Comp Rubric range: 0-60
Resolved score range: 0-60
Prompt
We all understand the benefits of laughter. For example, someone once said, “Laughter is the shortest
distance between two people.” Many other people believe that laughter is an important part of any
relationship. Tell a true story in which laughter was one element or part.
Rubric Guidelines
A rating of 1-6 on the following six traits:
Ideas and Content
Score 6: The writing is exceptionally clear, focused, and interesting. It holds the reader’s attention
throughout. Main ideas stand out and are developed by strong support and rich details suitable to
audience and purpose. The writing is characterized by
• clarity, focus, and control.
• main idea(s) that stand out.
• supporting, relevant, carefully selected details; when appropriate, use of resources provides
strong, accurate, credible support.
• a thorough, balanced, in-depth explanation / exploration of the topic; the writing makes
connections and shares insights.
• content and selected details that are well-suited to audience and purpose.
Score 5: The writing is clear, focused and interesting. It holds the reader’s attention. Main ideas stand
out and are developed by supporting details suitable to audience and purpose. The writing is
characterized by
• clarity, focus, and control.
• main idea(s) that stand out.
• supporting, relevant, carefully selected details; when appropriate, use of resources provides
strong, accurate, credible support.
• a thorough, balanced explanation / exploration of the topic; the writing makes connections and
shares insights.
• content and selected details that are well-suited to audience and purpose.
Essay Set #8 2
Score 4: The writing is clear and focused. The reader can easily understand the main ideas. Support is
present, although it may be limited or rather general. The writing is characterized by
• an easily identifiable purpose.
• clear main idea(s).
• supporting details that are relevant, but may be overly general or limited in places; when
appropriate, resources are used to provide accurate support.
• a topic that is explored / explained, although developmental details may occasionally be out of
balance with the main idea(s); some connections and insights may be present.
• content and selected details that are relevant, but perhaps not consistently well-chosen for
audience and purpose.
Score 3: The reader can understand the main ideas, although they may be overly broad or simplistic, and
the results may not be effective. Supporting detail is often limited, insubstantial, overly general, or
occasionally slightly off-topic. The writing is characterized by
• an easily identifiable purpose and main idea(s).
• predictable or overly-obvious main ideas; or points that echo observations heard elsewhere; or
a close retelling of another work.
• support that is attempted, but developmental details are often limited, uneven, somewhat off-
topic, predictable, or too general (e.g., a list of underdeveloped points).
• details that may not be well-grounded in credible resources; they may be based on clichés,
stereotypes or questionable sources of information.
• difficulties when moving from general observations to specifics.
Score 2: Main ideas and purpose are somewhat unclear or development is attempted but minimal. The
writing is characterized by
• a purpose and main idea(s) that may require extensive inferences by the reader.
• minimal development; insufficient details.
• irrelevant details that clutter the text.
• extensive repetition of detail.
Score 1: The writing lacks a central idea or purpose. The writing is characterized by
• ideas that are extremely limited or simply unclear.
• attempts at development that are minimal or nonexistent; the paper is too short to
demonstrate the development of an idea.
Organization
Score 6: The organization enhances the central idea(s) and its development. The order and structure are
compelling and move the reader through the text easily. The writing is characterized by
• effective, perhaps creative, sequencing and paragraph breaks; the organizational structure fits
the topic, and the writing is easy to follow.
Essay Set #8 3
• a strong, inviting beginning that draws the reader in and a strong, satisfying sense of resolution
or closure.
• smooth, effective transitions among all elements (sentences, paragraphs, ideas).
• details that fit where placed.
Score 5: The organization enhances the central idea(s) and its development. The order and structure are
strong and move the reader through the text. The writing is characterized by
• effective sequencing and paragraph breaks; the organizational structure fits the topic, and the
writing is easy to follow.
• an inviting beginning that draws the reader in and a satisfying sense of resolution or closure.
• smooth, effective transitions among all elements (sentences, paragraphs, ideas).
• details that fit where placed.
Score 4: Organization is clear and coherent. Order and structure are present, but may seem formulaic.
The writing is characterized by
• clear sequencing and paragraph breaks.
• an organization that may be predictable.
• a recognizable, developed beginning that may not be particularly inviting; a developed
conclusion that may lack subtlety.
• a body that is easy to follow with details that fit where placed.
• transitions that may be stilted or formulaic.
• organization which helps the reader, despite some weaknesses.
Score 3: An attempt has been made to organize the writing; however, the overall structure is
inconsistent or skeletal. The writing is characterized by
• attempts at sequencing and paragraph breaks, but the order or the relationship among ideas
may occasionally be unclear.
• a beginning and an ending which, although present, are either undeveloped or too obvious (e.g.,
“My topic is...”; “These are all the reasons that...”).
• transitions that sometimes work. The same few transitional devices (e.g., coordinating
conjunctions, numbering, etc.) may be overused.
• a structure that is skeletal or too rigid.
• placement of details that may not always be effective.
• organization which lapses in some places, but helps the reader in others.
Score 2: The writing lacks a clear organizational structure. An occasional organizational device is
discernible; however, the writing is either difficult to follow and the reader has to reread substantial
portions, or the piece is simply too short to demonstrate organizational skills. The writing is
characterized by
Essay Set #8 4
• some attempts at sequencing, but the order or the relationship among ideas is frequently
unclear; a lack of paragraph breaks.
• a missing or extremely undeveloped beginning, body, and/or ending.
• a lack of transitions, or when present, ineffective or overused.
• a lack of an effective organizational structure.
• details that seem to be randomly placed, leaving the reader frequently confused.
Score 1: The writing lacks coherence; organization seems haphazard and disjointed. Even after
rereading, the reader remains confused. The writing is characterized by
• a lack of effective sequencing and paragraph breaks.
• a failure to provide an identifiable beginning, body and/or ending.
• a lack of transitions.
• pacing that is consistently awkward; the reader feels either mired down in trivia or rushed along
too rapidly.
• a lack of organization which ultimately obscures or distorts the main point.
Voice
Score 6: The writer has chosen a voice appropriate for the topic, purpose, and audience. The writer
demonstrates deep commitment to the topic, and there is an exceptional sense of “writing to be read.”
The writing is expressive, engaging, or sincere. The writing is characterized by
• an effective level of closeness to or distance from the audience (e.g., a narrative should have a
strong personal voice, while an expository piece may require extensive use of outside resources
and a more academic voice; nevertheless, both should be engaging, lively, or interesting.
Technical writing may require greater distance.).
• an exceptionally strong sense of audience; the writer seems to be aware of the reader and of
how to communicate the message most effectively. The reader may discern the writer behind
the words and feel a sense of interaction.
• a sense that the topic has come to life; when appropriate, the writing may show originality,
liveliness, honesty, conviction, excitement, humor, or suspense.
Score 5: The writer has chosen a voice appropriate for the topic, purpose, and audience. The writer
demonstrates commitment to the topic, and there is a sense of “writing to be read.” The writing is
expressive, engaging, or sincere. The writing is characterized by an appropriate level of closeness to or
distance from the audience (e.g., a narrative should have a strong personal voice, while an expository
piece may require extensive use of outside resources and a more academic voice; nevertheless, both
should be engaging, lively, or interesting. Technical writing may require greater distance.).
• a strong sense of audience; the writer seems to be aware of the reader and of how to
communicate the message most effectively. The reader may discern the writer behind the
words and feel a sense of interaction.
• a sense that the topic has come to life; when appropriate, the writing may show originality,
liveliness, honesty, conviction, excitement, humor, or suspense.
Essay Set #8 5
Score 4: A voice is present. The writer seems committed to the topic, and there may be a sense of
“writing to be read.” In places, the writing is expressive, engaging, or sincere. The writing is
characterized by
• a suitable level of closeness to or distance from the audience.
• a sense of audience; the writer seems to be aware of the reader but has not consistently
employed an appropriate voice. The reader may glimpse the writer behind the words and feel a
sense of interaction in places.
• liveliness, sincerity, or humor when appropriate; however, at times the writing may be either
inappropriately casual or personal, or inappropriately formal and stiff.
Score 3: The writer’s commitment to the topic seems inconsistent. A sense of the writer may emerge at
times; however, the voice is either inappropriately personal or inappropriately impersonal. The writing is
characterized by
• a limited sense of audience; the writer’s awareness of the reader is unclear.
• an occasional sense of the writer behind the words; however, the voice may shift or disappear a
line or two later and the writing become somewhat mechanical.
• a limited ability to shift to a more objective voice when necessary.
• text that is too short to demonstrate a consistent and appropriate voice.
Score 2: The writing provides little sense of involvement or commitment. There is no evidence that the
writer has chosen a suitable voice. The writing is characterized by
• little engagement of the writer; the writing tends to be largely flat, lifeless, stiff, or mechanical.
• a voice that is likely to be overly informal and personal.
• a lack of audience awareness; there is little sense of “writing to be read.”
• little or no hint of the writer behind the words. There is rarely a sense of interaction between
reader and writer.
Score 1: The writing seems to lack a sense of involvement or commitment. The writing is characterized
by
• no engagement of the writer; the writing is flat and lifeless.
• a lack of audience awareness; there is no sense of “writing to be read.”
• no hint of the writer behind the words. There is no sense of interaction between writer and
reader; the writing does not involve or engage the reader.
Word Choice
Score 6: Words convey the intended message in an exceptionally interesting, precise, and natural way
appropriate to audience and purpose. The writer employs a rich, broad range of words which have been
carefully chosen and thoughtfully placed for impact. The writing is characterized by
• accurate, strong, specific words; powerful words energize the writing.
• fresh, original expression; slang, if used, seems purposeful and is effective.
Essay Set #8 6
• vocabulary that is striking and varied, but that is natural and not overdone.
• ordinary words used in an unusual way.
• words that evoke strong images; figurative language may be used.
Score 5: Words convey the intended message in an interesting, precise, and natural way appropriate to
audience and purpose. The writer employs a broad range of words which have been carefully chosen
and thoughtfully placed for impact. The writing is characterized by
• accurate, specific words; word choices energize the writing.
• fresh, vivid expression; slang, if used, seems purposeful and is effective.
• vocabulary that may be striking and varied, but that is natural and not overdone.
• ordinary words used in an unusual way.
• words that evoke clear images; figurative language may be used.
Score 4: Words effectively convey the intended message. The writer employs a variety of words that are
functional and appropriate to audience and purpose. The writing is characterized by
• words that work but do not particularly energize the writing.
• expression that is functional; however, slang, if used, does not seem purposeful and is not
particularly effective.
• attempts at colorful language that may occasionally seem overdone.
• occasional overuse of technical language or jargon.
• rare experiments with language; however, the writing may have some fine moments and
generally avoids clichés.
Score 3: Language lacks precision and variety, or may be inappropriate to audience and purpose in
places. The writer does not employ a variety of words, producing a sort of “generic” paper filled with
familiar words and phrases. The writing is characterized by
• words that work, but that rarely capture the reader’s interest.
• expression that seems mundane and general; slang, if used, does not seem purposeful and is not
effective.
• attempts at colorful language that seem overdone or forced.
• words that are accurate for the most part, although misused words may occasionally appear;
technical language or jargon may be overused or inappropriately used.
• reliance on clichés and overused expressions.
• text that is too short to demonstrate variety.
Score 2: Language is monotonous and/or misused, detracting from the meaning and impact. The writing
is characterized by
• words that are colorless, flat or imprecise.
• monotonous repetition or overwhelming reliance on worn expressions that repeatedly detract
from the message.
Essay Set #8 7
• images that are fuzzy or absent altogether.
Score 1: The writing shows an extremely limited vocabulary or is so filled with misuses of words that the
meaning is obscured. Only the most general kind of message is communicated because of vague or
imprecise language. The writing is characterized by
• general, vague words that fail to communicate.
• an extremely limited range of words.
• words that simply do not fit the text; they seem imprecise, inadequate, or just plain wrong.
Sentence Fluency
Score 6: The writing has an effective flow and rhythm. Sentences show a high degree of craftsmanship,
with consistently strong and varied structure that makes expressive oral reading easy and enjoyable. The
writing is characterized by
• a natural, fluent sound; it glides along with one sentence flowing effortlessly into the next.
• extensive variation in sentence structure, length, and beginnings that add interest to the text.
• sentence structure that enhances meaning by drawing attention to key ideas or reinforcing
relationships among ideas.
• varied sentence patterns that create an effective combination of power and grace.
• strong control over sentence structure; fragments, if used at all, work well.
• stylistic control; dialogue, if used, sounds natural.
Score 5: The writing has an easy flow and rhythm. Sentences are carefully crafted, with strong and
varied structure that makes expressive oral reading easy and enjoyable. The writing is characterized by
• a natural, fluent sound; it glides along with one sentence flowing into the next.
• variation in sentence structure, length, and beginnings that add interest to the text.
• sentence structure that enhances meaning.
• control over sentence structure; fragments, if used at all, work well.
• stylistic control; dialogue, if used, sounds natural.
Score 4: The writing flows; however, connections between phrases or sentences may be less than fluid.
Sentence patterns are somewhat varied, contributing to ease in oral reading. The writing is
characterized by
• a natural sound; the reader can move easily through the piece, although it may lack a certain
rhythm and grace.
• some repeated patterns of sentence structure, length, and beginnings that may detract
somewhat from overall impact.
• strong control over simple sentence structures, but variable control over more complex
sentences; fragments, if present, are usually effective.
• occasional lapses in stylistic control; dialogue, if used, sounds natural for the most part, but may
at times sound stilted or unnatural.
Essay Set #8 8
Score 3: The writing tends to be mechanical rather than fluid. Occasional awkward constructions may
force the reader to slow down or reread. The writing is characterized by
• some passages that invite fluid oral reading; however, others do not.
• some variety in sentence structure, length, and beginnings, although the writer falls into
repetitive sentence patterns.
• good control over simple sentence structures, but little control over more complex sentences;
fragments, if present, may not be effective.
• sentences which, although functional, lack energy.
• lapses in stylistic control; dialogue, if used, may sound stilted or unnatural.
• text that is too short to demonstrate variety and control.
Score 2: The writing tends to be either choppy or rambling. Awkward constructions often force the
reader to slow down or reread. The writing is characterized by
• significant portions of the text that are difficult to follow or read aloud.
• sentence patterns that are monotonous (e.g., subject-verb or subject-verb-object).
• a significant number of awkward, choppy, or rambling constructions.
Score 1: The writing is difficult to follow or to read aloud. Sentences tend to be incomplete, rambling, or
very awkward. The writing is characterized by
• text that does not invite—and may not even permit—smooth oral reading.
• confusing word order that is often jarring and irregular.
• sentence structure that frequently obscures meaning.
• sentences that are disjointed, confusing, or rambling.
Conventions
Score 6: The writing demonstrates exceptionally strong control of standard writing conventions (e.g.,
punctuation, spelling, capitalization, grammar and usage) and uses them effectively to enhance
communication. Errors are so few and so minor that the reader can easily skim right over them unless
specifically searching for them. The writing is characterized by
• strong control of conventions; manipulation of conventions may occur for stylistic effect.
• strong, effective use of punctuation that guides the reader through the text.
• correct spelling, even of more difficult words.
• correct grammar and usage that contribute to clarity and style.
• skill in using a wide range of conventions in a sufficiently long and complex piece.
• little or no need for editing.
Score 5: The writing demonstrates strong control of standard writing conventions (e.g., punctuation,
spelling, capitalization, grammar and usage) and uses them effectively to enhance communication.
Errors are few and minor. Conventions support readability. The writing is characterized by
• strong control of conventions.
Essay Set #8 9
• effective use of punctuation that guides the reader through the text.
• correct spelling, even of more difficult words.
• correct capitalization; errors, if any, are minor.
• correct grammar and usage that contribute to clarity and style.
• skill in using a wide range of conventions in a sufficiently long and complex piece.
• little need for editing.
Score 4: The writing demonstrates control of standard writing conventions (e.g., punctuation, spelling,
capitalization, grammar and usage). Significant errors do not occur frequently. Minor errors, while
perhaps noticeable, do not impede readability. The writing is characterized by
• control over conventions used, although a wide range is not demonstrated.
• correct end-of-sentence punctuation; internal punctuation may sometimes be incorrect.
• spelling that is usually correct, especially on common words.
• correct capitalization; errors, if any, are minor.
• occasional lapses in correct grammar and usage; problems are not severe enough to distort
meaning or confuse the reader.
• moderate need for editing.
Score 3: The writing demonstrates limited control of standard writing conventions (e.g., punctuation,
spelling, capitalization, grammar and usage). Errors begin to impede readability. The writing is
characterized by
• some control over basic conventions; the text may be too simple or too short to reveal mastery.
• end-of-sentence punctuation that is usually correct; however, internal punctuation contains
frequent errors.
• spelling errors that distract the reader; misspelling of common words occurs.
• capitalization errors.
• errors in grammar and usage that do not block meaning but do distract the reader.
• significant need for editing.
Score 2: The writing demonstrates little control of standard writing conventions. Frequent, significant
errors impede readability. The writing is characterized by
• little control over basic conventions.
• many end-of-sentence punctuation errors; internal punctuation contains frequent errors.
• spelling errors that frequently distract the reader; misspelling of common words often occurs.
• capitalization that is inconsistent or often incorrect.
• errors in grammar and usage that interfere with readability and meaning.
• substantial need for editing.
Score 1: Numerous errors in usage, spelling, capitalization, and punctuation repeatedly distract the
reader and make the text difficult to read. In fact, the severity and frequency of errors are so
Essay Set #8 10
overwhelming that the reader finds it difficult to focus on the message and must reread for meaning.
The writing is characterized by
• very limited skill in using conventions.
• basic punctuation (including end-of-sentence punctuation) that tends to be omitted, haphazard,
or incorrect.
• frequent spelling errors that significantly impair readability.
• capitalization that appears to be random.
• a need for extensive editing.
Adjudication Rules
Each student essay is rated for six Writing traits (I, O, V, W, S, C), by two independent raters: Rater 1 and
Rater 2. Rater 3 provides a third (resolution) rating for each trait, triggered by the following rules:
Standard Rule: Non-adjacency between the 1st and 2nd scorer on any of the 6 traits generates a
resolution read.
Cusp Rule: If first or second score has all 4s on:
o Ideas and Content
o Organization
o Sentence Fluency
o Conventions,
and the other (1st or 2nd score) has one 3 and three 4s in these categories, require a resolution.
Voice and Word Choice are excluded – it does not matter what scores occur for Voice or Word
choice (though non-adjacent Voice and Word Choice scores will still cause failure on (1)).
Total Composite Score:
For most essays:
= (I_R1+I_R2) + (O_R1+O_R2) + (S_R1+S_R2) + 2 (C_R1+C_R2)
When there is Rater 3 set of scores for the essay then the Total Composite Score formula
changes to:
= 2 (I_R3) + 2 (O_R3) + 2 (S_R3) + 4 (C_R3) or equivalently 2 (I+O+S+C) + 2 (C)
Note the use of only four of the six traits.