artificial unintelligence:why and how automated essay scoring doesn’t work (most of the time)...
TRANSCRIPT
![Page 1: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/1.jpg)
Artificial Unintelligence:
Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of
Automated Essay Evaluation
Les PerelmanComparative Media Studies / Writing
MIT
![Page 2: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/2.jpg)
Definition of TermsAutomated Essay Scoring
(AES)
• Computer produces summative assessment for evaluation
Automated Essay Evaluation (AEE)
• Computer produces formative assessment and responses for learning
![Page 3: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/3.jpg)
Overview
1. Brief recounting of mass-market writing assessment in the United States
2. AES: how it works and its major flaws3. The Turing Test: evaluate AES4. AEE: a brief overview5. AEE: evaluating current implementations 6. AEE: what we can reasonably hope to achieve
– Writelab
7. Demonstration: Playing with the BABEL Generator
![Page 4: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/4.jpg)
The First College Board Entrance Examination in English − June 1901
• The two sides of the character of Achilles as shown in The Iliad. Illustrate each and tell whether we find anything like this contrast in the character of Hector.
• At least two pages
• Four hours to write -- two in the morning & two in the afternoon
![Page 5: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/5.jpg)
SAT Essay June 2005
• Think carefully about the issue presented in the following excerpt and the assignment below. – Most of our schools are not facing up to their
responsibilities. We must begin to ask ourselves whether educators should help students address the critical moral choices and social issues of our time. Schools have responsibilities beyond training people for jobs and getting students into college. • Adapted from Svi Shapiro
• Assignment:– Should schools help students understand moral choices
and social issues?
• 25 minutes
![Page 6: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/6.jpg)
The timed impromptu is an unnatural act
• The timed
impromptu does
not occur in the
real world
• No one writes on
demand without
reflecting about
a topic they may
never have
thought about
![Page 7: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/7.jpg)
Why the change?
• Reliability– Godshalk, F. et al.
The Measurement of Writing Ability ETS 1966
– A. Myers et al. (1966) Simplex structure in the grading of essay tests. Educational and Psychological Measurement, Vol26(1), 1966, 41-54.
![Page 8: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/8.jpg)
Where the reliability comes from:Correlation between length and score is a negative function of time allotted
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
Share
d V
ari
ance b
etw
een #
Word
s &
Score
25 min
1 hr.
72 hrs.
N=247
N =6498 N=2820
N=115
N= 106N=106
N=660 N=1458
N=798
The greater the time; the smaller the correlation
![Page 9: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/9.jpg)
College Board’s ScoreWrite
![Page 10: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/10.jpg)
Colbert Report on SAT Word Length
![Page 11: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/11.jpg)
Graders were trained to read for length and pretentious diction
![Page 12: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/12.jpg)
Ellis B. Page – Project Essay Grade
• Trin -- Intrinsic variable of interest (e.g. word choice, diction; sentence complexity)
• Prox – “some variable which it is hoped will approximate the variable of true interest”
![Page 13: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/13.jpg)
e-Rater construct
Quinlan, T., Higgins, D., & Wolff, S. (2009) Evaluating the Construct-Coverage of the e-rater® Scoring Engine. ETS Research Report 09-01. p. 15
![Page 14: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/14.jpg)
E-rater 2.0 Proxies
• Organization = # of Discourse Elements (i.e. paragraphs)
• Development = Length of Discourse Elements (i.e. # of sentences & # of words in paragraphs)
• Lexical Complexity = average word length + frequency of infrequently used words + absence of frequently repeated words
![Page 15: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/15.jpg)
Machines Consistently Overvalue Essay Length
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
1 Argument 2A Argument
Holistic 2B Argument
Grammar 3 Literary
Analysis 4 Literary
Analysis 5 ReadingSummary
6 ReadingSummary
7 NarrativeComposite
8 NarrativeCompositive
Ave
rage
Sh
are
d V
aria
nce
Essay Sets
Hewlett ASAP Study (2012)
Average Shared Variance (r2) between # of Words and Score for AES Machines & Human Readers
(7 of 9 vendors -- 2 vendors would not allow data to be released)
Average Vendors Average Human Readers
![Page 16: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/16.jpg)
AES Machines Maintain Artificial Correlations Through the Steroids of
Word Counting
![Page 17: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/17.jpg)
Percentage for Reader 1
# of words
Other
Percentage for Reader 2
# of words
Percentage AES Machine
Other
# of words
Other
Shared
Variance
# WordsHuman
Machine
![Page 18: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/18.jpg)
But what about voice recognition?
• Relatively very small set + Moore’s Law
![Page 19: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/19.jpg)
What kind of writing AES can’t grade• Long essays
– ETS’s e-rater has a 1,000 word limit
• Broad and open Writing Tasks
– Two AES machines could not approximate scores on the essay portion of the Australian Scholastic Aptitude Test that called for a fixed length (600 word) essay on a fairly open topic McCurry (2010)
![Page 20: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/20.jpg)
The Turing Test: The Imitation Game
![Page 21: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/21.jpg)
CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart)
![Page 22: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/22.jpg)
The Reverse Turing Test
Coherent prose
Gibberish
![Page 23: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/23.jpg)
The Basic Automatic BS Essay Language (BABEL) Generator
http://babel-generator.herokuapp.com/
![Page 24: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/24.jpg)
http://babel-generator.herokuapp.com/
![Page 25: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/25.jpg)
![Page 26: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/26.jpg)
![Page 27: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/27.jpg)
![Page 28: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/28.jpg)
![Page 29: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/29.jpg)
![Page 30: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/30.jpg)
![Page 31: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/31.jpg)
![Page 32: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/32.jpg)
![Page 33: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/33.jpg)
What we can conclude
• The software does not do what it tells students and teachers it is doing
• The metrics (proxies) used are irrelevant, at best, and, probably, are largely antithetical to good writing or communication.
• Students can probably be trained to memorize language and strategies to obtain high scores (construct-irrelevant-strategies)
![Page 34: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/34.jpg)
Some Evidence that Students Are Using BABEL to Game AEE Products
4. What is meant by a “good faith” essay?
It is important to note that although PEG software is extremely reliable in terms of producing scores that are comparable to those awarded by human judges, it can be fooled. Computers, like humans, are not perfect.
PEG presumes “good faith” essays authored by “motivated” writers. A “good faith” essay is one that reflects the writer’s best efforts to respond to the assignment and the prompt without trickery or deceit. A “motivated” writer is one who genuinely wants to do well and for whom the assignment has some consequence (a grade, a factor in admissions or hiring, etc.).
Efforts to “spoof” the system by typing in gibberish, repetitive phrases, or off-topic, illogical prose will produce illogical and essentially meaningless results.
![Page 35: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/35.jpg)
Most of the Studies Are Conducted and / or Controlled by the Vendors
![Page 36: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/36.jpg)
Important Unanswered Questions
1. How easy will it be for students to “game” these machines?
2. When essays are read by a human reader and a machine and there is a discrepancy between scores, after the adjudication procedure, what percentage the machine’s scores are omitted or changed compared to the scores of human reader?
3. When gamed essays are read by a reader and a machine, will the human reader’s score always catch the gamed score?
4. Can human readers also be “gamed”?
![Page 37: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/37.jpg)
Negative Consequences
• What is tested is what is taught
• Emphasis on short writing
• Emphasis on impromptu on-demand writing
![Page 38: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/38.jpg)
When can AES be useful?
• Grading short content-based writing
– Already useful applications
• Use in MOOC’s in conjunction with Peer Review Applications such as Calibrated Peer Review
![Page 39: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/39.jpg)
Why are the testing companies so in love with AES?
• ῥίζα γὰρ πάντων τῶν κακῶν ἐστιν ἡ φιλαργυρία
• Radix omnium malorum est cupiditas
• The love of money is the root of all evil• 1st Timothy 6:10
![Page 40: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/40.jpg)
Chase the Moneychangers out of the Temple
![Page 41: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/41.jpg)
Three Proposals
• First, some sort of professional system of disclosure for large sums of money, let’s say more than $10,000, received from outside professional organizations such as the College Board.
– With textbooks, the disclosure is transparent.
![Page 42: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/42.jpg)
Second, Grass roots development of several different Honors English and
Writing curricula• Teach skills students will need in college
• Developed jointly by high school and college teachers
• Developed through organizations such as NCTE, WPA, and NWP & College Admissions
• Accompanied by some sort of certification procedure
• Pacesetter Program as model
![Page 43: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/43.jpg)
Create Tests of Our Own
• Design Criteria:– Valid
– Fair• Coaching has minimal effect
• Does not discriminate against bilingual or bidialectical students
– Feasible• Not College Board’s bad version of reverse engineering
– “The plane doesn’t fly, but we can make money on it”
– Transparent design, development, and administration
![Page 44: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/44.jpg)
Show warts and all
![Page 45: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/45.jpg)
Decentralized Development and GradingOpportunities for Professional Development
![Page 46: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/46.jpg)
Test to the Teaching
• Different tests and testing communities for different approaches
• Technology enabled
• Diverse and linked group of readers
– Opportunity to address problems of low-performing minorities
– Show students that their essay will be read by a diverse group of readers
![Page 47: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/47.jpg)
So let us
• Urge our schools to stop using the SAT, ACT, & even AP
• Begin a conversation with professional organizations to involve– K-12 teachers
– College admission officers
– College teachers
To envision new and different kinds of writing tests
![Page 48: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/48.jpg)
Act
• Act to divert some of the billions spent on testing to improve teaching
• Act to reclaim testing from business and bring it back to education
• Act to make testing a form of learning not only for students but also for us
• And act by doing sound research on writing and testing; because if we don’t do it we are leaving it to people like the gang at Pearson.
![Page 49: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/49.jpg)
Automated Essay Scoring (AES) becomes Automated Essay Evaluation
(AEE)• Teaching writing in the classroom
![Page 50: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/50.jpg)
First-generation retrofitted trait numeric scores
![Page 51: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/51.jpg)
My Access / IntelliMetric
• Holistic Writing Score: 5.3 / 6.0 88%
![Page 52: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/52.jpg)
Some advice makes writing less effective
![Page 53: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/53.jpg)
Grammar Checkers are UnreliableNick Carbone’s Comparison of Grammarly, MS Word,
and WriteCheck
Grammarly MS Word WriteCheck(e-rater)
# Errors Flagged 52 30 23
# misdiagnosed, false positives, or poorly explained
11 8 14
% misdiagnosed, false positives, or poorly explained
21% 27% 61%
http://nccei12carbone.blogspot.com/2012/10/an-experiment-with-grammar-checkers.html
![Page 54: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/54.jpg)
MS Word Grammar & Style Checker is also not infallible
![Page 55: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/55.jpg)
Dean Mark D. Shermis on Microsoft Word and AEE
The feedback provided by the Web-based software is both quantitative and qualitative. That is, in addition to an overall rating, students may receive scores on individual attributes of writing, and the software may summarize or highlight a variety of errors, ranging from simple grammar to style or content. Some of the software packages also provide a discourse analysis of the work
Shermis, M. (May11, 2012). How automated grading can make good writers. Los Angeles Times
![Page 56: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/56.jpg)
Category: Usage
• Missing or Extra Article
– Rather than rely on commercials or expert opinions about a film, individuals often make their viewing choices based on blogs and the1 collected reviews of peers on various sites on the Internet.
• 1 You may need to remove this article.
![Page 57: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/57.jpg)
Category: Usage
• Type: Confused Words
– Because the consumption of entertainment is so ephemeral, one could posit that advertising might affect1 a consumer's decision more than when by buying a durable good, such as a toaster or a blender.
• 1 You have used affect in this sentence. You may need to use effect instead.
![Page 58: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/58.jpg)
Category: Usage
• Type: Preposition Error
– Rather than rely on commercials or expert opinions about1 a film, individuals often make their viewing choices based on blogs and the collected reviews of peers on various sites on the Internet.
• 1 You may be using the wrong preposition.
![Page 59: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/59.jpg)
Category: Organization & Development
• Type: Thesis Statement– The question of how important advertising is to the sale of any product
is an important one. This question is extremely important in the media industries.1 Because the consumption of entertainment is so ephemeral, one could posit that advertising might affect a consumer's decision more than when by buying a durable good, such as a toaster or a blender.2 The advertising department of the Silver Screen Movie Production Company has recommended spending more on advertising and less on movie production. The advertising director's arguments are not only self-serving, but also logically flawed and, at the least, inconclusive, resting on several very dubious assumptions.• 1 Is this part of the essay your thesis? The purpose of a thesis is to organize,
predict, control, and define your essay. Look in the Writer's Handbook for ways to improve your thesis.
• 2 Is this sentence really a part of your thesis? Remember that a thesis controls the whole content of your essay. You need to strengthen this thesis so that you clearly state the main point you will be making. Look in the Writer's Handbook for tips on doing this.
![Page 60: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/60.jpg)
Category: Organization & Development
• Type: Supporting Ideas– First, the motives behind this particular argument
need to be questioned. In essence, the advertising director is arguing that resources should be taken away from producing films and given to his department.1 Although people often make reasonable requests in their own self-interest, that this policy would greatly enhance the director's fiefdom is a consequence that should elicit some skepticism.1
• 1 Criterion has identified only two sentences to support your topic sentence. Try to include one more sentence in this paragraph.
![Page 61: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/61.jpg)
The Missing Piece in Research on Classroom Use
• Controlled experiments to avoid placebo effect
• Comparison with the default writing tool of the 21st century, MS Word.
![Page 62: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/62.jpg)
Both Pearson and Measurement Inc. Concede that Grammar Checkers are Imperfect
http://doe.sd.gov/oats/documents/WToLrnFAQ.pdf
Q: Why does the grammar check not catch all of a student’s errors?A: The technology that supports grammar check features in programs such as Microsoft Word often return false positives. Since WriteToLearn is a educational product, the creators of this program have decided, in an attempt to not provide students with false positives, to err on the side of caution. Consequently, there are times when the grammar check will not catch all of a student’s errors.
Teachers can address these missed grammar errors by using the post‐it note feature within the program to flag additional errors students might have missed.
![Page 63: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/63.jpg)
PEG Writer8. Why does PEG seem to ignore some grammar “trouble spots” identified by Microsoft Word (or other programs)?
PEG’s grammar checker can detect and provide feedback for a wide variety of syntactic, semantic and punctuation errors. These errors include, but are not limited to, run-on sentences, sentence fragments and comma splices; homophone errors and other errors of word choice; and missing or misused commas, apostrophes, quotation marks and end punctuation. In addition, the grammar checker can locate and offer feedback on style choices inappropriate for formal writing.
Unlike commercial grammar checkers, however, PEG only reports those errors for which there is a high degree of confidence that the “error” is indeed an error. Commercial grammar checkers generally implement a lower threshold and as a result, may report more errors. The downside is they also report higher number of “false positives” (errors that aren’t errors). Because PEG factors these error conditions into scoring decisions, we are careful not to let “false positives” prejudice an otherwise well constructed essay.
![Page 64: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/64.jpg)
What kind of AEE can be useful & effective?
![Page 65: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/65.jpg)
Basic design principle
Primum non nocere!
![Page 66: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/66.jpg)
Focus on style
• MS Word is flawed but it may be hard to build something better that won’t confuse students
• What can be emphasized is style:– Clarity
– Cohesion
– Emphasis
– Concision
– Elegance• e. g. Parallel structures
![Page 67: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/67.jpg)
What we need to do to build effective AEE tools
• Start with General Principles
– Then use statistical modeling
– Follow the model of the development of voice recognition apps
• Transparency
• Independent Research
![Page 68: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/68.jpg)
The right way: by asking questions not giving answers
Letting the student own the process
![Page 69: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/69.jpg)
Products should be transparent in displaying their limitations – again,
showing warts and all
![Page 70: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/70.jpg)
The Real Danger of AES and bad AEE: Widening the Educational Divide
• Private well-endowed institutions do not use AES
• Flawed AEE will be used in large classes to “give students more opportunities to write” poorly
• But what flawed AEE teaches not only dumbs down the ability to communicate – it has the potential to almost totally eliminate it
![Page 71: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/71.jpg)
But AEE also provides a real opportunity to provide a cheap accessible tool to teach and improve writing in multiple contexts
• In classrooms
• Writing at home
• In the workplace
![Page 72: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/72.jpg)
Demo of BABEL Generator
• http://babel-generator.herokuapp.com/
• https://www.dxrgroup.com/cgi-bin/scoreitnow/password.pl
![Page 73: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/73.jpg)
Break Up into Six Groups for MyAccess Experiment
• Hypothesis: MyAccess will give high scores to computer-generated gibberish
• Login to MyAccess– http://www.vantagelearning.com/login/myaccess-home-edition/
Username Password
studentone one
studentwo two
studentthree three
studentfour four
studentfive five
studentsix six
• Open new window and open BABEL Generatorhttp://babel-generator.herokuapp.com/
![Page 74: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/74.jpg)
Instructions1. Click ASSIGNMENTS2. Select Ages 15-183. Select one of the following topics (suggested keywords for
BABEL generator are in parentheses) & click START ESSAY or START REVISION
a. Nature v. nurture (nature, nurture)b. A sense of wonder (wonder)c. Invasion of privacy (privacy)d. Rating movies, music, & video games (violence, obscenity,
teenagers)e. What makes a good coach (coach, encouragement)
4. Open another window, open BABEL Generator, generate essay, & then copy and paste it into the MY ACCESS window
5. Click SUBMIT ESSAY
![Page 75: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/75.jpg)
References• Attali, Yigal & Burstein, J. (2006). Automated essay scoring with e-rater V.2. Journal of Technology,
Learning, and Assessment. 4:3. pp. 1-29.
• Bejar, I. I., Flor, M., Futagi, Y., Ramineni, C. (2014). On the vulnerability of automated scoring to construct-irrelevant response strategies (CIRS): An illustration. Assessing Writing 22 pp. 43-59.
• Condon, W. (2013). Large-scale assessment, locally-developed measures, and automated scoring of essays: Fishing for red herrings? Assessing Writing 18, pp. 100-108.
• Dikli, S. & Bleyle, S. (October 2014). Automated Essay Scoring feedback for second language writers: How does it compare to instructor feedback?, Assessing Writing, 22, pp. 1-17. http://www.sciencedirect.com/science/article/pii/S1075293514000221.
• Elliot, N. & Klobucar, A. “Automated Essay Evaluation and the teaching of writing.” Handbook of Automated Essay Evaluation: Current Applications and New Directions. Ed. Mark D. Shermis, Jill Burstein, and Sharon Apel. London: Routledge. June 2013.
• Freitag Ericsson, P. & Haswell, R. Ed. (2006) Machine Scoring of Student Essays: Truth or Consequences. Logan, UT: Utah State UP
• Godshalk, F. I, Swineford, F., Coffman, W. E. (1966). The Measurement of Writing Ability. New York, NY: College Entrance Examination Board.
• Haudek, K. C. et al. (2012). What are they thinking? Automated analysis of student writing about acid-based chemistry in introductory biology. CBE—Life Sciences Education 11, pp. 149-155.
• Haudek, K. C. et al. (2011). Harnessing technology to improve formative assessment of student conceptions in STEM: Forging a national network. CBE—Life Sciences Education 10, pp. 283-293.
• Herrington, A. & Moran, C. (2012). When writing to a machine in not writing at all. Writing Assessment in the 21st Century: Essays in Honor of Edward M. White. Ed. Norbert Elliot and Les Perelman. New York, NY: Hampton Press, pp.219-232
![Page 76: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/76.jpg)
c• Higgins, D. & Heilman, M. Managing what we can measure: Quantifying the susceptibility of
Automated Scoring Systems to gaming behavior. Educational Measurement 33:3. pp. 36-46.• Klobucar, Andrew, Paul Deane, Norbert Elliot, Chaitanya Ramineni, Perry Deess, & Alex
Rudniy. (2012).“Automated Essay Scoring and the Search for Valid Writing Assessment.” International Advances in Writing Research: Cultures, Places, Measures. Ed. Charles. Bazerman, Chris Dean, Jessica Early, Karen Lunsford, Suzie Null, Paul Rogers, and Amanda Stansell. Fort Collins, Colorado: WAC . pp. 103-119 http://wac.colostate.edu/books/wrab2011/chapter6.pdf
• McCurry, D. (2010). Can machine scoring deal with broad and open writing tests as well as human readers? Assessing Writing. Vol. 15 pp. 118–129
• Morgan, J., Shermis, M. D., Van Deventer, L., & Vander Ark, T. (2013). Automated Student Assessment Prize: Phase 1 & Phase 2: A case study to promote focused innovation in student writing assessment. Seattle, WA: Getting Smarthttp://gettingsmart.com/cms/wp-content/uploads/2013/02/ASAP-Case-Study-FINAL.pdf
• Nehm, R. H., Ha, M., Mayfield, E. (2011). Transforming biology with machine learning: automated scoring of written evolutionary explanations. Journal of Science Education and Technology. 21:1, pp. 183-196
• Page, E. B.(1966). The imminence of grading essays by computer. Phi Delta Kappan 76:7 pp. 561-565
• Perelman, L. (July, 2014). When ‘the state of the art’ is counting words. Assessing Writing Vol. 21 pp. 104–111 http://dx.doi.org/10.1016/j.asw.2014.05.001
![Page 77: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/77.jpg)
• Perelman, L. (August 2013). Critique of Mark D. Shermis & Ben Hammer, “Contrasting State-of-the-Art Automated Scoring of Essays: Analysis.” Journal of Writing Assessment 6:1 http://journalofwritingassessment.org/article.php?article=69
• Perelman, L. (2012). “Mass-Market Writing Assessments as Bullshit.” Writing Assessment in the 21st Century: Essays in Honor of Edward M. White. Ed. Norbert Elliot and Les Perelman. New York, NY: Hampton Press, 2012, pp. 425-438.
• Perelman, L. (2012). Length, Score, Time, & Construct Validity in Holistically Graded Writing Assessments: The Case against Automated Essay Scoring (AES). In New Directions in International Writing Research, Ed. C. Bazerman, C. Dean, K. Lunsford, S. Null, P. Rogers, A. Stansell, and T. Zawacki. Anderson, SC: Parlor Press, pp. 121-132.http://wac.colostate.edu/books/wrab2011/chapter7.pdf
• Powers, D. et al. (2002). Stumping e-rater: Challenging the validity of automated essay scoring. Computers in Human Behavior. 18:2 pp. 103-134.
• Quinlan, T., Higgins, D., & Wolff, S. (2009) Evaluating the Construct-Coverage of the e-rater® Scoring Engine. ETS Research Report 09-01.
• Sandene, B. et al. (2005) Online Assessment in Mathematics and Writing: Reports From the NAEP Technology-Based Assessment Project, Research and Development Series. National Center for Education Statistics Report 2005-457. http://files.eric.ed.gov/fulltext/ED485780.pdf
![Page 78: Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation](https://reader031.vdocument.in/reader031/viewer/2022030401/58ad63291a28ab9e428b506b/html5/thumbnails/78.jpg)
• Shermis, M. D. (In press). The challenges of emulating human behavior in writing assessment. http://www.sciencedirect.com/science/article/pii/S1075293514000373#
• Shermis, M. D. (2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20, 53-76
• Shermis, M. D., & Hamner, B. (2013). Contrasting state-of-the-art automated scoring of essays. In M. D. Shermis & J. Burstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions . New York, NY: Routledge. pp. 313-346
• Shermis, M. (May11, 2012). How automated grading can make good writers. Los Angeles Times http://articles.latimes.com/2012/may/11/news/la-ol-automated-scoring-blowback-20120510
• Stevenson, M. Phakiti, A. (2014). The effects of computer-generated feedback on the quality of writing. Assessing Writing. 19, pp.51-65.