confidential and proprietary. copyright © 2010 educational testing service. all rights reserved....
TRANSCRIPT
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Catherine TrapaniEducational Testing Service
ECOLT: October 2010
e-rater® for TOEFL® Independent and Integrated
Tasks
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Overview
• Background and Context• About e-rater®• Research protocol for operational use • Initial Results for TOEFL Independent item• Recommendations and Actions• Questions and Discussion
2
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
BackgroundQuality of human scoring• No pretesting because of security concerns• Human rater agreement is variable:
– 5 point rubric, across 38 prompts sampled in 2008• Exact agreement varied from 57% - 62%• Exact plus adjacent agreement, 97.5% - 99%
Quantity of human raters• Frequently administered assessment• Fluctuating demand, peak volumes
Demand for quicker score turnaround• Human scoring still desired• Market wants quicker score reporting
3
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
e-rater ® — ETS’ Automated Scoring of
Essays
4
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
What is e-rater?
Automatically evaluates essay quality– Provides holistic scores
• Predictions of scores from trained human raters– Emphasizes writing quality over content
– Evaluation of features– Provides feedback on essays (e.g., “diagnostic”
feedback)– Advisories filter out responses not consistent with
good faith submissions
5
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Examples of E-rater (micro)Features
6
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.7
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.8
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Where do these features come from?
• A parser assigns a part of speech to every word• e-rater examines adjacent or nearly adjacent
pairs of words with expected relationships• Rare or nonexistent word combinations are
identified as a possible error and appropriate feedback issued
… At a basic level, the features are what the NLP scientists have successfully created …
9
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
New feature? • Correlation of feature with human scores?
• Correlation of feature with other, already existing features?
• Measurement scientists conduct evaluation
10
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
How Does e-rater Predict the Human Score?
• Organization and development– 2 features
• Error rates– Grammar, Usage, Mechanics, Style– 4 features
• Lexical complexity– 2 features
• Prompt-specific vocabulary usage– 2 features
• Proposed new feature?– Detection of indications of
• “good writing”• A “positive feature”• Use of Collocations• Use of prepositions
Predicted essay score
11
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
E-rater engine upgrade - an introduction
• Annual process for introducing enhancements• Data sets representing all clients• Known baseline performance• Add the proposed NLP features into a
development engine in IT• Reproduce model performance results with
proposed feature • Take difference between known and proposed
performance of existing models12
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
E-rater engine upgrade - an introduction (2)
• Results must represent improvement in performance OR
• Increase in English Language construct coverage, with no degradation from current performance
13
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Results
• In July 2010, e-rater version 10.1 was released with a new feature that detects good use of collocations and prepositions
14
15
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Impact of this new feature -TOEFL Independent Task
Descriptive Feature Name Relative Weight
Organization 29Development 27Mechanics 10Usage 8Grammar 6Lexical Complexity, Average word length 6Positive Writing Indicators – collocations & prepositions 4
Style 4Lexical Complexity, Sophistication 4
16
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
e-rater Model Types: Prompt-Specific
Prompt-specific models
Prompt A
3 5 4
Prompt B
4 4 1
Prompt C
4 5 3
Prompt D
6 3 2
1 5 6 4 4 5 2 3 5 3 5 3
Training data
Evaluation data
E-rater models Model A Model B Model DModel C
• Each model is trained on responses to a particular prompt• Advantages:
– Tailored to particular prompt characteristics– High agreement with human raters– Incorporates content features
• Disadvantages:– Higher demand for training data– Requires pre-testing of prompts
17
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
e-rater Model Types: Generic
• A single model is trained on responses to a variety of prompts• Advantages:
– Smaller data set required for training– Scoring standards the same across prompts– Applicable to prompts that are not pre-tested
• Disadvantages:– No content features– Differences between particular prompts are not accounted for– Agreement with human raters is lower
Generic models
Prompt A
3 5 4
Prompt B
4 4 1
Prompt C
4 5 3
Prompt D
6 3 2
1 5 6 4 4 5 2 3 5 3 5 3
Training data
Evaluation data
E-rater model Generic Model
18
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Implementation Research for Proposed Use
• Purpose is to evaluate expected – Quality of ratings and reported scores– Effectiveness of using e-rater operationally
• Research questions– Is e-rater performance
comparable to human scores?– Is there any differential performance for
subgroups of concern?– Is there any significant impact when e-rater
scores are used in reported scores?19
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Implementation Research (2)
• Construct relevance?– Consistency of e-rater features with TOEFL scoring
rubrics?
• Relationship between e-rater and human ratings?– Overall agreement rates?– Degradation from human-human agreement rates to
human-automated agreement rates?– Standardized difference in scores between humans and
e-rater?– Subgroup differences (fairness concerns)?
20
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Implementation Research (3)
• Impact on reported Writing scores– Change in reported score under multiple
implementation possibilities? • More or less conservative approaches
– Contributory score– Confirmatory score
– Differential impact on subgroups? • Gender• Native Language• Native Country
– Association of writing scores with external variables?21
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Human Scoring Performance
• Human scoring is the baseline performance against which e-rater is evaluated.
human1 by human2
human1 human2 stats
prompt N mean sd mean sd
std wtd % % adj
corrdiff kappa exact
plus adj
All 132,347 3.35 0.85 3.35 0.85 0.01 0.69 60 98 0.69
22
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
E-rater Performance
• Human scoring is the baseline against which e-rater is evaluated.• Absolute statistical guidelines also exist.• Human-human agreement: 60% exact, 98% exact + adjacent• Human-erater agreement: 59% exact, 99% exact + adjacent• No degradation in performance from human-human performance
std wtd % % adj
diff kappa
agree agree
3.35 0.85 3.36 0.87 0.02 0.70 59 99 0.74
corr
statshuman1 by erater agreement
human1 erater
mean sd mean sd
23
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Impact on Writing Scale Scores (0-30 points)
• There is no significant impact to the candidate from the implementation of e-rater for the TOEFL Independent task.
0 601 232 163 1
Change in scale score from all-human scoring
to simulated scores w ith e-rater
Percent
24
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Correlations with Scaled Scores
Variable Human rating E-rater (integer)
Reading Scale Score 0.54 0.56
Listening Scale Score 0.55 0.53
Speaking Scale Score 0.59 0.57
(Read+Listen+Speak) Scale Score 0.63 0.63
Correlations of e-rater scores with TOEFL construct scale scores is on par with the human rating correlation to those same scores.
25
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Conclusions and Future Work
• The research team recommended to the TOEFL Committee of Examiners the operational use of a generic e-rater model for the Independent task.• Operational use began in July 2009
• Subsequently, the research team recommended operational use of a generic model for the Integrated task.
• Operational use will begin shortly
26
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
QUESTIONS?
DISCUSSION?
Thank you!
Contact Information:Cathy [email protected](609)734-5640
27