confidential and proprietary. copyright © 2010 educational testing service. all rights reserved....

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

Catherine TrapaniEducational Testing Service

ECOLT: October 2010

e-rater® for TOEFL® Independent and Integrated

Tasks


Overview

• Background and Context• About e-rater®• Research protocol for operational use • Initial Results for TOEFL Independent item• Recommendations and Actions• Questions and Discussion

2


BackgroundQuality of human scoring• No pretesting because of security concerns• Human rater agreement is variable:

– 5 point rubric, across 38 prompts sampled in 2008• Exact agreement varied from 57% - 62%• Exact plus adjacent agreement, 97.5% - 99%

Quantity of human raters• Frequently administered assessment• Fluctuating demand, peak volumes

Demand for quicker score turnaround• Human scoring still desired• Market wants quicker score reporting

3


e-rater ® — ETS’ Automated Scoring of

Essays

4


What is e-rater?

Automatically evaluates essay quality– Provides holistic scores

• Predictions of scores from trained human raters– Emphasizes writing quality over content

– Evaluation of features– Provides feedback on essays (e.g., “diagnostic”

feedback)– Advisories filter out responses not consistent with

good faith submissions

5


Examples of E-rater (micro)Features

6


Where do these features come from?

• A parser assigns a part of speech to every word• e-rater examines adjacent or nearly adjacent

pairs of words with expected relationships• Rare or nonexistent word combinations are

identified as a possible error and appropriate feedback issued

… At a basic level, the features are what the NLP scientists have successfully created …

9


New feature? • Correlation of feature with human scores?

• Correlation of feature with other, already existing features?

• Measurement scientists conduct evaluation

10


How Does e-rater Predict the Human Score?

• Organization and development– 2 features

• Error rates– Grammar, Usage, Mechanics, Style– 4 features

• Lexical complexity– 2 features

• Prompt-specific vocabulary usage– 2 features

• Proposed new feature?– Detection of indications of

• “good writing”• A “positive feature”• Use of Collocations• Use of prepositions

Predicted essay score

11


E-rater engine upgrade - an introduction

• Annual process for introducing enhancements• Data sets representing all clients• Known baseline performance• Add the proposed NLP features into a

development engine in IT• Reproduce model performance results with

proposed feature • Take difference between known and proposed

performance of existing models12


E-rater engine upgrade - an introduction (2)

• Results must represent improvement in performance OR

• Increase in English Language construct coverage, with no degradation from current performance

13


Results

• In July 2010, e-rater version 10.1 was released with a new feature that detects good use of collocations and prepositions

14


Impact of this new feature -TOEFL Independent Task

Descriptive Feature Name Relative Weight

Organization 29Development 27Mechanics 10Usage 8Grammar 6Lexical Complexity, Average word length 6Positive Writing Indicators – collocations & prepositions 4

Style 4Lexical Complexity, Sophistication 4

16


e-rater Model Types: Prompt-Specific

Prompt-specific models

Prompt A

3 5 4

Prompt B

4 4 1

Prompt C

4 5 3

Prompt D

6 3 2

1 5 6 4 4 5 2 3 5 3 5 3

Training data

Evaluation data

E-rater models Model A Model B Model DModel C

• Each model is trained on responses to a particular prompt• Advantages:

– Tailored to particular prompt characteristics– High agreement with human raters– Incorporates content features

• Disadvantages:– Higher demand for training data– Requires pre-testing of prompts

17


e-rater Model Types: Generic

• A single model is trained on responses to a variety of prompts• Advantages:

– Smaller data set required for training– Scoring standards the same across prompts– Applicable to prompts that are not pre-tested

• Disadvantages:– No content features– Differences between particular prompts are not accounted for– Agreement with human raters is lower

Generic models

Prompt A

3 5 4

Prompt B

4 4 1

Prompt C

4 5 3

Prompt D

6 3 2

1 5 6 4 4 5 2 3 5 3 5 3

Training data

Evaluation data

E-rater model Generic Model

18


Implementation Research for Proposed Use

• Purpose is to evaluate expected – Quality of ratings and reported scores– Effectiveness of using e-rater operationally

• Research questions– Is e-rater performance

comparable to human scores?– Is there any differential performance for

subgroups of concern?– Is there any significant impact when e-rater

scores are used in reported scores?19


Implementation Research (2)

• Construct relevance?– Consistency of e-rater features with TOEFL scoring

rubrics?

• Relationship between e-rater and human ratings?– Overall agreement rates?– Degradation from human-human agreement rates to

human-automated agreement rates?– Standardized difference in scores between humans and

e-rater?– Subgroup differences (fairness concerns)?

20


Implementation Research (3)

• Impact on reported Writing scores– Change in reported score under multiple

implementation possibilities? • More or less conservative approaches

– Contributory score– Confirmatory score

– Differential impact on subgroups? • Gender• Native Language• Native Country

– Association of writing scores with external variables?21


Human Scoring Performance

• Human scoring is the baseline performance against which e-rater is evaluated.

human1 by human2

human1 human2 stats

prompt N mean sd mean sd

std wtd % % adj

corrdiff kappa exact

plus adj

All 132,347 3.35 0.85 3.35 0.85 0.01 0.69 60 98 0.69

22


E-rater Performance

• Human scoring is the baseline against which e-rater is evaluated.• Absolute statistical guidelines also exist.• Human-human agreement: 60% exact, 98% exact + adjacent• Human-erater agreement: 59% exact, 99% exact + adjacent• No degradation in performance from human-human performance

std wtd % % adj

diff kappa

agree agree

3.35 0.85 3.36 0.87 0.02 0.70 59 99 0.74

corr

statshuman1 by erater agreement

human1 erater

mean sd mean sd

23


Impact on Writing Scale Scores (0-30 points)

• There is no significant impact to the candidate from the implementation of e-rater for the TOEFL Independent task.

0 601 232 163 1

Change in scale score from all-human scoring

to simulated scores w ith e-rater

Percent

24


Correlations with Scaled Scores

Variable Human rating E-rater (integer)

Reading Scale Score 0.54 0.56

Listening Scale Score 0.55 0.53

Speaking Scale Score 0.59 0.57

(Read+Listen+Speak) Scale Score 0.63 0.63

Correlations of e-rater scores with TOEFL construct scale scores is on par with the human rating correlation to those same scores.

25


Conclusions and Future Work

• The research team recommended to the TOEFL Committee of Examiners the operational use of a generic e-rater model for the Independent task.• Operational use began in July 2009

• Subsequently, the research team recommended operational use of a generic model for the Integrated task.

• Operational use will begin shortly

26


QUESTIONS?

DISCUSSION?

Thank you!

Contact Information:Cathy [email protected](609)734-5640

27

confidential and proprietary. copyright © 2010 educational testing service. all rights reserved....

Documents