confidential and proprietary. copyright © 2010 educational testing service. all rights reserved....

27
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October 2010 e-rater ® for TOEFL® Independent and Integrated Tasks

Upload: elijah-riley

Post on 18-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

Catherine TrapaniEducational Testing Service

ECOLT: October 2010

e-rater® for TOEFL® Independent and Integrated

Tasks

Page 2: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

Overview

• Background and Context• About e-rater®• Research protocol for operational use • Initial Results for TOEFL Independent item• Recommendations and Actions• Questions and Discussion

2

Page 3: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

BackgroundQuality of human scoring• No pretesting because of security concerns• Human rater agreement is variable:

– 5 point rubric, across 38 prompts sampled in 2008• Exact agreement varied from 57% - 62%• Exact plus adjacent agreement, 97.5% - 99%

Quantity of human raters• Frequently administered assessment• Fluctuating demand, peak volumes

Demand for quicker score turnaround• Human scoring still desired• Market wants quicker score reporting

3

Page 4: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

e-rater ® — ETS’ Automated Scoring of

Essays

4

Page 5: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

What is e-rater?

Automatically evaluates essay quality– Provides holistic scores

• Predictions of scores from trained human raters– Emphasizes writing quality over content

– Evaluation of features– Provides feedback on essays (e.g., “diagnostic”

feedback)– Advisories filter out responses not consistent with

good faith submissions

5

Page 6: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

Examples of E-rater (micro)Features

6

Page 7: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.7

Page 8: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.8

Page 9: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

Where do these features come from?

• A parser assigns a part of speech to every word• e-rater examines adjacent or nearly adjacent

pairs of words with expected relationships• Rare or nonexistent word combinations are

identified as a possible error and appropriate feedback issued

… At a basic level, the features are what the NLP scientists have successfully created …

9

Page 10: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

New feature? • Correlation of feature with human scores?

• Correlation of feature with other, already existing features?

• Measurement scientists conduct evaluation

10

Page 11: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

How Does e-rater Predict the Human Score?

• Organization and development– 2 features

• Error rates– Grammar, Usage, Mechanics, Style– 4 features

• Lexical complexity– 2 features

• Prompt-specific vocabulary usage– 2 features

• Proposed new feature?– Detection of indications of

• “good writing”• A “positive feature”• Use of Collocations• Use of prepositions

Predicted essay score

11

Page 12: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

E-rater engine upgrade - an introduction

• Annual process for introducing enhancements• Data sets representing all clients• Known baseline performance• Add the proposed NLP features into a

development engine in IT• Reproduce model performance results with

proposed feature • Take difference between known and proposed

performance of existing models12

Page 13: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

E-rater engine upgrade - an introduction (2)

• Results must represent improvement in performance OR

• Increase in English Language construct coverage, with no degradation from current performance

13

Page 14: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

Results

• In July 2010, e-rater version 10.1 was released with a new feature that detects good use of collocations and prepositions

14

Page 15: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

15

Page 16: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

Impact of this new feature -TOEFL Independent Task

Descriptive Feature Name Relative Weight

Organization 29Development 27Mechanics 10Usage 8Grammar 6Lexical Complexity, Average word length 6Positive Writing Indicators – collocations & prepositions 4

Style 4Lexical Complexity, Sophistication 4

16

Page 17: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

e-rater Model Types: Prompt-Specific

Prompt-specific models

Prompt A

3 5 4

Prompt B

4 4 1

Prompt C

4 5 3

Prompt D

6 3 2

1 5 6 4 4 5 2 3 5 3 5 3

Training data

Evaluation data

E-rater models Model A Model B Model DModel C

• Each model is trained on responses to a particular prompt• Advantages:

– Tailored to particular prompt characteristics– High agreement with human raters– Incorporates content features

• Disadvantages:– Higher demand for training data– Requires pre-testing of prompts

17

Page 18: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

e-rater Model Types: Generic

• A single model is trained on responses to a variety of prompts• Advantages:

– Smaller data set required for training– Scoring standards the same across prompts– Applicable to prompts that are not pre-tested

• Disadvantages:– No content features– Differences between particular prompts are not accounted for– Agreement with human raters is lower

Generic models

Prompt A

3 5 4

Prompt B

4 4 1

Prompt C

4 5 3

Prompt D

6 3 2

1 5 6 4 4 5 2 3 5 3 5 3

Training data

Evaluation data

E-rater model Generic Model

18

Page 19: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

Implementation Research for Proposed Use

• Purpose is to evaluate expected – Quality of ratings and reported scores– Effectiveness of using e-rater operationally

• Research questions– Is e-rater performance

comparable to human scores?– Is there any differential performance for

subgroups of concern?– Is there any significant impact when e-rater

scores are used in reported scores?19

Page 20: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

Implementation Research (2)

• Construct relevance?– Consistency of e-rater features with TOEFL scoring

rubrics?

• Relationship between e-rater and human ratings?– Overall agreement rates?– Degradation from human-human agreement rates to

human-automated agreement rates?– Standardized difference in scores between humans and

e-rater?– Subgroup differences (fairness concerns)?

20

Page 21: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

Implementation Research (3)

• Impact on reported Writing scores– Change in reported score under multiple

implementation possibilities? • More or less conservative approaches

– Contributory score– Confirmatory score

– Differential impact on subgroups? • Gender• Native Language• Native Country

– Association of writing scores with external variables?21

Page 22: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

Human Scoring Performance

• Human scoring is the baseline performance against which e-rater is evaluated.

  human1 by human2

  human1 human2 stats

prompt N mean sd mean sd

std wtd % % adj

corrdiff kappa exact

plus adj

All 132,347 3.35 0.85 3.35 0.85 0.01 0.69 60 98 0.69

22

Page 23: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

E-rater Performance

• Human scoring is the baseline against which e-rater is evaluated.• Absolute statistical guidelines also exist.• Human-human agreement: 60% exact, 98% exact + adjacent• Human-erater agreement: 59% exact, 99% exact + adjacent• No degradation in performance from human-human performance

std wtd % % adj

diff kappa

agree agree

3.35 0.85 3.36 0.87 0.02 0.70 59 99 0.74

corr

statshuman1 by erater agreement

human1 erater

mean sd mean sd

23

Page 24: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

Impact on Writing Scale Scores (0-30 points)

• There is no significant impact to the candidate from the implementation of e-rater for the TOEFL Independent task.

0 601 232 163 1

Change in scale score from all-human scoring

to simulated scores w ith e-rater

Percent

24

Page 25: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

Correlations with Scaled Scores

Variable Human rating E-rater (integer)

Reading Scale Score 0.54 0.56

Listening Scale Score 0.55 0.53

Speaking Scale Score 0.59 0.57

(Read+Listen+Speak) Scale Score 0.63 0.63

Correlations of e-rater scores with TOEFL construct scale scores is on par with the human rating correlation to those same scores.

25

Page 26: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

Conclusions and Future Work

• The research team recommended to the TOEFL Committee of Examiners the operational use of a generic e-rater model for the Independent task.• Operational use began in July 2009

• Subsequently, the research team recommended operational use of a generic model for the Integrated task.

• Operational use will begin shortly

26

Page 27: Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.

QUESTIONS?

DISCUSSION?

Thank you!

Contact Information:Cathy [email protected](609)734-5640

27